Commit Graph

211 Commits

Author SHA1 Message Date
Trey t 5d8559b495 chore(deploy): mark deploy_prod.sh as deprecated; point at k3s flow
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Production migrated from Docker Swarm to k3s on 2026-04-24, but
deploy_prod.sh continued to target the old hetzner1 Swarm manager.
Without dockerd running there it spent 30+ seconds doing SSH probes
before dying on a confusing "Got: false" Swarm-state error.

Add an early guard that fails immediately with a pointer to
deploy-k3s/scripts/03-deploy.sh and the kubeconfig-fetch one-liner.
ALLOW_LEGACY_SWARM_DEPLOY=1 still bypasses if anyone needs the old path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:46:13 -05:00
Trey t 191c9b08e0 feat(static): rebuild landing page on amber-on-midnight brand system
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Replace the off-brand "VIBRANT EDITION" CSS (generic SaaS blue/purple/
teal/pink) with a strict palette derived from icon.png and style_guide.md:

  --gold #FCCE38, --amber #F5A623, --pollen #FFE082, --sun-bloom #F9BB2F
  --midnight #181E37, --deep #162140, --comb-line #232230
  --cream #FFF1D0

Spacing/radius scale mirrors iOS DesignSystem.swift (AppSpacing 4/8/12/
16/24/32/48/64; AppRadius 4/8/12/16/20/24) so the web feels native to
the same brand system. 56px button height, 16px card radius, identical
elevation language.

Page architecture:
- Sticky translucent nav with hex brand mark (1+6 cluster)
- Hero with iPhone frame mock showing real kanban view (overdue/due
  soon/in progress/done with priority dots and meta chips)
- Cream "What's due, what's done, what's yours" pillars
- Four feature deep-dives (residences, tasks/kanban, contractors,
  documents/warranties) with product UI mocks built from real app
  concepts
- "Each cell, a task" comb section with JS-generated 8x10 honeycomb
  completion grid that fills more densely toward the top
- iOS polish section: Home Screen widget mock with quick-complete,
  push notification with inline actions, Face ID lock, 11 themes
- Sharing section with share-card mock (HIVE-7K2D-Q9 code + 3 keepers)
- Free vs Pro pricing with "Most chosen" tag
- Final CTA with brand mark + golden glow

Honeycomb motif:
- Brand mark uses gold-on-navy with a radial halo (no currentColor
  dependency — renders identically everywhere)
- Hex grid background uses a properly tessellating flat-top tile
  (3 hexes per 126x73 unit, sharing full edges, no seams)
- Hex bullets, hex pills, completion grid all flat-top per style guide

Copy follows style_guide.md voice — calm, specific, no banned words
(chore, simplified, seamless), sentence case throughout. Canonical
tagline "A home is a hive. We'll help you keep it." used verbatim
in the hero and footer.

JS: mobile nav toggle, scroll-state nav, IntersectionObserver reveal,
deterministic comb-grid generator. Respects prefers-reduced-motion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:34:32 -05:00
Trey t 4efc87559a fix(uploads): force virtual-hosted-style URLs for B2 presigned POST
Backblaze B2's S3-compatible endpoint only implements POST Object on
virtual-hosted-style URLs (https://<bucket>.s3.<region>.backblazeb2.com/).
Path-style POST returns HTTP 501 Not Implemented.

minio-go's BucketLookupAuto only flips to virtual-hosted for AWS,
Google, and Aliyun endpoints — for B2 it falls through to path-style,
which is why every PresignedPostPolicy() call has been handing the
mobile clients a URL that B2 then refuses with 501.

Force BucketLookupDNS only when the endpoint is backblazeb2.com so
MinIO dev (no DNS for arbitrary buckets at minio:9000) keeps its
path-style default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 13:34:05 -05:00
Trey t 1347ffadf5 docs: presigned-URL upload flow + B2 lifecycle setup
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
09-storage.md:
  - Replaced the "Upload flow" section. The previous text described the
    multipart-via-API path that was removed in b7f8329. Now documents
    the three-step direct-to-B2 flow (presign → POST to B2 → attach
    via upload_ids[]) with an ASCII diagram and a server-side
    enforcement-points table.
  - Replaced the "Future: signed URLs" placeholder (since presigned
    URLs are now the present, not the future).
  - Added "Lifecycle and retention" subsections covering the
    pending_uploads cleanup cron (worker, 30 * * * *), the B2 bucket
    lifecycle as backstop (uploads/ prefix, 7-day hide + 1-day delete),
    and the still-open user-deletion cascade gap.

14-deployment-process.md:
  - Added a "One-time B2 bucket lifecycle (manual)" section explaining
    why the rule can't live in the deploy script (B2's S3 lifecycle
    API is partial), the exact rule to apply via the Backblaze
    console, and a verification command.

docs/deployment/README.md:
  - Updated the chapter 9 description to mention presigned-URL uploads.

README.md (root):
  - Added a paragraph under "Object storage" pointing to the new
    upload architecture and the relevant deployment-book chapters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:44:08 -07:00
Trey t 14026251b7 fix(worker): wire B2 credentials so pending_uploads cleanup cron can run
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The new TypeUploadCleanup cron (30 * * * *) constructs a StorageService at
worker startup so it can call b.client.RemoveObject on B2 when reaping
expired pending_uploads rows. Without B2_KEY_ID + B2_APP_KEY the storage
service falls back to local disk and crashes on this pod's read-only
root filesystem, leaving the cleanup as a no-op.

Mirrors the api deployment which already wires these.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:25:53 -07:00
Trey t b7f83293b8 refactor(uploads): drop legacy multipart code paths
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The presigned-URL upload flow (POST /api/uploads/presign + direct B2 POST
+ upload_ids[] in entity creation) is now the only image upload path. The
legacy multipart routes and DTO fields used by older clients are removed:

Removed:
  - POST /api/uploads/image/        (legacy multipart upload → URL)
  - POST /api/uploads/document/     (legacy multipart upload → URL)
  - POST /api/uploads/completion/   (legacy multipart upload → URL)
  - Multipart branch in POST /api/task-completions/ (now JSON-only)
  - CreateTaskCompletionRequest.ImageURLs DTO field
  - UpdateTaskCompletionRequest.ImageURLs DTO field
  - CreateDocumentRequest.ImageURLs DTO field
  - Service-layer ImageURLs loops in task_service.CreateCompletion,
    task_service.UpdateCompletion, document_service.CreateDocument
  - Tests exercising the removed paths
  - Now-unused imports (strings/time/decimal) in task_handler.go

Kept:
  - DELETE /api/uploads/  (orphan-cleanup endpoint, still useful)
  - POST /api/uploads/presign/  (the new path)
  - POST /api/documents/:id/images/  (uses storage_service.Upload directly,
    same multipart pattern but separate code path; deferred for now)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:19:21 -07:00
Trey t 29c9014a33 feat(uploads): direct-to-B2 presigned uploads with content-length-range policy
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Replaces the multipart-via-API path for image uploads with a three-step
direct-to-storage flow:

  1. Client POSTs /api/uploads/presign with content_length + content_type;
     server validates size (10 MB cap), mime allow-list per category, rate
     limit (50/hour/user via Redis sliding window), and concurrent unclaimed
     cap (10 in-flight per user). On success it persists a pending_uploads
     row, signs an S3 POST policy with content-length-range bound to the
     claimed length ±256 bytes, and returns the URL+fields.
  2. Client POSTs the bytes directly to B2 using the signed policy. B2
     enforces size, content-type, and key match before accepting.
  3. Client passes upload_ids[] to /api/task-completions/ or /api/documents/.
     Service HEADs each B2 object, verifies size matches expected_bytes
     within slack, marks pending_uploads claimed_at, and creates the
     associated TaskCompletionImage / DocumentImage rows.

Bytes never traverse our API server. The 1 MB Echo BodyLimit middleware
that was rejecting all task-completion image uploads becomes irrelevant
for this path. Existing multipart endpoints stay functional alongside,
soak-testing the new path before legacy removal.

Cleanup:
  - cmd/worker registers a new hourly cron (TypeUploadCleanup, "30 * * * *")
    that reaps pending_uploads where claimed_at IS NULL AND expires_at < NOW().
    Reaps both the B2 object and the row.
  - B2 bucket lifecycle rule on `uploads/` prefix (7 days hide → 1 day delete)
    documented in deploy-k3s/manifests/b2-lifecycle.md as a backstop.

Schema:
  - migrations/000002_pending_uploads.sql adds the table + partial index for
    cleanup + nullable pending_upload_id FKs on task_taskcompletionimage and
    task_documentimage.

Policy (single tier, no free/pro split):
  - 10 MB cap per upload
  - 50 presigns/hour/user
  - 10 concurrent unclaimed uploads/user
  - allow-list: jpeg/png/heic/heif/webp for image categories;
    + pdf for document_file

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:36:42 -07:00
Trey t 9bee436e86 perf(subscription-status): cache + parallelize + invalidate on mutations
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
GET /api/subscription/status/ was the slowest endpoint in the API at
p50≈1750ms / p95≈2425ms — about 12× the floor for our cluster→Neon
geography. Jaeger traces showed seven sequential SQL queries each
costing roughly one transatlantic RTT (~110ms), with the actual queries
running in 0.073ms at the database. Pure network serialization, not slow
SQL.

Three changes, in order of leverage:

1. Cache the assembled SubscriptionStatusResponse per-user in Redis with
   a 5-minute TTL. Hot path collapses to a single Redis GET (~5ms) on
   warm reads; the TTL is a safety net against missed invalidations.

2. Parallelize the three independent COUNT queries in getUserUsage
   (task_task / task_contractor / task_document) via golang.org/x/sync
   errgroup. Three RTTs collapse to one. Also dropped the redundant
   residence_residence COUNT — len(residenceIDs) from FindResidenceIDsByOwner
   is the same number, no need to re-query.

3. Wire explicit invalidation into every mutation that could change a
   user's response — residence/task/contractor/document CRUD,
   residence membership changes (JoinWithCode, RemoveUser, DeleteResidence),
   and every subscription tier flip across the IAP/Stripe/webhook surface.
   Residence-scoped invalidations fan out to every user with access via a
   new ResidenceRepository.FindUserIDsByResidence helper, so members of a
   shared residence don't see stale `usage` numbers when another member
   adds a task.

Net effect: warm path goes from ~1350ms to ~5ms (Redis hit). Cold path
goes from ~1350ms to ~250-450ms (5 sequential queries → 2 phases:
residence IDs lookup, then parallel task/contractor/document counts).

Also fixed a pre-existing CheckLimit signature drift in
internal/integration/subscription_is_free_test.go that was blocking the
package build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:00:23 -07:00
Trey t 0798ae8d74 fix(testutil): use shared-cache SQLite so concurrent reads see same DB
SetupTestDB used `sqlite.Open(":memory:")`, which creates a *separate*
in-memory database for every connection in GORM's pool. Sequential tests
never noticed because the pool keeps reusing one connection — but the
moment any code path issued concurrent reads (e.g. errgroup-driven
parallel COUNT queries), a goroutine could pull a fresh connection, see
no migrated tables, and explode with "no such table".

Switched to `file:testdb_<n>?mode=memory&cache=shared&_journal=memory`
with a per-test atomic counter so every connection in the pool sees the
same in-memory DB and tests stay isolated from each other through the
unique cache namespace. As a bonus, this also resolves the pre-existing
TestTaskHandler_QuickComplete flake — same root cause, just intermittent
because the pool occasionally handed out a second connection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 11:00:03 -07:00
Trey t ce4d49caef tools: add send-test-push for one-shot Asynq push verification
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Tiny CLI that enqueues a notification:send_push task into Redis. The
worker picks it up and routes through internal/push/Client.SendToAll —
which is exactly the path used by HandleSmartReminder, HandleDailyDigest,
and any other in-process push, so a successful round-trip here proves the
production push pipeline end-to-end without waiting for the next cron tick.

Requires Redis to be reachable. Easiest path:

  kubectl -n honeydue port-forward svc/redis 6379:6379

  go run ./cmd/send-test-push --user-id 6 --title "..." --message "..."

The worker logs `Sending push notification...` followed by the APNs
batch result; failure modes (BadDeviceToken, circuit breaker, etc.)
surface as the same error_message rows the existing notif-diag tool
already reports on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:59:51 -07:00
Trey t cb1dc383b4 tools: add admin-reset and notif-diag operational CLIs
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Two small Go CLIs for production ops that previously required ad-hoc
psql or kubectl gymnastics. Both load DB credentials from prod.env-style
env vars and read POSTGRES_PASSWORD from deploy/secrets/postgres_password.txt
by default, so the workflow is `set -a && source deploy/prod.env && set +a`
followed by go run.

cmd/admin-reset/main.go:
  --list                  print all admin_users rows
  --verify --email X      bcrypt-check a password against the stored hash
                          using the same case-insensitive lookup the live
                          /api/admin/auth/login endpoint uses
  --new-email Y           rename an admin's email (with unique-index check)
  default (--email X)     prompt for a new password twice (no echo, min 12
                          chars), bcrypt at DefaultCost, update the row

cmd/notif-diag/main.go:
  default                 print pending/sent counts, breakdown by type and
                          age, the 5 most recent pending rows with their
                          error_message, and registered APNs/FCM device
                          counts
  --mark-failed-as-sent   cosmetic cleanup — UPDATE pending rows that have
                          a recorded error to sent=true,
                          sent_at=COALESCE(updated_at, NOW())
  --yes                   skip the interactive confirmation prompt

Both bypass internal/config.Load() entirely so they don't need
SECRET_KEY or other unrelated env vars to run. .gitignore excludes the
build artifacts at /admin-reset and /notif-diag.

go.mod adds golang.org/x/term v0.41.0 (promoted from indirect to direct)
for no-echo password input in admin-reset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:36:13 -07:00
Trey t 8fce568532 fix(config): replace sync.Once reset-from-Do with mutex
Load()'s validation-failure path reassigned cfgOnce = sync.Once{} from
inside Do(). When Do() returned and tried to unlock the original mutex,
the Once struct had already been replaced with a fresh one whose mutex
was unlocked, panicking with "sync: unlock of unlocked mutex" on every
boot where any required env var was missing or invalid.

Replaced the Once with a plain sync.Mutex around a nil-check on the
package-level cfg, building the candidate into a local first and only
assigning to cfg after validate() succeeds. Same caching semantics, no
race, and a failed Load() leaves cfg nil so the next caller retries
cleanly.

Also documented AppleAuthConfig.TeamID as currently dead — it's loaded
from APPLE_TEAM_ID but no service reads it. Wire-up point noted for
when Sign in with Apple revocation/refresh is added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:35:54 -07:00
Trey t 289a23f7e6 deploy(ingress): drop obsolete scaffold ingress.yaml
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
The directory had two ingress manifests that both define
honeydue-api and honeydue-admin:

  - ingress.yaml          (Mar 28, scaffold from `deploy-k3s/`
                           greenfield template)
  - ingress-simple.yaml   (Apr 24, corrected for our actual cluster
                           shape per MIGRATION_NOTES.md)

`kubectl apply -f manifests/ingress/` applies both, and ingress.yaml
happens to apply last alphabetically (`-` < `.` so `ingress-simple`
sorts before `ingress.yaml`), clobbering the corrected manifest.
That left the live cluster with two regressions:

  1. honeydue-admin had `admin-auth` Traefik middleware in its chain,
     referencing the `admin-basic-auth` secret. Per MIGRATION_NOTES
     basic auth is intentionally not applied on this cluster (admin
     uses in-app auth), so the secret was never created.  Traefik
     logs `secret 'honeydue/admin-basic-auth' not found` on every
     reconcile and refuses to materialize the admin router → 404.

  2. honeydue-api lost the apex `myhoneydue.com` rule that
     ingress-simple.yaml adds for the marketing landing page →
     apex 404.

`kubectl apply -f ingress-simple.yaml` against the live cluster
restored both routes (admin/apex back to 200). Removing the stale
file from the repo prevents the next deploy from regressing.

Refs: deploy-k3s/MIGRATION_NOTES.md ("Admin basic auth | Not applied
— in-app auth only").
2026-04-26 23:44:21 -05:00
Trey t 8d9ca2e6ed docs(deployment): rewrite migration prose for goose adoption
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
2026-04-26 23:01:32 -05:00
Trey t 0f7450ada9 build: fix goose binary copy path for cross-compile
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
go install with GOOS/GOARCH set drops binaries in /go/bin/<goos>_<goarch>/,
which broke the COPY in go-base. Switching to git clone + go build with
explicit -o /app/goose so the output path is stable regardless of host
platform.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:48:08 -05:00
Trey t 12b2f9d43b Adopt pressly/goose for schema migrations
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path,
which had two compounding problems:
- AutoMigrate ran on every pod startup (~5 min over the transatlantic
  link) even when no schema changes had landed
- pg_advisory_lock is session-scoped, which silently fails through
  Neon's pgbouncer transaction-mode pooler — turns out this is a
  known and documented limitation that bites golang-migrate too

Goose was chosen over golang-migrate (the other heavyweight) because:
- Goose wraps each migration file in a transaction by default, so a
  failure rolls back cleanly instead of leaving a "dirty" version
  state requiring manual force-reset (golang-migrate's known
  weakness, per its own issue tracker — see #1001 + Atlas's writeup)
- Goose's locking is opt-in. We don't opt in: migrations run as a
  single Kubernetes Job, which IS the singleton process. No advisory
  lock needed at all.

Layout:
- migrations/000001_init.sql — schema-only pg_dump of the live Neon
  DB at adoption, stripped of psql-only directives that block goose's
  bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had
  their effects folded into this baseline; deleted from the live tree
  but preserved in git history at 58e6997.
- Dockerfile installs `goose v3.22.1` at build time and copies the
  binary into the api image. The migrate Job reuses the api image with
  command=goose, so no separate image to build/push/version.
- deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips
  the -pooler segment from DB_HOST (advisory lock won't survive
  pgbouncer transaction-mode), runs `goose up`, exits.
- deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the
  fresh one, `kubectl wait --for=condition=complete --timeout=10m`,
  then proceeds with api/worker rollout. Job failure aborts the deploy
  before any new app pod sees a stale schema.
- internal/database/database.go::RequireSchemaApplied checks
  goose_db_version on startup. api/worker refuse to boot if the
  table is missing or its latest row has is_applied=false — the
  fail-fast for "operator forgot to run migrate."
- Makefile: migrate-up / migrate-down / migrate-status / migrate-new
  for local workflow.

Production DB was bootstrapped manually:
  $ goose -dir migrations postgres "$DSN" version  # creates table
  $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"

Smoke test against fresh Postgres locally: 50 user tables created in
284ms via `goose up`, version_id=1 + is_applied=t recorded.

Verified the local goose CLI talks to prod successfully:
  $ goose ... status
  Applied At                  Migration
  =======================================
  Mon Apr 27 03:43:55 2026 -- 000001_init.sql

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:46:36 -05:00
Trey t d96f317d20 Revert "Fix migration deadlock under Neon pooler"
This reverts commit 30966c6f5e.
2026-04-26 22:22:07 -05:00
Trey t 4049b704c3 Revert "deployment: extend api startup probe budget for direct-endpoint migrations"
This reverts commit a94744061e.
2026-04-26 22:22:07 -05:00
Trey t a94744061e deployment: extend api startup probe budget for direct-endpoint migrations
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The migration-pooler fix (commit 30966c6) routes AutoMigrate through
Neon's direct compute endpoint to keep the session-scoped advisory lock
alive. That swap means each DDL pays a fresh transatlantic RTT instead
of riding warm pooler connections, so AutoMigrate's runtime climbs from
~90s to 4-6 min on the first pod of a cold boot. With the previous 240s
grace the startup probe was killing pods mid-migration.

Bumping to 120 × 5s = 600s grace. Subsequent pods inherit the schema
and finish their migrate-no-op in seconds, so this only matters for the
single first-pod migration window after a deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:05:58 -05:00
Trey t 30966c6f5e Fix migration deadlock under Neon pooler
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Today's pooler-endpoint switch broke MigrateWithLock: pg_advisory_lock is
session-scoped, but PgBouncer transaction-mode releases the underlying
Postgres session after every transaction. The lock was being released
the moment we acquired it, and on the next pod's startup the migration
either deadlocked or proceeded without serialization. Visible as
\"Acquiring migration advisory lock...\" hanging until the startup
probe killed the pod (as just happened on the b67f7f9 deploy).

Fix: open a parallel *gorm.DB pointed at the *direct* Neon endpoint
(DB_HOST with the -pooler segment stripped) for migrations only. That
keeps a real persistent session, so pg_advisory_lock works correctly.
The migration runs against this direct connection; the runtime pool
keeps using the pooler for everything else.

Side effects:
- Migrate() now delegates to migrate(target *gorm.DB) which lets
  MigrateWithLock pass the direct DB. Tests on SQLite still call
  Migrate() through the global pool unchanged.
- Migration DB uses MaxOpenConns=1, MaxIdleConns=1 — we just hold
  the lock on it, no connection pressure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 21:53:52 -05:00
Trey t b67f7f9e6b Cache SubscriptionSettings + cut monitoring poll noise
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Trace data revealed subscription_subscriptionsettings was consuming
1,983s of cumulative DB time per day (180× more than the next-largest
table) for a 32-byte singleton row of admin-toggleable global flags.
Root cause was a 30-second poll loop in monitoring.Service per pod
plus uncached reads on every authed status check / CreateResidence /
Stripe webhook. Fix is layered:

1. Redis cache for SubscriptionSettings — same shape as the
   residence-IDs cache. 30-min TTL, explicit invalidation on admin
   write. New CacheService.{Cache,GetCached,Invalidate}SubscriptionSettings
   plus a cachedSubscriptionSettings helper in services/.

2. SubscriptionService, StripeService, and both admin handlers
   (settings + limitations) now read through the cache. Admin write
   handlers invalidate so toggles propagate cluster-wide within ms
   instead of waiting for the TTL.

3. monitoring.Service.syncSettingsFromDB also reads from Redis first
   (raw redis.Client to avoid a services→monitoring import cycle).
   Polling interval bumped 30s → 5min. Combined with Redis-shared
   cache, cluster-wide DB hits from this poll go from ~480/hour to
   ~2/hour — a 240× reduction.

4. StripeService.CreateCheckoutSession now takes ctx so the cached
   settings span (and the Stripe webhook trace) stay attached to the
   request. Handler call site updated.

5. Admin handlers' direct h.db.First calls switched to
   db.WithContext(ctx) so the resulting orphan SQL spans nest under
   the admin request span in Jaeger.

Net DB query rate for subscription_subscriptionsettings should drop
from 0.101/sec to ~0/sec with occasional invalidation-driven refills,
and the table's cumulative DB time from 1,983s/day to ~10s/day.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 21:29:30 -05:00
Trey t c9ac273dbd docs: capture latency optimizations + new caching invariants
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.

ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
  MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
  warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
  optimization stack that minimizes round-trips, when this still matters

ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
  post-optimization trace; kept the old one underneath for diff context

ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
  switchover procedure

ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
  missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints

observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.

README links to the new ch08 latency section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:36:36 -05:00
Trey t 88fb1751c7 Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Stack of optimizations against the same Hetzner→Neon transatlantic link.
The trace revealed every visible ms was network/proxy overhead — DB
execution itself is sub-millisecond per query (verified via EXPLAIN
ANALYZE: index scans on every hot path).

Connection layer:
- DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer
  transaction-mode keeps backend Postgres connections warm so we no
  longer pay the ~110ms Postgres-startup RTT on cold queries.
- GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s,
  MaxIdleTime added (default 0 = never close idle).
- Eager pool warm-up at boot via parallel pings — first user request
  no longer pays the ~440ms TCP+TLS+startup handshake.
- Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will
  evict cold keys instead of erroring at the 256MB limit.

Auth layer:
- TokenCacheTTL 5min → 1 hour (Redis token cache).
- UserCacheTTL 30s → 5min (in-memory User cache, per pod).
- UserCache gains a 5,000-entry LRU cap so a flood of unique users
  can't blow up pod RSS. ~5MB worst-case per pod.
- Token + user lookup collapsed from 2 GORM Preload queries into a
  single INNER JOIN. Saves 1 RTT per cold-cache request.
- Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL
  spans nest under the parent HTTP request in Jaeger.

Service layer:
- TaskService.ListTasks: replaced two-step
  FindResidenceIDsByUser → GetKanbanDataForMultipleResidences
  with a single GetKanbanDataForUser that uses a Postgres subquery
  for residence-access. One round-trip instead of two.
- New CacheService residence-IDs cache: \"residence_ids_user:<id>\"
  with 5-min TTL. Wired into Task/Residence/Contractor/Document
  services for the four hot read paths that need this list.
- Cache invalidation on every relevant mutation: CreateResidence,
  DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence
  invalidates every member of the residence, not just the owner.

What this stacks up to (Hetzner→Neon, before US migration):
  Path                                 Before        After (target)
  Cache-warm authed read               ~800ms        ~100-200ms
  Cache-cold authed read (1st in 1hr)  ~2500ms       ~500-700ms
  First request after deploy           ~2500ms       ~700-900ms

The endgame US-region migration on top of this gets us to ~30-50ms
warm-cache, but we're shippable at ~150ms warm right now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:13:50 -05:00
Trey t 9410da7497 docs/ch15: mark distributed tracing fully integrated
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Every authed API endpoint now produces a nested flame graph
(HTTP → auth → service → SQL). Replaces the in-flight section with the
final span-source matrix and a sample 5-span /api/tasks/ trace. Notes
the visible Hetzner→Neon transatlantic RTT as the perf bottleneck the
flame graph surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:44:31 -05:00
Trey t d9b5f85c3d Thread ctx through auth middleware DB lookups
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The auth middleware's m.db.Preload + m.db.First calls were running without
ctx, so on cache miss the resulting SQL queries appeared as orphan
gorm.Query / gorm.Row spans in Jaeger. Now they nest under the parent
HTTP request span like every other repo call.

This was the last orphaned-SQL source on the request hot path. Combined
with the seven service migrations, every authenticated API call now
produces a fully-nested flame graph: HTTP → auth-token-lookup (cache hit)
or HTTP → auth-token-SQL (cache miss) → service → service-SQL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:36:47 -05:00
Trey t e881d37de0 Migrate Auth/Contractor/Document/Notification/Subscription services to ctx
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Every public method on these five services now takes ctx context.Context as
the first arg and routes its repo calls through .WithContext(ctx). With
TaskService and ResidenceService already migrated, this means every
in-process service that touches Postgres now produces a flame graph in
Jaeger where the SQL spans nest under the parent HTTP request span.

Endpoints now fully traced (HTTP → service → SQL):
- /api/auth/login, /register, /logout, /me, /verify-email, /resend-verification
- /api/auth/forgot-password, /verify-reset, /reset-password, /update-profile
- /api/contractors/* (CRUD + favorite + by-residence + tasks)
- /api/documents/* (CRUD + activate/deactivate + image upload/delete)
- /api/notifications/* (list, count, mark-read, prefs, devices)
- /api/subscription/* (status, purchase, cancel, triggers, promotions)
- All previously-migrated /api/tasks/* and /api/residences/* paths

Internal helpers also threaded:
- TaskService.sendTaskCompletedNotification → forwards ctx
- TaskService.UpdateUserTimezone → forwards ctx to NotificationService
- ResidenceService.CreateResidence → forwards ctx to SubscriptionService.CheckLimit
- NotificationService.registerAPNSDevice / registerGCMDevice → both take ctx

~75 method signatures, ~120 handler/test call sites updated. Tests pass
green; the only failure is the pre-existing flaky TaskHandler_QuickComplete
SQLite race that fails ~60% of runs on master.

Step 3 of the observability plan is now genuinely complete: every API
endpoint backed by a Go service emits a per-request flame graph with
HTTP → service → SQL spans, plus B2/APNs/FCM/asynq spans where applicable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:26:21 -05:00
Trey t 65a9aae4e5 Migrate TaskService + ResidenceService to ctx-aware repos
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Every public method on TaskService and ResidenceService now takes
ctx context.Context as the first arg and routes its repo calls through
.WithContext(ctx). With otelgorm registered, this means every API
endpoint backed by these two services produces a flame graph in Jaeger
where the SQL spans nest under the parent HTTP request span — instead
of appearing as orphaned queries.

Endpoints now fully traced (HTTP → service → SQL):
- GET    /api/tasks/                       (already shipped)
- GET    /api/tasks/by-residence/:id/      (already shipped)
- GET    /api/tasks/:id/
- POST   /api/tasks/
- POST   /api/tasks/bulk/
- PUT    /api/tasks/:id/
- DELETE /api/tasks/:id/
- POST   /api/tasks/:id/in-progress/
- POST   /api/tasks/:id/cancel/
- POST   /api/tasks/:id/uncancel/
- POST   /api/tasks/:id/archive/
- POST   /api/tasks/:id/unarchive/
- POST   /api/tasks/:id/complete/
- POST   /api/tasks/:id/quick-complete/
- GET    /api/tasks/completions/* (CRUD)
- GET    /api/static_data/ (categories, priorities, frequencies)
- GET    /api/residences/
- GET    /api/residences/my/
- GET    /api/residences/summary/
- GET    /api/residences/:id/
- POST   /api/residences/
- PUT    /api/residences/:id/
- DELETE /api/residences/:id/
- Share-code + member management endpoints
- GET    /api/residences/:id/report/

Mechanical work: ~50 method signatures, ~80 handler call sites,
~25 test call sites updated. Internal sendTaskCompletedNotification
helper also takes ctx so background notification SQL nests correctly.

The remaining services (ContractorService, DocumentService,
AuthService, NotificationService, SubscriptionService) follow the same
pattern; they continue to emit untraced SQL until migrated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 16:04:01 -05:00
Trey t 3f5bf21e09 tracing: bump semconv to v1.40.0 to match runtime resource schema
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Pods crashed at startup with "build resource: conflicting Schema URL:
https://opentelemetry.io/schemas/1.40.0 and https://opentelemetry.io/schemas/1.27.0"
because resource.Default() in the SDK targets v1.40.0. Aligning here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:35:46 -05:00
Trey t bc3da007db Wire OpenTelemetry tracing — HTTP, B2, APNs, FCM, asynq, GORM (partial)
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider
that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling
is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod,
overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api
and honeydue-worker. otelecho.Middleware opens a span per HTTP request.

Step 2 — Manual spans: storage_service.Upload now takes ctx and emits
storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket,
result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token
spans with topic, status_code, reason. Asynq middleware emits
asynq.handle:<task_type> per job with retry/payload attrs and records
asynq_job_duration_seconds.

Step 3 — Database: otelgorm plugin registered in database.Connect, so
any SQL emitted via db.WithContext(ctx) attaches to the request span.
Every repository now exposes WithContext(ctx) *XRepository as the
migration helper. TaskService.ListTasks and GetTasksByResidence are
migrated end-to-end (ctx threaded through handler → service → repo);
remaining services adopt the same pattern incrementally — pre-migration
methods still emit untraced SQL via the unchanged db field.

OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env →
honeydue-secrets → api+worker Deployments via secretKeyRef (optional).
02-setup-secrets.sh sources them from prod.env on next run; manifests
mark both env vars optional so the deployment rolls without traces if
the secret is absent.

ch15 observability doc now lists what produces spans today vs the
remaining migration work, with the explicit per-method pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:28:05 -05:00
Trey t 77cfcc0b27 docs: rewrite ch15 observability + cross-refs for the live obs stack
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.

Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
  metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
  (full build + push + apply + obs vmagent), with the manual
  kubectl-set-image flow kept as the single-service path. Notes the
  IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
  CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
  sheet block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:05:06 -05:00
Trey t d3708e6c72 Fix /metrics double-gzip + deploy script for amd64 build
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so
vmagent received double-compressed bytes that failed the Prometheus
parser with binary garbage. Skipping /metrics in the gzip Skipper.

Three deploy-script fixes uncovered while shipping this:
- _config.sh had backticks around \"kubectl get cm\" inside the python
  heredoc, which bash treated as command substitution when KUBECONFIG
  was set. Quoted the literal instead.
- 03-deploy.sh now passes --platform linux/amd64 to all docker builds
  so arm64 Macs don't push images that fail with \"exec format error\"
  on the Hetzner CX nodes.
- OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of
  the actual deploy/prod.env at the repo root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:42:15 -05:00
Trey t 372d4d2d37 deploy-k3s: apply observability manifests during 03-deploy
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
vmagent.yaml lives under manifests/observability/; the deploy script now
substitutes the OBS_INGEST_TOKEN from deploy/prod.env into the manifest
before apply, and waits on the vmagent rollout. Manual kubectl apply is
no longer needed after the next deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:16:59 -05:00
Trey t df78d9ccd8 Add Prometheus metrics + vmagent push to obs.88oakapps.com
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and
FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via
statement-level callbacks (no ctx plumbing needed). Storage and push
clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the
network round-trip points.

Existing internal/monitoring metrics move to /metrics/legacy so the
canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups.

deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica
vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics
every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write
with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN
in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods
and to the public obs endpoint over :443; the obs side runs
VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate.

docs/observability-plan.md captures the full plan including resource
budget, instrumentation table, 4-step rollout, and migration triggers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:16:17 -05:00
Trey t 1cd6cafa9d deploy-k3s: wire B2_KEY_ID/B2_APP_KEY into api Deployment
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The B2 credentials existed in honeydue-secrets (created by
02-setup-secrets.sh) but were never referenced from the api
Deployment, so StorageConfig.IsS3() returned false at runtime →
StorageService fell back to local filesystem. With
readOnlyRootFilesystem=true on the api container, that local
fallback would silently fail on every upload — meaning every
photo, document, and task-completion upload was broken in prod
since the k3s migration on 2026-04-24.

Adding both as secretKeyRef on the api container only (the worker
doesn't perform uploads). Verified end-to-end with a registered
test user: source PDF (sha256=3af3a645...) → POST /api/uploads/document/
→ POST /api/documents/ → GET /api/media/document/:id → byte-identical
download. Storage init log now reports "Storage service initialized (S3)".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:53:25 -05:00
Trey t 57cef36379 deploy-k3s: align _config.sh::generate_env with live ConfigMap
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
generate_env was missing 5 keys that exist in the live honeydue-config
ConfigMap (drift introduced over time by manual kubectl patches):
STATIC_DIR, STORAGE_UPLOAD_DIR, STORAGE_BASE_URL, B2_REGION, B2_USE_SSL.
Without these, running 03-deploy.sh would silently drop them and
break static asset serving + B2 region/TLS.

Also:
- Move B2_KEY_ID/B2_APP_KEY out of generate_env: they're credentials
  and belong in honeydue-secrets, not cleartext in the ConfigMap. The
  api/worker deployments still need to be wired to read them via
  envFrom: secretRef before B2 uploads will work — pre-existing gap,
  not caused by this commit.
- Use the in-namespace short DNS form for REDIS_URL ('redis:6379') to
  match what the live cluster has — pods' resolv.conf search path
  already covers honeydue.svc.cluster.local.
- config.yaml.example: add b2_region, b2_use_ssl, upload_dir, base_url,
  static_dir under storage so a fresh bootstrap sets them correctly.

Verified by sourcing _config.sh and diffing generate_env output against
`kubectl get cm honeydue-config -o jsonpath='{.data}'`: clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:38:37 -05:00
Trey t 9ea058347f Fix Apple Sign In: update bundle IDs from old com.tt.honeyDue.* to com.myhoneydue.*
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The iOS app was renamed (MyCrib → Casera → honeyDue) and the bundle ID
was updated to com.myhoneydue.honeyDue (release) / .dev (debug), but
APPLE_CLIENT_ID and APNS_TOPIC across env templates and k3s configs
still pointed at the old com.tt.honeyDue.honeyDueDev value. This made
verifyAudience reject every Apple identity token (aud claim mismatch).

Updated:
- deploy/prod.env.example: bundle ID + comment that empty client_id
  rejects all tokens with DEBUG=false
- .env.example: add Sign in with Apple block (was missing entirely)
- deploy-k3s{,-dev}/config.yaml.example: apple_auth.client_id default
- deploy-k3s-dev/scripts/00-init.sh: same
- docker-compose.dev.yml: APNS_TOPIC fallback
- docs/deployment/10-secrets-config.md: doc reference

The live deploy/prod.env and local .env are .gitignored — they were
edited in place and need to ship via deploy_prod.sh to take effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:58:44 -05:00
Trey t 7e77e3bbab docs/deployment: record security hardening pass + webapp + APNs
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Mark roadmap items done (network policies, Traefik middleware, CF Full
strict, CF IP UFW restriction, webapp deploy, APNs wired up, admin
URL-baking fix, admin probe bug). Update Chapter 4 (firewall rule
inventory now shows CF-only :443, no :80), Chapter 6 (request flow
walks through TLS on :443 and middleware hops), Chapter 13 (CF SSL
mode is Full strict, not Flexible; documents the origin cert
install), Chapter 7 (adds the web service section — proxy pattern,
3 replicas, PostHog build-args), and Appendix C (web manifests, CF
origin cert paths on disk, APNs .p8 path, updated network-policies
applied status).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:50:59 -05:00
Trey t ace03d2340 Security hardening: TLS at origin, security headers, network policies, admin probe fix
Four related hardening changes made on the live cluster during this
session. Each manifest captures the final working state so a fresh
`kubectl apply` of the repo reproduces it.

1. Cloudflare Full (strict) TLS — ingresses now carry `tls:` blocks
   pointing at `cloudflare-origin-cert` secret (installed imperatively
   from the CF Origin CA PEM). CF SSL mode flipped from Flexible to
   Full (strict). CF↔origin is now HTTPS; origin serves a CF-issued
   cert that only CF can validate.

2. Traefik middleware attached to all three ingresses — `rate-limit`
   (100/min avg, 200 burst) and `security-headers` (frame-deny,
   nosniff, HSTS, referrer policy, permissions policy). `admin-auth`
   middleware was also defined in middleware.yaml but is not attached
   (needs an unset basic-auth secret) and was deleted at runtime.

3. `security-headers` middleware: stripped the
   Content-Security-Policy entry. The Go API sets its own CSP in
   internal/router/router.go that permits Google Fonts for the
   landing page. Two CSP headers combine via intersection (most
   restrictive wins), which would break the landing page. Next.js
   apps set their own CSP via middleware. Header kept documentation
   comments explain this.

4. NetworkPolicies — default-deny + explicit allows, applied. Added
   missing policies for `web`. Corrected the Traefik ingress rule: the
   scaffold used `namespaceSelector: kube-system`, but our Traefik
   runs as a DaemonSet with `hostNetwork: true`, so traffic arrives
   with the NODE IP as source. Fixed to an `ipBlock` list of the
   three node IPs plus the cluster pod CIDR (10.42.0.0/16).

5. admin livenessProbe path fix: was hitting /admin/ (404) which
   caused a 6-hour crashloop cycle (87 restarts) before the bug was
   caught. Fixed to / — matches the startupProbe and readinessProbe
   paths that were corrected earlier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:50:47 -05:00
Trey t 15359401fa Deploy honeyDueAPI-Web to k3s at app.myhoneydue.com
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The Next.js 16 webapp in sibling repo honeyDueAPI-Web now runs
alongside api/worker/admin on the cluster. Uses a server-side proxy
pattern: browser hits app.myhoneydue.com, Next.js route handlers
forward to the Go API with an httpOnly cookie, so no CORS entry or
Allowed-Hosts change is needed on the API side.

Availability mirrors api (3 replicas, PDB minAvailable:2,
topologySpreadConstraints across nodes).

Changes:
- deploy-k3s/manifests/web/deployment.yaml: 3 replicas, readOnly root
  FS, drops all caps, mounts emptyDir for /app/.next/cache and /tmp,
  reads API_URL from honeydue-config.
- deploy-k3s/manifests/web/service.yaml: ClusterIP :3000.
- deploy-k3s/manifests/rbac.yaml: ServiceAccount web with
  automountServiceAccountToken: false.
- deploy-k3s/manifests/pod-disruption-budgets.yaml: web-pdb
  minAvailable: 2.
- deploy-k3s/manifests/ingress/ingress-simple.yaml: route
  app.myhoneydue.com → web:3000.
- deploy-k3s/scripts/_config.sh: emit API_URL into the ConfigMap.
- deploy-k3s/scripts/03-deploy.sh: build + push + apply the web image
  alongside api/worker/admin. Reads NEXT_PUBLIC_POSTHOG_KEY and
  NEXT_PUBLIC_POSTHOG_HOST from the operator shell env (not committed).
  Also adds the --build-arg NEXT_PUBLIC_API_URL wiring for the admin
  image that was previously only done manually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 10:11:17 -05:00
Trey t 082b5fd3cd Fix admin URL baking: bake NEXT_PUBLIC_API_URL at Docker build time
Next.js bakes NEXT_PUBLIC_* vars into the client JS bundle at build
time, not runtime. The admin image was being built with
admin/.env.local containing NEXT_PUBLIC_API_URL=http://localhost:8000,
hardcoding localhost into the browser bundle. The runtime configMap
value had no effect on the already-compiled JS, causing prod admin
login to throw CORS errors hitting localhost.

Fix:
- Dockerfile: admin-builder stage accepts ARG NEXT_PUBLIC_API_URL and
  strips any committed .env.local/.env.development.local before
  npm run build.
- .dockerignore: explicitly exclude admin/.env.* (root-level .env.*
  pattern doesn't match nested paths), so a local dev .env.local can
  never sneak into the build context again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 10:10:53 -05:00
Trey t 6d39875ef2 README: reflect auto-seed, expand env var reference, link deployment book
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
- Getting Started (all 3 install paths): simplify to one "start the app,
  first-boot auto-seeds lookups + admin + templates" step + optional
  dev test-data seed
- Seed Data table: mark 001_lookups / 003_admin_user / 003_task_templates
  as auto-seeded via internal/database/migration_seed_initial_data.go;
  only 002_test_data.sql is manual now
- Environment Variables: split into logical groups (core, server, admin
  seed, email, push, B2, worker schedules, feature flags, Apple/Google),
  added ~45 vars that weren't documented. Defer full reference to
  docs/deployment/10-secrets-config.md
- Add ADMIN_EMAIL / ADMIN_PASSWORD to the admin-seed group
- Tech Stack: add Backblaze B2 (minio-go), Fastmail/go-mail, Cloudflare,
  K3s (production orchestrator)
- Project Structure: add deploy-k3s/ and docs/deployment/; mark
  docker-compose.yml as Swarm-era legacy
- Docker subsection: clarify compose files are local dev; point at
  deployment book for prod workflow

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:30:55 -05:00
Trey t 6f303dbbaa Migrate prod deploy from Swarm to K3s; add full deployment book
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00
Trey T 4ec4bbbfe8 Auto-seed lookups + admin + templates on first API boot
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Add a data_migration that runs seeds/001_lookups.sql,
seeds/003_admin_user.sql, and seeds/003_task_templates.sql exactly
once on startup and invalidates the Redis seeded_data cache afterwards
so /api/static_data/ returns fresh results. Removes the need to
remember `./dev.sh seed-all`; the data_migrations tracking row prevents
re-runs, and each INSERT uses ON CONFLICT DO UPDATE so re-execution is
safe.
2026-04-15 08:37:55 -05:00
Trey T 58e6997eee Fix migration numbering collision and bump Dockerfile to Go 1.25
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
The `000016_task_template_id` and `000017_drop_task_template_regions_join`
migrations introduced on gitea collided with the existing unpadded 016/017
migrations (authtoken_created_at, fk_indexes). Renamed them to 021/022 so
they extend the shipped sequence instead of replacing real migrations.
Also removed the padded 000012-000015 files which were duplicate content
of the shipped 012-015 unpadded migrations.

Dockerfile builder image bumped from golang:1.24-alpine to 1.25-alpine to
match go.mod's `go 1.25` directive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:17:23 -05:00
Trey t 237c6b84ee Onboarding: template backlink, bulk-create endpoint, climate-region scoring
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Clients that send users through a multi-task onboarding step no longer
loop N POST /api/tasks/ calls and no longer create "orphan" tasks with
no reference to the TaskTemplate they came from.

Task model
- New task_template_id column + GORM FK (migration 000016)
- CreateTaskRequest.template_id, TaskResponse.template_id
- task_service.CreateTask persists the backlink

Bulk endpoint
- POST /api/tasks/bulk/ — 1-50 tasks in a single transaction,
  returns every created row + TotalSummary. Single residence access
  check, per-entry residence_id is overridden with batch value
- task_handler.BulkCreateTasks + task_service.BulkCreateTasks using
  db.Transaction; task_repo.CreateTx + FindByIDTx helpers

Climate-region scoring
- templateConditions gains ClimateRegionID; suggestion_service scores
  residence.PostalCode -> ZipToState -> GetClimateRegionIDByState against
  the template's conditions JSON (no penalty on mismatch / unknown ZIP)
- regionMatchBonus 0.35, totalProfileFields 14 -> 15
- Standalone GET /api/tasks/templates/by-region/ removed; legacy
  task_tasktemplate_regions many-to-many dropped (migration 000017).
  Region affinity now lives entirely in the template's conditions JSON

Tests
- +11 cases across task_service_test, task_handler_test, suggestion_
  service_test: template_id persistence, bulk rollback + cap + auth,
  region match / mismatch / no-ZIP / unknown-ZIP / stacks-with-others

Docs
- docs/openapi.yaml: /tasks/bulk/ + BulkCreateTasks schemas, template_id
  on TaskResponse + CreateTaskRequest, /templates/by-region/ removed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:23:57 -05:00
Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run
Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:22:43 -05:00
Trey t ca818e8478 Merge branch 'master' of github.com:akatreyt/MyCribAPI_GO
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
2026-04-01 20:45:43 -05:00
Trey T bec880886b Coverage priorities 1-5: test pure functions, extract interfaces, mock-based handler tests
- Priority 1: Test NewSendEmailTask + NewSendPushTask (5 tests)
- Priority 2: Test customHTTPErrorHandler — all 15+ branches (21 tests)
- Priority 3: Extract Enqueuer interface + payload builders in worker pkg (5 tests)
- Priority 4: Extract ClassifyFile/ComputeRelPath in migrate-encrypt (6 tests)
- Priority 5: Define Handler interfaces, refactor to accept them, mock-based tests (14 tests)
- Fix .gitignore: /worker instead of worker to stop ignoring internal/worker/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-01 20:30:09 -05:00
Trey t 2e10822e5a Add S3-compatible storage backend (B2, MinIO, AWS S3)
Introduces a StorageBackend interface with local filesystem and S3
implementations. The StorageService delegates raw I/O to the backend
while keeping validation, encryption, and URL generation unchanged.

Backend selection is config-driven: set B2_ENDPOINT + B2_KEY_ID +
B2_APP_KEY + B2_BUCKET_NAME for S3 mode, or STORAGE_UPLOAD_DIR for
local mode. STORAGE_USE_SSL=false for in-cluster MinIO (HTTP).

All existing tests pass unchanged — the local backend preserves
identical behavior to the previous direct-filesystem implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:31:24 -05:00
Trey t 34553f3bec Add K3s dev deployment setup for single-node VPS
Mirrors the prod deploy-k3s/ setup but runs all services in-cluster
on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible
storage (replaces B2), Redis, API, worker, and admin.

Includes fully automated setup scripts (00-init through 04-verify),
server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik,
network policies, RBAC, and security contexts matching prod.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 21:30:39 -05:00