docs(deployment): rewrite migration prose for goose adoption

Update the deployment book and glossary to reflect the goose-based schema migration flow shipped in 12b2f9d/0f7450a: - ch07: clarify startup probe assumes migrations ran out-of-band - ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job - ch12: pod startup checks goose_db_version, no longer runs migrations - ch14: document the Job→wait→roll deploy gate and how to debug failures - ch16: add "Migrate Job fails during deploy" + "Schema precondition failed" failure modes - ch17: new runbook entries §26 (run migrations manually), §27 (recover from failed/dirty migration), §28 (bootstrap goose on fresh clone) - ch19: postscript on §13 noting MigrateWithLock approach is superseded - ch20: mark "Migration Job for schema changes" task done - glossary: add `goose` and `goose_db_version`; flag AutoMigrate as tests-only - references: add goose links; flag AutoMigrate as tests-only
2026-04-26 23:01:32 -05:00
parent 0f7450ada9
commit 8d9ca2e6ed
10 changed files with 260 additions and 39 deletions
@@ -175,13 +175,15 @@ doesn't run as root.
 file writes to the image layer. Go binary doesn't need to write to `/`;
 only `/tmp` is mutable.
-**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this
+**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) —
-was bumped up from the scaffold default of 12. Reason: on first boot,
+historically bumped from the scaffold default of 12 to absorb in-replica
-the Go app runs `MigrateWithLock()` which acquires a Postgres advisory
+migration time. Now that migrations run out-of-band as a Kubernetes
-lock and runs AutoMigrate. First replica takes ~90s; subsequent
+Job ([Chapter 8 §Schema management](./08-database.md)), pods boot in
-replicas wait on the lock. With 3 replicas all starting simultaneously
+seconds and only need a few probe failures of grace, but the budget
-and the lock serializing them, 240s is the right grace. See
+stays at 240s because cold pods on a fresh Hetzner node still pay
-[Chapter 19](./19-postmortem-swarm.md) for the detailed story.
+~10s for image pull + startup. See
 [Chapter 19 §13](./19-postmortem-swarm.md) for the historical
 context (the in-replica advisory-lock approach this replaced).
 **`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
 passes, wait 5s before starting readiness checks. Prevents a racy
@@ -4,8 +4,10 @@
 Authoritative user data lives in a Neon-managed Postgres database in AWS
 us-east-1. Connections use TLS (`DB_SSLMODE=require`). Schema is managed
-via GORM AutoMigrate inside the api binary, coordinated across replicas
+via [pressly/goose](https://github.com/pressly/goose) running as a
-by a Postgres advisory lock to prevent concurrent migration attempts.
+one-shot Kubernetes Job before every api/worker rollout. See §Schema
 management below for the full shape; ch19 §13 documents the previous
 in-replica AutoMigrate approach this replaced.
 ## Why Neon
@@ -78,13 +80,13 @@ Modes PgBouncer supports:
 - **statement** — per-statement (most aggressive; breaks many features)
 Neon's pooler runs in **transaction mode**. This is compatible with GORM
-out of the box (we don't use session-level features like LISTEN/NOTIFY
+runtime queries (we don't use session-level features like LISTEN/NOTIFY
-or session-scope advisory locks). Note: `database.MigrateWithLock()`
+or session-scope advisory locks in the data path). The one place this
-needs the *direct* (non-pooler) endpoint because session-level
+matters is migrations: goose's session-scoped advisory lock can't
-advisory locks don't survive PgBouncer's per-transaction cycling — but
+survive PgBouncer transaction-mode pooling. The migrate Job
-the migration helper opens its own ad-hoc connection bypassing the
+(`deploy-k3s/manifests/migrate/job.yaml`) handles this by stripping
-configured pool, so this happens automatically. See `MigrateWithLock`
+the `-pooler` segment from `DB_HOST` before invoking goose — runtime
-in `internal/database/database.go`.
+keeps using the pooler, only migrations bypass it.
 ### Connection pool settings
@@ -404,11 +406,13 @@ GROUP BY usename, state, application_name;
 - [Neon docs][neon-docs]
 - [Neon pricing][neon-pricing]
 - [Postgres advisory locks][pg-locks]
- [GORM AutoMigrate][gorm-automigrate]
+- [pressly/goose][goose] — production migration tool
 - [GORM AutoMigrate][gorm-automigrate] (tests only)
 - [honeyDue task architecture][task-arch] (repo-local)
 [neon-docs]: https://neon.com/docs/introduction
 [neon-pricing]: https://neon.com/pricing
 [pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
 [goose]: https://github.com/pressly/goose
 [gorm-automigrate]: https://gorm.io/docs/migration.html
 [task-arch]: ../../docs/TASK_LOGIC_ARCHITECTURE.md
@@ -272,7 +272,7 @@ sequenceDiagram
    participant NewPod as api pod v2 (starting)
    Note over NewPod: kubelet starts new pod
-    Note over NewPod: pod connects to Postgres<br/>MigrateWithLock runs (no-op)<br/>HTTP server starts<br/>readinessProbe passes
+    Note over NewPod: pod connects to Postgres<br/>RequireSchemaApplied checks goose_db_version<br/>HTTP server starts<br/>readinessProbe passes
    Note over NewPod: kube-proxy updates endpoints<br/>NewPod added to Service pool
    CF->>Traefik: request 1
    Traefik->>OldPod: routed (old pod still in pool)
@@ -317,10 +317,47 @@ Timeline (approximate, warm state):
 - t=60s: another old pod terminates
 - ...continues until all on new RS
-For cold-boot (e.g., first deploy on a rebuilt cluster), the
+Migrations run as a separate Kubernetes Job that completes before any
-MigrateWithLock advisory lock extends this to several minutes. But the
+api/worker pod is rolled. So the rollout above never includes migration
-rollout is serialized — only one pod starts per iteration, so the lock
+work — pods that boot are guaranteed to find the schema already at the
-queue is small.
+expected version. See §"Migrations are gated, not interleaved" below.
 ## Migrations are gated, not interleaved
 `03-deploy.sh` runs `goose up` as a one-shot Job before applying any
 api/worker manifests:
 ```
 1. kubectl delete job honeydue-migrate (idempotent, removes prior run)
 2. kubectl apply -f manifests/migrate/job.yaml (with current api image)
 3. kubectl wait --for=condition=complete --timeout=10m job/honeydue-migrate
 4. (only if Job succeeded) kubectl apply -f manifests/api/...
 ```
 The Job uses the api image — `/usr/local/bin/goose` is baked in at
 Dockerfile build time. The Job script strips the `-pooler` segment
 from `DB_HOST` before connecting (goose's session-scoped advisory
 lock can't survive PgBouncer transaction-mode), runs `goose up`, exits.
 If the Job fails, the script aborts before any new app pod sees a
 stale schema. To debug:
 ```bash
 kubectl -n honeydue logs job/honeydue-migrate --tail=200
 kubectl -n honeydue describe job honeydue-migrate
 ```
 After investigating, fix the migration file and re-run `03-deploy.sh`.
 The Job is idempotent — successful migrations stay applied, only the
 new/failed file gets retried.
 api/worker pods run a `RequireSchemaApplied` check at startup that
 queries `goose_db_version` and refuses to boot if the table is missing
 or the latest row is `is_applied=false`. This is the fail-fast for
 "someone bypassed the deploy script and the schema isn't current."
 For full schema management background, see
 [Chapter 8 §Schema management](./08-database.md).
 ## Hotfix workflow
@@ -327,6 +327,55 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui
 Cold-handshake latency goes back up (~440ms first hit) but the API
 keeps serving. Switch back when the pooler recovers.
 #### Migrate Job fails during deploy
 **Symptom**: `03-deploy.sh` aborts at the migrations step:
 ```
 [deploy][error] migrations did not complete cleanly; aborting deploy
 ```
 api/worker pods are NOT updated — they keep running the previous
 revision. This is the intentional fail-fast.
 **Recovery**:
 ```bash
 # 1. See the failure
 kubectl -n honeydue logs job/honeydue-migrate --tail=200
 # 2. Common cause: a SQL error in the migration file. Fix the file
 #    locally, commit, retry the deploy. The Job is idempotent —
 #    successful prior versions stay applied; only the failed file
 #    re-runs.
 git add migrations/000NNN_*.sql
 git commit -m "Fix migration NNN"
 git push gitea master
 bash deploy-k3s/scripts/03-deploy.sh
 # 3. Other cause: Neon down or auth changed. Test direct connection:
 DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
  -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
 docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \
  psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
        user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;"
 ```
 **Why no automatic retry**: `backoffLimit: 0` on the Job is deliberate.
 A failing migration almost never gets unstuck by retrying — needs an
 operator to look. See [Chapter 17 §27](./17-runbook.md) for recovery
 playbook.
 #### api refuses to start: "Schema precondition failed"
 **Symptom**: api pods log `Schema precondition failed` and exit
 immediately after DB connect.
 **Cause**: `goose_db_version` table is missing or its latest row has
 `is_applied=false`. Means the migrate Job either was never run or
 ran and rolled back.
 **Recovery**: run the migrate Job manually (see
 [Chapter 17 §26](./17-runbook.md)). After it completes successfully,
 delete the failing api pods so they restart with a fresh schema check:
 ```bash
 kubectl -n honeydue rollout restart deploy/api
 ```
 #### Backblaze B2 outage
 **Symptom**: image uploads fail; image downloads fail unless cached by
@@ -428,10 +428,94 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui
 ```
 The pooler runs in transaction mode so any session-scope feature
-(LISTEN/NOTIFY, session advisory locks for migrations) auto-falls
+(LISTEN/NOTIFY, session advisory locks) won't work over it. Migrations
-through to direct via `MigrateWithLock` opening its own connection.
+already handle this — the migrate Job script strips `-pooler` from
-But if you ever add session-level features in the data path, they'll
+`DB_HOST` before invoking goose. If you add new session-level features
-need the direct endpoint.
+in the data path, they'll need the same workaround.
 ## 26. Run migrations manually (rare)
 Day-to-day, migrations run as part of every `03-deploy.sh`. But
 sometimes you want to apply or inspect them outside a deploy:
 ```bash
 # Direct-endpoint DSN (goose's advisory lock won't survive the pooler)
 DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
  -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
 export DATABASE_URL="host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
                     port=5432 user=neondb_owner password=$DB_PASS \
                     dbname=honeyDue sslmode=require"
 # What's pending? (read-only; safe to run anytime)
 make migrate-status
 # Apply pending migrations (or `goose -dir migrations postgres "$DATABASE_URL" up`)
 make migrate-up
 # Roll back the most recent migration
 make migrate-down
 # Scaffold a new migration file
 make migrate-new name=add_widget_count_to_residences
 # → migrations/000002_add_widget_count_to_residences.sql
 # Edit, then `make migrate-up` to test, then commit.
 ```
 To run goose from inside the cluster (e.g., to bypass a network policy
 that blocks Neon from your laptop), use the migrate Job manifest as a
 one-shot:
 ```bash
 # Re-runs the latest migrate Job with whatever args you need
 kubectl -n honeydue delete job honeydue-migrate --ignore-not-found
 sed "s|image: IMAGE_PLACEHOLDER|image: $(kubectl -n honeydue get deploy api -o jsonpath='{.spec.template.spec.containers[0].image}')|" \
  deploy-k3s/manifests/migrate/job.yaml | kubectl apply -f -
 kubectl -n honeydue wait --for=condition=complete --timeout=5m job/honeydue-migrate
 kubectl -n honeydue logs job/honeydue-migrate
 ```
 ## 27. Recover from a failed/dirty migration
 If `goose up` fails partway through, the migration file's transaction
 rolls back and `goose_db_version` reflects the last *complete*
 version. Goose marks no row as "dirty" — that's a golang-migrate
 concept. So recovery is just: fix the migration file, re-run.
 If you've genuinely corrupted state (dropped tables you shouldn't have,
 applied a destructive migration in error):
 ```bash
 # See current goose state
 make migrate-status
 psql "$DATABASE_URL" -c \
  "SELECT version_id, is_applied, tstamp FROM goose_db_version ORDER BY id DESC LIMIT 10;"
 # To force the version table back to a known-good number after
 # manually fixing the schema:
 psql "$DATABASE_URL" -c \
  "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (<N>, true, NOW());"
 ```
 ## 28. Bootstrap goose on a fresh clone of the schema
 If you create a new Neon branch / dev DB and need to bring it under
 goose management:
 ```bash
 export DATABASE_URL="...<the new DB>..."
 # Option A: fresh DB, no schema → just run up
 make migrate-up
 # Option B: schema already populated (e.g., restored from a dump) →
 #          mark v1 as already-applied
 goose -dir migrations postgres "$DATABASE_URL" version  # creates table
 psql "$DATABASE_URL" -c \
  "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"
 ```
 This is also what was done for the live prod DB at goose-adoption time
 (commit `12b2f9d`).
 ## References
@@ -397,6 +397,35 @@ should reflect reality, not be optimistic.
 **Moral**: Healthchecks should be realistic, not aspirational. Know
 what your app actually does at startup.
 #### Postscript (2026-04-26): the whole `MigrateWithLock` shape was wrong
 A few months after the Swarm migration, switching `DB_HOST` to Neon's
 `-pooler` endpoint for runtime perf wins broke this code completely:
 `pg_advisory_lock` is session-scoped, but PgBouncer transaction-mode
 multiplexes statements across backend Postgres sessions, so the lock
 appeared to be held but actually wasn't. Pods hung at
 "Acquiring migration advisory lock..." and the startup probe killed
 them in turn.
 After a brief band-aid (route migrations through the direct endpoint;
 bump probe to 600s to absorb 5-minute AutoMigrate runs over the slow
 direct connection — both reverted), we abandoned the runtime-side
 migration story entirely and adopted [pressly/goose](https://github.com/pressly/goose)
 in commit `12b2f9d`:
 - Migrations run as a one-shot Kubernetes Job before any api/worker
  pod rolls. No more in-replica migration, no more advisory lock,
  no more startup probe gymnastics.
 - `RequireSchemaApplied` checks `goose_db_version` at startup and
  refuses to boot on a stale schema — fail-fast for "operator
  forgot to run migrate," instead of mysterious runtime errors.
 - `failureThreshold` reverted to its pre-MigrateWithLock value.
  Pods boot in seconds again.
 See [Chapter 8 §Schema management](./08-database.md) for the goose
 shape. This entire sub-section is preserved as historical context
 for why we walked the path we did.
 ## What we learned
 ### Docker Swarm is in a bad place in 2026
@@ -69,20 +69,22 @@ Flexible to Full (strict). Verified by:
 - CF edge continues to serve its own Let's Encrypt cert to browsers
 - both layers now TLS-encrypted
-### Migration Job for schema changes
+### ~~Migration Job for schema changes~~ — done (2026-04-26, commit 12b2f9d)
-**Why**: Currently every api pod runs `MigrateWithLock()` on startup,
+**What shipped**: pressly/goose as the migration tool, run as a one-shot
-serializing on a Postgres advisory lock. Adds 90-240s to cold startup
+Kubernetes Job from `deploy-k3s/manifests/migrate/job.yaml` before
-and caused bug #13 in Chapter 19.
+api/worker rollout. The Job uses the api image (goose CLI is baked in
 during the Dockerfile build), strips `-pooler` from `DB_HOST` for the
 direct-endpoint connection migrations need, and exits in seconds when
 there's nothing to apply. `RequireSchemaApplied` in the api/worker
 startup checks `goose_db_version` and fails fast on a stale schema.
-**How**: Create a Kubernetes `Job` resource that runs the api image
+The Go-code-with-`--migrate-only` shape originally proposed here was
-with a `--migrate-only` flag. Job runs once per deploy, completes when
+rejected in favor of using the upstream goose binary directly — see
-schema is current. api pods get an initContainer that waits for the
+[Chapter 8 §Schema management](./08-database.md) for the trade-offs.
 Job to complete.
-Requires Go code change to support `--migrate-only` flag.
+Pre-goose `MigrateWithLock` is gone; ch19 §13 has the historical
-
+postmortem context.
 **Effort**: 3-4 hours (code + job manifest + testing).
 ### Redis password
@@ -173,11 +173,21 @@ suffix. (Chapter 8)
 ## Go + Asynq
 **AutoMigrate**: GORM function that syncs DB schema to Go structs.
-(Chapter 8)
+We used this in production until 2026-04, replaced by goose. Tests
 still use it via `testutil.SetupTestDB`. (Chapter 8)
 **Asynq**: Go library for background job queues. Redis-backed.
 (Chapter 7)
 **goose**: pressly/goose — the SQL migration tool we use in production
 (commit 12b2f9d onward). Migration files live in `migrations/`, one
 file per version with `-- +goose Up` / `-- +goose Down` markers.
 (Chapter 8)
 **goose_db_version**: goose's version-tracking table. One row per
 applied migration. `RequireSchemaApplied` reads the latest row at
 api/worker startup to fail fast on a stale schema. (Chapter 8)
 **GORM**: Go ORM we use. (Chapter 8)
 **pgx**: Go Postgres driver used by GORM. (Chapter 8)
@@ -65,7 +65,9 @@ Every external link cited anywhere in this book, grouped by topic.
 - [Neon usage-based pricing announcement][neon-blog]
 - [Neon connect from any app][neon-connect]
 - [Postgres advisory locks][pg-locks]
- [GORM AutoMigrate][gorm-automigrate]
+- [GORM AutoMigrate][gorm-automigrate] (tests only — production migrations use goose)
 - [pressly/goose — SQL migration tool][goose]
 - [Goose documentation][goose-docs]
 ## Backblaze B2
@@ -168,6 +170,8 @@ Every external link cited anywhere in this book, grouped by topic.
 [neon-connect]: https://neon.com/docs/connect/connect-from-any-app
 [pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
 [gorm-automigrate]: https://gorm.io/docs/migration.html
 [goose]: https://github.com/pressly/goose
 [goose-docs]: https://pressly.github.io/goose/
 <!-- B2 -->
 [b2-docs]: https://www.backblaze.com/docs/