From 8d9ca2e6ed5ff48e1cc56df00557111c80ba3344 Mon Sep 17 00:00:00 2001
From: Trey t <treytartt@fastmail.com>
Date: Sun, 26 Apr 2026 23:01:32 -0500
Subject: [PATCH] docs(deployment): rewrite migration prose for goose adoption
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
---
 docs/deployment/07-services.md             | 16 ++--
 docs/deployment/08-database.md             | 24 +++---
 docs/deployment/12-data-flow.md            |  2 +-
 docs/deployment/14-deployment-process.md   | 45 ++++++++++-
 docs/deployment/16-failure-modes.md        | 49 ++++++++++++
 docs/deployment/17-runbook.md              | 92 +++++++++++++++++++++-
 docs/deployment/19-postmortem-swarm.md     | 29 +++++++
 docs/deployment/20-roadmap.md              | 24 +++---
 docs/deployment/appendices/a-glossary.md   | 12 ++-
 docs/deployment/appendices/d-references.md |  6 +-
 10 files changed, 260 insertions(+), 39 deletions(-)

diff --git a/docs/deployment/07-services.md b/docs/deployment/07-services.md
index fd2d044..057e776 100644
--- a/docs/deployment/07-services.md
+++ b/docs/deployment/07-services.md
@@ -175,13 +175,15 @@ doesn't run as root.
 file writes to the image layer. Go binary doesn't need to write to `/`;
 only `/tmp` is mutable.
 
-**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this
-was bumped up from the scaffold default of 12. Reason: on first boot,
-the Go app runs `MigrateWithLock()` which acquires a Postgres advisory
-lock and runs AutoMigrate. First replica takes ~90s; subsequent
-replicas wait on the lock. With 3 replicas all starting simultaneously
-and the lock serializing them, 240s is the right grace. See
-[Chapter 19](./19-postmortem-swarm.md) for the detailed story.
+**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) —
+historically bumped from the scaffold default of 12 to absorb in-replica
+migration time. Now that migrations run out-of-band as a Kubernetes
+Job ([Chapter 8 §Schema management](./08-database.md)), pods boot in
+seconds and only need a few probe failures of grace, but the budget
+stays at 240s because cold pods on a fresh Hetzner node still pay
+~10s for image pull + startup. See
+[Chapter 19 §13](./19-postmortem-swarm.md) for the historical
+context (the in-replica advisory-lock approach this replaced).
 
 **`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
 passes, wait 5s before starting readiness checks. Prevents a racy
diff --git a/docs/deployment/08-database.md b/docs/deployment/08-database.md
index a26cf1a..2cfd28e 100644
--- a/docs/deployment/08-database.md
+++ b/docs/deployment/08-database.md
@@ -4,8 +4,10 @@
 
 Authoritative user data lives in a Neon-managed Postgres database in AWS
 us-east-1. Connections use TLS (`DB_SSLMODE=require`). Schema is managed
-via GORM AutoMigrate inside the api binary, coordinated across replicas
-by a Postgres advisory lock to prevent concurrent migration attempts.
+via [pressly/goose](https://github.com/pressly/goose) running as a
+one-shot Kubernetes Job before every api/worker rollout. See §Schema
+management below for the full shape; ch19 §13 documents the previous
+in-replica AutoMigrate approach this replaced.
 
 ## Why Neon
 
@@ -78,13 +80,13 @@ Modes PgBouncer supports:
 - **statement** — per-statement (most aggressive; breaks many features)
 
 Neon's pooler runs in **transaction mode**. This is compatible with GORM
-out of the box (we don't use session-level features like LISTEN/NOTIFY
-or session-scope advisory locks). Note: `database.MigrateWithLock()`
-needs the *direct* (non-pooler) endpoint because session-level
-advisory locks don't survive PgBouncer's per-transaction cycling — but
-the migration helper opens its own ad-hoc connection bypassing the
-configured pool, so this happens automatically. See `MigrateWithLock`
-in `internal/database/database.go`.
+runtime queries (we don't use session-level features like LISTEN/NOTIFY
+or session-scope advisory locks in the data path). The one place this
+matters is migrations: goose's session-scoped advisory lock can't
+survive PgBouncer transaction-mode pooling. The migrate Job
+(`deploy-k3s/manifests/migrate/job.yaml`) handles this by stripping
+the `-pooler` segment from `DB_HOST` before invoking goose — runtime
+keeps using the pooler, only migrations bypass it.
 
 ### Connection pool settings
 
@@ -404,11 +406,13 @@ GROUP BY usename, state, application_name;
 - [Neon docs][neon-docs]
 - [Neon pricing][neon-pricing]
 - [Postgres advisory locks][pg-locks]
-- [GORM AutoMigrate][gorm-automigrate]
+- [pressly/goose][goose] — production migration tool
+- [GORM AutoMigrate][gorm-automigrate] (tests only)
 - [honeyDue task architecture][task-arch] (repo-local)
 
 [neon-docs]: https://neon.com/docs/introduction
 [neon-pricing]: https://neon.com/pricing
 [pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
+[goose]: https://github.com/pressly/goose
 [gorm-automigrate]: https://gorm.io/docs/migration.html
 [task-arch]: ../../docs/TASK_LOGIC_ARCHITECTURE.md
diff --git a/docs/deployment/12-data-flow.md b/docs/deployment/12-data-flow.md
index 6648dfe..fdc2f19 100644
--- a/docs/deployment/12-data-flow.md
+++ b/docs/deployment/12-data-flow.md
@@ -272,7 +272,7 @@ sequenceDiagram
     participant NewPod as api pod v2 (starting)
 
     Note over NewPod: kubelet starts new pod
-    Note over NewPod: pod connects to Postgres<br/>MigrateWithLock runs (no-op)<br/>HTTP server starts<br/>readinessProbe passes
+    Note over NewPod: pod connects to Postgres<br/>RequireSchemaApplied checks goose_db_version<br/>HTTP server starts<br/>readinessProbe passes
     Note over NewPod: kube-proxy updates endpoints<br/>NewPod added to Service pool
     CF->>Traefik: request 1
     Traefik->>OldPod: routed (old pod still in pool)
diff --git a/docs/deployment/14-deployment-process.md b/docs/deployment/14-deployment-process.md
index ea22763..c32444f 100644
--- a/docs/deployment/14-deployment-process.md
+++ b/docs/deployment/14-deployment-process.md
@@ -317,10 +317,47 @@ Timeline (approximate, warm state):
 - t=60s: another old pod terminates
 - ...continues until all on new RS
 
-For cold-boot (e.g., first deploy on a rebuilt cluster), the
-MigrateWithLock advisory lock extends this to several minutes. But the
-rollout is serialized — only one pod starts per iteration, so the lock
-queue is small.
+Migrations run as a separate Kubernetes Job that completes before any
+api/worker pod is rolled. So the rollout above never includes migration
+work — pods that boot are guaranteed to find the schema already at the
+expected version. See §"Migrations are gated, not interleaved" below.
+
+## Migrations are gated, not interleaved
+
+`03-deploy.sh` runs `goose up` as a one-shot Job before applying any
+api/worker manifests:
+
+```
+1. kubectl delete job honeydue-migrate (idempotent, removes prior run)
+2. kubectl apply -f manifests/migrate/job.yaml (with current api image)
+3. kubectl wait --for=condition=complete --timeout=10m job/honeydue-migrate
+4. (only if Job succeeded) kubectl apply -f manifests/api/...
+```
+
+The Job uses the api image — `/usr/local/bin/goose` is baked in at
+Dockerfile build time. The Job script strips the `-pooler` segment
+from `DB_HOST` before connecting (goose's session-scoped advisory
+lock can't survive PgBouncer transaction-mode), runs `goose up`, exits.
+
+If the Job fails, the script aborts before any new app pod sees a
+stale schema. To debug:
+
+```bash
+kubectl -n honeydue logs job/honeydue-migrate --tail=200
+kubectl -n honeydue describe job honeydue-migrate
+```
+
+After investigating, fix the migration file and re-run `03-deploy.sh`.
+The Job is idempotent — successful migrations stay applied, only the
+new/failed file gets retried.
+
+api/worker pods run a `RequireSchemaApplied` check at startup that
+queries `goose_db_version` and refuses to boot if the table is missing
+or the latest row is `is_applied=false`. This is the fail-fast for
+"someone bypassed the deploy script and the schema isn't current."
+
+For full schema management background, see
+[Chapter 8 §Schema management](./08-database.md).
 
 ## Hotfix workflow
 
diff --git a/docs/deployment/16-failure-modes.md b/docs/deployment/16-failure-modes.md
index f3b0155..b6c82a4 100644
--- a/docs/deployment/16-failure-modes.md
+++ b/docs/deployment/16-failure-modes.md
@@ -327,6 +327,55 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui
 Cold-handshake latency goes back up (~440ms first hit) but the API
 keeps serving. Switch back when the pooler recovers.
 
+#### Migrate Job fails during deploy
+
+**Symptom**: `03-deploy.sh` aborts at the migrations step:
+```
+[deploy][error] migrations did not complete cleanly; aborting deploy
+```
+api/worker pods are NOT updated — they keep running the previous
+revision. This is the intentional fail-fast.
+
+**Recovery**:
+```bash
+# 1. See the failure
+kubectl -n honeydue logs job/honeydue-migrate --tail=200
+
+# 2. Common cause: a SQL error in the migration file. Fix the file
+#    locally, commit, retry the deploy. The Job is idempotent —
+#    successful prior versions stay applied; only the failed file
+#    re-runs.
+git add migrations/000NNN_*.sql
+git commit -m "Fix migration NNN"
+git push gitea master
+bash deploy-k3s/scripts/03-deploy.sh
+
+# 3. Other cause: Neon down or auth changed. Test direct connection:
+DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
+  -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
+docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \
+  psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
+        user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;"
+```
+**Why no automatic retry**: `backoffLimit: 0` on the Job is deliberate.
+A failing migration almost never gets unstuck by retrying — needs an
+operator to look. See [Chapter 17 §27](./17-runbook.md) for recovery
+playbook.
+
+#### api refuses to start: "Schema precondition failed"
+
+**Symptom**: api pods log `Schema precondition failed` and exit
+immediately after DB connect.
+**Cause**: `goose_db_version` table is missing or its latest row has
+`is_applied=false`. Means the migrate Job either was never run or
+ran and rolled back.
+**Recovery**: run the migrate Job manually (see
+[Chapter 17 §26](./17-runbook.md)). After it completes successfully,
+delete the failing api pods so they restart with a fresh schema check:
+```bash
+kubectl -n honeydue rollout restart deploy/api
+```
+
 #### Backblaze B2 outage
 
 **Symptom**: image uploads fail; image downloads fail unless cached by
diff --git a/docs/deployment/17-runbook.md b/docs/deployment/17-runbook.md
index 9bd9512..048adc0 100644
--- a/docs/deployment/17-runbook.md
+++ b/docs/deployment/17-runbook.md
@@ -428,10 +428,94 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui
 ```
 
 The pooler runs in transaction mode so any session-scope feature
-(LISTEN/NOTIFY, session advisory locks for migrations) auto-falls
-through to direct via `MigrateWithLock` opening its own connection.
-But if you ever add session-level features in the data path, they'll
-need the direct endpoint.
+(LISTEN/NOTIFY, session advisory locks) won't work over it. Migrations
+already handle this — the migrate Job script strips `-pooler` from
+`DB_HOST` before invoking goose. If you add new session-level features
+in the data path, they'll need the same workaround.
+
+## 26. Run migrations manually (rare)
+
+Day-to-day, migrations run as part of every `03-deploy.sh`. But
+sometimes you want to apply or inspect them outside a deploy:
+
+```bash
+# Direct-endpoint DSN (goose's advisory lock won't survive the pooler)
+DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
+  -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
+export DATABASE_URL="host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
+                     port=5432 user=neondb_owner password=$DB_PASS \
+                     dbname=honeyDue sslmode=require"
+
+# What's pending? (read-only; safe to run anytime)
+make migrate-status
+
+# Apply pending migrations (or `goose -dir migrations postgres "$DATABASE_URL" up`)
+make migrate-up
+
+# Roll back the most recent migration
+make migrate-down
+
+# Scaffold a new migration file
+make migrate-new name=add_widget_count_to_residences
+# → migrations/000002_add_widget_count_to_residences.sql
+# Edit, then `make migrate-up` to test, then commit.
+```
+
+To run goose from inside the cluster (e.g., to bypass a network policy
+that blocks Neon from your laptop), use the migrate Job manifest as a
+one-shot:
+
+```bash
+# Re-runs the latest migrate Job with whatever args you need
+kubectl -n honeydue delete job honeydue-migrate --ignore-not-found
+sed "s|image: IMAGE_PLACEHOLDER|image: $(kubectl -n honeydue get deploy api -o jsonpath='{.spec.template.spec.containers[0].image}')|" \
+  deploy-k3s/manifests/migrate/job.yaml | kubectl apply -f -
+kubectl -n honeydue wait --for=condition=complete --timeout=5m job/honeydue-migrate
+kubectl -n honeydue logs job/honeydue-migrate
+```
+
+## 27. Recover from a failed/dirty migration
+
+If `goose up` fails partway through, the migration file's transaction
+rolls back and `goose_db_version` reflects the last *complete*
+version. Goose marks no row as "dirty" — that's a golang-migrate
+concept. So recovery is just: fix the migration file, re-run.
+
+If you've genuinely corrupted state (dropped tables you shouldn't have,
+applied a destructive migration in error):
+
+```bash
+# See current goose state
+make migrate-status
+psql "$DATABASE_URL" -c \
+  "SELECT version_id, is_applied, tstamp FROM goose_db_version ORDER BY id DESC LIMIT 10;"
+
+# To force the version table back to a known-good number after
+# manually fixing the schema:
+psql "$DATABASE_URL" -c \
+  "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (<N>, true, NOW());"
+```
+
+## 28. Bootstrap goose on a fresh clone of the schema
+
+If you create a new Neon branch / dev DB and need to bring it under
+goose management:
+
+```bash
+export DATABASE_URL="...<the new DB>..."
+
+# Option A: fresh DB, no schema → just run up
+make migrate-up
+
+# Option B: schema already populated (e.g., restored from a dump) →
+#          mark v1 as already-applied
+goose -dir migrations postgres "$DATABASE_URL" version  # creates table
+psql "$DATABASE_URL" -c \
+  "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"
+```
+
+This is also what was done for the live prod DB at goose-adoption time
+(commit `12b2f9d`).
 
 ## References
 
diff --git a/docs/deployment/19-postmortem-swarm.md b/docs/deployment/19-postmortem-swarm.md
index 5ac12b4..b42fe49 100644
--- a/docs/deployment/19-postmortem-swarm.md
+++ b/docs/deployment/19-postmortem-swarm.md
@@ -397,6 +397,35 @@ should reflect reality, not be optimistic.
 **Moral**: Healthchecks should be realistic, not aspirational. Know
 what your app actually does at startup.
 
+#### Postscript (2026-04-26): the whole `MigrateWithLock` shape was wrong
+
+A few months after the Swarm migration, switching `DB_HOST` to Neon's
+`-pooler` endpoint for runtime perf wins broke this code completely:
+`pg_advisory_lock` is session-scoped, but PgBouncer transaction-mode
+multiplexes statements across backend Postgres sessions, so the lock
+appeared to be held but actually wasn't. Pods hung at
+"Acquiring migration advisory lock..." and the startup probe killed
+them in turn.
+
+After a brief band-aid (route migrations through the direct endpoint;
+bump probe to 600s to absorb 5-minute AutoMigrate runs over the slow
+direct connection — both reverted), we abandoned the runtime-side
+migration story entirely and adopted [pressly/goose](https://github.com/pressly/goose)
+in commit `12b2f9d`:
+
+- Migrations run as a one-shot Kubernetes Job before any api/worker
+  pod rolls. No more in-replica migration, no more advisory lock,
+  no more startup probe gymnastics.
+- `RequireSchemaApplied` checks `goose_db_version` at startup and
+  refuses to boot on a stale schema — fail-fast for "operator
+  forgot to run migrate," instead of mysterious runtime errors.
+- `failureThreshold` reverted to its pre-MigrateWithLock value.
+  Pods boot in seconds again.
+
+See [Chapter 8 §Schema management](./08-database.md) for the goose
+shape. This entire sub-section is preserved as historical context
+for why we walked the path we did.
+
 ## What we learned
 
 ### Docker Swarm is in a bad place in 2026
diff --git a/docs/deployment/20-roadmap.md b/docs/deployment/20-roadmap.md
index 04a6c44..b31bad6 100644
--- a/docs/deployment/20-roadmap.md
+++ b/docs/deployment/20-roadmap.md
@@ -69,20 +69,22 @@ Flexible to Full (strict). Verified by:
 - CF edge continues to serve its own Let's Encrypt cert to browsers
 - both layers now TLS-encrypted
 
-### Migration Job for schema changes
+### ~~Migration Job for schema changes~~ — done (2026-04-26, commit 12b2f9d)
 
-**Why**: Currently every api pod runs `MigrateWithLock()` on startup,
-serializing on a Postgres advisory lock. Adds 90-240s to cold startup
-and caused bug #13 in Chapter 19.
+**What shipped**: pressly/goose as the migration tool, run as a one-shot
+Kubernetes Job from `deploy-k3s/manifests/migrate/job.yaml` before
+api/worker rollout. The Job uses the api image (goose CLI is baked in
+during the Dockerfile build), strips `-pooler` from `DB_HOST` for the
+direct-endpoint connection migrations need, and exits in seconds when
+there's nothing to apply. `RequireSchemaApplied` in the api/worker
+startup checks `goose_db_version` and fails fast on a stale schema.
 
-**How**: Create a Kubernetes `Job` resource that runs the api image
-with a `--migrate-only` flag. Job runs once per deploy, completes when
-schema is current. api pods get an initContainer that waits for the
-Job to complete.
+The Go-code-with-`--migrate-only` shape originally proposed here was
+rejected in favor of using the upstream goose binary directly — see
+[Chapter 8 §Schema management](./08-database.md) for the trade-offs.
 
-Requires Go code change to support `--migrate-only` flag.
-
-**Effort**: 3-4 hours (code + job manifest + testing).
+Pre-goose `MigrateWithLock` is gone; ch19 §13 has the historical
+postmortem context.
 
 ### Redis password
 
diff --git a/docs/deployment/appendices/a-glossary.md b/docs/deployment/appendices/a-glossary.md
index badee6f..a663f15 100644
--- a/docs/deployment/appendices/a-glossary.md
+++ b/docs/deployment/appendices/a-glossary.md
@@ -173,11 +173,21 @@ suffix. (Chapter 8)
 ## Go + Asynq
 
 **AutoMigrate**: GORM function that syncs DB schema to Go structs.
-(Chapter 8)
+We used this in production until 2026-04, replaced by goose. Tests
+still use it via `testutil.SetupTestDB`. (Chapter 8)
 
 **Asynq**: Go library for background job queues. Redis-backed.
 (Chapter 7)
 
+**goose**: pressly/goose — the SQL migration tool we use in production
+(commit 12b2f9d onward). Migration files live in `migrations/`, one
+file per version with `-- +goose Up` / `-- +goose Down` markers.
+(Chapter 8)
+
+**goose_db_version**: goose's version-tracking table. One row per
+applied migration. `RequireSchemaApplied` reads the latest row at
+api/worker startup to fail fast on a stale schema. (Chapter 8)
+
 **GORM**: Go ORM we use. (Chapter 8)
 
 **pgx**: Go Postgres driver used by GORM. (Chapter 8)
diff --git a/docs/deployment/appendices/d-references.md b/docs/deployment/appendices/d-references.md
index 31f3ea4..cdc1be5 100644
--- a/docs/deployment/appendices/d-references.md
+++ b/docs/deployment/appendices/d-references.md
@@ -65,7 +65,9 @@ Every external link cited anywhere in this book, grouped by topic.
 - [Neon usage-based pricing announcement][neon-blog]
 - [Neon connect from any app][neon-connect]
 - [Postgres advisory locks][pg-locks]
-- [GORM AutoMigrate][gorm-automigrate]
+- [GORM AutoMigrate][gorm-automigrate] (tests only — production migrations use goose)
+- [pressly/goose — SQL migration tool][goose]
+- [Goose documentation][goose-docs]
 
 ## Backblaze B2
 
@@ -168,6 +170,8 @@ Every external link cited anywhere in this book, grouped by topic.
 [neon-connect]: https://neon.com/docs/connect/connect-from-any-app
 [pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
 [gorm-automigrate]: https://gorm.io/docs/migration.html
+[goose]: https://github.com/pressly/goose
+[goose-docs]: https://pressly.github.io/goose/
 
 <!-- B2 -->
 [b2-docs]: https://www.backblaze.com/docs/