docs(deployment): rewrite migration prose for goose adoption

Update the deployment book and glossary to reflect the goose-based schema migration flow shipped in 12b2f9d/0f7450a: - ch07: clarify startup probe assumes migrations ran out-of-band - ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job - ch12: pod startup checks goose_db_version, no longer runs migrations - ch14: document the Job→wait→roll deploy gate and how to debug failures - ch16: add "Migrate Job fails during deploy" + "Schema precondition failed" failure modes - ch17: new runbook entries §26 (run migrations manually), §27 (recover from failed/dirty migration), §28 (bootstrap goose on fresh clone) - ch19: postscript on §13 noting MigrateWithLock approach is superseded - ch20: mark "Migration Job for schema changes" task done - glossary: add `goose` and `goose_db_version`; flag AutoMigrate as tests-only - references: add goose links; flag AutoMigrate as tests-only
2026-04-26 23:01:32 -05:00
parent 0f7450ada9
commit 8d9ca2e6ed
10 changed files with 260 additions and 39 deletions
@@ -327,6 +327,55 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui
 Cold-handshake latency goes back up (~440ms first hit) but the API
 keeps serving. Switch back when the pooler recovers.

+#### Migrate Job fails during deploy
+
+**Symptom**: `03-deploy.sh` aborts at the migrations step:
+```
+[deploy][error] migrations did not complete cleanly; aborting deploy
+```
+api/worker pods are NOT updated — they keep running the previous
+revision. This is the intentional fail-fast.
+
+**Recovery**:
+```bash
+# 1. See the failure
+kubectl -n honeydue logs job/honeydue-migrate --tail=200
+
+# 2. Common cause: a SQL error in the migration file. Fix the file
+#    locally, commit, retry the deploy. The Job is idempotent —
+#    successful prior versions stay applied; only the failed file
+#    re-runs.
+git add migrations/000NNN_*.sql
+git commit -m "Fix migration NNN"
+git push gitea master
+bash deploy-k3s/scripts/03-deploy.sh
+
+# 3. Other cause: Neon down or auth changed. Test direct connection:
+DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
+  -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
+docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \
+  psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
+        user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;"
+```
+**Why no automatic retry**: `backoffLimit: 0` on the Job is deliberate.
+A failing migration almost never gets unstuck by retrying — needs an
+operator to look. See [Chapter 17 §27](./17-runbook.md) for recovery
+playbook.
+
+#### api refuses to start: "Schema precondition failed"
+
+**Symptom**: api pods log `Schema precondition failed` and exit
+immediately after DB connect.
+**Cause**: `goose_db_version` table is missing or its latest row has
+`is_applied=false`. Means the migrate Job either was never run or
+ran and rolled back.
+**Recovery**: run the migrate Job manually (see
+[Chapter 17 §26](./17-runbook.md)). After it completes successfully,
+delete the failing api pods so they restart with a fresh schema check:
+```bash
+kubectl -n honeydue rollout restart deploy/api
+```
+
 #### Backblaze B2 outage

 **Symptom**: image uploads fail; image downloads fail unless cached by