docs(deployment): rewrite migration prose for goose adoption
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
This commit is contained in:
Trey t
2026-04-26 23:01:32 -05:00
parent 0f7450ada9
commit 8d9ca2e6ed
10 changed files with 260 additions and 39 deletions
+41 -4
View File
@@ -317,10 +317,47 @@ Timeline (approximate, warm state):
- t=60s: another old pod terminates
- ...continues until all on new RS
For cold-boot (e.g., first deploy on a rebuilt cluster), the
MigrateWithLock advisory lock extends this to several minutes. But the
rollout is serialized — only one pod starts per iteration, so the lock
queue is small.
Migrations run as a separate Kubernetes Job that completes before any
api/worker pod is rolled. So the rollout above never includes migration
work — pods that boot are guaranteed to find the schema already at the
expected version. See §"Migrations are gated, not interleaved" below.
## Migrations are gated, not interleaved
`03-deploy.sh` runs `goose up` as a one-shot Job before applying any
api/worker manifests:
```
1. kubectl delete job honeydue-migrate (idempotent, removes prior run)
2. kubectl apply -f manifests/migrate/job.yaml (with current api image)
3. kubectl wait --for=condition=complete --timeout=10m job/honeydue-migrate
4. (only if Job succeeded) kubectl apply -f manifests/api/...
```
The Job uses the api image — `/usr/local/bin/goose` is baked in at
Dockerfile build time. The Job script strips the `-pooler` segment
from `DB_HOST` before connecting (goose's session-scoped advisory
lock can't survive PgBouncer transaction-mode), runs `goose up`, exits.
If the Job fails, the script aborts before any new app pod sees a
stale schema. To debug:
```bash
kubectl -n honeydue logs job/honeydue-migrate --tail=200
kubectl -n honeydue describe job honeydue-migrate
```
After investigating, fix the migration file and re-run `03-deploy.sh`.
The Job is idempotent — successful migrations stay applied, only the
new/failed file gets retried.
api/worker pods run a `RequireSchemaApplied` check at startup that
queries `goose_db_version` and refuses to boot if the table is missing
or the latest row is `is_applied=false`. This is the fail-fast for
"someone bypassed the deploy script and the schema isn't current."
For full schema management background, see
[Chapter 8 §Schema management](./08-database.md).
## Hotfix workflow