docs(deployment): rewrite migration prose for goose adoption
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
This commit is contained in:
Trey t
2026-04-26 23:01:32 -05:00
parent 0f7450ada9
commit 8d9ca2e6ed
10 changed files with 260 additions and 39 deletions
+29
View File
@@ -397,6 +397,35 @@ should reflect reality, not be optimistic.
**Moral**: Healthchecks should be realistic, not aspirational. Know
what your app actually does at startup.
#### Postscript (2026-04-26): the whole `MigrateWithLock` shape was wrong
A few months after the Swarm migration, switching `DB_HOST` to Neon's
`-pooler` endpoint for runtime perf wins broke this code completely:
`pg_advisory_lock` is session-scoped, but PgBouncer transaction-mode
multiplexes statements across backend Postgres sessions, so the lock
appeared to be held but actually wasn't. Pods hung at
"Acquiring migration advisory lock..." and the startup probe killed
them in turn.
After a brief band-aid (route migrations through the direct endpoint;
bump probe to 600s to absorb 5-minute AutoMigrate runs over the slow
direct connection — both reverted), we abandoned the runtime-side
migration story entirely and adopted [pressly/goose](https://github.com/pressly/goose)
in commit `12b2f9d`:
- Migrations run as a one-shot Kubernetes Job before any api/worker
pod rolls. No more in-replica migration, no more advisory lock,
no more startup probe gymnastics.
- `RequireSchemaApplied` checks `goose_db_version` at startup and
refuses to boot on a stale schema — fail-fast for "operator
forgot to run migrate," instead of mysterious runtime errors.
- `failureThreshold` reverted to its pre-MigrateWithLock value.
Pods boot in seconds again.
See [Chapter 8 §Schema management](./08-database.md) for the goose
shape. This entire sub-section is preserved as historical context
for why we walked the path we did.
## What we learned
### Docker Swarm is in a bad place in 2026