docs(deployment): rewrite migration prose for goose adoption
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
This commit is contained in:
Trey t
2026-04-26 23:01:32 -05:00
parent 0f7450ada9
commit 8d9ca2e6ed
10 changed files with 260 additions and 39 deletions
+9 -7
View File
@@ -175,13 +175,15 @@ doesn't run as root.
file writes to the image layer. Go binary doesn't need to write to `/`;
only `/tmp` is mutable.
**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this
was bumped up from the scaffold default of 12. Reason: on first boot,
the Go app runs `MigrateWithLock()` which acquires a Postgres advisory
lock and runs AutoMigrate. First replica takes ~90s; subsequent
replicas wait on the lock. With 3 replicas all starting simultaneously
and the lock serializing them, 240s is the right grace. See
[Chapter 19](./19-postmortem-swarm.md) for the detailed story.
**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) —
historically bumped from the scaffold default of 12 to absorb in-replica
migration time. Now that migrations run out-of-band as a Kubernetes
Job ([Chapter 8 §Schema management](./08-database.md)), pods boot in
seconds and only need a few probe failures of grace, but the budget
stays at 240s because cold pods on a fresh Hetzner node still pay
~10s for image pull + startup. See
[Chapter 19 §13](./19-postmortem-swarm.md) for the historical
context (the in-replica advisory-lock approach this replaced).
**`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
passes, wait 5s before starting readiness checks. Prevents a racy