docs(deployment): rewrite migration prose for goose adoption

Update the deployment book and glossary to reflect the goose-based schema migration flow shipped in 12b2f9d/0f7450a: - ch07: clarify startup probe assumes migrations ran out-of-band - ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job - ch12: pod startup checks goose_db_version, no longer runs migrations - ch14: document the Job→wait→roll deploy gate and how to debug failures - ch16: add "Migrate Job fails during deploy" + "Schema precondition failed" failure modes - ch17: new runbook entries §26 (run migrations manually), §27 (recover from failed/dirty migration), §28 (bootstrap goose on fresh clone) - ch19: postscript on §13 noting MigrateWithLock approach is superseded - ch20: mark "Migration Job for schema changes" task done - glossary: add `goose` and `goose_db_version`; flag AutoMigrate as tests-only - references: add goose links; flag AutoMigrate as tests-only
2026-04-26 23:01:32 -05:00
parent 0f7450ada9
commit 8d9ca2e6ed
10 changed files with 260 additions and 39 deletions
@@ -175,13 +175,15 @@ doesn't run as root.
 file writes to the image layer. Go binary doesn't need to write to `/`;
 only `/tmp` is mutable.

-**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this
-was bumped up from the scaffold default of 12. Reason: on first boot,
-the Go app runs `MigrateWithLock()` which acquires a Postgres advisory
-lock and runs AutoMigrate. First replica takes ~90s; subsequent
-replicas wait on the lock. With 3 replicas all starting simultaneously
-and the lock serializing them, 240s is the right grace. See
-[Chapter 19](./19-postmortem-swarm.md) for the detailed story.
+**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) —
+historically bumped from the scaffold default of 12 to absorb in-replica
+migration time. Now that migrations run out-of-band as a Kubernetes
+Job ([Chapter 8 §Schema management](./08-database.md)), pods boot in
+seconds and only need a few probe failures of grace, but the budget
+stays at 240s because cold pods on a fresh Hetzner node still pay
+~10s for image pull + startup. See
+[Chapter 19 §13](./19-postmortem-swarm.md) for the historical
+context (the in-replica advisory-lock approach this replaced).

 **`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
 passes, wait 5s before starting readiness checks. Prevents a racy