From 8d9ca2e6ed5ff48e1cc56df00557111c80ba3344 Mon Sep 17 00:00:00 2001 From: Trey t Date: Sun, 26 Apr 2026 23:01:32 -0500 Subject: [PATCH] docs(deployment): rewrite migration prose for goose adoption MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Update the deployment book and glossary to reflect the goose-based schema migration flow shipped in 12b2f9d/0f7450a: - ch07: clarify startup probe assumes migrations ran out-of-band - ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job - ch12: pod startup checks goose_db_version, no longer runs migrations - ch14: document the Job→wait→roll deploy gate and how to debug failures - ch16: add "Migrate Job fails during deploy" + "Schema precondition failed" failure modes - ch17: new runbook entries §26 (run migrations manually), §27 (recover from failed/dirty migration), §28 (bootstrap goose on fresh clone) - ch19: postscript on §13 noting MigrateWithLock approach is superseded - ch20: mark "Migration Job for schema changes" task done - glossary: add `goose` and `goose_db_version`; flag AutoMigrate as tests-only - references: add goose links; flag AutoMigrate as tests-only --- docs/deployment/07-services.md | 16 ++-- docs/deployment/08-database.md | 24 +++--- docs/deployment/12-data-flow.md | 2 +- docs/deployment/14-deployment-process.md | 45 ++++++++++- docs/deployment/16-failure-modes.md | 49 ++++++++++++ docs/deployment/17-runbook.md | 92 +++++++++++++++++++++- docs/deployment/19-postmortem-swarm.md | 29 +++++++ docs/deployment/20-roadmap.md | 24 +++--- docs/deployment/appendices/a-glossary.md | 12 ++- docs/deployment/appendices/d-references.md | 6 +- 10 files changed, 260 insertions(+), 39 deletions(-) diff --git a/docs/deployment/07-services.md b/docs/deployment/07-services.md index fd2d044..057e776 100644 --- a/docs/deployment/07-services.md +++ b/docs/deployment/07-services.md @@ -175,13 +175,15 @@ doesn't run as root. file writes to the image layer. Go binary doesn't need to write to `/`; only `/tmp` is mutable. -**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this -was bumped up from the scaffold default of 12. Reason: on first boot, -the Go app runs `MigrateWithLock()` which acquires a Postgres advisory -lock and runs AutoMigrate. First replica takes ~90s; subsequent -replicas wait on the lock. With 3 replicas all starting simultaneously -and the lock serializing them, 240s is the right grace. See -[Chapter 19](./19-postmortem-swarm.md) for the detailed story. +**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — +historically bumped from the scaffold default of 12 to absorb in-replica +migration time. Now that migrations run out-of-band as a Kubernetes +Job ([Chapter 8 §Schema management](./08-database.md)), pods boot in +seconds and only need a few probe failures of grace, but the budget +stays at 240s because cold pods on a fresh Hetzner node still pay +~10s for image pull + startup. See +[Chapter 19 §13](./19-postmortem-swarm.md) for the historical +context (the in-replica advisory-lock approach this replaced). **`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe passes, wait 5s before starting readiness checks. Prevents a racy diff --git a/docs/deployment/08-database.md b/docs/deployment/08-database.md index a26cf1a..2cfd28e 100644 --- a/docs/deployment/08-database.md +++ b/docs/deployment/08-database.md @@ -4,8 +4,10 @@ Authoritative user data lives in a Neon-managed Postgres database in AWS us-east-1. Connections use TLS (`DB_SSLMODE=require`). Schema is managed -via GORM AutoMigrate inside the api binary, coordinated across replicas -by a Postgres advisory lock to prevent concurrent migration attempts. +via [pressly/goose](https://github.com/pressly/goose) running as a +one-shot Kubernetes Job before every api/worker rollout. See §Schema +management below for the full shape; ch19 §13 documents the previous +in-replica AutoMigrate approach this replaced. ## Why Neon @@ -78,13 +80,13 @@ Modes PgBouncer supports: - **statement** — per-statement (most aggressive; breaks many features) Neon's pooler runs in **transaction mode**. This is compatible with GORM -out of the box (we don't use session-level features like LISTEN/NOTIFY -or session-scope advisory locks). Note: `database.MigrateWithLock()` -needs the *direct* (non-pooler) endpoint because session-level -advisory locks don't survive PgBouncer's per-transaction cycling — but -the migration helper opens its own ad-hoc connection bypassing the -configured pool, so this happens automatically. See `MigrateWithLock` -in `internal/database/database.go`. +runtime queries (we don't use session-level features like LISTEN/NOTIFY +or session-scope advisory locks in the data path). The one place this +matters is migrations: goose's session-scoped advisory lock can't +survive PgBouncer transaction-mode pooling. The migrate Job +(`deploy-k3s/manifests/migrate/job.yaml`) handles this by stripping +the `-pooler` segment from `DB_HOST` before invoking goose — runtime +keeps using the pooler, only migrations bypass it. ### Connection pool settings @@ -404,11 +406,13 @@ GROUP BY usename, state, application_name; - [Neon docs][neon-docs] - [Neon pricing][neon-pricing] - [Postgres advisory locks][pg-locks] -- [GORM AutoMigrate][gorm-automigrate] +- [pressly/goose][goose] — production migration tool +- [GORM AutoMigrate][gorm-automigrate] (tests only) - [honeyDue task architecture][task-arch] (repo-local) [neon-docs]: https://neon.com/docs/introduction [neon-pricing]: https://neon.com/pricing [pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS +[goose]: https://github.com/pressly/goose [gorm-automigrate]: https://gorm.io/docs/migration.html [task-arch]: ../../docs/TASK_LOGIC_ARCHITECTURE.md diff --git a/docs/deployment/12-data-flow.md b/docs/deployment/12-data-flow.md index 6648dfe..fdc2f19 100644 --- a/docs/deployment/12-data-flow.md +++ b/docs/deployment/12-data-flow.md @@ -272,7 +272,7 @@ sequenceDiagram participant NewPod as api pod v2 (starting) Note over NewPod: kubelet starts new pod - Note over NewPod: pod connects to Postgres
MigrateWithLock runs (no-op)
HTTP server starts
readinessProbe passes + Note over NewPod: pod connects to Postgres
RequireSchemaApplied checks goose_db_version
HTTP server starts
readinessProbe passes Note over NewPod: kube-proxy updates endpoints
NewPod added to Service pool CF->>Traefik: request 1 Traefik->>OldPod: routed (old pod still in pool) diff --git a/docs/deployment/14-deployment-process.md b/docs/deployment/14-deployment-process.md index ea22763..c32444f 100644 --- a/docs/deployment/14-deployment-process.md +++ b/docs/deployment/14-deployment-process.md @@ -317,10 +317,47 @@ Timeline (approximate, warm state): - t=60s: another old pod terminates - ...continues until all on new RS -For cold-boot (e.g., first deploy on a rebuilt cluster), the -MigrateWithLock advisory lock extends this to several minutes. But the -rollout is serialized — only one pod starts per iteration, so the lock -queue is small. +Migrations run as a separate Kubernetes Job that completes before any +api/worker pod is rolled. So the rollout above never includes migration +work — pods that boot are guaranteed to find the schema already at the +expected version. See §"Migrations are gated, not interleaved" below. + +## Migrations are gated, not interleaved + +`03-deploy.sh` runs `goose up` as a one-shot Job before applying any +api/worker manifests: + +``` +1. kubectl delete job honeydue-migrate (idempotent, removes prior run) +2. kubectl apply -f manifests/migrate/job.yaml (with current api image) +3. kubectl wait --for=condition=complete --timeout=10m job/honeydue-migrate +4. (only if Job succeeded) kubectl apply -f manifests/api/... +``` + +The Job uses the api image — `/usr/local/bin/goose` is baked in at +Dockerfile build time. The Job script strips the `-pooler` segment +from `DB_HOST` before connecting (goose's session-scoped advisory +lock can't survive PgBouncer transaction-mode), runs `goose up`, exits. + +If the Job fails, the script aborts before any new app pod sees a +stale schema. To debug: + +```bash +kubectl -n honeydue logs job/honeydue-migrate --tail=200 +kubectl -n honeydue describe job honeydue-migrate +``` + +After investigating, fix the migration file and re-run `03-deploy.sh`. +The Job is idempotent — successful migrations stay applied, only the +new/failed file gets retried. + +api/worker pods run a `RequireSchemaApplied` check at startup that +queries `goose_db_version` and refuses to boot if the table is missing +or the latest row is `is_applied=false`. This is the fail-fast for +"someone bypassed the deploy script and the schema isn't current." + +For full schema management background, see +[Chapter 8 §Schema management](./08-database.md). ## Hotfix workflow diff --git a/docs/deployment/16-failure-modes.md b/docs/deployment/16-failure-modes.md index f3b0155..b6c82a4 100644 --- a/docs/deployment/16-failure-modes.md +++ b/docs/deployment/16-failure-modes.md @@ -327,6 +327,55 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui Cold-handshake latency goes back up (~440ms first hit) but the API keeps serving. Switch back when the pooler recovers. +#### Migrate Job fails during deploy + +**Symptom**: `03-deploy.sh` aborts at the migrations step: +``` +[deploy][error] migrations did not complete cleanly; aborting deploy +``` +api/worker pods are NOT updated — they keep running the previous +revision. This is the intentional fail-fast. + +**Recovery**: +```bash +# 1. See the failure +kubectl -n honeydue logs job/honeydue-migrate --tail=200 + +# 2. Common cause: a SQL error in the migration file. Fix the file +# locally, commit, retry the deploy. The Job is idempotent — +# successful prior versions stay applied; only the failed file +# re-runs. +git add migrations/000NNN_*.sql +git commit -m "Fix migration NNN" +git push gitea master +bash deploy-k3s/scripts/03-deploy.sh + +# 3. Other cause: Neon down or auth changed. Test direct connection: +DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \ + -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d) +docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \ + psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \ + user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;" +``` +**Why no automatic retry**: `backoffLimit: 0` on the Job is deliberate. +A failing migration almost never gets unstuck by retrying — needs an +operator to look. See [Chapter 17 §27](./17-runbook.md) for recovery +playbook. + +#### api refuses to start: "Schema precondition failed" + +**Symptom**: api pods log `Schema precondition failed` and exit +immediately after DB connect. +**Cause**: `goose_db_version` table is missing or its latest row has +`is_applied=false`. Means the migrate Job either was never run or +ran and rolled back. +**Recovery**: run the migrate Job manually (see +[Chapter 17 §26](./17-runbook.md)). After it completes successfully, +delete the failing api pods so they restart with a fresh schema check: +```bash +kubectl -n honeydue rollout restart deploy/api +``` + #### Backblaze B2 outage **Symptom**: image uploads fail; image downloads fail unless cached by diff --git a/docs/deployment/17-runbook.md b/docs/deployment/17-runbook.md index 9bd9512..048adc0 100644 --- a/docs/deployment/17-runbook.md +++ b/docs/deployment/17-runbook.md @@ -428,10 +428,94 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui ``` The pooler runs in transaction mode so any session-scope feature -(LISTEN/NOTIFY, session advisory locks for migrations) auto-falls -through to direct via `MigrateWithLock` opening its own connection. -But if you ever add session-level features in the data path, they'll -need the direct endpoint. +(LISTEN/NOTIFY, session advisory locks) won't work over it. Migrations +already handle this — the migrate Job script strips `-pooler` from +`DB_HOST` before invoking goose. If you add new session-level features +in the data path, they'll need the same workaround. + +## 26. Run migrations manually (rare) + +Day-to-day, migrations run as part of every `03-deploy.sh`. But +sometimes you want to apply or inspect them outside a deploy: + +```bash +# Direct-endpoint DSN (goose's advisory lock won't survive the pooler) +DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \ + -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d) +export DATABASE_URL="host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \ + port=5432 user=neondb_owner password=$DB_PASS \ + dbname=honeyDue sslmode=require" + +# What's pending? (read-only; safe to run anytime) +make migrate-status + +# Apply pending migrations (or `goose -dir migrations postgres "$DATABASE_URL" up`) +make migrate-up + +# Roll back the most recent migration +make migrate-down + +# Scaffold a new migration file +make migrate-new name=add_widget_count_to_residences +# → migrations/000002_add_widget_count_to_residences.sql +# Edit, then `make migrate-up` to test, then commit. +``` + +To run goose from inside the cluster (e.g., to bypass a network policy +that blocks Neon from your laptop), use the migrate Job manifest as a +one-shot: + +```bash +# Re-runs the latest migrate Job with whatever args you need +kubectl -n honeydue delete job honeydue-migrate --ignore-not-found +sed "s|image: IMAGE_PLACEHOLDER|image: $(kubectl -n honeydue get deploy api -o jsonpath='{.spec.template.spec.containers[0].image}')|" \ + deploy-k3s/manifests/migrate/job.yaml | kubectl apply -f - +kubectl -n honeydue wait --for=condition=complete --timeout=5m job/honeydue-migrate +kubectl -n honeydue logs job/honeydue-migrate +``` + +## 27. Recover from a failed/dirty migration + +If `goose up` fails partway through, the migration file's transaction +rolls back and `goose_db_version` reflects the last *complete* +version. Goose marks no row as "dirty" — that's a golang-migrate +concept. So recovery is just: fix the migration file, re-run. + +If you've genuinely corrupted state (dropped tables you shouldn't have, +applied a destructive migration in error): + +```bash +# See current goose state +make migrate-status +psql "$DATABASE_URL" -c \ + "SELECT version_id, is_applied, tstamp FROM goose_db_version ORDER BY id DESC LIMIT 10;" + +# To force the version table back to a known-good number after +# manually fixing the schema: +psql "$DATABASE_URL" -c \ + "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (, true, NOW());" +``` + +## 28. Bootstrap goose on a fresh clone of the schema + +If you create a new Neon branch / dev DB and need to bring it under +goose management: + +```bash +export DATABASE_URL="......" + +# Option A: fresh DB, no schema → just run up +make migrate-up + +# Option B: schema already populated (e.g., restored from a dump) → +# mark v1 as already-applied +goose -dir migrations postgres "$DATABASE_URL" version # creates table +psql "$DATABASE_URL" -c \ + "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());" +``` + +This is also what was done for the live prod DB at goose-adoption time +(commit `12b2f9d`). ## References diff --git a/docs/deployment/19-postmortem-swarm.md b/docs/deployment/19-postmortem-swarm.md index 5ac12b4..b42fe49 100644 --- a/docs/deployment/19-postmortem-swarm.md +++ b/docs/deployment/19-postmortem-swarm.md @@ -397,6 +397,35 @@ should reflect reality, not be optimistic. **Moral**: Healthchecks should be realistic, not aspirational. Know what your app actually does at startup. +#### Postscript (2026-04-26): the whole `MigrateWithLock` shape was wrong + +A few months after the Swarm migration, switching `DB_HOST` to Neon's +`-pooler` endpoint for runtime perf wins broke this code completely: +`pg_advisory_lock` is session-scoped, but PgBouncer transaction-mode +multiplexes statements across backend Postgres sessions, so the lock +appeared to be held but actually wasn't. Pods hung at +"Acquiring migration advisory lock..." and the startup probe killed +them in turn. + +After a brief band-aid (route migrations through the direct endpoint; +bump probe to 600s to absorb 5-minute AutoMigrate runs over the slow +direct connection — both reverted), we abandoned the runtime-side +migration story entirely and adopted [pressly/goose](https://github.com/pressly/goose) +in commit `12b2f9d`: + +- Migrations run as a one-shot Kubernetes Job before any api/worker + pod rolls. No more in-replica migration, no more advisory lock, + no more startup probe gymnastics. +- `RequireSchemaApplied` checks `goose_db_version` at startup and + refuses to boot on a stale schema — fail-fast for "operator + forgot to run migrate," instead of mysterious runtime errors. +- `failureThreshold` reverted to its pre-MigrateWithLock value. + Pods boot in seconds again. + +See [Chapter 8 §Schema management](./08-database.md) for the goose +shape. This entire sub-section is preserved as historical context +for why we walked the path we did. + ## What we learned ### Docker Swarm is in a bad place in 2026 diff --git a/docs/deployment/20-roadmap.md b/docs/deployment/20-roadmap.md index 04a6c44..b31bad6 100644 --- a/docs/deployment/20-roadmap.md +++ b/docs/deployment/20-roadmap.md @@ -69,20 +69,22 @@ Flexible to Full (strict). Verified by: - CF edge continues to serve its own Let's Encrypt cert to browsers - both layers now TLS-encrypted -### Migration Job for schema changes +### ~~Migration Job for schema changes~~ — done (2026-04-26, commit 12b2f9d) -**Why**: Currently every api pod runs `MigrateWithLock()` on startup, -serializing on a Postgres advisory lock. Adds 90-240s to cold startup -and caused bug #13 in Chapter 19. +**What shipped**: pressly/goose as the migration tool, run as a one-shot +Kubernetes Job from `deploy-k3s/manifests/migrate/job.yaml` before +api/worker rollout. The Job uses the api image (goose CLI is baked in +during the Dockerfile build), strips `-pooler` from `DB_HOST` for the +direct-endpoint connection migrations need, and exits in seconds when +there's nothing to apply. `RequireSchemaApplied` in the api/worker +startup checks `goose_db_version` and fails fast on a stale schema. -**How**: Create a Kubernetes `Job` resource that runs the api image -with a `--migrate-only` flag. Job runs once per deploy, completes when -schema is current. api pods get an initContainer that waits for the -Job to complete. +The Go-code-with-`--migrate-only` shape originally proposed here was +rejected in favor of using the upstream goose binary directly — see +[Chapter 8 §Schema management](./08-database.md) for the trade-offs. -Requires Go code change to support `--migrate-only` flag. - -**Effort**: 3-4 hours (code + job manifest + testing). +Pre-goose `MigrateWithLock` is gone; ch19 §13 has the historical +postmortem context. ### Redis password diff --git a/docs/deployment/appendices/a-glossary.md b/docs/deployment/appendices/a-glossary.md index badee6f..a663f15 100644 --- a/docs/deployment/appendices/a-glossary.md +++ b/docs/deployment/appendices/a-glossary.md @@ -173,11 +173,21 @@ suffix. (Chapter 8) ## Go + Asynq **AutoMigrate**: GORM function that syncs DB schema to Go structs. -(Chapter 8) +We used this in production until 2026-04, replaced by goose. Tests +still use it via `testutil.SetupTestDB`. (Chapter 8) **Asynq**: Go library for background job queues. Redis-backed. (Chapter 7) +**goose**: pressly/goose — the SQL migration tool we use in production +(commit 12b2f9d onward). Migration files live in `migrations/`, one +file per version with `-- +goose Up` / `-- +goose Down` markers. +(Chapter 8) + +**goose_db_version**: goose's version-tracking table. One row per +applied migration. `RequireSchemaApplied` reads the latest row at +api/worker startup to fail fast on a stale schema. (Chapter 8) + **GORM**: Go ORM we use. (Chapter 8) **pgx**: Go Postgres driver used by GORM. (Chapter 8) diff --git a/docs/deployment/appendices/d-references.md b/docs/deployment/appendices/d-references.md index 31f3ea4..cdc1be5 100644 --- a/docs/deployment/appendices/d-references.md +++ b/docs/deployment/appendices/d-references.md @@ -65,7 +65,9 @@ Every external link cited anywhere in this book, grouped by topic. - [Neon usage-based pricing announcement][neon-blog] - [Neon connect from any app][neon-connect] - [Postgres advisory locks][pg-locks] -- [GORM AutoMigrate][gorm-automigrate] +- [GORM AutoMigrate][gorm-automigrate] (tests only — production migrations use goose) +- [pressly/goose — SQL migration tool][goose] +- [Goose documentation][goose-docs] ## Backblaze B2 @@ -168,6 +170,8 @@ Every external link cited anywhere in this book, grouped by topic. [neon-connect]: https://neon.com/docs/connect/connect-from-any-app [pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS [gorm-automigrate]: https://gorm.io/docs/migration.html +[goose]: https://github.com/pressly/goose +[goose-docs]: https://pressly.github.io/goose/ [b2-docs]: https://www.backblaze.com/docs/