docs(deployment): rewrite migration prose for goose adoption
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
This commit is contained in:
Trey t
2026-04-26 23:01:32 -05:00
parent 0f7450ada9
commit 8d9ca2e6ed
10 changed files with 260 additions and 39 deletions
+9 -7
View File
@@ -175,13 +175,15 @@ doesn't run as root.
file writes to the image layer. Go binary doesn't need to write to `/`; file writes to the image layer. Go binary doesn't need to write to `/`;
only `/tmp` is mutable. only `/tmp` is mutable.
**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this **`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) —
was bumped up from the scaffold default of 12. Reason: on first boot, historically bumped from the scaffold default of 12 to absorb in-replica
the Go app runs `MigrateWithLock()` which acquires a Postgres advisory migration time. Now that migrations run out-of-band as a Kubernetes
lock and runs AutoMigrate. First replica takes ~90s; subsequent Job ([Chapter 8 §Schema management](./08-database.md)), pods boot in
replicas wait on the lock. With 3 replicas all starting simultaneously seconds and only need a few probe failures of grace, but the budget
and the lock serializing them, 240s is the right grace. See stays at 240s because cold pods on a fresh Hetzner node still pay
[Chapter 19](./19-postmortem-swarm.md) for the detailed story. ~10s for image pull + startup. See
[Chapter 19 §13](./19-postmortem-swarm.md) for the historical
context (the in-replica advisory-lock approach this replaced).
**`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe **`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
passes, wait 5s before starting readiness checks. Prevents a racy passes, wait 5s before starting readiness checks. Prevents a racy
+14 -10
View File
@@ -4,8 +4,10 @@
Authoritative user data lives in a Neon-managed Postgres database in AWS Authoritative user data lives in a Neon-managed Postgres database in AWS
us-east-1. Connections use TLS (`DB_SSLMODE=require`). Schema is managed us-east-1. Connections use TLS (`DB_SSLMODE=require`). Schema is managed
via GORM AutoMigrate inside the api binary, coordinated across replicas via [pressly/goose](https://github.com/pressly/goose) running as a
by a Postgres advisory lock to prevent concurrent migration attempts. one-shot Kubernetes Job before every api/worker rollout. See §Schema
management below for the full shape; ch19 §13 documents the previous
in-replica AutoMigrate approach this replaced.
## Why Neon ## Why Neon
@@ -78,13 +80,13 @@ Modes PgBouncer supports:
- **statement** — per-statement (most aggressive; breaks many features) - **statement** — per-statement (most aggressive; breaks many features)
Neon's pooler runs in **transaction mode**. This is compatible with GORM Neon's pooler runs in **transaction mode**. This is compatible with GORM
out of the box (we don't use session-level features like LISTEN/NOTIFY runtime queries (we don't use session-level features like LISTEN/NOTIFY
or session-scope advisory locks). Note: `database.MigrateWithLock()` or session-scope advisory locks in the data path). The one place this
needs the *direct* (non-pooler) endpoint because session-level matters is migrations: goose's session-scoped advisory lock can't
advisory locks don't survive PgBouncer's per-transaction cycling — but survive PgBouncer transaction-mode pooling. The migrate Job
the migration helper opens its own ad-hoc connection bypassing the (`deploy-k3s/manifests/migrate/job.yaml`) handles this by stripping
configured pool, so this happens automatically. See `MigrateWithLock` the `-pooler` segment from `DB_HOST` before invoking goose — runtime
in `internal/database/database.go`. keeps using the pooler, only migrations bypass it.
### Connection pool settings ### Connection pool settings
@@ -404,11 +406,13 @@ GROUP BY usename, state, application_name;
- [Neon docs][neon-docs] - [Neon docs][neon-docs]
- [Neon pricing][neon-pricing] - [Neon pricing][neon-pricing]
- [Postgres advisory locks][pg-locks] - [Postgres advisory locks][pg-locks]
- [GORM AutoMigrate][gorm-automigrate] - [pressly/goose][goose] — production migration tool
- [GORM AutoMigrate][gorm-automigrate] (tests only)
- [honeyDue task architecture][task-arch] (repo-local) - [honeyDue task architecture][task-arch] (repo-local)
[neon-docs]: https://neon.com/docs/introduction [neon-docs]: https://neon.com/docs/introduction
[neon-pricing]: https://neon.com/pricing [neon-pricing]: https://neon.com/pricing
[pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS [pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
[goose]: https://github.com/pressly/goose
[gorm-automigrate]: https://gorm.io/docs/migration.html [gorm-automigrate]: https://gorm.io/docs/migration.html
[task-arch]: ../../docs/TASK_LOGIC_ARCHITECTURE.md [task-arch]: ../../docs/TASK_LOGIC_ARCHITECTURE.md
+1 -1
View File
@@ -272,7 +272,7 @@ sequenceDiagram
participant NewPod as api pod v2 (starting) participant NewPod as api pod v2 (starting)
Note over NewPod: kubelet starts new pod Note over NewPod: kubelet starts new pod
Note over NewPod: pod connects to Postgres<br/>MigrateWithLock runs (no-op)<br/>HTTP server starts<br/>readinessProbe passes Note over NewPod: pod connects to Postgres<br/>RequireSchemaApplied checks goose_db_version<br/>HTTP server starts<br/>readinessProbe passes
Note over NewPod: kube-proxy updates endpoints<br/>NewPod added to Service pool Note over NewPod: kube-proxy updates endpoints<br/>NewPod added to Service pool
CF->>Traefik: request 1 CF->>Traefik: request 1
Traefik->>OldPod: routed (old pod still in pool) Traefik->>OldPod: routed (old pod still in pool)
+41 -4
View File
@@ -317,10 +317,47 @@ Timeline (approximate, warm state):
- t=60s: another old pod terminates - t=60s: another old pod terminates
- ...continues until all on new RS - ...continues until all on new RS
For cold-boot (e.g., first deploy on a rebuilt cluster), the Migrations run as a separate Kubernetes Job that completes before any
MigrateWithLock advisory lock extends this to several minutes. But the api/worker pod is rolled. So the rollout above never includes migration
rollout is serialized — only one pod starts per iteration, so the lock work — pods that boot are guaranteed to find the schema already at the
queue is small. expected version. See §"Migrations are gated, not interleaved" below.
## Migrations are gated, not interleaved
`03-deploy.sh` runs `goose up` as a one-shot Job before applying any
api/worker manifests:
```
1. kubectl delete job honeydue-migrate (idempotent, removes prior run)
2. kubectl apply -f manifests/migrate/job.yaml (with current api image)
3. kubectl wait --for=condition=complete --timeout=10m job/honeydue-migrate
4. (only if Job succeeded) kubectl apply -f manifests/api/...
```
The Job uses the api image — `/usr/local/bin/goose` is baked in at
Dockerfile build time. The Job script strips the `-pooler` segment
from `DB_HOST` before connecting (goose's session-scoped advisory
lock can't survive PgBouncer transaction-mode), runs `goose up`, exits.
If the Job fails, the script aborts before any new app pod sees a
stale schema. To debug:
```bash
kubectl -n honeydue logs job/honeydue-migrate --tail=200
kubectl -n honeydue describe job honeydue-migrate
```
After investigating, fix the migration file and re-run `03-deploy.sh`.
The Job is idempotent — successful migrations stay applied, only the
new/failed file gets retried.
api/worker pods run a `RequireSchemaApplied` check at startup that
queries `goose_db_version` and refuses to boot if the table is missing
or the latest row is `is_applied=false`. This is the fail-fast for
"someone bypassed the deploy script and the schema isn't current."
For full schema management background, see
[Chapter 8 §Schema management](./08-database.md).
## Hotfix workflow ## Hotfix workflow
+49
View File
@@ -327,6 +327,55 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui
Cold-handshake latency goes back up (~440ms first hit) but the API Cold-handshake latency goes back up (~440ms first hit) but the API
keeps serving. Switch back when the pooler recovers. keeps serving. Switch back when the pooler recovers.
#### Migrate Job fails during deploy
**Symptom**: `03-deploy.sh` aborts at the migrations step:
```
[deploy][error] migrations did not complete cleanly; aborting deploy
```
api/worker pods are NOT updated — they keep running the previous
revision. This is the intentional fail-fast.
**Recovery**:
```bash
# 1. See the failure
kubectl -n honeydue logs job/honeydue-migrate --tail=200
# 2. Common cause: a SQL error in the migration file. Fix the file
# locally, commit, retry the deploy. The Job is idempotent —
# successful prior versions stay applied; only the failed file
# re-runs.
git add migrations/000NNN_*.sql
git commit -m "Fix migration NNN"
git push gitea master
bash deploy-k3s/scripts/03-deploy.sh
# 3. Other cause: Neon down or auth changed. Test direct connection:
DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
-o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \
psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;"
```
**Why no automatic retry**: `backoffLimit: 0` on the Job is deliberate.
A failing migration almost never gets unstuck by retrying — needs an
operator to look. See [Chapter 17 §27](./17-runbook.md) for recovery
playbook.
#### api refuses to start: "Schema precondition failed"
**Symptom**: api pods log `Schema precondition failed` and exit
immediately after DB connect.
**Cause**: `goose_db_version` table is missing or its latest row has
`is_applied=false`. Means the migrate Job either was never run or
ran and rolled back.
**Recovery**: run the migrate Job manually (see
[Chapter 17 §26](./17-runbook.md)). After it completes successfully,
delete the failing api pods so they restart with a fresh schema check:
```bash
kubectl -n honeydue rollout restart deploy/api
```
#### Backblaze B2 outage #### Backblaze B2 outage
**Symptom**: image uploads fail; image downloads fail unless cached by **Symptom**: image uploads fail; image downloads fail unless cached by
+88 -4
View File
@@ -428,10 +428,94 @@ KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-bui
``` ```
The pooler runs in transaction mode so any session-scope feature The pooler runs in transaction mode so any session-scope feature
(LISTEN/NOTIFY, session advisory locks for migrations) auto-falls (LISTEN/NOTIFY, session advisory locks) won't work over it. Migrations
through to direct via `MigrateWithLock` opening its own connection. already handle this — the migrate Job script strips `-pooler` from
But if you ever add session-level features in the data path, they'll `DB_HOST` before invoking goose. If you add new session-level features
need the direct endpoint. in the data path, they'll need the same workaround.
## 26. Run migrations manually (rare)
Day-to-day, migrations run as part of every `03-deploy.sh`. But
sometimes you want to apply or inspect them outside a deploy:
```bash
# Direct-endpoint DSN (goose's advisory lock won't survive the pooler)
DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
-o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
export DATABASE_URL="host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
port=5432 user=neondb_owner password=$DB_PASS \
dbname=honeyDue sslmode=require"
# What's pending? (read-only; safe to run anytime)
make migrate-status
# Apply pending migrations (or `goose -dir migrations postgres "$DATABASE_URL" up`)
make migrate-up
# Roll back the most recent migration
make migrate-down
# Scaffold a new migration file
make migrate-new name=add_widget_count_to_residences
# → migrations/000002_add_widget_count_to_residences.sql
# Edit, then `make migrate-up` to test, then commit.
```
To run goose from inside the cluster (e.g., to bypass a network policy
that blocks Neon from your laptop), use the migrate Job manifest as a
one-shot:
```bash
# Re-runs the latest migrate Job with whatever args you need
kubectl -n honeydue delete job honeydue-migrate --ignore-not-found
sed "s|image: IMAGE_PLACEHOLDER|image: $(kubectl -n honeydue get deploy api -o jsonpath='{.spec.template.spec.containers[0].image}')|" \
deploy-k3s/manifests/migrate/job.yaml | kubectl apply -f -
kubectl -n honeydue wait --for=condition=complete --timeout=5m job/honeydue-migrate
kubectl -n honeydue logs job/honeydue-migrate
```
## 27. Recover from a failed/dirty migration
If `goose up` fails partway through, the migration file's transaction
rolls back and `goose_db_version` reflects the last *complete*
version. Goose marks no row as "dirty" — that's a golang-migrate
concept. So recovery is just: fix the migration file, re-run.
If you've genuinely corrupted state (dropped tables you shouldn't have,
applied a destructive migration in error):
```bash
# See current goose state
make migrate-status
psql "$DATABASE_URL" -c \
"SELECT version_id, is_applied, tstamp FROM goose_db_version ORDER BY id DESC LIMIT 10;"
# To force the version table back to a known-good number after
# manually fixing the schema:
psql "$DATABASE_URL" -c \
"INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (<N>, true, NOW());"
```
## 28. Bootstrap goose on a fresh clone of the schema
If you create a new Neon branch / dev DB and need to bring it under
goose management:
```bash
export DATABASE_URL="...<the new DB>..."
# Option A: fresh DB, no schema → just run up
make migrate-up
# Option B: schema already populated (e.g., restored from a dump) →
# mark v1 as already-applied
goose -dir migrations postgres "$DATABASE_URL" version # creates table
psql "$DATABASE_URL" -c \
"INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"
```
This is also what was done for the live prod DB at goose-adoption time
(commit `12b2f9d`).
## References ## References
+29
View File
@@ -397,6 +397,35 @@ should reflect reality, not be optimistic.
**Moral**: Healthchecks should be realistic, not aspirational. Know **Moral**: Healthchecks should be realistic, not aspirational. Know
what your app actually does at startup. what your app actually does at startup.
#### Postscript (2026-04-26): the whole `MigrateWithLock` shape was wrong
A few months after the Swarm migration, switching `DB_HOST` to Neon's
`-pooler` endpoint for runtime perf wins broke this code completely:
`pg_advisory_lock` is session-scoped, but PgBouncer transaction-mode
multiplexes statements across backend Postgres sessions, so the lock
appeared to be held but actually wasn't. Pods hung at
"Acquiring migration advisory lock..." and the startup probe killed
them in turn.
After a brief band-aid (route migrations through the direct endpoint;
bump probe to 600s to absorb 5-minute AutoMigrate runs over the slow
direct connection — both reverted), we abandoned the runtime-side
migration story entirely and adopted [pressly/goose](https://github.com/pressly/goose)
in commit `12b2f9d`:
- Migrations run as a one-shot Kubernetes Job before any api/worker
pod rolls. No more in-replica migration, no more advisory lock,
no more startup probe gymnastics.
- `RequireSchemaApplied` checks `goose_db_version` at startup and
refuses to boot on a stale schema — fail-fast for "operator
forgot to run migrate," instead of mysterious runtime errors.
- `failureThreshold` reverted to its pre-MigrateWithLock value.
Pods boot in seconds again.
See [Chapter 8 §Schema management](./08-database.md) for the goose
shape. This entire sub-section is preserved as historical context
for why we walked the path we did.
## What we learned ## What we learned
### Docker Swarm is in a bad place in 2026 ### Docker Swarm is in a bad place in 2026
+13 -11
View File
@@ -69,20 +69,22 @@ Flexible to Full (strict). Verified by:
- CF edge continues to serve its own Let's Encrypt cert to browsers - CF edge continues to serve its own Let's Encrypt cert to browsers
- both layers now TLS-encrypted - both layers now TLS-encrypted
### Migration Job for schema changes ### ~~Migration Job for schema changes~~ — done (2026-04-26, commit 12b2f9d)
**Why**: Currently every api pod runs `MigrateWithLock()` on startup, **What shipped**: pressly/goose as the migration tool, run as a one-shot
serializing on a Postgres advisory lock. Adds 90-240s to cold startup Kubernetes Job from `deploy-k3s/manifests/migrate/job.yaml` before
and caused bug #13 in Chapter 19. api/worker rollout. The Job uses the api image (goose CLI is baked in
during the Dockerfile build), strips `-pooler` from `DB_HOST` for the
direct-endpoint connection migrations need, and exits in seconds when
there's nothing to apply. `RequireSchemaApplied` in the api/worker
startup checks `goose_db_version` and fails fast on a stale schema.
**How**: Create a Kubernetes `Job` resource that runs the api image The Go-code-with-`--migrate-only` shape originally proposed here was
with a `--migrate-only` flag. Job runs once per deploy, completes when rejected in favor of using the upstream goose binary directly — see
schema is current. api pods get an initContainer that waits for the [Chapter 8 §Schema management](./08-database.md) for the trade-offs.
Job to complete.
Requires Go code change to support `--migrate-only` flag. Pre-goose `MigrateWithLock` is gone; ch19 §13 has the historical
postmortem context.
**Effort**: 3-4 hours (code + job manifest + testing).
### Redis password ### Redis password
+11 -1
View File
@@ -173,11 +173,21 @@ suffix. (Chapter 8)
## Go + Asynq ## Go + Asynq
**AutoMigrate**: GORM function that syncs DB schema to Go structs. **AutoMigrate**: GORM function that syncs DB schema to Go structs.
(Chapter 8) We used this in production until 2026-04, replaced by goose. Tests
still use it via `testutil.SetupTestDB`. (Chapter 8)
**Asynq**: Go library for background job queues. Redis-backed. **Asynq**: Go library for background job queues. Redis-backed.
(Chapter 7) (Chapter 7)
**goose**: pressly/goose — the SQL migration tool we use in production
(commit 12b2f9d onward). Migration files live in `migrations/`, one
file per version with `-- +goose Up` / `-- +goose Down` markers.
(Chapter 8)
**goose_db_version**: goose's version-tracking table. One row per
applied migration. `RequireSchemaApplied` reads the latest row at
api/worker startup to fail fast on a stale schema. (Chapter 8)
**GORM**: Go ORM we use. (Chapter 8) **GORM**: Go ORM we use. (Chapter 8)
**pgx**: Go Postgres driver used by GORM. (Chapter 8) **pgx**: Go Postgres driver used by GORM. (Chapter 8)
+5 -1
View File
@@ -65,7 +65,9 @@ Every external link cited anywhere in this book, grouped by topic.
- [Neon usage-based pricing announcement][neon-blog] - [Neon usage-based pricing announcement][neon-blog]
- [Neon connect from any app][neon-connect] - [Neon connect from any app][neon-connect]
- [Postgres advisory locks][pg-locks] - [Postgres advisory locks][pg-locks]
- [GORM AutoMigrate][gorm-automigrate] - [GORM AutoMigrate][gorm-automigrate] (tests only — production migrations use goose)
- [pressly/goose — SQL migration tool][goose]
- [Goose documentation][goose-docs]
## Backblaze B2 ## Backblaze B2
@@ -168,6 +170,8 @@ Every external link cited anywhere in this book, grouped by topic.
[neon-connect]: https://neon.com/docs/connect/connect-from-any-app [neon-connect]: https://neon.com/docs/connect/connect-from-any-app
[pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS [pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
[gorm-automigrate]: https://gorm.io/docs/migration.html [gorm-automigrate]: https://gorm.io/docs/migration.html
[goose]: https://github.com/pressly/goose
[goose-docs]: https://pressly.github.io/goose/
<!-- B2 --> <!-- B2 -->
[b2-docs]: https://www.backblaze.com/docs/ [b2-docs]: https://www.backblaze.com/docs/