Adopt pressly/goose for schema migrations
Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path,
which had two compounding problems:
- AutoMigrate ran on every pod startup (~5 min over the transatlantic
link) even when no schema changes had landed
- pg_advisory_lock is session-scoped, which silently fails through
Neon's pgbouncer transaction-mode pooler — turns out this is a
known and documented limitation that bites golang-migrate too
Goose was chosen over golang-migrate (the other heavyweight) because:
- Goose wraps each migration file in a transaction by default, so a
failure rolls back cleanly instead of leaving a "dirty" version
state requiring manual force-reset (golang-migrate's known
weakness, per its own issue tracker — see #1001 + Atlas's writeup)
- Goose's locking is opt-in. We don't opt in: migrations run as a
single Kubernetes Job, which IS the singleton process. No advisory
lock needed at all.
Layout:
- migrations/000001_init.sql — schema-only pg_dump of the live Neon
DB at adoption, stripped of psql-only directives that block goose's
bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had
their effects folded into this baseline; deleted from the live tree
but preserved in git history at 58e6997.
- Dockerfile installs `goose v3.22.1` at build time and copies the
binary into the api image. The migrate Job reuses the api image with
command=goose, so no separate image to build/push/version.
- deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips
the -pooler segment from DB_HOST (advisory lock won't survive
pgbouncer transaction-mode), runs `goose up`, exits.
- deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the
fresh one, `kubectl wait --for=condition=complete --timeout=10m`,
then proceeds with api/worker rollout. Job failure aborts the deploy
before any new app pod sees a stale schema.
- internal/database/database.go::RequireSchemaApplied checks
goose_db_version on startup. api/worker refuse to boot if the
table is missing or its latest row has is_applied=false — the
fail-fast for "operator forgot to run migrate."
- Makefile: migrate-up / migrate-down / migrate-status / migrate-new
for local workflow.
Production DB was bootstrapped manually:
$ goose -dir migrations postgres "$DSN" version # creates table
$ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"
Smoke test against fresh Postgres locally: 50 user tables created in
284ms via `goose up`, version_id=1 + is_applied=t recorded.
Verified the local goose CLI talks to prod successfully:
$ goose ... status
Applied At Migration
=======================================
Mon Apr 27 03:43:55 2026 -- 000001_init.sql
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -150,66 +150,110 @@ the default 25/10. If we hit connection errors in prod, adjust.
|
||||
|
||||
## Schema management
|
||||
|
||||
### GORM AutoMigrate
|
||||
### goose
|
||||
|
||||
On startup, the Go API's `cmd/api/main.go` calls
|
||||
`database.MigrateWithLock()` which:
|
||||
We use [pressly/goose](https://github.com/pressly/goose) (pinned in the
|
||||
api `Dockerfile` to v3.22.1) for schema migrations. Why goose specifically:
|
||||
|
||||
1. Opens a dedicated Postgres connection
|
||||
2. `SELECT pg_advisory_lock(1751412071)` — acquires a session-level
|
||||
advisory lock on a hardcoded key
|
||||
3. Calls `db.AutoMigrate(&models.*{})` for every GORM model
|
||||
4. `SELECT pg_advisory_unlock(...)` via deferred function
|
||||
5. Close the connection
|
||||
- Each migration file runs inside its own transaction by default —
|
||||
partial-failure recovery is built in (no "dirty" state to manually
|
||||
unstick like golang-migrate).
|
||||
- Locking is opt-in. We *don't* opt in. Migrations run as a single
|
||||
Kubernetes Job — that's the singleton process. No advisory-lock vs
|
||||
PgBouncer-transaction-mode foot-gun.
|
||||
- Plain SQL files. No DSL, no library integration in our Go code.
|
||||
|
||||
The advisory lock serializes migrations across replicas: when 3 api
|
||||
pods start simultaneously, one acquires the lock and migrates; the
|
||||
others block on the lock. Once the first finishes (≤2s for already-
|
||||
migrated schema, up to 90s on first cold boot), the next acquires and
|
||||
sees the schema is current (no-op migrate).
|
||||
See `docs/deployment/19-postmortem-swarm.md` (Schema Versioning section)
|
||||
for the AutoMigrate-with-advisory-lock approach this replaced and why.
|
||||
|
||||
### Why an advisory lock
|
||||
### Migration files
|
||||
|
||||
Without it, concurrent `CREATE TABLE IF NOT EXISTS ...` statements from
|
||||
multiple replicas would race — Postgres usually handles it, but GORM's
|
||||
AutoMigrate also alters tables (adds columns, indexes) which can deadlock
|
||||
under concurrency.
|
||||
Live under `migrations/`, named `<NNNNNN>_<short_name>.sql`. Each file
|
||||
has both the up and down migration in one file, separated by goose
|
||||
markers:
|
||||
|
||||
The advisory lock pattern (also used by Rails + Django + Alembic) is the
|
||||
canonical solution.
|
||||
```sql
|
||||
-- +goose Up
|
||||
CREATE TABLE example (id bigint PRIMARY KEY);
|
||||
|
||||
### The lock key
|
||||
-- +goose Down
|
||||
DROP TABLE example;
|
||||
```
|
||||
|
||||
`1751412071` is a hardcoded integer in `internal/database/database.go`.
|
||||
Arbitrary but unique — as long as nothing else in the Postgres instance
|
||||
uses the same advisory lock key, no conflicts.
|
||||
Multi-statement constructs (`CREATE FUNCTION`, `DO $$ BEGIN ... END $$`)
|
||||
need `-- +goose StatementBegin` / `-- +goose StatementEnd` wrappers
|
||||
because goose splits on semicolons by default.
|
||||
|
||||
### First-boot behavior
|
||||
`migrations/000001_init.sql` is the baseline — captures every
|
||||
table/index/sequence as it existed when goose was adopted, generated
|
||||
via `pg_dump --schema-only --no-owner --no-privileges`. The pre-goose
|
||||
hand-numbered migrations (002-022 in git history at commit
|
||||
58e6997) had their effects folded into this baseline; they're gone
|
||||
from the live tree but remain in git for archaeology.
|
||||
|
||||
On a **fresh database** (new Neon project), the first api pod runs
|
||||
through every model's `CREATE TABLE` statement. This is ~50 tables for
|
||||
honeyDue and takes ~90 seconds.
|
||||
### Production migration flow
|
||||
|
||||
On a **warm database** (tables already exist), AutoMigrate is fast —
|
||||
typically under 2 seconds. It still runs (GORM checks every model
|
||||
against the schema) but finds no work to do.
|
||||
`deploy-k3s/scripts/03-deploy.sh` runs migrations as part of every
|
||||
deploy, **before** the api/worker rollout starts:
|
||||
|
||||
### Where this bit us
|
||||
```
|
||||
1. kubectl delete job honeydue-migrate (idempotent)
|
||||
2. kubectl apply -f manifests/migrate/job.yaml (with current api image)
|
||||
3. kubectl wait --for=condition=complete --timeout=10m job/honeydue-migrate
|
||||
4. (only if Job succeeded) kubectl apply -f manifests/api/...
|
||||
```
|
||||
|
||||
With 3 api pods starting simultaneously and migrations taking 90s first
|
||||
time, the lock queue for the last replica is ~180s. We needed a
|
||||
startupProbe grace of 240s to cover this without false restart loops.
|
||||
See Chapter 7 §startupProbe and Chapter 19 §MigrateWithLock.
|
||||
The Job uses the api image — we install the goose CLI binary at
|
||||
`/usr/local/bin/goose` during the api Dockerfile build, so any pod that
|
||||
can run api can run goose. No separate image to build/push.
|
||||
|
||||
### Downside: no schema versioning
|
||||
The Job's `command` runs `goose ... up` against the **direct**
|
||||
(non-pooler) Neon endpoint. Goose's session-scoped advisory lock can't
|
||||
survive PgBouncer transaction-mode pooling, so the Job script strips
|
||||
the `-pooler` segment from `DB_HOST` before connecting. The api/worker
|
||||
runtime continues to use the pooler endpoint for everything else; only
|
||||
this one Job needs the direct connection.
|
||||
|
||||
AutoMigrate can only *add* — new tables, new columns, new indexes. It
|
||||
won't drop columns, rename them, or change types destructively. For
|
||||
those we'd need raw SQL migrations (a tool like `golang-migrate` or
|
||||
`dbmate`).
|
||||
### Schema-version precondition
|
||||
|
||||
Today: we accept that schema changes are additive-only. When we need
|
||||
destructive changes, we'd hand-write them.
|
||||
`internal/database/database.go::RequireSchemaApplied()` runs at api and
|
||||
worker startup. It queries `goose_db_version` for the highest applied
|
||||
version and refuses to start if the table is missing or the latest row
|
||||
is `is_applied=false`. This catches "operator forgot to run migrate" as
|
||||
a clear boot error instead of a mysterious runtime "relation does not
|
||||
exist" later.
|
||||
|
||||
### Local migration workflow
|
||||
|
||||
```bash
|
||||
# Set the direct-endpoint DSN once
|
||||
export DATABASE_URL='host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
|
||||
user=neondb_owner password=$PG_PASSWORD dbname=honeyDue sslmode=require'
|
||||
|
||||
make migrate-status # what's pending
|
||||
make migrate-up # apply
|
||||
make migrate-down # roll back the latest
|
||||
make migrate-new name=add_widget_col # scaffold a new SQL file
|
||||
```
|
||||
|
||||
Each new migration file goes through code review like any other code
|
||||
change. The deploy-script Job applies it on the next deploy.
|
||||
|
||||
### Bootstrap (one-time, when the prod DB already had a schema)
|
||||
|
||||
Bootstrapping a goose-managed DB whose schema already exists requires
|
||||
seeding `goose_db_version` so goose treats version 1 as already-applied:
|
||||
|
||||
```bash
|
||||
# Once. After this, future migrations append normally.
|
||||
goose -dir migrations postgres "$DATABASE_URL" version # creates the table
|
||||
psql "$DATABASE_URL" -c \
|
||||
"INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"
|
||||
```
|
||||
|
||||
This was done for honeyDue's prod Neon project at the time of goose
|
||||
adoption — no need to repeat unless we set up a fresh DB from a
|
||||
schema dump.
|
||||
|
||||
## What's in the database
|
||||
|
||||
|
||||
Reference in New Issue
Block a user