Adopt pressly/goose for schema migrations
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path,
which had two compounding problems:
- AutoMigrate ran on every pod startup (~5 min over the transatlantic
  link) even when no schema changes had landed
- pg_advisory_lock is session-scoped, which silently fails through
  Neon's pgbouncer transaction-mode pooler — turns out this is a
  known and documented limitation that bites golang-migrate too

Goose was chosen over golang-migrate (the other heavyweight) because:
- Goose wraps each migration file in a transaction by default, so a
  failure rolls back cleanly instead of leaving a "dirty" version
  state requiring manual force-reset (golang-migrate's known
  weakness, per its own issue tracker — see #1001 + Atlas's writeup)
- Goose's locking is opt-in. We don't opt in: migrations run as a
  single Kubernetes Job, which IS the singleton process. No advisory
  lock needed at all.

Layout:
- migrations/000001_init.sql — schema-only pg_dump of the live Neon
  DB at adoption, stripped of psql-only directives that block goose's
  bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had
  their effects folded into this baseline; deleted from the live tree
  but preserved in git history at 58e6997.
- Dockerfile installs `goose v3.22.1` at build time and copies the
  binary into the api image. The migrate Job reuses the api image with
  command=goose, so no separate image to build/push/version.
- deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips
  the -pooler segment from DB_HOST (advisory lock won't survive
  pgbouncer transaction-mode), runs `goose up`, exits.
- deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the
  fresh one, `kubectl wait --for=condition=complete --timeout=10m`,
  then proceeds with api/worker rollout. Job failure aborts the deploy
  before any new app pod sees a stale schema.
- internal/database/database.go::RequireSchemaApplied checks
  goose_db_version on startup. api/worker refuse to boot if the
  table is missing or its latest row has is_applied=false — the
  fail-fast for "operator forgot to run migrate."
- Makefile: migrate-up / migrate-down / migrate-status / migrate-new
  for local workflow.

Production DB was bootstrapped manually:
  $ goose -dir migrations postgres "$DSN" version  # creates table
  $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"

Smoke test against fresh Postgres locally: 50 user tables created in
284ms via `goose up`, version_id=1 + is_applied=t recorded.

Verified the local goose CLI talks to prod successfully:
  $ goose ... status
  Applied At                  Migration
  =======================================
  Mon Apr 27 03:43:55 2026 -- 000001_init.sql

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-26 22:46:36 -05:00
parent d96f317d20
commit 12b2f9d43b
53 changed files with 3716 additions and 968 deletions
+30 -41
View File
@@ -19,11 +19,6 @@ import (
"github.com/uptrace/opentelemetry-go-extra/otelgorm"
)
// migrationAdvisoryLockKey is the pg_advisory_lock key that serializes
// Migrate() across API replicas booting in parallel. Value is arbitrary but
// stable ("hdmg" as bytes = honeydue migration).
const migrationAdvisoryLockKey int64 = 0x68646d67
// zerologGormWriter adapts zerolog for GORM's logger interface
type zerologGormWriter struct{}
@@ -189,52 +184,46 @@ func Paginate(page, pageSize int) func(db *gorm.DB) *gorm.DB {
}
}
// MigrateWithLock runs Migrate() under a Postgres session-level advisory lock
// so that multiple API replicas booting in parallel don't race on AutoMigrate.
// On non-Postgres dialects (sqlite in tests) it falls through to Migrate().
func MigrateWithLock() error {
// RequireSchemaApplied verifies that goose's version table exists and has
// at least one applied entry. This is the fail-fast that runs at api/worker
// boot: if the operator forgot to run the migrate Job, the pod refuses to
// start with a clear error instead of throwing mysterious "relation does
// not exist" errors deep in a request handler.
//
// On non-Postgres dialects (sqlite in tests) this is a no-op — tests use
// AutoMigrate via testutil.SetupTestDB to create a fresh schema per run.
// goose isn't involved in the test path.
func RequireSchemaApplied() error {
if db == nil {
return fmt.Errorf("database not initialised")
}
if db.Dialector.Name() != "postgres" {
return Migrate()
return nil
}
sqlDB, err := db.DB()
// goose_db_version stores one row per applied migration, not a single
// "current version" row — so we look for the highest version_id with
// is_applied=true. ORDER BY id DESC LIMIT 1 also catches the case where
// the table exists but is empty (no rows returned, scan leaves Version
// at zero).
type migrationRow struct {
VersionID int64 `gorm:"column:version_id"`
IsApplied bool `gorm:"column:is_applied"`
}
var row migrationRow
err := db.Raw(`SELECT version_id, is_applied FROM goose_db_version ORDER BY id DESC LIMIT 1`).Scan(&row).Error
if err != nil {
return fmt.Errorf("get underlying sql.DB: %w", err)
return fmt.Errorf("goose_db_version check failed (run the migrate Job to bootstrap): %w", err)
}
// Give ourselves up to 5 min to acquire the lock — long enough for a
// slow migration on a peer replica, short enough to fail fast if Postgres
// is hung.
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
conn, err := sqlDB.Conn(ctx)
if err != nil {
return fmt.Errorf("acquire dedicated migration connection: %w", err)
if !row.IsApplied {
return fmt.Errorf("goose_db_version latest row is_applied=false at version=%d — last migration was rolled back or aborted; investigate before starting", row.VersionID)
}
defer conn.Close()
log.Info().Int64("lock_key", migrationAdvisoryLockKey).Msg("Acquiring migration advisory lock...")
if _, err := conn.ExecContext(ctx, "SELECT pg_advisory_lock($1)", migrationAdvisoryLockKey); err != nil {
return fmt.Errorf("pg_advisory_lock: %w", err)
if row.VersionID < 1 {
return fmt.Errorf("goose_db_version is empty — run goose up (or seed a row marking version 1 as applied if the schema already exists)")
}
log.Info().Msg("Migration advisory lock acquired")
defer func() {
// Unlock with a fresh context — the outer ctx may have expired.
unlockCtx, unlockCancel := context.WithTimeout(context.Background(), 10*time.Second)
defer unlockCancel()
if _, err := conn.ExecContext(unlockCtx, "SELECT pg_advisory_unlock($1)", migrationAdvisoryLockKey); err != nil {
log.Warn().Err(err).Msg("Failed to release migration advisory lock (session close will also release)")
} else {
log.Info().Msg("Migration advisory lock released")
}
}()
return Migrate()
log.Info().Int64("schema_version", row.VersionID).Msg("Schema precondition satisfied")
return nil
}
// Migrate runs database migrations for all models