Adopt pressly/goose for schema migrations

Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path, which had two compounding problems: - AutoMigrate ran on every pod startup (~5 min over the transatlantic link) even when no schema changes had landed - pg_advisory_lock is session-scoped, which silently fails through Neon's pgbouncer transaction-mode pooler — turns out this is a known and documented limitation that bites golang-migrate too Goose was chosen over golang-migrate (the other heavyweight) because: - Goose wraps each migration file in a transaction by default, so a failure rolls back cleanly instead of leaving a "dirty" version state requiring manual force-reset (golang-migrate's known weakness, per its own issue tracker — see #1001 + Atlas's writeup) - Goose's locking is opt-in. We don't opt in: migrations run as a single Kubernetes Job, which IS the singleton process. No advisory lock needed at all. Layout: - migrations/000001_init.sql — schema-only pg_dump of the live Neon DB at adoption, stripped of psql-only directives that block goose's bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had their effects folded into this baseline; deleted from the live tree but preserved in git history at 58e6997. - Dockerfile installs `goose v3.22.1` at build time and copies the binary into the api image. The migrate Job reuses the api image with command=goose, so no separate image to build/push/version. - deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips the -pooler segment from DB_HOST (advisory lock won't survive pgbouncer transaction-mode), runs `goose up`, exits. - deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the fresh one, `kubectl wait --for=condition=complete --timeout=10m`, then proceeds with api/worker rollout. Job failure aborts the deploy before any new app pod sees a stale schema. - internal/database/database.go::RequireSchemaApplied checks goose_db_version on startup. api/worker refuse to boot if the table is missing or its latest row has is_applied=false — the fail-fast for "operator forgot to run migrate." - Makefile: migrate-up / migrate-down / migrate-status / migrate-new for local workflow. Production DB was bootstrapped manually: $ goose -dir migrations postgres "$DSN" version # creates table $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());" Smoke test against fresh Postgres locally: 50 user tables created in 284ms via `goose up`, version_id=1 + is_applied=t recorded. Verified the local goose CLI talks to prod successfully: $ goose ... status Applied At Migration ======================================= Mon Apr 27 03:43:55 2026 -- 000001_init.sql Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:46:36 -05:00
parent d96f317d20
commit 12b2f9d43b
53 changed files with 3716 additions and 968 deletions
@@ -0,0 +1,75 @@
+# One-shot migration Job. Runs goose against Neon's *direct* (non-pooler)
+# endpoint, applies any pending migrations from /app/migrations (baked into
+# the api image), exits.
+#
+# 03-deploy.sh deletes any prior Job, applies this one, waits for completion
+# with `kubectl wait --for=condition=complete`, and rolls api/worker only
+# after the Job succeeds. A Job failure aborts the whole deploy.
+#
+# We reuse the api image rather than build a separate one — the api Dockerfile
+# already installs the goose CLI to /usr/local/bin/goose and copies the
+# migrations directory to /app/migrations.
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: honeydue-migrate
+  namespace: honeydue
+  labels:
+    app.kubernetes.io/name: migrate
+    app.kubernetes.io/part-of: honeydue
+spec:
+  backoffLimit: 0                  # fail fast — no silent retries on a bad migration
+  ttlSecondsAfterFinished: 86400   # keep finished Job for 24h so logs are inspectable
+  template:
+    metadata:
+      labels:
+        app.kubernetes.io/name: migrate
+        app.kubernetes.io/part-of: honeydue
+    spec:
+      restartPolicy: Never
+      imagePullSecrets:
+        - name: ghcr-credentials
+      securityContext:
+        runAsNonRoot: true
+        runAsUser: 1000
+        runAsGroup: 1000
+        seccompProfile:
+          type: RuntimeDefault
+      containers:
+        - name: goose
+          image: IMAGE_PLACEHOLDER  # Replaced by 03-deploy.sh — same as api
+          command: ["/bin/sh", "-c"]
+          # DB_HOST in the ConfigMap points at the -pooler endpoint for runtime.
+          # goose's session-scoped advisory lock can't survive PgBouncer
+          # transaction-mode, so we strip the -pooler segment for migrations.
+          # `set -e` so any sub-command failure exits non-zero.
+          args:
+            - |
+              set -e
+              DIRECT_HOST=$(echo "$DB_HOST" | sed 's/-pooler\.\(.*\)$/.\1/')
+              echo "[migrate] running goose up against $DIRECT_HOST"
+              exec /usr/local/bin/goose \
+                -dir /app/migrations \
+                postgres "host=$DIRECT_HOST port=$DB_PORT user=$POSTGRES_USER password=$POSTGRES_PASSWORD dbname=$POSTGRES_DB sslmode=$DB_SSLMODE" \
+                up
+          securityContext:
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
+            capabilities:
+              drop: ["ALL"]
+          envFrom:
+            - configMapRef:
+                name: honeydue-config
+          env:
+            - name: POSTGRES_PASSWORD
+              valueFrom:
+                secretKeyRef:
+                  name: honeydue-secrets
+                  key: POSTGRES_PASSWORD
+          resources:
+            requests:
+              cpu: 100m
+              memory: 64Mi
+            limits:
+              cpu: 500m
+              memory: 256Mi