Files
honeyDueAPI/docs/deployment/08-database.md
T
Trey t 12b2f9d43b
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Adopt pressly/goose for schema migrations
Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path,
which had two compounding problems:
- AutoMigrate ran on every pod startup (~5 min over the transatlantic
  link) even when no schema changes had landed
- pg_advisory_lock is session-scoped, which silently fails through
  Neon's pgbouncer transaction-mode pooler — turns out this is a
  known and documented limitation that bites golang-migrate too

Goose was chosen over golang-migrate (the other heavyweight) because:
- Goose wraps each migration file in a transaction by default, so a
  failure rolls back cleanly instead of leaving a "dirty" version
  state requiring manual force-reset (golang-migrate's known
  weakness, per its own issue tracker — see #1001 + Atlas's writeup)
- Goose's locking is opt-in. We don't opt in: migrations run as a
  single Kubernetes Job, which IS the singleton process. No advisory
  lock needed at all.

Layout:
- migrations/000001_init.sql — schema-only pg_dump of the live Neon
  DB at adoption, stripped of psql-only directives that block goose's
  bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had
  their effects folded into this baseline; deleted from the live tree
  but preserved in git history at 58e6997.
- Dockerfile installs `goose v3.22.1` at build time and copies the
  binary into the api image. The migrate Job reuses the api image with
  command=goose, so no separate image to build/push/version.
- deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips
  the -pooler segment from DB_HOST (advisory lock won't survive
  pgbouncer transaction-mode), runs `goose up`, exits.
- deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the
  fresh one, `kubectl wait --for=condition=complete --timeout=10m`,
  then proceeds with api/worker rollout. Job failure aborts the deploy
  before any new app pod sees a stale schema.
- internal/database/database.go::RequireSchemaApplied checks
  goose_db_version on startup. api/worker refuse to boot if the
  table is missing or its latest row has is_applied=false — the
  fail-fast for "operator forgot to run migrate."
- Makefile: migrate-up / migrate-down / migrate-status / migrate-new
  for local workflow.

Production DB was bootstrapped manually:
  $ goose -dir migrations postgres "$DSN" version  # creates table
  $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"

Smoke test against fresh Postgres locally: 50 user tables created in
284ms via `goose up`, version_id=1 + is_applied=t recorded.

Verified the local goose CLI talks to prod successfully:
  $ goose ... status
  Applied At                  Migration
  =======================================
  Mon Apr 27 03:43:55 2026 -- 000001_init.sql

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:46:36 -05:00

15 KiB
Raw Blame History

08 — Database (Neon Postgres)

Summary

Authoritative user data lives in a Neon-managed Postgres database in AWS us-east-1. Connections use TLS (DB_SSLMODE=require). Schema is managed via GORM AutoMigrate inside the api binary, coordinated across replicas by a Postgres advisory lock to prevent concurrent migration attempts.

Why Neon

Decision matrix

At deploy time we considered:

Option Setup effort Monthly cost Backup/PITR Scale ceiling Notes
Neon Launch Zero (managed) $5-15 Included Large Picked
Postgres on a Hetzner VPS High $8 (VPS) Manual Medium More ops
AWS RDS Medium $30+ Included Huge Overkill, expensive
Supabase Free Zero $0 Limited Small Free tier has quota limits
CNPG on our k3s High (Helm) $0 (using cluster) Self-rolled Medium Operational burden

Neon Launch won on:

  • Serverless: scales compute to zero when idle (cheap)
  • Branch databases: we can create dev/staging branches from prod in seconds
  • Connection pooling built-in: PgBouncer on the hostname suffix -pooler
  • Point-in-time recovery included (paid tier)
  • Pay-as-you-go with a $5 minimum — fits a bootstrapped app

Connection details

Field Value
Hostname ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech
Port 5432
Username neondb_owner
Database honeyDue (case-sensitive!)
TLS mode require (enforced by Neon; app pg driver verifies)
Branch production (Neon's concept — isolated DB within the project)

The database name is case-sensitive

Postgres identifiers are lowercase unless quoted. Neon's UI created the database as "honeyDue" (quoted, camelCase preserved). In prod.env / ConfigMap we must use exactly POSTGRES_DB=honeyDue — lowercase honeydue gets a database "honeydue" does not exist error. This bit us during the initial Swarm deploy (Chapter 19 §Neon DB name).

Connection pooling

Why it matters

Postgres is memory-hungry per connection (~5-10 MB each). 3 api replicas × DB_MAX_OPEN_CONNS=25 = up to 75 direct Postgres connections. Add the worker's 25. Neon's free tier caps at 100 concurrent connections; paid tiers much higher.

PgBouncer on Neon

Neon provides a built-in PgBouncer at the -pooler subdomain. The non-pooler endpoint (ep-floral-truth-amttbc5a.c-5.us-east-1...) is the direct compute endpoint and connects straight to Postgres, paying the full TCP+TLS+startup handshake on every cold connection. The -pooler endpoint multiplexes through PgBouncer in Neon's infrastructure.

We use the -pooler endpoint because the direct endpoint paid ~440ms per cold handshake on a transatlantic link, visible as 1500ms-tail spikes in /api/tasks/ traces. The pooler keeps backend Postgres connections warm in Neon's data center, so the only latency our Go pods see is one TCP+TLS to PgBouncer (already warm via our pool) plus one query round-trip.

Modes PgBouncer supports:

  • session — one server connection held per client session (transparent)
  • transaction — server connection released after each transaction (high-throughput)
  • statement — per-statement (most aggressive; breaks many features)

Neon's pooler runs in transaction mode. This is compatible with GORM out of the box (we don't use session-level features like LISTEN/NOTIFY or session-scope advisory locks). Note: database.MigrateWithLock() needs the direct (non-pooler) endpoint because session-level advisory locks don't survive PgBouncer's per-transaction cycling — but the migration helper opens its own ad-hoc connection bypassing the configured pool, so this happens automatically. See MigrateWithLock in internal/database/database.go.

Connection pool settings

In config.yaml (rendered into ConfigMap → env vars):

database:
  max_open_conns: 25
  max_idle_conns: 20
  max_lifetime: "1800s"
  max_idle_time: "0s"

These map to Go database/sql pool settings:

  • MaxOpenConns: 25 — at most 25 concurrent connections per replica.
  • MaxIdleConns: 20 — keep up to 20 warm connections per replica ready to reuse. Bumped from 10 because the pooler tolerates many client connections cheaply, and the cost of a cold handshake (~440ms transatlantic) is far higher than the cost of holding an idle connection.
  • MaxLifetime: 1800s — recycle connections after 30 min. Bumped from 600s; with the pooler keeping things warm, longer lifetime reduces churn.
  • MaxIdleTime: 0s — never close idle connections. Lifetime drives recycling instead.

Pool warm-up at boot

database.Connect() issues 20 parallel PingContext calls immediately after opening the pool. This pre-establishes MaxIdleConns connections to the pooler so the first user request doesn't pay any handshake.

The warm-up is bounded by one round-trip time (~440ms cold), not one round-trip per connection — pings run concurrently. Confirmed in pod logs at boot:

{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"}

If warm-up partially fails (e.g., 18/20 succeed), the pod still starts; the pool fills the rest under traffic. Failure to ping at all would be caught by the synchronous sqlDB.Ping() immediately before, which is fatal.

Worst-case connection count

3 api + 1 worker replicas × 25 conns = 100 peak. Right at Neon free tier's ceiling, with zero margin. This is a real risk — a spike that saturates the pool on all replicas simultaneously would exhaust Neon's limit.

Mitigations to consider:

  • Drop DB_MAX_OPEN_CONNS to 15 → 60 peak. Safe on free tier.
  • Upgrade to Neon Scale plan (1000+ connections).
  • Rely on Neon's PgBouncer to multiplex — the raw backend connections to Postgres-proper are pooled, not our TCP connections to Neon.

Currently we trust Neon's pooler to handle the multiplexing and run with the default 25/10. If we hit connection errors in prod, adjust.

Schema management

goose

We use pressly/goose (pinned in the api Dockerfile to v3.22.1) for schema migrations. Why goose specifically:

  • Each migration file runs inside its own transaction by default — partial-failure recovery is built in (no "dirty" state to manually unstick like golang-migrate).
  • Locking is opt-in. We don't opt in. Migrations run as a single Kubernetes Job — that's the singleton process. No advisory-lock vs PgBouncer-transaction-mode foot-gun.
  • Plain SQL files. No DSL, no library integration in our Go code.

See docs/deployment/19-postmortem-swarm.md (Schema Versioning section) for the AutoMigrate-with-advisory-lock approach this replaced and why.

Migration files

Live under migrations/, named <NNNNNN>_<short_name>.sql. Each file has both the up and down migration in one file, separated by goose markers:

-- +goose Up
CREATE TABLE example (id bigint PRIMARY KEY);

-- +goose Down
DROP TABLE example;

Multi-statement constructs (CREATE FUNCTION, DO $$ BEGIN ... END $$) need -- +goose StatementBegin / -- +goose StatementEnd wrappers because goose splits on semicolons by default.

migrations/000001_init.sql is the baseline — captures every table/index/sequence as it existed when goose was adopted, generated via pg_dump --schema-only --no-owner --no-privileges. The pre-goose hand-numbered migrations (002-022 in git history at commit 58e6997) had their effects folded into this baseline; they're gone from the live tree but remain in git for archaeology.

Production migration flow

deploy-k3s/scripts/03-deploy.sh runs migrations as part of every deploy, before the api/worker rollout starts:

1. kubectl delete job honeydue-migrate (idempotent)
2. kubectl apply -f manifests/migrate/job.yaml (with current api image)
3. kubectl wait --for=condition=complete --timeout=10m job/honeydue-migrate
4. (only if Job succeeded) kubectl apply -f manifests/api/...

The Job uses the api image — we install the goose CLI binary at /usr/local/bin/goose during the api Dockerfile build, so any pod that can run api can run goose. No separate image to build/push.

The Job's command runs goose ... up against the direct (non-pooler) Neon endpoint. Goose's session-scoped advisory lock can't survive PgBouncer transaction-mode pooling, so the Job script strips the -pooler segment from DB_HOST before connecting. The api/worker runtime continues to use the pooler endpoint for everything else; only this one Job needs the direct connection.

Schema-version precondition

internal/database/database.go::RequireSchemaApplied() runs at api and worker startup. It queries goose_db_version for the highest applied version and refuses to start if the table is missing or the latest row is is_applied=false. This catches "operator forgot to run migrate" as a clear boot error instead of a mysterious runtime "relation does not exist" later.

Local migration workflow

# Set the direct-endpoint DSN once
export DATABASE_URL='host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
                     user=neondb_owner password=$PG_PASSWORD dbname=honeyDue sslmode=require'

make migrate-status                  # what's pending
make migrate-up                      # apply
make migrate-down                    # roll back the latest
make migrate-new name=add_widget_col # scaffold a new SQL file

Each new migration file goes through code review like any other code change. The deploy-script Job applies it on the next deploy.

Bootstrap (one-time, when the prod DB already had a schema)

Bootstrapping a goose-managed DB whose schema already exists requires seeding goose_db_version so goose treats version 1 as already-applied:

# Once. After this, future migrations append normally.
goose -dir migrations postgres "$DATABASE_URL" version  # creates the table
psql "$DATABASE_URL" -c \
  "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"

This was done for honeyDue's prod Neon project at the time of goose adoption — no need to repeat unless we set up a fresh DB from a schema dump.

What's in the database

Major tables (see honeyDueAPI-go/internal/models/):

Table Purpose
auth_user Users (Django legacy name kept for compatibility)
user_userprofile Profile data
authtoken_token API auth tokens
residence_residence Properties users manage
task_task Maintenance tasks
task_taskcompletion Task completion history
contractor_contractor Contractor contacts
documents_document Document records (files in B2)
notification_notification In-app notifications
subscription_usersubscription IAP subscriptions
admin_users Next.js admin panel users

See honeyDueAPI-go/docs/TASK_LOGIC_ARCHITECTURE.md for the task logic model details.

Backup and recovery

Neon's built-in

Neon Launch includes point-in-time recovery within the last 24h (longer on Scale plan). To restore:

  1. Go to Neon console → project → Backups
  2. Create a branch from a timestamp
  3. Point the app at the new branch (change DB_HOST in our ConfigMap)

Done. No tape-wrangling.

What we don't have

  • Off-site backup (if Neon itself is compromised, we have no exfil). A nightly pg_dump to Backblaze B2 would close this gap. TODO (Chapter 20).
  • Tested DR drills. We've never actually restored from a Neon backup into a new branch and pointed the app at it. Should be routine; hasn't been exercised.

Migrations from old MyCrib/Casera data

honeyDue originally ran on a Django codebase (MyCrib / Casera-era). The schema inherits Django's naming (app_model table names, _id suffix foreign keys). The Go app's GORM models have TableName() methods that preserve this:

func (Task) TableName() string { return "task_task" }

This isn't ideal (GORM's default tasks would be cleaner), but changing would require a migration that renames every table — more risk than value.

Neon regions

Neon's default region for new projects is aws-us-east-1 (Virginia). Our DB is there. Latency from Nuremberg to us-east-1 is ~108ms one-way TCP-level (verified by nc -z -w 5 from hetzner1), so ~220ms RTT through Neon's pooler stack.

This is the slowest hop in our data flow. Every api request that needs a DB query pays this latency at least once. Sub-millisecond Postgres execution time (verified via EXPLAIN ANALYZE: 0.04-0.34 ms on every hot path) means wall-clock latency = network + Neon proxy overhead.

Optimizations layered on top to minimize round trips

We don't move the DB region (yet) but we cut the number of RTTs per request via:

  1. Auth caching (Chapter 7 §Redis) — token + user lookups served from Redis (1-hour TTL) and per-pod in-memory cache (5-min TTL). On warm cache: 0 SQL round-trips for auth.
  2. JOIN consolidation — two-step find residence-IDs → find tasks IN ids collapsed into a single query with a Postgres subquery. One RTT instead of two.
  3. Single-query auth — token + user fetched in one INNER JOIN instead of GORM's two-query Preload pattern.
  4. Residence-IDs Redis cache — cached per user with 5-min TTL, invalidated on Create/Delete/Join/Remove. Saves 1 RTT per /api/documents/, /api/contractors/, /api/residences/summary/ request.

After these, a fully-warm /api/tasks/ is 1 SQL round-trip total (~220ms wall-clock). Verified via Jaeger trace — see Chapter 15.

When this still matters

  • Any cold-cache request still pays 2-3 RTTs (~500-700ms).
  • Pod startup pays 1 RTT × 20 (warm-up), but that runs in parallel: ~440ms one-shot.

Long-term fix: migrate Neon to aws-eu-central-1 (Frankfurt) — drops RTT to ~5ms and brings warm-cache requests under 50ms. Tracked in docs/observability-plan.md and Chapter 18 §migration triggers.

Environment variables the app reads

From ConfigMap:

Var Purpose
DB_HOST Neon pooler hostname (-pooler suffix)
DB_PORT 5432
POSTGRES_USER neondb_owner
POSTGRES_DB honeyDue
DB_SSLMODE require
DB_MAX_OPEN_CONNS 25
DB_MAX_IDLE_CONNS 20
DB_MAX_LIFETIME 1800s
DB_MAX_IDLE_TIME 0s (never close idle)

From Secret (honeydue-secrets):

Var Purpose
POSTGRES_PASSWORD Neon DB password

Operator cheat sheet

# Connect to Neon from workstation (requires psql + the password)
PGPASSWORD="<pw>" psql -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner -d honeyDue

# From a pod (lets you debug against the actual in-cluster network path)
kubectl exec -n honeydue -it deploy/api -- sh
# inside the pod (no psql by default, but wget + JSON API works)
wget -qO- http://127.0.0.1:8000/api/health/

# See current migration state (no direct CLI, but the api logs show it)
kubectl logs -n honeydue deploy/api | grep -i migration

# See active connections (run against Neon)
SELECT count(*), usename, state, application_name
FROM pg_stat_activity
GROUP BY usename, state, application_name;

References