Update the deployment book and glossary to reflect the goose-based schema migration flow shipped in 12b2f9d/0f7450a: - ch07: clarify startup probe assumes migrations ran out-of-band - ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job - ch12: pod startup checks goose_db_version, no longer runs migrations - ch14: document the Job→wait→roll deploy gate and how to debug failures - ch16: add "Migrate Job fails during deploy" + "Schema precondition failed" failure modes - ch17: new runbook entries §26 (run migrations manually), §27 (recover from failed/dirty migration), §28 (bootstrap goose on fresh clone) - ch19: postscript on §13 noting MigrateWithLock approach is superseded - ch20: mark "Migration Job for schema changes" task done - glossary: add `goose` and `goose_db_version`; flag AutoMigrate as tests-only - references: add goose links; flag AutoMigrate as tests-only
16 KiB
08 — Database (Neon Postgres)
Summary
Authoritative user data lives in a Neon-managed Postgres database in AWS
us-east-1. Connections use TLS (DB_SSLMODE=require). Schema is managed
via pressly/goose running as a
one-shot Kubernetes Job before every api/worker rollout. See §Schema
management below for the full shape; ch19 §13 documents the previous
in-replica AutoMigrate approach this replaced.
Why Neon
Decision matrix
At deploy time we considered:
| Option | Setup effort | Monthly cost | Backup/PITR | Scale ceiling | Notes |
|---|---|---|---|---|---|
| Neon Launch | Zero (managed) | $5-15 | Included | Large | Picked |
| Postgres on a Hetzner VPS | High | $8 (VPS) | Manual | Medium | More ops |
| AWS RDS | Medium | $30+ | Included | Huge | Overkill, expensive |
| Supabase Free | Zero | $0 | Limited | Small | Free tier has quota limits |
| CNPG on our k3s | High (Helm) | $0 (using cluster) | Self-rolled | Medium | Operational burden |
Neon Launch won on:
- Serverless: scales compute to zero when idle (cheap)
- Branch databases: we can create dev/staging branches from prod in seconds
- Connection pooling built-in: PgBouncer on the hostname suffix
-pooler - Point-in-time recovery included (paid tier)
- Pay-as-you-go with a $5 minimum — fits a bootstrapped app
Connection details
| Field | Value |
|---|---|
| Hostname | ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech |
| Port | 5432 |
| Username | neondb_owner |
| Database | honeyDue (case-sensitive!) |
| TLS mode | require (enforced by Neon; app pg driver verifies) |
| Branch | production (Neon's concept — isolated DB within the project) |
The database name is case-sensitive
Postgres identifiers are lowercase unless quoted. Neon's UI created the
database as "honeyDue" (quoted, camelCase preserved). In prod.env /
ConfigMap we must use exactly POSTGRES_DB=honeyDue — lowercase
honeydue gets a database "honeydue" does not exist error. This bit
us during the initial Swarm deploy (Chapter 19 §Neon DB name).
Connection pooling
Why it matters
Postgres is memory-hungry per connection (~5-10 MB each). 3 api replicas
× DB_MAX_OPEN_CONNS=25 = up to 75 direct Postgres connections. Add
the worker's 25. Neon's free tier caps at 100 concurrent connections;
paid tiers much higher.
PgBouncer on Neon
Neon provides a built-in PgBouncer at the -pooler subdomain. The
non-pooler endpoint (ep-floral-truth-amttbc5a.c-5.us-east-1...) is
the direct compute endpoint and connects straight to Postgres,
paying the full TCP+TLS+startup handshake on every cold connection.
The -pooler endpoint multiplexes through PgBouncer in Neon's
infrastructure.
We use the -pooler endpoint because the direct endpoint paid
~440ms per cold handshake on a transatlantic link, visible as
1500ms-tail spikes in /api/tasks/ traces. The pooler keeps backend
Postgres connections warm in Neon's data center, so the only
latency our Go pods see is one TCP+TLS to PgBouncer (already
warm via our pool) plus one query round-trip.
Modes PgBouncer supports:
- session — one server connection held per client session (transparent)
- transaction — server connection released after each transaction (high-throughput)
- statement — per-statement (most aggressive; breaks many features)
Neon's pooler runs in transaction mode. This is compatible with GORM
runtime queries (we don't use session-level features like LISTEN/NOTIFY
or session-scope advisory locks in the data path). The one place this
matters is migrations: goose's session-scoped advisory lock can't
survive PgBouncer transaction-mode pooling. The migrate Job
(deploy-k3s/manifests/migrate/job.yaml) handles this by stripping
the -pooler segment from DB_HOST before invoking goose — runtime
keeps using the pooler, only migrations bypass it.
Connection pool settings
In config.yaml (rendered into ConfigMap → env vars):
database:
max_open_conns: 25
max_idle_conns: 20
max_lifetime: "1800s"
max_idle_time: "0s"
These map to Go database/sql pool settings:
- MaxOpenConns: 25 — at most 25 concurrent connections per replica.
- MaxIdleConns: 20 — keep up to 20 warm connections per replica ready to reuse. Bumped from 10 because the pooler tolerates many client connections cheaply, and the cost of a cold handshake (~440ms transatlantic) is far higher than the cost of holding an idle connection.
- MaxLifetime: 1800s — recycle connections after 30 min. Bumped from 600s; with the pooler keeping things warm, longer lifetime reduces churn.
- MaxIdleTime: 0s — never close idle connections. Lifetime drives recycling instead.
Pool warm-up at boot
database.Connect() issues 20 parallel PingContext calls
immediately after opening the pool. This pre-establishes
MaxIdleConns connections to the pooler so the first user request
doesn't pay any handshake.
The warm-up is bounded by one round-trip time (~440ms cold), not one round-trip per connection — pings run concurrently. Confirmed in pod logs at boot:
{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"}
If warm-up partially fails (e.g., 18/20 succeed), the pod still
starts; the pool fills the rest under traffic. Failure to ping at all
would be caught by the synchronous sqlDB.Ping() immediately before,
which is fatal.
Worst-case connection count
3 api + 1 worker replicas × 25 conns = 100 peak. Right at Neon free tier's ceiling, with zero margin. This is a real risk — a spike that saturates the pool on all replicas simultaneously would exhaust Neon's limit.
Mitigations to consider:
- Drop
DB_MAX_OPEN_CONNSto 15 → 60 peak. Safe on free tier. - Upgrade to Neon Scale plan (1000+ connections).
- Rely on Neon's PgBouncer to multiplex — the raw backend connections to Postgres-proper are pooled, not our TCP connections to Neon.
Currently we trust Neon's pooler to handle the multiplexing and run with the default 25/10. If we hit connection errors in prod, adjust.
Schema management
goose
We use pressly/goose (pinned in the
api Dockerfile to v3.22.1) for schema migrations. Why goose specifically:
- Each migration file runs inside its own transaction by default — partial-failure recovery is built in (no "dirty" state to manually unstick like golang-migrate).
- Locking is opt-in. We don't opt in. Migrations run as a single Kubernetes Job — that's the singleton process. No advisory-lock vs PgBouncer-transaction-mode foot-gun.
- Plain SQL files. No DSL, no library integration in our Go code.
See docs/deployment/19-postmortem-swarm.md (Schema Versioning section)
for the AutoMigrate-with-advisory-lock approach this replaced and why.
Migration files
Live under migrations/, named <NNNNNN>_<short_name>.sql. Each file
has both the up and down migration in one file, separated by goose
markers:
-- +goose Up
CREATE TABLE example (id bigint PRIMARY KEY);
-- +goose Down
DROP TABLE example;
Multi-statement constructs (CREATE FUNCTION, DO $$ BEGIN ... END $$)
need -- +goose StatementBegin / -- +goose StatementEnd wrappers
because goose splits on semicolons by default.
migrations/000001_init.sql is the baseline — captures every
table/index/sequence as it existed when goose was adopted, generated
via pg_dump --schema-only --no-owner --no-privileges. The pre-goose
hand-numbered migrations (002-022 in git history at commit
58e6997) had their effects folded into this baseline; they're gone
from the live tree but remain in git for archaeology.
Production migration flow
deploy-k3s/scripts/03-deploy.sh runs migrations as part of every
deploy, before the api/worker rollout starts:
1. kubectl delete job honeydue-migrate (idempotent)
2. kubectl apply -f manifests/migrate/job.yaml (with current api image)
3. kubectl wait --for=condition=complete --timeout=10m job/honeydue-migrate
4. (only if Job succeeded) kubectl apply -f manifests/api/...
The Job uses the api image — we install the goose CLI binary at
/usr/local/bin/goose during the api Dockerfile build, so any pod that
can run api can run goose. No separate image to build/push.
The Job's command runs goose ... up against the direct
(non-pooler) Neon endpoint. Goose's session-scoped advisory lock can't
survive PgBouncer transaction-mode pooling, so the Job script strips
the -pooler segment from DB_HOST before connecting. The api/worker
runtime continues to use the pooler endpoint for everything else; only
this one Job needs the direct connection.
Schema-version precondition
internal/database/database.go::RequireSchemaApplied() runs at api and
worker startup. It queries goose_db_version for the highest applied
version and refuses to start if the table is missing or the latest row
is is_applied=false. This catches "operator forgot to run migrate" as
a clear boot error instead of a mysterious runtime "relation does not
exist" later.
Local migration workflow
# Set the direct-endpoint DSN once
export DATABASE_URL='host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
user=neondb_owner password=$PG_PASSWORD dbname=honeyDue sslmode=require'
make migrate-status # what's pending
make migrate-up # apply
make migrate-down # roll back the latest
make migrate-new name=add_widget_col # scaffold a new SQL file
Each new migration file goes through code review like any other code change. The deploy-script Job applies it on the next deploy.
Bootstrap (one-time, when the prod DB already had a schema)
Bootstrapping a goose-managed DB whose schema already exists requires
seeding goose_db_version so goose treats version 1 as already-applied:
# Once. After this, future migrations append normally.
goose -dir migrations postgres "$DATABASE_URL" version # creates the table
psql "$DATABASE_URL" -c \
"INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"
This was done for honeyDue's prod Neon project at the time of goose adoption — no need to repeat unless we set up a fresh DB from a schema dump.
What's in the database
Major tables (see honeyDueAPI-go/internal/models/):
| Table | Purpose |
|---|---|
auth_user |
Users (Django legacy name kept for compatibility) |
user_userprofile |
Profile data |
authtoken_token |
API auth tokens |
residence_residence |
Properties users manage |
task_task |
Maintenance tasks |
task_taskcompletion |
Task completion history |
contractor_contractor |
Contractor contacts |
documents_document |
Document records (files in B2) |
notification_notification |
In-app notifications |
subscription_usersubscription |
IAP subscriptions |
admin_users |
Next.js admin panel users |
See honeyDueAPI-go/docs/TASK_LOGIC_ARCHITECTURE.md for the task logic
model details.
Backup and recovery
Neon's built-in
Neon Launch includes point-in-time recovery within the last 24h (longer on Scale plan). To restore:
- Go to Neon console → project → Backups
- Create a branch from a timestamp
- Point the app at the new branch (change
DB_HOSTin our ConfigMap)
Done. No tape-wrangling.
What we don't have
- Off-site backup (if Neon itself is compromised, we have no exfil). A
nightly
pg_dumpto Backblaze B2 would close this gap. TODO (Chapter 20). - Tested DR drills. We've never actually restored from a Neon backup into a new branch and pointed the app at it. Should be routine; hasn't been exercised.
Migrations from old MyCrib/Casera data
honeyDue originally ran on a Django codebase (MyCrib / Casera-era). The
schema inherits Django's naming (app_model table names, _id suffix
foreign keys). The Go app's GORM models have TableName() methods that
preserve this:
func (Task) TableName() string { return "task_task" }
This isn't ideal (GORM's default tasks would be cleaner), but changing
would require a migration that renames every table — more risk than
value.
Neon regions
Neon's default region for new projects is aws-us-east-1 (Virginia).
Our DB is there. Latency from Nuremberg to us-east-1 is ~108ms one-way
TCP-level (verified by nc -z -w 5 from hetzner1), so ~220ms RTT
through Neon's pooler stack.
This is the slowest hop in our data flow. Every api request that needs
a DB query pays this latency at least once. Sub-millisecond Postgres
execution time (verified via EXPLAIN ANALYZE: 0.04-0.34 ms on every
hot path) means wall-clock latency = network + Neon proxy overhead.
Optimizations layered on top to minimize round trips
We don't move the DB region (yet) but we cut the number of RTTs per request via:
- Auth caching (Chapter 7 §Redis) — token + user lookups served from Redis (1-hour TTL) and per-pod in-memory cache (5-min TTL). On warm cache: 0 SQL round-trips for auth.
- JOIN consolidation — two-step
find residence-IDs → find tasks IN idscollapsed into a single query with a Postgres subquery. One RTT instead of two. - Single-query auth — token + user fetched in one INNER JOIN instead of GORM's two-query Preload pattern.
- Residence-IDs Redis cache — cached per user with 5-min TTL,
invalidated on Create/Delete/Join/Remove. Saves 1 RTT per
/api/documents/,/api/contractors/,/api/residences/summary/request.
After these, a fully-warm /api/tasks/ is 1 SQL round-trip total
(~220ms wall-clock). Verified via Jaeger trace — see Chapter 15.
When this still matters
- Any cold-cache request still pays 2-3 RTTs (~500-700ms).
- Pod startup pays 1 RTT × 20 (warm-up), but that runs in parallel: ~440ms one-shot.
Long-term fix: migrate Neon to aws-eu-central-1 (Frankfurt) — drops
RTT to ~5ms and brings warm-cache requests under 50ms. Tracked in
docs/observability-plan.md and Chapter 18 §migration triggers.
Environment variables the app reads
From ConfigMap:
| Var | Purpose |
|---|---|
DB_HOST |
Neon pooler hostname (-pooler suffix) |
DB_PORT |
5432 |
POSTGRES_USER |
neondb_owner |
POSTGRES_DB |
honeyDue |
DB_SSLMODE |
require |
DB_MAX_OPEN_CONNS |
25 |
DB_MAX_IDLE_CONNS |
20 |
DB_MAX_LIFETIME |
1800s |
DB_MAX_IDLE_TIME |
0s (never close idle) |
From Secret (honeydue-secrets):
| Var | Purpose |
|---|---|
POSTGRES_PASSWORD |
Neon DB password |
Operator cheat sheet
# Connect to Neon from workstation (requires psql + the password)
PGPASSWORD="<pw>" psql -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
-U neondb_owner -d honeyDue
# From a pod (lets you debug against the actual in-cluster network path)
kubectl exec -n honeydue -it deploy/api -- sh
# inside the pod (no psql by default, but wget + JSON API works)
wget -qO- http://127.0.0.1:8000/api/health/
# See current migration state (no direct CLI, but the api logs show it)
kubectl logs -n honeydue deploy/api | grep -i migration
# See active connections (run against Neon)
SELECT count(*), usename, state, application_name
FROM pg_stat_activity
GROUP BY usename, state, application_name;
References
- Neon docs
- Neon pricing
- Postgres advisory locks
- pressly/goose — production migration tool
- GORM AutoMigrate (tests only)
- honeyDue task architecture (repo-local)