Files
honeyDueAPI/docs/deployment/08-database.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

10 KiB
Raw Blame History

08 — Database (Neon Postgres)

Summary

Authoritative user data lives in a Neon-managed Postgres database in AWS us-east-1. Connections use TLS (DB_SSLMODE=require). Schema is managed via GORM AutoMigrate inside the api binary, coordinated across replicas by a Postgres advisory lock to prevent concurrent migration attempts.

Why Neon

Decision matrix

At deploy time we considered:

Option Setup effort Monthly cost Backup/PITR Scale ceiling Notes
Neon Launch Zero (managed) $5-15 Included Large Picked
Postgres on a Hetzner VPS High $8 (VPS) Manual Medium More ops
AWS RDS Medium $30+ Included Huge Overkill, expensive
Supabase Free Zero $0 Limited Small Free tier has quota limits
CNPG on our k3s High (Helm) $0 (using cluster) Self-rolled Medium Operational burden

Neon Launch won on:

  • Serverless: scales compute to zero when idle (cheap)
  • Branch databases: we can create dev/staging branches from prod in seconds
  • Connection pooling built-in: PgBouncer on the hostname suffix -pooler
  • Point-in-time recovery included (paid tier)
  • Pay-as-you-go with a $5 minimum — fits a bootstrapped app

Connection details

Field Value
Hostname ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech
Port 5432
Username neondb_owner
Database honeyDue (case-sensitive!)
TLS mode require (enforced by Neon; app pg driver verifies)
Branch production (Neon's concept — isolated DB within the project)

The database name is case-sensitive

Postgres identifiers are lowercase unless quoted. Neon's UI created the database as "honeyDue" (quoted, camelCase preserved). In prod.env / ConfigMap we must use exactly POSTGRES_DB=honeyDue — lowercase honeydue gets a database "honeydue" does not exist error. This bit us during the initial Swarm deploy (Chapter 19 §Neon DB name).

Connection pooling

Why it matters

Postgres is memory-hungry per connection (~5-10 MB each). 3 api replicas × DB_MAX_OPEN_CONNS=25 = up to 75 direct Postgres connections. Add the worker's 25. Neon's free tier caps at 100 concurrent connections; paid tiers much higher.

PgBouncer on Neon

Neon provides a built-in PgBouncer at -pooler subdomain. Our hostname already includes -pooler handling in the route, so connections go through PgBouncer transparently.

Modes PgBouncer supports:

  • session — one server connection held per client session (transparent)
  • transaction — server connection released after each transaction (high-throughput)
  • statement — per-statement (most aggressive; breaks many features)

Neon's pooler runs in transaction mode. This is compatible with GORM out of the box (we don't use session-level features like prepared statements or session variables).

Connection pool settings

In prod.env:

DB_MAX_OPEN_CONNS=25
DB_MAX_IDLE_CONNS=10
DB_MAX_LIFETIME=600s

These are the Go database/sql pool settings (GORM uses database/sql underneath):

  • MaxOpenConns: 25 — at most 25 concurrent connections per replica
  • MaxIdleConns: 10 — keep up to 10 warm connections ready to reuse
  • MaxLifetime: 600s — recycle connections after 10 min (prevents stale state in long-lived connections, good for Neon's idle timeout)

Worst-case connection count

3 api + 1 worker replicas × 25 conns = 100 peak. Right at Neon free tier's ceiling, with zero margin. This is a real risk — a spike that saturates the pool on all replicas simultaneously would exhaust Neon's limit.

Mitigations to consider:

  • Drop DB_MAX_OPEN_CONNS to 15 → 60 peak. Safe on free tier.
  • Upgrade to Neon Scale plan (1000+ connections).
  • Rely on Neon's PgBouncer to multiplex — the raw backend connections to Postgres-proper are pooled, not our TCP connections to Neon.

Currently we trust Neon's pooler to handle the multiplexing and run with the default 25/10. If we hit connection errors in prod, adjust.

Schema management

GORM AutoMigrate

On startup, the Go API's cmd/api/main.go calls database.MigrateWithLock() which:

  1. Opens a dedicated Postgres connection
  2. SELECT pg_advisory_lock(1751412071) — acquires a session-level advisory lock on a hardcoded key
  3. Calls db.AutoMigrate(&models.*{}) for every GORM model
  4. SELECT pg_advisory_unlock(...) via deferred function
  5. Close the connection

The advisory lock serializes migrations across replicas: when 3 api pods start simultaneously, one acquires the lock and migrates; the others block on the lock. Once the first finishes (≤2s for already- migrated schema, up to 90s on first cold boot), the next acquires and sees the schema is current (no-op migrate).

Why an advisory lock

Without it, concurrent CREATE TABLE IF NOT EXISTS ... statements from multiple replicas would race — Postgres usually handles it, but GORM's AutoMigrate also alters tables (adds columns, indexes) which can deadlock under concurrency.

The advisory lock pattern (also used by Rails + Django + Alembic) is the canonical solution.

The lock key

1751412071 is a hardcoded integer in internal/database/database.go. Arbitrary but unique — as long as nothing else in the Postgres instance uses the same advisory lock key, no conflicts.

First-boot behavior

On a fresh database (new Neon project), the first api pod runs through every model's CREATE TABLE statement. This is ~50 tables for honeyDue and takes ~90 seconds.

On a warm database (tables already exist), AutoMigrate is fast — typically under 2 seconds. It still runs (GORM checks every model against the schema) but finds no work to do.

Where this bit us

With 3 api pods starting simultaneously and migrations taking 90s first time, the lock queue for the last replica is ~180s. We needed a startupProbe grace of 240s to cover this without false restart loops. See Chapter 7 §startupProbe and Chapter 19 §MigrateWithLock.

Downside: no schema versioning

AutoMigrate can only add — new tables, new columns, new indexes. It won't drop columns, rename them, or change types destructively. For those we'd need raw SQL migrations (a tool like golang-migrate or dbmate).

Today: we accept that schema changes are additive-only. When we need destructive changes, we'd hand-write them.

What's in the database

Major tables (see honeyDueAPI-go/internal/models/):

Table Purpose
auth_user Users (Django legacy name kept for compatibility)
user_userprofile Profile data
authtoken_token API auth tokens
residence_residence Properties users manage
task_task Maintenance tasks
task_taskcompletion Task completion history
contractor_contractor Contractor contacts
documents_document Document records (files in B2)
notification_notification In-app notifications
subscription_usersubscription IAP subscriptions
admin_users Next.js admin panel users

See honeyDueAPI-go/docs/TASK_LOGIC_ARCHITECTURE.md for the task logic model details.

Backup and recovery

Neon's built-in

Neon Launch includes point-in-time recovery within the last 24h (longer on Scale plan). To restore:

  1. Go to Neon console → project → Backups
  2. Create a branch from a timestamp
  3. Point the app at the new branch (change DB_HOST in our ConfigMap)

Done. No tape-wrangling.

What we don't have

  • Off-site backup (if Neon itself is compromised, we have no exfil). A nightly pg_dump to Backblaze B2 would close this gap. TODO (Chapter 20).
  • Tested DR drills. We've never actually restored from a Neon backup into a new branch and pointed the app at it. Should be routine; hasn't been exercised.

Migrations from old MyCrib/Casera data

honeyDue originally ran on a Django codebase (MyCrib / Casera-era). The schema inherits Django's naming (app_model table names, _id suffix foreign keys). The Go app's GORM models have TableName() methods that preserve this:

func (Task) TableName() string { return "task_task" }

This isn't ideal (GORM's default tasks would be cleaner), but changing would require a migration that renames every table — more risk than value.

Neon regions

Neon's default region for new projects is aws-us-east-1 (Virginia). Our DB is there. Latency from Nuremberg to us-east-1 is ~90-120ms round trip.

This is the slowest hop in our data flow. Every api request that needs a DB query (most of them) pays this latency at least once.

When this matters: When we start seeing ~200ms+ response times from complex endpoints, it's likely DB latency dominant. Options:

  • Migrate Neon to aws-eu-central-1 (Frankfurt) — shaves ~90ms off
  • Add Redis caching for hot reads (Chapter 7)
  • Read replicas (Neon supports them on paid tiers)

Environment variables the app reads

From ConfigMap:

Var Purpose
DB_HOST Neon pooler hostname
DB_PORT 5432
POSTGRES_USER neondb_owner
POSTGRES_DB honeyDue
DB_SSLMODE require
DB_MAX_OPEN_CONNS 25
DB_MAX_IDLE_CONNS 10
DB_MAX_LIFETIME 600s

From Secret (honeydue-secrets):

Var Purpose
POSTGRES_PASSWORD Neon DB password

Operator cheat sheet

# Connect to Neon from workstation (requires psql + the password)
PGPASSWORD="<pw>" psql -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner -d honeyDue

# From a pod (lets you debug against the actual in-cluster network path)
kubectl exec -n honeydue -it deploy/api -- sh
# inside the pod (no psql by default, but wget + JSON API works)
wget -qO- http://127.0.0.1:8000/api/health/

# See current migration state (no direct CLI, but the api logs show it)
kubectl logs -n honeydue deploy/api | grep -i migration

# See active connections (run against Neon)
SELECT count(*), usename, state, application_name
FROM pg_stat_activity
GROUP BY usename, state, application_name;

References