honeyDueAPI/docs/deployment/08-database.md

# 08 — Database (Neon Postgres)

## Summary

Authoritative user data lives in a Neon-managed Postgres database in AWS
us-east-1. Connections use TLS (`DB_SSLMODE=require`). Schema is managed
via GORM AutoMigrate inside the api binary, coordinated across replicas
by a Postgres advisory lock to prevent concurrent migration attempts.

## Why Neon

### Decision matrix

At deploy time we considered:

| Option | Setup effort | Monthly cost | Backup/PITR | Scale ceiling | Notes |
|---|---|---|---|---|---|
| **Neon Launch** | Zero (managed) | $5-15 | Included | Large | **Picked** |
| Postgres on a Hetzner VPS | High | $8 (VPS) | Manual | Medium | More ops |
| AWS RDS | Medium | $30+ | Included | Huge | Overkill, expensive |
| Supabase Free | Zero | $0 | Limited | Small | Free tier has quota limits |
| CNPG on our k3s | High (Helm) | $0 (using cluster) | Self-rolled | Medium | Operational burden |

Neon Launch won on:
- **Serverless**: scales compute to zero when idle (cheap)
- **Branch databases**: we can create dev/staging branches from prod in seconds
- **Connection pooling built-in**: PgBouncer on the hostname suffix `-pooler`
- **Point-in-time recovery** included (paid tier)
- **Pay-as-you-go** with a $5 minimum — fits a bootstrapped app

### Connection details

| Field | Value |
|---|---|
| Hostname | `ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech` |
| Port | 5432 |
| Username | `neondb_owner` |
| Database | `honeyDue` (case-sensitive!) |
| TLS mode | `require` (enforced by Neon; app pg driver verifies) |
| Branch | production (Neon's concept — isolated DB within the project) |

### The database name is case-sensitive

Postgres identifiers are lowercase unless quoted. Neon's UI created the
database as `"honeyDue"` (quoted, camelCase preserved). In `prod.env` /
ConfigMap we must use exactly `POSTGRES_DB=honeyDue` — lowercase
`honeydue` gets a `database "honeydue" does not exist` error. This bit
us during the initial Swarm deploy (Chapter 19 §Neon DB name).

## Connection pooling

### Why it matters

Postgres is memory-hungry per connection (~5-10 MB each). 3 api replicas
× `DB_MAX_OPEN_CONNS=25` = up to 75 direct Postgres connections. Add
the worker's 25. Neon's free tier caps at 100 concurrent connections;
paid tiers much higher.

### PgBouncer on Neon

Neon provides a built-in PgBouncer at the `-pooler` subdomain. The
non-pooler endpoint (`ep-floral-truth-amttbc5a.c-5.us-east-1...`) is
the direct compute endpoint and connects straight to Postgres,
paying the full TCP+TLS+startup handshake on every cold connection.
The `-pooler` endpoint multiplexes through PgBouncer in Neon's
infrastructure.

**We use the `-pooler` endpoint** because the direct endpoint paid
~440ms per cold handshake on a transatlantic link, visible as
1500ms-tail spikes in /api/tasks/ traces. The pooler keeps backend
Postgres connections warm in Neon's data center, so the only
latency our Go pods see is one TCP+TLS to PgBouncer (already
warm via our pool) plus one query round-trip.

Modes PgBouncer supports:
- **session** — one server connection held per client session (transparent)
- **transaction** — server connection released after each transaction (high-throughput)
- **statement** — per-statement (most aggressive; breaks many features)

Neon's pooler runs in **transaction mode**. This is compatible with GORM
out of the box (we don't use session-level features like LISTEN/NOTIFY
or session-scope advisory locks). Note: `database.MigrateWithLock()`
needs the *direct* (non-pooler) endpoint because session-level
advisory locks don't survive PgBouncer's per-transaction cycling — but
the migration helper opens its own ad-hoc connection bypassing the
configured pool, so this happens automatically. See `MigrateWithLock`
in `internal/database/database.go`.

### Connection pool settings

In `config.yaml` (rendered into ConfigMap → env vars):

```yaml
database:
  max_open_conns: 25
  max_idle_conns: 20
  max_lifetime: "1800s"
  max_idle_time: "0s"
```

These map to Go `database/sql` pool settings:

- **MaxOpenConns: 25** — at most 25 concurrent connections per replica.
- **MaxIdleConns: 20** — keep up to 20 warm connections per replica
  ready to reuse. Bumped from 10 because the pooler tolerates many
  client connections cheaply, and the cost of a cold handshake (~440ms
  transatlantic) is far higher than the cost of holding an idle
  connection.
- **MaxLifetime: 1800s** — recycle connections after 30 min. Bumped
  from 600s; with the pooler keeping things warm, longer lifetime
  reduces churn.
- **MaxIdleTime: 0s** — never close idle connections. Lifetime drives
  recycling instead.

### Pool warm-up at boot

`database.Connect()` issues 20 parallel `PingContext` calls
immediately after opening the pool. This pre-establishes
`MaxIdleConns` connections to the pooler so the first user request
doesn't pay any handshake.

The warm-up is bounded by *one* round-trip time (~440ms cold), not
one round-trip per connection — pings run concurrently. Confirmed
in pod logs at boot:

```
{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"}
```

If warm-up partially fails (e.g., 18/20 succeed), the pod still
starts; the pool fills the rest under traffic. Failure to ping at all
would be caught by the synchronous `sqlDB.Ping()` immediately before,
which is fatal.

### Worst-case connection count

3 api + 1 worker replicas × 25 conns = 100 peak. Right at Neon free
tier's ceiling, with zero margin. **This is a real risk** — a spike that
saturates the pool on all replicas simultaneously would exhaust Neon's
limit.

Mitigations to consider:
- Drop `DB_MAX_OPEN_CONNS` to 15 → 60 peak. Safe on free tier.
- Upgrade to Neon Scale plan (1000+ connections).
- Rely on Neon's PgBouncer to multiplex — the raw backend connections
  to Postgres-proper are pooled, not our TCP connections to Neon.

Currently we trust Neon's pooler to handle the multiplexing and run with
the default 25/10. If we hit connection errors in prod, adjust.

## Schema management

### GORM AutoMigrate

On startup, the Go API's `cmd/api/main.go` calls
`database.MigrateWithLock()` which:

1. Opens a dedicated Postgres connection
2. `SELECT pg_advisory_lock(1751412071)` — acquires a session-level
   advisory lock on a hardcoded key
3. Calls `db.AutoMigrate(&models.*{})` for every GORM model
4. `SELECT pg_advisory_unlock(...)` via deferred function
5. Close the connection

The advisory lock serializes migrations across replicas: when 3 api
pods start simultaneously, one acquires the lock and migrates; the
others block on the lock. Once the first finishes (≤2s for already-
migrated schema, up to 90s on first cold boot), the next acquires and
sees the schema is current (no-op migrate).

### Why an advisory lock

Without it, concurrent `CREATE TABLE IF NOT EXISTS ...` statements from
multiple replicas would race — Postgres usually handles it, but GORM's
AutoMigrate also alters tables (adds columns, indexes) which can deadlock
under concurrency.

The advisory lock pattern (also used by Rails + Django + Alembic) is the
canonical solution.

### The lock key

`1751412071` is a hardcoded integer in `internal/database/database.go`.
Arbitrary but unique — as long as nothing else in the Postgres instance
uses the same advisory lock key, no conflicts.

### First-boot behavior

On a **fresh database** (new Neon project), the first api pod runs
through every model's `CREATE TABLE` statement. This is ~50 tables for
honeyDue and takes ~90 seconds.

On a **warm database** (tables already exist), AutoMigrate is fast —
typically under 2 seconds. It still runs (GORM checks every model
against the schema) but finds no work to do.

### Where this bit us

With 3 api pods starting simultaneously and migrations taking 90s first
time, the lock queue for the last replica is ~180s. We needed a
startupProbe grace of 240s to cover this without false restart loops.
See Chapter 7 §startupProbe and Chapter 19 §MigrateWithLock.

### Downside: no schema versioning

AutoMigrate can only *add* — new tables, new columns, new indexes. It
won't drop columns, rename them, or change types destructively. For
those we'd need raw SQL migrations (a tool like `golang-migrate` or
`dbmate`).

Today: we accept that schema changes are additive-only. When we need
destructive changes, we'd hand-write them.

## What's in the database

Major tables (see `honeyDueAPI-go/internal/models/`):

| Table | Purpose |
|---|---|
| `auth_user` | Users (Django legacy name kept for compatibility) |
| `user_userprofile` | Profile data |
| `authtoken_token` | API auth tokens |
| `residence_residence` | Properties users manage |
| `task_task` | Maintenance tasks |
| `task_taskcompletion` | Task completion history |
| `contractor_contractor` | Contractor contacts |
| `documents_document` | Document records (files in B2) |
| `notification_notification` | In-app notifications |
| `subscription_usersubscription` | IAP subscriptions |
| `admin_users` | Next.js admin panel users |

See `honeyDueAPI-go/docs/TASK_LOGIC_ARCHITECTURE.md` for the task logic
model details.

## Backup and recovery

### Neon's built-in

Neon Launch includes **point-in-time recovery** within the last 24h
(longer on Scale plan). To restore:

1. Go to Neon console → project → Backups
2. Create a branch from a timestamp
3. Point the app at the new branch (change `DB_HOST` in our ConfigMap)

Done. No tape-wrangling.

### What we don't have

- Off-site backup (if Neon itself is compromised, we have no exfil). A
  nightly `pg_dump` to Backblaze B2 would close this gap. **TODO**
  (Chapter 20).
- Tested DR drills. We've never actually restored from a Neon backup
  into a new branch and pointed the app at it. Should be routine; hasn't
  been exercised.

## Migrations from old MyCrib/Casera data

honeyDue originally ran on a Django codebase (MyCrib / Casera-era). The
schema inherits Django's naming (`app_model` table names, `_id` suffix
foreign keys). The Go app's GORM models have `TableName()` methods that
preserve this:

```go
func (Task) TableName() string { return "task_task" }
```

This isn't ideal (GORM's default `tasks` would be cleaner), but changing
would require a migration that renames every table — more risk than
value.

## Neon regions

Neon's default region for new projects is `aws-us-east-1` (Virginia).
Our DB is there. Latency from Nuremberg to us-east-1 is **~108ms one-way**
TCP-level (verified by `nc -z -w 5` from `hetzner1`), so **~220ms RTT
through Neon's pooler stack**.

This is the slowest hop in our data flow. Every api request that needs
a DB query pays this latency at least once. Sub-millisecond Postgres
execution time (verified via `EXPLAIN ANALYZE`: 0.04-0.34 ms on every
hot path) means **wall-clock latency = network + Neon proxy overhead**.

### Optimizations layered on top to minimize round trips

We don't move the DB region (yet) but we cut the *number* of RTTs per
request via:

1. **Auth caching** (Chapter 7 §Redis) — token + user lookups served
   from Redis (1-hour TTL) and per-pod in-memory cache (5-min TTL).
   On warm cache: 0 SQL round-trips for auth.
2. **JOIN consolidation** — two-step
   `find residence-IDs → find tasks IN ids` collapsed into a single
   query with a Postgres subquery. One RTT instead of two.
3. **Single-query auth** — token + user fetched in one INNER JOIN
   instead of GORM's two-query Preload pattern.
4. **Residence-IDs Redis cache** — cached per user with 5-min TTL,
   invalidated on Create/Delete/Join/Remove. Saves 1 RTT per
   `/api/documents/`, `/api/contractors/`, `/api/residences/summary/`
   request.

After these, a fully-warm `/api/tasks/` is **1 SQL round-trip total
(~220ms wall-clock)**. Verified via Jaeger trace — see Chapter 15.

### When this still matters

- Any cold-cache request still pays 2-3 RTTs (~500-700ms).
- Pod startup pays 1 RTT × 20 (warm-up), but that runs in parallel:
  ~440ms one-shot.

Long-term fix: migrate Neon to `aws-eu-central-1` (Frankfurt) — drops
RTT to ~5ms and brings warm-cache requests under 50ms. Tracked in
`docs/observability-plan.md` and Chapter 18 §migration triggers.

## Environment variables the app reads

From ConfigMap:

| Var | Purpose |
|---|---|
| `DB_HOST` | Neon pooler hostname (`-pooler` suffix) |
| `DB_PORT` | 5432 |
| `POSTGRES_USER` | `neondb_owner` |
| `POSTGRES_DB` | `honeyDue` |
| `DB_SSLMODE` | `require` |
| `DB_MAX_OPEN_CONNS` | 25 |
| `DB_MAX_IDLE_CONNS` | 20 |
| `DB_MAX_LIFETIME` | `1800s` |
| `DB_MAX_IDLE_TIME` | `0s` (never close idle) |

From Secret (`honeydue-secrets`):

| Var | Purpose |
|---|---|
| `POSTGRES_PASSWORD` | Neon DB password |

## Operator cheat sheet

```bash
# Connect to Neon from workstation (requires psql + the password)
PGPASSWORD="<pw>" psql -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner -d honeyDue

# From a pod (lets you debug against the actual in-cluster network path)
kubectl exec -n honeydue -it deploy/api -- sh
# inside the pod (no psql by default, but wget + JSON API works)
wget -qO- http://127.0.0.1:8000/api/health/

# See current migration state (no direct CLI, but the api logs show it)
kubectl logs -n honeydue deploy/api | grep -i migration

# See active connections (run against Neon)
SELECT count(*), usename, state, application_name
FROM pg_stat_activity
GROUP BY usename, state, application_name;
```

## References

- [Neon docs][neon-docs]
- [Neon pricing][neon-pricing]
- [Postgres advisory locks][pg-locks]
- [GORM AutoMigrate][gorm-automigrate]
- [honeyDue task architecture][task-arch] (repo-local)

[neon-docs]: https://neon.com/docs/introduction
[neon-pricing]: https://neon.com/pricing
[pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
[gorm-automigrate]: https://gorm.io/docs/migration.html
[task-arch]: ../../docs/TASK_LOGIC_ARCHITECTURE.md