docs: capture latency optimizations + new caching invariants

Shipping commit 88fb175 changed the trace shape and added a new caching layer with required invalidation rules. Updating the operator-facing docs so they match the running system. ch08 (database): - DB_HOST is the -pooler Neon endpoint, not direct compute - Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m), MaxIdleTime 0 (never close idle) - New \"Pool warm-up at boot\" section documenting the 20-parallel-ping warm-up in database.Connect - Replaced the \"Neon regions\" section: explicit RTT numbers, the optimization stack that minimizes round-trips, when this still matters ch15 (observability): - Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span post-optimization trace; kept the old one underneath for diff context ch16 (failure modes): - Added: stale residence-IDs cache (data freshness bug + recovery) - Added: Redis at maxmemory limit (verify allkeys-lru policy) - Added: Neon pooler unreachable but direct endpoint up — emergency switchover procedure ch17 (runbook): - §23 Invalidate residence-IDs cache for a user (DEL key + grep for missing invalidation in new code) - §24 Verify DB pool warm-up is working (log pattern + impact test) - §25 Switch DB host between pooler and direct endpoints observability-plan.md status flipped from \"plan only\" to shipped with the latency-cut summary. README links to the new ch08 latency section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:36:36 -05:00
parent 88fb1751c7
commit c9ac273dbd
6 changed files with 264 additions and 34 deletions
@@ -32,7 +32,7 @@ Neon Launch won on:

 | Field | Value |
 |---|---|
-| Hostname | `ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech` |
+| Hostname | `ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech` |
 | Port | 5432 |
 | Username | `neondb_owner` |
 | Database | `honeyDue` (case-sensitive!) |
@@ -58,9 +58,19 @@ paid tiers much higher.

 ### PgBouncer on Neon

-Neon provides a built-in PgBouncer at `-pooler` subdomain. Our hostname
-already includes `-pooler` handling in the route, so connections go
-through PgBouncer transparently.
+Neon provides a built-in PgBouncer at the `-pooler` subdomain. The
+non-pooler endpoint (`ep-floral-truth-amttbc5a.c-5.us-east-1...`) is
+the direct compute endpoint and connects straight to Postgres,
+paying the full TCP+TLS+startup handshake on every cold connection.
+The `-pooler` endpoint multiplexes through PgBouncer in Neon's
+infrastructure.
+
+**We use the `-pooler` endpoint** because the direct endpoint paid
+~440ms per cold handshake on a transatlantic link, visible as
+1500ms-tail spikes in /api/tasks/ traces. The pooler keeps backend
+Postgres connections warm in Neon's data center, so the only
+latency our Go pods see is one TCP+TLS to PgBouncer (already
+warm via our pool) plus one query round-trip.

 Modes PgBouncer supports:
 - **session** — one server connection held per client session (transparent)
@@ -68,26 +78,59 @@ Modes PgBouncer supports:
 - **statement** — per-statement (most aggressive; breaks many features)

 Neon's pooler runs in **transaction mode**. This is compatible with GORM
-out of the box (we don't use session-level features like prepared
-statements or session variables).
+out of the box (we don't use session-level features like LISTEN/NOTIFY
+or session-scope advisory locks). Note: `database.MigrateWithLock()`
+needs the *direct* (non-pooler) endpoint because session-level
+advisory locks don't survive PgBouncer's per-transaction cycling — but
+the migration helper opens its own ad-hoc connection bypassing the
+configured pool, so this happens automatically. See `MigrateWithLock`
+in `internal/database/database.go`.

 ### Connection pool settings

-In `prod.env`:
+In `config.yaml` (rendered into ConfigMap → env vars):

-```
-DB_MAX_OPEN_CONNS=25
-DB_MAX_IDLE_CONNS=10
-DB_MAX_LIFETIME=600s
+```yaml
+database:
+  max_open_conns: 25
+  max_idle_conns: 20
+  max_lifetime: "1800s"
+  max_idle_time: "0s"
 ```

-These are the Go `database/sql` pool settings (GORM uses `database/sql`
-underneath):
+These map to Go `database/sql` pool settings:

- **MaxOpenConns: 25** — at most 25 concurrent connections per replica
- **MaxIdleConns: 10** — keep up to 10 warm connections ready to reuse
- **MaxLifetime: 600s** — recycle connections after 10 min (prevents
-  stale state in long-lived connections, good for Neon's idle timeout)
+- **MaxOpenConns: 25** — at most 25 concurrent connections per replica.
+- **MaxIdleConns: 20** — keep up to 20 warm connections per replica
+  ready to reuse. Bumped from 10 because the pooler tolerates many
+  client connections cheaply, and the cost of a cold handshake (~440ms
+  transatlantic) is far higher than the cost of holding an idle
+  connection.
+- **MaxLifetime: 1800s** — recycle connections after 30 min. Bumped
+  from 600s; with the pooler keeping things warm, longer lifetime
+  reduces churn.
+- **MaxIdleTime: 0s** — never close idle connections. Lifetime drives
+  recycling instead.
+
+### Pool warm-up at boot
+
+`database.Connect()` issues 20 parallel `PingContext` calls
+immediately after opening the pool. This pre-establishes
+`MaxIdleConns` connections to the pooler so the first user request
+doesn't pay any handshake.
+
+The warm-up is bounded by *one* round-trip time (~440ms cold), not
+one round-trip per connection — pings run concurrently. Confirmed
+in pod logs at boot:
+
+```
+{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"}
+```
+
+If warm-up partially fails (e.g., 18/20 succeed), the pod still
+starts; the pool fills the rest under traffic. Failure to ping at all
+would be caught by the synchronous `sqlDB.Ping()` immediately before,
+which is fatal.

 ### Worst-case connection count

@@ -229,17 +272,45 @@ value.
 ## Neon regions

 Neon's default region for new projects is `aws-us-east-1` (Virginia).
-Our DB is there. Latency from Nuremberg to us-east-1 is **~90-120ms
-round trip**.
+Our DB is there. Latency from Nuremberg to us-east-1 is **~108ms one-way**
+TCP-level (verified by `nc -z -w 5` from `hetzner1`), so **~220ms RTT
+through Neon's pooler stack**.

 This is the slowest hop in our data flow. Every api request that needs
-a DB query (most of them) pays this latency at least once.
+a DB query pays this latency at least once. Sub-millisecond Postgres
+execution time (verified via `EXPLAIN ANALYZE`: 0.04-0.34 ms on every
+hot path) means **wall-clock latency = network + Neon proxy overhead**.

-**When this matters**: When we start seeing ~200ms+ response times from
-complex endpoints, it's likely DB latency dominant. Options:
- Migrate Neon to `aws-eu-central-1` (Frankfurt) — shaves ~90ms off
- Add Redis caching for hot reads (Chapter 7)
- Read replicas (Neon supports them on paid tiers)
+### Optimizations layered on top to minimize round trips
+
+We don't move the DB region (yet) but we cut the *number* of RTTs per
+request via:
+
+1. **Auth caching** (Chapter 7 §Redis) — token + user lookups served
+   from Redis (1-hour TTL) and per-pod in-memory cache (5-min TTL).
+   On warm cache: 0 SQL round-trips for auth.
+2. **JOIN consolidation** — two-step
+   `find residence-IDs → find tasks IN ids` collapsed into a single
+   query with a Postgres subquery. One RTT instead of two.
+3. **Single-query auth** — token + user fetched in one INNER JOIN
+   instead of GORM's two-query Preload pattern.
+4. **Residence-IDs Redis cache** — cached per user with 5-min TTL,
+   invalidated on Create/Delete/Join/Remove. Saves 1 RTT per
+   `/api/documents/`, `/api/contractors/`, `/api/residences/summary/`
+   request.
+
+After these, a fully-warm `/api/tasks/` is **1 SQL round-trip total
+(~220ms wall-clock)**. Verified via Jaeger trace — see Chapter 15.
+
+### When this still matters
+
+- Any cold-cache request still pays 2-3 RTTs (~500-700ms).
+- Pod startup pays 1 RTT × 20 (warm-up), but that runs in parallel:
+  ~440ms one-shot.
+
+Long-term fix: migrate Neon to `aws-eu-central-1` (Frankfurt) — drops
+RTT to ~5ms and brings warm-cache requests under 50ms. Tracked in
+`docs/observability-plan.md` and Chapter 18 §migration triggers.

 ## Environment variables the app reads

@@ -247,14 +318,15 @@ From ConfigMap:

 | Var | Purpose |
 |---|---|
-| `DB_HOST` | Neon pooler hostname |
+| `DB_HOST` | Neon pooler hostname (`-pooler` suffix) |
 | `DB_PORT` | 5432 |
 | `POSTGRES_USER` | `neondb_owner` |
 | `POSTGRES_DB` | `honeyDue` |
 | `DB_SSLMODE` | `require` |
 | `DB_MAX_OPEN_CONNS` | 25 |
-| `DB_MAX_IDLE_CONNS` | 10 |
-| `DB_MAX_LIFETIME` | `600s` |
+| `DB_MAX_IDLE_CONNS` | 20 |
+| `DB_MAX_LIFETIME` | `1800s` |
+| `DB_MAX_IDLE_TIME` | `0s` (never close idle) |

 From Secret (`honeydue-secrets`):