From c9ac273dbd2268281367d0a3190ee07a0f3f1234 Mon Sep 17 00:00:00 2001 From: Trey t Date: Sat, 25 Apr 2026 17:36:36 -0500 Subject: [PATCH] docs: capture latency optimizations + new caching invariants MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Shipping commit 88fb175 changed the trace shape and added a new caching layer with required invalidation rules. Updating the operator-facing docs so they match the running system. ch08 (database): - DB_HOST is the -pooler Neon endpoint, not direct compute - Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m), MaxIdleTime 0 (never close idle) - New \"Pool warm-up at boot\" section documenting the 20-parallel-ping warm-up in database.Connect - Replaced the \"Neon regions\" section: explicit RTT numbers, the optimization stack that minimizes round-trips, when this still matters ch15 (observability): - Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span post-optimization trace; kept the old one underneath for diff context ch16 (failure modes): - Added: stale residence-IDs cache (data freshness bug + recovery) - Added: Redis at maxmemory limit (verify allkeys-lru policy) - Added: Neon pooler unreachable but direct endpoint up — emergency switchover procedure ch17 (runbook): - §23 Invalidate residence-IDs cache for a user (DEL key + grep for missing invalidation in new code) - §24 Verify DB pool warm-up is working (log pattern + impact test) - §25 Switch DB host between pooler and direct endpoints observability-plan.md status flipped from \"plan only\" to shipped with the latency-cut summary. README links to the new ch08 latency section. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 1 + docs/deployment/08-database.md | 128 ++++++++++++++++++++++------ docs/deployment/15-observability.md | 27 ++++-- docs/deployment/16-failure-modes.md | 63 ++++++++++++++ docs/deployment/17-runbook.md | 75 ++++++++++++++++ docs/observability-plan.md | 4 +- 6 files changed, 264 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index 5366c29..74ac734 100644 --- a/README.md +++ b/README.md @@ -378,6 +378,7 @@ Quick links: - **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — `bash deploy-k3s/scripts/03-deploy.sh` builds → pushes → rolls out - **Observability** — [docs/deployment/15-observability.md](./docs/deployment/15-observability.md) — VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com` - **Observability plan** — [docs/observability-plan.md](./docs/observability-plan.md) — design doc and rollout phases +- **Database / pool tuning** — [docs/deployment/08-database.md](./docs/deployment/08-database.md) — Neon pooler endpoint, GORM pool, warm-up, RTT budget - **Failure modes** — [docs/deployment/16-failure-modes.md](./docs/deployment/16-failure-modes.md) — what happens when X dies - **Swarm postmortem** — [docs/deployment/19-postmortem-swarm.md](./docs/deployment/19-postmortem-swarm.md) — why we migrated diff --git a/docs/deployment/08-database.md b/docs/deployment/08-database.md index 865ade7..7fd29fa 100644 --- a/docs/deployment/08-database.md +++ b/docs/deployment/08-database.md @@ -32,7 +32,7 @@ Neon Launch won on: | Field | Value | |---|---| -| Hostname | `ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech` | +| Hostname | `ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech` | | Port | 5432 | | Username | `neondb_owner` | | Database | `honeyDue` (case-sensitive!) | @@ -58,9 +58,19 @@ paid tiers much higher. ### PgBouncer on Neon -Neon provides a built-in PgBouncer at `-pooler` subdomain. Our hostname -already includes `-pooler` handling in the route, so connections go -through PgBouncer transparently. +Neon provides a built-in PgBouncer at the `-pooler` subdomain. The +non-pooler endpoint (`ep-floral-truth-amttbc5a.c-5.us-east-1...`) is +the direct compute endpoint and connects straight to Postgres, +paying the full TCP+TLS+startup handshake on every cold connection. +The `-pooler` endpoint multiplexes through PgBouncer in Neon's +infrastructure. + +**We use the `-pooler` endpoint** because the direct endpoint paid +~440ms per cold handshake on a transatlantic link, visible as +1500ms-tail spikes in /api/tasks/ traces. The pooler keeps backend +Postgres connections warm in Neon's data center, so the only +latency our Go pods see is one TCP+TLS to PgBouncer (already +warm via our pool) plus one query round-trip. Modes PgBouncer supports: - **session** — one server connection held per client session (transparent) @@ -68,26 +78,59 @@ Modes PgBouncer supports: - **statement** — per-statement (most aggressive; breaks many features) Neon's pooler runs in **transaction mode**. This is compatible with GORM -out of the box (we don't use session-level features like prepared -statements or session variables). +out of the box (we don't use session-level features like LISTEN/NOTIFY +or session-scope advisory locks). Note: `database.MigrateWithLock()` +needs the *direct* (non-pooler) endpoint because session-level +advisory locks don't survive PgBouncer's per-transaction cycling — but +the migration helper opens its own ad-hoc connection bypassing the +configured pool, so this happens automatically. See `MigrateWithLock` +in `internal/database/database.go`. ### Connection pool settings -In `prod.env`: +In `config.yaml` (rendered into ConfigMap → env vars): -``` -DB_MAX_OPEN_CONNS=25 -DB_MAX_IDLE_CONNS=10 -DB_MAX_LIFETIME=600s +```yaml +database: + max_open_conns: 25 + max_idle_conns: 20 + max_lifetime: "1800s" + max_idle_time: "0s" ``` -These are the Go `database/sql` pool settings (GORM uses `database/sql` -underneath): +These map to Go `database/sql` pool settings: -- **MaxOpenConns: 25** — at most 25 concurrent connections per replica -- **MaxIdleConns: 10** — keep up to 10 warm connections ready to reuse -- **MaxLifetime: 600s** — recycle connections after 10 min (prevents - stale state in long-lived connections, good for Neon's idle timeout) +- **MaxOpenConns: 25** — at most 25 concurrent connections per replica. +- **MaxIdleConns: 20** — keep up to 20 warm connections per replica + ready to reuse. Bumped from 10 because the pooler tolerates many + client connections cheaply, and the cost of a cold handshake (~440ms + transatlantic) is far higher than the cost of holding an idle + connection. +- **MaxLifetime: 1800s** — recycle connections after 30 min. Bumped + from 600s; with the pooler keeping things warm, longer lifetime + reduces churn. +- **MaxIdleTime: 0s** — never close idle connections. Lifetime drives + recycling instead. + +### Pool warm-up at boot + +`database.Connect()` issues 20 parallel `PingContext` calls +immediately after opening the pool. This pre-establishes +`MaxIdleConns` connections to the pooler so the first user request +doesn't pay any handshake. + +The warm-up is bounded by *one* round-trip time (~440ms cold), not +one round-trip per connection — pings run concurrently. Confirmed +in pod logs at boot: + +``` +{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"} +``` + +If warm-up partially fails (e.g., 18/20 succeed), the pod still +starts; the pool fills the rest under traffic. Failure to ping at all +would be caught by the synchronous `sqlDB.Ping()` immediately before, +which is fatal. ### Worst-case connection count @@ -229,17 +272,45 @@ value. ## Neon regions Neon's default region for new projects is `aws-us-east-1` (Virginia). -Our DB is there. Latency from Nuremberg to us-east-1 is **~90-120ms -round trip**. +Our DB is there. Latency from Nuremberg to us-east-1 is **~108ms one-way** +TCP-level (verified by `nc -z -w 5` from `hetzner1`), so **~220ms RTT +through Neon's pooler stack**. This is the slowest hop in our data flow. Every api request that needs -a DB query (most of them) pays this latency at least once. +a DB query pays this latency at least once. Sub-millisecond Postgres +execution time (verified via `EXPLAIN ANALYZE`: 0.04-0.34 ms on every +hot path) means **wall-clock latency = network + Neon proxy overhead**. -**When this matters**: When we start seeing ~200ms+ response times from -complex endpoints, it's likely DB latency dominant. Options: -- Migrate Neon to `aws-eu-central-1` (Frankfurt) — shaves ~90ms off -- Add Redis caching for hot reads (Chapter 7) -- Read replicas (Neon supports them on paid tiers) +### Optimizations layered on top to minimize round trips + +We don't move the DB region (yet) but we cut the *number* of RTTs per +request via: + +1. **Auth caching** (Chapter 7 §Redis) — token + user lookups served + from Redis (1-hour TTL) and per-pod in-memory cache (5-min TTL). + On warm cache: 0 SQL round-trips for auth. +2. **JOIN consolidation** — two-step + `find residence-IDs → find tasks IN ids` collapsed into a single + query with a Postgres subquery. One RTT instead of two. +3. **Single-query auth** — token + user fetched in one INNER JOIN + instead of GORM's two-query Preload pattern. +4. **Residence-IDs Redis cache** — cached per user with 5-min TTL, + invalidated on Create/Delete/Join/Remove. Saves 1 RTT per + `/api/documents/`, `/api/contractors/`, `/api/residences/summary/` + request. + +After these, a fully-warm `/api/tasks/` is **1 SQL round-trip total +(~220ms wall-clock)**. Verified via Jaeger trace — see Chapter 15. + +### When this still matters + +- Any cold-cache request still pays 2-3 RTTs (~500-700ms). +- Pod startup pays 1 RTT × 20 (warm-up), but that runs in parallel: + ~440ms one-shot. + +Long-term fix: migrate Neon to `aws-eu-central-1` (Frankfurt) — drops +RTT to ~5ms and brings warm-cache requests under 50ms. Tracked in +`docs/observability-plan.md` and Chapter 18 §migration triggers. ## Environment variables the app reads @@ -247,14 +318,15 @@ From ConfigMap: | Var | Purpose | |---|---| -| `DB_HOST` | Neon pooler hostname | +| `DB_HOST` | Neon pooler hostname (`-pooler` suffix) | | `DB_PORT` | 5432 | | `POSTGRES_USER` | `neondb_owner` | | `POSTGRES_DB` | `honeyDue` | | `DB_SSLMODE` | `require` | | `DB_MAX_OPEN_CONNS` | 25 | -| `DB_MAX_IDLE_CONNS` | 10 | -| `DB_MAX_LIFETIME` | `600s` | +| `DB_MAX_IDLE_CONNS` | 20 | +| `DB_MAX_LIFETIME` | `1800s` | +| `DB_MAX_IDLE_TIME` | `0s` (never close idle) | From Secret (`honeydue-secrets`): diff --git a/docs/deployment/15-observability.md b/docs/deployment/15-observability.md index 9883ac5..1b886ed 100644 --- a/docs/deployment/15-observability.md +++ b/docs/deployment/15-observability.md @@ -248,7 +248,22 @@ Migrated services (every public method takes ctx): - `NotificationService` — all 12 methods - `SubscriptionService` — all 12 methods including Apple/Google IAP -Sample trace for `GET /api/tasks/`: +Sample trace for `GET /api/tasks/` (warm cache, post-optimization): + +``` +GET /api/tasks/ (229ms) +└── service: SELECT * FROM task_task WHERE residence_id IN + (SELECT id FROM residence_residence WHERE...) (227ms) +``` + +Two spans total. The auth path runs entirely from Redis + in-memory +cache (zero SQL queries) thanks to the 1-hour token TTL and 5-min user +TTL. The residence-ID lookup is folded into the tasks query as a +Postgres subquery, so a single network round-trip to Neon services the +whole request. See Chapter 8 §"Optimizations layered on top" for the +optimization stack. + +Earlier trace, before the optimization stack landed (commit 88fb175): ``` GET /api/tasks/ (2473ms) @@ -258,10 +273,12 @@ GET /api/tasks/ (2473ms) └── service: SELECT * FROM task_task WHERE residence_id IN(...) (226ms) ``` -Each query labeled with `db.statement`. The transatlantic Hetzner→Neon -RTT (~110ms one-way) shows up clearly — that's the perf bottleneck the -flame graph makes obvious. See [Chapter 18 — Cost](./18-cost.md) for -the planned move to a US-region cluster. +10× improvement from 2,473ms to 229ms by cutting query count +(5 SQL → 1 SQL on warm cache). The 227ms in the surviving query is +**1 transatlantic round-trip** to Neon us-east-1 from Hetzner +Nuremberg — the physical floor on the current setup. Eliminated by +migrating Neon to a EU region; tracked in [Chapter 18 §migration +triggers](./18-cost.md) and `docs/observability-plan.md`. **Migration pattern (for any future services or middleware):** add `ctx context.Context` as the first arg, change the handler call site diff --git a/docs/deployment/16-failure-modes.md b/docs/deployment/16-failure-modes.md index ac2d740..f3b0155 100644 --- a/docs/deployment/16-failure-modes.md +++ b/docs/deployment/16-failure-modes.md @@ -245,12 +245,58 @@ finds an empty data directory (or can't mount at all). - If the original node is gone: Redis starts empty. Cache regenerates. Asynq queue state is lost; pending jobs re-queue on retry, cron fires re-schedule on next tick. +- Auth caches (token + residence-IDs) regenerate on first user + request — first request per user pays full DB lookup, then warm + again. Visible as a brief latency spike in the Grafana RED + dashboard, not a functional failure. - Ensure the node label `honeydue/redis=true` is on a healthy node: ```bash kubectl label node honeydue/redis=true --overwrite kubectl label node honeydue/redis- 2>/dev/null || true ``` +#### Stale residence-IDs cache (data freshness bug) + +**Symptom**: a user accepts a share-code or has a residence +removed, but `/api/tasks/`, `/api/documents/`, `/api/contractors/`, +or `/api/residences/summary/` continues to show the old +membership for up to 5 minutes. +**Cause**: a residence-membership-mutating code path landed +without calling `cache.InvalidateResidenceIDsForUsers(...)`. The +cache TTL is 5 min so the issue self-heals, but it's user-visible. +**Recovery (immediate)**: flush the affected user's cache key +manually. See [Chapter 17 §residence-IDs cache invalidation](./17-runbook.md). +**Prevention (permanent)**: every mutation that changes +`residence_residence.owner_id`, `residence_residence_users.user_id`, +or deletes a residence MUST invalidate. Existing call sites for +reference: `CreateResidence` (owner), `DeleteResidence` +(all members), `JoinWithCode` (joining user), `RemoveUser` +(removed user). The pattern lives in +`internal/services/residence_id_cache.go`. + +#### Redis at maxmemory limit + +**Symptom**: Redis logs `OOM command not allowed when used memory > 'maxmemory'`. +Should be rare — current production usage is ~2.4 MB against a 256 MB +limit and the policy is `allkeys-lru` (cache writes evict cold keys +instead of erroring). +**Recovery**: confirm the policy is still `allkeys-lru`: +```bash +kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy +``` +If it's somehow `noeviction`, set it live: +```bash +kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru +``` +And re-apply the manifest at `deploy-k3s/manifests/redis/deployment.yaml` +so the change survives a pod restart. + +If memory usage is genuinely climbing toward the cap, check for +runaway keys without TTLs: +```bash +kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys +``` + ### External service failures #### Neon Postgres outage @@ -264,6 +310,23 @@ until Neon is back. Postgres-level failover. **Frequency**: Neon has had a handful of hours-scale outages since launch. +#### Neon pooler endpoint unreachable but direct endpoint up + +**Symptom**: `dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o +timeout` in api logs but the direct compute endpoint is reachable. +Rare — Neon's pooler runs in their infra alongside compute — but +possible during pooler maintenance. +**Recovery (emergency)**: switch `DB_HOST` in `config.yaml` from the +`-pooler` to the direct hostname (drop the `-pooler` segment), +re-apply ConfigMap, rolling-restart api and worker: +```bash +# Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5... +# Then: +KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build +``` +Cold-handshake latency goes back up (~440ms first hit) but the API +keeps serving. Switch back when the pooler recovers. + #### Backblaze B2 outage **Symptom**: image uploads fail; image downloads fail unless cached by diff --git a/docs/deployment/17-runbook.md b/docs/deployment/17-runbook.md index 44df38f..9bd9512 100644 --- a/docs/deployment/17-runbook.md +++ b/docs/deployment/17-runbook.md @@ -358,6 +358,81 @@ Workaround: in each pod's logs, search for a unique user identifier: stern -n honeydue api | grep "user_id=12345" ``` +## 23. Invalidate residence-IDs cache for a user + +Used when a user reports stale data ("I joined a residence but my +tasks list still shows the old one"). The cache is keyed on user ID +with 5-min TTL — most issues self-heal — but you can flush manually. + +```bash +# Single user +kubectl -n honeydue exec deploy/redis -- redis-cli DEL "residence_ids_user:7" + +# All users (nuclear; everyone pays one DB lookup on next request) +kubectl -n honeydue exec deploy/redis -- redis-cli --scan --pattern "residence_ids_user:*" \ + | xargs -r -n 100 kubectl -n honeydue exec deploy/redis -- redis-cli DEL +``` + +Mutation paths that should invalidate this cache automatically (any +new code that changes membership must call +`cache.InvalidateResidenceIDsForUsers(ctx, userIDs...)`): + +- `ResidenceService.CreateResidence` → owner +- `ResidenceService.DeleteResidence` → all members +- `ResidenceService.JoinWithCode` → joining user +- `ResidenceService.RemoveUser` → removed user + +If a user keeps reporting stale data, grep for missing invalidation: + +```bash +grep -rn "residenceRepo.*Add\|RemoveUser\|residence_residence_users" internal/ \ + | grep -v cache | grep -v _test +``` + +## 24. Verify DB pool warm-up is working + +After a deploy, check the api pod log for the warm-up confirmation: + +```bash +kubectl -n honeydue logs -l app.kubernetes.io/name=api --tail=50 \ + | grep "DB pool warm-up complete" +``` + +Expected output (per pod): + +```json +{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"} +``` + +If `warmed` < `requested`, the pool partially failed at boot — pod +still starts, fills from there. If `warmed=0`, something's wrong with +either Neon connectivity or auth — check the next log line for the +specific error. + +To test impact: hit the api right after a rollout. With warm-up +working, the first request should be ~250ms (1 RTT). Without warm-up, +the first request is ~700ms (full handshake). + +## 25. Switch DB host between pooler and direct endpoints + +The pooler endpoint (`-pooler` suffix) is the default — it cuts +cold-handshake latency by ~3 RTTs. The direct endpoint +(`ep-floral-truth-amttbc5a.c-5...`) is the fallback. + +```bash +# Edit deploy-k3s/config.yaml — change database.host +# To pooler: ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech +# To direct: ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech + +KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build +``` + +The pooler runs in transaction mode so any session-scope feature +(LISTEN/NOTIFY, session advisory locks for migrations) auto-falls +through to direct via `MigrateWithLock` opening its own connection. +But if you ever add session-level features in the data path, they'll +need the direct endpoint. + ## References - [kubectl cheat sheet][kubectl-cs] diff --git a/docs/observability-plan.md b/docs/observability-plan.md index 2eefe53..be31814 100644 --- a/docs/observability-plan.md +++ b/docs/observability-plan.md @@ -6,7 +6,9 @@ **Why not in the honeyDue k3s cluster:** Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint. -**Status:** Plan only — nothing implemented yet. +**Status:** Fully shipped. VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`, vmagent in-cluster, OTel SDK and otelgorm wired into the api+worker, every authed endpoint produces nested HTTP→service→SQL flame graphs in Jaeger. + +The first round of traces revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond. The follow-up work (`internal/services/residence_id_cache.go`, GORM pool warm-up, auth-query JOIN consolidation, switching `DB_HOST` to Neon's `-pooler` endpoint, bumped cache TTLs) cut warm-cache `/api/tasks/` from 2,473 ms / 5 spans to **229 ms / 2 spans** — see commit `88fb175` and Chapter 8 §"Optimizations layered on top". ---