docs: capture latency optimizations + new caching invariants

Shipping commit 88fb175 changed the trace shape and added a new caching layer with required invalidation rules. Updating the operator-facing docs so they match the running system. ch08 (database): - DB_HOST is the -pooler Neon endpoint, not direct compute - Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m), MaxIdleTime 0 (never close idle) - New \"Pool warm-up at boot\" section documenting the 20-parallel-ping warm-up in database.Connect - Replaced the \"Neon regions\" section: explicit RTT numbers, the optimization stack that minimizes round-trips, when this still matters ch15 (observability): - Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span post-optimization trace; kept the old one underneath for diff context ch16 (failure modes): - Added: stale residence-IDs cache (data freshness bug + recovery) - Added: Redis at maxmemory limit (verify allkeys-lru policy) - Added: Neon pooler unreachable but direct endpoint up — emergency switchover procedure ch17 (runbook): - §23 Invalidate residence-IDs cache for a user (DEL key + grep for missing invalidation in new code) - §24 Verify DB pool warm-up is working (log pattern + impact test) - §25 Switch DB host between pooler and direct endpoints observability-plan.md status flipped from \"plan only\" to shipped with the latency-cut summary. README links to the new ch08 latency section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:36:36 -05:00
parent 88fb1751c7
commit c9ac273dbd
6 changed files with 264 additions and 34 deletions
@@ -6,7 +6,9 @@

 **Why not in the honeyDue k3s cluster:** Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.

-**Status:** Plan only — nothing implemented yet.
+**Status:** Fully shipped. VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`, vmagent in-cluster, OTel SDK and otelgorm wired into the api+worker, every authed endpoint produces nested HTTP→service→SQL flame graphs in Jaeger.
+
+The first round of traces revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond. The follow-up work (`internal/services/residence_id_cache.go`, GORM pool warm-up, auth-query JOIN consolidation, switching `DB_HOST` to Neon's `-pooler` endpoint, bumped cache TTLs) cut warm-cache `/api/tasks/` from 2,473 ms / 5 spans to **229 ms / 2 spans** — see commit `88fb175` and Chapter 8 §"Optimizations layered on top".

 ---