docs: capture latency optimizations + new caching invariants

Shipping commit 88fb175 changed the trace shape and added a new caching layer with required invalidation rules. Updating the operator-facing docs so they match the running system. ch08 (database): - DB_HOST is the -pooler Neon endpoint, not direct compute - Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m), MaxIdleTime 0 (never close idle) - New \"Pool warm-up at boot\" section documenting the 20-parallel-ping warm-up in database.Connect - Replaced the \"Neon regions\" section: explicit RTT numbers, the optimization stack that minimizes round-trips, when this still matters ch15 (observability): - Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span post-optimization trace; kept the old one underneath for diff context ch16 (failure modes): - Added: stale residence-IDs cache (data freshness bug + recovery) - Added: Redis at maxmemory limit (verify allkeys-lru policy) - Added: Neon pooler unreachable but direct endpoint up — emergency switchover procedure ch17 (runbook): - §23 Invalidate residence-IDs cache for a user (DEL key + grep for missing invalidation in new code) - §24 Verify DB pool warm-up is working (log pattern + impact test) - §25 Switch DB host between pooler and direct endpoints observability-plan.md status flipped from \"plan only\" to shipped with the latency-cut summary. README links to the new ch08 latency section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:36:36 -05:00
parent 88fb1751c7
commit c9ac273dbd
6 changed files with 264 additions and 34 deletions
@@ -245,12 +245,58 @@ finds an empty data directory (or can't mount at all).
 - If the original node is gone: Redis starts empty. Cache regenerates.
  Asynq queue state is lost; pending jobs re-queue on retry, cron
  fires re-schedule on next tick.
+- Auth caches (token + residence-IDs) regenerate on first user
+  request — first request per user pays full DB lookup, then warm
+  again. Visible as a brief latency spike in the Grafana RED
+  dashboard, not a functional failure.
 - Ensure the node label `honeydue/redis=true` is on a healthy node:
 ```bash
 kubectl label node <new-node> honeydue/redis=true --overwrite
 kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
 ```

+#### Stale residence-IDs cache (data freshness bug)
+
+**Symptom**: a user accepts a share-code or has a residence
+removed, but `/api/tasks/`, `/api/documents/`, `/api/contractors/`,
+or `/api/residences/summary/` continues to show the old
+membership for up to 5 minutes.
+**Cause**: a residence-membership-mutating code path landed
+without calling `cache.InvalidateResidenceIDsForUsers(...)`. The
+cache TTL is 5 min so the issue self-heals, but it's user-visible.
+**Recovery (immediate)**: flush the affected user's cache key
+manually. See [Chapter 17 §residence-IDs cache invalidation](./17-runbook.md).
+**Prevention (permanent)**: every mutation that changes
+`residence_residence.owner_id`, `residence_residence_users.user_id`,
+or deletes a residence MUST invalidate. Existing call sites for
+reference: `CreateResidence` (owner), `DeleteResidence`
+(all members), `JoinWithCode` (joining user), `RemoveUser`
+(removed user). The pattern lives in
+`internal/services/residence_id_cache.go`.
+
+#### Redis at maxmemory limit
+
+**Symptom**: Redis logs `OOM command not allowed when used memory > 'maxmemory'`.
+Should be rare — current production usage is ~2.4 MB against a 256 MB
+limit and the policy is `allkeys-lru` (cache writes evict cold keys
+instead of erroring).
+**Recovery**: confirm the policy is still `allkeys-lru`:
+```bash
+kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy
+```
+If it's somehow `noeviction`, set it live:
+```bash
+kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru
+```
+And re-apply the manifest at `deploy-k3s/manifests/redis/deployment.yaml`
+so the change survives a pod restart.
+
+If memory usage is genuinely climbing toward the cap, check for
+runaway keys without TTLs:
+```bash
+kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys
+```
+
 ### External service failures

 #### Neon Postgres outage
@@ -264,6 +310,23 @@ until Neon is back.
 Postgres-level failover.
 **Frequency**: Neon has had a handful of hours-scale outages since launch.

+#### Neon pooler endpoint unreachable but direct endpoint up
+
+**Symptom**: `dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o
+timeout` in api logs but the direct compute endpoint is reachable.
+Rare — Neon's pooler runs in their infra alongside compute — but
+possible during pooler maintenance.
+**Recovery (emergency)**: switch `DB_HOST` in `config.yaml` from the
+`-pooler` to the direct hostname (drop the `-pooler` segment),
+re-apply ConfigMap, rolling-restart api and worker:
+```bash
+# Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5...
+# Then:
+KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build
+```
+Cold-handshake latency goes back up (~440ms first hit) but the API
+keeps serving. Switch back when the pooler recovers.
+
 #### Backblaze B2 outage

 **Symptom**: image uploads fail; image downloads fail unless cached by