docs: capture latency optimizations + new caching invariants
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.

ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
  MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
  warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
  optimization stack that minimizes round-trips, when this still matters

ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
  post-optimization trace; kept the old one underneath for diff context

ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
  switchover procedure

ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
  missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints

observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.

README links to the new ch08 latency section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-25 17:36:36 -05:00
parent 88fb1751c7
commit c9ac273dbd
6 changed files with 264 additions and 34 deletions
+63
View File
@@ -245,12 +245,58 @@ finds an empty data directory (or can't mount at all).
- If the original node is gone: Redis starts empty. Cache regenerates.
Asynq queue state is lost; pending jobs re-queue on retry, cron
fires re-schedule on next tick.
- Auth caches (token + residence-IDs) regenerate on first user
request — first request per user pays full DB lookup, then warm
again. Visible as a brief latency spike in the Grafana RED
dashboard, not a functional failure.
- Ensure the node label `honeydue/redis=true` is on a healthy node:
```bash
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
```
#### Stale residence-IDs cache (data freshness bug)
**Symptom**: a user accepts a share-code or has a residence
removed, but `/api/tasks/`, `/api/documents/`, `/api/contractors/`,
or `/api/residences/summary/` continues to show the old
membership for up to 5 minutes.
**Cause**: a residence-membership-mutating code path landed
without calling `cache.InvalidateResidenceIDsForUsers(...)`. The
cache TTL is 5 min so the issue self-heals, but it's user-visible.
**Recovery (immediate)**: flush the affected user's cache key
manually. See [Chapter 17 §residence-IDs cache invalidation](./17-runbook.md).
**Prevention (permanent)**: every mutation that changes
`residence_residence.owner_id`, `residence_residence_users.user_id`,
or deletes a residence MUST invalidate. Existing call sites for
reference: `CreateResidence` (owner), `DeleteResidence`
(all members), `JoinWithCode` (joining user), `RemoveUser`
(removed user). The pattern lives in
`internal/services/residence_id_cache.go`.
#### Redis at maxmemory limit
**Symptom**: Redis logs `OOM command not allowed when used memory > 'maxmemory'`.
Should be rare — current production usage is ~2.4 MB against a 256 MB
limit and the policy is `allkeys-lru` (cache writes evict cold keys
instead of erroring).
**Recovery**: confirm the policy is still `allkeys-lru`:
```bash
kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy
```
If it's somehow `noeviction`, set it live:
```bash
kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru
```
And re-apply the manifest at `deploy-k3s/manifests/redis/deployment.yaml`
so the change survives a pod restart.
If memory usage is genuinely climbing toward the cap, check for
runaway keys without TTLs:
```bash
kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys
```
### External service failures
#### Neon Postgres outage
@@ -264,6 +310,23 @@ until Neon is back.
Postgres-level failover.
**Frequency**: Neon has had a handful of hours-scale outages since launch.
#### Neon pooler endpoint unreachable but direct endpoint up
**Symptom**: `dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o
timeout` in api logs but the direct compute endpoint is reachable.
Rare — Neon's pooler runs in their infra alongside compute — but
possible during pooler maintenance.
**Recovery (emergency)**: switch `DB_HOST` in `config.yaml` from the
`-pooler` to the direct hostname (drop the `-pooler` segment),
re-apply ConfigMap, rolling-restart api and worker:
```bash
# Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5...
# Then:
KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build
```
Cold-handshake latency goes back up (~440ms first hit) but the API
keeps serving. Switch back when the pooler recovers.
#### Backblaze B2 outage
**Symptom**: image uploads fail; image downloads fail unless cached by