docs: capture latency optimizations + new caching invariants
Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.
ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
optimization stack that minimizes round-trips, when this still matters
ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
post-optimization trace; kept the old one underneath for diff context
ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
switchover procedure
ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints
observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.
README links to the new ch08 latency section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -245,12 +245,58 @@ finds an empty data directory (or can't mount at all).
|
||||
- If the original node is gone: Redis starts empty. Cache regenerates.
|
||||
Asynq queue state is lost; pending jobs re-queue on retry, cron
|
||||
fires re-schedule on next tick.
|
||||
- Auth caches (token + residence-IDs) regenerate on first user
|
||||
request — first request per user pays full DB lookup, then warm
|
||||
again. Visible as a brief latency spike in the Grafana RED
|
||||
dashboard, not a functional failure.
|
||||
- Ensure the node label `honeydue/redis=true` is on a healthy node:
|
||||
```bash
|
||||
kubectl label node <new-node> honeydue/redis=true --overwrite
|
||||
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
|
||||
```
|
||||
|
||||
#### Stale residence-IDs cache (data freshness bug)
|
||||
|
||||
**Symptom**: a user accepts a share-code or has a residence
|
||||
removed, but `/api/tasks/`, `/api/documents/`, `/api/contractors/`,
|
||||
or `/api/residences/summary/` continues to show the old
|
||||
membership for up to 5 minutes.
|
||||
**Cause**: a residence-membership-mutating code path landed
|
||||
without calling `cache.InvalidateResidenceIDsForUsers(...)`. The
|
||||
cache TTL is 5 min so the issue self-heals, but it's user-visible.
|
||||
**Recovery (immediate)**: flush the affected user's cache key
|
||||
manually. See [Chapter 17 §residence-IDs cache invalidation](./17-runbook.md).
|
||||
**Prevention (permanent)**: every mutation that changes
|
||||
`residence_residence.owner_id`, `residence_residence_users.user_id`,
|
||||
or deletes a residence MUST invalidate. Existing call sites for
|
||||
reference: `CreateResidence` (owner), `DeleteResidence`
|
||||
(all members), `JoinWithCode` (joining user), `RemoveUser`
|
||||
(removed user). The pattern lives in
|
||||
`internal/services/residence_id_cache.go`.
|
||||
|
||||
#### Redis at maxmemory limit
|
||||
|
||||
**Symptom**: Redis logs `OOM command not allowed when used memory > 'maxmemory'`.
|
||||
Should be rare — current production usage is ~2.4 MB against a 256 MB
|
||||
limit and the policy is `allkeys-lru` (cache writes evict cold keys
|
||||
instead of erroring).
|
||||
**Recovery**: confirm the policy is still `allkeys-lru`:
|
||||
```bash
|
||||
kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy
|
||||
```
|
||||
If it's somehow `noeviction`, set it live:
|
||||
```bash
|
||||
kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru
|
||||
```
|
||||
And re-apply the manifest at `deploy-k3s/manifests/redis/deployment.yaml`
|
||||
so the change survives a pod restart.
|
||||
|
||||
If memory usage is genuinely climbing toward the cap, check for
|
||||
runaway keys without TTLs:
|
||||
```bash
|
||||
kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys
|
||||
```
|
||||
|
||||
### External service failures
|
||||
|
||||
#### Neon Postgres outage
|
||||
@@ -264,6 +310,23 @@ until Neon is back.
|
||||
Postgres-level failover.
|
||||
**Frequency**: Neon has had a handful of hours-scale outages since launch.
|
||||
|
||||
#### Neon pooler endpoint unreachable but direct endpoint up
|
||||
|
||||
**Symptom**: `dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o
|
||||
timeout` in api logs but the direct compute endpoint is reachable.
|
||||
Rare — Neon's pooler runs in their infra alongside compute — but
|
||||
possible during pooler maintenance.
|
||||
**Recovery (emergency)**: switch `DB_HOST` in `config.yaml` from the
|
||||
`-pooler` to the direct hostname (drop the `-pooler` segment),
|
||||
re-apply ConfigMap, rolling-restart api and worker:
|
||||
```bash
|
||||
# Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5...
|
||||
# Then:
|
||||
KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build
|
||||
```
|
||||
Cold-handshake latency goes back up (~440ms first hit) but the API
|
||||
keeps serving. Switch back when the pooler recovers.
|
||||
|
||||
#### Backblaze B2 outage
|
||||
|
||||
**Symptom**: image uploads fail; image downloads fail unless cached by
|
||||
|
||||
Reference in New Issue
Block a user