docs: capture latency optimizations + new caching invariants
Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.
ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
optimization stack that minimizes round-trips, when this still matters
ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
post-optimization trace; kept the old one underneath for diff context
ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
switchover procedure
ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints
observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.
README links to the new ch08 latency section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -248,7 +248,22 @@ Migrated services (every public method takes ctx):
|
||||
- `NotificationService` — all 12 methods
|
||||
- `SubscriptionService` — all 12 methods including Apple/Google IAP
|
||||
|
||||
Sample trace for `GET /api/tasks/`:
|
||||
Sample trace for `GET /api/tasks/` (warm cache, post-optimization):
|
||||
|
||||
```
|
||||
GET /api/tasks/ (229ms)
|
||||
└── service: SELECT * FROM task_task WHERE residence_id IN
|
||||
(SELECT id FROM residence_residence WHERE...) (227ms)
|
||||
```
|
||||
|
||||
Two spans total. The auth path runs entirely from Redis + in-memory
|
||||
cache (zero SQL queries) thanks to the 1-hour token TTL and 5-min user
|
||||
TTL. The residence-ID lookup is folded into the tasks query as a
|
||||
Postgres subquery, so a single network round-trip to Neon services the
|
||||
whole request. See Chapter 8 §"Optimizations layered on top" for the
|
||||
optimization stack.
|
||||
|
||||
Earlier trace, before the optimization stack landed (commit 88fb175):
|
||||
|
||||
```
|
||||
GET /api/tasks/ (2473ms)
|
||||
@@ -258,10 +273,12 @@ GET /api/tasks/ (2473ms)
|
||||
└── service: SELECT * FROM task_task WHERE residence_id IN(...) (226ms)
|
||||
```
|
||||
|
||||
Each query labeled with `db.statement`. The transatlantic Hetzner→Neon
|
||||
RTT (~110ms one-way) shows up clearly — that's the perf bottleneck the
|
||||
flame graph makes obvious. See [Chapter 18 — Cost](./18-cost.md) for
|
||||
the planned move to a US-region cluster.
|
||||
10× improvement from 2,473ms to 229ms by cutting query count
|
||||
(5 SQL → 1 SQL on warm cache). The 227ms in the surviving query is
|
||||
**1 transatlantic round-trip** to Neon us-east-1 from Hetzner
|
||||
Nuremberg — the physical floor on the current setup. Eliminated by
|
||||
migrating Neon to a EU region; tracked in [Chapter 18 §migration
|
||||
triggers](./18-cost.md) and `docs/observability-plan.md`.
|
||||
|
||||
**Migration pattern (for any future services or middleware):** add
|
||||
`ctx context.Context` as the first arg, change the handler call site
|
||||
|
||||
Reference in New Issue
Block a user