docs/ch15: mark distributed tracing fully integrated
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Every authed API endpoint now produces a nested flame graph
(HTTP → auth → service → SQL). Replaces the in-flight section with the
final span-source matrix and a sample 5-span /api/tasks/ trace. Notes
the visible Hetzner→Neon transatlantic RTT as the perf bottleneck the
flame graph surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-25 16:44:31 -05:00
parent d9b5f85c3d
commit 9410da7497
+34 -22
View File
@@ -220,42 +220,54 @@ Cheapest fix path:
We'd want both eventually. Grafana alerting first because the data is
already there.
### Distributed tracing — adoption is in flight
### Distributed tracing — fully integrated
The OTel SDK is **wired** in `cmd/api/main.go` and `cmd/worker/main.go`
and ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. What's
already producing spans:
The OTel SDK is wired in `cmd/api/main.go` and `cmd/worker/main.go` and
ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. Every public
service method now takes `ctx context.Context` and routes its SQL through
`repo.WithContext(ctx)`, which means **every authenticated API endpoint
produces a fully-nested flame graph** in Jaeger.
| Span source | Status |
|---|---|
| `otelecho.Middleware` — span per HTTP request | ✅ live |
| Auth middleware DB lookups (`m.db.WithContext(ctx)`) | ✅ live |
| All repos via `repo.WithContext(ctx)` (`otelgorm` plugin) | ✅ live |
| Manual span around `storage_service.Upload` (B2 PutObject) | ✅ live |
| Manual span around APNs `Send` / `SendWithCategory` | ✅ live |
| Manual span around FCM `sendOne` | ✅ live |
| Asynq middleware — span per task type with retry/payload attrs | ✅ live |
| `otelgorm` plugin — span per SQL statement | ✅ plugin registered |
What's still in flight: SQL spans appear in a request's trace **only when
the service method took the request's `ctx` and called
`repo.WithContext(ctx)`** before issuing queries. Every repository now
exposes `WithContext(ctx) *XRepository`, but services need to be
migrated one method at a time.
Migrated services (every public method takes ctx):
- `AuthService` — login, register, refresh, logout, me, verify-email,
forgot/reset-password, update-profile
- `TaskService` — all 25+ task and completion methods
- `ResidenceService` — all 15 methods including share-codes
- `ContractorService` — all 9 methods
- `DocumentService` — all 10 methods
- `NotificationService` — all 12 methods
- `SubscriptionService` — all 12 methods including Apple/Google IAP
**Migration pattern:** for each service method on the request hot path,
add `ctx context.Context` as the first arg, change the handler call site
Sample trace for `GET /api/tasks/`:
```
GET /api/tasks/ (2473ms)
├── auth: SELECT * FROM user_authtoken WHERE key=... (1506ms)
├── auth: SELECT * FROM auth_user WHERE id=7 (333ms)
├── service: SELECT id FROM residence_residence WHERE... (736ms)
└── service: SELECT * FROM task_task WHERE residence_id IN(...) (226ms)
```
Each query labeled with `db.statement`. The transatlantic Hetzner→Neon
RTT (~110ms one-way) shows up clearly — that's the perf bottleneck the
flame graph makes obvious. See [Chapter 18 — Cost](./18-cost.md) for
the planned move to a US-region cluster.
**Migration pattern (for any future services or middleware):** add
`ctx context.Context` as the first arg, change the handler call site
to pass `c.Request().Context()`, and replace `s.repo.X(...)` with
`s.repo.WithContext(ctx).X(...)`. Tests pass `context.Background()`.
Already migrated:
- `TaskService.ListTasks``GET /api/tasks/`
- `TaskService.GetTasksByResidence``GET /api/tasks/by-residence/:id/`
Remaining: every other public method on `TaskService`, `ResidenceService`,
`ContractorService`, `DocumentService`, `AuthService`,
`NotificationService`, `SubscriptionService`. Mechanical work; can be
done a method at a time without breaking anything (untouched methods
just emit untraced SQL like before).
### No APM (Application Performance Monitoring)
No continuous profiling. We can answer "which endpoint has the highest