docs/ch15: mark distributed tracing fully integrated
Every authed API endpoint now produces a nested flame graph (HTTP → auth → service → SQL). Replaces the in-flight section with the final span-source matrix and a sample 5-span /api/tasks/ trace. Notes the visible Hetzner→Neon transatlantic RTT as the perf bottleneck the flame graph surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -220,42 +220,54 @@ Cheapest fix path:
|
||||
We'd want both eventually. Grafana alerting first because the data is
|
||||
already there.
|
||||
|
||||
### Distributed tracing — adoption is in flight
|
||||
### Distributed tracing — fully integrated
|
||||
|
||||
The OTel SDK is **wired** in `cmd/api/main.go` and `cmd/worker/main.go`
|
||||
and ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. What's
|
||||
already producing spans:
|
||||
The OTel SDK is wired in `cmd/api/main.go` and `cmd/worker/main.go` and
|
||||
ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. Every public
|
||||
service method now takes `ctx context.Context` and routes its SQL through
|
||||
`repo.WithContext(ctx)`, which means **every authenticated API endpoint
|
||||
produces a fully-nested flame graph** in Jaeger.
|
||||
|
||||
| Span source | Status |
|
||||
|---|---|
|
||||
| `otelecho.Middleware` — span per HTTP request | ✅ live |
|
||||
| Auth middleware DB lookups (`m.db.WithContext(ctx)`) | ✅ live |
|
||||
| All repos via `repo.WithContext(ctx)` (`otelgorm` plugin) | ✅ live |
|
||||
| Manual span around `storage_service.Upload` (B2 PutObject) | ✅ live |
|
||||
| Manual span around APNs `Send` / `SendWithCategory` | ✅ live |
|
||||
| Manual span around FCM `sendOne` | ✅ live |
|
||||
| Asynq middleware — span per task type with retry/payload attrs | ✅ live |
|
||||
| `otelgorm` plugin — span per SQL statement | ✅ plugin registered |
|
||||
|
||||
What's still in flight: SQL spans appear in a request's trace **only when
|
||||
the service method took the request's `ctx` and called
|
||||
`repo.WithContext(ctx)`** before issuing queries. Every repository now
|
||||
exposes `WithContext(ctx) *XRepository`, but services need to be
|
||||
migrated one method at a time.
|
||||
Migrated services (every public method takes ctx):
|
||||
- `AuthService` — login, register, refresh, logout, me, verify-email,
|
||||
forgot/reset-password, update-profile
|
||||
- `TaskService` — all 25+ task and completion methods
|
||||
- `ResidenceService` — all 15 methods including share-codes
|
||||
- `ContractorService` — all 9 methods
|
||||
- `DocumentService` — all 10 methods
|
||||
- `NotificationService` — all 12 methods
|
||||
- `SubscriptionService` — all 12 methods including Apple/Google IAP
|
||||
|
||||
**Migration pattern:** for each service method on the request hot path,
|
||||
add `ctx context.Context` as the first arg, change the handler call site
|
||||
Sample trace for `GET /api/tasks/`:
|
||||
|
||||
```
|
||||
GET /api/tasks/ (2473ms)
|
||||
├── auth: SELECT * FROM user_authtoken WHERE key=... (1506ms)
|
||||
├── auth: SELECT * FROM auth_user WHERE id=7 (333ms)
|
||||
├── service: SELECT id FROM residence_residence WHERE... (736ms)
|
||||
└── service: SELECT * FROM task_task WHERE residence_id IN(...) (226ms)
|
||||
```
|
||||
|
||||
Each query labeled with `db.statement`. The transatlantic Hetzner→Neon
|
||||
RTT (~110ms one-way) shows up clearly — that's the perf bottleneck the
|
||||
flame graph makes obvious. See [Chapter 18 — Cost](./18-cost.md) for
|
||||
the planned move to a US-region cluster.
|
||||
|
||||
**Migration pattern (for any future services or middleware):** add
|
||||
`ctx context.Context` as the first arg, change the handler call site
|
||||
to pass `c.Request().Context()`, and replace `s.repo.X(...)` with
|
||||
`s.repo.WithContext(ctx).X(...)`. Tests pass `context.Background()`.
|
||||
|
||||
Already migrated:
|
||||
- `TaskService.ListTasks` → `GET /api/tasks/`
|
||||
- `TaskService.GetTasksByResidence` → `GET /api/tasks/by-residence/:id/`
|
||||
|
||||
Remaining: every other public method on `TaskService`, `ResidenceService`,
|
||||
`ContractorService`, `DocumentService`, `AuthService`,
|
||||
`NotificationService`, `SubscriptionService`. Mechanical work; can be
|
||||
done a method at a time without breaking anything (untouched methods
|
||||
just emit untraced SQL like before).
|
||||
|
||||
### No APM (Application Performance Monitoring)
|
||||
|
||||
No continuous profiling. We can answer "which endpoint has the highest
|
||||
|
||||
Reference in New Issue
Block a user