docs/ch15: mark distributed tracing fully integrated
Every authed API endpoint now produces a nested flame graph (HTTP → auth → service → SQL). Replaces the in-flight section with the final span-source matrix and a sample 5-span /api/tasks/ trace. Notes the visible Hetzner→Neon transatlantic RTT as the perf bottleneck the flame graph surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -220,42 +220,54 @@ Cheapest fix path:
|
|||||||
We'd want both eventually. Grafana alerting first because the data is
|
We'd want both eventually. Grafana alerting first because the data is
|
||||||
already there.
|
already there.
|
||||||
|
|
||||||
### Distributed tracing — adoption is in flight
|
### Distributed tracing — fully integrated
|
||||||
|
|
||||||
The OTel SDK is **wired** in `cmd/api/main.go` and `cmd/worker/main.go`
|
The OTel SDK is wired in `cmd/api/main.go` and `cmd/worker/main.go` and
|
||||||
and ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. What's
|
ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. Every public
|
||||||
already producing spans:
|
service method now takes `ctx context.Context` and routes its SQL through
|
||||||
|
`repo.WithContext(ctx)`, which means **every authenticated API endpoint
|
||||||
|
produces a fully-nested flame graph** in Jaeger.
|
||||||
|
|
||||||
| Span source | Status |
|
| Span source | Status |
|
||||||
|---|---|
|
|---|---|
|
||||||
| `otelecho.Middleware` — span per HTTP request | ✅ live |
|
| `otelecho.Middleware` — span per HTTP request | ✅ live |
|
||||||
|
| Auth middleware DB lookups (`m.db.WithContext(ctx)`) | ✅ live |
|
||||||
|
| All repos via `repo.WithContext(ctx)` (`otelgorm` plugin) | ✅ live |
|
||||||
| Manual span around `storage_service.Upload` (B2 PutObject) | ✅ live |
|
| Manual span around `storage_service.Upload` (B2 PutObject) | ✅ live |
|
||||||
| Manual span around APNs `Send` / `SendWithCategory` | ✅ live |
|
| Manual span around APNs `Send` / `SendWithCategory` | ✅ live |
|
||||||
| Manual span around FCM `sendOne` | ✅ live |
|
| Manual span around FCM `sendOne` | ✅ live |
|
||||||
| Asynq middleware — span per task type with retry/payload attrs | ✅ live |
|
| Asynq middleware — span per task type with retry/payload attrs | ✅ live |
|
||||||
| `otelgorm` plugin — span per SQL statement | ✅ plugin registered |
|
|
||||||
|
|
||||||
What's still in flight: SQL spans appear in a request's trace **only when
|
Migrated services (every public method takes ctx):
|
||||||
the service method took the request's `ctx` and called
|
- `AuthService` — login, register, refresh, logout, me, verify-email,
|
||||||
`repo.WithContext(ctx)`** before issuing queries. Every repository now
|
forgot/reset-password, update-profile
|
||||||
exposes `WithContext(ctx) *XRepository`, but services need to be
|
- `TaskService` — all 25+ task and completion methods
|
||||||
migrated one method at a time.
|
- `ResidenceService` — all 15 methods including share-codes
|
||||||
|
- `ContractorService` — all 9 methods
|
||||||
|
- `DocumentService` — all 10 methods
|
||||||
|
- `NotificationService` — all 12 methods
|
||||||
|
- `SubscriptionService` — all 12 methods including Apple/Google IAP
|
||||||
|
|
||||||
**Migration pattern:** for each service method on the request hot path,
|
Sample trace for `GET /api/tasks/`:
|
||||||
add `ctx context.Context` as the first arg, change the handler call site
|
|
||||||
|
```
|
||||||
|
GET /api/tasks/ (2473ms)
|
||||||
|
├── auth: SELECT * FROM user_authtoken WHERE key=... (1506ms)
|
||||||
|
├── auth: SELECT * FROM auth_user WHERE id=7 (333ms)
|
||||||
|
├── service: SELECT id FROM residence_residence WHERE... (736ms)
|
||||||
|
└── service: SELECT * FROM task_task WHERE residence_id IN(...) (226ms)
|
||||||
|
```
|
||||||
|
|
||||||
|
Each query labeled with `db.statement`. The transatlantic Hetzner→Neon
|
||||||
|
RTT (~110ms one-way) shows up clearly — that's the perf bottleneck the
|
||||||
|
flame graph makes obvious. See [Chapter 18 — Cost](./18-cost.md) for
|
||||||
|
the planned move to a US-region cluster.
|
||||||
|
|
||||||
|
**Migration pattern (for any future services or middleware):** add
|
||||||
|
`ctx context.Context` as the first arg, change the handler call site
|
||||||
to pass `c.Request().Context()`, and replace `s.repo.X(...)` with
|
to pass `c.Request().Context()`, and replace `s.repo.X(...)` with
|
||||||
`s.repo.WithContext(ctx).X(...)`. Tests pass `context.Background()`.
|
`s.repo.WithContext(ctx).X(...)`. Tests pass `context.Background()`.
|
||||||
|
|
||||||
Already migrated:
|
|
||||||
- `TaskService.ListTasks` → `GET /api/tasks/`
|
|
||||||
- `TaskService.GetTasksByResidence` → `GET /api/tasks/by-residence/:id/`
|
|
||||||
|
|
||||||
Remaining: every other public method on `TaskService`, `ResidenceService`,
|
|
||||||
`ContractorService`, `DocumentService`, `AuthService`,
|
|
||||||
`NotificationService`, `SubscriptionService`. Mechanical work; can be
|
|
||||||
done a method at a time without breaking anything (untouched methods
|
|
||||||
just emit untraced SQL like before).
|
|
||||||
|
|
||||||
### No APM (Application Performance Monitoring)
|
### No APM (Application Performance Monitoring)
|
||||||
|
|
||||||
No continuous profiling. We can answer "which endpoint has the highest
|
No continuous profiling. We can answer "which endpoint has the highest
|
||||||
|
|||||||
Reference in New Issue
Block a user