diff --git a/docs/deployment/15-observability.md b/docs/deployment/15-observability.md index dca5fa2..9883ac5 100644 --- a/docs/deployment/15-observability.md +++ b/docs/deployment/15-observability.md @@ -220,42 +220,54 @@ Cheapest fix path: We'd want both eventually. Grafana alerting first because the data is already there. -### Distributed tracing — adoption is in flight +### Distributed tracing — fully integrated -The OTel SDK is **wired** in `cmd/api/main.go` and `cmd/worker/main.go` -and ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. What's -already producing spans: +The OTel SDK is wired in `cmd/api/main.go` and `cmd/worker/main.go` and +ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. Every public +service method now takes `ctx context.Context` and routes its SQL through +`repo.WithContext(ctx)`, which means **every authenticated API endpoint +produces a fully-nested flame graph** in Jaeger. | Span source | Status | |---|---| | `otelecho.Middleware` — span per HTTP request | ✅ live | +| Auth middleware DB lookups (`m.db.WithContext(ctx)`) | ✅ live | +| All repos via `repo.WithContext(ctx)` (`otelgorm` plugin) | ✅ live | | Manual span around `storage_service.Upload` (B2 PutObject) | ✅ live | | Manual span around APNs `Send` / `SendWithCategory` | ✅ live | | Manual span around FCM `sendOne` | ✅ live | | Asynq middleware — span per task type with retry/payload attrs | ✅ live | -| `otelgorm` plugin — span per SQL statement | ✅ plugin registered | -What's still in flight: SQL spans appear in a request's trace **only when -the service method took the request's `ctx` and called -`repo.WithContext(ctx)`** before issuing queries. Every repository now -exposes `WithContext(ctx) *XRepository`, but services need to be -migrated one method at a time. +Migrated services (every public method takes ctx): +- `AuthService` — login, register, refresh, logout, me, verify-email, + forgot/reset-password, update-profile +- `TaskService` — all 25+ task and completion methods +- `ResidenceService` — all 15 methods including share-codes +- `ContractorService` — all 9 methods +- `DocumentService` — all 10 methods +- `NotificationService` — all 12 methods +- `SubscriptionService` — all 12 methods including Apple/Google IAP -**Migration pattern:** for each service method on the request hot path, -add `ctx context.Context` as the first arg, change the handler call site +Sample trace for `GET /api/tasks/`: + +``` +GET /api/tasks/ (2473ms) +├── auth: SELECT * FROM user_authtoken WHERE key=... (1506ms) +├── auth: SELECT * FROM auth_user WHERE id=7 (333ms) +├── service: SELECT id FROM residence_residence WHERE... (736ms) +└── service: SELECT * FROM task_task WHERE residence_id IN(...) (226ms) +``` + +Each query labeled with `db.statement`. The transatlantic Hetzner→Neon +RTT (~110ms one-way) shows up clearly — that's the perf bottleneck the +flame graph makes obvious. See [Chapter 18 — Cost](./18-cost.md) for +the planned move to a US-region cluster. + +**Migration pattern (for any future services or middleware):** add +`ctx context.Context` as the first arg, change the handler call site to pass `c.Request().Context()`, and replace `s.repo.X(...)` with `s.repo.WithContext(ctx).X(...)`. Tests pass `context.Background()`. -Already migrated: -- `TaskService.ListTasks` → `GET /api/tasks/` -- `TaskService.GetTasksByResidence` → `GET /api/tasks/by-residence/:id/` - -Remaining: every other public method on `TaskService`, `ResidenceService`, -`ContractorService`, `DocumentService`, `AuthService`, -`NotificationService`, `SubscriptionService`. Mechanical work; can be -done a method at a time without breaking anything (untouched methods -just emit untraced SQL like before). - ### No APM (Application Performance Monitoring) No continuous profiling. We can answer "which endpoint has the highest