Wire OpenTelemetry tracing — HTTP, B2, APNs, FCM, asynq, GORM (partial)
Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod, overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api and honeydue-worker. otelecho.Middleware opens a span per HTTP request. Step 2 — Manual spans: storage_service.Upload now takes ctx and emits storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket, result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token spans with topic, status_code, reason. Asynq middleware emits asynq.handle:<task_type> per job with retry/payload attrs and records asynq_job_duration_seconds. Step 3 — Database: otelgorm plugin registered in database.Connect, so any SQL emitted via db.WithContext(ctx) attaches to the request span. Every repository now exposes WithContext(ctx) *XRepository as the migration helper. TaskService.ListTasks and GetTasksByResidence are migrated end-to-end (ctx threaded through handler → service → repo); remaining services adopt the same pattern incrementally — pre-migration methods still emit untraced SQL via the unchanged db field. OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env → honeydue-secrets → api+worker Deployments via secretKeyRef (optional). 02-setup-secrets.sh sources them from prod.env on next run; manifests mark both env vars optional so the deployment rolls without traces if the secret is absent. ch15 observability doc now lists what produces spans today vs the remaining migration work, with the explicit per-method pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -220,17 +220,41 @@ Cheapest fix path:
|
||||
We'd want both eventually. Grafana alerting first because the data is
|
||||
already there.
|
||||
|
||||
### Partial distributed tracing
|
||||
### Distributed tracing — adoption is in flight
|
||||
|
||||
The OTel SDK is **not yet wired** in `cmd/api/main.go`. When it ships:
|
||||
- `otelecho.Middleware` produces a span per HTTP request
|
||||
- `otelgorm` plugin produces a span per SQL query (requires threading
|
||||
`ctx` through repositories — the largest diff in the rollout)
|
||||
- Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs
|
||||
The OTel SDK is **wired** in `cmd/api/main.go` and `cmd/worker/main.go`
|
||||
and ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. What's
|
||||
already producing spans:
|
||||
|
||||
Until then, we have aggregate latency by route from the histograms but
|
||||
no per-request flame graph. For "why is *this one* request slow" we
|
||||
still rely on logs + the GORM duration histogram.
|
||||
| Span source | Status |
|
||||
|---|---|
|
||||
| `otelecho.Middleware` — span per HTTP request | ✅ live |
|
||||
| Manual span around `storage_service.Upload` (B2 PutObject) | ✅ live |
|
||||
| Manual span around APNs `Send` / `SendWithCategory` | ✅ live |
|
||||
| Manual span around FCM `sendOne` | ✅ live |
|
||||
| Asynq middleware — span per task type with retry/payload attrs | ✅ live |
|
||||
| `otelgorm` plugin — span per SQL statement | ✅ plugin registered |
|
||||
|
||||
What's still in flight: SQL spans appear in a request's trace **only when
|
||||
the service method took the request's `ctx` and called
|
||||
`repo.WithContext(ctx)`** before issuing queries. Every repository now
|
||||
exposes `WithContext(ctx) *XRepository`, but services need to be
|
||||
migrated one method at a time.
|
||||
|
||||
**Migration pattern:** for each service method on the request hot path,
|
||||
add `ctx context.Context` as the first arg, change the handler call site
|
||||
to pass `c.Request().Context()`, and replace `s.repo.X(...)` with
|
||||
`s.repo.WithContext(ctx).X(...)`. Tests pass `context.Background()`.
|
||||
|
||||
Already migrated:
|
||||
- `TaskService.ListTasks` → `GET /api/tasks/`
|
||||
- `TaskService.GetTasksByResidence` → `GET /api/tasks/by-residence/:id/`
|
||||
|
||||
Remaining: every other public method on `TaskService`, `ResidenceService`,
|
||||
`ContractorService`, `DocumentService`, `AuthService`,
|
||||
`NotificationService`, `SubscriptionService`. Mechanical work; can be
|
||||
done a method at a time without breaking anything (untouched methods
|
||||
just emit untraced SQL like before).
|
||||
|
||||
### No APM (Application Performance Monitoring)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user