Wire OpenTelemetry tracing — HTTP, B2, APNs, FCM, asynq, GORM (partial)
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider
that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling
is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod,
overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api
and honeydue-worker. otelecho.Middleware opens a span per HTTP request.

Step 2 — Manual spans: storage_service.Upload now takes ctx and emits
storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket,
result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token
spans with topic, status_code, reason. Asynq middleware emits
asynq.handle:<task_type> per job with retry/payload attrs and records
asynq_job_duration_seconds.

Step 3 — Database: otelgorm plugin registered in database.Connect, so
any SQL emitted via db.WithContext(ctx) attaches to the request span.
Every repository now exposes WithContext(ctx) *XRepository as the
migration helper. TaskService.ListTasks and GetTasksByResidence are
migrated end-to-end (ctx threaded through handler → service → repo);
remaining services adopt the same pattern incrementally — pre-migration
methods still emit untraced SQL via the unchanged db field.

OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env →
honeydue-secrets → api+worker Deployments via secretKeyRef (optional).
02-setup-secrets.sh sources them from prod.env on next run; manifests
mark both env vars optional so the deployment rolls without traces if
the secret is absent.

ch15 observability doc now lists what produces spans today vs the
remaining migration work, with the explicit per-method pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-25 15:28:05 -05:00
parent 77cfcc0b27
commit bc3da007db
30 changed files with 655 additions and 91 deletions
+33 -9
View File
@@ -220,17 +220,41 @@ Cheapest fix path:
We'd want both eventually. Grafana alerting first because the data is
already there.
### Partial distributed tracing
### Distributed tracing — adoption is in flight
The OTel SDK is **not yet wired** in `cmd/api/main.go`. When it ships:
- `otelecho.Middleware` produces a span per HTTP request
- `otelgorm` plugin produces a span per SQL query (requires threading
`ctx` through repositories — the largest diff in the rollout)
- Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs
The OTel SDK is **wired** in `cmd/api/main.go` and `cmd/worker/main.go`
and ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. What's
already producing spans:
Until then, we have aggregate latency by route from the histograms but
no per-request flame graph. For "why is *this one* request slow" we
still rely on logs + the GORM duration histogram.
| Span source | Status |
|---|---|
| `otelecho.Middleware` — span per HTTP request | ✅ live |
| Manual span around `storage_service.Upload` (B2 PutObject) | ✅ live |
| Manual span around APNs `Send` / `SendWithCategory` | ✅ live |
| Manual span around FCM `sendOne` | ✅ live |
| Asynq middleware — span per task type with retry/payload attrs | ✅ live |
| `otelgorm` plugin — span per SQL statement | ✅ plugin registered |
What's still in flight: SQL spans appear in a request's trace **only when
the service method took the request's `ctx` and called
`repo.WithContext(ctx)`** before issuing queries. Every repository now
exposes `WithContext(ctx) *XRepository`, but services need to be
migrated one method at a time.
**Migration pattern:** for each service method on the request hot path,
add `ctx context.Context` as the first arg, change the handler call site
to pass `c.Request().Context()`, and replace `s.repo.X(...)` with
`s.repo.WithContext(ctx).X(...)`. Tests pass `context.Background()`.
Already migrated:
- `TaskService.ListTasks``GET /api/tasks/`
- `TaskService.GetTasksByResidence``GET /api/tasks/by-residence/:id/`
Remaining: every other public method on `TaskService`, `ResidenceService`,
`ContractorService`, `DocumentService`, `AuthService`,
`NotificationService`, `SubscriptionService`. Mechanical work; can be
done a method at a time without breaking anything (untouched methods
just emit untraced SQL like before).
### No APM (Application Performance Monitoring)