Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.
ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
optimization stack that minimizes round-trips, when this still matters
ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
post-optimization trace; kept the old one underneath for diff context
ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
switchover procedure
ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints
observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.
README links to the new ch08 latency section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
Observability Plan — honeyDue (100% self-hosted)
Goal: Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor.
Deployment target: 88oakappsUpdate (Linode VPS at 185.143.228.16, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for *.88oakapps.com. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental.
Why not in the honeyDue k3s cluster: Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.
Status: Fully shipped. VictoriaMetrics + Jaeger + Grafana on obs.88oakapps.com, vmagent in-cluster, OTel SDK and otelgorm wired into the api+worker, every authed endpoint produces nested HTTP→service→SQL flame graphs in Jaeger.
The first round of traces revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond. The follow-up work (internal/services/residence_id_cache.go, GORM pool warm-up, auth-query JOIN consolidation, switching DB_HOST to Neon's -pooler endpoint, bumped cache TTLs) cut warm-cache /api/tasks/ from 2,473 ms / 5 spans to 229 ms / 2 spans — see commit 88fb175 and Chapter 8 §"Optimizations layered on top".
Stack
| Role | Choice | Why this vs. the obvious alternative |
|---|---|---|
| Metrics store | VictoriaMetrics (single-node) | Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary. |
| Tracing | Jaeger all-in-one | ~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale. |
| Dashboards | Grafana OSS | Connects to both VM (Prometheus protocol) and Jaeger natively. |
| App instrumentation | OpenTelemetry SDK + prometheus/client_golang |
OTel is vendor-neutral — backends are swappable without code change. |
| Logs | Keep Dozzle; add Loki only when log search becomes painful | Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point. |
Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?
- Tempo wants 1-2 GB RAM minimum in monolithic mode (Grafana community report). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra.
- Mimir is wonderful for multi-tenant Prometheus at scale — you have one tenant.
- Loki is great if you live in
kubectl logsand need full-text search across them. You currently use Dozzle and are not feeling that pain.
VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost.
Resource budget on 88oakappsUpdate
Three Docker containers in a separate compose project under /opt/honeydue-obs/ — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa.
| Service | mem_limit |
Disk (bind mount) | Retention |
|---|---|---|---|
| VictoriaMetrics single-node | 256 MB | 10 GB | 30 days metrics |
| Jaeger all-in-one (badger storage) | 256 MB | 10 GB | 7 days traces |
| Grafana OSS | 256 MB | 1 GB | — |
| Total | ~768 MB hard cap | 21 GB |
~5% of the box's free RAM and ~14% of free disk. The hard mem_limit per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch.
Don't reuse PostHog's ClickHouse / Kafka / Redis. Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate.
Shared blast radius caveat: A kernel panic on 88oakappsUpdate loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix.
App-side instrumentation
| Surface | Library / approach | Import path |
|---|---|---|
| Echo HTTP middleware | otelecho — span per request, tagged route/method/status |
go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho |
| GORM queries | uptrace/otelgorm plugin — db.Use(otelgorm.NewPlugin()). Requires threading ctx through repositories so db.WithContext(ctx) works. |
github.com/uptrace/opentelemetry-go-extra/otelgorm |
| B2 / minio-go uploads | Manual span around storage_service.Upload with attributes for bucket, object size, MIME type |
go.opentelemetry.io/otel |
| APNs / FCM | Manual span in internal/push/apns.go and fcm.go; record device-token, response status code |
go.opentelemetry.io/otel |
| asynq jobs | Custom asynq.MiddlewareFunc (~20 lines) — span per task type, attached to ctx, records duration + retry count |
go.opentelemetry.io/otel + asynq.MiddlewareFunc |
Prometheus /metrics endpoint |
prometheus/client_golang direct — register histograms for HTTP duration / GORM op / B2 op / APNs send |
github.com/prometheus/client_golang/prometheus, .../prometheus/promhttp |
| OTLP exporter | OTLP/HTTP → https://obs.88oakapps.com/v1/traces with bearer token. 100% sample in dev, 10% in prod. |
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp |
| Metrics push | vmagent sidecar in k3s scrapes the api Pod's /metrics and remote-writes to https://obs.88oakapps.com/api/v1/write with bearer token. Cleaner than exposing /metrics publicly. |
victoriametrics/vmagent image |
Note on GORM context propagation: the existing repository methods don't take ctx context.Context. Adding otelgorm requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large.
Implementation order (smallest first)
Step 1 — Metrics + dashboards (highest immediate ROI)
On 88oakappsUpdate:
mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana}and adocker-compose.ymldefining the three services withmem_limit: 256m, bind mounts for persistence, and an isolated bridge network- Add nginx vhosts (DNS A records first):
grafana.88oakapps.com→127.0.0.1:3000(basic auth via htpasswd, Let's Encrypt)obs.88oakapps.com→ routes by path:/api/v1/write→127.0.0.1:8428(VictoriaMetrics remote-write, bearer-token check)/v1/traces→127.0.0.1:4318(OTLP/HTTP traces, bearer-token check)
- Generate a 32-byte token, store in
/etc/honeydue-obs/token(mode 0600), reference from nginx asauth_requestor simpleif ($http_authorization != ...) - Pre-provision Grafana with the VM datasource pointing at
http://victoriametrics:8428(in-network)
On the honeyDue k3s cluster:
5. Add prometheus/client_golang to honeyDueAPI-go/go.mod and a /metrics endpoint to the Go API
6. Register histograms:
http_request_duration_seconds{route,method,status}via Echo middlewaregorm_query_duration_seconds{table,operation}via a GORMPlugincallback (no ctx needed for this one — operates at the SQL string level)b2_upload_duration_seconds{bucket,result}apns_send_duration_seconds{result}
- Deploy a
vmagentsidecar (or DaemonSet) in thehoneyduenamespace with:- Scrape: api Service
/metricsevery 15s remote_write.url:https://obs.88oakapps.com/api/v1/writeremote_write.bearer_token: from k8s Secret
- Scrape: api Service
- Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route
ROI: "Is the API healthy? Where is time being spent right now?" answered live, served from grafana.88oakapps.com.
Step 2 — Tracing baseline
(Jaeger is already up from Step 1. This step adds the app-side wiring.)
- Add Grafana datasource for Jaeger pointing at
http://jaeger:16686(in-network) - Wire OTel SDK in
cmd/api/main.go:otel.SetTracerProvider(tracerProvider)otelecho.Middleware("honeydue-api")on Echo- OTLP/HTTP exporter pointing at
https://obs.88oakapps.com/v1/traceswithAuthorization: Bearer <token>header (token from env) - Sampling:
TraceIDRatioBased(0.1)in prod,AlwaysSample()in dev
- Verify: a single
POST /api/auth/login/produces a trace in Jaeger
ROI: "Why is this one request slow?" — answered with a flame graph.
Step 3 — Manual spans for the work that actually matters
Wrap each in tracer.Start(ctx, ...) with attributes:
storage_service.Upload→ span "b2.PutObject" withbucket,key,size_bytes, resultpush/apns.go→ span "apns.send" withdevice_token_hash,status_code,reasonasynqmiddleware → span per task type withtask.type,retry_count,payload_size
ROI: Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology.
Step 4 — Repository ctx + otelgorm (biggest diff, save for last)
- Refactor every repository method to accept
ctx context.Contextas first arg - Update every call site to pass
c.Request().Context()from handlers / propagate through services - Add
db.Use(otelgorm.NewPlugin())ininternal/database/database.go - Verify: a request now has nested spans
http → service → query → query → b2.PutObject → apns.sendwith full SQL on the query spans
ROI: Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand.
Hard skips (revisit only when explicitly proven needed)
| Tool | Why skip |
|---|---|
| Loki / Promtail | Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point. |
| Mimir / VM cluster mode | Single-node VM handles honeyDue scale for years. |
| Pyroscope continuous profiling | Overkill at 3 small nodes. Use pprof endpoints ad-hoc when CPU pressure shows up. |
| OTel Collector | Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now. |
| Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance) | User constraint: nothing paid. |
When to move off 88oakappsUpdate
Triggers — any one is enough:
88oakappsUpdateavailable memory drops below ~3 GB sustained (PostHog growth squeezing it)- ClickHouse OOM events start showing up in
dmesg(PostHog under load) - You want fully separate failure domains for honeyDue vs. 88oakapps
Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = rsync /opt/honeydue-obs/ to a new box, update DNS for grafana.88oakapps.com and obs.88oakapps.com, docker compose up -d. ~30 min of work. Until then: cohabiting on 88oakappsUpdate is correct.
Quick reference: what shows up where
| Question | Where to look |
|---|---|
| Is the API up right now? Latency? Errors? | Grafana RED dashboard |
| Why is this specific request slow? | Jaeger trace view |
| What did the slow part of that request actually do (which SQL, which B2 PUT)? | Span details inside the trace |
| Background job throughput / queue depth | VictoriaMetrics + asynq metrics |
| What did the app print to stdout 5 minutes ago? | Dozzle |
| What error did the app log? | Dozzle (search) — or Loki if/when added |