Files
honeyDueAPI/docs/observability-plan.md
T
Trey t c9ac273dbd
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs: capture latency optimizations + new caching invariants
Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.

ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
  MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
  warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
  optimization stack that minimizes round-trips, when this still matters

ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
  post-optimization trace; kept the old one underneath for diff context

ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
  switchover procedure

ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
  missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints

observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.

README links to the new ch08 latency section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:36:36 -05:00

11 KiB
Raw Blame History

Observability Plan — honeyDue (100% self-hosted)

Goal: Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor.

Deployment target: 88oakappsUpdate (Linode VPS at 185.143.228.16, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for *.88oakapps.com. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental.

Why not in the honeyDue k3s cluster: Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.

Status: Fully shipped. VictoriaMetrics + Jaeger + Grafana on obs.88oakapps.com, vmagent in-cluster, OTel SDK and otelgorm wired into the api+worker, every authed endpoint produces nested HTTP→service→SQL flame graphs in Jaeger.

The first round of traces revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond. The follow-up work (internal/services/residence_id_cache.go, GORM pool warm-up, auth-query JOIN consolidation, switching DB_HOST to Neon's -pooler endpoint, bumped cache TTLs) cut warm-cache /api/tasks/ from 2,473 ms / 5 spans to 229 ms / 2 spans — see commit 88fb175 and Chapter 8 §"Optimizations layered on top".


Stack

Role Choice Why this vs. the obvious alternative
Metrics store VictoriaMetrics (single-node) Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary.
Tracing Jaeger all-in-one ~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale.
Dashboards Grafana OSS Connects to both VM (Prometheus protocol) and Jaeger natively.
App instrumentation OpenTelemetry SDK + prometheus/client_golang OTel is vendor-neutral — backends are swappable without code change.
Logs Keep Dozzle; add Loki only when log search becomes painful Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point.

Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?

  • Tempo wants 1-2 GB RAM minimum in monolithic mode (Grafana community report). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra.
  • Mimir is wonderful for multi-tenant Prometheus at scale — you have one tenant.
  • Loki is great if you live in kubectl logs and need full-text search across them. You currently use Dozzle and are not feeling that pain.

VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost.


Resource budget on 88oakappsUpdate

Three Docker containers in a separate compose project under /opt/honeydue-obs/ — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa.

Service mem_limit Disk (bind mount) Retention
VictoriaMetrics single-node 256 MB 10 GB 30 days metrics
Jaeger all-in-one (badger storage) 256 MB 10 GB 7 days traces
Grafana OSS 256 MB 1 GB
Total ~768 MB hard cap 21 GB

~5% of the box's free RAM and ~14% of free disk. The hard mem_limit per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch.

Don't reuse PostHog's ClickHouse / Kafka / Redis. Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate.

Shared blast radius caveat: A kernel panic on 88oakappsUpdate loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix.


App-side instrumentation

Surface Library / approach Import path
Echo HTTP middleware otelecho — span per request, tagged route/method/status go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho
GORM queries uptrace/otelgorm plugin — db.Use(otelgorm.NewPlugin()). Requires threading ctx through repositories so db.WithContext(ctx) works. github.com/uptrace/opentelemetry-go-extra/otelgorm
B2 / minio-go uploads Manual span around storage_service.Upload with attributes for bucket, object size, MIME type go.opentelemetry.io/otel
APNs / FCM Manual span in internal/push/apns.go and fcm.go; record device-token, response status code go.opentelemetry.io/otel
asynq jobs Custom asynq.MiddlewareFunc (~20 lines) — span per task type, attached to ctx, records duration + retry count go.opentelemetry.io/otel + asynq.MiddlewareFunc
Prometheus /metrics endpoint prometheus/client_golang direct — register histograms for HTTP duration / GORM op / B2 op / APNs send github.com/prometheus/client_golang/prometheus, .../prometheus/promhttp
OTLP exporter OTLP/HTTP → https://obs.88oakapps.com/v1/traces with bearer token. 100% sample in dev, 10% in prod. go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp
Metrics push vmagent sidecar in k3s scrapes the api Pod's /metrics and remote-writes to https://obs.88oakapps.com/api/v1/write with bearer token. Cleaner than exposing /metrics publicly. victoriametrics/vmagent image

Note on GORM context propagation: the existing repository methods don't take ctx context.Context. Adding otelgorm requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large.


Implementation order (smallest first)

Step 1 — Metrics + dashboards (highest immediate ROI)

On 88oakappsUpdate:

  1. mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana} and a docker-compose.yml defining the three services with mem_limit: 256m, bind mounts for persistence, and an isolated bridge network
  2. Add nginx vhosts (DNS A records first):
    • grafana.88oakapps.com127.0.0.1:3000 (basic auth via htpasswd, Let's Encrypt)
    • obs.88oakapps.com → routes by path:
      • /api/v1/write127.0.0.1:8428 (VictoriaMetrics remote-write, bearer-token check)
      • /v1/traces127.0.0.1:4318 (OTLP/HTTP traces, bearer-token check)
  3. Generate a 32-byte token, store in /etc/honeydue-obs/token (mode 0600), reference from nginx as auth_request or simple if ($http_authorization != ...)
  4. Pre-provision Grafana with the VM datasource pointing at http://victoriametrics:8428 (in-network)

On the honeyDue k3s cluster: 5. Add prometheus/client_golang to honeyDueAPI-go/go.mod and a /metrics endpoint to the Go API 6. Register histograms:

  • http_request_duration_seconds{route,method,status} via Echo middleware
  • gorm_query_duration_seconds{table,operation} via a GORM Plugin callback (no ctx needed for this one — operates at the SQL string level)
  • b2_upload_duration_seconds{bucket,result}
  • apns_send_duration_seconds{result}
  1. Deploy a vmagent sidecar (or DaemonSet) in the honeydue namespace with:
    • Scrape: api Service /metrics every 15s
    • remote_write.url: https://obs.88oakapps.com/api/v1/write
    • remote_write.bearer_token: from k8s Secret
  2. Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route

ROI: "Is the API healthy? Where is time being spent right now?" answered live, served from grafana.88oakapps.com.

Step 2 — Tracing baseline

(Jaeger is already up from Step 1. This step adds the app-side wiring.)

  1. Add Grafana datasource for Jaeger pointing at http://jaeger:16686 (in-network)
  2. Wire OTel SDK in cmd/api/main.go:
    • otel.SetTracerProvider(tracerProvider)
    • otelecho.Middleware("honeydue-api") on Echo
    • OTLP/HTTP exporter pointing at https://obs.88oakapps.com/v1/traces with Authorization: Bearer <token> header (token from env)
    • Sampling: TraceIDRatioBased(0.1) in prod, AlwaysSample() in dev
  3. Verify: a single POST /api/auth/login/ produces a trace in Jaeger

ROI: "Why is this one request slow?" — answered with a flame graph.

Step 3 — Manual spans for the work that actually matters

Wrap each in tracer.Start(ctx, ...) with attributes:

  • storage_service.Upload → span "b2.PutObject" with bucket, key, size_bytes, result
  • push/apns.go → span "apns.send" with device_token_hash, status_code, reason
  • asynq middleware → span per task type with task.type, retry_count, payload_size

ROI: Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology.

Step 4 — Repository ctx + otelgorm (biggest diff, save for last)

  1. Refactor every repository method to accept ctx context.Context as first arg
  2. Update every call site to pass c.Request().Context() from handlers / propagate through services
  3. Add db.Use(otelgorm.NewPlugin()) in internal/database/database.go
  4. Verify: a request now has nested spans http → service → query → query → b2.PutObject → apns.send with full SQL on the query spans

ROI: Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand.


Hard skips (revisit only when explicitly proven needed)

Tool Why skip
Loki / Promtail Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point.
Mimir / VM cluster mode Single-node VM handles honeyDue scale for years.
Pyroscope continuous profiling Overkill at 3 small nodes. Use pprof endpoints ad-hoc when CPU pressure shows up.
OTel Collector Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now.
Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance) User constraint: nothing paid.

When to move off 88oakappsUpdate

Triggers — any one is enough:

  • 88oakappsUpdate available memory drops below ~3 GB sustained (PostHog growth squeezing it)
  • ClickHouse OOM events start showing up in dmesg (PostHog under load)
  • You want fully separate failure domains for honeyDue vs. 88oakapps

Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = rsync /opt/honeydue-obs/ to a new box, update DNS for grafana.88oakapps.com and obs.88oakapps.com, docker compose up -d. ~30 min of work. Until then: cohabiting on 88oakappsUpdate is correct.


Quick reference: what shows up where

Question Where to look
Is the API up right now? Latency? Errors? Grafana RED dashboard
Why is this specific request slow? Jaeger trace view
What did the slow part of that request actually do (which SQL, which B2 PUT)? Span details inside the trace
Background job throughput / queue depth VictoriaMetrics + asynq metrics
What did the app print to stdout 5 minutes ago? Dozzle
What error did the app log? Dozzle (search) — or Loki if/when added