admin/honeyDueAPI

Fork 0

Files

T

Trey t c9ac273dbd

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

docs: capture latency optimizations + new caching invariants

Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.

ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
  MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
  warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
  optimization stack that minimizes round-trips, when this still matters

ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
  post-optimization trace; kept the old one underneath for diff context

ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
  switchover procedure

ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
  missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints

observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.

README links to the new ch08 latency section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-25 17:36:36 -05:00

11 KiB

Raw Blame History

Observability Plan — honeyDue (100% self-hosted)

Goal: Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor.

Deployment target: 88oakappsUpdate (Linode VPS at 185.143.228.16, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for *.88oakapps.com. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental.

Why not in the honeyDue k3s cluster: Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.

Status: Fully shipped. VictoriaMetrics + Jaeger + Grafana on obs.88oakapps.com, vmagent in-cluster, OTel SDK and otelgorm wired into the api+worker, every authed endpoint produces nested HTTP→service→SQL flame graphs in Jaeger.

The first round of traces revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond. The follow-up work (internal/services/residence_id_cache.go, GORM pool warm-up, auth-query JOIN consolidation, switching DB_HOST to Neon's -pooler endpoint, bumped cache TTLs) cut warm-cache /api/tasks/ from 2,473 ms / 5 spans to 229 ms / 2 spans — see commit 88fb175 and Chapter 8 §"Optimizations layered on top".

Stack

Role	Choice	Why this vs. the obvious alternative
Metrics store	VictoriaMetrics (single-node)	Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary.
Tracing	Jaeger all-in-one	~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale.
Dashboards	Grafana OSS	Connects to both VM (Prometheus protocol) and Jaeger natively.
App instrumentation	OpenTelemetry SDK + `prometheus/client_golang`	OTel is vendor-neutral — backends are swappable without code change.
Logs	Keep Dozzle; add Loki only when log search becomes painful	Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point.

Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?

Tempo wants 1-2 GB RAM minimum in monolithic mode (Grafana community report). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra.
Mimir is wonderful for multi-tenant Prometheus at scale — you have one tenant.
Loki is great if you live in kubectl logs and need full-text search across them. You currently use Dozzle and are not feeling that pain.

VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost.

Resource budget on `88oakappsUpdate`

Three Docker containers in a separate compose project under /opt/honeydue-obs/ — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa.

Service	`mem_limit`	Disk (bind mount)	Retention
VictoriaMetrics single-node	256 MB	10 GB	30 days metrics
Jaeger all-in-one (badger storage)	256 MB	10 GB	7 days traces
Grafana OSS	256 MB	1 GB	—
Total	~768 MB hard cap	21 GB

~5% of the box's free RAM and ~14% of free disk. The hard mem_limit per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch.

Don't reuse PostHog's ClickHouse / Kafka / Redis. Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate.

Shared blast radius caveat: A kernel panic on 88oakappsUpdate loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix.

App-side instrumentation

Surface	Library / approach	Import path
Echo HTTP middleware	`otelecho` — span per request, tagged route/method/status	`go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho`
GORM queries	`uptrace/otelgorm` plugin — `db.Use(otelgorm.NewPlugin())`. Requires threading `ctx` through repositories so `db.WithContext(ctx)` works.	`github.com/uptrace/opentelemetry-go-extra/otelgorm`
B2 / minio-go uploads	Manual span around `storage_service.Upload` with attributes for bucket, object size, MIME type	`go.opentelemetry.io/otel`
APNs / FCM	Manual span in `internal/push/apns.go` and `fcm.go`; record device-token, response status code	`go.opentelemetry.io/otel`
asynq jobs	Custom `asynq.MiddlewareFunc` (~20 lines) — span per task type, attached to ctx, records duration + retry count	`go.opentelemetry.io/otel` + `asynq.MiddlewareFunc`
Prometheus `/metrics` endpoint	`prometheus/client_golang` direct — register histograms for HTTP duration / GORM op / B2 op / APNs send	`github.com/prometheus/client_golang/prometheus`, `.../prometheus/promhttp`
OTLP exporter	OTLP/HTTP → `https://obs.88oakapps.com/v1/traces` with bearer token. 100% sample in dev, 10% in prod.	`go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp`
Metrics push	`vmagent` sidecar in k3s scrapes the api Pod's `/metrics` and remote-writes to `https://obs.88oakapps.com/api/v1/write` with bearer token. Cleaner than exposing `/metrics` publicly.	`victoriametrics/vmagent` image

Note on GORM context propagation: the existing repository methods don't take ctx context.Context. Adding otelgorm requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large.

Implementation order (smallest first)

Step 1 — Metrics + dashboards (highest immediate ROI)

On 88oakappsUpdate:

mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana} and a docker-compose.yml defining the three services with mem_limit: 256m, bind mounts for persistence, and an isolated bridge network
Add nginx vhosts (DNS A records first):
- grafana.88oakapps.com → 127.0.0.1:3000 (basic auth via htpasswd, Let's Encrypt)
- obs.88oakapps.com → routes by path:
  - /api/v1/write → 127.0.0.1:8428 (VictoriaMetrics remote-write, bearer-token check)
  - /v1/traces → 127.0.0.1:4318 (OTLP/HTTP traces, bearer-token check)
Generate a 32-byte token, store in /etc/honeydue-obs/token (mode 0600), reference from nginx as auth_request or simple if ($http_authorization != ...)
Pre-provision Grafana with the VM datasource pointing at http://victoriametrics:8428 (in-network)

On the honeyDue k3s cluster: 5. Add prometheus/client_golang to honeyDueAPI-go/go.mod and a /metrics endpoint to the Go API 6. Register histograms:

http_request_duration_seconds{route,method,status} via Echo middleware
gorm_query_duration_seconds{table,operation} via a GORM Plugin callback (no ctx needed for this one — operates at the SQL string level)
b2_upload_duration_seconds{bucket,result}
apns_send_duration_seconds{result}

Deploy a vmagent sidecar (or DaemonSet) in the honeydue namespace with:
- Scrape: api Service /metrics every 15s
- remote_write.url: https://obs.88oakapps.com/api/v1/write
- remote_write.bearer_token: from k8s Secret
Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route

ROI: "Is the API healthy? Where is time being spent right now?" answered live, served from grafana.88oakapps.com.

Step 2 — Tracing baseline

(Jaeger is already up from Step 1. This step adds the app-side wiring.)

Add Grafana datasource for Jaeger pointing at http://jaeger:16686 (in-network)
Wire OTel SDK in cmd/api/main.go:
- otel.SetTracerProvider(tracerProvider)
- otelecho.Middleware("honeydue-api") on Echo
- OTLP/HTTP exporter pointing at https://obs.88oakapps.com/v1/traces with Authorization: Bearer <token> header (token from env)
- Sampling: TraceIDRatioBased(0.1) in prod, AlwaysSample() in dev
Verify: a single POST /api/auth/login/ produces a trace in Jaeger

ROI: "Why is this one request slow?" — answered with a flame graph.

Step 3 — Manual spans for the work that actually matters

Wrap each in tracer.Start(ctx, ...) with attributes:

storage_service.Upload → span "b2.PutObject" with bucket, key, size_bytes, result
push/apns.go → span "apns.send" with device_token_hash, status_code, reason
asynq middleware → span per task type with task.type, retry_count, payload_size

ROI: Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology.

Step 4 — Repository ctx + `otelgorm` (biggest diff, save for last)

Refactor every repository method to accept ctx context.Context as first arg
Update every call site to pass c.Request().Context() from handlers / propagate through services
Add db.Use(otelgorm.NewPlugin()) in internal/database/database.go
Verify: a request now has nested spans http → service → query → query → b2.PutObject → apns.send with full SQL on the query spans

ROI: Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand.

Hard skips (revisit only when explicitly proven needed)

Tool	Why skip
Loki / Promtail	Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point.
Mimir / VM cluster mode	Single-node VM handles honeyDue scale for years.
Pyroscope continuous profiling	Overkill at 3 small nodes. Use `pprof` endpoints ad-hoc when CPU pressure shows up.
OTel Collector	Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now.
Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance)	User constraint: nothing paid.

When to move off `88oakappsUpdate`

Triggers — any one is enough:

88oakappsUpdate available memory drops below ~3 GB sustained (PostHog growth squeezing it)
ClickHouse OOM events start showing up in dmesg (PostHog under load)
You want fully separate failure domains for honeyDue vs. 88oakapps

Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = rsync /opt/honeydue-obs/ to a new box, update DNS for grafana.88oakapps.com and obs.88oakapps.com, docker compose up -d. ~30 min of work. Until then: cohabiting on 88oakappsUpdate is correct.

Quick reference: what shows up where

Question	Where to look
Is the API up right now? Latency? Errors?	Grafana RED dashboard
Why is this specific request slow?	Jaeger trace view
What did the slow part of that request actually do (which SQL, which B2 PUT)?	Span details inside the trace
Background job throughput / queue depth	VictoriaMetrics + asynq metrics
What did the app print to stdout 5 minutes ago?	Dozzle
What error did the app log?	Dozzle (search) — or Loki if/when added

11 KiB Raw Blame History Unescape Escape

Observability Plan — honeyDue (100% self-hosted)

Stack

Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?

Resource budget on 88oakappsUpdate

App-side instrumentation

Implementation order (smallest first)

Step 1 — Metrics + dashboards (highest immediate ROI)

Step 2 — Tracing baseline

Step 3 — Manual spans for the work that actually matters

Step 4 — Repository ctx + otelgorm (biggest diff, save for last)

Hard skips (revisit only when explicitly proven needed)

When to move off 88oakappsUpdate

Quick reference: what shows up where

11 KiB

Raw Blame History

Resource budget on `88oakappsUpdate`

Step 4 — Repository ctx + `otelgorm` (biggest diff, save for last)

When to move off `88oakappsUpdate`