Add Prometheus metrics + vmagent push to obs.88oakapps.com

Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via statement-level callbacks (no ctx plumbing needed). Storage and push clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the network round-trip points. Existing internal/monitoring metrics move to /metrics/legacy so the canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups. deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods and to the public obs endpoint over :443; the obs side runs VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate. docs/observability-plan.md captures the full plan including resource budget, instrumentation table, 4-step rollout, and migration triggers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:16:17 -05:00
parent 1cd6cafa9d
commit df78d9ccd8
10 changed files with 622 additions and 3 deletions
@@ -0,0 +1,164 @@
+# Observability Plan — honeyDue (100% self-hosted)
+
+**Goal:** Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor.
+
+**Deployment target:** `88oakappsUpdate` (Linode VPS at `185.143.228.16`, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for `*.88oakapps.com`. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental.
+
+**Why not in the honeyDue k3s cluster:** Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.
+
+**Status:** Plan only — nothing implemented yet.
+
+---
+
+## Stack
+
+| Role | Choice | Why this vs. the obvious alternative |
+|---|---|---|
+| Metrics store | **VictoriaMetrics** (single-node) | Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary. |
+| Tracing | **Jaeger all-in-one** | ~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale. |
+| Dashboards | **Grafana OSS** | Connects to both VM (Prometheus protocol) and Jaeger natively. |
+| App instrumentation | **OpenTelemetry SDK** + `prometheus/client_golang` | OTel is vendor-neutral — backends are swappable without code change. |
+| Logs | **Keep Dozzle**; add Loki only when log search becomes painful | Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point. |
+
+### Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?
+
+- **Tempo** wants 1-2 GB RAM minimum in monolithic mode ([Grafana community report](https://community.grafana.com/t/tempo-ram-usage-for-6k-spans-per-hour/63801)). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra.
+- **Mimir** is wonderful for multi-tenant Prometheus at scale — you have one tenant.
+- **Loki** is great if you live in `kubectl logs` and need full-text search across them. You currently use Dozzle and are not feeling that pain.
+
+VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost.
+
+---
+
+## Resource budget on `88oakappsUpdate`
+
+Three Docker containers in a separate compose project under `/opt/honeydue-obs/` — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa.
+
+| Service | `mem_limit` | Disk (bind mount) | Retention |
+|---|---|---|---|
+| VictoriaMetrics single-node | 256 MB | 10 GB | 30 days metrics |
+| Jaeger all-in-one (badger storage) | 256 MB | 10 GB | 7 days traces |
+| Grafana OSS | 256 MB | 1 GB | — |
+| **Total** | **~768 MB hard cap** | **21 GB** | |
+
+**~5% of the box's free RAM and ~14% of free disk.** The hard `mem_limit` per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch.
+
+**Don't reuse PostHog's ClickHouse / Kafka / Redis.** Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate.
+
+**Shared blast radius caveat:** A kernel panic on `88oakappsUpdate` loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix.
+
+---
+
+## App-side instrumentation
+
+| Surface | Library / approach | Import path |
+|---|---|---|
+| Echo HTTP middleware | `otelecho` — span per request, tagged route/method/status | `go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho` |
+| GORM queries | `uptrace/otelgorm` plugin — `db.Use(otelgorm.NewPlugin())`. Requires threading `ctx` through repositories so `db.WithContext(ctx)` works. | `github.com/uptrace/opentelemetry-go-extra/otelgorm` |
+| B2 / minio-go uploads | Manual span around `storage_service.Upload` with attributes for bucket, object size, MIME type | `go.opentelemetry.io/otel` |
+| APNs / FCM | Manual span in `internal/push/apns.go` and `fcm.go`; record device-token, response status code | `go.opentelemetry.io/otel` |
+| asynq jobs | Custom `asynq.MiddlewareFunc` (~20 lines) — span per task type, attached to ctx, records duration + retry count | `go.opentelemetry.io/otel` + `asynq.MiddlewareFunc` |
+| Prometheus `/metrics` endpoint | `prometheus/client_golang` direct — register histograms for HTTP duration / GORM op / B2 op / APNs send | `github.com/prometheus/client_golang/prometheus`, `.../prometheus/promhttp` |
+| OTLP exporter | OTLP/HTTP → `https://obs.88oakapps.com/v1/traces` with bearer token. 100% sample in dev, 10% in prod. | `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp` |
+| Metrics push | `vmagent` sidecar in k3s scrapes the api Pod's `/metrics` and remote-writes to `https://obs.88oakapps.com/api/v1/write` with bearer token. Cleaner than exposing `/metrics` publicly. | `victoriametrics/vmagent` image |
+
+**Note on GORM context propagation:** the existing repository methods don't take `ctx context.Context`. Adding `otelgorm` requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large.
+
+---
+
+## Implementation order (smallest first)
+
+### Step 1 — Metrics + dashboards (highest immediate ROI)
+
+**On `88oakappsUpdate`:**
+1. `mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana}` and a `docker-compose.yml` defining the three services with `mem_limit: 256m`, bind mounts for persistence, and an isolated bridge network
+2. Add nginx vhosts (DNS A records first):
+   - `grafana.88oakapps.com` → `127.0.0.1:3000` (basic auth via htpasswd, Let's Encrypt)
+   - `obs.88oakapps.com` → routes by path:
+     - `/api/v1/write` → `127.0.0.1:8428` (VictoriaMetrics remote-write, bearer-token check)
+     - `/v1/traces`     → `127.0.0.1:4318` (OTLP/HTTP traces, bearer-token check)
+3. Generate a 32-byte token, store in `/etc/honeydue-obs/token` (mode 0600), reference from nginx as `auth_request` or simple `if ($http_authorization != ...)`
+4. Pre-provision Grafana with the VM datasource pointing at `http://victoriametrics:8428` (in-network)
+
+**On the honeyDue k3s cluster:**
+5. Add `prometheus/client_golang` to `honeyDueAPI-go/go.mod` and a `/metrics` endpoint to the Go API
+6. Register histograms:
+   - `http_request_duration_seconds{route,method,status}` via Echo middleware
+   - `gorm_query_duration_seconds{table,operation}` via a GORM `Plugin` callback (no ctx needed for this one — operates at the SQL string level)
+   - `b2_upload_duration_seconds{bucket,result}`
+   - `apns_send_duration_seconds{result}`
+7. Deploy a `vmagent` sidecar (or DaemonSet) in the `honeydue` namespace with:
+   - Scrape: api Service `/metrics` every 15s
+   - `remote_write.url`: `https://obs.88oakapps.com/api/v1/write`
+   - `remote_write.bearer_token`: from k8s Secret
+8. Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route
+
+**ROI:** "Is the API healthy? Where is time being spent right now?" answered live, served from `grafana.88oakapps.com`.
+
+### Step 2 — Tracing baseline
+
+(Jaeger is already up from Step 1. This step adds the app-side wiring.)
+
+1. Add Grafana datasource for Jaeger pointing at `http://jaeger:16686` (in-network)
+2. Wire OTel SDK in `cmd/api/main.go`:
+   - `otel.SetTracerProvider(tracerProvider)`
+   - `otelecho.Middleware("honeydue-api")` on Echo
+   - OTLP/HTTP exporter pointing at `https://obs.88oakapps.com/v1/traces` with `Authorization: Bearer <token>` header (token from env)
+   - Sampling: `TraceIDRatioBased(0.1)` in prod, `AlwaysSample()` in dev
+3. Verify: a single `POST /api/auth/login/` produces a trace in Jaeger
+
+**ROI:** "Why is this one request slow?" — answered with a flame graph.
+
+### Step 3 — Manual spans for the work that actually matters
+
+Wrap each in `tracer.Start(ctx, ...)` with attributes:
+- `storage_service.Upload` → span "b2.PutObject" with `bucket`, `key`, `size_bytes`, result
+- `push/apns.go` → span "apns.send" with `device_token_hash`, `status_code`, `reason`
+- `asynq` middleware → span per task type with `task.type`, `retry_count`, `payload_size`
+
+**ROI:** Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology.
+
+### Step 4 — Repository ctx + `otelgorm` (biggest diff, save for last)
+
+1. Refactor every repository method to accept `ctx context.Context` as first arg
+2. Update every call site to pass `c.Request().Context()` from handlers / propagate through services
+3. Add `db.Use(otelgorm.NewPlugin())` in `internal/database/database.go`
+4. Verify: a request now has nested spans `http → service → query → query → b2.PutObject → apns.send` with full SQL on the query spans
+
+**ROI:** Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand.
+
+---
+
+## Hard skips (revisit only when explicitly proven needed)
+
+| Tool | Why skip |
+|---|---|
+| Loki / Promtail | Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point. |
+| Mimir / VM cluster mode | Single-node VM handles honeyDue scale for years. |
+| Pyroscope continuous profiling | Overkill at 3 small nodes. Use `pprof` endpoints ad-hoc when CPU pressure shows up. |
+| OTel Collector | Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now. |
+| Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance) | User constraint: nothing paid. |
+
+---
+
+## When to move off `88oakappsUpdate`
+
+Triggers — any one is enough:
+- `88oakappsUpdate` available memory drops below ~3 GB sustained (PostHog growth squeezing it)
+- ClickHouse OOM events start showing up in `dmesg` (PostHog under load)
+- You want fully separate failure domains for honeyDue vs. 88oakapps
+
+Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = `rsync /opt/honeydue-obs/` to a new box, update DNS for `grafana.88oakapps.com` and `obs.88oakapps.com`, `docker compose up -d`. ~30 min of work. Until then: cohabiting on `88oakappsUpdate` is correct.
+
+---
+
+## Quick reference: what shows up where
+
+| Question | Where to look |
+|---|---|
+| Is the API up right now? Latency? Errors? | Grafana RED dashboard |
+| Why is this specific request slow? | Jaeger trace view |
+| What did the slow part of that request actually do (which SQL, which B2 PUT)? | Span details inside the trace |
+| Background job throughput / queue depth | VictoriaMetrics + asynq metrics |
+| What did the app print to stdout 5 minutes ago? | Dozzle |
+| What error did the app log? | Dozzle (search) — or Loki if/when added |