Add Prometheus metrics + vmagent push to obs.88oakapps.com
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and
FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via
statement-level callbacks (no ctx plumbing needed). Storage and push
clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the
network round-trip points.

Existing internal/monitoring metrics move to /metrics/legacy so the
canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups.

deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica
vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics
every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write
with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN
in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods
and to the public obs endpoint over :443; the obs side runs
VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate.

docs/observability-plan.md captures the full plan including resource
budget, instrumentation table, 4-step rollout, and migration triggers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-25 14:16:17 -05:00
parent 1cd6cafa9d
commit df78d9ccd8
10 changed files with 622 additions and 3 deletions
+164
View File
@@ -0,0 +1,164 @@
# Observability Plan — honeyDue (100% self-hosted)
**Goal:** Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor.
**Deployment target:** `88oakappsUpdate` (Linode VPS at `185.143.228.16`, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for `*.88oakapps.com`. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental.
**Why not in the honeyDue k3s cluster:** Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.
**Status:** Plan only — nothing implemented yet.
---
## Stack
| Role | Choice | Why this vs. the obvious alternative |
|---|---|---|
| Metrics store | **VictoriaMetrics** (single-node) | Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary. |
| Tracing | **Jaeger all-in-one** | ~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale. |
| Dashboards | **Grafana OSS** | Connects to both VM (Prometheus protocol) and Jaeger natively. |
| App instrumentation | **OpenTelemetry SDK** + `prometheus/client_golang` | OTel is vendor-neutral — backends are swappable without code change. |
| Logs | **Keep Dozzle**; add Loki only when log search becomes painful | Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point. |
### Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?
- **Tempo** wants 1-2 GB RAM minimum in monolithic mode ([Grafana community report](https://community.grafana.com/t/tempo-ram-usage-for-6k-spans-per-hour/63801)). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra.
- **Mimir** is wonderful for multi-tenant Prometheus at scale — you have one tenant.
- **Loki** is great if you live in `kubectl logs` and need full-text search across them. You currently use Dozzle and are not feeling that pain.
VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost.
---
## Resource budget on `88oakappsUpdate`
Three Docker containers in a separate compose project under `/opt/honeydue-obs/` — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa.
| Service | `mem_limit` | Disk (bind mount) | Retention |
|---|---|---|---|
| VictoriaMetrics single-node | 256 MB | 10 GB | 30 days metrics |
| Jaeger all-in-one (badger storage) | 256 MB | 10 GB | 7 days traces |
| Grafana OSS | 256 MB | 1 GB | — |
| **Total** | **~768 MB hard cap** | **21 GB** | |
**~5% of the box's free RAM and ~14% of free disk.** The hard `mem_limit` per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch.
**Don't reuse PostHog's ClickHouse / Kafka / Redis.** Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate.
**Shared blast radius caveat:** A kernel panic on `88oakappsUpdate` loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix.
---
## App-side instrumentation
| Surface | Library / approach | Import path |
|---|---|---|
| Echo HTTP middleware | `otelecho` — span per request, tagged route/method/status | `go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho` |
| GORM queries | `uptrace/otelgorm` plugin — `db.Use(otelgorm.NewPlugin())`. Requires threading `ctx` through repositories so `db.WithContext(ctx)` works. | `github.com/uptrace/opentelemetry-go-extra/otelgorm` |
| B2 / minio-go uploads | Manual span around `storage_service.Upload` with attributes for bucket, object size, MIME type | `go.opentelemetry.io/otel` |
| APNs / FCM | Manual span in `internal/push/apns.go` and `fcm.go`; record device-token, response status code | `go.opentelemetry.io/otel` |
| asynq jobs | Custom `asynq.MiddlewareFunc` (~20 lines) — span per task type, attached to ctx, records duration + retry count | `go.opentelemetry.io/otel` + `asynq.MiddlewareFunc` |
| Prometheus `/metrics` endpoint | `prometheus/client_golang` direct — register histograms for HTTP duration / GORM op / B2 op / APNs send | `github.com/prometheus/client_golang/prometheus`, `.../prometheus/promhttp` |
| OTLP exporter | OTLP/HTTP → `https://obs.88oakapps.com/v1/traces` with bearer token. 100% sample in dev, 10% in prod. | `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp` |
| Metrics push | `vmagent` sidecar in k3s scrapes the api Pod's `/metrics` and remote-writes to `https://obs.88oakapps.com/api/v1/write` with bearer token. Cleaner than exposing `/metrics` publicly. | `victoriametrics/vmagent` image |
**Note on GORM context propagation:** the existing repository methods don't take `ctx context.Context`. Adding `otelgorm` requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large.
---
## Implementation order (smallest first)
### Step 1 — Metrics + dashboards (highest immediate ROI)
**On `88oakappsUpdate`:**
1. `mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana}` and a `docker-compose.yml` defining the three services with `mem_limit: 256m`, bind mounts for persistence, and an isolated bridge network
2. Add nginx vhosts (DNS A records first):
- `grafana.88oakapps.com``127.0.0.1:3000` (basic auth via htpasswd, Let's Encrypt)
- `obs.88oakapps.com` → routes by path:
- `/api/v1/write``127.0.0.1:8428` (VictoriaMetrics remote-write, bearer-token check)
- `/v1/traces``127.0.0.1:4318` (OTLP/HTTP traces, bearer-token check)
3. Generate a 32-byte token, store in `/etc/honeydue-obs/token` (mode 0600), reference from nginx as `auth_request` or simple `if ($http_authorization != ...)`
4. Pre-provision Grafana with the VM datasource pointing at `http://victoriametrics:8428` (in-network)
**On the honeyDue k3s cluster:**
5. Add `prometheus/client_golang` to `honeyDueAPI-go/go.mod` and a `/metrics` endpoint to the Go API
6. Register histograms:
- `http_request_duration_seconds{route,method,status}` via Echo middleware
- `gorm_query_duration_seconds{table,operation}` via a GORM `Plugin` callback (no ctx needed for this one — operates at the SQL string level)
- `b2_upload_duration_seconds{bucket,result}`
- `apns_send_duration_seconds{result}`
7. Deploy a `vmagent` sidecar (or DaemonSet) in the `honeydue` namespace with:
- Scrape: api Service `/metrics` every 15s
- `remote_write.url`: `https://obs.88oakapps.com/api/v1/write`
- `remote_write.bearer_token`: from k8s Secret
8. Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route
**ROI:** "Is the API healthy? Where is time being spent right now?" answered live, served from `grafana.88oakapps.com`.
### Step 2 — Tracing baseline
(Jaeger is already up from Step 1. This step adds the app-side wiring.)
1. Add Grafana datasource for Jaeger pointing at `http://jaeger:16686` (in-network)
2. Wire OTel SDK in `cmd/api/main.go`:
- `otel.SetTracerProvider(tracerProvider)`
- `otelecho.Middleware("honeydue-api")` on Echo
- OTLP/HTTP exporter pointing at `https://obs.88oakapps.com/v1/traces` with `Authorization: Bearer <token>` header (token from env)
- Sampling: `TraceIDRatioBased(0.1)` in prod, `AlwaysSample()` in dev
3. Verify: a single `POST /api/auth/login/` produces a trace in Jaeger
**ROI:** "Why is this one request slow?" — answered with a flame graph.
### Step 3 — Manual spans for the work that actually matters
Wrap each in `tracer.Start(ctx, ...)` with attributes:
- `storage_service.Upload` → span "b2.PutObject" with `bucket`, `key`, `size_bytes`, result
- `push/apns.go` → span "apns.send" with `device_token_hash`, `status_code`, `reason`
- `asynq` middleware → span per task type with `task.type`, `retry_count`, `payload_size`
**ROI:** Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology.
### Step 4 — Repository ctx + `otelgorm` (biggest diff, save for last)
1. Refactor every repository method to accept `ctx context.Context` as first arg
2. Update every call site to pass `c.Request().Context()` from handlers / propagate through services
3. Add `db.Use(otelgorm.NewPlugin())` in `internal/database/database.go`
4. Verify: a request now has nested spans `http → service → query → query → b2.PutObject → apns.send` with full SQL on the query spans
**ROI:** Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand.
---
## Hard skips (revisit only when explicitly proven needed)
| Tool | Why skip |
|---|---|
| Loki / Promtail | Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point. |
| Mimir / VM cluster mode | Single-node VM handles honeyDue scale for years. |
| Pyroscope continuous profiling | Overkill at 3 small nodes. Use `pprof` endpoints ad-hoc when CPU pressure shows up. |
| OTel Collector | Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now. |
| Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance) | User constraint: nothing paid. |
---
## When to move off `88oakappsUpdate`
Triggers — any one is enough:
- `88oakappsUpdate` available memory drops below ~3 GB sustained (PostHog growth squeezing it)
- ClickHouse OOM events start showing up in `dmesg` (PostHog under load)
- You want fully separate failure domains for honeyDue vs. 88oakapps
Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = `rsync /opt/honeydue-obs/` to a new box, update DNS for `grafana.88oakapps.com` and `obs.88oakapps.com`, `docker compose up -d`. ~30 min of work. Until then: cohabiting on `88oakappsUpdate` is correct.
---
## Quick reference: what shows up where
| Question | Where to look |
|---|---|
| Is the API up right now? Latency? Errors? | Grafana RED dashboard |
| Why is this specific request slow? | Jaeger trace view |
| What did the slow part of that request actually do (which SQL, which B2 PUT)? | Span details inside the trace |
| Background job throughput / queue depth | VictoriaMetrics + asynq metrics |
| What did the app print to stdout 5 minutes ago? | Dozzle |
| What error did the app log? | Dozzle (search) — or Loki if/when added |