Files
honeyDueAPI/docs/observability-plan.md
T
Trey t df78d9ccd8
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Add Prometheus metrics + vmagent push to obs.88oakapps.com
Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and
FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via
statement-level callbacks (no ctx plumbing needed). Storage and push
clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the
network round-trip points.

Existing internal/monitoring metrics move to /metrics/legacy so the
canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups.

deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica
vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics
every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write
with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN
in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods
and to the public obs endpoint over :443; the obs side runs
VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate.

docs/observability-plan.md captures the full plan including resource
budget, instrumentation table, 4-step rollout, and migration triggers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:16:17 -05:00

165 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Observability Plan — honeyDue (100% self-hosted)
**Goal:** Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor.
**Deployment target:** `88oakappsUpdate` (Linode VPS at `185.143.228.16`, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for `*.88oakapps.com`. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental.
**Why not in the honeyDue k3s cluster:** Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.
**Status:** Plan only — nothing implemented yet.
---
## Stack
| Role | Choice | Why this vs. the obvious alternative |
|---|---|---|
| Metrics store | **VictoriaMetrics** (single-node) | Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary. |
| Tracing | **Jaeger all-in-one** | ~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale. |
| Dashboards | **Grafana OSS** | Connects to both VM (Prometheus protocol) and Jaeger natively. |
| App instrumentation | **OpenTelemetry SDK** + `prometheus/client_golang` | OTel is vendor-neutral — backends are swappable without code change. |
| Logs | **Keep Dozzle**; add Loki only when log search becomes painful | Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point. |
### Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?
- **Tempo** wants 1-2 GB RAM minimum in monolithic mode ([Grafana community report](https://community.grafana.com/t/tempo-ram-usage-for-6k-spans-per-hour/63801)). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra.
- **Mimir** is wonderful for multi-tenant Prometheus at scale — you have one tenant.
- **Loki** is great if you live in `kubectl logs` and need full-text search across them. You currently use Dozzle and are not feeling that pain.
VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost.
---
## Resource budget on `88oakappsUpdate`
Three Docker containers in a separate compose project under `/opt/honeydue-obs/` — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa.
| Service | `mem_limit` | Disk (bind mount) | Retention |
|---|---|---|---|
| VictoriaMetrics single-node | 256 MB | 10 GB | 30 days metrics |
| Jaeger all-in-one (badger storage) | 256 MB | 10 GB | 7 days traces |
| Grafana OSS | 256 MB | 1 GB | — |
| **Total** | **~768 MB hard cap** | **21 GB** | |
**~5% of the box's free RAM and ~14% of free disk.** The hard `mem_limit` per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch.
**Don't reuse PostHog's ClickHouse / Kafka / Redis.** Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate.
**Shared blast radius caveat:** A kernel panic on `88oakappsUpdate` loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix.
---
## App-side instrumentation
| Surface | Library / approach | Import path |
|---|---|---|
| Echo HTTP middleware | `otelecho` — span per request, tagged route/method/status | `go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho` |
| GORM queries | `uptrace/otelgorm` plugin — `db.Use(otelgorm.NewPlugin())`. Requires threading `ctx` through repositories so `db.WithContext(ctx)` works. | `github.com/uptrace/opentelemetry-go-extra/otelgorm` |
| B2 / minio-go uploads | Manual span around `storage_service.Upload` with attributes for bucket, object size, MIME type | `go.opentelemetry.io/otel` |
| APNs / FCM | Manual span in `internal/push/apns.go` and `fcm.go`; record device-token, response status code | `go.opentelemetry.io/otel` |
| asynq jobs | Custom `asynq.MiddlewareFunc` (~20 lines) — span per task type, attached to ctx, records duration + retry count | `go.opentelemetry.io/otel` + `asynq.MiddlewareFunc` |
| Prometheus `/metrics` endpoint | `prometheus/client_golang` direct — register histograms for HTTP duration / GORM op / B2 op / APNs send | `github.com/prometheus/client_golang/prometheus`, `.../prometheus/promhttp` |
| OTLP exporter | OTLP/HTTP → `https://obs.88oakapps.com/v1/traces` with bearer token. 100% sample in dev, 10% in prod. | `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp` |
| Metrics push | `vmagent` sidecar in k3s scrapes the api Pod's `/metrics` and remote-writes to `https://obs.88oakapps.com/api/v1/write` with bearer token. Cleaner than exposing `/metrics` publicly. | `victoriametrics/vmagent` image |
**Note on GORM context propagation:** the existing repository methods don't take `ctx context.Context`. Adding `otelgorm` requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large.
---
## Implementation order (smallest first)
### Step 1 — Metrics + dashboards (highest immediate ROI)
**On `88oakappsUpdate`:**
1. `mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana}` and a `docker-compose.yml` defining the three services with `mem_limit: 256m`, bind mounts for persistence, and an isolated bridge network
2. Add nginx vhosts (DNS A records first):
- `grafana.88oakapps.com``127.0.0.1:3000` (basic auth via htpasswd, Let's Encrypt)
- `obs.88oakapps.com` → routes by path:
- `/api/v1/write``127.0.0.1:8428` (VictoriaMetrics remote-write, bearer-token check)
- `/v1/traces``127.0.0.1:4318` (OTLP/HTTP traces, bearer-token check)
3. Generate a 32-byte token, store in `/etc/honeydue-obs/token` (mode 0600), reference from nginx as `auth_request` or simple `if ($http_authorization != ...)`
4. Pre-provision Grafana with the VM datasource pointing at `http://victoriametrics:8428` (in-network)
**On the honeyDue k3s cluster:**
5. Add `prometheus/client_golang` to `honeyDueAPI-go/go.mod` and a `/metrics` endpoint to the Go API
6. Register histograms:
- `http_request_duration_seconds{route,method,status}` via Echo middleware
- `gorm_query_duration_seconds{table,operation}` via a GORM `Plugin` callback (no ctx needed for this one — operates at the SQL string level)
- `b2_upload_duration_seconds{bucket,result}`
- `apns_send_duration_seconds{result}`
7. Deploy a `vmagent` sidecar (or DaemonSet) in the `honeydue` namespace with:
- Scrape: api Service `/metrics` every 15s
- `remote_write.url`: `https://obs.88oakapps.com/api/v1/write`
- `remote_write.bearer_token`: from k8s Secret
8. Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route
**ROI:** "Is the API healthy? Where is time being spent right now?" answered live, served from `grafana.88oakapps.com`.
### Step 2 — Tracing baseline
(Jaeger is already up from Step 1. This step adds the app-side wiring.)
1. Add Grafana datasource for Jaeger pointing at `http://jaeger:16686` (in-network)
2. Wire OTel SDK in `cmd/api/main.go`:
- `otel.SetTracerProvider(tracerProvider)`
- `otelecho.Middleware("honeydue-api")` on Echo
- OTLP/HTTP exporter pointing at `https://obs.88oakapps.com/v1/traces` with `Authorization: Bearer <token>` header (token from env)
- Sampling: `TraceIDRatioBased(0.1)` in prod, `AlwaysSample()` in dev
3. Verify: a single `POST /api/auth/login/` produces a trace in Jaeger
**ROI:** "Why is this one request slow?" — answered with a flame graph.
### Step 3 — Manual spans for the work that actually matters
Wrap each in `tracer.Start(ctx, ...)` with attributes:
- `storage_service.Upload` → span "b2.PutObject" with `bucket`, `key`, `size_bytes`, result
- `push/apns.go` → span "apns.send" with `device_token_hash`, `status_code`, `reason`
- `asynq` middleware → span per task type with `task.type`, `retry_count`, `payload_size`
**ROI:** Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology.
### Step 4 — Repository ctx + `otelgorm` (biggest diff, save for last)
1. Refactor every repository method to accept `ctx context.Context` as first arg
2. Update every call site to pass `c.Request().Context()` from handlers / propagate through services
3. Add `db.Use(otelgorm.NewPlugin())` in `internal/database/database.go`
4. Verify: a request now has nested spans `http → service → query → query → b2.PutObject → apns.send` with full SQL on the query spans
**ROI:** Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand.
---
## Hard skips (revisit only when explicitly proven needed)
| Tool | Why skip |
|---|---|
| Loki / Promtail | Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point. |
| Mimir / VM cluster mode | Single-node VM handles honeyDue scale for years. |
| Pyroscope continuous profiling | Overkill at 3 small nodes. Use `pprof` endpoints ad-hoc when CPU pressure shows up. |
| OTel Collector | Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now. |
| Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance) | User constraint: nothing paid. |
---
## When to move off `88oakappsUpdate`
Triggers — any one is enough:
- `88oakappsUpdate` available memory drops below ~3 GB sustained (PostHog growth squeezing it)
- ClickHouse OOM events start showing up in `dmesg` (PostHog under load)
- You want fully separate failure domains for honeyDue vs. 88oakapps
Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = `rsync /opt/honeydue-obs/` to a new box, update DNS for `grafana.88oakapps.com` and `obs.88oakapps.com`, `docker compose up -d`. ~30 min of work. Until then: cohabiting on `88oakappsUpdate` is correct.
---
## Quick reference: what shows up where
| Question | Where to look |
|---|---|
| Is the API up right now? Latency? Errors? | Grafana RED dashboard |
| Why is this specific request slow? | Jaeger trace view |
| What did the slow part of that request actually do (which SQL, which B2 PUT)? | Span details inside the trace |
| Background job throughput / queue depth | VictoriaMetrics + asynq metrics |
| What did the app print to stdout 5 minutes ago? | Dozzle |
| What error did the app log? | Dozzle (search) — or Loki if/when added |