honeyDueAPI/docs/observability-plan.md

# Observability Plan — honeyDue (100% self-hosted)

**Goal:** Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor.

**Deployment target:** `88oakappsUpdate` (Linode VPS at `185.143.228.16`, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for `*.88oakapps.com`. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental.

**Why not in the honeyDue k3s cluster:** Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.

**Status:** Fully shipped. VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`, vmagent in-cluster, OTel SDK and otelgorm wired into the api+worker, every authed endpoint produces nested HTTP→service→SQL flame graphs in Jaeger.

The first round of traces revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond. The follow-up work (`internal/services/residence_id_cache.go`, GORM pool warm-up, auth-query JOIN consolidation, switching `DB_HOST` to Neon's `-pooler` endpoint, bumped cache TTLs) cut warm-cache `/api/tasks/` from 2,473 ms / 5 spans to **229 ms / 2 spans** — see commit `88fb175` and Chapter 8 §"Optimizations layered on top".

---

## Stack

| Role | Choice | Why this vs. the obvious alternative |
|---|---|---|
| Metrics store | **VictoriaMetrics** (single-node) | Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary. |
| Tracing | **Jaeger all-in-one** | ~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale. |
| Dashboards | **Grafana OSS** | Connects to both VM (Prometheus protocol) and Jaeger natively. |
| App instrumentation | **OpenTelemetry SDK** + `prometheus/client_golang` | OTel is vendor-neutral — backends are swappable without code change. |
| Logs | **Keep Dozzle**; add Loki only when log search becomes painful | Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point. |

### Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?

- **Tempo** wants 1-2 GB RAM minimum in monolithic mode ([Grafana community report](https://community.grafana.com/t/tempo-ram-usage-for-6k-spans-per-hour/63801)). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra.
- **Mimir** is wonderful for multi-tenant Prometheus at scale — you have one tenant.
- **Loki** is great if you live in `kubectl logs` and need full-text search across them. You currently use Dozzle and are not feeling that pain.

VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost.

---

## Resource budget on `88oakappsUpdate`

Three Docker containers in a separate compose project under `/opt/honeydue-obs/` — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa.

| Service | `mem_limit` | Disk (bind mount) | Retention |
|---|---|---|---|
| VictoriaMetrics single-node | 256 MB | 10 GB | 30 days metrics |
| Jaeger all-in-one (badger storage) | 256 MB | 10 GB | 7 days traces |
| Grafana OSS | 256 MB | 1 GB | — |
| **Total** | **~768 MB hard cap** | **21 GB** | |

**~5% of the box's free RAM and ~14% of free disk.** The hard `mem_limit` per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch.

**Don't reuse PostHog's ClickHouse / Kafka / Redis.** Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate.

**Shared blast radius caveat:** A kernel panic on `88oakappsUpdate` loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix.

---

## App-side instrumentation

| Surface | Library / approach | Import path |
|---|---|---|
| Echo HTTP middleware | `otelecho` — span per request, tagged route/method/status | `go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho` |
| GORM queries | `uptrace/otelgorm` plugin — `db.Use(otelgorm.NewPlugin())`. Requires threading `ctx` through repositories so `db.WithContext(ctx)` works. | `github.com/uptrace/opentelemetry-go-extra/otelgorm` |
| B2 / minio-go uploads | Manual span around `storage_service.Upload` with attributes for bucket, object size, MIME type | `go.opentelemetry.io/otel` |
| APNs / FCM | Manual span in `internal/push/apns.go` and `fcm.go`; record device-token, response status code | `go.opentelemetry.io/otel` |
| asynq jobs | Custom `asynq.MiddlewareFunc` (~20 lines) — span per task type, attached to ctx, records duration + retry count | `go.opentelemetry.io/otel` + `asynq.MiddlewareFunc` |
| Prometheus `/metrics` endpoint | `prometheus/client_golang` direct — register histograms for HTTP duration / GORM op / B2 op / APNs send | `github.com/prometheus/client_golang/prometheus`, `.../prometheus/promhttp` |
| OTLP exporter | OTLP/HTTP → `https://obs.88oakapps.com/v1/traces` with bearer token. 100% sample in dev, 10% in prod. | `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp` |
| Metrics push | `vmagent` sidecar in k3s scrapes the api Pod's `/metrics` and remote-writes to `https://obs.88oakapps.com/api/v1/write` with bearer token. Cleaner than exposing `/metrics` publicly. | `victoriametrics/vmagent` image |

**Note on GORM context propagation:** the existing repository methods don't take `ctx context.Context`. Adding `otelgorm` requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large.

---

## Implementation order (smallest first)

### Step 1 — Metrics + dashboards (highest immediate ROI)

**On `88oakappsUpdate`:**
1. `mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana}` and a `docker-compose.yml` defining the three services with `mem_limit: 256m`, bind mounts for persistence, and an isolated bridge network
2. Add nginx vhosts (DNS A records first):
   - `grafana.88oakapps.com` → `127.0.0.1:3000` (basic auth via htpasswd, Let's Encrypt)
   - `obs.88oakapps.com` → routes by path:
     - `/api/v1/write` → `127.0.0.1:8428` (VictoriaMetrics remote-write, bearer-token check)
     - `/v1/traces`     → `127.0.0.1:4318` (OTLP/HTTP traces, bearer-token check)
3. Generate a 32-byte token, store in `/etc/honeydue-obs/token` (mode 0600), reference from nginx as `auth_request` or simple `if ($http_authorization != ...)`
4. Pre-provision Grafana with the VM datasource pointing at `http://victoriametrics:8428` (in-network)

**On the honeyDue k3s cluster:**
5. Add `prometheus/client_golang` to `honeyDueAPI-go/go.mod` and a `/metrics` endpoint to the Go API
6. Register histograms:
   - `http_request_duration_seconds{route,method,status}` via Echo middleware
   - `gorm_query_duration_seconds{table,operation}` via a GORM `Plugin` callback (no ctx needed for this one — operates at the SQL string level)
   - `b2_upload_duration_seconds{bucket,result}`
   - `apns_send_duration_seconds{result}`
7. Deploy a `vmagent` sidecar (or DaemonSet) in the `honeydue` namespace with:
   - Scrape: api Service `/metrics` every 15s
   - `remote_write.url`: `https://obs.88oakapps.com/api/v1/write`
   - `remote_write.bearer_token`: from k8s Secret
8. Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route

**ROI:** "Is the API healthy? Where is time being spent right now?" answered live, served from `grafana.88oakapps.com`.

### Step 2 — Tracing baseline

(Jaeger is already up from Step 1. This step adds the app-side wiring.)

1. Add Grafana datasource for Jaeger pointing at `http://jaeger:16686` (in-network)
2. Wire OTel SDK in `cmd/api/main.go`:
   - `otel.SetTracerProvider(tracerProvider)`
   - `otelecho.Middleware("honeydue-api")` on Echo
   - OTLP/HTTP exporter pointing at `https://obs.88oakapps.com/v1/traces` with `Authorization: Bearer <token>` header (token from env)
   - Sampling: `TraceIDRatioBased(0.1)` in prod, `AlwaysSample()` in dev
3. Verify: a single `POST /api/auth/login/` produces a trace in Jaeger

**ROI:** "Why is this one request slow?" — answered with a flame graph.

### Step 3 — Manual spans for the work that actually matters

Wrap each in `tracer.Start(ctx, ...)` with attributes:
- `storage_service.Upload` → span "b2.PutObject" with `bucket`, `key`, `size_bytes`, result
- `push/apns.go` → span "apns.send" with `device_token_hash`, `status_code`, `reason`
- `asynq` middleware → span per task type with `task.type`, `retry_count`, `payload_size`

**ROI:** Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology.

### Step 4 — Repository ctx + `otelgorm` (biggest diff, save for last)

1. Refactor every repository method to accept `ctx context.Context` as first arg
2. Update every call site to pass `c.Request().Context()` from handlers / propagate through services
3. Add `db.Use(otelgorm.NewPlugin())` in `internal/database/database.go`
4. Verify: a request now has nested spans `http → service → query → query → b2.PutObject → apns.send` with full SQL on the query spans

**ROI:** Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand.

---

## Hard skips (revisit only when explicitly proven needed)

| Tool | Why skip |
|---|---|
| Loki / Promtail | Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point. |
| Mimir / VM cluster mode | Single-node VM handles honeyDue scale for years. |
| Pyroscope continuous profiling | Overkill at 3 small nodes. Use `pprof` endpoints ad-hoc when CPU pressure shows up. |
| OTel Collector | Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now. |
| Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance) | User constraint: nothing paid. |

---

## When to move off `88oakappsUpdate`

Triggers — any one is enough:
- `88oakappsUpdate` available memory drops below ~3 GB sustained (PostHog growth squeezing it)
- ClickHouse OOM events start showing up in `dmesg` (PostHog under load)
- You want fully separate failure domains for honeyDue vs. 88oakapps

Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = `rsync /opt/honeydue-obs/` to a new box, update DNS for `grafana.88oakapps.com` and `obs.88oakapps.com`, `docker compose up -d`. ~30 min of work. Until then: cohabiting on `88oakappsUpdate` is correct.

---

## Quick reference: what shows up where

| Question | Where to look |
|---|---|
| Is the API up right now? Latency? Errors? | Grafana RED dashboard |
| Why is this specific request slow? | Jaeger trace view |
| What did the slow part of that request actually do (which SQL, which B2 PUT)? | Span details inside the trace |
| Background job throughput / queue depth | VictoriaMetrics + asynq metrics |
| What did the app print to stdout 5 minutes ago? | Dozzle |
| What error did the app log? | Dozzle (search) — or Loki if/when added |