# Observability Plan — honeyDue (100% self-hosted) **Goal:** Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor. **Deployment target:** `88oakappsUpdate` (Linode VPS at `185.143.228.16`, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for `*.88oakapps.com`. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental. **Why not in the honeyDue k3s cluster:** Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint. **Status:** Plan only — nothing implemented yet. --- ## Stack | Role | Choice | Why this vs. the obvious alternative | |---|---|---| | Metrics store | **VictoriaMetrics** (single-node) | Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary. | | Tracing | **Jaeger all-in-one** | ~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale. | | Dashboards | **Grafana OSS** | Connects to both VM (Prometheus protocol) and Jaeger natively. | | App instrumentation | **OpenTelemetry SDK** + `prometheus/client_golang` | OTel is vendor-neutral — backends are swappable without code change. | | Logs | **Keep Dozzle**; add Loki only when log search becomes painful | Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point. | ### Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)? - **Tempo** wants 1-2 GB RAM minimum in monolithic mode ([Grafana community report](https://community.grafana.com/t/tempo-ram-usage-for-6k-spans-per-hour/63801)). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra. - **Mimir** is wonderful for multi-tenant Prometheus at scale — you have one tenant. - **Loki** is great if you live in `kubectl logs` and need full-text search across them. You currently use Dozzle and are not feeling that pain. VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost. --- ## Resource budget on `88oakappsUpdate` Three Docker containers in a separate compose project under `/opt/honeydue-obs/` — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa. | Service | `mem_limit` | Disk (bind mount) | Retention | |---|---|---|---| | VictoriaMetrics single-node | 256 MB | 10 GB | 30 days metrics | | Jaeger all-in-one (badger storage) | 256 MB | 10 GB | 7 days traces | | Grafana OSS | 256 MB | 1 GB | — | | **Total** | **~768 MB hard cap** | **21 GB** | | **~5% of the box's free RAM and ~14% of free disk.** The hard `mem_limit` per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch. **Don't reuse PostHog's ClickHouse / Kafka / Redis.** Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate. **Shared blast radius caveat:** A kernel panic on `88oakappsUpdate` loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix. --- ## App-side instrumentation | Surface | Library / approach | Import path | |---|---|---| | Echo HTTP middleware | `otelecho` — span per request, tagged route/method/status | `go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho` | | GORM queries | `uptrace/otelgorm` plugin — `db.Use(otelgorm.NewPlugin())`. Requires threading `ctx` through repositories so `db.WithContext(ctx)` works. | `github.com/uptrace/opentelemetry-go-extra/otelgorm` | | B2 / minio-go uploads | Manual span around `storage_service.Upload` with attributes for bucket, object size, MIME type | `go.opentelemetry.io/otel` | | APNs / FCM | Manual span in `internal/push/apns.go` and `fcm.go`; record device-token, response status code | `go.opentelemetry.io/otel` | | asynq jobs | Custom `asynq.MiddlewareFunc` (~20 lines) — span per task type, attached to ctx, records duration + retry count | `go.opentelemetry.io/otel` + `asynq.MiddlewareFunc` | | Prometheus `/metrics` endpoint | `prometheus/client_golang` direct — register histograms for HTTP duration / GORM op / B2 op / APNs send | `github.com/prometheus/client_golang/prometheus`, `.../prometheus/promhttp` | | OTLP exporter | OTLP/HTTP → `https://obs.88oakapps.com/v1/traces` with bearer token. 100% sample in dev, 10% in prod. | `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp` | | Metrics push | `vmagent` sidecar in k3s scrapes the api Pod's `/metrics` and remote-writes to `https://obs.88oakapps.com/api/v1/write` with bearer token. Cleaner than exposing `/metrics` publicly. | `victoriametrics/vmagent` image | **Note on GORM context propagation:** the existing repository methods don't take `ctx context.Context`. Adding `otelgorm` requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large. --- ## Implementation order (smallest first) ### Step 1 — Metrics + dashboards (highest immediate ROI) **On `88oakappsUpdate`:** 1. `mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana}` and a `docker-compose.yml` defining the three services with `mem_limit: 256m`, bind mounts for persistence, and an isolated bridge network 2. Add nginx vhosts (DNS A records first): - `grafana.88oakapps.com` → `127.0.0.1:3000` (basic auth via htpasswd, Let's Encrypt) - `obs.88oakapps.com` → routes by path: - `/api/v1/write` → `127.0.0.1:8428` (VictoriaMetrics remote-write, bearer-token check) - `/v1/traces` → `127.0.0.1:4318` (OTLP/HTTP traces, bearer-token check) 3. Generate a 32-byte token, store in `/etc/honeydue-obs/token` (mode 0600), reference from nginx as `auth_request` or simple `if ($http_authorization != ...)` 4. Pre-provision Grafana with the VM datasource pointing at `http://victoriametrics:8428` (in-network) **On the honeyDue k3s cluster:** 5. Add `prometheus/client_golang` to `honeyDueAPI-go/go.mod` and a `/metrics` endpoint to the Go API 6. Register histograms: - `http_request_duration_seconds{route,method,status}` via Echo middleware - `gorm_query_duration_seconds{table,operation}` via a GORM `Plugin` callback (no ctx needed for this one — operates at the SQL string level) - `b2_upload_duration_seconds{bucket,result}` - `apns_send_duration_seconds{result}` 7. Deploy a `vmagent` sidecar (or DaemonSet) in the `honeydue` namespace with: - Scrape: api Service `/metrics` every 15s - `remote_write.url`: `https://obs.88oakapps.com/api/v1/write` - `remote_write.bearer_token`: from k8s Secret 8. Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route **ROI:** "Is the API healthy? Where is time being spent right now?" answered live, served from `grafana.88oakapps.com`. ### Step 2 — Tracing baseline (Jaeger is already up from Step 1. This step adds the app-side wiring.) 1. Add Grafana datasource for Jaeger pointing at `http://jaeger:16686` (in-network) 2. Wire OTel SDK in `cmd/api/main.go`: - `otel.SetTracerProvider(tracerProvider)` - `otelecho.Middleware("honeydue-api")` on Echo - OTLP/HTTP exporter pointing at `https://obs.88oakapps.com/v1/traces` with `Authorization: Bearer ` header (token from env) - Sampling: `TraceIDRatioBased(0.1)` in prod, `AlwaysSample()` in dev 3. Verify: a single `POST /api/auth/login/` produces a trace in Jaeger **ROI:** "Why is this one request slow?" — answered with a flame graph. ### Step 3 — Manual spans for the work that actually matters Wrap each in `tracer.Start(ctx, ...)` with attributes: - `storage_service.Upload` → span "b2.PutObject" with `bucket`, `key`, `size_bytes`, result - `push/apns.go` → span "apns.send" with `device_token_hash`, `status_code`, `reason` - `asynq` middleware → span per task type with `task.type`, `retry_count`, `payload_size` **ROI:** Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology. ### Step 4 — Repository ctx + `otelgorm` (biggest diff, save for last) 1. Refactor every repository method to accept `ctx context.Context` as first arg 2. Update every call site to pass `c.Request().Context()` from handlers / propagate through services 3. Add `db.Use(otelgorm.NewPlugin())` in `internal/database/database.go` 4. Verify: a request now has nested spans `http → service → query → query → b2.PutObject → apns.send` with full SQL on the query spans **ROI:** Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand. --- ## Hard skips (revisit only when explicitly proven needed) | Tool | Why skip | |---|---| | Loki / Promtail | Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point. | | Mimir / VM cluster mode | Single-node VM handles honeyDue scale for years. | | Pyroscope continuous profiling | Overkill at 3 small nodes. Use `pprof` endpoints ad-hoc when CPU pressure shows up. | | OTel Collector | Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now. | | Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance) | User constraint: nothing paid. | --- ## When to move off `88oakappsUpdate` Triggers — any one is enough: - `88oakappsUpdate` available memory drops below ~3 GB sustained (PostHog growth squeezing it) - ClickHouse OOM events start showing up in `dmesg` (PostHog under load) - You want fully separate failure domains for honeyDue vs. 88oakapps Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = `rsync /opt/honeydue-obs/` to a new box, update DNS for `grafana.88oakapps.com` and `obs.88oakapps.com`, `docker compose up -d`. ~30 min of work. Until then: cohabiting on `88oakappsUpdate` is correct. --- ## Quick reference: what shows up where | Question | Where to look | |---|---| | Is the API up right now? Latency? Errors? | Grafana RED dashboard | | Why is this specific request slow? | Jaeger trace view | | What did the slow part of that request actually do (which SQL, which B2 PUT)? | Span details inside the trace | | Background job throughput / queue depth | VictoriaMetrics + asynq metrics | | What did the app print to stdout 5 minutes ago? | Dozzle | | What error did the app log? | Dozzle (search) — or Loki if/when added |