c9ac273dbd
Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.
ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
optimization stack that minimizes round-trips, when this still matters
ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
post-optimization trace; kept the old one underneath for diff context
ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
switchover procedure
ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints
observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.
README links to the new ch08 latency section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
167 lines
11 KiB
Markdown
167 lines
11 KiB
Markdown
# Observability Plan — honeyDue (100% self-hosted)
|
||
|
||
**Goal:** Live request-timing visibility (HTTP, DB, B2 uploads, APNs, asynq jobs) without paying any SaaS vendor.
|
||
|
||
**Deployment target:** `88oakappsUpdate` (Linode VPS at `185.143.228.16`, Ubuntu 24.04, 8 vCPU / 32 GB RAM / 193 GB disk). This box already runs the self-hosted PostHog stack and has nginx + Let's Encrypt set up for `*.88oakapps.com`. Free RAM at rest ≈ 15 GB; the obs stack budget is ≈ 700 MB → ~5% of free RAM. Costs $0 incremental.
|
||
|
||
**Why not in the honeyDue k3s cluster:** Frees ~700 MB across the 3 Hetzner nodes, no PVC plumbing, and no need to expose anything from k3s — everything is push-from-app to a public TLS endpoint.
|
||
|
||
**Status:** Fully shipped. VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`, vmagent in-cluster, OTel SDK and otelgorm wired into the api+worker, every authed endpoint produces nested HTTP→service→SQL flame graphs in Jaeger.
|
||
|
||
The first round of traces revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond. The follow-up work (`internal/services/residence_id_cache.go`, GORM pool warm-up, auth-query JOIN consolidation, switching `DB_HOST` to Neon's `-pooler` endpoint, bumped cache TTLs) cut warm-cache `/api/tasks/` from 2,473 ms / 5 spans to **229 ms / 2 spans** — see commit `88fb175` and Chapter 8 §"Optimizations layered on top".
|
||
|
||
---
|
||
|
||
## Stack
|
||
|
||
| Role | Choice | Why this vs. the obvious alternative |
|
||
|---|---|---|
|
||
| Metrics store | **VictoriaMetrics** (single-node) | Drop-in Prometheus-compatible. ~4× lower RAM (~200 MB vs ~500 MB) and ~7× better compression. Single binary. |
|
||
| Tracing | **Jaeger all-in-one** | ~150 MB RAM with embedded badger storage. Tempo monolithic mode needs 1-2 GB minimum — overkill for honeyDue's scale. |
|
||
| Dashboards | **Grafana OSS** | Connects to both VM (Prometheus protocol) and Jaeger natively. |
|
||
| App instrumentation | **OpenTelemetry SDK** + `prometheus/client_golang` | OTel is vendor-neutral — backends are swappable without code change. |
|
||
| Logs | **Keep Dozzle**; add Loki only when log search becomes painful | Loki adds ~512 MB RAM + a daemonset for log shipping. Not worth it until there's a concrete pain point. |
|
||
|
||
### Why not the LGTM stack (Loki + Grafana + Tempo + Mimir)?
|
||
|
||
- **Tempo** wants 1-2 GB RAM minimum in monolithic mode ([Grafana community report](https://community.grafana.com/t/tempo-ram-usage-for-6k-spans-per-hour/63801)). Stacking that on top of Loki + Mimir would consume ~3-4 GB RAM. On a 3×8 GB cluster that's 12-17% of capacity for observability infra.
|
||
- **Mimir** is wonderful for multi-tenant Prometheus at scale — you have one tenant.
|
||
- **Loki** is great if you live in `kubectl logs` and need full-text search across them. You currently use Dozzle and are not feeling that pain.
|
||
|
||
VictoriaMetrics + Jaeger all-in-one gives you 90% of the value at 25% of the resource cost.
|
||
|
||
---
|
||
|
||
## Resource budget on `88oakappsUpdate`
|
||
|
||
Three Docker containers in a separate compose project under `/opt/honeydue-obs/` — fully isolated from the existing PostHog compose stack so PostHog's lifecycle never touches the obs stack and vice versa.
|
||
|
||
| Service | `mem_limit` | Disk (bind mount) | Retention |
|
||
|---|---|---|---|
|
||
| VictoriaMetrics single-node | 256 MB | 10 GB | 30 days metrics |
|
||
| Jaeger all-in-one (badger storage) | 256 MB | 10 GB | 7 days traces |
|
||
| Grafana OSS | 256 MB | 1 GB | — |
|
||
| **Total** | **~768 MB hard cap** | **21 GB** | |
|
||
|
||
**~5% of the box's free RAM and ~14% of free disk.** The hard `mem_limit` per container matters: ClickHouse on the same VM can spike under PostHog analytics load, so bounding the obs stack prevents it from competing in a memory pinch.
|
||
|
||
**Don't reuse PostHog's ClickHouse / Kafka / Redis.** Tempting because they're sitting right there, but coupling honeyDue's observability to PostHog's storage means a PostHog incident takes honeyDue's incident-response telemetry down with it. Keep them fully separate.
|
||
|
||
**Shared blast radius caveat:** A kernel panic on `88oakappsUpdate` loses both PostHog and honeyDue obs at once. At current scale, fine — call it out, don't fix.
|
||
|
||
---
|
||
|
||
## App-side instrumentation
|
||
|
||
| Surface | Library / approach | Import path |
|
||
|---|---|---|
|
||
| Echo HTTP middleware | `otelecho` — span per request, tagged route/method/status | `go.opentelemetry.io/contrib/instrumentation/github.com/labstack/echo/otelecho` |
|
||
| GORM queries | `uptrace/otelgorm` plugin — `db.Use(otelgorm.NewPlugin())`. Requires threading `ctx` through repositories so `db.WithContext(ctx)` works. | `github.com/uptrace/opentelemetry-go-extra/otelgorm` |
|
||
| B2 / minio-go uploads | Manual span around `storage_service.Upload` with attributes for bucket, object size, MIME type | `go.opentelemetry.io/otel` |
|
||
| APNs / FCM | Manual span in `internal/push/apns.go` and `fcm.go`; record device-token, response status code | `go.opentelemetry.io/otel` |
|
||
| asynq jobs | Custom `asynq.MiddlewareFunc` (~20 lines) — span per task type, attached to ctx, records duration + retry count | `go.opentelemetry.io/otel` + `asynq.MiddlewareFunc` |
|
||
| Prometheus `/metrics` endpoint | `prometheus/client_golang` direct — register histograms for HTTP duration / GORM op / B2 op / APNs send | `github.com/prometheus/client_golang/prometheus`, `.../prometheus/promhttp` |
|
||
| OTLP exporter | OTLP/HTTP → `https://obs.88oakapps.com/v1/traces` with bearer token. 100% sample in dev, 10% in prod. | `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp` |
|
||
| Metrics push | `vmagent` sidecar in k3s scrapes the api Pod's `/metrics` and remote-writes to `https://obs.88oakapps.com/api/v1/write` with bearer token. Cleaner than exposing `/metrics` publicly. | `victoriametrics/vmagent` image |
|
||
|
||
**Note on GORM context propagation:** the existing repository methods don't take `ctx context.Context`. Adding `otelgorm` requires plumbing ctx down from the Echo handler through the service layer to the repository call site. ~10 repository files, many call sites. Save for last because the diff is large.
|
||
|
||
---
|
||
|
||
## Implementation order (smallest first)
|
||
|
||
### Step 1 — Metrics + dashboards (highest immediate ROI)
|
||
|
||
**On `88oakappsUpdate`:**
|
||
1. `mkdir -p /opt/honeydue-obs/{data/vm,data/jaeger,data/grafana}` and a `docker-compose.yml` defining the three services with `mem_limit: 256m`, bind mounts for persistence, and an isolated bridge network
|
||
2. Add nginx vhosts (DNS A records first):
|
||
- `grafana.88oakapps.com` → `127.0.0.1:3000` (basic auth via htpasswd, Let's Encrypt)
|
||
- `obs.88oakapps.com` → routes by path:
|
||
- `/api/v1/write` → `127.0.0.1:8428` (VictoriaMetrics remote-write, bearer-token check)
|
||
- `/v1/traces` → `127.0.0.1:4318` (OTLP/HTTP traces, bearer-token check)
|
||
3. Generate a 32-byte token, store in `/etc/honeydue-obs/token` (mode 0600), reference from nginx as `auth_request` or simple `if ($http_authorization != ...)`
|
||
4. Pre-provision Grafana with the VM datasource pointing at `http://victoriametrics:8428` (in-network)
|
||
|
||
**On the honeyDue k3s cluster:**
|
||
5. Add `prometheus/client_golang` to `honeyDueAPI-go/go.mod` and a `/metrics` endpoint to the Go API
|
||
6. Register histograms:
|
||
- `http_request_duration_seconds{route,method,status}` via Echo middleware
|
||
- `gorm_query_duration_seconds{table,operation}` via a GORM `Plugin` callback (no ctx needed for this one — operates at the SQL string level)
|
||
- `b2_upload_duration_seconds{bucket,result}`
|
||
- `apns_send_duration_seconds{result}`
|
||
7. Deploy a `vmagent` sidecar (or DaemonSet) in the `honeydue` namespace with:
|
||
- Scrape: api Service `/metrics` every 15s
|
||
- `remote_write.url`: `https://obs.88oakapps.com/api/v1/write`
|
||
- `remote_write.bearer_token`: from k8s Secret
|
||
8. Build the RED dashboard in Grafana: rate, errors, duration p50/p95/p99 per route
|
||
|
||
**ROI:** "Is the API healthy? Where is time being spent right now?" answered live, served from `grafana.88oakapps.com`.
|
||
|
||
### Step 2 — Tracing baseline
|
||
|
||
(Jaeger is already up from Step 1. This step adds the app-side wiring.)
|
||
|
||
1. Add Grafana datasource for Jaeger pointing at `http://jaeger:16686` (in-network)
|
||
2. Wire OTel SDK in `cmd/api/main.go`:
|
||
- `otel.SetTracerProvider(tracerProvider)`
|
||
- `otelecho.Middleware("honeydue-api")` on Echo
|
||
- OTLP/HTTP exporter pointing at `https://obs.88oakapps.com/v1/traces` with `Authorization: Bearer <token>` header (token from env)
|
||
- Sampling: `TraceIDRatioBased(0.1)` in prod, `AlwaysSample()` in dev
|
||
3. Verify: a single `POST /api/auth/login/` produces a trace in Jaeger
|
||
|
||
**ROI:** "Why is this one request slow?" — answered with a flame graph.
|
||
|
||
### Step 3 — Manual spans for the work that actually matters
|
||
|
||
Wrap each in `tracer.Start(ctx, ...)` with attributes:
|
||
- `storage_service.Upload` → span "b2.PutObject" with `bucket`, `key`, `size_bytes`, result
|
||
- `push/apns.go` → span "apns.send" with `device_token_hash`, `status_code`, `reason`
|
||
- `asynq` middleware → span per task type with `task.type`, `retry_count`, `payload_size`
|
||
|
||
**ROI:** Specific high-value debugging questions ("why did this upload take 30 seconds", "why did these 5 push notifications fail") answered without code archaeology.
|
||
|
||
### Step 4 — Repository ctx + `otelgorm` (biggest diff, save for last)
|
||
|
||
1. Refactor every repository method to accept `ctx context.Context` as first arg
|
||
2. Update every call site to pass `c.Request().Context()` from handlers / propagate through services
|
||
3. Add `db.Use(otelgorm.NewPlugin())` in `internal/database/database.go`
|
||
4. Verify: a request now has nested spans `http → service → query → query → b2.PutObject → apns.send` with full SQL on the query spans
|
||
|
||
**ROI:** Every DB query in every trace, with SQL + table + rows. The "find the N+1" tool you'd otherwise build by hand.
|
||
|
||
---
|
||
|
||
## Hard skips (revisit only when explicitly proven needed)
|
||
|
||
| Tool | Why skip |
|
||
|---|---|
|
||
| Loki / Promtail | Dozzle covers the immediate need. Loki adds 512 Mi RAM + a daemonset; defer until log search becomes a hot pain point. |
|
||
| Mimir / VM cluster mode | Single-node VM handles honeyDue scale for years. |
|
||
| Pyroscope continuous profiling | Overkill at 3 small nodes. Use `pprof` endpoints ad-hoc when CPU pressure shows up. |
|
||
| OTel Collector | Only worth running when 3+ services emit telemetry. App → Jaeger direct is fine for now. |
|
||
| Any SaaS vendor (Datadog, NR, Honeycomb, Grafana Cloud, Sentry Performance) | User constraint: nothing paid. |
|
||
|
||
---
|
||
|
||
## When to move off `88oakappsUpdate`
|
||
|
||
Triggers — any one is enough:
|
||
- `88oakappsUpdate` available memory drops below ~3 GB sustained (PostHog growth squeezing it)
|
||
- ClickHouse OOM events start showing up in `dmesg` (PostHog under load)
|
||
- You want fully separate failure domains for honeyDue vs. 88oakapps
|
||
|
||
Migration path: the obs stack is a single docker-compose project on a bind-mount, so moving it = `rsync /opt/honeydue-obs/` to a new box, update DNS for `grafana.88oakapps.com` and `obs.88oakapps.com`, `docker compose up -d`. ~30 min of work. Until then: cohabiting on `88oakappsUpdate` is correct.
|
||
|
||
---
|
||
|
||
## Quick reference: what shows up where
|
||
|
||
| Question | Where to look |
|
||
|---|---|
|
||
| Is the API up right now? Latency? Errors? | Grafana RED dashboard |
|
||
| Why is this specific request slow? | Jaeger trace view |
|
||
| What did the slow part of that request actually do (which SQL, which B2 PUT)? | Span details inside the trace |
|
||
| Background job throughput / queue depth | VictoriaMetrics + asynq metrics |
|
||
| What did the app print to stdout 5 minutes ago? | Dozzle |
|
||
| What error did the app log? | Dozzle (search) — or Loki if/when added |
|