Files
honeyDueAPI/docs/deployment/15-observability.md
T
Trey t 77cfcc0b27
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs: rewrite ch15 observability + cross-refs for the live obs stack
ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.

Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
  metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
  (full build + push + apply + obs vmagent), with the manual
  kubectl-set-image flow kept as the single-service path. Notes the
  IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
  CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
  sheet block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:05:06 -05:00

16 KiB
Raw Blame History

15 — Observability

Summary

Production has live metrics and tracing infrastructure as of 2026-04-25. A self-hosted VictoriaMetrics + Jaeger + Grafana stack runs on 88oakappsUpdate (Linode VPS, also home to the self-hosted PostHog deployment). A vmagent sidecar in the honeyDue k3s namespace scrapes the api Pods' /metrics endpoint every 15 seconds and remote-writes to https://obs.88oakapps.com/api/v1/write. Grafana is at https://grafana.88oakapps.com with a pre-provisioned RED dashboard.

What we still don't have: log aggregation (Dozzle and kubectl logs fill the niche for now), alerting (no PagerDuty/Slack on errors), and full distributed tracing (OTel SDK is wired in app code but app-side instrumentation beyond HTTP routes hasn't shipped yet).

The whole observability stack costs $0 incremental and uses ~700 MB RAM on 88oakappsUpdate (5% of its free RAM). It runs as a separate docker-compose project from PostHog so neither product's lifecycle touches the other.

What we have

1. Metrics — VictoriaMetrics + vmagent

honeyDue k3s (Hetzner)                   88oakappsUpdate (Linode)
┌───────────────────────────┐            ┌──────────────────────────┐
│ api Pods (3) :8000/metrics│            │ /opt/honeydue-obs/       │
│   prometheus/client_golang│            │ ┌──────────────────┐     │
│                           │            │ │ VictoriaMetrics  │     │
│ vmagent ──── scrape 15s   │            │ │  30d retention   │     │
│         remote_write ─────┼────────────┼─→ /api/v1/write   │     │
│         (HTTPS, bearer)   │            │ │  (mem 256 MB)    │     │
└───────────────────────────┘            │ └──────────────────┘     │
                                          └──────────────────────────┘

The Go API exposes /metrics in Prometheus exposition format. Histograms are defined in internal/prom/metrics.go and registered globally:

Metric Labels Source
http_request_duration_seconds route, method, status Echo middleware around every handler
gorm_query_duration_seconds table, operation GORM before/after callbacks (no ctx threading needed)
b2_upload_duration_seconds bucket, result Wrapped s.backend.Write in internal/services/storage_service.go
b2_upload_bytes_total bucket, result Counter alongside the duration histogram
apns_send_duration_seconds result (ok/bad_token/error) Wrapped APNs PushWithContext in internal/push/apns.go
fcm_send_duration_seconds result Wrapped FCM HTTP v1 send in internal/push/fcm.go
asynq_job_duration_seconds task_type, result Histograms registered; middleware not yet attached (Step 3)
go_*, process_* (standard) prometheus/client_golang/prometheus/collectors defaults

The previous custom monitoring at /metrics was renamed to /metrics/legacy so the canonical /metrics emits proper histograms suitable for histogram_quantile() rollups. The legacy endpoint stays because the GoAdmin dashboard reads it.

vmagent in k3s

Lives at deploy-k3s/manifests/observability/vmagent.yaml. One replica, mem_limit: 256Mi, scrapes by Kubernetes pod-discovery filtered to app.kubernetes.io/name=api and remote-writes to https://obs.88oakapps.com/api/v1/write with a bearer token from OBS_INGEST_TOKEN in deploy/prod.env (substituted into a Secret at deploy time).

The agent buffers locally to /tmp/vmagent (emptyDir, 512 MB cap), so brief obs outages don't drop samples. Persistent queue replays on reconnect.

NetworkPolicies in the honeydue namespace allow egress from vmagent to:

  • DNS (kube-dns / coredns)
  • Kubernetes API (10.43.0.0/16:443) for pod discovery
  • api Pods on 10.42.0.0/16:8000
  • The public obs endpoint over 0.0.0.0/0:443

These are scoped tight — vmagent can't reach Postgres, Redis, B2, or any other external service.

2. Tracing — Jaeger all-in-one

Jaeger 1.62 with badger storage runs alongside VictoriaMetrics. The collector accepts:

  • OTLP/HTTP at https://obs.88oakapps.com/v1/traces (bearer-token gated)
  • OTLP/gRPC at :4317 (localhost-only)
  • Native Jaeger protocols at :14268 etc. (localhost-only)

Retention: ~7 days at current scale before badger rotates. UI at https://grafana.88oakapps.com via the Jaeger datasource.

Status of app-side instrumentation: the histograms are populating metrics. The OTel exporter wiring in cmd/api/main.go is not yet shipped. When it does ship, every POST /api/auth/login/ will produce a flame-graph trace with HTTP → handler → SQL → B2 → APNs spans. Tracking issue: gitea#3.

3. Dashboards — Grafana

https://grafana.88oakapps.com (Cloudflare-fronted, basic auth via Grafana itself, admin credentials in deploy/prod.env).

Datasources auto-provisioned at container startup from /opt/honeydue-obs/data/grafana-provisioning/datasources/datasources.yaml:

  • VictoriaMetrics (Prometheus type, http://victoriametrics:8428 in-network)
  • Jaeger (http://jaeger:16686 in-network)

Pre-provisioned dashboard: honeyDue API — RED at /d/honeydue-red. Top row uses the legacy custom metrics (http_endpoint_requests_total, http_requests_total) which started flowing the moment vmagent attached. Lower rows use the new histograms (http_request_duration_seconds_bucket p50/p95/p99 by route, GORM p95 by table, B2 upload p95, APNs/FCM send p95, Go memory + goroutines). Lower rows populated immediately after the api rebuild that shipped internal/prom.

4. kubectl logs

Every container's stdout/stderr is captured by containerd and readable via kubectl:

# Live tail from all api pods
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix

# Last 100 lines
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=100

# Previous pod's logs (before the most recent restart)
kubectl logs -n honeydue <pod-name> --previous

# Events (not logs — k8s-level state changes)
kubectl get events -n honeydue --sort-by=.lastTimestamp

Retention: containerd rotates logs when they exceed 10 MB (default). Only the last ~20 MB of logs is retained per container, on-disk on the node. Once a pod is deleted, its logs are gone.

For persistent log access we'd need aggregation (see §What we still don't have).

5. kubectl top

Pod and node resource usage via metrics-server:

kubectl top nodes
# NAME                CPU(cores)   CPU(%)   MEMORY(bytes)   MEMORY(%)
# ubuntu-8gb-nbg1-1   169m         4%       748Mi           9%

kubectl top pods -n honeydue

In-memory only; last few minutes of data. For historical trends use the Grafana dashboard, which exposes the same data via the go_* and container_* (kubelet cAdvisor) metrics.

6. Cloudflare Analytics

CF Dashboard → Analytics & Logs. Per-zone aggregate stats: requests/sec, bandwidth, cache hit ratio, top status codes, top paths, bot traffic score. Good for spotting macro trends ("suddenly 10× more 502s today") that wouldn't show up in a single-pod sample.

Free tier retention: 7 days of aggregate stats.

7. Neon dashboard

Neon console → project → Monitoring: compute utilization (CU-hours), slow queries, active connections, storage usage. Useful for "is the DB busy?" and free-tier limit watching. The new gorm_query_duration_seconds histogram covers the application side of the same question with much better latency tail visibility.

8. Kubernetes events

kubectl get events shows cluster-level state changes: pod scheduling, failures, image pulls, probe failures. Useful for post-mortem on deploys.

Retention: events are stored in etcd but default to 1 hour.

What we still don't have

No log aggregation

Individual pod logs are on the node. For multi-pod debugging ("show me all api pod logs for user X") we have to:

# Query all at once with stern (if installed)
stern -n honeydue api

# Or per-pod
kubectl logs -n honeydue <pod> | grep user_id=12345

This works but doesn't scale across many pods.

What we'd add: Loki on 88oakappsUpdate next to the existing obs stack. Adds ~512 MB RAM plus a Promtail (or Vector/Alloy) DaemonSet in k3s. Defer until log search becomes a recurring pain point — stern + grep is fine at current pod count.

No alerting

No PagerDuty, no Slack webhooks, no email on "api is returning 500s." The operator finds out when users complain.

Cheapest fix path:

  1. Grafana alerting (built into Grafana 11) — alert rules over the existing histograms (e.g., histogram_quantile(0.95, ...) > 1s). Routes to Slack via webhook. Zero infra cost.
  2. Uptime Kuma on 88oakappsUpdate — pings /api/health/ from outside the cluster every minute; complements the in-cluster view.

We'd want both eventually. Grafana alerting first because the data is already there.

Partial distributed tracing

The OTel SDK is not yet wired in cmd/api/main.go. When it ships:

  • otelecho.Middleware produces a span per HTTP request
  • otelgorm plugin produces a span per SQL query (requires threading ctx through repositories — the largest diff in the rollout)
  • Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs

Until then, we have aggregate latency by route from the histograms but no per-request flame graph. For "why is this one request slow" we still rely on logs + the GORM duration histogram.

No APM (Application Performance Monitoring)

No continuous profiling. We can answer "which endpoint has the highest p99 latency?" from the histograms, but not "where in the call stack is the time going?" without ad-hoc pprof runs.

If/when needed: Grafana Pyroscope is the OSS continuous profiler that fits our stack. Adds ~512 MB RAM. Defer until a CPU performance incident shows up.

The app's logging conventions

The Go app uses zerolog and emits structured JSON:

{
  "level": "info",
  "time": "2026-04-24T05:29:40Z",
  "caller": "/app/cmd/api/main.go:189",
  "addr": ":8000",
  "message": "HTTP server listening"
}

Log levels: debug, info, warn, error, fatal. Controlled by DEBUG=true|false in the ConfigMap (true sets level to debug, false sets level to info).

Every request is logged with method, path, status, request_id, user_id (if authenticated), latency. Queryable by grep today; ready to ingest into Loki when we add it.

Health endpoints

Each service exposes a health endpoint:

Service Endpoint What it checks
api /api/health/ Process alive (doesn't verify DB)
api /api/health/live Process alive
admin / Next.js is up
worker (none public) Internal Asynq status
api /metrics Prometheus exposition (vmagent scrapes here)
api /metrics/legacy Custom monitoring metrics for GoAdmin

Health endpoints are shallow — they return 200 if the process is running and listening. They don't try to reach Postgres/Redis/etc. Rationale: if Postgres is briefly down, we don't want all api pods to start failing liveness and cascade-restart.

obs.88oakapps.com — the ingest endpoint

Public hostname for cross-cluster metric and trace ingest. Cloudflare in front, nginx on 88oakappsUpdate enforces a bearer-token check before forwarding to the local VM/Jaeger containers.

Path Forwards to Purpose
/api/v1/write http://127.0.0.1:8428 Prometheus remote-write (vmagent → VM)
/v1/traces http://127.0.0.1:4318/v1/traces OTLP/HTTP traces (app → Jaeger)
/health (returns 200) Reachability probe — also requires auth
anything else 404

Token lives at /etc/honeydue-obs/secrets.env (mode 0600 on the box) and at OBS_INGEST_TOKEN= in deploy/prod.env (gitignored). To rotate: generate a new value, update both ends, restart vmagent.

# Operator: rotate the bearer token
NEW=$(openssl rand -hex 32)
ssh 88oakappsUpdate "sudo sed -i 's|OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|' /etc/honeydue-obs/secrets.env"
ssh 88oakappsUpdate "sudo sed -i 's|Bearer [a-f0-9]\{64\}|Bearer $NEW|' /etc/nginx/sites-available/obs.88oakapps.com && sudo nginx -s reload"
sed -i.bak "s|^OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|" deploy/prod.env
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue create secret generic vmagent-remote-write \
  --from-literal=bearer_token=$NEW --dry-run=client -o yaml | kubectl apply -f -
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue rollout restart deploy/vmagent

Resource budget

Service mem_limit Disk Retention
VictoriaMetrics 256 MB 10 GB 30 days
Jaeger all-in-one (badger) 256 MB 10 GB ~7 days
Grafana OSS 256 MB 1 GB
vmagent (in k3s) 256 MB 512 MB emptyDir
Total ~1 GB hard cap ~21 GB

Resident usage at idle is much lower (~90 MB on the obs side, ~30 MB for vmagent). Hard limits exist so a memory leak in any one component can't squeeze the cohabiting PostHog stack on 88oakappsUpdate.

Operator cheat sheet

# Tail all logs in the namespace
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue

# Scrape state from vmagent self-metrics
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
  | grep -E "scrapes_total|targets|remotewrite"

# Force vmagent to reload scrape config
kubectl -n honeydue rollout restart deploy/vmagent

# Query VictoriaMetrics directly (PromQL)
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'

# Restart the obs stack on 88oakappsUpdate
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'

# Live obs container memory
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'

# Pod resource usage (k3s side)
kubectl top pods -n honeydue --sort-by=memory

# With stern (if installed: brew install stern)
stern -n honeydue .

# Full state dump for a pod (debugging)
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt

Future: what to add and when

Trigger Add
First production incident Grafana alerting (free, data already there)
10k+ daily users Loki + Vector for log aggregation
Performance incident the histograms can't explain Wire OTel exporter → Jaeger from the Go app
CPU pressure on api pods Pyroscope continuous profiler
Multi-product obs needs Migrate obs stack to dedicated CX32 ($8/mo)

The overall philosophy: observability is an investment that compounds. Add it before you need it, not after. But also don't over-invest at idle.

References