docs: rewrite ch15 observability + cross-refs for the live obs stack
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.

Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
  metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
  (full build + push + apply + obs vmagent), with the manual
  kubectl-set-image flow kept as the single-service path. Notes the
  IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
  CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
  sheet block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-25 15:05:06 -05:00
parent d3708e6c72
commit 77cfcc0b27
8 changed files with 414 additions and 187 deletions
+249 -164
View File
@@ -2,15 +2,119 @@
## Summary
We have minimal observability today: `kubectl logs`, `kubectl top`,
Cloudflare Analytics, and the Neon dashboard. No Prometheus, no Grafana,
no centralized log aggregator, no APM. This is adequate for the
current traffic volume (low) but is a known gap. This chapter documents
what we *have* and what we'd add as traffic grows.
Production has live metrics and tracing infrastructure as of 2026-04-25.
A self-hosted **VictoriaMetrics + Jaeger + Grafana** stack runs on
`88oakappsUpdate` (Linode VPS, also home to the self-hosted PostHog
deployment). A `vmagent` sidecar in the honeyDue k3s namespace scrapes
the api Pods' `/metrics` endpoint every 15 seconds and remote-writes to
`https://obs.88oakapps.com/api/v1/write`. Grafana is at
`https://grafana.88oakapps.com` with a pre-provisioned RED dashboard.
What we still don't have: log aggregation (Dozzle and `kubectl logs`
fill the niche for now), alerting (no PagerDuty/Slack on errors), and
full distributed tracing (OTel SDK is wired in app code but app-side
instrumentation beyond HTTP routes hasn't shipped yet).
The whole observability stack costs **$0** incremental and uses ~700 MB
RAM on `88oakappsUpdate` (5% of its free RAM). It runs as a separate
docker-compose project from PostHog so neither product's lifecycle
touches the other.
## What we have
### 1. `kubectl logs`
### 1. Metrics — VictoriaMetrics + vmagent
```
honeyDue k3s (Hetzner) 88oakappsUpdate (Linode)
┌───────────────────────────┐ ┌──────────────────────────┐
│ api Pods (3) :8000/metrics│ │ /opt/honeydue-obs/ │
│ prometheus/client_golang│ │ ┌──────────────────┐ │
│ │ │ │ VictoriaMetrics │ │
│ vmagent ──── scrape 15s │ │ │ 30d retention │ │
│ remote_write ─────┼────────────┼─→ /api/v1/write │ │
│ (HTTPS, bearer) │ │ │ (mem 256 MB) │ │
└───────────────────────────┘ │ └──────────────────┘ │
└──────────────────────────┘
```
The Go API exposes `/metrics` in Prometheus exposition format. Histograms
are defined in `internal/prom/metrics.go` and registered globally:
| Metric | Labels | Source |
|---|---|---|
| `http_request_duration_seconds` | `route, method, status` | Echo middleware around every handler |
| `gorm_query_duration_seconds` | `table, operation` | GORM before/after callbacks (no ctx threading needed) |
| `b2_upload_duration_seconds` | `bucket, result` | Wrapped `s.backend.Write` in `internal/services/storage_service.go` |
| `b2_upload_bytes_total` | `bucket, result` | Counter alongside the duration histogram |
| `apns_send_duration_seconds` | `result` (`ok`/`bad_token`/`error`) | Wrapped APNs `PushWithContext` in `internal/push/apns.go` |
| `fcm_send_duration_seconds` | `result` | Wrapped FCM HTTP v1 send in `internal/push/fcm.go` |
| `asynq_job_duration_seconds` | `task_type, result` | Histograms registered; middleware not yet attached (Step 3) |
| `go_*`, `process_*` | (standard) | `prometheus/client_golang/prometheus/collectors` defaults |
The previous custom monitoring at `/metrics` was renamed to
`/metrics/legacy` so the canonical `/metrics` emits proper histograms
suitable for `histogram_quantile()` rollups. The legacy endpoint stays
because the GoAdmin dashboard reads it.
#### vmagent in k3s
Lives at `deploy-k3s/manifests/observability/vmagent.yaml`. One replica,
`mem_limit: 256Mi`, scrapes by Kubernetes pod-discovery filtered to
`app.kubernetes.io/name=api` and remote-writes to
`https://obs.88oakapps.com/api/v1/write` with a bearer token from
`OBS_INGEST_TOKEN` in `deploy/prod.env` (substituted into a Secret at
deploy time).
The agent buffers locally to `/tmp/vmagent` (emptyDir, 512 MB cap), so
brief obs outages don't drop samples. Persistent queue replays on
reconnect.
NetworkPolicies in the honeydue namespace allow egress from vmagent to:
- DNS (kube-dns / coredns)
- Kubernetes API (`10.43.0.0/16:443`) for pod discovery
- api Pods on `10.42.0.0/16:8000`
- The public obs endpoint over `0.0.0.0/0:443`
These are scoped tight — vmagent can't reach Postgres, Redis, B2, or
any other external service.
### 2. Tracing — Jaeger all-in-one
Jaeger 1.62 with badger storage runs alongside VictoriaMetrics. The
collector accepts:
- OTLP/HTTP at `https://obs.88oakapps.com/v1/traces` (bearer-token gated)
- OTLP/gRPC at `:4317` (localhost-only)
- Native Jaeger protocols at `:14268` etc. (localhost-only)
Retention: ~7 days at current scale before badger rotates. UI at
`https://grafana.88oakapps.com` via the Jaeger datasource.
**Status of app-side instrumentation**: the histograms are populating
metrics. The OTel exporter wiring in `cmd/api/main.go` is **not yet
shipped**. When it does ship, every `POST /api/auth/login/` will produce
a flame-graph trace with HTTP → handler → SQL → B2 → APNs spans.
Tracking issue: gitea#3.
### 3. Dashboards — Grafana
`https://grafana.88oakapps.com` (Cloudflare-fronted, basic auth via
Grafana itself, admin credentials in `deploy/prod.env`).
Datasources auto-provisioned at container startup from
`/opt/honeydue-obs/data/grafana-provisioning/datasources/datasources.yaml`:
- VictoriaMetrics (Prometheus type, `http://victoriametrics:8428` in-network)
- Jaeger (`http://jaeger:16686` in-network)
Pre-provisioned dashboard: `honeyDue API — RED` at
`/d/honeydue-red`. Top row uses the legacy custom metrics
(`http_endpoint_requests_total`, `http_requests_total`) which started
flowing the moment vmagent attached. Lower rows use the new histograms
(`http_request_duration_seconds_bucket` p50/p95/p99 by route, GORM p95
by table, B2 upload p95, APNs/FCM send p95, Go memory + goroutines).
Lower rows populated immediately after the api rebuild that shipped
`internal/prom`.
### 4. `kubectl logs`
Every container's stdout/stderr is captured by containerd and readable
via kubectl:
@@ -33,9 +137,10 @@ kubectl get events -n honeydue --sort-by=.lastTimestamp
Only the last ~20 MB of logs is retained per container, on-disk on the
node. Once a pod is deleted, its logs are gone.
For persistent log access we'd need aggregation (see §what we'd add).
For persistent log access we'd need aggregation (see §What we still
don't have).
### 2. `kubectl top`
### 5. `kubectl top`
Pod and node resource usage via metrics-server:
@@ -43,43 +148,32 @@ Pod and node resource usage via metrics-server:
kubectl top nodes
# NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
# ubuntu-8gb-nbg1-1 169m 4% 748Mi 9%
# ubuntu-8gb-nbg1-2 229m 5% 1043Mi 13%
# ubuntu-8gb-nbg1-3 124m 3% 770Mi 9%
kubectl top pods -n honeydue
```
**Retention**: In-memory only. Last few minutes of data. No
historical view.
In-memory only; last few minutes of data. For historical trends use
the Grafana dashboard, which exposes the same data via the `go_*` and
`container_*` (kubelet cAdvisor) metrics.
### 3. Cloudflare Analytics
### 6. Cloudflare Analytics
CF Dashboard → Analytics & Logs. Per-zone stats:
- Requests per second
- Bandwidth
- Cache hit ratio
- Top HTTP status codes
- Top request paths
- Bot traffic score
CF Dashboard → Analytics & Logs. Per-zone aggregate stats:
requests/sec, bandwidth, cache hit ratio, top status codes, top paths,
bot traffic score. Good for spotting macro trends ("suddenly 10× more
502s today") that wouldn't show up in a single-pod sample.
All aggregated, no individual request traces. Good for spotting macro
trends ("suddenly 10× more 502s today"), poor for debugging specific
issues.
Free tier retention: 7 days of aggregate stats.
Free tier retention: 7 days of aggregate stats. Pro extends this.
### 7. Neon dashboard
### 4. Neon dashboard
Neon console → project → Monitoring: compute utilization (CU-hours),
slow queries, active connections, storage usage. Useful for "is the
DB busy?" and free-tier limit watching. The new
`gorm_query_duration_seconds` histogram covers the application side
of the same question with much better latency tail visibility.
Neon console → project → Monitoring:
- Compute utilization (CU-hours consumed)
- Query performance (slow queries)
- Active connections
- Storage usage
Good for "is the DB busy?" and "am I close to my free tier limit?"
Not real-time.
### 5. Kubernetes events
### 8. Kubernetes events
`kubectl get events` shows cluster-level state changes: pod scheduling,
failures, image pulls, probe failures. Useful for post-mortem on
@@ -87,7 +181,7 @@ deploys.
Retention: events are stored in etcd but default to 1 hour.
## What we don't have (the gap)
## What we still don't have
### No log aggregation
@@ -98,64 +192,55 @@ all api pod logs for user X") we have to:
# Query all at once with stern (if installed)
stern -n honeydue api
# Or for specific pod
# Or per-pod
kubectl logs -n honeydue <pod> | grep user_id=12345
```
This works but doesn't scale. Grep across 3 pods for a specific
user_id is OK. Across 30 pods, intractable.
This works but doesn't scale across many pods.
**What we'd add**: [Loki](https://grafana.com/oss/loki/) — a lightweight
log aggregator designed for k8s. ~$0 to self-host; integrates with
Grafana for queries. Or [Betterstack](https://betterstack.com/logs)
($10/mo, hosted).
### No metrics/dashboards
`kubectl top` tells us "is this pod hot right now?" but not "has CPU
been climbing over the past hour?" We'd need:
- **Prometheus** — scrapes metrics from kubelet and pods' `/metrics`
endpoints, stores time series
- **Grafana** — queries Prometheus, renders dashboards
K3s can install these via Helm in ~10 minutes. Adds ~500MB RAM to the
cluster. Stability and operational load: moderate.
**Alternative**: [Kubernetes Dashboard](https://github.com/kubernetes/dashboard)
bundled with k3s (disabled by default). Minimal UI over the existing
metrics API. Cheaper than Prometheus but less queryable.
### No distributed tracing
"This request took 800ms — which hop was slow?" is currently unanswerable
beyond "the DB query, probably." A real trace would show:
- TLS handshake time
- Traefik routing time
- Go handler time
- Postgres query time
- Redis call time
- Each B2 request time
We'd add OpenTelemetry to the Go app and export to Jaeger/Tempo. Work
is moderate; value kicks in when we have complex request flows.
**What we'd add**: [Loki](https://grafana.com/oss/loki/) on
`88oakappsUpdate` next to the existing obs stack. Adds ~512 MB RAM
plus a Promtail (or Vector/Alloy) DaemonSet in k3s. Defer until log
search becomes a recurring pain point — `stern` + `grep` is fine at
current pod count.
### No alerting
No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
The operator finds out when users complain.
Cheapest fix: [Uptime Kuma](https://github.com/louislam/uptime-kuma)
(self-hosted) or Better Stack Uptime (free for small teams). Ping
`https://api.myhoneydue.com/api/health/` every minute; alert if it fails.
Cheapest fix path:
1. Grafana alerting (built into Grafana 11) — alert rules over the
existing histograms (e.g., `histogram_quantile(0.95, ...) > 1s`).
Routes to Slack via webhook. **Zero infra cost.**
2. [Uptime Kuma](https://github.com/louislam/uptime-kuma) on
`88oakappsUpdate` — pings `/api/health/` from outside the cluster
every minute; complements the in-cluster view.
We'd want both eventually. Grafana alerting first because the data is
already there.
### Partial distributed tracing
The OTel SDK is **not yet wired** in `cmd/api/main.go`. When it ships:
- `otelecho.Middleware` produces a span per HTTP request
- `otelgorm` plugin produces a span per SQL query (requires threading
`ctx` through repositories — the largest diff in the rollout)
- Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs
Until then, we have aggregate latency by route from the histograms but
no per-request flame graph. For "why is *this one* request slow" we
still rely on logs + the GORM duration histogram.
### No APM (Application Performance Monitoring)
No request-level profiling. We can't see "which endpoint has the highest
p99 latency?" or "which SQL query is hot this week?"
No continuous profiling. We can answer "which endpoint has the highest
p99 latency?" from the histograms, but not "where in the call stack is
the time going?" without ad-hoc `pprof` runs.
Options: Datadog, New Relic, Honeycomb, self-hosted Tempo+Grafana.
All are meaningful work to set up and cost $$$.
If/when needed: Grafana Pyroscope is the OSS continuous profiler that
fits our stack. Adds ~512 MB RAM. Defer until a CPU performance
incident shows up.
## The app's logging conventions
@@ -172,28 +257,12 @@ The Go app uses zerolog and emits structured JSON:
```
Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
`DEBUG=true|false` in ConfigMap (true sets level to debug, false sets
level to info).
`DEBUG=true|false` in the ConfigMap (true sets level to debug, false
sets level to info).
Every request is logged with:
- Method, path, status code
- Request ID (for correlating logs across pods)
- User ID (if authenticated)
- Latency
```json
{
"level": "info",
"method": "GET",
"path": "/api/tasks/",
"status": 200,
"latency_ms": 42,
"user_id": 123,
"request_id": "a6b5db35-..."
}
```
This is queryable by grep. Better with log aggregation.
Every request is logged with method, path, status, request_id, user_id
(if authenticated), latency. Queryable by grep today; ready to ingest
into Loki when we add it.
## Health endpoints
@@ -202,71 +271,58 @@ Each service exposes a health endpoint:
| Service | Endpoint | What it checks |
|---|---|---|
| api | `/api/health/` | Process alive (doesn't verify DB) |
| api | `/api/health/live` | Process alive |
| admin | `/` | Next.js is up |
| worker | (none public) | Internal Asynq status |
| api | `/metrics` | Prometheus exposition (vmagent scrapes here) |
| api | `/metrics/legacy` | Custom monitoring metrics for GoAdmin |
Health endpoints are **shallow** — they return 200 if the process is
running and listening. They don't try to reach Postgres/Redis/etc.
Rationale: if Postgres is briefly down, we don't want all api pods to
start failing liveness and cascade-restart.
## Dozzle (deprecated)
## obs.88oakapps.com — the ingest endpoint
The Swarm era had [Dozzle](https://github.com/amir20/dozzle) — a
lightweight web UI for Docker logs. Accessible via SSH tunnel to the
manager node. Not deployed on k3s; `kubectl logs` + `stern` fills the
niche.
Public hostname for cross-cluster metric and trace ingest. Cloudflare
in front, nginx on `88oakappsUpdate` enforces a bearer-token check
before forwarding to the local VM/Jaeger containers.
## Kubernetes metrics the k8s API exposes
| Path | Forwards to | Purpose |
|---|---|---|
| `/api/v1/write` | `http://127.0.0.1:8428` | Prometheus remote-write (vmagent → VM) |
| `/v1/traces` | `http://127.0.0.1:4318/v1/traces` | OTLP/HTTP traces (app → Jaeger) |
| `/health` | (returns 200) | Reachability probe — also requires auth |
| anything else | 404 | |
Even without Prometheus, these are queryable:
Token lives at `/etc/honeydue-obs/secrets.env` (mode 0600 on the box)
and at `OBS_INGEST_TOKEN=` in `deploy/prod.env` (gitignored). To rotate:
generate a new value, update both ends, restart vmagent.
```bash
# Resource metrics (via metrics-server)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/honeydue/pods
# Core API (k8s state)
kubectl get --raw /api/v1/namespaces/honeydue/pods/<name>
# Kubelet metrics (per-node; requires tunneling)
kubectl get --raw /api/v1/nodes/<node>/proxy/metrics
# Operator: rotate the bearer token
NEW=$(openssl rand -hex 32)
ssh 88oakappsUpdate "sudo sed -i 's|OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|' /etc/honeydue-obs/secrets.env"
ssh 88oakappsUpdate "sudo sed -i 's|Bearer [a-f0-9]\{64\}|Bearer $NEW|' /etc/nginx/sites-available/obs.88oakapps.com && sudo nginx -s reload"
sed -i.bak "s|^OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|" deploy/prod.env
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue create secret generic vmagent-remote-write \
--from-literal=bearer_token=$NEW --dry-run=client -o yaml | kubectl apply -f -
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue rollout restart deploy/vmagent
```
If we ever spin up Prometheus, these are the endpoints it would scrape.
## Resource budget
## Future: what to add and when
| Service | mem_limit | Disk | Retention |
|---|---|---|---|
| VictoriaMetrics | 256 MB | 10 GB | 30 days |
| Jaeger all-in-one (badger) | 256 MB | 10 GB | ~7 days |
| Grafana OSS | 256 MB | 1 GB | — |
| vmagent (in k3s) | 256 MB | 512 MB emptyDir | — |
| **Total** | **~1 GB hard cap** | **~21 GB** | |
| Trigger | Add |
|---|---|
| 10k+ daily users | Loki + Grafana for logs |
| 100+ req/s sustained | Prometheus + Grafana for metrics |
| Performance incidents | OpenTelemetry tracing |
| Revenue > $5k/mo | Paid monitoring (Datadog or similar) |
| First production outage | Alerting to phone/Slack |
The overall philosophy: observability is an investment that compounds.
Add it before you need it, not after. But also don't over-invest at
idle.
**Next quarter**: set up Uptime Kuma + Loki at minimum.
## Checking what's installed
```bash
# In kube-system namespace
kubectl get pods -n kube-system
# Should see: coredns, metrics-server, traefik, local-path-provisioner,
# and some k3s-related helm install jobs
# In honeydue namespace
kubectl get pods -n honeydue
# api, admin, worker, redis
# No monitoring namespace (yet)
kubectl get namespaces
# default, honeydue, kube-node-lease, kube-public, kube-system
```
Resident usage at idle is much lower (~90 MB on the obs side, ~30 MB
for vmagent). Hard limits exist so a memory leak in any one component
can't squeeze the cohabiting PostHog stack on `88oakappsUpdate`.
## Operator cheat sheet
@@ -274,32 +330,61 @@ kubectl get namespaces
# Tail all logs in the namespace
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue
# Scrape state from vmagent self-metrics
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
| grep -E "scrapes_total|targets|remotewrite"
# Force vmagent to reload scrape config
kubectl -n honeydue rollout restart deploy/vmagent
# Query VictoriaMetrics directly (PromQL)
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
# Restart the obs stack on 88oakappsUpdate
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
# Live obs container memory
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
# Pod resource usage (k3s side)
kubectl top pods -n honeydue --sort-by=memory
# With stern (if installed: brew install stern)
stern -n honeydue .
# Follow specific pod, including previous runs
kubectl logs -n honeydue <pod> -f --previous=false
# Pod resource usage
kubectl top pods -n honeydue --sort-by=memory
kubectl top pods -n honeydue --sort-by=cpu
# Events (cluster-wide)
kubectl get events -A --sort-by=.lastTimestamp | tail -20
# Full state dump for a pod (debugging)
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
```
## Future: what to add and when
| Trigger | Add |
|---|---|
| First production incident | Grafana alerting (free, data already there) |
| 10k+ daily users | Loki + Vector for log aggregation |
| Performance incident the histograms can't explain | Wire OTel exporter → Jaeger from the Go app |
| CPU pressure on api pods | Pyroscope continuous profiler |
| Multi-product obs needs | Migrate obs stack to dedicated CX32 ($8/mo) |
The overall philosophy: observability is an investment that compounds.
Add it before you need it, not after. But also don't over-invest at
idle.
## References
- [Kubernetes metrics-server][ms]
- [K3s metrics][k3s-metrics]
- [Loki][loki]
- [VictoriaMetrics docs][vm]
- [vmagent kubernetes_sd_configs][vmagent-k8s]
- [Jaeger all-in-one with badger][jaeger]
- [prometheus/client_golang][promclient]
- [Grafana provisioning datasources][gf-prov]
- [Loki][loki] (future)
- [Stern (multi-pod log tail)][stern]
[ms]: https://github.com/kubernetes-sigs/metrics-server
[k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server
[vm]: https://docs.victoriametrics.com/single-server-victoriametrics/
[vmagent-k8s]: https://docs.victoriametrics.com/vmagent.html#kubernetes-monitoring-with-vmagent
[jaeger]: https://www.jaegertracing.io/docs/1.62/getting-started/#all-in-one
[promclient]: https://pkg.go.dev/github.com/prometheus/client_golang
[gf-prov]: https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources
[loki]: https://grafana.com/oss/loki/
[stern]: https://github.com/stern/stern