77cfcc0b27
ch15 is now an account of what's actually running, not a roadmap for what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the internal/prom histogram set, the rollout's NetworkPolicy footprint, the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget, and a token-rotation runbook. The "what we still don't have" section keeps log aggregation, alerting, and full distributed tracing as the honest gap list. Other touched docs: - 00-overview: \"deliberately absent\" no longer claims we have no metrics — calls out the cross-cluster shape instead. - 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh (full build + push + apply + obs vmagent), with the manual kubectl-set-image flow kept as the single-service path. Notes the IfNotPresent gotcha that bit us during the rollout. - 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data. - 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the CX32 migration trigger. - 17/18 README + appendix b: link the new ch15, add the obs cheat sheet block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
391 lines
16 KiB
Markdown
391 lines
16 KiB
Markdown
# 15 — Observability
|
||
|
||
## Summary
|
||
|
||
Production has live metrics and tracing infrastructure as of 2026-04-25.
|
||
A self-hosted **VictoriaMetrics + Jaeger + Grafana** stack runs on
|
||
`88oakappsUpdate` (Linode VPS, also home to the self-hosted PostHog
|
||
deployment). A `vmagent` sidecar in the honeyDue k3s namespace scrapes
|
||
the api Pods' `/metrics` endpoint every 15 seconds and remote-writes to
|
||
`https://obs.88oakapps.com/api/v1/write`. Grafana is at
|
||
`https://grafana.88oakapps.com` with a pre-provisioned RED dashboard.
|
||
|
||
What we still don't have: log aggregation (Dozzle and `kubectl logs`
|
||
fill the niche for now), alerting (no PagerDuty/Slack on errors), and
|
||
full distributed tracing (OTel SDK is wired in app code but app-side
|
||
instrumentation beyond HTTP routes hasn't shipped yet).
|
||
|
||
The whole observability stack costs **$0** incremental and uses ~700 MB
|
||
RAM on `88oakappsUpdate` (5% of its free RAM). It runs as a separate
|
||
docker-compose project from PostHog so neither product's lifecycle
|
||
touches the other.
|
||
|
||
## What we have
|
||
|
||
### 1. Metrics — VictoriaMetrics + vmagent
|
||
|
||
```
|
||
honeyDue k3s (Hetzner) 88oakappsUpdate (Linode)
|
||
┌───────────────────────────┐ ┌──────────────────────────┐
|
||
│ api Pods (3) :8000/metrics│ │ /opt/honeydue-obs/ │
|
||
│ prometheus/client_golang│ │ ┌──────────────────┐ │
|
||
│ │ │ │ VictoriaMetrics │ │
|
||
│ vmagent ──── scrape 15s │ │ │ 30d retention │ │
|
||
│ remote_write ─────┼────────────┼─→ /api/v1/write │ │
|
||
│ (HTTPS, bearer) │ │ │ (mem 256 MB) │ │
|
||
└───────────────────────────┘ │ └──────────────────┘ │
|
||
└──────────────────────────┘
|
||
```
|
||
|
||
The Go API exposes `/metrics` in Prometheus exposition format. Histograms
|
||
are defined in `internal/prom/metrics.go` and registered globally:
|
||
|
||
| Metric | Labels | Source |
|
||
|---|---|---|
|
||
| `http_request_duration_seconds` | `route, method, status` | Echo middleware around every handler |
|
||
| `gorm_query_duration_seconds` | `table, operation` | GORM before/after callbacks (no ctx threading needed) |
|
||
| `b2_upload_duration_seconds` | `bucket, result` | Wrapped `s.backend.Write` in `internal/services/storage_service.go` |
|
||
| `b2_upload_bytes_total` | `bucket, result` | Counter alongside the duration histogram |
|
||
| `apns_send_duration_seconds` | `result` (`ok`/`bad_token`/`error`) | Wrapped APNs `PushWithContext` in `internal/push/apns.go` |
|
||
| `fcm_send_duration_seconds` | `result` | Wrapped FCM HTTP v1 send in `internal/push/fcm.go` |
|
||
| `asynq_job_duration_seconds` | `task_type, result` | Histograms registered; middleware not yet attached (Step 3) |
|
||
| `go_*`, `process_*` | (standard) | `prometheus/client_golang/prometheus/collectors` defaults |
|
||
|
||
The previous custom monitoring at `/metrics` was renamed to
|
||
`/metrics/legacy` so the canonical `/metrics` emits proper histograms
|
||
suitable for `histogram_quantile()` rollups. The legacy endpoint stays
|
||
because the GoAdmin dashboard reads it.
|
||
|
||
#### vmagent in k3s
|
||
|
||
Lives at `deploy-k3s/manifests/observability/vmagent.yaml`. One replica,
|
||
`mem_limit: 256Mi`, scrapes by Kubernetes pod-discovery filtered to
|
||
`app.kubernetes.io/name=api` and remote-writes to
|
||
`https://obs.88oakapps.com/api/v1/write` with a bearer token from
|
||
`OBS_INGEST_TOKEN` in `deploy/prod.env` (substituted into a Secret at
|
||
deploy time).
|
||
|
||
The agent buffers locally to `/tmp/vmagent` (emptyDir, 512 MB cap), so
|
||
brief obs outages don't drop samples. Persistent queue replays on
|
||
reconnect.
|
||
|
||
NetworkPolicies in the honeydue namespace allow egress from vmagent to:
|
||
- DNS (kube-dns / coredns)
|
||
- Kubernetes API (`10.43.0.0/16:443`) for pod discovery
|
||
- api Pods on `10.42.0.0/16:8000`
|
||
- The public obs endpoint over `0.0.0.0/0:443`
|
||
|
||
These are scoped tight — vmagent can't reach Postgres, Redis, B2, or
|
||
any other external service.
|
||
|
||
### 2. Tracing — Jaeger all-in-one
|
||
|
||
Jaeger 1.62 with badger storage runs alongside VictoriaMetrics. The
|
||
collector accepts:
|
||
- OTLP/HTTP at `https://obs.88oakapps.com/v1/traces` (bearer-token gated)
|
||
- OTLP/gRPC at `:4317` (localhost-only)
|
||
- Native Jaeger protocols at `:14268` etc. (localhost-only)
|
||
|
||
Retention: ~7 days at current scale before badger rotates. UI at
|
||
`https://grafana.88oakapps.com` via the Jaeger datasource.
|
||
|
||
**Status of app-side instrumentation**: the histograms are populating
|
||
metrics. The OTel exporter wiring in `cmd/api/main.go` is **not yet
|
||
shipped**. When it does ship, every `POST /api/auth/login/` will produce
|
||
a flame-graph trace with HTTP → handler → SQL → B2 → APNs spans.
|
||
Tracking issue: gitea#3.
|
||
|
||
### 3. Dashboards — Grafana
|
||
|
||
`https://grafana.88oakapps.com` (Cloudflare-fronted, basic auth via
|
||
Grafana itself, admin credentials in `deploy/prod.env`).
|
||
|
||
Datasources auto-provisioned at container startup from
|
||
`/opt/honeydue-obs/data/grafana-provisioning/datasources/datasources.yaml`:
|
||
- VictoriaMetrics (Prometheus type, `http://victoriametrics:8428` in-network)
|
||
- Jaeger (`http://jaeger:16686` in-network)
|
||
|
||
Pre-provisioned dashboard: `honeyDue API — RED` at
|
||
`/d/honeydue-red`. Top row uses the legacy custom metrics
|
||
(`http_endpoint_requests_total`, `http_requests_total`) which started
|
||
flowing the moment vmagent attached. Lower rows use the new histograms
|
||
(`http_request_duration_seconds_bucket` p50/p95/p99 by route, GORM p95
|
||
by table, B2 upload p95, APNs/FCM send p95, Go memory + goroutines).
|
||
Lower rows populated immediately after the api rebuild that shipped
|
||
`internal/prom`.
|
||
|
||
### 4. `kubectl logs`
|
||
|
||
Every container's stdout/stderr is captured by containerd and readable
|
||
via kubectl:
|
||
|
||
```bash
|
||
# Live tail from all api pods
|
||
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix
|
||
|
||
# Last 100 lines
|
||
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=100
|
||
|
||
# Previous pod's logs (before the most recent restart)
|
||
kubectl logs -n honeydue <pod-name> --previous
|
||
|
||
# Events (not logs — k8s-level state changes)
|
||
kubectl get events -n honeydue --sort-by=.lastTimestamp
|
||
```
|
||
|
||
**Retention**: containerd rotates logs when they exceed 10 MB (default).
|
||
Only the last ~20 MB of logs is retained per container, on-disk on the
|
||
node. Once a pod is deleted, its logs are gone.
|
||
|
||
For persistent log access we'd need aggregation (see §What we still
|
||
don't have).
|
||
|
||
### 5. `kubectl top`
|
||
|
||
Pod and node resource usage via metrics-server:
|
||
|
||
```bash
|
||
kubectl top nodes
|
||
# NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
|
||
# ubuntu-8gb-nbg1-1 169m 4% 748Mi 9%
|
||
|
||
kubectl top pods -n honeydue
|
||
```
|
||
|
||
In-memory only; last few minutes of data. For historical trends use
|
||
the Grafana dashboard, which exposes the same data via the `go_*` and
|
||
`container_*` (kubelet cAdvisor) metrics.
|
||
|
||
### 6. Cloudflare Analytics
|
||
|
||
CF Dashboard → Analytics & Logs. Per-zone aggregate stats:
|
||
requests/sec, bandwidth, cache hit ratio, top status codes, top paths,
|
||
bot traffic score. Good for spotting macro trends ("suddenly 10× more
|
||
502s today") that wouldn't show up in a single-pod sample.
|
||
|
||
Free tier retention: 7 days of aggregate stats.
|
||
|
||
### 7. Neon dashboard
|
||
|
||
Neon console → project → Monitoring: compute utilization (CU-hours),
|
||
slow queries, active connections, storage usage. Useful for "is the
|
||
DB busy?" and free-tier limit watching. The new
|
||
`gorm_query_duration_seconds` histogram covers the application side
|
||
of the same question with much better latency tail visibility.
|
||
|
||
### 8. Kubernetes events
|
||
|
||
`kubectl get events` shows cluster-level state changes: pod scheduling,
|
||
failures, image pulls, probe failures. Useful for post-mortem on
|
||
deploys.
|
||
|
||
Retention: events are stored in etcd but default to 1 hour.
|
||
|
||
## What we still don't have
|
||
|
||
### No log aggregation
|
||
|
||
Individual pod logs are on the node. For multi-pod debugging ("show me
|
||
all api pod logs for user X") we have to:
|
||
|
||
```bash
|
||
# Query all at once with stern (if installed)
|
||
stern -n honeydue api
|
||
|
||
# Or per-pod
|
||
kubectl logs -n honeydue <pod> | grep user_id=12345
|
||
```
|
||
|
||
This works but doesn't scale across many pods.
|
||
|
||
**What we'd add**: [Loki](https://grafana.com/oss/loki/) on
|
||
`88oakappsUpdate` next to the existing obs stack. Adds ~512 MB RAM
|
||
plus a Promtail (or Vector/Alloy) DaemonSet in k3s. Defer until log
|
||
search becomes a recurring pain point — `stern` + `grep` is fine at
|
||
current pod count.
|
||
|
||
### No alerting
|
||
|
||
No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
|
||
The operator finds out when users complain.
|
||
|
||
Cheapest fix path:
|
||
1. Grafana alerting (built into Grafana 11) — alert rules over the
|
||
existing histograms (e.g., `histogram_quantile(0.95, ...) > 1s`).
|
||
Routes to Slack via webhook. **Zero infra cost.**
|
||
2. [Uptime Kuma](https://github.com/louislam/uptime-kuma) on
|
||
`88oakappsUpdate` — pings `/api/health/` from outside the cluster
|
||
every minute; complements the in-cluster view.
|
||
|
||
We'd want both eventually. Grafana alerting first because the data is
|
||
already there.
|
||
|
||
### Partial distributed tracing
|
||
|
||
The OTel SDK is **not yet wired** in `cmd/api/main.go`. When it ships:
|
||
- `otelecho.Middleware` produces a span per HTTP request
|
||
- `otelgorm` plugin produces a span per SQL query (requires threading
|
||
`ctx` through repositories — the largest diff in the rollout)
|
||
- Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs
|
||
|
||
Until then, we have aggregate latency by route from the histograms but
|
||
no per-request flame graph. For "why is *this one* request slow" we
|
||
still rely on logs + the GORM duration histogram.
|
||
|
||
### No APM (Application Performance Monitoring)
|
||
|
||
No continuous profiling. We can answer "which endpoint has the highest
|
||
p99 latency?" from the histograms, but not "where in the call stack is
|
||
the time going?" without ad-hoc `pprof` runs.
|
||
|
||
If/when needed: Grafana Pyroscope is the OSS continuous profiler that
|
||
fits our stack. Adds ~512 MB RAM. Defer until a CPU performance
|
||
incident shows up.
|
||
|
||
## The app's logging conventions
|
||
|
||
The Go app uses zerolog and emits structured JSON:
|
||
|
||
```json
|
||
{
|
||
"level": "info",
|
||
"time": "2026-04-24T05:29:40Z",
|
||
"caller": "/app/cmd/api/main.go:189",
|
||
"addr": ":8000",
|
||
"message": "HTTP server listening"
|
||
}
|
||
```
|
||
|
||
Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
|
||
`DEBUG=true|false` in the ConfigMap (true sets level to debug, false
|
||
sets level to info).
|
||
|
||
Every request is logged with method, path, status, request_id, user_id
|
||
(if authenticated), latency. Queryable by grep today; ready to ingest
|
||
into Loki when we add it.
|
||
|
||
## Health endpoints
|
||
|
||
Each service exposes a health endpoint:
|
||
|
||
| Service | Endpoint | What it checks |
|
||
|---|---|---|
|
||
| api | `/api/health/` | Process alive (doesn't verify DB) |
|
||
| api | `/api/health/live` | Process alive |
|
||
| admin | `/` | Next.js is up |
|
||
| worker | (none public) | Internal Asynq status |
|
||
| api | `/metrics` | Prometheus exposition (vmagent scrapes here) |
|
||
| api | `/metrics/legacy` | Custom monitoring metrics for GoAdmin |
|
||
|
||
Health endpoints are **shallow** — they return 200 if the process is
|
||
running and listening. They don't try to reach Postgres/Redis/etc.
|
||
Rationale: if Postgres is briefly down, we don't want all api pods to
|
||
start failing liveness and cascade-restart.
|
||
|
||
## obs.88oakapps.com — the ingest endpoint
|
||
|
||
Public hostname for cross-cluster metric and trace ingest. Cloudflare
|
||
in front, nginx on `88oakappsUpdate` enforces a bearer-token check
|
||
before forwarding to the local VM/Jaeger containers.
|
||
|
||
| Path | Forwards to | Purpose |
|
||
|---|---|---|
|
||
| `/api/v1/write` | `http://127.0.0.1:8428` | Prometheus remote-write (vmagent → VM) |
|
||
| `/v1/traces` | `http://127.0.0.1:4318/v1/traces` | OTLP/HTTP traces (app → Jaeger) |
|
||
| `/health` | (returns 200) | Reachability probe — also requires auth |
|
||
| anything else | 404 | |
|
||
|
||
Token lives at `/etc/honeydue-obs/secrets.env` (mode 0600 on the box)
|
||
and at `OBS_INGEST_TOKEN=` in `deploy/prod.env` (gitignored). To rotate:
|
||
generate a new value, update both ends, restart vmagent.
|
||
|
||
```bash
|
||
# Operator: rotate the bearer token
|
||
NEW=$(openssl rand -hex 32)
|
||
ssh 88oakappsUpdate "sudo sed -i 's|OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|' /etc/honeydue-obs/secrets.env"
|
||
ssh 88oakappsUpdate "sudo sed -i 's|Bearer [a-f0-9]\{64\}|Bearer $NEW|' /etc/nginx/sites-available/obs.88oakapps.com && sudo nginx -s reload"
|
||
sed -i.bak "s|^OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|" deploy/prod.env
|
||
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue create secret generic vmagent-remote-write \
|
||
--from-literal=bearer_token=$NEW --dry-run=client -o yaml | kubectl apply -f -
|
||
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue rollout restart deploy/vmagent
|
||
```
|
||
|
||
## Resource budget
|
||
|
||
| Service | mem_limit | Disk | Retention |
|
||
|---|---|---|---|
|
||
| VictoriaMetrics | 256 MB | 10 GB | 30 days |
|
||
| Jaeger all-in-one (badger) | 256 MB | 10 GB | ~7 days |
|
||
| Grafana OSS | 256 MB | 1 GB | — |
|
||
| vmagent (in k3s) | 256 MB | 512 MB emptyDir | — |
|
||
| **Total** | **~1 GB hard cap** | **~21 GB** | |
|
||
|
||
Resident usage at idle is much lower (~90 MB on the obs side, ~30 MB
|
||
for vmagent). Hard limits exist so a memory leak in any one component
|
||
can't squeeze the cohabiting PostHog stack on `88oakappsUpdate`.
|
||
|
||
## Operator cheat sheet
|
||
|
||
```bash
|
||
# Tail all logs in the namespace
|
||
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue
|
||
|
||
# Scrape state from vmagent self-metrics
|
||
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
|
||
| grep -E "scrapes_total|targets|remotewrite"
|
||
|
||
# Force vmagent to reload scrape config
|
||
kubectl -n honeydue rollout restart deploy/vmagent
|
||
|
||
# Query VictoriaMetrics directly (PromQL)
|
||
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
|
||
|
||
# Restart the obs stack on 88oakappsUpdate
|
||
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
|
||
|
||
# Live obs container memory
|
||
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
|
||
|
||
# Pod resource usage (k3s side)
|
||
kubectl top pods -n honeydue --sort-by=memory
|
||
|
||
# With stern (if installed: brew install stern)
|
||
stern -n honeydue .
|
||
|
||
# Full state dump for a pod (debugging)
|
||
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
|
||
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
|
||
```
|
||
|
||
## Future: what to add and when
|
||
|
||
| Trigger | Add |
|
||
|---|---|
|
||
| First production incident | Grafana alerting (free, data already there) |
|
||
| 10k+ daily users | Loki + Vector for log aggregation |
|
||
| Performance incident the histograms can't explain | Wire OTel exporter → Jaeger from the Go app |
|
||
| CPU pressure on api pods | Pyroscope continuous profiler |
|
||
| Multi-product obs needs | Migrate obs stack to dedicated CX32 ($8/mo) |
|
||
|
||
The overall philosophy: observability is an investment that compounds.
|
||
Add it before you need it, not after. But also don't over-invest at
|
||
idle.
|
||
|
||
## References
|
||
|
||
- [VictoriaMetrics docs][vm]
|
||
- [vmagent kubernetes_sd_configs][vmagent-k8s]
|
||
- [Jaeger all-in-one with badger][jaeger]
|
||
- [prometheus/client_golang][promclient]
|
||
- [Grafana provisioning datasources][gf-prov]
|
||
- [Loki][loki] (future)
|
||
- [Stern (multi-pod log tail)][stern]
|
||
|
||
[vm]: https://docs.victoriametrics.com/single-server-victoriametrics/
|
||
[vmagent-k8s]: https://docs.victoriametrics.com/vmagent.html#kubernetes-monitoring-with-vmagent
|
||
[jaeger]: https://www.jaegertracing.io/docs/1.62/getting-started/#all-in-one
|
||
[promclient]: https://pkg.go.dev/github.com/prometheus/client_golang
|
||
[gf-prov]: https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources
|
||
[loki]: https://grafana.com/oss/loki/
|
||
[stern]: https://github.com/stern/stern
|