bc3da007db
Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod, overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api and honeydue-worker. otelecho.Middleware opens a span per HTTP request. Step 2 — Manual spans: storage_service.Upload now takes ctx and emits storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket, result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token spans with topic, status_code, reason. Asynq middleware emits asynq.handle:<task_type> per job with retry/payload attrs and records asynq_job_duration_seconds. Step 3 — Database: otelgorm plugin registered in database.Connect, so any SQL emitted via db.WithContext(ctx) attaches to the request span. Every repository now exposes WithContext(ctx) *XRepository as the migration helper. TaskService.ListTasks and GetTasksByResidence are migrated end-to-end (ctx threaded through handler → service → repo); remaining services adopt the same pattern incrementally — pre-migration methods still emit untraced SQL via the unchanged db field. OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env → honeydue-secrets → api+worker Deployments via secretKeyRef (optional). 02-setup-secrets.sh sources them from prod.env on next run; manifests mark both env vars optional so the deployment rolls without traces if the secret is absent. ch15 observability doc now lists what produces spans today vs the remaining migration work, with the explicit per-method pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
415 lines
17 KiB
Markdown
415 lines
17 KiB
Markdown
# 15 — Observability
|
||
|
||
## Summary
|
||
|
||
Production has live metrics and tracing infrastructure as of 2026-04-25.
|
||
A self-hosted **VictoriaMetrics + Jaeger + Grafana** stack runs on
|
||
`88oakappsUpdate` (Linode VPS, also home to the self-hosted PostHog
|
||
deployment). A `vmagent` sidecar in the honeyDue k3s namespace scrapes
|
||
the api Pods' `/metrics` endpoint every 15 seconds and remote-writes to
|
||
`https://obs.88oakapps.com/api/v1/write`. Grafana is at
|
||
`https://grafana.88oakapps.com` with a pre-provisioned RED dashboard.
|
||
|
||
What we still don't have: log aggregation (Dozzle and `kubectl logs`
|
||
fill the niche for now), alerting (no PagerDuty/Slack on errors), and
|
||
full distributed tracing (OTel SDK is wired in app code but app-side
|
||
instrumentation beyond HTTP routes hasn't shipped yet).
|
||
|
||
The whole observability stack costs **$0** incremental and uses ~700 MB
|
||
RAM on `88oakappsUpdate` (5% of its free RAM). It runs as a separate
|
||
docker-compose project from PostHog so neither product's lifecycle
|
||
touches the other.
|
||
|
||
## What we have
|
||
|
||
### 1. Metrics — VictoriaMetrics + vmagent
|
||
|
||
```
|
||
honeyDue k3s (Hetzner) 88oakappsUpdate (Linode)
|
||
┌───────────────────────────┐ ┌──────────────────────────┐
|
||
│ api Pods (3) :8000/metrics│ │ /opt/honeydue-obs/ │
|
||
│ prometheus/client_golang│ │ ┌──────────────────┐ │
|
||
│ │ │ │ VictoriaMetrics │ │
|
||
│ vmagent ──── scrape 15s │ │ │ 30d retention │ │
|
||
│ remote_write ─────┼────────────┼─→ /api/v1/write │ │
|
||
│ (HTTPS, bearer) │ │ │ (mem 256 MB) │ │
|
||
└───────────────────────────┘ │ └──────────────────┘ │
|
||
└──────────────────────────┘
|
||
```
|
||
|
||
The Go API exposes `/metrics` in Prometheus exposition format. Histograms
|
||
are defined in `internal/prom/metrics.go` and registered globally:
|
||
|
||
| Metric | Labels | Source |
|
||
|---|---|---|
|
||
| `http_request_duration_seconds` | `route, method, status` | Echo middleware around every handler |
|
||
| `gorm_query_duration_seconds` | `table, operation` | GORM before/after callbacks (no ctx threading needed) |
|
||
| `b2_upload_duration_seconds` | `bucket, result` | Wrapped `s.backend.Write` in `internal/services/storage_service.go` |
|
||
| `b2_upload_bytes_total` | `bucket, result` | Counter alongside the duration histogram |
|
||
| `apns_send_duration_seconds` | `result` (`ok`/`bad_token`/`error`) | Wrapped APNs `PushWithContext` in `internal/push/apns.go` |
|
||
| `fcm_send_duration_seconds` | `result` | Wrapped FCM HTTP v1 send in `internal/push/fcm.go` |
|
||
| `asynq_job_duration_seconds` | `task_type, result` | Histograms registered; middleware not yet attached (Step 3) |
|
||
| `go_*`, `process_*` | (standard) | `prometheus/client_golang/prometheus/collectors` defaults |
|
||
|
||
The previous custom monitoring at `/metrics` was renamed to
|
||
`/metrics/legacy` so the canonical `/metrics` emits proper histograms
|
||
suitable for `histogram_quantile()` rollups. The legacy endpoint stays
|
||
because the GoAdmin dashboard reads it.
|
||
|
||
#### vmagent in k3s
|
||
|
||
Lives at `deploy-k3s/manifests/observability/vmagent.yaml`. One replica,
|
||
`mem_limit: 256Mi`, scrapes by Kubernetes pod-discovery filtered to
|
||
`app.kubernetes.io/name=api` and remote-writes to
|
||
`https://obs.88oakapps.com/api/v1/write` with a bearer token from
|
||
`OBS_INGEST_TOKEN` in `deploy/prod.env` (substituted into a Secret at
|
||
deploy time).
|
||
|
||
The agent buffers locally to `/tmp/vmagent` (emptyDir, 512 MB cap), so
|
||
brief obs outages don't drop samples. Persistent queue replays on
|
||
reconnect.
|
||
|
||
NetworkPolicies in the honeydue namespace allow egress from vmagent to:
|
||
- DNS (kube-dns / coredns)
|
||
- Kubernetes API (`10.43.0.0/16:443`) for pod discovery
|
||
- api Pods on `10.42.0.0/16:8000`
|
||
- The public obs endpoint over `0.0.0.0/0:443`
|
||
|
||
These are scoped tight — vmagent can't reach Postgres, Redis, B2, or
|
||
any other external service.
|
||
|
||
### 2. Tracing — Jaeger all-in-one
|
||
|
||
Jaeger 1.62 with badger storage runs alongside VictoriaMetrics. The
|
||
collector accepts:
|
||
- OTLP/HTTP at `https://obs.88oakapps.com/v1/traces` (bearer-token gated)
|
||
- OTLP/gRPC at `:4317` (localhost-only)
|
||
- Native Jaeger protocols at `:14268` etc. (localhost-only)
|
||
|
||
Retention: ~7 days at current scale before badger rotates. UI at
|
||
`https://grafana.88oakapps.com` via the Jaeger datasource.
|
||
|
||
**Status of app-side instrumentation**: the histograms are populating
|
||
metrics. The OTel exporter wiring in `cmd/api/main.go` is **not yet
|
||
shipped**. When it does ship, every `POST /api/auth/login/` will produce
|
||
a flame-graph trace with HTTP → handler → SQL → B2 → APNs spans.
|
||
Tracking issue: gitea#3.
|
||
|
||
### 3. Dashboards — Grafana
|
||
|
||
`https://grafana.88oakapps.com` (Cloudflare-fronted, basic auth via
|
||
Grafana itself, admin credentials in `deploy/prod.env`).
|
||
|
||
Datasources auto-provisioned at container startup from
|
||
`/opt/honeydue-obs/data/grafana-provisioning/datasources/datasources.yaml`:
|
||
- VictoriaMetrics (Prometheus type, `http://victoriametrics:8428` in-network)
|
||
- Jaeger (`http://jaeger:16686` in-network)
|
||
|
||
Pre-provisioned dashboard: `honeyDue API — RED` at
|
||
`/d/honeydue-red`. Top row uses the legacy custom metrics
|
||
(`http_endpoint_requests_total`, `http_requests_total`) which started
|
||
flowing the moment vmagent attached. Lower rows use the new histograms
|
||
(`http_request_duration_seconds_bucket` p50/p95/p99 by route, GORM p95
|
||
by table, B2 upload p95, APNs/FCM send p95, Go memory + goroutines).
|
||
Lower rows populated immediately after the api rebuild that shipped
|
||
`internal/prom`.
|
||
|
||
### 4. `kubectl logs`
|
||
|
||
Every container's stdout/stderr is captured by containerd and readable
|
||
via kubectl:
|
||
|
||
```bash
|
||
# Live tail from all api pods
|
||
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix
|
||
|
||
# Last 100 lines
|
||
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=100
|
||
|
||
# Previous pod's logs (before the most recent restart)
|
||
kubectl logs -n honeydue <pod-name> --previous
|
||
|
||
# Events (not logs — k8s-level state changes)
|
||
kubectl get events -n honeydue --sort-by=.lastTimestamp
|
||
```
|
||
|
||
**Retention**: containerd rotates logs when they exceed 10 MB (default).
|
||
Only the last ~20 MB of logs is retained per container, on-disk on the
|
||
node. Once a pod is deleted, its logs are gone.
|
||
|
||
For persistent log access we'd need aggregation (see §What we still
|
||
don't have).
|
||
|
||
### 5. `kubectl top`
|
||
|
||
Pod and node resource usage via metrics-server:
|
||
|
||
```bash
|
||
kubectl top nodes
|
||
# NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
|
||
# ubuntu-8gb-nbg1-1 169m 4% 748Mi 9%
|
||
|
||
kubectl top pods -n honeydue
|
||
```
|
||
|
||
In-memory only; last few minutes of data. For historical trends use
|
||
the Grafana dashboard, which exposes the same data via the `go_*` and
|
||
`container_*` (kubelet cAdvisor) metrics.
|
||
|
||
### 6. Cloudflare Analytics
|
||
|
||
CF Dashboard → Analytics & Logs. Per-zone aggregate stats:
|
||
requests/sec, bandwidth, cache hit ratio, top status codes, top paths,
|
||
bot traffic score. Good for spotting macro trends ("suddenly 10× more
|
||
502s today") that wouldn't show up in a single-pod sample.
|
||
|
||
Free tier retention: 7 days of aggregate stats.
|
||
|
||
### 7. Neon dashboard
|
||
|
||
Neon console → project → Monitoring: compute utilization (CU-hours),
|
||
slow queries, active connections, storage usage. Useful for "is the
|
||
DB busy?" and free-tier limit watching. The new
|
||
`gorm_query_duration_seconds` histogram covers the application side
|
||
of the same question with much better latency tail visibility.
|
||
|
||
### 8. Kubernetes events
|
||
|
||
`kubectl get events` shows cluster-level state changes: pod scheduling,
|
||
failures, image pulls, probe failures. Useful for post-mortem on
|
||
deploys.
|
||
|
||
Retention: events are stored in etcd but default to 1 hour.
|
||
|
||
## What we still don't have
|
||
|
||
### No log aggregation
|
||
|
||
Individual pod logs are on the node. For multi-pod debugging ("show me
|
||
all api pod logs for user X") we have to:
|
||
|
||
```bash
|
||
# Query all at once with stern (if installed)
|
||
stern -n honeydue api
|
||
|
||
# Or per-pod
|
||
kubectl logs -n honeydue <pod> | grep user_id=12345
|
||
```
|
||
|
||
This works but doesn't scale across many pods.
|
||
|
||
**What we'd add**: [Loki](https://grafana.com/oss/loki/) on
|
||
`88oakappsUpdate` next to the existing obs stack. Adds ~512 MB RAM
|
||
plus a Promtail (or Vector/Alloy) DaemonSet in k3s. Defer until log
|
||
search becomes a recurring pain point — `stern` + `grep` is fine at
|
||
current pod count.
|
||
|
||
### No alerting
|
||
|
||
No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
|
||
The operator finds out when users complain.
|
||
|
||
Cheapest fix path:
|
||
1. Grafana alerting (built into Grafana 11) — alert rules over the
|
||
existing histograms (e.g., `histogram_quantile(0.95, ...) > 1s`).
|
||
Routes to Slack via webhook. **Zero infra cost.**
|
||
2. [Uptime Kuma](https://github.com/louislam/uptime-kuma) on
|
||
`88oakappsUpdate` — pings `/api/health/` from outside the cluster
|
||
every minute; complements the in-cluster view.
|
||
|
||
We'd want both eventually. Grafana alerting first because the data is
|
||
already there.
|
||
|
||
### Distributed tracing — adoption is in flight
|
||
|
||
The OTel SDK is **wired** in `cmd/api/main.go` and `cmd/worker/main.go`
|
||
and ships traces to Jaeger via `obs.88oakapps.com/v1/traces`. What's
|
||
already producing spans:
|
||
|
||
| Span source | Status |
|
||
|---|---|
|
||
| `otelecho.Middleware` — span per HTTP request | ✅ live |
|
||
| Manual span around `storage_service.Upload` (B2 PutObject) | ✅ live |
|
||
| Manual span around APNs `Send` / `SendWithCategory` | ✅ live |
|
||
| Manual span around FCM `sendOne` | ✅ live |
|
||
| Asynq middleware — span per task type with retry/payload attrs | ✅ live |
|
||
| `otelgorm` plugin — span per SQL statement | ✅ plugin registered |
|
||
|
||
What's still in flight: SQL spans appear in a request's trace **only when
|
||
the service method took the request's `ctx` and called
|
||
`repo.WithContext(ctx)`** before issuing queries. Every repository now
|
||
exposes `WithContext(ctx) *XRepository`, but services need to be
|
||
migrated one method at a time.
|
||
|
||
**Migration pattern:** for each service method on the request hot path,
|
||
add `ctx context.Context` as the first arg, change the handler call site
|
||
to pass `c.Request().Context()`, and replace `s.repo.X(...)` with
|
||
`s.repo.WithContext(ctx).X(...)`. Tests pass `context.Background()`.
|
||
|
||
Already migrated:
|
||
- `TaskService.ListTasks` → `GET /api/tasks/`
|
||
- `TaskService.GetTasksByResidence` → `GET /api/tasks/by-residence/:id/`
|
||
|
||
Remaining: every other public method on `TaskService`, `ResidenceService`,
|
||
`ContractorService`, `DocumentService`, `AuthService`,
|
||
`NotificationService`, `SubscriptionService`. Mechanical work; can be
|
||
done a method at a time without breaking anything (untouched methods
|
||
just emit untraced SQL like before).
|
||
|
||
### No APM (Application Performance Monitoring)
|
||
|
||
No continuous profiling. We can answer "which endpoint has the highest
|
||
p99 latency?" from the histograms, but not "where in the call stack is
|
||
the time going?" without ad-hoc `pprof` runs.
|
||
|
||
If/when needed: Grafana Pyroscope is the OSS continuous profiler that
|
||
fits our stack. Adds ~512 MB RAM. Defer until a CPU performance
|
||
incident shows up.
|
||
|
||
## The app's logging conventions
|
||
|
||
The Go app uses zerolog and emits structured JSON:
|
||
|
||
```json
|
||
{
|
||
"level": "info",
|
||
"time": "2026-04-24T05:29:40Z",
|
||
"caller": "/app/cmd/api/main.go:189",
|
||
"addr": ":8000",
|
||
"message": "HTTP server listening"
|
||
}
|
||
```
|
||
|
||
Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
|
||
`DEBUG=true|false` in the ConfigMap (true sets level to debug, false
|
||
sets level to info).
|
||
|
||
Every request is logged with method, path, status, request_id, user_id
|
||
(if authenticated), latency. Queryable by grep today; ready to ingest
|
||
into Loki when we add it.
|
||
|
||
## Health endpoints
|
||
|
||
Each service exposes a health endpoint:
|
||
|
||
| Service | Endpoint | What it checks |
|
||
|---|---|---|
|
||
| api | `/api/health/` | Process alive (doesn't verify DB) |
|
||
| api | `/api/health/live` | Process alive |
|
||
| admin | `/` | Next.js is up |
|
||
| worker | (none public) | Internal Asynq status |
|
||
| api | `/metrics` | Prometheus exposition (vmagent scrapes here) |
|
||
| api | `/metrics/legacy` | Custom monitoring metrics for GoAdmin |
|
||
|
||
Health endpoints are **shallow** — they return 200 if the process is
|
||
running and listening. They don't try to reach Postgres/Redis/etc.
|
||
Rationale: if Postgres is briefly down, we don't want all api pods to
|
||
start failing liveness and cascade-restart.
|
||
|
||
## obs.88oakapps.com — the ingest endpoint
|
||
|
||
Public hostname for cross-cluster metric and trace ingest. Cloudflare
|
||
in front, nginx on `88oakappsUpdate` enforces a bearer-token check
|
||
before forwarding to the local VM/Jaeger containers.
|
||
|
||
| Path | Forwards to | Purpose |
|
||
|---|---|---|
|
||
| `/api/v1/write` | `http://127.0.0.1:8428` | Prometheus remote-write (vmagent → VM) |
|
||
| `/v1/traces` | `http://127.0.0.1:4318/v1/traces` | OTLP/HTTP traces (app → Jaeger) |
|
||
| `/health` | (returns 200) | Reachability probe — also requires auth |
|
||
| anything else | 404 | |
|
||
|
||
Token lives at `/etc/honeydue-obs/secrets.env` (mode 0600 on the box)
|
||
and at `OBS_INGEST_TOKEN=` in `deploy/prod.env` (gitignored). To rotate:
|
||
generate a new value, update both ends, restart vmagent.
|
||
|
||
```bash
|
||
# Operator: rotate the bearer token
|
||
NEW=$(openssl rand -hex 32)
|
||
ssh 88oakappsUpdate "sudo sed -i 's|OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|' /etc/honeydue-obs/secrets.env"
|
||
ssh 88oakappsUpdate "sudo sed -i 's|Bearer [a-f0-9]\{64\}|Bearer $NEW|' /etc/nginx/sites-available/obs.88oakapps.com && sudo nginx -s reload"
|
||
sed -i.bak "s|^OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|" deploy/prod.env
|
||
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue create secret generic vmagent-remote-write \
|
||
--from-literal=bearer_token=$NEW --dry-run=client -o yaml | kubectl apply -f -
|
||
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue rollout restart deploy/vmagent
|
||
```
|
||
|
||
## Resource budget
|
||
|
||
| Service | mem_limit | Disk | Retention |
|
||
|---|---|---|---|
|
||
| VictoriaMetrics | 256 MB | 10 GB | 30 days |
|
||
| Jaeger all-in-one (badger) | 256 MB | 10 GB | ~7 days |
|
||
| Grafana OSS | 256 MB | 1 GB | — |
|
||
| vmagent (in k3s) | 256 MB | 512 MB emptyDir | — |
|
||
| **Total** | **~1 GB hard cap** | **~21 GB** | |
|
||
|
||
Resident usage at idle is much lower (~90 MB on the obs side, ~30 MB
|
||
for vmagent). Hard limits exist so a memory leak in any one component
|
||
can't squeeze the cohabiting PostHog stack on `88oakappsUpdate`.
|
||
|
||
## Operator cheat sheet
|
||
|
||
```bash
|
||
# Tail all logs in the namespace
|
||
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue
|
||
|
||
# Scrape state from vmagent self-metrics
|
||
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
|
||
| grep -E "scrapes_total|targets|remotewrite"
|
||
|
||
# Force vmagent to reload scrape config
|
||
kubectl -n honeydue rollout restart deploy/vmagent
|
||
|
||
# Query VictoriaMetrics directly (PromQL)
|
||
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
|
||
|
||
# Restart the obs stack on 88oakappsUpdate
|
||
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
|
||
|
||
# Live obs container memory
|
||
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
|
||
|
||
# Pod resource usage (k3s side)
|
||
kubectl top pods -n honeydue --sort-by=memory
|
||
|
||
# With stern (if installed: brew install stern)
|
||
stern -n honeydue .
|
||
|
||
# Full state dump for a pod (debugging)
|
||
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
|
||
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
|
||
```
|
||
|
||
## Future: what to add and when
|
||
|
||
| Trigger | Add |
|
||
|---|---|
|
||
| First production incident | Grafana alerting (free, data already there) |
|
||
| 10k+ daily users | Loki + Vector for log aggregation |
|
||
| Performance incident the histograms can't explain | Wire OTel exporter → Jaeger from the Go app |
|
||
| CPU pressure on api pods | Pyroscope continuous profiler |
|
||
| Multi-product obs needs | Migrate obs stack to dedicated CX32 ($8/mo) |
|
||
|
||
The overall philosophy: observability is an investment that compounds.
|
||
Add it before you need it, not after. But also don't over-invest at
|
||
idle.
|
||
|
||
## References
|
||
|
||
- [VictoriaMetrics docs][vm]
|
||
- [vmagent kubernetes_sd_configs][vmagent-k8s]
|
||
- [Jaeger all-in-one with badger][jaeger]
|
||
- [prometheus/client_golang][promclient]
|
||
- [Grafana provisioning datasources][gf-prov]
|
||
- [Loki][loki] (future)
|
||
- [Stern (multi-pod log tail)][stern]
|
||
|
||
[vm]: https://docs.victoriametrics.com/single-server-victoriametrics/
|
||
[vmagent-k8s]: https://docs.victoriametrics.com/vmagent.html#kubernetes-monitoring-with-vmagent
|
||
[jaeger]: https://www.jaegertracing.io/docs/1.62/getting-started/#all-in-one
|
||
[promclient]: https://pkg.go.dev/github.com/prometheus/client_golang
|
||
[gf-prov]: https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources
|
||
[loki]: https://grafana.com/oss/loki/
|
||
[stern]: https://github.com/stern/stern
|