docs: rewrite ch15 observability + cross-refs for the live obs stack
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.

Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
  metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
  (full build + push + apply + obs vmagent), with the manual
  kubectl-set-image flow kept as the single-service path. Notes the
  IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
  CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
  sheet block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-25 15:05:06 -05:00
parent d3708e6c72
commit 77cfcc0b27
8 changed files with 414 additions and 187 deletions
+8 -2
View File
@@ -349,7 +349,11 @@ All protected endpoints require an `Authorization: Token <token>` header.
Production runs on a **3-node K3s HA cluster** on Hetzner Cloud, fronted Production runs on a **3-node K3s HA cluster** on Hetzner Cloud, fronted
by Cloudflare, with Neon Postgres, Backblaze B2, and a self-hosted Gitea by Cloudflare, with Neon Postgres, Backblaze B2, and a self-hosted Gitea
container registry. See the full deployment book for every detail: container registry. Live observability (VictoriaMetrics + Jaeger +
Grafana) runs on a separate Linode VPS at
[`grafana.88oakapps.com`](https://grafana.88oakapps.com) and is fed by a
`vmagent` sidecar in-cluster. See the full deployment book for every
detail:
**→ [docs/deployment/](./docs/deployment/README.md) — The Deployment Book** **→ [docs/deployment/](./docs/deployment/README.md) — The Deployment Book**
@@ -371,7 +375,9 @@ Quick links:
- **Runbook** — [docs/deployment/17-runbook.md](./docs/deployment/17-runbook.md) — 22 common ops procedures - **Runbook** — [docs/deployment/17-runbook.md](./docs/deployment/17-runbook.md) — 22 common ops procedures
- **kubectl cheat sheet** — [docs/deployment/appendices/b-commands.md](./docs/deployment/appendices/b-commands.md) - **kubectl cheat sheet** — [docs/deployment/appendices/b-commands.md](./docs/deployment/appendices/b-commands.md)
- **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — build → push → rollout - **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — `bash deploy-k3s/scripts/03-deploy.sh` builds → pushes → rolls out
- **Observability** — [docs/deployment/15-observability.md](./docs/deployment/15-observability.md) — VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`
- **Observability plan** — [docs/observability-plan.md](./docs/observability-plan.md) — design doc and rollout phases
- **Failure modes** — [docs/deployment/16-failure-modes.md](./docs/deployment/16-failure-modes.md) — what happens when X dies - **Failure modes** — [docs/deployment/16-failure-modes.md](./docs/deployment/16-failure-modes.md) — what happens when X dies
- **Swarm postmortem** — [docs/deployment/19-postmortem-swarm.md](./docs/deployment/19-postmortem-swarm.md) — why we migrated - **Swarm postmortem** — [docs/deployment/19-postmortem-swarm.md](./docs/deployment/19-postmortem-swarm.md) — why we migrated
+11 -4
View File
@@ -194,10 +194,17 @@ See [Chapter 8](./08-database.md), [9](./09-storage.md), and
until we have Apple Developer / Google Play accounts. The env vars are until we have Apple Developer / Google Play accounts. The env vars are
set to sentinel values that let the Go app boot; `FEATURE_PUSH_ENABLED=false` set to sentinel values that let the Go app boot; `FEATURE_PUSH_ENABLED=false`
gates all call sites. gates all call sites.
- **External metrics/monitoring (Prometheus, Grafana, Betterstack).** - **In-cluster Prometheus / Grafana.** Self-hosted Prometheus-compatible
Right now we rely on `kubectl logs`, `kubectl top`, and Cloudflare's own metrics + tracing + dashboards live **outside** the k3s cluster on
analytics. See [Chapter 15](./15-observability.md) for what's there and `88oakappsUpdate` (the same Linode VPS that hosts PostHog), reached
what we'd add. via `https://obs.88oakapps.com` (Cloudflare-fronted, bearer-gated).
A `vmagent` sidecar in the honeydue namespace scrapes the api Pods
and remote-writes out. This frees ~700 MB of cluster RAM and means
observability survives a k3s control-plane incident. See
[Chapter 15](./15-observability.md).
- **Alerting.** No PagerDuty, Slack hooks, or pages-on-error wired up
yet. Histograms are flowing into Grafana — alert rules on top of them
is the next add. See [Chapter 15 — Future](./15-observability.md).
- **Automated backups of Redis state.** Redis is configured with AOF - **Automated backups of Redis state.** Redis is configured with AOF
(append-only file) persistence, but the PVC is only on one node. Redis (append-only file) persistence, but the PVC is only on one node. Redis
holds only cache + Asynq queue state; losing it re-populates on first holds only cache + Asynq queue state; losing it re-populates on first
+58 -16
View File
@@ -8,23 +8,62 @@ No downtime if the change is backward-compatible. Rollback is
`kubectl rollout undo`. This chapter walks through the full process, `kubectl rollout undo`. This chapter walks through the full process,
plus alternate paths (config-only changes, manifest changes, hotfixes). plus alternate paths (config-only changes, manifest changes, hotfixes).
## TL;DR for a code change ## TL;DR using the unified deploy script
The recommended path. `deploy-k3s/scripts/03-deploy.sh` builds all four
images (api, worker, admin, web), pushes to Gitea, regenerates the
ConfigMap from `config.yaml`, applies every manifest under
`deploy-k3s/manifests/` (including the observability vmagent), and
waits for all rollouts.
```bash
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
git add . && git commit -m "..." && git push gitea master
export KUBECONFIG=~/.kube/honeydue.yaml
bash deploy-k3s/scripts/03-deploy.sh # full build + push + rollout
# or, to redeploy without rebuilding:
bash deploy-k3s/scripts/03-deploy.sh --skip-build
# or, to pin a specific tag:
bash deploy-k3s/scripts/03-deploy.sh --tag d3708e6
```
What the script does, in order:
1. Read registry creds from `deploy-k3s/config.yaml`.
2. `docker login gitea.treytartt.com`.
3. Build all four images with `--platform linux/amd64` (so arm64 Macs
don't push images that crash on Hetzner amd64 nodes with
"exec format error").
4. Push to the gitea registry, plus tag and push `:latest`.
5. Generate the env file from `config.yaml` and apply as ConfigMap
`honeydue-config` (uses dry-run + apply for diff-free idempotence).
6. Apply `manifests/namespace.yaml`, `redis/`, `ingress/`,
`api/{deployment,service,hpa}`, `worker/`, `admin/`, `web/`.
7. Apply `manifests/observability/vmagent.yaml`, substituting
`TOKEN_PLACEHOLDER` with `OBS_INGEST_TOKEN` from `deploy/prod.env`
(gitignored). Skipped with a warning if the token isn't present.
8. `kubectl rollout status` for every Deployment, including vmagent.
~710 minutes for a full rebuild. ~12 minutes with `--skip-build`.
## TL;DR for a single-service code change (manual)
```bash ```bash
# 1. Commit + get SHA # 1. Commit + get SHA
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
git add . && git commit -m "..." && SHA=$(git rev-parse --short HEAD) git add . && git commit -m "..." && SHA=$(git rev-parse --short HEAD)
# 2. Login to Gitea registry # 2. Login to Gitea registry (creds in config.yaml)
set -a; source deploy/registry.env; set +a docker login gitea.treytartt.com -u admin
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
# 3. Build + push amd64 image # 3. Build + push amd64 image
docker buildx build --platform linux/amd64 --target api \ docker build --platform linux/amd64 --target api \
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push . -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" .
docker push "gitea.treytartt.com/admin/honeydue-api:${SHA}"
# 4. Roll it in # 4. Roll it in
export KUBECONFIG=~/.kube/honeydue-k3s.yaml export KUBECONFIG=~/.kube/honeydue.yaml
kubectl set image deployment/api -n honeydue \ kubectl set image deployment/api -n honeydue \
api="gitea.treytartt.com/admin/honeydue-api:${SHA}" api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
@@ -32,11 +71,18 @@ kubectl set image deployment/api -n honeydue \
kubectl rollout status -n honeydue deployment/api kubectl rollout status -n honeydue deployment/api
# 6. Log out # 6. Log out
docker logout "$REGISTRY" docker logout gitea.treytartt.com
``` ```
~35 minutes end to end for api. ~35 minutes end to end for api.
> **Gotcha:** Deployments default to `imagePullPolicy: IfNotPresent`,
> which means kubelet won't re-fetch an image with a tag it already
> has cached locally — even if the registry now has different bytes
> at that tag. Always change tags (use the SHA), or temporarily flip
> `imagePullPolicy: Always` and `kubectl rollout restart` if you need
> to overwrite a tag.
## The build ## The build
### Step 1 — Prepare ### Step 1 — Prepare
@@ -314,14 +360,10 @@ Contrast: `deploy/scripts/deploy_prod.sh` (Swarm-era) did:
9. Healthcheck the final URL; auto-rollback on failure 9. Healthcheck the final URL; auto-rollback on failure
10. Log out of registries 10. Log out of registries
Our current k3s deploy is more manual but simpler. We'd write a similar The current k3s replacement, `deploy-k3s/scripts/03-deploy.sh`, covers
script for k3s if deploys become frequent: the same ground in fewer steps because Kubernetes does the
versioning/rollout/health bookkeeping natively. See the TL;DR section
```bash at the top of this chapter.
# deploy-k3s/scripts/04-deploy.sh (not yet updated for Gitea)
```
See the scaffold in `deploy-k3s/scripts/`.
## Common deploy failures ## Common deploy failures
+249 -164
View File
@@ -2,15 +2,119 @@
## Summary ## Summary
We have minimal observability today: `kubectl logs`, `kubectl top`, Production has live metrics and tracing infrastructure as of 2026-04-25.
Cloudflare Analytics, and the Neon dashboard. No Prometheus, no Grafana, A self-hosted **VictoriaMetrics + Jaeger + Grafana** stack runs on
no centralized log aggregator, no APM. This is adequate for the `88oakappsUpdate` (Linode VPS, also home to the self-hosted PostHog
current traffic volume (low) but is a known gap. This chapter documents deployment). A `vmagent` sidecar in the honeyDue k3s namespace scrapes
what we *have* and what we'd add as traffic grows. the api Pods' `/metrics` endpoint every 15 seconds and remote-writes to
`https://obs.88oakapps.com/api/v1/write`. Grafana is at
`https://grafana.88oakapps.com` with a pre-provisioned RED dashboard.
What we still don't have: log aggregation (Dozzle and `kubectl logs`
fill the niche for now), alerting (no PagerDuty/Slack on errors), and
full distributed tracing (OTel SDK is wired in app code but app-side
instrumentation beyond HTTP routes hasn't shipped yet).
The whole observability stack costs **$0** incremental and uses ~700 MB
RAM on `88oakappsUpdate` (5% of its free RAM). It runs as a separate
docker-compose project from PostHog so neither product's lifecycle
touches the other.
## What we have ## What we have
### 1. `kubectl logs` ### 1. Metrics — VictoriaMetrics + vmagent
```
honeyDue k3s (Hetzner) 88oakappsUpdate (Linode)
┌───────────────────────────┐ ┌──────────────────────────┐
│ api Pods (3) :8000/metrics│ │ /opt/honeydue-obs/ │
│ prometheus/client_golang│ │ ┌──────────────────┐ │
│ │ │ │ VictoriaMetrics │ │
│ vmagent ──── scrape 15s │ │ │ 30d retention │ │
│ remote_write ─────┼────────────┼─→ /api/v1/write │ │
│ (HTTPS, bearer) │ │ │ (mem 256 MB) │ │
└───────────────────────────┘ │ └──────────────────┘ │
└──────────────────────────┘
```
The Go API exposes `/metrics` in Prometheus exposition format. Histograms
are defined in `internal/prom/metrics.go` and registered globally:
| Metric | Labels | Source |
|---|---|---|
| `http_request_duration_seconds` | `route, method, status` | Echo middleware around every handler |
| `gorm_query_duration_seconds` | `table, operation` | GORM before/after callbacks (no ctx threading needed) |
| `b2_upload_duration_seconds` | `bucket, result` | Wrapped `s.backend.Write` in `internal/services/storage_service.go` |
| `b2_upload_bytes_total` | `bucket, result` | Counter alongside the duration histogram |
| `apns_send_duration_seconds` | `result` (`ok`/`bad_token`/`error`) | Wrapped APNs `PushWithContext` in `internal/push/apns.go` |
| `fcm_send_duration_seconds` | `result` | Wrapped FCM HTTP v1 send in `internal/push/fcm.go` |
| `asynq_job_duration_seconds` | `task_type, result` | Histograms registered; middleware not yet attached (Step 3) |
| `go_*`, `process_*` | (standard) | `prometheus/client_golang/prometheus/collectors` defaults |
The previous custom monitoring at `/metrics` was renamed to
`/metrics/legacy` so the canonical `/metrics` emits proper histograms
suitable for `histogram_quantile()` rollups. The legacy endpoint stays
because the GoAdmin dashboard reads it.
#### vmagent in k3s
Lives at `deploy-k3s/manifests/observability/vmagent.yaml`. One replica,
`mem_limit: 256Mi`, scrapes by Kubernetes pod-discovery filtered to
`app.kubernetes.io/name=api` and remote-writes to
`https://obs.88oakapps.com/api/v1/write` with a bearer token from
`OBS_INGEST_TOKEN` in `deploy/prod.env` (substituted into a Secret at
deploy time).
The agent buffers locally to `/tmp/vmagent` (emptyDir, 512 MB cap), so
brief obs outages don't drop samples. Persistent queue replays on
reconnect.
NetworkPolicies in the honeydue namespace allow egress from vmagent to:
- DNS (kube-dns / coredns)
- Kubernetes API (`10.43.0.0/16:443`) for pod discovery
- api Pods on `10.42.0.0/16:8000`
- The public obs endpoint over `0.0.0.0/0:443`
These are scoped tight — vmagent can't reach Postgres, Redis, B2, or
any other external service.
### 2. Tracing — Jaeger all-in-one
Jaeger 1.62 with badger storage runs alongside VictoriaMetrics. The
collector accepts:
- OTLP/HTTP at `https://obs.88oakapps.com/v1/traces` (bearer-token gated)
- OTLP/gRPC at `:4317` (localhost-only)
- Native Jaeger protocols at `:14268` etc. (localhost-only)
Retention: ~7 days at current scale before badger rotates. UI at
`https://grafana.88oakapps.com` via the Jaeger datasource.
**Status of app-side instrumentation**: the histograms are populating
metrics. The OTel exporter wiring in `cmd/api/main.go` is **not yet
shipped**. When it does ship, every `POST /api/auth/login/` will produce
a flame-graph trace with HTTP → handler → SQL → B2 → APNs spans.
Tracking issue: gitea#3.
### 3. Dashboards — Grafana
`https://grafana.88oakapps.com` (Cloudflare-fronted, basic auth via
Grafana itself, admin credentials in `deploy/prod.env`).
Datasources auto-provisioned at container startup from
`/opt/honeydue-obs/data/grafana-provisioning/datasources/datasources.yaml`:
- VictoriaMetrics (Prometheus type, `http://victoriametrics:8428` in-network)
- Jaeger (`http://jaeger:16686` in-network)
Pre-provisioned dashboard: `honeyDue API — RED` at
`/d/honeydue-red`. Top row uses the legacy custom metrics
(`http_endpoint_requests_total`, `http_requests_total`) which started
flowing the moment vmagent attached. Lower rows use the new histograms
(`http_request_duration_seconds_bucket` p50/p95/p99 by route, GORM p95
by table, B2 upload p95, APNs/FCM send p95, Go memory + goroutines).
Lower rows populated immediately after the api rebuild that shipped
`internal/prom`.
### 4. `kubectl logs`
Every container's stdout/stderr is captured by containerd and readable Every container's stdout/stderr is captured by containerd and readable
via kubectl: via kubectl:
@@ -33,9 +137,10 @@ kubectl get events -n honeydue --sort-by=.lastTimestamp
Only the last ~20 MB of logs is retained per container, on-disk on the Only the last ~20 MB of logs is retained per container, on-disk on the
node. Once a pod is deleted, its logs are gone. node. Once a pod is deleted, its logs are gone.
For persistent log access we'd need aggregation (see §what we'd add). For persistent log access we'd need aggregation (see §What we still
don't have).
### 2. `kubectl top` ### 5. `kubectl top`
Pod and node resource usage via metrics-server: Pod and node resource usage via metrics-server:
@@ -43,43 +148,32 @@ Pod and node resource usage via metrics-server:
kubectl top nodes kubectl top nodes
# NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%) # NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
# ubuntu-8gb-nbg1-1 169m 4% 748Mi 9% # ubuntu-8gb-nbg1-1 169m 4% 748Mi 9%
# ubuntu-8gb-nbg1-2 229m 5% 1043Mi 13%
# ubuntu-8gb-nbg1-3 124m 3% 770Mi 9%
kubectl top pods -n honeydue kubectl top pods -n honeydue
``` ```
**Retention**: In-memory only. Last few minutes of data. No In-memory only; last few minutes of data. For historical trends use
historical view. the Grafana dashboard, which exposes the same data via the `go_*` and
`container_*` (kubelet cAdvisor) metrics.
### 3. Cloudflare Analytics ### 6. Cloudflare Analytics
CF Dashboard → Analytics & Logs. Per-zone stats: CF Dashboard → Analytics & Logs. Per-zone aggregate stats:
- Requests per second requests/sec, bandwidth, cache hit ratio, top status codes, top paths,
- Bandwidth bot traffic score. Good for spotting macro trends ("suddenly 10× more
- Cache hit ratio 502s today") that wouldn't show up in a single-pod sample.
- Top HTTP status codes
- Top request paths
- Bot traffic score
All aggregated, no individual request traces. Good for spotting macro Free tier retention: 7 days of aggregate stats.
trends ("suddenly 10× more 502s today"), poor for debugging specific
issues.
Free tier retention: 7 days of aggregate stats. Pro extends this. ### 7. Neon dashboard
### 4. Neon dashboard Neon console → project → Monitoring: compute utilization (CU-hours),
slow queries, active connections, storage usage. Useful for "is the
DB busy?" and free-tier limit watching. The new
`gorm_query_duration_seconds` histogram covers the application side
of the same question with much better latency tail visibility.
Neon console → project → Monitoring: ### 8. Kubernetes events
- Compute utilization (CU-hours consumed)
- Query performance (slow queries)
- Active connections
- Storage usage
Good for "is the DB busy?" and "am I close to my free tier limit?"
Not real-time.
### 5. Kubernetes events
`kubectl get events` shows cluster-level state changes: pod scheduling, `kubectl get events` shows cluster-level state changes: pod scheduling,
failures, image pulls, probe failures. Useful for post-mortem on failures, image pulls, probe failures. Useful for post-mortem on
@@ -87,7 +181,7 @@ deploys.
Retention: events are stored in etcd but default to 1 hour. Retention: events are stored in etcd but default to 1 hour.
## What we don't have (the gap) ## What we still don't have
### No log aggregation ### No log aggregation
@@ -98,64 +192,55 @@ all api pod logs for user X") we have to:
# Query all at once with stern (if installed) # Query all at once with stern (if installed)
stern -n honeydue api stern -n honeydue api
# Or for specific pod # Or per-pod
kubectl logs -n honeydue <pod> | grep user_id=12345 kubectl logs -n honeydue <pod> | grep user_id=12345
``` ```
This works but doesn't scale. Grep across 3 pods for a specific This works but doesn't scale across many pods.
user_id is OK. Across 30 pods, intractable.
**What we'd add**: [Loki](https://grafana.com/oss/loki/) — a lightweight **What we'd add**: [Loki](https://grafana.com/oss/loki/) on
log aggregator designed for k8s. ~$0 to self-host; integrates with `88oakappsUpdate` next to the existing obs stack. Adds ~512 MB RAM
Grafana for queries. Or [Betterstack](https://betterstack.com/logs) plus a Promtail (or Vector/Alloy) DaemonSet in k3s. Defer until log
($10/mo, hosted). search becomes a recurring pain point — `stern` + `grep` is fine at
current pod count.
### No metrics/dashboards
`kubectl top` tells us "is this pod hot right now?" but not "has CPU
been climbing over the past hour?" We'd need:
- **Prometheus** — scrapes metrics from kubelet and pods' `/metrics`
endpoints, stores time series
- **Grafana** — queries Prometheus, renders dashboards
K3s can install these via Helm in ~10 minutes. Adds ~500MB RAM to the
cluster. Stability and operational load: moderate.
**Alternative**: [Kubernetes Dashboard](https://github.com/kubernetes/dashboard)
bundled with k3s (disabled by default). Minimal UI over the existing
metrics API. Cheaper than Prometheus but less queryable.
### No distributed tracing
"This request took 800ms — which hop was slow?" is currently unanswerable
beyond "the DB query, probably." A real trace would show:
- TLS handshake time
- Traefik routing time
- Go handler time
- Postgres query time
- Redis call time
- Each B2 request time
We'd add OpenTelemetry to the Go app and export to Jaeger/Tempo. Work
is moderate; value kicks in when we have complex request flows.
### No alerting ### No alerting
No PagerDuty, no Slack webhooks, no email on "api is returning 500s." No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
The operator finds out when users complain. The operator finds out when users complain.
Cheapest fix: [Uptime Kuma](https://github.com/louislam/uptime-kuma) Cheapest fix path:
(self-hosted) or Better Stack Uptime (free for small teams). Ping 1. Grafana alerting (built into Grafana 11) — alert rules over the
`https://api.myhoneydue.com/api/health/` every minute; alert if it fails. existing histograms (e.g., `histogram_quantile(0.95, ...) > 1s`).
Routes to Slack via webhook. **Zero infra cost.**
2. [Uptime Kuma](https://github.com/louislam/uptime-kuma) on
`88oakappsUpdate` — pings `/api/health/` from outside the cluster
every minute; complements the in-cluster view.
We'd want both eventually. Grafana alerting first because the data is
already there.
### Partial distributed tracing
The OTel SDK is **not yet wired** in `cmd/api/main.go`. When it ships:
- `otelecho.Middleware` produces a span per HTTP request
- `otelgorm` plugin produces a span per SQL query (requires threading
`ctx` through repositories — the largest diff in the rollout)
- Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs
Until then, we have aggregate latency by route from the histograms but
no per-request flame graph. For "why is *this one* request slow" we
still rely on logs + the GORM duration histogram.
### No APM (Application Performance Monitoring) ### No APM (Application Performance Monitoring)
No request-level profiling. We can't see "which endpoint has the highest No continuous profiling. We can answer "which endpoint has the highest
p99 latency?" or "which SQL query is hot this week?" p99 latency?" from the histograms, but not "where in the call stack is
the time going?" without ad-hoc `pprof` runs.
Options: Datadog, New Relic, Honeycomb, self-hosted Tempo+Grafana. If/when needed: Grafana Pyroscope is the OSS continuous profiler that
All are meaningful work to set up and cost $$$. fits our stack. Adds ~512 MB RAM. Defer until a CPU performance
incident shows up.
## The app's logging conventions ## The app's logging conventions
@@ -172,28 +257,12 @@ The Go app uses zerolog and emits structured JSON:
``` ```
Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
`DEBUG=true|false` in ConfigMap (true sets level to debug, false sets `DEBUG=true|false` in the ConfigMap (true sets level to debug, false
level to info). sets level to info).
Every request is logged with: Every request is logged with method, path, status, request_id, user_id
- Method, path, status code (if authenticated), latency. Queryable by grep today; ready to ingest
- Request ID (for correlating logs across pods) into Loki when we add it.
- User ID (if authenticated)
- Latency
```json
{
"level": "info",
"method": "GET",
"path": "/api/tasks/",
"status": 200,
"latency_ms": 42,
"user_id": 123,
"request_id": "a6b5db35-..."
}
```
This is queryable by grep. Better with log aggregation.
## Health endpoints ## Health endpoints
@@ -202,71 +271,58 @@ Each service exposes a health endpoint:
| Service | Endpoint | What it checks | | Service | Endpoint | What it checks |
|---|---|---| |---|---|---|
| api | `/api/health/` | Process alive (doesn't verify DB) | | api | `/api/health/` | Process alive (doesn't verify DB) |
| api | `/api/health/live` | Process alive |
| admin | `/` | Next.js is up | | admin | `/` | Next.js is up |
| worker | (none public) | Internal Asynq status | | worker | (none public) | Internal Asynq status |
| api | `/metrics` | Prometheus exposition (vmagent scrapes here) |
| api | `/metrics/legacy` | Custom monitoring metrics for GoAdmin |
Health endpoints are **shallow** — they return 200 if the process is Health endpoints are **shallow** — they return 200 if the process is
running and listening. They don't try to reach Postgres/Redis/etc. running and listening. They don't try to reach Postgres/Redis/etc.
Rationale: if Postgres is briefly down, we don't want all api pods to Rationale: if Postgres is briefly down, we don't want all api pods to
start failing liveness and cascade-restart. start failing liveness and cascade-restart.
## Dozzle (deprecated) ## obs.88oakapps.com — the ingest endpoint
The Swarm era had [Dozzle](https://github.com/amir20/dozzle) — a Public hostname for cross-cluster metric and trace ingest. Cloudflare
lightweight web UI for Docker logs. Accessible via SSH tunnel to the in front, nginx on `88oakappsUpdate` enforces a bearer-token check
manager node. Not deployed on k3s; `kubectl logs` + `stern` fills the before forwarding to the local VM/Jaeger containers.
niche.
## Kubernetes metrics the k8s API exposes | Path | Forwards to | Purpose |
|---|---|---|
| `/api/v1/write` | `http://127.0.0.1:8428` | Prometheus remote-write (vmagent → VM) |
| `/v1/traces` | `http://127.0.0.1:4318/v1/traces` | OTLP/HTTP traces (app → Jaeger) |
| `/health` | (returns 200) | Reachability probe — also requires auth |
| anything else | 404 | |
Even without Prometheus, these are queryable: Token lives at `/etc/honeydue-obs/secrets.env` (mode 0600 on the box)
and at `OBS_INGEST_TOKEN=` in `deploy/prod.env` (gitignored). To rotate:
generate a new value, update both ends, restart vmagent.
```bash ```bash
# Resource metrics (via metrics-server) # Operator: rotate the bearer token
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes NEW=$(openssl rand -hex 32)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/honeydue/pods ssh 88oakappsUpdate "sudo sed -i 's|OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|' /etc/honeydue-obs/secrets.env"
ssh 88oakappsUpdate "sudo sed -i 's|Bearer [a-f0-9]\{64\}|Bearer $NEW|' /etc/nginx/sites-available/obs.88oakapps.com && sudo nginx -s reload"
# Core API (k8s state) sed -i.bak "s|^OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|" deploy/prod.env
kubectl get --raw /api/v1/namespaces/honeydue/pods/<name> KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue create secret generic vmagent-remote-write \
--from-literal=bearer_token=$NEW --dry-run=client -o yaml | kubectl apply -f -
# Kubelet metrics (per-node; requires tunneling) KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue rollout restart deploy/vmagent
kubectl get --raw /api/v1/nodes/<node>/proxy/metrics
``` ```
If we ever spin up Prometheus, these are the endpoints it would scrape. ## Resource budget
## Future: what to add and when | Service | mem_limit | Disk | Retention |
|---|---|---|---|
| VictoriaMetrics | 256 MB | 10 GB | 30 days |
| Jaeger all-in-one (badger) | 256 MB | 10 GB | ~7 days |
| Grafana OSS | 256 MB | 1 GB | — |
| vmagent (in k3s) | 256 MB | 512 MB emptyDir | — |
| **Total** | **~1 GB hard cap** | **~21 GB** | |
| Trigger | Add | Resident usage at idle is much lower (~90 MB on the obs side, ~30 MB
|---|---| for vmagent). Hard limits exist so a memory leak in any one component
| 10k+ daily users | Loki + Grafana for logs | can't squeeze the cohabiting PostHog stack on `88oakappsUpdate`.
| 100+ req/s sustained | Prometheus + Grafana for metrics |
| Performance incidents | OpenTelemetry tracing |
| Revenue > $5k/mo | Paid monitoring (Datadog or similar) |
| First production outage | Alerting to phone/Slack |
The overall philosophy: observability is an investment that compounds.
Add it before you need it, not after. But also don't over-invest at
idle.
**Next quarter**: set up Uptime Kuma + Loki at minimum.
## Checking what's installed
```bash
# In kube-system namespace
kubectl get pods -n kube-system
# Should see: coredns, metrics-server, traefik, local-path-provisioner,
# and some k3s-related helm install jobs
# In honeydue namespace
kubectl get pods -n honeydue
# api, admin, worker, redis
# No monitoring namespace (yet)
kubectl get namespaces
# default, honeydue, kube-node-lease, kube-public, kube-system
```
## Operator cheat sheet ## Operator cheat sheet
@@ -274,32 +330,61 @@ kubectl get namespaces
# Tail all logs in the namespace # Tail all logs in the namespace
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue
# Scrape state from vmagent self-metrics
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
| grep -E "scrapes_total|targets|remotewrite"
# Force vmagent to reload scrape config
kubectl -n honeydue rollout restart deploy/vmagent
# Query VictoriaMetrics directly (PromQL)
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
# Restart the obs stack on 88oakappsUpdate
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
# Live obs container memory
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
# Pod resource usage (k3s side)
kubectl top pods -n honeydue --sort-by=memory
# With stern (if installed: brew install stern) # With stern (if installed: brew install stern)
stern -n honeydue . stern -n honeydue .
# Follow specific pod, including previous runs
kubectl logs -n honeydue <pod> -f --previous=false
# Pod resource usage
kubectl top pods -n honeydue --sort-by=memory
kubectl top pods -n honeydue --sort-by=cpu
# Events (cluster-wide)
kubectl get events -A --sort-by=.lastTimestamp | tail -20
# Full state dump for a pod (debugging) # Full state dump for a pod (debugging)
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
``` ```
## Future: what to add and when
| Trigger | Add |
|---|---|
| First production incident | Grafana alerting (free, data already there) |
| 10k+ daily users | Loki + Vector for log aggregation |
| Performance incident the histograms can't explain | Wire OTel exporter → Jaeger from the Go app |
| CPU pressure on api pods | Pyroscope continuous profiler |
| Multi-product obs needs | Migrate obs stack to dedicated CX32 ($8/mo) |
The overall philosophy: observability is an investment that compounds.
Add it before you need it, not after. But also don't over-invest at
idle.
## References ## References
- [Kubernetes metrics-server][ms] - [VictoriaMetrics docs][vm]
- [K3s metrics][k3s-metrics] - [vmagent kubernetes_sd_configs][vmagent-k8s]
- [Loki][loki] - [Jaeger all-in-one with badger][jaeger]
- [prometheus/client_golang][promclient]
- [Grafana provisioning datasources][gf-prov]
- [Loki][loki] (future)
- [Stern (multi-pod log tail)][stern] - [Stern (multi-pod log tail)][stern]
[ms]: https://github.com/kubernetes-sigs/metrics-server [vm]: https://docs.victoriametrics.com/single-server-victoriametrics/
[k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server [vmagent-k8s]: https://docs.victoriametrics.com/vmagent.html#kubernetes-monitoring-with-vmagent
[jaeger]: https://www.jaegertracing.io/docs/1.62/getting-started/#all-in-one
[promclient]: https://pkg.go.dev/github.com/prometheus/client_golang
[gf-prov]: https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources
[loki]: https://grafana.com/oss/loki/ [loki]: https://grafana.com/oss/loki/
[stern]: https://github.com/stern/stern [stern]: https://github.com/stern/stern
+35
View File
@@ -115,6 +115,41 @@ kubectl rollout restart deployment/coredns -n kube-system
kubectl rollout restart deployment/metrics-server -n kube-system kubectl rollout restart deployment/metrics-server -n kube-system
``` ```
#### vmagent can't reach obs.88oakapps.com
**Symptom**: dashboards stop updating; vmagent logs show 401 / TLS /
network errors against `obs.88oakapps.com`. App is unaffected.
**Recovery**: vmagent buffers up to 512 MB locally and replays on
reconnect, so brief outages self-heal. If sustained:
```bash
# Is the obs endpoint up?
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
-H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
# 200 = ingest endpoint healthy.
# Inspect vmagent's failure metric
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
| grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"
# Restart vmagent (forces config reload + drains queue)
kubectl -n honeydue rollout restart deploy/vmagent
```
**If 88oakappsUpdate itself is down** (PostHog runs there too):
SSH and check `sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps`.
**Non-critical**: nothing app-facing depends on the obs stack.
#### Grafana dashboard shows "no data"
**Possible causes, in order of frequency**:
1. New histogram name — query targets a metric the api hasn't emitted
yet. Check `kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics`
for the metric name.
2. vmagent isn't scraping (see above).
3. Time range is before the obs stack came up (2026-04-25). Adjust
the dashboard time picker.
4. Cardinality blowup — VM rejected high-label-count series. Check
`vm_rows_inserted_total` vs `vm_rows_dropped_total` on the obs box.
### Networking failures ### Networking failures
#### UFW rule accidentally blocks essential traffic #### UFW rule accidentally blocks essential traffic
+15
View File
@@ -58,6 +58,20 @@ honeyDue.
|---|---:| |---|---:|
| Gitea container registry | **$0** | | Gitea container registry | **$0** |
### Observability (88oakappsUpdate)
VictoriaMetrics + Jaeger + Grafana co-tenant on the existing Linode
VPS that hosts PostHog. ~700 MB RAM, 21 GB disk — fits inside the
existing instance. Not charged to honeyDue.
| Item | Monthly |
|---|---:|
| Self-hosted obs stack on `88oakappsUpdate` | **$0** |
Migration trigger: when the obs stack starts pressuring PostHog or
needs hard isolation, move to a dedicated Hetzner CX32 (~$8/mo).
See [Chapter 15 — When to move off](./15-observability.md).
### Total infrastructure ### Total infrastructure
| Category | Monthly | | Category | Monthly |
@@ -67,6 +81,7 @@ honeyDue.
| Storage | ~$0.30 | | Storage | ~$0.30 |
| Edge | $0 | | Edge | $0 |
| Registry | $0 | | Registry | $0 |
| Observability | $0 |
| **Total** | **~$30** | | **Total** | **~$30** |
## External SaaS ## External SaaS
+1 -1
View File
@@ -48,7 +48,7 @@ they do, and how to operate them.
- [12 — Data Flow](./12-data-flow.md) — end-to-end request lifecycle - [12 — Data Flow](./12-data-flow.md) — end-to-end request lifecycle
- [14 — Deployment Process](./14-deployment-process.md) — how to roll new code - [14 — Deployment Process](./14-deployment-process.md) — how to roll new code
- [15 — Observability](./15-observability.md) — logs, metrics, tracing - [15 — Observability](./15-observability.md) — VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`, vmagent in-cluster, Prometheus histograms in the Go API
- [16 — Failure Modes](./16-failure-modes.md) — what happens when X dies - [16 — Failure Modes](./16-failure-modes.md) — what happens when X dies
- [17 — Runbook](./17-runbook.md) — common ops tasks - [17 — Runbook](./17-runbook.md) — common ops tasks
+37
View File
@@ -278,6 +278,43 @@ ssh -i ~/.ssh/hetzner deploy@<node> 'sudo systemctl start k3s'
# then re-join via the k3s install command # then re-join via the k3s install command
``` ```
## Observability
```bash
# Hit api /metrics from inside the cluster
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://api:8000/metrics | head -30
# vmagent self-stats: scrapes succeeded, samples shipped, queue health
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
| grep -E "scrapes_total|targets|remotewrite_samples_dropped|persistentqueue_blocks_dropped"
# Force vmagent to reload config (after editing the ConfigMap)
kubectl -n honeydue rollout restart deploy/vmagent
# Query VictoriaMetrics by SSH'ing to the obs box
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=up"'
# p95 latency by route, last 5m
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
# All metric names landing in VM
ssh 88oakappsUpdate 'curl -s http://127.0.0.1:8428/api/v1/label/__name__/values | python3 -m json.tool'
# Restart the obs stack on 88oakappsUpdate (VM + Jaeger + Grafana)
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
# Live RAM usage of the obs containers
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
# Test the obs ingest endpoint with auth
TOKEN=$(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
-H "Authorization: Bearer $TOKEN" # 200 = healthy
```
Dashboards live at `https://grafana.88oakapps.com/d/honeydue-red`.
Admin credentials in `deploy/prod.env`.
## One-liners worth memorizing ## One-liners worth memorizing
```bash ```bash