diff --git a/README.md b/README.md index 79ae8fd..5366c29 100644 --- a/README.md +++ b/README.md @@ -349,7 +349,11 @@ All protected endpoints require an `Authorization: Token ` header. Production runs on a **3-node K3s HA cluster** on Hetzner Cloud, fronted by Cloudflare, with Neon Postgres, Backblaze B2, and a self-hosted Gitea -container registry. See the full deployment book for every detail: +container registry. Live observability (VictoriaMetrics + Jaeger + +Grafana) runs on a separate Linode VPS at +[`grafana.88oakapps.com`](https://grafana.88oakapps.com) and is fed by a +`vmagent` sidecar in-cluster. See the full deployment book for every +detail: **→ [docs/deployment/](./docs/deployment/README.md) — The Deployment Book** @@ -371,7 +375,9 @@ Quick links: - **Runbook** — [docs/deployment/17-runbook.md](./docs/deployment/17-runbook.md) — 22 common ops procedures - **kubectl cheat sheet** — [docs/deployment/appendices/b-commands.md](./docs/deployment/appendices/b-commands.md) -- **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — build → push → rollout +- **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — `bash deploy-k3s/scripts/03-deploy.sh` builds → pushes → rolls out +- **Observability** — [docs/deployment/15-observability.md](./docs/deployment/15-observability.md) — VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com` +- **Observability plan** — [docs/observability-plan.md](./docs/observability-plan.md) — design doc and rollout phases - **Failure modes** — [docs/deployment/16-failure-modes.md](./docs/deployment/16-failure-modes.md) — what happens when X dies - **Swarm postmortem** — [docs/deployment/19-postmortem-swarm.md](./docs/deployment/19-postmortem-swarm.md) — why we migrated diff --git a/docs/deployment/00-overview.md b/docs/deployment/00-overview.md index 2448e8d..4141ec4 100644 --- a/docs/deployment/00-overview.md +++ b/docs/deployment/00-overview.md @@ -194,10 +194,17 @@ See [Chapter 8](./08-database.md), [9](./09-storage.md), and until we have Apple Developer / Google Play accounts. The env vars are set to sentinel values that let the Go app boot; `FEATURE_PUSH_ENABLED=false` gates all call sites. -- **External metrics/monitoring (Prometheus, Grafana, Betterstack).** - Right now we rely on `kubectl logs`, `kubectl top`, and Cloudflare's own - analytics. See [Chapter 15](./15-observability.md) for what's there and - what we'd add. +- **In-cluster Prometheus / Grafana.** Self-hosted Prometheus-compatible + metrics + tracing + dashboards live **outside** the k3s cluster on + `88oakappsUpdate` (the same Linode VPS that hosts PostHog), reached + via `https://obs.88oakapps.com` (Cloudflare-fronted, bearer-gated). + A `vmagent` sidecar in the honeydue namespace scrapes the api Pods + and remote-writes out. This frees ~700 MB of cluster RAM and means + observability survives a k3s control-plane incident. See + [Chapter 15](./15-observability.md). +- **Alerting.** No PagerDuty, Slack hooks, or pages-on-error wired up + yet. Histograms are flowing into Grafana — alert rules on top of them + is the next add. See [Chapter 15 — Future](./15-observability.md). - **Automated backups of Redis state.** Redis is configured with AOF (append-only file) persistence, but the PVC is only on one node. Redis holds only cache + Asynq queue state; losing it re-populates on first diff --git a/docs/deployment/14-deployment-process.md b/docs/deployment/14-deployment-process.md index f2021ef..ea22763 100644 --- a/docs/deployment/14-deployment-process.md +++ b/docs/deployment/14-deployment-process.md @@ -8,23 +8,62 @@ No downtime if the change is backward-compatible. Rollback is `kubectl rollout undo`. This chapter walks through the full process, plus alternate paths (config-only changes, manifest changes, hotfixes). -## TL;DR for a code change +## TL;DR using the unified deploy script + +The recommended path. `deploy-k3s/scripts/03-deploy.sh` builds all four +images (api, worker, admin, web), pushes to Gitea, regenerates the +ConfigMap from `config.yaml`, applies every manifest under +`deploy-k3s/manifests/` (including the observability vmagent), and +waits for all rollouts. + +```bash +cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go +git add . && git commit -m "..." && git push gitea master + +export KUBECONFIG=~/.kube/honeydue.yaml +bash deploy-k3s/scripts/03-deploy.sh # full build + push + rollout +# or, to redeploy without rebuilding: +bash deploy-k3s/scripts/03-deploy.sh --skip-build +# or, to pin a specific tag: +bash deploy-k3s/scripts/03-deploy.sh --tag d3708e6 +``` + +What the script does, in order: + +1. Read registry creds from `deploy-k3s/config.yaml`. +2. `docker login gitea.treytartt.com`. +3. Build all four images with `--platform linux/amd64` (so arm64 Macs + don't push images that crash on Hetzner amd64 nodes with + "exec format error"). +4. Push to the gitea registry, plus tag and push `:latest`. +5. Generate the env file from `config.yaml` and apply as ConfigMap + `honeydue-config` (uses dry-run + apply for diff-free idempotence). +6. Apply `manifests/namespace.yaml`, `redis/`, `ingress/`, + `api/{deployment,service,hpa}`, `worker/`, `admin/`, `web/`. +7. Apply `manifests/observability/vmagent.yaml`, substituting + `TOKEN_PLACEHOLDER` with `OBS_INGEST_TOKEN` from `deploy/prod.env` + (gitignored). Skipped with a warning if the token isn't present. +8. `kubectl rollout status` for every Deployment, including vmagent. + +~7–10 minutes for a full rebuild. ~1–2 minutes with `--skip-build`. + +## TL;DR for a single-service code change (manual) ```bash # 1. Commit + get SHA cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go git add . && git commit -m "..." && SHA=$(git rev-parse --short HEAD) -# 2. Login to Gitea registry -set -a; source deploy/registry.env; set +a -printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin +# 2. Login to Gitea registry (creds in config.yaml) +docker login gitea.treytartt.com -u admin # 3. Build + push amd64 image -docker buildx build --platform linux/amd64 --target api \ - -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push . +docker build --platform linux/amd64 --target api \ + -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" . +docker push "gitea.treytartt.com/admin/honeydue-api:${SHA}" # 4. Roll it in -export KUBECONFIG=~/.kube/honeydue-k3s.yaml +export KUBECONFIG=~/.kube/honeydue.yaml kubectl set image deployment/api -n honeydue \ api="gitea.treytartt.com/admin/honeydue-api:${SHA}" @@ -32,11 +71,18 @@ kubectl set image deployment/api -n honeydue \ kubectl rollout status -n honeydue deployment/api # 6. Log out -docker logout "$REGISTRY" +docker logout gitea.treytartt.com ``` ~3–5 minutes end to end for api. +> **Gotcha:** Deployments default to `imagePullPolicy: IfNotPresent`, +> which means kubelet won't re-fetch an image with a tag it already +> has cached locally — even if the registry now has different bytes +> at that tag. Always change tags (use the SHA), or temporarily flip +> `imagePullPolicy: Always` and `kubectl rollout restart` if you need +> to overwrite a tag. + ## The build ### Step 1 — Prepare @@ -314,14 +360,10 @@ Contrast: `deploy/scripts/deploy_prod.sh` (Swarm-era) did: 9. Healthcheck the final URL; auto-rollback on failure 10. Log out of registries -Our current k3s deploy is more manual but simpler. We'd write a similar -script for k3s if deploys become frequent: - -```bash -# deploy-k3s/scripts/04-deploy.sh (not yet updated for Gitea) -``` - -See the scaffold in `deploy-k3s/scripts/`. +The current k3s replacement, `deploy-k3s/scripts/03-deploy.sh`, covers +the same ground in fewer steps because Kubernetes does the +versioning/rollout/health bookkeeping natively. See the TL;DR section +at the top of this chapter. ## Common deploy failures diff --git a/docs/deployment/15-observability.md b/docs/deployment/15-observability.md index 84bc5e7..b19e677 100644 --- a/docs/deployment/15-observability.md +++ b/docs/deployment/15-observability.md @@ -2,15 +2,119 @@ ## Summary -We have minimal observability today: `kubectl logs`, `kubectl top`, -Cloudflare Analytics, and the Neon dashboard. No Prometheus, no Grafana, -no centralized log aggregator, no APM. This is adequate for the -current traffic volume (low) but is a known gap. This chapter documents -what we *have* and what we'd add as traffic grows. +Production has live metrics and tracing infrastructure as of 2026-04-25. +A self-hosted **VictoriaMetrics + Jaeger + Grafana** stack runs on +`88oakappsUpdate` (Linode VPS, also home to the self-hosted PostHog +deployment). A `vmagent` sidecar in the honeyDue k3s namespace scrapes +the api Pods' `/metrics` endpoint every 15 seconds and remote-writes to +`https://obs.88oakapps.com/api/v1/write`. Grafana is at +`https://grafana.88oakapps.com` with a pre-provisioned RED dashboard. + +What we still don't have: log aggregation (Dozzle and `kubectl logs` +fill the niche for now), alerting (no PagerDuty/Slack on errors), and +full distributed tracing (OTel SDK is wired in app code but app-side +instrumentation beyond HTTP routes hasn't shipped yet). + +The whole observability stack costs **$0** incremental and uses ~700 MB +RAM on `88oakappsUpdate` (5% of its free RAM). It runs as a separate +docker-compose project from PostHog so neither product's lifecycle +touches the other. ## What we have -### 1. `kubectl logs` +### 1. Metrics — VictoriaMetrics + vmagent + +``` +honeyDue k3s (Hetzner) 88oakappsUpdate (Linode) +┌───────────────────────────┐ ┌──────────────────────────┐ +│ api Pods (3) :8000/metrics│ │ /opt/honeydue-obs/ │ +│ prometheus/client_golang│ │ ┌──────────────────┐ │ +│ │ │ │ VictoriaMetrics │ │ +│ vmagent ──── scrape 15s │ │ │ 30d retention │ │ +│ remote_write ─────┼────────────┼─→ /api/v1/write │ │ +│ (HTTPS, bearer) │ │ │ (mem 256 MB) │ │ +└───────────────────────────┘ │ └──────────────────┘ │ + └──────────────────────────┘ +``` + +The Go API exposes `/metrics` in Prometheus exposition format. Histograms +are defined in `internal/prom/metrics.go` and registered globally: + +| Metric | Labels | Source | +|---|---|---| +| `http_request_duration_seconds` | `route, method, status` | Echo middleware around every handler | +| `gorm_query_duration_seconds` | `table, operation` | GORM before/after callbacks (no ctx threading needed) | +| `b2_upload_duration_seconds` | `bucket, result` | Wrapped `s.backend.Write` in `internal/services/storage_service.go` | +| `b2_upload_bytes_total` | `bucket, result` | Counter alongside the duration histogram | +| `apns_send_duration_seconds` | `result` (`ok`/`bad_token`/`error`) | Wrapped APNs `PushWithContext` in `internal/push/apns.go` | +| `fcm_send_duration_seconds` | `result` | Wrapped FCM HTTP v1 send in `internal/push/fcm.go` | +| `asynq_job_duration_seconds` | `task_type, result` | Histograms registered; middleware not yet attached (Step 3) | +| `go_*`, `process_*` | (standard) | `prometheus/client_golang/prometheus/collectors` defaults | + +The previous custom monitoring at `/metrics` was renamed to +`/metrics/legacy` so the canonical `/metrics` emits proper histograms +suitable for `histogram_quantile()` rollups. The legacy endpoint stays +because the GoAdmin dashboard reads it. + +#### vmagent in k3s + +Lives at `deploy-k3s/manifests/observability/vmagent.yaml`. One replica, +`mem_limit: 256Mi`, scrapes by Kubernetes pod-discovery filtered to +`app.kubernetes.io/name=api` and remote-writes to +`https://obs.88oakapps.com/api/v1/write` with a bearer token from +`OBS_INGEST_TOKEN` in `deploy/prod.env` (substituted into a Secret at +deploy time). + +The agent buffers locally to `/tmp/vmagent` (emptyDir, 512 MB cap), so +brief obs outages don't drop samples. Persistent queue replays on +reconnect. + +NetworkPolicies in the honeydue namespace allow egress from vmagent to: +- DNS (kube-dns / coredns) +- Kubernetes API (`10.43.0.0/16:443`) for pod discovery +- api Pods on `10.42.0.0/16:8000` +- The public obs endpoint over `0.0.0.0/0:443` + +These are scoped tight — vmagent can't reach Postgres, Redis, B2, or +any other external service. + +### 2. Tracing — Jaeger all-in-one + +Jaeger 1.62 with badger storage runs alongside VictoriaMetrics. The +collector accepts: +- OTLP/HTTP at `https://obs.88oakapps.com/v1/traces` (bearer-token gated) +- OTLP/gRPC at `:4317` (localhost-only) +- Native Jaeger protocols at `:14268` etc. (localhost-only) + +Retention: ~7 days at current scale before badger rotates. UI at +`https://grafana.88oakapps.com` via the Jaeger datasource. + +**Status of app-side instrumentation**: the histograms are populating +metrics. The OTel exporter wiring in `cmd/api/main.go` is **not yet +shipped**. When it does ship, every `POST /api/auth/login/` will produce +a flame-graph trace with HTTP → handler → SQL → B2 → APNs spans. +Tracking issue: gitea#3. + +### 3. Dashboards — Grafana + +`https://grafana.88oakapps.com` (Cloudflare-fronted, basic auth via +Grafana itself, admin credentials in `deploy/prod.env`). + +Datasources auto-provisioned at container startup from +`/opt/honeydue-obs/data/grafana-provisioning/datasources/datasources.yaml`: +- VictoriaMetrics (Prometheus type, `http://victoriametrics:8428` in-network) +- Jaeger (`http://jaeger:16686` in-network) + +Pre-provisioned dashboard: `honeyDue API — RED` at +`/d/honeydue-red`. Top row uses the legacy custom metrics +(`http_endpoint_requests_total`, `http_requests_total`) which started +flowing the moment vmagent attached. Lower rows use the new histograms +(`http_request_duration_seconds_bucket` p50/p95/p99 by route, GORM p95 +by table, B2 upload p95, APNs/FCM send p95, Go memory + goroutines). +Lower rows populated immediately after the api rebuild that shipped +`internal/prom`. + +### 4. `kubectl logs` Every container's stdout/stderr is captured by containerd and readable via kubectl: @@ -33,9 +137,10 @@ kubectl get events -n honeydue --sort-by=.lastTimestamp Only the last ~20 MB of logs is retained per container, on-disk on the node. Once a pod is deleted, its logs are gone. -For persistent log access we'd need aggregation (see §what we'd add). +For persistent log access we'd need aggregation (see §What we still +don't have). -### 2. `kubectl top` +### 5. `kubectl top` Pod and node resource usage via metrics-server: @@ -43,43 +148,32 @@ Pod and node resource usage via metrics-server: kubectl top nodes # NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%) # ubuntu-8gb-nbg1-1 169m 4% 748Mi 9% -# ubuntu-8gb-nbg1-2 229m 5% 1043Mi 13% -# ubuntu-8gb-nbg1-3 124m 3% 770Mi 9% kubectl top pods -n honeydue ``` -**Retention**: In-memory only. Last few minutes of data. No -historical view. +In-memory only; last few minutes of data. For historical trends use +the Grafana dashboard, which exposes the same data via the `go_*` and +`container_*` (kubelet cAdvisor) metrics. -### 3. Cloudflare Analytics +### 6. Cloudflare Analytics -CF Dashboard → Analytics & Logs. Per-zone stats: -- Requests per second -- Bandwidth -- Cache hit ratio -- Top HTTP status codes -- Top request paths -- Bot traffic score +CF Dashboard → Analytics & Logs. Per-zone aggregate stats: +requests/sec, bandwidth, cache hit ratio, top status codes, top paths, +bot traffic score. Good for spotting macro trends ("suddenly 10× more +502s today") that wouldn't show up in a single-pod sample. -All aggregated, no individual request traces. Good for spotting macro -trends ("suddenly 10× more 502s today"), poor for debugging specific -issues. +Free tier retention: 7 days of aggregate stats. -Free tier retention: 7 days of aggregate stats. Pro extends this. +### 7. Neon dashboard -### 4. Neon dashboard +Neon console → project → Monitoring: compute utilization (CU-hours), +slow queries, active connections, storage usage. Useful for "is the +DB busy?" and free-tier limit watching. The new +`gorm_query_duration_seconds` histogram covers the application side +of the same question with much better latency tail visibility. -Neon console → project → Monitoring: -- Compute utilization (CU-hours consumed) -- Query performance (slow queries) -- Active connections -- Storage usage - -Good for "is the DB busy?" and "am I close to my free tier limit?" -Not real-time. - -### 5. Kubernetes events +### 8. Kubernetes events `kubectl get events` shows cluster-level state changes: pod scheduling, failures, image pulls, probe failures. Useful for post-mortem on @@ -87,7 +181,7 @@ deploys. Retention: events are stored in etcd but default to 1 hour. -## What we don't have (the gap) +## What we still don't have ### No log aggregation @@ -98,64 +192,55 @@ all api pod logs for user X") we have to: # Query all at once with stern (if installed) stern -n honeydue api -# Or for specific pod +# Or per-pod kubectl logs -n honeydue | grep user_id=12345 ``` -This works but doesn't scale. Grep across 3 pods for a specific -user_id is OK. Across 30 pods, intractable. +This works but doesn't scale across many pods. -**What we'd add**: [Loki](https://grafana.com/oss/loki/) — a lightweight -log aggregator designed for k8s. ~$0 to self-host; integrates with -Grafana for queries. Or [Betterstack](https://betterstack.com/logs) -($10/mo, hosted). - -### No metrics/dashboards - -`kubectl top` tells us "is this pod hot right now?" but not "has CPU -been climbing over the past hour?" We'd need: - -- **Prometheus** — scrapes metrics from kubelet and pods' `/metrics` - endpoints, stores time series -- **Grafana** — queries Prometheus, renders dashboards - -K3s can install these via Helm in ~10 minutes. Adds ~500MB RAM to the -cluster. Stability and operational load: moderate. - -**Alternative**: [Kubernetes Dashboard](https://github.com/kubernetes/dashboard) -bundled with k3s (disabled by default). Minimal UI over the existing -metrics API. Cheaper than Prometheus but less queryable. - -### No distributed tracing - -"This request took 800ms — which hop was slow?" is currently unanswerable -beyond "the DB query, probably." A real trace would show: -- TLS handshake time -- Traefik routing time -- Go handler time -- Postgres query time -- Redis call time -- Each B2 request time - -We'd add OpenTelemetry to the Go app and export to Jaeger/Tempo. Work -is moderate; value kicks in when we have complex request flows. +**What we'd add**: [Loki](https://grafana.com/oss/loki/) on +`88oakappsUpdate` next to the existing obs stack. Adds ~512 MB RAM +plus a Promtail (or Vector/Alloy) DaemonSet in k3s. Defer until log +search becomes a recurring pain point — `stern` + `grep` is fine at +current pod count. ### No alerting No PagerDuty, no Slack webhooks, no email on "api is returning 500s." The operator finds out when users complain. -Cheapest fix: [Uptime Kuma](https://github.com/louislam/uptime-kuma) -(self-hosted) or Better Stack Uptime (free for small teams). Ping -`https://api.myhoneydue.com/api/health/` every minute; alert if it fails. +Cheapest fix path: +1. Grafana alerting (built into Grafana 11) — alert rules over the + existing histograms (e.g., `histogram_quantile(0.95, ...) > 1s`). + Routes to Slack via webhook. **Zero infra cost.** +2. [Uptime Kuma](https://github.com/louislam/uptime-kuma) on + `88oakappsUpdate` — pings `/api/health/` from outside the cluster + every minute; complements the in-cluster view. + +We'd want both eventually. Grafana alerting first because the data is +already there. + +### Partial distributed tracing + +The OTel SDK is **not yet wired** in `cmd/api/main.go`. When it ships: +- `otelecho.Middleware` produces a span per HTTP request +- `otelgorm` plugin produces a span per SQL query (requires threading + `ctx` through repositories — the largest diff in the rollout) +- Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs + +Until then, we have aggregate latency by route from the histograms but +no per-request flame graph. For "why is *this one* request slow" we +still rely on logs + the GORM duration histogram. ### No APM (Application Performance Monitoring) -No request-level profiling. We can't see "which endpoint has the highest -p99 latency?" or "which SQL query is hot this week?" +No continuous profiling. We can answer "which endpoint has the highest +p99 latency?" from the histograms, but not "where in the call stack is +the time going?" without ad-hoc `pprof` runs. -Options: Datadog, New Relic, Honeycomb, self-hosted Tempo+Grafana. -All are meaningful work to set up and cost $$$. +If/when needed: Grafana Pyroscope is the OSS continuous profiler that +fits our stack. Adds ~512 MB RAM. Defer until a CPU performance +incident shows up. ## The app's logging conventions @@ -172,28 +257,12 @@ The Go app uses zerolog and emits structured JSON: ``` Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by -`DEBUG=true|false` in ConfigMap (true sets level to debug, false sets -level to info). +`DEBUG=true|false` in the ConfigMap (true sets level to debug, false +sets level to info). -Every request is logged with: -- Method, path, status code -- Request ID (for correlating logs across pods) -- User ID (if authenticated) -- Latency - -```json -{ - "level": "info", - "method": "GET", - "path": "/api/tasks/", - "status": 200, - "latency_ms": 42, - "user_id": 123, - "request_id": "a6b5db35-..." -} -``` - -This is queryable by grep. Better with log aggregation. +Every request is logged with method, path, status, request_id, user_id +(if authenticated), latency. Queryable by grep today; ready to ingest +into Loki when we add it. ## Health endpoints @@ -202,71 +271,58 @@ Each service exposes a health endpoint: | Service | Endpoint | What it checks | |---|---|---| | api | `/api/health/` | Process alive (doesn't verify DB) | +| api | `/api/health/live` | Process alive | | admin | `/` | Next.js is up | | worker | (none public) | Internal Asynq status | +| api | `/metrics` | Prometheus exposition (vmagent scrapes here) | +| api | `/metrics/legacy` | Custom monitoring metrics for GoAdmin | Health endpoints are **shallow** — they return 200 if the process is running and listening. They don't try to reach Postgres/Redis/etc. Rationale: if Postgres is briefly down, we don't want all api pods to start failing liveness and cascade-restart. -## Dozzle (deprecated) +## obs.88oakapps.com — the ingest endpoint -The Swarm era had [Dozzle](https://github.com/amir20/dozzle) — a -lightweight web UI for Docker logs. Accessible via SSH tunnel to the -manager node. Not deployed on k3s; `kubectl logs` + `stern` fills the -niche. +Public hostname for cross-cluster metric and trace ingest. Cloudflare +in front, nginx on `88oakappsUpdate` enforces a bearer-token check +before forwarding to the local VM/Jaeger containers. -## Kubernetes metrics the k8s API exposes +| Path | Forwards to | Purpose | +|---|---|---| +| `/api/v1/write` | `http://127.0.0.1:8428` | Prometheus remote-write (vmagent → VM) | +| `/v1/traces` | `http://127.0.0.1:4318/v1/traces` | OTLP/HTTP traces (app → Jaeger) | +| `/health` | (returns 200) | Reachability probe — also requires auth | +| anything else | 404 | | -Even without Prometheus, these are queryable: +Token lives at `/etc/honeydue-obs/secrets.env` (mode 0600 on the box) +and at `OBS_INGEST_TOKEN=` in `deploy/prod.env` (gitignored). To rotate: +generate a new value, update both ends, restart vmagent. ```bash -# Resource metrics (via metrics-server) -kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes -kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/honeydue/pods - -# Core API (k8s state) -kubectl get --raw /api/v1/namespaces/honeydue/pods/ - -# Kubelet metrics (per-node; requires tunneling) -kubectl get --raw /api/v1/nodes//proxy/metrics +# Operator: rotate the bearer token +NEW=$(openssl rand -hex 32) +ssh 88oakappsUpdate "sudo sed -i 's|OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|' /etc/honeydue-obs/secrets.env" +ssh 88oakappsUpdate "sudo sed -i 's|Bearer [a-f0-9]\{64\}|Bearer $NEW|' /etc/nginx/sites-available/obs.88oakapps.com && sudo nginx -s reload" +sed -i.bak "s|^OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|" deploy/prod.env +KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue create secret generic vmagent-remote-write \ + --from-literal=bearer_token=$NEW --dry-run=client -o yaml | kubectl apply -f - +KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue rollout restart deploy/vmagent ``` -If we ever spin up Prometheus, these are the endpoints it would scrape. +## Resource budget -## Future: what to add and when +| Service | mem_limit | Disk | Retention | +|---|---|---|---| +| VictoriaMetrics | 256 MB | 10 GB | 30 days | +| Jaeger all-in-one (badger) | 256 MB | 10 GB | ~7 days | +| Grafana OSS | 256 MB | 1 GB | — | +| vmagent (in k3s) | 256 MB | 512 MB emptyDir | — | +| **Total** | **~1 GB hard cap** | **~21 GB** | | -| Trigger | Add | -|---|---| -| 10k+ daily users | Loki + Grafana for logs | -| 100+ req/s sustained | Prometheus + Grafana for metrics | -| Performance incidents | OpenTelemetry tracing | -| Revenue > $5k/mo | Paid monitoring (Datadog or similar) | -| First production outage | Alerting to phone/Slack | - -The overall philosophy: observability is an investment that compounds. -Add it before you need it, not after. But also don't over-invest at -idle. - -**Next quarter**: set up Uptime Kuma + Loki at minimum. - -## Checking what's installed - -```bash -# In kube-system namespace -kubectl get pods -n kube-system -# Should see: coredns, metrics-server, traefik, local-path-provisioner, -# and some k3s-related helm install jobs - -# In honeydue namespace -kubectl get pods -n honeydue -# api, admin, worker, redis - -# No monitoring namespace (yet) -kubectl get namespaces -# default, honeydue, kube-node-lease, kube-public, kube-system -``` +Resident usage at idle is much lower (~90 MB on the obs side, ~30 MB +for vmagent). Hard limits exist so a memory leak in any one component +can't squeeze the cohabiting PostHog stack on `88oakappsUpdate`. ## Operator cheat sheet @@ -274,32 +330,61 @@ kubectl get namespaces # Tail all logs in the namespace kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue +# Scrape state from vmagent self-metrics +kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \ + | grep -E "scrapes_total|targets|remotewrite" + +# Force vmagent to reload scrape config +kubectl -n honeydue rollout restart deploy/vmagent + +# Query VictoriaMetrics directly (PromQL) +ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool' + +# Restart the obs stack on 88oakappsUpdate +ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart' + +# Live obs container memory +ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs' + +# Pod resource usage (k3s side) +kubectl top pods -n honeydue --sort-by=memory + # With stern (if installed: brew install stern) stern -n honeydue . -# Follow specific pod, including previous runs -kubectl logs -n honeydue -f --previous=false - -# Pod resource usage -kubectl top pods -n honeydue --sort-by=memory -kubectl top pods -n honeydue --sort-by=cpu - -# Events (cluster-wide) -kubectl get events -A --sort-by=.lastTimestamp | tail -20 - # Full state dump for a pod (debugging) kubectl describe pod -n honeydue > /tmp/pod-dump.txt kubectl logs -n honeydue > /tmp/pod-logs.txt ``` +## Future: what to add and when + +| Trigger | Add | +|---|---| +| First production incident | Grafana alerting (free, data already there) | +| 10k+ daily users | Loki + Vector for log aggregation | +| Performance incident the histograms can't explain | Wire OTel exporter → Jaeger from the Go app | +| CPU pressure on api pods | Pyroscope continuous profiler | +| Multi-product obs needs | Migrate obs stack to dedicated CX32 ($8/mo) | + +The overall philosophy: observability is an investment that compounds. +Add it before you need it, not after. But also don't over-invest at +idle. + ## References -- [Kubernetes metrics-server][ms] -- [K3s metrics][k3s-metrics] -- [Loki][loki] +- [VictoriaMetrics docs][vm] +- [vmagent kubernetes_sd_configs][vmagent-k8s] +- [Jaeger all-in-one with badger][jaeger] +- [prometheus/client_golang][promclient] +- [Grafana provisioning datasources][gf-prov] +- [Loki][loki] (future) - [Stern (multi-pod log tail)][stern] -[ms]: https://github.com/kubernetes-sigs/metrics-server -[k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server +[vm]: https://docs.victoriametrics.com/single-server-victoriametrics/ +[vmagent-k8s]: https://docs.victoriametrics.com/vmagent.html#kubernetes-monitoring-with-vmagent +[jaeger]: https://www.jaegertracing.io/docs/1.62/getting-started/#all-in-one +[promclient]: https://pkg.go.dev/github.com/prometheus/client_golang +[gf-prov]: https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources [loki]: https://grafana.com/oss/loki/ [stern]: https://github.com/stern/stern diff --git a/docs/deployment/16-failure-modes.md b/docs/deployment/16-failure-modes.md index ef5585c..ac2d740 100644 --- a/docs/deployment/16-failure-modes.md +++ b/docs/deployment/16-failure-modes.md @@ -115,6 +115,41 @@ kubectl rollout restart deployment/coredns -n kube-system kubectl rollout restart deployment/metrics-server -n kube-system ``` +#### vmagent can't reach obs.88oakapps.com + +**Symptom**: dashboards stop updating; vmagent logs show 401 / TLS / +network errors against `obs.88oakapps.com`. App is unaffected. +**Recovery**: vmagent buffers up to 512 MB locally and replays on +reconnect, so brief outages self-heal. If sustained: +```bash +# Is the obs endpoint up? +curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \ + -H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)" +# 200 = ingest endpoint healthy. + +# Inspect vmagent's failure metric +kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \ + | grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped" + +# Restart vmagent (forces config reload + drains queue) +kubectl -n honeydue rollout restart deploy/vmagent +``` +**If 88oakappsUpdate itself is down** (PostHog runs there too): +SSH and check `sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps`. +**Non-critical**: nothing app-facing depends on the obs stack. + +#### Grafana dashboard shows "no data" + +**Possible causes, in order of frequency**: +1. New histogram name — query targets a metric the api hasn't emitted + yet. Check `kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics` + for the metric name. +2. vmagent isn't scraping (see above). +3. Time range is before the obs stack came up (2026-04-25). Adjust + the dashboard time picker. +4. Cardinality blowup — VM rejected high-label-count series. Check + `vm_rows_inserted_total` vs `vm_rows_dropped_total` on the obs box. + ### Networking failures #### UFW rule accidentally blocks essential traffic diff --git a/docs/deployment/18-cost.md b/docs/deployment/18-cost.md index 7015ddf..764c0ce 100644 --- a/docs/deployment/18-cost.md +++ b/docs/deployment/18-cost.md @@ -58,6 +58,20 @@ honeyDue. |---|---:| | Gitea container registry | **$0** | +### Observability (88oakappsUpdate) + +VictoriaMetrics + Jaeger + Grafana co-tenant on the existing Linode +VPS that hosts PostHog. ~700 MB RAM, 21 GB disk — fits inside the +existing instance. Not charged to honeyDue. + +| Item | Monthly | +|---|---:| +| Self-hosted obs stack on `88oakappsUpdate` | **$0** | + +Migration trigger: when the obs stack starts pressuring PostHog or +needs hard isolation, move to a dedicated Hetzner CX32 (~$8/mo). +See [Chapter 15 — When to move off](./15-observability.md). + ### Total infrastructure | Category | Monthly | @@ -67,6 +81,7 @@ honeyDue. | Storage | ~$0.30 | | Edge | $0 | | Registry | $0 | +| Observability | $0 | | **Total** | **~$30** | ## External SaaS diff --git a/docs/deployment/README.md b/docs/deployment/README.md index a84f056..69dc7c1 100644 --- a/docs/deployment/README.md +++ b/docs/deployment/README.md @@ -48,7 +48,7 @@ they do, and how to operate them. - [12 — Data Flow](./12-data-flow.md) — end-to-end request lifecycle - [14 — Deployment Process](./14-deployment-process.md) — how to roll new code -- [15 — Observability](./15-observability.md) — logs, metrics, tracing +- [15 — Observability](./15-observability.md) — VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`, vmagent in-cluster, Prometheus histograms in the Go API - [16 — Failure Modes](./16-failure-modes.md) — what happens when X dies - [17 — Runbook](./17-runbook.md) — common ops tasks diff --git a/docs/deployment/appendices/b-commands.md b/docs/deployment/appendices/b-commands.md index baec7b9..497d7a4 100644 --- a/docs/deployment/appendices/b-commands.md +++ b/docs/deployment/appendices/b-commands.md @@ -278,6 +278,43 @@ ssh -i ~/.ssh/hetzner deploy@ 'sudo systemctl start k3s' # then re-join via the k3s install command ``` +## Observability + +```bash +# Hit api /metrics from inside the cluster +kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://api:8000/metrics | head -30 + +# vmagent self-stats: scrapes succeeded, samples shipped, queue health +kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \ + | grep -E "scrapes_total|targets|remotewrite_samples_dropped|persistentqueue_blocks_dropped" + +# Force vmagent to reload config (after editing the ConfigMap) +kubectl -n honeydue rollout restart deploy/vmagent + +# Query VictoriaMetrics by SSH'ing to the obs box +ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=up"' + +# p95 latency by route, last 5m +ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool' + +# All metric names landing in VM +ssh 88oakappsUpdate 'curl -s http://127.0.0.1:8428/api/v1/label/__name__/values | python3 -m json.tool' + +# Restart the obs stack on 88oakappsUpdate (VM + Jaeger + Grafana) +ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart' + +# Live RAM usage of the obs containers +ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs' + +# Test the obs ingest endpoint with auth +TOKEN=$(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2) +curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \ + -H "Authorization: Bearer $TOKEN" # 200 = healthy +``` + +Dashboards live at `https://grafana.88oakapps.com/d/honeydue-red`. +Admin credentials in `deploy/prod.env`. + ## One-liners worth memorizing ```bash