docs: rewrite ch15 observability + cross-refs for the live obs stack
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.

Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
  metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
  (full build + push + apply + obs vmagent), with the manual
  kubectl-set-image flow kept as the single-service path. Notes the
  IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
  CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
  sheet block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-25 15:05:06 -05:00
parent d3708e6c72
commit 77cfcc0b27
8 changed files with 414 additions and 187 deletions
+8 -2
View File
@@ -349,7 +349,11 @@ All protected endpoints require an `Authorization: Token <token>` header.
Production runs on a **3-node K3s HA cluster** on Hetzner Cloud, fronted
by Cloudflare, with Neon Postgres, Backblaze B2, and a self-hosted Gitea
container registry. See the full deployment book for every detail:
container registry. Live observability (VictoriaMetrics + Jaeger +
Grafana) runs on a separate Linode VPS at
[`grafana.88oakapps.com`](https://grafana.88oakapps.com) and is fed by a
`vmagent` sidecar in-cluster. See the full deployment book for every
detail:
**→ [docs/deployment/](./docs/deployment/README.md) — The Deployment Book**
@@ -371,7 +375,9 @@ Quick links:
- **Runbook** — [docs/deployment/17-runbook.md](./docs/deployment/17-runbook.md) — 22 common ops procedures
- **kubectl cheat sheet** — [docs/deployment/appendices/b-commands.md](./docs/deployment/appendices/b-commands.md)
- **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — build → push → rollout
- **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — `bash deploy-k3s/scripts/03-deploy.sh` builds → pushes → rolls out
- **Observability** — [docs/deployment/15-observability.md](./docs/deployment/15-observability.md) — VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`
- **Observability plan** — [docs/observability-plan.md](./docs/observability-plan.md) — design doc and rollout phases
- **Failure modes** — [docs/deployment/16-failure-modes.md](./docs/deployment/16-failure-modes.md) — what happens when X dies
- **Swarm postmortem** — [docs/deployment/19-postmortem-swarm.md](./docs/deployment/19-postmortem-swarm.md) — why we migrated