docs: rewrite ch15 observability + cross-refs for the live obs stack
ch15 is now an account of what's actually running, not a roadmap for what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the internal/prom histogram set, the rollout's NetworkPolicy footprint, the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget, and a token-rotation runbook. The "what we still don't have" section keeps log aggregation, alerting, and full distributed tracing as the honest gap list. Other touched docs: - 00-overview: \"deliberately absent\" no longer claims we have no metrics — calls out the cross-cluster shape instead. - 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh (full build + push + apply + obs vmagent), with the manual kubectl-set-image flow kept as the single-service path. Notes the IfNotPresent gotcha that bit us during the rollout. - 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data. - 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the CX32 migration trigger. - 17/18 README + appendix b: link the new ch15, add the obs cheat sheet block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -194,10 +194,17 @@ See [Chapter 8](./08-database.md), [9](./09-storage.md), and
|
||||
until we have Apple Developer / Google Play accounts. The env vars are
|
||||
set to sentinel values that let the Go app boot; `FEATURE_PUSH_ENABLED=false`
|
||||
gates all call sites.
|
||||
- **External metrics/monitoring (Prometheus, Grafana, Betterstack).**
|
||||
Right now we rely on `kubectl logs`, `kubectl top`, and Cloudflare's own
|
||||
analytics. See [Chapter 15](./15-observability.md) for what's there and
|
||||
what we'd add.
|
||||
- **In-cluster Prometheus / Grafana.** Self-hosted Prometheus-compatible
|
||||
metrics + tracing + dashboards live **outside** the k3s cluster on
|
||||
`88oakappsUpdate` (the same Linode VPS that hosts PostHog), reached
|
||||
via `https://obs.88oakapps.com` (Cloudflare-fronted, bearer-gated).
|
||||
A `vmagent` sidecar in the honeydue namespace scrapes the api Pods
|
||||
and remote-writes out. This frees ~700 MB of cluster RAM and means
|
||||
observability survives a k3s control-plane incident. See
|
||||
[Chapter 15](./15-observability.md).
|
||||
- **Alerting.** No PagerDuty, Slack hooks, or pages-on-error wired up
|
||||
yet. Histograms are flowing into Grafana — alert rules on top of them
|
||||
is the next add. See [Chapter 15 — Future](./15-observability.md).
|
||||
- **Automated backups of Redis state.** Redis is configured with AOF
|
||||
(append-only file) persistence, but the PVC is only on one node. Redis
|
||||
holds only cache + Asynq queue state; losing it re-populates on first
|
||||
|
||||
Reference in New Issue
Block a user