docs: rewrite ch15 observability + cross-refs for the live obs stack
ch15 is now an account of what's actually running, not a roadmap for what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the internal/prom histogram set, the rollout's NetworkPolicy footprint, the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget, and a token-rotation runbook. The "what we still don't have" section keeps log aggregation, alerting, and full distributed tracing as the honest gap list. Other touched docs: - 00-overview: \"deliberately absent\" no longer claims we have no metrics — calls out the cross-cluster shape instead. - 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh (full build + push + apply + obs vmagent), with the manual kubectl-set-image flow kept as the single-service path. Notes the IfNotPresent gotcha that bit us during the rollout. - 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data. - 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the CX32 migration trigger. - 17/18 README + appendix b: link the new ch15, add the obs cheat sheet block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -115,6 +115,41 @@ kubectl rollout restart deployment/coredns -n kube-system
|
||||
kubectl rollout restart deployment/metrics-server -n kube-system
|
||||
```
|
||||
|
||||
#### vmagent can't reach obs.88oakapps.com
|
||||
|
||||
**Symptom**: dashboards stop updating; vmagent logs show 401 / TLS /
|
||||
network errors against `obs.88oakapps.com`. App is unaffected.
|
||||
**Recovery**: vmagent buffers up to 512 MB locally and replays on
|
||||
reconnect, so brief outages self-heal. If sustained:
|
||||
```bash
|
||||
# Is the obs endpoint up?
|
||||
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
|
||||
-H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
|
||||
# 200 = ingest endpoint healthy.
|
||||
|
||||
# Inspect vmagent's failure metric
|
||||
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
|
||||
| grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"
|
||||
|
||||
# Restart vmagent (forces config reload + drains queue)
|
||||
kubectl -n honeydue rollout restart deploy/vmagent
|
||||
```
|
||||
**If 88oakappsUpdate itself is down** (PostHog runs there too):
|
||||
SSH and check `sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps`.
|
||||
**Non-critical**: nothing app-facing depends on the obs stack.
|
||||
|
||||
#### Grafana dashboard shows "no data"
|
||||
|
||||
**Possible causes, in order of frequency**:
|
||||
1. New histogram name — query targets a metric the api hasn't emitted
|
||||
yet. Check `kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics`
|
||||
for the metric name.
|
||||
2. vmagent isn't scraping (see above).
|
||||
3. Time range is before the obs stack came up (2026-04-25). Adjust
|
||||
the dashboard time picker.
|
||||
4. Cardinality blowup — VM rejected high-label-count series. Check
|
||||
`vm_rows_inserted_total` vs `vm_rows_dropped_total` on the obs box.
|
||||
|
||||
### Networking failures
|
||||
|
||||
#### UFW rule accidentally blocks essential traffic
|
||||
|
||||
Reference in New Issue
Block a user