docs: rewrite ch15 observability + cross-refs for the live obs stack
ch15 is now an account of what's actually running, not a roadmap for what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the internal/prom histogram set, the rollout's NetworkPolicy footprint, the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget, and a token-rotation runbook. The "what we still don't have" section keeps log aggregation, alerting, and full distributed tracing as the honest gap list. Other touched docs: - 00-overview: \"deliberately absent\" no longer claims we have no metrics — calls out the cross-cluster shape instead. - 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh (full build + push + apply + obs vmagent), with the manual kubectl-set-image flow kept as the single-service path. Notes the IfNotPresent gotcha that bit us during the rollout. - 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data. - 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the CX32 migration trigger. - 17/18 README + appendix b: link the new ch15, add the obs cheat sheet block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -278,6 +278,43 @@ ssh -i ~/.ssh/hetzner deploy@<node> 'sudo systemctl start k3s'
|
||||
# then re-join via the k3s install command
|
||||
```
|
||||
|
||||
## Observability
|
||||
|
||||
```bash
|
||||
# Hit api /metrics from inside the cluster
|
||||
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://api:8000/metrics | head -30
|
||||
|
||||
# vmagent self-stats: scrapes succeeded, samples shipped, queue health
|
||||
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
|
||||
| grep -E "scrapes_total|targets|remotewrite_samples_dropped|persistentqueue_blocks_dropped"
|
||||
|
||||
# Force vmagent to reload config (after editing the ConfigMap)
|
||||
kubectl -n honeydue rollout restart deploy/vmagent
|
||||
|
||||
# Query VictoriaMetrics by SSH'ing to the obs box
|
||||
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=up"'
|
||||
|
||||
# p95 latency by route, last 5m
|
||||
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
|
||||
|
||||
# All metric names landing in VM
|
||||
ssh 88oakappsUpdate 'curl -s http://127.0.0.1:8428/api/v1/label/__name__/values | python3 -m json.tool'
|
||||
|
||||
# Restart the obs stack on 88oakappsUpdate (VM + Jaeger + Grafana)
|
||||
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
|
||||
|
||||
# Live RAM usage of the obs containers
|
||||
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
|
||||
|
||||
# Test the obs ingest endpoint with auth
|
||||
TOKEN=$(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)
|
||||
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
|
||||
-H "Authorization: Bearer $TOKEN" # 200 = healthy
|
||||
```
|
||||
|
||||
Dashboards live at `https://grafana.88oakapps.com/d/honeydue-red`.
|
||||
Admin credentials in `deploy/prod.env`.
|
||||
|
||||
## One-liners worth memorizing
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user