docs: rewrite ch15 observability + cross-refs for the live obs stack
ch15 is now an account of what's actually running, not a roadmap for what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the internal/prom histogram set, the rollout's NetworkPolicy footprint, the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget, and a token-rotation runbook. The "what we still don't have" section keeps log aggregation, alerting, and full distributed tracing as the honest gap list. Other touched docs: - 00-overview: \"deliberately absent\" no longer claims we have no metrics — calls out the cross-cluster shape instead. - 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh (full build + push + apply + obs vmagent), with the manual kubectl-set-image flow kept as the single-service path. Notes the IfNotPresent gotcha that bit us during the rollout. - 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data. - 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the CX32 migration trigger. - 17/18 README + appendix b: link the new ch15, add the obs cheat sheet block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -349,7 +349,11 @@ All protected endpoints require an `Authorization: Token <token>` header.
|
|||||||
|
|
||||||
Production runs on a **3-node K3s HA cluster** on Hetzner Cloud, fronted
|
Production runs on a **3-node K3s HA cluster** on Hetzner Cloud, fronted
|
||||||
by Cloudflare, with Neon Postgres, Backblaze B2, and a self-hosted Gitea
|
by Cloudflare, with Neon Postgres, Backblaze B2, and a self-hosted Gitea
|
||||||
container registry. See the full deployment book for every detail:
|
container registry. Live observability (VictoriaMetrics + Jaeger +
|
||||||
|
Grafana) runs on a separate Linode VPS at
|
||||||
|
[`grafana.88oakapps.com`](https://grafana.88oakapps.com) and is fed by a
|
||||||
|
`vmagent` sidecar in-cluster. See the full deployment book for every
|
||||||
|
detail:
|
||||||
|
|
||||||
**→ [docs/deployment/](./docs/deployment/README.md) — The Deployment Book**
|
**→ [docs/deployment/](./docs/deployment/README.md) — The Deployment Book**
|
||||||
|
|
||||||
@@ -371,7 +375,9 @@ Quick links:
|
|||||||
|
|
||||||
- **Runbook** — [docs/deployment/17-runbook.md](./docs/deployment/17-runbook.md) — 22 common ops procedures
|
- **Runbook** — [docs/deployment/17-runbook.md](./docs/deployment/17-runbook.md) — 22 common ops procedures
|
||||||
- **kubectl cheat sheet** — [docs/deployment/appendices/b-commands.md](./docs/deployment/appendices/b-commands.md)
|
- **kubectl cheat sheet** — [docs/deployment/appendices/b-commands.md](./docs/deployment/appendices/b-commands.md)
|
||||||
- **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — build → push → rollout
|
- **Deploy process** — [docs/deployment/14-deployment-process.md](./docs/deployment/14-deployment-process.md) — `bash deploy-k3s/scripts/03-deploy.sh` builds → pushes → rolls out
|
||||||
|
- **Observability** — [docs/deployment/15-observability.md](./docs/deployment/15-observability.md) — VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`
|
||||||
|
- **Observability plan** — [docs/observability-plan.md](./docs/observability-plan.md) — design doc and rollout phases
|
||||||
- **Failure modes** — [docs/deployment/16-failure-modes.md](./docs/deployment/16-failure-modes.md) — what happens when X dies
|
- **Failure modes** — [docs/deployment/16-failure-modes.md](./docs/deployment/16-failure-modes.md) — what happens when X dies
|
||||||
- **Swarm postmortem** — [docs/deployment/19-postmortem-swarm.md](./docs/deployment/19-postmortem-swarm.md) — why we migrated
|
- **Swarm postmortem** — [docs/deployment/19-postmortem-swarm.md](./docs/deployment/19-postmortem-swarm.md) — why we migrated
|
||||||
|
|
||||||
|
|||||||
@@ -194,10 +194,17 @@ See [Chapter 8](./08-database.md), [9](./09-storage.md), and
|
|||||||
until we have Apple Developer / Google Play accounts. The env vars are
|
until we have Apple Developer / Google Play accounts. The env vars are
|
||||||
set to sentinel values that let the Go app boot; `FEATURE_PUSH_ENABLED=false`
|
set to sentinel values that let the Go app boot; `FEATURE_PUSH_ENABLED=false`
|
||||||
gates all call sites.
|
gates all call sites.
|
||||||
- **External metrics/monitoring (Prometheus, Grafana, Betterstack).**
|
- **In-cluster Prometheus / Grafana.** Self-hosted Prometheus-compatible
|
||||||
Right now we rely on `kubectl logs`, `kubectl top`, and Cloudflare's own
|
metrics + tracing + dashboards live **outside** the k3s cluster on
|
||||||
analytics. See [Chapter 15](./15-observability.md) for what's there and
|
`88oakappsUpdate` (the same Linode VPS that hosts PostHog), reached
|
||||||
what we'd add.
|
via `https://obs.88oakapps.com` (Cloudflare-fronted, bearer-gated).
|
||||||
|
A `vmagent` sidecar in the honeydue namespace scrapes the api Pods
|
||||||
|
and remote-writes out. This frees ~700 MB of cluster RAM and means
|
||||||
|
observability survives a k3s control-plane incident. See
|
||||||
|
[Chapter 15](./15-observability.md).
|
||||||
|
- **Alerting.** No PagerDuty, Slack hooks, or pages-on-error wired up
|
||||||
|
yet. Histograms are flowing into Grafana — alert rules on top of them
|
||||||
|
is the next add. See [Chapter 15 — Future](./15-observability.md).
|
||||||
- **Automated backups of Redis state.** Redis is configured with AOF
|
- **Automated backups of Redis state.** Redis is configured with AOF
|
||||||
(append-only file) persistence, but the PVC is only on one node. Redis
|
(append-only file) persistence, but the PVC is only on one node. Redis
|
||||||
holds only cache + Asynq queue state; losing it re-populates on first
|
holds only cache + Asynq queue state; losing it re-populates on first
|
||||||
|
|||||||
@@ -8,23 +8,62 @@ No downtime if the change is backward-compatible. Rollback is
|
|||||||
`kubectl rollout undo`. This chapter walks through the full process,
|
`kubectl rollout undo`. This chapter walks through the full process,
|
||||||
plus alternate paths (config-only changes, manifest changes, hotfixes).
|
plus alternate paths (config-only changes, manifest changes, hotfixes).
|
||||||
|
|
||||||
## TL;DR for a code change
|
## TL;DR using the unified deploy script
|
||||||
|
|
||||||
|
The recommended path. `deploy-k3s/scripts/03-deploy.sh` builds all four
|
||||||
|
images (api, worker, admin, web), pushes to Gitea, regenerates the
|
||||||
|
ConfigMap from `config.yaml`, applies every manifest under
|
||||||
|
`deploy-k3s/manifests/` (including the observability vmagent), and
|
||||||
|
waits for all rollouts.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
|
||||||
|
git add . && git commit -m "..." && git push gitea master
|
||||||
|
|
||||||
|
export KUBECONFIG=~/.kube/honeydue.yaml
|
||||||
|
bash deploy-k3s/scripts/03-deploy.sh # full build + push + rollout
|
||||||
|
# or, to redeploy without rebuilding:
|
||||||
|
bash deploy-k3s/scripts/03-deploy.sh --skip-build
|
||||||
|
# or, to pin a specific tag:
|
||||||
|
bash deploy-k3s/scripts/03-deploy.sh --tag d3708e6
|
||||||
|
```
|
||||||
|
|
||||||
|
What the script does, in order:
|
||||||
|
|
||||||
|
1. Read registry creds from `deploy-k3s/config.yaml`.
|
||||||
|
2. `docker login gitea.treytartt.com`.
|
||||||
|
3. Build all four images with `--platform linux/amd64` (so arm64 Macs
|
||||||
|
don't push images that crash on Hetzner amd64 nodes with
|
||||||
|
"exec format error").
|
||||||
|
4. Push to the gitea registry, plus tag and push `:latest`.
|
||||||
|
5. Generate the env file from `config.yaml` and apply as ConfigMap
|
||||||
|
`honeydue-config` (uses dry-run + apply for diff-free idempotence).
|
||||||
|
6. Apply `manifests/namespace.yaml`, `redis/`, `ingress/`,
|
||||||
|
`api/{deployment,service,hpa}`, `worker/`, `admin/`, `web/`.
|
||||||
|
7. Apply `manifests/observability/vmagent.yaml`, substituting
|
||||||
|
`TOKEN_PLACEHOLDER` with `OBS_INGEST_TOKEN` from `deploy/prod.env`
|
||||||
|
(gitignored). Skipped with a warning if the token isn't present.
|
||||||
|
8. `kubectl rollout status` for every Deployment, including vmagent.
|
||||||
|
|
||||||
|
~7–10 minutes for a full rebuild. ~1–2 minutes with `--skip-build`.
|
||||||
|
|
||||||
|
## TL;DR for a single-service code change (manual)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1. Commit + get SHA
|
# 1. Commit + get SHA
|
||||||
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
|
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
|
||||||
git add . && git commit -m "..." && SHA=$(git rev-parse --short HEAD)
|
git add . && git commit -m "..." && SHA=$(git rev-parse --short HEAD)
|
||||||
|
|
||||||
# 2. Login to Gitea registry
|
# 2. Login to Gitea registry (creds in config.yaml)
|
||||||
set -a; source deploy/registry.env; set +a
|
docker login gitea.treytartt.com -u admin
|
||||||
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
|
|
||||||
|
|
||||||
# 3. Build + push amd64 image
|
# 3. Build + push amd64 image
|
||||||
docker buildx build --platform linux/amd64 --target api \
|
docker build --platform linux/amd64 --target api \
|
||||||
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
|
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" .
|
||||||
|
docker push "gitea.treytartt.com/admin/honeydue-api:${SHA}"
|
||||||
|
|
||||||
# 4. Roll it in
|
# 4. Roll it in
|
||||||
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
|
export KUBECONFIG=~/.kube/honeydue.yaml
|
||||||
kubectl set image deployment/api -n honeydue \
|
kubectl set image deployment/api -n honeydue \
|
||||||
api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
|
api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
|
||||||
|
|
||||||
@@ -32,11 +71,18 @@ kubectl set image deployment/api -n honeydue \
|
|||||||
kubectl rollout status -n honeydue deployment/api
|
kubectl rollout status -n honeydue deployment/api
|
||||||
|
|
||||||
# 6. Log out
|
# 6. Log out
|
||||||
docker logout "$REGISTRY"
|
docker logout gitea.treytartt.com
|
||||||
```
|
```
|
||||||
|
|
||||||
~3–5 minutes end to end for api.
|
~3–5 minutes end to end for api.
|
||||||
|
|
||||||
|
> **Gotcha:** Deployments default to `imagePullPolicy: IfNotPresent`,
|
||||||
|
> which means kubelet won't re-fetch an image with a tag it already
|
||||||
|
> has cached locally — even if the registry now has different bytes
|
||||||
|
> at that tag. Always change tags (use the SHA), or temporarily flip
|
||||||
|
> `imagePullPolicy: Always` and `kubectl rollout restart` if you need
|
||||||
|
> to overwrite a tag.
|
||||||
|
|
||||||
## The build
|
## The build
|
||||||
|
|
||||||
### Step 1 — Prepare
|
### Step 1 — Prepare
|
||||||
@@ -314,14 +360,10 @@ Contrast: `deploy/scripts/deploy_prod.sh` (Swarm-era) did:
|
|||||||
9. Healthcheck the final URL; auto-rollback on failure
|
9. Healthcheck the final URL; auto-rollback on failure
|
||||||
10. Log out of registries
|
10. Log out of registries
|
||||||
|
|
||||||
Our current k3s deploy is more manual but simpler. We'd write a similar
|
The current k3s replacement, `deploy-k3s/scripts/03-deploy.sh`, covers
|
||||||
script for k3s if deploys become frequent:
|
the same ground in fewer steps because Kubernetes does the
|
||||||
|
versioning/rollout/health bookkeeping natively. See the TL;DR section
|
||||||
```bash
|
at the top of this chapter.
|
||||||
# deploy-k3s/scripts/04-deploy.sh (not yet updated for Gitea)
|
|
||||||
```
|
|
||||||
|
|
||||||
See the scaffold in `deploy-k3s/scripts/`.
|
|
||||||
|
|
||||||
## Common deploy failures
|
## Common deploy failures
|
||||||
|
|
||||||
|
|||||||
+249
-164
@@ -2,15 +2,119 @@
|
|||||||
|
|
||||||
## Summary
|
## Summary
|
||||||
|
|
||||||
We have minimal observability today: `kubectl logs`, `kubectl top`,
|
Production has live metrics and tracing infrastructure as of 2026-04-25.
|
||||||
Cloudflare Analytics, and the Neon dashboard. No Prometheus, no Grafana,
|
A self-hosted **VictoriaMetrics + Jaeger + Grafana** stack runs on
|
||||||
no centralized log aggregator, no APM. This is adequate for the
|
`88oakappsUpdate` (Linode VPS, also home to the self-hosted PostHog
|
||||||
current traffic volume (low) but is a known gap. This chapter documents
|
deployment). A `vmagent` sidecar in the honeyDue k3s namespace scrapes
|
||||||
what we *have* and what we'd add as traffic grows.
|
the api Pods' `/metrics` endpoint every 15 seconds and remote-writes to
|
||||||
|
`https://obs.88oakapps.com/api/v1/write`. Grafana is at
|
||||||
|
`https://grafana.88oakapps.com` with a pre-provisioned RED dashboard.
|
||||||
|
|
||||||
|
What we still don't have: log aggregation (Dozzle and `kubectl logs`
|
||||||
|
fill the niche for now), alerting (no PagerDuty/Slack on errors), and
|
||||||
|
full distributed tracing (OTel SDK is wired in app code but app-side
|
||||||
|
instrumentation beyond HTTP routes hasn't shipped yet).
|
||||||
|
|
||||||
|
The whole observability stack costs **$0** incremental and uses ~700 MB
|
||||||
|
RAM on `88oakappsUpdate` (5% of its free RAM). It runs as a separate
|
||||||
|
docker-compose project from PostHog so neither product's lifecycle
|
||||||
|
touches the other.
|
||||||
|
|
||||||
## What we have
|
## What we have
|
||||||
|
|
||||||
### 1. `kubectl logs`
|
### 1. Metrics — VictoriaMetrics + vmagent
|
||||||
|
|
||||||
|
```
|
||||||
|
honeyDue k3s (Hetzner) 88oakappsUpdate (Linode)
|
||||||
|
┌───────────────────────────┐ ┌──────────────────────────┐
|
||||||
|
│ api Pods (3) :8000/metrics│ │ /opt/honeydue-obs/ │
|
||||||
|
│ prometheus/client_golang│ │ ┌──────────────────┐ │
|
||||||
|
│ │ │ │ VictoriaMetrics │ │
|
||||||
|
│ vmagent ──── scrape 15s │ │ │ 30d retention │ │
|
||||||
|
│ remote_write ─────┼────────────┼─→ /api/v1/write │ │
|
||||||
|
│ (HTTPS, bearer) │ │ │ (mem 256 MB) │ │
|
||||||
|
└───────────────────────────┘ │ └──────────────────┘ │
|
||||||
|
└──────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
The Go API exposes `/metrics` in Prometheus exposition format. Histograms
|
||||||
|
are defined in `internal/prom/metrics.go` and registered globally:
|
||||||
|
|
||||||
|
| Metric | Labels | Source |
|
||||||
|
|---|---|---|
|
||||||
|
| `http_request_duration_seconds` | `route, method, status` | Echo middleware around every handler |
|
||||||
|
| `gorm_query_duration_seconds` | `table, operation` | GORM before/after callbacks (no ctx threading needed) |
|
||||||
|
| `b2_upload_duration_seconds` | `bucket, result` | Wrapped `s.backend.Write` in `internal/services/storage_service.go` |
|
||||||
|
| `b2_upload_bytes_total` | `bucket, result` | Counter alongside the duration histogram |
|
||||||
|
| `apns_send_duration_seconds` | `result` (`ok`/`bad_token`/`error`) | Wrapped APNs `PushWithContext` in `internal/push/apns.go` |
|
||||||
|
| `fcm_send_duration_seconds` | `result` | Wrapped FCM HTTP v1 send in `internal/push/fcm.go` |
|
||||||
|
| `asynq_job_duration_seconds` | `task_type, result` | Histograms registered; middleware not yet attached (Step 3) |
|
||||||
|
| `go_*`, `process_*` | (standard) | `prometheus/client_golang/prometheus/collectors` defaults |
|
||||||
|
|
||||||
|
The previous custom monitoring at `/metrics` was renamed to
|
||||||
|
`/metrics/legacy` so the canonical `/metrics` emits proper histograms
|
||||||
|
suitable for `histogram_quantile()` rollups. The legacy endpoint stays
|
||||||
|
because the GoAdmin dashboard reads it.
|
||||||
|
|
||||||
|
#### vmagent in k3s
|
||||||
|
|
||||||
|
Lives at `deploy-k3s/manifests/observability/vmagent.yaml`. One replica,
|
||||||
|
`mem_limit: 256Mi`, scrapes by Kubernetes pod-discovery filtered to
|
||||||
|
`app.kubernetes.io/name=api` and remote-writes to
|
||||||
|
`https://obs.88oakapps.com/api/v1/write` with a bearer token from
|
||||||
|
`OBS_INGEST_TOKEN` in `deploy/prod.env` (substituted into a Secret at
|
||||||
|
deploy time).
|
||||||
|
|
||||||
|
The agent buffers locally to `/tmp/vmagent` (emptyDir, 512 MB cap), so
|
||||||
|
brief obs outages don't drop samples. Persistent queue replays on
|
||||||
|
reconnect.
|
||||||
|
|
||||||
|
NetworkPolicies in the honeydue namespace allow egress from vmagent to:
|
||||||
|
- DNS (kube-dns / coredns)
|
||||||
|
- Kubernetes API (`10.43.0.0/16:443`) for pod discovery
|
||||||
|
- api Pods on `10.42.0.0/16:8000`
|
||||||
|
- The public obs endpoint over `0.0.0.0/0:443`
|
||||||
|
|
||||||
|
These are scoped tight — vmagent can't reach Postgres, Redis, B2, or
|
||||||
|
any other external service.
|
||||||
|
|
||||||
|
### 2. Tracing — Jaeger all-in-one
|
||||||
|
|
||||||
|
Jaeger 1.62 with badger storage runs alongside VictoriaMetrics. The
|
||||||
|
collector accepts:
|
||||||
|
- OTLP/HTTP at `https://obs.88oakapps.com/v1/traces` (bearer-token gated)
|
||||||
|
- OTLP/gRPC at `:4317` (localhost-only)
|
||||||
|
- Native Jaeger protocols at `:14268` etc. (localhost-only)
|
||||||
|
|
||||||
|
Retention: ~7 days at current scale before badger rotates. UI at
|
||||||
|
`https://grafana.88oakapps.com` via the Jaeger datasource.
|
||||||
|
|
||||||
|
**Status of app-side instrumentation**: the histograms are populating
|
||||||
|
metrics. The OTel exporter wiring in `cmd/api/main.go` is **not yet
|
||||||
|
shipped**. When it does ship, every `POST /api/auth/login/` will produce
|
||||||
|
a flame-graph trace with HTTP → handler → SQL → B2 → APNs spans.
|
||||||
|
Tracking issue: gitea#3.
|
||||||
|
|
||||||
|
### 3. Dashboards — Grafana
|
||||||
|
|
||||||
|
`https://grafana.88oakapps.com` (Cloudflare-fronted, basic auth via
|
||||||
|
Grafana itself, admin credentials in `deploy/prod.env`).
|
||||||
|
|
||||||
|
Datasources auto-provisioned at container startup from
|
||||||
|
`/opt/honeydue-obs/data/grafana-provisioning/datasources/datasources.yaml`:
|
||||||
|
- VictoriaMetrics (Prometheus type, `http://victoriametrics:8428` in-network)
|
||||||
|
- Jaeger (`http://jaeger:16686` in-network)
|
||||||
|
|
||||||
|
Pre-provisioned dashboard: `honeyDue API — RED` at
|
||||||
|
`/d/honeydue-red`. Top row uses the legacy custom metrics
|
||||||
|
(`http_endpoint_requests_total`, `http_requests_total`) which started
|
||||||
|
flowing the moment vmagent attached. Lower rows use the new histograms
|
||||||
|
(`http_request_duration_seconds_bucket` p50/p95/p99 by route, GORM p95
|
||||||
|
by table, B2 upload p95, APNs/FCM send p95, Go memory + goroutines).
|
||||||
|
Lower rows populated immediately after the api rebuild that shipped
|
||||||
|
`internal/prom`.
|
||||||
|
|
||||||
|
### 4. `kubectl logs`
|
||||||
|
|
||||||
Every container's stdout/stderr is captured by containerd and readable
|
Every container's stdout/stderr is captured by containerd and readable
|
||||||
via kubectl:
|
via kubectl:
|
||||||
@@ -33,9 +137,10 @@ kubectl get events -n honeydue --sort-by=.lastTimestamp
|
|||||||
Only the last ~20 MB of logs is retained per container, on-disk on the
|
Only the last ~20 MB of logs is retained per container, on-disk on the
|
||||||
node. Once a pod is deleted, its logs are gone.
|
node. Once a pod is deleted, its logs are gone.
|
||||||
|
|
||||||
For persistent log access we'd need aggregation (see §what we'd add).
|
For persistent log access we'd need aggregation (see §What we still
|
||||||
|
don't have).
|
||||||
|
|
||||||
### 2. `kubectl top`
|
### 5. `kubectl top`
|
||||||
|
|
||||||
Pod and node resource usage via metrics-server:
|
Pod and node resource usage via metrics-server:
|
||||||
|
|
||||||
@@ -43,43 +148,32 @@ Pod and node resource usage via metrics-server:
|
|||||||
kubectl top nodes
|
kubectl top nodes
|
||||||
# NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
|
# NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
|
||||||
# ubuntu-8gb-nbg1-1 169m 4% 748Mi 9%
|
# ubuntu-8gb-nbg1-1 169m 4% 748Mi 9%
|
||||||
# ubuntu-8gb-nbg1-2 229m 5% 1043Mi 13%
|
|
||||||
# ubuntu-8gb-nbg1-3 124m 3% 770Mi 9%
|
|
||||||
|
|
||||||
kubectl top pods -n honeydue
|
kubectl top pods -n honeydue
|
||||||
```
|
```
|
||||||
|
|
||||||
**Retention**: In-memory only. Last few minutes of data. No
|
In-memory only; last few minutes of data. For historical trends use
|
||||||
historical view.
|
the Grafana dashboard, which exposes the same data via the `go_*` and
|
||||||
|
`container_*` (kubelet cAdvisor) metrics.
|
||||||
|
|
||||||
### 3. Cloudflare Analytics
|
### 6. Cloudflare Analytics
|
||||||
|
|
||||||
CF Dashboard → Analytics & Logs. Per-zone stats:
|
CF Dashboard → Analytics & Logs. Per-zone aggregate stats:
|
||||||
- Requests per second
|
requests/sec, bandwidth, cache hit ratio, top status codes, top paths,
|
||||||
- Bandwidth
|
bot traffic score. Good for spotting macro trends ("suddenly 10× more
|
||||||
- Cache hit ratio
|
502s today") that wouldn't show up in a single-pod sample.
|
||||||
- Top HTTP status codes
|
|
||||||
- Top request paths
|
|
||||||
- Bot traffic score
|
|
||||||
|
|
||||||
All aggregated, no individual request traces. Good for spotting macro
|
Free tier retention: 7 days of aggregate stats.
|
||||||
trends ("suddenly 10× more 502s today"), poor for debugging specific
|
|
||||||
issues.
|
|
||||||
|
|
||||||
Free tier retention: 7 days of aggregate stats. Pro extends this.
|
### 7. Neon dashboard
|
||||||
|
|
||||||
### 4. Neon dashboard
|
Neon console → project → Monitoring: compute utilization (CU-hours),
|
||||||
|
slow queries, active connections, storage usage. Useful for "is the
|
||||||
|
DB busy?" and free-tier limit watching. The new
|
||||||
|
`gorm_query_duration_seconds` histogram covers the application side
|
||||||
|
of the same question with much better latency tail visibility.
|
||||||
|
|
||||||
Neon console → project → Monitoring:
|
### 8. Kubernetes events
|
||||||
- Compute utilization (CU-hours consumed)
|
|
||||||
- Query performance (slow queries)
|
|
||||||
- Active connections
|
|
||||||
- Storage usage
|
|
||||||
|
|
||||||
Good for "is the DB busy?" and "am I close to my free tier limit?"
|
|
||||||
Not real-time.
|
|
||||||
|
|
||||||
### 5. Kubernetes events
|
|
||||||
|
|
||||||
`kubectl get events` shows cluster-level state changes: pod scheduling,
|
`kubectl get events` shows cluster-level state changes: pod scheduling,
|
||||||
failures, image pulls, probe failures. Useful for post-mortem on
|
failures, image pulls, probe failures. Useful for post-mortem on
|
||||||
@@ -87,7 +181,7 @@ deploys.
|
|||||||
|
|
||||||
Retention: events are stored in etcd but default to 1 hour.
|
Retention: events are stored in etcd but default to 1 hour.
|
||||||
|
|
||||||
## What we don't have (the gap)
|
## What we still don't have
|
||||||
|
|
||||||
### No log aggregation
|
### No log aggregation
|
||||||
|
|
||||||
@@ -98,64 +192,55 @@ all api pod logs for user X") we have to:
|
|||||||
# Query all at once with stern (if installed)
|
# Query all at once with stern (if installed)
|
||||||
stern -n honeydue api
|
stern -n honeydue api
|
||||||
|
|
||||||
# Or for specific pod
|
# Or per-pod
|
||||||
kubectl logs -n honeydue <pod> | grep user_id=12345
|
kubectl logs -n honeydue <pod> | grep user_id=12345
|
||||||
```
|
```
|
||||||
|
|
||||||
This works but doesn't scale. Grep across 3 pods for a specific
|
This works but doesn't scale across many pods.
|
||||||
user_id is OK. Across 30 pods, intractable.
|
|
||||||
|
|
||||||
**What we'd add**: [Loki](https://grafana.com/oss/loki/) — a lightweight
|
**What we'd add**: [Loki](https://grafana.com/oss/loki/) on
|
||||||
log aggregator designed for k8s. ~$0 to self-host; integrates with
|
`88oakappsUpdate` next to the existing obs stack. Adds ~512 MB RAM
|
||||||
Grafana for queries. Or [Betterstack](https://betterstack.com/logs)
|
plus a Promtail (or Vector/Alloy) DaemonSet in k3s. Defer until log
|
||||||
($10/mo, hosted).
|
search becomes a recurring pain point — `stern` + `grep` is fine at
|
||||||
|
current pod count.
|
||||||
### No metrics/dashboards
|
|
||||||
|
|
||||||
`kubectl top` tells us "is this pod hot right now?" but not "has CPU
|
|
||||||
been climbing over the past hour?" We'd need:
|
|
||||||
|
|
||||||
- **Prometheus** — scrapes metrics from kubelet and pods' `/metrics`
|
|
||||||
endpoints, stores time series
|
|
||||||
- **Grafana** — queries Prometheus, renders dashboards
|
|
||||||
|
|
||||||
K3s can install these via Helm in ~10 minutes. Adds ~500MB RAM to the
|
|
||||||
cluster. Stability and operational load: moderate.
|
|
||||||
|
|
||||||
**Alternative**: [Kubernetes Dashboard](https://github.com/kubernetes/dashboard)
|
|
||||||
bundled with k3s (disabled by default). Minimal UI over the existing
|
|
||||||
metrics API. Cheaper than Prometheus but less queryable.
|
|
||||||
|
|
||||||
### No distributed tracing
|
|
||||||
|
|
||||||
"This request took 800ms — which hop was slow?" is currently unanswerable
|
|
||||||
beyond "the DB query, probably." A real trace would show:
|
|
||||||
- TLS handshake time
|
|
||||||
- Traefik routing time
|
|
||||||
- Go handler time
|
|
||||||
- Postgres query time
|
|
||||||
- Redis call time
|
|
||||||
- Each B2 request time
|
|
||||||
|
|
||||||
We'd add OpenTelemetry to the Go app and export to Jaeger/Tempo. Work
|
|
||||||
is moderate; value kicks in when we have complex request flows.
|
|
||||||
|
|
||||||
### No alerting
|
### No alerting
|
||||||
|
|
||||||
No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
|
No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
|
||||||
The operator finds out when users complain.
|
The operator finds out when users complain.
|
||||||
|
|
||||||
Cheapest fix: [Uptime Kuma](https://github.com/louislam/uptime-kuma)
|
Cheapest fix path:
|
||||||
(self-hosted) or Better Stack Uptime (free for small teams). Ping
|
1. Grafana alerting (built into Grafana 11) — alert rules over the
|
||||||
`https://api.myhoneydue.com/api/health/` every minute; alert if it fails.
|
existing histograms (e.g., `histogram_quantile(0.95, ...) > 1s`).
|
||||||
|
Routes to Slack via webhook. **Zero infra cost.**
|
||||||
|
2. [Uptime Kuma](https://github.com/louislam/uptime-kuma) on
|
||||||
|
`88oakappsUpdate` — pings `/api/health/` from outside the cluster
|
||||||
|
every minute; complements the in-cluster view.
|
||||||
|
|
||||||
|
We'd want both eventually. Grafana alerting first because the data is
|
||||||
|
already there.
|
||||||
|
|
||||||
|
### Partial distributed tracing
|
||||||
|
|
||||||
|
The OTel SDK is **not yet wired** in `cmd/api/main.go`. When it ships:
|
||||||
|
- `otelecho.Middleware` produces a span per HTTP request
|
||||||
|
- `otelgorm` plugin produces a span per SQL query (requires threading
|
||||||
|
`ctx` through repositories — the largest diff in the rollout)
|
||||||
|
- Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs
|
||||||
|
|
||||||
|
Until then, we have aggregate latency by route from the histograms but
|
||||||
|
no per-request flame graph. For "why is *this one* request slow" we
|
||||||
|
still rely on logs + the GORM duration histogram.
|
||||||
|
|
||||||
### No APM (Application Performance Monitoring)
|
### No APM (Application Performance Monitoring)
|
||||||
|
|
||||||
No request-level profiling. We can't see "which endpoint has the highest
|
No continuous profiling. We can answer "which endpoint has the highest
|
||||||
p99 latency?" or "which SQL query is hot this week?"
|
p99 latency?" from the histograms, but not "where in the call stack is
|
||||||
|
the time going?" without ad-hoc `pprof` runs.
|
||||||
|
|
||||||
Options: Datadog, New Relic, Honeycomb, self-hosted Tempo+Grafana.
|
If/when needed: Grafana Pyroscope is the OSS continuous profiler that
|
||||||
All are meaningful work to set up and cost $$$.
|
fits our stack. Adds ~512 MB RAM. Defer until a CPU performance
|
||||||
|
incident shows up.
|
||||||
|
|
||||||
## The app's logging conventions
|
## The app's logging conventions
|
||||||
|
|
||||||
@@ -172,28 +257,12 @@ The Go app uses zerolog and emits structured JSON:
|
|||||||
```
|
```
|
||||||
|
|
||||||
Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
|
Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
|
||||||
`DEBUG=true|false` in ConfigMap (true sets level to debug, false sets
|
`DEBUG=true|false` in the ConfigMap (true sets level to debug, false
|
||||||
level to info).
|
sets level to info).
|
||||||
|
|
||||||
Every request is logged with:
|
Every request is logged with method, path, status, request_id, user_id
|
||||||
- Method, path, status code
|
(if authenticated), latency. Queryable by grep today; ready to ingest
|
||||||
- Request ID (for correlating logs across pods)
|
into Loki when we add it.
|
||||||
- User ID (if authenticated)
|
|
||||||
- Latency
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"level": "info",
|
|
||||||
"method": "GET",
|
|
||||||
"path": "/api/tasks/",
|
|
||||||
"status": 200,
|
|
||||||
"latency_ms": 42,
|
|
||||||
"user_id": 123,
|
|
||||||
"request_id": "a6b5db35-..."
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
This is queryable by grep. Better with log aggregation.
|
|
||||||
|
|
||||||
## Health endpoints
|
## Health endpoints
|
||||||
|
|
||||||
@@ -202,71 +271,58 @@ Each service exposes a health endpoint:
|
|||||||
| Service | Endpoint | What it checks |
|
| Service | Endpoint | What it checks |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| api | `/api/health/` | Process alive (doesn't verify DB) |
|
| api | `/api/health/` | Process alive (doesn't verify DB) |
|
||||||
|
| api | `/api/health/live` | Process alive |
|
||||||
| admin | `/` | Next.js is up |
|
| admin | `/` | Next.js is up |
|
||||||
| worker | (none public) | Internal Asynq status |
|
| worker | (none public) | Internal Asynq status |
|
||||||
|
| api | `/metrics` | Prometheus exposition (vmagent scrapes here) |
|
||||||
|
| api | `/metrics/legacy` | Custom monitoring metrics for GoAdmin |
|
||||||
|
|
||||||
Health endpoints are **shallow** — they return 200 if the process is
|
Health endpoints are **shallow** — they return 200 if the process is
|
||||||
running and listening. They don't try to reach Postgres/Redis/etc.
|
running and listening. They don't try to reach Postgres/Redis/etc.
|
||||||
Rationale: if Postgres is briefly down, we don't want all api pods to
|
Rationale: if Postgres is briefly down, we don't want all api pods to
|
||||||
start failing liveness and cascade-restart.
|
start failing liveness and cascade-restart.
|
||||||
|
|
||||||
## Dozzle (deprecated)
|
## obs.88oakapps.com — the ingest endpoint
|
||||||
|
|
||||||
The Swarm era had [Dozzle](https://github.com/amir20/dozzle) — a
|
Public hostname for cross-cluster metric and trace ingest. Cloudflare
|
||||||
lightweight web UI for Docker logs. Accessible via SSH tunnel to the
|
in front, nginx on `88oakappsUpdate` enforces a bearer-token check
|
||||||
manager node. Not deployed on k3s; `kubectl logs` + `stern` fills the
|
before forwarding to the local VM/Jaeger containers.
|
||||||
niche.
|
|
||||||
|
|
||||||
## Kubernetes metrics the k8s API exposes
|
| Path | Forwards to | Purpose |
|
||||||
|
|---|---|---|
|
||||||
|
| `/api/v1/write` | `http://127.0.0.1:8428` | Prometheus remote-write (vmagent → VM) |
|
||||||
|
| `/v1/traces` | `http://127.0.0.1:4318/v1/traces` | OTLP/HTTP traces (app → Jaeger) |
|
||||||
|
| `/health` | (returns 200) | Reachability probe — also requires auth |
|
||||||
|
| anything else | 404 | |
|
||||||
|
|
||||||
Even without Prometheus, these are queryable:
|
Token lives at `/etc/honeydue-obs/secrets.env` (mode 0600 on the box)
|
||||||
|
and at `OBS_INGEST_TOKEN=` in `deploy/prod.env` (gitignored). To rotate:
|
||||||
|
generate a new value, update both ends, restart vmagent.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Resource metrics (via metrics-server)
|
# Operator: rotate the bearer token
|
||||||
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
|
NEW=$(openssl rand -hex 32)
|
||||||
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/honeydue/pods
|
ssh 88oakappsUpdate "sudo sed -i 's|OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|' /etc/honeydue-obs/secrets.env"
|
||||||
|
ssh 88oakappsUpdate "sudo sed -i 's|Bearer [a-f0-9]\{64\}|Bearer $NEW|' /etc/nginx/sites-available/obs.88oakapps.com && sudo nginx -s reload"
|
||||||
# Core API (k8s state)
|
sed -i.bak "s|^OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|" deploy/prod.env
|
||||||
kubectl get --raw /api/v1/namespaces/honeydue/pods/<name>
|
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue create secret generic vmagent-remote-write \
|
||||||
|
--from-literal=bearer_token=$NEW --dry-run=client -o yaml | kubectl apply -f -
|
||||||
# Kubelet metrics (per-node; requires tunneling)
|
KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue rollout restart deploy/vmagent
|
||||||
kubectl get --raw /api/v1/nodes/<node>/proxy/metrics
|
|
||||||
```
|
```
|
||||||
|
|
||||||
If we ever spin up Prometheus, these are the endpoints it would scrape.
|
## Resource budget
|
||||||
|
|
||||||
## Future: what to add and when
|
| Service | mem_limit | Disk | Retention |
|
||||||
|
|---|---|---|---|
|
||||||
|
| VictoriaMetrics | 256 MB | 10 GB | 30 days |
|
||||||
|
| Jaeger all-in-one (badger) | 256 MB | 10 GB | ~7 days |
|
||||||
|
| Grafana OSS | 256 MB | 1 GB | — |
|
||||||
|
| vmagent (in k3s) | 256 MB | 512 MB emptyDir | — |
|
||||||
|
| **Total** | **~1 GB hard cap** | **~21 GB** | |
|
||||||
|
|
||||||
| Trigger | Add |
|
Resident usage at idle is much lower (~90 MB on the obs side, ~30 MB
|
||||||
|---|---|
|
for vmagent). Hard limits exist so a memory leak in any one component
|
||||||
| 10k+ daily users | Loki + Grafana for logs |
|
can't squeeze the cohabiting PostHog stack on `88oakappsUpdate`.
|
||||||
| 100+ req/s sustained | Prometheus + Grafana for metrics |
|
|
||||||
| Performance incidents | OpenTelemetry tracing |
|
|
||||||
| Revenue > $5k/mo | Paid monitoring (Datadog or similar) |
|
|
||||||
| First production outage | Alerting to phone/Slack |
|
|
||||||
|
|
||||||
The overall philosophy: observability is an investment that compounds.
|
|
||||||
Add it before you need it, not after. But also don't over-invest at
|
|
||||||
idle.
|
|
||||||
|
|
||||||
**Next quarter**: set up Uptime Kuma + Loki at minimum.
|
|
||||||
|
|
||||||
## Checking what's installed
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# In kube-system namespace
|
|
||||||
kubectl get pods -n kube-system
|
|
||||||
# Should see: coredns, metrics-server, traefik, local-path-provisioner,
|
|
||||||
# and some k3s-related helm install jobs
|
|
||||||
|
|
||||||
# In honeydue namespace
|
|
||||||
kubectl get pods -n honeydue
|
|
||||||
# api, admin, worker, redis
|
|
||||||
|
|
||||||
# No monitoring namespace (yet)
|
|
||||||
kubectl get namespaces
|
|
||||||
# default, honeydue, kube-node-lease, kube-public, kube-system
|
|
||||||
```
|
|
||||||
|
|
||||||
## Operator cheat sheet
|
## Operator cheat sheet
|
||||||
|
|
||||||
@@ -274,32 +330,61 @@ kubectl get namespaces
|
|||||||
# Tail all logs in the namespace
|
# Tail all logs in the namespace
|
||||||
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue
|
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue
|
||||||
|
|
||||||
|
# Scrape state from vmagent self-metrics
|
||||||
|
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
|
||||||
|
| grep -E "scrapes_total|targets|remotewrite"
|
||||||
|
|
||||||
|
# Force vmagent to reload scrape config
|
||||||
|
kubectl -n honeydue rollout restart deploy/vmagent
|
||||||
|
|
||||||
|
# Query VictoriaMetrics directly (PromQL)
|
||||||
|
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
|
||||||
|
|
||||||
|
# Restart the obs stack on 88oakappsUpdate
|
||||||
|
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
|
||||||
|
|
||||||
|
# Live obs container memory
|
||||||
|
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
|
||||||
|
|
||||||
|
# Pod resource usage (k3s side)
|
||||||
|
kubectl top pods -n honeydue --sort-by=memory
|
||||||
|
|
||||||
# With stern (if installed: brew install stern)
|
# With stern (if installed: brew install stern)
|
||||||
stern -n honeydue .
|
stern -n honeydue .
|
||||||
|
|
||||||
# Follow specific pod, including previous runs
|
|
||||||
kubectl logs -n honeydue <pod> -f --previous=false
|
|
||||||
|
|
||||||
# Pod resource usage
|
|
||||||
kubectl top pods -n honeydue --sort-by=memory
|
|
||||||
kubectl top pods -n honeydue --sort-by=cpu
|
|
||||||
|
|
||||||
# Events (cluster-wide)
|
|
||||||
kubectl get events -A --sort-by=.lastTimestamp | tail -20
|
|
||||||
|
|
||||||
# Full state dump for a pod (debugging)
|
# Full state dump for a pod (debugging)
|
||||||
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
|
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
|
||||||
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
|
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Future: what to add and when
|
||||||
|
|
||||||
|
| Trigger | Add |
|
||||||
|
|---|---|
|
||||||
|
| First production incident | Grafana alerting (free, data already there) |
|
||||||
|
| 10k+ daily users | Loki + Vector for log aggregation |
|
||||||
|
| Performance incident the histograms can't explain | Wire OTel exporter → Jaeger from the Go app |
|
||||||
|
| CPU pressure on api pods | Pyroscope continuous profiler |
|
||||||
|
| Multi-product obs needs | Migrate obs stack to dedicated CX32 ($8/mo) |
|
||||||
|
|
||||||
|
The overall philosophy: observability is an investment that compounds.
|
||||||
|
Add it before you need it, not after. But also don't over-invest at
|
||||||
|
idle.
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- [Kubernetes metrics-server][ms]
|
- [VictoriaMetrics docs][vm]
|
||||||
- [K3s metrics][k3s-metrics]
|
- [vmagent kubernetes_sd_configs][vmagent-k8s]
|
||||||
- [Loki][loki]
|
- [Jaeger all-in-one with badger][jaeger]
|
||||||
|
- [prometheus/client_golang][promclient]
|
||||||
|
- [Grafana provisioning datasources][gf-prov]
|
||||||
|
- [Loki][loki] (future)
|
||||||
- [Stern (multi-pod log tail)][stern]
|
- [Stern (multi-pod log tail)][stern]
|
||||||
|
|
||||||
[ms]: https://github.com/kubernetes-sigs/metrics-server
|
[vm]: https://docs.victoriametrics.com/single-server-victoriametrics/
|
||||||
[k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server
|
[vmagent-k8s]: https://docs.victoriametrics.com/vmagent.html#kubernetes-monitoring-with-vmagent
|
||||||
|
[jaeger]: https://www.jaegertracing.io/docs/1.62/getting-started/#all-in-one
|
||||||
|
[promclient]: https://pkg.go.dev/github.com/prometheus/client_golang
|
||||||
|
[gf-prov]: https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources
|
||||||
[loki]: https://grafana.com/oss/loki/
|
[loki]: https://grafana.com/oss/loki/
|
||||||
[stern]: https://github.com/stern/stern
|
[stern]: https://github.com/stern/stern
|
||||||
|
|||||||
@@ -115,6 +115,41 @@ kubectl rollout restart deployment/coredns -n kube-system
|
|||||||
kubectl rollout restart deployment/metrics-server -n kube-system
|
kubectl rollout restart deployment/metrics-server -n kube-system
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### vmagent can't reach obs.88oakapps.com
|
||||||
|
|
||||||
|
**Symptom**: dashboards stop updating; vmagent logs show 401 / TLS /
|
||||||
|
network errors against `obs.88oakapps.com`. App is unaffected.
|
||||||
|
**Recovery**: vmagent buffers up to 512 MB locally and replays on
|
||||||
|
reconnect, so brief outages self-heal. If sustained:
|
||||||
|
```bash
|
||||||
|
# Is the obs endpoint up?
|
||||||
|
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
|
||||||
|
-H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
|
||||||
|
# 200 = ingest endpoint healthy.
|
||||||
|
|
||||||
|
# Inspect vmagent's failure metric
|
||||||
|
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
|
||||||
|
| grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"
|
||||||
|
|
||||||
|
# Restart vmagent (forces config reload + drains queue)
|
||||||
|
kubectl -n honeydue rollout restart deploy/vmagent
|
||||||
|
```
|
||||||
|
**If 88oakappsUpdate itself is down** (PostHog runs there too):
|
||||||
|
SSH and check `sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps`.
|
||||||
|
**Non-critical**: nothing app-facing depends on the obs stack.
|
||||||
|
|
||||||
|
#### Grafana dashboard shows "no data"
|
||||||
|
|
||||||
|
**Possible causes, in order of frequency**:
|
||||||
|
1. New histogram name — query targets a metric the api hasn't emitted
|
||||||
|
yet. Check `kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics`
|
||||||
|
for the metric name.
|
||||||
|
2. vmagent isn't scraping (see above).
|
||||||
|
3. Time range is before the obs stack came up (2026-04-25). Adjust
|
||||||
|
the dashboard time picker.
|
||||||
|
4. Cardinality blowup — VM rejected high-label-count series. Check
|
||||||
|
`vm_rows_inserted_total` vs `vm_rows_dropped_total` on the obs box.
|
||||||
|
|
||||||
### Networking failures
|
### Networking failures
|
||||||
|
|
||||||
#### UFW rule accidentally blocks essential traffic
|
#### UFW rule accidentally blocks essential traffic
|
||||||
|
|||||||
@@ -58,6 +58,20 @@ honeyDue.
|
|||||||
|---|---:|
|
|---|---:|
|
||||||
| Gitea container registry | **$0** |
|
| Gitea container registry | **$0** |
|
||||||
|
|
||||||
|
### Observability (88oakappsUpdate)
|
||||||
|
|
||||||
|
VictoriaMetrics + Jaeger + Grafana co-tenant on the existing Linode
|
||||||
|
VPS that hosts PostHog. ~700 MB RAM, 21 GB disk — fits inside the
|
||||||
|
existing instance. Not charged to honeyDue.
|
||||||
|
|
||||||
|
| Item | Monthly |
|
||||||
|
|---|---:|
|
||||||
|
| Self-hosted obs stack on `88oakappsUpdate` | **$0** |
|
||||||
|
|
||||||
|
Migration trigger: when the obs stack starts pressuring PostHog or
|
||||||
|
needs hard isolation, move to a dedicated Hetzner CX32 (~$8/mo).
|
||||||
|
See [Chapter 15 — When to move off](./15-observability.md).
|
||||||
|
|
||||||
### Total infrastructure
|
### Total infrastructure
|
||||||
|
|
||||||
| Category | Monthly |
|
| Category | Monthly |
|
||||||
@@ -67,6 +81,7 @@ honeyDue.
|
|||||||
| Storage | ~$0.30 |
|
| Storage | ~$0.30 |
|
||||||
| Edge | $0 |
|
| Edge | $0 |
|
||||||
| Registry | $0 |
|
| Registry | $0 |
|
||||||
|
| Observability | $0 |
|
||||||
| **Total** | **~$30** |
|
| **Total** | **~$30** |
|
||||||
|
|
||||||
## External SaaS
|
## External SaaS
|
||||||
|
|||||||
@@ -48,7 +48,7 @@ they do, and how to operate them.
|
|||||||
|
|
||||||
- [12 — Data Flow](./12-data-flow.md) — end-to-end request lifecycle
|
- [12 — Data Flow](./12-data-flow.md) — end-to-end request lifecycle
|
||||||
- [14 — Deployment Process](./14-deployment-process.md) — how to roll new code
|
- [14 — Deployment Process](./14-deployment-process.md) — how to roll new code
|
||||||
- [15 — Observability](./15-observability.md) — logs, metrics, tracing
|
- [15 — Observability](./15-observability.md) — VictoriaMetrics + Jaeger + Grafana on `obs.88oakapps.com`, vmagent in-cluster, Prometheus histograms in the Go API
|
||||||
- [16 — Failure Modes](./16-failure-modes.md) — what happens when X dies
|
- [16 — Failure Modes](./16-failure-modes.md) — what happens when X dies
|
||||||
- [17 — Runbook](./17-runbook.md) — common ops tasks
|
- [17 — Runbook](./17-runbook.md) — common ops tasks
|
||||||
|
|
||||||
|
|||||||
@@ -278,6 +278,43 @@ ssh -i ~/.ssh/hetzner deploy@<node> 'sudo systemctl start k3s'
|
|||||||
# then re-join via the k3s install command
|
# then re-join via the k3s install command
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Hit api /metrics from inside the cluster
|
||||||
|
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://api:8000/metrics | head -30
|
||||||
|
|
||||||
|
# vmagent self-stats: scrapes succeeded, samples shipped, queue health
|
||||||
|
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
|
||||||
|
| grep -E "scrapes_total|targets|remotewrite_samples_dropped|persistentqueue_blocks_dropped"
|
||||||
|
|
||||||
|
# Force vmagent to reload config (after editing the ConfigMap)
|
||||||
|
kubectl -n honeydue rollout restart deploy/vmagent
|
||||||
|
|
||||||
|
# Query VictoriaMetrics by SSH'ing to the obs box
|
||||||
|
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=up"'
|
||||||
|
|
||||||
|
# p95 latency by route, last 5m
|
||||||
|
ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
|
||||||
|
|
||||||
|
# All metric names landing in VM
|
||||||
|
ssh 88oakappsUpdate 'curl -s http://127.0.0.1:8428/api/v1/label/__name__/values | python3 -m json.tool'
|
||||||
|
|
||||||
|
# Restart the obs stack on 88oakappsUpdate (VM + Jaeger + Grafana)
|
||||||
|
ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
|
||||||
|
|
||||||
|
# Live RAM usage of the obs containers
|
||||||
|
ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
|
||||||
|
|
||||||
|
# Test the obs ingest endpoint with auth
|
||||||
|
TOKEN=$(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)
|
||||||
|
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
|
||||||
|
-H "Authorization: Bearer $TOKEN" # 200 = healthy
|
||||||
|
```
|
||||||
|
|
||||||
|
Dashboards live at `https://grafana.88oakapps.com/d/honeydue-red`.
|
||||||
|
Admin credentials in `deploy/prod.env`.
|
||||||
|
|
||||||
## One-liners worth memorizing
|
## One-liners worth memorizing
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
Reference in New Issue
Block a user