fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics

vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 00:30:11 -05:00
parent 7cc5448a7c
commit 139a990ebc
6 changed files with 666 additions and 6 deletions
@@ -0,0 +1,262 @@
+# k3s Cluster Operations Runbook
+
+Living document for honeyDue k3s cluster operations. Add entries when you
+hit something non-obvious so future-you (or your replacement) doesn't have
+to rediscover it.
+
+---
+
+## Deployment
+
+The canonical deploy path is `deploy-k3s/scripts/03-deploy.sh`. It applies
+everything in `deploy-k3s/manifests/` in the right order.
+
+What it touches (in order):
+
+1. `namespace.yaml`
+2. `network-policies.yaml` — **all** NetPols including the vmagent ones
+3. `redis/`
+4. `ingress/`
+5. `migrate/job.yaml` (with image substitution; blocks on success)
+6. `api/deployment.yaml`, `api/service.yaml`, `api/hpa.yaml` (image-subbed)
+7. `worker/deployment.yaml` (image-subbed)
+8. `admin/deployment.yaml`, `admin/service.yaml` (image-subbed)
+9. `web/deployment.yaml`, `web/service.yaml` (image-subbed; optional dir)
+10. `observability/kube-state-metrics.yaml`
+11. `observability/vmagent.yaml` (with `TOKEN_PLACEHOLDER` sed-substituted from `deploy/prod.env`)
+
+If you add a new manifest, also add a `kubectl apply -f` line to
+`03-deploy.sh` — there's no kustomization or `apply -R`. **A manifest
+that exists in the repo but isn't applied by the script will silently
+not deploy.**
+
+### Pre-deploy checklist
+
+- [ ] `deploy/prod.env` exists and contains `OBS_INGEST_TOKEN=...`
+       (otherwise vmagent gets skipped with a warning)
+- [ ] `KUBECONFIG` points at the right cluster
+- [ ] The Gitea image registry is reachable from k3s nodes
+- [ ] Schema migrations in `migrations/` are tested locally first
+       (the deploy aborts if `honeydue-migrate` Job fails)
+
+---
+
+## Known gotchas
+
+### vmagent SD broken on fresh deploy ("0 pods up" in Grafana)
+
+**Symptoms:**
+- Grafana panels using `kube_*` metrics or `up{job=...}` show 0
+- vmagent logs: `dial tcp 10.43.0.1:443: connect: connection refused`
+       repeating every ~30s
+- Direct test from a pod also refused: `kubectl -n honeydue exec deploy/vmagent
+       -- wget --no-check-certificate -qO- -T 3 https://10.43.0.1:443/livez`
+
+**Cause:** k3s's built-in NetworkPolicy controller evaluates egress rules
+**after** kube-proxy's DNAT, not before (contrary to the k8s spec).
+Traffic from a pod to the `kubernetes` Service (ClusterIP `10.43.0.1:443`)
+gets DNAT'd to `<node_public_ip>:6443`, and **then** the policy check
+runs. Without an explicit egress rule for `:6443`, the packet is rejected
+with a TCP RST → "connection refused".
+
+The `allow-egress-from-vmagent` NetPol in `network-policies.yaml` includes
+both rules:
+
+```yaml
+# Pre-DNAT view (correct per spec; harmless if unused)
+- to:
+    - ipBlock: { cidr: 10.43.0.0/16 }
+  ports:
+    - { port: 443, protocol: TCP }
+# Post-DNAT path (what k3s NetPol enforcer actually sees) — REQUIRED
+- to:
+    - ipBlock:
+        cidr: 0.0.0.0/0
+        except: [10.42.0.0/16]
+  ports:
+    - { port: 6443, protocol: TCP }
+```
+
+**If this happens on a fresh deploy:** confirm `network-policies.yaml`
+was applied:
+```bash
+kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml
+```
+Look for the port-6443 egress rule. If missing, the apply step in
+`03-deploy.sh` was skipped or the file was edited and the rule got
+dropped.
+
+**Counter-evidence that confirms diagnosis:** kube-state-metrics in
+`kube-system` works fine, because `kube-system` has no NetPols. So if
+ksm is healthy but workloads in `honeydue` can't reach the apiserver
+ClusterIP, this gotcha is the cause.
+
+---
+
+### vmagent appears healthy but no data in Grafana
+
+vmagent's `/-/healthy` endpoint returns 200 as long as the process is
+alive and remote-write is functional (TCP-level) — it does **not**
+check whether scrapes are succeeding. We saw this fail once: vmagent
+was "healthy" for 17 days while having zero healthy targets due to a
+broken k8s SD watch.
+
+The liveness probe in `vmagent.yaml` queries the agent's `/api/v1/targets`
+endpoint and fails the pod if no target is in state `up`. After 3
+consecutive failures (~3 min), kubelet recycles the pod and SD restarts
+clean.
+
+**Verify it's working:** `kubectl -n honeydue describe pod -l app.kubernetes.io/name=vmagent`
+should show `Liveness: exec [sh -c ...]`. If you ever see vmagent running
+for weeks but no metrics in Grafana, the probe was disabled or the exec
+command broke.
+
+---
+
+### vmagent's bearer token got blown away after `kubectl apply -f vmagent.yaml`
+
+The committed `vmagent.yaml` has `bearer_token: TOKEN_PLACEHOLDER`. The
+real token is sed-substituted at deploy time by `03-deploy.sh`. If you
+ever apply `vmagent.yaml` directly:
+
+```bash
+kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml   # WRONG
+```
+
+the Secret gets overwritten with the literal string `TOKEN_PLACEHOLDER`
+and all remote-writes start returning 401 from obs.88oakapps.com.
+
+**To restore without a full redeploy** (the safe inline path):
+
+```bash
+export KUBECONFIG=...
+OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
+                  -o jsonpath='{.data.OBS_INGEST_TOKEN}')
+kubectl -n honeydue patch secret vmagent-remote-write --type=json \
+  -p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
+kubectl -n honeydue rollout restart deploy/vmagent
+```
+
+The OBS token also lives in `honeydue-secrets.OBS_INGEST_TOKEN` because
+the api pods use it for traces — same secret, same value.
+
+**Or just re-run the deploy:** `./deploy-k3s/scripts/03-deploy.sh`. The
+sed step handles the substitution correctly.
+
+---
+
+### Node kubeconfig is world-readable
+
+`/etc/rancher/k3s/k3s.yaml` is mode `0644` per the `--write-kubeconfig-mode=644`
+k3s install flag. Any process on the host (including any container that
+mounts the host filesystem) can read full cluster-admin credentials.
+
+This is intentional for the deploy user but worth knowing — any container
+escape becomes immediate cluster-admin. Tracked as finding **F4** in
+`k3_audit_5_12.md`.
+
+To tighten (if you ever turn this knob): change to `--write-kubeconfig-mode=600`
+in the k3s install command, then re-fetch `deploy-k3s/kubeconfig`.
+
+---
+
+## Common operations
+
+### Fetch a working kubectl tunnel (if `deploy-k3s/kubeconfig` is missing or stale)
+
+```bash
+ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
+  | sed 's|server: https://127.0.0.1:6443|server: https://178.104.247.152:6443|' \
+  > deploy-k3s/kubeconfig
+chmod 600 deploy-k3s/kubeconfig
+```
+
+If the public :6443 is firewalled from your IP (the default — only
+Cloudflare ranges are allowed for app traffic; admin is locked down):
+
+```bash
+# SSH tunnel — leave running in another terminal
+ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
+    -i ~/.ssh/hetzner \
+    -L 127.0.0.1:6443:127.0.0.1:6443 \
+    deploy@hetzner1
+
+# Then use a kubeconfig pointing at localhost
+cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
+sed -i.bak 's|https://178.104.247.152:6443|https://127.0.0.1:6443|' \
+  deploy-k3s/kubeconfig.tunnel
+export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
+```
+
+### Restore vmagent after a "0 targets" incident
+
+```bash
+export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
+
+# 1. Confirm the diagnosis
+kubectl -n honeydue logs deploy/vmagent --tail=20 | grep -i "connect: connection refused"
+
+# 2. Check the NetPol has the :6443 rule
+kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443
+
+# 3. If missing, re-apply
+kubectl apply -f deploy-k3s/manifests/network-policies.yaml
+
+# 4. Restart vmagent
+kubectl -n honeydue rollout restart deploy/vmagent
+
+# 5. Verify targets after ~60s
+kubectl -n honeydue port-forward deploy/vmagent 8429:8429 &
+curl -s http://localhost:8429/api/v1/targets \
+  | python3 -c "import json,sys; d=json.load(sys.stdin); \
+                a=d['data']['activeTargets']; \
+                print(f'targets={len(a)} up={sum(1 for t in a if t[\"health\"]==\"up\")}')"
+```
+
+### Verify NetPols match the repo
+
+If you suspect drift between cluster and repo:
+
+```bash
+diff <(kubectl -n honeydue get netpol -o name | sort) \
+     <(grep -E '^\s*name: ' deploy-k3s/manifests/network-policies.yaml \
+       | sed 's/.*name: /networkpolicy.networking.k8s.io\//' | sort)
+```
+
+Empty output = match. Any differences need investigation — either the
+cluster has policies that aren't in repo (manual `kubectl apply` did it)
+or repo has policies that didn't apply.
+
+---
+
+## Disaster recovery notes
+
+### "I have to redeploy the whole stack"
+
+The deploy path is designed to be re-runnable. From a fresh cluster:
+
+1. Install k3s on all 3 nodes (use existing `deploy-k3s/scripts/01-install-k3s.sh`)
+2. Fetch a kubeconfig (see "Common operations" above)
+3. Confirm `deploy/prod.env` has all required secrets:
+   - `POSTGRES_PASSWORD`, `SECRET_KEY`, `EMAIL_HOST_PASSWORD`,
+     `FCM_SERVER_KEY`, `B2_KEY_ID`, `B2_APP_KEY`, `OBS_INGEST_TOKEN`,
+     `OBS_TRACES_URL`, `REDIS_PASSWORD` (optional), `ADMIN_EMAIL`, `ADMIN_PASSWORD`
+4. Run `./deploy-k3s/scripts/02-setup-secrets.sh` (creates `honeydue-secrets`)
+5. Run `./deploy-k3s/scripts/03-deploy.sh` (deploys everything; sed-injects
+   the obs token into vmagent at apply time)
+6. Verify: `kubectl -n honeydue get pods` should show all workloads Running
+
+### Post-redeploy verification checklist
+
+- [ ] `kubectl -n honeydue get netpol` shows **12 NetPols** (default-deny +
+       6 egress + 5 ingress)
+- [ ] `kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep 6443`
+       returns the rule (if missing → see "vmagent SD broken" gotcha)
+- [ ] `kubectl -n kube-system get pod -l app.kubernetes.io/name=kube-state-metrics`
+       shows 1 Running pod
+- [ ] `kubectl -n honeydue port-forward deploy/vmagent 8429:8429` + curl
+       `localhost:8429/api/v1/targets` shows 4+ targets, all `up`
+- [ ] Grafana panel "pods up" in `honeydue` namespace populates within 60s
+
+If any of those fail, this runbook entry tells you exactly which gotcha
+you hit.