Files
honeyDueAPI/deploy-k3s/RUNBOOK.md
T
Trey t 139a990ebc
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.

Changes:

- network-policies.yaml: commit the previously-cluster-only NetPols
  (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
  produces a working cluster. The vmagent egress rule now includes :6443
  to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
  scraping kube-state-metrics).

- observability/kube-state-metrics.yaml: new manifest. Provides the
  kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
  panels need to count pods, replicas, etc. Runs in kube-system with
  cluster-scoped RBAC.

- observability/vmagent.yaml:
  * add kube-state-metrics scrape job to the ConfigMap
  * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
  * replace the misleading liveness probe (was /-/healthy, which lies
    while SD is broken) with an exec probe that checks /api/v1/targets
    for at least one healthy target — automatic recovery from future
    stale-SD incidents

- scripts/03-deploy.sh: actually apply network-policies.yaml (was
  committed but never applied) and apply kube-state-metrics.yaml.

- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
  trap, bearer-token recovery procedure, drift-detection diff, and a
  post-redeploy verification checklist.

- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
  kubectl sessions) so admin client cert can't be committed by accident.

Verified via kubectl --dry-run on all three modified manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 00:30:11 -05:00

9.6 KiB

k3s Cluster Operations Runbook

Living document for honeyDue k3s cluster operations. Add entries when you hit something non-obvious so future-you (or your replacement) doesn't have to rediscover it.


Deployment

The canonical deploy path is deploy-k3s/scripts/03-deploy.sh. It applies everything in deploy-k3s/manifests/ in the right order.

What it touches (in order):

  1. namespace.yaml
  2. network-policies.yamlall NetPols including the vmagent ones
  3. redis/
  4. ingress/
  5. migrate/job.yaml (with image substitution; blocks on success)
  6. api/deployment.yaml, api/service.yaml, api/hpa.yaml (image-subbed)
  7. worker/deployment.yaml (image-subbed)
  8. admin/deployment.yaml, admin/service.yaml (image-subbed)
  9. web/deployment.yaml, web/service.yaml (image-subbed; optional dir)
  10. observability/kube-state-metrics.yaml
  11. observability/vmagent.yaml (with TOKEN_PLACEHOLDER sed-substituted from deploy/prod.env)

If you add a new manifest, also add a kubectl apply -f line to 03-deploy.sh — there's no kustomization or apply -R. A manifest that exists in the repo but isn't applied by the script will silently not deploy.

Pre-deploy checklist

  • deploy/prod.env exists and contains OBS_INGEST_TOKEN=... (otherwise vmagent gets skipped with a warning)
  • KUBECONFIG points at the right cluster
  • The Gitea image registry is reachable from k3s nodes
  • Schema migrations in migrations/ are tested locally first (the deploy aborts if honeydue-migrate Job fails)

Known gotchas

vmagent SD broken on fresh deploy ("0 pods up" in Grafana)

Symptoms:

  • Grafana panels using kube_* metrics or up{job=...} show 0
  • vmagent logs: dial tcp 10.43.0.1:443: connect: connection refused repeating every ~30s
  • Direct test from a pod also refused: kubectl -n honeydue exec deploy/vmagent -- wget --no-check-certificate -qO- -T 3 https://10.43.0.1:443/livez

Cause: k3s's built-in NetworkPolicy controller evaluates egress rules after kube-proxy's DNAT, not before (contrary to the k8s spec). Traffic from a pod to the kubernetes Service (ClusterIP 10.43.0.1:443) gets DNAT'd to <node_public_ip>:6443, and then the policy check runs. Without an explicit egress rule for :6443, the packet is rejected with a TCP RST → "connection refused".

The allow-egress-from-vmagent NetPol in network-policies.yaml includes both rules:

# Pre-DNAT view (correct per spec; harmless if unused)
- to:
    - ipBlock: { cidr: 10.43.0.0/16 }
  ports:
    - { port: 443, protocol: TCP }
# Post-DNAT path (what k3s NetPol enforcer actually sees) — REQUIRED
- to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except: [10.42.0.0/16]
  ports:
    - { port: 6443, protocol: TCP }

If this happens on a fresh deploy: confirm network-policies.yaml was applied:

kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml

Look for the port-6443 egress rule. If missing, the apply step in 03-deploy.sh was skipped or the file was edited and the rule got dropped.

Counter-evidence that confirms diagnosis: kube-state-metrics in kube-system works fine, because kube-system has no NetPols. So if ksm is healthy but workloads in honeydue can't reach the apiserver ClusterIP, this gotcha is the cause.


vmagent appears healthy but no data in Grafana

vmagent's /-/healthy endpoint returns 200 as long as the process is alive and remote-write is functional (TCP-level) — it does not check whether scrapes are succeeding. We saw this fail once: vmagent was "healthy" for 17 days while having zero healthy targets due to a broken k8s SD watch.

The liveness probe in vmagent.yaml queries the agent's /api/v1/targets endpoint and fails the pod if no target is in state up. After 3 consecutive failures (~3 min), kubelet recycles the pod and SD restarts clean.

Verify it's working: kubectl -n honeydue describe pod -l app.kubernetes.io/name=vmagent should show Liveness: exec [sh -c ...]. If you ever see vmagent running for weeks but no metrics in Grafana, the probe was disabled or the exec command broke.


vmagent's bearer token got blown away after kubectl apply -f vmagent.yaml

The committed vmagent.yaml has bearer_token: TOKEN_PLACEHOLDER. The real token is sed-substituted at deploy time by 03-deploy.sh. If you ever apply vmagent.yaml directly:

kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml   # WRONG

the Secret gets overwritten with the literal string TOKEN_PLACEHOLDER and all remote-writes start returning 401 from obs.88oakapps.com.

To restore without a full redeploy (the safe inline path):

export KUBECONFIG=...
OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
                  -o jsonpath='{.data.OBS_INGEST_TOKEN}')
kubectl -n honeydue patch secret vmagent-remote-write --type=json \
  -p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
kubectl -n honeydue rollout restart deploy/vmagent

The OBS token also lives in honeydue-secrets.OBS_INGEST_TOKEN because the api pods use it for traces — same secret, same value.

Or just re-run the deploy: ./deploy-k3s/scripts/03-deploy.sh. The sed step handles the substitution correctly.


Node kubeconfig is world-readable

/etc/rancher/k3s/k3s.yaml is mode 0644 per the --write-kubeconfig-mode=644 k3s install flag. Any process on the host (including any container that mounts the host filesystem) can read full cluster-admin credentials.

This is intentional for the deploy user but worth knowing — any container escape becomes immediate cluster-admin. Tracked as finding F4 in k3_audit_5_12.md.

To tighten (if you ever turn this knob): change to --write-kubeconfig-mode=600 in the k3s install command, then re-fetch deploy-k3s/kubeconfig.


Common operations

Fetch a working kubectl tunnel (if deploy-k3s/kubeconfig is missing or stale)

ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
  | sed 's|server: https://127.0.0.1:6443|server: https://178.104.247.152:6443|' \
  > deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig

If the public :6443 is firewalled from your IP (the default — only Cloudflare ranges are allowed for app traffic; admin is locked down):

# SSH tunnel — leave running in another terminal
ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
    -i ~/.ssh/hetzner \
    -L 127.0.0.1:6443:127.0.0.1:6443 \
    deploy@hetzner1

# Then use a kubeconfig pointing at localhost
cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
sed -i.bak 's|https://178.104.247.152:6443|https://127.0.0.1:6443|' \
  deploy-k3s/kubeconfig.tunnel
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"

Restore vmagent after a "0 targets" incident

export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"

# 1. Confirm the diagnosis
kubectl -n honeydue logs deploy/vmagent --tail=20 | grep -i "connect: connection refused"

# 2. Check the NetPol has the :6443 rule
kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443

# 3. If missing, re-apply
kubectl apply -f deploy-k3s/manifests/network-policies.yaml

# 4. Restart vmagent
kubectl -n honeydue rollout restart deploy/vmagent

# 5. Verify targets after ~60s
kubectl -n honeydue port-forward deploy/vmagent 8429:8429 &
curl -s http://localhost:8429/api/v1/targets \
  | python3 -c "import json,sys; d=json.load(sys.stdin); \
                a=d['data']['activeTargets']; \
                print(f'targets={len(a)} up={sum(1 for t in a if t[\"health\"]==\"up\")}')"

Verify NetPols match the repo

If you suspect drift between cluster and repo:

diff <(kubectl -n honeydue get netpol -o name | sort) \
     <(grep -E '^\s*name: ' deploy-k3s/manifests/network-policies.yaml \
       | sed 's/.*name: /networkpolicy.networking.k8s.io\//' | sort)

Empty output = match. Any differences need investigation — either the cluster has policies that aren't in repo (manual kubectl apply did it) or repo has policies that didn't apply.


Disaster recovery notes

"I have to redeploy the whole stack"

The deploy path is designed to be re-runnable. From a fresh cluster:

  1. Install k3s on all 3 nodes (use existing deploy-k3s/scripts/01-install-k3s.sh)
  2. Fetch a kubeconfig (see "Common operations" above)
  3. Confirm deploy/prod.env has all required secrets:
    • POSTGRES_PASSWORD, SECRET_KEY, EMAIL_HOST_PASSWORD, FCM_SERVER_KEY, B2_KEY_ID, B2_APP_KEY, OBS_INGEST_TOKEN, OBS_TRACES_URL, REDIS_PASSWORD (optional), ADMIN_EMAIL, ADMIN_PASSWORD
  4. Run ./deploy-k3s/scripts/02-setup-secrets.sh (creates honeydue-secrets)
  5. Run ./deploy-k3s/scripts/03-deploy.sh (deploys everything; sed-injects the obs token into vmagent at apply time)
  6. Verify: kubectl -n honeydue get pods should show all workloads Running

Post-redeploy verification checklist

  • kubectl -n honeydue get netpol shows 12 NetPols (default-deny + 6 egress + 5 ingress)
  • kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep 6443 returns the rule (if missing → see "vmagent SD broken" gotcha)
  • kubectl -n kube-system get pod -l app.kubernetes.io/name=kube-state-metrics shows 1 Running pod
  • kubectl -n honeydue port-forward deploy/vmagent 8429:8429 + curl localhost:8429/api/v1/targets shows 4+ targets, all up
  • Grafana panel "pods up" in honeydue namespace populates within 60s

If any of those fail, this runbook entry tells you exactly which gotcha you hit.