vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.
Changes:
- network-policies.yaml: commit the previously-cluster-only NetPols
(allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
produces a working cluster. The vmagent egress rule now includes :6443
to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
scraping kube-state-metrics).
- observability/kube-state-metrics.yaml: new manifest. Provides the
kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
panels need to count pods, replicas, etc. Runs in kube-system with
cluster-scoped RBAC.
- observability/vmagent.yaml:
* add kube-state-metrics scrape job to the ConfigMap
* add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
* replace the misleading liveness probe (was /-/healthy, which lies
while SD is broken) with an exec probe that checks /api/v1/targets
for at least one healthy target — automatic recovery from future
stale-SD incidents
- scripts/03-deploy.sh: actually apply network-policies.yaml (was
committed but never applied) and apply kube-state-metrics.yaml.
- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
trap, bearer-token recovery procedure, drift-detection diff, and a
post-redeploy verification checklist.
- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
kubectl sessions) so admin client cert can't be committed by accident.
Verified via kubectl --dry-run on all three modified manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.6 KiB
k3s Cluster Operations Runbook
Living document for honeyDue k3s cluster operations. Add entries when you hit something non-obvious so future-you (or your replacement) doesn't have to rediscover it.
Deployment
The canonical deploy path is deploy-k3s/scripts/03-deploy.sh. It applies
everything in deploy-k3s/manifests/ in the right order.
What it touches (in order):
namespace.yamlnetwork-policies.yaml— all NetPols including the vmagent onesredis/ingress/migrate/job.yaml(with image substitution; blocks on success)api/deployment.yaml,api/service.yaml,api/hpa.yaml(image-subbed)worker/deployment.yaml(image-subbed)admin/deployment.yaml,admin/service.yaml(image-subbed)web/deployment.yaml,web/service.yaml(image-subbed; optional dir)observability/kube-state-metrics.yamlobservability/vmagent.yaml(withTOKEN_PLACEHOLDERsed-substituted fromdeploy/prod.env)
If you add a new manifest, also add a kubectl apply -f line to
03-deploy.sh — there's no kustomization or apply -R. A manifest
that exists in the repo but isn't applied by the script will silently
not deploy.
Pre-deploy checklist
deploy/prod.envexists and containsOBS_INGEST_TOKEN=...(otherwise vmagent gets skipped with a warning)KUBECONFIGpoints at the right cluster- The Gitea image registry is reachable from k3s nodes
- Schema migrations in
migrations/are tested locally first (the deploy aborts ifhoneydue-migrateJob fails)
Known gotchas
vmagent SD broken on fresh deploy ("0 pods up" in Grafana)
Symptoms:
- Grafana panels using
kube_*metrics orup{job=...}show 0 - vmagent logs:
dial tcp 10.43.0.1:443: connect: connection refusedrepeating every ~30s - Direct test from a pod also refused:
kubectl -n honeydue exec deploy/vmagent -- wget --no-check-certificate -qO- -T 3 https://10.43.0.1:443/livez
Cause: k3s's built-in NetworkPolicy controller evaluates egress rules
after kube-proxy's DNAT, not before (contrary to the k8s spec).
Traffic from a pod to the kubernetes Service (ClusterIP 10.43.0.1:443)
gets DNAT'd to <node_public_ip>:6443, and then the policy check
runs. Without an explicit egress rule for :6443, the packet is rejected
with a TCP RST → "connection refused".
The allow-egress-from-vmagent NetPol in network-policies.yaml includes
both rules:
# Pre-DNAT view (correct per spec; harmless if unused)
- to:
- ipBlock: { cidr: 10.43.0.0/16 }
ports:
- { port: 443, protocol: TCP }
# Post-DNAT path (what k3s NetPol enforcer actually sees) — REQUIRED
- to:
- ipBlock:
cidr: 0.0.0.0/0
except: [10.42.0.0/16]
ports:
- { port: 6443, protocol: TCP }
If this happens on a fresh deploy: confirm network-policies.yaml
was applied:
kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml
Look for the port-6443 egress rule. If missing, the apply step in
03-deploy.sh was skipped or the file was edited and the rule got
dropped.
Counter-evidence that confirms diagnosis: kube-state-metrics in
kube-system works fine, because kube-system has no NetPols. So if
ksm is healthy but workloads in honeydue can't reach the apiserver
ClusterIP, this gotcha is the cause.
vmagent appears healthy but no data in Grafana
vmagent's /-/healthy endpoint returns 200 as long as the process is
alive and remote-write is functional (TCP-level) — it does not
check whether scrapes are succeeding. We saw this fail once: vmagent
was "healthy" for 17 days while having zero healthy targets due to a
broken k8s SD watch.
The liveness probe in vmagent.yaml queries the agent's /api/v1/targets
endpoint and fails the pod if no target is in state up. After 3
consecutive failures (~3 min), kubelet recycles the pod and SD restarts
clean.
Verify it's working: kubectl -n honeydue describe pod -l app.kubernetes.io/name=vmagent
should show Liveness: exec [sh -c ...]. If you ever see vmagent running
for weeks but no metrics in Grafana, the probe was disabled or the exec
command broke.
vmagent's bearer token got blown away after kubectl apply -f vmagent.yaml
The committed vmagent.yaml has bearer_token: TOKEN_PLACEHOLDER. The
real token is sed-substituted at deploy time by 03-deploy.sh. If you
ever apply vmagent.yaml directly:
kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml # WRONG
the Secret gets overwritten with the literal string TOKEN_PLACEHOLDER
and all remote-writes start returning 401 from obs.88oakapps.com.
To restore without a full redeploy (the safe inline path):
export KUBECONFIG=...
OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
-o jsonpath='{.data.OBS_INGEST_TOKEN}')
kubectl -n honeydue patch secret vmagent-remote-write --type=json \
-p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
kubectl -n honeydue rollout restart deploy/vmagent
The OBS token also lives in honeydue-secrets.OBS_INGEST_TOKEN because
the api pods use it for traces — same secret, same value.
Or just re-run the deploy: ./deploy-k3s/scripts/03-deploy.sh. The
sed step handles the substitution correctly.
Node kubeconfig is world-readable
/etc/rancher/k3s/k3s.yaml is mode 0644 per the --write-kubeconfig-mode=644
k3s install flag. Any process on the host (including any container that
mounts the host filesystem) can read full cluster-admin credentials.
This is intentional for the deploy user but worth knowing — any container
escape becomes immediate cluster-admin. Tracked as finding F4 in
k3_audit_5_12.md.
To tighten (if you ever turn this knob): change to --write-kubeconfig-mode=600
in the k3s install command, then re-fetch deploy-k3s/kubeconfig.
Common operations
Fetch a working kubectl tunnel (if deploy-k3s/kubeconfig is missing or stale)
ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
| sed 's|server: https://127.0.0.1:6443|server: https://178.104.247.152:6443|' \
> deploy-k3s/kubeconfig
chmod 600 deploy-k3s/kubeconfig
If the public :6443 is firewalled from your IP (the default — only Cloudflare ranges are allowed for app traffic; admin is locked down):
# SSH tunnel — leave running in another terminal
ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
-i ~/.ssh/hetzner \
-L 127.0.0.1:6443:127.0.0.1:6443 \
deploy@hetzner1
# Then use a kubeconfig pointing at localhost
cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
sed -i.bak 's|https://178.104.247.152:6443|https://127.0.0.1:6443|' \
deploy-k3s/kubeconfig.tunnel
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
Restore vmagent after a "0 targets" incident
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
# 1. Confirm the diagnosis
kubectl -n honeydue logs deploy/vmagent --tail=20 | grep -i "connect: connection refused"
# 2. Check the NetPol has the :6443 rule
kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443
# 3. If missing, re-apply
kubectl apply -f deploy-k3s/manifests/network-policies.yaml
# 4. Restart vmagent
kubectl -n honeydue rollout restart deploy/vmagent
# 5. Verify targets after ~60s
kubectl -n honeydue port-forward deploy/vmagent 8429:8429 &
curl -s http://localhost:8429/api/v1/targets \
| python3 -c "import json,sys; d=json.load(sys.stdin); \
a=d['data']['activeTargets']; \
print(f'targets={len(a)} up={sum(1 for t in a if t[\"health\"]==\"up\")}')"
Verify NetPols match the repo
If you suspect drift between cluster and repo:
diff <(kubectl -n honeydue get netpol -o name | sort) \
<(grep -E '^\s*name: ' deploy-k3s/manifests/network-policies.yaml \
| sed 's/.*name: /networkpolicy.networking.k8s.io\//' | sort)
Empty output = match. Any differences need investigation — either the
cluster has policies that aren't in repo (manual kubectl apply did it)
or repo has policies that didn't apply.
Disaster recovery notes
"I have to redeploy the whole stack"
The deploy path is designed to be re-runnable. From a fresh cluster:
- Install k3s on all 3 nodes (use existing
deploy-k3s/scripts/01-install-k3s.sh) - Fetch a kubeconfig (see "Common operations" above)
- Confirm
deploy/prod.envhas all required secrets:POSTGRES_PASSWORD,SECRET_KEY,EMAIL_HOST_PASSWORD,FCM_SERVER_KEY,B2_KEY_ID,B2_APP_KEY,OBS_INGEST_TOKEN,OBS_TRACES_URL,REDIS_PASSWORD(optional),ADMIN_EMAIL,ADMIN_PASSWORD
- Run
./deploy-k3s/scripts/02-setup-secrets.sh(createshoneydue-secrets) - Run
./deploy-k3s/scripts/03-deploy.sh(deploys everything; sed-injects the obs token into vmagent at apply time) - Verify:
kubectl -n honeydue get podsshould show all workloads Running
Post-redeploy verification checklist
kubectl -n honeydue get netpolshows 12 NetPols (default-deny + 6 egress + 5 ingress)kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep 6443returns the rule (if missing → see "vmagent SD broken" gotcha)kubectl -n kube-system get pod -l app.kubernetes.io/name=kube-state-metricsshows 1 Running podkubectl -n honeydue port-forward deploy/vmagent 8429:8429+ curllocalhost:8429/api/v1/targetsshows 4+ targets, allup- Grafana panel "pods up" in
honeyduenamespace populates within 60s
If any of those fail, this runbook entry tells you exactly which gotcha you hit.