Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover:
- Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions,
network/firewall, DNS, filesystem layout — all reflect the live OVH
install instead of the historical Hetzner setup.
- Section 6: canonical install-from-clean-boxes procedure (the literal
commands run on 2026-06-03), so anyone can stand up a backup cluster
by following along.
- Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away,
healthy-but-empty) and adds four new ones discovered during the OVH
build: rbac.yaml not in 03-deploy.sh, namespace label missing from api
metrics (use service="api"), cluster-label collision when two clusters
push concurrently, worker double-firing on cutover.
- Section 11.1: enumerates Tier-3 observability gaps surfaced while
building the honeydue-eli5-overview dashboard (node-exporter not
deployed, Traefik metrics off, push success counters absent, worker
/metrics endpoint absent, cache hit rate uninstrumented, APNs latency
uninstrumented).
- Section 12: dated audit trail of cluster changes.
Pure documentation; no code or manifest changes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.
Changes:
- network-policies.yaml: commit the previously-cluster-only NetPols
(allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
produces a working cluster. The vmagent egress rule now includes :6443
to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
scraping kube-state-metrics).
- observability/kube-state-metrics.yaml: new manifest. Provides the
kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
panels need to count pods, replicas, etc. Runs in kube-system with
cluster-scoped RBAC.
- observability/vmagent.yaml:
* add kube-state-metrics scrape job to the ConfigMap
* add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
* replace the misleading liveness probe (was /-/healthy, which lies
while SD is broken) with an exec probe that checks /api/v1/targets
for at least one healthy target — automatic recovery from future
stale-SD incidents
- scripts/03-deploy.sh: actually apply network-policies.yaml (was
committed but never applied) and apply kube-state-metrics.yaml.
- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
trap, bearer-token recovery procedure, drift-detection diff, and a
post-redeploy verification checklist.
- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
kubectl sessions) so admin client cert can't be committed by accident.
Verified via kubectl --dry-run on all three modified manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>