fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.

Changes:

- network-policies.yaml: commit the previously-cluster-only NetPols
  (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
  produces a working cluster. The vmagent egress rule now includes :6443
  to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
  scraping kube-state-metrics).

- observability/kube-state-metrics.yaml: new manifest. Provides the
  kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
  panels need to count pods, replicas, etc. Runs in kube-system with
  cluster-scoped RBAC.

- observability/vmagent.yaml:
  * add kube-state-metrics scrape job to the ConfigMap
  * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
  * replace the misleading liveness probe (was /-/healthy, which lies
    while SD is broken) with an exec probe that checks /api/v1/targets
    for at least one healthy target — automatic recovery from future
    stale-SD incidents

- scripts/03-deploy.sh: actually apply network-policies.yaml (was
  committed but never applied) and apply kube-state-metrics.yaml.

- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
  trap, bearer-token recovery procedure, drift-detection diff, and a
  post-redeploy verification checklist.

- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
  kubectl sessions) so admin client cert can't be committed by accident.

Verified via kubectl --dry-run on all three modified manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-05-13 00:30:11 -05:00
parent 7cc5448a7c
commit 139a990ebc
6 changed files with 666 additions and 6 deletions
@@ -275,3 +275,100 @@ spec:
ports:
- protocol: TCP
port: 443
---
# vmagent egress.
#
# IMPORTANT (gotcha): k3s's built-in NetworkPolicy controller appears to
# evaluate egress rules AFTER kube-proxy's DNAT, not before (contrary to
# the k8s spec). So traffic from a pod to the kubernetes Service
# (ClusterIP 10.43.0.1:443) is policy-checked as dst=<node_public_ip>:6443.
# That's why we need an explicit rule for :6443 to public IPs, even though
# we already allow :443 to the cluster service CIDR.
#
# Without the :6443 rule, vmagent's k8s service discovery silently fails
# and zero pods get scraped. See deploy-k3s/RUNBOOK.md ("vmagent SD broken").
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-egress-from-vmagent
namespace: honeydue
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: vmagent
policyTypes:
- Egress
egress:
# DNS (cluster-internal)
- to:
- namespaceSelector: {}
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
# k8s API server via ClusterIP (pre-DNAT view)
- to:
- ipBlock:
cidr: 10.43.0.0/16
ports:
- port: 443
protocol: TCP
# k8s API server post-DNAT (real path k3s NetPol enforcer sees) — REQUIRED
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.42.0.0/16
ports:
- port: 6443
protocol: TCP
# Scrape api Pods on :8000
- to:
- ipBlock:
cidr: 10.42.0.0/16
ports:
- port: 8000
protocol: TCP
# Scrape kube-state-metrics Pod on :8080 (pod CIDR)
- to:
- ipBlock:
cidr: 10.42.0.0/16
ports:
- port: 8080
protocol: TCP
# HTTPS to public (remote-write to obs.88oakapps.com via Cloudflare)
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.42.0.0/16
- 10.43.0.0/16
ports:
- port: 443
protocol: TCP
---
# Allow vmagent → api ingress on :8000 so api pods accept scrapes.
# api Pods are otherwise locked down by default-deny-all + allow-ingress-to-api
# (which only allows Traefik). This adds vmagent specifically.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-vmagent-to-api
namespace: honeydue
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: vmagent
ports:
- port: 8000
protocol: TCP