fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.
Changes:
- network-policies.yaml: commit the previously-cluster-only NetPols
(allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
produces a working cluster. The vmagent egress rule now includes :6443
to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
scraping kube-state-metrics).
- observability/kube-state-metrics.yaml: new manifest. Provides the
kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
panels need to count pods, replicas, etc. Runs in kube-system with
cluster-scoped RBAC.
- observability/vmagent.yaml:
* add kube-state-metrics scrape job to the ConfigMap
* add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
* replace the misleading liveness probe (was /-/healthy, which lies
while SD is broken) with an exec probe that checks /api/v1/targets
for at least one healthy target — automatic recovery from future
stale-SD incidents
- scripts/03-deploy.sh: actually apply network-policies.yaml (was
committed but never applied) and apply kube-state-metrics.yaml.
- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
trap, bearer-token recovery procedure, drift-detection diff, and a
post-redeploy verification checklist.
- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
kubectl sessions) so admin client cert can't be committed by accident.
Verified via kubectl --dry-run on all three modified manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -146,6 +146,14 @@ kubectl create configmap honeydue-config \
|
||||
log "Applying manifests..."
|
||||
|
||||
kubectl apply -f "${MANIFESTS}/namespace.yaml"
|
||||
|
||||
# NetworkPolicies first — default-deny-all + per-app allow rules.
|
||||
# These MUST be applied; without them the cluster falls back to default-allow
|
||||
# (worse posture) AND the vmagent egress rule for :6443 (which fixes a k3s
|
||||
# post-DNAT enforcement quirk for k8s API discovery) is missing.
|
||||
# See deploy-k3s/RUNBOOK.md ("vmagent SD broken on fresh deploy").
|
||||
kubectl apply -f "${MANIFESTS}/network-policies.yaml"
|
||||
|
||||
kubectl apply -f "${MANIFESTS}/redis/"
|
||||
kubectl apply -f "${MANIFESTS}/ingress/"
|
||||
|
||||
@@ -181,10 +189,16 @@ if [[ -d "${MANIFESTS}/web" ]]; then
|
||||
kubectl apply -f "${MANIFESTS}/web/service.yaml"
|
||||
fi
|
||||
|
||||
# Observability — vmagent scrapes api Pods :8000/metrics and remote-writes
|
||||
# to obs.88oakapps.com. The bearer token comes from deploy/prod.env so it
|
||||
# stays out of the repo; the manifest holds TOKEN_PLACEHOLDER.
|
||||
# Observability — vmagent scrapes api Pods :8000/metrics + kube-state-metrics
|
||||
# :8080/metrics and remote-writes everything to obs.88oakapps.com. The bearer
|
||||
# token comes from deploy/prod.env so it stays out of the repo; the manifest
|
||||
# holds TOKEN_PLACEHOLDER. kube-state-metrics provides the kube_* metrics
|
||||
# Grafana panels need to count pods, deployments, etc.
|
||||
if [[ -d "${MANIFESTS}/observability" ]]; then
|
||||
# kube-state-metrics — no secrets, plain apply
|
||||
kubectl apply -f "${MANIFESTS}/observability/kube-state-metrics.yaml"
|
||||
|
||||
# vmagent — needs the bearer-token substitution
|
||||
# prod.env lives at the repo's deploy/ dir (sibling of deploy-k3s/), not
|
||||
# under deploy-k3s/. It's gitignored — operator copies values there once.
|
||||
OBS_TOKEN="$(grep -E '^OBS_INGEST_TOKEN=' "${REPO_DIR}/deploy/prod.env" 2>/dev/null | cut -d= -f2- || true)"
|
||||
|
||||
Reference in New Issue
Block a user