fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics

vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 00:30:11 -05:00
parent 7cc5448a7c
commit 139a990ebc
6 changed files with 666 additions and 6 deletions
@@ -275,3 +275,100 @@ spec:
      ports:
        - protocol: TCP
          port: 443
+
+---
+# vmagent egress.
+#
+# IMPORTANT (gotcha): k3s's built-in NetworkPolicy controller appears to
+# evaluate egress rules AFTER kube-proxy's DNAT, not before (contrary to
+# the k8s spec). So traffic from a pod to the kubernetes Service
+# (ClusterIP 10.43.0.1:443) is policy-checked as dst=<node_public_ip>:6443.
+# That's why we need an explicit rule for :6443 to public IPs, even though
+# we already allow :443 to the cluster service CIDR.
+#
+# Without the :6443 rule, vmagent's k8s service discovery silently fails
+# and zero pods get scraped. See deploy-k3s/RUNBOOK.md ("vmagent SD broken").
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: allow-egress-from-vmagent
+  namespace: honeydue
+spec:
+  podSelector:
+    matchLabels:
+      app.kubernetes.io/name: vmagent
+  policyTypes:
+    - Egress
+  egress:
+    # DNS (cluster-internal)
+    - to:
+        - namespaceSelector: {}
+      ports:
+        - port: 53
+          protocol: UDP
+        - port: 53
+          protocol: TCP
+    # k8s API server via ClusterIP (pre-DNAT view)
+    - to:
+        - ipBlock:
+            cidr: 10.43.0.0/16
+      ports:
+        - port: 443
+          protocol: TCP
+    # k8s API server post-DNAT (real path k3s NetPol enforcer sees) — REQUIRED
+    - to:
+        - ipBlock:
+            cidr: 0.0.0.0/0
+            except:
+              - 10.42.0.0/16
+      ports:
+        - port: 6443
+          protocol: TCP
+    # Scrape api Pods on :8000
+    - to:
+        - ipBlock:
+            cidr: 10.42.0.0/16
+      ports:
+        - port: 8000
+          protocol: TCP
+    # Scrape kube-state-metrics Pod on :8080 (pod CIDR)
+    - to:
+        - ipBlock:
+            cidr: 10.42.0.0/16
+      ports:
+        - port: 8080
+          protocol: TCP
+    # HTTPS to public (remote-write to obs.88oakapps.com via Cloudflare)
+    - to:
+        - ipBlock:
+            cidr: 0.0.0.0/0
+            except:
+              - 10.42.0.0/16
+              - 10.43.0.0/16
+      ports:
+        - port: 443
+          protocol: TCP
+
+---
+# Allow vmagent → api ingress on :8000 so api pods accept scrapes.
+# api Pods are otherwise locked down by default-deny-all + allow-ingress-to-api
+# (which only allows Traefik). This adds vmagent specifically.
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: allow-vmagent-to-api
+  namespace: honeydue
+spec:
+  podSelector:
+    matchLabels:
+      app.kubernetes.io/name: api
+  policyTypes:
+    - Ingress
+  ingress:
+    - from:
+        - podSelector:
+            matchLabels:
+              app.kubernetes.io/name: vmagent
+      ports:
+        - port: 8000
+          protocol: TCP
@@ -0,0 +1,223 @@
+# kube-state-metrics — exposes cluster object state (pods, deployments,
+# services, etc.) as Prometheus metrics. vmagent scrapes it via the api
+# group defined in vmagent-config; Grafana panels that count pods,
+# replicas, etc. consume the `kube_*` metrics this produces.
+#
+# Lives in kube-system because it watches resources cluster-wide.
+# RBAC is cluster-scoped (ClusterRole + ClusterRoleBinding).
+#
+# Image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.13.0
+# (latest stable as of authoring; bump when a newer minor is released)
+
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: kube-state-metrics
+  namespace: kube-system
+  labels:
+    app.kubernetes.io/name: kube-state-metrics
+    app.kubernetes.io/part-of: honeydue-observability
+
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: kube-state-metrics
+  labels:
+    app.kubernetes.io/name: kube-state-metrics
+    app.kubernetes.io/part-of: honeydue-observability
+rules:
+  # Core resources
+  - apiGroups: [""]
+    resources:
+      - configmaps
+      - secrets
+      - nodes
+      - pods
+      - services
+      - serviceaccounts
+      - resourcequotas
+      - replicationcontrollers
+      - limitranges
+      - persistentvolumeclaims
+      - persistentvolumes
+      - namespaces
+      - endpoints
+    verbs: [list, watch]
+  # Apps
+  - apiGroups: ["apps"]
+    resources:
+      - statefulsets
+      - daemonsets
+      - deployments
+      - replicasets
+    verbs: [list, watch]
+  # Batch
+  - apiGroups: ["batch"]
+    resources:
+      - cronjobs
+      - jobs
+    verbs: [list, watch]
+  # Autoscaling
+  - apiGroups: ["autoscaling"]
+    resources:
+      - horizontalpodautoscalers
+    verbs: [list, watch]
+  # Authentication / authorization (used by some ksm collectors)
+  - apiGroups: ["authentication.k8s.io"]
+    resources: [tokenreviews]
+    verbs: [create]
+  - apiGroups: ["authorization.k8s.io"]
+    resources: [subjectaccessreviews]
+    verbs: [create]
+  # Policy
+  - apiGroups: ["policy"]
+    resources: [poddisruptionbudgets]
+    verbs: [list, watch]
+  # Certificate signing
+  - apiGroups: ["certificates.k8s.io"]
+    resources: [certificatesigningrequests]
+    verbs: [list, watch]
+  # Discovery
+  - apiGroups: ["discovery.k8s.io"]
+    resources: [endpointslices]
+    verbs: [list, watch]
+  # Storage
+  - apiGroups: ["storage.k8s.io"]
+    resources:
+      - storageclasses
+      - volumeattachments
+    verbs: [list, watch]
+  # Admission policy
+  - apiGroups: ["admissionregistration.k8s.io"]
+    resources:
+      - mutatingwebhookconfigurations
+      - validatingwebhookconfigurations
+    verbs: [list, watch]
+  # Networking
+  - apiGroups: ["networking.k8s.io"]
+    resources:
+      - networkpolicies
+      - ingressclasses
+      - ingresses
+    verbs: [list, watch]
+  # Coordination (leader election)
+  - apiGroups: ["coordination.k8s.io"]
+    resources: [leases]
+    verbs: [list, watch]
+  # RBAC
+  - apiGroups: ["rbac.authorization.k8s.io"]
+    resources:
+      - clusterrolebindings
+      - clusterroles
+      - rolebindings
+      - roles
+    verbs: [list, watch]
+
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: kube-state-metrics
+  labels:
+    app.kubernetes.io/name: kube-state-metrics
+    app.kubernetes.io/part-of: honeydue-observability
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: kube-state-metrics
+subjects:
+  - kind: ServiceAccount
+    name: kube-state-metrics
+    namespace: kube-system
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: kube-state-metrics
+  namespace: kube-system
+  labels:
+    app.kubernetes.io/name: kube-state-metrics
+    app.kubernetes.io/part-of: honeydue-observability
+spec:
+  type: ClusterIP
+  selector:
+    app.kubernetes.io/name: kube-state-metrics
+  ports:
+    - name: http-metrics
+      port: 8080
+      targetPort: http-metrics
+      protocol: TCP
+    - name: telemetry
+      port: 8081
+      targetPort: telemetry
+      protocol: TCP
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: kube-state-metrics
+  namespace: kube-system
+  labels:
+    app.kubernetes.io/name: kube-state-metrics
+    app.kubernetes.io/part-of: honeydue-observability
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: kube-state-metrics
+  template:
+    metadata:
+      labels:
+        app.kubernetes.io/name: kube-state-metrics
+        app.kubernetes.io/part-of: honeydue-observability
+    spec:
+      serviceAccountName: kube-state-metrics
+      automountServiceAccountToken: true
+      securityContext:
+        runAsNonRoot: true
+        runAsUser: 65534
+        fsGroup: 65534
+        seccompProfile:
+          type: RuntimeDefault
+      containers:
+        - name: kube-state-metrics
+          image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.13.0
+          imagePullPolicy: IfNotPresent
+          ports:
+            - containerPort: 8080
+              name: http-metrics
+            - containerPort: 8081
+              name: telemetry
+          args:
+            - --port=8080
+            - --telemetry-port=8081
+          resources:
+            requests:
+              cpu: 25m
+              memory: 64Mi
+            limits:
+              cpu: 200m
+              memory: 256Mi
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop: [ALL]
+            readOnlyRootFilesystem: true
+          livenessProbe:
+            httpGet:
+              path: /livez
+              port: http-metrics
+            initialDelaySeconds: 5
+            periodSeconds: 30
+          readinessProbe:
+            httpGet:
+              path: /readyz
+              port: http-metrics
+            initialDelaySeconds: 5
+            periodSeconds: 10
@@ -42,6 +42,21 @@ data:
          - target_label: service
            replacement: api

+      # kube-state-metrics — cluster object state (kube_pod_*, kube_deployment_*,
+      # etc.) needed for Grafana panels that count pods/replicas/etc.
+      - job_name: kube-state-metrics
+        kubernetes_sd_configs:
+          - role: endpoints
+            namespaces:
+              names: [kube-system]
+        relabel_configs:
+          - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
+            action: keep
+            regex: kube-state-metrics
+          - source_labels: [__meta_kubernetes_endpoint_port_name]
+            action: keep
+            regex: http-metrics
+
      # honeyDue worker — also exposes /metrics if/when we add it.
      # Keep this stanza commented until the worker has a /metrics endpoint;
      # uncommented form drops scrapes silently.
@@ -104,6 +119,35 @@ roleRef:
  name: vmagent
  apiGroup: rbac.authorization.k8s.io

+---
+# Allow vmagent to discover the kube-state-metrics Service/Endpoints in
+# kube-system so the kube-state-metrics scrape job can find its target.
+# Cross-namespace SD needs an explicit RoleBinding here.
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: vmagent-kube-system
+  namespace: kube-system
+rules:
+  - apiGroups: [""]
+    resources: [services, endpoints, pods]
+    verbs: [get, list, watch]
+
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: vmagent-kube-system
+  namespace: kube-system
+subjects:
+  - kind: ServiceAccount
+    name: vmagent
+    namespace: honeydue
+roleRef:
+  kind: Role
+  name: vmagent-kube-system
+  apiGroup: rbac.authorization.k8s.io
+
 ---
 apiVersion: apps/v1
 kind: Deployment
@@ -162,12 +206,31 @@ spec:
              readOnly: true
            - name: buffer
              mountPath: /tmp/vmagent
-          livenessProbe:
+          # Process startup gate. /-/healthy returns 200 once vmagent has
+          # parsed config — gives the agent up to 2 min to come up before
+          # liveness starts evaluating.
+          startupProbe:
            httpGet:
              path: /-/healthy
              port: http
-            initialDelaySeconds: 10
-            periodSeconds: 30
+            initialDelaySeconds: 5
+            periodSeconds: 5
+            failureThreshold: 24
+          # Real liveness check: are scrapes actually succeeding?
+          # /-/healthy was the old probe and returned 200 for 17 days even
+          # while vmagent had zero healthy targets (stale k8s SD watch).
+          # This exec probe queries vmagent's own targets API and fails if
+          # NO target is in state "up". Three consecutive failures (3 min)
+          # → kubelet kills the pod → fresh SD watch.
+          livenessProbe:
+            exec:
+              command:
+                - sh
+                - -c
+                - 'n=$(wget -qO- http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]'
+            initialDelaySeconds: 120
+            periodSeconds: 60
+            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /-/healthy