fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics

vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 00:30:11 -05:00
parent 7cc5448a7c
commit 139a990ebc
6 changed files with 666 additions and 6 deletions
@@ -3,6 +3,7 @@ config.yaml
 # Generated files
 kubeconfig
 kubeconfig.*
 cluster-config.yaml
 prod.env
@@ -0,0 +1,262 @@
 # k3s Cluster Operations Runbook
 Living document for honeyDue k3s cluster operations. Add entries when you
 hit something non-obvious so future-you (or your replacement) doesn't have
 to rediscover it.
 ---
 ## Deployment
 The canonical deploy path is `deploy-k3s/scripts/03-deploy.sh`. It applies
 everything in `deploy-k3s/manifests/` in the right order.
 What it touches (in order):
 1. `namespace.yaml`
 2. `network-policies.yaml` — **all** NetPols including the vmagent ones
 3. `redis/`
 4. `ingress/`
 5. `migrate/job.yaml` (with image substitution; blocks on success)
 6. `api/deployment.yaml`, `api/service.yaml`, `api/hpa.yaml` (image-subbed)
 7. `worker/deployment.yaml` (image-subbed)
 8. `admin/deployment.yaml`, `admin/service.yaml` (image-subbed)
 9. `web/deployment.yaml`, `web/service.yaml` (image-subbed; optional dir)
 10. `observability/kube-state-metrics.yaml`
 11. `observability/vmagent.yaml` (with `TOKEN_PLACEHOLDER` sed-substituted from `deploy/prod.env`)
 If you add a new manifest, also add a `kubectl apply -f` line to
 `03-deploy.sh` — there's no kustomization or `apply -R`. **A manifest
 that exists in the repo but isn't applied by the script will silently
 not deploy.**
 ### Pre-deploy checklist
 - [ ] `deploy/prod.env` exists and contains `OBS_INGEST_TOKEN=...`
       (otherwise vmagent gets skipped with a warning)
 - [ ] `KUBECONFIG` points at the right cluster
 - [ ] The Gitea image registry is reachable from k3s nodes
 - [ ] Schema migrations in `migrations/` are tested locally first
       (the deploy aborts if `honeydue-migrate` Job fails)
 ---
 ## Known gotchas
 ### vmagent SD broken on fresh deploy ("0 pods up" in Grafana)
 **Symptoms:**
 - Grafana panels using `kube_*` metrics or `up{job=...}` show 0
 - vmagent logs: `dial tcp 10.43.0.1:443: connect: connection refused`
       repeating every ~30s
 - Direct test from a pod also refused: `kubectl -n honeydue exec deploy/vmagent
       -- wget --no-check-certificate -qO- -T 3 https://10.43.0.1:443/livez`
 **Cause:** k3s's built-in NetworkPolicy controller evaluates egress rules
 **after** kube-proxy's DNAT, not before (contrary to the k8s spec).
 Traffic from a pod to the `kubernetes` Service (ClusterIP `10.43.0.1:443`)
 gets DNAT'd to `<node_public_ip>:6443`, and **then** the policy check
 runs. Without an explicit egress rule for `:6443`, the packet is rejected
 with a TCP RST → "connection refused".
 The `allow-egress-from-vmagent` NetPol in `network-policies.yaml` includes
 both rules:
 ```yaml
 # Pre-DNAT view (correct per spec; harmless if unused)
 - to:
    - ipBlock: { cidr: 10.43.0.0/16 }
  ports:
    - { port: 443, protocol: TCP }
 # Post-DNAT path (what k3s NetPol enforcer actually sees) — REQUIRED
 - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except: [10.42.0.0/16]
  ports:
    - { port: 6443, protocol: TCP }
 ```
 **If this happens on a fresh deploy:** confirm `network-policies.yaml`
 was applied:
 ```bash
 kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml
 ```
 Look for the port-6443 egress rule. If missing, the apply step in
 `03-deploy.sh` was skipped or the file was edited and the rule got
 dropped.
 **Counter-evidence that confirms diagnosis:** kube-state-metrics in
 `kube-system` works fine, because `kube-system` has no NetPols. So if
 ksm is healthy but workloads in `honeydue` can't reach the apiserver
 ClusterIP, this gotcha is the cause.
 ---
 ### vmagent appears healthy but no data in Grafana
 vmagent's `/-/healthy` endpoint returns 200 as long as the process is
 alive and remote-write is functional (TCP-level) — it does **not**
 check whether scrapes are succeeding. We saw this fail once: vmagent
 was "healthy" for 17 days while having zero healthy targets due to a
 broken k8s SD watch.
 The liveness probe in `vmagent.yaml` queries the agent's `/api/v1/targets`
 endpoint and fails the pod if no target is in state `up`. After 3
 consecutive failures (~3 min), kubelet recycles the pod and SD restarts
 clean.
 **Verify it's working:** `kubectl -n honeydue describe pod -l app.kubernetes.io/name=vmagent`
 should show `Liveness: exec [sh -c ...]`. If you ever see vmagent running
 for weeks but no metrics in Grafana, the probe was disabled or the exec
 command broke.
 ---
 ### vmagent's bearer token got blown away after `kubectl apply -f vmagent.yaml`
 The committed `vmagent.yaml` has `bearer_token: TOKEN_PLACEHOLDER`. The
 real token is sed-substituted at deploy time by `03-deploy.sh`. If you
 ever apply `vmagent.yaml` directly:
 ```bash
 kubectl apply -f deploy-k3s/manifests/observability/vmagent.yaml   # WRONG
 ```
 the Secret gets overwritten with the literal string `TOKEN_PLACEHOLDER`
 and all remote-writes start returning 401 from obs.88oakapps.com.
 **To restore without a full redeploy** (the safe inline path):
 ```bash
 export KUBECONFIG=...
 OBS_TOKEN_B64=$(kubectl -n honeydue get secret honeydue-secrets \
                  -o jsonpath='{.data.OBS_INGEST_TOKEN}')
 kubectl -n honeydue patch secret vmagent-remote-write --type=json \
  -p="[{\"op\":\"replace\",\"path\":\"/data/bearer_token\",\"value\":\"${OBS_TOKEN_B64}\"}]"
 kubectl -n honeydue rollout restart deploy/vmagent
 ```
 The OBS token also lives in `honeydue-secrets.OBS_INGEST_TOKEN` because
 the api pods use it for traces — same secret, same value.
 **Or just re-run the deploy:** `./deploy-k3s/scripts/03-deploy.sh`. The
 sed step handles the substitution correctly.
 ---
 ### Node kubeconfig is world-readable
 `/etc/rancher/k3s/k3s.yaml` is mode `0644` per the `--write-kubeconfig-mode=644`
 k3s install flag. Any process on the host (including any container that
 mounts the host filesystem) can read full cluster-admin credentials.
 This is intentional for the deploy user but worth knowing — any container
 escape becomes immediate cluster-admin. Tracked as finding **F4** in
 `k3_audit_5_12.md`.
 To tighten (if you ever turn this knob): change to `--write-kubeconfig-mode=600`
 in the k3s install command, then re-fetch `deploy-k3s/kubeconfig`.
 ---
 ## Common operations
 ### Fetch a working kubectl tunnel (if `deploy-k3s/kubeconfig` is missing or stale)
 ```bash
 ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /etc/rancher/k3s/k3s.yaml' \
  | sed 's|server: https://127.0.0.1:6443|server: https://178.104.247.152:6443|' \
  > deploy-k3s/kubeconfig
 chmod 600 deploy-k3s/kubeconfig
 ```
 If the public :6443 is firewalled from your IP (the default — only
 Cloudflare ranges are allowed for app traffic; admin is locked down):
 ```bash
 # SSH tunnel — leave running in another terminal
 ssh -fN -o ExitOnForwardFailure=yes -o ServerAliveInterval=30 \
    -i ~/.ssh/hetzner \
    -L 127.0.0.1:6443:127.0.0.1:6443 \
    deploy@hetzner1
 # Then use a kubeconfig pointing at localhost
 cp deploy-k3s/kubeconfig deploy-k3s/kubeconfig.tunnel
 sed -i.bak 's|https://178.104.247.152:6443|https://127.0.0.1:6443|' \
  deploy-k3s/kubeconfig.tunnel
 export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
 ```
 ### Restore vmagent after a "0 targets" incident
 ```bash
 export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig.tunnel"
 # 1. Confirm the diagnosis
 kubectl -n honeydue logs deploy/vmagent --tail=20 | grep -i "connect: connection refused"
 # 2. Check the NetPol has the :6443 rule
 kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep -A 5 6443
 # 3. If missing, re-apply
 kubectl apply -f deploy-k3s/manifests/network-policies.yaml
 # 4. Restart vmagent
 kubectl -n honeydue rollout restart deploy/vmagent
 # 5. Verify targets after ~60s
 kubectl -n honeydue port-forward deploy/vmagent 8429:8429 &
 curl -s http://localhost:8429/api/v1/targets \
  | python3 -c "import json,sys; d=json.load(sys.stdin); \
                a=d['data']['activeTargets']; \
                print(f'targets={len(a)} up={sum(1 for t in a if t[\"health\"]==\"up\")}')"
 ```
 ### Verify NetPols match the repo
 If you suspect drift between cluster and repo:
 ```bash
 diff <(kubectl -n honeydue get netpol -o name | sort) \
     <(grep -E '^\s*name: ' deploy-k3s/manifests/network-policies.yaml \
       | sed 's/.*name: /networkpolicy.networking.k8s.io\//' | sort)
 ```
 Empty output = match. Any differences need investigation — either the
 cluster has policies that aren't in repo (manual `kubectl apply` did it)
 or repo has policies that didn't apply.
 ---
 ## Disaster recovery notes
 ### "I have to redeploy the whole stack"
 The deploy path is designed to be re-runnable. From a fresh cluster:
 1. Install k3s on all 3 nodes (use existing `deploy-k3s/scripts/01-install-k3s.sh`)
 2. Fetch a kubeconfig (see "Common operations" above)
 3. Confirm `deploy/prod.env` has all required secrets:
   - `POSTGRES_PASSWORD`, `SECRET_KEY`, `EMAIL_HOST_PASSWORD`,
     `FCM_SERVER_KEY`, `B2_KEY_ID`, `B2_APP_KEY`, `OBS_INGEST_TOKEN`,
     `OBS_TRACES_URL`, `REDIS_PASSWORD` (optional), `ADMIN_EMAIL`, `ADMIN_PASSWORD`
 4. Run `./deploy-k3s/scripts/02-setup-secrets.sh` (creates `honeydue-secrets`)
 5. Run `./deploy-k3s/scripts/03-deploy.sh` (deploys everything; sed-injects
   the obs token into vmagent at apply time)
 6. Verify: `kubectl -n honeydue get pods` should show all workloads Running
 ### Post-redeploy verification checklist
 - [ ] `kubectl -n honeydue get netpol` shows **12 NetPols** (default-deny +
       6 egress + 5 ingress)
 - [ ] `kubectl -n honeydue get netpol allow-egress-from-vmagent -o yaml | grep 6443`
       returns the rule (if missing → see "vmagent SD broken" gotcha)
 - [ ] `kubectl -n kube-system get pod -l app.kubernetes.io/name=kube-state-metrics`
       shows 1 Running pod
 - [ ] `kubectl -n honeydue port-forward deploy/vmagent 8429:8429` + curl
       `localhost:8429/api/v1/targets` shows 4+ targets, all `up`
 - [ ] Grafana panel "pods up" in `honeydue` namespace populates within 60s
 If any of those fail, this runbook entry tells you exactly which gotcha
 you hit.
@@ -275,3 +275,100 @@ spec:
      ports:
        - protocol: TCP
          port: 443
 ---
 # vmagent egress.
 #
 # IMPORTANT (gotcha): k3s's built-in NetworkPolicy controller appears to
 # evaluate egress rules AFTER kube-proxy's DNAT, not before (contrary to
 # the k8s spec). So traffic from a pod to the kubernetes Service
 # (ClusterIP 10.43.0.1:443) is policy-checked as dst=<node_public_ip>:6443.
 # That's why we need an explicit rule for :6443 to public IPs, even though
 # we already allow :443 to the cluster service CIDR.
 #
 # Without the :6443 rule, vmagent's k8s service discovery silently fails
 # and zero pods get scraped. See deploy-k3s/RUNBOOK.md ("vmagent SD broken").
 apiVersion: networking.k8s.io/v1
 kind: NetworkPolicy
 metadata:
  name: allow-egress-from-vmagent
  namespace: honeydue
 spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: vmagent
  policyTypes:
    - Egress
  egress:
    # DNS (cluster-internal)
    - to:
        - namespaceSelector: {}
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP
    # k8s API server via ClusterIP (pre-DNAT view)
    - to:
        - ipBlock:
            cidr: 10.43.0.0/16
      ports:
        - port: 443
          protocol: TCP
    # k8s API server post-DNAT (real path k3s NetPol enforcer sees) — REQUIRED
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.42.0.0/16
      ports:
        - port: 6443
          protocol: TCP
    # Scrape api Pods on :8000
    - to:
        - ipBlock:
            cidr: 10.42.0.0/16
      ports:
        - port: 8000
          protocol: TCP
    # Scrape kube-state-metrics Pod on :8080 (pod CIDR)
    - to:
        - ipBlock:
            cidr: 10.42.0.0/16
      ports:
        - port: 8080
          protocol: TCP
    # HTTPS to public (remote-write to obs.88oakapps.com via Cloudflare)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.42.0.0/16
              - 10.43.0.0/16
      ports:
        - port: 443
          protocol: TCP
 ---
 # Allow vmagent → api ingress on :8000 so api pods accept scrapes.
 # api Pods are otherwise locked down by default-deny-all + allow-ingress-to-api
 # (which only allows Traefik). This adds vmagent specifically.
 apiVersion: networking.k8s.io/v1
 kind: NetworkPolicy
 metadata:
  name: allow-vmagent-to-api
  namespace: honeydue
 spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: vmagent
      ports:
        - port: 8000
          protocol: TCP
@@ -0,0 +1,223 @@
 # kube-state-metrics — exposes cluster object state (pods, deployments,
 # services, etc.) as Prometheus metrics. vmagent scrapes it via the api
 # group defined in vmagent-config; Grafana panels that count pods,
 # replicas, etc. consume the `kube_*` metrics this produces.
 #
 # Lives in kube-system because it watches resources cluster-wide.
 # RBAC is cluster-scoped (ClusterRole + ClusterRoleBinding).
 #
 # Image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.13.0
 # (latest stable as of authoring; bump when a newer minor is released)
 ---
 apiVersion: v1
 kind: ServiceAccount
 metadata:
  name: kube-state-metrics
  namespace: kube-system
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: honeydue-observability
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
  name: kube-state-metrics
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: honeydue-observability
 rules:
  # Core resources
  - apiGroups: [""]
    resources:
      - configmaps
      - secrets
      - nodes
      - pods
      - services
      - serviceaccounts
      - resourcequotas
      - replicationcontrollers
      - limitranges
      - persistentvolumeclaims
      - persistentvolumes
      - namespaces
      - endpoints
    verbs: [list, watch]
  # Apps
  - apiGroups: ["apps"]
    resources:
      - statefulsets
      - daemonsets
      - deployments
      - replicasets
    verbs: [list, watch]
  # Batch
  - apiGroups: ["batch"]
    resources:
      - cronjobs
      - jobs
    verbs: [list, watch]
  # Autoscaling
  - apiGroups: ["autoscaling"]
    resources:
      - horizontalpodautoscalers
    verbs: [list, watch]
  # Authentication / authorization (used by some ksm collectors)
  - apiGroups: ["authentication.k8s.io"]
    resources: [tokenreviews]
    verbs: [create]
  - apiGroups: ["authorization.k8s.io"]
    resources: [subjectaccessreviews]
    verbs: [create]
  # Policy
  - apiGroups: ["policy"]
    resources: [poddisruptionbudgets]
    verbs: [list, watch]
  # Certificate signing
  - apiGroups: ["certificates.k8s.io"]
    resources: [certificatesigningrequests]
    verbs: [list, watch]
  # Discovery
  - apiGroups: ["discovery.k8s.io"]
    resources: [endpointslices]
    verbs: [list, watch]
  # Storage
  - apiGroups: ["storage.k8s.io"]
    resources:
      - storageclasses
      - volumeattachments
    verbs: [list, watch]
  # Admission policy
  - apiGroups: ["admissionregistration.k8s.io"]
    resources:
      - mutatingwebhookconfigurations
      - validatingwebhookconfigurations
    verbs: [list, watch]
  # Networking
  - apiGroups: ["networking.k8s.io"]
    resources:
      - networkpolicies
      - ingressclasses
      - ingresses
    verbs: [list, watch]
  # Coordination (leader election)
  - apiGroups: ["coordination.k8s.io"]
    resources: [leases]
    verbs: [list, watch]
  # RBAC
  - apiGroups: ["rbac.authorization.k8s.io"]
    resources:
      - clusterrolebindings
      - clusterroles
      - rolebindings
      - roles
    verbs: [list, watch]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:
  name: kube-state-metrics
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: honeydue-observability
 roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
 subjects:
  - kind: ServiceAccount
    name: kube-state-metrics
    namespace: kube-system
 ---
 apiVersion: v1
 kind: Service
 metadata:
  name: kube-state-metrics
  namespace: kube-system
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: honeydue-observability
 spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: kube-state-metrics
  ports:
    - name: http-metrics
      port: 8080
      targetPort: http-metrics
      protocol: TCP
    - name: telemetry
      port: 8081
      targetPort: telemetry
      protocol: TCP
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: kube-state-metrics
  namespace: kube-system
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: honeydue-observability
 spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/part-of: honeydue-observability
    spec:
      serviceAccountName: kube-state-metrics
      automountServiceAccountToken: true
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: kube-state-metrics
          image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.13.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
              name: http-metrics
            - containerPort: 8081
              name: telemetry
          args:
            - --port=8080
            - --telemetry-port=8081
          resources:
            requests:
              cpu: 25m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 256Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: [ALL]
            readOnlyRootFilesystem: true
          livenessProbe:
            httpGet:
              path: /livez
              port: http-metrics
            initialDelaySeconds: 5
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /readyz
              port: http-metrics
            initialDelaySeconds: 5
            periodSeconds: 10
@@ -42,6 +42,21 @@ data:
          - target_label: service
            replacement: api
      # kube-state-metrics — cluster object state (kube_pod_*, kube_deployment_*,
      # etc.) needed for Grafana panels that count pods/replicas/etc.
      - job_name: kube-state-metrics
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names: [kube-system]
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
            action: keep
            regex: kube-state-metrics
          - source_labels: [__meta_kubernetes_endpoint_port_name]
            action: keep
            regex: http-metrics
      # honeyDue worker — also exposes /metrics if/when we add it.
      # Keep this stanza commented until the worker has a /metrics endpoint;
      # uncommented form drops scrapes silently.
@@ -104,6 +119,35 @@ roleRef:
  name: vmagent
  apiGroup: rbac.authorization.k8s.io
 ---
 # Allow vmagent to discover the kube-state-metrics Service/Endpoints in
 # kube-system so the kube-state-metrics scrape job can find its target.
 # Cross-namespace SD needs an explicit RoleBinding here.
 apiVersion: rbac.authorization.k8s.io/v1
 kind: Role
 metadata:
  name: vmagent-kube-system
  namespace: kube-system
 rules:
  - apiGroups: [""]
    resources: [services, endpoints, pods]
    verbs: [get, list, watch]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding
 metadata:
  name: vmagent-kube-system
  namespace: kube-system
 subjects:
  - kind: ServiceAccount
    name: vmagent
    namespace: honeydue
 roleRef:
  kind: Role
  name: vmagent-kube-system
  apiGroup: rbac.authorization.k8s.io
 ---
 apiVersion: apps/v1
 kind: Deployment
@@ -162,12 +206,31 @@ spec:
              readOnly: true
            - name: buffer
              mountPath: /tmp/vmagent
-          livenessProbe:
+          # Process startup gate. /-/healthy returns 200 once vmagent has
          # parsed config — gives the agent up to 2 min to come up before
          # liveness starts evaluating.
          startupProbe:
            httpGet:
              path: /-/healthy
              port: http
-            initialDelaySeconds: 10
+            initialDelaySeconds: 5
-            periodSeconds: 30
+            periodSeconds: 5
            failureThreshold: 24
          # Real liveness check: are scrapes actually succeeding?
          # /-/healthy was the old probe and returned 200 for 17 days even
          # while vmagent had zero healthy targets (stale k8s SD watch).
          # This exec probe queries vmagent's own targets API and fails if
          # NO target is in state "up". Three consecutive failures (3 min)
          # → kubelet kills the pod → fresh SD watch.
          livenessProbe:
            exec:
              command:
                - sh
                - -c
                - 'n=$(wget -qO- http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]'
            initialDelaySeconds: 120
            periodSeconds: 60
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /-/healthy
@@ -146,6 +146,14 @@ kubectl create configmap honeydue-config \
 log "Applying manifests..."
 kubectl apply -f "${MANIFESTS}/namespace.yaml"
 # NetworkPolicies first — default-deny-all + per-app allow rules.
 # These MUST be applied; without them the cluster falls back to default-allow
 # (worse posture) AND the vmagent egress rule for :6443 (which fixes a k3s
 # post-DNAT enforcement quirk for k8s API discovery) is missing.
 # See deploy-k3s/RUNBOOK.md ("vmagent SD broken on fresh deploy").
 kubectl apply -f "${MANIFESTS}/network-policies.yaml"
 kubectl apply -f "${MANIFESTS}/redis/"
 kubectl apply -f "${MANIFESTS}/ingress/"
@@ -181,10 +189,16 @@ if [[ -d "${MANIFESTS}/web" ]]; then
  kubectl apply -f "${MANIFESTS}/web/service.yaml"
 fi
-# Observability — vmagent scrapes api Pods :8000/metrics and remote-writes
+# Observability — vmagent scrapes api Pods :8000/metrics + kube-state-metrics
-# to obs.88oakapps.com. The bearer token comes from deploy/prod.env so it
+# :8080/metrics and remote-writes everything to obs.88oakapps.com. The bearer
-# stays out of the repo; the manifest holds TOKEN_PLACEHOLDER.
+# token comes from deploy/prod.env so it stays out of the repo; the manifest
 # holds TOKEN_PLACEHOLDER. kube-state-metrics provides the kube_* metrics
 # Grafana panels need to count pods, deployments, etc.
 if [[ -d "${MANIFESTS}/observability" ]]; then
  # kube-state-metrics — no secrets, plain apply
  kubectl apply -f "${MANIFESTS}/observability/kube-state-metrics.yaml"
  # vmagent — needs the bearer-token substitution
  # prod.env lives at the repo's deploy/ dir (sibling of deploy-k3s/), not
  # under deploy-k3s/. It's gitignored — operator copies values there once.
  OBS_TOKEN="$(grep -E '^OBS_INGEST_TOKEN=' "${REPO_DIR}/deploy/prod.env" 2>/dev/null | cut -d= -f2- || true)"