fix(observability): relax vmagent liveness probe — was crash-looping every ~5m

The previous probe had timeoutSeconds=1 which is too tight for the shell pipeline (sh + wget + grep + comparison). On a busy node the wget call regularly exceeded 1s, the exec timed out, and 3 consecutive timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min, causing brief gaps that made the Grafana "Pods up" panel render 0 whenever a refresh happened to coincide with a restart. The relaxed probe still catches the original failure mode (zero healthy targets) but only kills the pod after 10 full minutes of consecutive failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min). timeoutSeconds: 1 → 5 periodSeconds: 60 → 120 failureThreshold: 3 → 5 initialDelaySeconds: 120 → 180 Also added wget -T 4 inside the command so wget itself bounds its network call to 4s — leaving 1s of slack within the 5s exec budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 00:39:23 -05:00
parent 139a990ebc
commit 2004f9c5b2
1 changed files with 5 additions and 4 deletions
@@ -227,10 +227,11 @@ spec:
              command:
                - sh
                - -c
-                - 'n=$(wget -qO- http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]'
+                - 'n=$(wget -qO- -T 4 http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]'
-            initialDelaySeconds: 120
+            initialDelaySeconds: 180
-            periodSeconds: 60
+            periodSeconds: 120
-            failureThreshold: 3
+            timeoutSeconds: 5
            failureThreshold: 5
          readinessProbe:
            httpGet:
              path: /-/healthy