fix(observability): relax vmagent liveness probe — was crash-looping every ~5m
The previous probe had timeoutSeconds=1 which is too tight for the shell pipeline (sh + wget + grep + comparison). On a busy node the wget call regularly exceeded 1s, the exec timed out, and 3 consecutive timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min, causing brief gaps that made the Grafana "Pods up" panel render 0 whenever a refresh happened to coincide with a restart. The relaxed probe still catches the original failure mode (zero healthy targets) but only kills the pod after 10 full minutes of consecutive failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min). timeoutSeconds: 1 → 5 periodSeconds: 60 → 120 failureThreshold: 3 → 5 initialDelaySeconds: 120 → 180 Also added wget -T 4 inside the command so wget itself bounds its network call to 4s — leaving 1s of slack within the 5s exec budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -227,10 +227,11 @@ spec:
|
|||||||
command:
|
command:
|
||||||
- sh
|
- sh
|
||||||
- -c
|
- -c
|
||||||
- 'n=$(wget -qO- http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]'
|
- 'n=$(wget -qO- -T 4 http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]'
|
||||||
initialDelaySeconds: 120
|
initialDelaySeconds: 180
|
||||||
periodSeconds: 60
|
periodSeconds: 120
|
||||||
failureThreshold: 3
|
timeoutSeconds: 5
|
||||||
|
failureThreshold: 5
|
||||||
readinessProbe:
|
readinessProbe:
|
||||||
httpGet:
|
httpGet:
|
||||||
path: /-/healthy
|
path: /-/healthy
|
||||||
|
|||||||
Reference in New Issue
Block a user