fix(observability): relax vmagent liveness probe — was crash-looping every ~5m
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled

The previous probe had timeoutSeconds=1 which is too tight for the
shell pipeline (sh + wget + grep + comparison). On a busy node the
wget call regularly exceeded 1s, the exec timed out, and 3 consecutive
timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min,
causing brief gaps that made the Grafana "Pods up" panel render 0
whenever a refresh happened to coincide with a restart.

The relaxed probe still catches the original failure mode (zero healthy
targets) but only kills the pod after 10 full minutes of consecutive
failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min).

  timeoutSeconds: 1 → 5
  periodSeconds: 60 → 120
  failureThreshold: 3 → 5
  initialDelaySeconds: 120 → 180

Also added wget -T 4 inside the command so wget itself bounds its
network call to 4s — leaving 1s of slack within the 5s exec budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-05-13 00:39:23 -05:00
parent 139a990ebc
commit 2004f9c5b2
@@ -227,10 +227,11 @@ spec:
command:
- sh
- -c
- 'n=$(wget -qO- http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]'
initialDelaySeconds: 120
periodSeconds: 60
failureThreshold: 3
- 'n=$(wget -qO- -T 4 http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]'
initialDelaySeconds: 180
periodSeconds: 120
timeoutSeconds: 5
failureThreshold: 5
readinessProbe:
httpGet:
path: /-/healthy