From 2004f9c5b2eb033cf6eabcba43743acee539bb04 Mon Sep 17 00:00:00 2001 From: Trey t Date: Wed, 13 May 2026 00:39:23 -0500 Subject: [PATCH] =?UTF-8?q?fix(observability):=20relax=20vmagent=20livenes?= =?UTF-8?q?s=20probe=20=E2=80=94=20was=20crash-looping=20every=20~5m?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous probe had timeoutSeconds=1 which is too tight for the shell pipeline (sh + wget + grep + comparison). On a busy node the wget call regularly exceeded 1s, the exec timed out, and 3 consecutive timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min, causing brief gaps that made the Grafana "Pods up" panel render 0 whenever a refresh happened to coincide with a restart. The relaxed probe still catches the original failure mode (zero healthy targets) but only kills the pod after 10 full minutes of consecutive failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min). timeoutSeconds: 1 → 5 periodSeconds: 60 → 120 failureThreshold: 3 → 5 initialDelaySeconds: 120 → 180 Also added wget -T 4 inside the command so wget itself bounds its network call to 4s — leaving 1s of slack within the 5s exec budget. Co-Authored-By: Claude Opus 4.7 (1M context) --- deploy-k3s/manifests/observability/vmagent.yaml | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/deploy-k3s/manifests/observability/vmagent.yaml b/deploy-k3s/manifests/observability/vmagent.yaml index e12032d..ffc1217 100644 --- a/deploy-k3s/manifests/observability/vmagent.yaml +++ b/deploy-k3s/manifests/observability/vmagent.yaml @@ -227,10 +227,11 @@ spec: command: - sh - -c - - 'n=$(wget -qO- http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]' - initialDelaySeconds: 120 - periodSeconds: 60 - failureThreshold: 3 + - 'n=$(wget -qO- -T 4 http://localhost:8429/api/v1/targets 2>/dev/null | grep -c ''"health":"up"''); [ "$n" -gt 0 ]' + initialDelaySeconds: 180 + periodSeconds: 120 + timeoutSeconds: 5 + failureThreshold: 5 readinessProbe: httpGet: path: /-/healthy