The previous probe had timeoutSeconds=1 which is too tight for the
shell pipeline (sh + wget + grep + comparison). On a busy node the
wget call regularly exceeded 1s, the exec timed out, and 3 consecutive
timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min,
causing brief gaps that made the Grafana "Pods up" panel render 0
whenever a refresh happened to coincide with a restart.
The relaxed probe still catches the original failure mode (zero healthy
targets) but only kills the pod after 10 full minutes of consecutive
failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min).
timeoutSeconds: 1 → 5
periodSeconds: 60 → 120
failureThreshold: 3 → 5
initialDelaySeconds: 120 → 180
Also added wget -T 4 inside the command so wget itself bounds its
network call to 4s — leaving 1s of slack within the 5s exec budget.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vmagent's k8s service discovery has been silently broken for 17+ days
because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's
DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was
DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination
matched none of vmagent's egress rules → TCP RST → "connection refused"
on every SD watch attempt. Grafana panels using kube_* or up{} metrics
returned empty as a result.
Changes:
- network-policies.yaml: commit the previously-cluster-only NetPols
(allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy
produces a working cluster. The vmagent egress rule now includes :6443
to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for
scraping kube-state-metrics).
- observability/kube-state-metrics.yaml: new manifest. Provides the
kube_pod_*, kube_deployment_*, kube_service_* metrics that Grafana
panels need to count pods, replicas, etc. Runs in kube-system with
cluster-scoped RBAC.
- observability/vmagent.yaml:
* add kube-state-metrics scrape job to the ConfigMap
* add vmagent-kube-system Role+RoleBinding so cross-namespace SD works
* replace the misleading liveness probe (was /-/healthy, which lies
while SD is broken) with an exec probe that checks /api/v1/targets
for at least one healthy target — automatic recovery from future
stale-SD incidents
- scripts/03-deploy.sh: actually apply network-policies.yaml (was
committed but never applied) and apply kube-state-metrics.yaml.
- RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe
trap, bearer-token recovery procedure, drift-detection diff, and a
post-redeploy verification checklist.
- .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled
kubectl sessions) so admin client cert can't be committed by accident.
Verified via kubectl --dry-run on all three modified manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and
FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via
statement-level callbacks (no ctx plumbing needed). Storage and push
clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the
network round-trip points.
Existing internal/monitoring metrics move to /metrics/legacy so the
canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups.
deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica
vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics
every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write
with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN
in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods
and to the public obs endpoint over :443; the obs side runs
VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate.
docs/observability-plan.md captures the full plan including resource
budget, instrumentation table, 4-step rollout, and migration triggers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>