feat(observability): drop health/metrics probe noise from shipped logs

The api logs every request, so k8s liveness/readiness probes on /api/health/ and vmagent's /metrics scrape drowned Loki in 2xx access logs. Alloy now drops successful probe/scrape access lines at ingest (loki.process stage.drop) — a non-2xx health check, or one logged above info level, still matches nothing and is kept. Also hardens Alloy's read-offset store: moved /tmp/alloy from an emptyDir to a hostPath and set loki.source.file tail_from_end=true, so a pod restart resumes from the saved offset instead of re-reading log files from the start — which made Loki 400-reject the now-too-old entries ("entry too far behind") and stalled shipping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 21:29:15 -05:00
parent 93fddc3769
commit c845771946
1 changed files with 26 additions and 5 deletions
@@ -141,13 +141,30 @@ data:
    loki.source.file "pod_logs" {
      targets       = local.file_match.pod_logs.targets
      forward_to    = [loki.process.pod_logs.receiver]
      // With no stored read offset (fresh node, or positions wiped), start
      // at the END of each file instead of re-shipping history — otherwise
      // Loki rejects the now-too-old entries ("entry too far behind") and
      // shipping stalls. Offsets persist on a hostPath (see volumes), so a
      // normal pod restart resumes exactly where it left off.
      tail_from_end = true
    }
-    // Parse the CRI log format (timestamp / stream / flags / message).
+    // Parse the CRI log format (timestamp / stream / flags / message),
    // then drop probe/scrape noise before shipping.
    loki.process "pod_logs" {
      forward_to = [loki.write.obs.receiver]
      stage.cri {}
      // Drop successful probe/scrape access logs. k8s liveness/readiness
      // hits /api/health/ every few seconds and vmagent scrapes /metrics
      // on a 15s interval — all 2xx, pure noise that drowns real logs.
      // A non-2xx health check, or one logged above info level, does NOT
      // match this regex and is kept.
      stage.drop {
        expression          = "\"level\":\"info\".*\"path\":\"/(api/health/?|metrics)\".*\"status\":2[0-9][0-9]"
        drop_counter_reason = "probe_access_ok"
      }
    }
    loki.write "obs" {
@@ -252,6 +269,10 @@ spec:
          hostPath:
            path: /var/log/pods
            type: Directory
        # Alloy's positions/WAL store. A hostPath (not emptyDir) so file read
        # offsets survive pod restarts — otherwise every restart re-reads log
        # files from the start and Loki rejects the now-too-old entries.
        - name: tmp
-          emptyDir:
+          hostPath:
-            sizeLimit: 256Mi
+            path: /var/lib/honeydue-alloy-logs
            type: DirectoryOrCreate