feat(observability): drop health/metrics probe noise from shipped logs
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled

The api logs every request, so k8s liveness/readiness probes on
/api/health/ and vmagent's /metrics scrape drowned Loki in 2xx access
logs. Alloy now drops successful probe/scrape access lines at ingest
(loki.process stage.drop) — a non-2xx health check, or one logged
above info level, still matches nothing and is kept.

Also hardens Alloy's read-offset store: moved /tmp/alloy from an
emptyDir to a hostPath and set loki.source.file tail_from_end=true, so
a pod restart resumes from the saved offset instead of re-reading log
files from the start — which made Loki 400-reject the now-too-old
entries ("entry too far behind") and stalled shipping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-05-17 21:29:15 -05:00
parent 93fddc3769
commit c845771946
@@ -141,13 +141,30 @@ data:
loki.source.file "pod_logs" { loki.source.file "pod_logs" {
targets = local.file_match.pod_logs.targets targets = local.file_match.pod_logs.targets
forward_to = [loki.process.pod_logs.receiver] forward_to = [loki.process.pod_logs.receiver]
// With no stored read offset (fresh node, or positions wiped), start
// at the END of each file instead of re-shipping history — otherwise
// Loki rejects the now-too-old entries ("entry too far behind") and
// shipping stalls. Offsets persist on a hostPath (see volumes), so a
// normal pod restart resumes exactly where it left off.
tail_from_end = true
} }
// Parse the CRI log format (timestamp / stream / flags / message). // Parse the CRI log format (timestamp / stream / flags / message),
// then drop probe/scrape noise before shipping.
loki.process "pod_logs" { loki.process "pod_logs" {
forward_to = [loki.write.obs.receiver] forward_to = [loki.write.obs.receiver]
stage.cri {} stage.cri {}
// Drop successful probe/scrape access logs. k8s liveness/readiness
// hits /api/health/ every few seconds and vmagent scrapes /metrics
// on a 15s interval — all 2xx, pure noise that drowns real logs.
// A non-2xx health check, or one logged above info level, does NOT
// match this regex and is kept.
stage.drop {
expression = "\"level\":\"info\".*\"path\":\"/(api/health/?|metrics)\".*\"status\":2[0-9][0-9]"
drop_counter_reason = "probe_access_ok"
}
} }
loki.write "obs" { loki.write "obs" {
@@ -252,6 +269,10 @@ spec:
hostPath: hostPath:
path: /var/log/pods path: /var/log/pods
type: Directory type: Directory
# Alloy's positions/WAL store. A hostPath (not emptyDir) so file read
# offsets survive pod restarts — otherwise every restart re-reads log
# files from the start and Loki rejects the now-too-old entries.
- name: tmp - name: tmp
emptyDir: hostPath:
sizeLimit: 256Mi path: /var/lib/honeydue-alloy-logs
type: DirectoryOrCreate