deploy: add node-exporter DaemonSet + vmagent scrape job
Per-node host metrics (node_filesystem_*, node_memory_*, node_load*) were missing — a node running out of disk would silently fail the cluster before any dashboard signal (RUNBOOK §11.1 gap #9). Adds: - node-exporter DaemonSet (pod-networked, :9100; host /proc,/sys,/ ro) so vmagent scrapes it pod-to-pod over the cluster CIDR, independent of node public IPs (the netpol node-IP list is OVH-stale). - two additive NetworkPolicies (default-deny-all is in force): ingress to node-exporter from vmagent, and vmagent egress to the pod CIDR on :9100. - a node-exporter scrape job in the vmagent-config ConfigMap. Feeds the new "Node host health" row (disk/mem/load) on the eli5 dashboard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -57,6 +57,27 @@ data:
|
||||
action: keep
|
||||
regex: http-metrics
|
||||
|
||||
# node-exporter — per-node host metrics (node_filesystem_*, node_memory_*,
|
||||
# node_load*). Pod-networked DaemonSet scraped on :9100 over the pod CIDR.
|
||||
- job_name: node-exporter
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
namespaces:
|
||||
names: [honeydue]
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
|
||||
action: keep
|
||||
regex: node-exporter
|
||||
- source_labels: [__meta_kubernetes_pod_container_port_number]
|
||||
action: keep
|
||||
regex: "9100"
|
||||
- source_labels: [__meta_kubernetes_pod_name]
|
||||
target_label: pod
|
||||
- source_labels: [__meta_kubernetes_pod_node_name]
|
||||
target_label: node
|
||||
- target_label: service
|
||||
replacement: node-exporter
|
||||
|
||||
# honeyDue worker — also exposes /metrics if/when we add it.
|
||||
# Keep this stanza commented until the worker has a /metrics endpoint;
|
||||
# uncommented form drops scrapes silently.
|
||||
|
||||
Reference in New Issue
Block a user