docs: rewrite ch15 observability + cross-refs for the live obs stack

ch15 is now an account of what's actually running, not a roadmap for what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the internal/prom histogram set, the rollout's NetworkPolicy footprint, the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget, and a token-rotation runbook. The "what we still don't have" section keeps log aggregation, alerting, and full distributed tracing as the honest gap list. Other touched docs: - 00-overview: \"deliberately absent\" no longer claims we have no metrics — calls out the cross-cluster shape instead. - 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh (full build + push + apply + obs vmagent), with the manual kubectl-set-image flow kept as the single-service path. Notes the IfNotPresent gotcha that bit us during the rollout. - 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data. - 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the CX32 migration trigger. - 17/18 README + appendix b: link the new ch15, add the obs cheat sheet block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:05:06 -05:00
parent d3708e6c72
commit 77cfcc0b27
8 changed files with 414 additions and 187 deletions
@@ -2,15 +2,119 @@

 ## Summary

-We have minimal observability today: `kubectl logs`, `kubectl top`,
-Cloudflare Analytics, and the Neon dashboard. No Prometheus, no Grafana,
-no centralized log aggregator, no APM. This is adequate for the
-current traffic volume (low) but is a known gap. This chapter documents
-what we *have* and what we'd add as traffic grows.
+Production has live metrics and tracing infrastructure as of 2026-04-25.
+A self-hosted **VictoriaMetrics + Jaeger + Grafana** stack runs on
+`88oakappsUpdate` (Linode VPS, also home to the self-hosted PostHog
+deployment). A `vmagent` sidecar in the honeyDue k3s namespace scrapes
+the api Pods' `/metrics` endpoint every 15 seconds and remote-writes to
+`https://obs.88oakapps.com/api/v1/write`. Grafana is at
+`https://grafana.88oakapps.com` with a pre-provisioned RED dashboard.
+
+What we still don't have: log aggregation (Dozzle and `kubectl logs`
+fill the niche for now), alerting (no PagerDuty/Slack on errors), and
+full distributed tracing (OTel SDK is wired in app code but app-side
+instrumentation beyond HTTP routes hasn't shipped yet).
+
+The whole observability stack costs **$0** incremental and uses ~700 MB
+RAM on `88oakappsUpdate` (5% of its free RAM). It runs as a separate
+docker-compose project from PostHog so neither product's lifecycle
+touches the other.

 ## What we have

-### 1. `kubectl logs`
+### 1. Metrics — VictoriaMetrics + vmagent
+
+```
+honeyDue k3s (Hetzner)                   88oakappsUpdate (Linode)
+┌───────────────────────────┐            ┌──────────────────────────┐
+│ api Pods (3) :8000/metrics│            │ /opt/honeydue-obs/       │
+│   prometheus/client_golang│            │ ┌──────────────────┐     │
+│                           │            │ │ VictoriaMetrics  │     │
+│ vmagent ──── scrape 15s   │            │ │  30d retention   │     │
+│         remote_write ─────┼────────────┼─→ /api/v1/write   │     │
+│         (HTTPS, bearer)   │            │ │  (mem 256 MB)    │     │
+└───────────────────────────┘            │ └──────────────────┘     │
+                                          └──────────────────────────┘
+```
+
+The Go API exposes `/metrics` in Prometheus exposition format. Histograms
+are defined in `internal/prom/metrics.go` and registered globally:
+
+| Metric | Labels | Source |
+|---|---|---|
+| `http_request_duration_seconds` | `route, method, status` | Echo middleware around every handler |
+| `gorm_query_duration_seconds` | `table, operation` | GORM before/after callbacks (no ctx threading needed) |
+| `b2_upload_duration_seconds` | `bucket, result` | Wrapped `s.backend.Write` in `internal/services/storage_service.go` |
+| `b2_upload_bytes_total` | `bucket, result` | Counter alongside the duration histogram |
+| `apns_send_duration_seconds` | `result` (`ok`/`bad_token`/`error`) | Wrapped APNs `PushWithContext` in `internal/push/apns.go` |
+| `fcm_send_duration_seconds` | `result` | Wrapped FCM HTTP v1 send in `internal/push/fcm.go` |
+| `asynq_job_duration_seconds` | `task_type, result` | Histograms registered; middleware not yet attached (Step 3) |
+| `go_*`, `process_*` | (standard) | `prometheus/client_golang/prometheus/collectors` defaults |
+
+The previous custom monitoring at `/metrics` was renamed to
+`/metrics/legacy` so the canonical `/metrics` emits proper histograms
+suitable for `histogram_quantile()` rollups. The legacy endpoint stays
+because the GoAdmin dashboard reads it.
+
+#### vmagent in k3s
+
+Lives at `deploy-k3s/manifests/observability/vmagent.yaml`. One replica,
+`mem_limit: 256Mi`, scrapes by Kubernetes pod-discovery filtered to
+`app.kubernetes.io/name=api` and remote-writes to
+`https://obs.88oakapps.com/api/v1/write` with a bearer token from
+`OBS_INGEST_TOKEN` in `deploy/prod.env` (substituted into a Secret at
+deploy time).
+
+The agent buffers locally to `/tmp/vmagent` (emptyDir, 512 MB cap), so
+brief obs outages don't drop samples. Persistent queue replays on
+reconnect.
+
+NetworkPolicies in the honeydue namespace allow egress from vmagent to:
+- DNS (kube-dns / coredns)
+- Kubernetes API (`10.43.0.0/16:443`) for pod discovery
+- api Pods on `10.42.0.0/16:8000`
+- The public obs endpoint over `0.0.0.0/0:443`
+
+These are scoped tight — vmagent can't reach Postgres, Redis, B2, or
+any other external service.
+
+### 2. Tracing — Jaeger all-in-one
+
+Jaeger 1.62 with badger storage runs alongside VictoriaMetrics. The
+collector accepts:
+- OTLP/HTTP at `https://obs.88oakapps.com/v1/traces` (bearer-token gated)
+- OTLP/gRPC at `:4317` (localhost-only)
+- Native Jaeger protocols at `:14268` etc. (localhost-only)
+
+Retention: ~7 days at current scale before badger rotates. UI at
+`https://grafana.88oakapps.com` via the Jaeger datasource.
+
+**Status of app-side instrumentation**: the histograms are populating
+metrics. The OTel exporter wiring in `cmd/api/main.go` is **not yet
+shipped**. When it does ship, every `POST /api/auth/login/` will produce
+a flame-graph trace with HTTP → handler → SQL → B2 → APNs spans.
+Tracking issue: gitea#3.
+
+### 3. Dashboards — Grafana
+
+`https://grafana.88oakapps.com` (Cloudflare-fronted, basic auth via
+Grafana itself, admin credentials in `deploy/prod.env`).
+
+Datasources auto-provisioned at container startup from
+`/opt/honeydue-obs/data/grafana-provisioning/datasources/datasources.yaml`:
+- VictoriaMetrics (Prometheus type, `http://victoriametrics:8428` in-network)
+- Jaeger (`http://jaeger:16686` in-network)
+
+Pre-provisioned dashboard: `honeyDue API — RED` at
+`/d/honeydue-red`. Top row uses the legacy custom metrics
+(`http_endpoint_requests_total`, `http_requests_total`) which started
+flowing the moment vmagent attached. Lower rows use the new histograms
+(`http_request_duration_seconds_bucket` p50/p95/p99 by route, GORM p95
+by table, B2 upload p95, APNs/FCM send p95, Go memory + goroutines).
+Lower rows populated immediately after the api rebuild that shipped
+`internal/prom`.
+
+### 4. `kubectl logs`

 Every container's stdout/stderr is captured by containerd and readable
 via kubectl:
@@ -33,9 +137,10 @@ kubectl get events -n honeydue --sort-by=.lastTimestamp
 Only the last ~20 MB of logs is retained per container, on-disk on the
 node. Once a pod is deleted, its logs are gone.

-For persistent log access we'd need aggregation (see §what we'd add).
+For persistent log access we'd need aggregation (see §What we still
+don't have).

-### 2. `kubectl top`
+### 5. `kubectl top`

 Pod and node resource usage via metrics-server:

@@ -43,43 +148,32 @@ Pod and node resource usage via metrics-server:
 kubectl top nodes
 # NAME                CPU(cores)   CPU(%)   MEMORY(bytes)   MEMORY(%)
 # ubuntu-8gb-nbg1-1   169m         4%       748Mi           9%
-# ubuntu-8gb-nbg1-2   229m         5%       1043Mi          13%
-# ubuntu-8gb-nbg1-3   124m         3%       770Mi           9%

 kubectl top pods -n honeydue
 ```

-**Retention**: In-memory only. Last few minutes of data. No
-historical view.
+In-memory only; last few minutes of data. For historical trends use
+the Grafana dashboard, which exposes the same data via the `go_*` and
+`container_*` (kubelet cAdvisor) metrics.

-### 3. Cloudflare Analytics
+### 6. Cloudflare Analytics

-CF Dashboard → Analytics & Logs. Per-zone stats:
- Requests per second
- Bandwidth
- Cache hit ratio
- Top HTTP status codes
- Top request paths
- Bot traffic score
+CF Dashboard → Analytics & Logs. Per-zone aggregate stats:
+requests/sec, bandwidth, cache hit ratio, top status codes, top paths,
+bot traffic score. Good for spotting macro trends ("suddenly 10× more
+502s today") that wouldn't show up in a single-pod sample.

-All aggregated, no individual request traces. Good for spotting macro
-trends ("suddenly 10× more 502s today"), poor for debugging specific
-issues.
+Free tier retention: 7 days of aggregate stats.

-Free tier retention: 7 days of aggregate stats. Pro extends this.
+### 7. Neon dashboard

-### 4. Neon dashboard
+Neon console → project → Monitoring: compute utilization (CU-hours),
+slow queries, active connections, storage usage. Useful for "is the
+DB busy?" and free-tier limit watching. The new
+`gorm_query_duration_seconds` histogram covers the application side
+of the same question with much better latency tail visibility.

-Neon console → project → Monitoring:
- Compute utilization (CU-hours consumed)
- Query performance (slow queries)
- Active connections
- Storage usage
-
-Good for "is the DB busy?" and "am I close to my free tier limit?"
-Not real-time.
-
-### 5. Kubernetes events
+### 8. Kubernetes events

 `kubectl get events` shows cluster-level state changes: pod scheduling,
 failures, image pulls, probe failures. Useful for post-mortem on
@@ -87,7 +181,7 @@ deploys.

 Retention: events are stored in etcd but default to 1 hour.

-## What we don't have (the gap)
+## What we still don't have

 ### No log aggregation

@@ -98,64 +192,55 @@ all api pod logs for user X") we have to:
 # Query all at once with stern (if installed)
 stern -n honeydue api

-# Or for specific pod
+# Or per-pod
 kubectl logs -n honeydue <pod> | grep user_id=12345
 ```

-This works but doesn't scale. Grep across 3 pods for a specific
-user_id is OK. Across 30 pods, intractable.
+This works but doesn't scale across many pods.

-**What we'd add**: [Loki](https://grafana.com/oss/loki/) — a lightweight
-log aggregator designed for k8s. ~$0 to self-host; integrates with
-Grafana for queries. Or [Betterstack](https://betterstack.com/logs)
-($10/mo, hosted).
-
-### No metrics/dashboards
-
-`kubectl top` tells us "is this pod hot right now?" but not "has CPU
-been climbing over the past hour?" We'd need:
-
- **Prometheus** — scrapes metrics from kubelet and pods' `/metrics`
-  endpoints, stores time series
- **Grafana** — queries Prometheus, renders dashboards
-
-K3s can install these via Helm in ~10 minutes. Adds ~500MB RAM to the
-cluster. Stability and operational load: moderate.
-
-**Alternative**: [Kubernetes Dashboard](https://github.com/kubernetes/dashboard)
-bundled with k3s (disabled by default). Minimal UI over the existing
-metrics API. Cheaper than Prometheus but less queryable.
-
-### No distributed tracing
-
-"This request took 800ms — which hop was slow?" is currently unanswerable
-beyond "the DB query, probably." A real trace would show:
- TLS handshake time
- Traefik routing time
- Go handler time
- Postgres query time
- Redis call time
- Each B2 request time
-
-We'd add OpenTelemetry to the Go app and export to Jaeger/Tempo. Work
-is moderate; value kicks in when we have complex request flows.
+**What we'd add**: [Loki](https://grafana.com/oss/loki/) on
+`88oakappsUpdate` next to the existing obs stack. Adds ~512 MB RAM
+plus a Promtail (or Vector/Alloy) DaemonSet in k3s. Defer until log
+search becomes a recurring pain point — `stern` + `grep` is fine at
+current pod count.

 ### No alerting

 No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
 The operator finds out when users complain.

-Cheapest fix: [Uptime Kuma](https://github.com/louislam/uptime-kuma)
-(self-hosted) or Better Stack Uptime (free for small teams). Ping
-`https://api.myhoneydue.com/api/health/` every minute; alert if it fails.
+Cheapest fix path:
+1. Grafana alerting (built into Grafana 11) — alert rules over the
+   existing histograms (e.g., `histogram_quantile(0.95, ...) > 1s`).
+   Routes to Slack via webhook. **Zero infra cost.**
+2. [Uptime Kuma](https://github.com/louislam/uptime-kuma) on
+   `88oakappsUpdate` — pings `/api/health/` from outside the cluster
+   every minute; complements the in-cluster view.
+
+We'd want both eventually. Grafana alerting first because the data is
+already there.
+
+### Partial distributed tracing
+
+The OTel SDK is **not yet wired** in `cmd/api/main.go`. When it ships:
+- `otelecho.Middleware` produces a span per HTTP request
+- `otelgorm` plugin produces a span per SQL query (requires threading
+  `ctx` through repositories — the largest diff in the rollout)
+- Manual spans wrap B2 uploads, APNs/FCM sends, asynq jobs
+
+Until then, we have aggregate latency by route from the histograms but
+no per-request flame graph. For "why is *this one* request slow" we
+still rely on logs + the GORM duration histogram.

 ### No APM (Application Performance Monitoring)

-No request-level profiling. We can't see "which endpoint has the highest
-p99 latency?" or "which SQL query is hot this week?"
+No continuous profiling. We can answer "which endpoint has the highest
+p99 latency?" from the histograms, but not "where in the call stack is
+the time going?" without ad-hoc `pprof` runs.

-Options: Datadog, New Relic, Honeycomb, self-hosted Tempo+Grafana.
-All are meaningful work to set up and cost $$$.
+If/when needed: Grafana Pyroscope is the OSS continuous profiler that
+fits our stack. Adds ~512 MB RAM. Defer until a CPU performance
+incident shows up.

 ## The app's logging conventions

@@ -172,28 +257,12 @@ The Go app uses zerolog and emits structured JSON:
 ```

 Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
-`DEBUG=true|false` in ConfigMap (true sets level to debug, false sets
-level to info).
+`DEBUG=true|false` in the ConfigMap (true sets level to debug, false
+sets level to info).

-Every request is logged with:
- Method, path, status code
- Request ID (for correlating logs across pods)
- User ID (if authenticated)
- Latency
-
-```json
-{
-  "level": "info",
-  "method": "GET",
-  "path": "/api/tasks/",
-  "status": 200,
-  "latency_ms": 42,
-  "user_id": 123,
-  "request_id": "a6b5db35-..."
-}
-```
-
-This is queryable by grep. Better with log aggregation.
+Every request is logged with method, path, status, request_id, user_id
+(if authenticated), latency. Queryable by grep today; ready to ingest
+into Loki when we add it.

 ## Health endpoints

@@ -202,71 +271,58 @@ Each service exposes a health endpoint:
 | Service | Endpoint | What it checks |
 |---|---|---|
 | api | `/api/health/` | Process alive (doesn't verify DB) |
+| api | `/api/health/live` | Process alive |
 | admin | `/` | Next.js is up |
 | worker | (none public) | Internal Asynq status |
+| api | `/metrics` | Prometheus exposition (vmagent scrapes here) |
+| api | `/metrics/legacy` | Custom monitoring metrics for GoAdmin |

 Health endpoints are **shallow** — they return 200 if the process is
 running and listening. They don't try to reach Postgres/Redis/etc.
 Rationale: if Postgres is briefly down, we don't want all api pods to
 start failing liveness and cascade-restart.

-## Dozzle (deprecated)
+## obs.88oakapps.com — the ingest endpoint

-The Swarm era had [Dozzle](https://github.com/amir20/dozzle) — a
-lightweight web UI for Docker logs. Accessible via SSH tunnel to the
-manager node. Not deployed on k3s; `kubectl logs` + `stern` fills the
-niche.
+Public hostname for cross-cluster metric and trace ingest. Cloudflare
+in front, nginx on `88oakappsUpdate` enforces a bearer-token check
+before forwarding to the local VM/Jaeger containers.

-## Kubernetes metrics the k8s API exposes
+| Path | Forwards to | Purpose |
+|---|---|---|
+| `/api/v1/write` | `http://127.0.0.1:8428` | Prometheus remote-write (vmagent → VM) |
+| `/v1/traces` | `http://127.0.0.1:4318/v1/traces` | OTLP/HTTP traces (app → Jaeger) |
+| `/health` | (returns 200) | Reachability probe — also requires auth |
+| anything else | 404 | |

-Even without Prometheus, these are queryable:
+Token lives at `/etc/honeydue-obs/secrets.env` (mode 0600 on the box)
+and at `OBS_INGEST_TOKEN=` in `deploy/prod.env` (gitignored). To rotate:
+generate a new value, update both ends, restart vmagent.

 ```bash
-# Resource metrics (via metrics-server)
-kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
-kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/honeydue/pods
-
-# Core API (k8s state)
-kubectl get --raw /api/v1/namespaces/honeydue/pods/<name>
-
-# Kubelet metrics (per-node; requires tunneling)
-kubectl get --raw /api/v1/nodes/<node>/proxy/metrics
+# Operator: rotate the bearer token
+NEW=$(openssl rand -hex 32)
+ssh 88oakappsUpdate "sudo sed -i 's|OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|' /etc/honeydue-obs/secrets.env"
+ssh 88oakappsUpdate "sudo sed -i 's|Bearer [a-f0-9]\{64\}|Bearer $NEW|' /etc/nginx/sites-available/obs.88oakapps.com && sudo nginx -s reload"
+sed -i.bak "s|^OBS_INGEST_TOKEN=.*|OBS_INGEST_TOKEN=$NEW|" deploy/prod.env
+KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue create secret generic vmagent-remote-write \
+  --from-literal=bearer_token=$NEW --dry-run=client -o yaml | kubectl apply -f -
+KUBECONFIG=~/.kube/honeydue.yaml kubectl -n honeydue rollout restart deploy/vmagent
 ```

-If we ever spin up Prometheus, these are the endpoints it would scrape.
+## Resource budget

-## Future: what to add and when
+| Service | mem_limit | Disk | Retention |
+|---|---|---|---|
+| VictoriaMetrics | 256 MB | 10 GB | 30 days |
+| Jaeger all-in-one (badger) | 256 MB | 10 GB | ~7 days |
+| Grafana OSS | 256 MB | 1 GB | — |
+| vmagent (in k3s) | 256 MB | 512 MB emptyDir | — |
+| **Total** | **~1 GB hard cap** | **~21 GB** | |

-| Trigger | Add |
-|---|---|
-| 10k+ daily users | Loki + Grafana for logs |
-| 100+ req/s sustained | Prometheus + Grafana for metrics |
-| Performance incidents | OpenTelemetry tracing |
-| Revenue > $5k/mo | Paid monitoring (Datadog or similar) |
-| First production outage | Alerting to phone/Slack |
-
-The overall philosophy: observability is an investment that compounds.
-Add it before you need it, not after. But also don't over-invest at
-idle.
-
-**Next quarter**: set up Uptime Kuma + Loki at minimum.
-
-## Checking what's installed
-
-```bash
-# In kube-system namespace
-kubectl get pods -n kube-system
-# Should see: coredns, metrics-server, traefik, local-path-provisioner,
-# and some k3s-related helm install jobs
-
-# In honeydue namespace
-kubectl get pods -n honeydue
-# api, admin, worker, redis
-
-# No monitoring namespace (yet)
-kubectl get namespaces
-# default, honeydue, kube-node-lease, kube-public, kube-system
-```
+Resident usage at idle is much lower (~90 MB on the obs side, ~30 MB
+for vmagent). Hard limits exist so a memory leak in any one component
+can't squeeze the cohabiting PostHog stack on `88oakappsUpdate`.

 ## Operator cheat sheet

@@ -274,32 +330,61 @@ kubectl get namespaces
 # Tail all logs in the namespace
 kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue

+# Scrape state from vmagent self-metrics
+kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
+  | grep -E "scrapes_total|targets|remotewrite"
+
+# Force vmagent to reload scrape config
+kubectl -n honeydue rollout restart deploy/vmagent
+
+# Query VictoriaMetrics directly (PromQL)
+ssh 88oakappsUpdate 'curl -s "http://127.0.0.1:8428/api/v1/query?query=histogram_quantile(0.95,sum%20by%20(route,le)(rate(http_request_duration_seconds_bucket%5B5m%5D)))" | python3 -m json.tool'
+
+# Restart the obs stack on 88oakappsUpdate
+ssh 88oakappsUpdate 'cd /opt/honeydue-obs && sudo docker compose restart'
+
+# Live obs container memory
+ssh 88oakappsUpdate 'sudo docker stats --no-stream | grep honeydue-obs'
+
+# Pod resource usage (k3s side)
+kubectl top pods -n honeydue --sort-by=memory
+
 # With stern (if installed: brew install stern)
 stern -n honeydue .

-# Follow specific pod, including previous runs
-kubectl logs -n honeydue <pod> -f --previous=false
-
-# Pod resource usage
-kubectl top pods -n honeydue --sort-by=memory
-kubectl top pods -n honeydue --sort-by=cpu
-
-# Events (cluster-wide)
-kubectl get events -A --sort-by=.lastTimestamp | tail -20
-
 # Full state dump for a pod (debugging)
 kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
 kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
 ```

+## Future: what to add and when
+
+| Trigger | Add |
+|---|---|
+| First production incident | Grafana alerting (free, data already there) |
+| 10k+ daily users | Loki + Vector for log aggregation |
+| Performance incident the histograms can't explain | Wire OTel exporter → Jaeger from the Go app |
+| CPU pressure on api pods | Pyroscope continuous profiler |
+| Multi-product obs needs | Migrate obs stack to dedicated CX32 ($8/mo) |
+
+The overall philosophy: observability is an investment that compounds.
+Add it before you need it, not after. But also don't over-invest at
+idle.
+
 ## References

- [Kubernetes metrics-server][ms]
- [K3s metrics][k3s-metrics]
- [Loki][loki]
+- [VictoriaMetrics docs][vm]
+- [vmagent kubernetes_sd_configs][vmagent-k8s]
+- [Jaeger all-in-one with badger][jaeger]
+- [prometheus/client_golang][promclient]
+- [Grafana provisioning datasources][gf-prov]
+- [Loki][loki] (future)
 - [Stern (multi-pod log tail)][stern]

-[ms]: https://github.com/kubernetes-sigs/metrics-server
-[k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server
+[vm]: https://docs.victoriametrics.com/single-server-victoriametrics/
+[vmagent-k8s]: https://docs.victoriametrics.com/vmagent.html#kubernetes-monitoring-with-vmagent
+[jaeger]: https://www.jaegertracing.io/docs/1.62/getting-started/#all-in-one
+[promclient]: https://pkg.go.dev/github.com/prometheus/client_golang
+[gf-prov]: https://grafana.com/docs/grafana/latest/administration/provisioning/#datasources
 [loki]: https://grafana.com/oss/loki/
 [stern]: https://github.com/stern/stern