# 15 — Observability ## Summary We have minimal observability today: `kubectl logs`, `kubectl top`, Cloudflare Analytics, and the Neon dashboard. No Prometheus, no Grafana, no centralized log aggregator, no APM. This is adequate for the current traffic volume (low) but is a known gap. This chapter documents what we *have* and what we'd add as traffic grows. ## What we have ### 1. `kubectl logs` Every container's stdout/stderr is captured by containerd and readable via kubectl: ```bash # Live tail from all api pods kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix # Last 100 lines kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=100 # Previous pod's logs (before the most recent restart) kubectl logs -n honeydue --previous # Events (not logs — k8s-level state changes) kubectl get events -n honeydue --sort-by=.lastTimestamp ``` **Retention**: containerd rotates logs when they exceed 10 MB (default). Only the last ~20 MB of logs is retained per container, on-disk on the node. Once a pod is deleted, its logs are gone. For persistent log access we'd need aggregation (see §what we'd add). ### 2. `kubectl top` Pod and node resource usage via metrics-server: ```bash kubectl top nodes # NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%) # ubuntu-8gb-nbg1-1 169m 4% 748Mi 9% # ubuntu-8gb-nbg1-2 229m 5% 1043Mi 13% # ubuntu-8gb-nbg1-3 124m 3% 770Mi 9% kubectl top pods -n honeydue ``` **Retention**: In-memory only. Last few minutes of data. No historical view. ### 3. Cloudflare Analytics CF Dashboard → Analytics & Logs. Per-zone stats: - Requests per second - Bandwidth - Cache hit ratio - Top HTTP status codes - Top request paths - Bot traffic score All aggregated, no individual request traces. Good for spotting macro trends ("suddenly 10× more 502s today"), poor for debugging specific issues. Free tier retention: 7 days of aggregate stats. Pro extends this. ### 4. Neon dashboard Neon console → project → Monitoring: - Compute utilization (CU-hours consumed) - Query performance (slow queries) - Active connections - Storage usage Good for "is the DB busy?" and "am I close to my free tier limit?" Not real-time. ### 5. Kubernetes events `kubectl get events` shows cluster-level state changes: pod scheduling, failures, image pulls, probe failures. Useful for post-mortem on deploys. Retention: events are stored in etcd but default to 1 hour. ## What we don't have (the gap) ### No log aggregation Individual pod logs are on the node. For multi-pod debugging ("show me all api pod logs for user X") we have to: ```bash # Query all at once with stern (if installed) stern -n honeydue api # Or for specific pod kubectl logs -n honeydue | grep user_id=12345 ``` This works but doesn't scale. Grep across 3 pods for a specific user_id is OK. Across 30 pods, intractable. **What we'd add**: [Loki](https://grafana.com/oss/loki/) — a lightweight log aggregator designed for k8s. ~$0 to self-host; integrates with Grafana for queries. Or [Betterstack](https://betterstack.com/logs) ($10/mo, hosted). ### No metrics/dashboards `kubectl top` tells us "is this pod hot right now?" but not "has CPU been climbing over the past hour?" We'd need: - **Prometheus** — scrapes metrics from kubelet and pods' `/metrics` endpoints, stores time series - **Grafana** — queries Prometheus, renders dashboards K3s can install these via Helm in ~10 minutes. Adds ~500MB RAM to the cluster. Stability and operational load: moderate. **Alternative**: [Kubernetes Dashboard](https://github.com/kubernetes/dashboard) bundled with k3s (disabled by default). Minimal UI over the existing metrics API. Cheaper than Prometheus but less queryable. ### No distributed tracing "This request took 800ms — which hop was slow?" is currently unanswerable beyond "the DB query, probably." A real trace would show: - TLS handshake time - Traefik routing time - Go handler time - Postgres query time - Redis call time - Each B2 request time We'd add OpenTelemetry to the Go app and export to Jaeger/Tempo. Work is moderate; value kicks in when we have complex request flows. ### No alerting No PagerDuty, no Slack webhooks, no email on "api is returning 500s." The operator finds out when users complain. Cheapest fix: [Uptime Kuma](https://github.com/louislam/uptime-kuma) (self-hosted) or Better Stack Uptime (free for small teams). Ping `https://api.myhoneydue.com/api/health/` every minute; alert if it fails. ### No APM (Application Performance Monitoring) No request-level profiling. We can't see "which endpoint has the highest p99 latency?" or "which SQL query is hot this week?" Options: Datadog, New Relic, Honeycomb, self-hosted Tempo+Grafana. All are meaningful work to set up and cost $$$. ## The app's logging conventions The Go app uses zerolog and emits structured JSON: ```json { "level": "info", "time": "2026-04-24T05:29:40Z", "caller": "/app/cmd/api/main.go:189", "addr": ":8000", "message": "HTTP server listening" } ``` Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by `DEBUG=true|false` in ConfigMap (true sets level to debug, false sets level to info). Every request is logged with: - Method, path, status code - Request ID (for correlating logs across pods) - User ID (if authenticated) - Latency ```json { "level": "info", "method": "GET", "path": "/api/tasks/", "status": 200, "latency_ms": 42, "user_id": 123, "request_id": "a6b5db35-..." } ``` This is queryable by grep. Better with log aggregation. ## Health endpoints Each service exposes a health endpoint: | Service | Endpoint | What it checks | |---|---|---| | api | `/api/health/` | Process alive (doesn't verify DB) | | admin | `/` | Next.js is up | | worker | (none public) | Internal Asynq status | Health endpoints are **shallow** — they return 200 if the process is running and listening. They don't try to reach Postgres/Redis/etc. Rationale: if Postgres is briefly down, we don't want all api pods to start failing liveness and cascade-restart. ## Dozzle (deprecated) The Swarm era had [Dozzle](https://github.com/amir20/dozzle) — a lightweight web UI for Docker logs. Accessible via SSH tunnel to the manager node. Not deployed on k3s; `kubectl logs` + `stern` fills the niche. ## Kubernetes metrics the k8s API exposes Even without Prometheus, these are queryable: ```bash # Resource metrics (via metrics-server) kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/honeydue/pods # Core API (k8s state) kubectl get --raw /api/v1/namespaces/honeydue/pods/ # Kubelet metrics (per-node; requires tunneling) kubectl get --raw /api/v1/nodes//proxy/metrics ``` If we ever spin up Prometheus, these are the endpoints it would scrape. ## Future: what to add and when | Trigger | Add | |---|---| | 10k+ daily users | Loki + Grafana for logs | | 100+ req/s sustained | Prometheus + Grafana for metrics | | Performance incidents | OpenTelemetry tracing | | Revenue > $5k/mo | Paid monitoring (Datadog or similar) | | First production outage | Alerting to phone/Slack | The overall philosophy: observability is an investment that compounds. Add it before you need it, not after. But also don't over-invest at idle. **Next quarter**: set up Uptime Kuma + Loki at minimum. ## Checking what's installed ```bash # In kube-system namespace kubectl get pods -n kube-system # Should see: coredns, metrics-server, traefik, local-path-provisioner, # and some k3s-related helm install jobs # In honeydue namespace kubectl get pods -n honeydue # api, admin, worker, redis # No monitoring namespace (yet) kubectl get namespaces # default, honeydue, kube-node-lease, kube-public, kube-system ``` ## Operator cheat sheet ```bash # Tail all logs in the namespace kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue # With stern (if installed: brew install stern) stern -n honeydue . # Follow specific pod, including previous runs kubectl logs -n honeydue -f --previous=false # Pod resource usage kubectl top pods -n honeydue --sort-by=memory kubectl top pods -n honeydue --sort-by=cpu # Events (cluster-wide) kubectl get events -A --sort-by=.lastTimestamp | tail -20 # Full state dump for a pod (debugging) kubectl describe pod -n honeydue > /tmp/pod-dump.txt kubectl logs -n honeydue > /tmp/pod-logs.txt ``` ## References - [Kubernetes metrics-server][ms] - [K3s metrics][k3s-metrics] - [Loki][loki] - [Stern (multi-pod log tail)][stern] [ms]: https://github.com/kubernetes-sigs/metrics-server [k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server [loki]: https://grafana.com/oss/loki/ [stern]: https://github.com/stern/stern