Files
honeyDueAPI/docs/deployment/15-observability.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

306 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 15 — Observability
## Summary
We have minimal observability today: `kubectl logs`, `kubectl top`,
Cloudflare Analytics, and the Neon dashboard. No Prometheus, no Grafana,
no centralized log aggregator, no APM. This is adequate for the
current traffic volume (low) but is a known gap. This chapter documents
what we *have* and what we'd add as traffic grows.
## What we have
### 1. `kubectl logs`
Every container's stdout/stderr is captured by containerd and readable
via kubectl:
```bash
# Live tail from all api pods
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix
# Last 100 lines
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=100
# Previous pod's logs (before the most recent restart)
kubectl logs -n honeydue <pod-name> --previous
# Events (not logs — k8s-level state changes)
kubectl get events -n honeydue --sort-by=.lastTimestamp
```
**Retention**: containerd rotates logs when they exceed 10 MB (default).
Only the last ~20 MB of logs is retained per container, on-disk on the
node. Once a pod is deleted, its logs are gone.
For persistent log access we'd need aggregation (see §what we'd add).
### 2. `kubectl top`
Pod and node resource usage via metrics-server:
```bash
kubectl top nodes
# NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
# ubuntu-8gb-nbg1-1 169m 4% 748Mi 9%
# ubuntu-8gb-nbg1-2 229m 5% 1043Mi 13%
# ubuntu-8gb-nbg1-3 124m 3% 770Mi 9%
kubectl top pods -n honeydue
```
**Retention**: In-memory only. Last few minutes of data. No
historical view.
### 3. Cloudflare Analytics
CF Dashboard → Analytics & Logs. Per-zone stats:
- Requests per second
- Bandwidth
- Cache hit ratio
- Top HTTP status codes
- Top request paths
- Bot traffic score
All aggregated, no individual request traces. Good for spotting macro
trends ("suddenly 10× more 502s today"), poor for debugging specific
issues.
Free tier retention: 7 days of aggregate stats. Pro extends this.
### 4. Neon dashboard
Neon console → project → Monitoring:
- Compute utilization (CU-hours consumed)
- Query performance (slow queries)
- Active connections
- Storage usage
Good for "is the DB busy?" and "am I close to my free tier limit?"
Not real-time.
### 5. Kubernetes events
`kubectl get events` shows cluster-level state changes: pod scheduling,
failures, image pulls, probe failures. Useful for post-mortem on
deploys.
Retention: events are stored in etcd but default to 1 hour.
## What we don't have (the gap)
### No log aggregation
Individual pod logs are on the node. For multi-pod debugging ("show me
all api pod logs for user X") we have to:
```bash
# Query all at once with stern (if installed)
stern -n honeydue api
# Or for specific pod
kubectl logs -n honeydue <pod> | grep user_id=12345
```
This works but doesn't scale. Grep across 3 pods for a specific
user_id is OK. Across 30 pods, intractable.
**What we'd add**: [Loki](https://grafana.com/oss/loki/) — a lightweight
log aggregator designed for k8s. ~$0 to self-host; integrates with
Grafana for queries. Or [Betterstack](https://betterstack.com/logs)
($10/mo, hosted).
### No metrics/dashboards
`kubectl top` tells us "is this pod hot right now?" but not "has CPU
been climbing over the past hour?" We'd need:
- **Prometheus** — scrapes metrics from kubelet and pods' `/metrics`
endpoints, stores time series
- **Grafana** — queries Prometheus, renders dashboards
K3s can install these via Helm in ~10 minutes. Adds ~500MB RAM to the
cluster. Stability and operational load: moderate.
**Alternative**: [Kubernetes Dashboard](https://github.com/kubernetes/dashboard)
bundled with k3s (disabled by default). Minimal UI over the existing
metrics API. Cheaper than Prometheus but less queryable.
### No distributed tracing
"This request took 800ms — which hop was slow?" is currently unanswerable
beyond "the DB query, probably." A real trace would show:
- TLS handshake time
- Traefik routing time
- Go handler time
- Postgres query time
- Redis call time
- Each B2 request time
We'd add OpenTelemetry to the Go app and export to Jaeger/Tempo. Work
is moderate; value kicks in when we have complex request flows.
### No alerting
No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
The operator finds out when users complain.
Cheapest fix: [Uptime Kuma](https://github.com/louislam/uptime-kuma)
(self-hosted) or Better Stack Uptime (free for small teams). Ping
`https://api.myhoneydue.com/api/health/` every minute; alert if it fails.
### No APM (Application Performance Monitoring)
No request-level profiling. We can't see "which endpoint has the highest
p99 latency?" or "which SQL query is hot this week?"
Options: Datadog, New Relic, Honeycomb, self-hosted Tempo+Grafana.
All are meaningful work to set up and cost $$$.
## The app's logging conventions
The Go app uses zerolog and emits structured JSON:
```json
{
"level": "info",
"time": "2026-04-24T05:29:40Z",
"caller": "/app/cmd/api/main.go:189",
"addr": ":8000",
"message": "HTTP server listening"
}
```
Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
`DEBUG=true|false` in ConfigMap (true sets level to debug, false sets
level to info).
Every request is logged with:
- Method, path, status code
- Request ID (for correlating logs across pods)
- User ID (if authenticated)
- Latency
```json
{
"level": "info",
"method": "GET",
"path": "/api/tasks/",
"status": 200,
"latency_ms": 42,
"user_id": 123,
"request_id": "a6b5db35-..."
}
```
This is queryable by grep. Better with log aggregation.
## Health endpoints
Each service exposes a health endpoint:
| Service | Endpoint | What it checks |
|---|---|---|
| api | `/api/health/` | Process alive (doesn't verify DB) |
| admin | `/` | Next.js is up |
| worker | (none public) | Internal Asynq status |
Health endpoints are **shallow** — they return 200 if the process is
running and listening. They don't try to reach Postgres/Redis/etc.
Rationale: if Postgres is briefly down, we don't want all api pods to
start failing liveness and cascade-restart.
## Dozzle (deprecated)
The Swarm era had [Dozzle](https://github.com/amir20/dozzle) — a
lightweight web UI for Docker logs. Accessible via SSH tunnel to the
manager node. Not deployed on k3s; `kubectl logs` + `stern` fills the
niche.
## Kubernetes metrics the k8s API exposes
Even without Prometheus, these are queryable:
```bash
# Resource metrics (via metrics-server)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/honeydue/pods
# Core API (k8s state)
kubectl get --raw /api/v1/namespaces/honeydue/pods/<name>
# Kubelet metrics (per-node; requires tunneling)
kubectl get --raw /api/v1/nodes/<node>/proxy/metrics
```
If we ever spin up Prometheus, these are the endpoints it would scrape.
## Future: what to add and when
| Trigger | Add |
|---|---|
| 10k+ daily users | Loki + Grafana for logs |
| 100+ req/s sustained | Prometheus + Grafana for metrics |
| Performance incidents | OpenTelemetry tracing |
| Revenue > $5k/mo | Paid monitoring (Datadog or similar) |
| First production outage | Alerting to phone/Slack |
The overall philosophy: observability is an investment that compounds.
Add it before you need it, not after. But also don't over-invest at
idle.
**Next quarter**: set up Uptime Kuma + Loki at minimum.
## Checking what's installed
```bash
# In kube-system namespace
kubectl get pods -n kube-system
# Should see: coredns, metrics-server, traefik, local-path-provisioner,
# and some k3s-related helm install jobs
# In honeydue namespace
kubectl get pods -n honeydue
# api, admin, worker, redis
# No monitoring namespace (yet)
kubectl get namespaces
# default, honeydue, kube-node-lease, kube-public, kube-system
```
## Operator cheat sheet
```bash
# Tail all logs in the namespace
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue
# With stern (if installed: brew install stern)
stern -n honeydue .
# Follow specific pod, including previous runs
kubectl logs -n honeydue <pod> -f --previous=false
# Pod resource usage
kubectl top pods -n honeydue --sort-by=memory
kubectl top pods -n honeydue --sort-by=cpu
# Events (cluster-wide)
kubectl get events -A --sort-by=.lastTimestamp | tail -20
# Full state dump for a pod (debugging)
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
```
## References
- [Kubernetes metrics-server][ms]
- [K3s metrics][k3s-metrics]
- [Loki][loki]
- [Stern (multi-pod log tail)][stern]
[ms]: https://github.com/kubernetes-sigs/metrics-server
[k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server
[loki]: https://grafana.com/oss/loki/
[stern]: https://github.com/stern/stern