Migrate prod deploy from Swarm to K3s; add full deployment book
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-24 07:20:21 -05:00
parent 4ec4bbbfe8
commit 6f303dbbaa
46 changed files with 9785 additions and 93 deletions
+305
View File
@@ -0,0 +1,305 @@
# 15 — Observability
## Summary
We have minimal observability today: `kubectl logs`, `kubectl top`,
Cloudflare Analytics, and the Neon dashboard. No Prometheus, no Grafana,
no centralized log aggregator, no APM. This is adequate for the
current traffic volume (low) but is a known gap. This chapter documents
what we *have* and what we'd add as traffic grows.
## What we have
### 1. `kubectl logs`
Every container's stdout/stderr is captured by containerd and readable
via kubectl:
```bash
# Live tail from all api pods
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix
# Last 100 lines
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=100
# Previous pod's logs (before the most recent restart)
kubectl logs -n honeydue <pod-name> --previous
# Events (not logs — k8s-level state changes)
kubectl get events -n honeydue --sort-by=.lastTimestamp
```
**Retention**: containerd rotates logs when they exceed 10 MB (default).
Only the last ~20 MB of logs is retained per container, on-disk on the
node. Once a pod is deleted, its logs are gone.
For persistent log access we'd need aggregation (see §what we'd add).
### 2. `kubectl top`
Pod and node resource usage via metrics-server:
```bash
kubectl top nodes
# NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%)
# ubuntu-8gb-nbg1-1 169m 4% 748Mi 9%
# ubuntu-8gb-nbg1-2 229m 5% 1043Mi 13%
# ubuntu-8gb-nbg1-3 124m 3% 770Mi 9%
kubectl top pods -n honeydue
```
**Retention**: In-memory only. Last few minutes of data. No
historical view.
### 3. Cloudflare Analytics
CF Dashboard → Analytics & Logs. Per-zone stats:
- Requests per second
- Bandwidth
- Cache hit ratio
- Top HTTP status codes
- Top request paths
- Bot traffic score
All aggregated, no individual request traces. Good for spotting macro
trends ("suddenly 10× more 502s today"), poor for debugging specific
issues.
Free tier retention: 7 days of aggregate stats. Pro extends this.
### 4. Neon dashboard
Neon console → project → Monitoring:
- Compute utilization (CU-hours consumed)
- Query performance (slow queries)
- Active connections
- Storage usage
Good for "is the DB busy?" and "am I close to my free tier limit?"
Not real-time.
### 5. Kubernetes events
`kubectl get events` shows cluster-level state changes: pod scheduling,
failures, image pulls, probe failures. Useful for post-mortem on
deploys.
Retention: events are stored in etcd but default to 1 hour.
## What we don't have (the gap)
### No log aggregation
Individual pod logs are on the node. For multi-pod debugging ("show me
all api pod logs for user X") we have to:
```bash
# Query all at once with stern (if installed)
stern -n honeydue api
# Or for specific pod
kubectl logs -n honeydue <pod> | grep user_id=12345
```
This works but doesn't scale. Grep across 3 pods for a specific
user_id is OK. Across 30 pods, intractable.
**What we'd add**: [Loki](https://grafana.com/oss/loki/) — a lightweight
log aggregator designed for k8s. ~$0 to self-host; integrates with
Grafana for queries. Or [Betterstack](https://betterstack.com/logs)
($10/mo, hosted).
### No metrics/dashboards
`kubectl top` tells us "is this pod hot right now?" but not "has CPU
been climbing over the past hour?" We'd need:
- **Prometheus** — scrapes metrics from kubelet and pods' `/metrics`
endpoints, stores time series
- **Grafana** — queries Prometheus, renders dashboards
K3s can install these via Helm in ~10 minutes. Adds ~500MB RAM to the
cluster. Stability and operational load: moderate.
**Alternative**: [Kubernetes Dashboard](https://github.com/kubernetes/dashboard)
bundled with k3s (disabled by default). Minimal UI over the existing
metrics API. Cheaper than Prometheus but less queryable.
### No distributed tracing
"This request took 800ms — which hop was slow?" is currently unanswerable
beyond "the DB query, probably." A real trace would show:
- TLS handshake time
- Traefik routing time
- Go handler time
- Postgres query time
- Redis call time
- Each B2 request time
We'd add OpenTelemetry to the Go app and export to Jaeger/Tempo. Work
is moderate; value kicks in when we have complex request flows.
### No alerting
No PagerDuty, no Slack webhooks, no email on "api is returning 500s."
The operator finds out when users complain.
Cheapest fix: [Uptime Kuma](https://github.com/louislam/uptime-kuma)
(self-hosted) or Better Stack Uptime (free for small teams). Ping
`https://api.myhoneydue.com/api/health/` every minute; alert if it fails.
### No APM (Application Performance Monitoring)
No request-level profiling. We can't see "which endpoint has the highest
p99 latency?" or "which SQL query is hot this week?"
Options: Datadog, New Relic, Honeycomb, self-hosted Tempo+Grafana.
All are meaningful work to set up and cost $$$.
## The app's logging conventions
The Go app uses zerolog and emits structured JSON:
```json
{
"level": "info",
"time": "2026-04-24T05:29:40Z",
"caller": "/app/cmd/api/main.go:189",
"addr": ":8000",
"message": "HTTP server listening"
}
```
Log levels: `debug`, `info`, `warn`, `error`, `fatal`. Controlled by
`DEBUG=true|false` in ConfigMap (true sets level to debug, false sets
level to info).
Every request is logged with:
- Method, path, status code
- Request ID (for correlating logs across pods)
- User ID (if authenticated)
- Latency
```json
{
"level": "info",
"method": "GET",
"path": "/api/tasks/",
"status": 200,
"latency_ms": 42,
"user_id": 123,
"request_id": "a6b5db35-..."
}
```
This is queryable by grep. Better with log aggregation.
## Health endpoints
Each service exposes a health endpoint:
| Service | Endpoint | What it checks |
|---|---|---|
| api | `/api/health/` | Process alive (doesn't verify DB) |
| admin | `/` | Next.js is up |
| worker | (none public) | Internal Asynq status |
Health endpoints are **shallow** — they return 200 if the process is
running and listening. They don't try to reach Postgres/Redis/etc.
Rationale: if Postgres is briefly down, we don't want all api pods to
start failing liveness and cascade-restart.
## Dozzle (deprecated)
The Swarm era had [Dozzle](https://github.com/amir20/dozzle) — a
lightweight web UI for Docker logs. Accessible via SSH tunnel to the
manager node. Not deployed on k3s; `kubectl logs` + `stern` fills the
niche.
## Kubernetes metrics the k8s API exposes
Even without Prometheus, these are queryable:
```bash
# Resource metrics (via metrics-server)
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/honeydue/pods
# Core API (k8s state)
kubectl get --raw /api/v1/namespaces/honeydue/pods/<name>
# Kubelet metrics (per-node; requires tunneling)
kubectl get --raw /api/v1/nodes/<node>/proxy/metrics
```
If we ever spin up Prometheus, these are the endpoints it would scrape.
## Future: what to add and when
| Trigger | Add |
|---|---|
| 10k+ daily users | Loki + Grafana for logs |
| 100+ req/s sustained | Prometheus + Grafana for metrics |
| Performance incidents | OpenTelemetry tracing |
| Revenue > $5k/mo | Paid monitoring (Datadog or similar) |
| First production outage | Alerting to phone/Slack |
The overall philosophy: observability is an investment that compounds.
Add it before you need it, not after. But also don't over-invest at
idle.
**Next quarter**: set up Uptime Kuma + Loki at minimum.
## Checking what's installed
```bash
# In kube-system namespace
kubectl get pods -n kube-system
# Should see: coredns, metrics-server, traefik, local-path-provisioner,
# and some k3s-related helm install jobs
# In honeydue namespace
kubectl get pods -n honeydue
# api, admin, worker, redis
# No monitoring namespace (yet)
kubectl get namespaces
# default, honeydue, kube-node-lease, kube-public, kube-system
```
## Operator cheat sheet
```bash
# Tail all logs in the namespace
kubectl logs -n honeydue --all-containers=true --tail=50 -l app.kubernetes.io/part-of=honeydue
# With stern (if installed: brew install stern)
stern -n honeydue .
# Follow specific pod, including previous runs
kubectl logs -n honeydue <pod> -f --previous=false
# Pod resource usage
kubectl top pods -n honeydue --sort-by=memory
kubectl top pods -n honeydue --sort-by=cpu
# Events (cluster-wide)
kubectl get events -A --sort-by=.lastTimestamp | tail -20
# Full state dump for a pod (debugging)
kubectl describe pod -n honeydue <pod> > /tmp/pod-dump.txt
kubectl logs -n honeydue <pod> > /tmp/pod-logs.txt
```
## References
- [Kubernetes metrics-server][ms]
- [K3s metrics][k3s-metrics]
- [Loki][loki]
- [Stern (multi-pod log tail)][stern]
[ms]: https://github.com/kubernetes-sigs/metrics-server
[k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server
[loki]: https://grafana.com/oss/loki/
[stern]: https://github.com/stern/stern