Migrate prod deploy from Swarm to K3s; add full deployment book
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-24 07:20:21 -05:00
parent 4ec4bbbfe8
commit 6f303dbbaa
46 changed files with 9785 additions and 93 deletions
+360
View File
@@ -0,0 +1,360 @@
# 16 — Failure Modes
## Summary
Every component in the system has a failure mode, a user-visible
symptom, and a recovery story. This chapter enumerates them from the
edge inward. Use this as a reference when debugging or when planning
resilience improvements.
## Failure catalog
### Cloudflare-level
#### CF edge POP outage
**Symptom**: users in one geographic region see errors; other regions
fine.
**Recovery**: automatic — CF routes traffic to next-nearest POP.
**Our action**: none; wait for CF.
**Frequency**: rare, usually resolved in minutes.
#### CF global outage (rare but has happened)
**Symptom**: the whole site unreachable via CF.
**Recovery**: manual — disable CF proxy (grey cloud DNS records), users
hit origins directly.
**Our action**: in Cloudflare dashboard, flip each A record's proxy off.
Users then resolve to our node IPs directly; UFW allows :80/:443 from
anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL
Flexible mode), but HTTP works.
**Frequency**: extremely rare (hours-long event happens ~annually).
#### DNS hijacking
**Symptom**: users' DNS queries return attacker IPs; all traffic
compromised.
**Mitigation**: unlikely at CF; users who use DoH/DoT are protected.
No mitigation at our level.
**Recovery**: requires CF incident response.
### Node-level
#### One node's NIC fails
**Symptom**: Cloudflare's retry logic routes around it within seconds.
Users see a brief spike in latency as CF learns the IP is unhealthy.
Pods on that node get rescheduled to surviving nodes by Kubernetes
after `node-monitor-grace-period` (40s).
**Recovery**:
- Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
- Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
- Replace the node via Hetzner console when convenient
**Our action**: verify `kubectl get nodes` shows NotReady; check
Hetzner console to confirm the node's status; recreate if needed.
#### Two nodes fail simultaneously
**Symptom**: Raft loses quorum. Kubernetes API server rejects writes.
Existing pods keep running but nothing new can be scheduled/updated.
Single surviving node's pods continue serving traffic.
**Recovery**:
- If a failed node comes back within Raft's leader-election timeout
(seconds to minutes), quorum restores
- If failed nodes are truly gone, the cluster is broken — need to
rebuild
**Rebuild procedure**: from the surviving node, `k3s-killall.sh`, then
bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe;
Redis state is lost.
#### All three nodes fail simultaneously
**Symptom**: full site outage.
**Recovery**: rebuild the cluster from scratch.
**Frequency**: Hetzner-region-wide outage, extremely rare.
#### Node disk fills up
**Symptom**: pods get evicted ("node is disk-pressure"). Containers
can't be scheduled on that node.
**Common cause**: container log buildup (containerd rotates at 10 MB
per container but across dozens of pod churn cycles, total fills up),
local-path PVC fills up, apt cache.
**Recovery**:
```bash
ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
# Then clean up
```
### k3s control plane failures
#### etcd corruption on one node
**Symptom**: Raft detects divergence; that node stops serving writes.
**Recovery**: remove the node from the cluster, rejoin. Etcd snapshot
is pulled from surviving peers automatically.
#### CoreDNS down
**Symptom**: pods can't resolve Service names. New TCP connections
fail; existing connections continue (they already resolved).
Typical manifestation: "DB connection failed — no such host" errors.
**Recovery**: k3s automatically restarts CoreDNS pod. If it
keeps crashing:
```bash
kubectl logs -n kube-system deploy/coredns --previous
kubectl rollout restart deployment/coredns -n kube-system
```
**Frequency**: rare.
#### metrics-server down
**Symptom**: `kubectl top` returns an error; HPAs can't scale.
**Recovery**: restart metrics-server pod. Non-critical; service stays up.
```bash
kubectl rollout restart deployment/metrics-server -n kube-system
```
### Networking failures
#### UFW rule accidentally blocks essential traffic
**Symptom**: Some specific thing stops working (e.g., api can't reach
Postgres, cross-node pod traffic fails, kubectl times out).
**Recovery**: log in via SSH (if that still works), `sudo ufw status
numbered`, `sudo ufw --force delete <N>` to remove offending rule.
**If SSH is blocked too**: Hetzner console → Rescue mode → mount disk
→ edit `/etc/ufw/user.rules`.
#### Flannel broken on one node
**Symptom**: pods on that node can't reach remote pods via overlay.
ClusterIP Services involving cross-node endpoints fail.
**Recovery**: restart kubelet on that node:
```bash
ssh deploy@<node> "sudo systemctl restart k3s"
```
#### Kube-proxy broken on one node
**Symptom**: pods on that node can't reach ClusterIPs. Symptoms look
like DNS resolution succeeded but connection refused or timed out.
**Recovery**: same as Flannel — restart k3s on the node.
### Application-level
#### api pod OOM
**Symptom**: pod gets killed, kubelet restarts it. User's request
returns 502 briefly; subsequent requests routed to healthy pods.
Readiness probe removes the OOMing pod from Service endpoints.
**Recovery**: automatic (pod restarts). If it keeps OOMing:
- Increase `resources.limits.memory` in the deployment
- Or debug the memory leak
**Check**:
```bash
kubectl describe pod -n honeydue <pod> | grep -i oom
kubectl logs -n honeydue <pod> --previous
```
#### api pod panics
**Symptom**: goroutine panic kills the process. Kubelet restarts.
Similar user impact to OOM.
**Recovery**: automatic restart. But if the panic is deterministic
(same input → panic), the pod crashloops.
**Action**: read the logs, find the panic stack trace, fix the code,
deploy.
**Circuit-breaker scenario**: if all 3 api pods crashloop on startup
because of bad code, kubectl rollout undo to previous revision.
#### api deadlocks
**Symptom**: all 3 pods are up, readiness passes (shallow probe), but
real requests time out or hang.
**Recovery**: liveness probe is the same endpoint as readiness, so it
won't help. You'll see gradually increasing 504s at the edge. Manual
intervention:
```bash
kubectl rollout restart deployment/api -n honeydue
```
#### admin pod crashes
**Symptom**: 502 at Cloudflare when accessing admin.myhoneydue.com.
**Recovery**: k8s auto-restarts. Usually within 10-30s.
**Impact**: only admins lose access; user-facing api is unaffected.
#### worker stops processing jobs
**Symptom**: emails stop being sent, cron jobs stop firing.
**Detection**: no direct alert; need to notice via user feedback or
missing daily-digest emails. Or check Redis for queue backlog.
**Recovery**:
```bash
kubectl rollout restart deployment/worker -n honeydue
```
**If persistent**: check logs for specific error:
```bash
kubectl logs -n honeydue deploy/worker --tail=100
```
#### redis pod dies + node is different
**Symptom**: Redis schedules to a new node, but the PVC is on the
original node (local-path is per-node). New Redis pod comes up but
finds an empty data directory (or can't mount at all).
**Recovery**:
- If the original node is still alive but Redis pod died: pod comes
back up on same node with data intact
- If the original node is gone: Redis starts empty. Cache regenerates.
Asynq queue state is lost; pending jobs re-queue on retry, cron
fires re-schedule on next tick.
- Ensure the node label `honeydue/redis=true` is on a healthy node:
```bash
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
```
### External service failures
#### Neon Postgres outage
**Symptom**: api logs fill with "failed to connect to database." All
mutating API calls fail. Reads from cache continue (via Redis) but
eventually cache expires.
**Recovery**: no action from us; Neon's problem. Users will see 5xx
until Neon is back.
**Mitigation for future**: multi-region Neon read replica, or
Postgres-level failover.
**Frequency**: Neon has had a handful of hours-scale outages since launch.
#### Backblaze B2 outage
**Symptom**: image uploads fail; image downloads fail unless cached by
CF.
**Recovery**: wait. B2 rarely goes down.
**Mitigation**: serve downloads via CF with long cache TTL — most
users won't notice brief B2 outages for read traffic.
#### Fastmail SMTP unreachable
**Symptom**: `worker` can't send transactional emails. Jobs retry per
Asynq's retry policy, eventually giving up and logging an error.
**Recovery**: automatic retry; wait for Fastmail to come back.
**Manual intervention**: re-enqueue jobs from the Asynq UI (we don't
expose it yet — future).
#### Gitea registry unreachable
**Symptom**: `kubectl rollout` stuck at "Pulling image" for new pods.
Existing pods continue running with their already-pulled images.
**Recovery**: wait for Gitea to come back.
**Mitigation**: K8s has `imagePullPolicy: IfNotPresent` by default on
SHA-tagged images, so images aren't re-pulled on every restart if
the node already has them cached.
#### Cloudflare DNS failure
See §CF failures above.
## Combined failures
### "Everything is slow"
Most often = Neon is being hammered by our load + someone else's noisy
neighbor.
- Check `kubectl top pods` (are we CPU-bound?)
- Check Neon console for query performance
- Check CF analytics for traffic spikes
### "Some users see 502, others don't"
Usually one node has an unhealthy Traefik or api. Cloudflare routes
some connections to it, others to healthy nodes.
- `kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik`
- `kubectl get pods -n honeydue -l app.kubernetes.io/name=api`
- Check per-pod logs
### "It worked 5 minutes ago, now it doesn't"
Something recent changed. Check:
- Recent deploys: `kubectl rollout history deployment/api -n honeydue`
- Recent manifest changes: `kubectl get events -A --sort-by=.lastTimestamp | tail -30`
- External: Cloudflare Status page, Neon Status page, Backblaze Status page
## Planned outages
### Node upgrades (OS patches)
```bash
# Drain the node (evict pods, block scheduling)
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data
# SSH in, upgrade, reboot
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"
# Wait for node to come back
watch kubectl get nodes
# Uncordon
kubectl uncordon ubuntu-8gb-nbg1-1
```
During the drain, pods from that node reschedule to the survivors.
With current workload (api: 3 replicas, everything else: 1), rescheduling
1 api pod is fine. Traffic loss: zero.
Worker pod or Redis pod scheduled on the drained node would be
briefly unavailable during reschedule. Acceptable for planned windows.
### k3s upgrades
Same per-node drain + upgrade pattern, but with k3s-specific install:
```bash
# On the node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server
# k3s detects existing install and upgrades in place
```
Do one node at a time. Verify cluster health between each.
## Disaster recovery
### Complete cluster loss
Procedure:
1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
2. Follow bootstrap procedure (Chapter 1 §node hardening)
3. Install k3s on each (Chapter 2 §HA architecture)
4. Configure kubeconfig
5. Apply all manifests:
```bash
kubectl apply -f deploy-k3s/manifests/namespace.yaml
kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml
# Wait for Traefik to redeploy
# ... recreate secrets (see Chapter 10) ...
# ... apply rest of manifests ...
```
6. Update DNS if node IPs changed
7. Verify: curl https://api.myhoneydue.com/api/health/
Estimated time: **1-2 hours** if you've done it before. A lot of
context-switching between Hetzner console, SSH, kubectl, and CF.
Neon data is untouched by any of this. B2 data is untouched. Only
state that's lost: Redis cache (regenerates) and any in-flight Asynq
jobs that were mid-processing.
## References
- [Kubernetes pod lifecycle][lifecycle]
- [K3s HA recovery][k3s-ha-recovery]
- [Hetzner rescue system][hetzner-rescue]
[lifecycle]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
[k3s-ha-recovery]: https://docs.k3s.io/datastore/ha-embedded#new-cluster-with-embedded-db
[hetzner-rescue]: https://docs.hetzner.com/cloud/servers/getting-started/enabling-rescue-system/