Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure: - Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers) - Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh - All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept temporarily for reference Bug fixes surfaced during migration: - Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25) - cache_service.go: remove sync.Once reassignment from inside Do() callback (was causing 'unlock of unlocked mutex' fatal after Redis Ping failure) - router.go: relax CSP from 'default-src none' to 'default-src self' + allowlist fonts.googleapis.com so the marketing landing page CSS actually loads in browsers - deploy/scripts/deploy_prod.sh: use docker buildx with --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce images runnable on x86_64 Hetzner nodes; fix array expansion under set -u - deploy/swarm-stack.prod.yml: fix secret source references to use top-level aliases (the '\${X_SECRET}' form never actually resolved); dozzle ports: long-form host_ip is rejected by Swarm, switched to short-form (bound to 0.0.0.0 with UFW-based loopback restriction); worker replicas 2 -> 1 (Asynq scheduler singleton) - deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/' (Next.js serves at root; /admin/ returned 404 and killed pods); startupProbe failureThreshold 12 -> 24 - deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable 1 -> 0 (singleton) - deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold 12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot; real startup takes up to 240s) - .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/ and admin/src/app/api/*, hiding legitimate files) New files: - deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet + hostNetwork override for k3s-bundled Traefik - deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress without TLS (CF Flexible SSL) and without middleware - deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log Documentation: - docs/deployment/ — full deployment book, 26 files, ~42k words: - Part I Overview, infrastructure, orchestrator choice (Ch 0-2) - Part II Networking, firewall, Cloudflare (Ch 3-4, 13) - Part III Security, Traefik ingress (Ch 5-6) - Part IV Services, DB, storage, secrets, registry (Ch 7-11) - Part V Data flow, deploy process, observability, failures, runbook (Ch 12, 14-17) - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20) - Appendices: glossary, kubectl cheat sheet, file locations, consolidated citations - README.md: Production Deployment section replaced with pointer to the book; Go version bumped to 1.25 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:21 -05:00
parent 4ec4bbbfe8
commit 6f303dbbaa
46 changed files with 9785 additions and 93 deletions
@@ -0,0 +1,360 @@
+# 16 — Failure Modes
+
+## Summary
+
+Every component in the system has a failure mode, a user-visible
+symptom, and a recovery story. This chapter enumerates them from the
+edge inward. Use this as a reference when debugging or when planning
+resilience improvements.
+
+## Failure catalog
+
+### Cloudflare-level
+
+#### CF edge POP outage
+
+**Symptom**: users in one geographic region see errors; other regions
+fine.
+**Recovery**: automatic — CF routes traffic to next-nearest POP.
+**Our action**: none; wait for CF.
+**Frequency**: rare, usually resolved in minutes.
+
+#### CF global outage (rare but has happened)
+
+**Symptom**: the whole site unreachable via CF.
+**Recovery**: manual — disable CF proxy (grey cloud DNS records), users
+hit origins directly.
+**Our action**: in Cloudflare dashboard, flip each A record's proxy off.
+Users then resolve to our node IPs directly; UFW allows :80/:443 from
+anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL
+Flexible mode), but HTTP works.
+**Frequency**: extremely rare (hours-long event happens ~annually).
+
+#### DNS hijacking
+
+**Symptom**: users' DNS queries return attacker IPs; all traffic
+compromised.
+**Mitigation**: unlikely at CF; users who use DoH/DoT are protected.
+No mitigation at our level.
+**Recovery**: requires CF incident response.
+
+### Node-level
+
+#### One node's NIC fails
+
+**Symptom**: Cloudflare's retry logic routes around it within seconds.
+Users see a brief spike in latency as CF learns the IP is unhealthy.
+Pods on that node get rescheduled to surviving nodes by Kubernetes
+after `node-monitor-grace-period` (40s).
+**Recovery**:
+- Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
+- Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
+- Replace the node via Hetzner console when convenient
+**Our action**: verify `kubectl get nodes` shows NotReady; check
+Hetzner console to confirm the node's status; recreate if needed.
+
+#### Two nodes fail simultaneously
+
+**Symptom**: Raft loses quorum. Kubernetes API server rejects writes.
+Existing pods keep running but nothing new can be scheduled/updated.
+Single surviving node's pods continue serving traffic.
+**Recovery**:
+- If a failed node comes back within Raft's leader-election timeout
+  (seconds to minutes), quorum restores
+- If failed nodes are truly gone, the cluster is broken — need to
+  rebuild
+**Rebuild procedure**: from the surviving node, `k3s-killall.sh`, then
+bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe;
+Redis state is lost.
+
+#### All three nodes fail simultaneously
+
+**Symptom**: full site outage.
+**Recovery**: rebuild the cluster from scratch.
+**Frequency**: Hetzner-region-wide outage, extremely rare.
+
+#### Node disk fills up
+
+**Symptom**: pods get evicted ("node is disk-pressure"). Containers
+can't be scheduled on that node.
+**Common cause**: container log buildup (containerd rotates at 10 MB
+per container but across dozens of pod churn cycles, total fills up),
+local-path PVC fills up, apt cache.
+**Recovery**:
+```bash
+ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
+# Then clean up
+```
+
+### k3s control plane failures
+
+#### etcd corruption on one node
+
+**Symptom**: Raft detects divergence; that node stops serving writes.
+**Recovery**: remove the node from the cluster, rejoin. Etcd snapshot
+is pulled from surviving peers automatically.
+
+#### CoreDNS down
+
+**Symptom**: pods can't resolve Service names. New TCP connections
+fail; existing connections continue (they already resolved).
+Typical manifestation: "DB connection failed — no such host" errors.
+**Recovery**: k3s automatically restarts CoreDNS pod. If it
+keeps crashing:
+```bash
+kubectl logs -n kube-system deploy/coredns --previous
+kubectl rollout restart deployment/coredns -n kube-system
+```
+**Frequency**: rare.
+
+#### metrics-server down
+
+**Symptom**: `kubectl top` returns an error; HPAs can't scale.
+**Recovery**: restart metrics-server pod. Non-critical; service stays up.
+```bash
+kubectl rollout restart deployment/metrics-server -n kube-system
+```
+
+### Networking failures
+
+#### UFW rule accidentally blocks essential traffic
+
+**Symptom**: Some specific thing stops working (e.g., api can't reach
+Postgres, cross-node pod traffic fails, kubectl times out).
+**Recovery**: log in via SSH (if that still works), `sudo ufw status
+numbered`, `sudo ufw --force delete <N>` to remove offending rule.
+**If SSH is blocked too**: Hetzner console → Rescue mode → mount disk
+→ edit `/etc/ufw/user.rules`.
+
+#### Flannel broken on one node
+
+**Symptom**: pods on that node can't reach remote pods via overlay.
+ClusterIP Services involving cross-node endpoints fail.
+**Recovery**: restart kubelet on that node:
+```bash
+ssh deploy@<node> "sudo systemctl restart k3s"
+```
+
+#### Kube-proxy broken on one node
+
+**Symptom**: pods on that node can't reach ClusterIPs. Symptoms look
+like DNS resolution succeeded but connection refused or timed out.
+**Recovery**: same as Flannel — restart k3s on the node.
+
+### Application-level
+
+#### api pod OOM
+
+**Symptom**: pod gets killed, kubelet restarts it. User's request
+returns 502 briefly; subsequent requests routed to healthy pods.
+Readiness probe removes the OOMing pod from Service endpoints.
+**Recovery**: automatic (pod restarts). If it keeps OOMing:
+- Increase `resources.limits.memory` in the deployment
+- Or debug the memory leak
+**Check**:
+```bash
+kubectl describe pod -n honeydue <pod> | grep -i oom
+kubectl logs -n honeydue <pod> --previous
+```
+
+#### api pod panics
+
+**Symptom**: goroutine panic kills the process. Kubelet restarts.
+Similar user impact to OOM.
+**Recovery**: automatic restart. But if the panic is deterministic
+(same input → panic), the pod crashloops.
+**Action**: read the logs, find the panic stack trace, fix the code,
+deploy.
+**Circuit-breaker scenario**: if all 3 api pods crashloop on startup
+because of bad code, kubectl rollout undo to previous revision.
+
+#### api deadlocks
+
+**Symptom**: all 3 pods are up, readiness passes (shallow probe), but
+real requests time out or hang.
+**Recovery**: liveness probe is the same endpoint as readiness, so it
+won't help. You'll see gradually increasing 504s at the edge. Manual
+intervention:
+```bash
+kubectl rollout restart deployment/api -n honeydue
+```
+
+#### admin pod crashes
+
+**Symptom**: 502 at Cloudflare when accessing admin.myhoneydue.com.
+**Recovery**: k8s auto-restarts. Usually within 10-30s.
+**Impact**: only admins lose access; user-facing api is unaffected.
+
+#### worker stops processing jobs
+
+**Symptom**: emails stop being sent, cron jobs stop firing.
+**Detection**: no direct alert; need to notice via user feedback or
+missing daily-digest emails. Or check Redis for queue backlog.
+**Recovery**:
+```bash
+kubectl rollout restart deployment/worker -n honeydue
+```
+**If persistent**: check logs for specific error:
+```bash
+kubectl logs -n honeydue deploy/worker --tail=100
+```
+
+#### redis pod dies + node is different
+
+**Symptom**: Redis schedules to a new node, but the PVC is on the
+original node (local-path is per-node). New Redis pod comes up but
+finds an empty data directory (or can't mount at all).
+**Recovery**:
+- If the original node is still alive but Redis pod died: pod comes
+  back up on same node with data intact
+- If the original node is gone: Redis starts empty. Cache regenerates.
+  Asynq queue state is lost; pending jobs re-queue on retry, cron
+  fires re-schedule on next tick.
+- Ensure the node label `honeydue/redis=true` is on a healthy node:
+```bash
+kubectl label node <new-node> honeydue/redis=true --overwrite
+kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
+```
+
+### External service failures
+
+#### Neon Postgres outage
+
+**Symptom**: api logs fill with "failed to connect to database." All
+mutating API calls fail. Reads from cache continue (via Redis) but
+eventually cache expires.
+**Recovery**: no action from us; Neon's problem. Users will see 5xx
+until Neon is back.
+**Mitigation for future**: multi-region Neon read replica, or
+Postgres-level failover.
+**Frequency**: Neon has had a handful of hours-scale outages since launch.
+
+#### Backblaze B2 outage
+
+**Symptom**: image uploads fail; image downloads fail unless cached by
+CF.
+**Recovery**: wait. B2 rarely goes down.
+**Mitigation**: serve downloads via CF with long cache TTL — most
+users won't notice brief B2 outages for read traffic.
+
+#### Fastmail SMTP unreachable
+
+**Symptom**: `worker` can't send transactional emails. Jobs retry per
+Asynq's retry policy, eventually giving up and logging an error.
+**Recovery**: automatic retry; wait for Fastmail to come back.
+**Manual intervention**: re-enqueue jobs from the Asynq UI (we don't
+expose it yet — future).
+
+#### Gitea registry unreachable
+
+**Symptom**: `kubectl rollout` stuck at "Pulling image" for new pods.
+Existing pods continue running with their already-pulled images.
+**Recovery**: wait for Gitea to come back.
+**Mitigation**: K8s has `imagePullPolicy: IfNotPresent` by default on
+SHA-tagged images, so images aren't re-pulled on every restart if
+the node already has them cached.
+
+#### Cloudflare DNS failure
+
+See §CF failures above.
+
+## Combined failures
+
+### "Everything is slow"
+
+Most often = Neon is being hammered by our load + someone else's noisy
+neighbor.
+- Check `kubectl top pods` (are we CPU-bound?)
+- Check Neon console for query performance
+- Check CF analytics for traffic spikes
+
+### "Some users see 502, others don't"
+
+Usually one node has an unhealthy Traefik or api. Cloudflare routes
+some connections to it, others to healthy nodes.
+- `kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik`
+- `kubectl get pods -n honeydue -l app.kubernetes.io/name=api`
+- Check per-pod logs
+
+### "It worked 5 minutes ago, now it doesn't"
+
+Something recent changed. Check:
+- Recent deploys: `kubectl rollout history deployment/api -n honeydue`
+- Recent manifest changes: `kubectl get events -A --sort-by=.lastTimestamp | tail -30`
+- External: Cloudflare Status page, Neon Status page, Backblaze Status page
+
+## Planned outages
+
+### Node upgrades (OS patches)
+
+```bash
+# Drain the node (evict pods, block scheduling)
+kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data
+
+# SSH in, upgrade, reboot
+ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"
+
+# Wait for node to come back
+watch kubectl get nodes
+
+# Uncordon
+kubectl uncordon ubuntu-8gb-nbg1-1
+```
+
+During the drain, pods from that node reschedule to the survivors.
+With current workload (api: 3 replicas, everything else: 1), rescheduling
+1 api pod is fine. Traffic loss: zero.
+
+Worker pod or Redis pod scheduled on the drained node would be
+briefly unavailable during reschedule. Acceptable for planned windows.
+
+### k3s upgrades
+
+Same per-node drain + upgrade pattern, but with k3s-specific install:
+
+```bash
+# On the node
+curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server
+
+# k3s detects existing install and upgrades in place
+```
+
+Do one node at a time. Verify cluster health between each.
+
+## Disaster recovery
+
+### Complete cluster loss
+
+Procedure:
+1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
+2. Follow bootstrap procedure (Chapter 1 §node hardening)
+3. Install k3s on each (Chapter 2 §HA architecture)
+4. Configure kubeconfig
+5. Apply all manifests:
+   ```bash
+   kubectl apply -f deploy-k3s/manifests/namespace.yaml
+   kubectl apply -f deploy-k3s/manifests/rbac.yaml
+   kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml
+   # Wait for Traefik to redeploy
+   # ... recreate secrets (see Chapter 10) ...
+   # ... apply rest of manifests ...
+   ```
+6. Update DNS if node IPs changed
+7. Verify: curl https://api.myhoneydue.com/api/health/
+
+Estimated time: **1-2 hours** if you've done it before. A lot of
+context-switching between Hetzner console, SSH, kubectl, and CF.
+
+Neon data is untouched by any of this. B2 data is untouched. Only
+state that's lost: Redis cache (regenerates) and any in-flight Asynq
+jobs that were mid-processing.
+
+## References
+
+- [Kubernetes pod lifecycle][lifecycle]
+- [K3s HA recovery][k3s-ha-recovery]
+- [Hetzner rescue system][hetzner-rescue]
+
+[lifecycle]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
+[k3s-ha-recovery]: https://docs.k3s.io/datastore/ha-embedded#new-cluster-with-embedded-db
+[hetzner-rescue]: https://docs.hetzner.com/cloud/servers/getting-started/enabling-rescue-system/