Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
16 — Failure Modes
Summary
Every component in the system has a failure mode, a user-visible symptom, and a recovery story. This chapter enumerates them from the edge inward. Use this as a reference when debugging or when planning resilience improvements.
Failure catalog
Cloudflare-level
CF edge POP outage
Symptom: users in one geographic region see errors; other regions fine. Recovery: automatic — CF routes traffic to next-nearest POP. Our action: none; wait for CF. Frequency: rare, usually resolved in minutes.
CF global outage (rare but has happened)
Symptom: the whole site unreachable via CF. Recovery: manual — disable CF proxy (grey cloud DNS records), users hit origins directly. Our action: in Cloudflare dashboard, flip each A record's proxy off. Users then resolve to our node IPs directly; UFW allows :80/:443 from anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL Flexible mode), but HTTP works. Frequency: extremely rare (hours-long event happens ~annually).
DNS hijacking
Symptom: users' DNS queries return attacker IPs; all traffic compromised. Mitigation: unlikely at CF; users who use DoH/DoT are protected. No mitigation at our level. Recovery: requires CF incident response.
Node-level
One node's NIC fails
Symptom: Cloudflare's retry logic routes around it within seconds.
Users see a brief spike in latency as CF learns the IP is unhealthy.
Pods on that node get rescheduled to surviving nodes by Kubernetes
after node-monitor-grace-period (40s).
Recovery:
- Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
- Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
- Replace the node via Hetzner console when convenient
Our action: verify
kubectl get nodesshows NotReady; check Hetzner console to confirm the node's status; recreate if needed.
Two nodes fail simultaneously
Symptom: Raft loses quorum. Kubernetes API server rejects writes. Existing pods keep running but nothing new can be scheduled/updated. Single surviving node's pods continue serving traffic. Recovery:
- If a failed node comes back within Raft's leader-election timeout (seconds to minutes), quorum restores
- If failed nodes are truly gone, the cluster is broken — need to
rebuild
Rebuild procedure: from the surviving node,
k3s-killall.sh, then bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe; Redis state is lost.
All three nodes fail simultaneously
Symptom: full site outage. Recovery: rebuild the cluster from scratch. Frequency: Hetzner-region-wide outage, extremely rare.
Node disk fills up
Symptom: pods get evicted ("node is disk-pressure"). Containers can't be scheduled on that node. Common cause: container log buildup (containerd rotates at 10 MB per container but across dozens of pod churn cycles, total fills up), local-path PVC fills up, apt cache. Recovery:
ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
# Then clean up
k3s control plane failures
etcd corruption on one node
Symptom: Raft detects divergence; that node stops serving writes. Recovery: remove the node from the cluster, rejoin. Etcd snapshot is pulled from surviving peers automatically.
CoreDNS down
Symptom: pods can't resolve Service names. New TCP connections fail; existing connections continue (they already resolved). Typical manifestation: "DB connection failed — no such host" errors. Recovery: k3s automatically restarts CoreDNS pod. If it keeps crashing:
kubectl logs -n kube-system deploy/coredns --previous
kubectl rollout restart deployment/coredns -n kube-system
Frequency: rare.
metrics-server down
Symptom: kubectl top returns an error; HPAs can't scale.
Recovery: restart metrics-server pod. Non-critical; service stays up.
kubectl rollout restart deployment/metrics-server -n kube-system
Networking failures
UFW rule accidentally blocks essential traffic
Symptom: Some specific thing stops working (e.g., api can't reach
Postgres, cross-node pod traffic fails, kubectl times out).
Recovery: log in via SSH (if that still works), sudo ufw status numbered, sudo ufw --force delete <N> to remove offending rule.
If SSH is blocked too: Hetzner console → Rescue mode → mount disk
→ edit /etc/ufw/user.rules.
Flannel broken on one node
Symptom: pods on that node can't reach remote pods via overlay. ClusterIP Services involving cross-node endpoints fail. Recovery: restart kubelet on that node:
ssh deploy@<node> "sudo systemctl restart k3s"
Kube-proxy broken on one node
Symptom: pods on that node can't reach ClusterIPs. Symptoms look like DNS resolution succeeded but connection refused or timed out. Recovery: same as Flannel — restart k3s on the node.
Application-level
api pod OOM
Symptom: pod gets killed, kubelet restarts it. User's request returns 502 briefly; subsequent requests routed to healthy pods. Readiness probe removes the OOMing pod from Service endpoints. Recovery: automatic (pod restarts). If it keeps OOMing:
- Increase
resources.limits.memoryin the deployment - Or debug the memory leak Check:
kubectl describe pod -n honeydue <pod> | grep -i oom
kubectl logs -n honeydue <pod> --previous
api pod panics
Symptom: goroutine panic kills the process. Kubelet restarts. Similar user impact to OOM. Recovery: automatic restart. But if the panic is deterministic (same input → panic), the pod crashloops. Action: read the logs, find the panic stack trace, fix the code, deploy. Circuit-breaker scenario: if all 3 api pods crashloop on startup because of bad code, kubectl rollout undo to previous revision.
api deadlocks
Symptom: all 3 pods are up, readiness passes (shallow probe), but real requests time out or hang. Recovery: liveness probe is the same endpoint as readiness, so it won't help. You'll see gradually increasing 504s at the edge. Manual intervention:
kubectl rollout restart deployment/api -n honeydue
admin pod crashes
Symptom: 502 at Cloudflare when accessing admin.myhoneydue.com. Recovery: k8s auto-restarts. Usually within 10-30s. Impact: only admins lose access; user-facing api is unaffected.
worker stops processing jobs
Symptom: emails stop being sent, cron jobs stop firing. Detection: no direct alert; need to notice via user feedback or missing daily-digest emails. Or check Redis for queue backlog. Recovery:
kubectl rollout restart deployment/worker -n honeydue
If persistent: check logs for specific error:
kubectl logs -n honeydue deploy/worker --tail=100
redis pod dies + node is different
Symptom: Redis schedules to a new node, but the PVC is on the original node (local-path is per-node). New Redis pod comes up but finds an empty data directory (or can't mount at all). Recovery:
- If the original node is still alive but Redis pod died: pod comes back up on same node with data intact
- If the original node is gone: Redis starts empty. Cache regenerates. Asynq queue state is lost; pending jobs re-queue on retry, cron fires re-schedule on next tick.
- Ensure the node label
honeydue/redis=trueis on a healthy node:
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
External service failures
Neon Postgres outage
Symptom: api logs fill with "failed to connect to database." All mutating API calls fail. Reads from cache continue (via Redis) but eventually cache expires. Recovery: no action from us; Neon's problem. Users will see 5xx until Neon is back. Mitigation for future: multi-region Neon read replica, or Postgres-level failover. Frequency: Neon has had a handful of hours-scale outages since launch.
Backblaze B2 outage
Symptom: image uploads fail; image downloads fail unless cached by CF. Recovery: wait. B2 rarely goes down. Mitigation: serve downloads via CF with long cache TTL — most users won't notice brief B2 outages for read traffic.
Fastmail SMTP unreachable
Symptom: worker can't send transactional emails. Jobs retry per
Asynq's retry policy, eventually giving up and logging an error.
Recovery: automatic retry; wait for Fastmail to come back.
Manual intervention: re-enqueue jobs from the Asynq UI (we don't
expose it yet — future).
Gitea registry unreachable
Symptom: kubectl rollout stuck at "Pulling image" for new pods.
Existing pods continue running with their already-pulled images.
Recovery: wait for Gitea to come back.
Mitigation: K8s has imagePullPolicy: IfNotPresent by default on
SHA-tagged images, so images aren't re-pulled on every restart if
the node already has them cached.
Cloudflare DNS failure
See §CF failures above.
Combined failures
"Everything is slow"
Most often = Neon is being hammered by our load + someone else's noisy neighbor.
- Check
kubectl top pods(are we CPU-bound?) - Check Neon console for query performance
- Check CF analytics for traffic spikes
"Some users see 502, others don't"
Usually one node has an unhealthy Traefik or api. Cloudflare routes some connections to it, others to healthy nodes.
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefikkubectl get pods -n honeydue -l app.kubernetes.io/name=api- Check per-pod logs
"It worked 5 minutes ago, now it doesn't"
Something recent changed. Check:
- Recent deploys:
kubectl rollout history deployment/api -n honeydue - Recent manifest changes:
kubectl get events -A --sort-by=.lastTimestamp | tail -30 - External: Cloudflare Status page, Neon Status page, Backblaze Status page
Planned outages
Node upgrades (OS patches)
# Drain the node (evict pods, block scheduling)
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data
# SSH in, upgrade, reboot
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"
# Wait for node to come back
watch kubectl get nodes
# Uncordon
kubectl uncordon ubuntu-8gb-nbg1-1
During the drain, pods from that node reschedule to the survivors. With current workload (api: 3 replicas, everything else: 1), rescheduling 1 api pod is fine. Traffic loss: zero.
Worker pod or Redis pod scheduled on the drained node would be briefly unavailable during reschedule. Acceptable for planned windows.
k3s upgrades
Same per-node drain + upgrade pattern, but with k3s-specific install:
# On the node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server
# k3s detects existing install and upgrades in place
Do one node at a time. Verify cluster health between each.
Disaster recovery
Complete cluster loss
Procedure:
- Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
- Follow bootstrap procedure (Chapter 1 §node hardening)
- Install k3s on each (Chapter 2 §HA architecture)
- Configure kubeconfig
- Apply all manifests:
kubectl apply -f deploy-k3s/manifests/namespace.yaml kubectl apply -f deploy-k3s/manifests/rbac.yaml kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml # Wait for Traefik to redeploy # ... recreate secrets (see Chapter 10) ... # ... apply rest of manifests ... - Update DNS if node IPs changed
- Verify: curl https://api.myhoneydue.com/api/health/
Estimated time: 1-2 hours if you've done it before. A lot of context-switching between Hetzner console, SSH, kubectl, and CF.
Neon data is untouched by any of this. B2 data is untouched. Only state that's lost: Redis cache (regenerates) and any in-flight Asynq jobs that were mid-processing.