Files
honeyDueAPI/docs/deployment/16-failure-modes.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

12 KiB

16 — Failure Modes

Summary

Every component in the system has a failure mode, a user-visible symptom, and a recovery story. This chapter enumerates them from the edge inward. Use this as a reference when debugging or when planning resilience improvements.

Failure catalog

Cloudflare-level

CF edge POP outage

Symptom: users in one geographic region see errors; other regions fine. Recovery: automatic — CF routes traffic to next-nearest POP. Our action: none; wait for CF. Frequency: rare, usually resolved in minutes.

CF global outage (rare but has happened)

Symptom: the whole site unreachable via CF. Recovery: manual — disable CF proxy (grey cloud DNS records), users hit origins directly. Our action: in Cloudflare dashboard, flip each A record's proxy off. Users then resolve to our node IPs directly; UFW allows :80/:443 from anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL Flexible mode), but HTTP works. Frequency: extremely rare (hours-long event happens ~annually).

DNS hijacking

Symptom: users' DNS queries return attacker IPs; all traffic compromised. Mitigation: unlikely at CF; users who use DoH/DoT are protected. No mitigation at our level. Recovery: requires CF incident response.

Node-level

One node's NIC fails

Symptom: Cloudflare's retry logic routes around it within seconds. Users see a brief spike in latency as CF learns the IP is unhealthy. Pods on that node get rescheduled to surviving nodes by Kubernetes after node-monitor-grace-period (40s). Recovery:

  • Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
  • Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
  • Replace the node via Hetzner console when convenient Our action: verify kubectl get nodes shows NotReady; check Hetzner console to confirm the node's status; recreate if needed.

Two nodes fail simultaneously

Symptom: Raft loses quorum. Kubernetes API server rejects writes. Existing pods keep running but nothing new can be scheduled/updated. Single surviving node's pods continue serving traffic. Recovery:

  • If a failed node comes back within Raft's leader-election timeout (seconds to minutes), quorum restores
  • If failed nodes are truly gone, the cluster is broken — need to rebuild Rebuild procedure: from the surviving node, k3s-killall.sh, then bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe; Redis state is lost.

All three nodes fail simultaneously

Symptom: full site outage. Recovery: rebuild the cluster from scratch. Frequency: Hetzner-region-wide outage, extremely rare.

Node disk fills up

Symptom: pods get evicted ("node is disk-pressure"). Containers can't be scheduled on that node. Common cause: container log buildup (containerd rotates at 10 MB per container but across dozens of pod churn cycles, total fills up), local-path PVC fills up, apt cache. Recovery:

ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
# Then clean up

k3s control plane failures

etcd corruption on one node

Symptom: Raft detects divergence; that node stops serving writes. Recovery: remove the node from the cluster, rejoin. Etcd snapshot is pulled from surviving peers automatically.

CoreDNS down

Symptom: pods can't resolve Service names. New TCP connections fail; existing connections continue (they already resolved). Typical manifestation: "DB connection failed — no such host" errors. Recovery: k3s automatically restarts CoreDNS pod. If it keeps crashing:

kubectl logs -n kube-system deploy/coredns --previous
kubectl rollout restart deployment/coredns -n kube-system

Frequency: rare.

metrics-server down

Symptom: kubectl top returns an error; HPAs can't scale. Recovery: restart metrics-server pod. Non-critical; service stays up.

kubectl rollout restart deployment/metrics-server -n kube-system

Networking failures

UFW rule accidentally blocks essential traffic

Symptom: Some specific thing stops working (e.g., api can't reach Postgres, cross-node pod traffic fails, kubectl times out). Recovery: log in via SSH (if that still works), sudo ufw status numbered, sudo ufw --force delete <N> to remove offending rule. If SSH is blocked too: Hetzner console → Rescue mode → mount disk → edit /etc/ufw/user.rules.

Flannel broken on one node

Symptom: pods on that node can't reach remote pods via overlay. ClusterIP Services involving cross-node endpoints fail. Recovery: restart kubelet on that node:

ssh deploy@<node> "sudo systemctl restart k3s"

Kube-proxy broken on one node

Symptom: pods on that node can't reach ClusterIPs. Symptoms look like DNS resolution succeeded but connection refused or timed out. Recovery: same as Flannel — restart k3s on the node.

Application-level

api pod OOM

Symptom: pod gets killed, kubelet restarts it. User's request returns 502 briefly; subsequent requests routed to healthy pods. Readiness probe removes the OOMing pod from Service endpoints. Recovery: automatic (pod restarts). If it keeps OOMing:

  • Increase resources.limits.memory in the deployment
  • Or debug the memory leak Check:
kubectl describe pod -n honeydue <pod> | grep -i oom
kubectl logs -n honeydue <pod> --previous

api pod panics

Symptom: goroutine panic kills the process. Kubelet restarts. Similar user impact to OOM. Recovery: automatic restart. But if the panic is deterministic (same input → panic), the pod crashloops. Action: read the logs, find the panic stack trace, fix the code, deploy. Circuit-breaker scenario: if all 3 api pods crashloop on startup because of bad code, kubectl rollout undo to previous revision.

api deadlocks

Symptom: all 3 pods are up, readiness passes (shallow probe), but real requests time out or hang. Recovery: liveness probe is the same endpoint as readiness, so it won't help. You'll see gradually increasing 504s at the edge. Manual intervention:

kubectl rollout restart deployment/api -n honeydue

admin pod crashes

Symptom: 502 at Cloudflare when accessing admin.myhoneydue.com. Recovery: k8s auto-restarts. Usually within 10-30s. Impact: only admins lose access; user-facing api is unaffected.

worker stops processing jobs

Symptom: emails stop being sent, cron jobs stop firing. Detection: no direct alert; need to notice via user feedback or missing daily-digest emails. Or check Redis for queue backlog. Recovery:

kubectl rollout restart deployment/worker -n honeydue

If persistent: check logs for specific error:

kubectl logs -n honeydue deploy/worker --tail=100

redis pod dies + node is different

Symptom: Redis schedules to a new node, but the PVC is on the original node (local-path is per-node). New Redis pod comes up but finds an empty data directory (or can't mount at all). Recovery:

  • If the original node is still alive but Redis pod died: pod comes back up on same node with data intact
  • If the original node is gone: Redis starts empty. Cache regenerates. Asynq queue state is lost; pending jobs re-queue on retry, cron fires re-schedule on next tick.
  • Ensure the node label honeydue/redis=true is on a healthy node:
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true

External service failures

Neon Postgres outage

Symptom: api logs fill with "failed to connect to database." All mutating API calls fail. Reads from cache continue (via Redis) but eventually cache expires. Recovery: no action from us; Neon's problem. Users will see 5xx until Neon is back. Mitigation for future: multi-region Neon read replica, or Postgres-level failover. Frequency: Neon has had a handful of hours-scale outages since launch.

Backblaze B2 outage

Symptom: image uploads fail; image downloads fail unless cached by CF. Recovery: wait. B2 rarely goes down. Mitigation: serve downloads via CF with long cache TTL — most users won't notice brief B2 outages for read traffic.

Fastmail SMTP unreachable

Symptom: worker can't send transactional emails. Jobs retry per Asynq's retry policy, eventually giving up and logging an error. Recovery: automatic retry; wait for Fastmail to come back. Manual intervention: re-enqueue jobs from the Asynq UI (we don't expose it yet — future).

Gitea registry unreachable

Symptom: kubectl rollout stuck at "Pulling image" for new pods. Existing pods continue running with their already-pulled images. Recovery: wait for Gitea to come back. Mitigation: K8s has imagePullPolicy: IfNotPresent by default on SHA-tagged images, so images aren't re-pulled on every restart if the node already has them cached.

Cloudflare DNS failure

See §CF failures above.

Combined failures

"Everything is slow"

Most often = Neon is being hammered by our load + someone else's noisy neighbor.

  • Check kubectl top pods (are we CPU-bound?)
  • Check Neon console for query performance
  • Check CF analytics for traffic spikes

"Some users see 502, others don't"

Usually one node has an unhealthy Traefik or api. Cloudflare routes some connections to it, others to healthy nodes.

  • kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
  • kubectl get pods -n honeydue -l app.kubernetes.io/name=api
  • Check per-pod logs

"It worked 5 minutes ago, now it doesn't"

Something recent changed. Check:

  • Recent deploys: kubectl rollout history deployment/api -n honeydue
  • Recent manifest changes: kubectl get events -A --sort-by=.lastTimestamp | tail -30
  • External: Cloudflare Status page, Neon Status page, Backblaze Status page

Planned outages

Node upgrades (OS patches)

# Drain the node (evict pods, block scheduling)
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data

# SSH in, upgrade, reboot
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"

# Wait for node to come back
watch kubectl get nodes

# Uncordon
kubectl uncordon ubuntu-8gb-nbg1-1

During the drain, pods from that node reschedule to the survivors. With current workload (api: 3 replicas, everything else: 1), rescheduling 1 api pod is fine. Traffic loss: zero.

Worker pod or Redis pod scheduled on the drained node would be briefly unavailable during reschedule. Acceptable for planned windows.

k3s upgrades

Same per-node drain + upgrade pattern, but with k3s-specific install:

# On the node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server

# k3s detects existing install and upgrades in place

Do one node at a time. Verify cluster health between each.

Disaster recovery

Complete cluster loss

Procedure:

  1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
  2. Follow bootstrap procedure (Chapter 1 §node hardening)
  3. Install k3s on each (Chapter 2 §HA architecture)
  4. Configure kubeconfig
  5. Apply all manifests:
    kubectl apply -f deploy-k3s/manifests/namespace.yaml
    kubectl apply -f deploy-k3s/manifests/rbac.yaml
    kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml
    # Wait for Traefik to redeploy
    # ... recreate secrets (see Chapter 10) ...
    # ... apply rest of manifests ...
    
  6. Update DNS if node IPs changed
  7. Verify: curl https://api.myhoneydue.com/api/health/

Estimated time: 1-2 hours if you've done it before. A lot of context-switching between Hetzner console, SSH, kubectl, and CF.

Neon data is untouched by any of this. B2 data is untouched. Only state that's lost: Redis cache (regenerates) and any in-flight Asynq jobs that were mid-processing.

References