# 16 — Failure Modes ## Summary Every component in the system has a failure mode, a user-visible symptom, and a recovery story. This chapter enumerates them from the edge inward. Use this as a reference when debugging or when planning resilience improvements. ## Failure catalog ### Cloudflare-level #### CF edge POP outage **Symptom**: users in one geographic region see errors; other regions fine. **Recovery**: automatic — CF routes traffic to next-nearest POP. **Our action**: none; wait for CF. **Frequency**: rare, usually resolved in minutes. #### CF global outage (rare but has happened) **Symptom**: the whole site unreachable via CF. **Recovery**: manual — disable CF proxy (grey cloud DNS records), users hit origins directly. **Our action**: in Cloudflare dashboard, flip each A record's proxy off. Users then resolve to our node IPs directly; UFW allows :80/:443 from anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL Flexible mode), but HTTP works. **Frequency**: extremely rare (hours-long event happens ~annually). #### DNS hijacking **Symptom**: users' DNS queries return attacker IPs; all traffic compromised. **Mitigation**: unlikely at CF; users who use DoH/DoT are protected. No mitigation at our level. **Recovery**: requires CF incident response. ### Node-level #### One node's NIC fails **Symptom**: Cloudflare's retry logic routes around it within seconds. Users see a brief spike in latency as CF learns the IP is unhealthy. Pods on that node get rescheduled to surviving nodes by Kubernetes after `node-monitor-grace-period` (40s). **Recovery**: - Automatic pod rescheduling takes ~5 min (grace period + pod eviction) - Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum) - Replace the node via Hetzner console when convenient **Our action**: verify `kubectl get nodes` shows NotReady; check Hetzner console to confirm the node's status; recreate if needed. #### Two nodes fail simultaneously **Symptom**: Raft loses quorum. Kubernetes API server rejects writes. Existing pods keep running but nothing new can be scheduled/updated. Single surviving node's pods continue serving traffic. **Recovery**: - If a failed node comes back within Raft's leader-election timeout (seconds to minutes), quorum restores - If failed nodes are truly gone, the cluster is broken — need to rebuild **Rebuild procedure**: from the surviving node, `k3s-killall.sh`, then bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe; Redis state is lost. #### All three nodes fail simultaneously **Symptom**: full site outage. **Recovery**: rebuild the cluster from scratch. **Frequency**: Hetzner-region-wide outage, extremely rare. #### Node disk fills up **Symptom**: pods get evicted ("node is disk-pressure"). Containers can't be scheduled on that node. **Common cause**: container log buildup (containerd rotates at 10 MB per container but across dozens of pod churn cycles, total fills up), local-path PVC fills up, apt cache. **Recovery**: ```bash ssh deploy@ "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h" # Then clean up ``` ### k3s control plane failures #### etcd corruption on one node **Symptom**: Raft detects divergence; that node stops serving writes. **Recovery**: remove the node from the cluster, rejoin. Etcd snapshot is pulled from surviving peers automatically. #### CoreDNS down **Symptom**: pods can't resolve Service names. New TCP connections fail; existing connections continue (they already resolved). Typical manifestation: "DB connection failed — no such host" errors. **Recovery**: k3s automatically restarts CoreDNS pod. If it keeps crashing: ```bash kubectl logs -n kube-system deploy/coredns --previous kubectl rollout restart deployment/coredns -n kube-system ``` **Frequency**: rare. #### metrics-server down **Symptom**: `kubectl top` returns an error; HPAs can't scale. **Recovery**: restart metrics-server pod. Non-critical; service stays up. ```bash kubectl rollout restart deployment/metrics-server -n kube-system ``` #### vmagent can't reach obs.88oakapps.com **Symptom**: dashboards stop updating; vmagent logs show 401 / TLS / network errors against `obs.88oakapps.com`. App is unaffected. **Recovery**: vmagent buffers up to 512 MB locally and replays on reconnect, so brief outages self-heal. If sustained: ```bash # Is the obs endpoint up? curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \ -H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)" # 200 = ingest endpoint healthy. # Inspect vmagent's failure metric kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \ | grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped" # Restart vmagent (forces config reload + drains queue) kubectl -n honeydue rollout restart deploy/vmagent ``` **If 88oakappsUpdate itself is down** (PostHog runs there too): SSH and check `sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps`. **Non-critical**: nothing app-facing depends on the obs stack. #### Grafana dashboard shows "no data" **Possible causes, in order of frequency**: 1. New histogram name — query targets a metric the api hasn't emitted yet. Check `kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics` for the metric name. 2. vmagent isn't scraping (see above). 3. Time range is before the obs stack came up (2026-04-25). Adjust the dashboard time picker. 4. Cardinality blowup — VM rejected high-label-count series. Check `vm_rows_inserted_total` vs `vm_rows_dropped_total` on the obs box. ### Networking failures #### UFW rule accidentally blocks essential traffic **Symptom**: Some specific thing stops working (e.g., api can't reach Postgres, cross-node pod traffic fails, kubectl times out). **Recovery**: log in via SSH (if that still works), `sudo ufw status numbered`, `sudo ufw --force delete ` to remove offending rule. **If SSH is blocked too**: Hetzner console → Rescue mode → mount disk → edit `/etc/ufw/user.rules`. #### Flannel broken on one node **Symptom**: pods on that node can't reach remote pods via overlay. ClusterIP Services involving cross-node endpoints fail. **Recovery**: restart kubelet on that node: ```bash ssh deploy@ "sudo systemctl restart k3s" ``` #### Kube-proxy broken on one node **Symptom**: pods on that node can't reach ClusterIPs. Symptoms look like DNS resolution succeeded but connection refused or timed out. **Recovery**: same as Flannel — restart k3s on the node. ### Application-level #### api pod OOM **Symptom**: pod gets killed, kubelet restarts it. User's request returns 502 briefly; subsequent requests routed to healthy pods. Readiness probe removes the OOMing pod from Service endpoints. **Recovery**: automatic (pod restarts). If it keeps OOMing: - Increase `resources.limits.memory` in the deployment - Or debug the memory leak **Check**: ```bash kubectl describe pod -n honeydue | grep -i oom kubectl logs -n honeydue --previous ``` #### api pod panics **Symptom**: goroutine panic kills the process. Kubelet restarts. Similar user impact to OOM. **Recovery**: automatic restart. But if the panic is deterministic (same input → panic), the pod crashloops. **Action**: read the logs, find the panic stack trace, fix the code, deploy. **Circuit-breaker scenario**: if all 3 api pods crashloop on startup because of bad code, kubectl rollout undo to previous revision. #### api deadlocks **Symptom**: all 3 pods are up, readiness passes (shallow probe), but real requests time out or hang. **Recovery**: liveness probe is the same endpoint as readiness, so it won't help. You'll see gradually increasing 504s at the edge. Manual intervention: ```bash kubectl rollout restart deployment/api -n honeydue ``` #### admin pod crashes **Symptom**: 502 at Cloudflare when accessing admin.myhoneydue.com. **Recovery**: k8s auto-restarts. Usually within 10-30s. **Impact**: only admins lose access; user-facing api is unaffected. #### worker stops processing jobs **Symptom**: emails stop being sent, cron jobs stop firing. **Detection**: no direct alert; need to notice via user feedback or missing daily-digest emails. Or check Redis for queue backlog. **Recovery**: ```bash kubectl rollout restart deployment/worker -n honeydue ``` **If persistent**: check logs for specific error: ```bash kubectl logs -n honeydue deploy/worker --tail=100 ``` #### redis pod dies + node is different **Symptom**: Redis schedules to a new node, but the PVC is on the original node (local-path is per-node). New Redis pod comes up but finds an empty data directory (or can't mount at all). **Recovery**: - If the original node is still alive but Redis pod died: pod comes back up on same node with data intact - If the original node is gone: Redis starts empty. Cache regenerates. Asynq queue state is lost; pending jobs re-queue on retry, cron fires re-schedule on next tick. - Auth caches (token + residence-IDs) regenerate on first user request — first request per user pays full DB lookup, then warm again. Visible as a brief latency spike in the Grafana RED dashboard, not a functional failure. - Ensure the node label `honeydue/redis=true` is on a healthy node: ```bash kubectl label node honeydue/redis=true --overwrite kubectl label node honeydue/redis- 2>/dev/null || true ``` #### Stale residence-IDs cache (data freshness bug) **Symptom**: a user accepts a share-code or has a residence removed, but `/api/tasks/`, `/api/documents/`, `/api/contractors/`, or `/api/residences/summary/` continues to show the old membership for up to 5 minutes. **Cause**: a residence-membership-mutating code path landed without calling `cache.InvalidateResidenceIDsForUsers(...)`. The cache TTL is 5 min so the issue self-heals, but it's user-visible. **Recovery (immediate)**: flush the affected user's cache key manually. See [Chapter 17 §residence-IDs cache invalidation](./17-runbook.md). **Prevention (permanent)**: every mutation that changes `residence_residence.owner_id`, `residence_residence_users.user_id`, or deletes a residence MUST invalidate. Existing call sites for reference: `CreateResidence` (owner), `DeleteResidence` (all members), `JoinWithCode` (joining user), `RemoveUser` (removed user). The pattern lives in `internal/services/residence_id_cache.go`. #### Redis at maxmemory limit **Symptom**: Redis logs `OOM command not allowed when used memory > 'maxmemory'`. Should be rare — current production usage is ~2.4 MB against a 256 MB limit and the policy is `allkeys-lru` (cache writes evict cold keys instead of erroring). **Recovery**: confirm the policy is still `allkeys-lru`: ```bash kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy ``` If it's somehow `noeviction`, set it live: ```bash kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru ``` And re-apply the manifest at `deploy-k3s/manifests/redis/deployment.yaml` so the change survives a pod restart. If memory usage is genuinely climbing toward the cap, check for runaway keys without TTLs: ```bash kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys ``` ### External service failures #### Neon Postgres outage **Symptom**: api logs fill with "failed to connect to database." All mutating API calls fail. Reads from cache continue (via Redis) but eventually cache expires. **Recovery**: no action from us; Neon's problem. Users will see 5xx until Neon is back. **Mitigation for future**: multi-region Neon read replica, or Postgres-level failover. **Frequency**: Neon has had a handful of hours-scale outages since launch. #### Neon pooler endpoint unreachable but direct endpoint up **Symptom**: `dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o timeout` in api logs but the direct compute endpoint is reachable. Rare — Neon's pooler runs in their infra alongside compute — but possible during pooler maintenance. **Recovery (emergency)**: switch `DB_HOST` in `config.yaml` from the `-pooler` to the direct hostname (drop the `-pooler` segment), re-apply ConfigMap, rolling-restart api and worker: ```bash # Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5... # Then: KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build ``` Cold-handshake latency goes back up (~440ms first hit) but the API keeps serving. Switch back when the pooler recovers. #### Migrate Job fails during deploy **Symptom**: `03-deploy.sh` aborts at the migrations step: ``` [deploy][error] migrations did not complete cleanly; aborting deploy ``` api/worker pods are NOT updated — they keep running the previous revision. This is the intentional fail-fast. **Recovery**: ```bash # 1. See the failure kubectl -n honeydue logs job/honeydue-migrate --tail=200 # 2. Common cause: a SQL error in the migration file. Fix the file # locally, commit, retry the deploy. The Job is idempotent — # successful prior versions stay applied; only the failed file # re-runs. git add migrations/000NNN_*.sql git commit -m "Fix migration NNN" git push gitea master bash deploy-k3s/scripts/03-deploy.sh # 3. Other cause: Neon down or auth changed. Test direct connection: DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \ -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d) docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \ psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \ user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;" ``` **Why no automatic retry**: `backoffLimit: 0` on the Job is deliberate. A failing migration almost never gets unstuck by retrying — needs an operator to look. See [Chapter 17 §27](./17-runbook.md) for recovery playbook. #### api refuses to start: "Schema precondition failed" **Symptom**: api pods log `Schema precondition failed` and exit immediately after DB connect. **Cause**: `goose_db_version` table is missing or its latest row has `is_applied=false`. Means the migrate Job either was never run or ran and rolled back. **Recovery**: run the migrate Job manually (see [Chapter 17 §26](./17-runbook.md)). After it completes successfully, delete the failing api pods so they restart with a fresh schema check: ```bash kubectl -n honeydue rollout restart deploy/api ``` #### Backblaze B2 outage **Symptom**: image uploads fail; image downloads fail unless cached by CF. **Recovery**: wait. B2 rarely goes down. **Mitigation**: serve downloads via CF with long cache TTL — most users won't notice brief B2 outages for read traffic. #### Fastmail SMTP unreachable **Symptom**: `worker` can't send transactional emails. Jobs retry per Asynq's retry policy, eventually giving up and logging an error. **Recovery**: automatic retry; wait for Fastmail to come back. **Manual intervention**: re-enqueue jobs from the Asynq UI (we don't expose it yet — future). #### Gitea registry unreachable **Symptom**: `kubectl rollout` stuck at "Pulling image" for new pods. Existing pods continue running with their already-pulled images. **Recovery**: wait for Gitea to come back. **Mitigation**: K8s has `imagePullPolicy: IfNotPresent` by default on SHA-tagged images, so images aren't re-pulled on every restart if the node already has them cached. #### Cloudflare DNS failure See §CF failures above. ## Combined failures ### "Everything is slow" Most often = Neon is being hammered by our load + someone else's noisy neighbor. - Check `kubectl top pods` (are we CPU-bound?) - Check Neon console for query performance - Check CF analytics for traffic spikes ### "Some users see 502, others don't" Usually one node has an unhealthy Traefik or api. Cloudflare routes some connections to it, others to healthy nodes. - `kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik` - `kubectl get pods -n honeydue -l app.kubernetes.io/name=api` - Check per-pod logs ### "It worked 5 minutes ago, now it doesn't" Something recent changed. Check: - Recent deploys: `kubectl rollout history deployment/api -n honeydue` - Recent manifest changes: `kubectl get events -A --sort-by=.lastTimestamp | tail -30` - External: Cloudflare Status page, Neon Status page, Backblaze Status page ## Planned outages ### Node upgrades (OS patches) ```bash # Drain the node (evict pods, block scheduling) kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data # SSH in, upgrade, reboot ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot" # Wait for node to come back watch kubectl get nodes # Uncordon kubectl uncordon ubuntu-8gb-nbg1-1 ``` During the drain, pods from that node reschedule to the survivors. With current workload (api: 3 replicas, everything else: 1), rescheduling 1 api pod is fine. Traffic loss: zero. Worker pod or Redis pod scheduled on the drained node would be briefly unavailable during reschedule. Acceptable for planned windows. ### k3s upgrades Same per-node drain + upgrade pattern, but with k3s-specific install: ```bash # On the node curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server # k3s detects existing install and upgrades in place ``` Do one node at a time. Verify cluster health between each. ## Disaster recovery ### Complete cluster loss Procedure: 1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy) 2. Follow bootstrap procedure (Chapter 1 §node hardening) 3. Install k3s on each (Chapter 2 §HA architecture) 4. Configure kubeconfig 5. Apply all manifests: ```bash kubectl apply -f deploy-k3s/manifests/namespace.yaml kubectl apply -f deploy-k3s/manifests/rbac.yaml kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml # Wait for Traefik to redeploy # ... recreate secrets (see Chapter 10) ... # ... apply rest of manifests ... ``` 6. Update DNS if node IPs changed 7. Verify: curl https://api.myhoneydue.com/api/health/ Estimated time: **1-2 hours** if you've done it before. A lot of context-switching between Hetzner console, SSH, kubectl, and CF. Neon data is untouched by any of this. B2 data is untouched. Only state that's lost: Redis cache (regenerates) and any in-flight Asynq jobs that were mid-processing. ## References - [Kubernetes pod lifecycle][lifecycle] - [K3s HA recovery][k3s-ha-recovery] - [Hetzner rescue system][hetzner-rescue] [lifecycle]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/ [k3s-ha-recovery]: https://docs.k3s.io/datastore/ha-embedded#new-cluster-with-embedded-db [hetzner-rescue]: https://docs.hetzner.com/cloud/servers/getting-started/enabling-rescue-system/