Files
honeyDueAPI/docs/deployment/16-failure-modes.md
T
Trey t c9ac273dbd
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs: capture latency optimizations + new caching invariants
Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.

ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
  MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
  warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
  optimization stack that minimizes round-trips, when this still matters

ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
  post-optimization trace; kept the old one underneath for diff context

ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
  switchover procedure

ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
  missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints

observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.

README links to the new ch08 latency section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:36:36 -05:00

16 KiB

16 — Failure Modes

Summary

Every component in the system has a failure mode, a user-visible symptom, and a recovery story. This chapter enumerates them from the edge inward. Use this as a reference when debugging or when planning resilience improvements.

Failure catalog

Cloudflare-level

CF edge POP outage

Symptom: users in one geographic region see errors; other regions fine. Recovery: automatic — CF routes traffic to next-nearest POP. Our action: none; wait for CF. Frequency: rare, usually resolved in minutes.

CF global outage (rare but has happened)

Symptom: the whole site unreachable via CF. Recovery: manual — disable CF proxy (grey cloud DNS records), users hit origins directly. Our action: in Cloudflare dashboard, flip each A record's proxy off. Users then resolve to our node IPs directly; UFW allows :80/:443 from anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL Flexible mode), but HTTP works. Frequency: extremely rare (hours-long event happens ~annually).

DNS hijacking

Symptom: users' DNS queries return attacker IPs; all traffic compromised. Mitigation: unlikely at CF; users who use DoH/DoT are protected. No mitigation at our level. Recovery: requires CF incident response.

Node-level

One node's NIC fails

Symptom: Cloudflare's retry logic routes around it within seconds. Users see a brief spike in latency as CF learns the IP is unhealthy. Pods on that node get rescheduled to surviving nodes by Kubernetes after node-monitor-grace-period (40s). Recovery:

  • Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
  • Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
  • Replace the node via Hetzner console when convenient Our action: verify kubectl get nodes shows NotReady; check Hetzner console to confirm the node's status; recreate if needed.

Two nodes fail simultaneously

Symptom: Raft loses quorum. Kubernetes API server rejects writes. Existing pods keep running but nothing new can be scheduled/updated. Single surviving node's pods continue serving traffic. Recovery:

  • If a failed node comes back within Raft's leader-election timeout (seconds to minutes), quorum restores
  • If failed nodes are truly gone, the cluster is broken — need to rebuild Rebuild procedure: from the surviving node, k3s-killall.sh, then bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe; Redis state is lost.

All three nodes fail simultaneously

Symptom: full site outage. Recovery: rebuild the cluster from scratch. Frequency: Hetzner-region-wide outage, extremely rare.

Node disk fills up

Symptom: pods get evicted ("node is disk-pressure"). Containers can't be scheduled on that node. Common cause: container log buildup (containerd rotates at 10 MB per container but across dozens of pod churn cycles, total fills up), local-path PVC fills up, apt cache. Recovery:

ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
# Then clean up

k3s control plane failures

etcd corruption on one node

Symptom: Raft detects divergence; that node stops serving writes. Recovery: remove the node from the cluster, rejoin. Etcd snapshot is pulled from surviving peers automatically.

CoreDNS down

Symptom: pods can't resolve Service names. New TCP connections fail; existing connections continue (they already resolved). Typical manifestation: "DB connection failed — no such host" errors. Recovery: k3s automatically restarts CoreDNS pod. If it keeps crashing:

kubectl logs -n kube-system deploy/coredns --previous
kubectl rollout restart deployment/coredns -n kube-system

Frequency: rare.

metrics-server down

Symptom: kubectl top returns an error; HPAs can't scale. Recovery: restart metrics-server pod. Non-critical; service stays up.

kubectl rollout restart deployment/metrics-server -n kube-system

vmagent can't reach obs.88oakapps.com

Symptom: dashboards stop updating; vmagent logs show 401 / TLS / network errors against obs.88oakapps.com. App is unaffected. Recovery: vmagent buffers up to 512 MB locally and replays on reconnect, so brief outages self-heal. If sustained:

# Is the obs endpoint up?
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
  -H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
# 200 = ingest endpoint healthy.

# Inspect vmagent's failure metric
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
  | grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"

# Restart vmagent (forces config reload + drains queue)
kubectl -n honeydue rollout restart deploy/vmagent

If 88oakappsUpdate itself is down (PostHog runs there too): SSH and check sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps. Non-critical: nothing app-facing depends on the obs stack.

Grafana dashboard shows "no data"

Possible causes, in order of frequency:

  1. New histogram name — query targets a metric the api hasn't emitted yet. Check kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics for the metric name.
  2. vmagent isn't scraping (see above).
  3. Time range is before the obs stack came up (2026-04-25). Adjust the dashboard time picker.
  4. Cardinality blowup — VM rejected high-label-count series. Check vm_rows_inserted_total vs vm_rows_dropped_total on the obs box.

Networking failures

UFW rule accidentally blocks essential traffic

Symptom: Some specific thing stops working (e.g., api can't reach Postgres, cross-node pod traffic fails, kubectl times out). Recovery: log in via SSH (if that still works), sudo ufw status numbered, sudo ufw --force delete <N> to remove offending rule. If SSH is blocked too: Hetzner console → Rescue mode → mount disk → edit /etc/ufw/user.rules.

Flannel broken on one node

Symptom: pods on that node can't reach remote pods via overlay. ClusterIP Services involving cross-node endpoints fail. Recovery: restart kubelet on that node:

ssh deploy@<node> "sudo systemctl restart k3s"

Kube-proxy broken on one node

Symptom: pods on that node can't reach ClusterIPs. Symptoms look like DNS resolution succeeded but connection refused or timed out. Recovery: same as Flannel — restart k3s on the node.

Application-level

api pod OOM

Symptom: pod gets killed, kubelet restarts it. User's request returns 502 briefly; subsequent requests routed to healthy pods. Readiness probe removes the OOMing pod from Service endpoints. Recovery: automatic (pod restarts). If it keeps OOMing:

  • Increase resources.limits.memory in the deployment
  • Or debug the memory leak Check:
kubectl describe pod -n honeydue <pod> | grep -i oom
kubectl logs -n honeydue <pod> --previous

api pod panics

Symptom: goroutine panic kills the process. Kubelet restarts. Similar user impact to OOM. Recovery: automatic restart. But if the panic is deterministic (same input → panic), the pod crashloops. Action: read the logs, find the panic stack trace, fix the code, deploy. Circuit-breaker scenario: if all 3 api pods crashloop on startup because of bad code, kubectl rollout undo to previous revision.

api deadlocks

Symptom: all 3 pods are up, readiness passes (shallow probe), but real requests time out or hang. Recovery: liveness probe is the same endpoint as readiness, so it won't help. You'll see gradually increasing 504s at the edge. Manual intervention:

kubectl rollout restart deployment/api -n honeydue

admin pod crashes

Symptom: 502 at Cloudflare when accessing admin.myhoneydue.com. Recovery: k8s auto-restarts. Usually within 10-30s. Impact: only admins lose access; user-facing api is unaffected.

worker stops processing jobs

Symptom: emails stop being sent, cron jobs stop firing. Detection: no direct alert; need to notice via user feedback or missing daily-digest emails. Or check Redis for queue backlog. Recovery:

kubectl rollout restart deployment/worker -n honeydue

If persistent: check logs for specific error:

kubectl logs -n honeydue deploy/worker --tail=100

redis pod dies + node is different

Symptom: Redis schedules to a new node, but the PVC is on the original node (local-path is per-node). New Redis pod comes up but finds an empty data directory (or can't mount at all). Recovery:

  • If the original node is still alive but Redis pod died: pod comes back up on same node with data intact
  • If the original node is gone: Redis starts empty. Cache regenerates. Asynq queue state is lost; pending jobs re-queue on retry, cron fires re-schedule on next tick.
  • Auth caches (token + residence-IDs) regenerate on first user request — first request per user pays full DB lookup, then warm again. Visible as a brief latency spike in the Grafana RED dashboard, not a functional failure.
  • Ensure the node label honeydue/redis=true is on a healthy node:
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true

Stale residence-IDs cache (data freshness bug)

Symptom: a user accepts a share-code or has a residence removed, but /api/tasks/, /api/documents/, /api/contractors/, or /api/residences/summary/ continues to show the old membership for up to 5 minutes. Cause: a residence-membership-mutating code path landed without calling cache.InvalidateResidenceIDsForUsers(...). The cache TTL is 5 min so the issue self-heals, but it's user-visible. Recovery (immediate): flush the affected user's cache key manually. See Chapter 17 §residence-IDs cache invalidation. Prevention (permanent): every mutation that changes residence_residence.owner_id, residence_residence_users.user_id, or deletes a residence MUST invalidate. Existing call sites for reference: CreateResidence (owner), DeleteResidence (all members), JoinWithCode (joining user), RemoveUser (removed user). The pattern lives in internal/services/residence_id_cache.go.

Redis at maxmemory limit

Symptom: Redis logs OOM command not allowed when used memory > 'maxmemory'. Should be rare — current production usage is ~2.4 MB against a 256 MB limit and the policy is allkeys-lru (cache writes evict cold keys instead of erroring). Recovery: confirm the policy is still allkeys-lru:

kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy

If it's somehow noeviction, set it live:

kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru

And re-apply the manifest at deploy-k3s/manifests/redis/deployment.yaml so the change survives a pod restart.

If memory usage is genuinely climbing toward the cap, check for runaway keys without TTLs:

kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys

External service failures

Neon Postgres outage

Symptom: api logs fill with "failed to connect to database." All mutating API calls fail. Reads from cache continue (via Redis) but eventually cache expires. Recovery: no action from us; Neon's problem. Users will see 5xx until Neon is back. Mitigation for future: multi-region Neon read replica, or Postgres-level failover. Frequency: Neon has had a handful of hours-scale outages since launch.

Neon pooler endpoint unreachable but direct endpoint up

Symptom: dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o timeout in api logs but the direct compute endpoint is reachable. Rare — Neon's pooler runs in their infra alongside compute — but possible during pooler maintenance. Recovery (emergency): switch DB_HOST in config.yaml from the -pooler to the direct hostname (drop the -pooler segment), re-apply ConfigMap, rolling-restart api and worker:

# Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5...
# Then:
KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build

Cold-handshake latency goes back up (~440ms first hit) but the API keeps serving. Switch back when the pooler recovers.

Backblaze B2 outage

Symptom: image uploads fail; image downloads fail unless cached by CF. Recovery: wait. B2 rarely goes down. Mitigation: serve downloads via CF with long cache TTL — most users won't notice brief B2 outages for read traffic.

Fastmail SMTP unreachable

Symptom: worker can't send transactional emails. Jobs retry per Asynq's retry policy, eventually giving up and logging an error. Recovery: automatic retry; wait for Fastmail to come back. Manual intervention: re-enqueue jobs from the Asynq UI (we don't expose it yet — future).

Gitea registry unreachable

Symptom: kubectl rollout stuck at "Pulling image" for new pods. Existing pods continue running with their already-pulled images. Recovery: wait for Gitea to come back. Mitigation: K8s has imagePullPolicy: IfNotPresent by default on SHA-tagged images, so images aren't re-pulled on every restart if the node already has them cached.

Cloudflare DNS failure

See §CF failures above.

Combined failures

"Everything is slow"

Most often = Neon is being hammered by our load + someone else's noisy neighbor.

  • Check kubectl top pods (are we CPU-bound?)
  • Check Neon console for query performance
  • Check CF analytics for traffic spikes

"Some users see 502, others don't"

Usually one node has an unhealthy Traefik or api. Cloudflare routes some connections to it, others to healthy nodes.

  • kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
  • kubectl get pods -n honeydue -l app.kubernetes.io/name=api
  • Check per-pod logs

"It worked 5 minutes ago, now it doesn't"

Something recent changed. Check:

  • Recent deploys: kubectl rollout history deployment/api -n honeydue
  • Recent manifest changes: kubectl get events -A --sort-by=.lastTimestamp | tail -30
  • External: Cloudflare Status page, Neon Status page, Backblaze Status page

Planned outages

Node upgrades (OS patches)

# Drain the node (evict pods, block scheduling)
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data

# SSH in, upgrade, reboot
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"

# Wait for node to come back
watch kubectl get nodes

# Uncordon
kubectl uncordon ubuntu-8gb-nbg1-1

During the drain, pods from that node reschedule to the survivors. With current workload (api: 3 replicas, everything else: 1), rescheduling 1 api pod is fine. Traffic loss: zero.

Worker pod or Redis pod scheduled on the drained node would be briefly unavailable during reschedule. Acceptable for planned windows.

k3s upgrades

Same per-node drain + upgrade pattern, but with k3s-specific install:

# On the node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server

# k3s detects existing install and upgrades in place

Do one node at a time. Verify cluster health between each.

Disaster recovery

Complete cluster loss

Procedure:

  1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
  2. Follow bootstrap procedure (Chapter 1 §node hardening)
  3. Install k3s on each (Chapter 2 §HA architecture)
  4. Configure kubeconfig
  5. Apply all manifests:
    kubectl apply -f deploy-k3s/manifests/namespace.yaml
    kubectl apply -f deploy-k3s/manifests/rbac.yaml
    kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml
    # Wait for Traefik to redeploy
    # ... recreate secrets (see Chapter 10) ...
    # ... apply rest of manifests ...
    
  6. Update DNS if node IPs changed
  7. Verify: curl https://api.myhoneydue.com/api/health/

Estimated time: 1-2 hours if you've done it before. A lot of context-switching between Hetzner console, SSH, kubectl, and CF.

Neon data is untouched by any of this. B2 data is untouched. Only state that's lost: Redis cache (regenerates) and any in-flight Asynq jobs that were mid-processing.

References