honeyDueAPI/docs/deployment/16-failure-modes.md

# 16 — Failure Modes

## Summary

Every component in the system has a failure mode, a user-visible
symptom, and a recovery story. This chapter enumerates them from the
edge inward. Use this as a reference when debugging or when planning
resilience improvements.

## Failure catalog

### Cloudflare-level

#### CF edge POP outage

**Symptom**: users in one geographic region see errors; other regions
fine.
**Recovery**: automatic — CF routes traffic to next-nearest POP.
**Our action**: none; wait for CF.
**Frequency**: rare, usually resolved in minutes.

#### CF global outage (rare but has happened)

**Symptom**: the whole site unreachable via CF.
**Recovery**: manual — disable CF proxy (grey cloud DNS records), users
hit origins directly.
**Our action**: in Cloudflare dashboard, flip each A record's proxy off.
Users then resolve to our node IPs directly; UFW allows :80/:443 from
anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL
Flexible mode), but HTTP works.
**Frequency**: extremely rare (hours-long event happens ~annually).

#### DNS hijacking

**Symptom**: users' DNS queries return attacker IPs; all traffic
compromised.
**Mitigation**: unlikely at CF; users who use DoH/DoT are protected.
No mitigation at our level.
**Recovery**: requires CF incident response.

### Node-level

#### One node's NIC fails

**Symptom**: Cloudflare's retry logic routes around it within seconds.
Users see a brief spike in latency as CF learns the IP is unhealthy.
Pods on that node get rescheduled to surviving nodes by Kubernetes
after `node-monitor-grace-period` (40s).
**Recovery**:
- Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
- Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
- Replace the node via Hetzner console when convenient
**Our action**: verify `kubectl get nodes` shows NotReady; check
Hetzner console to confirm the node's status; recreate if needed.

#### Two nodes fail simultaneously

**Symptom**: Raft loses quorum. Kubernetes API server rejects writes.
Existing pods keep running but nothing new can be scheduled/updated.
Single surviving node's pods continue serving traffic.
**Recovery**:
- If a failed node comes back within Raft's leader-election timeout
  (seconds to minutes), quorum restores
- If failed nodes are truly gone, the cluster is broken — need to
  rebuild
**Rebuild procedure**: from the surviving node, `k3s-killall.sh`, then
bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe;
Redis state is lost.

#### All three nodes fail simultaneously

**Symptom**: full site outage.
**Recovery**: rebuild the cluster from scratch.
**Frequency**: Hetzner-region-wide outage, extremely rare.

#### Node disk fills up

**Symptom**: pods get evicted ("node is disk-pressure"). Containers
can't be scheduled on that node.
**Common cause**: container log buildup (containerd rotates at 10 MB
per container but across dozens of pod churn cycles, total fills up),
local-path PVC fills up, apt cache.
**Recovery**:
```bash
ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
# Then clean up
```

### k3s control plane failures

#### etcd corruption on one node

**Symptom**: Raft detects divergence; that node stops serving writes.
**Recovery**: remove the node from the cluster, rejoin. Etcd snapshot
is pulled from surviving peers automatically.

#### CoreDNS down

**Symptom**: pods can't resolve Service names. New TCP connections
fail; existing connections continue (they already resolved).
Typical manifestation: "DB connection failed — no such host" errors.
**Recovery**: k3s automatically restarts CoreDNS pod. If it
keeps crashing:
```bash
kubectl logs -n kube-system deploy/coredns --previous
kubectl rollout restart deployment/coredns -n kube-system
```
**Frequency**: rare.

#### metrics-server down

**Symptom**: `kubectl top` returns an error; HPAs can't scale.
**Recovery**: restart metrics-server pod. Non-critical; service stays up.
```bash
kubectl rollout restart deployment/metrics-server -n kube-system
```

#### vmagent can't reach obs.88oakapps.com

**Symptom**: dashboards stop updating; vmagent logs show 401 / TLS /
network errors against `obs.88oakapps.com`. App is unaffected.
**Recovery**: vmagent buffers up to 512 MB locally and replays on
reconnect, so brief outages self-heal. If sustained:
```bash
# Is the obs endpoint up?
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
  -H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
# 200 = ingest endpoint healthy.

# Inspect vmagent's failure metric
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
  | grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"

# Restart vmagent (forces config reload + drains queue)
kubectl -n honeydue rollout restart deploy/vmagent
```
**If 88oakappsUpdate itself is down** (PostHog runs there too):
SSH and check `sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps`.
**Non-critical**: nothing app-facing depends on the obs stack.

#### Grafana dashboard shows "no data"

**Possible causes, in order of frequency**:
1. New histogram name — query targets a metric the api hasn't emitted
   yet. Check `kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics`
   for the metric name.
2. vmagent isn't scraping (see above).
3. Time range is before the obs stack came up (2026-04-25). Adjust
   the dashboard time picker.
4. Cardinality blowup — VM rejected high-label-count series. Check
   `vm_rows_inserted_total` vs `vm_rows_dropped_total` on the obs box.

### Networking failures

#### UFW rule accidentally blocks essential traffic

**Symptom**: Some specific thing stops working (e.g., api can't reach
Postgres, cross-node pod traffic fails, kubectl times out).
**Recovery**: log in via SSH (if that still works), `sudo ufw status
numbered`, `sudo ufw --force delete <N>` to remove offending rule.
**If SSH is blocked too**: Hetzner console → Rescue mode → mount disk
→ edit `/etc/ufw/user.rules`.

#### Flannel broken on one node

**Symptom**: pods on that node can't reach remote pods via overlay.
ClusterIP Services involving cross-node endpoints fail.
**Recovery**: restart kubelet on that node:
```bash
ssh deploy@<node> "sudo systemctl restart k3s"
```

#### Kube-proxy broken on one node

**Symptom**: pods on that node can't reach ClusterIPs. Symptoms look
like DNS resolution succeeded but connection refused or timed out.
**Recovery**: same as Flannel — restart k3s on the node.

### Application-level

#### api pod OOM

**Symptom**: pod gets killed, kubelet restarts it. User's request
returns 502 briefly; subsequent requests routed to healthy pods.
Readiness probe removes the OOMing pod from Service endpoints.
**Recovery**: automatic (pod restarts). If it keeps OOMing:
- Increase `resources.limits.memory` in the deployment
- Or debug the memory leak
**Check**:
```bash
kubectl describe pod -n honeydue <pod> | grep -i oom
kubectl logs -n honeydue <pod> --previous
```

#### api pod panics

**Symptom**: goroutine panic kills the process. Kubelet restarts.
Similar user impact to OOM.
**Recovery**: automatic restart. But if the panic is deterministic
(same input → panic), the pod crashloops.
**Action**: read the logs, find the panic stack trace, fix the code,
deploy.
**Circuit-breaker scenario**: if all 3 api pods crashloop on startup
because of bad code, kubectl rollout undo to previous revision.

#### api deadlocks

**Symptom**: all 3 pods are up, readiness passes (shallow probe), but
real requests time out or hang.
**Recovery**: liveness probe is the same endpoint as readiness, so it
won't help. You'll see gradually increasing 504s at the edge. Manual
intervention:
```bash
kubectl rollout restart deployment/api -n honeydue
```

#### admin pod crashes

**Symptom**: 502 at Cloudflare when accessing admin.myhoneydue.com.
**Recovery**: k8s auto-restarts. Usually within 10-30s.
**Impact**: only admins lose access; user-facing api is unaffected.

#### worker stops processing jobs

**Symptom**: emails stop being sent, cron jobs stop firing.
**Detection**: no direct alert; need to notice via user feedback or
missing daily-digest emails. Or check Redis for queue backlog.
**Recovery**:
```bash
kubectl rollout restart deployment/worker -n honeydue
```
**If persistent**: check logs for specific error:
```bash
kubectl logs -n honeydue deploy/worker --tail=100
```

#### redis pod dies + node is different

**Symptom**: Redis schedules to a new node, but the PVC is on the
original node (local-path is per-node). New Redis pod comes up but
finds an empty data directory (or can't mount at all).
**Recovery**:
- If the original node is still alive but Redis pod died: pod comes
  back up on same node with data intact
- If the original node is gone: Redis starts empty. Cache regenerates.
  Asynq queue state is lost; pending jobs re-queue on retry, cron
  fires re-schedule on next tick.
- Ensure the node label `honeydue/redis=true` is on a healthy node:
```bash
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
```

### External service failures

#### Neon Postgres outage

**Symptom**: api logs fill with "failed to connect to database." All
mutating API calls fail. Reads from cache continue (via Redis) but
eventually cache expires.
**Recovery**: no action from us; Neon's problem. Users will see 5xx
until Neon is back.
**Mitigation for future**: multi-region Neon read replica, or
Postgres-level failover.
**Frequency**: Neon has had a handful of hours-scale outages since launch.

#### Backblaze B2 outage

**Symptom**: image uploads fail; image downloads fail unless cached by
CF.
**Recovery**: wait. B2 rarely goes down.
**Mitigation**: serve downloads via CF with long cache TTL — most
users won't notice brief B2 outages for read traffic.

#### Fastmail SMTP unreachable

**Symptom**: `worker` can't send transactional emails. Jobs retry per
Asynq's retry policy, eventually giving up and logging an error.
**Recovery**: automatic retry; wait for Fastmail to come back.
**Manual intervention**: re-enqueue jobs from the Asynq UI (we don't
expose it yet — future).

#### Gitea registry unreachable

**Symptom**: `kubectl rollout` stuck at "Pulling image" for new pods.
Existing pods continue running with their already-pulled images.
**Recovery**: wait for Gitea to come back.
**Mitigation**: K8s has `imagePullPolicy: IfNotPresent` by default on
SHA-tagged images, so images aren't re-pulled on every restart if
the node already has them cached.

#### Cloudflare DNS failure

See §CF failures above.

## Combined failures

### "Everything is slow"

Most often = Neon is being hammered by our load + someone else's noisy
neighbor.
- Check `kubectl top pods` (are we CPU-bound?)
- Check Neon console for query performance
- Check CF analytics for traffic spikes

### "Some users see 502, others don't"

Usually one node has an unhealthy Traefik or api. Cloudflare routes
some connections to it, others to healthy nodes.
- `kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik`
- `kubectl get pods -n honeydue -l app.kubernetes.io/name=api`
- Check per-pod logs

### "It worked 5 minutes ago, now it doesn't"

Something recent changed. Check:
- Recent deploys: `kubectl rollout history deployment/api -n honeydue`
- Recent manifest changes: `kubectl get events -A --sort-by=.lastTimestamp | tail -30`
- External: Cloudflare Status page, Neon Status page, Backblaze Status page

## Planned outages

### Node upgrades (OS patches)

```bash
# Drain the node (evict pods, block scheduling)
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data

# SSH in, upgrade, reboot
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"

# Wait for node to come back
watch kubectl get nodes

# Uncordon
kubectl uncordon ubuntu-8gb-nbg1-1
```

During the drain, pods from that node reschedule to the survivors.
With current workload (api: 3 replicas, everything else: 1), rescheduling
1 api pod is fine. Traffic loss: zero.

Worker pod or Redis pod scheduled on the drained node would be
briefly unavailable during reschedule. Acceptable for planned windows.

### k3s upgrades

Same per-node drain + upgrade pattern, but with k3s-specific install:

```bash
# On the node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server

# k3s detects existing install and upgrades in place
```

Do one node at a time. Verify cluster health between each.

## Disaster recovery

### Complete cluster loss

Procedure:
1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
2. Follow bootstrap procedure (Chapter 1 §node hardening)
3. Install k3s on each (Chapter 2 §HA architecture)
4. Configure kubeconfig
5. Apply all manifests:
   ```bash
   kubectl apply -f deploy-k3s/manifests/namespace.yaml
   kubectl apply -f deploy-k3s/manifests/rbac.yaml
   kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml
   # Wait for Traefik to redeploy
   # ... recreate secrets (see Chapter 10) ...
   # ... apply rest of manifests ...
   ```
6. Update DNS if node IPs changed
7. Verify: curl https://api.myhoneydue.com/api/health/

Estimated time: **1-2 hours** if you've done it before. A lot of
context-switching between Hetzner console, SSH, kubectl, and CF.

Neon data is untouched by any of this. B2 data is untouched. Only
state that's lost: Redis cache (regenerates) and any in-flight Asynq
jobs that were mid-processing.

## References

- [Kubernetes pod lifecycle][lifecycle]
- [K3s HA recovery][k3s-ha-recovery]
- [Hetzner rescue system][hetzner-rescue]

[lifecycle]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
[k3s-ha-recovery]: https://docs.k3s.io/datastore/ha-embedded#new-cluster-with-embedded-db
[hetzner-rescue]: https://docs.hetzner.com/cloud/servers/getting-started/enabling-rescue-system/