Files
honeyDueAPI/docs/deployment/16-failure-modes.md
T
Trey t 8d9ca2e6ed
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs(deployment): rewrite migration prose for goose adoption
Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
2026-04-26 23:01:32 -05:00

508 lines
18 KiB
Markdown

# 16 — Failure Modes
## Summary
Every component in the system has a failure mode, a user-visible
symptom, and a recovery story. This chapter enumerates them from the
edge inward. Use this as a reference when debugging or when planning
resilience improvements.
## Failure catalog
### Cloudflare-level
#### CF edge POP outage
**Symptom**: users in one geographic region see errors; other regions
fine.
**Recovery**: automatic — CF routes traffic to next-nearest POP.
**Our action**: none; wait for CF.
**Frequency**: rare, usually resolved in minutes.
#### CF global outage (rare but has happened)
**Symptom**: the whole site unreachable via CF.
**Recovery**: manual — disable CF proxy (grey cloud DNS records), users
hit origins directly.
**Our action**: in Cloudflare dashboard, flip each A record's proxy off.
Users then resolve to our node IPs directly; UFW allows :80/:443 from
anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL
Flexible mode), but HTTP works.
**Frequency**: extremely rare (hours-long event happens ~annually).
#### DNS hijacking
**Symptom**: users' DNS queries return attacker IPs; all traffic
compromised.
**Mitigation**: unlikely at CF; users who use DoH/DoT are protected.
No mitigation at our level.
**Recovery**: requires CF incident response.
### Node-level
#### One node's NIC fails
**Symptom**: Cloudflare's retry logic routes around it within seconds.
Users see a brief spike in latency as CF learns the IP is unhealthy.
Pods on that node get rescheduled to surviving nodes by Kubernetes
after `node-monitor-grace-period` (40s).
**Recovery**:
- Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
- Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
- Replace the node via Hetzner console when convenient
**Our action**: verify `kubectl get nodes` shows NotReady; check
Hetzner console to confirm the node's status; recreate if needed.
#### Two nodes fail simultaneously
**Symptom**: Raft loses quorum. Kubernetes API server rejects writes.
Existing pods keep running but nothing new can be scheduled/updated.
Single surviving node's pods continue serving traffic.
**Recovery**:
- If a failed node comes back within Raft's leader-election timeout
(seconds to minutes), quorum restores
- If failed nodes are truly gone, the cluster is broken — need to
rebuild
**Rebuild procedure**: from the surviving node, `k3s-killall.sh`, then
bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe;
Redis state is lost.
#### All three nodes fail simultaneously
**Symptom**: full site outage.
**Recovery**: rebuild the cluster from scratch.
**Frequency**: Hetzner-region-wide outage, extremely rare.
#### Node disk fills up
**Symptom**: pods get evicted ("node is disk-pressure"). Containers
can't be scheduled on that node.
**Common cause**: container log buildup (containerd rotates at 10 MB
per container but across dozens of pod churn cycles, total fills up),
local-path PVC fills up, apt cache.
**Recovery**:
```bash
ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
# Then clean up
```
### k3s control plane failures
#### etcd corruption on one node
**Symptom**: Raft detects divergence; that node stops serving writes.
**Recovery**: remove the node from the cluster, rejoin. Etcd snapshot
is pulled from surviving peers automatically.
#### CoreDNS down
**Symptom**: pods can't resolve Service names. New TCP connections
fail; existing connections continue (they already resolved).
Typical manifestation: "DB connection failed — no such host" errors.
**Recovery**: k3s automatically restarts CoreDNS pod. If it
keeps crashing:
```bash
kubectl logs -n kube-system deploy/coredns --previous
kubectl rollout restart deployment/coredns -n kube-system
```
**Frequency**: rare.
#### metrics-server down
**Symptom**: `kubectl top` returns an error; HPAs can't scale.
**Recovery**: restart metrics-server pod. Non-critical; service stays up.
```bash
kubectl rollout restart deployment/metrics-server -n kube-system
```
#### vmagent can't reach obs.88oakapps.com
**Symptom**: dashboards stop updating; vmagent logs show 401 / TLS /
network errors against `obs.88oakapps.com`. App is unaffected.
**Recovery**: vmagent buffers up to 512 MB locally and replays on
reconnect, so brief outages self-heal. If sustained:
```bash
# Is the obs endpoint up?
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
-H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
# 200 = ingest endpoint healthy.
# Inspect vmagent's failure metric
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
| grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"
# Restart vmagent (forces config reload + drains queue)
kubectl -n honeydue rollout restart deploy/vmagent
```
**If 88oakappsUpdate itself is down** (PostHog runs there too):
SSH and check `sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps`.
**Non-critical**: nothing app-facing depends on the obs stack.
#### Grafana dashboard shows "no data"
**Possible causes, in order of frequency**:
1. New histogram name — query targets a metric the api hasn't emitted
yet. Check `kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics`
for the metric name.
2. vmagent isn't scraping (see above).
3. Time range is before the obs stack came up (2026-04-25). Adjust
the dashboard time picker.
4. Cardinality blowup — VM rejected high-label-count series. Check
`vm_rows_inserted_total` vs `vm_rows_dropped_total` on the obs box.
### Networking failures
#### UFW rule accidentally blocks essential traffic
**Symptom**: Some specific thing stops working (e.g., api can't reach
Postgres, cross-node pod traffic fails, kubectl times out).
**Recovery**: log in via SSH (if that still works), `sudo ufw status
numbered`, `sudo ufw --force delete <N>` to remove offending rule.
**If SSH is blocked too**: Hetzner console → Rescue mode → mount disk
→ edit `/etc/ufw/user.rules`.
#### Flannel broken on one node
**Symptom**: pods on that node can't reach remote pods via overlay.
ClusterIP Services involving cross-node endpoints fail.
**Recovery**: restart kubelet on that node:
```bash
ssh deploy@<node> "sudo systemctl restart k3s"
```
#### Kube-proxy broken on one node
**Symptom**: pods on that node can't reach ClusterIPs. Symptoms look
like DNS resolution succeeded but connection refused or timed out.
**Recovery**: same as Flannel — restart k3s on the node.
### Application-level
#### api pod OOM
**Symptom**: pod gets killed, kubelet restarts it. User's request
returns 502 briefly; subsequent requests routed to healthy pods.
Readiness probe removes the OOMing pod from Service endpoints.
**Recovery**: automatic (pod restarts). If it keeps OOMing:
- Increase `resources.limits.memory` in the deployment
- Or debug the memory leak
**Check**:
```bash
kubectl describe pod -n honeydue <pod> | grep -i oom
kubectl logs -n honeydue <pod> --previous
```
#### api pod panics
**Symptom**: goroutine panic kills the process. Kubelet restarts.
Similar user impact to OOM.
**Recovery**: automatic restart. But if the panic is deterministic
(same input → panic), the pod crashloops.
**Action**: read the logs, find the panic stack trace, fix the code,
deploy.
**Circuit-breaker scenario**: if all 3 api pods crashloop on startup
because of bad code, kubectl rollout undo to previous revision.
#### api deadlocks
**Symptom**: all 3 pods are up, readiness passes (shallow probe), but
real requests time out or hang.
**Recovery**: liveness probe is the same endpoint as readiness, so it
won't help. You'll see gradually increasing 504s at the edge. Manual
intervention:
```bash
kubectl rollout restart deployment/api -n honeydue
```
#### admin pod crashes
**Symptom**: 502 at Cloudflare when accessing admin.myhoneydue.com.
**Recovery**: k8s auto-restarts. Usually within 10-30s.
**Impact**: only admins lose access; user-facing api is unaffected.
#### worker stops processing jobs
**Symptom**: emails stop being sent, cron jobs stop firing.
**Detection**: no direct alert; need to notice via user feedback or
missing daily-digest emails. Or check Redis for queue backlog.
**Recovery**:
```bash
kubectl rollout restart deployment/worker -n honeydue
```
**If persistent**: check logs for specific error:
```bash
kubectl logs -n honeydue deploy/worker --tail=100
```
#### redis pod dies + node is different
**Symptom**: Redis schedules to a new node, but the PVC is on the
original node (local-path is per-node). New Redis pod comes up but
finds an empty data directory (or can't mount at all).
**Recovery**:
- If the original node is still alive but Redis pod died: pod comes
back up on same node with data intact
- If the original node is gone: Redis starts empty. Cache regenerates.
Asynq queue state is lost; pending jobs re-queue on retry, cron
fires re-schedule on next tick.
- Auth caches (token + residence-IDs) regenerate on first user
request — first request per user pays full DB lookup, then warm
again. Visible as a brief latency spike in the Grafana RED
dashboard, not a functional failure.
- Ensure the node label `honeydue/redis=true` is on a healthy node:
```bash
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
```
#### Stale residence-IDs cache (data freshness bug)
**Symptom**: a user accepts a share-code or has a residence
removed, but `/api/tasks/`, `/api/documents/`, `/api/contractors/`,
or `/api/residences/summary/` continues to show the old
membership for up to 5 minutes.
**Cause**: a residence-membership-mutating code path landed
without calling `cache.InvalidateResidenceIDsForUsers(...)`. The
cache TTL is 5 min so the issue self-heals, but it's user-visible.
**Recovery (immediate)**: flush the affected user's cache key
manually. See [Chapter 17 §residence-IDs cache invalidation](./17-runbook.md).
**Prevention (permanent)**: every mutation that changes
`residence_residence.owner_id`, `residence_residence_users.user_id`,
or deletes a residence MUST invalidate. Existing call sites for
reference: `CreateResidence` (owner), `DeleteResidence`
(all members), `JoinWithCode` (joining user), `RemoveUser`
(removed user). The pattern lives in
`internal/services/residence_id_cache.go`.
#### Redis at maxmemory limit
**Symptom**: Redis logs `OOM command not allowed when used memory > 'maxmemory'`.
Should be rare — current production usage is ~2.4 MB against a 256 MB
limit and the policy is `allkeys-lru` (cache writes evict cold keys
instead of erroring).
**Recovery**: confirm the policy is still `allkeys-lru`:
```bash
kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy
```
If it's somehow `noeviction`, set it live:
```bash
kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru
```
And re-apply the manifest at `deploy-k3s/manifests/redis/deployment.yaml`
so the change survives a pod restart.
If memory usage is genuinely climbing toward the cap, check for
runaway keys without TTLs:
```bash
kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys
```
### External service failures
#### Neon Postgres outage
**Symptom**: api logs fill with "failed to connect to database." All
mutating API calls fail. Reads from cache continue (via Redis) but
eventually cache expires.
**Recovery**: no action from us; Neon's problem. Users will see 5xx
until Neon is back.
**Mitigation for future**: multi-region Neon read replica, or
Postgres-level failover.
**Frequency**: Neon has had a handful of hours-scale outages since launch.
#### Neon pooler endpoint unreachable but direct endpoint up
**Symptom**: `dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o
timeout` in api logs but the direct compute endpoint is reachable.
Rare — Neon's pooler runs in their infra alongside compute — but
possible during pooler maintenance.
**Recovery (emergency)**: switch `DB_HOST` in `config.yaml` from the
`-pooler` to the direct hostname (drop the `-pooler` segment),
re-apply ConfigMap, rolling-restart api and worker:
```bash
# Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5...
# Then:
KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build
```
Cold-handshake latency goes back up (~440ms first hit) but the API
keeps serving. Switch back when the pooler recovers.
#### Migrate Job fails during deploy
**Symptom**: `03-deploy.sh` aborts at the migrations step:
```
[deploy][error] migrations did not complete cleanly; aborting deploy
```
api/worker pods are NOT updated — they keep running the previous
revision. This is the intentional fail-fast.
**Recovery**:
```bash
# 1. See the failure
kubectl -n honeydue logs job/honeydue-migrate --tail=200
# 2. Common cause: a SQL error in the migration file. Fix the file
# locally, commit, retry the deploy. The Job is idempotent —
# successful prior versions stay applied; only the failed file
# re-runs.
git add migrations/000NNN_*.sql
git commit -m "Fix migration NNN"
git push gitea master
bash deploy-k3s/scripts/03-deploy.sh
# 3. Other cause: Neon down or auth changed. Test direct connection:
DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
-o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \
psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;"
```
**Why no automatic retry**: `backoffLimit: 0` on the Job is deliberate.
A failing migration almost never gets unstuck by retrying — needs an
operator to look. See [Chapter 17 §27](./17-runbook.md) for recovery
playbook.
#### api refuses to start: "Schema precondition failed"
**Symptom**: api pods log `Schema precondition failed` and exit
immediately after DB connect.
**Cause**: `goose_db_version` table is missing or its latest row has
`is_applied=false`. Means the migrate Job either was never run or
ran and rolled back.
**Recovery**: run the migrate Job manually (see
[Chapter 17 §26](./17-runbook.md)). After it completes successfully,
delete the failing api pods so they restart with a fresh schema check:
```bash
kubectl -n honeydue rollout restart deploy/api
```
#### Backblaze B2 outage
**Symptom**: image uploads fail; image downloads fail unless cached by
CF.
**Recovery**: wait. B2 rarely goes down.
**Mitigation**: serve downloads via CF with long cache TTL — most
users won't notice brief B2 outages for read traffic.
#### Fastmail SMTP unreachable
**Symptom**: `worker` can't send transactional emails. Jobs retry per
Asynq's retry policy, eventually giving up and logging an error.
**Recovery**: automatic retry; wait for Fastmail to come back.
**Manual intervention**: re-enqueue jobs from the Asynq UI (we don't
expose it yet — future).
#### Gitea registry unreachable
**Symptom**: `kubectl rollout` stuck at "Pulling image" for new pods.
Existing pods continue running with their already-pulled images.
**Recovery**: wait for Gitea to come back.
**Mitigation**: K8s has `imagePullPolicy: IfNotPresent` by default on
SHA-tagged images, so images aren't re-pulled on every restart if
the node already has them cached.
#### Cloudflare DNS failure
See §CF failures above.
## Combined failures
### "Everything is slow"
Most often = Neon is being hammered by our load + someone else's noisy
neighbor.
- Check `kubectl top pods` (are we CPU-bound?)
- Check Neon console for query performance
- Check CF analytics for traffic spikes
### "Some users see 502, others don't"
Usually one node has an unhealthy Traefik or api. Cloudflare routes
some connections to it, others to healthy nodes.
- `kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik`
- `kubectl get pods -n honeydue -l app.kubernetes.io/name=api`
- Check per-pod logs
### "It worked 5 minutes ago, now it doesn't"
Something recent changed. Check:
- Recent deploys: `kubectl rollout history deployment/api -n honeydue`
- Recent manifest changes: `kubectl get events -A --sort-by=.lastTimestamp | tail -30`
- External: Cloudflare Status page, Neon Status page, Backblaze Status page
## Planned outages
### Node upgrades (OS patches)
```bash
# Drain the node (evict pods, block scheduling)
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data
# SSH in, upgrade, reboot
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"
# Wait for node to come back
watch kubectl get nodes
# Uncordon
kubectl uncordon ubuntu-8gb-nbg1-1
```
During the drain, pods from that node reschedule to the survivors.
With current workload (api: 3 replicas, everything else: 1), rescheduling
1 api pod is fine. Traffic loss: zero.
Worker pod or Redis pod scheduled on the drained node would be
briefly unavailable during reschedule. Acceptable for planned windows.
### k3s upgrades
Same per-node drain + upgrade pattern, but with k3s-specific install:
```bash
# On the node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server
# k3s detects existing install and upgrades in place
```
Do one node at a time. Verify cluster health between each.
## Disaster recovery
### Complete cluster loss
Procedure:
1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
2. Follow bootstrap procedure (Chapter 1 §node hardening)
3. Install k3s on each (Chapter 2 §HA architecture)
4. Configure kubeconfig
5. Apply all manifests:
```bash
kubectl apply -f deploy-k3s/manifests/namespace.yaml
kubectl apply -f deploy-k3s/manifests/rbac.yaml
kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml
# Wait for Traefik to redeploy
# ... recreate secrets (see Chapter 10) ...
# ... apply rest of manifests ...
```
6. Update DNS if node IPs changed
7. Verify: curl https://api.myhoneydue.com/api/health/
Estimated time: **1-2 hours** if you've done it before. A lot of
context-switching between Hetzner console, SSH, kubectl, and CF.
Neon data is untouched by any of this. B2 data is untouched. Only
state that's lost: Redis cache (regenerates) and any in-flight Asynq
jobs that were mid-processing.
## References
- [Kubernetes pod lifecycle][lifecycle]
- [K3s HA recovery][k3s-ha-recovery]
- [Hetzner rescue system][hetzner-rescue]
[lifecycle]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
[k3s-ha-recovery]: https://docs.k3s.io/datastore/ha-embedded#new-cluster-with-embedded-db
[hetzner-rescue]: https://docs.hetzner.com/cloud/servers/getting-started/enabling-rescue-system/