Update the deployment book and glossary to reflect the goose-based schema migration flow shipped in 12b2f9d/0f7450a: - ch07: clarify startup probe assumes migrations ran out-of-band - ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job - ch12: pod startup checks goose_db_version, no longer runs migrations - ch14: document the Job→wait→roll deploy gate and how to debug failures - ch16: add "Migrate Job fails during deploy" + "Schema precondition failed" failure modes - ch17: new runbook entries §26 (run migrations manually), §27 (recover from failed/dirty migration), §28 (bootstrap goose on fresh clone) - ch19: postscript on §13 noting MigrateWithLock approach is superseded - ch20: mark "Migration Job for schema changes" task done - glossary: add `goose` and `goose_db_version`; flag AutoMigrate as tests-only - references: add goose links; flag AutoMigrate as tests-only
18 KiB
16 — Failure Modes
Summary
Every component in the system has a failure mode, a user-visible symptom, and a recovery story. This chapter enumerates them from the edge inward. Use this as a reference when debugging or when planning resilience improvements.
Failure catalog
Cloudflare-level
CF edge POP outage
Symptom: users in one geographic region see errors; other regions fine. Recovery: automatic — CF routes traffic to next-nearest POP. Our action: none; wait for CF. Frequency: rare, usually resolved in minutes.
CF global outage (rare but has happened)
Symptom: the whole site unreachable via CF. Recovery: manual — disable CF proxy (grey cloud DNS records), users hit origins directly. Our action: in Cloudflare dashboard, flip each A record's proxy off. Users then resolve to our node IPs directly; UFW allows :80/:443 from anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL Flexible mode), but HTTP works. Frequency: extremely rare (hours-long event happens ~annually).
DNS hijacking
Symptom: users' DNS queries return attacker IPs; all traffic compromised. Mitigation: unlikely at CF; users who use DoH/DoT are protected. No mitigation at our level. Recovery: requires CF incident response.
Node-level
One node's NIC fails
Symptom: Cloudflare's retry logic routes around it within seconds.
Users see a brief spike in latency as CF learns the IP is unhealthy.
Pods on that node get rescheduled to surviving nodes by Kubernetes
after node-monitor-grace-period (40s).
Recovery:
- Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
- Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
- Replace the node via Hetzner console when convenient
Our action: verify
kubectl get nodesshows NotReady; check Hetzner console to confirm the node's status; recreate if needed.
Two nodes fail simultaneously
Symptom: Raft loses quorum. Kubernetes API server rejects writes. Existing pods keep running but nothing new can be scheduled/updated. Single surviving node's pods continue serving traffic. Recovery:
- If a failed node comes back within Raft's leader-election timeout (seconds to minutes), quorum restores
- If failed nodes are truly gone, the cluster is broken — need to
rebuild
Rebuild procedure: from the surviving node,
k3s-killall.sh, then bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe; Redis state is lost.
All three nodes fail simultaneously
Symptom: full site outage. Recovery: rebuild the cluster from scratch. Frequency: Hetzner-region-wide outage, extremely rare.
Node disk fills up
Symptom: pods get evicted ("node is disk-pressure"). Containers can't be scheduled on that node. Common cause: container log buildup (containerd rotates at 10 MB per container but across dozens of pod churn cycles, total fills up), local-path PVC fills up, apt cache. Recovery:
ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
# Then clean up
k3s control plane failures
etcd corruption on one node
Symptom: Raft detects divergence; that node stops serving writes. Recovery: remove the node from the cluster, rejoin. Etcd snapshot is pulled from surviving peers automatically.
CoreDNS down
Symptom: pods can't resolve Service names. New TCP connections fail; existing connections continue (they already resolved). Typical manifestation: "DB connection failed — no such host" errors. Recovery: k3s automatically restarts CoreDNS pod. If it keeps crashing:
kubectl logs -n kube-system deploy/coredns --previous
kubectl rollout restart deployment/coredns -n kube-system
Frequency: rare.
metrics-server down
Symptom: kubectl top returns an error; HPAs can't scale.
Recovery: restart metrics-server pod. Non-critical; service stays up.
kubectl rollout restart deployment/metrics-server -n kube-system
vmagent can't reach obs.88oakapps.com
Symptom: dashboards stop updating; vmagent logs show 401 / TLS /
network errors against obs.88oakapps.com. App is unaffected.
Recovery: vmagent buffers up to 512 MB locally and replays on
reconnect, so brief outages self-heal. If sustained:
# Is the obs endpoint up?
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
-H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
# 200 = ingest endpoint healthy.
# Inspect vmagent's failure metric
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
| grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"
# Restart vmagent (forces config reload + drains queue)
kubectl -n honeydue rollout restart deploy/vmagent
If 88oakappsUpdate itself is down (PostHog runs there too):
SSH and check sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps.
Non-critical: nothing app-facing depends on the obs stack.
Grafana dashboard shows "no data"
Possible causes, in order of frequency:
- New histogram name — query targets a metric the api hasn't emitted
yet. Check
kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metricsfor the metric name. - vmagent isn't scraping (see above).
- Time range is before the obs stack came up (2026-04-25). Adjust the dashboard time picker.
- Cardinality blowup — VM rejected high-label-count series. Check
vm_rows_inserted_totalvsvm_rows_dropped_totalon the obs box.
Networking failures
UFW rule accidentally blocks essential traffic
Symptom: Some specific thing stops working (e.g., api can't reach
Postgres, cross-node pod traffic fails, kubectl times out).
Recovery: log in via SSH (if that still works), sudo ufw status numbered, sudo ufw --force delete <N> to remove offending rule.
If SSH is blocked too: Hetzner console → Rescue mode → mount disk
→ edit /etc/ufw/user.rules.
Flannel broken on one node
Symptom: pods on that node can't reach remote pods via overlay. ClusterIP Services involving cross-node endpoints fail. Recovery: restart kubelet on that node:
ssh deploy@<node> "sudo systemctl restart k3s"
Kube-proxy broken on one node
Symptom: pods on that node can't reach ClusterIPs. Symptoms look like DNS resolution succeeded but connection refused or timed out. Recovery: same as Flannel — restart k3s on the node.
Application-level
api pod OOM
Symptom: pod gets killed, kubelet restarts it. User's request returns 502 briefly; subsequent requests routed to healthy pods. Readiness probe removes the OOMing pod from Service endpoints. Recovery: automatic (pod restarts). If it keeps OOMing:
- Increase
resources.limits.memoryin the deployment - Or debug the memory leak Check:
kubectl describe pod -n honeydue <pod> | grep -i oom
kubectl logs -n honeydue <pod> --previous
api pod panics
Symptom: goroutine panic kills the process. Kubelet restarts. Similar user impact to OOM. Recovery: automatic restart. But if the panic is deterministic (same input → panic), the pod crashloops. Action: read the logs, find the panic stack trace, fix the code, deploy. Circuit-breaker scenario: if all 3 api pods crashloop on startup because of bad code, kubectl rollout undo to previous revision.
api deadlocks
Symptom: all 3 pods are up, readiness passes (shallow probe), but real requests time out or hang. Recovery: liveness probe is the same endpoint as readiness, so it won't help. You'll see gradually increasing 504s at the edge. Manual intervention:
kubectl rollout restart deployment/api -n honeydue
admin pod crashes
Symptom: 502 at Cloudflare when accessing admin.myhoneydue.com. Recovery: k8s auto-restarts. Usually within 10-30s. Impact: only admins lose access; user-facing api is unaffected.
worker stops processing jobs
Symptom: emails stop being sent, cron jobs stop firing. Detection: no direct alert; need to notice via user feedback or missing daily-digest emails. Or check Redis for queue backlog. Recovery:
kubectl rollout restart deployment/worker -n honeydue
If persistent: check logs for specific error:
kubectl logs -n honeydue deploy/worker --tail=100
redis pod dies + node is different
Symptom: Redis schedules to a new node, but the PVC is on the original node (local-path is per-node). New Redis pod comes up but finds an empty data directory (or can't mount at all). Recovery:
- If the original node is still alive but Redis pod died: pod comes back up on same node with data intact
- If the original node is gone: Redis starts empty. Cache regenerates. Asynq queue state is lost; pending jobs re-queue on retry, cron fires re-schedule on next tick.
- Auth caches (token + residence-IDs) regenerate on first user request — first request per user pays full DB lookup, then warm again. Visible as a brief latency spike in the Grafana RED dashboard, not a functional failure.
- Ensure the node label
honeydue/redis=trueis on a healthy node:
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
Stale residence-IDs cache (data freshness bug)
Symptom: a user accepts a share-code or has a residence
removed, but /api/tasks/, /api/documents/, /api/contractors/,
or /api/residences/summary/ continues to show the old
membership for up to 5 minutes.
Cause: a residence-membership-mutating code path landed
without calling cache.InvalidateResidenceIDsForUsers(...). The
cache TTL is 5 min so the issue self-heals, but it's user-visible.
Recovery (immediate): flush the affected user's cache key
manually. See Chapter 17 §residence-IDs cache invalidation.
Prevention (permanent): every mutation that changes
residence_residence.owner_id, residence_residence_users.user_id,
or deletes a residence MUST invalidate. Existing call sites for
reference: CreateResidence (owner), DeleteResidence
(all members), JoinWithCode (joining user), RemoveUser
(removed user). The pattern lives in
internal/services/residence_id_cache.go.
Redis at maxmemory limit
Symptom: Redis logs OOM command not allowed when used memory > 'maxmemory'.
Should be rare — current production usage is ~2.4 MB against a 256 MB
limit and the policy is allkeys-lru (cache writes evict cold keys
instead of erroring).
Recovery: confirm the policy is still allkeys-lru:
kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy
If it's somehow noeviction, set it live:
kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru
And re-apply the manifest at deploy-k3s/manifests/redis/deployment.yaml
so the change survives a pod restart.
If memory usage is genuinely climbing toward the cap, check for runaway keys without TTLs:
kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys
External service failures
Neon Postgres outage
Symptom: api logs fill with "failed to connect to database." All mutating API calls fail. Reads from cache continue (via Redis) but eventually cache expires. Recovery: no action from us; Neon's problem. Users will see 5xx until Neon is back. Mitigation for future: multi-region Neon read replica, or Postgres-level failover. Frequency: Neon has had a handful of hours-scale outages since launch.
Neon pooler endpoint unreachable but direct endpoint up
Symptom: dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o timeout in api logs but the direct compute endpoint is reachable.
Rare — Neon's pooler runs in their infra alongside compute — but
possible during pooler maintenance.
Recovery (emergency): switch DB_HOST in config.yaml from the
-pooler to the direct hostname (drop the -pooler segment),
re-apply ConfigMap, rolling-restart api and worker:
# Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5...
# Then:
KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build
Cold-handshake latency goes back up (~440ms first hit) but the API keeps serving. Switch back when the pooler recovers.
Migrate Job fails during deploy
Symptom: 03-deploy.sh aborts at the migrations step:
[deploy][error] migrations did not complete cleanly; aborting deploy
api/worker pods are NOT updated — they keep running the previous revision. This is the intentional fail-fast.
Recovery:
# 1. See the failure
kubectl -n honeydue logs job/honeydue-migrate --tail=200
# 2. Common cause: a SQL error in the migration file. Fix the file
# locally, commit, retry the deploy. The Job is idempotent —
# successful prior versions stay applied; only the failed file
# re-runs.
git add migrations/000NNN_*.sql
git commit -m "Fix migration NNN"
git push gitea master
bash deploy-k3s/scripts/03-deploy.sh
# 3. Other cause: Neon down or auth changed. Test direct connection:
DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
-o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \
psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;"
Why no automatic retry: backoffLimit: 0 on the Job is deliberate.
A failing migration almost never gets unstuck by retrying — needs an
operator to look. See Chapter 17 §27 for recovery
playbook.
api refuses to start: "Schema precondition failed"
Symptom: api pods log Schema precondition failed and exit
immediately after DB connect.
Cause: goose_db_version table is missing or its latest row has
is_applied=false. Means the migrate Job either was never run or
ran and rolled back.
Recovery: run the migrate Job manually (see
Chapter 17 §26). After it completes successfully,
delete the failing api pods so they restart with a fresh schema check:
kubectl -n honeydue rollout restart deploy/api
Backblaze B2 outage
Symptom: image uploads fail; image downloads fail unless cached by CF. Recovery: wait. B2 rarely goes down. Mitigation: serve downloads via CF with long cache TTL — most users won't notice brief B2 outages for read traffic.
Fastmail SMTP unreachable
Symptom: worker can't send transactional emails. Jobs retry per
Asynq's retry policy, eventually giving up and logging an error.
Recovery: automatic retry; wait for Fastmail to come back.
Manual intervention: re-enqueue jobs from the Asynq UI (we don't
expose it yet — future).
Gitea registry unreachable
Symptom: kubectl rollout stuck at "Pulling image" for new pods.
Existing pods continue running with their already-pulled images.
Recovery: wait for Gitea to come back.
Mitigation: K8s has imagePullPolicy: IfNotPresent by default on
SHA-tagged images, so images aren't re-pulled on every restart if
the node already has them cached.
Cloudflare DNS failure
See §CF failures above.
Combined failures
"Everything is slow"
Most often = Neon is being hammered by our load + someone else's noisy neighbor.
- Check
kubectl top pods(are we CPU-bound?) - Check Neon console for query performance
- Check CF analytics for traffic spikes
"Some users see 502, others don't"
Usually one node has an unhealthy Traefik or api. Cloudflare routes some connections to it, others to healthy nodes.
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefikkubectl get pods -n honeydue -l app.kubernetes.io/name=api- Check per-pod logs
"It worked 5 minutes ago, now it doesn't"
Something recent changed. Check:
- Recent deploys:
kubectl rollout history deployment/api -n honeydue - Recent manifest changes:
kubectl get events -A --sort-by=.lastTimestamp | tail -30 - External: Cloudflare Status page, Neon Status page, Backblaze Status page
Planned outages
Node upgrades (OS patches)
# Drain the node (evict pods, block scheduling)
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data
# SSH in, upgrade, reboot
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"
# Wait for node to come back
watch kubectl get nodes
# Uncordon
kubectl uncordon ubuntu-8gb-nbg1-1
During the drain, pods from that node reschedule to the survivors. With current workload (api: 3 replicas, everything else: 1), rescheduling 1 api pod is fine. Traffic loss: zero.
Worker pod or Redis pod scheduled on the drained node would be briefly unavailable during reschedule. Acceptable for planned windows.
k3s upgrades
Same per-node drain + upgrade pattern, but with k3s-specific install:
# On the node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server
# k3s detects existing install and upgrades in place
Do one node at a time. Verify cluster health between each.
Disaster recovery
Complete cluster loss
Procedure:
- Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
- Follow bootstrap procedure (Chapter 1 §node hardening)
- Install k3s on each (Chapter 2 §HA architecture)
- Configure kubeconfig
- Apply all manifests:
kubectl apply -f deploy-k3s/manifests/namespace.yaml kubectl apply -f deploy-k3s/manifests/rbac.yaml kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml # Wait for Traefik to redeploy # ... recreate secrets (see Chapter 10) ... # ... apply rest of manifests ... - Update DNS if node IPs changed
- Verify: curl https://api.myhoneydue.com/api/health/
Estimated time: 1-2 hours if you've done it before. A lot of context-switching between Hetzner console, SSH, kubectl, and CF.
Neon data is untouched by any of this. B2 data is untouched. Only state that's lost: Redis cache (regenerates) and any in-flight Asynq jobs that were mid-processing.