Files
Trey t 8d9ca2e6ed
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs(deployment): rewrite migration prose for goose adoption
Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
2026-04-26 23:01:32 -05:00

18 KiB

16 — Failure Modes

Summary

Every component in the system has a failure mode, a user-visible symptom, and a recovery story. This chapter enumerates them from the edge inward. Use this as a reference when debugging or when planning resilience improvements.

Failure catalog

Cloudflare-level

CF edge POP outage

Symptom: users in one geographic region see errors; other regions fine. Recovery: automatic — CF routes traffic to next-nearest POP. Our action: none; wait for CF. Frequency: rare, usually resolved in minutes.

CF global outage (rare but has happened)

Symptom: the whole site unreachable via CF. Recovery: manual — disable CF proxy (grey cloud DNS records), users hit origins directly. Our action: in Cloudflare dashboard, flip each A record's proxy off. Users then resolve to our node IPs directly; UFW allows :80/:443 from anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL Flexible mode), but HTTP works. Frequency: extremely rare (hours-long event happens ~annually).

DNS hijacking

Symptom: users' DNS queries return attacker IPs; all traffic compromised. Mitigation: unlikely at CF; users who use DoH/DoT are protected. No mitigation at our level. Recovery: requires CF incident response.

Node-level

One node's NIC fails

Symptom: Cloudflare's retry logic routes around it within seconds. Users see a brief spike in latency as CF learns the IP is unhealthy. Pods on that node get rescheduled to surviving nodes by Kubernetes after node-monitor-grace-period (40s). Recovery:

  • Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
  • Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
  • Replace the node via Hetzner console when convenient Our action: verify kubectl get nodes shows NotReady; check Hetzner console to confirm the node's status; recreate if needed.

Two nodes fail simultaneously

Symptom: Raft loses quorum. Kubernetes API server rejects writes. Existing pods keep running but nothing new can be scheduled/updated. Single surviving node's pods continue serving traffic. Recovery:

  • If a failed node comes back within Raft's leader-election timeout (seconds to minutes), quorum restores
  • If failed nodes are truly gone, the cluster is broken — need to rebuild Rebuild procedure: from the surviving node, k3s-killall.sh, then bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe; Redis state is lost.

All three nodes fail simultaneously

Symptom: full site outage. Recovery: rebuild the cluster from scratch. Frequency: Hetzner-region-wide outage, extremely rare.

Node disk fills up

Symptom: pods get evicted ("node is disk-pressure"). Containers can't be scheduled on that node. Common cause: container log buildup (containerd rotates at 10 MB per container but across dozens of pod churn cycles, total fills up), local-path PVC fills up, apt cache. Recovery:

ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
# Then clean up

k3s control plane failures

etcd corruption on one node

Symptom: Raft detects divergence; that node stops serving writes. Recovery: remove the node from the cluster, rejoin. Etcd snapshot is pulled from surviving peers automatically.

CoreDNS down

Symptom: pods can't resolve Service names. New TCP connections fail; existing connections continue (they already resolved). Typical manifestation: "DB connection failed — no such host" errors. Recovery: k3s automatically restarts CoreDNS pod. If it keeps crashing:

kubectl logs -n kube-system deploy/coredns --previous
kubectl rollout restart deployment/coredns -n kube-system

Frequency: rare.

metrics-server down

Symptom: kubectl top returns an error; HPAs can't scale. Recovery: restart metrics-server pod. Non-critical; service stays up.

kubectl rollout restart deployment/metrics-server -n kube-system

vmagent can't reach obs.88oakapps.com

Symptom: dashboards stop updating; vmagent logs show 401 / TLS / network errors against obs.88oakapps.com. App is unaffected. Recovery: vmagent buffers up to 512 MB locally and replays on reconnect, so brief outages self-heal. If sustained:

# Is the obs endpoint up?
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
  -H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
# 200 = ingest endpoint healthy.

# Inspect vmagent's failure metric
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
  | grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"

# Restart vmagent (forces config reload + drains queue)
kubectl -n honeydue rollout restart deploy/vmagent

If 88oakappsUpdate itself is down (PostHog runs there too): SSH and check sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps. Non-critical: nothing app-facing depends on the obs stack.

Grafana dashboard shows "no data"

Possible causes, in order of frequency:

  1. New histogram name — query targets a metric the api hasn't emitted yet. Check kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics for the metric name.
  2. vmagent isn't scraping (see above).
  3. Time range is before the obs stack came up (2026-04-25). Adjust the dashboard time picker.
  4. Cardinality blowup — VM rejected high-label-count series. Check vm_rows_inserted_total vs vm_rows_dropped_total on the obs box.

Networking failures

UFW rule accidentally blocks essential traffic

Symptom: Some specific thing stops working (e.g., api can't reach Postgres, cross-node pod traffic fails, kubectl times out). Recovery: log in via SSH (if that still works), sudo ufw status numbered, sudo ufw --force delete <N> to remove offending rule. If SSH is blocked too: Hetzner console → Rescue mode → mount disk → edit /etc/ufw/user.rules.

Flannel broken on one node

Symptom: pods on that node can't reach remote pods via overlay. ClusterIP Services involving cross-node endpoints fail. Recovery: restart kubelet on that node:

ssh deploy@<node> "sudo systemctl restart k3s"

Kube-proxy broken on one node

Symptom: pods on that node can't reach ClusterIPs. Symptoms look like DNS resolution succeeded but connection refused or timed out. Recovery: same as Flannel — restart k3s on the node.

Application-level

api pod OOM

Symptom: pod gets killed, kubelet restarts it. User's request returns 502 briefly; subsequent requests routed to healthy pods. Readiness probe removes the OOMing pod from Service endpoints. Recovery: automatic (pod restarts). If it keeps OOMing:

  • Increase resources.limits.memory in the deployment
  • Or debug the memory leak Check:
kubectl describe pod -n honeydue <pod> | grep -i oom
kubectl logs -n honeydue <pod> --previous

api pod panics

Symptom: goroutine panic kills the process. Kubelet restarts. Similar user impact to OOM. Recovery: automatic restart. But if the panic is deterministic (same input → panic), the pod crashloops. Action: read the logs, find the panic stack trace, fix the code, deploy. Circuit-breaker scenario: if all 3 api pods crashloop on startup because of bad code, kubectl rollout undo to previous revision.

api deadlocks

Symptom: all 3 pods are up, readiness passes (shallow probe), but real requests time out or hang. Recovery: liveness probe is the same endpoint as readiness, so it won't help. You'll see gradually increasing 504s at the edge. Manual intervention:

kubectl rollout restart deployment/api -n honeydue

admin pod crashes

Symptom: 502 at Cloudflare when accessing admin.myhoneydue.com. Recovery: k8s auto-restarts. Usually within 10-30s. Impact: only admins lose access; user-facing api is unaffected.

worker stops processing jobs

Symptom: emails stop being sent, cron jobs stop firing. Detection: no direct alert; need to notice via user feedback or missing daily-digest emails. Or check Redis for queue backlog. Recovery:

kubectl rollout restart deployment/worker -n honeydue

If persistent: check logs for specific error:

kubectl logs -n honeydue deploy/worker --tail=100

redis pod dies + node is different

Symptom: Redis schedules to a new node, but the PVC is on the original node (local-path is per-node). New Redis pod comes up but finds an empty data directory (or can't mount at all). Recovery:

  • If the original node is still alive but Redis pod died: pod comes back up on same node with data intact
  • If the original node is gone: Redis starts empty. Cache regenerates. Asynq queue state is lost; pending jobs re-queue on retry, cron fires re-schedule on next tick.
  • Auth caches (token + residence-IDs) regenerate on first user request — first request per user pays full DB lookup, then warm again. Visible as a brief latency spike in the Grafana RED dashboard, not a functional failure.
  • Ensure the node label honeydue/redis=true is on a healthy node:
kubectl label node <new-node> honeydue/redis=true --overwrite
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true

Stale residence-IDs cache (data freshness bug)

Symptom: a user accepts a share-code or has a residence removed, but /api/tasks/, /api/documents/, /api/contractors/, or /api/residences/summary/ continues to show the old membership for up to 5 minutes. Cause: a residence-membership-mutating code path landed without calling cache.InvalidateResidenceIDsForUsers(...). The cache TTL is 5 min so the issue self-heals, but it's user-visible. Recovery (immediate): flush the affected user's cache key manually. See Chapter 17 §residence-IDs cache invalidation. Prevention (permanent): every mutation that changes residence_residence.owner_id, residence_residence_users.user_id, or deletes a residence MUST invalidate. Existing call sites for reference: CreateResidence (owner), DeleteResidence (all members), JoinWithCode (joining user), RemoveUser (removed user). The pattern lives in internal/services/residence_id_cache.go.

Redis at maxmemory limit

Symptom: Redis logs OOM command not allowed when used memory > 'maxmemory'. Should be rare — current production usage is ~2.4 MB against a 256 MB limit and the policy is allkeys-lru (cache writes evict cold keys instead of erroring). Recovery: confirm the policy is still allkeys-lru:

kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG GET maxmemory-policy

If it's somehow noeviction, set it live:

kubectl -n honeydue exec deploy/redis -- redis-cli CONFIG SET maxmemory-policy allkeys-lru

And re-apply the manifest at deploy-k3s/manifests/redis/deployment.yaml so the change survives a pod restart.

If memory usage is genuinely climbing toward the cap, check for runaway keys without TTLs:

kubectl -n honeydue exec deploy/redis -- redis-cli --bigkeys

External service failures

Neon Postgres outage

Symptom: api logs fill with "failed to connect to database." All mutating API calls fail. Reads from cache continue (via Redis) but eventually cache expires. Recovery: no action from us; Neon's problem. Users will see 5xx until Neon is back. Mitigation for future: multi-region Neon read replica, or Postgres-level failover. Frequency: Neon has had a handful of hours-scale outages since launch.

Neon pooler endpoint unreachable but direct endpoint up

Symptom: dial tcp ep-floral-truth-amttbc5a-pooler.c-5...: i/o timeout in api logs but the direct compute endpoint is reachable. Rare — Neon's pooler runs in their infra alongside compute — but possible during pooler maintenance. Recovery (emergency): switch DB_HOST in config.yaml from the -pooler to the direct hostname (drop the -pooler segment), re-apply ConfigMap, rolling-restart api and worker:

# Edit deploy-k3s/config.yaml: database.host: ep-floral-truth-amttbc5a.c-5...
# Then:
KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build

Cold-handshake latency goes back up (~440ms first hit) but the API keeps serving. Switch back when the pooler recovers.

Migrate Job fails during deploy

Symptom: 03-deploy.sh aborts at the migrations step:

[deploy][error] migrations did not complete cleanly; aborting deploy

api/worker pods are NOT updated — they keep running the previous revision. This is the intentional fail-fast.

Recovery:

# 1. See the failure
kubectl -n honeydue logs job/honeydue-migrate --tail=200

# 2. Common cause: a SQL error in the migration file. Fix the file
#    locally, commit, retry the deploy. The Job is idempotent —
#    successful prior versions stay applied; only the failed file
#    re-runs.
git add migrations/000NNN_*.sql
git commit -m "Fix migration NNN"
git push gitea master
bash deploy-k3s/scripts/03-deploy.sh

# 3. Other cause: Neon down or auth changed. Test direct connection:
DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
  -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
docker run --rm -e PGPASSWORD="$DB_PASS" postgres:17-alpine \
  psql "host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
        user=neondb_owner dbname=honeyDue sslmode=require" -c "SELECT 1;"

Why no automatic retry: backoffLimit: 0 on the Job is deliberate. A failing migration almost never gets unstuck by retrying — needs an operator to look. See Chapter 17 §27 for recovery playbook.

api refuses to start: "Schema precondition failed"

Symptom: api pods log Schema precondition failed and exit immediately after DB connect. Cause: goose_db_version table is missing or its latest row has is_applied=false. Means the migrate Job either was never run or ran and rolled back. Recovery: run the migrate Job manually (see Chapter 17 §26). After it completes successfully, delete the failing api pods so they restart with a fresh schema check:

kubectl -n honeydue rollout restart deploy/api

Backblaze B2 outage

Symptom: image uploads fail; image downloads fail unless cached by CF. Recovery: wait. B2 rarely goes down. Mitigation: serve downloads via CF with long cache TTL — most users won't notice brief B2 outages for read traffic.

Fastmail SMTP unreachable

Symptom: worker can't send transactional emails. Jobs retry per Asynq's retry policy, eventually giving up and logging an error. Recovery: automatic retry; wait for Fastmail to come back. Manual intervention: re-enqueue jobs from the Asynq UI (we don't expose it yet — future).

Gitea registry unreachable

Symptom: kubectl rollout stuck at "Pulling image" for new pods. Existing pods continue running with their already-pulled images. Recovery: wait for Gitea to come back. Mitigation: K8s has imagePullPolicy: IfNotPresent by default on SHA-tagged images, so images aren't re-pulled on every restart if the node already has them cached.

Cloudflare DNS failure

See §CF failures above.

Combined failures

"Everything is slow"

Most often = Neon is being hammered by our load + someone else's noisy neighbor.

  • Check kubectl top pods (are we CPU-bound?)
  • Check Neon console for query performance
  • Check CF analytics for traffic spikes

"Some users see 502, others don't"

Usually one node has an unhealthy Traefik or api. Cloudflare routes some connections to it, others to healthy nodes.

  • kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik
  • kubectl get pods -n honeydue -l app.kubernetes.io/name=api
  • Check per-pod logs

"It worked 5 minutes ago, now it doesn't"

Something recent changed. Check:

  • Recent deploys: kubectl rollout history deployment/api -n honeydue
  • Recent manifest changes: kubectl get events -A --sort-by=.lastTimestamp | tail -30
  • External: Cloudflare Status page, Neon Status page, Backblaze Status page

Planned outages

Node upgrades (OS patches)

# Drain the node (evict pods, block scheduling)
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data

# SSH in, upgrade, reboot
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"

# Wait for node to come back
watch kubectl get nodes

# Uncordon
kubectl uncordon ubuntu-8gb-nbg1-1

During the drain, pods from that node reschedule to the survivors. With current workload (api: 3 replicas, everything else: 1), rescheduling 1 api pod is fine. Traffic loss: zero.

Worker pod or Redis pod scheduled on the drained node would be briefly unavailable during reschedule. Acceptable for planned windows.

k3s upgrades

Same per-node drain + upgrade pattern, but with k3s-specific install:

# On the node
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server

# k3s detects existing install and upgrades in place

Do one node at a time. Verify cluster health between each.

Disaster recovery

Complete cluster loss

Procedure:

  1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
  2. Follow bootstrap procedure (Chapter 1 §node hardening)
  3. Install k3s on each (Chapter 2 §HA architecture)
  4. Configure kubeconfig
  5. Apply all manifests:
    kubectl apply -f deploy-k3s/manifests/namespace.yaml
    kubectl apply -f deploy-k3s/manifests/rbac.yaml
    kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml
    # Wait for Traefik to redeploy
    # ... recreate secrets (see Chapter 10) ...
    # ... apply rest of manifests ...
    
  6. Update DNS if node IPs changed
  7. Verify: curl https://api.myhoneydue.com/api/health/

Estimated time: 1-2 hours if you've done it before. A lot of context-switching between Hetzner console, SSH, kubectl, and CF.

Neon data is untouched by any of this. B2 data is untouched. Only state that's lost: Redis cache (regenerates) and any in-flight Asynq jobs that were mid-processing.

References