77cfcc0b27
ch15 is now an account of what's actually running, not a roadmap for what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the internal/prom histogram set, the rollout's NetworkPolicy footprint, the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget, and a token-rotation runbook. The "what we still don't have" section keeps log aggregation, alerting, and full distributed tracing as the honest gap list. Other touched docs: - 00-overview: \"deliberately absent\" no longer claims we have no metrics — calls out the cross-cluster shape instead. - 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh (full build + push + apply + obs vmagent), with the manual kubectl-set-image flow kept as the single-service path. Notes the IfNotPresent gotcha that bit us during the rollout. - 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data. - 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the CX32 migration trigger. - 17/18 README + appendix b: link the new ch15, add the obs cheat sheet block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
396 lines
14 KiB
Markdown
396 lines
14 KiB
Markdown
# 16 — Failure Modes
|
|
|
|
## Summary
|
|
|
|
Every component in the system has a failure mode, a user-visible
|
|
symptom, and a recovery story. This chapter enumerates them from the
|
|
edge inward. Use this as a reference when debugging or when planning
|
|
resilience improvements.
|
|
|
|
## Failure catalog
|
|
|
|
### Cloudflare-level
|
|
|
|
#### CF edge POP outage
|
|
|
|
**Symptom**: users in one geographic region see errors; other regions
|
|
fine.
|
|
**Recovery**: automatic — CF routes traffic to next-nearest POP.
|
|
**Our action**: none; wait for CF.
|
|
**Frequency**: rare, usually resolved in minutes.
|
|
|
|
#### CF global outage (rare but has happened)
|
|
|
|
**Symptom**: the whole site unreachable via CF.
|
|
**Recovery**: manual — disable CF proxy (grey cloud DNS records), users
|
|
hit origins directly.
|
|
**Our action**: in Cloudflare dashboard, flip each A record's proxy off.
|
|
Users then resolve to our node IPs directly; UFW allows :80/:443 from
|
|
anywhere so they reach Traefik. TLS breaks (origin has no cert in SSL
|
|
Flexible mode), but HTTP works.
|
|
**Frequency**: extremely rare (hours-long event happens ~annually).
|
|
|
|
#### DNS hijacking
|
|
|
|
**Symptom**: users' DNS queries return attacker IPs; all traffic
|
|
compromised.
|
|
**Mitigation**: unlikely at CF; users who use DoH/DoT are protected.
|
|
No mitigation at our level.
|
|
**Recovery**: requires CF incident response.
|
|
|
|
### Node-level
|
|
|
|
#### One node's NIC fails
|
|
|
|
**Symptom**: Cloudflare's retry logic routes around it within seconds.
|
|
Users see a brief spike in latency as CF learns the IP is unhealthy.
|
|
Pods on that node get rescheduled to surviving nodes by Kubernetes
|
|
after `node-monitor-grace-period` (40s).
|
|
**Recovery**:
|
|
- Automatic pod rescheduling takes ~5 min (grace period + pod eviction)
|
|
- Dead node's Raft vote is missing; cluster stays up (2 of 3 quorum)
|
|
- Replace the node via Hetzner console when convenient
|
|
**Our action**: verify `kubectl get nodes` shows NotReady; check
|
|
Hetzner console to confirm the node's status; recreate if needed.
|
|
|
|
#### Two nodes fail simultaneously
|
|
|
|
**Symptom**: Raft loses quorum. Kubernetes API server rejects writes.
|
|
Existing pods keep running but nothing new can be scheduled/updated.
|
|
Single surviving node's pods continue serving traffic.
|
|
**Recovery**:
|
|
- If a failed node comes back within Raft's leader-election timeout
|
|
(seconds to minutes), quorum restores
|
|
- If failed nodes are truly gone, the cluster is broken — need to
|
|
rebuild
|
|
**Rebuild procedure**: from the surviving node, `k3s-killall.sh`, then
|
|
bootstrap a new 3-node cluster from scratch. Data in Neon/B2 is safe;
|
|
Redis state is lost.
|
|
|
|
#### All three nodes fail simultaneously
|
|
|
|
**Symptom**: full site outage.
|
|
**Recovery**: rebuild the cluster from scratch.
|
|
**Frequency**: Hetzner-region-wide outage, extremely rare.
|
|
|
|
#### Node disk fills up
|
|
|
|
**Symptom**: pods get evicted ("node is disk-pressure"). Containers
|
|
can't be scheduled on that node.
|
|
**Common cause**: container log buildup (containerd rotates at 10 MB
|
|
per container but across dozens of pod churn cycles, total fills up),
|
|
local-path PVC fills up, apt cache.
|
|
**Recovery**:
|
|
```bash
|
|
ssh deploy@<node> "sudo df -h; sudo du -sh /var/lib/rancher/* | sort -h"
|
|
# Then clean up
|
|
```
|
|
|
|
### k3s control plane failures
|
|
|
|
#### etcd corruption on one node
|
|
|
|
**Symptom**: Raft detects divergence; that node stops serving writes.
|
|
**Recovery**: remove the node from the cluster, rejoin. Etcd snapshot
|
|
is pulled from surviving peers automatically.
|
|
|
|
#### CoreDNS down
|
|
|
|
**Symptom**: pods can't resolve Service names. New TCP connections
|
|
fail; existing connections continue (they already resolved).
|
|
Typical manifestation: "DB connection failed — no such host" errors.
|
|
**Recovery**: k3s automatically restarts CoreDNS pod. If it
|
|
keeps crashing:
|
|
```bash
|
|
kubectl logs -n kube-system deploy/coredns --previous
|
|
kubectl rollout restart deployment/coredns -n kube-system
|
|
```
|
|
**Frequency**: rare.
|
|
|
|
#### metrics-server down
|
|
|
|
**Symptom**: `kubectl top` returns an error; HPAs can't scale.
|
|
**Recovery**: restart metrics-server pod. Non-critical; service stays up.
|
|
```bash
|
|
kubectl rollout restart deployment/metrics-server -n kube-system
|
|
```
|
|
|
|
#### vmagent can't reach obs.88oakapps.com
|
|
|
|
**Symptom**: dashboards stop updating; vmagent logs show 401 / TLS /
|
|
network errors against `obs.88oakapps.com`. App is unaffected.
|
|
**Recovery**: vmagent buffers up to 512 MB locally and replays on
|
|
reconnect, so brief outages self-heal. If sustained:
|
|
```bash
|
|
# Is the obs endpoint up?
|
|
curl -s -o /dev/null -w "%{http_code}\n" https://obs.88oakapps.com/health \
|
|
-H "Authorization: Bearer $(grep ^OBS_INGEST_TOKEN= deploy/prod.env | cut -d= -f2)"
|
|
# 200 = ingest endpoint healthy.
|
|
|
|
# Inspect vmagent's failure metric
|
|
kubectl -n honeydue exec deploy/vmagent -- wget -qO- http://127.0.0.1:8429/metrics \
|
|
| grep -E "remotewrite_(packets|samples)_dropped|persistentqueue_blocks_dropped"
|
|
|
|
# Restart vmagent (forces config reload + drains queue)
|
|
kubectl -n honeydue rollout restart deploy/vmagent
|
|
```
|
|
**If 88oakappsUpdate itself is down** (PostHog runs there too):
|
|
SSH and check `sudo docker compose -f /opt/honeydue-obs/docker-compose.yml ps`.
|
|
**Non-critical**: nothing app-facing depends on the obs stack.
|
|
|
|
#### Grafana dashboard shows "no data"
|
|
|
|
**Possible causes, in order of frequency**:
|
|
1. New histogram name — query targets a metric the api hasn't emitted
|
|
yet. Check `kubectl exec deploy/vmagent -- wget -qO- http://api:8000/metrics`
|
|
for the metric name.
|
|
2. vmagent isn't scraping (see above).
|
|
3. Time range is before the obs stack came up (2026-04-25). Adjust
|
|
the dashboard time picker.
|
|
4. Cardinality blowup — VM rejected high-label-count series. Check
|
|
`vm_rows_inserted_total` vs `vm_rows_dropped_total` on the obs box.
|
|
|
|
### Networking failures
|
|
|
|
#### UFW rule accidentally blocks essential traffic
|
|
|
|
**Symptom**: Some specific thing stops working (e.g., api can't reach
|
|
Postgres, cross-node pod traffic fails, kubectl times out).
|
|
**Recovery**: log in via SSH (if that still works), `sudo ufw status
|
|
numbered`, `sudo ufw --force delete <N>` to remove offending rule.
|
|
**If SSH is blocked too**: Hetzner console → Rescue mode → mount disk
|
|
→ edit `/etc/ufw/user.rules`.
|
|
|
|
#### Flannel broken on one node
|
|
|
|
**Symptom**: pods on that node can't reach remote pods via overlay.
|
|
ClusterIP Services involving cross-node endpoints fail.
|
|
**Recovery**: restart kubelet on that node:
|
|
```bash
|
|
ssh deploy@<node> "sudo systemctl restart k3s"
|
|
```
|
|
|
|
#### Kube-proxy broken on one node
|
|
|
|
**Symptom**: pods on that node can't reach ClusterIPs. Symptoms look
|
|
like DNS resolution succeeded but connection refused or timed out.
|
|
**Recovery**: same as Flannel — restart k3s on the node.
|
|
|
|
### Application-level
|
|
|
|
#### api pod OOM
|
|
|
|
**Symptom**: pod gets killed, kubelet restarts it. User's request
|
|
returns 502 briefly; subsequent requests routed to healthy pods.
|
|
Readiness probe removes the OOMing pod from Service endpoints.
|
|
**Recovery**: automatic (pod restarts). If it keeps OOMing:
|
|
- Increase `resources.limits.memory` in the deployment
|
|
- Or debug the memory leak
|
|
**Check**:
|
|
```bash
|
|
kubectl describe pod -n honeydue <pod> | grep -i oom
|
|
kubectl logs -n honeydue <pod> --previous
|
|
```
|
|
|
|
#### api pod panics
|
|
|
|
**Symptom**: goroutine panic kills the process. Kubelet restarts.
|
|
Similar user impact to OOM.
|
|
**Recovery**: automatic restart. But if the panic is deterministic
|
|
(same input → panic), the pod crashloops.
|
|
**Action**: read the logs, find the panic stack trace, fix the code,
|
|
deploy.
|
|
**Circuit-breaker scenario**: if all 3 api pods crashloop on startup
|
|
because of bad code, kubectl rollout undo to previous revision.
|
|
|
|
#### api deadlocks
|
|
|
|
**Symptom**: all 3 pods are up, readiness passes (shallow probe), but
|
|
real requests time out or hang.
|
|
**Recovery**: liveness probe is the same endpoint as readiness, so it
|
|
won't help. You'll see gradually increasing 504s at the edge. Manual
|
|
intervention:
|
|
```bash
|
|
kubectl rollout restart deployment/api -n honeydue
|
|
```
|
|
|
|
#### admin pod crashes
|
|
|
|
**Symptom**: 502 at Cloudflare when accessing admin.myhoneydue.com.
|
|
**Recovery**: k8s auto-restarts. Usually within 10-30s.
|
|
**Impact**: only admins lose access; user-facing api is unaffected.
|
|
|
|
#### worker stops processing jobs
|
|
|
|
**Symptom**: emails stop being sent, cron jobs stop firing.
|
|
**Detection**: no direct alert; need to notice via user feedback or
|
|
missing daily-digest emails. Or check Redis for queue backlog.
|
|
**Recovery**:
|
|
```bash
|
|
kubectl rollout restart deployment/worker -n honeydue
|
|
```
|
|
**If persistent**: check logs for specific error:
|
|
```bash
|
|
kubectl logs -n honeydue deploy/worker --tail=100
|
|
```
|
|
|
|
#### redis pod dies + node is different
|
|
|
|
**Symptom**: Redis schedules to a new node, but the PVC is on the
|
|
original node (local-path is per-node). New Redis pod comes up but
|
|
finds an empty data directory (or can't mount at all).
|
|
**Recovery**:
|
|
- If the original node is still alive but Redis pod died: pod comes
|
|
back up on same node with data intact
|
|
- If the original node is gone: Redis starts empty. Cache regenerates.
|
|
Asynq queue state is lost; pending jobs re-queue on retry, cron
|
|
fires re-schedule on next tick.
|
|
- Ensure the node label `honeydue/redis=true` is on a healthy node:
|
|
```bash
|
|
kubectl label node <new-node> honeydue/redis=true --overwrite
|
|
kubectl label node <dead-node> honeydue/redis- 2>/dev/null || true
|
|
```
|
|
|
|
### External service failures
|
|
|
|
#### Neon Postgres outage
|
|
|
|
**Symptom**: api logs fill with "failed to connect to database." All
|
|
mutating API calls fail. Reads from cache continue (via Redis) but
|
|
eventually cache expires.
|
|
**Recovery**: no action from us; Neon's problem. Users will see 5xx
|
|
until Neon is back.
|
|
**Mitigation for future**: multi-region Neon read replica, or
|
|
Postgres-level failover.
|
|
**Frequency**: Neon has had a handful of hours-scale outages since launch.
|
|
|
|
#### Backblaze B2 outage
|
|
|
|
**Symptom**: image uploads fail; image downloads fail unless cached by
|
|
CF.
|
|
**Recovery**: wait. B2 rarely goes down.
|
|
**Mitigation**: serve downloads via CF with long cache TTL — most
|
|
users won't notice brief B2 outages for read traffic.
|
|
|
|
#### Fastmail SMTP unreachable
|
|
|
|
**Symptom**: `worker` can't send transactional emails. Jobs retry per
|
|
Asynq's retry policy, eventually giving up and logging an error.
|
|
**Recovery**: automatic retry; wait for Fastmail to come back.
|
|
**Manual intervention**: re-enqueue jobs from the Asynq UI (we don't
|
|
expose it yet — future).
|
|
|
|
#### Gitea registry unreachable
|
|
|
|
**Symptom**: `kubectl rollout` stuck at "Pulling image" for new pods.
|
|
Existing pods continue running with their already-pulled images.
|
|
**Recovery**: wait for Gitea to come back.
|
|
**Mitigation**: K8s has `imagePullPolicy: IfNotPresent` by default on
|
|
SHA-tagged images, so images aren't re-pulled on every restart if
|
|
the node already has them cached.
|
|
|
|
#### Cloudflare DNS failure
|
|
|
|
See §CF failures above.
|
|
|
|
## Combined failures
|
|
|
|
### "Everything is slow"
|
|
|
|
Most often = Neon is being hammered by our load + someone else's noisy
|
|
neighbor.
|
|
- Check `kubectl top pods` (are we CPU-bound?)
|
|
- Check Neon console for query performance
|
|
- Check CF analytics for traffic spikes
|
|
|
|
### "Some users see 502, others don't"
|
|
|
|
Usually one node has an unhealthy Traefik or api. Cloudflare routes
|
|
some connections to it, others to healthy nodes.
|
|
- `kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik`
|
|
- `kubectl get pods -n honeydue -l app.kubernetes.io/name=api`
|
|
- Check per-pod logs
|
|
|
|
### "It worked 5 minutes ago, now it doesn't"
|
|
|
|
Something recent changed. Check:
|
|
- Recent deploys: `kubectl rollout history deployment/api -n honeydue`
|
|
- Recent manifest changes: `kubectl get events -A --sort-by=.lastTimestamp | tail -30`
|
|
- External: Cloudflare Status page, Neon Status page, Backblaze Status page
|
|
|
|
## Planned outages
|
|
|
|
### Node upgrades (OS patches)
|
|
|
|
```bash
|
|
# Drain the node (evict pods, block scheduling)
|
|
kubectl drain ubuntu-8gb-nbg1-1 --ignore-daemonsets --delete-emptydir-data
|
|
|
|
# SSH in, upgrade, reboot
|
|
ssh deploy@hetzner2 "sudo apt update && sudo apt upgrade -y && sudo reboot"
|
|
|
|
# Wait for node to come back
|
|
watch kubectl get nodes
|
|
|
|
# Uncordon
|
|
kubectl uncordon ubuntu-8gb-nbg1-1
|
|
```
|
|
|
|
During the drain, pods from that node reschedule to the survivors.
|
|
With current workload (api: 3 replicas, everything else: 1), rescheduling
|
|
1 api pod is fine. Traffic loss: zero.
|
|
|
|
Worker pod or Redis pod scheduled on the drained node would be
|
|
briefly unavailable during reschedule. Acceptable for planned windows.
|
|
|
|
### k3s upgrades
|
|
|
|
Same per-node drain + upgrade pattern, but with k3s-specific install:
|
|
|
|
```bash
|
|
# On the node
|
|
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.35.x+k3s1 sh -s - server
|
|
|
|
# k3s detects existing install and upgrades in place
|
|
```
|
|
|
|
Do one node at a time. Verify cluster health between each.
|
|
|
|
## Disaster recovery
|
|
|
|
### Complete cluster loss
|
|
|
|
Procedure:
|
|
1. Provision 3 new Hetzner CX33 nodes (or use existing if healthy)
|
|
2. Follow bootstrap procedure (Chapter 1 §node hardening)
|
|
3. Install k3s on each (Chapter 2 §HA architecture)
|
|
4. Configure kubeconfig
|
|
5. Apply all manifests:
|
|
```bash
|
|
kubectl apply -f deploy-k3s/manifests/namespace.yaml
|
|
kubectl apply -f deploy-k3s/manifests/rbac.yaml
|
|
kubectl apply -f deploy-k3s/manifests/traefik-helmchartconfig.yaml
|
|
# Wait for Traefik to redeploy
|
|
# ... recreate secrets (see Chapter 10) ...
|
|
# ... apply rest of manifests ...
|
|
```
|
|
6. Update DNS if node IPs changed
|
|
7. Verify: curl https://api.myhoneydue.com/api/health/
|
|
|
|
Estimated time: **1-2 hours** if you've done it before. A lot of
|
|
context-switching between Hetzner console, SSH, kubectl, and CF.
|
|
|
|
Neon data is untouched by any of this. B2 data is untouched. Only
|
|
state that's lost: Redis cache (regenerates) and any in-flight Asynq
|
|
jobs that were mid-processing.
|
|
|
|
## References
|
|
|
|
- [Kubernetes pod lifecycle][lifecycle]
|
|
- [K3s HA recovery][k3s-ha-recovery]
|
|
- [Hetzner rescue system][hetzner-rescue]
|
|
|
|
[lifecycle]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
|
|
[k3s-ha-recovery]: https://docs.k3s.io/datastore/ha-embedded#new-cluster-with-embedded-db
|
|
[hetzner-rescue]: https://docs.hetzner.com/cloud/servers/getting-started/enabling-rescue-system/
|