Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure: - Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers) - Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh - All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept temporarily for reference Bug fixes surfaced during migration: - Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25) - cache_service.go: remove sync.Once reassignment from inside Do() callback (was causing 'unlock of unlocked mutex' fatal after Redis Ping failure) - router.go: relax CSP from 'default-src none' to 'default-src self' + allowlist fonts.googleapis.com so the marketing landing page CSS actually loads in browsers - deploy/scripts/deploy_prod.sh: use docker buildx with --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce images runnable on x86_64 Hetzner nodes; fix array expansion under set -u - deploy/swarm-stack.prod.yml: fix secret source references to use top-level aliases (the '\${X_SECRET}' form never actually resolved); dozzle ports: long-form host_ip is rejected by Swarm, switched to short-form (bound to 0.0.0.0 with UFW-based loopback restriction); worker replicas 2 -> 1 (Asynq scheduler singleton) - deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/' (Next.js serves at root; /admin/ returned 404 and killed pods); startupProbe failureThreshold 12 -> 24 - deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable 1 -> 0 (singleton) - deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold 12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot; real startup takes up to 240s) - .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/ and admin/src/app/api/*, hiding legitimate files) New files: - deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet + hostNetwork override for k3s-bundled Traefik - deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress without TLS (CF Flexible SSL) and without middleware - deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log Documentation: - docs/deployment/ — full deployment book, 26 files, ~42k words: - Part I Overview, infrastructure, orchestrator choice (Ch 0-2) - Part II Networking, firewall, Cloudflare (Ch 3-4, 13) - Part III Security, Traefik ingress (Ch 5-6) - Part IV Services, DB, storage, secrets, registry (Ch 7-11) - Part V Data flow, deploy process, observability, failures, runbook (Ch 12, 14-17) - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20) - Appendices: glossary, kubectl cheat sheet, file locations, consolidated citations - README.md: Production Deployment section replaced with pointer to the book; Go version bumped to 1.25 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:21 -05:00
parent 4ec4bbbfe8
commit 6f303dbbaa
46 changed files with 9785 additions and 93 deletions
@@ -0,0 +1,369 @@
+# 17 — Operator Runbook
+
+## Summary
+
+Common procedures the operator runs. Each is a numbered sequence of
+exact commands. If a step is unclear, add a comment; if a procedure
+fails in an unexpected way, add the symptom + fix to this document.
+
+## Environment setup
+
+Every command assumes:
+
+```bash
+export KUBECONFIG=~/.kube/honeydue-k3s.yaml
+cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
+```
+
+If you see "Unable to connect to the server," the kubeconfig isn't set.
+
+## 1. Check cluster health
+
+```bash
+kubectl get nodes                  # all 3 Ready?
+kubectl get pods -A | grep -vE 'Running|Completed'  # anything not running?
+kubectl top nodes                  # resource usage
+kubectl get events -A --sort-by=.lastTimestamp | tail -20
+```
+
+## 2. Deploy new code
+
+### Full deploy (all three services)
+
+```bash
+SHA=$(git rev-parse --short HEAD)
+
+# Login
+set -a; source deploy/registry.env; set +a
+printf '%s' "$REGISTRY_TOKEN" | \
+  docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
+
+# Build
+docker buildx build --platform linux/amd64 --target api \
+  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
+docker buildx build --platform linux/amd64 --target worker \
+  -t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" --push .
+docker buildx build --platform linux/amd64 --target admin \
+  -t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" --push .
+
+# Apply
+for svc in api worker admin; do
+  kubectl set image deployment/$svc -n honeydue \
+    "$svc=gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
+done
+
+# Watch
+for svc in api worker admin; do
+  kubectl rollout status -n honeydue deployment/$svc
+done
+
+# Logout
+docker logout gitea.treytartt.com
+```
+
+### Single service
+
+```bash
+SHA=$(git rev-parse --short HEAD)
+set -a; source deploy/registry.env; set +a
+printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
+docker buildx build --platform linux/amd64 --target api \
+  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
+kubectl set image deployment/api -n honeydue \
+  api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
+kubectl rollout status -n honeydue deployment/api
+docker logout "$REGISTRY"
+```
+
+## 3. Rollback
+
+### Last good
+
+```bash
+kubectl rollout undo deployment/api -n honeydue
+kubectl rollout status -n honeydue deployment/api
+```
+
+### Specific SHA
+
+```bash
+kubectl set image deployment/api -n honeydue \
+  api="gitea.treytartt.com/admin/honeydue-api:<sha>"
+```
+
+## 4. Read logs
+
+```bash
+# Follow all api pod logs
+kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix
+
+# Errors only
+kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=1000 | grep -i error
+
+# Previous pod (before crash/restart)
+kubectl logs -n honeydue <pod> --previous
+```
+
+## 5. Exec into a pod
+
+```bash
+kubectl exec -n honeydue -it deploy/api -- /bin/sh
+# inside:
+#   wget -qO- http://127.0.0.1:8000/api/health/
+#   env | grep DB_
+#   exit
+```
+
+## 6. Rotate a secret
+
+```bash
+# For honeydue-secrets keys
+kubectl patch secret honeydue-secrets -n honeydue \
+  --type=merge \
+  -p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'new-value' | base64)\"}}"
+
+# Update local file to match (keep in sync)
+printf '%s' 'new-value' > deploy/secrets/secret_key.txt
+
+# Restart pods so they pick up the new secret
+kubectl rollout restart -n honeydue deploy/api deploy/worker
+```
+
+## 7. Change a ConfigMap value
+
+```bash
+# Edit deploy/prod.env locally
+# Regenerate the configmap
+kubectl create configmap honeydue-config -n honeydue \
+  --from-env-file=deploy/prod.env \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+# Restart to pick up
+kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker
+```
+
+## 8. Scale a service
+
+```bash
+kubectl scale deployment/api -n honeydue --replicas=5
+# Then wait
+kubectl rollout status -n honeydue deployment/api
+```
+
+**DO NOT** scale worker above 1 until Asynq PeriodicTaskManager is wired.
+
+## 9. Drain a node for maintenance
+
+```bash
+# Prevent new pods, evict existing
+kubectl drain <node-hostname> --ignore-daemonsets --delete-emptydir-data
+
+# Do maintenance (apt upgrade, reboot, etc.)
+ssh deploy@<node> "sudo apt update && sudo apt upgrade -y && sudo reboot"
+
+# Wait for node to come back
+watch kubectl get nodes
+
+# Allow scheduling again
+kubectl uncordon <node-hostname>
+```
+
+Node hostnames (not SSH aliases!):
+- `ubuntu-8gb-nbg1-1` (hetzner2)
+- `ubuntu-8gb-nbg1-2` (hetzner1)
+- `ubuntu-8gb-nbg1-3` (hetzner3)
+
+## 10. Add a new node
+
+```bash
+# 1. Provision CX33 in Hetzner console
+# 2. SSH in as root, create deploy user + key
+# 3. Install k3s as agent (or server)
+NODE_TOKEN=$(ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /var/lib/rancher/k3s/server/node-token')
+ssh -i ~/.ssh/hetzner root@<new-node-ip> "curl -sfL https://get.k3s.io | K3S_TOKEN=\"$NODE_TOKEN\" INSTALL_K3S_EXEC=\"server --server=https://178.104.247.152:6443 --disable=servicelb --write-kubeconfig-mode=644\" sh -"
+
+# 4. Add UFW rules for inter-node traffic
+#    (see deploy-k3s/scripts/ for the script)
+
+# 5. Verify
+kubectl get nodes
+```
+
+## 11. Remove a node
+
+```bash
+# Drain first
+kubectl drain <hostname> --ignore-daemonsets --delete-emptydir-data
+
+# Tell k3s to leave
+ssh -i ~/.ssh/hetzner deploy@<node-alias> "sudo systemctl stop k3s && sudo /usr/local/bin/k3s-uninstall.sh"
+
+# Remove from cluster
+kubectl delete node <hostname>
+```
+
+## 12. Force-restart all pods
+
+```bash
+kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker deploy/redis
+```
+
+Use sparingly. Causes brief downtime per pod.
+
+## 13. Migrate to a new Neon DB
+
+```bash
+# 1. Point a new branch or project on Neon
+# 2. Update prod.env with new DB_HOST
+# 3. Apply new ConfigMap
+kubectl create configmap honeydue-config -n honeydue \
+  --from-env-file=deploy/prod.env \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+# 4. Rolling restart
+kubectl rollout restart -n honeydue deploy/api deploy/worker
+```
+
+## 14. Rotate Gitea registry PAT
+
+```bash
+# 1. Create new PAT in Gitea UI
+# 2. Update deploy/registry.env locally
+# 3. Update in-cluster Secret
+kubectl create secret docker-registry gitea-credentials -n honeydue \
+  --docker-server=gitea.treytartt.com \
+  --docker-username=admin \
+  --docker-password=<new-pat> \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+# 4. Delete old PAT from Gitea UI
+
+# 5. Pods don't re-auth with existing images (already pulled), but
+# new pulls will use new PAT. Test by rolling a pod:
+kubectl rollout restart -n honeydue deployment/api
+```
+
+## 15. Clean up old images in Gitea
+
+Manual, via Gitea UI:
+https://gitea.treytartt.com/admin/-/packages
+
+Keep ~last 30 tags per image; delete older.
+
+Or via API:
+```bash
+GITEA_PAT="$(grep REGISTRY_TOKEN deploy/registry.env | cut -d= -f2)"
+# List tags
+curl -sS -H "Authorization: token $GITEA_PAT" \
+  "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/versions" | jq .
+# Delete specific tag
+curl -X DELETE -H "Authorization: token $GITEA_PAT" \
+  "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/<tag>"
+```
+
+## 16. Recreate the cluster from scratch
+
+See [Chapter 16 §Disaster recovery](./16-failure-modes.md#disaster-recovery).
+
+## 17. Connect to Neon directly
+
+```bash
+# Get password
+PW=$(cat deploy/secrets/postgres_password.txt)
+
+# Connect
+PGPASSWORD="$PW" psql \
+  -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
+  -U neondb_owner \
+  -d honeyDue
+```
+
+## 18. Check admin user credentials
+
+```bash
+# ADMIN_EMAIL is in the honeydue-secrets Secret
+kubectl get secret honeydue-secrets -n honeydue \
+  -o jsonpath='{.data.ADMIN_EMAIL}' | base64 -d
+
+# ADMIN_PASSWORD (ONLY VALID FOR FIRST DEPLOY; may have been changed in UI)
+kubectl get secret honeydue-secrets -n honeydue \
+  -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d
+```
+
+If you need to reset admin password because nobody remembers it:
+
+```bash
+# Generate a new bcrypt hash
+NEW_PASSWORD='newpassword'
+HASH=$(htpasswd -bnBC 10 "" "$NEW_PASSWORD" | tr -d ':\n')
+
+# Update directly in Postgres
+PGPASSWORD="$(cat deploy/secrets/postgres_password.txt)" psql \
+  -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
+  -U neondb_owner -d honeyDue \
+  -c "UPDATE admin_users SET password='$HASH' WHERE email='admin@myhoneydue.com'"
+```
+
+## 19. Trigger a Helm chart re-run (Traefik etc.)
+
+If the Traefik HelmChartConfig was updated but chart didn't reconcile:
+
+```bash
+kubectl delete job -n kube-system helm-install-traefik
+# Helm operator re-runs automatically within ~30 seconds
+kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik -w
+```
+
+## 20. Smoke test after any change
+
+```bash
+# Through Cloudflare
+for url in "https://api.myhoneydue.com/api/health/" \
+           "https://admin.myhoneydue.com/" \
+           "https://myhoneydue.com/"; do
+  ok=0
+  for i in $(seq 1 20); do
+    [[ "$(curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "$url")" == "200" ]] && ok=$((ok+1))
+  done
+  printf "%-45s %d/20 ok\n" "$url" "$ok"
+done
+```
+
+Expect 20/20 on all three.
+
+## 21. Kill everything (emergency rollback)
+
+If the cluster is so broken you need to reset the app layer:
+
+```bash
+# Scale everything to 0
+kubectl scale -n honeydue deploy/api deploy/admin deploy/worker deploy/redis --replicas=0
+
+# When ready, scale back up
+kubectl scale -n honeydue deploy/api --replicas=3
+kubectl scale -n honeydue deploy/admin deploy/worker deploy/redis --replicas=1
+```
+
+During the scale-down, CF returns errors to users because no pod is
+serving. The rolling update for scale-up takes ~5 min.
+
+## 22. Find which pod a user's request hit
+
+Not directly supported (we don't log node/pod name in requests). When
+we add request logging that includes these, a grep through logs works.
+
+Workaround: in each pod's logs, search for a unique user identifier:
+
+```bash
+stern -n honeydue api | grep "user_id=12345"
+```
+
+## References
+
+- [kubectl cheat sheet][kubectl-cs]
+- [K3s docs][k3s-docs]
+- [Neon connect][neon-connect]
+
+[kubectl-cs]: https://kubernetes.io/docs/reference/kubectl/cheatsheet/
+[k3s-docs]: https://docs.k3s.io/
+[neon-connect]: https://neon.com/docs/connect/connect-from-any-app