# 17 — Operator Runbook ## Summary Common procedures the operator runs. Each is a numbered sequence of exact commands. If a step is unclear, add a comment; if a procedure fails in an unexpected way, add the symptom + fix to this document. ## Environment setup Every command assumes: ```bash export KUBECONFIG=~/.kube/honeydue-k3s.yaml cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go ``` If you see "Unable to connect to the server," the kubeconfig isn't set. ## 1. Check cluster health ```bash kubectl get nodes # all 3 Ready? kubectl get pods -A | grep -vE 'Running|Completed' # anything not running? kubectl top nodes # resource usage kubectl get events -A --sort-by=.lastTimestamp | tail -20 ``` ## 2. Deploy new code ### Full deploy (all three services) ```bash SHA=$(git rev-parse --short HEAD) # Login set -a; source deploy/registry.env; set +a printf '%s' "$REGISTRY_TOKEN" | \ docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin # Build docker buildx build --platform linux/amd64 --target api \ -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push . docker buildx build --platform linux/amd64 --target worker \ -t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" --push . docker buildx build --platform linux/amd64 --target admin \ -t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" --push . # Apply for svc in api worker admin; do kubectl set image deployment/$svc -n honeydue \ "$svc=gitea.treytartt.com/admin/honeydue-${svc}:${SHA}" done # Watch for svc in api worker admin; do kubectl rollout status -n honeydue deployment/$svc done # Logout docker logout gitea.treytartt.com ``` ### Single service ```bash SHA=$(git rev-parse --short HEAD) set -a; source deploy/registry.env; set +a printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin docker buildx build --platform linux/amd64 --target api \ -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push . kubectl set image deployment/api -n honeydue \ api="gitea.treytartt.com/admin/honeydue-api:${SHA}" kubectl rollout status -n honeydue deployment/api docker logout "$REGISTRY" ``` ## 3. Rollback ### Last good ```bash kubectl rollout undo deployment/api -n honeydue kubectl rollout status -n honeydue deployment/api ``` ### Specific SHA ```bash kubectl set image deployment/api -n honeydue \ api="gitea.treytartt.com/admin/honeydue-api:" ``` ## 4. Read logs ```bash # Follow all api pod logs kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix # Errors only kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=1000 | grep -i error # Previous pod (before crash/restart) kubectl logs -n honeydue --previous ``` ## 5. Exec into a pod ```bash kubectl exec -n honeydue -it deploy/api -- /bin/sh # inside: # wget -qO- http://127.0.0.1:8000/api/health/ # env | grep DB_ # exit ``` ## 6. Rotate a secret ```bash # For honeydue-secrets keys kubectl patch secret honeydue-secrets -n honeydue \ --type=merge \ -p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'new-value' | base64)\"}}" # Update local file to match (keep in sync) printf '%s' 'new-value' > deploy/secrets/secret_key.txt # Restart pods so they pick up the new secret kubectl rollout restart -n honeydue deploy/api deploy/worker ``` ## 7. Change a ConfigMap value ```bash # Edit deploy/prod.env locally # Regenerate the configmap kubectl create configmap honeydue-config -n honeydue \ --from-env-file=deploy/prod.env \ --dry-run=client -o yaml | kubectl apply -f - # Restart to pick up kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker ``` ## 8. Scale a service ```bash kubectl scale deployment/api -n honeydue --replicas=5 # Then wait kubectl rollout status -n honeydue deployment/api ``` **DO NOT** scale worker above 1 until Asynq PeriodicTaskManager is wired. ## 9. Drain a node for maintenance ```bash # Prevent new pods, evict existing kubectl drain --ignore-daemonsets --delete-emptydir-data # Do maintenance (apt upgrade, reboot, etc.) ssh deploy@ "sudo apt update && sudo apt upgrade -y && sudo reboot" # Wait for node to come back watch kubectl get nodes # Allow scheduling again kubectl uncordon ``` Node hostnames (not SSH aliases!): - `ubuntu-8gb-nbg1-1` (hetzner2) - `ubuntu-8gb-nbg1-2` (hetzner1) - `ubuntu-8gb-nbg1-3` (hetzner3) ## 10. Add a new node ```bash # 1. Provision CX33 in Hetzner console # 2. SSH in as root, create deploy user + key # 3. Install k3s as agent (or server) NODE_TOKEN=$(ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /var/lib/rancher/k3s/server/node-token') ssh -i ~/.ssh/hetzner root@ "curl -sfL https://get.k3s.io | K3S_TOKEN=\"$NODE_TOKEN\" INSTALL_K3S_EXEC=\"server --server=https://178.104.247.152:6443 --disable=servicelb --write-kubeconfig-mode=644\" sh -" # 4. Add UFW rules for inter-node traffic # (see deploy-k3s/scripts/ for the script) # 5. Verify kubectl get nodes ``` ## 11. Remove a node ```bash # Drain first kubectl drain --ignore-daemonsets --delete-emptydir-data # Tell k3s to leave ssh -i ~/.ssh/hetzner deploy@ "sudo systemctl stop k3s && sudo /usr/local/bin/k3s-uninstall.sh" # Remove from cluster kubectl delete node ``` ## 12. Force-restart all pods ```bash kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker deploy/redis ``` Use sparingly. Causes brief downtime per pod. ## 13. Migrate to a new Neon DB ```bash # 1. Point a new branch or project on Neon # 2. Update prod.env with new DB_HOST # 3. Apply new ConfigMap kubectl create configmap honeydue-config -n honeydue \ --from-env-file=deploy/prod.env \ --dry-run=client -o yaml | kubectl apply -f - # 4. Rolling restart kubectl rollout restart -n honeydue deploy/api deploy/worker ``` ## 14. Rotate Gitea registry PAT ```bash # 1. Create new PAT in Gitea UI # 2. Update deploy/registry.env locally # 3. Update in-cluster Secret kubectl create secret docker-registry gitea-credentials -n honeydue \ --docker-server=gitea.treytartt.com \ --docker-username=admin \ --docker-password= \ --dry-run=client -o yaml | kubectl apply -f - # 4. Delete old PAT from Gitea UI # 5. Pods don't re-auth with existing images (already pulled), but # new pulls will use new PAT. Test by rolling a pod: kubectl rollout restart -n honeydue deployment/api ``` ## 15. Clean up old images in Gitea Manual, via Gitea UI: https://gitea.treytartt.com/admin/-/packages Keep ~last 30 tags per image; delete older. Or via API: ```bash GITEA_PAT="$(grep REGISTRY_TOKEN deploy/registry.env | cut -d= -f2)" # List tags curl -sS -H "Authorization: token $GITEA_PAT" \ "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/versions" | jq . # Delete specific tag curl -X DELETE -H "Authorization: token $GITEA_PAT" \ "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/" ``` ## 16. Recreate the cluster from scratch See [Chapter 16 §Disaster recovery](./16-failure-modes.md#disaster-recovery). ## 17. Connect to Neon directly ```bash # Get password PW=$(cat deploy/secrets/postgres_password.txt) # Connect PGPASSWORD="$PW" psql \ -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \ -U neondb_owner \ -d honeyDue ``` ## 18. Check admin user credentials ```bash # ADMIN_EMAIL is in the honeydue-secrets Secret kubectl get secret honeydue-secrets -n honeydue \ -o jsonpath='{.data.ADMIN_EMAIL}' | base64 -d # ADMIN_PASSWORD (ONLY VALID FOR FIRST DEPLOY; may have been changed in UI) kubectl get secret honeydue-secrets -n honeydue \ -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d ``` If you need to reset admin password because nobody remembers it: ```bash # Generate a new bcrypt hash NEW_PASSWORD='newpassword' HASH=$(htpasswd -bnBC 10 "" "$NEW_PASSWORD" | tr -d ':\n') # Update directly in Postgres PGPASSWORD="$(cat deploy/secrets/postgres_password.txt)" psql \ -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \ -U neondb_owner -d honeyDue \ -c "UPDATE admin_users SET password='$HASH' WHERE email='admin@myhoneydue.com'" ``` ## 19. Trigger a Helm chart re-run (Traefik etc.) If the Traefik HelmChartConfig was updated but chart didn't reconcile: ```bash kubectl delete job -n kube-system helm-install-traefik # Helm operator re-runs automatically within ~30 seconds kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik -w ``` ## 20. Smoke test after any change ```bash # Through Cloudflare for url in "https://api.myhoneydue.com/api/health/" \ "https://admin.myhoneydue.com/" \ "https://myhoneydue.com/"; do ok=0 for i in $(seq 1 20); do [[ "$(curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "$url")" == "200" ]] && ok=$((ok+1)) done printf "%-45s %d/20 ok\n" "$url" "$ok" done ``` Expect 20/20 on all three. ## 21. Kill everything (emergency rollback) If the cluster is so broken you need to reset the app layer: ```bash # Scale everything to 0 kubectl scale -n honeydue deploy/api deploy/admin deploy/worker deploy/redis --replicas=0 # When ready, scale back up kubectl scale -n honeydue deploy/api --replicas=3 kubectl scale -n honeydue deploy/admin deploy/worker deploy/redis --replicas=1 ``` During the scale-down, CF returns errors to users because no pod is serving. The rolling update for scale-up takes ~5 min. ## 22. Find which pod a user's request hit Not directly supported (we don't log node/pod name in requests). When we add request logging that includes these, a grep through logs works. Workaround: in each pod's logs, search for a unique user identifier: ```bash stern -n honeydue api | grep "user_id=12345" ``` ## References - [kubectl cheat sheet][kubectl-cs] - [K3s docs][k3s-docs] - [Neon connect][neon-connect] [kubectl-cs]: https://kubernetes.io/docs/reference/kubectl/cheatsheet/ [k3s-docs]: https://docs.k3s.io/ [neon-connect]: https://neon.com/docs/connect/connect-from-any-app