8d9ca2e6ed
Update the deployment book and glossary to reflect the goose-based schema migration flow shipped in 12b2f9d/0f7450a: - ch07: clarify startup probe assumes migrations ran out-of-band - ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job - ch12: pod startup checks goose_db_version, no longer runs migrations - ch14: document the Job→wait→roll deploy gate and how to debug failures - ch16: add "Migrate Job fails during deploy" + "Schema precondition failed" failure modes - ch17: new runbook entries §26 (run migrations manually), §27 (recover from failed/dirty migration), §28 (bootstrap goose on fresh clone) - ch19: postscript on §13 noting MigrateWithLock approach is superseded - ch20: mark "Migration Job for schema changes" task done - glossary: add `goose` and `goose_db_version`; flag AutoMigrate as tests-only - references: add goose links; flag AutoMigrate as tests-only
529 lines
16 KiB
Markdown
529 lines
16 KiB
Markdown
# 17 — Operator Runbook
|
|
|
|
## Summary
|
|
|
|
Common procedures the operator runs. Each is a numbered sequence of
|
|
exact commands. If a step is unclear, add a comment; if a procedure
|
|
fails in an unexpected way, add the symptom + fix to this document.
|
|
|
|
## Environment setup
|
|
|
|
Every command assumes:
|
|
|
|
```bash
|
|
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
|
|
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
|
|
```
|
|
|
|
If you see "Unable to connect to the server," the kubeconfig isn't set.
|
|
|
|
## 1. Check cluster health
|
|
|
|
```bash
|
|
kubectl get nodes # all 3 Ready?
|
|
kubectl get pods -A | grep -vE 'Running|Completed' # anything not running?
|
|
kubectl top nodes # resource usage
|
|
kubectl get events -A --sort-by=.lastTimestamp | tail -20
|
|
```
|
|
|
|
## 2. Deploy new code
|
|
|
|
### Full deploy (all three services)
|
|
|
|
```bash
|
|
SHA=$(git rev-parse --short HEAD)
|
|
|
|
# Login
|
|
set -a; source deploy/registry.env; set +a
|
|
printf '%s' "$REGISTRY_TOKEN" | \
|
|
docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
|
|
|
|
# Build
|
|
docker buildx build --platform linux/amd64 --target api \
|
|
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
|
|
docker buildx build --platform linux/amd64 --target worker \
|
|
-t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" --push .
|
|
docker buildx build --platform linux/amd64 --target admin \
|
|
-t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" --push .
|
|
|
|
# Apply
|
|
for svc in api worker admin; do
|
|
kubectl set image deployment/$svc -n honeydue \
|
|
"$svc=gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
|
|
done
|
|
|
|
# Watch
|
|
for svc in api worker admin; do
|
|
kubectl rollout status -n honeydue deployment/$svc
|
|
done
|
|
|
|
# Logout
|
|
docker logout gitea.treytartt.com
|
|
```
|
|
|
|
### Single service
|
|
|
|
```bash
|
|
SHA=$(git rev-parse --short HEAD)
|
|
set -a; source deploy/registry.env; set +a
|
|
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
|
|
docker buildx build --platform linux/amd64 --target api \
|
|
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
|
|
kubectl set image deployment/api -n honeydue \
|
|
api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
|
|
kubectl rollout status -n honeydue deployment/api
|
|
docker logout "$REGISTRY"
|
|
```
|
|
|
|
## 3. Rollback
|
|
|
|
### Last good
|
|
|
|
```bash
|
|
kubectl rollout undo deployment/api -n honeydue
|
|
kubectl rollout status -n honeydue deployment/api
|
|
```
|
|
|
|
### Specific SHA
|
|
|
|
```bash
|
|
kubectl set image deployment/api -n honeydue \
|
|
api="gitea.treytartt.com/admin/honeydue-api:<sha>"
|
|
```
|
|
|
|
## 4. Read logs
|
|
|
|
```bash
|
|
# Follow all api pod logs
|
|
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix
|
|
|
|
# Errors only
|
|
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=1000 | grep -i error
|
|
|
|
# Previous pod (before crash/restart)
|
|
kubectl logs -n honeydue <pod> --previous
|
|
```
|
|
|
|
## 5. Exec into a pod
|
|
|
|
```bash
|
|
kubectl exec -n honeydue -it deploy/api -- /bin/sh
|
|
# inside:
|
|
# wget -qO- http://127.0.0.1:8000/api/health/
|
|
# env | grep DB_
|
|
# exit
|
|
```
|
|
|
|
## 6. Rotate a secret
|
|
|
|
```bash
|
|
# For honeydue-secrets keys
|
|
kubectl patch secret honeydue-secrets -n honeydue \
|
|
--type=merge \
|
|
-p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'new-value' | base64)\"}}"
|
|
|
|
# Update local file to match (keep in sync)
|
|
printf '%s' 'new-value' > deploy/secrets/secret_key.txt
|
|
|
|
# Restart pods so they pick up the new secret
|
|
kubectl rollout restart -n honeydue deploy/api deploy/worker
|
|
```
|
|
|
|
## 7. Change a ConfigMap value
|
|
|
|
```bash
|
|
# Edit deploy/prod.env locally
|
|
# Regenerate the configmap
|
|
kubectl create configmap honeydue-config -n honeydue \
|
|
--from-env-file=deploy/prod.env \
|
|
--dry-run=client -o yaml | kubectl apply -f -
|
|
|
|
# Restart to pick up
|
|
kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker
|
|
```
|
|
|
|
## 8. Scale a service
|
|
|
|
```bash
|
|
kubectl scale deployment/api -n honeydue --replicas=5
|
|
# Then wait
|
|
kubectl rollout status -n honeydue deployment/api
|
|
```
|
|
|
|
**DO NOT** scale worker above 1 until Asynq PeriodicTaskManager is wired.
|
|
|
|
## 9. Drain a node for maintenance
|
|
|
|
```bash
|
|
# Prevent new pods, evict existing
|
|
kubectl drain <node-hostname> --ignore-daemonsets --delete-emptydir-data
|
|
|
|
# Do maintenance (apt upgrade, reboot, etc.)
|
|
ssh deploy@<node> "sudo apt update && sudo apt upgrade -y && sudo reboot"
|
|
|
|
# Wait for node to come back
|
|
watch kubectl get nodes
|
|
|
|
# Allow scheduling again
|
|
kubectl uncordon <node-hostname>
|
|
```
|
|
|
|
Node hostnames (not SSH aliases!):
|
|
- `ubuntu-8gb-nbg1-1` (hetzner2)
|
|
- `ubuntu-8gb-nbg1-2` (hetzner1)
|
|
- `ubuntu-8gb-nbg1-3` (hetzner3)
|
|
|
|
## 10. Add a new node
|
|
|
|
```bash
|
|
# 1. Provision CX33 in Hetzner console
|
|
# 2. SSH in as root, create deploy user + key
|
|
# 3. Install k3s as agent (or server)
|
|
NODE_TOKEN=$(ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /var/lib/rancher/k3s/server/node-token')
|
|
ssh -i ~/.ssh/hetzner root@<new-node-ip> "curl -sfL https://get.k3s.io | K3S_TOKEN=\"$NODE_TOKEN\" INSTALL_K3S_EXEC=\"server --server=https://178.104.247.152:6443 --disable=servicelb --write-kubeconfig-mode=644\" sh -"
|
|
|
|
# 4. Add UFW rules for inter-node traffic
|
|
# (see deploy-k3s/scripts/ for the script)
|
|
|
|
# 5. Verify
|
|
kubectl get nodes
|
|
```
|
|
|
|
## 11. Remove a node
|
|
|
|
```bash
|
|
# Drain first
|
|
kubectl drain <hostname> --ignore-daemonsets --delete-emptydir-data
|
|
|
|
# Tell k3s to leave
|
|
ssh -i ~/.ssh/hetzner deploy@<node-alias> "sudo systemctl stop k3s && sudo /usr/local/bin/k3s-uninstall.sh"
|
|
|
|
# Remove from cluster
|
|
kubectl delete node <hostname>
|
|
```
|
|
|
|
## 12. Force-restart all pods
|
|
|
|
```bash
|
|
kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker deploy/redis
|
|
```
|
|
|
|
Use sparingly. Causes brief downtime per pod.
|
|
|
|
## 13. Migrate to a new Neon DB
|
|
|
|
```bash
|
|
# 1. Point a new branch or project on Neon
|
|
# 2. Update prod.env with new DB_HOST
|
|
# 3. Apply new ConfigMap
|
|
kubectl create configmap honeydue-config -n honeydue \
|
|
--from-env-file=deploy/prod.env \
|
|
--dry-run=client -o yaml | kubectl apply -f -
|
|
|
|
# 4. Rolling restart
|
|
kubectl rollout restart -n honeydue deploy/api deploy/worker
|
|
```
|
|
|
|
## 14. Rotate Gitea registry PAT
|
|
|
|
```bash
|
|
# 1. Create new PAT in Gitea UI
|
|
# 2. Update deploy/registry.env locally
|
|
# 3. Update in-cluster Secret
|
|
kubectl create secret docker-registry gitea-credentials -n honeydue \
|
|
--docker-server=gitea.treytartt.com \
|
|
--docker-username=admin \
|
|
--docker-password=<new-pat> \
|
|
--dry-run=client -o yaml | kubectl apply -f -
|
|
|
|
# 4. Delete old PAT from Gitea UI
|
|
|
|
# 5. Pods don't re-auth with existing images (already pulled), but
|
|
# new pulls will use new PAT. Test by rolling a pod:
|
|
kubectl rollout restart -n honeydue deployment/api
|
|
```
|
|
|
|
## 15. Clean up old images in Gitea
|
|
|
|
Manual, via Gitea UI:
|
|
https://gitea.treytartt.com/admin/-/packages
|
|
|
|
Keep ~last 30 tags per image; delete older.
|
|
|
|
Or via API:
|
|
```bash
|
|
GITEA_PAT="$(grep REGISTRY_TOKEN deploy/registry.env | cut -d= -f2)"
|
|
# List tags
|
|
curl -sS -H "Authorization: token $GITEA_PAT" \
|
|
"https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/versions" | jq .
|
|
# Delete specific tag
|
|
curl -X DELETE -H "Authorization: token $GITEA_PAT" \
|
|
"https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/<tag>"
|
|
```
|
|
|
|
## 16. Recreate the cluster from scratch
|
|
|
|
See [Chapter 16 §Disaster recovery](./16-failure-modes.md#disaster-recovery).
|
|
|
|
## 17. Connect to Neon directly
|
|
|
|
```bash
|
|
# Get password
|
|
PW=$(cat deploy/secrets/postgres_password.txt)
|
|
|
|
# Connect
|
|
PGPASSWORD="$PW" psql \
|
|
-h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
|
|
-U neondb_owner \
|
|
-d honeyDue
|
|
```
|
|
|
|
## 18. Check admin user credentials
|
|
|
|
```bash
|
|
# ADMIN_EMAIL is in the honeydue-secrets Secret
|
|
kubectl get secret honeydue-secrets -n honeydue \
|
|
-o jsonpath='{.data.ADMIN_EMAIL}' | base64 -d
|
|
|
|
# ADMIN_PASSWORD (ONLY VALID FOR FIRST DEPLOY; may have been changed in UI)
|
|
kubectl get secret honeydue-secrets -n honeydue \
|
|
-o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d
|
|
```
|
|
|
|
If you need to reset admin password because nobody remembers it:
|
|
|
|
```bash
|
|
# Generate a new bcrypt hash
|
|
NEW_PASSWORD='newpassword'
|
|
HASH=$(htpasswd -bnBC 10 "" "$NEW_PASSWORD" | tr -d ':\n')
|
|
|
|
# Update directly in Postgres
|
|
PGPASSWORD="$(cat deploy/secrets/postgres_password.txt)" psql \
|
|
-h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
|
|
-U neondb_owner -d honeyDue \
|
|
-c "UPDATE admin_users SET password='$HASH' WHERE email='admin@myhoneydue.com'"
|
|
```
|
|
|
|
## 19. Trigger a Helm chart re-run (Traefik etc.)
|
|
|
|
If the Traefik HelmChartConfig was updated but chart didn't reconcile:
|
|
|
|
```bash
|
|
kubectl delete job -n kube-system helm-install-traefik
|
|
# Helm operator re-runs automatically within ~30 seconds
|
|
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik -w
|
|
```
|
|
|
|
## 20. Smoke test after any change
|
|
|
|
```bash
|
|
# Through Cloudflare
|
|
for url in "https://api.myhoneydue.com/api/health/" \
|
|
"https://admin.myhoneydue.com/" \
|
|
"https://myhoneydue.com/"; do
|
|
ok=0
|
|
for i in $(seq 1 20); do
|
|
[[ "$(curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "$url")" == "200" ]] && ok=$((ok+1))
|
|
done
|
|
printf "%-45s %d/20 ok\n" "$url" "$ok"
|
|
done
|
|
```
|
|
|
|
Expect 20/20 on all three.
|
|
|
|
## 21. Kill everything (emergency rollback)
|
|
|
|
If the cluster is so broken you need to reset the app layer:
|
|
|
|
```bash
|
|
# Scale everything to 0
|
|
kubectl scale -n honeydue deploy/api deploy/admin deploy/worker deploy/redis --replicas=0
|
|
|
|
# When ready, scale back up
|
|
kubectl scale -n honeydue deploy/api --replicas=3
|
|
kubectl scale -n honeydue deploy/admin deploy/worker deploy/redis --replicas=1
|
|
```
|
|
|
|
During the scale-down, CF returns errors to users because no pod is
|
|
serving. The rolling update for scale-up takes ~5 min.
|
|
|
|
## 22. Find which pod a user's request hit
|
|
|
|
Not directly supported (we don't log node/pod name in requests). When
|
|
we add request logging that includes these, a grep through logs works.
|
|
|
|
Workaround: in each pod's logs, search for a unique user identifier:
|
|
|
|
```bash
|
|
stern -n honeydue api | grep "user_id=12345"
|
|
```
|
|
|
|
## 23. Invalidate residence-IDs cache for a user
|
|
|
|
Used when a user reports stale data ("I joined a residence but my
|
|
tasks list still shows the old one"). The cache is keyed on user ID
|
|
with 5-min TTL — most issues self-heal — but you can flush manually.
|
|
|
|
```bash
|
|
# Single user
|
|
kubectl -n honeydue exec deploy/redis -- redis-cli DEL "residence_ids_user:7"
|
|
|
|
# All users (nuclear; everyone pays one DB lookup on next request)
|
|
kubectl -n honeydue exec deploy/redis -- redis-cli --scan --pattern "residence_ids_user:*" \
|
|
| xargs -r -n 100 kubectl -n honeydue exec deploy/redis -- redis-cli DEL
|
|
```
|
|
|
|
Mutation paths that should invalidate this cache automatically (any
|
|
new code that changes membership must call
|
|
`cache.InvalidateResidenceIDsForUsers(ctx, userIDs...)`):
|
|
|
|
- `ResidenceService.CreateResidence` → owner
|
|
- `ResidenceService.DeleteResidence` → all members
|
|
- `ResidenceService.JoinWithCode` → joining user
|
|
- `ResidenceService.RemoveUser` → removed user
|
|
|
|
If a user keeps reporting stale data, grep for missing invalidation:
|
|
|
|
```bash
|
|
grep -rn "residenceRepo.*Add\|RemoveUser\|residence_residence_users" internal/ \
|
|
| grep -v cache | grep -v _test
|
|
```
|
|
|
|
## 24. Verify DB pool warm-up is working
|
|
|
|
After a deploy, check the api pod log for the warm-up confirmation:
|
|
|
|
```bash
|
|
kubectl -n honeydue logs -l app.kubernetes.io/name=api --tail=50 \
|
|
| grep "DB pool warm-up complete"
|
|
```
|
|
|
|
Expected output (per pod):
|
|
|
|
```json
|
|
{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"}
|
|
```
|
|
|
|
If `warmed` < `requested`, the pool partially failed at boot — pod
|
|
still starts, fills from there. If `warmed=0`, something's wrong with
|
|
either Neon connectivity or auth — check the next log line for the
|
|
specific error.
|
|
|
|
To test impact: hit the api right after a rollout. With warm-up
|
|
working, the first request should be ~250ms (1 RTT). Without warm-up,
|
|
the first request is ~700ms (full handshake).
|
|
|
|
## 25. Switch DB host between pooler and direct endpoints
|
|
|
|
The pooler endpoint (`-pooler` suffix) is the default — it cuts
|
|
cold-handshake latency by ~3 RTTs. The direct endpoint
|
|
(`ep-floral-truth-amttbc5a.c-5...`) is the fallback.
|
|
|
|
```bash
|
|
# Edit deploy-k3s/config.yaml — change database.host
|
|
# To pooler: ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech
|
|
# To direct: ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech
|
|
|
|
KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build
|
|
```
|
|
|
|
The pooler runs in transaction mode so any session-scope feature
|
|
(LISTEN/NOTIFY, session advisory locks) won't work over it. Migrations
|
|
already handle this — the migrate Job script strips `-pooler` from
|
|
`DB_HOST` before invoking goose. If you add new session-level features
|
|
in the data path, they'll need the same workaround.
|
|
|
|
## 26. Run migrations manually (rare)
|
|
|
|
Day-to-day, migrations run as part of every `03-deploy.sh`. But
|
|
sometimes you want to apply or inspect them outside a deploy:
|
|
|
|
```bash
|
|
# Direct-endpoint DSN (goose's advisory lock won't survive the pooler)
|
|
DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
|
|
-o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
|
|
export DATABASE_URL="host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
|
|
port=5432 user=neondb_owner password=$DB_PASS \
|
|
dbname=honeyDue sslmode=require"
|
|
|
|
# What's pending? (read-only; safe to run anytime)
|
|
make migrate-status
|
|
|
|
# Apply pending migrations (or `goose -dir migrations postgres "$DATABASE_URL" up`)
|
|
make migrate-up
|
|
|
|
# Roll back the most recent migration
|
|
make migrate-down
|
|
|
|
# Scaffold a new migration file
|
|
make migrate-new name=add_widget_count_to_residences
|
|
# → migrations/000002_add_widget_count_to_residences.sql
|
|
# Edit, then `make migrate-up` to test, then commit.
|
|
```
|
|
|
|
To run goose from inside the cluster (e.g., to bypass a network policy
|
|
that blocks Neon from your laptop), use the migrate Job manifest as a
|
|
one-shot:
|
|
|
|
```bash
|
|
# Re-runs the latest migrate Job with whatever args you need
|
|
kubectl -n honeydue delete job honeydue-migrate --ignore-not-found
|
|
sed "s|image: IMAGE_PLACEHOLDER|image: $(kubectl -n honeydue get deploy api -o jsonpath='{.spec.template.spec.containers[0].image}')|" \
|
|
deploy-k3s/manifests/migrate/job.yaml | kubectl apply -f -
|
|
kubectl -n honeydue wait --for=condition=complete --timeout=5m job/honeydue-migrate
|
|
kubectl -n honeydue logs job/honeydue-migrate
|
|
```
|
|
|
|
## 27. Recover from a failed/dirty migration
|
|
|
|
If `goose up` fails partway through, the migration file's transaction
|
|
rolls back and `goose_db_version` reflects the last *complete*
|
|
version. Goose marks no row as "dirty" — that's a golang-migrate
|
|
concept. So recovery is just: fix the migration file, re-run.
|
|
|
|
If you've genuinely corrupted state (dropped tables you shouldn't have,
|
|
applied a destructive migration in error):
|
|
|
|
```bash
|
|
# See current goose state
|
|
make migrate-status
|
|
psql "$DATABASE_URL" -c \
|
|
"SELECT version_id, is_applied, tstamp FROM goose_db_version ORDER BY id DESC LIMIT 10;"
|
|
|
|
# To force the version table back to a known-good number after
|
|
# manually fixing the schema:
|
|
psql "$DATABASE_URL" -c \
|
|
"INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (<N>, true, NOW());"
|
|
```
|
|
|
|
## 28. Bootstrap goose on a fresh clone of the schema
|
|
|
|
If you create a new Neon branch / dev DB and need to bring it under
|
|
goose management:
|
|
|
|
```bash
|
|
export DATABASE_URL="...<the new DB>..."
|
|
|
|
# Option A: fresh DB, no schema → just run up
|
|
make migrate-up
|
|
|
|
# Option B: schema already populated (e.g., restored from a dump) →
|
|
# mark v1 as already-applied
|
|
goose -dir migrations postgres "$DATABASE_URL" version # creates table
|
|
psql "$DATABASE_URL" -c \
|
|
"INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"
|
|
```
|
|
|
|
This is also what was done for the live prod DB at goose-adoption time
|
|
(commit `12b2f9d`).
|
|
|
|
## References
|
|
|
|
- [kubectl cheat sheet][kubectl-cs]
|
|
- [K3s docs][k3s-docs]
|
|
- [Neon connect][neon-connect]
|
|
|
|
[kubectl-cs]: https://kubernetes.io/docs/reference/kubectl/cheatsheet/
|
|
[k3s-docs]: https://docs.k3s.io/
|
|
[neon-connect]: https://neon.com/docs/connect/connect-from-any-app
|