admin/honeyDueAPI

Fork 0

Files

T

Trey t 8d9ca2e6ed

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

docs(deployment): rewrite migration prose for goose adoption

Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only

2026-04-26 23:01:32 -05:00

16 KiB

Raw Blame History

17 — Operator Runbook

Summary

Common procedures the operator runs. Each is a numbered sequence of exact commands. If a step is unclear, add a comment; if a procedure fails in an unexpected way, add the symptom + fix to this document.

Environment setup

Every command assumes:

export KUBECONFIG=~/.kube/honeydue-k3s.yaml
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go

If you see "Unable to connect to the server," the kubeconfig isn't set.

1. Check cluster health

kubectl get nodes                  # all 3 Ready?
kubectl get pods -A | grep -vE 'Running|Completed'  # anything not running?
kubectl top nodes                  # resource usage
kubectl get events -A --sort-by=.lastTimestamp | tail -20

2. Deploy new code

Full deploy (all three services)

SHA=$(git rev-parse --short HEAD)

# Login
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | \
  docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin

# Build
docker buildx build --platform linux/amd64 --target api \
  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
docker buildx build --platform linux/amd64 --target worker \
  -t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" --push .
docker buildx build --platform linux/amd64 --target admin \
  -t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" --push .

# Apply
for svc in api worker admin; do
  kubectl set image deployment/$svc -n honeydue \
    "$svc=gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
done

# Watch
for svc in api worker admin; do
  kubectl rollout status -n honeydue deployment/$svc
done

# Logout
docker logout gitea.treytartt.com

Single service

SHA=$(git rev-parse --short HEAD)
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
docker buildx build --platform linux/amd64 --target api \
  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
kubectl set image deployment/api -n honeydue \
  api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
kubectl rollout status -n honeydue deployment/api
docker logout "$REGISTRY"

3. Rollback

Last good

kubectl rollout undo deployment/api -n honeydue
kubectl rollout status -n honeydue deployment/api

Specific SHA

kubectl set image deployment/api -n honeydue \
  api="gitea.treytartt.com/admin/honeydue-api:<sha>"

4. Read logs

# Follow all api pod logs
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix

# Errors only
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=1000 | grep -i error

# Previous pod (before crash/restart)
kubectl logs -n honeydue <pod> --previous

5. Exec into a pod

kubectl exec -n honeydue -it deploy/api -- /bin/sh
# inside:
#   wget -qO- http://127.0.0.1:8000/api/health/
#   env | grep DB_
#   exit

6. Rotate a secret

# For honeydue-secrets keys
kubectl patch secret honeydue-secrets -n honeydue \
  --type=merge \
  -p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'new-value' | base64)\"}}"

# Update local file to match (keep in sync)
printf '%s' 'new-value' > deploy/secrets/secret_key.txt

# Restart pods so they pick up the new secret
kubectl rollout restart -n honeydue deploy/api deploy/worker

7. Change a ConfigMap value

# Edit deploy/prod.env locally
# Regenerate the configmap
kubectl create configmap honeydue-config -n honeydue \
  --from-env-file=deploy/prod.env \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart to pick up
kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker

8. Scale a service

kubectl scale deployment/api -n honeydue --replicas=5
# Then wait
kubectl rollout status -n honeydue deployment/api

DO NOT scale worker above 1 until Asynq PeriodicTaskManager is wired.

9. Drain a node for maintenance

# Prevent new pods, evict existing
kubectl drain <node-hostname> --ignore-daemonsets --delete-emptydir-data

# Do maintenance (apt upgrade, reboot, etc.)
ssh deploy@<node> "sudo apt update && sudo apt upgrade -y && sudo reboot"

# Wait for node to come back
watch kubectl get nodes

# Allow scheduling again
kubectl uncordon <node-hostname>

Node hostnames (not SSH aliases!):

ubuntu-8gb-nbg1-1 (hetzner2)
ubuntu-8gb-nbg1-2 (hetzner1)
ubuntu-8gb-nbg1-3 (hetzner3)

10. Add a new node

# 1. Provision CX33 in Hetzner console
# 2. SSH in as root, create deploy user + key
# 3. Install k3s as agent (or server)
NODE_TOKEN=$(ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /var/lib/rancher/k3s/server/node-token')
ssh -i ~/.ssh/hetzner root@<new-node-ip> "curl -sfL https://get.k3s.io | K3S_TOKEN=\"$NODE_TOKEN\" INSTALL_K3S_EXEC=\"server --server=https://178.104.247.152:6443 --disable=servicelb --write-kubeconfig-mode=644\" sh -"

# 4. Add UFW rules for inter-node traffic
#    (see deploy-k3s/scripts/ for the script)

# 5. Verify
kubectl get nodes

11. Remove a node

# Drain first
kubectl drain <hostname> --ignore-daemonsets --delete-emptydir-data

# Tell k3s to leave
ssh -i ~/.ssh/hetzner deploy@<node-alias> "sudo systemctl stop k3s && sudo /usr/local/bin/k3s-uninstall.sh"

# Remove from cluster
kubectl delete node <hostname>

12. Force-restart all pods

kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker deploy/redis

Use sparingly. Causes brief downtime per pod.

13. Migrate to a new Neon DB

# 1. Point a new branch or project on Neon
# 2. Update prod.env with new DB_HOST
# 3. Apply new ConfigMap
kubectl create configmap honeydue-config -n honeydue \
  --from-env-file=deploy/prod.env \
  --dry-run=client -o yaml | kubectl apply -f -

# 4. Rolling restart
kubectl rollout restart -n honeydue deploy/api deploy/worker

14. Rotate Gitea registry PAT

# 1. Create new PAT in Gitea UI
# 2. Update deploy/registry.env locally
# 3. Update in-cluster Secret
kubectl create secret docker-registry gitea-credentials -n honeydue \
  --docker-server=gitea.treytartt.com \
  --docker-username=admin \
  --docker-password=<new-pat> \
  --dry-run=client -o yaml | kubectl apply -f -

# 4. Delete old PAT from Gitea UI

# 5. Pods don't re-auth with existing images (already pulled), but
# new pulls will use new PAT. Test by rolling a pod:
kubectl rollout restart -n honeydue deployment/api

15. Clean up old images in Gitea

Manual, via Gitea UI: https://gitea.treytartt.com/admin/-/packages

Keep ~last 30 tags per image; delete older.

Or via API:

GITEA_PAT="$(grep REGISTRY_TOKEN deploy/registry.env | cut -d= -f2)"
# List tags
curl -sS -H "Authorization: token $GITEA_PAT" \
  "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/versions" | jq .
# Delete specific tag
curl -X DELETE -H "Authorization: token $GITEA_PAT" \
  "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/<tag>"

16. Recreate the cluster from scratch

See Chapter 16 §Disaster recovery.

17. Connect to Neon directly

# Get password
PW=$(cat deploy/secrets/postgres_password.txt)

# Connect
PGPASSWORD="$PW" psql \
  -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner \
  -d honeyDue

18. Check admin user credentials

# ADMIN_EMAIL is in the honeydue-secrets Secret
kubectl get secret honeydue-secrets -n honeydue \
  -o jsonpath='{.data.ADMIN_EMAIL}' | base64 -d

# ADMIN_PASSWORD (ONLY VALID FOR FIRST DEPLOY; may have been changed in UI)
kubectl get secret honeydue-secrets -n honeydue \
  -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d

If you need to reset admin password because nobody remembers it:

# Generate a new bcrypt hash
NEW_PASSWORD='newpassword'
HASH=$(htpasswd -bnBC 10 "" "$NEW_PASSWORD" | tr -d ':\n')

# Update directly in Postgres
PGPASSWORD="$(cat deploy/secrets/postgres_password.txt)" psql \
  -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner -d honeyDue \
  -c "UPDATE admin_users SET password='$HASH' WHERE email='admin@myhoneydue.com'"

19. Trigger a Helm chart re-run (Traefik etc.)

If the Traefik HelmChartConfig was updated but chart didn't reconcile:

kubectl delete job -n kube-system helm-install-traefik
# Helm operator re-runs automatically within ~30 seconds
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik -w

20. Smoke test after any change

# Through Cloudflare
for url in "https://api.myhoneydue.com/api/health/" \
           "https://admin.myhoneydue.com/" \
           "https://myhoneydue.com/"; do
  ok=0
  for i in $(seq 1 20); do
    [[ "$(curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "$url")" == "200" ]] && ok=$((ok+1))
  done
  printf "%-45s %d/20 ok\n" "$url" "$ok"
done

Expect 20/20 on all three.

21. Kill everything (emergency rollback)

If the cluster is so broken you need to reset the app layer:

# Scale everything to 0
kubectl scale -n honeydue deploy/api deploy/admin deploy/worker deploy/redis --replicas=0

# When ready, scale back up
kubectl scale -n honeydue deploy/api --replicas=3
kubectl scale -n honeydue deploy/admin deploy/worker deploy/redis --replicas=1

During the scale-down, CF returns errors to users because no pod is serving. The rolling update for scale-up takes ~5 min.

22. Find which pod a user's request hit

Not directly supported (we don't log node/pod name in requests). When we add request logging that includes these, a grep through logs works.

Workaround: in each pod's logs, search for a unique user identifier:

stern -n honeydue api | grep "user_id=12345"

23. Invalidate residence-IDs cache for a user

Used when a user reports stale data ("I joined a residence but my tasks list still shows the old one"). The cache is keyed on user ID with 5-min TTL — most issues self-heal — but you can flush manually.

# Single user
kubectl -n honeydue exec deploy/redis -- redis-cli DEL "residence_ids_user:7"

# All users (nuclear; everyone pays one DB lookup on next request)
kubectl -n honeydue exec deploy/redis -- redis-cli --scan --pattern "residence_ids_user:*" \
  | xargs -r -n 100 kubectl -n honeydue exec deploy/redis -- redis-cli DEL

Mutation paths that should invalidate this cache automatically (any new code that changes membership must call cache.InvalidateResidenceIDsForUsers(ctx, userIDs...)):

ResidenceService.CreateResidence → owner
ResidenceService.DeleteResidence → all members
ResidenceService.JoinWithCode → joining user
ResidenceService.RemoveUser → removed user

If a user keeps reporting stale data, grep for missing invalidation:

grep -rn "residenceRepo.*Add\|RemoveUser\|residence_residence_users" internal/ \
  | grep -v cache | grep -v _test

24. Verify DB pool warm-up is working

After a deploy, check the api pod log for the warm-up confirmation:

kubectl -n honeydue logs -l app.kubernetes.io/name=api --tail=50 \
  | grep "DB pool warm-up complete"

Expected output (per pod):

{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"}

If warmed < requested, the pool partially failed at boot — pod still starts, fills from there. If warmed=0, something's wrong with either Neon connectivity or auth — check the next log line for the specific error.

To test impact: hit the api right after a rollout. With warm-up working, the first request should be ~250ms (1 RTT). Without warm-up, the first request is ~700ms (full handshake).

25. Switch DB host between pooler and direct endpoints

The pooler endpoint (-pooler suffix) is the default — it cuts cold-handshake latency by ~3 RTTs. The direct endpoint (ep-floral-truth-amttbc5a.c-5...) is the fallback.

# Edit deploy-k3s/config.yaml — change database.host
# To pooler:   ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech
# To direct:   ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech

KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build

The pooler runs in transaction mode so any session-scope feature (LISTEN/NOTIFY, session advisory locks) won't work over it. Migrations already handle this — the migrate Job script strips -pooler from DB_HOST before invoking goose. If you add new session-level features in the data path, they'll need the same workaround.

26. Run migrations manually (rare)

Day-to-day, migrations run as part of every 03-deploy.sh. But sometimes you want to apply or inspect them outside a deploy:

# Direct-endpoint DSN (goose's advisory lock won't survive the pooler)
DB_PASS=$(kubectl -n honeydue get secret honeydue-secrets \
  -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d)
export DATABASE_URL="host=ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
                     port=5432 user=neondb_owner password=$DB_PASS \
                     dbname=honeyDue sslmode=require"

# What's pending? (read-only; safe to run anytime)
make migrate-status

# Apply pending migrations (or `goose -dir migrations postgres "$DATABASE_URL" up`)
make migrate-up

# Roll back the most recent migration
make migrate-down

# Scaffold a new migration file
make migrate-new name=add_widget_count_to_residences
# → migrations/000002_add_widget_count_to_residences.sql
# Edit, then `make migrate-up` to test, then commit.

To run goose from inside the cluster (e.g., to bypass a network policy that blocks Neon from your laptop), use the migrate Job manifest as a one-shot:

# Re-runs the latest migrate Job with whatever args you need
kubectl -n honeydue delete job honeydue-migrate --ignore-not-found
sed "s|image: IMAGE_PLACEHOLDER|image: $(kubectl -n honeydue get deploy api -o jsonpath='{.spec.template.spec.containers[0].image}')|" \
  deploy-k3s/manifests/migrate/job.yaml | kubectl apply -f -
kubectl -n honeydue wait --for=condition=complete --timeout=5m job/honeydue-migrate
kubectl -n honeydue logs job/honeydue-migrate

27. Recover from a failed/dirty migration

If goose up fails partway through, the migration file's transaction rolls back and goose_db_version reflects the last complete version. Goose marks no row as "dirty" — that's a golang-migrate concept. So recovery is just: fix the migration file, re-run.

If you've genuinely corrupted state (dropped tables you shouldn't have, applied a destructive migration in error):

# See current goose state
make migrate-status
psql "$DATABASE_URL" -c \
  "SELECT version_id, is_applied, tstamp FROM goose_db_version ORDER BY id DESC LIMIT 10;"

# To force the version table back to a known-good number after
# manually fixing the schema:
psql "$DATABASE_URL" -c \
  "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (<N>, true, NOW());"

28. Bootstrap goose on a fresh clone of the schema

If you create a new Neon branch / dev DB and need to bring it under goose management:

export DATABASE_URL="...<the new DB>..."

# Option A: fresh DB, no schema → just run up
make migrate-up

# Option B: schema already populated (e.g., restored from a dump) →
#          mark v1 as already-applied
goose -dir migrations postgres "$DATABASE_URL" version  # creates table
psql "$DATABASE_URL" -c \
  "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());"

This is also what was done for the live prod DB at goose-adoption time (commit 12b2f9d).

16 KiB Raw Blame History

17 — Operator Runbook

Summary

Environment setup

1. Check cluster health

2. Deploy new code

Full deploy (all three services)

Single service

3. Rollback

Last good

Specific SHA

4. Read logs

5. Exec into a pod

6. Rotate a secret

7. Change a ConfigMap value

8. Scale a service

9. Drain a node for maintenance

10. Add a new node

11. Remove a node

12. Force-restart all pods

13. Migrate to a new Neon DB

14. Rotate Gitea registry PAT

15. Clean up old images in Gitea

16. Recreate the cluster from scratch

17. Connect to Neon directly

18. Check admin user credentials

19. Trigger a Helm chart re-run (Traefik etc.)

20. Smoke test after any change

21. Kill everything (emergency rollback)

22. Find which pod a user's request hit

23. Invalidate residence-IDs cache for a user

24. Verify DB pool warm-up is working

25. Switch DB host between pooler and direct endpoints

26. Run migrations manually (rare)

27. Recover from a failed/dirty migration

28. Bootstrap goose on a fresh clone of the schema

References

16 KiB

Raw Blame History