Files
honeyDueAPI/docs/deployment/17-runbook.md
T
Trey t c9ac273dbd
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs: capture latency optimizations + new caching invariants
Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.

ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
  MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
  warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
  optimization stack that minimizes round-trips, when this still matters

ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
  post-optimization trace; kept the old one underneath for diff context

ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
  switchover procedure

ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
  missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints

observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.

README links to the new ch08 latency section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:36:36 -05:00

12 KiB

17 — Operator Runbook

Summary

Common procedures the operator runs. Each is a numbered sequence of exact commands. If a step is unclear, add a comment; if a procedure fails in an unexpected way, add the symptom + fix to this document.

Environment setup

Every command assumes:

export KUBECONFIG=~/.kube/honeydue-k3s.yaml
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go

If you see "Unable to connect to the server," the kubeconfig isn't set.

1. Check cluster health

kubectl get nodes                  # all 3 Ready?
kubectl get pods -A | grep -vE 'Running|Completed'  # anything not running?
kubectl top nodes                  # resource usage
kubectl get events -A --sort-by=.lastTimestamp | tail -20

2. Deploy new code

Full deploy (all three services)

SHA=$(git rev-parse --short HEAD)

# Login
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | \
  docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin

# Build
docker buildx build --platform linux/amd64 --target api \
  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
docker buildx build --platform linux/amd64 --target worker \
  -t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" --push .
docker buildx build --platform linux/amd64 --target admin \
  -t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" --push .

# Apply
for svc in api worker admin; do
  kubectl set image deployment/$svc -n honeydue \
    "$svc=gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
done

# Watch
for svc in api worker admin; do
  kubectl rollout status -n honeydue deployment/$svc
done

# Logout
docker logout gitea.treytartt.com

Single service

SHA=$(git rev-parse --short HEAD)
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
docker buildx build --platform linux/amd64 --target api \
  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
kubectl set image deployment/api -n honeydue \
  api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
kubectl rollout status -n honeydue deployment/api
docker logout "$REGISTRY"

3. Rollback

Last good

kubectl rollout undo deployment/api -n honeydue
kubectl rollout status -n honeydue deployment/api

Specific SHA

kubectl set image deployment/api -n honeydue \
  api="gitea.treytartt.com/admin/honeydue-api:<sha>"

4. Read logs

# Follow all api pod logs
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix

# Errors only
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=1000 | grep -i error

# Previous pod (before crash/restart)
kubectl logs -n honeydue <pod> --previous

5. Exec into a pod

kubectl exec -n honeydue -it deploy/api -- /bin/sh
# inside:
#   wget -qO- http://127.0.0.1:8000/api/health/
#   env | grep DB_
#   exit

6. Rotate a secret

# For honeydue-secrets keys
kubectl patch secret honeydue-secrets -n honeydue \
  --type=merge \
  -p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'new-value' | base64)\"}}"

# Update local file to match (keep in sync)
printf '%s' 'new-value' > deploy/secrets/secret_key.txt

# Restart pods so they pick up the new secret
kubectl rollout restart -n honeydue deploy/api deploy/worker

7. Change a ConfigMap value

# Edit deploy/prod.env locally
# Regenerate the configmap
kubectl create configmap honeydue-config -n honeydue \
  --from-env-file=deploy/prod.env \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart to pick up
kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker

8. Scale a service

kubectl scale deployment/api -n honeydue --replicas=5
# Then wait
kubectl rollout status -n honeydue deployment/api

DO NOT scale worker above 1 until Asynq PeriodicTaskManager is wired.

9. Drain a node for maintenance

# Prevent new pods, evict existing
kubectl drain <node-hostname> --ignore-daemonsets --delete-emptydir-data

# Do maintenance (apt upgrade, reboot, etc.)
ssh deploy@<node> "sudo apt update && sudo apt upgrade -y && sudo reboot"

# Wait for node to come back
watch kubectl get nodes

# Allow scheduling again
kubectl uncordon <node-hostname>

Node hostnames (not SSH aliases!):

  • ubuntu-8gb-nbg1-1 (hetzner2)
  • ubuntu-8gb-nbg1-2 (hetzner1)
  • ubuntu-8gb-nbg1-3 (hetzner3)

10. Add a new node

# 1. Provision CX33 in Hetzner console
# 2. SSH in as root, create deploy user + key
# 3. Install k3s as agent (or server)
NODE_TOKEN=$(ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /var/lib/rancher/k3s/server/node-token')
ssh -i ~/.ssh/hetzner root@<new-node-ip> "curl -sfL https://get.k3s.io | K3S_TOKEN=\"$NODE_TOKEN\" INSTALL_K3S_EXEC=\"server --server=https://178.104.247.152:6443 --disable=servicelb --write-kubeconfig-mode=644\" sh -"

# 4. Add UFW rules for inter-node traffic
#    (see deploy-k3s/scripts/ for the script)

# 5. Verify
kubectl get nodes

11. Remove a node

# Drain first
kubectl drain <hostname> --ignore-daemonsets --delete-emptydir-data

# Tell k3s to leave
ssh -i ~/.ssh/hetzner deploy@<node-alias> "sudo systemctl stop k3s && sudo /usr/local/bin/k3s-uninstall.sh"

# Remove from cluster
kubectl delete node <hostname>

12. Force-restart all pods

kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker deploy/redis

Use sparingly. Causes brief downtime per pod.

13. Migrate to a new Neon DB

# 1. Point a new branch or project on Neon
# 2. Update prod.env with new DB_HOST
# 3. Apply new ConfigMap
kubectl create configmap honeydue-config -n honeydue \
  --from-env-file=deploy/prod.env \
  --dry-run=client -o yaml | kubectl apply -f -

# 4. Rolling restart
kubectl rollout restart -n honeydue deploy/api deploy/worker

14. Rotate Gitea registry PAT

# 1. Create new PAT in Gitea UI
# 2. Update deploy/registry.env locally
# 3. Update in-cluster Secret
kubectl create secret docker-registry gitea-credentials -n honeydue \
  --docker-server=gitea.treytartt.com \
  --docker-username=admin \
  --docker-password=<new-pat> \
  --dry-run=client -o yaml | kubectl apply -f -

# 4. Delete old PAT from Gitea UI

# 5. Pods don't re-auth with existing images (already pulled), but
# new pulls will use new PAT. Test by rolling a pod:
kubectl rollout restart -n honeydue deployment/api

15. Clean up old images in Gitea

Manual, via Gitea UI: https://gitea.treytartt.com/admin/-/packages

Keep ~last 30 tags per image; delete older.

Or via API:

GITEA_PAT="$(grep REGISTRY_TOKEN deploy/registry.env | cut -d= -f2)"
# List tags
curl -sS -H "Authorization: token $GITEA_PAT" \
  "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/versions" | jq .
# Delete specific tag
curl -X DELETE -H "Authorization: token $GITEA_PAT" \
  "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/<tag>"

16. Recreate the cluster from scratch

See Chapter 16 §Disaster recovery.

17. Connect to Neon directly

# Get password
PW=$(cat deploy/secrets/postgres_password.txt)

# Connect
PGPASSWORD="$PW" psql \
  -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner \
  -d honeyDue

18. Check admin user credentials

# ADMIN_EMAIL is in the honeydue-secrets Secret
kubectl get secret honeydue-secrets -n honeydue \
  -o jsonpath='{.data.ADMIN_EMAIL}' | base64 -d

# ADMIN_PASSWORD (ONLY VALID FOR FIRST DEPLOY; may have been changed in UI)
kubectl get secret honeydue-secrets -n honeydue \
  -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d

If you need to reset admin password because nobody remembers it:

# Generate a new bcrypt hash
NEW_PASSWORD='newpassword'
HASH=$(htpasswd -bnBC 10 "" "$NEW_PASSWORD" | tr -d ':\n')

# Update directly in Postgres
PGPASSWORD="$(cat deploy/secrets/postgres_password.txt)" psql \
  -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner -d honeyDue \
  -c "UPDATE admin_users SET password='$HASH' WHERE email='admin@myhoneydue.com'"

19. Trigger a Helm chart re-run (Traefik etc.)

If the Traefik HelmChartConfig was updated but chart didn't reconcile:

kubectl delete job -n kube-system helm-install-traefik
# Helm operator re-runs automatically within ~30 seconds
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik -w

20. Smoke test after any change

# Through Cloudflare
for url in "https://api.myhoneydue.com/api/health/" \
           "https://admin.myhoneydue.com/" \
           "https://myhoneydue.com/"; do
  ok=0
  for i in $(seq 1 20); do
    [[ "$(curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "$url")" == "200" ]] && ok=$((ok+1))
  done
  printf "%-45s %d/20 ok\n" "$url" "$ok"
done

Expect 20/20 on all three.

21. Kill everything (emergency rollback)

If the cluster is so broken you need to reset the app layer:

# Scale everything to 0
kubectl scale -n honeydue deploy/api deploy/admin deploy/worker deploy/redis --replicas=0

# When ready, scale back up
kubectl scale -n honeydue deploy/api --replicas=3
kubectl scale -n honeydue deploy/admin deploy/worker deploy/redis --replicas=1

During the scale-down, CF returns errors to users because no pod is serving. The rolling update for scale-up takes ~5 min.

22. Find which pod a user's request hit

Not directly supported (we don't log node/pod name in requests). When we add request logging that includes these, a grep through logs works.

Workaround: in each pod's logs, search for a unique user identifier:

stern -n honeydue api | grep "user_id=12345"

23. Invalidate residence-IDs cache for a user

Used when a user reports stale data ("I joined a residence but my tasks list still shows the old one"). The cache is keyed on user ID with 5-min TTL — most issues self-heal — but you can flush manually.

# Single user
kubectl -n honeydue exec deploy/redis -- redis-cli DEL "residence_ids_user:7"

# All users (nuclear; everyone pays one DB lookup on next request)
kubectl -n honeydue exec deploy/redis -- redis-cli --scan --pattern "residence_ids_user:*" \
  | xargs -r -n 100 kubectl -n honeydue exec deploy/redis -- redis-cli DEL

Mutation paths that should invalidate this cache automatically (any new code that changes membership must call cache.InvalidateResidenceIDsForUsers(ctx, userIDs...)):

  • ResidenceService.CreateResidence → owner
  • ResidenceService.DeleteResidence → all members
  • ResidenceService.JoinWithCode → joining user
  • ResidenceService.RemoveUser → removed user

If a user keeps reporting stale data, grep for missing invalidation:

grep -rn "residenceRepo.*Add\|RemoveUser\|residence_residence_users" internal/ \
  | grep -v cache | grep -v _test

24. Verify DB pool warm-up is working

After a deploy, check the api pod log for the warm-up confirmation:

kubectl -n honeydue logs -l app.kubernetes.io/name=api --tail=50 \
  | grep "DB pool warm-up complete"

Expected output (per pod):

{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"}

If warmed < requested, the pool partially failed at boot — pod still starts, fills from there. If warmed=0, something's wrong with either Neon connectivity or auth — check the next log line for the specific error.

To test impact: hit the api right after a rollout. With warm-up working, the first request should be ~250ms (1 RTT). Without warm-up, the first request is ~700ms (full handshake).

25. Switch DB host between pooler and direct endpoints

The pooler endpoint (-pooler suffix) is the default — it cuts cold-handshake latency by ~3 RTTs. The direct endpoint (ep-floral-truth-amttbc5a.c-5...) is the fallback.

# Edit deploy-k3s/config.yaml — change database.host
# To pooler:   ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech
# To direct:   ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech

KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build

The pooler runs in transaction mode so any session-scope feature (LISTEN/NOTIFY, session advisory locks for migrations) auto-falls through to direct via MigrateWithLock opening its own connection. But if you ever add session-level features in the data path, they'll need the direct endpoint.

References