admin/honeyDueAPI

Fork 0

Files

T

Trey t c9ac273dbd

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

docs: capture latency optimizations + new caching invariants

Shipping commit 88fb175 changed the trace shape and added a new caching
layer with required invalidation rules. Updating the operator-facing
docs so they match the running system.

ch08 (database):
- DB_HOST is the -pooler Neon endpoint, not direct compute
- Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m),
  MaxIdleTime 0 (never close idle)
- New \"Pool warm-up at boot\" section documenting the 20-parallel-ping
  warm-up in database.Connect
- Replaced the \"Neon regions\" section: explicit RTT numbers, the
  optimization stack that minimizes round-trips, when this still matters

ch15 (observability):
- Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span
  post-optimization trace; kept the old one underneath for diff context

ch16 (failure modes):
- Added: stale residence-IDs cache (data freshness bug + recovery)
- Added: Redis at maxmemory limit (verify allkeys-lru policy)
- Added: Neon pooler unreachable but direct endpoint up — emergency
  switchover procedure

ch17 (runbook):
- §23 Invalidate residence-IDs cache for a user (DEL key + grep for
  missing invalidation in new code)
- §24 Verify DB pool warm-up is working (log pattern + impact test)
- §25 Switch DB host between pooler and direct endpoints

observability-plan.md status flipped from \"plan only\" to shipped
with the latency-cut summary.

README links to the new ch08 latency section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-25 17:36:36 -05:00

12 KiB

Raw Blame History

17 — Operator Runbook

Summary

Common procedures the operator runs. Each is a numbered sequence of exact commands. If a step is unclear, add a comment; if a procedure fails in an unexpected way, add the symptom + fix to this document.

Environment setup

Every command assumes:

export KUBECONFIG=~/.kube/honeydue-k3s.yaml
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go

If you see "Unable to connect to the server," the kubeconfig isn't set.

1. Check cluster health

kubectl get nodes                  # all 3 Ready?
kubectl get pods -A | grep -vE 'Running|Completed'  # anything not running?
kubectl top nodes                  # resource usage
kubectl get events -A --sort-by=.lastTimestamp | tail -20

2. Deploy new code

Full deploy (all three services)

SHA=$(git rev-parse --short HEAD)

# Login
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | \
  docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin

# Build
docker buildx build --platform linux/amd64 --target api \
  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
docker buildx build --platform linux/amd64 --target worker \
  -t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" --push .
docker buildx build --platform linux/amd64 --target admin \
  -t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" --push .

# Apply
for svc in api worker admin; do
  kubectl set image deployment/$svc -n honeydue \
    "$svc=gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
done

# Watch
for svc in api worker admin; do
  kubectl rollout status -n honeydue deployment/$svc
done

# Logout
docker logout gitea.treytartt.com

Single service

SHA=$(git rev-parse --short HEAD)
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
docker buildx build --platform linux/amd64 --target api \
  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
kubectl set image deployment/api -n honeydue \
  api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
kubectl rollout status -n honeydue deployment/api
docker logout "$REGISTRY"

3. Rollback

Last good

kubectl rollout undo deployment/api -n honeydue
kubectl rollout status -n honeydue deployment/api

Specific SHA

kubectl set image deployment/api -n honeydue \
  api="gitea.treytartt.com/admin/honeydue-api:<sha>"

4. Read logs

# Follow all api pod logs
kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix

# Errors only
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=1000 | grep -i error

# Previous pod (before crash/restart)
kubectl logs -n honeydue <pod> --previous

5. Exec into a pod

kubectl exec -n honeydue -it deploy/api -- /bin/sh
# inside:
#   wget -qO- http://127.0.0.1:8000/api/health/
#   env | grep DB_
#   exit

6. Rotate a secret

# For honeydue-secrets keys
kubectl patch secret honeydue-secrets -n honeydue \
  --type=merge \
  -p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'new-value' | base64)\"}}"

# Update local file to match (keep in sync)
printf '%s' 'new-value' > deploy/secrets/secret_key.txt

# Restart pods so they pick up the new secret
kubectl rollout restart -n honeydue deploy/api deploy/worker

7. Change a ConfigMap value

# Edit deploy/prod.env locally
# Regenerate the configmap
kubectl create configmap honeydue-config -n honeydue \
  --from-env-file=deploy/prod.env \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart to pick up
kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker

8. Scale a service

kubectl scale deployment/api -n honeydue --replicas=5
# Then wait
kubectl rollout status -n honeydue deployment/api

DO NOT scale worker above 1 until Asynq PeriodicTaskManager is wired.

9. Drain a node for maintenance

# Prevent new pods, evict existing
kubectl drain <node-hostname> --ignore-daemonsets --delete-emptydir-data

# Do maintenance (apt upgrade, reboot, etc.)
ssh deploy@<node> "sudo apt update && sudo apt upgrade -y && sudo reboot"

# Wait for node to come back
watch kubectl get nodes

# Allow scheduling again
kubectl uncordon <node-hostname>

Node hostnames (not SSH aliases!):

ubuntu-8gb-nbg1-1 (hetzner2)
ubuntu-8gb-nbg1-2 (hetzner1)
ubuntu-8gb-nbg1-3 (hetzner3)

10. Add a new node

# 1. Provision CX33 in Hetzner console
# 2. SSH in as root, create deploy user + key
# 3. Install k3s as agent (or server)
NODE_TOKEN=$(ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo cat /var/lib/rancher/k3s/server/node-token')
ssh -i ~/.ssh/hetzner root@<new-node-ip> "curl -sfL https://get.k3s.io | K3S_TOKEN=\"$NODE_TOKEN\" INSTALL_K3S_EXEC=\"server --server=https://178.104.247.152:6443 --disable=servicelb --write-kubeconfig-mode=644\" sh -"

# 4. Add UFW rules for inter-node traffic
#    (see deploy-k3s/scripts/ for the script)

# 5. Verify
kubectl get nodes

11. Remove a node

# Drain first
kubectl drain <hostname> --ignore-daemonsets --delete-emptydir-data

# Tell k3s to leave
ssh -i ~/.ssh/hetzner deploy@<node-alias> "sudo systemctl stop k3s && sudo /usr/local/bin/k3s-uninstall.sh"

# Remove from cluster
kubectl delete node <hostname>

12. Force-restart all pods

kubectl rollout restart -n honeydue deploy/api deploy/admin deploy/worker deploy/redis

Use sparingly. Causes brief downtime per pod.

13. Migrate to a new Neon DB

# 1. Point a new branch or project on Neon
# 2. Update prod.env with new DB_HOST
# 3. Apply new ConfigMap
kubectl create configmap honeydue-config -n honeydue \
  --from-env-file=deploy/prod.env \
  --dry-run=client -o yaml | kubectl apply -f -

# 4. Rolling restart
kubectl rollout restart -n honeydue deploy/api deploy/worker

14. Rotate Gitea registry PAT

# 1. Create new PAT in Gitea UI
# 2. Update deploy/registry.env locally
# 3. Update in-cluster Secret
kubectl create secret docker-registry gitea-credentials -n honeydue \
  --docker-server=gitea.treytartt.com \
  --docker-username=admin \
  --docker-password=<new-pat> \
  --dry-run=client -o yaml | kubectl apply -f -

# 4. Delete old PAT from Gitea UI

# 5. Pods don't re-auth with existing images (already pulled), but
# new pulls will use new PAT. Test by rolling a pod:
kubectl rollout restart -n honeydue deployment/api

15. Clean up old images in Gitea

Manual, via Gitea UI: https://gitea.treytartt.com/admin/-/packages

Keep ~last 30 tags per image; delete older.

Or via API:

GITEA_PAT="$(grep REGISTRY_TOKEN deploy/registry.env | cut -d= -f2)"
# List tags
curl -sS -H "Authorization: token $GITEA_PAT" \
  "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/versions" | jq .
# Delete specific tag
curl -X DELETE -H "Authorization: token $GITEA_PAT" \
  "https://gitea.treytartt.com/api/v1/packages/admin/container/honeydue-api/<tag>"

16. Recreate the cluster from scratch

See Chapter 16 §Disaster recovery.

17. Connect to Neon directly

# Get password
PW=$(cat deploy/secrets/postgres_password.txt)

# Connect
PGPASSWORD="$PW" psql \
  -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner \
  -d honeyDue

18. Check admin user credentials

# ADMIN_EMAIL is in the honeydue-secrets Secret
kubectl get secret honeydue-secrets -n honeydue \
  -o jsonpath='{.data.ADMIN_EMAIL}' | base64 -d

# ADMIN_PASSWORD (ONLY VALID FOR FIRST DEPLOY; may have been changed in UI)
kubectl get secret honeydue-secrets -n honeydue \
  -o jsonpath='{.data.ADMIN_PASSWORD}' | base64 -d

If you need to reset admin password because nobody remembers it:

# Generate a new bcrypt hash
NEW_PASSWORD='newpassword'
HASH=$(htpasswd -bnBC 10 "" "$NEW_PASSWORD" | tr -d ':\n')

# Update directly in Postgres
PGPASSWORD="$(cat deploy/secrets/postgres_password.txt)" psql \
  -h ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech \
  -U neondb_owner -d honeyDue \
  -c "UPDATE admin_users SET password='$HASH' WHERE email='admin@myhoneydue.com'"

19. Trigger a Helm chart re-run (Traefik etc.)

If the Traefik HelmChartConfig was updated but chart didn't reconcile:

kubectl delete job -n kube-system helm-install-traefik
# Helm operator re-runs automatically within ~30 seconds
kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik -w

20. Smoke test after any change

# Through Cloudflare
for url in "https://api.myhoneydue.com/api/health/" \
           "https://admin.myhoneydue.com/" \
           "https://myhoneydue.com/"; do
  ok=0
  for i in $(seq 1 20); do
    [[ "$(curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "$url")" == "200" ]] && ok=$((ok+1))
  done
  printf "%-45s %d/20 ok\n" "$url" "$ok"
done

Expect 20/20 on all three.

21. Kill everything (emergency rollback)

If the cluster is so broken you need to reset the app layer:

# Scale everything to 0
kubectl scale -n honeydue deploy/api deploy/admin deploy/worker deploy/redis --replicas=0

# When ready, scale back up
kubectl scale -n honeydue deploy/api --replicas=3
kubectl scale -n honeydue deploy/admin deploy/worker deploy/redis --replicas=1

During the scale-down, CF returns errors to users because no pod is serving. The rolling update for scale-up takes ~5 min.

22. Find which pod a user's request hit

Not directly supported (we don't log node/pod name in requests). When we add request logging that includes these, a grep through logs works.

Workaround: in each pod's logs, search for a unique user identifier:

stern -n honeydue api | grep "user_id=12345"

23. Invalidate residence-IDs cache for a user

Used when a user reports stale data ("I joined a residence but my tasks list still shows the old one"). The cache is keyed on user ID with 5-min TTL — most issues self-heal — but you can flush manually.

# Single user
kubectl -n honeydue exec deploy/redis -- redis-cli DEL "residence_ids_user:7"

# All users (nuclear; everyone pays one DB lookup on next request)
kubectl -n honeydue exec deploy/redis -- redis-cli --scan --pattern "residence_ids_user:*" \
  | xargs -r -n 100 kubectl -n honeydue exec deploy/redis -- redis-cli DEL

Mutation paths that should invalidate this cache automatically (any new code that changes membership must call cache.InvalidateResidenceIDsForUsers(ctx, userIDs...)):

ResidenceService.CreateResidence → owner
ResidenceService.DeleteResidence → all members
ResidenceService.JoinWithCode → joining user
ResidenceService.RemoveUser → removed user

If a user keeps reporting stale data, grep for missing invalidation:

grep -rn "residenceRepo.*Add\|RemoveUser\|residence_residence_users" internal/ \
  | grep -v cache | grep -v _test

24. Verify DB pool warm-up is working

After a deploy, check the api pod log for the warm-up confirmation:

kubectl -n honeydue logs -l app.kubernetes.io/name=api --tail=50 \
  | grep "DB pool warm-up complete"

Expected output (per pod):

{"level":"info","requested":20,"warmed":20,"message":"DB pool warm-up complete"}

If warmed < requested, the pool partially failed at boot — pod still starts, fills from there. If warmed=0, something's wrong with either Neon connectivity or auth — check the next log line for the specific error.

To test impact: hit the api right after a rollout. With warm-up working, the first request should be ~250ms (1 RTT). Without warm-up, the first request is ~700ms (full handshake).

25. Switch DB host between pooler and direct endpoints

The pooler endpoint (-pooler suffix) is the default — it cuts cold-handshake latency by ~3 RTTs. The direct endpoint (ep-floral-truth-amttbc5a.c-5...) is the fallback.

# Edit deploy-k3s/config.yaml — change database.host
# To pooler:   ep-floral-truth-amttbc5a-pooler.c-5.us-east-1.aws.neon.tech
# To direct:   ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech

KUBECONFIG=~/.kube/honeydue.yaml bash deploy-k3s/scripts/03-deploy.sh --skip-build

The pooler runs in transaction mode so any session-scope feature (LISTEN/NOTIFY, session advisory locks for migrations) auto-falls through to direct via MigrateWithLock opening its own connection. But if you ever add session-level features in the data path, they'll need the direct endpoint.

12 KiB Raw Blame History

17 — Operator Runbook

Summary

Environment setup

1. Check cluster health

2. Deploy new code

Full deploy (all three services)

Single service

3. Rollback

Last good

Specific SHA

4. Read logs

5. Exec into a pod

6. Rotate a secret

7. Change a ConfigMap value

8. Scale a service

9. Drain a node for maintenance

10. Add a new node

11. Remove a node

12. Force-restart all pods

13. Migrate to a new Neon DB

14. Rotate Gitea registry PAT

15. Clean up old images in Gitea

16. Recreate the cluster from scratch

17. Connect to Neon directly

18. Check admin user credentials

19. Trigger a Helm chart re-run (Traefik etc.)

20. Smoke test after any change

21. Kill everything (emergency rollback)

22. Find which pod a user's request hit

23. Invalidate residence-IDs cache for a user

24. Verify DB pool warm-up is working

25. Switch DB host between pooler and direct endpoints

References

12 KiB

Raw Blame History