Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

Swarm stack - Resource limits on all services, stop_grace_period 60s on api/worker/admin - Dozzle bound to manager loopback only (ssh -L required for access) - Worker health server on :6060, admin /api/health endpoint - Redis 200M LRU cap, B2/S3 env vars wired through to api service Deploy script - DRY_RUN=1 prints plan + exits - Auto-rollback on failed healthcheck, docker logout at end - Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3) - PUSH_LATEST_TAG default flipped to false - B2 all-or-none validation before deploy Code - cmd/api takes pg_advisory_lock on a dedicated connection before AutoMigrate, serialising boot-time migrations across replicas - cmd/worker exposes an HTTP /health endpoint with graceful shutdown Docs - deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy - deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops - deploy/README.md updated with storage toggle, worker-replica caveat, multi-arch recipe, connection-pool tuning, renumbered sections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:22:43 -05:00
parent ca818e8478
commit 33eee812b6
11 changed files with 908 additions and 30 deletions
@@ -0,0 +1,126 @@
+# Deploying Right Now
+
+Practical walkthrough for a prod deploy against the current Swarm stack.
+Assumes infrastructure and cloud services already exist — if not, work
+through [`shit_deploy_cant_do.md`](./shit_deploy_cant_do.md) first.
+
+See [`README.md`](./README.md) for the reference docs that back each step.
+
+---
+
+## 0. Pre-flight — check local state
+
+```bash
+cd honeyDueAPI-go
+
+git status                # clean working tree?
+git log -1 --oneline      # deploying this SHA
+
+ls deploy/cluster.env deploy/registry.env deploy/prod.env
+ls deploy/secrets/*.txt deploy/secrets/*.p8
+```
+
+## 1. Reconcile your envs with current defaults
+
+These two values **must** be right — the script does not enforce them:
+
+```bash
+# deploy/cluster.env
+WORKER_REPLICAS=1          # >1 → duplicate cron jobs (Asynq scheduler is a singleton)
+PUSH_LATEST_TAG=false      # keeps prod images SHA-pinned
+SECRET_KEEP_VERSIONS=3     # optional; 3 is the default
+```
+
+Decide storage backend in `deploy/prod.env`:
+
+- **Multi-replica safe (recommended):** set all four of `B2_ENDPOINT`,
+  `B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`. Uploads go to B2.
+- **Single-node ok:** leave all four empty. Script will warn. In this
+  mode you must also set `API_REPLICAS=1` — otherwise uploads are
+  invisible from 2/3 of requests.
+
+## 2. Dry run
+
+```bash
+DRY_RUN=1 ./.deploy_prod
+```
+
+Confirm in the output:
+- `Storage backend: S3 (...)` OR the `LOCAL VOLUME` warning matches intent
+- `Replicas: api=3, worker=1, admin=1` (or `api=1` if local storage)
+- Image SHA matches `git rev-parse --short HEAD`
+- `Manager:` host is correct
+- `Secret retention: 3 versions`
+
+Fix envs and re-run until the plan looks right. Nothing touches the cluster yet.
+
+## 3. Real deploy
+
+```bash
+./.deploy_prod
+```
+
+Do **not** pass `SKIP_BUILD=1` after code changes — the worker's health
+server and `MigrateWithLock` both require a fresh build.
+
+End-to-end: ~3–8 minutes. The script prints each phase.
+
+## 4. Post-deploy verification
+
+```bash
+# Stack health (replicas X/X = desired)
+ssh <manager> docker stack services honeydue
+
+# API smoke
+curl -fsS https://api.<domain>/api/health/ && echo OK
+
+# Logs via Dozzle (loopback-bound, needs SSH tunnel)
+ssh -p <port> -L 9999:127.0.0.1:9999 <user>@<manager>
+# Then browse http://localhost:9999
+```
+
+What the logs should show on a healthy boot:
+- `api`: exactly one replica logs `Migration advisory lock acquired`,
+  the others log `Migration advisory lock acquired` after waiting, then
+  `released`.
+- `worker`: `Health server listening addr=:6060`, `Starting worker server...`,
+  four `Registered ... job` lines.
+- No `Failed to connect to Redis` / `Failed to connect to database`.
+
+## 5. If it goes wrong
+
+Auto-rollback triggers when `DEPLOY_HEALTHCHECK_URL` fails — every service
+is rolled back to its previous spec, script exits non-zero.
+
+Triage:
+
+```bash
+ssh <manager> docker service logs --tail 200 honeydue_api
+ssh <manager> docker service ps honeydue_api --no-trunc
+```
+
+Manual rollback (if auto didn't catch it):
+
+```bash
+ssh <manager> bash -c '
+  for svc in $(docker stack services honeydue --format "{{.Name}}"); do
+    docker service rollback "$svc"
+  done'
+```
+
+Redeploy a known-good SHA:
+
+```bash
+DEPLOY_TAG=<older-sha> SKIP_BUILD=1 ./.deploy_prod
+# Only valid if that image was previously pushed to the registry.
+```
+
+## 6. Pre-deploy honesty check
+
+Before pulling the trigger:
+
+- [ ] Tested Neon PITR restore (not just "backups exist")?
+- [ ] `WORKER_REPLICAS=1` — otherwise duplicate push notifications next cron tick
+- [ ] Cloudflare-only firewall rule on 80/443 — otherwise origin IP is on the public internet
+- [ ] If storage is LOCAL, `API_REPLICAS=1` too
+- [ ] Last deploy's secrets still valid (rotation hasn't expired any creds)