Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-14 15:22:43 -05:00
parent ca818e8478
commit 33eee812b6
11 changed files with 908 additions and 30 deletions

126
deploy/DEPLOYING.md Normal file
View File

@@ -0,0 +1,126 @@
# Deploying Right Now
Practical walkthrough for a prod deploy against the current Swarm stack.
Assumes infrastructure and cloud services already exist — if not, work
through [`shit_deploy_cant_do.md`](./shit_deploy_cant_do.md) first.
See [`README.md`](./README.md) for the reference docs that back each step.
---
## 0. Pre-flight — check local state
```bash
cd honeyDueAPI-go
git status # clean working tree?
git log -1 --oneline # deploying this SHA
ls deploy/cluster.env deploy/registry.env deploy/prod.env
ls deploy/secrets/*.txt deploy/secrets/*.p8
```
## 1. Reconcile your envs with current defaults
These two values **must** be right — the script does not enforce them:
```bash
# deploy/cluster.env
WORKER_REPLICAS=1 # >1 → duplicate cron jobs (Asynq scheduler is a singleton)
PUSH_LATEST_TAG=false # keeps prod images SHA-pinned
SECRET_KEEP_VERSIONS=3 # optional; 3 is the default
```
Decide storage backend in `deploy/prod.env`:
- **Multi-replica safe (recommended):** set all four of `B2_ENDPOINT`,
`B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`. Uploads go to B2.
- **Single-node ok:** leave all four empty. Script will warn. In this
mode you must also set `API_REPLICAS=1` — otherwise uploads are
invisible from 2/3 of requests.
## 2. Dry run
```bash
DRY_RUN=1 ./.deploy_prod
```
Confirm in the output:
- `Storage backend: S3 (...)` OR the `LOCAL VOLUME` warning matches intent
- `Replicas: api=3, worker=1, admin=1` (or `api=1` if local storage)
- Image SHA matches `git rev-parse --short HEAD`
- `Manager:` host is correct
- `Secret retention: 3 versions`
Fix envs and re-run until the plan looks right. Nothing touches the cluster yet.
## 3. Real deploy
```bash
./.deploy_prod
```
Do **not** pass `SKIP_BUILD=1` after code changes — the worker's health
server and `MigrateWithLock` both require a fresh build.
End-to-end: ~38 minutes. The script prints each phase.
## 4. Post-deploy verification
```bash
# Stack health (replicas X/X = desired)
ssh <manager> docker stack services honeydue
# API smoke
curl -fsS https://api.<domain>/api/health/ && echo OK
# Logs via Dozzle (loopback-bound, needs SSH tunnel)
ssh -p <port> -L 9999:127.0.0.1:9999 <user>@<manager>
# Then browse http://localhost:9999
```
What the logs should show on a healthy boot:
- `api`: exactly one replica logs `Migration advisory lock acquired`,
the others log `Migration advisory lock acquired` after waiting, then
`released`.
- `worker`: `Health server listening addr=:6060`, `Starting worker server...`,
four `Registered ... job` lines.
- No `Failed to connect to Redis` / `Failed to connect to database`.
## 5. If it goes wrong
Auto-rollback triggers when `DEPLOY_HEALTHCHECK_URL` fails — every service
is rolled back to its previous spec, script exits non-zero.
Triage:
```bash
ssh <manager> docker service logs --tail 200 honeydue_api
ssh <manager> docker service ps honeydue_api --no-trunc
```
Manual rollback (if auto didn't catch it):
```bash
ssh <manager> bash -c '
for svc in $(docker stack services honeydue --format "{{.Name}}"); do
docker service rollback "$svc"
done'
```
Redeploy a known-good SHA:
```bash
DEPLOY_TAG=<older-sha> SKIP_BUILD=1 ./.deploy_prod
# Only valid if that image was previously pushed to the registry.
```
## 6. Pre-deploy honesty check
Before pulling the trigger:
- [ ] Tested Neon PITR restore (not just "backups exist")?
- [ ] `WORKER_REPLICAS=1` — otherwise duplicate push notifications next cron tick
- [ ] Cloudflare-only firewall rule on 80/443 — otherwise origin IP is on the public internet
- [ ] If storage is LOCAL, `API_REPLICAS=1` too
- [ ] Last deploy's secrets still valid (rotation hasn't expired any creds)