Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run
Swarm stack - Resource limits on all services, stop_grace_period 60s on api/worker/admin - Dozzle bound to manager loopback only (ssh -L required for access) - Worker health server on :6060, admin /api/health endpoint - Redis 200M LRU cap, B2/S3 env vars wired through to api service Deploy script - DRY_RUN=1 prints plan + exits - Auto-rollback on failed healthcheck, docker logout at end - Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3) - PUSH_LATEST_TAG default flipped to false - B2 all-or-none validation before deploy Code - cmd/api takes pg_advisory_lock on a dedicated connection before AutoMigrate, serialising boot-time migrations across replicas - cmd/worker exposes an HTTP /health endpoint with graceful shutdown Docs - deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy - deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops - deploy/README.md updated with storage toggle, worker-replica caveat, multi-arch recipe, connection-pool tuning, renumbered sections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
126
deploy/DEPLOYING.md
Normal file
126
deploy/DEPLOYING.md
Normal file
@@ -0,0 +1,126 @@
|
||||
# Deploying Right Now
|
||||
|
||||
Practical walkthrough for a prod deploy against the current Swarm stack.
|
||||
Assumes infrastructure and cloud services already exist — if not, work
|
||||
through [`shit_deploy_cant_do.md`](./shit_deploy_cant_do.md) first.
|
||||
|
||||
See [`README.md`](./README.md) for the reference docs that back each step.
|
||||
|
||||
---
|
||||
|
||||
## 0. Pre-flight — check local state
|
||||
|
||||
```bash
|
||||
cd honeyDueAPI-go
|
||||
|
||||
git status # clean working tree?
|
||||
git log -1 --oneline # deploying this SHA
|
||||
|
||||
ls deploy/cluster.env deploy/registry.env deploy/prod.env
|
||||
ls deploy/secrets/*.txt deploy/secrets/*.p8
|
||||
```
|
||||
|
||||
## 1. Reconcile your envs with current defaults
|
||||
|
||||
These two values **must** be right — the script does not enforce them:
|
||||
|
||||
```bash
|
||||
# deploy/cluster.env
|
||||
WORKER_REPLICAS=1 # >1 → duplicate cron jobs (Asynq scheduler is a singleton)
|
||||
PUSH_LATEST_TAG=false # keeps prod images SHA-pinned
|
||||
SECRET_KEEP_VERSIONS=3 # optional; 3 is the default
|
||||
```
|
||||
|
||||
Decide storage backend in `deploy/prod.env`:
|
||||
|
||||
- **Multi-replica safe (recommended):** set all four of `B2_ENDPOINT`,
|
||||
`B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`. Uploads go to B2.
|
||||
- **Single-node ok:** leave all four empty. Script will warn. In this
|
||||
mode you must also set `API_REPLICAS=1` — otherwise uploads are
|
||||
invisible from 2/3 of requests.
|
||||
|
||||
## 2. Dry run
|
||||
|
||||
```bash
|
||||
DRY_RUN=1 ./.deploy_prod
|
||||
```
|
||||
|
||||
Confirm in the output:
|
||||
- `Storage backend: S3 (...)` OR the `LOCAL VOLUME` warning matches intent
|
||||
- `Replicas: api=3, worker=1, admin=1` (or `api=1` if local storage)
|
||||
- Image SHA matches `git rev-parse --short HEAD`
|
||||
- `Manager:` host is correct
|
||||
- `Secret retention: 3 versions`
|
||||
|
||||
Fix envs and re-run until the plan looks right. Nothing touches the cluster yet.
|
||||
|
||||
## 3. Real deploy
|
||||
|
||||
```bash
|
||||
./.deploy_prod
|
||||
```
|
||||
|
||||
Do **not** pass `SKIP_BUILD=1` after code changes — the worker's health
|
||||
server and `MigrateWithLock` both require a fresh build.
|
||||
|
||||
End-to-end: ~3–8 minutes. The script prints each phase.
|
||||
|
||||
## 4. Post-deploy verification
|
||||
|
||||
```bash
|
||||
# Stack health (replicas X/X = desired)
|
||||
ssh <manager> docker stack services honeydue
|
||||
|
||||
# API smoke
|
||||
curl -fsS https://api.<domain>/api/health/ && echo OK
|
||||
|
||||
# Logs via Dozzle (loopback-bound, needs SSH tunnel)
|
||||
ssh -p <port> -L 9999:127.0.0.1:9999 <user>@<manager>
|
||||
# Then browse http://localhost:9999
|
||||
```
|
||||
|
||||
What the logs should show on a healthy boot:
|
||||
- `api`: exactly one replica logs `Migration advisory lock acquired`,
|
||||
the others log `Migration advisory lock acquired` after waiting, then
|
||||
`released`.
|
||||
- `worker`: `Health server listening addr=:6060`, `Starting worker server...`,
|
||||
four `Registered ... job` lines.
|
||||
- No `Failed to connect to Redis` / `Failed to connect to database`.
|
||||
|
||||
## 5. If it goes wrong
|
||||
|
||||
Auto-rollback triggers when `DEPLOY_HEALTHCHECK_URL` fails — every service
|
||||
is rolled back to its previous spec, script exits non-zero.
|
||||
|
||||
Triage:
|
||||
|
||||
```bash
|
||||
ssh <manager> docker service logs --tail 200 honeydue_api
|
||||
ssh <manager> docker service ps honeydue_api --no-trunc
|
||||
```
|
||||
|
||||
Manual rollback (if auto didn't catch it):
|
||||
|
||||
```bash
|
||||
ssh <manager> bash -c '
|
||||
for svc in $(docker stack services honeydue --format "{{.Name}}"); do
|
||||
docker service rollback "$svc"
|
||||
done'
|
||||
```
|
||||
|
||||
Redeploy a known-good SHA:
|
||||
|
||||
```bash
|
||||
DEPLOY_TAG=<older-sha> SKIP_BUILD=1 ./.deploy_prod
|
||||
# Only valid if that image was previously pushed to the registry.
|
||||
```
|
||||
|
||||
## 6. Pre-deploy honesty check
|
||||
|
||||
Before pulling the trigger:
|
||||
|
||||
- [ ] Tested Neon PITR restore (not just "backups exist")?
|
||||
- [ ] `WORKER_REPLICAS=1` — otherwise duplicate push notifications next cron tick
|
||||
- [ ] Cloudflare-only firewall rule on 80/443 — otherwise origin IP is on the public internet
|
||||
- [ ] If storage is LOCAL, `API_REPLICAS=1` too
|
||||
- [ ] Last deploy's secrets still valid (rotation hasn't expired any creds)
|
||||
Reference in New Issue
Block a user