Swarm stack - Resource limits on all services, stop_grace_period 60s on api/worker/admin - Dozzle bound to manager loopback only (ssh -L required for access) - Worker health server on :6060, admin /api/health endpoint - Redis 200M LRU cap, B2/S3 env vars wired through to api service Deploy script - DRY_RUN=1 prints plan + exits - Auto-rollback on failed healthcheck, docker logout at end - Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3) - PUSH_LATEST_TAG default flipped to false - B2 all-or-none validation before deploy Code - cmd/api takes pg_advisory_lock on a dedicated connection before AutoMigrate, serialising boot-time migrations across replicas - cmd/worker exposes an HTTP /health endpoint with graceful shutdown Docs - deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy - deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops - deploy/README.md updated with storage toggle, worker-replica caveat, multi-arch recipe, connection-pool tuning, renumbered sections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.7 KiB
Deploying Right Now
Practical walkthrough for a prod deploy against the current Swarm stack.
Assumes infrastructure and cloud services already exist — if not, work
through shit_deploy_cant_do.md first.
See README.md for the reference docs that back each step.
0. Pre-flight — check local state
cd honeyDueAPI-go
git status # clean working tree?
git log -1 --oneline # deploying this SHA
ls deploy/cluster.env deploy/registry.env deploy/prod.env
ls deploy/secrets/*.txt deploy/secrets/*.p8
1. Reconcile your envs with current defaults
These two values must be right — the script does not enforce them:
# deploy/cluster.env
WORKER_REPLICAS=1 # >1 → duplicate cron jobs (Asynq scheduler is a singleton)
PUSH_LATEST_TAG=false # keeps prod images SHA-pinned
SECRET_KEEP_VERSIONS=3 # optional; 3 is the default
Decide storage backend in deploy/prod.env:
- Multi-replica safe (recommended): set all four of
B2_ENDPOINT,B2_KEY_ID,B2_APP_KEY,B2_BUCKET_NAME. Uploads go to B2. - Single-node ok: leave all four empty. Script will warn. In this
mode you must also set
API_REPLICAS=1— otherwise uploads are invisible from 2/3 of requests.
2. Dry run
DRY_RUN=1 ./.deploy_prod
Confirm in the output:
Storage backend: S3 (...)OR theLOCAL VOLUMEwarning matches intentReplicas: api=3, worker=1, admin=1(orapi=1if local storage)- Image SHA matches
git rev-parse --short HEAD Manager:host is correctSecret retention: 3 versions
Fix envs and re-run until the plan looks right. Nothing touches the cluster yet.
3. Real deploy
./.deploy_prod
Do not pass SKIP_BUILD=1 after code changes — the worker's health
server and MigrateWithLock both require a fresh build.
End-to-end: ~3–8 minutes. The script prints each phase.
4. Post-deploy verification
# Stack health (replicas X/X = desired)
ssh <manager> docker stack services honeydue
# API smoke
curl -fsS https://api.<domain>/api/health/ && echo OK
# Logs via Dozzle (loopback-bound, needs SSH tunnel)
ssh -p <port> -L 9999:127.0.0.1:9999 <user>@<manager>
# Then browse http://localhost:9999
What the logs should show on a healthy boot:
api: exactly one replica logsMigration advisory lock acquired, the others logMigration advisory lock acquiredafter waiting, thenreleased.worker:Health server listening addr=:6060,Starting worker server..., fourRegistered ... joblines.- No
Failed to connect to Redis/Failed to connect to database.
5. If it goes wrong
Auto-rollback triggers when DEPLOY_HEALTHCHECK_URL fails — every service
is rolled back to its previous spec, script exits non-zero.
Triage:
ssh <manager> docker service logs --tail 200 honeydue_api
ssh <manager> docker service ps honeydue_api --no-trunc
Manual rollback (if auto didn't catch it):
ssh <manager> bash -c '
for svc in $(docker stack services honeydue --format "{{.Name}}"); do
docker service rollback "$svc"
done'
Redeploy a known-good SHA:
DEPLOY_TAG=<older-sha> SKIP_BUILD=1 ./.deploy_prod
# Only valid if that image was previously pushed to the registry.
6. Pre-deploy honesty check
Before pulling the trigger:
- Tested Neon PITR restore (not just "backups exist")?
WORKER_REPLICAS=1— otherwise duplicate push notifications next cron tick- Cloudflare-only firewall rule on 80/443 — otherwise origin IP is on the public internet
- If storage is LOCAL,
API_REPLICAS=1too - Last deploy's secrets still valid (rotation hasn't expired any creds)