Files

Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 15:22:43 -05:00

3.7 KiB

Raw Permalink Blame History

Deploying Right Now

Practical walkthrough for a prod deploy against the current Swarm stack. Assumes infrastructure and cloud services already exist — if not, work through shit_deploy_cant_do.md first.

See README.md for the reference docs that back each step.

0. Pre-flight — check local state

cd honeyDueAPI-go

git status                # clean working tree?
git log -1 --oneline      # deploying this SHA

ls deploy/cluster.env deploy/registry.env deploy/prod.env
ls deploy/secrets/*.txt deploy/secrets/*.p8

1. Reconcile your envs with current defaults

These two values must be right — the script does not enforce them:

# deploy/cluster.env
WORKER_REPLICAS=1          # >1 → duplicate cron jobs (Asynq scheduler is a singleton)
PUSH_LATEST_TAG=false      # keeps prod images SHA-pinned
SECRET_KEEP_VERSIONS=3     # optional; 3 is the default

Decide storage backend in deploy/prod.env:

Multi-replica safe (recommended): set all four of B2_ENDPOINT, B2_KEY_ID, B2_APP_KEY, B2_BUCKET_NAME. Uploads go to B2.
Single-node ok: leave all four empty. Script will warn. In this mode you must also set API_REPLICAS=1 — otherwise uploads are invisible from 2/3 of requests.

2. Dry run

DRY_RUN=1 ./.deploy_prod

Confirm in the output:

Storage backend: S3 (...) OR the LOCAL VOLUME warning matches intent
Replicas: api=3, worker=1, admin=1 (or api=1 if local storage)
Image SHA matches git rev-parse --short HEAD
Manager: host is correct
Secret retention: 3 versions

Fix envs and re-run until the plan looks right. Nothing touches the cluster yet.

3. Real deploy

./.deploy_prod

Do not pass SKIP_BUILD=1 after code changes — the worker's health server and MigrateWithLock both require a fresh build.

End-to-end: ~3–8 minutes. The script prints each phase.

4. Post-deploy verification

# Stack health (replicas X/X = desired)
ssh <manager> docker stack services honeydue

# API smoke
curl -fsS https://api.<domain>/api/health/ && echo OK

# Logs via Dozzle (loopback-bound, needs SSH tunnel)
ssh -p <port> -L 9999:127.0.0.1:9999 <user>@<manager>
# Then browse http://localhost:9999

What the logs should show on a healthy boot:

api: exactly one replica logs Migration advisory lock acquired, the others log Migration advisory lock acquired after waiting, then released.
worker: Health server listening addr=:6060, Starting worker server..., four Registered ... job lines.
No Failed to connect to Redis / Failed to connect to database.

5. If it goes wrong

Auto-rollback triggers when DEPLOY_HEALTHCHECK_URL fails — every service is rolled back to its previous spec, script exits non-zero.

Triage:

ssh <manager> docker service logs --tail 200 honeydue_api
ssh <manager> docker service ps honeydue_api --no-trunc

Manual rollback (if auto didn't catch it):

ssh <manager> bash -c '
  for svc in $(docker stack services honeydue --format "{{.Name}}"); do
    docker service rollback "$svc"
  done'

Redeploy a known-good SHA:

DEPLOY_TAG=<older-sha> SKIP_BUILD=1 ./.deploy_prod
# Only valid if that image was previously pushed to the registry.

6. Pre-deploy honesty check

Before pulling the trigger:

Tested Neon PITR restore (not just "backups exist")?
WORKER_REPLICAS=1 — otherwise duplicate push notifications next cron tick
Cloudflare-only firewall rule on 80/443 — otherwise origin IP is on the public internet
If storage is LOCAL, API_REPLICAS=1 too
Last deploy's secrets still valid (rotation hasn't expired any creds)

3.7 KiB Raw Permalink Blame History Unescape Escape