Files
Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run
Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:22:43 -05:00
..

Deploy Folder

This folder is the full production deploy toolkit for honeyDueAPI-go.

Recommended flow — always dry-run first:

DRY_RUN=1 ./.deploy_prod   # validates everything, prints the plan, no changes
./.deploy_prod             # then the real deploy

The script refuses to run until all required values are set.

  • Step-by-step walkthrough for a real deploy: DEPLOYING.md
  • Manual prerequisites the script cannot automate (Swarm init, firewall, Cloudflare, Neon, APNS, etc.): shit_deploy_cant_do.md

First-Time Prerequisite: Create The Swarm Cluster

You must do this once before ./.deploy_prod can work.

  1. SSH to manager #1 and initialize Swarm:
docker swarm init --advertise-addr <manager1-private-ip>
  1. On manager #1, get join commands:
docker swarm join-token manager
docker swarm join-token worker
  1. SSH to each additional node and run the appropriate docker swarm join ... command.

  2. Verify from manager #1:

docker node ls

Security Requirements Before Public Launch

Use this as a mandatory checklist before you route production traffic.

1) Firewall Rules (Node-Level)

Apply firewall rules to all Swarm nodes:

  • SSH port (for example 2222/tcp): your IP only
  • 80/tcp, 443/tcp: Hetzner LB only (or Cloudflare IP ranges only if no LB)
  • 2377/tcp: Swarm nodes only
  • 7946/tcp,udp: Swarm nodes only
  • 4789/udp: Swarm nodes only
  • Everything else: blocked

2) SSH Hardening

On each node, harden /etc/ssh/sshd_config:

Port 2222
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy

3) Cloudflare Origin Lockdown

  • Keep public DNS records proxied (orange cloud on).
  • Point Cloudflare to LB, not node IPs.
  • Do not publish Swarm node IPs in DNS.
  • Enforce firewall source restrictions so public traffic cannot bypass Cloudflare/LB.

4) Secrets Policy

  • Keep runtime secrets in Docker Swarm secrets only.
  • Do not put production secrets in git or plain .env files.
  • ./.deploy_prod already creates versioned Swarm secrets from files in deploy/secrets/.
  • Rotate secrets after incidents or credential exposure.

5) Data Path Security

  • Neon/Postgres: DB_SSLMODE=require, strong DB password, Neon IP allowlist limited to node IPs.
  • Backblaze B2: HTTPS only, scoped app keys (not master key), least-privilege bucket access.
  • Swarm overlay: encrypted network enabled in stack (driver_opts.encrypted: "true").

6) Dozzle Hardening

Dozzle exposes the full Docker log stream with no built-in auth — logs contain secrets, tokens, and user data. The stack binds Dozzle to 127.0.0.1 on the manager node only (mode: host, host_ip: 127.0.0.1), so it is not reachable from the public internet or from other Swarm nodes.

To view logs, open an SSH tunnel from your workstation:

ssh -p "${DEPLOY_MANAGER_SSH_PORT}" \
    -L "${DOZZLE_PORT}:127.0.0.1:${DOZZLE_PORT}" \
    "${DEPLOY_MANAGER_USER}@${DEPLOY_MANAGER_HOST}"
# Then browse http://localhost:${DOZZLE_PORT}

Additional hardening if you ever need to expose Dozzle over a network:

  • Put auth/SSO in front (Cloudflare Access or equivalent).
  • Replace the raw /var/run/docker.sock mount with a Docker socket proxy limited to read-only log endpoints.
  • Prefer a persistent log aggregator (Loki, Datadog, CloudWatch) for prod — Dozzle is ephemeral and not a substitute for audit trails.

7) Backup + Restore Readiness

Treat this as a pre-launch checklist. Nothing below is automated by ./.deploy_prod.

  • Postgres PITR path tested in staging (restore a real dump, validate app boots).
  • Redis AOF persistence enabled (appendonly yes --appendfsync everysec in stack).
  • Redis restore path tested (verify AOF replays on a fresh node).
  • Written runbook for restore + secret rotation (see §4 and shit_deploy_cant_do.md).
  • Named owner for incident response.
  • Uploads bucket (Backblaze B2) lifecycle / versioning reviewed — deletes are handled by the app, not by retention rules.

8) Storage Backend (Uploads)

The stack supports two storage backends. The choice is runtime-only — the same image runs in both modes, selected by env vars in prod.env:

Mode When to use Config
Local volume Dev / single-node prod Leave all B2_* empty. Files land on /app/uploads via the named volume.
S3-compatible (B2, MinIO) Multi-replica prod Set all four of B2_ENDPOINT, B2_KEY_ID, B2_APP_KEY, B2_BUCKET_NAME.

The deploy script enforces all-or-none for the B2 vars — a partial config fails fast rather than silently falling back to the local volume.

Why this matters: Docker Swarm named volumes are per-node. With 3 API replicas spread across nodes, an upload written on node A is invisible to replicas on nodes B and C (the client sees a random 404 two-thirds of the time). In multi-replica prod you must use S3-compatible storage.

The uploads: volume is still declared as a harmless fallback: when B2 is configured, nothing writes to it. ./.deploy_prod prints the selected backend at the start of each run.

9) Worker Replicas & Scheduler

Keep WORKER_REPLICAS=1 in cluster.env until Asynq PeriodicTaskManager is wired up. The current asynq.Scheduler in cmd/worker/main.go has no Redis-based leader election, so each replica independently enqueues the same cron task — users see duplicate daily digests / onboarding emails.

Asynq workers (task consumers) are already safe to scale horizontally; it's only the scheduler singleton that is constrained. Future work: migrate to asynq.NewPeriodicTaskManager(...) with PeriodicTaskConfigProvider so multiple scheduler replicas coordinate via Redis.

10) Database Migrations

cmd/api/main.go runs database.MigrateWithLock() on startup, which takes a Postgres session-level pg_advisory_lock on a dedicated connection before calling AutoMigrate. This serialises boot-time migrations across all API replicas — the first replica migrates, the rest wait, then each sees an already-current schema and AutoMigrate is a no-op.

The lock is released on connection close, so a crashed replica can't leave a stale lock behind.

For very large schema changes, run migrations as a separate pre-deploy step (there is no dedicated cmd/migrate binary today — this is a future improvement).

11) Redis Redundancy

Redis runs as a single replica with an AOF-persisted named volume. If the node running Redis dies, Swarm reschedules the container but the named volume is per-node — the new Redis boots empty.

Impact:

  • Cache (ETag lookups, static data): regenerates on first request.
  • Asynq queue: in-flight jobs at the moment of the crash are lost; Asynq retry semantics cover most re-enqueues. Scheduled-but-not-yet-fired cron events are re-triggered on the next cron tick.
  • Sessions / auth tokens: not stored in Redis, so unaffected.

This is an accepted limitation today. Options to harden later: Redis Sentinel, a managed Redis (Upstash, Dragonfly Cloud), or restoring from the AOF on a pinned node.

12) Multi-Arch Builds

./.deploy_prod builds images for the host architecture of the machine running the script. If your Swarm nodes are a different arch (e.g. ARM64 Ampere VMs), use docker buildx explicitly:

docker buildx create --use
docker buildx build --platform linux/arm64 --target api -t <image> --push .
# repeat for worker, admin
SKIP_BUILD=1 ./.deploy_prod   # then deploy the already-pushed images

The Go stages cross-compile cleanly (TARGETARCH is already honoured). The Node/admin stages require QEMU emulation (docker run --privileged --rm tonistiigi/binfmt --install all on the build host) since native deps may need to be rebuilt for the target arch.

13) Connection Pool & TLS Tuning

Because Postgres is external (Neon/RDS), each replica opens its own pool. Sizing matters: total open connections across the cluster must stay under the database's configured limit. Defaults in prod.env.example:

Setting Default Notes
DB_SSLMODE require Never set to disable in prod. For Neon use require.
DB_MAX_OPEN_CONNS 25 Per-replica cap. Worst case: 25 × (API+worker replicas).
DB_MAX_IDLE_CONNS 10 Keep warm connections ready without exhausting the pool.
DB_MAX_LIFETIME 600s Recycle before Neon's idle disconnect (typically 5 min).

Worked example with default replicas (3 API + 1 worker — see §9 for why worker is pinned to 1):

3 × 25 + 1 × 25 = 100 peak open connections

That lands exactly on Neon's free-tier ceiling (100 concurrent connections), which is risky with even one transient spike. For Neon free tier drop DB_MAX_OPEN_CONNS=15 (→ 60 peak). Paid tiers (Neon Scale, 1000+ connections) can keep the default or raise it.

Operational checklist:

  • Confirm Neon IP allowlist includes every Swarm node IP.
  • After changing pool sizes, redeploy and watch pg_stat_activity / Neon metrics for saturation.
  • Keep DB_MAX_LIFETIME ≤ Neon idle timeout to avoid "terminating connection due to administrator command" errors in the API logs.
  • For read-heavy workloads, consider a Neon read replica and split query traffic at the application layer.

Files You Fill In

Paste your values into these files:

  • deploy/cluster.env
  • deploy/registry.env
  • deploy/prod.env
  • deploy/secrets/postgres_password.txt
  • deploy/secrets/secret_key.txt
  • deploy/secrets/email_host_password.txt
  • deploy/secrets/fcm_server_key.txt
  • deploy/secrets/apns_auth_key.p8

If one is missing, the deploy script auto-copies it from its .example template and exits so you can fill it.

What ./.deploy_prod Does

  1. Validates all required config files and credentials.
  2. Validates the storage-backend toggle (all-or-none for B2_*). Prints the selected backend (S3 or local volume) before continuing.
  3. Builds and pushes api, worker, and admin images (skip with SKIP_BUILD=1).
  4. Uploads deploy bundle to your Swarm manager over SSH.
  5. Creates versioned Docker secrets on the manager.
  6. Deploys the stack with docker stack deploy --with-registry-auth.
  7. Waits until service replicas converge.
  8. Prunes old secret versions, keeping the last SECRET_KEEP_VERSIONS (default 3).
  9. Runs an HTTP health check (if DEPLOY_HEALTHCHECK_URL is set). On failure, automatically runs docker service rollback for every service in the stack and exits non-zero.
  10. Logs out of the registry on both the dev host and the manager so the token doesn't linger in ~/.docker/config.json.

Useful Flags

Environment flags:

  • DRY_RUN=1 ./.deploy_prod — validate config and print the deploy plan without building, pushing, or touching the cluster. Use this before every production deploy to review images, replicas, and secret names.
  • SKIP_BUILD=1 ./.deploy_prod — deploy already-pushed images.
  • SKIP_HEALTHCHECK=1 ./.deploy_prod — skip final URL check.
  • DEPLOY_TAG=<tag> ./.deploy_prod — deploy a specific image tag.
  • PUSH_LATEST_TAG=true ./.deploy_prod — also push :latest to the registry (default is false so prod pins to the SHA tag and stays reproducible).
  • SECRET_KEEP_VERSIONS=<n> ./.deploy_prod — how many versions of each Swarm secret to retain after deploy (default: 3). Older unused versions are pruned automatically once the stack converges.

Secret Versioning & Pruning

Each deploy creates a fresh set of Swarm secrets named <stack>_<secret>_<deploy_id> (for example honeydue_secret_key_abc1234_20260413120000). The stack file references the current names via ${POSTGRES_PASSWORD_SECRET} etc., so rolling updates never reuse a secret that a running task still holds open.

After the new stack converges, ./.deploy_prod SSHes to the manager and prunes old versions per base name, keeping the most recent SECRET_KEEP_VERSIONS (default 3). Anything still referenced by a running task is left alone (Docker refuses to delete in-use secrets) and will be pruned on the next deploy.

Important

  • deploy/shit_deploy_cant_do.md lists the manual tasks this script cannot automate.
  • Keep real credentials and secret files out of git.