Files

Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 15:22:43 -05:00

scripts

Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

2026-04-14 15:22:43 -05:00

secrets

Harden API security: input validation, safe auth extraction, new tests, and deploy config

2026-03-02 09:48:01 -06:00

.gitignore

Harden API security: input validation, safe auth extraction, new tests, and deploy config

2026-03-02 09:48:01 -06:00

cluster.env.example

Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

2026-04-14 15:22:43 -05:00

DEPLOYING.md

Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

2026-04-14 15:22:43 -05:00

prod.env.example

Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

2026-04-14 15:22:43 -05:00

README.md

Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

2026-04-14 15:22:43 -05:00

registry.env.example

Harden API security: input validation, safe auth extraction, new tests, and deploy config

2026-03-02 09:48:01 -06:00

shit_deploy_cant_do.md

Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

2026-04-14 15:22:43 -05:00

swarm-stack.prod.yml

Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

2026-04-14 15:22:43 -05:00

README.md

Deploy Folder

This folder is the full production deploy toolkit for honeyDueAPI-go.

Recommended flow — always dry-run first:

DRY_RUN=1 ./.deploy_prod   # validates everything, prints the plan, no changes
./.deploy_prod             # then the real deploy

The script refuses to run until all required values are set.

Step-by-step walkthrough for a real deploy: DEPLOYING.md
Manual prerequisites the script cannot automate (Swarm init, firewall, Cloudflare, Neon, APNS, etc.): shit_deploy_cant_do.md

First-Time Prerequisite: Create The Swarm Cluster

You must do this once before ./.deploy_prod can work.

SSH to manager #1 and initialize Swarm:

docker swarm init --advertise-addr <manager1-private-ip>

On manager #1, get join commands:

docker swarm join-token manager
docker swarm join-token worker

SSH to each additional node and run the appropriate docker swarm join ... command.
Verify from manager #1:

docker node ls

Security Requirements Before Public Launch

Use this as a mandatory checklist before you route production traffic.

1) Firewall Rules (Node-Level)

Apply firewall rules to all Swarm nodes:

SSH port (for example 2222/tcp): your IP only
80/tcp, 443/tcp: Hetzner LB only (or Cloudflare IP ranges only if no LB)
2377/tcp: Swarm nodes only
7946/tcp,udp: Swarm nodes only
4789/udp: Swarm nodes only
Everything else: blocked

2) SSH Hardening

On each node, harden /etc/ssh/sshd_config:

Port 2222
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy

3) Cloudflare Origin Lockdown

Keep public DNS records proxied (orange cloud on).
Point Cloudflare to LB, not node IPs.
Do not publish Swarm node IPs in DNS.
Enforce firewall source restrictions so public traffic cannot bypass Cloudflare/LB.

4) Secrets Policy

Keep runtime secrets in Docker Swarm secrets only.
Do not put production secrets in git or plain .env files.
./.deploy_prod already creates versioned Swarm secrets from files in deploy/secrets/.
Rotate secrets after incidents or credential exposure.

5) Data Path Security

Neon/Postgres: DB_SSLMODE=require, strong DB password, Neon IP allowlist limited to node IPs.
Backblaze B2: HTTPS only, scoped app keys (not master key), least-privilege bucket access.
Swarm overlay: encrypted network enabled in stack (driver_opts.encrypted: "true").

6) Dozzle Hardening

Dozzle exposes the full Docker log stream with no built-in auth — logs contain secrets, tokens, and user data. The stack binds Dozzle to 127.0.0.1 on the manager node only (mode: host, host_ip: 127.0.0.1), so it is not reachable from the public internet or from other Swarm nodes.

To view logs, open an SSH tunnel from your workstation:

ssh -p "${DEPLOY_MANAGER_SSH_PORT}" \
    -L "${DOZZLE_PORT}:127.0.0.1:${DOZZLE_PORT}" \
    "${DEPLOY_MANAGER_USER}@${DEPLOY_MANAGER_HOST}"
# Then browse http://localhost:${DOZZLE_PORT}

Additional hardening if you ever need to expose Dozzle over a network:

Put auth/SSO in front (Cloudflare Access or equivalent).
Replace the raw /var/run/docker.sock mount with a Docker socket proxy limited to read-only log endpoints.
Prefer a persistent log aggregator (Loki, Datadog, CloudWatch) for prod — Dozzle is ephemeral and not a substitute for audit trails.

7) Backup + Restore Readiness

Treat this as a pre-launch checklist. Nothing below is automated by ./.deploy_prod.

Postgres PITR path tested in staging (restore a real dump, validate app boots).
Redis AOF persistence enabled (appendonly yes --appendfsync everysec in stack).
Redis restore path tested (verify AOF replays on a fresh node).
Written runbook for restore + secret rotation (see §4 and shit_deploy_cant_do.md).
Named owner for incident response.
Uploads bucket (Backblaze B2) lifecycle / versioning reviewed — deletes are handled by the app, not by retention rules.

8) Storage Backend (Uploads)

The stack supports two storage backends. The choice is runtime-only — the same image runs in both modes, selected by env vars in prod.env:

Mode	When to use	Config
Local volume	Dev / single-node prod	Leave all `B2_*` empty. Files land on `/app/uploads` via the named volume.
S3-compatible (B2, MinIO)	Multi-replica prod	Set all four of `B2_ENDPOINT`, `B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`.

The deploy script enforces all-or-none for the B2 vars — a partial config fails fast rather than silently falling back to the local volume.

Why this matters: Docker Swarm named volumes are per-node. With 3 API replicas spread across nodes, an upload written on node A is invisible to replicas on nodes B and C (the client sees a random 404 two-thirds of the time). In multi-replica prod you must use S3-compatible storage.

The uploads: volume is still declared as a harmless fallback: when B2 is configured, nothing writes to it. ./.deploy_prod prints the selected backend at the start of each run.

9) Worker Replicas & Scheduler

Keep WORKER_REPLICAS=1 in cluster.env until Asynq PeriodicTaskManager is wired up. The current asynq.Scheduler in cmd/worker/main.go has no Redis-based leader election, so each replica independently enqueues the same cron task — users see duplicate daily digests / onboarding emails.

Asynq workers (task consumers) are already safe to scale horizontally; it's only the scheduler singleton that is constrained. Future work: migrate to asynq.NewPeriodicTaskManager(...) with PeriodicTaskConfigProvider so multiple scheduler replicas coordinate via Redis.

10) Database Migrations

cmd/api/main.go runs database.MigrateWithLock() on startup, which takes a Postgres session-level pg_advisory_lock on a dedicated connection before calling AutoMigrate. This serialises boot-time migrations across all API replicas — the first replica migrates, the rest wait, then each sees an already-current schema and AutoMigrate is a no-op.

The lock is released on connection close, so a crashed replica can't leave a stale lock behind.

For very large schema changes, run migrations as a separate pre-deploy step (there is no dedicated cmd/migrate binary today — this is a future improvement).

11) Redis Redundancy

Redis runs as a single replica with an AOF-persisted named volume. If the node running Redis dies, Swarm reschedules the container but the named volume is per-node — the new Redis boots empty.

Impact:

Cache (ETag lookups, static data): regenerates on first request.
Asynq queue: in-flight jobs at the moment of the crash are lost; Asynq retry semantics cover most re-enqueues. Scheduled-but-not-yet-fired cron events are re-triggered on the next cron tick.
Sessions / auth tokens: not stored in Redis, so unaffected.

This is an accepted limitation today. Options to harden later: Redis Sentinel, a managed Redis (Upstash, Dragonfly Cloud), or restoring from the AOF on a pinned node.

12) Multi-Arch Builds

./.deploy_prod builds images for the host architecture of the machine running the script. If your Swarm nodes are a different arch (e.g. ARM64 Ampere VMs), use docker buildx explicitly:

docker buildx create --use
docker buildx build --platform linux/arm64 --target api -t <image> --push .
# repeat for worker, admin
SKIP_BUILD=1 ./.deploy_prod   # then deploy the already-pushed images

The Go stages cross-compile cleanly (TARGETARCH is already honoured). The Node/admin stages require QEMU emulation (docker run --privileged --rm tonistiigi/binfmt --install all on the build host) since native deps may need to be rebuilt for the target arch.

13) Connection Pool & TLS Tuning

Because Postgres is external (Neon/RDS), each replica opens its own pool. Sizing matters: total open connections across the cluster must stay under the database's configured limit. Defaults in prod.env.example:

Setting	Default	Notes
`DB_SSLMODE`	`require`	Never set to `disable` in prod. For Neon use `require`.
`DB_MAX_OPEN_CONNS`	`25`	Per-replica cap. Worst case: 25 × (API+worker replicas).
`DB_MAX_IDLE_CONNS`	`10`	Keep warm connections ready without exhausting the pool.
`DB_MAX_LIFETIME`	`600s`	Recycle before Neon's idle disconnect (typically 5 min).

Worked example with default replicas (3 API + 1 worker — see §9 for why worker is pinned to 1):

3 × 25 + 1 × 25 = 100 peak open connections

That lands exactly on Neon's free-tier ceiling (100 concurrent connections), which is risky with even one transient spike. For Neon free tier drop DB_MAX_OPEN_CONNS=15 (→ 60 peak). Paid tiers (Neon Scale, 1000+ connections) can keep the default or raise it.

Operational checklist:

Confirm Neon IP allowlist includes every Swarm node IP.
After changing pool sizes, redeploy and watch pg_stat_activity / Neon metrics for saturation.
Keep DB_MAX_LIFETIME ≤ Neon idle timeout to avoid "terminating connection due to administrator command" errors in the API logs.
For read-heavy workloads, consider a Neon read replica and split query traffic at the application layer.

Files You Fill In

Paste your values into these files:

deploy/cluster.env
deploy/registry.env
deploy/prod.env
deploy/secrets/postgres_password.txt
deploy/secrets/secret_key.txt
deploy/secrets/email_host_password.txt
deploy/secrets/fcm_server_key.txt
deploy/secrets/apns_auth_key.p8

If one is missing, the deploy script auto-copies it from its .example template and exits so you can fill it.

What `./.deploy_prod` Does

Validates all required config files and credentials.
Validates the storage-backend toggle (all-or-none for B2_*). Prints the selected backend (S3 or local volume) before continuing.
Builds and pushes api, worker, and admin images (skip with SKIP_BUILD=1).
Uploads deploy bundle to your Swarm manager over SSH.
Creates versioned Docker secrets on the manager.
Deploys the stack with docker stack deploy --with-registry-auth.
Waits until service replicas converge.
Prunes old secret versions, keeping the last SECRET_KEEP_VERSIONS (default 3).
Runs an HTTP health check (if DEPLOY_HEALTHCHECK_URL is set). On failure, automatically runs docker service rollback for every service in the stack and exits non-zero.
Logs out of the registry on both the dev host and the manager so the token doesn't linger in ~/.docker/config.json.

Useful Flags

Environment flags:

DRY_RUN=1 ./.deploy_prod — validate config and print the deploy plan without building, pushing, or touching the cluster. Use this before every production deploy to review images, replicas, and secret names.
SKIP_BUILD=1 ./.deploy_prod — deploy already-pushed images.
SKIP_HEALTHCHECK=1 ./.deploy_prod — skip final URL check.
DEPLOY_TAG=<tag> ./.deploy_prod — deploy a specific image tag.
PUSH_LATEST_TAG=true ./.deploy_prod — also push :latest to the registry (default is false so prod pins to the SHA tag and stays reproducible).
SECRET_KEEP_VERSIONS=<n> ./.deploy_prod — how many versions of each Swarm secret to retain after deploy (default: 3). Older unused versions are pruned automatically once the stack converges.

Secret Versioning & Pruning

Each deploy creates a fresh set of Swarm secrets named <stack>_<secret>_<deploy_id> (for example honeydue_secret_key_abc1234_20260413120000). The stack file references the current names via ${POSTGRES_PASSWORD_SECRET} etc., so rolling updates never reuse a secret that a running task still holds open.

After the new stack converges, ./.deploy_prod SSHes to the manager and prunes old versions per base name, keeping the most recent SECRET_KEEP_VERSIONS (default 3). Anything still referenced by a running task is left alone (Docker refuses to delete in-use secrets) and will be pruned on the next deploy.

Important

deploy/shit_deploy_cant_do.md lists the manual tasks this script cannot automate.
Keep real credentials and secret files out of git.

README.md Unescape Escape

Deploy Folder

First-Time Prerequisite: Create The Swarm Cluster

Security Requirements Before Public Launch

1) Firewall Rules (Node-Level)

2) SSH Hardening

3) Cloudflare Origin Lockdown

4) Secrets Policy

5) Data Path Security

6) Dozzle Hardening

7) Backup + Restore Readiness

8) Storage Backend (Uploads)

9) Worker Replicas & Scheduler

10) Database Migrations

11) Redis Redundancy

12) Multi-Arch Builds

13) Connection Pool & TLS Tuning

Files You Fill In

What ./.deploy_prod Does

Useful Flags

Secret Versioning & Pruning

Important

README.md

What `./.deploy_prod` Does