# Deploy Folder This folder is the full production deploy toolkit for `honeyDueAPI-go`. **Recommended flow — always dry-run first:** ```bash DRY_RUN=1 ./.deploy_prod # validates everything, prints the plan, no changes ./.deploy_prod # then the real deploy ``` The script refuses to run until all required values are set. - Step-by-step walkthrough for a real deploy: [`DEPLOYING.md`](./DEPLOYING.md) - Manual prerequisites the script cannot automate (Swarm init, firewall, Cloudflare, Neon, APNS, etc.): [`shit_deploy_cant_do.md`](./shit_deploy_cant_do.md) ## First-Time Prerequisite: Create The Swarm Cluster You must do this once before `./.deploy_prod` can work. 1. SSH to manager #1 and initialize Swarm: ```bash docker swarm init --advertise-addr ``` 2. On manager #1, get join commands: ```bash docker swarm join-token manager docker swarm join-token worker ``` 3. SSH to each additional node and run the appropriate `docker swarm join ...` command. 4. Verify from manager #1: ```bash docker node ls ``` ## Security Requirements Before Public Launch Use this as a mandatory checklist before you route production traffic. ### 1) Firewall Rules (Node-Level) Apply firewall rules to all Swarm nodes: - SSH port (for example `2222/tcp`): your IP only - `80/tcp`, `443/tcp`: Hetzner LB only (or Cloudflare IP ranges only if no LB) - `2377/tcp`: Swarm nodes only - `7946/tcp,udp`: Swarm nodes only - `4789/udp`: Swarm nodes only - Everything else: blocked ### 2) SSH Hardening On each node, harden `/etc/ssh/sshd_config`: ```text Port 2222 PermitRootLogin no PasswordAuthentication no PubkeyAuthentication yes AllowUsers deploy ``` ### 3) Cloudflare Origin Lockdown - Keep public DNS records proxied (orange cloud on). - Point Cloudflare to LB, not node IPs. - Do not publish Swarm node IPs in DNS. - Enforce firewall source restrictions so public traffic cannot bypass Cloudflare/LB. ### 4) Secrets Policy - Keep runtime secrets in Docker Swarm secrets only. - Do not put production secrets in git or plain `.env` files. - `./.deploy_prod` already creates versioned Swarm secrets from files in `deploy/secrets/`. - Rotate secrets after incidents or credential exposure. ### 5) Data Path Security - Neon/Postgres: `DB_SSLMODE=require`, strong DB password, Neon IP allowlist limited to node IPs. - Backblaze B2: HTTPS only, scoped app keys (not master key), least-privilege bucket access. - Swarm overlay: encrypted network enabled in stack (`driver_opts.encrypted: "true"`). ### 6) Dozzle Hardening Dozzle exposes the full Docker log stream with no built-in auth — logs contain secrets, tokens, and user data. The stack binds Dozzle to `127.0.0.1` on the manager node only (`mode: host`, `host_ip: 127.0.0.1`), so it is **not reachable from the public internet or from other Swarm nodes**. To view logs, open an SSH tunnel from your workstation: ```bash ssh -p "${DEPLOY_MANAGER_SSH_PORT}" \ -L "${DOZZLE_PORT}:127.0.0.1:${DOZZLE_PORT}" \ "${DEPLOY_MANAGER_USER}@${DEPLOY_MANAGER_HOST}" # Then browse http://localhost:${DOZZLE_PORT} ``` Additional hardening if you ever need to expose Dozzle over a network: - Put auth/SSO in front (Cloudflare Access or equivalent). - Replace the raw `/var/run/docker.sock` mount with a Docker socket proxy limited to read-only log endpoints. - Prefer a persistent log aggregator (Loki, Datadog, CloudWatch) for prod — Dozzle is ephemeral and not a substitute for audit trails. ### 7) Backup + Restore Readiness Treat this as a pre-launch checklist. Nothing below is automated by `./.deploy_prod`. - [ ] Postgres PITR path tested in staging (restore a real dump, validate app boots). - [x] Redis AOF persistence enabled (`appendonly yes --appendfsync everysec` in stack). - [ ] Redis restore path tested (verify AOF replays on a fresh node). - [ ] Written runbook for restore + secret rotation (see §4 and `shit_deploy_cant_do.md`). - [ ] Named owner for incident response. - [ ] Uploads bucket (Backblaze B2) lifecycle / versioning reviewed — deletes are handled by the app, not by retention rules. ### 8) Storage Backend (Uploads) The stack supports two storage backends. The choice is **runtime-only** — the same image runs in both modes, selected by env vars in `prod.env`: | Mode | When to use | Config | |---|---|---| | **Local volume** | Dev / single-node prod | Leave all `B2_*` empty. Files land on `/app/uploads` via the named volume. | | **S3-compatible** (B2, MinIO) | Multi-replica prod | Set all four of `B2_ENDPOINT`, `B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`. | The deploy script enforces **all-or-none** for the B2 vars — a partial config fails fast rather than silently falling back to the local volume. **Why this matters:** Docker Swarm named volumes are **per-node**. With 3 API replicas spread across nodes, an upload written on node A is invisible to replicas on nodes B and C (the client sees a random 404 two-thirds of the time). In multi-replica prod you **must** use S3-compatible storage. The `uploads:` volume is still declared as a harmless fallback: when B2 is configured, nothing writes to it. `./.deploy_prod` prints the selected backend at the start of each run. ### 9) Worker Replicas & Scheduler Keep `WORKER_REPLICAS=1` in `cluster.env` until Asynq `PeriodicTaskManager` is wired up. The current `asynq.Scheduler` in `cmd/worker/main.go` has no Redis-based leader election, so each replica independently enqueues the same cron task — users see duplicate daily digests / onboarding emails. Asynq workers (task consumers) are already safe to scale horizontally; it's only the scheduler singleton that is constrained. Future work: migrate to `asynq.NewPeriodicTaskManager(...)` with `PeriodicTaskConfigProvider` so multiple scheduler replicas coordinate via Redis. ### 10) Database Migrations `cmd/api/main.go` runs `database.MigrateWithLock()` on startup, which takes a Postgres session-level `pg_advisory_lock` on a dedicated connection before calling `AutoMigrate`. This serialises boot-time migrations across all API replicas — the first replica migrates, the rest wait, then each sees an already-current schema and `AutoMigrate` is a no-op. The lock is released on connection close, so a crashed replica can't leave a stale lock behind. For very large schema changes, run migrations as a separate pre-deploy step (there is no dedicated `cmd/migrate` binary today — this is a future improvement). ### 11) Redis Redundancy Redis runs as a **single replica** with an AOF-persisted named volume. If the node running Redis dies, Swarm reschedules the container but the named volume is per-node — the new Redis boots **empty**. Impact: - **Cache** (ETag lookups, static data): regenerates on first request. - **Asynq queue**: in-flight jobs at the moment of the crash are lost; Asynq retry semantics cover most re-enqueues. Scheduled-but-not-yet-fired cron events are re-triggered on the next cron tick. - **Sessions / auth tokens**: not stored in Redis, so unaffected. This is an accepted limitation today. Options to harden later: Redis Sentinel, a managed Redis (Upstash, Dragonfly Cloud), or restoring from the AOF on a pinned node. ### 12) Multi-Arch Builds `./.deploy_prod` builds images for the **host** architecture of the machine running the script. If your Swarm nodes are a different arch (e.g. ARM64 Ampere VMs), use `docker buildx` explicitly: ```bash docker buildx create --use docker buildx build --platform linux/arm64 --target api -t --push . # repeat for worker, admin SKIP_BUILD=1 ./.deploy_prod # then deploy the already-pushed images ``` The Go stages cross-compile cleanly (`TARGETARCH` is already honoured). The Node/admin stages require QEMU emulation (`docker run --privileged --rm tonistiigi/binfmt --install all` on the build host) since native deps may need to be rebuilt for the target arch. ### 13) Connection Pool & TLS Tuning Because Postgres is external (Neon/RDS), each replica opens its own pool. Sizing matters: total open connections across the cluster must stay under the database's configured limit. Defaults in `prod.env.example`: | Setting | Default | Notes | |---|---|---| | `DB_SSLMODE` | `require` | Never set to `disable` in prod. For Neon use `require`. | | `DB_MAX_OPEN_CONNS` | `25` | Per-replica cap. Worst case: 25 × (API+worker replicas). | | `DB_MAX_IDLE_CONNS` | `10` | Keep warm connections ready without exhausting the pool. | | `DB_MAX_LIFETIME` | `600s` | Recycle before Neon's idle disconnect (typically 5 min). | Worked example with default replicas (3 API + 1 worker — see §9 for why worker is pinned to 1): ``` 3 × 25 + 1 × 25 = 100 peak open connections ``` That lands exactly on Neon's free-tier ceiling (100 concurrent connections), which is risky with even one transient spike. For Neon free tier drop `DB_MAX_OPEN_CONNS=15` (→ 60 peak). Paid tiers (Neon Scale, 1000+ connections) can keep the default or raise it. Operational checklist: - Confirm Neon IP allowlist includes every Swarm node IP. - After changing pool sizes, redeploy and watch `pg_stat_activity` / Neon metrics for saturation. - Keep `DB_MAX_LIFETIME` ≤ Neon idle timeout to avoid "terminating connection due to administrator command" errors in the API logs. - For read-heavy workloads, consider a Neon read replica and split query traffic at the application layer. ## Files You Fill In Paste your values into these files: - `deploy/cluster.env` - `deploy/registry.env` - `deploy/prod.env` - `deploy/secrets/postgres_password.txt` - `deploy/secrets/secret_key.txt` - `deploy/secrets/email_host_password.txt` - `deploy/secrets/fcm_server_key.txt` - `deploy/secrets/apns_auth_key.p8` If one is missing, the deploy script auto-copies it from its `.example` template and exits so you can fill it. ## What `./.deploy_prod` Does 1. Validates all required config files and credentials. 2. Validates the storage-backend toggle (all-or-none for `B2_*`). Prints the selected backend (S3 or local volume) before continuing. 3. Builds and pushes `api`, `worker`, and `admin` images (skip with `SKIP_BUILD=1`). 4. Uploads deploy bundle to your Swarm manager over SSH. 5. Creates versioned Docker secrets on the manager. 6. Deploys the stack with `docker stack deploy --with-registry-auth`. 7. Waits until service replicas converge. 8. Prunes old secret versions, keeping the last `SECRET_KEEP_VERSIONS` (default 3). 9. Runs an HTTP health check (if `DEPLOY_HEALTHCHECK_URL` is set). **On failure, automatically runs `docker service rollback` for every service in the stack and exits non-zero.** 10. Logs out of the registry on both the dev host and the manager so the token doesn't linger in `~/.docker/config.json`. ## Useful Flags Environment flags: - `DRY_RUN=1 ./.deploy_prod` — validate config and print the deploy plan without building, pushing, or touching the cluster. Use this before every production deploy to review images, replicas, and secret names. - `SKIP_BUILD=1 ./.deploy_prod` — deploy already-pushed images. - `SKIP_HEALTHCHECK=1 ./.deploy_prod` — skip final URL check. - `DEPLOY_TAG= ./.deploy_prod` — deploy a specific image tag. - `PUSH_LATEST_TAG=true ./.deploy_prod` — also push `:latest` to the registry (default is `false` so prod pins to the SHA tag and stays reproducible). - `SECRET_KEEP_VERSIONS= ./.deploy_prod` — how many versions of each Swarm secret to retain after deploy (default: 3). Older unused versions are pruned automatically once the stack converges. ## Secret Versioning & Pruning Each deploy creates a fresh set of Swarm secrets named `__` (for example `honeydue_secret_key_abc1234_20260413120000`). The stack file references the current names via `${POSTGRES_PASSWORD_SECRET}` etc., so rolling updates never reuse a secret that a running task still holds open. After the new stack converges, `./.deploy_prod` SSHes to the manager and prunes old versions per base name, keeping the most recent `SECRET_KEEP_VERSIONS` (default 3). Anything still referenced by a running task is left alone (Docker refuses to delete in-use secrets) and will be pruned on the next deploy. ## Important - `deploy/shit_deploy_cant_do.md` lists the manual tasks this script cannot automate. - Keep real credentials and secret files out of git.