Files
honeyDueAPI/deploy/README.md
Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run
Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:22:43 -05:00

314 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deploy Folder
This folder is the full production deploy toolkit for `honeyDueAPI-go`.
**Recommended flow — always dry-run first:**
```bash
DRY_RUN=1 ./.deploy_prod # validates everything, prints the plan, no changes
./.deploy_prod # then the real deploy
```
The script refuses to run until all required values are set.
- Step-by-step walkthrough for a real deploy: [`DEPLOYING.md`](./DEPLOYING.md)
- Manual prerequisites the script cannot automate (Swarm init, firewall,
Cloudflare, Neon, APNS, etc.): [`shit_deploy_cant_do.md`](./shit_deploy_cant_do.md)
## First-Time Prerequisite: Create The Swarm Cluster
You must do this once before `./.deploy_prod` can work.
1. SSH to manager #1 and initialize Swarm:
```bash
docker swarm init --advertise-addr <manager1-private-ip>
```
2. On manager #1, get join commands:
```bash
docker swarm join-token manager
docker swarm join-token worker
```
3. SSH to each additional node and run the appropriate `docker swarm join ...` command.
4. Verify from manager #1:
```bash
docker node ls
```
## Security Requirements Before Public Launch
Use this as a mandatory checklist before you route production traffic.
### 1) Firewall Rules (Node-Level)
Apply firewall rules to all Swarm nodes:
- SSH port (for example `2222/tcp`): your IP only
- `80/tcp`, `443/tcp`: Hetzner LB only (or Cloudflare IP ranges only if no LB)
- `2377/tcp`: Swarm nodes only
- `7946/tcp,udp`: Swarm nodes only
- `4789/udp`: Swarm nodes only
- Everything else: blocked
### 2) SSH Hardening
On each node, harden `/etc/ssh/sshd_config`:
```text
Port 2222
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy
```
### 3) Cloudflare Origin Lockdown
- Keep public DNS records proxied (orange cloud on).
- Point Cloudflare to LB, not node IPs.
- Do not publish Swarm node IPs in DNS.
- Enforce firewall source restrictions so public traffic cannot bypass Cloudflare/LB.
### 4) Secrets Policy
- Keep runtime secrets in Docker Swarm secrets only.
- Do not put production secrets in git or plain `.env` files.
- `./.deploy_prod` already creates versioned Swarm secrets from files in `deploy/secrets/`.
- Rotate secrets after incidents or credential exposure.
### 5) Data Path Security
- Neon/Postgres: `DB_SSLMODE=require`, strong DB password, Neon IP allowlist limited to node IPs.
- Backblaze B2: HTTPS only, scoped app keys (not master key), least-privilege bucket access.
- Swarm overlay: encrypted network enabled in stack (`driver_opts.encrypted: "true"`).
### 6) Dozzle Hardening
Dozzle exposes the full Docker log stream with no built-in auth — logs contain
secrets, tokens, and user data. The stack binds Dozzle to `127.0.0.1` on the
manager node only (`mode: host`, `host_ip: 127.0.0.1`), so it is **not
reachable from the public internet or from other Swarm nodes**.
To view logs, open an SSH tunnel from your workstation:
```bash
ssh -p "${DEPLOY_MANAGER_SSH_PORT}" \
-L "${DOZZLE_PORT}:127.0.0.1:${DOZZLE_PORT}" \
"${DEPLOY_MANAGER_USER}@${DEPLOY_MANAGER_HOST}"
# Then browse http://localhost:${DOZZLE_PORT}
```
Additional hardening if you ever need to expose Dozzle over a network:
- Put auth/SSO in front (Cloudflare Access or equivalent).
- Replace the raw `/var/run/docker.sock` mount with a Docker socket proxy
limited to read-only log endpoints.
- Prefer a persistent log aggregator (Loki, Datadog, CloudWatch) for prod —
Dozzle is ephemeral and not a substitute for audit trails.
### 7) Backup + Restore Readiness
Treat this as a pre-launch checklist. Nothing below is automated by
`./.deploy_prod`.
- [ ] Postgres PITR path tested in staging (restore a real dump, validate app boots).
- [x] Redis AOF persistence enabled (`appendonly yes --appendfsync everysec` in stack).
- [ ] Redis restore path tested (verify AOF replays on a fresh node).
- [ ] Written runbook for restore + secret rotation (see §4 and `shit_deploy_cant_do.md`).
- [ ] Named owner for incident response.
- [ ] Uploads bucket (Backblaze B2) lifecycle / versioning reviewed — deletes are
handled by the app, not by retention rules.
### 8) Storage Backend (Uploads)
The stack supports two storage backends. The choice is **runtime-only** — the
same image runs in both modes, selected by env vars in `prod.env`:
| Mode | When to use | Config |
|---|---|---|
| **Local volume** | Dev / single-node prod | Leave all `B2_*` empty. Files land on `/app/uploads` via the named volume. |
| **S3-compatible** (B2, MinIO) | Multi-replica prod | Set all four of `B2_ENDPOINT`, `B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`. |
The deploy script enforces **all-or-none** for the B2 vars — a partial config
fails fast rather than silently falling back to the local volume.
**Why this matters:** Docker Swarm named volumes are **per-node**. With 3 API
replicas spread across nodes, an upload written on node A is invisible to
replicas on nodes B and C (the client sees a random 404 two-thirds of the
time). In multi-replica prod you **must** use S3-compatible storage.
The `uploads:` volume is still declared as a harmless fallback: when B2 is
configured, nothing writes to it. `./.deploy_prod` prints the selected
backend at the start of each run.
### 9) Worker Replicas & Scheduler
Keep `WORKER_REPLICAS=1` in `cluster.env` until Asynq `PeriodicTaskManager`
is wired up. The current `asynq.Scheduler` in `cmd/worker/main.go` has no
Redis-based leader election, so each replica independently enqueues the
same cron task — users see duplicate daily digests / onboarding emails.
Asynq workers (task consumers) are already safe to scale horizontally; it's
only the scheduler singleton that is constrained. Future work: migrate to
`asynq.NewPeriodicTaskManager(...)` with `PeriodicTaskConfigProvider` so
multiple scheduler replicas coordinate via Redis.
### 10) Database Migrations
`cmd/api/main.go` runs `database.MigrateWithLock()` on startup, which takes a
Postgres session-level `pg_advisory_lock` on a dedicated connection before
calling `AutoMigrate`. This serialises boot-time migrations across all API
replicas — the first replica migrates, the rest wait, then each sees an
already-current schema and `AutoMigrate` is a no-op.
The lock is released on connection close, so a crashed replica can't leave
a stale lock behind.
For very large schema changes, run migrations as a separate pre-deploy
step (there is no dedicated `cmd/migrate` binary today — this is a future
improvement).
### 11) Redis Redundancy
Redis runs as a **single replica** with an AOF-persisted named volume. If
the node running Redis dies, Swarm reschedules the container but the named
volume is per-node — the new Redis boots **empty**.
Impact:
- **Cache** (ETag lookups, static data): regenerates on first request.
- **Asynq queue**: in-flight jobs at the moment of the crash are lost; Asynq
retry semantics cover most re-enqueues. Scheduled-but-not-yet-fired cron
events are re-triggered on the next cron tick.
- **Sessions / auth tokens**: not stored in Redis, so unaffected.
This is an accepted limitation today. Options to harden later: Redis
Sentinel, a managed Redis (Upstash, Dragonfly Cloud), or restoring from the
AOF on a pinned node.
### 12) Multi-Arch Builds
`./.deploy_prod` builds images for the **host** architecture of the machine
running the script. If your Swarm nodes are a different arch (e.g. ARM64
Ampere VMs), use `docker buildx` explicitly:
```bash
docker buildx create --use
docker buildx build --platform linux/arm64 --target api -t <image> --push .
# repeat for worker, admin
SKIP_BUILD=1 ./.deploy_prod # then deploy the already-pushed images
```
The Go stages cross-compile cleanly (`TARGETARCH` is already honoured).
The Node/admin stages require QEMU emulation (`docker run --privileged --rm
tonistiigi/binfmt --install all` on the build host) since native deps may
need to be rebuilt for the target arch.
### 13) Connection Pool & TLS Tuning
Because Postgres is external (Neon/RDS), each replica opens its own pool.
Sizing matters: total open connections across the cluster must stay under
the database's configured limit. Defaults in `prod.env.example`:
| Setting | Default | Notes |
|---|---|---|
| `DB_SSLMODE` | `require` | Never set to `disable` in prod. For Neon use `require`. |
| `DB_MAX_OPEN_CONNS` | `25` | Per-replica cap. Worst case: 25 × (API+worker replicas). |
| `DB_MAX_IDLE_CONNS` | `10` | Keep warm connections ready without exhausting the pool. |
| `DB_MAX_LIFETIME` | `600s` | Recycle before Neon's idle disconnect (typically 5 min). |
Worked example with default replicas (3 API + 1 worker — see §9 for why
worker is pinned to 1):
```
3 × 25 + 1 × 25 = 100 peak open connections
```
That lands exactly on Neon's free-tier ceiling (100 concurrent connections),
which is risky with even one transient spike. For Neon free tier drop
`DB_MAX_OPEN_CONNS=15` (→ 60 peak). Paid tiers (Neon Scale, 1000+
connections) can keep the default or raise it.
Operational checklist:
- Confirm Neon IP allowlist includes every Swarm node IP.
- After changing pool sizes, redeploy and watch `pg_stat_activity` /
Neon metrics for saturation.
- Keep `DB_MAX_LIFETIME` ≤ Neon idle timeout to avoid "terminating
connection due to administrator command" errors in the API logs.
- For read-heavy workloads, consider a Neon read replica and split
query traffic at the application layer.
## Files You Fill In
Paste your values into these files:
- `deploy/cluster.env`
- `deploy/registry.env`
- `deploy/prod.env`
- `deploy/secrets/postgres_password.txt`
- `deploy/secrets/secret_key.txt`
- `deploy/secrets/email_host_password.txt`
- `deploy/secrets/fcm_server_key.txt`
- `deploy/secrets/apns_auth_key.p8`
If one is missing, the deploy script auto-copies it from its `.example` template and exits so you can fill it.
## What `./.deploy_prod` Does
1. Validates all required config files and credentials.
2. Validates the storage-backend toggle (all-or-none for `B2_*`). Prints
the selected backend (S3 or local volume) before continuing.
3. Builds and pushes `api`, `worker`, and `admin` images (skip with
`SKIP_BUILD=1`).
4. Uploads deploy bundle to your Swarm manager over SSH.
5. Creates versioned Docker secrets on the manager.
6. Deploys the stack with `docker stack deploy --with-registry-auth`.
7. Waits until service replicas converge.
8. Prunes old secret versions, keeping the last `SECRET_KEEP_VERSIONS`
(default 3).
9. Runs an HTTP health check (if `DEPLOY_HEALTHCHECK_URL` is set). **On
failure, automatically runs `docker service rollback` for every service
in the stack and exits non-zero.**
10. Logs out of the registry on both the dev host and the manager so the
token doesn't linger in `~/.docker/config.json`.
## Useful Flags
Environment flags:
- `DRY_RUN=1 ./.deploy_prod` — validate config and print the deploy plan
without building, pushing, or touching the cluster. Use this before every
production deploy to review images, replicas, and secret names.
- `SKIP_BUILD=1 ./.deploy_prod` — deploy already-pushed images.
- `SKIP_HEALTHCHECK=1 ./.deploy_prod` — skip final URL check.
- `DEPLOY_TAG=<tag> ./.deploy_prod` — deploy a specific image tag.
- `PUSH_LATEST_TAG=true ./.deploy_prod` — also push `:latest` to the registry
(default is `false` so prod pins to the SHA tag and stays reproducible).
- `SECRET_KEEP_VERSIONS=<n> ./.deploy_prod` — how many versions of each
Swarm secret to retain after deploy (default: 3). Older unused versions
are pruned automatically once the stack converges.
## Secret Versioning & Pruning
Each deploy creates a fresh set of Swarm secrets named
`<stack>_<secret>_<deploy_id>` (for example
`honeydue_secret_key_abc1234_20260413120000`). The stack file references the
current names via `${POSTGRES_PASSWORD_SECRET}` etc., so rolling updates never
reuse a secret that a running task still holds open.
After the new stack converges, `./.deploy_prod` SSHes to the manager and
prunes old versions per base name, keeping the most recent
`SECRET_KEEP_VERSIONS` (default 3). Anything still referenced by a running
task is left alone (Docker refuses to delete in-use secrets) and will be
pruned on the next deploy.
## Important
- `deploy/shit_deploy_cant_do.md` lists the manual tasks this script cannot automate.
- Keep real credentials and secret files out of git.