Swarm stack - Resource limits on all services, stop_grace_period 60s on api/worker/admin - Dozzle bound to manager loopback only (ssh -L required for access) - Worker health server on :6060, admin /api/health endpoint - Redis 200M LRU cap, B2/S3 env vars wired through to api service Deploy script - DRY_RUN=1 prints plan + exits - Auto-rollback on failed healthcheck, docker logout at end - Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3) - PUSH_LATEST_TAG default flipped to false - B2 all-or-none validation before deploy Code - cmd/api takes pg_advisory_lock on a dedicated connection before AutoMigrate, serialising boot-time migrations across replicas - cmd/worker exposes an HTTP /health endpoint with graceful shutdown Docs - deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy - deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops - deploy/README.md updated with storage toggle, worker-replica caveat, multi-arch recipe, connection-pool tuning, renumbered sections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.2 KiB
Shit ./.deploy_prod Can't Do
Everything listed here is manual. The deploy script orchestrates builds, secrets, and the stack — it does not provision infrastructure, touch DNS, configure Cloudflare, or rotate external credentials. Work through this list once before your first prod deploy, then revisit after every cloud-side change.
See README.md for the security checklist that complements
this file.
One-Time: Infrastructure
Swarm Cluster
- Provision manager + worker VMs (Hetzner, DO, etc.).
docker swarm init --advertise-addr <manager-private-ip>on manager #1.docker swarm join-token {manager,worker}→ join additional nodes.docker node lsto verify — all nodesReadyandActive.- Label nodes if you want placement constraints beyond the defaults.
Node Hardening (every node)
- SSH: non-default port, key-only auth, no root login — see README §2.
- Firewall: allow 22 (or 2222), 80, 443 from CF IPs only; 2377/tcp, 7946/tcp+udp, 4789/udp Swarm-nodes only; block the rest — see README §1.
- Install unattended-upgrades (or equivalent) for security patches.
- Disable password auth in
/etc/ssh/sshd_config. - Create the
deployuser (AllowUsers deployin sshd_config).
DNS + Cloudflare
- Add A records for
api.<domain>,admin.<domain>pointing to the LB or manager IPs. Keep them proxied (orange cloud). - Create a Cloudflare tunnel or enable "Authenticated Origin Pulls" if you want to lock the origin to CF only.
- Firewall rule on the nodes: only accept 80/443 from Cloudflare IP ranges (https://www.cloudflare.com/ips/).
- Configure CF Access (or equivalent SSO) in front of admin panel if exposing it publicly.
One-Time: External Services
Postgres (Neon)
- Create project + database (
honeydue). - Create a dedicated DB user with least privilege — not the project owner.
- Enable IP allowlist, add every Swarm node's egress IP.
- Verify
DB_SSLMODE=requireworks end-to-end. - Turn on PITR (paid tier) or schedule automated
pg_dumpbackups. - Do one restore drill — boot a staging stack from a real backup. If you haven't done this, you do not have backups.
Redis
- Redis runs inside the stack on a named volume. No external setup needed today. See README §11 — this is an accepted SPOF.
- If you move Redis external (Upstash, Dragonfly Cloud): update
REDIS_URLinprod.env, remove theredisservice + volume from the stack.
Backblaze B2 (or MinIO)
Skip this section if you're running a single-node prod and are OK with uploads on a local volume. Required for multi-replica prod — see README §8.
- Create B2 account + bucket (private).
- Create a scoped application key bound to that single bucket — not the master key.
- Set lifecycle rules: keep only the current version of each file, or whatever matches your policy.
- Populate
B2_ENDPOINT,B2_KEY_ID,B2_APP_KEY,B2_BUCKET_NAMEindeploy/prod.env. Optionally setB2_USE_SSLandB2_REGION. - Verify uploads round-trip across replicas after the first deploy (upload a file via client A → fetch via client B in a different session).
APNS (Apple Push)
- Create an APNS auth key (
.p8) in the Apple Developer portal. - Save to
deploy/secrets/apns_auth_key.p8— the script enforces it contains a real-----BEGIN PRIVATE KEY-----block. - Fill
APNS_AUTH_KEY_ID,APNS_TEAM_ID,APNS_TOPIC(bundle ID) indeploy/prod.env. - Decide
APNS_USE_SANDBOX/APNS_PRODUCTIONbased on build target.
FCM (Android Push)
- Create Firebase project + legacy server key (or migrate to HTTP v1 — the code currently uses the legacy server key).
- Save to
deploy/secrets/fcm_server_key.txt.
SMTP (Email)
- Provision SMTP credentials (Gmail app password, SES, Postmark, etc.).
- Fill
EMAIL_HOST,EMAIL_PORT,EMAIL_HOST_USER,DEFAULT_FROM_EMAIL,EMAIL_USE_TLSindeploy/prod.env. - Save the password to
deploy/secrets/email_host_password.txt. - Verify SPF, DKIM, DMARC on the sending domain if you care about deliverability.
Registry (GHCR / other)
- Create a personal access token with
write:packages+read:packages. - Fill
REGISTRY,REGISTRY_NAMESPACE,REGISTRY_USERNAME,REGISTRY_TOKENindeploy/registry.env. - Rotate the token on a schedule (quarterly at minimum).
Apple / Google IAP (optional)
- Apple: create App Store Connect API key, fill the
APPLE_IAP_*vars. - Google: create a service account with Play Developer API access,
store JSON at a path referenced by
GOOGLE_IAP_SERVICE_ACCOUNT_PATH.
Recurring Operations
Secret Rotation
After any compromise, annually at minimum, and when a team member leaves:
- Generate the new value (e.g.
openssl rand -base64 32 > deploy/secrets/secret_key.txt). ./.deploy_prod— creates a new versioned Swarm secret and redeploys services to pick it up.- The old secret lingers until
SECRET_KEEP_VERSIONSbumps it out (see README "Secret Versioning & Pruning"). - For external creds (Neon, B2, APNS, etc.) rotate at the provider first, update the local secret file, then redeploy.
Backup Drills
- Quarterly: pull a Neon backup, restore to a scratch project, boot a staging stack against it, verify login + basic reads.
- Monthly: spot-check that B2 objects are actually present and the app key still works.
- After any schema change: confirm PITR coverage includes the new columns before relying on it.
Certificate Management
- TLS is terminated by Cloudflare today, so there are no origin certs to renew. If you ever move TLS on-origin (Traefik, Caddy), automate renewal — don't add it to this list and expect it to happen.
Multi-Arch Builds
./.deploy_prod builds for the host arch. If target ≠ host:
- Enable buildx:
docker buildx create --use. - Install QEMU:
docker run --privileged --rm tonistiigi/binfmt --install all. - Build + push images manually per target platform.
- Run
SKIP_BUILD=1 ./.deploy_prodso the script just deploys.
Node Maintenance / Rolling Upgrades
docker node update --availability drain <node>before OS upgrades.- Reboot, verify, then
docker node update --availability active <node>. - Re-converge with
docker stack deploy -c swarm-stack.prod.yml honeydue.
Incident Response
Redis Node Dies
Named volume is per-node and doesn't follow. Accept the loss:
- Let Swarm reschedule Redis on a new node.
- In-flight Asynq jobs are lost; retry semantics cover most of them.
- Scheduled cron events fire again on the next tick (hourly for smart reminders and daily digest; daily for onboarding + cleanup).
- Cache repopulates on first request.
Deploy Rolled Back Automatically
./.deploy_prod triggers docker service rollback on every service if
DEPLOY_HEALTHCHECK_URL fails. Diagnose with:
ssh <manager> docker stack services honeydue
ssh <manager> docker service logs --tail 200 honeydue_api
# Or open an SSH tunnel to Dozzle: ssh -L 9999:127.0.0.1:9999 <manager>
Lost Ability to Deploy
- Registry token revoked → regenerate, update
deploy/registry.env, re-run. - Manager host key changed → verify legitimacy, update
~/.ssh/known_hosts. - All secrets accidentally pruned → restore the
deploy/secrets/*files locally and redeploy; new Swarm secret versions will be created.
Known Gaps (Future Work)
- No dedicated
cmd/migratebinary — migrations run at API boot (see README §10). Large schema changes still need manual coordination. asynq.Schedulerhas no leader election;WORKER_REPLICASmust stay 1 until we migrate toasynq.PeriodicTaskManager(README §9).- No Prometheus / Grafana / alerting in the stack.
/metricsis exposed on the API but nothing scrapes it. - No automated TLS renewal on-origin — add if you ever move off Cloudflare.
- No staging environment wired to the deploy script —
DEPLOY_TAG=<sha>is the closest thing. A proper staging flow is future work.