Files

Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run

Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 15:22:43 -05:00

8.2 KiB

Raw Blame History

Shit `./.deploy_prod` Can't Do

Everything listed here is manual. The deploy script orchestrates builds, secrets, and the stack — it does not provision infrastructure, touch DNS, configure Cloudflare, or rotate external credentials. Work through this list once before your first prod deploy, then revisit after every cloud-side change.

See README.md for the security checklist that complements this file.

One-Time: Infrastructure

Swarm Cluster

Provision manager + worker VMs (Hetzner, DO, etc.).
docker swarm init --advertise-addr <manager-private-ip> on manager #1.
docker swarm join-token {manager,worker} → join additional nodes.
docker node ls to verify — all nodes Ready and Active.
Label nodes if you want placement constraints beyond the defaults.

Node Hardening (every node)

SSH: non-default port, key-only auth, no root login — see README §2.
Firewall: allow 22 (or 2222), 80, 443 from CF IPs only; 2377/tcp, 7946/tcp+udp, 4789/udp Swarm-nodes only; block the rest — see README §1.
Install unattended-upgrades (or equivalent) for security patches.
Disable password auth in /etc/ssh/sshd_config.
Create the deploy user (AllowUsers deploy in sshd_config).

DNS + Cloudflare

Add A records for api.<domain>, admin.<domain> pointing to the LB or manager IPs. Keep them proxied (orange cloud).
Create a Cloudflare tunnel or enable "Authenticated Origin Pulls" if you want to lock the origin to CF only.
Firewall rule on the nodes: only accept 80/443 from Cloudflare IP ranges (https://www.cloudflare.com/ips/).
Configure CF Access (or equivalent SSO) in front of admin panel if exposing it publicly.

One-Time: External Services

Postgres (Neon)

Create project + database (honeydue).
Create a dedicated DB user with least privilege — not the project owner.
Enable IP allowlist, add every Swarm node's egress IP.
Verify DB_SSLMODE=require works end-to-end.
Turn on PITR (paid tier) or schedule automated pg_dump backups.
Do one restore drill — boot a staging stack from a real backup. If you haven't done this, you do not have backups.

Redis

Redis runs inside the stack on a named volume. No external setup needed today. See README §11 — this is an accepted SPOF.
If you move Redis external (Upstash, Dragonfly Cloud): update REDIS_URL in prod.env, remove the redis service + volume from the stack.

Backblaze B2 (or MinIO)

Skip this section if you're running a single-node prod and are OK with uploads on a local volume. Required for multi-replica prod — see README §8.

Create B2 account + bucket (private).
Create a scoped application key bound to that single bucket — not the master key.
Set lifecycle rules: keep only the current version of each file, or whatever matches your policy.
Populate B2_ENDPOINT, B2_KEY_ID, B2_APP_KEY, B2_BUCKET_NAME in deploy/prod.env. Optionally set B2_USE_SSL and B2_REGION.
Verify uploads round-trip across replicas after the first deploy (upload a file via client A → fetch via client B in a different session).

APNS (Apple Push)

Create an APNS auth key (.p8) in the Apple Developer portal.
Save to deploy/secrets/apns_auth_key.p8 — the script enforces it contains a real -----BEGIN PRIVATE KEY----- block.
Fill APNS_AUTH_KEY_ID, APNS_TEAM_ID, APNS_TOPIC (bundle ID) in deploy/prod.env.
Decide APNS_USE_SANDBOX / APNS_PRODUCTION based on build target.

FCM (Android Push)

Create Firebase project + legacy server key (or migrate to HTTP v1 — the code currently uses the legacy server key).
Save to deploy/secrets/fcm_server_key.txt.

SMTP (Email)

Provision SMTP credentials (Gmail app password, SES, Postmark, etc.).
Fill EMAIL_HOST, EMAIL_PORT, EMAIL_HOST_USER, DEFAULT_FROM_EMAIL, EMAIL_USE_TLS in deploy/prod.env.
Save the password to deploy/secrets/email_host_password.txt.
Verify SPF, DKIM, DMARC on the sending domain if you care about deliverability.

Registry (GHCR / other)

Create a personal access token with write:packages + read:packages.
Fill REGISTRY, REGISTRY_NAMESPACE, REGISTRY_USERNAME, REGISTRY_TOKEN in deploy/registry.env.
Rotate the token on a schedule (quarterly at minimum).

Apple / Google IAP (optional)

Apple: create App Store Connect API key, fill the APPLE_IAP_* vars.
Google: create a service account with Play Developer API access, store JSON at a path referenced by GOOGLE_IAP_SERVICE_ACCOUNT_PATH.

Recurring Operations

Secret Rotation

After any compromise, annually at minimum, and when a team member leaves:

Generate the new value (e.g. openssl rand -base64 32 > deploy/secrets/secret_key.txt).
./.deploy_prod — creates a new versioned Swarm secret and redeploys services to pick it up.
The old secret lingers until SECRET_KEEP_VERSIONS bumps it out (see README "Secret Versioning & Pruning").
For external creds (Neon, B2, APNS, etc.) rotate at the provider first, update the local secret file, then redeploy.

Backup Drills

Quarterly: pull a Neon backup, restore to a scratch project, boot a staging stack against it, verify login + basic reads.
Monthly: spot-check that B2 objects are actually present and the app key still works.
After any schema change: confirm PITR coverage includes the new columns before relying on it.

Certificate Management

TLS is terminated by Cloudflare today, so there are no origin certs to renew. If you ever move TLS on-origin (Traefik, Caddy), automate renewal — don't add it to this list and expect it to happen.

Multi-Arch Builds

./.deploy_prod builds for the host arch. If target ≠ host:

Enable buildx: docker buildx create --use.
Install QEMU: docker run --privileged --rm tonistiigi/binfmt --install all.
Build + push images manually per target platform.
Run SKIP_BUILD=1 ./.deploy_prod so the script just deploys.

Node Maintenance / Rolling Upgrades

docker node update --availability drain <node> before OS upgrades.
Reboot, verify, then docker node update --availability active <node>.
Re-converge with docker stack deploy -c swarm-stack.prod.yml honeydue.

Incident Response

Redis Node Dies

Named volume is per-node and doesn't follow. Accept the loss:

Let Swarm reschedule Redis on a new node.
In-flight Asynq jobs are lost; retry semantics cover most of them.
Scheduled cron events fire again on the next tick (hourly for smart reminders and daily digest; daily for onboarding + cleanup).
Cache repopulates on first request.

Deploy Rolled Back Automatically

./.deploy_prod triggers docker service rollback on every service if DEPLOY_HEALTHCHECK_URL fails. Diagnose with:

ssh <manager> docker stack services honeydue
ssh <manager> docker service logs --tail 200 honeydue_api
# Or open an SSH tunnel to Dozzle: ssh -L 9999:127.0.0.1:9999 <manager>

Lost Ability to Deploy

Registry token revoked → regenerate, update deploy/registry.env, re-run.
Manager host key changed → verify legitimacy, update ~/.ssh/known_hosts.
All secrets accidentally pruned → restore the deploy/secrets/* files locally and redeploy; new Swarm secret versions will be created.

Known Gaps (Future Work)

No dedicated cmd/migrate binary — migrations run at API boot (see README §10). Large schema changes still need manual coordination.
asynq.Scheduler has no leader election; WORKER_REPLICAS must stay 1 until we migrate to asynq.PeriodicTaskManager (README §9).
No Prometheus / Grafana / alerting in the stack. /metrics is exposed on the API but nothing scrapes it.
No automated TLS renewal on-origin — add if you ever move off Cloudflare.
No staging environment wired to the deploy script — DEPLOY_TAG=<sha> is the closest thing. A proper staging flow is future work.

8.2 KiB Raw Blame History

Shit ./.deploy_prod Can't Do