Files
honeyDueAPI/deploy/shit_deploy_cant_do.md
Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run
Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:22:43 -05:00

8.2 KiB

Shit ./.deploy_prod Can't Do

Everything listed here is manual. The deploy script orchestrates builds, secrets, and the stack — it does not provision infrastructure, touch DNS, configure Cloudflare, or rotate external credentials. Work through this list once before your first prod deploy, then revisit after every cloud-side change.

See README.md for the security checklist that complements this file.


One-Time: Infrastructure

Swarm Cluster

  • Provision manager + worker VMs (Hetzner, DO, etc.).
  • docker swarm init --advertise-addr <manager-private-ip> on manager #1.
  • docker swarm join-token {manager,worker} → join additional nodes.
  • docker node ls to verify — all nodes Ready and Active.
  • Label nodes if you want placement constraints beyond the defaults.

Node Hardening (every node)

  • SSH: non-default port, key-only auth, no root login — see README §2.
  • Firewall: allow 22 (or 2222), 80, 443 from CF IPs only; 2377/tcp, 7946/tcp+udp, 4789/udp Swarm-nodes only; block the rest — see README §1.
  • Install unattended-upgrades (or equivalent) for security patches.
  • Disable password auth in /etc/ssh/sshd_config.
  • Create the deploy user (AllowUsers deploy in sshd_config).

DNS + Cloudflare

  • Add A records for api.<domain>, admin.<domain> pointing to the LB or manager IPs. Keep them proxied (orange cloud).
  • Create a Cloudflare tunnel or enable "Authenticated Origin Pulls" if you want to lock the origin to CF only.
  • Firewall rule on the nodes: only accept 80/443 from Cloudflare IP ranges (https://www.cloudflare.com/ips/).
  • Configure CF Access (or equivalent SSO) in front of admin panel if exposing it publicly.

One-Time: External Services

Postgres (Neon)

  • Create project + database (honeydue).
  • Create a dedicated DB user with least privilege — not the project owner.
  • Enable IP allowlist, add every Swarm node's egress IP.
  • Verify DB_SSLMODE=require works end-to-end.
  • Turn on PITR (paid tier) or schedule automated pg_dump backups.
  • Do one restore drill — boot a staging stack from a real backup. If you haven't done this, you do not have backups.

Redis

  • Redis runs inside the stack on a named volume. No external setup needed today. See README §11 — this is an accepted SPOF.
  • If you move Redis external (Upstash, Dragonfly Cloud): update REDIS_URL in prod.env, remove the redis service + volume from the stack.

Backblaze B2 (or MinIO)

Skip this section if you're running a single-node prod and are OK with uploads on a local volume. Required for multi-replica prod — see README §8.

  • Create B2 account + bucket (private).
  • Create a scoped application key bound to that single bucket — not the master key.
  • Set lifecycle rules: keep only the current version of each file, or whatever matches your policy.
  • Populate B2_ENDPOINT, B2_KEY_ID, B2_APP_KEY, B2_BUCKET_NAME in deploy/prod.env. Optionally set B2_USE_SSL and B2_REGION.
  • Verify uploads round-trip across replicas after the first deploy (upload a file via client A → fetch via client B in a different session).

APNS (Apple Push)

  • Create an APNS auth key (.p8) in the Apple Developer portal.
  • Save to deploy/secrets/apns_auth_key.p8 — the script enforces it contains a real -----BEGIN PRIVATE KEY----- block.
  • Fill APNS_AUTH_KEY_ID, APNS_TEAM_ID, APNS_TOPIC (bundle ID) in deploy/prod.env.
  • Decide APNS_USE_SANDBOX / APNS_PRODUCTION based on build target.

FCM (Android Push)

  • Create Firebase project + legacy server key (or migrate to HTTP v1 — the code currently uses the legacy server key).
  • Save to deploy/secrets/fcm_server_key.txt.

SMTP (Email)

  • Provision SMTP credentials (Gmail app password, SES, Postmark, etc.).
  • Fill EMAIL_HOST, EMAIL_PORT, EMAIL_HOST_USER, DEFAULT_FROM_EMAIL, EMAIL_USE_TLS in deploy/prod.env.
  • Save the password to deploy/secrets/email_host_password.txt.
  • Verify SPF, DKIM, DMARC on the sending domain if you care about deliverability.

Registry (GHCR / other)

  • Create a personal access token with write:packages + read:packages.
  • Fill REGISTRY, REGISTRY_NAMESPACE, REGISTRY_USERNAME, REGISTRY_TOKEN in deploy/registry.env.
  • Rotate the token on a schedule (quarterly at minimum).

Apple / Google IAP (optional)

  • Apple: create App Store Connect API key, fill the APPLE_IAP_* vars.
  • Google: create a service account with Play Developer API access, store JSON at a path referenced by GOOGLE_IAP_SERVICE_ACCOUNT_PATH.

Recurring Operations

Secret Rotation

After any compromise, annually at minimum, and when a team member leaves:

  1. Generate the new value (e.g. openssl rand -base64 32 > deploy/secrets/secret_key.txt).
  2. ./.deploy_prod — creates a new versioned Swarm secret and redeploys services to pick it up.
  3. The old secret lingers until SECRET_KEEP_VERSIONS bumps it out (see README "Secret Versioning & Pruning").
  4. For external creds (Neon, B2, APNS, etc.) rotate at the provider first, update the local secret file, then redeploy.

Backup Drills

  • Quarterly: pull a Neon backup, restore to a scratch project, boot a staging stack against it, verify login + basic reads.
  • Monthly: spot-check that B2 objects are actually present and the app key still works.
  • After any schema change: confirm PITR coverage includes the new columns before relying on it.

Certificate Management

  • TLS is terminated by Cloudflare today, so there are no origin certs to renew. If you ever move TLS on-origin (Traefik, Caddy), automate renewal — don't add it to this list and expect it to happen.

Multi-Arch Builds

./.deploy_prod builds for the host arch. If target ≠ host:

  • Enable buildx: docker buildx create --use.
  • Install QEMU: docker run --privileged --rm tonistiigi/binfmt --install all.
  • Build + push images manually per target platform.
  • Run SKIP_BUILD=1 ./.deploy_prod so the script just deploys.

Node Maintenance / Rolling Upgrades

  • docker node update --availability drain <node> before OS upgrades.
  • Reboot, verify, then docker node update --availability active <node>.
  • Re-converge with docker stack deploy -c swarm-stack.prod.yml honeydue.

Incident Response

Redis Node Dies

Named volume is per-node and doesn't follow. Accept the loss:

  1. Let Swarm reschedule Redis on a new node.
  2. In-flight Asynq jobs are lost; retry semantics cover most of them.
  3. Scheduled cron events fire again on the next tick (hourly for smart reminders and daily digest; daily for onboarding + cleanup).
  4. Cache repopulates on first request.

Deploy Rolled Back Automatically

./.deploy_prod triggers docker service rollback on every service if DEPLOY_HEALTHCHECK_URL fails. Diagnose with:

ssh <manager> docker stack services honeydue
ssh <manager> docker service logs --tail 200 honeydue_api
# Or open an SSH tunnel to Dozzle: ssh -L 9999:127.0.0.1:9999 <manager>

Lost Ability to Deploy

  • Registry token revoked → regenerate, update deploy/registry.env, re-run.
  • Manager host key changed → verify legitimacy, update ~/.ssh/known_hosts.
  • All secrets accidentally pruned → restore the deploy/secrets/* files locally and redeploy; new Swarm secret versions will be created.

Known Gaps (Future Work)

  • No dedicated cmd/migrate binary — migrations run at API boot (see README §10). Large schema changes still need manual coordination.
  • asynq.Scheduler has no leader election; WORKER_REPLICAS must stay 1 until we migrate to asynq.PeriodicTaskManager (README §9).
  • No Prometheus / Grafana / alerting in the stack. /metrics is exposed on the API but nothing scrapes it.
  • No automated TLS renewal on-origin — add if you ever move off Cloudflare.
  • No staging environment wired to the deploy script — DEPLOY_TAG=<sha> is the closest thing. A proper staging flow is future work.