# Shit `./.deploy_prod` Can't Do Everything listed here is **manual**. The deploy script orchestrates builds, secrets, and the stack — it does not provision infrastructure, touch DNS, configure Cloudflare, or rotate external credentials. Work through this list once before your first prod deploy, then revisit after every cloud-side change. See [`README.md`](./README.md) for the security checklist that complements this file. --- ## One-Time: Infrastructure ### Swarm Cluster - [ ] Provision manager + worker VMs (Hetzner, DO, etc.). - [ ] `docker swarm init --advertise-addr ` on manager #1. - [ ] `docker swarm join-token {manager,worker}` → join additional nodes. - [ ] `docker node ls` to verify — all nodes `Ready` and `Active`. - [ ] Label nodes if you want placement constraints beyond the defaults. ### Node Hardening (every node) - [ ] SSH: non-default port, key-only auth, no root login — see README §2. - [ ] Firewall: allow 22 (or 2222), 80, 443 from CF IPs only; 2377/tcp, 7946/tcp+udp, 4789/udp Swarm-nodes only; block the rest — see README §1. - [ ] Install unattended-upgrades (or equivalent) for security patches. - [ ] Disable password auth in `/etc/ssh/sshd_config`. - [ ] Create the `deploy` user (`AllowUsers deploy` in sshd_config). ### DNS + Cloudflare - [ ] Add A records for `api.`, `admin.` pointing to the LB or manager IPs. Keep them **proxied** (orange cloud). - [ ] Create a Cloudflare tunnel or enable "Authenticated Origin Pulls" if you want to lock the origin to CF only. - [ ] Firewall rule on the nodes: only accept 80/443 from Cloudflare IP ranges (). - [ ] Configure CF Access (or equivalent SSO) in front of admin panel if exposing it publicly. --- ## One-Time: External Services ### Postgres (Neon) - [ ] Create project + database (`honeydue`). - [ ] Create a dedicated DB user with least privilege — not the project owner. - [ ] Enable IP allowlist, add every Swarm node's egress IP. - [ ] Verify `DB_SSLMODE=require` works end-to-end. - [ ] Turn on PITR (paid tier) or schedule automated `pg_dump` backups. - [ ] Do one restore drill — boot a staging stack from a real backup. If you haven't done this, you do not have backups. ### Redis - Redis runs **inside** the stack on a named volume. No external setup needed today. See README §11 — this is an accepted SPOF. - [ ] If you move Redis external (Upstash, Dragonfly Cloud): update `REDIS_URL` in `prod.env`, remove the `redis` service + volume from the stack. ### Backblaze B2 (or MinIO) Skip this section if you're running a single-node prod and are OK with uploads on a local volume. Required for multi-replica prod — see README §8. - [ ] Create B2 account + bucket (private). - [ ] Create a **scoped** application key bound to that single bucket — not the master key. - [ ] Set lifecycle rules: keep only the current version of each file, or whatever matches your policy. - [ ] Populate `B2_ENDPOINT`, `B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME` in `deploy/prod.env`. Optionally set `B2_USE_SSL` and `B2_REGION`. - [ ] Verify uploads round-trip across replicas after the first deploy (upload a file via client A → fetch via client B in a different session). ### APNS (Apple Push) - [ ] Create an APNS auth key (`.p8`) in the Apple Developer portal. - [ ] Save to `deploy/secrets/apns_auth_key.p8` — the script enforces it contains a real `-----BEGIN PRIVATE KEY-----` block. - [ ] Fill `APNS_AUTH_KEY_ID`, `APNS_TEAM_ID`, `APNS_TOPIC` (bundle ID) in `deploy/prod.env`. - [ ] Decide `APNS_USE_SANDBOX` / `APNS_PRODUCTION` based on build target. ### FCM (Android Push) - [ ] Create Firebase project + legacy server key (or migrate to HTTP v1 — the code currently uses the legacy server key). - [ ] Save to `deploy/secrets/fcm_server_key.txt`. ### SMTP (Email) - [ ] Provision SMTP credentials (Gmail app password, SES, Postmark, etc.). - [ ] Fill `EMAIL_HOST`, `EMAIL_PORT`, `EMAIL_HOST_USER`, `DEFAULT_FROM_EMAIL`, `EMAIL_USE_TLS` in `deploy/prod.env`. - [ ] Save the password to `deploy/secrets/email_host_password.txt`. - [ ] Verify SPF, DKIM, DMARC on the sending domain if you care about deliverability. ### Registry (GHCR / other) - [ ] Create a personal access token with `write:packages` + `read:packages`. - [ ] Fill `REGISTRY`, `REGISTRY_NAMESPACE`, `REGISTRY_USERNAME`, `REGISTRY_TOKEN` in `deploy/registry.env`. - [ ] Rotate the token on a schedule (quarterly at minimum). ### Apple / Google IAP (optional) - [ ] Apple: create App Store Connect API key, fill the `APPLE_IAP_*` vars. - [ ] Google: create a service account with Play Developer API access, store JSON at a path referenced by `GOOGLE_IAP_SERVICE_ACCOUNT_PATH`. --- ## Recurring Operations ### Secret Rotation After any compromise, annually at minimum, and when a team member leaves: 1. Generate the new value (e.g. `openssl rand -base64 32 > deploy/secrets/secret_key.txt`). 2. `./.deploy_prod` — creates a new versioned Swarm secret and redeploys services to pick it up. 3. The old secret lingers until `SECRET_KEEP_VERSIONS` bumps it out (see README "Secret Versioning & Pruning"). 4. For external creds (Neon, B2, APNS, etc.) rotate at the provider first, update the local secret file, then redeploy. ### Backup Drills - [ ] Quarterly: pull a Neon backup, restore to a scratch project, boot a staging stack against it, verify login + basic reads. - [ ] Monthly: spot-check that B2 objects are actually present and the app key still works. - [ ] After any schema change: confirm PITR coverage includes the new columns before relying on it. ### Certificate Management - TLS is terminated by Cloudflare today, so there are no origin certs to renew. If you ever move TLS on-origin (Traefik, Caddy), automate renewal — don't add it to this list and expect it to happen. ### Multi-Arch Builds `./.deploy_prod` builds for the host arch. If target ≠ host: - [ ] Enable buildx: `docker buildx create --use`. - [ ] Install QEMU: `docker run --privileged --rm tonistiigi/binfmt --install all`. - [ ] Build + push images manually per target platform. - [ ] Run `SKIP_BUILD=1 ./.deploy_prod` so the script just deploys. ### Node Maintenance / Rolling Upgrades - [ ] `docker node update --availability drain ` before OS upgrades. - [ ] Reboot, verify, then `docker node update --availability active `. - [ ] Re-converge with `docker stack deploy -c swarm-stack.prod.yml honeydue`. --- ## Incident Response ### Redis Node Dies Named volume is per-node and doesn't follow. Accept the loss: 1. Let Swarm reschedule Redis on a new node. 2. In-flight Asynq jobs are lost; retry semantics cover most of them. 3. Scheduled cron events fire again on the next tick (hourly for smart reminders and daily digest; daily for onboarding + cleanup). 4. Cache repopulates on first request. ### Deploy Rolled Back Automatically `./.deploy_prod` triggers `docker service rollback` on every service if `DEPLOY_HEALTHCHECK_URL` fails. Diagnose with: ```bash ssh docker stack services honeydue ssh docker service logs --tail 200 honeydue_api # Or open an SSH tunnel to Dozzle: ssh -L 9999:127.0.0.1:9999 ``` ### Lost Ability to Deploy - Registry token revoked → regenerate, update `deploy/registry.env`, re-run. - Manager host key changed → verify legitimacy, update `~/.ssh/known_hosts`. - All secrets accidentally pruned → restore the `deploy/secrets/*` files locally and redeploy; new Swarm secret versions will be created. --- ## Known Gaps (Future Work) - No dedicated `cmd/migrate` binary — migrations run at API boot (see README §10). Large schema changes still need manual coordination. - `asynq.Scheduler` has no leader election; `WORKER_REPLICAS` must stay 1 until we migrate to `asynq.PeriodicTaskManager` (README §9). - No Prometheus / Grafana / alerting in the stack. `/metrics` is exposed on the API but nothing scrapes it. - No automated TLS renewal on-origin — add if you ever move off Cloudflare. - No staging environment wired to the deploy script — `DEPLOY_TAG=` is the closest thing. A proper staging flow is future work.