Files
honeyDueAPI/deploy/shit_deploy_cant_do.md
Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run
Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:22:43 -05:00

209 lines
8.2 KiB
Markdown

# Shit `./.deploy_prod` Can't Do
Everything listed here is **manual**. The deploy script orchestrates builds,
secrets, and the stack — it does not provision infrastructure, touch DNS,
configure Cloudflare, or rotate external credentials. Work through this list
once before your first prod deploy, then revisit after every cloud-side
change.
See [`README.md`](./README.md) for the security checklist that complements
this file.
---
## One-Time: Infrastructure
### Swarm Cluster
- [ ] Provision manager + worker VMs (Hetzner, DO, etc.).
- [ ] `docker swarm init --advertise-addr <manager-private-ip>` on manager #1.
- [ ] `docker swarm join-token {manager,worker}` → join additional nodes.
- [ ] `docker node ls` to verify — all nodes `Ready` and `Active`.
- [ ] Label nodes if you want placement constraints beyond the defaults.
### Node Hardening (every node)
- [ ] SSH: non-default port, key-only auth, no root login — see README §2.
- [ ] Firewall: allow 22 (or 2222), 80, 443 from CF IPs only; 2377/tcp,
7946/tcp+udp, 4789/udp Swarm-nodes only; block the rest — see README §1.
- [ ] Install unattended-upgrades (or equivalent) for security patches.
- [ ] Disable password auth in `/etc/ssh/sshd_config`.
- [ ] Create the `deploy` user (`AllowUsers deploy` in sshd_config).
### DNS + Cloudflare
- [ ] Add A records for `api.<domain>`, `admin.<domain>` pointing to the LB
or manager IPs. Keep them **proxied** (orange cloud).
- [ ] Create a Cloudflare tunnel or enable "Authenticated Origin Pulls" if
you want to lock the origin to CF only.
- [ ] Firewall rule on the nodes: only accept 80/443 from Cloudflare IP ranges
(<https://www.cloudflare.com/ips/>).
- [ ] Configure CF Access (or equivalent SSO) in front of admin panel if
exposing it publicly.
---
## One-Time: External Services
### Postgres (Neon)
- [ ] Create project + database (`honeydue`).
- [ ] Create a dedicated DB user with least privilege — not the project owner.
- [ ] Enable IP allowlist, add every Swarm node's egress IP.
- [ ] Verify `DB_SSLMODE=require` works end-to-end.
- [ ] Turn on PITR (paid tier) or schedule automated `pg_dump` backups.
- [ ] Do one restore drill — boot a staging stack from a real backup. If you
haven't done this, you do not have backups.
### Redis
- Redis runs **inside** the stack on a named volume. No external setup
needed today. See README §11 — this is an accepted SPOF.
- [ ] If you move Redis external (Upstash, Dragonfly Cloud): update
`REDIS_URL` in `prod.env`, remove the `redis` service + volume from
the stack.
### Backblaze B2 (or MinIO)
Skip this section if you're running a single-node prod and are OK with
uploads on a local volume. Required for multi-replica prod — see README §8.
- [ ] Create B2 account + bucket (private).
- [ ] Create a **scoped** application key bound to that single bucket —
not the master key.
- [ ] Set lifecycle rules: keep only the current version of each file,
or whatever matches your policy.
- [ ] Populate `B2_ENDPOINT`, `B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`
in `deploy/prod.env`. Optionally set `B2_USE_SSL` and `B2_REGION`.
- [ ] Verify uploads round-trip across replicas after the first deploy
(upload a file via client A → fetch via client B in a different session).
### APNS (Apple Push)
- [ ] Create an APNS auth key (`.p8`) in the Apple Developer portal.
- [ ] Save to `deploy/secrets/apns_auth_key.p8` — the script enforces it
contains a real `-----BEGIN PRIVATE KEY-----` block.
- [ ] Fill `APNS_AUTH_KEY_ID`, `APNS_TEAM_ID`, `APNS_TOPIC` (bundle ID) in
`deploy/prod.env`.
- [ ] Decide `APNS_USE_SANDBOX` / `APNS_PRODUCTION` based on build target.
### FCM (Android Push)
- [ ] Create Firebase project + legacy server key (or migrate to HTTP v1 —
the code currently uses the legacy server key).
- [ ] Save to `deploy/secrets/fcm_server_key.txt`.
### SMTP (Email)
- [ ] Provision SMTP credentials (Gmail app password, SES, Postmark, etc.).
- [ ] Fill `EMAIL_HOST`, `EMAIL_PORT`, `EMAIL_HOST_USER`,
`DEFAULT_FROM_EMAIL`, `EMAIL_USE_TLS` in `deploy/prod.env`.
- [ ] Save the password to `deploy/secrets/email_host_password.txt`.
- [ ] Verify SPF, DKIM, DMARC on the sending domain if you care about
deliverability.
### Registry (GHCR / other)
- [ ] Create a personal access token with `write:packages` + `read:packages`.
- [ ] Fill `REGISTRY`, `REGISTRY_NAMESPACE`, `REGISTRY_USERNAME`,
`REGISTRY_TOKEN` in `deploy/registry.env`.
- [ ] Rotate the token on a schedule (quarterly at minimum).
### Apple / Google IAP (optional)
- [ ] Apple: create App Store Connect API key, fill the `APPLE_IAP_*` vars.
- [ ] Google: create a service account with Play Developer API access,
store JSON at a path referenced by `GOOGLE_IAP_SERVICE_ACCOUNT_PATH`.
---
## Recurring Operations
### Secret Rotation
After any compromise, annually at minimum, and when a team member leaves:
1. Generate the new value (e.g. `openssl rand -base64 32 > deploy/secrets/secret_key.txt`).
2. `./.deploy_prod` — creates a new versioned Swarm secret and redeploys
services to pick it up.
3. The old secret lingers until `SECRET_KEEP_VERSIONS` bumps it out (see
README "Secret Versioning & Pruning").
4. For external creds (Neon, B2, APNS, etc.) rotate at the provider first,
update the local secret file, then redeploy.
### Backup Drills
- [ ] Quarterly: pull a Neon backup, restore to a scratch project, boot a
staging stack against it, verify login + basic reads.
- [ ] Monthly: spot-check that B2 objects are actually present and the
app key still works.
- [ ] After any schema change: confirm PITR coverage includes the new
columns before relying on it.
### Certificate Management
- TLS is terminated by Cloudflare today, so there are no origin certs to
renew. If you ever move TLS on-origin (Traefik, Caddy), automate renewal
— don't add it to this list and expect it to happen.
### Multi-Arch Builds
`./.deploy_prod` builds for the host arch. If target ≠ host:
- [ ] Enable buildx: `docker buildx create --use`.
- [ ] Install QEMU: `docker run --privileged --rm tonistiigi/binfmt --install all`.
- [ ] Build + push images manually per target platform.
- [ ] Run `SKIP_BUILD=1 ./.deploy_prod` so the script just deploys.
### Node Maintenance / Rolling Upgrades
- [ ] `docker node update --availability drain <node>` before OS upgrades.
- [ ] Reboot, verify, then `docker node update --availability active <node>`.
- [ ] Re-converge with `docker stack deploy -c swarm-stack.prod.yml honeydue`.
---
## Incident Response
### Redis Node Dies
Named volume is per-node and doesn't follow. Accept the loss:
1. Let Swarm reschedule Redis on a new node.
2. In-flight Asynq jobs are lost; retry semantics cover most of them.
3. Scheduled cron events fire again on the next tick (hourly for smart
reminders and daily digest; daily for onboarding + cleanup).
4. Cache repopulates on first request.
### Deploy Rolled Back Automatically
`./.deploy_prod` triggers `docker service rollback` on every service if
`DEPLOY_HEALTHCHECK_URL` fails. Diagnose with:
```bash
ssh <manager> docker stack services honeydue
ssh <manager> docker service logs --tail 200 honeydue_api
# Or open an SSH tunnel to Dozzle: ssh -L 9999:127.0.0.1:9999 <manager>
```
### Lost Ability to Deploy
- Registry token revoked → regenerate, update `deploy/registry.env`, re-run.
- Manager host key changed → verify legitimacy, update `~/.ssh/known_hosts`.
- All secrets accidentally pruned → restore the `deploy/secrets/*` files
locally and redeploy; new Swarm secret versions will be created.
---
## Known Gaps (Future Work)
- No dedicated `cmd/migrate` binary — migrations run at API boot (see
README §10). Large schema changes still need manual coordination.
- `asynq.Scheduler` has no leader election; `WORKER_REPLICAS` must stay 1
until we migrate to `asynq.PeriodicTaskManager` (README §9).
- No Prometheus / Grafana / alerting in the stack. `/metrics` is exposed
on the API but nothing scrapes it.
- No automated TLS renewal on-origin — add if you ever move off Cloudflare.
- No staging environment wired to the deploy script — `DEPLOY_TAG=<sha>`
is the closest thing. A proper staging flow is future work.