honeyDueAPI/deploy/shit_deploy_cant_do.md

# Shit `./.deploy_prod` Can't Do

Everything listed here is **manual**. The deploy script orchestrates builds,
secrets, and the stack — it does not provision infrastructure, touch DNS,
configure Cloudflare, or rotate external credentials. Work through this list
once before your first prod deploy, then revisit after every cloud-side
change.

See [`README.md`](./README.md) for the security checklist that complements
this file.

---

## One-Time: Infrastructure

### Swarm Cluster

- [ ] Provision manager + worker VMs (Hetzner, DO, etc.).
- [ ] `docker swarm init --advertise-addr <manager-private-ip>` on manager #1.
- [ ] `docker swarm join-token {manager,worker}` → join additional nodes.
- [ ] `docker node ls` to verify — all nodes `Ready` and `Active`.
- [ ] Label nodes if you want placement constraints beyond the defaults.

### Node Hardening (every node)

- [ ] SSH: non-default port, key-only auth, no root login — see README §2.
- [ ] Firewall: allow 22 (or 2222), 80, 443 from CF IPs only; 2377/tcp,
      7946/tcp+udp, 4789/udp Swarm-nodes only; block the rest — see README §1.
- [ ] Install unattended-upgrades (or equivalent) for security patches.
- [ ] Disable password auth in `/etc/ssh/sshd_config`.
- [ ] Create the `deploy` user (`AllowUsers deploy` in sshd_config).

### DNS + Cloudflare

- [ ] Add A records for `api.<domain>`, `admin.<domain>` pointing to the LB
      or manager IPs. Keep them **proxied** (orange cloud).
- [ ] Create a Cloudflare tunnel or enable "Authenticated Origin Pulls" if
      you want to lock the origin to CF only.
- [ ] Firewall rule on the nodes: only accept 80/443 from Cloudflare IP ranges
      (<https://www.cloudflare.com/ips/>).
- [ ] Configure CF Access (or equivalent SSO) in front of admin panel if
      exposing it publicly.

---

## One-Time: External Services

### Postgres (Neon)

- [ ] Create project + database (`honeydue`).
- [ ] Create a dedicated DB user with least privilege — not the project owner.
- [ ] Enable IP allowlist, add every Swarm node's egress IP.
- [ ] Verify `DB_SSLMODE=require` works end-to-end.
- [ ] Turn on PITR (paid tier) or schedule automated `pg_dump` backups.
- [ ] Do one restore drill — boot a staging stack from a real backup. If you
      haven't done this, you do not have backups.

### Redis

- Redis runs **inside** the stack on a named volume. No external setup
  needed today. See README §11 — this is an accepted SPOF.
- [ ] If you move Redis external (Upstash, Dragonfly Cloud): update
      `REDIS_URL` in `prod.env`, remove the `redis` service + volume from
      the stack.

### Backblaze B2 (or MinIO)

Skip this section if you're running a single-node prod and are OK with
uploads on a local volume. Required for multi-replica prod — see README §8.

- [ ] Create B2 account + bucket (private).
- [ ] Create a **scoped** application key bound to that single bucket —
      not the master key.
- [ ] Set lifecycle rules: keep only the current version of each file,
      or whatever matches your policy.
- [ ] Populate `B2_ENDPOINT`, `B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`
      in `deploy/prod.env`. Optionally set `B2_USE_SSL` and `B2_REGION`.
- [ ] Verify uploads round-trip across replicas after the first deploy
      (upload a file via client A → fetch via client B in a different session).

### APNS (Apple Push)

- [ ] Create an APNS auth key (`.p8`) in the Apple Developer portal.
- [ ] Save to `deploy/secrets/apns_auth_key.p8` — the script enforces it
      contains a real `-----BEGIN PRIVATE KEY-----` block.
- [ ] Fill `APNS_AUTH_KEY_ID`, `APNS_TEAM_ID`, `APNS_TOPIC` (bundle ID) in
      `deploy/prod.env`.
- [ ] Decide `APNS_USE_SANDBOX` / `APNS_PRODUCTION` based on build target.

### FCM (Android Push)

- [ ] Create Firebase project + legacy server key (or migrate to HTTP v1 —
      the code currently uses the legacy server key).
- [ ] Save to `deploy/secrets/fcm_server_key.txt`.

### SMTP (Email)

- [ ] Provision SMTP credentials (Gmail app password, SES, Postmark, etc.).
- [ ] Fill `EMAIL_HOST`, `EMAIL_PORT`, `EMAIL_HOST_USER`,
      `DEFAULT_FROM_EMAIL`, `EMAIL_USE_TLS` in `deploy/prod.env`.
- [ ] Save the password to `deploy/secrets/email_host_password.txt`.
- [ ] Verify SPF, DKIM, DMARC on the sending domain if you care about
      deliverability.

### Registry (GHCR / other)

- [ ] Create a personal access token with `write:packages` + `read:packages`.
- [ ] Fill `REGISTRY`, `REGISTRY_NAMESPACE`, `REGISTRY_USERNAME`,
      `REGISTRY_TOKEN` in `deploy/registry.env`.
- [ ] Rotate the token on a schedule (quarterly at minimum).

### Apple / Google IAP (optional)

- [ ] Apple: create App Store Connect API key, fill the `APPLE_IAP_*` vars.
- [ ] Google: create a service account with Play Developer API access,
      store JSON at a path referenced by `GOOGLE_IAP_SERVICE_ACCOUNT_PATH`.

---

## Recurring Operations

### Secret Rotation

After any compromise, annually at minimum, and when a team member leaves:

1. Generate the new value (e.g. `openssl rand -base64 32 > deploy/secrets/secret_key.txt`).
2. `./.deploy_prod` — creates a new versioned Swarm secret and redeploys
   services to pick it up.
3. The old secret lingers until `SECRET_KEEP_VERSIONS` bumps it out (see
   README "Secret Versioning & Pruning").
4. For external creds (Neon, B2, APNS, etc.) rotate at the provider first,
   update the local secret file, then redeploy.

### Backup Drills

- [ ] Quarterly: pull a Neon backup, restore to a scratch project, boot a
      staging stack against it, verify login + basic reads.
- [ ] Monthly: spot-check that B2 objects are actually present and the
      app key still works.
- [ ] After any schema change: confirm PITR coverage includes the new
      columns before relying on it.

### Certificate Management

- TLS is terminated by Cloudflare today, so there are no origin certs to
  renew. If you ever move TLS on-origin (Traefik, Caddy), automate renewal
  — don't add it to this list and expect it to happen.

### Multi-Arch Builds

`./.deploy_prod` builds for the host arch. If target ≠ host:

- [ ] Enable buildx: `docker buildx create --use`.
- [ ] Install QEMU: `docker run --privileged --rm tonistiigi/binfmt --install all`.
- [ ] Build + push images manually per target platform.
- [ ] Run `SKIP_BUILD=1 ./.deploy_prod` so the script just deploys.

### Node Maintenance / Rolling Upgrades

- [ ] `docker node update --availability drain <node>` before OS upgrades.
- [ ] Reboot, verify, then `docker node update --availability active <node>`.
- [ ] Re-converge with `docker stack deploy -c swarm-stack.prod.yml honeydue`.

---

## Incident Response

### Redis Node Dies

Named volume is per-node and doesn't follow. Accept the loss:

1. Let Swarm reschedule Redis on a new node.
2. In-flight Asynq jobs are lost; retry semantics cover most of them.
3. Scheduled cron events fire again on the next tick (hourly for smart
   reminders and daily digest; daily for onboarding + cleanup).
4. Cache repopulates on first request.

### Deploy Rolled Back Automatically

`./.deploy_prod` triggers `docker service rollback` on every service if
`DEPLOY_HEALTHCHECK_URL` fails. Diagnose with:

```bash
ssh <manager> docker stack services honeydue
ssh <manager> docker service logs --tail 200 honeydue_api
# Or open an SSH tunnel to Dozzle: ssh -L 9999:127.0.0.1:9999 <manager>
```

### Lost Ability to Deploy

- Registry token revoked → regenerate, update `deploy/registry.env`, re-run.
- Manager host key changed → verify legitimacy, update `~/.ssh/known_hosts`.
- All secrets accidentally pruned → restore the `deploy/secrets/*` files
  locally and redeploy; new Swarm secret versions will be created.

---

## Known Gaps (Future Work)

- No dedicated `cmd/migrate` binary — migrations run at API boot (see
  README §10). Large schema changes still need manual coordination.
- `asynq.Scheduler` has no leader election; `WORKER_REPLICAS` must stay 1
  until we migrate to `asynq.PeriodicTaskManager` (README §9).
- No Prometheus / Grafana / alerting in the stack. `/metrics` is exposed
  on the API but nothing scrapes it.
- No automated TLS renewal on-origin — add if you ever move off Cloudflare.
- No staging environment wired to the deploy script — `DEPLOY_TAG=<sha>`
  is the closest thing. A proper staging flow is future work.