Files
honeyDueAPI/deploy/DEPLOYING.md
Trey t 33eee812b6 Harden prod deploy: versioned secrets, healthchecks, migration lock, dry-run
Swarm stack
- Resource limits on all services, stop_grace_period 60s on api/worker/admin
- Dozzle bound to manager loopback only (ssh -L required for access)
- Worker health server on :6060, admin /api/health endpoint
- Redis 200M LRU cap, B2/S3 env vars wired through to api service

Deploy script
- DRY_RUN=1 prints plan + exits
- Auto-rollback on failed healthcheck, docker logout at end
- Versioned-secret pruning keeps last SECRET_KEEP_VERSIONS (default 3)
- PUSH_LATEST_TAG default flipped to false
- B2 all-or-none validation before deploy

Code
- cmd/api takes pg_advisory_lock on a dedicated connection before
  AutoMigrate, serialising boot-time migrations across replicas
- cmd/worker exposes an HTTP /health endpoint with graceful shutdown

Docs
- deploy/DEPLOYING.md: step-by-step walkthrough for a real deploy
- deploy/shit_deploy_cant_do.md: manual prerequisites + recurring ops
- deploy/README.md updated with storage toggle, worker-replica caveat,
  multi-arch recipe, connection-pool tuning, renumbered sections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:22:43 -05:00

127 lines
3.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Deploying Right Now
Practical walkthrough for a prod deploy against the current Swarm stack.
Assumes infrastructure and cloud services already exist — if not, work
through [`shit_deploy_cant_do.md`](./shit_deploy_cant_do.md) first.
See [`README.md`](./README.md) for the reference docs that back each step.
---
## 0. Pre-flight — check local state
```bash
cd honeyDueAPI-go
git status # clean working tree?
git log -1 --oneline # deploying this SHA
ls deploy/cluster.env deploy/registry.env deploy/prod.env
ls deploy/secrets/*.txt deploy/secrets/*.p8
```
## 1. Reconcile your envs with current defaults
These two values **must** be right — the script does not enforce them:
```bash
# deploy/cluster.env
WORKER_REPLICAS=1 # >1 → duplicate cron jobs (Asynq scheduler is a singleton)
PUSH_LATEST_TAG=false # keeps prod images SHA-pinned
SECRET_KEEP_VERSIONS=3 # optional; 3 is the default
```
Decide storage backend in `deploy/prod.env`:
- **Multi-replica safe (recommended):** set all four of `B2_ENDPOINT`,
`B2_KEY_ID`, `B2_APP_KEY`, `B2_BUCKET_NAME`. Uploads go to B2.
- **Single-node ok:** leave all four empty. Script will warn. In this
mode you must also set `API_REPLICAS=1` — otherwise uploads are
invisible from 2/3 of requests.
## 2. Dry run
```bash
DRY_RUN=1 ./.deploy_prod
```
Confirm in the output:
- `Storage backend: S3 (...)` OR the `LOCAL VOLUME` warning matches intent
- `Replicas: api=3, worker=1, admin=1` (or `api=1` if local storage)
- Image SHA matches `git rev-parse --short HEAD`
- `Manager:` host is correct
- `Secret retention: 3 versions`
Fix envs and re-run until the plan looks right. Nothing touches the cluster yet.
## 3. Real deploy
```bash
./.deploy_prod
```
Do **not** pass `SKIP_BUILD=1` after code changes — the worker's health
server and `MigrateWithLock` both require a fresh build.
End-to-end: ~38 minutes. The script prints each phase.
## 4. Post-deploy verification
```bash
# Stack health (replicas X/X = desired)
ssh <manager> docker stack services honeydue
# API smoke
curl -fsS https://api.<domain>/api/health/ && echo OK
# Logs via Dozzle (loopback-bound, needs SSH tunnel)
ssh -p <port> -L 9999:127.0.0.1:9999 <user>@<manager>
# Then browse http://localhost:9999
```
What the logs should show on a healthy boot:
- `api`: exactly one replica logs `Migration advisory lock acquired`,
the others log `Migration advisory lock acquired` after waiting, then
`released`.
- `worker`: `Health server listening addr=:6060`, `Starting worker server...`,
four `Registered ... job` lines.
- No `Failed to connect to Redis` / `Failed to connect to database`.
## 5. If it goes wrong
Auto-rollback triggers when `DEPLOY_HEALTHCHECK_URL` fails — every service
is rolled back to its previous spec, script exits non-zero.
Triage:
```bash
ssh <manager> docker service logs --tail 200 honeydue_api
ssh <manager> docker service ps honeydue_api --no-trunc
```
Manual rollback (if auto didn't catch it):
```bash
ssh <manager> bash -c '
for svc in $(docker stack services honeydue --format "{{.Name}}"); do
docker service rollback "$svc"
done'
```
Redeploy a known-good SHA:
```bash
DEPLOY_TAG=<older-sha> SKIP_BUILD=1 ./.deploy_prod
# Only valid if that image was previously pushed to the registry.
```
## 6. Pre-deploy honesty check
Before pulling the trigger:
- [ ] Tested Neon PITR restore (not just "backups exist")?
- [ ] `WORKER_REPLICAS=1` — otherwise duplicate push notifications next cron tick
- [ ] Cloudflare-only firewall rule on 80/443 — otherwise origin IP is on the public internet
- [ ] If storage is LOCAL, `API_REPLICAS=1` too
- [ ] Last deploy's secrets still valid (rotation hasn't expired any creds)