Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure: - Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers) - Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh - All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept temporarily for reference Bug fixes surfaced during migration: - Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25) - cache_service.go: remove sync.Once reassignment from inside Do() callback (was causing 'unlock of unlocked mutex' fatal after Redis Ping failure) - router.go: relax CSP from 'default-src none' to 'default-src self' + allowlist fonts.googleapis.com so the marketing landing page CSS actually loads in browsers - deploy/scripts/deploy_prod.sh: use docker buildx with --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce images runnable on x86_64 Hetzner nodes; fix array expansion under set -u - deploy/swarm-stack.prod.yml: fix secret source references to use top-level aliases (the '\${X_SECRET}' form never actually resolved); dozzle ports: long-form host_ip is rejected by Swarm, switched to short-form (bound to 0.0.0.0 with UFW-based loopback restriction); worker replicas 2 -> 1 (Asynq scheduler singleton) - deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/' (Next.js serves at root; /admin/ returned 404 and killed pods); startupProbe failureThreshold 12 -> 24 - deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable 1 -> 0 (singleton) - deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold 12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot; real startup takes up to 240s) - .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/ and admin/src/app/api/*, hiding legitimate files) New files: - deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet + hostNetwork override for k3s-bundled Traefik - deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress without TLS (CF Flexible SSL) and without middleware - deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log Documentation: - docs/deployment/ — full deployment book, 26 files, ~42k words: - Part I Overview, infrastructure, orchestrator choice (Ch 0-2) - Part II Networking, firewall, Cloudflare (Ch 3-4, 13) - Part III Security, Traefik ingress (Ch 5-6) - Part IV Services, DB, storage, secrets, registry (Ch 7-11) - Part V Data flow, deploy process, observability, failures, runbook (Ch 12, 14-17) - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20) - Appendices: glossary, kubectl cheat sheet, file locations, consolidated citations - README.md: Production Deployment section replaced with pointer to the book; Go version bumped to 1.25 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:21 -05:00
parent 4ec4bbbfe8
commit 6f303dbbaa
46 changed files with 9785 additions and 93 deletions
@@ -0,0 +1,480 @@
+# 19 — Postmortem: The Swarm Era
+
+## Summary
+
+honeyDue launched on Docker Swarm on 2026-04-23. Over the course of a
+single afternoon we hit **thirteen distinct bugs** before declaring
+Swarm unfit and migrating to k3s. This chapter is the forensic record:
+the symptom of each bug, the root cause, the specific fix, and citations
+where relevant. It's preserved because these lessons are expensive and
+future-us should not pay them again.
+
+**TL;DR**: Twelve of the thirteen bugs were recoverable. The thirteenth
+was a Docker libnetwork ghost-DNS defect ([moby/moby#52265][moby-52265])
+that is fundamentally incompatible with single-replica services. No
+amount of clever config fixed it; we had to change orchestrators.
+
+## Timeline
+
+**~18:00** — Infrastructure stood up. Docker Swarm initialized. First
+build + push to Gitea.
+
+**~19:30** — First deploy runs. Immediate failures.
+
+**~22:00** — api + admin returning 200 through Cloudflare. Flaky but
+working.
+
+**~23:00** — Admin flapping 50%+ through Cloudflare. Ghost DNS record
+identified. Workarounds begin.
+
+**~00:30 (next day)** — Ghost DNS survives every non-nuclear
+intervention. Research confirms it's a known libnetwork bug. Decision
+to migrate to k3s.
+
+**~04:30** — k3s cluster up, all services healthy, 150/150 requests
+green. Postmortem begins.
+
+The session ran ~10 hours. The migration itself took ~1 hour.
+
+## The thirteen bugs
+
+### 1 — Deploy script array expansion under `set -u`
+
+**File**: `deploy/scripts/deploy_prod.sh`
+
+**Symptom**:
+```
+./deploy/scripts/deploy_prod.sh: line 339: api_extra[@]: unbound variable
+```
+
+**Root cause**: Bash arrays expanded with `"${arr[@]}"` under `set -u`
+fail when the array is empty. Our deploy script initialized empty
+arrays conditionally but expanded them unconditionally.
+
+**Fix**: Use the `${arr[@]+"${arr[@]}"}` safe-expansion idiom, or
+restructure to avoid passing empty arrays:
+
+```bash
+build_and_push api "${API_IMAGE}" ${api_extra[@]+"${api_extra[@]}"}
+```
+
+Inside the function, same treatment — use `shift` instead of array
+slicing.
+
+**Moral**: `set -u` with bash arrays is a known pitfall. The
+`"${arr[@]}"` expansion isn't safe under strict mode if arrays can be
+empty.
+
+### 2 — Dockerfile Go version mismatch
+
+**File**: `Dockerfile`
+
+**Symptom**:
+```
+go: go.mod requires go >= 1.25 (running go 1.24.13; GOTOOLCHAIN=local)
+ERROR: failed to build: failed to solve: process "/bin/sh -c go mod download" did not complete successfully: exit code: 1
+```
+
+**Root cause**: `go.mod` specifies `go 1.25`, but the Dockerfile's
+builder stage used `golang:1.24-alpine`.
+
+**Fix**: Bumped to `golang:1.25-alpine`. One-character change.
+
+**Moral**: Keep the Dockerfile base image in sync with `go.mod`'s
+go directive. CI would catch this; we had none.
+
+### 3 — dev machine arm64 vs node amd64
+
+**Symptom**: Would have been `exec format error` on the nodes if we'd
+deployed without fixing. Caught at build config stage.
+
+**Root cause**: Operator on Apple Silicon (arm64). Hetzner nodes are
+amd64. Plain `docker build` produces arm64 images.
+
+**Fix**: Switched deploy script to use `docker buildx build --platform
+linux/amd64 --push`. This cross-compiles the Go stages (they honor
+`TARGETARCH`) and uses QEMU emulation for the Node stages.
+
+**Moral**: Cross-platform builds are routine for Apple Silicon
+developers. Document it up front, bake it into the deploy script.
+
+### 4 — Swarm stack `host_ip` rejected
+
+**File**: `deploy/swarm-stack.prod.yml` (dozzle service)
+
+**Symptom**:
+```
+services.dozzle.ports.0 Additional property host_ip is not allowed
+```
+
+**Root cause**: Docker Compose v3.8 schema allows `host_ip` in long-form
+port spec. Swarm's `docker stack deploy` parser doesn't.
+
+**Fix**: Use the short form:
+```yaml
+ports:
+  - "127.0.0.1:${DOZZLE_PORT}:8080"
+```
+
+But then: Swarm's ingress mesh mode silently ignores the `127.0.0.1`
+binding and listens on `0.0.0.0` anyway. Only way to get true
+loopback-only binding is `mode: host`, which changes port-publishing
+semantics.
+
+**Moral**: Compose-file compatibility between plain Docker and Swarm
+is imperfect. Check the [Swarm-specific compose reference][swarm-compose]
+when in doubt.
+
+### 5 — Stack file secret references
+
+**Symptom**:
+```
+service worker: undefined secret "honeydue_postgres_password_237c6b8-20260423195810"
+```
+
+**Root cause**: The original stack file template used
+`source: ${POSTGRES_PASSWORD_SECRET}` (which expanded to the versioned
+secret name like `honeydue_postgres_password_<ts>`) under each service's
+`secrets:` list.
+
+Swarm expects `source:` to match the **alias** in the top-level
+`secrets:` block (`postgres_password`), not the actual secret `name:`.
+
+**Fix**: Changed every `source:` to the alias form:
+
+```yaml
+# Was:
+- source: ${POSTGRES_PASSWORD_SECRET}
+  target: postgres_password
+
+# Now:
+- source: postgres_password
+  target: postgres_password
+```
+
+**Moral**: The original template was clever but subtly wrong. It had
+never successfully deployed — the earlier Dokku setup used a different
+secret model. Bugs-in-template-code catch you when you first hit them.
+
+### 6 — API pod crash: `sync.Once` double-unlock
+
+**File**: `internal/services/cache_service.go:54`
+
+**Symptom**: api pods completed migrations, started HTTP server, then
+fataled with:
+```
+fatal error: sync: unlock of unlocked mutex
+goroutine 1 [running]:
+internal/sync.fatal(...)
+sync.(*Once).doSlow(...)
+github.com/treytartt/honeydue-api/internal/services.NewCacheService
+  /app/internal/services/cache_service.go:31
+```
+
+**Root cause**: Inside a `sync.Once.Do(func() { ... })` callback, the
+code did:
+
+```go
+cacheOnce.Do(func() {
+    // ...
+    if err := client.Ping(ctx).Err(); err != nil {
+        initErr = fmt.Errorf(...)
+        cacheOnce = sync.Once{}  // ← THIS LINE
+        return
+    }
+})
+```
+
+The intent: "if Redis ping fails, reset the Once so a retry can happen."
+The reality: the Once's internal mutex is held while `Do` is running the
+callback. Reassigning `cacheOnce = sync.Once{}` creates a NEW zero-
+valued Once and replaces the old one. When `Do` tries to release the
+mutex afterward, the mutex is the new-zero-valued one — which isn't
+locked. Panic.
+
+**Fix**: Removed the reset. `main.go` already handles the error
+gracefully (`cache = nil`, continues without caching). Retries happen
+via pod restart, not in-process.
+
+```go
+if err := client.Ping(ctx).Err(); err != nil {
+    initErr = fmt.Errorf(...)
+    // Don't reassign cacheOnce here — mutating it from inside Do()
+    // is a fatal error. Let main.go handle the error.
+    return
+}
+```
+
+**Moral**: `sync.Once` is simpler than it looks. Never reassign an
+active sync primitive from within its own callback.
+
+### 7 — Stack file `maxUnavailable: 2` warning for worker
+
+**Symptom**: We noticed `WORKER_REPLICAS=2` in `cluster.env` despite
+the Asynq scheduler being a singleton.
+
+**Root cause**: Asynq's `Scheduler` is not leader-elected by default.
+Running >1 replica causes duplicate cron firings — duplicate daily
+digests, double-welcome emails.
+
+**Fix**: `WORKER_REPLICAS=1`. Added a comment in `cluster.env.example`
+explaining why.
+
+**Moral**: Defaults can be dangerous. Even when a default seems
+reasonable ("2 replicas for HA"), check against the app's semantics.
+
+### 8 — `PUSH_LATEST_TAG=true` for prod
+
+**Symptom**: During a test, we saw `honeydue-api:latest` updating,
+which would make rollbacks harder.
+
+**Root cause**: The cluster.env had `PUSH_LATEST_TAG=true` when the
+design intent was SHA-pinned deploys only.
+
+**Fix**: `PUSH_LATEST_TAG=false`. SHA tags only.
+
+**Moral**: Tag-mutable images make rollbacks non-deterministic.
+Prefer immutable SHA tags.
+
+### 9 — Neon DB name case sensitivity
+
+**Symptom**:
+```
+server error: ERROR: database "honeydue" does not exist (SQLSTATE 3D000)
+```
+
+**Root cause**: Neon's UI created the database as `"honeyDue"` (quoted,
+camelCase). Postgres treats quoted identifiers case-sensitively at
+create time. Our `prod.env` had `POSTGRES_DB=honeydue` (lowercase).
+
+**Fix**: `POSTGRES_DB=honeyDue`.
+
+**Moral**: Respect Postgres's identifier quoting rules. If something
+was created with quotes, refer to it with exact case.
+
+### 10 — Admin DNS ghost A-record (the big one)
+
+**Symptom**: Through Cloudflare, `admin.myhoneydue.com` returned 502 on
+~50% of requests. The other 50% succeeded. The pattern was stable over
+hours.
+
+**Investigation**:
+
+The admin service had 1 replica, alive on one of three Swarm nodes.
+Caddy (reverse proxy at the time) resolved `admin` via Swarm's
+embedded DNS at `127.0.0.11`. `nslookup admin` returned:
+
+```
+Name: admin  Address: 10.0.1.36   (current task IP)
+Name: admin  Address: 10.0.1.17   (GHOST — what is this?)
+```
+
+Two A records for one-replica service, both returned randomly.
+
+`10.0.1.17` was checked: that IP now belonged to the **dozzle**
+container on hetzner3. Nothing listens on dozzle's 3000 port →
+connection refused → 502.
+
+The old admin task had run on hetzner3 with IP 10.0.1.17. When it
+migrated to hetzner1 with IP 10.0.1.36, libnetwork's DNS registration
+for admin was supposed to update. On hetzner2 and hetzner3, the old
+10.0.1.17 record never got removed.
+
+**Things tried, none worked**:
+
+| Attempt | Result |
+|---|---|
+| `endpoint_mode: dnsrr` on admin | DNS still returns both IPs |
+| Kill + restart Caddy container | DNS still returns both IPs |
+| Scale admin to 0 and back to 1 | Ghost 10.0.1.17 still in DNS with 0 replicas |
+| `docker service rm honeydue_admin` | Ghost 10.0.1.17 still in DNS (orphaned) |
+| Change admin to `mode: global` | Different IPs but ghost remains |
+| `mode: host` on admin ports + `extra_hosts: host.docker.internal:host-gateway` | `host.docker.internal` resolved to docker0 (172.17.0.1), not reachable from overlay |
+| Hardcoded 3 node IPs in Caddy + UFW port 3000 node-to-node | ~90% reliable, NAT hairpin issues when Caddy dials its own node |
+
+**Root cause**: [moby/moby#52265][moby-52265] — Docker libnetwork's
+overlay network state store doesn't reliably deregister service
+endpoints when tasks migrate between nodes. Known bug in the 29.x
+line. Partial fixes in #50236 (29.0) were incomplete; 29.3 still
+leaks; #52289 is the pending follow-up.
+
+**Why it only manifests on single-replica services**: With 3 replicas,
+Caddy's DNS query returns 4 IPs (3 real + 1 ghost). Round-robin
+succeeds 75% of the time. With 1 replica, 1 real + 1 ghost = 50%
+failure. More replicas = bug is masked.
+
+**Final fix**: None at the libnetwork level. The ghost survives every
+non-cluster-recreating operation. The only clean purge is
+`docker stack rm` + `docker network rm` + full redeploy. Even then,
+the bug recurs on the next task migration.
+
+**Decision**: Migrate to k3s. CoreDNS has none of libnetwork's state-
+store semantics and the bug class doesn't exist. 4 hours of fighting
+Swarm → 1-hour k3s migration that just worked.
+
+**Citations**:
+- [moby/moby#52265 — Overlay ARP stale entries on 29.3.0][moby-52265]
+- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
+- [Dokploy#3480 — Traefik stale VIP on Swarm][dokploy-3480]
+
+### 11 — IPSec ESP + UDP 500 blocked
+
+**Symptom**: Earlier in the Swarm setup, api 3/3 was working but
+cross-node overlay traffic was intermittently failing. This turned out
+to be a separate bug masking #10 earlier in the session.
+
+**Root cause**: We had encrypted overlay enabled
+(`driver_opts: encrypted: "true"`). Swarm's encrypted mode uses IPSec
+ESP (IP protocol 50) + UDP 500 (IKE). Our UFW only allowed UDP 4789
+(VXLAN) and 7946 (gossip). ESP was blocked by default-deny. Encrypted
+packets dropped silently on some flows.
+
+**Fix**: Added UFW rules for each peer node IP:
+```bash
+sudo ufw allow from <peer> to any proto esp
+sudo ufw allow from <peer> to any port 500 proto udp
+```
+
+Once applied, cross-node overlay data path became stable.
+
+**Moral**: Encrypted Swarm overlay requires more than VXLAN to be open.
+ESP (protocol 50) and UDP 500 (IKE) for IPSec. Official Docker docs
+mention this but it's easy to miss.
+
+### 12 — Admin startupProbe path
+
+**Symptom**: Admin pod kept restarting with startup probe failures.
+Kubelet reported:
+```
+Startup probe failed: HTTP probe failed with statuscode: 404
+```
+
+**Root cause**: The k3s scaffold's `admin/deployment.yaml` had:
+```yaml
+startupProbe:
+  httpGet:
+    path: /admin/
+    port: 3000
+```
+
+But our admin Next.js app serves at `/`, not `/admin/`. Requests to
+`/admin/` return 404. K8s considered the pod unhealthy and restart-
+looped.
+
+**Fix**: Change probe path to `/`. Also bumped `failureThreshold` from
+12 to 24 (120s grace) for Next.js's slower-than-expected cold boot
+when the node's already busy.
+
+**Moral**: Copy-pasted scaffolds can have assumptions that don't match
+your app. Always verify probes against actual reachable paths.
+
+### 13 — MigrateWithLock startup probe grace
+
+**Symptom**: API pods were getting killed by k8s during migration.
+First replica was OK (fast migration); replicas 2 and 3 waited on
+the advisory lock too long and healthchecks tripped.
+
+**Root cause**: Go app's `MigrateWithLock()` uses
+`pg_advisory_lock()` to serialize migrations across replicas. First
+replica does real AutoMigrate (~90s cold); subsequent replicas wait
+on the lock, then run no-op migrations. Total time for 3rd replica
+can be 3+ minutes.
+
+K3s scaffold's `api/deployment.yaml` had:
+```yaml
+startupProbe:
+  failureThreshold: 12
+  periodSeconds: 5
+```
+
+= 60s grace. Not enough.
+
+**Fix**: Bumped `failureThreshold` to 48 (= 240s grace). Comment in
+the manifest explains why. This is *not* a band-aid — the real startup
+time genuinely is 90-240s depending on lock queue position. The probe
+should reflect reality, not be optimistic.
+
+**Moral**: Healthchecks should be realistic, not aspirational. Know
+what your app actually does at startup.
+
+## What we learned
+
+### Docker Swarm is in a bad place in 2026
+
+Not dead — Mirantis supports it through 2030 — but **nobody is
+modernizing libnetwork**. When you hit a DNS or networking bug, you're
+on your own. The fix churn on #52265 (incomplete 29.0 fix → 29.3
+regression → pending #52289) is a tell: the code has no champion.
+
+For new deployments, **don't pick Swarm** unless you're doing something
+Swarm-shaped (tiny, single-replica, no inter-service traffic). K3s is
+a strictly better choice for anything approximating what we're doing.
+
+### Investigate before you work around
+
+We spent a lot of time on clever workarounds for bug #10 (host-mode
+ports, host.docker.internal, hardcoded node IPs, UFW routing) before
+doing the 20-minute research task that revealed the bug was a known
+libnetwork defect. If we'd searched "Swarm DNS stale record 2026" first,
+we'd have saved ~3 hours.
+
+### Scaffolds are starting points, not finishing points
+
+The k3s scaffold in `deploy-k3s/` was excellent — production-grade
+RBAC, PDBs, security contexts, network policies, Traefik middleware.
+But its image references (GHCR), TLS assumptions (CF Full strict), and
+probe paths (admin's `/admin/`) didn't match our actual setup. Every
+scaffold needs a read-through against your environment before you
+`kubectl apply -f`.
+
+### Keep the old config until the new config is proven
+
+We kept `deploy/` (Swarm) intact during the k3s migration. That meant
+if k3s failed, we could `git stash` the k3s work and do a fast Swarm
+redeploy. It took ~4 days before we deleted `deploy/`, by which point
+we were confident.
+
+## Files affected by tonight's work
+
+All in `honeyDueAPI-go`:
+
+- `Dockerfile` — Go 1.24 → 1.25 (bug #2)
+- `deploy/scripts/deploy_prod.sh` — buildx refactor, array expansion fixes (bugs #1, #3)
+- `deploy/swarm-stack.prod.yml` — dozzle host_ip, secret source references, multiple iterations trying to fix #10
+- `deploy/prod.env` — admin seed env vars, DB_POSTGRES_DB case, B2 values, push-disabled placeholders (bug #9)
+- `deploy/cluster.env` — WORKER_REPLICAS 2 → 1, PUSH_LATEST_TAG (bugs #7, #8)
+- `deploy/Caddyfile` — multiple iterations (ultimately deleted when we moved to k3s)
+- `internal/services/cache_service.go` — removed sync.Once reset (bug #6)
+- `internal/database/database.go` — (no change, MigrateWithLock semantics investigated)
+- `deploy-k3s/manifests/api/deployment.yaml` — startupProbe grace (bug #13)
+- `deploy-k3s/manifests/admin/deployment.yaml` — probe path (bug #12)
+- `deploy-k3s/manifests/worker/deployment.yaml` — replicas 2 → 1
+- `deploy-k3s/manifests/pod-disruption-budgets.yaml` — worker minAvailable 1 → 0
+- `deploy-k3s/manifests/traefik-helmchartconfig.yaml` — NEW (DaemonSet + hostNetwork for Traefik)
+- `deploy-k3s/manifests/ingress/ingress-simple.yaml` — NEW (simple host routing, no TLS)
+- `deploy-k3s/MIGRATION_NOTES.md` — NEW
+
+## What was thrown away
+
+- Swarm stack definitions (still in `deploy/`, planned for removal)
+- Caddy Caddyfile (k3s uses Traefik instead)
+- Several hours of work on Caddy `dynamic a` upstream refresh, host-
+  mode ports, and NAT-hairpin workarounds for bug #10 — all moot
+  once we migrated
+
+## References
+
+- [moby/moby#52265 — Overlay ARP stale entries][moby-52265]
+- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
+- [Dokploy#3480 — Traefik stale VIP][dokploy-3480]
+- [Mirantis Swarm LTS commitment][mirantis-swarm]
+- [Kubernetes probe best practices][k8s-probes]
+- [Asynq scheduler limitations][asynq-sched]
+
+[moby-52265]: https://github.com/moby/moby/issues/52265
+[moby-51491]: https://github.com/moby/moby/issues/51491
+[dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480
+[mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/
+[k8s-probes]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
+[asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks
+[swarm-compose]: https://docs.docker.com/reference/compose-file/legacy-versions/