Files
honeyDueAPI/docs/deployment/19-postmortem-swarm.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

481 lines
17 KiB
Markdown

# 19 — Postmortem: The Swarm Era
## Summary
honeyDue launched on Docker Swarm on 2026-04-23. Over the course of a
single afternoon we hit **thirteen distinct bugs** before declaring
Swarm unfit and migrating to k3s. This chapter is the forensic record:
the symptom of each bug, the root cause, the specific fix, and citations
where relevant. It's preserved because these lessons are expensive and
future-us should not pay them again.
**TL;DR**: Twelve of the thirteen bugs were recoverable. The thirteenth
was a Docker libnetwork ghost-DNS defect ([moby/moby#52265][moby-52265])
that is fundamentally incompatible with single-replica services. No
amount of clever config fixed it; we had to change orchestrators.
## Timeline
**~18:00** — Infrastructure stood up. Docker Swarm initialized. First
build + push to Gitea.
**~19:30** — First deploy runs. Immediate failures.
**~22:00** — api + admin returning 200 through Cloudflare. Flaky but
working.
**~23:00** — Admin flapping 50%+ through Cloudflare. Ghost DNS record
identified. Workarounds begin.
**~00:30 (next day)** — Ghost DNS survives every non-nuclear
intervention. Research confirms it's a known libnetwork bug. Decision
to migrate to k3s.
**~04:30** — k3s cluster up, all services healthy, 150/150 requests
green. Postmortem begins.
The session ran ~10 hours. The migration itself took ~1 hour.
## The thirteen bugs
### 1 — Deploy script array expansion under `set -u`
**File**: `deploy/scripts/deploy_prod.sh`
**Symptom**:
```
./deploy/scripts/deploy_prod.sh: line 339: api_extra[@]: unbound variable
```
**Root cause**: Bash arrays expanded with `"${arr[@]}"` under `set -u`
fail when the array is empty. Our deploy script initialized empty
arrays conditionally but expanded them unconditionally.
**Fix**: Use the `${arr[@]+"${arr[@]}"}` safe-expansion idiom, or
restructure to avoid passing empty arrays:
```bash
build_and_push api "${API_IMAGE}" ${api_extra[@]+"${api_extra[@]}"}
```
Inside the function, same treatment — use `shift` instead of array
slicing.
**Moral**: `set -u` with bash arrays is a known pitfall. The
`"${arr[@]}"` expansion isn't safe under strict mode if arrays can be
empty.
### 2 — Dockerfile Go version mismatch
**File**: `Dockerfile`
**Symptom**:
```
go: go.mod requires go >= 1.25 (running go 1.24.13; GOTOOLCHAIN=local)
ERROR: failed to build: failed to solve: process "/bin/sh -c go mod download" did not complete successfully: exit code: 1
```
**Root cause**: `go.mod` specifies `go 1.25`, but the Dockerfile's
builder stage used `golang:1.24-alpine`.
**Fix**: Bumped to `golang:1.25-alpine`. One-character change.
**Moral**: Keep the Dockerfile base image in sync with `go.mod`'s
go directive. CI would catch this; we had none.
### 3 — dev machine arm64 vs node amd64
**Symptom**: Would have been `exec format error` on the nodes if we'd
deployed without fixing. Caught at build config stage.
**Root cause**: Operator on Apple Silicon (arm64). Hetzner nodes are
amd64. Plain `docker build` produces arm64 images.
**Fix**: Switched deploy script to use `docker buildx build --platform
linux/amd64 --push`. This cross-compiles the Go stages (they honor
`TARGETARCH`) and uses QEMU emulation for the Node stages.
**Moral**: Cross-platform builds are routine for Apple Silicon
developers. Document it up front, bake it into the deploy script.
### 4 — Swarm stack `host_ip` rejected
**File**: `deploy/swarm-stack.prod.yml` (dozzle service)
**Symptom**:
```
services.dozzle.ports.0 Additional property host_ip is not allowed
```
**Root cause**: Docker Compose v3.8 schema allows `host_ip` in long-form
port spec. Swarm's `docker stack deploy` parser doesn't.
**Fix**: Use the short form:
```yaml
ports:
- "127.0.0.1:${DOZZLE_PORT}:8080"
```
But then: Swarm's ingress mesh mode silently ignores the `127.0.0.1`
binding and listens on `0.0.0.0` anyway. Only way to get true
loopback-only binding is `mode: host`, which changes port-publishing
semantics.
**Moral**: Compose-file compatibility between plain Docker and Swarm
is imperfect. Check the [Swarm-specific compose reference][swarm-compose]
when in doubt.
### 5 — Stack file secret references
**Symptom**:
```
service worker: undefined secret "honeydue_postgres_password_237c6b8-20260423195810"
```
**Root cause**: The original stack file template used
`source: ${POSTGRES_PASSWORD_SECRET}` (which expanded to the versioned
secret name like `honeydue_postgres_password_<ts>`) under each service's
`secrets:` list.
Swarm expects `source:` to match the **alias** in the top-level
`secrets:` block (`postgres_password`), not the actual secret `name:`.
**Fix**: Changed every `source:` to the alias form:
```yaml
# Was:
- source: ${POSTGRES_PASSWORD_SECRET}
target: postgres_password
# Now:
- source: postgres_password
target: postgres_password
```
**Moral**: The original template was clever but subtly wrong. It had
never successfully deployed — the earlier Dokku setup used a different
secret model. Bugs-in-template-code catch you when you first hit them.
### 6 — API pod crash: `sync.Once` double-unlock
**File**: `internal/services/cache_service.go:54`
**Symptom**: api pods completed migrations, started HTTP server, then
fataled with:
```
fatal error: sync: unlock of unlocked mutex
goroutine 1 [running]:
internal/sync.fatal(...)
sync.(*Once).doSlow(...)
github.com/treytartt/honeydue-api/internal/services.NewCacheService
/app/internal/services/cache_service.go:31
```
**Root cause**: Inside a `sync.Once.Do(func() { ... })` callback, the
code did:
```go
cacheOnce.Do(func() {
// ...
if err := client.Ping(ctx).Err(); err != nil {
initErr = fmt.Errorf(...)
cacheOnce = sync.Once{} // ← THIS LINE
return
}
})
```
The intent: "if Redis ping fails, reset the Once so a retry can happen."
The reality: the Once's internal mutex is held while `Do` is running the
callback. Reassigning `cacheOnce = sync.Once{}` creates a NEW zero-
valued Once and replaces the old one. When `Do` tries to release the
mutex afterward, the mutex is the new-zero-valued one — which isn't
locked. Panic.
**Fix**: Removed the reset. `main.go` already handles the error
gracefully (`cache = nil`, continues without caching). Retries happen
via pod restart, not in-process.
```go
if err := client.Ping(ctx).Err(); err != nil {
initErr = fmt.Errorf(...)
// Don't reassign cacheOnce here — mutating it from inside Do()
// is a fatal error. Let main.go handle the error.
return
}
```
**Moral**: `sync.Once` is simpler than it looks. Never reassign an
active sync primitive from within its own callback.
### 7 — Stack file `maxUnavailable: 2` warning for worker
**Symptom**: We noticed `WORKER_REPLICAS=2` in `cluster.env` despite
the Asynq scheduler being a singleton.
**Root cause**: Asynq's `Scheduler` is not leader-elected by default.
Running >1 replica causes duplicate cron firings — duplicate daily
digests, double-welcome emails.
**Fix**: `WORKER_REPLICAS=1`. Added a comment in `cluster.env.example`
explaining why.
**Moral**: Defaults can be dangerous. Even when a default seems
reasonable ("2 replicas for HA"), check against the app's semantics.
### 8 — `PUSH_LATEST_TAG=true` for prod
**Symptom**: During a test, we saw `honeydue-api:latest` updating,
which would make rollbacks harder.
**Root cause**: The cluster.env had `PUSH_LATEST_TAG=true` when the
design intent was SHA-pinned deploys only.
**Fix**: `PUSH_LATEST_TAG=false`. SHA tags only.
**Moral**: Tag-mutable images make rollbacks non-deterministic.
Prefer immutable SHA tags.
### 9 — Neon DB name case sensitivity
**Symptom**:
```
server error: ERROR: database "honeydue" does not exist (SQLSTATE 3D000)
```
**Root cause**: Neon's UI created the database as `"honeyDue"` (quoted,
camelCase). Postgres treats quoted identifiers case-sensitively at
create time. Our `prod.env` had `POSTGRES_DB=honeydue` (lowercase).
**Fix**: `POSTGRES_DB=honeyDue`.
**Moral**: Respect Postgres's identifier quoting rules. If something
was created with quotes, refer to it with exact case.
### 10 — Admin DNS ghost A-record (the big one)
**Symptom**: Through Cloudflare, `admin.myhoneydue.com` returned 502 on
~50% of requests. The other 50% succeeded. The pattern was stable over
hours.
**Investigation**:
The admin service had 1 replica, alive on one of three Swarm nodes.
Caddy (reverse proxy at the time) resolved `admin` via Swarm's
embedded DNS at `127.0.0.11`. `nslookup admin` returned:
```
Name: admin Address: 10.0.1.36 (current task IP)
Name: admin Address: 10.0.1.17 (GHOST — what is this?)
```
Two A records for one-replica service, both returned randomly.
`10.0.1.17` was checked: that IP now belonged to the **dozzle**
container on hetzner3. Nothing listens on dozzle's 3000 port →
connection refused → 502.
The old admin task had run on hetzner3 with IP 10.0.1.17. When it
migrated to hetzner1 with IP 10.0.1.36, libnetwork's DNS registration
for admin was supposed to update. On hetzner2 and hetzner3, the old
10.0.1.17 record never got removed.
**Things tried, none worked**:
| Attempt | Result |
|---|---|
| `endpoint_mode: dnsrr` on admin | DNS still returns both IPs |
| Kill + restart Caddy container | DNS still returns both IPs |
| Scale admin to 0 and back to 1 | Ghost 10.0.1.17 still in DNS with 0 replicas |
| `docker service rm honeydue_admin` | Ghost 10.0.1.17 still in DNS (orphaned) |
| Change admin to `mode: global` | Different IPs but ghost remains |
| `mode: host` on admin ports + `extra_hosts: host.docker.internal:host-gateway` | `host.docker.internal` resolved to docker0 (172.17.0.1), not reachable from overlay |
| Hardcoded 3 node IPs in Caddy + UFW port 3000 node-to-node | ~90% reliable, NAT hairpin issues when Caddy dials its own node |
**Root cause**: [moby/moby#52265][moby-52265] — Docker libnetwork's
overlay network state store doesn't reliably deregister service
endpoints when tasks migrate between nodes. Known bug in the 29.x
line. Partial fixes in #50236 (29.0) were incomplete; 29.3 still
leaks; #52289 is the pending follow-up.
**Why it only manifests on single-replica services**: With 3 replicas,
Caddy's DNS query returns 4 IPs (3 real + 1 ghost). Round-robin
succeeds 75% of the time. With 1 replica, 1 real + 1 ghost = 50%
failure. More replicas = bug is masked.
**Final fix**: None at the libnetwork level. The ghost survives every
non-cluster-recreating operation. The only clean purge is
`docker stack rm` + `docker network rm` + full redeploy. Even then,
the bug recurs on the next task migration.
**Decision**: Migrate to k3s. CoreDNS has none of libnetwork's state-
store semantics and the bug class doesn't exist. 4 hours of fighting
Swarm → 1-hour k3s migration that just worked.
**Citations**:
- [moby/moby#52265 — Overlay ARP stale entries on 29.3.0][moby-52265]
- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
- [Dokploy#3480 — Traefik stale VIP on Swarm][dokploy-3480]
### 11 — IPSec ESP + UDP 500 blocked
**Symptom**: Earlier in the Swarm setup, api 3/3 was working but
cross-node overlay traffic was intermittently failing. This turned out
to be a separate bug masking #10 earlier in the session.
**Root cause**: We had encrypted overlay enabled
(`driver_opts: encrypted: "true"`). Swarm's encrypted mode uses IPSec
ESP (IP protocol 50) + UDP 500 (IKE). Our UFW only allowed UDP 4789
(VXLAN) and 7946 (gossip). ESP was blocked by default-deny. Encrypted
packets dropped silently on some flows.
**Fix**: Added UFW rules for each peer node IP:
```bash
sudo ufw allow from <peer> to any proto esp
sudo ufw allow from <peer> to any port 500 proto udp
```
Once applied, cross-node overlay data path became stable.
**Moral**: Encrypted Swarm overlay requires more than VXLAN to be open.
ESP (protocol 50) and UDP 500 (IKE) for IPSec. Official Docker docs
mention this but it's easy to miss.
### 12 — Admin startupProbe path
**Symptom**: Admin pod kept restarting with startup probe failures.
Kubelet reported:
```
Startup probe failed: HTTP probe failed with statuscode: 404
```
**Root cause**: The k3s scaffold's `admin/deployment.yaml` had:
```yaml
startupProbe:
httpGet:
path: /admin/
port: 3000
```
But our admin Next.js app serves at `/`, not `/admin/`. Requests to
`/admin/` return 404. K8s considered the pod unhealthy and restart-
looped.
**Fix**: Change probe path to `/`. Also bumped `failureThreshold` from
12 to 24 (120s grace) for Next.js's slower-than-expected cold boot
when the node's already busy.
**Moral**: Copy-pasted scaffolds can have assumptions that don't match
your app. Always verify probes against actual reachable paths.
### 13 — MigrateWithLock startup probe grace
**Symptom**: API pods were getting killed by k8s during migration.
First replica was OK (fast migration); replicas 2 and 3 waited on
the advisory lock too long and healthchecks tripped.
**Root cause**: Go app's `MigrateWithLock()` uses
`pg_advisory_lock()` to serialize migrations across replicas. First
replica does real AutoMigrate (~90s cold); subsequent replicas wait
on the lock, then run no-op migrations. Total time for 3rd replica
can be 3+ minutes.
K3s scaffold's `api/deployment.yaml` had:
```yaml
startupProbe:
failureThreshold: 12
periodSeconds: 5
```
= 60s grace. Not enough.
**Fix**: Bumped `failureThreshold` to 48 (= 240s grace). Comment in
the manifest explains why. This is *not* a band-aid — the real startup
time genuinely is 90-240s depending on lock queue position. The probe
should reflect reality, not be optimistic.
**Moral**: Healthchecks should be realistic, not aspirational. Know
what your app actually does at startup.
## What we learned
### Docker Swarm is in a bad place in 2026
Not dead — Mirantis supports it through 2030 — but **nobody is
modernizing libnetwork**. When you hit a DNS or networking bug, you're
on your own. The fix churn on #52265 (incomplete 29.0 fix → 29.3
regression → pending #52289) is a tell: the code has no champion.
For new deployments, **don't pick Swarm** unless you're doing something
Swarm-shaped (tiny, single-replica, no inter-service traffic). K3s is
a strictly better choice for anything approximating what we're doing.
### Investigate before you work around
We spent a lot of time on clever workarounds for bug #10 (host-mode
ports, host.docker.internal, hardcoded node IPs, UFW routing) before
doing the 20-minute research task that revealed the bug was a known
libnetwork defect. If we'd searched "Swarm DNS stale record 2026" first,
we'd have saved ~3 hours.
### Scaffolds are starting points, not finishing points
The k3s scaffold in `deploy-k3s/` was excellent — production-grade
RBAC, PDBs, security contexts, network policies, Traefik middleware.
But its image references (GHCR), TLS assumptions (CF Full strict), and
probe paths (admin's `/admin/`) didn't match our actual setup. Every
scaffold needs a read-through against your environment before you
`kubectl apply -f`.
### Keep the old config until the new config is proven
We kept `deploy/` (Swarm) intact during the k3s migration. That meant
if k3s failed, we could `git stash` the k3s work and do a fast Swarm
redeploy. It took ~4 days before we deleted `deploy/`, by which point
we were confident.
## Files affected by tonight's work
All in `honeyDueAPI-go`:
- `Dockerfile` — Go 1.24 → 1.25 (bug #2)
- `deploy/scripts/deploy_prod.sh` — buildx refactor, array expansion fixes (bugs #1, #3)
- `deploy/swarm-stack.prod.yml` — dozzle host_ip, secret source references, multiple iterations trying to fix #10
- `deploy/prod.env` — admin seed env vars, DB_POSTGRES_DB case, B2 values, push-disabled placeholders (bug #9)
- `deploy/cluster.env` — WORKER_REPLICAS 2 → 1, PUSH_LATEST_TAG (bugs #7, #8)
- `deploy/Caddyfile` — multiple iterations (ultimately deleted when we moved to k3s)
- `internal/services/cache_service.go` — removed sync.Once reset (bug #6)
- `internal/database/database.go` — (no change, MigrateWithLock semantics investigated)
- `deploy-k3s/manifests/api/deployment.yaml` — startupProbe grace (bug #13)
- `deploy-k3s/manifests/admin/deployment.yaml` — probe path (bug #12)
- `deploy-k3s/manifests/worker/deployment.yaml` — replicas 2 → 1
- `deploy-k3s/manifests/pod-disruption-budgets.yaml` — worker minAvailable 1 → 0
- `deploy-k3s/manifests/traefik-helmchartconfig.yaml` — NEW (DaemonSet + hostNetwork for Traefik)
- `deploy-k3s/manifests/ingress/ingress-simple.yaml` — NEW (simple host routing, no TLS)
- `deploy-k3s/MIGRATION_NOTES.md` — NEW
## What was thrown away
- Swarm stack definitions (still in `deploy/`, planned for removal)
- Caddy Caddyfile (k3s uses Traefik instead)
- Several hours of work on Caddy `dynamic a` upstream refresh, host-
mode ports, and NAT-hairpin workarounds for bug #10 — all moot
once we migrated
## References
- [moby/moby#52265 — Overlay ARP stale entries][moby-52265]
- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
- [Dokploy#3480 — Traefik stale VIP][dokploy-3480]
- [Mirantis Swarm LTS commitment][mirantis-swarm]
- [Kubernetes probe best practices][k8s-probes]
- [Asynq scheduler limitations][asynq-sched]
[moby-52265]: https://github.com/moby/moby/issues/52265
[moby-51491]: https://github.com/moby/moby/issues/51491
[dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480
[mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/
[k8s-probes]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
[asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks
[swarm-compose]: https://docs.docker.com/reference/compose-file/legacy-versions/