# 19 — Postmortem: The Swarm Era ## Summary honeyDue launched on Docker Swarm on 2026-04-23. Over the course of a single afternoon we hit **thirteen distinct bugs** before declaring Swarm unfit and migrating to k3s. This chapter is the forensic record: the symptom of each bug, the root cause, the specific fix, and citations where relevant. It's preserved because these lessons are expensive and future-us should not pay them again. **TL;DR**: Twelve of the thirteen bugs were recoverable. The thirteenth was a Docker libnetwork ghost-DNS defect ([moby/moby#52265][moby-52265]) that is fundamentally incompatible with single-replica services. No amount of clever config fixed it; we had to change orchestrators. ## Timeline **~18:00** — Infrastructure stood up. Docker Swarm initialized. First build + push to Gitea. **~19:30** — First deploy runs. Immediate failures. **~22:00** — api + admin returning 200 through Cloudflare. Flaky but working. **~23:00** — Admin flapping 50%+ through Cloudflare. Ghost DNS record identified. Workarounds begin. **~00:30 (next day)** — Ghost DNS survives every non-nuclear intervention. Research confirms it's a known libnetwork bug. Decision to migrate to k3s. **~04:30** — k3s cluster up, all services healthy, 150/150 requests green. Postmortem begins. The session ran ~10 hours. The migration itself took ~1 hour. ## The thirteen bugs ### 1 — Deploy script array expansion under `set -u` **File**: `deploy/scripts/deploy_prod.sh` **Symptom**: ``` ./deploy/scripts/deploy_prod.sh: line 339: api_extra[@]: unbound variable ``` **Root cause**: Bash arrays expanded with `"${arr[@]}"` under `set -u` fail when the array is empty. Our deploy script initialized empty arrays conditionally but expanded them unconditionally. **Fix**: Use the `${arr[@]+"${arr[@]}"}` safe-expansion idiom, or restructure to avoid passing empty arrays: ```bash build_and_push api "${API_IMAGE}" ${api_extra[@]+"${api_extra[@]}"} ``` Inside the function, same treatment — use `shift` instead of array slicing. **Moral**: `set -u` with bash arrays is a known pitfall. The `"${arr[@]}"` expansion isn't safe under strict mode if arrays can be empty. ### 2 — Dockerfile Go version mismatch **File**: `Dockerfile` **Symptom**: ``` go: go.mod requires go >= 1.25 (running go 1.24.13; GOTOOLCHAIN=local) ERROR: failed to build: failed to solve: process "/bin/sh -c go mod download" did not complete successfully: exit code: 1 ``` **Root cause**: `go.mod` specifies `go 1.25`, but the Dockerfile's builder stage used `golang:1.24-alpine`. **Fix**: Bumped to `golang:1.25-alpine`. One-character change. **Moral**: Keep the Dockerfile base image in sync with `go.mod`'s go directive. CI would catch this; we had none. ### 3 — dev machine arm64 vs node amd64 **Symptom**: Would have been `exec format error` on the nodes if we'd deployed without fixing. Caught at build config stage. **Root cause**: Operator on Apple Silicon (arm64). Hetzner nodes are amd64. Plain `docker build` produces arm64 images. **Fix**: Switched deploy script to use `docker buildx build --platform linux/amd64 --push`. This cross-compiles the Go stages (they honor `TARGETARCH`) and uses QEMU emulation for the Node stages. **Moral**: Cross-platform builds are routine for Apple Silicon developers. Document it up front, bake it into the deploy script. ### 4 — Swarm stack `host_ip` rejected **File**: `deploy/swarm-stack.prod.yml` (dozzle service) **Symptom**: ``` services.dozzle.ports.0 Additional property host_ip is not allowed ``` **Root cause**: Docker Compose v3.8 schema allows `host_ip` in long-form port spec. Swarm's `docker stack deploy` parser doesn't. **Fix**: Use the short form: ```yaml ports: - "127.0.0.1:${DOZZLE_PORT}:8080" ``` But then: Swarm's ingress mesh mode silently ignores the `127.0.0.1` binding and listens on `0.0.0.0` anyway. Only way to get true loopback-only binding is `mode: host`, which changes port-publishing semantics. **Moral**: Compose-file compatibility between plain Docker and Swarm is imperfect. Check the [Swarm-specific compose reference][swarm-compose] when in doubt. ### 5 — Stack file secret references **Symptom**: ``` service worker: undefined secret "honeydue_postgres_password_237c6b8-20260423195810" ``` **Root cause**: The original stack file template used `source: ${POSTGRES_PASSWORD_SECRET}` (which expanded to the versioned secret name like `honeydue_postgres_password_`) under each service's `secrets:` list. Swarm expects `source:` to match the **alias** in the top-level `secrets:` block (`postgres_password`), not the actual secret `name:`. **Fix**: Changed every `source:` to the alias form: ```yaml # Was: - source: ${POSTGRES_PASSWORD_SECRET} target: postgres_password # Now: - source: postgres_password target: postgres_password ``` **Moral**: The original template was clever but subtly wrong. It had never successfully deployed — the earlier Dokku setup used a different secret model. Bugs-in-template-code catch you when you first hit them. ### 6 — API pod crash: `sync.Once` double-unlock **File**: `internal/services/cache_service.go:54` **Symptom**: api pods completed migrations, started HTTP server, then fataled with: ``` fatal error: sync: unlock of unlocked mutex goroutine 1 [running]: internal/sync.fatal(...) sync.(*Once).doSlow(...) github.com/treytartt/honeydue-api/internal/services.NewCacheService /app/internal/services/cache_service.go:31 ``` **Root cause**: Inside a `sync.Once.Do(func() { ... })` callback, the code did: ```go cacheOnce.Do(func() { // ... if err := client.Ping(ctx).Err(); err != nil { initErr = fmt.Errorf(...) cacheOnce = sync.Once{} // ← THIS LINE return } }) ``` The intent: "if Redis ping fails, reset the Once so a retry can happen." The reality: the Once's internal mutex is held while `Do` is running the callback. Reassigning `cacheOnce = sync.Once{}` creates a NEW zero- valued Once and replaces the old one. When `Do` tries to release the mutex afterward, the mutex is the new-zero-valued one — which isn't locked. Panic. **Fix**: Removed the reset. `main.go` already handles the error gracefully (`cache = nil`, continues without caching). Retries happen via pod restart, not in-process. ```go if err := client.Ping(ctx).Err(); err != nil { initErr = fmt.Errorf(...) // Don't reassign cacheOnce here — mutating it from inside Do() // is a fatal error. Let main.go handle the error. return } ``` **Moral**: `sync.Once` is simpler than it looks. Never reassign an active sync primitive from within its own callback. ### 7 — Stack file `maxUnavailable: 2` warning for worker **Symptom**: We noticed `WORKER_REPLICAS=2` in `cluster.env` despite the Asynq scheduler being a singleton. **Root cause**: Asynq's `Scheduler` is not leader-elected by default. Running >1 replica causes duplicate cron firings — duplicate daily digests, double-welcome emails. **Fix**: `WORKER_REPLICAS=1`. Added a comment in `cluster.env.example` explaining why. **Moral**: Defaults can be dangerous. Even when a default seems reasonable ("2 replicas for HA"), check against the app's semantics. ### 8 — `PUSH_LATEST_TAG=true` for prod **Symptom**: During a test, we saw `honeydue-api:latest` updating, which would make rollbacks harder. **Root cause**: The cluster.env had `PUSH_LATEST_TAG=true` when the design intent was SHA-pinned deploys only. **Fix**: `PUSH_LATEST_TAG=false`. SHA tags only. **Moral**: Tag-mutable images make rollbacks non-deterministic. Prefer immutable SHA tags. ### 9 — Neon DB name case sensitivity **Symptom**: ``` server error: ERROR: database "honeydue" does not exist (SQLSTATE 3D000) ``` **Root cause**: Neon's UI created the database as `"honeyDue"` (quoted, camelCase). Postgres treats quoted identifiers case-sensitively at create time. Our `prod.env` had `POSTGRES_DB=honeydue` (lowercase). **Fix**: `POSTGRES_DB=honeyDue`. **Moral**: Respect Postgres's identifier quoting rules. If something was created with quotes, refer to it with exact case. ### 10 — Admin DNS ghost A-record (the big one) **Symptom**: Through Cloudflare, `admin.myhoneydue.com` returned 502 on ~50% of requests. The other 50% succeeded. The pattern was stable over hours. **Investigation**: The admin service had 1 replica, alive on one of three Swarm nodes. Caddy (reverse proxy at the time) resolved `admin` via Swarm's embedded DNS at `127.0.0.11`. `nslookup admin` returned: ``` Name: admin Address: 10.0.1.36 (current task IP) Name: admin Address: 10.0.1.17 (GHOST — what is this?) ``` Two A records for one-replica service, both returned randomly. `10.0.1.17` was checked: that IP now belonged to the **dozzle** container on hetzner3. Nothing listens on dozzle's 3000 port → connection refused → 502. The old admin task had run on hetzner3 with IP 10.0.1.17. When it migrated to hetzner1 with IP 10.0.1.36, libnetwork's DNS registration for admin was supposed to update. On hetzner2 and hetzner3, the old 10.0.1.17 record never got removed. **Things tried, none worked**: | Attempt | Result | |---|---| | `endpoint_mode: dnsrr` on admin | DNS still returns both IPs | | Kill + restart Caddy container | DNS still returns both IPs | | Scale admin to 0 and back to 1 | Ghost 10.0.1.17 still in DNS with 0 replicas | | `docker service rm honeydue_admin` | Ghost 10.0.1.17 still in DNS (orphaned) | | Change admin to `mode: global` | Different IPs but ghost remains | | `mode: host` on admin ports + `extra_hosts: host.docker.internal:host-gateway` | `host.docker.internal` resolved to docker0 (172.17.0.1), not reachable from overlay | | Hardcoded 3 node IPs in Caddy + UFW port 3000 node-to-node | ~90% reliable, NAT hairpin issues when Caddy dials its own node | **Root cause**: [moby/moby#52265][moby-52265] — Docker libnetwork's overlay network state store doesn't reliably deregister service endpoints when tasks migrate between nodes. Known bug in the 29.x line. Partial fixes in #50236 (29.0) were incomplete; 29.3 still leaks; #52289 is the pending follow-up. **Why it only manifests on single-replica services**: With 3 replicas, Caddy's DNS query returns 4 IPs (3 real + 1 ghost). Round-robin succeeds 75% of the time. With 1 replica, 1 real + 1 ghost = 50% failure. More replicas = bug is masked. **Final fix**: None at the libnetwork level. The ghost survives every non-cluster-recreating operation. The only clean purge is `docker stack rm` + `docker network rm` + full redeploy. Even then, the bug recurs on the next task migration. **Decision**: Migrate to k3s. CoreDNS has none of libnetwork's state- store semantics and the bug class doesn't exist. 4 hours of fighting Swarm → 1-hour k3s migration that just worked. **Citations**: - [moby/moby#52265 — Overlay ARP stale entries on 29.3.0][moby-52265] - [moby/moby#51491 — DNS broken after swarm init][moby-51491] - [Dokploy#3480 — Traefik stale VIP on Swarm][dokploy-3480] ### 11 — IPSec ESP + UDP 500 blocked **Symptom**: Earlier in the Swarm setup, api 3/3 was working but cross-node overlay traffic was intermittently failing. This turned out to be a separate bug masking #10 earlier in the session. **Root cause**: We had encrypted overlay enabled (`driver_opts: encrypted: "true"`). Swarm's encrypted mode uses IPSec ESP (IP protocol 50) + UDP 500 (IKE). Our UFW only allowed UDP 4789 (VXLAN) and 7946 (gossip). ESP was blocked by default-deny. Encrypted packets dropped silently on some flows. **Fix**: Added UFW rules for each peer node IP: ```bash sudo ufw allow from to any proto esp sudo ufw allow from to any port 500 proto udp ``` Once applied, cross-node overlay data path became stable. **Moral**: Encrypted Swarm overlay requires more than VXLAN to be open. ESP (protocol 50) and UDP 500 (IKE) for IPSec. Official Docker docs mention this but it's easy to miss. ### 12 — Admin startupProbe path **Symptom**: Admin pod kept restarting with startup probe failures. Kubelet reported: ``` Startup probe failed: HTTP probe failed with statuscode: 404 ``` **Root cause**: The k3s scaffold's `admin/deployment.yaml` had: ```yaml startupProbe: httpGet: path: /admin/ port: 3000 ``` But our admin Next.js app serves at `/`, not `/admin/`. Requests to `/admin/` return 404. K8s considered the pod unhealthy and restart- looped. **Fix**: Change probe path to `/`. Also bumped `failureThreshold` from 12 to 24 (120s grace) for Next.js's slower-than-expected cold boot when the node's already busy. **Moral**: Copy-pasted scaffolds can have assumptions that don't match your app. Always verify probes against actual reachable paths. ### 13 — MigrateWithLock startup probe grace **Symptom**: API pods were getting killed by k8s during migration. First replica was OK (fast migration); replicas 2 and 3 waited on the advisory lock too long and healthchecks tripped. **Root cause**: Go app's `MigrateWithLock()` uses `pg_advisory_lock()` to serialize migrations across replicas. First replica does real AutoMigrate (~90s cold); subsequent replicas wait on the lock, then run no-op migrations. Total time for 3rd replica can be 3+ minutes. K3s scaffold's `api/deployment.yaml` had: ```yaml startupProbe: failureThreshold: 12 periodSeconds: 5 ``` = 60s grace. Not enough. **Fix**: Bumped `failureThreshold` to 48 (= 240s grace). Comment in the manifest explains why. This is *not* a band-aid — the real startup time genuinely is 90-240s depending on lock queue position. The probe should reflect reality, not be optimistic. **Moral**: Healthchecks should be realistic, not aspirational. Know what your app actually does at startup. #### Postscript (2026-04-26): the whole `MigrateWithLock` shape was wrong A few months after the Swarm migration, switching `DB_HOST` to Neon's `-pooler` endpoint for runtime perf wins broke this code completely: `pg_advisory_lock` is session-scoped, but PgBouncer transaction-mode multiplexes statements across backend Postgres sessions, so the lock appeared to be held but actually wasn't. Pods hung at "Acquiring migration advisory lock..." and the startup probe killed them in turn. After a brief band-aid (route migrations through the direct endpoint; bump probe to 600s to absorb 5-minute AutoMigrate runs over the slow direct connection — both reverted), we abandoned the runtime-side migration story entirely and adopted [pressly/goose](https://github.com/pressly/goose) in commit `12b2f9d`: - Migrations run as a one-shot Kubernetes Job before any api/worker pod rolls. No more in-replica migration, no more advisory lock, no more startup probe gymnastics. - `RequireSchemaApplied` checks `goose_db_version` at startup and refuses to boot on a stale schema — fail-fast for "operator forgot to run migrate," instead of mysterious runtime errors. - `failureThreshold` reverted to its pre-MigrateWithLock value. Pods boot in seconds again. See [Chapter 8 §Schema management](./08-database.md) for the goose shape. This entire sub-section is preserved as historical context for why we walked the path we did. ## What we learned ### Docker Swarm is in a bad place in 2026 Not dead — Mirantis supports it through 2030 — but **nobody is modernizing libnetwork**. When you hit a DNS or networking bug, you're on your own. The fix churn on #52265 (incomplete 29.0 fix → 29.3 regression → pending #52289) is a tell: the code has no champion. For new deployments, **don't pick Swarm** unless you're doing something Swarm-shaped (tiny, single-replica, no inter-service traffic). K3s is a strictly better choice for anything approximating what we're doing. ### Investigate before you work around We spent a lot of time on clever workarounds for bug #10 (host-mode ports, host.docker.internal, hardcoded node IPs, UFW routing) before doing the 20-minute research task that revealed the bug was a known libnetwork defect. If we'd searched "Swarm DNS stale record 2026" first, we'd have saved ~3 hours. ### Scaffolds are starting points, not finishing points The k3s scaffold in `deploy-k3s/` was excellent — production-grade RBAC, PDBs, security contexts, network policies, Traefik middleware. But its image references (GHCR), TLS assumptions (CF Full strict), and probe paths (admin's `/admin/`) didn't match our actual setup. Every scaffold needs a read-through against your environment before you `kubectl apply -f`. ### Keep the old config until the new config is proven We kept `deploy/` (Swarm) intact during the k3s migration. That meant if k3s failed, we could `git stash` the k3s work and do a fast Swarm redeploy. It took ~4 days before we deleted `deploy/`, by which point we were confident. ## Files affected by tonight's work All in `honeyDueAPI-go`: - `Dockerfile` — Go 1.24 → 1.25 (bug #2) - `deploy/scripts/deploy_prod.sh` — buildx refactor, array expansion fixes (bugs #1, #3) - `deploy/swarm-stack.prod.yml` — dozzle host_ip, secret source references, multiple iterations trying to fix #10 - `deploy/prod.env` — admin seed env vars, DB_POSTGRES_DB case, B2 values, push-disabled placeholders (bug #9) - `deploy/cluster.env` — WORKER_REPLICAS 2 → 1, PUSH_LATEST_TAG (bugs #7, #8) - `deploy/Caddyfile` — multiple iterations (ultimately deleted when we moved to k3s) - `internal/services/cache_service.go` — removed sync.Once reset (bug #6) - `internal/database/database.go` — (no change, MigrateWithLock semantics investigated) - `deploy-k3s/manifests/api/deployment.yaml` — startupProbe grace (bug #13) - `deploy-k3s/manifests/admin/deployment.yaml` — probe path (bug #12) - `deploy-k3s/manifests/worker/deployment.yaml` — replicas 2 → 1 - `deploy-k3s/manifests/pod-disruption-budgets.yaml` — worker minAvailable 1 → 0 - `deploy-k3s/manifests/traefik-helmchartconfig.yaml` — NEW (DaemonSet + hostNetwork for Traefik) - `deploy-k3s/manifests/ingress/ingress-simple.yaml` — NEW (simple host routing, no TLS) - `deploy-k3s/MIGRATION_NOTES.md` — NEW ## What was thrown away - Swarm stack definitions (still in `deploy/`, planned for removal) - Caddy Caddyfile (k3s uses Traefik instead) - Several hours of work on Caddy `dynamic a` upstream refresh, host- mode ports, and NAT-hairpin workarounds for bug #10 — all moot once we migrated ## References - [moby/moby#52265 — Overlay ARP stale entries][moby-52265] - [moby/moby#51491 — DNS broken after swarm init][moby-51491] - [Dokploy#3480 — Traefik stale VIP][dokploy-3480] - [Mirantis Swarm LTS commitment][mirantis-swarm] - [Kubernetes probe best practices][k8s-probes] - [Asynq scheduler limitations][asynq-sched] [moby-52265]: https://github.com/moby/moby/issues/52265 [moby-51491]: https://github.com/moby/moby/issues/51491 [dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480 [mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/ [k8s-probes]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ [asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks [swarm-compose]: https://docs.docker.com/reference/compose-file/legacy-versions/