Files
honeyDueAPI/docs/deployment/19-postmortem-swarm.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

17 KiB

19 — Postmortem: The Swarm Era

Summary

honeyDue launched on Docker Swarm on 2026-04-23. Over the course of a single afternoon we hit thirteen distinct bugs before declaring Swarm unfit and migrating to k3s. This chapter is the forensic record: the symptom of each bug, the root cause, the specific fix, and citations where relevant. It's preserved because these lessons are expensive and future-us should not pay them again.

TL;DR: Twelve of the thirteen bugs were recoverable. The thirteenth was a Docker libnetwork ghost-DNS defect (moby/moby#52265) that is fundamentally incompatible with single-replica services. No amount of clever config fixed it; we had to change orchestrators.

Timeline

~18:00 — Infrastructure stood up. Docker Swarm initialized. First build + push to Gitea.

~19:30 — First deploy runs. Immediate failures.

~22:00 — api + admin returning 200 through Cloudflare. Flaky but working.

~23:00 — Admin flapping 50%+ through Cloudflare. Ghost DNS record identified. Workarounds begin.

~00:30 (next day) — Ghost DNS survives every non-nuclear intervention. Research confirms it's a known libnetwork bug. Decision to migrate to k3s.

~04:30 — k3s cluster up, all services healthy, 150/150 requests green. Postmortem begins.

The session ran ~10 hours. The migration itself took ~1 hour.

The thirteen bugs

1 — Deploy script array expansion under set -u

File: deploy/scripts/deploy_prod.sh

Symptom:

./deploy/scripts/deploy_prod.sh: line 339: api_extra[@]: unbound variable

Root cause: Bash arrays expanded with "${arr[@]}" under set -u fail when the array is empty. Our deploy script initialized empty arrays conditionally but expanded them unconditionally.

Fix: Use the ${arr[@]+"${arr[@]}"} safe-expansion idiom, or restructure to avoid passing empty arrays:

build_and_push api "${API_IMAGE}" ${api_extra[@]+"${api_extra[@]}"}

Inside the function, same treatment — use shift instead of array slicing.

Moral: set -u with bash arrays is a known pitfall. The "${arr[@]}" expansion isn't safe under strict mode if arrays can be empty.

2 — Dockerfile Go version mismatch

File: Dockerfile

Symptom:

go: go.mod requires go >= 1.25 (running go 1.24.13; GOTOOLCHAIN=local)
ERROR: failed to build: failed to solve: process "/bin/sh -c go mod download" did not complete successfully: exit code: 1

Root cause: go.mod specifies go 1.25, but the Dockerfile's builder stage used golang:1.24-alpine.

Fix: Bumped to golang:1.25-alpine. One-character change.

Moral: Keep the Dockerfile base image in sync with go.mod's go directive. CI would catch this; we had none.

3 — dev machine arm64 vs node amd64

Symptom: Would have been exec format error on the nodes if we'd deployed without fixing. Caught at build config stage.

Root cause: Operator on Apple Silicon (arm64). Hetzner nodes are amd64. Plain docker build produces arm64 images.

Fix: Switched deploy script to use docker buildx build --platform linux/amd64 --push. This cross-compiles the Go stages (they honor TARGETARCH) and uses QEMU emulation for the Node stages.

Moral: Cross-platform builds are routine for Apple Silicon developers. Document it up front, bake it into the deploy script.

4 — Swarm stack host_ip rejected

File: deploy/swarm-stack.prod.yml (dozzle service)

Symptom:

services.dozzle.ports.0 Additional property host_ip is not allowed

Root cause: Docker Compose v3.8 schema allows host_ip in long-form port spec. Swarm's docker stack deploy parser doesn't.

Fix: Use the short form:

ports:
  - "127.0.0.1:${DOZZLE_PORT}:8080"

But then: Swarm's ingress mesh mode silently ignores the 127.0.0.1 binding and listens on 0.0.0.0 anyway. Only way to get true loopback-only binding is mode: host, which changes port-publishing semantics.

Moral: Compose-file compatibility between plain Docker and Swarm is imperfect. Check the Swarm-specific compose reference when in doubt.

5 — Stack file secret references

Symptom:

service worker: undefined secret "honeydue_postgres_password_237c6b8-20260423195810"

Root cause: The original stack file template used source: ${POSTGRES_PASSWORD_SECRET} (which expanded to the versioned secret name like honeydue_postgres_password_<ts>) under each service's secrets: list.

Swarm expects source: to match the alias in the top-level secrets: block (postgres_password), not the actual secret name:.

Fix: Changed every source: to the alias form:

# Was:
- source: ${POSTGRES_PASSWORD_SECRET}
  target: postgres_password

# Now:
- source: postgres_password
  target: postgres_password

Moral: The original template was clever but subtly wrong. It had never successfully deployed — the earlier Dokku setup used a different secret model. Bugs-in-template-code catch you when you first hit them.

6 — API pod crash: sync.Once double-unlock

File: internal/services/cache_service.go:54

Symptom: api pods completed migrations, started HTTP server, then fataled with:

fatal error: sync: unlock of unlocked mutex
goroutine 1 [running]:
internal/sync.fatal(...)
sync.(*Once).doSlow(...)
github.com/treytartt/honeydue-api/internal/services.NewCacheService
  /app/internal/services/cache_service.go:31

Root cause: Inside a sync.Once.Do(func() { ... }) callback, the code did:

cacheOnce.Do(func() {
    // ...
    if err := client.Ping(ctx).Err(); err != nil {
        initErr = fmt.Errorf(...)
        cacheOnce = sync.Once{}  // ← THIS LINE
        return
    }
})

The intent: "if Redis ping fails, reset the Once so a retry can happen." The reality: the Once's internal mutex is held while Do is running the callback. Reassigning cacheOnce = sync.Once{} creates a NEW zero- valued Once and replaces the old one. When Do tries to release the mutex afterward, the mutex is the new-zero-valued one — which isn't locked. Panic.

Fix: Removed the reset. main.go already handles the error gracefully (cache = nil, continues without caching). Retries happen via pod restart, not in-process.

if err := client.Ping(ctx).Err(); err != nil {
    initErr = fmt.Errorf(...)
    // Don't reassign cacheOnce here — mutating it from inside Do()
    // is a fatal error. Let main.go handle the error.
    return
}

Moral: sync.Once is simpler than it looks. Never reassign an active sync primitive from within its own callback.

7 — Stack file maxUnavailable: 2 warning for worker

Symptom: We noticed WORKER_REPLICAS=2 in cluster.env despite the Asynq scheduler being a singleton.

Root cause: Asynq's Scheduler is not leader-elected by default. Running >1 replica causes duplicate cron firings — duplicate daily digests, double-welcome emails.

Fix: WORKER_REPLICAS=1. Added a comment in cluster.env.example explaining why.

Moral: Defaults can be dangerous. Even when a default seems reasonable ("2 replicas for HA"), check against the app's semantics.

8 — PUSH_LATEST_TAG=true for prod

Symptom: During a test, we saw honeydue-api:latest updating, which would make rollbacks harder.

Root cause: The cluster.env had PUSH_LATEST_TAG=true when the design intent was SHA-pinned deploys only.

Fix: PUSH_LATEST_TAG=false. SHA tags only.

Moral: Tag-mutable images make rollbacks non-deterministic. Prefer immutable SHA tags.

9 — Neon DB name case sensitivity

Symptom:

server error: ERROR: database "honeydue" does not exist (SQLSTATE 3D000)

Root cause: Neon's UI created the database as "honeyDue" (quoted, camelCase). Postgres treats quoted identifiers case-sensitively at create time. Our prod.env had POSTGRES_DB=honeydue (lowercase).

Fix: POSTGRES_DB=honeyDue.

Moral: Respect Postgres's identifier quoting rules. If something was created with quotes, refer to it with exact case.

10 — Admin DNS ghost A-record (the big one)

Symptom: Through Cloudflare, admin.myhoneydue.com returned 502 on ~50% of requests. The other 50% succeeded. The pattern was stable over hours.

Investigation:

The admin service had 1 replica, alive on one of three Swarm nodes. Caddy (reverse proxy at the time) resolved admin via Swarm's embedded DNS at 127.0.0.11. nslookup admin returned:

Name: admin  Address: 10.0.1.36   (current task IP)
Name: admin  Address: 10.0.1.17   (GHOST — what is this?)

Two A records for one-replica service, both returned randomly.

10.0.1.17 was checked: that IP now belonged to the dozzle container on hetzner3. Nothing listens on dozzle's 3000 port → connection refused → 502.

The old admin task had run on hetzner3 with IP 10.0.1.17. When it migrated to hetzner1 with IP 10.0.1.36, libnetwork's DNS registration for admin was supposed to update. On hetzner2 and hetzner3, the old 10.0.1.17 record never got removed.

Things tried, none worked:

Attempt Result
endpoint_mode: dnsrr on admin DNS still returns both IPs
Kill + restart Caddy container DNS still returns both IPs
Scale admin to 0 and back to 1 Ghost 10.0.1.17 still in DNS with 0 replicas
docker service rm honeydue_admin Ghost 10.0.1.17 still in DNS (orphaned)
Change admin to mode: global Different IPs but ghost remains
mode: host on admin ports + extra_hosts: host.docker.internal:host-gateway host.docker.internal resolved to docker0 (172.17.0.1), not reachable from overlay
Hardcoded 3 node IPs in Caddy + UFW port 3000 node-to-node ~90% reliable, NAT hairpin issues when Caddy dials its own node

Root cause: moby/moby#52265 — Docker libnetwork's overlay network state store doesn't reliably deregister service endpoints when tasks migrate between nodes. Known bug in the 29.x line. Partial fixes in #50236 (29.0) were incomplete; 29.3 still leaks; #52289 is the pending follow-up.

Why it only manifests on single-replica services: With 3 replicas, Caddy's DNS query returns 4 IPs (3 real + 1 ghost). Round-robin succeeds 75% of the time. With 1 replica, 1 real + 1 ghost = 50% failure. More replicas = bug is masked.

Final fix: None at the libnetwork level. The ghost survives every non-cluster-recreating operation. The only clean purge is docker stack rm + docker network rm + full redeploy. Even then, the bug recurs on the next task migration.

Decision: Migrate to k3s. CoreDNS has none of libnetwork's state- store semantics and the bug class doesn't exist. 4 hours of fighting Swarm → 1-hour k3s migration that just worked.

Citations:

11 — IPSec ESP + UDP 500 blocked

Symptom: Earlier in the Swarm setup, api 3/3 was working but cross-node overlay traffic was intermittently failing. This turned out to be a separate bug masking #10 earlier in the session.

Root cause: We had encrypted overlay enabled (driver_opts: encrypted: "true"). Swarm's encrypted mode uses IPSec ESP (IP protocol 50) + UDP 500 (IKE). Our UFW only allowed UDP 4789 (VXLAN) and 7946 (gossip). ESP was blocked by default-deny. Encrypted packets dropped silently on some flows.

Fix: Added UFW rules for each peer node IP:

sudo ufw allow from <peer> to any proto esp
sudo ufw allow from <peer> to any port 500 proto udp

Once applied, cross-node overlay data path became stable.

Moral: Encrypted Swarm overlay requires more than VXLAN to be open. ESP (protocol 50) and UDP 500 (IKE) for IPSec. Official Docker docs mention this but it's easy to miss.

12 — Admin startupProbe path

Symptom: Admin pod kept restarting with startup probe failures. Kubelet reported:

Startup probe failed: HTTP probe failed with statuscode: 404

Root cause: The k3s scaffold's admin/deployment.yaml had:

startupProbe:
  httpGet:
    path: /admin/
    port: 3000

But our admin Next.js app serves at /, not /admin/. Requests to /admin/ return 404. K8s considered the pod unhealthy and restart- looped.

Fix: Change probe path to /. Also bumped failureThreshold from 12 to 24 (120s grace) for Next.js's slower-than-expected cold boot when the node's already busy.

Moral: Copy-pasted scaffolds can have assumptions that don't match your app. Always verify probes against actual reachable paths.

13 — MigrateWithLock startup probe grace

Symptom: API pods were getting killed by k8s during migration. First replica was OK (fast migration); replicas 2 and 3 waited on the advisory lock too long and healthchecks tripped.

Root cause: Go app's MigrateWithLock() uses pg_advisory_lock() to serialize migrations across replicas. First replica does real AutoMigrate (~90s cold); subsequent replicas wait on the lock, then run no-op migrations. Total time for 3rd replica can be 3+ minutes.

K3s scaffold's api/deployment.yaml had:

startupProbe:
  failureThreshold: 12
  periodSeconds: 5

= 60s grace. Not enough.

Fix: Bumped failureThreshold to 48 (= 240s grace). Comment in the manifest explains why. This is not a band-aid — the real startup time genuinely is 90-240s depending on lock queue position. The probe should reflect reality, not be optimistic.

Moral: Healthchecks should be realistic, not aspirational. Know what your app actually does at startup.

What we learned

Docker Swarm is in a bad place in 2026

Not dead — Mirantis supports it through 2030 — but nobody is modernizing libnetwork. When you hit a DNS or networking bug, you're on your own. The fix churn on #52265 (incomplete 29.0 fix → 29.3 regression → pending #52289) is a tell: the code has no champion.

For new deployments, don't pick Swarm unless you're doing something Swarm-shaped (tiny, single-replica, no inter-service traffic). K3s is a strictly better choice for anything approximating what we're doing.

Investigate before you work around

We spent a lot of time on clever workarounds for bug #10 (host-mode ports, host.docker.internal, hardcoded node IPs, UFW routing) before doing the 20-minute research task that revealed the bug was a known libnetwork defect. If we'd searched "Swarm DNS stale record 2026" first, we'd have saved ~3 hours.

Scaffolds are starting points, not finishing points

The k3s scaffold in deploy-k3s/ was excellent — production-grade RBAC, PDBs, security contexts, network policies, Traefik middleware. But its image references (GHCR), TLS assumptions (CF Full strict), and probe paths (admin's /admin/) didn't match our actual setup. Every scaffold needs a read-through against your environment before you kubectl apply -f.

Keep the old config until the new config is proven

We kept deploy/ (Swarm) intact during the k3s migration. That meant if k3s failed, we could git stash the k3s work and do a fast Swarm redeploy. It took ~4 days before we deleted deploy/, by which point we were confident.

Files affected by tonight's work

All in honeyDueAPI-go:

  • Dockerfile — Go 1.24 → 1.25 (bug #2)
  • deploy/scripts/deploy_prod.sh — buildx refactor, array expansion fixes (bugs #1, #3)
  • deploy/swarm-stack.prod.yml — dozzle host_ip, secret source references, multiple iterations trying to fix #10
  • deploy/prod.env — admin seed env vars, DB_POSTGRES_DB case, B2 values, push-disabled placeholders (bug #9)
  • deploy/cluster.env — WORKER_REPLICAS 2 → 1, PUSH_LATEST_TAG (bugs #7, #8)
  • deploy/Caddyfile — multiple iterations (ultimately deleted when we moved to k3s)
  • internal/services/cache_service.go — removed sync.Once reset (bug #6)
  • internal/database/database.go — (no change, MigrateWithLock semantics investigated)
  • deploy-k3s/manifests/api/deployment.yaml — startupProbe grace (bug #13)
  • deploy-k3s/manifests/admin/deployment.yaml — probe path (bug #12)
  • deploy-k3s/manifests/worker/deployment.yaml — replicas 2 → 1
  • deploy-k3s/manifests/pod-disruption-budgets.yaml — worker minAvailable 1 → 0
  • deploy-k3s/manifests/traefik-helmchartconfig.yaml — NEW (DaemonSet + hostNetwork for Traefik)
  • deploy-k3s/manifests/ingress/ingress-simple.yaml — NEW (simple host routing, no TLS)
  • deploy-k3s/MIGRATION_NOTES.md — NEW

What was thrown away

  • Swarm stack definitions (still in deploy/, planned for removal)
  • Caddy Caddyfile (k3s uses Traefik instead)
  • Several hours of work on Caddy dynamic a upstream refresh, host- mode ports, and NAT-hairpin workarounds for bug #10 — all moot once we migrated

References