Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
17 KiB
19 — Postmortem: The Swarm Era
Summary
honeyDue launched on Docker Swarm on 2026-04-23. Over the course of a single afternoon we hit thirteen distinct bugs before declaring Swarm unfit and migrating to k3s. This chapter is the forensic record: the symptom of each bug, the root cause, the specific fix, and citations where relevant. It's preserved because these lessons are expensive and future-us should not pay them again.
TL;DR: Twelve of the thirteen bugs were recoverable. The thirteenth was a Docker libnetwork ghost-DNS defect (moby/moby#52265) that is fundamentally incompatible with single-replica services. No amount of clever config fixed it; we had to change orchestrators.
Timeline
~18:00 — Infrastructure stood up. Docker Swarm initialized. First build + push to Gitea.
~19:30 — First deploy runs. Immediate failures.
~22:00 — api + admin returning 200 through Cloudflare. Flaky but working.
~23:00 — Admin flapping 50%+ through Cloudflare. Ghost DNS record identified. Workarounds begin.
~00:30 (next day) — Ghost DNS survives every non-nuclear intervention. Research confirms it's a known libnetwork bug. Decision to migrate to k3s.
~04:30 — k3s cluster up, all services healthy, 150/150 requests green. Postmortem begins.
The session ran ~10 hours. The migration itself took ~1 hour.
The thirteen bugs
1 — Deploy script array expansion under set -u
File: deploy/scripts/deploy_prod.sh
Symptom:
./deploy/scripts/deploy_prod.sh: line 339: api_extra[@]: unbound variable
Root cause: Bash arrays expanded with "${arr[@]}" under set -u
fail when the array is empty. Our deploy script initialized empty
arrays conditionally but expanded them unconditionally.
Fix: Use the ${arr[@]+"${arr[@]}"} safe-expansion idiom, or
restructure to avoid passing empty arrays:
build_and_push api "${API_IMAGE}" ${api_extra[@]+"${api_extra[@]}"}
Inside the function, same treatment — use shift instead of array
slicing.
Moral: set -u with bash arrays is a known pitfall. The
"${arr[@]}" expansion isn't safe under strict mode if arrays can be
empty.
2 — Dockerfile Go version mismatch
File: Dockerfile
Symptom:
go: go.mod requires go >= 1.25 (running go 1.24.13; GOTOOLCHAIN=local)
ERROR: failed to build: failed to solve: process "/bin/sh -c go mod download" did not complete successfully: exit code: 1
Root cause: go.mod specifies go 1.25, but the Dockerfile's
builder stage used golang:1.24-alpine.
Fix: Bumped to golang:1.25-alpine. One-character change.
Moral: Keep the Dockerfile base image in sync with go.mod's
go directive. CI would catch this; we had none.
3 — dev machine arm64 vs node amd64
Symptom: Would have been exec format error on the nodes if we'd
deployed without fixing. Caught at build config stage.
Root cause: Operator on Apple Silicon (arm64). Hetzner nodes are
amd64. Plain docker build produces arm64 images.
Fix: Switched deploy script to use docker buildx build --platform linux/amd64 --push. This cross-compiles the Go stages (they honor
TARGETARCH) and uses QEMU emulation for the Node stages.
Moral: Cross-platform builds are routine for Apple Silicon developers. Document it up front, bake it into the deploy script.
4 — Swarm stack host_ip rejected
File: deploy/swarm-stack.prod.yml (dozzle service)
Symptom:
services.dozzle.ports.0 Additional property host_ip is not allowed
Root cause: Docker Compose v3.8 schema allows host_ip in long-form
port spec. Swarm's docker stack deploy parser doesn't.
Fix: Use the short form:
ports:
- "127.0.0.1:${DOZZLE_PORT}:8080"
But then: Swarm's ingress mesh mode silently ignores the 127.0.0.1
binding and listens on 0.0.0.0 anyway. Only way to get true
loopback-only binding is mode: host, which changes port-publishing
semantics.
Moral: Compose-file compatibility between plain Docker and Swarm is imperfect. Check the Swarm-specific compose reference when in doubt.
5 — Stack file secret references
Symptom:
service worker: undefined secret "honeydue_postgres_password_237c6b8-20260423195810"
Root cause: The original stack file template used
source: ${POSTGRES_PASSWORD_SECRET} (which expanded to the versioned
secret name like honeydue_postgres_password_<ts>) under each service's
secrets: list.
Swarm expects source: to match the alias in the top-level
secrets: block (postgres_password), not the actual secret name:.
Fix: Changed every source: to the alias form:
# Was:
- source: ${POSTGRES_PASSWORD_SECRET}
target: postgres_password
# Now:
- source: postgres_password
target: postgres_password
Moral: The original template was clever but subtly wrong. It had never successfully deployed — the earlier Dokku setup used a different secret model. Bugs-in-template-code catch you when you first hit them.
6 — API pod crash: sync.Once double-unlock
File: internal/services/cache_service.go:54
Symptom: api pods completed migrations, started HTTP server, then fataled with:
fatal error: sync: unlock of unlocked mutex
goroutine 1 [running]:
internal/sync.fatal(...)
sync.(*Once).doSlow(...)
github.com/treytartt/honeydue-api/internal/services.NewCacheService
/app/internal/services/cache_service.go:31
Root cause: Inside a sync.Once.Do(func() { ... }) callback, the
code did:
cacheOnce.Do(func() {
// ...
if err := client.Ping(ctx).Err(); err != nil {
initErr = fmt.Errorf(...)
cacheOnce = sync.Once{} // ← THIS LINE
return
}
})
The intent: "if Redis ping fails, reset the Once so a retry can happen."
The reality: the Once's internal mutex is held while Do is running the
callback. Reassigning cacheOnce = sync.Once{} creates a NEW zero-
valued Once and replaces the old one. When Do tries to release the
mutex afterward, the mutex is the new-zero-valued one — which isn't
locked. Panic.
Fix: Removed the reset. main.go already handles the error
gracefully (cache = nil, continues without caching). Retries happen
via pod restart, not in-process.
if err := client.Ping(ctx).Err(); err != nil {
initErr = fmt.Errorf(...)
// Don't reassign cacheOnce here — mutating it from inside Do()
// is a fatal error. Let main.go handle the error.
return
}
Moral: sync.Once is simpler than it looks. Never reassign an
active sync primitive from within its own callback.
7 — Stack file maxUnavailable: 2 warning for worker
Symptom: We noticed WORKER_REPLICAS=2 in cluster.env despite
the Asynq scheduler being a singleton.
Root cause: Asynq's Scheduler is not leader-elected by default.
Running >1 replica causes duplicate cron firings — duplicate daily
digests, double-welcome emails.
Fix: WORKER_REPLICAS=1. Added a comment in cluster.env.example
explaining why.
Moral: Defaults can be dangerous. Even when a default seems reasonable ("2 replicas for HA"), check against the app's semantics.
8 — PUSH_LATEST_TAG=true for prod
Symptom: During a test, we saw honeydue-api:latest updating,
which would make rollbacks harder.
Root cause: The cluster.env had PUSH_LATEST_TAG=true when the
design intent was SHA-pinned deploys only.
Fix: PUSH_LATEST_TAG=false. SHA tags only.
Moral: Tag-mutable images make rollbacks non-deterministic. Prefer immutable SHA tags.
9 — Neon DB name case sensitivity
Symptom:
server error: ERROR: database "honeydue" does not exist (SQLSTATE 3D000)
Root cause: Neon's UI created the database as "honeyDue" (quoted,
camelCase). Postgres treats quoted identifiers case-sensitively at
create time. Our prod.env had POSTGRES_DB=honeydue (lowercase).
Fix: POSTGRES_DB=honeyDue.
Moral: Respect Postgres's identifier quoting rules. If something was created with quotes, refer to it with exact case.
10 — Admin DNS ghost A-record (the big one)
Symptom: Through Cloudflare, admin.myhoneydue.com returned 502 on
~50% of requests. The other 50% succeeded. The pattern was stable over
hours.
Investigation:
The admin service had 1 replica, alive on one of three Swarm nodes.
Caddy (reverse proxy at the time) resolved admin via Swarm's
embedded DNS at 127.0.0.11. nslookup admin returned:
Name: admin Address: 10.0.1.36 (current task IP)
Name: admin Address: 10.0.1.17 (GHOST — what is this?)
Two A records for one-replica service, both returned randomly.
10.0.1.17 was checked: that IP now belonged to the dozzle
container on hetzner3. Nothing listens on dozzle's 3000 port →
connection refused → 502.
The old admin task had run on hetzner3 with IP 10.0.1.17. When it migrated to hetzner1 with IP 10.0.1.36, libnetwork's DNS registration for admin was supposed to update. On hetzner2 and hetzner3, the old 10.0.1.17 record never got removed.
Things tried, none worked:
| Attempt | Result |
|---|---|
endpoint_mode: dnsrr on admin |
DNS still returns both IPs |
| Kill + restart Caddy container | DNS still returns both IPs |
| Scale admin to 0 and back to 1 | Ghost 10.0.1.17 still in DNS with 0 replicas |
docker service rm honeydue_admin |
Ghost 10.0.1.17 still in DNS (orphaned) |
Change admin to mode: global |
Different IPs but ghost remains |
mode: host on admin ports + extra_hosts: host.docker.internal:host-gateway |
host.docker.internal resolved to docker0 (172.17.0.1), not reachable from overlay |
| Hardcoded 3 node IPs in Caddy + UFW port 3000 node-to-node | ~90% reliable, NAT hairpin issues when Caddy dials its own node |
Root cause: moby/moby#52265 — Docker libnetwork's overlay network state store doesn't reliably deregister service endpoints when tasks migrate between nodes. Known bug in the 29.x line. Partial fixes in #50236 (29.0) were incomplete; 29.3 still leaks; #52289 is the pending follow-up.
Why it only manifests on single-replica services: With 3 replicas, Caddy's DNS query returns 4 IPs (3 real + 1 ghost). Round-robin succeeds 75% of the time. With 1 replica, 1 real + 1 ghost = 50% failure. More replicas = bug is masked.
Final fix: None at the libnetwork level. The ghost survives every
non-cluster-recreating operation. The only clean purge is
docker stack rm + docker network rm + full redeploy. Even then,
the bug recurs on the next task migration.
Decision: Migrate to k3s. CoreDNS has none of libnetwork's state- store semantics and the bug class doesn't exist. 4 hours of fighting Swarm → 1-hour k3s migration that just worked.
Citations:
- moby/moby#52265 — Overlay ARP stale entries on 29.3.0
- moby/moby#51491 — DNS broken after swarm init
- Dokploy#3480 — Traefik stale VIP on Swarm
11 — IPSec ESP + UDP 500 blocked
Symptom: Earlier in the Swarm setup, api 3/3 was working but cross-node overlay traffic was intermittently failing. This turned out to be a separate bug masking #10 earlier in the session.
Root cause: We had encrypted overlay enabled
(driver_opts: encrypted: "true"). Swarm's encrypted mode uses IPSec
ESP (IP protocol 50) + UDP 500 (IKE). Our UFW only allowed UDP 4789
(VXLAN) and 7946 (gossip). ESP was blocked by default-deny. Encrypted
packets dropped silently on some flows.
Fix: Added UFW rules for each peer node IP:
sudo ufw allow from <peer> to any proto esp
sudo ufw allow from <peer> to any port 500 proto udp
Once applied, cross-node overlay data path became stable.
Moral: Encrypted Swarm overlay requires more than VXLAN to be open. ESP (protocol 50) and UDP 500 (IKE) for IPSec. Official Docker docs mention this but it's easy to miss.
12 — Admin startupProbe path
Symptom: Admin pod kept restarting with startup probe failures. Kubelet reported:
Startup probe failed: HTTP probe failed with statuscode: 404
Root cause: The k3s scaffold's admin/deployment.yaml had:
startupProbe:
httpGet:
path: /admin/
port: 3000
But our admin Next.js app serves at /, not /admin/. Requests to
/admin/ return 404. K8s considered the pod unhealthy and restart-
looped.
Fix: Change probe path to /. Also bumped failureThreshold from
12 to 24 (120s grace) for Next.js's slower-than-expected cold boot
when the node's already busy.
Moral: Copy-pasted scaffolds can have assumptions that don't match your app. Always verify probes against actual reachable paths.
13 — MigrateWithLock startup probe grace
Symptom: API pods were getting killed by k8s during migration. First replica was OK (fast migration); replicas 2 and 3 waited on the advisory lock too long and healthchecks tripped.
Root cause: Go app's MigrateWithLock() uses
pg_advisory_lock() to serialize migrations across replicas. First
replica does real AutoMigrate (~90s cold); subsequent replicas wait
on the lock, then run no-op migrations. Total time for 3rd replica
can be 3+ minutes.
K3s scaffold's api/deployment.yaml had:
startupProbe:
failureThreshold: 12
periodSeconds: 5
= 60s grace. Not enough.
Fix: Bumped failureThreshold to 48 (= 240s grace). Comment in
the manifest explains why. This is not a band-aid — the real startup
time genuinely is 90-240s depending on lock queue position. The probe
should reflect reality, not be optimistic.
Moral: Healthchecks should be realistic, not aspirational. Know what your app actually does at startup.
What we learned
Docker Swarm is in a bad place in 2026
Not dead — Mirantis supports it through 2030 — but nobody is modernizing libnetwork. When you hit a DNS or networking bug, you're on your own. The fix churn on #52265 (incomplete 29.0 fix → 29.3 regression → pending #52289) is a tell: the code has no champion.
For new deployments, don't pick Swarm unless you're doing something Swarm-shaped (tiny, single-replica, no inter-service traffic). K3s is a strictly better choice for anything approximating what we're doing.
Investigate before you work around
We spent a lot of time on clever workarounds for bug #10 (host-mode ports, host.docker.internal, hardcoded node IPs, UFW routing) before doing the 20-minute research task that revealed the bug was a known libnetwork defect. If we'd searched "Swarm DNS stale record 2026" first, we'd have saved ~3 hours.
Scaffolds are starting points, not finishing points
The k3s scaffold in deploy-k3s/ was excellent — production-grade
RBAC, PDBs, security contexts, network policies, Traefik middleware.
But its image references (GHCR), TLS assumptions (CF Full strict), and
probe paths (admin's /admin/) didn't match our actual setup. Every
scaffold needs a read-through against your environment before you
kubectl apply -f.
Keep the old config until the new config is proven
We kept deploy/ (Swarm) intact during the k3s migration. That meant
if k3s failed, we could git stash the k3s work and do a fast Swarm
redeploy. It took ~4 days before we deleted deploy/, by which point
we were confident.
Files affected by tonight's work
All in honeyDueAPI-go:
Dockerfile— Go 1.24 → 1.25 (bug #2)deploy/scripts/deploy_prod.sh— buildx refactor, array expansion fixes (bugs #1, #3)deploy/swarm-stack.prod.yml— dozzle host_ip, secret source references, multiple iterations trying to fix #10deploy/prod.env— admin seed env vars, DB_POSTGRES_DB case, B2 values, push-disabled placeholders (bug #9)deploy/cluster.env— WORKER_REPLICAS 2 → 1, PUSH_LATEST_TAG (bugs #7, #8)deploy/Caddyfile— multiple iterations (ultimately deleted when we moved to k3s)internal/services/cache_service.go— removed sync.Once reset (bug #6)internal/database/database.go— (no change, MigrateWithLock semantics investigated)deploy-k3s/manifests/api/deployment.yaml— startupProbe grace (bug #13)deploy-k3s/manifests/admin/deployment.yaml— probe path (bug #12)deploy-k3s/manifests/worker/deployment.yaml— replicas 2 → 1deploy-k3s/manifests/pod-disruption-budgets.yaml— worker minAvailable 1 → 0deploy-k3s/manifests/traefik-helmchartconfig.yaml— NEW (DaemonSet + hostNetwork for Traefik)deploy-k3s/manifests/ingress/ingress-simple.yaml— NEW (simple host routing, no TLS)deploy-k3s/MIGRATION_NOTES.md— NEW
What was thrown away
- Swarm stack definitions (still in
deploy/, planned for removal) - Caddy Caddyfile (k3s uses Traefik instead)
- Several hours of work on Caddy
dynamic aupstream refresh, host- mode ports, and NAT-hairpin workarounds for bug #10 — all moot once we migrated