Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,480 @@
|
||||
# 19 — Postmortem: The Swarm Era
|
||||
|
||||
## Summary
|
||||
|
||||
honeyDue launched on Docker Swarm on 2026-04-23. Over the course of a
|
||||
single afternoon we hit **thirteen distinct bugs** before declaring
|
||||
Swarm unfit and migrating to k3s. This chapter is the forensic record:
|
||||
the symptom of each bug, the root cause, the specific fix, and citations
|
||||
where relevant. It's preserved because these lessons are expensive and
|
||||
future-us should not pay them again.
|
||||
|
||||
**TL;DR**: Twelve of the thirteen bugs were recoverable. The thirteenth
|
||||
was a Docker libnetwork ghost-DNS defect ([moby/moby#52265][moby-52265])
|
||||
that is fundamentally incompatible with single-replica services. No
|
||||
amount of clever config fixed it; we had to change orchestrators.
|
||||
|
||||
## Timeline
|
||||
|
||||
**~18:00** — Infrastructure stood up. Docker Swarm initialized. First
|
||||
build + push to Gitea.
|
||||
|
||||
**~19:30** — First deploy runs. Immediate failures.
|
||||
|
||||
**~22:00** — api + admin returning 200 through Cloudflare. Flaky but
|
||||
working.
|
||||
|
||||
**~23:00** — Admin flapping 50%+ through Cloudflare. Ghost DNS record
|
||||
identified. Workarounds begin.
|
||||
|
||||
**~00:30 (next day)** — Ghost DNS survives every non-nuclear
|
||||
intervention. Research confirms it's a known libnetwork bug. Decision
|
||||
to migrate to k3s.
|
||||
|
||||
**~04:30** — k3s cluster up, all services healthy, 150/150 requests
|
||||
green. Postmortem begins.
|
||||
|
||||
The session ran ~10 hours. The migration itself took ~1 hour.
|
||||
|
||||
## The thirteen bugs
|
||||
|
||||
### 1 — Deploy script array expansion under `set -u`
|
||||
|
||||
**File**: `deploy/scripts/deploy_prod.sh`
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
./deploy/scripts/deploy_prod.sh: line 339: api_extra[@]: unbound variable
|
||||
```
|
||||
|
||||
**Root cause**: Bash arrays expanded with `"${arr[@]}"` under `set -u`
|
||||
fail when the array is empty. Our deploy script initialized empty
|
||||
arrays conditionally but expanded them unconditionally.
|
||||
|
||||
**Fix**: Use the `${arr[@]+"${arr[@]}"}` safe-expansion idiom, or
|
||||
restructure to avoid passing empty arrays:
|
||||
|
||||
```bash
|
||||
build_and_push api "${API_IMAGE}" ${api_extra[@]+"${api_extra[@]}"}
|
||||
```
|
||||
|
||||
Inside the function, same treatment — use `shift` instead of array
|
||||
slicing.
|
||||
|
||||
**Moral**: `set -u` with bash arrays is a known pitfall. The
|
||||
`"${arr[@]}"` expansion isn't safe under strict mode if arrays can be
|
||||
empty.
|
||||
|
||||
### 2 — Dockerfile Go version mismatch
|
||||
|
||||
**File**: `Dockerfile`
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
go: go.mod requires go >= 1.25 (running go 1.24.13; GOTOOLCHAIN=local)
|
||||
ERROR: failed to build: failed to solve: process "/bin/sh -c go mod download" did not complete successfully: exit code: 1
|
||||
```
|
||||
|
||||
**Root cause**: `go.mod` specifies `go 1.25`, but the Dockerfile's
|
||||
builder stage used `golang:1.24-alpine`.
|
||||
|
||||
**Fix**: Bumped to `golang:1.25-alpine`. One-character change.
|
||||
|
||||
**Moral**: Keep the Dockerfile base image in sync with `go.mod`'s
|
||||
go directive. CI would catch this; we had none.
|
||||
|
||||
### 3 — dev machine arm64 vs node amd64
|
||||
|
||||
**Symptom**: Would have been `exec format error` on the nodes if we'd
|
||||
deployed without fixing. Caught at build config stage.
|
||||
|
||||
**Root cause**: Operator on Apple Silicon (arm64). Hetzner nodes are
|
||||
amd64. Plain `docker build` produces arm64 images.
|
||||
|
||||
**Fix**: Switched deploy script to use `docker buildx build --platform
|
||||
linux/amd64 --push`. This cross-compiles the Go stages (they honor
|
||||
`TARGETARCH`) and uses QEMU emulation for the Node stages.
|
||||
|
||||
**Moral**: Cross-platform builds are routine for Apple Silicon
|
||||
developers. Document it up front, bake it into the deploy script.
|
||||
|
||||
### 4 — Swarm stack `host_ip` rejected
|
||||
|
||||
**File**: `deploy/swarm-stack.prod.yml` (dozzle service)
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
services.dozzle.ports.0 Additional property host_ip is not allowed
|
||||
```
|
||||
|
||||
**Root cause**: Docker Compose v3.8 schema allows `host_ip` in long-form
|
||||
port spec. Swarm's `docker stack deploy` parser doesn't.
|
||||
|
||||
**Fix**: Use the short form:
|
||||
```yaml
|
||||
ports:
|
||||
- "127.0.0.1:${DOZZLE_PORT}:8080"
|
||||
```
|
||||
|
||||
But then: Swarm's ingress mesh mode silently ignores the `127.0.0.1`
|
||||
binding and listens on `0.0.0.0` anyway. Only way to get true
|
||||
loopback-only binding is `mode: host`, which changes port-publishing
|
||||
semantics.
|
||||
|
||||
**Moral**: Compose-file compatibility between plain Docker and Swarm
|
||||
is imperfect. Check the [Swarm-specific compose reference][swarm-compose]
|
||||
when in doubt.
|
||||
|
||||
### 5 — Stack file secret references
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
service worker: undefined secret "honeydue_postgres_password_237c6b8-20260423195810"
|
||||
```
|
||||
|
||||
**Root cause**: The original stack file template used
|
||||
`source: ${POSTGRES_PASSWORD_SECRET}` (which expanded to the versioned
|
||||
secret name like `honeydue_postgres_password_<ts>`) under each service's
|
||||
`secrets:` list.
|
||||
|
||||
Swarm expects `source:` to match the **alias** in the top-level
|
||||
`secrets:` block (`postgres_password`), not the actual secret `name:`.
|
||||
|
||||
**Fix**: Changed every `source:` to the alias form:
|
||||
|
||||
```yaml
|
||||
# Was:
|
||||
- source: ${POSTGRES_PASSWORD_SECRET}
|
||||
target: postgres_password
|
||||
|
||||
# Now:
|
||||
- source: postgres_password
|
||||
target: postgres_password
|
||||
```
|
||||
|
||||
**Moral**: The original template was clever but subtly wrong. It had
|
||||
never successfully deployed — the earlier Dokku setup used a different
|
||||
secret model. Bugs-in-template-code catch you when you first hit them.
|
||||
|
||||
### 6 — API pod crash: `sync.Once` double-unlock
|
||||
|
||||
**File**: `internal/services/cache_service.go:54`
|
||||
|
||||
**Symptom**: api pods completed migrations, started HTTP server, then
|
||||
fataled with:
|
||||
```
|
||||
fatal error: sync: unlock of unlocked mutex
|
||||
goroutine 1 [running]:
|
||||
internal/sync.fatal(...)
|
||||
sync.(*Once).doSlow(...)
|
||||
github.com/treytartt/honeydue-api/internal/services.NewCacheService
|
||||
/app/internal/services/cache_service.go:31
|
||||
```
|
||||
|
||||
**Root cause**: Inside a `sync.Once.Do(func() { ... })` callback, the
|
||||
code did:
|
||||
|
||||
```go
|
||||
cacheOnce.Do(func() {
|
||||
// ...
|
||||
if err := client.Ping(ctx).Err(); err != nil {
|
||||
initErr = fmt.Errorf(...)
|
||||
cacheOnce = sync.Once{} // ← THIS LINE
|
||||
return
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
The intent: "if Redis ping fails, reset the Once so a retry can happen."
|
||||
The reality: the Once's internal mutex is held while `Do` is running the
|
||||
callback. Reassigning `cacheOnce = sync.Once{}` creates a NEW zero-
|
||||
valued Once and replaces the old one. When `Do` tries to release the
|
||||
mutex afterward, the mutex is the new-zero-valued one — which isn't
|
||||
locked. Panic.
|
||||
|
||||
**Fix**: Removed the reset. `main.go` already handles the error
|
||||
gracefully (`cache = nil`, continues without caching). Retries happen
|
||||
via pod restart, not in-process.
|
||||
|
||||
```go
|
||||
if err := client.Ping(ctx).Err(); err != nil {
|
||||
initErr = fmt.Errorf(...)
|
||||
// Don't reassign cacheOnce here — mutating it from inside Do()
|
||||
// is a fatal error. Let main.go handle the error.
|
||||
return
|
||||
}
|
||||
```
|
||||
|
||||
**Moral**: `sync.Once` is simpler than it looks. Never reassign an
|
||||
active sync primitive from within its own callback.
|
||||
|
||||
### 7 — Stack file `maxUnavailable: 2` warning for worker
|
||||
|
||||
**Symptom**: We noticed `WORKER_REPLICAS=2` in `cluster.env` despite
|
||||
the Asynq scheduler being a singleton.
|
||||
|
||||
**Root cause**: Asynq's `Scheduler` is not leader-elected by default.
|
||||
Running >1 replica causes duplicate cron firings — duplicate daily
|
||||
digests, double-welcome emails.
|
||||
|
||||
**Fix**: `WORKER_REPLICAS=1`. Added a comment in `cluster.env.example`
|
||||
explaining why.
|
||||
|
||||
**Moral**: Defaults can be dangerous. Even when a default seems
|
||||
reasonable ("2 replicas for HA"), check against the app's semantics.
|
||||
|
||||
### 8 — `PUSH_LATEST_TAG=true` for prod
|
||||
|
||||
**Symptom**: During a test, we saw `honeydue-api:latest` updating,
|
||||
which would make rollbacks harder.
|
||||
|
||||
**Root cause**: The cluster.env had `PUSH_LATEST_TAG=true` when the
|
||||
design intent was SHA-pinned deploys only.
|
||||
|
||||
**Fix**: `PUSH_LATEST_TAG=false`. SHA tags only.
|
||||
|
||||
**Moral**: Tag-mutable images make rollbacks non-deterministic.
|
||||
Prefer immutable SHA tags.
|
||||
|
||||
### 9 — Neon DB name case sensitivity
|
||||
|
||||
**Symptom**:
|
||||
```
|
||||
server error: ERROR: database "honeydue" does not exist (SQLSTATE 3D000)
|
||||
```
|
||||
|
||||
**Root cause**: Neon's UI created the database as `"honeyDue"` (quoted,
|
||||
camelCase). Postgres treats quoted identifiers case-sensitively at
|
||||
create time. Our `prod.env` had `POSTGRES_DB=honeydue` (lowercase).
|
||||
|
||||
**Fix**: `POSTGRES_DB=honeyDue`.
|
||||
|
||||
**Moral**: Respect Postgres's identifier quoting rules. If something
|
||||
was created with quotes, refer to it with exact case.
|
||||
|
||||
### 10 — Admin DNS ghost A-record (the big one)
|
||||
|
||||
**Symptom**: Through Cloudflare, `admin.myhoneydue.com` returned 502 on
|
||||
~50% of requests. The other 50% succeeded. The pattern was stable over
|
||||
hours.
|
||||
|
||||
**Investigation**:
|
||||
|
||||
The admin service had 1 replica, alive on one of three Swarm nodes.
|
||||
Caddy (reverse proxy at the time) resolved `admin` via Swarm's
|
||||
embedded DNS at `127.0.0.11`. `nslookup admin` returned:
|
||||
|
||||
```
|
||||
Name: admin Address: 10.0.1.36 (current task IP)
|
||||
Name: admin Address: 10.0.1.17 (GHOST — what is this?)
|
||||
```
|
||||
|
||||
Two A records for one-replica service, both returned randomly.
|
||||
|
||||
`10.0.1.17` was checked: that IP now belonged to the **dozzle**
|
||||
container on hetzner3. Nothing listens on dozzle's 3000 port →
|
||||
connection refused → 502.
|
||||
|
||||
The old admin task had run on hetzner3 with IP 10.0.1.17. When it
|
||||
migrated to hetzner1 with IP 10.0.1.36, libnetwork's DNS registration
|
||||
for admin was supposed to update. On hetzner2 and hetzner3, the old
|
||||
10.0.1.17 record never got removed.
|
||||
|
||||
**Things tried, none worked**:
|
||||
|
||||
| Attempt | Result |
|
||||
|---|---|
|
||||
| `endpoint_mode: dnsrr` on admin | DNS still returns both IPs |
|
||||
| Kill + restart Caddy container | DNS still returns both IPs |
|
||||
| Scale admin to 0 and back to 1 | Ghost 10.0.1.17 still in DNS with 0 replicas |
|
||||
| `docker service rm honeydue_admin` | Ghost 10.0.1.17 still in DNS (orphaned) |
|
||||
| Change admin to `mode: global` | Different IPs but ghost remains |
|
||||
| `mode: host` on admin ports + `extra_hosts: host.docker.internal:host-gateway` | `host.docker.internal` resolved to docker0 (172.17.0.1), not reachable from overlay |
|
||||
| Hardcoded 3 node IPs in Caddy + UFW port 3000 node-to-node | ~90% reliable, NAT hairpin issues when Caddy dials its own node |
|
||||
|
||||
**Root cause**: [moby/moby#52265][moby-52265] — Docker libnetwork's
|
||||
overlay network state store doesn't reliably deregister service
|
||||
endpoints when tasks migrate between nodes. Known bug in the 29.x
|
||||
line. Partial fixes in #50236 (29.0) were incomplete; 29.3 still
|
||||
leaks; #52289 is the pending follow-up.
|
||||
|
||||
**Why it only manifests on single-replica services**: With 3 replicas,
|
||||
Caddy's DNS query returns 4 IPs (3 real + 1 ghost). Round-robin
|
||||
succeeds 75% of the time. With 1 replica, 1 real + 1 ghost = 50%
|
||||
failure. More replicas = bug is masked.
|
||||
|
||||
**Final fix**: None at the libnetwork level. The ghost survives every
|
||||
non-cluster-recreating operation. The only clean purge is
|
||||
`docker stack rm` + `docker network rm` + full redeploy. Even then,
|
||||
the bug recurs on the next task migration.
|
||||
|
||||
**Decision**: Migrate to k3s. CoreDNS has none of libnetwork's state-
|
||||
store semantics and the bug class doesn't exist. 4 hours of fighting
|
||||
Swarm → 1-hour k3s migration that just worked.
|
||||
|
||||
**Citations**:
|
||||
- [moby/moby#52265 — Overlay ARP stale entries on 29.3.0][moby-52265]
|
||||
- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
|
||||
- [Dokploy#3480 — Traefik stale VIP on Swarm][dokploy-3480]
|
||||
|
||||
### 11 — IPSec ESP + UDP 500 blocked
|
||||
|
||||
**Symptom**: Earlier in the Swarm setup, api 3/3 was working but
|
||||
cross-node overlay traffic was intermittently failing. This turned out
|
||||
to be a separate bug masking #10 earlier in the session.
|
||||
|
||||
**Root cause**: We had encrypted overlay enabled
|
||||
(`driver_opts: encrypted: "true"`). Swarm's encrypted mode uses IPSec
|
||||
ESP (IP protocol 50) + UDP 500 (IKE). Our UFW only allowed UDP 4789
|
||||
(VXLAN) and 7946 (gossip). ESP was blocked by default-deny. Encrypted
|
||||
packets dropped silently on some flows.
|
||||
|
||||
**Fix**: Added UFW rules for each peer node IP:
|
||||
```bash
|
||||
sudo ufw allow from <peer> to any proto esp
|
||||
sudo ufw allow from <peer> to any port 500 proto udp
|
||||
```
|
||||
|
||||
Once applied, cross-node overlay data path became stable.
|
||||
|
||||
**Moral**: Encrypted Swarm overlay requires more than VXLAN to be open.
|
||||
ESP (protocol 50) and UDP 500 (IKE) for IPSec. Official Docker docs
|
||||
mention this but it's easy to miss.
|
||||
|
||||
### 12 — Admin startupProbe path
|
||||
|
||||
**Symptom**: Admin pod kept restarting with startup probe failures.
|
||||
Kubelet reported:
|
||||
```
|
||||
Startup probe failed: HTTP probe failed with statuscode: 404
|
||||
```
|
||||
|
||||
**Root cause**: The k3s scaffold's `admin/deployment.yaml` had:
|
||||
```yaml
|
||||
startupProbe:
|
||||
httpGet:
|
||||
path: /admin/
|
||||
port: 3000
|
||||
```
|
||||
|
||||
But our admin Next.js app serves at `/`, not `/admin/`. Requests to
|
||||
`/admin/` return 404. K8s considered the pod unhealthy and restart-
|
||||
looped.
|
||||
|
||||
**Fix**: Change probe path to `/`. Also bumped `failureThreshold` from
|
||||
12 to 24 (120s grace) for Next.js's slower-than-expected cold boot
|
||||
when the node's already busy.
|
||||
|
||||
**Moral**: Copy-pasted scaffolds can have assumptions that don't match
|
||||
your app. Always verify probes against actual reachable paths.
|
||||
|
||||
### 13 — MigrateWithLock startup probe grace
|
||||
|
||||
**Symptom**: API pods were getting killed by k8s during migration.
|
||||
First replica was OK (fast migration); replicas 2 and 3 waited on
|
||||
the advisory lock too long and healthchecks tripped.
|
||||
|
||||
**Root cause**: Go app's `MigrateWithLock()` uses
|
||||
`pg_advisory_lock()` to serialize migrations across replicas. First
|
||||
replica does real AutoMigrate (~90s cold); subsequent replicas wait
|
||||
on the lock, then run no-op migrations. Total time for 3rd replica
|
||||
can be 3+ minutes.
|
||||
|
||||
K3s scaffold's `api/deployment.yaml` had:
|
||||
```yaml
|
||||
startupProbe:
|
||||
failureThreshold: 12
|
||||
periodSeconds: 5
|
||||
```
|
||||
|
||||
= 60s grace. Not enough.
|
||||
|
||||
**Fix**: Bumped `failureThreshold` to 48 (= 240s grace). Comment in
|
||||
the manifest explains why. This is *not* a band-aid — the real startup
|
||||
time genuinely is 90-240s depending on lock queue position. The probe
|
||||
should reflect reality, not be optimistic.
|
||||
|
||||
**Moral**: Healthchecks should be realistic, not aspirational. Know
|
||||
what your app actually does at startup.
|
||||
|
||||
## What we learned
|
||||
|
||||
### Docker Swarm is in a bad place in 2026
|
||||
|
||||
Not dead — Mirantis supports it through 2030 — but **nobody is
|
||||
modernizing libnetwork**. When you hit a DNS or networking bug, you're
|
||||
on your own. The fix churn on #52265 (incomplete 29.0 fix → 29.3
|
||||
regression → pending #52289) is a tell: the code has no champion.
|
||||
|
||||
For new deployments, **don't pick Swarm** unless you're doing something
|
||||
Swarm-shaped (tiny, single-replica, no inter-service traffic). K3s is
|
||||
a strictly better choice for anything approximating what we're doing.
|
||||
|
||||
### Investigate before you work around
|
||||
|
||||
We spent a lot of time on clever workarounds for bug #10 (host-mode
|
||||
ports, host.docker.internal, hardcoded node IPs, UFW routing) before
|
||||
doing the 20-minute research task that revealed the bug was a known
|
||||
libnetwork defect. If we'd searched "Swarm DNS stale record 2026" first,
|
||||
we'd have saved ~3 hours.
|
||||
|
||||
### Scaffolds are starting points, not finishing points
|
||||
|
||||
The k3s scaffold in `deploy-k3s/` was excellent — production-grade
|
||||
RBAC, PDBs, security contexts, network policies, Traefik middleware.
|
||||
But its image references (GHCR), TLS assumptions (CF Full strict), and
|
||||
probe paths (admin's `/admin/`) didn't match our actual setup. Every
|
||||
scaffold needs a read-through against your environment before you
|
||||
`kubectl apply -f`.
|
||||
|
||||
### Keep the old config until the new config is proven
|
||||
|
||||
We kept `deploy/` (Swarm) intact during the k3s migration. That meant
|
||||
if k3s failed, we could `git stash` the k3s work and do a fast Swarm
|
||||
redeploy. It took ~4 days before we deleted `deploy/`, by which point
|
||||
we were confident.
|
||||
|
||||
## Files affected by tonight's work
|
||||
|
||||
All in `honeyDueAPI-go`:
|
||||
|
||||
- `Dockerfile` — Go 1.24 → 1.25 (bug #2)
|
||||
- `deploy/scripts/deploy_prod.sh` — buildx refactor, array expansion fixes (bugs #1, #3)
|
||||
- `deploy/swarm-stack.prod.yml` — dozzle host_ip, secret source references, multiple iterations trying to fix #10
|
||||
- `deploy/prod.env` — admin seed env vars, DB_POSTGRES_DB case, B2 values, push-disabled placeholders (bug #9)
|
||||
- `deploy/cluster.env` — WORKER_REPLICAS 2 → 1, PUSH_LATEST_TAG (bugs #7, #8)
|
||||
- `deploy/Caddyfile` — multiple iterations (ultimately deleted when we moved to k3s)
|
||||
- `internal/services/cache_service.go` — removed sync.Once reset (bug #6)
|
||||
- `internal/database/database.go` — (no change, MigrateWithLock semantics investigated)
|
||||
- `deploy-k3s/manifests/api/deployment.yaml` — startupProbe grace (bug #13)
|
||||
- `deploy-k3s/manifests/admin/deployment.yaml` — probe path (bug #12)
|
||||
- `deploy-k3s/manifests/worker/deployment.yaml` — replicas 2 → 1
|
||||
- `deploy-k3s/manifests/pod-disruption-budgets.yaml` — worker minAvailable 1 → 0
|
||||
- `deploy-k3s/manifests/traefik-helmchartconfig.yaml` — NEW (DaemonSet + hostNetwork for Traefik)
|
||||
- `deploy-k3s/manifests/ingress/ingress-simple.yaml` — NEW (simple host routing, no TLS)
|
||||
- `deploy-k3s/MIGRATION_NOTES.md` — NEW
|
||||
|
||||
## What was thrown away
|
||||
|
||||
- Swarm stack definitions (still in `deploy/`, planned for removal)
|
||||
- Caddy Caddyfile (k3s uses Traefik instead)
|
||||
- Several hours of work on Caddy `dynamic a` upstream refresh, host-
|
||||
mode ports, and NAT-hairpin workarounds for bug #10 — all moot
|
||||
once we migrated
|
||||
|
||||
## References
|
||||
|
||||
- [moby/moby#52265 — Overlay ARP stale entries][moby-52265]
|
||||
- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
|
||||
- [Dokploy#3480 — Traefik stale VIP][dokploy-3480]
|
||||
- [Mirantis Swarm LTS commitment][mirantis-swarm]
|
||||
- [Kubernetes probe best practices][k8s-probes]
|
||||
- [Asynq scheduler limitations][asynq-sched]
|
||||
|
||||
[moby-52265]: https://github.com/moby/moby/issues/52265
|
||||
[moby-51491]: https://github.com/moby/moby/issues/51491
|
||||
[dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480
|
||||
[mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/
|
||||
[k8s-probes]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
|
||||
[asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks
|
||||
[swarm-compose]: https://docs.docker.com/reference/compose-file/legacy-versions/
|
||||
Reference in New Issue
Block a user