Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,323 @@
|
||||
# 02 — Orchestrator Choice
|
||||
|
||||
## Summary
|
||||
|
||||
We run K3s — a lightweight Kubernetes distribution from SUSE/Rancher Labs.
|
||||
This wasn't our first choice. We originally deployed on Docker Swarm and
|
||||
spent a long afternoon hitting a libnetwork bug before migrating. This
|
||||
chapter walks through the comparison of the three realistic orchestrators
|
||||
(Docker Swarm, full Kubernetes, and K3s) and a fourth (Nomad) we
|
||||
considered and rejected. The story of the Swarm→k3s migration is in
|
||||
[Chapter 19](./19-postmortem-swarm.md); this chapter is about the decision
|
||||
framework.
|
||||
|
||||
## The decision
|
||||
|
||||
**K3s v1.34.6+k3s1**, HA mode, three control-plane nodes with embedded etcd.
|
||||
|
||||
## Candidates considered
|
||||
|
||||
| | Docker Swarm | K3s | Full Kubernetes (kubeadm) | Hashicorp Nomad |
|
||||
|---|---|---|---|---|
|
||||
| Learning curve | Easiest | Medium | Hardest | Easy |
|
||||
| Install on 3 nodes | `docker swarm init/join` | `curl \| sh` per node | Many steps | `nomad server/agent` |
|
||||
| Memory footprint (control plane) | ~200 MB per node | ~500 MB per node | ~1 GB per node | ~200 MB per node |
|
||||
| Service discovery | libnetwork (buggy) | CoreDNS | CoreDNS | Consul |
|
||||
| HA quorum | Raft (3+ managers) | Raft via embedded etcd (3+ servers) | etcd cluster (3+ nodes) | Raft (3+ servers) |
|
||||
| Secrets management | Swarm secrets | k8s Secrets | k8s Secrets | Vault or file-backed |
|
||||
| Rolling updates | Swarm update_config | Deployments | Deployments | job update stanza |
|
||||
| Ingress | None (third-party) | Traefik bundled | None (install yourself) | None (install yourself) |
|
||||
| Active development | Maintenance mode | Active | Active | Active |
|
||||
| Industry momentum | Declining | Growing | Dominant | Niche |
|
||||
|
||||
## Why K3s
|
||||
|
||||
### Against Docker Swarm
|
||||
|
||||
Swarm was our first pick because it's the simplest "production-like"
|
||||
option. `docker swarm init` gives you a working cluster in seconds. It's
|
||||
built into the Docker daemon you already have.
|
||||
|
||||
What killed it:
|
||||
|
||||
1. **libnetwork state bugs.** Swarm's service discovery relies on
|
||||
libnetwork's gossip-backed service registry. When a service's task
|
||||
migrates between nodes, the old endpoint record isn't always removed
|
||||
cleanly — especially on encrypted overlays or during transient network
|
||||
partitions. The result: stale DNS A-records that persist indefinitely,
|
||||
survive service removal, survive containerd restarts, survive pretty
|
||||
much everything except recreating the overlay network. Multiple open
|
||||
issues track this: [moby/moby#52265][moby-52265],
|
||||
[moby/moby#51491][moby-51491], [Dokploy#3480][dokploy-3480].
|
||||
|
||||
2. **It's in maintenance mode.** Mirantis [committed to supporting
|
||||
Swarm through 2030][mirantis-swarm] as part of Mirantis Kubernetes
|
||||
Engine 3, but nothing is being actively developed. The libnetwork code
|
||||
has no champion; bug fixes land slowly and often incompletely (the
|
||||
29.0.0 partial fix for #50236, the 29.3.0 regression, the pending
|
||||
follow-up in #52289 — months apart).
|
||||
|
||||
3. **Industry signal.** Every 2026 write-up of "should I pick Swarm"
|
||||
reaches the same conclusion: run what works; don't bet new workload on
|
||||
it. [Better Stack][bstack-swarm] and [VirtualizationHowTo][vht-swarm]
|
||||
are representative.
|
||||
|
||||
The [Chapter 19 postmortem](./19-postmortem-swarm.md) details the specific
|
||||
bug we hit, the workarounds we tried, and why each failed.
|
||||
|
||||
### Against full Kubernetes (kubeadm)
|
||||
|
||||
Full Kubernetes is the de-facto standard. It has the biggest ecosystem, the
|
||||
most documentation, the most mindshare. Against it:
|
||||
|
||||
1. **Operational overhead.** A kubeadm-built cluster has ~6 control-plane
|
||||
processes (kube-apiserver, etcd, kube-scheduler, kube-controller-manager,
|
||||
kube-proxy, kubelet) each of which needs monitoring, upgrading, and
|
||||
understanding. K3s bundles them into a single binary with sensible
|
||||
defaults.
|
||||
|
||||
2. **Memory.** A kubeadm control plane wants ~1 GB RAM baseline per master
|
||||
node. On an 8 GB node that's 12% gone before any workload runs. K3s is
|
||||
~500 MB per master.
|
||||
|
||||
3. **Etcd.** Full Kubernetes expects a separate 3+ node etcd cluster for
|
||||
HA, typically on the same masters but as an independent process. K3s
|
||||
embeds etcd in the server binary; still Raft, still HA, but one less
|
||||
thing to install/upgrade/monitor.
|
||||
|
||||
4. **Cluster creation UX.** `kubeadm init` + certificate distribution + CNI
|
||||
install + storage class setup is a multi-step dance. K3s `curl -sfL
|
||||
https://get.k3s.io | sh -s - server --cluster-init` plus two joins is a
|
||||
10-minute cluster.
|
||||
|
||||
**What we'd lose by not using full Kubernetes:** nothing that matters at
|
||||
our scale. K3s is 100% Kubernetes API-compatible. Every `kubectl` command,
|
||||
every Helm chart, every manifest works identically. If we ever need to
|
||||
migrate to full Kubernetes, `kubectl get all -A -o yaml` gives us the
|
||||
entire state and we re-apply it on the new cluster.
|
||||
|
||||
### Against Hashicorp Nomad
|
||||
|
||||
Nomad is very good at what it does — simpler than Kubernetes, more robust
|
||||
than Swarm, has real load balancing (via Consul Connect), and the
|
||||
`nomad agent` binary is ~80 MB vs k3s' ~200 MB.
|
||||
|
||||
Against it:
|
||||
|
||||
1. **Ecosystem is smaller.** Far fewer community Helm charts, operators,
|
||||
tutorials. Every new component needs bespoke integration.
|
||||
2. **Service discovery requires Consul.** Two products to operate, not one.
|
||||
3. **Ingress requires a separate tool** (Traefik, HAProxy, Fabio). K3s
|
||||
bundles Traefik by default.
|
||||
4. **Secrets management** requires Vault or relies on Nomad's template
|
||||
stanza. Not bad, but more moving parts.
|
||||
5. **The operator hasn't used Nomad in production before.** Learning curve
|
||||
on a new platform during a prod migration is a bad trade.
|
||||
|
||||
Nomad would be a defensible choice. K3s won primarily on ecosystem
|
||||
maturity and the operator's familiarity with Kubernetes primitives.
|
||||
|
||||
## What K3s actually is
|
||||
|
||||
K3s is a CNCF Sandbox project (now graduated to Rancher/SUSE-backed)
|
||||
originally designed for edge and IoT. Its design goals:
|
||||
|
||||
- Single ~200 MB static binary
|
||||
- Works on ARM64 and AMD64
|
||||
- Bundles everything needed for a working cluster: containerd, Flannel,
|
||||
CoreDNS, Traefik, metrics-server, local-path storage provisioner, and
|
||||
(optionally) servicelb (klipper-lb) load balancer
|
||||
- Replaces the kubeadm setup dance with `curl | sh`
|
||||
- Replaces etcd-in-its-own-cluster with embedded etcd (or SQLite for
|
||||
single-node)
|
||||
- Replaces Docker with containerd (though you can opt back into Docker)
|
||||
|
||||
It is **not** a fork of Kubernetes. K3s is Kubernetes, packaged differently.
|
||||
The Kubernetes Go code it wraps is unmodified (aside from build-time
|
||||
stripping of cloud provider integrations you don't need). `kubectl`,
|
||||
the API, CRDs, operators — all identical.
|
||||
|
||||
## HA architecture we chose
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Cluster[k3s HA cluster]
|
||||
subgraph N1[hetzner1]
|
||||
K1[k3s server]
|
||||
E1[etcd]
|
||||
KUB1[kubelet]
|
||||
TR1[Traefik pod<br/>hostNet :80/:443]
|
||||
P1[app pods]
|
||||
end
|
||||
subgraph N2[hetzner2]
|
||||
K2[k3s server]
|
||||
E2[etcd]
|
||||
KUB2[kubelet]
|
||||
TR2[Traefik pod<br/>hostNet :80/:443]
|
||||
P2[app pods]
|
||||
end
|
||||
subgraph N3[hetzner3]
|
||||
K3[k3s server]
|
||||
E3[etcd]
|
||||
KUB3[kubelet]
|
||||
TR3[Traefik pod<br/>hostNet :80/:443]
|
||||
P3[app pods]
|
||||
end
|
||||
end
|
||||
|
||||
E1 <--Raft--> E2 <--Raft--> E3
|
||||
E1 <--Raft--> E3
|
||||
|
||||
K1 & K2 & K3 --- API[kube-apiserver<br/>port 6443]
|
||||
```
|
||||
|
||||
### ASCII fallback
|
||||
|
||||
```
|
||||
hetzner1 hetzner2 hetzner3
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ k3s srv │ │ k3s srv │ │ k3s srv │
|
||||
│ ├ etcd ─┼──────┼ ├ etcd ──┼──────┼─ etcd │ │
|
||||
│ │ :6443│ │ │ :6443│ │ :6443│ │
|
||||
│ ├ kubelet │ ├ kubelet │ kubelet│
|
||||
│ └ pods │ │ └ pods │ │ pods │ │
|
||||
└──────────┘ └──────────┘ └──────────┘
|
||||
│ ▲ │ ▲ │ ▲
|
||||
│ └─── Raft ────┤ └─── Raft ────┘ │
|
||||
└────────── Raft ─┴─────────────────────┘
|
||||
```
|
||||
|
||||
All three nodes are **server** nodes (in k3s terminology) — they all run
|
||||
`kube-apiserver`, `kube-scheduler`, `kube-controller-manager`, and
|
||||
participate in etcd Raft consensus. A fourth "agent" node could be added
|
||||
as worker-only; we don't need that capacity yet.
|
||||
|
||||
**Quorum**: 2 out of 3 nodes must agree on writes. The cluster stays
|
||||
operational if any one node dies. Two dying nodes = cluster loses quorum
|
||||
(Raft halts) until a majority returns.
|
||||
|
||||
## What we disabled
|
||||
|
||||
We ran k3s install with `--disable=servicelb`. `servicelb` (a.k.a.
|
||||
`klipper-lb`) is a trick where k3s spawns a daemonset that listens on a
|
||||
node's host ports and proxies to `LoadBalancer`-typed services. Fine for
|
||||
dev; we don't need it because we handle ingress with Traefik in
|
||||
DaemonSet+hostNetwork mode (Chapter 6).
|
||||
|
||||
We did **not** disable:
|
||||
- **traefik** — we reconfigured it via HelmChartConfig rather than
|
||||
disable-and-replace. See Chapter 6.
|
||||
- **local-path-provisioner** — provides the default `StorageClass` we use
|
||||
for Redis PVC (Chapter 7).
|
||||
- **metrics-server** — required for `kubectl top` and HorizontalPodAutoscaler.
|
||||
- **coredns** — the cluster DNS. Essential for service discovery.
|
||||
|
||||
## Version choices
|
||||
|
||||
### K3s v1.34.6+k3s1
|
||||
|
||||
This was the latest stable K3s release as of 2026-04-24. K3s follows
|
||||
upstream Kubernetes' release cadence — `1.34` matches Kubernetes 1.34.x.
|
||||
The `+k3s1` suffix is the K3s build number within that upstream version.
|
||||
|
||||
**Upgrade policy**: K3s supports one minor version per quarter. We'd
|
||||
upgrade in place to 1.35 when it's been out ~30 days and has no open
|
||||
critical bugs in the release notes. See Chapter 17 for the procedure.
|
||||
|
||||
### containerd v2.2.2
|
||||
|
||||
Bundled with K3s. containerd 2.x brought full support for the
|
||||
`cri-dockerd` replacement API and performance improvements over 1.x.
|
||||
We don't pin containerd separately — we take whatever K3s ships.
|
||||
|
||||
### Flannel (VXLAN backend)
|
||||
|
||||
Bundled with K3s as the default CNI. Flannel's VXLAN backend is
|
||||
straightforward, performant enough, and has worked reliably in every K3s
|
||||
install we've seen. Alternatives (Calico, Cilium) are more featureful but
|
||||
add operational complexity.
|
||||
|
||||
See [Chapter 3](./03-networking.md) for a deep dive on the networking
|
||||
layer.
|
||||
|
||||
## What we did NOT choose from K3s' ecosystem
|
||||
|
||||
- **servicelb / klipper-lb** — off. Reason above.
|
||||
- **embedded SQLite** — on single-node k3s, SQLite replaces etcd. We're
|
||||
multi-node, so this doesn't apply.
|
||||
- **`--flannel-backend=wireguard-native`** — WireGuard-encrypted overlay.
|
||||
We didn't enable it because (a) VXLAN already works, (b) our node-to-node
|
||||
traffic stays within Hetzner's internal network anyway, and (c) we haven't
|
||||
proven we need it. Encryption is a TODO (Chapter 20).
|
||||
|
||||
## Raft and split-brain behavior
|
||||
|
||||
If the 3 nodes become network-partitioned such that one node sees the
|
||||
other two and vice versa (a "2-1 split"):
|
||||
|
||||
- **Majority partition (2 nodes)** — retains quorum, cluster keeps
|
||||
accepting writes. Pods on those 2 nodes keep running. Pods on the
|
||||
isolated node eventually get marked `NotReady` after
|
||||
`node-monitor-grace-period` (default 40s), and after
|
||||
`pod-eviction-timeout` (default 5 min) their pods are marked for
|
||||
eviction and rescheduled onto the surviving nodes.
|
||||
- **Minority partition (1 node)** — loses quorum. API server on that
|
||||
node refuses writes; existing pods keep running (kubelet doesn't need
|
||||
the API server for already-scheduled pods), but nothing new can deploy,
|
||||
scale, or reschedule.
|
||||
|
||||
When the partition heals, Raft reconciles automatically. The minority
|
||||
node catches up on etcd state via snapshot+replay.
|
||||
|
||||
**Worst case** (all 3 isolated from each other): no quorum, no node is
|
||||
authoritative. Pods keep running from existing state; nothing can be
|
||||
updated. This requires all three nodes losing network to each other
|
||||
simultaneously, which implies Hetzner's entire internal switching is
|
||||
broken — at that point, the whole region is likely down anyway.
|
||||
|
||||
## Our decision in one sentence
|
||||
|
||||
K3s gave us the Kubernetes API (enormous ecosystem, known primitives, our
|
||||
existing scaffold in `deploy-k3s/manifests/`) without the operational
|
||||
overhead of kubeadm; and unlike Swarm, its service-discovery layer is
|
||||
rock-solid.
|
||||
|
||||
## Operator cheat sheet
|
||||
|
||||
```bash
|
||||
# On any k3s server node, root commands use k3s-wrapped kubectl:
|
||||
sudo k3s kubectl get nodes
|
||||
|
||||
# From workstation, use the copied kubeconfig:
|
||||
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
|
||||
kubectl get nodes
|
||||
|
||||
# Check k3s service:
|
||||
ssh deploy@hetzner1 "sudo systemctl status k3s"
|
||||
|
||||
# Watch cluster events live:
|
||||
kubectl get events -A --watch
|
||||
|
||||
# See what's on each node:
|
||||
kubectl get pods -A -o wide | sort -k 8
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [K3s architecture][k3s-arch]
|
||||
- [K3s requirements][k3s-reqs]
|
||||
- [Mirantis Swarm support announcement][mirantis-swarm]
|
||||
- [moby/moby#52265 — libnetwork stale records][moby-52265]
|
||||
- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
|
||||
- [Dokploy #3480 — Traefik stale VIP on Swarm][dokploy-3480]
|
||||
- [Better Stack: Hetzner Cloud Review 2026][bstack-swarm]
|
||||
- [VirtualizationHowTo: Is Docker Swarm Still Safe in 2026?][vht-swarm]
|
||||
|
||||
[k3s-arch]: https://docs.k3s.io/architecture
|
||||
[k3s-reqs]: https://docs.k3s.io/installation/requirements
|
||||
[mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/
|
||||
[moby-52265]: https://github.com/moby/moby/issues/52265
|
||||
[moby-51491]: https://github.com/moby/moby/issues/51491
|
||||
[dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480
|
||||
[bstack-swarm]: https://betterstack.com/community/guides/web-servers/hetzner-cloud-review/
|
||||
[vht-swarm]: https://www.virtualizationhowto.com/2026/03/is-docker-swarm-still-safe-in-2026/
|
||||
Reference in New Issue
Block a user