Files
honeyDueAPI/docs/deployment/02-orchestrator-choice.md
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

324 lines
14 KiB
Markdown

# 02 — Orchestrator Choice
## Summary
We run K3s — a lightweight Kubernetes distribution from SUSE/Rancher Labs.
This wasn't our first choice. We originally deployed on Docker Swarm and
spent a long afternoon hitting a libnetwork bug before migrating. This
chapter walks through the comparison of the three realistic orchestrators
(Docker Swarm, full Kubernetes, and K3s) and a fourth (Nomad) we
considered and rejected. The story of the Swarm→k3s migration is in
[Chapter 19](./19-postmortem-swarm.md); this chapter is about the decision
framework.
## The decision
**K3s v1.34.6+k3s1**, HA mode, three control-plane nodes with embedded etcd.
## Candidates considered
| | Docker Swarm | K3s | Full Kubernetes (kubeadm) | Hashicorp Nomad |
|---|---|---|---|---|
| Learning curve | Easiest | Medium | Hardest | Easy |
| Install on 3 nodes | `docker swarm init/join` | `curl \| sh` per node | Many steps | `nomad server/agent` |
| Memory footprint (control plane) | ~200 MB per node | ~500 MB per node | ~1 GB per node | ~200 MB per node |
| Service discovery | libnetwork (buggy) | CoreDNS | CoreDNS | Consul |
| HA quorum | Raft (3+ managers) | Raft via embedded etcd (3+ servers) | etcd cluster (3+ nodes) | Raft (3+ servers) |
| Secrets management | Swarm secrets | k8s Secrets | k8s Secrets | Vault or file-backed |
| Rolling updates | Swarm update_config | Deployments | Deployments | job update stanza |
| Ingress | None (third-party) | Traefik bundled | None (install yourself) | None (install yourself) |
| Active development | Maintenance mode | Active | Active | Active |
| Industry momentum | Declining | Growing | Dominant | Niche |
## Why K3s
### Against Docker Swarm
Swarm was our first pick because it's the simplest "production-like"
option. `docker swarm init` gives you a working cluster in seconds. It's
built into the Docker daemon you already have.
What killed it:
1. **libnetwork state bugs.** Swarm's service discovery relies on
libnetwork's gossip-backed service registry. When a service's task
migrates between nodes, the old endpoint record isn't always removed
cleanly — especially on encrypted overlays or during transient network
partitions. The result: stale DNS A-records that persist indefinitely,
survive service removal, survive containerd restarts, survive pretty
much everything except recreating the overlay network. Multiple open
issues track this: [moby/moby#52265][moby-52265],
[moby/moby#51491][moby-51491], [Dokploy#3480][dokploy-3480].
2. **It's in maintenance mode.** Mirantis [committed to supporting
Swarm through 2030][mirantis-swarm] as part of Mirantis Kubernetes
Engine 3, but nothing is being actively developed. The libnetwork code
has no champion; bug fixes land slowly and often incompletely (the
29.0.0 partial fix for #50236, the 29.3.0 regression, the pending
follow-up in #52289 — months apart).
3. **Industry signal.** Every 2026 write-up of "should I pick Swarm"
reaches the same conclusion: run what works; don't bet new workload on
it. [Better Stack][bstack-swarm] and [VirtualizationHowTo][vht-swarm]
are representative.
The [Chapter 19 postmortem](./19-postmortem-swarm.md) details the specific
bug we hit, the workarounds we tried, and why each failed.
### Against full Kubernetes (kubeadm)
Full Kubernetes is the de-facto standard. It has the biggest ecosystem, the
most documentation, the most mindshare. Against it:
1. **Operational overhead.** A kubeadm-built cluster has ~6 control-plane
processes (kube-apiserver, etcd, kube-scheduler, kube-controller-manager,
kube-proxy, kubelet) each of which needs monitoring, upgrading, and
understanding. K3s bundles them into a single binary with sensible
defaults.
2. **Memory.** A kubeadm control plane wants ~1 GB RAM baseline per master
node. On an 8 GB node that's 12% gone before any workload runs. K3s is
~500 MB per master.
3. **Etcd.** Full Kubernetes expects a separate 3+ node etcd cluster for
HA, typically on the same masters but as an independent process. K3s
embeds etcd in the server binary; still Raft, still HA, but one less
thing to install/upgrade/monitor.
4. **Cluster creation UX.** `kubeadm init` + certificate distribution + CNI
install + storage class setup is a multi-step dance. K3s `curl -sfL
https://get.k3s.io | sh -s - server --cluster-init` plus two joins is a
10-minute cluster.
**What we'd lose by not using full Kubernetes:** nothing that matters at
our scale. K3s is 100% Kubernetes API-compatible. Every `kubectl` command,
every Helm chart, every manifest works identically. If we ever need to
migrate to full Kubernetes, `kubectl get all -A -o yaml` gives us the
entire state and we re-apply it on the new cluster.
### Against Hashicorp Nomad
Nomad is very good at what it does — simpler than Kubernetes, more robust
than Swarm, has real load balancing (via Consul Connect), and the
`nomad agent` binary is ~80 MB vs k3s' ~200 MB.
Against it:
1. **Ecosystem is smaller.** Far fewer community Helm charts, operators,
tutorials. Every new component needs bespoke integration.
2. **Service discovery requires Consul.** Two products to operate, not one.
3. **Ingress requires a separate tool** (Traefik, HAProxy, Fabio). K3s
bundles Traefik by default.
4. **Secrets management** requires Vault or relies on Nomad's template
stanza. Not bad, but more moving parts.
5. **The operator hasn't used Nomad in production before.** Learning curve
on a new platform during a prod migration is a bad trade.
Nomad would be a defensible choice. K3s won primarily on ecosystem
maturity and the operator's familiarity with Kubernetes primitives.
## What K3s actually is
K3s is a CNCF Sandbox project (now graduated to Rancher/SUSE-backed)
originally designed for edge and IoT. Its design goals:
- Single ~200 MB static binary
- Works on ARM64 and AMD64
- Bundles everything needed for a working cluster: containerd, Flannel,
CoreDNS, Traefik, metrics-server, local-path storage provisioner, and
(optionally) servicelb (klipper-lb) load balancer
- Replaces the kubeadm setup dance with `curl | sh`
- Replaces etcd-in-its-own-cluster with embedded etcd (or SQLite for
single-node)
- Replaces Docker with containerd (though you can opt back into Docker)
It is **not** a fork of Kubernetes. K3s is Kubernetes, packaged differently.
The Kubernetes Go code it wraps is unmodified (aside from build-time
stripping of cloud provider integrations you don't need). `kubectl`,
the API, CRDs, operators — all identical.
## HA architecture we chose
```mermaid
flowchart TB
subgraph Cluster[k3s HA cluster]
subgraph N1[hetzner1]
K1[k3s server]
E1[etcd]
KUB1[kubelet]
TR1[Traefik pod<br/>hostNet :80/:443]
P1[app pods]
end
subgraph N2[hetzner2]
K2[k3s server]
E2[etcd]
KUB2[kubelet]
TR2[Traefik pod<br/>hostNet :80/:443]
P2[app pods]
end
subgraph N3[hetzner3]
K3[k3s server]
E3[etcd]
KUB3[kubelet]
TR3[Traefik pod<br/>hostNet :80/:443]
P3[app pods]
end
end
E1 <--Raft--> E2 <--Raft--> E3
E1 <--Raft--> E3
K1 & K2 & K3 --- API[kube-apiserver<br/>port 6443]
```
### ASCII fallback
```
hetzner1 hetzner2 hetzner3
┌──────────┐ ┌──────────┐ ┌──────────┐
│ k3s srv │ │ k3s srv │ │ k3s srv │
│ ├ etcd ─┼──────┼ ├ etcd ──┼──────┼─ etcd │ │
│ │ :6443│ │ │ :6443│ │ :6443│ │
│ ├ kubelet │ ├ kubelet │ kubelet│
│ └ pods │ │ └ pods │ │ pods │ │
└──────────┘ └──────────┘ └──────────┘
│ ▲ │ ▲ │ ▲
│ └─── Raft ────┤ └─── Raft ────┘ │
└────────── Raft ─┴─────────────────────┘
```
All three nodes are **server** nodes (in k3s terminology) — they all run
`kube-apiserver`, `kube-scheduler`, `kube-controller-manager`, and
participate in etcd Raft consensus. A fourth "agent" node could be added
as worker-only; we don't need that capacity yet.
**Quorum**: 2 out of 3 nodes must agree on writes. The cluster stays
operational if any one node dies. Two dying nodes = cluster loses quorum
(Raft halts) until a majority returns.
## What we disabled
We ran k3s install with `--disable=servicelb`. `servicelb` (a.k.a.
`klipper-lb`) is a trick where k3s spawns a daemonset that listens on a
node's host ports and proxies to `LoadBalancer`-typed services. Fine for
dev; we don't need it because we handle ingress with Traefik in
DaemonSet+hostNetwork mode (Chapter 6).
We did **not** disable:
- **traefik** — we reconfigured it via HelmChartConfig rather than
disable-and-replace. See Chapter 6.
- **local-path-provisioner** — provides the default `StorageClass` we use
for Redis PVC (Chapter 7).
- **metrics-server** — required for `kubectl top` and HorizontalPodAutoscaler.
- **coredns** — the cluster DNS. Essential for service discovery.
## Version choices
### K3s v1.34.6+k3s1
This was the latest stable K3s release as of 2026-04-24. K3s follows
upstream Kubernetes' release cadence — `1.34` matches Kubernetes 1.34.x.
The `+k3s1` suffix is the K3s build number within that upstream version.
**Upgrade policy**: K3s supports one minor version per quarter. We'd
upgrade in place to 1.35 when it's been out ~30 days and has no open
critical bugs in the release notes. See Chapter 17 for the procedure.
### containerd v2.2.2
Bundled with K3s. containerd 2.x brought full support for the
`cri-dockerd` replacement API and performance improvements over 1.x.
We don't pin containerd separately — we take whatever K3s ships.
### Flannel (VXLAN backend)
Bundled with K3s as the default CNI. Flannel's VXLAN backend is
straightforward, performant enough, and has worked reliably in every K3s
install we've seen. Alternatives (Calico, Cilium) are more featureful but
add operational complexity.
See [Chapter 3](./03-networking.md) for a deep dive on the networking
layer.
## What we did NOT choose from K3s' ecosystem
- **servicelb / klipper-lb** — off. Reason above.
- **embedded SQLite** — on single-node k3s, SQLite replaces etcd. We're
multi-node, so this doesn't apply.
- **`--flannel-backend=wireguard-native`** — WireGuard-encrypted overlay.
We didn't enable it because (a) VXLAN already works, (b) our node-to-node
traffic stays within Hetzner's internal network anyway, and (c) we haven't
proven we need it. Encryption is a TODO (Chapter 20).
## Raft and split-brain behavior
If the 3 nodes become network-partitioned such that one node sees the
other two and vice versa (a "2-1 split"):
- **Majority partition (2 nodes)** — retains quorum, cluster keeps
accepting writes. Pods on those 2 nodes keep running. Pods on the
isolated node eventually get marked `NotReady` after
`node-monitor-grace-period` (default 40s), and after
`pod-eviction-timeout` (default 5 min) their pods are marked for
eviction and rescheduled onto the surviving nodes.
- **Minority partition (1 node)** — loses quorum. API server on that
node refuses writes; existing pods keep running (kubelet doesn't need
the API server for already-scheduled pods), but nothing new can deploy,
scale, or reschedule.
When the partition heals, Raft reconciles automatically. The minority
node catches up on etcd state via snapshot+replay.
**Worst case** (all 3 isolated from each other): no quorum, no node is
authoritative. Pods keep running from existing state; nothing can be
updated. This requires all three nodes losing network to each other
simultaneously, which implies Hetzner's entire internal switching is
broken — at that point, the whole region is likely down anyway.
## Our decision in one sentence
K3s gave us the Kubernetes API (enormous ecosystem, known primitives, our
existing scaffold in `deploy-k3s/manifests/`) without the operational
overhead of kubeadm; and unlike Swarm, its service-discovery layer is
rock-solid.
## Operator cheat sheet
```bash
# On any k3s server node, root commands use k3s-wrapped kubectl:
sudo k3s kubectl get nodes
# From workstation, use the copied kubeconfig:
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
kubectl get nodes
# Check k3s service:
ssh deploy@hetzner1 "sudo systemctl status k3s"
# Watch cluster events live:
kubectl get events -A --watch
# See what's on each node:
kubectl get pods -A -o wide | sort -k 8
```
## References
- [K3s architecture][k3s-arch]
- [K3s requirements][k3s-reqs]
- [Mirantis Swarm support announcement][mirantis-swarm]
- [moby/moby#52265 — libnetwork stale records][moby-52265]
- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
- [Dokploy #3480 — Traefik stale VIP on Swarm][dokploy-3480]
- [Better Stack: Hetzner Cloud Review 2026][bstack-swarm]
- [VirtualizationHowTo: Is Docker Swarm Still Safe in 2026?][vht-swarm]
[k3s-arch]: https://docs.k3s.io/architecture
[k3s-reqs]: https://docs.k3s.io/installation/requirements
[mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/
[moby-52265]: https://github.com/moby/moby/issues/52265
[moby-51491]: https://github.com/moby/moby/issues/51491
[dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480
[bstack-swarm]: https://betterstack.com/community/guides/web-servers/hetzner-cloud-review/
[vht-swarm]: https://www.virtualizationhowto.com/2026/03/is-docker-swarm-still-safe-in-2026/