honeyDueAPI/docs/deployment/02-orchestrator-choice.md

# 02 — Orchestrator Choice

## Summary

We run K3s — a lightweight Kubernetes distribution from SUSE/Rancher Labs.
This wasn't our first choice. We originally deployed on Docker Swarm and
spent a long afternoon hitting a libnetwork bug before migrating. This
chapter walks through the comparison of the three realistic orchestrators
(Docker Swarm, full Kubernetes, and K3s) and a fourth (Nomad) we
considered and rejected. The story of the Swarm→k3s migration is in
[Chapter 19](./19-postmortem-swarm.md); this chapter is about the decision
framework.

## The decision

**K3s v1.34.6+k3s1**, HA mode, three control-plane nodes with embedded etcd.

## Candidates considered

| | Docker Swarm | K3s | Full Kubernetes (kubeadm) | Hashicorp Nomad |
|---|---|---|---|---|
| Learning curve | Easiest | Medium | Hardest | Easy |
| Install on 3 nodes | `docker swarm init/join` | `curl \| sh` per node | Many steps | `nomad server/agent` |
| Memory footprint (control plane) | ~200 MB per node | ~500 MB per node | ~1 GB per node | ~200 MB per node |
| Service discovery | libnetwork (buggy) | CoreDNS | CoreDNS | Consul |
| HA quorum | Raft (3+ managers) | Raft via embedded etcd (3+ servers) | etcd cluster (3+ nodes) | Raft (3+ servers) |
| Secrets management | Swarm secrets | k8s Secrets | k8s Secrets | Vault or file-backed |
| Rolling updates | Swarm update_config | Deployments | Deployments | job update stanza |
| Ingress | None (third-party) | Traefik bundled | None (install yourself) | None (install yourself) |
| Active development | Maintenance mode | Active | Active | Active |
| Industry momentum | Declining | Growing | Dominant | Niche |

## Why K3s

### Against Docker Swarm

Swarm was our first pick because it's the simplest "production-like"
option. `docker swarm init` gives you a working cluster in seconds. It's
built into the Docker daemon you already have.

What killed it:

1. **libnetwork state bugs.** Swarm's service discovery relies on
   libnetwork's gossip-backed service registry. When a service's task
   migrates between nodes, the old endpoint record isn't always removed
   cleanly — especially on encrypted overlays or during transient network
   partitions. The result: stale DNS A-records that persist indefinitely,
   survive service removal, survive containerd restarts, survive pretty
   much everything except recreating the overlay network. Multiple open
   issues track this: [moby/moby#52265][moby-52265],
   [moby/moby#51491][moby-51491], [Dokploy#3480][dokploy-3480].

2. **It's in maintenance mode.** Mirantis [committed to supporting
   Swarm through 2030][mirantis-swarm] as part of Mirantis Kubernetes
   Engine 3, but nothing is being actively developed. The libnetwork code
   has no champion; bug fixes land slowly and often incompletely (the
   29.0.0 partial fix for #50236, the 29.3.0 regression, the pending
   follow-up in #52289 — months apart).

3. **Industry signal.** Every 2026 write-up of "should I pick Swarm"
   reaches the same conclusion: run what works; don't bet new workload on
   it. [Better Stack][bstack-swarm] and [VirtualizationHowTo][vht-swarm]
   are representative.

The [Chapter 19 postmortem](./19-postmortem-swarm.md) details the specific
bug we hit, the workarounds we tried, and why each failed.

### Against full Kubernetes (kubeadm)

Full Kubernetes is the de-facto standard. It has the biggest ecosystem, the
most documentation, the most mindshare. Against it:

1. **Operational overhead.** A kubeadm-built cluster has ~6 control-plane
   processes (kube-apiserver, etcd, kube-scheduler, kube-controller-manager,
   kube-proxy, kubelet) each of which needs monitoring, upgrading, and
   understanding. K3s bundles them into a single binary with sensible
   defaults.

2. **Memory.** A kubeadm control plane wants ~1 GB RAM baseline per master
   node. On an 8 GB node that's 12% gone before any workload runs. K3s is
   ~500 MB per master.

3. **Etcd.** Full Kubernetes expects a separate 3+ node etcd cluster for
   HA, typically on the same masters but as an independent process. K3s
   embeds etcd in the server binary; still Raft, still HA, but one less
   thing to install/upgrade/monitor.

4. **Cluster creation UX.** `kubeadm init` + certificate distribution + CNI
   install + storage class setup is a multi-step dance. K3s `curl -sfL
   https://get.k3s.io | sh -s - server --cluster-init` plus two joins is a
   10-minute cluster.

**What we'd lose by not using full Kubernetes:** nothing that matters at
our scale. K3s is 100% Kubernetes API-compatible. Every `kubectl` command,
every Helm chart, every manifest works identically. If we ever need to
migrate to full Kubernetes, `kubectl get all -A -o yaml` gives us the
entire state and we re-apply it on the new cluster.

### Against Hashicorp Nomad

Nomad is very good at what it does — simpler than Kubernetes, more robust
than Swarm, has real load balancing (via Consul Connect), and the
`nomad agent` binary is ~80 MB vs k3s' ~200 MB.

Against it:

1. **Ecosystem is smaller.** Far fewer community Helm charts, operators,
   tutorials. Every new component needs bespoke integration.
2. **Service discovery requires Consul.** Two products to operate, not one.
3. **Ingress requires a separate tool** (Traefik, HAProxy, Fabio). K3s
   bundles Traefik by default.
4. **Secrets management** requires Vault or relies on Nomad's template
   stanza. Not bad, but more moving parts.
5. **The operator hasn't used Nomad in production before.** Learning curve
   on a new platform during a prod migration is a bad trade.

Nomad would be a defensible choice. K3s won primarily on ecosystem
maturity and the operator's familiarity with Kubernetes primitives.

## What K3s actually is

K3s is a CNCF Sandbox project (now graduated to Rancher/SUSE-backed)
originally designed for edge and IoT. Its design goals:

- Single ~200 MB static binary
- Works on ARM64 and AMD64
- Bundles everything needed for a working cluster: containerd, Flannel,
  CoreDNS, Traefik, metrics-server, local-path storage provisioner, and
  (optionally) servicelb (klipper-lb) load balancer
- Replaces the kubeadm setup dance with `curl | sh`
- Replaces etcd-in-its-own-cluster with embedded etcd (or SQLite for
  single-node)
- Replaces Docker with containerd (though you can opt back into Docker)

It is **not** a fork of Kubernetes. K3s is Kubernetes, packaged differently.
The Kubernetes Go code it wraps is unmodified (aside from build-time
stripping of cloud provider integrations you don't need). `kubectl`,
the API, CRDs, operators — all identical.

## HA architecture we chose

```mermaid
flowchart TB
    subgraph Cluster[k3s HA cluster]
        subgraph N1[hetzner1]
            K1[k3s server]
            E1[etcd]
            KUB1[kubelet]
            TR1[Traefik pod<br/>hostNet :80/:443]
            P1[app pods]
        end
        subgraph N2[hetzner2]
            K2[k3s server]
            E2[etcd]
            KUB2[kubelet]
            TR2[Traefik pod<br/>hostNet :80/:443]
            P2[app pods]
        end
        subgraph N3[hetzner3]
            K3[k3s server]
            E3[etcd]
            KUB3[kubelet]
            TR3[Traefik pod<br/>hostNet :80/:443]
            P3[app pods]
        end
    end

    E1 <--Raft--> E2 <--Raft--> E3
    E1 <--Raft--> E3

    K1 & K2 & K3 --- API[kube-apiserver<br/>port 6443]
```

### ASCII fallback

```
      hetzner1          hetzner2          hetzner3
    ┌──────────┐      ┌──────────┐      ┌──────────┐
    │ k3s srv  │      │ k3s srv  │      │ k3s srv  │
    │  ├ etcd ─┼──────┼ ├ etcd ──┼──────┼─ etcd  │ │
    │  │  :6443│      │ │   :6443│      │   :6443│ │
    │  ├ kubelet      │ ├ kubelet      │   kubelet│
    │  └ pods  │      │ └ pods   │      │   pods │ │
    └──────────┘      └──────────┘      └──────────┘
       │   ▲             │   ▲             │   ▲
       │   └─── Raft ────┤   └─── Raft ────┘   │
       └────────── Raft ─┴─────────────────────┘
```

All three nodes are **server** nodes (in k3s terminology) — they all run
`kube-apiserver`, `kube-scheduler`, `kube-controller-manager`, and
participate in etcd Raft consensus. A fourth "agent" node could be added
as worker-only; we don't need that capacity yet.

**Quorum**: 2 out of 3 nodes must agree on writes. The cluster stays
operational if any one node dies. Two dying nodes = cluster loses quorum
(Raft halts) until a majority returns.

## What we disabled

We ran k3s install with `--disable=servicelb`. `servicelb` (a.k.a.
`klipper-lb`) is a trick where k3s spawns a daemonset that listens on a
node's host ports and proxies to `LoadBalancer`-typed services. Fine for
dev; we don't need it because we handle ingress with Traefik in
DaemonSet+hostNetwork mode (Chapter 6).

We did **not** disable:
- **traefik** — we reconfigured it via HelmChartConfig rather than
  disable-and-replace. See Chapter 6.
- **local-path-provisioner** — provides the default `StorageClass` we use
  for Redis PVC (Chapter 7).
- **metrics-server** — required for `kubectl top` and HorizontalPodAutoscaler.
- **coredns** — the cluster DNS. Essential for service discovery.

## Version choices

### K3s v1.34.6+k3s1

This was the latest stable K3s release as of 2026-04-24. K3s follows
upstream Kubernetes' release cadence — `1.34` matches Kubernetes 1.34.x.
The `+k3s1` suffix is the K3s build number within that upstream version.

**Upgrade policy**: K3s supports one minor version per quarter. We'd
upgrade in place to 1.35 when it's been out ~30 days and has no open
critical bugs in the release notes. See Chapter 17 for the procedure.

### containerd v2.2.2

Bundled with K3s. containerd 2.x brought full support for the
`cri-dockerd` replacement API and performance improvements over 1.x.
We don't pin containerd separately — we take whatever K3s ships.

### Flannel (VXLAN backend)

Bundled with K3s as the default CNI. Flannel's VXLAN backend is
straightforward, performant enough, and has worked reliably in every K3s
install we've seen. Alternatives (Calico, Cilium) are more featureful but
add operational complexity.

See [Chapter 3](./03-networking.md) for a deep dive on the networking
layer.

## What we did NOT choose from K3s' ecosystem

- **servicelb / klipper-lb** — off. Reason above.
- **embedded SQLite** — on single-node k3s, SQLite replaces etcd. We're
  multi-node, so this doesn't apply.
- **`--flannel-backend=wireguard-native`** — WireGuard-encrypted overlay.
  We didn't enable it because (a) VXLAN already works, (b) our node-to-node
  traffic stays within Hetzner's internal network anyway, and (c) we haven't
  proven we need it. Encryption is a TODO (Chapter 20).

## Raft and split-brain behavior

If the 3 nodes become network-partitioned such that one node sees the
other two and vice versa (a "2-1 split"):

- **Majority partition (2 nodes)** — retains quorum, cluster keeps
  accepting writes. Pods on those 2 nodes keep running. Pods on the
  isolated node eventually get marked `NotReady` after
  `node-monitor-grace-period` (default 40s), and after
  `pod-eviction-timeout` (default 5 min) their pods are marked for
  eviction and rescheduled onto the surviving nodes.
- **Minority partition (1 node)** — loses quorum. API server on that
  node refuses writes; existing pods keep running (kubelet doesn't need
  the API server for already-scheduled pods), but nothing new can deploy,
  scale, or reschedule.

When the partition heals, Raft reconciles automatically. The minority
node catches up on etcd state via snapshot+replay.

**Worst case** (all 3 isolated from each other): no quorum, no node is
authoritative. Pods keep running from existing state; nothing can be
updated. This requires all three nodes losing network to each other
simultaneously, which implies Hetzner's entire internal switching is
broken — at that point, the whole region is likely down anyway.

## Our decision in one sentence

K3s gave us the Kubernetes API (enormous ecosystem, known primitives, our
existing scaffold in `deploy-k3s/manifests/`) without the operational
overhead of kubeadm; and unlike Swarm, its service-discovery layer is
rock-solid.

## Operator cheat sheet

```bash
# On any k3s server node, root commands use k3s-wrapped kubectl:
sudo k3s kubectl get nodes

# From workstation, use the copied kubeconfig:
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
kubectl get nodes

# Check k3s service:
ssh deploy@hetzner1 "sudo systemctl status k3s"

# Watch cluster events live:
kubectl get events -A --watch

# See what's on each node:
kubectl get pods -A -o wide | sort -k 8
```

## References

- [K3s architecture][k3s-arch]
- [K3s requirements][k3s-reqs]
- [Mirantis Swarm support announcement][mirantis-swarm]
- [moby/moby#52265 — libnetwork stale records][moby-52265]
- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
- [Dokploy #3480 — Traefik stale VIP on Swarm][dokploy-3480]
- [Better Stack: Hetzner Cloud Review 2026][bstack-swarm]
- [VirtualizationHowTo: Is Docker Swarm Still Safe in 2026?][vht-swarm]

[k3s-arch]: https://docs.k3s.io/architecture
[k3s-reqs]: https://docs.k3s.io/installation/requirements
[mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/
[moby-52265]: https://github.com/moby/moby/issues/52265
[moby-51491]: https://github.com/moby/moby/issues/51491
[dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480
[bstack-swarm]: https://betterstack.com/community/guides/web-servers/hetzner-cloud-review/
[vht-swarm]: https://www.virtualizationhowto.com/2026/03/is-docker-swarm-still-safe-in-2026/