Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure: - Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers) - Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh - All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept temporarily for reference Bug fixes surfaced during migration: - Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25) - cache_service.go: remove sync.Once reassignment from inside Do() callback (was causing 'unlock of unlocked mutex' fatal after Redis Ping failure) - router.go: relax CSP from 'default-src none' to 'default-src self' + allowlist fonts.googleapis.com so the marketing landing page CSS actually loads in browsers - deploy/scripts/deploy_prod.sh: use docker buildx with --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce images runnable on x86_64 Hetzner nodes; fix array expansion under set -u - deploy/swarm-stack.prod.yml: fix secret source references to use top-level aliases (the '\${X_SECRET}' form never actually resolved); dozzle ports: long-form host_ip is rejected by Swarm, switched to short-form (bound to 0.0.0.0 with UFW-based loopback restriction); worker replicas 2 -> 1 (Asynq scheduler singleton) - deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/' (Next.js serves at root; /admin/ returned 404 and killed pods); startupProbe failureThreshold 12 -> 24 - deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable 1 -> 0 (singleton) - deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold 12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot; real startup takes up to 240s) - .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/ and admin/src/app/api/*, hiding legitimate files) New files: - deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet + hostNetwork override for k3s-bundled Traefik - deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress without TLS (CF Flexible SSL) and without middleware - deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log Documentation: - docs/deployment/ — full deployment book, 26 files, ~42k words: - Part I Overview, infrastructure, orchestrator choice (Ch 0-2) - Part II Networking, firewall, Cloudflare (Ch 3-4, 13) - Part III Security, Traefik ingress (Ch 5-6) - Part IV Services, DB, storage, secrets, registry (Ch 7-11) - Part V Data flow, deploy process, observability, failures, runbook (Ch 12, 14-17) - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20) - Appendices: glossary, kubectl cheat sheet, file locations, consolidated citations - README.md: Production Deployment section replaced with pointer to the book; Go version bumped to 1.25 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:21 -05:00
parent 4ec4bbbfe8
commit 6f303dbbaa
46 changed files with 9785 additions and 93 deletions
@@ -0,0 +1,323 @@
+# 02 — Orchestrator Choice
+
+## Summary
+
+We run K3s — a lightweight Kubernetes distribution from SUSE/Rancher Labs.
+This wasn't our first choice. We originally deployed on Docker Swarm and
+spent a long afternoon hitting a libnetwork bug before migrating. This
+chapter walks through the comparison of the three realistic orchestrators
+(Docker Swarm, full Kubernetes, and K3s) and a fourth (Nomad) we
+considered and rejected. The story of the Swarm→k3s migration is in
+[Chapter 19](./19-postmortem-swarm.md); this chapter is about the decision
+framework.
+
+## The decision
+
+**K3s v1.34.6+k3s1**, HA mode, three control-plane nodes with embedded etcd.
+
+## Candidates considered
+
+| | Docker Swarm | K3s | Full Kubernetes (kubeadm) | Hashicorp Nomad |
+|---|---|---|---|---|
+| Learning curve | Easiest | Medium | Hardest | Easy |
+| Install on 3 nodes | `docker swarm init/join` | `curl \| sh` per node | Many steps | `nomad server/agent` |
+| Memory footprint (control plane) | ~200 MB per node | ~500 MB per node | ~1 GB per node | ~200 MB per node |
+| Service discovery | libnetwork (buggy) | CoreDNS | CoreDNS | Consul |
+| HA quorum | Raft (3+ managers) | Raft via embedded etcd (3+ servers) | etcd cluster (3+ nodes) | Raft (3+ servers) |
+| Secrets management | Swarm secrets | k8s Secrets | k8s Secrets | Vault or file-backed |
+| Rolling updates | Swarm update_config | Deployments | Deployments | job update stanza |
+| Ingress | None (third-party) | Traefik bundled | None (install yourself) | None (install yourself) |
+| Active development | Maintenance mode | Active | Active | Active |
+| Industry momentum | Declining | Growing | Dominant | Niche |
+
+## Why K3s
+
+### Against Docker Swarm
+
+Swarm was our first pick because it's the simplest "production-like"
+option. `docker swarm init` gives you a working cluster in seconds. It's
+built into the Docker daemon you already have.
+
+What killed it:
+
+1. **libnetwork state bugs.** Swarm's service discovery relies on
+   libnetwork's gossip-backed service registry. When a service's task
+   migrates between nodes, the old endpoint record isn't always removed
+   cleanly — especially on encrypted overlays or during transient network
+   partitions. The result: stale DNS A-records that persist indefinitely,
+   survive service removal, survive containerd restarts, survive pretty
+   much everything except recreating the overlay network. Multiple open
+   issues track this: [moby/moby#52265][moby-52265],
+   [moby/moby#51491][moby-51491], [Dokploy#3480][dokploy-3480].
+
+2. **It's in maintenance mode.** Mirantis [committed to supporting
+   Swarm through 2030][mirantis-swarm] as part of Mirantis Kubernetes
+   Engine 3, but nothing is being actively developed. The libnetwork code
+   has no champion; bug fixes land slowly and often incompletely (the
+   29.0.0 partial fix for #50236, the 29.3.0 regression, the pending
+   follow-up in #52289 — months apart).
+
+3. **Industry signal.** Every 2026 write-up of "should I pick Swarm"
+   reaches the same conclusion: run what works; don't bet new workload on
+   it. [Better Stack][bstack-swarm] and [VirtualizationHowTo][vht-swarm]
+   are representative.
+
+The [Chapter 19 postmortem](./19-postmortem-swarm.md) details the specific
+bug we hit, the workarounds we tried, and why each failed.
+
+### Against full Kubernetes (kubeadm)
+
+Full Kubernetes is the de-facto standard. It has the biggest ecosystem, the
+most documentation, the most mindshare. Against it:
+
+1. **Operational overhead.** A kubeadm-built cluster has ~6 control-plane
+   processes (kube-apiserver, etcd, kube-scheduler, kube-controller-manager,
+   kube-proxy, kubelet) each of which needs monitoring, upgrading, and
+   understanding. K3s bundles them into a single binary with sensible
+   defaults.
+
+2. **Memory.** A kubeadm control plane wants ~1 GB RAM baseline per master
+   node. On an 8 GB node that's 12% gone before any workload runs. K3s is
+   ~500 MB per master.
+
+3. **Etcd.** Full Kubernetes expects a separate 3+ node etcd cluster for
+   HA, typically on the same masters but as an independent process. K3s
+   embeds etcd in the server binary; still Raft, still HA, but one less
+   thing to install/upgrade/monitor.
+
+4. **Cluster creation UX.** `kubeadm init` + certificate distribution + CNI
+   install + storage class setup is a multi-step dance. K3s `curl -sfL
+   https://get.k3s.io | sh -s - server --cluster-init` plus two joins is a
+   10-minute cluster.
+
+**What we'd lose by not using full Kubernetes:** nothing that matters at
+our scale. K3s is 100% Kubernetes API-compatible. Every `kubectl` command,
+every Helm chart, every manifest works identically. If we ever need to
+migrate to full Kubernetes, `kubectl get all -A -o yaml` gives us the
+entire state and we re-apply it on the new cluster.
+
+### Against Hashicorp Nomad
+
+Nomad is very good at what it does — simpler than Kubernetes, more robust
+than Swarm, has real load balancing (via Consul Connect), and the
+`nomad agent` binary is ~80 MB vs k3s' ~200 MB.
+
+Against it:
+
+1. **Ecosystem is smaller.** Far fewer community Helm charts, operators,
+   tutorials. Every new component needs bespoke integration.
+2. **Service discovery requires Consul.** Two products to operate, not one.
+3. **Ingress requires a separate tool** (Traefik, HAProxy, Fabio). K3s
+   bundles Traefik by default.
+4. **Secrets management** requires Vault or relies on Nomad's template
+   stanza. Not bad, but more moving parts.
+5. **The operator hasn't used Nomad in production before.** Learning curve
+   on a new platform during a prod migration is a bad trade.
+
+Nomad would be a defensible choice. K3s won primarily on ecosystem
+maturity and the operator's familiarity with Kubernetes primitives.
+
+## What K3s actually is
+
+K3s is a CNCF Sandbox project (now graduated to Rancher/SUSE-backed)
+originally designed for edge and IoT. Its design goals:
+
+- Single ~200 MB static binary
+- Works on ARM64 and AMD64
+- Bundles everything needed for a working cluster: containerd, Flannel,
+  CoreDNS, Traefik, metrics-server, local-path storage provisioner, and
+  (optionally) servicelb (klipper-lb) load balancer
+- Replaces the kubeadm setup dance with `curl | sh`
+- Replaces etcd-in-its-own-cluster with embedded etcd (or SQLite for
+  single-node)
+- Replaces Docker with containerd (though you can opt back into Docker)
+
+It is **not** a fork of Kubernetes. K3s is Kubernetes, packaged differently.
+The Kubernetes Go code it wraps is unmodified (aside from build-time
+stripping of cloud provider integrations you don't need). `kubectl`,
+the API, CRDs, operators — all identical.
+
+## HA architecture we chose
+
+```mermaid
+flowchart TB
+    subgraph Cluster[k3s HA cluster]
+        subgraph N1[hetzner1]
+            K1[k3s server]
+            E1[etcd]
+            KUB1[kubelet]
+            TR1[Traefik pod<br/>hostNet :80/:443]
+            P1[app pods]
+        end
+        subgraph N2[hetzner2]
+            K2[k3s server]
+            E2[etcd]
+            KUB2[kubelet]
+            TR2[Traefik pod<br/>hostNet :80/:443]
+            P2[app pods]
+        end
+        subgraph N3[hetzner3]
+            K3[k3s server]
+            E3[etcd]
+            KUB3[kubelet]
+            TR3[Traefik pod<br/>hostNet :80/:443]
+            P3[app pods]
+        end
+    end
+
+    E1 <--Raft--> E2 <--Raft--> E3
+    E1 <--Raft--> E3
+
+    K1 & K2 & K3 --- API[kube-apiserver<br/>port 6443]
+```
+
+### ASCII fallback
+
+```
+      hetzner1          hetzner2          hetzner3
+    ┌──────────┐      ┌──────────┐      ┌──────────┐
+    │ k3s srv  │      │ k3s srv  │      │ k3s srv  │
+    │  ├ etcd ─┼──────┼ ├ etcd ──┼──────┼─ etcd  │ │
+    │  │  :6443│      │ │   :6443│      │   :6443│ │
+    │  ├ kubelet      │ ├ kubelet      │   kubelet│
+    │  └ pods  │      │ └ pods   │      │   pods │ │
+    └──────────┘      └──────────┘      └──────────┘
+       │   ▲             │   ▲             │   ▲
+       │   └─── Raft ────┤   └─── Raft ────┘   │
+       └────────── Raft ─┴─────────────────────┘
+```
+
+All three nodes are **server** nodes (in k3s terminology) — they all run
+`kube-apiserver`, `kube-scheduler`, `kube-controller-manager`, and
+participate in etcd Raft consensus. A fourth "agent" node could be added
+as worker-only; we don't need that capacity yet.
+
+**Quorum**: 2 out of 3 nodes must agree on writes. The cluster stays
+operational if any one node dies. Two dying nodes = cluster loses quorum
+(Raft halts) until a majority returns.
+
+## What we disabled
+
+We ran k3s install with `--disable=servicelb`. `servicelb` (a.k.a.
+`klipper-lb`) is a trick where k3s spawns a daemonset that listens on a
+node's host ports and proxies to `LoadBalancer`-typed services. Fine for
+dev; we don't need it because we handle ingress with Traefik in
+DaemonSet+hostNetwork mode (Chapter 6).
+
+We did **not** disable:
+- **traefik** — we reconfigured it via HelmChartConfig rather than
+  disable-and-replace. See Chapter 6.
+- **local-path-provisioner** — provides the default `StorageClass` we use
+  for Redis PVC (Chapter 7).
+- **metrics-server** — required for `kubectl top` and HorizontalPodAutoscaler.
+- **coredns** — the cluster DNS. Essential for service discovery.
+
+## Version choices
+
+### K3s v1.34.6+k3s1
+
+This was the latest stable K3s release as of 2026-04-24. K3s follows
+upstream Kubernetes' release cadence — `1.34` matches Kubernetes 1.34.x.
+The `+k3s1` suffix is the K3s build number within that upstream version.
+
+**Upgrade policy**: K3s supports one minor version per quarter. We'd
+upgrade in place to 1.35 when it's been out ~30 days and has no open
+critical bugs in the release notes. See Chapter 17 for the procedure.
+
+### containerd v2.2.2
+
+Bundled with K3s. containerd 2.x brought full support for the
+`cri-dockerd` replacement API and performance improvements over 1.x.
+We don't pin containerd separately — we take whatever K3s ships.
+
+### Flannel (VXLAN backend)
+
+Bundled with K3s as the default CNI. Flannel's VXLAN backend is
+straightforward, performant enough, and has worked reliably in every K3s
+install we've seen. Alternatives (Calico, Cilium) are more featureful but
+add operational complexity.
+
+See [Chapter 3](./03-networking.md) for a deep dive on the networking
+layer.
+
+## What we did NOT choose from K3s' ecosystem
+
+- **servicelb / klipper-lb** — off. Reason above.
+- **embedded SQLite** — on single-node k3s, SQLite replaces etcd. We're
+  multi-node, so this doesn't apply.
+- **`--flannel-backend=wireguard-native`** — WireGuard-encrypted overlay.
+  We didn't enable it because (a) VXLAN already works, (b) our node-to-node
+  traffic stays within Hetzner's internal network anyway, and (c) we haven't
+  proven we need it. Encryption is a TODO (Chapter 20).
+
+## Raft and split-brain behavior
+
+If the 3 nodes become network-partitioned such that one node sees the
+other two and vice versa (a "2-1 split"):
+
+- **Majority partition (2 nodes)** — retains quorum, cluster keeps
+  accepting writes. Pods on those 2 nodes keep running. Pods on the
+  isolated node eventually get marked `NotReady` after
+  `node-monitor-grace-period` (default 40s), and after
+  `pod-eviction-timeout` (default 5 min) their pods are marked for
+  eviction and rescheduled onto the surviving nodes.
+- **Minority partition (1 node)** — loses quorum. API server on that
+  node refuses writes; existing pods keep running (kubelet doesn't need
+  the API server for already-scheduled pods), but nothing new can deploy,
+  scale, or reschedule.
+
+When the partition heals, Raft reconciles automatically. The minority
+node catches up on etcd state via snapshot+replay.
+
+**Worst case** (all 3 isolated from each other): no quorum, no node is
+authoritative. Pods keep running from existing state; nothing can be
+updated. This requires all three nodes losing network to each other
+simultaneously, which implies Hetzner's entire internal switching is
+broken — at that point, the whole region is likely down anyway.
+
+## Our decision in one sentence
+
+K3s gave us the Kubernetes API (enormous ecosystem, known primitives, our
+existing scaffold in `deploy-k3s/manifests/`) without the operational
+overhead of kubeadm; and unlike Swarm, its service-discovery layer is
+rock-solid.
+
+## Operator cheat sheet
+
+```bash
+# On any k3s server node, root commands use k3s-wrapped kubectl:
+sudo k3s kubectl get nodes
+
+# From workstation, use the copied kubeconfig:
+export KUBECONFIG=~/.kube/honeydue-k3s.yaml
+kubectl get nodes
+
+# Check k3s service:
+ssh deploy@hetzner1 "sudo systemctl status k3s"
+
+# Watch cluster events live:
+kubectl get events -A --watch
+
+# See what's on each node:
+kubectl get pods -A -o wide | sort -k 8
+```
+
+## References
+
+- [K3s architecture][k3s-arch]
+- [K3s requirements][k3s-reqs]
+- [Mirantis Swarm support announcement][mirantis-swarm]
+- [moby/moby#52265 — libnetwork stale records][moby-52265]
+- [moby/moby#51491 — DNS broken after swarm init][moby-51491]
+- [Dokploy #3480 — Traefik stale VIP on Swarm][dokploy-3480]
+- [Better Stack: Hetzner Cloud Review 2026][bstack-swarm]
+- [VirtualizationHowTo: Is Docker Swarm Still Safe in 2026?][vht-swarm]
+
+[k3s-arch]: https://docs.k3s.io/architecture
+[k3s-reqs]: https://docs.k3s.io/installation/requirements
+[mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/
+[moby-52265]: https://github.com/moby/moby/issues/52265
+[moby-51491]: https://github.com/moby/moby/issues/51491
+[dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480
+[bstack-swarm]: https://betterstack.com/community/guides/web-servers/hetzner-cloud-review/
+[vht-swarm]: https://www.virtualizationhowto.com/2026/03/is-docker-swarm-still-safe-in-2026/