# K3s Migration Notes — 2026-04-24 honeyDue is running on a 3-node K3s HA cluster on the existing Hetzner nodes (hetzner1/2/3), replacing the previous Docker Swarm deployment. ## Why we migrated Docker Swarm's libnetwork has a known stale-DNS bug on 29.x ([moby/moby#52265](https://github.com/moby/moby/issues/52265)) that leaves ghost A-records when tasks migrate between nodes. Single-replica services (like the admin panel) landed on a ghost IP ~50% of the time → connection refused → 502. Full stack recreate cleared it, but the bug recurs on every node-to-node task migration. K3s uses CoreDNS + containerd with no libnetwork history → the bug class doesn't exist there. See `docs/SWARM_POSTMORTEM.md` if it exists, or the research summary in the earlier deploy session. ## Differences from the original `deploy-k3s/` scaffold The original scaffold assumes a greenfield provision via `hetzner-k3s`, GHCR for images, Cloudflare origin certs, and a Hetzner Load Balancer. We reused existing nodes and kept Cloudflare Flexible SSL: | Setting | Scaffold default | What we did | |---|---|---| | Provisioning | `hetzner-k3s` tool creates boxes | Manual k3s install on existing Hetzner boxes | | Registry | GHCR (`ghcr-credentials`) | Gitea (`gitea-credentials`) via `kubectl create secret docker-registry` | | Ingress TLS | `cloudflare-origin-cert` Secret | No TLS at origin (CF Flexible) | | Load balancer | Hetzner LB → nodes | Cloudflare round-robin across 3 node IPs | | Admin basic auth | `admin-auth` Traefik middleware | Not applied — in-app auth only | | CF-only IP allowlist | `cloudflare-only` middleware | Not applied — UFW restricts some ports, 80/443 open to anyone who knows node IPs | | Traefik | LoadBalancer via servicelb | DaemonSet w/ hostNetwork (servicelb disabled); see `traefik-config.yaml` below | | Worker replicas | 2 | 1 (Asynq scheduler is singleton) | | API start_period | 12×5s = 60s | 48×5s = 240s (covers migrate + lock queue on first boot) | | Admin probe path | `/admin/` | `/` (Next.js serves at root) | ## Manifest fixes applied in-repo (already committed) - `manifests/api/deployment.yaml` — `startupProbe.failureThreshold: 12 → 48` - `manifests/admin/deployment.yaml` — probe path `/admin/ → /`, threshold `12 → 24` - `manifests/worker/deployment.yaml` — `replicas: 2 → 1` - `manifests/pod-disruption-budgets.yaml` — worker `minAvailable: 1 → 0` ## Traefik override (applied as HelmChartConfig) K3s ships Traefik as a single-replica Deployment with a LoadBalancer service. With servicelb disabled (to avoid binding a random port), we reconfigure it to a DaemonSet binding directly on each node's public :80/:443 via `hostNetwork: true`. The HelmChartConfig: ```yaml apiVersion: helm.cattle.io/v1 kind: HelmChartConfig metadata: name: traefik namespace: kube-system spec: valuesContent: |- deployment: kind: DaemonSet hostNetwork: true service: enabled: false ports: web: port: 80 hostPort: 80 websecure: port: 443 hostPort: 443 updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 0 securityContext: capabilities: drop: [ALL] add: [NET_BIND_SERVICE] readOnlyRootFilesystem: true runAsGroup: 65532 runAsNonRoot: true runAsUser: 65532 additionalArguments: - "--entrypoints.web.forwardedHeaders.trustedIPs=173.245.48.0/20,103.21.244.0/22,103.22.200.0/22,103.31.4.0/22,141.101.64.0/18,108.162.192.0/18,190.93.240.0/20,188.114.96.0/20,197.234.240.0/22,198.41.128.0/17,162.158.0.0/15,104.16.0.0/13,104.24.0.0/14,172.64.0.0/13,131.0.72.0/22" ``` Apply with `kubectl apply -f traefik-config.yaml`, then bump the helm job (`kubectl delete job -n kube-system helm-install-traefik`) to trigger reinstall. ## Required node-level sysctl hostNetwork pods with capabilities don't get CAP_NET_BIND_SERVICE in the host netns on modern containerd. Set on each node: ```bash echo 'net.ipv4.ip_unprivileged_port_start=0' | sudo tee /etc/sysctl.d/99-unprivileged-ports.conf sudo sysctl --system ``` ## UFW rules added for k3s (per node) All between the 3 node IPs (178.104.247.152, 178.105.32.198, 178.104.249.189): - `6443/tcp` — kube API - `2379/tcp`, `2380/tcp` — embedded etcd client + peer - `10250/tcp` — kubelet - `8472/udp` — flannel VXLAN overlay Plus from your workstation IP to each node's `6443/tcp` for `kubectl`. ## Ingress Minimal hostname-only routing (`/tmp/honeydue-ingress.yaml` at deploy time — move it into `deploy-k3s/manifests/ingress/` in a follow-up): ```yaml apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: honeydue-api namespace: honeydue spec: ingressClassName: traefik rules: - host: api.myhoneydue.com http: paths: - {path: /, pathType: Prefix, backend: {service: {name: api, port: {number: 8000}}}} - host: myhoneydue.com http: paths: - {path: /, pathType: Prefix, backend: {service: {name: api, port: {number: 8000}}}} --- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: honeydue-admin namespace: honeydue spec: ingressClassName: traefik rules: - host: admin.myhoneydue.com http: paths: - {path: /, pathType: Prefix, backend: {service: {name: admin, port: {number: 3000}}}} ``` ## Operator access Kubeconfig lives at `~/.kube/honeydue-k3s.yaml`. ```bash export KUBECONFIG=~/.kube/honeydue-k3s.yaml kubectl get pods -n honeydue ``` ## Remaining TODOs (not blocking) - Apply `manifests/ingress/middleware.yaml` for security headers + rate limiting (CF-only allowlist + basic auth deliberately skipped until you want them) - Apply `manifests/network-policies.yaml` for default-deny + explicit allows - Apply `manifests/api/hpa.yaml` if you want autoscaling (metrics-server is already running, so just `kubectl apply` it) - Upgrade to CF Full (strict) SSL: generate origin cert, create `cloudflare-origin-cert` Secret, add `tls:` block back to Ingress - Set up a proper migration Job so `api` replicas don't each run `MigrateWithLock` on startup — lets you drop the 240s startupProbe grace - Remove `deploy/` (the Swarm-era config) once you're confident in k3s