Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.2 KiB
K3s Migration Notes — 2026-04-24
honeyDue is running on a 3-node K3s HA cluster on the existing Hetzner nodes (hetzner1/2/3), replacing the previous Docker Swarm deployment.
Why we migrated
Docker Swarm's libnetwork has a known stale-DNS bug on 29.x (moby/moby#52265) that leaves ghost A-records when tasks migrate between nodes. Single-replica services (like the admin panel) landed on a ghost IP ~50% of the time → connection refused → 502. Full stack recreate cleared it, but the bug recurs on every node-to-node task migration.
K3s uses CoreDNS + containerd with no libnetwork history → the bug class
doesn't exist there. See docs/SWARM_POSTMORTEM.md if it exists, or the
research summary in the earlier deploy session.
Differences from the original deploy-k3s/ scaffold
The original scaffold assumes a greenfield provision via hetzner-k3s,
GHCR for images, Cloudflare origin certs, and a Hetzner Load Balancer.
We reused existing nodes and kept Cloudflare Flexible SSL:
| Setting | Scaffold default | What we did |
|---|---|---|
| Provisioning | hetzner-k3s tool creates boxes |
Manual k3s install on existing Hetzner boxes |
| Registry | GHCR (ghcr-credentials) |
Gitea (gitea-credentials) via kubectl create secret docker-registry |
| Ingress TLS | cloudflare-origin-cert Secret |
No TLS at origin (CF Flexible) |
| Load balancer | Hetzner LB → nodes | Cloudflare round-robin across 3 node IPs |
| Admin basic auth | admin-auth Traefik middleware |
Not applied — in-app auth only |
| CF-only IP allowlist | cloudflare-only middleware |
Not applied — UFW restricts some ports, 80/443 open to anyone who knows node IPs |
| Traefik | LoadBalancer via servicelb | DaemonSet w/ hostNetwork (servicelb disabled); see traefik-config.yaml below |
| Worker replicas | 2 | 1 (Asynq scheduler is singleton) |
| API start_period | 12×5s = 60s | 48×5s = 240s (covers migrate + lock queue on first boot) |
| Admin probe path | /admin/ |
/ (Next.js serves at root) |
Manifest fixes applied in-repo (already committed)
manifests/api/deployment.yaml—startupProbe.failureThreshold: 12 → 48manifests/admin/deployment.yaml— probe path/admin/ → /, threshold12 → 24manifests/worker/deployment.yaml—replicas: 2 → 1manifests/pod-disruption-budgets.yaml— workerminAvailable: 1 → 0
Traefik override (applied as HelmChartConfig)
K3s ships Traefik as a single-replica Deployment with a LoadBalancer service.
With servicelb disabled (to avoid binding a random port), we reconfigure it
to a DaemonSet binding directly on each node's public :80/:443 via
hostNetwork: true. The HelmChartConfig:
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: traefik
namespace: kube-system
spec:
valuesContent: |-
deployment:
kind: DaemonSet
hostNetwork: true
service:
enabled: false
ports:
web:
port: 80
hostPort: 80
websecure:
port: 443
hostPort: 443
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 0
securityContext:
capabilities:
drop: [ALL]
add: [NET_BIND_SERVICE]
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65532
additionalArguments:
- "--entrypoints.web.forwardedHeaders.trustedIPs=173.245.48.0/20,103.21.244.0/22,103.22.200.0/22,103.31.4.0/22,141.101.64.0/18,108.162.192.0/18,190.93.240.0/20,188.114.96.0/20,197.234.240.0/22,198.41.128.0/17,162.158.0.0/15,104.16.0.0/13,104.24.0.0/14,172.64.0.0/13,131.0.72.0/22"
Apply with kubectl apply -f traefik-config.yaml, then bump the helm job
(kubectl delete job -n kube-system helm-install-traefik) to trigger reinstall.
Required node-level sysctl
hostNetwork pods with capabilities don't get CAP_NET_BIND_SERVICE in the host netns on modern containerd. Set on each node:
echo 'net.ipv4.ip_unprivileged_port_start=0' | sudo tee /etc/sysctl.d/99-unprivileged-ports.conf
sudo sysctl --system
UFW rules added for k3s (per node)
All between the 3 node IPs (178.104.247.152, 178.105.32.198, 178.104.249.189):
6443/tcp— kube API2379/tcp,2380/tcp— embedded etcd client + peer10250/tcp— kubelet8472/udp— flannel VXLAN overlay
Plus from your workstation IP to each node's 6443/tcp for kubectl.
Ingress
Minimal hostname-only routing (/tmp/honeydue-ingress.yaml at deploy time
— move it into deploy-k3s/manifests/ingress/ in a follow-up):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: honeydue-api
namespace: honeydue
spec:
ingressClassName: traefik
rules:
- host: api.myhoneydue.com
http:
paths:
- {path: /, pathType: Prefix, backend: {service: {name: api, port: {number: 8000}}}}
- host: myhoneydue.com
http:
paths:
- {path: /, pathType: Prefix, backend: {service: {name: api, port: {number: 8000}}}}
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: honeydue-admin
namespace: honeydue
spec:
ingressClassName: traefik
rules:
- host: admin.myhoneydue.com
http:
paths:
- {path: /, pathType: Prefix, backend: {service: {name: admin, port: {number: 3000}}}}
Operator access
Kubeconfig lives at ~/.kube/honeydue-k3s.yaml.
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
kubectl get pods -n honeydue
Remaining TODOs (not blocking)
- Apply
manifests/ingress/middleware.yamlfor security headers + rate limiting (CF-only allowlist + basic auth deliberately skipped until you want them) - Apply
manifests/network-policies.yamlfor default-deny + explicit allows - Apply
manifests/api/hpa.yamlif you want autoscaling (metrics-server is already running, so justkubectl applyit) - Upgrade to CF Full (strict) SSL: generate origin cert, create
cloudflare-origin-certSecret, addtls:block back to Ingress - Set up a proper migration Job so
apireplicas don't each runMigrateWithLockon startup — lets you drop the 240s startupProbe grace - Remove
deploy/(the Swarm-era config) once you're confident in k3s