admin/honeyDueAPI

Fork 0

Files

T

Trey t 6f303dbbaa

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 07:20:54 -05:00

6.2 KiB

Raw Blame History

K3s Migration Notes — 2026-04-24

honeyDue is running on a 3-node K3s HA cluster on the existing Hetzner nodes (hetzner1/2/3), replacing the previous Docker Swarm deployment.

Why we migrated

Docker Swarm's libnetwork has a known stale-DNS bug on 29.x (moby/moby#52265) that leaves ghost A-records when tasks migrate between nodes. Single-replica services (like the admin panel) landed on a ghost IP ~50% of the time → connection refused → 502. Full stack recreate cleared it, but the bug recurs on every node-to-node task migration.

K3s uses CoreDNS + containerd with no libnetwork history → the bug class doesn't exist there. See docs/SWARM_POSTMORTEM.md if it exists, or the research summary in the earlier deploy session.

Differences from the original `deploy-k3s/` scaffold

The original scaffold assumes a greenfield provision via hetzner-k3s, GHCR for images, Cloudflare origin certs, and a Hetzner Load Balancer. We reused existing nodes and kept Cloudflare Flexible SSL:

Setting	Scaffold default	What we did
Provisioning	`hetzner-k3s` tool creates boxes	Manual k3s install on existing Hetzner boxes
Registry	GHCR (`ghcr-credentials`)	Gitea (`gitea-credentials`) via `kubectl create secret docker-registry`
Ingress TLS	`cloudflare-origin-cert` Secret	No TLS at origin (CF Flexible)
Load balancer	Hetzner LB → nodes	Cloudflare round-robin across 3 node IPs
Admin basic auth	`admin-auth` Traefik middleware	Not applied — in-app auth only
CF-only IP allowlist	`cloudflare-only` middleware	Not applied — UFW restricts some ports, 80/443 open to anyone who knows node IPs
Traefik	LoadBalancer via servicelb	DaemonSet w/ hostNetwork (servicelb disabled); see `traefik-config.yaml` below
Worker replicas	2	1 (Asynq scheduler is singleton)
API start_period	12×5s = 60s	48×5s = 240s (covers migrate + lock queue on first boot)
Admin probe path	`/admin/`	`/` (Next.js serves at root)

Manifest fixes applied in-repo (already committed)

manifests/api/deployment.yaml — startupProbe.failureThreshold: 12 → 48
manifests/admin/deployment.yaml — probe path /admin/ → /, threshold 12 → 24
manifests/worker/deployment.yaml — replicas: 2 → 1
manifests/pod-disruption-budgets.yaml — worker minAvailable: 1 → 0

Traefik override (applied as HelmChartConfig)

K3s ships Traefik as a single-replica Deployment with a LoadBalancer service. With servicelb disabled (to avoid binding a random port), we reconfigure it to a DaemonSet binding directly on each node's public :80/:443 via hostNetwork: true. The HelmChartConfig:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: traefik
  namespace: kube-system
spec:
  valuesContent: |-
    deployment:
      kind: DaemonSet
    hostNetwork: true
    service:
      enabled: false
    ports:
      web:
        port: 80
        hostPort: 80
      websecure:
        port: 443
        hostPort: 443
    updateStrategy:
      type: RollingUpdate
      rollingUpdate:
        maxUnavailable: 1
        maxSurge: 0
    securityContext:
      capabilities:
        drop: [ALL]
        add: [NET_BIND_SERVICE]
      readOnlyRootFilesystem: true
      runAsGroup: 65532
      runAsNonRoot: true
      runAsUser: 65532
    additionalArguments:
      - "--entrypoints.web.forwardedHeaders.trustedIPs=173.245.48.0/20,103.21.244.0/22,103.22.200.0/22,103.31.4.0/22,141.101.64.0/18,108.162.192.0/18,190.93.240.0/20,188.114.96.0/20,197.234.240.0/22,198.41.128.0/17,162.158.0.0/15,104.16.0.0/13,104.24.0.0/14,172.64.0.0/13,131.0.72.0/22"

Apply with kubectl apply -f traefik-config.yaml, then bump the helm job (kubectl delete job -n kube-system helm-install-traefik) to trigger reinstall.

Required node-level sysctl

hostNetwork pods with capabilities don't get CAP_NET_BIND_SERVICE in the host netns on modern containerd. Set on each node:

echo 'net.ipv4.ip_unprivileged_port_start=0' | sudo tee /etc/sysctl.d/99-unprivileged-ports.conf
sudo sysctl --system

UFW rules added for k3s (per node)

All between the 3 node IPs (178.104.247.152, 178.105.32.198, 178.104.249.189):

6443/tcp — kube API
2379/tcp, 2380/tcp — embedded etcd client + peer
10250/tcp — kubelet
8472/udp — flannel VXLAN overlay

Plus from your workstation IP to each node's 6443/tcp for kubectl.

Ingress

Minimal hostname-only routing (/tmp/honeydue-ingress.yaml at deploy time — move it into deploy-k3s/manifests/ingress/ in a follow-up):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: honeydue-api
  namespace: honeydue
spec:
  ingressClassName: traefik
  rules:
    - host: api.myhoneydue.com
      http:
        paths:
          - {path: /, pathType: Prefix, backend: {service: {name: api, port: {number: 8000}}}}
    - host: myhoneydue.com
      http:
        paths:
          - {path: /, pathType: Prefix, backend: {service: {name: api, port: {number: 8000}}}}
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: honeydue-admin
  namespace: honeydue
spec:
  ingressClassName: traefik
  rules:
    - host: admin.myhoneydue.com
      http:
        paths:
          - {path: /, pathType: Prefix, backend: {service: {name: admin, port: {number: 3000}}}}

Operator access

Kubeconfig lives at ~/.kube/honeydue-k3s.yaml.

export KUBECONFIG=~/.kube/honeydue-k3s.yaml
kubectl get pods -n honeydue

Remaining TODOs (not blocking)

Apply manifests/ingress/middleware.yaml for security headers + rate limiting (CF-only allowlist + basic auth deliberately skipped until you want them)
Apply manifests/network-policies.yaml for default-deny + explicit allows
Apply manifests/api/hpa.yaml if you want autoscaling (metrics-server is already running, so just kubectl apply it)
Upgrade to CF Full (strict) SSL: generate origin cert, create cloudflare-origin-cert Secret, add tls: block back to Ingress
Set up a proper migration Job so api replicas don't each run MigrateWithLock on startup — lets you drop the 240s startupProbe grace
Remove deploy/ (the Swarm-era config) once you're confident in k3s

6.2 KiB Raw Blame History Unescape Escape

K3s Migration Notes — 2026-04-24

Why we migrated

Differences from the original deploy-k3s/ scaffold

Manifest fixes applied in-repo (already committed)

Traefik override (applied as HelmChartConfig)

Required node-level sysctl

UFW rules added for k3s (per node)

Ingress

Operator access

Remaining TODOs (not blocking)

6.2 KiB

Raw Blame History

Differences from the original `deploy-k3s/` scaffold