Files
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

18 KiB
Raw Permalink Blame History

03 — Networking

Summary

The network stack has five layers: the physical/internet layer (Hetzner's public network), the node layer (Ubuntu with UFW), the Kubernetes overlay (Flannel VXLAN), the service layer (kube-proxy IPVS + CoreDNS), and the ingress layer (Traefik). This chapter walks through each, explains how they compose, and traces a single HTTP request from browser to Go API response showing every hop.

The five layers

flowchart TB
    subgraph L5[Layer 5 — Ingress]
        Traefik
    end
    subgraph L4[Layer 4 — Service discovery]
        KubeProxy[kube-proxy IPVS]
        CoreDNS
    end
    subgraph L3[Layer 3 — Pod overlay]
        Flannel[Flannel VXLAN<br/>UDP 8472]
    end
    subgraph L2[Layer 2 — Node network]
        UFW
        Kernel[Linux kernel<br/>netfilter/iptables]
    end
    subgraph L1[Layer 1 — Physical]
        Hetzner[Hetzner network<br/>public v4 + v6]
    end

    L5 --> L4 --> L3 --> L2 --> L1

ASCII fallback

  ┌──────────────────────────────────────┐
  │  L5  Traefik (host network, :80/:443)│
  ├──────────────────────────────────────┤
  │  L4  kube-proxy (IPVS) + CoreDNS     │
  ├──────────────────────────────────────┤
  │  L3  Flannel VXLAN overlay           │
  │      10.42.0.0/16 pod CIDR           │
  ├──────────────────────────────────────┤
  │  L2  Ubuntu + UFW + kernel iptables  │
  ├──────────────────────────────────────┤
  │  L1  Hetzner public IPv4/IPv6        │
  └──────────────────────────────────────┘

Layer 1 — Physical network

Each Hetzner CX33 has:

  • A public IPv4 address on the internet
  • A public IPv6 /64 subnet (one address used, the rest unused)
  • 20 TB/mo outbound traffic included; inbound is free
  • ~1 Gbps network bandwidth per node

All inter-node traffic goes over the public network. Hetzner Cloud offers a private-network feature (vswitch), but we didn't attach one — adding it now would require reconfiguring Flannel's advertise-addr. A future improvement: attach a private vSwitch to all three nodes, reconfigure Flannel to use it, shrink our public-interface attack surface.

Layer 2 — Node network

Each node runs Ubuntu 24.04.3 LTS with:

  • Default routing via the Hetzner-provided gateway
  • UFW as the iptables frontend (Chapter 4 lists every rule)
  • IP forwarding enabled (net.ipv4.ip_forward=1) — required for Kubernetes pod routing
  • Bridge netfilter enabled (net.bridge.bridge-nf-call-iptables=1) — required so iptables can see bridged traffic

K3s configures the latter two automatically at install time via /etc/sysctl.d/90-kubelet.conf (or similar; exact file varies by distro).

Two additional sysctls we set manually:

# /etc/sysctl.d/99-unprivileged-ports.conf
net.ipv4.ip_unprivileged_port_start=0

Why: Traefik runs as UID 65532 (non-root) in host network mode to bind :80 and :443. Without this sysctl, even with CAP_NET_BIND_SERVICE, it can't bind privileged ports in the host namespace. Ubuntu 24.04's default is 1024 (so ports 11023 are "privileged"). Setting it to 0 lets any user bind any port.

Security implication: Minimal. The ports Traefik binds are still controlled by the container runtime — other pods on the node can't accidentally grab 80/443 because kubelet won't schedule conflicting host ports. And the UFW rules still gate what's reachable externally.

Layer 3 — Pod overlay (Flannel VXLAN)

What Flannel is

Flannel is a CNI (Container Network Interface) plugin. Its job: give every pod in the cluster a routable IP address, and make those IPs reachable from any other pod regardless of which node they're on.

The pod CIDR

K3s assigns 10.42.0.0/16 as the cluster-wide pod CIDR by default. Each node gets a /24 slice:

Node Pod CIDR
ubuntu-8gb-nbg1-1 10.42.1.0/24
ubuntu-8gb-nbg1-2 10.42.0.0/24
ubuntu-8gb-nbg1-3 10.42.2.0/24

Each pod gets an IP from its node's slice. So a pod on hetzner2 (nbg1-1) might be 10.42.1.6; a pod on hetzner3 (nbg1-3) might be 10.42.2.10.

How VXLAN works

VXLAN ("Virtual Extensible LAN") tunnels Layer-2 frames over UDP. Flannel wraps every inter-node packet like so:

 Original pod → pod packet:
 ┌──────────────────────────────────────────────────┐
 │ Ethernet │ IP src=10.42.0.5 → dst=10.42.2.10 │ … │
 └──────────────────────────────────────────────────┘

 Flannel VXLAN-encapsulates it:
 ┌──────────────────────────────────────────────────────────────────┐
 │ Eth │ IP src=178.104.247.152 → dst=178.104.249.189 │ UDP 8472 │  │
 │ VXLAN header │ <original Ethernet+IP+payload>               │  │
 └──────────────────────────────────────────────────────────────────┘

The outer IP/UDP carries the packet between nodes over Hetzner's public network. On arrival, the destination node unwraps the VXLAN header and delivers the inner packet to the target pod.

UDP port 8472 is VXLAN's IANA-assigned port. It must be open node-to-node in UFW (see Chapter 4).

MTU note: VXLAN encapsulation adds 50 bytes of overhead (8 VXLAN + 8 UDP + 20 IP + 14 Ethernet). Hetzner's network uses standard 1500-byte MTU, so Flannel's overlay MTU is 1450. Mismatches cause silent packet drops. K3s sets this correctly by default.

Flannel config

/var/lib/rancher/k3s/agent/etc/flannel/net-conf.json on each node:

{
  "Network": "10.42.0.0/16",
  "EnableIPv6": false,
  "EnableIPv4": true,
  "IPv6Network": "::/0",
  "Backend": { "Type": "vxlan" }
}

We did not enable IPv6 in the cluster — an unnecessary complexity for our scale, and CoreDNS + kube-proxy + node controllers all work fine in v4-only mode.

No encryption (yet)

Flannel VXLAN traffic over Hetzner's public network is not encrypted. This means pod-to-pod traffic between nodes is visible to any attacker with packet capture on the path — in practice, nobody between our three nodes at Hetzner Nuremberg, but it's still plaintext on the wire.

Mitigation today: All sensitive inter-pod traffic already uses TLS:

  • api ↔ Neon Postgres: TLS 1.3 (DB_SSLMODE=require)
  • api/worker ↔ Backblaze B2: HTTPS
  • api ↔ Fastmail: STARTTLS
  • api ↔ Redis: plaintext but Redis only holds cache + Asynq queue state, no user credentials

TODO (Chapter 20): Switch Flannel to wireguard-native backend. K3s supports this with a flag at install time; enabling on an existing cluster requires a config edit and rolling kubelet restart.

Layer 4 — Service discovery

Pods don't talk to each other by IP — IPs are ephemeral, assigned on pod creation. They use service names resolved by DNS.

CoreDNS

K3s runs CoreDNS as the cluster DNS server. A pod in the honeydue namespace resolves redis to the Redis Service's ClusterIP:

redis                     → 10.43.7.10  (Service ClusterIP)
redis.honeydue            → 10.43.7.10
redis.honeydue.svc.cluster.local → 10.43.7.10

When an app resolves redis:6379:

  1. The pod's /etc/resolv.conf points to 10.43.0.10 (the CoreDNS Service).
  2. CoreDNS receives the query, checks its known Services, returns 10.43.7.10.
  3. The pod sends TCP to 10.43.7.10:6379.
  4. kube-proxy (Layer 4, below) intercepts and routes to the actual pod IP.

The service CIDR

K3s assigns 10.43.0.0/16 as the service CIDR. ClusterIPs live here. Currently:

Service ClusterIP
api.honeydue 10.43.167.83
admin.honeydue 10.43.136.168
redis.honeydue 10.43.7.10
kubernetes.default 10.43.0.1
kube-dns.kube-system 10.43.0.10

ClusterIPs are stable for the life of the Service — they don't change when pods come and go.

kube-proxy (IPVS mode)

kube-proxy is the dataplane component that makes Services work. It runs as a DaemonSet (one per node), watches the k3s API for Service and Endpoint changes, and programs the kernel to route traffic.

K3s defaults to IPVS mode on modern kernels. IPVS is a Linux kernel feature for in-kernel L4 load balancing — essentially connection-tracking NAT with round-robin or other scheduling.

When a pod dials 10.43.7.10:6379:

  1. The first packet hits the node's kernel
  2. IPVS sees the destination is a ClusterIP
  3. IPVS picks an endpoint from the Service's endpoint set (e.g., 10.42.0.10:6379 on hetzner2)
  4. IPVS rewrites the destination and forwards
  5. Flannel tunnels it to the destination node (if remote) or delivers locally (if the endpoint is on the same node)

This happens per-TCP-connection, not per-packet, thanks to conntrack.

Why IPVS over iptables

K3s' default kube-proxy mode is IPVS. The alternative (iptables mode) is older and slower — for every Service, iptables mode adds a chain of rules that grow linearly with Service count. IPVS uses a hash table and scales to thousands of Services without performance degradation. At our scale either works, but IPVS is the better default.

Headless Services

Some of our Services are not using a ClusterIP — they're "headless" (clusterIP: None). Our setup doesn't currently use them but it's worth knowing the distinction: headless Services return all endpoint IPs directly via DNS, no kube-proxy involvement. Useful for stateful sets where clients need to talk to a specific replica.

Layer 5 — Ingress (Traefik)

External traffic arrives on the node's public :80 or :443. Traefik handles the first mile of routing. See Chapter 6 for Traefik-specific details; this section just shows how it fits in the networking stack.

Traefik runs as a DaemonSet with hostNetwork: true. That means:

  • One Traefik pod per node
  • Each pod is in the host's network namespace, not a pod netns
  • Each pod can bind directly to 0.0.0.0:80 and 0.0.0.0:443 on the node

When Cloudflare sends a request to 178.104.247.152:80:

  1. Packet arrives at hetzner1's NIC
  2. UFW accepts (80/tcp is open from anywhere)
  3. Linux kernel routes to localhost:80 because something's listening
  4. Traefik (running in host namespace) accepts the connection
  5. Traefik reads the Host: header
  6. Traefik matches an Ingress rule (api.myhoneydue.com → api Service)
  7. Traefik dials 10.43.167.83:8000 (Service ClusterIP)
  8. Kube-proxy IPVS rewrites to a live api pod endpoint
  9. Flannel VXLAN tunnels if the endpoint is on a remote node
  10. The api pod receives the request, processes, responds
  11. Response flows back the reverse path

Full trace in the end-to-end section below.

IPs we care about

What CIDR / IP Used for
Pod CIDR 10.42.0.0/16 All pod IPs cluster-wide
Service CIDR 10.43.0.0/16 All ClusterIPs
Flannel VXLAN UDP 8472 Pod-to-pod traffic (inter-node)
CoreDNS Service 10.43.0.10:53 Cluster DNS
Kubernetes Service 10.43.0.1:443 Internal kube-apiserver
Node IPs See README External + flannel source/dst
Traefik host network Listens on node's :80, :443

End-to-end request trace

A user in Texas hits https://api.myhoneydue.com/api/tasks/. Here's every hop:

sequenceDiagram
    autonumber
    participant U as User (Austin, TX)
    participant CF as Cloudflare edge (DFW POP)
    participant H as hetzner2 (picked by CF)<br/>178.105.32.198
    participant TR as Traefik pod<br/>(hostNetwork)
    participant API as api pod on hetzner3<br/>10.42.2.6:8000
    participant DB as Neon Postgres<br/>(AWS us-east-1)

    U->>CF: HTTPS :443 GET /api/tasks/
    Note over CF: TLS handshake terminates here
    CF->>H: HTTP :80 (with original Host header)
    H->>TR: Accepted by kernel, delivered to Traefik
    Note over TR: Matches Ingress rule<br/>host: api.myhoneydue.com
    TR->>TR: Resolve api.honeydue → 10.43.167.83
    TR->>H: dial 10.43.167.83:8000
    H->>H: kube-proxy IPVS rewrites<br/>dst → 10.42.2.6:8000
    H->>API: Flannel VXLAN encapsulate<br/>UDP 8472 → hetzner3
    Note over API: Pod receives packet
    API->>DB: SELECT … FROM tasks WHERE user_id = …<br/>TLS :5432
    DB-->>API: Result rows
    API-->>TR: HTTP 200 JSON
    TR-->>CF: HTTP 200
    CF-->>U: HTTPS 200

Timing budget for a cache-miss read

Hop Typical latency
User → CF edge (DFW) 515 ms
CF edge → hetzner2 (origin HTTP :80) 90120 ms (cross-Atlantic)
UFW + kernel accept <1 ms
Traefik accept + route 12 ms
kube-proxy + Flannel (same node) <1 ms
kube-proxy + Flannel (remote node, VXLAN) 13 ms
Go API request handling 15 ms
Neon Postgres query (TLS + SQL) 2060 ms (AWS us-east-1)
Return path (reverse) similar

Total typical: ~200300 ms for a user in North America, dominated by the cross-Atlantic CF→origin hop. Cached responses at Cloudflare skip the origin hop entirely.

Inter-node routing concretely

Here's what ip route shows on hetzner2 (not run live, reconstructed from typical k3s+flannel+vxlan setup):

default via 172.31.1.1 dev eth0              # Hetzner gateway
10.42.0.0/24 via 10.42.0.0 dev flannel.1     # to hetzner1 pods (via VXLAN iface)
10.42.1.0/24 dev cni0                        # local pods on hetzner2
10.42.2.0/24 via 10.42.2.0 dev flannel.1     # to hetzner3 pods (via VXLAN iface)
10.43.0.0/16 via 10.42.1.1 dev cni0          # services via kube-proxy

The flannel.1 interface is the VXLAN tunnel endpoint. Traffic written to it gets encapsulated in UDP 8472 and sent to the peer node's public IP.

Flannel learns about peer nodes via the Kubernetes API (it watches Node resources). When hetzner3 joins, Flannel on hetzner1 and hetzner2 both learn its public IP and pod CIDR, update their routes and ARP tables, and traffic just works.

Network performance

Within a node (pod to pod, same host)

Packets go through cni0 bridge, never leave the node. Sub-millisecond latency, bounded by kernel + veth performance. Easily >10 Gbps.

Between nodes (pod to pod, different host)

Packets go through Flannel VXLAN. Added overhead: encap/decap in the kernel (~510 μs), plus the actual network hop between hetzner nodes (~0.5 ms within the same Hetzner datacenter). Throughput is bounded by Hetzner's NIC (≈1 Gbps sustained per node).

In practice this is fine for everything we do. The slowest link in our application is Neon (AWS us-east-1), which is ~100 ms round-trip.

DNS resolution path

A pod resolves redis:

  1. App does getaddrinfo("redis").
  2. glibc reads /etc/resolv.conf, finds nameserver 10.43.0.10.
  3. sends UDP 53 to 10.43.0.10.
  4. Destination is CoreDNS Service ClusterIP.
  5. kube-proxy IPVS load-balances across CoreDNS pods (there's usually 1).
  6. The packet arrives at the CoreDNS pod.
  7. CoreDNS checks its Kubernetes plugin cache for redis.<ns>.svc.cluster.local.
  8. Returns 10.43.7.10 (redis Service ClusterIP) with a low TTL.

CoreDNS is stateless — if it restarts, pods re-query on their next lookup.

DNS caching in pods: The Go API uses net.Resolver which does not cache by default. Each new connection triggers a fresh DNS lookup. This is correct behavior for Kubernetes (where Service IPs are stable but Endpoints change), but it means a CoreDNS outage breaks new connections immediately.

Next.js (admin) also uses Node's default resolver, similar behavior.

What breaks if X fails

Failure Symptom
Flannel daemon on one node crashes Pods on that node can't reach other nodes' pods; kube-proxy Services sometimes work (kernel conntrack)
CoreDNS pod crashes (only 1) New connection DNS lookups fail; existing connections continue
kube-proxy daemon on one node crashes Pods on that node can't resolve Service ClusterIPs; direct pod IPs still work
UFW misconfigured (port 8472 UDP blocked) Pods on that node can't reach remote pods over overlay
Node's NIC fails Node unreachable; Raft loses it; its pods get rescheduled elsewhere
Hetzner datacenter outage Entire cluster offline

Operator cheat sheet

# See all IPs in the cluster
kubectl get pods -A -o wide            # pod IPs + nodes
kubectl get svc -A                     # Service ClusterIPs

# Test pod-to-pod DNS from inside a pod
kubectl exec -n honeydue deploy/api -- nslookup redis
kubectl exec -n honeydue deploy/api -- getent hosts redis

# Test pod-to-pod TCP connectivity
kubectl exec -n honeydue deploy/api -- nc -zv redis 6379
kubectl exec -n honeydue deploy/api -- wget -q -O- http://admin:3000/

# See the node's iptables/IPVS rules (run on a node)
ssh deploy@hetzner1 "sudo ipvsadm -Ln"
ssh deploy@hetzner1 "sudo iptables -L -n -t nat | head -50"

# See the cluster's flannel state
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"  "}{.status.addresses[?(@.type=="InternalIP")].address}{"  "}{.spec.podCIDR}{"\n"}{end}'

References