# 03 — Networking ## Summary The network stack has five layers: the physical/internet layer (Hetzner's public network), the node layer (Ubuntu with UFW), the Kubernetes overlay (Flannel VXLAN), the service layer (kube-proxy IPVS + CoreDNS), and the ingress layer (Traefik). This chapter walks through each, explains how they compose, and traces a single HTTP request from browser to Go API response showing every hop. ## The five layers ```mermaid flowchart TB subgraph L5[Layer 5 — Ingress] Traefik end subgraph L4[Layer 4 — Service discovery] KubeProxy[kube-proxy IPVS] CoreDNS end subgraph L3[Layer 3 — Pod overlay] Flannel[Flannel VXLAN
UDP 8472] end subgraph L2[Layer 2 — Node network] UFW Kernel[Linux kernel
netfilter/iptables] end subgraph L1[Layer 1 — Physical] Hetzner[Hetzner network
public v4 + v6] end L5 --> L4 --> L3 --> L2 --> L1 ``` ### ASCII fallback ``` ┌──────────────────────────────────────┐ │ L5 Traefik (host network, :80/:443)│ ├──────────────────────────────────────┤ │ L4 kube-proxy (IPVS) + CoreDNS │ ├──────────────────────────────────────┤ │ L3 Flannel VXLAN overlay │ │ 10.42.0.0/16 pod CIDR │ ├──────────────────────────────────────┤ │ L2 Ubuntu + UFW + kernel iptables │ ├──────────────────────────────────────┤ │ L1 Hetzner public IPv4/IPv6 │ └──────────────────────────────────────┘ ``` ## Layer 1 — Physical network Each Hetzner CX33 has: - A **public IPv4** address on the internet - A **public IPv6** /64 subnet (one address used, the rest unused) - **20 TB/mo** outbound traffic included; inbound is free - **~1 Gbps** network bandwidth per node All inter-node traffic goes over the **public network**. Hetzner Cloud offers a private-network feature (vswitch), but we didn't attach one — adding it now would require reconfiguring Flannel's advertise-addr. A future improvement: attach a private vSwitch to all three nodes, reconfigure Flannel to use it, shrink our public-interface attack surface. ## Layer 2 — Node network Each node runs Ubuntu 24.04.3 LTS with: - **Default routing** via the Hetzner-provided gateway - **UFW** as the iptables frontend (Chapter 4 lists every rule) - **IP forwarding** enabled (`net.ipv4.ip_forward=1`) — required for Kubernetes pod routing - **Bridge netfilter** enabled (`net.bridge.bridge-nf-call-iptables=1`) — required so iptables can see bridged traffic K3s configures the latter two automatically at install time via `/etc/sysctl.d/90-kubelet.conf` (or similar; exact file varies by distro). Two additional sysctls we set manually: ``` # /etc/sysctl.d/99-unprivileged-ports.conf net.ipv4.ip_unprivileged_port_start=0 ``` **Why**: Traefik runs as UID 65532 (non-root) in host network mode to bind :80 and :443. Without this sysctl, even with `CAP_NET_BIND_SERVICE`, it can't bind privileged ports in the host namespace. Ubuntu 24.04's default is 1024 (so ports 1–1023 are "privileged"). Setting it to 0 lets any user bind any port. **Security implication**: Minimal. The ports Traefik binds are still controlled by the container runtime — other pods on the node can't accidentally grab 80/443 because kubelet won't schedule conflicting host ports. And the UFW rules still gate what's reachable externally. ## Layer 3 — Pod overlay (Flannel VXLAN) ### What Flannel is Flannel is a CNI (Container Network Interface) plugin. Its job: give every pod in the cluster a routable IP address, and make those IPs reachable from any other pod regardless of which node they're on. ### The pod CIDR K3s assigns **10.42.0.0/16** as the cluster-wide pod CIDR by default. Each node gets a /24 slice: | Node | Pod CIDR | |---|---| | ubuntu-8gb-nbg1-1 | 10.42.1.0/24 | | ubuntu-8gb-nbg1-2 | 10.42.0.0/24 | | ubuntu-8gb-nbg1-3 | 10.42.2.0/24 | Each pod gets an IP from its node's slice. So a pod on hetzner2 (`nbg1-1`) might be `10.42.1.6`; a pod on hetzner3 (`nbg1-3`) might be `10.42.2.10`. ### How VXLAN works VXLAN ("Virtual Extensible LAN") tunnels Layer-2 frames over UDP. Flannel wraps every inter-node packet like so: ``` Original pod → pod packet: ┌──────────────────────────────────────────────────┐ │ Ethernet │ IP src=10.42.0.5 → dst=10.42.2.10 │ … │ └──────────────────────────────────────────────────┘ Flannel VXLAN-encapsulates it: ┌──────────────────────────────────────────────────────────────────┐ │ Eth │ IP src=178.104.247.152 → dst=178.104.249.189 │ UDP 8472 │ │ │ VXLAN header │ │ │ └──────────────────────────────────────────────────────────────────┘ ``` The outer IP/UDP carries the packet between nodes over Hetzner's public network. On arrival, the destination node unwraps the VXLAN header and delivers the inner packet to the target pod. **UDP port 8472** is VXLAN's IANA-assigned port. It must be open node-to-node in UFW (see Chapter 4). **MTU note**: VXLAN encapsulation adds 50 bytes of overhead (8 VXLAN + 8 UDP + 20 IP + 14 Ethernet). Hetzner's network uses standard 1500-byte MTU, so Flannel's overlay MTU is 1450. Mismatches cause silent packet drops. K3s sets this correctly by default. ### Flannel config `/var/lib/rancher/k3s/agent/etc/flannel/net-conf.json` on each node: ```json { "Network": "10.42.0.0/16", "EnableIPv6": false, "EnableIPv4": true, "IPv6Network": "::/0", "Backend": { "Type": "vxlan" } } ``` We did not enable IPv6 in the cluster — an unnecessary complexity for our scale, and CoreDNS + kube-proxy + node controllers all work fine in v4-only mode. ### No encryption (yet) Flannel VXLAN traffic over Hetzner's public network is **not encrypted**. This means pod-to-pod traffic between nodes is visible to any attacker with packet capture on the path — in practice, nobody between our three nodes at Hetzner Nuremberg, but it's still plaintext on the wire. **Mitigation today**: All sensitive inter-pod traffic already uses TLS: - api ↔ Neon Postgres: TLS 1.3 (`DB_SSLMODE=require`) - api/worker ↔ Backblaze B2: HTTPS - api ↔ Fastmail: STARTTLS - api ↔ Redis: plaintext but Redis only holds cache + Asynq queue state, no user credentials **TODO** (Chapter 20): Switch Flannel to `wireguard-native` backend. K3s supports this with a flag at install time; enabling on an existing cluster requires a config edit and rolling kubelet restart. ## Layer 4 — Service discovery Pods don't talk to each other by IP — IPs are ephemeral, assigned on pod creation. They use **service names** resolved by DNS. ### CoreDNS K3s runs **CoreDNS** as the cluster DNS server. A pod in the `honeydue` namespace resolves `redis` to the Redis Service's ClusterIP: ``` redis → 10.43.7.10 (Service ClusterIP) redis.honeydue → 10.43.7.10 redis.honeydue.svc.cluster.local → 10.43.7.10 ``` When an app resolves `redis:6379`: 1. The pod's `/etc/resolv.conf` points to `10.43.0.10` (the CoreDNS Service). 2. CoreDNS receives the query, checks its known Services, returns `10.43.7.10`. 3. The pod sends TCP to `10.43.7.10:6379`. 4. kube-proxy (Layer 4, below) intercepts and routes to the actual pod IP. ### The service CIDR K3s assigns **10.43.0.0/16** as the service CIDR. ClusterIPs live here. Currently: | Service | ClusterIP | |---|---| | `api.honeydue` | 10.43.167.83 | | `admin.honeydue` | 10.43.136.168 | | `redis.honeydue` | 10.43.7.10 | | `kubernetes.default` | 10.43.0.1 | | `kube-dns.kube-system` | 10.43.0.10 | ClusterIPs are **stable** for the life of the Service — they don't change when pods come and go. ### kube-proxy (IPVS mode) `kube-proxy` is the dataplane component that makes Services work. It runs as a DaemonSet (one per node), watches the k3s API for Service and Endpoint changes, and programs the kernel to route traffic. K3s defaults to **IPVS mode** on modern kernels. IPVS is a Linux kernel feature for in-kernel L4 load balancing — essentially connection-tracking NAT with round-robin or other scheduling. When a pod dials `10.43.7.10:6379`: 1. The first packet hits the node's kernel 2. IPVS sees the destination is a ClusterIP 3. IPVS picks an endpoint from the Service's endpoint set (e.g., `10.42.0.10:6379` on hetzner2) 4. IPVS rewrites the destination and forwards 5. Flannel tunnels it to the destination node (if remote) or delivers locally (if the endpoint is on the same node) This happens per-TCP-connection, not per-packet, thanks to conntrack. ### Why IPVS over iptables K3s' default kube-proxy mode is IPVS. The alternative (iptables mode) is older and slower — for every Service, iptables mode adds a chain of rules that grow linearly with Service count. IPVS uses a hash table and scales to thousands of Services without performance degradation. At our scale either works, but IPVS is the better default. ### Headless Services Some of our Services are *not* using a ClusterIP — they're "headless" (`clusterIP: None`). Our setup doesn't currently use them but it's worth knowing the distinction: headless Services return all endpoint IPs directly via DNS, no kube-proxy involvement. Useful for stateful sets where clients need to talk to a specific replica. ## Layer 5 — Ingress (Traefik) External traffic arrives on the node's public :80 or :443. Traefik handles the first mile of routing. See [Chapter 6](./06-traefik-ingress.md) for Traefik-specific details; this section just shows how it fits in the networking stack. Traefik runs as a **DaemonSet** with `hostNetwork: true`. That means: - One Traefik pod per node - Each pod is in the **host's network namespace**, not a pod netns - Each pod can bind directly to `0.0.0.0:80` and `0.0.0.0:443` on the node When Cloudflare sends a request to `178.104.247.152:80`: 1. Packet arrives at hetzner1's NIC 2. UFW accepts (80/tcp is open from anywhere) 3. Linux kernel routes to localhost:80 because something's listening 4. Traefik (running in host namespace) accepts the connection 5. Traefik reads the `Host:` header 6. Traefik matches an Ingress rule (api.myhoneydue.com → api Service) 7. Traefik dials `10.43.167.83:8000` (Service ClusterIP) 8. Kube-proxy IPVS rewrites to a live api pod endpoint 9. Flannel VXLAN tunnels if the endpoint is on a remote node 10. The api pod receives the request, processes, responds 11. Response flows back the reverse path Full trace in the [end-to-end section](#end-to-end-request-trace) below. ## IPs we care about | What | CIDR / IP | Used for | |---|---|---| | Pod CIDR | 10.42.0.0/16 | All pod IPs cluster-wide | | Service CIDR | 10.43.0.0/16 | All ClusterIPs | | Flannel VXLAN | UDP 8472 | Pod-to-pod traffic (inter-node) | | CoreDNS Service | 10.43.0.10:53 | Cluster DNS | | Kubernetes Service | 10.43.0.1:443 | Internal kube-apiserver | | Node IPs | See README | External + flannel source/dst | | Traefik | host network | Listens on node's :80, :443 | ## End-to-end request trace A user in Texas hits `https://api.myhoneydue.com/api/tasks/`. Here's every hop: ```mermaid sequenceDiagram autonumber participant U as User (Austin, TX) participant CF as Cloudflare edge (DFW POP) participant H as hetzner2 (picked by CF)
178.105.32.198 participant TR as Traefik pod
(hostNetwork) participant API as api pod on hetzner3
10.42.2.6:8000 participant DB as Neon Postgres
(AWS us-east-1) U->>CF: HTTPS :443 GET /api/tasks/ Note over CF: TLS handshake terminates here CF->>H: HTTP :80 (with original Host header) H->>TR: Accepted by kernel, delivered to Traefik Note over TR: Matches Ingress rule
host: api.myhoneydue.com TR->>TR: Resolve api.honeydue → 10.43.167.83 TR->>H: dial 10.43.167.83:8000 H->>H: kube-proxy IPVS rewrites
dst → 10.42.2.6:8000 H->>API: Flannel VXLAN encapsulate
UDP 8472 → hetzner3 Note over API: Pod receives packet API->>DB: SELECT … FROM tasks WHERE user_id = …
TLS :5432 DB-->>API: Result rows API-->>TR: HTTP 200 JSON TR-->>CF: HTTP 200 CF-->>U: HTTPS 200 ``` ### Timing budget for a cache-miss read | Hop | Typical latency | |---|---| | User → CF edge (DFW) | 5–15 ms | | CF edge → hetzner2 (origin HTTP :80) | 90–120 ms (cross-Atlantic) | | UFW + kernel accept | <1 ms | | Traefik accept + route | 1–2 ms | | kube-proxy + Flannel (same node) | <1 ms | | kube-proxy + Flannel (remote node, VXLAN) | 1–3 ms | | Go API request handling | 1–5 ms | | Neon Postgres query (TLS + SQL) | 20–60 ms (AWS us-east-1) | | Return path (reverse) | similar | **Total typical**: ~200–300 ms for a user in North America, dominated by the cross-Atlantic CF→origin hop. Cached responses at Cloudflare skip the origin hop entirely. ## Inter-node routing concretely Here's what `ip route` shows on hetzner2 (not run live, reconstructed from typical k3s+flannel+vxlan setup): ``` default via 172.31.1.1 dev eth0 # Hetzner gateway 10.42.0.0/24 via 10.42.0.0 dev flannel.1 # to hetzner1 pods (via VXLAN iface) 10.42.1.0/24 dev cni0 # local pods on hetzner2 10.42.2.0/24 via 10.42.2.0 dev flannel.1 # to hetzner3 pods (via VXLAN iface) 10.43.0.0/16 via 10.42.1.1 dev cni0 # services via kube-proxy ``` The `flannel.1` interface is the VXLAN tunnel endpoint. Traffic written to it gets encapsulated in UDP 8472 and sent to the peer node's public IP. Flannel learns about peer nodes via the Kubernetes API (it watches Node resources). When hetzner3 joins, Flannel on hetzner1 and hetzner2 both learn its public IP and pod CIDR, update their routes and ARP tables, and traffic just works. ## Network performance ### Within a node (pod to pod, same host) Packets go through `cni0` bridge, never leave the node. Sub-millisecond latency, bounded by kernel + veth performance. Easily >10 Gbps. ### Between nodes (pod to pod, different host) Packets go through Flannel VXLAN. Added overhead: encap/decap in the kernel (~5–10 μs), plus the actual network hop between hetzner nodes (~0.5 ms within the same Hetzner datacenter). Throughput is bounded by Hetzner's NIC (≈1 Gbps sustained per node). In practice this is fine for everything we do. The slowest link in our application is Neon (AWS us-east-1), which is ~100 ms round-trip. ## DNS resolution path A pod resolves `redis`: 1. App does `getaddrinfo("redis")`. 2. glibc reads `/etc/resolv.conf`, finds nameserver `10.43.0.10`. 3. sends UDP 53 to `10.43.0.10`. 4. Destination is CoreDNS Service ClusterIP. 5. kube-proxy IPVS load-balances across CoreDNS pods (there's usually 1). 6. The packet arrives at the CoreDNS pod. 7. CoreDNS checks its Kubernetes plugin cache for `redis..svc.cluster.local`. 8. Returns `10.43.7.10` (redis Service ClusterIP) with a low TTL. CoreDNS is stateless — if it restarts, pods re-query on their next lookup. **DNS caching in pods**: The Go API uses `net.Resolver` which does not cache by default. Each new connection triggers a fresh DNS lookup. This is correct behavior for Kubernetes (where Service IPs are stable but Endpoints change), but it means a CoreDNS outage breaks new connections immediately. Next.js (admin) also uses Node's default resolver, similar behavior. ## What breaks if X fails | Failure | Symptom | |---|---| | Flannel daemon on one node crashes | Pods on that node can't reach other nodes' pods; kube-proxy Services sometimes work (kernel conntrack) | | CoreDNS pod crashes (only 1) | New connection DNS lookups fail; existing connections continue | | kube-proxy daemon on one node crashes | Pods on that node can't resolve Service ClusterIPs; direct pod IPs still work | | UFW misconfigured (port 8472 UDP blocked) | Pods on that node can't reach remote pods over overlay | | Node's NIC fails | Node unreachable; Raft loses it; its pods get rescheduled elsewhere | | Hetzner datacenter outage | Entire cluster offline | ## Operator cheat sheet ```bash # See all IPs in the cluster kubectl get pods -A -o wide # pod IPs + nodes kubectl get svc -A # Service ClusterIPs # Test pod-to-pod DNS from inside a pod kubectl exec -n honeydue deploy/api -- nslookup redis kubectl exec -n honeydue deploy/api -- getent hosts redis # Test pod-to-pod TCP connectivity kubectl exec -n honeydue deploy/api -- nc -zv redis 6379 kubectl exec -n honeydue deploy/api -- wget -q -O- http://admin:3000/ # See the node's iptables/IPVS rules (run on a node) ssh deploy@hetzner1 "sudo ipvsadm -Ln" ssh deploy@hetzner1 "sudo iptables -L -n -t nat | head -50" # See the cluster's flannel state kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.addresses[?(@.type=="InternalIP")].address}{" "}{.spec.podCIDR}{"\n"}{end}' ``` ## References - [Kubernetes networking concepts][k8s-net] - [Flannel VXLAN backend][flannel-vxlan] - [CoreDNS k8s plugin][coredns-k8s] - [IPVS mode for kube-proxy][ipvs] - [VXLAN RFC 7348][vxlan-rfc] [k8s-net]: https://kubernetes.io/docs/concepts/services-networking/ [flannel-vxlan]: https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md#vxlan [coredns-k8s]: https://coredns.io/plugins/kubernetes/ [ipvs]: https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/ [vxlan-rfc]: https://datatracker.ietf.org/doc/html/rfc7348