Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
03 — Networking
Summary
The network stack has five layers: the physical/internet layer (Hetzner's public network), the node layer (Ubuntu with UFW), the Kubernetes overlay (Flannel VXLAN), the service layer (kube-proxy IPVS + CoreDNS), and the ingress layer (Traefik). This chapter walks through each, explains how they compose, and traces a single HTTP request from browser to Go API response showing every hop.
The five layers
flowchart TB
subgraph L5[Layer 5 — Ingress]
Traefik
end
subgraph L4[Layer 4 — Service discovery]
KubeProxy[kube-proxy IPVS]
CoreDNS
end
subgraph L3[Layer 3 — Pod overlay]
Flannel[Flannel VXLAN<br/>UDP 8472]
end
subgraph L2[Layer 2 — Node network]
UFW
Kernel[Linux kernel<br/>netfilter/iptables]
end
subgraph L1[Layer 1 — Physical]
Hetzner[Hetzner network<br/>public v4 + v6]
end
L5 --> L4 --> L3 --> L2 --> L1
ASCII fallback
┌──────────────────────────────────────┐
│ L5 Traefik (host network, :80/:443)│
├──────────────────────────────────────┤
│ L4 kube-proxy (IPVS) + CoreDNS │
├──────────────────────────────────────┤
│ L3 Flannel VXLAN overlay │
│ 10.42.0.0/16 pod CIDR │
├──────────────────────────────────────┤
│ L2 Ubuntu + UFW + kernel iptables │
├──────────────────────────────────────┤
│ L1 Hetzner public IPv4/IPv6 │
└──────────────────────────────────────┘
Layer 1 — Physical network
Each Hetzner CX33 has:
- A public IPv4 address on the internet
- A public IPv6 /64 subnet (one address used, the rest unused)
- 20 TB/mo outbound traffic included; inbound is free
- ~1 Gbps network bandwidth per node
All inter-node traffic goes over the public network. Hetzner Cloud offers a private-network feature (vswitch), but we didn't attach one — adding it now would require reconfiguring Flannel's advertise-addr. A future improvement: attach a private vSwitch to all three nodes, reconfigure Flannel to use it, shrink our public-interface attack surface.
Layer 2 — Node network
Each node runs Ubuntu 24.04.3 LTS with:
- Default routing via the Hetzner-provided gateway
- UFW as the iptables frontend (Chapter 4 lists every rule)
- IP forwarding enabled (
net.ipv4.ip_forward=1) — required for Kubernetes pod routing - Bridge netfilter enabled (
net.bridge.bridge-nf-call-iptables=1) — required so iptables can see bridged traffic
K3s configures the latter two automatically at install time via
/etc/sysctl.d/90-kubelet.conf (or similar; exact file varies by distro).
Two additional sysctls we set manually:
# /etc/sysctl.d/99-unprivileged-ports.conf
net.ipv4.ip_unprivileged_port_start=0
Why: Traefik runs as UID 65532 (non-root) in host network mode to bind
:80 and :443. Without this sysctl, even with CAP_NET_BIND_SERVICE, it
can't bind privileged ports in the host namespace. Ubuntu 24.04's default
is 1024 (so ports 1–1023 are "privileged"). Setting it to 0 lets any
user bind any port.
Security implication: Minimal. The ports Traefik binds are still controlled by the container runtime — other pods on the node can't accidentally grab 80/443 because kubelet won't schedule conflicting host ports. And the UFW rules still gate what's reachable externally.
Layer 3 — Pod overlay (Flannel VXLAN)
What Flannel is
Flannel is a CNI (Container Network Interface) plugin. Its job: give every pod in the cluster a routable IP address, and make those IPs reachable from any other pod regardless of which node they're on.
The pod CIDR
K3s assigns 10.42.0.0/16 as the cluster-wide pod CIDR by default. Each node gets a /24 slice:
| Node | Pod CIDR |
|---|---|
| ubuntu-8gb-nbg1-1 | 10.42.1.0/24 |
| ubuntu-8gb-nbg1-2 | 10.42.0.0/24 |
| ubuntu-8gb-nbg1-3 | 10.42.2.0/24 |
Each pod gets an IP from its node's slice. So a pod on hetzner2
(nbg1-1) might be 10.42.1.6; a pod on hetzner3 (nbg1-3) might be
10.42.2.10.
How VXLAN works
VXLAN ("Virtual Extensible LAN") tunnels Layer-2 frames over UDP. Flannel wraps every inter-node packet like so:
Original pod → pod packet:
┌──────────────────────────────────────────────────┐
│ Ethernet │ IP src=10.42.0.5 → dst=10.42.2.10 │ … │
└──────────────────────────────────────────────────┘
Flannel VXLAN-encapsulates it:
┌──────────────────────────────────────────────────────────────────┐
│ Eth │ IP src=178.104.247.152 → dst=178.104.249.189 │ UDP 8472 │ │
│ VXLAN header │ <original Ethernet+IP+payload> │ │
└──────────────────────────────────────────────────────────────────┘
The outer IP/UDP carries the packet between nodes over Hetzner's public network. On arrival, the destination node unwraps the VXLAN header and delivers the inner packet to the target pod.
UDP port 8472 is VXLAN's IANA-assigned port. It must be open node-to-node in UFW (see Chapter 4).
MTU note: VXLAN encapsulation adds 50 bytes of overhead (8 VXLAN + 8 UDP + 20 IP + 14 Ethernet). Hetzner's network uses standard 1500-byte MTU, so Flannel's overlay MTU is 1450. Mismatches cause silent packet drops. K3s sets this correctly by default.
Flannel config
/var/lib/rancher/k3s/agent/etc/flannel/net-conf.json on each node:
{
"Network": "10.42.0.0/16",
"EnableIPv6": false,
"EnableIPv4": true,
"IPv6Network": "::/0",
"Backend": { "Type": "vxlan" }
}
We did not enable IPv6 in the cluster — an unnecessary complexity for our scale, and CoreDNS + kube-proxy + node controllers all work fine in v4-only mode.
No encryption (yet)
Flannel VXLAN traffic over Hetzner's public network is not encrypted. This means pod-to-pod traffic between nodes is visible to any attacker with packet capture on the path — in practice, nobody between our three nodes at Hetzner Nuremberg, but it's still plaintext on the wire.
Mitigation today: All sensitive inter-pod traffic already uses TLS:
- api ↔ Neon Postgres: TLS 1.3 (
DB_SSLMODE=require) - api/worker ↔ Backblaze B2: HTTPS
- api ↔ Fastmail: STARTTLS
- api ↔ Redis: plaintext but Redis only holds cache + Asynq queue state, no user credentials
TODO (Chapter 20): Switch Flannel to wireguard-native backend. K3s
supports this with a flag at install time; enabling on an existing
cluster requires a config edit and rolling kubelet restart.
Layer 4 — Service discovery
Pods don't talk to each other by IP — IPs are ephemeral, assigned on pod creation. They use service names resolved by DNS.
CoreDNS
K3s runs CoreDNS as the cluster DNS server. A pod in the honeydue
namespace resolves redis to the Redis Service's ClusterIP:
redis → 10.43.7.10 (Service ClusterIP)
redis.honeydue → 10.43.7.10
redis.honeydue.svc.cluster.local → 10.43.7.10
When an app resolves redis:6379:
- The pod's
/etc/resolv.confpoints to10.43.0.10(the CoreDNS Service). - CoreDNS receives the query, checks its known Services, returns
10.43.7.10. - The pod sends TCP to
10.43.7.10:6379. - kube-proxy (Layer 4, below) intercepts and routes to the actual pod IP.
The service CIDR
K3s assigns 10.43.0.0/16 as the service CIDR. ClusterIPs live here. Currently:
| Service | ClusterIP |
|---|---|
api.honeydue |
10.43.167.83 |
admin.honeydue |
10.43.136.168 |
redis.honeydue |
10.43.7.10 |
kubernetes.default |
10.43.0.1 |
kube-dns.kube-system |
10.43.0.10 |
ClusterIPs are stable for the life of the Service — they don't change when pods come and go.
kube-proxy (IPVS mode)
kube-proxy is the dataplane component that makes Services work. It runs
as a DaemonSet (one per node), watches the k3s API for Service and
Endpoint changes, and programs the kernel to route traffic.
K3s defaults to IPVS mode on modern kernels. IPVS is a Linux kernel feature for in-kernel L4 load balancing — essentially connection-tracking NAT with round-robin or other scheduling.
When a pod dials 10.43.7.10:6379:
- The first packet hits the node's kernel
- IPVS sees the destination is a ClusterIP
- IPVS picks an endpoint from the Service's endpoint set (e.g.,
10.42.0.10:6379on hetzner2) - IPVS rewrites the destination and forwards
- Flannel tunnels it to the destination node (if remote) or delivers locally (if the endpoint is on the same node)
This happens per-TCP-connection, not per-packet, thanks to conntrack.
Why IPVS over iptables
K3s' default kube-proxy mode is IPVS. The alternative (iptables mode) is older and slower — for every Service, iptables mode adds a chain of rules that grow linearly with Service count. IPVS uses a hash table and scales to thousands of Services without performance degradation. At our scale either works, but IPVS is the better default.
Headless Services
Some of our Services are not using a ClusterIP — they're "headless"
(clusterIP: None). Our setup doesn't currently use them but it's worth
knowing the distinction: headless Services return all endpoint IPs
directly via DNS, no kube-proxy involvement. Useful for stateful sets
where clients need to talk to a specific replica.
Layer 5 — Ingress (Traefik)
External traffic arrives on the node's public :80 or :443. Traefik handles the first mile of routing. See Chapter 6 for Traefik-specific details; this section just shows how it fits in the networking stack.
Traefik runs as a DaemonSet with hostNetwork: true. That means:
- One Traefik pod per node
- Each pod is in the host's network namespace, not a pod netns
- Each pod can bind directly to
0.0.0.0:80and0.0.0.0:443on the node
When Cloudflare sends a request to 178.104.247.152:80:
- Packet arrives at hetzner1's NIC
- UFW accepts (80/tcp is open from anywhere)
- Linux kernel routes to localhost:80 because something's listening
- Traefik (running in host namespace) accepts the connection
- Traefik reads the
Host:header - Traefik matches an Ingress rule (api.myhoneydue.com → api Service)
- Traefik dials
10.43.167.83:8000(Service ClusterIP) - Kube-proxy IPVS rewrites to a live api pod endpoint
- Flannel VXLAN tunnels if the endpoint is on a remote node
- The api pod receives the request, processes, responds
- Response flows back the reverse path
Full trace in the end-to-end section below.
IPs we care about
| What | CIDR / IP | Used for |
|---|---|---|
| Pod CIDR | 10.42.0.0/16 | All pod IPs cluster-wide |
| Service CIDR | 10.43.0.0/16 | All ClusterIPs |
| Flannel VXLAN | UDP 8472 | Pod-to-pod traffic (inter-node) |
| CoreDNS Service | 10.43.0.10:53 | Cluster DNS |
| Kubernetes Service | 10.43.0.1:443 | Internal kube-apiserver |
| Node IPs | See README | External + flannel source/dst |
| Traefik | host network | Listens on node's :80, :443 |
End-to-end request trace
A user in Texas hits https://api.myhoneydue.com/api/tasks/. Here's every
hop:
sequenceDiagram
autonumber
participant U as User (Austin, TX)
participant CF as Cloudflare edge (DFW POP)
participant H as hetzner2 (picked by CF)<br/>178.105.32.198
participant TR as Traefik pod<br/>(hostNetwork)
participant API as api pod on hetzner3<br/>10.42.2.6:8000
participant DB as Neon Postgres<br/>(AWS us-east-1)
U->>CF: HTTPS :443 GET /api/tasks/
Note over CF: TLS handshake terminates here
CF->>H: HTTP :80 (with original Host header)
H->>TR: Accepted by kernel, delivered to Traefik
Note over TR: Matches Ingress rule<br/>host: api.myhoneydue.com
TR->>TR: Resolve api.honeydue → 10.43.167.83
TR->>H: dial 10.43.167.83:8000
H->>H: kube-proxy IPVS rewrites<br/>dst → 10.42.2.6:8000
H->>API: Flannel VXLAN encapsulate<br/>UDP 8472 → hetzner3
Note over API: Pod receives packet
API->>DB: SELECT … FROM tasks WHERE user_id = …<br/>TLS :5432
DB-->>API: Result rows
API-->>TR: HTTP 200 JSON
TR-->>CF: HTTP 200
CF-->>U: HTTPS 200
Timing budget for a cache-miss read
| Hop | Typical latency |
|---|---|
| User → CF edge (DFW) | 5–15 ms |
| CF edge → hetzner2 (origin HTTP :80) | 90–120 ms (cross-Atlantic) |
| UFW + kernel accept | <1 ms |
| Traefik accept + route | 1–2 ms |
| kube-proxy + Flannel (same node) | <1 ms |
| kube-proxy + Flannel (remote node, VXLAN) | 1–3 ms |
| Go API request handling | 1–5 ms |
| Neon Postgres query (TLS + SQL) | 20–60 ms (AWS us-east-1) |
| Return path (reverse) | similar |
Total typical: ~200–300 ms for a user in North America, dominated by the cross-Atlantic CF→origin hop. Cached responses at Cloudflare skip the origin hop entirely.
Inter-node routing concretely
Here's what ip route shows on hetzner2 (not run live, reconstructed from
typical k3s+flannel+vxlan setup):
default via 172.31.1.1 dev eth0 # Hetzner gateway
10.42.0.0/24 via 10.42.0.0 dev flannel.1 # to hetzner1 pods (via VXLAN iface)
10.42.1.0/24 dev cni0 # local pods on hetzner2
10.42.2.0/24 via 10.42.2.0 dev flannel.1 # to hetzner3 pods (via VXLAN iface)
10.43.0.0/16 via 10.42.1.1 dev cni0 # services via kube-proxy
The flannel.1 interface is the VXLAN tunnel endpoint. Traffic written
to it gets encapsulated in UDP 8472 and sent to the peer node's public IP.
Flannel learns about peer nodes via the Kubernetes API (it watches Node resources). When hetzner3 joins, Flannel on hetzner1 and hetzner2 both learn its public IP and pod CIDR, update their routes and ARP tables, and traffic just works.
Network performance
Within a node (pod to pod, same host)
Packets go through cni0 bridge, never leave the node. Sub-millisecond
latency, bounded by kernel + veth performance. Easily >10 Gbps.
Between nodes (pod to pod, different host)
Packets go through Flannel VXLAN. Added overhead: encap/decap in the kernel (~5–10 μs), plus the actual network hop between hetzner nodes (~0.5 ms within the same Hetzner datacenter). Throughput is bounded by Hetzner's NIC (≈1 Gbps sustained per node).
In practice this is fine for everything we do. The slowest link in our application is Neon (AWS us-east-1), which is ~100 ms round-trip.
DNS resolution path
A pod resolves redis:
- App does
getaddrinfo("redis"). - glibc reads
/etc/resolv.conf, finds nameserver10.43.0.10. - sends UDP 53 to
10.43.0.10. - Destination is CoreDNS Service ClusterIP.
- kube-proxy IPVS load-balances across CoreDNS pods (there's usually 1).
- The packet arrives at the CoreDNS pod.
- CoreDNS checks its Kubernetes plugin cache for
redis.<ns>.svc.cluster.local. - Returns
10.43.7.10(redis Service ClusterIP) with a low TTL.
CoreDNS is stateless — if it restarts, pods re-query on their next lookup.
DNS caching in pods: The Go API uses net.Resolver which does not
cache by default. Each new connection triggers a fresh DNS lookup. This
is correct behavior for Kubernetes (where Service IPs are stable but
Endpoints change), but it means a CoreDNS outage breaks new connections
immediately.
Next.js (admin) also uses Node's default resolver, similar behavior.
What breaks if X fails
| Failure | Symptom |
|---|---|
| Flannel daemon on one node crashes | Pods on that node can't reach other nodes' pods; kube-proxy Services sometimes work (kernel conntrack) |
| CoreDNS pod crashes (only 1) | New connection DNS lookups fail; existing connections continue |
| kube-proxy daemon on one node crashes | Pods on that node can't resolve Service ClusterIPs; direct pod IPs still work |
| UFW misconfigured (port 8472 UDP blocked) | Pods on that node can't reach remote pods over overlay |
| Node's NIC fails | Node unreachable; Raft loses it; its pods get rescheduled elsewhere |
| Hetzner datacenter outage | Entire cluster offline |
Operator cheat sheet
# See all IPs in the cluster
kubectl get pods -A -o wide # pod IPs + nodes
kubectl get svc -A # Service ClusterIPs
# Test pod-to-pod DNS from inside a pod
kubectl exec -n honeydue deploy/api -- nslookup redis
kubectl exec -n honeydue deploy/api -- getent hosts redis
# Test pod-to-pod TCP connectivity
kubectl exec -n honeydue deploy/api -- nc -zv redis 6379
kubectl exec -n honeydue deploy/api -- wget -q -O- http://admin:3000/
# See the node's iptables/IPVS rules (run on a node)
ssh deploy@hetzner1 "sudo ipvsadm -Ln"
ssh deploy@hetzner1 "sudo iptables -L -n -t nat | head -50"
# See the cluster's flannel state
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.addresses[?(@.type=="InternalIP")].address}{" "}{.spec.podCIDR}{"\n"}{end}'