honeyDueAPI/docs/deployment/03-networking.md

# 03 — Networking

## Summary

The network stack has five layers: the physical/internet layer (Hetzner's
public network), the node layer (Ubuntu with UFW), the Kubernetes overlay
(Flannel VXLAN), the service layer (kube-proxy IPVS + CoreDNS), and the
ingress layer (Traefik). This chapter walks through each, explains how
they compose, and traces a single HTTP request from browser to Go API
response showing every hop.

## The five layers

```mermaid
flowchart TB
    subgraph L5[Layer 5 — Ingress]
        Traefik
    end
    subgraph L4[Layer 4 — Service discovery]
        KubeProxy[kube-proxy IPVS]
        CoreDNS
    end
    subgraph L3[Layer 3 — Pod overlay]
        Flannel[Flannel VXLAN<br/>UDP 8472]
    end
    subgraph L2[Layer 2 — Node network]
        UFW
        Kernel[Linux kernel<br/>netfilter/iptables]
    end
    subgraph L1[Layer 1 — Physical]
        Hetzner[Hetzner network<br/>public v4 + v6]
    end

    L5 --> L4 --> L3 --> L2 --> L1
```

### ASCII fallback

```
  ┌──────────────────────────────────────┐
  │  L5  Traefik (host network, :80/:443)│
  ├──────────────────────────────────────┤
  │  L4  kube-proxy (IPVS) + CoreDNS     │
  ├──────────────────────────────────────┤
  │  L3  Flannel VXLAN overlay           │
  │      10.42.0.0/16 pod CIDR           │
  ├──────────────────────────────────────┤
  │  L2  Ubuntu + UFW + kernel iptables  │
  ├──────────────────────────────────────┤
  │  L1  Hetzner public IPv4/IPv6        │
  └──────────────────────────────────────┘
```

## Layer 1 — Physical network

Each Hetzner CX33 has:
- A **public IPv4** address on the internet
- A **public IPv6** /64 subnet (one address used, the rest unused)
- **20 TB/mo** outbound traffic included; inbound is free
- **~1 Gbps** network bandwidth per node

All inter-node traffic goes over the **public network**. Hetzner Cloud
offers a private-network feature (vswitch), but we didn't attach one —
adding it now would require reconfiguring Flannel's advertise-addr. A
future improvement: attach a private vSwitch to all three nodes,
reconfigure Flannel to use it, shrink our public-interface attack surface.

## Layer 2 — Node network

Each node runs Ubuntu 24.04.3 LTS with:

- **Default routing** via the Hetzner-provided gateway
- **UFW** as the iptables frontend (Chapter 4 lists every rule)
- **IP forwarding** enabled (`net.ipv4.ip_forward=1`) — required for
  Kubernetes pod routing
- **Bridge netfilter** enabled (`net.bridge.bridge-nf-call-iptables=1`)
  — required so iptables can see bridged traffic

K3s configures the latter two automatically at install time via
`/etc/sysctl.d/90-kubelet.conf` (or similar; exact file varies by distro).

Two additional sysctls we set manually:

```
# /etc/sysctl.d/99-unprivileged-ports.conf
net.ipv4.ip_unprivileged_port_start=0
```

**Why**: Traefik runs as UID 65532 (non-root) in host network mode to bind
:80 and :443. Without this sysctl, even with `CAP_NET_BIND_SERVICE`, it
can't bind privileged ports in the host namespace. Ubuntu 24.04's default
is 1024 (so ports 1–1023 are "privileged"). Setting it to 0 lets any
user bind any port.

**Security implication**: Minimal. The ports Traefik binds are still
controlled by the container runtime — other pods on the node can't
accidentally grab 80/443 because kubelet won't schedule conflicting host
ports. And the UFW rules still gate what's reachable externally.

## Layer 3 — Pod overlay (Flannel VXLAN)

### What Flannel is

Flannel is a CNI (Container Network Interface) plugin. Its job: give every
pod in the cluster a routable IP address, and make those IPs reachable
from any other pod regardless of which node they're on.

### The pod CIDR

K3s assigns **10.42.0.0/16** as the cluster-wide pod CIDR by default. Each
node gets a /24 slice:

| Node | Pod CIDR |
|---|---|
| ubuntu-8gb-nbg1-1 | 10.42.1.0/24 |
| ubuntu-8gb-nbg1-2 | 10.42.0.0/24 |
| ubuntu-8gb-nbg1-3 | 10.42.2.0/24 |

Each pod gets an IP from its node's slice. So a pod on hetzner2
(`nbg1-1`) might be `10.42.1.6`; a pod on hetzner3 (`nbg1-3`) might be
`10.42.2.10`.

### How VXLAN works

VXLAN ("Virtual Extensible LAN") tunnels Layer-2 frames over UDP. Flannel
wraps every inter-node packet like so:

```
 Original pod → pod packet:
 ┌──────────────────────────────────────────────────┐
 │ Ethernet │ IP src=10.42.0.5 → dst=10.42.2.10 │ … │
 └──────────────────────────────────────────────────┘

 Flannel VXLAN-encapsulates it:
 ┌──────────────────────────────────────────────────────────────────┐
 │ Eth │ IP src=178.104.247.152 → dst=178.104.249.189 │ UDP 8472 │  │
 │ VXLAN header │ <original Ethernet+IP+payload>               │  │
 └──────────────────────────────────────────────────────────────────┘
```

The outer IP/UDP carries the packet between nodes over Hetzner's public
network. On arrival, the destination node unwraps the VXLAN header and
delivers the inner packet to the target pod.

**UDP port 8472** is VXLAN's IANA-assigned port. It must be open
node-to-node in UFW (see Chapter 4).

**MTU note**: VXLAN encapsulation adds 50 bytes of overhead (8 VXLAN +
8 UDP + 20 IP + 14 Ethernet). Hetzner's network uses standard 1500-byte
MTU, so Flannel's overlay MTU is 1450. Mismatches cause silent packet
drops. K3s sets this correctly by default.

### Flannel config

`/var/lib/rancher/k3s/agent/etc/flannel/net-conf.json` on each node:

```json
{
  "Network": "10.42.0.0/16",
  "EnableIPv6": false,
  "EnableIPv4": true,
  "IPv6Network": "::/0",
  "Backend": { "Type": "vxlan" }
}
```

We did not enable IPv6 in the cluster — an unnecessary complexity for our
scale, and CoreDNS + kube-proxy + node controllers all work fine in v4-only
mode.

### No encryption (yet)

Flannel VXLAN traffic over Hetzner's public network is **not encrypted**.
This means pod-to-pod traffic between nodes is visible to any attacker
with packet capture on the path — in practice, nobody between our three
nodes at Hetzner Nuremberg, but it's still plaintext on the wire.

**Mitigation today**: All sensitive inter-pod traffic already uses TLS:
- api ↔ Neon Postgres: TLS 1.3 (`DB_SSLMODE=require`)
- api/worker ↔ Backblaze B2: HTTPS
- api ↔ Fastmail: STARTTLS
- api ↔ Redis: plaintext but Redis only holds cache + Asynq queue state,
  no user credentials

**TODO** (Chapter 20): Switch Flannel to `wireguard-native` backend. K3s
supports this with a flag at install time; enabling on an existing
cluster requires a config edit and rolling kubelet restart.

## Layer 4 — Service discovery

Pods don't talk to each other by IP — IPs are ephemeral, assigned on pod
creation. They use **service names** resolved by DNS.

### CoreDNS

K3s runs **CoreDNS** as the cluster DNS server. A pod in the `honeydue`
namespace resolves `redis` to the Redis Service's ClusterIP:

```
redis                     → 10.43.7.10  (Service ClusterIP)
redis.honeydue            → 10.43.7.10
redis.honeydue.svc.cluster.local → 10.43.7.10
```

When an app resolves `redis:6379`:

1. The pod's `/etc/resolv.conf` points to `10.43.0.10` (the CoreDNS
   Service).
2. CoreDNS receives the query, checks its known Services, returns
   `10.43.7.10`.
3. The pod sends TCP to `10.43.7.10:6379`.
4. kube-proxy (Layer 4, below) intercepts and routes to the actual pod IP.

### The service CIDR

K3s assigns **10.43.0.0/16** as the service CIDR. ClusterIPs live here.
Currently:

| Service | ClusterIP |
|---|---|
| `api.honeydue` | 10.43.167.83 |
| `admin.honeydue` | 10.43.136.168 |
| `redis.honeydue` | 10.43.7.10 |
| `kubernetes.default` | 10.43.0.1 |
| `kube-dns.kube-system` | 10.43.0.10 |

ClusterIPs are **stable** for the life of the Service — they don't change
when pods come and go.

### kube-proxy (IPVS mode)

`kube-proxy` is the dataplane component that makes Services work. It runs
as a DaemonSet (one per node), watches the k3s API for Service and
Endpoint changes, and programs the kernel to route traffic.

K3s defaults to **IPVS mode** on modern kernels. IPVS is a Linux kernel
feature for in-kernel L4 load balancing — essentially connection-tracking
NAT with round-robin or other scheduling.

When a pod dials `10.43.7.10:6379`:

1. The first packet hits the node's kernel
2. IPVS sees the destination is a ClusterIP
3. IPVS picks an endpoint from the Service's endpoint set (e.g.,
   `10.42.0.10:6379` on hetzner2)
4. IPVS rewrites the destination and forwards
5. Flannel tunnels it to the destination node (if remote) or delivers
   locally (if the endpoint is on the same node)

This happens per-TCP-connection, not per-packet, thanks to conntrack.

### Why IPVS over iptables

K3s' default kube-proxy mode is IPVS. The alternative (iptables mode) is
older and slower — for every Service, iptables mode adds a chain of rules
that grow linearly with Service count. IPVS uses a hash table and scales
to thousands of Services without performance degradation. At our scale
either works, but IPVS is the better default.

### Headless Services

Some of our Services are *not* using a ClusterIP — they're "headless"
(`clusterIP: None`). Our setup doesn't currently use them but it's worth
knowing the distinction: headless Services return all endpoint IPs
directly via DNS, no kube-proxy involvement. Useful for stateful sets
where clients need to talk to a specific replica.

## Layer 5 — Ingress (Traefik)

External traffic arrives on the node's public :80 or :443. Traefik
handles the first mile of routing. See [Chapter 6](./06-traefik-ingress.md)
for Traefik-specific details; this section just shows how it fits in the
networking stack.

Traefik runs as a **DaemonSet** with `hostNetwork: true`. That means:
- One Traefik pod per node
- Each pod is in the **host's network namespace**, not a pod netns
- Each pod can bind directly to `0.0.0.0:80` and `0.0.0.0:443` on the node

When Cloudflare sends a request to `178.104.247.152:80`:

1. Packet arrives at hetzner1's NIC
2. UFW accepts (80/tcp is open from anywhere)
3. Linux kernel routes to localhost:80 because something's listening
4. Traefik (running in host namespace) accepts the connection
5. Traefik reads the `Host:` header
6. Traefik matches an Ingress rule (api.myhoneydue.com → api Service)
7. Traefik dials `10.43.167.83:8000` (Service ClusterIP)
8. Kube-proxy IPVS rewrites to a live api pod endpoint
9. Flannel VXLAN tunnels if the endpoint is on a remote node
10. The api pod receives the request, processes, responds
11. Response flows back the reverse path

Full trace in the [end-to-end section](#end-to-end-request-trace) below.

## IPs we care about

| What | CIDR / IP | Used for |
|---|---|---|
| Pod CIDR | 10.42.0.0/16 | All pod IPs cluster-wide |
| Service CIDR | 10.43.0.0/16 | All ClusterIPs |
| Flannel VXLAN | UDP 8472 | Pod-to-pod traffic (inter-node) |
| CoreDNS Service | 10.43.0.10:53 | Cluster DNS |
| Kubernetes Service | 10.43.0.1:443 | Internal kube-apiserver |
| Node IPs | See README | External + flannel source/dst |
| Traefik | host network | Listens on node's :80, :443 |

## End-to-end request trace

A user in Texas hits `https://api.myhoneydue.com/api/tasks/`. Here's every
hop:

```mermaid
sequenceDiagram
    autonumber
    participant U as User (Austin, TX)
    participant CF as Cloudflare edge (DFW POP)
    participant H as hetzner2 (picked by CF)<br/>178.105.32.198
    participant TR as Traefik pod<br/>(hostNetwork)
    participant API as api pod on hetzner3<br/>10.42.2.6:8000
    participant DB as Neon Postgres<br/>(AWS us-east-1)

    U->>CF: HTTPS :443 GET /api/tasks/
    Note over CF: TLS handshake terminates here
    CF->>H: HTTP :80 (with original Host header)
    H->>TR: Accepted by kernel, delivered to Traefik
    Note over TR: Matches Ingress rule<br/>host: api.myhoneydue.com
    TR->>TR: Resolve api.honeydue → 10.43.167.83
    TR->>H: dial 10.43.167.83:8000
    H->>H: kube-proxy IPVS rewrites<br/>dst → 10.42.2.6:8000
    H->>API: Flannel VXLAN encapsulate<br/>UDP 8472 → hetzner3
    Note over API: Pod receives packet
    API->>DB: SELECT … FROM tasks WHERE user_id = …<br/>TLS :5432
    DB-->>API: Result rows
    API-->>TR: HTTP 200 JSON
    TR-->>CF: HTTP 200
    CF-->>U: HTTPS 200
```

### Timing budget for a cache-miss read

| Hop | Typical latency |
|---|---|
| User → CF edge (DFW) | 5–15 ms |
| CF edge → hetzner2 (origin HTTP :80) | 90–120 ms (cross-Atlantic) |
| UFW + kernel accept | <1 ms |
| Traefik accept + route | 1–2 ms |
| kube-proxy + Flannel (same node) | <1 ms |
| kube-proxy + Flannel (remote node, VXLAN) | 1–3 ms |
| Go API request handling | 1–5 ms |
| Neon Postgres query (TLS + SQL) | 20–60 ms (AWS us-east-1) |
| Return path (reverse) | similar |

**Total typical**: ~200–300 ms for a user in North America, dominated by
the cross-Atlantic CF→origin hop. Cached responses at Cloudflare skip the
origin hop entirely.

## Inter-node routing concretely

Here's what `ip route` shows on hetzner2 (not run live, reconstructed from
typical k3s+flannel+vxlan setup):

```
default via 172.31.1.1 dev eth0              # Hetzner gateway
10.42.0.0/24 via 10.42.0.0 dev flannel.1     # to hetzner1 pods (via VXLAN iface)
10.42.1.0/24 dev cni0                        # local pods on hetzner2
10.42.2.0/24 via 10.42.2.0 dev flannel.1     # to hetzner3 pods (via VXLAN iface)
10.43.0.0/16 via 10.42.1.1 dev cni0          # services via kube-proxy
```

The `flannel.1` interface is the VXLAN tunnel endpoint. Traffic written
to it gets encapsulated in UDP 8472 and sent to the peer node's public IP.

Flannel learns about peer nodes via the Kubernetes API (it watches Node
resources). When hetzner3 joins, Flannel on hetzner1 and hetzner2 both
learn its public IP and pod CIDR, update their routes and ARP tables,
and traffic just works.

## Network performance

### Within a node (pod to pod, same host)

Packets go through `cni0` bridge, never leave the node. Sub-millisecond
latency, bounded by kernel + veth performance. Easily >10 Gbps.

### Between nodes (pod to pod, different host)

Packets go through Flannel VXLAN. Added overhead: encap/decap in the
kernel (~5–10 μs), plus the actual network hop between hetzner nodes
(~0.5 ms within the same Hetzner datacenter). Throughput is bounded by
Hetzner's NIC (≈1 Gbps sustained per node).

In practice this is fine for everything we do. The slowest link in our
application is Neon (AWS us-east-1), which is ~100 ms round-trip.

## DNS resolution path

A pod resolves `redis`:

1. App does `getaddrinfo("redis")`.
2. glibc reads `/etc/resolv.conf`, finds nameserver `10.43.0.10`.
3. sends UDP 53 to `10.43.0.10`.
4. Destination is CoreDNS Service ClusterIP.
5. kube-proxy IPVS load-balances across CoreDNS pods (there's usually 1).
6. The packet arrives at the CoreDNS pod.
7. CoreDNS checks its Kubernetes plugin cache for `redis.<ns>.svc.cluster.local`.
8. Returns `10.43.7.10` (redis Service ClusterIP) with a low TTL.

CoreDNS is stateless — if it restarts, pods re-query on their next lookup.

**DNS caching in pods**: The Go API uses `net.Resolver` which does not
cache by default. Each new connection triggers a fresh DNS lookup. This
is correct behavior for Kubernetes (where Service IPs are stable but
Endpoints change), but it means a CoreDNS outage breaks new connections
immediately.

Next.js (admin) also uses Node's default resolver, similar behavior.

## What breaks if X fails

| Failure | Symptom |
|---|---|
| Flannel daemon on one node crashes | Pods on that node can't reach other nodes' pods; kube-proxy Services sometimes work (kernel conntrack) |
| CoreDNS pod crashes (only 1) | New connection DNS lookups fail; existing connections continue |
| kube-proxy daemon on one node crashes | Pods on that node can't resolve Service ClusterIPs; direct pod IPs still work |
| UFW misconfigured (port 8472 UDP blocked) | Pods on that node can't reach remote pods over overlay |
| Node's NIC fails | Node unreachable; Raft loses it; its pods get rescheduled elsewhere |
| Hetzner datacenter outage | Entire cluster offline |

## Operator cheat sheet

```bash
# See all IPs in the cluster
kubectl get pods -A -o wide            # pod IPs + nodes
kubectl get svc -A                     # Service ClusterIPs

# Test pod-to-pod DNS from inside a pod
kubectl exec -n honeydue deploy/api -- nslookup redis
kubectl exec -n honeydue deploy/api -- getent hosts redis

# Test pod-to-pod TCP connectivity
kubectl exec -n honeydue deploy/api -- nc -zv redis 6379
kubectl exec -n honeydue deploy/api -- wget -q -O- http://admin:3000/

# See the node's iptables/IPVS rules (run on a node)
ssh deploy@hetzner1 "sudo ipvsadm -Ln"
ssh deploy@hetzner1 "sudo iptables -L -n -t nat | head -50"

# See the cluster's flannel state
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"  "}{.status.addresses[?(@.type=="InternalIP")].address}{"  "}{.spec.podCIDR}{"\n"}{end}'
```

## References

- [Kubernetes networking concepts][k8s-net]
- [Flannel VXLAN backend][flannel-vxlan]
- [CoreDNS k8s plugin][coredns-k8s]
- [IPVS mode for kube-proxy][ipvs]
- [VXLAN RFC 7348][vxlan-rfc]

[k8s-net]: https://kubernetes.io/docs/concepts/services-networking/
[flannel-vxlan]: https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md#vxlan
[coredns-k8s]: https://coredns.io/plugins/kubernetes/
[ipvs]: https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/
[vxlan-rfc]: https://datatracker.ietf.org/doc/html/rfc7348