Files
honeyDueAPI/docs/deployment/04-firewall.md
T
Trey t 7e77e3bbab
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs/deployment: record security hardening pass + webapp + APNs
Mark roadmap items done (network policies, Traefik middleware, CF Full
strict, CF IP UFW restriction, webapp deploy, APNs wired up, admin
URL-baking fix, admin probe bug). Update Chapter 4 (firewall rule
inventory now shows CF-only :443, no :80), Chapter 6 (request flow
walks through TLS on :443 and middleware hops), Chapter 13 (CF SSL
mode is Full strict, not Flexible; documents the origin cert
install), Chapter 7 (adds the web service section — proxy pattern,
3 replicas, PostHog build-args), and Appendix C (web manifests, CF
origin cert paths on disk, APNs .p8 path, updated network-policies
applied status).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:50:59 -05:00

362 lines
13 KiB
Markdown

# 04 — Firewall
## Summary
Every node runs UFW (Uncomplicated Firewall, a frontend for iptables) with
a default-deny-incoming policy. Specific ports are allowed from specific
sources only. This chapter lists every rule on every node, why each rule
exists, and what breaks without it. It also traces what happens to an
inbound packet as it goes through iptables, UFW, and the kernel.
## Policy
All three nodes have the same UFW config. The policy:
| Direction | Default |
|---|---|
| **Incoming** | **deny** |
| Outgoing | allow |
| Routed | disabled (we don't NAT) |
Default deny is a white-list model: unless a rule explicitly allows a
packet, it's dropped. This is more secure than default-allow but requires
that every legitimate port be enumerated in a rule.
## Current ruleset per node
Run `sudo ufw status verbose` on any node to see the live ruleset. The
canonical ruleset below, grouped by purpose.
### Public-facing
| Port | Protocol | From | Purpose |
|---|---|---|---|
| 22 | TCP | Anywhere | SSH (key-only) |
| 443 | TCP | Cloudflare ranges (15 IPv4 + 7 IPv6) | HTTPS (CF → Traefik, TLS-terminated at Traefik) |
**Port :80 is closed** on all three nodes. CF is in Full (strict) mode
and initiates every request on :443 to the origin. Cloudflare's
"Always Use HTTPS" turns any plaintext client request into HTTPS at
the edge, so the origin never needs to accept :80.
**Port :443 is restricted to Cloudflare** via 22 UFW allow rules per
node (one per CF CIDR). Direct-connect from any non-CF IP is dropped
at the kernel. This closes the "node IP leak = bypass CF WAF/DDoS"
hole entirely. See [Chapter 13](./13-cloudflare.md#cloudflare-ip-ranges-used-in-traefik-trustedips)
for the exact ranges and UFW rule format.
**Refresh cadence**: CF updates its IP ranges rarely. A monthly
`curl https://www.cloudflare.com/ips-v4` diff and UFW re-apply is
enough. Automation TODO (Chapter 20).
### SSH (operator access)
| Port | Protocol | From | Purpose |
|---|---|---|---|
| 22 | TCP | Anywhere | SSH login (key-only) |
SSH is open to the internet but hardened: key-only auth, no root login,
`AllowUsers deploy` configured (the stock distribution still allows root;
we hardened in bootstrap). See [Chapter 5](./05-security.md) for the full
SSH config.
**TODO** (Chapter 20): Move SSH off :22 to :2222 or similar, tighten to
the operator's current IP. Current state is acceptable given key-only +
fail2ban defaults.
### Kubernetes API (kubectl from operator)
| Port | Protocol | From | Purpose |
|---|---|---|---|
| 6443 | TCP | 47.185.183.191 (operator IP) | kubectl to kube-apiserver |
When the operator's public IP changes (moves, new ISP), this rule needs
updating on all 3 nodes. Ugly but necessary. A better long-term fix is
**Cloudflare Access** or **Tailscale** to avoid pinning operator IPs.
### Inter-node cluster traffic
These rules allow the three nodes to talk to each other for cluster state.
Each node has an allow rule for each of the **three node IPs** (including
its own — the "allow from self" rule exists so local flows are explicit).
| Port | Protocol | From | Purpose |
|---|---|---|---|
| 6443 | TCP | other nodes | kube-apiserver (other servers' talk to each other) |
| 2379 | TCP | other nodes | etcd client (Raft state reads) |
| 2380 | TCP | other nodes | etcd peer (Raft state writes between server nodes) |
| 10250 | TCP | other nodes | kubelet (metrics, exec, logs from API server) |
| 8472 | UDP | other nodes | Flannel VXLAN overlay |
### Application-specific (legacy, mostly superfluous on k3s)
These rules were added during the Swarm era and still exist on the nodes.
None of them hurt anything; most are unused on k3s.
| Port | Protocol | From | Purpose (original) | Status on k3s |
|---|---|---|---|---|
| 2377 | TCP | node IPs | Swarm cluster management | unused (Swarm gone) |
| 7946 | TCP + UDP | node IPs | Swarm gossip | unused |
| 4789 | UDP | node IPs | Swarm VXLAN | unused (k3s uses 8472) |
| (ESP, proto 50) | — | node IPs | IPSec encrypted overlay | unused |
| 500 | UDP | node IPs | IKE key exchange | unused |
| 3000 | TCP | node IPs | admin Next.js, when we tried node-IP hardcoding | unused |
These can be removed in a cleanup pass. They don't affect security because
no process listens on those ports anymore.
## Why each required rule exists
### Port 22 — SSH (public)
Obviously needed for operator access. Without it we'd have no way to
reach the nodes. Hetzner console's "rescue" mode is an emergency fallback.
### Port 80 — HTTP (public)
Cloudflare talks HTTP to origin on port 80 (SSL=Flexible mode). Without
this rule, Cloudflare gets connection-refused and returns 521 to users.
### Port 443 — HTTPS (public)
Currently unused in SSL=Flexible mode. Open to smooth the future
Full-strict migration. No process listens on 443 yet; the kernel would
reject connections. Rule is harmless.
### Port 6443 — kube-apiserver (operator + inter-node)
**From operator IP**: so `kubectl` works. Without this, `kubectl get pods`
times out.
**From other nodes**: server nodes check each other's apiservers for
Raft elections and cross-node controller operations. Without this,
nodes can still run pods but can't participate in cluster state changes.
### Ports 2379/2380 — embedded etcd (inter-node)
K3s runs etcd as an embedded library inside the server binary. The etcd
client port (2379) and peer port (2380) carry Raft protocol messages
between the three servers. **Without these rules, Raft cannot replicate
state and the cluster loses quorum.**
This bit us during the k3s install — initially the joins failed because
2379/2380 were blocked.
### Port 10250 — kubelet (inter-node)
The kubelet on each node exposes a read-only API for the kube-apiserver
to call — `kubectl logs`, `kubectl exec`, kubelet metrics scraping.
Without this rule, operator commands like `kubectl logs -n honeydue
deploy/api` fail with "Error from server: unable to upgrade connection".
### Port 8472 UDP — Flannel VXLAN (inter-node)
Pod-to-pod traffic between nodes flows through VXLAN tunnels on UDP 8472.
**Without this rule, cross-node pod communication silently fails** — which
looks like "admin can't reach api" or "worker can't reach Redis" depending
on where pods land.
This rule is load-bearing. It is the single most important inter-node
rule.
## Inbound packet's journey through UFW/iptables
When a packet arrives at hetzner1's network interface on port 80:
```mermaid
sequenceDiagram
participant NIC as hetzner1 NIC
participant PRE as iptables<br/>raw + mangle + nat PREROUTING
participant FIL as iptables filter INPUT<br/>(UFW lives here)
participant SOCK as Traefik pod socket<br/>(host network)
NIC->>PRE: Packet: SYN :80 from CF
PRE->>PRE: conntrack state: NEW
PRE->>FIL: handoff to INPUT chain
FIL->>FIL: UFW rules evaluated
Note over FIL: Rule: allow 80/tcp from anywhere<br/>→ ACCEPT
FIL->>SOCK: delivered to listening socket
SOCK->>SOCK: Traefik accepts connection
```
UFW is really a set of wrapper chains on top of iptables. `sudo iptables
-L INPUT -n --line-numbers` on any node shows the actual rules; UFW just
makes editing them easier.
## Rule syntax we used
UFW commands we ran during setup (for reference):
```bash
# Reset to default
sudo ufw --force reset
# Default deny incoming
sudo ufw default deny incoming
sudo ufw default allow outgoing
# SSH + web (public)
sudo ufw allow 22/tcp comment 'SSH'
sudo ufw allow 80/tcp comment 'HTTP'
sudo ufw allow 443/tcp comment 'HTTPS'
# Kubernetes inter-node (repeat for each peer IP)
for ip in 178.104.247.152 178.105.32.198 178.104.249.189; do
sudo ufw allow from "$ip" to any port 6443 proto tcp comment "k3s-api $ip"
sudo ufw allow from "$ip" to any port 2379 proto tcp comment "k3s-etcd-client $ip"
sudo ufw allow from "$ip" to any port 2380 proto tcp comment "k3s-etcd-peer $ip"
sudo ufw allow from "$ip" to any port 10250 proto tcp comment "k3s-kubelet $ip"
sudo ufw allow from "$ip" to any port 8472 proto udp comment "k3s-flannel-vxlan $ip"
done
# Kubectl from operator
sudo ufw allow from 47.185.183.191 to any port 6443 proto tcp comment 'kubectl from dev'
# Enable
sudo ufw --force enable
```
Rules persist across reboots via `/etc/ufw/user.rules`.
## What if we used Hetzner Cloud Firewall instead?
Hetzner Cloud has a provider-level firewall feature — rule-for-rule
equivalent but configured in the Hetzner console (or via API), not on the
nodes. Tradeoffs:
| | Hetzner Cloud Firewall | UFW (current) |
|---|---|---|
| Cost | Free | Free |
| Config location | Hetzner console / API | Per-node `/etc/ufw/` |
| Applies to | All traffic to NIC | All traffic to kernel |
| Failure mode | Provider-side issue = rules gone | Node-side issue = rules gone |
| Inter-node traffic | Same rules for all nodes | Same rules on each node |
| Visible to attacker | Yes (provider fingerprints) | Yes (iptables probe) |
| Rule ordering | UI-based | `iptables -L` |
Either works. A future improvement: move the stable rules to Hetzner
Cloud Firewall (one source of truth) and leave only the dynamic rules
(operator IP, ad-hoc debug) on the nodes.
## Why we don't use iptables directly
UFW is a frontend. `iptables` works, but the rules are harder to read and
edit. `sudo ufw allow from X to any port Y proto Z comment 'Z-rule'` is
clearer than writing the equivalent `-A INPUT ...` rule directly.
Also, UFW's `comment` field lets us explain each rule, which becomes
critical when the ruleset grows past ~10 rules.
## Testing the firewall
From the operator workstation (47.185.183.191):
```bash
# Should work (22/tcp open)
ssh deploy@hetzner1 exit
# Should work (80/tcp open)
curl -I -H "Host: api.myhoneydue.com" http://hetzner1/api/health/
# Should work (443/tcp open; TLS handshake will fail because nothing listens)
curl -kI https://178.104.247.152/
# Should work (6443 allowed from operator IP)
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
kubectl get nodes
# Should time out (default-deny from arbitrary ports)
curl http://178.104.247.152:3000/ # not open to operator
curl http://178.104.247.152:6379/ # Redis not exposed publicly
```
From another peer node (hetzner2 trying to reach hetzner1):
```bash
# Should work (k3s API allowed from peer node IPs)
curl -k https://178.104.247.152:6443/healthz
# Should work (etcd client from peer)
nc -zv 178.104.247.152 2379
```
## The hidden dependency: kubelet/containerd also need ports
Beyond the UFW rules, the kubelet also listens on:
- **10255/tcp** — kubelet read-only port (no auth, deprecated; disabled by default in k3s)
- **10256/tcp** — kube-proxy health
- **10257/tcp** — kube-controller-manager health
- **10259/tcp** — kube-scheduler health
These are bound to `localhost` only, so they don't need UFW rules. But
they're important to know about when debugging — if one of these health
endpoints isn't responding, the relevant component is broken.
## Legacy rules to clean up
The following rules are on the nodes from the Swarm era and can be
removed in a future cleanup pass:
```bash
# On each node, list Swarm-era rules
sudo ufw status numbered | grep -E "2377|7946|4789|500|3000|esp"
# Remove by number (highest-to-lowest to avoid renumbering)
# Example:
sudo ufw --force delete 15
sudo ufw --force delete 14
# ... etc.
```
We left them in because they don't affect security (no process listens on
those ports), and removing them requires careful testing that nothing in
k3s secretly relies on 4789/udp or similar.
## Operator cheat sheet
```bash
# Show the ruleset, with comments, numbered
sudo ufw status numbered verbose
# Add a new rule
sudo ufw allow from <ip> to any port <port> proto <tcp|udp> comment '<desc>'
# Remove a rule by number
sudo ufw status numbered
sudo ufw --force delete <N>
# Temporarily disable all rules (emergency)
sudo ufw disable
# Re-enable
sudo ufw enable
# Reload after editing /etc/ufw/ files directly
sudo ufw reload
```
## What to do if the firewall locks you out
Worst case: you apply a rule that blocks your own SSH, UFW enables it
immediately, and you can't log back in. Recovery:
1. Hetzner Cloud Console → Server → Rescue mode
2. Boot into rescue, mount the disk
3. Edit `/etc/ufw/user.rules` to remove the bad rule
4. Reboot back into normal mode
This has never happened to us but it's the escape hatch. The Console is
always a TLS login away.
## References
- [UFW man page][ufw-man]
- [K3s networking requirements][k3s-reqs]
- [Kubernetes ports and protocols][k8s-ports]
- [Cloudflare IP ranges][cf-ips]
[ufw-man]: https://manpages.ubuntu.com/manpages/noble/en/man8/ufw.8.html
[k3s-reqs]: https://docs.k3s.io/installation/requirements#networking
[k8s-ports]: https://kubernetes.io/docs/reference/networking/ports-and-protocols/
[cf-ips]: https://www.cloudflare.com/ips/