honeyDueAPI/docs/deployment/01-infrastructure.md

# 01 — Infrastructure

## Summary

Three Hetzner Cloud CX33 virtual machines in the Nuremberg (nbg1) datacenter
form the compute foundation. Each is a 4 vCPU / 8 GB RAM / 80 GB NVMe SSD
instance on Hetzner's shared-CPU "Cloud" line. Total compute cost is
$23.97/mo. This chapter explains each node spec in detail, why we picked
Hetzner and this tier specifically, and the rejected alternatives.

## Node specifications

All three nodes are identical. Specs per node:

| Spec | Value |
|---|---|
| Provider | Hetzner Cloud (`www.hetzner.com/cloud`) |
| Instance type | CX33 (shared-CPU line) |
| vCPU | 4 |
| RAM | 8 GB |
| Disk | 80 GB NVMe SSD |
| Network | 20 TB/mo outbound included |
| IPv4 address | Public dedicated |
| IPv6 address | /64 subnet |
| Region | `nbg1` (Nuremberg, Germany) |
| OS | Ubuntu 24.04.3 LTS (HWE kernel 6.8.0-90-generic) |
| Price | **$7.99/mo** (April 2026) ⁽¹⁾ |

⁽¹⁾ Hetzner applied a price adjustment on 2026-04-01 — CX33 went from
~$6.59 to $7.99. See [Hetzner price adjustment announcement][hetzner-prices].

### The three nodes

| SSH alias | Public IPv4 | IPv6 | k3s hostname |
|---|---|---|---|
| `hetzner1` | 178.104.247.152 | `2a01:4f8:1c18:79c7::1` | `ubuntu-8gb-nbg1-2` |
| `hetzner2` | 178.105.32.198 | `2a01:4f8:1c18:5ecf::1` | `ubuntu-8gb-nbg1-1` |
| `hetzner3` | 178.104.249.189 | `2a01:4f8:1c18:241a::1` | `ubuntu-8gb-nbg1-3` |

**Naming quirk.** The SSH-alias numbers and the Hetzner-assigned hostname
numbers do not match (`hetzner1` is `nbg1-2`, `hetzner2` is `nbg1-1`). This
is because the Hetzner hostnames are assigned in server-creation order; the
SSH aliases were set up later in the order we wanted to refer to them. We
chose not to rename the hosts — renaming `hostname` on a Kubernetes node
after it joins the cluster causes problems (node certificates, etcd
identity, etc. tie to the hostname). Living with the quirk is easier than
rebuilding. See the mapping table in [the README](./README.md).

## Why Hetzner

### Decision matrix

Compared at the time of purchase (~2026-04-23):

| Provider | Instance | vCPU / RAM / SSD | Price/mo | Traffic/mo |
|---|---|---|---:|---|
| **Hetzner** | **CX33** | **4 / 8 GB / 80 GB** | **$7.99** | **20 TB** |
| DigitalOcean | General-purpose | 2 / 8 GB / 25 GB | $63 | 4 TB |
| DigitalOcean | Basic | 4 / 8 GB / 160 GB | $48 | 5 TB |
| Vultr | High Perf | 4 / 8 GB / 180 GB | $48 | 5 TB |
| Linode (Akamai) | Shared | 4 / 8 GB / 160 GB | $48 | 5 TB |
| OVHcloud | VPS 2026 4vC | 4 / 8 GB / 75 GB | ~$13 | unlimited |
| Contabo | Cloud VPS 2 | 4 / 8 GB / 200 GB | $8 | 32 TB |
| Netcup | VPS 1000 G11 | 4 / 8 GB / 256 GB | ~$6 | unlimited |
| Oracle Always Free | ARM Ampere | up to 4 / 24 GB / 200 GB | $0 | 10 TB | *availability lottery* |

**Why Hetzner won:**

1. **Price/performance at this tier is best-in-class among mainstream hosts.**
   Similar specs at DigitalOcean/Vultr/Linode cost 6× as much. You're paying
   the "American managed cloud" premium there for UX polish we don't need.
2. **Dedicated IPv4 + /64 IPv6 + 20 TB traffic included.** No overage anxiety
   at this scale; 20 TB is multiple months of anticipated traffic for a
   bootstrapped app.
3. **European datacenter, GDPR-native.** honeyDue serves users in
   multiple regions; if EU users dominate, Nuremberg is fast. US users pay
   about +100 ms over a US-East host, which is well within Cloudflare-cached
   tolerances for most app traffic.
4. **Mature API + `hcloud` CLI** for automation if we ever need it.
5. **Hetzner Cloud Firewall is free** and rule-for-rule equivalent to AWS
   Security Groups / DO Cloud Firewall. We use UFW on the nodes instead
   (Chapter 4) because our rule set evolved ad-hoc and moving it to the
   provider's firewall is a small cleanup project.

**Why not the cheaper options:**

- **Netcup** is ~$1/mo cheaper per node with more disk, but its API is
  barebones, the account/billing UX is more fiddly, and their network
  routing in the US (where the operator is based) has more hops than
  Hetzner's.
- **Contabo** is the cheapest, but the company has a reputation for
  oversubscribed nodes. For a production service, unpredictable CPU steal
  and disk I/O variance is not worth saving $0/node. Contabo is fine for
  non-critical workloads; it's a poor fit for prod.
- **Oracle Cloud Always Free** is genuinely free (4 ARM cores + 24 GB RAM)
  but:
  - Requires ARM64 builds (we build on ARM but would need to not need
    cross-compile — see Chapter 11 for why amd64 matters)
  - Capacity for free accounts is a lottery; instance creation fails
    "out of capacity" more often than it succeeds
  - Oracle has reclaimed idle free-tier instances in the past

### Why not the premium options

DigitalOcean, Vultr, and Linode are excellent products with better UX than
Hetzner. They were rejected because at honeyDue's current scale the 3–6×
price multiplier doesn't buy anything we'd use:

- We don't need managed databases, object storage, or load balancers from
  the same provider — those are Neon, Backblaze, and Cloudflare
- We don't need their monitoring dashboards — Cloudflare Analytics +
  `kubectl top` + future Prometheus cover it
- The UI polish matters mostly for day-1 setup; ongoing operations are
  `kubectl` and `ssh`

When honeyDue has enough revenue that an engineer's time is worth more than
$40/mo, we'd consider moving for the better tooling. Not yet.

## Why Nuremberg (`nbg1`)

Hetzner has datacenters in Nuremberg (nbg1), Falkenstein (fsn1), Helsinki
(hel1), Ashburn (ash), and Hillsboro (hil). Nuremberg was picked because:

- The operator's primary user base is expected to be mixed US/EU
- Within the EU, Nuremberg is the most central from a peering perspective
  (well-connected to DE-CIX, Europe's largest internet exchange)
- Falkenstein is Hetzner's main datacenter and tends to have longer
  provisioning queues during capacity crunches; Nuremberg is smaller and
  more available

For a US-only userbase, Ashburn (ash) or Hillsboro (hil) would be better
picks — US users would see ~20 ms instead of ~120 ms.

Cloudflare's edge caches most assets, so the origin location matters mostly
for first-request / uncached / POST traffic.

## Why three nodes

**Raft quorum and fault tolerance.** K3s in HA mode uses Raft consensus
(via embedded etcd) for cluster state. Raft requires a majority of nodes
to agree on every write. Quorum formulas:

| Total managers | Quorum | Max failures tolerated |
|---|---|---|
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 3 | 2 | 1 |
| 4 | 3 | 1 |
| 5 | 3 | 2 |

Three is the smallest odd number that tolerates a failure, and three is
where price/resilience is sweetest. Five nodes doesn't help until you need
to tolerate *two* simultaneous failures — a scale concern that doesn't
apply at our traffic volume.

Two nodes is worse than one: you still have single-failure intolerance
(one down = no quorum), but you've doubled your cost and failure surface.
Avoid even-node clusters for consensus systems.

## Node hardening

Each node was bootstrapped with:

1. **Docker installed** from `download.docker.com` using the stable repo
   (this was the original Swarm setup; still installed but disabled — k3s
   bundles its own containerd).
2. **`deploy` user created** with:
   - Home directory
   - Bash as login shell
   - Member of `docker` group (historical, when Swarm was the orchestrator)
   - Member of `sudo` group with `NOPASSWD: ALL` in `/etc/sudoers.d/deploy`
3. **SSH key installed** at `/home/deploy/.ssh/authorized_keys`
   - The key is the public half of `~/.ssh/hetzner` on the operator
     workstation (`ssh-ed25519`, 256 bits)
4. **`/opt/honeydue/deploy`** directory created, owned by `deploy`
   (originally for Swarm deploy bundle drop zone; unused now)
5. **Sysctl** `net.ipv4.ip_unprivileged_port_start=0` persisted to
   `/etc/sysctl.d/99-unprivileged-ports.conf`. Required so Traefik (running
   as UID 65532) can bind `:80` and `:443` in the host network namespace.

The full bootstrap script is at `/tmp/honeydue_bootstrap.sh` on the
operator workstation (used during the initial Swarm setup — see
[Chapter 19](./19-postmortem-swarm.md) for context).

## Cost breakdown

```
3 × Hetzner CX33             $23.97/mo
Hetzner network traffic      $0       (20 TB/mo included per node, nowhere near it)
Neon Postgres (Launch)       $5-15/mo (usage-based, ~$5 min)
Backblaze B2                 <$1/mo   (tiny upload volume currently)
Cloudflare Free              $0
Gitea (self-hosted)          $0       (the operator's existing Gitea)
─────────────────────────────────
Total infra                  ~$30-40/mo
```

See [Chapter 18 — Cost](./18-cost.md) for a full breakdown including
external SaaS (Fastmail, Apple Developer, etc.) and at-scale projections.

## Provisioning workflow

Nodes were provisioned manually through Hetzner Cloud Console. This is
fine for a three-node cluster; for larger clusters we'd switch to the
[`hetzner-k3s`][hetzner-k3s] Ruby tool that the `deploy-k3s/` scaffold
expects. The manual steps were:

1. Create project in Hetzner Cloud Console.
2. Upload SSH key (`hetzner.pub`).
3. Create 3× CX33 servers in `nbg1` with Ubuntu 24.04.
4. SSH in as `root`, run bootstrap to create `deploy` user and install
   Docker / later k3s.
5. Apply Hetzner Cloud Firewall rules at the network edge *optional* (we
   use UFW per Chapter 4 instead).

A future greenfield deployment would run `deploy-k3s/scripts/01-provision-cluster.sh`,
which does all of this in one shot via the `hetzner-k3s` CLI.

## Upgrade / replacement plan

**Node failure.** If a node becomes unreachable, the other two retain
Raft quorum and the cluster continues accepting writes. Pods from the
failed node get rescheduled to the survivors (so long as the survivors
have spare capacity — see Chapter 16). To replace the dead node:

1. Delete it from the cluster: `kubectl delete node <name>`
2. Create a replacement CX33 in Hetzner console
3. Install k3s on it with `--server=https://<manager>:6443`
4. Verify `kubectl get nodes` shows it as Ready

**Scaling up.** To add a fourth node, same procedure without deleting
anything. Consider whether you want it as a server (adds to Raft quorum;
must also add up to an odd total) or an agent (worker-only). K3s agents
join with `INSTALL_K3S_EXEC=agent` instead of `server`.

**Upgrading K3s.** K3s has a minor release every ~3 months. Upgrade by
running the install script with the new version on each node, one at a
time, verifying cluster health between each. See
[Chapter 17](./17-runbook.md) for the detailed procedure.

**Upgrading the OS.** Ubuntu 24.04 LTS is supported until 2029.
`unattended-upgrades` is *not* currently installed, so OS patches require
manual `apt upgrade`. Install `unattended-upgrades` when time permits —
security patches are important and automation reduces the risk of
falling behind.

## Physical location & regulatory

- **Sovereignty**: Hetzner is headquartered in Gunzenhausen, Germany.
  All data at rest in `nbg1` is subject to German law and the GDPR.
- **User data**: Most user data actually lives in
  **Neon Postgres (AWS us-east-1, Virginia)** and **Backblaze B2
  (us-east-005, South Carolina)** — both US-hosted. EU users' data
  therefore *exits* the EU in the API path. If strict EU data residency
  is ever a requirement, Neon has a EU region (Frankfurt) and Backblaze
  has EU endpoints; switching is a configuration change, not an
  architectural one.
- **Encryption at rest**: Hetzner encrypts node-local disks at the
  hypervisor layer. Neon encrypts at the AWS EBS layer. B2 encrypts
  objects server-side. None of our application code or config holds
  secrets at rest that aren't already in Kubernetes Secrets (which
  are stored in etcd; etcd on disk is unencrypted by default in k3s
  but see Chapter 5 for hardening).

## Operator cheat sheet

```bash
# SSH to any node
ssh -i ~/.ssh/hetzner deploy@hetzner1

# Check node health
kubectl get nodes -o wide

# Per-node resource usage
kubectl top nodes

# See what's on each node
kubectl get pods -A -o wide | sort -k 8

# Hetzner console (in browser)
#   https://console.hetzner.cloud/
```

## References

- [Hetzner Cloud product page][hetzner-cloud]
- [Hetzner price adjustment April 2026][hetzner-prices]
- [hetzner-k3s tool][hetzner-k3s]
- [K3s architecture docs][k3s-arch]

[hetzner-cloud]: https://www.hetzner.com/cloud/
[hetzner-prices]: https://docs.hetzner.com/general/infrastructure-and-availability/price-adjustment/
[hetzner-k3s]: https://github.com/vitobotta/hetzner-k3s
[k3s-arch]: https://docs.k3s.io/architecture