6f303dbbaa
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
295 lines
12 KiB
Markdown
295 lines
12 KiB
Markdown
# 01 — Infrastructure
|
||
|
||
## Summary
|
||
|
||
Three Hetzner Cloud CX33 virtual machines in the Nuremberg (nbg1) datacenter
|
||
form the compute foundation. Each is a 4 vCPU / 8 GB RAM / 80 GB NVMe SSD
|
||
instance on Hetzner's shared-CPU "Cloud" line. Total compute cost is
|
||
$23.97/mo. This chapter explains each node spec in detail, why we picked
|
||
Hetzner and this tier specifically, and the rejected alternatives.
|
||
|
||
## Node specifications
|
||
|
||
All three nodes are identical. Specs per node:
|
||
|
||
| Spec | Value |
|
||
|---|---|
|
||
| Provider | Hetzner Cloud (`www.hetzner.com/cloud`) |
|
||
| Instance type | CX33 (shared-CPU line) |
|
||
| vCPU | 4 |
|
||
| RAM | 8 GB |
|
||
| Disk | 80 GB NVMe SSD |
|
||
| Network | 20 TB/mo outbound included |
|
||
| IPv4 address | Public dedicated |
|
||
| IPv6 address | /64 subnet |
|
||
| Region | `nbg1` (Nuremberg, Germany) |
|
||
| OS | Ubuntu 24.04.3 LTS (HWE kernel 6.8.0-90-generic) |
|
||
| Price | **$7.99/mo** (April 2026) ⁽¹⁾ |
|
||
|
||
⁽¹⁾ Hetzner applied a price adjustment on 2026-04-01 — CX33 went from
|
||
~$6.59 to $7.99. See [Hetzner price adjustment announcement][hetzner-prices].
|
||
|
||
### The three nodes
|
||
|
||
| SSH alias | Public IPv4 | IPv6 | k3s hostname |
|
||
|---|---|---|---|
|
||
| `hetzner1` | 178.104.247.152 | `2a01:4f8:1c18:79c7::1` | `ubuntu-8gb-nbg1-2` |
|
||
| `hetzner2` | 178.105.32.198 | `2a01:4f8:1c18:5ecf::1` | `ubuntu-8gb-nbg1-1` |
|
||
| `hetzner3` | 178.104.249.189 | `2a01:4f8:1c18:241a::1` | `ubuntu-8gb-nbg1-3` |
|
||
|
||
**Naming quirk.** The SSH-alias numbers and the Hetzner-assigned hostname
|
||
numbers do not match (`hetzner1` is `nbg1-2`, `hetzner2` is `nbg1-1`). This
|
||
is because the Hetzner hostnames are assigned in server-creation order; the
|
||
SSH aliases were set up later in the order we wanted to refer to them. We
|
||
chose not to rename the hosts — renaming `hostname` on a Kubernetes node
|
||
after it joins the cluster causes problems (node certificates, etcd
|
||
identity, etc. tie to the hostname). Living with the quirk is easier than
|
||
rebuilding. See the mapping table in [the README](./README.md).
|
||
|
||
## Why Hetzner
|
||
|
||
### Decision matrix
|
||
|
||
Compared at the time of purchase (~2026-04-23):
|
||
|
||
| Provider | Instance | vCPU / RAM / SSD | Price/mo | Traffic/mo |
|
||
|---|---|---|---:|---|
|
||
| **Hetzner** | **CX33** | **4 / 8 GB / 80 GB** | **$7.99** | **20 TB** |
|
||
| DigitalOcean | General-purpose | 2 / 8 GB / 25 GB | $63 | 4 TB |
|
||
| DigitalOcean | Basic | 4 / 8 GB / 160 GB | $48 | 5 TB |
|
||
| Vultr | High Perf | 4 / 8 GB / 180 GB | $48 | 5 TB |
|
||
| Linode (Akamai) | Shared | 4 / 8 GB / 160 GB | $48 | 5 TB |
|
||
| OVHcloud | VPS 2026 4vC | 4 / 8 GB / 75 GB | ~$13 | unlimited |
|
||
| Contabo | Cloud VPS 2 | 4 / 8 GB / 200 GB | $8 | 32 TB |
|
||
| Netcup | VPS 1000 G11 | 4 / 8 GB / 256 GB | ~$6 | unlimited |
|
||
| Oracle Always Free | ARM Ampere | up to 4 / 24 GB / 200 GB | $0 | 10 TB | *availability lottery* |
|
||
|
||
**Why Hetzner won:**
|
||
|
||
1. **Price/performance at this tier is best-in-class among mainstream hosts.**
|
||
Similar specs at DigitalOcean/Vultr/Linode cost 6× as much. You're paying
|
||
the "American managed cloud" premium there for UX polish we don't need.
|
||
2. **Dedicated IPv4 + /64 IPv6 + 20 TB traffic included.** No overage anxiety
|
||
at this scale; 20 TB is multiple months of anticipated traffic for a
|
||
bootstrapped app.
|
||
3. **European datacenter, GDPR-native.** honeyDue serves users in
|
||
multiple regions; if EU users dominate, Nuremberg is fast. US users pay
|
||
about +100 ms over a US-East host, which is well within Cloudflare-cached
|
||
tolerances for most app traffic.
|
||
4. **Mature API + `hcloud` CLI** for automation if we ever need it.
|
||
5. **Hetzner Cloud Firewall is free** and rule-for-rule equivalent to AWS
|
||
Security Groups / DO Cloud Firewall. We use UFW on the nodes instead
|
||
(Chapter 4) because our rule set evolved ad-hoc and moving it to the
|
||
provider's firewall is a small cleanup project.
|
||
|
||
**Why not the cheaper options:**
|
||
|
||
- **Netcup** is ~$1/mo cheaper per node with more disk, but its API is
|
||
barebones, the account/billing UX is more fiddly, and their network
|
||
routing in the US (where the operator is based) has more hops than
|
||
Hetzner's.
|
||
- **Contabo** is the cheapest, but the company has a reputation for
|
||
oversubscribed nodes. For a production service, unpredictable CPU steal
|
||
and disk I/O variance is not worth saving $0/node. Contabo is fine for
|
||
non-critical workloads; it's a poor fit for prod.
|
||
- **Oracle Cloud Always Free** is genuinely free (4 ARM cores + 24 GB RAM)
|
||
but:
|
||
- Requires ARM64 builds (we build on ARM but would need to not need
|
||
cross-compile — see Chapter 11 for why amd64 matters)
|
||
- Capacity for free accounts is a lottery; instance creation fails
|
||
"out of capacity" more often than it succeeds
|
||
- Oracle has reclaimed idle free-tier instances in the past
|
||
|
||
### Why not the premium options
|
||
|
||
DigitalOcean, Vultr, and Linode are excellent products with better UX than
|
||
Hetzner. They were rejected because at honeyDue's current scale the 3–6×
|
||
price multiplier doesn't buy anything we'd use:
|
||
|
||
- We don't need managed databases, object storage, or load balancers from
|
||
the same provider — those are Neon, Backblaze, and Cloudflare
|
||
- We don't need their monitoring dashboards — Cloudflare Analytics +
|
||
`kubectl top` + future Prometheus cover it
|
||
- The UI polish matters mostly for day-1 setup; ongoing operations are
|
||
`kubectl` and `ssh`
|
||
|
||
When honeyDue has enough revenue that an engineer's time is worth more than
|
||
$40/mo, we'd consider moving for the better tooling. Not yet.
|
||
|
||
## Why Nuremberg (`nbg1`)
|
||
|
||
Hetzner has datacenters in Nuremberg (nbg1), Falkenstein (fsn1), Helsinki
|
||
(hel1), Ashburn (ash), and Hillsboro (hil). Nuremberg was picked because:
|
||
|
||
- The operator's primary user base is expected to be mixed US/EU
|
||
- Within the EU, Nuremberg is the most central from a peering perspective
|
||
(well-connected to DE-CIX, Europe's largest internet exchange)
|
||
- Falkenstein is Hetzner's main datacenter and tends to have longer
|
||
provisioning queues during capacity crunches; Nuremberg is smaller and
|
||
more available
|
||
|
||
For a US-only userbase, Ashburn (ash) or Hillsboro (hil) would be better
|
||
picks — US users would see ~20 ms instead of ~120 ms.
|
||
|
||
Cloudflare's edge caches most assets, so the origin location matters mostly
|
||
for first-request / uncached / POST traffic.
|
||
|
||
## Why three nodes
|
||
|
||
**Raft quorum and fault tolerance.** K3s in HA mode uses Raft consensus
|
||
(via embedded etcd) for cluster state. Raft requires a majority of nodes
|
||
to agree on every write. Quorum formulas:
|
||
|
||
| Total managers | Quorum | Max failures tolerated |
|
||
|---|---|---|
|
||
| 1 | 1 | 0 |
|
||
| 2 | 2 | 0 |
|
||
| 3 | 2 | 1 |
|
||
| 4 | 3 | 1 |
|
||
| 5 | 3 | 2 |
|
||
|
||
Three is the smallest odd number that tolerates a failure, and three is
|
||
where price/resilience is sweetest. Five nodes doesn't help until you need
|
||
to tolerate *two* simultaneous failures — a scale concern that doesn't
|
||
apply at our traffic volume.
|
||
|
||
Two nodes is worse than one: you still have single-failure intolerance
|
||
(one down = no quorum), but you've doubled your cost and failure surface.
|
||
Avoid even-node clusters for consensus systems.
|
||
|
||
## Node hardening
|
||
|
||
Each node was bootstrapped with:
|
||
|
||
1. **Docker installed** from `download.docker.com` using the stable repo
|
||
(this was the original Swarm setup; still installed but disabled — k3s
|
||
bundles its own containerd).
|
||
2. **`deploy` user created** with:
|
||
- Home directory
|
||
- Bash as login shell
|
||
- Member of `docker` group (historical, when Swarm was the orchestrator)
|
||
- Member of `sudo` group with `NOPASSWD: ALL` in `/etc/sudoers.d/deploy`
|
||
3. **SSH key installed** at `/home/deploy/.ssh/authorized_keys`
|
||
- The key is the public half of `~/.ssh/hetzner` on the operator
|
||
workstation (`ssh-ed25519`, 256 bits)
|
||
4. **`/opt/honeydue/deploy`** directory created, owned by `deploy`
|
||
(originally for Swarm deploy bundle drop zone; unused now)
|
||
5. **Sysctl** `net.ipv4.ip_unprivileged_port_start=0` persisted to
|
||
`/etc/sysctl.d/99-unprivileged-ports.conf`. Required so Traefik (running
|
||
as UID 65532) can bind `:80` and `:443` in the host network namespace.
|
||
|
||
The full bootstrap script is at `/tmp/honeydue_bootstrap.sh` on the
|
||
operator workstation (used during the initial Swarm setup — see
|
||
[Chapter 19](./19-postmortem-swarm.md) for context).
|
||
|
||
## Cost breakdown
|
||
|
||
```
|
||
3 × Hetzner CX33 $23.97/mo
|
||
Hetzner network traffic $0 (20 TB/mo included per node, nowhere near it)
|
||
Neon Postgres (Launch) $5-15/mo (usage-based, ~$5 min)
|
||
Backblaze B2 <$1/mo (tiny upload volume currently)
|
||
Cloudflare Free $0
|
||
Gitea (self-hosted) $0 (the operator's existing Gitea)
|
||
─────────────────────────────────
|
||
Total infra ~$30-40/mo
|
||
```
|
||
|
||
See [Chapter 18 — Cost](./18-cost.md) for a full breakdown including
|
||
external SaaS (Fastmail, Apple Developer, etc.) and at-scale projections.
|
||
|
||
## Provisioning workflow
|
||
|
||
Nodes were provisioned manually through Hetzner Cloud Console. This is
|
||
fine for a three-node cluster; for larger clusters we'd switch to the
|
||
[`hetzner-k3s`][hetzner-k3s] Ruby tool that the `deploy-k3s/` scaffold
|
||
expects. The manual steps were:
|
||
|
||
1. Create project in Hetzner Cloud Console.
|
||
2. Upload SSH key (`hetzner.pub`).
|
||
3. Create 3× CX33 servers in `nbg1` with Ubuntu 24.04.
|
||
4. SSH in as `root`, run bootstrap to create `deploy` user and install
|
||
Docker / later k3s.
|
||
5. Apply Hetzner Cloud Firewall rules at the network edge *optional* (we
|
||
use UFW per Chapter 4 instead).
|
||
|
||
A future greenfield deployment would run `deploy-k3s/scripts/01-provision-cluster.sh`,
|
||
which does all of this in one shot via the `hetzner-k3s` CLI.
|
||
|
||
## Upgrade / replacement plan
|
||
|
||
**Node failure.** If a node becomes unreachable, the other two retain
|
||
Raft quorum and the cluster continues accepting writes. Pods from the
|
||
failed node get rescheduled to the survivors (so long as the survivors
|
||
have spare capacity — see Chapter 16). To replace the dead node:
|
||
|
||
1. Delete it from the cluster: `kubectl delete node <name>`
|
||
2. Create a replacement CX33 in Hetzner console
|
||
3. Install k3s on it with `--server=https://<manager>:6443`
|
||
4. Verify `kubectl get nodes` shows it as Ready
|
||
|
||
**Scaling up.** To add a fourth node, same procedure without deleting
|
||
anything. Consider whether you want it as a server (adds to Raft quorum;
|
||
must also add up to an odd total) or an agent (worker-only). K3s agents
|
||
join with `INSTALL_K3S_EXEC=agent` instead of `server`.
|
||
|
||
**Upgrading K3s.** K3s has a minor release every ~3 months. Upgrade by
|
||
running the install script with the new version on each node, one at a
|
||
time, verifying cluster health between each. See
|
||
[Chapter 17](./17-runbook.md) for the detailed procedure.
|
||
|
||
**Upgrading the OS.** Ubuntu 24.04 LTS is supported until 2029.
|
||
`unattended-upgrades` is *not* currently installed, so OS patches require
|
||
manual `apt upgrade`. Install `unattended-upgrades` when time permits —
|
||
security patches are important and automation reduces the risk of
|
||
falling behind.
|
||
|
||
## Physical location & regulatory
|
||
|
||
- **Sovereignty**: Hetzner is headquartered in Gunzenhausen, Germany.
|
||
All data at rest in `nbg1` is subject to German law and the GDPR.
|
||
- **User data**: Most user data actually lives in
|
||
**Neon Postgres (AWS us-east-1, Virginia)** and **Backblaze B2
|
||
(us-east-005, South Carolina)** — both US-hosted. EU users' data
|
||
therefore *exits* the EU in the API path. If strict EU data residency
|
||
is ever a requirement, Neon has a EU region (Frankfurt) and Backblaze
|
||
has EU endpoints; switching is a configuration change, not an
|
||
architectural one.
|
||
- **Encryption at rest**: Hetzner encrypts node-local disks at the
|
||
hypervisor layer. Neon encrypts at the AWS EBS layer. B2 encrypts
|
||
objects server-side. None of our application code or config holds
|
||
secrets at rest that aren't already in Kubernetes Secrets (which
|
||
are stored in etcd; etcd on disk is unencrypted by default in k3s
|
||
but see Chapter 5 for hardening).
|
||
|
||
## Operator cheat sheet
|
||
|
||
```bash
|
||
# SSH to any node
|
||
ssh -i ~/.ssh/hetzner deploy@hetzner1
|
||
|
||
# Check node health
|
||
kubectl get nodes -o wide
|
||
|
||
# Per-node resource usage
|
||
kubectl top nodes
|
||
|
||
# See what's on each node
|
||
kubectl get pods -A -o wide | sort -k 8
|
||
|
||
# Hetzner console (in browser)
|
||
# https://console.hetzner.cloud/
|
||
```
|
||
|
||
## References
|
||
|
||
- [Hetzner Cloud product page][hetzner-cloud]
|
||
- [Hetzner price adjustment April 2026][hetzner-prices]
|
||
- [hetzner-k3s tool][hetzner-k3s]
|
||
- [K3s architecture docs][k3s-arch]
|
||
|
||
[hetzner-cloud]: https://www.hetzner.com/cloud/
|
||
[hetzner-prices]: https://docs.hetzner.com/general/infrastructure-and-availability/price-adjustment/
|
||
[hetzner-k3s]: https://github.com/vitobotta/hetzner-k3s
|
||
[k3s-arch]: https://docs.k3s.io/architecture
|