Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
01 — Infrastructure
Summary
Three Hetzner Cloud CX33 virtual machines in the Nuremberg (nbg1) datacenter form the compute foundation. Each is a 4 vCPU / 8 GB RAM / 80 GB NVMe SSD instance on Hetzner's shared-CPU "Cloud" line. Total compute cost is $23.97/mo. This chapter explains each node spec in detail, why we picked Hetzner and this tier specifically, and the rejected alternatives.
Node specifications
All three nodes are identical. Specs per node:
| Spec | Value |
|---|---|
| Provider | Hetzner Cloud (www.hetzner.com/cloud) |
| Instance type | CX33 (shared-CPU line) |
| vCPU | 4 |
| RAM | 8 GB |
| Disk | 80 GB NVMe SSD |
| Network | 20 TB/mo outbound included |
| IPv4 address | Public dedicated |
| IPv6 address | /64 subnet |
| Region | nbg1 (Nuremberg, Germany) |
| OS | Ubuntu 24.04.3 LTS (HWE kernel 6.8.0-90-generic) |
| Price | $7.99/mo (April 2026) ⁽¹⁾ |
⁽¹⁾ Hetzner applied a price adjustment on 2026-04-01 — CX33 went from ~$6.59 to $7.99. See Hetzner price adjustment announcement.
The three nodes
| SSH alias | Public IPv4 | IPv6 | k3s hostname |
|---|---|---|---|
hetzner1 |
178.104.247.152 | 2a01:4f8:1c18:79c7::1 |
ubuntu-8gb-nbg1-2 |
hetzner2 |
178.105.32.198 | 2a01:4f8:1c18:5ecf::1 |
ubuntu-8gb-nbg1-1 |
hetzner3 |
178.104.249.189 | 2a01:4f8:1c18:241a::1 |
ubuntu-8gb-nbg1-3 |
Naming quirk. The SSH-alias numbers and the Hetzner-assigned hostname
numbers do not match (hetzner1 is nbg1-2, hetzner2 is nbg1-1). This
is because the Hetzner hostnames are assigned in server-creation order; the
SSH aliases were set up later in the order we wanted to refer to them. We
chose not to rename the hosts — renaming hostname on a Kubernetes node
after it joins the cluster causes problems (node certificates, etcd
identity, etc. tie to the hostname). Living with the quirk is easier than
rebuilding. See the mapping table in the README.
Why Hetzner
Decision matrix
Compared at the time of purchase (~2026-04-23):
| Provider | Instance | vCPU / RAM / SSD | Price/mo | Traffic/mo |
|---|---|---|---|---|
| Hetzner | CX33 | 4 / 8 GB / 80 GB | $7.99 | 20 TB |
| DigitalOcean | General-purpose | 2 / 8 GB / 25 GB | $63 | 4 TB |
| DigitalOcean | Basic | 4 / 8 GB / 160 GB | $48 | 5 TB |
| Vultr | High Perf | 4 / 8 GB / 180 GB | $48 | 5 TB |
| Linode (Akamai) | Shared | 4 / 8 GB / 160 GB | $48 | 5 TB |
| OVHcloud | VPS 2026 4vC | 4 / 8 GB / 75 GB | ~$13 | unlimited |
| Contabo | Cloud VPS 2 | 4 / 8 GB / 200 GB | $8 | 32 TB |
| Netcup | VPS 1000 G11 | 4 / 8 GB / 256 GB | ~$6 | unlimited |
| Oracle Always Free | ARM Ampere | up to 4 / 24 GB / 200 GB | $0 | 10 TB |
Why Hetzner won:
- Price/performance at this tier is best-in-class among mainstream hosts. Similar specs at DigitalOcean/Vultr/Linode cost 6× as much. You're paying the "American managed cloud" premium there for UX polish we don't need.
- Dedicated IPv4 + /64 IPv6 + 20 TB traffic included. No overage anxiety at this scale; 20 TB is multiple months of anticipated traffic for a bootstrapped app.
- European datacenter, GDPR-native. honeyDue serves users in multiple regions; if EU users dominate, Nuremberg is fast. US users pay about +100 ms over a US-East host, which is well within Cloudflare-cached tolerances for most app traffic.
- Mature API +
hcloudCLI for automation if we ever need it. - Hetzner Cloud Firewall is free and rule-for-rule equivalent to AWS Security Groups / DO Cloud Firewall. We use UFW on the nodes instead (Chapter 4) because our rule set evolved ad-hoc and moving it to the provider's firewall is a small cleanup project.
Why not the cheaper options:
- Netcup is ~$1/mo cheaper per node with more disk, but its API is barebones, the account/billing UX is more fiddly, and their network routing in the US (where the operator is based) has more hops than Hetzner's.
- Contabo is the cheapest, but the company has a reputation for oversubscribed nodes. For a production service, unpredictable CPU steal and disk I/O variance is not worth saving $0/node. Contabo is fine for non-critical workloads; it's a poor fit for prod.
- Oracle Cloud Always Free is genuinely free (4 ARM cores + 24 GB RAM)
but:
- Requires ARM64 builds (we build on ARM but would need to not need cross-compile — see Chapter 11 for why amd64 matters)
- Capacity for free accounts is a lottery; instance creation fails "out of capacity" more often than it succeeds
- Oracle has reclaimed idle free-tier instances in the past
Why not the premium options
DigitalOcean, Vultr, and Linode are excellent products with better UX than Hetzner. They were rejected because at honeyDue's current scale the 3–6× price multiplier doesn't buy anything we'd use:
- We don't need managed databases, object storage, or load balancers from the same provider — those are Neon, Backblaze, and Cloudflare
- We don't need their monitoring dashboards — Cloudflare Analytics +
kubectl top+ future Prometheus cover it - The UI polish matters mostly for day-1 setup; ongoing operations are
kubectlandssh
When honeyDue has enough revenue that an engineer's time is worth more than $40/mo, we'd consider moving for the better tooling. Not yet.
Why Nuremberg (nbg1)
Hetzner has datacenters in Nuremberg (nbg1), Falkenstein (fsn1), Helsinki (hel1), Ashburn (ash), and Hillsboro (hil). Nuremberg was picked because:
- The operator's primary user base is expected to be mixed US/EU
- Within the EU, Nuremberg is the most central from a peering perspective (well-connected to DE-CIX, Europe's largest internet exchange)
- Falkenstein is Hetzner's main datacenter and tends to have longer provisioning queues during capacity crunches; Nuremberg is smaller and more available
For a US-only userbase, Ashburn (ash) or Hillsboro (hil) would be better picks — US users would see ~20 ms instead of ~120 ms.
Cloudflare's edge caches most assets, so the origin location matters mostly for first-request / uncached / POST traffic.
Why three nodes
Raft quorum and fault tolerance. K3s in HA mode uses Raft consensus (via embedded etcd) for cluster state. Raft requires a majority of nodes to agree on every write. Quorum formulas:
| Total managers | Quorum | Max failures tolerated |
|---|---|---|
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 3 | 2 | 1 |
| 4 | 3 | 1 |
| 5 | 3 | 2 |
Three is the smallest odd number that tolerates a failure, and three is where price/resilience is sweetest. Five nodes doesn't help until you need to tolerate two simultaneous failures — a scale concern that doesn't apply at our traffic volume.
Two nodes is worse than one: you still have single-failure intolerance (one down = no quorum), but you've doubled your cost and failure surface. Avoid even-node clusters for consensus systems.
Node hardening
Each node was bootstrapped with:
- Docker installed from
download.docker.comusing the stable repo (this was the original Swarm setup; still installed but disabled — k3s bundles its own containerd). deployuser created with:- Home directory
- Bash as login shell
- Member of
dockergroup (historical, when Swarm was the orchestrator) - Member of
sudogroup withNOPASSWD: ALLin/etc/sudoers.d/deploy
- SSH key installed at
/home/deploy/.ssh/authorized_keys- The key is the public half of
~/.ssh/hetzneron the operator workstation (ssh-ed25519, 256 bits)
- The key is the public half of
/opt/honeydue/deploydirectory created, owned bydeploy(originally for Swarm deploy bundle drop zone; unused now)- Sysctl
net.ipv4.ip_unprivileged_port_start=0persisted to/etc/sysctl.d/99-unprivileged-ports.conf. Required so Traefik (running as UID 65532) can bind:80and:443in the host network namespace.
The full bootstrap script is at /tmp/honeydue_bootstrap.sh on the
operator workstation (used during the initial Swarm setup — see
Chapter 19 for context).
Cost breakdown
3 × Hetzner CX33 $23.97/mo
Hetzner network traffic $0 (20 TB/mo included per node, nowhere near it)
Neon Postgres (Launch) $5-15/mo (usage-based, ~$5 min)
Backblaze B2 <$1/mo (tiny upload volume currently)
Cloudflare Free $0
Gitea (self-hosted) $0 (the operator's existing Gitea)
─────────────────────────────────
Total infra ~$30-40/mo
See Chapter 18 — Cost for a full breakdown including external SaaS (Fastmail, Apple Developer, etc.) and at-scale projections.
Provisioning workflow
Nodes were provisioned manually through Hetzner Cloud Console. This is
fine for a three-node cluster; for larger clusters we'd switch to the
hetzner-k3s Ruby tool that the deploy-k3s/ scaffold
expects. The manual steps were:
- Create project in Hetzner Cloud Console.
- Upload SSH key (
hetzner.pub). - Create 3× CX33 servers in
nbg1with Ubuntu 24.04. - SSH in as
root, run bootstrap to createdeployuser and install Docker / later k3s. - Apply Hetzner Cloud Firewall rules at the network edge optional (we use UFW per Chapter 4 instead).
A future greenfield deployment would run deploy-k3s/scripts/01-provision-cluster.sh,
which does all of this in one shot via the hetzner-k3s CLI.
Upgrade / replacement plan
Node failure. If a node becomes unreachable, the other two retain Raft quorum and the cluster continues accepting writes. Pods from the failed node get rescheduled to the survivors (so long as the survivors have spare capacity — see Chapter 16). To replace the dead node:
- Delete it from the cluster:
kubectl delete node <name> - Create a replacement CX33 in Hetzner console
- Install k3s on it with
--server=https://<manager>:6443 - Verify
kubectl get nodesshows it as Ready
Scaling up. To add a fourth node, same procedure without deleting
anything. Consider whether you want it as a server (adds to Raft quorum;
must also add up to an odd total) or an agent (worker-only). K3s agents
join with INSTALL_K3S_EXEC=agent instead of server.
Upgrading K3s. K3s has a minor release every ~3 months. Upgrade by running the install script with the new version on each node, one at a time, verifying cluster health between each. See Chapter 17 for the detailed procedure.
Upgrading the OS. Ubuntu 24.04 LTS is supported until 2029.
unattended-upgrades is not currently installed, so OS patches require
manual apt upgrade. Install unattended-upgrades when time permits —
security patches are important and automation reduces the risk of
falling behind.
Physical location & regulatory
- Sovereignty: Hetzner is headquartered in Gunzenhausen, Germany.
All data at rest in
nbg1is subject to German law and the GDPR. - User data: Most user data actually lives in Neon Postgres (AWS us-east-1, Virginia) and Backblaze B2 (us-east-005, South Carolina) — both US-hosted. EU users' data therefore exits the EU in the API path. If strict EU data residency is ever a requirement, Neon has a EU region (Frankfurt) and Backblaze has EU endpoints; switching is a configuration change, not an architectural one.
- Encryption at rest: Hetzner encrypts node-local disks at the hypervisor layer. Neon encrypts at the AWS EBS layer. B2 encrypts objects server-side. None of our application code or config holds secrets at rest that aren't already in Kubernetes Secrets (which are stored in etcd; etcd on disk is unencrypted by default in k3s but see Chapter 5 for hardening).
Operator cheat sheet
# SSH to any node
ssh -i ~/.ssh/hetzner deploy@hetzner1
# Check node health
kubectl get nodes -o wide
# Per-node resource usage
kubectl top nodes
# See what's on each node
kubectl get pods -A -o wide | sort -k 8
# Hetzner console (in browser)
# https://console.hetzner.cloud/