admin/honeyDueAPI

Fork 0

Files

T

Trey t 6f303dbbaa

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 07:20:54 -05:00

12 KiB

Raw Permalink Blame History

01 — Infrastructure

Summary

Three Hetzner Cloud CX33 virtual machines in the Nuremberg (nbg1) datacenter form the compute foundation. Each is a 4 vCPU / 8 GB RAM / 80 GB NVMe SSD instance on Hetzner's shared-CPU "Cloud" line. Total compute cost is $23.97/mo. This chapter explains each node spec in detail, why we picked Hetzner and this tier specifically, and the rejected alternatives.

Node specifications

All three nodes are identical. Specs per node:

Spec	Value
Provider	Hetzner Cloud (`www.hetzner.com/cloud`)
Instance type	CX33 (shared-CPU line)
vCPU	4
RAM	8 GB
Disk	80 GB NVMe SSD
Network	20 TB/mo outbound included
IPv4 address	Public dedicated
IPv6 address	/64 subnet
Region	`nbg1` (Nuremberg, Germany)
OS	Ubuntu 24.04.3 LTS (HWE kernel 6.8.0-90-generic)
Price	$7.99/mo (April 2026) ⁽¹⁾

⁽¹⁾ Hetzner applied a price adjustment on 2026-04-01 — CX33 went from ~$6.59 to $7.99. See Hetzner price adjustment announcement.

The three nodes

SSH alias	Public IPv4	IPv6	k3s hostname
`hetzner1`	178.104.247.152	`2a01:4f8:1c18:79c7::1`	`ubuntu-8gb-nbg1-2`
`hetzner2`	178.105.32.198	`2a01:4f8:1c18:5ecf::1`	`ubuntu-8gb-nbg1-1`
`hetzner3`	178.104.249.189	`2a01:4f8:1c18:241a::1`	`ubuntu-8gb-nbg1-3`

Naming quirk. The SSH-alias numbers and the Hetzner-assigned hostname numbers do not match (hetzner1 is nbg1-2, hetzner2 is nbg1-1). This is because the Hetzner hostnames are assigned in server-creation order; the SSH aliases were set up later in the order we wanted to refer to them. We chose not to rename the hosts — renaming hostname on a Kubernetes node after it joins the cluster causes problems (node certificates, etcd identity, etc. tie to the hostname). Living with the quirk is easier than rebuilding. See the mapping table in the README.

Why Hetzner

Decision matrix

Compared at the time of purchase (~2026-04-23):

Provider	Instance	vCPU / RAM / SSD	Price/mo	Traffic/mo
Hetzner	CX33	4 / 8 GB / 80 GB	$7.99	20 TB
DigitalOcean	General-purpose	2 / 8 GB / 25 GB	$63	4 TB
DigitalOcean	Basic	4 / 8 GB / 160 GB	$48	5 TB
Vultr	High Perf	4 / 8 GB / 180 GB	$48	5 TB
Linode (Akamai)	Shared	4 / 8 GB / 160 GB	$48	5 TB
OVHcloud	VPS 2026 4vC	4 / 8 GB / 75 GB	~$13	unlimited
Contabo	Cloud VPS 2	4 / 8 GB / 200 GB	$8	32 TB
Netcup	VPS 1000 G11	4 / 8 GB / 256 GB	~$6	unlimited
Oracle Always Free	ARM Ampere	up to 4 / 24 GB / 200 GB	$0	10 TB

Why Hetzner won:

Price/performance at this tier is best-in-class among mainstream hosts. Similar specs at DigitalOcean/Vultr/Linode cost 6× as much. You're paying the "American managed cloud" premium there for UX polish we don't need.
Dedicated IPv4 + /64 IPv6 + 20 TB traffic included. No overage anxiety at this scale; 20 TB is multiple months of anticipated traffic for a bootstrapped app.
European datacenter, GDPR-native. honeyDue serves users in multiple regions; if EU users dominate, Nuremberg is fast. US users pay about +100 ms over a US-East host, which is well within Cloudflare-cached tolerances for most app traffic.
Mature API + hcloud CLI for automation if we ever need it.
Hetzner Cloud Firewall is free and rule-for-rule equivalent to AWS Security Groups / DO Cloud Firewall. We use UFW on the nodes instead (Chapter 4) because our rule set evolved ad-hoc and moving it to the provider's firewall is a small cleanup project.

Why not the cheaper options:

Netcup is ~$1/mo cheaper per node with more disk, but its API is barebones, the account/billing UX is more fiddly, and their network routing in the US (where the operator is based) has more hops than Hetzner's.
Contabo is the cheapest, but the company has a reputation for oversubscribed nodes. For a production service, unpredictable CPU steal and disk I/O variance is not worth saving $0/node. Contabo is fine for non-critical workloads; it's a poor fit for prod.
Oracle Cloud Always Free is genuinely free (4 ARM cores + 24 GB RAM) but:
- Requires ARM64 builds (we build on ARM but would need to not need cross-compile — see Chapter 11 for why amd64 matters)
- Capacity for free accounts is a lottery; instance creation fails "out of capacity" more often than it succeeds
- Oracle has reclaimed idle free-tier instances in the past

Why not the premium options

DigitalOcean, Vultr, and Linode are excellent products with better UX than Hetzner. They were rejected because at honeyDue's current scale the 3–6× price multiplier doesn't buy anything we'd use:

We don't need managed databases, object storage, or load balancers from the same provider — those are Neon, Backblaze, and Cloudflare
We don't need their monitoring dashboards — Cloudflare Analytics + kubectl top + future Prometheus cover it
The UI polish matters mostly for day-1 setup; ongoing operations are kubectl and ssh

When honeyDue has enough revenue that an engineer's time is worth more than $40/mo, we'd consider moving for the better tooling. Not yet.

Why Nuremberg (`nbg1`)

Hetzner has datacenters in Nuremberg (nbg1), Falkenstein (fsn1), Helsinki (hel1), Ashburn (ash), and Hillsboro (hil). Nuremberg was picked because:

The operator's primary user base is expected to be mixed US/EU
Within the EU, Nuremberg is the most central from a peering perspective (well-connected to DE-CIX, Europe's largest internet exchange)
Falkenstein is Hetzner's main datacenter and tends to have longer provisioning queues during capacity crunches; Nuremberg is smaller and more available

For a US-only userbase, Ashburn (ash) or Hillsboro (hil) would be better picks — US users would see ~20 ms instead of ~120 ms.

Cloudflare's edge caches most assets, so the origin location matters mostly for first-request / uncached / POST traffic.

Why three nodes

Raft quorum and fault tolerance. K3s in HA mode uses Raft consensus (via embedded etcd) for cluster state. Raft requires a majority of nodes to agree on every write. Quorum formulas:

Total managers	Quorum	Max failures tolerated
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2

Three is the smallest odd number that tolerates a failure, and three is where price/resilience is sweetest. Five nodes doesn't help until you need to tolerate two simultaneous failures — a scale concern that doesn't apply at our traffic volume.

Two nodes is worse than one: you still have single-failure intolerance (one down = no quorum), but you've doubled your cost and failure surface. Avoid even-node clusters for consensus systems.

Node hardening

Each node was bootstrapped with:

Docker installed from download.docker.com using the stable repo (this was the original Swarm setup; still installed but disabled — k3s bundles its own containerd).
deploy user created with:
- Home directory
- Bash as login shell
- Member of docker group (historical, when Swarm was the orchestrator)
- Member of sudo group with NOPASSWD: ALL in /etc/sudoers.d/deploy
SSH key installed at /home/deploy/.ssh/authorized_keys
- The key is the public half of ~/.ssh/hetzner on the operator workstation (ssh-ed25519, 256 bits)
/opt/honeydue/deploy directory created, owned by deploy (originally for Swarm deploy bundle drop zone; unused now)
Sysctl net.ipv4.ip_unprivileged_port_start=0 persisted to /etc/sysctl.d/99-unprivileged-ports.conf. Required so Traefik (running as UID 65532) can bind :80 and :443 in the host network namespace.

The full bootstrap script is at /tmp/honeydue_bootstrap.sh on the operator workstation (used during the initial Swarm setup — see Chapter 19 for context).

Cost breakdown

3 × Hetzner CX33             $23.97/mo
Hetzner network traffic      $0       (20 TB/mo included per node, nowhere near it)
Neon Postgres (Launch)       $5-15/mo (usage-based, ~$5 min)
Backblaze B2                 <$1/mo   (tiny upload volume currently)
Cloudflare Free              $0
Gitea (self-hosted)          $0       (the operator's existing Gitea)
─────────────────────────────────
Total infra                  ~$30-40/mo

See Chapter 18 — Cost for a full breakdown including external SaaS (Fastmail, Apple Developer, etc.) and at-scale projections.

Provisioning workflow

Nodes were provisioned manually through Hetzner Cloud Console. This is fine for a three-node cluster; for larger clusters we'd switch to the hetzner-k3s Ruby tool that the deploy-k3s/ scaffold expects. The manual steps were:

Create project in Hetzner Cloud Console.
Upload SSH key (hetzner.pub).
Create 3× CX33 servers in nbg1 with Ubuntu 24.04.
SSH in as root, run bootstrap to create deploy user and install Docker / later k3s.
Apply Hetzner Cloud Firewall rules at the network edge optional (we use UFW per Chapter 4 instead).

A future greenfield deployment would run deploy-k3s/scripts/01-provision-cluster.sh, which does all of this in one shot via the hetzner-k3s CLI.

Upgrade / replacement plan

Node failure. If a node becomes unreachable, the other two retain Raft quorum and the cluster continues accepting writes. Pods from the failed node get rescheduled to the survivors (so long as the survivors have spare capacity — see Chapter 16). To replace the dead node:

Delete it from the cluster: kubectl delete node <name>
Create a replacement CX33 in Hetzner console
Install k3s on it with --server=https://<manager>:6443
Verify kubectl get nodes shows it as Ready

Scaling up. To add a fourth node, same procedure without deleting anything. Consider whether you want it as a server (adds to Raft quorum; must also add up to an odd total) or an agent (worker-only). K3s agents join with INSTALL_K3S_EXEC=agent instead of server.

Upgrading K3s. K3s has a minor release every ~3 months. Upgrade by running the install script with the new version on each node, one at a time, verifying cluster health between each. See Chapter 17 for the detailed procedure.

Upgrading the OS. Ubuntu 24.04 LTS is supported until 2029. unattended-upgrades is not currently installed, so OS patches require manual apt upgrade. Install unattended-upgrades when time permits — security patches are important and automation reduces the risk of falling behind.

Physical location & regulatory

Sovereignty: Hetzner is headquartered in Gunzenhausen, Germany. All data at rest in nbg1 is subject to German law and the GDPR.
User data: Most user data actually lives in Neon Postgres (AWS us-east-1, Virginia) and Backblaze B2 (us-east-005, South Carolina) — both US-hosted. EU users' data therefore exits the EU in the API path. If strict EU data residency is ever a requirement, Neon has a EU region (Frankfurt) and Backblaze has EU endpoints; switching is a configuration change, not an architectural one.
Encryption at rest: Hetzner encrypts node-local disks at the hypervisor layer. Neon encrypts at the AWS EBS layer. B2 encrypts objects server-side. None of our application code or config holds secrets at rest that aren't already in Kubernetes Secrets (which are stored in etcd; etcd on disk is unencrypted by default in k3s but see Chapter 5 for hardening).

Operator cheat sheet

# SSH to any node
ssh -i ~/.ssh/hetzner deploy@hetzner1

# Check node health
kubectl get nodes -o wide

# Per-node resource usage
kubectl top nodes

# See what's on each node
kubectl get pods -A -o wide | sort -k 8

# Hetzner console (in browser)
#   https://console.hetzner.cloud/

12 KiB Raw Permalink Blame History Unescape Escape

01 — Infrastructure

Summary

Node specifications

The three nodes

Why Hetzner

Decision matrix

Why not the premium options

Why Nuremberg (nbg1)

Why three nodes

Node hardening

Cost breakdown

Provisioning workflow

Upgrade / replacement plan

Physical location & regulatory

Operator cheat sheet

References

12 KiB

Raw Permalink Blame History

Why Nuremberg (`nbg1`)