Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
00 — Overview
Summary
honeyDue runs on a three-node Kubernetes cluster managed by K3s, fronted by Cloudflare, and backed by a managed Postgres (Neon), S3-compatible object storage (Backblaze B2), and a self-hosted container registry (Gitea). The application consists of a Go REST API, a Next.js admin panel, and a background worker process using Redis-backed queues. Traefik handles HTTP ingress and path-based routing. The whole stack fits in about 1 GB of RAM across the three nodes with plenty of headroom.
This chapter is the map. Everything here is expanded in a later chapter.
Architecture at a glance
flowchart TB
subgraph Internet
Browser[End-user browser / mobile client]
end
subgraph CF[Cloudflare]
CFEdge[Edge POP<br/>TLS terminates here]
end
Browser -- HTTPS :443 --> CFEdge
subgraph Hetzner[Hetzner Cloud — Nuremberg nbg1]
direction LR
subgraph H1[hetzner1<br/>178.104.247.152]
T1[Traefik<br/>:80/:443 hostNet]
A1[api pod]
W1[worker pod]
end
subgraph H2[hetzner2<br/>178.105.32.198]
T2[Traefik<br/>:80/:443 hostNet]
A2[api pod]
R1[redis pod<br/>PVC]
end
subgraph H3[hetzner3<br/>178.104.249.189]
T3[Traefik<br/>:80/:443 hostNet]
A3[api pod]
AD1[admin pod]
end
end
CFEdge -- HTTP :80<br/>DNS round-robin --> T1
CFEdge -- HTTP :80 --> T2
CFEdge -- HTTP :80 --> T3
T1 & T2 & T3 -.Ingress routes by<br/>Host header.-> A1
T1 & T2 & T3 -.-> AD1
A1 & A2 & A3 -.-> R1
subgraph External[Managed services]
Neon[(Neon Postgres<br/>AWS us-east-1)]
B2[(Backblaze B2<br/>us-east-005)]
FM[Fastmail SMTP]
Gitea[Gitea Registry<br/>gitea.treytartt.com]
end
A1 & A2 & A3 -- SSL --> Neon
W1 -- SSL --> Neon
A1 & A2 & A3 -- HTTPS --> B2
W1 -- SMTP :587 --> FM
H1 & H2 & H3 -. image pull .-> Gitea
ASCII fallback
┌─────────────────────┐
│ End user │
└──────────┬──────────┘
│ HTTPS :443
▼
┌─────────────────────┐
│ Cloudflare edge │ TLS terminates here
│ (SSL = Flexible) │
└──────────┬──────────┘
HTTP :80 round-robin
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ hetzner1 │ │ hetzner2 │ │ hetzner3 │
│ 178.104.247.152 │ │ 178.105.32.198 │ │ 178.104.249.189 │
│ Traefik :80/443 │ │ Traefik :80/443 │ │ Traefik :80/443 │
│ api worker │ │ api redis │ │ api admin │
└─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
│ │ │
└──────── Kubernetes overlay ───────────┘
│
┌─────────────────────────────┴──────────────────────────────┐
│ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────────┐ ┌──────────┐ ┌───────────────┐
│ Neon │ │ Backblaze B2│ │ Fastmail │ │ Gitea Registry│
│Postgres │ │ uploads │ │ SMTP │ │ image pull │
└─────────┘ └─────────────┘ └──────────┘ └───────────────┘
The stack, one layer at a time
Layer 0 — Hardware
Three Hetzner Cloud CX33 instances (4 vCPU, 8 GB RAM, 80 GB NVMe SSD) in Hetzner's Nuremberg (nbg1) datacenter. Each node is $7.99/mo (April 2026 pricing), totaling ~$24/mo. See Chapter 1.
Layer 1 — Operating system
Ubuntu 24.04.3 LTS. Each node has:
- SSH on port 22, key-only auth,
deployuser with NOPASSWD sudo ufwfirewall with strict default-deny-incoming; specific ports allowed per Chapter 4- Sysctl override
net.ipv4.ip_unprivileged_port_start=0so non-root containers can bind privileged ports (needed for Traefik to serve :80/:443)
Layer 2 — Container runtime
containerd v2.2.2 (bundled with K3s). Docker was previously installed from
the Swarm era but is now disabled. containerd is Kubernetes' reference
runtime and has a smaller footprint than Docker's full stack.
Layer 3 — Orchestrator
K3s v1.34.6 in HA mode. All 3 nodes are control-plane,etcd (Raft quorum
of 3 — can tolerate one node failure). K3s is a minimal Kubernetes
distribution from Rancher Labs (now Suse): single-binary, embedded etcd
instead of a separate etcd cluster, sane defaults for small installations.
See Chapter 2 for why k3s over full Kubernetes
or Docker Swarm.
Layer 4 — Cluster networking
- Flannel VXLAN for pod-to-pod overlay (default on K3s). VXLAN tunnels pod traffic over UDP port 8472 between nodes.
- CoreDNS for service discovery (what pods call
apiorredisto reach each other). - kube-proxy in IPVS mode for ClusterIP → pod routing.
Chapter 3 walks through a single request to show every hop.
Layer 5 — Ingress
Traefik v3 as a DaemonSet with hostNetwork: true. Each node has a
Traefik pod that binds directly to the node's public :80 and :443. No
servicelb, no Hetzner Load Balancer — Cloudflare round-robins the three
node IPs in DNS and any node can serve any request. See
Chapter 6.
Layer 6 — Edge / CDN
Cloudflare Free plan. Proxied A records for api.myhoneydue.com,
admin.myhoneydue.com, and myhoneydue.com each point at all three node
IPs. Edge handles TLS termination (SSL=Flexible), DDoS protection, caching
for static assets, and traffic failover if a node becomes unreachable.
See Chapter 13.
Layer 7 — Application services
| Service | Type | Replicas | Image |
|---|---|---|---|
api |
Go (Echo, GORM) | 3 | gitea.treytartt.com/admin/honeydue-api:<sha> |
admin |
Next.js 16 | 1 | gitea.treytartt.com/admin/honeydue-admin:<sha> |
worker |
Go (Asynq) | 1 | gitea.treytartt.com/admin/honeydue-worker:<sha> |
redis |
redis:7-alpine | 1 | Docker Hub |
See Chapter 7.
Layer 8 — External dependencies
- Neon Postgres (Launch plan) —
honeyDuedatabase - Backblaze B2 —
honeyDueProdbucket for user uploads - Fastmail SMTP — transactional email
- Gitea (self-hosted at
gitea.treytartt.com) — container registry - Cloudflare — DNS, TLS, CDN
What's deliberately absent
- TLS at origin. Cloudflare terminates TLS at the edge and talks HTTP on port 80 to the nodes. This is "Flexible SSL" in Cloudflare terminology. It's the simplest setup; we have a TODO to upgrade to "Full (strict)" with Cloudflare Origin CA certs (Chapter 13, §Future).
- Hetzner Load Balancer. We save the $8.49/mo by having Cloudflare round-robin across node IPs directly. If any node is unresponsive, Cloudflare's own origin health checks will route around it within 30s.
- Push notifications. APNs (iOS) and FCM (Android) are configured off
until we have Apple Developer / Google Play accounts. The env vars are
set to sentinel values that let the Go app boot;
FEATURE_PUSH_ENABLED=falsegates all call sites. - External metrics/monitoring (Prometheus, Grafana, Betterstack).
Right now we rely on
kubectl logs,kubectl top, and Cloudflare's own analytics. See Chapter 15 for what's there and what we'd add. - Automated backups of Redis state. Redis is configured with AOF (append-only file) persistence, but the PVC is only on one node. Redis holds only cache + Asynq queue state; losing it re-populates on first request / next cron tick. Not critical.
- Admin panel basic auth (Traefik middleware). In-app admin login is enabled; the extra Traefik-layer basic auth the scaffold supports is not currently attached.
The deployment pipeline in one paragraph
Changes to application code are built on your workstation by
docker buildx build --platform linux/amd64 --push, which cross-compiles
from arm64 (Apple Silicon) to amd64 (Hetzner nodes) and pushes directly to
gitea.treytartt.com. Manifests live in deploy-k3s/manifests/; they
reference image tags by git short SHA. kubectl apply -f rolls the new
image in with maxUnavailable: 0, maxSurge: 1 — one new pod at a time,
old one stays up until new is healthy. Service discovery by Kubernetes
DNS means api and admin hostnames always resolve to live backing pods;
traffic shifts the moment a new pod passes its readiness probe.
Chapter 14 walks through a complete deploy.
What we used to have (the short version)
Up until 2026-04-24 this stack ran on Docker Swarm on the same three
Hetzner boxes. It worked, but the Docker libnetwork service-discovery
layer has a bug in the 29.x line (moby/moby#52265) that
leaves stale DNS A-records behind when tasks migrate between nodes. We
hit it: the admin panel returned 502s for ~50% of requests through
Cloudflare because Caddy (our previous reverse proxy) was dialing a ghost
IP that had since been recycled to the Dozzle log viewer. We spent four
hours trying increasingly clever workarounds (dnsrr vs VIP,
dynamic a DNS refresh, global mode, host-mode ports, host.docker.internal,
hardcoded node IPs) before concluding that libnetwork state corruption
survives every non-nuclear fix.
The full autopsy is in Chapter 19 — Swarm Postmortem. K3s uses CoreDNS and has no libnetwork history; the bug class doesn't exist there.