admin/honeyDueAPI

Fork 0

Files

T

Trey t 6f303dbbaa

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 07:20:54 -05:00

11 KiB

Raw Blame History

00 — Overview

Summary

honeyDue runs on a three-node Kubernetes cluster managed by K3s, fronted by Cloudflare, and backed by a managed Postgres (Neon), S3-compatible object storage (Backblaze B2), and a self-hosted container registry (Gitea). The application consists of a Go REST API, a Next.js admin panel, and a background worker process using Redis-backed queues. Traefik handles HTTP ingress and path-based routing. The whole stack fits in about 1 GB of RAM across the three nodes with plenty of headroom.

This chapter is the map. Everything here is expanded in a later chapter.

Architecture at a glance

flowchart TB
    subgraph Internet
        Browser[End-user browser / mobile client]
    end

    subgraph CF[Cloudflare]
        CFEdge[Edge POP<br/>TLS terminates here]
    end

    Browser -- HTTPS :443 --> CFEdge

    subgraph Hetzner[Hetzner Cloud — Nuremberg nbg1]
        direction LR
        subgraph H1[hetzner1<br/>178.104.247.152]
            T1[Traefik<br/>:80/:443 hostNet]
            A1[api pod]
            W1[worker pod]
        end
        subgraph H2[hetzner2<br/>178.105.32.198]
            T2[Traefik<br/>:80/:443 hostNet]
            A2[api pod]
            R1[redis pod<br/>PVC]
        end
        subgraph H3[hetzner3<br/>178.104.249.189]
            T3[Traefik<br/>:80/:443 hostNet]
            A3[api pod]
            AD1[admin pod]
        end
    end

    CFEdge -- HTTP :80<br/>DNS round-robin --> T1
    CFEdge -- HTTP :80 --> T2
    CFEdge -- HTTP :80 --> T3

    T1 & T2 & T3 -.Ingress routes by<br/>Host header.-> A1
    T1 & T2 & T3 -.-> AD1
    A1 & A2 & A3 -.-> R1

    subgraph External[Managed services]
        Neon[(Neon Postgres<br/>AWS us-east-1)]
        B2[(Backblaze B2<br/>us-east-005)]
        FM[Fastmail SMTP]
        Gitea[Gitea Registry<br/>gitea.treytartt.com]
    end

    A1 & A2 & A3 -- SSL --> Neon
    W1 -- SSL --> Neon
    A1 & A2 & A3 -- HTTPS --> B2
    W1 -- SMTP :587 --> FM
    H1 & H2 & H3 -. image pull .-> Gitea

ASCII fallback

                         ┌─────────────────────┐
                         │     End user        │
                         └──────────┬──────────┘
                                    │ HTTPS :443
                                    ▼
                         ┌─────────────────────┐
                         │  Cloudflare edge    │ TLS terminates here
                         │  (SSL = Flexible)   │
                         └──────────┬──────────┘
                       HTTP :80 round-robin
                  ┌─────────────┼─────────────┐
                  ▼             ▼             ▼
     ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
     │   hetzner1      │ │   hetzner2      │ │   hetzner3      │
     │ 178.104.247.152 │ │ 178.105.32.198  │ │ 178.104.249.189 │
     │ Traefik :80/443 │ │ Traefik :80/443 │ │ Traefik :80/443 │
     │  api   worker   │ │  api   redis    │ │  api   admin    │
     └─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
               │                   │                   │
               └──────── Kubernetes overlay ───────────┘
                                   │
     ┌─────────────────────────────┴──────────────────────────────┐
     │                                                             │
     ▼                      ▼                    ▼                 ▼
┌─────────┐          ┌─────────────┐     ┌──────────┐    ┌───────────────┐
│  Neon   │          │ Backblaze B2│     │ Fastmail │    │ Gitea Registry│
│Postgres │          │   uploads   │     │   SMTP   │    │   image pull  │
└─────────┘          └─────────────┘     └──────────┘    └───────────────┘

The stack, one layer at a time

Layer 0 — Hardware

Three Hetzner Cloud CX33 instances (4 vCPU, 8 GB RAM, 80 GB NVMe SSD) in Hetzner's Nuremberg (nbg1) datacenter. Each node is $7.99/mo (April 2026 pricing), totaling ~$24/mo. See Chapter 1.

Layer 1 — Operating system

Ubuntu 24.04.3 LTS. Each node has:

SSH on port 22, key-only auth, deploy user with NOPASSWD sudo
ufw firewall with strict default-deny-incoming; specific ports allowed per Chapter 4
Sysctl override net.ipv4.ip_unprivileged_port_start=0 so non-root containers can bind privileged ports (needed for Traefik to serve :80/:443)

Layer 2 — Container runtime

containerd v2.2.2 (bundled with K3s). Docker was previously installed from the Swarm era but is now disabled. containerd is Kubernetes' reference runtime and has a smaller footprint than Docker's full stack.

Layer 3 — Orchestrator

K3s v1.34.6 in HA mode. All 3 nodes are control-plane,etcd (Raft quorum of 3 — can tolerate one node failure). K3s is a minimal Kubernetes distribution from Rancher Labs (now Suse): single-binary, embedded etcd instead of a separate etcd cluster, sane defaults for small installations. See Chapter 2 for why k3s over full Kubernetes or Docker Swarm.

Layer 4 — Cluster networking

Flannel VXLAN for pod-to-pod overlay (default on K3s). VXLAN tunnels pod traffic over UDP port 8472 between nodes.
CoreDNS for service discovery (what pods call api or redis to reach each other).
kube-proxy in IPVS mode for ClusterIP → pod routing.

Chapter 3 walks through a single request to show every hop.

Layer 5 — Ingress

Traefik v3 as a DaemonSet with hostNetwork: true. Each node has a Traefik pod that binds directly to the node's public :80 and :443. No servicelb, no Hetzner Load Balancer — Cloudflare round-robins the three node IPs in DNS and any node can serve any request. See Chapter 6.

Layer 6 — Edge / CDN

Cloudflare Free plan. Proxied A records for api.myhoneydue.com, admin.myhoneydue.com, and myhoneydue.com each point at all three node IPs. Edge handles TLS termination (SSL=Flexible), DDoS protection, caching for static assets, and traffic failover if a node becomes unreachable. See Chapter 13.

Layer 7 — Application services

Service	Type	Replicas	Image
`api`	Go (Echo, GORM)	3	`gitea.treytartt.com/admin/honeydue-api:<sha>`
`admin`	Next.js 16	1	`gitea.treytartt.com/admin/honeydue-admin:<sha>`
`worker`	Go (Asynq)	1	`gitea.treytartt.com/admin/honeydue-worker:<sha>`
`redis`	redis:7-alpine	1	Docker Hub

See Chapter 7.

Layer 8 — External dependencies

Neon Postgres (Launch plan) — honeyDue database
Backblaze B2 — honeyDueProd bucket for user uploads
Fastmail SMTP — transactional email
Gitea (self-hosted at gitea.treytartt.com) — container registry
Cloudflare — DNS, TLS, CDN

See Chapter 8, 9, and 11.

What's deliberately absent

TLS at origin. Cloudflare terminates TLS at the edge and talks HTTP on port 80 to the nodes. This is "Flexible SSL" in Cloudflare terminology. It's the simplest setup; we have a TODO to upgrade to "Full (strict)" with Cloudflare Origin CA certs (Chapter 13, §Future).
Hetzner Load Balancer. We save the $8.49/mo by having Cloudflare round-robin across node IPs directly. If any node is unresponsive, Cloudflare's own origin health checks will route around it within 30s.
Push notifications. APNs (iOS) and FCM (Android) are configured off until we have Apple Developer / Google Play accounts. The env vars are set to sentinel values that let the Go app boot; FEATURE_PUSH_ENABLED=false gates all call sites.
External metrics/monitoring (Prometheus, Grafana, Betterstack). Right now we rely on kubectl logs, kubectl top, and Cloudflare's own analytics. See Chapter 15 for what's there and what we'd add.
Automated backups of Redis state. Redis is configured with AOF (append-only file) persistence, but the PVC is only on one node. Redis holds only cache + Asynq queue state; losing it re-populates on first request / next cron tick. Not critical.
Admin panel basic auth (Traefik middleware). In-app admin login is enabled; the extra Traefik-layer basic auth the scaffold supports is not currently attached.

The deployment pipeline in one paragraph

Changes to application code are built on your workstation by docker buildx build --platform linux/amd64 --push, which cross-compiles from arm64 (Apple Silicon) to amd64 (Hetzner nodes) and pushes directly to gitea.treytartt.com. Manifests live in deploy-k3s/manifests/; they reference image tags by git short SHA. kubectl apply -f rolls the new image in with maxUnavailable: 0, maxSurge: 1 — one new pod at a time, old one stays up until new is healthy. Service discovery by Kubernetes DNS means api and admin hostnames always resolve to live backing pods; traffic shifts the moment a new pod passes its readiness probe. Chapter 14 walks through a complete deploy.

What we used to have (the short version)

Up until 2026-04-24 this stack ran on Docker Swarm on the same three Hetzner boxes. It worked, but the Docker libnetwork service-discovery layer has a bug in the 29.x line (moby/moby#52265) that leaves stale DNS A-records behind when tasks migrate between nodes. We hit it: the admin panel returned 502s for ~50% of requests through Cloudflare because Caddy (our previous reverse proxy) was dialing a ghost IP that had since been recycled to the Dozzle log viewer. We spent four hours trying increasingly clever workarounds (dnsrr vs VIP, dynamic a DNS refresh, global mode, host-mode ports, host.docker.internal, hardcoded node IPs) before concluding that libnetwork state corruption survives every non-nuclear fix.

The full autopsy is in Chapter 19 — Swarm Postmortem. K3s uses CoreDNS and has no libnetwork history; the bug class doesn't exist there.

11 KiB Raw Blame History

00 — Overview

Summary

Architecture at a glance

ASCII fallback

The stack, one layer at a time

Layer 0 — Hardware

Layer 1 — Operating system

Layer 2 — Container runtime

Layer 3 — Orchestrator

Layer 4 — Cluster networking

Layer 5 — Ingress

Layer 6 — Edge / CDN

Layer 7 — Application services

Layer 8 — External dependencies

What's deliberately absent

The deployment pipeline in one paragraph

What we used to have (the short version)

11 KiB

Raw Blame History