Files
honeyDueAPI/docs/deployment/README.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

113 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# honeyDue Production Deployment — The Book
This is the complete reference for the honeyDue production deployment as it
exists on **2026-04-24**. It serves two audiences:
1. **A new engineer** learning the system for the first time. Start at
Chapter 0 (Overview) and read in order. Concepts are built up; nothing is
assumed beyond "you've deployed web apps before."
2. **The operator** (future-you) needing a specific fact fast. Every chapter
opens with a one-paragraph summary and has an operator runbook at its end.
The appendices are a cheat sheet.
The deployment is non-trivial. It's a 3-node HA Kubernetes cluster running
a Go API, a Next.js admin panel, a background worker, Redis, and Traefik —
all fronted by Cloudflare, integrated with Neon Postgres, Backblaze B2, and
a self-hosted Gitea registry. This book explains **why each of those pieces
was chosen** (often over two or three alternatives we tried first), what
they do, and how to operate them.
## Table of Contents
### Part I — The System
- [00 — Overview](./00-overview.md) — what's running, at a glance
- [01 — Infrastructure](./01-infrastructure.md) — Hetzner nodes, specs, cost, region
- [02 — Orchestrator Choice](./02-orchestrator-choice.md) — why k3s (and not Swarm, full k8s, or Nomad)
### Part II — Networking
- [03 — Networking](./03-networking.md) — flannel, CoreDNS, kube-proxy, the overlay story
- [04 — Firewall](./04-firewall.md) — every UFW rule on every node, rationale
- [13 — Cloudflare](./13-cloudflare.md) — DNS, SSL modes, round-robin origin pool
### Part III — Security
- [05 — Security](./05-security.md) — RBAC, Pod Security, secrets, TLS chain
- [06 — Traefik Ingress](./06-traefik-ingress.md) — host-network DaemonSet, cert plan
### Part IV — Workloads
- [07 — Services](./07-services.md) — api, admin, worker, redis per-service deep dive
- [08 — Database](./08-database.md) — Neon Postgres, advisory-lock migrations
- [09 — Storage](./09-storage.md) — Backblaze B2, minio-go client details
- [10 — Secrets & Config](./10-secrets-config.md) — ConfigMap, Secret, env mapping
- [11 — Registry](./11-registry.md) — Gitea container registry, multi-arch builds
### Part V — Operation
- [12 — Data Flow](./12-data-flow.md) — end-to-end request lifecycle
- [14 — Deployment Process](./14-deployment-process.md) — how to roll new code
- [15 — Observability](./15-observability.md) — logs, metrics, tracing
- [16 — Failure Modes](./16-failure-modes.md) — what happens when X dies
- [17 — Runbook](./17-runbook.md) — common ops tasks
### Part VI — Context
- [18 — Cost](./18-cost.md) — what this costs to run, per service
- [19 — Swarm Postmortem](./19-postmortem-swarm.md) — the story of why we migrated from Docker Swarm
- [20 — Roadmap](./20-roadmap.md) — known TODOs and scaling triggers
### Appendices
- [A — Glossary](./appendices/a-glossary.md)
- [B — kubectl Cheat Sheet](./appendices/b-commands.md)
- [C — File Locations](./appendices/c-file-locations.md)
- [D — References & Citations](./appendices/d-references.md)
## Quick Facts
| Field | Value |
|---|---|
| Orchestrator | K3s v1.34.6+k3s1 (3 nodes, HA control plane) |
| Ingress | Traefik v3 (DaemonSet, hostNetwork) |
| Nodes | 3× Hetzner Cloud CX33 (4 vCPU, 8 GB RAM, 80 GB SSD) in `nbg1` (Nuremberg) |
| DNS & Edge | Cloudflare (Free plan), SSL=Flexible, round-robin 3 node A records |
| Database | Neon Postgres, `ep-floral-truth-amttbc5a.c-5.us-east-1.aws.neon.tech` |
| Cache + Queue | Redis 7-alpine, in-cluster, 1 replica, PVC-backed, pinned to `nbg1-2` |
| Object Storage | Backblaze B2, `honeyDueProd` bucket, `us-east-005` region |
| Image Registry | Self-hosted Gitea v1.25.5 at `gitea.treytartt.com` |
| Transactional Email | Fastmail SMTP (`smtp.fastmail.com:587`) |
| Domains | `api.myhoneydue.com`, `admin.myhoneydue.com`, `myhoneydue.com` |
| Monthly Cost (current) | ~$3040 (3× Hetzner + Neon Launch + B2 + Cloudflare Free + Gitea free) |
| kubeconfig | `~/.kube/honeydue-k3s.yaml` on operator workstation |
| Repo | `honeyDueAPI-go/deploy-k3s/` for manifests, `deploy/` is the legacy Swarm config |
## How to Read This Book
- **"Why did we…?"** answers are in the chapter covering that component. Every
major design choice has an explicit rejection of 13 alternatives.
- **Historical bugs** are in Chapter 19. The rest of the book describes the
current (fixed) state; 19 is the forensic record of what was broken and
how we figured it out.
- **Operator commands** you'll run regularly are in Appendix B. Chapter 17
has longer procedures (cert rotation, DB migration, etc.).
- **Citations** throughout use footnote-style links to the canonical source
(k3s docs, moby issues, Cloudflare docs, etc.). Appendix D collects them.
## Conventions
- Kubernetes namespace for the app is `honeydue`.
- SSH aliases are `hetzner1`, `hetzner2`, `hetzner3` in your `~/.ssh/config`.
- Node hostnames in the cluster are `ubuntu-8gb-nbg1-{1,2,3}` (Hetzner-assigned).
- The mapping is non-obvious because the Hetzner hostname suffix order does
not match SSH alias order:
| SSH alias | Public IP | Hostname in k3s |
|---|---|---|
| hetzner1 | 178.104.247.152 | `ubuntu-8gb-nbg1-2` |
| hetzner2 | 178.105.32.198 | `ubuntu-8gb-nbg1-1` |
| hetzner3 | 178.104.249.189 | `ubuntu-8gb-nbg1-3` |
When a chapter refers to "hetzner1" it means the box at 178.104.247.152 / `nbg1-2`.