Files
Trey t 77cfcc0b27
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs: rewrite ch15 observability + cross-refs for the live obs stack
ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.

Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
  metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
  (full build + push + apply + obs vmagent), with the manual
  kubectl-set-image flow kept as the single-service path. Notes the
  IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
  CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
  sheet block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:05:06 -05:00

11 KiB

00 — Overview

Summary

honeyDue runs on a three-node Kubernetes cluster managed by K3s, fronted by Cloudflare, and backed by a managed Postgres (Neon), S3-compatible object storage (Backblaze B2), and a self-hosted container registry (Gitea). The application consists of a Go REST API, a Next.js admin panel, and a background worker process using Redis-backed queues. Traefik handles HTTP ingress and path-based routing. The whole stack fits in about 1 GB of RAM across the three nodes with plenty of headroom.

This chapter is the map. Everything here is expanded in a later chapter.

Architecture at a glance

flowchart TB
    subgraph Internet
        Browser[End-user browser / mobile client]
    end

    subgraph CF[Cloudflare]
        CFEdge[Edge POP<br/>TLS terminates here]
    end

    Browser -- HTTPS :443 --> CFEdge

    subgraph Hetzner[Hetzner Cloud — Nuremberg nbg1]
        direction LR
        subgraph H1[hetzner1<br/>178.104.247.152]
            T1[Traefik<br/>:80/:443 hostNet]
            A1[api pod]
            W1[worker pod]
        end
        subgraph H2[hetzner2<br/>178.105.32.198]
            T2[Traefik<br/>:80/:443 hostNet]
            A2[api pod]
            R1[redis pod<br/>PVC]
        end
        subgraph H3[hetzner3<br/>178.104.249.189]
            T3[Traefik<br/>:80/:443 hostNet]
            A3[api pod]
            AD1[admin pod]
        end
    end

    CFEdge -- HTTP :80<br/>DNS round-robin --> T1
    CFEdge -- HTTP :80 --> T2
    CFEdge -- HTTP :80 --> T3

    T1 & T2 & T3 -.Ingress routes by<br/>Host header.-> A1
    T1 & T2 & T3 -.-> AD1
    A1 & A2 & A3 -.-> R1

    subgraph External[Managed services]
        Neon[(Neon Postgres<br/>AWS us-east-1)]
        B2[(Backblaze B2<br/>us-east-005)]
        FM[Fastmail SMTP]
        Gitea[Gitea Registry<br/>gitea.treytartt.com]
    end

    A1 & A2 & A3 -- SSL --> Neon
    W1 -- SSL --> Neon
    A1 & A2 & A3 -- HTTPS --> B2
    W1 -- SMTP :587 --> FM
    H1 & H2 & H3 -. image pull .-> Gitea

ASCII fallback

                         ┌─────────────────────┐
                         │     End user        │
                         └──────────┬──────────┘
                                    │ HTTPS :443
                                    ▼
                         ┌─────────────────────┐
                         │  Cloudflare edge    │ TLS terminates here
                         │  (SSL = Flexible)   │
                         └──────────┬──────────┘
                       HTTP :80 round-robin
                  ┌─────────────┼─────────────┐
                  ▼             ▼             ▼
     ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
     │   hetzner1      │ │   hetzner2      │ │   hetzner3      │
     │ 178.104.247.152 │ │ 178.105.32.198  │ │ 178.104.249.189 │
     │ Traefik :80/443 │ │ Traefik :80/443 │ │ Traefik :80/443 │
     │  api   worker   │ │  api   redis    │ │  api   admin    │
     └─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
               │                   │                   │
               └──────── Kubernetes overlay ───────────┘
                                   │
     ┌─────────────────────────────┴──────────────────────────────┐
     │                                                             │
     ▼                      ▼                    ▼                 ▼
┌─────────┐          ┌─────────────┐     ┌──────────┐    ┌───────────────┐
│  Neon   │          │ Backblaze B2│     │ Fastmail │    │ Gitea Registry│
│Postgres │          │   uploads   │     │   SMTP   │    │   image pull  │
└─────────┘          └─────────────┘     └──────────┘    └───────────────┘

The stack, one layer at a time

Layer 0 — Hardware

Three Hetzner Cloud CX33 instances (4 vCPU, 8 GB RAM, 80 GB NVMe SSD) in Hetzner's Nuremberg (nbg1) datacenter. Each node is $7.99/mo (April 2026 pricing), totaling ~$24/mo. See Chapter 1.

Layer 1 — Operating system

Ubuntu 24.04.3 LTS. Each node has:

  • SSH on port 22, key-only auth, deploy user with NOPASSWD sudo
  • ufw firewall with strict default-deny-incoming; specific ports allowed per Chapter 4
  • Sysctl override net.ipv4.ip_unprivileged_port_start=0 so non-root containers can bind privileged ports (needed for Traefik to serve :80/:443)

Layer 2 — Container runtime

containerd v2.2.2 (bundled with K3s). Docker was previously installed from the Swarm era but is now disabled. containerd is Kubernetes' reference runtime and has a smaller footprint than Docker's full stack.

Layer 3 — Orchestrator

K3s v1.34.6 in HA mode. All 3 nodes are control-plane,etcd (Raft quorum of 3 — can tolerate one node failure). K3s is a minimal Kubernetes distribution from Rancher Labs (now Suse): single-binary, embedded etcd instead of a separate etcd cluster, sane defaults for small installations. See Chapter 2 for why k3s over full Kubernetes or Docker Swarm.

Layer 4 — Cluster networking

  • Flannel VXLAN for pod-to-pod overlay (default on K3s). VXLAN tunnels pod traffic over UDP port 8472 between nodes.
  • CoreDNS for service discovery (what pods call api or redis to reach each other).
  • kube-proxy in IPVS mode for ClusterIP → pod routing.

Chapter 3 walks through a single request to show every hop.

Layer 5 — Ingress

Traefik v3 as a DaemonSet with hostNetwork: true. Each node has a Traefik pod that binds directly to the node's public :80 and :443. No servicelb, no Hetzner Load Balancer — Cloudflare round-robins the three node IPs in DNS and any node can serve any request. See Chapter 6.

Layer 6 — Edge / CDN

Cloudflare Free plan. Proxied A records for api.myhoneydue.com, admin.myhoneydue.com, and myhoneydue.com each point at all three node IPs. Edge handles TLS termination (SSL=Flexible), DDoS protection, caching for static assets, and traffic failover if a node becomes unreachable. See Chapter 13.

Layer 7 — Application services

Service Type Replicas Image
api Go (Echo, GORM) 3 gitea.treytartt.com/admin/honeydue-api:<sha>
admin Next.js 16 1 gitea.treytartt.com/admin/honeydue-admin:<sha>
worker Go (Asynq) 1 gitea.treytartt.com/admin/honeydue-worker:<sha>
redis redis:7-alpine 1 Docker Hub

See Chapter 7.

Layer 8 — External dependencies

  • Neon Postgres (Launch plan) — honeyDue database
  • Backblaze B2honeyDueProd bucket for user uploads
  • Fastmail SMTP — transactional email
  • Gitea (self-hosted at gitea.treytartt.com) — container registry
  • Cloudflare — DNS, TLS, CDN

See Chapter 8, 9, and 11.

What's deliberately absent

  • TLS at origin. Cloudflare terminates TLS at the edge and talks HTTP on port 80 to the nodes. This is "Flexible SSL" in Cloudflare terminology. It's the simplest setup; we have a TODO to upgrade to "Full (strict)" with Cloudflare Origin CA certs (Chapter 13, §Future).
  • Hetzner Load Balancer. We save the $8.49/mo by having Cloudflare round-robin across node IPs directly. If any node is unresponsive, Cloudflare's own origin health checks will route around it within 30s.
  • Push notifications. APNs (iOS) and FCM (Android) are configured off until we have Apple Developer / Google Play accounts. The env vars are set to sentinel values that let the Go app boot; FEATURE_PUSH_ENABLED=false gates all call sites.
  • In-cluster Prometheus / Grafana. Self-hosted Prometheus-compatible metrics + tracing + dashboards live outside the k3s cluster on 88oakappsUpdate (the same Linode VPS that hosts PostHog), reached via https://obs.88oakapps.com (Cloudflare-fronted, bearer-gated). A vmagent sidecar in the honeydue namespace scrapes the api Pods and remote-writes out. This frees ~700 MB of cluster RAM and means observability survives a k3s control-plane incident. See Chapter 15.
  • Alerting. No PagerDuty, Slack hooks, or pages-on-error wired up yet. Histograms are flowing into Grafana — alert rules on top of them is the next add. See Chapter 15 — Future.
  • Automated backups of Redis state. Redis is configured with AOF (append-only file) persistence, but the PVC is only on one node. Redis holds only cache + Asynq queue state; losing it re-populates on first request / next cron tick. Not critical.
  • Admin panel basic auth (Traefik middleware). In-app admin login is enabled; the extra Traefik-layer basic auth the scaffold supports is not currently attached.

The deployment pipeline in one paragraph

Changes to application code are built on your workstation by docker buildx build --platform linux/amd64 --push, which cross-compiles from arm64 (Apple Silicon) to amd64 (Hetzner nodes) and pushes directly to gitea.treytartt.com. Manifests live in deploy-k3s/manifests/; they reference image tags by git short SHA. kubectl apply -f rolls the new image in with maxUnavailable: 0, maxSurge: 1 — one new pod at a time, old one stays up until new is healthy. Service discovery by Kubernetes DNS means api and admin hostnames always resolve to live backing pods; traffic shifts the moment a new pod passes its readiness probe. Chapter 14 walks through a complete deploy.

What we used to have (the short version)

Up until 2026-04-24 this stack ran on Docker Swarm on the same three Hetzner boxes. It worked, but the Docker libnetwork service-discovery layer has a bug in the 29.x line (moby/moby#52265) that leaves stale DNS A-records behind when tasks migrate between nodes. We hit it: the admin panel returned 502s for ~50% of requests through Cloudflare because Caddy (our previous reverse proxy) was dialing a ghost IP that had since been recycled to the Dozzle log viewer. We spent four hours trying increasingly clever workarounds (dnsrr vs VIP, dynamic a DNS refresh, global mode, host-mode ports, host.docker.internal, hardcoded node IPs) before concluding that libnetwork state corruption survives every non-nuclear fix.

The full autopsy is in Chapter 19 — Swarm Postmortem. K3s uses CoreDNS and has no libnetwork history; the bug class doesn't exist there.