admin/honeyDueAPI

Fork 0

Files

T

Trey t 77cfcc0b27

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

docs: rewrite ch15 observability + cross-refs for the live obs stack

ch15 is now an account of what's actually running, not a roadmap for
what we'd add: VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate
fronted by Cloudflare and bearer-gated nginx, vmagent in-cluster, the
internal/prom histogram set, the rollout's NetworkPolicy footprint,
the obs.88oakapps.com endpoint shape, the ~$0/700MB resource budget,
and a token-rotation runbook. The "what we still don't have" section
keeps log aggregation, alerting, and full distributed tracing as the
honest gap list.

Other touched docs:
- 00-overview: \"deliberately absent\" no longer claims we have no
  metrics — calls out the cross-cluster shape instead.
- 14-deployment-process: TL;DR now points at deploy-k3s/scripts/03-deploy.sh
  (full build + push + apply + obs vmagent), with the manual
  kubectl-set-image flow kept as the single-service path. Notes the
  IfNotPresent gotcha that bit us during the rollout.
- 16-failure-modes: adds vmagent-can't-reach-obs and Grafana-no-data.
- 18-cost: $0 line item for the obs stack on 88oakappsUpdate, with the
  CX32 migration trigger.
- 17/18 README + appendix b: link the new ch15, add the obs cheat
  sheet block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-25 15:05:06 -05:00

11 KiB

Raw Blame History

00 — Overview

Summary

honeyDue runs on a three-node Kubernetes cluster managed by K3s, fronted by Cloudflare, and backed by a managed Postgres (Neon), S3-compatible object storage (Backblaze B2), and a self-hosted container registry (Gitea). The application consists of a Go REST API, a Next.js admin panel, and a background worker process using Redis-backed queues. Traefik handles HTTP ingress and path-based routing. The whole stack fits in about 1 GB of RAM across the three nodes with plenty of headroom.

This chapter is the map. Everything here is expanded in a later chapter.

Architecture at a glance

flowchart TB
    subgraph Internet
        Browser[End-user browser / mobile client]
    end

    subgraph CF[Cloudflare]
        CFEdge[Edge POP<br/>TLS terminates here]
    end

    Browser -- HTTPS :443 --> CFEdge

    subgraph Hetzner[Hetzner Cloud — Nuremberg nbg1]
        direction LR
        subgraph H1[hetzner1<br/>178.104.247.152]
            T1[Traefik<br/>:80/:443 hostNet]
            A1[api pod]
            W1[worker pod]
        end
        subgraph H2[hetzner2<br/>178.105.32.198]
            T2[Traefik<br/>:80/:443 hostNet]
            A2[api pod]
            R1[redis pod<br/>PVC]
        end
        subgraph H3[hetzner3<br/>178.104.249.189]
            T3[Traefik<br/>:80/:443 hostNet]
            A3[api pod]
            AD1[admin pod]
        end
    end

    CFEdge -- HTTP :80<br/>DNS round-robin --> T1
    CFEdge -- HTTP :80 --> T2
    CFEdge -- HTTP :80 --> T3

    T1 & T2 & T3 -.Ingress routes by<br/>Host header.-> A1
    T1 & T2 & T3 -.-> AD1
    A1 & A2 & A3 -.-> R1

    subgraph External[Managed services]
        Neon[(Neon Postgres<br/>AWS us-east-1)]
        B2[(Backblaze B2<br/>us-east-005)]
        FM[Fastmail SMTP]
        Gitea[Gitea Registry<br/>gitea.treytartt.com]
    end

    A1 & A2 & A3 -- SSL --> Neon
    W1 -- SSL --> Neon
    A1 & A2 & A3 -- HTTPS --> B2
    W1 -- SMTP :587 --> FM
    H1 & H2 & H3 -. image pull .-> Gitea

ASCII fallback

                         ┌─────────────────────┐
                         │     End user        │
                         └──────────┬──────────┘
                                    │ HTTPS :443
                                    ▼
                         ┌─────────────────────┐
                         │  Cloudflare edge    │ TLS terminates here
                         │  (SSL = Flexible)   │
                         └──────────┬──────────┘
                       HTTP :80 round-robin
                  ┌─────────────┼─────────────┐
                  ▼             ▼             ▼
     ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
     │   hetzner1      │ │   hetzner2      │ │   hetzner3      │
     │ 178.104.247.152 │ │ 178.105.32.198  │ │ 178.104.249.189 │
     │ Traefik :80/443 │ │ Traefik :80/443 │ │ Traefik :80/443 │
     │  api   worker   │ │  api   redis    │ │  api   admin    │
     └─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
               │                   │                   │
               └──────── Kubernetes overlay ───────────┘
                                   │
     ┌─────────────────────────────┴──────────────────────────────┐
     │                                                             │
     ▼                      ▼                    ▼                 ▼
┌─────────┐          ┌─────────────┐     ┌──────────┐    ┌───────────────┐
│  Neon   │          │ Backblaze B2│     │ Fastmail │    │ Gitea Registry│
│Postgres │          │   uploads   │     │   SMTP   │    │   image pull  │
└─────────┘          └─────────────┘     └──────────┘    └───────────────┘

The stack, one layer at a time

Layer 0 — Hardware

Three Hetzner Cloud CX33 instances (4 vCPU, 8 GB RAM, 80 GB NVMe SSD) in Hetzner's Nuremberg (nbg1) datacenter. Each node is $7.99/mo (April 2026 pricing), totaling ~$24/mo. See Chapter 1.

Layer 1 — Operating system

Ubuntu 24.04.3 LTS. Each node has:

SSH on port 22, key-only auth, deploy user with NOPASSWD sudo
ufw firewall with strict default-deny-incoming; specific ports allowed per Chapter 4
Sysctl override net.ipv4.ip_unprivileged_port_start=0 so non-root containers can bind privileged ports (needed for Traefik to serve :80/:443)

Layer 2 — Container runtime

containerd v2.2.2 (bundled with K3s). Docker was previously installed from the Swarm era but is now disabled. containerd is Kubernetes' reference runtime and has a smaller footprint than Docker's full stack.

Layer 3 — Orchestrator

K3s v1.34.6 in HA mode. All 3 nodes are control-plane,etcd (Raft quorum of 3 — can tolerate one node failure). K3s is a minimal Kubernetes distribution from Rancher Labs (now Suse): single-binary, embedded etcd instead of a separate etcd cluster, sane defaults for small installations. See Chapter 2 for why k3s over full Kubernetes or Docker Swarm.

Layer 4 — Cluster networking

Flannel VXLAN for pod-to-pod overlay (default on K3s). VXLAN tunnels pod traffic over UDP port 8472 between nodes.
CoreDNS for service discovery (what pods call api or redis to reach each other).
kube-proxy in IPVS mode for ClusterIP → pod routing.

Chapter 3 walks through a single request to show every hop.

Layer 5 — Ingress

Traefik v3 as a DaemonSet with hostNetwork: true. Each node has a Traefik pod that binds directly to the node's public :80 and :443. No servicelb, no Hetzner Load Balancer — Cloudflare round-robins the three node IPs in DNS and any node can serve any request. See Chapter 6.

Layer 6 — Edge / CDN

Cloudflare Free plan. Proxied A records for api.myhoneydue.com, admin.myhoneydue.com, and myhoneydue.com each point at all three node IPs. Edge handles TLS termination (SSL=Flexible), DDoS protection, caching for static assets, and traffic failover if a node becomes unreachable. See Chapter 13.

Layer 7 — Application services

Service	Type	Replicas	Image
`api`	Go (Echo, GORM)	3	`gitea.treytartt.com/admin/honeydue-api:<sha>`
`admin`	Next.js 16	1	`gitea.treytartt.com/admin/honeydue-admin:<sha>`
`worker`	Go (Asynq)	1	`gitea.treytartt.com/admin/honeydue-worker:<sha>`
`redis`	redis:7-alpine	1	Docker Hub

See Chapter 7.

Layer 8 — External dependencies

Neon Postgres (Launch plan) — honeyDue database
Backblaze B2 — honeyDueProd bucket for user uploads
Fastmail SMTP — transactional email
Gitea (self-hosted at gitea.treytartt.com) — container registry
Cloudflare — DNS, TLS, CDN

See Chapter 8, 9, and 11.

What's deliberately absent

TLS at origin. Cloudflare terminates TLS at the edge and talks HTTP on port 80 to the nodes. This is "Flexible SSL" in Cloudflare terminology. It's the simplest setup; we have a TODO to upgrade to "Full (strict)" with Cloudflare Origin CA certs (Chapter 13, §Future).
Hetzner Load Balancer. We save the $8.49/mo by having Cloudflare round-robin across node IPs directly. If any node is unresponsive, Cloudflare's own origin health checks will route around it within 30s.
Push notifications. APNs (iOS) and FCM (Android) are configured off until we have Apple Developer / Google Play accounts. The env vars are set to sentinel values that let the Go app boot; FEATURE_PUSH_ENABLED=false gates all call sites.
In-cluster Prometheus / Grafana. Self-hosted Prometheus-compatible metrics + tracing + dashboards live outside the k3s cluster on 88oakappsUpdate (the same Linode VPS that hosts PostHog), reached via https://obs.88oakapps.com (Cloudflare-fronted, bearer-gated). A vmagent sidecar in the honeydue namespace scrapes the api Pods and remote-writes out. This frees ~700 MB of cluster RAM and means observability survives a k3s control-plane incident. See Chapter 15.
Alerting. No PagerDuty, Slack hooks, or pages-on-error wired up yet. Histograms are flowing into Grafana — alert rules on top of them is the next add. See Chapter 15 — Future.
Automated backups of Redis state. Redis is configured with AOF (append-only file) persistence, but the PVC is only on one node. Redis holds only cache + Asynq queue state; losing it re-populates on first request / next cron tick. Not critical.
Admin panel basic auth (Traefik middleware). In-app admin login is enabled; the extra Traefik-layer basic auth the scaffold supports is not currently attached.

The deployment pipeline in one paragraph

Changes to application code are built on your workstation by docker buildx build --platform linux/amd64 --push, which cross-compiles from arm64 (Apple Silicon) to amd64 (Hetzner nodes) and pushes directly to gitea.treytartt.com. Manifests live in deploy-k3s/manifests/; they reference image tags by git short SHA. kubectl apply -f rolls the new image in with maxUnavailable: 0, maxSurge: 1 — one new pod at a time, old one stays up until new is healthy. Service discovery by Kubernetes DNS means api and admin hostnames always resolve to live backing pods; traffic shifts the moment a new pod passes its readiness probe. Chapter 14 walks through a complete deploy.

What we used to have (the short version)

Up until 2026-04-24 this stack ran on Docker Swarm on the same three Hetzner boxes. It worked, but the Docker libnetwork service-discovery layer has a bug in the 29.x line (moby/moby#52265) that leaves stale DNS A-records behind when tasks migrate between nodes. We hit it: the admin panel returned 502s for ~50% of requests through Cloudflare because Caddy (our previous reverse proxy) was dialing a ghost IP that had since been recycled to the Dozzle log viewer. We spent four hours trying increasingly clever workarounds (dnsrr vs VIP, dynamic a DNS refresh, global mode, host-mode ports, host.docker.internal, hardcoded node IPs) before concluding that libnetwork state corruption survives every non-nuclear fix.

The full autopsy is in Chapter 19 — Swarm Postmortem. K3s uses CoreDNS and has no libnetwork history; the bug class doesn't exist there.

11 KiB Raw Blame History

00 — Overview

Summary

Architecture at a glance

ASCII fallback

The stack, one layer at a time

Layer 0 — Hardware

Layer 1 — Operating system

Layer 2 — Container runtime

Layer 3 — Orchestrator

Layer 4 — Cluster networking

Layer 5 — Ingress

Layer 6 — Edge / CDN

Layer 7 — Application services

Layer 8 — External dependencies

What's deliberately absent

The deployment pipeline in one paragraph

What we used to have (the short version)

11 KiB

Raw Blame History