admin/honeyDueAPI

Fork 0

Files

T

Trey t 6f303dbbaa

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 07:20:54 -05:00

11 KiB

Raw Blame History

12 — Data Flow

Summary

This chapter follows a user's request end to end, hop by hop. It's the consolidated picture of Chapters 3, 6, 7, 8, 9 working together. Use this chapter to answer "when X doesn't work, which layer failed?"

Scenario: User creates a task

A user in Austin opens the mobile app and adds a new task for their property. The client sends POST https://api.myhoneydue.com/api/tasks/ with a JSON body and an auth token. We trace every hop.

Hop 1 — Mobile client → Cloudflare edge

sequenceDiagram
    participant App as iOS client
    participant DNS as Local DNS
    participant CFE as Cloudflare edge (DFW)

    App->>DNS: Resolve api.myhoneydue.com
    DNS->>App: 104.21.13.7 (Cloudflare edge IP)
    App->>CFE: TCP SYN :443
    CFE-->>App: TCP SYN+ACK
    App->>CFE: TLS ClientHello
    CFE->>App: TLS ServerHello + cert
    Note over App,CFE: TLS 1.3 handshake<br/>~1 RTT
    App->>CFE: HTTP/2 stream<br/>POST /api/tasks/<br/>Authorization: Token <xxx>

Client resolves api.myhoneydue.com via OS resolver, gets Cloudflare edge IP (not our origin IP)
Client establishes TLS 1.3 to CF's nearest POP (Dallas for Austin)
Cert presented by CF is sni.cloudflaressl.com or a CF-issued *.myhoneydue.com — our origin cert is never seen by the client
Latency: ~5–15 ms Austin → DFW

Hop 2 — Cloudflare edge → Origin (hetzner)

sequenceDiagram
    participant CFE as Cloudflare DFW POP
    participant DNS as CF internal DNS
    participant HN as hetzner node (random of 3)
    participant Traefik as Traefik pod<br/>(host network)

    CFE->>DNS: Which origin for api.myhoneydue.com?
    DNS->>CFE: One of 178.104.247.152, 178.105.32.198, 178.104.249.189
    CFE->>HN: TCP SYN :80
    HN-->>CFE: SYN+ACK
    CFE->>HN: HTTP/1.1 POST /api/tasks/<br/>Host: api.myhoneydue.com<br/>X-Forwarded-For: <user IP><br/>X-Forwarded-Proto: https<br/>CF-Connecting-IP: <user IP>
    Note over HN: UFW: allow 80/tcp from<br/>anywhere (anywhere for now)
    HN->>Traefik: delivered to listener

CF picks one of the 3 node IPs via DNS round-robin. This is per-connection, not per-request.
Protocol between CF and origin: HTTP/1.1 plaintext (SSL=Flexible). A future Full-strict upgrade would make this HTTPS.
Latency: ~90–120 ms DFW → Nuremberg
CF adds headers: CF-Connecting-IP, X-Forwarded-For, X-Forwarded-Proto

Hop 3 — Traefik → api Service

sequenceDiagram
    participant Traefik as Traefik pod
    participant CoreDNS as CoreDNS (10.43.0.10)
    participant KP as kube-proxy IPVS<br/>(kernel)
    participant APIPod as api pod<br/>(some node)

    Note over Traefik: Match Host: api.myhoneydue.com<br/>→ honeydue-api Ingress<br/>→ backend: api Service :8000
    Traefik->>CoreDNS: Resolve "api"
    CoreDNS->>Traefik: 10.43.167.83 (Service ClusterIP)
    Traefik->>KP: TCP SYN to 10.43.167.83:8000
    KP->>KP: IPVS: pick endpoint<br/>from Service endpoint set
    KP->>APIPod: Rewrite destination<br/>to 10.42.2.6:8000<br/>(Flannel VXLAN if remote node)

Traefik resolves api via CoreDNS → gets the Service ClusterIP
Traefik sends to 10.43.167.83:8000
kube-proxy IPVS (running in-kernel on the node where Traefik lives) intercepts, picks a live endpoint, rewrites
Destination might be local (same node) or remote (VXLAN tunnel to another node)
Latency: <3 ms even cross-node

Hop 4 — api → Postgres (Neon)

sequenceDiagram
    participant API as api pod (Go)
    participant Resolv as Pod resolv.conf
    participant Neon as Neon pooler<br/>AWS us-east-1

    API->>Resolv: Resolve ep-floral-truth-...-pooler.us-east-1.aws.neon.tech
    Note over Resolv: Goes to CoreDNS<br/>which forwards to upstream<br/>(Hetzner's DNS, then public root)
    Resolv->>API: Neon pooler IP (e.g., 34.206.177.121)
    API->>Neon: TCP :5432
    API->>Neon: TLS 1.3 handshake (DB_SSLMODE=require)
    API->>Neon: Postgres startup (user, database)
    API->>Neon: BEGIN<br/>SELECT ... FROM task_task WHERE residence_id = ?<br/>INSERT INTO task_task (...) VALUES (...)<br/>COMMIT
    Neon-->>API: Query results

Go's database/sql pool may already have an idle connection. If so, skip handshake.
If new connection: ~50 ms TLS handshake + Postgres startup
Query itself: typically ~5–20 ms (single-row read/write on indexed columns)
Total for this hop: often <10 ms on a warm connection, ~80 ms cold

Hop 5 — api → Redis (cache miss invalidation)

sequenceDiagram
    participant API as api pod
    participant CoreDNS
    participant KP as kube-proxy
    participant Redis as redis pod

    API->>CoreDNS: Resolve "redis"
    CoreDNS->>API: 10.43.7.10
    API->>KP: TCP :6379
    KP->>Redis: rewritten to 10.42.x.y:6379
    API->>Redis: DEL tasks:user:<user_id>  (invalidate cached list)
    Redis-->>API: OK

Redis connection is usually kept alive in the api's pool
Latency: <1 ms (Redis is on hetzner2, usually a short hop)

Hop 6 — api → worker (enqueue side effect)

For some task creation events, api enqueues a background job (send-notification, update-lookup-table, etc.):

sequenceDiagram
    participant API as api pod
    participant Redis as redis pod (acting as Asynq queue)
    participant Worker as worker pod

    API->>Redis: RPUSH asynq:queue:default <job JSON>
    Redis-->>API: OK
    Note over API,Worker: (Async, no response blocking)
    Worker->>Redis: BLPOP asynq:queue:default
    Redis-->>Worker: <job JSON>
    Worker->>Worker: Process job<br/>(send email, push, etc.)

api returns to the caller without waiting for the job.

Hop 7 — Response back to user

Reverse the path:

api returns JSON response to Traefik
Traefik returns to Cloudflare
Cloudflare re-encrypts TLS to user
User receives response

End-to-end latency budget

For a typical "create task" operation:

Hop	Latency
User → CF (Austin → DFW)	5–15 ms
CF → hetzner (cross-Atlantic)	90–120 ms
UFW + kernel + Traefik accept	<1 ms
Traefik → api (same or cross-node)	1–3 ms
api request parsing, auth validation	1–3 ms
api → Postgres (query)	20–60 ms
api → Redis (invalidate)	<1 ms
api response generation	1–5 ms
Return path	same as forward, reversed

Total: ~220–310 ms typical. Dominated by the cross-Atlantic CF→origin hop and the Postgres query round trip.

Read path (GET /api/tasks/)

Similar but simpler:

sequenceDiagram
    participant App as iOS client
    participant CF as Cloudflare
    participant Traefik
    participant API as api pod
    participant Redis
    participant Neon

    App->>CF: GET /api/tasks/
    CF->>Traefik: (no cache hit)
    Traefik->>API: Route via Service
    API->>Redis: GET tasks:user:<user_id>
    alt Cache hit
        Redis-->>API: cached JSON
    else Cache miss
        API->>Neon: SELECT ...
        Neon-->>API: rows
        API->>Redis: SET tasks:user:<user_id> <json> EX 300
    end
    API-->>Traefik: 200 JSON
    Traefik-->>CF: 200
    CF-->>App: 200 (may cache per response headers)

Admin panel data flow

A different dance because the admin is Next.js:

sequenceDiagram
    participant Browser
    participant CF
    participant Traefik
    participant Admin as admin pod (Next.js)
    participant AdminAPI as api pod<br/>(via public URL)
    participant Neon

    Browser->>CF: GET admin.myhoneydue.com/users
    CF->>Traefik: HTTP :80
    Traefik->>Admin: Service /users
    Note over Admin: Next.js SSR:<br/>fetch from NEXT_PUBLIC_API_URL
    Admin->>CF: GET api.myhoneydue.com/api/admin/users/
    CF->>Traefik: (api ingress)
    Traefik->>AdminAPI: Service
    AdminAPI->>Neon: SELECT ... FROM auth_user
    Neon-->>AdminAPI: rows
    AdminAPI-->>Admin: JSON
    Admin->>Admin: Render HTML
    Admin-->>Traefik: HTML
    Traefik-->>CF: HTML
    CF-->>Browser: HTML

Notably, the admin pod's calls to api go back out to Cloudflare and in through the public URL. Not the in-cluster Service IP. This is because NEXT_PUBLIC_API_URL=https://api.myhoneydue.com — Next.js builds use the same URL for browser-side and server-side fetches.

This is suboptimal — server-side (SSR) calls could use the internal api.honeydue.svc:8000 URL and skip the CF round-trip. Future optimization: separate NEXT_PUBLIC_API_URL (browser) from API_URL (server-side).

Static asset flow

For the marketing landing page at https://myhoneydue.com/:

CF caches HTML per Cache-Control (the Go app sets short TTLs)
CF caches CSS / JS / images aggressively (via default CF rules)
First request hits origin, subsequent requests served from CF edge

The static assets live inside the api container at /app/static/. Served by Echo's static file handler at routes /css, /js, /images.

Request flow during a rolling update

When a new api image is deployed, some requests will hit old pods and some will hit new pods for a few minutes:

sequenceDiagram
    participant CF
    participant Traefik
    participant OldPod as api pod v1
    participant NewPod as api pod v2 (starting)

    Note over NewPod: kubelet starts new pod
    Note over NewPod: pod connects to Postgres<br/>MigrateWithLock runs (no-op)<br/>HTTP server starts<br/>readinessProbe passes
    Note over NewPod: kube-proxy updates endpoints<br/>NewPod added to Service pool
    CF->>Traefik: request 1
    Traefik->>OldPod: routed (old pod still in pool)
    CF->>Traefik: request 2
    Traefik->>NewPod: routed (new pod now in pool)
    Note over OldPod: Kubelet terminates old pod<br/>(graceful SIGTERM, then SIGKILL after grace)
    CF->>Traefik: request 3
    Traefik->>NewPod: routed (OldPod gone from pool)

Both old and new handle traffic simultaneously until the rolling update completes. As long as the new code is API-compatible, users don't notice.

Failure modes in the data path

See Chapter 16 — Failure Modes for a full catalog.

Quick summary:

Layer fails	User sees	Recovery
Cloudflare DNS down	Can't resolve api.myhoneydue.com	Manual DNS fallback; extremely rare
Cloudflare edge down (single POP)	Slow, CF routes to another POP	Automatic
Node NIC fails	Some requests time out (CF routes away)	Cluster reschedules pods
UFW misconfig blocks :80	521 errors at CF	Re-add rule
Traefik pod down on one node	CF routes to other nodes	Automatic
kube-proxy broken on one node	Pods on that node can't reach Services	Restart kubelet
CoreDNS down	New connections fail DNS	Restart CoreDNS
Flannel broken between nodes	Cross-node pod communication fails	Restart flannel or node
api pod OOM	502 to user briefly	kubelet restarts pod
Postgres down	500 errors from api	Neon-side issue; outage
Redis down	api serves without cache (degraded)	Restart Redis pod
B2 down	Uploads fail, existing content served if cached	Backblaze-side outage

References

Chapter 3 — Networking for the overlay mechanics
Chapter 6 — Traefik for routing details
Chapter 7 — Services for per-service specifics
Chapter 16 — Failure Modes for what-if scenarios

11 KiB Raw Blame History Unescape Escape

12 — Data Flow

Summary

Scenario: User creates a task

Hop 1 — Mobile client → Cloudflare edge

Hop 2 — Cloudflare edge → Origin (hetzner)

Hop 3 — Traefik → api Service

Hop 4 — api → Postgres (Neon)

Hop 5 — api → Redis (cache miss invalidation)

Hop 6 — api → worker (enqueue side effect)

Hop 7 — Response back to user

End-to-end latency budget

Read path (GET /api/tasks/)

Admin panel data flow

Static asset flow

Request flow during a rolling update

Failure modes in the data path

References

11 KiB

Raw Blame History