Files
honeyDueAPI/docs/deployment/12-data-flow.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

11 KiB
Raw Blame History

12 — Data Flow

Summary

This chapter follows a user's request end to end, hop by hop. It's the consolidated picture of Chapters 3, 6, 7, 8, 9 working together. Use this chapter to answer "when X doesn't work, which layer failed?"

Scenario: User creates a task

A user in Austin opens the mobile app and adds a new task for their property. The client sends POST https://api.myhoneydue.com/api/tasks/ with a JSON body and an auth token. We trace every hop.

Hop 1 — Mobile client → Cloudflare edge

sequenceDiagram
    participant App as iOS client
    participant DNS as Local DNS
    participant CFE as Cloudflare edge (DFW)

    App->>DNS: Resolve api.myhoneydue.com
    DNS->>App: 104.21.13.7 (Cloudflare edge IP)
    App->>CFE: TCP SYN :443
    CFE-->>App: TCP SYN+ACK
    App->>CFE: TLS ClientHello
    CFE->>App: TLS ServerHello + cert
    Note over App,CFE: TLS 1.3 handshake<br/>~1 RTT
    App->>CFE: HTTP/2 stream<br/>POST /api/tasks/<br/>Authorization: Token <xxx>
  • Client resolves api.myhoneydue.com via OS resolver, gets Cloudflare edge IP (not our origin IP)
  • Client establishes TLS 1.3 to CF's nearest POP (Dallas for Austin)
  • Cert presented by CF is sni.cloudflaressl.com or a CF-issued *.myhoneydue.com — our origin cert is never seen by the client
  • Latency: ~515 ms Austin → DFW

Hop 2 — Cloudflare edge → Origin (hetzner)

sequenceDiagram
    participant CFE as Cloudflare DFW POP
    participant DNS as CF internal DNS
    participant HN as hetzner node (random of 3)
    participant Traefik as Traefik pod<br/>(host network)

    CFE->>DNS: Which origin for api.myhoneydue.com?
    DNS->>CFE: One of 178.104.247.152, 178.105.32.198, 178.104.249.189
    CFE->>HN: TCP SYN :80
    HN-->>CFE: SYN+ACK
    CFE->>HN: HTTP/1.1 POST /api/tasks/<br/>Host: api.myhoneydue.com<br/>X-Forwarded-For: <user IP><br/>X-Forwarded-Proto: https<br/>CF-Connecting-IP: <user IP>
    Note over HN: UFW: allow 80/tcp from<br/>anywhere (anywhere for now)
    HN->>Traefik: delivered to listener
  • CF picks one of the 3 node IPs via DNS round-robin. This is per-connection, not per-request.
  • Protocol between CF and origin: HTTP/1.1 plaintext (SSL=Flexible). A future Full-strict upgrade would make this HTTPS.
  • Latency: ~90120 ms DFW → Nuremberg
  • CF adds headers: CF-Connecting-IP, X-Forwarded-For, X-Forwarded-Proto

Hop 3 — Traefik → api Service

sequenceDiagram
    participant Traefik as Traefik pod
    participant CoreDNS as CoreDNS (10.43.0.10)
    participant KP as kube-proxy IPVS<br/>(kernel)
    participant APIPod as api pod<br/>(some node)

    Note over Traefik: Match Host: api.myhoneydue.com<br/>→ honeydue-api Ingress<br/>→ backend: api Service :8000
    Traefik->>CoreDNS: Resolve "api"
    CoreDNS->>Traefik: 10.43.167.83 (Service ClusterIP)
    Traefik->>KP: TCP SYN to 10.43.167.83:8000
    KP->>KP: IPVS: pick endpoint<br/>from Service endpoint set
    KP->>APIPod: Rewrite destination<br/>to 10.42.2.6:8000<br/>(Flannel VXLAN if remote node)
  • Traefik resolves api via CoreDNS → gets the Service ClusterIP
  • Traefik sends to 10.43.167.83:8000
  • kube-proxy IPVS (running in-kernel on the node where Traefik lives) intercepts, picks a live endpoint, rewrites
  • Destination might be local (same node) or remote (VXLAN tunnel to another node)
  • Latency: <3 ms even cross-node

Hop 4 — api → Postgres (Neon)

sequenceDiagram
    participant API as api pod (Go)
    participant Resolv as Pod resolv.conf
    participant Neon as Neon pooler<br/>AWS us-east-1

    API->>Resolv: Resolve ep-floral-truth-...-pooler.us-east-1.aws.neon.tech
    Note over Resolv: Goes to CoreDNS<br/>which forwards to upstream<br/>(Hetzner's DNS, then public root)
    Resolv->>API: Neon pooler IP (e.g., 34.206.177.121)
    API->>Neon: TCP :5432
    API->>Neon: TLS 1.3 handshake (DB_SSLMODE=require)
    API->>Neon: Postgres startup (user, database)
    API->>Neon: BEGIN<br/>SELECT ... FROM task_task WHERE residence_id = ?<br/>INSERT INTO task_task (...) VALUES (...)<br/>COMMIT
    Neon-->>API: Query results
  • Go's database/sql pool may already have an idle connection. If so, skip handshake.
  • If new connection: ~50 ms TLS handshake + Postgres startup
  • Query itself: typically ~520 ms (single-row read/write on indexed columns)
  • Total for this hop: often <10 ms on a warm connection, ~80 ms cold

Hop 5 — api → Redis (cache miss invalidation)

sequenceDiagram
    participant API as api pod
    participant CoreDNS
    participant KP as kube-proxy
    participant Redis as redis pod

    API->>CoreDNS: Resolve "redis"
    CoreDNS->>API: 10.43.7.10
    API->>KP: TCP :6379
    KP->>Redis: rewritten to 10.42.x.y:6379
    API->>Redis: DEL tasks:user:<user_id>  (invalidate cached list)
    Redis-->>API: OK
  • Redis connection is usually kept alive in the api's pool
  • Latency: <1 ms (Redis is on hetzner2, usually a short hop)

Hop 6 — api → worker (enqueue side effect)

For some task creation events, api enqueues a background job (send-notification, update-lookup-table, etc.):

sequenceDiagram
    participant API as api pod
    participant Redis as redis pod (acting as Asynq queue)
    participant Worker as worker pod

    API->>Redis: RPUSH asynq:queue:default <job JSON>
    Redis-->>API: OK
    Note over API,Worker: (Async, no response blocking)
    Worker->>Redis: BLPOP asynq:queue:default
    Redis-->>Worker: <job JSON>
    Worker->>Worker: Process job<br/>(send email, push, etc.)

api returns to the caller without waiting for the job.

Hop 7 — Response back to user

Reverse the path:

  1. api returns JSON response to Traefik
  2. Traefik returns to Cloudflare
  3. Cloudflare re-encrypts TLS to user
  4. User receives response

End-to-end latency budget

For a typical "create task" operation:

Hop Latency
User → CF (Austin → DFW) 515 ms
CF → hetzner (cross-Atlantic) 90120 ms
UFW + kernel + Traefik accept <1 ms
Traefik → api (same or cross-node) 13 ms
api request parsing, auth validation 13 ms
api → Postgres (query) 2060 ms
api → Redis (invalidate) <1 ms
api response generation 15 ms
Return path same as forward, reversed

Total: ~220310 ms typical. Dominated by the cross-Atlantic CF→origin hop and the Postgres query round trip.

Read path (GET /api/tasks/)

Similar but simpler:

sequenceDiagram
    participant App as iOS client
    participant CF as Cloudflare
    participant Traefik
    participant API as api pod
    participant Redis
    participant Neon

    App->>CF: GET /api/tasks/
    CF->>Traefik: (no cache hit)
    Traefik->>API: Route via Service
    API->>Redis: GET tasks:user:<user_id>
    alt Cache hit
        Redis-->>API: cached JSON
    else Cache miss
        API->>Neon: SELECT ...
        Neon-->>API: rows
        API->>Redis: SET tasks:user:<user_id> <json> EX 300
    end
    API-->>Traefik: 200 JSON
    Traefik-->>CF: 200
    CF-->>App: 200 (may cache per response headers)

Admin panel data flow

A different dance because the admin is Next.js:

sequenceDiagram
    participant Browser
    participant CF
    participant Traefik
    participant Admin as admin pod (Next.js)
    participant AdminAPI as api pod<br/>(via public URL)
    participant Neon

    Browser->>CF: GET admin.myhoneydue.com/users
    CF->>Traefik: HTTP :80
    Traefik->>Admin: Service /users
    Note over Admin: Next.js SSR:<br/>fetch from NEXT_PUBLIC_API_URL
    Admin->>CF: GET api.myhoneydue.com/api/admin/users/
    CF->>Traefik: (api ingress)
    Traefik->>AdminAPI: Service
    AdminAPI->>Neon: SELECT ... FROM auth_user
    Neon-->>AdminAPI: rows
    AdminAPI-->>Admin: JSON
    Admin->>Admin: Render HTML
    Admin-->>Traefik: HTML
    Traefik-->>CF: HTML
    CF-->>Browser: HTML

Notably, the admin pod's calls to api go back out to Cloudflare and in through the public URL. Not the in-cluster Service IP. This is because NEXT_PUBLIC_API_URL=https://api.myhoneydue.com — Next.js builds use the same URL for browser-side and server-side fetches.

This is suboptimal — server-side (SSR) calls could use the internal api.honeydue.svc:8000 URL and skip the CF round-trip. Future optimization: separate NEXT_PUBLIC_API_URL (browser) from API_URL (server-side).

Static asset flow

For the marketing landing page at https://myhoneydue.com/:

  1. CF caches HTML per Cache-Control (the Go app sets short TTLs)
  2. CF caches CSS / JS / images aggressively (via default CF rules)
  3. First request hits origin, subsequent requests served from CF edge

The static assets live inside the api container at /app/static/. Served by Echo's static file handler at routes /css, /js, /images.

Request flow during a rolling update

When a new api image is deployed, some requests will hit old pods and some will hit new pods for a few minutes:

sequenceDiagram
    participant CF
    participant Traefik
    participant OldPod as api pod v1
    participant NewPod as api pod v2 (starting)

    Note over NewPod: kubelet starts new pod
    Note over NewPod: pod connects to Postgres<br/>MigrateWithLock runs (no-op)<br/>HTTP server starts<br/>readinessProbe passes
    Note over NewPod: kube-proxy updates endpoints<br/>NewPod added to Service pool
    CF->>Traefik: request 1
    Traefik->>OldPod: routed (old pod still in pool)
    CF->>Traefik: request 2
    Traefik->>NewPod: routed (new pod now in pool)
    Note over OldPod: Kubelet terminates old pod<br/>(graceful SIGTERM, then SIGKILL after grace)
    CF->>Traefik: request 3
    Traefik->>NewPod: routed (OldPod gone from pool)

Both old and new handle traffic simultaneously until the rolling update completes. As long as the new code is API-compatible, users don't notice.

Failure modes in the data path

See Chapter 16 — Failure Modes for a full catalog.

Quick summary:

Layer fails User sees Recovery
Cloudflare DNS down Can't resolve api.myhoneydue.com Manual DNS fallback; extremely rare
Cloudflare edge down (single POP) Slow, CF routes to another POP Automatic
Node NIC fails Some requests time out (CF routes away) Cluster reschedules pods
UFW misconfig blocks :80 521 errors at CF Re-add rule
Traefik pod down on one node CF routes to other nodes Automatic
kube-proxy broken on one node Pods on that node can't reach Services Restart kubelet
CoreDNS down New connections fail DNS Restart CoreDNS
Flannel broken between nodes Cross-node pod communication fails Restart flannel or node
api pod OOM 502 to user briefly kubelet restarts pod
Postgres down 500 errors from api Neon-side issue; outage
Redis down api serves without cache (degraded) Restart Redis pod
B2 down Uploads fail, existing content served if cached Backblaze-side outage

References