Files
honeyDueAPI/docs/deployment/12-data-flow.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

318 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 12 — Data Flow
## Summary
This chapter follows a user's request end to end, hop by hop. It's the
consolidated picture of Chapters 3, 6, 7, 8, 9 working together. Use
this chapter to answer "when X doesn't work, which layer failed?"
## Scenario: User creates a task
A user in Austin opens the mobile app and adds a new task for their
property. The client sends `POST https://api.myhoneydue.com/api/tasks/`
with a JSON body and an auth token. We trace every hop.
## Hop 1 — Mobile client → Cloudflare edge
```mermaid
sequenceDiagram
participant App as iOS client
participant DNS as Local DNS
participant CFE as Cloudflare edge (DFW)
App->>DNS: Resolve api.myhoneydue.com
DNS->>App: 104.21.13.7 (Cloudflare edge IP)
App->>CFE: TCP SYN :443
CFE-->>App: TCP SYN+ACK
App->>CFE: TLS ClientHello
CFE->>App: TLS ServerHello + cert
Note over App,CFE: TLS 1.3 handshake<br/>~1 RTT
App->>CFE: HTTP/2 stream<br/>POST /api/tasks/<br/>Authorization: Token <xxx>
```
- Client resolves `api.myhoneydue.com` via OS resolver, gets Cloudflare
edge IP (not our origin IP)
- Client establishes TLS 1.3 to CF's nearest POP (Dallas for Austin)
- Cert presented by CF is `sni.cloudflaressl.com` or a CF-issued
`*.myhoneydue.com` — our origin cert is never seen by the client
- Latency: ~515 ms Austin → DFW
## Hop 2 — Cloudflare edge → Origin (hetzner)
```mermaid
sequenceDiagram
participant CFE as Cloudflare DFW POP
participant DNS as CF internal DNS
participant HN as hetzner node (random of 3)
participant Traefik as Traefik pod<br/>(host network)
CFE->>DNS: Which origin for api.myhoneydue.com?
DNS->>CFE: One of 178.104.247.152, 178.105.32.198, 178.104.249.189
CFE->>HN: TCP SYN :80
HN-->>CFE: SYN+ACK
CFE->>HN: HTTP/1.1 POST /api/tasks/<br/>Host: api.myhoneydue.com<br/>X-Forwarded-For: <user IP><br/>X-Forwarded-Proto: https<br/>CF-Connecting-IP: <user IP>
Note over HN: UFW: allow 80/tcp from<br/>anywhere (anywhere for now)
HN->>Traefik: delivered to listener
```
- CF picks one of the 3 node IPs via DNS round-robin. This is per-connection, not per-request.
- Protocol between CF and origin: **HTTP/1.1 plaintext** (SSL=Flexible).
A future Full-strict upgrade would make this HTTPS.
- Latency: ~90120 ms DFW → Nuremberg
- CF adds headers: `CF-Connecting-IP`, `X-Forwarded-For`, `X-Forwarded-Proto`
## Hop 3 — Traefik → api Service
```mermaid
sequenceDiagram
participant Traefik as Traefik pod
participant CoreDNS as CoreDNS (10.43.0.10)
participant KP as kube-proxy IPVS<br/>(kernel)
participant APIPod as api pod<br/>(some node)
Note over Traefik: Match Host: api.myhoneydue.com<br/>→ honeydue-api Ingress<br/>→ backend: api Service :8000
Traefik->>CoreDNS: Resolve "api"
CoreDNS->>Traefik: 10.43.167.83 (Service ClusterIP)
Traefik->>KP: TCP SYN to 10.43.167.83:8000
KP->>KP: IPVS: pick endpoint<br/>from Service endpoint set
KP->>APIPod: Rewrite destination<br/>to 10.42.2.6:8000<br/>(Flannel VXLAN if remote node)
```
- Traefik resolves `api` via CoreDNS → gets the Service ClusterIP
- Traefik sends to `10.43.167.83:8000`
- kube-proxy IPVS (running in-kernel on the node where Traefik lives)
intercepts, picks a live endpoint, rewrites
- Destination might be local (same node) or remote (VXLAN tunnel to
another node)
- Latency: <3 ms even cross-node
## Hop 4 — api → Postgres (Neon)
```mermaid
sequenceDiagram
participant API as api pod (Go)
participant Resolv as Pod resolv.conf
participant Neon as Neon pooler<br/>AWS us-east-1
API->>Resolv: Resolve ep-floral-truth-...-pooler.us-east-1.aws.neon.tech
Note over Resolv: Goes to CoreDNS<br/>which forwards to upstream<br/>(Hetzner's DNS, then public root)
Resolv->>API: Neon pooler IP (e.g., 34.206.177.121)
API->>Neon: TCP :5432
API->>Neon: TLS 1.3 handshake (DB_SSLMODE=require)
API->>Neon: Postgres startup (user, database)
API->>Neon: BEGIN<br/>SELECT ... FROM task_task WHERE residence_id = ?<br/>INSERT INTO task_task (...) VALUES (...)<br/>COMMIT
Neon-->>API: Query results
```
- Go's database/sql pool may already have an idle connection. If so,
skip handshake.
- If new connection: ~50 ms TLS handshake + Postgres startup
- Query itself: typically ~520 ms (single-row read/write on indexed
columns)
- Total for this hop: often <10 ms on a warm connection, ~80 ms cold
## Hop 5 — api → Redis (cache miss invalidation)
```mermaid
sequenceDiagram
participant API as api pod
participant CoreDNS
participant KP as kube-proxy
participant Redis as redis pod
API->>CoreDNS: Resolve "redis"
CoreDNS->>API: 10.43.7.10
API->>KP: TCP :6379
KP->>Redis: rewritten to 10.42.x.y:6379
API->>Redis: DEL tasks:user:<user_id> (invalidate cached list)
Redis-->>API: OK
```
- Redis connection is usually kept alive in the api's pool
- Latency: <1 ms (Redis is on hetzner2, usually a short hop)
## Hop 6 — api → worker (enqueue side effect)
For some task creation events, api enqueues a background job
(send-notification, update-lookup-table, etc.):
```mermaid
sequenceDiagram
participant API as api pod
participant Redis as redis pod (acting as Asynq queue)
participant Worker as worker pod
API->>Redis: RPUSH asynq:queue:default <job JSON>
Redis-->>API: OK
Note over API,Worker: (Async, no response blocking)
Worker->>Redis: BLPOP asynq:queue:default
Redis-->>Worker: <job JSON>
Worker->>Worker: Process job<br/>(send email, push, etc.)
```
api returns to the caller without waiting for the job.
## Hop 7 — Response back to user
Reverse the path:
1. api returns JSON response to Traefik
2. Traefik returns to Cloudflare
3. Cloudflare re-encrypts TLS to user
4. User receives response
## End-to-end latency budget
For a typical "create task" operation:
| Hop | Latency |
|---|---|
| User → CF (Austin → DFW) | 515 ms |
| CF → hetzner (cross-Atlantic) | 90120 ms |
| UFW + kernel + Traefik accept | <1 ms |
| Traefik → api (same or cross-node) | 13 ms |
| api request parsing, auth validation | 13 ms |
| api → Postgres (query) | 2060 ms |
| api → Redis (invalidate) | <1 ms |
| api response generation | 15 ms |
| Return path | same as forward, reversed |
**Total**: ~220310 ms typical. Dominated by the cross-Atlantic CF→origin
hop and the Postgres query round trip.
## Read path (GET /api/tasks/)
Similar but simpler:
```mermaid
sequenceDiagram
participant App as iOS client
participant CF as Cloudflare
participant Traefik
participant API as api pod
participant Redis
participant Neon
App->>CF: GET /api/tasks/
CF->>Traefik: (no cache hit)
Traefik->>API: Route via Service
API->>Redis: GET tasks:user:<user_id>
alt Cache hit
Redis-->>API: cached JSON
else Cache miss
API->>Neon: SELECT ...
Neon-->>API: rows
API->>Redis: SET tasks:user:<user_id> <json> EX 300
end
API-->>Traefik: 200 JSON
Traefik-->>CF: 200
CF-->>App: 200 (may cache per response headers)
```
## Admin panel data flow
A different dance because the admin is Next.js:
```mermaid
sequenceDiagram
participant Browser
participant CF
participant Traefik
participant Admin as admin pod (Next.js)
participant AdminAPI as api pod<br/>(via public URL)
participant Neon
Browser->>CF: GET admin.myhoneydue.com/users
CF->>Traefik: HTTP :80
Traefik->>Admin: Service /users
Note over Admin: Next.js SSR:<br/>fetch from NEXT_PUBLIC_API_URL
Admin->>CF: GET api.myhoneydue.com/api/admin/users/
CF->>Traefik: (api ingress)
Traefik->>AdminAPI: Service
AdminAPI->>Neon: SELECT ... FROM auth_user
Neon-->>AdminAPI: rows
AdminAPI-->>Admin: JSON
Admin->>Admin: Render HTML
Admin-->>Traefik: HTML
Traefik-->>CF: HTML
CF-->>Browser: HTML
```
Notably, the admin pod's calls to api go **back out to Cloudflare** and
in through the public URL. Not the in-cluster Service IP. This is
because `NEXT_PUBLIC_API_URL=https://api.myhoneydue.com` — Next.js builds
use the same URL for browser-side and server-side fetches.
This is **suboptimal** — server-side (SSR) calls could use the internal
`api.honeydue.svc:8000` URL and skip the CF round-trip. Future
optimization: separate `NEXT_PUBLIC_API_URL` (browser) from `API_URL`
(server-side).
## Static asset flow
For the marketing landing page at `https://myhoneydue.com/`:
1. CF caches HTML per `Cache-Control` (the Go app sets short TTLs)
2. CF caches CSS / JS / images aggressively (via default CF rules)
3. First request hits origin, subsequent requests served from CF edge
The static assets live inside the api container at `/app/static/`.
Served by Echo's static file handler at routes `/css`, `/js`, `/images`.
## Request flow during a rolling update
When a new api image is deployed, some requests will hit old pods and
some will hit new pods for a few minutes:
```mermaid
sequenceDiagram
participant CF
participant Traefik
participant OldPod as api pod v1
participant NewPod as api pod v2 (starting)
Note over NewPod: kubelet starts new pod
Note over NewPod: pod connects to Postgres<br/>MigrateWithLock runs (no-op)<br/>HTTP server starts<br/>readinessProbe passes
Note over NewPod: kube-proxy updates endpoints<br/>NewPod added to Service pool
CF->>Traefik: request 1
Traefik->>OldPod: routed (old pod still in pool)
CF->>Traefik: request 2
Traefik->>NewPod: routed (new pod now in pool)
Note over OldPod: Kubelet terminates old pod<br/>(graceful SIGTERM, then SIGKILL after grace)
CF->>Traefik: request 3
Traefik->>NewPod: routed (OldPod gone from pool)
```
Both old and new handle traffic simultaneously until the rolling update
completes. As long as the new code is API-compatible, users don't
notice.
## Failure modes in the data path
See [Chapter 16 — Failure Modes](./16-failure-modes.md) for a full
catalog.
Quick summary:
| Layer fails | User sees | Recovery |
|---|---|---|
| Cloudflare DNS down | Can't resolve api.myhoneydue.com | Manual DNS fallback; extremely rare |
| Cloudflare edge down (single POP) | Slow, CF routes to another POP | Automatic |
| Node NIC fails | Some requests time out (CF routes away) | Cluster reschedules pods |
| UFW misconfig blocks :80 | 521 errors at CF | Re-add rule |
| Traefik pod down on one node | CF routes to other nodes | Automatic |
| kube-proxy broken on one node | Pods on that node can't reach Services | Restart kubelet |
| CoreDNS down | New connections fail DNS | Restart CoreDNS |
| Flannel broken between nodes | Cross-node pod communication fails | Restart flannel or node |
| api pod OOM | 502 to user briefly | kubelet restarts pod |
| Postgres down | 500 errors from api | Neon-side issue; outage |
| Redis down | api serves without cache (degraded) | Restart Redis pod |
| B2 down | Uploads fail, existing content served if cached | Backblaze-side outage |
## References
- [Chapter 3 — Networking](./03-networking.md) for the overlay mechanics
- [Chapter 6 — Traefik](./06-traefik-ingress.md) for routing details
- [Chapter 7 — Services](./07-services.md) for per-service specifics
- [Chapter 16 — Failure Modes](./16-failure-modes.md) for what-if scenarios