Files
honeyDueAPI/docs/deployment/12-data-flow.md
T
Trey t 8d9ca2e6ed
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
docs(deployment): rewrite migration prose for goose adoption
Update the deployment book and glossary to reflect the goose-based
schema migration flow shipped in 12b2f9d/0f7450a:

- ch07: clarify startup probe assumes migrations ran out-of-band
- ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job
- ch12: pod startup checks goose_db_version, no longer runs migrations
- ch14: document the Job→wait→roll deploy gate and how to debug failures
- ch16: add "Migrate Job fails during deploy" + "Schema precondition
  failed" failure modes
- ch17: new runbook entries §26 (run migrations manually), §27 (recover
  from failed/dirty migration), §28 (bootstrap goose on fresh clone)
- ch19: postscript on §13 noting MigrateWithLock approach is superseded
- ch20: mark "Migration Job for schema changes" task done
- glossary: add `goose` and `goose_db_version`; flag AutoMigrate as
  tests-only
- references: add goose links; flag AutoMigrate as tests-only
2026-04-26 23:01:32 -05:00

318 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 12 — Data Flow
## Summary
This chapter follows a user's request end to end, hop by hop. It's the
consolidated picture of Chapters 3, 6, 7, 8, 9 working together. Use
this chapter to answer "when X doesn't work, which layer failed?"
## Scenario: User creates a task
A user in Austin opens the mobile app and adds a new task for their
property. The client sends `POST https://api.myhoneydue.com/api/tasks/`
with a JSON body and an auth token. We trace every hop.
## Hop 1 — Mobile client → Cloudflare edge
```mermaid
sequenceDiagram
participant App as iOS client
participant DNS as Local DNS
participant CFE as Cloudflare edge (DFW)
App->>DNS: Resolve api.myhoneydue.com
DNS->>App: 104.21.13.7 (Cloudflare edge IP)
App->>CFE: TCP SYN :443
CFE-->>App: TCP SYN+ACK
App->>CFE: TLS ClientHello
CFE->>App: TLS ServerHello + cert
Note over App,CFE: TLS 1.3 handshake<br/>~1 RTT
App->>CFE: HTTP/2 stream<br/>POST /api/tasks/<br/>Authorization: Token <xxx>
```
- Client resolves `api.myhoneydue.com` via OS resolver, gets Cloudflare
edge IP (not our origin IP)
- Client establishes TLS 1.3 to CF's nearest POP (Dallas for Austin)
- Cert presented by CF is `sni.cloudflaressl.com` or a CF-issued
`*.myhoneydue.com` — our origin cert is never seen by the client
- Latency: ~515 ms Austin → DFW
## Hop 2 — Cloudflare edge → Origin (hetzner)
```mermaid
sequenceDiagram
participant CFE as Cloudflare DFW POP
participant DNS as CF internal DNS
participant HN as hetzner node (random of 3)
participant Traefik as Traefik pod<br/>(host network)
CFE->>DNS: Which origin for api.myhoneydue.com?
DNS->>CFE: One of 178.104.247.152, 178.105.32.198, 178.104.249.189
CFE->>HN: TCP SYN :80
HN-->>CFE: SYN+ACK
CFE->>HN: HTTP/1.1 POST /api/tasks/<br/>Host: api.myhoneydue.com<br/>X-Forwarded-For: <user IP><br/>X-Forwarded-Proto: https<br/>CF-Connecting-IP: <user IP>
Note over HN: UFW: allow 80/tcp from<br/>anywhere (anywhere for now)
HN->>Traefik: delivered to listener
```
- CF picks one of the 3 node IPs via DNS round-robin. This is per-connection, not per-request.
- Protocol between CF and origin: **HTTP/1.1 plaintext** (SSL=Flexible).
A future Full-strict upgrade would make this HTTPS.
- Latency: ~90120 ms DFW → Nuremberg
- CF adds headers: `CF-Connecting-IP`, `X-Forwarded-For`, `X-Forwarded-Proto`
## Hop 3 — Traefik → api Service
```mermaid
sequenceDiagram
participant Traefik as Traefik pod
participant CoreDNS as CoreDNS (10.43.0.10)
participant KP as kube-proxy IPVS<br/>(kernel)
participant APIPod as api pod<br/>(some node)
Note over Traefik: Match Host: api.myhoneydue.com<br/>→ honeydue-api Ingress<br/>→ backend: api Service :8000
Traefik->>CoreDNS: Resolve "api"
CoreDNS->>Traefik: 10.43.167.83 (Service ClusterIP)
Traefik->>KP: TCP SYN to 10.43.167.83:8000
KP->>KP: IPVS: pick endpoint<br/>from Service endpoint set
KP->>APIPod: Rewrite destination<br/>to 10.42.2.6:8000<br/>(Flannel VXLAN if remote node)
```
- Traefik resolves `api` via CoreDNS → gets the Service ClusterIP
- Traefik sends to `10.43.167.83:8000`
- kube-proxy IPVS (running in-kernel on the node where Traefik lives)
intercepts, picks a live endpoint, rewrites
- Destination might be local (same node) or remote (VXLAN tunnel to
another node)
- Latency: <3 ms even cross-node
## Hop 4 — api → Postgres (Neon)
```mermaid
sequenceDiagram
participant API as api pod (Go)
participant Resolv as Pod resolv.conf
participant Neon as Neon pooler<br/>AWS us-east-1
API->>Resolv: Resolve ep-floral-truth-...-pooler.us-east-1.aws.neon.tech
Note over Resolv: Goes to CoreDNS<br/>which forwards to upstream<br/>(Hetzner's DNS, then public root)
Resolv->>API: Neon pooler IP (e.g., 34.206.177.121)
API->>Neon: TCP :5432
API->>Neon: TLS 1.3 handshake (DB_SSLMODE=require)
API->>Neon: Postgres startup (user, database)
API->>Neon: BEGIN<br/>SELECT ... FROM task_task WHERE residence_id = ?<br/>INSERT INTO task_task (...) VALUES (...)<br/>COMMIT
Neon-->>API: Query results
```
- Go's database/sql pool may already have an idle connection. If so,
skip handshake.
- If new connection: ~50 ms TLS handshake + Postgres startup
- Query itself: typically ~520 ms (single-row read/write on indexed
columns)
- Total for this hop: often <10 ms on a warm connection, ~80 ms cold
## Hop 5 — api → Redis (cache miss invalidation)
```mermaid
sequenceDiagram
participant API as api pod
participant CoreDNS
participant KP as kube-proxy
participant Redis as redis pod
API->>CoreDNS: Resolve "redis"
CoreDNS->>API: 10.43.7.10
API->>KP: TCP :6379
KP->>Redis: rewritten to 10.42.x.y:6379
API->>Redis: DEL tasks:user:<user_id> (invalidate cached list)
Redis-->>API: OK
```
- Redis connection is usually kept alive in the api's pool
- Latency: <1 ms (Redis is on hetzner2, usually a short hop)
## Hop 6 — api → worker (enqueue side effect)
For some task creation events, api enqueues a background job
(send-notification, update-lookup-table, etc.):
```mermaid
sequenceDiagram
participant API as api pod
participant Redis as redis pod (acting as Asynq queue)
participant Worker as worker pod
API->>Redis: RPUSH asynq:queue:default <job JSON>
Redis-->>API: OK
Note over API,Worker: (Async, no response blocking)
Worker->>Redis: BLPOP asynq:queue:default
Redis-->>Worker: <job JSON>
Worker->>Worker: Process job<br/>(send email, push, etc.)
```
api returns to the caller without waiting for the job.
## Hop 7 — Response back to user
Reverse the path:
1. api returns JSON response to Traefik
2. Traefik returns to Cloudflare
3. Cloudflare re-encrypts TLS to user
4. User receives response
## End-to-end latency budget
For a typical "create task" operation:
| Hop | Latency |
|---|---|
| User → CF (Austin → DFW) | 515 ms |
| CF → hetzner (cross-Atlantic) | 90120 ms |
| UFW + kernel + Traefik accept | <1 ms |
| Traefik → api (same or cross-node) | 13 ms |
| api request parsing, auth validation | 13 ms |
| api → Postgres (query) | 2060 ms |
| api → Redis (invalidate) | <1 ms |
| api response generation | 15 ms |
| Return path | same as forward, reversed |
**Total**: ~220310 ms typical. Dominated by the cross-Atlantic CF→origin
hop and the Postgres query round trip.
## Read path (GET /api/tasks/)
Similar but simpler:
```mermaid
sequenceDiagram
participant App as iOS client
participant CF as Cloudflare
participant Traefik
participant API as api pod
participant Redis
participant Neon
App->>CF: GET /api/tasks/
CF->>Traefik: (no cache hit)
Traefik->>API: Route via Service
API->>Redis: GET tasks:user:<user_id>
alt Cache hit
Redis-->>API: cached JSON
else Cache miss
API->>Neon: SELECT ...
Neon-->>API: rows
API->>Redis: SET tasks:user:<user_id> <json> EX 300
end
API-->>Traefik: 200 JSON
Traefik-->>CF: 200
CF-->>App: 200 (may cache per response headers)
```
## Admin panel data flow
A different dance because the admin is Next.js:
```mermaid
sequenceDiagram
participant Browser
participant CF
participant Traefik
participant Admin as admin pod (Next.js)
participant AdminAPI as api pod<br/>(via public URL)
participant Neon
Browser->>CF: GET admin.myhoneydue.com/users
CF->>Traefik: HTTP :80
Traefik->>Admin: Service /users
Note over Admin: Next.js SSR:<br/>fetch from NEXT_PUBLIC_API_URL
Admin->>CF: GET api.myhoneydue.com/api/admin/users/
CF->>Traefik: (api ingress)
Traefik->>AdminAPI: Service
AdminAPI->>Neon: SELECT ... FROM auth_user
Neon-->>AdminAPI: rows
AdminAPI-->>Admin: JSON
Admin->>Admin: Render HTML
Admin-->>Traefik: HTML
Traefik-->>CF: HTML
CF-->>Browser: HTML
```
Notably, the admin pod's calls to api go **back out to Cloudflare** and
in through the public URL. Not the in-cluster Service IP. This is
because `NEXT_PUBLIC_API_URL=https://api.myhoneydue.com` — Next.js builds
use the same URL for browser-side and server-side fetches.
This is **suboptimal** — server-side (SSR) calls could use the internal
`api.honeydue.svc:8000` URL and skip the CF round-trip. Future
optimization: separate `NEXT_PUBLIC_API_URL` (browser) from `API_URL`
(server-side).
## Static asset flow
For the marketing landing page at `https://myhoneydue.com/`:
1. CF caches HTML per `Cache-Control` (the Go app sets short TTLs)
2. CF caches CSS / JS / images aggressively (via default CF rules)
3. First request hits origin, subsequent requests served from CF edge
The static assets live inside the api container at `/app/static/`.
Served by Echo's static file handler at routes `/css`, `/js`, `/images`.
## Request flow during a rolling update
When a new api image is deployed, some requests will hit old pods and
some will hit new pods for a few minutes:
```mermaid
sequenceDiagram
participant CF
participant Traefik
participant OldPod as api pod v1
participant NewPod as api pod v2 (starting)
Note over NewPod: kubelet starts new pod
Note over NewPod: pod connects to Postgres<br/>RequireSchemaApplied checks goose_db_version<br/>HTTP server starts<br/>readinessProbe passes
Note over NewPod: kube-proxy updates endpoints<br/>NewPod added to Service pool
CF->>Traefik: request 1
Traefik->>OldPod: routed (old pod still in pool)
CF->>Traefik: request 2
Traefik->>NewPod: routed (new pod now in pool)
Note over OldPod: Kubelet terminates old pod<br/>(graceful SIGTERM, then SIGKILL after grace)
CF->>Traefik: request 3
Traefik->>NewPod: routed (OldPod gone from pool)
```
Both old and new handle traffic simultaneously until the rolling update
completes. As long as the new code is API-compatible, users don't
notice.
## Failure modes in the data path
See [Chapter 16 — Failure Modes](./16-failure-modes.md) for a full
catalog.
Quick summary:
| Layer fails | User sees | Recovery |
|---|---|---|
| Cloudflare DNS down | Can't resolve api.myhoneydue.com | Manual DNS fallback; extremely rare |
| Cloudflare edge down (single POP) | Slow, CF routes to another POP | Automatic |
| Node NIC fails | Some requests time out (CF routes away) | Cluster reschedules pods |
| UFW misconfig blocks :80 | 521 errors at CF | Re-add rule |
| Traefik pod down on one node | CF routes to other nodes | Automatic |
| kube-proxy broken on one node | Pods on that node can't reach Services | Restart kubelet |
| CoreDNS down | New connections fail DNS | Restart CoreDNS |
| Flannel broken between nodes | Cross-node pod communication fails | Restart flannel or node |
| api pod OOM | 502 to user briefly | kubelet restarts pod |
| Postgres down | 500 errors from api | Neon-side issue; outage |
| Redis down | api serves without cache (degraded) | Restart Redis pod |
| B2 down | Uploads fail, existing content served if cached | Backblaze-side outage |
## References
- [Chapter 3 — Networking](./03-networking.md) for the overlay mechanics
- [Chapter 6 — Traefik](./06-traefik-ingress.md) for routing details
- [Chapter 7 — Services](./07-services.md) for per-service specifics
- [Chapter 16 — Failure Modes](./16-failure-modes.md) for what-if scenarios