7e77e3bbab
Mark roadmap items done (network policies, Traefik middleware, CF Full strict, CF IP UFW restriction, webapp deploy, APNs wired up, admin URL-baking fix, admin probe bug). Update Chapter 4 (firewall rule inventory now shows CF-only :443, no :80), Chapter 6 (request flow walks through TLS on :443 and middleware hops), Chapter 13 (CF SSL mode is Full strict, not Flexible; documents the origin cert install), Chapter 7 (adds the web service section — proxy pattern, 3 replicas, PostHog build-args), and Appendix C (web manifests, CF origin cert paths on disk, APNs .p8 path, updated network-policies applied status). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
638 lines
21 KiB
Markdown
638 lines
21 KiB
Markdown
# 07 — Services
|
||
|
||
## Summary
|
||
|
||
Five workloads run in the `honeydue` namespace: **api** (Go REST API, 3
|
||
replicas), **admin** (Next.js admin panel, 1 replica), **web** (Next.js
|
||
customer-facing app, 3 replicas), **worker** (Go background jobs, 1
|
||
replica), and **redis** (cache + job queue, 1 replica, PVC-backed).
|
||
This chapter deep-dives each: container image, resource limits, probes,
|
||
volumes, and why each knob is set the way it is.
|
||
|
||
## Overview
|
||
|
||
| Service | Image | Replicas | Ports | Role |
|
||
|---|---|---|---|---|
|
||
| `api` | `gitea.treytartt.com/admin/honeydue-api:<sha>` | 3 | 8000 | HTTP REST API |
|
||
| `admin` | `gitea.treytartt.com/admin/honeydue-admin:<sha>` | 1 | 3000 | Next.js admin panel |
|
||
| `web` | `gitea.treytartt.com/admin/honeydue-web:<sha>` | 3 | 3000 | Next.js customer-facing web client at `app.myhoneydue.com` |
|
||
| `worker` | `gitea.treytartt.com/admin/honeydue-worker:<sha>` | 1 | — | Background job processor |
|
||
| `redis` | `redis:7-alpine` | 1 | 6379 | Cache + Asynq queue |
|
||
|
||
All five are Kubernetes `Deployment` workloads (not StatefulSets, not
|
||
DaemonSets). They share:
|
||
- ServiceAccount with `automountServiceAccountToken: false` (Chapter 5)
|
||
- `imagePullSecrets: [gitea-credentials]` (Chapter 11)
|
||
- `envFrom: configMapRef: honeydue-config` (Chapter 10)
|
||
- Individual env vars wired to `honeydue-secrets` keys
|
||
- Read-only root filesystem with `tmp` emptyDir mounted at `/tmp`
|
||
|
||
## Service — web (Next.js customer app)
|
||
|
||
### What it does
|
||
|
||
Lives at `https://app.myhoneydue.com`. Next.js 16 standalone build,
|
||
served by `node server.js` inside the container. Sibling repo:
|
||
`/Users/treyt/Desktop/code/honeyDue/honeyDueAPI-Web/`.
|
||
|
||
### Architecture: server-side proxy pattern
|
||
|
||
Unlike the admin panel (which makes CORS requests directly to
|
||
`api.myhoneydue.com`), the web app uses a proxy pattern:
|
||
|
||
```
|
||
Browser → https://app.myhoneydue.com/api/proxy/tasks/123/
|
||
→ Next.js route handler (src/app/api/proxy/[...path]/route.ts)
|
||
→ reads honeydue-token httpOnly cookie
|
||
→ attaches Authorization: Token <value>
|
||
→ https://api.myhoneydue.com/api/tasks/123/ (server-side fetch)
|
||
→ response flows back
|
||
```
|
||
|
||
**Consequences:**
|
||
- Browser never makes cross-origin requests. No CORS entry needed on
|
||
the Go API for `app.myhoneydue.com`.
|
||
- Auth tokens live in httpOnly cookies, not localStorage. XSS can't
|
||
exfiltrate them.
|
||
- The web pod needs outbound HTTPS to `api.myhoneydue.com` — covered
|
||
in the `allow-egress-from-web` NetworkPolicy (Chapter 5).
|
||
|
||
### Env vars
|
||
|
||
Build-time (baked into the client bundle by the Dockerfile `ARG`):
|
||
- `NEXT_PUBLIC_API_URL` — only used as a fallback; baked for safety
|
||
- `NEXT_PUBLIC_POSTHOG_KEY` — PostHog project API key
|
||
- `NEXT_PUBLIC_POSTHOG_HOST` — `https://analytics.88oakapps.com`
|
||
|
||
Runtime (ConfigMap):
|
||
- `API_URL=https://api.myhoneydue.com/api` — consumed by the
|
||
server-side proxy handlers
|
||
- `PORT=3000`, `HOSTNAME=0.0.0.0`
|
||
|
||
### Deployment spec highlights
|
||
|
||
- **3 replicas**, same as api — this is a production customer surface
|
||
- `topologySpreadConstraints` across `kubernetes.io/hostname` —
|
||
evicting one node at most kills one pod
|
||
- `readOnlyRootFilesystem: true`; `emptyDir`s at `/app/.next/cache`
|
||
(Next.js build cache) and `/tmp`
|
||
- PDB `web-pdb` with `minAvailable: 2`
|
||
- runAsUser/runAsGroup `1001` (matches the `nextjs` user created in
|
||
the Dockerfile)
|
||
|
||
### Why same availability as api
|
||
|
||
The web client is now the primary user-facing surface. Users hitting
|
||
`app.myhoneydue.com/login` should never see a 502 because a single
|
||
node went down. 3 replicas × `minAvailable: 2` guarantees at least
|
||
two pods stay up through any voluntary disruption.
|
||
|
||
## Service 1 — api (Go REST API)
|
||
|
||
### What it does
|
||
|
||
The Go HTTP API — the heart of the app. Handlers for user auth,
|
||
residences, tasks, contractors, documents, subscriptions, notifications,
|
||
etc. Reads/writes to Neon Postgres, reads/writes to Redis cache, reads
|
||
from Backblaze B2.
|
||
|
||
Also serves a marketing landing page at `/` (static HTML + CSS from
|
||
`/app/static/`). This is why the `myhoneydue.com` apex domain routes to
|
||
the api service (Chapter 6).
|
||
|
||
### Deployment spec highlights
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata:
|
||
name: api
|
||
spec:
|
||
replicas: 3
|
||
strategy:
|
||
type: RollingUpdate
|
||
rollingUpdate:
|
||
maxUnavailable: 0
|
||
maxSurge: 1
|
||
template:
|
||
spec:
|
||
serviceAccountName: api
|
||
imagePullSecrets: [name: gitea-credentials]
|
||
securityContext:
|
||
runAsNonRoot: true
|
||
runAsUser: 1000
|
||
runAsGroup: 1000
|
||
fsGroup: 1000
|
||
seccompProfile: { type: RuntimeDefault }
|
||
containers:
|
||
- name: api
|
||
image: gitea.treytartt.com/admin/honeydue-api:237c6b8
|
||
ports: [containerPort: 8000]
|
||
securityContext:
|
||
allowPrivilegeEscalation: false
|
||
readOnlyRootFilesystem: true
|
||
capabilities: { drop: [ALL] }
|
||
envFrom: [configMapRef: {name: honeydue-config}]
|
||
env:
|
||
- name: POSTGRES_PASSWORD
|
||
valueFrom: { secretKeyRef: {name: honeydue-secrets, key: POSTGRES_PASSWORD} }
|
||
- name: SECRET_KEY
|
||
valueFrom: { secretKeyRef: {name: honeydue-secrets, key: SECRET_KEY} }
|
||
# ... all other secrets
|
||
volumeMounts:
|
||
- { name: apns-key, mountPath: /secrets/apns, readOnly: true }
|
||
- { name: tmp, mountPath: /tmp }
|
||
resources:
|
||
requests: { cpu: 100m, memory: 128Mi }
|
||
limits: { cpu: 1000m, memory: 512Mi }
|
||
startupProbe: { httpGet: {path: /api/health/, port: 8000}, failureThreshold: 48, periodSeconds: 5 }
|
||
readinessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 5, periodSeconds: 10, timeoutSeconds: 5 }
|
||
livenessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 30, periodSeconds: 30, timeoutSeconds: 10 }
|
||
volumes:
|
||
- name: apns-key
|
||
secret:
|
||
secretName: honeydue-apns-key
|
||
items: [key: apns_auth_key.p8, path: apns_auth_key.p8]
|
||
- name: tmp
|
||
emptyDir: {sizeLimit: 64Mi}
|
||
```
|
||
|
||
### Why each setting
|
||
|
||
**`replicas: 3`** — one per node via anti-affinity rules (not strictly
|
||
required but helpful). Three gives us HA (one pod down = two still
|
||
serve traffic) and headroom for rolling updates.
|
||
|
||
**`maxUnavailable: 0, maxSurge: 1`** — during a rollout, start a 4th
|
||
pod before killing any old one. Ensures the service stays at 3 live
|
||
pods throughout. `maxUnavailable: 0` means zero downtime updates — but
|
||
depends on readinessProbe being accurate.
|
||
|
||
**`runAsUser: 1000`** — the `app` user created in the Dockerfile. Image
|
||
doesn't run as root.
|
||
|
||
**`readOnlyRootFilesystem: true`** — prevents any attacker-introduced
|
||
file writes to the image layer. Go binary doesn't need to write to `/`;
|
||
only `/tmp` is mutable.
|
||
|
||
**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this
|
||
was bumped up from the scaffold default of 12. Reason: on first boot,
|
||
the Go app runs `MigrateWithLock()` which acquires a Postgres advisory
|
||
lock and runs AutoMigrate. First replica takes ~90s; subsequent
|
||
replicas wait on the lock. With 3 replicas all starting simultaneously
|
||
and the lock serializing them, 240s is the right grace. See
|
||
[Chapter 19](./19-postmortem-swarm.md) for the detailed story.
|
||
|
||
**`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
|
||
passes, wait 5s before starting readiness checks. Prevents a racy
|
||
initial failure.
|
||
|
||
**`livenessProbe.initialDelaySeconds: 30`** — don't start restarting on
|
||
liveness failures for 30s after readiness passes. Avoids cascading
|
||
failures from false-negative liveness checks.
|
||
|
||
**`resources.requests/limits`** — Kubernetes uses `requests` for
|
||
scheduling (how much a pod "reserves") and `limits` for enforcement
|
||
(max it can use before throttling/OOM). Our api is CPU-bursty for
|
||
complex query handling, so we give it 100m baseline with a 1000m ceiling.
|
||
512Mi memory ceiling is comfortable — in practice api uses ~100-200Mi.
|
||
|
||
**`volumes.apns-key`** — mounts the `honeydue-apns-key` Secret as a file
|
||
at `/secrets/apns/apns_auth_key.p8`. The `APNS_AUTH_KEY_PATH` env var
|
||
points to this path. Even though push is currently disabled, the file
|
||
must exist because the Go app may try to stat it on startup.
|
||
|
||
**`volumes.tmp`** — `emptyDir` with `sizeLimit: 64Mi`. Bounded so a
|
||
runaway process can't fill the node's disk.
|
||
|
||
### The Service
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata:
|
||
name: api
|
||
namespace: honeydue
|
||
spec:
|
||
type: ClusterIP
|
||
selector: {app.kubernetes.io/name: api}
|
||
ports:
|
||
- port: 8000
|
||
targetPort: 8000
|
||
protocol: TCP
|
||
```
|
||
|
||
ClusterIP `10.43.167.83`. Reachable as `api.honeydue.svc.cluster.local` or
|
||
just `api` from inside the namespace.
|
||
|
||
### HorizontalPodAutoscaler (not yet enabled)
|
||
|
||
`deploy-k3s/manifests/api/hpa.yaml` defines an HPA that would scale api
|
||
between 3 and 6 replicas based on CPU (70% util) and memory (80% util).
|
||
|
||
**Not currently applied.** `metrics-server` runs but we haven't run
|
||
`kubectl apply -f api/hpa.yaml`. TODO in Chapter 20.
|
||
|
||
## Service 2 — admin (Next.js panel)
|
||
|
||
### What it does
|
||
|
||
Server-rendered admin UI. Authenticates admin users against a
|
||
separate `admin_users` table in Postgres (seeded with `ADMIN_EMAIL` +
|
||
`ADMIN_PASSWORD` on first migration). Lets operators view/manage
|
||
users, residences, tasks, subscriptions, etc.
|
||
|
||
Built as a Next.js 16 standalone server.
|
||
|
||
### Why 1 replica
|
||
|
||
Low traffic. It's an internal tool. One pod suffices. If it crashes,
|
||
Kubernetes restarts it in ~10s. If the hosting node dies, Kubernetes
|
||
reschedules to another node.
|
||
|
||
The cost of running 3 replicas is tiny (Next.js is ~128MB per pod) but
|
||
has no operational benefit. When the admin panel becomes user-facing,
|
||
revisit.
|
||
|
||
### Deployment highlights
|
||
|
||
```yaml
|
||
replicas: 1
|
||
strategy:
|
||
type: RollingUpdate
|
||
rollingUpdate:
|
||
maxUnavailable: 0
|
||
maxSurge: 1
|
||
|
||
securityContext:
|
||
runAsNonRoot: true
|
||
runAsUser: 1001 # different from api (1000) for isolation
|
||
runAsGroup: 1001
|
||
fsGroup: 1001
|
||
|
||
containers:
|
||
- image: gitea.treytartt.com/admin/honeydue-admin:<sha>
|
||
ports: [containerPort: 3000]
|
||
env:
|
||
- name: PORT
|
||
value: "3000"
|
||
- name: HOSTNAME
|
||
value: "0.0.0.0"
|
||
- name: NEXT_PUBLIC_API_URL
|
||
valueFrom: {configMapKeyRef: {name: honeydue-config, key: NEXT_PUBLIC_API_URL}}
|
||
volumeMounts:
|
||
- {name: nextjs-cache, mountPath: /app/.next/cache}
|
||
- {name: tmp, mountPath: /tmp}
|
||
resources:
|
||
requests: {cpu: 50m, memory: 64Mi}
|
||
limits: {cpu: 500m, memory: 256Mi}
|
||
startupProbe:
|
||
httpGet: {path: /, port: 3000} # was /admin/ — wrong for this app (Chapter 19)
|
||
failureThreshold: 24
|
||
periodSeconds: 5
|
||
readinessProbe:
|
||
httpGet: {path: /, port: 3000}
|
||
initialDelaySeconds: 5
|
||
periodSeconds: 10
|
||
timeoutSeconds: 5
|
||
```
|
||
|
||
**Probe path `/`** — Next.js serves at root. `/admin/` (scaffold default)
|
||
returns 404 and killed the pod repeatedly during initial bring-up.
|
||
See Chapter 19 §Admin probe path for the story.
|
||
|
||
**`runAsUser: 1001`** — different from api's 1000 so that if one
|
||
service were compromised, the stolen UID would at least be distinct
|
||
from other services' (minor defense-in-depth).
|
||
|
||
**`nextjs-cache`** — emptyDir mount for Next.js's server-side cache.
|
||
Without it, the read-only rootfs would prevent Next from caching
|
||
server-rendered pages. Not a persistent volume because cache is
|
||
regenerable on restart.
|
||
|
||
### The Service
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata:
|
||
name: admin
|
||
spec:
|
||
type: ClusterIP
|
||
selector: {app.kubernetes.io/name: admin}
|
||
ports: [port: 3000, targetPort: 3000]
|
||
```
|
||
|
||
ClusterIP `10.43.136.168`.
|
||
|
||
## Service 3 — worker (Go + Asynq)
|
||
|
||
### What it does
|
||
|
||
Runs scheduled background jobs via [Asynq](https://github.com/hibiken/asynq)
|
||
(a Redis-backed job queue for Go):
|
||
|
||
- **Task reminders** (14:00 UTC daily) — notify users of upcoming tasks
|
||
- **Overdue reminders** (15:00 UTC daily) — notify users of overdue tasks
|
||
- **Daily digest** (03:00 UTC daily) — summary email per user
|
||
- **Onboarding emails** — multi-step drip campaign for new users
|
||
- **Cleanup jobs** — expired tokens, stale data
|
||
|
||
### Why 1 replica (hard requirement)
|
||
|
||
Asynq uses a `Scheduler` component that does cron-like scheduling. The
|
||
Scheduler is **not leader-elected** by default — if you run two, both
|
||
fire every cron task. Users get duplicate emails.
|
||
|
||
The asynq docs cover this: to scale scheduling, migrate to
|
||
`PeriodicTaskManager` + `PeriodicTaskConfigProvider` which coordinate
|
||
via Redis. Not yet done in our codebase.
|
||
|
||
Until then: `replicas: 1` is a hard constraint. See the comment in the
|
||
deployment manifest:
|
||
|
||
```yaml
|
||
spec:
|
||
# Asynq's Scheduler is a singleton — running >1 replica fires every cron
|
||
# task once per replica (duplicate daily digests, onboarding emails, etc.).
|
||
# Keep at 1 until asynq.PeriodicTaskManager with Redis leader election is
|
||
# wired in cmd/worker/main.go.
|
||
replicas: 1
|
||
```
|
||
|
||
### What happens if the worker pod dies?
|
||
|
||
- Asynq schedule state is in Redis (which has AOF persistence)
|
||
- When a new worker pod starts, it re-registers the scheduler and picks up
|
||
where it left off
|
||
- Any job that was in-flight (dequeued but not acknowledged) gets retried
|
||
by Asynq's automatic retry logic (see the `worker.RetryOptions` in the
|
||
Go code)
|
||
- Cron jobs that were supposed to fire during the downtime: fire on the
|
||
next tick
|
||
|
||
A 5-minute worker outage = 5 minutes of delayed jobs. Not great but
|
||
acceptable.
|
||
|
||
### PodDisruptionBudget
|
||
|
||
```yaml
|
||
apiVersion: policy/v1
|
||
kind: PodDisruptionBudget
|
||
metadata:
|
||
name: worker-pdb
|
||
spec:
|
||
minAvailable: 0
|
||
selector: {matchLabels: {app.kubernetes.io/name: worker}}
|
||
```
|
||
|
||
`minAvailable: 0` means voluntary disruptions (`kubectl drain`) can take
|
||
the worker down. This matches the singleton constraint: there's only one,
|
||
it's OK to drain.
|
||
|
||
### No Service
|
||
|
||
worker doesn't listen on any HTTP port for application traffic — it's a
|
||
queue consumer, not a web server. So there's **no Kubernetes Service**
|
||
for it.
|
||
|
||
(On Swarm we had the worker expose a health endpoint at `:6060/health`;
|
||
the k3s scaffold doesn't replicate this. Future work.)
|
||
|
||
## Service 4 — redis
|
||
|
||
### What it does
|
||
|
||
- Caching layer (ETag-based lookups, user session cache)
|
||
- Asynq queue backend (job state, scheduled tasks, retry state)
|
||
|
||
### Why 1 replica
|
||
|
||
Single-instance Redis with AOF persistence. Not replicated, not
|
||
clustered. Downsides:
|
||
- Node outage = Redis outage (cache regenerates, queue state is preserved
|
||
by AOF on the PVC)
|
||
- No failover — if the node hosting Redis dies, Redis restarts on another
|
||
node *but* the PVC is local-path (per-node), so the data is gone
|
||
|
||
For our scale this is acceptable. Redis holds no authoritative state
|
||
(everything that matters is in Postgres). Cache regenerates on first
|
||
request; Asynq retries enqueue on failure.
|
||
|
||
### PVC
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: PersistentVolumeClaim
|
||
metadata:
|
||
name: redis-data
|
||
spec:
|
||
accessModes: [ReadWriteOnce]
|
||
storageClassName: local-path
|
||
resources: {requests: {storage: 5Gi}}
|
||
```
|
||
|
||
Uses k3s' built-in `local-path-provisioner`. The PVC binds to a local
|
||
directory on the node where the Redis pod lands (`/var/lib/rancher/k3s/storage/`).
|
||
`ReadWriteOnce` means only one pod at a time.
|
||
|
||
### Node affinity
|
||
|
||
```yaml
|
||
nodeSelector:
|
||
honeydue/redis: "true"
|
||
```
|
||
|
||
We labeled `ubuntu-8gb-nbg1-2` (hetzner1) with `honeydue/redis=true` so
|
||
Redis always lands there. This ensures the PVC finds its backing
|
||
storage (since PVCs with `local-path` are per-node).
|
||
|
||
```bash
|
||
kubectl label node ubuntu-8gb-nbg1-2 honeydue/redis=true --overwrite
|
||
```
|
||
|
||
### Why not Redis Sentinel / Cluster
|
||
|
||
Complexity. At our scale (~a few req/s, kilobytes of cache), a single
|
||
Redis does fine. If Redis becomes critical-path for availability, we'd:
|
||
- Use a managed Redis (Upstash, Dragonfly Cloud) — $5-15/mo, their problem
|
||
- Or run Redis Sentinel with 3 replicas — manageable but operational work
|
||
|
||
Neither is needed yet.
|
||
|
||
### Redis config
|
||
|
||
From the deployment:
|
||
|
||
```yaml
|
||
command:
|
||
- sh
|
||
- -c
|
||
- |
|
||
ARGS="--appendonly yes --appendfsync everysec --maxmemory 256mb --maxmemory-policy noeviction"
|
||
if [ -n "$REDIS_PASSWORD" ]; then
|
||
ARGS="$ARGS --requirepass $REDIS_PASSWORD"
|
||
fi
|
||
exec redis-server $ARGS
|
||
```
|
||
|
||
Settings:
|
||
- **`--appendonly yes --appendfsync everysec`** — AOF persistence,
|
||
fsync every second. Survives restarts with up to 1 second of data
|
||
loss.
|
||
- **`--maxmemory 256mb`** — Redis will refuse new data if it grows past
|
||
256 MB. Gives us a safety cap.
|
||
- **`--maxmemory-policy noeviction`** — we'd rather get errors than
|
||
silently drop data. This is the right choice when Redis holds queue
|
||
state (losing a queue item silently = missed job).
|
||
|
||
The `REDIS_PASSWORD` env var is optional. Currently empty (no auth). The
|
||
Redis pod is only reachable from inside the overlay network, and our
|
||
NetworkPolicies (once enabled) would restrict egress further.
|
||
|
||
## Resource summary
|
||
|
||
Combined requests and limits across all services:
|
||
|
||
| Service | CPU requests | CPU limits | Memory requests | Memory limits | Replicas |
|
||
|---|---|---|---|---|---|
|
||
| api | 100m | 1000m | 128Mi | 512Mi | 3 |
|
||
| admin | 50m | 500m | 64Mi | 256Mi | 1 |
|
||
| worker | 50m | 500m | 64Mi | 256Mi | 1 |
|
||
| redis | 100m | 500m | 128Mi | 512Mi | 1 |
|
||
| traefik (kube-system) | ~100m | unlimited | ~50Mi | unlimited | 3 |
|
||
| **Total requests** | **~750m** | | **~550Mi** | | |
|
||
|
||
Each node has 4000m CPU + 8192Mi memory. Total cluster capacity is
|
||
12000m + 24576Mi. We're using roughly 6% CPU and 2% memory for requests
|
||
— tons of headroom.
|
||
|
||
## Health check semantics
|
||
|
||
Kubernetes distinguishes three probe types:
|
||
|
||
- **startupProbe** — is the container done starting? Runs until it passes
|
||
once, then stops. While running, the other probes are disabled.
|
||
Failing startupProbe = container killed and restarted.
|
||
- **readinessProbe** — is the container ready to serve traffic? A failing
|
||
pod is removed from Service endpoints (traffic stops flowing to it)
|
||
but the pod keeps running.
|
||
- **livenessProbe** — is the container healthy? A failing pod is killed
|
||
and restarted.
|
||
|
||
### Why we tuned startupProbe separately
|
||
|
||
The api's first-boot migration takes 90–240s. If we only had a
|
||
readinessProbe with a typical initialDelay of 5s + failureThreshold of 3,
|
||
the pod would be killed before migration finishes. startupProbe lets us
|
||
give generous first-boot grace (240s) without affecting the sharper
|
||
ongoing readiness/liveness checks.
|
||
|
||
### Probe path design
|
||
|
||
Each service's `/health` endpoint should be:
|
||
- Cheap (no DB query, no external call)
|
||
- Fast (< 100ms)
|
||
- Honest (returns 200 iff the process can serve)
|
||
|
||
Our api's `/api/health/` does a trivial check. It does NOT verify Postgres
|
||
connectivity (to avoid cascading DB failures tearing down all api pods).
|
||
If Postgres is down, api pods stay "ready" and return 5xx for actual
|
||
endpoints — that's the right behavior.
|
||
|
||
## Log routing
|
||
|
||
All container logs go to stdout/stderr. containerd captures them to
|
||
`/var/log/containers/` on the node. `kubectl logs` fetches them via the
|
||
kubelet's /api/v1/pods/<pod>/log endpoint.
|
||
|
||
We have **no log aggregation** in the cluster (no Loki, no ELK, no
|
||
Datadog). For debugging we use:
|
||
|
||
```bash
|
||
kubectl logs -n honeydue deploy/api -f --prefix
|
||
kubectl logs -n honeydue deploy/api --previous # previous pod's logs
|
||
```
|
||
|
||
See [Chapter 15](./15-observability.md).
|
||
|
||
## Rolling update semantics
|
||
|
||
When you push a new image and `kubectl set image` or `kubectl apply` with
|
||
a new image tag:
|
||
|
||
1. Kubernetes creates a new ReplicaSet with the new image
|
||
2. Starts 1 new pod (per `maxSurge: 1`)
|
||
3. Waits for it to pass readinessProbe
|
||
4. Removes 1 pod from the old ReplicaSet
|
||
5. Repeats until all N pods are on the new ReplicaSet
|
||
6. Old ReplicaSet stays around (for rollback) with 0 replicas
|
||
|
||
For api (3 replicas): total rollout time is roughly
|
||
`3 × (pod_startup_time + small_buffer)` = ~15 minutes in the cold-boot
|
||
case, seconds for warm updates where migrations are no-op.
|
||
|
||
During the rollout:
|
||
- Service endpoint set updates as pods become ready
|
||
- kube-proxy IPVS is reprogrammed on each node
|
||
- Traefik's connection pool to the Service invalidates gradually
|
||
|
||
Users see no downtime if the new image is compatible. If it's broken:
|
||
|
||
```bash
|
||
kubectl rollout undo deployment/api -n honeydue
|
||
```
|
||
|
||
Reverts to the previous ReplicaSet. Typically takes 30 seconds to
|
||
stabilize.
|
||
|
||
## Why no StatefulSet
|
||
|
||
For Redis (the only stateful thing we run), we use a Deployment + PVC.
|
||
StatefulSet is designed for:
|
||
- Ordered startup (pod-0 before pod-1)
|
||
- Stable hostnames (pod-0 gets DNS name `redis-0.redis`)
|
||
- Per-replica PVCs
|
||
|
||
We have one Redis replica. None of those features matter for a
|
||
singleton. Deployment + PVC + nodeSelector is simpler and equivalent.
|
||
|
||
If we ever run Redis Sentinel or Cluster, we'd migrate to StatefulSet.
|
||
|
||
## Operator cheat sheet
|
||
|
||
```bash
|
||
# See all pods in honeydue namespace
|
||
kubectl get pods -n honeydue -o wide
|
||
|
||
# Per-service rollout status
|
||
kubectl rollout status deployment/api -n honeydue
|
||
|
||
# Scale a service
|
||
kubectl scale deployment/api -n honeydue --replicas=5
|
||
|
||
# Restart all pods (e.g., to re-read a configmap)
|
||
kubectl rollout restart deployment/api -n honeydue
|
||
|
||
# Exec into a pod
|
||
kubectl exec -it -n honeydue deploy/admin -- /bin/sh
|
||
|
||
# Describe a pod (shows events, probe state, restarts)
|
||
kubectl describe pod -n honeydue <pod-name>
|
||
|
||
# Resource usage
|
||
kubectl top pods -n honeydue
|
||
```
|
||
|
||
## References
|
||
|
||
- [Kubernetes Deployments][deploy]
|
||
- [Pod lifecycle + probes][probes]
|
||
- [Asynq scheduler limitations][asynq-sched]
|
||
- [K3s local-path provisioner][k3s-lp]
|
||
|
||
[deploy]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
|
||
[probes]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifecycle
|
||
[asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks
|
||
[k3s-lp]: https://docs.k3s.io/storage#setting-up-the-local-storage-provider
|