Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,575 @@
|
||||
# 07 — Services
|
||||
|
||||
## Summary
|
||||
|
||||
Four workloads run in the `honeydue` namespace: **api** (Go REST API, 3
|
||||
replicas), **admin** (Next.js panel, 1 replica), **worker** (Go background
|
||||
jobs, 1 replica), and **redis** (cache + job queue, 1 replica, PVC-backed).
|
||||
This chapter deep-dives each: container image, resource limits, probes,
|
||||
volumes, and why each knob is set the way it is.
|
||||
|
||||
## Overview
|
||||
|
||||
| Service | Image | Replicas | Ports | Role |
|
||||
|---|---|---|---|---|
|
||||
| `api` | `gitea.treytartt.com/admin/honeydue-api:<sha>` | 3 | 8000 | HTTP REST API |
|
||||
| `admin` | `gitea.treytartt.com/admin/honeydue-admin:<sha>` | 1 | 3000 | Next.js admin panel |
|
||||
| `worker` | `gitea.treytartt.com/admin/honeydue-worker:<sha>` | 1 | — | Background job processor |
|
||||
| `redis` | `redis:7-alpine` | 1 | 6379 | Cache + Asynq queue |
|
||||
|
||||
All four are Kubernetes `Deployment` workloads (not StatefulSets, not
|
||||
DaemonSets). They share:
|
||||
- ServiceAccount with `automountServiceAccountToken: false` (Chapter 5)
|
||||
- `imagePullSecrets: [gitea-credentials]` (Chapter 11)
|
||||
- `envFrom: configMapRef: honeydue-config` (Chapter 10)
|
||||
- Individual env vars wired to `honeydue-secrets` keys
|
||||
- Read-only root filesystem with `tmp` emptyDir mounted at `/tmp`
|
||||
|
||||
## Service 1 — api (Go REST API)
|
||||
|
||||
### What it does
|
||||
|
||||
The Go HTTP API — the heart of the app. Handlers for user auth,
|
||||
residences, tasks, contractors, documents, subscriptions, notifications,
|
||||
etc. Reads/writes to Neon Postgres, reads/writes to Redis cache, reads
|
||||
from Backblaze B2.
|
||||
|
||||
Also serves a marketing landing page at `/` (static HTML + CSS from
|
||||
`/app/static/`). This is why the `myhoneydue.com` apex domain routes to
|
||||
the api service (Chapter 6).
|
||||
|
||||
### Deployment spec highlights
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: api
|
||||
spec:
|
||||
replicas: 3
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
maxUnavailable: 0
|
||||
maxSurge: 1
|
||||
template:
|
||||
spec:
|
||||
serviceAccountName: api
|
||||
imagePullSecrets: [name: gitea-credentials]
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 1000
|
||||
runAsGroup: 1000
|
||||
fsGroup: 1000
|
||||
seccompProfile: { type: RuntimeDefault }
|
||||
containers:
|
||||
- name: api
|
||||
image: gitea.treytartt.com/admin/honeydue-api:237c6b8
|
||||
ports: [containerPort: 8000]
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
readOnlyRootFilesystem: true
|
||||
capabilities: { drop: [ALL] }
|
||||
envFrom: [configMapRef: {name: honeydue-config}]
|
||||
env:
|
||||
- name: POSTGRES_PASSWORD
|
||||
valueFrom: { secretKeyRef: {name: honeydue-secrets, key: POSTGRES_PASSWORD} }
|
||||
- name: SECRET_KEY
|
||||
valueFrom: { secretKeyRef: {name: honeydue-secrets, key: SECRET_KEY} }
|
||||
# ... all other secrets
|
||||
volumeMounts:
|
||||
- { name: apns-key, mountPath: /secrets/apns, readOnly: true }
|
||||
- { name: tmp, mountPath: /tmp }
|
||||
resources:
|
||||
requests: { cpu: 100m, memory: 128Mi }
|
||||
limits: { cpu: 1000m, memory: 512Mi }
|
||||
startupProbe: { httpGet: {path: /api/health/, port: 8000}, failureThreshold: 48, periodSeconds: 5 }
|
||||
readinessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 5, periodSeconds: 10, timeoutSeconds: 5 }
|
||||
livenessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 30, periodSeconds: 30, timeoutSeconds: 10 }
|
||||
volumes:
|
||||
- name: apns-key
|
||||
secret:
|
||||
secretName: honeydue-apns-key
|
||||
items: [key: apns_auth_key.p8, path: apns_auth_key.p8]
|
||||
- name: tmp
|
||||
emptyDir: {sizeLimit: 64Mi}
|
||||
```
|
||||
|
||||
### Why each setting
|
||||
|
||||
**`replicas: 3`** — one per node via anti-affinity rules (not strictly
|
||||
required but helpful). Three gives us HA (one pod down = two still
|
||||
serve traffic) and headroom for rolling updates.
|
||||
|
||||
**`maxUnavailable: 0, maxSurge: 1`** — during a rollout, start a 4th
|
||||
pod before killing any old one. Ensures the service stays at 3 live
|
||||
pods throughout. `maxUnavailable: 0` means zero downtime updates — but
|
||||
depends on readinessProbe being accurate.
|
||||
|
||||
**`runAsUser: 1000`** — the `app` user created in the Dockerfile. Image
|
||||
doesn't run as root.
|
||||
|
||||
**`readOnlyRootFilesystem: true`** — prevents any attacker-introduced
|
||||
file writes to the image layer. Go binary doesn't need to write to `/`;
|
||||
only `/tmp` is mutable.
|
||||
|
||||
**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this
|
||||
was bumped up from the scaffold default of 12. Reason: on first boot,
|
||||
the Go app runs `MigrateWithLock()` which acquires a Postgres advisory
|
||||
lock and runs AutoMigrate. First replica takes ~90s; subsequent
|
||||
replicas wait on the lock. With 3 replicas all starting simultaneously
|
||||
and the lock serializing them, 240s is the right grace. See
|
||||
[Chapter 19](./19-postmortem-swarm.md) for the detailed story.
|
||||
|
||||
**`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
|
||||
passes, wait 5s before starting readiness checks. Prevents a racy
|
||||
initial failure.
|
||||
|
||||
**`livenessProbe.initialDelaySeconds: 30`** — don't start restarting on
|
||||
liveness failures for 30s after readiness passes. Avoids cascading
|
||||
failures from false-negative liveness checks.
|
||||
|
||||
**`resources.requests/limits`** — Kubernetes uses `requests` for
|
||||
scheduling (how much a pod "reserves") and `limits` for enforcement
|
||||
(max it can use before throttling/OOM). Our api is CPU-bursty for
|
||||
complex query handling, so we give it 100m baseline with a 1000m ceiling.
|
||||
512Mi memory ceiling is comfortable — in practice api uses ~100-200Mi.
|
||||
|
||||
**`volumes.apns-key`** — mounts the `honeydue-apns-key` Secret as a file
|
||||
at `/secrets/apns/apns_auth_key.p8`. The `APNS_AUTH_KEY_PATH` env var
|
||||
points to this path. Even though push is currently disabled, the file
|
||||
must exist because the Go app may try to stat it on startup.
|
||||
|
||||
**`volumes.tmp`** — `emptyDir` with `sizeLimit: 64Mi`. Bounded so a
|
||||
runaway process can't fill the node's disk.
|
||||
|
||||
### The Service
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: api
|
||||
namespace: honeydue
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector: {app.kubernetes.io/name: api}
|
||||
ports:
|
||||
- port: 8000
|
||||
targetPort: 8000
|
||||
protocol: TCP
|
||||
```
|
||||
|
||||
ClusterIP `10.43.167.83`. Reachable as `api.honeydue.svc.cluster.local` or
|
||||
just `api` from inside the namespace.
|
||||
|
||||
### HorizontalPodAutoscaler (not yet enabled)
|
||||
|
||||
`deploy-k3s/manifests/api/hpa.yaml` defines an HPA that would scale api
|
||||
between 3 and 6 replicas based on CPU (70% util) and memory (80% util).
|
||||
|
||||
**Not currently applied.** `metrics-server` runs but we haven't run
|
||||
`kubectl apply -f api/hpa.yaml`. TODO in Chapter 20.
|
||||
|
||||
## Service 2 — admin (Next.js panel)
|
||||
|
||||
### What it does
|
||||
|
||||
Server-rendered admin UI. Authenticates admin users against a
|
||||
separate `admin_users` table in Postgres (seeded with `ADMIN_EMAIL` +
|
||||
`ADMIN_PASSWORD` on first migration). Lets operators view/manage
|
||||
users, residences, tasks, subscriptions, etc.
|
||||
|
||||
Built as a Next.js 16 standalone server.
|
||||
|
||||
### Why 1 replica
|
||||
|
||||
Low traffic. It's an internal tool. One pod suffices. If it crashes,
|
||||
Kubernetes restarts it in ~10s. If the hosting node dies, Kubernetes
|
||||
reschedules to another node.
|
||||
|
||||
The cost of running 3 replicas is tiny (Next.js is ~128MB per pod) but
|
||||
has no operational benefit. When the admin panel becomes user-facing,
|
||||
revisit.
|
||||
|
||||
### Deployment highlights
|
||||
|
||||
```yaml
|
||||
replicas: 1
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
maxUnavailable: 0
|
||||
maxSurge: 1
|
||||
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 1001 # different from api (1000) for isolation
|
||||
runAsGroup: 1001
|
||||
fsGroup: 1001
|
||||
|
||||
containers:
|
||||
- image: gitea.treytartt.com/admin/honeydue-admin:<sha>
|
||||
ports: [containerPort: 3000]
|
||||
env:
|
||||
- name: PORT
|
||||
value: "3000"
|
||||
- name: HOSTNAME
|
||||
value: "0.0.0.0"
|
||||
- name: NEXT_PUBLIC_API_URL
|
||||
valueFrom: {configMapKeyRef: {name: honeydue-config, key: NEXT_PUBLIC_API_URL}}
|
||||
volumeMounts:
|
||||
- {name: nextjs-cache, mountPath: /app/.next/cache}
|
||||
- {name: tmp, mountPath: /tmp}
|
||||
resources:
|
||||
requests: {cpu: 50m, memory: 64Mi}
|
||||
limits: {cpu: 500m, memory: 256Mi}
|
||||
startupProbe:
|
||||
httpGet: {path: /, port: 3000} # was /admin/ — wrong for this app (Chapter 19)
|
||||
failureThreshold: 24
|
||||
periodSeconds: 5
|
||||
readinessProbe:
|
||||
httpGet: {path: /, port: 3000}
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
```
|
||||
|
||||
**Probe path `/`** — Next.js serves at root. `/admin/` (scaffold default)
|
||||
returns 404 and killed the pod repeatedly during initial bring-up.
|
||||
See Chapter 19 §Admin probe path for the story.
|
||||
|
||||
**`runAsUser: 1001`** — different from api's 1000 so that if one
|
||||
service were compromised, the stolen UID would at least be distinct
|
||||
from other services' (minor defense-in-depth).
|
||||
|
||||
**`nextjs-cache`** — emptyDir mount for Next.js's server-side cache.
|
||||
Without it, the read-only rootfs would prevent Next from caching
|
||||
server-rendered pages. Not a persistent volume because cache is
|
||||
regenerable on restart.
|
||||
|
||||
### The Service
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: admin
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector: {app.kubernetes.io/name: admin}
|
||||
ports: [port: 3000, targetPort: 3000]
|
||||
```
|
||||
|
||||
ClusterIP `10.43.136.168`.
|
||||
|
||||
## Service 3 — worker (Go + Asynq)
|
||||
|
||||
### What it does
|
||||
|
||||
Runs scheduled background jobs via [Asynq](https://github.com/hibiken/asynq)
|
||||
(a Redis-backed job queue for Go):
|
||||
|
||||
- **Task reminders** (14:00 UTC daily) — notify users of upcoming tasks
|
||||
- **Overdue reminders** (15:00 UTC daily) — notify users of overdue tasks
|
||||
- **Daily digest** (03:00 UTC daily) — summary email per user
|
||||
- **Onboarding emails** — multi-step drip campaign for new users
|
||||
- **Cleanup jobs** — expired tokens, stale data
|
||||
|
||||
### Why 1 replica (hard requirement)
|
||||
|
||||
Asynq uses a `Scheduler` component that does cron-like scheduling. The
|
||||
Scheduler is **not leader-elected** by default — if you run two, both
|
||||
fire every cron task. Users get duplicate emails.
|
||||
|
||||
The asynq docs cover this: to scale scheduling, migrate to
|
||||
`PeriodicTaskManager` + `PeriodicTaskConfigProvider` which coordinate
|
||||
via Redis. Not yet done in our codebase.
|
||||
|
||||
Until then: `replicas: 1` is a hard constraint. See the comment in the
|
||||
deployment manifest:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
# Asynq's Scheduler is a singleton — running >1 replica fires every cron
|
||||
# task once per replica (duplicate daily digests, onboarding emails, etc.).
|
||||
# Keep at 1 until asynq.PeriodicTaskManager with Redis leader election is
|
||||
# wired in cmd/worker/main.go.
|
||||
replicas: 1
|
||||
```
|
||||
|
||||
### What happens if the worker pod dies?
|
||||
|
||||
- Asynq schedule state is in Redis (which has AOF persistence)
|
||||
- When a new worker pod starts, it re-registers the scheduler and picks up
|
||||
where it left off
|
||||
- Any job that was in-flight (dequeued but not acknowledged) gets retried
|
||||
by Asynq's automatic retry logic (see the `worker.RetryOptions` in the
|
||||
Go code)
|
||||
- Cron jobs that were supposed to fire during the downtime: fire on the
|
||||
next tick
|
||||
|
||||
A 5-minute worker outage = 5 minutes of delayed jobs. Not great but
|
||||
acceptable.
|
||||
|
||||
### PodDisruptionBudget
|
||||
|
||||
```yaml
|
||||
apiVersion: policy/v1
|
||||
kind: PodDisruptionBudget
|
||||
metadata:
|
||||
name: worker-pdb
|
||||
spec:
|
||||
minAvailable: 0
|
||||
selector: {matchLabels: {app.kubernetes.io/name: worker}}
|
||||
```
|
||||
|
||||
`minAvailable: 0` means voluntary disruptions (`kubectl drain`) can take
|
||||
the worker down. This matches the singleton constraint: there's only one,
|
||||
it's OK to drain.
|
||||
|
||||
### No Service
|
||||
|
||||
worker doesn't listen on any HTTP port for application traffic — it's a
|
||||
queue consumer, not a web server. So there's **no Kubernetes Service**
|
||||
for it.
|
||||
|
||||
(On Swarm we had the worker expose a health endpoint at `:6060/health`;
|
||||
the k3s scaffold doesn't replicate this. Future work.)
|
||||
|
||||
## Service 4 — redis
|
||||
|
||||
### What it does
|
||||
|
||||
- Caching layer (ETag-based lookups, user session cache)
|
||||
- Asynq queue backend (job state, scheduled tasks, retry state)
|
||||
|
||||
### Why 1 replica
|
||||
|
||||
Single-instance Redis with AOF persistence. Not replicated, not
|
||||
clustered. Downsides:
|
||||
- Node outage = Redis outage (cache regenerates, queue state is preserved
|
||||
by AOF on the PVC)
|
||||
- No failover — if the node hosting Redis dies, Redis restarts on another
|
||||
node *but* the PVC is local-path (per-node), so the data is gone
|
||||
|
||||
For our scale this is acceptable. Redis holds no authoritative state
|
||||
(everything that matters is in Postgres). Cache regenerates on first
|
||||
request; Asynq retries enqueue on failure.
|
||||
|
||||
### PVC
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: redis-data
|
||||
spec:
|
||||
accessModes: [ReadWriteOnce]
|
||||
storageClassName: local-path
|
||||
resources: {requests: {storage: 5Gi}}
|
||||
```
|
||||
|
||||
Uses k3s' built-in `local-path-provisioner`. The PVC binds to a local
|
||||
directory on the node where the Redis pod lands (`/var/lib/rancher/k3s/storage/`).
|
||||
`ReadWriteOnce` means only one pod at a time.
|
||||
|
||||
### Node affinity
|
||||
|
||||
```yaml
|
||||
nodeSelector:
|
||||
honeydue/redis: "true"
|
||||
```
|
||||
|
||||
We labeled `ubuntu-8gb-nbg1-2` (hetzner1) with `honeydue/redis=true` so
|
||||
Redis always lands there. This ensures the PVC finds its backing
|
||||
storage (since PVCs with `local-path` are per-node).
|
||||
|
||||
```bash
|
||||
kubectl label node ubuntu-8gb-nbg1-2 honeydue/redis=true --overwrite
|
||||
```
|
||||
|
||||
### Why not Redis Sentinel / Cluster
|
||||
|
||||
Complexity. At our scale (~a few req/s, kilobytes of cache), a single
|
||||
Redis does fine. If Redis becomes critical-path for availability, we'd:
|
||||
- Use a managed Redis (Upstash, Dragonfly Cloud) — $5-15/mo, their problem
|
||||
- Or run Redis Sentinel with 3 replicas — manageable but operational work
|
||||
|
||||
Neither is needed yet.
|
||||
|
||||
### Redis config
|
||||
|
||||
From the deployment:
|
||||
|
||||
```yaml
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
ARGS="--appendonly yes --appendfsync everysec --maxmemory 256mb --maxmemory-policy noeviction"
|
||||
if [ -n "$REDIS_PASSWORD" ]; then
|
||||
ARGS="$ARGS --requirepass $REDIS_PASSWORD"
|
||||
fi
|
||||
exec redis-server $ARGS
|
||||
```
|
||||
|
||||
Settings:
|
||||
- **`--appendonly yes --appendfsync everysec`** — AOF persistence,
|
||||
fsync every second. Survives restarts with up to 1 second of data
|
||||
loss.
|
||||
- **`--maxmemory 256mb`** — Redis will refuse new data if it grows past
|
||||
256 MB. Gives us a safety cap.
|
||||
- **`--maxmemory-policy noeviction`** — we'd rather get errors than
|
||||
silently drop data. This is the right choice when Redis holds queue
|
||||
state (losing a queue item silently = missed job).
|
||||
|
||||
The `REDIS_PASSWORD` env var is optional. Currently empty (no auth). The
|
||||
Redis pod is only reachable from inside the overlay network, and our
|
||||
NetworkPolicies (once enabled) would restrict egress further.
|
||||
|
||||
## Resource summary
|
||||
|
||||
Combined requests and limits across all services:
|
||||
|
||||
| Service | CPU requests | CPU limits | Memory requests | Memory limits | Replicas |
|
||||
|---|---|---|---|---|---|
|
||||
| api | 100m | 1000m | 128Mi | 512Mi | 3 |
|
||||
| admin | 50m | 500m | 64Mi | 256Mi | 1 |
|
||||
| worker | 50m | 500m | 64Mi | 256Mi | 1 |
|
||||
| redis | 100m | 500m | 128Mi | 512Mi | 1 |
|
||||
| traefik (kube-system) | ~100m | unlimited | ~50Mi | unlimited | 3 |
|
||||
| **Total requests** | **~750m** | | **~550Mi** | | |
|
||||
|
||||
Each node has 4000m CPU + 8192Mi memory. Total cluster capacity is
|
||||
12000m + 24576Mi. We're using roughly 6% CPU and 2% memory for requests
|
||||
— tons of headroom.
|
||||
|
||||
## Health check semantics
|
||||
|
||||
Kubernetes distinguishes three probe types:
|
||||
|
||||
- **startupProbe** — is the container done starting? Runs until it passes
|
||||
once, then stops. While running, the other probes are disabled.
|
||||
Failing startupProbe = container killed and restarted.
|
||||
- **readinessProbe** — is the container ready to serve traffic? A failing
|
||||
pod is removed from Service endpoints (traffic stops flowing to it)
|
||||
but the pod keeps running.
|
||||
- **livenessProbe** — is the container healthy? A failing pod is killed
|
||||
and restarted.
|
||||
|
||||
### Why we tuned startupProbe separately
|
||||
|
||||
The api's first-boot migration takes 90–240s. If we only had a
|
||||
readinessProbe with a typical initialDelay of 5s + failureThreshold of 3,
|
||||
the pod would be killed before migration finishes. startupProbe lets us
|
||||
give generous first-boot grace (240s) without affecting the sharper
|
||||
ongoing readiness/liveness checks.
|
||||
|
||||
### Probe path design
|
||||
|
||||
Each service's `/health` endpoint should be:
|
||||
- Cheap (no DB query, no external call)
|
||||
- Fast (< 100ms)
|
||||
- Honest (returns 200 iff the process can serve)
|
||||
|
||||
Our api's `/api/health/` does a trivial check. It does NOT verify Postgres
|
||||
connectivity (to avoid cascading DB failures tearing down all api pods).
|
||||
If Postgres is down, api pods stay "ready" and return 5xx for actual
|
||||
endpoints — that's the right behavior.
|
||||
|
||||
## Log routing
|
||||
|
||||
All container logs go to stdout/stderr. containerd captures them to
|
||||
`/var/log/containers/` on the node. `kubectl logs` fetches them via the
|
||||
kubelet's /api/v1/pods/<pod>/log endpoint.
|
||||
|
||||
We have **no log aggregation** in the cluster (no Loki, no ELK, no
|
||||
Datadog). For debugging we use:
|
||||
|
||||
```bash
|
||||
kubectl logs -n honeydue deploy/api -f --prefix
|
||||
kubectl logs -n honeydue deploy/api --previous # previous pod's logs
|
||||
```
|
||||
|
||||
See [Chapter 15](./15-observability.md).
|
||||
|
||||
## Rolling update semantics
|
||||
|
||||
When you push a new image and `kubectl set image` or `kubectl apply` with
|
||||
a new image tag:
|
||||
|
||||
1. Kubernetes creates a new ReplicaSet with the new image
|
||||
2. Starts 1 new pod (per `maxSurge: 1`)
|
||||
3. Waits for it to pass readinessProbe
|
||||
4. Removes 1 pod from the old ReplicaSet
|
||||
5. Repeats until all N pods are on the new ReplicaSet
|
||||
6. Old ReplicaSet stays around (for rollback) with 0 replicas
|
||||
|
||||
For api (3 replicas): total rollout time is roughly
|
||||
`3 × (pod_startup_time + small_buffer)` = ~15 minutes in the cold-boot
|
||||
case, seconds for warm updates where migrations are no-op.
|
||||
|
||||
During the rollout:
|
||||
- Service endpoint set updates as pods become ready
|
||||
- kube-proxy IPVS is reprogrammed on each node
|
||||
- Traefik's connection pool to the Service invalidates gradually
|
||||
|
||||
Users see no downtime if the new image is compatible. If it's broken:
|
||||
|
||||
```bash
|
||||
kubectl rollout undo deployment/api -n honeydue
|
||||
```
|
||||
|
||||
Reverts to the previous ReplicaSet. Typically takes 30 seconds to
|
||||
stabilize.
|
||||
|
||||
## Why no StatefulSet
|
||||
|
||||
For Redis (the only stateful thing we run), we use a Deployment + PVC.
|
||||
StatefulSet is designed for:
|
||||
- Ordered startup (pod-0 before pod-1)
|
||||
- Stable hostnames (pod-0 gets DNS name `redis-0.redis`)
|
||||
- Per-replica PVCs
|
||||
|
||||
We have one Redis replica. None of those features matter for a
|
||||
singleton. Deployment + PVC + nodeSelector is simpler and equivalent.
|
||||
|
||||
If we ever run Redis Sentinel or Cluster, we'd migrate to StatefulSet.
|
||||
|
||||
## Operator cheat sheet
|
||||
|
||||
```bash
|
||||
# See all pods in honeydue namespace
|
||||
kubectl get pods -n honeydue -o wide
|
||||
|
||||
# Per-service rollout status
|
||||
kubectl rollout status deployment/api -n honeydue
|
||||
|
||||
# Scale a service
|
||||
kubectl scale deployment/api -n honeydue --replicas=5
|
||||
|
||||
# Restart all pods (e.g., to re-read a configmap)
|
||||
kubectl rollout restart deployment/api -n honeydue
|
||||
|
||||
# Exec into a pod
|
||||
kubectl exec -it -n honeydue deploy/admin -- /bin/sh
|
||||
|
||||
# Describe a pod (shows events, probe state, restarts)
|
||||
kubectl describe pod -n honeydue <pod-name>
|
||||
|
||||
# Resource usage
|
||||
kubectl top pods -n honeydue
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [Kubernetes Deployments][deploy]
|
||||
- [Pod lifecycle + probes][probes]
|
||||
- [Asynq scheduler limitations][asynq-sched]
|
||||
- [K3s local-path provisioner][k3s-lp]
|
||||
|
||||
[deploy]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
|
||||
[probes]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifecycle
|
||||
[asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks
|
||||
[k3s-lp]: https://docs.k3s.io/storage#setting-up-the-local-storage-provider
|
||||
Reference in New Issue
Block a user