Files
honeyDueAPI/docs/deployment/07-services.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

576 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 07 — Services
## Summary
Four workloads run in the `honeydue` namespace: **api** (Go REST API, 3
replicas), **admin** (Next.js panel, 1 replica), **worker** (Go background
jobs, 1 replica), and **redis** (cache + job queue, 1 replica, PVC-backed).
This chapter deep-dives each: container image, resource limits, probes,
volumes, and why each knob is set the way it is.
## Overview
| Service | Image | Replicas | Ports | Role |
|---|---|---|---|---|
| `api` | `gitea.treytartt.com/admin/honeydue-api:<sha>` | 3 | 8000 | HTTP REST API |
| `admin` | `gitea.treytartt.com/admin/honeydue-admin:<sha>` | 1 | 3000 | Next.js admin panel |
| `worker` | `gitea.treytartt.com/admin/honeydue-worker:<sha>` | 1 | — | Background job processor |
| `redis` | `redis:7-alpine` | 1 | 6379 | Cache + Asynq queue |
All four are Kubernetes `Deployment` workloads (not StatefulSets, not
DaemonSets). They share:
- ServiceAccount with `automountServiceAccountToken: false` (Chapter 5)
- `imagePullSecrets: [gitea-credentials]` (Chapter 11)
- `envFrom: configMapRef: honeydue-config` (Chapter 10)
- Individual env vars wired to `honeydue-secrets` keys
- Read-only root filesystem with `tmp` emptyDir mounted at `/tmp`
## Service 1 — api (Go REST API)
### What it does
The Go HTTP API — the heart of the app. Handlers for user auth,
residences, tasks, contractors, documents, subscriptions, notifications,
etc. Reads/writes to Neon Postgres, reads/writes to Redis cache, reads
from Backblaze B2.
Also serves a marketing landing page at `/` (static HTML + CSS from
`/app/static/`). This is why the `myhoneydue.com` apex domain routes to
the api service (Chapter 6).
### Deployment spec highlights
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
template:
spec:
serviceAccountName: api
imagePullSecrets: [name: gitea-credentials]
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile: { type: RuntimeDefault }
containers:
- name: api
image: gitea.treytartt.com/admin/honeydue-api:237c6b8
ports: [containerPort: 8000]
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: [ALL] }
envFrom: [configMapRef: {name: honeydue-config}]
env:
- name: POSTGRES_PASSWORD
valueFrom: { secretKeyRef: {name: honeydue-secrets, key: POSTGRES_PASSWORD} }
- name: SECRET_KEY
valueFrom: { secretKeyRef: {name: honeydue-secrets, key: SECRET_KEY} }
# ... all other secrets
volumeMounts:
- { name: apns-key, mountPath: /secrets/apns, readOnly: true }
- { name: tmp, mountPath: /tmp }
resources:
requests: { cpu: 100m, memory: 128Mi }
limits: { cpu: 1000m, memory: 512Mi }
startupProbe: { httpGet: {path: /api/health/, port: 8000}, failureThreshold: 48, periodSeconds: 5 }
readinessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 5, periodSeconds: 10, timeoutSeconds: 5 }
livenessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 30, periodSeconds: 30, timeoutSeconds: 10 }
volumes:
- name: apns-key
secret:
secretName: honeydue-apns-key
items: [key: apns_auth_key.p8, path: apns_auth_key.p8]
- name: tmp
emptyDir: {sizeLimit: 64Mi}
```
### Why each setting
**`replicas: 3`** — one per node via anti-affinity rules (not strictly
required but helpful). Three gives us HA (one pod down = two still
serve traffic) and headroom for rolling updates.
**`maxUnavailable: 0, maxSurge: 1`** — during a rollout, start a 4th
pod before killing any old one. Ensures the service stays at 3 live
pods throughout. `maxUnavailable: 0` means zero downtime updates — but
depends on readinessProbe being accurate.
**`runAsUser: 1000`** — the `app` user created in the Dockerfile. Image
doesn't run as root.
**`readOnlyRootFilesystem: true`** — prevents any attacker-introduced
file writes to the image layer. Go binary doesn't need to write to `/`;
only `/tmp` is mutable.
**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this
was bumped up from the scaffold default of 12. Reason: on first boot,
the Go app runs `MigrateWithLock()` which acquires a Postgres advisory
lock and runs AutoMigrate. First replica takes ~90s; subsequent
replicas wait on the lock. With 3 replicas all starting simultaneously
and the lock serializing them, 240s is the right grace. See
[Chapter 19](./19-postmortem-swarm.md) for the detailed story.
**`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
passes, wait 5s before starting readiness checks. Prevents a racy
initial failure.
**`livenessProbe.initialDelaySeconds: 30`** — don't start restarting on
liveness failures for 30s after readiness passes. Avoids cascading
failures from false-negative liveness checks.
**`resources.requests/limits`** — Kubernetes uses `requests` for
scheduling (how much a pod "reserves") and `limits` for enforcement
(max it can use before throttling/OOM). Our api is CPU-bursty for
complex query handling, so we give it 100m baseline with a 1000m ceiling.
512Mi memory ceiling is comfortable — in practice api uses ~100-200Mi.
**`volumes.apns-key`** — mounts the `honeydue-apns-key` Secret as a file
at `/secrets/apns/apns_auth_key.p8`. The `APNS_AUTH_KEY_PATH` env var
points to this path. Even though push is currently disabled, the file
must exist because the Go app may try to stat it on startup.
**`volumes.tmp`** — `emptyDir` with `sizeLimit: 64Mi`. Bounded so a
runaway process can't fill the node's disk.
### The Service
```yaml
apiVersion: v1
kind: Service
metadata:
name: api
namespace: honeydue
spec:
type: ClusterIP
selector: {app.kubernetes.io/name: api}
ports:
- port: 8000
targetPort: 8000
protocol: TCP
```
ClusterIP `10.43.167.83`. Reachable as `api.honeydue.svc.cluster.local` or
just `api` from inside the namespace.
### HorizontalPodAutoscaler (not yet enabled)
`deploy-k3s/manifests/api/hpa.yaml` defines an HPA that would scale api
between 3 and 6 replicas based on CPU (70% util) and memory (80% util).
**Not currently applied.** `metrics-server` runs but we haven't run
`kubectl apply -f api/hpa.yaml`. TODO in Chapter 20.
## Service 2 — admin (Next.js panel)
### What it does
Server-rendered admin UI. Authenticates admin users against a
separate `admin_users` table in Postgres (seeded with `ADMIN_EMAIL` +
`ADMIN_PASSWORD` on first migration). Lets operators view/manage
users, residences, tasks, subscriptions, etc.
Built as a Next.js 16 standalone server.
### Why 1 replica
Low traffic. It's an internal tool. One pod suffices. If it crashes,
Kubernetes restarts it in ~10s. If the hosting node dies, Kubernetes
reschedules to another node.
The cost of running 3 replicas is tiny (Next.js is ~128MB per pod) but
has no operational benefit. When the admin panel becomes user-facing,
revisit.
### Deployment highlights
```yaml
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
securityContext:
runAsNonRoot: true
runAsUser: 1001 # different from api (1000) for isolation
runAsGroup: 1001
fsGroup: 1001
containers:
- image: gitea.treytartt.com/admin/honeydue-admin:<sha>
ports: [containerPort: 3000]
env:
- name: PORT
value: "3000"
- name: HOSTNAME
value: "0.0.0.0"
- name: NEXT_PUBLIC_API_URL
valueFrom: {configMapKeyRef: {name: honeydue-config, key: NEXT_PUBLIC_API_URL}}
volumeMounts:
- {name: nextjs-cache, mountPath: /app/.next/cache}
- {name: tmp, mountPath: /tmp}
resources:
requests: {cpu: 50m, memory: 64Mi}
limits: {cpu: 500m, memory: 256Mi}
startupProbe:
httpGet: {path: /, port: 3000} # was /admin/ — wrong for this app (Chapter 19)
failureThreshold: 24
periodSeconds: 5
readinessProbe:
httpGet: {path: /, port: 3000}
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
```
**Probe path `/`** — Next.js serves at root. `/admin/` (scaffold default)
returns 404 and killed the pod repeatedly during initial bring-up.
See Chapter 19 §Admin probe path for the story.
**`runAsUser: 1001`** — different from api's 1000 so that if one
service were compromised, the stolen UID would at least be distinct
from other services' (minor defense-in-depth).
**`nextjs-cache`** — emptyDir mount for Next.js's server-side cache.
Without it, the read-only rootfs would prevent Next from caching
server-rendered pages. Not a persistent volume because cache is
regenerable on restart.
### The Service
```yaml
apiVersion: v1
kind: Service
metadata:
name: admin
spec:
type: ClusterIP
selector: {app.kubernetes.io/name: admin}
ports: [port: 3000, targetPort: 3000]
```
ClusterIP `10.43.136.168`.
## Service 3 — worker (Go + Asynq)
### What it does
Runs scheduled background jobs via [Asynq](https://github.com/hibiken/asynq)
(a Redis-backed job queue for Go):
- **Task reminders** (14:00 UTC daily) — notify users of upcoming tasks
- **Overdue reminders** (15:00 UTC daily) — notify users of overdue tasks
- **Daily digest** (03:00 UTC daily) — summary email per user
- **Onboarding emails** — multi-step drip campaign for new users
- **Cleanup jobs** — expired tokens, stale data
### Why 1 replica (hard requirement)
Asynq uses a `Scheduler` component that does cron-like scheduling. The
Scheduler is **not leader-elected** by default — if you run two, both
fire every cron task. Users get duplicate emails.
The asynq docs cover this: to scale scheduling, migrate to
`PeriodicTaskManager` + `PeriodicTaskConfigProvider` which coordinate
via Redis. Not yet done in our codebase.
Until then: `replicas: 1` is a hard constraint. See the comment in the
deployment manifest:
```yaml
spec:
# Asynq's Scheduler is a singleton — running >1 replica fires every cron
# task once per replica (duplicate daily digests, onboarding emails, etc.).
# Keep at 1 until asynq.PeriodicTaskManager with Redis leader election is
# wired in cmd/worker/main.go.
replicas: 1
```
### What happens if the worker pod dies?
- Asynq schedule state is in Redis (which has AOF persistence)
- When a new worker pod starts, it re-registers the scheduler and picks up
where it left off
- Any job that was in-flight (dequeued but not acknowledged) gets retried
by Asynq's automatic retry logic (see the `worker.RetryOptions` in the
Go code)
- Cron jobs that were supposed to fire during the downtime: fire on the
next tick
A 5-minute worker outage = 5 minutes of delayed jobs. Not great but
acceptable.
### PodDisruptionBudget
```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: worker-pdb
spec:
minAvailable: 0
selector: {matchLabels: {app.kubernetes.io/name: worker}}
```
`minAvailable: 0` means voluntary disruptions (`kubectl drain`) can take
the worker down. This matches the singleton constraint: there's only one,
it's OK to drain.
### No Service
worker doesn't listen on any HTTP port for application traffic — it's a
queue consumer, not a web server. So there's **no Kubernetes Service**
for it.
(On Swarm we had the worker expose a health endpoint at `:6060/health`;
the k3s scaffold doesn't replicate this. Future work.)
## Service 4 — redis
### What it does
- Caching layer (ETag-based lookups, user session cache)
- Asynq queue backend (job state, scheduled tasks, retry state)
### Why 1 replica
Single-instance Redis with AOF persistence. Not replicated, not
clustered. Downsides:
- Node outage = Redis outage (cache regenerates, queue state is preserved
by AOF on the PVC)
- No failover — if the node hosting Redis dies, Redis restarts on another
node *but* the PVC is local-path (per-node), so the data is gone
For our scale this is acceptable. Redis holds no authoritative state
(everything that matters is in Postgres). Cache regenerates on first
request; Asynq retries enqueue on failure.
### PVC
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: redis-data
spec:
accessModes: [ReadWriteOnce]
storageClassName: local-path
resources: {requests: {storage: 5Gi}}
```
Uses k3s' built-in `local-path-provisioner`. The PVC binds to a local
directory on the node where the Redis pod lands (`/var/lib/rancher/k3s/storage/`).
`ReadWriteOnce` means only one pod at a time.
### Node affinity
```yaml
nodeSelector:
honeydue/redis: "true"
```
We labeled `ubuntu-8gb-nbg1-2` (hetzner1) with `honeydue/redis=true` so
Redis always lands there. This ensures the PVC finds its backing
storage (since PVCs with `local-path` are per-node).
```bash
kubectl label node ubuntu-8gb-nbg1-2 honeydue/redis=true --overwrite
```
### Why not Redis Sentinel / Cluster
Complexity. At our scale (~a few req/s, kilobytes of cache), a single
Redis does fine. If Redis becomes critical-path for availability, we'd:
- Use a managed Redis (Upstash, Dragonfly Cloud) — $5-15/mo, their problem
- Or run Redis Sentinel with 3 replicas — manageable but operational work
Neither is needed yet.
### Redis config
From the deployment:
```yaml
command:
- sh
- -c
- |
ARGS="--appendonly yes --appendfsync everysec --maxmemory 256mb --maxmemory-policy noeviction"
if [ -n "$REDIS_PASSWORD" ]; then
ARGS="$ARGS --requirepass $REDIS_PASSWORD"
fi
exec redis-server $ARGS
```
Settings:
- **`--appendonly yes --appendfsync everysec`** — AOF persistence,
fsync every second. Survives restarts with up to 1 second of data
loss.
- **`--maxmemory 256mb`** — Redis will refuse new data if it grows past
256 MB. Gives us a safety cap.
- **`--maxmemory-policy noeviction`** — we'd rather get errors than
silently drop data. This is the right choice when Redis holds queue
state (losing a queue item silently = missed job).
The `REDIS_PASSWORD` env var is optional. Currently empty (no auth). The
Redis pod is only reachable from inside the overlay network, and our
NetworkPolicies (once enabled) would restrict egress further.
## Resource summary
Combined requests and limits across all services:
| Service | CPU requests | CPU limits | Memory requests | Memory limits | Replicas |
|---|---|---|---|---|---|
| api | 100m | 1000m | 128Mi | 512Mi | 3 |
| admin | 50m | 500m | 64Mi | 256Mi | 1 |
| worker | 50m | 500m | 64Mi | 256Mi | 1 |
| redis | 100m | 500m | 128Mi | 512Mi | 1 |
| traefik (kube-system) | ~100m | unlimited | ~50Mi | unlimited | 3 |
| **Total requests** | **~750m** | | **~550Mi** | | |
Each node has 4000m CPU + 8192Mi memory. Total cluster capacity is
12000m + 24576Mi. We're using roughly 6% CPU and 2% memory for requests
— tons of headroom.
## Health check semantics
Kubernetes distinguishes three probe types:
- **startupProbe** — is the container done starting? Runs until it passes
once, then stops. While running, the other probes are disabled.
Failing startupProbe = container killed and restarted.
- **readinessProbe** — is the container ready to serve traffic? A failing
pod is removed from Service endpoints (traffic stops flowing to it)
but the pod keeps running.
- **livenessProbe** — is the container healthy? A failing pod is killed
and restarted.
### Why we tuned startupProbe separately
The api's first-boot migration takes 90240s. If we only had a
readinessProbe with a typical initialDelay of 5s + failureThreshold of 3,
the pod would be killed before migration finishes. startupProbe lets us
give generous first-boot grace (240s) without affecting the sharper
ongoing readiness/liveness checks.
### Probe path design
Each service's `/health` endpoint should be:
- Cheap (no DB query, no external call)
- Fast (< 100ms)
- Honest (returns 200 iff the process can serve)
Our api's `/api/health/` does a trivial check. It does NOT verify Postgres
connectivity (to avoid cascading DB failures tearing down all api pods).
If Postgres is down, api pods stay "ready" and return 5xx for actual
endpoints — that's the right behavior.
## Log routing
All container logs go to stdout/stderr. containerd captures them to
`/var/log/containers/` on the node. `kubectl logs` fetches them via the
kubelet's /api/v1/pods/<pod>/log endpoint.
We have **no log aggregation** in the cluster (no Loki, no ELK, no
Datadog). For debugging we use:
```bash
kubectl logs -n honeydue deploy/api -f --prefix
kubectl logs -n honeydue deploy/api --previous # previous pod's logs
```
See [Chapter 15](./15-observability.md).
## Rolling update semantics
When you push a new image and `kubectl set image` or `kubectl apply` with
a new image tag:
1. Kubernetes creates a new ReplicaSet with the new image
2. Starts 1 new pod (per `maxSurge: 1`)
3. Waits for it to pass readinessProbe
4. Removes 1 pod from the old ReplicaSet
5. Repeats until all N pods are on the new ReplicaSet
6. Old ReplicaSet stays around (for rollback) with 0 replicas
For api (3 replicas): total rollout time is roughly
`3 × (pod_startup_time + small_buffer)` = ~15 minutes in the cold-boot
case, seconds for warm updates where migrations are no-op.
During the rollout:
- Service endpoint set updates as pods become ready
- kube-proxy IPVS is reprogrammed on each node
- Traefik's connection pool to the Service invalidates gradually
Users see no downtime if the new image is compatible. If it's broken:
```bash
kubectl rollout undo deployment/api -n honeydue
```
Reverts to the previous ReplicaSet. Typically takes 30 seconds to
stabilize.
## Why no StatefulSet
For Redis (the only stateful thing we run), we use a Deployment + PVC.
StatefulSet is designed for:
- Ordered startup (pod-0 before pod-1)
- Stable hostnames (pod-0 gets DNS name `redis-0.redis`)
- Per-replica PVCs
We have one Redis replica. None of those features matter for a
singleton. Deployment + PVC + nodeSelector is simpler and equivalent.
If we ever run Redis Sentinel or Cluster, we'd migrate to StatefulSet.
## Operator cheat sheet
```bash
# See all pods in honeydue namespace
kubectl get pods -n honeydue -o wide
# Per-service rollout status
kubectl rollout status deployment/api -n honeydue
# Scale a service
kubectl scale deployment/api -n honeydue --replicas=5
# Restart all pods (e.g., to re-read a configmap)
kubectl rollout restart deployment/api -n honeydue
# Exec into a pod
kubectl exec -it -n honeydue deploy/admin -- /bin/sh
# Describe a pod (shows events, probe state, restarts)
kubectl describe pod -n honeydue <pod-name>
# Resource usage
kubectl top pods -n honeydue
```
## References
- [Kubernetes Deployments][deploy]
- [Pod lifecycle + probes][probes]
- [Asynq scheduler limitations][asynq-sched]
- [K3s local-path provisioner][k3s-lp]
[deploy]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[probes]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifecycle
[asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks
[k3s-lp]: https://docs.k3s.io/storage#setting-up-the-local-storage-provider