Files
honeyDueAPI/docs/deployment/07-services.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

19 KiB
Raw Blame History

07 — Services

Summary

Four workloads run in the honeydue namespace: api (Go REST API, 3 replicas), admin (Next.js panel, 1 replica), worker (Go background jobs, 1 replica), and redis (cache + job queue, 1 replica, PVC-backed). This chapter deep-dives each: container image, resource limits, probes, volumes, and why each knob is set the way it is.

Overview

Service Image Replicas Ports Role
api gitea.treytartt.com/admin/honeydue-api:<sha> 3 8000 HTTP REST API
admin gitea.treytartt.com/admin/honeydue-admin:<sha> 1 3000 Next.js admin panel
worker gitea.treytartt.com/admin/honeydue-worker:<sha> 1 Background job processor
redis redis:7-alpine 1 6379 Cache + Asynq queue

All four are Kubernetes Deployment workloads (not StatefulSets, not DaemonSets). They share:

  • ServiceAccount with automountServiceAccountToken: false (Chapter 5)
  • imagePullSecrets: [gitea-credentials] (Chapter 11)
  • envFrom: configMapRef: honeydue-config (Chapter 10)
  • Individual env vars wired to honeydue-secrets keys
  • Read-only root filesystem with tmp emptyDir mounted at /tmp

Service 1 — api (Go REST API)

What it does

The Go HTTP API — the heart of the app. Handlers for user auth, residences, tasks, contractors, documents, subscriptions, notifications, etc. Reads/writes to Neon Postgres, reads/writes to Redis cache, reads from Backblaze B2.

Also serves a marketing landing page at / (static HTML + CSS from /app/static/). This is why the myhoneydue.com apex domain routes to the api service (Chapter 6).

Deployment spec highlights

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      serviceAccountName: api
      imagePullSecrets: [name: gitea-credentials]
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: api
          image: gitea.treytartt.com/admin/honeydue-api:237c6b8
          ports: [containerPort: 8000]
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: [ALL] }
          envFrom: [configMapRef: {name: honeydue-config}]
          env:
            - name: POSTGRES_PASSWORD
              valueFrom: { secretKeyRef: {name: honeydue-secrets, key: POSTGRES_PASSWORD} }
            - name: SECRET_KEY
              valueFrom: { secretKeyRef: {name: honeydue-secrets, key: SECRET_KEY} }
            # ... all other secrets
          volumeMounts:
            - { name: apns-key, mountPath: /secrets/apns, readOnly: true }
            - { name: tmp, mountPath: /tmp }
          resources:
            requests: { cpu: 100m, memory: 128Mi }
            limits:   { cpu: 1000m, memory: 512Mi }
          startupProbe: { httpGet: {path: /api/health/, port: 8000}, failureThreshold: 48, periodSeconds: 5 }
          readinessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 5, periodSeconds: 10, timeoutSeconds: 5 }
          livenessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 30, periodSeconds: 30, timeoutSeconds: 10 }
      volumes:
        - name: apns-key
          secret:
            secretName: honeydue-apns-key
            items: [key: apns_auth_key.p8, path: apns_auth_key.p8]
        - name: tmp
          emptyDir: {sizeLimit: 64Mi}

Why each setting

replicas: 3 — one per node via anti-affinity rules (not strictly required but helpful). Three gives us HA (one pod down = two still serve traffic) and headroom for rolling updates.

maxUnavailable: 0, maxSurge: 1 — during a rollout, start a 4th pod before killing any old one. Ensures the service stays at 3 live pods throughout. maxUnavailable: 0 means zero downtime updates — but depends on readinessProbe being accurate.

runAsUser: 1000 — the app user created in the Dockerfile. Image doesn't run as root.

readOnlyRootFilesystem: true — prevents any attacker-introduced file writes to the image layer. Go binary doesn't need to write to /; only /tmp is mutable.

startupProbe.failureThreshold: 48 (= 48 × 5s = 240s grace) — this was bumped up from the scaffold default of 12. Reason: on first boot, the Go app runs MigrateWithLock() which acquires a Postgres advisory lock and runs AutoMigrate. First replica takes ~90s; subsequent replicas wait on the lock. With 3 replicas all starting simultaneously and the lock serializing them, 240s is the right grace. See Chapter 19 for the detailed story.

readinessProbe.initialDelaySeconds: 5 — after the startupProbe passes, wait 5s before starting readiness checks. Prevents a racy initial failure.

livenessProbe.initialDelaySeconds: 30 — don't start restarting on liveness failures for 30s after readiness passes. Avoids cascading failures from false-negative liveness checks.

resources.requests/limits — Kubernetes uses requests for scheduling (how much a pod "reserves") and limits for enforcement (max it can use before throttling/OOM). Our api is CPU-bursty for complex query handling, so we give it 100m baseline with a 1000m ceiling. 512Mi memory ceiling is comfortable — in practice api uses ~100-200Mi.

volumes.apns-key — mounts the honeydue-apns-key Secret as a file at /secrets/apns/apns_auth_key.p8. The APNS_AUTH_KEY_PATH env var points to this path. Even though push is currently disabled, the file must exist because the Go app may try to stat it on startup.

volumes.tmpemptyDir with sizeLimit: 64Mi. Bounded so a runaway process can't fill the node's disk.

The Service

apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: honeydue
spec:
  type: ClusterIP
  selector: {app.kubernetes.io/name: api}
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP

ClusterIP 10.43.167.83. Reachable as api.honeydue.svc.cluster.local or just api from inside the namespace.

HorizontalPodAutoscaler (not yet enabled)

deploy-k3s/manifests/api/hpa.yaml defines an HPA that would scale api between 3 and 6 replicas based on CPU (70% util) and memory (80% util).

Not currently applied. metrics-server runs but we haven't run kubectl apply -f api/hpa.yaml. TODO in Chapter 20.

Service 2 — admin (Next.js panel)

What it does

Server-rendered admin UI. Authenticates admin users against a separate admin_users table in Postgres (seeded with ADMIN_EMAIL + ADMIN_PASSWORD on first migration). Lets operators view/manage users, residences, tasks, subscriptions, etc.

Built as a Next.js 16 standalone server.

Why 1 replica

Low traffic. It's an internal tool. One pod suffices. If it crashes, Kubernetes restarts it in ~10s. If the hosting node dies, Kubernetes reschedules to another node.

The cost of running 3 replicas is tiny (Next.js is ~128MB per pod) but has no operational benefit. When the admin panel becomes user-facing, revisit.

Deployment highlights

replicas: 1
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1

securityContext:
  runAsNonRoot: true
  runAsUser: 1001     # different from api (1000) for isolation
  runAsGroup: 1001
  fsGroup: 1001

containers:
  - image: gitea.treytartt.com/admin/honeydue-admin:<sha>
    ports: [containerPort: 3000]
    env:
      - name: PORT
        value: "3000"
      - name: HOSTNAME
        value: "0.0.0.0"
      - name: NEXT_PUBLIC_API_URL
        valueFrom: {configMapKeyRef: {name: honeydue-config, key: NEXT_PUBLIC_API_URL}}
    volumeMounts:
      - {name: nextjs-cache, mountPath: /app/.next/cache}
      - {name: tmp, mountPath: /tmp}
    resources:
      requests: {cpu: 50m, memory: 64Mi}
      limits:   {cpu: 500m, memory: 256Mi}
    startupProbe:
      httpGet: {path: /, port: 3000}     # was /admin/ — wrong for this app (Chapter 19)
      failureThreshold: 24
      periodSeconds: 5
    readinessProbe:
      httpGet: {path: /, port: 3000}
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 5

Probe path / — Next.js serves at root. /admin/ (scaffold default) returns 404 and killed the pod repeatedly during initial bring-up. See Chapter 19 §Admin probe path for the story.

runAsUser: 1001 — different from api's 1000 so that if one service were compromised, the stolen UID would at least be distinct from other services' (minor defense-in-depth).

nextjs-cache — emptyDir mount for Next.js's server-side cache. Without it, the read-only rootfs would prevent Next from caching server-rendered pages. Not a persistent volume because cache is regenerable on restart.

The Service

apiVersion: v1
kind: Service
metadata:
  name: admin
spec:
  type: ClusterIP
  selector: {app.kubernetes.io/name: admin}
  ports: [port: 3000, targetPort: 3000]

ClusterIP 10.43.136.168.

Service 3 — worker (Go + Asynq)

What it does

Runs scheduled background jobs via Asynq (a Redis-backed job queue for Go):

  • Task reminders (14:00 UTC daily) — notify users of upcoming tasks
  • Overdue reminders (15:00 UTC daily) — notify users of overdue tasks
  • Daily digest (03:00 UTC daily) — summary email per user
  • Onboarding emails — multi-step drip campaign for new users
  • Cleanup jobs — expired tokens, stale data

Why 1 replica (hard requirement)

Asynq uses a Scheduler component that does cron-like scheduling. The Scheduler is not leader-elected by default — if you run two, both fire every cron task. Users get duplicate emails.

The asynq docs cover this: to scale scheduling, migrate to PeriodicTaskManager + PeriodicTaskConfigProvider which coordinate via Redis. Not yet done in our codebase.

Until then: replicas: 1 is a hard constraint. See the comment in the deployment manifest:

spec:
  # Asynq's Scheduler is a singleton — running >1 replica fires every cron
  # task once per replica (duplicate daily digests, onboarding emails, etc.).
  # Keep at 1 until asynq.PeriodicTaskManager with Redis leader election is
  # wired in cmd/worker/main.go.
  replicas: 1

What happens if the worker pod dies?

  • Asynq schedule state is in Redis (which has AOF persistence)
  • When a new worker pod starts, it re-registers the scheduler and picks up where it left off
  • Any job that was in-flight (dequeued but not acknowledged) gets retried by Asynq's automatic retry logic (see the worker.RetryOptions in the Go code)
  • Cron jobs that were supposed to fire during the downtime: fire on the next tick

A 5-minute worker outage = 5 minutes of delayed jobs. Not great but acceptable.

PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: worker-pdb
spec:
  minAvailable: 0
  selector: {matchLabels: {app.kubernetes.io/name: worker}}

minAvailable: 0 means voluntary disruptions (kubectl drain) can take the worker down. This matches the singleton constraint: there's only one, it's OK to drain.

No Service

worker doesn't listen on any HTTP port for application traffic — it's a queue consumer, not a web server. So there's no Kubernetes Service for it.

(On Swarm we had the worker expose a health endpoint at :6060/health; the k3s scaffold doesn't replicate this. Future work.)

Service 4 — redis

What it does

  • Caching layer (ETag-based lookups, user session cache)
  • Asynq queue backend (job state, scheduled tasks, retry state)

Why 1 replica

Single-instance Redis with AOF persistence. Not replicated, not clustered. Downsides:

  • Node outage = Redis outage (cache regenerates, queue state is preserved by AOF on the PVC)
  • No failover — if the node hosting Redis dies, Redis restarts on another node but the PVC is local-path (per-node), so the data is gone

For our scale this is acceptable. Redis holds no authoritative state (everything that matters is in Postgres). Cache regenerates on first request; Asynq retries enqueue on failure.

PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-data
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: local-path
  resources: {requests: {storage: 5Gi}}

Uses k3s' built-in local-path-provisioner. The PVC binds to a local directory on the node where the Redis pod lands (/var/lib/rancher/k3s/storage/). ReadWriteOnce means only one pod at a time.

Node affinity

nodeSelector:
  honeydue/redis: "true"

We labeled ubuntu-8gb-nbg1-2 (hetzner1) with honeydue/redis=true so Redis always lands there. This ensures the PVC finds its backing storage (since PVCs with local-path are per-node).

kubectl label node ubuntu-8gb-nbg1-2 honeydue/redis=true --overwrite

Why not Redis Sentinel / Cluster

Complexity. At our scale (~a few req/s, kilobytes of cache), a single Redis does fine. If Redis becomes critical-path for availability, we'd:

  • Use a managed Redis (Upstash, Dragonfly Cloud) — $5-15/mo, their problem
  • Or run Redis Sentinel with 3 replicas — manageable but operational work

Neither is needed yet.

Redis config

From the deployment:

command:
  - sh
  - -c
  - |
    ARGS="--appendonly yes --appendfsync everysec --maxmemory 256mb --maxmemory-policy noeviction"
    if [ -n "$REDIS_PASSWORD" ]; then
      ARGS="$ARGS --requirepass $REDIS_PASSWORD"
    fi
    exec redis-server $ARGS

Settings:

  • --appendonly yes --appendfsync everysec — AOF persistence, fsync every second. Survives restarts with up to 1 second of data loss.
  • --maxmemory 256mb — Redis will refuse new data if it grows past 256 MB. Gives us a safety cap.
  • --maxmemory-policy noeviction — we'd rather get errors than silently drop data. This is the right choice when Redis holds queue state (losing a queue item silently = missed job).

The REDIS_PASSWORD env var is optional. Currently empty (no auth). The Redis pod is only reachable from inside the overlay network, and our NetworkPolicies (once enabled) would restrict egress further.

Resource summary

Combined requests and limits across all services:

Service CPU requests CPU limits Memory requests Memory limits Replicas
api 100m 1000m 128Mi 512Mi 3
admin 50m 500m 64Mi 256Mi 1
worker 50m 500m 64Mi 256Mi 1
redis 100m 500m 128Mi 512Mi 1
traefik (kube-system) ~100m unlimited ~50Mi unlimited 3
Total requests ~750m ~550Mi

Each node has 4000m CPU + 8192Mi memory. Total cluster capacity is 12000m + 24576Mi. We're using roughly 6% CPU and 2% memory for requests — tons of headroom.

Health check semantics

Kubernetes distinguishes three probe types:

  • startupProbe — is the container done starting? Runs until it passes once, then stops. While running, the other probes are disabled. Failing startupProbe = container killed and restarted.
  • readinessProbe — is the container ready to serve traffic? A failing pod is removed from Service endpoints (traffic stops flowing to it) but the pod keeps running.
  • livenessProbe — is the container healthy? A failing pod is killed and restarted.

Why we tuned startupProbe separately

The api's first-boot migration takes 90240s. If we only had a readinessProbe with a typical initialDelay of 5s + failureThreshold of 3, the pod would be killed before migration finishes. startupProbe lets us give generous first-boot grace (240s) without affecting the sharper ongoing readiness/liveness checks.

Probe path design

Each service's /health endpoint should be:

  • Cheap (no DB query, no external call)
  • Fast (< 100ms)
  • Honest (returns 200 iff the process can serve)

Our api's /api/health/ does a trivial check. It does NOT verify Postgres connectivity (to avoid cascading DB failures tearing down all api pods). If Postgres is down, api pods stay "ready" and return 5xx for actual endpoints — that's the right behavior.

Log routing

All container logs go to stdout/stderr. containerd captures them to /var/log/containers/ on the node. kubectl logs fetches them via the kubelet's /api/v1/pods//log endpoint.

We have no log aggregation in the cluster (no Loki, no ELK, no Datadog). For debugging we use:

kubectl logs -n honeydue deploy/api -f --prefix
kubectl logs -n honeydue deploy/api --previous  # previous pod's logs

See Chapter 15.

Rolling update semantics

When you push a new image and kubectl set image or kubectl apply with a new image tag:

  1. Kubernetes creates a new ReplicaSet with the new image
  2. Starts 1 new pod (per maxSurge: 1)
  3. Waits for it to pass readinessProbe
  4. Removes 1 pod from the old ReplicaSet
  5. Repeats until all N pods are on the new ReplicaSet
  6. Old ReplicaSet stays around (for rollback) with 0 replicas

For api (3 replicas): total rollout time is roughly 3 × (pod_startup_time + small_buffer) = ~15 minutes in the cold-boot case, seconds for warm updates where migrations are no-op.

During the rollout:

  • Service endpoint set updates as pods become ready
  • kube-proxy IPVS is reprogrammed on each node
  • Traefik's connection pool to the Service invalidates gradually

Users see no downtime if the new image is compatible. If it's broken:

kubectl rollout undo deployment/api -n honeydue

Reverts to the previous ReplicaSet. Typically takes 30 seconds to stabilize.

Why no StatefulSet

For Redis (the only stateful thing we run), we use a Deployment + PVC. StatefulSet is designed for:

  • Ordered startup (pod-0 before pod-1)
  • Stable hostnames (pod-0 gets DNS name redis-0.redis)
  • Per-replica PVCs

We have one Redis replica. None of those features matter for a singleton. Deployment + PVC + nodeSelector is simpler and equivalent.

If we ever run Redis Sentinel or Cluster, we'd migrate to StatefulSet.

Operator cheat sheet

# See all pods in honeydue namespace
kubectl get pods -n honeydue -o wide

# Per-service rollout status
kubectl rollout status deployment/api -n honeydue

# Scale a service
kubectl scale deployment/api -n honeydue --replicas=5

# Restart all pods (e.g., to re-read a configmap)
kubectl rollout restart deployment/api -n honeydue

# Exec into a pod
kubectl exec -it -n honeydue deploy/admin -- /bin/sh

# Describe a pod (shows events, probe state, restarts)
kubectl describe pod -n honeydue <pod-name>

# Resource usage
kubectl top pods -n honeydue

References