honeyDueAPI/docs/deployment/07-services.md

# 07 — Services

## Summary

Four workloads run in the `honeydue` namespace: **api** (Go REST API, 3
replicas), **admin** (Next.js panel, 1 replica), **worker** (Go background
jobs, 1 replica), and **redis** (cache + job queue, 1 replica, PVC-backed).
This chapter deep-dives each: container image, resource limits, probes,
volumes, and why each knob is set the way it is.

## Overview

| Service | Image | Replicas | Ports | Role |
|---|---|---|---|---|
| `api` | `gitea.treytartt.com/admin/honeydue-api:<sha>` | 3 | 8000 | HTTP REST API |
| `admin` | `gitea.treytartt.com/admin/honeydue-admin:<sha>` | 1 | 3000 | Next.js admin panel |
| `worker` | `gitea.treytartt.com/admin/honeydue-worker:<sha>` | 1 | — | Background job processor |
| `redis` | `redis:7-alpine` | 1 | 6379 | Cache + Asynq queue |

All four are Kubernetes `Deployment` workloads (not StatefulSets, not
DaemonSets). They share:
- ServiceAccount with `automountServiceAccountToken: false` (Chapter 5)
- `imagePullSecrets: [gitea-credentials]` (Chapter 11)
- `envFrom: configMapRef: honeydue-config` (Chapter 10)
- Individual env vars wired to `honeydue-secrets` keys
- Read-only root filesystem with `tmp` emptyDir mounted at `/tmp`

## Service 1 — api (Go REST API)

### What it does

The Go HTTP API — the heart of the app. Handlers for user auth,
residences, tasks, contractors, documents, subscriptions, notifications,
etc. Reads/writes to Neon Postgres, reads/writes to Redis cache, reads
from Backblaze B2.

Also serves a marketing landing page at `/` (static HTML + CSS from
`/app/static/`). This is why the `myhoneydue.com` apex domain routes to
the api service (Chapter 6).

### Deployment spec highlights

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    spec:
      serviceAccountName: api
      imagePullSecrets: [name: gitea-credentials]
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: api
          image: gitea.treytartt.com/admin/honeydue-api:237c6b8
          ports: [containerPort: 8000]
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: [ALL] }
          envFrom: [configMapRef: {name: honeydue-config}]
          env:
            - name: POSTGRES_PASSWORD
              valueFrom: { secretKeyRef: {name: honeydue-secrets, key: POSTGRES_PASSWORD} }
            - name: SECRET_KEY
              valueFrom: { secretKeyRef: {name: honeydue-secrets, key: SECRET_KEY} }
            # ... all other secrets
          volumeMounts:
            - { name: apns-key, mountPath: /secrets/apns, readOnly: true }
            - { name: tmp, mountPath: /tmp }
          resources:
            requests: { cpu: 100m, memory: 128Mi }
            limits:   { cpu: 1000m, memory: 512Mi }
          startupProbe: { httpGet: {path: /api/health/, port: 8000}, failureThreshold: 48, periodSeconds: 5 }
          readinessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 5, periodSeconds: 10, timeoutSeconds: 5 }
          livenessProbe: { httpGet: {path: /api/health/, port: 8000}, initialDelaySeconds: 30, periodSeconds: 30, timeoutSeconds: 10 }
      volumes:
        - name: apns-key
          secret:
            secretName: honeydue-apns-key
            items: [key: apns_auth_key.p8, path: apns_auth_key.p8]
        - name: tmp
          emptyDir: {sizeLimit: 64Mi}
```

### Why each setting

**`replicas: 3`** — one per node via anti-affinity rules (not strictly
required but helpful). Three gives us HA (one pod down = two still
serve traffic) and headroom for rolling updates.

**`maxUnavailable: 0, maxSurge: 1`** — during a rollout, start a 4th
pod before killing any old one. Ensures the service stays at 3 live
pods throughout. `maxUnavailable: 0` means zero downtime updates — but
depends on readinessProbe being accurate.

**`runAsUser: 1000`** — the `app` user created in the Dockerfile. Image
doesn't run as root.

**`readOnlyRootFilesystem: true`** — prevents any attacker-introduced
file writes to the image layer. Go binary doesn't need to write to `/`;
only `/tmp` is mutable.

**`startupProbe.failureThreshold: 48`** (= 48 × 5s = 240s grace) — this
was bumped up from the scaffold default of 12. Reason: on first boot,
the Go app runs `MigrateWithLock()` which acquires a Postgres advisory
lock and runs AutoMigrate. First replica takes ~90s; subsequent
replicas wait on the lock. With 3 replicas all starting simultaneously
and the lock serializing them, 240s is the right grace. See
[Chapter 19](./19-postmortem-swarm.md) for the detailed story.

**`readinessProbe.initialDelaySeconds: 5`** — after the startupProbe
passes, wait 5s before starting readiness checks. Prevents a racy
initial failure.

**`livenessProbe.initialDelaySeconds: 30`** — don't start restarting on
liveness failures for 30s after readiness passes. Avoids cascading
failures from false-negative liveness checks.

**`resources.requests/limits`** — Kubernetes uses `requests` for
scheduling (how much a pod "reserves") and `limits` for enforcement
(max it can use before throttling/OOM). Our api is CPU-bursty for
complex query handling, so we give it 100m baseline with a 1000m ceiling.
512Mi memory ceiling is comfortable — in practice api uses ~100-200Mi.

**`volumes.apns-key`** — mounts the `honeydue-apns-key` Secret as a file
at `/secrets/apns/apns_auth_key.p8`. The `APNS_AUTH_KEY_PATH` env var
points to this path. Even though push is currently disabled, the file
must exist because the Go app may try to stat it on startup.

**`volumes.tmp`** — `emptyDir` with `sizeLimit: 64Mi`. Bounded so a
runaway process can't fill the node's disk.

### The Service

```yaml
apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: honeydue
spec:
  type: ClusterIP
  selector: {app.kubernetes.io/name: api}
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP
```

ClusterIP `10.43.167.83`. Reachable as `api.honeydue.svc.cluster.local` or
just `api` from inside the namespace.

### HorizontalPodAutoscaler (not yet enabled)

`deploy-k3s/manifests/api/hpa.yaml` defines an HPA that would scale api
between 3 and 6 replicas based on CPU (70% util) and memory (80% util).

**Not currently applied.** `metrics-server` runs but we haven't run
`kubectl apply -f api/hpa.yaml`. TODO in Chapter 20.

## Service 2 — admin (Next.js panel)

### What it does

Server-rendered admin UI. Authenticates admin users against a
separate `admin_users` table in Postgres (seeded with `ADMIN_EMAIL` +
`ADMIN_PASSWORD` on first migration). Lets operators view/manage
users, residences, tasks, subscriptions, etc.

Built as a Next.js 16 standalone server.

### Why 1 replica

Low traffic. It's an internal tool. One pod suffices. If it crashes,
Kubernetes restarts it in ~10s. If the hosting node dies, Kubernetes
reschedules to another node.

The cost of running 3 replicas is tiny (Next.js is ~128MB per pod) but
has no operational benefit. When the admin panel becomes user-facing,
revisit.

### Deployment highlights

```yaml
replicas: 1
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1

securityContext:
  runAsNonRoot: true
  runAsUser: 1001     # different from api (1000) for isolation
  runAsGroup: 1001
  fsGroup: 1001

containers:
  - image: gitea.treytartt.com/admin/honeydue-admin:<sha>
    ports: [containerPort: 3000]
    env:
      - name: PORT
        value: "3000"
      - name: HOSTNAME
        value: "0.0.0.0"
      - name: NEXT_PUBLIC_API_URL
        valueFrom: {configMapKeyRef: {name: honeydue-config, key: NEXT_PUBLIC_API_URL}}
    volumeMounts:
      - {name: nextjs-cache, mountPath: /app/.next/cache}
      - {name: tmp, mountPath: /tmp}
    resources:
      requests: {cpu: 50m, memory: 64Mi}
      limits:   {cpu: 500m, memory: 256Mi}
    startupProbe:
      httpGet: {path: /, port: 3000}     # was /admin/ — wrong for this app (Chapter 19)
      failureThreshold: 24
      periodSeconds: 5
    readinessProbe:
      httpGet: {path: /, port: 3000}
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 5
```

**Probe path `/`** — Next.js serves at root. `/admin/` (scaffold default)
returns 404 and killed the pod repeatedly during initial bring-up.
See Chapter 19 §Admin probe path for the story.

**`runAsUser: 1001`** — different from api's 1000 so that if one
service were compromised, the stolen UID would at least be distinct
from other services' (minor defense-in-depth).

**`nextjs-cache`** — emptyDir mount for Next.js's server-side cache.
Without it, the read-only rootfs would prevent Next from caching
server-rendered pages. Not a persistent volume because cache is
regenerable on restart.

### The Service

```yaml
apiVersion: v1
kind: Service
metadata:
  name: admin
spec:
  type: ClusterIP
  selector: {app.kubernetes.io/name: admin}
  ports: [port: 3000, targetPort: 3000]
```

ClusterIP `10.43.136.168`.

## Service 3 — worker (Go + Asynq)

### What it does

Runs scheduled background jobs via [Asynq](https://github.com/hibiken/asynq)
(a Redis-backed job queue for Go):

- **Task reminders** (14:00 UTC daily) — notify users of upcoming tasks
- **Overdue reminders** (15:00 UTC daily) — notify users of overdue tasks
- **Daily digest** (03:00 UTC daily) — summary email per user
- **Onboarding emails** — multi-step drip campaign for new users
- **Cleanup jobs** — expired tokens, stale data

### Why 1 replica (hard requirement)

Asynq uses a `Scheduler` component that does cron-like scheduling. The
Scheduler is **not leader-elected** by default — if you run two, both
fire every cron task. Users get duplicate emails.

The asynq docs cover this: to scale scheduling, migrate to
`PeriodicTaskManager` + `PeriodicTaskConfigProvider` which coordinate
via Redis. Not yet done in our codebase.

Until then: `replicas: 1` is a hard constraint. See the comment in the
deployment manifest:

```yaml
spec:
  # Asynq's Scheduler is a singleton — running >1 replica fires every cron
  # task once per replica (duplicate daily digests, onboarding emails, etc.).
  # Keep at 1 until asynq.PeriodicTaskManager with Redis leader election is
  # wired in cmd/worker/main.go.
  replicas: 1
```

### What happens if the worker pod dies?

- Asynq schedule state is in Redis (which has AOF persistence)
- When a new worker pod starts, it re-registers the scheduler and picks up
  where it left off
- Any job that was in-flight (dequeued but not acknowledged) gets retried
  by Asynq's automatic retry logic (see the `worker.RetryOptions` in the
  Go code)
- Cron jobs that were supposed to fire during the downtime: fire on the
  next tick

A 5-minute worker outage = 5 minutes of delayed jobs. Not great but
acceptable.

### PodDisruptionBudget

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: worker-pdb
spec:
  minAvailable: 0
  selector: {matchLabels: {app.kubernetes.io/name: worker}}
```

`minAvailable: 0` means voluntary disruptions (`kubectl drain`) can take
the worker down. This matches the singleton constraint: there's only one,
it's OK to drain.

### No Service

worker doesn't listen on any HTTP port for application traffic — it's a
queue consumer, not a web server. So there's **no Kubernetes Service**
for it.

(On Swarm we had the worker expose a health endpoint at `:6060/health`;
the k3s scaffold doesn't replicate this. Future work.)

## Service 4 — redis

### What it does

- Caching layer (ETag-based lookups, user session cache)
- Asynq queue backend (job state, scheduled tasks, retry state)

### Why 1 replica

Single-instance Redis with AOF persistence. Not replicated, not
clustered. Downsides:
- Node outage = Redis outage (cache regenerates, queue state is preserved
  by AOF on the PVC)
- No failover — if the node hosting Redis dies, Redis restarts on another
  node *but* the PVC is local-path (per-node), so the data is gone

For our scale this is acceptable. Redis holds no authoritative state
(everything that matters is in Postgres). Cache regenerates on first
request; Asynq retries enqueue on failure.

### PVC

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-data
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: local-path
  resources: {requests: {storage: 5Gi}}
```

Uses k3s' built-in `local-path-provisioner`. The PVC binds to a local
directory on the node where the Redis pod lands (`/var/lib/rancher/k3s/storage/`).
`ReadWriteOnce` means only one pod at a time.

### Node affinity

```yaml
nodeSelector:
  honeydue/redis: "true"
```

We labeled `ubuntu-8gb-nbg1-2` (hetzner1) with `honeydue/redis=true` so
Redis always lands there. This ensures the PVC finds its backing
storage (since PVCs with `local-path` are per-node).

```bash
kubectl label node ubuntu-8gb-nbg1-2 honeydue/redis=true --overwrite
```

### Why not Redis Sentinel / Cluster

Complexity. At our scale (~a few req/s, kilobytes of cache), a single
Redis does fine. If Redis becomes critical-path for availability, we'd:
- Use a managed Redis (Upstash, Dragonfly Cloud) — $5-15/mo, their problem
- Or run Redis Sentinel with 3 replicas — manageable but operational work

Neither is needed yet.

### Redis config

From the deployment:

```yaml
command:
  - sh
  - -c
  - |
    ARGS="--appendonly yes --appendfsync everysec --maxmemory 256mb --maxmemory-policy noeviction"
    if [ -n "$REDIS_PASSWORD" ]; then
      ARGS="$ARGS --requirepass $REDIS_PASSWORD"
    fi
    exec redis-server $ARGS
```

Settings:
- **`--appendonly yes --appendfsync everysec`** — AOF persistence,
  fsync every second. Survives restarts with up to 1 second of data
  loss.
- **`--maxmemory 256mb`** — Redis will refuse new data if it grows past
  256 MB. Gives us a safety cap.
- **`--maxmemory-policy noeviction`** — we'd rather get errors than
  silently drop data. This is the right choice when Redis holds queue
  state (losing a queue item silently = missed job).

The `REDIS_PASSWORD` env var is optional. Currently empty (no auth). The
Redis pod is only reachable from inside the overlay network, and our
NetworkPolicies (once enabled) would restrict egress further.

## Resource summary

Combined requests and limits across all services:

| Service | CPU requests | CPU limits | Memory requests | Memory limits | Replicas |
|---|---|---|---|---|---|
| api | 100m | 1000m | 128Mi | 512Mi | 3 |
| admin | 50m | 500m | 64Mi | 256Mi | 1 |
| worker | 50m | 500m | 64Mi | 256Mi | 1 |
| redis | 100m | 500m | 128Mi | 512Mi | 1 |
| traefik (kube-system) | ~100m | unlimited | ~50Mi | unlimited | 3 |
| **Total requests** | **~750m** | | **~550Mi** | | |

Each node has 4000m CPU + 8192Mi memory. Total cluster capacity is
12000m + 24576Mi. We're using roughly 6% CPU and 2% memory for requests
— tons of headroom.

## Health check semantics

Kubernetes distinguishes three probe types:

- **startupProbe** — is the container done starting? Runs until it passes
  once, then stops. While running, the other probes are disabled.
  Failing startupProbe = container killed and restarted.
- **readinessProbe** — is the container ready to serve traffic? A failing
  pod is removed from Service endpoints (traffic stops flowing to it)
  but the pod keeps running.
- **livenessProbe** — is the container healthy? A failing pod is killed
  and restarted.

### Why we tuned startupProbe separately

The api's first-boot migration takes 90–240s. If we only had a
readinessProbe with a typical initialDelay of 5s + failureThreshold of 3,
the pod would be killed before migration finishes. startupProbe lets us
give generous first-boot grace (240s) without affecting the sharper
ongoing readiness/liveness checks.

### Probe path design

Each service's `/health` endpoint should be:
- Cheap (no DB query, no external call)
- Fast (< 100ms)
- Honest (returns 200 iff the process can serve)

Our api's `/api/health/` does a trivial check. It does NOT verify Postgres
connectivity (to avoid cascading DB failures tearing down all api pods).
If Postgres is down, api pods stay "ready" and return 5xx for actual
endpoints — that's the right behavior.

## Log routing

All container logs go to stdout/stderr. containerd captures them to
`/var/log/containers/` on the node. `kubectl logs` fetches them via the
kubelet's /api/v1/pods/<pod>/log endpoint.

We have **no log aggregation** in the cluster (no Loki, no ELK, no
Datadog). For debugging we use:

```bash
kubectl logs -n honeydue deploy/api -f --prefix
kubectl logs -n honeydue deploy/api --previous  # previous pod's logs
```

See [Chapter 15](./15-observability.md).

## Rolling update semantics

When you push a new image and `kubectl set image` or `kubectl apply` with
a new image tag:

1. Kubernetes creates a new ReplicaSet with the new image
2. Starts 1 new pod (per `maxSurge: 1`)
3. Waits for it to pass readinessProbe
4. Removes 1 pod from the old ReplicaSet
5. Repeats until all N pods are on the new ReplicaSet
6. Old ReplicaSet stays around (for rollback) with 0 replicas

For api (3 replicas): total rollout time is roughly
`3 × (pod_startup_time + small_buffer)` = ~15 minutes in the cold-boot
case, seconds for warm updates where migrations are no-op.

During the rollout:
- Service endpoint set updates as pods become ready
- kube-proxy IPVS is reprogrammed on each node
- Traefik's connection pool to the Service invalidates gradually

Users see no downtime if the new image is compatible. If it's broken:

```bash
kubectl rollout undo deployment/api -n honeydue
```

Reverts to the previous ReplicaSet. Typically takes 30 seconds to
stabilize.

## Why no StatefulSet

For Redis (the only stateful thing we run), we use a Deployment + PVC.
StatefulSet is designed for:
- Ordered startup (pod-0 before pod-1)
- Stable hostnames (pod-0 gets DNS name `redis-0.redis`)
- Per-replica PVCs

We have one Redis replica. None of those features matter for a
singleton. Deployment + PVC + nodeSelector is simpler and equivalent.

If we ever run Redis Sentinel or Cluster, we'd migrate to StatefulSet.

## Operator cheat sheet

```bash
# See all pods in honeydue namespace
kubectl get pods -n honeydue -o wide

# Per-service rollout status
kubectl rollout status deployment/api -n honeydue

# Scale a service
kubectl scale deployment/api -n honeydue --replicas=5

# Restart all pods (e.g., to re-read a configmap)
kubectl rollout restart deployment/api -n honeydue

# Exec into a pod
kubectl exec -it -n honeydue deploy/admin -- /bin/sh

# Describe a pod (shows events, probe state, restarts)
kubectl describe pod -n honeydue <pod-name>

# Resource usage
kubectl top pods -n honeydue
```

## References

- [Kubernetes Deployments][deploy]
- [Pod lifecycle + probes][probes]
- [Asynq scheduler limitations][asynq-sched]
- [K3s local-path provisioner][k3s-lp]

[deploy]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[probes]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifecycle
[asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks
[k3s-lp]: https://docs.k3s.io/storage#setting-up-the-local-storage-provider