Files
honeyDueAPI/docs/deployment/14-deployment-process.md
T
Trey t 6f303dbbaa
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:54 -05:00

434 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 14 — Deployment Process
## Summary
A production deploy is: build a new image, push to Gitea, update the
Deployment's image field with the new SHA, Kubernetes rolls new pods in.
No downtime if the change is backward-compatible. Rollback is
`kubectl rollout undo`. This chapter walks through the full process,
plus alternate paths (config-only changes, manifest changes, hotfixes).
## TL;DR for a code change
```bash
# 1. Commit + get SHA
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
git add . && git commit -m "..." && SHA=$(git rev-parse --short HEAD)
# 2. Login to Gitea registry
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
# 3. Build + push amd64 image
docker buildx build --platform linux/amd64 --target api \
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
# 4. Roll it in
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
kubectl set image deployment/api -n honeydue \
api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
# 5. Watch
kubectl rollout status -n honeydue deployment/api
# 6. Log out
docker logout "$REGISTRY"
```
~35 minutes end to end for api.
## The build
### Step 1 — Prepare
```bash
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
git status # clean working tree?
git log -1 --oneline # this is the SHA that'll ship
```
### Step 2 — Login to Gitea
```bash
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | \
docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
```
**Note**: `docker login` without `--password-stdin` writes the token to
shell history. Don't skip the `printf` trick.
### Step 3 — Build + push
```bash
SHA=$(git rev-parse --short HEAD)
# For API
docker buildx build \
--platform linux/amd64 \
--target api \
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" \
--push .
# For Worker
docker buildx build \
--platform linux/amd64 \
--target worker \
-t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" \
--push .
# For Admin (Next.js)
docker buildx build \
--platform linux/amd64 \
--target admin \
-t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" \
--push .
```
- `--platform linux/amd64` — cross-compile from operator's arm64 to
Hetzner nodes' amd64
- `--target X` — select a stage from the multi-stage Dockerfile
- `--push` — push to registry in one step; don't leave image in local
Docker
First build is slow (~35 min cold). Subsequent builds hit BuildKit
layer cache and complete in ~3060s if only app code changed.
### Build platform note
If `docker buildx` isn't configured:
```bash
docker buildx create --name honeydue-builder --use
docker buildx inspect --bootstrap
```
This creates a BuildKit container that supports cross-platform builds.
The `--bootstrap` line spins it up immediately so errors surface now
instead of on first build.
## The deploy
### For a single service
```bash
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
kubectl set image deployment/api -n honeydue \
api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
```
This updates the Deployment's image field. Kubernetes:
1. Creates a new ReplicaSet with the new image (annotation records
rev)
2. Starts a new pod (per `maxSurge: 1`)
3. Waits for readinessProbe to pass on the new pod (up to 240s for
cold api boot)
4. Once ready, removes a pod from the old ReplicaSet
5. Repeats until all pods are on the new ReplicaSet
6. Marks rollout complete
### Watching the rollout
```bash
kubectl rollout status -n honeydue deployment/api
```
Outputs progress; returns when complete or timed out. Default timeout
is 10 minutes.
More detailed:
```bash
# Watch pods transition
kubectl get pods -n honeydue -l app.kubernetes.io/name=api -w
# Watch events
kubectl get events -n honeydue --sort-by=.lastTimestamp -w
```
### For all three services
```bash
for svc in api worker admin; do
kubectl set image deployment/$svc -n honeydue \
$svc="gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
done
# Watch all rollouts
for svc in api worker admin; do
kubectl rollout status -n honeydue deployment/$svc
done
```
## Config-only changes (no new image)
When you change `prod.env` but code is unchanged:
```bash
# 1. Update prod.env locally
# 2. Regenerate ConfigMap
kubectl create configmap honeydue-config -n honeydue \
--from-env-file=deploy/prod.env \
--dry-run=client -o yaml | kubectl apply -f -
# 3. Pods do NOT auto-reload env vars. Restart them.
kubectl rollout restart -n honeydue deployment/api deployment/admin deployment/worker
```
`rollout restart` triggers a rolling update with the *same* image but
forces pod recreation. New pods pick up the updated ConfigMap.
### Why not auto-reload?
Kubernetes has no built-in mechanism to restart pods on ConfigMap change.
There's no `envFromWatch` equivalent. Third-party operators like
Reloader can do it, but we don't run one.
For sensitive config (like the `SECRET_KEY`), this is actually good —
pods don't cycle unexpectedly when someone tweaks the ConfigMap.
## Secret changes
Same flow as config:
```bash
# Rotate a value
kubectl patch secret honeydue-secrets -n honeydue \
--type=merge -p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'newvalue' | base64)\"}}"
# Restart pods
kubectl rollout restart -n honeydue deployment/api deployment/worker
```
## Manifest changes
When you add/modify a deployment YAML:
```bash
kubectl apply -f deploy-k3s/manifests/api/deployment.yaml
```
If the change is a spec field that Kubernetes considers a new pod
template (e.g., changing resource limits, env, volumes), pods roll.
If the change is a scalar like replicas, no pod churn — just new pods
added/removed.
## Rollback
### Last-known-good rollback
```bash
kubectl rollout undo deployment/api -n honeydue
```
Reverts to the previous ReplicaSet (the one with the previous image).
Takes ~30s to stabilize.
### Rollback to a specific revision
```bash
# See revision history
kubectl rollout history deployment/api -n honeydue
# Revert to specific revision number
kubectl rollout undo deployment/api -n honeydue --to-revision=3
```
Kubernetes keeps up to 10 ReplicaSet revisions by default
(`spec.revisionHistoryLimit`).
### Hard rollback (deploy an older image)
```bash
kubectl set image deployment/api -n honeydue \
api="gitea.treytartt.com/admin/honeydue-api:<older-sha>"
```
Useful when you want to go back further than the revision history, or
to a specific known-good SHA.
## Rolling update semantics
```yaml
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
```
For api (3 replicas):
- `maxUnavailable: 0` — no pod is removed until replacement is ready
- `maxSurge: 1` — up to 4 pods exist simultaneously during rollout
Timeline (approximate, warm state):
- t=0: kubectl set image
- t=0: k8s creates new RS with 1 pod
- t=30s (or so): new pod readiness probe passes
- t=30s: k8s terminates 1 old pod
- t=60s: next new pod ready
- t=60s: another old pod terminates
- ...continues until all on new RS
For cold-boot (e.g., first deploy on a rebuilt cluster), the
MigrateWithLock advisory lock extends this to several minutes. But the
rollout is serialized — only one pod starts per iteration, so the lock
queue is small.
## Hotfix workflow
When we need to ship a fix fast and skip the usual steps:
1. Fix in code
2. Build + push
3. `kubectl set image` on the affected service only
4. Monitor with `kubectl logs -f`
Don't skip CI/tests in a real org; for solo operator this is the tradeoff.
## Integration with Gitea
Currently no CI/CD. The operator builds from the workstation and pushes
manually. Future:
- Gitea Actions (Drone-like CI) could trigger on push to `main`
- Build + push step could run in a GitHub Actions-compatible workflow
- Auto-deploy on tag push, manual promote to prod
**TODO** (Chapter 20).
## What the old Swarm deploy script did
Contrast: `deploy/scripts/deploy_prod.sh` (Swarm-era) did:
1. Validate every config file (placeholder detection, APNS key format,
B2 all-or-none)
2. Buildx to amd64
3. Push to Gitea (we retrofitted this from GHCR)
4. SCP bundle to manager node
5. `docker secret create` + `docker config create` with versioned names
6. `docker stack deploy --with-registry-auth`
7. Poll stack services until convergence (420s timeout)
8. Prune old secret/config versions
9. Healthcheck the final URL; auto-rollback on failure
10. Log out of registries
Our current k3s deploy is more manual but simpler. We'd write a similar
script for k3s if deploys become frequent:
```bash
# deploy-k3s/scripts/04-deploy.sh (not yet updated for Gitea)
```
See the scaffold in `deploy-k3s/scripts/`.
## Common deploy failures
| Symptom | Likely cause |
|---|---|
| `ImagePullBackOff` | Image not in registry, or pull secret expired |
| Stuck at "Progressing" | Readiness probe not passing; check pod logs |
| `CrashLoopBackOff` immediately | App won't start; check pod logs for panic/exit reason |
| `CrashLoopBackOff` after migration | Cache service, Redis connection, or post-init code issue |
| Old pods never terminate | New pods not ready; rollout doesn't progress |
| Rollout succeeds but app is broken | Readiness probe is too lenient; passes on broken app |
### Debugging commands
```bash
# Describe the deployment (shows events, conditions)
kubectl describe deployment api -n honeydue
# Describe the latest pod
kubectl describe pod -n honeydue -l app.kubernetes.io/name=api
# Logs from currently-running pods
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=100 --prefix
# Logs from the last-terminated pod
kubectl logs -n honeydue <pod> --previous
# Events in the namespace (newest first)
kubectl get events -n honeydue --sort-by=.lastTimestamp
# Pause a rollout (stops new pods from being created)
kubectl rollout pause deployment/api -n honeydue
# Resume
kubectl rollout resume deployment/api -n honeydue
```
## Zero-downtime considerations
For zero-downtime deploys, the new image must be:
1. **Backward-compatible** with the current database schema (schema
migrations run before new code)
2. **Backward-compatible** with in-flight API requests (don't remove
endpoints mid-deploy; deprecate first)
3. **Backward-compatible** with Redis data structures (don't change
cache key formats abruptly)
For breaking changes:
1. Deploy intermediate version that handles both old and new
2. Once rolled out everywhere, deploy breaking-change version
3. Two deploys, same day or different days
We don't have this discipline yet; our API has too few clients to
worry about. As mobile clients proliferate, this becomes more important.
## Blue-green / canary (not yet)
Kubernetes supports advanced rollout strategies:
- **Canary**: route 5% of traffic to new version, scale up gradually
- **Blue-green**: run new version alongside old, flip traffic all at
once
These require Traefik's TraefikService CRD with weighted routing, or
a service mesh. **TODO** if traffic scale justifies.
## Cleanup: the old Swarm config
`deploy/` directory contains the Swarm-era config. It's still there but
unused. After we're confident in k3s (a few weeks? month?), remove it:
```bash
rm -rf deploy/
```
Keep the useful files in `deploy-k3s/` only.
## Operator cheat sheet
```bash
# Full build + deploy
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
SHA=$(git rev-parse --short HEAD)
set -a; source deploy/registry.env; set +a
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u admin --password-stdin
docker buildx build --platform linux/amd64 --target api -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
docker buildx build --platform linux/amd64 --target worker -t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" --push .
docker buildx build --platform linux/amd64 --target admin -t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" --push .
docker logout gitea.treytartt.com
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
for svc in api worker admin; do
kubectl set image deployment/$svc -n honeydue "$svc=gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
done
for svc in api worker admin; do
kubectl rollout status -n honeydue deployment/$svc
done
```
## References
- [Kubernetes Deployment rolling update][rolling]
- [kubectl rollout][rollout]
- [Docker buildx][buildx]
[rolling]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment
[rollout]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#rollout
[buildx]: https://docs.docker.com/build/buildx/