Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,433 @@
|
||||
# 14 — Deployment Process
|
||||
|
||||
## Summary
|
||||
|
||||
A production deploy is: build a new image, push to Gitea, update the
|
||||
Deployment's image field with the new SHA, Kubernetes rolls new pods in.
|
||||
No downtime if the change is backward-compatible. Rollback is
|
||||
`kubectl rollout undo`. This chapter walks through the full process,
|
||||
plus alternate paths (config-only changes, manifest changes, hotfixes).
|
||||
|
||||
## TL;DR for a code change
|
||||
|
||||
```bash
|
||||
# 1. Commit + get SHA
|
||||
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
|
||||
git add . && git commit -m "..." && SHA=$(git rev-parse --short HEAD)
|
||||
|
||||
# 2. Login to Gitea registry
|
||||
set -a; source deploy/registry.env; set +a
|
||||
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
|
||||
|
||||
# 3. Build + push amd64 image
|
||||
docker buildx build --platform linux/amd64 --target api \
|
||||
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
|
||||
|
||||
# 4. Roll it in
|
||||
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
|
||||
kubectl set image deployment/api -n honeydue \
|
||||
api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
|
||||
|
||||
# 5. Watch
|
||||
kubectl rollout status -n honeydue deployment/api
|
||||
|
||||
# 6. Log out
|
||||
docker logout "$REGISTRY"
|
||||
```
|
||||
|
||||
~3–5 minutes end to end for api.
|
||||
|
||||
## The build
|
||||
|
||||
### Step 1 — Prepare
|
||||
|
||||
```bash
|
||||
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
|
||||
git status # clean working tree?
|
||||
git log -1 --oneline # this is the SHA that'll ship
|
||||
```
|
||||
|
||||
### Step 2 — Login to Gitea
|
||||
|
||||
```bash
|
||||
set -a; source deploy/registry.env; set +a
|
||||
printf '%s' "$REGISTRY_TOKEN" | \
|
||||
docker login "$REGISTRY" -u "$REGISTRY_USERNAME" --password-stdin
|
||||
```
|
||||
|
||||
**Note**: `docker login` without `--password-stdin` writes the token to
|
||||
shell history. Don't skip the `printf` trick.
|
||||
|
||||
### Step 3 — Build + push
|
||||
|
||||
```bash
|
||||
SHA=$(git rev-parse --short HEAD)
|
||||
|
||||
# For API
|
||||
docker buildx build \
|
||||
--platform linux/amd64 \
|
||||
--target api \
|
||||
-t "gitea.treytartt.com/admin/honeydue-api:${SHA}" \
|
||||
--push .
|
||||
|
||||
# For Worker
|
||||
docker buildx build \
|
||||
--platform linux/amd64 \
|
||||
--target worker \
|
||||
-t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" \
|
||||
--push .
|
||||
|
||||
# For Admin (Next.js)
|
||||
docker buildx build \
|
||||
--platform linux/amd64 \
|
||||
--target admin \
|
||||
-t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" \
|
||||
--push .
|
||||
```
|
||||
|
||||
- `--platform linux/amd64` — cross-compile from operator's arm64 to
|
||||
Hetzner nodes' amd64
|
||||
- `--target X` — select a stage from the multi-stage Dockerfile
|
||||
- `--push` — push to registry in one step; don't leave image in local
|
||||
Docker
|
||||
|
||||
First build is slow (~3–5 min cold). Subsequent builds hit BuildKit
|
||||
layer cache and complete in ~30–60s if only app code changed.
|
||||
|
||||
### Build platform note
|
||||
|
||||
If `docker buildx` isn't configured:
|
||||
|
||||
```bash
|
||||
docker buildx create --name honeydue-builder --use
|
||||
docker buildx inspect --bootstrap
|
||||
```
|
||||
|
||||
This creates a BuildKit container that supports cross-platform builds.
|
||||
The `--bootstrap` line spins it up immediately so errors surface now
|
||||
instead of on first build.
|
||||
|
||||
## The deploy
|
||||
|
||||
### For a single service
|
||||
|
||||
```bash
|
||||
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
|
||||
|
||||
kubectl set image deployment/api -n honeydue \
|
||||
api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
|
||||
```
|
||||
|
||||
This updates the Deployment's image field. Kubernetes:
|
||||
1. Creates a new ReplicaSet with the new image (annotation records
|
||||
rev)
|
||||
2. Starts a new pod (per `maxSurge: 1`)
|
||||
3. Waits for readinessProbe to pass on the new pod (up to 240s for
|
||||
cold api boot)
|
||||
4. Once ready, removes a pod from the old ReplicaSet
|
||||
5. Repeats until all pods are on the new ReplicaSet
|
||||
6. Marks rollout complete
|
||||
|
||||
### Watching the rollout
|
||||
|
||||
```bash
|
||||
kubectl rollout status -n honeydue deployment/api
|
||||
```
|
||||
|
||||
Outputs progress; returns when complete or timed out. Default timeout
|
||||
is 10 minutes.
|
||||
|
||||
More detailed:
|
||||
|
||||
```bash
|
||||
# Watch pods transition
|
||||
kubectl get pods -n honeydue -l app.kubernetes.io/name=api -w
|
||||
|
||||
# Watch events
|
||||
kubectl get events -n honeydue --sort-by=.lastTimestamp -w
|
||||
```
|
||||
|
||||
### For all three services
|
||||
|
||||
```bash
|
||||
for svc in api worker admin; do
|
||||
kubectl set image deployment/$svc -n honeydue \
|
||||
$svc="gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
|
||||
done
|
||||
|
||||
# Watch all rollouts
|
||||
for svc in api worker admin; do
|
||||
kubectl rollout status -n honeydue deployment/$svc
|
||||
done
|
||||
```
|
||||
|
||||
## Config-only changes (no new image)
|
||||
|
||||
When you change `prod.env` but code is unchanged:
|
||||
|
||||
```bash
|
||||
# 1. Update prod.env locally
|
||||
# 2. Regenerate ConfigMap
|
||||
kubectl create configmap honeydue-config -n honeydue \
|
||||
--from-env-file=deploy/prod.env \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# 3. Pods do NOT auto-reload env vars. Restart them.
|
||||
kubectl rollout restart -n honeydue deployment/api deployment/admin deployment/worker
|
||||
```
|
||||
|
||||
`rollout restart` triggers a rolling update with the *same* image but
|
||||
forces pod recreation. New pods pick up the updated ConfigMap.
|
||||
|
||||
### Why not auto-reload?
|
||||
|
||||
Kubernetes has no built-in mechanism to restart pods on ConfigMap change.
|
||||
There's no `envFromWatch` equivalent. Third-party operators like
|
||||
Reloader can do it, but we don't run one.
|
||||
|
||||
For sensitive config (like the `SECRET_KEY`), this is actually good —
|
||||
pods don't cycle unexpectedly when someone tweaks the ConfigMap.
|
||||
|
||||
## Secret changes
|
||||
|
||||
Same flow as config:
|
||||
|
||||
```bash
|
||||
# Rotate a value
|
||||
kubectl patch secret honeydue-secrets -n honeydue \
|
||||
--type=merge -p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'newvalue' | base64)\"}}"
|
||||
|
||||
# Restart pods
|
||||
kubectl rollout restart -n honeydue deployment/api deployment/worker
|
||||
```
|
||||
|
||||
## Manifest changes
|
||||
|
||||
When you add/modify a deployment YAML:
|
||||
|
||||
```bash
|
||||
kubectl apply -f deploy-k3s/manifests/api/deployment.yaml
|
||||
```
|
||||
|
||||
If the change is a spec field that Kubernetes considers a new pod
|
||||
template (e.g., changing resource limits, env, volumes), pods roll.
|
||||
If the change is a scalar like replicas, no pod churn — just new pods
|
||||
added/removed.
|
||||
|
||||
## Rollback
|
||||
|
||||
### Last-known-good rollback
|
||||
|
||||
```bash
|
||||
kubectl rollout undo deployment/api -n honeydue
|
||||
```
|
||||
|
||||
Reverts to the previous ReplicaSet (the one with the previous image).
|
||||
Takes ~30s to stabilize.
|
||||
|
||||
### Rollback to a specific revision
|
||||
|
||||
```bash
|
||||
# See revision history
|
||||
kubectl rollout history deployment/api -n honeydue
|
||||
|
||||
# Revert to specific revision number
|
||||
kubectl rollout undo deployment/api -n honeydue --to-revision=3
|
||||
```
|
||||
|
||||
Kubernetes keeps up to 10 ReplicaSet revisions by default
|
||||
(`spec.revisionHistoryLimit`).
|
||||
|
||||
### Hard rollback (deploy an older image)
|
||||
|
||||
```bash
|
||||
kubectl set image deployment/api -n honeydue \
|
||||
api="gitea.treytartt.com/admin/honeydue-api:<older-sha>"
|
||||
```
|
||||
|
||||
Useful when you want to go back further than the revision history, or
|
||||
to a specific known-good SHA.
|
||||
|
||||
## Rolling update semantics
|
||||
|
||||
```yaml
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
maxUnavailable: 0
|
||||
maxSurge: 1
|
||||
```
|
||||
|
||||
For api (3 replicas):
|
||||
- `maxUnavailable: 0` — no pod is removed until replacement is ready
|
||||
- `maxSurge: 1` — up to 4 pods exist simultaneously during rollout
|
||||
|
||||
Timeline (approximate, warm state):
|
||||
- t=0: kubectl set image
|
||||
- t=0: k8s creates new RS with 1 pod
|
||||
- t=30s (or so): new pod readiness probe passes
|
||||
- t=30s: k8s terminates 1 old pod
|
||||
- t=60s: next new pod ready
|
||||
- t=60s: another old pod terminates
|
||||
- ...continues until all on new RS
|
||||
|
||||
For cold-boot (e.g., first deploy on a rebuilt cluster), the
|
||||
MigrateWithLock advisory lock extends this to several minutes. But the
|
||||
rollout is serialized — only one pod starts per iteration, so the lock
|
||||
queue is small.
|
||||
|
||||
## Hotfix workflow
|
||||
|
||||
When we need to ship a fix fast and skip the usual steps:
|
||||
|
||||
1. Fix in code
|
||||
2. Build + push
|
||||
3. `kubectl set image` on the affected service only
|
||||
4. Monitor with `kubectl logs -f`
|
||||
|
||||
Don't skip CI/tests in a real org; for solo operator this is the tradeoff.
|
||||
|
||||
## Integration with Gitea
|
||||
|
||||
Currently no CI/CD. The operator builds from the workstation and pushes
|
||||
manually. Future:
|
||||
|
||||
- Gitea Actions (Drone-like CI) could trigger on push to `main`
|
||||
- Build + push step could run in a GitHub Actions-compatible workflow
|
||||
- Auto-deploy on tag push, manual promote to prod
|
||||
|
||||
**TODO** (Chapter 20).
|
||||
|
||||
## What the old Swarm deploy script did
|
||||
|
||||
Contrast: `deploy/scripts/deploy_prod.sh` (Swarm-era) did:
|
||||
|
||||
1. Validate every config file (placeholder detection, APNS key format,
|
||||
B2 all-or-none)
|
||||
2. Buildx to amd64
|
||||
3. Push to Gitea (we retrofitted this from GHCR)
|
||||
4. SCP bundle to manager node
|
||||
5. `docker secret create` + `docker config create` with versioned names
|
||||
6. `docker stack deploy --with-registry-auth`
|
||||
7. Poll stack services until convergence (420s timeout)
|
||||
8. Prune old secret/config versions
|
||||
9. Healthcheck the final URL; auto-rollback on failure
|
||||
10. Log out of registries
|
||||
|
||||
Our current k3s deploy is more manual but simpler. We'd write a similar
|
||||
script for k3s if deploys become frequent:
|
||||
|
||||
```bash
|
||||
# deploy-k3s/scripts/04-deploy.sh (not yet updated for Gitea)
|
||||
```
|
||||
|
||||
See the scaffold in `deploy-k3s/scripts/`.
|
||||
|
||||
## Common deploy failures
|
||||
|
||||
| Symptom | Likely cause |
|
||||
|---|---|
|
||||
| `ImagePullBackOff` | Image not in registry, or pull secret expired |
|
||||
| Stuck at "Progressing" | Readiness probe not passing; check pod logs |
|
||||
| `CrashLoopBackOff` immediately | App won't start; check pod logs for panic/exit reason |
|
||||
| `CrashLoopBackOff` after migration | Cache service, Redis connection, or post-init code issue |
|
||||
| Old pods never terminate | New pods not ready; rollout doesn't progress |
|
||||
| Rollout succeeds but app is broken | Readiness probe is too lenient; passes on broken app |
|
||||
|
||||
### Debugging commands
|
||||
|
||||
```bash
|
||||
# Describe the deployment (shows events, conditions)
|
||||
kubectl describe deployment api -n honeydue
|
||||
|
||||
# Describe the latest pod
|
||||
kubectl describe pod -n honeydue -l app.kubernetes.io/name=api
|
||||
|
||||
# Logs from currently-running pods
|
||||
kubectl logs -n honeydue -l app.kubernetes.io/name=api --tail=100 --prefix
|
||||
|
||||
# Logs from the last-terminated pod
|
||||
kubectl logs -n honeydue <pod> --previous
|
||||
|
||||
# Events in the namespace (newest first)
|
||||
kubectl get events -n honeydue --sort-by=.lastTimestamp
|
||||
|
||||
# Pause a rollout (stops new pods from being created)
|
||||
kubectl rollout pause deployment/api -n honeydue
|
||||
|
||||
# Resume
|
||||
kubectl rollout resume deployment/api -n honeydue
|
||||
```
|
||||
|
||||
## Zero-downtime considerations
|
||||
|
||||
For zero-downtime deploys, the new image must be:
|
||||
|
||||
1. **Backward-compatible** with the current database schema (schema
|
||||
migrations run before new code)
|
||||
2. **Backward-compatible** with in-flight API requests (don't remove
|
||||
endpoints mid-deploy; deprecate first)
|
||||
3. **Backward-compatible** with Redis data structures (don't change
|
||||
cache key formats abruptly)
|
||||
|
||||
For breaking changes:
|
||||
1. Deploy intermediate version that handles both old and new
|
||||
2. Once rolled out everywhere, deploy breaking-change version
|
||||
3. Two deploys, same day or different days
|
||||
|
||||
We don't have this discipline yet; our API has too few clients to
|
||||
worry about. As mobile clients proliferate, this becomes more important.
|
||||
|
||||
## Blue-green / canary (not yet)
|
||||
|
||||
Kubernetes supports advanced rollout strategies:
|
||||
- **Canary**: route 5% of traffic to new version, scale up gradually
|
||||
- **Blue-green**: run new version alongside old, flip traffic all at
|
||||
once
|
||||
|
||||
These require Traefik's TraefikService CRD with weighted routing, or
|
||||
a service mesh. **TODO** if traffic scale justifies.
|
||||
|
||||
## Cleanup: the old Swarm config
|
||||
|
||||
`deploy/` directory contains the Swarm-era config. It's still there but
|
||||
unused. After we're confident in k3s (a few weeks? month?), remove it:
|
||||
|
||||
```bash
|
||||
rm -rf deploy/
|
||||
```
|
||||
|
||||
Keep the useful files in `deploy-k3s/` only.
|
||||
|
||||
## Operator cheat sheet
|
||||
|
||||
```bash
|
||||
# Full build + deploy
|
||||
cd /Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go
|
||||
SHA=$(git rev-parse --short HEAD)
|
||||
set -a; source deploy/registry.env; set +a
|
||||
printf '%s' "$REGISTRY_TOKEN" | docker login "$REGISTRY" -u admin --password-stdin
|
||||
docker buildx build --platform linux/amd64 --target api -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
|
||||
docker buildx build --platform linux/amd64 --target worker -t "gitea.treytartt.com/admin/honeydue-worker:${SHA}" --push .
|
||||
docker buildx build --platform linux/amd64 --target admin -t "gitea.treytartt.com/admin/honeydue-admin:${SHA}" --push .
|
||||
docker logout gitea.treytartt.com
|
||||
|
||||
export KUBECONFIG=~/.kube/honeydue-k3s.yaml
|
||||
for svc in api worker admin; do
|
||||
kubectl set image deployment/$svc -n honeydue "$svc=gitea.treytartt.com/admin/honeydue-${svc}:${SHA}"
|
||||
done
|
||||
|
||||
for svc in api worker admin; do
|
||||
kubectl rollout status -n honeydue deployment/$svc
|
||||
done
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [Kubernetes Deployment rolling update][rolling]
|
||||
- [kubectl rollout][rollout]
|
||||
- [Docker buildx][buildx]
|
||||
|
||||
[rolling]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment
|
||||
[rollout]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#rollout
|
||||
[buildx]: https://docs.docker.com/build/buildx/
|
||||
Reference in New Issue
Block a user