Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure: - Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers) - Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh - All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept temporarily for reference Bug fixes surfaced during migration: - Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25) - cache_service.go: remove sync.Once reassignment from inside Do() callback (was causing 'unlock of unlocked mutex' fatal after Redis Ping failure) - router.go: relax CSP from 'default-src none' to 'default-src self' + allowlist fonts.googleapis.com so the marketing landing page CSS actually loads in browsers - deploy/scripts/deploy_prod.sh: use docker buildx with --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce images runnable on x86_64 Hetzner nodes; fix array expansion under set -u - deploy/swarm-stack.prod.yml: fix secret source references to use top-level aliases (the '\${X_SECRET}' form never actually resolved); dozzle ports: long-form host_ip is rejected by Swarm, switched to short-form (bound to 0.0.0.0 with UFW-based loopback restriction); worker replicas 2 -> 1 (Asynq scheduler singleton) - deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/' (Next.js serves at root; /admin/ returned 404 and killed pods); startupProbe failureThreshold 12 -> 24 - deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable 1 -> 0 (singleton) - deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold 12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot; real startup takes up to 240s) - .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/ and admin/src/app/api/*, hiding legitimate files) New files: - deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet + hostNetwork override for k3s-bundled Traefik - deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress without TLS (CF Flexible SSL) and without middleware - deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log Documentation: - docs/deployment/ — full deployment book, 26 files, ~42k words: - Part I Overview, infrastructure, orchestrator choice (Ch 0-2) - Part II Networking, firewall, Cloudflare (Ch 3-4, 13) - Part III Security, Traefik ingress (Ch 5-6) - Part IV Services, DB, storage, secrets, registry (Ch 7-11) - Part V Data flow, deploy process, observability, failures, runbook (Ch 12, 14-17) - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20) - Appendices: glossary, kubectl cheat sheet, file locations, consolidated citations - README.md: Production Deployment section replaced with pointer to the book; Go version bumped to 1.25 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:20:21 -05:00
parent 4ec4bbbfe8
commit 6f303dbbaa
46 changed files with 9785 additions and 93 deletions
@@ -0,0 +1,207 @@
+# Appendix A — Glossary
+
+Alphabetical. Cross-referenced to chapters where each term is used in
+detail.
+
+## Kubernetes / k3s
+
+**ClusterIP**: Internal IP of a Kubernetes Service. Stable; load-
+balances to backing pods. (Chapter 3)
+
+**containerd**: Container runtime bundled with k3s. Replaces Docker for
+the runtime layer. (Chapter 2)
+
+**ConfigMap**: Kubernetes resource holding non-sensitive config (env
+vars). Mounted into pods via `envFrom`. (Chapter 10)
+
+**CoreDNS**: Cluster-internal DNS resolver. Every pod's
+`/etc/resolv.conf` points to the CoreDNS Service. (Chapter 3)
+
+**CRD (Custom Resource Definition)**: Kubernetes extension mechanism
+for third-party resource types. Traefik's `IngressRoute` and
+`Middleware` are CRDs. (Chapter 6)
+
+**DaemonSet**: Workload that runs exactly one pod per node. We use it
+for Traefik so each node has its own ingress pod. (Chapter 6)
+
+**Deployment**: Kubernetes workload for stateless pods. Supports rolling
+updates. Most of our services are Deployments. (Chapter 7)
+
+**Endpoints**: The actual pod IPs backing a Service's ClusterIP.
+Dynamically updated as pods come and go. (Chapter 3)
+
+**etcd**: Distributed key-value store holding cluster state. K3s
+embeds it. Raft-replicated across server nodes. (Chapter 2)
+
+**Flannel**: Kubernetes CNI (Container Network Interface) plugin for
+pod-to-pod networking. Uses VXLAN tunneling. (Chapter 3)
+
+**HPA (HorizontalPodAutoscaler)**: K8s resource that scales Deployment
+replicas based on CPU/memory usage. Not currently enabled for us.
+(Chapter 7)
+
+**Ingress**: K8s resource describing external-to-internal routing rules.
+Traefik watches Ingresses and programs itself accordingly. (Chapter 6)
+
+**IPVS**: Linux kernel feature for in-kernel L4 load balancing. Our
+kube-proxy uses it. (Chapter 3)
+
+**k3s**: Lightweight Kubernetes distribution by Rancher/SUSE. What we
+run. (Chapter 2)
+
+**kubectl**: Kubernetes CLI tool. Runs on operator workstation.
+(Chapter 17)
+
+**kubelet**: Agent running on each node, responsible for pod lifecycle.
+(Chapter 2)
+
+**kube-proxy**: Service-to-pod routing component. Runs on each node in
+IPVS mode. (Chapter 3)
+
+**Namespace**: Kubernetes logical grouping. Our app lives in `honeydue`.
+System services in `kube-system`. (Chapter 7)
+
+**NetworkPolicy**: K8s resource defining allowed traffic between pods.
+Not currently applied. (Chapter 5)
+
+**Node**: A physical or virtual machine running Kubernetes. We have 3.
+(Chapter 1)
+
+**PDB (PodDisruptionBudget)**: Constraint on voluntary pod disruptions
+(drain, upgrade). Keeps N replicas available. (Chapter 7)
+
+**Pod**: Smallest Kubernetes unit — one or more containers sharing
+network and storage. Our pods are usually one-container. (Chapter 7)
+
+**PVC (PersistentVolumeClaim)**: Request for persistent storage. Redis
+uses one. (Chapter 7)
+
+**RBAC**: Role-Based Access Control. Governs who/what can do what via
+the Kubernetes API. (Chapter 5)
+
+**ReplicaSet**: Managed by a Deployment; ensures N pods of a template
+are running. Each deploy creates a new ReplicaSet. (Chapter 14)
+
+**Secret**: K8s resource holding sensitive values. Base64-encoded;
+stored in etcd (unencrypted by default). (Chapter 10)
+
+**Service**: K8s resource providing a stable endpoint (ClusterIP) for
+a set of pods. (Chapter 3)
+
+**ServiceAccount**: Identity used by pods to authenticate to the
+Kubernetes API. We disable token mounting for our app pods.
+(Chapter 5)
+
+**Taint / Toleration**: Mechanism to prevent pods from being scheduled
+on certain nodes. Not used in our setup. (Chapter 7)
+
+## Docker / Swarm
+
+**libnetwork**: Docker's networking library. Provides overlay
+networking for Swarm. Source of the DNS ghost bug (Chapter 19).
+
+**mode: global**: Swarm deploy mode for services running one pod per
+node. (Chapter 19)
+
+**mode: host**: Port publishing mode that binds to node's real
+interface, bypassing the ingress mesh. (Chapter 4)
+
+**Overlay network**: Encrypted or unencrypted virtual network spanning
+Swarm nodes. (Chapter 19)
+
+**Swarm**: Docker's built-in orchestrator. What we used to run.
+(Chapter 19)
+
+**VXLAN**: Virtual Extensible LAN. Layer-2 over Layer-3 tunneling.
+Used by both Swarm overlay and Kubernetes Flannel. (Chapter 3)
+
+## Cloudflare
+
+**Flexible SSL**: CF SSL mode where CF↔origin is HTTP. Our current
+setup. (Chapter 13)
+
+**Full (strict) SSL**: CF SSL mode where CF↔origin is HTTPS with cert
+verification. Our target. (Chapter 13)
+
+**Origin CA**: CF-internal certificate authority that issues certs CF's
+edge trusts. Used for Full strict mode. (Chapter 13)
+
+**POP (Point of Presence)**: A CF edge location. ~300 globally.
+(Chapter 13)
+
+**Proxied (orange cloud)**: DNS record with CF proxying on. Traffic
+goes through CF. (Chapter 13)
+
+**Workers**: CF's serverless compute at the edge. We don't use yet.
+(Chapter 20)
+
+## Hetzner
+
+**CX33**: Hetzner Cloud instance type. 4 vCPU, 8 GB RAM, 80 GB SSD.
+(Chapter 1)
+
+**Cloud Firewall**: Hetzner's provider-level firewall feature. We use
+UFW on nodes instead. (Chapter 4)
+
+**nbg1**: Nuremberg datacenter code. Our region. (Chapter 1)
+
+## Neon
+
+**Branch**: Neon's isolation primitive. Each project can have multiple
+branches (prod, staging, dev). (Chapter 8)
+
+**CU (Compute Unit)**: Neon's pricing unit for compute.
+(Chapter 8)
+
+**Launch plan**: Neon's entry-level paid plan. $5 min + usage.
+(Chapter 8)
+
+**Pooler**: Neon's built-in PgBouncer instance at the `-pooler` hostname
+suffix. (Chapter 8)
+
+## Backblaze B2
+
+**B2**: Backblaze's object storage. What we use for uploads.
+(Chapter 9)
+
+**App key**: B2's bucket-scoped credential. Not an IAM-flavored role.
+(Chapter 9)
+
+**S3-compatible**: API that speaks AWS S3 protocol. B2 supports it.
+(Chapter 9)
+
+## Go + Asynq
+
+**AutoMigrate**: GORM function that syncs DB schema to Go structs.
+(Chapter 8)
+
+**Asynq**: Go library for background job queues. Redis-backed.
+(Chapter 7)
+
+**GORM**: Go ORM we use. (Chapter 8)
+
+**pgx**: Go Postgres driver used by GORM. (Chapter 8)
+
+**sync.Once**: Go stdlib primitive for "run this exactly once." Source
+of bug #6 (Chapter 19).
+
+## Other
+
+**advisory lock**: A Postgres lock that doesn't block rows but lets
+apps coordinate voluntarily. We use for migration serialization.
+(Chapter 8)
+
+**AOF (Append-Only File)**: Redis persistence mode that logs every
+write. (Chapter 7)
+
+**MTU**: Maximum Transmission Unit. Packet size limit. VXLAN reduces
+effective MTU by 50 bytes. (Chapter 3)
+
+**Raft**: Consensus algorithm. Used by etcd. (Chapter 2)
+
+**STARTTLS**: SMTP upgrade from plain to TLS. Used for Fastmail.
+(Chapter 5)
+
+**UFW**: Uncomplicated Firewall. Frontend for iptables. (Chapter 4)
+
+**VXLAN**: See Docker/Swarm section.
@@ -0,0 +1,305 @@
+# Appendix B — kubectl Cheat Sheet
+
+Specific to this deployment. Assumes:
+
+```bash
+export KUBECONFIG=~/.kube/honeydue-k3s.yaml
+```
+
+## Viewing state
+
+```bash
+# All pods in our namespace
+kubectl get pods -n honeydue
+
+# With node placement + IPs
+kubectl get pods -n honeydue -o wide
+
+# All resources in our namespace
+kubectl get all -n honeydue
+
+# Cluster-wide pod overview
+kubectl get pods -A
+
+# Node health
+kubectl get nodes
+kubectl top nodes
+
+# What's using RAM
+kubectl top pods -n honeydue --sort-by=memory
+
+# What's using CPU
+kubectl top pods -n honeydue --sort-by=cpu
+```
+
+## Logs
+
+```bash
+# Follow all api pod logs
+kubectl logs -n honeydue -l app.kubernetes.io/name=api -f --prefix
+
+# One specific pod
+kubectl logs -n honeydue <pod-name>
+
+# Previous pod's logs (after crash)
+kubectl logs -n honeydue <pod-name> --previous
+
+# Filtered
+kubectl logs -n honeydue deploy/api | grep -i error
+kubectl logs -n honeydue deploy/api --since=1h
+
+# stern is nicer for multi-pod (if installed)
+stern -n honeydue api
+```
+
+## Deploying new code
+
+```bash
+SHA=$(git rev-parse --short HEAD)
+
+# Build + push (requires docker login to Gitea first)
+docker buildx build --platform linux/amd64 --target api \
+  -t "gitea.treytartt.com/admin/honeydue-api:${SHA}" --push .
+
+# Roll it in
+kubectl set image deployment/api -n honeydue \
+  api="gitea.treytartt.com/admin/honeydue-api:${SHA}"
+
+# Watch
+kubectl rollout status -n honeydue deployment/api
+```
+
+## Rolling update controls
+
+```bash
+# Pause a rollout in progress (new pods stop being created)
+kubectl rollout pause deployment/api -n honeydue
+
+# Resume
+kubectl rollout resume deployment/api -n honeydue
+
+# Rollback to previous version
+kubectl rollout undo deployment/api -n honeydue
+
+# Rollback to specific revision
+kubectl rollout history deployment/api -n honeydue
+kubectl rollout undo deployment/api -n honeydue --to-revision=3
+
+# Force restart (re-pulls image if digest changed; reloads ConfigMap)
+kubectl rollout restart deployment/api -n honeydue
+```
+
+## Scaling
+
+```bash
+# Scale up
+kubectl scale deployment/api -n honeydue --replicas=5
+
+# Scale down
+kubectl scale deployment/api -n honeydue --replicas=3
+
+# Kill everything (emergency)
+kubectl scale deployment -n honeydue --all --replicas=0
+
+# Bring back
+kubectl scale deployment/api -n honeydue --replicas=3
+kubectl scale deployment/admin deployment/worker deployment/redis -n honeydue --replicas=1
+```
+
+## Debugging a pod
+
+```bash
+# Describe = events + state + restart history
+kubectl describe pod -n honeydue <pod-name>
+
+# Shell in
+kubectl exec -it -n honeydue deploy/api -- /bin/sh
+
+# Inside:
+# Test HTTP locally (bypasses Traefik, Service, overlay)
+wget -qO- http://127.0.0.1:8000/api/health/
+
+# Test cross-Service DNS
+getent hosts redis
+getent hosts admin
+getent hosts postgres
+
+# Run arbitrary command (one-shot)
+kubectl exec -n honeydue deploy/api -- env | grep POSTGRES
+```
+
+## Networking checks
+
+```bash
+# Resolve a Service from a pod
+kubectl exec -n honeydue deploy/api -- nslookup redis
+
+# Check Service endpoints (the actual IPs behind a ClusterIP)
+kubectl get endpoints -n honeydue api
+
+# Traffic test via Service
+kubectl run test --rm -it --image=alpine/curl -- sh
+# curl http://api.honeydue.svc:8000/api/health/
+
+# List all Ingresses
+kubectl get ingress -A
+```
+
+## Secret / Config
+
+```bash
+# List
+kubectl get secrets -n honeydue
+kubectl get configmap -n honeydue
+
+# Describe (shows keys, not values)
+kubectl describe secret honeydue-secrets -n honeydue
+
+# Read a value (DANGER: plaintext to stdout)
+kubectl get secret honeydue-secrets -n honeydue \
+  -o jsonpath='{.data.POSTGRES_PASSWORD}' | base64 -d; echo
+
+# Update a single secret key
+kubectl patch secret honeydue-secrets -n honeydue \
+  --type=merge -p "{\"data\":{\"SECRET_KEY\":\"$(echo -n 'new-val' | base64)\"}}"
+
+# Regenerate ConfigMap from prod.env
+kubectl create configmap honeydue-config -n honeydue \
+  --from-env-file=deploy/prod.env \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+# Edit a ConfigMap interactively (does NOT restart pods)
+kubectl edit configmap honeydue-config -n honeydue
+```
+
+## Node management
+
+```bash
+# Prevent scheduling on a node
+kubectl cordon <node-hostname>
+
+# Prevent scheduling + evict existing pods
+kubectl drain <node-hostname> --ignore-daemonsets --delete-emptydir-data
+
+# Allow scheduling again
+kubectl uncordon <node-hostname>
+
+# Label a node
+kubectl label node <node-hostname> honeydue/redis=true --overwrite
+
+# Remove a label
+kubectl label node <node-hostname> honeydue/redis-
+```
+
+## Events (the timeline)
+
+```bash
+# All events, newest last
+kubectl get events -A --sort-by=.lastTimestamp
+
+# Watch live
+kubectl get events -A --sort-by=.lastTimestamp -w
+
+# Only warnings
+kubectl get events -A --field-selector type=Warning
+
+# Events for a specific pod
+kubectl describe pod -n honeydue <pod> | awk '/Events:/,0'
+```
+
+## Traefik-specific
+
+```bash
+# All Traefik pods (DaemonSet, so one per node)
+kubectl get pods -n kube-system -l app.kubernetes.io/name=traefik -o wide
+
+# Restart Traefik across all nodes
+kubectl rollout restart daemonset/traefik -n kube-system
+
+# View Traefik config (via ConfigMap)
+kubectl get cm -n kube-system traefik -o yaml | less
+
+# See the HelmChartConfig we applied
+kubectl get helmchartconfig -n kube-system traefik -o yaml
+
+# Force Helm re-reconcile
+kubectl delete job -n kube-system helm-install-traefik
+```
+
+## Cluster-wide operations
+
+```bash
+# API server health
+kubectl cluster-info
+
+# All namespaces
+kubectl get namespaces
+
+# All k3s-system pods
+kubectl get pods -n kube-system
+
+# All ServiceAccounts in our namespace
+kubectl get sa -n honeydue
+
+# Check what an SA can do
+kubectl auth can-i --list --as=system:serviceaccount:honeydue:api
+```
+
+## Hetzner SSH (not kubectl but oft needed)
+
+```bash
+# SSH in
+ssh -i ~/.ssh/hetzner deploy@hetzner1
+
+# Check k3s service
+ssh -i ~/.ssh/hetzner deploy@hetzner1 'sudo systemctl status k3s'
+
+# Per-node commands in parallel (e.g., apt upgrade)
+for h in hetzner1 hetzner2 hetzner3; do
+  ssh -i ~/.ssh/hetzner "deploy@$h" 'sudo apt update && sudo apt upgrade -y'
+done
+```
+
+## Emergency: cluster is wedged
+
+```bash
+# Check all nodes Ready
+kubectl get nodes
+
+# If one is NotReady
+ssh -i ~/.ssh/hetzner deploy@<node> 'sudo systemctl restart k3s'
+
+# If still bad, kill k3s on that node and check
+ssh -i ~/.ssh/hetzner deploy@<node> 'sudo /usr/local/bin/k3s-killall.sh'
+ssh -i ~/.ssh/hetzner deploy@<node> 'sudo systemctl start k3s'
+
+# Last resort: uninstall + rejoin
+# ssh -i ~/.ssh/hetzner deploy@<node> 'sudo /usr/local/bin/k3s-uninstall.sh'
+# then re-join via the k3s install command
+```
+
+## One-liners worth memorizing
+
+```bash
+# Heavy smoke test through CF
+for url in https://api.myhoneydue.com/api/health/ https://admin.myhoneydue.com/ https://myhoneydue.com/; do
+  ok=0
+  for i in $(seq 1 20); do
+    [[ "$(curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "$url")" == "200" ]] && ok=$((ok+1))
+  done
+  printf "%-45s %d/20\n" "$url" "$ok"
+done
+
+# Pods not ready
+kubectl get pods -A | awk '$3!="Running" && $3!="Completed" && $3!="STATUS"'
+
+# Restart everything in our namespace
+for d in api admin worker redis; do
+  kubectl rollout restart deploy/$d -n honeydue
+done
+
+# Watch all rollouts simultaneously
+for d in api admin worker redis; do
+  kubectl rollout status deploy/$d -n honeydue &
+done; wait
+```
@@ -0,0 +1,216 @@
+# Appendix C — File Locations
+
+Complete map of where every significant file lives — on the operator
+workstation, in the git repo, and on the Hetzner nodes.
+
+## Operator workstation
+
+### Kubernetes
+
+| Path | Purpose |
+|---|---|
+| `~/.kube/honeydue-k3s.yaml` | kubeconfig for the k3s cluster. Contains an admin bearer token. Mode 0600. |
+| `~/.kube/config` | Default kubeconfig (points elsewhere, not our cluster). |
+
+Set `KUBECONFIG=~/.kube/honeydue-k3s.yaml` before any `kubectl` command.
+
+### SSH
+
+| Path | Purpose |
+|---|---|
+| `~/.ssh/hetzner` | Private key for node SSH (ed25519). Mode 0600. |
+| `~/.ssh/hetzner.pub` | Public key corresponding to above. |
+| `~/.ssh/config` | Host aliases for hetzner1/hetzner2/hetzner3 → node IPs. |
+
+Public key content:
+```
+ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBU9xTTBD78tYUqHijgyU9PDqtmS4NuM/6uy8XgDzva+ hetzner2@myhoneydue.com
+```
+
+### Docker
+
+| Path | Purpose |
+|---|---|
+| `~/.docker/config.json` | Docker CLI config. After `docker login` to Gitea, contains creds. **Log out after each deploy** to not leave PATs on disk. |
+| `~/Library/Containers/com.docker.docker/` | Docker Desktop state (macOS). |
+
+## Git repo (`/Users/treyt/Desktop/code/honeyDue/honeyDueAPI-go/`)
+
+### Top-level
+
+| Path | Purpose |
+|---|---|
+| `CLAUDE.md` | Project-wide instructions for Claude assistant. Never commit secrets here. |
+| `Dockerfile` | Multi-stage Docker build: api, worker, admin targets. |
+| `go.mod`, `go.sum` | Go module definition. |
+| `package.json` (admin-ui/) | Next.js dependencies. |
+
+### Application code
+
+| Path | Purpose |
+|---|---|
+| `cmd/api/main.go` | API server entry point. |
+| `cmd/worker/main.go` | Background worker entry point. |
+| `cmd/admin/main.go` | (may or may not exist for Go admin variant) |
+| `internal/config/` | Viper configuration loading. |
+| `internal/database/` | Postgres connection, migrations. |
+| `internal/handlers/` | HTTP handlers (one file per domain). |
+| `internal/services/` | Business logic. `cache_service.go` is where the sync.Once bug was (Chapter 19). |
+| `internal/repositories/` | GORM repositories. |
+| `internal/router/router.go` | Echo routes, including static file serving. CSP is set here. |
+| `internal/middleware/` | Echo middleware (auth, logging, etc.). |
+| `internal/task/` | Task predicates/scopes/categorization. See `docs/TASK_LOGIC_ARCHITECTURE.md`. |
+
+### Deploy config (Swarm era — still exists, unused)
+
+| Path | Purpose |
+|---|---|
+| `deploy/` | Legacy Swarm deploy root. |
+| `deploy/prod.env` | Non-secret config (ConfigMap source). **Gitignored.** |
+| `deploy/registry.env` | Gitea PAT + registry URL. **Gitignored.** |
+| `deploy/cluster.env` | Swarm cluster settings. Partly used for k3s too (manager host). **Gitignored.** |
+| `deploy/secrets/postgres_password.txt` | Neon password. **Gitignored.** |
+| `deploy/secrets/secret_key.txt` | App signing key (≥32 chars). **Gitignored.** |
+| `deploy/secrets/email_host_password.txt` | Fastmail password. **Gitignored.** |
+| `deploy/secrets/fcm_server_key.txt` | FCM key (placeholder, push off). **Gitignored.** |
+| `deploy/secrets/apns_auth_key.p8` | APNs key (placeholder, push off). **Gitignored.** |
+| `deploy/swarm-stack.prod.yml` | Swarm stack definition. Unused after migration. |
+| `deploy/Caddyfile` | Caddy config. Unused after migration. |
+| `deploy/scripts/deploy_prod.sh` | Swarm deploy script. Unused. |
+| `deploy/DEPLOYING.md`, `deploy/README.md`, `deploy/shit_deploy_cant_do.md` | Swarm-era docs. Historical reference. |
+
+### Deploy config (k3s)
+
+| Path | Purpose |
+|---|---|
+| `deploy-k3s/README.md` | k3s deployment README (scaffold version). |
+| `deploy-k3s/MIGRATION_NOTES.md` | Notes from Swarm → k3s migration. |
+| `deploy-k3s/SECURITY.md` | Security posture doc (scaffold). |
+| `deploy-k3s/config.yaml.example` | Template for a unified config.yaml (unused — we kept Swarm's file layout). |
+| `deploy-k3s/manifests/namespace.yaml` | Creates `honeydue` namespace. |
+| `deploy-k3s/manifests/rbac.yaml` | ServiceAccounts + `automountServiceAccountToken: false`. |
+| `deploy-k3s/manifests/pod-disruption-budgets.yaml` | PDBs for api (2/3) and worker (0/1). |
+| `deploy-k3s/manifests/network-policies.yaml` | Default-deny + allows. NOT currently applied. |
+| `deploy-k3s/manifests/api/deployment.yaml` | api Deployment. |
+| `deploy-k3s/manifests/api/service.yaml` | api ClusterIP Service. |
+| `deploy-k3s/manifests/api/hpa.yaml` | api HorizontalPodAutoscaler. NOT currently applied. |
+| `deploy-k3s/manifests/admin/deployment.yaml` | admin Deployment. |
+| `deploy-k3s/manifests/admin/service.yaml` | admin Service. |
+| `deploy-k3s/manifests/worker/deployment.yaml` | worker Deployment. |
+| `deploy-k3s/manifests/redis/deployment.yaml` | Redis Deployment. |
+| `deploy-k3s/manifests/redis/service.yaml` | Redis Service. |
+| `deploy-k3s/manifests/redis/pvc.yaml` | Redis PersistentVolumeClaim. |
+| `deploy-k3s/manifests/ingress/ingress.yaml` | Full Ingress with TLS + middleware (scaffold; needs CF origin cert). |
+| `deploy-k3s/manifests/ingress/ingress-simple.yaml` | Simple Ingress without TLS (what we actually apply). |
+| `deploy-k3s/manifests/ingress/middleware.yaml` | Traefik middleware CRDs. Not currently applied. |
+| `deploy-k3s/manifests/traefik-helmchartconfig.yaml` | Our DaemonSet + hostNetwork override for Traefik. |
+| `deploy-k3s/manifests/secrets.yaml.example` | Template (never deployed). |
+| `deploy-k3s/scripts/01-provision-cluster.sh` | hetzner-k3s provisioning (we didn't use it; existing nodes). |
+| `deploy-k3s/scripts/02-setup-secrets.sh` | Creates Secrets + ConfigMap (scaffold version; we ran commands manually). |
+| `deploy-k3s/scripts/03-deploy.sh` | Applies manifests (unused; we ran kubectl manually). |
+| `deploy-k3s/scripts/04-verify.sh` | Post-deploy verification. |
+| `deploy-k3s/scripts/rollback.sh` | Rollback helper. |
+
+### Documentation
+
+| Path | Purpose |
+|---|---|
+| `docs/deployment/` | **This book.** |
+| `docs/TASK_LOGIC_ARCHITECTURE.md` | Task logic internals. |
+| `docs/PUSH_NOTIFICATIONS.md` | Push notifications setup (for future). |
+| `docs/SUBSCRIPTION_WEBHOOKS.md` | Apple/Google subscription webhooks. |
+| `docs/Dokku_notes` | Pre-Swarm era deployment notes. Historical. |
+| `docs/server_2026_2_24.md` | Earlier architecture doc (predates k3s migration). |
+
+## On the Hetzner nodes
+
+### System
+
+| Path | Purpose |
+|---|---|
+| `/etc/ssh/sshd_config` | SSH config — `PermitRootLogin no`, `PasswordAuthentication no`, `AllowUsers deploy`. |
+| `/etc/sudoers.d/deploy` | `deploy ALL=(ALL) NOPASSWD: ALL`. |
+| `/etc/ufw/` | UFW configuration. See Chapter 4 for rule inventory. |
+| `/etc/sysctl.d/99-unprivileged-ports.conf` | `net.ipv4.ip_unprivileged_port_start=0` for Traefik. |
+| `/home/deploy/.ssh/authorized_keys` | Our hetzner.pub. |
+
+### K3s
+
+| Path | Purpose |
+|---|---|
+| `/etc/rancher/k3s/k3s.yaml` | Kubeconfig (localhost-scoped; we copied to workstation). |
+| `/etc/systemd/system/k3s.service` | systemd service file. |
+| `/etc/systemd/system/k3s.service.env` | K3s install args (INSTALL_K3S_EXEC). |
+| `/var/lib/rancher/k3s/` | K3s state root (etcd, containerd, PVC storage). |
+| `/var/lib/rancher/k3s/server/node-token` | Token for joining additional nodes. |
+| `/var/lib/rancher/k3s/storage/` | local-path PVC storage. Redis data lives here. |
+| `/var/lib/rancher/k3s/agent/containerd/` | containerd state. |
+| `/var/log/containers/` | Container log files. |
+
+### Commands installed
+
+| Path | Purpose |
+|---|---|
+| `/usr/local/bin/k3s` | The k3s binary. |
+| `/usr/local/bin/kubectl` | Symlink to k3s (CLI for this cluster). |
+| `/usr/local/bin/crictl` | containerd CLI. |
+| `/usr/local/bin/k3s-killall.sh` | Emergency kill-all-k3s script. |
+| `/usr/local/bin/k3s-uninstall.sh` | Clean uninstall script. |
+
+### Docker (legacy; disabled)
+
+| Path | Purpose |
+|---|---|
+| `/etc/systemd/system/docker.service` | systemd unit (stopped + disabled). |
+| `/var/lib/docker/` | Docker state (unused on current cluster). |
+
+## On Cloudflare
+
+Not a filesystem, but worth noting the dashboard hierarchy:
+
+```
+Websites → myhoneydue.com
+├── DNS → Records              (A records for api, admin, @)
+├── SSL/TLS → Overview          (SSL mode: Flexible)
+├── SSL/TLS → Edge Certificates (Always Use HTTPS: On)
+├── SSL/TLS → Origin Server     (would live the Origin CA cert if we enabled it)
+├── Rules → Overview            (where Origin Rules live if we had them)
+├── Rules → Page Rules          (none)
+├── Security → WAF              (managed rules only)
+├── Speed → Optimization        (default)
+└── Analytics & Logs            (read-only stats)
+```
+
+## On Gitea (`gitea.treytartt.com`)
+
+The image registry lives at:
+
+```
+gitea.treytartt.com/admin/-/packages         # UI listing of all packages
+gitea.treytartt.com/admin/-/packages/container/honeydue-api      # API image
+gitea.treytartt.com/admin/-/packages/container/honeydue-worker   # Worker image
+gitea.treytartt.com/admin/-/packages/container/honeydue-admin    # Admin image
+```
+
+Per-version tags visible in the UI with `docker pull` commands.
+
+PATs at `gitea.treytartt.com/-/user/settings/applications`.
+
+## On Neon
+
+```
+console.neon.tech → project → Branches       (production branch default)
+console.neon.tech → project → Monitoring     (CU-hour usage, slow queries)
+console.neon.tech → project → Operations     (history of schema changes)
+```
+
+Connection strings at `console.neon.tech → project → Connection Details`.
+
+## On Backblaze B2
+
+```
+secure.backblaze.com/b2_buckets.htm          # Buckets list
+secure.backblaze.com/b2_app_keys.htm         # App keys
+```
+
+`honeyDueProd` bucket → Files tab for browsing contents.
@@ -0,0 +1,202 @@
+# Appendix D — References & Citations
+
+Every external link cited anywhere in this book, grouped by topic.
+
+## Docker / Moby
+
+- [moby/moby#52265 — Overlay ARP stale entries on 29.3.0 regression][moby-52265] (Chapter 19, primary root-cause citation)
+- [moby/moby#51491 — DNS broken after `docker swarm init` on 29.0.0][moby-51491]
+- [Dokploy#3480 — Traefik routes intermittently timeout due to stale VIP][dokploy-3480]
+- [Mirantis: Commits to Long-Term Support for Swarm Through 2030][mirantis-swarm]
+- [Better Stack: Hetzner Cloud Review 2026][bstack-swarm]
+- [VirtualizationHowTo: Is Docker Swarm Still Safe in 2026?][vht-swarm]
+- [bleevht: Where Docker Swarm Still Fits in 2026][bleevht-swarm]
+- [Docker buildx multi-platform builds][buildx]
+- [Compose specification][compose-spec]
+
+## Kubernetes / k3s
+
+- [K3s documentation home][k3s-docs]
+- [K3s architecture][k3s-arch]
+- [K3s requirements (networking ports)][k3s-reqs]
+- [K3s advanced config — metrics server][k3s-metrics]
+- [K3s HA datastore recovery][k3s-ha-recovery]
+- [K3s storage — local-path provisioner][k3s-lp]
+- [K3s Helm integration — HelmChartConfig][k3s-helm]
+- [K3s Traefik customization][k3s-traefik]
+- [K3s secrets encryption][k3s-secrets]
+- [Kubernetes concepts — Services & Networking][k8s-net]
+- [Kubernetes Ingress][k8s-ingress]
+- [Kubernetes Deployments — rolling updates][rolling]
+- [kubectl rollout][rollout]
+- [kubectl cheat sheet][kubectl-cs]
+- [Pod lifecycle + probes][probes]
+- [Pod Security Standards][psa]
+- [Kubernetes RBAC][rbac]
+- [NetworkPolicy][netpol]
+- [Ports and Protocols reference][k8s-ports]
+- [metrics-server][ms]
+
+## Traefik
+
+- [Traefik v3 documentation][traefik]
+- [Traefik Swarm provider][traefik-swarm]
+- [Traefik migrate v2 → v3][traefik-v3]
+
+## Cloudflare
+
+- [IP ranges][cf-ips]
+- [SSL modes explained][cf-ssl]
+- [Origin CA certificates][cf-origin-ca]
+- [DNS best practices][cf-dns]
+- [Free plan][cf-free]
+
+## Hetzner
+
+- [Hetzner Cloud][hetzner-cloud]
+- [Hetzner price adjustment 2026-04-01][hetzner-prices]
+- [Hetzner rescue system][hetzner-rescue]
+- [hetzner-k3s tool][hetzner-k3s]
+
+## Neon / Postgres
+
+- [Neon docs][neon-docs]
+- [Neon pricing][neon-pricing]
+- [Neon usage-based pricing announcement][neon-blog]
+- [Neon connect from any app][neon-connect]
+- [Postgres advisory locks][pg-locks]
+- [GORM AutoMigrate][gorm-automigrate]
+
+## Backblaze B2
+
+- [B2 documentation][b2-docs]
+- [B2 S3-compatible API][b2-s3]
+- [B2 pricing][b2-pricing]
+- [minio-go SDK][minio-go]
+- [S3 path-style vs virtual-hosted addressing][s3-style]
+
+## Gitea
+
+- [Gitea container registry docs][gitea-cr]
+
+## CNI / Networking
+
+- [Flannel VXLAN backend][flannel-vxlan]
+- [CoreDNS Kubernetes plugin][coredns-k8s]
+- [IPVS mode for kube-proxy deep dive][ipvs]
+- [VXLAN RFC 7348][vxlan-rfc]
+- [Kubernetes NetworkPolicy][netpol]
+
+## Security tools
+
+- [cosign (image signing)][cosign]
+- [Loki (logs)][loki]
+- [Stern (multi-pod log tailing)][stern]
+- [fail2ban][fail2ban]
+
+## Asynq
+
+- [Asynq documentation][asynq]
+- [Asynq periodic tasks (scheduler limitations)][asynq-sched]
+
+## Miscellaneous
+
+- [Let's Encrypt][le]
+- [UFW man page][ufw-man]
+- [SSH hardening guide][ssh-guide]
+- [pg_dump][pg-dump]
+
+---
+
+## Link definitions
+
+<!-- Docker / Moby -->
+[moby-52265]: https://github.com/moby/moby/issues/52265
+[moby-51491]: https://github.com/moby/moby/issues/51491
+[dokploy-3480]: https://github.com/Dokploy/dokploy/issues/3480
+[mirantis-swarm]: https://www.mirantis.com/blog/mirantis-guarantees-long-term-support-for-swarm/
+[bstack-swarm]: https://betterstack.com/community/guides/web-servers/hetzner-cloud-review/
+[vht-swarm]: https://www.virtualizationhowto.com/2026/03/is-docker-swarm-still-safe-in-2026/
+[bleevht-swarm]: https://bleevht.substack.com/p/where-docker-swarm-still-fits-in
+[buildx]: https://docs.docker.com/build/buildx/
+[compose-spec]: https://docs.docker.com/reference/compose-file/
+
+<!-- Kubernetes / k3s -->
+[k3s-docs]: https://docs.k3s.io/
+[k3s-arch]: https://docs.k3s.io/architecture
+[k3s-reqs]: https://docs.k3s.io/installation/requirements#networking
+[k3s-metrics]: https://docs.k3s.io/advanced#enabling-metrics-server
+[k3s-ha-recovery]: https://docs.k3s.io/datastore/ha-embedded#new-cluster-with-embedded-db
+[k3s-lp]: https://docs.k3s.io/storage#setting-up-the-local-storage-provider
+[k3s-helm]: https://docs.k3s.io/helm#customizing-packaged-components-with-helmchartconfig
+[k3s-traefik]: https://docs.k3s.io/networking/networking-services#traefik-ingress-controller
+[k3s-secrets]: https://docs.k3s.io/security/secrets-encryption
+[k8s-net]: https://kubernetes.io/docs/concepts/services-networking/
+[k8s-ingress]: https://kubernetes.io/docs/concepts/services-networking/ingress/
+[rolling]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment
+[rollout]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#rollout
+[kubectl-cs]: https://kubernetes.io/docs/reference/kubectl/cheatsheet/
+[probes]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifecycle
+[psa]: https://kubernetes.io/docs/concepts/security/pod-security-standards/
+[rbac]: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
+[netpol]: https://kubernetes.io/docs/concepts/services-networking/network-policies/
+[k8s-ports]: https://kubernetes.io/docs/reference/networking/ports-and-protocols/
+[ms]: https://github.com/kubernetes-sigs/metrics-server
+
+<!-- Traefik -->
+[traefik]: https://doc.traefik.io/traefik/v3.6/
+[traefik-swarm]: https://doc.traefik.io/traefik/providers/swarm/
+[traefik-v3]: https://doc.traefik.io/traefik/migrate/v2-to-v3-details/
+
+<!-- Cloudflare -->
+[cf-ips]: https://www.cloudflare.com/ips/
+[cf-ssl]: https://developers.cloudflare.com/ssl/origin-configuration/ssl-modes/
+[cf-origin-ca]: https://developers.cloudflare.com/ssl/origin-configuration/origin-ca/
+[cf-dns]: https://developers.cloudflare.com/dns/
+[cf-free]: https://www.cloudflare.com/plans/free/
+
+<!-- Hetzner -->
+[hetzner-cloud]: https://www.hetzner.com/cloud/
+[hetzner-prices]: https://docs.hetzner.com/general/infrastructure-and-availability/price-adjustment/
+[hetzner-rescue]: https://docs.hetzner.com/cloud/servers/getting-started/enabling-rescue-system/
+[hetzner-k3s]: https://github.com/vitobotta/hetzner-k3s
+
+<!-- Neon / Postgres -->
+[neon-docs]: https://neon.com/docs/introduction
+[neon-pricing]: https://neon.com/pricing
+[neon-blog]: https://neon.com/blog/new-usage-based-pricing
+[neon-connect]: https://neon.com/docs/connect/connect-from-any-app
+[pg-locks]: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS
+[gorm-automigrate]: https://gorm.io/docs/migration.html
+
+<!-- B2 -->
+[b2-docs]: https://www.backblaze.com/docs/
+[b2-s3]: https://www.backblaze.com/docs/cloud-storage-s3-compatible-api
+[b2-pricing]: https://www.backblaze.com/cloud-storage/pricing
+[minio-go]: https://github.com/minio/minio-go
+[s3-style]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html
+
+<!-- Gitea -->
+[gitea-cr]: https://docs.gitea.com/usage/packages/container
+
+<!-- CNI -->
+[flannel-vxlan]: https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md#vxlan
+[coredns-k8s]: https://coredns.io/plugins/kubernetes/
+[ipvs]: https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/
+[vxlan-rfc]: https://datatracker.ietf.org/doc/html/rfc7348
+
+<!-- Security tools -->
+[cosign]: https://github.com/sigstore/cosign
+[loki]: https://grafana.com/oss/loki/
+[stern]: https://github.com/stern/stern
+[fail2ban]: https://www.fail2ban.org/
+
+<!-- Asynq -->
+[asynq]: https://github.com/hibiken/asynq
+[asynq-sched]: https://github.com/hibiken/asynq/wiki/Periodic-Tasks
+
+<!-- Misc -->
+[le]: https://letsencrypt.org/
+[ufw-man]: https://manpages.ubuntu.com/manpages/noble/en/man8/ufw.8.html
+[ssh-guide]: https://linux-audit.com/audit-and-harden-your-ssh-configuration/
+[pg-dump]: https://www.postgresql.org/docs/current/app-pgdump.html