Migrate prod deploy from Swarm to K3s; add full deployment book
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled

Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-04-24 07:20:21 -05:00
parent 4ec4bbbfe8
commit 6f303dbbaa
46 changed files with 9785 additions and 93 deletions
+318
View File
@@ -0,0 +1,318 @@
# 20 — Roadmap
## Summary
A consolidated list of known gaps, improvements, and scaling triggers.
Items are grouped by category and roughly ordered by priority. This is
the "if we had more time" list referenced throughout the book.
## High priority (do soon)
### Uptime monitoring
**Why**: Right now we find out the site is down when users complain.
**How**: Set up Uptime Kuma (self-hosted) or Better Stack Uptime
(free tier) to ping `https://api.myhoneydue.com/api/health/` every
minute, with Slack/email alerts on failure.
**Effort**: ~30 min for Uptime Kuma deploy, ~10 min for Better Stack
signup.
### Cloudflare origin IP restriction
**Why**: UFW allows :80 from anywhere. If node IPs leak, direct-connect
attackers bypass CF's WAF/DDoS protection.
**How**: Replace the anywhere-80 UFW rule with 15 IPv4 + 7 IPv6 CF
ranges. See [Chapter 13 §CF IP ranges](./13-cloudflare.md#cloudflare-ip-ranges-used-in-traefik-trustedips).
Automation: a small script that refreshes the CF IP list monthly and
re-applies UFW rules.
**Effort**: 1 hour.
### Enable network policies in k3s
**Why**: Currently pods can freely egress anywhere. A compromised pod
could exfiltrate data or attack lateral services.
**How**: `kubectl apply -f deploy-k3s/manifests/network-policies.yaml`.
The scaffold defines default-deny + explicit allows for:
- DNS egress for all pods
- Traefik → api (port 8000)
- Traefik → admin (port 3000)
- api/worker → Redis
- api/worker → external services (Postgres, B2, Fastmail)
Then test that nothing breaks (might need to adjust allow rules).
**Effort**: 1-2 hours including testing.
### Apply Traefik security middleware
**Why**: Our current Ingress has no rate limiting or security headers
beyond what Traefik adds by default.
**How**: Apply `deploy-k3s/manifests/ingress/middleware.yaml`, annotate
Ingresses to use them:
```yaml
metadata:
annotations:
traefik.ingress.kubernetes.io/router.middlewares: honeydue-security-headers@kubernetescrd,honeydue-rate-limit@kubernetescrd
```
**Effort**: 15 min.
## Medium priority
### Upgrade to CF Full (strict) SSL
**Why**: Currently CF↔origin is plain HTTP. An attacker between CF and
Hetzner could read traffic. Full (strict) mode encrypts this leg with
a CF-issued origin cert.
**How**:
1. Generate Origin CA cert in CF dashboard → SSL/TLS → Origin Server
2. Create `cloudflare-origin-cert` Secret in k8s
3. Add `tls:` block to Ingresses
4. Switch CF SSL mode to Full (strict)
**Effort**: 30 min.
**Citations**: [Cloudflare Origin CA docs][cf-origin-ca]
### Migration Job for schema changes
**Why**: Currently every api pod runs `MigrateWithLock()` on startup,
serializing on a Postgres advisory lock. Adds 90-240s to cold startup
and caused bug #13 in Chapter 19.
**How**: Create a Kubernetes `Job` resource that runs the api image
with a `--migrate-only` flag. Job runs once per deploy, completes when
schema is current. api pods get an initContainer that waits for the
Job to complete.
Requires Go code change to support `--migrate-only` flag.
**Effort**: 3-4 hours (code + job manifest + testing).
### Redis password
**Why**: Redis runs in the cluster with no auth. Any compromised pod
could read cache or queue state.
**How**: Set `REDIS_PASSWORD` in `honeydue-secrets`, update api/worker
env, update Redis command to include `--requirepass`. Already partially
wired up in the manifests.
**Effort**: 20 min.
### Image signing with cosign
**Why**: No guarantee that an image pulled from Gitea is the one we
built. Gitea compromise = arbitrary code execution in cluster.
**How**:
1. Install cosign on build machine
2. Sign images as part of deploy: `cosign sign gitea.treytartt.com/admin/honeydue-api:<sha>`
3. Deploy Kyverno (or Connaisseur) to cluster
4. Apply cluster policy requiring all images have valid cosign signatures
**Effort**: 4-6 hours.
### etcd encryption at rest
**Why**: Kubernetes Secrets are stored in etcd unencrypted by default.
Node disk compromise = plaintext secrets.
**How**: K3s supports `--secrets-encryption` flag at server install.
Need to recreate cluster or re-install k3s server on each node.
**Effort**: 1 hour.
### Automated unattended-upgrades
**Why**: Currently OS patches require manual `apt upgrade`. Security
patches can be delayed.
**How**:
```bash
sudo apt install unattended-upgrades
# Configure /etc/apt/apt.conf.d/50unattended-upgrades for security-only
sudo dpkg-reconfigure -plow unattended-upgrades
```
**Effort**: 30 min per node.
### fail2ban
**Why**: SSH is open to the world. No rate limiting on failed attempts.
Bot noise is constant.
**How**: `sudo apt install fail2ban; sudo systemctl enable --now fail2ban`.
Default config bans IPs after 5 failed attempts for 10 min.
**Effort**: 15 min per node.
### Move SSH off port 22
**Why**: Port 22 attracts constant scanner noise. Moving to a
non-default port cuts >90% of attempts.
**How**:
1. Edit `/etc/ssh/sshd_config` on each node: `Port 2222`
2. UFW rule: `sudo ufw allow 2222/tcp`
3. Update `~/.ssh/config` on operator: `Port 2222`
4. Restart sshd: `sudo systemctl restart ssh`
5. Remove UFW rule for port 22 after verifying
**Effort**: 30 min (and pray).
## Lower priority
### Prometheus + Grafana
**Why**: Historical metrics, dashboards, alerting.
**How**: `kube-prometheus-stack` Helm chart. Adds ~500 MB RAM across
cluster.
**Effort**: 4-6 hours including dashboard setup.
### Loki log aggregation
**Why**: Cross-pod log queries, longer retention.
**How**: `grafana/loki` + `promtail` DaemonSet. Integrates with existing
Grafana.
**Effort**: 2-3 hours.
### OpenTelemetry tracing
**Why**: Request-level profiling. Show which hop dominates p99 latency.
**How**: Add OpenTelemetry SDK to Go app; export to Jaeger/Tempo.
**Effort**: 8-12 hours including tuning.
### Hetzner private network
**Why**: Currently all inter-node traffic (including Flannel overlay)
goes over public network. Private network = less attack surface, no
bandwidth costs (if metered in future).
**How**: Attach Hetzner vswitch to the 3 nodes, reconfigure Flannel to
advertise private IPs, update UFW rules to allow from private IP range
instead of specific public IPs.
**Effort**: 2-3 hours including testing Flannel reconfig.
### Move secrets to Vault
**Why**: Kubernetes Secrets are base64-encoded etcd values. Vault is
purpose-built for secret management with audit logs, dynamic secrets,
rotation policies.
**How**: Deploy Vault in the cluster (or external), migrate secret
values, use Vault Agent Injector or External Secrets Operator.
**Effort**: 6-8 hours.
Not high priority until we have multiple engineers who shouldn't see
every secret, or compliance requirements.
### Automated backups to B2
**Why**: Neon's backup is Neon's problem. If Neon-as-a-company
disappeared, we'd lose everything.
**How**: Nightly `pg_dump | gzip | aws s3 cp` (via `s3cmd` for B2) as a
CronJob in the cluster.
**Effort**: 2 hours.
### Multi-region
**Why**: ~100 ms CF→origin hop could be reduced by having origins in
multiple regions. Not needed at current scale.
**How**: Add 2 more Hetzner nodes in ash (Ashburn, US). Separate k3s
cluster (or one stretched cluster — painful). Cloudflare Load Balancing
for geo-based routing.
**Effort**: Days of work, doubling cost. Don't until traffic justifies.
### CF Workers for static + caching
**Why**: Certain endpoints (the marketing landing page, public API
lookups) could serve from CF Workers with near-zero origin load.
**How**: Move static pages to Cloudflare Pages; cache API responses
with `Cache-Control: public, max-age=300`.
**Effort**: 4-6 hours.
### WireGuard-encrypted overlay
**Why**: Current Flannel VXLAN is plaintext between nodes. An attacker
with Hetzner-internal network access could read pod-to-pod traffic.
**How**: K3s supports `--flannel-backend=wireguard-native`. Reinstall
k3s server on each node with the new backend.
**Effort**: 2-3 hours (requires brief downtime).
## Scaling triggers
| Trigger | Action |
|---|---|
| p99 latency > 500ms sustained | Investigate with tracing; consider CF Workers for cached paths |
| API CPU > 70% sustained | HPA already configured; may need more nodes |
| DB connections at Neon limit | Upgrade Neon Scale or reduce `DB_MAX_OPEN_CONNS` |
| Redis memory > 80% | Scale Redis memory; consider cache sharding |
| B2 storage > 500 GB | Evaluate if R2 (free egress) is cheaper overall |
| Active users > 100k | Evaluate multi-region, CF Pro, paid monitoring |
| Revenue > $5k/mo | Hire ops help; this document assumes solo operator |
## Known gaps we accept
- **No canary deploys**: all-or-nothing rollouts via `kubectl set image`
- **No feature flags** (app-level): code is deployed as-is. Can't toggle
features without re-deploying
- **No A/B testing infra**: out of scope for current product stage
- **No Windows/tablet-specific CDN rules**: CF serves everyone the same
responses
- **No explicit blue-green**: rolling updates only
## Stuff to delete when brave
- `deploy/` (the Swarm era) — once we've been on k3s 30 days
- Legacy UFW rules from the Swarm era (2377, 7946, 4789, ESP, 500, 3000)
— they don't hurt but they're confusing
- `deploy-k3s/manifests/secrets.yaml.example` — we don't use this
pattern, we create secrets imperatively
## Stuff that could go wrong and we should plan for
- **Hetzner price hike**: 2026-04-01 already happened. If another one
comes, we could migrate to Netcup or OVH for savings.
- **Neon EOL free tier**: Neon could change pricing policy. Fallback:
self-host Postgres on a Hetzner box or migrate to Supabase.
- **Cloudflare Free plan changes**: CF could restrict Free features.
Fallback: BunnyCDN, or raw nodes without CDN.
- **Gitea host outage**: If Gitea is down, deploys can't pull new
images. Existing pods continue. For long outages, we'd cache images
locally or temporarily push to Docker Hub.
## Progress tracker
As items are done, mark them here. Think of this as a running changelog.
- [x] k3s migration from Swarm (2026-04-24)
- [x] Traefik DaemonSet + hostNetwork
- [x] Admin seed via ADMIN_EMAIL + ADMIN_PASSWORD
- [x] Documentation book (this doc set)
- [ ] All other items above