Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
05 — Security
Summary
Security on this deployment is layered: Cloudflare at the edge, UFW at the node, k3s RBAC + Pod Security at the orchestrator, TLS between long-haul components, and dedicated service accounts with dropped capabilities inside containers. This chapter documents each layer, the rationale, and what's currently missing (and why).
Threat model
Who we're defending against, in rough order of likelihood:
- Opportunistic scanners — bots scanning random IPv4 ranges for known vulnerabilities. Mitigated by the firewall.
- Credential stuffing / brute-force — especially against SSH and admin login. Mitigated by key-only SSH, strong passwords, rate limits.
- Compromised external service — if Neon, Backblaze, or Cloudflare were breached, attacker would have access to whatever we store there. Mitigated by scoped credentials, least-privilege API keys.
- Compromised container image — if Gitea or our build pipeline were compromised, malicious code could reach prod. Mitigated by (a) Gitea is behind authentication, (b) image pull secrets scoped, (c) containers run non-root with minimal capabilities.
- Insider threat — not really a threat for a solo operator.
- State actor — not in threat model. At our scale this is effectively unaddressable without becoming a security company.
Explicitly not in threat model:
- DDoS at a scale that saturates Cloudflare. We pay $0 for CF; their DDoS mitigation is included but not unlimited. If we got hit with a large attack, we'd move to a paid plan.
- Physical access to Hetzner datacenters. That's their problem.
Layer 1 — Cloudflare edge
Cloudflare sits in front of every public request.
What Cloudflare does for us
| Protection | How it works |
|---|---|
| TLS termination | CF presents a cert for *.myhoneydue.com; clients encrypt to CF |
| DDoS mitigation | Automatic on all plans including Free |
| Bot filtering | "Under Attack" mode + bot score based blocking |
| IP concealment | Origin IPs not in DNS; attackers can't directly scan |
| WAF rules | CF Free includes managed ruleset for common exploits |
| Rate limiting | Free tier: 10k requests/10min; more on paid plans |
What Cloudflare does not do
- Authenticate users — that's the app's job
- Authorize requests — that's the app's job
- Protect origin if origin IP leaks — once someone knows a node IP they can bypass CF. Mitigation: keep origin firewall strict (Chapter 4).
- Encrypt between CF and origin — we're on SSL=Flexible, so CF↔origin is HTTP. This is in our TODO (Chapter 20, upgrade to Full-strict).
The proxy-IP problem
Cloudflare publishes its IP ranges
(cloudflare.com/ips). Any client can
verify a request came from a CF IP by checking the remote address. Our
Traefik is configured to trust X-Forwarded-Proto (so the Go API sees
https even though origin received HTTP) only from CF IP ranges:
# deploy-k3s/manifests/traefik-helmchartconfig.yaml
additionalArguments:
- "--entrypoints.web.forwardedHeaders.trustedIPs=173.245.48.0/20,..."
This means a malicious request that bypasses CF (by hitting the node IP
directly) can't spoof headers — Traefik ignores X-Forwarded-* unless
the source IP is in CF's ranges.
TODO (Chapter 20): Enforce at UFW level — allow 80/tcp only from CF IP ranges. Today any IP can reach the origin on port 80.
Layer 2 — Node (OS, SSH, firewall)
Each node runs Ubuntu 24.04.3 LTS with:
SSH hardening
/etc/ssh/sshd_config on each node:
Port 22
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy
Result:
- Only the
deployuser can log in - Only with a public key (no password)
- Root cannot log in remotely
The public key authorized for deploy:
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBU9xTTBD78tYUqHijgyU9PDqtmS4NuM/6uy8XgDzva+ hetzner2@myhoneydue.com
(Note: the comment field says "hetzner2" but it's the key for all three nodes — the comment is the key's identifier, not a restriction.)
Private key is at ~/.ssh/hetzner on the operator workstation.
Sudo
The deploy user has unrestricted sudo with no password
(/etc/sudoers.d/deploy):
deploy ALL=(ALL) NOPASSWD: ALL
This is convenient but broad. A compromise of the deploy SSH key =
root on the node. Mitigations:
- Key is stored only on the operator workstation, not checked into git
- Operator workstation has disk encryption (macOS FileVault)
- Operator workstation has a passphrase for the key (ssh-agent cache)
Future hardening: scope sudo to specific commands that deploy workflows
need (e.g., /usr/sbin/ufw, /usr/bin/systemctl), but this requires
enumerating every command we might run, which breaks ad-hoc debugging.
fail2ban
Not installed. fail2ban would ban IPs that fail SSH auth repeatedly. Because we disable password auth entirely, the attack surface is tiny (an attacker with the private key wins; failed-public-key attempts are functionally DDoS, not credential-stuffing). Installing fail2ban is on the TODO list anyway because it buys us rate-limiting on SSH bot noise.
unattended-upgrades
Not installed. Security patches require manual apt upgrade. This is
a gap. Install and configure for security-only updates as soon as time
permits.
UFW firewall
See Chapter 4 for the complete ruleset. Summary: default-deny incoming, specific allows for SSH (22), HTTP (80), HTTPS (443), k3s API from operator IP (6443), and inter-node cluster ports.
Layer 3 — Kubernetes RBAC
K3s inherits full Kubernetes RBAC. Every component that talks to the API server has a ServiceAccount with only the permissions it needs.
System accounts
K3s creates these by default:
kube-system:admin— cluster admin, used bykubectlkube-system:coredns— for CoreDNSkube-system:traefik— for Traefik ingress controllerkube-system:helm-install-traefik— for the Helm chart installer
We don't touch these.
Application service accounts
Our rbac.yaml creates four ServiceAccounts in the honeydue namespace:
apiVersion: v1
kind: ServiceAccount
metadata:
name: api
namespace: honeydue
automountServiceAccountToken: false # ← important
Same for admin, worker, redis.
automountServiceAccountToken: false means pods don't get a k8s
API token mounted in /var/run/secrets/kubernetes.io/serviceaccount/.
Without it, a compromised pod cannot query the Kubernetes API even if
the default service account has broad permissions.
What the app pods CAN'T do
Our app service accounts have no RoleBindings or ClusterRoleBindings. They cannot:
- List, get, create, update, delete any Kubernetes resource
- Read other namespaces' secrets
- Schedule workloads
- View cluster state
If the api container were fully compromised (RCE), the attacker would have:
- Network access to other pods in the
honeyduenamespace (Chapter 16) - Read access to our ConfigMap + Secrets (mounted into the container)
- No ability to pivot to other parts of the cluster via the k8s API
Layer 4 — Pod Security
Every pod runs with restrictive security context:
securityContext:
runAsNonRoot: true
runAsUser: 1000 # api; different per service
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
What each setting does
| Setting | Effect |
|---|---|
runAsNonRoot: true |
Pod refuses to start if the image's default user is root |
runAsUser: 1000 |
Override to UID 1000 (app user) |
allowPrivilegeEscalation: false |
Process cannot become root via setuid, ptrace, etc. |
readOnlyRootFilesystem: true |
/ is read-only; writes require explicit volumes |
capabilities: drop: [ALL] |
No Linux capabilities (NET_ADMIN, SYS_TIME, etc.) |
seccompProfile: RuntimeDefault |
Restrict syscalls to containerd's default seccomp allowlist |
Read-only root means our app images must declare writable volumes for anything mutable:
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir:
sizeLimit: 64Mi
If the app needs to write somewhere else (e.g., Next.js cache), we mount an emptyDir there explicitly.
Traefik exception
Traefik needs CAP_NET_BIND_SERVICE to bind ports 80/443 on the host
network. Its security context adds just that one capability back:
securityContext:
capabilities:
drop: [ALL]
add: [NET_BIND_SERVICE]
readOnlyRootFilesystem: true
runAsGroup: 65532
runAsNonRoot: true
runAsUser: 65532
The net.ipv4.ip_unprivileged_port_start=0 sysctl on the nodes
complements this — on older kernels NET_BIND_SERVICE alone isn't enough
in the host netns.
Pod Security Admission (PSA)
Kubernetes has a built-in admission controller for enforcing Pod Security Standards at the namespace level:
apiVersion: v1
kind: Namespace
metadata:
name: honeydue
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
We don't currently set this. We get the equivalent effect from the explicit securityContext on each pod, but namespace-level enforcement would catch new workloads that forget to set it. TODO (Chapter 20).
Layer 5 — Network Policies
The deploy-k3s/manifests/network-policies.yaml scaffold defines:
- default-deny-all — deny all ingress and egress by default in the
honeyduenamespace - allow-dns — allow egress UDP/TCP 53 to CoreDNS
- allow-ingress-to-api — allow Traefik (
kube-systemnamespace) to reach api pods on port 8000 - allow-ingress-to-admin — same, for admin:3000
These are not currently applied. Without them, our pods can freely talk to anything — including, theoretically, malicious destinations if an attacker gets RCE inside a pod.
TODO (Chapter 20): Apply network policies. The scaffold is there; we
just need to kubectl apply -f deploy-k3s/manifests/network-policies.yaml
and test that nothing breaks.
What network policies would prevent
| Attack scenario | NetworkPolicy blocks |
|---|---|
| Pod A compromised, attacker SSHs sideways to pod B | Yes (explicit allow needed) |
| Pod RCE → scan internal networks | Yes (default deny egress) |
| Pod RCE → exfil to attacker's C2 | Yes (outbound to internet needs egress rule) |
Without policies, all of these work.
TLS and encryption
CF ↔ user
Always TLS 1.2+ (CF doesn't support older). CF presents an automatically-
renewed Let's Encrypt or CF-managed cert for *.myhoneydue.com.
CF ↔ origin
Plaintext HTTP (SSL = Flexible). An attacker with access to the Cloudflare-to-Hetzner path could read traffic. In practice nobody who isn't Cloudflare or Hetzner sits on that path.
TODO (Chapter 20): Upgrade to SSL = Full (strict) with a Cloudflare Origin CA certificate. This encrypts CF ↔ origin and verifies that origin's cert is the CF-issued one (prevents MitM if DNS is compromised).
API ↔ Neon Postgres
TLS 1.3 via DB_SSLMODE=require. The Go app's postgres driver (pgx)
negotiates TLS and verifies Neon's cert against the system CA bundle.
Connection fails if TLS can't be established.
API ↔ Backblaze B2
HTTPS (B2 doesn't support HTTP). B2_USE_SSL=true in our ConfigMap
(though actually the app reads STORAGE_USE_SSL — see Chapter 9 for this
vestigial variable's story).
Worker ↔ Fastmail SMTP
STARTTLS on port 587. The Go wneessen/go-mail library uses
TLSOpportunistic mode — which means it connects plain then upgrades via
STARTTLS. Fastmail always supports STARTTLS, so in practice every
connection is encrypted.
API/worker ↔ Redis
Plaintext inside the cluster. Redis 7 supports TLS (redis-tls.conf,
redis-server --tls-port), but we haven't enabled it because Redis is
on the overlay network, not exposed externally, and only holds cache +
queue state.
Pod-to-pod (Flannel overlay)
Plaintext VXLAN over Hetzner's public network. See Chapter 3 §Layer 3. TODO to switch to WireGuard backend.
Secrets management
Kubernetes Secrets
Our k8s Secrets are stored in etcd. etcd-at-rest encryption is not currently enabled — a compromise of the etcd data directory would expose Secret values. Given:
- Nodes have disk encryption at the Hetzner hypervisor layer
- Attacker needs root on the node to read etcd
- Our operator access is already root-via-sudo
This is an accepted risk. TODO (Chapter 20): enable encryption at rest
for etcd. K3s supports it via --secrets-encryption flag on the server.
What Secrets we have
$ kubectl get secrets -n honeydue
NAME TYPE DATA AGE
gitea-credentials kubernetes.io/dockerconfigjson 1 ...
honeydue-apns-key Opaque 1 ...
honeydue-secrets Opaque 9 ...
Contents:
| Secret | Key | Source |
|---|---|---|
gitea-credentials |
.dockerconfigjson |
PAT for Gitea registry (image pulls) |
honeydue-apns-key |
apns_auth_key.p8 |
Placeholder p8 file (push off) |
honeydue-secrets |
POSTGRES_PASSWORD |
Neon DB password |
honeydue-secrets |
SECRET_KEY |
64-char random, app signing key |
honeydue-secrets |
EMAIL_HOST_PASSWORD |
Fastmail app password |
honeydue-secrets |
FCM_SERVER_KEY |
"disabled-no-push-accounts-yet" placeholder |
honeydue-secrets |
REDIS_PASSWORD |
Empty (no auth on internal Redis) |
honeydue-secrets |
B2_KEY_ID |
B2 app key ID |
honeydue-secrets |
B2_APP_KEY |
B2 app key secret |
honeydue-secrets |
ADMIN_EMAIL |
admin@myhoneydue.com |
honeydue-secrets |
ADMIN_PASSWORD |
Generated 24-char initial admin password |
Source of truth
The Secret values came from:
deploy/secrets/*.txtfiles on the operator workstation (gitignored)deploy/prod.env(gitignored)deploy/registry.env(gitignored)
These Swarm-era files are still the canonical source. If you need to recreate Secrets in a new cluster:
cd honeyDueAPI-go
kubectl create secret generic honeydue-secrets -n honeydue \
--from-literal=POSTGRES_PASSWORD="$(cat deploy/secrets/postgres_password.txt)" \
--from-literal=SECRET_KEY="$(cat deploy/secrets/secret_key.txt)" \
--from-literal=EMAIL_HOST_PASSWORD="$(cat deploy/secrets/email_host_password.txt)" \
...
The full recreation script is in Chapter 17 (Runbook).
Secret rotation
Not automated. To rotate (e.g., after a compromise):
- Generate new value:
openssl rand -base64 32 - Update the secret:
kubectl create secret generic honeydue-secrets -n honeydue \ --from-literal=SECRET_KEY='new-value' \ --dry-run=client -o yaml | kubectl apply -f - - Restart dependent pods:
kubectl rollout restart -n honeydue deploy/api deploy/worker - Update
deploy/secrets/secret_key.txtto match - Revoke the old credential at the source (Neon, Fastmail, etc.)
Container image provenance
Images come from gitea.treytartt.com/admin/*. We have no image
signing or verification (cosign/sigstore) in place. A compromise of
the Gitea registry = the ability to push malicious images that would be
pulled into prod on the next rollout.
Mitigations:
- Gitea itself is behind login; PAT is scoped to read:packages + write:packages only
- Gitea runs on the operator's infrastructure (same operator account)
- Image tags are SHA-pinned (
:237c6b8) not:latest→ attacker can't replace an existing tag's image without us noticing the digest change
TODO (Chapter 20): Add cosign signing at build time, verify at pull time.
Operator workstation security
The operator workstation has:
- macOS with FileVault (full disk encryption)
- Login password required
- Private keys in
~/.ssh/(mode 0600) - Kubeconfig at
~/.kube/honeydue-k3s.yaml(mode 0600) — contains a bearer token to the cluster
Losing the laptop would require immediate credential rotation:
- New SSH key, redeploy public part on all 3 nodes
- New kubeconfig: run
sudo cat /etc/rancher/k3s/k3s.yamlon hetzner1, copy to workstation, updateKUBECONFIGenv - Rotate operator-access PATs on Gitea, Neon, Cloudflare, Backblaze
Compliance notes
This stack is not currently certified for:
- HIPAA — we transit and store health-related data but haven't contractually bound any BAA
- SOC 2 — no auditing, no documented controls beyond this document
- PCI-DSS — we don't handle card data; Apple/Google IAP handles payments
- GDPR — we follow GDPR best practices (data minimization, user deletion) but haven't had a formal assessment
If honeyDue ever needs any of these, the infrastructure is compatible but the operational processes around it would need formal work.
Operator cheat sheet
# See all RBAC-related resources in a namespace
kubectl get sa,role,rolebinding -n honeydue
# Check what a ServiceAccount can do
kubectl auth can-i --list --as=system:serviceaccount:honeydue:api -n honeydue
# Verify pod is running with expected security context
kubectl get pod <pod> -n honeydue -o jsonpath='{.spec.securityContext}'
kubectl get pod <pod> -n honeydue -o jsonpath='{.spec.containers[0].securityContext}'
# List all Secrets (without revealing content)
kubectl get secret -n honeydue
kubectl describe secret honeydue-secrets -n honeydue # shows keys, not values
# Decode a secret (CAREFUL: prints plaintext)
kubectl get secret honeydue-secrets -n honeydue -o jsonpath='{.data.SECRET_KEY}' | base64 -d