admin/honeyDueAPI

Fork 0

Files

T

Trey t 6f303dbbaa

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Migrate prod deploy from Swarm to K3s; add full deployment book

Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
  temporarily for reference

Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
  callback (was causing 'unlock of unlocked mutex' fatal after
  Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
  + allowlist fonts.googleapis.com so the marketing landing page CSS
  actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
  --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
  images runnable on x86_64 Hetzner nodes; fix array expansion under
  set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
  top-level aliases (the '\${X_SECRET}' form never actually resolved);
  dozzle ports: long-form host_ip is rejected by Swarm, switched to
  short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
  worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
  (Next.js serves at root; /admin/ returned 404 and killed pods);
  startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
  1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
  12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
  real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
  and admin/src/app/api/*, hiding legitimate files)

New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
  hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
  without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log

Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
  - Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
  - Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
  - Part III Security, Traefik ingress (Ch 5-6)
  - Part IV Services, DB, storage, secrets, registry (Ch 7-11)
  - Part V Data flow, deploy process, observability, failures, runbook
    (Ch 12, 14-17)
  - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
  - Appendices: glossary, kubectl cheat sheet, file locations,
    consolidated citations
- README.md: Production Deployment section replaced with pointer to
  the book; Go version bumped to 1.25

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-24 07:20:54 -05:00

18 KiB

Raw Blame History

05 — Security

Summary

Security on this deployment is layered: Cloudflare at the edge, UFW at the node, k3s RBAC + Pod Security at the orchestrator, TLS between long-haul components, and dedicated service accounts with dropped capabilities inside containers. This chapter documents each layer, the rationale, and what's currently missing (and why).

Threat model

Who we're defending against, in rough order of likelihood:

Opportunistic scanners — bots scanning random IPv4 ranges for known vulnerabilities. Mitigated by the firewall.
Credential stuffing / brute-force — especially against SSH and admin login. Mitigated by key-only SSH, strong passwords, rate limits.
Compromised external service — if Neon, Backblaze, or Cloudflare were breached, attacker would have access to whatever we store there. Mitigated by scoped credentials, least-privilege API keys.
Compromised container image — if Gitea or our build pipeline were compromised, malicious code could reach prod. Mitigated by (a) Gitea is behind authentication, (b) image pull secrets scoped, (c) containers run non-root with minimal capabilities.
Insider threat — not really a threat for a solo operator.
State actor — not in threat model. At our scale this is effectively unaddressable without becoming a security company.

Explicitly not in threat model:

DDoS at a scale that saturates Cloudflare. We pay $0 for CF; their DDoS mitigation is included but not unlimited. If we got hit with a large attack, we'd move to a paid plan.
Physical access to Hetzner datacenters. That's their problem.

Layer 1 — Cloudflare edge

Cloudflare sits in front of every public request.

What Cloudflare does for us

Protection	How it works
TLS termination	CF presents a cert for `*.myhoneydue.com`; clients encrypt to CF
DDoS mitigation	Automatic on all plans including Free
Bot filtering	"Under Attack" mode + bot score based blocking
IP concealment	Origin IPs not in DNS; attackers can't directly scan
WAF rules	CF Free includes managed ruleset for common exploits
Rate limiting	Free tier: 10k requests/10min; more on paid plans

What Cloudflare does not do

Authenticate users — that's the app's job
Authorize requests — that's the app's job
Protect origin if origin IP leaks — once someone knows a node IP they can bypass CF. Mitigation: keep origin firewall strict (Chapter 4).
Encrypt between CF and origin — we're on SSL=Flexible, so CF↔origin is HTTP. This is in our TODO (Chapter 20, upgrade to Full-strict).

The proxy-IP problem

Cloudflare publishes its IP ranges (cloudflare.com/ips). Any client can verify a request came from a CF IP by checking the remote address. Our Traefik is configured to trust X-Forwarded-Proto (so the Go API sees https even though origin received HTTP) only from CF IP ranges:

# deploy-k3s/manifests/traefik-helmchartconfig.yaml
additionalArguments:
  - "--entrypoints.web.forwardedHeaders.trustedIPs=173.245.48.0/20,..."

This means a malicious request that bypasses CF (by hitting the node IP directly) can't spoof headers — Traefik ignores X-Forwarded-* unless the source IP is in CF's ranges.

TODO (Chapter 20): Enforce at UFW level — allow 80/tcp only from CF IP ranges. Today any IP can reach the origin on port 80.

Layer 2 — Node (OS, SSH, firewall)

Each node runs Ubuntu 24.04.3 LTS with:

SSH hardening

/etc/ssh/sshd_config on each node:

Port 22
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy

Result:

Only the deploy user can log in
Only with a public key (no password)
Root cannot log in remotely

The public key authorized for deploy:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBU9xTTBD78tYUqHijgyU9PDqtmS4NuM/6uy8XgDzva+ hetzner2@myhoneydue.com

(Note: the comment field says "hetzner2" but it's the key for all three nodes — the comment is the key's identifier, not a restriction.)

Private key is at ~/.ssh/hetzner on the operator workstation.

Sudo

The deploy user has unrestricted sudo with no password (/etc/sudoers.d/deploy):

deploy ALL=(ALL) NOPASSWD: ALL

This is convenient but broad. A compromise of the deploy SSH key = root on the node. Mitigations:

Key is stored only on the operator workstation, not checked into git
Operator workstation has disk encryption (macOS FileVault)
Operator workstation has a passphrase for the key (ssh-agent cache)

Future hardening: scope sudo to specific commands that deploy workflows need (e.g., /usr/sbin/ufw, /usr/bin/systemctl), but this requires enumerating every command we might run, which breaks ad-hoc debugging.

fail2ban

Not installed. fail2ban would ban IPs that fail SSH auth repeatedly. Because we disable password auth entirely, the attack surface is tiny (an attacker with the private key wins; failed-public-key attempts are functionally DDoS, not credential-stuffing). Installing fail2ban is on the TODO list anyway because it buys us rate-limiting on SSH bot noise.

unattended-upgrades

Not installed. Security patches require manual apt upgrade. This is a gap. Install and configure for security-only updates as soon as time permits.

UFW firewall

See Chapter 4 for the complete ruleset. Summary: default-deny incoming, specific allows for SSH (22), HTTP (80), HTTPS (443), k3s API from operator IP (6443), and inter-node cluster ports.

Layer 3 — Kubernetes RBAC

K3s inherits full Kubernetes RBAC. Every component that talks to the API server has a ServiceAccount with only the permissions it needs.

System accounts

K3s creates these by default:

kube-system:admin — cluster admin, used by kubectl
kube-system:coredns — for CoreDNS
kube-system:traefik — for Traefik ingress controller
kube-system:helm-install-traefik — for the Helm chart installer

We don't touch these.

Application service accounts

Our rbac.yaml creates four ServiceAccounts in the honeydue namespace:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: api
  namespace: honeydue
automountServiceAccountToken: false   # ← important

Same for admin, worker, redis.

automountServiceAccountToken: false means pods don't get a k8s API token mounted in /var/run/secrets/kubernetes.io/serviceaccount/. Without it, a compromised pod cannot query the Kubernetes API even if the default service account has broad permissions.

What the app pods CAN'T do

Our app service accounts have no RoleBindings or ClusterRoleBindings. They cannot:

List, get, create, update, delete any Kubernetes resource
Read other namespaces' secrets
Schedule workloads
View cluster state

If the api container were fully compromised (RCE), the attacker would have:

Network access to other pods in the honeydue namespace (Chapter 16)
Read access to our ConfigMap + Secrets (mounted into the container)
No ability to pivot to other parts of the cluster via the k8s API

Layer 4 — Pod Security

Every pod runs with restrictive security context:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000        # api; different per service
  runAsGroup: 1000
  fsGroup: 1000
  seccompProfile:
    type: RuntimeDefault

containers:
  - securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]

What each setting does

Setting	Effect
`runAsNonRoot: true`	Pod refuses to start if the image's default user is root
`runAsUser: 1000`	Override to UID 1000 (app user)
`allowPrivilegeEscalation: false`	Process cannot become root via setuid, ptrace, etc.
`readOnlyRootFilesystem: true`	`/` is read-only; writes require explicit volumes
`capabilities: drop: [ALL]`	No Linux capabilities (NET_ADMIN, SYS_TIME, etc.)
`seccompProfile: RuntimeDefault`	Restrict syscalls to containerd's default seccomp allowlist

Read-only root means our app images must declare writable volumes for anything mutable:

volumeMounts:
  - name: tmp
    mountPath: /tmp
volumes:
  - name: tmp
    emptyDir:
      sizeLimit: 64Mi

If the app needs to write somewhere else (e.g., Next.js cache), we mount an emptyDir there explicitly.

Traefik exception

Traefik needs CAP_NET_BIND_SERVICE to bind ports 80/443 on the host network. Its security context adds just that one capability back:

securityContext:
  capabilities:
    drop: [ALL]
    add: [NET_BIND_SERVICE]
  readOnlyRootFilesystem: true
  runAsGroup: 65532
  runAsNonRoot: true
  runAsUser: 65532

The net.ipv4.ip_unprivileged_port_start=0 sysctl on the nodes complements this — on older kernels NET_BIND_SERVICE alone isn't enough in the host netns.

Pod Security Admission (PSA)

Kubernetes has a built-in admission controller for enforcing Pod Security Standards at the namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: honeydue
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest

We don't currently set this. We get the equivalent effect from the explicit securityContext on each pod, but namespace-level enforcement would catch new workloads that forget to set it. TODO (Chapter 20).

Layer 5 — Network Policies

The deploy-k3s/manifests/network-policies.yaml scaffold defines:

default-deny-all — deny all ingress and egress by default in the honeydue namespace
allow-dns — allow egress UDP/TCP 53 to CoreDNS
allow-ingress-to-api — allow Traefik (kube-system namespace) to reach api pods on port 8000
allow-ingress-to-admin — same, for admin:3000

These are not currently applied. Without them, our pods can freely talk to anything — including, theoretically, malicious destinations if an attacker gets RCE inside a pod.

TODO (Chapter 20): Apply network policies. The scaffold is there; we just need to kubectl apply -f deploy-k3s/manifests/network-policies.yaml and test that nothing breaks.

What network policies would prevent

Attack scenario	NetworkPolicy blocks
Pod A compromised, attacker SSHs sideways to pod B	Yes (explicit allow needed)
Pod RCE → scan internal networks	Yes (default deny egress)
Pod RCE → exfil to attacker's C2	Yes (outbound to internet needs egress rule)

Without policies, all of these work.

TLS and encryption

CF ↔ user

Always TLS 1.2+ (CF doesn't support older). CF presents an automatically- renewed Let's Encrypt or CF-managed cert for *.myhoneydue.com.

CF ↔ origin

Plaintext HTTP (SSL = Flexible). An attacker with access to the Cloudflare-to-Hetzner path could read traffic. In practice nobody who isn't Cloudflare or Hetzner sits on that path.

TODO (Chapter 20): Upgrade to SSL = Full (strict) with a Cloudflare Origin CA certificate. This encrypts CF ↔ origin and verifies that origin's cert is the CF-issued one (prevents MitM if DNS is compromised).

API ↔ Neon Postgres

TLS 1.3 via DB_SSLMODE=require. The Go app's postgres driver (pgx) negotiates TLS and verifies Neon's cert against the system CA bundle. Connection fails if TLS can't be established.

API ↔ Backblaze B2

HTTPS (B2 doesn't support HTTP). B2_USE_SSL=true in our ConfigMap (though actually the app reads STORAGE_USE_SSL — see Chapter 9 for this vestigial variable's story).

Worker ↔ Fastmail SMTP

STARTTLS on port 587. The Go wneessen/go-mail library uses TLSOpportunistic mode — which means it connects plain then upgrades via STARTTLS. Fastmail always supports STARTTLS, so in practice every connection is encrypted.

API/worker ↔ Redis

Plaintext inside the cluster. Redis 7 supports TLS (redis-tls.conf, redis-server --tls-port), but we haven't enabled it because Redis is on the overlay network, not exposed externally, and only holds cache + queue state.

Pod-to-pod (Flannel overlay)

Plaintext VXLAN over Hetzner's public network. See Chapter 3 §Layer 3. TODO to switch to WireGuard backend.

Secrets management

Kubernetes Secrets

Our k8s Secrets are stored in etcd. etcd-at-rest encryption is not currently enabled — a compromise of the etcd data directory would expose Secret values. Given:

Nodes have disk encryption at the Hetzner hypervisor layer
Attacker needs root on the node to read etcd
Our operator access is already root-via-sudo

This is an accepted risk. TODO (Chapter 20): enable encryption at rest for etcd. K3s supports it via --secrets-encryption flag on the server.

What Secrets we have

$ kubectl get secrets -n honeydue
NAME                TYPE                             DATA   AGE
gitea-credentials   kubernetes.io/dockerconfigjson   1      ...
honeydue-apns-key   Opaque                           1      ...
honeydue-secrets    Opaque                           9      ...

Contents:

Secret	Key	Source
`gitea-credentials`	`.dockerconfigjson`	PAT for Gitea registry (image pulls)
`honeydue-apns-key`	`apns_auth_key.p8`	Placeholder p8 file (push off)
`honeydue-secrets`	`POSTGRES_PASSWORD`	Neon DB password
`honeydue-secrets`	`SECRET_KEY`	64-char random, app signing key
`honeydue-secrets`	`EMAIL_HOST_PASSWORD`	Fastmail app password
`honeydue-secrets`	`FCM_SERVER_KEY`	"disabled-no-push-accounts-yet" placeholder
`honeydue-secrets`	`REDIS_PASSWORD`	Empty (no auth on internal Redis)
`honeydue-secrets`	`B2_KEY_ID`	B2 app key ID
`honeydue-secrets`	`B2_APP_KEY`	B2 app key secret
`honeydue-secrets`	`ADMIN_EMAIL`	`admin@myhoneydue.com`
`honeydue-secrets`	`ADMIN_PASSWORD`	Generated 24-char initial admin password

Source of truth

The Secret values came from:

deploy/secrets/*.txt files on the operator workstation (gitignored)
deploy/prod.env (gitignored)
deploy/registry.env (gitignored)

These Swarm-era files are still the canonical source. If you need to recreate Secrets in a new cluster:

cd honeyDueAPI-go
kubectl create secret generic honeydue-secrets -n honeydue \
  --from-literal=POSTGRES_PASSWORD="$(cat deploy/secrets/postgres_password.txt)" \
  --from-literal=SECRET_KEY="$(cat deploy/secrets/secret_key.txt)" \
  --from-literal=EMAIL_HOST_PASSWORD="$(cat deploy/secrets/email_host_password.txt)" \
  ...

The full recreation script is in Chapter 17 (Runbook).

Secret rotation

Not automated. To rotate (e.g., after a compromise):

Generate new value: openssl rand -base64 32

Update the secret:

kubectl create secret generic honeydue-secrets -n honeydue \
  --from-literal=SECRET_KEY='new-value' \
  --dry-run=client -o yaml | kubectl apply -f -

Restart dependent pods:

kubectl rollout restart -n honeydue deploy/api deploy/worker

Update deploy/secrets/secret_key.txt to match
Revoke the old credential at the source (Neon, Fastmail, etc.)

Container image provenance

Images come from gitea.treytartt.com/admin/*. We have no image signing or verification (cosign/sigstore) in place. A compromise of the Gitea registry = the ability to push malicious images that would be pulled into prod on the next rollout.

Mitigations:

Gitea itself is behind login; PAT is scoped to read:packages + write:packages only
Gitea runs on the operator's infrastructure (same operator account)
Image tags are SHA-pinned (:237c6b8) not :latest → attacker can't replace an existing tag's image without us noticing the digest change

TODO (Chapter 20): Add cosign signing at build time, verify at pull time.

Operator workstation security

The operator workstation has:

macOS with FileVault (full disk encryption)
Login password required
Private keys in ~/.ssh/ (mode 0600)
Kubeconfig at ~/.kube/honeydue-k3s.yaml (mode 0600) — contains a bearer token to the cluster

Losing the laptop would require immediate credential rotation:

New SSH key, redeploy public part on all 3 nodes
New kubeconfig: run sudo cat /etc/rancher/k3s/k3s.yaml on hetzner1, copy to workstation, update KUBECONFIG env
Rotate operator-access PATs on Gitea, Neon, Cloudflare, Backblaze

Compliance notes

This stack is not currently certified for:

HIPAA — we transit and store health-related data but haven't contractually bound any BAA
SOC 2 — no auditing, no documented controls beyond this document
PCI-DSS — we don't handle card data; Apple/Google IAP handles payments
GDPR — we follow GDPR best practices (data minimization, user deletion) but haven't had a formal assessment

If honeyDue ever needs any of these, the infrastructure is compatible but the operational processes around it would need formal work.

Operator cheat sheet

# See all RBAC-related resources in a namespace
kubectl get sa,role,rolebinding -n honeydue

# Check what a ServiceAccount can do
kubectl auth can-i --list --as=system:serviceaccount:honeydue:api -n honeydue

# Verify pod is running with expected security context
kubectl get pod <pod> -n honeydue -o jsonpath='{.spec.securityContext}'
kubectl get pod <pod> -n honeydue -o jsonpath='{.spec.containers[0].securityContext}'

# List all Secrets (without revealing content)
kubectl get secret -n honeydue
kubectl describe secret honeydue-secrets -n honeydue  # shows keys, not values

# Decode a secret (CAREFUL: prints plaintext)
kubectl get secret honeydue-secrets -n honeydue -o jsonpath='{.data.SECRET_KEY}' | base64 -d

18 KiB Raw Blame History

05 — Security

Summary

Threat model

Layer 1 — Cloudflare edge

What Cloudflare does for us

What Cloudflare does not do

The proxy-IP problem

Layer 2 — Node (OS, SSH, firewall)

SSH hardening

Sudo

fail2ban

unattended-upgrades

UFW firewall

Layer 3 — Kubernetes RBAC

System accounts

Application service accounts

What the app pods CAN'T do

Layer 4 — Pod Security

What each setting does

Traefik exception

Pod Security Admission (PSA)

Layer 5 — Network Policies

What network policies would prevent

TLS and encryption

CF ↔ user

CF ↔ origin

API ↔ Neon Postgres

API ↔ Backblaze B2

Worker ↔ Fastmail SMTP

API/worker ↔ Redis

Pod-to-pod (Flannel overlay)

Secrets management

Kubernetes Secrets

What Secrets we have

Source of truth

Secret rotation

Container image provenance

Operator workstation security

Compliance notes

Operator cheat sheet

References

18 KiB

Raw Blame History