admin/honeyDueAPI

Fork 0

Files

T

Trey t c77ff07ce9

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

fix(security): remediate 2026-05-12 audit findings (Stages 2–5)

Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps),
tracked in deploy-k3s/SECURITY.md, plus fixes from two independent
post-remediation reviews.

Auth & sessions:
- SHA-256 hashed auth-token storage (C1); prior-token cache eviction on
  re-login (MEDIUM-1)
- local Google JWKS verification, iss/aud/exp checks (C2/C3)
- constant-time login + generic errors (L1/LIVE-L11/LIVE-L13)
- per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3)
- verified-email gating, login rate limiting (LIVE-L19, H1-H3)

IAP & webhooks:
- Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6)
- migrations 000003-000006 (token hashing, IAP replay, audit_log +
  webhook_event_log table creation, append-only audit log)

Authorization & races:
- file-ownership owner-OR-member fix (C7), atomic share-code join
  (C9/H9), device-token reassignment (C8/LOW-3)

Secrets & deploy:
- secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis
  password out of the ConfigMap (HIGH-1); B2 keys reconciled
- digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics
  lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban +
  unattended-upgrades at provision; secret-rotation runbook

Build, vet, and the full test suite (incl. -race) pass; the goose
migration chain is verified against PostgreSQL 16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 22:28:33 -05:00

19 KiB

Raw Blame History

05 — Security

Summary

Security on this deployment is layered: Cloudflare at the edge, UFW at the node, k3s RBAC + Pod Security at the orchestrator, TLS between long-haul components, and dedicated service accounts with dropped capabilities inside containers. This chapter documents each layer, the rationale, and what's currently missing (and why).

Updated 2026-05-15 — security remediation. The 2026-05 audits (live_scan_5_12.md, k3_audit_5_12.md, security_scan_5_12.md) drove a full remediation pass. deploy-k3s/SECURITY.md is the authoritative, per-finding current-state record. This chapter is corrected for the major items below; where any other detail conflicts with SECURITY.md, SECURITY.md wins.

Threat model

Who we're defending against, in rough order of likelihood:

Opportunistic scanners — bots scanning random IPv4 ranges for known vulnerabilities. Mitigated by the firewall.
Credential stuffing / brute-force — especially against SSH and admin login. Mitigated by key-only SSH, strong passwords, rate limits.
Compromised external service — if Neon, Backblaze, or Cloudflare were breached, attacker would have access to whatever we store there. Mitigated by scoped credentials, least-privilege API keys.
Compromised container image — if Gitea or our build pipeline were compromised, malicious code could reach prod. Mitigated by (a) Gitea is behind authentication, (b) image pull secrets scoped, (c) containers run non-root with minimal capabilities.
Insider threat — not really a threat for a solo operator.
State actor — not in threat model. At our scale this is effectively unaddressable without becoming a security company.

Explicitly not in threat model:

DDoS at a scale that saturates Cloudflare. We pay $0 for CF; their DDoS mitigation is included but not unlimited. If we got hit with a large attack, we'd move to a paid plan.
Physical access to Hetzner datacenters. That's their problem.

Layer 1 — Cloudflare edge

Cloudflare sits in front of every public request.

What Cloudflare does for us

Protection	How it works
TLS termination	CF presents a cert for `*.myhoneydue.com`; clients encrypt to CF
DDoS mitigation	Automatic on all plans including Free
Bot filtering	"Under Attack" mode + bot score based blocking
IP concealment	Origin IPs not in DNS; attackers can't directly scan
WAF rules	CF Free includes managed ruleset for common exploits
Rate limiting	Free tier: 10k requests/10min; more on paid plans

What Cloudflare does not do

Authenticate users — that's the app's job
Authorize requests — that's the app's job
Protect origin if origin IP leaks — once someone knows a node IP they can bypass CF. Mitigation: keep origin firewall strict (Chapter 4).
~~Encrypt between CF and origin~~ — done (2026-04-24): SSL mode is Full (strict); CF↔origin is TLS with a Cloudflare Origin CA cert.

The proxy-IP problem

Cloudflare publishes its IP ranges (cloudflare.com/ips). Any client can verify a request came from a CF IP by checking the remote address. Our Traefik is configured to trust X-Forwarded-Proto (so the Go API sees https even though origin received HTTP) only from CF IP ranges:

# deploy-k3s/manifests/traefik-helmchartconfig.yaml
additionalArguments:
  - "--entrypoints.web.forwardedHeaders.trustedIPs=173.245.48.0/20,..."

This means a malicious request that bypasses CF (by hitting the node IP directly) can't spoof headers — Traefik ignores X-Forwarded-* unless the source IP is in CF's ranges.

Done (2026-04-24): the node UFW allowlist permits :443 only from Cloudflare's IP ranges; the Anywhere rules on :80/:443 were removed.

Layer 2 — Node (OS, SSH, firewall)

Each node runs Ubuntu 24.04.3 LTS with:

SSH hardening

/etc/ssh/sshd_config on each node:

Port 22
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers deploy

Result:

Only the deploy user can log in
Only with a public key (no password)
Root cannot log in remotely

The public key authorized for deploy:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBU9xTTBD78tYUqHijgyU9PDqtmS4NuM/6uy8XgDzva+ hetzner2@myhoneydue.com

(Note: the comment field says "hetzner2" but it's the key for all three nodes — the comment is the key's identifier, not a restriction.)

Private key is at ~/.ssh/hetzner on the operator workstation.

Sudo

The deploy user has unrestricted sudo with no password (/etc/sudoers.d/deploy):

deploy ALL=(ALL) NOPASSWD: ALL

This is convenient but broad. A compromise of the deploy SSH key = root on the node. Mitigations:

Key is stored only on the operator workstation, not checked into git
Operator workstation has disk encryption (macOS FileVault)
Operator workstation has a passphrase for the key (ssh-agent cache)

Future hardening: scope sudo to specific commands that deploy workflows need (e.g., /usr/sbin/ufw, /usr/bin/systemctl), but this requires enumerating every command we might run, which breaks ad-hoc debugging.

fail2ban

Not installed. fail2ban would ban IPs that fail SSH auth repeatedly. Because we disable password auth entirely, the attack surface is tiny (an attacker with the private key wins; failed-public-key attempts are functionally DDoS, not credential-stuffing). Installing fail2ban is on the TODO list anyway because it buys us rate-limiting on SSH bot noise.

unattended-upgrades

Not installed. Security patches require manual apt upgrade. This is a gap. Install and configure for security-only updates as soon as time permits.

UFW firewall

See Chapter 4 for the complete ruleset. Summary: default-deny incoming, specific allows for SSH (22), HTTP (80), HTTPS (443), k3s API from operator IP (6443), and inter-node cluster ports.

Layer 3 — Kubernetes RBAC

K3s inherits full Kubernetes RBAC. Every component that talks to the API server has a ServiceAccount with only the permissions it needs.

System accounts

K3s creates these by default:

kube-system:admin — cluster admin, used by kubectl
kube-system:coredns — for CoreDNS
kube-system:traefik — for Traefik ingress controller
kube-system:helm-install-traefik — for the Helm chart installer

We don't touch these.

Application service accounts

Our rbac.yaml creates four ServiceAccounts in the honeydue namespace:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: api
  namespace: honeydue
automountServiceAccountToken: false   # ← important

Same for admin, worker, redis.

automountServiceAccountToken: false means pods don't get a k8s API token mounted in /var/run/secrets/kubernetes.io/serviceaccount/. Without it, a compromised pod cannot query the Kubernetes API even if the default service account has broad permissions.

What the app pods CAN'T do

Our app service accounts have no RoleBindings or ClusterRoleBindings. They cannot:

List, get, create, update, delete any Kubernetes resource
Read other namespaces' secrets
Schedule workloads
View cluster state

If the api container were fully compromised (RCE), the attacker would have:

Network access to other pods in the honeydue namespace (Chapter 16)
Read access to our ConfigMap + Secrets (mounted into the container)
No ability to pivot to other parts of the cluster via the k8s API

Layer 4 — Pod Security

Every pod runs with restrictive security context:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000        # api; different per service
  runAsGroup: 1000
  fsGroup: 1000
  seccompProfile:
    type: RuntimeDefault

containers:
  - securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]

What each setting does

Setting	Effect
`runAsNonRoot: true`	Pod refuses to start if the image's default user is root
`runAsUser: 1000`	Override to UID 1000 (app user)
`allowPrivilegeEscalation: false`	Process cannot become root via setuid, ptrace, etc.
`readOnlyRootFilesystem: true`	`/` is read-only; writes require explicit volumes
`capabilities: drop: [ALL]`	No Linux capabilities (NET_ADMIN, SYS_TIME, etc.)
`seccompProfile: RuntimeDefault`	Restrict syscalls to containerd's default seccomp allowlist

Read-only root means our app images must declare writable volumes for anything mutable:

volumeMounts:
  - name: tmp
    mountPath: /tmp
volumes:
  - name: tmp
    emptyDir:
      sizeLimit: 64Mi

If the app needs to write somewhere else (e.g., Next.js cache), we mount an emptyDir there explicitly.

Traefik exception

Traefik needs CAP_NET_BIND_SERVICE to bind ports 80/443 on the host network. Its security context adds just that one capability back:

securityContext:
  capabilities:
    drop: [ALL]
    add: [NET_BIND_SERVICE]
  readOnlyRootFilesystem: true
  runAsGroup: 65532
  runAsNonRoot: true
  runAsUser: 65532

The net.ipv4.ip_unprivileged_port_start=0 sysctl on the nodes complements this — on older kernels NET_BIND_SERVICE alone isn't enough in the host netns.

Pod Security Admission (PSA)

Kubernetes has a built-in admission controller for enforcing Pod Security Standards at the namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: honeydue
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest

We don't currently set this. We get the equivalent effect from the explicit securityContext on each pod, but namespace-level enforcement would catch new workloads that forget to set it. TODO (Chapter 20).

Layer 5 — Network Policies

The deploy-k3s/manifests/network-policies.yaml scaffold defines:

default-deny-all — deny all ingress and egress by default in the honeydue namespace
allow-dns — allow egress UDP/TCP 53 to CoreDNS
allow-ingress-to-api — allow Traefik (kube-system namespace) to reach api pods on port 8000
allow-ingress-to-admin — same, for admin:3000

Applied. 03-deploy.sh applies deploy-k3s/manifests/network-policies.yaml on every deploy — default-deny plus the explicit per-app allows below. Traefik runs hostNetwork, so its traffic is matched by node-IP ipBlocks plus the pod CIDR 10.42.0.0/16, not a namespaceSelector.

What network policies prevent

Attack scenario	NetworkPolicy blocks
Pod A compromised, attacker SSHs sideways to pod B	Yes (explicit allow needed)
Pod RCE → scan internal networks	Yes (default deny egress)
Pod RCE → exfil to attacker's C2	Yes (outbound to internet needs egress rule)

Without policies, all of these work.

TLS and encryption

CF ↔ user

Always TLS 1.2+ (CF doesn't support older). CF presents an automatically- renewed Let's Encrypt or CF-managed cert for *.myhoneydue.com.

CF ↔ origin

TLS — SSL = Full (strict) (since 2026-04-24). A Cloudflare Origin CA certificate (cloudflare-origin-cert secret) is installed on all three ingresses; Cloudflare validates it. Both user↔CF and CF↔origin are encrypted, and a DNS-hijack MitM is defeated by the origin-cert check.

API ↔ Neon Postgres

TLS 1.3 via DB_SSLMODE=require. The Go app's postgres driver (pgx) negotiates TLS and verifies Neon's cert against the system CA bundle. Connection fails if TLS can't be established.

API ↔ Backblaze B2

HTTPS (B2 doesn't support HTTP). B2_USE_SSL=true in our ConfigMap (though actually the app reads STORAGE_USE_SSL — see Chapter 9 for this vestigial variable's story).

Worker ↔ Fastmail SMTP

STARTTLS on port 587. The Go wneessen/go-mail library uses TLSOpportunistic mode — which means it connects plain then upgrades via STARTTLS. Fastmail always supports STARTTLS, so in practice every connection is encrypted.

API/worker ↔ Redis

Plaintext inside the cluster. Redis 7 supports TLS (redis-tls.conf, redis-server --tls-port), but we haven't enabled it because Redis is on the overlay network, not exposed externally, and only holds cache + queue state.

Pod-to-pod (Flannel overlay)

Plaintext VXLAN over Hetzner's public network. See Chapter 3 §Layer 3. TODO to switch to WireGuard backend.

Secrets management

Kubernetes Secrets

Our k8s Secrets are stored in etcd. etcd-at-rest encryption is not currently enabled — a compromise of the etcd data directory would expose Secret values. Given:

Nodes have disk encryption at the Hetzner hypervisor layer
Attacker needs root on the node to read etcd
Our operator access is already root-via-sudo

This is an accepted risk. TODO (Chapter 20): enable encryption at rest for etcd. K3s supports it via --secrets-encryption flag on the server.

What Secrets we have

$ kubectl get secrets -n honeydue
NAME                TYPE                             DATA   AGE
gitea-credentials   kubernetes.io/dockerconfigjson   1      ...
honeydue-apns-key   Opaque                           1      ...
honeydue-secrets    Opaque                           9      ...

Contents:

Secret	Key	Source
`gitea-credentials`	`.dockerconfigjson`	PAT for Gitea registry (image pulls)
`honeydue-apns-key`	`apns_auth_key.p8`	Placeholder p8 file (push off)
`honeydue-secrets`	`POSTGRES_PASSWORD`	Neon DB password
`honeydue-secrets`	`SECRET_KEY`	64-char random, app signing key
`honeydue-secrets`	`EMAIL_HOST_PASSWORD`	Fastmail app password
`honeydue-secrets`	`FCM_SERVER_KEY`	"disabled-no-push-accounts-yet" placeholder
`honeydue-secrets`	`REDIS_PASSWORD`	Empty (no auth on internal Redis)
`honeydue-secrets`	`B2_KEY_ID`	B2 app key ID
`honeydue-secrets`	`B2_APP_KEY`	B2 app key secret
`honeydue-secrets`	`ADMIN_EMAIL`	`admin@myhoneydue.com`
`honeydue-secrets`	`ADMIN_PASSWORD`	Generated 24-char initial admin password

Source of truth

The Secret values came from:

deploy/secrets/*.txt files on the operator workstation (gitignored)
deploy/prod.env (gitignored)
deploy/registry.env (gitignored)

These Swarm-era files are still the canonical source. If you need to recreate Secrets in a new cluster:

cd honeyDueAPI-go
kubectl create secret generic honeydue-secrets -n honeydue \
  --from-literal=POSTGRES_PASSWORD="$(cat deploy/secrets/postgres_password.txt)" \
  --from-literal=SECRET_KEY="$(cat deploy/secrets/secret_key.txt)" \
  --from-literal=EMAIL_HOST_PASSWORD="$(cat deploy/secrets/email_host_password.txt)" \
  ...

The full recreation script is in Chapter 17 (Runbook).

Secret rotation

Not automated. To rotate (e.g., after a compromise):

Generate new value: openssl rand -base64 32

Update the secret:

kubectl create secret generic honeydue-secrets -n honeydue \
  --from-literal=SECRET_KEY='new-value' \
  --dry-run=client -o yaml | kubectl apply -f -

Restart dependent pods:

kubectl rollout restart -n honeydue deploy/api deploy/worker

Update deploy/secrets/secret_key.txt to match
Revoke the old credential at the source (Neon, Fastmail, etc.)

Container image provenance

Images come from gitea.treytartt.com/admin/*. We have no image signing or verification (cosign/sigstore) in place. A compromise of the Gitea registry = the ability to push malicious images that would be pulled into prod on the next rollout.

Mitigations:

Gitea itself is behind login; PAT is scoped to read:packages + write:packages only
Gitea runs on the operator's infrastructure (same operator account)
Workloads deploy by immutable @sha256: digest, not by mutable tag (03-deploy.sh resolves the digest after push; the redis/vmagent/node base images are digest-pinned too) — a swapped tag cannot reach the cluster.

TODO: cosign signing is wired into 03-deploy.sh (guarded — runs when cosign + COSIGN_KEY are present); cluster-side admission verification (Kyverno/Connaisseur) is still pending. See deploy-k3s/SECURITY.md → L5.

Operator workstation security

The operator workstation has:

macOS with FileVault (full disk encryption)
Login password required
Private keys in ~/.ssh/ (mode 0600)
Kubeconfig at ~/.kube/honeydue-k3s.yaml (mode 0600) — contains a bearer token to the cluster

Losing the laptop would require immediate credential rotation:

New SSH key, redeploy public part on all 3 nodes
New kubeconfig: run sudo cat /etc/rancher/k3s/k3s.yaml on hetzner1, copy to workstation, update KUBECONFIG env
Rotate operator-access PATs on Gitea, Neon, Cloudflare, Backblaze

Compliance notes

This stack is not currently certified for:

HIPAA — we transit and store health-related data but haven't contractually bound any BAA
SOC 2 — no auditing, no documented controls beyond this document
PCI-DSS — we don't handle card data; Apple/Google IAP handles payments
GDPR — we follow GDPR best practices (data minimization, user deletion) but haven't had a formal assessment

If honeyDue ever needs any of these, the infrastructure is compatible but the operational processes around it would need formal work.

Operator cheat sheet

# See all RBAC-related resources in a namespace
kubectl get sa,role,rolebinding -n honeydue

# Check what a ServiceAccount can do
kubectl auth can-i --list --as=system:serviceaccount:honeydue:api -n honeydue

# Verify pod is running with expected security context
kubectl get pod <pod> -n honeydue -o jsonpath='{.spec.securityContext}'
kubectl get pod <pod> -n honeydue -o jsonpath='{.spec.containers[0].securityContext}'

# List all Secrets (without revealing content)
kubectl get secret -n honeydue
kubectl describe secret honeydue-secrets -n honeydue  # shows keys, not values

# Decode a secret (CAREFUL: prints plaintext)
kubectl get secret honeydue-secrets -n honeydue -o jsonpath='{.data.SECRET_KEY}' | base64 -d

19 KiB Raw Blame History

05 — Security

Summary

Threat model

Layer 1 — Cloudflare edge

What Cloudflare does for us

What Cloudflare does not do

The proxy-IP problem

Layer 2 — Node (OS, SSH, firewall)

SSH hardening

Sudo

fail2ban

unattended-upgrades

UFW firewall

Layer 3 — Kubernetes RBAC

System accounts

Application service accounts

What the app pods CAN'T do

Layer 4 — Pod Security

What each setting does

Traefik exception

Pod Security Admission (PSA)

Layer 5 — Network Policies

What network policies prevent

TLS and encryption

CF ↔ user

CF ↔ origin

API ↔ Neon Postgres

API ↔ Backblaze B2

Worker ↔ Fastmail SMTP

API/worker ↔ Redis

Pod-to-pod (Flannel overlay)

Secrets management

Kubernetes Secrets

What Secrets we have

Source of truth

Secret rotation

Container image provenance

Operator workstation security

Compliance notes

Operator cheat sheet

References

19 KiB

Raw Blame History