fix(security): remediate 2026-05-12 audit findings (Stages 2–5)

Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:28:33 -05:00
parent 2004f9c5b2
commit c77ff07ce9
59 changed files with 2819 additions and 1245 deletions
@@ -8,6 +8,13 @@ long-haul components, and dedicated service accounts with dropped
 capabilities inside containers. This chapter documents each layer, the
 rationale, and what's currently missing (and why).

+> **Updated 2026-05-15 — security remediation.** The 2026-05 audits
+> (`live_scan_5_12.md`, `k3_audit_5_12.md`, `security_scan_5_12.md`) drove a
+> full remediation pass. **`deploy-k3s/SECURITY.md` is the authoritative,
+> per-finding current-state record.** This chapter is corrected for the
+> major items below; where any other detail conflicts with `SECURITY.md`,
+> `SECURITY.md` wins.
+
 ## Threat model

 Who we're defending against, in rough order of likelihood:
@@ -54,8 +61,8 @@ Cloudflare sits in front of every public request.
 - **Authorize requests** — that's the app's job
 - **Protect origin if origin IP leaks** — once someone knows a node IP
  they can bypass CF. Mitigation: keep origin firewall strict (Chapter 4).
- **Encrypt between CF and origin** — we're on SSL=Flexible, so CF↔origin
-  is HTTP. This is in our TODO (Chapter 20, upgrade to Full-strict).
+- **~~Encrypt between CF and origin~~** — done (2026-04-24): SSL mode is
+  Full (strict); CF↔origin is TLS with a Cloudflare Origin CA cert.

 ### The proxy-IP problem

@@ -75,8 +82,8 @@ This means a malicious request that bypasses CF (by hitting the node IP
 directly) can't spoof headers — Traefik ignores `X-Forwarded-*` unless
 the source IP is in CF's ranges.

-**TODO** (Chapter 20): Enforce at UFW level — allow 80/tcp only from
-CF IP ranges. Today any IP can reach the origin on port 80.
+**Done (2026-04-24):** the node UFW allowlist permits `:443` only from
+Cloudflare's IP ranges; the `Anywhere` rules on `:80`/`:443` were removed.

 ## Layer 2 — Node (OS, SSH, firewall)

@@ -297,15 +304,13 @@ The `deploy-k3s/manifests/network-policies.yaml` scaffold defines:
  reach api pods on port 8000
 - **allow-ingress-to-admin** — same, for admin:3000

-**These are not currently applied.** Without them, our pods can freely
-talk to anything — including, theoretically, malicious destinations if
-an attacker gets RCE inside a pod.
+**Applied.** `03-deploy.sh` applies
+`deploy-k3s/manifests/network-policies.yaml` on every deploy — default-deny
+plus the explicit per-app allows below. Traefik runs `hostNetwork`, so its
+traffic is matched by node-IP `ipBlock`s plus the pod CIDR `10.42.0.0/16`,
+not a `namespaceSelector`.

-**TODO** (Chapter 20): Apply network policies. The scaffold is there; we
-just need to `kubectl apply -f deploy-k3s/manifests/network-policies.yaml`
-and test that nothing breaks.
-
-### What network policies would prevent
+### What network policies prevent

 | Attack scenario | NetworkPolicy blocks |
 |---|---|
@@ -324,13 +329,10 @@ renewed Let's Encrypt or CF-managed cert for `*.myhoneydue.com`.

 ### CF ↔ origin

-**Plaintext HTTP** (SSL = Flexible). An attacker with access to the
-Cloudflare-to-Hetzner path could read traffic. In practice nobody who
-isn't Cloudflare or Hetzner sits on that path.
-
-**TODO** (Chapter 20): Upgrade to SSL = Full (strict) with a Cloudflare
-Origin CA certificate. This encrypts CF ↔ origin and verifies that
-origin's cert is the CF-issued one (prevents MitM if DNS is compromised).
+**TLS — SSL = Full (strict)** (since 2026-04-24). A Cloudflare Origin CA
+certificate (`cloudflare-origin-cert` secret) is installed on all three
+ingresses; Cloudflare validates it. Both user↔CF and CF↔origin are
+encrypted, and a DNS-hijack MitM is defeated by the origin-cert check.

 ### API ↔ Neon Postgres

@@ -454,11 +456,14 @@ Mitigations:
 - Gitea itself is behind login; PAT is scoped to read:packages +
  write:packages only
 - Gitea runs on the operator's infrastructure (same operator account)
- Image tags are SHA-pinned (`:237c6b8`) not `:latest` → attacker can't
-  replace an existing tag's image without us noticing the digest change
+- Workloads deploy by immutable `@sha256:` digest, not by mutable tag
+  (`03-deploy.sh` resolves the digest after push; the redis/vmagent/node
+  base images are digest-pinned too) — a swapped tag cannot reach the
+  cluster.

-**TODO** (Chapter 20): Add cosign signing at build time, verify at pull
-time.
+**TODO**: cosign signing is wired into `03-deploy.sh` (guarded — runs when
+`cosign` + `COSIGN_KEY` are present); cluster-side admission verification
+(Kyverno/Connaisseur) is still pending. See `deploy-k3s/SECURITY.md` → L5.

 ## Operator workstation security