fix(security): remediate 2026-05-12 audit findings (Stages 2–5)
Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -8,6 +8,13 @@ long-haul components, and dedicated service accounts with dropped
|
||||
capabilities inside containers. This chapter documents each layer, the
|
||||
rationale, and what's currently missing (and why).
|
||||
|
||||
> **Updated 2026-05-15 — security remediation.** The 2026-05 audits
|
||||
> (`live_scan_5_12.md`, `k3_audit_5_12.md`, `security_scan_5_12.md`) drove a
|
||||
> full remediation pass. **`deploy-k3s/SECURITY.md` is the authoritative,
|
||||
> per-finding current-state record.** This chapter is corrected for the
|
||||
> major items below; where any other detail conflicts with `SECURITY.md`,
|
||||
> `SECURITY.md` wins.
|
||||
|
||||
## Threat model
|
||||
|
||||
Who we're defending against, in rough order of likelihood:
|
||||
@@ -54,8 +61,8 @@ Cloudflare sits in front of every public request.
|
||||
- **Authorize requests** — that's the app's job
|
||||
- **Protect origin if origin IP leaks** — once someone knows a node IP
|
||||
they can bypass CF. Mitigation: keep origin firewall strict (Chapter 4).
|
||||
- **Encrypt between CF and origin** — we're on SSL=Flexible, so CF↔origin
|
||||
is HTTP. This is in our TODO (Chapter 20, upgrade to Full-strict).
|
||||
- **~~Encrypt between CF and origin~~** — done (2026-04-24): SSL mode is
|
||||
Full (strict); CF↔origin is TLS with a Cloudflare Origin CA cert.
|
||||
|
||||
### The proxy-IP problem
|
||||
|
||||
@@ -75,8 +82,8 @@ This means a malicious request that bypasses CF (by hitting the node IP
|
||||
directly) can't spoof headers — Traefik ignores `X-Forwarded-*` unless
|
||||
the source IP is in CF's ranges.
|
||||
|
||||
**TODO** (Chapter 20): Enforce at UFW level — allow 80/tcp only from
|
||||
CF IP ranges. Today any IP can reach the origin on port 80.
|
||||
**Done (2026-04-24):** the node UFW allowlist permits `:443` only from
|
||||
Cloudflare's IP ranges; the `Anywhere` rules on `:80`/`:443` were removed.
|
||||
|
||||
## Layer 2 — Node (OS, SSH, firewall)
|
||||
|
||||
@@ -297,15 +304,13 @@ The `deploy-k3s/manifests/network-policies.yaml` scaffold defines:
|
||||
reach api pods on port 8000
|
||||
- **allow-ingress-to-admin** — same, for admin:3000
|
||||
|
||||
**These are not currently applied.** Without them, our pods can freely
|
||||
talk to anything — including, theoretically, malicious destinations if
|
||||
an attacker gets RCE inside a pod.
|
||||
**Applied.** `03-deploy.sh` applies
|
||||
`deploy-k3s/manifests/network-policies.yaml` on every deploy — default-deny
|
||||
plus the explicit per-app allows below. Traefik runs `hostNetwork`, so its
|
||||
traffic is matched by node-IP `ipBlock`s plus the pod CIDR `10.42.0.0/16`,
|
||||
not a `namespaceSelector`.
|
||||
|
||||
**TODO** (Chapter 20): Apply network policies. The scaffold is there; we
|
||||
just need to `kubectl apply -f deploy-k3s/manifests/network-policies.yaml`
|
||||
and test that nothing breaks.
|
||||
|
||||
### What network policies would prevent
|
||||
### What network policies prevent
|
||||
|
||||
| Attack scenario | NetworkPolicy blocks |
|
||||
|---|---|
|
||||
@@ -324,13 +329,10 @@ renewed Let's Encrypt or CF-managed cert for `*.myhoneydue.com`.
|
||||
|
||||
### CF ↔ origin
|
||||
|
||||
**Plaintext HTTP** (SSL = Flexible). An attacker with access to the
|
||||
Cloudflare-to-Hetzner path could read traffic. In practice nobody who
|
||||
isn't Cloudflare or Hetzner sits on that path.
|
||||
|
||||
**TODO** (Chapter 20): Upgrade to SSL = Full (strict) with a Cloudflare
|
||||
Origin CA certificate. This encrypts CF ↔ origin and verifies that
|
||||
origin's cert is the CF-issued one (prevents MitM if DNS is compromised).
|
||||
**TLS — SSL = Full (strict)** (since 2026-04-24). A Cloudflare Origin CA
|
||||
certificate (`cloudflare-origin-cert` secret) is installed on all three
|
||||
ingresses; Cloudflare validates it. Both user↔CF and CF↔origin are
|
||||
encrypted, and a DNS-hijack MitM is defeated by the origin-cert check.
|
||||
|
||||
### API ↔ Neon Postgres
|
||||
|
||||
@@ -454,11 +456,14 @@ Mitigations:
|
||||
- Gitea itself is behind login; PAT is scoped to read:packages +
|
||||
write:packages only
|
||||
- Gitea runs on the operator's infrastructure (same operator account)
|
||||
- Image tags are SHA-pinned (`:237c6b8`) not `:latest` → attacker can't
|
||||
replace an existing tag's image without us noticing the digest change
|
||||
- Workloads deploy by immutable `@sha256:` digest, not by mutable tag
|
||||
(`03-deploy.sh` resolves the digest after push; the redis/vmagent/node
|
||||
base images are digest-pinned too) — a swapped tag cannot reach the
|
||||
cluster.
|
||||
|
||||
**TODO** (Chapter 20): Add cosign signing at build time, verify at pull
|
||||
time.
|
||||
**TODO**: cosign signing is wired into `03-deploy.sh` (guarded — runs when
|
||||
`cosign` + `COSIGN_KEY` are present); cluster-side admission verification
|
||||
(Kyverno/Connaisseur) is still pending. See `deploy-k3s/SECURITY.md` → L5.
|
||||
|
||||
## Operator workstation security
|
||||
|
||||
|
||||
@@ -1,5 +1,13 @@
|
||||
# 06 — Traefik Ingress
|
||||
|
||||
> **Updated 2026-05-15 (security remediation):** the Traefik middleware set
|
||||
> changed — `cloudflare-only` + `admin-auth` are now attached to the admin
|
||||
> ingress, a strict `auth-rate-limit` middleware fronts the auth endpoints
|
||||
> (via a dedicated `honeydue-api-auth` Ingress), and `security-headers`
|
||||
> gained COOP/CORP + a 2-year preload HSTS and dropped the deprecated
|
||||
> `X-XSS-Protection`. `deploy-k3s/SECURITY.md` is the authoritative
|
||||
> current-state record.
|
||||
|
||||
## Summary
|
||||
|
||||
Traefik is the reverse proxy that routes external HTTP requests to the
|
||||
|
||||
@@ -1,5 +1,11 @@
|
||||
# 07 — Services
|
||||
|
||||
> **Updated 2026-05-15 (security remediation):** Redis now requires a
|
||||
> password (`config.yaml` `redis.password` → `honeydue-secrets`), all
|
||||
> workloads deploy by immutable `@sha256:` digest, and the redis/vmagent
|
||||
> base images are digest-pinned. `deploy-k3s/SECURITY.md` is the
|
||||
> authoritative current-state record.
|
||||
|
||||
## Summary
|
||||
|
||||
Five workloads run in the `honeydue` namespace: **api** (Go REST API, 3
|
||||
|
||||
@@ -1,5 +1,11 @@
|
||||
# 10 — Secrets & Config
|
||||
|
||||
> **Updated 2026-05-15 (security remediation):** `honeydue-secrets` now
|
||||
> carries `REDIS_PASSWORD`; an `admin-basic-auth` Secret backs the admin
|
||||
> ingress; rotation is documented in `docs/runbooks/secret-rotation.md`;
|
||||
> and the Go config can read file-mounted secrets (`HONEYDUE_SECRETS_DIR`).
|
||||
> `deploy-k3s/SECURITY.md` is the authoritative current-state record.
|
||||
|
||||
## Summary
|
||||
|
||||
Non-sensitive config (hostnames, ports, feature flags, etc.) lives in
|
||||
|
||||
@@ -0,0 +1,146 @@
|
||||
# Runbook — Secret Rotation
|
||||
|
||||
Closes audit finding `K3S-F12` (secrets unrotated since cluster bootstrap,
|
||||
no rotation cadence). See `deploy-k3s/SECURITY.md` Stage 2.
|
||||
|
||||
**Cadence:** rotate every secret at least **annually**. Rotate
|
||||
**immediately** on suspected exposure, on an operator-device loss, or when
|
||||
anyone who has seen a secret leaves the project.
|
||||
|
||||
**Record keeping:** after each rotation, annotate the secret so the age is
|
||||
visible:
|
||||
|
||||
```bash
|
||||
kubectl -n honeydue annotate secret <name> \
|
||||
honeydue.dev/last-rotated="$(date -u +%Y-%m-%d)" --overwrite
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How rotation works
|
||||
|
||||
Every secret has a **source of truth** on the operator workstation. The
|
||||
deploy scripts read those sources and (re)create the Kubernetes Secrets.
|
||||
Rotation is always: **update the source → re-run `02-setup-secrets.sh` →
|
||||
restart the pods that consume it → revoke the old credential at its
|
||||
provider.**
|
||||
|
||||
`02-setup-secrets.sh` uses `kubectl apply` (via `--dry-run=client -o yaml`),
|
||||
so re-running it is idempotent and only changes what you changed.
|
||||
|
||||
| Kubernetes Secret | Source of truth | Consumed by |
|
||||
|---|---|---|
|
||||
| `honeydue-secrets` → `POSTGRES_PASSWORD` | `deploy-k3s/secrets/postgres_password.txt` | api, worker |
|
||||
| `honeydue-secrets` → `SECRET_KEY` | `deploy-k3s/secrets/secret_key.txt` | api, worker |
|
||||
| `honeydue-secrets` → `EMAIL_HOST_PASSWORD` | `deploy-k3s/secrets/email_host_password.txt` | api, worker |
|
||||
| `honeydue-secrets` → `FCM_SERVER_KEY` | `deploy-k3s/secrets/fcm_server_key.txt` | api, worker |
|
||||
| `honeydue-secrets` → `REDIS_PASSWORD` | `config.yaml` key `redis.password` | api, worker, redis |
|
||||
| `honeydue-secrets` → `OBS_INGEST_TOKEN` | `deploy/prod.env` | api, worker |
|
||||
| `honeydue-apns-key` → `apns_auth_key.p8` | `deploy-k3s/secrets/apns_auth_key.p8` | api, worker |
|
||||
| `cloudflare-origin-cert` | `deploy-k3s/secrets/cloudflare-origin.{crt,key}` | Traefik ingress |
|
||||
| `ghcr-credentials` | `config.yaml` block `registry.*` | image pulls (all pods) |
|
||||
| `admin-basic-auth` | `config.yaml` keys `admin.basic_auth_user` / `..._password` | Traefik `admin-auth` middleware |
|
||||
|
||||
The `deploy-k3s/secrets/` directory and `config.yaml` are **gitignored** —
|
||||
never commit them.
|
||||
|
||||
---
|
||||
|
||||
## Standard rotation procedure
|
||||
|
||||
```bash
|
||||
cd honeyDueAPI-go
|
||||
export KUBECONFIG="$(pwd)/deploy-k3s/kubeconfig"
|
||||
|
||||
# 1. Update the source (file under deploy-k3s/secrets/ or a config.yaml key)
|
||||
# 2. Recreate the Kubernetes Secrets from sources
|
||||
./deploy-k3s/scripts/02-setup-secrets.sh
|
||||
|
||||
# 3. Restart the consumers (see per-secret notes below for which)
|
||||
kubectl -n honeydue rollout restart deploy/api deploy/worker
|
||||
|
||||
# 4. Confirm health
|
||||
kubectl -n honeydue rollout status deploy/api
|
||||
kubectl -n honeydue rollout status deploy/worker
|
||||
|
||||
# 5. Revoke the OLD credential at its provider (see per-secret notes)
|
||||
# 6. Annotate the rotated secret with today's date
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Per-secret notes
|
||||
|
||||
### `POSTGRES_PASSWORD`
|
||||
1. Rotate the role password in the Neon dashboard.
|
||||
2. Write the new value to `deploy-k3s/secrets/postgres_password.txt`.
|
||||
3. `02-setup-secrets.sh`, then `rollout restart deploy/api deploy/worker`.
|
||||
4. Watch logs for connection errors; the old password stops working the
|
||||
moment Neon applies the change, so do steps 2–3 promptly.
|
||||
|
||||
### `SECRET_KEY` ⚠️ user-visible
|
||||
This signs auth tokens. **Rotating it logs every user out** — all existing
|
||||
tokens become invalid and every client must re-authenticate.
|
||||
1. Generate: `openssl rand -hex 32`.
|
||||
2. Write to `deploy-k3s/secrets/secret_key.txt` (must be ≥32 chars — the
|
||||
script enforces this; the app refuses to start in production without it).
|
||||
3. `02-setup-secrets.sh`, then `rollout restart deploy/api deploy/worker`.
|
||||
- Only rotate on a schedule or on suspected compromise — not casually.
|
||||
- A future improvement (overlap window via a key-id header) would let old
|
||||
tokens validate during the transition; not implemented today.
|
||||
|
||||
### `EMAIL_HOST_PASSWORD`
|
||||
1. Generate a new app password in Fastmail; keep the old one alive briefly.
|
||||
2. Write to `deploy-k3s/secrets/email_host_password.txt`.
|
||||
3. `02-setup-secrets.sh`, `rollout restart deploy/api deploy/worker`.
|
||||
4. Delete the old Fastmail app password.
|
||||
|
||||
### `FCM_SERVER_KEY`
|
||||
1. Rotate the key in the Firebase console.
|
||||
2. Write to `deploy-k3s/secrets/fcm_server_key.txt`.
|
||||
3. `02-setup-secrets.sh`, `rollout restart deploy/api deploy/worker`.
|
||||
|
||||
### `REDIS_PASSWORD`
|
||||
Source is `config.yaml` key `redis.password` (hex only — it is embedded in
|
||||
the `REDIS_URL`, so non-hex characters would break URL parsing).
|
||||
1. Generate: `openssl rand -hex 32`.
|
||||
2. Set `redis.password` in `config.yaml`.
|
||||
3. `02-setup-secrets.sh`.
|
||||
4. Restart **redis as well as** api/worker so the new `--requirepass` and
|
||||
the new `REDIS_URL` land together:
|
||||
`kubectl -n honeydue rollout restart deploy/redis deploy/api deploy/worker`.
|
||||
Expect a few seconds where api/worker reconnect.
|
||||
|
||||
### `apns_auth_key.p8`
|
||||
1. Revoke the key in the Apple Developer console, generate a new `.p8`.
|
||||
2. Replace `deploy-k3s/secrets/apns_auth_key.p8`.
|
||||
3. `02-setup-secrets.sh`, `rollout restart deploy/api deploy/worker`.
|
||||
4. If the Key ID changed, update `push.apns_key_id` in `config.yaml` too.
|
||||
|
||||
### `cloudflare-origin-cert`
|
||||
1. Generate a new Origin CA certificate in the Cloudflare dashboard.
|
||||
2. Replace `deploy-k3s/secrets/cloudflare-origin.crt` and `.key`.
|
||||
3. `02-setup-secrets.sh`. Traefik picks up the new TLS secret; no app
|
||||
restart needed. Verify the served cert with `openssl s_client`.
|
||||
|
||||
### `ghcr-credentials` (Gitea registry)
|
||||
1. Generate a new PAT in Gitea (scope: `read:packages`).
|
||||
2. Update the `registry.token` value in `config.yaml`.
|
||||
3. `02-setup-secrets.sh`. No restart needed unless a pull is pending.
|
||||
4. Revoke the old PAT in Gitea.
|
||||
|
||||
### `admin-basic-auth`
|
||||
Source is `config.yaml` keys `admin.basic_auth_user` / `basic_auth_password`.
|
||||
1. Set a new password (e.g. `openssl rand -hex 24`).
|
||||
2. `02-setup-secrets.sh` regenerates the bcrypt htpasswd secret.
|
||||
3. No app restart needed — Traefik reloads the `admin-auth` middleware.
|
||||
4. Distribute the new credential to whoever uses the admin panel.
|
||||
|
||||
---
|
||||
|
||||
## After any rotation
|
||||
|
||||
- Run `./deploy-k3s/scripts/04-verify.sh` and confirm no `✗` lines.
|
||||
- Annotate the rotated secret (see "Record keeping" above).
|
||||
- If the rotation was due to a compromise, also follow the relevant
|
||||
playbook in `deploy-k3s/SECURITY.md` → Appendix (Incident response).
|
||||
Reference in New Issue
Block a user