# honeyDue — Production Security Remediation Plan

This document is the **single source of truth for fixing every security
finding from the 2026-05-12/13 audits, and for keeping those fixes baked
into the stack so a full redeploy never reproduces them.**

It replaces the previous aspirational `SECURITY.md` (which described a
desired state that, per the audits, was never fully true). The accurate
*current* architecture lives in `docs/deployment/05-security.md`; this file
is the **work list**.

**Last updated:** 2026-05-16
**Audit sources (kept at repo root):**

| Tag | File | Scope | Findings |
|---|---|---|---|
| `LIVE` | `live_scan_5_12.md` | External black-box scan of api/admin/app | L1–L20 (20) |
| `K3S` | `k3_audit_5_12.md` | k3s cluster + `honeydue` namespace audit | F1–F17 (17) + 8 coverage gaps |
| `CODE` | `security_scan_5_12.md` | Static audit of `honeyDueAPI-go` | C1–C13, H1–H9, M1–M13, L1–L6 (41) |

**Total: 78 findings + 8 cluster coverage gaps + 13 runtime verification items.**

---

## How to use this document

The plan is organised by **redeploy stage**, not by severity, because the
operator's goal is: *redeploy the entire stack and come up clean.* Each
finding is tagged with where its fix lives:

| Marker | Meaning |
|---|---|
| **In-repo: Y** | Fix lives in a committed file (`config.yaml`, a manifest, a script, Go code, a Dockerfile). Once committed, **every redeploy re-applies it automatically.** |
| **In-repo: N** | Fix is external state (DNS records, Cloudflare dashboard, Hetzner firewall, hstspreload.org). A redeploy does **not** touch it — it survives on its own but must be done once and tracked here. |

**Status legend:** ☐ open · ◐ in progress · ☑ done · ⊘ accepted risk / deferred

**Redeploy stage order** (matches `deploy-k3s/scripts/` run order):

```
Stage 0  DNS & Cloudflare edge          (external; no cluster needed)
Stage 1  Cluster provisioning & node OS (01-provision-cluster.sh / hetzner-k3s / SSH)
Stage 2  Secrets & config bootstrap     (02-setup-secrets.sh / config.yaml)
Stage 3  Kubernetes manifests           (deploy-k3s/manifests/, applied by 03-deploy.sh)
Stage 4  Application code & images      (honeyDueAPI-go source → rebuilt image)
Stage 5  CI / build pipeline            (image digest pinning, signing, scanning)
Stage 6  Post-deploy verification       (04-verify.sh + runtime investigations)
```

**Golden rule for "redeploy clean":** a fix only counts as done when it is
committed to the file that the redeploy reads. A `kubectl patch` on the live
cluster that is not mirrored into `deploy-k3s/manifests/` **will be wiped on
the next `03-deploy.sh`.** Every entry below names the committed file.

---

## Execution status (2026-05-16)

Stages 2–5 were executed in-repo, then put through an independent code
review (see *Post-remediation independent review* below). The Go module
**builds clean and the full `go test ./...` suite passes.** Four new goose
migrations were added — `000003` (auth-token hashing), `000004` (IAP replay
protection), `000005` (audit-log append-only + `audit_log` table create),
`000006` (`webhook_event_log` table create) — and run automatically via the
migrate Job before the api/worker rollout.

- **~63 findings fixed (☑) and verified** — all of Stage 2 (secrets/config)
  and Stage 3 (Kubernetes manifests), every exploitable Stage 4 application
  finding (all 11 actioned Criticals + the auth / webhook / race / handler
  High & Medium fixes), Stage-5 image digest pinning **and `K3S-F8`
  (secrets are now file-mounted, not env vars)**, plus the in-repo half of
  Stage 1 cluster provisioning — `K3S-F4` (kubeconfig written `0600`),
  `K3S-CG1` (etcd `secrets-encryption`), `K3S-CG2` (fail2ban +
  unattended-upgrades installed at provision). Includes token hashing,
  Google JWKS verification, IAP replay protection, the authorization
  fixes, atomic share-code join, the metrics-endpoint lockdown,
  per-account login lockout, verified-email gating, CSP/HSTS hardening,
  and digest-pinned images.
- **1 partial (◐)** — `CODE-L5`: cosign signing + a Trivy `HIGH,CRITICAL`
  scan are wired (guarded) into `03-deploy.sh`, and a ready-to-use Kyverno
  `ClusterPolicy` ships at `deploy-k3s/manifests/kyverno-verify-images.yaml`.
  Closing it needs two operator actions that cannot be committed: install
  Kyverno in the cluster, and supply a cosign key pair (`COSIGN_KEY` for
  signing + the public key pasted into the policy).
- **Accepted / blocked / moot (⊘)** — `M3` (Apple nonce — blocked on an
  iOS-client change), `C12` (moot — accounts are hard-deleted),
  `LIVE-L14`/`L15` (UUID migration — planned quarter), `LIVE-L17`/`L18`/
  `L20` (no security impact — see entries), `F15`/`F16` (architectural),
  and `LIVE-L2`/`L3`/`L4` (DMARC / SPF / CAA — operator-declined, below).
- **Operator-declined — Stage 0 DNS (`LIVE-L2`/`L3`/`L4`).** The operator
  has opted not to add the DMARC, SPF-hardening, and CAA DNS records this
  cycle. For the record: these are **not** a paid-Cloudflare feature —
  DMARC and SPF are ordinary TXT records and CAA is an ordinary CAA
  record, all addable on any Cloudflare plan including Free. They remain
  genuine email-spoofing / certificate-issuance gaps and are marked ⊘;
  revisit when DNS is next touched.
- **Remaining operator runtime steps (no code to commit)** — on the
  *existing* cluster: `k3s secrets-encrypt` enable/reencrypt (`K3S-CG1` /
  `V12`) and `chmod 600` the live kubeconfig (`K3S-F4`); the SSH/sysctl
  half of `K3S-CG2`; and the `K3S-CG3`–`CG8` verification items. A full
  *fresh* provision already comes up with `K3S-F4`/`CG1`/`CG2`(fail2ban +
  unattended-upgrades) applied straight from `_config.sh`.

**Operator note:** `C1` (token hashing) invalidates every existing login
session once at deploy and makes login single-session per user — see the
`CODE-C1` entry. The status boxes in the master index below are authoritative.

## Post-remediation independent review (2026-05-16)

The change set went through **two** independent review passes; the deploy-time
verification below (build, `go test -race`, full `goose up` against real
PostgreSQL 16) was executed and passed.

**First pass.** A separate review agent audited the full change set against the
three audit files. It surfaced three **deploy-breaking** defects that a green
`go test` could not catch — the test harness builds two tables via GORM
`AutoMigrate`, which production never runs — all since fixed:

- **`audit_log` table was never created by a migration.** `000005` added
  append-only triggers to a table that exists only in the test DB, so a
  from-scratch `goose up` would fail on `000005`. `000005` now does
  `CREATE TABLE IF NOT EXISTS audit_log` before the triggers.
- **`webhook_event_log` table was never created by a migration.** The H6
  fail-closed webhook dedup turns a missing table into a 500 on every
  subscription webhook. New migration `000006` creates it.
- **`000004`'s `google_purchase_token` unique index could fail to build** on
  a production table already holding duplicate tokens — exactly the C6
  replay the migration fixes. `000004` now de-duplicates (keep-earliest,
  NULL-the-rest) before creating the index.

It also tightened the C13 Apple-webhook lookup (`subscription_webhook_handler.go`)
so the legacy substring scan runs only on a genuine `ErrRecordNotFound`,
never masking a real DB error as "not found".

**Second pass (master review).** A second, independent security-audit agent
re-verified all four first-pass fixes (correct), ran `go test -race` (0 data
races) and the full `goose up`/`down` chain against real PostgreSQL (clean,
idempotent), and returned **GO** with one HIGH finding, since fixed:

- **HIGH-1 — Redis password leaked via the `honeydue-config` ConfigMap.**
  `_config.sh` built `REDIS_URL` with the password embedded inline, and that
  URL is emitted into the `honeydue-config` ConfigMap (delivered to pods via
  `envFrom`). ConfigMaps are *not* covered by `secrets-encryption` and are
  readable by any principal with `get configmap` — so `K3S-F1`/`K3S-F8` were
  not actually fully closed. **Fixed (2026-05-16):** `_config.sh` now emits
  `REDIS_URL=redis://redis:6379/0` with no credentials; the password travels
  only as the file-mounted `REDIS_PASSWORD` secret. The API applies it in
  `cache_service.go`; `cmd/worker/main.go` now applies it onto the parsed
  Asynq `RedisClientOpt` so the server/inspector/monitoring client all
  authenticate against the `requirepass` Redis.

The master review's other seven findings (4 Medium, 3 Low — none
deploy-blocking) were then **all fixed (2026-05-16)**:

- **MEDIUM-1 — re-login left the prior token usable for ≤5 min.**
  `CreateFreshToken` deleted the old token row but not its Redis cache entry.
  It now also returns the deleted tokens' hashes; `AuthService.freshToken`
  evicts them via the new `CacheService.InvalidateAuthTokenHashes` on every
  login / Apple / Google sign-in, so a prior (e.g. stolen) token stops
  authenticating immediately.
- **MEDIUM-2 — IAP `.p8` mode check incompatible with k8s.** The Apple IAP
  key check (`iap_validation.go`) required `0600`-or-stricter, unattainable
  on a k8s Secret volume (`0440` under `fsGroup`). It now rejects only
  world-accessible keys (`perm & 0o007`).
- **MEDIUM-3 — single-IP account-lockout DoS.** The `M5` per-account lockout
  is now keyed on the *set of distinct source IPs* that have failed
  (`RegisterLoginFailure` takes the IP, tracks a Redis set; lock at 5
  distinct IPs). One attacker IP can no longer lock a victim out by spamming
  failures; genuinely distributed stuffing still trips it. `Login` now takes
  the client IP (`c.RealIP()`).
- **MEDIUM-4 — Redis no-auth deployable.** `02-setup-secrets.sh` now `die`s
  (was `warn`) when `redis.password` is empty, so a deploy can no longer
  bring up an unauthenticated Redis (`K3S-F1`).
- **LOW-1 / LOW-2 — missing regression tests.** Added: `config_test.go`
  asserts `validate()` refuses `DEBUG_FIXED_CODES` with `DEBUG=false` (`C4`);
  `subscription_repo_test.go` asserts a second account cannot bind an Apple
  transaction / Google purchase token already bound to another (`C5`/`C6`).
- **LOW-3 — device-token 409.** A recycled APNs/FCM token re-registering
  under a new account is now reassigned to that account (and logged) instead
  of returning a 409 that locked the legitimate new device owner out of push.

One earlier (first-pass) hardening item remains a **tracked follow-up**, not
re-raised by the master review and not deploy-blocking: `/metrics` is gated
by an `X-Forwarded-For` check rather than network-isolated. True isolation
needs `/metrics` on a separate port plus a NetworkPolicy restricting the
scrape to vmagent — an architectural change deferred to a later cycle.

## Consolidated work items (fix once, closes many)

Several findings are the same defect seen from three angles. Do the work
once at the listed anchor; the rest close with it.

| Theme | Anchor | Also closes |
|---|---|---|
| Auth-endpoint rate limiting | Stage 3 `auth-rate-limit` middleware + Stage 4 app limiter | `K3S-F10`, `LIVE-L12`, `CODE-H1`, `CODE-H2`, `CODE-H3`, `CODE-M5` |
| CSP / cross-origin headers | Stage 3 `security-headers` + Stage 4 app CSP | `K3S-F9`, `LIVE-L8` |
| HSTS `preload` | Stage 3 middleware + Stage 0 list submission | `LIVE-L5`, `CODE-L3` |
| Admin ingress hardening | Stage 2 secret + Stage 3 middleware wiring | `K3S-F2`, `K3S-F3`, `CODE-L6` |
| etcd encryption at rest | Stage 1 `--secrets-encryption` | `K3S-CG1`, `CODE-M9` |
| Image digest pinning + signing | Stage 5 CI | `K3S-F5`, `K3S-F14`, `CODE-L4`, `CODE-L5` |
| Pagination hard caps | Stage 4 app | `LIVE-L16`, `CODE-M6` |
| imagePullSecret name consistency | Stage 3 manifests + Stage 2 script | `K3S-F6` |

**Known contradiction to resolve before planning Stage 4:** `LIVE-L18` says
*no account-deletion endpoint exists* (every `DELETE` path 404/400), but
`CODE-M13` points at a delete handler at `auth_handler.go:488-539`. Either
the endpoint exists at a path the external scan never probed, or it is
mounted but unreachable. **Confirm the route in `internal/router/router.go`
first** — the fix differs (add an endpoint vs. expose/rate-limit an existing
one). Tracked as verification item `V11`.

---

## Master finding index

Every finding, ordered by redeploy stage. Use this as the live tracker —
flip the Status box as work lands.

### Stage 0 — DNS & Cloudflare edge

| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `LIVE-L2` | HIGH | No DMARC record — email spoofing open | N | ⊘ |
| `LIVE-L3` | MED | SPF ends `?all` (neutral — fails open) | N | ⊘ |
| `LIVE-L4` | MED | No CAA records — any CA may issue certs | N | ⊘ |
| `LIVE-L6` | LOW | No `/.well-known/security.txt` | Y | ☐ |
| `LIVE-L9` | INFO | Aggressive Cloudflare caching on admin SSR shell | N | ☐ |
| `LIVE-L10` | INFO | `x-powered-by: Next.js` framework leak | Y | ☐ |

### Stage 1 — Cluster provisioning & node OS

| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `K3S-F4` | HIGH | Node kubeconfig world-readable (mode 644) | Y | ☑ |
| `K3S-F15` | INFO | Nodes on public IPs, no private VPC | Y | ⊘ |
| `K3S-F16` | INFO | All 3 nodes are control-plane + etcd + worker | Y | ⊘ |
| `K3S-F17` | INFO | Single-replica SPOFs (redis/worker/admin/vmagent) | Y | ☐ |
| `K3S-CG1` | — | etcd encryption at rest not verified (`--secrets-encryption`) | Y | ☑ |
| `K3S-CG2` | — | Node OS hardening: SSH, fail2ban, unattended-upgrades, sysctl | Y/N | ◐ |
| `K3S-CG3` | — | Hetzner Cloud Firewall rules not verified | N | ☐ |
| `K3S-CG4` | — | etcd snapshot backup destination/encryption not verified | Y | ☐ |
| `K3S-CG5` | — | kubelet flags (`--anonymous-auth=false`, webhook authz) not verified | Y | ☐ |
| `K3S-CG6` | — | Container-runtime CIS controls (`kube-bench`) not run | N | ☐ |
| `K3S-CG7` | — | `deploy` user sudoers least-privilege not verified | N | ☐ |
| `K3S-CG8` | — | `/etc/rancher/k3s/` dir + server-token perms not verified | N | ☐ |

### Stage 2 — Secrets & config bootstrap

| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `K3S-F1` | **CRIT** | Redis runs with no authentication | Y | ☑ |
| `K3S-F3` | HIGH | `admin-basic-auth` secret never created | Y | ☑ |
| `K3S-F12` | MED | Secrets unrotated since cluster bootstrap; no runbook | Y | ☑ |
| `CODE-C4` | **CRIT** | `DEBUG_FIXED_CODES` "123456" auth bypass if it reaches prod | Y | ☑ |
| `CODE-M8` | MED | `SECRET_KEY` hardcoded debug fallback | Y | ☑ |

> **Stage 2 status (2026-05-15):** `config.yaml` now carries a Redis
> password and admin basic-auth user/password; `02-setup-secrets.sh` uses
> bcrypt (`htpasswd -nbB`); `internal/config/config.go` generates an
> ephemeral random `SECRET_KEY` in debug instead of a static fallback and
> refuses to boot if `DEBUG_FIXED_CODES` is set with `DEBUG=false`; the
> rotation runbook is at `docs/runbooks/secret-rotation.md`. All take
> effect on the next `02-setup-secrets.sh` + `03-deploy.sh`.

### Stage 3 — Kubernetes manifests

| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `K3S-F2` | HIGH | Admin ingress missing `cloudflare-only` + `admin-auth` | Y | ☑ |
| `K3S-F6` | HIGH | `imagePullSecrets` name mismatch (`ghcr-credentials`) | Y | ☑ |
| `K3S-F7` | MED | `vmagent` container missing `securityContext` | Y | ☑ |
| `K3S-F9` | MED | `security-headers` missing COOP/COEP/CORP | Y | ☑ |
| `K3S-F10` | MED | Uniform rate limit — no auth-endpoint tightening | Y | ☑ |
| `K3S-F11` | MED | `automountServiceAccountToken` not disabled | Y | ☑ |
| `K3S-F13` | LOW | `CORS_ALLOWED_ORIGINS` missing `app.myhoneydue.com` | Y | ☑ |
| `K3S-F14` | LOW | Public images (`redis`, `vmagent`) pinned by tag | Y | ☑ |
| `LIVE-L5` | LOW | HSTS not preload-eligible | Y | ☑ |
| `LIVE-L7` | LOW | Deprecated `X-XSS-Protection` header | Y | ☑ |
| `LIVE-L8` | LOW | CSP missing `object-src`/`base-uri`; COOP/COEP/CORP absent | Y | ☑ |
| `CODE-L3` | LOW | HSTS missing `preload` (duplicate of `LIVE-L5`) | Y | ☑ |
| `CODE-L4` | LOW | `imagePullPolicy` not set on Deployments | Y | ☑ |
| `CODE-L6` | LOW | Admin `admin-auth` middleware defined, not attached | Y | ☑ |

> **Stage 3 status (2026-05-15):** admin ingress now chains
> `cloudflare-only` + `admin-auth` + `security-headers` + `rate-limit`; a
> dedicated `honeydue-api-auth` Ingress applies a new `auth-rate-limit`
> middleware (5/min, burst 10) to login / register / forgot-password /
> reset-password / join-with-code; `security-headers` gained COOP + CORP,
> HSTS is now `max-age=63072000; …; preload`, and the deprecated
> `X-XSS-Protection` (`browserXssFilter`) is removed; `vmagent` has a
> container `securityContext`; all workload pods + the migrate Job set
> `automountServiceAccountToken: false` explicitly (on top of the
> rbac.yaml ServiceAccount-level setting that already existed); the
> registry secret is `gitea-credentials` everywhere; `imagePullPolicy:
> IfNotPresent` is explicit on every container; CORS includes
> `app.myhoneydue.com`. **Still open:** `K3S-F14` (public-image digest
> pins) is folded into Stage 5 with `K3S-F5`; `LIVE-L8` is partial — the
> COOP/CORP half shipped here, the CSP `object-src`/`base-uri` half is an
> app change tracked in Stage 4.

### Stage 4 — Application code & container images

| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `CODE-C1` | **CRIT** | Auth tokens stored plaintext in DB | Y | ☑ |
| `CODE-C2` | **CRIT** | Google ID token not verified locally | Y | ☑ |
| `CODE-C3` | **CRIT** | Google `iss` claim never validated | Y | ☑ |
| `CODE-C5` | **CRIT** | Apple IAP receipt replay across accounts | Y | ☑ |
| `CODE-C6` | **CRIT** | Google purchase-token replay across accounts | Y | ☑ |
| `CODE-C7` | **CRIT** | File-ownership check excludes residence owners | Y | ☑ |
| `CODE-C8` | **CRIT** | Device-token cross-account hijack on re-register | Y | ☑ |
| `CODE-C9` | **CRIT** | Share-code join not atomic (Add+Deactivate race) | Y | ☑ |
| `CODE-C10` | **CRIT** | Subscription upgrade race — validation outside txn | Y | ☑ |
| `CODE-C11` | **CRIT** | Task-completion duplicate-row race | Y | ☑ |
| `CODE-C12` | **CRIT** | Soft-deleted email reusable; `is_active` not filtered | Y | ⊘ |
| `CODE-C13` | **CRIT** | Apple webhook user lookup may LIKE-match | Y | ☑ |
| `CODE-H1` | HIGH | Rate limit doesn't cover all auth surfaces | Y | ☑ |
| `CODE-H2` | HIGH | No rate limit on `join-with-code` | Y | ☑ |
| `CODE-H3` | HIGH | No rate limit on `register` | Y | ☑ |
| `CODE-H4` | HIGH | Modulo bias in 6-digit code generation | Y | ☑ |
| `CODE-H5` | HIGH | Apple IAP `.p8` loaded with no file-mode check | Y | ☑ |
| `CODE-H6` | HIGH | Webhook dedup fail-open | Y | ☑ |
| `CODE-H7` | HIGH | Auth-failure log lacks IP/User-Agent | Y | ☑ |
| `CODE-H8` | HIGH | `X-Timezone` header trusted for trial-start calc | Y | ☑ |
| `CODE-H9` | HIGH | Share-code `Deactivate` error swallowed | Y | ☑ |
| `CODE-M1` | MED | HTTP header injection via `Content-Disposition` filename | Y | ☑ |
| `CODE-M2` | MED | bcrypt cost = 10 (recommend 12) | Y | ☑ |
| `CODE-M3` | MED | Apple Sign In nonce not validated | Y | ⊘ |
| `CODE-M4` | MED | Email verification not atomic | Y | ☑ |
| `CODE-M5` | MED | Per-user rate limiting absent | Y | ☑ |
| `CODE-M6` | MED | List endpoints uncapped (Documents/Contractors/Residences) | Y | ☑ |
| `CODE-M7` | MED | Audit log not append-only | Y | ☑ |
| `CODE-M11` | MED | `golang.org/x/crypto v0.49.0` outdated | Y | ☑ |
| `CODE-M12` | MED | Contractor toggle refetch race | Y | ☑ |
| `CODE-M13` | MED | Account-deletion endpoint unrate-limited | Y | ☑ |
| `CODE-M10` | MED | `node:20-alpine` floating tag in Dockerfile | Y | ☑ |
| `CODE-L1` | LOW | Login inactive-account error enables enumeration | Y | ☑ |
| `CODE-L2` | LOW | Auth responses lack `Cache-Control: no-store` | Y | ☑ |
| `LIVE-L1` | HIGH | `/metrics` publicly exposed on `api.myhoneydue.com` | Y | ☑ |
| `LIVE-L11` | HIGH | Login user-enumeration via timing | Y | ☑ |
| `LIVE-L12` | HIGH | No rate-limit on `/api/auth/login/` | Y | ☑ |
| `LIVE-L13` | HIGH | Password-reset user-enumeration via timing | Y | ☑ |
| `LIVE-L14` | MED | Sequential integer user IDs leak userbase size | Y | ⊘ |
| `LIVE-L15` | MED | Sequential integer resource IDs (same risk) | Y | ⊘ |
| `LIVE-L16` | MED | Pagination `limit` accepted at any size | Y | ☑ |
| `LIVE-L17` | LOW | Garbage pagination params silently accepted | Y | ⊘ |
| `LIVE-L18` | LOW | No account-deletion endpoint (GDPR gap) | Y | ⊘ |
| `LIVE-L19` | LOW | Email verification not enforced | Y | ☑ |
| `LIVE-L20` | INFO | Profile-update silently drops unknown fields | Y | ⊘ |

> **Stage 4 handler/misc batch status (2026-05-15):** `M1` —
> `Content-Disposition` filenames are sanitized (control chars / quote /
> backslash stripped) so an upload filename cannot inject response
> headers. `M7` — migration `000005` creates the `audit_log` table (no
> prior migration did — `CREATE TABLE IF NOT EXISTS`) and makes it
> append-only via BEFORE UPDATE/DELETE triggers. `M11` —
> `golang.org/x/crypto` bumped
> `v0.49.0 → v0.51.0`. `M13` — `DELETE /api/auth/account` now carries the
> Traefik `auth-rate-limit` edge limiter. `LIVE-L18` ⊘ — not a real gap:
> the endpoint **exists** at `DELETE /api/auth/account/`
> (`router.go:546`); the live scan probed `/api/auth/me/`, `/auth/delete/`,
> `/users/me/` and missed it. **Update (2026-05-15):** items shown as
> deferred in an earlier draft were then completed — `LIVE-L1` (`/metrics`
> rejects proxied/public requests via an `X-Forwarded-For` check, so only
> the in-cluster vmagent scrape reaches it), `M6`/`LIVE-L16` (the
> document/contractor list repos already hard-cap at 500 rows), and
> `LIVE-L19` (verified-email gating on share-code generation via the new
> `RequireVerified` middleware). `LIVE-L17` (inert pagination params,
> results capped) and `LIVE-L20` (whitelist profile update is the correct
> pattern) are closed as no-security-impact (⊘). The master index above is
> authoritative.

> **Stage 4 races batch status (2026-05-15):** `C9`/`H9` — share-code
> redemption is now one locked transaction in `ResidenceRepository.
> JoinWithShareCode` (lock the code row, re-check validity, add member,
> deactivate — a deactivation failure aborts the join). `C11` — the
> task-completion duplicate-row race was *already* closed: the completion
> insert and the optimistically-version-locked task update share one
> transaction, so a concurrent completion fails `ErrVersionConflict` and
> rolls back its inserted row; no `UNIQUE(task_id, completed_date)` was
> added (it would reject legitimate same-day re-completions and risk a
> migration failure on existing data). `M4` — email verification's
> find/consume/flag writes are now one transaction. `M12` — a concurrent
> contractor delete now yields a clean 404. `C12` ⊘ — premise moot: the
> app **hard-deletes** accounts (`DeleteUserCascade`), so there is no
> soft-deleted user whose email lingers, and `ExistsByEmail` already
> blocks re-registering a *deactivated* user's email.
>
> **Stage 4 auth batch status (2026-05-15):** C1, C2, C3 done (see entries
> below). Rate limiting — every sensitive auth path now carries the shared
> Traefik `auth-rate-limit` edge limiter (login/register/forgot/reset/
> verify-reset/apple/google/refresh/join-with-code); login/register/forgot/
> reset/apple/google additionally keep the per-IP app limiter
> (`H1`/`H2`/`H3`/`LIVE-L12`). `H4` rejection-sampled codes, `M2` bcrypt
> cost 12, `L1`+`LIVE-L11` constant-time generic-error login, `L2`
> `no-store` on auth responses, `H7` IP/UA in auth logs, `LIVE-L13`
> fully-async forgot-password — all done; `go build ./...` and the
> `models`/`repositories`/`middleware`/`handlers`/`services` test packages
> pass. **Deferred:** `M3` (Apple nonce) — needs the iOS client to
> generate and send a nonce; server-only validation would reject every
> Apple login, so this is blocked on a coordinated mobile change. `H8` —
> the `parseTimezone` ±14h cap shipped; the "use server UTC for
> trial-start" half is folded into Stage 4's subscription work. `M5`
> per-account lockout (Redis) deferred — the edge + per-IP app limiters +
> the existing per-account password-reset counter cover the practical
> risk; a true per-account login lockout remains a tracked enhancement.

### Stage 5 — CI / build pipeline

| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `K3S-F5` | HIGH | Images pinned by mutable short SHA tag, not digest | Y | ☑ |
| `K3S-F8` | MED | Secrets injected as env vars, not file mounts | Y | ☑ |
| `CODE-L5` | LOW | No image signing (cosign) in CI | Y | ◐ |

> **Stage 5 status (2026-05-15):** `CODE-M11` done — `golang.org/x/crypto`
> bumped `v0.49.0 → v0.51.0` (with the `x/sys`/`x/term`/`x/text` bumps
> `go get -u` pulled in), `go mod tidy` clean, full build + test green.
> **Update (2026-05-15):** `K3S-F5`/`K3S-F14`/`CODE-M10` are done —
> `03-deploy.sh` resolves the image digest after each push and deploys
> api/worker/admin/web by `@sha256:`, and redis/vmagent/`node:20-alpine`
> are pinned to their resolved index digests.
> **Update (2026-05-16):** `K3S-F8` is **done** — the `api`/`worker`
> Deployments mount `honeydue-secrets` as files (`defaultMode: 0400`) at
> `/etc/honeydue/secrets` and inject no secret as an env var;
> `config.loadFileSecrets` reads them; `02-setup-secrets.sh` now writes
> `B2_KEY_ID`/`B2_APP_KEY` into the secret, reconciling the earlier
> script-vs-manifest drift. `CODE-L5` stays **◐** — cosign signing and a
> Trivy `HIGH,CRITICAL` scan are wired (guarded) into `03-deploy.sh` and a
> ready-to-use Kyverno `ClusterPolicy` ships at
> `deploy-k3s/manifests/kyverno-verify-images.yaml`; closing it needs the
> operator to install Kyverno and supply a cosign key. See both entries.

### Stage 6 — Post-deploy verification & runtime investigations

`V1`–`V13` — see [Stage 6](#stage-6--post-deploy-verification--runtime-investigations).

---

## Stage 0 — DNS & Cloudflare edge

External state at Cloudflare. Not touched by `03-deploy.sh`, so a redeploy
neither breaks nor re-applies these — do them once and leave them. Tracked
here so they are never forgotten on a domain move or DNS migration.

### `LIVE-L2` — Add DMARC record · HIGH · ⊘
- **Operator decision (2026-05-16):** declined for this cycle. A DMARC record is an ordinary DNS TXT record — it is **not** gated behind a paid Cloudflare plan and can be added on Free. This remains a real email-spoofing gap; revisit when DNS is next touched.
- **Where:** Cloudflare DNS, TXT record at `_dmarc.myhoneydue.com`.
- **Fix:** Publish `v=DMARC1; p=quarantine; rua=mailto:dmarc@myhoneydue.com; ruf=mailto:dmarc@myhoneydue.com; fo=1; aspf=s; adkim=s`. Start at `pct=10` for 30 days, watch the `rua` aggregate reports, then ramp to `pct=100` and finally `p=reject`.
- **Verify:** `dig +short TXT _dmarc.myhoneydue.com` returns the record.

### `LIVE-L3` — Tighten SPF from `?all` to `-all` · MEDIUM · ⊘
- **Operator decision (2026-05-16):** declined for this cycle. SPF is an ordinary DNS TXT record, editable on any Cloudflare plan including Free. The `?all` (neutral) qualifier leaves spoofed mail un-penalised; revisit alongside `LIVE-L2`.
- **Where:** Cloudflare DNS, TXT record at `myhoneydue.com`.
- **Fix:** Change `v=spf1 include:spf.messagingengine.com ?all` → `~all` for ~7 days, confirm no legitimate mail (CI, transactional) is missed, then `-all`. Do this **after** `LIVE-L2`'s DMARC ramp begins.
- **Verify:** `dig +short TXT myhoneydue.com | grep spf` shows `-all`.

### `LIVE-L4` — Add CAA records · MEDIUM · ⊘
- **Operator decision (2026-05-16):** declined for this cycle. CAA is an ordinary DNS record type, addable on any Cloudflare plan including Free. Without it, any public CA may issue a cert for the domain; revisit when DNS is next touched.
- **Where:** Cloudflare DNS, apex `myhoneydue.com`.
- **Fix:** Add `0 issue "letsencrypt.org"`, `0 issuewild "letsencrypt.org"`, `0 iodef "mailto:security@myhoneydue.com"`. Add `0 issue "pki.goog"` only if Google Trust Services is used anywhere. Confirm against the CAs Cloudflare Universal SSL actually uses before locking down.
- **Verify:** `dig +short CAA myhoneydue.com` returns the records.

### `LIVE-L6` — Publish `security.txt` · LOW · ☐ · In-repo: Y
- **Where:** served by the Go API and/or Next.js apps at `/.well-known/security.txt` (RFC 9116) — committed route, so it survives redeploys.
- **Fix:** Serve `Contact:`, `Expires:`, `Preferred-Languages:`, `Canonical:` on both `api.myhoneydue.com` and the apex.
- **Verify:** `curl https://api.myhoneydue.com/.well-known/security.txt` → 200.

### `LIVE-L9` — Review Cloudflare caching of the admin SSR shell · INFO · ☐
- **Where:** Cloudflare cache rules for `admin.myhoneydue.com`.
- **Fix:** `cache-control: s-maxage=31536000` on admin SSR pages means Cloudflare caches the admin shell for a year. Confirm this is intentional; if the admin shell ever contains per-session content, add a bypass-cache rule for `admin.myhoneydue.com`.
- **Verify:** `curl -sI https://admin.myhoneydue.com/ | grep -i cache` reflects the intended policy.

### `LIVE-L10` — Suppress `x-powered-by` · INFO · ☐ · In-repo: Y
- **Where:** Next.js config in the admin and web repos (`next.config.js` → `poweredByHeader: false`). Committed, survives redeploys.
- **Fix:** Disable the `x-powered-by: Next.js` header.
- **Verify:** `curl -sI https://admin.myhoneydue.com/ | grep -i x-powered-by` returns nothing.

---

## Stage 1 — Cluster provisioning & node OS

Run by `01-provision-cluster.sh` (which drives the `hetzner-k3s` CLI from
`config.yaml` via `generate_cluster_config` in `_config.sh`) plus one-time
SSH hardening on each node. **Any k3s server flag must be set in the
`hetzner-k3s` cluster config so a cluster rebuild applies it.**

### `K3S-F4` — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y
- **Where:** `_config.sh` → `generate_cluster_config` → `k3s_config_file`. Node file `/etc/rancher/k3s/k3s.yaml`.
- **Done (2026-05-16):** `generate_cluster_config` now emits `write-kubeconfig-mode: "0600"` in the k3s config file, so any fresh provision writes the node kubeconfig as `0600`.
- **Operator step on the existing cluster:** a running node keeps the mode it was installed with — `ssh deploy@<node> 'sudo chmod 600 /etc/rancher/k3s/k3s.yaml'` on each. Deploy scripts still read it via `sudo`.
- **Verify:** `ssh deploy@<node> 'sudo stat -c %a /etc/rancher/k3s/k3s.yaml'` → `600`.

### `K3S-CG1` / `CODE-M9` — etcd / Secret encryption at rest · ☑ · In-repo: Y
- **Where:** `_config.sh` → `generate_cluster_config` → `k3s_config_file`.
- **Done:** the k3s config file carries `secrets-encryption: true`, so a fresh provision boots with AES Secret encryption enabled. (The `write-kubeconfig-mode` line for `K3S-F4` was added next to it on 2026-05-16.)
- **Operator step on the existing cluster:** a cluster provisioned *without* the flag does not retro-encrypt — run `k3s secrets-encrypt enable` then `k3s secrets-encrypt reencrypt` once. Tracked as `V12`.
- **Verify:** `k3s secrets-encrypt status` reports `Encryption Status: Enabled` on every server node.
- **Note:** the old `SECURITY.md` *claimed* this was already on — `04-verify.sh` greps for the string but cannot truly confirm; see `V12`.

### `K3S-CG2` — Node OS hardening · ◐ · In-repo: partial
- **Where:** `_config.sh` → `generate_cluster_config` → `post_create_commands` (runs on every node at provision).
- **Done (2026-05-16):** `post_create_commands` now installs and enables `fail2ban` (SSH brute-force bans) and `unattended-upgrades` (automatic security patching) on every node at provision time — a fresh cluster comes up hardened on both.
- **Still operator (runtime; not yet in-repo):**
  - SSH — confirm `PermitRootLogin no`, `PasswordAuthentication no`, `AllowUsers deploy`, modern ciphers/MACs/KEX. (hetzner-k3s provisions key-only SSH; verify and tighten.)
  - sysctl — confirm `net.ipv4.ip_unprivileged_port_start=0` (Traefik) and standard network-hardening sysctls.
- **Verify:** `ssh deploy@<node> 'fail2ban-client status sshd; systemctl is-enabled unattended-upgrades'`.

### `K3S-CG3` — Hetzner Cloud Firewall rules · ☐ · In-repo: N
- **Fix:** Confirm only: `:443` from Cloudflare CIDRs, `:22` from operator IP(s), `:6443` from operator IP(s). Nothing else. This is the *only* network defense for the public-IP nodes (`K3S-F15`).
- **Verify:** `hcloud firewall describe honeydue-fw` matches the intended ruleset; a direct `curl` to a node IP on `:80`/`:443` from a non-CF host times out.

### `K3S-CG4` — etcd snapshot backup · ☐ · In-repo: Y
- **Fix:** Confirm k3s etcd snapshots are enabled (default hourly) and shipped off-node — set `--etcd-s3` (to Backblaze B2) with encryption. Without offsite snapshots, a 3-node loss is unrecoverable.
- **Verify:** `ls /var/lib/rancher/k3s/server/db/snapshots/` on a node + an object in the B2 backup bucket.

### `K3S-CG5` — kubelet authn/authz flags · ☐ · In-repo: Y
- **Fix:** Confirm `--anonymous-auth=false` and `--authorization-mode=Webhook` on the kubelet (k3s defaults are usually safe — verify, don't assume). Set via k3s `kubelet-arg` in the cluster config if missing.
- **Verify:** `kubectl get --raw /api/v1/nodes/<node>/proxy/configz` shows the expected kubelet config.

### `K3S-CG6` — Container-runtime CIS baseline · ☐ · In-repo: N
- **Fix:** Run `kube-bench` once; remediate any FAIL lines that aren't k3s-by-design.
- **Verify:** `kube-bench` run archived with FAILs triaged.

### `K3S-CG7` — `deploy` user sudoers least-privilege · ☐ · In-repo: N
- **Fix:** Current `deploy ALL=(ALL) NOPASSWD: ALL` means an SSH-key compromise = node root. Scope to the commands deploys actually need (`ufw`, `systemctl`, `chmod` on k3s.yaml, `cat` of k3s.yaml). Accept the convenience trade-off only with eyes open.
- **Verify:** `ssh deploy@<node> 'sudo -l'` shows the scoped list.

### `K3S-CG8` — `/etc/rancher/k3s/` perms · ☐ · In-repo: N
- **Fix:** `/var/lib/rancher/k3s/server/token` and `/var/lib/rancher/k3s/server/node-token` must be `0600 root:root`; `/etc/rancher/k3s/` not world-traversable.
- **Verify:** `ssh deploy@<node> 'sudo stat -c "%a %n" /var/lib/rancher/k3s/server/token'` → `600`.

### `K3S-F15` — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y
- **Decision:** Accepted for now. Defense is `K3S-CG3` (Hetzner firewall) only. To remediate later: attach a Hetzner private network, re-IP the cluster, move etcd/kubelet/Flannel onto it. Substantial re-provision — track on the roadmap, not this cycle.

### `K3S-F16` — All nodes are control-plane + etcd + worker · INFO · ⊘
- **Decision:** Accepted — standard small-cluster k3s. Revisit (dedicated workers + `NoSchedule` taint on control-plane) when workload pressure grows. No redeploy action.

### `K3S-F17` — Single-replica SPOFs · INFO · ☐ · In-repo: Y
- **Where:** `deploy-k3s/manifests/worker/deployment.yaml`, `redis/`, `admin/`, `observability/vmagent.yaml`.
- **Fix:** `worker` → `replicas: 2` (stateless, Asynq at-least-once — safe now). `admin`/`vmagent` → 2 if zero-downtime restart is wanted. `redis` is stateful — true HA needs Sentinel or managed Redis; track separately, do not naively scale.
- **Verify:** `kubectl -n honeydue get deploy` shows `worker 2/2`.

---

## Stage 2 — Secrets & config bootstrap

Run by `02-setup-secrets.sh`, which reads `deploy-k3s/config.yaml` and the
`secrets/` directory. **Both `K3S-F1` and `K3S-F3` are open purely because
`config.yaml` lacks the values — the script already supports them.**

### `K3S-F1` — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y
- **Where:** `deploy-k3s/config.yaml` key `redis.password`. `02-setup-secrets.sh:53,68-71` includes `REDIS_PASSWORD` in `honeydue-secrets` only when that key is non-empty; `redis/deployment.yaml` adds `--requirepass` only when the env var is non-empty.
- **Fix:** Set `redis.password` in `config.yaml` to a strong value (`openssl rand -base64 32`). Re-run `02-setup-secrets.sh`. `api`/`worker` already consume `REDIS_PASSWORD`.
- **Verify:** `kubectl -n honeydue exec deploy/redis -- redis-cli ping` → `NOAUTH`; with `-a "$REDIS_PASSWORD"` → `PONG`.
- **Redeploy-clean:** committing the value to `config.yaml` means every future `02-setup-secrets.sh` re-creates the authenticated secret. (If `config.yaml` is gitignored, store the value in the operator's secret store and document it here.)

### `K3S-F3` — `admin-basic-auth` secret never created · HIGH · ☐ · In-repo: Y
- **Where:** `config.yaml` keys `admin.basic_auth_user` / `admin.basic_auth_password`. `02-setup-secrets.sh:54-55,132-143` creates the `admin-basic-auth` secret (bcrypt htpasswd) only when both are set, else it warns and skips.
- **Fix:** Set both keys. Re-run `02-setup-secrets.sh`. **Must be done before `K3S-F2`** — attaching `admin-auth` to the ingress with the secret missing makes Traefik 503 the admin route.
- **Verify:** `kubectl -n honeydue get secret admin-basic-auth`.

### `K3S-F8` (Stage 2 half) — `B2_KEY_ID` / `B2_APP_KEY` in `honeydue-secrets` · ☑ · In-repo: Y
- **Where:** `02-setup-secrets.sh`.
- **Done (2026-05-16):** the script now reads `storage.b2_key_id` / `storage.b2_app_key` from `config.yaml` and adds `B2_KEY_ID` / `B2_APP_KEY` to `honeydue-secrets`. Previously the `api`/`worker` manifests referenced these keys but the script never created them — a latent deploy break. See the full `K3S-F8` entry in Stage 5.
- **Verify:** `kubectl -n honeydue get secret honeydue-secrets -o jsonpath='{.data.B2_KEY_ID}'` is non-empty.

### `K3S-F12` — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y
- **Where:** new doc `docs/runbooks/secret-rotation.md`.
- **Fix:** Document per-secret rotation (Postgres, `SECRET_KEY`, APNs `.p8`, FCM, B2, observability token, Redis, admin basic-auth). Annual minimum; immediate on suspected exposure or operator-device loss. For `SECRET_KEY` (JWT signing) plan an overlap window so live tokens validate across the change. Add a `last-rotated` annotation to each secret.
- **Verify:** runbook exists and the first rotation is logged.

### `CODE-C4` — `DEBUG_FIXED_CODES` "123456" auth bypass · CRITICAL · ☐ · In-repo: Y
- **Where:** `internal/services/auth_service.go:141-145,385-390,432-435,470-473,503-504`; config in `internal/config/config.go`. ConfigMap generated from `config.yaml` by `03-deploy.sh`.
- **Fix (two layers):** (1) Code — refuse to start if `ENV=production && DebugFixedCodes` (Stage 4 code change). (2) Config — ensure `config.yaml` never sets `DEBUG_FIXED_CODES=true` for prod, and the generated ConfigMap omits it.
- **Verify:** prod ConfigMap has no `DEBUG_FIXED_CODES`; a prod boot with the flag set fails fast.

### `CODE-M8` — `SECRET_KEY` hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y
- **Where:** `internal/config/config.go:437-442` falls back to `"change-me-in-production-secret-key-12345"`.
- **Fix:** Remove the static fallback — generate a per-boot random key in debug, and **refuse to start** in production if `SECRET_KEY` is unset. (`02-setup-secrets.sh:46-49` already enforces ≥32 chars for the real secret — keep that.)
- **Verify:** prod boot with no `SECRET_KEY` exits non-zero; the fallback string is gone from the binary.

---

## Stage 3 — Kubernetes manifests

Committed under `deploy-k3s/manifests/` and applied by `03-deploy.sh`. **Any
fix here is automatically re-applied on every redeploy** — the highest-value
stage for "redeploy clean."

### `K3S-F2` / `CODE-L6` — Wire defense-in-depth onto the admin ingress · HIGH · ☐
- **Where:** `deploy-k3s/manifests/ingress/ingress-simple.yaml` — admin route annotation.
- **Fix:** Add `cloudflare-only` and `admin-auth` to the `traefik.ingress.kubernetes.io/router.middlewares` annotation alongside the existing `security-headers` + `rate-limit`. **Do `K3S-F3` first** or Traefik 503s the route.
- **Verify:** `04-verify.sh` "Cloudflare-Only Middleware" check passes; `admin.myhoneydue.com` prompts for basic auth.

### `K3S-F6` — `imagePullSecrets` name consistency · HIGH · ☐
- **Where:** all `deploy-k3s/manifests/*/deployment.yaml`, `migrate/job.yaml`; secret created by `02-setup-secrets.sh:111` as `ghcr-credentials`.
- **Fix:** The registry is Gitea — `ghcr-credentials` is a misleading name and the live cluster currently also has a hand-made `gitea-credentials`. Pick one name (`gitea-credentials` is clearer), use it in **both** the script and **every** manifest, and delete the orphan. The defect is a name *mismatch*, not a missing fix — make script + manifests agree so a pull never fails on a fresh node.
- **Verify:** `grep -rl imagePullSecrets deploy-k3s/manifests/` all reference one name == the script's; cordon a node, delete a pod, confirm the replacement pulls.

### `K3S-F7` — `vmagent` container `securityContext` · MEDIUM · ☐
- **Where:** `deploy-k3s/manifests/observability/vmagent.yaml`.
- **Fix:** Add the container-level block the other 5 deployments already have: `allowPrivilegeEscalation: false`, `capabilities.drop: [ALL]`, `readOnlyRootFilesystem: true`. Its volumes (`/etc/vmagent`, `/etc/vmagent-secrets`, `/tmp/vmagent` emptyDir) already support read-only root.
- **Verify:** `04-verify.sh` "Pod Security Contexts" reports OK for `vmagent`.

### `K3S-F9` / `LIVE-L8` — CSP + cross-origin headers · MEDIUM / LOW · ☐
- **Where:** Cross-origin trio → `deploy-k3s/manifests/ingress/middleware.yaml` (`security-headers`). CSP `object-src`/`base-uri` → Go app CSP middleware (Stage 4, `LIVE-L8` code half).
- **Important correction:** `K3S-F9` originally said CSP was missing. The live scan **disproved** that — the Go app sets a strong CSP via app middleware. So `K3S-F9` reduces to: add `Cross-Origin-Opener-Policy: same-origin` and `Cross-Origin-Resource-Policy: same-origin` (and `Cross-Origin-Embedder-Policy: require-corp` only if it doesn't break embeds) to `security-headers`. The CSP `object-src 'none'; base-uri 'self'` additions belong in the app and are tracked under `LIVE-L8` in Stage 4.
- **Verify:** `curl -sI https://api.myhoneydue.com/api/health/ | grep -i cross-origin` shows COOP/CORP.

### `K3S-F10` / `LIVE-L12` — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` (new `auth-rate-limit` Middleware) + `ingress/ingress-simple.yaml`. Requires migrating the auth paths from vanilla `Ingress` to a Traefik `IngressRoute` to apply a per-path middleware.
- **Fix:** New Middleware `average: 5, burst: 10, period: 1m, sourceCriterion.ipStrategy.depth: 2` (depth 2 for the Cloudflare hop). Apply to `/api/auth/login`, `/api/auth/register`, `/api/auth/forgot-password`, `/api/auth/reset-password`, `/api/residences/join-with-code`. This is the **edge** half; the **app** half is `CODE-H1/H2/H3/M5` in Stage 4 (per-account lockout in Redis). Do both — edge limit alone resets on IP rotation.
- **Verify:** 10 rapid logins from one IP → `429`.

### `K3S-F11` — Disable `automountServiceAccountToken` · MEDIUM · ☐
- **Where:** `deploy-k3s/manifests/rbac.yaml` (ServiceAccounts) and/or each `*/deployment.yaml` pod spec.
- **Fix:** Set `automountServiceAccountToken: false` on `api`, `admin`, `worker`, `web`, `redis`. Leave `true` only for `vmagent` (it uses the k8s API for service discovery). **Note:** `05-security.md` claims this is already set — the audit (`F11`) says it is not. Treat the audit as ground truth; this fix makes the doc true.
- **Verify:** `kubectl -n honeydue get pod <api-pod> -o jsonpath='{.spec.automountServiceAccountToken}'` → `false`; no token file in the container.

### `K3S-F13` — Add `app.myhoneydue.com` to CORS · LOW · ☐
- **Where:** `CORS_ALLOWED_ORIGINS` in `config.yaml` → generated into `honeydue-config` ConfigMap by `03-deploy.sh`.
- **Fix:** Confirm whether the web app calls `api.myhoneydue.com` directly from the browser. If yes, add `https://app.myhoneydue.com` to `CORS_ALLOWED_ORIGINS`. If it proxies through Next.js server-side, CORS is moot — record that decision here instead.
- **Verify:** browser fetch from `app.myhoneydue.com` to the API succeeds (or the proxy decision is documented).

### `K3S-F14` — Pin public images by digest · LOW · ☐
- **Where:** `redis/deployment.yaml` (`redis:7-alpine`), `observability/vmagent.yaml` (`victoriametrics/vmagent:v1.106.1`).
- **Fix:** Replace tags with `@sha256:` digests. Folded into the `K3S-F5` CI work (Stage 5).
- **Verify:** manifests contain no public-image tag without a digest.

### `LIVE-L5` / `CODE-L3` — HSTS `preload` · LOW · ☐
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` `security-headers` HSTS value.
- **Fix:** Change to `max-age=63072000; includeSubDomains; preload`. Confirm api/admin/app all work fully over HTTPS, then submit to `hstspreload.org` (the submission is the Stage 0 external half — once preloaded you cannot easily downgrade for ~6 months).
- **Verify:** response header shows `preload`; domain accepted at hstspreload.org.

### `LIVE-L7` — Drop deprecated `X-XSS-Protection` · LOW · ☐
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` `security-headers` (`browserXssFilter: true` / `customResponseHeaders`).
- **Fix:** Remove the header or set `X-XSS-Protection: "0"`. Modern browsers ignore it; legacy filter bypass has caused XSS.
- **Verify:** header absent or `0` on all three hosts.

### `CODE-L4` — Set `imagePullPolicy` · LOW · ☐
- **Where:** all `deploy-k3s/manifests/*/deployment.yaml`.
- **Fix:** Set `imagePullPolicy` explicitly. Once images are digest-pinned (`K3S-F5`), `IfNotPresent` is correct and avoids needless re-pulls; until then `Always` avoids stale tags. Pick the policy that matches the `K3S-F5` rollout state.
- **Verify:** every container has an explicit `imagePullPolicy`.

---

## Stage 4 — Application code & container images

Fixes in `honeyDueAPI-go` source (and the admin/web Dockerfiles). They reach
production by **rebuilding the image** in `03-deploy.sh`; schema-changing
fixes (`CODE-C1`, `CODE-C5/6`, `CODE-C11`, `CODE-C12`) also need a **goose
migration**, which the migrate `Job` runs automatically before the
api/worker roll. Per repo rule: do not auto-commit — these are code changes;
this section is the plan, not the patch.

### Critical (C1–C13)

#### `CODE-C1` — Plaintext auth tokens in DB · ☑ (2026-05-15)
- **Where:** `internal/models/user.go`, `internal/repositories/user_repo.go`, `internal/middleware/auth.go`, `internal/services/cache_service.go`, `internal/services/auth_service.go`, migration `000003_hash_auth_tokens.sql`.
- **Done:** `user_authtoken.key` now stores `models.HashToken()` — the hex SHA-256 of the token — never the raw value. The raw token reaches the client once (the non-persisted `AuthToken.Plaintext` field) and is re-hashed on every request before the DB and Redis lookup, so the single indexed JOIN query in the auth middleware is preserved. A fast hash (not bcrypt) is correct here — tokens are 160-bit random values, nothing to brute-force. Migration `000003` widens the column 40→64 and clears existing rows.
- **Behaviour change:** the server can no longer re-issue a stored token's plaintext, so every login mints a fresh token via `CreateFreshToken` (delete + create). With the existing one-token-per-user schema this means **one active session per user** — logging in on a new device invalidates the previous device's token. The migration also invalidates all sessions once, at deploy.
- **Verify:** `SELECT key FROM user_authtoken LIMIT 1` → 64-char hash; `go build ./...` and `go test ./internal/{models,repositories,middleware,handlers}/...` pass.

#### `CODE-C2` / `CODE-C3` — Google ID token not verified locally · ☑ (2026-05-15)
- **Where:** `internal/services/google_auth.go` (full rewrite).
- **Done:** `VerifyIDToken` no longer calls the deprecated `tokeninfo` URL (which leaked the token in the query string and made verification depend on a third party). It now parses the JWT, fetches Google's JWKS from `googleapis.com/oauth2/v3/certs` (Redis-cached 24h, re-fetched on a `kid` miss), verifies the `RS256` signature locally, and asserts `iss ∈ {accounts.google.com, https://accounts.google.com}` (C3), `aud`/`azp` against the configured client IDs, and `exp` (validated by jwt v5). Mirrors the existing Apple JWKS verifier. `GoogleSignIn` is unchanged — the returned `GoogleTokenInfo` shape is preserved.
- **Verify:** `go build ./...` clean; `internal/services` tests pass.

#### `CODE-C5` / `CODE-C6` — IAP receipt / purchase-token replay · ☐
- **Where:** `internal/services/subscription_service.go` (`ProcessApplePurchase`, `ProcessGooglePurchase`).
- **Fix:** Goose migration adding `UNIQUE(provider, original_transaction_id)`. On purchase, if the transaction ID is already bound to a different `user_id` → `403`.
- **Verify:** re-submitting a valid receipt against a second account → `403`; DB has no duplicate.

#### `CODE-C7` — File-ownership check excludes residence owners · ☐
- **Where:** `internal/services/file_ownership_service.go:20-66`.
- **Fix:** Replace the three `residence_residence_users`-only JOINs with the canonical owner-OR-member UNION from `residence_repo.HasAccess` (owners live in `residence_residence.owner_id`).
- **Verify:** a residence owner can delete a file in their own property; a non-member still gets `403`.

#### `CODE-C8` — Device-token cross-account hijack · ☐
- **Where:** `internal/services/notification_service.go:307-319` (APNS), `:336-349` (GCM).
- **Fix:** On re-register of an existing token, if `existing.UserID != nil && *existing.UserID != userID` → `409 Conflict`. Only same-user updates allowed.
- **Verify:** registering another user's known token → `409`; that user's push traffic is unaffected.

#### `CODE-C9` / `CODE-H9` — Share-code join not atomic · ☐
- **Where:** `internal/services/residence_service.go:562-615` (`:594-599` swallows the deactivate error).
- **Fix:** Wrap `JoinWithCode` in one transaction with `SELECT … FOR UPDATE` on the share-code row; **fail the join if deactivation fails** (do not log-and-continue).
- **Verify:** concurrent redemptions of a single-use code → exactly one succeeds; a forced deactivate error rolls the whole join back.

#### `CODE-C10` — Subscription upgrade race · ☐
- **Where:** `internal/services/subscription_service.go:404-459`; webhook handler `:136-213`.
- **Fix:** Move Apple validation inside the row-locked transaction, or add an idempotency-key table so the validate→write window can't be raced.
- **Verify:** two concurrent upgrades for one user → one tier change, not two.

#### `CODE-C11` — Task-completion duplicate-row race · ☐
- **Where:** `internal/services/task_service.go:631-750`.
- **Fix:** `SELECT … FOR UPDATE` on the task in `CreateCompletion`; goose migration adding `UNIQUE(task_id, completed_date)`.
- **Verify:** double-tap "complete" → one completion row.

#### `CODE-C12` — Soft-deleted email reusable · ☐
- **Where:** `internal/services/auth_service.go:274-324`; `internal/repositories/user_repo.go` (`FindByEmail`, `ExistsByEmail`).
- **Fix:** On delete, mangle the email (`deleted_<id>_<email>`); add `is_active = true` filtering consistently to `FindByEmail`/`ExistsByEmail`.
- **Verify:** registering with a soft-deleted account's email is rejected; no cross-account takeover.

#### `CODE-C13` — Apple webhook user lookup may LIKE-match · ☐
- **Where:** `internal/handlers/subscription_webhook_handler.go:354-366` (`FindByAppleReceiptContains`).
- **Fix:** Confirm the SQL is an equality match, not `LIKE`. If `LIKE`, this is a confirmed Critical — change to equality and rename the function. See `V8`.
- **Verify:** the query is parameterized equality; rename merged.

### High (H1–H9)

#### `CODE-H1` / `CODE-H2` / `CODE-H3` / `CODE-M5` — Rate limiting gaps · ☐
- **Where:** `internal/router/router.go` (`:520` login limiter, `:593` `join-with-code` unprotected), `internal/middleware/rate_limit.go`, `internal/handlers/auth_handler.go`.
- **Fix:** Extend rate limiting to `register`, `join-with-code`, Apple/Google sign-in, and token refresh. Add a per-account login-attempt counter in Redis (lock after 5–10 fails for 15–60 min). This is the **app** half of the consolidated auth-rate-limit item; the **edge** half is `K3S-F10`.
- **Verify:** rapid attempts on every auth route throttle; per-account lockout fires regardless of source IP.

#### `CODE-H4` — Modulo bias in 6-digit codes · ☐
- **Where:** `internal/services/auth_service.go:884-892`.
- **Fix:** Replace `int32 % 1000000` with rejection sampling on `crypto/rand` for a uniform `000000–999999`.
- **Verify:** distribution test over many samples is uniform.

#### `CODE-H5` — Apple IAP `.p8` file-mode unchecked · ☐
- **Where:** `internal/services/iap_validation.go:93-128`, `internal/config/config.go:325`.
- **Fix:** Prefer a base64 env-injected PEM. If a file path is kept, refuse to start when the file mode is more permissive than `0600`.
- **Verify:** boot fails on a `0644` key file; succeeds on `0600`.

#### `CODE-H6` — Webhook dedup fail-open · ☐
- **Where:** `internal/handlers/subscription_webhook_handler.go:165-173` (Apple), `:564-574` (Google).
- **Fix:** Fail **closed** — if `webhookEventRepo.HasProcessed` errors, return `500` so Apple/Google retry, rather than processing (which risks duplicate refunds).
- **Verify:** simulated dedup-check DB error → `500`, no double-processing.

#### `CODE-H7` — Auth-failure log lacks IP/UA · ☐
- **Where:** `internal/handlers/auth_handler.go:70`.
- **Fix:** Add `c.RealIP()` + `User-Agent` to the structured failure log line (the audit log captures them; the request-line log does not). Depends on `V10` (RealIP trust).
- **Verify:** a failed login log line carries IP + UA.

#### `CODE-H8` — `X-Timezone` header trusted for trial start · ☐
- **Where:** `internal/middleware/timezone.go:40-71` → `internal/services/subscription_service.go:145-150`.
- **Fix:** Validate `X-Timezone` against IANA `LoadLocation`, cap to ±14h; use server UTC for trial-start / billing-window math regardless.
- **Verify:** a bogus/extreme `X-Timezone` cannot shift trial start.

### Medium (M1–M13)

#### `CODE-M1` — Header injection via `Content-Disposition` filename · ☐
- **Where:** `internal/handlers/media_handler.go:74,117,165`.
- **Fix:** Sanitize `doc.FileName` — strip CR/LF/quote/null, or emit RFC 5987 `filename*=UTF-8''…`.
- **Verify:** an upload with CRLF in the filename does not split the response.

#### `CODE-M2` — bcrypt cost 10 → 12 · ☐
- **Where:** `internal/models/user.go:47`, `internal/services/auth_service.go:479`.
- **Fix:** Make the cost config-driven, default 12.
- **Verify:** new hashes are `$2a$12$`.

#### `CODE-M3` — Apple Sign In nonce not validated · ☐
- **Where:** `internal/services/apple_auth.go`.
- **Fix:** Generate, store, and verify the nonce round-trip on Apple sign-in.
- **Verify:** a replayed/mismatched nonce is rejected.

#### `CODE-M4` — Email verification not atomic · ☐
- **Where:** `internal/services/auth_service.go:373-415`.
- **Fix:** Wrap verify in a transaction so a concurrent request can't double-apply.
- **Verify:** concurrent verify calls → one state transition.

#### `CODE-M6` / `LIVE-L16` — Uncapped list / pagination · ☐
- **Where:** `ListDocuments`, `ListContractors`, `ListResidences` handlers; pagination parsing.
- **Fix:** Clamp `limit` server-side to ≤100 (`< 1` → default 25). Notifications already caps at 200 — match the pattern.
- **Verify:** `?limit=999999` returns ≤100 rows.

#### `CODE-M7` — Audit log not append-only · ☐
- **Where:** audit-log model / repository.
- **Fix:** Make it append-only — a DB trigger forbidding `UPDATE`/`DELETE`, or move to an event store. Remove the soft-delete column.
- **Verify:** an `UPDATE`/`DELETE` on the audit table is rejected.

#### `CODE-M11` — `golang.org/x/crypto` outdated · ☐
- **Where:** `go.mod:30` (`v0.49.0`).
- **Fix:** `go get -u golang.org/x/crypto`, re-run `govulncheck`, retest. Pairs with Stage 5 dependency automation.
- **Verify:** `govulncheck ./...` clean.

#### `CODE-M12` — Contractor toggle refetch race · ☐
- **Where:** `internal/services/contractor_service.go:279-307`.
- **Fix:** Do the toggle + read in one transaction so a concurrent soft-delete can't make it return `nil`.
- **Verify:** concurrent toggle + delete → defined result, no nil panic.

#### `CODE-M13` — Account-deletion endpoint unrate-limited · ☐
- **Where:** `internal/handlers/auth_handler.go:488-539`.
- **Fix:** Add a throttle to `DELETE /account`. **First resolve `V11`** — `LIVE-L18` claims no delete endpoint exists; reconcile before deciding whether this is "rate-limit it" or "expose it."
- **Verify:** repeated delete calls throttle.

#### `CODE-M10` — `node:20-alpine` floating tag · ☐
- **Where:** admin/web `Dockerfile` (`:2,112,134`).
- **Fix:** Pin to a specific patch version or digest.
- **Verify:** Dockerfile has no bare `node:20-alpine`.

### Low / Info (CODE-L1, L2)

#### `CODE-L1` — Inactive-account login enumeration · ☐
- **Where:** `internal/services/auth_service.go:76-77`.
- **Fix:** Return the same generic error for inactive accounts as for invalid credentials.
- **Verify:** inactive vs. wrong-password responses are byte-identical.

#### `CODE-L2` — Auth responses lack `Cache-Control: no-store` · ☐
- **Where:** `internal/handlers/auth_handler.go` (Login / CurrentUser / Refresh).
- **Fix:** Set `Cache-Control: no-store` on auth responses.
- **Verify:** the header is present.

### Live-scan code-level findings (LIVE-L1, L11–L20)

#### `LIVE-L1` — `/metrics` publicly exposed · HIGH · ☐
- **Where:** `cmd/api/main.go` route registration; vmagent scrapes it cluster-internally already.
- **Fix (recommended — Option B):** bind Prometheus metrics to a separate cluster-internal port (e.g. `:9090`), expose only via a ClusterIP Service the vmagent NetworkPolicy allows; the public Ingress never registers `/metrics`. Update `observability/vmagent.yaml` scrape target. (Alternative: block `/metrics` at Traefik via an `IngressRoute` — Stage 3.)
- **Verify:** `curl https://api.myhoneydue.com/metrics` → `404`; vmagent still scrapes successfully.

#### `LIVE-L11` — Login user-enumeration via timing · HIGH · ☐
- **Where:** login handler / `auth_service.go`.
- **Fix:** Always run a bcrypt compare against a fixed dummy hash when the user is not found, so the response time is constant.
- **Verify:** real vs. fake email login timing delta < network noise.

#### `LIVE-L12` — No rate-limit on login · HIGH · ☐
- See the consolidated auth-rate-limit item: `K3S-F10` (edge) + `CODE-H1/H2/H3/M5` (app). Closed when both land.

#### `LIVE-L13` — Password-reset timing enumeration · HIGH · ☐
- **Where:** `forgot-password` handler.
- **Fix:** Enqueue the reset email on the Asynq queue and return the generic response immediately, so real vs. fake emails have identical latency.
- **Verify:** real vs. fake email reset timing delta < network noise.

#### `LIVE-L14` / `LIVE-L15` — Sequential integer IDs · MEDIUM · ⊘ (deferred)
- **Where:** all user-facing IDs.
- **Decision:** Real enumeration/intel leak, but migrating to UUID/ULID touches API, web, mobile, and webhook payloads. **Deferred to a planned quarter** — not a redeploy-stage fix. Track on the roadmap; revisit before the userbase size becomes commercially sensitive.

#### `LIVE-L16` — Pagination `limit` uncapped · MEDIUM · ☐
- Duplicate of `CODE-M6` — closed with it.

#### `LIVE-L17` — Garbage pagination params silently accepted · LOW · ☐
- **Where:** query-param parsing in list handlers.
- **Fix:** Return `400` naming the bad parameter instead of silently using defaults.
- **Verify:** `?limit=abc` → `400`.

#### `LIVE-L18` — No account-deletion endpoint (GDPR) · LOW · ☐
- **Where:** `internal/router/router.go`, `internal/handlers/auth_handler.go`.
- **Fix:** Reconcile with `CODE-M13` first (`V11`). Provide `DELETE /api/auth/me/` that anonymizes PII, cascades/transfers residences, revokes tokens, and writes an audit-trail row. Also closes the throwaway-account cleanup gap the live scan left behind.
- **Verify:** an authenticated user can delete their own account; PII is anonymized.

#### `LIVE-L19` — Email verification not enforced · LOW · ☐
- **Where:** router middleware.
- **Fix:** Add a `RequireVerified()` middleware on sensitive routes (share-code generation/redemption, anything that emails other users), or cap unverified accounts (1 residence, no share codes) until verified.
- **Verify:** an unverified account is blocked from the chosen gated routes.

#### `LIVE-L20` — Profile-update silently drops unknown fields · INFO · ☐
- **Where:** `PATCH /api/auth/profile/` handler.
- **Fix:** Either accept the fields (if intended) or return `400` listing unsupported keys — don't silently `200`.
- **Verify:** an unknown field yields a clear response.

#### `LIVE-L10` — `x-powered-by` — see Stage 0 (Next.js config).

---

## Stage 5 — CI / build pipeline

Build-time controls. Where there is no CI pipeline file yet, the fix is to
add one (or a `03-deploy.sh` step) so the control runs on every build.

### `K3S-F5` / `K3S-F14` / `CODE-L4` — Pin images by digest · HIGH · ☐
- **Where:** `03-deploy.sh` (currently tags by git short SHA, lines 47/57-61, and also pushes `:latest`), all `deploy-k3s/manifests/*/deployment.yaml`.
- **Fix:** After `docker push`, capture the digest (`crane digest …` or parse `docker push` output) and substitute `@sha256:…` into the manifests instead of `IMAGE_PLACEHOLDER` tags. Pin `redis` and `vmagent` by digest too. Reconsider pushing `:latest` — a mutable `:latest` undercuts digest pinning.
- **Verify:** `kubectl -n honeydue get deploy -o jsonpath` shows every image as `@sha256:`.

### `K3S-F8` — Secrets as file mounts, not env vars · MEDIUM · ☑ · In-repo: Y
- **Where:** `api`/`worker` `deployment.yaml`, `internal/config/config.go`, `cmd/api/main.go`, `cmd/worker/main.go`, `02-setup-secrets.sh`.
- **Done (2026-05-16):**
  - `config.loadFileSecrets()` reads each of the 9 secret keys (`POSTGRES_PASSWORD`, `SECRET_KEY`, `EMAIL_HOST_PASSWORD`, `FCM_SERVER_KEY`, `REDIS_PASSWORD`, `B2_KEY_ID`, `B2_APP_KEY`, `OBS_INGEST_TOKEN`, `OBS_TRACES_URL`) from `/etc/honeydue/secrets/<KEY>` and `viper.Set`s it (highest precedence). A missing file is a silent skip, so the same binary still works from env vars in local/dev.
  - `api`/`worker` `deployment.yaml` no longer inject **any** secret as an `env: secretKeyRef`. `honeydue-secrets` is mounted as a volume (`defaultMode: 0400`), read-only, at `/etc/honeydue/secrets`. Non-secret config still arrives via `envFrom: configMapRef`.
  - `cmd/api`/`cmd/worker` read the observability endpoints through the new `config.SecretValue()` (Viper-backed) instead of `os.Getenv`, so file-mounted `OBS_*` values resolve now that they are gone from the environment.
  - `02-setup-secrets.sh` now also writes `B2_KEY_ID`/`B2_APP_KEY` into `honeydue-secrets` — reconciling the script-vs-manifest drift (the manifests referenced these keys but the script never created them).
- **Scoped exception:** the one-shot `honeydue-migrate` Job still takes `POSTGRES_PASSWORD` as an env var. goose is invoked as a CLI with the password inside the DSN argument, so the value is exposed in that process regardless of env-vs-file; the Job is transient (one run, seconds, pod GC'd) so this is accepted.
- **Verify:** `kubectl -n honeydue exec deploy/api -- env` shows no `POSTGRES_PASSWORD`/`SECRET_KEY`; `kubectl -n honeydue exec deploy/api -- ls /etc/honeydue/secrets` lists the key files.

### `CODE-L5` — Image signing + scanning · LOW · ◐ · In-repo: Y
- **Where:** `03-deploy.sh`, `deploy-k3s/manifests/kyverno-verify-images.yaml`.
- **Done (in-repo, 2026-05-16):**
  - `03-deploy.sh` runs `cosign sign` after each push and a `trivy image --severity HIGH,CRITICAL` scan before push — both **guarded**: they no-op when the tool is absent, so they never break a deploy on a host without them.
  - A ready-to-use Kyverno `ClusterPolicy` ships at `deploy-k3s/manifests/kyverno-verify-images.yaml`. It matches only the four `gitea.treytartt.com/admin/honeydue-*` images, starts in `Audit` mode, and is **intentionally not applied by `03-deploy.sh`** — applying a verify-images policy with no key would block every Pod from scheduling.
- **Remaining (operator — cannot be committed):**
  1. Install Kyverno in the cluster (admission controller).
  2. `cosign generate-key-pair`; set `COSIGN_KEY` in the deploy env so signing activates; paste `cosign.pub` into the policy's `publicKeys` block.
  3. `kubectl apply -f deploy-k3s/manifests/kyverno-verify-images.yaml`, confirm Pods still schedule, then flip `validationFailureAction: Audit → Enforce`.
- **Verify:** an unsigned image is rejected by admission; `03-deploy.sh` fails on a HIGH/CRITICAL CVE.

### `CODE-M11` (CI half) — Dependency hygiene · ☐
- **Fix:** Add scheduled `go get -u` + `govulncheck` (the audit confirms `govulncheck` + `gitleaks` already run in CI — extend with a dependency-update cadence).
- **Verify:** stale-dependency alerts surface automatically.

---

## Stage 6 — Post-deploy verification & runtime investigations

`04-verify.sh` already runs a security block (secret encryption, NetworkPolicy
count, ServiceAccounts, pod security contexts, PDBs, `cloudflare-only`
middleware, `admin-basic-auth`). **Extend it so each fix above stays fixed,
and work the open investigations the audits could not resolve.**

### Extend `04-verify.sh` with assertions for · ☐
- Redis rejects unauthenticated `PING` (`K3S-F1`).
- Admin ingress annotation contains `admin-auth` (`K3S-F2`).
- `/metrics` returns `404` on the public host (`LIVE-L1`).
- Every container (incl. `vmagent`) has a full `securityContext` (`K3S-F7`).
- `automountServiceAccountToken: false` on app pods (`K3S-F11`).
- Every workload image is digest-pinned (`K3S-F5`).
- No `DEBUG_FIXED_CODES` key in the prod ConfigMap (`CODE-C4`).

### Runtime investigations (cannot be closed by code review alone)

| ID | Item | Source | Action |
|---|---|---|---|
| `V1` | Apple/Google Sign-In token validation depth | LIVE | Test with a self-signed Apple identity token; confirm signature/aud/nonce checks |
| `V2` | Webhook signature verification — confirm webhook routes are **outside** the auth middleware in `router.go` (live scan saw `401`s, signature middleware may never run) | LIVE | Code-review `internal/router/router.go` |
| `V3` | File-upload security — locate upload paths, test polyglots / MIME bypass / path traversal in filename / oversized files | LIVE | Focused upload security test |
| `V4` | Long-term token validity / revocation behaviour | LIVE | Test token expiry + revocation over time |
| `V5` | Apple IAP receipt validation with a real sandbox StoreKit receipt | LIVE | Sandbox test |
| `V6` | Share-code system — find the endpoint path; test brute-force, single-use, expiration | LIVE | Locate + test |
| `V7` | Trial-expiration enforcement — age a test account past 14 days, confirm `limitations_enabled` flips and creation gates fire | LIVE | Aged-account test |
| `V8` | `FindByAppleReceiptContains` — confirm equality, not `LIKE`. If `LIKE`, escalate `CODE-C13` to confirmed Critical | CODE | SQL review |
| `V9` | Rate-limiter storage — confirm `rate_limit.go` is Redis-backed (shared across 3 api replicas); in-memory = 3× the intended limit | CODE | Code review |
| `V10` | `X-Forwarded-For` / Echo `RealIP` trust behind Traefik — without it per-IP limits collapse to the ingress IP | CODE | Code + Traefik config review |
| `V11` | Account-deletion contradiction — `LIVE-L18` (no endpoint) vs `CODE-M13` (endpoint at `auth_handler.go:488-539`). Resolve before Stage 4 planning | LIVE/CODE | Route review |
| `V12` | etcd encryption — `04-verify.sh` only greps a string; truly confirm with `k3s secrets-encrypt status` on each server node | K3S | SSH check |
| `V13` | `user_authtoken` index — confirm a `user_id` lookup index exists before hashing tokens at rest (`CODE-C1`) | CODE | Schema check |

---

## Accepted risks / deferred (this cycle)

| ID | Item | Rationale |
|---|---|---|
| `K3S-F15` | Public-IP nodes, no VPC | Re-provision-scale change; Hetzner firewall (`K3S-CG3`) is the compensating control. Roadmap. |
| `K3S-F16` | Combined control-plane/worker nodes | Standard small-cluster k3s; revisit on workload growth. |
| `LIVE-L14`/`L15` | Sequential integer IDs | UUID migration spans API + web + mobile + webhooks; planned quarter, not this cycle. |

Mirror these in `docs/deployment/20-roadmap.md` so they are not silently lost.

---

## Documentation drift corrected alongside this plan

The audits contradicted the existing deployment book. These corrections ship
with this plan so the docs match audited reality:

| Doc | Claimed | Reality (audit) | Action |
|---|---|---|---|
| `05-security.md` | `automountServiceAccountToken: false` set | `K3S-F11`: not set on any workload | Corrected to "TODO" + linked here |
| `05-security.md` | NetworkPolicies "not currently applied" (TODO) | Applied 2026-04-24; `03-deploy.sh:155` applies them | Corrected to "applied" |
| `05-security.md` | CF↔origin is plaintext (SSL=Flexible) | Upgraded to Full (strict) 2026-04-24 | Corrected |
| `05-security.md` | SHA tags immutable / "we'd notice a digest change" | `K3S-F5`: short SHA tags are mutable | Corrected; points to `K3S-F5` |
| `SECURITY.md` (old) | Redis "requires a password" | `K3S-F1`: no auth | This rewrite |
| `SECURITY.md` (old) | etcd `secrets-encryption: true` | `K3S-CG1`: not verified / not on | This rewrite |
| `SECURITY.md` (old) | fail2ban active | `05-security.md` + `K3S-CG2`: not installed | This rewrite |
| `20-roadmap.md` | — | Audit findings not represented | Audit items folded in |

---

## Hardened-redeploy checklist (run order)

A clean rebuild of the whole stack, with every fix above applied:

```
□ Stage 0  DNS once-off:    DMARC, SPF, CAA at Cloudflare; security.txt route live
□ Stage 1  Provision:       hetzner-k3s config carries --write-kubeconfig-mode=600
                            and --secrets-encryption; run 01-provision-cluster.sh
□ Stage 1  Node OS:         fail2ban + unattended-upgrades + SSH/sysctl on each node
□ Stage 1  Verify cluster:  K3S-CG3..CG8 (firewall, snapshots, kubelet, perms)
□ Stage 2  Config:          config.yaml has redis.password + admin.basic_auth_*;
                            no DEBUG_FIXED_CODES; SECRET_KEY ≥32 chars
□ Stage 2  Secrets:         run 02-setup-secrets.sh — confirm redis + admin-basic-auth
□ Stage 3  Manifests:       admin ingress middlewares wired; imagePullSecret name
                            consistent; vmagent securityContext; COOP/CORP headers;
                            auth-rate-limit; automountServiceAccountToken:false;
                            HSTS preload; X-XSS-Protection dropped; imagePullPolicy set
□ Stage 4  Code+image:      all C/H/M/L code fixes committed; image rebuilt;
                            goose migrations for C1/C5/C6/C11/C12 present
□ Stage 5  CI:              images digest-pinned + signed + scanned; secrets file-mounted
□ Stage 6  Verify:          run 04-verify.sh (extended); work V1–V13
□ Post:    Submit myhoneydue.com to hstspreload.org
```

A redeploy is "clean" only when `04-verify.sh` (extended per Stage 6) passes
with zero `✗` lines and every checkbox in the master index is ☑ or ⊘.

---

## Appendix — Incident response playbooks

Preserved from the previous `SECURITY.md`; still current.

### Compromised API token
Rotate `SECRET_KEY` to invalidate all tokens, then restart api/worker:
```bash
echo "$(openssl rand -hex 32)" > secrets/secret_key.txt
./scripts/02-setup-secrets.sh
kubectl rollout restart deployment/api deployment/worker -n honeydue
```
(After `CODE-C1` lands, tokens are hashed at rest — a DB read no longer yields
usable tokens, but `SECRET_KEY` rotation remains the kill-switch.)

### Compromised database credentials
Rotate in the Neon dashboard, update `secrets/postgres_password.txt`, re-run
`02-setup-secrets.sh`, restart api/worker, watch logs for connection errors.

### Compromised push keys
APNs: revoke in Apple Developer, drop the new `.p8` into `secrets/`, re-run
`02-setup-secrets.sh`, restart api/worker. FCM: rotate the key in Firebase,
update `secrets/fcm_server_key.txt`, re-run, restart.

### Suspicious pod
```bash
kubectl logs <pod> -n honeydue > /tmp/pod-logs.txt
kubectl describe pod <pod> -n honeydue > /tmp/pod-describe.txt
kubectl delete pod <pod> -n honeydue   # deployment recreates it
```

### Communication
Document the timeline privately; on a data breach notify affected users
within 72 hours; rotate every potentially-exposed credential; write a
post-mortem (root cause, timeline, remediation, prevention).

---

## References

- Audit reports: `live_scan_5_12.md`, `k3_audit_5_12.md`, `security_scan_5_12.md` (repo root)
- Current architecture: `docs/deployment/05-security.md`
- Roadmap: `docs/deployment/20-roadmap.md`
- Deploy process: `docs/deployment/14-deployment-process.md`
- Scripts: `deploy-k3s/scripts/{01-provision-cluster,02-setup-secrets,03-deploy,04-verify}.sh`
- Manifests: `deploy-k3s/manifests/`