# honeyDue — Production Security Remediation Plan This document is the **single source of truth for fixing every security finding from the 2026-05-12/13 audits, and for keeping those fixes baked into the stack so a full redeploy never reproduces them.** It replaces the previous aspirational `SECURITY.md` (which described a desired state that, per the audits, was never fully true). The accurate *current* architecture lives in `docs/deployment/05-security.md`; this file is the **work list**. **Last updated:** 2026-05-16 **Audit sources (kept at repo root):** | Tag | File | Scope | Findings | |---|---|---|---| | `LIVE` | `live_scan_5_12.md` | External black-box scan of api/admin/app | L1–L20 (20) | | `K3S` | `k3_audit_5_12.md` | k3s cluster + `honeydue` namespace audit | F1–F17 (17) + 8 coverage gaps | | `CODE` | `security_scan_5_12.md` | Static audit of `honeyDueAPI-go` | C1–C13, H1–H9, M1–M13, L1–L6 (41) | **Total: 78 findings + 8 cluster coverage gaps + 13 runtime verification items.** --- ## How to use this document The plan is organised by **redeploy stage**, not by severity, because the operator's goal is: *redeploy the entire stack and come up clean.* Each finding is tagged with where its fix lives: | Marker | Meaning | |---|---| | **In-repo: Y** | Fix lives in a committed file (`config.yaml`, a manifest, a script, Go code, a Dockerfile). Once committed, **every redeploy re-applies it automatically.** | | **In-repo: N** | Fix is external state (DNS records, Cloudflare dashboard, Hetzner firewall, hstspreload.org). A redeploy does **not** touch it — it survives on its own but must be done once and tracked here. | **Status legend:** ☐ open · ◐ in progress · ☑ done · ⊘ accepted risk / deferred **Redeploy stage order** (matches `deploy-k3s/scripts/` run order): ``` Stage 0 DNS & Cloudflare edge (external; no cluster needed) Stage 1 Cluster provisioning & node OS (01-provision-cluster.sh / hetzner-k3s / SSH) Stage 2 Secrets & config bootstrap (02-setup-secrets.sh / config.yaml) Stage 3 Kubernetes manifests (deploy-k3s/manifests/, applied by 03-deploy.sh) Stage 4 Application code & images (honeyDueAPI-go source → rebuilt image) Stage 5 CI / build pipeline (image digest pinning, signing, scanning) Stage 6 Post-deploy verification (04-verify.sh + runtime investigations) ``` **Golden rule for "redeploy clean":** a fix only counts as done when it is committed to the file that the redeploy reads. A `kubectl patch` on the live cluster that is not mirrored into `deploy-k3s/manifests/` **will be wiped on the next `03-deploy.sh`.** Every entry below names the committed file. --- ## Execution status (2026-05-16) Stages 2–5 were executed in-repo, then put through an independent code review (see *Post-remediation independent review* below). The Go module **builds clean and the full `go test ./...` suite passes.** Four new goose migrations were added — `000003` (auth-token hashing), `000004` (IAP replay protection), `000005` (audit-log append-only + `audit_log` table create), `000006` (`webhook_event_log` table create) — and run automatically via the migrate Job before the api/worker rollout. - **~63 findings fixed (☑) and verified** — all of Stage 2 (secrets/config) and Stage 3 (Kubernetes manifests), every exploitable Stage 4 application finding (all 11 actioned Criticals + the auth / webhook / race / handler High & Medium fixes), Stage-5 image digest pinning **and `K3S-F8` (secrets are now file-mounted, not env vars)**, plus the in-repo half of Stage 1 cluster provisioning — `K3S-F4` (kubeconfig written `0600`), `K3S-CG1` (etcd `secrets-encryption`), `K3S-CG2` (fail2ban + unattended-upgrades installed at provision). Includes token hashing, Google JWKS verification, IAP replay protection, the authorization fixes, atomic share-code join, the metrics-endpoint lockdown, per-account login lockout, verified-email gating, CSP/HSTS hardening, and digest-pinned images. - **1 partial (◐)** — `CODE-L5`: cosign signing + a Trivy `HIGH,CRITICAL` scan are wired (guarded) into `03-deploy.sh`, and a ready-to-use Kyverno `ClusterPolicy` ships at `deploy-k3s/manifests/kyverno-verify-images.yaml`. Closing it needs two operator actions that cannot be committed: install Kyverno in the cluster, and supply a cosign key pair (`COSIGN_KEY` for signing + the public key pasted into the policy). - **Accepted / blocked / moot (⊘)** — `M3` (Apple nonce — blocked on an iOS-client change), `C12` (moot — accounts are hard-deleted), `LIVE-L14`/`L15` (UUID migration — planned quarter), `LIVE-L17`/`L18`/ `L20` (no security impact — see entries), `F15`/`F16` (architectural), and `LIVE-L2`/`L3`/`L4` (DMARC / SPF / CAA — operator-declined, below). - **Operator-declined — Stage 0 DNS (`LIVE-L2`/`L3`/`L4`).** The operator has opted not to add the DMARC, SPF-hardening, and CAA DNS records this cycle. For the record: these are **not** a paid-Cloudflare feature — DMARC and SPF are ordinary TXT records and CAA is an ordinary CAA record, all addable on any Cloudflare plan including Free. They remain genuine email-spoofing / certificate-issuance gaps and are marked ⊘; revisit when DNS is next touched. - **Remaining operator runtime steps (no code to commit)** — on the *existing* cluster: `k3s secrets-encrypt` enable/reencrypt (`K3S-CG1` / `V12`) and `chmod 600` the live kubeconfig (`K3S-F4`); the SSH/sysctl half of `K3S-CG2`; and the `K3S-CG3`–`CG8` verification items. A full *fresh* provision already comes up with `K3S-F4`/`CG1`/`CG2`(fail2ban + unattended-upgrades) applied straight from `_config.sh`. **Operator note:** `C1` (token hashing) invalidates every existing login session once at deploy and makes login single-session per user — see the `CODE-C1` entry. The status boxes in the master index below are authoritative. ## Post-remediation independent review (2026-05-16) The change set went through **two** independent review passes; the deploy-time verification below (build, `go test -race`, full `goose up` against real PostgreSQL 16) was executed and passed. **First pass.** A separate review agent audited the full change set against the three audit files. It surfaced three **deploy-breaking** defects that a green `go test` could not catch — the test harness builds two tables via GORM `AutoMigrate`, which production never runs — all since fixed: - **`audit_log` table was never created by a migration.** `000005` added append-only triggers to a table that exists only in the test DB, so a from-scratch `goose up` would fail on `000005`. `000005` now does `CREATE TABLE IF NOT EXISTS audit_log` before the triggers. - **`webhook_event_log` table was never created by a migration.** The H6 fail-closed webhook dedup turns a missing table into a 500 on every subscription webhook. New migration `000006` creates it. - **`000004`'s `google_purchase_token` unique index could fail to build** on a production table already holding duplicate tokens — exactly the C6 replay the migration fixes. `000004` now de-duplicates (keep-earliest, NULL-the-rest) before creating the index. It also tightened the C13 Apple-webhook lookup (`subscription_webhook_handler.go`) so the legacy substring scan runs only on a genuine `ErrRecordNotFound`, never masking a real DB error as "not found". **Second pass (master review).** A second, independent security-audit agent re-verified all four first-pass fixes (correct), ran `go test -race` (0 data races) and the full `goose up`/`down` chain against real PostgreSQL (clean, idempotent), and returned **GO** with one HIGH finding, since fixed: - **HIGH-1 — Redis password leaked via the `honeydue-config` ConfigMap.** `_config.sh` built `REDIS_URL` with the password embedded inline, and that URL is emitted into the `honeydue-config` ConfigMap (delivered to pods via `envFrom`). ConfigMaps are *not* covered by `secrets-encryption` and are readable by any principal with `get configmap` — so `K3S-F1`/`K3S-F8` were not actually fully closed. **Fixed (2026-05-16):** `_config.sh` now emits `REDIS_URL=redis://redis:6379/0` with no credentials; the password travels only as the file-mounted `REDIS_PASSWORD` secret. The API applies it in `cache_service.go`; `cmd/worker/main.go` now applies it onto the parsed Asynq `RedisClientOpt` so the server/inspector/monitoring client all authenticate against the `requirepass` Redis. The master review's other seven findings (4 Medium, 3 Low — none deploy-blocking) were then **all fixed (2026-05-16)**: - **MEDIUM-1 — re-login left the prior token usable for ≤5 min.** `CreateFreshToken` deleted the old token row but not its Redis cache entry. It now also returns the deleted tokens' hashes; `AuthService.freshToken` evicts them via the new `CacheService.InvalidateAuthTokenHashes` on every login / Apple / Google sign-in, so a prior (e.g. stolen) token stops authenticating immediately. - **MEDIUM-2 — IAP `.p8` mode check incompatible with k8s.** The Apple IAP key check (`iap_validation.go`) required `0600`-or-stricter, unattainable on a k8s Secret volume (`0440` under `fsGroup`). It now rejects only world-accessible keys (`perm & 0o007`). - **MEDIUM-3 — single-IP account-lockout DoS.** The `M5` per-account lockout is now keyed on the *set of distinct source IPs* that have failed (`RegisterLoginFailure` takes the IP, tracks a Redis set; lock at 5 distinct IPs). One attacker IP can no longer lock a victim out by spamming failures; genuinely distributed stuffing still trips it. `Login` now takes the client IP (`c.RealIP()`). - **MEDIUM-4 — Redis no-auth deployable.** `02-setup-secrets.sh` now `die`s (was `warn`) when `redis.password` is empty, so a deploy can no longer bring up an unauthenticated Redis (`K3S-F1`). - **LOW-1 / LOW-2 — missing regression tests.** Added: `config_test.go` asserts `validate()` refuses `DEBUG_FIXED_CODES` with `DEBUG=false` (`C4`); `subscription_repo_test.go` asserts a second account cannot bind an Apple transaction / Google purchase token already bound to another (`C5`/`C6`). - **LOW-3 — device-token 409.** A recycled APNs/FCM token re-registering under a new account is now reassigned to that account (and logged) instead of returning a 409 that locked the legitimate new device owner out of push. One earlier (first-pass) hardening item remains a **tracked follow-up**, not re-raised by the master review and not deploy-blocking: `/metrics` is gated by an `X-Forwarded-For` check rather than network-isolated. True isolation needs `/metrics` on a separate port plus a NetworkPolicy restricting the scrape to vmagent — an architectural change deferred to a later cycle. ## Consolidated work items (fix once, closes many) Several findings are the same defect seen from three angles. Do the work once at the listed anchor; the rest close with it. | Theme | Anchor | Also closes | |---|---|---| | Auth-endpoint rate limiting | Stage 3 `auth-rate-limit` middleware + Stage 4 app limiter | `K3S-F10`, `LIVE-L12`, `CODE-H1`, `CODE-H2`, `CODE-H3`, `CODE-M5` | | CSP / cross-origin headers | Stage 3 `security-headers` + Stage 4 app CSP | `K3S-F9`, `LIVE-L8` | | HSTS `preload` | Stage 3 middleware + Stage 0 list submission | `LIVE-L5`, `CODE-L3` | | Admin ingress hardening | Stage 2 secret + Stage 3 middleware wiring | `K3S-F2`, `K3S-F3`, `CODE-L6` | | etcd encryption at rest | Stage 1 `--secrets-encryption` | `K3S-CG1`, `CODE-M9` | | Image digest pinning + signing | Stage 5 CI | `K3S-F5`, `K3S-F14`, `CODE-L4`, `CODE-L5` | | Pagination hard caps | Stage 4 app | `LIVE-L16`, `CODE-M6` | | imagePullSecret name consistency | Stage 3 manifests + Stage 2 script | `K3S-F6` | **Known contradiction to resolve before planning Stage 4:** `LIVE-L18` says *no account-deletion endpoint exists* (every `DELETE` path 404/400), but `CODE-M13` points at a delete handler at `auth_handler.go:488-539`. Either the endpoint exists at a path the external scan never probed, or it is mounted but unreachable. **Confirm the route in `internal/router/router.go` first** — the fix differs (add an endpoint vs. expose/rate-limit an existing one). Tracked as verification item `V11`. --- ## Master finding index Every finding, ordered by redeploy stage. Use this as the live tracker — flip the Status box as work lands. ### Stage 0 — DNS & Cloudflare edge | ID | Sev | Finding | In-repo | Status | |---|---|---|---|---| | `LIVE-L2` | HIGH | No DMARC record — email spoofing open | N | ⊘ | | `LIVE-L3` | MED | SPF ends `?all` (neutral — fails open) | N | ⊘ | | `LIVE-L4` | MED | No CAA records — any CA may issue certs | N | ⊘ | | `LIVE-L6` | LOW | No `/.well-known/security.txt` | Y | ☐ | | `LIVE-L9` | INFO | Aggressive Cloudflare caching on admin SSR shell | N | ☐ | | `LIVE-L10` | INFO | `x-powered-by: Next.js` framework leak | Y | ☐ | ### Stage 1 — Cluster provisioning & node OS | ID | Sev | Finding | In-repo | Status | |---|---|---|---|---| | `K3S-F4` | HIGH | Node kubeconfig world-readable (mode 644) | Y | ☑ | | `K3S-F15` | INFO | Nodes on public IPs, no private VPC | Y | ⊘ | | `K3S-F16` | INFO | All 3 nodes are control-plane + etcd + worker | Y | ⊘ | | `K3S-F17` | INFO | Single-replica SPOFs (redis/worker/admin/vmagent) | Y | ☐ | | `K3S-CG1` | — | etcd encryption at rest not verified (`--secrets-encryption`) | Y | ☑ | | `K3S-CG2` | — | Node OS hardening: SSH, fail2ban, unattended-upgrades, sysctl | Y/N | ◐ | | `K3S-CG3` | — | Hetzner Cloud Firewall rules not verified | N | ☐ | | `K3S-CG4` | — | etcd snapshot backup destination/encryption not verified | Y | ☐ | | `K3S-CG5` | — | kubelet flags (`--anonymous-auth=false`, webhook authz) not verified | Y | ☐ | | `K3S-CG6` | — | Container-runtime CIS controls (`kube-bench`) not run | N | ☐ | | `K3S-CG7` | — | `deploy` user sudoers least-privilege not verified | N | ☐ | | `K3S-CG8` | — | `/etc/rancher/k3s/` dir + server-token perms not verified | N | ☐ | ### Stage 2 — Secrets & config bootstrap | ID | Sev | Finding | In-repo | Status | |---|---|---|---|---| | `K3S-F1` | **CRIT** | Redis runs with no authentication | Y | ☑ | | `K3S-F3` | HIGH | `admin-basic-auth` secret never created | Y | ☑ | | `K3S-F12` | MED | Secrets unrotated since cluster bootstrap; no runbook | Y | ☑ | | `CODE-C4` | **CRIT** | `DEBUG_FIXED_CODES` "123456" auth bypass if it reaches prod | Y | ☑ | | `CODE-M8` | MED | `SECRET_KEY` hardcoded debug fallback | Y | ☑ | > **Stage 2 status (2026-05-15):** `config.yaml` now carries a Redis > password and admin basic-auth user/password; `02-setup-secrets.sh` uses > bcrypt (`htpasswd -nbB`); `internal/config/config.go` generates an > ephemeral random `SECRET_KEY` in debug instead of a static fallback and > refuses to boot if `DEBUG_FIXED_CODES` is set with `DEBUG=false`; the > rotation runbook is at `docs/runbooks/secret-rotation.md`. All take > effect on the next `02-setup-secrets.sh` + `03-deploy.sh`. ### Stage 3 — Kubernetes manifests | ID | Sev | Finding | In-repo | Status | |---|---|---|---|---| | `K3S-F2` | HIGH | Admin ingress missing `cloudflare-only` + `admin-auth` | Y | ☑ | | `K3S-F6` | HIGH | `imagePullSecrets` name mismatch (`ghcr-credentials`) | Y | ☑ | | `K3S-F7` | MED | `vmagent` container missing `securityContext` | Y | ☑ | | `K3S-F9` | MED | `security-headers` missing COOP/COEP/CORP | Y | ☑ | | `K3S-F10` | MED | Uniform rate limit — no auth-endpoint tightening | Y | ☑ | | `K3S-F11` | MED | `automountServiceAccountToken` not disabled | Y | ☑ | | `K3S-F13` | LOW | `CORS_ALLOWED_ORIGINS` missing `app.myhoneydue.com` | Y | ☑ | | `K3S-F14` | LOW | Public images (`redis`, `vmagent`) pinned by tag | Y | ☑ | | `LIVE-L5` | LOW | HSTS not preload-eligible | Y | ☑ | | `LIVE-L7` | LOW | Deprecated `X-XSS-Protection` header | Y | ☑ | | `LIVE-L8` | LOW | CSP missing `object-src`/`base-uri`; COOP/COEP/CORP absent | Y | ☑ | | `CODE-L3` | LOW | HSTS missing `preload` (duplicate of `LIVE-L5`) | Y | ☑ | | `CODE-L4` | LOW | `imagePullPolicy` not set on Deployments | Y | ☑ | | `CODE-L6` | LOW | Admin `admin-auth` middleware defined, not attached | Y | ☑ | > **Stage 3 status (2026-05-15):** admin ingress now chains > `cloudflare-only` + `admin-auth` + `security-headers` + `rate-limit`; a > dedicated `honeydue-api-auth` Ingress applies a new `auth-rate-limit` > middleware (5/min, burst 10) to login / register / forgot-password / > reset-password / join-with-code; `security-headers` gained COOP + CORP, > HSTS is now `max-age=63072000; …; preload`, and the deprecated > `X-XSS-Protection` (`browserXssFilter`) is removed; `vmagent` has a > container `securityContext`; all workload pods + the migrate Job set > `automountServiceAccountToken: false` explicitly (on top of the > rbac.yaml ServiceAccount-level setting that already existed); the > registry secret is `gitea-credentials` everywhere; `imagePullPolicy: > IfNotPresent` is explicit on every container; CORS includes > `app.myhoneydue.com`. **Still open:** `K3S-F14` (public-image digest > pins) is folded into Stage 5 with `K3S-F5`; `LIVE-L8` is partial — the > COOP/CORP half shipped here, the CSP `object-src`/`base-uri` half is an > app change tracked in Stage 4. ### Stage 4 — Application code & container images | ID | Sev | Finding | In-repo | Status | |---|---|---|---|---| | `CODE-C1` | **CRIT** | Auth tokens stored plaintext in DB | Y | ☑ | | `CODE-C2` | **CRIT** | Google ID token not verified locally | Y | ☑ | | `CODE-C3` | **CRIT** | Google `iss` claim never validated | Y | ☑ | | `CODE-C5` | **CRIT** | Apple IAP receipt replay across accounts | Y | ☑ | | `CODE-C6` | **CRIT** | Google purchase-token replay across accounts | Y | ☑ | | `CODE-C7` | **CRIT** | File-ownership check excludes residence owners | Y | ☑ | | `CODE-C8` | **CRIT** | Device-token cross-account hijack on re-register | Y | ☑ | | `CODE-C9` | **CRIT** | Share-code join not atomic (Add+Deactivate race) | Y | ☑ | | `CODE-C10` | **CRIT** | Subscription upgrade race — validation outside txn | Y | ☑ | | `CODE-C11` | **CRIT** | Task-completion duplicate-row race | Y | ☑ | | `CODE-C12` | **CRIT** | Soft-deleted email reusable; `is_active` not filtered | Y | ⊘ | | `CODE-C13` | **CRIT** | Apple webhook user lookup may LIKE-match | Y | ☑ | | `CODE-H1` | HIGH | Rate limit doesn't cover all auth surfaces | Y | ☑ | | `CODE-H2` | HIGH | No rate limit on `join-with-code` | Y | ☑ | | `CODE-H3` | HIGH | No rate limit on `register` | Y | ☑ | | `CODE-H4` | HIGH | Modulo bias in 6-digit code generation | Y | ☑ | | `CODE-H5` | HIGH | Apple IAP `.p8` loaded with no file-mode check | Y | ☑ | | `CODE-H6` | HIGH | Webhook dedup fail-open | Y | ☑ | | `CODE-H7` | HIGH | Auth-failure log lacks IP/User-Agent | Y | ☑ | | `CODE-H8` | HIGH | `X-Timezone` header trusted for trial-start calc | Y | ☑ | | `CODE-H9` | HIGH | Share-code `Deactivate` error swallowed | Y | ☑ | | `CODE-M1` | MED | HTTP header injection via `Content-Disposition` filename | Y | ☑ | | `CODE-M2` | MED | bcrypt cost = 10 (recommend 12) | Y | ☑ | | `CODE-M3` | MED | Apple Sign In nonce not validated | Y | ⊘ | | `CODE-M4` | MED | Email verification not atomic | Y | ☑ | | `CODE-M5` | MED | Per-user rate limiting absent | Y | ☑ | | `CODE-M6` | MED | List endpoints uncapped (Documents/Contractors/Residences) | Y | ☑ | | `CODE-M7` | MED | Audit log not append-only | Y | ☑ | | `CODE-M11` | MED | `golang.org/x/crypto v0.49.0` outdated | Y | ☑ | | `CODE-M12` | MED | Contractor toggle refetch race | Y | ☑ | | `CODE-M13` | MED | Account-deletion endpoint unrate-limited | Y | ☑ | | `CODE-M10` | MED | `node:20-alpine` floating tag in Dockerfile | Y | ☑ | | `CODE-L1` | LOW | Login inactive-account error enables enumeration | Y | ☑ | | `CODE-L2` | LOW | Auth responses lack `Cache-Control: no-store` | Y | ☑ | | `LIVE-L1` | HIGH | `/metrics` publicly exposed on `api.myhoneydue.com` | Y | ☑ | | `LIVE-L11` | HIGH | Login user-enumeration via timing | Y | ☑ | | `LIVE-L12` | HIGH | No rate-limit on `/api/auth/login/` | Y | ☑ | | `LIVE-L13` | HIGH | Password-reset user-enumeration via timing | Y | ☑ | | `LIVE-L14` | MED | Sequential integer user IDs leak userbase size | Y | ⊘ | | `LIVE-L15` | MED | Sequential integer resource IDs (same risk) | Y | ⊘ | | `LIVE-L16` | MED | Pagination `limit` accepted at any size | Y | ☑ | | `LIVE-L17` | LOW | Garbage pagination params silently accepted | Y | ⊘ | | `LIVE-L18` | LOW | No account-deletion endpoint (GDPR gap) | Y | ⊘ | | `LIVE-L19` | LOW | Email verification not enforced | Y | ☑ | | `LIVE-L20` | INFO | Profile-update silently drops unknown fields | Y | ⊘ | > **Stage 4 handler/misc batch status (2026-05-15):** `M1` — > `Content-Disposition` filenames are sanitized (control chars / quote / > backslash stripped) so an upload filename cannot inject response > headers. `M7` — migration `000005` creates the `audit_log` table (no > prior migration did — `CREATE TABLE IF NOT EXISTS`) and makes it > append-only via BEFORE UPDATE/DELETE triggers. `M11` — > `golang.org/x/crypto` bumped > `v0.49.0 → v0.51.0`. `M13` — `DELETE /api/auth/account` now carries the > Traefik `auth-rate-limit` edge limiter. `LIVE-L18` ⊘ — not a real gap: > the endpoint **exists** at `DELETE /api/auth/account/` > (`router.go:546`); the live scan probed `/api/auth/me/`, `/auth/delete/`, > `/users/me/` and missed it. **Update (2026-05-15):** items shown as > deferred in an earlier draft were then completed — `LIVE-L1` (`/metrics` > rejects proxied/public requests via an `X-Forwarded-For` check, so only > the in-cluster vmagent scrape reaches it), `M6`/`LIVE-L16` (the > document/contractor list repos already hard-cap at 500 rows), and > `LIVE-L19` (verified-email gating on share-code generation via the new > `RequireVerified` middleware). `LIVE-L17` (inert pagination params, > results capped) and `LIVE-L20` (whitelist profile update is the correct > pattern) are closed as no-security-impact (⊘). The master index above is > authoritative. > **Stage 4 races batch status (2026-05-15):** `C9`/`H9` — share-code > redemption is now one locked transaction in `ResidenceRepository. > JoinWithShareCode` (lock the code row, re-check validity, add member, > deactivate — a deactivation failure aborts the join). `C11` — the > task-completion duplicate-row race was *already* closed: the completion > insert and the optimistically-version-locked task update share one > transaction, so a concurrent completion fails `ErrVersionConflict` and > rolls back its inserted row; no `UNIQUE(task_id, completed_date)` was > added (it would reject legitimate same-day re-completions and risk a > migration failure on existing data). `M4` — email verification's > find/consume/flag writes are now one transaction. `M12` — a concurrent > contractor delete now yields a clean 404. `C12` ⊘ — premise moot: the > app **hard-deletes** accounts (`DeleteUserCascade`), so there is no > soft-deleted user whose email lingers, and `ExistsByEmail` already > blocks re-registering a *deactivated* user's email. > > **Stage 4 auth batch status (2026-05-15):** C1, C2, C3 done (see entries > below). Rate limiting — every sensitive auth path now carries the shared > Traefik `auth-rate-limit` edge limiter (login/register/forgot/reset/ > verify-reset/apple/google/refresh/join-with-code); login/register/forgot/ > reset/apple/google additionally keep the per-IP app limiter > (`H1`/`H2`/`H3`/`LIVE-L12`). `H4` rejection-sampled codes, `M2` bcrypt > cost 12, `L1`+`LIVE-L11` constant-time generic-error login, `L2` > `no-store` on auth responses, `H7` IP/UA in auth logs, `LIVE-L13` > fully-async forgot-password — all done; `go build ./...` and the > `models`/`repositories`/`middleware`/`handlers`/`services` test packages > pass. **Deferred:** `M3` (Apple nonce) — needs the iOS client to > generate and send a nonce; server-only validation would reject every > Apple login, so this is blocked on a coordinated mobile change. `H8` — > the `parseTimezone` ±14h cap shipped; the "use server UTC for > trial-start" half is folded into Stage 4's subscription work. `M5` > per-account lockout (Redis) deferred — the edge + per-IP app limiters + > the existing per-account password-reset counter cover the practical > risk; a true per-account login lockout remains a tracked enhancement. ### Stage 5 — CI / build pipeline | ID | Sev | Finding | In-repo | Status | |---|---|---|---|---| | `K3S-F5` | HIGH | Images pinned by mutable short SHA tag, not digest | Y | ☑ | | `K3S-F8` | MED | Secrets injected as env vars, not file mounts | Y | ☑ | | `CODE-L5` | LOW | No image signing (cosign) in CI | Y | ◐ | > **Stage 5 status (2026-05-15):** `CODE-M11` done — `golang.org/x/crypto` > bumped `v0.49.0 → v0.51.0` (with the `x/sys`/`x/term`/`x/text` bumps > `go get -u` pulled in), `go mod tidy` clean, full build + test green. > **Update (2026-05-15):** `K3S-F5`/`K3S-F14`/`CODE-M10` are done — > `03-deploy.sh` resolves the image digest after each push and deploys > api/worker/admin/web by `@sha256:`, and redis/vmagent/`node:20-alpine` > are pinned to their resolved index digests. > **Update (2026-05-16):** `K3S-F8` is **done** — the `api`/`worker` > Deployments mount `honeydue-secrets` as files (`defaultMode: 0400`) at > `/etc/honeydue/secrets` and inject no secret as an env var; > `config.loadFileSecrets` reads them; `02-setup-secrets.sh` now writes > `B2_KEY_ID`/`B2_APP_KEY` into the secret, reconciling the earlier > script-vs-manifest drift. `CODE-L5` stays **◐** — cosign signing and a > Trivy `HIGH,CRITICAL` scan are wired (guarded) into `03-deploy.sh` and a > ready-to-use Kyverno `ClusterPolicy` ships at > `deploy-k3s/manifests/kyverno-verify-images.yaml`; closing it needs the > operator to install Kyverno and supply a cosign key. See both entries. ### Stage 6 — Post-deploy verification & runtime investigations `V1`–`V13` — see [Stage 6](#stage-6--post-deploy-verification--runtime-investigations). --- ## Stage 0 — DNS & Cloudflare edge External state at Cloudflare. Not touched by `03-deploy.sh`, so a redeploy neither breaks nor re-applies these — do them once and leave them. Tracked here so they are never forgotten on a domain move or DNS migration. ### `LIVE-L2` — Add DMARC record · HIGH · ⊘ - **Operator decision (2026-05-16):** declined for this cycle. A DMARC record is an ordinary DNS TXT record — it is **not** gated behind a paid Cloudflare plan and can be added on Free. This remains a real email-spoofing gap; revisit when DNS is next touched. - **Where:** Cloudflare DNS, TXT record at `_dmarc.myhoneydue.com`. - **Fix:** Publish `v=DMARC1; p=quarantine; rua=mailto:dmarc@myhoneydue.com; ruf=mailto:dmarc@myhoneydue.com; fo=1; aspf=s; adkim=s`. Start at `pct=10` for 30 days, watch the `rua` aggregate reports, then ramp to `pct=100` and finally `p=reject`. - **Verify:** `dig +short TXT _dmarc.myhoneydue.com` returns the record. ### `LIVE-L3` — Tighten SPF from `?all` to `-all` · MEDIUM · ⊘ - **Operator decision (2026-05-16):** declined for this cycle. SPF is an ordinary DNS TXT record, editable on any Cloudflare plan including Free. The `?all` (neutral) qualifier leaves spoofed mail un-penalised; revisit alongside `LIVE-L2`. - **Where:** Cloudflare DNS, TXT record at `myhoneydue.com`. - **Fix:** Change `v=spf1 include:spf.messagingengine.com ?all` → `~all` for ~7 days, confirm no legitimate mail (CI, transactional) is missed, then `-all`. Do this **after** `LIVE-L2`'s DMARC ramp begins. - **Verify:** `dig +short TXT myhoneydue.com | grep spf` shows `-all`. ### `LIVE-L4` — Add CAA records · MEDIUM · ⊘ - **Operator decision (2026-05-16):** declined for this cycle. CAA is an ordinary DNS record type, addable on any Cloudflare plan including Free. Without it, any public CA may issue a cert for the domain; revisit when DNS is next touched. - **Where:** Cloudflare DNS, apex `myhoneydue.com`. - **Fix:** Add `0 issue "letsencrypt.org"`, `0 issuewild "letsencrypt.org"`, `0 iodef "mailto:security@myhoneydue.com"`. Add `0 issue "pki.goog"` only if Google Trust Services is used anywhere. Confirm against the CAs Cloudflare Universal SSL actually uses before locking down. - **Verify:** `dig +short CAA myhoneydue.com` returns the records. ### `LIVE-L6` — Publish `security.txt` · LOW · ☐ · In-repo: Y - **Where:** served by the Go API and/or Next.js apps at `/.well-known/security.txt` (RFC 9116) — committed route, so it survives redeploys. - **Fix:** Serve `Contact:`, `Expires:`, `Preferred-Languages:`, `Canonical:` on both `api.myhoneydue.com` and the apex. - **Verify:** `curl https://api.myhoneydue.com/.well-known/security.txt` → 200. ### `LIVE-L9` — Review Cloudflare caching of the admin SSR shell · INFO · ☐ - **Where:** Cloudflare cache rules for `admin.myhoneydue.com`. - **Fix:** `cache-control: s-maxage=31536000` on admin SSR pages means Cloudflare caches the admin shell for a year. Confirm this is intentional; if the admin shell ever contains per-session content, add a bypass-cache rule for `admin.myhoneydue.com`. - **Verify:** `curl -sI https://admin.myhoneydue.com/ | grep -i cache` reflects the intended policy. ### `LIVE-L10` — Suppress `x-powered-by` · INFO · ☐ · In-repo: Y - **Where:** Next.js config in the admin and web repos (`next.config.js` → `poweredByHeader: false`). Committed, survives redeploys. - **Fix:** Disable the `x-powered-by: Next.js` header. - **Verify:** `curl -sI https://admin.myhoneydue.com/ | grep -i x-powered-by` returns nothing. --- ## Stage 1 — Cluster provisioning & node OS Run by `01-provision-cluster.sh` (which drives the `hetzner-k3s` CLI from `config.yaml` via `generate_cluster_config` in `_config.sh`) plus one-time SSH hardening on each node. **Any k3s server flag must be set in the `hetzner-k3s` cluster config so a cluster rebuild applies it.** ### `K3S-F4` — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y - **Where:** `_config.sh` → `generate_cluster_config` → `k3s_config_file`. Node file `/etc/rancher/k3s/k3s.yaml`. - **Done (2026-05-16):** `generate_cluster_config` now emits `write-kubeconfig-mode: "0600"` in the k3s config file, so any fresh provision writes the node kubeconfig as `0600`. - **Operator step on the existing cluster:** a running node keeps the mode it was installed with — `ssh deploy@ 'sudo chmod 600 /etc/rancher/k3s/k3s.yaml'` on each. Deploy scripts still read it via `sudo`. - **Verify:** `ssh deploy@ 'sudo stat -c %a /etc/rancher/k3s/k3s.yaml'` → `600`. ### `K3S-CG1` / `CODE-M9` — etcd / Secret encryption at rest · ☑ · In-repo: Y - **Where:** `_config.sh` → `generate_cluster_config` → `k3s_config_file`. - **Done:** the k3s config file carries `secrets-encryption: true`, so a fresh provision boots with AES Secret encryption enabled. (The `write-kubeconfig-mode` line for `K3S-F4` was added next to it on 2026-05-16.) - **Operator step on the existing cluster:** a cluster provisioned *without* the flag does not retro-encrypt — run `k3s secrets-encrypt enable` then `k3s secrets-encrypt reencrypt` once. Tracked as `V12`. - **Verify:** `k3s secrets-encrypt status` reports `Encryption Status: Enabled` on every server node. - **Note:** the old `SECURITY.md` *claimed* this was already on — `04-verify.sh` greps for the string but cannot truly confirm; see `V12`. ### `K3S-CG2` — Node OS hardening · ◐ · In-repo: partial - **Where:** `_config.sh` → `generate_cluster_config` → `post_create_commands` (runs on every node at provision). - **Done (2026-05-16):** `post_create_commands` now installs and enables `fail2ban` (SSH brute-force bans) and `unattended-upgrades` (automatic security patching) on every node at provision time — a fresh cluster comes up hardened on both. - **Still operator (runtime; not yet in-repo):** - SSH — confirm `PermitRootLogin no`, `PasswordAuthentication no`, `AllowUsers deploy`, modern ciphers/MACs/KEX. (hetzner-k3s provisions key-only SSH; verify and tighten.) - sysctl — confirm `net.ipv4.ip_unprivileged_port_start=0` (Traefik) and standard network-hardening sysctls. - **Verify:** `ssh deploy@ 'fail2ban-client status sshd; systemctl is-enabled unattended-upgrades'`. ### `K3S-CG3` — Hetzner Cloud Firewall rules · ☐ · In-repo: N - **Fix:** Confirm only: `:443` from Cloudflare CIDRs, `:22` from operator IP(s), `:6443` from operator IP(s). Nothing else. This is the *only* network defense for the public-IP nodes (`K3S-F15`). - **Verify:** `hcloud firewall describe honeydue-fw` matches the intended ruleset; a direct `curl` to a node IP on `:80`/`:443` from a non-CF host times out. ### `K3S-CG4` — etcd snapshot backup · ☐ · In-repo: Y - **Fix:** Confirm k3s etcd snapshots are enabled (default hourly) and shipped off-node — set `--etcd-s3` (to Backblaze B2) with encryption. Without offsite snapshots, a 3-node loss is unrecoverable. - **Verify:** `ls /var/lib/rancher/k3s/server/db/snapshots/` on a node + an object in the B2 backup bucket. ### `K3S-CG5` — kubelet authn/authz flags · ☐ · In-repo: Y - **Fix:** Confirm `--anonymous-auth=false` and `--authorization-mode=Webhook` on the kubelet (k3s defaults are usually safe — verify, don't assume). Set via k3s `kubelet-arg` in the cluster config if missing. - **Verify:** `kubectl get --raw /api/v1/nodes//proxy/configz` shows the expected kubelet config. ### `K3S-CG6` — Container-runtime CIS baseline · ☐ · In-repo: N - **Fix:** Run `kube-bench` once; remediate any FAIL lines that aren't k3s-by-design. - **Verify:** `kube-bench` run archived with FAILs triaged. ### `K3S-CG7` — `deploy` user sudoers least-privilege · ☐ · In-repo: N - **Fix:** Current `deploy ALL=(ALL) NOPASSWD: ALL` means an SSH-key compromise = node root. Scope to the commands deploys actually need (`ufw`, `systemctl`, `chmod` on k3s.yaml, `cat` of k3s.yaml). Accept the convenience trade-off only with eyes open. - **Verify:** `ssh deploy@ 'sudo -l'` shows the scoped list. ### `K3S-CG8` — `/etc/rancher/k3s/` perms · ☐ · In-repo: N - **Fix:** `/var/lib/rancher/k3s/server/token` and `/var/lib/rancher/k3s/server/node-token` must be `0600 root:root`; `/etc/rancher/k3s/` not world-traversable. - **Verify:** `ssh deploy@ 'sudo stat -c "%a %n" /var/lib/rancher/k3s/server/token'` → `600`. ### `K3S-F15` — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y - **Decision:** Accepted for now. Defense is `K3S-CG3` (Hetzner firewall) only. To remediate later: attach a Hetzner private network, re-IP the cluster, move etcd/kubelet/Flannel onto it. Substantial re-provision — track on the roadmap, not this cycle. ### `K3S-F16` — All nodes are control-plane + etcd + worker · INFO · ⊘ - **Decision:** Accepted — standard small-cluster k3s. Revisit (dedicated workers + `NoSchedule` taint on control-plane) when workload pressure grows. No redeploy action. ### `K3S-F17` — Single-replica SPOFs · INFO · ☐ · In-repo: Y - **Where:** `deploy-k3s/manifests/worker/deployment.yaml`, `redis/`, `admin/`, `observability/vmagent.yaml`. - **Fix:** `worker` → `replicas: 2` (stateless, Asynq at-least-once — safe now). `admin`/`vmagent` → 2 if zero-downtime restart is wanted. `redis` is stateful — true HA needs Sentinel or managed Redis; track separately, do not naively scale. - **Verify:** `kubectl -n honeydue get deploy` shows `worker 2/2`. --- ## Stage 2 — Secrets & config bootstrap Run by `02-setup-secrets.sh`, which reads `deploy-k3s/config.yaml` and the `secrets/` directory. **Both `K3S-F1` and `K3S-F3` are open purely because `config.yaml` lacks the values — the script already supports them.** ### `K3S-F1` — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y - **Where:** `deploy-k3s/config.yaml` key `redis.password`. `02-setup-secrets.sh:53,68-71` includes `REDIS_PASSWORD` in `honeydue-secrets` only when that key is non-empty; `redis/deployment.yaml` adds `--requirepass` only when the env var is non-empty. - **Fix:** Set `redis.password` in `config.yaml` to a strong value (`openssl rand -base64 32`). Re-run `02-setup-secrets.sh`. `api`/`worker` already consume `REDIS_PASSWORD`. - **Verify:** `kubectl -n honeydue exec deploy/redis -- redis-cli ping` → `NOAUTH`; with `-a "$REDIS_PASSWORD"` → `PONG`. - **Redeploy-clean:** committing the value to `config.yaml` means every future `02-setup-secrets.sh` re-creates the authenticated secret. (If `config.yaml` is gitignored, store the value in the operator's secret store and document it here.) ### `K3S-F3` — `admin-basic-auth` secret never created · HIGH · ☐ · In-repo: Y - **Where:** `config.yaml` keys `admin.basic_auth_user` / `admin.basic_auth_password`. `02-setup-secrets.sh:54-55,132-143` creates the `admin-basic-auth` secret (bcrypt htpasswd) only when both are set, else it warns and skips. - **Fix:** Set both keys. Re-run `02-setup-secrets.sh`. **Must be done before `K3S-F2`** — attaching `admin-auth` to the ingress with the secret missing makes Traefik 503 the admin route. - **Verify:** `kubectl -n honeydue get secret admin-basic-auth`. ### `K3S-F8` (Stage 2 half) — `B2_KEY_ID` / `B2_APP_KEY` in `honeydue-secrets` · ☑ · In-repo: Y - **Where:** `02-setup-secrets.sh`. - **Done (2026-05-16):** the script now reads `storage.b2_key_id` / `storage.b2_app_key` from `config.yaml` and adds `B2_KEY_ID` / `B2_APP_KEY` to `honeydue-secrets`. Previously the `api`/`worker` manifests referenced these keys but the script never created them — a latent deploy break. See the full `K3S-F8` entry in Stage 5. - **Verify:** `kubectl -n honeydue get secret honeydue-secrets -o jsonpath='{.data.B2_KEY_ID}'` is non-empty. ### `K3S-F12` — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y - **Where:** new doc `docs/runbooks/secret-rotation.md`. - **Fix:** Document per-secret rotation (Postgres, `SECRET_KEY`, APNs `.p8`, FCM, B2, observability token, Redis, admin basic-auth). Annual minimum; immediate on suspected exposure or operator-device loss. For `SECRET_KEY` (JWT signing) plan an overlap window so live tokens validate across the change. Add a `last-rotated` annotation to each secret. - **Verify:** runbook exists and the first rotation is logged. ### `CODE-C4` — `DEBUG_FIXED_CODES` "123456" auth bypass · CRITICAL · ☐ · In-repo: Y - **Where:** `internal/services/auth_service.go:141-145,385-390,432-435,470-473,503-504`; config in `internal/config/config.go`. ConfigMap generated from `config.yaml` by `03-deploy.sh`. - **Fix (two layers):** (1) Code — refuse to start if `ENV=production && DebugFixedCodes` (Stage 4 code change). (2) Config — ensure `config.yaml` never sets `DEBUG_FIXED_CODES=true` for prod, and the generated ConfigMap omits it. - **Verify:** prod ConfigMap has no `DEBUG_FIXED_CODES`; a prod boot with the flag set fails fast. ### `CODE-M8` — `SECRET_KEY` hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y - **Where:** `internal/config/config.go:437-442` falls back to `"change-me-in-production-secret-key-12345"`. - **Fix:** Remove the static fallback — generate a per-boot random key in debug, and **refuse to start** in production if `SECRET_KEY` is unset. (`02-setup-secrets.sh:46-49` already enforces ≥32 chars for the real secret — keep that.) - **Verify:** prod boot with no `SECRET_KEY` exits non-zero; the fallback string is gone from the binary. --- ## Stage 3 — Kubernetes manifests Committed under `deploy-k3s/manifests/` and applied by `03-deploy.sh`. **Any fix here is automatically re-applied on every redeploy** — the highest-value stage for "redeploy clean." ### `K3S-F2` / `CODE-L6` — Wire defense-in-depth onto the admin ingress · HIGH · ☐ - **Where:** `deploy-k3s/manifests/ingress/ingress-simple.yaml` — admin route annotation. - **Fix:** Add `cloudflare-only` and `admin-auth` to the `traefik.ingress.kubernetes.io/router.middlewares` annotation alongside the existing `security-headers` + `rate-limit`. **Do `K3S-F3` first** or Traefik 503s the route. - **Verify:** `04-verify.sh` "Cloudflare-Only Middleware" check passes; `admin.myhoneydue.com` prompts for basic auth. ### `K3S-F6` — `imagePullSecrets` name consistency · HIGH · ☐ - **Where:** all `deploy-k3s/manifests/*/deployment.yaml`, `migrate/job.yaml`; secret created by `02-setup-secrets.sh:111` as `ghcr-credentials`. - **Fix:** The registry is Gitea — `ghcr-credentials` is a misleading name and the live cluster currently also has a hand-made `gitea-credentials`. Pick one name (`gitea-credentials` is clearer), use it in **both** the script and **every** manifest, and delete the orphan. The defect is a name *mismatch*, not a missing fix — make script + manifests agree so a pull never fails on a fresh node. - **Verify:** `grep -rl imagePullSecrets deploy-k3s/manifests/` all reference one name == the script's; cordon a node, delete a pod, confirm the replacement pulls. ### `K3S-F7` — `vmagent` container `securityContext` · MEDIUM · ☐ - **Where:** `deploy-k3s/manifests/observability/vmagent.yaml`. - **Fix:** Add the container-level block the other 5 deployments already have: `allowPrivilegeEscalation: false`, `capabilities.drop: [ALL]`, `readOnlyRootFilesystem: true`. Its volumes (`/etc/vmagent`, `/etc/vmagent-secrets`, `/tmp/vmagent` emptyDir) already support read-only root. - **Verify:** `04-verify.sh` "Pod Security Contexts" reports OK for `vmagent`. ### `K3S-F9` / `LIVE-L8` — CSP + cross-origin headers · MEDIUM / LOW · ☐ - **Where:** Cross-origin trio → `deploy-k3s/manifests/ingress/middleware.yaml` (`security-headers`). CSP `object-src`/`base-uri` → Go app CSP middleware (Stage 4, `LIVE-L8` code half). - **Important correction:** `K3S-F9` originally said CSP was missing. The live scan **disproved** that — the Go app sets a strong CSP via app middleware. So `K3S-F9` reduces to: add `Cross-Origin-Opener-Policy: same-origin` and `Cross-Origin-Resource-Policy: same-origin` (and `Cross-Origin-Embedder-Policy: require-corp` only if it doesn't break embeds) to `security-headers`. The CSP `object-src 'none'; base-uri 'self'` additions belong in the app and are tracked under `LIVE-L8` in Stage 4. - **Verify:** `curl -sI https://api.myhoneydue.com/api/health/ | grep -i cross-origin` shows COOP/CORP. ### `K3S-F10` / `LIVE-L12` — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐ - **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` (new `auth-rate-limit` Middleware) + `ingress/ingress-simple.yaml`. Requires migrating the auth paths from vanilla `Ingress` to a Traefik `IngressRoute` to apply a per-path middleware. - **Fix:** New Middleware `average: 5, burst: 10, period: 1m, sourceCriterion.ipStrategy.depth: 2` (depth 2 for the Cloudflare hop). Apply to `/api/auth/login`, `/api/auth/register`, `/api/auth/forgot-password`, `/api/auth/reset-password`, `/api/residences/join-with-code`. This is the **edge** half; the **app** half is `CODE-H1/H2/H3/M5` in Stage 4 (per-account lockout in Redis). Do both — edge limit alone resets on IP rotation. - **Verify:** 10 rapid logins from one IP → `429`. ### `K3S-F11` — Disable `automountServiceAccountToken` · MEDIUM · ☐ - **Where:** `deploy-k3s/manifests/rbac.yaml` (ServiceAccounts) and/or each `*/deployment.yaml` pod spec. - **Fix:** Set `automountServiceAccountToken: false` on `api`, `admin`, `worker`, `web`, `redis`. Leave `true` only for `vmagent` (it uses the k8s API for service discovery). **Note:** `05-security.md` claims this is already set — the audit (`F11`) says it is not. Treat the audit as ground truth; this fix makes the doc true. - **Verify:** `kubectl -n honeydue get pod -o jsonpath='{.spec.automountServiceAccountToken}'` → `false`; no token file in the container. ### `K3S-F13` — Add `app.myhoneydue.com` to CORS · LOW · ☐ - **Where:** `CORS_ALLOWED_ORIGINS` in `config.yaml` → generated into `honeydue-config` ConfigMap by `03-deploy.sh`. - **Fix:** Confirm whether the web app calls `api.myhoneydue.com` directly from the browser. If yes, add `https://app.myhoneydue.com` to `CORS_ALLOWED_ORIGINS`. If it proxies through Next.js server-side, CORS is moot — record that decision here instead. - **Verify:** browser fetch from `app.myhoneydue.com` to the API succeeds (or the proxy decision is documented). ### `K3S-F14` — Pin public images by digest · LOW · ☐ - **Where:** `redis/deployment.yaml` (`redis:7-alpine`), `observability/vmagent.yaml` (`victoriametrics/vmagent:v1.106.1`). - **Fix:** Replace tags with `@sha256:` digests. Folded into the `K3S-F5` CI work (Stage 5). - **Verify:** manifests contain no public-image tag without a digest. ### `LIVE-L5` / `CODE-L3` — HSTS `preload` · LOW · ☐ - **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` `security-headers` HSTS value. - **Fix:** Change to `max-age=63072000; includeSubDomains; preload`. Confirm api/admin/app all work fully over HTTPS, then submit to `hstspreload.org` (the submission is the Stage 0 external half — once preloaded you cannot easily downgrade for ~6 months). - **Verify:** response header shows `preload`; domain accepted at hstspreload.org. ### `LIVE-L7` — Drop deprecated `X-XSS-Protection` · LOW · ☐ - **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` `security-headers` (`browserXssFilter: true` / `customResponseHeaders`). - **Fix:** Remove the header or set `X-XSS-Protection: "0"`. Modern browsers ignore it; legacy filter bypass has caused XSS. - **Verify:** header absent or `0` on all three hosts. ### `CODE-L4` — Set `imagePullPolicy` · LOW · ☐ - **Where:** all `deploy-k3s/manifests/*/deployment.yaml`. - **Fix:** Set `imagePullPolicy` explicitly. Once images are digest-pinned (`K3S-F5`), `IfNotPresent` is correct and avoids needless re-pulls; until then `Always` avoids stale tags. Pick the policy that matches the `K3S-F5` rollout state. - **Verify:** every container has an explicit `imagePullPolicy`. --- ## Stage 4 — Application code & container images Fixes in `honeyDueAPI-go` source (and the admin/web Dockerfiles). They reach production by **rebuilding the image** in `03-deploy.sh`; schema-changing fixes (`CODE-C1`, `CODE-C5/6`, `CODE-C11`, `CODE-C12`) also need a **goose migration**, which the migrate `Job` runs automatically before the api/worker roll. Per repo rule: do not auto-commit — these are code changes; this section is the plan, not the patch. ### Critical (C1–C13) #### `CODE-C1` — Plaintext auth tokens in DB · ☑ (2026-05-15) - **Where:** `internal/models/user.go`, `internal/repositories/user_repo.go`, `internal/middleware/auth.go`, `internal/services/cache_service.go`, `internal/services/auth_service.go`, migration `000003_hash_auth_tokens.sql`. - **Done:** `user_authtoken.key` now stores `models.HashToken()` — the hex SHA-256 of the token — never the raw value. The raw token reaches the client once (the non-persisted `AuthToken.Plaintext` field) and is re-hashed on every request before the DB and Redis lookup, so the single indexed JOIN query in the auth middleware is preserved. A fast hash (not bcrypt) is correct here — tokens are 160-bit random values, nothing to brute-force. Migration `000003` widens the column 40→64 and clears existing rows. - **Behaviour change:** the server can no longer re-issue a stored token's plaintext, so every login mints a fresh token via `CreateFreshToken` (delete + create). With the existing one-token-per-user schema this means **one active session per user** — logging in on a new device invalidates the previous device's token. The migration also invalidates all sessions once, at deploy. - **Verify:** `SELECT key FROM user_authtoken LIMIT 1` → 64-char hash; `go build ./...` and `go test ./internal/{models,repositories,middleware,handlers}/...` pass. #### `CODE-C2` / `CODE-C3` — Google ID token not verified locally · ☑ (2026-05-15) - **Where:** `internal/services/google_auth.go` (full rewrite). - **Done:** `VerifyIDToken` no longer calls the deprecated `tokeninfo` URL (which leaked the token in the query string and made verification depend on a third party). It now parses the JWT, fetches Google's JWKS from `googleapis.com/oauth2/v3/certs` (Redis-cached 24h, re-fetched on a `kid` miss), verifies the `RS256` signature locally, and asserts `iss ∈ {accounts.google.com, https://accounts.google.com}` (C3), `aud`/`azp` against the configured client IDs, and `exp` (validated by jwt v5). Mirrors the existing Apple JWKS verifier. `GoogleSignIn` is unchanged — the returned `GoogleTokenInfo` shape is preserved. - **Verify:** `go build ./...` clean; `internal/services` tests pass. #### `CODE-C5` / `CODE-C6` — IAP receipt / purchase-token replay · ☐ - **Where:** `internal/services/subscription_service.go` (`ProcessApplePurchase`, `ProcessGooglePurchase`). - **Fix:** Goose migration adding `UNIQUE(provider, original_transaction_id)`. On purchase, if the transaction ID is already bound to a different `user_id` → `403`. - **Verify:** re-submitting a valid receipt against a second account → `403`; DB has no duplicate. #### `CODE-C7` — File-ownership check excludes residence owners · ☐ - **Where:** `internal/services/file_ownership_service.go:20-66`. - **Fix:** Replace the three `residence_residence_users`-only JOINs with the canonical owner-OR-member UNION from `residence_repo.HasAccess` (owners live in `residence_residence.owner_id`). - **Verify:** a residence owner can delete a file in their own property; a non-member still gets `403`. #### `CODE-C8` — Device-token cross-account hijack · ☐ - **Where:** `internal/services/notification_service.go:307-319` (APNS), `:336-349` (GCM). - **Fix:** On re-register of an existing token, if `existing.UserID != nil && *existing.UserID != userID` → `409 Conflict`. Only same-user updates allowed. - **Verify:** registering another user's known token → `409`; that user's push traffic is unaffected. #### `CODE-C9` / `CODE-H9` — Share-code join not atomic · ☐ - **Where:** `internal/services/residence_service.go:562-615` (`:594-599` swallows the deactivate error). - **Fix:** Wrap `JoinWithCode` in one transaction with `SELECT … FOR UPDATE` on the share-code row; **fail the join if deactivation fails** (do not log-and-continue). - **Verify:** concurrent redemptions of a single-use code → exactly one succeeds; a forced deactivate error rolls the whole join back. #### `CODE-C10` — Subscription upgrade race · ☐ - **Where:** `internal/services/subscription_service.go:404-459`; webhook handler `:136-213`. - **Fix:** Move Apple validation inside the row-locked transaction, or add an idempotency-key table so the validate→write window can't be raced. - **Verify:** two concurrent upgrades for one user → one tier change, not two. #### `CODE-C11` — Task-completion duplicate-row race · ☐ - **Where:** `internal/services/task_service.go:631-750`. - **Fix:** `SELECT … FOR UPDATE` on the task in `CreateCompletion`; goose migration adding `UNIQUE(task_id, completed_date)`. - **Verify:** double-tap "complete" → one completion row. #### `CODE-C12` — Soft-deleted email reusable · ☐ - **Where:** `internal/services/auth_service.go:274-324`; `internal/repositories/user_repo.go` (`FindByEmail`, `ExistsByEmail`). - **Fix:** On delete, mangle the email (`deleted__`); add `is_active = true` filtering consistently to `FindByEmail`/`ExistsByEmail`. - **Verify:** registering with a soft-deleted account's email is rejected; no cross-account takeover. #### `CODE-C13` — Apple webhook user lookup may LIKE-match · ☐ - **Where:** `internal/handlers/subscription_webhook_handler.go:354-366` (`FindByAppleReceiptContains`). - **Fix:** Confirm the SQL is an equality match, not `LIKE`. If `LIKE`, this is a confirmed Critical — change to equality and rename the function. See `V8`. - **Verify:** the query is parameterized equality; rename merged. ### High (H1–H9) #### `CODE-H1` / `CODE-H2` / `CODE-H3` / `CODE-M5` — Rate limiting gaps · ☐ - **Where:** `internal/router/router.go` (`:520` login limiter, `:593` `join-with-code` unprotected), `internal/middleware/rate_limit.go`, `internal/handlers/auth_handler.go`. - **Fix:** Extend rate limiting to `register`, `join-with-code`, Apple/Google sign-in, and token refresh. Add a per-account login-attempt counter in Redis (lock after 5–10 fails for 15–60 min). This is the **app** half of the consolidated auth-rate-limit item; the **edge** half is `K3S-F10`. - **Verify:** rapid attempts on every auth route throttle; per-account lockout fires regardless of source IP. #### `CODE-H4` — Modulo bias in 6-digit codes · ☐ - **Where:** `internal/services/auth_service.go:884-892`. - **Fix:** Replace `int32 % 1000000` with rejection sampling on `crypto/rand` for a uniform `000000–999999`. - **Verify:** distribution test over many samples is uniform. #### `CODE-H5` — Apple IAP `.p8` file-mode unchecked · ☐ - **Where:** `internal/services/iap_validation.go:93-128`, `internal/config/config.go:325`. - **Fix:** Prefer a base64 env-injected PEM. If a file path is kept, refuse to start when the file mode is more permissive than `0600`. - **Verify:** boot fails on a `0644` key file; succeeds on `0600`. #### `CODE-H6` — Webhook dedup fail-open · ☐ - **Where:** `internal/handlers/subscription_webhook_handler.go:165-173` (Apple), `:564-574` (Google). - **Fix:** Fail **closed** — if `webhookEventRepo.HasProcessed` errors, return `500` so Apple/Google retry, rather than processing (which risks duplicate refunds). - **Verify:** simulated dedup-check DB error → `500`, no double-processing. #### `CODE-H7` — Auth-failure log lacks IP/UA · ☐ - **Where:** `internal/handlers/auth_handler.go:70`. - **Fix:** Add `c.RealIP()` + `User-Agent` to the structured failure log line (the audit log captures them; the request-line log does not). Depends on `V10` (RealIP trust). - **Verify:** a failed login log line carries IP + UA. #### `CODE-H8` — `X-Timezone` header trusted for trial start · ☐ - **Where:** `internal/middleware/timezone.go:40-71` → `internal/services/subscription_service.go:145-150`. - **Fix:** Validate `X-Timezone` against IANA `LoadLocation`, cap to ±14h; use server UTC for trial-start / billing-window math regardless. - **Verify:** a bogus/extreme `X-Timezone` cannot shift trial start. ### Medium (M1–M13) #### `CODE-M1` — Header injection via `Content-Disposition` filename · ☐ - **Where:** `internal/handlers/media_handler.go:74,117,165`. - **Fix:** Sanitize `doc.FileName` — strip CR/LF/quote/null, or emit RFC 5987 `filename*=UTF-8''…`. - **Verify:** an upload with CRLF in the filename does not split the response. #### `CODE-M2` — bcrypt cost 10 → 12 · ☐ - **Where:** `internal/models/user.go:47`, `internal/services/auth_service.go:479`. - **Fix:** Make the cost config-driven, default 12. - **Verify:** new hashes are `$2a$12$`. #### `CODE-M3` — Apple Sign In nonce not validated · ☐ - **Where:** `internal/services/apple_auth.go`. - **Fix:** Generate, store, and verify the nonce round-trip on Apple sign-in. - **Verify:** a replayed/mismatched nonce is rejected. #### `CODE-M4` — Email verification not atomic · ☐ - **Where:** `internal/services/auth_service.go:373-415`. - **Fix:** Wrap verify in a transaction so a concurrent request can't double-apply. - **Verify:** concurrent verify calls → one state transition. #### `CODE-M6` / `LIVE-L16` — Uncapped list / pagination · ☐ - **Where:** `ListDocuments`, `ListContractors`, `ListResidences` handlers; pagination parsing. - **Fix:** Clamp `limit` server-side to ≤100 (`< 1` → default 25). Notifications already caps at 200 — match the pattern. - **Verify:** `?limit=999999` returns ≤100 rows. #### `CODE-M7` — Audit log not append-only · ☐ - **Where:** audit-log model / repository. - **Fix:** Make it append-only — a DB trigger forbidding `UPDATE`/`DELETE`, or move to an event store. Remove the soft-delete column. - **Verify:** an `UPDATE`/`DELETE` on the audit table is rejected. #### `CODE-M11` — `golang.org/x/crypto` outdated · ☐ - **Where:** `go.mod:30` (`v0.49.0`). - **Fix:** `go get -u golang.org/x/crypto`, re-run `govulncheck`, retest. Pairs with Stage 5 dependency automation. - **Verify:** `govulncheck ./...` clean. #### `CODE-M12` — Contractor toggle refetch race · ☐ - **Where:** `internal/services/contractor_service.go:279-307`. - **Fix:** Do the toggle + read in one transaction so a concurrent soft-delete can't make it return `nil`. - **Verify:** concurrent toggle + delete → defined result, no nil panic. #### `CODE-M13` — Account-deletion endpoint unrate-limited · ☐ - **Where:** `internal/handlers/auth_handler.go:488-539`. - **Fix:** Add a throttle to `DELETE /account`. **First resolve `V11`** — `LIVE-L18` claims no delete endpoint exists; reconcile before deciding whether this is "rate-limit it" or "expose it." - **Verify:** repeated delete calls throttle. #### `CODE-M10` — `node:20-alpine` floating tag · ☐ - **Where:** admin/web `Dockerfile` (`:2,112,134`). - **Fix:** Pin to a specific patch version or digest. - **Verify:** Dockerfile has no bare `node:20-alpine`. ### Low / Info (CODE-L1, L2) #### `CODE-L1` — Inactive-account login enumeration · ☐ - **Where:** `internal/services/auth_service.go:76-77`. - **Fix:** Return the same generic error for inactive accounts as for invalid credentials. - **Verify:** inactive vs. wrong-password responses are byte-identical. #### `CODE-L2` — Auth responses lack `Cache-Control: no-store` · ☐ - **Where:** `internal/handlers/auth_handler.go` (Login / CurrentUser / Refresh). - **Fix:** Set `Cache-Control: no-store` on auth responses. - **Verify:** the header is present. ### Live-scan code-level findings (LIVE-L1, L11–L20) #### `LIVE-L1` — `/metrics` publicly exposed · HIGH · ☐ - **Where:** `cmd/api/main.go` route registration; vmagent scrapes it cluster-internally already. - **Fix (recommended — Option B):** bind Prometheus metrics to a separate cluster-internal port (e.g. `:9090`), expose only via a ClusterIP Service the vmagent NetworkPolicy allows; the public Ingress never registers `/metrics`. Update `observability/vmagent.yaml` scrape target. (Alternative: block `/metrics` at Traefik via an `IngressRoute` — Stage 3.) - **Verify:** `curl https://api.myhoneydue.com/metrics` → `404`; vmagent still scrapes successfully. #### `LIVE-L11` — Login user-enumeration via timing · HIGH · ☐ - **Where:** login handler / `auth_service.go`. - **Fix:** Always run a bcrypt compare against a fixed dummy hash when the user is not found, so the response time is constant. - **Verify:** real vs. fake email login timing delta < network noise. #### `LIVE-L12` — No rate-limit on login · HIGH · ☐ - See the consolidated auth-rate-limit item: `K3S-F10` (edge) + `CODE-H1/H2/H3/M5` (app). Closed when both land. #### `LIVE-L13` — Password-reset timing enumeration · HIGH · ☐ - **Where:** `forgot-password` handler. - **Fix:** Enqueue the reset email on the Asynq queue and return the generic response immediately, so real vs. fake emails have identical latency. - **Verify:** real vs. fake email reset timing delta < network noise. #### `LIVE-L14` / `LIVE-L15` — Sequential integer IDs · MEDIUM · ⊘ (deferred) - **Where:** all user-facing IDs. - **Decision:** Real enumeration/intel leak, but migrating to UUID/ULID touches API, web, mobile, and webhook payloads. **Deferred to a planned quarter** — not a redeploy-stage fix. Track on the roadmap; revisit before the userbase size becomes commercially sensitive. #### `LIVE-L16` — Pagination `limit` uncapped · MEDIUM · ☐ - Duplicate of `CODE-M6` — closed with it. #### `LIVE-L17` — Garbage pagination params silently accepted · LOW · ☐ - **Where:** query-param parsing in list handlers. - **Fix:** Return `400` naming the bad parameter instead of silently using defaults. - **Verify:** `?limit=abc` → `400`. #### `LIVE-L18` — No account-deletion endpoint (GDPR) · LOW · ☐ - **Where:** `internal/router/router.go`, `internal/handlers/auth_handler.go`. - **Fix:** Reconcile with `CODE-M13` first (`V11`). Provide `DELETE /api/auth/me/` that anonymizes PII, cascades/transfers residences, revokes tokens, and writes an audit-trail row. Also closes the throwaway-account cleanup gap the live scan left behind. - **Verify:** an authenticated user can delete their own account; PII is anonymized. #### `LIVE-L19` — Email verification not enforced · LOW · ☐ - **Where:** router middleware. - **Fix:** Add a `RequireVerified()` middleware on sensitive routes (share-code generation/redemption, anything that emails other users), or cap unverified accounts (1 residence, no share codes) until verified. - **Verify:** an unverified account is blocked from the chosen gated routes. #### `LIVE-L20` — Profile-update silently drops unknown fields · INFO · ☐ - **Where:** `PATCH /api/auth/profile/` handler. - **Fix:** Either accept the fields (if intended) or return `400` listing unsupported keys — don't silently `200`. - **Verify:** an unknown field yields a clear response. #### `LIVE-L10` — `x-powered-by` — see Stage 0 (Next.js config). --- ## Stage 5 — CI / build pipeline Build-time controls. Where there is no CI pipeline file yet, the fix is to add one (or a `03-deploy.sh` step) so the control runs on every build. ### `K3S-F5` / `K3S-F14` / `CODE-L4` — Pin images by digest · HIGH · ☐ - **Where:** `03-deploy.sh` (currently tags by git short SHA, lines 47/57-61, and also pushes `:latest`), all `deploy-k3s/manifests/*/deployment.yaml`. - **Fix:** After `docker push`, capture the digest (`crane digest …` or parse `docker push` output) and substitute `@sha256:…` into the manifests instead of `IMAGE_PLACEHOLDER` tags. Pin `redis` and `vmagent` by digest too. Reconsider pushing `:latest` — a mutable `:latest` undercuts digest pinning. - **Verify:** `kubectl -n honeydue get deploy -o jsonpath` shows every image as `@sha256:`. ### `K3S-F8` — Secrets as file mounts, not env vars · MEDIUM · ☑ · In-repo: Y - **Where:** `api`/`worker` `deployment.yaml`, `internal/config/config.go`, `cmd/api/main.go`, `cmd/worker/main.go`, `02-setup-secrets.sh`. - **Done (2026-05-16):** - `config.loadFileSecrets()` reads each of the 9 secret keys (`POSTGRES_PASSWORD`, `SECRET_KEY`, `EMAIL_HOST_PASSWORD`, `FCM_SERVER_KEY`, `REDIS_PASSWORD`, `B2_KEY_ID`, `B2_APP_KEY`, `OBS_INGEST_TOKEN`, `OBS_TRACES_URL`) from `/etc/honeydue/secrets/` and `viper.Set`s it (highest precedence). A missing file is a silent skip, so the same binary still works from env vars in local/dev. - `api`/`worker` `deployment.yaml` no longer inject **any** secret as an `env: secretKeyRef`. `honeydue-secrets` is mounted as a volume (`defaultMode: 0400`), read-only, at `/etc/honeydue/secrets`. Non-secret config still arrives via `envFrom: configMapRef`. - `cmd/api`/`cmd/worker` read the observability endpoints through the new `config.SecretValue()` (Viper-backed) instead of `os.Getenv`, so file-mounted `OBS_*` values resolve now that they are gone from the environment. - `02-setup-secrets.sh` now also writes `B2_KEY_ID`/`B2_APP_KEY` into `honeydue-secrets` — reconciling the script-vs-manifest drift (the manifests referenced these keys but the script never created them). - **Scoped exception:** the one-shot `honeydue-migrate` Job still takes `POSTGRES_PASSWORD` as an env var. goose is invoked as a CLI with the password inside the DSN argument, so the value is exposed in that process regardless of env-vs-file; the Job is transient (one run, seconds, pod GC'd) so this is accepted. - **Verify:** `kubectl -n honeydue exec deploy/api -- env` shows no `POSTGRES_PASSWORD`/`SECRET_KEY`; `kubectl -n honeydue exec deploy/api -- ls /etc/honeydue/secrets` lists the key files. ### `CODE-L5` — Image signing + scanning · LOW · ◐ · In-repo: Y - **Where:** `03-deploy.sh`, `deploy-k3s/manifests/kyverno-verify-images.yaml`. - **Done (in-repo, 2026-05-16):** - `03-deploy.sh` runs `cosign sign` after each push and a `trivy image --severity HIGH,CRITICAL` scan before push — both **guarded**: they no-op when the tool is absent, so they never break a deploy on a host without them. - A ready-to-use Kyverno `ClusterPolicy` ships at `deploy-k3s/manifests/kyverno-verify-images.yaml`. It matches only the four `gitea.treytartt.com/admin/honeydue-*` images, starts in `Audit` mode, and is **intentionally not applied by `03-deploy.sh`** — applying a verify-images policy with no key would block every Pod from scheduling. - **Remaining (operator — cannot be committed):** 1. Install Kyverno in the cluster (admission controller). 2. `cosign generate-key-pair`; set `COSIGN_KEY` in the deploy env so signing activates; paste `cosign.pub` into the policy's `publicKeys` block. 3. `kubectl apply -f deploy-k3s/manifests/kyverno-verify-images.yaml`, confirm Pods still schedule, then flip `validationFailureAction: Audit → Enforce`. - **Verify:** an unsigned image is rejected by admission; `03-deploy.sh` fails on a HIGH/CRITICAL CVE. ### `CODE-M11` (CI half) — Dependency hygiene · ☐ - **Fix:** Add scheduled `go get -u` + `govulncheck` (the audit confirms `govulncheck` + `gitleaks` already run in CI — extend with a dependency-update cadence). - **Verify:** stale-dependency alerts surface automatically. --- ## Stage 6 — Post-deploy verification & runtime investigations `04-verify.sh` already runs a security block (secret encryption, NetworkPolicy count, ServiceAccounts, pod security contexts, PDBs, `cloudflare-only` middleware, `admin-basic-auth`). **Extend it so each fix above stays fixed, and work the open investigations the audits could not resolve.** ### Extend `04-verify.sh` with assertions for · ☐ - Redis rejects unauthenticated `PING` (`K3S-F1`). - Admin ingress annotation contains `admin-auth` (`K3S-F2`). - `/metrics` returns `404` on the public host (`LIVE-L1`). - Every container (incl. `vmagent`) has a full `securityContext` (`K3S-F7`). - `automountServiceAccountToken: false` on app pods (`K3S-F11`). - Every workload image is digest-pinned (`K3S-F5`). - No `DEBUG_FIXED_CODES` key in the prod ConfigMap (`CODE-C4`). ### Runtime investigations (cannot be closed by code review alone) | ID | Item | Source | Action | |---|---|---|---| | `V1` | Apple/Google Sign-In token validation depth | LIVE | Test with a self-signed Apple identity token; confirm signature/aud/nonce checks | | `V2` | Webhook signature verification — confirm webhook routes are **outside** the auth middleware in `router.go` (live scan saw `401`s, signature middleware may never run) | LIVE | Code-review `internal/router/router.go` | | `V3` | File-upload security — locate upload paths, test polyglots / MIME bypass / path traversal in filename / oversized files | LIVE | Focused upload security test | | `V4` | Long-term token validity / revocation behaviour | LIVE | Test token expiry + revocation over time | | `V5` | Apple IAP receipt validation with a real sandbox StoreKit receipt | LIVE | Sandbox test | | `V6` | Share-code system — find the endpoint path; test brute-force, single-use, expiration | LIVE | Locate + test | | `V7` | Trial-expiration enforcement — age a test account past 14 days, confirm `limitations_enabled` flips and creation gates fire | LIVE | Aged-account test | | `V8` | `FindByAppleReceiptContains` — confirm equality, not `LIKE`. If `LIKE`, escalate `CODE-C13` to confirmed Critical | CODE | SQL review | | `V9` | Rate-limiter storage — confirm `rate_limit.go` is Redis-backed (shared across 3 api replicas); in-memory = 3× the intended limit | CODE | Code review | | `V10` | `X-Forwarded-For` / Echo `RealIP` trust behind Traefik — without it per-IP limits collapse to the ingress IP | CODE | Code + Traefik config review | | `V11` | Account-deletion contradiction — `LIVE-L18` (no endpoint) vs `CODE-M13` (endpoint at `auth_handler.go:488-539`). Resolve before Stage 4 planning | LIVE/CODE | Route review | | `V12` | etcd encryption — `04-verify.sh` only greps a string; truly confirm with `k3s secrets-encrypt status` on each server node | K3S | SSH check | | `V13` | `user_authtoken` index — confirm a `user_id` lookup index exists before hashing tokens at rest (`CODE-C1`) | CODE | Schema check | --- ## Accepted risks / deferred (this cycle) | ID | Item | Rationale | |---|---|---| | `K3S-F15` | Public-IP nodes, no VPC | Re-provision-scale change; Hetzner firewall (`K3S-CG3`) is the compensating control. Roadmap. | | `K3S-F16` | Combined control-plane/worker nodes | Standard small-cluster k3s; revisit on workload growth. | | `LIVE-L14`/`L15` | Sequential integer IDs | UUID migration spans API + web + mobile + webhooks; planned quarter, not this cycle. | Mirror these in `docs/deployment/20-roadmap.md` so they are not silently lost. --- ## Documentation drift corrected alongside this plan The audits contradicted the existing deployment book. These corrections ship with this plan so the docs match audited reality: | Doc | Claimed | Reality (audit) | Action | |---|---|---|---| | `05-security.md` | `automountServiceAccountToken: false` set | `K3S-F11`: not set on any workload | Corrected to "TODO" + linked here | | `05-security.md` | NetworkPolicies "not currently applied" (TODO) | Applied 2026-04-24; `03-deploy.sh:155` applies them | Corrected to "applied" | | `05-security.md` | CF↔origin is plaintext (SSL=Flexible) | Upgraded to Full (strict) 2026-04-24 | Corrected | | `05-security.md` | SHA tags immutable / "we'd notice a digest change" | `K3S-F5`: short SHA tags are mutable | Corrected; points to `K3S-F5` | | `SECURITY.md` (old) | Redis "requires a password" | `K3S-F1`: no auth | This rewrite | | `SECURITY.md` (old) | etcd `secrets-encryption: true` | `K3S-CG1`: not verified / not on | This rewrite | | `SECURITY.md` (old) | fail2ban active | `05-security.md` + `K3S-CG2`: not installed | This rewrite | | `20-roadmap.md` | — | Audit findings not represented | Audit items folded in | --- ## Hardened-redeploy checklist (run order) A clean rebuild of the whole stack, with every fix above applied: ``` □ Stage 0 DNS once-off: DMARC, SPF, CAA at Cloudflare; security.txt route live □ Stage 1 Provision: hetzner-k3s config carries --write-kubeconfig-mode=600 and --secrets-encryption; run 01-provision-cluster.sh □ Stage 1 Node OS: fail2ban + unattended-upgrades + SSH/sysctl on each node □ Stage 1 Verify cluster: K3S-CG3..CG8 (firewall, snapshots, kubelet, perms) □ Stage 2 Config: config.yaml has redis.password + admin.basic_auth_*; no DEBUG_FIXED_CODES; SECRET_KEY ≥32 chars □ Stage 2 Secrets: run 02-setup-secrets.sh — confirm redis + admin-basic-auth □ Stage 3 Manifests: admin ingress middlewares wired; imagePullSecret name consistent; vmagent securityContext; COOP/CORP headers; auth-rate-limit; automountServiceAccountToken:false; HSTS preload; X-XSS-Protection dropped; imagePullPolicy set □ Stage 4 Code+image: all C/H/M/L code fixes committed; image rebuilt; goose migrations for C1/C5/C6/C11/C12 present □ Stage 5 CI: images digest-pinned + signed + scanned; secrets file-mounted □ Stage 6 Verify: run 04-verify.sh (extended); work V1–V13 □ Post: Submit myhoneydue.com to hstspreload.org ``` A redeploy is "clean" only when `04-verify.sh` (extended per Stage 6) passes with zero `✗` lines and every checkbox in the master index is ☑ or ⊘. --- ## Appendix — Incident response playbooks Preserved from the previous `SECURITY.md`; still current. ### Compromised API token Rotate `SECRET_KEY` to invalidate all tokens, then restart api/worker: ```bash echo "$(openssl rand -hex 32)" > secrets/secret_key.txt ./scripts/02-setup-secrets.sh kubectl rollout restart deployment/api deployment/worker -n honeydue ``` (After `CODE-C1` lands, tokens are hashed at rest — a DB read no longer yields usable tokens, but `SECRET_KEY` rotation remains the kill-switch.) ### Compromised database credentials Rotate in the Neon dashboard, update `secrets/postgres_password.txt`, re-run `02-setup-secrets.sh`, restart api/worker, watch logs for connection errors. ### Compromised push keys APNs: revoke in Apple Developer, drop the new `.p8` into `secrets/`, re-run `02-setup-secrets.sh`, restart api/worker. FCM: rotate the key in Firebase, update `secrets/fcm_server_key.txt`, re-run, restart. ### Suspicious pod ```bash kubectl logs -n honeydue > /tmp/pod-logs.txt kubectl describe pod -n honeydue > /tmp/pod-describe.txt kubectl delete pod -n honeydue # deployment recreates it ``` ### Communication Document the timeline privately; on a data breach notify affected users within 72 hours; rotate every potentially-exposed credential; write a post-mortem (root cause, timeline, remediation, prevention). --- ## References - Audit reports: `live_scan_5_12.md`, `k3_audit_5_12.md`, `security_scan_5_12.md` (repo root) - Current architecture: `docs/deployment/05-security.md` - Roadmap: `docs/deployment/20-roadmap.md` - Deploy process: `docs/deployment/14-deployment-process.md` - Scripts: `deploy-k3s/scripts/{01-provision-cluster,02-setup-secrets,03-deploy,04-verify}.sh` - Manifests: `deploy-k3s/manifests/`