Files
honeyDueAPI/deploy-k3s/SECURITY.md
T
Trey t c77ff07ce9
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
fix(security): remediate 2026-05-12 audit findings (Stages 2–5)
Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps),
tracked in deploy-k3s/SECURITY.md, plus fixes from two independent
post-remediation reviews.

Auth & sessions:
- SHA-256 hashed auth-token storage (C1); prior-token cache eviction on
  re-login (MEDIUM-1)
- local Google JWKS verification, iss/aud/exp checks (C2/C3)
- constant-time login + generic errors (L1/LIVE-L11/LIVE-L13)
- per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3)
- verified-email gating, login rate limiting (LIVE-L19, H1-H3)

IAP & webhooks:
- Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6)
- migrations 000003-000006 (token hashing, IAP replay, audit_log +
  webhook_event_log table creation, append-only audit log)

Authorization & races:
- file-ownership owner-OR-member fix (C7), atomic share-code join
  (C9/H9), device-token reassignment (C8/LOW-3)

Secrets & deploy:
- secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis
  password out of the ConfigMap (HIGH-1); B2 keys reconciled
- digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics
  lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban +
  unattended-upgrades at provision; secret-rotation runbook

Build, vet, and the full test suite (incl. -race) pass; the goose
migration chain is verified against PostgreSQL 16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:28:33 -05:00

1034 lines
72 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# honeyDue — Production Security Remediation Plan
This document is the **single source of truth for fixing every security
finding from the 2026-05-12/13 audits, and for keeping those fixes baked
into the stack so a full redeploy never reproduces them.**
It replaces the previous aspirational `SECURITY.md` (which described a
desired state that, per the audits, was never fully true). The accurate
*current* architecture lives in `docs/deployment/05-security.md`; this file
is the **work list**.
**Last updated:** 2026-05-16
**Audit sources (kept at repo root):**
| Tag | File | Scope | Findings |
|---|---|---|---|
| `LIVE` | `live_scan_5_12.md` | External black-box scan of api/admin/app | L1L20 (20) |
| `K3S` | `k3_audit_5_12.md` | k3s cluster + `honeydue` namespace audit | F1F17 (17) + 8 coverage gaps |
| `CODE` | `security_scan_5_12.md` | Static audit of `honeyDueAPI-go` | C1C13, H1H9, M1M13, L1L6 (41) |
**Total: 78 findings + 8 cluster coverage gaps + 13 runtime verification items.**
---
## How to use this document
The plan is organised by **redeploy stage**, not by severity, because the
operator's goal is: *redeploy the entire stack and come up clean.* Each
finding is tagged with where its fix lives:
| Marker | Meaning |
|---|---|
| **In-repo: Y** | Fix lives in a committed file (`config.yaml`, a manifest, a script, Go code, a Dockerfile). Once committed, **every redeploy re-applies it automatically.** |
| **In-repo: N** | Fix is external state (DNS records, Cloudflare dashboard, Hetzner firewall, hstspreload.org). A redeploy does **not** touch it — it survives on its own but must be done once and tracked here. |
**Status legend:** ☐ open · ◐ in progress · ☑ done · ⊘ accepted risk / deferred
**Redeploy stage order** (matches `deploy-k3s/scripts/` run order):
```
Stage 0 DNS & Cloudflare edge (external; no cluster needed)
Stage 1 Cluster provisioning & node OS (01-provision-cluster.sh / hetzner-k3s / SSH)
Stage 2 Secrets & config bootstrap (02-setup-secrets.sh / config.yaml)
Stage 3 Kubernetes manifests (deploy-k3s/manifests/, applied by 03-deploy.sh)
Stage 4 Application code & images (honeyDueAPI-go source → rebuilt image)
Stage 5 CI / build pipeline (image digest pinning, signing, scanning)
Stage 6 Post-deploy verification (04-verify.sh + runtime investigations)
```
**Golden rule for "redeploy clean":** a fix only counts as done when it is
committed to the file that the redeploy reads. A `kubectl patch` on the live
cluster that is not mirrored into `deploy-k3s/manifests/` **will be wiped on
the next `03-deploy.sh`.** Every entry below names the committed file.
---
## Execution status (2026-05-16)
Stages 25 were executed in-repo, then put through an independent code
review (see *Post-remediation independent review* below). The Go module
**builds clean and the full `go test ./...` suite passes.** Four new goose
migrations were added — `000003` (auth-token hashing), `000004` (IAP replay
protection), `000005` (audit-log append-only + `audit_log` table create),
`000006` (`webhook_event_log` table create) — and run automatically via the
migrate Job before the api/worker rollout.
- **~63 findings fixed (☑) and verified** — all of Stage 2 (secrets/config)
and Stage 3 (Kubernetes manifests), every exploitable Stage 4 application
finding (all 11 actioned Criticals + the auth / webhook / race / handler
High & Medium fixes), Stage-5 image digest pinning **and `K3S-F8`
(secrets are now file-mounted, not env vars)**, plus the in-repo half of
Stage 1 cluster provisioning — `K3S-F4` (kubeconfig written `0600`),
`K3S-CG1` (etcd `secrets-encryption`), `K3S-CG2` (fail2ban +
unattended-upgrades installed at provision). Includes token hashing,
Google JWKS verification, IAP replay protection, the authorization
fixes, atomic share-code join, the metrics-endpoint lockdown,
per-account login lockout, verified-email gating, CSP/HSTS hardening,
and digest-pinned images.
- **1 partial (◐)** — `CODE-L5`: cosign signing + a Trivy `HIGH,CRITICAL`
scan are wired (guarded) into `03-deploy.sh`, and a ready-to-use Kyverno
`ClusterPolicy` ships at `deploy-k3s/manifests/kyverno-verify-images.yaml`.
Closing it needs two operator actions that cannot be committed: install
Kyverno in the cluster, and supply a cosign key pair (`COSIGN_KEY` for
signing + the public key pasted into the policy).
- **Accepted / blocked / moot (⊘)** — `M3` (Apple nonce — blocked on an
iOS-client change), `C12` (moot — accounts are hard-deleted),
`LIVE-L14`/`L15` (UUID migration — planned quarter), `LIVE-L17`/`L18`/
`L20` (no security impact — see entries), `F15`/`F16` (architectural),
and `LIVE-L2`/`L3`/`L4` (DMARC / SPF / CAA — operator-declined, below).
- **Operator-declined — Stage 0 DNS (`LIVE-L2`/`L3`/`L4`).** The operator
has opted not to add the DMARC, SPF-hardening, and CAA DNS records this
cycle. For the record: these are **not** a paid-Cloudflare feature —
DMARC and SPF are ordinary TXT records and CAA is an ordinary CAA
record, all addable on any Cloudflare plan including Free. They remain
genuine email-spoofing / certificate-issuance gaps and are marked ⊘;
revisit when DNS is next touched.
- **Remaining operator runtime steps (no code to commit)** — on the
*existing* cluster: `k3s secrets-encrypt` enable/reencrypt (`K3S-CG1` /
`V12`) and `chmod 600` the live kubeconfig (`K3S-F4`); the SSH/sysctl
half of `K3S-CG2`; and the `K3S-CG3``CG8` verification items. A full
*fresh* provision already comes up with `K3S-F4`/`CG1`/`CG2`(fail2ban +
unattended-upgrades) applied straight from `_config.sh`.
**Operator note:** `C1` (token hashing) invalidates every existing login
session once at deploy and makes login single-session per user — see the
`CODE-C1` entry. The status boxes in the master index below are authoritative.
## Post-remediation independent review (2026-05-16)
The change set went through **two** independent review passes; the deploy-time
verification below (build, `go test -race`, full `goose up` against real
PostgreSQL 16) was executed and passed.
**First pass.** A separate review agent audited the full change set against the
three audit files. It surfaced three **deploy-breaking** defects that a green
`go test` could not catch — the test harness builds two tables via GORM
`AutoMigrate`, which production never runs — all since fixed:
- **`audit_log` table was never created by a migration.** `000005` added
append-only triggers to a table that exists only in the test DB, so a
from-scratch `goose up` would fail on `000005`. `000005` now does
`CREATE TABLE IF NOT EXISTS audit_log` before the triggers.
- **`webhook_event_log` table was never created by a migration.** The H6
fail-closed webhook dedup turns a missing table into a 500 on every
subscription webhook. New migration `000006` creates it.
- **`000004`'s `google_purchase_token` unique index could fail to build** on
a production table already holding duplicate tokens — exactly the C6
replay the migration fixes. `000004` now de-duplicates (keep-earliest,
NULL-the-rest) before creating the index.
It also tightened the C13 Apple-webhook lookup (`subscription_webhook_handler.go`)
so the legacy substring scan runs only on a genuine `ErrRecordNotFound`,
never masking a real DB error as "not found".
**Second pass (master review).** A second, independent security-audit agent
re-verified all four first-pass fixes (correct), ran `go test -race` (0 data
races) and the full `goose up`/`down` chain against real PostgreSQL (clean,
idempotent), and returned **GO** with one HIGH finding, since fixed:
- **HIGH-1 — Redis password leaked via the `honeydue-config` ConfigMap.**
`_config.sh` built `REDIS_URL` with the password embedded inline, and that
URL is emitted into the `honeydue-config` ConfigMap (delivered to pods via
`envFrom`). ConfigMaps are *not* covered by `secrets-encryption` and are
readable by any principal with `get configmap` — so `K3S-F1`/`K3S-F8` were
not actually fully closed. **Fixed (2026-05-16):** `_config.sh` now emits
`REDIS_URL=redis://redis:6379/0` with no credentials; the password travels
only as the file-mounted `REDIS_PASSWORD` secret. The API applies it in
`cache_service.go`; `cmd/worker/main.go` now applies it onto the parsed
Asynq `RedisClientOpt` so the server/inspector/monitoring client all
authenticate against the `requirepass` Redis.
The master review's other seven findings (4 Medium, 3 Low — none
deploy-blocking) were then **all fixed (2026-05-16)**:
- **MEDIUM-1 — re-login left the prior token usable for ≤5 min.**
`CreateFreshToken` deleted the old token row but not its Redis cache entry.
It now also returns the deleted tokens' hashes; `AuthService.freshToken`
evicts them via the new `CacheService.InvalidateAuthTokenHashes` on every
login / Apple / Google sign-in, so a prior (e.g. stolen) token stops
authenticating immediately.
- **MEDIUM-2 — IAP `.p8` mode check incompatible with k8s.** The Apple IAP
key check (`iap_validation.go`) required `0600`-or-stricter, unattainable
on a k8s Secret volume (`0440` under `fsGroup`). It now rejects only
world-accessible keys (`perm & 0o007`).
- **MEDIUM-3 — single-IP account-lockout DoS.** The `M5` per-account lockout
is now keyed on the *set of distinct source IPs* that have failed
(`RegisterLoginFailure` takes the IP, tracks a Redis set; lock at 5
distinct IPs). One attacker IP can no longer lock a victim out by spamming
failures; genuinely distributed stuffing still trips it. `Login` now takes
the client IP (`c.RealIP()`).
- **MEDIUM-4 — Redis no-auth deployable.** `02-setup-secrets.sh` now `die`s
(was `warn`) when `redis.password` is empty, so a deploy can no longer
bring up an unauthenticated Redis (`K3S-F1`).
- **LOW-1 / LOW-2 — missing regression tests.** Added: `config_test.go`
asserts `validate()` refuses `DEBUG_FIXED_CODES` with `DEBUG=false` (`C4`);
`subscription_repo_test.go` asserts a second account cannot bind an Apple
transaction / Google purchase token already bound to another (`C5`/`C6`).
- **LOW-3 — device-token 409.** A recycled APNs/FCM token re-registering
under a new account is now reassigned to that account (and logged) instead
of returning a 409 that locked the legitimate new device owner out of push.
One earlier (first-pass) hardening item remains a **tracked follow-up**, not
re-raised by the master review and not deploy-blocking: `/metrics` is gated
by an `X-Forwarded-For` check rather than network-isolated. True isolation
needs `/metrics` on a separate port plus a NetworkPolicy restricting the
scrape to vmagent — an architectural change deferred to a later cycle.
## Consolidated work items (fix once, closes many)
Several findings are the same defect seen from three angles. Do the work
once at the listed anchor; the rest close with it.
| Theme | Anchor | Also closes |
|---|---|---|
| Auth-endpoint rate limiting | Stage 3 `auth-rate-limit` middleware + Stage 4 app limiter | `K3S-F10`, `LIVE-L12`, `CODE-H1`, `CODE-H2`, `CODE-H3`, `CODE-M5` |
| CSP / cross-origin headers | Stage 3 `security-headers` + Stage 4 app CSP | `K3S-F9`, `LIVE-L8` |
| HSTS `preload` | Stage 3 middleware + Stage 0 list submission | `LIVE-L5`, `CODE-L3` |
| Admin ingress hardening | Stage 2 secret + Stage 3 middleware wiring | `K3S-F2`, `K3S-F3`, `CODE-L6` |
| etcd encryption at rest | Stage 1 `--secrets-encryption` | `K3S-CG1`, `CODE-M9` |
| Image digest pinning + signing | Stage 5 CI | `K3S-F5`, `K3S-F14`, `CODE-L4`, `CODE-L5` |
| Pagination hard caps | Stage 4 app | `LIVE-L16`, `CODE-M6` |
| imagePullSecret name consistency | Stage 3 manifests + Stage 2 script | `K3S-F6` |
**Known contradiction to resolve before planning Stage 4:** `LIVE-L18` says
*no account-deletion endpoint exists* (every `DELETE` path 404/400), but
`CODE-M13` points at a delete handler at `auth_handler.go:488-539`. Either
the endpoint exists at a path the external scan never probed, or it is
mounted but unreachable. **Confirm the route in `internal/router/router.go`
first** — the fix differs (add an endpoint vs. expose/rate-limit an existing
one). Tracked as verification item `V11`.
---
## Master finding index
Every finding, ordered by redeploy stage. Use this as the live tracker —
flip the Status box as work lands.
### Stage 0 — DNS & Cloudflare edge
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `LIVE-L2` | HIGH | No DMARC record — email spoofing open | N | ⊘ |
| `LIVE-L3` | MED | SPF ends `?all` (neutral — fails open) | N | ⊘ |
| `LIVE-L4` | MED | No CAA records — any CA may issue certs | N | ⊘ |
| `LIVE-L6` | LOW | No `/.well-known/security.txt` | Y | ☐ |
| `LIVE-L9` | INFO | Aggressive Cloudflare caching on admin SSR shell | N | ☐ |
| `LIVE-L10` | INFO | `x-powered-by: Next.js` framework leak | Y | ☐ |
### Stage 1 — Cluster provisioning & node OS
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `K3S-F4` | HIGH | Node kubeconfig world-readable (mode 644) | Y | ☑ |
| `K3S-F15` | INFO | Nodes on public IPs, no private VPC | Y | ⊘ |
| `K3S-F16` | INFO | All 3 nodes are control-plane + etcd + worker | Y | ⊘ |
| `K3S-F17` | INFO | Single-replica SPOFs (redis/worker/admin/vmagent) | Y | ☐ |
| `K3S-CG1` | — | etcd encryption at rest not verified (`--secrets-encryption`) | Y | ☑ |
| `K3S-CG2` | — | Node OS hardening: SSH, fail2ban, unattended-upgrades, sysctl | Y/N | ◐ |
| `K3S-CG3` | — | Hetzner Cloud Firewall rules not verified | N | ☐ |
| `K3S-CG4` | — | etcd snapshot backup destination/encryption not verified | Y | ☐ |
| `K3S-CG5` | — | kubelet flags (`--anonymous-auth=false`, webhook authz) not verified | Y | ☐ |
| `K3S-CG6` | — | Container-runtime CIS controls (`kube-bench`) not run | N | ☐ |
| `K3S-CG7` | — | `deploy` user sudoers least-privilege not verified | N | ☐ |
| `K3S-CG8` | — | `/etc/rancher/k3s/` dir + server-token perms not verified | N | ☐ |
### Stage 2 — Secrets & config bootstrap
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `K3S-F1` | **CRIT** | Redis runs with no authentication | Y | ☑ |
| `K3S-F3` | HIGH | `admin-basic-auth` secret never created | Y | ☑ |
| `K3S-F12` | MED | Secrets unrotated since cluster bootstrap; no runbook | Y | ☑ |
| `CODE-C4` | **CRIT** | `DEBUG_FIXED_CODES` "123456" auth bypass if it reaches prod | Y | ☑ |
| `CODE-M8` | MED | `SECRET_KEY` hardcoded debug fallback | Y | ☑ |
> **Stage 2 status (2026-05-15):** `config.yaml` now carries a Redis
> password and admin basic-auth user/password; `02-setup-secrets.sh` uses
> bcrypt (`htpasswd -nbB`); `internal/config/config.go` generates an
> ephemeral random `SECRET_KEY` in debug instead of a static fallback and
> refuses to boot if `DEBUG_FIXED_CODES` is set with `DEBUG=false`; the
> rotation runbook is at `docs/runbooks/secret-rotation.md`. All take
> effect on the next `02-setup-secrets.sh` + `03-deploy.sh`.
### Stage 3 — Kubernetes manifests
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `K3S-F2` | HIGH | Admin ingress missing `cloudflare-only` + `admin-auth` | Y | ☑ |
| `K3S-F6` | HIGH | `imagePullSecrets` name mismatch (`ghcr-credentials`) | Y | ☑ |
| `K3S-F7` | MED | `vmagent` container missing `securityContext` | Y | ☑ |
| `K3S-F9` | MED | `security-headers` missing COOP/COEP/CORP | Y | ☑ |
| `K3S-F10` | MED | Uniform rate limit — no auth-endpoint tightening | Y | ☑ |
| `K3S-F11` | MED | `automountServiceAccountToken` not disabled | Y | ☑ |
| `K3S-F13` | LOW | `CORS_ALLOWED_ORIGINS` missing `app.myhoneydue.com` | Y | ☑ |
| `K3S-F14` | LOW | Public images (`redis`, `vmagent`) pinned by tag | Y | ☑ |
| `LIVE-L5` | LOW | HSTS not preload-eligible | Y | ☑ |
| `LIVE-L7` | LOW | Deprecated `X-XSS-Protection` header | Y | ☑ |
| `LIVE-L8` | LOW | CSP missing `object-src`/`base-uri`; COOP/COEP/CORP absent | Y | ☑ |
| `CODE-L3` | LOW | HSTS missing `preload` (duplicate of `LIVE-L5`) | Y | ☑ |
| `CODE-L4` | LOW | `imagePullPolicy` not set on Deployments | Y | ☑ |
| `CODE-L6` | LOW | Admin `admin-auth` middleware defined, not attached | Y | ☑ |
> **Stage 3 status (2026-05-15):** admin ingress now chains
> `cloudflare-only` + `admin-auth` + `security-headers` + `rate-limit`; a
> dedicated `honeydue-api-auth` Ingress applies a new `auth-rate-limit`
> middleware (5/min, burst 10) to login / register / forgot-password /
> reset-password / join-with-code; `security-headers` gained COOP + CORP,
> HSTS is now `max-age=63072000; …; preload`, and the deprecated
> `X-XSS-Protection` (`browserXssFilter`) is removed; `vmagent` has a
> container `securityContext`; all workload pods + the migrate Job set
> `automountServiceAccountToken: false` explicitly (on top of the
> rbac.yaml ServiceAccount-level setting that already existed); the
> registry secret is `gitea-credentials` everywhere; `imagePullPolicy:
> IfNotPresent` is explicit on every container; CORS includes
> `app.myhoneydue.com`. **Still open:** `K3S-F14` (public-image digest
> pins) is folded into Stage 5 with `K3S-F5`; `LIVE-L8` is partial — the
> COOP/CORP half shipped here, the CSP `object-src`/`base-uri` half is an
> app change tracked in Stage 4.
### Stage 4 — Application code & container images
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `CODE-C1` | **CRIT** | Auth tokens stored plaintext in DB | Y | ☑ |
| `CODE-C2` | **CRIT** | Google ID token not verified locally | Y | ☑ |
| `CODE-C3` | **CRIT** | Google `iss` claim never validated | Y | ☑ |
| `CODE-C5` | **CRIT** | Apple IAP receipt replay across accounts | Y | ☑ |
| `CODE-C6` | **CRIT** | Google purchase-token replay across accounts | Y | ☑ |
| `CODE-C7` | **CRIT** | File-ownership check excludes residence owners | Y | ☑ |
| `CODE-C8` | **CRIT** | Device-token cross-account hijack on re-register | Y | ☑ |
| `CODE-C9` | **CRIT** | Share-code join not atomic (Add+Deactivate race) | Y | ☑ |
| `CODE-C10` | **CRIT** | Subscription upgrade race — validation outside txn | Y | ☑ |
| `CODE-C11` | **CRIT** | Task-completion duplicate-row race | Y | ☑ |
| `CODE-C12` | **CRIT** | Soft-deleted email reusable; `is_active` not filtered | Y | ⊘ |
| `CODE-C13` | **CRIT** | Apple webhook user lookup may LIKE-match | Y | ☑ |
| `CODE-H1` | HIGH | Rate limit doesn't cover all auth surfaces | Y | ☑ |
| `CODE-H2` | HIGH | No rate limit on `join-with-code` | Y | ☑ |
| `CODE-H3` | HIGH | No rate limit on `register` | Y | ☑ |
| `CODE-H4` | HIGH | Modulo bias in 6-digit code generation | Y | ☑ |
| `CODE-H5` | HIGH | Apple IAP `.p8` loaded with no file-mode check | Y | ☑ |
| `CODE-H6` | HIGH | Webhook dedup fail-open | Y | ☑ |
| `CODE-H7` | HIGH | Auth-failure log lacks IP/User-Agent | Y | ☑ |
| `CODE-H8` | HIGH | `X-Timezone` header trusted for trial-start calc | Y | ☑ |
| `CODE-H9` | HIGH | Share-code `Deactivate` error swallowed | Y | ☑ |
| `CODE-M1` | MED | HTTP header injection via `Content-Disposition` filename | Y | ☑ |
| `CODE-M2` | MED | bcrypt cost = 10 (recommend 12) | Y | ☑ |
| `CODE-M3` | MED | Apple Sign In nonce not validated | Y | ⊘ |
| `CODE-M4` | MED | Email verification not atomic | Y | ☑ |
| `CODE-M5` | MED | Per-user rate limiting absent | Y | ☑ |
| `CODE-M6` | MED | List endpoints uncapped (Documents/Contractors/Residences) | Y | ☑ |
| `CODE-M7` | MED | Audit log not append-only | Y | ☑ |
| `CODE-M11` | MED | `golang.org/x/crypto v0.49.0` outdated | Y | ☑ |
| `CODE-M12` | MED | Contractor toggle refetch race | Y | ☑ |
| `CODE-M13` | MED | Account-deletion endpoint unrate-limited | Y | ☑ |
| `CODE-M10` | MED | `node:20-alpine` floating tag in Dockerfile | Y | ☑ |
| `CODE-L1` | LOW | Login inactive-account error enables enumeration | Y | ☑ |
| `CODE-L2` | LOW | Auth responses lack `Cache-Control: no-store` | Y | ☑ |
| `LIVE-L1` | HIGH | `/metrics` publicly exposed on `api.myhoneydue.com` | Y | ☑ |
| `LIVE-L11` | HIGH | Login user-enumeration via timing | Y | ☑ |
| `LIVE-L12` | HIGH | No rate-limit on `/api/auth/login/` | Y | ☑ |
| `LIVE-L13` | HIGH | Password-reset user-enumeration via timing | Y | ☑ |
| `LIVE-L14` | MED | Sequential integer user IDs leak userbase size | Y | ⊘ |
| `LIVE-L15` | MED | Sequential integer resource IDs (same risk) | Y | ⊘ |
| `LIVE-L16` | MED | Pagination `limit` accepted at any size | Y | ☑ |
| `LIVE-L17` | LOW | Garbage pagination params silently accepted | Y | ⊘ |
| `LIVE-L18` | LOW | No account-deletion endpoint (GDPR gap) | Y | ⊘ |
| `LIVE-L19` | LOW | Email verification not enforced | Y | ☑ |
| `LIVE-L20` | INFO | Profile-update silently drops unknown fields | Y | ⊘ |
> **Stage 4 handler/misc batch status (2026-05-15):** `M1` —
> `Content-Disposition` filenames are sanitized (control chars / quote /
> backslash stripped) so an upload filename cannot inject response
> headers. `M7` — migration `000005` creates the `audit_log` table (no
> prior migration did — `CREATE TABLE IF NOT EXISTS`) and makes it
> append-only via BEFORE UPDATE/DELETE triggers. `M11` —
> `golang.org/x/crypto` bumped
> `v0.49.0 → v0.51.0`. `M13` — `DELETE /api/auth/account` now carries the
> Traefik `auth-rate-limit` edge limiter. `LIVE-L18` ⊘ — not a real gap:
> the endpoint **exists** at `DELETE /api/auth/account/`
> (`router.go:546`); the live scan probed `/api/auth/me/`, `/auth/delete/`,
> `/users/me/` and missed it. **Update (2026-05-15):** items shown as
> deferred in an earlier draft were then completed — `LIVE-L1` (`/metrics`
> rejects proxied/public requests via an `X-Forwarded-For` check, so only
> the in-cluster vmagent scrape reaches it), `M6`/`LIVE-L16` (the
> document/contractor list repos already hard-cap at 500 rows), and
> `LIVE-L19` (verified-email gating on share-code generation via the new
> `RequireVerified` middleware). `LIVE-L17` (inert pagination params,
> results capped) and `LIVE-L20` (whitelist profile update is the correct
> pattern) are closed as no-security-impact (⊘). The master index above is
> authoritative.
> **Stage 4 races batch status (2026-05-15):** `C9`/`H9` — share-code
> redemption is now one locked transaction in `ResidenceRepository.
> JoinWithShareCode` (lock the code row, re-check validity, add member,
> deactivate — a deactivation failure aborts the join). `C11` — the
> task-completion duplicate-row race was *already* closed: the completion
> insert and the optimistically-version-locked task update share one
> transaction, so a concurrent completion fails `ErrVersionConflict` and
> rolls back its inserted row; no `UNIQUE(task_id, completed_date)` was
> added (it would reject legitimate same-day re-completions and risk a
> migration failure on existing data). `M4` — email verification's
> find/consume/flag writes are now one transaction. `M12` — a concurrent
> contractor delete now yields a clean 404. `C12` ⊘ — premise moot: the
> app **hard-deletes** accounts (`DeleteUserCascade`), so there is no
> soft-deleted user whose email lingers, and `ExistsByEmail` already
> blocks re-registering a *deactivated* user's email.
>
> **Stage 4 auth batch status (2026-05-15):** C1, C2, C3 done (see entries
> below). Rate limiting — every sensitive auth path now carries the shared
> Traefik `auth-rate-limit` edge limiter (login/register/forgot/reset/
> verify-reset/apple/google/refresh/join-with-code); login/register/forgot/
> reset/apple/google additionally keep the per-IP app limiter
> (`H1`/`H2`/`H3`/`LIVE-L12`). `H4` rejection-sampled codes, `M2` bcrypt
> cost 12, `L1`+`LIVE-L11` constant-time generic-error login, `L2`
> `no-store` on auth responses, `H7` IP/UA in auth logs, `LIVE-L13`
> fully-async forgot-password — all done; `go build ./...` and the
> `models`/`repositories`/`middleware`/`handlers`/`services` test packages
> pass. **Deferred:** `M3` (Apple nonce) — needs the iOS client to
> generate and send a nonce; server-only validation would reject every
> Apple login, so this is blocked on a coordinated mobile change. `H8` —
> the `parseTimezone` ±14h cap shipped; the "use server UTC for
> trial-start" half is folded into Stage 4's subscription work. `M5`
> per-account lockout (Redis) deferred — the edge + per-IP app limiters +
> the existing per-account password-reset counter cover the practical
> risk; a true per-account login lockout remains a tracked enhancement.
### Stage 5 — CI / build pipeline
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
| `K3S-F5` | HIGH | Images pinned by mutable short SHA tag, not digest | Y | ☑ |
| `K3S-F8` | MED | Secrets injected as env vars, not file mounts | Y | ☑ |
| `CODE-L5` | LOW | No image signing (cosign) in CI | Y | ◐ |
> **Stage 5 status (2026-05-15):** `CODE-M11` done — `golang.org/x/crypto`
> bumped `v0.49.0 → v0.51.0` (with the `x/sys`/`x/term`/`x/text` bumps
> `go get -u` pulled in), `go mod tidy` clean, full build + test green.
> **Update (2026-05-15):** `K3S-F5`/`K3S-F14`/`CODE-M10` are done —
> `03-deploy.sh` resolves the image digest after each push and deploys
> api/worker/admin/web by `@sha256:`, and redis/vmagent/`node:20-alpine`
> are pinned to their resolved index digests.
> **Update (2026-05-16):** `K3S-F8` is **done** — the `api`/`worker`
> Deployments mount `honeydue-secrets` as files (`defaultMode: 0400`) at
> `/etc/honeydue/secrets` and inject no secret as an env var;
> `config.loadFileSecrets` reads them; `02-setup-secrets.sh` now writes
> `B2_KEY_ID`/`B2_APP_KEY` into the secret, reconciling the earlier
> script-vs-manifest drift. `CODE-L5` stays **◐** — cosign signing and a
> Trivy `HIGH,CRITICAL` scan are wired (guarded) into `03-deploy.sh` and a
> ready-to-use Kyverno `ClusterPolicy` ships at
> `deploy-k3s/manifests/kyverno-verify-images.yaml`; closing it needs the
> operator to install Kyverno and supply a cosign key. See both entries.
### Stage 6 — Post-deploy verification & runtime investigations
`V1``V13` — see [Stage 6](#stage-6--post-deploy-verification--runtime-investigations).
---
## Stage 0 — DNS & Cloudflare edge
External state at Cloudflare. Not touched by `03-deploy.sh`, so a redeploy
neither breaks nor re-applies these — do them once and leave them. Tracked
here so they are never forgotten on a domain move or DNS migration.
### `LIVE-L2` — Add DMARC record · HIGH · ⊘
- **Operator decision (2026-05-16):** declined for this cycle. A DMARC record is an ordinary DNS TXT record — it is **not** gated behind a paid Cloudflare plan and can be added on Free. This remains a real email-spoofing gap; revisit when DNS is next touched.
- **Where:** Cloudflare DNS, TXT record at `_dmarc.myhoneydue.com`.
- **Fix:** Publish `v=DMARC1; p=quarantine; rua=mailto:dmarc@myhoneydue.com; ruf=mailto:dmarc@myhoneydue.com; fo=1; aspf=s; adkim=s`. Start at `pct=10` for 30 days, watch the `rua` aggregate reports, then ramp to `pct=100` and finally `p=reject`.
- **Verify:** `dig +short TXT _dmarc.myhoneydue.com` returns the record.
### `LIVE-L3` — Tighten SPF from `?all` to `-all` · MEDIUM · ⊘
- **Operator decision (2026-05-16):** declined for this cycle. SPF is an ordinary DNS TXT record, editable on any Cloudflare plan including Free. The `?all` (neutral) qualifier leaves spoofed mail un-penalised; revisit alongside `LIVE-L2`.
- **Where:** Cloudflare DNS, TXT record at `myhoneydue.com`.
- **Fix:** Change `v=spf1 include:spf.messagingengine.com ?all``~all` for ~7 days, confirm no legitimate mail (CI, transactional) is missed, then `-all`. Do this **after** `LIVE-L2`'s DMARC ramp begins.
- **Verify:** `dig +short TXT myhoneydue.com | grep spf` shows `-all`.
### `LIVE-L4` — Add CAA records · MEDIUM · ⊘
- **Operator decision (2026-05-16):** declined for this cycle. CAA is an ordinary DNS record type, addable on any Cloudflare plan including Free. Without it, any public CA may issue a cert for the domain; revisit when DNS is next touched.
- **Where:** Cloudflare DNS, apex `myhoneydue.com`.
- **Fix:** Add `0 issue "letsencrypt.org"`, `0 issuewild "letsencrypt.org"`, `0 iodef "mailto:security@myhoneydue.com"`. Add `0 issue "pki.goog"` only if Google Trust Services is used anywhere. Confirm against the CAs Cloudflare Universal SSL actually uses before locking down.
- **Verify:** `dig +short CAA myhoneydue.com` returns the records.
### `LIVE-L6` — Publish `security.txt` · LOW · ☐ · In-repo: Y
- **Where:** served by the Go API and/or Next.js apps at `/.well-known/security.txt` (RFC 9116) — committed route, so it survives redeploys.
- **Fix:** Serve `Contact:`, `Expires:`, `Preferred-Languages:`, `Canonical:` on both `api.myhoneydue.com` and the apex.
- **Verify:** `curl https://api.myhoneydue.com/.well-known/security.txt` → 200.
### `LIVE-L9` — Review Cloudflare caching of the admin SSR shell · INFO · ☐
- **Where:** Cloudflare cache rules for `admin.myhoneydue.com`.
- **Fix:** `cache-control: s-maxage=31536000` on admin SSR pages means Cloudflare caches the admin shell for a year. Confirm this is intentional; if the admin shell ever contains per-session content, add a bypass-cache rule for `admin.myhoneydue.com`.
- **Verify:** `curl -sI https://admin.myhoneydue.com/ | grep -i cache` reflects the intended policy.
### `LIVE-L10` — Suppress `x-powered-by` · INFO · ☐ · In-repo: Y
- **Where:** Next.js config in the admin and web repos (`next.config.js``poweredByHeader: false`). Committed, survives redeploys.
- **Fix:** Disable the `x-powered-by: Next.js` header.
- **Verify:** `curl -sI https://admin.myhoneydue.com/ | grep -i x-powered-by` returns nothing.
---
## Stage 1 — Cluster provisioning & node OS
Run by `01-provision-cluster.sh` (which drives the `hetzner-k3s` CLI from
`config.yaml` via `generate_cluster_config` in `_config.sh`) plus one-time
SSH hardening on each node. **Any k3s server flag must be set in the
`hetzner-k3s` cluster config so a cluster rebuild applies it.**
### `K3S-F4` — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y
- **Where:** `_config.sh``generate_cluster_config``k3s_config_file`. Node file `/etc/rancher/k3s/k3s.yaml`.
- **Done (2026-05-16):** `generate_cluster_config` now emits `write-kubeconfig-mode: "0600"` in the k3s config file, so any fresh provision writes the node kubeconfig as `0600`.
- **Operator step on the existing cluster:** a running node keeps the mode it was installed with — `ssh deploy@<node> 'sudo chmod 600 /etc/rancher/k3s/k3s.yaml'` on each. Deploy scripts still read it via `sudo`.
- **Verify:** `ssh deploy@<node> 'sudo stat -c %a /etc/rancher/k3s/k3s.yaml'``600`.
### `K3S-CG1` / `CODE-M9` — etcd / Secret encryption at rest · ☑ · In-repo: Y
- **Where:** `_config.sh``generate_cluster_config``k3s_config_file`.
- **Done:** the k3s config file carries `secrets-encryption: true`, so a fresh provision boots with AES Secret encryption enabled. (The `write-kubeconfig-mode` line for `K3S-F4` was added next to it on 2026-05-16.)
- **Operator step on the existing cluster:** a cluster provisioned *without* the flag does not retro-encrypt — run `k3s secrets-encrypt enable` then `k3s secrets-encrypt reencrypt` once. Tracked as `V12`.
- **Verify:** `k3s secrets-encrypt status` reports `Encryption Status: Enabled` on every server node.
- **Note:** the old `SECURITY.md` *claimed* this was already on — `04-verify.sh` greps for the string but cannot truly confirm; see `V12`.
### `K3S-CG2` — Node OS hardening · ◐ · In-repo: partial
- **Where:** `_config.sh``generate_cluster_config``post_create_commands` (runs on every node at provision).
- **Done (2026-05-16):** `post_create_commands` now installs and enables `fail2ban` (SSH brute-force bans) and `unattended-upgrades` (automatic security patching) on every node at provision time — a fresh cluster comes up hardened on both.
- **Still operator (runtime; not yet in-repo):**
- SSH — confirm `PermitRootLogin no`, `PasswordAuthentication no`, `AllowUsers deploy`, modern ciphers/MACs/KEX. (hetzner-k3s provisions key-only SSH; verify and tighten.)
- sysctl — confirm `net.ipv4.ip_unprivileged_port_start=0` (Traefik) and standard network-hardening sysctls.
- **Verify:** `ssh deploy@<node> 'fail2ban-client status sshd; systemctl is-enabled unattended-upgrades'`.
### `K3S-CG3` — Hetzner Cloud Firewall rules · ☐ · In-repo: N
- **Fix:** Confirm only: `:443` from Cloudflare CIDRs, `:22` from operator IP(s), `:6443` from operator IP(s). Nothing else. This is the *only* network defense for the public-IP nodes (`K3S-F15`).
- **Verify:** `hcloud firewall describe honeydue-fw` matches the intended ruleset; a direct `curl` to a node IP on `:80`/`:443` from a non-CF host times out.
### `K3S-CG4` — etcd snapshot backup · ☐ · In-repo: Y
- **Fix:** Confirm k3s etcd snapshots are enabled (default hourly) and shipped off-node — set `--etcd-s3` (to Backblaze B2) with encryption. Without offsite snapshots, a 3-node loss is unrecoverable.
- **Verify:** `ls /var/lib/rancher/k3s/server/db/snapshots/` on a node + an object in the B2 backup bucket.
### `K3S-CG5` — kubelet authn/authz flags · ☐ · In-repo: Y
- **Fix:** Confirm `--anonymous-auth=false` and `--authorization-mode=Webhook` on the kubelet (k3s defaults are usually safe — verify, don't assume). Set via k3s `kubelet-arg` in the cluster config if missing.
- **Verify:** `kubectl get --raw /api/v1/nodes/<node>/proxy/configz` shows the expected kubelet config.
### `K3S-CG6` — Container-runtime CIS baseline · ☐ · In-repo: N
- **Fix:** Run `kube-bench` once; remediate any FAIL lines that aren't k3s-by-design.
- **Verify:** `kube-bench` run archived with FAILs triaged.
### `K3S-CG7` — `deploy` user sudoers least-privilege · ☐ · In-repo: N
- **Fix:** Current `deploy ALL=(ALL) NOPASSWD: ALL` means an SSH-key compromise = node root. Scope to the commands deploys actually need (`ufw`, `systemctl`, `chmod` on k3s.yaml, `cat` of k3s.yaml). Accept the convenience trade-off only with eyes open.
- **Verify:** `ssh deploy@<node> 'sudo -l'` shows the scoped list.
### `K3S-CG8` — `/etc/rancher/k3s/` perms · ☐ · In-repo: N
- **Fix:** `/var/lib/rancher/k3s/server/token` and `/var/lib/rancher/k3s/server/node-token` must be `0600 root:root`; `/etc/rancher/k3s/` not world-traversable.
- **Verify:** `ssh deploy@<node> 'sudo stat -c "%a %n" /var/lib/rancher/k3s/server/token'``600`.
### `K3S-F15` — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y
- **Decision:** Accepted for now. Defense is `K3S-CG3` (Hetzner firewall) only. To remediate later: attach a Hetzner private network, re-IP the cluster, move etcd/kubelet/Flannel onto it. Substantial re-provision — track on the roadmap, not this cycle.
### `K3S-F16` — All nodes are control-plane + etcd + worker · INFO · ⊘
- **Decision:** Accepted — standard small-cluster k3s. Revisit (dedicated workers + `NoSchedule` taint on control-plane) when workload pressure grows. No redeploy action.
### `K3S-F17` — Single-replica SPOFs · INFO · ☐ · In-repo: Y
- **Where:** `deploy-k3s/manifests/worker/deployment.yaml`, `redis/`, `admin/`, `observability/vmagent.yaml`.
- **Fix:** `worker``replicas: 2` (stateless, Asynq at-least-once — safe now). `admin`/`vmagent` → 2 if zero-downtime restart is wanted. `redis` is stateful — true HA needs Sentinel or managed Redis; track separately, do not naively scale.
- **Verify:** `kubectl -n honeydue get deploy` shows `worker 2/2`.
---
## Stage 2 — Secrets & config bootstrap
Run by `02-setup-secrets.sh`, which reads `deploy-k3s/config.yaml` and the
`secrets/` directory. **Both `K3S-F1` and `K3S-F3` are open purely because
`config.yaml` lacks the values — the script already supports them.**
### `K3S-F1` — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y
- **Where:** `deploy-k3s/config.yaml` key `redis.password`. `02-setup-secrets.sh:53,68-71` includes `REDIS_PASSWORD` in `honeydue-secrets` only when that key is non-empty; `redis/deployment.yaml` adds `--requirepass` only when the env var is non-empty.
- **Fix:** Set `redis.password` in `config.yaml` to a strong value (`openssl rand -base64 32`). Re-run `02-setup-secrets.sh`. `api`/`worker` already consume `REDIS_PASSWORD`.
- **Verify:** `kubectl -n honeydue exec deploy/redis -- redis-cli ping``NOAUTH`; with `-a "$REDIS_PASSWORD"``PONG`.
- **Redeploy-clean:** committing the value to `config.yaml` means every future `02-setup-secrets.sh` re-creates the authenticated secret. (If `config.yaml` is gitignored, store the value in the operator's secret store and document it here.)
### `K3S-F3` — `admin-basic-auth` secret never created · HIGH · ☐ · In-repo: Y
- **Where:** `config.yaml` keys `admin.basic_auth_user` / `admin.basic_auth_password`. `02-setup-secrets.sh:54-55,132-143` creates the `admin-basic-auth` secret (bcrypt htpasswd) only when both are set, else it warns and skips.
- **Fix:** Set both keys. Re-run `02-setup-secrets.sh`. **Must be done before `K3S-F2`** — attaching `admin-auth` to the ingress with the secret missing makes Traefik 503 the admin route.
- **Verify:** `kubectl -n honeydue get secret admin-basic-auth`.
### `K3S-F8` (Stage 2 half) — `B2_KEY_ID` / `B2_APP_KEY` in `honeydue-secrets` · ☑ · In-repo: Y
- **Where:** `02-setup-secrets.sh`.
- **Done (2026-05-16):** the script now reads `storage.b2_key_id` / `storage.b2_app_key` from `config.yaml` and adds `B2_KEY_ID` / `B2_APP_KEY` to `honeydue-secrets`. Previously the `api`/`worker` manifests referenced these keys but the script never created them — a latent deploy break. See the full `K3S-F8` entry in Stage 5.
- **Verify:** `kubectl -n honeydue get secret honeydue-secrets -o jsonpath='{.data.B2_KEY_ID}'` is non-empty.
### `K3S-F12` — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y
- **Where:** new doc `docs/runbooks/secret-rotation.md`.
- **Fix:** Document per-secret rotation (Postgres, `SECRET_KEY`, APNs `.p8`, FCM, B2, observability token, Redis, admin basic-auth). Annual minimum; immediate on suspected exposure or operator-device loss. For `SECRET_KEY` (JWT signing) plan an overlap window so live tokens validate across the change. Add a `last-rotated` annotation to each secret.
- **Verify:** runbook exists and the first rotation is logged.
### `CODE-C4` — `DEBUG_FIXED_CODES` "123456" auth bypass · CRITICAL · ☐ · In-repo: Y
- **Where:** `internal/services/auth_service.go:141-145,385-390,432-435,470-473,503-504`; config in `internal/config/config.go`. ConfigMap generated from `config.yaml` by `03-deploy.sh`.
- **Fix (two layers):** (1) Code — refuse to start if `ENV=production && DebugFixedCodes` (Stage 4 code change). (2) Config — ensure `config.yaml` never sets `DEBUG_FIXED_CODES=true` for prod, and the generated ConfigMap omits it.
- **Verify:** prod ConfigMap has no `DEBUG_FIXED_CODES`; a prod boot with the flag set fails fast.
### `CODE-M8` — `SECRET_KEY` hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y
- **Where:** `internal/config/config.go:437-442` falls back to `"change-me-in-production-secret-key-12345"`.
- **Fix:** Remove the static fallback — generate a per-boot random key in debug, and **refuse to start** in production if `SECRET_KEY` is unset. (`02-setup-secrets.sh:46-49` already enforces ≥32 chars for the real secret — keep that.)
- **Verify:** prod boot with no `SECRET_KEY` exits non-zero; the fallback string is gone from the binary.
---
## Stage 3 — Kubernetes manifests
Committed under `deploy-k3s/manifests/` and applied by `03-deploy.sh`. **Any
fix here is automatically re-applied on every redeploy** — the highest-value
stage for "redeploy clean."
### `K3S-F2` / `CODE-L6` — Wire defense-in-depth onto the admin ingress · HIGH · ☐
- **Where:** `deploy-k3s/manifests/ingress/ingress-simple.yaml` — admin route annotation.
- **Fix:** Add `cloudflare-only` and `admin-auth` to the `traefik.ingress.kubernetes.io/router.middlewares` annotation alongside the existing `security-headers` + `rate-limit`. **Do `K3S-F3` first** or Traefik 503s the route.
- **Verify:** `04-verify.sh` "Cloudflare-Only Middleware" check passes; `admin.myhoneydue.com` prompts for basic auth.
### `K3S-F6` — `imagePullSecrets` name consistency · HIGH · ☐
- **Where:** all `deploy-k3s/manifests/*/deployment.yaml`, `migrate/job.yaml`; secret created by `02-setup-secrets.sh:111` as `ghcr-credentials`.
- **Fix:** The registry is Gitea — `ghcr-credentials` is a misleading name and the live cluster currently also has a hand-made `gitea-credentials`. Pick one name (`gitea-credentials` is clearer), use it in **both** the script and **every** manifest, and delete the orphan. The defect is a name *mismatch*, not a missing fix — make script + manifests agree so a pull never fails on a fresh node.
- **Verify:** `grep -rl imagePullSecrets deploy-k3s/manifests/` all reference one name == the script's; cordon a node, delete a pod, confirm the replacement pulls.
### `K3S-F7` — `vmagent` container `securityContext` · MEDIUM · ☐
- **Where:** `deploy-k3s/manifests/observability/vmagent.yaml`.
- **Fix:** Add the container-level block the other 5 deployments already have: `allowPrivilegeEscalation: false`, `capabilities.drop: [ALL]`, `readOnlyRootFilesystem: true`. Its volumes (`/etc/vmagent`, `/etc/vmagent-secrets`, `/tmp/vmagent` emptyDir) already support read-only root.
- **Verify:** `04-verify.sh` "Pod Security Contexts" reports OK for `vmagent`.
### `K3S-F9` / `LIVE-L8` — CSP + cross-origin headers · MEDIUM / LOW · ☐
- **Where:** Cross-origin trio → `deploy-k3s/manifests/ingress/middleware.yaml` (`security-headers`). CSP `object-src`/`base-uri` → Go app CSP middleware (Stage 4, `LIVE-L8` code half).
- **Important correction:** `K3S-F9` originally said CSP was missing. The live scan **disproved** that — the Go app sets a strong CSP via app middleware. So `K3S-F9` reduces to: add `Cross-Origin-Opener-Policy: same-origin` and `Cross-Origin-Resource-Policy: same-origin` (and `Cross-Origin-Embedder-Policy: require-corp` only if it doesn't break embeds) to `security-headers`. The CSP `object-src 'none'; base-uri 'self'` additions belong in the app and are tracked under `LIVE-L8` in Stage 4.
- **Verify:** `curl -sI https://api.myhoneydue.com/api/health/ | grep -i cross-origin` shows COOP/CORP.
### `K3S-F10` / `LIVE-L12` — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` (new `auth-rate-limit` Middleware) + `ingress/ingress-simple.yaml`. Requires migrating the auth paths from vanilla `Ingress` to a Traefik `IngressRoute` to apply a per-path middleware.
- **Fix:** New Middleware `average: 5, burst: 10, period: 1m, sourceCriterion.ipStrategy.depth: 2` (depth 2 for the Cloudflare hop). Apply to `/api/auth/login`, `/api/auth/register`, `/api/auth/forgot-password`, `/api/auth/reset-password`, `/api/residences/join-with-code`. This is the **edge** half; the **app** half is `CODE-H1/H2/H3/M5` in Stage 4 (per-account lockout in Redis). Do both — edge limit alone resets on IP rotation.
- **Verify:** 10 rapid logins from one IP → `429`.
### `K3S-F11` — Disable `automountServiceAccountToken` · MEDIUM · ☐
- **Where:** `deploy-k3s/manifests/rbac.yaml` (ServiceAccounts) and/or each `*/deployment.yaml` pod spec.
- **Fix:** Set `automountServiceAccountToken: false` on `api`, `admin`, `worker`, `web`, `redis`. Leave `true` only for `vmagent` (it uses the k8s API for service discovery). **Note:** `05-security.md` claims this is already set — the audit (`F11`) says it is not. Treat the audit as ground truth; this fix makes the doc true.
- **Verify:** `kubectl -n honeydue get pod <api-pod> -o jsonpath='{.spec.automountServiceAccountToken}'``false`; no token file in the container.
### `K3S-F13` — Add `app.myhoneydue.com` to CORS · LOW · ☐
- **Where:** `CORS_ALLOWED_ORIGINS` in `config.yaml` → generated into `honeydue-config` ConfigMap by `03-deploy.sh`.
- **Fix:** Confirm whether the web app calls `api.myhoneydue.com` directly from the browser. If yes, add `https://app.myhoneydue.com` to `CORS_ALLOWED_ORIGINS`. If it proxies through Next.js server-side, CORS is moot — record that decision here instead.
- **Verify:** browser fetch from `app.myhoneydue.com` to the API succeeds (or the proxy decision is documented).
### `K3S-F14` — Pin public images by digest · LOW · ☐
- **Where:** `redis/deployment.yaml` (`redis:7-alpine`), `observability/vmagent.yaml` (`victoriametrics/vmagent:v1.106.1`).
- **Fix:** Replace tags with `@sha256:` digests. Folded into the `K3S-F5` CI work (Stage 5).
- **Verify:** manifests contain no public-image tag without a digest.
### `LIVE-L5` / `CODE-L3` — HSTS `preload` · LOW · ☐
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` `security-headers` HSTS value.
- **Fix:** Change to `max-age=63072000; includeSubDomains; preload`. Confirm api/admin/app all work fully over HTTPS, then submit to `hstspreload.org` (the submission is the Stage 0 external half — once preloaded you cannot easily downgrade for ~6 months).
- **Verify:** response header shows `preload`; domain accepted at hstspreload.org.
### `LIVE-L7` — Drop deprecated `X-XSS-Protection` · LOW · ☐
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` `security-headers` (`browserXssFilter: true` / `customResponseHeaders`).
- **Fix:** Remove the header or set `X-XSS-Protection: "0"`. Modern browsers ignore it; legacy filter bypass has caused XSS.
- **Verify:** header absent or `0` on all three hosts.
### `CODE-L4` — Set `imagePullPolicy` · LOW · ☐
- **Where:** all `deploy-k3s/manifests/*/deployment.yaml`.
- **Fix:** Set `imagePullPolicy` explicitly. Once images are digest-pinned (`K3S-F5`), `IfNotPresent` is correct and avoids needless re-pulls; until then `Always` avoids stale tags. Pick the policy that matches the `K3S-F5` rollout state.
- **Verify:** every container has an explicit `imagePullPolicy`.
---
## Stage 4 — Application code & container images
Fixes in `honeyDueAPI-go` source (and the admin/web Dockerfiles). They reach
production by **rebuilding the image** in `03-deploy.sh`; schema-changing
fixes (`CODE-C1`, `CODE-C5/6`, `CODE-C11`, `CODE-C12`) also need a **goose
migration**, which the migrate `Job` runs automatically before the
api/worker roll. Per repo rule: do not auto-commit — these are code changes;
this section is the plan, not the patch.
### Critical (C1C13)
#### `CODE-C1` — Plaintext auth tokens in DB · ☑ (2026-05-15)
- **Where:** `internal/models/user.go`, `internal/repositories/user_repo.go`, `internal/middleware/auth.go`, `internal/services/cache_service.go`, `internal/services/auth_service.go`, migration `000003_hash_auth_tokens.sql`.
- **Done:** `user_authtoken.key` now stores `models.HashToken()` — the hex SHA-256 of the token — never the raw value. The raw token reaches the client once (the non-persisted `AuthToken.Plaintext` field) and is re-hashed on every request before the DB and Redis lookup, so the single indexed JOIN query in the auth middleware is preserved. A fast hash (not bcrypt) is correct here — tokens are 160-bit random values, nothing to brute-force. Migration `000003` widens the column 40→64 and clears existing rows.
- **Behaviour change:** the server can no longer re-issue a stored token's plaintext, so every login mints a fresh token via `CreateFreshToken` (delete + create). With the existing one-token-per-user schema this means **one active session per user** — logging in on a new device invalidates the previous device's token. The migration also invalidates all sessions once, at deploy.
- **Verify:** `SELECT key FROM user_authtoken LIMIT 1` → 64-char hash; `go build ./...` and `go test ./internal/{models,repositories,middleware,handlers}/...` pass.
#### `CODE-C2` / `CODE-C3` — Google ID token not verified locally · ☑ (2026-05-15)
- **Where:** `internal/services/google_auth.go` (full rewrite).
- **Done:** `VerifyIDToken` no longer calls the deprecated `tokeninfo` URL (which leaked the token in the query string and made verification depend on a third party). It now parses the JWT, fetches Google's JWKS from `googleapis.com/oauth2/v3/certs` (Redis-cached 24h, re-fetched on a `kid` miss), verifies the `RS256` signature locally, and asserts `iss ∈ {accounts.google.com, https://accounts.google.com}` (C3), `aud`/`azp` against the configured client IDs, and `exp` (validated by jwt v5). Mirrors the existing Apple JWKS verifier. `GoogleSignIn` is unchanged — the returned `GoogleTokenInfo` shape is preserved.
- **Verify:** `go build ./...` clean; `internal/services` tests pass.
#### `CODE-C5` / `CODE-C6` — IAP receipt / purchase-token replay · ☐
- **Where:** `internal/services/subscription_service.go` (`ProcessApplePurchase`, `ProcessGooglePurchase`).
- **Fix:** Goose migration adding `UNIQUE(provider, original_transaction_id)`. On purchase, if the transaction ID is already bound to a different `user_id``403`.
- **Verify:** re-submitting a valid receipt against a second account → `403`; DB has no duplicate.
#### `CODE-C7` — File-ownership check excludes residence owners · ☐
- **Where:** `internal/services/file_ownership_service.go:20-66`.
- **Fix:** Replace the three `residence_residence_users`-only JOINs with the canonical owner-OR-member UNION from `residence_repo.HasAccess` (owners live in `residence_residence.owner_id`).
- **Verify:** a residence owner can delete a file in their own property; a non-member still gets `403`.
#### `CODE-C8` — Device-token cross-account hijack · ☐
- **Where:** `internal/services/notification_service.go:307-319` (APNS), `:336-349` (GCM).
- **Fix:** On re-register of an existing token, if `existing.UserID != nil && *existing.UserID != userID``409 Conflict`. Only same-user updates allowed.
- **Verify:** registering another user's known token → `409`; that user's push traffic is unaffected.
#### `CODE-C9` / `CODE-H9` — Share-code join not atomic · ☐
- **Where:** `internal/services/residence_service.go:562-615` (`:594-599` swallows the deactivate error).
- **Fix:** Wrap `JoinWithCode` in one transaction with `SELECT … FOR UPDATE` on the share-code row; **fail the join if deactivation fails** (do not log-and-continue).
- **Verify:** concurrent redemptions of a single-use code → exactly one succeeds; a forced deactivate error rolls the whole join back.
#### `CODE-C10` — Subscription upgrade race · ☐
- **Where:** `internal/services/subscription_service.go:404-459`; webhook handler `:136-213`.
- **Fix:** Move Apple validation inside the row-locked transaction, or add an idempotency-key table so the validate→write window can't be raced.
- **Verify:** two concurrent upgrades for one user → one tier change, not two.
#### `CODE-C11` — Task-completion duplicate-row race · ☐
- **Where:** `internal/services/task_service.go:631-750`.
- **Fix:** `SELECT … FOR UPDATE` on the task in `CreateCompletion`; goose migration adding `UNIQUE(task_id, completed_date)`.
- **Verify:** double-tap "complete" → one completion row.
#### `CODE-C12` — Soft-deleted email reusable · ☐
- **Where:** `internal/services/auth_service.go:274-324`; `internal/repositories/user_repo.go` (`FindByEmail`, `ExistsByEmail`).
- **Fix:** On delete, mangle the email (`deleted_<id>_<email>`); add `is_active = true` filtering consistently to `FindByEmail`/`ExistsByEmail`.
- **Verify:** registering with a soft-deleted account's email is rejected; no cross-account takeover.
#### `CODE-C13` — Apple webhook user lookup may LIKE-match · ☐
- **Where:** `internal/handlers/subscription_webhook_handler.go:354-366` (`FindByAppleReceiptContains`).
- **Fix:** Confirm the SQL is an equality match, not `LIKE`. If `LIKE`, this is a confirmed Critical — change to equality and rename the function. See `V8`.
- **Verify:** the query is parameterized equality; rename merged.
### High (H1H9)
#### `CODE-H1` / `CODE-H2` / `CODE-H3` / `CODE-M5` — Rate limiting gaps · ☐
- **Where:** `internal/router/router.go` (`:520` login limiter, `:593` `join-with-code` unprotected), `internal/middleware/rate_limit.go`, `internal/handlers/auth_handler.go`.
- **Fix:** Extend rate limiting to `register`, `join-with-code`, Apple/Google sign-in, and token refresh. Add a per-account login-attempt counter in Redis (lock after 510 fails for 1560 min). This is the **app** half of the consolidated auth-rate-limit item; the **edge** half is `K3S-F10`.
- **Verify:** rapid attempts on every auth route throttle; per-account lockout fires regardless of source IP.
#### `CODE-H4` — Modulo bias in 6-digit codes · ☐
- **Where:** `internal/services/auth_service.go:884-892`.
- **Fix:** Replace `int32 % 1000000` with rejection sampling on `crypto/rand` for a uniform `000000999999`.
- **Verify:** distribution test over many samples is uniform.
#### `CODE-H5` — Apple IAP `.p8` file-mode unchecked · ☐
- **Where:** `internal/services/iap_validation.go:93-128`, `internal/config/config.go:325`.
- **Fix:** Prefer a base64 env-injected PEM. If a file path is kept, refuse to start when the file mode is more permissive than `0600`.
- **Verify:** boot fails on a `0644` key file; succeeds on `0600`.
#### `CODE-H6` — Webhook dedup fail-open · ☐
- **Where:** `internal/handlers/subscription_webhook_handler.go:165-173` (Apple), `:564-574` (Google).
- **Fix:** Fail **closed** — if `webhookEventRepo.HasProcessed` errors, return `500` so Apple/Google retry, rather than processing (which risks duplicate refunds).
- **Verify:** simulated dedup-check DB error → `500`, no double-processing.
#### `CODE-H7` — Auth-failure log lacks IP/UA · ☐
- **Where:** `internal/handlers/auth_handler.go:70`.
- **Fix:** Add `c.RealIP()` + `User-Agent` to the structured failure log line (the audit log captures them; the request-line log does not). Depends on `V10` (RealIP trust).
- **Verify:** a failed login log line carries IP + UA.
#### `CODE-H8` — `X-Timezone` header trusted for trial start · ☐
- **Where:** `internal/middleware/timezone.go:40-71``internal/services/subscription_service.go:145-150`.
- **Fix:** Validate `X-Timezone` against IANA `LoadLocation`, cap to ±14h; use server UTC for trial-start / billing-window math regardless.
- **Verify:** a bogus/extreme `X-Timezone` cannot shift trial start.
### Medium (M1M13)
#### `CODE-M1` — Header injection via `Content-Disposition` filename · ☐
- **Where:** `internal/handlers/media_handler.go:74,117,165`.
- **Fix:** Sanitize `doc.FileName` — strip CR/LF/quote/null, or emit RFC 5987 `filename*=UTF-8''…`.
- **Verify:** an upload with CRLF in the filename does not split the response.
#### `CODE-M2` — bcrypt cost 10 → 12 · ☐
- **Where:** `internal/models/user.go:47`, `internal/services/auth_service.go:479`.
- **Fix:** Make the cost config-driven, default 12.
- **Verify:** new hashes are `$2a$12$`.
#### `CODE-M3` — Apple Sign In nonce not validated · ☐
- **Where:** `internal/services/apple_auth.go`.
- **Fix:** Generate, store, and verify the nonce round-trip on Apple sign-in.
- **Verify:** a replayed/mismatched nonce is rejected.
#### `CODE-M4` — Email verification not atomic · ☐
- **Where:** `internal/services/auth_service.go:373-415`.
- **Fix:** Wrap verify in a transaction so a concurrent request can't double-apply.
- **Verify:** concurrent verify calls → one state transition.
#### `CODE-M6` / `LIVE-L16` — Uncapped list / pagination · ☐
- **Where:** `ListDocuments`, `ListContractors`, `ListResidences` handlers; pagination parsing.
- **Fix:** Clamp `limit` server-side to ≤100 (`< 1` → default 25). Notifications already caps at 200 — match the pattern.
- **Verify:** `?limit=999999` returns ≤100 rows.
#### `CODE-M7` — Audit log not append-only · ☐
- **Where:** audit-log model / repository.
- **Fix:** Make it append-only — a DB trigger forbidding `UPDATE`/`DELETE`, or move to an event store. Remove the soft-delete column.
- **Verify:** an `UPDATE`/`DELETE` on the audit table is rejected.
#### `CODE-M11` — `golang.org/x/crypto` outdated · ☐
- **Where:** `go.mod:30` (`v0.49.0`).
- **Fix:** `go get -u golang.org/x/crypto`, re-run `govulncheck`, retest. Pairs with Stage 5 dependency automation.
- **Verify:** `govulncheck ./...` clean.
#### `CODE-M12` — Contractor toggle refetch race · ☐
- **Where:** `internal/services/contractor_service.go:279-307`.
- **Fix:** Do the toggle + read in one transaction so a concurrent soft-delete can't make it return `nil`.
- **Verify:** concurrent toggle + delete → defined result, no nil panic.
#### `CODE-M13` — Account-deletion endpoint unrate-limited · ☐
- **Where:** `internal/handlers/auth_handler.go:488-539`.
- **Fix:** Add a throttle to `DELETE /account`. **First resolve `V11`**`LIVE-L18` claims no delete endpoint exists; reconcile before deciding whether this is "rate-limit it" or "expose it."
- **Verify:** repeated delete calls throttle.
#### `CODE-M10` — `node:20-alpine` floating tag · ☐
- **Where:** admin/web `Dockerfile` (`:2,112,134`).
- **Fix:** Pin to a specific patch version or digest.
- **Verify:** Dockerfile has no bare `node:20-alpine`.
### Low / Info (CODE-L1, L2)
#### `CODE-L1` — Inactive-account login enumeration · ☐
- **Where:** `internal/services/auth_service.go:76-77`.
- **Fix:** Return the same generic error for inactive accounts as for invalid credentials.
- **Verify:** inactive vs. wrong-password responses are byte-identical.
#### `CODE-L2` — Auth responses lack `Cache-Control: no-store` · ☐
- **Where:** `internal/handlers/auth_handler.go` (Login / CurrentUser / Refresh).
- **Fix:** Set `Cache-Control: no-store` on auth responses.
- **Verify:** the header is present.
### Live-scan code-level findings (LIVE-L1, L11L20)
#### `LIVE-L1` — `/metrics` publicly exposed · HIGH · ☐
- **Where:** `cmd/api/main.go` route registration; vmagent scrapes it cluster-internally already.
- **Fix (recommended — Option B):** bind Prometheus metrics to a separate cluster-internal port (e.g. `:9090`), expose only via a ClusterIP Service the vmagent NetworkPolicy allows; the public Ingress never registers `/metrics`. Update `observability/vmagent.yaml` scrape target. (Alternative: block `/metrics` at Traefik via an `IngressRoute` — Stage 3.)
- **Verify:** `curl https://api.myhoneydue.com/metrics``404`; vmagent still scrapes successfully.
#### `LIVE-L11` — Login user-enumeration via timing · HIGH · ☐
- **Where:** login handler / `auth_service.go`.
- **Fix:** Always run a bcrypt compare against a fixed dummy hash when the user is not found, so the response time is constant.
- **Verify:** real vs. fake email login timing delta < network noise.
#### `LIVE-L12` — No rate-limit on login · HIGH · ☐
- See the consolidated auth-rate-limit item: `K3S-F10` (edge) + `CODE-H1/H2/H3/M5` (app). Closed when both land.
#### `LIVE-L13` — Password-reset timing enumeration · HIGH · ☐
- **Where:** `forgot-password` handler.
- **Fix:** Enqueue the reset email on the Asynq queue and return the generic response immediately, so real vs. fake emails have identical latency.
- **Verify:** real vs. fake email reset timing delta < network noise.
#### `LIVE-L14` / `LIVE-L15` — Sequential integer IDs · MEDIUM · ⊘ (deferred)
- **Where:** all user-facing IDs.
- **Decision:** Real enumeration/intel leak, but migrating to UUID/ULID touches API, web, mobile, and webhook payloads. **Deferred to a planned quarter** — not a redeploy-stage fix. Track on the roadmap; revisit before the userbase size becomes commercially sensitive.
#### `LIVE-L16` — Pagination `limit` uncapped · MEDIUM · ☐
- Duplicate of `CODE-M6` — closed with it.
#### `LIVE-L17` — Garbage pagination params silently accepted · LOW · ☐
- **Where:** query-param parsing in list handlers.
- **Fix:** Return `400` naming the bad parameter instead of silently using defaults.
- **Verify:** `?limit=abc``400`.
#### `LIVE-L18` — No account-deletion endpoint (GDPR) · LOW · ☐
- **Where:** `internal/router/router.go`, `internal/handlers/auth_handler.go`.
- **Fix:** Reconcile with `CODE-M13` first (`V11`). Provide `DELETE /api/auth/me/` that anonymizes PII, cascades/transfers residences, revokes tokens, and writes an audit-trail row. Also closes the throwaway-account cleanup gap the live scan left behind.
- **Verify:** an authenticated user can delete their own account; PII is anonymized.
#### `LIVE-L19` — Email verification not enforced · LOW · ☐
- **Where:** router middleware.
- **Fix:** Add a `RequireVerified()` middleware on sensitive routes (share-code generation/redemption, anything that emails other users), or cap unverified accounts (1 residence, no share codes) until verified.
- **Verify:** an unverified account is blocked from the chosen gated routes.
#### `LIVE-L20` — Profile-update silently drops unknown fields · INFO · ☐
- **Where:** `PATCH /api/auth/profile/` handler.
- **Fix:** Either accept the fields (if intended) or return `400` listing unsupported keys — don't silently `200`.
- **Verify:** an unknown field yields a clear response.
#### `LIVE-L10` — `x-powered-by` — see Stage 0 (Next.js config).
---
## Stage 5 — CI / build pipeline
Build-time controls. Where there is no CI pipeline file yet, the fix is to
add one (or a `03-deploy.sh` step) so the control runs on every build.
### `K3S-F5` / `K3S-F14` / `CODE-L4` — Pin images by digest · HIGH · ☐
- **Where:** `03-deploy.sh` (currently tags by git short SHA, lines 47/57-61, and also pushes `:latest`), all `deploy-k3s/manifests/*/deployment.yaml`.
- **Fix:** After `docker push`, capture the digest (`crane digest …` or parse `docker push` output) and substitute `@sha256:…` into the manifests instead of `IMAGE_PLACEHOLDER` tags. Pin `redis` and `vmagent` by digest too. Reconsider pushing `:latest` — a mutable `:latest` undercuts digest pinning.
- **Verify:** `kubectl -n honeydue get deploy -o jsonpath` shows every image as `@sha256:`.
### `K3S-F8` — Secrets as file mounts, not env vars · MEDIUM · ☑ · In-repo: Y
- **Where:** `api`/`worker` `deployment.yaml`, `internal/config/config.go`, `cmd/api/main.go`, `cmd/worker/main.go`, `02-setup-secrets.sh`.
- **Done (2026-05-16):**
- `config.loadFileSecrets()` reads each of the 9 secret keys (`POSTGRES_PASSWORD`, `SECRET_KEY`, `EMAIL_HOST_PASSWORD`, `FCM_SERVER_KEY`, `REDIS_PASSWORD`, `B2_KEY_ID`, `B2_APP_KEY`, `OBS_INGEST_TOKEN`, `OBS_TRACES_URL`) from `/etc/honeydue/secrets/<KEY>` and `viper.Set`s it (highest precedence). A missing file is a silent skip, so the same binary still works from env vars in local/dev.
- `api`/`worker` `deployment.yaml` no longer inject **any** secret as an `env: secretKeyRef`. `honeydue-secrets` is mounted as a volume (`defaultMode: 0400`), read-only, at `/etc/honeydue/secrets`. Non-secret config still arrives via `envFrom: configMapRef`.
- `cmd/api`/`cmd/worker` read the observability endpoints through the new `config.SecretValue()` (Viper-backed) instead of `os.Getenv`, so file-mounted `OBS_*` values resolve now that they are gone from the environment.
- `02-setup-secrets.sh` now also writes `B2_KEY_ID`/`B2_APP_KEY` into `honeydue-secrets` — reconciling the script-vs-manifest drift (the manifests referenced these keys but the script never created them).
- **Scoped exception:** the one-shot `honeydue-migrate` Job still takes `POSTGRES_PASSWORD` as an env var. goose is invoked as a CLI with the password inside the DSN argument, so the value is exposed in that process regardless of env-vs-file; the Job is transient (one run, seconds, pod GC'd) so this is accepted.
- **Verify:** `kubectl -n honeydue exec deploy/api -- env` shows no `POSTGRES_PASSWORD`/`SECRET_KEY`; `kubectl -n honeydue exec deploy/api -- ls /etc/honeydue/secrets` lists the key files.
### `CODE-L5` — Image signing + scanning · LOW · ◐ · In-repo: Y
- **Where:** `03-deploy.sh`, `deploy-k3s/manifests/kyverno-verify-images.yaml`.
- **Done (in-repo, 2026-05-16):**
- `03-deploy.sh` runs `cosign sign` after each push and a `trivy image --severity HIGH,CRITICAL` scan before push — both **guarded**: they no-op when the tool is absent, so they never break a deploy on a host without them.
- A ready-to-use Kyverno `ClusterPolicy` ships at `deploy-k3s/manifests/kyverno-verify-images.yaml`. It matches only the four `gitea.treytartt.com/admin/honeydue-*` images, starts in `Audit` mode, and is **intentionally not applied by `03-deploy.sh`** — applying a verify-images policy with no key would block every Pod from scheduling.
- **Remaining (operator — cannot be committed):**
1. Install Kyverno in the cluster (admission controller).
2. `cosign generate-key-pair`; set `COSIGN_KEY` in the deploy env so signing activates; paste `cosign.pub` into the policy's `publicKeys` block.
3. `kubectl apply -f deploy-k3s/manifests/kyverno-verify-images.yaml`, confirm Pods still schedule, then flip `validationFailureAction: Audit → Enforce`.
- **Verify:** an unsigned image is rejected by admission; `03-deploy.sh` fails on a HIGH/CRITICAL CVE.
### `CODE-M11` (CI half) — Dependency hygiene · ☐
- **Fix:** Add scheduled `go get -u` + `govulncheck` (the audit confirms `govulncheck` + `gitleaks` already run in CI — extend with a dependency-update cadence).
- **Verify:** stale-dependency alerts surface automatically.
---
## Stage 6 — Post-deploy verification & runtime investigations
`04-verify.sh` already runs a security block (secret encryption, NetworkPolicy
count, ServiceAccounts, pod security contexts, PDBs, `cloudflare-only`
middleware, `admin-basic-auth`). **Extend it so each fix above stays fixed,
and work the open investigations the audits could not resolve.**
### Extend `04-verify.sh` with assertions for · ☐
- Redis rejects unauthenticated `PING` (`K3S-F1`).
- Admin ingress annotation contains `admin-auth` (`K3S-F2`).
- `/metrics` returns `404` on the public host (`LIVE-L1`).
- Every container (incl. `vmagent`) has a full `securityContext` (`K3S-F7`).
- `automountServiceAccountToken: false` on app pods (`K3S-F11`).
- Every workload image is digest-pinned (`K3S-F5`).
- No `DEBUG_FIXED_CODES` key in the prod ConfigMap (`CODE-C4`).
### Runtime investigations (cannot be closed by code review alone)
| ID | Item | Source | Action |
|---|---|---|---|
| `V1` | Apple/Google Sign-In token validation depth | LIVE | Test with a self-signed Apple identity token; confirm signature/aud/nonce checks |
| `V2` | Webhook signature verification — confirm webhook routes are **outside** the auth middleware in `router.go` (live scan saw `401`s, signature middleware may never run) | LIVE | Code-review `internal/router/router.go` |
| `V3` | File-upload security — locate upload paths, test polyglots / MIME bypass / path traversal in filename / oversized files | LIVE | Focused upload security test |
| `V4` | Long-term token validity / revocation behaviour | LIVE | Test token expiry + revocation over time |
| `V5` | Apple IAP receipt validation with a real sandbox StoreKit receipt | LIVE | Sandbox test |
| `V6` | Share-code system — find the endpoint path; test brute-force, single-use, expiration | LIVE | Locate + test |
| `V7` | Trial-expiration enforcement — age a test account past 14 days, confirm `limitations_enabled` flips and creation gates fire | LIVE | Aged-account test |
| `V8` | `FindByAppleReceiptContains` — confirm equality, not `LIKE`. If `LIKE`, escalate `CODE-C13` to confirmed Critical | CODE | SQL review |
| `V9` | Rate-limiter storage — confirm `rate_limit.go` is Redis-backed (shared across 3 api replicas); in-memory = 3× the intended limit | CODE | Code review |
| `V10` | `X-Forwarded-For` / Echo `RealIP` trust behind Traefik — without it per-IP limits collapse to the ingress IP | CODE | Code + Traefik config review |
| `V11` | Account-deletion contradiction — `LIVE-L18` (no endpoint) vs `CODE-M13` (endpoint at `auth_handler.go:488-539`). Resolve before Stage 4 planning | LIVE/CODE | Route review |
| `V12` | etcd encryption — `04-verify.sh` only greps a string; truly confirm with `k3s secrets-encrypt status` on each server node | K3S | SSH check |
| `V13` | `user_authtoken` index — confirm a `user_id` lookup index exists before hashing tokens at rest (`CODE-C1`) | CODE | Schema check |
---
## Accepted risks / deferred (this cycle)
| ID | Item | Rationale |
|---|---|---|
| `K3S-F15` | Public-IP nodes, no VPC | Re-provision-scale change; Hetzner firewall (`K3S-CG3`) is the compensating control. Roadmap. |
| `K3S-F16` | Combined control-plane/worker nodes | Standard small-cluster k3s; revisit on workload growth. |
| `LIVE-L14`/`L15` | Sequential integer IDs | UUID migration spans API + web + mobile + webhooks; planned quarter, not this cycle. |
Mirror these in `docs/deployment/20-roadmap.md` so they are not silently lost.
---
## Documentation drift corrected alongside this plan
The audits contradicted the existing deployment book. These corrections ship
with this plan so the docs match audited reality:
| Doc | Claimed | Reality (audit) | Action |
|---|---|---|---|
| `05-security.md` | `automountServiceAccountToken: false` set | `K3S-F11`: not set on any workload | Corrected to "TODO" + linked here |
| `05-security.md` | NetworkPolicies "not currently applied" (TODO) | Applied 2026-04-24; `03-deploy.sh:155` applies them | Corrected to "applied" |
| `05-security.md` | CF↔origin is plaintext (SSL=Flexible) | Upgraded to Full (strict) 2026-04-24 | Corrected |
| `05-security.md` | SHA tags immutable / "we'd notice a digest change" | `K3S-F5`: short SHA tags are mutable | Corrected; points to `K3S-F5` |
| `SECURITY.md` (old) | Redis "requires a password" | `K3S-F1`: no auth | This rewrite |
| `SECURITY.md` (old) | etcd `secrets-encryption: true` | `K3S-CG1`: not verified / not on | This rewrite |
| `SECURITY.md` (old) | fail2ban active | `05-security.md` + `K3S-CG2`: not installed | This rewrite |
| `20-roadmap.md` | — | Audit findings not represented | Audit items folded in |
---
## Hardened-redeploy checklist (run order)
A clean rebuild of the whole stack, with every fix above applied:
```
□ Stage 0 DNS once-off: DMARC, SPF, CAA at Cloudflare; security.txt route live
□ Stage 1 Provision: hetzner-k3s config carries --write-kubeconfig-mode=600
and --secrets-encryption; run 01-provision-cluster.sh
□ Stage 1 Node OS: fail2ban + unattended-upgrades + SSH/sysctl on each node
□ Stage 1 Verify cluster: K3S-CG3..CG8 (firewall, snapshots, kubelet, perms)
□ Stage 2 Config: config.yaml has redis.password + admin.basic_auth_*;
no DEBUG_FIXED_CODES; SECRET_KEY ≥32 chars
□ Stage 2 Secrets: run 02-setup-secrets.sh — confirm redis + admin-basic-auth
□ Stage 3 Manifests: admin ingress middlewares wired; imagePullSecret name
consistent; vmagent securityContext; COOP/CORP headers;
auth-rate-limit; automountServiceAccountToken:false;
HSTS preload; X-XSS-Protection dropped; imagePullPolicy set
□ Stage 4 Code+image: all C/H/M/L code fixes committed; image rebuilt;
goose migrations for C1/C5/C6/C11/C12 present
□ Stage 5 CI: images digest-pinned + signed + scanned; secrets file-mounted
□ Stage 6 Verify: run 04-verify.sh (extended); work V1V13
□ Post: Submit myhoneydue.com to hstspreload.org
```
A redeploy is "clean" only when `04-verify.sh` (extended per Stage 6) passes
with zero `✗` lines and every checkbox in the master index is ☑ or ⊘.
---
## Appendix — Incident response playbooks
Preserved from the previous `SECURITY.md`; still current.
### Compromised API token
Rotate `SECRET_KEY` to invalidate all tokens, then restart api/worker:
```bash
echo "$(openssl rand -hex 32)" > secrets/secret_key.txt
./scripts/02-setup-secrets.sh
kubectl rollout restart deployment/api deployment/worker -n honeydue
```
(After `CODE-C1` lands, tokens are hashed at rest — a DB read no longer yields
usable tokens, but `SECRET_KEY` rotation remains the kill-switch.)
### Compromised database credentials
Rotate in the Neon dashboard, update `secrets/postgres_password.txt`, re-run
`02-setup-secrets.sh`, restart api/worker, watch logs for connection errors.
### Compromised push keys
APNs: revoke in Apple Developer, drop the new `.p8` into `secrets/`, re-run
`02-setup-secrets.sh`, restart api/worker. FCM: rotate the key in Firebase,
update `secrets/fcm_server_key.txt`, re-run, restart.
### Suspicious pod
```bash
kubectl logs <pod> -n honeydue > /tmp/pod-logs.txt
kubectl describe pod <pod> -n honeydue > /tmp/pod-describe.txt
kubectl delete pod <pod> -n honeydue # deployment recreates it
```
### Communication
Document the timeline privately; on a data breach notify affected users
within 72 hours; rotate every potentially-exposed credential; write a
post-mortem (root cause, timeline, remediation, prevention).
---
## References
- Audit reports: `live_scan_5_12.md`, `k3_audit_5_12.md`, `security_scan_5_12.md` (repo root)
- Current architecture: `docs/deployment/05-security.md`
- Roadmap: `docs/deployment/20-roadmap.md`
- Deploy process: `docs/deployment/14-deployment-process.md`
- Scripts: `deploy-k3s/scripts/{01-provision-cluster,02-setup-secrets,03-deploy,04-verify}.sh`
- Manifests: `deploy-k3s/manifests/`