admin/honeyDueAPI

Fork 0

Files

T

Trey t c77ff07ce9

Backend CI / Test (push) Has been cancelled

Details

Backend CI / Contract Tests (push) Has been cancelled

Details

Backend CI / Lint (push) Has been cancelled

Details

Backend CI / Secret Scanning (push) Has been cancelled

Details

Backend CI / Build (push) Has been cancelled

Details

fix(security): remediate 2026-05-12 audit findings (Stages 2–5)

Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps),
tracked in deploy-k3s/SECURITY.md, plus fixes from two independent
post-remediation reviews.

Auth & sessions:
- SHA-256 hashed auth-token storage (C1); prior-token cache eviction on
  re-login (MEDIUM-1)
- local Google JWKS verification, iss/aud/exp checks (C2/C3)
- constant-time login + generic errors (L1/LIVE-L11/LIVE-L13)
- per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3)
- verified-email gating, login rate limiting (LIVE-L19, H1-H3)

IAP & webhooks:
- Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6)
- migrations 000003-000006 (token hashing, IAP replay, audit_log +
  webhook_event_log table creation, append-only audit log)

Authorization & races:
- file-ownership owner-OR-member fix (C7), atomic share-code join
  (C9/H9), device-token reassignment (C8/LOW-3)

Secrets & deploy:
- secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis
  password out of the ConfigMap (HIGH-1); B2 keys reconciled
- digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics
  lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban +
  unattended-upgrades at provision; secret-rotation runbook

Build, vet, and the full test suite (incl. -race) pass; the goose
migration chain is verified against PostgreSQL 16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 22:28:33 -05:00

72 KiB

Raw Blame History

honeyDue — Production Security Remediation Plan

This document is the single source of truth for fixing every security finding from the 2026-05-12/13 audits, and for keeping those fixes baked into the stack so a full redeploy never reproduces them.

It replaces the previous aspirational SECURITY.md (which described a desired state that, per the audits, was never fully true). The accurate current architecture lives in docs/deployment/05-security.md; this file is the work list.

Last updated: 2026-05-16 Audit sources (kept at repo root):

Tag	File	Scope	Findings
`LIVE`	`live_scan_5_12.md`	External black-box scan of api/admin/app	L1–L20 (20)
`K3S`	`k3_audit_5_12.md`	k3s cluster + `honeydue` namespace audit	F1–F17 (17) + 8 coverage gaps
`CODE`	`security_scan_5_12.md`	Static audit of `honeyDueAPI-go`	C1–C13, H1–H9, M1–M13, L1–L6 (41)

Total: 78 findings + 8 cluster coverage gaps + 13 runtime verification items.

How to use this document

The plan is organised by redeploy stage, not by severity, because the operator's goal is: redeploy the entire stack and come up clean. Each finding is tagged with where its fix lives:

Marker	Meaning
In-repo: Y	Fix lives in a committed file (`config.yaml`, a manifest, a script, Go code, a Dockerfile). Once committed, every redeploy re-applies it automatically.
In-repo: N	Fix is external state (DNS records, Cloudflare dashboard, Hetzner firewall, hstspreload.org). A redeploy does not touch it — it survives on its own but must be done once and tracked here.

Status legend: ☐ open · ◐ in progress · ☑ done · ⊘ accepted risk / deferred

Redeploy stage order (matches deploy-k3s/scripts/ run order):

Stage 0  DNS & Cloudflare edge          (external; no cluster needed)
Stage 1  Cluster provisioning & node OS (01-provision-cluster.sh / hetzner-k3s / SSH)
Stage 2  Secrets & config bootstrap     (02-setup-secrets.sh / config.yaml)
Stage 3  Kubernetes manifests           (deploy-k3s/manifests/, applied by 03-deploy.sh)
Stage 4  Application code & images      (honeyDueAPI-go source → rebuilt image)
Stage 5  CI / build pipeline            (image digest pinning, signing, scanning)
Stage 6  Post-deploy verification       (04-verify.sh + runtime investigations)

Golden rule for "redeploy clean": a fix only counts as done when it is committed to the file that the redeploy reads. A kubectl patch on the live cluster that is not mirrored into deploy-k3s/manifests/ will be wiped on the next 03-deploy.sh. Every entry below names the committed file.

Execution status (2026-05-16)

Stages 2–5 were executed in-repo, then put through an independent code review (see Post-remediation independent review below). The Go module builds clean and the full go test ./... suite passes. Four new goose migrations were added — 000003 (auth-token hashing), 000004 (IAP replay protection), 000005 (audit-log append-only + audit_log table create), 000006 (webhook_event_log table create) — and run automatically via the migrate Job before the api/worker rollout.

~63 findings fixed (☑) and verified — all of Stage 2 (secrets/config) and Stage 3 (Kubernetes manifests), every exploitable Stage 4 application finding (all 11 actioned Criticals + the auth / webhook / race / handler High & Medium fixes), Stage-5 image digest pinning and K3S-F8 (secrets are now file-mounted, not env vars), plus the in-repo half of Stage 1 cluster provisioning — K3S-F4 (kubeconfig written 0600), K3S-CG1 (etcd secrets-encryption), K3S-CG2 (fail2ban + unattended-upgrades installed at provision). Includes token hashing, Google JWKS verification, IAP replay protection, the authorization fixes, atomic share-code join, the metrics-endpoint lockdown, per-account login lockout, verified-email gating, CSP/HSTS hardening, and digest-pinned images.
1 partial (◐) — CODE-L5: cosign signing + a Trivy HIGH,CRITICAL scan are wired (guarded) into 03-deploy.sh, and a ready-to-use Kyverno ClusterPolicy ships at deploy-k3s/manifests/kyverno-verify-images.yaml. Closing it needs two operator actions that cannot be committed: install Kyverno in the cluster, and supply a cosign key pair (COSIGN_KEY for signing + the public key pasted into the policy).
Accepted / blocked / moot (⊘) — M3 (Apple nonce — blocked on an iOS-client change), C12 (moot — accounts are hard-deleted), LIVE-L14/L15 (UUID migration — planned quarter), LIVE-L17/L18/ L20 (no security impact — see entries), F15/F16 (architectural), and LIVE-L2/L3/L4 (DMARC / SPF / CAA — operator-declined, below).
Operator-declined — Stage 0 DNS (LIVE-L2/L3/L4). The operator has opted not to add the DMARC, SPF-hardening, and CAA DNS records this cycle. For the record: these are not a paid-Cloudflare feature — DMARC and SPF are ordinary TXT records and CAA is an ordinary CAA record, all addable on any Cloudflare plan including Free. They remain genuine email-spoofing / certificate-issuance gaps and are marked ⊘; revisit when DNS is next touched.
Remaining operator runtime steps (no code to commit) — on the existing cluster: k3s secrets-encrypt enable/reencrypt (K3S-CG1 / V12) and chmod 600 the live kubeconfig (K3S-F4); the SSH/sysctl half of K3S-CG2; and the K3S-CG3–CG8 verification items. A full fresh provision already comes up with K3S-F4/CG1/CG2(fail2ban + unattended-upgrades) applied straight from _config.sh.

Operator note: C1 (token hashing) invalidates every existing login session once at deploy and makes login single-session per user — see the CODE-C1 entry. The status boxes in the master index below are authoritative.

Post-remediation independent review (2026-05-16)

The change set went through two independent review passes; the deploy-time verification below (build, go test -race, full goose up against real PostgreSQL 16) was executed and passed.

First pass. A separate review agent audited the full change set against the three audit files. It surfaced three deploy-breaking defects that a green go test could not catch — the test harness builds two tables via GORM AutoMigrate, which production never runs — all since fixed:

audit_log table was never created by a migration. 000005 added append-only triggers to a table that exists only in the test DB, so a from-scratch goose up would fail on 000005. 000005 now does CREATE TABLE IF NOT EXISTS audit_log before the triggers.
webhook_event_log table was never created by a migration. The H6 fail-closed webhook dedup turns a missing table into a 500 on every subscription webhook. New migration 000006 creates it.
000004's google_purchase_token unique index could fail to build on a production table already holding duplicate tokens — exactly the C6 replay the migration fixes. 000004 now de-duplicates (keep-earliest, NULL-the-rest) before creating the index.

It also tightened the C13 Apple-webhook lookup (subscription_webhook_handler.go) so the legacy substring scan runs only on a genuine ErrRecordNotFound, never masking a real DB error as "not found".

Second pass (master review). A second, independent security-audit agent re-verified all four first-pass fixes (correct), ran go test -race (0 data races) and the full goose up/down chain against real PostgreSQL (clean, idempotent), and returned GO with one HIGH finding, since fixed:

HIGH-1 — Redis password leaked via the honeydue-config ConfigMap. _config.sh built REDIS_URL with the password embedded inline, and that URL is emitted into the honeydue-config ConfigMap (delivered to pods via envFrom). ConfigMaps are not covered by secrets-encryption and are readable by any principal with get configmap — so K3S-F1/K3S-F8 were not actually fully closed. Fixed (2026-05-16): _config.sh now emits REDIS_URL=redis://redis:6379/0 with no credentials; the password travels only as the file-mounted REDIS_PASSWORD secret. The API applies it in cache_service.go; cmd/worker/main.go now applies it onto the parsed Asynq RedisClientOpt so the server/inspector/monitoring client all authenticate against the requirepass Redis.

The master review's other seven findings (4 Medium, 3 Low — none deploy-blocking) were then all fixed (2026-05-16):

MEDIUM-1 — re-login left the prior token usable for ≤5 min. CreateFreshToken deleted the old token row but not its Redis cache entry. It now also returns the deleted tokens' hashes; AuthService.freshToken evicts them via the new CacheService.InvalidateAuthTokenHashes on every login / Apple / Google sign-in, so a prior (e.g. stolen) token stops authenticating immediately.
MEDIUM-2 — IAP .p8 mode check incompatible with k8s. The Apple IAP key check (iap_validation.go) required 0600-or-stricter, unattainable on a k8s Secret volume (0440 under fsGroup). It now rejects only world-accessible keys (perm & 0o007).
MEDIUM-3 — single-IP account-lockout DoS. The M5 per-account lockout is now keyed on the set of distinct source IPs that have failed (RegisterLoginFailure takes the IP, tracks a Redis set; lock at 5 distinct IPs). One attacker IP can no longer lock a victim out by spamming failures; genuinely distributed stuffing still trips it. Login now takes the client IP (c.RealIP()).
MEDIUM-4 — Redis no-auth deployable. 02-setup-secrets.sh now dies (was warn) when redis.password is empty, so a deploy can no longer bring up an unauthenticated Redis (K3S-F1).
LOW-1 / LOW-2 — missing regression tests. Added: config_test.go asserts validate() refuses DEBUG_FIXED_CODES with DEBUG=false (C4); subscription_repo_test.go asserts a second account cannot bind an Apple transaction / Google purchase token already bound to another (C5/C6).
LOW-3 — device-token 409. A recycled APNs/FCM token re-registering under a new account is now reassigned to that account (and logged) instead of returning a 409 that locked the legitimate new device owner out of push.

One earlier (first-pass) hardening item remains a tracked follow-up, not re-raised by the master review and not deploy-blocking: /metrics is gated by an X-Forwarded-For check rather than network-isolated. True isolation needs /metrics on a separate port plus a NetworkPolicy restricting the scrape to vmagent — an architectural change deferred to a later cycle.

Consolidated work items (fix once, closes many)

Several findings are the same defect seen from three angles. Do the work once at the listed anchor; the rest close with it.

Theme	Anchor	Also closes
Auth-endpoint rate limiting	Stage 3 `auth-rate-limit` middleware + Stage 4 app limiter	`K3S-F10`, `LIVE-L12`, `CODE-H1`, `CODE-H2`, `CODE-H3`, `CODE-M5`
CSP / cross-origin headers	Stage 3 `security-headers` + Stage 4 app CSP	`K3S-F9`, `LIVE-L8`
HSTS `preload`	Stage 3 middleware + Stage 0 list submission	`LIVE-L5`, `CODE-L3`
Admin ingress hardening	Stage 2 secret + Stage 3 middleware wiring	`K3S-F2`, `K3S-F3`, `CODE-L6`
etcd encryption at rest	Stage 1 `--secrets-encryption`	`K3S-CG1`, `CODE-M9`
Image digest pinning + signing	Stage 5 CI	`K3S-F5`, `K3S-F14`, `CODE-L4`, `CODE-L5`
Pagination hard caps	Stage 4 app	`LIVE-L16`, `CODE-M6`
imagePullSecret name consistency	Stage 3 manifests + Stage 2 script	`K3S-F6`

Known contradiction to resolve before planning Stage 4: LIVE-L18 says no account-deletion endpoint exists (every DELETE path 404/400), but CODE-M13 points at a delete handler at auth_handler.go:488-539. Either the endpoint exists at a path the external scan never probed, or it is mounted but unreachable. Confirm the route in internal/router/router.go first — the fix differs (add an endpoint vs. expose/rate-limit an existing one). Tracked as verification item V11.

Master finding index

Every finding, ordered by redeploy stage. Use this as the live tracker — flip the Status box as work lands.

Stage 0 — DNS & Cloudflare edge

ID	Sev	Finding	In-repo	Status
`LIVE-L2`	HIGH	No DMARC record — email spoofing open	N	⊘
`LIVE-L3`	MED	SPF ends `?all` (neutral — fails open)	N	⊘
`LIVE-L4`	MED	No CAA records — any CA may issue certs	N	⊘
`LIVE-L6`	LOW	No `/.well-known/security.txt`	Y	☐
`LIVE-L9`	INFO	Aggressive Cloudflare caching on admin SSR shell	N	☐
`LIVE-L10`	INFO	`x-powered-by: Next.js` framework leak	Y	☐

Stage 1 — Cluster provisioning & node OS

ID	Sev	Finding	In-repo	Status
`K3S-F4`	HIGH	Node kubeconfig world-readable (mode 644)	Y	☑
`K3S-F15`	INFO	Nodes on public IPs, no private VPC	Y	⊘
`K3S-F16`	INFO	All 3 nodes are control-plane + etcd + worker	Y	⊘
`K3S-F17`	INFO	Single-replica SPOFs (redis/worker/admin/vmagent)	Y	☐
`K3S-CG1`	—	etcd encryption at rest not verified (`--secrets-encryption`)	Y	☑
`K3S-CG2`	—	Node OS hardening: SSH, fail2ban, unattended-upgrades, sysctl	Y/N	◐
`K3S-CG3`	—	Hetzner Cloud Firewall rules not verified	N	☐
`K3S-CG4`	—	etcd snapshot backup destination/encryption not verified	Y	☐
`K3S-CG5`	—	kubelet flags (`--anonymous-auth=false`, webhook authz) not verified	Y	☐
`K3S-CG6`	—	Container-runtime CIS controls (`kube-bench`) not run	N	☐
`K3S-CG7`	—	`deploy` user sudoers least-privilege not verified	N	☐
`K3S-CG8`	—	`/etc/rancher/k3s/` dir + server-token perms not verified	N	☐

Stage 2 — Secrets & config bootstrap

ID	Sev	Finding	In-repo	Status
`K3S-F1`	CRIT	Redis runs with no authentication	Y	☑
`K3S-F3`	HIGH	`admin-basic-auth` secret never created	Y	☑
`K3S-F12`	MED	Secrets unrotated since cluster bootstrap; no runbook	Y	☑
`CODE-C4`	CRIT	`DEBUG_FIXED_CODES` "123456" auth bypass if it reaches prod	Y	☑
`CODE-M8`	MED	`SECRET_KEY` hardcoded debug fallback	Y	☑

Stage 2 status (2026-05-15): config.yaml now carries a Redis password and admin basic-auth user/password; 02-setup-secrets.sh uses bcrypt (htpasswd -nbB); internal/config/config.go generates an ephemeral random SECRET_KEY in debug instead of a static fallback and refuses to boot if DEBUG_FIXED_CODES is set with DEBUG=false; the rotation runbook is at docs/runbooks/secret-rotation.md. All take effect on the next 02-setup-secrets.sh + 03-deploy.sh.

Stage 3 — Kubernetes manifests

ID	Sev	Finding	In-repo	Status
`K3S-F2`	HIGH	Admin ingress missing `cloudflare-only` + `admin-auth`	Y	☑
`K3S-F6`	HIGH	`imagePullSecrets` name mismatch (`ghcr-credentials`)	Y	☑
`K3S-F7`	MED	`vmagent` container missing `securityContext`	Y	☑
`K3S-F9`	MED	`security-headers` missing COOP/COEP/CORP	Y	☑
`K3S-F10`	MED	Uniform rate limit — no auth-endpoint tightening	Y	☑
`K3S-F11`	MED	`automountServiceAccountToken` not disabled	Y	☑
`K3S-F13`	LOW	`CORS_ALLOWED_ORIGINS` missing `app.myhoneydue.com`	Y	☑
`K3S-F14`	LOW	Public images (`redis`, `vmagent`) pinned by tag	Y	☑
`LIVE-L5`	LOW	HSTS not preload-eligible	Y	☑
`LIVE-L7`	LOW	Deprecated `X-XSS-Protection` header	Y	☑
`LIVE-L8`	LOW	CSP missing `object-src`/`base-uri`; COOP/COEP/CORP absent	Y	☑
`CODE-L3`	LOW	HSTS missing `preload` (duplicate of `LIVE-L5`)	Y	☑
`CODE-L4`	LOW	`imagePullPolicy` not set on Deployments	Y	☑
`CODE-L6`	LOW	Admin `admin-auth` middleware defined, not attached	Y	☑

Stage 3 status (2026-05-15): admin ingress now chains cloudflare-only + admin-auth + security-headers + rate-limit; a dedicated honeydue-api-auth Ingress applies a new auth-rate-limit middleware (5/min, burst 10) to login / register / forgot-password / reset-password / join-with-code; security-headers gained COOP + CORP, HSTS is now max-age=63072000; …; preload, and the deprecated X-XSS-Protection (browserXssFilter) is removed; vmagent has a container securityContext; all workload pods + the migrate Job set automountServiceAccountToken: false explicitly (on top of the rbac.yaml ServiceAccount-level setting that already existed); the registry secret is gitea-credentials everywhere; imagePullPolicy: IfNotPresent is explicit on every container; CORS includes app.myhoneydue.com. Still open: K3S-F14 (public-image digest pins) is folded into Stage 5 with K3S-F5; LIVE-L8 is partial — the COOP/CORP half shipped here, the CSP object-src/base-uri half is an app change tracked in Stage 4.

Stage 4 — Application code & container images

ID	Sev	Finding	In-repo	Status
`CODE-C1`	CRIT	Auth tokens stored plaintext in DB	Y	☑
`CODE-C2`	CRIT	Google ID token not verified locally	Y	☑
`CODE-C3`	CRIT	Google `iss` claim never validated	Y	☑
`CODE-C5`	CRIT	Apple IAP receipt replay across accounts	Y	☑
`CODE-C6`	CRIT	Google purchase-token replay across accounts	Y	☑
`CODE-C7`	CRIT	File-ownership check excludes residence owners	Y	☑
`CODE-C8`	CRIT	Device-token cross-account hijack on re-register	Y	☑
`CODE-C9`	CRIT	Share-code join not atomic (Add+Deactivate race)	Y	☑
`CODE-C10`	CRIT	Subscription upgrade race — validation outside txn	Y	☑
`CODE-C11`	CRIT	Task-completion duplicate-row race	Y	☑
`CODE-C12`	CRIT	Soft-deleted email reusable; `is_active` not filtered	Y	⊘
`CODE-C13`	CRIT	Apple webhook user lookup may LIKE-match	Y	☑
`CODE-H1`	HIGH	Rate limit doesn't cover all auth surfaces	Y	☑
`CODE-H2`	HIGH	No rate limit on `join-with-code`	Y	☑
`CODE-H3`	HIGH	No rate limit on `register`	Y	☑
`CODE-H4`	HIGH	Modulo bias in 6-digit code generation	Y	☑
`CODE-H5`	HIGH	Apple IAP `.p8` loaded with no file-mode check	Y	☑
`CODE-H6`	HIGH	Webhook dedup fail-open	Y	☑
`CODE-H7`	HIGH	Auth-failure log lacks IP/User-Agent	Y	☑
`CODE-H8`	HIGH	`X-Timezone` header trusted for trial-start calc	Y	☑
`CODE-H9`	HIGH	Share-code `Deactivate` error swallowed	Y	☑
`CODE-M1`	MED	HTTP header injection via `Content-Disposition` filename	Y	☑
`CODE-M2`	MED	bcrypt cost = 10 (recommend 12)	Y	☑
`CODE-M3`	MED	Apple Sign In nonce not validated	Y	⊘
`CODE-M4`	MED	Email verification not atomic	Y	☑
`CODE-M5`	MED	Per-user rate limiting absent	Y	☑
`CODE-M6`	MED	List endpoints uncapped (Documents/Contractors/Residences)	Y	☑
`CODE-M7`	MED	Audit log not append-only	Y	☑
`CODE-M11`	MED	`golang.org/x/crypto v0.49.0` outdated	Y	☑
`CODE-M12`	MED	Contractor toggle refetch race	Y	☑
`CODE-M13`	MED	Account-deletion endpoint unrate-limited	Y	☑
`CODE-M10`	MED	`node:20-alpine` floating tag in Dockerfile	Y	☑
`CODE-L1`	LOW	Login inactive-account error enables enumeration	Y	☑
`CODE-L2`	LOW	Auth responses lack `Cache-Control: no-store`	Y	☑
`LIVE-L1`	HIGH	`/metrics` publicly exposed on `api.myhoneydue.com`	Y	☑
`LIVE-L11`	HIGH	Login user-enumeration via timing	Y	☑
`LIVE-L12`	HIGH	No rate-limit on `/api/auth/login/`	Y	☑
`LIVE-L13`	HIGH	Password-reset user-enumeration via timing	Y	☑
`LIVE-L14`	MED	Sequential integer user IDs leak userbase size	Y	⊘
`LIVE-L15`	MED	Sequential integer resource IDs (same risk)	Y	⊘
`LIVE-L16`	MED	Pagination `limit` accepted at any size	Y	☑
`LIVE-L17`	LOW	Garbage pagination params silently accepted	Y	⊘
`LIVE-L18`	LOW	No account-deletion endpoint (GDPR gap)	Y	⊘
`LIVE-L19`	LOW	Email verification not enforced	Y	☑
`LIVE-L20`	INFO	Profile-update silently drops unknown fields	Y	⊘

Stage 4 handler/misc batch status (2026-05-15): M1 — Content-Disposition filenames are sanitized (control chars / quote / backslash stripped) so an upload filename cannot inject response headers. M7 — migration 000005 creates the audit_log table (no prior migration did — CREATE TABLE IF NOT EXISTS) and makes it append-only via BEFORE UPDATE/DELETE triggers. M11 — golang.org/x/crypto bumped v0.49.0 → v0.51.0. M13 — DELETE /api/auth/account now carries the Traefik auth-rate-limit edge limiter. LIVE-L18 ⊘ — not a real gap: the endpoint exists at DELETE /api/auth/account/ (router.go:546); the live scan probed /api/auth/me/, /auth/delete/, /users/me/ and missed it. Update (2026-05-15): items shown as deferred in an earlier draft were then completed — LIVE-L1 (/metrics rejects proxied/public requests via an X-Forwarded-For check, so only the in-cluster vmagent scrape reaches it), M6/LIVE-L16 (the document/contractor list repos already hard-cap at 500 rows), and LIVE-L19 (verified-email gating on share-code generation via the new RequireVerified middleware). LIVE-L17 (inert pagination params, results capped) and LIVE-L20 (whitelist profile update is the correct pattern) are closed as no-security-impact (⊘). The master index above is authoritative.

Stage 4 races batch status (2026-05-15): C9/H9 — share-code redemption is now one locked transaction in ResidenceRepository. JoinWithShareCode (lock the code row, re-check validity, add member, deactivate — a deactivation failure aborts the join). C11 — the task-completion duplicate-row race was already closed: the completion insert and the optimistically-version-locked task update share one transaction, so a concurrent completion fails ErrVersionConflict and rolls back its inserted row; no UNIQUE(task_id, completed_date) was added (it would reject legitimate same-day re-completions and risk a migration failure on existing data). M4 — email verification's find/consume/flag writes are now one transaction. M12 — a concurrent contractor delete now yields a clean 404. C12 ⊘ — premise moot: the app hard-deletes accounts (DeleteUserCascade), so there is no soft-deleted user whose email lingers, and ExistsByEmail already blocks re-registering a deactivated user's email.

Stage 4 auth batch status (2026-05-15): C1, C2, C3 done (see entries below). Rate limiting — every sensitive auth path now carries the shared Traefik auth-rate-limit edge limiter (login/register/forgot/reset/ verify-reset/apple/google/refresh/join-with-code); login/register/forgot/ reset/apple/google additionally keep the per-IP app limiter (H1/H2/H3/LIVE-L12). H4 rejection-sampled codes, M2 bcrypt cost 12, L1+LIVE-L11 constant-time generic-error login, L2 no-store on auth responses, H7 IP/UA in auth logs, LIVE-L13 fully-async forgot-password — all done; go build ./... and the models/repositories/middleware/handlers/services test packages pass. Deferred: M3 (Apple nonce) — needs the iOS client to generate and send a nonce; server-only validation would reject every Apple login, so this is blocked on a coordinated mobile change. H8 — the parseTimezone ±14h cap shipped; the "use server UTC for trial-start" half is folded into Stage 4's subscription work. M5 per-account lockout (Redis) deferred — the edge + per-IP app limiters + the existing per-account password-reset counter cover the practical risk; a true per-account login lockout remains a tracked enhancement.

Stage 5 — CI / build pipeline

ID	Sev	Finding	In-repo	Status
`K3S-F5`	HIGH	Images pinned by mutable short SHA tag, not digest	Y	☑
`K3S-F8`	MED	Secrets injected as env vars, not file mounts	Y	☑
`CODE-L5`	LOW	No image signing (cosign) in CI	Y	◐

Stage 5 status (2026-05-15): CODE-M11 done — golang.org/x/crypto bumped v0.49.0 → v0.51.0 (with the x/sys/x/term/x/text bumps go get -u pulled in), go mod tidy clean, full build + test green. Update (2026-05-15): K3S-F5/K3S-F14/CODE-M10 are done — 03-deploy.sh resolves the image digest after each push and deploys api/worker/admin/web by @sha256:, and redis/vmagent/node:20-alpine are pinned to their resolved index digests. Update (2026-05-16): K3S-F8 is done — the api/worker Deployments mount honeydue-secrets as files (defaultMode: 0400) at /etc/honeydue/secrets and inject no secret as an env var; config.loadFileSecrets reads them; 02-setup-secrets.sh now writes B2_KEY_ID/B2_APP_KEY into the secret, reconciling the earlier script-vs-manifest drift. CODE-L5 stays ◐ — cosign signing and a Trivy HIGH,CRITICAL scan are wired (guarded) into 03-deploy.sh and a ready-to-use Kyverno ClusterPolicy ships at deploy-k3s/manifests/kyverno-verify-images.yaml; closing it needs the operator to install Kyverno and supply a cosign key. See both entries.

Stage 6 — Post-deploy verification & runtime investigations

V1–V13 — see Stage 6.

Stage 0 — DNS & Cloudflare edge

External state at Cloudflare. Not touched by 03-deploy.sh, so a redeploy neither breaks nor re-applies these — do them once and leave them. Tracked here so they are never forgotten on a domain move or DNS migration.

`LIVE-L2` — Add DMARC record · HIGH · ⊘

Operator decision (2026-05-16): declined for this cycle. A DMARC record is an ordinary DNS TXT record — it is not gated behind a paid Cloudflare plan and can be added on Free. This remains a real email-spoofing gap; revisit when DNS is next touched.
Where: Cloudflare DNS, TXT record at _dmarc.myhoneydue.com.
Fix: Publish v=DMARC1; p=quarantine; rua=mailto:dmarc@myhoneydue.com; ruf=mailto:dmarc@myhoneydue.com; fo=1; aspf=s; adkim=s. Start at pct=10 for 30 days, watch the rua aggregate reports, then ramp to pct=100 and finally p=reject.
Verify: dig +short TXT _dmarc.myhoneydue.com returns the record.

`LIVE-L3` — Tighten SPF from `?all` to `-all` · MEDIUM · ⊘

Operator decision (2026-05-16): declined for this cycle. SPF is an ordinary DNS TXT record, editable on any Cloudflare plan including Free. The ?all (neutral) qualifier leaves spoofed mail un-penalised; revisit alongside LIVE-L2.
Where: Cloudflare DNS, TXT record at myhoneydue.com.
Fix: Change v=spf1 include:spf.messagingengine.com ?all → ~all for ~7 days, confirm no legitimate mail (CI, transactional) is missed, then -all. Do this after LIVE-L2's DMARC ramp begins.
Verify: dig +short TXT myhoneydue.com | grep spf shows -all.

`LIVE-L4` — Add CAA records · MEDIUM · ⊘

Operator decision (2026-05-16): declined for this cycle. CAA is an ordinary DNS record type, addable on any Cloudflare plan including Free. Without it, any public CA may issue a cert for the domain; revisit when DNS is next touched.
Where: Cloudflare DNS, apex myhoneydue.com.
Fix: Add 0 issue "letsencrypt.org", 0 issuewild "letsencrypt.org", 0 iodef "mailto:security@myhoneydue.com". Add 0 issue "pki.goog" only if Google Trust Services is used anywhere. Confirm against the CAs Cloudflare Universal SSL actually uses before locking down.
Verify: dig +short CAA myhoneydue.com returns the records.

`LIVE-L6` — Publish `security.txt` · LOW · ☐ · In-repo: Y

Where: served by the Go API and/or Next.js apps at /.well-known/security.txt (RFC 9116) — committed route, so it survives redeploys.
Fix: Serve Contact:, Expires:, Preferred-Languages:, Canonical: on both api.myhoneydue.com and the apex.
Verify: curl https://api.myhoneydue.com/.well-known/security.txt → 200.

`LIVE-L9` — Review Cloudflare caching of the admin SSR shell · INFO · ☐

Where: Cloudflare cache rules for admin.myhoneydue.com.
Fix: cache-control: s-maxage=31536000 on admin SSR pages means Cloudflare caches the admin shell for a year. Confirm this is intentional; if the admin shell ever contains per-session content, add a bypass-cache rule for admin.myhoneydue.com.
Verify: curl -sI https://admin.myhoneydue.com/ | grep -i cache reflects the intended policy.

`LIVE-L10` — Suppress `x-powered-by` · INFO · ☐ · In-repo: Y

Where: Next.js config in the admin and web repos (next.config.js → poweredByHeader: false). Committed, survives redeploys.
Fix: Disable the x-powered-by: Next.js header.
Verify: curl -sI https://admin.myhoneydue.com/ | grep -i x-powered-by returns nothing.

Stage 1 — Cluster provisioning & node OS

Run by 01-provision-cluster.sh (which drives the hetzner-k3s CLI from config.yaml via generate_cluster_config in _config.sh) plus one-time SSH hardening on each node. Any k3s server flag must be set in the hetzner-k3s cluster config so a cluster rebuild applies it.

`K3S-F4` — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y

Where: _config.sh → generate_cluster_config → k3s_config_file. Node file /etc/rancher/k3s/k3s.yaml.
Done (2026-05-16): generate_cluster_config now emits write-kubeconfig-mode: "0600" in the k3s config file, so any fresh provision writes the node kubeconfig as 0600.
Operator step on the existing cluster: a running node keeps the mode it was installed with — ssh deploy@<node> 'sudo chmod 600 /etc/rancher/k3s/k3s.yaml' on each. Deploy scripts still read it via sudo.
Verify: ssh deploy@<node> 'sudo stat -c %a /etc/rancher/k3s/k3s.yaml' → 600.

`K3S-CG1` / `CODE-M9` — etcd / Secret encryption at rest · ☑ · In-repo: Y

Where: _config.sh → generate_cluster_config → k3s_config_file.
Done: the k3s config file carries secrets-encryption: true, so a fresh provision boots with AES Secret encryption enabled. (The write-kubeconfig-mode line for K3S-F4 was added next to it on 2026-05-16.)
Operator step on the existing cluster: a cluster provisioned without the flag does not retro-encrypt — run k3s secrets-encrypt enable then k3s secrets-encrypt reencrypt once. Tracked as V12.
Verify: k3s secrets-encrypt status reports Encryption Status: Enabled on every server node.
Note: the old SECURITY.md claimed this was already on — 04-verify.sh greps for the string but cannot truly confirm; see V12.

`K3S-CG2` — Node OS hardening · ◐ · In-repo: partial

Where: _config.sh → generate_cluster_config → post_create_commands (runs on every node at provision).
Done (2026-05-16): post_create_commands now installs and enables fail2ban (SSH brute-force bans) and unattended-upgrades (automatic security patching) on every node at provision time — a fresh cluster comes up hardened on both.
Still operator (runtime; not yet in-repo):
- SSH — confirm PermitRootLogin no, PasswordAuthentication no, AllowUsers deploy, modern ciphers/MACs/KEX. (hetzner-k3s provisions key-only SSH; verify and tighten.)
- sysctl — confirm net.ipv4.ip_unprivileged_port_start=0 (Traefik) and standard network-hardening sysctls.
Verify: ssh deploy@<node> 'fail2ban-client status sshd; systemctl is-enabled unattended-upgrades'.

`K3S-CG3` — Hetzner Cloud Firewall rules · ☐ · In-repo: N

Fix: Confirm only: :443 from Cloudflare CIDRs, :22 from operator IP(s), :6443 from operator IP(s). Nothing else. This is the only network defense for the public-IP nodes (K3S-F15).
Verify: hcloud firewall describe honeydue-fw matches the intended ruleset; a direct curl to a node IP on :80/:443 from a non-CF host times out.

`K3S-CG4` — etcd snapshot backup · ☐ · In-repo: Y

Fix: Confirm k3s etcd snapshots are enabled (default hourly) and shipped off-node — set --etcd-s3 (to Backblaze B2) with encryption. Without offsite snapshots, a 3-node loss is unrecoverable.
Verify: ls /var/lib/rancher/k3s/server/db/snapshots/ on a node + an object in the B2 backup bucket.

`K3S-CG5` — kubelet authn/authz flags · ☐ · In-repo: Y

Fix: Confirm --anonymous-auth=false and --authorization-mode=Webhook on the kubelet (k3s defaults are usually safe — verify, don't assume). Set via k3s kubelet-arg in the cluster config if missing.
Verify: kubectl get --raw /api/v1/nodes/<node>/proxy/configz shows the expected kubelet config.

`K3S-CG6` — Container-runtime CIS baseline · ☐ · In-repo: N

Fix: Run kube-bench once; remediate any FAIL lines that aren't k3s-by-design.
Verify: kube-bench run archived with FAILs triaged.

`K3S-CG7` — `deploy` user sudoers least-privilege · ☐ · In-repo: N

Fix: Current deploy ALL=(ALL) NOPASSWD: ALL means an SSH-key compromise = node root. Scope to the commands deploys actually need (ufw, systemctl, chmod on k3s.yaml, cat of k3s.yaml). Accept the convenience trade-off only with eyes open.
Verify: ssh deploy@<node> 'sudo -l' shows the scoped list.

`K3S-CG8` — `/etc/rancher/k3s/` perms · ☐ · In-repo: N

Fix: /var/lib/rancher/k3s/server/token and /var/lib/rancher/k3s/server/node-token must be 0600 root:root; /etc/rancher/k3s/ not world-traversable.
Verify: ssh deploy@<node> 'sudo stat -c "%a %n" /var/lib/rancher/k3s/server/token' → 600.

`K3S-F15` — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y

Decision: Accepted for now. Defense is K3S-CG3 (Hetzner firewall) only. To remediate later: attach a Hetzner private network, re-IP the cluster, move etcd/kubelet/Flannel onto it. Substantial re-provision — track on the roadmap, not this cycle.

`K3S-F16` — All nodes are control-plane + etcd + worker · INFO · ⊘

Decision: Accepted — standard small-cluster k3s. Revisit (dedicated workers + NoSchedule taint on control-plane) when workload pressure grows. No redeploy action.

`K3S-F17` — Single-replica SPOFs · INFO · ☐ · In-repo: Y

Where: deploy-k3s/manifests/worker/deployment.yaml, redis/, admin/, observability/vmagent.yaml.
Fix: worker → replicas: 2 (stateless, Asynq at-least-once — safe now). admin/vmagent → 2 if zero-downtime restart is wanted. redis is stateful — true HA needs Sentinel or managed Redis; track separately, do not naively scale.
Verify: kubectl -n honeydue get deploy shows worker 2/2.

Stage 2 — Secrets & config bootstrap

Run by 02-setup-secrets.sh, which reads deploy-k3s/config.yaml and the secrets/ directory. Both K3S-F1 and K3S-F3 are open purely because config.yaml lacks the values — the script already supports them.

`K3S-F1` — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y

Where: deploy-k3s/config.yaml key redis.password. 02-setup-secrets.sh:53,68-71 includes REDIS_PASSWORD in honeydue-secrets only when that key is non-empty; redis/deployment.yaml adds --requirepass only when the env var is non-empty.
Fix: Set redis.password in config.yaml to a strong value (openssl rand -base64 32). Re-run 02-setup-secrets.sh. api/worker already consume REDIS_PASSWORD.
Verify: kubectl -n honeydue exec deploy/redis -- redis-cli ping → NOAUTH; with -a "$REDIS_PASSWORD" → PONG.
Redeploy-clean: committing the value to config.yaml means every future 02-setup-secrets.sh re-creates the authenticated secret. (If config.yaml is gitignored, store the value in the operator's secret store and document it here.)

`K3S-F3` — `admin-basic-auth` secret never created · HIGH · ☐ · In-repo: Y

Where: config.yaml keys admin.basic_auth_user / admin.basic_auth_password. 02-setup-secrets.sh:54-55,132-143 creates the admin-basic-auth secret (bcrypt htpasswd) only when both are set, else it warns and skips.
Fix: Set both keys. Re-run 02-setup-secrets.sh. Must be done before K3S-F2 — attaching admin-auth to the ingress with the secret missing makes Traefik 503 the admin route.
Verify: kubectl -n honeydue get secret admin-basic-auth.

`K3S-F8` (Stage 2 half) — `B2_KEY_ID` / `B2_APP_KEY` in `honeydue-secrets` · ☑ · In-repo: Y

Where: 02-setup-secrets.sh.
Done (2026-05-16): the script now reads storage.b2_key_id / storage.b2_app_key from config.yaml and adds B2_KEY_ID / B2_APP_KEY to honeydue-secrets. Previously the api/worker manifests referenced these keys but the script never created them — a latent deploy break. See the full K3S-F8 entry in Stage 5.
Verify: kubectl -n honeydue get secret honeydue-secrets -o jsonpath='{.data.B2_KEY_ID}' is non-empty.

`K3S-F12` — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y

Where: new doc docs/runbooks/secret-rotation.md.
Fix: Document per-secret rotation (Postgres, SECRET_KEY, APNs .p8, FCM, B2, observability token, Redis, admin basic-auth). Annual minimum; immediate on suspected exposure or operator-device loss. For SECRET_KEY (JWT signing) plan an overlap window so live tokens validate across the change. Add a last-rotated annotation to each secret.
Verify: runbook exists and the first rotation is logged.

`CODE-C4` — `DEBUG_FIXED_CODES` "123456" auth bypass · CRITICAL · ☐ · In-repo: Y

Where: internal/services/auth_service.go:141-145,385-390,432-435,470-473,503-504; config in internal/config/config.go. ConfigMap generated from config.yaml by 03-deploy.sh.
Fix (two layers): (1) Code — refuse to start if ENV=production && DebugFixedCodes (Stage 4 code change). (2) Config — ensure config.yaml never sets DEBUG_FIXED_CODES=true for prod, and the generated ConfigMap omits it.
Verify: prod ConfigMap has no DEBUG_FIXED_CODES; a prod boot with the flag set fails fast.

`CODE-M8` — `SECRET_KEY` hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y

Where: internal/config/config.go:437-442 falls back to "change-me-in-production-secret-key-12345".
Fix: Remove the static fallback — generate a per-boot random key in debug, and refuse to start in production if SECRET_KEY is unset. (02-setup-secrets.sh:46-49 already enforces ≥32 chars for the real secret — keep that.)
Verify: prod boot with no SECRET_KEY exits non-zero; the fallback string is gone from the binary.

Stage 3 — Kubernetes manifests

Committed under deploy-k3s/manifests/ and applied by 03-deploy.sh. Any fix here is automatically re-applied on every redeploy — the highest-value stage for "redeploy clean."

`K3S-F2` / `CODE-L6` — Wire defense-in-depth onto the admin ingress · HIGH · ☐

Where: deploy-k3s/manifests/ingress/ingress-simple.yaml — admin route annotation.
Fix: Add cloudflare-only and admin-auth to the traefik.ingress.kubernetes.io/router.middlewares annotation alongside the existing security-headers + rate-limit. Do K3S-F3 first or Traefik 503s the route.
Verify: 04-verify.sh "Cloudflare-Only Middleware" check passes; admin.myhoneydue.com prompts for basic auth.

`K3S-F6` — `imagePullSecrets` name consistency · HIGH · ☐

Where: all deploy-k3s/manifests/*/deployment.yaml, migrate/job.yaml; secret created by 02-setup-secrets.sh:111 as ghcr-credentials.
Fix: The registry is Gitea — ghcr-credentials is a misleading name and the live cluster currently also has a hand-made gitea-credentials. Pick one name (gitea-credentials is clearer), use it in both the script and every manifest, and delete the orphan. The defect is a name mismatch, not a missing fix — make script + manifests agree so a pull never fails on a fresh node.
Verify: grep -rl imagePullSecrets deploy-k3s/manifests/ all reference one name == the script's; cordon a node, delete a pod, confirm the replacement pulls.

`K3S-F7` — `vmagent` container `securityContext` · MEDIUM · ☐

Where: deploy-k3s/manifests/observability/vmagent.yaml.
Fix: Add the container-level block the other 5 deployments already have: allowPrivilegeEscalation: false, capabilities.drop: [ALL], readOnlyRootFilesystem: true. Its volumes (/etc/vmagent, /etc/vmagent-secrets, /tmp/vmagent emptyDir) already support read-only root.
Verify: 04-verify.sh "Pod Security Contexts" reports OK for vmagent.

`K3S-F9` / `LIVE-L8` — CSP + cross-origin headers · MEDIUM / LOW · ☐

Where: Cross-origin trio → deploy-k3s/manifests/ingress/middleware.yaml (security-headers). CSP object-src/base-uri → Go app CSP middleware (Stage 4, LIVE-L8 code half).
Important correction: K3S-F9 originally said CSP was missing. The live scan disproved that — the Go app sets a strong CSP via app middleware. So K3S-F9 reduces to: add Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Resource-Policy: same-origin (and Cross-Origin-Embedder-Policy: require-corp only if it doesn't break embeds) to security-headers. The CSP object-src 'none'; base-uri 'self' additions belong in the app and are tracked under LIVE-L8 in Stage 4.
Verify: curl -sI https://api.myhoneydue.com/api/health/ | grep -i cross-origin shows COOP/CORP.

`K3S-F10` / `LIVE-L12` — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐

Where: deploy-k3s/manifests/ingress/middleware.yaml (new auth-rate-limit Middleware) + ingress/ingress-simple.yaml. Requires migrating the auth paths from vanilla Ingress to a Traefik IngressRoute to apply a per-path middleware.
Fix: New Middleware average: 5, burst: 10, period: 1m, sourceCriterion.ipStrategy.depth: 2 (depth 2 for the Cloudflare hop). Apply to /api/auth/login, /api/auth/register, /api/auth/forgot-password, /api/auth/reset-password, /api/residences/join-with-code. This is the edge half; the app half is CODE-H1/H2/H3/M5 in Stage 4 (per-account lockout in Redis). Do both — edge limit alone resets on IP rotation.
Verify: 10 rapid logins from one IP → 429.

`K3S-F11` — Disable `automountServiceAccountToken` · MEDIUM · ☐

Where: deploy-k3s/manifests/rbac.yaml (ServiceAccounts) and/or each */deployment.yaml pod spec.
Fix: Set automountServiceAccountToken: false on api, admin, worker, web, redis. Leave true only for vmagent (it uses the k8s API for service discovery). Note: 05-security.md claims this is already set — the audit (F11) says it is not. Treat the audit as ground truth; this fix makes the doc true.
Verify: kubectl -n honeydue get pod <api-pod> -o jsonpath='{.spec.automountServiceAccountToken}' → false; no token file in the container.

`K3S-F13` — Add `app.myhoneydue.com` to CORS · LOW · ☐

Where: CORS_ALLOWED_ORIGINS in config.yaml → generated into honeydue-config ConfigMap by 03-deploy.sh.
Fix: Confirm whether the web app calls api.myhoneydue.com directly from the browser. If yes, add https://app.myhoneydue.com to CORS_ALLOWED_ORIGINS. If it proxies through Next.js server-side, CORS is moot — record that decision here instead.
Verify: browser fetch from app.myhoneydue.com to the API succeeds (or the proxy decision is documented).

`K3S-F14` — Pin public images by digest · LOW · ☐

Where: redis/deployment.yaml (redis:7-alpine), observability/vmagent.yaml (victoriametrics/vmagent:v1.106.1).
Fix: Replace tags with @sha256: digests. Folded into the K3S-F5 CI work (Stage 5).
Verify: manifests contain no public-image tag without a digest.

`LIVE-L5` / `CODE-L3` — HSTS `preload` · LOW · ☐

Where: deploy-k3s/manifests/ingress/middleware.yaml security-headers HSTS value.
Fix: Change to max-age=63072000; includeSubDomains; preload. Confirm api/admin/app all work fully over HTTPS, then submit to hstspreload.org (the submission is the Stage 0 external half — once preloaded you cannot easily downgrade for ~6 months).
Verify: response header shows preload; domain accepted at hstspreload.org.

`LIVE-L7` — Drop deprecated `X-XSS-Protection` · LOW · ☐

Where: deploy-k3s/manifests/ingress/middleware.yaml security-headers (browserXssFilter: true / customResponseHeaders).
Fix: Remove the header or set X-XSS-Protection: "0". Modern browsers ignore it; legacy filter bypass has caused XSS.
Verify: header absent or 0 on all three hosts.

`CODE-L4` — Set `imagePullPolicy` · LOW · ☐

Where: all deploy-k3s/manifests/*/deployment.yaml.
Fix: Set imagePullPolicy explicitly. Once images are digest-pinned (K3S-F5), IfNotPresent is correct and avoids needless re-pulls; until then Always avoids stale tags. Pick the policy that matches the K3S-F5 rollout state.
Verify: every container has an explicit imagePullPolicy.

Stage 4 — Application code & container images

Fixes in honeyDueAPI-go source (and the admin/web Dockerfiles). They reach production by rebuilding the image in 03-deploy.sh; schema-changing fixes (CODE-C1, CODE-C5/6, CODE-C11, CODE-C12) also need a goose migration, which the migrate Job runs automatically before the api/worker roll. Per repo rule: do not auto-commit — these are code changes; this section is the plan, not the patch.

Critical (C1–C13)

`CODE-C1` — Plaintext auth tokens in DB · ☑ (2026-05-15)

Where: internal/models/user.go, internal/repositories/user_repo.go, internal/middleware/auth.go, internal/services/cache_service.go, internal/services/auth_service.go, migration 000003_hash_auth_tokens.sql.
Done: user_authtoken.key now stores models.HashToken() — the hex SHA-256 of the token — never the raw value. The raw token reaches the client once (the non-persisted AuthToken.Plaintext field) and is re-hashed on every request before the DB and Redis lookup, so the single indexed JOIN query in the auth middleware is preserved. A fast hash (not bcrypt) is correct here — tokens are 160-bit random values, nothing to brute-force. Migration 000003 widens the column 40→64 and clears existing rows.
Behaviour change: the server can no longer re-issue a stored token's plaintext, so every login mints a fresh token via CreateFreshToken (delete + create). With the existing one-token-per-user schema this means one active session per user — logging in on a new device invalidates the previous device's token. The migration also invalidates all sessions once, at deploy.
Verify: SELECT key FROM user_authtoken LIMIT 1 → 64-char hash; go build ./... and go test ./internal/{models,repositories,middleware,handlers}/... pass.

`CODE-C2` / `CODE-C3` — Google ID token not verified locally · ☑ (2026-05-15)

Where: internal/services/google_auth.go (full rewrite).
Done: VerifyIDToken no longer calls the deprecated tokeninfo URL (which leaked the token in the query string and made verification depend on a third party). It now parses the JWT, fetches Google's JWKS from googleapis.com/oauth2/v3/certs (Redis-cached 24h, re-fetched on a kid miss), verifies the RS256 signature locally, and asserts iss ∈ {accounts.google.com, https://accounts.google.com} (C3), aud/azp against the configured client IDs, and exp (validated by jwt v5). Mirrors the existing Apple JWKS verifier. GoogleSignIn is unchanged — the returned GoogleTokenInfo shape is preserved.
Verify: go build ./... clean; internal/services tests pass.

`CODE-C5` / `CODE-C6` — IAP receipt / purchase-token replay · ☐

Where: internal/services/subscription_service.go (ProcessApplePurchase, ProcessGooglePurchase).
Fix: Goose migration adding UNIQUE(provider, original_transaction_id). On purchase, if the transaction ID is already bound to a different user_id → 403.
Verify: re-submitting a valid receipt against a second account → 403; DB has no duplicate.

`CODE-C7` — File-ownership check excludes residence owners · ☐

Where: internal/services/file_ownership_service.go:20-66.
Fix: Replace the three residence_residence_users-only JOINs with the canonical owner-OR-member UNION from residence_repo.HasAccess (owners live in residence_residence.owner_id).
Verify: a residence owner can delete a file in their own property; a non-member still gets 403.

`CODE-C8` — Device-token cross-account hijack · ☐

Where: internal/services/notification_service.go:307-319 (APNS), :336-349 (GCM).
Fix: On re-register of an existing token, if existing.UserID != nil && *existing.UserID != userID → 409 Conflict. Only same-user updates allowed.
Verify: registering another user's known token → 409; that user's push traffic is unaffected.

`CODE-C9` / `CODE-H9` — Share-code join not atomic · ☐

Where: internal/services/residence_service.go:562-615 (:594-599 swallows the deactivate error).
Fix: Wrap JoinWithCode in one transaction with SELECT … FOR UPDATE on the share-code row; fail the join if deactivation fails (do not log-and-continue).
Verify: concurrent redemptions of a single-use code → exactly one succeeds; a forced deactivate error rolls the whole join back.

`CODE-C10` — Subscription upgrade race · ☐

Where: internal/services/subscription_service.go:404-459; webhook handler :136-213.
Fix: Move Apple validation inside the row-locked transaction, or add an idempotency-key table so the validate→write window can't be raced.
Verify: two concurrent upgrades for one user → one tier change, not two.

`CODE-C11` — Task-completion duplicate-row race · ☐

Where: internal/services/task_service.go:631-750.
Fix: SELECT … FOR UPDATE on the task in CreateCompletion; goose migration adding UNIQUE(task_id, completed_date).
Verify: double-tap "complete" → one completion row.

`CODE-C12` — Soft-deleted email reusable · ☐

Where: internal/services/auth_service.go:274-324; internal/repositories/user_repo.go (FindByEmail, ExistsByEmail).
Fix: On delete, mangle the email (deleted_<id>_<email>); add is_active = true filtering consistently to FindByEmail/ExistsByEmail.
Verify: registering with a soft-deleted account's email is rejected; no cross-account takeover.

`CODE-C13` — Apple webhook user lookup may LIKE-match · ☐

Where: internal/handlers/subscription_webhook_handler.go:354-366 (FindByAppleReceiptContains).
Fix: Confirm the SQL is an equality match, not LIKE. If LIKE, this is a confirmed Critical — change to equality and rename the function. See V8.
Verify: the query is parameterized equality; rename merged.

High (H1–H9)

`CODE-H1` / `CODE-H2` / `CODE-H3` / `CODE-M5` — Rate limiting gaps · ☐

Where: internal/router/router.go (:520 login limiter, :593 join-with-code unprotected), internal/middleware/rate_limit.go, internal/handlers/auth_handler.go.
Fix: Extend rate limiting to register, join-with-code, Apple/Google sign-in, and token refresh. Add a per-account login-attempt counter in Redis (lock after 5–10 fails for 15–60 min). This is the app half of the consolidated auth-rate-limit item; the edge half is K3S-F10.
Verify: rapid attempts on every auth route throttle; per-account lockout fires regardless of source IP.

`CODE-H4` — Modulo bias in 6-digit codes · ☐

Where: internal/services/auth_service.go:884-892.
Fix: Replace int32 % 1000000 with rejection sampling on crypto/rand for a uniform 000000–999999.
Verify: distribution test over many samples is uniform.

`CODE-H5` — Apple IAP `.p8` file-mode unchecked · ☐

Where: internal/services/iap_validation.go:93-128, internal/config/config.go:325.
Fix: Prefer a base64 env-injected PEM. If a file path is kept, refuse to start when the file mode is more permissive than 0600.
Verify: boot fails on a 0644 key file; succeeds on 0600.

`CODE-H6` — Webhook dedup fail-open · ☐

Where: internal/handlers/subscription_webhook_handler.go:165-173 (Apple), :564-574 (Google).
Fix: Fail closed — if webhookEventRepo.HasProcessed errors, return 500 so Apple/Google retry, rather than processing (which risks duplicate refunds).
Verify: simulated dedup-check DB error → 500, no double-processing.

`CODE-H7` — Auth-failure log lacks IP/UA · ☐

Where: internal/handlers/auth_handler.go:70.
Fix: Add c.RealIP() + User-Agent to the structured failure log line (the audit log captures them; the request-line log does not). Depends on V10 (RealIP trust).
Verify: a failed login log line carries IP + UA.

`CODE-H8` — `X-Timezone` header trusted for trial start · ☐

Where: internal/middleware/timezone.go:40-71 → internal/services/subscription_service.go:145-150.
Fix: Validate X-Timezone against IANA LoadLocation, cap to ±14h; use server UTC for trial-start / billing-window math regardless.
Verify: a bogus/extreme X-Timezone cannot shift trial start.

Medium (M1–M13)

`CODE-M1` — Header injection via `Content-Disposition` filename · ☐

Where: internal/handlers/media_handler.go:74,117,165.
Fix: Sanitize doc.FileName — strip CR/LF/quote/null, or emit RFC 5987 filename*=UTF-8''….
Verify: an upload with CRLF in the filename does not split the response.

`CODE-M2` — bcrypt cost 10 → 12 · ☐

Where: internal/models/user.go:47, internal/services/auth_service.go:479.
Fix: Make the cost config-driven, default 12.
Verify: new hashes are $2a$12$.

`CODE-M3` — Apple Sign In nonce not validated · ☐

Where: internal/services/apple_auth.go.
Fix: Generate, store, and verify the nonce round-trip on Apple sign-in.
Verify: a replayed/mismatched nonce is rejected.

`CODE-M4` — Email verification not atomic · ☐

Where: internal/services/auth_service.go:373-415.
Fix: Wrap verify in a transaction so a concurrent request can't double-apply.
Verify: concurrent verify calls → one state transition.

`CODE-M6` / `LIVE-L16` — Uncapped list / pagination · ☐

Where: ListDocuments, ListContractors, ListResidences handlers; pagination parsing.
Fix: Clamp limit server-side to ≤100 (< 1 → default 25). Notifications already caps at 200 — match the pattern.
Verify: ?limit=999999 returns ≤100 rows.

`CODE-M7` — Audit log not append-only · ☐

Where: audit-log model / repository.
Fix: Make it append-only — a DB trigger forbidding UPDATE/DELETE, or move to an event store. Remove the soft-delete column.
Verify: an UPDATE/DELETE on the audit table is rejected.

`CODE-M11` — `golang.org/x/crypto` outdated · ☐

Where: go.mod:30 (v0.49.0).
Fix: go get -u golang.org/x/crypto, re-run govulncheck, retest. Pairs with Stage 5 dependency automation.
Verify: govulncheck ./... clean.

`CODE-M12` — Contractor toggle refetch race · ☐

Where: internal/services/contractor_service.go:279-307.
Fix: Do the toggle + read in one transaction so a concurrent soft-delete can't make it return nil.
Verify: concurrent toggle + delete → defined result, no nil panic.

`CODE-M13` — Account-deletion endpoint unrate-limited · ☐

Where: internal/handlers/auth_handler.go:488-539.
Fix: Add a throttle to DELETE /account. First resolve V11 — LIVE-L18 claims no delete endpoint exists; reconcile before deciding whether this is "rate-limit it" or "expose it."
Verify: repeated delete calls throttle.

`CODE-M10` — `node:20-alpine` floating tag · ☐

Where: admin/web Dockerfile (:2,112,134).
Fix: Pin to a specific patch version or digest.
Verify: Dockerfile has no bare node:20-alpine.

Low / Info (CODE-L1, L2)

`CODE-L1` — Inactive-account login enumeration · ☐

Where: internal/services/auth_service.go:76-77.
Fix: Return the same generic error for inactive accounts as for invalid credentials.
Verify: inactive vs. wrong-password responses are byte-identical.

`CODE-L2` — Auth responses lack `Cache-Control: no-store` · ☐

Where: internal/handlers/auth_handler.go (Login / CurrentUser / Refresh).
Fix: Set Cache-Control: no-store on auth responses.
Verify: the header is present.

Live-scan code-level findings (LIVE-L1, L11–L20)

`LIVE-L1` — `/metrics` publicly exposed · HIGH · ☐

Where: cmd/api/main.go route registration; vmagent scrapes it cluster-internally already.
Fix (recommended — Option B): bind Prometheus metrics to a separate cluster-internal port (e.g. :9090), expose only via a ClusterIP Service the vmagent NetworkPolicy allows; the public Ingress never registers /metrics. Update observability/vmagent.yaml scrape target. (Alternative: block /metrics at Traefik via an IngressRoute — Stage 3.)
Verify: curl https://api.myhoneydue.com/metrics → 404; vmagent still scrapes successfully.

`LIVE-L11` — Login user-enumeration via timing · HIGH · ☐

Where: login handler / auth_service.go.
Fix: Always run a bcrypt compare against a fixed dummy hash when the user is not found, so the response time is constant.
Verify: real vs. fake email login timing delta < network noise.

`LIVE-L12` — No rate-limit on login · HIGH · ☐

See the consolidated auth-rate-limit item: K3S-F10 (edge) + CODE-H1/H2/H3/M5 (app). Closed when both land.

`LIVE-L13` — Password-reset timing enumeration · HIGH · ☐

Where: forgot-password handler.
Fix: Enqueue the reset email on the Asynq queue and return the generic response immediately, so real vs. fake emails have identical latency.
Verify: real vs. fake email reset timing delta < network noise.

`LIVE-L14` / `LIVE-L15` — Sequential integer IDs · MEDIUM · ⊘ (deferred)

Where: all user-facing IDs.
Decision: Real enumeration/intel leak, but migrating to UUID/ULID touches API, web, mobile, and webhook payloads. Deferred to a planned quarter — not a redeploy-stage fix. Track on the roadmap; revisit before the userbase size becomes commercially sensitive.

`LIVE-L16` — Pagination `limit` uncapped · MEDIUM · ☐

Duplicate of CODE-M6 — closed with it.

`LIVE-L17` — Garbage pagination params silently accepted · LOW · ☐

Where: query-param parsing in list handlers.
Fix: Return 400 naming the bad parameter instead of silently using defaults.
Verify: ?limit=abc → 400.

`LIVE-L18` — No account-deletion endpoint (GDPR) · LOW · ☐

Where: internal/router/router.go, internal/handlers/auth_handler.go.
Fix: Reconcile with CODE-M13 first (V11). Provide DELETE /api/auth/me/ that anonymizes PII, cascades/transfers residences, revokes tokens, and writes an audit-trail row. Also closes the throwaway-account cleanup gap the live scan left behind.
Verify: an authenticated user can delete their own account; PII is anonymized.

`LIVE-L19` — Email verification not enforced · LOW · ☐

Where: router middleware.
Fix: Add a RequireVerified() middleware on sensitive routes (share-code generation/redemption, anything that emails other users), or cap unverified accounts (1 residence, no share codes) until verified.
Verify: an unverified account is blocked from the chosen gated routes.

`LIVE-L20` — Profile-update silently drops unknown fields · INFO · ☐

Where: PATCH /api/auth/profile/ handler.
Fix: Either accept the fields (if intended) or return 400 listing unsupported keys — don't silently 200.
Verify: an unknown field yields a clear response.

`LIVE-L10` — `x-powered-by` — see Stage 0 (Next.js config).

Stage 5 — CI / build pipeline

Build-time controls. Where there is no CI pipeline file yet, the fix is to add one (or a 03-deploy.sh step) so the control runs on every build.

`K3S-F5` / `K3S-F14` / `CODE-L4` — Pin images by digest · HIGH · ☐

Where: 03-deploy.sh (currently tags by git short SHA, lines 47/57-61, and also pushes :latest), all deploy-k3s/manifests/*/deployment.yaml.
Fix: After docker push, capture the digest (crane digest … or parse docker push output) and substitute @sha256:… into the manifests instead of IMAGE_PLACEHOLDER tags. Pin redis and vmagent by digest too. Reconsider pushing :latest — a mutable :latest undercuts digest pinning.
Verify: kubectl -n honeydue get deploy -o jsonpath shows every image as @sha256:.

`K3S-F8` — Secrets as file mounts, not env vars · MEDIUM · ☑ · In-repo: Y

Where: api/worker deployment.yaml, internal/config/config.go, cmd/api/main.go, cmd/worker/main.go, 02-setup-secrets.sh.
Done (2026-05-16):
- config.loadFileSecrets() reads each of the 9 secret keys (POSTGRES_PASSWORD, SECRET_KEY, EMAIL_HOST_PASSWORD, FCM_SERVER_KEY, REDIS_PASSWORD, B2_KEY_ID, B2_APP_KEY, OBS_INGEST_TOKEN, OBS_TRACES_URL) from /etc/honeydue/secrets/<KEY> and viper.Sets it (highest precedence). A missing file is a silent skip, so the same binary still works from env vars in local/dev.
- api/worker deployment.yaml no longer inject any secret as an env: secretKeyRef. honeydue-secrets is mounted as a volume (defaultMode: 0400), read-only, at /etc/honeydue/secrets. Non-secret config still arrives via envFrom: configMapRef.
- cmd/api/cmd/worker read the observability endpoints through the new config.SecretValue() (Viper-backed) instead of os.Getenv, so file-mounted OBS_* values resolve now that they are gone from the environment.
- 02-setup-secrets.sh now also writes B2_KEY_ID/B2_APP_KEY into honeydue-secrets — reconciling the script-vs-manifest drift (the manifests referenced these keys but the script never created them).
Scoped exception: the one-shot honeydue-migrate Job still takes POSTGRES_PASSWORD as an env var. goose is invoked as a CLI with the password inside the DSN argument, so the value is exposed in that process regardless of env-vs-file; the Job is transient (one run, seconds, pod GC'd) so this is accepted.
Verify: kubectl -n honeydue exec deploy/api -- env shows no POSTGRES_PASSWORD/SECRET_KEY; kubectl -n honeydue exec deploy/api -- ls /etc/honeydue/secrets lists the key files.

`CODE-L5` — Image signing + scanning · LOW · ◐ · In-repo: Y

Where: 03-deploy.sh, deploy-k3s/manifests/kyverno-verify-images.yaml.
Done (in-repo, 2026-05-16):
- 03-deploy.sh runs cosign sign after each push and a trivy image --severity HIGH,CRITICAL scan before push — both guarded: they no-op when the tool is absent, so they never break a deploy on a host without them.
- A ready-to-use Kyverno ClusterPolicy ships at deploy-k3s/manifests/kyverno-verify-images.yaml. It matches only the four gitea.treytartt.com/admin/honeydue-* images, starts in Audit mode, and is intentionally not applied by 03-deploy.sh — applying a verify-images policy with no key would block every Pod from scheduling.
Remaining (operator — cannot be committed):
1. Install Kyverno in the cluster (admission controller).
2. cosign generate-key-pair; set COSIGN_KEY in the deploy env so signing activates; paste cosign.pub into the policy's publicKeys block.
3. kubectl apply -f deploy-k3s/manifests/kyverno-verify-images.yaml, confirm Pods still schedule, then flip validationFailureAction: Audit → Enforce.
Verify: an unsigned image is rejected by admission; 03-deploy.sh fails on a HIGH/CRITICAL CVE.

`CODE-M11` (CI half) — Dependency hygiene · ☐

Fix: Add scheduled go get -u + govulncheck (the audit confirms govulncheck + gitleaks already run in CI — extend with a dependency-update cadence).
Verify: stale-dependency alerts surface automatically.

Stage 6 — Post-deploy verification & runtime investigations

04-verify.sh already runs a security block (secret encryption, NetworkPolicy count, ServiceAccounts, pod security contexts, PDBs, cloudflare-only middleware, admin-basic-auth). Extend it so each fix above stays fixed, and work the open investigations the audits could not resolve.

Extend `04-verify.sh` with assertions for · ☐

Redis rejects unauthenticated PING (K3S-F1).
Admin ingress annotation contains admin-auth (K3S-F2).
/metrics returns 404 on the public host (LIVE-L1).
Every container (incl. vmagent) has a full securityContext (K3S-F7).
automountServiceAccountToken: false on app pods (K3S-F11).
Every workload image is digest-pinned (K3S-F5).
No DEBUG_FIXED_CODES key in the prod ConfigMap (CODE-C4).

Runtime investigations (cannot be closed by code review alone)

ID	Item	Source	Action
`V1`	Apple/Google Sign-In token validation depth	LIVE	Test with a self-signed Apple identity token; confirm signature/aud/nonce checks
`V2`	Webhook signature verification — confirm webhook routes are outside the auth middleware in `router.go` (live scan saw `401`s, signature middleware may never run)	LIVE	Code-review `internal/router/router.go`
`V3`	File-upload security — locate upload paths, test polyglots / MIME bypass / path traversal in filename / oversized files	LIVE	Focused upload security test
`V4`	Long-term token validity / revocation behaviour	LIVE	Test token expiry + revocation over time
`V5`	Apple IAP receipt validation with a real sandbox StoreKit receipt	LIVE	Sandbox test
`V6`	Share-code system — find the endpoint path; test brute-force, single-use, expiration	LIVE	Locate + test
`V7`	Trial-expiration enforcement — age a test account past 14 days, confirm `limitations_enabled` flips and creation gates fire	LIVE	Aged-account test
`V8`	`FindByAppleReceiptContains` — confirm equality, not `LIKE`. If `LIKE`, escalate `CODE-C13` to confirmed Critical	CODE	SQL review
`V9`	Rate-limiter storage — confirm `rate_limit.go` is Redis-backed (shared across 3 api replicas); in-memory = 3× the intended limit	CODE	Code review
`V10`	`X-Forwarded-For` / Echo `RealIP` trust behind Traefik — without it per-IP limits collapse to the ingress IP	CODE	Code + Traefik config review
`V11`	Account-deletion contradiction — `LIVE-L18` (no endpoint) vs `CODE-M13` (endpoint at `auth_handler.go:488-539`). Resolve before Stage 4 planning	LIVE/CODE	Route review
`V12`	etcd encryption — `04-verify.sh` only greps a string; truly confirm with `k3s secrets-encrypt status` on each server node	K3S	SSH check
`V13`	`user_authtoken` index — confirm a `user_id` lookup index exists before hashing tokens at rest (`CODE-C1`)	CODE	Schema check

Accepted risks / deferred (this cycle)

ID	Item	Rationale
`K3S-F15`	Public-IP nodes, no VPC	Re-provision-scale change; Hetzner firewall (`K3S-CG3`) is the compensating control. Roadmap.
`K3S-F16`	Combined control-plane/worker nodes	Standard small-cluster k3s; revisit on workload growth.
`LIVE-L14`/`L15`	Sequential integer IDs	UUID migration spans API + web + mobile + webhooks; planned quarter, not this cycle.

Mirror these in docs/deployment/20-roadmap.md so they are not silently lost.

Documentation drift corrected alongside this plan

The audits contradicted the existing deployment book. These corrections ship with this plan so the docs match audited reality:

Doc	Claimed	Reality (audit)	Action
`05-security.md`	`automountServiceAccountToken: false` set	`K3S-F11`: not set on any workload	Corrected to "TODO" + linked here
`05-security.md`	NetworkPolicies "not currently applied" (TODO)	Applied 2026-04-24; `03-deploy.sh:155` applies them	Corrected to "applied"
`05-security.md`	CF↔origin is plaintext (SSL=Flexible)	Upgraded to Full (strict) 2026-04-24	Corrected
`05-security.md`	SHA tags immutable / "we'd notice a digest change"	`K3S-F5`: short SHA tags are mutable	Corrected; points to `K3S-F5`
`SECURITY.md` (old)	Redis "requires a password"	`K3S-F1`: no auth	This rewrite
`SECURITY.md` (old)	etcd `secrets-encryption: true`	`K3S-CG1`: not verified / not on	This rewrite
`SECURITY.md` (old)	fail2ban active	`05-security.md` + `K3S-CG2`: not installed	This rewrite
`20-roadmap.md`	—	Audit findings not represented	Audit items folded in

Hardened-redeploy checklist (run order)

A clean rebuild of the whole stack, with every fix above applied:

□ Stage 0  DNS once-off:    DMARC, SPF, CAA at Cloudflare; security.txt route live
□ Stage 1  Provision:       hetzner-k3s config carries --write-kubeconfig-mode=600
                            and --secrets-encryption; run 01-provision-cluster.sh
□ Stage 1  Node OS:         fail2ban + unattended-upgrades + SSH/sysctl on each node
□ Stage 1  Verify cluster:  K3S-CG3..CG8 (firewall, snapshots, kubelet, perms)
□ Stage 2  Config:          config.yaml has redis.password + admin.basic_auth_*;
                            no DEBUG_FIXED_CODES; SECRET_KEY ≥32 chars
□ Stage 2  Secrets:         run 02-setup-secrets.sh — confirm redis + admin-basic-auth
□ Stage 3  Manifests:       admin ingress middlewares wired; imagePullSecret name
                            consistent; vmagent securityContext; COOP/CORP headers;
                            auth-rate-limit; automountServiceAccountToken:false;
                            HSTS preload; X-XSS-Protection dropped; imagePullPolicy set
□ Stage 4  Code+image:      all C/H/M/L code fixes committed; image rebuilt;
                            goose migrations for C1/C5/C6/C11/C12 present
□ Stage 5  CI:              images digest-pinned + signed + scanned; secrets file-mounted
□ Stage 6  Verify:          run 04-verify.sh (extended); work V1–V13
□ Post:    Submit myhoneydue.com to hstspreload.org

A redeploy is "clean" only when 04-verify.sh (extended per Stage 6) passes with zero ✗ lines and every checkbox in the master index is ☑ or ⊘.

Appendix — Incident response playbooks

Preserved from the previous SECURITY.md; still current.

Compromised API token

Rotate SECRET_KEY to invalidate all tokens, then restart api/worker:

echo "$(openssl rand -hex 32)" > secrets/secret_key.txt
./scripts/02-setup-secrets.sh
kubectl rollout restart deployment/api deployment/worker -n honeydue

(After CODE-C1 lands, tokens are hashed at rest — a DB read no longer yields usable tokens, but SECRET_KEY rotation remains the kill-switch.)

Compromised database credentials

Rotate in the Neon dashboard, update secrets/postgres_password.txt, re-run 02-setup-secrets.sh, restart api/worker, watch logs for connection errors.

Compromised push keys

APNs: revoke in Apple Developer, drop the new .p8 into secrets/, re-run 02-setup-secrets.sh, restart api/worker. FCM: rotate the key in Firebase, update secrets/fcm_server_key.txt, re-run, restart.

Suspicious pod

kubectl logs <pod> -n honeydue > /tmp/pod-logs.txt
kubectl describe pod <pod> -n honeydue > /tmp/pod-describe.txt
kubectl delete pod <pod> -n honeydue   # deployment recreates it

Communication

Document the timeline privately; on a data breach notify affected users within 72 hours; rotate every potentially-exposed credential; write a post-mortem (root cause, timeline, remediation, prevention).

References

Audit reports: live_scan_5_12.md, k3_audit_5_12.md, security_scan_5_12.md (repo root)
Current architecture: docs/deployment/05-security.md
Roadmap: docs/deployment/20-roadmap.md
Deploy process: docs/deployment/14-deployment-process.md
Scripts: deploy-k3s/scripts/{01-provision-cluster,02-setup-secrets,03-deploy,04-verify}.sh
Manifests: deploy-k3s/manifests/

72 KiB Raw Blame History Unescape Escape

honeyDue — Production Security Remediation Plan

How to use this document

Execution status (2026-05-16)

Post-remediation independent review (2026-05-16)

Consolidated work items (fix once, closes many)

Master finding index

Stage 0 — DNS & Cloudflare edge

Stage 1 — Cluster provisioning & node OS

Stage 2 — Secrets & config bootstrap

Stage 3 — Kubernetes manifests

Stage 4 — Application code & container images

Stage 5 — CI / build pipeline

Stage 6 — Post-deploy verification & runtime investigations

Stage 0 — DNS & Cloudflare edge

LIVE-L2 — Add DMARC record · HIGH · ⊘

LIVE-L3 — Tighten SPF from ?all to -all · MEDIUM · ⊘

LIVE-L4 — Add CAA records · MEDIUM · ⊘

LIVE-L6 — Publish security.txt · LOW · ☐ · In-repo: Y

LIVE-L9 — Review Cloudflare caching of the admin SSR shell · INFO · ☐

LIVE-L10 — Suppress x-powered-by · INFO · ☐ · In-repo: Y

Stage 1 — Cluster provisioning & node OS

K3S-F4 — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y

K3S-CG1 / CODE-M9 — etcd / Secret encryption at rest · ☑ · In-repo: Y

K3S-CG2 — Node OS hardening · ◐ · In-repo: partial

K3S-CG3 — Hetzner Cloud Firewall rules · ☐ · In-repo: N

K3S-CG4 — etcd snapshot backup · ☐ · In-repo: Y

K3S-CG5 — kubelet authn/authz flags · ☐ · In-repo: Y

K3S-CG6 — Container-runtime CIS baseline · ☐ · In-repo: N

K3S-CG7 — deploy user sudoers least-privilege · ☐ · In-repo: N

K3S-CG8 — /etc/rancher/k3s/ perms · ☐ · In-repo: N

K3S-F15 — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y

K3S-F16 — All nodes are control-plane + etcd + worker · INFO · ⊘

K3S-F17 — Single-replica SPOFs · INFO · ☐ · In-repo: Y

Stage 2 — Secrets & config bootstrap

K3S-F1 — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y

K3S-F3 — admin-basic-auth secret never created · HIGH · ☐ · In-repo: Y

K3S-F8 (Stage 2 half) — B2_KEY_ID / B2_APP_KEY in honeydue-secrets · ☑ · In-repo: Y

K3S-F12 — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y

CODE-C4 — DEBUG_FIXED_CODES "123456" auth bypass · CRITICAL · ☐ · In-repo: Y

CODE-M8 — SECRET_KEY hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y

Stage 3 — Kubernetes manifests

K3S-F2 / CODE-L6 — Wire defense-in-depth onto the admin ingress · HIGH · ☐

K3S-F6 — imagePullSecrets name consistency · HIGH · ☐

K3S-F7 — vmagent container securityContext · MEDIUM · ☐

K3S-F9 / LIVE-L8 — CSP + cross-origin headers · MEDIUM / LOW · ☐

K3S-F10 / LIVE-L12 — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐

K3S-F11 — Disable automountServiceAccountToken · MEDIUM · ☐

K3S-F13 — Add app.myhoneydue.com to CORS · LOW · ☐

K3S-F14 — Pin public images by digest · LOW · ☐

LIVE-L5 / CODE-L3 — HSTS preload · LOW · ☐

LIVE-L7 — Drop deprecated X-XSS-Protection · LOW · ☐

CODE-L4 — Set imagePullPolicy · LOW · ☐

Stage 4 — Application code & container images

Critical (C1–C13)

CODE-C1 — Plaintext auth tokens in DB · ☑ (2026-05-15)

CODE-C2 / CODE-C3 — Google ID token not verified locally · ☑ (2026-05-15)

CODE-C5 / CODE-C6 — IAP receipt / purchase-token replay · ☐

CODE-C7 — File-ownership check excludes residence owners · ☐

CODE-C8 — Device-token cross-account hijack · ☐

CODE-C9 / CODE-H9 — Share-code join not atomic · ☐

CODE-C10 — Subscription upgrade race · ☐

CODE-C11 — Task-completion duplicate-row race · ☐

CODE-C12 — Soft-deleted email reusable · ☐

CODE-C13 — Apple webhook user lookup may LIKE-match · ☐

High (H1–H9)

CODE-H1 / CODE-H2 / CODE-H3 / CODE-M5 — Rate limiting gaps · ☐

CODE-H4 — Modulo bias in 6-digit codes · ☐

CODE-H5 — Apple IAP .p8 file-mode unchecked · ☐

CODE-H6 — Webhook dedup fail-open · ☐

CODE-H7 — Auth-failure log lacks IP/UA · ☐

CODE-H8 — X-Timezone header trusted for trial start · ☐

Medium (M1–M13)

CODE-M1 — Header injection via Content-Disposition filename · ☐

CODE-M2 — bcrypt cost 10 → 12 · ☐

CODE-M3 — Apple Sign In nonce not validated · ☐

CODE-M4 — Email verification not atomic · ☐

CODE-M6 / LIVE-L16 — Uncapped list / pagination · ☐

CODE-M7 — Audit log not append-only · ☐

CODE-M11 — golang.org/x/crypto outdated · ☐

72 KiB

Raw Blame History

`LIVE-L2` — Add DMARC record · HIGH · ⊘

`LIVE-L3` — Tighten SPF from `?all` to `-all` · MEDIUM · ⊘

`LIVE-L4` — Add CAA records · MEDIUM · ⊘

`LIVE-L6` — Publish `security.txt` · LOW · ☐ · In-repo: Y

`LIVE-L9` — Review Cloudflare caching of the admin SSR shell · INFO · ☐

`LIVE-L10` — Suppress `x-powered-by` · INFO · ☐ · In-repo: Y

`K3S-F4` — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y

`K3S-CG1` / `CODE-M9` — etcd / Secret encryption at rest · ☑ · In-repo: Y

`K3S-CG2` — Node OS hardening · ◐ · In-repo: partial

`K3S-CG3` — Hetzner Cloud Firewall rules · ☐ · In-repo: N

`K3S-CG4` — etcd snapshot backup · ☐ · In-repo: Y

`K3S-CG5` — kubelet authn/authz flags · ☐ · In-repo: Y

`K3S-CG6` — Container-runtime CIS baseline · ☐ · In-repo: N

`K3S-CG7` — `deploy` user sudoers least-privilege · ☐ · In-repo: N

`K3S-CG8` — `/etc/rancher/k3s/` perms · ☐ · In-repo: N

`K3S-F15` — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y

`K3S-F16` — All nodes are control-plane + etcd + worker · INFO · ⊘

`K3S-F17` — Single-replica SPOFs · INFO · ☐ · In-repo: Y

`K3S-F1` — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y

`K3S-F3` — `admin-basic-auth` secret never created · HIGH · ☐ · In-repo: Y

`K3S-F8` (Stage 2 half) — `B2_KEY_ID` / `B2_APP_KEY` in `honeydue-secrets` · ☑ · In-repo: Y

`K3S-F12` — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y

`CODE-C4` — `DEBUG_FIXED_CODES` "123456" auth bypass · CRITICAL · ☐ · In-repo: Y

`CODE-M8` — `SECRET_KEY` hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y

`K3S-F2` / `CODE-L6` — Wire defense-in-depth onto the admin ingress · HIGH · ☐

`K3S-F6` — `imagePullSecrets` name consistency · HIGH · ☐

`K3S-F7` — `vmagent` container `securityContext` · MEDIUM · ☐

`K3S-F9` / `LIVE-L8` — CSP + cross-origin headers · MEDIUM / LOW · ☐

`K3S-F10` / `LIVE-L12` — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐

`K3S-F11` — Disable `automountServiceAccountToken` · MEDIUM · ☐

`K3S-F13` — Add `app.myhoneydue.com` to CORS · LOW · ☐

`K3S-F14` — Pin public images by digest · LOW · ☐

`LIVE-L5` / `CODE-L3` — HSTS `preload` · LOW · ☐

`LIVE-L7` — Drop deprecated `X-XSS-Protection` · LOW · ☐

`CODE-L4` — Set `imagePullPolicy` · LOW · ☐

`CODE-C1` — Plaintext auth tokens in DB · ☑ (2026-05-15)

`CODE-C2` / `CODE-C3` — Google ID token not verified locally · ☑ (2026-05-15)

`CODE-C5` / `CODE-C6` — IAP receipt / purchase-token replay · ☐

`CODE-C7` — File-ownership check excludes residence owners · ☐

`CODE-C8` — Device-token cross-account hijack · ☐

`CODE-C9` / `CODE-H9` — Share-code join not atomic · ☐

`CODE-C10` — Subscription upgrade race · ☐

`CODE-C11` — Task-completion duplicate-row race · ☐

`CODE-C12` — Soft-deleted email reusable · ☐

`CODE-C13` — Apple webhook user lookup may LIKE-match · ☐

`CODE-H1` / `CODE-H2` / `CODE-H3` / `CODE-M5` — Rate limiting gaps · ☐

`CODE-H4` — Modulo bias in 6-digit codes · ☐

`CODE-H5` — Apple IAP `.p8` file-mode unchecked · ☐

`CODE-H6` — Webhook dedup fail-open · ☐

`CODE-H7` — Auth-failure log lacks IP/UA · ☐

`CODE-H8` — `X-Timezone` header trusted for trial start · ☐

`CODE-M1` — Header injection via `Content-Disposition` filename · ☐

`CODE-M2` — bcrypt cost 10 → 12 · ☐

`CODE-M3` — Apple Sign In nonce not validated · ☐

`CODE-M4` — Email verification not atomic · ☐

`CODE-M6` / `LIVE-L16` — Uncapped list / pagination · ☐

`CODE-M7` — Audit log not append-only · ☐

`CODE-M11` — `golang.org/x/crypto` outdated · ☐

`CODE-M12` — Contractor toggle refetch race · ☐

`CODE-M13` — Account-deletion endpoint unrate-limited · ☐

`CODE-M10` — `node:20-alpine` floating tag · ☐

`CODE-L1` — Inactive-account login enumeration · ☐

`CODE-L2` — Auth responses lack `Cache-Control: no-store` · ☐

`LIVE-L1` — `/metrics` publicly exposed · HIGH · ☐

`LIVE-L11` — Login user-enumeration via timing · HIGH · ☐

`LIVE-L12` — No rate-limit on login · HIGH · ☐

`LIVE-L13` — Password-reset timing enumeration · HIGH · ☐

`LIVE-L14` / `LIVE-L15` — Sequential integer IDs · MEDIUM · ⊘ (deferred)

`LIVE-L16` — Pagination `limit` uncapped · MEDIUM · ☐

`LIVE-L17` — Garbage pagination params silently accepted · LOW · ☐

`LIVE-L18` — No account-deletion endpoint (GDPR) · LOW · ☐

`LIVE-L19` — Email verification not enforced · LOW · ☐

`LIVE-L20` — Profile-update silently drops unknown fields · INFO · ☐

`LIVE-L10` — `x-powered-by` — see Stage 0 (Next.js config).

`K3S-F5` / `K3S-F14` / `CODE-L4` — Pin images by digest · HIGH · ☐

`K3S-F8` — Secrets as file mounts, not env vars · MEDIUM · ☑ · In-repo: Y

`CODE-L5` — Image signing + scanning · LOW · ◐ · In-repo: Y

`CODE-M11` (CI half) — Dependency hygiene · ☐

Extend `04-verify.sh` with assertions for · ☐