Files
honeyDueAPI/deploy-k3s/SECURITY.md
T
Trey t c77ff07ce9
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled
fix(security): remediate 2026-05-12 audit findings (Stages 2–5)
Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps),
tracked in deploy-k3s/SECURITY.md, plus fixes from two independent
post-remediation reviews.

Auth & sessions:
- SHA-256 hashed auth-token storage (C1); prior-token cache eviction on
  re-login (MEDIUM-1)
- local Google JWKS verification, iss/aud/exp checks (C2/C3)
- constant-time login + generic errors (L1/LIVE-L11/LIVE-L13)
- per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3)
- verified-email gating, login rate limiting (LIVE-L19, H1-H3)

IAP & webhooks:
- Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6)
- migrations 000003-000006 (token hashing, IAP replay, audit_log +
  webhook_event_log table creation, append-only audit log)

Authorization & races:
- file-ownership owner-OR-member fix (C7), atomic share-code join
  (C9/H9), device-token reassignment (C8/LOW-3)

Secrets & deploy:
- secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis
  password out of the ConfigMap (HIGH-1); B2 keys reconciled
- digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics
  lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban +
  unattended-upgrades at provision; secret-rotation runbook

Build, vet, and the full test suite (incl. -race) pass; the goose
migration chain is verified against PostgreSQL 16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:28:33 -05:00

72 KiB
Raw Blame History

honeyDue — Production Security Remediation Plan

This document is the single source of truth for fixing every security finding from the 2026-05-12/13 audits, and for keeping those fixes baked into the stack so a full redeploy never reproduces them.

It replaces the previous aspirational SECURITY.md (which described a desired state that, per the audits, was never fully true). The accurate current architecture lives in docs/deployment/05-security.md; this file is the work list.

Last updated: 2026-05-16 Audit sources (kept at repo root):

Tag File Scope Findings
LIVE live_scan_5_12.md External black-box scan of api/admin/app L1L20 (20)
K3S k3_audit_5_12.md k3s cluster + honeydue namespace audit F1F17 (17) + 8 coverage gaps
CODE security_scan_5_12.md Static audit of honeyDueAPI-go C1C13, H1H9, M1M13, L1L6 (41)

Total: 78 findings + 8 cluster coverage gaps + 13 runtime verification items.


How to use this document

The plan is organised by redeploy stage, not by severity, because the operator's goal is: redeploy the entire stack and come up clean. Each finding is tagged with where its fix lives:

Marker Meaning
In-repo: Y Fix lives in a committed file (config.yaml, a manifest, a script, Go code, a Dockerfile). Once committed, every redeploy re-applies it automatically.
In-repo: N Fix is external state (DNS records, Cloudflare dashboard, Hetzner firewall, hstspreload.org). A redeploy does not touch it — it survives on its own but must be done once and tracked here.

Status legend: ☐ open · ◐ in progress · ☑ done · ⊘ accepted risk / deferred

Redeploy stage order (matches deploy-k3s/scripts/ run order):

Stage 0  DNS & Cloudflare edge          (external; no cluster needed)
Stage 1  Cluster provisioning & node OS (01-provision-cluster.sh / hetzner-k3s / SSH)
Stage 2  Secrets & config bootstrap     (02-setup-secrets.sh / config.yaml)
Stage 3  Kubernetes manifests           (deploy-k3s/manifests/, applied by 03-deploy.sh)
Stage 4  Application code & images      (honeyDueAPI-go source → rebuilt image)
Stage 5  CI / build pipeline            (image digest pinning, signing, scanning)
Stage 6  Post-deploy verification       (04-verify.sh + runtime investigations)

Golden rule for "redeploy clean": a fix only counts as done when it is committed to the file that the redeploy reads. A kubectl patch on the live cluster that is not mirrored into deploy-k3s/manifests/ will be wiped on the next 03-deploy.sh. Every entry below names the committed file.


Execution status (2026-05-16)

Stages 25 were executed in-repo, then put through an independent code review (see Post-remediation independent review below). The Go module builds clean and the full go test ./... suite passes. Four new goose migrations were added — 000003 (auth-token hashing), 000004 (IAP replay protection), 000005 (audit-log append-only + audit_log table create), 000006 (webhook_event_log table create) — and run automatically via the migrate Job before the api/worker rollout.

  • ~63 findings fixed (☑) and verified — all of Stage 2 (secrets/config) and Stage 3 (Kubernetes manifests), every exploitable Stage 4 application finding (all 11 actioned Criticals + the auth / webhook / race / handler High & Medium fixes), Stage-5 image digest pinning and K3S-F8 (secrets are now file-mounted, not env vars), plus the in-repo half of Stage 1 cluster provisioning — K3S-F4 (kubeconfig written 0600), K3S-CG1 (etcd secrets-encryption), K3S-CG2 (fail2ban + unattended-upgrades installed at provision). Includes token hashing, Google JWKS verification, IAP replay protection, the authorization fixes, atomic share-code join, the metrics-endpoint lockdown, per-account login lockout, verified-email gating, CSP/HSTS hardening, and digest-pinned images.
  • 1 partial (◐)CODE-L5: cosign signing + a Trivy HIGH,CRITICAL scan are wired (guarded) into 03-deploy.sh, and a ready-to-use Kyverno ClusterPolicy ships at deploy-k3s/manifests/kyverno-verify-images.yaml. Closing it needs two operator actions that cannot be committed: install Kyverno in the cluster, and supply a cosign key pair (COSIGN_KEY for signing + the public key pasted into the policy).
  • Accepted / blocked / moot (⊘)M3 (Apple nonce — blocked on an iOS-client change), C12 (moot — accounts are hard-deleted), LIVE-L14/L15 (UUID migration — planned quarter), LIVE-L17/L18/ L20 (no security impact — see entries), F15/F16 (architectural), and LIVE-L2/L3/L4 (DMARC / SPF / CAA — operator-declined, below).
  • Operator-declined — Stage 0 DNS (LIVE-L2/L3/L4). The operator has opted not to add the DMARC, SPF-hardening, and CAA DNS records this cycle. For the record: these are not a paid-Cloudflare feature — DMARC and SPF are ordinary TXT records and CAA is an ordinary CAA record, all addable on any Cloudflare plan including Free. They remain genuine email-spoofing / certificate-issuance gaps and are marked ⊘; revisit when DNS is next touched.
  • Remaining operator runtime steps (no code to commit) — on the existing cluster: k3s secrets-encrypt enable/reencrypt (K3S-CG1 / V12) and chmod 600 the live kubeconfig (K3S-F4); the SSH/sysctl half of K3S-CG2; and the K3S-CG3CG8 verification items. A full fresh provision already comes up with K3S-F4/CG1/CG2(fail2ban + unattended-upgrades) applied straight from _config.sh.

Operator note: C1 (token hashing) invalidates every existing login session once at deploy and makes login single-session per user — see the CODE-C1 entry. The status boxes in the master index below are authoritative.

Post-remediation independent review (2026-05-16)

The change set went through two independent review passes; the deploy-time verification below (build, go test -race, full goose up against real PostgreSQL 16) was executed and passed.

First pass. A separate review agent audited the full change set against the three audit files. It surfaced three deploy-breaking defects that a green go test could not catch — the test harness builds two tables via GORM AutoMigrate, which production never runs — all since fixed:

  • audit_log table was never created by a migration. 000005 added append-only triggers to a table that exists only in the test DB, so a from-scratch goose up would fail on 000005. 000005 now does CREATE TABLE IF NOT EXISTS audit_log before the triggers.
  • webhook_event_log table was never created by a migration. The H6 fail-closed webhook dedup turns a missing table into a 500 on every subscription webhook. New migration 000006 creates it.
  • 000004's google_purchase_token unique index could fail to build on a production table already holding duplicate tokens — exactly the C6 replay the migration fixes. 000004 now de-duplicates (keep-earliest, NULL-the-rest) before creating the index.

It also tightened the C13 Apple-webhook lookup (subscription_webhook_handler.go) so the legacy substring scan runs only on a genuine ErrRecordNotFound, never masking a real DB error as "not found".

Second pass (master review). A second, independent security-audit agent re-verified all four first-pass fixes (correct), ran go test -race (0 data races) and the full goose up/down chain against real PostgreSQL (clean, idempotent), and returned GO with one HIGH finding, since fixed:

  • HIGH-1 — Redis password leaked via the honeydue-config ConfigMap. _config.sh built REDIS_URL with the password embedded inline, and that URL is emitted into the honeydue-config ConfigMap (delivered to pods via envFrom). ConfigMaps are not covered by secrets-encryption and are readable by any principal with get configmap — so K3S-F1/K3S-F8 were not actually fully closed. Fixed (2026-05-16): _config.sh now emits REDIS_URL=redis://redis:6379/0 with no credentials; the password travels only as the file-mounted REDIS_PASSWORD secret. The API applies it in cache_service.go; cmd/worker/main.go now applies it onto the parsed Asynq RedisClientOpt so the server/inspector/monitoring client all authenticate against the requirepass Redis.

The master review's other seven findings (4 Medium, 3 Low — none deploy-blocking) were then all fixed (2026-05-16):

  • MEDIUM-1 — re-login left the prior token usable for ≤5 min. CreateFreshToken deleted the old token row but not its Redis cache entry. It now also returns the deleted tokens' hashes; AuthService.freshToken evicts them via the new CacheService.InvalidateAuthTokenHashes on every login / Apple / Google sign-in, so a prior (e.g. stolen) token stops authenticating immediately.
  • MEDIUM-2 — IAP .p8 mode check incompatible with k8s. The Apple IAP key check (iap_validation.go) required 0600-or-stricter, unattainable on a k8s Secret volume (0440 under fsGroup). It now rejects only world-accessible keys (perm & 0o007).
  • MEDIUM-3 — single-IP account-lockout DoS. The M5 per-account lockout is now keyed on the set of distinct source IPs that have failed (RegisterLoginFailure takes the IP, tracks a Redis set; lock at 5 distinct IPs). One attacker IP can no longer lock a victim out by spamming failures; genuinely distributed stuffing still trips it. Login now takes the client IP (c.RealIP()).
  • MEDIUM-4 — Redis no-auth deployable. 02-setup-secrets.sh now dies (was warn) when redis.password is empty, so a deploy can no longer bring up an unauthenticated Redis (K3S-F1).
  • LOW-1 / LOW-2 — missing regression tests. Added: config_test.go asserts validate() refuses DEBUG_FIXED_CODES with DEBUG=false (C4); subscription_repo_test.go asserts a second account cannot bind an Apple transaction / Google purchase token already bound to another (C5/C6).
  • LOW-3 — device-token 409. A recycled APNs/FCM token re-registering under a new account is now reassigned to that account (and logged) instead of returning a 409 that locked the legitimate new device owner out of push.

One earlier (first-pass) hardening item remains a tracked follow-up, not re-raised by the master review and not deploy-blocking: /metrics is gated by an X-Forwarded-For check rather than network-isolated. True isolation needs /metrics on a separate port plus a NetworkPolicy restricting the scrape to vmagent — an architectural change deferred to a later cycle.

Consolidated work items (fix once, closes many)

Several findings are the same defect seen from three angles. Do the work once at the listed anchor; the rest close with it.

Theme Anchor Also closes
Auth-endpoint rate limiting Stage 3 auth-rate-limit middleware + Stage 4 app limiter K3S-F10, LIVE-L12, CODE-H1, CODE-H2, CODE-H3, CODE-M5
CSP / cross-origin headers Stage 3 security-headers + Stage 4 app CSP K3S-F9, LIVE-L8
HSTS preload Stage 3 middleware + Stage 0 list submission LIVE-L5, CODE-L3
Admin ingress hardening Stage 2 secret + Stage 3 middleware wiring K3S-F2, K3S-F3, CODE-L6
etcd encryption at rest Stage 1 --secrets-encryption K3S-CG1, CODE-M9
Image digest pinning + signing Stage 5 CI K3S-F5, K3S-F14, CODE-L4, CODE-L5
Pagination hard caps Stage 4 app LIVE-L16, CODE-M6
imagePullSecret name consistency Stage 3 manifests + Stage 2 script K3S-F6

Known contradiction to resolve before planning Stage 4: LIVE-L18 says no account-deletion endpoint exists (every DELETE path 404/400), but CODE-M13 points at a delete handler at auth_handler.go:488-539. Either the endpoint exists at a path the external scan never probed, or it is mounted but unreachable. Confirm the route in internal/router/router.go first — the fix differs (add an endpoint vs. expose/rate-limit an existing one). Tracked as verification item V11.


Master finding index

Every finding, ordered by redeploy stage. Use this as the live tracker — flip the Status box as work lands.

Stage 0 — DNS & Cloudflare edge

ID Sev Finding In-repo Status
LIVE-L2 HIGH No DMARC record — email spoofing open N
LIVE-L3 MED SPF ends ?all (neutral — fails open) N
LIVE-L4 MED No CAA records — any CA may issue certs N
LIVE-L6 LOW No /.well-known/security.txt Y
LIVE-L9 INFO Aggressive Cloudflare caching on admin SSR shell N
LIVE-L10 INFO x-powered-by: Next.js framework leak Y

Stage 1 — Cluster provisioning & node OS

ID Sev Finding In-repo Status
K3S-F4 HIGH Node kubeconfig world-readable (mode 644) Y
K3S-F15 INFO Nodes on public IPs, no private VPC Y
K3S-F16 INFO All 3 nodes are control-plane + etcd + worker Y
K3S-F17 INFO Single-replica SPOFs (redis/worker/admin/vmagent) Y
K3S-CG1 etcd encryption at rest not verified (--secrets-encryption) Y
K3S-CG2 Node OS hardening: SSH, fail2ban, unattended-upgrades, sysctl Y/N
K3S-CG3 Hetzner Cloud Firewall rules not verified N
K3S-CG4 etcd snapshot backup destination/encryption not verified Y
K3S-CG5 kubelet flags (--anonymous-auth=false, webhook authz) not verified Y
K3S-CG6 Container-runtime CIS controls (kube-bench) not run N
K3S-CG7 deploy user sudoers least-privilege not verified N
K3S-CG8 /etc/rancher/k3s/ dir + server-token perms not verified N

Stage 2 — Secrets & config bootstrap

ID Sev Finding In-repo Status
K3S-F1 CRIT Redis runs with no authentication Y
K3S-F3 HIGH admin-basic-auth secret never created Y
K3S-F12 MED Secrets unrotated since cluster bootstrap; no runbook Y
CODE-C4 CRIT DEBUG_FIXED_CODES "123456" auth bypass if it reaches prod Y
CODE-M8 MED SECRET_KEY hardcoded debug fallback Y

Stage 2 status (2026-05-15): config.yaml now carries a Redis password and admin basic-auth user/password; 02-setup-secrets.sh uses bcrypt (htpasswd -nbB); internal/config/config.go generates an ephemeral random SECRET_KEY in debug instead of a static fallback and refuses to boot if DEBUG_FIXED_CODES is set with DEBUG=false; the rotation runbook is at docs/runbooks/secret-rotation.md. All take effect on the next 02-setup-secrets.sh + 03-deploy.sh.

Stage 3 — Kubernetes manifests

ID Sev Finding In-repo Status
K3S-F2 HIGH Admin ingress missing cloudflare-only + admin-auth Y
K3S-F6 HIGH imagePullSecrets name mismatch (ghcr-credentials) Y
K3S-F7 MED vmagent container missing securityContext Y
K3S-F9 MED security-headers missing COOP/COEP/CORP Y
K3S-F10 MED Uniform rate limit — no auth-endpoint tightening Y
K3S-F11 MED automountServiceAccountToken not disabled Y
K3S-F13 LOW CORS_ALLOWED_ORIGINS missing app.myhoneydue.com Y
K3S-F14 LOW Public images (redis, vmagent) pinned by tag Y
LIVE-L5 LOW HSTS not preload-eligible Y
LIVE-L7 LOW Deprecated X-XSS-Protection header Y
LIVE-L8 LOW CSP missing object-src/base-uri; COOP/COEP/CORP absent Y
CODE-L3 LOW HSTS missing preload (duplicate of LIVE-L5) Y
CODE-L4 LOW imagePullPolicy not set on Deployments Y
CODE-L6 LOW Admin admin-auth middleware defined, not attached Y

Stage 3 status (2026-05-15): admin ingress now chains cloudflare-only + admin-auth + security-headers + rate-limit; a dedicated honeydue-api-auth Ingress applies a new auth-rate-limit middleware (5/min, burst 10) to login / register / forgot-password / reset-password / join-with-code; security-headers gained COOP + CORP, HSTS is now max-age=63072000; …; preload, and the deprecated X-XSS-Protection (browserXssFilter) is removed; vmagent has a container securityContext; all workload pods + the migrate Job set automountServiceAccountToken: false explicitly (on top of the rbac.yaml ServiceAccount-level setting that already existed); the registry secret is gitea-credentials everywhere; imagePullPolicy: IfNotPresent is explicit on every container; CORS includes app.myhoneydue.com. Still open: K3S-F14 (public-image digest pins) is folded into Stage 5 with K3S-F5; LIVE-L8 is partial — the COOP/CORP half shipped here, the CSP object-src/base-uri half is an app change tracked in Stage 4.

Stage 4 — Application code & container images

ID Sev Finding In-repo Status
CODE-C1 CRIT Auth tokens stored plaintext in DB Y
CODE-C2 CRIT Google ID token not verified locally Y
CODE-C3 CRIT Google iss claim never validated Y
CODE-C5 CRIT Apple IAP receipt replay across accounts Y
CODE-C6 CRIT Google purchase-token replay across accounts Y
CODE-C7 CRIT File-ownership check excludes residence owners Y
CODE-C8 CRIT Device-token cross-account hijack on re-register Y
CODE-C9 CRIT Share-code join not atomic (Add+Deactivate race) Y
CODE-C10 CRIT Subscription upgrade race — validation outside txn Y
CODE-C11 CRIT Task-completion duplicate-row race Y
CODE-C12 CRIT Soft-deleted email reusable; is_active not filtered Y
CODE-C13 CRIT Apple webhook user lookup may LIKE-match Y
CODE-H1 HIGH Rate limit doesn't cover all auth surfaces Y
CODE-H2 HIGH No rate limit on join-with-code Y
CODE-H3 HIGH No rate limit on register Y
CODE-H4 HIGH Modulo bias in 6-digit code generation Y
CODE-H5 HIGH Apple IAP .p8 loaded with no file-mode check Y
CODE-H6 HIGH Webhook dedup fail-open Y
CODE-H7 HIGH Auth-failure log lacks IP/User-Agent Y
CODE-H8 HIGH X-Timezone header trusted for trial-start calc Y
CODE-H9 HIGH Share-code Deactivate error swallowed Y
CODE-M1 MED HTTP header injection via Content-Disposition filename Y
CODE-M2 MED bcrypt cost = 10 (recommend 12) Y
CODE-M3 MED Apple Sign In nonce not validated Y
CODE-M4 MED Email verification not atomic Y
CODE-M5 MED Per-user rate limiting absent Y
CODE-M6 MED List endpoints uncapped (Documents/Contractors/Residences) Y
CODE-M7 MED Audit log not append-only Y
CODE-M11 MED golang.org/x/crypto v0.49.0 outdated Y
CODE-M12 MED Contractor toggle refetch race Y
CODE-M13 MED Account-deletion endpoint unrate-limited Y
CODE-M10 MED node:20-alpine floating tag in Dockerfile Y
CODE-L1 LOW Login inactive-account error enables enumeration Y
CODE-L2 LOW Auth responses lack Cache-Control: no-store Y
LIVE-L1 HIGH /metrics publicly exposed on api.myhoneydue.com Y
LIVE-L11 HIGH Login user-enumeration via timing Y
LIVE-L12 HIGH No rate-limit on /api/auth/login/ Y
LIVE-L13 HIGH Password-reset user-enumeration via timing Y
LIVE-L14 MED Sequential integer user IDs leak userbase size Y
LIVE-L15 MED Sequential integer resource IDs (same risk) Y
LIVE-L16 MED Pagination limit accepted at any size Y
LIVE-L17 LOW Garbage pagination params silently accepted Y
LIVE-L18 LOW No account-deletion endpoint (GDPR gap) Y
LIVE-L19 LOW Email verification not enforced Y
LIVE-L20 INFO Profile-update silently drops unknown fields Y

Stage 4 handler/misc batch status (2026-05-15): M1Content-Disposition filenames are sanitized (control chars / quote / backslash stripped) so an upload filename cannot inject response headers. M7 — migration 000005 creates the audit_log table (no prior migration did — CREATE TABLE IF NOT EXISTS) and makes it append-only via BEFORE UPDATE/DELETE triggers. M11golang.org/x/crypto bumped v0.49.0 → v0.51.0. M13DELETE /api/auth/account now carries the Traefik auth-rate-limit edge limiter. LIVE-L18 ⊘ — not a real gap: the endpoint exists at DELETE /api/auth/account/ (router.go:546); the live scan probed /api/auth/me/, /auth/delete/, /users/me/ and missed it. Update (2026-05-15): items shown as deferred in an earlier draft were then completed — LIVE-L1 (/metrics rejects proxied/public requests via an X-Forwarded-For check, so only the in-cluster vmagent scrape reaches it), M6/LIVE-L16 (the document/contractor list repos already hard-cap at 500 rows), and LIVE-L19 (verified-email gating on share-code generation via the new RequireVerified middleware). LIVE-L17 (inert pagination params, results capped) and LIVE-L20 (whitelist profile update is the correct pattern) are closed as no-security-impact (⊘). The master index above is authoritative.

Stage 4 races batch status (2026-05-15): C9/H9 — share-code redemption is now one locked transaction in ResidenceRepository. JoinWithShareCode (lock the code row, re-check validity, add member, deactivate — a deactivation failure aborts the join). C11 — the task-completion duplicate-row race was already closed: the completion insert and the optimistically-version-locked task update share one transaction, so a concurrent completion fails ErrVersionConflict and rolls back its inserted row; no UNIQUE(task_id, completed_date) was added (it would reject legitimate same-day re-completions and risk a migration failure on existing data). M4 — email verification's find/consume/flag writes are now one transaction. M12 — a concurrent contractor delete now yields a clean 404. C12 ⊘ — premise moot: the app hard-deletes accounts (DeleteUserCascade), so there is no soft-deleted user whose email lingers, and ExistsByEmail already blocks re-registering a deactivated user's email.

Stage 4 auth batch status (2026-05-15): C1, C2, C3 done (see entries below). Rate limiting — every sensitive auth path now carries the shared Traefik auth-rate-limit edge limiter (login/register/forgot/reset/ verify-reset/apple/google/refresh/join-with-code); login/register/forgot/ reset/apple/google additionally keep the per-IP app limiter (H1/H2/H3/LIVE-L12). H4 rejection-sampled codes, M2 bcrypt cost 12, L1+LIVE-L11 constant-time generic-error login, L2 no-store on auth responses, H7 IP/UA in auth logs, LIVE-L13 fully-async forgot-password — all done; go build ./... and the models/repositories/middleware/handlers/services test packages pass. Deferred: M3 (Apple nonce) — needs the iOS client to generate and send a nonce; server-only validation would reject every Apple login, so this is blocked on a coordinated mobile change. H8 — the parseTimezone ±14h cap shipped; the "use server UTC for trial-start" half is folded into Stage 4's subscription work. M5 per-account lockout (Redis) deferred — the edge + per-IP app limiters + the existing per-account password-reset counter cover the practical risk; a true per-account login lockout remains a tracked enhancement.

Stage 5 — CI / build pipeline

ID Sev Finding In-repo Status
K3S-F5 HIGH Images pinned by mutable short SHA tag, not digest Y
K3S-F8 MED Secrets injected as env vars, not file mounts Y
CODE-L5 LOW No image signing (cosign) in CI Y

Stage 5 status (2026-05-15): CODE-M11 done — golang.org/x/crypto bumped v0.49.0 → v0.51.0 (with the x/sys/x/term/x/text bumps go get -u pulled in), go mod tidy clean, full build + test green. Update (2026-05-15): K3S-F5/K3S-F14/CODE-M10 are done — 03-deploy.sh resolves the image digest after each push and deploys api/worker/admin/web by @sha256:, and redis/vmagent/node:20-alpine are pinned to their resolved index digests. Update (2026-05-16): K3S-F8 is done — the api/worker Deployments mount honeydue-secrets as files (defaultMode: 0400) at /etc/honeydue/secrets and inject no secret as an env var; config.loadFileSecrets reads them; 02-setup-secrets.sh now writes B2_KEY_ID/B2_APP_KEY into the secret, reconciling the earlier script-vs-manifest drift. CODE-L5 stays — cosign signing and a Trivy HIGH,CRITICAL scan are wired (guarded) into 03-deploy.sh and a ready-to-use Kyverno ClusterPolicy ships at deploy-k3s/manifests/kyverno-verify-images.yaml; closing it needs the operator to install Kyverno and supply a cosign key. See both entries.

Stage 6 — Post-deploy verification & runtime investigations

V1V13 — see Stage 6.


Stage 0 — DNS & Cloudflare edge

External state at Cloudflare. Not touched by 03-deploy.sh, so a redeploy neither breaks nor re-applies these — do them once and leave them. Tracked here so they are never forgotten on a domain move or DNS migration.

LIVE-L2 — Add DMARC record · HIGH · ⊘

  • Operator decision (2026-05-16): declined for this cycle. A DMARC record is an ordinary DNS TXT record — it is not gated behind a paid Cloudflare plan and can be added on Free. This remains a real email-spoofing gap; revisit when DNS is next touched.
  • Where: Cloudflare DNS, TXT record at _dmarc.myhoneydue.com.
  • Fix: Publish v=DMARC1; p=quarantine; rua=mailto:dmarc@myhoneydue.com; ruf=mailto:dmarc@myhoneydue.com; fo=1; aspf=s; adkim=s. Start at pct=10 for 30 days, watch the rua aggregate reports, then ramp to pct=100 and finally p=reject.
  • Verify: dig +short TXT _dmarc.myhoneydue.com returns the record.

LIVE-L3 — Tighten SPF from ?all to -all · MEDIUM · ⊘

  • Operator decision (2026-05-16): declined for this cycle. SPF is an ordinary DNS TXT record, editable on any Cloudflare plan including Free. The ?all (neutral) qualifier leaves spoofed mail un-penalised; revisit alongside LIVE-L2.
  • Where: Cloudflare DNS, TXT record at myhoneydue.com.
  • Fix: Change v=spf1 include:spf.messagingengine.com ?all~all for ~7 days, confirm no legitimate mail (CI, transactional) is missed, then -all. Do this after LIVE-L2's DMARC ramp begins.
  • Verify: dig +short TXT myhoneydue.com | grep spf shows -all.

LIVE-L4 — Add CAA records · MEDIUM · ⊘

  • Operator decision (2026-05-16): declined for this cycle. CAA is an ordinary DNS record type, addable on any Cloudflare plan including Free. Without it, any public CA may issue a cert for the domain; revisit when DNS is next touched.
  • Where: Cloudflare DNS, apex myhoneydue.com.
  • Fix: Add 0 issue "letsencrypt.org", 0 issuewild "letsencrypt.org", 0 iodef "mailto:security@myhoneydue.com". Add 0 issue "pki.goog" only if Google Trust Services is used anywhere. Confirm against the CAs Cloudflare Universal SSL actually uses before locking down.
  • Verify: dig +short CAA myhoneydue.com returns the records.

LIVE-L6 — Publish security.txt · LOW · ☐ · In-repo: Y

  • Where: served by the Go API and/or Next.js apps at /.well-known/security.txt (RFC 9116) — committed route, so it survives redeploys.
  • Fix: Serve Contact:, Expires:, Preferred-Languages:, Canonical: on both api.myhoneydue.com and the apex.
  • Verify: curl https://api.myhoneydue.com/.well-known/security.txt → 200.

LIVE-L9 — Review Cloudflare caching of the admin SSR shell · INFO · ☐

  • Where: Cloudflare cache rules for admin.myhoneydue.com.
  • Fix: cache-control: s-maxage=31536000 on admin SSR pages means Cloudflare caches the admin shell for a year. Confirm this is intentional; if the admin shell ever contains per-session content, add a bypass-cache rule for admin.myhoneydue.com.
  • Verify: curl -sI https://admin.myhoneydue.com/ | grep -i cache reflects the intended policy.

LIVE-L10 — Suppress x-powered-by · INFO · ☐ · In-repo: Y

  • Where: Next.js config in the admin and web repos (next.config.jspoweredByHeader: false). Committed, survives redeploys.
  • Fix: Disable the x-powered-by: Next.js header.
  • Verify: curl -sI https://admin.myhoneydue.com/ | grep -i x-powered-by returns nothing.

Stage 1 — Cluster provisioning & node OS

Run by 01-provision-cluster.sh (which drives the hetzner-k3s CLI from config.yaml via generate_cluster_config in _config.sh) plus one-time SSH hardening on each node. Any k3s server flag must be set in the hetzner-k3s cluster config so a cluster rebuild applies it.

K3S-F4 — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y

  • Where: _config.shgenerate_cluster_configk3s_config_file. Node file /etc/rancher/k3s/k3s.yaml.
  • Done (2026-05-16): generate_cluster_config now emits write-kubeconfig-mode: "0600" in the k3s config file, so any fresh provision writes the node kubeconfig as 0600.
  • Operator step on the existing cluster: a running node keeps the mode it was installed with — ssh deploy@<node> 'sudo chmod 600 /etc/rancher/k3s/k3s.yaml' on each. Deploy scripts still read it via sudo.
  • Verify: ssh deploy@<node> 'sudo stat -c %a /etc/rancher/k3s/k3s.yaml'600.

K3S-CG1 / CODE-M9 — etcd / Secret encryption at rest · ☑ · In-repo: Y

  • Where: _config.shgenerate_cluster_configk3s_config_file.
  • Done: the k3s config file carries secrets-encryption: true, so a fresh provision boots with AES Secret encryption enabled. (The write-kubeconfig-mode line for K3S-F4 was added next to it on 2026-05-16.)
  • Operator step on the existing cluster: a cluster provisioned without the flag does not retro-encrypt — run k3s secrets-encrypt enable then k3s secrets-encrypt reencrypt once. Tracked as V12.
  • Verify: k3s secrets-encrypt status reports Encryption Status: Enabled on every server node.
  • Note: the old SECURITY.md claimed this was already on — 04-verify.sh greps for the string but cannot truly confirm; see V12.

K3S-CG2 — Node OS hardening · ◐ · In-repo: partial

  • Where: _config.shgenerate_cluster_configpost_create_commands (runs on every node at provision).
  • Done (2026-05-16): post_create_commands now installs and enables fail2ban (SSH brute-force bans) and unattended-upgrades (automatic security patching) on every node at provision time — a fresh cluster comes up hardened on both.
  • Still operator (runtime; not yet in-repo):
    • SSH — confirm PermitRootLogin no, PasswordAuthentication no, AllowUsers deploy, modern ciphers/MACs/KEX. (hetzner-k3s provisions key-only SSH; verify and tighten.)
    • sysctl — confirm net.ipv4.ip_unprivileged_port_start=0 (Traefik) and standard network-hardening sysctls.
  • Verify: ssh deploy@<node> 'fail2ban-client status sshd; systemctl is-enabled unattended-upgrades'.

K3S-CG3 — Hetzner Cloud Firewall rules · ☐ · In-repo: N

  • Fix: Confirm only: :443 from Cloudflare CIDRs, :22 from operator IP(s), :6443 from operator IP(s). Nothing else. This is the only network defense for the public-IP nodes (K3S-F15).
  • Verify: hcloud firewall describe honeydue-fw matches the intended ruleset; a direct curl to a node IP on :80/:443 from a non-CF host times out.

K3S-CG4 — etcd snapshot backup · ☐ · In-repo: Y

  • Fix: Confirm k3s etcd snapshots are enabled (default hourly) and shipped off-node — set --etcd-s3 (to Backblaze B2) with encryption. Without offsite snapshots, a 3-node loss is unrecoverable.
  • Verify: ls /var/lib/rancher/k3s/server/db/snapshots/ on a node + an object in the B2 backup bucket.

K3S-CG5 — kubelet authn/authz flags · ☐ · In-repo: Y

  • Fix: Confirm --anonymous-auth=false and --authorization-mode=Webhook on the kubelet (k3s defaults are usually safe — verify, don't assume). Set via k3s kubelet-arg in the cluster config if missing.
  • Verify: kubectl get --raw /api/v1/nodes/<node>/proxy/configz shows the expected kubelet config.

K3S-CG6 — Container-runtime CIS baseline · ☐ · In-repo: N

  • Fix: Run kube-bench once; remediate any FAIL lines that aren't k3s-by-design.
  • Verify: kube-bench run archived with FAILs triaged.

K3S-CG7deploy user sudoers least-privilege · ☐ · In-repo: N

  • Fix: Current deploy ALL=(ALL) NOPASSWD: ALL means an SSH-key compromise = node root. Scope to the commands deploys actually need (ufw, systemctl, chmod on k3s.yaml, cat of k3s.yaml). Accept the convenience trade-off only with eyes open.
  • Verify: ssh deploy@<node> 'sudo -l' shows the scoped list.

K3S-CG8/etc/rancher/k3s/ perms · ☐ · In-repo: N

  • Fix: /var/lib/rancher/k3s/server/token and /var/lib/rancher/k3s/server/node-token must be 0600 root:root; /etc/rancher/k3s/ not world-traversable.
  • Verify: ssh deploy@<node> 'sudo stat -c "%a %n" /var/lib/rancher/k3s/server/token'600.

K3S-F15 — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y

  • Decision: Accepted for now. Defense is K3S-CG3 (Hetzner firewall) only. To remediate later: attach a Hetzner private network, re-IP the cluster, move etcd/kubelet/Flannel onto it. Substantial re-provision — track on the roadmap, not this cycle.

K3S-F16 — All nodes are control-plane + etcd + worker · INFO · ⊘

  • Decision: Accepted — standard small-cluster k3s. Revisit (dedicated workers + NoSchedule taint on control-plane) when workload pressure grows. No redeploy action.

K3S-F17 — Single-replica SPOFs · INFO · ☐ · In-repo: Y

  • Where: deploy-k3s/manifests/worker/deployment.yaml, redis/, admin/, observability/vmagent.yaml.
  • Fix: workerreplicas: 2 (stateless, Asynq at-least-once — safe now). admin/vmagent → 2 if zero-downtime restart is wanted. redis is stateful — true HA needs Sentinel or managed Redis; track separately, do not naively scale.
  • Verify: kubectl -n honeydue get deploy shows worker 2/2.

Stage 2 — Secrets & config bootstrap

Run by 02-setup-secrets.sh, which reads deploy-k3s/config.yaml and the secrets/ directory. Both K3S-F1 and K3S-F3 are open purely because config.yaml lacks the values — the script already supports them.

K3S-F1 — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y

  • Where: deploy-k3s/config.yaml key redis.password. 02-setup-secrets.sh:53,68-71 includes REDIS_PASSWORD in honeydue-secrets only when that key is non-empty; redis/deployment.yaml adds --requirepass only when the env var is non-empty.
  • Fix: Set redis.password in config.yaml to a strong value (openssl rand -base64 32). Re-run 02-setup-secrets.sh. api/worker already consume REDIS_PASSWORD.
  • Verify: kubectl -n honeydue exec deploy/redis -- redis-cli pingNOAUTH; with -a "$REDIS_PASSWORD"PONG.
  • Redeploy-clean: committing the value to config.yaml means every future 02-setup-secrets.sh re-creates the authenticated secret. (If config.yaml is gitignored, store the value in the operator's secret store and document it here.)

K3S-F3admin-basic-auth secret never created · HIGH · ☐ · In-repo: Y

  • Where: config.yaml keys admin.basic_auth_user / admin.basic_auth_password. 02-setup-secrets.sh:54-55,132-143 creates the admin-basic-auth secret (bcrypt htpasswd) only when both are set, else it warns and skips.
  • Fix: Set both keys. Re-run 02-setup-secrets.sh. Must be done before K3S-F2 — attaching admin-auth to the ingress with the secret missing makes Traefik 503 the admin route.
  • Verify: kubectl -n honeydue get secret admin-basic-auth.

K3S-F8 (Stage 2 half) — B2_KEY_ID / B2_APP_KEY in honeydue-secrets · ☑ · In-repo: Y

  • Where: 02-setup-secrets.sh.
  • Done (2026-05-16): the script now reads storage.b2_key_id / storage.b2_app_key from config.yaml and adds B2_KEY_ID / B2_APP_KEY to honeydue-secrets. Previously the api/worker manifests referenced these keys but the script never created them — a latent deploy break. See the full K3S-F8 entry in Stage 5.
  • Verify: kubectl -n honeydue get secret honeydue-secrets -o jsonpath='{.data.B2_KEY_ID}' is non-empty.

K3S-F12 — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y

  • Where: new doc docs/runbooks/secret-rotation.md.
  • Fix: Document per-secret rotation (Postgres, SECRET_KEY, APNs .p8, FCM, B2, observability token, Redis, admin basic-auth). Annual minimum; immediate on suspected exposure or operator-device loss. For SECRET_KEY (JWT signing) plan an overlap window so live tokens validate across the change. Add a last-rotated annotation to each secret.
  • Verify: runbook exists and the first rotation is logged.

CODE-C4DEBUG_FIXED_CODES "123456" auth bypass · CRITICAL · ☐ · In-repo: Y

  • Where: internal/services/auth_service.go:141-145,385-390,432-435,470-473,503-504; config in internal/config/config.go. ConfigMap generated from config.yaml by 03-deploy.sh.
  • Fix (two layers): (1) Code — refuse to start if ENV=production && DebugFixedCodes (Stage 4 code change). (2) Config — ensure config.yaml never sets DEBUG_FIXED_CODES=true for prod, and the generated ConfigMap omits it.
  • Verify: prod ConfigMap has no DEBUG_FIXED_CODES; a prod boot with the flag set fails fast.

CODE-M8SECRET_KEY hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y

  • Where: internal/config/config.go:437-442 falls back to "change-me-in-production-secret-key-12345".
  • Fix: Remove the static fallback — generate a per-boot random key in debug, and refuse to start in production if SECRET_KEY is unset. (02-setup-secrets.sh:46-49 already enforces ≥32 chars for the real secret — keep that.)
  • Verify: prod boot with no SECRET_KEY exits non-zero; the fallback string is gone from the binary.

Stage 3 — Kubernetes manifests

Committed under deploy-k3s/manifests/ and applied by 03-deploy.sh. Any fix here is automatically re-applied on every redeploy — the highest-value stage for "redeploy clean."

K3S-F2 / CODE-L6 — Wire defense-in-depth onto the admin ingress · HIGH · ☐

  • Where: deploy-k3s/manifests/ingress/ingress-simple.yaml — admin route annotation.
  • Fix: Add cloudflare-only and admin-auth to the traefik.ingress.kubernetes.io/router.middlewares annotation alongside the existing security-headers + rate-limit. Do K3S-F3 first or Traefik 503s the route.
  • Verify: 04-verify.sh "Cloudflare-Only Middleware" check passes; admin.myhoneydue.com prompts for basic auth.

K3S-F6imagePullSecrets name consistency · HIGH · ☐

  • Where: all deploy-k3s/manifests/*/deployment.yaml, migrate/job.yaml; secret created by 02-setup-secrets.sh:111 as ghcr-credentials.
  • Fix: The registry is Gitea — ghcr-credentials is a misleading name and the live cluster currently also has a hand-made gitea-credentials. Pick one name (gitea-credentials is clearer), use it in both the script and every manifest, and delete the orphan. The defect is a name mismatch, not a missing fix — make script + manifests agree so a pull never fails on a fresh node.
  • Verify: grep -rl imagePullSecrets deploy-k3s/manifests/ all reference one name == the script's; cordon a node, delete a pod, confirm the replacement pulls.

K3S-F7vmagent container securityContext · MEDIUM · ☐

  • Where: deploy-k3s/manifests/observability/vmagent.yaml.
  • Fix: Add the container-level block the other 5 deployments already have: allowPrivilegeEscalation: false, capabilities.drop: [ALL], readOnlyRootFilesystem: true. Its volumes (/etc/vmagent, /etc/vmagent-secrets, /tmp/vmagent emptyDir) already support read-only root.
  • Verify: 04-verify.sh "Pod Security Contexts" reports OK for vmagent.

K3S-F9 / LIVE-L8 — CSP + cross-origin headers · MEDIUM / LOW · ☐

  • Where: Cross-origin trio → deploy-k3s/manifests/ingress/middleware.yaml (security-headers). CSP object-src/base-uri → Go app CSP middleware (Stage 4, LIVE-L8 code half).
  • Important correction: K3S-F9 originally said CSP was missing. The live scan disproved that — the Go app sets a strong CSP via app middleware. So K3S-F9 reduces to: add Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Resource-Policy: same-origin (and Cross-Origin-Embedder-Policy: require-corp only if it doesn't break embeds) to security-headers. The CSP object-src 'none'; base-uri 'self' additions belong in the app and are tracked under LIVE-L8 in Stage 4.
  • Verify: curl -sI https://api.myhoneydue.com/api/health/ | grep -i cross-origin shows COOP/CORP.

K3S-F10 / LIVE-L12 — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐

  • Where: deploy-k3s/manifests/ingress/middleware.yaml (new auth-rate-limit Middleware) + ingress/ingress-simple.yaml. Requires migrating the auth paths from vanilla Ingress to a Traefik IngressRoute to apply a per-path middleware.
  • Fix: New Middleware average: 5, burst: 10, period: 1m, sourceCriterion.ipStrategy.depth: 2 (depth 2 for the Cloudflare hop). Apply to /api/auth/login, /api/auth/register, /api/auth/forgot-password, /api/auth/reset-password, /api/residences/join-with-code. This is the edge half; the app half is CODE-H1/H2/H3/M5 in Stage 4 (per-account lockout in Redis). Do both — edge limit alone resets on IP rotation.
  • Verify: 10 rapid logins from one IP → 429.

K3S-F11 — Disable automountServiceAccountToken · MEDIUM · ☐

  • Where: deploy-k3s/manifests/rbac.yaml (ServiceAccounts) and/or each */deployment.yaml pod spec.
  • Fix: Set automountServiceAccountToken: false on api, admin, worker, web, redis. Leave true only for vmagent (it uses the k8s API for service discovery). Note: 05-security.md claims this is already set — the audit (F11) says it is not. Treat the audit as ground truth; this fix makes the doc true.
  • Verify: kubectl -n honeydue get pod <api-pod> -o jsonpath='{.spec.automountServiceAccountToken}'false; no token file in the container.

K3S-F13 — Add app.myhoneydue.com to CORS · LOW · ☐

  • Where: CORS_ALLOWED_ORIGINS in config.yaml → generated into honeydue-config ConfigMap by 03-deploy.sh.
  • Fix: Confirm whether the web app calls api.myhoneydue.com directly from the browser. If yes, add https://app.myhoneydue.com to CORS_ALLOWED_ORIGINS. If it proxies through Next.js server-side, CORS is moot — record that decision here instead.
  • Verify: browser fetch from app.myhoneydue.com to the API succeeds (or the proxy decision is documented).

K3S-F14 — Pin public images by digest · LOW · ☐

  • Where: redis/deployment.yaml (redis:7-alpine), observability/vmagent.yaml (victoriametrics/vmagent:v1.106.1).
  • Fix: Replace tags with @sha256: digests. Folded into the K3S-F5 CI work (Stage 5).
  • Verify: manifests contain no public-image tag without a digest.

LIVE-L5 / CODE-L3 — HSTS preload · LOW · ☐

  • Where: deploy-k3s/manifests/ingress/middleware.yaml security-headers HSTS value.
  • Fix: Change to max-age=63072000; includeSubDomains; preload. Confirm api/admin/app all work fully over HTTPS, then submit to hstspreload.org (the submission is the Stage 0 external half — once preloaded you cannot easily downgrade for ~6 months).
  • Verify: response header shows preload; domain accepted at hstspreload.org.

LIVE-L7 — Drop deprecated X-XSS-Protection · LOW · ☐

  • Where: deploy-k3s/manifests/ingress/middleware.yaml security-headers (browserXssFilter: true / customResponseHeaders).
  • Fix: Remove the header or set X-XSS-Protection: "0". Modern browsers ignore it; legacy filter bypass has caused XSS.
  • Verify: header absent or 0 on all three hosts.

CODE-L4 — Set imagePullPolicy · LOW · ☐

  • Where: all deploy-k3s/manifests/*/deployment.yaml.
  • Fix: Set imagePullPolicy explicitly. Once images are digest-pinned (K3S-F5), IfNotPresent is correct and avoids needless re-pulls; until then Always avoids stale tags. Pick the policy that matches the K3S-F5 rollout state.
  • Verify: every container has an explicit imagePullPolicy.

Stage 4 — Application code & container images

Fixes in honeyDueAPI-go source (and the admin/web Dockerfiles). They reach production by rebuilding the image in 03-deploy.sh; schema-changing fixes (CODE-C1, CODE-C5/6, CODE-C11, CODE-C12) also need a goose migration, which the migrate Job runs automatically before the api/worker roll. Per repo rule: do not auto-commit — these are code changes; this section is the plan, not the patch.

Critical (C1C13)

CODE-C1 — Plaintext auth tokens in DB · ☑ (2026-05-15)

  • Where: internal/models/user.go, internal/repositories/user_repo.go, internal/middleware/auth.go, internal/services/cache_service.go, internal/services/auth_service.go, migration 000003_hash_auth_tokens.sql.
  • Done: user_authtoken.key now stores models.HashToken() — the hex SHA-256 of the token — never the raw value. The raw token reaches the client once (the non-persisted AuthToken.Plaintext field) and is re-hashed on every request before the DB and Redis lookup, so the single indexed JOIN query in the auth middleware is preserved. A fast hash (not bcrypt) is correct here — tokens are 160-bit random values, nothing to brute-force. Migration 000003 widens the column 40→64 and clears existing rows.
  • Behaviour change: the server can no longer re-issue a stored token's plaintext, so every login mints a fresh token via CreateFreshToken (delete + create). With the existing one-token-per-user schema this means one active session per user — logging in on a new device invalidates the previous device's token. The migration also invalidates all sessions once, at deploy.
  • Verify: SELECT key FROM user_authtoken LIMIT 1 → 64-char hash; go build ./... and go test ./internal/{models,repositories,middleware,handlers}/... pass.

CODE-C2 / CODE-C3 — Google ID token not verified locally · ☑ (2026-05-15)

  • Where: internal/services/google_auth.go (full rewrite).
  • Done: VerifyIDToken no longer calls the deprecated tokeninfo URL (which leaked the token in the query string and made verification depend on a third party). It now parses the JWT, fetches Google's JWKS from googleapis.com/oauth2/v3/certs (Redis-cached 24h, re-fetched on a kid miss), verifies the RS256 signature locally, and asserts iss ∈ {accounts.google.com, https://accounts.google.com} (C3), aud/azp against the configured client IDs, and exp (validated by jwt v5). Mirrors the existing Apple JWKS verifier. GoogleSignIn is unchanged — the returned GoogleTokenInfo shape is preserved.
  • Verify: go build ./... clean; internal/services tests pass.

CODE-C5 / CODE-C6 — IAP receipt / purchase-token replay · ☐

  • Where: internal/services/subscription_service.go (ProcessApplePurchase, ProcessGooglePurchase).
  • Fix: Goose migration adding UNIQUE(provider, original_transaction_id). On purchase, if the transaction ID is already bound to a different user_id403.
  • Verify: re-submitting a valid receipt against a second account → 403; DB has no duplicate.

CODE-C7 — File-ownership check excludes residence owners · ☐

  • Where: internal/services/file_ownership_service.go:20-66.
  • Fix: Replace the three residence_residence_users-only JOINs with the canonical owner-OR-member UNION from residence_repo.HasAccess (owners live in residence_residence.owner_id).
  • Verify: a residence owner can delete a file in their own property; a non-member still gets 403.

CODE-C8 — Device-token cross-account hijack · ☐

  • Where: internal/services/notification_service.go:307-319 (APNS), :336-349 (GCM).
  • Fix: On re-register of an existing token, if existing.UserID != nil && *existing.UserID != userID409 Conflict. Only same-user updates allowed.
  • Verify: registering another user's known token → 409; that user's push traffic is unaffected.

CODE-C9 / CODE-H9 — Share-code join not atomic · ☐

  • Where: internal/services/residence_service.go:562-615 (:594-599 swallows the deactivate error).
  • Fix: Wrap JoinWithCode in one transaction with SELECT … FOR UPDATE on the share-code row; fail the join if deactivation fails (do not log-and-continue).
  • Verify: concurrent redemptions of a single-use code → exactly one succeeds; a forced deactivate error rolls the whole join back.

CODE-C10 — Subscription upgrade race · ☐

  • Where: internal/services/subscription_service.go:404-459; webhook handler :136-213.
  • Fix: Move Apple validation inside the row-locked transaction, or add an idempotency-key table so the validate→write window can't be raced.
  • Verify: two concurrent upgrades for one user → one tier change, not two.

CODE-C11 — Task-completion duplicate-row race · ☐

  • Where: internal/services/task_service.go:631-750.
  • Fix: SELECT … FOR UPDATE on the task in CreateCompletion; goose migration adding UNIQUE(task_id, completed_date).
  • Verify: double-tap "complete" → one completion row.

CODE-C12 — Soft-deleted email reusable · ☐

  • Where: internal/services/auth_service.go:274-324; internal/repositories/user_repo.go (FindByEmail, ExistsByEmail).
  • Fix: On delete, mangle the email (deleted_<id>_<email>); add is_active = true filtering consistently to FindByEmail/ExistsByEmail.
  • Verify: registering with a soft-deleted account's email is rejected; no cross-account takeover.

CODE-C13 — Apple webhook user lookup may LIKE-match · ☐

  • Where: internal/handlers/subscription_webhook_handler.go:354-366 (FindByAppleReceiptContains).
  • Fix: Confirm the SQL is an equality match, not LIKE. If LIKE, this is a confirmed Critical — change to equality and rename the function. See V8.
  • Verify: the query is parameterized equality; rename merged.

High (H1H9)

CODE-H1 / CODE-H2 / CODE-H3 / CODE-M5 — Rate limiting gaps · ☐

  • Where: internal/router/router.go (:520 login limiter, :593 join-with-code unprotected), internal/middleware/rate_limit.go, internal/handlers/auth_handler.go.
  • Fix: Extend rate limiting to register, join-with-code, Apple/Google sign-in, and token refresh. Add a per-account login-attempt counter in Redis (lock after 510 fails for 1560 min). This is the app half of the consolidated auth-rate-limit item; the edge half is K3S-F10.
  • Verify: rapid attempts on every auth route throttle; per-account lockout fires regardless of source IP.

CODE-H4 — Modulo bias in 6-digit codes · ☐

  • Where: internal/services/auth_service.go:884-892.
  • Fix: Replace int32 % 1000000 with rejection sampling on crypto/rand for a uniform 000000999999.
  • Verify: distribution test over many samples is uniform.

CODE-H5 — Apple IAP .p8 file-mode unchecked · ☐

  • Where: internal/services/iap_validation.go:93-128, internal/config/config.go:325.
  • Fix: Prefer a base64 env-injected PEM. If a file path is kept, refuse to start when the file mode is more permissive than 0600.
  • Verify: boot fails on a 0644 key file; succeeds on 0600.

CODE-H6 — Webhook dedup fail-open · ☐

  • Where: internal/handlers/subscription_webhook_handler.go:165-173 (Apple), :564-574 (Google).
  • Fix: Fail closed — if webhookEventRepo.HasProcessed errors, return 500 so Apple/Google retry, rather than processing (which risks duplicate refunds).
  • Verify: simulated dedup-check DB error → 500, no double-processing.

CODE-H7 — Auth-failure log lacks IP/UA · ☐

  • Where: internal/handlers/auth_handler.go:70.
  • Fix: Add c.RealIP() + User-Agent to the structured failure log line (the audit log captures them; the request-line log does not). Depends on V10 (RealIP trust).
  • Verify: a failed login log line carries IP + UA.

CODE-H8X-Timezone header trusted for trial start · ☐

  • Where: internal/middleware/timezone.go:40-71internal/services/subscription_service.go:145-150.
  • Fix: Validate X-Timezone against IANA LoadLocation, cap to ±14h; use server UTC for trial-start / billing-window math regardless.
  • Verify: a bogus/extreme X-Timezone cannot shift trial start.

Medium (M1M13)

CODE-M1 — Header injection via Content-Disposition filename · ☐

  • Where: internal/handlers/media_handler.go:74,117,165.
  • Fix: Sanitize doc.FileName — strip CR/LF/quote/null, or emit RFC 5987 filename*=UTF-8''….
  • Verify: an upload with CRLF in the filename does not split the response.

CODE-M2 — bcrypt cost 10 → 12 · ☐

  • Where: internal/models/user.go:47, internal/services/auth_service.go:479.
  • Fix: Make the cost config-driven, default 12.
  • Verify: new hashes are $2a$12$.

CODE-M3 — Apple Sign In nonce not validated · ☐

  • Where: internal/services/apple_auth.go.
  • Fix: Generate, store, and verify the nonce round-trip on Apple sign-in.
  • Verify: a replayed/mismatched nonce is rejected.

CODE-M4 — Email verification not atomic · ☐

  • Where: internal/services/auth_service.go:373-415.
  • Fix: Wrap verify in a transaction so a concurrent request can't double-apply.
  • Verify: concurrent verify calls → one state transition.

CODE-M6 / LIVE-L16 — Uncapped list / pagination · ☐

  • Where: ListDocuments, ListContractors, ListResidences handlers; pagination parsing.
  • Fix: Clamp limit server-side to ≤100 (< 1 → default 25). Notifications already caps at 200 — match the pattern.
  • Verify: ?limit=999999 returns ≤100 rows.

CODE-M7 — Audit log not append-only · ☐

  • Where: audit-log model / repository.
  • Fix: Make it append-only — a DB trigger forbidding UPDATE/DELETE, or move to an event store. Remove the soft-delete column.
  • Verify: an UPDATE/DELETE on the audit table is rejected.

CODE-M11golang.org/x/crypto outdated · ☐

  • Where: go.mod:30 (v0.49.0).
  • Fix: go get -u golang.org/x/crypto, re-run govulncheck, retest. Pairs with Stage 5 dependency automation.
  • Verify: govulncheck ./... clean.

CODE-M12 — Contractor toggle refetch race · ☐

  • Where: internal/services/contractor_service.go:279-307.
  • Fix: Do the toggle + read in one transaction so a concurrent soft-delete can't make it return nil.
  • Verify: concurrent toggle + delete → defined result, no nil panic.

CODE-M13 — Account-deletion endpoint unrate-limited · ☐

  • Where: internal/handlers/auth_handler.go:488-539.
  • Fix: Add a throttle to DELETE /account. First resolve V11LIVE-L18 claims no delete endpoint exists; reconcile before deciding whether this is "rate-limit it" or "expose it."
  • Verify: repeated delete calls throttle.

CODE-M10node:20-alpine floating tag · ☐

  • Where: admin/web Dockerfile (:2,112,134).
  • Fix: Pin to a specific patch version or digest.
  • Verify: Dockerfile has no bare node:20-alpine.

Low / Info (CODE-L1, L2)

CODE-L1 — Inactive-account login enumeration · ☐

  • Where: internal/services/auth_service.go:76-77.
  • Fix: Return the same generic error for inactive accounts as for invalid credentials.
  • Verify: inactive vs. wrong-password responses are byte-identical.

CODE-L2 — Auth responses lack Cache-Control: no-store · ☐

  • Where: internal/handlers/auth_handler.go (Login / CurrentUser / Refresh).
  • Fix: Set Cache-Control: no-store on auth responses.
  • Verify: the header is present.

Live-scan code-level findings (LIVE-L1, L11L20)

LIVE-L1/metrics publicly exposed · HIGH · ☐

  • Where: cmd/api/main.go route registration; vmagent scrapes it cluster-internally already.
  • Fix (recommended — Option B): bind Prometheus metrics to a separate cluster-internal port (e.g. :9090), expose only via a ClusterIP Service the vmagent NetworkPolicy allows; the public Ingress never registers /metrics. Update observability/vmagent.yaml scrape target. (Alternative: block /metrics at Traefik via an IngressRoute — Stage 3.)
  • Verify: curl https://api.myhoneydue.com/metrics404; vmagent still scrapes successfully.

LIVE-L11 — Login user-enumeration via timing · HIGH · ☐

  • Where: login handler / auth_service.go.
  • Fix: Always run a bcrypt compare against a fixed dummy hash when the user is not found, so the response time is constant.
  • Verify: real vs. fake email login timing delta < network noise.

LIVE-L12 — No rate-limit on login · HIGH · ☐

  • See the consolidated auth-rate-limit item: K3S-F10 (edge) + CODE-H1/H2/H3/M5 (app). Closed when both land.

LIVE-L13 — Password-reset timing enumeration · HIGH · ☐

  • Where: forgot-password handler.
  • Fix: Enqueue the reset email on the Asynq queue and return the generic response immediately, so real vs. fake emails have identical latency.
  • Verify: real vs. fake email reset timing delta < network noise.

LIVE-L14 / LIVE-L15 — Sequential integer IDs · MEDIUM · ⊘ (deferred)

  • Where: all user-facing IDs.
  • Decision: Real enumeration/intel leak, but migrating to UUID/ULID touches API, web, mobile, and webhook payloads. Deferred to a planned quarter — not a redeploy-stage fix. Track on the roadmap; revisit before the userbase size becomes commercially sensitive.

LIVE-L16 — Pagination limit uncapped · MEDIUM · ☐

  • Duplicate of CODE-M6 — closed with it.

LIVE-L17 — Garbage pagination params silently accepted · LOW · ☐

  • Where: query-param parsing in list handlers.
  • Fix: Return 400 naming the bad parameter instead of silently using defaults.
  • Verify: ?limit=abc400.

LIVE-L18 — No account-deletion endpoint (GDPR) · LOW · ☐

  • Where: internal/router/router.go, internal/handlers/auth_handler.go.
  • Fix: Reconcile with CODE-M13 first (V11). Provide DELETE /api/auth/me/ that anonymizes PII, cascades/transfers residences, revokes tokens, and writes an audit-trail row. Also closes the throwaway-account cleanup gap the live scan left behind.
  • Verify: an authenticated user can delete their own account; PII is anonymized.

LIVE-L19 — Email verification not enforced · LOW · ☐

  • Where: router middleware.
  • Fix: Add a RequireVerified() middleware on sensitive routes (share-code generation/redemption, anything that emails other users), or cap unverified accounts (1 residence, no share codes) until verified.
  • Verify: an unverified account is blocked from the chosen gated routes.

LIVE-L20 — Profile-update silently drops unknown fields · INFO · ☐

  • Where: PATCH /api/auth/profile/ handler.
  • Fix: Either accept the fields (if intended) or return 400 listing unsupported keys — don't silently 200.
  • Verify: an unknown field yields a clear response.

LIVE-L10x-powered-by — see Stage 0 (Next.js config).


Stage 5 — CI / build pipeline

Build-time controls. Where there is no CI pipeline file yet, the fix is to add one (or a 03-deploy.sh step) so the control runs on every build.

K3S-F5 / K3S-F14 / CODE-L4 — Pin images by digest · HIGH · ☐

  • Where: 03-deploy.sh (currently tags by git short SHA, lines 47/57-61, and also pushes :latest), all deploy-k3s/manifests/*/deployment.yaml.
  • Fix: After docker push, capture the digest (crane digest … or parse docker push output) and substitute @sha256:… into the manifests instead of IMAGE_PLACEHOLDER tags. Pin redis and vmagent by digest too. Reconsider pushing :latest — a mutable :latest undercuts digest pinning.
  • Verify: kubectl -n honeydue get deploy -o jsonpath shows every image as @sha256:.

K3S-F8 — Secrets as file mounts, not env vars · MEDIUM · ☑ · In-repo: Y

  • Where: api/worker deployment.yaml, internal/config/config.go, cmd/api/main.go, cmd/worker/main.go, 02-setup-secrets.sh.
  • Done (2026-05-16):
    • config.loadFileSecrets() reads each of the 9 secret keys (POSTGRES_PASSWORD, SECRET_KEY, EMAIL_HOST_PASSWORD, FCM_SERVER_KEY, REDIS_PASSWORD, B2_KEY_ID, B2_APP_KEY, OBS_INGEST_TOKEN, OBS_TRACES_URL) from /etc/honeydue/secrets/<KEY> and viper.Sets it (highest precedence). A missing file is a silent skip, so the same binary still works from env vars in local/dev.
    • api/worker deployment.yaml no longer inject any secret as an env: secretKeyRef. honeydue-secrets is mounted as a volume (defaultMode: 0400), read-only, at /etc/honeydue/secrets. Non-secret config still arrives via envFrom: configMapRef.
    • cmd/api/cmd/worker read the observability endpoints through the new config.SecretValue() (Viper-backed) instead of os.Getenv, so file-mounted OBS_* values resolve now that they are gone from the environment.
    • 02-setup-secrets.sh now also writes B2_KEY_ID/B2_APP_KEY into honeydue-secrets — reconciling the script-vs-manifest drift (the manifests referenced these keys but the script never created them).
  • Scoped exception: the one-shot honeydue-migrate Job still takes POSTGRES_PASSWORD as an env var. goose is invoked as a CLI with the password inside the DSN argument, so the value is exposed in that process regardless of env-vs-file; the Job is transient (one run, seconds, pod GC'd) so this is accepted.
  • Verify: kubectl -n honeydue exec deploy/api -- env shows no POSTGRES_PASSWORD/SECRET_KEY; kubectl -n honeydue exec deploy/api -- ls /etc/honeydue/secrets lists the key files.

CODE-L5 — Image signing + scanning · LOW · ◐ · In-repo: Y

  • Where: 03-deploy.sh, deploy-k3s/manifests/kyverno-verify-images.yaml.
  • Done (in-repo, 2026-05-16):
    • 03-deploy.sh runs cosign sign after each push and a trivy image --severity HIGH,CRITICAL scan before push — both guarded: they no-op when the tool is absent, so they never break a deploy on a host without them.
    • A ready-to-use Kyverno ClusterPolicy ships at deploy-k3s/manifests/kyverno-verify-images.yaml. It matches only the four gitea.treytartt.com/admin/honeydue-* images, starts in Audit mode, and is intentionally not applied by 03-deploy.sh — applying a verify-images policy with no key would block every Pod from scheduling.
  • Remaining (operator — cannot be committed):
    1. Install Kyverno in the cluster (admission controller).
    2. cosign generate-key-pair; set COSIGN_KEY in the deploy env so signing activates; paste cosign.pub into the policy's publicKeys block.
    3. kubectl apply -f deploy-k3s/manifests/kyverno-verify-images.yaml, confirm Pods still schedule, then flip validationFailureAction: Audit → Enforce.
  • Verify: an unsigned image is rejected by admission; 03-deploy.sh fails on a HIGH/CRITICAL CVE.

CODE-M11 (CI half) — Dependency hygiene · ☐

  • Fix: Add scheduled go get -u + govulncheck (the audit confirms govulncheck + gitleaks already run in CI — extend with a dependency-update cadence).
  • Verify: stale-dependency alerts surface automatically.

Stage 6 — Post-deploy verification & runtime investigations

04-verify.sh already runs a security block (secret encryption, NetworkPolicy count, ServiceAccounts, pod security contexts, PDBs, cloudflare-only middleware, admin-basic-auth). Extend it so each fix above stays fixed, and work the open investigations the audits could not resolve.

Extend 04-verify.sh with assertions for · ☐

  • Redis rejects unauthenticated PING (K3S-F1).
  • Admin ingress annotation contains admin-auth (K3S-F2).
  • /metrics returns 404 on the public host (LIVE-L1).
  • Every container (incl. vmagent) has a full securityContext (K3S-F7).
  • automountServiceAccountToken: false on app pods (K3S-F11).
  • Every workload image is digest-pinned (K3S-F5).
  • No DEBUG_FIXED_CODES key in the prod ConfigMap (CODE-C4).

Runtime investigations (cannot be closed by code review alone)

ID Item Source Action
V1 Apple/Google Sign-In token validation depth LIVE Test with a self-signed Apple identity token; confirm signature/aud/nonce checks
V2 Webhook signature verification — confirm webhook routes are outside the auth middleware in router.go (live scan saw 401s, signature middleware may never run) LIVE Code-review internal/router/router.go
V3 File-upload security — locate upload paths, test polyglots / MIME bypass / path traversal in filename / oversized files LIVE Focused upload security test
V4 Long-term token validity / revocation behaviour LIVE Test token expiry + revocation over time
V5 Apple IAP receipt validation with a real sandbox StoreKit receipt LIVE Sandbox test
V6 Share-code system — find the endpoint path; test brute-force, single-use, expiration LIVE Locate + test
V7 Trial-expiration enforcement — age a test account past 14 days, confirm limitations_enabled flips and creation gates fire LIVE Aged-account test
V8 FindByAppleReceiptContains — confirm equality, not LIKE. If LIKE, escalate CODE-C13 to confirmed Critical CODE SQL review
V9 Rate-limiter storage — confirm rate_limit.go is Redis-backed (shared across 3 api replicas); in-memory = 3× the intended limit CODE Code review
V10 X-Forwarded-For / Echo RealIP trust behind Traefik — without it per-IP limits collapse to the ingress IP CODE Code + Traefik config review
V11 Account-deletion contradiction — LIVE-L18 (no endpoint) vs CODE-M13 (endpoint at auth_handler.go:488-539). Resolve before Stage 4 planning LIVE/CODE Route review
V12 etcd encryption — 04-verify.sh only greps a string; truly confirm with k3s secrets-encrypt status on each server node K3S SSH check
V13 user_authtoken index — confirm a user_id lookup index exists before hashing tokens at rest (CODE-C1) CODE Schema check

Accepted risks / deferred (this cycle)

ID Item Rationale
K3S-F15 Public-IP nodes, no VPC Re-provision-scale change; Hetzner firewall (K3S-CG3) is the compensating control. Roadmap.
K3S-F16 Combined control-plane/worker nodes Standard small-cluster k3s; revisit on workload growth.
LIVE-L14/L15 Sequential integer IDs UUID migration spans API + web + mobile + webhooks; planned quarter, not this cycle.

Mirror these in docs/deployment/20-roadmap.md so they are not silently lost.


Documentation drift corrected alongside this plan

The audits contradicted the existing deployment book. These corrections ship with this plan so the docs match audited reality:

Doc Claimed Reality (audit) Action
05-security.md automountServiceAccountToken: false set K3S-F11: not set on any workload Corrected to "TODO" + linked here
05-security.md NetworkPolicies "not currently applied" (TODO) Applied 2026-04-24; 03-deploy.sh:155 applies them Corrected to "applied"
05-security.md CF↔origin is plaintext (SSL=Flexible) Upgraded to Full (strict) 2026-04-24 Corrected
05-security.md SHA tags immutable / "we'd notice a digest change" K3S-F5: short SHA tags are mutable Corrected; points to K3S-F5
SECURITY.md (old) Redis "requires a password" K3S-F1: no auth This rewrite
SECURITY.md (old) etcd secrets-encryption: true K3S-CG1: not verified / not on This rewrite
SECURITY.md (old) fail2ban active 05-security.md + K3S-CG2: not installed This rewrite
20-roadmap.md Audit findings not represented Audit items folded in

Hardened-redeploy checklist (run order)

A clean rebuild of the whole stack, with every fix above applied:

□ Stage 0  DNS once-off:    DMARC, SPF, CAA at Cloudflare; security.txt route live
□ Stage 1  Provision:       hetzner-k3s config carries --write-kubeconfig-mode=600
                            and --secrets-encryption; run 01-provision-cluster.sh
□ Stage 1  Node OS:         fail2ban + unattended-upgrades + SSH/sysctl on each node
□ Stage 1  Verify cluster:  K3S-CG3..CG8 (firewall, snapshots, kubelet, perms)
□ Stage 2  Config:          config.yaml has redis.password + admin.basic_auth_*;
                            no DEBUG_FIXED_CODES; SECRET_KEY ≥32 chars
□ Stage 2  Secrets:         run 02-setup-secrets.sh — confirm redis + admin-basic-auth
□ Stage 3  Manifests:       admin ingress middlewares wired; imagePullSecret name
                            consistent; vmagent securityContext; COOP/CORP headers;
                            auth-rate-limit; automountServiceAccountToken:false;
                            HSTS preload; X-XSS-Protection dropped; imagePullPolicy set
□ Stage 4  Code+image:      all C/H/M/L code fixes committed; image rebuilt;
                            goose migrations for C1/C5/C6/C11/C12 present
□ Stage 5  CI:              images digest-pinned + signed + scanned; secrets file-mounted
□ Stage 6  Verify:          run 04-verify.sh (extended); work V1V13
□ Post:    Submit myhoneydue.com to hstspreload.org

A redeploy is "clean" only when 04-verify.sh (extended per Stage 6) passes with zero lines and every checkbox in the master index is ☑ or ⊘.


Appendix — Incident response playbooks

Preserved from the previous SECURITY.md; still current.

Compromised API token

Rotate SECRET_KEY to invalidate all tokens, then restart api/worker:

echo "$(openssl rand -hex 32)" > secrets/secret_key.txt
./scripts/02-setup-secrets.sh
kubectl rollout restart deployment/api deployment/worker -n honeydue

(After CODE-C1 lands, tokens are hashed at rest — a DB read no longer yields usable tokens, but SECRET_KEY rotation remains the kill-switch.)

Compromised database credentials

Rotate in the Neon dashboard, update secrets/postgres_password.txt, re-run 02-setup-secrets.sh, restart api/worker, watch logs for connection errors.

Compromised push keys

APNs: revoke in Apple Developer, drop the new .p8 into secrets/, re-run 02-setup-secrets.sh, restart api/worker. FCM: rotate the key in Firebase, update secrets/fcm_server_key.txt, re-run, restart.

Suspicious pod

kubectl logs <pod> -n honeydue > /tmp/pod-logs.txt
kubectl describe pod <pod> -n honeydue > /tmp/pod-describe.txt
kubectl delete pod <pod> -n honeydue   # deployment recreates it

Communication

Document the timeline privately; on a data breach notify affected users within 72 hours; rotate every potentially-exposed credential; write a post-mortem (root cause, timeline, remediation, prevention).


References

  • Audit reports: live_scan_5_12.md, k3_audit_5_12.md, security_scan_5_12.md (repo root)
  • Current architecture: docs/deployment/05-security.md
  • Roadmap: docs/deployment/20-roadmap.md
  • Deploy process: docs/deployment/14-deployment-process.md
  • Scripts: deploy-k3s/scripts/{01-provision-cluster,02-setup-secrets,03-deploy,04-verify}.sh
  • Manifests: deploy-k3s/manifests/