Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
72 KiB
honeyDue — Production Security Remediation Plan
This document is the single source of truth for fixing every security finding from the 2026-05-12/13 audits, and for keeping those fixes baked into the stack so a full redeploy never reproduces them.
It replaces the previous aspirational SECURITY.md (which described a
desired state that, per the audits, was never fully true). The accurate
current architecture lives in docs/deployment/05-security.md; this file
is the work list.
Last updated: 2026-05-16 Audit sources (kept at repo root):
| Tag | File | Scope | Findings |
|---|---|---|---|
LIVE |
live_scan_5_12.md |
External black-box scan of api/admin/app | L1–L20 (20) |
K3S |
k3_audit_5_12.md |
k3s cluster + honeydue namespace audit |
F1–F17 (17) + 8 coverage gaps |
CODE |
security_scan_5_12.md |
Static audit of honeyDueAPI-go |
C1–C13, H1–H9, M1–M13, L1–L6 (41) |
Total: 78 findings + 8 cluster coverage gaps + 13 runtime verification items.
How to use this document
The plan is organised by redeploy stage, not by severity, because the operator's goal is: redeploy the entire stack and come up clean. Each finding is tagged with where its fix lives:
| Marker | Meaning |
|---|---|
| In-repo: Y | Fix lives in a committed file (config.yaml, a manifest, a script, Go code, a Dockerfile). Once committed, every redeploy re-applies it automatically. |
| In-repo: N | Fix is external state (DNS records, Cloudflare dashboard, Hetzner firewall, hstspreload.org). A redeploy does not touch it — it survives on its own but must be done once and tracked here. |
Status legend: ☐ open · ◐ in progress · ☑ done · ⊘ accepted risk / deferred
Redeploy stage order (matches deploy-k3s/scripts/ run order):
Stage 0 DNS & Cloudflare edge (external; no cluster needed)
Stage 1 Cluster provisioning & node OS (01-provision-cluster.sh / hetzner-k3s / SSH)
Stage 2 Secrets & config bootstrap (02-setup-secrets.sh / config.yaml)
Stage 3 Kubernetes manifests (deploy-k3s/manifests/, applied by 03-deploy.sh)
Stage 4 Application code & images (honeyDueAPI-go source → rebuilt image)
Stage 5 CI / build pipeline (image digest pinning, signing, scanning)
Stage 6 Post-deploy verification (04-verify.sh + runtime investigations)
Golden rule for "redeploy clean": a fix only counts as done when it is
committed to the file that the redeploy reads. A kubectl patch on the live
cluster that is not mirrored into deploy-k3s/manifests/ will be wiped on
the next 03-deploy.sh. Every entry below names the committed file.
Execution status (2026-05-16)
Stages 2–5 were executed in-repo, then put through an independent code
review (see Post-remediation independent review below). The Go module
builds clean and the full go test ./... suite passes. Four new goose
migrations were added — 000003 (auth-token hashing), 000004 (IAP replay
protection), 000005 (audit-log append-only + audit_log table create),
000006 (webhook_event_log table create) — and run automatically via the
migrate Job before the api/worker rollout.
- ~63 findings fixed (☑) and verified — all of Stage 2 (secrets/config)
and Stage 3 (Kubernetes manifests), every exploitable Stage 4 application
finding (all 11 actioned Criticals + the auth / webhook / race / handler
High & Medium fixes), Stage-5 image digest pinning and
K3S-F8(secrets are now file-mounted, not env vars), plus the in-repo half of Stage 1 cluster provisioning —K3S-F4(kubeconfig written0600),K3S-CG1(etcdsecrets-encryption),K3S-CG2(fail2ban + unattended-upgrades installed at provision). Includes token hashing, Google JWKS verification, IAP replay protection, the authorization fixes, atomic share-code join, the metrics-endpoint lockdown, per-account login lockout, verified-email gating, CSP/HSTS hardening, and digest-pinned images. - 1 partial (◐) —
CODE-L5: cosign signing + a TrivyHIGH,CRITICALscan are wired (guarded) into03-deploy.sh, and a ready-to-use KyvernoClusterPolicyships atdeploy-k3s/manifests/kyverno-verify-images.yaml. Closing it needs two operator actions that cannot be committed: install Kyverno in the cluster, and supply a cosign key pair (COSIGN_KEYfor signing + the public key pasted into the policy). - Accepted / blocked / moot (⊘) —
M3(Apple nonce — blocked on an iOS-client change),C12(moot — accounts are hard-deleted),LIVE-L14/L15(UUID migration — planned quarter),LIVE-L17/L18/L20(no security impact — see entries),F15/F16(architectural), andLIVE-L2/L3/L4(DMARC / SPF / CAA — operator-declined, below). - Operator-declined — Stage 0 DNS (
LIVE-L2/L3/L4). The operator has opted not to add the DMARC, SPF-hardening, and CAA DNS records this cycle. For the record: these are not a paid-Cloudflare feature — DMARC and SPF are ordinary TXT records and CAA is an ordinary CAA record, all addable on any Cloudflare plan including Free. They remain genuine email-spoofing / certificate-issuance gaps and are marked ⊘; revisit when DNS is next touched. - Remaining operator runtime steps (no code to commit) — on the
existing cluster:
k3s secrets-encryptenable/reencrypt (K3S-CG1/V12) andchmod 600the live kubeconfig (K3S-F4); the SSH/sysctl half ofK3S-CG2; and theK3S-CG3–CG8verification items. A full fresh provision already comes up withK3S-F4/CG1/CG2(fail2ban + unattended-upgrades) applied straight from_config.sh.
Operator note: C1 (token hashing) invalidates every existing login
session once at deploy and makes login single-session per user — see the
CODE-C1 entry. The status boxes in the master index below are authoritative.
Post-remediation independent review (2026-05-16)
The change set went through two independent review passes; the deploy-time
verification below (build, go test -race, full goose up against real
PostgreSQL 16) was executed and passed.
First pass. A separate review agent audited the full change set against the
three audit files. It surfaced three deploy-breaking defects that a green
go test could not catch — the test harness builds two tables via GORM
AutoMigrate, which production never runs — all since fixed:
audit_logtable was never created by a migration.000005added append-only triggers to a table that exists only in the test DB, so a from-scratchgoose upwould fail on000005.000005now doesCREATE TABLE IF NOT EXISTS audit_logbefore the triggers.webhook_event_logtable was never created by a migration. The H6 fail-closed webhook dedup turns a missing table into a 500 on every subscription webhook. New migration000006creates it.000004'sgoogle_purchase_tokenunique index could fail to build on a production table already holding duplicate tokens — exactly the C6 replay the migration fixes.000004now de-duplicates (keep-earliest, NULL-the-rest) before creating the index.
It also tightened the C13 Apple-webhook lookup (subscription_webhook_handler.go)
so the legacy substring scan runs only on a genuine ErrRecordNotFound,
never masking a real DB error as "not found".
Second pass (master review). A second, independent security-audit agent
re-verified all four first-pass fixes (correct), ran go test -race (0 data
races) and the full goose up/down chain against real PostgreSQL (clean,
idempotent), and returned GO with one HIGH finding, since fixed:
- HIGH-1 — Redis password leaked via the
honeydue-configConfigMap._config.shbuiltREDIS_URLwith the password embedded inline, and that URL is emitted into thehoneydue-configConfigMap (delivered to pods viaenvFrom). ConfigMaps are not covered bysecrets-encryptionand are readable by any principal withget configmap— soK3S-F1/K3S-F8were not actually fully closed. Fixed (2026-05-16):_config.shnow emitsREDIS_URL=redis://redis:6379/0with no credentials; the password travels only as the file-mountedREDIS_PASSWORDsecret. The API applies it incache_service.go;cmd/worker/main.gonow applies it onto the parsed AsynqRedisClientOptso the server/inspector/monitoring client all authenticate against therequirepassRedis.
The master review's other seven findings (4 Medium, 3 Low — none deploy-blocking) were then all fixed (2026-05-16):
- MEDIUM-1 — re-login left the prior token usable for ≤5 min.
CreateFreshTokendeleted the old token row but not its Redis cache entry. It now also returns the deleted tokens' hashes;AuthService.freshTokenevicts them via the newCacheService.InvalidateAuthTokenHasheson every login / Apple / Google sign-in, so a prior (e.g. stolen) token stops authenticating immediately. - MEDIUM-2 — IAP
.p8mode check incompatible with k8s. The Apple IAP key check (iap_validation.go) required0600-or-stricter, unattainable on a k8s Secret volume (0440underfsGroup). It now rejects only world-accessible keys (perm & 0o007). - MEDIUM-3 — single-IP account-lockout DoS. The
M5per-account lockout is now keyed on the set of distinct source IPs that have failed (RegisterLoginFailuretakes the IP, tracks a Redis set; lock at 5 distinct IPs). One attacker IP can no longer lock a victim out by spamming failures; genuinely distributed stuffing still trips it.Loginnow takes the client IP (c.RealIP()). - MEDIUM-4 — Redis no-auth deployable.
02-setup-secrets.shnowdies (waswarn) whenredis.passwordis empty, so a deploy can no longer bring up an unauthenticated Redis (K3S-F1). - LOW-1 / LOW-2 — missing regression tests. Added:
config_test.goassertsvalidate()refusesDEBUG_FIXED_CODESwithDEBUG=false(C4);subscription_repo_test.goasserts a second account cannot bind an Apple transaction / Google purchase token already bound to another (C5/C6). - LOW-3 — device-token 409. A recycled APNs/FCM token re-registering under a new account is now reassigned to that account (and logged) instead of returning a 409 that locked the legitimate new device owner out of push.
One earlier (first-pass) hardening item remains a tracked follow-up, not
re-raised by the master review and not deploy-blocking: /metrics is gated
by an X-Forwarded-For check rather than network-isolated. True isolation
needs /metrics on a separate port plus a NetworkPolicy restricting the
scrape to vmagent — an architectural change deferred to a later cycle.
Consolidated work items (fix once, closes many)
Several findings are the same defect seen from three angles. Do the work once at the listed anchor; the rest close with it.
| Theme | Anchor | Also closes |
|---|---|---|
| Auth-endpoint rate limiting | Stage 3 auth-rate-limit middleware + Stage 4 app limiter |
K3S-F10, LIVE-L12, CODE-H1, CODE-H2, CODE-H3, CODE-M5 |
| CSP / cross-origin headers | Stage 3 security-headers + Stage 4 app CSP |
K3S-F9, LIVE-L8 |
HSTS preload |
Stage 3 middleware + Stage 0 list submission | LIVE-L5, CODE-L3 |
| Admin ingress hardening | Stage 2 secret + Stage 3 middleware wiring | K3S-F2, K3S-F3, CODE-L6 |
| etcd encryption at rest | Stage 1 --secrets-encryption |
K3S-CG1, CODE-M9 |
| Image digest pinning + signing | Stage 5 CI | K3S-F5, K3S-F14, CODE-L4, CODE-L5 |
| Pagination hard caps | Stage 4 app | LIVE-L16, CODE-M6 |
| imagePullSecret name consistency | Stage 3 manifests + Stage 2 script | K3S-F6 |
Known contradiction to resolve before planning Stage 4: LIVE-L18 says
no account-deletion endpoint exists (every DELETE path 404/400), but
CODE-M13 points at a delete handler at auth_handler.go:488-539. Either
the endpoint exists at a path the external scan never probed, or it is
mounted but unreachable. Confirm the route in internal/router/router.go
first — the fix differs (add an endpoint vs. expose/rate-limit an existing
one). Tracked as verification item V11.
Master finding index
Every finding, ordered by redeploy stage. Use this as the live tracker — flip the Status box as work lands.
Stage 0 — DNS & Cloudflare edge
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
LIVE-L2 |
HIGH | No DMARC record — email spoofing open | N | ⊘ |
LIVE-L3 |
MED | SPF ends ?all (neutral — fails open) |
N | ⊘ |
LIVE-L4 |
MED | No CAA records — any CA may issue certs | N | ⊘ |
LIVE-L6 |
LOW | No /.well-known/security.txt |
Y | ☐ |
LIVE-L9 |
INFO | Aggressive Cloudflare caching on admin SSR shell | N | ☐ |
LIVE-L10 |
INFO | x-powered-by: Next.js framework leak |
Y | ☐ |
Stage 1 — Cluster provisioning & node OS
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
K3S-F4 |
HIGH | Node kubeconfig world-readable (mode 644) | Y | ☑ |
K3S-F15 |
INFO | Nodes on public IPs, no private VPC | Y | ⊘ |
K3S-F16 |
INFO | All 3 nodes are control-plane + etcd + worker | Y | ⊘ |
K3S-F17 |
INFO | Single-replica SPOFs (redis/worker/admin/vmagent) | Y | ☐ |
K3S-CG1 |
— | etcd encryption at rest not verified (--secrets-encryption) |
Y | ☑ |
K3S-CG2 |
— | Node OS hardening: SSH, fail2ban, unattended-upgrades, sysctl | Y/N | ◐ |
K3S-CG3 |
— | Hetzner Cloud Firewall rules not verified | N | ☐ |
K3S-CG4 |
— | etcd snapshot backup destination/encryption not verified | Y | ☐ |
K3S-CG5 |
— | kubelet flags (--anonymous-auth=false, webhook authz) not verified |
Y | ☐ |
K3S-CG6 |
— | Container-runtime CIS controls (kube-bench) not run |
N | ☐ |
K3S-CG7 |
— | deploy user sudoers least-privilege not verified |
N | ☐ |
K3S-CG8 |
— | /etc/rancher/k3s/ dir + server-token perms not verified |
N | ☐ |
Stage 2 — Secrets & config bootstrap
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
K3S-F1 |
CRIT | Redis runs with no authentication | Y | ☑ |
K3S-F3 |
HIGH | admin-basic-auth secret never created |
Y | ☑ |
K3S-F12 |
MED | Secrets unrotated since cluster bootstrap; no runbook | Y | ☑ |
CODE-C4 |
CRIT | DEBUG_FIXED_CODES "123456" auth bypass if it reaches prod |
Y | ☑ |
CODE-M8 |
MED | SECRET_KEY hardcoded debug fallback |
Y | ☑ |
Stage 2 status (2026-05-15):
config.yamlnow carries a Redis password and admin basic-auth user/password;02-setup-secrets.shuses bcrypt (htpasswd -nbB);internal/config/config.gogenerates an ephemeral randomSECRET_KEYin debug instead of a static fallback and refuses to boot ifDEBUG_FIXED_CODESis set withDEBUG=false; the rotation runbook is atdocs/runbooks/secret-rotation.md. All take effect on the next02-setup-secrets.sh+03-deploy.sh.
Stage 3 — Kubernetes manifests
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
K3S-F2 |
HIGH | Admin ingress missing cloudflare-only + admin-auth |
Y | ☑ |
K3S-F6 |
HIGH | imagePullSecrets name mismatch (ghcr-credentials) |
Y | ☑ |
K3S-F7 |
MED | vmagent container missing securityContext |
Y | ☑ |
K3S-F9 |
MED | security-headers missing COOP/COEP/CORP |
Y | ☑ |
K3S-F10 |
MED | Uniform rate limit — no auth-endpoint tightening | Y | ☑ |
K3S-F11 |
MED | automountServiceAccountToken not disabled |
Y | ☑ |
K3S-F13 |
LOW | CORS_ALLOWED_ORIGINS missing app.myhoneydue.com |
Y | ☑ |
K3S-F14 |
LOW | Public images (redis, vmagent) pinned by tag |
Y | ☑ |
LIVE-L5 |
LOW | HSTS not preload-eligible | Y | ☑ |
LIVE-L7 |
LOW | Deprecated X-XSS-Protection header |
Y | ☑ |
LIVE-L8 |
LOW | CSP missing object-src/base-uri; COOP/COEP/CORP absent |
Y | ☑ |
CODE-L3 |
LOW | HSTS missing preload (duplicate of LIVE-L5) |
Y | ☑ |
CODE-L4 |
LOW | imagePullPolicy not set on Deployments |
Y | ☑ |
CODE-L6 |
LOW | Admin admin-auth middleware defined, not attached |
Y | ☑ |
Stage 3 status (2026-05-15): admin ingress now chains
cloudflare-only+admin-auth+security-headers+rate-limit; a dedicatedhoneydue-api-authIngress applies a newauth-rate-limitmiddleware (5/min, burst 10) to login / register / forgot-password / reset-password / join-with-code;security-headersgained COOP + CORP, HSTS is nowmax-age=63072000; …; preload, and the deprecatedX-XSS-Protection(browserXssFilter) is removed;vmagenthas a containersecurityContext; all workload pods + the migrate Job setautomountServiceAccountToken: falseexplicitly (on top of the rbac.yaml ServiceAccount-level setting that already existed); the registry secret isgitea-credentialseverywhere;imagePullPolicy: IfNotPresentis explicit on every container; CORS includesapp.myhoneydue.com. Still open:K3S-F14(public-image digest pins) is folded into Stage 5 withK3S-F5;LIVE-L8is partial — the COOP/CORP half shipped here, the CSPobject-src/base-urihalf is an app change tracked in Stage 4.
Stage 4 — Application code & container images
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
CODE-C1 |
CRIT | Auth tokens stored plaintext in DB | Y | ☑ |
CODE-C2 |
CRIT | Google ID token not verified locally | Y | ☑ |
CODE-C3 |
CRIT | Google iss claim never validated |
Y | ☑ |
CODE-C5 |
CRIT | Apple IAP receipt replay across accounts | Y | ☑ |
CODE-C6 |
CRIT | Google purchase-token replay across accounts | Y | ☑ |
CODE-C7 |
CRIT | File-ownership check excludes residence owners | Y | ☑ |
CODE-C8 |
CRIT | Device-token cross-account hijack on re-register | Y | ☑ |
CODE-C9 |
CRIT | Share-code join not atomic (Add+Deactivate race) | Y | ☑ |
CODE-C10 |
CRIT | Subscription upgrade race — validation outside txn | Y | ☑ |
CODE-C11 |
CRIT | Task-completion duplicate-row race | Y | ☑ |
CODE-C12 |
CRIT | Soft-deleted email reusable; is_active not filtered |
Y | ⊘ |
CODE-C13 |
CRIT | Apple webhook user lookup may LIKE-match | Y | ☑ |
CODE-H1 |
HIGH | Rate limit doesn't cover all auth surfaces | Y | ☑ |
CODE-H2 |
HIGH | No rate limit on join-with-code |
Y | ☑ |
CODE-H3 |
HIGH | No rate limit on register |
Y | ☑ |
CODE-H4 |
HIGH | Modulo bias in 6-digit code generation | Y | ☑ |
CODE-H5 |
HIGH | Apple IAP .p8 loaded with no file-mode check |
Y | ☑ |
CODE-H6 |
HIGH | Webhook dedup fail-open | Y | ☑ |
CODE-H7 |
HIGH | Auth-failure log lacks IP/User-Agent | Y | ☑ |
CODE-H8 |
HIGH | X-Timezone header trusted for trial-start calc |
Y | ☑ |
CODE-H9 |
HIGH | Share-code Deactivate error swallowed |
Y | ☑ |
CODE-M1 |
MED | HTTP header injection via Content-Disposition filename |
Y | ☑ |
CODE-M2 |
MED | bcrypt cost = 10 (recommend 12) | Y | ☑ |
CODE-M3 |
MED | Apple Sign In nonce not validated | Y | ⊘ |
CODE-M4 |
MED | Email verification not atomic | Y | ☑ |
CODE-M5 |
MED | Per-user rate limiting absent | Y | ☑ |
CODE-M6 |
MED | List endpoints uncapped (Documents/Contractors/Residences) | Y | ☑ |
CODE-M7 |
MED | Audit log not append-only | Y | ☑ |
CODE-M11 |
MED | golang.org/x/crypto v0.49.0 outdated |
Y | ☑ |
CODE-M12 |
MED | Contractor toggle refetch race | Y | ☑ |
CODE-M13 |
MED | Account-deletion endpoint unrate-limited | Y | ☑ |
CODE-M10 |
MED | node:20-alpine floating tag in Dockerfile |
Y | ☑ |
CODE-L1 |
LOW | Login inactive-account error enables enumeration | Y | ☑ |
CODE-L2 |
LOW | Auth responses lack Cache-Control: no-store |
Y | ☑ |
LIVE-L1 |
HIGH | /metrics publicly exposed on api.myhoneydue.com |
Y | ☑ |
LIVE-L11 |
HIGH | Login user-enumeration via timing | Y | ☑ |
LIVE-L12 |
HIGH | No rate-limit on /api/auth/login/ |
Y | ☑ |
LIVE-L13 |
HIGH | Password-reset user-enumeration via timing | Y | ☑ |
LIVE-L14 |
MED | Sequential integer user IDs leak userbase size | Y | ⊘ |
LIVE-L15 |
MED | Sequential integer resource IDs (same risk) | Y | ⊘ |
LIVE-L16 |
MED | Pagination limit accepted at any size |
Y | ☑ |
LIVE-L17 |
LOW | Garbage pagination params silently accepted | Y | ⊘ |
LIVE-L18 |
LOW | No account-deletion endpoint (GDPR gap) | Y | ⊘ |
LIVE-L19 |
LOW | Email verification not enforced | Y | ☑ |
LIVE-L20 |
INFO | Profile-update silently drops unknown fields | Y | ⊘ |
Stage 4 handler/misc batch status (2026-05-15):
M1—Content-Dispositionfilenames are sanitized (control chars / quote / backslash stripped) so an upload filename cannot inject response headers.M7— migration000005creates theaudit_logtable (no prior migration did —CREATE TABLE IF NOT EXISTS) and makes it append-only via BEFORE UPDATE/DELETE triggers.M11—golang.org/x/cryptobumpedv0.49.0 → v0.51.0.M13—DELETE /api/auth/accountnow carries the Traefikauth-rate-limitedge limiter.LIVE-L18⊘ — not a real gap: the endpoint exists atDELETE /api/auth/account/(router.go:546); the live scan probed/api/auth/me/,/auth/delete/,/users/me/and missed it. Update (2026-05-15): items shown as deferred in an earlier draft were then completed —LIVE-L1(/metricsrejects proxied/public requests via anX-Forwarded-Forcheck, so only the in-cluster vmagent scrape reaches it),M6/LIVE-L16(the document/contractor list repos already hard-cap at 500 rows), andLIVE-L19(verified-email gating on share-code generation via the newRequireVerifiedmiddleware).LIVE-L17(inert pagination params, results capped) andLIVE-L20(whitelist profile update is the correct pattern) are closed as no-security-impact (⊘). The master index above is authoritative.
Stage 4 races batch status (2026-05-15):
C9/H9— share-code redemption is now one locked transaction inResidenceRepository. JoinWithShareCode(lock the code row, re-check validity, add member, deactivate — a deactivation failure aborts the join).C11— the task-completion duplicate-row race was already closed: the completion insert and the optimistically-version-locked task update share one transaction, so a concurrent completion failsErrVersionConflictand rolls back its inserted row; noUNIQUE(task_id, completed_date)was added (it would reject legitimate same-day re-completions and risk a migration failure on existing data).M4— email verification's find/consume/flag writes are now one transaction.M12— a concurrent contractor delete now yields a clean 404.C12⊘ — premise moot: the app hard-deletes accounts (DeleteUserCascade), so there is no soft-deleted user whose email lingers, andExistsByEmailalready blocks re-registering a deactivated user's email.Stage 4 auth batch status (2026-05-15): C1, C2, C3 done (see entries below). Rate limiting — every sensitive auth path now carries the shared Traefik
auth-rate-limitedge limiter (login/register/forgot/reset/ verify-reset/apple/google/refresh/join-with-code); login/register/forgot/ reset/apple/google additionally keep the per-IP app limiter (H1/H2/H3/LIVE-L12).H4rejection-sampled codes,M2bcrypt cost 12,L1+LIVE-L11constant-time generic-error login,L2no-storeon auth responses,H7IP/UA in auth logs,LIVE-L13fully-async forgot-password — all done;go build ./...and themodels/repositories/middleware/handlers/servicestest packages pass. Deferred:M3(Apple nonce) — needs the iOS client to generate and send a nonce; server-only validation would reject every Apple login, so this is blocked on a coordinated mobile change.H8— theparseTimezone±14h cap shipped; the "use server UTC for trial-start" half is folded into Stage 4's subscription work.M5per-account lockout (Redis) deferred — the edge + per-IP app limiters + the existing per-account password-reset counter cover the practical risk; a true per-account login lockout remains a tracked enhancement.
Stage 5 — CI / build pipeline
| ID | Sev | Finding | In-repo | Status |
|---|---|---|---|---|
K3S-F5 |
HIGH | Images pinned by mutable short SHA tag, not digest | Y | ☑ |
K3S-F8 |
MED | Secrets injected as env vars, not file mounts | Y | ☑ |
CODE-L5 |
LOW | No image signing (cosign) in CI | Y | ◐ |
Stage 5 status (2026-05-15):
CODE-M11done —golang.org/x/cryptobumpedv0.49.0 → v0.51.0(with thex/sys/x/term/x/textbumpsgo get -upulled in),go mod tidyclean, full build + test green. Update (2026-05-15):K3S-F5/K3S-F14/CODE-M10are done —03-deploy.shresolves the image digest after each push and deploys api/worker/admin/web by@sha256:, and redis/vmagent/node:20-alpineare pinned to their resolved index digests. Update (2026-05-16):K3S-F8is done — theapi/workerDeployments mounthoneydue-secretsas files (defaultMode: 0400) at/etc/honeydue/secretsand inject no secret as an env var;config.loadFileSecretsreads them;02-setup-secrets.shnow writesB2_KEY_ID/B2_APP_KEYinto the secret, reconciling the earlier script-vs-manifest drift.CODE-L5stays ◐ — cosign signing and a TrivyHIGH,CRITICALscan are wired (guarded) into03-deploy.shand a ready-to-use KyvernoClusterPolicyships atdeploy-k3s/manifests/kyverno-verify-images.yaml; closing it needs the operator to install Kyverno and supply a cosign key. See both entries.
Stage 6 — Post-deploy verification & runtime investigations
V1–V13 — see Stage 6.
Stage 0 — DNS & Cloudflare edge
External state at Cloudflare. Not touched by 03-deploy.sh, so a redeploy
neither breaks nor re-applies these — do them once and leave them. Tracked
here so they are never forgotten on a domain move or DNS migration.
LIVE-L2 — Add DMARC record · HIGH · ⊘
- Operator decision (2026-05-16): declined for this cycle. A DMARC record is an ordinary DNS TXT record — it is not gated behind a paid Cloudflare plan and can be added on Free. This remains a real email-spoofing gap; revisit when DNS is next touched.
- Where: Cloudflare DNS, TXT record at
_dmarc.myhoneydue.com. - Fix: Publish
v=DMARC1; p=quarantine; rua=mailto:dmarc@myhoneydue.com; ruf=mailto:dmarc@myhoneydue.com; fo=1; aspf=s; adkim=s. Start atpct=10for 30 days, watch theruaaggregate reports, then ramp topct=100and finallyp=reject. - Verify:
dig +short TXT _dmarc.myhoneydue.comreturns the record.
LIVE-L3 — Tighten SPF from ?all to -all · MEDIUM · ⊘
- Operator decision (2026-05-16): declined for this cycle. SPF is an ordinary DNS TXT record, editable on any Cloudflare plan including Free. The
?all(neutral) qualifier leaves spoofed mail un-penalised; revisit alongsideLIVE-L2. - Where: Cloudflare DNS, TXT record at
myhoneydue.com. - Fix: Change
v=spf1 include:spf.messagingengine.com ?all→~allfor ~7 days, confirm no legitimate mail (CI, transactional) is missed, then-all. Do this afterLIVE-L2's DMARC ramp begins. - Verify:
dig +short TXT myhoneydue.com | grep spfshows-all.
LIVE-L4 — Add CAA records · MEDIUM · ⊘
- Operator decision (2026-05-16): declined for this cycle. CAA is an ordinary DNS record type, addable on any Cloudflare plan including Free. Without it, any public CA may issue a cert for the domain; revisit when DNS is next touched.
- Where: Cloudflare DNS, apex
myhoneydue.com. - Fix: Add
0 issue "letsencrypt.org",0 issuewild "letsencrypt.org",0 iodef "mailto:security@myhoneydue.com". Add0 issue "pki.goog"only if Google Trust Services is used anywhere. Confirm against the CAs Cloudflare Universal SSL actually uses before locking down. - Verify:
dig +short CAA myhoneydue.comreturns the records.
LIVE-L6 — Publish security.txt · LOW · ☐ · In-repo: Y
- Where: served by the Go API and/or Next.js apps at
/.well-known/security.txt(RFC 9116) — committed route, so it survives redeploys. - Fix: Serve
Contact:,Expires:,Preferred-Languages:,Canonical:on bothapi.myhoneydue.comand the apex. - Verify:
curl https://api.myhoneydue.com/.well-known/security.txt→ 200.
LIVE-L9 — Review Cloudflare caching of the admin SSR shell · INFO · ☐
- Where: Cloudflare cache rules for
admin.myhoneydue.com. - Fix:
cache-control: s-maxage=31536000on admin SSR pages means Cloudflare caches the admin shell for a year. Confirm this is intentional; if the admin shell ever contains per-session content, add a bypass-cache rule foradmin.myhoneydue.com. - Verify:
curl -sI https://admin.myhoneydue.com/ | grep -i cachereflects the intended policy.
LIVE-L10 — Suppress x-powered-by · INFO · ☐ · In-repo: Y
- Where: Next.js config in the admin and web repos (
next.config.js→poweredByHeader: false). Committed, survives redeploys. - Fix: Disable the
x-powered-by: Next.jsheader. - Verify:
curl -sI https://admin.myhoneydue.com/ | grep -i x-powered-byreturns nothing.
Stage 1 — Cluster provisioning & node OS
Run by 01-provision-cluster.sh (which drives the hetzner-k3s CLI from
config.yaml via generate_cluster_config in _config.sh) plus one-time
SSH hardening on each node. Any k3s server flag must be set in the
hetzner-k3s cluster config so a cluster rebuild applies it.
K3S-F4 — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y
- Where:
_config.sh→generate_cluster_config→k3s_config_file. Node file/etc/rancher/k3s/k3s.yaml. - Done (2026-05-16):
generate_cluster_confignow emitswrite-kubeconfig-mode: "0600"in the k3s config file, so any fresh provision writes the node kubeconfig as0600. - Operator step on the existing cluster: a running node keeps the mode it was installed with —
ssh deploy@<node> 'sudo chmod 600 /etc/rancher/k3s/k3s.yaml'on each. Deploy scripts still read it viasudo. - Verify:
ssh deploy@<node> 'sudo stat -c %a /etc/rancher/k3s/k3s.yaml'→600.
K3S-CG1 / CODE-M9 — etcd / Secret encryption at rest · ☑ · In-repo: Y
- Where:
_config.sh→generate_cluster_config→k3s_config_file. - Done: the k3s config file carries
secrets-encryption: true, so a fresh provision boots with AES Secret encryption enabled. (Thewrite-kubeconfig-modeline forK3S-F4was added next to it on 2026-05-16.) - Operator step on the existing cluster: a cluster provisioned without the flag does not retro-encrypt — run
k3s secrets-encrypt enablethenk3s secrets-encrypt reencryptonce. Tracked asV12. - Verify:
k3s secrets-encrypt statusreportsEncryption Status: Enabledon every server node. - Note: the old
SECURITY.mdclaimed this was already on —04-verify.shgreps for the string but cannot truly confirm; seeV12.
K3S-CG2 — Node OS hardening · ◐ · In-repo: partial
- Where:
_config.sh→generate_cluster_config→post_create_commands(runs on every node at provision). - Done (2026-05-16):
post_create_commandsnow installs and enablesfail2ban(SSH brute-force bans) andunattended-upgrades(automatic security patching) on every node at provision time — a fresh cluster comes up hardened on both. - Still operator (runtime; not yet in-repo):
- SSH — confirm
PermitRootLogin no,PasswordAuthentication no,AllowUsers deploy, modern ciphers/MACs/KEX. (hetzner-k3s provisions key-only SSH; verify and tighten.) - sysctl — confirm
net.ipv4.ip_unprivileged_port_start=0(Traefik) and standard network-hardening sysctls.
- SSH — confirm
- Verify:
ssh deploy@<node> 'fail2ban-client status sshd; systemctl is-enabled unattended-upgrades'.
K3S-CG3 — Hetzner Cloud Firewall rules · ☐ · In-repo: N
- Fix: Confirm only:
:443from Cloudflare CIDRs,:22from operator IP(s),:6443from operator IP(s). Nothing else. This is the only network defense for the public-IP nodes (K3S-F15). - Verify:
hcloud firewall describe honeydue-fwmatches the intended ruleset; a directcurlto a node IP on:80/:443from a non-CF host times out.
K3S-CG4 — etcd snapshot backup · ☐ · In-repo: Y
- Fix: Confirm k3s etcd snapshots are enabled (default hourly) and shipped off-node — set
--etcd-s3(to Backblaze B2) with encryption. Without offsite snapshots, a 3-node loss is unrecoverable. - Verify:
ls /var/lib/rancher/k3s/server/db/snapshots/on a node + an object in the B2 backup bucket.
K3S-CG5 — kubelet authn/authz flags · ☐ · In-repo: Y
- Fix: Confirm
--anonymous-auth=falseand--authorization-mode=Webhookon the kubelet (k3s defaults are usually safe — verify, don't assume). Set via k3skubelet-argin the cluster config if missing. - Verify:
kubectl get --raw /api/v1/nodes/<node>/proxy/configzshows the expected kubelet config.
K3S-CG6 — Container-runtime CIS baseline · ☐ · In-repo: N
- Fix: Run
kube-benchonce; remediate any FAIL lines that aren't k3s-by-design. - Verify:
kube-benchrun archived with FAILs triaged.
K3S-CG7 — deploy user sudoers least-privilege · ☐ · In-repo: N
- Fix: Current
deploy ALL=(ALL) NOPASSWD: ALLmeans an SSH-key compromise = node root. Scope to the commands deploys actually need (ufw,systemctl,chmodon k3s.yaml,catof k3s.yaml). Accept the convenience trade-off only with eyes open. - Verify:
ssh deploy@<node> 'sudo -l'shows the scoped list.
K3S-CG8 — /etc/rancher/k3s/ perms · ☐ · In-repo: N
- Fix:
/var/lib/rancher/k3s/server/tokenand/var/lib/rancher/k3s/server/node-tokenmust be0600 root:root;/etc/rancher/k3s/not world-traversable. - Verify:
ssh deploy@<node> 'sudo stat -c "%a %n" /var/lib/rancher/k3s/server/token'→600.
K3S-F15 — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y
- Decision: Accepted for now. Defense is
K3S-CG3(Hetzner firewall) only. To remediate later: attach a Hetzner private network, re-IP the cluster, move etcd/kubelet/Flannel onto it. Substantial re-provision — track on the roadmap, not this cycle.
K3S-F16 — All nodes are control-plane + etcd + worker · INFO · ⊘
- Decision: Accepted — standard small-cluster k3s. Revisit (dedicated workers +
NoScheduletaint on control-plane) when workload pressure grows. No redeploy action.
K3S-F17 — Single-replica SPOFs · INFO · ☐ · In-repo: Y
- Where:
deploy-k3s/manifests/worker/deployment.yaml,redis/,admin/,observability/vmagent.yaml. - Fix:
worker→replicas: 2(stateless, Asynq at-least-once — safe now).admin/vmagent→ 2 if zero-downtime restart is wanted.redisis stateful — true HA needs Sentinel or managed Redis; track separately, do not naively scale. - Verify:
kubectl -n honeydue get deployshowsworker 2/2.
Stage 2 — Secrets & config bootstrap
Run by 02-setup-secrets.sh, which reads deploy-k3s/config.yaml and the
secrets/ directory. Both K3S-F1 and K3S-F3 are open purely because
config.yaml lacks the values — the script already supports them.
K3S-F1 — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y
- Where:
deploy-k3s/config.yamlkeyredis.password.02-setup-secrets.sh:53,68-71includesREDIS_PASSWORDinhoneydue-secretsonly when that key is non-empty;redis/deployment.yamladds--requirepassonly when the env var is non-empty. - Fix: Set
redis.passwordinconfig.yamlto a strong value (openssl rand -base64 32). Re-run02-setup-secrets.sh.api/workeralready consumeREDIS_PASSWORD. - Verify:
kubectl -n honeydue exec deploy/redis -- redis-cli ping→NOAUTH; with-a "$REDIS_PASSWORD"→PONG. - Redeploy-clean: committing the value to
config.yamlmeans every future02-setup-secrets.shre-creates the authenticated secret. (Ifconfig.yamlis gitignored, store the value in the operator's secret store and document it here.)
K3S-F3 — admin-basic-auth secret never created · HIGH · ☐ · In-repo: Y
- Where:
config.yamlkeysadmin.basic_auth_user/admin.basic_auth_password.02-setup-secrets.sh:54-55,132-143creates theadmin-basic-authsecret (bcrypt htpasswd) only when both are set, else it warns and skips. - Fix: Set both keys. Re-run
02-setup-secrets.sh. Must be done beforeK3S-F2— attachingadmin-authto the ingress with the secret missing makes Traefik 503 the admin route. - Verify:
kubectl -n honeydue get secret admin-basic-auth.
K3S-F8 (Stage 2 half) — B2_KEY_ID / B2_APP_KEY in honeydue-secrets · ☑ · In-repo: Y
- Where:
02-setup-secrets.sh. - Done (2026-05-16): the script now reads
storage.b2_key_id/storage.b2_app_keyfromconfig.yamland addsB2_KEY_ID/B2_APP_KEYtohoneydue-secrets. Previously theapi/workermanifests referenced these keys but the script never created them — a latent deploy break. See the fullK3S-F8entry in Stage 5. - Verify:
kubectl -n honeydue get secret honeydue-secrets -o jsonpath='{.data.B2_KEY_ID}'is non-empty.
K3S-F12 — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y
- Where: new doc
docs/runbooks/secret-rotation.md. - Fix: Document per-secret rotation (Postgres,
SECRET_KEY, APNs.p8, FCM, B2, observability token, Redis, admin basic-auth). Annual minimum; immediate on suspected exposure or operator-device loss. ForSECRET_KEY(JWT signing) plan an overlap window so live tokens validate across the change. Add alast-rotatedannotation to each secret. - Verify: runbook exists and the first rotation is logged.
CODE-C4 — DEBUG_FIXED_CODES "123456" auth bypass · CRITICAL · ☐ · In-repo: Y
- Where:
internal/services/auth_service.go:141-145,385-390,432-435,470-473,503-504; config ininternal/config/config.go. ConfigMap generated fromconfig.yamlby03-deploy.sh. - Fix (two layers): (1) Code — refuse to start if
ENV=production && DebugFixedCodes(Stage 4 code change). (2) Config — ensureconfig.yamlnever setsDEBUG_FIXED_CODES=truefor prod, and the generated ConfigMap omits it. - Verify: prod ConfigMap has no
DEBUG_FIXED_CODES; a prod boot with the flag set fails fast.
CODE-M8 — SECRET_KEY hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y
- Where:
internal/config/config.go:437-442falls back to"change-me-in-production-secret-key-12345". - Fix: Remove the static fallback — generate a per-boot random key in debug, and refuse to start in production if
SECRET_KEYis unset. (02-setup-secrets.sh:46-49already enforces ≥32 chars for the real secret — keep that.) - Verify: prod boot with no
SECRET_KEYexits non-zero; the fallback string is gone from the binary.
Stage 3 — Kubernetes manifests
Committed under deploy-k3s/manifests/ and applied by 03-deploy.sh. Any
fix here is automatically re-applied on every redeploy — the highest-value
stage for "redeploy clean."
K3S-F2 / CODE-L6 — Wire defense-in-depth onto the admin ingress · HIGH · ☐
- Where:
deploy-k3s/manifests/ingress/ingress-simple.yaml— admin route annotation. - Fix: Add
cloudflare-onlyandadmin-authto thetraefik.ingress.kubernetes.io/router.middlewaresannotation alongside the existingsecurity-headers+rate-limit. DoK3S-F3first or Traefik 503s the route. - Verify:
04-verify.sh"Cloudflare-Only Middleware" check passes;admin.myhoneydue.comprompts for basic auth.
K3S-F6 — imagePullSecrets name consistency · HIGH · ☐
- Where: all
deploy-k3s/manifests/*/deployment.yaml,migrate/job.yaml; secret created by02-setup-secrets.sh:111asghcr-credentials. - Fix: The registry is Gitea —
ghcr-credentialsis a misleading name and the live cluster currently also has a hand-madegitea-credentials. Pick one name (gitea-credentialsis clearer), use it in both the script and every manifest, and delete the orphan. The defect is a name mismatch, not a missing fix — make script + manifests agree so a pull never fails on a fresh node. - Verify:
grep -rl imagePullSecrets deploy-k3s/manifests/all reference one name == the script's; cordon a node, delete a pod, confirm the replacement pulls.
K3S-F7 — vmagent container securityContext · MEDIUM · ☐
- Where:
deploy-k3s/manifests/observability/vmagent.yaml. - Fix: Add the container-level block the other 5 deployments already have:
allowPrivilegeEscalation: false,capabilities.drop: [ALL],readOnlyRootFilesystem: true. Its volumes (/etc/vmagent,/etc/vmagent-secrets,/tmp/vmagentemptyDir) already support read-only root. - Verify:
04-verify.sh"Pod Security Contexts" reports OK forvmagent.
K3S-F9 / LIVE-L8 — CSP + cross-origin headers · MEDIUM / LOW · ☐
- Where: Cross-origin trio →
deploy-k3s/manifests/ingress/middleware.yaml(security-headers). CSPobject-src/base-uri→ Go app CSP middleware (Stage 4,LIVE-L8code half). - Important correction:
K3S-F9originally said CSP was missing. The live scan disproved that — the Go app sets a strong CSP via app middleware. SoK3S-F9reduces to: addCross-Origin-Opener-Policy: same-originandCross-Origin-Resource-Policy: same-origin(andCross-Origin-Embedder-Policy: require-corponly if it doesn't break embeds) tosecurity-headers. The CSPobject-src 'none'; base-uri 'self'additions belong in the app and are tracked underLIVE-L8in Stage 4. - Verify:
curl -sI https://api.myhoneydue.com/api/health/ | grep -i cross-originshows COOP/CORP.
K3S-F10 / LIVE-L12 — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐
- Where:
deploy-k3s/manifests/ingress/middleware.yaml(newauth-rate-limitMiddleware) +ingress/ingress-simple.yaml. Requires migrating the auth paths from vanillaIngressto a TraefikIngressRouteto apply a per-path middleware. - Fix: New Middleware
average: 5, burst: 10, period: 1m, sourceCriterion.ipStrategy.depth: 2(depth 2 for the Cloudflare hop). Apply to/api/auth/login,/api/auth/register,/api/auth/forgot-password,/api/auth/reset-password,/api/residences/join-with-code. This is the edge half; the app half isCODE-H1/H2/H3/M5in Stage 4 (per-account lockout in Redis). Do both — edge limit alone resets on IP rotation. - Verify: 10 rapid logins from one IP →
429.
K3S-F11 — Disable automountServiceAccountToken · MEDIUM · ☐
- Where:
deploy-k3s/manifests/rbac.yaml(ServiceAccounts) and/or each*/deployment.yamlpod spec. - Fix: Set
automountServiceAccountToken: falseonapi,admin,worker,web,redis. Leavetrueonly forvmagent(it uses the k8s API for service discovery). Note:05-security.mdclaims this is already set — the audit (F11) says it is not. Treat the audit as ground truth; this fix makes the doc true. - Verify:
kubectl -n honeydue get pod <api-pod> -o jsonpath='{.spec.automountServiceAccountToken}'→false; no token file in the container.
K3S-F13 — Add app.myhoneydue.com to CORS · LOW · ☐
- Where:
CORS_ALLOWED_ORIGINSinconfig.yaml→ generated intohoneydue-configConfigMap by03-deploy.sh. - Fix: Confirm whether the web app calls
api.myhoneydue.comdirectly from the browser. If yes, addhttps://app.myhoneydue.comtoCORS_ALLOWED_ORIGINS. If it proxies through Next.js server-side, CORS is moot — record that decision here instead. - Verify: browser fetch from
app.myhoneydue.comto the API succeeds (or the proxy decision is documented).
K3S-F14 — Pin public images by digest · LOW · ☐
- Where:
redis/deployment.yaml(redis:7-alpine),observability/vmagent.yaml(victoriametrics/vmagent:v1.106.1). - Fix: Replace tags with
@sha256:digests. Folded into theK3S-F5CI work (Stage 5). - Verify: manifests contain no public-image tag without a digest.
LIVE-L5 / CODE-L3 — HSTS preload · LOW · ☐
- Where:
deploy-k3s/manifests/ingress/middleware.yamlsecurity-headersHSTS value. - Fix: Change to
max-age=63072000; includeSubDomains; preload. Confirm api/admin/app all work fully over HTTPS, then submit tohstspreload.org(the submission is the Stage 0 external half — once preloaded you cannot easily downgrade for ~6 months). - Verify: response header shows
preload; domain accepted at hstspreload.org.
LIVE-L7 — Drop deprecated X-XSS-Protection · LOW · ☐
- Where:
deploy-k3s/manifests/ingress/middleware.yamlsecurity-headers(browserXssFilter: true/customResponseHeaders). - Fix: Remove the header or set
X-XSS-Protection: "0". Modern browsers ignore it; legacy filter bypass has caused XSS. - Verify: header absent or
0on all three hosts.
CODE-L4 — Set imagePullPolicy · LOW · ☐
- Where: all
deploy-k3s/manifests/*/deployment.yaml. - Fix: Set
imagePullPolicyexplicitly. Once images are digest-pinned (K3S-F5),IfNotPresentis correct and avoids needless re-pulls; until thenAlwaysavoids stale tags. Pick the policy that matches theK3S-F5rollout state. - Verify: every container has an explicit
imagePullPolicy.
Stage 4 — Application code & container images
Fixes in honeyDueAPI-go source (and the admin/web Dockerfiles). They reach
production by rebuilding the image in 03-deploy.sh; schema-changing
fixes (CODE-C1, CODE-C5/6, CODE-C11, CODE-C12) also need a goose
migration, which the migrate Job runs automatically before the
api/worker roll. Per repo rule: do not auto-commit — these are code changes;
this section is the plan, not the patch.
Critical (C1–C13)
CODE-C1 — Plaintext auth tokens in DB · ☑ (2026-05-15)
- Where:
internal/models/user.go,internal/repositories/user_repo.go,internal/middleware/auth.go,internal/services/cache_service.go,internal/services/auth_service.go, migration000003_hash_auth_tokens.sql. - Done:
user_authtoken.keynow storesmodels.HashToken()— the hex SHA-256 of the token — never the raw value. The raw token reaches the client once (the non-persistedAuthToken.Plaintextfield) and is re-hashed on every request before the DB and Redis lookup, so the single indexed JOIN query in the auth middleware is preserved. A fast hash (not bcrypt) is correct here — tokens are 160-bit random values, nothing to brute-force. Migration000003widens the column 40→64 and clears existing rows. - Behaviour change: the server can no longer re-issue a stored token's plaintext, so every login mints a fresh token via
CreateFreshToken(delete + create). With the existing one-token-per-user schema this means one active session per user — logging in on a new device invalidates the previous device's token. The migration also invalidates all sessions once, at deploy. - Verify:
SELECT key FROM user_authtoken LIMIT 1→ 64-char hash;go build ./...andgo test ./internal/{models,repositories,middleware,handlers}/...pass.
CODE-C2 / CODE-C3 — Google ID token not verified locally · ☑ (2026-05-15)
- Where:
internal/services/google_auth.go(full rewrite). - Done:
VerifyIDTokenno longer calls the deprecatedtokeninfoURL (which leaked the token in the query string and made verification depend on a third party). It now parses the JWT, fetches Google's JWKS fromgoogleapis.com/oauth2/v3/certs(Redis-cached 24h, re-fetched on akidmiss), verifies theRS256signature locally, and assertsiss ∈ {accounts.google.com, https://accounts.google.com}(C3),aud/azpagainst the configured client IDs, andexp(validated by jwt v5). Mirrors the existing Apple JWKS verifier.GoogleSignInis unchanged — the returnedGoogleTokenInfoshape is preserved. - Verify:
go build ./...clean;internal/servicestests pass.
CODE-C5 / CODE-C6 — IAP receipt / purchase-token replay · ☐
- Where:
internal/services/subscription_service.go(ProcessApplePurchase,ProcessGooglePurchase). - Fix: Goose migration adding
UNIQUE(provider, original_transaction_id). On purchase, if the transaction ID is already bound to a differentuser_id→403. - Verify: re-submitting a valid receipt against a second account →
403; DB has no duplicate.
CODE-C7 — File-ownership check excludes residence owners · ☐
- Where:
internal/services/file_ownership_service.go:20-66. - Fix: Replace the three
residence_residence_users-only JOINs with the canonical owner-OR-member UNION fromresidence_repo.HasAccess(owners live inresidence_residence.owner_id). - Verify: a residence owner can delete a file in their own property; a non-member still gets
403.
CODE-C8 — Device-token cross-account hijack · ☐
- Where:
internal/services/notification_service.go:307-319(APNS),:336-349(GCM). - Fix: On re-register of an existing token, if
existing.UserID != nil && *existing.UserID != userID→409 Conflict. Only same-user updates allowed. - Verify: registering another user's known token →
409; that user's push traffic is unaffected.
CODE-C9 / CODE-H9 — Share-code join not atomic · ☐
- Where:
internal/services/residence_service.go:562-615(:594-599swallows the deactivate error). - Fix: Wrap
JoinWithCodein one transaction withSELECT … FOR UPDATEon the share-code row; fail the join if deactivation fails (do not log-and-continue). - Verify: concurrent redemptions of a single-use code → exactly one succeeds; a forced deactivate error rolls the whole join back.
CODE-C10 — Subscription upgrade race · ☐
- Where:
internal/services/subscription_service.go:404-459; webhook handler:136-213. - Fix: Move Apple validation inside the row-locked transaction, or add an idempotency-key table so the validate→write window can't be raced.
- Verify: two concurrent upgrades for one user → one tier change, not two.
CODE-C11 — Task-completion duplicate-row race · ☐
- Where:
internal/services/task_service.go:631-750. - Fix:
SELECT … FOR UPDATEon the task inCreateCompletion; goose migration addingUNIQUE(task_id, completed_date). - Verify: double-tap "complete" → one completion row.
CODE-C12 — Soft-deleted email reusable · ☐
- Where:
internal/services/auth_service.go:274-324;internal/repositories/user_repo.go(FindByEmail,ExistsByEmail). - Fix: On delete, mangle the email (
deleted_<id>_<email>); addis_active = truefiltering consistently toFindByEmail/ExistsByEmail. - Verify: registering with a soft-deleted account's email is rejected; no cross-account takeover.
CODE-C13 — Apple webhook user lookup may LIKE-match · ☐
- Where:
internal/handlers/subscription_webhook_handler.go:354-366(FindByAppleReceiptContains). - Fix: Confirm the SQL is an equality match, not
LIKE. IfLIKE, this is a confirmed Critical — change to equality and rename the function. SeeV8. - Verify: the query is parameterized equality; rename merged.
High (H1–H9)
CODE-H1 / CODE-H2 / CODE-H3 / CODE-M5 — Rate limiting gaps · ☐
- Where:
internal/router/router.go(:520login limiter,:593join-with-codeunprotected),internal/middleware/rate_limit.go,internal/handlers/auth_handler.go. - Fix: Extend rate limiting to
register,join-with-code, Apple/Google sign-in, and token refresh. Add a per-account login-attempt counter in Redis (lock after 5–10 fails for 15–60 min). This is the app half of the consolidated auth-rate-limit item; the edge half isK3S-F10. - Verify: rapid attempts on every auth route throttle; per-account lockout fires regardless of source IP.
CODE-H4 — Modulo bias in 6-digit codes · ☐
- Where:
internal/services/auth_service.go:884-892. - Fix: Replace
int32 % 1000000with rejection sampling oncrypto/randfor a uniform000000–999999. - Verify: distribution test over many samples is uniform.
CODE-H5 — Apple IAP .p8 file-mode unchecked · ☐
- Where:
internal/services/iap_validation.go:93-128,internal/config/config.go:325. - Fix: Prefer a base64 env-injected PEM. If a file path is kept, refuse to start when the file mode is more permissive than
0600. - Verify: boot fails on a
0644key file; succeeds on0600.
CODE-H6 — Webhook dedup fail-open · ☐
- Where:
internal/handlers/subscription_webhook_handler.go:165-173(Apple),:564-574(Google). - Fix: Fail closed — if
webhookEventRepo.HasProcessederrors, return500so Apple/Google retry, rather than processing (which risks duplicate refunds). - Verify: simulated dedup-check DB error →
500, no double-processing.
CODE-H7 — Auth-failure log lacks IP/UA · ☐
- Where:
internal/handlers/auth_handler.go:70. - Fix: Add
c.RealIP()+User-Agentto the structured failure log line (the audit log captures them; the request-line log does not). Depends onV10(RealIP trust). - Verify: a failed login log line carries IP + UA.
CODE-H8 — X-Timezone header trusted for trial start · ☐
- Where:
internal/middleware/timezone.go:40-71→internal/services/subscription_service.go:145-150. - Fix: Validate
X-Timezoneagainst IANALoadLocation, cap to ±14h; use server UTC for trial-start / billing-window math regardless. - Verify: a bogus/extreme
X-Timezonecannot shift trial start.
Medium (M1–M13)
CODE-M1 — Header injection via Content-Disposition filename · ☐
- Where:
internal/handlers/media_handler.go:74,117,165. - Fix: Sanitize
doc.FileName— strip CR/LF/quote/null, or emit RFC 5987filename*=UTF-8''…. - Verify: an upload with CRLF in the filename does not split the response.
CODE-M2 — bcrypt cost 10 → 12 · ☐
- Where:
internal/models/user.go:47,internal/services/auth_service.go:479. - Fix: Make the cost config-driven, default 12.
- Verify: new hashes are
$2a$12$.
CODE-M3 — Apple Sign In nonce not validated · ☐
- Where:
internal/services/apple_auth.go. - Fix: Generate, store, and verify the nonce round-trip on Apple sign-in.
- Verify: a replayed/mismatched nonce is rejected.
CODE-M4 — Email verification not atomic · ☐
- Where:
internal/services/auth_service.go:373-415. - Fix: Wrap verify in a transaction so a concurrent request can't double-apply.
- Verify: concurrent verify calls → one state transition.
CODE-M6 / LIVE-L16 — Uncapped list / pagination · ☐
- Where:
ListDocuments,ListContractors,ListResidenceshandlers; pagination parsing. - Fix: Clamp
limitserver-side to ≤100 (< 1→ default 25). Notifications already caps at 200 — match the pattern. - Verify:
?limit=999999returns ≤100 rows.
CODE-M7 — Audit log not append-only · ☐
- Where: audit-log model / repository.
- Fix: Make it append-only — a DB trigger forbidding
UPDATE/DELETE, or move to an event store. Remove the soft-delete column. - Verify: an
UPDATE/DELETEon the audit table is rejected.
CODE-M11 — golang.org/x/crypto outdated · ☐
- Where:
go.mod:30(v0.49.0). - Fix:
go get -u golang.org/x/crypto, re-rungovulncheck, retest. Pairs with Stage 5 dependency automation. - Verify:
govulncheck ./...clean.
CODE-M12 — Contractor toggle refetch race · ☐
- Where:
internal/services/contractor_service.go:279-307. - Fix: Do the toggle + read in one transaction so a concurrent soft-delete can't make it return
nil. - Verify: concurrent toggle + delete → defined result, no nil panic.
CODE-M13 — Account-deletion endpoint unrate-limited · ☐
- Where:
internal/handlers/auth_handler.go:488-539. - Fix: Add a throttle to
DELETE /account. First resolveV11—LIVE-L18claims no delete endpoint exists; reconcile before deciding whether this is "rate-limit it" or "expose it." - Verify: repeated delete calls throttle.
CODE-M10 — node:20-alpine floating tag · ☐
- Where: admin/web
Dockerfile(:2,112,134). - Fix: Pin to a specific patch version or digest.
- Verify: Dockerfile has no bare
node:20-alpine.
Low / Info (CODE-L1, L2)
CODE-L1 — Inactive-account login enumeration · ☐
- Where:
internal/services/auth_service.go:76-77. - Fix: Return the same generic error for inactive accounts as for invalid credentials.
- Verify: inactive vs. wrong-password responses are byte-identical.
CODE-L2 — Auth responses lack Cache-Control: no-store · ☐
- Where:
internal/handlers/auth_handler.go(Login / CurrentUser / Refresh). - Fix: Set
Cache-Control: no-storeon auth responses. - Verify: the header is present.
Live-scan code-level findings (LIVE-L1, L11–L20)
LIVE-L1 — /metrics publicly exposed · HIGH · ☐
- Where:
cmd/api/main.goroute registration; vmagent scrapes it cluster-internally already. - Fix (recommended — Option B): bind Prometheus metrics to a separate cluster-internal port (e.g.
:9090), expose only via a ClusterIP Service the vmagent NetworkPolicy allows; the public Ingress never registers/metrics. Updateobservability/vmagent.yamlscrape target. (Alternative: block/metricsat Traefik via anIngressRoute— Stage 3.) - Verify:
curl https://api.myhoneydue.com/metrics→404; vmagent still scrapes successfully.
LIVE-L11 — Login user-enumeration via timing · HIGH · ☐
- Where: login handler /
auth_service.go. - Fix: Always run a bcrypt compare against a fixed dummy hash when the user is not found, so the response time is constant.
- Verify: real vs. fake email login timing delta < network noise.
LIVE-L12 — No rate-limit on login · HIGH · ☐
- See the consolidated auth-rate-limit item:
K3S-F10(edge) +CODE-H1/H2/H3/M5(app). Closed when both land.
LIVE-L13 — Password-reset timing enumeration · HIGH · ☐
- Where:
forgot-passwordhandler. - Fix: Enqueue the reset email on the Asynq queue and return the generic response immediately, so real vs. fake emails have identical latency.
- Verify: real vs. fake email reset timing delta < network noise.
LIVE-L14 / LIVE-L15 — Sequential integer IDs · MEDIUM · ⊘ (deferred)
- Where: all user-facing IDs.
- Decision: Real enumeration/intel leak, but migrating to UUID/ULID touches API, web, mobile, and webhook payloads. Deferred to a planned quarter — not a redeploy-stage fix. Track on the roadmap; revisit before the userbase size becomes commercially sensitive.
LIVE-L16 — Pagination limit uncapped · MEDIUM · ☐
- Duplicate of
CODE-M6— closed with it.
LIVE-L17 — Garbage pagination params silently accepted · LOW · ☐
- Where: query-param parsing in list handlers.
- Fix: Return
400naming the bad parameter instead of silently using defaults. - Verify:
?limit=abc→400.
LIVE-L18 — No account-deletion endpoint (GDPR) · LOW · ☐
- Where:
internal/router/router.go,internal/handlers/auth_handler.go. - Fix: Reconcile with
CODE-M13first (V11). ProvideDELETE /api/auth/me/that anonymizes PII, cascades/transfers residences, revokes tokens, and writes an audit-trail row. Also closes the throwaway-account cleanup gap the live scan left behind. - Verify: an authenticated user can delete their own account; PII is anonymized.
LIVE-L19 — Email verification not enforced · LOW · ☐
- Where: router middleware.
- Fix: Add a
RequireVerified()middleware on sensitive routes (share-code generation/redemption, anything that emails other users), or cap unverified accounts (1 residence, no share codes) until verified. - Verify: an unverified account is blocked from the chosen gated routes.
LIVE-L20 — Profile-update silently drops unknown fields · INFO · ☐
- Where:
PATCH /api/auth/profile/handler. - Fix: Either accept the fields (if intended) or return
400listing unsupported keys — don't silently200. - Verify: an unknown field yields a clear response.
LIVE-L10 — x-powered-by — see Stage 0 (Next.js config).
Stage 5 — CI / build pipeline
Build-time controls. Where there is no CI pipeline file yet, the fix is to
add one (or a 03-deploy.sh step) so the control runs on every build.
K3S-F5 / K3S-F14 / CODE-L4 — Pin images by digest · HIGH · ☐
- Where:
03-deploy.sh(currently tags by git short SHA, lines 47/57-61, and also pushes:latest), alldeploy-k3s/manifests/*/deployment.yaml. - Fix: After
docker push, capture the digest (crane digest …or parsedocker pushoutput) and substitute@sha256:…into the manifests instead ofIMAGE_PLACEHOLDERtags. Pinredisandvmagentby digest too. Reconsider pushing:latest— a mutable:latestundercuts digest pinning. - Verify:
kubectl -n honeydue get deploy -o jsonpathshows every image as@sha256:.
K3S-F8 — Secrets as file mounts, not env vars · MEDIUM · ☑ · In-repo: Y
- Where:
api/workerdeployment.yaml,internal/config/config.go,cmd/api/main.go,cmd/worker/main.go,02-setup-secrets.sh. - Done (2026-05-16):
config.loadFileSecrets()reads each of the 9 secret keys (POSTGRES_PASSWORD,SECRET_KEY,EMAIL_HOST_PASSWORD,FCM_SERVER_KEY,REDIS_PASSWORD,B2_KEY_ID,B2_APP_KEY,OBS_INGEST_TOKEN,OBS_TRACES_URL) from/etc/honeydue/secrets/<KEY>andviper.Sets it (highest precedence). A missing file is a silent skip, so the same binary still works from env vars in local/dev.api/workerdeployment.yamlno longer inject any secret as anenv: secretKeyRef.honeydue-secretsis mounted as a volume (defaultMode: 0400), read-only, at/etc/honeydue/secrets. Non-secret config still arrives viaenvFrom: configMapRef.cmd/api/cmd/workerread the observability endpoints through the newconfig.SecretValue()(Viper-backed) instead ofos.Getenv, so file-mountedOBS_*values resolve now that they are gone from the environment.02-setup-secrets.shnow also writesB2_KEY_ID/B2_APP_KEYintohoneydue-secrets— reconciling the script-vs-manifest drift (the manifests referenced these keys but the script never created them).
- Scoped exception: the one-shot
honeydue-migrateJob still takesPOSTGRES_PASSWORDas an env var. goose is invoked as a CLI with the password inside the DSN argument, so the value is exposed in that process regardless of env-vs-file; the Job is transient (one run, seconds, pod GC'd) so this is accepted. - Verify:
kubectl -n honeydue exec deploy/api -- envshows noPOSTGRES_PASSWORD/SECRET_KEY;kubectl -n honeydue exec deploy/api -- ls /etc/honeydue/secretslists the key files.
CODE-L5 — Image signing + scanning · LOW · ◐ · In-repo: Y
- Where:
03-deploy.sh,deploy-k3s/manifests/kyverno-verify-images.yaml. - Done (in-repo, 2026-05-16):
03-deploy.shrunscosign signafter each push and atrivy image --severity HIGH,CRITICALscan before push — both guarded: they no-op when the tool is absent, so they never break a deploy on a host without them.- A ready-to-use Kyverno
ClusterPolicyships atdeploy-k3s/manifests/kyverno-verify-images.yaml. It matches only the fourgitea.treytartt.com/admin/honeydue-*images, starts inAuditmode, and is intentionally not applied by03-deploy.sh— applying a verify-images policy with no key would block every Pod from scheduling.
- Remaining (operator — cannot be committed):
- Install Kyverno in the cluster (admission controller).
cosign generate-key-pair; setCOSIGN_KEYin the deploy env so signing activates; pastecosign.pubinto the policy'spublicKeysblock.kubectl apply -f deploy-k3s/manifests/kyverno-verify-images.yaml, confirm Pods still schedule, then flipvalidationFailureAction: Audit → Enforce.
- Verify: an unsigned image is rejected by admission;
03-deploy.shfails on a HIGH/CRITICAL CVE.
CODE-M11 (CI half) — Dependency hygiene · ☐
- Fix: Add scheduled
go get -u+govulncheck(the audit confirmsgovulncheck+gitleaksalready run in CI — extend with a dependency-update cadence). - Verify: stale-dependency alerts surface automatically.
Stage 6 — Post-deploy verification & runtime investigations
04-verify.sh already runs a security block (secret encryption, NetworkPolicy
count, ServiceAccounts, pod security contexts, PDBs, cloudflare-only
middleware, admin-basic-auth). Extend it so each fix above stays fixed,
and work the open investigations the audits could not resolve.
Extend 04-verify.sh with assertions for · ☐
- Redis rejects unauthenticated
PING(K3S-F1). - Admin ingress annotation contains
admin-auth(K3S-F2). /metricsreturns404on the public host (LIVE-L1).- Every container (incl.
vmagent) has a fullsecurityContext(K3S-F7). automountServiceAccountToken: falseon app pods (K3S-F11).- Every workload image is digest-pinned (
K3S-F5). - No
DEBUG_FIXED_CODESkey in the prod ConfigMap (CODE-C4).
Runtime investigations (cannot be closed by code review alone)
| ID | Item | Source | Action |
|---|---|---|---|
V1 |
Apple/Google Sign-In token validation depth | LIVE | Test with a self-signed Apple identity token; confirm signature/aud/nonce checks |
V2 |
Webhook signature verification — confirm webhook routes are outside the auth middleware in router.go (live scan saw 401s, signature middleware may never run) |
LIVE | Code-review internal/router/router.go |
V3 |
File-upload security — locate upload paths, test polyglots / MIME bypass / path traversal in filename / oversized files | LIVE | Focused upload security test |
V4 |
Long-term token validity / revocation behaviour | LIVE | Test token expiry + revocation over time |
V5 |
Apple IAP receipt validation with a real sandbox StoreKit receipt | LIVE | Sandbox test |
V6 |
Share-code system — find the endpoint path; test brute-force, single-use, expiration | LIVE | Locate + test |
V7 |
Trial-expiration enforcement — age a test account past 14 days, confirm limitations_enabled flips and creation gates fire |
LIVE | Aged-account test |
V8 |
FindByAppleReceiptContains — confirm equality, not LIKE. If LIKE, escalate CODE-C13 to confirmed Critical |
CODE | SQL review |
V9 |
Rate-limiter storage — confirm rate_limit.go is Redis-backed (shared across 3 api replicas); in-memory = 3× the intended limit |
CODE | Code review |
V10 |
X-Forwarded-For / Echo RealIP trust behind Traefik — without it per-IP limits collapse to the ingress IP |
CODE | Code + Traefik config review |
V11 |
Account-deletion contradiction — LIVE-L18 (no endpoint) vs CODE-M13 (endpoint at auth_handler.go:488-539). Resolve before Stage 4 planning |
LIVE/CODE | Route review |
V12 |
etcd encryption — 04-verify.sh only greps a string; truly confirm with k3s secrets-encrypt status on each server node |
K3S | SSH check |
V13 |
user_authtoken index — confirm a user_id lookup index exists before hashing tokens at rest (CODE-C1) |
CODE | Schema check |
Accepted risks / deferred (this cycle)
| ID | Item | Rationale |
|---|---|---|
K3S-F15 |
Public-IP nodes, no VPC | Re-provision-scale change; Hetzner firewall (K3S-CG3) is the compensating control. Roadmap. |
K3S-F16 |
Combined control-plane/worker nodes | Standard small-cluster k3s; revisit on workload growth. |
LIVE-L14/L15 |
Sequential integer IDs | UUID migration spans API + web + mobile + webhooks; planned quarter, not this cycle. |
Mirror these in docs/deployment/20-roadmap.md so they are not silently lost.
Documentation drift corrected alongside this plan
The audits contradicted the existing deployment book. These corrections ship with this plan so the docs match audited reality:
| Doc | Claimed | Reality (audit) | Action |
|---|---|---|---|
05-security.md |
automountServiceAccountToken: false set |
K3S-F11: not set on any workload |
Corrected to "TODO" + linked here |
05-security.md |
NetworkPolicies "not currently applied" (TODO) | Applied 2026-04-24; 03-deploy.sh:155 applies them |
Corrected to "applied" |
05-security.md |
CF↔origin is plaintext (SSL=Flexible) | Upgraded to Full (strict) 2026-04-24 | Corrected |
05-security.md |
SHA tags immutable / "we'd notice a digest change" | K3S-F5: short SHA tags are mutable |
Corrected; points to K3S-F5 |
SECURITY.md (old) |
Redis "requires a password" | K3S-F1: no auth |
This rewrite |
SECURITY.md (old) |
etcd secrets-encryption: true |
K3S-CG1: not verified / not on |
This rewrite |
SECURITY.md (old) |
fail2ban active | 05-security.md + K3S-CG2: not installed |
This rewrite |
20-roadmap.md |
— | Audit findings not represented | Audit items folded in |
Hardened-redeploy checklist (run order)
A clean rebuild of the whole stack, with every fix above applied:
□ Stage 0 DNS once-off: DMARC, SPF, CAA at Cloudflare; security.txt route live
□ Stage 1 Provision: hetzner-k3s config carries --write-kubeconfig-mode=600
and --secrets-encryption; run 01-provision-cluster.sh
□ Stage 1 Node OS: fail2ban + unattended-upgrades + SSH/sysctl on each node
□ Stage 1 Verify cluster: K3S-CG3..CG8 (firewall, snapshots, kubelet, perms)
□ Stage 2 Config: config.yaml has redis.password + admin.basic_auth_*;
no DEBUG_FIXED_CODES; SECRET_KEY ≥32 chars
□ Stage 2 Secrets: run 02-setup-secrets.sh — confirm redis + admin-basic-auth
□ Stage 3 Manifests: admin ingress middlewares wired; imagePullSecret name
consistent; vmagent securityContext; COOP/CORP headers;
auth-rate-limit; automountServiceAccountToken:false;
HSTS preload; X-XSS-Protection dropped; imagePullPolicy set
□ Stage 4 Code+image: all C/H/M/L code fixes committed; image rebuilt;
goose migrations for C1/C5/C6/C11/C12 present
□ Stage 5 CI: images digest-pinned + signed + scanned; secrets file-mounted
□ Stage 6 Verify: run 04-verify.sh (extended); work V1–V13
□ Post: Submit myhoneydue.com to hstspreload.org
A redeploy is "clean" only when 04-verify.sh (extended per Stage 6) passes
with zero ✗ lines and every checkbox in the master index is ☑ or ⊘.
Appendix — Incident response playbooks
Preserved from the previous SECURITY.md; still current.
Compromised API token
Rotate SECRET_KEY to invalidate all tokens, then restart api/worker:
echo "$(openssl rand -hex 32)" > secrets/secret_key.txt
./scripts/02-setup-secrets.sh
kubectl rollout restart deployment/api deployment/worker -n honeydue
(After CODE-C1 lands, tokens are hashed at rest — a DB read no longer yields
usable tokens, but SECRET_KEY rotation remains the kill-switch.)
Compromised database credentials
Rotate in the Neon dashboard, update secrets/postgres_password.txt, re-run
02-setup-secrets.sh, restart api/worker, watch logs for connection errors.
Compromised push keys
APNs: revoke in Apple Developer, drop the new .p8 into secrets/, re-run
02-setup-secrets.sh, restart api/worker. FCM: rotate the key in Firebase,
update secrets/fcm_server_key.txt, re-run, restart.
Suspicious pod
kubectl logs <pod> -n honeydue > /tmp/pod-logs.txt
kubectl describe pod <pod> -n honeydue > /tmp/pod-describe.txt
kubectl delete pod <pod> -n honeydue # deployment recreates it
Communication
Document the timeline privately; on a data breach notify affected users within 72 hours; rotate every potentially-exposed credential; write a post-mortem (root cause, timeline, remediation, prevention).
References
- Audit reports:
live_scan_5_12.md,k3_audit_5_12.md,security_scan_5_12.md(repo root) - Current architecture:
docs/deployment/05-security.md - Roadmap:
docs/deployment/20-roadmap.md - Deploy process:
docs/deployment/14-deployment-process.md - Scripts:
deploy-k3s/scripts/{01-provision-cluster,02-setup-secrets,03-deploy,04-verify}.sh - Manifests:
deploy-k3s/manifests/