c77ff07ce9
Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1034 lines
72 KiB
Markdown
1034 lines
72 KiB
Markdown
# honeyDue — Production Security Remediation Plan
|
||
|
||
This document is the **single source of truth for fixing every security
|
||
finding from the 2026-05-12/13 audits, and for keeping those fixes baked
|
||
into the stack so a full redeploy never reproduces them.**
|
||
|
||
It replaces the previous aspirational `SECURITY.md` (which described a
|
||
desired state that, per the audits, was never fully true). The accurate
|
||
*current* architecture lives in `docs/deployment/05-security.md`; this file
|
||
is the **work list**.
|
||
|
||
**Last updated:** 2026-05-16
|
||
**Audit sources (kept at repo root):**
|
||
|
||
| Tag | File | Scope | Findings |
|
||
|---|---|---|---|
|
||
| `LIVE` | `live_scan_5_12.md` | External black-box scan of api/admin/app | L1–L20 (20) |
|
||
| `K3S` | `k3_audit_5_12.md` | k3s cluster + `honeydue` namespace audit | F1–F17 (17) + 8 coverage gaps |
|
||
| `CODE` | `security_scan_5_12.md` | Static audit of `honeyDueAPI-go` | C1–C13, H1–H9, M1–M13, L1–L6 (41) |
|
||
|
||
**Total: 78 findings + 8 cluster coverage gaps + 13 runtime verification items.**
|
||
|
||
---
|
||
|
||
## How to use this document
|
||
|
||
The plan is organised by **redeploy stage**, not by severity, because the
|
||
operator's goal is: *redeploy the entire stack and come up clean.* Each
|
||
finding is tagged with where its fix lives:
|
||
|
||
| Marker | Meaning |
|
||
|---|---|
|
||
| **In-repo: Y** | Fix lives in a committed file (`config.yaml`, a manifest, a script, Go code, a Dockerfile). Once committed, **every redeploy re-applies it automatically.** |
|
||
| **In-repo: N** | Fix is external state (DNS records, Cloudflare dashboard, Hetzner firewall, hstspreload.org). A redeploy does **not** touch it — it survives on its own but must be done once and tracked here. |
|
||
|
||
**Status legend:** ☐ open · ◐ in progress · ☑ done · ⊘ accepted risk / deferred
|
||
|
||
**Redeploy stage order** (matches `deploy-k3s/scripts/` run order):
|
||
|
||
```
|
||
Stage 0 DNS & Cloudflare edge (external; no cluster needed)
|
||
Stage 1 Cluster provisioning & node OS (01-provision-cluster.sh / hetzner-k3s / SSH)
|
||
Stage 2 Secrets & config bootstrap (02-setup-secrets.sh / config.yaml)
|
||
Stage 3 Kubernetes manifests (deploy-k3s/manifests/, applied by 03-deploy.sh)
|
||
Stage 4 Application code & images (honeyDueAPI-go source → rebuilt image)
|
||
Stage 5 CI / build pipeline (image digest pinning, signing, scanning)
|
||
Stage 6 Post-deploy verification (04-verify.sh + runtime investigations)
|
||
```
|
||
|
||
**Golden rule for "redeploy clean":** a fix only counts as done when it is
|
||
committed to the file that the redeploy reads. A `kubectl patch` on the live
|
||
cluster that is not mirrored into `deploy-k3s/manifests/` **will be wiped on
|
||
the next `03-deploy.sh`.** Every entry below names the committed file.
|
||
|
||
---
|
||
|
||
## Execution status (2026-05-16)
|
||
|
||
Stages 2–5 were executed in-repo, then put through an independent code
|
||
review (see *Post-remediation independent review* below). The Go module
|
||
**builds clean and the full `go test ./...` suite passes.** Four new goose
|
||
migrations were added — `000003` (auth-token hashing), `000004` (IAP replay
|
||
protection), `000005` (audit-log append-only + `audit_log` table create),
|
||
`000006` (`webhook_event_log` table create) — and run automatically via the
|
||
migrate Job before the api/worker rollout.
|
||
|
||
- **~63 findings fixed (☑) and verified** — all of Stage 2 (secrets/config)
|
||
and Stage 3 (Kubernetes manifests), every exploitable Stage 4 application
|
||
finding (all 11 actioned Criticals + the auth / webhook / race / handler
|
||
High & Medium fixes), Stage-5 image digest pinning **and `K3S-F8`
|
||
(secrets are now file-mounted, not env vars)**, plus the in-repo half of
|
||
Stage 1 cluster provisioning — `K3S-F4` (kubeconfig written `0600`),
|
||
`K3S-CG1` (etcd `secrets-encryption`), `K3S-CG2` (fail2ban +
|
||
unattended-upgrades installed at provision). Includes token hashing,
|
||
Google JWKS verification, IAP replay protection, the authorization
|
||
fixes, atomic share-code join, the metrics-endpoint lockdown,
|
||
per-account login lockout, verified-email gating, CSP/HSTS hardening,
|
||
and digest-pinned images.
|
||
- **1 partial (◐)** — `CODE-L5`: cosign signing + a Trivy `HIGH,CRITICAL`
|
||
scan are wired (guarded) into `03-deploy.sh`, and a ready-to-use Kyverno
|
||
`ClusterPolicy` ships at `deploy-k3s/manifests/kyverno-verify-images.yaml`.
|
||
Closing it needs two operator actions that cannot be committed: install
|
||
Kyverno in the cluster, and supply a cosign key pair (`COSIGN_KEY` for
|
||
signing + the public key pasted into the policy).
|
||
- **Accepted / blocked / moot (⊘)** — `M3` (Apple nonce — blocked on an
|
||
iOS-client change), `C12` (moot — accounts are hard-deleted),
|
||
`LIVE-L14`/`L15` (UUID migration — planned quarter), `LIVE-L17`/`L18`/
|
||
`L20` (no security impact — see entries), `F15`/`F16` (architectural),
|
||
and `LIVE-L2`/`L3`/`L4` (DMARC / SPF / CAA — operator-declined, below).
|
||
- **Operator-declined — Stage 0 DNS (`LIVE-L2`/`L3`/`L4`).** The operator
|
||
has opted not to add the DMARC, SPF-hardening, and CAA DNS records this
|
||
cycle. For the record: these are **not** a paid-Cloudflare feature —
|
||
DMARC and SPF are ordinary TXT records and CAA is an ordinary CAA
|
||
record, all addable on any Cloudflare plan including Free. They remain
|
||
genuine email-spoofing / certificate-issuance gaps and are marked ⊘;
|
||
revisit when DNS is next touched.
|
||
- **Remaining operator runtime steps (no code to commit)** — on the
|
||
*existing* cluster: `k3s secrets-encrypt` enable/reencrypt (`K3S-CG1` /
|
||
`V12`) and `chmod 600` the live kubeconfig (`K3S-F4`); the SSH/sysctl
|
||
half of `K3S-CG2`; and the `K3S-CG3`–`CG8` verification items. A full
|
||
*fresh* provision already comes up with `K3S-F4`/`CG1`/`CG2`(fail2ban +
|
||
unattended-upgrades) applied straight from `_config.sh`.
|
||
|
||
**Operator note:** `C1` (token hashing) invalidates every existing login
|
||
session once at deploy and makes login single-session per user — see the
|
||
`CODE-C1` entry. The status boxes in the master index below are authoritative.
|
||
|
||
## Post-remediation independent review (2026-05-16)
|
||
|
||
The change set went through **two** independent review passes; the deploy-time
|
||
verification below (build, `go test -race`, full `goose up` against real
|
||
PostgreSQL 16) was executed and passed.
|
||
|
||
**First pass.** A separate review agent audited the full change set against the
|
||
three audit files. It surfaced three **deploy-breaking** defects that a green
|
||
`go test` could not catch — the test harness builds two tables via GORM
|
||
`AutoMigrate`, which production never runs — all since fixed:
|
||
|
||
- **`audit_log` table was never created by a migration.** `000005` added
|
||
append-only triggers to a table that exists only in the test DB, so a
|
||
from-scratch `goose up` would fail on `000005`. `000005` now does
|
||
`CREATE TABLE IF NOT EXISTS audit_log` before the triggers.
|
||
- **`webhook_event_log` table was never created by a migration.** The H6
|
||
fail-closed webhook dedup turns a missing table into a 500 on every
|
||
subscription webhook. New migration `000006` creates it.
|
||
- **`000004`'s `google_purchase_token` unique index could fail to build** on
|
||
a production table already holding duplicate tokens — exactly the C6
|
||
replay the migration fixes. `000004` now de-duplicates (keep-earliest,
|
||
NULL-the-rest) before creating the index.
|
||
|
||
It also tightened the C13 Apple-webhook lookup (`subscription_webhook_handler.go`)
|
||
so the legacy substring scan runs only on a genuine `ErrRecordNotFound`,
|
||
never masking a real DB error as "not found".
|
||
|
||
**Second pass (master review).** A second, independent security-audit agent
|
||
re-verified all four first-pass fixes (correct), ran `go test -race` (0 data
|
||
races) and the full `goose up`/`down` chain against real PostgreSQL (clean,
|
||
idempotent), and returned **GO** with one HIGH finding, since fixed:
|
||
|
||
- **HIGH-1 — Redis password leaked via the `honeydue-config` ConfigMap.**
|
||
`_config.sh` built `REDIS_URL` with the password embedded inline, and that
|
||
URL is emitted into the `honeydue-config` ConfigMap (delivered to pods via
|
||
`envFrom`). ConfigMaps are *not* covered by `secrets-encryption` and are
|
||
readable by any principal with `get configmap` — so `K3S-F1`/`K3S-F8` were
|
||
not actually fully closed. **Fixed (2026-05-16):** `_config.sh` now emits
|
||
`REDIS_URL=redis://redis:6379/0` with no credentials; the password travels
|
||
only as the file-mounted `REDIS_PASSWORD` secret. The API applies it in
|
||
`cache_service.go`; `cmd/worker/main.go` now applies it onto the parsed
|
||
Asynq `RedisClientOpt` so the server/inspector/monitoring client all
|
||
authenticate against the `requirepass` Redis.
|
||
|
||
The master review's other seven findings (4 Medium, 3 Low — none
|
||
deploy-blocking) were then **all fixed (2026-05-16)**:
|
||
|
||
- **MEDIUM-1 — re-login left the prior token usable for ≤5 min.**
|
||
`CreateFreshToken` deleted the old token row but not its Redis cache entry.
|
||
It now also returns the deleted tokens' hashes; `AuthService.freshToken`
|
||
evicts them via the new `CacheService.InvalidateAuthTokenHashes` on every
|
||
login / Apple / Google sign-in, so a prior (e.g. stolen) token stops
|
||
authenticating immediately.
|
||
- **MEDIUM-2 — IAP `.p8` mode check incompatible with k8s.** The Apple IAP
|
||
key check (`iap_validation.go`) required `0600`-or-stricter, unattainable
|
||
on a k8s Secret volume (`0440` under `fsGroup`). It now rejects only
|
||
world-accessible keys (`perm & 0o007`).
|
||
- **MEDIUM-3 — single-IP account-lockout DoS.** The `M5` per-account lockout
|
||
is now keyed on the *set of distinct source IPs* that have failed
|
||
(`RegisterLoginFailure` takes the IP, tracks a Redis set; lock at 5
|
||
distinct IPs). One attacker IP can no longer lock a victim out by spamming
|
||
failures; genuinely distributed stuffing still trips it. `Login` now takes
|
||
the client IP (`c.RealIP()`).
|
||
- **MEDIUM-4 — Redis no-auth deployable.** `02-setup-secrets.sh` now `die`s
|
||
(was `warn`) when `redis.password` is empty, so a deploy can no longer
|
||
bring up an unauthenticated Redis (`K3S-F1`).
|
||
- **LOW-1 / LOW-2 — missing regression tests.** Added: `config_test.go`
|
||
asserts `validate()` refuses `DEBUG_FIXED_CODES` with `DEBUG=false` (`C4`);
|
||
`subscription_repo_test.go` asserts a second account cannot bind an Apple
|
||
transaction / Google purchase token already bound to another (`C5`/`C6`).
|
||
- **LOW-3 — device-token 409.** A recycled APNs/FCM token re-registering
|
||
under a new account is now reassigned to that account (and logged) instead
|
||
of returning a 409 that locked the legitimate new device owner out of push.
|
||
|
||
One earlier (first-pass) hardening item remains a **tracked follow-up**, not
|
||
re-raised by the master review and not deploy-blocking: `/metrics` is gated
|
||
by an `X-Forwarded-For` check rather than network-isolated. True isolation
|
||
needs `/metrics` on a separate port plus a NetworkPolicy restricting the
|
||
scrape to vmagent — an architectural change deferred to a later cycle.
|
||
|
||
## Consolidated work items (fix once, closes many)
|
||
|
||
Several findings are the same defect seen from three angles. Do the work
|
||
once at the listed anchor; the rest close with it.
|
||
|
||
| Theme | Anchor | Also closes |
|
||
|---|---|---|
|
||
| Auth-endpoint rate limiting | Stage 3 `auth-rate-limit` middleware + Stage 4 app limiter | `K3S-F10`, `LIVE-L12`, `CODE-H1`, `CODE-H2`, `CODE-H3`, `CODE-M5` |
|
||
| CSP / cross-origin headers | Stage 3 `security-headers` + Stage 4 app CSP | `K3S-F9`, `LIVE-L8` |
|
||
| HSTS `preload` | Stage 3 middleware + Stage 0 list submission | `LIVE-L5`, `CODE-L3` |
|
||
| Admin ingress hardening | Stage 2 secret + Stage 3 middleware wiring | `K3S-F2`, `K3S-F3`, `CODE-L6` |
|
||
| etcd encryption at rest | Stage 1 `--secrets-encryption` | `K3S-CG1`, `CODE-M9` |
|
||
| Image digest pinning + signing | Stage 5 CI | `K3S-F5`, `K3S-F14`, `CODE-L4`, `CODE-L5` |
|
||
| Pagination hard caps | Stage 4 app | `LIVE-L16`, `CODE-M6` |
|
||
| imagePullSecret name consistency | Stage 3 manifests + Stage 2 script | `K3S-F6` |
|
||
|
||
**Known contradiction to resolve before planning Stage 4:** `LIVE-L18` says
|
||
*no account-deletion endpoint exists* (every `DELETE` path 404/400), but
|
||
`CODE-M13` points at a delete handler at `auth_handler.go:488-539`. Either
|
||
the endpoint exists at a path the external scan never probed, or it is
|
||
mounted but unreachable. **Confirm the route in `internal/router/router.go`
|
||
first** — the fix differs (add an endpoint vs. expose/rate-limit an existing
|
||
one). Tracked as verification item `V11`.
|
||
|
||
---
|
||
|
||
## Master finding index
|
||
|
||
Every finding, ordered by redeploy stage. Use this as the live tracker —
|
||
flip the Status box as work lands.
|
||
|
||
### Stage 0 — DNS & Cloudflare edge
|
||
|
||
| ID | Sev | Finding | In-repo | Status |
|
||
|---|---|---|---|---|
|
||
| `LIVE-L2` | HIGH | No DMARC record — email spoofing open | N | ⊘ |
|
||
| `LIVE-L3` | MED | SPF ends `?all` (neutral — fails open) | N | ⊘ |
|
||
| `LIVE-L4` | MED | No CAA records — any CA may issue certs | N | ⊘ |
|
||
| `LIVE-L6` | LOW | No `/.well-known/security.txt` | Y | ☐ |
|
||
| `LIVE-L9` | INFO | Aggressive Cloudflare caching on admin SSR shell | N | ☐ |
|
||
| `LIVE-L10` | INFO | `x-powered-by: Next.js` framework leak | Y | ☐ |
|
||
|
||
### Stage 1 — Cluster provisioning & node OS
|
||
|
||
| ID | Sev | Finding | In-repo | Status |
|
||
|---|---|---|---|---|
|
||
| `K3S-F4` | HIGH | Node kubeconfig world-readable (mode 644) | Y | ☑ |
|
||
| `K3S-F15` | INFO | Nodes on public IPs, no private VPC | Y | ⊘ |
|
||
| `K3S-F16` | INFO | All 3 nodes are control-plane + etcd + worker | Y | ⊘ |
|
||
| `K3S-F17` | INFO | Single-replica SPOFs (redis/worker/admin/vmagent) | Y | ☐ |
|
||
| `K3S-CG1` | — | etcd encryption at rest not verified (`--secrets-encryption`) | Y | ☑ |
|
||
| `K3S-CG2` | — | Node OS hardening: SSH, fail2ban, unattended-upgrades, sysctl | Y/N | ◐ |
|
||
| `K3S-CG3` | — | Hetzner Cloud Firewall rules not verified | N | ☐ |
|
||
| `K3S-CG4` | — | etcd snapshot backup destination/encryption not verified | Y | ☐ |
|
||
| `K3S-CG5` | — | kubelet flags (`--anonymous-auth=false`, webhook authz) not verified | Y | ☐ |
|
||
| `K3S-CG6` | — | Container-runtime CIS controls (`kube-bench`) not run | N | ☐ |
|
||
| `K3S-CG7` | — | `deploy` user sudoers least-privilege not verified | N | ☐ |
|
||
| `K3S-CG8` | — | `/etc/rancher/k3s/` dir + server-token perms not verified | N | ☐ |
|
||
|
||
### Stage 2 — Secrets & config bootstrap
|
||
|
||
| ID | Sev | Finding | In-repo | Status |
|
||
|---|---|---|---|---|
|
||
| `K3S-F1` | **CRIT** | Redis runs with no authentication | Y | ☑ |
|
||
| `K3S-F3` | HIGH | `admin-basic-auth` secret never created | Y | ☑ |
|
||
| `K3S-F12` | MED | Secrets unrotated since cluster bootstrap; no runbook | Y | ☑ |
|
||
| `CODE-C4` | **CRIT** | `DEBUG_FIXED_CODES` "123456" auth bypass if it reaches prod | Y | ☑ |
|
||
| `CODE-M8` | MED | `SECRET_KEY` hardcoded debug fallback | Y | ☑ |
|
||
|
||
> **Stage 2 status (2026-05-15):** `config.yaml` now carries a Redis
|
||
> password and admin basic-auth user/password; `02-setup-secrets.sh` uses
|
||
> bcrypt (`htpasswd -nbB`); `internal/config/config.go` generates an
|
||
> ephemeral random `SECRET_KEY` in debug instead of a static fallback and
|
||
> refuses to boot if `DEBUG_FIXED_CODES` is set with `DEBUG=false`; the
|
||
> rotation runbook is at `docs/runbooks/secret-rotation.md`. All take
|
||
> effect on the next `02-setup-secrets.sh` + `03-deploy.sh`.
|
||
|
||
### Stage 3 — Kubernetes manifests
|
||
|
||
| ID | Sev | Finding | In-repo | Status |
|
||
|---|---|---|---|---|
|
||
| `K3S-F2` | HIGH | Admin ingress missing `cloudflare-only` + `admin-auth` | Y | ☑ |
|
||
| `K3S-F6` | HIGH | `imagePullSecrets` name mismatch (`ghcr-credentials`) | Y | ☑ |
|
||
| `K3S-F7` | MED | `vmagent` container missing `securityContext` | Y | ☑ |
|
||
| `K3S-F9` | MED | `security-headers` missing COOP/COEP/CORP | Y | ☑ |
|
||
| `K3S-F10` | MED | Uniform rate limit — no auth-endpoint tightening | Y | ☑ |
|
||
| `K3S-F11` | MED | `automountServiceAccountToken` not disabled | Y | ☑ |
|
||
| `K3S-F13` | LOW | `CORS_ALLOWED_ORIGINS` missing `app.myhoneydue.com` | Y | ☑ |
|
||
| `K3S-F14` | LOW | Public images (`redis`, `vmagent`) pinned by tag | Y | ☑ |
|
||
| `LIVE-L5` | LOW | HSTS not preload-eligible | Y | ☑ |
|
||
| `LIVE-L7` | LOW | Deprecated `X-XSS-Protection` header | Y | ☑ |
|
||
| `LIVE-L8` | LOW | CSP missing `object-src`/`base-uri`; COOP/COEP/CORP absent | Y | ☑ |
|
||
| `CODE-L3` | LOW | HSTS missing `preload` (duplicate of `LIVE-L5`) | Y | ☑ |
|
||
| `CODE-L4` | LOW | `imagePullPolicy` not set on Deployments | Y | ☑ |
|
||
| `CODE-L6` | LOW | Admin `admin-auth` middleware defined, not attached | Y | ☑ |
|
||
|
||
> **Stage 3 status (2026-05-15):** admin ingress now chains
|
||
> `cloudflare-only` + `admin-auth` + `security-headers` + `rate-limit`; a
|
||
> dedicated `honeydue-api-auth` Ingress applies a new `auth-rate-limit`
|
||
> middleware (5/min, burst 10) to login / register / forgot-password /
|
||
> reset-password / join-with-code; `security-headers` gained COOP + CORP,
|
||
> HSTS is now `max-age=63072000; …; preload`, and the deprecated
|
||
> `X-XSS-Protection` (`browserXssFilter`) is removed; `vmagent` has a
|
||
> container `securityContext`; all workload pods + the migrate Job set
|
||
> `automountServiceAccountToken: false` explicitly (on top of the
|
||
> rbac.yaml ServiceAccount-level setting that already existed); the
|
||
> registry secret is `gitea-credentials` everywhere; `imagePullPolicy:
|
||
> IfNotPresent` is explicit on every container; CORS includes
|
||
> `app.myhoneydue.com`. **Still open:** `K3S-F14` (public-image digest
|
||
> pins) is folded into Stage 5 with `K3S-F5`; `LIVE-L8` is partial — the
|
||
> COOP/CORP half shipped here, the CSP `object-src`/`base-uri` half is an
|
||
> app change tracked in Stage 4.
|
||
|
||
### Stage 4 — Application code & container images
|
||
|
||
| ID | Sev | Finding | In-repo | Status |
|
||
|---|---|---|---|---|
|
||
| `CODE-C1` | **CRIT** | Auth tokens stored plaintext in DB | Y | ☑ |
|
||
| `CODE-C2` | **CRIT** | Google ID token not verified locally | Y | ☑ |
|
||
| `CODE-C3` | **CRIT** | Google `iss` claim never validated | Y | ☑ |
|
||
| `CODE-C5` | **CRIT** | Apple IAP receipt replay across accounts | Y | ☑ |
|
||
| `CODE-C6` | **CRIT** | Google purchase-token replay across accounts | Y | ☑ |
|
||
| `CODE-C7` | **CRIT** | File-ownership check excludes residence owners | Y | ☑ |
|
||
| `CODE-C8` | **CRIT** | Device-token cross-account hijack on re-register | Y | ☑ |
|
||
| `CODE-C9` | **CRIT** | Share-code join not atomic (Add+Deactivate race) | Y | ☑ |
|
||
| `CODE-C10` | **CRIT** | Subscription upgrade race — validation outside txn | Y | ☑ |
|
||
| `CODE-C11` | **CRIT** | Task-completion duplicate-row race | Y | ☑ |
|
||
| `CODE-C12` | **CRIT** | Soft-deleted email reusable; `is_active` not filtered | Y | ⊘ |
|
||
| `CODE-C13` | **CRIT** | Apple webhook user lookup may LIKE-match | Y | ☑ |
|
||
| `CODE-H1` | HIGH | Rate limit doesn't cover all auth surfaces | Y | ☑ |
|
||
| `CODE-H2` | HIGH | No rate limit on `join-with-code` | Y | ☑ |
|
||
| `CODE-H3` | HIGH | No rate limit on `register` | Y | ☑ |
|
||
| `CODE-H4` | HIGH | Modulo bias in 6-digit code generation | Y | ☑ |
|
||
| `CODE-H5` | HIGH | Apple IAP `.p8` loaded with no file-mode check | Y | ☑ |
|
||
| `CODE-H6` | HIGH | Webhook dedup fail-open | Y | ☑ |
|
||
| `CODE-H7` | HIGH | Auth-failure log lacks IP/User-Agent | Y | ☑ |
|
||
| `CODE-H8` | HIGH | `X-Timezone` header trusted for trial-start calc | Y | ☑ |
|
||
| `CODE-H9` | HIGH | Share-code `Deactivate` error swallowed | Y | ☑ |
|
||
| `CODE-M1` | MED | HTTP header injection via `Content-Disposition` filename | Y | ☑ |
|
||
| `CODE-M2` | MED | bcrypt cost = 10 (recommend 12) | Y | ☑ |
|
||
| `CODE-M3` | MED | Apple Sign In nonce not validated | Y | ⊘ |
|
||
| `CODE-M4` | MED | Email verification not atomic | Y | ☑ |
|
||
| `CODE-M5` | MED | Per-user rate limiting absent | Y | ☑ |
|
||
| `CODE-M6` | MED | List endpoints uncapped (Documents/Contractors/Residences) | Y | ☑ |
|
||
| `CODE-M7` | MED | Audit log not append-only | Y | ☑ |
|
||
| `CODE-M11` | MED | `golang.org/x/crypto v0.49.0` outdated | Y | ☑ |
|
||
| `CODE-M12` | MED | Contractor toggle refetch race | Y | ☑ |
|
||
| `CODE-M13` | MED | Account-deletion endpoint unrate-limited | Y | ☑ |
|
||
| `CODE-M10` | MED | `node:20-alpine` floating tag in Dockerfile | Y | ☑ |
|
||
| `CODE-L1` | LOW | Login inactive-account error enables enumeration | Y | ☑ |
|
||
| `CODE-L2` | LOW | Auth responses lack `Cache-Control: no-store` | Y | ☑ |
|
||
| `LIVE-L1` | HIGH | `/metrics` publicly exposed on `api.myhoneydue.com` | Y | ☑ |
|
||
| `LIVE-L11` | HIGH | Login user-enumeration via timing | Y | ☑ |
|
||
| `LIVE-L12` | HIGH | No rate-limit on `/api/auth/login/` | Y | ☑ |
|
||
| `LIVE-L13` | HIGH | Password-reset user-enumeration via timing | Y | ☑ |
|
||
| `LIVE-L14` | MED | Sequential integer user IDs leak userbase size | Y | ⊘ |
|
||
| `LIVE-L15` | MED | Sequential integer resource IDs (same risk) | Y | ⊘ |
|
||
| `LIVE-L16` | MED | Pagination `limit` accepted at any size | Y | ☑ |
|
||
| `LIVE-L17` | LOW | Garbage pagination params silently accepted | Y | ⊘ |
|
||
| `LIVE-L18` | LOW | No account-deletion endpoint (GDPR gap) | Y | ⊘ |
|
||
| `LIVE-L19` | LOW | Email verification not enforced | Y | ☑ |
|
||
| `LIVE-L20` | INFO | Profile-update silently drops unknown fields | Y | ⊘ |
|
||
|
||
> **Stage 4 handler/misc batch status (2026-05-15):** `M1` —
|
||
> `Content-Disposition` filenames are sanitized (control chars / quote /
|
||
> backslash stripped) so an upload filename cannot inject response
|
||
> headers. `M7` — migration `000005` creates the `audit_log` table (no
|
||
> prior migration did — `CREATE TABLE IF NOT EXISTS`) and makes it
|
||
> append-only via BEFORE UPDATE/DELETE triggers. `M11` —
|
||
> `golang.org/x/crypto` bumped
|
||
> `v0.49.0 → v0.51.0`. `M13` — `DELETE /api/auth/account` now carries the
|
||
> Traefik `auth-rate-limit` edge limiter. `LIVE-L18` ⊘ — not a real gap:
|
||
> the endpoint **exists** at `DELETE /api/auth/account/`
|
||
> (`router.go:546`); the live scan probed `/api/auth/me/`, `/auth/delete/`,
|
||
> `/users/me/` and missed it. **Update (2026-05-15):** items shown as
|
||
> deferred in an earlier draft were then completed — `LIVE-L1` (`/metrics`
|
||
> rejects proxied/public requests via an `X-Forwarded-For` check, so only
|
||
> the in-cluster vmagent scrape reaches it), `M6`/`LIVE-L16` (the
|
||
> document/contractor list repos already hard-cap at 500 rows), and
|
||
> `LIVE-L19` (verified-email gating on share-code generation via the new
|
||
> `RequireVerified` middleware). `LIVE-L17` (inert pagination params,
|
||
> results capped) and `LIVE-L20` (whitelist profile update is the correct
|
||
> pattern) are closed as no-security-impact (⊘). The master index above is
|
||
> authoritative.
|
||
|
||
> **Stage 4 races batch status (2026-05-15):** `C9`/`H9` — share-code
|
||
> redemption is now one locked transaction in `ResidenceRepository.
|
||
> JoinWithShareCode` (lock the code row, re-check validity, add member,
|
||
> deactivate — a deactivation failure aborts the join). `C11` — the
|
||
> task-completion duplicate-row race was *already* closed: the completion
|
||
> insert and the optimistically-version-locked task update share one
|
||
> transaction, so a concurrent completion fails `ErrVersionConflict` and
|
||
> rolls back its inserted row; no `UNIQUE(task_id, completed_date)` was
|
||
> added (it would reject legitimate same-day re-completions and risk a
|
||
> migration failure on existing data). `M4` — email verification's
|
||
> find/consume/flag writes are now one transaction. `M12` — a concurrent
|
||
> contractor delete now yields a clean 404. `C12` ⊘ — premise moot: the
|
||
> app **hard-deletes** accounts (`DeleteUserCascade`), so there is no
|
||
> soft-deleted user whose email lingers, and `ExistsByEmail` already
|
||
> blocks re-registering a *deactivated* user's email.
|
||
>
|
||
> **Stage 4 auth batch status (2026-05-15):** C1, C2, C3 done (see entries
|
||
> below). Rate limiting — every sensitive auth path now carries the shared
|
||
> Traefik `auth-rate-limit` edge limiter (login/register/forgot/reset/
|
||
> verify-reset/apple/google/refresh/join-with-code); login/register/forgot/
|
||
> reset/apple/google additionally keep the per-IP app limiter
|
||
> (`H1`/`H2`/`H3`/`LIVE-L12`). `H4` rejection-sampled codes, `M2` bcrypt
|
||
> cost 12, `L1`+`LIVE-L11` constant-time generic-error login, `L2`
|
||
> `no-store` on auth responses, `H7` IP/UA in auth logs, `LIVE-L13`
|
||
> fully-async forgot-password — all done; `go build ./...` and the
|
||
> `models`/`repositories`/`middleware`/`handlers`/`services` test packages
|
||
> pass. **Deferred:** `M3` (Apple nonce) — needs the iOS client to
|
||
> generate and send a nonce; server-only validation would reject every
|
||
> Apple login, so this is blocked on a coordinated mobile change. `H8` —
|
||
> the `parseTimezone` ±14h cap shipped; the "use server UTC for
|
||
> trial-start" half is folded into Stage 4's subscription work. `M5`
|
||
> per-account lockout (Redis) deferred — the edge + per-IP app limiters +
|
||
> the existing per-account password-reset counter cover the practical
|
||
> risk; a true per-account login lockout remains a tracked enhancement.
|
||
|
||
### Stage 5 — CI / build pipeline
|
||
|
||
| ID | Sev | Finding | In-repo | Status |
|
||
|---|---|---|---|---|
|
||
| `K3S-F5` | HIGH | Images pinned by mutable short SHA tag, not digest | Y | ☑ |
|
||
| `K3S-F8` | MED | Secrets injected as env vars, not file mounts | Y | ☑ |
|
||
| `CODE-L5` | LOW | No image signing (cosign) in CI | Y | ◐ |
|
||
|
||
> **Stage 5 status (2026-05-15):** `CODE-M11` done — `golang.org/x/crypto`
|
||
> bumped `v0.49.0 → v0.51.0` (with the `x/sys`/`x/term`/`x/text` bumps
|
||
> `go get -u` pulled in), `go mod tidy` clean, full build + test green.
|
||
> **Update (2026-05-15):** `K3S-F5`/`K3S-F14`/`CODE-M10` are done —
|
||
> `03-deploy.sh` resolves the image digest after each push and deploys
|
||
> api/worker/admin/web by `@sha256:`, and redis/vmagent/`node:20-alpine`
|
||
> are pinned to their resolved index digests.
|
||
> **Update (2026-05-16):** `K3S-F8` is **done** — the `api`/`worker`
|
||
> Deployments mount `honeydue-secrets` as files (`defaultMode: 0400`) at
|
||
> `/etc/honeydue/secrets` and inject no secret as an env var;
|
||
> `config.loadFileSecrets` reads them; `02-setup-secrets.sh` now writes
|
||
> `B2_KEY_ID`/`B2_APP_KEY` into the secret, reconciling the earlier
|
||
> script-vs-manifest drift. `CODE-L5` stays **◐** — cosign signing and a
|
||
> Trivy `HIGH,CRITICAL` scan are wired (guarded) into `03-deploy.sh` and a
|
||
> ready-to-use Kyverno `ClusterPolicy` ships at
|
||
> `deploy-k3s/manifests/kyverno-verify-images.yaml`; closing it needs the
|
||
> operator to install Kyverno and supply a cosign key. See both entries.
|
||
|
||
### Stage 6 — Post-deploy verification & runtime investigations
|
||
|
||
`V1`–`V13` — see [Stage 6](#stage-6--post-deploy-verification--runtime-investigations).
|
||
|
||
---
|
||
|
||
## Stage 0 — DNS & Cloudflare edge
|
||
|
||
External state at Cloudflare. Not touched by `03-deploy.sh`, so a redeploy
|
||
neither breaks nor re-applies these — do them once and leave them. Tracked
|
||
here so they are never forgotten on a domain move or DNS migration.
|
||
|
||
### `LIVE-L2` — Add DMARC record · HIGH · ⊘
|
||
- **Operator decision (2026-05-16):** declined for this cycle. A DMARC record is an ordinary DNS TXT record — it is **not** gated behind a paid Cloudflare plan and can be added on Free. This remains a real email-spoofing gap; revisit when DNS is next touched.
|
||
- **Where:** Cloudflare DNS, TXT record at `_dmarc.myhoneydue.com`.
|
||
- **Fix:** Publish `v=DMARC1; p=quarantine; rua=mailto:dmarc@myhoneydue.com; ruf=mailto:dmarc@myhoneydue.com; fo=1; aspf=s; adkim=s`. Start at `pct=10` for 30 days, watch the `rua` aggregate reports, then ramp to `pct=100` and finally `p=reject`.
|
||
- **Verify:** `dig +short TXT _dmarc.myhoneydue.com` returns the record.
|
||
|
||
### `LIVE-L3` — Tighten SPF from `?all` to `-all` · MEDIUM · ⊘
|
||
- **Operator decision (2026-05-16):** declined for this cycle. SPF is an ordinary DNS TXT record, editable on any Cloudflare plan including Free. The `?all` (neutral) qualifier leaves spoofed mail un-penalised; revisit alongside `LIVE-L2`.
|
||
- **Where:** Cloudflare DNS, TXT record at `myhoneydue.com`.
|
||
- **Fix:** Change `v=spf1 include:spf.messagingengine.com ?all` → `~all` for ~7 days, confirm no legitimate mail (CI, transactional) is missed, then `-all`. Do this **after** `LIVE-L2`'s DMARC ramp begins.
|
||
- **Verify:** `dig +short TXT myhoneydue.com | grep spf` shows `-all`.
|
||
|
||
### `LIVE-L4` — Add CAA records · MEDIUM · ⊘
|
||
- **Operator decision (2026-05-16):** declined for this cycle. CAA is an ordinary DNS record type, addable on any Cloudflare plan including Free. Without it, any public CA may issue a cert for the domain; revisit when DNS is next touched.
|
||
- **Where:** Cloudflare DNS, apex `myhoneydue.com`.
|
||
- **Fix:** Add `0 issue "letsencrypt.org"`, `0 issuewild "letsencrypt.org"`, `0 iodef "mailto:security@myhoneydue.com"`. Add `0 issue "pki.goog"` only if Google Trust Services is used anywhere. Confirm against the CAs Cloudflare Universal SSL actually uses before locking down.
|
||
- **Verify:** `dig +short CAA myhoneydue.com` returns the records.
|
||
|
||
### `LIVE-L6` — Publish `security.txt` · LOW · ☐ · In-repo: Y
|
||
- **Where:** served by the Go API and/or Next.js apps at `/.well-known/security.txt` (RFC 9116) — committed route, so it survives redeploys.
|
||
- **Fix:** Serve `Contact:`, `Expires:`, `Preferred-Languages:`, `Canonical:` on both `api.myhoneydue.com` and the apex.
|
||
- **Verify:** `curl https://api.myhoneydue.com/.well-known/security.txt` → 200.
|
||
|
||
### `LIVE-L9` — Review Cloudflare caching of the admin SSR shell · INFO · ☐
|
||
- **Where:** Cloudflare cache rules for `admin.myhoneydue.com`.
|
||
- **Fix:** `cache-control: s-maxage=31536000` on admin SSR pages means Cloudflare caches the admin shell for a year. Confirm this is intentional; if the admin shell ever contains per-session content, add a bypass-cache rule for `admin.myhoneydue.com`.
|
||
- **Verify:** `curl -sI https://admin.myhoneydue.com/ | grep -i cache` reflects the intended policy.
|
||
|
||
### `LIVE-L10` — Suppress `x-powered-by` · INFO · ☐ · In-repo: Y
|
||
- **Where:** Next.js config in the admin and web repos (`next.config.js` → `poweredByHeader: false`). Committed, survives redeploys.
|
||
- **Fix:** Disable the `x-powered-by: Next.js` header.
|
||
- **Verify:** `curl -sI https://admin.myhoneydue.com/ | grep -i x-powered-by` returns nothing.
|
||
|
||
---
|
||
|
||
## Stage 1 — Cluster provisioning & node OS
|
||
|
||
Run by `01-provision-cluster.sh` (which drives the `hetzner-k3s` CLI from
|
||
`config.yaml` via `generate_cluster_config` in `_config.sh`) plus one-time
|
||
SSH hardening on each node. **Any k3s server flag must be set in the
|
||
`hetzner-k3s` cluster config so a cluster rebuild applies it.**
|
||
|
||
### `K3S-F4` — kubeconfig world-readable (mode 644 → 600) · HIGH · ☑ · In-repo: Y
|
||
- **Where:** `_config.sh` → `generate_cluster_config` → `k3s_config_file`. Node file `/etc/rancher/k3s/k3s.yaml`.
|
||
- **Done (2026-05-16):** `generate_cluster_config` now emits `write-kubeconfig-mode: "0600"` in the k3s config file, so any fresh provision writes the node kubeconfig as `0600`.
|
||
- **Operator step on the existing cluster:** a running node keeps the mode it was installed with — `ssh deploy@<node> 'sudo chmod 600 /etc/rancher/k3s/k3s.yaml'` on each. Deploy scripts still read it via `sudo`.
|
||
- **Verify:** `ssh deploy@<node> 'sudo stat -c %a /etc/rancher/k3s/k3s.yaml'` → `600`.
|
||
|
||
### `K3S-CG1` / `CODE-M9` — etcd / Secret encryption at rest · ☑ · In-repo: Y
|
||
- **Where:** `_config.sh` → `generate_cluster_config` → `k3s_config_file`.
|
||
- **Done:** the k3s config file carries `secrets-encryption: true`, so a fresh provision boots with AES Secret encryption enabled. (The `write-kubeconfig-mode` line for `K3S-F4` was added next to it on 2026-05-16.)
|
||
- **Operator step on the existing cluster:** a cluster provisioned *without* the flag does not retro-encrypt — run `k3s secrets-encrypt enable` then `k3s secrets-encrypt reencrypt` once. Tracked as `V12`.
|
||
- **Verify:** `k3s secrets-encrypt status` reports `Encryption Status: Enabled` on every server node.
|
||
- **Note:** the old `SECURITY.md` *claimed* this was already on — `04-verify.sh` greps for the string but cannot truly confirm; see `V12`.
|
||
|
||
### `K3S-CG2` — Node OS hardening · ◐ · In-repo: partial
|
||
- **Where:** `_config.sh` → `generate_cluster_config` → `post_create_commands` (runs on every node at provision).
|
||
- **Done (2026-05-16):** `post_create_commands` now installs and enables `fail2ban` (SSH brute-force bans) and `unattended-upgrades` (automatic security patching) on every node at provision time — a fresh cluster comes up hardened on both.
|
||
- **Still operator (runtime; not yet in-repo):**
|
||
- SSH — confirm `PermitRootLogin no`, `PasswordAuthentication no`, `AllowUsers deploy`, modern ciphers/MACs/KEX. (hetzner-k3s provisions key-only SSH; verify and tighten.)
|
||
- sysctl — confirm `net.ipv4.ip_unprivileged_port_start=0` (Traefik) and standard network-hardening sysctls.
|
||
- **Verify:** `ssh deploy@<node> 'fail2ban-client status sshd; systemctl is-enabled unattended-upgrades'`.
|
||
|
||
### `K3S-CG3` — Hetzner Cloud Firewall rules · ☐ · In-repo: N
|
||
- **Fix:** Confirm only: `:443` from Cloudflare CIDRs, `:22` from operator IP(s), `:6443` from operator IP(s). Nothing else. This is the *only* network defense for the public-IP nodes (`K3S-F15`).
|
||
- **Verify:** `hcloud firewall describe honeydue-fw` matches the intended ruleset; a direct `curl` to a node IP on `:80`/`:443` from a non-CF host times out.
|
||
|
||
### `K3S-CG4` — etcd snapshot backup · ☐ · In-repo: Y
|
||
- **Fix:** Confirm k3s etcd snapshots are enabled (default hourly) and shipped off-node — set `--etcd-s3` (to Backblaze B2) with encryption. Without offsite snapshots, a 3-node loss is unrecoverable.
|
||
- **Verify:** `ls /var/lib/rancher/k3s/server/db/snapshots/` on a node + an object in the B2 backup bucket.
|
||
|
||
### `K3S-CG5` — kubelet authn/authz flags · ☐ · In-repo: Y
|
||
- **Fix:** Confirm `--anonymous-auth=false` and `--authorization-mode=Webhook` on the kubelet (k3s defaults are usually safe — verify, don't assume). Set via k3s `kubelet-arg` in the cluster config if missing.
|
||
- **Verify:** `kubectl get --raw /api/v1/nodes/<node>/proxy/configz` shows the expected kubelet config.
|
||
|
||
### `K3S-CG6` — Container-runtime CIS baseline · ☐ · In-repo: N
|
||
- **Fix:** Run `kube-bench` once; remediate any FAIL lines that aren't k3s-by-design.
|
||
- **Verify:** `kube-bench` run archived with FAILs triaged.
|
||
|
||
### `K3S-CG7` — `deploy` user sudoers least-privilege · ☐ · In-repo: N
|
||
- **Fix:** Current `deploy ALL=(ALL) NOPASSWD: ALL` means an SSH-key compromise = node root. Scope to the commands deploys actually need (`ufw`, `systemctl`, `chmod` on k3s.yaml, `cat` of k3s.yaml). Accept the convenience trade-off only with eyes open.
|
||
- **Verify:** `ssh deploy@<node> 'sudo -l'` shows the scoped list.
|
||
|
||
### `K3S-CG8` — `/etc/rancher/k3s/` perms · ☐ · In-repo: N
|
||
- **Fix:** `/var/lib/rancher/k3s/server/token` and `/var/lib/rancher/k3s/server/node-token` must be `0600 root:root`; `/etc/rancher/k3s/` not world-traversable.
|
||
- **Verify:** `ssh deploy@<node> 'sudo stat -c "%a %n" /var/lib/rancher/k3s/server/token'` → `600`.
|
||
|
||
### `K3S-F15` — Nodes on public IPs, no private VPC · INFO · ⊘ · In-repo: Y
|
||
- **Decision:** Accepted for now. Defense is `K3S-CG3` (Hetzner firewall) only. To remediate later: attach a Hetzner private network, re-IP the cluster, move etcd/kubelet/Flannel onto it. Substantial re-provision — track on the roadmap, not this cycle.
|
||
|
||
### `K3S-F16` — All nodes are control-plane + etcd + worker · INFO · ⊘
|
||
- **Decision:** Accepted — standard small-cluster k3s. Revisit (dedicated workers + `NoSchedule` taint on control-plane) when workload pressure grows. No redeploy action.
|
||
|
||
### `K3S-F17` — Single-replica SPOFs · INFO · ☐ · In-repo: Y
|
||
- **Where:** `deploy-k3s/manifests/worker/deployment.yaml`, `redis/`, `admin/`, `observability/vmagent.yaml`.
|
||
- **Fix:** `worker` → `replicas: 2` (stateless, Asynq at-least-once — safe now). `admin`/`vmagent` → 2 if zero-downtime restart is wanted. `redis` is stateful — true HA needs Sentinel or managed Redis; track separately, do not naively scale.
|
||
- **Verify:** `kubectl -n honeydue get deploy` shows `worker 2/2`.
|
||
|
||
---
|
||
|
||
## Stage 2 — Secrets & config bootstrap
|
||
|
||
Run by `02-setup-secrets.sh`, which reads `deploy-k3s/config.yaml` and the
|
||
`secrets/` directory. **Both `K3S-F1` and `K3S-F3` are open purely because
|
||
`config.yaml` lacks the values — the script already supports them.**
|
||
|
||
### `K3S-F1` — Redis runs with no authentication · CRITICAL · ☐ · In-repo: Y
|
||
- **Where:** `deploy-k3s/config.yaml` key `redis.password`. `02-setup-secrets.sh:53,68-71` includes `REDIS_PASSWORD` in `honeydue-secrets` only when that key is non-empty; `redis/deployment.yaml` adds `--requirepass` only when the env var is non-empty.
|
||
- **Fix:** Set `redis.password` in `config.yaml` to a strong value (`openssl rand -base64 32`). Re-run `02-setup-secrets.sh`. `api`/`worker` already consume `REDIS_PASSWORD`.
|
||
- **Verify:** `kubectl -n honeydue exec deploy/redis -- redis-cli ping` → `NOAUTH`; with `-a "$REDIS_PASSWORD"` → `PONG`.
|
||
- **Redeploy-clean:** committing the value to `config.yaml` means every future `02-setup-secrets.sh` re-creates the authenticated secret. (If `config.yaml` is gitignored, store the value in the operator's secret store and document it here.)
|
||
|
||
### `K3S-F3` — `admin-basic-auth` secret never created · HIGH · ☐ · In-repo: Y
|
||
- **Where:** `config.yaml` keys `admin.basic_auth_user` / `admin.basic_auth_password`. `02-setup-secrets.sh:54-55,132-143` creates the `admin-basic-auth` secret (bcrypt htpasswd) only when both are set, else it warns and skips.
|
||
- **Fix:** Set both keys. Re-run `02-setup-secrets.sh`. **Must be done before `K3S-F2`** — attaching `admin-auth` to the ingress with the secret missing makes Traefik 503 the admin route.
|
||
- **Verify:** `kubectl -n honeydue get secret admin-basic-auth`.
|
||
|
||
### `K3S-F8` (Stage 2 half) — `B2_KEY_ID` / `B2_APP_KEY` in `honeydue-secrets` · ☑ · In-repo: Y
|
||
- **Where:** `02-setup-secrets.sh`.
|
||
- **Done (2026-05-16):** the script now reads `storage.b2_key_id` / `storage.b2_app_key` from `config.yaml` and adds `B2_KEY_ID` / `B2_APP_KEY` to `honeydue-secrets`. Previously the `api`/`worker` manifests referenced these keys but the script never created them — a latent deploy break. See the full `K3S-F8` entry in Stage 5.
|
||
- **Verify:** `kubectl -n honeydue get secret honeydue-secrets -o jsonpath='{.data.B2_KEY_ID}'` is non-empty.
|
||
|
||
### `K3S-F12` — Secret rotation runbook · MEDIUM · ☐ · In-repo: Y
|
||
- **Where:** new doc `docs/runbooks/secret-rotation.md`.
|
||
- **Fix:** Document per-secret rotation (Postgres, `SECRET_KEY`, APNs `.p8`, FCM, B2, observability token, Redis, admin basic-auth). Annual minimum; immediate on suspected exposure or operator-device loss. For `SECRET_KEY` (JWT signing) plan an overlap window so live tokens validate across the change. Add a `last-rotated` annotation to each secret.
|
||
- **Verify:** runbook exists and the first rotation is logged.
|
||
|
||
### `CODE-C4` — `DEBUG_FIXED_CODES` "123456" auth bypass · CRITICAL · ☐ · In-repo: Y
|
||
- **Where:** `internal/services/auth_service.go:141-145,385-390,432-435,470-473,503-504`; config in `internal/config/config.go`. ConfigMap generated from `config.yaml` by `03-deploy.sh`.
|
||
- **Fix (two layers):** (1) Code — refuse to start if `ENV=production && DebugFixedCodes` (Stage 4 code change). (2) Config — ensure `config.yaml` never sets `DEBUG_FIXED_CODES=true` for prod, and the generated ConfigMap omits it.
|
||
- **Verify:** prod ConfigMap has no `DEBUG_FIXED_CODES`; a prod boot with the flag set fails fast.
|
||
|
||
### `CODE-M8` — `SECRET_KEY` hardcoded debug fallback · MEDIUM · ☐ · In-repo: Y
|
||
- **Where:** `internal/config/config.go:437-442` falls back to `"change-me-in-production-secret-key-12345"`.
|
||
- **Fix:** Remove the static fallback — generate a per-boot random key in debug, and **refuse to start** in production if `SECRET_KEY` is unset. (`02-setup-secrets.sh:46-49` already enforces ≥32 chars for the real secret — keep that.)
|
||
- **Verify:** prod boot with no `SECRET_KEY` exits non-zero; the fallback string is gone from the binary.
|
||
|
||
---
|
||
|
||
## Stage 3 — Kubernetes manifests
|
||
|
||
Committed under `deploy-k3s/manifests/` and applied by `03-deploy.sh`. **Any
|
||
fix here is automatically re-applied on every redeploy** — the highest-value
|
||
stage for "redeploy clean."
|
||
|
||
### `K3S-F2` / `CODE-L6` — Wire defense-in-depth onto the admin ingress · HIGH · ☐
|
||
- **Where:** `deploy-k3s/manifests/ingress/ingress-simple.yaml` — admin route annotation.
|
||
- **Fix:** Add `cloudflare-only` and `admin-auth` to the `traefik.ingress.kubernetes.io/router.middlewares` annotation alongside the existing `security-headers` + `rate-limit`. **Do `K3S-F3` first** or Traefik 503s the route.
|
||
- **Verify:** `04-verify.sh` "Cloudflare-Only Middleware" check passes; `admin.myhoneydue.com` prompts for basic auth.
|
||
|
||
### `K3S-F6` — `imagePullSecrets` name consistency · HIGH · ☐
|
||
- **Where:** all `deploy-k3s/manifests/*/deployment.yaml`, `migrate/job.yaml`; secret created by `02-setup-secrets.sh:111` as `ghcr-credentials`.
|
||
- **Fix:** The registry is Gitea — `ghcr-credentials` is a misleading name and the live cluster currently also has a hand-made `gitea-credentials`. Pick one name (`gitea-credentials` is clearer), use it in **both** the script and **every** manifest, and delete the orphan. The defect is a name *mismatch*, not a missing fix — make script + manifests agree so a pull never fails on a fresh node.
|
||
- **Verify:** `grep -rl imagePullSecrets deploy-k3s/manifests/` all reference one name == the script's; cordon a node, delete a pod, confirm the replacement pulls.
|
||
|
||
### `K3S-F7` — `vmagent` container `securityContext` · MEDIUM · ☐
|
||
- **Where:** `deploy-k3s/manifests/observability/vmagent.yaml`.
|
||
- **Fix:** Add the container-level block the other 5 deployments already have: `allowPrivilegeEscalation: false`, `capabilities.drop: [ALL]`, `readOnlyRootFilesystem: true`. Its volumes (`/etc/vmagent`, `/etc/vmagent-secrets`, `/tmp/vmagent` emptyDir) already support read-only root.
|
||
- **Verify:** `04-verify.sh` "Pod Security Contexts" reports OK for `vmagent`.
|
||
|
||
### `K3S-F9` / `LIVE-L8` — CSP + cross-origin headers · MEDIUM / LOW · ☐
|
||
- **Where:** Cross-origin trio → `deploy-k3s/manifests/ingress/middleware.yaml` (`security-headers`). CSP `object-src`/`base-uri` → Go app CSP middleware (Stage 4, `LIVE-L8` code half).
|
||
- **Important correction:** `K3S-F9` originally said CSP was missing. The live scan **disproved** that — the Go app sets a strong CSP via app middleware. So `K3S-F9` reduces to: add `Cross-Origin-Opener-Policy: same-origin` and `Cross-Origin-Resource-Policy: same-origin` (and `Cross-Origin-Embedder-Policy: require-corp` only if it doesn't break embeds) to `security-headers`. The CSP `object-src 'none'; base-uri 'self'` additions belong in the app and are tracked under `LIVE-L8` in Stage 4.
|
||
- **Verify:** `curl -sI https://api.myhoneydue.com/api/health/ | grep -i cross-origin` shows COOP/CORP.
|
||
|
||
### `K3S-F10` / `LIVE-L12` — Auth-endpoint rate-limit middleware · MEDIUM / HIGH · ☐
|
||
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` (new `auth-rate-limit` Middleware) + `ingress/ingress-simple.yaml`. Requires migrating the auth paths from vanilla `Ingress` to a Traefik `IngressRoute` to apply a per-path middleware.
|
||
- **Fix:** New Middleware `average: 5, burst: 10, period: 1m, sourceCriterion.ipStrategy.depth: 2` (depth 2 for the Cloudflare hop). Apply to `/api/auth/login`, `/api/auth/register`, `/api/auth/forgot-password`, `/api/auth/reset-password`, `/api/residences/join-with-code`. This is the **edge** half; the **app** half is `CODE-H1/H2/H3/M5` in Stage 4 (per-account lockout in Redis). Do both — edge limit alone resets on IP rotation.
|
||
- **Verify:** 10 rapid logins from one IP → `429`.
|
||
|
||
### `K3S-F11` — Disable `automountServiceAccountToken` · MEDIUM · ☐
|
||
- **Where:** `deploy-k3s/manifests/rbac.yaml` (ServiceAccounts) and/or each `*/deployment.yaml` pod spec.
|
||
- **Fix:** Set `automountServiceAccountToken: false` on `api`, `admin`, `worker`, `web`, `redis`. Leave `true` only for `vmagent` (it uses the k8s API for service discovery). **Note:** `05-security.md` claims this is already set — the audit (`F11`) says it is not. Treat the audit as ground truth; this fix makes the doc true.
|
||
- **Verify:** `kubectl -n honeydue get pod <api-pod> -o jsonpath='{.spec.automountServiceAccountToken}'` → `false`; no token file in the container.
|
||
|
||
### `K3S-F13` — Add `app.myhoneydue.com` to CORS · LOW · ☐
|
||
- **Where:** `CORS_ALLOWED_ORIGINS` in `config.yaml` → generated into `honeydue-config` ConfigMap by `03-deploy.sh`.
|
||
- **Fix:** Confirm whether the web app calls `api.myhoneydue.com` directly from the browser. If yes, add `https://app.myhoneydue.com` to `CORS_ALLOWED_ORIGINS`. If it proxies through Next.js server-side, CORS is moot — record that decision here instead.
|
||
- **Verify:** browser fetch from `app.myhoneydue.com` to the API succeeds (or the proxy decision is documented).
|
||
|
||
### `K3S-F14` — Pin public images by digest · LOW · ☐
|
||
- **Where:** `redis/deployment.yaml` (`redis:7-alpine`), `observability/vmagent.yaml` (`victoriametrics/vmagent:v1.106.1`).
|
||
- **Fix:** Replace tags with `@sha256:` digests. Folded into the `K3S-F5` CI work (Stage 5).
|
||
- **Verify:** manifests contain no public-image tag without a digest.
|
||
|
||
### `LIVE-L5` / `CODE-L3` — HSTS `preload` · LOW · ☐
|
||
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` `security-headers` HSTS value.
|
||
- **Fix:** Change to `max-age=63072000; includeSubDomains; preload`. Confirm api/admin/app all work fully over HTTPS, then submit to `hstspreload.org` (the submission is the Stage 0 external half — once preloaded you cannot easily downgrade for ~6 months).
|
||
- **Verify:** response header shows `preload`; domain accepted at hstspreload.org.
|
||
|
||
### `LIVE-L7` — Drop deprecated `X-XSS-Protection` · LOW · ☐
|
||
- **Where:** `deploy-k3s/manifests/ingress/middleware.yaml` `security-headers` (`browserXssFilter: true` / `customResponseHeaders`).
|
||
- **Fix:** Remove the header or set `X-XSS-Protection: "0"`. Modern browsers ignore it; legacy filter bypass has caused XSS.
|
||
- **Verify:** header absent or `0` on all three hosts.
|
||
|
||
### `CODE-L4` — Set `imagePullPolicy` · LOW · ☐
|
||
- **Where:** all `deploy-k3s/manifests/*/deployment.yaml`.
|
||
- **Fix:** Set `imagePullPolicy` explicitly. Once images are digest-pinned (`K3S-F5`), `IfNotPresent` is correct and avoids needless re-pulls; until then `Always` avoids stale tags. Pick the policy that matches the `K3S-F5` rollout state.
|
||
- **Verify:** every container has an explicit `imagePullPolicy`.
|
||
|
||
---
|
||
|
||
## Stage 4 — Application code & container images
|
||
|
||
Fixes in `honeyDueAPI-go` source (and the admin/web Dockerfiles). They reach
|
||
production by **rebuilding the image** in `03-deploy.sh`; schema-changing
|
||
fixes (`CODE-C1`, `CODE-C5/6`, `CODE-C11`, `CODE-C12`) also need a **goose
|
||
migration**, which the migrate `Job` runs automatically before the
|
||
api/worker roll. Per repo rule: do not auto-commit — these are code changes;
|
||
this section is the plan, not the patch.
|
||
|
||
### Critical (C1–C13)
|
||
|
||
#### `CODE-C1` — Plaintext auth tokens in DB · ☑ (2026-05-15)
|
||
- **Where:** `internal/models/user.go`, `internal/repositories/user_repo.go`, `internal/middleware/auth.go`, `internal/services/cache_service.go`, `internal/services/auth_service.go`, migration `000003_hash_auth_tokens.sql`.
|
||
- **Done:** `user_authtoken.key` now stores `models.HashToken()` — the hex SHA-256 of the token — never the raw value. The raw token reaches the client once (the non-persisted `AuthToken.Plaintext` field) and is re-hashed on every request before the DB and Redis lookup, so the single indexed JOIN query in the auth middleware is preserved. A fast hash (not bcrypt) is correct here — tokens are 160-bit random values, nothing to brute-force. Migration `000003` widens the column 40→64 and clears existing rows.
|
||
- **Behaviour change:** the server can no longer re-issue a stored token's plaintext, so every login mints a fresh token via `CreateFreshToken` (delete + create). With the existing one-token-per-user schema this means **one active session per user** — logging in on a new device invalidates the previous device's token. The migration also invalidates all sessions once, at deploy.
|
||
- **Verify:** `SELECT key FROM user_authtoken LIMIT 1` → 64-char hash; `go build ./...` and `go test ./internal/{models,repositories,middleware,handlers}/...` pass.
|
||
|
||
#### `CODE-C2` / `CODE-C3` — Google ID token not verified locally · ☑ (2026-05-15)
|
||
- **Where:** `internal/services/google_auth.go` (full rewrite).
|
||
- **Done:** `VerifyIDToken` no longer calls the deprecated `tokeninfo` URL (which leaked the token in the query string and made verification depend on a third party). It now parses the JWT, fetches Google's JWKS from `googleapis.com/oauth2/v3/certs` (Redis-cached 24h, re-fetched on a `kid` miss), verifies the `RS256` signature locally, and asserts `iss ∈ {accounts.google.com, https://accounts.google.com}` (C3), `aud`/`azp` against the configured client IDs, and `exp` (validated by jwt v5). Mirrors the existing Apple JWKS verifier. `GoogleSignIn` is unchanged — the returned `GoogleTokenInfo` shape is preserved.
|
||
- **Verify:** `go build ./...` clean; `internal/services` tests pass.
|
||
|
||
#### `CODE-C5` / `CODE-C6` — IAP receipt / purchase-token replay · ☐
|
||
- **Where:** `internal/services/subscription_service.go` (`ProcessApplePurchase`, `ProcessGooglePurchase`).
|
||
- **Fix:** Goose migration adding `UNIQUE(provider, original_transaction_id)`. On purchase, if the transaction ID is already bound to a different `user_id` → `403`.
|
||
- **Verify:** re-submitting a valid receipt against a second account → `403`; DB has no duplicate.
|
||
|
||
#### `CODE-C7` — File-ownership check excludes residence owners · ☐
|
||
- **Where:** `internal/services/file_ownership_service.go:20-66`.
|
||
- **Fix:** Replace the three `residence_residence_users`-only JOINs with the canonical owner-OR-member UNION from `residence_repo.HasAccess` (owners live in `residence_residence.owner_id`).
|
||
- **Verify:** a residence owner can delete a file in their own property; a non-member still gets `403`.
|
||
|
||
#### `CODE-C8` — Device-token cross-account hijack · ☐
|
||
- **Where:** `internal/services/notification_service.go:307-319` (APNS), `:336-349` (GCM).
|
||
- **Fix:** On re-register of an existing token, if `existing.UserID != nil && *existing.UserID != userID` → `409 Conflict`. Only same-user updates allowed.
|
||
- **Verify:** registering another user's known token → `409`; that user's push traffic is unaffected.
|
||
|
||
#### `CODE-C9` / `CODE-H9` — Share-code join not atomic · ☐
|
||
- **Where:** `internal/services/residence_service.go:562-615` (`:594-599` swallows the deactivate error).
|
||
- **Fix:** Wrap `JoinWithCode` in one transaction with `SELECT … FOR UPDATE` on the share-code row; **fail the join if deactivation fails** (do not log-and-continue).
|
||
- **Verify:** concurrent redemptions of a single-use code → exactly one succeeds; a forced deactivate error rolls the whole join back.
|
||
|
||
#### `CODE-C10` — Subscription upgrade race · ☐
|
||
- **Where:** `internal/services/subscription_service.go:404-459`; webhook handler `:136-213`.
|
||
- **Fix:** Move Apple validation inside the row-locked transaction, or add an idempotency-key table so the validate→write window can't be raced.
|
||
- **Verify:** two concurrent upgrades for one user → one tier change, not two.
|
||
|
||
#### `CODE-C11` — Task-completion duplicate-row race · ☐
|
||
- **Where:** `internal/services/task_service.go:631-750`.
|
||
- **Fix:** `SELECT … FOR UPDATE` on the task in `CreateCompletion`; goose migration adding `UNIQUE(task_id, completed_date)`.
|
||
- **Verify:** double-tap "complete" → one completion row.
|
||
|
||
#### `CODE-C12` — Soft-deleted email reusable · ☐
|
||
- **Where:** `internal/services/auth_service.go:274-324`; `internal/repositories/user_repo.go` (`FindByEmail`, `ExistsByEmail`).
|
||
- **Fix:** On delete, mangle the email (`deleted_<id>_<email>`); add `is_active = true` filtering consistently to `FindByEmail`/`ExistsByEmail`.
|
||
- **Verify:** registering with a soft-deleted account's email is rejected; no cross-account takeover.
|
||
|
||
#### `CODE-C13` — Apple webhook user lookup may LIKE-match · ☐
|
||
- **Where:** `internal/handlers/subscription_webhook_handler.go:354-366` (`FindByAppleReceiptContains`).
|
||
- **Fix:** Confirm the SQL is an equality match, not `LIKE`. If `LIKE`, this is a confirmed Critical — change to equality and rename the function. See `V8`.
|
||
- **Verify:** the query is parameterized equality; rename merged.
|
||
|
||
### High (H1–H9)
|
||
|
||
#### `CODE-H1` / `CODE-H2` / `CODE-H3` / `CODE-M5` — Rate limiting gaps · ☐
|
||
- **Where:** `internal/router/router.go` (`:520` login limiter, `:593` `join-with-code` unprotected), `internal/middleware/rate_limit.go`, `internal/handlers/auth_handler.go`.
|
||
- **Fix:** Extend rate limiting to `register`, `join-with-code`, Apple/Google sign-in, and token refresh. Add a per-account login-attempt counter in Redis (lock after 5–10 fails for 15–60 min). This is the **app** half of the consolidated auth-rate-limit item; the **edge** half is `K3S-F10`.
|
||
- **Verify:** rapid attempts on every auth route throttle; per-account lockout fires regardless of source IP.
|
||
|
||
#### `CODE-H4` — Modulo bias in 6-digit codes · ☐
|
||
- **Where:** `internal/services/auth_service.go:884-892`.
|
||
- **Fix:** Replace `int32 % 1000000` with rejection sampling on `crypto/rand` for a uniform `000000–999999`.
|
||
- **Verify:** distribution test over many samples is uniform.
|
||
|
||
#### `CODE-H5` — Apple IAP `.p8` file-mode unchecked · ☐
|
||
- **Where:** `internal/services/iap_validation.go:93-128`, `internal/config/config.go:325`.
|
||
- **Fix:** Prefer a base64 env-injected PEM. If a file path is kept, refuse to start when the file mode is more permissive than `0600`.
|
||
- **Verify:** boot fails on a `0644` key file; succeeds on `0600`.
|
||
|
||
#### `CODE-H6` — Webhook dedup fail-open · ☐
|
||
- **Where:** `internal/handlers/subscription_webhook_handler.go:165-173` (Apple), `:564-574` (Google).
|
||
- **Fix:** Fail **closed** — if `webhookEventRepo.HasProcessed` errors, return `500` so Apple/Google retry, rather than processing (which risks duplicate refunds).
|
||
- **Verify:** simulated dedup-check DB error → `500`, no double-processing.
|
||
|
||
#### `CODE-H7` — Auth-failure log lacks IP/UA · ☐
|
||
- **Where:** `internal/handlers/auth_handler.go:70`.
|
||
- **Fix:** Add `c.RealIP()` + `User-Agent` to the structured failure log line (the audit log captures them; the request-line log does not). Depends on `V10` (RealIP trust).
|
||
- **Verify:** a failed login log line carries IP + UA.
|
||
|
||
#### `CODE-H8` — `X-Timezone` header trusted for trial start · ☐
|
||
- **Where:** `internal/middleware/timezone.go:40-71` → `internal/services/subscription_service.go:145-150`.
|
||
- **Fix:** Validate `X-Timezone` against IANA `LoadLocation`, cap to ±14h; use server UTC for trial-start / billing-window math regardless.
|
||
- **Verify:** a bogus/extreme `X-Timezone` cannot shift trial start.
|
||
|
||
### Medium (M1–M13)
|
||
|
||
#### `CODE-M1` — Header injection via `Content-Disposition` filename · ☐
|
||
- **Where:** `internal/handlers/media_handler.go:74,117,165`.
|
||
- **Fix:** Sanitize `doc.FileName` — strip CR/LF/quote/null, or emit RFC 5987 `filename*=UTF-8''…`.
|
||
- **Verify:** an upload with CRLF in the filename does not split the response.
|
||
|
||
#### `CODE-M2` — bcrypt cost 10 → 12 · ☐
|
||
- **Where:** `internal/models/user.go:47`, `internal/services/auth_service.go:479`.
|
||
- **Fix:** Make the cost config-driven, default 12.
|
||
- **Verify:** new hashes are `$2a$12$`.
|
||
|
||
#### `CODE-M3` — Apple Sign In nonce not validated · ☐
|
||
- **Where:** `internal/services/apple_auth.go`.
|
||
- **Fix:** Generate, store, and verify the nonce round-trip on Apple sign-in.
|
||
- **Verify:** a replayed/mismatched nonce is rejected.
|
||
|
||
#### `CODE-M4` — Email verification not atomic · ☐
|
||
- **Where:** `internal/services/auth_service.go:373-415`.
|
||
- **Fix:** Wrap verify in a transaction so a concurrent request can't double-apply.
|
||
- **Verify:** concurrent verify calls → one state transition.
|
||
|
||
#### `CODE-M6` / `LIVE-L16` — Uncapped list / pagination · ☐
|
||
- **Where:** `ListDocuments`, `ListContractors`, `ListResidences` handlers; pagination parsing.
|
||
- **Fix:** Clamp `limit` server-side to ≤100 (`< 1` → default 25). Notifications already caps at 200 — match the pattern.
|
||
- **Verify:** `?limit=999999` returns ≤100 rows.
|
||
|
||
#### `CODE-M7` — Audit log not append-only · ☐
|
||
- **Where:** audit-log model / repository.
|
||
- **Fix:** Make it append-only — a DB trigger forbidding `UPDATE`/`DELETE`, or move to an event store. Remove the soft-delete column.
|
||
- **Verify:** an `UPDATE`/`DELETE` on the audit table is rejected.
|
||
|
||
#### `CODE-M11` — `golang.org/x/crypto` outdated · ☐
|
||
- **Where:** `go.mod:30` (`v0.49.0`).
|
||
- **Fix:** `go get -u golang.org/x/crypto`, re-run `govulncheck`, retest. Pairs with Stage 5 dependency automation.
|
||
- **Verify:** `govulncheck ./...` clean.
|
||
|
||
#### `CODE-M12` — Contractor toggle refetch race · ☐
|
||
- **Where:** `internal/services/contractor_service.go:279-307`.
|
||
- **Fix:** Do the toggle + read in one transaction so a concurrent soft-delete can't make it return `nil`.
|
||
- **Verify:** concurrent toggle + delete → defined result, no nil panic.
|
||
|
||
#### `CODE-M13` — Account-deletion endpoint unrate-limited · ☐
|
||
- **Where:** `internal/handlers/auth_handler.go:488-539`.
|
||
- **Fix:** Add a throttle to `DELETE /account`. **First resolve `V11`** — `LIVE-L18` claims no delete endpoint exists; reconcile before deciding whether this is "rate-limit it" or "expose it."
|
||
- **Verify:** repeated delete calls throttle.
|
||
|
||
#### `CODE-M10` — `node:20-alpine` floating tag · ☐
|
||
- **Where:** admin/web `Dockerfile` (`:2,112,134`).
|
||
- **Fix:** Pin to a specific patch version or digest.
|
||
- **Verify:** Dockerfile has no bare `node:20-alpine`.
|
||
|
||
### Low / Info (CODE-L1, L2)
|
||
|
||
#### `CODE-L1` — Inactive-account login enumeration · ☐
|
||
- **Where:** `internal/services/auth_service.go:76-77`.
|
||
- **Fix:** Return the same generic error for inactive accounts as for invalid credentials.
|
||
- **Verify:** inactive vs. wrong-password responses are byte-identical.
|
||
|
||
#### `CODE-L2` — Auth responses lack `Cache-Control: no-store` · ☐
|
||
- **Where:** `internal/handlers/auth_handler.go` (Login / CurrentUser / Refresh).
|
||
- **Fix:** Set `Cache-Control: no-store` on auth responses.
|
||
- **Verify:** the header is present.
|
||
|
||
### Live-scan code-level findings (LIVE-L1, L11–L20)
|
||
|
||
#### `LIVE-L1` — `/metrics` publicly exposed · HIGH · ☐
|
||
- **Where:** `cmd/api/main.go` route registration; vmagent scrapes it cluster-internally already.
|
||
- **Fix (recommended — Option B):** bind Prometheus metrics to a separate cluster-internal port (e.g. `:9090`), expose only via a ClusterIP Service the vmagent NetworkPolicy allows; the public Ingress never registers `/metrics`. Update `observability/vmagent.yaml` scrape target. (Alternative: block `/metrics` at Traefik via an `IngressRoute` — Stage 3.)
|
||
- **Verify:** `curl https://api.myhoneydue.com/metrics` → `404`; vmagent still scrapes successfully.
|
||
|
||
#### `LIVE-L11` — Login user-enumeration via timing · HIGH · ☐
|
||
- **Where:** login handler / `auth_service.go`.
|
||
- **Fix:** Always run a bcrypt compare against a fixed dummy hash when the user is not found, so the response time is constant.
|
||
- **Verify:** real vs. fake email login timing delta < network noise.
|
||
|
||
#### `LIVE-L12` — No rate-limit on login · HIGH · ☐
|
||
- See the consolidated auth-rate-limit item: `K3S-F10` (edge) + `CODE-H1/H2/H3/M5` (app). Closed when both land.
|
||
|
||
#### `LIVE-L13` — Password-reset timing enumeration · HIGH · ☐
|
||
- **Where:** `forgot-password` handler.
|
||
- **Fix:** Enqueue the reset email on the Asynq queue and return the generic response immediately, so real vs. fake emails have identical latency.
|
||
- **Verify:** real vs. fake email reset timing delta < network noise.
|
||
|
||
#### `LIVE-L14` / `LIVE-L15` — Sequential integer IDs · MEDIUM · ⊘ (deferred)
|
||
- **Where:** all user-facing IDs.
|
||
- **Decision:** Real enumeration/intel leak, but migrating to UUID/ULID touches API, web, mobile, and webhook payloads. **Deferred to a planned quarter** — not a redeploy-stage fix. Track on the roadmap; revisit before the userbase size becomes commercially sensitive.
|
||
|
||
#### `LIVE-L16` — Pagination `limit` uncapped · MEDIUM · ☐
|
||
- Duplicate of `CODE-M6` — closed with it.
|
||
|
||
#### `LIVE-L17` — Garbage pagination params silently accepted · LOW · ☐
|
||
- **Where:** query-param parsing in list handlers.
|
||
- **Fix:** Return `400` naming the bad parameter instead of silently using defaults.
|
||
- **Verify:** `?limit=abc` → `400`.
|
||
|
||
#### `LIVE-L18` — No account-deletion endpoint (GDPR) · LOW · ☐
|
||
- **Where:** `internal/router/router.go`, `internal/handlers/auth_handler.go`.
|
||
- **Fix:** Reconcile with `CODE-M13` first (`V11`). Provide `DELETE /api/auth/me/` that anonymizes PII, cascades/transfers residences, revokes tokens, and writes an audit-trail row. Also closes the throwaway-account cleanup gap the live scan left behind.
|
||
- **Verify:** an authenticated user can delete their own account; PII is anonymized.
|
||
|
||
#### `LIVE-L19` — Email verification not enforced · LOW · ☐
|
||
- **Where:** router middleware.
|
||
- **Fix:** Add a `RequireVerified()` middleware on sensitive routes (share-code generation/redemption, anything that emails other users), or cap unverified accounts (1 residence, no share codes) until verified.
|
||
- **Verify:** an unverified account is blocked from the chosen gated routes.
|
||
|
||
#### `LIVE-L20` — Profile-update silently drops unknown fields · INFO · ☐
|
||
- **Where:** `PATCH /api/auth/profile/` handler.
|
||
- **Fix:** Either accept the fields (if intended) or return `400` listing unsupported keys — don't silently `200`.
|
||
- **Verify:** an unknown field yields a clear response.
|
||
|
||
#### `LIVE-L10` — `x-powered-by` — see Stage 0 (Next.js config).
|
||
|
||
---
|
||
|
||
## Stage 5 — CI / build pipeline
|
||
|
||
Build-time controls. Where there is no CI pipeline file yet, the fix is to
|
||
add one (or a `03-deploy.sh` step) so the control runs on every build.
|
||
|
||
### `K3S-F5` / `K3S-F14` / `CODE-L4` — Pin images by digest · HIGH · ☐
|
||
- **Where:** `03-deploy.sh` (currently tags by git short SHA, lines 47/57-61, and also pushes `:latest`), all `deploy-k3s/manifests/*/deployment.yaml`.
|
||
- **Fix:** After `docker push`, capture the digest (`crane digest …` or parse `docker push` output) and substitute `@sha256:…` into the manifests instead of `IMAGE_PLACEHOLDER` tags. Pin `redis` and `vmagent` by digest too. Reconsider pushing `:latest` — a mutable `:latest` undercuts digest pinning.
|
||
- **Verify:** `kubectl -n honeydue get deploy -o jsonpath` shows every image as `@sha256:`.
|
||
|
||
### `K3S-F8` — Secrets as file mounts, not env vars · MEDIUM · ☑ · In-repo: Y
|
||
- **Where:** `api`/`worker` `deployment.yaml`, `internal/config/config.go`, `cmd/api/main.go`, `cmd/worker/main.go`, `02-setup-secrets.sh`.
|
||
- **Done (2026-05-16):**
|
||
- `config.loadFileSecrets()` reads each of the 9 secret keys (`POSTGRES_PASSWORD`, `SECRET_KEY`, `EMAIL_HOST_PASSWORD`, `FCM_SERVER_KEY`, `REDIS_PASSWORD`, `B2_KEY_ID`, `B2_APP_KEY`, `OBS_INGEST_TOKEN`, `OBS_TRACES_URL`) from `/etc/honeydue/secrets/<KEY>` and `viper.Set`s it (highest precedence). A missing file is a silent skip, so the same binary still works from env vars in local/dev.
|
||
- `api`/`worker` `deployment.yaml` no longer inject **any** secret as an `env: secretKeyRef`. `honeydue-secrets` is mounted as a volume (`defaultMode: 0400`), read-only, at `/etc/honeydue/secrets`. Non-secret config still arrives via `envFrom: configMapRef`.
|
||
- `cmd/api`/`cmd/worker` read the observability endpoints through the new `config.SecretValue()` (Viper-backed) instead of `os.Getenv`, so file-mounted `OBS_*` values resolve now that they are gone from the environment.
|
||
- `02-setup-secrets.sh` now also writes `B2_KEY_ID`/`B2_APP_KEY` into `honeydue-secrets` — reconciling the script-vs-manifest drift (the manifests referenced these keys but the script never created them).
|
||
- **Scoped exception:** the one-shot `honeydue-migrate` Job still takes `POSTGRES_PASSWORD` as an env var. goose is invoked as a CLI with the password inside the DSN argument, so the value is exposed in that process regardless of env-vs-file; the Job is transient (one run, seconds, pod GC'd) so this is accepted.
|
||
- **Verify:** `kubectl -n honeydue exec deploy/api -- env` shows no `POSTGRES_PASSWORD`/`SECRET_KEY`; `kubectl -n honeydue exec deploy/api -- ls /etc/honeydue/secrets` lists the key files.
|
||
|
||
### `CODE-L5` — Image signing + scanning · LOW · ◐ · In-repo: Y
|
||
- **Where:** `03-deploy.sh`, `deploy-k3s/manifests/kyverno-verify-images.yaml`.
|
||
- **Done (in-repo, 2026-05-16):**
|
||
- `03-deploy.sh` runs `cosign sign` after each push and a `trivy image --severity HIGH,CRITICAL` scan before push — both **guarded**: they no-op when the tool is absent, so they never break a deploy on a host without them.
|
||
- A ready-to-use Kyverno `ClusterPolicy` ships at `deploy-k3s/manifests/kyverno-verify-images.yaml`. It matches only the four `gitea.treytartt.com/admin/honeydue-*` images, starts in `Audit` mode, and is **intentionally not applied by `03-deploy.sh`** — applying a verify-images policy with no key would block every Pod from scheduling.
|
||
- **Remaining (operator — cannot be committed):**
|
||
1. Install Kyverno in the cluster (admission controller).
|
||
2. `cosign generate-key-pair`; set `COSIGN_KEY` in the deploy env so signing activates; paste `cosign.pub` into the policy's `publicKeys` block.
|
||
3. `kubectl apply -f deploy-k3s/manifests/kyverno-verify-images.yaml`, confirm Pods still schedule, then flip `validationFailureAction: Audit → Enforce`.
|
||
- **Verify:** an unsigned image is rejected by admission; `03-deploy.sh` fails on a HIGH/CRITICAL CVE.
|
||
|
||
### `CODE-M11` (CI half) — Dependency hygiene · ☐
|
||
- **Fix:** Add scheduled `go get -u` + `govulncheck` (the audit confirms `govulncheck` + `gitleaks` already run in CI — extend with a dependency-update cadence).
|
||
- **Verify:** stale-dependency alerts surface automatically.
|
||
|
||
---
|
||
|
||
## Stage 6 — Post-deploy verification & runtime investigations
|
||
|
||
`04-verify.sh` already runs a security block (secret encryption, NetworkPolicy
|
||
count, ServiceAccounts, pod security contexts, PDBs, `cloudflare-only`
|
||
middleware, `admin-basic-auth`). **Extend it so each fix above stays fixed,
|
||
and work the open investigations the audits could not resolve.**
|
||
|
||
### Extend `04-verify.sh` with assertions for · ☐
|
||
- Redis rejects unauthenticated `PING` (`K3S-F1`).
|
||
- Admin ingress annotation contains `admin-auth` (`K3S-F2`).
|
||
- `/metrics` returns `404` on the public host (`LIVE-L1`).
|
||
- Every container (incl. `vmagent`) has a full `securityContext` (`K3S-F7`).
|
||
- `automountServiceAccountToken: false` on app pods (`K3S-F11`).
|
||
- Every workload image is digest-pinned (`K3S-F5`).
|
||
- No `DEBUG_FIXED_CODES` key in the prod ConfigMap (`CODE-C4`).
|
||
|
||
### Runtime investigations (cannot be closed by code review alone)
|
||
|
||
| ID | Item | Source | Action |
|
||
|---|---|---|---|
|
||
| `V1` | Apple/Google Sign-In token validation depth | LIVE | Test with a self-signed Apple identity token; confirm signature/aud/nonce checks |
|
||
| `V2` | Webhook signature verification — confirm webhook routes are **outside** the auth middleware in `router.go` (live scan saw `401`s, signature middleware may never run) | LIVE | Code-review `internal/router/router.go` |
|
||
| `V3` | File-upload security — locate upload paths, test polyglots / MIME bypass / path traversal in filename / oversized files | LIVE | Focused upload security test |
|
||
| `V4` | Long-term token validity / revocation behaviour | LIVE | Test token expiry + revocation over time |
|
||
| `V5` | Apple IAP receipt validation with a real sandbox StoreKit receipt | LIVE | Sandbox test |
|
||
| `V6` | Share-code system — find the endpoint path; test brute-force, single-use, expiration | LIVE | Locate + test |
|
||
| `V7` | Trial-expiration enforcement — age a test account past 14 days, confirm `limitations_enabled` flips and creation gates fire | LIVE | Aged-account test |
|
||
| `V8` | `FindByAppleReceiptContains` — confirm equality, not `LIKE`. If `LIKE`, escalate `CODE-C13` to confirmed Critical | CODE | SQL review |
|
||
| `V9` | Rate-limiter storage — confirm `rate_limit.go` is Redis-backed (shared across 3 api replicas); in-memory = 3× the intended limit | CODE | Code review |
|
||
| `V10` | `X-Forwarded-For` / Echo `RealIP` trust behind Traefik — without it per-IP limits collapse to the ingress IP | CODE | Code + Traefik config review |
|
||
| `V11` | Account-deletion contradiction — `LIVE-L18` (no endpoint) vs `CODE-M13` (endpoint at `auth_handler.go:488-539`). Resolve before Stage 4 planning | LIVE/CODE | Route review |
|
||
| `V12` | etcd encryption — `04-verify.sh` only greps a string; truly confirm with `k3s secrets-encrypt status` on each server node | K3S | SSH check |
|
||
| `V13` | `user_authtoken` index — confirm a `user_id` lookup index exists before hashing tokens at rest (`CODE-C1`) | CODE | Schema check |
|
||
|
||
---
|
||
|
||
## Accepted risks / deferred (this cycle)
|
||
|
||
| ID | Item | Rationale |
|
||
|---|---|---|
|
||
| `K3S-F15` | Public-IP nodes, no VPC | Re-provision-scale change; Hetzner firewall (`K3S-CG3`) is the compensating control. Roadmap. |
|
||
| `K3S-F16` | Combined control-plane/worker nodes | Standard small-cluster k3s; revisit on workload growth. |
|
||
| `LIVE-L14`/`L15` | Sequential integer IDs | UUID migration spans API + web + mobile + webhooks; planned quarter, not this cycle. |
|
||
|
||
Mirror these in `docs/deployment/20-roadmap.md` so they are not silently lost.
|
||
|
||
---
|
||
|
||
## Documentation drift corrected alongside this plan
|
||
|
||
The audits contradicted the existing deployment book. These corrections ship
|
||
with this plan so the docs match audited reality:
|
||
|
||
| Doc | Claimed | Reality (audit) | Action |
|
||
|---|---|---|---|
|
||
| `05-security.md` | `automountServiceAccountToken: false` set | `K3S-F11`: not set on any workload | Corrected to "TODO" + linked here |
|
||
| `05-security.md` | NetworkPolicies "not currently applied" (TODO) | Applied 2026-04-24; `03-deploy.sh:155` applies them | Corrected to "applied" |
|
||
| `05-security.md` | CF↔origin is plaintext (SSL=Flexible) | Upgraded to Full (strict) 2026-04-24 | Corrected |
|
||
| `05-security.md` | SHA tags immutable / "we'd notice a digest change" | `K3S-F5`: short SHA tags are mutable | Corrected; points to `K3S-F5` |
|
||
| `SECURITY.md` (old) | Redis "requires a password" | `K3S-F1`: no auth | This rewrite |
|
||
| `SECURITY.md` (old) | etcd `secrets-encryption: true` | `K3S-CG1`: not verified / not on | This rewrite |
|
||
| `SECURITY.md` (old) | fail2ban active | `05-security.md` + `K3S-CG2`: not installed | This rewrite |
|
||
| `20-roadmap.md` | — | Audit findings not represented | Audit items folded in |
|
||
|
||
---
|
||
|
||
## Hardened-redeploy checklist (run order)
|
||
|
||
A clean rebuild of the whole stack, with every fix above applied:
|
||
|
||
```
|
||
□ Stage 0 DNS once-off: DMARC, SPF, CAA at Cloudflare; security.txt route live
|
||
□ Stage 1 Provision: hetzner-k3s config carries --write-kubeconfig-mode=600
|
||
and --secrets-encryption; run 01-provision-cluster.sh
|
||
□ Stage 1 Node OS: fail2ban + unattended-upgrades + SSH/sysctl on each node
|
||
□ Stage 1 Verify cluster: K3S-CG3..CG8 (firewall, snapshots, kubelet, perms)
|
||
□ Stage 2 Config: config.yaml has redis.password + admin.basic_auth_*;
|
||
no DEBUG_FIXED_CODES; SECRET_KEY ≥32 chars
|
||
□ Stage 2 Secrets: run 02-setup-secrets.sh — confirm redis + admin-basic-auth
|
||
□ Stage 3 Manifests: admin ingress middlewares wired; imagePullSecret name
|
||
consistent; vmagent securityContext; COOP/CORP headers;
|
||
auth-rate-limit; automountServiceAccountToken:false;
|
||
HSTS preload; X-XSS-Protection dropped; imagePullPolicy set
|
||
□ Stage 4 Code+image: all C/H/M/L code fixes committed; image rebuilt;
|
||
goose migrations for C1/C5/C6/C11/C12 present
|
||
□ Stage 5 CI: images digest-pinned + signed + scanned; secrets file-mounted
|
||
□ Stage 6 Verify: run 04-verify.sh (extended); work V1–V13
|
||
□ Post: Submit myhoneydue.com to hstspreload.org
|
||
```
|
||
|
||
A redeploy is "clean" only when `04-verify.sh` (extended per Stage 6) passes
|
||
with zero `✗` lines and every checkbox in the master index is ☑ or ⊘.
|
||
|
||
---
|
||
|
||
## Appendix — Incident response playbooks
|
||
|
||
Preserved from the previous `SECURITY.md`; still current.
|
||
|
||
### Compromised API token
|
||
Rotate `SECRET_KEY` to invalidate all tokens, then restart api/worker:
|
||
```bash
|
||
echo "$(openssl rand -hex 32)" > secrets/secret_key.txt
|
||
./scripts/02-setup-secrets.sh
|
||
kubectl rollout restart deployment/api deployment/worker -n honeydue
|
||
```
|
||
(After `CODE-C1` lands, tokens are hashed at rest — a DB read no longer yields
|
||
usable tokens, but `SECRET_KEY` rotation remains the kill-switch.)
|
||
|
||
### Compromised database credentials
|
||
Rotate in the Neon dashboard, update `secrets/postgres_password.txt`, re-run
|
||
`02-setup-secrets.sh`, restart api/worker, watch logs for connection errors.
|
||
|
||
### Compromised push keys
|
||
APNs: revoke in Apple Developer, drop the new `.p8` into `secrets/`, re-run
|
||
`02-setup-secrets.sh`, restart api/worker. FCM: rotate the key in Firebase,
|
||
update `secrets/fcm_server_key.txt`, re-run, restart.
|
||
|
||
### Suspicious pod
|
||
```bash
|
||
kubectl logs <pod> -n honeydue > /tmp/pod-logs.txt
|
||
kubectl describe pod <pod> -n honeydue > /tmp/pod-describe.txt
|
||
kubectl delete pod <pod> -n honeydue # deployment recreates it
|
||
```
|
||
|
||
### Communication
|
||
Document the timeline privately; on a data breach notify affected users
|
||
within 72 hours; rotate every potentially-exposed credential; write a
|
||
post-mortem (root cause, timeline, remediation, prevention).
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- Audit reports: `live_scan_5_12.md`, `k3_audit_5_12.md`, `security_scan_5_12.md` (repo root)
|
||
- Current architecture: `docs/deployment/05-security.md`
|
||
- Roadmap: `docs/deployment/20-roadmap.md`
|
||
- Deploy process: `docs/deployment/14-deployment-process.md`
|
||
- Scripts: `deploy-k3s/scripts/{01-provision-cluster,02-setup-secrets,03-deploy,04-verify}.sh`
|
||
- Manifests: `deploy-k3s/manifests/`
|