honeyDueAPI

Author	SHA1	Message	Date
Trey t	c77ff07ce9	fix(security): remediate 2026-05-12 audit findings (Stages 2–5) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:28:33 -05:00
Trey t	2004f9c5b2	fix(observability): relax vmagent liveness probe — was crash-looping every ~5m Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details The previous probe had timeoutSeconds=1 which is too tight for the shell pipeline (sh + wget + grep + comparison). On a busy node the wget call regularly exceeded 1s, the exec timed out, and 3 consecutive timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min, causing brief gaps that made the Grafana "Pods up" panel render 0 whenever a refresh happened to coincide with a restart. The relaxed probe still catches the original failure mode (zero healthy targets) but only kills the pod after 10 full minutes of consecutive failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min). timeoutSeconds: 1 → 5 periodSeconds: 60 → 120 failureThreshold: 3 → 5 initialDelaySeconds: 120 → 180 Also added wget -T 4 inside the command so wget itself bounds its network call to 4s — leaving 1s of slack within the 5s exec budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:39:23 -05:00
Trey t	139a990ebc	fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_, kube_deployment_, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:30:11 -05:00
Trey t	df78d9ccd8	Add Prometheus metrics + vmagent push to obs.88oakapps.com Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via statement-level callbacks (no ctx plumbing needed). Storage and push clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the network round-trip points. Existing internal/monitoring metrics move to /metrics/legacy so the canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups. deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods and to the public obs endpoint over :443; the obs side runs VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate. docs/observability-plan.md captures the full plan including resource budget, instrumentation table, 4-step rollout, and migration triggers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:16:17 -05:00

4 Commits