honeyDueAPI

Author	SHA1	Message	Date
Trey t	b66151ddd9	feat(auth): scaffold Ory Kratos identity service — phase 1 (infrastructure) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details First phase of replacing the hand-rolled auth (internal/services/auth_service.go et al.) with Ory Kratos. This commit is infrastructure only — Kratos will run but nothing consumes it yet; the Go API still does its own auth until phase 2. Adds deploy-k3s/manifests/kratos/: - configmap.yaml — kratos.yml, identity schema, Google/Apple OIDC claim mappers (no secrets in the ConfigMap) - migrate-job.yaml — `kratos migrate sql`, run before the Deployment - kratos.yaml — Deployment (x2), Service, NetworkPolicies - ingress.yaml — auth.myhoneydue.com -> Kratos public API :4433 - README.md — operator prerequisites + deploy runbook Wiring: - 02-setup-secrets.sh creates kratos-secrets, gated on a config.yaml `kratos:` block (DSN, cookie/cipher, SMTP URI, OIDC client secret, Apple key). - 03-deploy.sh applies the Kratos manifests + runs the migrate Job, gated on the kratos-secrets Secret existing. Both gates mean the existing stack deploys completely unaffected until the operator completes the prerequisites (Neon `kratos` DB, auth.myhoneydue.com DNS, Apple/Google OAuth apps, Kratos image version). Pre-production, so no user-data migration — see manifests/kratos/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:24:38 -05:00
Trey t	93fddc3769	feat(observability): ship pod logs to Loki via Grafana Alloy Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Adds a Grafana Alloy DaemonSet that tails honeydue-namespace pod logs from /var/log/pods and pushes them to Loki at obs.88oakapps.com, reusing the existing OBS_INGEST_TOKEN (14-day retention). - deploy-k3s/manifests/observability/alloy-logs.yaml — DaemonSet + RBAC + token Secret + Alloy config. Runs as root (/var/log/pods is 0750 root:root) but otherwise locked down: all caps dropped, read-only root filesystem, seccomp RuntimeDefault, read-only hostPath mount. - network-policies.yaml — allow-egress-from-alloy-logs (DNS + k8s API + obs HTTPS), mirroring the vmagent egress policy. - 03-deploy.sh — applies alloy-logs with the OBS_INGEST_TOKEN substitution and waits for the DaemonSet rollout. The Loki container, nginx /loki/api/v1/push route, and Grafana Loki datasource live on the obs server and are not repo-managed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 20:04:09 -05:00
Trey t	c77ff07ce9	fix(security): remediate 2026-05-12 audit findings (Stages 2–5) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:28:33 -05:00
Trey t	139a990ebc	fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_, kube_deployment_, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:30:11 -05:00
Trey t	12b2f9d43b	Adopt pressly/goose for schema migrations Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path, which had two compounding problems: - AutoMigrate ran on every pod startup (~5 min over the transatlantic link) even when no schema changes had landed - pg_advisory_lock is session-scoped, which silently fails through Neon's pgbouncer transaction-mode pooler — turns out this is a known and documented limitation that bites golang-migrate too Goose was chosen over golang-migrate (the other heavyweight) because: - Goose wraps each migration file in a transaction by default, so a failure rolls back cleanly instead of leaving a "dirty" version state requiring manual force-reset (golang-migrate's known weakness, per its own issue tracker — see #1001 + Atlas's writeup) - Goose's locking is opt-in. We don't opt in: migrations run as a single Kubernetes Job, which IS the singleton process. No advisory lock needed at all. Layout: - migrations/000001_init.sql — schema-only pg_dump of the live Neon DB at adoption, stripped of psql-only directives that block goose's bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had their effects folded into this baseline; deleted from the live tree but preserved in git history at `58e6997`. - Dockerfile installs `goose v3.22.1` at build time and copies the binary into the api image. The migrate Job reuses the api image with command=goose, so no separate image to build/push/version. - deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips the -pooler segment from DB_HOST (advisory lock won't survive pgbouncer transaction-mode), runs `goose up`, exits. - deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the fresh one, `kubectl wait --for=condition=complete --timeout=10m`, then proceeds with api/worker rollout. Job failure aborts the deploy before any new app pod sees a stale schema. - internal/database/database.go::RequireSchemaApplied checks goose_db_version on startup. api/worker refuse to boot if the table is missing or its latest row has is_applied=false — the fail-fast for "operator forgot to run migrate." - Makefile: migrate-up / migrate-down / migrate-status / migrate-new for local workflow. Production DB was bootstrapped manually: $ goose -dir migrations postgres "$DSN" version # creates table $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());" Smoke test against fresh Postgres locally: 50 user tables created in 284ms via `goose up`, version_id=1 + is_applied=t recorded. Verified the local goose CLI talks to prod successfully: $ goose ... status Applied At Migration ======================================= Mon Apr 27 03:43:55 2026 -- 000001_init.sql Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:46:36 -05:00
Trey t	d3708e6c72	Fix /metrics double-gzip + deploy script for amd64 build Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so vmagent received double-compressed bytes that failed the Prometheus parser with binary garbage. Skipping /metrics in the gzip Skipper. Three deploy-script fixes uncovered while shipping this: - _config.sh had backticks around \"kubectl get cm\" inside the python heredoc, which bash treated as command substitution when KUBECONFIG was set. Quoted the literal instead. - 03-deploy.sh now passes --platform linux/amd64 to all docker builds so arm64 Macs don't push images that fail with \"exec format error\" on the Hetzner CX nodes. - OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of the actual deploy/prod.env at the repo root. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:42:15 -05:00
Trey t	372d4d2d37	deploy-k3s: apply observability manifests during 03-deploy Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details vmagent.yaml lives under manifests/observability/; the deploy script now substitutes the OBS_INGEST_TOKEN from deploy/prod.env into the manifest before apply, and waits on the vmagent rollout. Manual kubectl apply is no longer needed after the next deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:16:59 -05:00
Trey t	15359401fa	Deploy honeyDueAPI-Web to k3s at app.myhoneydue.com Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The Next.js 16 webapp in sibling repo honeyDueAPI-Web now runs alongside api/worker/admin on the cluster. Uses a server-side proxy pattern: browser hits app.myhoneydue.com, Next.js route handlers forward to the Go API with an httpOnly cookie, so no CORS entry or Allowed-Hosts change is needed on the API side. Availability mirrors api (3 replicas, PDB minAvailable:2, topologySpreadConstraints across nodes). Changes: - deploy-k3s/manifests/web/deployment.yaml: 3 replicas, readOnly root FS, drops all caps, mounts emptyDir for /app/.next/cache and /tmp, reads API_URL from honeydue-config. - deploy-k3s/manifests/web/service.yaml: ClusterIP :3000. - deploy-k3s/manifests/rbac.yaml: ServiceAccount web with automountServiceAccountToken: false. - deploy-k3s/manifests/pod-disruption-budgets.yaml: web-pdb minAvailable: 2. - deploy-k3s/manifests/ingress/ingress-simple.yaml: route app.myhoneydue.com → web:3000. - deploy-k3s/scripts/_config.sh: emit API_URL into the ConfigMap. - deploy-k3s/scripts/03-deploy.sh: build + push + apply the web image alongside api/worker/admin. Reads NEXT_PUBLIC_POSTHOG_KEY and NEXT_PUBLIC_POSTHOG_HOST from the operator shell env (not committed). Also adds the --build-arg NEXT_PUBLIC_API_URL wiring for the admin image that was previously only done manually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 10:11:17 -05:00
Trey t	34553f3bec	Add K3s dev deployment setup for single-node VPS Mirrors the prod deploy-k3s/ setup but runs all services in-cluster on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible storage (replaces B2), Redis, API, worker, and admin. Includes fully automated setup scripts (00-init through 04-verify), server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik, network policies, RBAC, and security contexts matching prod. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 21:30:39 -05:00

9 Commits