honeyDueAPI

Author	SHA1	Message	Date
Trey T	3b2ea9959a	deploy: add node-exporter DaemonSet + vmagent scrape job Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Per-node host metrics (node_filesystem_, node_memory_, node_load*) were missing — a node running out of disk would silently fail the cluster before any dashboard signal (RUNBOOK §11.1 gap #9). Adds: - node-exporter DaemonSet (pod-networked, :9100; host /proc,/sys,/ ro) so vmagent scrapes it pod-to-pod over the cluster CIDR, independent of node public IPs (the netpol node-IP list is OVH-stale). - two additive NetworkPolicies (default-deny-all is in force): ingress to node-exporter from vmagent, and vmagent egress to the pod CIDR on :9100. - a node-exporter scrape job in the vmagent-config ConfigMap. Feeds the new "Node host health" row (disk/mem/load) on the eli5 dashboard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 21:41:40 -05:00
Trey t	25897e913e	Auto-verify Sign in with Apple emails Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Apple OIDC mapper now marks the email verified unconditionally via verified_addresses. SIWA cryptographically proves control of the Apple ID and Apple owns/verifies the (relay) email, so a code is redundant. Gating on Apple's `email_verified` claim was unreliable — Apple omits it on many authorizations, which made verification random (sometimes a surprise code prompt). Password sign-ups still verify via the honeyDue API flow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 22:30:33 -05:00
Trey t	81e454d86d	Add admin-create registration + live email-verified flag Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Registration now goes through POST /api/auth/register, which admin-creates the Kratos identity (unverified email, NO auto-sent code). Kratos self-service registration never returns the verification flow id, so the client could never submit the user's code to the right flow; admin creation lets the client own a single verification flow instead. Also surface the live Kratos verified flag and fix Apple audience + team IDs. - kratos.Client.CreateIdentity via admin API; ErrIdentityExists / ErrInvalidCredentials - AuthService.Register + AuthHandler.Register + public POST /api/auth/register/ - CurrentUser overrides stale user_profile.verified with the live Kratos flag; UserRepository.MarkVerified mirrors it back - configmap: additional_id_token_audiences allows the .dev bundle id_token - fix Apple/APNs team id V3PF3M6B6U -> X86BR9WTLD in .env.example + dev init Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 17:46:30 -05:00
Trey t	7b87f2e392	fix(kratos): drop cloudflare-only middleware on auth ingress Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details iOS Sign In with Apple failed silently — the KMP client never reached Kratos. Traced to the cloudflare-only Traefik middleware rejecting every request at the auth ingress. Root cause: on this cluster klipper-lb sits in front of Traefik and SNATs the source IP. Traefik's ipAllowList sees the klipper-lb pod IP, not Cloudflare's real source IP — so even legitimate iOS requests proxied through Cloudflare get 403'd. The api ingress doesn't have this middleware (and works correctly), so removing it from auth matches the working pattern. Kratos is the user-facing OIDC endpoint — every iOS/web user device needs to reach it. Cloudflare's edge still does DDoS protection; Kratos applies its own per-flow rate limits. The IP allowlist was buying nothing here and breaking everything. Verified after this change: - GET /health/alive → 200 - GET /health/ready → 200 - GET /self-service/login/api → 200 + valid flow body listing apple as an OIDC provider option Related but not fixed by this commit: the same klipper-lb SNAT issue affects admin.myhoneydue.com (which retains cloudflare-only). Admin basic auth still gates real access there, but the IP check is dead weight. Proper fix is configuring Traefik ipStrategy to read the client IP from X-Forwarded-For (set by Cloudflare). Tracked as a follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 11:14:35 -05:00
Trey t	6de90acef7	feat(kratos): deploy Ory Kratos to production (Apple-only OIDC) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Auth was structurally broken — the api's Kratos middleware was pointing at http://kratos:4433 but Kratos wasn't deployed. The only thing keeping users logged in was a 5-min Redis cache; once it expired the middleware called Whoami → no DNS → 401 → forced relogin with no path back. This commit deploys Kratos for real: Manifests: - kratos.yaml + migrate-job.yaml: pin oryd/kratos:v26.2.0@sha256:92eedc... (CalVer current stable as of 2026-06-03) - configmap.yaml: drop Google OIDC provider (not in scope); fill the Apple provider with real Services ID / Team ID / Key ID — Apple now sits at providers[0] - kratos.yaml: drop the Google-secret env binding; rebind APPLE_PRIVATE_KEY to PROVIDERS_0_APPLE_PRIVATE_KEY (shifted from index 1) - network-policies.yaml: add a kratos egress rule to allow-egress-from-api. Without this, even with kratos running, the api gets "connection refused" on http://kratos:4433 (post-DNAT NetworkPolicy enforcement — runbook §9.2). Operator prerequisites that were completed alongside this commit: - Neon kratos database created (separate from honeyDue, owner neondb_owner) - Cloudflare DNS for auth.myhoneydue.com (3 A records, proxied) - kratos: block added to config.yaml (gitignored): DSN to the Neon DIRECT endpoint, cookie + cipher secrets generated, Fastmail SMTPS URI, .p8 contents inline Out of scope intentionally: - Google sign-in (additive; can append providers[] later) - Migrating existing auth_user rows onto Kratos identities — pre-prod; existing users will need to sign in fresh, which creates a new Kratos identity and a new local user row (per migration plan in manifests/kratos/README.md). Verified end-to-end: - 338 schema migrations applied successfully - 2/2 kratos pods Ready - api → kratos:4433/sessions/whoami returns 401 for invalid token (was "connection refused" before this commit's NetworkPolicy patch) - auth.myhoneydue.com resolves through CF; cloudflare-only middleware keeps the origin protected exactly like the other hostnames Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 11:08:09 -05:00
Trey t	e448ec66dc	docs(runbook): rewrite for OVH BHS cluster + Tier-3 observability TODOs Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover: - Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions, network/firewall, DNS, filesystem layout — all reflect the live OVH install instead of the historical Hetzner setup. - Section 6: canonical install-from-clean-boxes procedure (the literal commands run on 2026-06-03), so anyone can stand up a backup cluster by following along. - Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away, healthy-but-empty) and adds four new ones discovered during the OVH build: rbac.yaml not in 03-deploy.sh, namespace label missing from api metrics (use service="api"), cluster-label collision when two clusters push concurrently, worker double-firing on cutover. - Section 11.1: enumerates Tier-3 observability gaps surfaced while building the honeydue-eli5-overview dashboard (node-exporter not deployed, Traefik metrics off, push success counters absent, worker /metrics endpoint absent, cache hit rate uninstrumented, APNs latency uninstrumented). - Section 12: dated audit trail of cluster changes. Pure documentation; no code or manifest changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 09:34:35 -05:00
Trey t	3d3ba84df0	fix(auth): delete the Kratos identity on account deletion Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Account deletion removed all local data but left the Ory Kratos identity intact — an orphaned identity that can still authenticate. Close the gap: - kratos.Client gains the admin API: NewClient(publicURL, adminURL) and DeleteIdentity (DELETE /admin/identities/{id}; a 404 is treated as success so a retry after a partial failure is idempotent). - AuthService.DeleteAccount deletes the Kratos identity FIRST; if that call fails it aborts before touching local data, so the operation is retryable rather than partially applied. - KRATOS_ADMIN_URL config (default http://kratos:4434) + router wiring. - kratos NetworkPolicy split: the api pods may now reach the admin API :4434 (Traefik still reaches only the public API :4433). - kratos CORS: allow_credentials + OPTIONS so the web browser flows (ory_kratos_session cookie) work; origins stay an explicit allowlist. - Regression tests: identity teardown happens, and a Kratos failure aborts the deletion instead of orphaning local data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:55:33 -05:00
Trey t	b66151ddd9	feat(auth): scaffold Ory Kratos identity service — phase 1 (infrastructure) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details First phase of replacing the hand-rolled auth (internal/services/auth_service.go et al.) with Ory Kratos. This commit is infrastructure only — Kratos will run but nothing consumes it yet; the Go API still does its own auth until phase 2. Adds deploy-k3s/manifests/kratos/: - configmap.yaml — kratos.yml, identity schema, Google/Apple OIDC claim mappers (no secrets in the ConfigMap) - migrate-job.yaml — `kratos migrate sql`, run before the Deployment - kratos.yaml — Deployment (x2), Service, NetworkPolicies - ingress.yaml — auth.myhoneydue.com -> Kratos public API :4433 - README.md — operator prerequisites + deploy runbook Wiring: - 02-setup-secrets.sh creates kratos-secrets, gated on a config.yaml `kratos:` block (DSN, cookie/cipher, SMTP URI, OIDC client secret, Apple key). - 03-deploy.sh applies the Kratos manifests + runs the migrate Job, gated on the kratos-secrets Secret existing. Both gates mean the existing stack deploys completely unaffected until the operator completes the prerequisites (Neon `kratos` DB, auth.myhoneydue.com DNS, Apple/Google OAuth apps, Kratos image version). Pre-production, so no user-data migration — see manifests/kratos/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:24:38 -05:00
Trey t	c845771946	feat(observability): drop health/metrics probe noise from shipped logs Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details The api logs every request, so k8s liveness/readiness probes on /api/health/ and vmagent's /metrics scrape drowned Loki in 2xx access logs. Alloy now drops successful probe/scrape access lines at ingest (loki.process stage.drop) — a non-2xx health check, or one logged above info level, still matches nothing and is kept. Also hardens Alloy's read-offset store: moved /tmp/alloy from an emptyDir to a hostPath and set loki.source.file tail_from_end=true, so a pod restart resumes from the saved offset instead of re-reading log files from the start — which made Loki 400-reject the now-too-old entries ("entry too far behind") and stalled shipping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 21:29:15 -05:00
Trey t	93fddc3769	feat(observability): ship pod logs to Loki via Grafana Alloy Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Adds a Grafana Alloy DaemonSet that tails honeydue-namespace pod logs from /var/log/pods and pushes them to Loki at obs.88oakapps.com, reusing the existing OBS_INGEST_TOKEN (14-day retention). - deploy-k3s/manifests/observability/alloy-logs.yaml — DaemonSet + RBAC + token Secret + Alloy config. Runs as root (/var/log/pods is 0750 root:root) but otherwise locked down: all caps dropped, read-only root filesystem, seccomp RuntimeDefault, read-only hostPath mount. - network-policies.yaml — allow-egress-from-alloy-logs (DNS + k8s API + obs HTTPS), mirroring the vmagent egress policy. - 03-deploy.sh — applies alloy-logs with the OBS_INGEST_TOKEN substitution and waits for the DaemonSet rollout. The Loki container, nginx /loki/api/v1/push route, and Grafana Loki datasource live on the obs server and are not repo-managed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 20:04:09 -05:00
Trey t	c77ff07ce9	fix(security): remediate 2026-05-12 audit findings (Stages 2–5) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:28:33 -05:00
Trey t	2004f9c5b2	fix(observability): relax vmagent liveness probe — was crash-looping every ~5m Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details The previous probe had timeoutSeconds=1 which is too tight for the shell pipeline (sh + wget + grep + comparison). On a busy node the wget call regularly exceeded 1s, the exec timed out, and 3 consecutive timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min, causing brief gaps that made the Grafana "Pods up" panel render 0 whenever a refresh happened to coincide with a restart. The relaxed probe still catches the original failure mode (zero healthy targets) but only kills the pod after 10 full minutes of consecutive failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min). timeoutSeconds: 1 → 5 periodSeconds: 60 → 120 failureThreshold: 3 → 5 initialDelaySeconds: 120 → 180 Also added wget -T 4 inside the command so wget itself bounds its network call to 4s — leaving 1s of slack within the 5s exec budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:39:23 -05:00
Trey t	139a990ebc	fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_, kube_deployment_, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:30:11 -05:00
Trey t	14026251b7	fix(worker): wire B2 credentials so pending_uploads cleanup cron can run Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The new TypeUploadCleanup cron (30 * * * *) constructs a StorageService at worker startup so it can call b.client.RemoveObject on B2 when reaping expired pending_uploads rows. Without B2_KEY_ID + B2_APP_KEY the storage service falls back to local disk and crashes on this pod's read-only root filesystem, leaving the cleanup as a no-op. Mirrors the api deployment which already wires these. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:25:53 -07:00
Trey t	29c9014a33	feat(uploads): direct-to-B2 presigned uploads with content-length-range policy Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Replaces the multipart-via-API path for image uploads with a three-step direct-to-storage flow: 1. Client POSTs /api/uploads/presign with content_length + content_type; server validates size (10 MB cap), mime allow-list per category, rate limit (50/hour/user via Redis sliding window), and concurrent unclaimed cap (10 in-flight per user). On success it persists a pending_uploads row, signs an S3 POST policy with content-length-range bound to the claimed length ±256 bytes, and returns the URL+fields. 2. Client POSTs the bytes directly to B2 using the signed policy. B2 enforces size, content-type, and key match before accepting. 3. Client passes upload_ids[] to /api/task-completions/ or /api/documents/. Service HEADs each B2 object, verifies size matches expected_bytes within slack, marks pending_uploads claimed_at, and creates the associated TaskCompletionImage / DocumentImage rows. Bytes never traverse our API server. The 1 MB Echo BodyLimit middleware that was rejecting all task-completion image uploads becomes irrelevant for this path. Existing multipart endpoints stay functional alongside, soak-testing the new path before legacy removal. Cleanup: - cmd/worker registers a new hourly cron (TypeUploadCleanup, "30 * * * *") that reaps pending_uploads where claimed_at IS NULL AND expires_at < NOW(). Reaps both the B2 object and the row. - B2 bucket lifecycle rule on `uploads/` prefix (7 days hide → 1 day delete) documented in deploy-k3s/manifests/b2-lifecycle.md as a backstop. Schema: - migrations/000002_pending_uploads.sql adds the table + partial index for cleanup + nullable pending_upload_id FKs on task_taskcompletionimage and task_documentimage. Policy (single tier, no free/pro split): - 10 MB cap per upload - 50 presigns/hour/user - 10 concurrent unclaimed uploads/user - allow-list: jpeg/png/heic/heif/webp for image categories; + pdf for document_file Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:36:42 -07:00
Trey t	289a23f7e6	deploy(ingress): drop obsolete scaffold ingress.yaml Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details The directory had two ingress manifests that both define honeydue-api and honeydue-admin: - ingress.yaml (Mar 28, scaffold from `deploy-k3s/` greenfield template) - ingress-simple.yaml (Apr 24, corrected for our actual cluster shape per MIGRATION_NOTES.md) `kubectl apply -f manifests/ingress/` applies both, and ingress.yaml happens to apply last alphabetically (`-` < `.` so `ingress-simple` sorts before `ingress.yaml`), clobbering the corrected manifest. That left the live cluster with two regressions: 1. honeydue-admin had `admin-auth` Traefik middleware in its chain, referencing the `admin-basic-auth` secret. Per MIGRATION_NOTES basic auth is intentionally not applied on this cluster (admin uses in-app auth), so the secret was never created. Traefik logs `secret 'honeydue/admin-basic-auth' not found` on every reconcile and refuses to materialize the admin router → 404. 2. honeydue-api lost the apex `myhoneydue.com` rule that ingress-simple.yaml adds for the marketing landing page → apex 404. `kubectl apply -f ingress-simple.yaml` against the live cluster restored both routes (admin/apex back to 200). Removing the stale file from the repo prevents the next deploy from regressing. Refs: deploy-k3s/MIGRATION_NOTES.md ("Admin basic auth \| Not applied — in-app auth only").	2026-04-26 23:44:21 -05:00
Trey t	12b2f9d43b	Adopt pressly/goose for schema migrations Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path, which had two compounding problems: - AutoMigrate ran on every pod startup (~5 min over the transatlantic link) even when no schema changes had landed - pg_advisory_lock is session-scoped, which silently fails through Neon's pgbouncer transaction-mode pooler — turns out this is a known and documented limitation that bites golang-migrate too Goose was chosen over golang-migrate (the other heavyweight) because: - Goose wraps each migration file in a transaction by default, so a failure rolls back cleanly instead of leaving a "dirty" version state requiring manual force-reset (golang-migrate's known weakness, per its own issue tracker — see #1001 + Atlas's writeup) - Goose's locking is opt-in. We don't opt in: migrations run as a single Kubernetes Job, which IS the singleton process. No advisory lock needed at all. Layout: - migrations/000001_init.sql — schema-only pg_dump of the live Neon DB at adoption, stripped of psql-only directives that block goose's bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had their effects folded into this baseline; deleted from the live tree but preserved in git history at `58e6997`. - Dockerfile installs `goose v3.22.1` at build time and copies the binary into the api image. The migrate Job reuses the api image with command=goose, so no separate image to build/push/version. - deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips the -pooler segment from DB_HOST (advisory lock won't survive pgbouncer transaction-mode), runs `goose up`, exits. - deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the fresh one, `kubectl wait --for=condition=complete --timeout=10m`, then proceeds with api/worker rollout. Job failure aborts the deploy before any new app pod sees a stale schema. - internal/database/database.go::RequireSchemaApplied checks goose_db_version on startup. api/worker refuse to boot if the table is missing or its latest row has is_applied=false — the fail-fast for "operator forgot to run migrate." - Makefile: migrate-up / migrate-down / migrate-status / migrate-new for local workflow. Production DB was bootstrapped manually: $ goose -dir migrations postgres "$DSN" version # creates table $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());" Smoke test against fresh Postgres locally: 50 user tables created in 284ms via `goose up`, version_id=1 + is_applied=t recorded. Verified the local goose CLI talks to prod successfully: $ goose ... status Applied At Migration ======================================= Mon Apr 27 03:43:55 2026 -- 000001_init.sql Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:46:36 -05:00
Trey t	4049b704c3	Revert "deployment: extend api startup probe budget for direct-endpoint migrations" This reverts commit `a94744061e`.	2026-04-26 22:22:07 -05:00
Trey t	a94744061e	deployment: extend api startup probe budget for direct-endpoint migrations Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The migration-pooler fix (commit `30966c6`) routes AutoMigrate through Neon's direct compute endpoint to keep the session-scoped advisory lock alive. That swap means each DDL pays a fresh transatlantic RTT instead of riding warm pooler connections, so AutoMigrate's runtime climbs from ~90s to 4-6 min on the first pod of a cold boot. With the previous 240s grace the startup probe was killing pods mid-migration. Bumping to 120 × 5s = 600s grace. Subsequent pods inherit the schema and finish their migrate-no-op in seconds, so this only matters for the single first-pod migration window after a deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:05:58 -05:00
Trey t	88fb1751c7	Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Stack of optimizations against the same Hetzner→Neon transatlantic link. The trace revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond per query (verified via EXPLAIN ANALYZE: index scans on every hot path). Connection layer: - DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer transaction-mode keeps backend Postgres connections warm so we no longer pay the ~110ms Postgres-startup RTT on cold queries. - GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s, MaxIdleTime added (default 0 = never close idle). - Eager pool warm-up at boot via parallel pings — first user request no longer pays the ~440ms TCP+TLS+startup handshake. - Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will evict cold keys instead of erroring at the 256MB limit. Auth layer: - TokenCacheTTL 5min → 1 hour (Redis token cache). - UserCacheTTL 30s → 5min (in-memory User cache, per pod). - UserCache gains a 5,000-entry LRU cap so a flood of unique users can't blow up pod RSS. ~5MB worst-case per pod. - Token + user lookup collapsed from 2 GORM Preload queries into a single INNER JOIN. Saves 1 RTT per cold-cache request. - Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL spans nest under the parent HTTP request in Jaeger. Service layer: - TaskService.ListTasks: replaced two-step FindResidenceIDsByUser → GetKanbanDataForMultipleResidences with a single GetKanbanDataForUser that uses a Postgres subquery for residence-access. One round-trip instead of two. - New CacheService residence-IDs cache: \"residence_ids_user:<id>\" with 5-min TTL. Wired into Task/Residence/Contractor/Document services for the four hot read paths that need this list. - Cache invalidation on every relevant mutation: CreateResidence, DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence invalidates every member of the residence, not just the owner. What this stacks up to (Hetzner→Neon, before US migration): Path Before After (target) Cache-warm authed read ~800ms ~100-200ms Cache-cold authed read (1st in 1hr) ~2500ms ~500-700ms First request after deploy ~2500ms ~700-900ms The endgame US-region migration on top of this gets us to ~30-50ms warm-cache, but we're shippable at ~150ms warm right now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:13:50 -05:00
Trey t	bc3da007db	Wire OpenTelemetry tracing — HTTP, B2, APNs, FCM, asynq, GORM (partial) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Step 1 — OTel SDK: cmd/api and cmd/worker initialize a tracer provider that exports OTLP/HTTP to obs.88oakapps.com (Jaeger all-in-one). Sampling is AlwaysSample in dev (DEBUG=true) and TraceIDRatioBased(0.1) in prod, overridable via OTEL_TRACES_SAMPLER_ARG. Service names are honeydue-api and honeydue-worker. otelecho.Middleware opens a span per HTTP request. Step 2 — Manual spans: storage_service.Upload now takes ctx and emits storage.upload + b2.PutObject spans (size_bytes, key, mime_type, bucket, result attrs). APNs Send/SendWithCategory and FCM sendOne emit per-token spans with topic, status_code, reason. Asynq middleware emits asynq.handle:<task_type> per job with retry/payload attrs and records asynq_job_duration_seconds. Step 3 — Database: otelgorm plugin registered in database.Connect, so any SQL emitted via db.WithContext(ctx) attaches to the request span. Every repository now exposes WithContext(ctx) *XRepository as the migration helper. TaskService.ListTasks and GetTasksByResidence are migrated end-to-end (ctx threaded through handler → service → repo); remaining services adopt the same pattern incrementally — pre-migration methods still emit untraced SQL via the unchanged db field. OBS_TRACES_URL and OBS_INGEST_TOKEN flow from deploy/prod.env → honeydue-secrets → api+worker Deployments via secretKeyRef (optional). 02-setup-secrets.sh sources them from prod.env on next run; manifests mark both env vars optional so the deployment rolls without traces if the secret is absent. ch15 observability doc now lists what produces spans today vs the remaining migration work, with the explicit per-method pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 15:28:05 -05:00
Trey t	d3708e6c72	Fix /metrics double-gzip + deploy script for amd64 build Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The Echo gzip middleware was wrapping promhttp's pre-gzipped output, so vmagent received double-compressed bytes that failed the Prometheus parser with binary garbage. Skipping /metrics in the gzip Skipper. Three deploy-script fixes uncovered while shipping this: - _config.sh had backticks around \"kubectl get cm\" inside the python heredoc, which bash treated as command substitution when KUBECONFIG was set. Quoted the literal instead. - 03-deploy.sh now passes --platform linux/amd64 to all docker builds so arm64 Macs don't push images that fail with \"exec format error\" on the Hetzner CX nodes. - OBS_INGEST_TOKEN lookup was reading deploy-k3s/prod.env instead of the actual deploy/prod.env at the repo root. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:42:15 -05:00
Trey t	372d4d2d37	deploy-k3s: apply observability manifests during 03-deploy Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details vmagent.yaml lives under manifests/observability/; the deploy script now substitutes the OBS_INGEST_TOKEN from deploy/prod.env into the manifest before apply, and waits on the vmagent rollout. Manual kubectl apply is no longer needed after the next deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:16:59 -05:00
Trey t	df78d9ccd8	Add Prometheus metrics + vmagent push to obs.88oakapps.com Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Adds internal/prom package with histograms for HTTP, GORM, B2, APNs, and FCM, wired into the Echo router (HTTPMiddleware + /metrics) and GORM via statement-level callbacks (no ctx plumbing needed). Storage and push clients call ObserveB2Upload / ObserveAPNsSend / ObserveFCMSend at the network round-trip points. Existing internal/monitoring metrics move to /metrics/legacy so the canonical /metrics emits proper histogram buckets for p50/p95/p99 rollups. deploy-k3s/manifests/observability/vmagent.yaml deploys a single-replica vmagent in the honeydue namespace that scrapes api Pods on :8000/metrics every 15s and remote-writes to https://obs.88oakapps.com/api/v1/write with a bearer token (substituted at deploy time from OBS_INGEST_TOKEN in deploy/prod.env). NetworkPolicies allow vmagent egress to api Pods and to the public obs endpoint over :443; the obs side runs VictoriaMetrics + Jaeger + Grafana on 88oakappsUpdate. docs/observability-plan.md captures the full plan including resource budget, instrumentation table, 4-step rollout, and migration triggers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 14:16:17 -05:00
Trey t	1cd6cafa9d	deploy-k3s: wire B2_KEY_ID/B2_APP_KEY into api Deployment Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The B2 credentials existed in honeydue-secrets (created by 02-setup-secrets.sh) but were never referenced from the api Deployment, so StorageConfig.IsS3() returned false at runtime → StorageService fell back to local filesystem. With readOnlyRootFilesystem=true on the api container, that local fallback would silently fail on every upload — meaning every photo, document, and task-completion upload was broken in prod since the k3s migration on 2026-04-24. Adding both as secretKeyRef on the api container only (the worker doesn't perform uploads). Verified end-to-end with a registered test user: source PDF (sha256=3af3a645...) → POST /api/uploads/document/ → POST /api/documents/ → GET /api/media/document/:id → byte-identical download. Storage init log now reports "Storage service initialized (S3)". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 00:53:25 -05:00
Trey t	57cef36379	deploy-k3s: align _config.sh::generate_env with live ConfigMap Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details generate_env was missing 5 keys that exist in the live honeydue-config ConfigMap (drift introduced over time by manual kubectl patches): STATIC_DIR, STORAGE_UPLOAD_DIR, STORAGE_BASE_URL, B2_REGION, B2_USE_SSL. Without these, running 03-deploy.sh would silently drop them and break static asset serving + B2 region/TLS. Also: - Move B2_KEY_ID/B2_APP_KEY out of generate_env: they're credentials and belong in honeydue-secrets, not cleartext in the ConfigMap. The api/worker deployments still need to be wired to read them via envFrom: secretRef before B2 uploads will work — pre-existing gap, not caused by this commit. - Use the in-namespace short DNS form for REDIS_URL ('redis:6379') to match what the live cluster has — pods' resolv.conf search path already covers honeydue.svc.cluster.local. - config.yaml.example: add b2_region, b2_use_ssl, upload_dir, base_url, static_dir under storage so a fresh bootstrap sets them correctly. Verified by sourcing _config.sh and diffing generate_env output against `kubectl get cm honeydue-config -o jsonpath='{.data}'`: clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 00:38:37 -05:00
Trey t	9ea058347f	Fix Apple Sign In: update bundle IDs from old com.tt.honeyDue.* to com.myhoneydue.* Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The iOS app was renamed (MyCrib → Casera → honeyDue) and the bundle ID was updated to com.myhoneydue.honeyDue (release) / .dev (debug), but APPLE_CLIENT_ID and APNS_TOPIC across env templates and k3s configs still pointed at the old com.tt.honeyDue.honeyDueDev value. This made verifyAudience reject every Apple identity token (aud claim mismatch). Updated: - deploy/prod.env.example: bundle ID + comment that empty client_id rejects all tokens with DEBUG=false - .env.example: add Sign in with Apple block (was missing entirely) - deploy-k3s{,-dev}/config.yaml.example: apple_auth.client_id default - deploy-k3s-dev/scripts/00-init.sh: same - docker-compose.dev.yml: APNS_TOPIC fallback - docs/deployment/10-secrets-config.md: doc reference The live deploy/prod.env and local .env are .gitignored — they were edited in place and need to ship via deploy_prod.sh to take effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 23:58:44 -05:00
Trey t	ace03d2340	Security hardening: TLS at origin, security headers, network policies, admin probe fix Four related hardening changes made on the live cluster during this session. Each manifest captures the final working state so a fresh `kubectl apply` of the repo reproduces it. 1. Cloudflare Full (strict) TLS — ingresses now carry `tls:` blocks pointing at `cloudflare-origin-cert` secret (installed imperatively from the CF Origin CA PEM). CF SSL mode flipped from Flexible to Full (strict). CF↔origin is now HTTPS; origin serves a CF-issued cert that only CF can validate. 2. Traefik middleware attached to all three ingresses — `rate-limit` (100/min avg, 200 burst) and `security-headers` (frame-deny, nosniff, HSTS, referrer policy, permissions policy). `admin-auth` middleware was also defined in middleware.yaml but is not attached (needs an unset basic-auth secret) and was deleted at runtime. 3. `security-headers` middleware: stripped the Content-Security-Policy entry. The Go API sets its own CSP in internal/router/router.go that permits Google Fonts for the landing page. Two CSP headers combine via intersection (most restrictive wins), which would break the landing page. Next.js apps set their own CSP via middleware. Header kept documentation comments explain this. 4. NetworkPolicies — default-deny + explicit allows, applied. Added missing policies for `web`. Corrected the Traefik ingress rule: the scaffold used `namespaceSelector: kube-system`, but our Traefik runs as a DaemonSet with `hostNetwork: true`, so traffic arrives with the NODE IP as source. Fixed to an `ipBlock` list of the three node IPs plus the cluster pod CIDR (10.42.0.0/16). 5. admin livenessProbe path fix: was hitting /admin/ (404) which caused a 6-hour crashloop cycle (87 restarts) before the bug was caught. Fixed to / — matches the startupProbe and readinessProbe paths that were corrected earlier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 15:50:47 -05:00
Trey t	15359401fa	Deploy honeyDueAPI-Web to k3s at app.myhoneydue.com Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The Next.js 16 webapp in sibling repo honeyDueAPI-Web now runs alongside api/worker/admin on the cluster. Uses a server-side proxy pattern: browser hits app.myhoneydue.com, Next.js route handlers forward to the Go API with an httpOnly cookie, so no CORS entry or Allowed-Hosts change is needed on the API side. Availability mirrors api (3 replicas, PDB minAvailable:2, topologySpreadConstraints across nodes). Changes: - deploy-k3s/manifests/web/deployment.yaml: 3 replicas, readOnly root FS, drops all caps, mounts emptyDir for /app/.next/cache and /tmp, reads API_URL from honeydue-config. - deploy-k3s/manifests/web/service.yaml: ClusterIP :3000. - deploy-k3s/manifests/rbac.yaml: ServiceAccount web with automountServiceAccountToken: false. - deploy-k3s/manifests/pod-disruption-budgets.yaml: web-pdb minAvailable: 2. - deploy-k3s/manifests/ingress/ingress-simple.yaml: route app.myhoneydue.com → web:3000. - deploy-k3s/scripts/_config.sh: emit API_URL into the ConfigMap. - deploy-k3s/scripts/03-deploy.sh: build + push + apply the web image alongside api/worker/admin. Reads NEXT_PUBLIC_POSTHOG_KEY and NEXT_PUBLIC_POSTHOG_HOST from the operator shell env (not committed). Also adds the --build-arg NEXT_PUBLIC_API_URL wiring for the admin image that was previously only done manually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 10:11:17 -05:00
Trey t	6f303dbbaa	Migrate prod deploy from Swarm to K3s; add full deployment book Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Infrastructure: - Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers) - Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh - All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept temporarily for reference Bug fixes surfaced during migration: - Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25) - cache_service.go: remove sync.Once reassignment from inside Do() callback (was causing 'unlock of unlocked mutex' fatal after Redis Ping failure) - router.go: relax CSP from 'default-src none' to 'default-src self' + allowlist fonts.googleapis.com so the marketing landing page CSS actually loads in browsers - deploy/scripts/deploy_prod.sh: use docker buildx with --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce images runnable on x86_64 Hetzner nodes; fix array expansion under set -u - deploy/swarm-stack.prod.yml: fix secret source references to use top-level aliases (the '\${X_SECRET}' form never actually resolved); dozzle ports: long-form host_ip is rejected by Swarm, switched to short-form (bound to 0.0.0.0 with UFW-based loopback restriction); worker replicas 2 -> 1 (Asynq scheduler singleton) - deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/' (Next.js serves at root; /admin/ returned 404 and killed pods); startupProbe failureThreshold 12 -> 24 - deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable 1 -> 0 (singleton) - deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold 12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot; real startup takes up to 240s) - .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/ and admin/src/app/api/*, hiding legitimate files) New files: - deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet + hostNetwork override for k3s-bundled Traefik - deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress without TLS (CF Flexible SSL) and without middleware - deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log Documentation: - docs/deployment/ — full deployment book, 26 files, ~42k words: - Part I Overview, infrastructure, orchestrator choice (Ch 0-2) - Part II Networking, firewall, Cloudflare (Ch 3-4, 13) - Part III Security, Traefik ingress (Ch 5-6) - Part IV Services, DB, storage, secrets, registry (Ch 7-11) - Part V Data flow, deploy process, observability, failures, runbook (Ch 12, 14-17) - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20) - Appendices: glossary, kubectl cheat sheet, file locations, consolidated citations - README.md: Production Deployment section replaced with pointer to the book; Go version bumped to 1.25 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 07:20:54 -05:00
Trey t	34553f3bec	Add K3s dev deployment setup for single-node VPS Mirrors the prod deploy-k3s/ setup but runs all services in-cluster on a single node: PostgreSQL (replaces Neon), MinIO S3-compatible storage (replaces B2), Redis, API, worker, and admin. Includes fully automated setup scripts (00-init through 04-verify), server hardening (SSH, fail2ban, ufw), Let's Encrypt TLS via Traefik, network policies, RBAC, and security contexts matching prod. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 21:30:39 -05:00

31 Commits