honeyDueAPI

Author	SHA1	Message	Date
Trey T	225fb1306b	dev: add Kratos + Mailpit local-dev stack Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details docker-compose.dev.yml gains a Kratos identity service (public :4433 / admin :4434) and a Mailpit SMTP catcher for local onboarding email codes, plus a postgres-init mount. deploy/local/kratos/ holds the local Kratos config + identity schema (placeholder dev cookie secret only). Supports the local backend the XCUITest suite seeds against. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 00:11:06 -05:00
Trey T	b54493f785	backend: GDPR export + retention cleanups + worker metrics (BE-1/2/3) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details BE-3 observability: expose the worker's Prometheus metrics on :6060/metrics (apns/fcm/asynq histograms + a new cache_ops_total counter were recorded all along but never scraped — which is why those dashboard panels read empty); add the worker containerPort, the vmagent worker scrape job, and two additive NetworkPolicies. Instrument cache Get/Set hit/miss. BE-2 retention: three periodic Asynq cleanup crons mirroring the reminder-log cleanup — notifications (90d), webhook dedup log (180d), audit_log (365d). BE-1 GDPR data export: POST /api/auth/export/ enqueues a low-priority Asynq job that gathers all of the user's data (owned residences + their tasks/contractors/ documents/share-codes, plus profile/notifications/prefs/push-tokens/subscription/ audit log), zips one JSON file per category, and emails it as an attachment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 22:15:26 -05:00
Trey T	3b2ea9959a	deploy: add node-exporter DaemonSet + vmagent scrape job Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Per-node host metrics (node_filesystem_, node_memory_, node_load*) were missing — a node running out of disk would silently fail the cluster before any dashboard signal (RUNBOOK §11.1 gap #9). Adds: - node-exporter DaemonSet (pod-networked, :9100; host /proc,/sys,/ ro) so vmagent scrapes it pod-to-pod over the cluster CIDR, independent of node public IPs (the netpol node-IP list is OVH-stale). - two additive NetworkPolicies (default-deny-all is in force): ingress to node-exporter from vmagent, and vmagent egress to the pod CIDR on :9100. - a node-exporter scrape job in the vmagent-config ConfigMap. Feeds the new "Node host health" row (disk/mem/load) on the eli5 dashboard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 21:41:40 -05:00
Trey T	cf054959bd	Auth: require email-verified by default for all app-data routes Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Previously only 2 share-code routes required a verified email; every other authenticated route (residences, tasks, contractors, documents, notifications, subscription, users, uploads, media — ~70 routes) accepted an authenticated but UNVERIFIED user. This inverts the default to verified-by-default. - router.go: add a `verified` sub-group that applies RequireVerified() ONCE at the group level, and move all app-data route setups under it. Verification is now the default; new routes are gated automatically. The authenticated-only allow-list is just the sign-up surface (/auth/me, /auth/profile, /auth/account). Public stays: register, health, webhooks, lookups. - kratos_auth.go: fix a latent bug the gating exposed — the Redis session cache stored the verified flag for 24h, so a user who verified their email mid-session was still seen as unverified until the TTL expired (sign up -> verify -> create residence would 403). Now only a cached verified=true is trusted (verification is sticky); a cached verified=false re-resolves the live status from Kratos. - auth_safety_test.go: add RequireVerified unit tests (verified passes, unverified -> 403, no-user -> 401). Validated: API gating test (unverified->403, verified->200) + full iOS XCUITest suite green (211 passed) including the onboarding verify->use-immediately flow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 10:49:37 -05:00
Trey T	12de5a230a	i18n: backend-localized lookups, suggestions, and static data (10 languages) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details - suggestion_service: fix scorer (stringList unmarshal accepts scalar\|array; anchor scoring on base universal score so bool matches no longer tie); add localizeReasons for human-readable, Accept-Language-localized match reasons - lookup_i18n: localize lookup display names, home-profile options, document types/categories via internal/i18n - static_data_handler: per-locale seeded-data response (display_name, home profile options, document types/categories) with per-locale cache + ETag - settings_handler: invalidate per-locale seeded-data cache on lookup change instead of pre-warming a single non-localized blob - cache_service: per-locale seeded-data keys + ETag - DTOs: add DisplayName fields (task/residence/contractor) - translations: add suggestion.reason.* and lookup.* keys across all 10 langs - cmd/api: extract startup helpers + tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 20:54:54 -05:00
Trey t	25897e913e	Auto-verify Sign in with Apple emails Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Apple OIDC mapper now marks the email verified unconditionally via verified_addresses. SIWA cryptographically proves control of the Apple ID and Apple owns/verifies the (relay) email, so a code is redundant. Gating on Apple's `email_verified` claim was unreliable — Apple omits it on many authorizations, which made verification random (sometimes a surprise code prompt). Password sign-ups still verify via the honeyDue API flow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 22:30:33 -05:00
Trey t	81e454d86d	Add admin-create registration + live email-verified flag Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Registration now goes through POST /api/auth/register, which admin-creates the Kratos identity (unverified email, NO auto-sent code). Kratos self-service registration never returns the verification flow id, so the client could never submit the user's code to the right flow; admin creation lets the client own a single verification flow instead. Also surface the live Kratos verified flag and fix Apple audience + team IDs. - kratos.Client.CreateIdentity via admin API; ErrIdentityExists / ErrInvalidCredentials - AuthService.Register + AuthHandler.Register + public POST /api/auth/register/ - CurrentUser overrides stale user_profile.verified with the live Kratos flag; UserRepository.MarkVerified mirrors it back - configmap: additional_id_token_audiences allows the .dev bundle id_token - fix Apple/APNs team id V3PF3M6B6U -> X86BR9WTLD in .env.example + dev init Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 17:46:30 -05:00
Trey t	7b87f2e392	fix(kratos): drop cloudflare-only middleware on auth ingress Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details iOS Sign In with Apple failed silently — the KMP client never reached Kratos. Traced to the cloudflare-only Traefik middleware rejecting every request at the auth ingress. Root cause: on this cluster klipper-lb sits in front of Traefik and SNATs the source IP. Traefik's ipAllowList sees the klipper-lb pod IP, not Cloudflare's real source IP — so even legitimate iOS requests proxied through Cloudflare get 403'd. The api ingress doesn't have this middleware (and works correctly), so removing it from auth matches the working pattern. Kratos is the user-facing OIDC endpoint — every iOS/web user device needs to reach it. Cloudflare's edge still does DDoS protection; Kratos applies its own per-flow rate limits. The IP allowlist was buying nothing here and breaking everything. Verified after this change: - GET /health/alive → 200 - GET /health/ready → 200 - GET /self-service/login/api → 200 + valid flow body listing apple as an OIDC provider option Related but not fixed by this commit: the same klipper-lb SNAT issue affects admin.myhoneydue.com (which retains cloudflare-only). Admin basic auth still gates real access there, but the IP check is dead weight. Proper fix is configuring Traefik ipStrategy to read the client IP from X-Forwarded-For (set by Cloudflare). Tracked as a follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 11:14:35 -05:00
Trey t	6de90acef7	feat(kratos): deploy Ory Kratos to production (Apple-only OIDC) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Auth was structurally broken — the api's Kratos middleware was pointing at http://kratos:4433 but Kratos wasn't deployed. The only thing keeping users logged in was a 5-min Redis cache; once it expired the middleware called Whoami → no DNS → 401 → forced relogin with no path back. This commit deploys Kratos for real: Manifests: - kratos.yaml + migrate-job.yaml: pin oryd/kratos:v26.2.0@sha256:92eedc... (CalVer current stable as of 2026-06-03) - configmap.yaml: drop Google OIDC provider (not in scope); fill the Apple provider with real Services ID / Team ID / Key ID — Apple now sits at providers[0] - kratos.yaml: drop the Google-secret env binding; rebind APPLE_PRIVATE_KEY to PROVIDERS_0_APPLE_PRIVATE_KEY (shifted from index 1) - network-policies.yaml: add a kratos egress rule to allow-egress-from-api. Without this, even with kratos running, the api gets "connection refused" on http://kratos:4433 (post-DNAT NetworkPolicy enforcement — runbook §9.2). Operator prerequisites that were completed alongside this commit: - Neon kratos database created (separate from honeyDue, owner neondb_owner) - Cloudflare DNS for auth.myhoneydue.com (3 A records, proxied) - kratos: block added to config.yaml (gitignored): DSN to the Neon DIRECT endpoint, cookie + cipher secrets generated, Fastmail SMTPS URI, .p8 contents inline Out of scope intentionally: - Google sign-in (additive; can append providers[] later) - Migrating existing auth_user rows onto Kratos identities — pre-prod; existing users will need to sign in fresh, which creates a new Kratos identity and a new local user row (per migration plan in manifests/kratos/README.md). Verified end-to-end: - 338 schema migrations applied successfully - 2/2 kratos pods Ready - api → kratos:4433/sessions/whoami returns 401 for invalid token (was "connection refused" before this commit's NetworkPolicy patch) - auth.myhoneydue.com resolves through CF; cloudflare-only middleware keeps the origin protected exactly like the other hostnames Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 11:08:09 -05:00
Trey t	64c656bde1	fix(auth): keep users logged in while Kratos is down Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Production is running with no Kratos deployed in-cluster (the deploy script's kratos-secrets prerequisite isn't satisfied yet — see runbook §11 #7). That means Whoami calls ALWAYS fail, so any time a user's Redis session cache expires they get a 401, which the iOS app treats as session invalid → forced re-login → can't re-authenticate because the same Whoami is the only way back in. Two-part mitigation: 1. Bump kratosSessionCacheTTL from 5 minutes to 24 hours. Active users stay logged in indefinitely; idle users get bounced after a day. 2. Refresh the cache TTL on every successful cache hit (sliding window) so usage-driven expiry is no longer a cliff at the original TTL. When Kratos actually comes up: - revert the TTL constant to a sensible value (1-15 min) - the sliding-window refresh is fine to keep; it's good UX regardless Caveat: this papers over the missing Kratos. New sign-ins still cannot complete because the api needs Kratos to populate the cache the first time. Real fix is to deploy Kratos. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 10:48:12 -05:00
Trey t	d74cfeee62	feat(subscription): temporarily disable subscription gating Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Subscriptions aren't a shipping feature for now. Make GET /api/subscription/status/ return a "limitations disabled" / pro-tier stub at the top of the function with no DB or Redis work: - tier="pro" - is_active=true - limitations_enabled=false (master kill switch in SubscriptionHelper.kt; every canCreate* check short-circuits true) - usage=0 across the board - limits map present with empty entries (all-nil = unlimited per the KMM model convention) so client tier-lookups don't NPE The original implementation is preserved verbatim as the unexported getSubscriptionStatusFromDB method. Re-enabling is a one-line change: swap GetSubscriptionStatus's body to call s.getSubscriptionStatusFromDB. Two integration tests in subscription_is_free_test.go assert the original "limitations actually apply based on settings/IsFree" behavior. They now t.Skip with the same TEMPORARILY DISABLED marker pointing back to the service comment. CheckLimit-based tests in the same file still pass because that codepath is unchanged. Perf side effect: POST/GET on this route drops to ~1ms (just JSON marshal), removing 4-5 serial Neon RTTs from every cold call. Was the slowest endpoint in the live dashboard (~213ms p95 / ~480ms after the pod roll). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 10:07:06 -05:00
Trey t	52bf1ff3c7	perf(task): offload completion notification fan-out to Asynq worker Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details POST /api/task-completions/ was spending ~1.5-1.75s synchronously on APNs push + SMTP email + B2 image fetches inside sendTaskCompletedNotification. Per-user loop made it scale linearly with residence membership; one image attached + one residence user is the 1.75s baseline observed in the live honeydue-eli5-overview Grafana panel. Replace the inline call (and the fire-and-forget goroutine in QuickComplete, which violated the project's "no goroutines in handlers" rule) with an Asynq job: - new task type notification:task_completed (worker/scheduler.go) - new payload {task_id, completion_id} — IDs only, worker re-reads canonical state from Postgres so concurrent edits between enqueue and dequeue are reflected - new HandleTaskCompletedNotification on jobs.Handler delegates to TaskService.SendTaskCompletedNotificationByID - new dispatchTaskCompletedNotification in task_service.go picks between enqueue (preferred) and inline (fallback) when Redis is unreachable or the enqueuer isn't wired (tests / local dev) Other changes required to wire it up: - widen worker.NewTaskClient signature to accept asynq.RedisClientOpt so the file-mounted Redis password (audit HIGH-1) can be supplied; no prior callers, no breakage - extend worker.Enqueuer interface with EnqueueTaskCompletedNotification - add TaskEnqueuer field to router.Dependencies; wire from cmd/api/main.go with the standard typed-nil interface guard - wire a worker-side TaskService in cmd/worker/main.go so the handler can use the shared SendTaskCompletedNotificationByID implementation (storage service shared with the existing upload-cleanup wiring) Expected impact on POST /api/task-completions/ p50: ~1.75s -> ~120-170ms (DB + tx + Asynq enqueue only) Notifications still deliver; they just go via the worker instead of in the request path. MaxRetry=3; "row not found" returns nil so a deleted task/completion doesn't churn the retry loop. All 31 test packages pass. No DB migrations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 09:34:52 -05:00
Trey t	e448ec66dc	docs(runbook): rewrite for OVH BHS cluster + Tier-3 observability TODOs Brings the runbook in line with the 2026-06-03 Hetzner → OVH cutover: - Section 1-5: topology, machines (3x OVH VPS-1 BHS), software versions, network/firewall, DNS, filesystem layout — all reflect the live OVH install instead of the historical Hetzner setup. - Section 6: canonical install-from-clean-boxes procedure (the literal commands run on 2026-06-03), so anyone can stand up a backup cluster by following along. - Section 9: keeps existing gotchas (vmagent NetPol, token-blown-away, healthy-but-empty) and adds four new ones discovered during the OVH build: rbac.yaml not in 03-deploy.sh, namespace label missing from api metrics (use service="api"), cluster-label collision when two clusters push concurrently, worker double-firing on cutover. - Section 11.1: enumerates Tier-3 observability gaps surfaced while building the honeydue-eli5-overview dashboard (node-exporter not deployed, Traefik metrics off, push success counters absent, worker /metrics endpoint absent, cache hit rate uninstrumented, APNs latency uninstrumented). - Section 12: dated audit trail of cluster changes. Pure documentation; no code or manifest changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-03 09:34:35 -05:00
Trey t	3d3ba84df0	fix(auth): delete the Kratos identity on account deletion Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Account deletion removed all local data but left the Ory Kratos identity intact — an orphaned identity that can still authenticate. Close the gap: - kratos.Client gains the admin API: NewClient(publicURL, adminURL) and DeleteIdentity (DELETE /admin/identities/{id}; a 404 is treated as success so a retry after a partial failure is idempotent). - AuthService.DeleteAccount deletes the Kratos identity FIRST; if that call fails it aborts before touching local data, so the operation is retryable rather than partially applied. - KRATOS_ADMIN_URL config (default http://kratos:4434) + router wiring. - kratos NetworkPolicy split: the api pods may now reach the admin API :4434 (Traefik still reaches only the public API :4433). - kratos CORS: allow_credentials + OPTIONS so the web browser flows (ory_kratos_session cookie) work; origins stay an explicit allowlist. - Regression tests: identity teardown happens, and a Kratos failure aborts the deletion instead of orphaning local data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:55:33 -05:00
Trey t	81578f6e27	feat(auth): replace hand-rolled auth with Ory Kratos — phase 2 backend Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Delegates all credential management (login, register, password reset, email verification, social sign-in) to Ory Kratos. The Go API now acts as a resource server: the new KratosAuth middleware validates sessions against the Kratos whoami endpoint, writes the local User mirror into Echo context, and all existing domain handlers continue working unchanged. Hand-rolled token auth, AuthToken model, apple_auth/ google_auth services, and the auth refresh flow are removed. Tests are updated to use the fake-token middleware pattern so existing integration assertions require no rewrite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:55:56 -05:00
Trey t	b66151ddd9	feat(auth): scaffold Ory Kratos identity service — phase 1 (infrastructure) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details First phase of replacing the hand-rolled auth (internal/services/auth_service.go et al.) with Ory Kratos. This commit is infrastructure only — Kratos will run but nothing consumes it yet; the Go API still does its own auth until phase 2. Adds deploy-k3s/manifests/kratos/: - configmap.yaml — kratos.yml, identity schema, Google/Apple OIDC claim mappers (no secrets in the ConfigMap) - migrate-job.yaml — `kratos migrate sql`, run before the Deployment - kratos.yaml — Deployment (x2), Service, NetworkPolicies - ingress.yaml — auth.myhoneydue.com -> Kratos public API :4433 - README.md — operator prerequisites + deploy runbook Wiring: - 02-setup-secrets.sh creates kratos-secrets, gated on a config.yaml `kratos:` block (DSN, cookie/cipher, SMTP URI, OIDC client secret, Apple key). - 03-deploy.sh applies the Kratos manifests + runs the migrate Job, gated on the kratos-secrets Secret existing. Both gates mean the existing stack deploys completely unaffected until the operator completes the prerequisites (Neon `kratos` DB, auth.myhoneydue.com DNS, Apple/Google OAuth apps, Kratos image version). Pre-production, so no user-data migration — see manifests/kratos/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 16:24:38 -05:00
Trey t	c845771946	feat(observability): drop health/metrics probe noise from shipped logs Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details The api logs every request, so k8s liveness/readiness probes on /api/health/ and vmagent's /metrics scrape drowned Loki in 2xx access logs. Alloy now drops successful probe/scrape access lines at ingest (loki.process stage.drop) — a non-2xx health check, or one logged above info level, still matches nothing and is kept. Also hardens Alloy's read-offset store: moved /tmp/alloy from an emptyDir to a hostPath and set loki.source.file tail_from_end=true, so a pod restart resumes from the saved offset instead of re-reading log files from the start — which made Loki 400-reject the now-too-old entries ("entry too far behind") and stalled shipping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 21:29:15 -05:00
Trey t	93fddc3769	feat(observability): ship pod logs to Loki via Grafana Alloy Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Adds a Grafana Alloy DaemonSet that tails honeydue-namespace pod logs from /var/log/pods and pushes them to Loki at obs.88oakapps.com, reusing the existing OBS_INGEST_TOKEN (14-day retention). - deploy-k3s/manifests/observability/alloy-logs.yaml — DaemonSet + RBAC + token Secret + Alloy config. Runs as root (/var/log/pods is 0750 root:root) but otherwise locked down: all caps dropped, read-only root filesystem, seccomp RuntimeDefault, read-only hostPath mount. - network-policies.yaml — allow-egress-from-alloy-logs (DNS + k8s API + obs HTTPS), mirroring the vmagent egress policy. - 03-deploy.sh — applies alloy-logs with the OBS_INGEST_TOKEN substitution and waits for the DaemonSet rollout. The Loki container, nginx /loki/api/v1/push route, and Grafana Loki datasource live on the obs server and are not repo-managed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 20:04:09 -05:00
Trey t	c77ff07ce9	fix(security): remediate 2026-05-12 audit findings (Stages 2–5) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:28:33 -05:00
Trey t	2004f9c5b2	fix(observability): relax vmagent liveness probe — was crash-looping every ~5m Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details The previous probe had timeoutSeconds=1 which is too tight for the shell pipeline (sh + wget + grep + comparison). On a busy node the wget call regularly exceeded 1s, the exec timed out, and 3 consecutive timeouts triggered SIGTERM. Result: vmagent restarted ~5x per 30 min, causing brief gaps that made the Grafana "Pods up" panel render 0 whenever a refresh happened to coincide with a restart. The relaxed probe still catches the original failure mode (zero healthy targets) but only kills the pod after 10 full minutes of consecutive failure (5 attempts × 2 min period), not 3 minutes (3 × 1 min). timeoutSeconds: 1 → 5 periodSeconds: 60 → 120 failureThreshold: 3 → 5 initialDelaySeconds: 120 → 180 Also added wget -T 4 inside the command so wget itself bounds its network call to 4s — leaving 1s of slack within the 5s exec budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:39:23 -05:00
Trey t	139a990ebc	fix(observability): unbreak vmagent SD on fresh deploy + ship kube-state-metrics Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details vmagent's k8s service discovery has been silently broken for 17+ days because k3s's NetworkPolicy controller evaluates egress AFTER kube-proxy's DNAT (contrary to the k8s spec). Pod → ClusterIP 10.43.0.1:443 was DNAT'd to <node_public_ip>:6443, and the resulting :6443 destination matched none of vmagent's egress rules → TCP RST → "connection refused" on every SD watch attempt. Grafana panels using kube_* or up{} metrics returned empty as a result. Changes: - network-policies.yaml: commit the previously-cluster-only NetPols (allow-egress-from-vmagent, allow-vmagent-to-api) so a fresh deploy produces a working cluster. The vmagent egress rule now includes :6443 to public IPs (the post-DNAT path) and :8080 to the pod CIDR (for scraping kube-state-metrics). - observability/kube-state-metrics.yaml: new manifest. Provides the kube_pod_, kube_deployment_, kube_service_* metrics that Grafana panels need to count pods, replicas, etc. Runs in kube-system with cluster-scoped RBAC. - observability/vmagent.yaml: * add kube-state-metrics scrape job to the ConfigMap * add vmagent-kube-system Role+RoleBinding so cross-namespace SD works * replace the misleading liveness probe (was /-/healthy, which lies while SD is broken) with an exec probe that checks /api/v1/targets for at least one healthy target — automatic recovery from future stale-SD incidents - scripts/03-deploy.sh: actually apply network-policies.yaml (was committed but never applied) and apply kube-state-metrics.yaml. - RUNBOOK.md (new): documents the post-DNAT gotcha, the liveness probe trap, bearer-token recovery procedure, drift-detection diff, and a post-redeploy verification checklist. - .gitignore: cover kubeconfig.tunnel (created during SSH-tunnelled kubectl sessions) so admin client cert can't be committed by accident. Verified via kubectl --dry-run on all three modified manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 00:30:11 -05:00
Trey t	7cc5448a7c	fix(uploads): switch from S3 POST policy to presigned PUT Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backblaze B2's S3-compatible endpoint does not implement the S3 POST Object operation. It returns HTTP 501 to every POST regardless of URL style — both path-style (https://s3.<region>.backblazeb2.com/<bucket>/) and virtual-hosted-style (https://<bucket>.s3.<region>.backblazeb2.com/). Yesterday's BucketLookupDNS fix produced virtual-hosted URLs, which is correct for AWS but doesn't help here — B2 rejects POST on either form. Verified with `curl -X POST https://...backblazeb2.com/honeyDueProd/` returning 501 directly, with no signature involved. Replace minio-go's PresignedPostPolicy with PresignHeader + http.MethodPut. The signed URL now points at a single PUT endpoint, with Content-Type and Content-Length signed via headers — B2/S3/MinIO all accept it. Drop the min/max content-length range (we sign exactly one length now); post-upload size verification still happens in VerifyAndClaim via HEAD. Response shape: - URL (was: signed POST endpoint) → now: signed PUT URL - Fields → renamed to Headers; client sends them as request headers, not multipart form parts - Method (new): always "PUT", emitted explicitly so clients don't have to hardcode Companion KMP/iOS commits switch the client paths from multipart POST to single PUT. Existing builds in the field will need to be rebuilt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 15:41:48 -05:00
Trey t	5d8559b495	chore(deploy): mark deploy_prod.sh as deprecated; point at k3s flow Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Production migrated from Docker Swarm to k3s on 2026-04-24, but deploy_prod.sh continued to target the old hetzner1 Swarm manager. Without dockerd running there it spent 30+ seconds doing SSH probes before dying on a confusing "Got: false" Swarm-state error. Add an early guard that fails immediately with a pointer to deploy-k3s/scripts/03-deploy.sh and the kubeconfig-fetch one-liner. ALLOW_LEGACY_SWARM_DEPLOY=1 still bypasses if anyone needs the old path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 13:46:13 -05:00
Trey t	191c9b08e0	feat(static): rebuild landing page on amber-on-midnight brand system Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Replace the off-brand "VIBRANT EDITION" CSS (generic SaaS blue/purple/ teal/pink) with a strict palette derived from icon.png and style_guide.md: --gold #FCCE38, --amber #F5A623, --pollen #FFE082, --sun-bloom #F9BB2F --midnight #181E37, --deep #162140, --comb-line #232230 --cream #FFF1D0 Spacing/radius scale mirrors iOS DesignSystem.swift (AppSpacing 4/8/12/ 16/24/32/48/64; AppRadius 4/8/12/16/20/24) so the web feels native to the same brand system. 56px button height, 16px card radius, identical elevation language. Page architecture: - Sticky translucent nav with hex brand mark (1+6 cluster) - Hero with iPhone frame mock showing real kanban view (overdue/due soon/in progress/done with priority dots and meta chips) - Cream "What's due, what's done, what's yours" pillars - Four feature deep-dives (residences, tasks/kanban, contractors, documents/warranties) with product UI mocks built from real app concepts - "Each cell, a task" comb section with JS-generated 8x10 honeycomb completion grid that fills more densely toward the top - iOS polish section: Home Screen widget mock with quick-complete, push notification with inline actions, Face ID lock, 11 themes - Sharing section with share-card mock (HIVE-7K2D-Q9 code + 3 keepers) - Free vs Pro pricing with "Most chosen" tag - Final CTA with brand mark + golden glow Honeycomb motif: - Brand mark uses gold-on-navy with a radial halo (no currentColor dependency — renders identically everywhere) - Hex grid background uses a properly tessellating flat-top tile (3 hexes per 126x73 unit, sharing full edges, no seams) - Hex bullets, hex pills, completion grid all flat-top per style guide Copy follows style_guide.md voice — calm, specific, no banned words (chore, simplified, seamless), sentence case throughout. Canonical tagline "A home is a hive. We'll help you keep it." used verbatim in the hero and footer. JS: mobile nav toggle, scroll-state nav, IntersectionObserver reveal, deterministic comb-grid generator. Respects prefers-reduced-motion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 13:34:32 -05:00
Trey t	4efc87559a	fix(uploads): force virtual-hosted-style URLs for B2 presigned POST Backblaze B2's S3-compatible endpoint only implements POST Object on virtual-hosted-style URLs (https://<bucket>.s3.<region>.backblazeb2.com/). Path-style POST returns HTTP 501 Not Implemented. minio-go's BucketLookupAuto only flips to virtual-hosted for AWS, Google, and Aliyun endpoints — for B2 it falls through to path-style, which is why every PresignedPostPolicy() call has been handing the mobile clients a URL that B2 then refuses with 501. Force BucketLookupDNS only when the endpoint is backblazeb2.com so MinIO dev (no DNS for arbitrary buckets at minio:9000) keeps its path-style default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 13:34:05 -05:00
Trey t	1347ffadf5	docs: presigned-URL upload flow + B2 lifecycle setup Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details 09-storage.md: - Replaced the "Upload flow" section. The previous text described the multipart-via-API path that was removed in `b7f8329`. Now documents the three-step direct-to-B2 flow (presign → POST to B2 → attach via upload_ids[]) with an ASCII diagram and a server-side enforcement-points table. - Replaced the "Future: signed URLs" placeholder (since presigned URLs are now the present, not the future). - Added "Lifecycle and retention" subsections covering the pending_uploads cleanup cron (worker, 30 * * * *), the B2 bucket lifecycle as backstop (uploads/ prefix, 7-day hide + 1-day delete), and the still-open user-deletion cascade gap. 14-deployment-process.md: - Added a "One-time B2 bucket lifecycle (manual)" section explaining why the rule can't live in the deploy script (B2's S3 lifecycle API is partial), the exact rule to apply via the Backblaze console, and a verification command. docs/deployment/README.md: - Updated the chapter 9 description to mention presigned-URL uploads. README.md (root): - Added a paragraph under "Object storage" pointing to the new upload architecture and the relevant deployment-book chapters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:44:08 -07:00
Trey t	14026251b7	fix(worker): wire B2 credentials so pending_uploads cleanup cron can run Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The new TypeUploadCleanup cron (30 * * * *) constructs a StorageService at worker startup so it can call b.client.RemoveObject on B2 when reaping expired pending_uploads rows. Without B2_KEY_ID + B2_APP_KEY the storage service falls back to local disk and crashes on this pod's read-only root filesystem, leaving the cleanup as a no-op. Mirrors the api deployment which already wires these. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:25:53 -07:00
Trey t	b7f83293b8	refactor(uploads): drop legacy multipart code paths Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The presigned-URL upload flow (POST /api/uploads/presign + direct B2 POST + upload_ids[] in entity creation) is now the only image upload path. The legacy multipart routes and DTO fields used by older clients are removed: Removed: - POST /api/uploads/image/ (legacy multipart upload → URL) - POST /api/uploads/document/ (legacy multipart upload → URL) - POST /api/uploads/completion/ (legacy multipart upload → URL) - Multipart branch in POST /api/task-completions/ (now JSON-only) - CreateTaskCompletionRequest.ImageURLs DTO field - UpdateTaskCompletionRequest.ImageURLs DTO field - CreateDocumentRequest.ImageURLs DTO field - Service-layer ImageURLs loops in task_service.CreateCompletion, task_service.UpdateCompletion, document_service.CreateDocument - Tests exercising the removed paths - Now-unused imports (strings/time/decimal) in task_handler.go Kept: - DELETE /api/uploads/ (orphan-cleanup endpoint, still useful) - POST /api/uploads/presign/ (the new path) - POST /api/documents/:id/images/ (uses storage_service.Upload directly, same multipart pattern but separate code path; deferred for now) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:19:21 -07:00
Trey t	29c9014a33	feat(uploads): direct-to-B2 presigned uploads with content-length-range policy Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Replaces the multipart-via-API path for image uploads with a three-step direct-to-storage flow: 1. Client POSTs /api/uploads/presign with content_length + content_type; server validates size (10 MB cap), mime allow-list per category, rate limit (50/hour/user via Redis sliding window), and concurrent unclaimed cap (10 in-flight per user). On success it persists a pending_uploads row, signs an S3 POST policy with content-length-range bound to the claimed length ±256 bytes, and returns the URL+fields. 2. Client POSTs the bytes directly to B2 using the signed policy. B2 enforces size, content-type, and key match before accepting. 3. Client passes upload_ids[] to /api/task-completions/ or /api/documents/. Service HEADs each B2 object, verifies size matches expected_bytes within slack, marks pending_uploads claimed_at, and creates the associated TaskCompletionImage / DocumentImage rows. Bytes never traverse our API server. The 1 MB Echo BodyLimit middleware that was rejecting all task-completion image uploads becomes irrelevant for this path. Existing multipart endpoints stay functional alongside, soak-testing the new path before legacy removal. Cleanup: - cmd/worker registers a new hourly cron (TypeUploadCleanup, "30 * * * *") that reaps pending_uploads where claimed_at IS NULL AND expires_at < NOW(). Reaps both the B2 object and the row. - B2 bucket lifecycle rule on `uploads/` prefix (7 days hide → 1 day delete) documented in deploy-k3s/manifests/b2-lifecycle.md as a backstop. Schema: - migrations/000002_pending_uploads.sql adds the table + partial index for cleanup + nullable pending_upload_id FKs on task_taskcompletionimage and task_documentimage. Policy (single tier, no free/pro split): - 10 MB cap per upload - 50 presigns/hour/user - 10 concurrent unclaimed uploads/user - allow-list: jpeg/png/heic/heif/webp for image categories; + pdf for document_file Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:36:42 -07:00
Trey t	9bee436e86	perf(subscription-status): cache + parallelize + invalidate on mutations Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details GET /api/subscription/status/ was the slowest endpoint in the API at p50≈1750ms / p95≈2425ms — about 12× the floor for our cluster→Neon geography. Jaeger traces showed seven sequential SQL queries each costing roughly one transatlantic RTT (~110ms), with the actual queries running in 0.073ms at the database. Pure network serialization, not slow SQL. Three changes, in order of leverage: 1. Cache the assembled SubscriptionStatusResponse per-user in Redis with a 5-minute TTL. Hot path collapses to a single Redis GET (~5ms) on warm reads; the TTL is a safety net against missed invalidations. 2. Parallelize the three independent COUNT queries in getUserUsage (task_task / task_contractor / task_document) via golang.org/x/sync errgroup. Three RTTs collapse to one. Also dropped the redundant residence_residence COUNT — len(residenceIDs) from FindResidenceIDsByOwner is the same number, no need to re-query. 3. Wire explicit invalidation into every mutation that could change a user's response — residence/task/contractor/document CRUD, residence membership changes (JoinWithCode, RemoveUser, DeleteResidence), and every subscription tier flip across the IAP/Stripe/webhook surface. Residence-scoped invalidations fan out to every user with access via a new ResidenceRepository.FindUserIDsByResidence helper, so members of a shared residence don't see stale `usage` numbers when another member adds a task. Net effect: warm path goes from ~1350ms to ~5ms (Redis hit). Cold path goes from ~1350ms to ~250-450ms (5 sequential queries → 2 phases: residence IDs lookup, then parallel task/contractor/document counts). Also fixed a pre-existing CheckLimit signature drift in internal/integration/subscription_is_free_test.go that was blocking the package build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:00:23 -07:00
Trey t	0798ae8d74	fix(testutil): use shared-cache SQLite so concurrent reads see same DB SetupTestDB used `sqlite.Open(":memory:")`, which creates a separate in-memory database for every connection in GORM's pool. Sequential tests never noticed because the pool keeps reusing one connection — but the moment any code path issued concurrent reads (e.g. errgroup-driven parallel COUNT queries), a goroutine could pull a fresh connection, see no migrated tables, and explode with "no such table". Switched to `file:testdb_<n>?mode=memory&cache=shared&_journal=memory` with a per-test atomic counter so every connection in the pool sees the same in-memory DB and tests stay isolated from each other through the unique cache namespace. As a bonus, this also resolves the pre-existing TestTaskHandler_QuickComplete flake — same root cause, just intermittent because the pool occasionally handed out a second connection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:00:03 -07:00
Trey t	ce4d49caef	tools: add send-test-push for one-shot Asynq push verification Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Tiny CLI that enqueues a notification:send_push task into Redis. The worker picks it up and routes through internal/push/Client.SendToAll — which is exactly the path used by HandleSmartReminder, HandleDailyDigest, and any other in-process push, so a successful round-trip here proves the production push pipeline end-to-end without waiting for the next cron tick. Requires Redis to be reachable. Easiest path: kubectl -n honeydue port-forward svc/redis 6379:6379 go run ./cmd/send-test-push --user-id 6 --title "..." --message "..." The worker logs `Sending push notification...` followed by the APNs batch result; failure modes (BadDeviceToken, circuit breaker, etc.) surface as the same error_message rows the existing notif-diag tool already reports on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:59:51 -07:00
Trey t	cb1dc383b4	tools: add admin-reset and notif-diag operational CLIs Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Two small Go CLIs for production ops that previously required ad-hoc psql or kubectl gymnastics. Both load DB credentials from prod.env-style env vars and read POSTGRES_PASSWORD from deploy/secrets/postgres_password.txt by default, so the workflow is `set -a && source deploy/prod.env && set +a` followed by go run. cmd/admin-reset/main.go: --list print all admin_users rows --verify --email X bcrypt-check a password against the stored hash using the same case-insensitive lookup the live /api/admin/auth/login endpoint uses --new-email Y rename an admin's email (with unique-index check) default (--email X) prompt for a new password twice (no echo, min 12 chars), bcrypt at DefaultCost, update the row cmd/notif-diag/main.go: default print pending/sent counts, breakdown by type and age, the 5 most recent pending rows with their error_message, and registered APNs/FCM device counts --mark-failed-as-sent cosmetic cleanup — UPDATE pending rows that have a recorded error to sent=true, sent_at=COALESCE(updated_at, NOW()) --yes skip the interactive confirmation prompt Both bypass internal/config.Load() entirely so they don't need SECRET_KEY or other unrelated env vars to run. .gitignore excludes the build artifacts at /admin-reset and /notif-diag. go.mod adds golang.org/x/term v0.41.0 (promoted from indirect to direct) for no-echo password input in admin-reset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:36:13 -07:00
Trey t	8fce568532	fix(config): replace sync.Once reset-from-Do with mutex Load()'s validation-failure path reassigned cfgOnce = sync.Once{} from inside Do(). When Do() returned and tried to unlock the original mutex, the Once struct had already been replaced with a fresh one whose mutex was unlocked, panicking with "sync: unlock of unlocked mutex" on every boot where any required env var was missing or invalid. Replaced the Once with a plain sync.Mutex around a nil-check on the package-level cfg, building the candidate into a local first and only assigning to cfg after validate() succeeds. Same caching semantics, no race, and a failed Load() leaves cfg nil so the next caller retries cleanly. Also documented AppleAuthConfig.TeamID as currently dead — it's loaded from APPLE_TEAM_ID but no service reads it. Wire-up point noted for when Sign in with Apple revocation/refresh is added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:35:54 -07:00
Trey t	289a23f7e6	deploy(ingress): drop obsolete scaffold ingress.yaml Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details The directory had two ingress manifests that both define honeydue-api and honeydue-admin: - ingress.yaml (Mar 28, scaffold from `deploy-k3s/` greenfield template) - ingress-simple.yaml (Apr 24, corrected for our actual cluster shape per MIGRATION_NOTES.md) `kubectl apply -f manifests/ingress/` applies both, and ingress.yaml happens to apply last alphabetically (`-` < `.` so `ingress-simple` sorts before `ingress.yaml`), clobbering the corrected manifest. That left the live cluster with two regressions: 1. honeydue-admin had `admin-auth` Traefik middleware in its chain, referencing the `admin-basic-auth` secret. Per MIGRATION_NOTES basic auth is intentionally not applied on this cluster (admin uses in-app auth), so the secret was never created. Traefik logs `secret 'honeydue/admin-basic-auth' not found` on every reconcile and refuses to materialize the admin router → 404. 2. honeydue-api lost the apex `myhoneydue.com` rule that ingress-simple.yaml adds for the marketing landing page → apex 404. `kubectl apply -f ingress-simple.yaml` against the live cluster restored both routes (admin/apex back to 200). Removing the stale file from the repo prevents the next deploy from regressing. Refs: deploy-k3s/MIGRATION_NOTES.md ("Admin basic auth \| Not applied — in-app auth only").	2026-04-26 23:44:21 -05:00
Trey t	8d9ca2e6ed	docs(deployment): rewrite migration prose for goose adoption Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Update the deployment book and glossary to reflect the goose-based schema migration flow shipped in 12b2f9d/0f7450a: - ch07: clarify startup probe assumes migrations ran out-of-band - ch08: drop AutoMigrate-with-advisory-lock prose; describe goose Job - ch12: pod startup checks goose_db_version, no longer runs migrations - ch14: document the Job→wait→roll deploy gate and how to debug failures - ch16: add "Migrate Job fails during deploy" + "Schema precondition failed" failure modes - ch17: new runbook entries §26 (run migrations manually), §27 (recover from failed/dirty migration), §28 (bootstrap goose on fresh clone) - ch19: postscript on §13 noting MigrateWithLock approach is superseded - ch20: mark "Migration Job for schema changes" task done - glossary: add `goose` and `goose_db_version`; flag AutoMigrate as tests-only - references: add goose links; flag AutoMigrate as tests-only	2026-04-26 23:01:32 -05:00
Trey t	0f7450ada9	build: fix goose binary copy path for cross-compile Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details go install with GOOS/GOARCH set drops binaries in /go/bin/<goos>_<goarch>/, which broke the COPY in go-base. Switching to git clone + go build with explicit -o /app/goose so the output path is stable regardless of host platform. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:48:08 -05:00
Trey t	12b2f9d43b	Adopt pressly/goose for schema migrations Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Replaces the previous hand-rolled MigrateWithLock + GORM AutoMigrate path, which had two compounding problems: - AutoMigrate ran on every pod startup (~5 min over the transatlantic link) even when no schema changes had landed - pg_advisory_lock is session-scoped, which silently fails through Neon's pgbouncer transaction-mode pooler — turns out this is a known and documented limitation that bites golang-migrate too Goose was chosen over golang-migrate (the other heavyweight) because: - Goose wraps each migration file in a transaction by default, so a failure rolls back cleanly instead of leaving a "dirty" version state requiring manual force-reset (golang-migrate's known weakness, per its own issue tracker — see #1001 + Atlas's writeup) - Goose's locking is opt-in. We don't opt in: migrations run as a single Kubernetes Job, which IS the singleton process. No advisory lock needed at all. Layout: - migrations/000001_init.sql — schema-only pg_dump of the live Neon DB at adoption, stripped of psql-only directives that block goose's bookkeeping insert. Pre-goose hand-numbered migrations 002-022 had their effects folded into this baseline; deleted from the live tree but preserved in git history at `58e6997`. - Dockerfile installs `goose v3.22.1` at build time and copies the binary into the api image. The migrate Job reuses the api image with command=goose, so no separate image to build/push/version. - deploy-k3s/manifests/migrate/job.yaml: a one-shot Job that strips the -pooler segment from DB_HOST (advisory lock won't survive pgbouncer transaction-mode), runs `goose up`, exits. - deploy-k3s/scripts/03-deploy.sh: deletes any prior Job, applies the fresh one, `kubectl wait --for=condition=complete --timeout=10m`, then proceeds with api/worker rollout. Job failure aborts the deploy before any new app pod sees a stale schema. - internal/database/database.go::RequireSchemaApplied checks goose_db_version on startup. api/worker refuse to boot if the table is missing or its latest row has is_applied=false — the fail-fast for "operator forgot to run migrate." - Makefile: migrate-up / migrate-down / migrate-status / migrate-new for local workflow. Production DB was bootstrapped manually: $ goose -dir migrations postgres "$DSN" version # creates table $ psql ... -c "INSERT INTO goose_db_version (version_id, is_applied, tstamp) VALUES (1, true, NOW());" Smoke test against fresh Postgres locally: 50 user tables created in 284ms via `goose up`, version_id=1 + is_applied=t recorded. Verified the local goose CLI talks to prod successfully: $ goose ... status Applied At Migration ======================================= Mon Apr 27 03:43:55 2026 -- 000001_init.sql Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:46:36 -05:00
Trey t	d96f317d20	Revert "Fix migration deadlock under Neon pooler" This reverts commit `30966c6f5e`.	2026-04-26 22:22:07 -05:00
Trey t	4049b704c3	Revert "deployment: extend api startup probe budget for direct-endpoint migrations" This reverts commit `a94744061e`.	2026-04-26 22:22:07 -05:00
Trey t	a94744061e	deployment: extend api startup probe budget for direct-endpoint migrations Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The migration-pooler fix (commit `30966c6`) routes AutoMigrate through Neon's direct compute endpoint to keep the session-scoped advisory lock alive. That swap means each DDL pays a fresh transatlantic RTT instead of riding warm pooler connections, so AutoMigrate's runtime climbs from ~90s to 4-6 min on the first pod of a cold boot. With the previous 240s grace the startup probe was killing pods mid-migration. Bumping to 120 × 5s = 600s grace. Subsequent pods inherit the schema and finish their migrate-no-op in seconds, so this only matters for the single first-pod migration window after a deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:05:58 -05:00
Trey t	30966c6f5e	Fix migration deadlock under Neon pooler Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Today's pooler-endpoint switch broke MigrateWithLock: pg_advisory_lock is session-scoped, but PgBouncer transaction-mode releases the underlying Postgres session after every transaction. The lock was being released the moment we acquired it, and on the next pod's startup the migration either deadlocked or proceeded without serialization. Visible as \"Acquiring migration advisory lock...\" hanging until the startup probe killed the pod (as just happened on the `b67f7f9` deploy). Fix: open a parallel gorm.DB pointed at the direct* Neon endpoint (DB_HOST with the -pooler segment stripped) for migrations only. That keeps a real persistent session, so pg_advisory_lock works correctly. The migration runs against this direct connection; the runtime pool keeps using the pooler for everything else. Side effects: - Migrate() now delegates to migrate(target *gorm.DB) which lets MigrateWithLock pass the direct DB. Tests on SQLite still call Migrate() through the global pool unchanged. - Migration DB uses MaxOpenConns=1, MaxIdleConns=1 — we just hold the lock on it, no connection pressure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 21:53:52 -05:00
Trey t	b67f7f9e6b	Cache SubscriptionSettings + cut monitoring poll noise Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Trace data revealed subscription_subscriptionsettings was consuming 1,983s of cumulative DB time per day (180× more than the next-largest table) for a 32-byte singleton row of admin-toggleable global flags. Root cause was a 30-second poll loop in monitoring.Service per pod plus uncached reads on every authed status check / CreateResidence / Stripe webhook. Fix is layered: 1. Redis cache for SubscriptionSettings — same shape as the residence-IDs cache. 30-min TTL, explicit invalidation on admin write. New CacheService.{Cache,GetCached,Invalidate}SubscriptionSettings plus a cachedSubscriptionSettings helper in services/. 2. SubscriptionService, StripeService, and both admin handlers (settings + limitations) now read through the cache. Admin write handlers invalidate so toggles propagate cluster-wide within ms instead of waiting for the TTL. 3. monitoring.Service.syncSettingsFromDB also reads from Redis first (raw redis.Client to avoid a services→monitoring import cycle). Polling interval bumped 30s → 5min. Combined with Redis-shared cache, cluster-wide DB hits from this poll go from ~480/hour to ~2/hour — a 240× reduction. 4. StripeService.CreateCheckoutSession now takes ctx so the cached settings span (and the Stripe webhook trace) stay attached to the request. Handler call site updated. 5. Admin handlers' direct h.db.First calls switched to db.WithContext(ctx) so the resulting orphan SQL spans nest under the admin request span in Jaeger. Net DB query rate for subscription_subscriptionsettings should drop from 0.101/sec to ~0/sec with occasional invalidation-driven refills, and the table's cumulative DB time from 1,983s/day to ~10s/day. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 21:29:30 -05:00
Trey t	c9ac273dbd	docs: capture latency optimizations + new caching invariants Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Shipping commit `88fb175` changed the trace shape and added a new caching layer with required invalidation rules. Updating the operator-facing docs so they match the running system. ch08 (database): - DB_HOST is the -pooler Neon endpoint, not direct compute - Connection pool: MaxIdleConns 20 (was 10), MaxLifetime 30m (was 10m), MaxIdleTime 0 (never close idle) - New \"Pool warm-up at boot\" section documenting the 20-parallel-ping warm-up in database.Connect - Replaced the \"Neon regions\" section: explicit RTT numbers, the optimization stack that minimizes round-trips, when this still matters ch15 (observability): - Replaced the 2,473ms/5-span sample trace with the new 229ms/2-span post-optimization trace; kept the old one underneath for diff context ch16 (failure modes): - Added: stale residence-IDs cache (data freshness bug + recovery) - Added: Redis at maxmemory limit (verify allkeys-lru policy) - Added: Neon pooler unreachable but direct endpoint up — emergency switchover procedure ch17 (runbook): - §23 Invalidate residence-IDs cache for a user (DEL key + grep for missing invalidation in new code) - §24 Verify DB pool warm-up is working (log pattern + impact test) - §25 Switch DB host between pooler and direct endpoints observability-plan.md status flipped from \"plan only\" to shipped with the latency-cut summary. README links to the new ch08 latency section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:36:36 -05:00
Trey t	88fb1751c7	Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Stack of optimizations against the same Hetzner→Neon transatlantic link. The trace revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond per query (verified via EXPLAIN ANALYZE: index scans on every hot path). Connection layer: - DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer transaction-mode keeps backend Postgres connections warm so we no longer pay the ~110ms Postgres-startup RTT on cold queries. - GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s, MaxIdleTime added (default 0 = never close idle). - Eager pool warm-up at boot via parallel pings — first user request no longer pays the ~440ms TCP+TLS+startup handshake. - Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will evict cold keys instead of erroring at the 256MB limit. Auth layer: - TokenCacheTTL 5min → 1 hour (Redis token cache). - UserCacheTTL 30s → 5min (in-memory User cache, per pod). - UserCache gains a 5,000-entry LRU cap so a flood of unique users can't blow up pod RSS. ~5MB worst-case per pod. - Token + user lookup collapsed from 2 GORM Preload queries into a single INNER JOIN. Saves 1 RTT per cold-cache request. - Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL spans nest under the parent HTTP request in Jaeger. Service layer: - TaskService.ListTasks: replaced two-step FindResidenceIDsByUser → GetKanbanDataForMultipleResidences with a single GetKanbanDataForUser that uses a Postgres subquery for residence-access. One round-trip instead of two. - New CacheService residence-IDs cache: \"residence_ids_user:<id>\" with 5-min TTL. Wired into Task/Residence/Contractor/Document services for the four hot read paths that need this list. - Cache invalidation on every relevant mutation: CreateResidence, DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence invalidates every member of the residence, not just the owner. What this stacks up to (Hetzner→Neon, before US migration): Path Before After (target) Cache-warm authed read ~800ms ~100-200ms Cache-cold authed read (1st in 1hr) ~2500ms ~500-700ms First request after deploy ~2500ms ~700-900ms The endgame US-region migration on top of this gets us to ~30-50ms warm-cache, but we're shippable at ~150ms warm right now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:13:50 -05:00
Trey t	9410da7497	docs/ch15: mark distributed tracing fully integrated Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Every authed API endpoint now produces a nested flame graph (HTTP → auth → service → SQL). Replaces the in-flight section with the final span-source matrix and a sample 5-span /api/tasks/ trace. Notes the visible Hetzner→Neon transatlantic RTT as the perf bottleneck the flame graph surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:44:31 -05:00
Trey t	d9b5f85c3d	Thread ctx through auth middleware DB lookups Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details The auth middleware's m.db.Preload + m.db.First calls were running without ctx, so on cache miss the resulting SQL queries appeared as orphan gorm.Query / gorm.Row spans in Jaeger. Now they nest under the parent HTTP request span like every other repo call. This was the last orphaned-SQL source on the request hot path. Combined with the seven service migrations, every authenticated API call now produces a fully-nested flame graph: HTTP → auth-token-lookup (cache hit) or HTTP → auth-token-SQL (cache miss) → service → service-SQL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:36:47 -05:00
Trey t	e881d37de0	Migrate Auth/Contractor/Document/Notification/Subscription services to ctx Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Every public method on these five services now takes ctx context.Context as the first arg and routes its repo calls through .WithContext(ctx). With TaskService and ResidenceService already migrated, this means every in-process service that touches Postgres now produces a flame graph in Jaeger where the SQL spans nest under the parent HTTP request span. Endpoints now fully traced (HTTP → service → SQL): - /api/auth/login, /register, /logout, /me, /verify-email, /resend-verification - /api/auth/forgot-password, /verify-reset, /reset-password, /update-profile - /api/contractors/* (CRUD + favorite + by-residence + tasks) - /api/documents/* (CRUD + activate/deactivate + image upload/delete) - /api/notifications/* (list, count, mark-read, prefs, devices) - /api/subscription/* (status, purchase, cancel, triggers, promotions) - All previously-migrated /api/tasks/* and /api/residences/* paths Internal helpers also threaded: - TaskService.sendTaskCompletedNotification → forwards ctx - TaskService.UpdateUserTimezone → forwards ctx to NotificationService - ResidenceService.CreateResidence → forwards ctx to SubscriptionService.CheckLimit - NotificationService.registerAPNSDevice / registerGCMDevice → both take ctx ~75 method signatures, ~120 handler/test call sites updated. Tests pass green; the only failure is the pre-existing flaky TaskHandler_QuickComplete SQLite race that fails ~60% of runs on master. Step 3 of the observability plan is now genuinely complete: every API endpoint backed by a Go service emits a per-request flame graph with HTTP → service → SQL spans, plus B2/APNs/FCM/asynq spans where applicable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:26:21 -05:00
Trey t	65a9aae4e5	Migrate TaskService + ResidenceService to ctx-aware repos Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Every public method on TaskService and ResidenceService now takes ctx context.Context as the first arg and routes its repo calls through .WithContext(ctx). With otelgorm registered, this means every API endpoint backed by these two services produces a flame graph in Jaeger where the SQL spans nest under the parent HTTP request span — instead of appearing as orphaned queries. Endpoints now fully traced (HTTP → service → SQL): - GET /api/tasks/ (already shipped) - GET /api/tasks/by-residence/:id/ (already shipped) - GET /api/tasks/:id/ - POST /api/tasks/ - POST /api/tasks/bulk/ - PUT /api/tasks/:id/ - DELETE /api/tasks/:id/ - POST /api/tasks/:id/in-progress/ - POST /api/tasks/:id/cancel/ - POST /api/tasks/:id/uncancel/ - POST /api/tasks/:id/archive/ - POST /api/tasks/:id/unarchive/ - POST /api/tasks/:id/complete/ - POST /api/tasks/:id/quick-complete/ - GET /api/tasks/completions/* (CRUD) - GET /api/static_data/ (categories, priorities, frequencies) - GET /api/residences/ - GET /api/residences/my/ - GET /api/residences/summary/ - GET /api/residences/:id/ - POST /api/residences/ - PUT /api/residences/:id/ - DELETE /api/residences/:id/ - Share-code + member management endpoints - GET /api/residences/:id/report/ Mechanical work: ~50 method signatures, ~80 handler call sites, ~25 test call sites updated. Internal sendTaskCompletedNotification helper also takes ctx so background notification SQL nests correctly. The remaining services (ContractorService, DocumentService, AuthService, NotificationService, SubscriptionService) follow the same pattern; they continue to emit untraced SQL until migrated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 16:04:01 -05:00
Trey t	3f5bf21e09	tracing: bump semconv to v1.40.0 to match runtime resource schema Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Pods crashed at startup with "build resource: conflicting Schema URL: https://opentelemetry.io/schemas/1.40.0 and https://opentelemetry.io/schemas/1.27.0" because resource.Default() in the SDK targets v1.40.0. Aligning here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 15:35:46 -05:00

1 2 3 4 5

233 Commits