honeyDueAPI

Author	SHA1	Message	Date
Trey t	81578f6e27	feat(auth): replace hand-rolled auth with Ory Kratos — phase 2 backend Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Delegates all credential management (login, register, password reset, email verification, social sign-in) to Ory Kratos. The Go API now acts as a resource server: the new KratosAuth middleware validates sessions against the Kratos whoami endpoint, writes the local User mirror into Echo context, and all existing domain handlers continue working unchanged. Hand-rolled token auth, AuthToken model, apple_auth/ google_auth services, and the auth refresh flow are removed. Tests are updated to use the fake-token middleware pattern so existing integration assertions require no rewrite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 17:55:56 -05:00
Trey t	c77ff07ce9	fix(security): remediate 2026-05-12 audit findings (Stages 2–5) Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Remediation of the 2026-05-12/13 audits (78 findings + cluster gaps), tracked in deploy-k3s/SECURITY.md, plus fixes from two independent post-remediation reviews. Auth & sessions: - SHA-256 hashed auth-token storage (C1); prior-token cache eviction on re-login (MEDIUM-1) - local Google JWKS verification, iss/aud/exp checks (C2/C3) - constant-time login + generic errors (L1/LIVE-L11/LIVE-L13) - per-account login lockout keyed on distinct source IPs (M5/MEDIUM-3) - verified-email gating, login rate limiting (LIVE-L19, H1-H3) IAP & webhooks: - Apple/Google cross-account replay protection (C5/C6/C10/C13, H5/H6) - migrations 000003-000006 (token hashing, IAP replay, audit_log + webhook_event_log table creation, append-only audit log) Authorization & races: - file-ownership owner-OR-member fix (C7), atomic share-code join (C9/H9), device-token reassignment (C8/LOW-3) Secrets & deploy: - secrets file-mounted at /etc/honeydue/secrets, not env (F8); Redis password out of the ConfigMap (HIGH-1); B2 keys reconciled - digest-pinned images, admin ingress hardening, CSP/HSTS, /metrics lockdown; kubeconfig 0600, etcd secrets-encryption, fail2ban + unattended-upgrades at provision; secret-rotation runbook Build, vet, and the full test suite (incl. -race) pass; the goose migration chain is verified against PostgreSQL 16. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 22:28:33 -05:00
Trey t	9bee436e86	perf(subscription-status): cache + parallelize + invalidate on mutations Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details GET /api/subscription/status/ was the slowest endpoint in the API at p50≈1750ms / p95≈2425ms — about 12× the floor for our cluster→Neon geography. Jaeger traces showed seven sequential SQL queries each costing roughly one transatlantic RTT (~110ms), with the actual queries running in 0.073ms at the database. Pure network serialization, not slow SQL. Three changes, in order of leverage: 1. Cache the assembled SubscriptionStatusResponse per-user in Redis with a 5-minute TTL. Hot path collapses to a single Redis GET (~5ms) on warm reads; the TTL is a safety net against missed invalidations. 2. Parallelize the three independent COUNT queries in getUserUsage (task_task / task_contractor / task_document) via golang.org/x/sync errgroup. Three RTTs collapse to one. Also dropped the redundant residence_residence COUNT — len(residenceIDs) from FindResidenceIDsByOwner is the same number, no need to re-query. 3. Wire explicit invalidation into every mutation that could change a user's response — residence/task/contractor/document CRUD, residence membership changes (JoinWithCode, RemoveUser, DeleteResidence), and every subscription tier flip across the IAP/Stripe/webhook surface. Residence-scoped invalidations fan out to every user with access via a new ResidenceRepository.FindUserIDsByResidence helper, so members of a shared residence don't see stale `usage` numbers when another member adds a task. Net effect: warm path goes from ~1350ms to ~5ms (Redis hit). Cold path goes from ~1350ms to ~250-450ms (5 sequential queries → 2 phases: residence IDs lookup, then parallel task/contractor/document counts). Also fixed a pre-existing CheckLimit signature drift in internal/integration/subscription_is_free_test.go that was blocking the package build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:00:23 -07:00
Trey t	b67f7f9e6b	Cache SubscriptionSettings + cut monitoring poll noise Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Trace data revealed subscription_subscriptionsettings was consuming 1,983s of cumulative DB time per day (180× more than the next-largest table) for a 32-byte singleton row of admin-toggleable global flags. Root cause was a 30-second poll loop in monitoring.Service per pod plus uncached reads on every authed status check / CreateResidence / Stripe webhook. Fix is layered: 1. Redis cache for SubscriptionSettings — same shape as the residence-IDs cache. 30-min TTL, explicit invalidation on admin write. New CacheService.{Cache,GetCached,Invalidate}SubscriptionSettings plus a cachedSubscriptionSettings helper in services/. 2. SubscriptionService, StripeService, and both admin handlers (settings + limitations) now read through the cache. Admin write handlers invalidate so toggles propagate cluster-wide within ms instead of waiting for the TTL. 3. monitoring.Service.syncSettingsFromDB also reads from Redis first (raw redis.Client to avoid a services→monitoring import cycle). Polling interval bumped 30s → 5min. Combined with Redis-shared cache, cluster-wide DB hits from this poll go from ~480/hour to ~2/hour — a 240× reduction. 4. StripeService.CreateCheckoutSession now takes ctx so the cached settings span (and the Stripe webhook trace) stay attached to the request. Handler call site updated. 5. Admin handlers' direct h.db.First calls switched to db.WithContext(ctx) so the resulting orphan SQL spans nest under the admin request span in Jaeger. Net DB query rate for subscription_subscriptionsettings should drop from 0.101/sec to ~0/sec with occasional invalidation-driven refills, and the table's cumulative DB time from 1,983s/day to ~10s/day. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 21:29:30 -05:00
Trey t	88fb1751c7	Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Stack of optimizations against the same Hetzner→Neon transatlantic link. The trace revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond per query (verified via EXPLAIN ANALYZE: index scans on every hot path). Connection layer: - DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer transaction-mode keeps backend Postgres connections warm so we no longer pay the ~110ms Postgres-startup RTT on cold queries. - GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s, MaxIdleTime added (default 0 = never close idle). - Eager pool warm-up at boot via parallel pings — first user request no longer pays the ~440ms TCP+TLS+startup handshake. - Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will evict cold keys instead of erroring at the 256MB limit. Auth layer: - TokenCacheTTL 5min → 1 hour (Redis token cache). - UserCacheTTL 30s → 5min (in-memory User cache, per pod). - UserCache gains a 5,000-entry LRU cap so a flood of unique users can't blow up pod RSS. ~5MB worst-case per pod. - Token + user lookup collapsed from 2 GORM Preload queries into a single INNER JOIN. Saves 1 RTT per cold-cache request. - Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL spans nest under the parent HTTP request in Jaeger. Service layer: - TaskService.ListTasks: replaced two-step FindResidenceIDsByUser → GetKanbanDataForMultipleResidences with a single GetKanbanDataForUser that uses a Postgres subquery for residence-access. One round-trip instead of two. - New CacheService residence-IDs cache: \"residence_ids_user:<id>\" with 5-min TTL. Wired into Task/Residence/Contractor/Document services for the four hot read paths that need this list. - Cache invalidation on every relevant mutation: CreateResidence, DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence invalidates every member of the residence, not just the owner. What this stacks up to (Hetzner→Neon, before US migration): Path Before After (target) Cache-warm authed read ~800ms ~100-200ms Cache-cold authed read (1st in 1hr) ~2500ms ~500-700ms First request after deploy ~2500ms ~700-900ms The endgame US-region migration on top of this gets us to ~30-50ms warm-cache, but we're shippable at ~150ms warm right now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 17:13:50 -05:00
Trey t	6f303dbbaa	Migrate prod deploy from Swarm to K3s; add full deployment book Backend CI / Test (push) Has been cancelled Details Backend CI / Contract Tests (push) Has been cancelled Details Backend CI / Build (push) Has been cancelled Details Backend CI / Lint (push) Has been cancelled Details Backend CI / Secret Scanning (push) Has been cancelled Details Infrastructure: - Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers) - Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh - All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept temporarily for reference Bug fixes surfaced during migration: - Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25) - cache_service.go: remove sync.Once reassignment from inside Do() callback (was causing 'unlock of unlocked mutex' fatal after Redis Ping failure) - router.go: relax CSP from 'default-src none' to 'default-src self' + allowlist fonts.googleapis.com so the marketing landing page CSS actually loads in browsers - deploy/scripts/deploy_prod.sh: use docker buildx with --platform linux/amd64 so arm64 (Apple Silicon) dev machines produce images runnable on x86_64 Hetzner nodes; fix array expansion under set -u - deploy/swarm-stack.prod.yml: fix secret source references to use top-level aliases (the '\${X_SECRET}' form never actually resolved); dozzle ports: long-form host_ip is rejected by Swarm, switched to short-form (bound to 0.0.0.0 with UFW-based loopback restriction); worker replicas 2 -> 1 (Asynq scheduler singleton) - deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/' (Next.js serves at root; /admin/ returned 404 and killed pods); startupProbe failureThreshold 12 -> 24 - deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable 1 -> 0 (singleton) - deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold 12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot; real startup takes up to 240s) - .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/ and admin/src/app/api/*, hiding legitimate files) New files: - deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet + hostNetwork override for k3s-bundled Traefik - deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress without TLS (CF Flexible SSL) and without middleware - deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log Documentation: - docs/deployment/ — full deployment book, 26 files, ~42k words: - Part I Overview, infrastructure, orchestrator choice (Ch 0-2) - Part II Networking, firewall, Cloudflare (Ch 3-4, 13) - Part III Security, Traefik ingress (Ch 5-6) - Part IV Services, DB, storage, secrets, registry (Ch 7-11) - Part V Data flow, deploy process, observability, failures, runbook (Ch 12, 14-17) - Part VI Cost, Swarm postmortem, roadmap (Ch 18-20) - Appendices: glossary, kubectl cheat sheet, file locations, consolidated citations - README.md: Production Deployment section replaced with pointer to the book; Go version bumped to 1.25 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 07:20:54 -05:00
Trey T	b679f28e55	Production hardening: security, resilience, observability, and compliance Password complexity: custom validator requiring uppercase, lowercase, digit (min 8 chars) Token expiry: 90-day token lifetime with refresh endpoint (60-90 day renewal window) Health check: /api/health/ now pings Postgres + Redis, returns 503 on failure Audit logging: async audit_log table for auth events (login, register, delete, etc.) Circuit breaker: APNs/FCM push sends wrapped with 5-failure threshold, 30s recovery FK indexes: 27 missing foreign key indexes across all tables (migration 017) CSP header: default-src 'none'; frame-ancestors 'none' Gzip compression: level 5 with media endpoint skipper Prometheus metrics: /metrics endpoint using existing monitoring service External timeouts: 15s push, 30s SMTP, context timeouts on all external calls Migrations: 016 (token created_at), 017 (FK indexes), 018 (audit_log) Tests: circuit breaker (15), audit service (8), token refresh (7), health (4), middleware expiry (5), validator (new)	2026-03-26 14:05:28 -05:00
Trey t	42a5533a56	Fix 113 hardening issues across entire Go backend Security: - Replace all binding: tags with validate: + c.Validate() in admin handlers - Add rate limiting to auth endpoints (login, register, password reset) - Add security headers (HSTS, XSS protection, nosniff, frame options) - Wire Google Pub/Sub token verification into webhook handler - Replace ParseUnverified with proper OIDC/JWKS key verification - Verify inner Apple JWS signatures in webhook handler - Add io.LimitReader (1MB) to all webhook body reads - Add ownership verification to file deletion - Move hardcoded admin credentials to env vars - Add uniqueIndex to User.Email - Hide ConfirmationCode from JSON serialization - Mask confirmation codes in admin responses - Use http.DetectContentType for upload validation - Fix path traversal in storage service - Replace os.Getenv with Viper in stripe service - Sanitize Redis URLs before logging - Separate DEBUG_FIXED_CODES from DEBUG flag - Reject weak SECRET_KEY in production - Add host check on /_next/* proxy routes - Use explicit localhost CORS origins in debug mode - Replace err.Error() with generic messages in all admin error responses Critical fixes: - Rewrite FCM to HTTP v1 API with OAuth 2.0 service account auth - Fix user_customuser -> auth_user table names in raw SQL - Fix dashboard verified query to use UserProfile model - Add escapeLikeWildcards() to prevent SQL wildcard injection Bug fixes: - Add bounds checks for days/expiring_soon query params (1-3650) - Add receipt_data/transaction_id empty-check to RestoreSubscription - Change Active bool -> *bool in device handler - Check all unchecked GORM/FindByIDWithProfile errors - Add validation for notification hour fields (0-23) - Add max=10000 validation on task description updates Transactions & data integrity: - Wrap registration flow in transaction - Wrap QuickComplete in transaction - Move image creation inside completion transaction - Wrap SetSpecialties in transaction - Wrap GetOrCreateToken in transaction - Wrap completion+image deletion in transaction Performance: - Batch completion summaries (2 queries vs 2N) - Reuse single http.Client in IAP validation - Cache dashboard counts (30s TTL) - Batch COUNT queries in admin user list - Add Limit(500) to document queries - Add reminder_stage+due_date filters to reminder queries - Parse AllowedTypes once at init - In-memory user cache in auth middleware (30s TTL) - Timezone change detection cache - Optimize P95 with per-endpoint sorted buffers - Replace crypto/md5 with hash/fnv for ETags Code quality: - Add sync.Once to all monitoring Stop()/Close() methods - Replace 8 fmt.Printf with zerolog in auth service - Log previously discarded errors - Standardize delete response shapes - Route hardcoded English through i18n - Remove FileURL from DocumentResponse (keep MediaURL only) - Thread user timezone through kanban board responses - Initialize empty slices to prevent null JSON - Extract shared field map for task Update/UpdateTx - Delete unused SoftDeleteModel, min(), formatCron, legacy handlers Worker & jobs: - Wire Asynq email infrastructure into worker - Register HandleReminderLogCleanup with daily 3AM cron - Use per-user timezone in HandleSmartReminder - Replace direct DB queries with repository calls - Delete legacy reminder handlers (~200 lines) - Delete unused task type constants Dependencies: - Replace archived jung-kurt/gofpdf with go-pdf/fpdf - Replace unmaintained gomail.v2 with wneessen/go-mail - Add TODO for Echo jwt v3 transitive dep removal Test infrastructure: - Fix MakeRequest/SeedLookupData error handling - Replace os.Exit(0) with t.Skip() in scope/consistency tests - Add 11 new FCM v1 tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 23:14:13 -05:00
Trey t	4976eafc6c	Rebrand from Casera/MyCrib to honeyDue Total rebrand across all Go API source files: - Go module path: casera-api -> honeydue-api - All imports updated (130+ files) - Docker: containers, images, networks renamed - Email templates: support email, noreply, icon URL - Domains: casera.app/mycrib.treytartt.com -> honeyDue.treytartt.com - Bundle IDs: com.tt.casera -> com.tt.honeyDue - IAP product IDs updated - Landing page, admin panel, config defaults - Seeds, CI workflows, Makefile, docs - Database table names preserved (no migration needed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 06:33:38 -06:00
Trey t	c5b0225422	Replace status_id with in_progress boolean field - Remove task_statuses lookup table and StatusID foreign key - Add InProgress boolean field to Task model - Add database migration (005_replace_status_with_in_progress) - Update all handlers, services, and repositories - Update admin frontend to display in_progress as checkbox/boolean - Remove Task Statuses tab from admin lookups page - Update tests to use InProgress instead of StatusID - Task categorization now uses InProgress for kanban column assignment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-08 20:48:16 -06:00
Trey t	91a1f7ebed	Add Redis caching for lookup data and admin cache management - Add lookup-specific cache keys and methods to CacheService - Add cache refresh on lookup CRUD operations in AdminLookupHandler - Add Redis caching after seed-lookups in AdminSettingsHandler - Add ETag generation for seeded data to support client-side caching - Update task template handler with cache invalidation - Fix route for clear-cache endpoint 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-05 22:35:09 -06:00
Trey t	c7dc56e2d2	Rebrand from MyCrib to Casera - Update Go module from mycrib-api to casera-api - Update all import statements across 69 Go files - Update admin panel branding (title, sidebar, login form) - Update email templates (subjects, bodies, signatures) - Update PDF report generation branding - Update Docker container names and network - Update config defaults (database name, email sender, APNS topic) - Update README and documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-28 21:10:48 -06:00
Trey t	1f12f3f62a	Initial commit: MyCrib API in Go Complete rewrite of Django REST API to Go with: - Gin web framework for HTTP routing - GORM for database operations - GoAdmin for admin panel - Gorush integration for push notifications - Redis for caching and job queues Features implemented: - User authentication (login, register, logout, password reset) - Residence management (CRUD, sharing, share codes) - Task management (CRUD, kanban board, completions) - Contractor management (CRUD, specialties) - Document management (CRUD, warranties) - Notifications (preferences, push notifications) - Subscription management (tiers, limits) Infrastructure: - Docker Compose for local development - Database migrations and seed data - Admin panel for data management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 20:07:16 -06:00

13 Commits