139a990ebc258bad4ba9a88014224997eddcd21c
11 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9bee436e86 |
perf(subscription-status): cache + parallelize + invalidate on mutations
GET /api/subscription/status/ was the slowest endpoint in the API at p50≈1750ms / p95≈2425ms — about 12× the floor for our cluster→Neon geography. Jaeger traces showed seven sequential SQL queries each costing roughly one transatlantic RTT (~110ms), with the actual queries running in 0.073ms at the database. Pure network serialization, not slow SQL. Three changes, in order of leverage: 1. Cache the assembled SubscriptionStatusResponse per-user in Redis with a 5-minute TTL. Hot path collapses to a single Redis GET (~5ms) on warm reads; the TTL is a safety net against missed invalidations. 2. Parallelize the three independent COUNT queries in getUserUsage (task_task / task_contractor / task_document) via golang.org/x/sync errgroup. Three RTTs collapse to one. Also dropped the redundant residence_residence COUNT — len(residenceIDs) from FindResidenceIDsByOwner is the same number, no need to re-query. 3. Wire explicit invalidation into every mutation that could change a user's response — residence/task/contractor/document CRUD, residence membership changes (JoinWithCode, RemoveUser, DeleteResidence), and every subscription tier flip across the IAP/Stripe/webhook surface. Residence-scoped invalidations fan out to every user with access via a new ResidenceRepository.FindUserIDsByResidence helper, so members of a shared residence don't see stale `usage` numbers when another member adds a task. Net effect: warm path goes from ~1350ms to ~5ms (Redis hit). Cold path goes from ~1350ms to ~250-450ms (5 sequential queries → 2 phases: residence IDs lookup, then parallel task/contractor/document counts). Also fixed a pre-existing CheckLimit signature drift in internal/integration/subscription_is_free_test.go that was blocking the package build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b67f7f9e6b |
Cache SubscriptionSettings + cut monitoring poll noise
Trace data revealed subscription_subscriptionsettings was consuming
1,983s of cumulative DB time per day (180× more than the next-largest
table) for a 32-byte singleton row of admin-toggleable global flags.
Root cause was a 30-second poll loop in monitoring.Service per pod
plus uncached reads on every authed status check / CreateResidence /
Stripe webhook. Fix is layered:
1. Redis cache for SubscriptionSettings — same shape as the
residence-IDs cache. 30-min TTL, explicit invalidation on admin
write. New CacheService.{Cache,GetCached,Invalidate}SubscriptionSettings
plus a cachedSubscriptionSettings helper in services/.
2. SubscriptionService, StripeService, and both admin handlers
(settings + limitations) now read through the cache. Admin write
handlers invalidate so toggles propagate cluster-wide within ms
instead of waiting for the TTL.
3. monitoring.Service.syncSettingsFromDB also reads from Redis first
(raw redis.Client to avoid a services→monitoring import cycle).
Polling interval bumped 30s → 5min. Combined with Redis-shared
cache, cluster-wide DB hits from this poll go from ~480/hour to
~2/hour — a 240× reduction.
4. StripeService.CreateCheckoutSession now takes ctx so the cached
settings span (and the Stripe webhook trace) stay attached to the
request. Handler call site updated.
5. Admin handlers' direct h.db.First calls switched to
db.WithContext(ctx) so the resulting orphan SQL spans nest under
the admin request span in Jaeger.
Net DB query rate for subscription_subscriptionsettings should drop
from 0.101/sec to ~0/sec with occasional invalidation-driven refills,
and the table's cumulative DB time from 1,983s/day to ~10s/day.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
88fb1751c7 |
Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms
Stack of optimizations against the same Hetzner→Neon transatlantic link. The trace revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond per query (verified via EXPLAIN ANALYZE: index scans on every hot path). Connection layer: - DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer transaction-mode keeps backend Postgres connections warm so we no longer pay the ~110ms Postgres-startup RTT on cold queries. - GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s, MaxIdleTime added (default 0 = never close idle). - Eager pool warm-up at boot via parallel pings — first user request no longer pays the ~440ms TCP+TLS+startup handshake. - Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will evict cold keys instead of erroring at the 256MB limit. Auth layer: - TokenCacheTTL 5min → 1 hour (Redis token cache). - UserCacheTTL 30s → 5min (in-memory User cache, per pod). - UserCache gains a 5,000-entry LRU cap so a flood of unique users can't blow up pod RSS. ~5MB worst-case per pod. - Token + user lookup collapsed from 2 GORM Preload queries into a single INNER JOIN. Saves 1 RTT per cold-cache request. - Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL spans nest under the parent HTTP request in Jaeger. Service layer: - TaskService.ListTasks: replaced two-step FindResidenceIDsByUser → GetKanbanDataForMultipleResidences with a single GetKanbanDataForUser that uses a Postgres subquery for residence-access. One round-trip instead of two. - New CacheService residence-IDs cache: \"residence_ids_user:<id>\" with 5-min TTL. Wired into Task/Residence/Contractor/Document services for the four hot read paths that need this list. - Cache invalidation on every relevant mutation: CreateResidence, DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence invalidates every member of the residence, not just the owner. What this stacks up to (Hetzner→Neon, before US migration): Path Before After (target) Cache-warm authed read ~800ms ~100-200ms Cache-cold authed read (1st in 1hr) ~2500ms ~500-700ms First request after deploy ~2500ms ~700-900ms The endgame US-region migration on top of this gets us to ~30-50ms warm-cache, but we're shippable at ~150ms warm right now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
6f303dbbaa |
Migrate prod deploy from Swarm to K3s; add full deployment book
Infrastructure:
- Stack now runs on K3s v1.34.6 HA (3 Hetzner CX33 nodes as managers)
- Traefik DaemonSet + hostNetwork replaces Caddy + ingress mesh
- All manifests in deploy-k3s/manifests/; Swarm config (deploy/) kept
temporarily for reference
Bug fixes surfaced during migration:
- Dockerfile: golang:1.24-alpine -> 1.25-alpine (go.mod requires 1.25)
- cache_service.go: remove sync.Once reassignment from inside Do()
callback (was causing 'unlock of unlocked mutex' fatal after
Redis Ping failure)
- router.go: relax CSP from 'default-src none' to 'default-src self'
+ allowlist fonts.googleapis.com so the marketing landing page CSS
actually loads in browsers
- deploy/scripts/deploy_prod.sh: use docker buildx with
--platform linux/amd64 so arm64 (Apple Silicon) dev machines produce
images runnable on x86_64 Hetzner nodes; fix array expansion under
set -u
- deploy/swarm-stack.prod.yml: fix secret source references to use
top-level aliases (the '\${X_SECRET}' form never actually resolved);
dozzle ports: long-form host_ip is rejected by Swarm, switched to
short-form (bound to 0.0.0.0 with UFW-based loopback restriction);
worker replicas 2 -> 1 (Asynq scheduler singleton)
- deploy-k3s/manifests/admin/deployment.yaml: probe path '/admin/' -> '/'
(Next.js serves at root; /admin/ returned 404 and killed pods);
startupProbe failureThreshold 12 -> 24
- deploy-k3s/manifests/pod-disruption-budgets.yaml: worker minAvailable
1 -> 0 (singleton)
- deploy-k3s/manifests/api/deployment.yaml: startupProbe failureThreshold
12 -> 48 (MigrateWithLock serializes across 3 replicas on first-boot;
real startup takes up to 240s)
- .gitignore: tighten 'api' -> '/api' (was matching deploy-k3s/manifests/api/
and admin/src/app/api/*, hiding legitimate files)
New files:
- deploy-k3s/manifests/traefik-helmchartconfig.yaml: DaemonSet +
hostNetwork override for k3s-bundled Traefik
- deploy-k3s/manifests/ingress/ingress-simple.yaml: plain Ingress
without TLS (CF Flexible SSL) and without middleware
- deploy-k3s/MIGRATION_NOTES.md: operator-facing migration log
Documentation:
- docs/deployment/ — full deployment book, 26 files, ~42k words:
- Part I Overview, infrastructure, orchestrator choice (Ch 0-2)
- Part II Networking, firewall, Cloudflare (Ch 3-4, 13)
- Part III Security, Traefik ingress (Ch 5-6)
- Part IV Services, DB, storage, secrets, registry (Ch 7-11)
- Part V Data flow, deploy process, observability, failures, runbook
(Ch 12, 14-17)
- Part VI Cost, Swarm postmortem, roadmap (Ch 18-20)
- Appendices: glossary, kubectl cheat sheet, file locations,
consolidated citations
- README.md: Production Deployment section replaced with pointer to
the book; Go version bumped to 1.25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b679f28e55 |
Production hardening: security, resilience, observability, and compliance
Password complexity: custom validator requiring uppercase, lowercase, digit (min 8 chars)
Token expiry: 90-day token lifetime with refresh endpoint (60-90 day renewal window)
Health check: /api/health/ now pings Postgres + Redis, returns 503 on failure
Audit logging: async audit_log table for auth events (login, register, delete, etc.)
Circuit breaker: APNs/FCM push sends wrapped with 5-failure threshold, 30s recovery
FK indexes: 27 missing foreign key indexes across all tables (migration 017)
CSP header: default-src 'none'; frame-ancestors 'none'
Gzip compression: level 5 with media endpoint skipper
Prometheus metrics: /metrics endpoint using existing monitoring service
External timeouts: 15s push, 30s SMTP, context timeouts on all external calls
Migrations: 016 (token created_at), 017 (FK indexes), 018 (audit_log)
Tests: circuit breaker (15), audit service (8), token refresh (7), health (4),
middleware expiry (5), validator (new)
|
||
|
|
42a5533a56 |
Fix 113 hardening issues across entire Go backend
Security: - Replace all binding: tags with validate: + c.Validate() in admin handlers - Add rate limiting to auth endpoints (login, register, password reset) - Add security headers (HSTS, XSS protection, nosniff, frame options) - Wire Google Pub/Sub token verification into webhook handler - Replace ParseUnverified with proper OIDC/JWKS key verification - Verify inner Apple JWS signatures in webhook handler - Add io.LimitReader (1MB) to all webhook body reads - Add ownership verification to file deletion - Move hardcoded admin credentials to env vars - Add uniqueIndex to User.Email - Hide ConfirmationCode from JSON serialization - Mask confirmation codes in admin responses - Use http.DetectContentType for upload validation - Fix path traversal in storage service - Replace os.Getenv with Viper in stripe service - Sanitize Redis URLs before logging - Separate DEBUG_FIXED_CODES from DEBUG flag - Reject weak SECRET_KEY in production - Add host check on /_next/* proxy routes - Use explicit localhost CORS origins in debug mode - Replace err.Error() with generic messages in all admin error responses Critical fixes: - Rewrite FCM to HTTP v1 API with OAuth 2.0 service account auth - Fix user_customuser -> auth_user table names in raw SQL - Fix dashboard verified query to use UserProfile model - Add escapeLikeWildcards() to prevent SQL wildcard injection Bug fixes: - Add bounds checks for days/expiring_soon query params (1-3650) - Add receipt_data/transaction_id empty-check to RestoreSubscription - Change Active bool -> *bool in device handler - Check all unchecked GORM/FindByIDWithProfile errors - Add validation for notification hour fields (0-23) - Add max=10000 validation on task description updates Transactions & data integrity: - Wrap registration flow in transaction - Wrap QuickComplete in transaction - Move image creation inside completion transaction - Wrap SetSpecialties in transaction - Wrap GetOrCreateToken in transaction - Wrap completion+image deletion in transaction Performance: - Batch completion summaries (2 queries vs 2N) - Reuse single http.Client in IAP validation - Cache dashboard counts (30s TTL) - Batch COUNT queries in admin user list - Add Limit(500) to document queries - Add reminder_stage+due_date filters to reminder queries - Parse AllowedTypes once at init - In-memory user cache in auth middleware (30s TTL) - Timezone change detection cache - Optimize P95 with per-endpoint sorted buffers - Replace crypto/md5 with hash/fnv for ETags Code quality: - Add sync.Once to all monitoring Stop()/Close() methods - Replace 8 fmt.Printf with zerolog in auth service - Log previously discarded errors - Standardize delete response shapes - Route hardcoded English through i18n - Remove FileURL from DocumentResponse (keep MediaURL only) - Thread user timezone through kanban board responses - Initialize empty slices to prevent null JSON - Extract shared field map for task Update/UpdateTx - Delete unused SoftDeleteModel, min(), formatCron, legacy handlers Worker & jobs: - Wire Asynq email infrastructure into worker - Register HandleReminderLogCleanup with daily 3AM cron - Use per-user timezone in HandleSmartReminder - Replace direct DB queries with repository calls - Delete legacy reminder handlers (~200 lines) - Delete unused task type constants Dependencies: - Replace archived jung-kurt/gofpdf with go-pdf/fpdf - Replace unmaintained gomail.v2 with wneessen/go-mail - Add TODO for Echo jwt v3 transitive dep removal Test infrastructure: - Fix MakeRequest/SeedLookupData error handling - Replace os.Exit(0) with t.Skip() in scope/consistency tests - Add 11 new FCM v1 tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
||
|
|
4976eafc6c |
Rebrand from Casera/MyCrib to honeyDue
Total rebrand across all Go API source files: - Go module path: casera-api -> honeydue-api - All imports updated (130+ files) - Docker: containers, images, networks renamed - Email templates: support email, noreply, icon URL - Domains: casera.app/mycrib.treytartt.com -> honeyDue.treytartt.com - Bundle IDs: com.tt.casera -> com.tt.honeyDue - IAP product IDs updated - Landing page, admin panel, config defaults - Seeds, CI workflows, Makefile, docs - Database table names preserved (no migration needed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
c5b0225422 |
Replace status_id with in_progress boolean field
- Remove task_statuses lookup table and StatusID foreign key - Add InProgress boolean field to Task model - Add database migration (005_replace_status_with_in_progress) - Update all handlers, services, and repositories - Update admin frontend to display in_progress as checkbox/boolean - Remove Task Statuses tab from admin lookups page - Update tests to use InProgress instead of StatusID - Task categorization now uses InProgress for kanban column assignment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> |
||
|
|
91a1f7ebed |
Add Redis caching for lookup data and admin cache management
- Add lookup-specific cache keys and methods to CacheService - Add cache refresh on lookup CRUD operations in AdminLookupHandler - Add Redis caching after seed-lookups in AdminSettingsHandler - Add ETag generation for seeded data to support client-side caching - Update task template handler with cache invalidation - Fix route for clear-cache endpoint 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> |
||
|
|
c7dc56e2d2 |
Rebrand from MyCrib to Casera
- Update Go module from mycrib-api to casera-api - Update all import statements across 69 Go files - Update admin panel branding (title, sidebar, login form) - Update email templates (subjects, bodies, signatures) - Update PDF report generation branding - Update Docker container names and network - Update config defaults (database name, email sender, APNS topic) - Update README and documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> |
||
|
|
1f12f3f62a |
Initial commit: MyCrib API in Go
Complete rewrite of Django REST API to Go with: - Gin web framework for HTTP routing - GORM for database operations - GoAdmin for admin panel - Gorush integration for push notifications - Redis for caching and job queues Features implemented: - User authentication (login, register, logout, password reset) - Residence management (CRUD, sharing, share codes) - Task management (CRUD, kanban board, completions) - Contractor management (CRUD, specialties) - Document management (CRUD, warranties) - Notifications (preferences, push notifications) - Subscription management (tiers, limits) Infrastructure: - Docker Compose for local development - Database migrations and seed data - Admin panel for data management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> |