Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms

Stack of optimizations against the same Hetzner→Neon transatlantic link. The trace revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond per query (verified via EXPLAIN ANALYZE: index scans on every hot path). Connection layer: - DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer transaction-mode keeps backend Postgres connections warm so we no longer pay the ~110ms Postgres-startup RTT on cold queries. - GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s, MaxIdleTime added (default 0 = never close idle). - Eager pool warm-up at boot via parallel pings — first user request no longer pays the ~440ms TCP+TLS+startup handshake. - Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will evict cold keys instead of erroring at the 256MB limit. Auth layer: - TokenCacheTTL 5min → 1 hour (Redis token cache). - UserCacheTTL 30s → 5min (in-memory User cache, per pod). - UserCache gains a 5,000-entry LRU cap so a flood of unique users can't blow up pod RSS. ~5MB worst-case per pod. - Token + user lookup collapsed from 2 GORM Preload queries into a single INNER JOIN. Saves 1 RTT per cold-cache request. - Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL spans nest under the parent HTTP request in Jaeger. Service layer: - TaskService.ListTasks: replaced two-step FindResidenceIDsByUser → GetKanbanDataForMultipleResidences with a single GetKanbanDataForUser that uses a Postgres subquery for residence-access. One round-trip instead of two. - New CacheService residence-IDs cache: \"residence_ids_user:<id>\" with 5-min TTL. Wired into Task/Residence/Contractor/Document services for the four hot read paths that need this list. - Cache invalidation on every relevant mutation: CreateResidence, DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence invalidates every member of the residence, not just the owner. What this stacks up to (Hetzner→Neon, before US migration): Path Before After (target) Cache-warm authed read ~800ms ~100-200ms Cache-cold authed read (1st in 1hr) ~2500ms ~500-700ms First request after deploy ~2500ms ~700-900ms The endgame US-region migration on top of this gets us to ~30-50ms warm-cache, but we're shippable at ~150ms warm right now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:13:50 -05:00
parent 9410da7497
commit 88fb1751c7
15 changed files with 443 additions and 59 deletions
@@ -388,3 +388,61 @@ func (c *CacheService) GetSeededDataETag(ctx context.Context) (string, error) {
 func (c *CacheService) InvalidateSeededData(ctx context.Context) error {
 	return c.Delete(ctx, SeededDataKey, SeededDataETagKey)
 }
+
+// === User → Residence-IDs cache ===
+//
+// Caches the set of residence IDs each user has access to. Hot read on
+// every authenticated API call (auth + tasks + residences + contractors +
+// documents all need it). Mutations on residences/share-codes invalidate
+// only the affected user(s); see Invalidate*ResidenceIDsForUsers.
+
+const (
+	residenceIDsKeyPrefix = "residence_ids_user:"
+	residenceIDsTTL       = 5 * time.Minute
+)
+
+// CacheResidenceIDsForUser stores the residence-ID list for a user with a
+// 5-minute TTL. Membership rarely changes (only on share-code accept,
+// remove-user, delete-residence) so a 5-minute window catches the vast
+// majority of repeat reads while keeping staleness bounded.
+func (c *CacheService) CacheResidenceIDsForUser(ctx context.Context, userID uint, ids []uint) error {
+	if c == nil {
+		return nil
+	}
+	key := fmt.Sprintf("%s%d", residenceIDsKeyPrefix, userID)
+	data, err := json.Marshal(ids)
+	if err != nil {
+		return err
+	}
+	return c.client.Set(ctx, key, data, residenceIDsTTL).Err()
+}
+
+// GetCachedResidenceIDsForUser fetches the cached residence-ID list. Returns
+// (nil, redis.Nil) when not cached so callers can distinguish from "user has
+// zero residences" (empty slice) — though for practical purposes both result
+// in an empty kanban response, so most callers can ignore the distinction.
+func (c *CacheService) GetCachedResidenceIDsForUser(ctx context.Context, userID uint) ([]uint, error) {
+	if c == nil {
+		return nil, fmt.Errorf("cache not available")
+	}
+	key := fmt.Sprintf("%s%d", residenceIDsKeyPrefix, userID)
+	var ids []uint
+	if err := c.Get(ctx, key, &ids); err != nil {
+		return nil, err
+	}
+	return ids, nil
+}
+
+// InvalidateResidenceIDsForUsers drops the cache for one or more users.
+// Called from JoinWithCode (the joining user) and RemoveUser /
+// DeleteResidence (every affected user). Cheap — single Redis DEL per user.
+func (c *CacheService) InvalidateResidenceIDsForUsers(ctx context.Context, userIDs ...uint) error {
+	if c == nil || len(userIDs) == 0 {
+		return nil
+	}
+	keys := make([]string, len(userIDs))
+	for i, id := range userIDs {
+		keys[i] = fmt.Sprintf("%s%d", residenceIDsKeyPrefix, id)
+	}
+	return c.Delete(ctx, keys...)
+}