Cut /api/tasks/ p99 from ~2500ms toward ~150-300ms
Stack of optimizations against the same Hetzner→Neon transatlantic link. The trace revealed every visible ms was network/proxy overhead — DB execution itself is sub-millisecond per query (verified via EXPLAIN ANALYZE: index scans on every hot path). Connection layer: - DB_HOST → Neon pooler endpoint (-pooler suffix). PgBouncer transaction-mode keeps backend Postgres connections warm so we no longer pay the ~110ms Postgres-startup RTT on cold queries. - GORM pool tuned: MaxIdleConns 10→20, MaxLifetime 600s→1800s, MaxIdleTime added (default 0 = never close idle). - Eager pool warm-up at boot via parallel pings — first user request no longer pays the ~440ms TCP+TLS+startup handshake. - Redis maxmemory-policy noeviction → allkeys-lru. Cache writes will evict cold keys instead of erroring at the 256MB limit. Auth layer: - TokenCacheTTL 5min → 1 hour (Redis token cache). - UserCacheTTL 30s → 5min (in-memory User cache, per pod). - UserCache gains a 5,000-entry LRU cap so a flood of unique users can't blow up pod RSS. ~5MB worst-case per pod. - Token + user lookup collapsed from 2 GORM Preload queries into a single INNER JOIN. Saves 1 RTT per cold-cache request. - Auth middleware's m.db.* now use db.WithContext(ctx) so the SQL spans nest under the parent HTTP request in Jaeger. Service layer: - TaskService.ListTasks: replaced two-step FindResidenceIDsByUser → GetKanbanDataForMultipleResidences with a single GetKanbanDataForUser that uses a Postgres subquery for residence-access. One round-trip instead of two. - New CacheService residence-IDs cache: \"residence_ids_user:<id>\" with 5-min TTL. Wired into Task/Residence/Contractor/Document services for the four hot read paths that need this list. - Cache invalidation on every relevant mutation: CreateResidence, DeleteResidence, JoinWithCode, RemoveUser. DeleteResidence invalidates every member of the residence, not just the owner. What this stacks up to (Hetzner→Neon, before US migration): Path Before After (target) Cache-warm authed read ~800ms ~100-200ms Cache-cold authed read (1st in 1hr) ~2500ms ~500-700ms First request after deploy ~2500ms ~700-900ms The endgame US-region migration on top of this gets us to ~30-50ms warm-cache, but we're shippable at ~150ms warm right now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -71,16 +71,30 @@ func Connect(cfg *config.DatabaseConfig, debug bool) (*gorm.DB, error) {
|
||||
return nil, fmt.Errorf("failed to get underlying sql.DB: %w", err)
|
||||
}
|
||||
|
||||
// Configure connection pool
|
||||
// Configure connection pool. The Neon pooler endpoint keeps backend
|
||||
// connections warm, so we keep our client-side pool warm too — that
|
||||
// eliminates the ~440ms TCP+TLS+startup handshake on the first query
|
||||
// after a cold pod / idle period.
|
||||
sqlDB.SetMaxOpenConns(cfg.MaxOpenConns)
|
||||
sqlDB.SetMaxIdleConns(cfg.MaxIdleConns)
|
||||
sqlDB.SetConnMaxLifetime(cfg.MaxLifetime)
|
||||
if cfg.MaxIdleTime > 0 {
|
||||
sqlDB.SetConnMaxIdleTime(cfg.MaxIdleTime)
|
||||
}
|
||||
// MaxIdleTime=0 means "never close idle" — the pool fills up to
|
||||
// MaxIdleConns and they stay alive until MaxLifetime expires.
|
||||
|
||||
// Test connection
|
||||
if err := sqlDB.Ping(); err != nil {
|
||||
return nil, fmt.Errorf("failed to ping database: %w", err)
|
||||
}
|
||||
|
||||
// Eagerly warm the connection pool to MaxIdleConns. Without this, the
|
||||
// first N user requests each pay the full handshake (~440ms over a
|
||||
// transatlantic link). Pings are issued in parallel so warm-up is
|
||||
// bounded by handshake time, not handshake-time × N.
|
||||
warmUpPool(sqlDB, cfg.MaxIdleConns)
|
||||
|
||||
log.Info().
|
||||
Str("host", cfg.Host).
|
||||
Int("port", cfg.Port).
|
||||
@@ -106,6 +120,35 @@ func Connect(cfg *config.DatabaseConfig, debug bool) (*gorm.DB, error) {
|
||||
return db, nil
|
||||
}
|
||||
|
||||
// warmUpPool issues N parallel pings so the pool fills with established
|
||||
// connections before the first user request lands. Failures are logged but
|
||||
// not fatal — the pool will fill on demand under traffic if pre-warm fails.
|
||||
//
|
||||
// On a transatlantic link to Neon (~110ms RTT, ~440ms cold handshake), this
|
||||
// turns "first request pays the cold handshake" into "first request finds a
|
||||
// warm pool" — at the cost of ~440ms during pod startup.
|
||||
func warmUpPool(sqlDB interface {
|
||||
PingContext(context.Context) error
|
||||
}, n int) {
|
||||
if n <= 0 {
|
||||
return
|
||||
}
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
||||
defer cancel()
|
||||
|
||||
done := make(chan error, n)
|
||||
for i := 0; i < n; i++ {
|
||||
go func() { done <- sqlDB.PingContext(ctx) }()
|
||||
}
|
||||
successes := 0
|
||||
for i := 0; i < n; i++ {
|
||||
if err := <-done; err == nil {
|
||||
successes++
|
||||
}
|
||||
}
|
||||
log.Info().Int("requested", n).Int("warmed", successes).Msg("DB pool warm-up complete")
|
||||
}
|
||||
|
||||
// Get returns the database instance
|
||||
func Get() *gorm.DB {
|
||||
return db
|
||||
|
||||
Reference in New Issue
Block a user