deployment: extend api startup probe budget for direct-endpoint migrations

The migration-pooler fix (commit 30966c6) routes AutoMigrate through Neon's direct compute endpoint to keep the session-scoped advisory lock alive. That swap means each DDL pays a fresh transatlantic RTT instead of riding warm pooler connections, so AutoMigrate's runtime climbs from ~90s to 4-6 min on the first pod of a cold boot. With the previous 240s grace the startup probe was killing pods mid-migration. Bumping to 120 × 5s = 600s grace. Subsequent pods inherit the schema and finish their migrate-no-op in seconds, so this only matters for the single first-pod migration window after a deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:05:58 -05:00
parent 30966c6f5e
commit a94744061e
1 changed files with 8 additions and 5 deletions
@@ -122,11 +122,14 @@ spec:
              path: /api/health/
              port: 8000
            # MigrateWithLock in cmd/api/main.go runs pg_advisory_lock on
-            # every startup. On a cold boot with 3 replicas, the first does
-            # AutoMigrate (~90s) and the others wait on the lock, so real
-            # startup runs 90–240s. 48 × 5s = 240s grace absorbs it without
-            # healthcheck killing a still-starting replica.
-            failureThreshold: 48
+            # every startup against Neon's *direct* (non-pooler) endpoint,
+            # because session-scoped locks don't survive PgBouncer
+            # transaction-mode. AutoMigrate over a transatlantic direct
+            # link runs many DDLs serially × ~110ms RTT each ≈ 4–6 min on
+            # the first pod; subsequent pods see no-op migrate after
+            # acquiring the same lock. 120 × 5s = 600s grace absorbs it
+            # without the healthcheck killing a still-migrating replica.
+            failureThreshold: 120
            periodSeconds: 5
          readinessProbe:
            httpGet: