From a94744061eede5be65ae2e4455e235c4a5a0bb77 Mon Sep 17 00:00:00 2001 From: Trey t Date: Sun, 26 Apr 2026 22:05:58 -0500 Subject: [PATCH] deployment: extend api startup probe budget for direct-endpoint migrations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The migration-pooler fix (commit 30966c6) routes AutoMigrate through Neon's direct compute endpoint to keep the session-scoped advisory lock alive. That swap means each DDL pays a fresh transatlantic RTT instead of riding warm pooler connections, so AutoMigrate's runtime climbs from ~90s to 4-6 min on the first pod of a cold boot. With the previous 240s grace the startup probe was killing pods mid-migration. Bumping to 120 × 5s = 600s grace. Subsequent pods inherit the schema and finish their migrate-no-op in seconds, so this only matters for the single first-pod migration window after a deploy. Co-Authored-By: Claude Opus 4.7 (1M context) --- deploy-k3s/manifests/api/deployment.yaml | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/deploy-k3s/manifests/api/deployment.yaml b/deploy-k3s/manifests/api/deployment.yaml index a98f67c..d742ef6 100644 --- a/deploy-k3s/manifests/api/deployment.yaml +++ b/deploy-k3s/manifests/api/deployment.yaml @@ -122,11 +122,14 @@ spec: path: /api/health/ port: 8000 # MigrateWithLock in cmd/api/main.go runs pg_advisory_lock on - # every startup. On a cold boot with 3 replicas, the first does - # AutoMigrate (~90s) and the others wait on the lock, so real - # startup runs 90–240s. 48 × 5s = 240s grace absorbs it without - # healthcheck killing a still-starting replica. - failureThreshold: 48 + # every startup against Neon's *direct* (non-pooler) endpoint, + # because session-scoped locks don't survive PgBouncer + # transaction-mode. AutoMigrate over a transatlantic direct + # link runs many DDLs serially × ~110ms RTT each ≈ 4–6 min on + # the first pod; subsequent pods see no-op migrate after + # acquiring the same lock. 120 × 5s = 600s grace absorbs it + # without the healthcheck killing a still-migrating replica. + failureThreshold: 120 periodSeconds: 5 readinessProbe: httpGet: