fix(auth): keep users logged in while Kratos is down
Backend CI / Test (push) Has been cancelled
Backend CI / Contract Tests (push) Has been cancelled
Backend CI / Lint (push) Has been cancelled
Backend CI / Secret Scanning (push) Has been cancelled
Backend CI / Build (push) Has been cancelled

Production is running with no Kratos deployed in-cluster (the deploy
script's kratos-secrets prerequisite isn't satisfied yet — see runbook
§11 #7). That means Whoami calls ALWAYS fail, so any time a user's Redis
session cache expires they get a 401, which the iOS app treats as session
invalid → forced re-login → can't re-authenticate because the same
Whoami is the only way back in.

Two-part mitigation:

1. Bump kratosSessionCacheTTL from 5 minutes to 24 hours. Active users
   stay logged in indefinitely; idle users get bounced after a day.
2. Refresh the cache TTL on every successful cache hit (sliding window)
   so usage-driven expiry is no longer a cliff at the original TTL.

When Kratos actually comes up:
  - revert the TTL constant to a sensible value (1-15 min)
  - the sliding-window refresh is fine to keep; it's good UX regardless

Caveat: this papers over the missing Kratos. New sign-ins still cannot
complete because the api needs Kratos to populate the cache the first
time. Real fix is to deploy Kratos.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-06-03 10:48:12 -05:00
parent d74cfeee62
commit 64c656bde1
+14 -1
View File
@@ -33,7 +33,14 @@ const (
// kratosSessionCacheTTL is how long a validated session is cached in // kratosSessionCacheTTL is how long a validated session is cached in
// Redis, so most authed requests skip the Kratos /whoami round trip. // Redis, so most authed requests skip the Kratos /whoami round trip.
kratosSessionCacheTTL = 5 * time.Minute //
// PRODUCTION CAVEAT (2026-06-03): until Kratos is deployed in-cluster,
// the Whoami fallback ALWAYS fails (no kratos Service). That means every
// cache miss = 401 = forced re-login. We mitigate by (a) using a long
// TTL and (b) refreshing the TTL on every cache hit (see resolve()).
// This is a short-term workaround — restore to a few minutes once Kratos
// is live and the runbook §11 #7 prerequisites are done.
kratosSessionCacheTTL = 24 * time.Hour
kratosSessionPrefix = "kratos_sess:" kratosSessionPrefix = "kratos_sess:"
) )
@@ -123,6 +130,12 @@ func (m *KratosAuth) resolve(c echo.Context) (*models.User, bool, string, error)
if m.cache != nil { if m.cache != nil {
if v, err := m.cache.GetString(ctx, cacheKey); err == nil && v != "" { if v, err := m.cache.GetString(ctx, cacheKey); err == nil && v != "" {
if user, verified, ok := m.userFromCacheValue(ctx, v); ok { if user, verified, ok := m.userFromCacheValue(ctx, v); ok {
// Sliding-window refresh: extend the TTL on every successful
// hit so active users don't get bounced when their original
// cache entry would have otherwise expired. Best-effort —
// failure to refresh just means the entry expires on the
// original schedule.
_ = m.cache.SetString(ctx, cacheKey, v, kratosSessionCacheTTL)
return user, verified, cred, nil return user, verified, cred, nil
} }
} }