diff --git a/docs/plans/2026-01-13-sync-reliability-design.md b/docs/plans/2026-01-13-sync-reliability-design.md new file mode 100644 index 0000000..fc00c73 --- /dev/null +++ b/docs/plans/2026-01-13-sync-reliability-design.md @@ -0,0 +1,304 @@ +# Sync Reliability Design + +**Date:** 2026-01-13 +**Status:** Approved +**Author:** Claude Code + User collaboration + +## Summary + +Improve CloudKit sync reliability with three enhancements: +1. **Pagination** - Fetch all records, not just first ~400 +2. **Cancellation** - Graceful background task expiration with progress saving +3. **Network monitoring** - Auto-sync when connection restored + +## Problem Statement + +The current sync system has a critical gap: CloudKit queries return max 400 records per request, but the code discards the pagination cursor. With ~5,000 games in the dataset, most records may never sync from CloudKit. + +Additionally, background tasks can be terminated by iOS with no graceful handling, and there's no automatic sync when network connectivity is restored. + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Sync Reliability System │ +├─────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌──────────────────┐ ┌──────────────────────────────────┐ │ +│ │ NetworkMonitor │────►│ BackgroundSyncManager │ │ +│ │ (NWPathMonitor) │ │ - Cancellation token support │ │ +│ │ - Debounced │ │ - Progress persistence │ │ +│ └──────────────────┘ └──────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────┐ │ +│ │ CloudKitService │ │ +│ │ - Paginated fetches │ │ +│ │ - CKQueryOperation + cursors │ │ +│ │ - Cancellation-aware │ │ +│ └──────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌──────────────────────────────────┐ │ +│ │ SyncState (enhanced) │ │ +│ │ - Per-entity lastSync dates │ │ +│ │ - Partial sync progress │ │ +│ └──────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## Design Details + +### 1. Paginated CloudKit Fetches + +**New Protocol:** + +```swift +/// Token checked between pages to support graceful cancellation +protocol SyncCancellationToken: Sendable { + var isCancelled: Bool { get } +} +``` + +**Paginated Fetch Pattern:** + +Each fetch method (`fetchGamesForSync`, `fetchTeamsForSync`, etc.) refactored to: + +```swift +func fetchGamesForSync( + since lastSync: Date?, + cancellationToken: SyncCancellationToken? = nil +) async throws -> [SyncGame] { + var allResults: [SyncGame] = [] + var cursor: CKQueryOperation.Cursor? = nil + + repeat { + // Check cancellation before each page + if cancellationToken?.isCancelled == true { + break // Return what we have so far + } + + let (pageResults, nextCursor) = try await fetchGamesPage( + since: lastSync, + cursor: cursor + ) + allResults.append(contentsOf: pageResults) + cursor = nextCursor + + } while cursor != nil + + return allResults +} + +private func fetchGamesPage( + since lastSync: Date?, + cursor: CKQueryOperation.Cursor? +) async throws -> ([SyncGame], CKQueryOperation.Cursor?) { + // Uses CKQueryOperation for explicit cursor control + // Returns up to 400 records per page +} +``` + +**Rationale:** +- Cancellation checked between pages (not mid-fetch) +- Returns partial results on cancellation (not thrown away) +- Page size of 400 balances throughput vs. memory + +### 2. Cancellation & Progress Saving + +**Cancellation Token Implementation:** + +```swift +final class BackgroundTaskCancellationToken: SyncCancellationToken, @unchecked Sendable { + private let _isCancelled = OSAllocatedUnfairLock(initialState: false) + + var isCancelled: Bool { _isCancelled.withLock { $0 } } + + func cancel() { _isCancelled.withLock { $0 = true } } +} +``` + +**Enhanced SyncState:** + +Add per-entity sync timestamps for partial progress: + +```swift +// Add to existing SyncState model +var lastStadiumSync: Date? +var lastTeamSync: Date? +var lastGameSync: Date? +var lastLeagueStructureSync: Date? +var lastTeamAliasSync: Date? +var lastStadiumAliasSync: Date? +var lastSportSync: Date? +``` + +**Modified Sync Flow:** + +```swift +func syncAll(context: ModelContext, cancellationToken: SyncCancellationToken? = nil) async throws -> SyncResult { + // Sync each entity type, saving progress after each completes + + let stadiums = try await syncStadiums(context: context, cancellationToken: cancellationToken) + syncState.lastStadiumSync = Date() // Save progress immediately + try context.save() + + if cancellationToken?.isCancelled == true { + return partialResult(stadiums: stadiums) + } + + let teams = try await syncTeams(context: context, cancellationToken: cancellationToken) + syncState.lastTeamSync = Date() + try context.save() + + // ... continue for each entity type +} +``` + +**Key behaviors:** +- Progress saved after each entity type completes +- Cancellation returns partial results instead of throwing +- Next sync uses per-entity timestamps to resume + +### 3. Network Monitor + +**NetworkMonitor Singleton:** + +```swift +import Network + +@MainActor +final class NetworkMonitor: ObservableObject { + static let shared = NetworkMonitor() + + @Published private(set) var isConnected: Bool = true + @Published private(set) var connectionType: ConnectionType = .unknown + + enum ConnectionType { case wifi, cellular, wired, unknown } + + private let monitor = NWPathMonitor() + private let queue = DispatchQueue(label: "NetworkMonitor") + private var debounceTask: Task? + + private init() { + monitor.pathUpdateHandler = { [weak self] path in + Task { @MainActor in + self?.handlePathUpdate(path) + } + } + monitor.start(queue: queue) + } + + private func handlePathUpdate(_ path: NWPath) { + let wasConnected = isConnected + isConnected = (path.status == .satisfied) + connectionType = detectConnectionType(path) + + // Debounce: only trigger sync if we GAINED connection + if !wasConnected && isConnected { + debounceTask?.cancel() + debounceTask = Task { + try? await Task.sleep(for: .seconds(2.5)) + guard !Task.isCancelled else { return } + NotificationCenter.default.post(name: .networkRestored, object: nil) + } + } + } +} + +extension Notification.Name { + static let networkRestored = Notification.Name("NetworkRestored") +} +``` + +**BackgroundSyncManager Integration:** + +```swift +// In BackgroundSyncManager.configure(with:) +NotificationCenter.default.addObserver( + forName: .networkRestored, + object: nil, + queue: .main +) { [weak self] _ in + Task { @MainActor in + await self?.performSyncIfNeeded() + } +} +``` + +**Rationale:** +- 2.5 second debounce handles WiFi<->cellular transitions +- Only triggers on connection *gained* (not every path update) +- Uses NotificationCenter for loose coupling + +## Error Handling + +| Scenario | Behavior | +|----------|----------| +| Cancellation mid-page | Current page completes, then stops. Partial data saved. | +| Network drops during sync | CloudKit throws error, caught by existing handling. Per-entity progress preserved. | +| Multiple sync triggers overlap | Existing `syncInProgress` flag prevents duplicate syncs. | +| App killed during background task | Per-entity saves ensure most progress kept. | +| Network flapping | 2.5s debounce collapses rapid transitions into single sync. | +| First launch offline | Bootstrap from bundled JSON. NetworkMonitor triggers sync when connected. | + +**Cancellation vs Error distinction:** + +```swift +struct SyncResult { + // ... existing fields + let wasCancelled: Bool // NEW: distinguish clean cancel from error +} +``` + +When `wasCancelled == true`: +- Don't increment `consecutiveFailures` +- Don't set `lastSyncError` +- Do update per-entity timestamps + +## Testing Strategy + +### Unit Tests (CloudKitService) + +| Test | Validates | +|------|-----------| +| `test_pagination_fetchesAllPages` | Mock returns 3 pages of 400, verifies all 1200 returned | +| `test_pagination_stopsOnCancellation` | Cancel after page 1, verify only ~400 returned | +| `test_pagination_handlesEmptyResult` | No records returns empty array, not error | +| `test_pagination_handlesSinglePage` | <400 records returns without extra fetch | + +### Unit Tests (Cancellation) + +| Test | Validates | +|------|-----------| +| `test_cancellationToken_threadSafe` | Concurrent read/write doesn't crash | +| `test_syncAll_savesProgressPerEntity` | Mock cancel after teams, verify stadiums saved but games not | +| `test_syncAll_doesNotIncrementFailuresOnCancel` | `consecutiveFailures` unchanged after clean cancel | + +### Unit Tests (NetworkMonitor) + +| Test | Validates | +|------|-----------| +| `test_debounce_collapsesRapidChanges` | 5 changes in 1s = 1 notification | +| `test_onlyNotifiesOnGainedConnection` | connected->disconnected doesn't trigger sync | +| `test_cancelsDebounceIfDisconnects` | connect, disconnect before 2.5s = no notification | + +### Integration Test + +| Test | Validates | +|------|-----------| +| `test_fullSyncFlow_withMockCloudKit` | Bootstrap -> sync -> cancel -> resume picks up correctly | + +## Files to Modify + +| File | Changes | +|------|---------| +| `CloudKitService.swift` | Add pagination to all fetch methods, accept cancellation token | +| `CanonicalSyncService.swift` | Per-entity progress saving, pass cancellation token, handle partial results | +| `BackgroundSyncManager.swift` | Create/pass cancellation token, subscribe to network notifications | +| `CanonicalModels.swift` | Add per-entity sync timestamps to SyncState | +| `NetworkMonitor.swift` | **NEW FILE** - NWPathMonitor wrapper with debouncing | + +## Migration + +No data migration required. New `SyncState` fields will be `nil` initially, causing a full sync on first run after update (existing behavior).