# SportsTime Data Pipeline ## What This Is A Python data pipeline that scrapes, canonicalizes, and syncs sports schedule data to CloudKit for the SportsTime iOS app. The pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, and NFL. ## Core Value Every game must correctly link to its teams and stadium — a game at the wrong venue or with broken team links ruins trip planning. ## Requirements ### Validated - ✓ Basic schedule scraping for MLB, NBA, NHL, NFL — existing - ✓ Canonical data models (stadiums, teams, games) — existing - ✓ CloudKit import capability — existing - ✓ Bundled JSON generation for offline-first — existing ### Active - [ ] Split scripts by sport (MLB, NBA, NHL, NFL as separate modules) - [ ] Complete stadium database with correct coordinates and names - [ ] Stadium alias system for name variations across sources - [ ] Correct game→team→stadium canonical linking for all sports - [ ] Full CRUD CloudKit management (create, read, update, delete) - [ ] Validation reports showing counts, gaps, and orphan records - [ ] Team alias system for name variations across sources ### Out of Scope - Real-time scores — this is schedule data, not live game tracking - Adding new sports (MLS, WNBA, etc.) — stabilize current 4 first - iOS app changes — this is purely backend/script work ## Context **Current State:** - Data quality issues exist across all sports (wrong stadiums, missing games, broken team links) - Stadium problems include: missing venues, wrong coordinates, name mismatches between sources - Single large script files that are hard to debug and maintain - Existing CloudKit import works but lacks verification and CRUD operations **Existing Infrastructure:** - Python 3 with requests, beautifulsoup4, pandas, lxml - CloudKit server-to-server auth via cryptography package - Bundled JSON in `SportsTime/Resources/` for offline bootstrap - Data sources: Basketball-Reference, Baseball-Reference, Hockey-Reference, official APIs **iOS App Dependency:** - `AppDataProvider.shared` is single source of truth - SwiftData models: `CanonicalStadium`, `CanonicalTeam`, `CanonicalGame` - Domain models expect correct relationships via canonical IDs ## Constraints - **Tech Stack**: Must remain Python (existing tooling, team familiarity) - **Data Sources**: Free/public APIs and sites only (no paid subscriptions) - **CloudKit**: Must use existing container (`iCloud.com.sportstime.app`) - **Compatibility**: Output must match existing Swift model expectations ## Key Decisions | Decision | Rationale | Outcome | |----------|-----------|---------| | Split by sport, not function | User preference for organization | — Pending | | Validation reports over automated tests | Faster feedback, easier debugging | — Pending | | Full CRUD over upload-only | Enable data corrections without full rebuild | — Pending | --- *Last updated: 2026-01-09 after initialization*