# SportsTime Data Pipeline Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app. ## Overview This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL. ## Quick Start ```bash # Install dependencies pip install -r requirements.txt # Scrape all sports for current season python scrape_schedules.py --sport all --season 2026 # Run full pipeline (scrape + canonicalize) python run_pipeline.py --sport all # Validate data integrity python cloudkit_import.py --validate # Sync to CloudKit python cloudkit_import.py --upload ``` ## Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ SPORT MODULES │ │ mlb.py nba.py nhl.py nfl.py mls.py wnba.py nwsl.py │ └────────────────────────────┬────────────────────────────────────────┘ │ scrape ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ RAW DATA │ │ data/games.csv data/stadiums.csv data/games.json │ └────────────────────────────┬────────────────────────────────────────┘ │ canonicalize ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ CANONICAL JSON │ │ data/stadiums_canonical.json data/teams_canonical.json │ │ data/games/*.json (per-sport/season) │ └────────────────────────────┬────────────────────────────────────────┘ │ sync ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ CloudKit (iCloud.com.sportstime.app) │ │ Bundled JSON (SportsTime/Resources/) │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Module Reference | Script | Purpose | |--------|---------| | `core.py` | Shared utilities: data classes, rate limiting, fallback system | | `scrape_schedules.py` | Main orchestrator for scraping schedules from multiple sources | | `run_pipeline.py` | Full pipeline runner (scrape + canonicalize in one command) | | `canonicalize_stadiums.py` | Stadium name resolution with alias support | | `canonicalize_teams.py` | Team name resolution with alias support | | `canonicalize_games.py` | Game linking (game → team → stadium relationships) | | `cloudkit_import.py` | CloudKit sync with full CRUD, validation, and diff reporting | | `validate_canonical.py` | Data validation with completeness metrics | | `generate_canonical_data.py` | Generate bundled JSON for iOS app bootstrap | ## Sport Modules Each sport has its own module with hardcoded stadium data and sport-specific scraping logic: | Module | Sport | Stadiums | Notes | |--------|-------|----------|-------| | `mlb.py` | MLB | 30 ballparks | Baseball-Reference scraper | | `nba.py` | NBA | 30 arenas | Basketball-Reference scraper | | `nhl.py` | NHL | 32 arenas | Hockey-Reference scraper | | `nfl.py` | NFL | 30 stadiums | Cross-calendar season (2025-26) | | `mls.py` | MLS | 30 stadiums | Soccer-specific capacities | | `wnba.py` | WNBA | 13 arenas | Shares venues with NBA | | `nwsl.py` | NWSL | 13 stadiums | Shares some MLS venues | ## Data Files ### Output Directory: `data/` | File | Contents | |------|----------| | `games.csv` | Raw scraped game data (all sports) | | `games.json` | Raw scraped games as JSON | | `stadiums.json` | Raw stadium data | | `stadiums_canonical.json` | Canonical stadiums with resolved aliases | | `teams_canonical.json` | Canonical teams with resolved aliases | | `stadium_aliases.json` | Stadium name → canonical ID mapping | | `games/{sport}_{season}.json` | Per-sport canonical games | ### Alias Files - `data/canonical/stadiums.json` - Master stadium database - `data/canonical/teams.json` - Master team database ## Pipeline Commands ### Scraping ```bash # Single sport python scrape_schedules.py --sport nba --season 2025-26 # All sports python scrape_schedules.py --sport all --season 2026 # With specific output directory python scrape_schedules.py --sport mlb --season 2025 --output ./data ``` ### Canonicalization ```bash # Run canonicalization pipeline python run_canonicalization_pipeline.py --sport all ``` ### CloudKit Operations ```bash # Validate data without uploading python cloudkit_import.py --validate # Show what would be uploaded (dry run) python cloudkit_import.py --upload --dry-run # Upload to CloudKit python cloudkit_import.py --upload # List orphan records (requires CloudKit connection) python cloudkit_import.py --validate --list-orphans # Delete orphan records python cloudkit_import.py --delete-orphans ``` ## Related Documentation - [DATA_SOURCES.md](DATA_SOURCES.md) - Data source URLs, rate limits, validation strategy - [CLOUDKIT_SETUP.md](CLOUDKIT_SETUP.md) - CloudKit container setup, record types, security roles