After Phase 1 refactoring moved scraper functions to sport-specific modules (nba.py, mlb.py, etc.), these pipeline scripts still imported from scrape_schedules.py. - run_pipeline.py: import from core.py and sport modules - validate_data.py: import from core.py and sport modules - run_canonicalization_pipeline.py: import from core.py and sport modules Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SportsTime Data Pipeline
Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app.
Overview
This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL.
Quick Start
# Install dependencies
pip install -r requirements.txt
# Scrape all sports for current season
python scrape_schedules.py --sport all --season 2026
# Run full pipeline (scrape + canonicalize)
python run_pipeline.py --sport all
# Validate data integrity
python cloudkit_import.py --validate
# Sync to CloudKit
python cloudkit_import.py --upload
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ SPORT MODULES │
│ mlb.py nba.py nhl.py nfl.py mls.py wnba.py nwsl.py │
└────────────────────────────┬────────────────────────────────────────┘
│ scrape
▼
┌─────────────────────────────────────────────────────────────────────┐
│ RAW DATA │
│ data/games.csv data/stadiums.csv data/games.json │
└────────────────────────────┬────────────────────────────────────────┘
│ canonicalize
▼
┌─────────────────────────────────────────────────────────────────────┐
│ CANONICAL JSON │
│ data/stadiums_canonical.json data/teams_canonical.json │
│ data/games/*.json (per-sport/season) │
└────────────────────────────┬────────────────────────────────────────┘
│ sync
▼
┌─────────────────────────────────────────────────────────────────────┐
│ CloudKit (iCloud.com.sportstime.app) │
│ Bundled JSON (SportsTime/Resources/) │
└─────────────────────────────────────────────────────────────────────┘
Module Reference
| Script | Purpose |
|---|---|
core.py |
Shared utilities: data classes, rate limiting, fallback system |
scrape_schedules.py |
Main orchestrator for scraping schedules from multiple sources |
run_pipeline.py |
Full pipeline runner (scrape + canonicalize in one command) |
canonicalize_stadiums.py |
Stadium name resolution with alias support |
canonicalize_teams.py |
Team name resolution with alias support |
canonicalize_games.py |
Game linking (game → team → stadium relationships) |
cloudkit_import.py |
CloudKit sync with full CRUD, validation, and diff reporting |
validate_canonical.py |
Data validation with completeness metrics |
generate_canonical_data.py |
Generate bundled JSON for iOS app bootstrap |
Sport Modules
Each sport has its own module with hardcoded stadium data and sport-specific scraping logic:
| Module | Sport | Stadiums | Notes |
|---|---|---|---|
mlb.py |
MLB | 30 ballparks | Baseball-Reference scraper |
nba.py |
NBA | 30 arenas | Basketball-Reference scraper |
nhl.py |
NHL | 32 arenas | Hockey-Reference scraper |
nfl.py |
NFL | 30 stadiums | Cross-calendar season (2025-26) |
mls.py |
MLS | 30 stadiums | Soccer-specific capacities |
wnba.py |
WNBA | 13 arenas | Shares venues with NBA |
nwsl.py |
NWSL | 13 stadiums | Shares some MLS venues |
Data Files
Output Directory: data/
| File | Contents |
|---|---|
games.csv |
Raw scraped game data (all sports) |
games.json |
Raw scraped games as JSON |
stadiums.json |
Raw stadium data |
stadiums_canonical.json |
Canonical stadiums with resolved aliases |
teams_canonical.json |
Canonical teams with resolved aliases |
stadium_aliases.json |
Stadium name → canonical ID mapping |
games/{sport}_{season}.json |
Per-sport canonical games |
Alias Files
data/canonical/stadiums.json- Master stadium databasedata/canonical/teams.json- Master team database
Pipeline Commands
Scraping
# Single sport
python scrape_schedules.py --sport nba --season 2025-26
# All sports
python scrape_schedules.py --sport all --season 2026
# With specific output directory
python scrape_schedules.py --sport mlb --season 2025 --output ./data
Canonicalization
# Run canonicalization pipeline
python run_canonicalization_pipeline.py --sport all
CloudKit Operations
# Validate data without uploading
python cloudkit_import.py --validate
# Show what would be uploaded (dry run)
python cloudkit_import.py --upload --dry-run
# Upload to CloudKit
python cloudkit_import.py --upload
# List orphan records (requires CloudKit connection)
python cloudkit_import.py --validate --list-orphans
# Delete orphan records
python cloudkit_import.py --delete-orphans
Related Documentation
- DATA_SOURCES.md - Data source URLs, rate limits, validation strategy
- CLOUDKIT_SETUP.md - CloudKit container setup, record types, security roles