From d9f446bccb2bd66126eaf21a7aff31fce3706dcc Mon Sep 17 00:00:00 2001 From: Trey t Date: Sat, 10 Jan 2026 10:42:47 -0600 Subject: [PATCH] docs(07-01): create Scripts/README.md with pipeline documentation - Overview and quick start commands - ASCII architecture diagram showing data flow - Module reference table for all Python scripts - Sport modules table with stadium counts - Data files and alias file documentation - Pipeline commands for scraping, canonicalization, CloudKit Co-Authored-By: Claude Opus 4.5 --- Scripts/README.md | 147 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 147 insertions(+) create mode 100644 Scripts/README.md diff --git a/Scripts/README.md b/Scripts/README.md new file mode 100644 index 0000000..dc40108 --- /dev/null +++ b/Scripts/README.md @@ -0,0 +1,147 @@ +# SportsTime Data Pipeline + +Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app. + +## Overview + +This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL. + +## Quick Start + +```bash +# Install dependencies +pip install -r requirements.txt + +# Scrape all sports for current season +python scrape_schedules.py --sport all --season 2026 + +# Run full pipeline (scrape + canonicalize) +python run_pipeline.py --sport all + +# Validate data integrity +python cloudkit_import.py --validate + +# Sync to CloudKit +python cloudkit_import.py --upload +``` + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ SPORT MODULES │ +│ mlb.py nba.py nhl.py nfl.py mls.py wnba.py nwsl.py │ +└────────────────────────────┬────────────────────────────────────────┘ + │ scrape + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ RAW DATA │ +│ data/games.csv data/stadiums.csv data/games.json │ +└────────────────────────────┬────────────────────────────────────────┘ + │ canonicalize + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ CANONICAL JSON │ +│ data/stadiums_canonical.json data/teams_canonical.json │ +│ data/games/*.json (per-sport/season) │ +└────────────────────────────┬────────────────────────────────────────┘ + │ sync + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ CloudKit (iCloud.com.sportstime.app) │ +│ Bundled JSON (SportsTime/Resources/) │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +## Module Reference + +| Script | Purpose | +|--------|---------| +| `core.py` | Shared utilities: data classes, rate limiting, fallback system | +| `scrape_schedules.py` | Main orchestrator for scraping schedules from multiple sources | +| `run_pipeline.py` | Full pipeline runner (scrape + canonicalize in one command) | +| `canonicalize_stadiums.py` | Stadium name resolution with alias support | +| `canonicalize_teams.py` | Team name resolution with alias support | +| `canonicalize_games.py` | Game linking (game → team → stadium relationships) | +| `cloudkit_import.py` | CloudKit sync with full CRUD, validation, and diff reporting | +| `validate_canonical.py` | Data validation with completeness metrics | +| `generate_canonical_data.py` | Generate bundled JSON for iOS app bootstrap | + +## Sport Modules + +Each sport has its own module with hardcoded stadium data and sport-specific scraping logic: + +| Module | Sport | Stadiums | Notes | +|--------|-------|----------|-------| +| `mlb.py` | MLB | 30 ballparks | Baseball-Reference scraper | +| `nba.py` | NBA | 30 arenas | Basketball-Reference scraper | +| `nhl.py` | NHL | 32 arenas | Hockey-Reference scraper | +| `nfl.py` | NFL | 30 stadiums | Cross-calendar season (2025-26) | +| `mls.py` | MLS | 30 stadiums | Soccer-specific capacities | +| `wnba.py` | WNBA | 13 arenas | Shares venues with NBA | +| `nwsl.py` | NWSL | 13 stadiums | Shares some MLS venues | + +## Data Files + +### Output Directory: `data/` + +| File | Contents | +|------|----------| +| `games.csv` | Raw scraped game data (all sports) | +| `games.json` | Raw scraped games as JSON | +| `stadiums.json` | Raw stadium data | +| `stadiums_canonical.json` | Canonical stadiums with resolved aliases | +| `teams_canonical.json` | Canonical teams with resolved aliases | +| `stadium_aliases.json` | Stadium name → canonical ID mapping | +| `games/{sport}_{season}.json` | Per-sport canonical games | + +### Alias Files + +- `data/canonical/stadiums.json` - Master stadium database +- `data/canonical/teams.json` - Master team database + +## Pipeline Commands + +### Scraping + +```bash +# Single sport +python scrape_schedules.py --sport nba --season 2025-26 + +# All sports +python scrape_schedules.py --sport all --season 2026 + +# With specific output directory +python scrape_schedules.py --sport mlb --season 2025 --output ./data +``` + +### Canonicalization + +```bash +# Run canonicalization pipeline +python run_canonicalization_pipeline.py --sport all +``` + +### CloudKit Operations + +```bash +# Validate data without uploading +python cloudkit_import.py --validate + +# Show what would be uploaded (dry run) +python cloudkit_import.py --upload --dry-run + +# Upload to CloudKit +python cloudkit_import.py --upload + +# List orphan records (requires CloudKit connection) +python cloudkit_import.py --validate --list-orphans + +# Delete orphan records +python cloudkit_import.py --delete-orphans +``` + +## Related Documentation + +- [DATA_SOURCES.md](DATA_SOURCES.md) - Data source URLs, rate limits, validation strategy +- [CLOUDKIT_SETUP.md](CLOUDKIT_SETUP.md) - CloudKit container setup, record types, security roles