feat(scripts): rewrite parser as modular Python CLI

Replace monolithic scraping scripts with sportstime_parser package: - Multi-source scrapers with automatic fallback for 7 sports - Canonical ID generation for games, teams, and stadiums - Fuzzy matching with configurable thresholds for name resolution - CloudKit Web Services uploader with JWT auth, diff-based updates - Resumable uploads with checkpoint state persistence - Validation reports with manual review items and suggested matches - Comprehensive test suite (249 tests) CLI: sportstime-parser scrape|validate|upload|status|retry|clear Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 21:06:12 -06:00
parent 284a10d9e1
commit eeaf900e5a
109 changed files with 18415 additions and 266211 deletions
--- a/Scripts/README.md
+++ b/Scripts/README.md
@@ -1,147 +0,0 @@
-# SportsTime Data Pipeline
-
-Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app.
-
-## Overview
-
-This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL.
-
-## Quick Start
-
-```bash
-# Install dependencies
-pip install -r requirements.txt
-
-# Scrape all sports for current season
-python scrape_schedules.py --sport all --season 2026
-
-# Run full pipeline (scrape + canonicalize)
-python run_pipeline.py --sport all
-
-# Validate data integrity
-python cloudkit_import.py --validate
-
-# Sync to CloudKit
-python cloudkit_import.py --upload
-```
-
-## Architecture
-
-```
-┌─────────────────────────────────────────────────────────────────────┐
-│                        SPORT MODULES                                │
-│  mlb.py  nba.py  nhl.py  nfl.py  mls.py  wnba.py  nwsl.py         │
-└────────────────────────────┬────────────────────────────────────────┘
-                             │ scrape
-                             ▼
-┌─────────────────────────────────────────────────────────────────────┐
-│                        RAW DATA                                     │
-│  data/games.csv    data/stadiums.csv    data/games.json            │
-└────────────────────────────┬────────────────────────────────────────┘
-                             │ canonicalize
-                             ▼
-┌─────────────────────────────────────────────────────────────────────┐
-│                     CANONICAL JSON                                  │
-│  data/stadiums_canonical.json    data/teams_canonical.json         │
-│  data/games/*.json (per-sport/season)                              │
-└────────────────────────────┬────────────────────────────────────────┘
-                             │ sync
-                             ▼
-┌─────────────────────────────────────────────────────────────────────┐
-│               CloudKit (iCloud.com.sportstime.app)                 │
-│               Bundled JSON (SportsTime/Resources/)                  │
-└─────────────────────────────────────────────────────────────────────┘
-```
-
-## Module Reference
-
-| Script | Purpose |
-|--------|---------|
-| `core.py` | Shared utilities: data classes, rate limiting, fallback system |
-| `scrape_schedules.py` | Main orchestrator for scraping schedules from multiple sources |
-| `run_pipeline.py` | Full pipeline runner (scrape + canonicalize in one command) |
-| `canonicalize_stadiums.py` | Stadium name resolution with alias support |
-| `canonicalize_teams.py` | Team name resolution with alias support |
-| `canonicalize_games.py` | Game linking (game → team → stadium relationships) |
-| `cloudkit_import.py` | CloudKit sync with full CRUD, validation, and diff reporting |
-| `validate_canonical.py` | Data validation with completeness metrics |
-| `generate_canonical_data.py` | Generate bundled JSON for iOS app bootstrap |
-
-## Sport Modules
-
-Each sport has its own module with hardcoded stadium data and sport-specific scraping logic:
-
-| Module | Sport | Stadiums | Notes |
-|--------|-------|----------|-------|
-| `mlb.py` | MLB | 30 ballparks | Baseball-Reference scraper |
-| `nba.py` | NBA | 30 arenas | Basketball-Reference scraper |
-| `nhl.py` | NHL | 32 arenas | Hockey-Reference scraper |
-| `nfl.py` | NFL | 30 stadiums | Cross-calendar season (2025-26) |
-| `mls.py` | MLS | 30 stadiums | Soccer-specific capacities |
-| `wnba.py` | WNBA | 13 arenas | Shares venues with NBA |
-| `nwsl.py` | NWSL | 13 stadiums | Shares some MLS venues |
-
-## Data Files
-
-### Output Directory: `data/`
-
-| File | Contents |
-|------|----------|
-| `games.csv` | Raw scraped game data (all sports) |
-| `games.json` | Raw scraped games as JSON |
-| `stadiums.json` | Raw stadium data |
-| `stadiums_canonical.json` | Canonical stadiums with resolved aliases |
-| `teams_canonical.json` | Canonical teams with resolved aliases |
-| `stadium_aliases.json` | Stadium name → canonical ID mapping |
-| `games/{sport}_{season}.json` | Per-sport canonical games |
-
-### Alias Files
-
- `data/canonical/stadiums.json` - Master stadium database
- `data/canonical/teams.json` - Master team database
-
-## Pipeline Commands
-
-### Scraping
-
-```bash
-# Single sport
-python scrape_schedules.py --sport nba --season 2025-26
-
-# All sports
-python scrape_schedules.py --sport all --season 2026
-
-# With specific output directory
-python scrape_schedules.py --sport mlb --season 2025 --output ./data
-```
-
-### Canonicalization
-
-```bash
-# Run canonicalization pipeline
-python run_canonicalization_pipeline.py --sport all
-```
-
-### CloudKit Operations
-
-```bash
-# Validate data without uploading
-python cloudkit_import.py --validate
-
-# Show what would be uploaded (dry run)
-python cloudkit_import.py --upload --dry-run
-
-# Upload to CloudKit
-python cloudkit_import.py --upload
-
-# List orphan records (requires CloudKit connection)
-python cloudkit_import.py --validate --list-orphans
-
-# Delete orphan records
-python cloudkit_import.py --delete-orphans
-```
-
-## Related Documentation
-
- [DATA_SOURCES.md](DATA_SOURCES.md) - Data source URLs, rate limits, validation strategy
- [CLOUDKIT_SETUP.md](CLOUDKIT_SETUP.md) - CloudKit container setup, record types, security roles