# SportsTime Parser A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit. ## Features - Scrapes game schedules from multiple sources with automatic fallback - Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL - Generates deterministic canonical IDs for games, teams, and stadiums - Produces validation reports with manual review lists - Uploads to CloudKit with resumable, diff-based updates ## Requirements - Python 3.11+ - CloudKit credentials (for upload functionality) ## Installation ```bash # From the Scripts directory cd Scripts # Install in development mode pip install -e ".[dev]" # Or install dependencies only pip install -r requirements.txt ``` ## Quick Start ```bash # Scrape NBA 2025-26 season sportstime-parser scrape nba --season 2025 # Scrape all sports sportstime-parser scrape all --season 2025 # Validate existing scraped data sportstime-parser validate nba --season 2025 # Check status sportstime-parser status # Upload to CloudKit (development) sportstime-parser upload nba --season 2025 # Upload to CloudKit (production) sportstime-parser upload nba --season 2025 --environment production ``` ## CLI Reference ### scrape Scrape game schedules, teams, and stadiums from web sources. ```bash sportstime-parser scrape [options] Arguments: sport Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all" Options: --season, -s INT Season start year (default: 2025) --dry-run Parse and validate only, don't write output files --verbose, -v Enable verbose output ``` **Examples:** ```bash # Scrape NBA 2025-26 season sportstime-parser scrape nba --season 2025 # Scrape all sports with verbose output sportstime-parser scrape all --season 2025 --verbose # Dry run to test without writing files sportstime-parser scrape mlb --season 2026 --dry-run ``` ### validate Run validation on existing scraped data and regenerate reports. Validation performs these checks: 1. **Game Coverage**: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB) 2. **Team Resolution**: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching 3. **Stadium Resolution**: Identifies venue names that couldn't be matched to canonical stadium IDs 4. **Duplicate Detection**: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors) 5. **Missing Data**: Flags games missing required fields (stadium_id, team IDs, valid dates) The output is a Markdown report with: - Summary statistics (total games, valid games, coverage percentage) - Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates) - Fuzzy match suggestions with confidence scores to help resolve unmatched names ```bash sportstime-parser validate [options] Arguments: sport Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all" Options: --season, -s INT Season start year (default: 2025) ``` **Examples:** ```bash # Validate NBA data sportstime-parser validate nba --season 2025 # Validate all sports sportstime-parser validate all ``` ### upload Upload scraped data to CloudKit with diff-based updates. ```bash sportstime-parser upload [options] Arguments: sport Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all" Options: --season, -s INT Season start year (default: 2025) --environment, -e CloudKit environment: development or production (default: development) --resume Resume interrupted upload from last checkpoint ``` **Examples:** ```bash # Upload NBA to development sportstime-parser upload nba --season 2025 # Upload to production sportstime-parser upload nba --season 2025 --environment production # Resume interrupted upload sportstime-parser upload mlb --season 2026 --resume ``` ### status Show current scrape and upload status. ```bash sportstime-parser status ``` ### retry Retry failed uploads from previous attempts. ```bash sportstime-parser retry [options] Arguments: sport Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all" Options: --season, -s INT Season start year (default: 2025) --environment, -e CloudKit environment (default: development) --max-retries INT Maximum retry attempts per record (default: 3) ``` ### clear Clear upload session state to start fresh. ```bash sportstime-parser clear [options] Arguments: sport Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all" Options: --season, -s INT Season start year (default: 2025) --environment, -e CloudKit environment (default: development) ``` ## CloudKit Configuration To upload data to CloudKit, you need to configure authentication credentials. ### 1. Get Credentials from Apple Developer Portal 1. Go to [Apple Developer Portal](https://developer.apple.com) 2. Navigate to **Certificates, Identifiers & Profiles** > **Keys** 3. Create a new key with **CloudKit** capability 4. Download the private key file (.p8) 5. Note the Key ID ### 2. Set Environment Variables ```bash # Key ID from Apple Developer Portal export CLOUDKIT_KEY_ID="your_key_id_here" # Path to private key file export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8" # Or provide key content directly (useful for CI/CD) export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY----- ...key content... -----END EC PRIVATE KEY-----" ``` ### 3. Verify Configuration ```bash sportstime-parser status ``` The status output will show whether CloudKit is configured correctly. ## Output Files Scraped data is saved to the `output/` directory: ``` output/ games_nba_2025.json # Game schedules teams_nba.json # Team data stadiums_nba.json # Stadium data validation_nba_2025.md # Validation report ``` ## Validation Reports Validation reports are generated in Markdown format at `output/validation_{sport}_{season}.md`. ### Report Sections **Summary Table** | Metric | Description | |--------|-------------| | Total Games | Number of games scraped | | Valid Games | Games with all required fields resolved | | Coverage | Percentage of expected games found (based on league schedule) | | Unresolved Teams | Team names that couldn't be matched | | Unresolved Stadiums | Venue names that couldn't be matched | | Duplicates | Potential duplicate game entries | **Manual Review Items** Items are grouped by type and include the raw value, source URL, and suggested fixes: - **Unresolved Teams**: Team names not in the alias mapping. Add to `team_aliases.json` to resolve. - **Unresolved Stadiums**: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to `stadium_aliases.json`. - **Duplicate Games**: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources. - **Missing Data**: Games missing stadium coordinates or other required fields. **Fuzzy Match Suggestions** For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification. ## Canonical IDs Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources. ### ID Formats **Games** ``` {sport}_{season}_{away}_{home}_{MMDD}[_{game_number}] ``` Examples: - `nba_2025_hou_okc_1021` - NBA 2025-26, Houston @ OKC, Oct 21 - `mlb_2026_nyy_bos_0401_1` - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader) **Teams** ``` {sport}_{city}_{name} ``` Examples: - `nba_la_lakers` - `mlb_new_york_yankees` - `nfl_new_york_giants` **Stadiums** ``` {sport}_{normalized_name} ``` Examples: - `mlb_yankee_stadium` - `nba_crypto_com_arena` - `nfl_sofi_stadium` ### Generated vs Matched IDs | Entity | Generated | Matched | |--------|-----------|---------| | **Teams** | Pre-defined in `team_resolver.py` mappings | Resolved from raw scraped names via aliases + fuzzy matching | | **Stadiums** | Pre-defined in `stadium_resolver.py` mappings | Resolved from raw venue names via aliases + fuzzy matching | | **Games** | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) | **Resolution Flow:** ``` Raw Name (from scraper) ↓ Exact Match (alias lookup in team_aliases.json / stadium_aliases.json) ↓ (if no match) Fuzzy Match (Levenshtein distance against known names) ↓ (if confidence > threshold) Canonical ID assigned ↓ (if no match) Manual Review Item created ``` ### Cross-References Entities reference each other via canonical IDs: ``` ┌─────────────────────────────────────────────────────────────┐ │ Game │ │ id: nba_2025_hou_okc_1021 │ │ home_team_id: nba_oklahoma_city_thunder ──────────────┐ │ │ away_team_id: nba_houston_rockets ────────────────┐ │ │ │ stadium_id: nba_paycom_center ────────────────┐ │ │ │ └─────────────────────────────────────────────────│───│───│───┘ │ │ │ ┌─────────────────────────────────────────────────│───│───│───┐ │ Stadium │ │ │ │ │ id: nba_paycom_center ◄───────────────────────┘ │ │ │ │ name: "Paycom Center" │ │ │ │ city: "Oklahoma City" │ │ │ │ latitude: 35.4634 │ │ │ │ longitude: -97.5151 │ │ │ └─────────────────────────────────────────────────────│───│───┘ │ │ ┌─────────────────────────────────────────────────────│───│───┐ │ Team │ │ │ │ id: nba_houston_rockets ◄─────────────────────────┘ │ │ │ name: "Rockets" │ │ │ city: "Houston" │ │ │ stadium_id: nba_toyota_center │ │ └─────────────────────────────────────────────────────────│───┘ │ ┌─────────────────────────────────────────────────────────│───┐ │ Team │ │ │ id: nba_oklahoma_city_thunder ◄───────────────────────┘ │ │ name: "Thunder" │ │ city: "Oklahoma City" │ │ stadium_id: nba_paycom_center │ └─────────────────────────────────────────────────────────────┘ ``` ### Alias Files Aliases map variant names to canonical IDs: **`team_aliases.json`** ```json { "nba": { "LA Lakers": "nba_la_lakers", "Los Angeles Lakers": "nba_la_lakers", "LAL": "nba_la_lakers" } } ``` **`stadium_aliases.json`** ```json { "nba": { "Crypto.com Arena": "nba_crypto_com_arena", "Staples Center": "nba_crypto_com_arena", "STAPLES Center": "nba_crypto_com_arena" } } ``` When a scraper returns a raw name like "LA Lakers", the resolver: 1. Checks `team_aliases.json` for an exact match → finds `nba_la_lakers` 2. If no exact match, runs fuzzy matching against all known team names 3. If fuzzy match confidence > 80%, uses that canonical ID 4. Otherwise, creates a manual review item for human resolution ## Adding a New Sport To add support for a new sport (e.g., `cfb` for college football), update these files: ### 1. Configuration (`config.py`) Add the sport to `SUPPORTED_SPORTS` and `EXPECTED_GAME_COUNTS`: ```python SUPPORTED_SPORTS: list[str] = [ "nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl", "cfb", # ← Add new sport ] EXPECTED_GAME_COUNTS: dict[str, int] = { # ... existing sports ... "cfb": 900, # ← Add expected game count for validation } ``` ### 2. Team Mappings (`normalizers/team_resolver.py`) Add team definitions to `TEAM_MAPPINGS`. Each entry maps an abbreviation to `(canonical_id, full_name, city)`: ```python TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = { # ... existing sports ... "cfb": { "ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"), "OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"), # ... all teams ... }, } ``` ### 3. Stadium Mappings (`normalizers/stadium_resolver.py`) Add stadium definitions to `STADIUM_MAPPINGS`. Each entry is a `StadiumInfo` with coordinates: ```python STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = { # ... existing sports ... "cfb": { "stadium_cfb_bryant_denny": StadiumInfo( id="stadium_cfb_bryant_denny", name="Bryant-Denny Stadium", city="Tuscaloosa", state="AL", country="USA", sport="cfb", latitude=33.2083, longitude=-87.5503, ), # ... all stadiums ... }, } ``` ### 4. Scraper Implementation (`scrapers/cfb.py`) Create a new scraper class extending `BaseScraper`: ```python from .base import BaseScraper, RawGameData, ScrapeResult class CFBScraper(BaseScraper): def __init__(self, season: int, **kwargs): super().__init__("cfb", season, **kwargs) self._team_resolver = get_team_resolver("cfb") self._stadium_resolver = get_stadium_resolver("cfb") def _get_sources(self) -> list[str]: return ["espn", "sports_reference"] # Priority order def _get_source_url(self, source: str, **kwargs) -> str: # Return URL for each source ... def _scrape_games_from_source(self, source: str) -> list[RawGameData]: # Implement scraping logic ... def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]: # Convert raw data to Game objects using resolvers ... def scrape_teams(self) -> list[Team]: # Return Team objects from TEAM_MAPPINGS ... def scrape_stadiums(self) -> list[Stadium]: # Return Stadium objects from STADIUM_MAPPINGS ... def create_cfb_scraper(season: int) -> CFBScraper: return CFBScraper(season=season) ``` ### 5. Register Scraper (`scrapers/__init__.py`) Export the new scraper: ```python from .cfb import CFBScraper, create_cfb_scraper __all__ = [ # ... existing exports ... "CFBScraper", "create_cfb_scraper", ] ``` ### 6. CLI Registration (`cli.py`) Add the sport to `get_scraper()`: ```python def get_scraper(sport: str, season: int): # ... existing sports ... elif sport == "cfb": from .scrapers.cfb import create_cfb_scraper return create_cfb_scraper(season) ``` ### 7. Alias Files (`team_aliases.json`, `stadium_aliases.json`) Add initial aliases for common name variants: ```json // team_aliases.json { "cfb": { "Alabama": "team_cfb_ala", "Bama": "team_cfb_ala", "Roll Tide": "team_cfb_ala" } } // stadium_aliases.json { "cfb": { "Bryant Denny Stadium": "stadium_cfb_bryant_denny", "Bryant-Denny": "stadium_cfb_bryant_denny" } } ``` ### 8. Documentation (`SOURCES.md`) Document data sources with URLs, rate limits, and notes: ```markdown ## CFB (College Football) **Teams**: 134 (FBS) **Expected Games**: ~900 per season **Season**: August - January ### Sources | Priority | Source | URL Pattern | Data Type | |----------|--------|-------------|-----------| | 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON | | 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML | ``` ### 9. Tests (`tests/test_scrapers/test_cfb.py`) Create tests for the new scraper: ```python import pytest from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper class TestCFBScraper: def test_factory_creates_scraper(self): scraper = create_cfb_scraper(season=2025) assert scraper.sport == "cfb" assert scraper.season == 2025 def test_get_sources_returns_priority_list(self): scraper = CFBScraper(season=2025) sources = scraper._get_sources() assert "espn" in sources # ... more tests ... ``` ### Checklist - [ ] Add to `SUPPORTED_SPORTS` in `config.py` - [ ] Add to `EXPECTED_GAME_COUNTS` in `config.py` - [ ] Add team mappings to `team_resolver.py` - [ ] Add stadium mappings to `stadium_resolver.py` - [ ] Create `scrapers/{sport}.py` with scraper class - [ ] Export in `scrapers/__init__.py` - [ ] Register in `cli.py` `get_scraper()` - [ ] Add aliases to `team_aliases.json` - [ ] Add aliases to `stadium_aliases.json` - [ ] Document sources in `SOURCES.md` - [ ] Create tests in `tests/test_scrapers/` - [ ] Run `pytest` to verify all tests pass - [ ] Run dry-run scrape: `sportstime-parser scrape {sport} --season 2025 --dry-run` ## Development ### Running Tests ```bash # Run all tests pytest # Run with coverage pytest --cov=sportstime_parser --cov-report=html # Run specific test file pytest tests/test_scrapers/test_nba.py # Run with verbose output pytest -v ``` ### Project Structure ``` sportstime_parser/ __init__.py __main__.py # CLI entry point cli.py # Subcommand definitions config.py # Constants, defaults models/ game.py # Game dataclass team.py # Team dataclass stadium.py # Stadium dataclass aliases.py # Alias dataclasses scrapers/ base.py # BaseScraper abstract class nba.py # NBA scrapers mlb.py # MLB scrapers nfl.py # NFL scrapers nhl.py # NHL scrapers mls.py # MLS scrapers wnba.py # WNBA scrapers nwsl.py # NWSL scrapers normalizers/ canonical_id.py # ID generation team_resolver.py # Team name resolution stadium_resolver.py # Stadium name resolution timezone.py # Timezone conversion fuzzy.py # Fuzzy matching validators/ report.py # Validation report generator uploaders/ cloudkit.py # CloudKit Web Services client state.py # Resumable upload state diff.py # Record comparison utils/ http.py # Rate-limited HTTP client logging.py # Verbose logger progress.py # Progress bars ``` ## Troubleshooting ### "No games file found" Run the scrape command first: ```bash sportstime-parser scrape nba --season 2025 ``` ### "CloudKit not configured" Set the required environment variables: ```bash export CLOUDKIT_KEY_ID="your_key_id" export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8" ``` ### Rate limit errors The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors: 1. Wait a few minutes before retrying 2. Try scraping one sport at a time instead of "all" 3. Check that you're not running multiple instances ### Scrape fails with no data 1. Check your internet connection 2. Run with `--verbose` to see detailed error messages 3. The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable ## License MIT