# SportsTime Parser A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app. ## Table of Contents - [Overview](#overview) - [Installation](#installation) - [Quick Start](#quick-start) - [Architecture](#architecture) - [Directory Structure](#directory-structure) - [Configuration](#configuration) - [Data Models](#data-models) - [Normalizers](#normalizers) - [Scrapers](#scrapers) - [Uploaders](#uploaders) - [Utilities](#utilities) - [Manual Review Workflow](#manual-review-workflow) - [Adding a New Sport](#adding-a-new-sport) - [Troubleshooting](#troubleshooting) ## Overview The `sportstime_parser` package provides a complete pipeline for: 1. **Scraping** game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.) 2. **Normalizing** raw data to canonical identifiers (teams, stadiums, games) 3. **Resolving** team/stadium names using exact matching, historical aliases, and fuzzy matching 4. **Uploading** data to CloudKit with diff-based sync and resumable uploads ### Supported Sports | Sport | Code | Sources | Season Format | |-------|------|---------|---------------| | NBA | `nba` | Basketball-Reference, ESPN, CBS | Oct-Jun (split year) | | MLB | `mlb` | Baseball-Reference, MLB API, ESPN | Mar-Nov (single year) | | NFL | `nfl` | ESPN, Pro-Football-Reference, CBS | Sep-Feb (split year) | | NHL | `nhl` | Hockey-Reference, NHL API, ESPN | Oct-Jun (split year) | | MLS | `mls` | ESPN, FBref | Feb-Nov (single year) | | WNBA | `wnba` | ESPN | May-Oct (single year) | | NWSL | `nwsl` | ESPN | Mar-Nov (single year) | ## Installation ```bash cd Scripts pip install -r requirements.txt ``` ### Dependencies - `requests` - HTTP requests with session management - `beautifulsoup4` + `lxml` - HTML parsing - `rapidfuzz` - Fuzzy string matching - `pyjwt` + `cryptography` - CloudKit JWT authentication - `rich` - Terminal UI (progress bars, logging) - `pytz` / `timezonefinder` - Timezone detection ## Quick Start ### Scrape a Single Sport ```python from sportstime_parser.scrapers import create_nba_scraper scraper = create_nba_scraper(season=2025) result = scraper.scrape_all() print(f"Games: {result.game_count}") print(f"Teams: {result.team_count}") print(f"Stadiums: {result.stadium_count}") print(f"Needs review: {result.review_count}") ``` ### Upload to CloudKit ```python from sportstime_parser.uploaders import CloudKitClient, RecordDiffer client = CloudKitClient(environment="development") differ = RecordDiffer() # Compare local vs remote diff = differ.diff_games(local_games, remote_records) # Upload changes records = diff.get_records_to_upload() result = await client.save_records(records) ``` ## Architecture ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ DATA SOURCES │ │ Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc. │ └────────────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ SCRAPERS │ │ NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │ │ │ │ Features: │ │ • Multi-source fallback (try sources in priority order) │ │ • Automatic rate limiting with exponential backoff │ │ • Doubleheader detection │ │ • International game filtering (NFL London, NHL Global Series) │ └────────────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ NORMALIZERS │ │ TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader │ │ │ │ Resolution Strategy (in order): │ │ 1. Exact match against canonical mappings │ │ 2. Date-aware alias lookup (handles renames/relocations) │ │ 3. Fuzzy matching with confidence threshold (85%) │ │ 4. Flag for manual review if unresolved or low confidence │ └────────────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ DATA MODELS │ │ Game │ Team │ Stadium │ ManualReviewItem │ │ │ │ All models use canonical IDs: │ │ • team_nba_lal (Los Angeles Lakers) │ │ • stadium_nba_los_angeles_lakers (Crypto.com Arena) │ │ • game_nba_2025_20251022_bos_lal (specific game) │ └────────────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ UPLOADERS │ │ CloudKitClient │ RecordDiffer │ StateManager │ │ │ │ Features: │ │ • JWT authentication with Apple's CloudKit Web Services │ │ • Batch operations (up to 200 records per request) │ │ • Diff-based sync (only upload changes) │ │ • Resumable uploads with persistent state │ └────────────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ CLOUDKIT │ │ Public Database: Games, Teams, Stadiums, Aliases │ └─────────────────────────────────────────────────────────────────────────┘ ``` ## Directory Structure ``` Scripts/ ├── README.md # This file ├── requirements.txt # Python dependencies ├── pyproject.toml # Package configuration ├── league_structure.json # League hierarchy (conferences, divisions) ├── team_aliases.json # Historical team name mappings ├── stadium_aliases.json # Historical stadium name mappings ├── logs/ # Runtime logs (auto-created) ├── output/ # Scrape output files (auto-created) └── sportstime_parser/ # Main package ├── __init__.py ├── config.py # Configuration constants ├── SOURCES.md # Data source documentation ├── models/ # Data classes │ ├── game.py # Game model │ ├── team.py # Team model │ ├── stadium.py # Stadium model │ └── aliases.py # Alias and ManualReviewItem models ├── normalizers/ # Name resolution │ ├── canonical_id.py # ID generation │ ├── alias_loader.py # Alias loading and resolution │ ├── fuzzy.py # Fuzzy string matching │ ├── timezone.py # Timezone detection │ ├── team_resolver.py # Team name resolution │ └── stadium_resolver.py # Stadium name resolution ├── scrapers/ # Sport-specific scrapers │ ├── base.py # Abstract base scraper │ ├── nba.py # NBA scraper │ ├── mlb.py # MLB scraper │ ├── nfl.py # NFL scraper │ ├── nhl.py # NHL scraper │ ├── mls.py # MLS scraper │ ├── wnba.py # WNBA scraper │ └── nwsl.py # NWSL scraper ├── uploaders/ # CloudKit integration │ ├── cloudkit.py # CloudKit Web Services client │ ├── diff.py # Record diffing │ └── state.py # Resumable upload state └── utils/ # Shared utilities ├── logging.py # Rich-based logging ├── http.py # Rate-limited HTTP client └── progress.py # Progress tracking ``` ## Configuration ### config.py Key configuration constants: ```python # Directories SCRIPTS_DIR = Path(__file__).parent.parent # Scripts/ OUTPUT_DIR = SCRIPTS_DIR / "output" # JSON output STATE_DIR = SCRIPTS_DIR / ".parser_state" # Upload state # CloudKit CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app" CLOUDKIT_ENVIRONMENT = "development" # or "production" # Rate Limiting DEFAULT_REQUEST_DELAY = 3.0 # seconds between requests MAX_RETRIES = 3 # retry attempts BACKOFF_FACTOR = 2.0 # exponential backoff multiplier INITIAL_BACKOFF = 5.0 # initial backoff duration # Fuzzy Matching FUZZY_THRESHOLD = 85 # minimum match confidence (0-100) # Expected game counts (for validation) EXPECTED_GAME_COUNTS = { "nba": 1230, # 30 teams × 82 games ÷ 2 "mlb": 2430, # 30 teams × 162 games ÷ 2 "nfl": 272, # Regular season only "nhl": 1312, # 32 teams × 82 games ÷ 2 "mls": 544, # 29 teams × ~34 games ÷ 2 "wnba": 228, # 12 teams × 40 games ÷ 2 "nwsl": 182, # 14 teams × 26 games ÷ 2 } # Geography (for filtering international games) ALLOWED_COUNTRIES = {"USA", "Canada"} ``` ### league_structure.json Defines the hierarchical structure of each league: ```json { "nba": { "name": "National Basketball Association", "conferences": { "Eastern": { "divisions": { "Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"], "Central": ["CHI", "CLE", "DET", "IND", "MIL"], "Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"] } }, "Western": { ... } } }, "mlb": { ... }, ... } ``` ### team_aliases.json / stadium_aliases.json Historical name mappings with validity dates: ```json { "team_mlb_athletics": [ { "alias": "Oakland Athletics", "alias_type": "full_name", "valid_from": "1968-01-01", "valid_until": "2024-12-31" }, { "alias": "Las Vegas Athletics", "alias_type": "full_name", "valid_from": "2028-01-01", "valid_until": null } ] } ``` ## Data Models ### Game ```python @dataclass class Game: id: str # Canonical ID: game_{sport}_{season}_{date}_{away}_{home} sport: str # Sport code (nba, mlb, etc.) season: int # Season start year home_team_id: str # Canonical team ID away_team_id: str # Canonical team ID stadium_id: str # Canonical stadium ID game_date: datetime # UTC datetime game_number: Optional[int] # 1 or 2 for doubleheaders home_score: Optional[int] # None if not played away_score: Optional[int] status: str # scheduled, final, postponed, cancelled source_url: Optional[str] # For manual review raw_home_team: Optional[str] # Original scraped value raw_away_team: Optional[str] raw_stadium: Optional[str] ``` ### Team ```python @dataclass class Team: id: str # Canonical ID: team_{sport}_{abbrev} sport: str city: str # e.g., "Los Angeles" name: str # e.g., "Lakers" full_name: str # e.g., "Los Angeles Lakers" abbreviation: str # e.g., "LAL" conference: Optional[str] # e.g., "Western" division: Optional[str] # e.g., "Pacific" stadium_id: Optional[str] # Home stadium primary_color: Optional[str] secondary_color: Optional[str] logo_url: Optional[str] ``` ### Stadium ```python @dataclass class Stadium: id: str # Canonical ID: stadium_{sport}_{city_team} sport: str name: str # Current name (e.g., "Crypto.com Arena") city: str state: Optional[str] country: str latitude: Optional[float] longitude: Optional[float] capacity: Optional[int] surface: Optional[str] # grass, turf, ice, hardwood roof_type: Optional[str] # dome, retractable, open opened_year: Optional[int] image_url: Optional[str] timezone: Optional[str] ``` ### ManualReviewItem ```python @dataclass class ManualReviewItem: item_type: str # "team" or "stadium" raw_value: str # Original scraped value suggested_id: Optional[str] # Best fuzzy match (if any) confidence: float # 0.0 - 1.0 reason: str # Why review is needed source_url: Optional[str] # Where it came from sport: str check_date: Optional[date] # For date-aware alias lookup ``` ## Normalizers ### Canonical ID Generation IDs are deterministic and immutable: ```python # Team ID generate_team_id("nba", "LAL") # → "team_nba_lal" # Stadium ID generate_stadium_id("nba", "Los Angeles", "Lakers") # → "stadium_nba_los_angeles_lakers" # Game ID generate_game_id( sport="nba", season=2025, away_abbrev="BOS", home_abbrev="LAL", game_date=datetime(2025, 10, 22), game_number=None ) # → "game_nba_2025_20251022_bos_lal" # Doubleheader Game ID generate_game_id(..., game_number=2) # → "game_nba_2025_20251022_bos_lal_2" ``` ### Team Resolution The `TeamResolver` uses a three-stage strategy: ```python resolver = get_team_resolver("nba") result = resolver.resolve( "Los Angeles Lakers", check_date=date(2025, 10, 22), source_url="https://..." ) # Result: # - canonical_id: "team_nba_lal" # - confidence: 1.0 (exact match) # - review_item: None ``` **Resolution stages:** 1. **Exact Match**: Check against canonical team mappings - Full name: "Los Angeles Lakers" - City + Name: "Los Angeles" + "Lakers" - Abbreviation: "LAL" 2. **Alias Lookup**: Check historical aliases with date awareness - "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31) - Handles relocations: "Oakland" → "Las Vegas" transition 3. **Fuzzy Match**: Use rapidfuzz with 85% threshold - "LA Lakers" → "Los Angeles Lakers" (92% match) - Low-confidence matches flagged for review ### Stadium Resolution Similar three-stage strategy with additional location awareness: ```python resolver = get_stadium_resolver("nba") result = resolver.resolve( "Crypto.com Arena", check_date=date(2025, 10, 22) ) ``` **Key features:** - Handles naming rights changes (Staples Center → Crypto.com Arena) - Date-aware: "Staples Center" resolves correctly for historical games - Location-based fallback using latitude/longitude ## Scrapers ### Base Scraper All scrapers extend `BaseScraper` with these features: ```python class BaseScraper(ABC): def __init__(self, sport: str, season: int): ... # Required implementations def _get_sources(self) -> list[str]: ... def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ... def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ... def scrape_teams(self) -> list[Team]: ... def scrape_stadiums(self) -> list[Stadium]: ... # Built-in features def scrape_games(self) -> ScrapeResult: """Multi-source fallback - tries each source in order.""" ... def scrape_all(self) -> ScrapeResult: """Scrapes games, teams, and stadiums with progress tracking.""" ... ``` ### NBA Scraper ```python class NBAScraper(BaseScraper): """ Sources (in priority order): 1. Basketball-Reference - HTML tables, monthly pages 2. ESPN API - JSON, per-date queries 3. CBS Sports - Backup (not implemented) Season: October to June (split year, e.g., 2025-26) """ ``` **Basketball-Reference parsing:** - URL: `https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html` - Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name ### MLB Scraper ```python class MLBScraper(BaseScraper): """ Sources: 1. Baseball-Reference - Single page per season 2. MLB Stats API - Official API with date range queries 3. ESPN API - Backup Season: March to November (single year) Handles: Doubleheaders with game_number """ ``` ### NFL Scraper ```python class NFLScraper(BaseScraper): """ Sources: 1. ESPN API - Week-based queries 2. Pro-Football-Reference - Single page per season Season: September to February (split year) Filters: International games (London, Mexico City, Frankfurt) Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds) """ ``` ### NHL Scraper ```python class NHLScraper(BaseScraper): """ Sources: 1. Hockey-Reference - Single page per season 2. NHL API - New API (api-web.nhle.com) 3. ESPN API - Backup Season: October to June (split year) Filters: International games (Prague, Stockholm, Helsinki) """ ``` ### MLS / WNBA / NWSL Scrapers All use ESPN API as primary source with similar structure: - Single calendar year seasons - Conference-based organization (MLS) or single table (WNBA, NWSL) ## Uploaders ### CloudKit Client ```python class CloudKitClient: """CloudKit Web Services API client with JWT authentication.""" def __init__( self, container_id: str = CLOUDKIT_CONTAINER, environment: str = "development", # or "production" key_id: str = None, # From CloudKit Dashboard private_key: str = None, # EC P-256 private key ): ... async def fetch_records( self, record_type: RecordType, filter_by: Optional[dict] = None, sort_by: Optional[str] = None, ) -> list[dict]: ... async def save_records( self, records: list[CloudKitRecord], batch_size: int = 200, ) -> BatchResult: ... async def delete_records( self, record_names: list[str], record_type: RecordType, ) -> BatchResult: ... ``` **Authentication:** - Uses EC P-256 key pair - JWT tokens signed with private key - Tokens valid for 30 minutes ### Record Differ ```python class RecordDiffer: """Compares local records with CloudKit records.""" def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ... def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ... def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ... ``` **DiffResult:** ```python @dataclass class DiffResult: creates: list[RecordDiff] # New records to create updates: list[RecordDiff] # Changed records to update deletes: list[RecordDiff] # Remote records to delete unchanged: list[RecordDiff] # Records with no changes def get_records_to_upload(self) -> list[CloudKitRecord]: """Returns creates + updates ready for upload.""" ``` ### State Manager ```python class StateManager: """Manages resumable upload state.""" def load_session(self, sport, season, environment) -> Optional[UploadSession]: ... def save_session(self, session: UploadSession) -> None: ... def get_session_or_create( self, sport, season, environment, record_names: list[tuple[str, str]], resume: bool = False, ) -> UploadSession: ... ``` **State persistence:** - Stored in `.parser_state/upload_state_{sport}_{season}_{env}.json` - Tracks: pending, uploaded, failed records - Supports retry with backoff ## Utilities ### HTTP Client ```python class RateLimitedSession: """HTTP session with rate limiting and exponential backoff.""" def __init__( self, delay: float = 3.0, # Seconds between requests max_retries: int = 3, backoff_factor: float = 2.0, ): ... def get(self, url, **kwargs) -> Response: ... def get_json(self, url, **kwargs) -> dict: ... def get_html(self, url, **kwargs) -> str: ... ``` **Features:** - User-agent rotation (5 different Chrome/Firefox/Safari agents) - Per-domain rate limiting - Automatic 429 handling with exponential backoff + jitter - Connection pooling ### Logging ```python from sportstime_parser.utils import get_logger, log_success, log_error logger = get_logger() # Rich-formatted logger logger.info("Starting scrape") log_success("Scraped 1230 games") # Green checkmark log_error("Failed to parse") # Red X ``` **Log output:** - Console: Rich-formatted with colors - File: `logs/parser_{timestamp}.log` ### Progress Tracking ```python from sportstime_parser.utils import ScrapeProgress, track_progress # Specialized scrape tracking progress = ScrapeProgress("nba", 2025) progress.start() with progress.scraping_schedule(total_months=9) as advance: for month in months: fetch(month) advance() progress.finish() # Prints summary # Generic progress bar for game in track_progress(games, "Processing games"): process(game) ``` ## Manual Review Workflow When the system can't confidently resolve a team or stadium: 1. **Low confidence fuzzy match** (< 85%): ``` ManualReviewItem( item_type="team", raw_value="LA Lakers", suggested_id="team_nba_lal", confidence=0.82, reason="Fuzzy match below threshold" ) ``` 2. **No match found**: ``` ManualReviewItem( raw_value="Unknown Team FC", suggested_id=None, confidence=0.0, reason="No match found in canonical mappings" ) ``` 3. **Ambiguous match** (multiple candidates): ``` ManualReviewItem( raw_value="LA", suggested_id="team_nba_lac", confidence=0.5, reason="Ambiguous: could be Lakers or Clippers" ) ``` **Resolution:** - Review items are exported to JSON - Manually verify and add to `team_aliases.json` or `stadium_aliases.json` - Re-run scrape - aliases will be used for resolution ## Adding a New Sport 1. **Create scraper** in `scrapers/{sport}.py`: ```python class NewSportScraper(BaseScraper): def __init__(self, season: int, **kwargs): super().__init__("newsport", season, **kwargs) self._team_resolver = get_team_resolver("newsport") self._stadium_resolver = get_stadium_resolver("newsport") def _get_sources(self) -> list[str]: return ["primary_source", "backup_source"] def _scrape_games_from_source(self, source: str) -> list[RawGameData]: # Implement source-specific scraping ... def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: # Use resolvers to normalize ... def scrape_teams(self) -> list[Team]: # Return canonical team list ... def scrape_stadiums(self) -> list[Stadium]: # Return canonical stadium list ... ``` 2. **Add team mappings** in `normalizers/team_resolver.py`: ```python TEAM_MAPPINGS["newsport"] = { "ABC": ("team_newsport_abc", "Full Team Name", "City"), ... } ``` 3. **Add stadium mappings** in `normalizers/stadium_resolver.py`: ```python STADIUM_MAPPINGS["newsport"] = { "stadium_newsport_venue": StadiumInfo( name="Venue Name", city="City", state="State", country="USA", latitude=40.0, longitude=-74.0, ), ... } ``` 4. **Add to league_structure.json** (if hierarchical) 5. **Update config.py**: ```python EXPECTED_GAME_COUNTS["newsport"] = 500 ``` 6. **Export from `__init__.py`** ## Troubleshooting ### Rate Limiting (429 errors) The system handles these automatically with exponential backoff. If persistent: - Increase `DEFAULT_REQUEST_DELAY` in config.py - Check if source has changed their rate limits ### Missing Teams/Stadiums 1. Check scraper logs for raw values 2. Add to `team_aliases.json` or `stadium_aliases.json` 3. Or add to canonical mappings if it's a new team/stadium ### CloudKit Authentication Errors 1. Verify key_id matches CloudKit Dashboard 2. Check private key format (EC P-256, PEM) 3. Ensure container identifier is correct ### Incomplete Scrapes The system discards partial data on errors. Check: - `logs/` for error details - Network connectivity - Source website availability ### International Games Appearing NFL and NHL scrapers filter these automatically. If new locations emerge: - Add to `INTERNATIONAL_LOCATIONS` in the scraper - Or add filtering logic for neutral site games ## Contributing 1. Follow existing patterns for new scrapers 2. Always use canonical IDs 3. Add aliases for historical names 4. Include source URLs for traceability 5. Test with multiple seasons