diff --git a/Scripts/README.md b/Scripts/README.md new file mode 100644 index 0000000..c8a476e --- /dev/null +++ b/Scripts/README.md @@ -0,0 +1,833 @@ +# SportsTime Parser + +A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app. + +## Table of Contents + +- [Overview](#overview) +- [Installation](#installation) +- [Quick Start](#quick-start) +- [Architecture](#architecture) +- [Directory Structure](#directory-structure) +- [Configuration](#configuration) +- [Data Models](#data-models) +- [Normalizers](#normalizers) +- [Scrapers](#scrapers) +- [Uploaders](#uploaders) +- [Utilities](#utilities) +- [Manual Review Workflow](#manual-review-workflow) +- [Adding a New Sport](#adding-a-new-sport) +- [Troubleshooting](#troubleshooting) + +## Overview + +The `sportstime_parser` package provides a complete pipeline for: + +1. **Scraping** game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.) +2. **Normalizing** raw data to canonical identifiers (teams, stadiums, games) +3. **Resolving** team/stadium names using exact matching, historical aliases, and fuzzy matching +4. **Uploading** data to CloudKit with diff-based sync and resumable uploads + +### Supported Sports + +| Sport | Code | Sources | Season Format | +|-------|------|---------|---------------| +| NBA | `nba` | Basketball-Reference, ESPN, CBS | Oct-Jun (split year) | +| MLB | `mlb` | Baseball-Reference, MLB API, ESPN | Mar-Nov (single year) | +| NFL | `nfl` | ESPN, Pro-Football-Reference, CBS | Sep-Feb (split year) | +| NHL | `nhl` | Hockey-Reference, NHL API, ESPN | Oct-Jun (split year) | +| MLS | `mls` | ESPN, FBref | Feb-Nov (single year) | +| WNBA | `wnba` | ESPN | May-Oct (single year) | +| NWSL | `nwsl` | ESPN | Mar-Nov (single year) | + +## Installation + +```bash +cd Scripts +pip install -r requirements.txt +``` + +### Dependencies + +- `requests` - HTTP requests with session management +- `beautifulsoup4` + `lxml` - HTML parsing +- `rapidfuzz` - Fuzzy string matching +- `pyjwt` + `cryptography` - CloudKit JWT authentication +- `rich` - Terminal UI (progress bars, logging) +- `pytz` / `timezonefinder` - Timezone detection + +## Quick Start + +### Scrape a Single Sport + +```python +from sportstime_parser.scrapers import create_nba_scraper + +scraper = create_nba_scraper(season=2025) +result = scraper.scrape_all() + +print(f"Games: {result.game_count}") +print(f"Teams: {result.team_count}") +print(f"Stadiums: {result.stadium_count}") +print(f"Needs review: {result.review_count}") +``` + +### Upload to CloudKit + +```python +from sportstime_parser.uploaders import CloudKitClient, RecordDiffer + +client = CloudKitClient(environment="development") +differ = RecordDiffer() + +# Compare local vs remote +diff = differ.diff_games(local_games, remote_records) + +# Upload changes +records = diff.get_records_to_upload() +result = await client.save_records(records) +``` + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ DATA SOURCES │ +│ Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc. │ +└────────────────────────────────┬────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ SCRAPERS │ +│ NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │ +│ │ +│ Features: │ +│ • Multi-source fallback (try sources in priority order) │ +│ • Automatic rate limiting with exponential backoff │ +│ • Doubleheader detection │ +│ • International game filtering (NFL London, NHL Global Series) │ +└────────────────────────────────┬────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ NORMALIZERS │ +│ TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader │ +│ │ +│ Resolution Strategy (in order): │ +│ 1. Exact match against canonical mappings │ +│ 2. Date-aware alias lookup (handles renames/relocations) │ +│ 3. Fuzzy matching with confidence threshold (85%) │ +│ 4. Flag for manual review if unresolved or low confidence │ +└────────────────────────────────┬────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ DATA MODELS │ +│ Game │ Team │ Stadium │ ManualReviewItem │ +│ │ +│ All models use canonical IDs: │ +│ • team_nba_lal (Los Angeles Lakers) │ +│ • stadium_nba_los_angeles_lakers (Crypto.com Arena) │ +│ • game_nba_2025_20251022_bos_lal (specific game) │ +└────────────────────────────────┬────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ UPLOADERS │ +│ CloudKitClient │ RecordDiffer │ StateManager │ +│ │ +│ Features: │ +│ • JWT authentication with Apple's CloudKit Web Services │ +│ • Batch operations (up to 200 records per request) │ +│ • Diff-based sync (only upload changes) │ +│ • Resumable uploads with persistent state │ +└────────────────────────────────┬────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ CLOUDKIT │ +│ Public Database: Games, Teams, Stadiums, Aliases │ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +## Directory Structure + +``` +Scripts/ +├── README.md # This file +├── requirements.txt # Python dependencies +├── pyproject.toml # Package configuration +├── league_structure.json # League hierarchy (conferences, divisions) +├── team_aliases.json # Historical team name mappings +├── stadium_aliases.json # Historical stadium name mappings +├── logs/ # Runtime logs (auto-created) +├── output/ # Scrape output files (auto-created) +└── sportstime_parser/ # Main package + ├── __init__.py + ├── config.py # Configuration constants + ├── SOURCES.md # Data source documentation + ├── models/ # Data classes + │ ├── game.py # Game model + │ ├── team.py # Team model + │ ├── stadium.py # Stadium model + │ └── aliases.py # Alias and ManualReviewItem models + ├── normalizers/ # Name resolution + │ ├── canonical_id.py # ID generation + │ ├── alias_loader.py # Alias loading and resolution + │ ├── fuzzy.py # Fuzzy string matching + │ ├── timezone.py # Timezone detection + │ ├── team_resolver.py # Team name resolution + │ └── stadium_resolver.py # Stadium name resolution + ├── scrapers/ # Sport-specific scrapers + │ ├── base.py # Abstract base scraper + │ ├── nba.py # NBA scraper + │ ├── mlb.py # MLB scraper + │ ├── nfl.py # NFL scraper + │ ├── nhl.py # NHL scraper + │ ├── mls.py # MLS scraper + │ ├── wnba.py # WNBA scraper + │ └── nwsl.py # NWSL scraper + ├── uploaders/ # CloudKit integration + │ ├── cloudkit.py # CloudKit Web Services client + │ ├── diff.py # Record diffing + │ └── state.py # Resumable upload state + └── utils/ # Shared utilities + ├── logging.py # Rich-based logging + ├── http.py # Rate-limited HTTP client + └── progress.py # Progress tracking +``` + +## Configuration + +### config.py + +Key configuration constants: + +```python +# Directories +SCRIPTS_DIR = Path(__file__).parent.parent # Scripts/ +OUTPUT_DIR = SCRIPTS_DIR / "output" # JSON output +STATE_DIR = SCRIPTS_DIR / ".parser_state" # Upload state + +# CloudKit +CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app" +CLOUDKIT_ENVIRONMENT = "development" # or "production" + +# Rate Limiting +DEFAULT_REQUEST_DELAY = 3.0 # seconds between requests +MAX_RETRIES = 3 # retry attempts +BACKOFF_FACTOR = 2.0 # exponential backoff multiplier +INITIAL_BACKOFF = 5.0 # initial backoff duration + +# Fuzzy Matching +FUZZY_THRESHOLD = 85 # minimum match confidence (0-100) + +# Expected game counts (for validation) +EXPECTED_GAME_COUNTS = { + "nba": 1230, # 30 teams × 82 games ÷ 2 + "mlb": 2430, # 30 teams × 162 games ÷ 2 + "nfl": 272, # Regular season only + "nhl": 1312, # 32 teams × 82 games ÷ 2 + "mls": 544, # 29 teams × ~34 games ÷ 2 + "wnba": 228, # 12 teams × 40 games ÷ 2 + "nwsl": 182, # 14 teams × 26 games ÷ 2 +} + +# Geography (for filtering international games) +ALLOWED_COUNTRIES = {"USA", "Canada"} +``` + +### league_structure.json + +Defines the hierarchical structure of each league: + +```json +{ + "nba": { + "name": "National Basketball Association", + "conferences": { + "Eastern": { + "divisions": { + "Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"], + "Central": ["CHI", "CLE", "DET", "IND", "MIL"], + "Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"] + } + }, + "Western": { ... } + } + }, + "mlb": { ... }, + ... +} +``` + +### team_aliases.json / stadium_aliases.json + +Historical name mappings with validity dates: + +```json +{ + "team_mlb_athletics": [ + { + "alias": "Oakland Athletics", + "alias_type": "full_name", + "valid_from": "1968-01-01", + "valid_until": "2024-12-31" + }, + { + "alias": "Las Vegas Athletics", + "alias_type": "full_name", + "valid_from": "2028-01-01", + "valid_until": null + } + ] +} +``` + +## Data Models + +### Game + +```python +@dataclass +class Game: + id: str # Canonical ID: game_{sport}_{season}_{date}_{away}_{home} + sport: str # Sport code (nba, mlb, etc.) + season: int # Season start year + home_team_id: str # Canonical team ID + away_team_id: str # Canonical team ID + stadium_id: str # Canonical stadium ID + game_date: datetime # UTC datetime + game_number: Optional[int] # 1 or 2 for doubleheaders + home_score: Optional[int] # None if not played + away_score: Optional[int] + status: str # scheduled, final, postponed, cancelled + source_url: Optional[str] # For manual review + raw_home_team: Optional[str] # Original scraped value + raw_away_team: Optional[str] + raw_stadium: Optional[str] +``` + +### Team + +```python +@dataclass +class Team: + id: str # Canonical ID: team_{sport}_{abbrev} + sport: str + city: str # e.g., "Los Angeles" + name: str # e.g., "Lakers" + full_name: str # e.g., "Los Angeles Lakers" + abbreviation: str # e.g., "LAL" + conference: Optional[str] # e.g., "Western" + division: Optional[str] # e.g., "Pacific" + stadium_id: Optional[str] # Home stadium + primary_color: Optional[str] + secondary_color: Optional[str] + logo_url: Optional[str] +``` + +### Stadium + +```python +@dataclass +class Stadium: + id: str # Canonical ID: stadium_{sport}_{city_team} + sport: str + name: str # Current name (e.g., "Crypto.com Arena") + city: str + state: Optional[str] + country: str + latitude: Optional[float] + longitude: Optional[float] + capacity: Optional[int] + surface: Optional[str] # grass, turf, ice, hardwood + roof_type: Optional[str] # dome, retractable, open + opened_year: Optional[int] + image_url: Optional[str] + timezone: Optional[str] +``` + +### ManualReviewItem + +```python +@dataclass +class ManualReviewItem: + item_type: str # "team" or "stadium" + raw_value: str # Original scraped value + suggested_id: Optional[str] # Best fuzzy match (if any) + confidence: float # 0.0 - 1.0 + reason: str # Why review is needed + source_url: Optional[str] # Where it came from + sport: str + check_date: Optional[date] # For date-aware alias lookup +``` + +## Normalizers + +### Canonical ID Generation + +IDs are deterministic and immutable: + +```python +# Team ID +generate_team_id("nba", "LAL") +# → "team_nba_lal" + +# Stadium ID +generate_stadium_id("nba", "Los Angeles", "Lakers") +# → "stadium_nba_los_angeles_lakers" + +# Game ID +generate_game_id( + sport="nba", + season=2025, + away_abbrev="BOS", + home_abbrev="LAL", + game_date=datetime(2025, 10, 22), + game_number=None +) +# → "game_nba_2025_20251022_bos_lal" + +# Doubleheader Game ID +generate_game_id(..., game_number=2) +# → "game_nba_2025_20251022_bos_lal_2" +``` + +### Team Resolution + +The `TeamResolver` uses a three-stage strategy: + +```python +resolver = get_team_resolver("nba") +result = resolver.resolve( + "Los Angeles Lakers", + check_date=date(2025, 10, 22), + source_url="https://..." +) + +# Result: +# - canonical_id: "team_nba_lal" +# - confidence: 1.0 (exact match) +# - review_item: None +``` + +**Resolution stages:** + +1. **Exact Match**: Check against canonical team mappings + - Full name: "Los Angeles Lakers" + - City + Name: "Los Angeles" + "Lakers" + - Abbreviation: "LAL" + +2. **Alias Lookup**: Check historical aliases with date awareness + - "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31) + - Handles relocations: "Oakland" → "Las Vegas" transition + +3. **Fuzzy Match**: Use rapidfuzz with 85% threshold + - "LA Lakers" → "Los Angeles Lakers" (92% match) + - Low-confidence matches flagged for review + +### Stadium Resolution + +Similar three-stage strategy with additional location awareness: + +```python +resolver = get_stadium_resolver("nba") +result = resolver.resolve( + "Crypto.com Arena", + check_date=date(2025, 10, 22) +) +``` + +**Key features:** +- Handles naming rights changes (Staples Center → Crypto.com Arena) +- Date-aware: "Staples Center" resolves correctly for historical games +- Location-based fallback using latitude/longitude + +## Scrapers + +### Base Scraper + +All scrapers extend `BaseScraper` with these features: + +```python +class BaseScraper(ABC): + def __init__(self, sport: str, season: int): ... + + # Required implementations + def _get_sources(self) -> list[str]: ... + def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ... + def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ... + def scrape_teams(self) -> list[Team]: ... + def scrape_stadiums(self) -> list[Stadium]: ... + + # Built-in features + def scrape_games(self) -> ScrapeResult: + """Multi-source fallback - tries each source in order.""" + ... + + def scrape_all(self) -> ScrapeResult: + """Scrapes games, teams, and stadiums with progress tracking.""" + ... +``` + +### NBA Scraper + +```python +class NBAScraper(BaseScraper): + """ + Sources (in priority order): + 1. Basketball-Reference - HTML tables, monthly pages + 2. ESPN API - JSON, per-date queries + 3. CBS Sports - Backup (not implemented) + + Season: October to June (split year, e.g., 2025-26) + """ +``` + +**Basketball-Reference parsing:** +- URL: `https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html` +- Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name + +### MLB Scraper + +```python +class MLBScraper(BaseScraper): + """ + Sources: + 1. Baseball-Reference - Single page per season + 2. MLB Stats API - Official API with date range queries + 3. ESPN API - Backup + + Season: March to November (single year) + Handles: Doubleheaders with game_number + """ +``` + +### NFL Scraper + +```python +class NFLScraper(BaseScraper): + """ + Sources: + 1. ESPN API - Week-based queries + 2. Pro-Football-Reference - Single page per season + + Season: September to February (split year) + Filters: International games (London, Mexico City, Frankfurt) + Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds) + """ +``` + +### NHL Scraper + +```python +class NHLScraper(BaseScraper): + """ + Sources: + 1. Hockey-Reference - Single page per season + 2. NHL API - New API (api-web.nhle.com) + 3. ESPN API - Backup + + Season: October to June (split year) + Filters: International games (Prague, Stockholm, Helsinki) + """ +``` + +### MLS / WNBA / NWSL Scrapers + +All use ESPN API as primary source with similar structure: +- Single calendar year seasons +- Conference-based organization (MLS) or single table (WNBA, NWSL) + +## Uploaders + +### CloudKit Client + +```python +class CloudKitClient: + """CloudKit Web Services API client with JWT authentication.""" + + def __init__( + self, + container_id: str = CLOUDKIT_CONTAINER, + environment: str = "development", # or "production" + key_id: str = None, # From CloudKit Dashboard + private_key: str = None, # EC P-256 private key + ): ... + + async def fetch_records( + self, + record_type: RecordType, + filter_by: Optional[dict] = None, + sort_by: Optional[str] = None, + ) -> list[dict]: ... + + async def save_records( + self, + records: list[CloudKitRecord], + batch_size: int = 200, + ) -> BatchResult: ... + + async def delete_records( + self, + record_names: list[str], + record_type: RecordType, + ) -> BatchResult: ... +``` + +**Authentication:** +- Uses EC P-256 key pair +- JWT tokens signed with private key +- Tokens valid for 30 minutes + +### Record Differ + +```python +class RecordDiffer: + """Compares local records with CloudKit records.""" + + def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ... + def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ... + def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ... +``` + +**DiffResult:** +```python +@dataclass +class DiffResult: + creates: list[RecordDiff] # New records to create + updates: list[RecordDiff] # Changed records to update + deletes: list[RecordDiff] # Remote records to delete + unchanged: list[RecordDiff] # Records with no changes + + def get_records_to_upload(self) -> list[CloudKitRecord]: + """Returns creates + updates ready for upload.""" +``` + +### State Manager + +```python +class StateManager: + """Manages resumable upload state.""" + + def load_session(self, sport, season, environment) -> Optional[UploadSession]: ... + def save_session(self, session: UploadSession) -> None: ... + def get_session_or_create( + self, + sport, season, environment, + record_names: list[tuple[str, str]], + resume: bool = False, + ) -> UploadSession: ... +``` + +**State persistence:** +- Stored in `.parser_state/upload_state_{sport}_{season}_{env}.json` +- Tracks: pending, uploaded, failed records +- Supports retry with backoff + +## Utilities + +### HTTP Client + +```python +class RateLimitedSession: + """HTTP session with rate limiting and exponential backoff.""" + + def __init__( + self, + delay: float = 3.0, # Seconds between requests + max_retries: int = 3, + backoff_factor: float = 2.0, + ): ... + + def get(self, url, **kwargs) -> Response: ... + def get_json(self, url, **kwargs) -> dict: ... + def get_html(self, url, **kwargs) -> str: ... +``` + +**Features:** +- User-agent rotation (5 different Chrome/Firefox/Safari agents) +- Per-domain rate limiting +- Automatic 429 handling with exponential backoff + jitter +- Connection pooling + +### Logging + +```python +from sportstime_parser.utils import get_logger, log_success, log_error + +logger = get_logger() # Rich-formatted logger +logger.info("Starting scrape") + +log_success("Scraped 1230 games") # Green checkmark +log_error("Failed to parse") # Red X +``` + +**Log output:** +- Console: Rich-formatted with colors +- File: `logs/parser_{timestamp}.log` + +### Progress Tracking + +```python +from sportstime_parser.utils import ScrapeProgress, track_progress + +# Specialized scrape tracking +progress = ScrapeProgress("nba", 2025) +progress.start() + +with progress.scraping_schedule(total_months=9) as advance: + for month in months: + fetch(month) + advance() + +progress.finish() # Prints summary + +# Generic progress bar +for game in track_progress(games, "Processing games"): + process(game) +``` + +## Manual Review Workflow + +When the system can't confidently resolve a team or stadium: + +1. **Low confidence fuzzy match** (< 85%): + ``` + ManualReviewItem( + item_type="team", + raw_value="LA Lakers", + suggested_id="team_nba_lal", + confidence=0.82, + reason="Fuzzy match below threshold" + ) + ``` + +2. **No match found**: + ``` + ManualReviewItem( + raw_value="Unknown Team FC", + suggested_id=None, + confidence=0.0, + reason="No match found in canonical mappings" + ) + ``` + +3. **Ambiguous match** (multiple candidates): + ``` + ManualReviewItem( + raw_value="LA", + suggested_id="team_nba_lac", + confidence=0.5, + reason="Ambiguous: could be Lakers or Clippers" + ) + ``` + +**Resolution:** +- Review items are exported to JSON +- Manually verify and add to `team_aliases.json` or `stadium_aliases.json` +- Re-run scrape - aliases will be used for resolution + +## Adding a New Sport + +1. **Create scraper** in `scrapers/{sport}.py`: + ```python + class NewSportScraper(BaseScraper): + def __init__(self, season: int, **kwargs): + super().__init__("newsport", season, **kwargs) + self._team_resolver = get_team_resolver("newsport") + self._stadium_resolver = get_stadium_resolver("newsport") + + def _get_sources(self) -> list[str]: + return ["primary_source", "backup_source"] + + def _scrape_games_from_source(self, source: str) -> list[RawGameData]: + # Implement source-specific scraping + ... + + def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: + # Use resolvers to normalize + ... + + def scrape_teams(self) -> list[Team]: + # Return canonical team list + ... + + def scrape_stadiums(self) -> list[Stadium]: + # Return canonical stadium list + ... + ``` + +2. **Add team mappings** in `normalizers/team_resolver.py`: + ```python + TEAM_MAPPINGS["newsport"] = { + "ABC": ("team_newsport_abc", "Full Team Name", "City"), + ... + } + ``` + +3. **Add stadium mappings** in `normalizers/stadium_resolver.py`: + ```python + STADIUM_MAPPINGS["newsport"] = { + "stadium_newsport_venue": StadiumInfo( + name="Venue Name", + city="City", + state="State", + country="USA", + latitude=40.0, + longitude=-74.0, + ), + ... + } + ``` + +4. **Add to league_structure.json** (if hierarchical) + +5. **Update config.py**: + ```python + EXPECTED_GAME_COUNTS["newsport"] = 500 + ``` + +6. **Export from `__init__.py`** + +## Troubleshooting + +### Rate Limiting (429 errors) + +The system handles these automatically with exponential backoff. If persistent: +- Increase `DEFAULT_REQUEST_DELAY` in config.py +- Check if source has changed their rate limits + +### Missing Teams/Stadiums + +1. Check scraper logs for raw values +2. Add to `team_aliases.json` or `stadium_aliases.json` +3. Or add to canonical mappings if it's a new team/stadium + +### CloudKit Authentication Errors + +1. Verify key_id matches CloudKit Dashboard +2. Check private key format (EC P-256, PEM) +3. Ensure container identifier is correct + +### Incomplete Scrapes + +The system discards partial data on errors. Check: +- `logs/` for error details +- Network connectivity +- Source website availability + +### International Games Appearing + +NFL and NHL scrapers filter these automatically. If new locations emerge: +- Add to `INTERNATIONAL_LOCATIONS` in the scraper +- Or add filtering logic for neutral site games + +## Contributing + +1. Follow existing patterns for new scrapers +2. Always use canonical IDs +3. Add aliases for historical names +4. Include source URLs for traceability +5. Test with multiple seasons