SportstimeAPI/README.md

# SportsTime Parser

A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app.

## Table of Contents

- [Overview](#overview)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Architecture](#architecture)
- [Directory Structure](#directory-structure)
- [Configuration](#configuration)
- [Data Models](#data-models)
- [Normalizers](#normalizers)
- [Scrapers](#scrapers)
- [Uploaders](#uploaders)
- [Utilities](#utilities)
- [Manual Review Workflow](#manual-review-workflow)
- [Adding a New Sport](#adding-a-new-sport)
- [Troubleshooting](#troubleshooting)

## Overview

The `sportstime_parser` package provides a complete pipeline for:

1. **Scraping** game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.)
2. **Normalizing** raw data to canonical identifiers (teams, stadiums, games)
3. **Resolving** team/stadium names using exact matching, historical aliases, and fuzzy matching
4. **Uploading** data to CloudKit with diff-based sync and resumable uploads

### Supported Sports

| Sport | Code | Sources | Season Format |
|-------|------|---------|---------------|
| NBA | `nba` | Basketball-Reference, ESPN, CBS | Oct-Jun (split year) |
| MLB | `mlb` | Baseball-Reference, MLB API, ESPN | Mar-Nov (single year) |
| NFL | `nfl` | ESPN, Pro-Football-Reference, CBS | Sep-Feb (split year) |
| NHL | `nhl` | Hockey-Reference, NHL API, ESPN | Oct-Jun (split year) |
| MLS | `mls` | ESPN, FBref | Feb-Nov (single year) |
| WNBA | `wnba` | ESPN | May-Oct (single year) |
| NWSL | `nwsl` | ESPN | Mar-Nov (single year) |

## Installation

```bash
cd Scripts
pip install -r requirements.txt
```

### Dependencies

- `requests` - HTTP requests with session management
- `beautifulsoup4` + `lxml` - HTML parsing
- `rapidfuzz` - Fuzzy string matching
- `pyjwt` + `cryptography` - CloudKit JWT authentication
- `rich` - Terminal UI (progress bars, logging)
- `pytz` / `timezonefinder` - Timezone detection

## Quick Start

### Scrape a Single Sport

```python
from sportstime_parser.scrapers import create_nba_scraper

scraper = create_nba_scraper(season=2025)
result = scraper.scrape_all()

print(f"Games: {result.game_count}")
print(f"Teams: {result.team_count}")
print(f"Stadiums: {result.stadium_count}")
print(f"Needs review: {result.review_count}")
```

### Upload to CloudKit

```python
from sportstime_parser.uploaders import CloudKitClient, RecordDiffer

client = CloudKitClient(environment="development")
differ = RecordDiffer()

# Compare local vs remote
diff = differ.diff_games(local_games, remote_records)

# Upload changes
records = diff.get_records_to_upload()
result = await client.save_records(records)
```

## Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           DATA SOURCES                                   │
│  Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc.   │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            SCRAPERS                                      │
│  NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │
│                                                                          │
│  Features:                                                               │
│  • Multi-source fallback (try sources in priority order)                │
│  • Automatic rate limiting with exponential backoff                     │
│  • Doubleheader detection                                               │
│  • International game filtering (NFL London, NHL Global Series)         │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          NORMALIZERS                                     │
│  TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader   │
│                                                                          │
│  Resolution Strategy (in order):                                        │
│  1. Exact match against canonical mappings                              │
│  2. Date-aware alias lookup (handles renames/relocations)               │
│  3. Fuzzy matching with confidence threshold (85%)                      │
│  4. Flag for manual review if unresolved or low confidence              │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          DATA MODELS                                     │
│  Game │ Team │ Stadium │ ManualReviewItem                               │
│                                                                          │
│  All models use canonical IDs:                                          │
│  • team_nba_lal (Los Angeles Lakers)                                    │
│  • stadium_nba_los_angeles_lakers (Crypto.com Arena)                    │
│  • game_nba_2025_20251022_bos_lal (specific game)                       │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           UPLOADERS                                      │
│  CloudKitClient │ RecordDiffer │ StateManager                           │
│                                                                          │
│  Features:                                                               │
│  • JWT authentication with Apple's CloudKit Web Services                │
│  • Batch operations (up to 200 records per request)                     │
│  • Diff-based sync (only upload changes)                                │
│  • Resumable uploads with persistent state                              │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            CLOUDKIT                                      │
│  Public Database: Games, Teams, Stadiums, Aliases                       │
└─────────────────────────────────────────────────────────────────────────┘
```

## Directory Structure

```
Scripts/
├── README.md                    # This file
├── requirements.txt             # Python dependencies
├── pyproject.toml              # Package configuration
├── league_structure.json       # League hierarchy (conferences, divisions)
├── team_aliases.json           # Historical team name mappings
├── stadium_aliases.json        # Historical stadium name mappings
├── logs/                       # Runtime logs (auto-created)
├── output/                     # Scrape output files (auto-created)
└── sportstime_parser/          # Main package
    ├── __init__.py
    ├── config.py               # Configuration constants
    ├── SOURCES.md              # Data source documentation
    ├── models/                 # Data classes
    │   ├── game.py            # Game model
    │   ├── team.py            # Team model
    │   ├── stadium.py         # Stadium model
    │   └── aliases.py         # Alias and ManualReviewItem models
    ├── normalizers/            # Name resolution
    │   ├── canonical_id.py    # ID generation
    │   ├── alias_loader.py    # Alias loading and resolution
    │   ├── fuzzy.py           # Fuzzy string matching
    │   ├── timezone.py        # Timezone detection
    │   ├── team_resolver.py   # Team name resolution
    │   └── stadium_resolver.py # Stadium name resolution
    ├── scrapers/               # Sport-specific scrapers
    │   ├── base.py            # Abstract base scraper
    │   ├── nba.py             # NBA scraper
    │   ├── mlb.py             # MLB scraper
    │   ├── nfl.py             # NFL scraper
    │   ├── nhl.py             # NHL scraper
    │   ├── mls.py             # MLS scraper
    │   ├── wnba.py            # WNBA scraper
    │   └── nwsl.py            # NWSL scraper
    ├── uploaders/              # CloudKit integration
    │   ├── cloudkit.py        # CloudKit Web Services client
    │   ├── diff.py            # Record diffing
    │   └── state.py           # Resumable upload state
    └── utils/                  # Shared utilities
        ├── logging.py         # Rich-based logging
        ├── http.py            # Rate-limited HTTP client
        └── progress.py        # Progress tracking
```

## Configuration

### config.py

Key configuration constants:

```python
# Directories
SCRIPTS_DIR = Path(__file__).parent.parent      # Scripts/
OUTPUT_DIR = SCRIPTS_DIR / "output"             # JSON output
STATE_DIR = SCRIPTS_DIR / ".parser_state"       # Upload state

# CloudKit
CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app"
CLOUDKIT_ENVIRONMENT = "development"  # or "production"

# Rate Limiting
DEFAULT_REQUEST_DELAY = 3.0      # seconds between requests
MAX_RETRIES = 3                   # retry attempts
BACKOFF_FACTOR = 2.0             # exponential backoff multiplier
INITIAL_BACKOFF = 5.0            # initial backoff duration

# Fuzzy Matching
FUZZY_THRESHOLD = 85             # minimum match confidence (0-100)

# Expected game counts (for validation)
EXPECTED_GAME_COUNTS = {
    "nba": 1230,   # 30 teams × 82 games ÷ 2
    "mlb": 2430,   # 30 teams × 162 games ÷ 2
    "nfl": 272,    # Regular season only
    "nhl": 1312,   # 32 teams × 82 games ÷ 2
    "mls": 544,    # 29 teams × ~34 games ÷ 2
    "wnba": 228,   # 12 teams × 40 games ÷ 2
    "nwsl": 182,   # 14 teams × 26 games ÷ 2
}

# Geography (for filtering international games)
ALLOWED_COUNTRIES = {"USA", "Canada"}
```

### league_structure.json

Defines the hierarchical structure of each league:

```json
{
  "nba": {
    "name": "National Basketball Association",
    "conferences": {
      "Eastern": {
        "divisions": {
          "Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"],
          "Central": ["CHI", "CLE", "DET", "IND", "MIL"],
          "Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"]
        }
      },
      "Western": { ... }
    }
  },
  "mlb": { ... },
  ...
}
```

### team_aliases.json / stadium_aliases.json

Historical name mappings with validity dates:

```json
{
  "team_mlb_athletics": [
    {
      "alias": "Oakland Athletics",
      "alias_type": "full_name",
      "valid_from": "1968-01-01",
      "valid_until": "2024-12-31"
    },
    {
      "alias": "Las Vegas Athletics",
      "alias_type": "full_name",
      "valid_from": "2028-01-01",
      "valid_until": null
    }
  ]
}
```

## Data Models

### Game

```python
@dataclass
class Game:
    id: str                    # Canonical ID: game_{sport}_{season}_{date}_{away}_{home}
    sport: str                 # Sport code (nba, mlb, etc.)
    season: int                # Season start year
    home_team_id: str          # Canonical team ID
    away_team_id: str          # Canonical team ID
    stadium_id: str            # Canonical stadium ID
    game_date: datetime        # UTC datetime
    game_number: Optional[int] # 1 or 2 for doubleheaders
    home_score: Optional[int]  # None if not played
    away_score: Optional[int]
    status: str                # scheduled, final, postponed, cancelled
    source_url: Optional[str]  # For manual review
    raw_home_team: Optional[str]  # Original scraped value
    raw_away_team: Optional[str]
    raw_stadium: Optional[str]
```

### Team

```python
@dataclass
class Team:
    id: str                    # Canonical ID: team_{sport}_{abbrev}
    sport: str
    city: str                  # e.g., "Los Angeles"
    name: str                  # e.g., "Lakers"
    full_name: str             # e.g., "Los Angeles Lakers"
    abbreviation: str          # e.g., "LAL"
    conference: Optional[str]  # e.g., "Western"
    division: Optional[str]    # e.g., "Pacific"
    stadium_id: Optional[str]  # Home stadium
    primary_color: Optional[str]
    secondary_color: Optional[str]
    logo_url: Optional[str]
```

### Stadium

```python
@dataclass
class Stadium:
    id: str                    # Canonical ID: stadium_{sport}_{city_team}
    sport: str
    name: str                  # Current name (e.g., "Crypto.com Arena")
    city: str
    state: Optional[str]
    country: str
    latitude: Optional[float]
    longitude: Optional[float]
    capacity: Optional[int]
    surface: Optional[str]     # grass, turf, ice, hardwood
    roof_type: Optional[str]   # dome, retractable, open
    opened_year: Optional[int]
    image_url: Optional[str]
    timezone: Optional[str]
```

### ManualReviewItem

```python
@dataclass
class ManualReviewItem:
    item_type: str             # "team" or "stadium"
    raw_value: str             # Original scraped value
    suggested_id: Optional[str]  # Best fuzzy match (if any)
    confidence: float          # 0.0 - 1.0
    reason: str                # Why review is needed
    source_url: Optional[str]  # Where it came from
    sport: str
    check_date: Optional[date]  # For date-aware alias lookup
```

## Normalizers

### Canonical ID Generation

IDs are deterministic and immutable:

```python
# Team ID
generate_team_id("nba", "LAL")
# → "team_nba_lal"

# Stadium ID
generate_stadium_id("nba", "Los Angeles", "Lakers")
# → "stadium_nba_los_angeles_lakers"

# Game ID
generate_game_id(
    sport="nba",
    season=2025,
    away_abbrev="BOS",
    home_abbrev="LAL",
    game_date=datetime(2025, 10, 22),
    game_number=None
)
# → "game_nba_2025_20251022_bos_lal"

# Doubleheader Game ID
generate_game_id(..., game_number=2)
# → "game_nba_2025_20251022_bos_lal_2"
```

### Team Resolution

The `TeamResolver` uses a three-stage strategy:

```python
resolver = get_team_resolver("nba")
result = resolver.resolve(
    "Los Angeles Lakers",
    check_date=date(2025, 10, 22),
    source_url="https://..."
)

# Result:
# - canonical_id: "team_nba_lal"
# - confidence: 1.0 (exact match)
# - review_item: None
```

**Resolution stages:**

1. **Exact Match**: Check against canonical team mappings
   - Full name: "Los Angeles Lakers"
   - City + Name: "Los Angeles" + "Lakers"
   - Abbreviation: "LAL"

2. **Alias Lookup**: Check historical aliases with date awareness
   - "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31)
   - Handles relocations: "Oakland" → "Las Vegas" transition

3. **Fuzzy Match**: Use rapidfuzz with 85% threshold
   - "LA Lakers" → "Los Angeles Lakers" (92% match)
   - Low-confidence matches flagged for review

### Stadium Resolution

Similar three-stage strategy with additional location awareness:

```python
resolver = get_stadium_resolver("nba")
result = resolver.resolve(
    "Crypto.com Arena",
    check_date=date(2025, 10, 22)
)
```

**Key features:**
- Handles naming rights changes (Staples Center → Crypto.com Arena)
- Date-aware: "Staples Center" resolves correctly for historical games
- Location-based fallback using latitude/longitude

## Scrapers

### Base Scraper

All scrapers extend `BaseScraper` with these features:

```python
class BaseScraper(ABC):
    def __init__(self, sport: str, season: int): ...

    # Required implementations
    def _get_sources(self) -> list[str]: ...
    def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ...
    def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ...
    def scrape_teams(self) -> list[Team]: ...
    def scrape_stadiums(self) -> list[Stadium]: ...

    # Built-in features
    def scrape_games(self) -> ScrapeResult:
        """Multi-source fallback - tries each source in order."""
        ...

    def scrape_all(self) -> ScrapeResult:
        """Scrapes games, teams, and stadiums with progress tracking."""
        ...
```

### NBA Scraper

```python
class NBAScraper(BaseScraper):
    """
    Sources (in priority order):
    1. Basketball-Reference - HTML tables, monthly pages
    2. ESPN API - JSON, per-date queries
    3. CBS Sports - Backup (not implemented)

    Season: October to June (split year, e.g., 2025-26)
    """
```

**Basketball-Reference parsing:**
- URL: `https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html`
- Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name

### MLB Scraper

```python
class MLBScraper(BaseScraper):
    """
    Sources:
    1. Baseball-Reference - Single page per season
    2. MLB Stats API - Official API with date range queries
    3. ESPN API - Backup

    Season: March to November (single year)
    Handles: Doubleheaders with game_number
    """
```

### NFL Scraper

```python
class NFLScraper(BaseScraper):
    """
    Sources:
    1. ESPN API - Week-based queries
    2. Pro-Football-Reference - Single page per season

    Season: September to February (split year)
    Filters: International games (London, Mexico City, Frankfurt)
    Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds)
    """
```

### NHL Scraper

```python
class NHLScraper(BaseScraper):
    """
    Sources:
    1. Hockey-Reference - Single page per season
    2. NHL API - New API (api-web.nhle.com)
    3. ESPN API - Backup

    Season: October to June (split year)
    Filters: International games (Prague, Stockholm, Helsinki)
    """
```

### MLS / WNBA / NWSL Scrapers

All use ESPN API as primary source with similar structure:
- Single calendar year seasons
- Conference-based organization (MLS) or single table (WNBA, NWSL)

## Uploaders

### CloudKit Client

```python
class CloudKitClient:
    """CloudKit Web Services API client with JWT authentication."""

    def __init__(
        self,
        container_id: str = CLOUDKIT_CONTAINER,
        environment: str = "development",  # or "production"
        key_id: str = None,       # From CloudKit Dashboard
        private_key: str = None,  # EC P-256 private key
    ): ...

    async def fetch_records(
        self,
        record_type: RecordType,
        filter_by: Optional[dict] = None,
        sort_by: Optional[str] = None,
    ) -> list[dict]: ...

    async def save_records(
        self,
        records: list[CloudKitRecord],
        batch_size: int = 200,
    ) -> BatchResult: ...

    async def delete_records(
        self,
        record_names: list[str],
        record_type: RecordType,
    ) -> BatchResult: ...
```

**Authentication:**
- Uses EC P-256 key pair
- JWT tokens signed with private key
- Tokens valid for 30 minutes

### Record Differ

```python
class RecordDiffer:
    """Compares local records with CloudKit records."""

    def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ...
    def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ...
    def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ...
```

**DiffResult:**
```python
@dataclass
class DiffResult:
    creates: list[RecordDiff]   # New records to create
    updates: list[RecordDiff]   # Changed records to update
    deletes: list[RecordDiff]   # Remote records to delete
    unchanged: list[RecordDiff] # Records with no changes

    def get_records_to_upload(self) -> list[CloudKitRecord]:
        """Returns creates + updates ready for upload."""
```

### State Manager

```python
class StateManager:
    """Manages resumable upload state."""

    def load_session(self, sport, season, environment) -> Optional[UploadSession]: ...
    def save_session(self, session: UploadSession) -> None: ...
    def get_session_or_create(
        self,
        sport, season, environment,
        record_names: list[tuple[str, str]],
        resume: bool = False,
    ) -> UploadSession: ...
```

**State persistence:**
- Stored in `.parser_state/upload_state_{sport}_{season}_{env}.json`
- Tracks: pending, uploaded, failed records
- Supports retry with backoff

## Utilities

### HTTP Client

```python
class RateLimitedSession:
    """HTTP session with rate limiting and exponential backoff."""

    def __init__(
        self,
        delay: float = 3.0,        # Seconds between requests
        max_retries: int = 3,
        backoff_factor: float = 2.0,
    ): ...

    def get(self, url, **kwargs) -> Response: ...
    def get_json(self, url, **kwargs) -> dict: ...
    def get_html(self, url, **kwargs) -> str: ...
```

**Features:**
- User-agent rotation (5 different Chrome/Firefox/Safari agents)
- Per-domain rate limiting
- Automatic 429 handling with exponential backoff + jitter
- Connection pooling

### Logging

```python
from sportstime_parser.utils import get_logger, log_success, log_error

logger = get_logger()  # Rich-formatted logger
logger.info("Starting scrape")

log_success("Scraped 1230 games")  # Green checkmark
log_error("Failed to parse")       # Red X
```

**Log output:**
- Console: Rich-formatted with colors
- File: `logs/parser_{timestamp}.log`

### Progress Tracking

```python
from sportstime_parser.utils import ScrapeProgress, track_progress

# Specialized scrape tracking
progress = ScrapeProgress("nba", 2025)
progress.start()

with progress.scraping_schedule(total_months=9) as advance:
    for month in months:
        fetch(month)
        advance()

progress.finish()  # Prints summary

# Generic progress bar
for game in track_progress(games, "Processing games"):
    process(game)
```

## Manual Review Workflow

When the system can't confidently resolve a team or stadium:

1. **Low confidence fuzzy match** (< 85%):
   ```
   ManualReviewItem(
       item_type="team",
       raw_value="LA Lakers",
       suggested_id="team_nba_lal",
       confidence=0.82,
       reason="Fuzzy match below threshold"
   )
   ```

2. **No match found**:
   ```
   ManualReviewItem(
       raw_value="Unknown Team FC",
       suggested_id=None,
       confidence=0.0,
       reason="No match found in canonical mappings"
   )
   ```

3. **Ambiguous match** (multiple candidates):
   ```
   ManualReviewItem(
       raw_value="LA",
       suggested_id="team_nba_lac",
       confidence=0.5,
       reason="Ambiguous: could be Lakers or Clippers"
   )
   ```

**Resolution:**
- Review items are exported to JSON
- Manually verify and add to `team_aliases.json` or `stadium_aliases.json`
- Re-run scrape - aliases will be used for resolution

## Adding a New Sport

1. **Create scraper** in `scrapers/{sport}.py`:
   ```python
   class NewSportScraper(BaseScraper):
       def __init__(self, season: int, **kwargs):
           super().__init__("newsport", season, **kwargs)
           self._team_resolver = get_team_resolver("newsport")
           self._stadium_resolver = get_stadium_resolver("newsport")

       def _get_sources(self) -> list[str]:
           return ["primary_source", "backup_source"]

       def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
           # Implement source-specific scraping
           ...

       def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]:
           # Use resolvers to normalize
           ...

       def scrape_teams(self) -> list[Team]:
           # Return canonical team list
           ...

       def scrape_stadiums(self) -> list[Stadium]:
           # Return canonical stadium list
           ...
   ```

2. **Add team mappings** in `normalizers/team_resolver.py`:
   ```python
   TEAM_MAPPINGS["newsport"] = {
       "ABC": ("team_newsport_abc", "Full Team Name", "City"),
       ...
   }
   ```

3. **Add stadium mappings** in `normalizers/stadium_resolver.py`:
   ```python
   STADIUM_MAPPINGS["newsport"] = {
       "stadium_newsport_venue": StadiumInfo(
           name="Venue Name",
           city="City",
           state="State",
           country="USA",
           latitude=40.0,
           longitude=-74.0,
       ),
       ...
   }
   ```

4. **Add to league_structure.json** (if hierarchical)

5. **Update config.py**:
   ```python
   EXPECTED_GAME_COUNTS["newsport"] = 500
   ```

6. **Export from `__init__.py`**

## Troubleshooting

### Rate Limiting (429 errors)

The system handles these automatically with exponential backoff. If persistent:
- Increase `DEFAULT_REQUEST_DELAY` in config.py
- Check if source has changed their rate limits

### Missing Teams/Stadiums

1. Check scraper logs for raw values
2. Add to `team_aliases.json` or `stadium_aliases.json`
3. Or add to canonical mappings if it's a new team/stadium

### CloudKit Authentication Errors

1. Verify key_id matches CloudKit Dashboard
2. Check private key format (EC P-256, PEM)
3. Ensure container identifier is correct

### Incomplete Scrapes

The system discards partial data on errors. Check:
- `logs/` for error details
- Network connectivity
- Source website availability

### International Games Appearing

NFL and NHL scrapers filter these automatically. If new locations emerge:
- Add to `INTERNATIONAL_LOCATIONS` in the scraper
- Or add filtering logic for neutral site games

## Contributing

1. Follow existing patterns for new scrapers
2. Always use canonical IDs
3. Add aliases for historical names
4. Include source URLs for traceability
5. Test with multiple seasons