SportstimeAPI/sportstime_parser/README.md

# SportsTime Parser

A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.

## Features

- Scrapes game schedules from multiple sources with automatic fallback
- Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Generates deterministic canonical IDs for games, teams, and stadiums
- Produces validation reports with manual review lists
- Uploads to CloudKit with resumable, diff-based updates

## Requirements

- Python 3.11+
- CloudKit credentials (for upload functionality)

## Installation

```bash
# From the Scripts directory
cd Scripts

# Install in development mode
pip install -e ".[dev]"

# Or install dependencies only
pip install -r requirements.txt
```

## Quick Start

```bash
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025

# Scrape all sports
sportstime-parser scrape all --season 2025

# Validate existing scraped data
sportstime-parser validate nba --season 2025

# Check status
sportstime-parser status

# Upload to CloudKit (development)
sportstime-parser upload nba --season 2025

# Upload to CloudKit (production)
sportstime-parser upload nba --season 2025 --environment production
```

## CLI Reference

### scrape

Scrape game schedules, teams, and stadiums from web sources.

```bash
sportstime-parser scrape <sport> [options]

Arguments:
  sport                Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --dry-run            Parse and validate only, don't write output files
  --verbose, -v        Enable verbose output
```

**Examples:**

```bash
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025

# Scrape all sports with verbose output
sportstime-parser scrape all --season 2025 --verbose

# Dry run to test without writing files
sportstime-parser scrape mlb --season 2026 --dry-run
```

### validate

Run validation on existing scraped data and regenerate reports. Validation performs these checks:

1. **Game Coverage**: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
2. **Team Resolution**: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
3. **Stadium Resolution**: Identifies venue names that couldn't be matched to canonical stadium IDs
4. **Duplicate Detection**: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
5. **Missing Data**: Flags games missing required fields (stadium_id, team IDs, valid dates)

The output is a Markdown report with:
- Summary statistics (total games, valid games, coverage percentage)
- Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
- Fuzzy match suggestions with confidence scores to help resolve unmatched names

```bash
sportstime-parser validate <sport> [options]

Arguments:
  sport                Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
```

**Examples:**

```bash
# Validate NBA data
sportstime-parser validate nba --season 2025

# Validate all sports
sportstime-parser validate all
```

### upload

Upload scraped data to CloudKit with diff-based updates.

```bash
sportstime-parser upload <sport> [options]

Arguments:
  sport                Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment: development or production (default: development)
  --resume             Resume interrupted upload from last checkpoint
```

**Examples:**

```bash
# Upload NBA to development
sportstime-parser upload nba --season 2025

# Upload to production
sportstime-parser upload nba --season 2025 --environment production

# Resume interrupted upload
sportstime-parser upload mlb --season 2026 --resume
```

### status

Show current scrape and upload status.

```bash
sportstime-parser status
```

### retry

Retry failed uploads from previous attempts.

```bash
sportstime-parser retry <sport> [options]

Arguments:
  sport                Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment (default: development)
  --max-retries INT    Maximum retry attempts per record (default: 3)
```

### clear

Clear upload session state to start fresh.

```bash
sportstime-parser clear <sport> [options]

Arguments:
  sport                Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment (default: development)
```

## CloudKit Configuration

To upload data to CloudKit, you need to configure authentication credentials.

### 1. Get Credentials from Apple Developer Portal

1. Go to [Apple Developer Portal](https://developer.apple.com)
2. Navigate to **Certificates, Identifiers & Profiles** > **Keys**
3. Create a new key with **CloudKit** capability
4. Download the private key file (.p8)
5. Note the Key ID

### 2. Set Environment Variables

```bash
# Key ID from Apple Developer Portal
export CLOUDKIT_KEY_ID="your_key_id_here"

# Path to private key file
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"

# Or provide key content directly (useful for CI/CD)
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
...key content...
-----END EC PRIVATE KEY-----"
```

### 3. Verify Configuration

```bash
sportstime-parser status
```

The status output will show whether CloudKit is configured correctly.

## Output Files

Scraped data is saved to the `output/` directory:

```
output/
  games_nba_2025.json      # Game schedules
  teams_nba.json           # Team data
  stadiums_nba.json        # Stadium data
  validation_nba_2025.md   # Validation report
```

## Validation Reports

Validation reports are generated in Markdown format at `output/validation_{sport}_{season}.md`.

### Report Sections

**Summary Table**
| Metric | Description |
|--------|-------------|
| Total Games | Number of games scraped |
| Valid Games | Games with all required fields resolved |
| Coverage | Percentage of expected games found (based on league schedule) |
| Unresolved Teams | Team names that couldn't be matched |
| Unresolved Stadiums | Venue names that couldn't be matched |
| Duplicates | Potential duplicate game entries |

**Manual Review Items**

Items are grouped by type and include the raw value, source URL, and suggested fixes:

- **Unresolved Teams**: Team names not in the alias mapping. Add to `team_aliases.json` to resolve.
- **Unresolved Stadiums**: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to `stadium_aliases.json`.
- **Duplicate Games**: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
- **Missing Data**: Games missing stadium coordinates or other required fields.

**Fuzzy Match Suggestions**

For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.

## Canonical IDs

Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.

### ID Formats

**Games**
```
{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
```
Examples:
- `nba_2025_hou_okc_1021` - NBA 2025-26, Houston @ OKC, Oct 21
- `mlb_2026_nyy_bos_0401_1` - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)

**Teams**
```
{sport}_{city}_{name}
```
Examples:
- `nba_la_lakers`
- `mlb_new_york_yankees`
- `nfl_new_york_giants`

**Stadiums**
```
{sport}_{normalized_name}
```
Examples:
- `mlb_yankee_stadium`
- `nba_crypto_com_arena`
- `nfl_sofi_stadium`

### Generated vs Matched IDs

| Entity | Generated | Matched |
|--------|-----------|---------|
| **Teams** | Pre-defined in `team_resolver.py` mappings | Resolved from raw scraped names via aliases + fuzzy matching |
| **Stadiums** | Pre-defined in `stadium_resolver.py` mappings | Resolved from raw venue names via aliases + fuzzy matching |
| **Games** | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) |

**Resolution Flow:**
```
Raw Name (from scraper)
    ↓
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
    ↓ (if no match)
Fuzzy Match (Levenshtein distance against known names)
    ↓ (if confidence > threshold)
Canonical ID assigned
    ↓ (if no match)
Manual Review Item created
```

### Cross-References

Entities reference each other via canonical IDs:

```
┌─────────────────────────────────────────────────────────────┐
│                          Game                                │
│  id: nba_2025_hou_okc_1021                                  │
│  home_team_id: nba_oklahoma_city_thunder  ──────────────┐   │
│  away_team_id: nba_houston_rockets  ────────────────┐   │   │
│  stadium_id: nba_paycom_center  ────────────────┐   │   │   │
└─────────────────────────────────────────────────│───│───│───┘
                                                  │   │   │
┌─────────────────────────────────────────────────│───│───│───┐
│                        Stadium                  │   │   │   │
│  id: nba_paycom_center  ◄───────────────────────┘   │   │   │
│  name: "Paycom Center"                              │   │   │
│  city: "Oklahoma City"                              │   │   │
│  latitude: 35.4634                                  │   │   │
│  longitude: -97.5151                                │   │   │
└─────────────────────────────────────────────────────│───│───┘
                                                      │   │
┌─────────────────────────────────────────────────────│───│───┐
│                         Team                        │   │   │
│  id: nba_houston_rockets  ◄─────────────────────────┘   │   │
│  name: "Rockets"                                        │   │
│  city: "Houston"                                        │   │
│  stadium_id: nba_toyota_center                          │   │
└─────────────────────────────────────────────────────────│───┘
                                                          │
┌─────────────────────────────────────────────────────────│───┐
│                         Team                            │   │
│  id: nba_oklahoma_city_thunder  ◄───────────────────────┘   │
│  name: "Thunder"                                            │
│  city: "Oklahoma City"                                      │
│  stadium_id: nba_paycom_center                              │
└─────────────────────────────────────────────────────────────┘
```

### Alias Files

Aliases map variant names to canonical IDs:

**`team_aliases.json`**
```json
{
  "nba": {
    "LA Lakers": "nba_la_lakers",
    "Los Angeles Lakers": "nba_la_lakers",
    "LAL": "nba_la_lakers"
  }
}
```

**`stadium_aliases.json`**
```json
{
  "nba": {
    "Crypto.com Arena": "nba_crypto_com_arena",
    "Staples Center": "nba_crypto_com_arena",
    "STAPLES Center": "nba_crypto_com_arena"
  }
}
```

When a scraper returns a raw name like "LA Lakers", the resolver:
1. Checks `team_aliases.json` for an exact match → finds `nba_la_lakers`
2. If no exact match, runs fuzzy matching against all known team names
3. If fuzzy match confidence > 80%, uses that canonical ID
4. Otherwise, creates a manual review item for human resolution

## Adding a New Sport

To add support for a new sport (e.g., `cfb` for college football), update these files:

### 1. Configuration (`config.py`)

Add the sport to `SUPPORTED_SPORTS` and `EXPECTED_GAME_COUNTS`:

```python
SUPPORTED_SPORTS: list[str] = [
    "nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
    "cfb",  # ← Add new sport
]

EXPECTED_GAME_COUNTS: dict[str, int] = {
    # ... existing sports ...
    "cfb": 900,  # ← Add expected game count for validation
}
```

### 2. Team Mappings (`normalizers/team_resolver.py`)

Add team definitions to `TEAM_MAPPINGS`. Each entry maps an abbreviation to `(canonical_id, full_name, city)`:

```python
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
    # ... existing sports ...
    "cfb": {
        "ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
        "OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
        # ... all teams ...
    },
}
```

### 3. Stadium Mappings (`normalizers/stadium_resolver.py`)

Add stadium definitions to `STADIUM_MAPPINGS`. Each entry is a `StadiumInfo` with coordinates:

```python
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
    # ... existing sports ...
    "cfb": {
        "stadium_cfb_bryant_denny": StadiumInfo(
            id="stadium_cfb_bryant_denny",
            name="Bryant-Denny Stadium",
            city="Tuscaloosa",
            state="AL",
            country="USA",
            sport="cfb",
            latitude=33.2083,
            longitude=-87.5503,
        ),
        # ... all stadiums ...
    },
}
```

### 4. Scraper Implementation (`scrapers/cfb.py`)

Create a new scraper class extending `BaseScraper`:

```python
from .base import BaseScraper, RawGameData, ScrapeResult

class CFBScraper(BaseScraper):
    def __init__(self, season: int, **kwargs):
        super().__init__("cfb", season, **kwargs)
        self._team_resolver = get_team_resolver("cfb")
        self._stadium_resolver = get_stadium_resolver("cfb")

    def _get_sources(self) -> list[str]:
        return ["espn", "sports_reference"]  # Priority order

    def _get_source_url(self, source: str, **kwargs) -> str:
        # Return URL for each source
        ...

    def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
        # Implement scraping logic
        ...

    def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
        # Convert raw data to Game objects using resolvers
        ...

    def scrape_teams(self) -> list[Team]:
        # Return Team objects from TEAM_MAPPINGS
        ...

    def scrape_stadiums(self) -> list[Stadium]:
        # Return Stadium objects from STADIUM_MAPPINGS
        ...

def create_cfb_scraper(season: int) -> CFBScraper:
    return CFBScraper(season=season)
```

### 5. Register Scraper (`scrapers/__init__.py`)

Export the new scraper:

```python
from .cfb import CFBScraper, create_cfb_scraper

__all__ = [
    # ... existing exports ...
    "CFBScraper",
    "create_cfb_scraper",
]
```

### 6. CLI Registration (`cli.py`)

Add the sport to `get_scraper()`:

```python
def get_scraper(sport: str, season: int):
    # ... existing sports ...
    elif sport == "cfb":
        from .scrapers.cfb import create_cfb_scraper
        return create_cfb_scraper(season)
```

### 7. Alias Files (`team_aliases.json`, `stadium_aliases.json`)

Add initial aliases for common name variants:

```json
// team_aliases.json
{
  "cfb": {
    "Alabama": "team_cfb_ala",
    "Bama": "team_cfb_ala",
    "Roll Tide": "team_cfb_ala"
  }
}

// stadium_aliases.json
{
  "cfb": {
    "Bryant Denny Stadium": "stadium_cfb_bryant_denny",
    "Bryant-Denny": "stadium_cfb_bryant_denny"
  }
}
```

### 8. Documentation (`SOURCES.md`)

Document data sources with URLs, rate limits, and notes:

```markdown
## CFB (College Football)

**Teams**: 134 (FBS)
**Expected Games**: ~900 per season
**Season**: August - January

### Sources

| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |
```

### 9. Tests (`tests/test_scrapers/test_cfb.py`)

Create tests for the new scraper:

```python
import pytest
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper

class TestCFBScraper:
    def test_factory_creates_scraper(self):
        scraper = create_cfb_scraper(season=2025)
        assert scraper.sport == "cfb"
        assert scraper.season == 2025

    def test_get_sources_returns_priority_list(self):
        scraper = CFBScraper(season=2025)
        sources = scraper._get_sources()
        assert "espn" in sources

    # ... more tests ...
```

### Checklist

- [ ] Add to `SUPPORTED_SPORTS` in `config.py`
- [ ] Add to `EXPECTED_GAME_COUNTS` in `config.py`
- [ ] Add team mappings to `team_resolver.py`
- [ ] Add stadium mappings to `stadium_resolver.py`
- [ ] Create `scrapers/{sport}.py` with scraper class
- [ ] Export in `scrapers/__init__.py`
- [ ] Register in `cli.py` `get_scraper()`
- [ ] Add aliases to `team_aliases.json`
- [ ] Add aliases to `stadium_aliases.json`
- [ ] Document sources in `SOURCES.md`
- [ ] Create tests in `tests/test_scrapers/`
- [ ] Run `pytest` to verify all tests pass
- [ ] Run dry-run scrape: `sportstime-parser scrape {sport} --season 2025 --dry-run`

## Development

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=sportstime_parser --cov-report=html

# Run specific test file
pytest tests/test_scrapers/test_nba.py

# Run with verbose output
pytest -v
```

### Project Structure

```
sportstime_parser/
  __init__.py
  __main__.py              # CLI entry point
  cli.py                   # Subcommand definitions
  config.py                # Constants, defaults

  models/
    game.py                # Game dataclass
    team.py                # Team dataclass
    stadium.py             # Stadium dataclass
    aliases.py             # Alias dataclasses

  scrapers/
    base.py                # BaseScraper abstract class
    nba.py                 # NBA scrapers
    mlb.py                 # MLB scrapers
    nfl.py                 # NFL scrapers
    nhl.py                 # NHL scrapers
    mls.py                 # MLS scrapers
    wnba.py                # WNBA scrapers
    nwsl.py                # NWSL scrapers

  normalizers/
    canonical_id.py        # ID generation
    team_resolver.py       # Team name resolution
    stadium_resolver.py    # Stadium name resolution
    timezone.py            # Timezone conversion
    fuzzy.py               # Fuzzy matching

  validators/
    report.py              # Validation report generator

  uploaders/
    cloudkit.py            # CloudKit Web Services client
    state.py               # Resumable upload state
    diff.py                # Record comparison

  utils/
    http.py                # Rate-limited HTTP client
    logging.py             # Verbose logger
    progress.py            # Progress bars
```

## Troubleshooting

### "No games file found"

Run the scrape command first:
```bash
sportstime-parser scrape nba --season 2025
```

### "CloudKit not configured"

Set the required environment variables:
```bash
export CLOUDKIT_KEY_ID="your_key_id"
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"
```

### Rate limit errors

The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:

1. Wait a few minutes before retrying
2. Try scraping one sport at a time instead of "all"
3. Check that you're not running multiple instances

### Scrape fails with no data

1. Check your internet connection
2. Run with `--verbose` to see detailed error messages
3. The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable

## License

MIT