Files

T

Trey t 64c64093c4 feat: add timezone support for stadium-local game times

Adds timeZoneIdentifier to Stadium model and localGameTime/localGameTimeShort
computed properties to RichGame. Game times can now display in venue local
timezone. Also adds timezone field to Python StadiumInfo dataclass with
example entries for Pacific, Central, Mountain, and Toronto timezones.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-12 18:58:35 -06:00

models

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

normalizers

feat: add timezone support for stadium-local game times

2026-01-12 18:58:35 -06:00

scrapers

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

tests

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

uploaders

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

utils

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

validators

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

__init__.py

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

__main__.py

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

cli.py

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

config.py

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

README.md

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

SOURCES.md

feat(scripts): rewrite parser as modular Python CLI

2026-01-10 21:06:12 -06:00

README.md

SportsTime Parser

A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.

Features

Scrapes game schedules from multiple sources with automatic fallback
Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
Generates deterministic canonical IDs for games, teams, and stadiums
Produces validation reports with manual review lists
Uploads to CloudKit with resumable, diff-based updates

Requirements

Python 3.11+
CloudKit credentials (for upload functionality)

Installation

# From the Scripts directory
cd Scripts

# Install in development mode
pip install -e ".[dev]"

# Or install dependencies only
pip install -r requirements.txt

Quick Start

# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025

# Scrape all sports
sportstime-parser scrape all --season 2025

# Validate existing scraped data
sportstime-parser validate nba --season 2025

# Check status
sportstime-parser status

# Upload to CloudKit (development)
sportstime-parser upload nba --season 2025

# Upload to CloudKit (production)
sportstime-parser upload nba --season 2025 --environment production

CLI Reference

scrape

Scrape game schedules, teams, and stadiums from web sources.

sportstime-parser scrape <sport> [options]

Arguments:
  sport                Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --dry-run            Parse and validate only, don't write output files
  --verbose, -v        Enable verbose output

Examples:

# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025

# Scrape all sports with verbose output
sportstime-parser scrape all --season 2025 --verbose

# Dry run to test without writing files
sportstime-parser scrape mlb --season 2026 --dry-run

validate

Run validation on existing scraped data and regenerate reports. Validation performs these checks:

Game Coverage: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
Team Resolution: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
Stadium Resolution: Identifies venue names that couldn't be matched to canonical stadium IDs
Duplicate Detection: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
Missing Data: Flags games missing required fields (stadium_id, team IDs, valid dates)

The output is a Markdown report with:

Summary statistics (total games, valid games, coverage percentage)
Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
Fuzzy match suggestions with confidence scores to help resolve unmatched names

sportstime-parser validate <sport> [options]

Arguments:
  sport                Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)

Examples:

# Validate NBA data
sportstime-parser validate nba --season 2025

# Validate all sports
sportstime-parser validate all

upload

Upload scraped data to CloudKit with diff-based updates.

sportstime-parser upload <sport> [options]

Arguments:
  sport                Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment: development or production (default: development)
  --resume             Resume interrupted upload from last checkpoint

Examples:

# Upload NBA to development
sportstime-parser upload nba --season 2025

# Upload to production
sportstime-parser upload nba --season 2025 --environment production

# Resume interrupted upload
sportstime-parser upload mlb --season 2026 --resume

status

Show current scrape and upload status.

sportstime-parser status

retry

Retry failed uploads from previous attempts.

sportstime-parser retry <sport> [options]

Arguments:
  sport                Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment (default: development)
  --max-retries INT    Maximum retry attempts per record (default: 3)

clear

Clear upload session state to start fresh.

sportstime-parser clear <sport> [options]

Arguments:
  sport                Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment (default: development)

CloudKit Configuration

To upload data to CloudKit, you need to configure authentication credentials.

1. Get Credentials from Apple Developer Portal

Go to Apple Developer Portal
Navigate to Certificates, Identifiers & Profiles > Keys
Create a new key with CloudKit capability
Download the private key file (.p8)
Note the Key ID

2. Set Environment Variables

# Key ID from Apple Developer Portal
export CLOUDKIT_KEY_ID="your_key_id_here"

# Path to private key file
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"

# Or provide key content directly (useful for CI/CD)
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
...key content...
-----END EC PRIVATE KEY-----"

3. Verify Configuration

sportstime-parser status

The status output will show whether CloudKit is configured correctly.

Output Files

Scraped data is saved to the output/ directory:

output/
  games_nba_2025.json      # Game schedules
  teams_nba.json           # Team data
  stadiums_nba.json        # Stadium data
  validation_nba_2025.md   # Validation report

Validation Reports

Validation reports are generated in Markdown format at output/validation_{sport}_{season}.md.

Report Sections

Summary Table

Metric	Description
Total Games	Number of games scraped
Valid Games	Games with all required fields resolved
Coverage	Percentage of expected games found (based on league schedule)
Unresolved Teams	Team names that couldn't be matched
Unresolved Stadiums	Venue names that couldn't be matched
Duplicates	Potential duplicate game entries

Manual Review Items

Items are grouped by type and include the raw value, source URL, and suggested fixes:

Unresolved Teams: Team names not in the alias mapping. Add to team_aliases.json to resolve.
Unresolved Stadiums: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to stadium_aliases.json.
Duplicate Games: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
Missing Data: Games missing stadium coordinates or other required fields.

Fuzzy Match Suggestions

For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.

Canonical IDs

Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.

ID Formats

Games

{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]

Examples:

nba_2025_hou_okc_1021 - NBA 2025-26, Houston @ OKC, Oct 21
mlb_2026_nyy_bos_0401_1 - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)

Teams

{sport}_{city}_{name}

Examples:

nba_la_lakers
mlb_new_york_yankees
nfl_new_york_giants

Stadiums

{sport}_{normalized_name}

Examples:

mlb_yankee_stadium
nba_crypto_com_arena
nfl_sofi_stadium

Generated vs Matched IDs

Entity	Generated	Matched
Teams	Pre-defined in `team_resolver.py` mappings	Resolved from raw scraped names via aliases + fuzzy matching
Stadiums	Pre-defined in `stadium_resolver.py` mappings	Resolved from raw venue names via aliases + fuzzy matching
Games	Generated at scrape time from resolved team IDs + date	N/A (always generated, never matched)

Resolution Flow:

Raw Name (from scraper)
    ↓
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
    ↓ (if no match)
Fuzzy Match (Levenshtein distance against known names)
    ↓ (if confidence > threshold)
Canonical ID assigned
    ↓ (if no match)
Manual Review Item created

Cross-References

Entities reference each other via canonical IDs:

┌─────────────────────────────────────────────────────────────┐
│                          Game                                │
│  id: nba_2025_hou_okc_1021                                  │
│  home_team_id: nba_oklahoma_city_thunder  ──────────────┐   │
│  away_team_id: nba_houston_rockets  ────────────────┐   │   │
│  stadium_id: nba_paycom_center  ────────────────┐   │   │   │
└─────────────────────────────────────────────────│───│───│───┘
                                                  │   │   │
┌─────────────────────────────────────────────────│───│───│───┐
│                        Stadium                  │   │   │   │
│  id: nba_paycom_center  ◄───────────────────────┘   │   │   │
│  name: "Paycom Center"                              │   │   │
│  city: "Oklahoma City"                              │   │   │
│  latitude: 35.4634                                  │   │   │
│  longitude: -97.5151                                │   │   │
└─────────────────────────────────────────────────────│───│───┘
                                                      │   │
┌─────────────────────────────────────────────────────│───│───┐
│                         Team                        │   │   │
│  id: nba_houston_rockets  ◄─────────────────────────┘   │   │
│  name: "Rockets"                                        │   │
│  city: "Houston"                                        │   │
│  stadium_id: nba_toyota_center                          │   │
└─────────────────────────────────────────────────────────│───┘
                                                          │
┌─────────────────────────────────────────────────────────│───┐
│                         Team                            │   │
│  id: nba_oklahoma_city_thunder  ◄───────────────────────┘   │
│  name: "Thunder"                                            │
│  city: "Oklahoma City"                                      │
│  stadium_id: nba_paycom_center                              │
└─────────────────────────────────────────────────────────────┘

Alias Files

Aliases map variant names to canonical IDs:

team_aliases.json

{
  "nba": {
    "LA Lakers": "nba_la_lakers",
    "Los Angeles Lakers": "nba_la_lakers",
    "LAL": "nba_la_lakers"
  }
}

stadium_aliases.json

{
  "nba": {
    "Crypto.com Arena": "nba_crypto_com_arena",
    "Staples Center": "nba_crypto_com_arena",
    "STAPLES Center": "nba_crypto_com_arena"
  }
}

When a scraper returns a raw name like "LA Lakers", the resolver:

Checks team_aliases.json for an exact match → finds nba_la_lakers
If no exact match, runs fuzzy matching against all known team names
If fuzzy match confidence > 80%, uses that canonical ID
Otherwise, creates a manual review item for human resolution

Adding a New Sport

To add support for a new sport (e.g., cfb for college football), update these files:

1. Configuration (`config.py`)

Add the sport to SUPPORTED_SPORTS and EXPECTED_GAME_COUNTS:

SUPPORTED_SPORTS: list[str] = [
    "nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
    "cfb",  # ← Add new sport
]

EXPECTED_GAME_COUNTS: dict[str, int] = {
    # ... existing sports ...
    "cfb": 900,  # ← Add expected game count for validation
}

2. Team Mappings (`normalizers/team_resolver.py`)

Add team definitions to TEAM_MAPPINGS. Each entry maps an abbreviation to (canonical_id, full_name, city):

TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
    # ... existing sports ...
    "cfb": {
        "ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
        "OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
        # ... all teams ...
    },
}

3. Stadium Mappings (`normalizers/stadium_resolver.py`)

Add stadium definitions to STADIUM_MAPPINGS. Each entry is a StadiumInfo with coordinates:

STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
    # ... existing sports ...
    "cfb": {
        "stadium_cfb_bryant_denny": StadiumInfo(
            id="stadium_cfb_bryant_denny",
            name="Bryant-Denny Stadium",
            city="Tuscaloosa",
            state="AL",
            country="USA",
            sport="cfb",
            latitude=33.2083,
            longitude=-87.5503,
        ),
        # ... all stadiums ...
    },
}

4. Scraper Implementation (`scrapers/cfb.py`)

Create a new scraper class extending BaseScraper:

from .base import BaseScraper, RawGameData, ScrapeResult

class CFBScraper(BaseScraper):
    def __init__(self, season: int, **kwargs):
        super().__init__("cfb", season, **kwargs)
        self._team_resolver = get_team_resolver("cfb")
        self._stadium_resolver = get_stadium_resolver("cfb")

    def _get_sources(self) -> list[str]:
        return ["espn", "sports_reference"]  # Priority order

    def _get_source_url(self, source: str, **kwargs) -> str:
        # Return URL for each source
        ...

    def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
        # Implement scraping logic
        ...

    def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
        # Convert raw data to Game objects using resolvers
        ...

    def scrape_teams(self) -> list[Team]:
        # Return Team objects from TEAM_MAPPINGS
        ...

    def scrape_stadiums(self) -> list[Stadium]:
        # Return Stadium objects from STADIUM_MAPPINGS
        ...

def create_cfb_scraper(season: int) -> CFBScraper:
    return CFBScraper(season=season)

5. Register Scraper (`scrapers/init.py`)

Export the new scraper:

from .cfb import CFBScraper, create_cfb_scraper

__all__ = [
    # ... existing exports ...
    "CFBScraper",
    "create_cfb_scraper",
]

6. CLI Registration (`cli.py`)

Add the sport to get_scraper():

def get_scraper(sport: str, season: int):
    # ... existing sports ...
    elif sport == "cfb":
        from .scrapers.cfb import create_cfb_scraper
        return create_cfb_scraper(season)

7. Alias Files (`team_aliases.json`, `stadium_aliases.json`)

Add initial aliases for common name variants:

// team_aliases.json
{
  "cfb": {
    "Alabama": "team_cfb_ala",
    "Bama": "team_cfb_ala",
    "Roll Tide": "team_cfb_ala"
  }
}

// stadium_aliases.json
{
  "cfb": {
    "Bryant Denny Stadium": "stadium_cfb_bryant_denny",
    "Bryant-Denny": "stadium_cfb_bryant_denny"
  }
}

8. Documentation (`SOURCES.md`)

Document data sources with URLs, rate limits, and notes:

## CFB (College Football)

**Teams**: 134 (FBS)
**Expected Games**: ~900 per season
**Season**: August - January

### Sources

| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |

9. Tests (`tests/test_scrapers/test_cfb.py`)

Create tests for the new scraper:

import pytest
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper

class TestCFBScraper:
    def test_factory_creates_scraper(self):
        scraper = create_cfb_scraper(season=2025)
        assert scraper.sport == "cfb"
        assert scraper.season == 2025

    def test_get_sources_returns_priority_list(self):
        scraper = CFBScraper(season=2025)
        sources = scraper._get_sources()
        assert "espn" in sources

    # ... more tests ...

Checklist

Add to SUPPORTED_SPORTS in config.py
Add to EXPECTED_GAME_COUNTS in config.py
Add team mappings to team_resolver.py
Add stadium mappings to stadium_resolver.py
Create scrapers/{sport}.py with scraper class
Export in scrapers/__init__.py
Register in cli.py get_scraper()
Add aliases to team_aliases.json
Add aliases to stadium_aliases.json
Document sources in SOURCES.md
Create tests in tests/test_scrapers/
Run pytest to verify all tests pass
Run dry-run scrape: sportstime-parser scrape {sport} --season 2025 --dry-run

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=sportstime_parser --cov-report=html

# Run specific test file
pytest tests/test_scrapers/test_nba.py

# Run with verbose output
pytest -v

Project Structure

sportstime_parser/
  __init__.py
  __main__.py              # CLI entry point
  cli.py                   # Subcommand definitions
  config.py                # Constants, defaults

  models/
    game.py                # Game dataclass
    team.py                # Team dataclass
    stadium.py             # Stadium dataclass
    aliases.py             # Alias dataclasses

  scrapers/
    base.py                # BaseScraper abstract class
    nba.py                 # NBA scrapers
    mlb.py                 # MLB scrapers
    nfl.py                 # NFL scrapers
    nhl.py                 # NHL scrapers
    mls.py                 # MLS scrapers
    wnba.py                # WNBA scrapers
    nwsl.py                # NWSL scrapers

  normalizers/
    canonical_id.py        # ID generation
    team_resolver.py       # Team name resolution
    stadium_resolver.py    # Stadium name resolution
    timezone.py            # Timezone conversion
    fuzzy.py               # Fuzzy matching

  validators/
    report.py              # Validation report generator

  uploaders/
    cloudkit.py            # CloudKit Web Services client
    state.py               # Resumable upload state
    diff.py                # Record comparison

  utils/
    http.py                # Rate-limited HTTP client
    logging.py             # Verbose logger
    progress.py            # Progress bars

Troubleshooting

"No games file found"

Run the scrape command first:

sportstime-parser scrape nba --season 2025

"CloudKit not configured"

Set the required environment variables:

export CLOUDKIT_KEY_ID="your_key_id"
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"

Rate limit errors

The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:

Wait a few minutes before retrying
Try scraping one sport at a time instead of "all"
Check that you're not running multiple instances

Scrape fails with no data

Check your internet connection
Run with --verbose to see detailed error messages
The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable

License

MIT

README.md

SportsTime Parser

Features

Requirements

Installation

Quick Start

CLI Reference

scrape

validate

upload

status

retry

clear

CloudKit Configuration

1. Get Credentials from Apple Developer Portal

2. Set Environment Variables

3. Verify Configuration

Output Files

Validation Reports

Report Sections

Canonical IDs

ID Formats

Generated vs Matched IDs

Cross-References

Alias Files

Adding a New Sport

1. Configuration (config.py)

2. Team Mappings (normalizers/team_resolver.py)

3. Stadium Mappings (normalizers/stadium_resolver.py)

4. Scraper Implementation (scrapers/cfb.py)

5. Register Scraper (scrapers/__init__.py)

6. CLI Registration (cli.py)

7. Alias Files (team_aliases.json, stadium_aliases.json)

8. Documentation (SOURCES.md)

9. Tests (tests/test_scrapers/test_cfb.py)

Checklist

Development

Running Tests

Project Structure

Troubleshooting

"No games file found"

"CloudKit not configured"

Rate limit errors

Scrape fails with no data

License

1. Configuration (`config.py`)

2. Team Mappings (`normalizers/team_resolver.py`)

3. Stadium Mappings (`normalizers/stadium_resolver.py`)

4. Scraper Implementation (`scrapers/cfb.py`)

5. Register Scraper (`scrapers/init.py`)

6. CLI Registration (`cli.py`)

7. Alias Files (`team_aliases.json`, `stadium_aliases.json`)

8. Documentation (`SOURCES.md`)

9. Tests (`tests/test_scrapers/test_cfb.py`)