Files
Trey t 63acf7accb feat: add Django web app, CloudKit sync, dashboard, and game_datetime_utc export
Adds the full Django application layer on top of sportstime_parser:
- core: Sport, Team, Stadium, Game models with aliases and league structure
- scraper: orchestration engine, adapter, job management, Celery tasks
- cloudkit: CloudKit sync client, sync state tracking, sync jobs
- dashboard: staff dashboard for monitoring scrapers, sync, review queue
- notifications: email reports for scrape/sync results
- Docker setup for deployment (Dockerfile, docker-compose, entrypoint)

Game exports now use game_datetime_utc (ISO 8601 UTC) instead of
venue-local date+time strings, matching the canonical format used
by the iOS app.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 14:04:27 -06:00
..

SportsTime Parser

A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.

Features

  • Scrapes game schedules from multiple sources with automatic fallback
  • Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
  • Generates deterministic canonical IDs for games, teams, and stadiums
  • Produces validation reports with manual review lists
  • Uploads to CloudKit with resumable, diff-based updates

Requirements

  • Python 3.11+
  • CloudKit credentials (for upload functionality)

Installation

# From the Scripts directory
cd Scripts

# Install in development mode
pip install -e ".[dev]"

# Or install dependencies only
pip install -r requirements.txt

Quick Start

# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025

# Scrape all sports
sportstime-parser scrape all --season 2025

# Validate existing scraped data
sportstime-parser validate nba --season 2025

# Check status
sportstime-parser status

# Upload to CloudKit (development)
sportstime-parser upload nba --season 2025

# Upload to CloudKit (production)
sportstime-parser upload nba --season 2025 --environment production

CLI Reference

scrape

Scrape game schedules, teams, and stadiums from web sources.

sportstime-parser scrape <sport> [options]

Arguments:
  sport                Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --dry-run            Parse and validate only, don't write output files
  --verbose, -v        Enable verbose output

Examples:

# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025

# Scrape all sports with verbose output
sportstime-parser scrape all --season 2025 --verbose

# Dry run to test without writing files
sportstime-parser scrape mlb --season 2026 --dry-run

validate

Run validation on existing scraped data and regenerate reports. Validation performs these checks:

  1. Game Coverage: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
  2. Team Resolution: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
  3. Stadium Resolution: Identifies venue names that couldn't be matched to canonical stadium IDs
  4. Duplicate Detection: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
  5. Missing Data: Flags games missing required fields (stadium_id, team IDs, valid dates)

The output is a Markdown report with:

  • Summary statistics (total games, valid games, coverage percentage)
  • Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
  • Fuzzy match suggestions with confidence scores to help resolve unmatched names
sportstime-parser validate <sport> [options]

Arguments:
  sport                Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)

Examples:

# Validate NBA data
sportstime-parser validate nba --season 2025

# Validate all sports
sportstime-parser validate all

upload

Upload scraped data to CloudKit with diff-based updates.

sportstime-parser upload <sport> [options]

Arguments:
  sport                Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment: development or production (default: development)
  --resume             Resume interrupted upload from last checkpoint

Examples:

# Upload NBA to development
sportstime-parser upload nba --season 2025

# Upload to production
sportstime-parser upload nba --season 2025 --environment production

# Resume interrupted upload
sportstime-parser upload mlb --season 2026 --resume

status

Show current scrape and upload status.

sportstime-parser status

retry

Retry failed uploads from previous attempts.

sportstime-parser retry <sport> [options]

Arguments:
  sport                Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment (default: development)
  --max-retries INT    Maximum retry attempts per record (default: 3)

clear

Clear upload session state to start fresh.

sportstime-parser clear <sport> [options]

Arguments:
  sport                Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"

Options:
  --season, -s INT     Season start year (default: 2025)
  --environment, -e    CloudKit environment (default: development)

CloudKit Configuration

To upload data to CloudKit, you need to configure authentication credentials.

1. Get Credentials from Apple Developer Portal

  1. Go to Apple Developer Portal
  2. Navigate to Certificates, Identifiers & Profiles > Keys
  3. Create a new key with CloudKit capability
  4. Download the private key file (.p8)
  5. Note the Key ID

2. Set Environment Variables

# Key ID from Apple Developer Portal
export CLOUDKIT_KEY_ID="your_key_id_here"

# Path to private key file
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"

# Or provide key content directly (useful for CI/CD)
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
...key content...
-----END EC PRIVATE KEY-----"

3. Verify Configuration

sportstime-parser status

The status output will show whether CloudKit is configured correctly.

Output Files

Scraped data is saved to the output/ directory:

output/
  games_nba_2025.json      # Game schedules
  teams_nba.json           # Team data
  stadiums_nba.json        # Stadium data
  validation_nba_2025.md   # Validation report

Validation Reports

Validation reports are generated in Markdown format at output/validation_{sport}_{season}.md.

Report Sections

Summary Table

Metric Description
Total Games Number of games scraped
Valid Games Games with all required fields resolved
Coverage Percentage of expected games found (based on league schedule)
Unresolved Teams Team names that couldn't be matched
Unresolved Stadiums Venue names that couldn't be matched
Duplicates Potential duplicate game entries

Manual Review Items

Items are grouped by type and include the raw value, source URL, and suggested fixes:

  • Unresolved Teams: Team names not in the alias mapping. Add to team_aliases.json to resolve.
  • Unresolved Stadiums: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to stadium_aliases.json.
  • Duplicate Games: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
  • Missing Data: Games missing stadium coordinates or other required fields.

Fuzzy Match Suggestions

For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.

Canonical IDs

Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.

ID Formats

Games

{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]

Examples:

  • nba_2025_hou_okc_1021 - NBA 2025-26, Houston @ OKC, Oct 21
  • mlb_2026_nyy_bos_0401_1 - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)

Teams

{sport}_{city}_{name}

Examples:

  • nba_la_lakers
  • mlb_new_york_yankees
  • nfl_new_york_giants

Stadiums

{sport}_{normalized_name}

Examples:

  • mlb_yankee_stadium
  • nba_crypto_com_arena
  • nfl_sofi_stadium

Generated vs Matched IDs

Entity Generated Matched
Teams Pre-defined in team_resolver.py mappings Resolved from raw scraped names via aliases + fuzzy matching
Stadiums Pre-defined in stadium_resolver.py mappings Resolved from raw venue names via aliases + fuzzy matching
Games Generated at scrape time from resolved team IDs + date N/A (always generated, never matched)

Resolution Flow:

Raw Name (from scraper)
    ↓
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
    ↓ (if no match)
Fuzzy Match (Levenshtein distance against known names)
    ↓ (if confidence > threshold)
Canonical ID assigned
    ↓ (if no match)
Manual Review Item created

Cross-References

Entities reference each other via canonical IDs:

┌─────────────────────────────────────────────────────────────┐
│                          Game                                │
│  id: nba_2025_hou_okc_1021                                  │
│  home_team_id: nba_oklahoma_city_thunder  ──────────────┐   │
│  away_team_id: nba_houston_rockets  ────────────────┐   │   │
│  stadium_id: nba_paycom_center  ────────────────┐   │   │   │
└─────────────────────────────────────────────────│───│───│───┘
                                                  │   │   │
┌─────────────────────────────────────────────────│───│───│───┐
│                        Stadium                  │   │   │   │
│  id: nba_paycom_center  ◄───────────────────────┘   │   │   │
│  name: "Paycom Center"                              │   │   │
│  city: "Oklahoma City"                              │   │   │
│  latitude: 35.4634                                  │   │   │
│  longitude: -97.5151                                │   │   │
└─────────────────────────────────────────────────────│───│───┘
                                                      │   │
┌─────────────────────────────────────────────────────│───│───┐
│                         Team                        │   │   │
│  id: nba_houston_rockets  ◄─────────────────────────┘   │   │
│  name: "Rockets"                                        │   │
│  city: "Houston"                                        │   │
│  stadium_id: nba_toyota_center                          │   │
└─────────────────────────────────────────────────────────│───┘
                                                          │
┌─────────────────────────────────────────────────────────│───┐
│                         Team                            │   │
│  id: nba_oklahoma_city_thunder  ◄───────────────────────┘   │
│  name: "Thunder"                                            │
│  city: "Oklahoma City"                                      │
│  stadium_id: nba_paycom_center                              │
└─────────────────────────────────────────────────────────────┘

Alias Files

Aliases map variant names to canonical IDs:

team_aliases.json

{
  "nba": {
    "LA Lakers": "nba_la_lakers",
    "Los Angeles Lakers": "nba_la_lakers",
    "LAL": "nba_la_lakers"
  }
}

stadium_aliases.json

{
  "nba": {
    "Crypto.com Arena": "nba_crypto_com_arena",
    "Staples Center": "nba_crypto_com_arena",
    "STAPLES Center": "nba_crypto_com_arena"
  }
}

When a scraper returns a raw name like "LA Lakers", the resolver:

  1. Checks team_aliases.json for an exact match → finds nba_la_lakers
  2. If no exact match, runs fuzzy matching against all known team names
  3. If fuzzy match confidence > 80%, uses that canonical ID
  4. Otherwise, creates a manual review item for human resolution

Adding a New Sport

To add support for a new sport (e.g., cfb for college football), update these files:

1. Configuration (config.py)

Add the sport to SUPPORTED_SPORTS and EXPECTED_GAME_COUNTS:

SUPPORTED_SPORTS: list[str] = [
    "nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
    "cfb",  # ← Add new sport
]

EXPECTED_GAME_COUNTS: dict[str, int] = {
    # ... existing sports ...
    "cfb": 900,  # ← Add expected game count for validation
}

2. Team Mappings (normalizers/team_resolver.py)

Add team definitions to TEAM_MAPPINGS. Each entry maps an abbreviation to (canonical_id, full_name, city):

TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
    # ... existing sports ...
    "cfb": {
        "ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
        "OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
        # ... all teams ...
    },
}

3. Stadium Mappings (normalizers/stadium_resolver.py)

Add stadium definitions to STADIUM_MAPPINGS. Each entry is a StadiumInfo with coordinates:

STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
    # ... existing sports ...
    "cfb": {
        "stadium_cfb_bryant_denny": StadiumInfo(
            id="stadium_cfb_bryant_denny",
            name="Bryant-Denny Stadium",
            city="Tuscaloosa",
            state="AL",
            country="USA",
            sport="cfb",
            latitude=33.2083,
            longitude=-87.5503,
        ),
        # ... all stadiums ...
    },
}

4. Scraper Implementation (scrapers/cfb.py)

Create a new scraper class extending BaseScraper:

from .base import BaseScraper, RawGameData, ScrapeResult

class CFBScraper(BaseScraper):
    def __init__(self, season: int, **kwargs):
        super().__init__("cfb", season, **kwargs)
        self._team_resolver = get_team_resolver("cfb")
        self._stadium_resolver = get_stadium_resolver("cfb")

    def _get_sources(self) -> list[str]:
        return ["espn", "sports_reference"]  # Priority order

    def _get_source_url(self, source: str, **kwargs) -> str:
        # Return URL for each source
        ...

    def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
        # Implement scraping logic
        ...

    def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
        # Convert raw data to Game objects using resolvers
        ...

    def scrape_teams(self) -> list[Team]:
        # Return Team objects from TEAM_MAPPINGS
        ...

    def scrape_stadiums(self) -> list[Stadium]:
        # Return Stadium objects from STADIUM_MAPPINGS
        ...

def create_cfb_scraper(season: int) -> CFBScraper:
    return CFBScraper(season=season)

5. Register Scraper (scrapers/__init__.py)

Export the new scraper:

from .cfb import CFBScraper, create_cfb_scraper

__all__ = [
    # ... existing exports ...
    "CFBScraper",
    "create_cfb_scraper",
]

6. CLI Registration (cli.py)

Add the sport to get_scraper():

def get_scraper(sport: str, season: int):
    # ... existing sports ...
    elif sport == "cfb":
        from .scrapers.cfb import create_cfb_scraper
        return create_cfb_scraper(season)

7. Alias Files (team_aliases.json, stadium_aliases.json)

Add initial aliases for common name variants:

// team_aliases.json
{
  "cfb": {
    "Alabama": "team_cfb_ala",
    "Bama": "team_cfb_ala",
    "Roll Tide": "team_cfb_ala"
  }
}

// stadium_aliases.json
{
  "cfb": {
    "Bryant Denny Stadium": "stadium_cfb_bryant_denny",
    "Bryant-Denny": "stadium_cfb_bryant_denny"
  }
}

8. Documentation (SOURCES.md)

Document data sources with URLs, rate limits, and notes:

## CFB (College Football)

**Teams**: 134 (FBS)
**Expected Games**: ~900 per season
**Season**: August - January

### Sources

| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |

9. Tests (tests/test_scrapers/test_cfb.py)

Create tests for the new scraper:

import pytest
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper

class TestCFBScraper:
    def test_factory_creates_scraper(self):
        scraper = create_cfb_scraper(season=2025)
        assert scraper.sport == "cfb"
        assert scraper.season == 2025

    def test_get_sources_returns_priority_list(self):
        scraper = CFBScraper(season=2025)
        sources = scraper._get_sources()
        assert "espn" in sources

    # ... more tests ...

Checklist

  • Add to SUPPORTED_SPORTS in config.py
  • Add to EXPECTED_GAME_COUNTS in config.py
  • Add team mappings to team_resolver.py
  • Add stadium mappings to stadium_resolver.py
  • Create scrapers/{sport}.py with scraper class
  • Export in scrapers/__init__.py
  • Register in cli.py get_scraper()
  • Add aliases to team_aliases.json
  • Add aliases to stadium_aliases.json
  • Document sources in SOURCES.md
  • Create tests in tests/test_scrapers/
  • Run pytest to verify all tests pass
  • Run dry-run scrape: sportstime-parser scrape {sport} --season 2025 --dry-run

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=sportstime_parser --cov-report=html

# Run specific test file
pytest tests/test_scrapers/test_nba.py

# Run with verbose output
pytest -v

Project Structure

sportstime_parser/
  __init__.py
  __main__.py              # CLI entry point
  cli.py                   # Subcommand definitions
  config.py                # Constants, defaults

  models/
    game.py                # Game dataclass
    team.py                # Team dataclass
    stadium.py             # Stadium dataclass
    aliases.py             # Alias dataclasses

  scrapers/
    base.py                # BaseScraper abstract class
    nba.py                 # NBA scrapers
    mlb.py                 # MLB scrapers
    nfl.py                 # NFL scrapers
    nhl.py                 # NHL scrapers
    mls.py                 # MLS scrapers
    wnba.py                # WNBA scrapers
    nwsl.py                # NWSL scrapers

  normalizers/
    canonical_id.py        # ID generation
    team_resolver.py       # Team name resolution
    stadium_resolver.py    # Stadium name resolution
    timezone.py            # Timezone conversion
    fuzzy.py               # Fuzzy matching

  validators/
    report.py              # Validation report generator

  uploaders/
    cloudkit.py            # CloudKit Web Services client
    state.py               # Resumable upload state
    diff.py                # Record comparison

  utils/
    http.py                # Rate-limited HTTP client
    logging.py             # Verbose logger
    progress.py            # Progress bars

Troubleshooting

"No games file found"

Run the scrape command first:

sportstime-parser scrape nba --season 2025

"CloudKit not configured"

Set the required environment variables:

export CLOUDKIT_KEY_ID="your_key_id"
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"

Rate limit errors

The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:

  1. Wait a few minutes before retrying
  2. Try scraping one sport at a time instead of "all"
  3. Check that you're not running multiple instances

Scrape fails with no data

  1. Check your internet connection
  2. Run with --verbose to see detailed error messages
  3. The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable

License

MIT