Files

Trey t 52d445bca4 feat(scripts): add sportstime-parser data pipeline

Complete Python package for scraping, normalizing, and uploading
sports schedule data to CloudKit. Includes:

- Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Canonical ID system for teams, stadiums, and games
- Fuzzy matching with manual alias support
- CloudKit uploader with batch operations and deduplication
- Comprehensive test suite with fixtures
- WNBA abbreviation aliases for improved team resolution
- Alias validation script to detect orphan references

All 5 phases of data remediation plan completed:
- Phase 1: Alias fixes (team/stadium alias additions)
- Phase 2: NHL stadium coordinate fixes
- Phase 3: Re-scrape validation
- Phase 4: iOS bundle update
- Phase 5: Code quality improvements (WNBA aliases)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 18:56:25 -06:00

27 KiB

Raw Blame History

SportsTime Parser

A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app.

Overview
Installation
Quick Start
Architecture
Directory Structure
Configuration
Data Models
Normalizers
Scrapers
Uploaders
Utilities
Manual Review Workflow
Adding a New Sport
Troubleshooting

Overview

The sportstime_parser package provides a complete pipeline for:

Scraping game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.)
Normalizing raw data to canonical identifiers (teams, stadiums, games)
Resolving team/stadium names using exact matching, historical aliases, and fuzzy matching
Uploading data to CloudKit with diff-based sync and resumable uploads

Supported Sports

Sport	Code	Sources	Season Format
NBA	`nba`	Basketball-Reference, ESPN, CBS	Oct-Jun (split year)
MLB	`mlb`	Baseball-Reference, MLB API, ESPN	Mar-Nov (single year)
NFL	`nfl`	ESPN, Pro-Football-Reference, CBS	Sep-Feb (split year)
NHL	`nhl`	Hockey-Reference, NHL API, ESPN	Oct-Jun (split year)
MLS	`mls`	ESPN, FBref	Feb-Nov (single year)
WNBA	`wnba`	ESPN	May-Oct (single year)
NWSL	`nwsl`	ESPN	Mar-Nov (single year)

Installation

cd Scripts
pip install -r requirements.txt

Dependencies

requests - HTTP requests with session management
beautifulsoup4 + lxml - HTML parsing
rapidfuzz - Fuzzy string matching
pyjwt + cryptography - CloudKit JWT authentication
rich - Terminal UI (progress bars, logging)
pytz / timezonefinder - Timezone detection

Quick Start

Scrape a Single Sport

from sportstime_parser.scrapers import create_nba_scraper

scraper = create_nba_scraper(season=2025)
result = scraper.scrape_all()

print(f"Games: {result.game_count}")
print(f"Teams: {result.team_count}")
print(f"Stadiums: {result.stadium_count}")
print(f"Needs review: {result.review_count}")

Upload to CloudKit

from sportstime_parser.uploaders import CloudKitClient, RecordDiffer

client = CloudKitClient(environment="development")
differ = RecordDiffer()

# Compare local vs remote
diff = differ.diff_games(local_games, remote_records)

# Upload changes
records = diff.get_records_to_upload()
result = await client.save_records(records)

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           DATA SOURCES                                   │
│  Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc.   │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            SCRAPERS                                      │
│  NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │
│                                                                          │
│  Features:                                                               │
│  • Multi-source fallback (try sources in priority order)                │
│  • Automatic rate limiting with exponential backoff                     │
│  • Doubleheader detection                                               │
│  • International game filtering (NFL London, NHL Global Series)         │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          NORMALIZERS                                     │
│  TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader   │
│                                                                          │
│  Resolution Strategy (in order):                                        │
│  1. Exact match against canonical mappings                              │
│  2. Date-aware alias lookup (handles renames/relocations)               │
│  3. Fuzzy matching with confidence threshold (85%)                      │
│  4. Flag for manual review if unresolved or low confidence              │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          DATA MODELS                                     │
│  Game │ Team │ Stadium │ ManualReviewItem                               │
│                                                                          │
│  All models use canonical IDs:                                          │
│  • team_nba_lal (Los Angeles Lakers)                                    │
│  • stadium_nba_los_angeles_lakers (Crypto.com Arena)                    │
│  • game_nba_2025_20251022_bos_lal (specific game)                       │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           UPLOADERS                                      │
│  CloudKitClient │ RecordDiffer │ StateManager                           │
│                                                                          │
│  Features:                                                               │
│  • JWT authentication with Apple's CloudKit Web Services                │
│  • Batch operations (up to 200 records per request)                     │
│  • Diff-based sync (only upload changes)                                │
│  • Resumable uploads with persistent state                              │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            CLOUDKIT                                      │
│  Public Database: Games, Teams, Stadiums, Aliases                       │
└─────────────────────────────────────────────────────────────────────────┘

Directory Structure

Scripts/
├── README.md                    # This file
├── requirements.txt             # Python dependencies
├── pyproject.toml              # Package configuration
├── league_structure.json       # League hierarchy (conferences, divisions)
├── team_aliases.json           # Historical team name mappings
├── stadium_aliases.json        # Historical stadium name mappings
├── logs/                       # Runtime logs (auto-created)
├── output/                     # Scrape output files (auto-created)
└── sportstime_parser/          # Main package
    ├── __init__.py
    ├── config.py               # Configuration constants
    ├── SOURCES.md              # Data source documentation
    ├── models/                 # Data classes
    │   ├── game.py            # Game model
    │   ├── team.py            # Team model
    │   ├── stadium.py         # Stadium model
    │   └── aliases.py         # Alias and ManualReviewItem models
    ├── normalizers/            # Name resolution
    │   ├── canonical_id.py    # ID generation
    │   ├── alias_loader.py    # Alias loading and resolution
    │   ├── fuzzy.py           # Fuzzy string matching
    │   ├── timezone.py        # Timezone detection
    │   ├── team_resolver.py   # Team name resolution
    │   └── stadium_resolver.py # Stadium name resolution
    ├── scrapers/               # Sport-specific scrapers
    │   ├── base.py            # Abstract base scraper
    │   ├── nba.py             # NBA scraper
    │   ├── mlb.py             # MLB scraper
    │   ├── nfl.py             # NFL scraper
    │   ├── nhl.py             # NHL scraper
    │   ├── mls.py             # MLS scraper
    │   ├── wnba.py            # WNBA scraper
    │   └── nwsl.py            # NWSL scraper
    ├── uploaders/              # CloudKit integration
    │   ├── cloudkit.py        # CloudKit Web Services client
    │   ├── diff.py            # Record diffing
    │   └── state.py           # Resumable upload state
    └── utils/                  # Shared utilities
        ├── logging.py         # Rich-based logging
        ├── http.py            # Rate-limited HTTP client
        └── progress.py        # Progress tracking

Configuration

config.py

Key configuration constants:

# Directories
SCRIPTS_DIR = Path(__file__).parent.parent      # Scripts/
OUTPUT_DIR = SCRIPTS_DIR / "output"             # JSON output
STATE_DIR = SCRIPTS_DIR / ".parser_state"       # Upload state

# CloudKit
CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app"
CLOUDKIT_ENVIRONMENT = "development"  # or "production"

# Rate Limiting
DEFAULT_REQUEST_DELAY = 3.0      # seconds between requests
MAX_RETRIES = 3                   # retry attempts
BACKOFF_FACTOR = 2.0             # exponential backoff multiplier
INITIAL_BACKOFF = 5.0            # initial backoff duration

# Fuzzy Matching
FUZZY_THRESHOLD = 85             # minimum match confidence (0-100)

# Expected game counts (for validation)
EXPECTED_GAME_COUNTS = {
    "nba": 1230,   # 30 teams × 82 games ÷ 2
    "mlb": 2430,   # 30 teams × 162 games ÷ 2
    "nfl": 272,    # Regular season only
    "nhl": 1312,   # 32 teams × 82 games ÷ 2
    "mls": 544,    # 29 teams × ~34 games ÷ 2
    "wnba": 228,   # 12 teams × 40 games ÷ 2
    "nwsl": 182,   # 14 teams × 26 games ÷ 2
}

# Geography (for filtering international games)
ALLOWED_COUNTRIES = {"USA", "Canada"}

league_structure.json

Defines the hierarchical structure of each league:

{
  "nba": {
    "name": "National Basketball Association",
    "conferences": {
      "Eastern": {
        "divisions": {
          "Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"],
          "Central": ["CHI", "CLE", "DET", "IND", "MIL"],
          "Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"]
        }
      },
      "Western": { ... }
    }
  },
  "mlb": { ... },
  ...
}

team_aliases.json / stadium_aliases.json

Historical name mappings with validity dates:

{
  "team_mlb_athletics": [
    {
      "alias": "Oakland Athletics",
      "alias_type": "full_name",
      "valid_from": "1968-01-01",
      "valid_until": "2024-12-31"
    },
    {
      "alias": "Las Vegas Athletics",
      "alias_type": "full_name",
      "valid_from": "2028-01-01",
      "valid_until": null
    }
  ]
}

Data Models

Game

@dataclass
class Game:
    id: str                    # Canonical ID: game_{sport}_{season}_{date}_{away}_{home}
    sport: str                 # Sport code (nba, mlb, etc.)
    season: int                # Season start year
    home_team_id: str          # Canonical team ID
    away_team_id: str          # Canonical team ID
    stadium_id: str            # Canonical stadium ID
    game_date: datetime        # UTC datetime
    game_number: Optional[int] # 1 or 2 for doubleheaders
    home_score: Optional[int]  # None if not played
    away_score: Optional[int]
    status: str                # scheduled, final, postponed, cancelled
    source_url: Optional[str]  # For manual review
    raw_home_team: Optional[str]  # Original scraped value
    raw_away_team: Optional[str]
    raw_stadium: Optional[str]

Team

@dataclass
class Team:
    id: str                    # Canonical ID: team_{sport}_{abbrev}
    sport: str
    city: str                  # e.g., "Los Angeles"
    name: str                  # e.g., "Lakers"
    full_name: str             # e.g., "Los Angeles Lakers"
    abbreviation: str          # e.g., "LAL"
    conference: Optional[str]  # e.g., "Western"
    division: Optional[str]    # e.g., "Pacific"
    stadium_id: Optional[str]  # Home stadium
    primary_color: Optional[str]
    secondary_color: Optional[str]
    logo_url: Optional[str]

Stadium

@dataclass
class Stadium:
    id: str                    # Canonical ID: stadium_{sport}_{city_team}
    sport: str
    name: str                  # Current name (e.g., "Crypto.com Arena")
    city: str
    state: Optional[str]
    country: str
    latitude: Optional[float]
    longitude: Optional[float]
    capacity: Optional[int]
    surface: Optional[str]     # grass, turf, ice, hardwood
    roof_type: Optional[str]   # dome, retractable, open
    opened_year: Optional[int]
    image_url: Optional[str]
    timezone: Optional[str]

ManualReviewItem

@dataclass
class ManualReviewItem:
    item_type: str             # "team" or "stadium"
    raw_value: str             # Original scraped value
    suggested_id: Optional[str]  # Best fuzzy match (if any)
    confidence: float          # 0.0 - 1.0
    reason: str                # Why review is needed
    source_url: Optional[str]  # Where it came from
    sport: str
    check_date: Optional[date]  # For date-aware alias lookup

Normalizers

Canonical ID Generation

IDs are deterministic and immutable:

# Team ID
generate_team_id("nba", "LAL")
# → "team_nba_lal"

# Stadium ID
generate_stadium_id("nba", "Los Angeles", "Lakers")
# → "stadium_nba_los_angeles_lakers"

# Game ID
generate_game_id(
    sport="nba",
    season=2025,
    away_abbrev="BOS",
    home_abbrev="LAL",
    game_date=datetime(2025, 10, 22),
    game_number=None
)
# → "game_nba_2025_20251022_bos_lal"

# Doubleheader Game ID
generate_game_id(..., game_number=2)
# → "game_nba_2025_20251022_bos_lal_2"

Team Resolution

The TeamResolver uses a three-stage strategy:

resolver = get_team_resolver("nba")
result = resolver.resolve(
    "Los Angeles Lakers",
    check_date=date(2025, 10, 22),
    source_url="https://..."
)

# Result:
# - canonical_id: "team_nba_lal"
# - confidence: 1.0 (exact match)
# - review_item: None

Resolution stages:

Exact Match: Check against canonical team mappings
- Full name: "Los Angeles Lakers"
- City + Name: "Los Angeles" + "Lakers"
- Abbreviation: "LAL"
Alias Lookup: Check historical aliases with date awareness
- "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31)
- Handles relocations: "Oakland" → "Las Vegas" transition
Fuzzy Match: Use rapidfuzz with 85% threshold
- "LA Lakers" → "Los Angeles Lakers" (92% match)
- Low-confidence matches flagged for review

Stadium Resolution

Similar three-stage strategy with additional location awareness:

resolver = get_stadium_resolver("nba")
result = resolver.resolve(
    "Crypto.com Arena",
    check_date=date(2025, 10, 22)
)

Key features:

Handles naming rights changes (Staples Center → Crypto.com Arena)
Date-aware: "Staples Center" resolves correctly for historical games
Location-based fallback using latitude/longitude

Scrapers

Base Scraper

All scrapers extend BaseScraper with these features:

class BaseScraper(ABC):
    def __init__(self, sport: str, season: int): ...

    # Required implementations
    def _get_sources(self) -> list[str]: ...
    def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ...
    def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ...
    def scrape_teams(self) -> list[Team]: ...
    def scrape_stadiums(self) -> list[Stadium]: ...

    # Built-in features
    def scrape_games(self) -> ScrapeResult:
        """Multi-source fallback - tries each source in order."""
        ...

    def scrape_all(self) -> ScrapeResult:
        """Scrapes games, teams, and stadiums with progress tracking."""
        ...

NBA Scraper

class NBAScraper(BaseScraper):
    """
    Sources (in priority order):
    1. Basketball-Reference - HTML tables, monthly pages
    2. ESPN API - JSON, per-date queries
    3. CBS Sports - Backup (not implemented)

    Season: October to June (split year, e.g., 2025-26)
    """

Basketball-Reference parsing:

URL: https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html
Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name

MLB Scraper

class MLBScraper(BaseScraper):
    """
    Sources:
    1. Baseball-Reference - Single page per season
    2. MLB Stats API - Official API with date range queries
    3. ESPN API - Backup

    Season: March to November (single year)
    Handles: Doubleheaders with game_number
    """

NFL Scraper

class NFLScraper(BaseScraper):
    """
    Sources:
    1. ESPN API - Week-based queries
    2. Pro-Football-Reference - Single page per season

    Season: September to February (split year)
    Filters: International games (London, Mexico City, Frankfurt)
    Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds)
    """

NHL Scraper

class NHLScraper(BaseScraper):
    """
    Sources:
    1. Hockey-Reference - Single page per season
    2. NHL API - New API (api-web.nhle.com)
    3. ESPN API - Backup

    Season: October to June (split year)
    Filters: International games (Prague, Stockholm, Helsinki)
    """

MLS / WNBA / NWSL Scrapers

All use ESPN API as primary source with similar structure:

Single calendar year seasons
Conference-based organization (MLS) or single table (WNBA, NWSL)

Uploaders

CloudKit Client

class CloudKitClient:
    """CloudKit Web Services API client with JWT authentication."""

    def __init__(
        self,
        container_id: str = CLOUDKIT_CONTAINER,
        environment: str = "development",  # or "production"
        key_id: str = None,       # From CloudKit Dashboard
        private_key: str = None,  # EC P-256 private key
    ): ...

    async def fetch_records(
        self,
        record_type: RecordType,
        filter_by: Optional[dict] = None,
        sort_by: Optional[str] = None,
    ) -> list[dict]: ...

    async def save_records(
        self,
        records: list[CloudKitRecord],
        batch_size: int = 200,
    ) -> BatchResult: ...

    async def delete_records(
        self,
        record_names: list[str],
        record_type: RecordType,
    ) -> BatchResult: ...

Authentication:

Uses EC P-256 key pair
JWT tokens signed with private key
Tokens valid for 30 minutes

Record Differ

class RecordDiffer:
    """Compares local records with CloudKit records."""

    def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ...
    def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ...
    def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ...

DiffResult:

@dataclass
class DiffResult:
    creates: list[RecordDiff]   # New records to create
    updates: list[RecordDiff]   # Changed records to update
    deletes: list[RecordDiff]   # Remote records to delete
    unchanged: list[RecordDiff] # Records with no changes

    def get_records_to_upload(self) -> list[CloudKitRecord]:
        """Returns creates + updates ready for upload."""

State Manager

class StateManager:
    """Manages resumable upload state."""

    def load_session(self, sport, season, environment) -> Optional[UploadSession]: ...
    def save_session(self, session: UploadSession) -> None: ...
    def get_session_or_create(
        self,
        sport, season, environment,
        record_names: list[tuple[str, str]],
        resume: bool = False,
    ) -> UploadSession: ...

State persistence:

Stored in .parser_state/upload_state_{sport}_{season}_{env}.json
Tracks: pending, uploaded, failed records
Supports retry with backoff

Utilities

HTTP Client

class RateLimitedSession:
    """HTTP session with rate limiting and exponential backoff."""

    def __init__(
        self,
        delay: float = 3.0,        # Seconds between requests
        max_retries: int = 3,
        backoff_factor: float = 2.0,
    ): ...

    def get(self, url, **kwargs) -> Response: ...
    def get_json(self, url, **kwargs) -> dict: ...
    def get_html(self, url, **kwargs) -> str: ...

Features:

User-agent rotation (5 different Chrome/Firefox/Safari agents)
Per-domain rate limiting
Automatic 429 handling with exponential backoff + jitter
Connection pooling

Logging

from sportstime_parser.utils import get_logger, log_success, log_error

logger = get_logger()  # Rich-formatted logger
logger.info("Starting scrape")

log_success("Scraped 1230 games")  # Green checkmark
log_error("Failed to parse")       # Red X

Log output:

Console: Rich-formatted with colors
File: logs/parser_{timestamp}.log

Progress Tracking

from sportstime_parser.utils import ScrapeProgress, track_progress

# Specialized scrape tracking
progress = ScrapeProgress("nba", 2025)
progress.start()

with progress.scraping_schedule(total_months=9) as advance:
    for month in months:
        fetch(month)
        advance()

progress.finish()  # Prints summary

# Generic progress bar
for game in track_progress(games, "Processing games"):
    process(game)

Manual Review Workflow

When the system can't confidently resolve a team or stadium:

Low confidence fuzzy match (< 85%):

ManualReviewItem(
    item_type="team",
    raw_value="LA Lakers",
    suggested_id="team_nba_lal",
    confidence=0.82,
    reason="Fuzzy match below threshold"
)

No match found:

ManualReviewItem(
    raw_value="Unknown Team FC",
    suggested_id=None,
    confidence=0.0,
    reason="No match found in canonical mappings"
)

Ambiguous match (multiple candidates):

ManualReviewItem(
    raw_value="LA",
    suggested_id="team_nba_lac",
    confidence=0.5,
    reason="Ambiguous: could be Lakers or Clippers"
)

Resolution:

Review items are exported to JSON
Manually verify and add to team_aliases.json or stadium_aliases.json
Re-run scrape - aliases will be used for resolution

Adding a New Sport

Create scraper in scrapers/{sport}.py:

class NewSportScraper(BaseScraper):
    def __init__(self, season: int, **kwargs):
        super().__init__("newsport", season, **kwargs)
        self._team_resolver = get_team_resolver("newsport")
        self._stadium_resolver = get_stadium_resolver("newsport")

    def _get_sources(self) -> list[str]:
        return ["primary_source", "backup_source"]

    def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
        # Implement source-specific scraping
        ...

    def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]:
        # Use resolvers to normalize
        ...

    def scrape_teams(self) -> list[Team]:
        # Return canonical team list
        ...

    def scrape_stadiums(self) -> list[Stadium]:
        # Return canonical stadium list
        ...

Add team mappings in normalizers/team_resolver.py:

TEAM_MAPPINGS["newsport"] = {
    "ABC": ("team_newsport_abc", "Full Team Name", "City"),
    ...
}

Add stadium mappings in normalizers/stadium_resolver.py:

STADIUM_MAPPINGS["newsport"] = {
    "stadium_newsport_venue": StadiumInfo(
        name="Venue Name",
        city="City",
        state="State",
        country="USA",
        latitude=40.0,
        longitude=-74.0,
    ),
    ...
}

Add to league_structure.json (if hierarchical)
Update config.py:
```
EXPECTED_GAME_COUNTS["newsport"] = 500
```
Export from __init__.py

Troubleshooting

Rate Limiting (429 errors)

The system handles these automatically with exponential backoff. If persistent:

Increase DEFAULT_REQUEST_DELAY in config.py
Check if source has changed their rate limits

Missing Teams/Stadiums

Check scraper logs for raw values
Add to team_aliases.json or stadium_aliases.json
Or add to canonical mappings if it's a new team/stadium

CloudKit Authentication Errors

Verify key_id matches CloudKit Dashboard
Check private key format (EC P-256, PEM)
Ensure container identifier is correct

Incomplete Scrapes

The system discards partial data on errors. Check:

logs/ for error details
Network connectivity
Source website availability

International Games Appearing

NFL and NHL scrapers filter these automatically. If new locations emerge:

Add to INTERNATIONAL_LOCATIONS in the scraper
Or add filtering logic for neutral site games

Contributing

Follow existing patterns for new scrapers
Always use canonical IDs
Add aliases for historical names
Include source URLs for traceability
Test with multiple seasons

27 KiB Raw Blame History Unescape Escape