Files
SportstimeAPI/README.md
Trey t 52d445bca4 feat(scripts): add sportstime-parser data pipeline
Complete Python package for scraping, normalizing, and uploading
sports schedule data to CloudKit. Includes:

- Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Canonical ID system for teams, stadiums, and games
- Fuzzy matching with manual alias support
- CloudKit uploader with batch operations and deduplication
- Comprehensive test suite with fixtures
- WNBA abbreviation aliases for improved team resolution
- Alias validation script to detect orphan references

All 5 phases of data remediation plan completed:
- Phase 1: Alias fixes (team/stadium alias additions)
- Phase 2: NHL stadium coordinate fixes
- Phase 3: Re-scrape validation
- Phase 4: iOS bundle update
- Phase 5: Code quality improvements (WNBA aliases)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 18:56:25 -06:00

834 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SportsTime Parser
A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app.
## Table of Contents
- [Overview](#overview)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Architecture](#architecture)
- [Directory Structure](#directory-structure)
- [Configuration](#configuration)
- [Data Models](#data-models)
- [Normalizers](#normalizers)
- [Scrapers](#scrapers)
- [Uploaders](#uploaders)
- [Utilities](#utilities)
- [Manual Review Workflow](#manual-review-workflow)
- [Adding a New Sport](#adding-a-new-sport)
- [Troubleshooting](#troubleshooting)
## Overview
The `sportstime_parser` package provides a complete pipeline for:
1. **Scraping** game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.)
2. **Normalizing** raw data to canonical identifiers (teams, stadiums, games)
3. **Resolving** team/stadium names using exact matching, historical aliases, and fuzzy matching
4. **Uploading** data to CloudKit with diff-based sync and resumable uploads
### Supported Sports
| Sport | Code | Sources | Season Format |
|-------|------|---------|---------------|
| NBA | `nba` | Basketball-Reference, ESPN, CBS | Oct-Jun (split year) |
| MLB | `mlb` | Baseball-Reference, MLB API, ESPN | Mar-Nov (single year) |
| NFL | `nfl` | ESPN, Pro-Football-Reference, CBS | Sep-Feb (split year) |
| NHL | `nhl` | Hockey-Reference, NHL API, ESPN | Oct-Jun (split year) |
| MLS | `mls` | ESPN, FBref | Feb-Nov (single year) |
| WNBA | `wnba` | ESPN | May-Oct (single year) |
| NWSL | `nwsl` | ESPN | Mar-Nov (single year) |
## Installation
```bash
cd Scripts
pip install -r requirements.txt
```
### Dependencies
- `requests` - HTTP requests with session management
- `beautifulsoup4` + `lxml` - HTML parsing
- `rapidfuzz` - Fuzzy string matching
- `pyjwt` + `cryptography` - CloudKit JWT authentication
- `rich` - Terminal UI (progress bars, logging)
- `pytz` / `timezonefinder` - Timezone detection
## Quick Start
### Scrape a Single Sport
```python
from sportstime_parser.scrapers import create_nba_scraper
scraper = create_nba_scraper(season=2025)
result = scraper.scrape_all()
print(f"Games: {result.game_count}")
print(f"Teams: {result.team_count}")
print(f"Stadiums: {result.stadium_count}")
print(f"Needs review: {result.review_count}")
```
### Upload to CloudKit
```python
from sportstime_parser.uploaders import CloudKitClient, RecordDiffer
client = CloudKitClient(environment="development")
differ = RecordDiffer()
# Compare local vs remote
diff = differ.diff_games(local_games, remote_records)
# Upload changes
records = diff.get_records_to_upload()
result = await client.save_records(records)
```
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc. │
└────────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ SCRAPERS │
│ NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │
│ │
│ Features: │
│ • Multi-source fallback (try sources in priority order) │
│ • Automatic rate limiting with exponential backoff │
│ • Doubleheader detection │
│ • International game filtering (NFL London, NHL Global Series) │
└────────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ NORMALIZERS │
│ TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader │
│ │
│ Resolution Strategy (in order): │
│ 1. Exact match against canonical mappings │
│ 2. Date-aware alias lookup (handles renames/relocations) │
│ 3. Fuzzy matching with confidence threshold (85%) │
│ 4. Flag for manual review if unresolved or low confidence │
└────────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA MODELS │
│ Game │ Team │ Stadium │ ManualReviewItem │
│ │
│ All models use canonical IDs: │
│ • team_nba_lal (Los Angeles Lakers) │
│ • stadium_nba_los_angeles_lakers (Crypto.com Arena) │
│ • game_nba_2025_20251022_bos_lal (specific game) │
└────────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ UPLOADERS │
│ CloudKitClient │ RecordDiffer │ StateManager │
│ │
│ Features: │
│ • JWT authentication with Apple's CloudKit Web Services │
│ • Batch operations (up to 200 records per request) │
│ • Diff-based sync (only upload changes) │
│ • Resumable uploads with persistent state │
└────────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUDKIT │
│ Public Database: Games, Teams, Stadiums, Aliases │
└─────────────────────────────────────────────────────────────────────────┘
```
## Directory Structure
```
Scripts/
├── README.md # This file
├── requirements.txt # Python dependencies
├── pyproject.toml # Package configuration
├── league_structure.json # League hierarchy (conferences, divisions)
├── team_aliases.json # Historical team name mappings
├── stadium_aliases.json # Historical stadium name mappings
├── logs/ # Runtime logs (auto-created)
├── output/ # Scrape output files (auto-created)
└── sportstime_parser/ # Main package
├── __init__.py
├── config.py # Configuration constants
├── SOURCES.md # Data source documentation
├── models/ # Data classes
│ ├── game.py # Game model
│ ├── team.py # Team model
│ ├── stadium.py # Stadium model
│ └── aliases.py # Alias and ManualReviewItem models
├── normalizers/ # Name resolution
│ ├── canonical_id.py # ID generation
│ ├── alias_loader.py # Alias loading and resolution
│ ├── fuzzy.py # Fuzzy string matching
│ ├── timezone.py # Timezone detection
│ ├── team_resolver.py # Team name resolution
│ └── stadium_resolver.py # Stadium name resolution
├── scrapers/ # Sport-specific scrapers
│ ├── base.py # Abstract base scraper
│ ├── nba.py # NBA scraper
│ ├── mlb.py # MLB scraper
│ ├── nfl.py # NFL scraper
│ ├── nhl.py # NHL scraper
│ ├── mls.py # MLS scraper
│ ├── wnba.py # WNBA scraper
│ └── nwsl.py # NWSL scraper
├── uploaders/ # CloudKit integration
│ ├── cloudkit.py # CloudKit Web Services client
│ ├── diff.py # Record diffing
│ └── state.py # Resumable upload state
└── utils/ # Shared utilities
├── logging.py # Rich-based logging
├── http.py # Rate-limited HTTP client
└── progress.py # Progress tracking
```
## Configuration
### config.py
Key configuration constants:
```python
# Directories
SCRIPTS_DIR = Path(__file__).parent.parent # Scripts/
OUTPUT_DIR = SCRIPTS_DIR / "output" # JSON output
STATE_DIR = SCRIPTS_DIR / ".parser_state" # Upload state
# CloudKit
CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app"
CLOUDKIT_ENVIRONMENT = "development" # or "production"
# Rate Limiting
DEFAULT_REQUEST_DELAY = 3.0 # seconds between requests
MAX_RETRIES = 3 # retry attempts
BACKOFF_FACTOR = 2.0 # exponential backoff multiplier
INITIAL_BACKOFF = 5.0 # initial backoff duration
# Fuzzy Matching
FUZZY_THRESHOLD = 85 # minimum match confidence (0-100)
# Expected game counts (for validation)
EXPECTED_GAME_COUNTS = {
"nba": 1230, # 30 teams × 82 games ÷ 2
"mlb": 2430, # 30 teams × 162 games ÷ 2
"nfl": 272, # Regular season only
"nhl": 1312, # 32 teams × 82 games ÷ 2
"mls": 544, # 29 teams × ~34 games ÷ 2
"wnba": 228, # 12 teams × 40 games ÷ 2
"nwsl": 182, # 14 teams × 26 games ÷ 2
}
# Geography (for filtering international games)
ALLOWED_COUNTRIES = {"USA", "Canada"}
```
### league_structure.json
Defines the hierarchical structure of each league:
```json
{
"nba": {
"name": "National Basketball Association",
"conferences": {
"Eastern": {
"divisions": {
"Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"],
"Central": ["CHI", "CLE", "DET", "IND", "MIL"],
"Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"]
}
},
"Western": { ... }
}
},
"mlb": { ... },
...
}
```
### team_aliases.json / stadium_aliases.json
Historical name mappings with validity dates:
```json
{
"team_mlb_athletics": [
{
"alias": "Oakland Athletics",
"alias_type": "full_name",
"valid_from": "1968-01-01",
"valid_until": "2024-12-31"
},
{
"alias": "Las Vegas Athletics",
"alias_type": "full_name",
"valid_from": "2028-01-01",
"valid_until": null
}
]
}
```
## Data Models
### Game
```python
@dataclass
class Game:
id: str # Canonical ID: game_{sport}_{season}_{date}_{away}_{home}
sport: str # Sport code (nba, mlb, etc.)
season: int # Season start year
home_team_id: str # Canonical team ID
away_team_id: str # Canonical team ID
stadium_id: str # Canonical stadium ID
game_date: datetime # UTC datetime
game_number: Optional[int] # 1 or 2 for doubleheaders
home_score: Optional[int] # None if not played
away_score: Optional[int]
status: str # scheduled, final, postponed, cancelled
source_url: Optional[str] # For manual review
raw_home_team: Optional[str] # Original scraped value
raw_away_team: Optional[str]
raw_stadium: Optional[str]
```
### Team
```python
@dataclass
class Team:
id: str # Canonical ID: team_{sport}_{abbrev}
sport: str
city: str # e.g., "Los Angeles"
name: str # e.g., "Lakers"
full_name: str # e.g., "Los Angeles Lakers"
abbreviation: str # e.g., "LAL"
conference: Optional[str] # e.g., "Western"
division: Optional[str] # e.g., "Pacific"
stadium_id: Optional[str] # Home stadium
primary_color: Optional[str]
secondary_color: Optional[str]
logo_url: Optional[str]
```
### Stadium
```python
@dataclass
class Stadium:
id: str # Canonical ID: stadium_{sport}_{city_team}
sport: str
name: str # Current name (e.g., "Crypto.com Arena")
city: str
state: Optional[str]
country: str
latitude: Optional[float]
longitude: Optional[float]
capacity: Optional[int]
surface: Optional[str] # grass, turf, ice, hardwood
roof_type: Optional[str] # dome, retractable, open
opened_year: Optional[int]
image_url: Optional[str]
timezone: Optional[str]
```
### ManualReviewItem
```python
@dataclass
class ManualReviewItem:
item_type: str # "team" or "stadium"
raw_value: str # Original scraped value
suggested_id: Optional[str] # Best fuzzy match (if any)
confidence: float # 0.0 - 1.0
reason: str # Why review is needed
source_url: Optional[str] # Where it came from
sport: str
check_date: Optional[date] # For date-aware alias lookup
```
## Normalizers
### Canonical ID Generation
IDs are deterministic and immutable:
```python
# Team ID
generate_team_id("nba", "LAL")
# → "team_nba_lal"
# Stadium ID
generate_stadium_id("nba", "Los Angeles", "Lakers")
# → "stadium_nba_los_angeles_lakers"
# Game ID
generate_game_id(
sport="nba",
season=2025,
away_abbrev="BOS",
home_abbrev="LAL",
game_date=datetime(2025, 10, 22),
game_number=None
)
# → "game_nba_2025_20251022_bos_lal"
# Doubleheader Game ID
generate_game_id(..., game_number=2)
# → "game_nba_2025_20251022_bos_lal_2"
```
### Team Resolution
The `TeamResolver` uses a three-stage strategy:
```python
resolver = get_team_resolver("nba")
result = resolver.resolve(
"Los Angeles Lakers",
check_date=date(2025, 10, 22),
source_url="https://..."
)
# Result:
# - canonical_id: "team_nba_lal"
# - confidence: 1.0 (exact match)
# - review_item: None
```
**Resolution stages:**
1. **Exact Match**: Check against canonical team mappings
- Full name: "Los Angeles Lakers"
- City + Name: "Los Angeles" + "Lakers"
- Abbreviation: "LAL"
2. **Alias Lookup**: Check historical aliases with date awareness
- "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31)
- Handles relocations: "Oakland" → "Las Vegas" transition
3. **Fuzzy Match**: Use rapidfuzz with 85% threshold
- "LA Lakers" → "Los Angeles Lakers" (92% match)
- Low-confidence matches flagged for review
### Stadium Resolution
Similar three-stage strategy with additional location awareness:
```python
resolver = get_stadium_resolver("nba")
result = resolver.resolve(
"Crypto.com Arena",
check_date=date(2025, 10, 22)
)
```
**Key features:**
- Handles naming rights changes (Staples Center → Crypto.com Arena)
- Date-aware: "Staples Center" resolves correctly for historical games
- Location-based fallback using latitude/longitude
## Scrapers
### Base Scraper
All scrapers extend `BaseScraper` with these features:
```python
class BaseScraper(ABC):
def __init__(self, sport: str, season: int): ...
# Required implementations
def _get_sources(self) -> list[str]: ...
def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ...
def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ...
def scrape_teams(self) -> list[Team]: ...
def scrape_stadiums(self) -> list[Stadium]: ...
# Built-in features
def scrape_games(self) -> ScrapeResult:
"""Multi-source fallback - tries each source in order."""
...
def scrape_all(self) -> ScrapeResult:
"""Scrapes games, teams, and stadiums with progress tracking."""
...
```
### NBA Scraper
```python
class NBAScraper(BaseScraper):
"""
Sources (in priority order):
1. Basketball-Reference - HTML tables, monthly pages
2. ESPN API - JSON, per-date queries
3. CBS Sports - Backup (not implemented)
Season: October to June (split year, e.g., 2025-26)
"""
```
**Basketball-Reference parsing:**
- URL: `https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html`
- Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name
### MLB Scraper
```python
class MLBScraper(BaseScraper):
"""
Sources:
1. Baseball-Reference - Single page per season
2. MLB Stats API - Official API with date range queries
3. ESPN API - Backup
Season: March to November (single year)
Handles: Doubleheaders with game_number
"""
```
### NFL Scraper
```python
class NFLScraper(BaseScraper):
"""
Sources:
1. ESPN API - Week-based queries
2. Pro-Football-Reference - Single page per season
Season: September to February (split year)
Filters: International games (London, Mexico City, Frankfurt)
Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds)
"""
```
### NHL Scraper
```python
class NHLScraper(BaseScraper):
"""
Sources:
1. Hockey-Reference - Single page per season
2. NHL API - New API (api-web.nhle.com)
3. ESPN API - Backup
Season: October to June (split year)
Filters: International games (Prague, Stockholm, Helsinki)
"""
```
### MLS / WNBA / NWSL Scrapers
All use ESPN API as primary source with similar structure:
- Single calendar year seasons
- Conference-based organization (MLS) or single table (WNBA, NWSL)
## Uploaders
### CloudKit Client
```python
class CloudKitClient:
"""CloudKit Web Services API client with JWT authentication."""
def __init__(
self,
container_id: str = CLOUDKIT_CONTAINER,
environment: str = "development", # or "production"
key_id: str = None, # From CloudKit Dashboard
private_key: str = None, # EC P-256 private key
): ...
async def fetch_records(
self,
record_type: RecordType,
filter_by: Optional[dict] = None,
sort_by: Optional[str] = None,
) -> list[dict]: ...
async def save_records(
self,
records: list[CloudKitRecord],
batch_size: int = 200,
) -> BatchResult: ...
async def delete_records(
self,
record_names: list[str],
record_type: RecordType,
) -> BatchResult: ...
```
**Authentication:**
- Uses EC P-256 key pair
- JWT tokens signed with private key
- Tokens valid for 30 minutes
### Record Differ
```python
class RecordDiffer:
"""Compares local records with CloudKit records."""
def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ...
def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ...
def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ...
```
**DiffResult:**
```python
@dataclass
class DiffResult:
creates: list[RecordDiff] # New records to create
updates: list[RecordDiff] # Changed records to update
deletes: list[RecordDiff] # Remote records to delete
unchanged: list[RecordDiff] # Records with no changes
def get_records_to_upload(self) -> list[CloudKitRecord]:
"""Returns creates + updates ready for upload."""
```
### State Manager
```python
class StateManager:
"""Manages resumable upload state."""
def load_session(self, sport, season, environment) -> Optional[UploadSession]: ...
def save_session(self, session: UploadSession) -> None: ...
def get_session_or_create(
self,
sport, season, environment,
record_names: list[tuple[str, str]],
resume: bool = False,
) -> UploadSession: ...
```
**State persistence:**
- Stored in `.parser_state/upload_state_{sport}_{season}_{env}.json`
- Tracks: pending, uploaded, failed records
- Supports retry with backoff
## Utilities
### HTTP Client
```python
class RateLimitedSession:
"""HTTP session with rate limiting and exponential backoff."""
def __init__(
self,
delay: float = 3.0, # Seconds between requests
max_retries: int = 3,
backoff_factor: float = 2.0,
): ...
def get(self, url, **kwargs) -> Response: ...
def get_json(self, url, **kwargs) -> dict: ...
def get_html(self, url, **kwargs) -> str: ...
```
**Features:**
- User-agent rotation (5 different Chrome/Firefox/Safari agents)
- Per-domain rate limiting
- Automatic 429 handling with exponential backoff + jitter
- Connection pooling
### Logging
```python
from sportstime_parser.utils import get_logger, log_success, log_error
logger = get_logger() # Rich-formatted logger
logger.info("Starting scrape")
log_success("Scraped 1230 games") # Green checkmark
log_error("Failed to parse") # Red X
```
**Log output:**
- Console: Rich-formatted with colors
- File: `logs/parser_{timestamp}.log`
### Progress Tracking
```python
from sportstime_parser.utils import ScrapeProgress, track_progress
# Specialized scrape tracking
progress = ScrapeProgress("nba", 2025)
progress.start()
with progress.scraping_schedule(total_months=9) as advance:
for month in months:
fetch(month)
advance()
progress.finish() # Prints summary
# Generic progress bar
for game in track_progress(games, "Processing games"):
process(game)
```
## Manual Review Workflow
When the system can't confidently resolve a team or stadium:
1. **Low confidence fuzzy match** (< 85%):
```
ManualReviewItem(
item_type="team",
raw_value="LA Lakers",
suggested_id="team_nba_lal",
confidence=0.82,
reason="Fuzzy match below threshold"
)
```
2. **No match found**:
```
ManualReviewItem(
raw_value="Unknown Team FC",
suggested_id=None,
confidence=0.0,
reason="No match found in canonical mappings"
)
```
3. **Ambiguous match** (multiple candidates):
```
ManualReviewItem(
raw_value="LA",
suggested_id="team_nba_lac",
confidence=0.5,
reason="Ambiguous: could be Lakers or Clippers"
)
```
**Resolution:**
- Review items are exported to JSON
- Manually verify and add to `team_aliases.json` or `stadium_aliases.json`
- Re-run scrape - aliases will be used for resolution
## Adding a New Sport
1. **Create scraper** in `scrapers/{sport}.py`:
```python
class NewSportScraper(BaseScraper):
def __init__(self, season: int, **kwargs):
super().__init__("newsport", season, **kwargs)
self._team_resolver = get_team_resolver("newsport")
self._stadium_resolver = get_stadium_resolver("newsport")
def _get_sources(self) -> list[str]:
return ["primary_source", "backup_source"]
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
# Implement source-specific scraping
...
def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]:
# Use resolvers to normalize
...
def scrape_teams(self) -> list[Team]:
# Return canonical team list
...
def scrape_stadiums(self) -> list[Stadium]:
# Return canonical stadium list
...
```
2. **Add team mappings** in `normalizers/team_resolver.py`:
```python
TEAM_MAPPINGS["newsport"] = {
"ABC": ("team_newsport_abc", "Full Team Name", "City"),
...
}
```
3. **Add stadium mappings** in `normalizers/stadium_resolver.py`:
```python
STADIUM_MAPPINGS["newsport"] = {
"stadium_newsport_venue": StadiumInfo(
name="Venue Name",
city="City",
state="State",
country="USA",
latitude=40.0,
longitude=-74.0,
),
...
}
```
4. **Add to league_structure.json** (if hierarchical)
5. **Update config.py**:
```python
EXPECTED_GAME_COUNTS["newsport"] = 500
```
6. **Export from `__init__.py`**
## Troubleshooting
### Rate Limiting (429 errors)
The system handles these automatically with exponential backoff. If persistent:
- Increase `DEFAULT_REQUEST_DELAY` in config.py
- Check if source has changed their rate limits
### Missing Teams/Stadiums
1. Check scraper logs for raw values
2. Add to `team_aliases.json` or `stadium_aliases.json`
3. Or add to canonical mappings if it's a new team/stadium
### CloudKit Authentication Errors
1. Verify key_id matches CloudKit Dashboard
2. Check private key format (EC P-256, PEM)
3. Ensure container identifier is correct
### Incomplete Scrapes
The system discards partial data on errors. Check:
- `logs/` for error details
- Network connectivity
- Source website availability
### International Games Appearing
NFL and NHL scrapers filter these automatically. If new locations emerge:
- Add to `INTERNATIONAL_LOCATIONS` in the scraper
- Or add filtering logic for neutral site games
## Contributing
1. Follow existing patterns for new scrapers
2. Always use canonical IDs
3. Add aliases for historical names
4. Include source URLs for traceability
5. Test with multiple seasons