Complete Python package for scraping, normalizing, and uploading sports schedule data to CloudKit. Includes: - Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL - Canonical ID system for teams, stadiums, and games - Fuzzy matching with manual alias support - CloudKit uploader with batch operations and deduplication - Comprehensive test suite with fixtures - WNBA abbreviation aliases for improved team resolution - Alias validation script to detect orphan references All 5 phases of data remediation plan completed: - Phase 1: Alias fixes (team/stadium alias additions) - Phase 2: NHL stadium coordinate fixes - Phase 3: Re-scrape validation - Phase 4: iOS bundle update - Phase 5: Code quality improvements (WNBA aliases) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
27 KiB
SportsTime Parser
A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app.
Table of Contents
- Overview
- Installation
- Quick Start
- Architecture
- Directory Structure
- Configuration
- Data Models
- Normalizers
- Scrapers
- Uploaders
- Utilities
- Manual Review Workflow
- Adding a New Sport
- Troubleshooting
Overview
The sportstime_parser package provides a complete pipeline for:
- Scraping game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.)
- Normalizing raw data to canonical identifiers (teams, stadiums, games)
- Resolving team/stadium names using exact matching, historical aliases, and fuzzy matching
- Uploading data to CloudKit with diff-based sync and resumable uploads
Supported Sports
| Sport | Code | Sources | Season Format |
|---|---|---|---|
| NBA | nba |
Basketball-Reference, ESPN, CBS | Oct-Jun (split year) |
| MLB | mlb |
Baseball-Reference, MLB API, ESPN | Mar-Nov (single year) |
| NFL | nfl |
ESPN, Pro-Football-Reference, CBS | Sep-Feb (split year) |
| NHL | nhl |
Hockey-Reference, NHL API, ESPN | Oct-Jun (split year) |
| MLS | mls |
ESPN, FBref | Feb-Nov (single year) |
| WNBA | wnba |
ESPN | May-Oct (single year) |
| NWSL | nwsl |
ESPN | Mar-Nov (single year) |
Installation
cd Scripts
pip install -r requirements.txt
Dependencies
requests- HTTP requests with session managementbeautifulsoup4+lxml- HTML parsingrapidfuzz- Fuzzy string matchingpyjwt+cryptography- CloudKit JWT authenticationrich- Terminal UI (progress bars, logging)pytz/timezonefinder- Timezone detection
Quick Start
Scrape a Single Sport
from sportstime_parser.scrapers import create_nba_scraper
scraper = create_nba_scraper(season=2025)
result = scraper.scrape_all()
print(f"Games: {result.game_count}")
print(f"Teams: {result.team_count}")
print(f"Stadiums: {result.stadium_count}")
print(f"Needs review: {result.review_count}")
Upload to CloudKit
from sportstime_parser.uploaders import CloudKitClient, RecordDiffer
client = CloudKitClient(environment="development")
differ = RecordDiffer()
# Compare local vs remote
diff = differ.diff_games(local_games, remote_records)
# Upload changes
records = diff.get_records_to_upload()
result = await client.save_records(records)
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc. │
└────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ SCRAPERS │
│ NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │
│ │
│ Features: │
│ • Multi-source fallback (try sources in priority order) │
│ • Automatic rate limiting with exponential backoff │
│ • Doubleheader detection │
│ • International game filtering (NFL London, NHL Global Series) │
└────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ NORMALIZERS │
│ TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader │
│ │
│ Resolution Strategy (in order): │
│ 1. Exact match against canonical mappings │
│ 2. Date-aware alias lookup (handles renames/relocations) │
│ 3. Fuzzy matching with confidence threshold (85%) │
│ 4. Flag for manual review if unresolved or low confidence │
└────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ DATA MODELS │
│ Game │ Team │ Stadium │ ManualReviewItem │
│ │
│ All models use canonical IDs: │
│ • team_nba_lal (Los Angeles Lakers) │
│ • stadium_nba_los_angeles_lakers (Crypto.com Arena) │
│ • game_nba_2025_20251022_bos_lal (specific game) │
└────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ UPLOADERS │
│ CloudKitClient │ RecordDiffer │ StateManager │
│ │
│ Features: │
│ • JWT authentication with Apple's CloudKit Web Services │
│ • Batch operations (up to 200 records per request) │
│ • Diff-based sync (only upload changes) │
│ • Resumable uploads with persistent state │
└────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUDKIT │
│ Public Database: Games, Teams, Stadiums, Aliases │
└─────────────────────────────────────────────────────────────────────────┘
Directory Structure
Scripts/
├── README.md # This file
├── requirements.txt # Python dependencies
├── pyproject.toml # Package configuration
├── league_structure.json # League hierarchy (conferences, divisions)
├── team_aliases.json # Historical team name mappings
├── stadium_aliases.json # Historical stadium name mappings
├── logs/ # Runtime logs (auto-created)
├── output/ # Scrape output files (auto-created)
└── sportstime_parser/ # Main package
├── __init__.py
├── config.py # Configuration constants
├── SOURCES.md # Data source documentation
├── models/ # Data classes
│ ├── game.py # Game model
│ ├── team.py # Team model
│ ├── stadium.py # Stadium model
│ └── aliases.py # Alias and ManualReviewItem models
├── normalizers/ # Name resolution
│ ├── canonical_id.py # ID generation
│ ├── alias_loader.py # Alias loading and resolution
│ ├── fuzzy.py # Fuzzy string matching
│ ├── timezone.py # Timezone detection
│ ├── team_resolver.py # Team name resolution
│ └── stadium_resolver.py # Stadium name resolution
├── scrapers/ # Sport-specific scrapers
│ ├── base.py # Abstract base scraper
│ ├── nba.py # NBA scraper
│ ├── mlb.py # MLB scraper
│ ├── nfl.py # NFL scraper
│ ├── nhl.py # NHL scraper
│ ├── mls.py # MLS scraper
│ ├── wnba.py # WNBA scraper
│ └── nwsl.py # NWSL scraper
├── uploaders/ # CloudKit integration
│ ├── cloudkit.py # CloudKit Web Services client
│ ├── diff.py # Record diffing
│ └── state.py # Resumable upload state
└── utils/ # Shared utilities
├── logging.py # Rich-based logging
├── http.py # Rate-limited HTTP client
└── progress.py # Progress tracking
Configuration
config.py
Key configuration constants:
# Directories
SCRIPTS_DIR = Path(__file__).parent.parent # Scripts/
OUTPUT_DIR = SCRIPTS_DIR / "output" # JSON output
STATE_DIR = SCRIPTS_DIR / ".parser_state" # Upload state
# CloudKit
CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app"
CLOUDKIT_ENVIRONMENT = "development" # or "production"
# Rate Limiting
DEFAULT_REQUEST_DELAY = 3.0 # seconds between requests
MAX_RETRIES = 3 # retry attempts
BACKOFF_FACTOR = 2.0 # exponential backoff multiplier
INITIAL_BACKOFF = 5.0 # initial backoff duration
# Fuzzy Matching
FUZZY_THRESHOLD = 85 # minimum match confidence (0-100)
# Expected game counts (for validation)
EXPECTED_GAME_COUNTS = {
"nba": 1230, # 30 teams × 82 games ÷ 2
"mlb": 2430, # 30 teams × 162 games ÷ 2
"nfl": 272, # Regular season only
"nhl": 1312, # 32 teams × 82 games ÷ 2
"mls": 544, # 29 teams × ~34 games ÷ 2
"wnba": 228, # 12 teams × 40 games ÷ 2
"nwsl": 182, # 14 teams × 26 games ÷ 2
}
# Geography (for filtering international games)
ALLOWED_COUNTRIES = {"USA", "Canada"}
league_structure.json
Defines the hierarchical structure of each league:
{
"nba": {
"name": "National Basketball Association",
"conferences": {
"Eastern": {
"divisions": {
"Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"],
"Central": ["CHI", "CLE", "DET", "IND", "MIL"],
"Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"]
}
},
"Western": { ... }
}
},
"mlb": { ... },
...
}
team_aliases.json / stadium_aliases.json
Historical name mappings with validity dates:
{
"team_mlb_athletics": [
{
"alias": "Oakland Athletics",
"alias_type": "full_name",
"valid_from": "1968-01-01",
"valid_until": "2024-12-31"
},
{
"alias": "Las Vegas Athletics",
"alias_type": "full_name",
"valid_from": "2028-01-01",
"valid_until": null
}
]
}
Data Models
Game
@dataclass
class Game:
id: str # Canonical ID: game_{sport}_{season}_{date}_{away}_{home}
sport: str # Sport code (nba, mlb, etc.)
season: int # Season start year
home_team_id: str # Canonical team ID
away_team_id: str # Canonical team ID
stadium_id: str # Canonical stadium ID
game_date: datetime # UTC datetime
game_number: Optional[int] # 1 or 2 for doubleheaders
home_score: Optional[int] # None if not played
away_score: Optional[int]
status: str # scheduled, final, postponed, cancelled
source_url: Optional[str] # For manual review
raw_home_team: Optional[str] # Original scraped value
raw_away_team: Optional[str]
raw_stadium: Optional[str]
Team
@dataclass
class Team:
id: str # Canonical ID: team_{sport}_{abbrev}
sport: str
city: str # e.g., "Los Angeles"
name: str # e.g., "Lakers"
full_name: str # e.g., "Los Angeles Lakers"
abbreviation: str # e.g., "LAL"
conference: Optional[str] # e.g., "Western"
division: Optional[str] # e.g., "Pacific"
stadium_id: Optional[str] # Home stadium
primary_color: Optional[str]
secondary_color: Optional[str]
logo_url: Optional[str]
Stadium
@dataclass
class Stadium:
id: str # Canonical ID: stadium_{sport}_{city_team}
sport: str
name: str # Current name (e.g., "Crypto.com Arena")
city: str
state: Optional[str]
country: str
latitude: Optional[float]
longitude: Optional[float]
capacity: Optional[int]
surface: Optional[str] # grass, turf, ice, hardwood
roof_type: Optional[str] # dome, retractable, open
opened_year: Optional[int]
image_url: Optional[str]
timezone: Optional[str]
ManualReviewItem
@dataclass
class ManualReviewItem:
item_type: str # "team" or "stadium"
raw_value: str # Original scraped value
suggested_id: Optional[str] # Best fuzzy match (if any)
confidence: float # 0.0 - 1.0
reason: str # Why review is needed
source_url: Optional[str] # Where it came from
sport: str
check_date: Optional[date] # For date-aware alias lookup
Normalizers
Canonical ID Generation
IDs are deterministic and immutable:
# Team ID
generate_team_id("nba", "LAL")
# → "team_nba_lal"
# Stadium ID
generate_stadium_id("nba", "Los Angeles", "Lakers")
# → "stadium_nba_los_angeles_lakers"
# Game ID
generate_game_id(
sport="nba",
season=2025,
away_abbrev="BOS",
home_abbrev="LAL",
game_date=datetime(2025, 10, 22),
game_number=None
)
# → "game_nba_2025_20251022_bos_lal"
# Doubleheader Game ID
generate_game_id(..., game_number=2)
# → "game_nba_2025_20251022_bos_lal_2"
Team Resolution
The TeamResolver uses a three-stage strategy:
resolver = get_team_resolver("nba")
result = resolver.resolve(
"Los Angeles Lakers",
check_date=date(2025, 10, 22),
source_url="https://..."
)
# Result:
# - canonical_id: "team_nba_lal"
# - confidence: 1.0 (exact match)
# - review_item: None
Resolution stages:
-
Exact Match: Check against canonical team mappings
- Full name: "Los Angeles Lakers"
- City + Name: "Los Angeles" + "Lakers"
- Abbreviation: "LAL"
-
Alias Lookup: Check historical aliases with date awareness
- "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31)
- Handles relocations: "Oakland" → "Las Vegas" transition
-
Fuzzy Match: Use rapidfuzz with 85% threshold
- "LA Lakers" → "Los Angeles Lakers" (92% match)
- Low-confidence matches flagged for review
Stadium Resolution
Similar three-stage strategy with additional location awareness:
resolver = get_stadium_resolver("nba")
result = resolver.resolve(
"Crypto.com Arena",
check_date=date(2025, 10, 22)
)
Key features:
- Handles naming rights changes (Staples Center → Crypto.com Arena)
- Date-aware: "Staples Center" resolves correctly for historical games
- Location-based fallback using latitude/longitude
Scrapers
Base Scraper
All scrapers extend BaseScraper with these features:
class BaseScraper(ABC):
def __init__(self, sport: str, season: int): ...
# Required implementations
def _get_sources(self) -> list[str]: ...
def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ...
def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ...
def scrape_teams(self) -> list[Team]: ...
def scrape_stadiums(self) -> list[Stadium]: ...
# Built-in features
def scrape_games(self) -> ScrapeResult:
"""Multi-source fallback - tries each source in order."""
...
def scrape_all(self) -> ScrapeResult:
"""Scrapes games, teams, and stadiums with progress tracking."""
...
NBA Scraper
class NBAScraper(BaseScraper):
"""
Sources (in priority order):
1. Basketball-Reference - HTML tables, monthly pages
2. ESPN API - JSON, per-date queries
3. CBS Sports - Backup (not implemented)
Season: October to June (split year, e.g., 2025-26)
"""
Basketball-Reference parsing:
- URL:
https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html - Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name
MLB Scraper
class MLBScraper(BaseScraper):
"""
Sources:
1. Baseball-Reference - Single page per season
2. MLB Stats API - Official API with date range queries
3. ESPN API - Backup
Season: March to November (single year)
Handles: Doubleheaders with game_number
"""
NFL Scraper
class NFLScraper(BaseScraper):
"""
Sources:
1. ESPN API - Week-based queries
2. Pro-Football-Reference - Single page per season
Season: September to February (split year)
Filters: International games (London, Mexico City, Frankfurt)
Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds)
"""
NHL Scraper
class NHLScraper(BaseScraper):
"""
Sources:
1. Hockey-Reference - Single page per season
2. NHL API - New API (api-web.nhle.com)
3. ESPN API - Backup
Season: October to June (split year)
Filters: International games (Prague, Stockholm, Helsinki)
"""
MLS / WNBA / NWSL Scrapers
All use ESPN API as primary source with similar structure:
- Single calendar year seasons
- Conference-based organization (MLS) or single table (WNBA, NWSL)
Uploaders
CloudKit Client
class CloudKitClient:
"""CloudKit Web Services API client with JWT authentication."""
def __init__(
self,
container_id: str = CLOUDKIT_CONTAINER,
environment: str = "development", # or "production"
key_id: str = None, # From CloudKit Dashboard
private_key: str = None, # EC P-256 private key
): ...
async def fetch_records(
self,
record_type: RecordType,
filter_by: Optional[dict] = None,
sort_by: Optional[str] = None,
) -> list[dict]: ...
async def save_records(
self,
records: list[CloudKitRecord],
batch_size: int = 200,
) -> BatchResult: ...
async def delete_records(
self,
record_names: list[str],
record_type: RecordType,
) -> BatchResult: ...
Authentication:
- Uses EC P-256 key pair
- JWT tokens signed with private key
- Tokens valid for 30 minutes
Record Differ
class RecordDiffer:
"""Compares local records with CloudKit records."""
def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ...
def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ...
def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ...
DiffResult:
@dataclass
class DiffResult:
creates: list[RecordDiff] # New records to create
updates: list[RecordDiff] # Changed records to update
deletes: list[RecordDiff] # Remote records to delete
unchanged: list[RecordDiff] # Records with no changes
def get_records_to_upload(self) -> list[CloudKitRecord]:
"""Returns creates + updates ready for upload."""
State Manager
class StateManager:
"""Manages resumable upload state."""
def load_session(self, sport, season, environment) -> Optional[UploadSession]: ...
def save_session(self, session: UploadSession) -> None: ...
def get_session_or_create(
self,
sport, season, environment,
record_names: list[tuple[str, str]],
resume: bool = False,
) -> UploadSession: ...
State persistence:
- Stored in
.parser_state/upload_state_{sport}_{season}_{env}.json - Tracks: pending, uploaded, failed records
- Supports retry with backoff
Utilities
HTTP Client
class RateLimitedSession:
"""HTTP session with rate limiting and exponential backoff."""
def __init__(
self,
delay: float = 3.0, # Seconds between requests
max_retries: int = 3,
backoff_factor: float = 2.0,
): ...
def get(self, url, **kwargs) -> Response: ...
def get_json(self, url, **kwargs) -> dict: ...
def get_html(self, url, **kwargs) -> str: ...
Features:
- User-agent rotation (5 different Chrome/Firefox/Safari agents)
- Per-domain rate limiting
- Automatic 429 handling with exponential backoff + jitter
- Connection pooling
Logging
from sportstime_parser.utils import get_logger, log_success, log_error
logger = get_logger() # Rich-formatted logger
logger.info("Starting scrape")
log_success("Scraped 1230 games") # Green checkmark
log_error("Failed to parse") # Red X
Log output:
- Console: Rich-formatted with colors
- File:
logs/parser_{timestamp}.log
Progress Tracking
from sportstime_parser.utils import ScrapeProgress, track_progress
# Specialized scrape tracking
progress = ScrapeProgress("nba", 2025)
progress.start()
with progress.scraping_schedule(total_months=9) as advance:
for month in months:
fetch(month)
advance()
progress.finish() # Prints summary
# Generic progress bar
for game in track_progress(games, "Processing games"):
process(game)
Manual Review Workflow
When the system can't confidently resolve a team or stadium:
-
Low confidence fuzzy match (< 85%):
ManualReviewItem( item_type="team", raw_value="LA Lakers", suggested_id="team_nba_lal", confidence=0.82, reason="Fuzzy match below threshold" ) -
No match found:
ManualReviewItem( raw_value="Unknown Team FC", suggested_id=None, confidence=0.0, reason="No match found in canonical mappings" ) -
Ambiguous match (multiple candidates):
ManualReviewItem( raw_value="LA", suggested_id="team_nba_lac", confidence=0.5, reason="Ambiguous: could be Lakers or Clippers" )
Resolution:
- Review items are exported to JSON
- Manually verify and add to
team_aliases.jsonorstadium_aliases.json - Re-run scrape - aliases will be used for resolution
Adding a New Sport
-
Create scraper in
scrapers/{sport}.py:class NewSportScraper(BaseScraper): def __init__(self, season: int, **kwargs): super().__init__("newsport", season, **kwargs) self._team_resolver = get_team_resolver("newsport") self._stadium_resolver = get_stadium_resolver("newsport") def _get_sources(self) -> list[str]: return ["primary_source", "backup_source"] def _scrape_games_from_source(self, source: str) -> list[RawGameData]: # Implement source-specific scraping ... def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: # Use resolvers to normalize ... def scrape_teams(self) -> list[Team]: # Return canonical team list ... def scrape_stadiums(self) -> list[Stadium]: # Return canonical stadium list ... -
Add team mappings in
normalizers/team_resolver.py:TEAM_MAPPINGS["newsport"] = { "ABC": ("team_newsport_abc", "Full Team Name", "City"), ... } -
Add stadium mappings in
normalizers/stadium_resolver.py:STADIUM_MAPPINGS["newsport"] = { "stadium_newsport_venue": StadiumInfo( name="Venue Name", city="City", state="State", country="USA", latitude=40.0, longitude=-74.0, ), ... } -
Add to league_structure.json (if hierarchical)
-
Update config.py:
EXPECTED_GAME_COUNTS["newsport"] = 500 -
Export from
__init__.py
Troubleshooting
Rate Limiting (429 errors)
The system handles these automatically with exponential backoff. If persistent:
- Increase
DEFAULT_REQUEST_DELAYin config.py - Check if source has changed their rate limits
Missing Teams/Stadiums
- Check scraper logs for raw values
- Add to
team_aliases.jsonorstadium_aliases.json - Or add to canonical mappings if it's a new team/stadium
CloudKit Authentication Errors
- Verify key_id matches CloudKit Dashboard
- Check private key format (EC P-256, PEM)
- Ensure container identifier is correct
Incomplete Scrapes
The system discards partial data on errors. Check:
logs/for error details- Network connectivity
- Source website availability
International Games Appearing
NFL and NHL scrapers filter these automatically. If new locations emerge:
- Add to
INTERNATIONAL_LOCATIONSin the scraper - Or add filtering logic for neutral site games
Contributing
- Follow existing patterns for new scrapers
- Always use canonical IDs
- Add aliases for historical names
- Include source URLs for traceability
- Test with multiple seasons