Adds timeZoneIdentifier to Stadium model and localGameTime/localGameTimeShort computed properties to RichGame. Game times can now display in venue local timezone. Also adds timezone field to Python StadiumInfo dataclass with example entries for Pacific, Central, Mountain, and Toronto timezones. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SportsTime Parser
A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.
Features
- Scrapes game schedules from multiple sources with automatic fallback
- Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Generates deterministic canonical IDs for games, teams, and stadiums
- Produces validation reports with manual review lists
- Uploads to CloudKit with resumable, diff-based updates
Requirements
- Python 3.11+
- CloudKit credentials (for upload functionality)
Installation
# From the Scripts directory
cd Scripts
# Install in development mode
pip install -e ".[dev]"
# Or install dependencies only
pip install -r requirements.txt
Quick Start
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025
# Scrape all sports
sportstime-parser scrape all --season 2025
# Validate existing scraped data
sportstime-parser validate nba --season 2025
# Check status
sportstime-parser status
# Upload to CloudKit (development)
sportstime-parser upload nba --season 2025
# Upload to CloudKit (production)
sportstime-parser upload nba --season 2025 --environment production
CLI Reference
scrape
Scrape game schedules, teams, and stadiums from web sources.
sportstime-parser scrape <sport> [options]
Arguments:
sport Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--dry-run Parse and validate only, don't write output files
--verbose, -v Enable verbose output
Examples:
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025
# Scrape all sports with verbose output
sportstime-parser scrape all --season 2025 --verbose
# Dry run to test without writing files
sportstime-parser scrape mlb --season 2026 --dry-run
validate
Run validation on existing scraped data and regenerate reports. Validation performs these checks:
- Game Coverage: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
- Team Resolution: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
- Stadium Resolution: Identifies venue names that couldn't be matched to canonical stadium IDs
- Duplicate Detection: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
- Missing Data: Flags games missing required fields (stadium_id, team IDs, valid dates)
The output is a Markdown report with:
- Summary statistics (total games, valid games, coverage percentage)
- Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
- Fuzzy match suggestions with confidence scores to help resolve unmatched names
sportstime-parser validate <sport> [options]
Arguments:
sport Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
Examples:
# Validate NBA data
sportstime-parser validate nba --season 2025
# Validate all sports
sportstime-parser validate all
upload
Upload scraped data to CloudKit with diff-based updates.
sportstime-parser upload <sport> [options]
Arguments:
sport Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment: development or production (default: development)
--resume Resume interrupted upload from last checkpoint
Examples:
# Upload NBA to development
sportstime-parser upload nba --season 2025
# Upload to production
sportstime-parser upload nba --season 2025 --environment production
# Resume interrupted upload
sportstime-parser upload mlb --season 2026 --resume
status
Show current scrape and upload status.
sportstime-parser status
retry
Retry failed uploads from previous attempts.
sportstime-parser retry <sport> [options]
Arguments:
sport Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment (default: development)
--max-retries INT Maximum retry attempts per record (default: 3)
clear
Clear upload session state to start fresh.
sportstime-parser clear <sport> [options]
Arguments:
sport Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment (default: development)
CloudKit Configuration
To upload data to CloudKit, you need to configure authentication credentials.
1. Get Credentials from Apple Developer Portal
- Go to Apple Developer Portal
- Navigate to Certificates, Identifiers & Profiles > Keys
- Create a new key with CloudKit capability
- Download the private key file (.p8)
- Note the Key ID
2. Set Environment Variables
# Key ID from Apple Developer Portal
export CLOUDKIT_KEY_ID="your_key_id_here"
# Path to private key file
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"
# Or provide key content directly (useful for CI/CD)
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
...key content...
-----END EC PRIVATE KEY-----"
3. Verify Configuration
sportstime-parser status
The status output will show whether CloudKit is configured correctly.
Output Files
Scraped data is saved to the output/ directory:
output/
games_nba_2025.json # Game schedules
teams_nba.json # Team data
stadiums_nba.json # Stadium data
validation_nba_2025.md # Validation report
Validation Reports
Validation reports are generated in Markdown format at output/validation_{sport}_{season}.md.
Report Sections
Summary Table
| Metric | Description |
|---|---|
| Total Games | Number of games scraped |
| Valid Games | Games with all required fields resolved |
| Coverage | Percentage of expected games found (based on league schedule) |
| Unresolved Teams | Team names that couldn't be matched |
| Unresolved Stadiums | Venue names that couldn't be matched |
| Duplicates | Potential duplicate game entries |
Manual Review Items
Items are grouped by type and include the raw value, source URL, and suggested fixes:
- Unresolved Teams: Team names not in the alias mapping. Add to
team_aliases.jsonto resolve. - Unresolved Stadiums: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to
stadium_aliases.json. - Duplicate Games: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
- Missing Data: Games missing stadium coordinates or other required fields.
Fuzzy Match Suggestions
For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.
Canonical IDs
Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.
ID Formats
Games
{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
Examples:
nba_2025_hou_okc_1021- NBA 2025-26, Houston @ OKC, Oct 21mlb_2026_nyy_bos_0401_1- MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)
Teams
{sport}_{city}_{name}
Examples:
nba_la_lakersmlb_new_york_yankeesnfl_new_york_giants
Stadiums
{sport}_{normalized_name}
Examples:
mlb_yankee_stadiumnba_crypto_com_arenanfl_sofi_stadium
Generated vs Matched IDs
| Entity | Generated | Matched |
|---|---|---|
| Teams | Pre-defined in team_resolver.py mappings |
Resolved from raw scraped names via aliases + fuzzy matching |
| Stadiums | Pre-defined in stadium_resolver.py mappings |
Resolved from raw venue names via aliases + fuzzy matching |
| Games | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) |
Resolution Flow:
Raw Name (from scraper)
↓
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
↓ (if no match)
Fuzzy Match (Levenshtein distance against known names)
↓ (if confidence > threshold)
Canonical ID assigned
↓ (if no match)
Manual Review Item created
Cross-References
Entities reference each other via canonical IDs:
┌─────────────────────────────────────────────────────────────┐
│ Game │
│ id: nba_2025_hou_okc_1021 │
│ home_team_id: nba_oklahoma_city_thunder ──────────────┐ │
│ away_team_id: nba_houston_rockets ────────────────┐ │ │
│ stadium_id: nba_paycom_center ────────────────┐ │ │ │
└─────────────────────────────────────────────────│───│───│───┘
│ │ │
┌─────────────────────────────────────────────────│───│───│───┐
│ Stadium │ │ │ │
│ id: nba_paycom_center ◄───────────────────────┘ │ │ │
│ name: "Paycom Center" │ │ │
│ city: "Oklahoma City" │ │ │
│ latitude: 35.4634 │ │ │
│ longitude: -97.5151 │ │ │
└─────────────────────────────────────────────────────│───│───┘
│ │
┌─────────────────────────────────────────────────────│───│───┐
│ Team │ │ │
│ id: nba_houston_rockets ◄─────────────────────────┘ │ │
│ name: "Rockets" │ │
│ city: "Houston" │ │
│ stadium_id: nba_toyota_center │ │
└─────────────────────────────────────────────────────────│───┘
│
┌─────────────────────────────────────────────────────────│───┐
│ Team │ │
│ id: nba_oklahoma_city_thunder ◄───────────────────────┘ │
│ name: "Thunder" │
│ city: "Oklahoma City" │
│ stadium_id: nba_paycom_center │
└─────────────────────────────────────────────────────────────┘
Alias Files
Aliases map variant names to canonical IDs:
team_aliases.json
{
"nba": {
"LA Lakers": "nba_la_lakers",
"Los Angeles Lakers": "nba_la_lakers",
"LAL": "nba_la_lakers"
}
}
stadium_aliases.json
{
"nba": {
"Crypto.com Arena": "nba_crypto_com_arena",
"Staples Center": "nba_crypto_com_arena",
"STAPLES Center": "nba_crypto_com_arena"
}
}
When a scraper returns a raw name like "LA Lakers", the resolver:
- Checks
team_aliases.jsonfor an exact match → findsnba_la_lakers - If no exact match, runs fuzzy matching against all known team names
- If fuzzy match confidence > 80%, uses that canonical ID
- Otherwise, creates a manual review item for human resolution
Adding a New Sport
To add support for a new sport (e.g., cfb for college football), update these files:
1. Configuration (config.py)
Add the sport to SUPPORTED_SPORTS and EXPECTED_GAME_COUNTS:
SUPPORTED_SPORTS: list[str] = [
"nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
"cfb", # ← Add new sport
]
EXPECTED_GAME_COUNTS: dict[str, int] = {
# ... existing sports ...
"cfb": 900, # ← Add expected game count for validation
}
2. Team Mappings (normalizers/team_resolver.py)
Add team definitions to TEAM_MAPPINGS. Each entry maps an abbreviation to (canonical_id, full_name, city):
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
# ... existing sports ...
"cfb": {
"ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
"OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
# ... all teams ...
},
}
3. Stadium Mappings (normalizers/stadium_resolver.py)
Add stadium definitions to STADIUM_MAPPINGS. Each entry is a StadiumInfo with coordinates:
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
# ... existing sports ...
"cfb": {
"stadium_cfb_bryant_denny": StadiumInfo(
id="stadium_cfb_bryant_denny",
name="Bryant-Denny Stadium",
city="Tuscaloosa",
state="AL",
country="USA",
sport="cfb",
latitude=33.2083,
longitude=-87.5503,
),
# ... all stadiums ...
},
}
4. Scraper Implementation (scrapers/cfb.py)
Create a new scraper class extending BaseScraper:
from .base import BaseScraper, RawGameData, ScrapeResult
class CFBScraper(BaseScraper):
def __init__(self, season: int, **kwargs):
super().__init__("cfb", season, **kwargs)
self._team_resolver = get_team_resolver("cfb")
self._stadium_resolver = get_stadium_resolver("cfb")
def _get_sources(self) -> list[str]:
return ["espn", "sports_reference"] # Priority order
def _get_source_url(self, source: str, **kwargs) -> str:
# Return URL for each source
...
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
# Implement scraping logic
...
def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
# Convert raw data to Game objects using resolvers
...
def scrape_teams(self) -> list[Team]:
# Return Team objects from TEAM_MAPPINGS
...
def scrape_stadiums(self) -> list[Stadium]:
# Return Stadium objects from STADIUM_MAPPINGS
...
def create_cfb_scraper(season: int) -> CFBScraper:
return CFBScraper(season=season)
5. Register Scraper (scrapers/__init__.py)
Export the new scraper:
from .cfb import CFBScraper, create_cfb_scraper
__all__ = [
# ... existing exports ...
"CFBScraper",
"create_cfb_scraper",
]
6. CLI Registration (cli.py)
Add the sport to get_scraper():
def get_scraper(sport: str, season: int):
# ... existing sports ...
elif sport == "cfb":
from .scrapers.cfb import create_cfb_scraper
return create_cfb_scraper(season)
7. Alias Files (team_aliases.json, stadium_aliases.json)
Add initial aliases for common name variants:
// team_aliases.json
{
"cfb": {
"Alabama": "team_cfb_ala",
"Bama": "team_cfb_ala",
"Roll Tide": "team_cfb_ala"
}
}
// stadium_aliases.json
{
"cfb": {
"Bryant Denny Stadium": "stadium_cfb_bryant_denny",
"Bryant-Denny": "stadium_cfb_bryant_denny"
}
}
8. Documentation (SOURCES.md)
Document data sources with URLs, rate limits, and notes:
## CFB (College Football)
**Teams**: 134 (FBS)
**Expected Games**: ~900 per season
**Season**: August - January
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |
9. Tests (tests/test_scrapers/test_cfb.py)
Create tests for the new scraper:
import pytest
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper
class TestCFBScraper:
def test_factory_creates_scraper(self):
scraper = create_cfb_scraper(season=2025)
assert scraper.sport == "cfb"
assert scraper.season == 2025
def test_get_sources_returns_priority_list(self):
scraper = CFBScraper(season=2025)
sources = scraper._get_sources()
assert "espn" in sources
# ... more tests ...
Checklist
- Add to
SUPPORTED_SPORTSinconfig.py - Add to
EXPECTED_GAME_COUNTSinconfig.py - Add team mappings to
team_resolver.py - Add stadium mappings to
stadium_resolver.py - Create
scrapers/{sport}.pywith scraper class - Export in
scrapers/__init__.py - Register in
cli.pyget_scraper() - Add aliases to
team_aliases.json - Add aliases to
stadium_aliases.json - Document sources in
SOURCES.md - Create tests in
tests/test_scrapers/ - Run
pytestto verify all tests pass - Run dry-run scrape:
sportstime-parser scrape {sport} --season 2025 --dry-run
Development
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=sportstime_parser --cov-report=html
# Run specific test file
pytest tests/test_scrapers/test_nba.py
# Run with verbose output
pytest -v
Project Structure
sportstime_parser/
__init__.py
__main__.py # CLI entry point
cli.py # Subcommand definitions
config.py # Constants, defaults
models/
game.py # Game dataclass
team.py # Team dataclass
stadium.py # Stadium dataclass
aliases.py # Alias dataclasses
scrapers/
base.py # BaseScraper abstract class
nba.py # NBA scrapers
mlb.py # MLB scrapers
nfl.py # NFL scrapers
nhl.py # NHL scrapers
mls.py # MLS scrapers
wnba.py # WNBA scrapers
nwsl.py # NWSL scrapers
normalizers/
canonical_id.py # ID generation
team_resolver.py # Team name resolution
stadium_resolver.py # Stadium name resolution
timezone.py # Timezone conversion
fuzzy.py # Fuzzy matching
validators/
report.py # Validation report generator
uploaders/
cloudkit.py # CloudKit Web Services client
state.py # Resumable upload state
diff.py # Record comparison
utils/
http.py # Rate-limited HTTP client
logging.py # Verbose logger
progress.py # Progress bars
Troubleshooting
"No games file found"
Run the scrape command first:
sportstime-parser scrape nba --season 2025
"CloudKit not configured"
Set the required environment variables:
export CLOUDKIT_KEY_ID="your_key_id"
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"
Rate limit errors
The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:
- Wait a few minutes before retrying
- Try scraping one sport at a time instead of "all"
- Check that you're not running multiple instances
Scrape fails with no data
- Check your internet connection
- Run with
--verboseto see detailed error messages - The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable
License
MIT