Adds the full Django application layer on top of sportstime_parser: - core: Sport, Team, Stadium, Game models with aliases and league structure - scraper: orchestration engine, adapter, job management, Celery tasks - cloudkit: CloudKit sync client, sync state tracking, sync jobs - dashboard: staff dashboard for monitoring scrapers, sync, review queue - notifications: email reports for scrape/sync results - Docker setup for deployment (Dockerfile, docker-compose, entrypoint) Game exports now use game_datetime_utc (ISO 8601 UTC) instead of venue-local date+time strings, matching the canonical format used by the iOS app. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SportsTime Parser
A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.
Features
- Scrapes game schedules from multiple sources with automatic fallback
- Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Generates deterministic canonical IDs for games, teams, and stadiums
- Produces validation reports with manual review lists
- Uploads to CloudKit with resumable, diff-based updates
Requirements
- Python 3.11+
- CloudKit credentials (for upload functionality)
Installation
# From the Scripts directory
cd Scripts
# Install in development mode
pip install -e ".[dev]"
# Or install dependencies only
pip install -r requirements.txt
Quick Start
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025
# Scrape all sports
sportstime-parser scrape all --season 2025
# Validate existing scraped data
sportstime-parser validate nba --season 2025
# Check status
sportstime-parser status
# Upload to CloudKit (development)
sportstime-parser upload nba --season 2025
# Upload to CloudKit (production)
sportstime-parser upload nba --season 2025 --environment production
CLI Reference
scrape
Scrape game schedules, teams, and stadiums from web sources.
sportstime-parser scrape <sport> [options]
Arguments:
sport Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--dry-run Parse and validate only, don't write output files
--verbose, -v Enable verbose output
Examples:
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025
# Scrape all sports with verbose output
sportstime-parser scrape all --season 2025 --verbose
# Dry run to test without writing files
sportstime-parser scrape mlb --season 2026 --dry-run
validate
Run validation on existing scraped data and regenerate reports. Validation performs these checks:
- Game Coverage: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
- Team Resolution: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
- Stadium Resolution: Identifies venue names that couldn't be matched to canonical stadium IDs
- Duplicate Detection: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
- Missing Data: Flags games missing required fields (stadium_id, team IDs, valid dates)
The output is a Markdown report with:
- Summary statistics (total games, valid games, coverage percentage)
- Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
- Fuzzy match suggestions with confidence scores to help resolve unmatched names
sportstime-parser validate <sport> [options]
Arguments:
sport Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
Examples:
# Validate NBA data
sportstime-parser validate nba --season 2025
# Validate all sports
sportstime-parser validate all
upload
Upload scraped data to CloudKit with diff-based updates.
sportstime-parser upload <sport> [options]
Arguments:
sport Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment: development or production (default: development)
--resume Resume interrupted upload from last checkpoint
Examples:
# Upload NBA to development
sportstime-parser upload nba --season 2025
# Upload to production
sportstime-parser upload nba --season 2025 --environment production
# Resume interrupted upload
sportstime-parser upload mlb --season 2026 --resume
status
Show current scrape and upload status.
sportstime-parser status
retry
Retry failed uploads from previous attempts.
sportstime-parser retry <sport> [options]
Arguments:
sport Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment (default: development)
--max-retries INT Maximum retry attempts per record (default: 3)
clear
Clear upload session state to start fresh.
sportstime-parser clear <sport> [options]
Arguments:
sport Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment (default: development)
CloudKit Configuration
To upload data to CloudKit, you need to configure authentication credentials.
1. Get Credentials from Apple Developer Portal
- Go to Apple Developer Portal
- Navigate to Certificates, Identifiers & Profiles > Keys
- Create a new key with CloudKit capability
- Download the private key file (.p8)
- Note the Key ID
2. Set Environment Variables
# Key ID from Apple Developer Portal
export CLOUDKIT_KEY_ID="your_key_id_here"
# Path to private key file
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"
# Or provide key content directly (useful for CI/CD)
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
...key content...
-----END EC PRIVATE KEY-----"
3. Verify Configuration
sportstime-parser status
The status output will show whether CloudKit is configured correctly.
Output Files
Scraped data is saved to the output/ directory:
output/
games_nba_2025.json # Game schedules
teams_nba.json # Team data
stadiums_nba.json # Stadium data
validation_nba_2025.md # Validation report
Validation Reports
Validation reports are generated in Markdown format at output/validation_{sport}_{season}.md.
Report Sections
Summary Table
| Metric | Description |
|---|---|
| Total Games | Number of games scraped |
| Valid Games | Games with all required fields resolved |
| Coverage | Percentage of expected games found (based on league schedule) |
| Unresolved Teams | Team names that couldn't be matched |
| Unresolved Stadiums | Venue names that couldn't be matched |
| Duplicates | Potential duplicate game entries |
Manual Review Items
Items are grouped by type and include the raw value, source URL, and suggested fixes:
- Unresolved Teams: Team names not in the alias mapping. Add to
team_aliases.jsonto resolve. - Unresolved Stadiums: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to
stadium_aliases.json. - Duplicate Games: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
- Missing Data: Games missing stadium coordinates or other required fields.
Fuzzy Match Suggestions
For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.
Canonical IDs
Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.
ID Formats
Games
{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
Examples:
nba_2025_hou_okc_1021- NBA 2025-26, Houston @ OKC, Oct 21mlb_2026_nyy_bos_0401_1- MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)
Teams
{sport}_{city}_{name}
Examples:
nba_la_lakersmlb_new_york_yankeesnfl_new_york_giants
Stadiums
{sport}_{normalized_name}
Examples:
mlb_yankee_stadiumnba_crypto_com_arenanfl_sofi_stadium
Generated vs Matched IDs
| Entity | Generated | Matched |
|---|---|---|
| Teams | Pre-defined in team_resolver.py mappings |
Resolved from raw scraped names via aliases + fuzzy matching |
| Stadiums | Pre-defined in stadium_resolver.py mappings |
Resolved from raw venue names via aliases + fuzzy matching |
| Games | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) |
Resolution Flow:
Raw Name (from scraper)
↓
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
↓ (if no match)
Fuzzy Match (Levenshtein distance against known names)
↓ (if confidence > threshold)
Canonical ID assigned
↓ (if no match)
Manual Review Item created
Cross-References
Entities reference each other via canonical IDs:
┌─────────────────────────────────────────────────────────────┐
│ Game │
│ id: nba_2025_hou_okc_1021 │
│ home_team_id: nba_oklahoma_city_thunder ──────────────┐ │
│ away_team_id: nba_houston_rockets ────────────────┐ │ │
│ stadium_id: nba_paycom_center ────────────────┐ │ │ │
└─────────────────────────────────────────────────│───│───│───┘
│ │ │
┌─────────────────────────────────────────────────│───│───│───┐
│ Stadium │ │ │ │
│ id: nba_paycom_center ◄───────────────────────┘ │ │ │
│ name: "Paycom Center" │ │ │
│ city: "Oklahoma City" │ │ │
│ latitude: 35.4634 │ │ │
│ longitude: -97.5151 │ │ │
└─────────────────────────────────────────────────────│───│───┘
│ │
┌─────────────────────────────────────────────────────│───│───┐
│ Team │ │ │
│ id: nba_houston_rockets ◄─────────────────────────┘ │ │
│ name: "Rockets" │ │
│ city: "Houston" │ │
│ stadium_id: nba_toyota_center │ │
└─────────────────────────────────────────────────────────│───┘
│
┌─────────────────────────────────────────────────────────│───┐
│ Team │ │
│ id: nba_oklahoma_city_thunder ◄───────────────────────┘ │
│ name: "Thunder" │
│ city: "Oklahoma City" │
│ stadium_id: nba_paycom_center │
└─────────────────────────────────────────────────────────────┘
Alias Files
Aliases map variant names to canonical IDs:
team_aliases.json
{
"nba": {
"LA Lakers": "nba_la_lakers",
"Los Angeles Lakers": "nba_la_lakers",
"LAL": "nba_la_lakers"
}
}
stadium_aliases.json
{
"nba": {
"Crypto.com Arena": "nba_crypto_com_arena",
"Staples Center": "nba_crypto_com_arena",
"STAPLES Center": "nba_crypto_com_arena"
}
}
When a scraper returns a raw name like "LA Lakers", the resolver:
- Checks
team_aliases.jsonfor an exact match → findsnba_la_lakers - If no exact match, runs fuzzy matching against all known team names
- If fuzzy match confidence > 80%, uses that canonical ID
- Otherwise, creates a manual review item for human resolution
Adding a New Sport
To add support for a new sport (e.g., cfb for college football), update these files:
1. Configuration (config.py)
Add the sport to SUPPORTED_SPORTS and EXPECTED_GAME_COUNTS:
SUPPORTED_SPORTS: list[str] = [
"nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
"cfb", # ← Add new sport
]
EXPECTED_GAME_COUNTS: dict[str, int] = {
# ... existing sports ...
"cfb": 900, # ← Add expected game count for validation
}
2. Team Mappings (normalizers/team_resolver.py)
Add team definitions to TEAM_MAPPINGS. Each entry maps an abbreviation to (canonical_id, full_name, city):
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
# ... existing sports ...
"cfb": {
"ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
"OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
# ... all teams ...
},
}
3. Stadium Mappings (normalizers/stadium_resolver.py)
Add stadium definitions to STADIUM_MAPPINGS. Each entry is a StadiumInfo with coordinates:
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
# ... existing sports ...
"cfb": {
"stadium_cfb_bryant_denny": StadiumInfo(
id="stadium_cfb_bryant_denny",
name="Bryant-Denny Stadium",
city="Tuscaloosa",
state="AL",
country="USA",
sport="cfb",
latitude=33.2083,
longitude=-87.5503,
),
# ... all stadiums ...
},
}
4. Scraper Implementation (scrapers/cfb.py)
Create a new scraper class extending BaseScraper:
from .base import BaseScraper, RawGameData, ScrapeResult
class CFBScraper(BaseScraper):
def __init__(self, season: int, **kwargs):
super().__init__("cfb", season, **kwargs)
self._team_resolver = get_team_resolver("cfb")
self._stadium_resolver = get_stadium_resolver("cfb")
def _get_sources(self) -> list[str]:
return ["espn", "sports_reference"] # Priority order
def _get_source_url(self, source: str, **kwargs) -> str:
# Return URL for each source
...
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
# Implement scraping logic
...
def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
# Convert raw data to Game objects using resolvers
...
def scrape_teams(self) -> list[Team]:
# Return Team objects from TEAM_MAPPINGS
...
def scrape_stadiums(self) -> list[Stadium]:
# Return Stadium objects from STADIUM_MAPPINGS
...
def create_cfb_scraper(season: int) -> CFBScraper:
return CFBScraper(season=season)
5. Register Scraper (scrapers/__init__.py)
Export the new scraper:
from .cfb import CFBScraper, create_cfb_scraper
__all__ = [
# ... existing exports ...
"CFBScraper",
"create_cfb_scraper",
]
6. CLI Registration (cli.py)
Add the sport to get_scraper():
def get_scraper(sport: str, season: int):
# ... existing sports ...
elif sport == "cfb":
from .scrapers.cfb import create_cfb_scraper
return create_cfb_scraper(season)
7. Alias Files (team_aliases.json, stadium_aliases.json)
Add initial aliases for common name variants:
// team_aliases.json
{
"cfb": {
"Alabama": "team_cfb_ala",
"Bama": "team_cfb_ala",
"Roll Tide": "team_cfb_ala"
}
}
// stadium_aliases.json
{
"cfb": {
"Bryant Denny Stadium": "stadium_cfb_bryant_denny",
"Bryant-Denny": "stadium_cfb_bryant_denny"
}
}
8. Documentation (SOURCES.md)
Document data sources with URLs, rate limits, and notes:
## CFB (College Football)
**Teams**: 134 (FBS)
**Expected Games**: ~900 per season
**Season**: August - January
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |
9. Tests (tests/test_scrapers/test_cfb.py)
Create tests for the new scraper:
import pytest
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper
class TestCFBScraper:
def test_factory_creates_scraper(self):
scraper = create_cfb_scraper(season=2025)
assert scraper.sport == "cfb"
assert scraper.season == 2025
def test_get_sources_returns_priority_list(self):
scraper = CFBScraper(season=2025)
sources = scraper._get_sources()
assert "espn" in sources
# ... more tests ...
Checklist
- Add to
SUPPORTED_SPORTSinconfig.py - Add to
EXPECTED_GAME_COUNTSinconfig.py - Add team mappings to
team_resolver.py - Add stadium mappings to
stadium_resolver.py - Create
scrapers/{sport}.pywith scraper class - Export in
scrapers/__init__.py - Register in
cli.pyget_scraper() - Add aliases to
team_aliases.json - Add aliases to
stadium_aliases.json - Document sources in
SOURCES.md - Create tests in
tests/test_scrapers/ - Run
pytestto verify all tests pass - Run dry-run scrape:
sportstime-parser scrape {sport} --season 2025 --dry-run
Development
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=sportstime_parser --cov-report=html
# Run specific test file
pytest tests/test_scrapers/test_nba.py
# Run with verbose output
pytest -v
Project Structure
sportstime_parser/
__init__.py
__main__.py # CLI entry point
cli.py # Subcommand definitions
config.py # Constants, defaults
models/
game.py # Game dataclass
team.py # Team dataclass
stadium.py # Stadium dataclass
aliases.py # Alias dataclasses
scrapers/
base.py # BaseScraper abstract class
nba.py # NBA scrapers
mlb.py # MLB scrapers
nfl.py # NFL scrapers
nhl.py # NHL scrapers
mls.py # MLS scrapers
wnba.py # WNBA scrapers
nwsl.py # NWSL scrapers
normalizers/
canonical_id.py # ID generation
team_resolver.py # Team name resolution
stadium_resolver.py # Stadium name resolution
timezone.py # Timezone conversion
fuzzy.py # Fuzzy matching
validators/
report.py # Validation report generator
uploaders/
cloudkit.py # CloudKit Web Services client
state.py # Resumable upload state
diff.py # Record comparison
utils/
http.py # Rate-limited HTTP client
logging.py # Verbose logger
progress.py # Progress bars
Troubleshooting
"No games file found"
Run the scrape command first:
sportstime-parser scrape nba --season 2025
"CloudKit not configured"
Set the required environment variables:
export CLOUDKIT_KEY_ID="your_key_id"
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"
Rate limit errors
The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:
- Wait a few minutes before retrying
- Try scraping one sport at a time instead of "all"
- Check that you're not running multiple instances
Scrape fails with no data
- Check your internet connection
- Run with
--verboseto see detailed error messages - The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable
License
MIT