Files
Trey t 52d445bca4 feat(scripts): add sportstime-parser data pipeline
Complete Python package for scraping, normalizing, and uploading
sports schedule data to CloudKit. Includes:

- Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Canonical ID system for teams, stadiums, and games
- Fuzzy matching with manual alias support
- CloudKit uploader with batch operations and deduplication
- Comprehensive test suite with fixtures
- WNBA abbreviation aliases for improved team resolution
- Alias validation script to detect orphan references

All 5 phases of data remediation plan completed:
- Phase 1: Alias fixes (team/stadium alias additions)
- Phase 2: NHL stadium coordinate fixes
- Phase 3: Re-scrape validation
- Phase 4: iOS bundle update
- Phase 5: Code quality improvements (WNBA aliases)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 18:56:25 -06:00

689 lines
20 KiB
Markdown

# SportsTime Parser
A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.
## Features
- Scrapes game schedules from multiple sources with automatic fallback
- Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Generates deterministic canonical IDs for games, teams, and stadiums
- Produces validation reports with manual review lists
- Uploads to CloudKit with resumable, diff-based updates
## Requirements
- Python 3.11+
- CloudKit credentials (for upload functionality)
## Installation
```bash
# From the Scripts directory
cd Scripts
# Install in development mode
pip install -e ".[dev]"
# Or install dependencies only
pip install -r requirements.txt
```
## Quick Start
```bash
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025
# Scrape all sports
sportstime-parser scrape all --season 2025
# Validate existing scraped data
sportstime-parser validate nba --season 2025
# Check status
sportstime-parser status
# Upload to CloudKit (development)
sportstime-parser upload nba --season 2025
# Upload to CloudKit (production)
sportstime-parser upload nba --season 2025 --environment production
```
## CLI Reference
### scrape
Scrape game schedules, teams, and stadiums from web sources.
```bash
sportstime-parser scrape <sport> [options]
Arguments:
sport Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--dry-run Parse and validate only, don't write output files
--verbose, -v Enable verbose output
```
**Examples:**
```bash
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025
# Scrape all sports with verbose output
sportstime-parser scrape all --season 2025 --verbose
# Dry run to test without writing files
sportstime-parser scrape mlb --season 2026 --dry-run
```
### validate
Run validation on existing scraped data and regenerate reports. Validation performs these checks:
1. **Game Coverage**: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
2. **Team Resolution**: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
3. **Stadium Resolution**: Identifies venue names that couldn't be matched to canonical stadium IDs
4. **Duplicate Detection**: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
5. **Missing Data**: Flags games missing required fields (stadium_id, team IDs, valid dates)
The output is a Markdown report with:
- Summary statistics (total games, valid games, coverage percentage)
- Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
- Fuzzy match suggestions with confidence scores to help resolve unmatched names
```bash
sportstime-parser validate <sport> [options]
Arguments:
sport Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
```
**Examples:**
```bash
# Validate NBA data
sportstime-parser validate nba --season 2025
# Validate all sports
sportstime-parser validate all
```
### upload
Upload scraped data to CloudKit with diff-based updates.
```bash
sportstime-parser upload <sport> [options]
Arguments:
sport Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment: development or production (default: development)
--resume Resume interrupted upload from last checkpoint
```
**Examples:**
```bash
# Upload NBA to development
sportstime-parser upload nba --season 2025
# Upload to production
sportstime-parser upload nba --season 2025 --environment production
# Resume interrupted upload
sportstime-parser upload mlb --season 2026 --resume
```
### status
Show current scrape and upload status.
```bash
sportstime-parser status
```
### retry
Retry failed uploads from previous attempts.
```bash
sportstime-parser retry <sport> [options]
Arguments:
sport Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment (default: development)
--max-retries INT Maximum retry attempts per record (default: 3)
```
### clear
Clear upload session state to start fresh.
```bash
sportstime-parser clear <sport> [options]
Arguments:
sport Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment (default: development)
```
## CloudKit Configuration
To upload data to CloudKit, you need to configure authentication credentials.
### 1. Get Credentials from Apple Developer Portal
1. Go to [Apple Developer Portal](https://developer.apple.com)
2. Navigate to **Certificates, Identifiers & Profiles** > **Keys**
3. Create a new key with **CloudKit** capability
4. Download the private key file (.p8)
5. Note the Key ID
### 2. Set Environment Variables
```bash
# Key ID from Apple Developer Portal
export CLOUDKIT_KEY_ID="your_key_id_here"
# Path to private key file
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"
# Or provide key content directly (useful for CI/CD)
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
...key content...
-----END EC PRIVATE KEY-----"
```
### 3. Verify Configuration
```bash
sportstime-parser status
```
The status output will show whether CloudKit is configured correctly.
## Output Files
Scraped data is saved to the `output/` directory:
```
output/
games_nba_2025.json # Game schedules
teams_nba.json # Team data
stadiums_nba.json # Stadium data
validation_nba_2025.md # Validation report
```
## Validation Reports
Validation reports are generated in Markdown format at `output/validation_{sport}_{season}.md`.
### Report Sections
**Summary Table**
| Metric | Description |
|--------|-------------|
| Total Games | Number of games scraped |
| Valid Games | Games with all required fields resolved |
| Coverage | Percentage of expected games found (based on league schedule) |
| Unresolved Teams | Team names that couldn't be matched |
| Unresolved Stadiums | Venue names that couldn't be matched |
| Duplicates | Potential duplicate game entries |
**Manual Review Items**
Items are grouped by type and include the raw value, source URL, and suggested fixes:
- **Unresolved Teams**: Team names not in the alias mapping. Add to `team_aliases.json` to resolve.
- **Unresolved Stadiums**: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to `stadium_aliases.json`.
- **Duplicate Games**: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
- **Missing Data**: Games missing stadium coordinates or other required fields.
**Fuzzy Match Suggestions**
For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.
## Canonical IDs
Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.
### ID Formats
**Games**
```
{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
```
Examples:
- `nba_2025_hou_okc_1021` - NBA 2025-26, Houston @ OKC, Oct 21
- `mlb_2026_nyy_bos_0401_1` - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)
**Teams**
```
{sport}_{city}_{name}
```
Examples:
- `nba_la_lakers`
- `mlb_new_york_yankees`
- `nfl_new_york_giants`
**Stadiums**
```
{sport}_{normalized_name}
```
Examples:
- `mlb_yankee_stadium`
- `nba_crypto_com_arena`
- `nfl_sofi_stadium`
### Generated vs Matched IDs
| Entity | Generated | Matched |
|--------|-----------|---------|
| **Teams** | Pre-defined in `team_resolver.py` mappings | Resolved from raw scraped names via aliases + fuzzy matching |
| **Stadiums** | Pre-defined in `stadium_resolver.py` mappings | Resolved from raw venue names via aliases + fuzzy matching |
| **Games** | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) |
**Resolution Flow:**
```
Raw Name (from scraper)
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
↓ (if no match)
Fuzzy Match (Levenshtein distance against known names)
↓ (if confidence > threshold)
Canonical ID assigned
↓ (if no match)
Manual Review Item created
```
### Cross-References
Entities reference each other via canonical IDs:
```
┌─────────────────────────────────────────────────────────────┐
│ Game │
│ id: nba_2025_hou_okc_1021 │
│ home_team_id: nba_oklahoma_city_thunder ──────────────┐ │
│ away_team_id: nba_houston_rockets ────────────────┐ │ │
│ stadium_id: nba_paycom_center ────────────────┐ │ │ │
└─────────────────────────────────────────────────│───│───│───┘
│ │ │
┌─────────────────────────────────────────────────│───│───│───┐
│ Stadium │ │ │ │
│ id: nba_paycom_center ◄───────────────────────┘ │ │ │
│ name: "Paycom Center" │ │ │
│ city: "Oklahoma City" │ │ │
│ latitude: 35.4634 │ │ │
│ longitude: -97.5151 │ │ │
└─────────────────────────────────────────────────────│───│───┘
│ │
┌─────────────────────────────────────────────────────│───│───┐
│ Team │ │ │
│ id: nba_houston_rockets ◄─────────────────────────┘ │ │
│ name: "Rockets" │ │
│ city: "Houston" │ │
│ stadium_id: nba_toyota_center │ │
└─────────────────────────────────────────────────────────│───┘
┌─────────────────────────────────────────────────────────│───┐
│ Team │ │
│ id: nba_oklahoma_city_thunder ◄───────────────────────┘ │
│ name: "Thunder" │
│ city: "Oklahoma City" │
│ stadium_id: nba_paycom_center │
└─────────────────────────────────────────────────────────────┘
```
### Alias Files
Aliases map variant names to canonical IDs:
**`team_aliases.json`**
```json
{
"nba": {
"LA Lakers": "nba_la_lakers",
"Los Angeles Lakers": "nba_la_lakers",
"LAL": "nba_la_lakers"
}
}
```
**`stadium_aliases.json`**
```json
{
"nba": {
"Crypto.com Arena": "nba_crypto_com_arena",
"Staples Center": "nba_crypto_com_arena",
"STAPLES Center": "nba_crypto_com_arena"
}
}
```
When a scraper returns a raw name like "LA Lakers", the resolver:
1. Checks `team_aliases.json` for an exact match → finds `nba_la_lakers`
2. If no exact match, runs fuzzy matching against all known team names
3. If fuzzy match confidence > 80%, uses that canonical ID
4. Otherwise, creates a manual review item for human resolution
## Adding a New Sport
To add support for a new sport (e.g., `cfb` for college football), update these files:
### 1. Configuration (`config.py`)
Add the sport to `SUPPORTED_SPORTS` and `EXPECTED_GAME_COUNTS`:
```python
SUPPORTED_SPORTS: list[str] = [
"nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
"cfb", # ← Add new sport
]
EXPECTED_GAME_COUNTS: dict[str, int] = {
# ... existing sports ...
"cfb": 900, # ← Add expected game count for validation
}
```
### 2. Team Mappings (`normalizers/team_resolver.py`)
Add team definitions to `TEAM_MAPPINGS`. Each entry maps an abbreviation to `(canonical_id, full_name, city)`:
```python
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
# ... existing sports ...
"cfb": {
"ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
"OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
# ... all teams ...
},
}
```
### 3. Stadium Mappings (`normalizers/stadium_resolver.py`)
Add stadium definitions to `STADIUM_MAPPINGS`. Each entry is a `StadiumInfo` with coordinates:
```python
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
# ... existing sports ...
"cfb": {
"stadium_cfb_bryant_denny": StadiumInfo(
id="stadium_cfb_bryant_denny",
name="Bryant-Denny Stadium",
city="Tuscaloosa",
state="AL",
country="USA",
sport="cfb",
latitude=33.2083,
longitude=-87.5503,
),
# ... all stadiums ...
},
}
```
### 4. Scraper Implementation (`scrapers/cfb.py`)
Create a new scraper class extending `BaseScraper`:
```python
from .base import BaseScraper, RawGameData, ScrapeResult
class CFBScraper(BaseScraper):
def __init__(self, season: int, **kwargs):
super().__init__("cfb", season, **kwargs)
self._team_resolver = get_team_resolver("cfb")
self._stadium_resolver = get_stadium_resolver("cfb")
def _get_sources(self) -> list[str]:
return ["espn", "sports_reference"] # Priority order
def _get_source_url(self, source: str, **kwargs) -> str:
# Return URL for each source
...
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
# Implement scraping logic
...
def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
# Convert raw data to Game objects using resolvers
...
def scrape_teams(self) -> list[Team]:
# Return Team objects from TEAM_MAPPINGS
...
def scrape_stadiums(self) -> list[Stadium]:
# Return Stadium objects from STADIUM_MAPPINGS
...
def create_cfb_scraper(season: int) -> CFBScraper:
return CFBScraper(season=season)
```
### 5. Register Scraper (`scrapers/__init__.py`)
Export the new scraper:
```python
from .cfb import CFBScraper, create_cfb_scraper
__all__ = [
# ... existing exports ...
"CFBScraper",
"create_cfb_scraper",
]
```
### 6. CLI Registration (`cli.py`)
Add the sport to `get_scraper()`:
```python
def get_scraper(sport: str, season: int):
# ... existing sports ...
elif sport == "cfb":
from .scrapers.cfb import create_cfb_scraper
return create_cfb_scraper(season)
```
### 7. Alias Files (`team_aliases.json`, `stadium_aliases.json`)
Add initial aliases for common name variants:
```json
// team_aliases.json
{
"cfb": {
"Alabama": "team_cfb_ala",
"Bama": "team_cfb_ala",
"Roll Tide": "team_cfb_ala"
}
}
// stadium_aliases.json
{
"cfb": {
"Bryant Denny Stadium": "stadium_cfb_bryant_denny",
"Bryant-Denny": "stadium_cfb_bryant_denny"
}
}
```
### 8. Documentation (`SOURCES.md`)
Document data sources with URLs, rate limits, and notes:
```markdown
## CFB (College Football)
**Teams**: 134 (FBS)
**Expected Games**: ~900 per season
**Season**: August - January
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |
```
### 9. Tests (`tests/test_scrapers/test_cfb.py`)
Create tests for the new scraper:
```python
import pytest
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper
class TestCFBScraper:
def test_factory_creates_scraper(self):
scraper = create_cfb_scraper(season=2025)
assert scraper.sport == "cfb"
assert scraper.season == 2025
def test_get_sources_returns_priority_list(self):
scraper = CFBScraper(season=2025)
sources = scraper._get_sources()
assert "espn" in sources
# ... more tests ...
```
### Checklist
- [ ] Add to `SUPPORTED_SPORTS` in `config.py`
- [ ] Add to `EXPECTED_GAME_COUNTS` in `config.py`
- [ ] Add team mappings to `team_resolver.py`
- [ ] Add stadium mappings to `stadium_resolver.py`
- [ ] Create `scrapers/{sport}.py` with scraper class
- [ ] Export in `scrapers/__init__.py`
- [ ] Register in `cli.py` `get_scraper()`
- [ ] Add aliases to `team_aliases.json`
- [ ] Add aliases to `stadium_aliases.json`
- [ ] Document sources in `SOURCES.md`
- [ ] Create tests in `tests/test_scrapers/`
- [ ] Run `pytest` to verify all tests pass
- [ ] Run dry-run scrape: `sportstime-parser scrape {sport} --season 2025 --dry-run`
## Development
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=sportstime_parser --cov-report=html
# Run specific test file
pytest tests/test_scrapers/test_nba.py
# Run with verbose output
pytest -v
```
### Project Structure
```
sportstime_parser/
__init__.py
__main__.py # CLI entry point
cli.py # Subcommand definitions
config.py # Constants, defaults
models/
game.py # Game dataclass
team.py # Team dataclass
stadium.py # Stadium dataclass
aliases.py # Alias dataclasses
scrapers/
base.py # BaseScraper abstract class
nba.py # NBA scrapers
mlb.py # MLB scrapers
nfl.py # NFL scrapers
nhl.py # NHL scrapers
mls.py # MLS scrapers
wnba.py # WNBA scrapers
nwsl.py # NWSL scrapers
normalizers/
canonical_id.py # ID generation
team_resolver.py # Team name resolution
stadium_resolver.py # Stadium name resolution
timezone.py # Timezone conversion
fuzzy.py # Fuzzy matching
validators/
report.py # Validation report generator
uploaders/
cloudkit.py # CloudKit Web Services client
state.py # Resumable upload state
diff.py # Record comparison
utils/
http.py # Rate-limited HTTP client
logging.py # Verbose logger
progress.py # Progress bars
```
## Troubleshooting
### "No games file found"
Run the scrape command first:
```bash
sportstime-parser scrape nba --season 2025
```
### "CloudKit not configured"
Set the required environment variables:
```bash
export CLOUDKIT_KEY_ID="your_key_id"
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"
```
### Rate limit errors
The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:
1. Wait a few minutes before retrying
2. Try scraping one sport at a time instead of "all"
3. Check that you're not running multiple instances
### Scrape fails with no data
1. Check your internet connection
2. Run with `--verbose` to see detailed error messages
3. The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable
## License
MIT