feat(scripts): add sportstime-parser data pipeline
Complete Python package for scraping, normalizing, and uploading sports schedule data to CloudKit. Includes: - Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL - Canonical ID system for teams, stadiums, and games - Fuzzy matching with manual alias support - CloudKit uploader with batch operations and deduplication - Comprehensive test suite with fixtures - WNBA abbreviation aliases for improved team resolution - Alias validation script to detect orphan references All 5 phases of data remediation plan completed: - Phase 1: Alias fixes (team/stadium alias additions) - Phase 2: NHL stadium coordinate fixes - Phase 3: Re-scrape validation - Phase 4: iOS bundle update - Phase 5: Code quality improvements (WNBA aliases) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
49
.gitignore
vendored
Normal file
49
.gitignore
vendored
Normal file
@@ -0,0 +1,49 @@
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
||||
# Virtual environments
|
||||
venv/
|
||||
ENV/
|
||||
env/
|
||||
.venv/
|
||||
|
||||
# IDE
|
||||
.idea/
|
||||
.vscode/
|
||||
*.swp
|
||||
*.swo
|
||||
|
||||
# Output and logs
|
||||
output/
|
||||
logs/
|
||||
*.log
|
||||
|
||||
# Secrets
|
||||
*.pem
|
||||
.env
|
||||
.env.*
|
||||
|
||||
# Parser state
|
||||
.parser_state/
|
||||
|
||||
# Claude Code
|
||||
.claude/
|
||||
833
README.md
Normal file
833
README.md
Normal file
@@ -0,0 +1,833 @@
|
||||
# SportsTime Parser
|
||||
|
||||
A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Overview](#overview)
|
||||
- [Installation](#installation)
|
||||
- [Quick Start](#quick-start)
|
||||
- [Architecture](#architecture)
|
||||
- [Directory Structure](#directory-structure)
|
||||
- [Configuration](#configuration)
|
||||
- [Data Models](#data-models)
|
||||
- [Normalizers](#normalizers)
|
||||
- [Scrapers](#scrapers)
|
||||
- [Uploaders](#uploaders)
|
||||
- [Utilities](#utilities)
|
||||
- [Manual Review Workflow](#manual-review-workflow)
|
||||
- [Adding a New Sport](#adding-a-new-sport)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
## Overview
|
||||
|
||||
The `sportstime_parser` package provides a complete pipeline for:
|
||||
|
||||
1. **Scraping** game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.)
|
||||
2. **Normalizing** raw data to canonical identifiers (teams, stadiums, games)
|
||||
3. **Resolving** team/stadium names using exact matching, historical aliases, and fuzzy matching
|
||||
4. **Uploading** data to CloudKit with diff-based sync and resumable uploads
|
||||
|
||||
### Supported Sports
|
||||
|
||||
| Sport | Code | Sources | Season Format |
|
||||
|-------|------|---------|---------------|
|
||||
| NBA | `nba` | Basketball-Reference, ESPN, CBS | Oct-Jun (split year) |
|
||||
| MLB | `mlb` | Baseball-Reference, MLB API, ESPN | Mar-Nov (single year) |
|
||||
| NFL | `nfl` | ESPN, Pro-Football-Reference, CBS | Sep-Feb (split year) |
|
||||
| NHL | `nhl` | Hockey-Reference, NHL API, ESPN | Oct-Jun (split year) |
|
||||
| MLS | `mls` | ESPN, FBref | Feb-Nov (single year) |
|
||||
| WNBA | `wnba` | ESPN | May-Oct (single year) |
|
||||
| NWSL | `nwsl` | ESPN | Mar-Nov (single year) |
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
cd Scripts
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### Dependencies
|
||||
|
||||
- `requests` - HTTP requests with session management
|
||||
- `beautifulsoup4` + `lxml` - HTML parsing
|
||||
- `rapidfuzz` - Fuzzy string matching
|
||||
- `pyjwt` + `cryptography` - CloudKit JWT authentication
|
||||
- `rich` - Terminal UI (progress bars, logging)
|
||||
- `pytz` / `timezonefinder` - Timezone detection
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Scrape a Single Sport
|
||||
|
||||
```python
|
||||
from sportstime_parser.scrapers import create_nba_scraper
|
||||
|
||||
scraper = create_nba_scraper(season=2025)
|
||||
result = scraper.scrape_all()
|
||||
|
||||
print(f"Games: {result.game_count}")
|
||||
print(f"Teams: {result.team_count}")
|
||||
print(f"Stadiums: {result.stadium_count}")
|
||||
print(f"Needs review: {result.review_count}")
|
||||
```
|
||||
|
||||
### Upload to CloudKit
|
||||
|
||||
```python
|
||||
from sportstime_parser.uploaders import CloudKitClient, RecordDiffer
|
||||
|
||||
client = CloudKitClient(environment="development")
|
||||
differ = RecordDiffer()
|
||||
|
||||
# Compare local vs remote
|
||||
diff = differ.diff_games(local_games, remote_records)
|
||||
|
||||
# Upload changes
|
||||
records = diff.get_records_to_upload()
|
||||
result = await client.save_records(records)
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ DATA SOURCES │
|
||||
│ Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc. │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ SCRAPERS │
|
||||
│ NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │
|
||||
│ │
|
||||
│ Features: │
|
||||
│ • Multi-source fallback (try sources in priority order) │
|
||||
│ • Automatic rate limiting with exponential backoff │
|
||||
│ • Doubleheader detection │
|
||||
│ • International game filtering (NFL London, NHL Global Series) │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ NORMALIZERS │
|
||||
│ TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader │
|
||||
│ │
|
||||
│ Resolution Strategy (in order): │
|
||||
│ 1. Exact match against canonical mappings │
|
||||
│ 2. Date-aware alias lookup (handles renames/relocations) │
|
||||
│ 3. Fuzzy matching with confidence threshold (85%) │
|
||||
│ 4. Flag for manual review if unresolved or low confidence │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ DATA MODELS │
|
||||
│ Game │ Team │ Stadium │ ManualReviewItem │
|
||||
│ │
|
||||
│ All models use canonical IDs: │
|
||||
│ • team_nba_lal (Los Angeles Lakers) │
|
||||
│ • stadium_nba_los_angeles_lakers (Crypto.com Arena) │
|
||||
│ • game_nba_2025_20251022_bos_lal (specific game) │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ UPLOADERS │
|
||||
│ CloudKitClient │ RecordDiffer │ StateManager │
|
||||
│ │
|
||||
│ Features: │
|
||||
│ • JWT authentication with Apple's CloudKit Web Services │
|
||||
│ • Batch operations (up to 200 records per request) │
|
||||
│ • Diff-based sync (only upload changes) │
|
||||
│ • Resumable uploads with persistent state │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ CLOUDKIT │
|
||||
│ Public Database: Games, Teams, Stadiums, Aliases │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
Scripts/
|
||||
├── README.md # This file
|
||||
├── requirements.txt # Python dependencies
|
||||
├── pyproject.toml # Package configuration
|
||||
├── league_structure.json # League hierarchy (conferences, divisions)
|
||||
├── team_aliases.json # Historical team name mappings
|
||||
├── stadium_aliases.json # Historical stadium name mappings
|
||||
├── logs/ # Runtime logs (auto-created)
|
||||
├── output/ # Scrape output files (auto-created)
|
||||
└── sportstime_parser/ # Main package
|
||||
├── __init__.py
|
||||
├── config.py # Configuration constants
|
||||
├── SOURCES.md # Data source documentation
|
||||
├── models/ # Data classes
|
||||
│ ├── game.py # Game model
|
||||
│ ├── team.py # Team model
|
||||
│ ├── stadium.py # Stadium model
|
||||
│ └── aliases.py # Alias and ManualReviewItem models
|
||||
├── normalizers/ # Name resolution
|
||||
│ ├── canonical_id.py # ID generation
|
||||
│ ├── alias_loader.py # Alias loading and resolution
|
||||
│ ├── fuzzy.py # Fuzzy string matching
|
||||
│ ├── timezone.py # Timezone detection
|
||||
│ ├── team_resolver.py # Team name resolution
|
||||
│ └── stadium_resolver.py # Stadium name resolution
|
||||
├── scrapers/ # Sport-specific scrapers
|
||||
│ ├── base.py # Abstract base scraper
|
||||
│ ├── nba.py # NBA scraper
|
||||
│ ├── mlb.py # MLB scraper
|
||||
│ ├── nfl.py # NFL scraper
|
||||
│ ├── nhl.py # NHL scraper
|
||||
│ ├── mls.py # MLS scraper
|
||||
│ ├── wnba.py # WNBA scraper
|
||||
│ └── nwsl.py # NWSL scraper
|
||||
├── uploaders/ # CloudKit integration
|
||||
│ ├── cloudkit.py # CloudKit Web Services client
|
||||
│ ├── diff.py # Record diffing
|
||||
│ └── state.py # Resumable upload state
|
||||
└── utils/ # Shared utilities
|
||||
├── logging.py # Rich-based logging
|
||||
├── http.py # Rate-limited HTTP client
|
||||
└── progress.py # Progress tracking
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### config.py
|
||||
|
||||
Key configuration constants:
|
||||
|
||||
```python
|
||||
# Directories
|
||||
SCRIPTS_DIR = Path(__file__).parent.parent # Scripts/
|
||||
OUTPUT_DIR = SCRIPTS_DIR / "output" # JSON output
|
||||
STATE_DIR = SCRIPTS_DIR / ".parser_state" # Upload state
|
||||
|
||||
# CloudKit
|
||||
CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app"
|
||||
CLOUDKIT_ENVIRONMENT = "development" # or "production"
|
||||
|
||||
# Rate Limiting
|
||||
DEFAULT_REQUEST_DELAY = 3.0 # seconds between requests
|
||||
MAX_RETRIES = 3 # retry attempts
|
||||
BACKOFF_FACTOR = 2.0 # exponential backoff multiplier
|
||||
INITIAL_BACKOFF = 5.0 # initial backoff duration
|
||||
|
||||
# Fuzzy Matching
|
||||
FUZZY_THRESHOLD = 85 # minimum match confidence (0-100)
|
||||
|
||||
# Expected game counts (for validation)
|
||||
EXPECTED_GAME_COUNTS = {
|
||||
"nba": 1230, # 30 teams × 82 games ÷ 2
|
||||
"mlb": 2430, # 30 teams × 162 games ÷ 2
|
||||
"nfl": 272, # Regular season only
|
||||
"nhl": 1312, # 32 teams × 82 games ÷ 2
|
||||
"mls": 544, # 29 teams × ~34 games ÷ 2
|
||||
"wnba": 228, # 12 teams × 40 games ÷ 2
|
||||
"nwsl": 182, # 14 teams × 26 games ÷ 2
|
||||
}
|
||||
|
||||
# Geography (for filtering international games)
|
||||
ALLOWED_COUNTRIES = {"USA", "Canada"}
|
||||
```
|
||||
|
||||
### league_structure.json
|
||||
|
||||
Defines the hierarchical structure of each league:
|
||||
|
||||
```json
|
||||
{
|
||||
"nba": {
|
||||
"name": "National Basketball Association",
|
||||
"conferences": {
|
||||
"Eastern": {
|
||||
"divisions": {
|
||||
"Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"],
|
||||
"Central": ["CHI", "CLE", "DET", "IND", "MIL"],
|
||||
"Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"]
|
||||
}
|
||||
},
|
||||
"Western": { ... }
|
||||
}
|
||||
},
|
||||
"mlb": { ... },
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
### team_aliases.json / stadium_aliases.json
|
||||
|
||||
Historical name mappings with validity dates:
|
||||
|
||||
```json
|
||||
{
|
||||
"team_mlb_athletics": [
|
||||
{
|
||||
"alias": "Oakland Athletics",
|
||||
"alias_type": "full_name",
|
||||
"valid_from": "1968-01-01",
|
||||
"valid_until": "2024-12-31"
|
||||
},
|
||||
{
|
||||
"alias": "Las Vegas Athletics",
|
||||
"alias_type": "full_name",
|
||||
"valid_from": "2028-01-01",
|
||||
"valid_until": null
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Data Models
|
||||
|
||||
### Game
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Game:
|
||||
id: str # Canonical ID: game_{sport}_{season}_{date}_{away}_{home}
|
||||
sport: str # Sport code (nba, mlb, etc.)
|
||||
season: int # Season start year
|
||||
home_team_id: str # Canonical team ID
|
||||
away_team_id: str # Canonical team ID
|
||||
stadium_id: str # Canonical stadium ID
|
||||
game_date: datetime # UTC datetime
|
||||
game_number: Optional[int] # 1 or 2 for doubleheaders
|
||||
home_score: Optional[int] # None if not played
|
||||
away_score: Optional[int]
|
||||
status: str # scheduled, final, postponed, cancelled
|
||||
source_url: Optional[str] # For manual review
|
||||
raw_home_team: Optional[str] # Original scraped value
|
||||
raw_away_team: Optional[str]
|
||||
raw_stadium: Optional[str]
|
||||
```
|
||||
|
||||
### Team
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Team:
|
||||
id: str # Canonical ID: team_{sport}_{abbrev}
|
||||
sport: str
|
||||
city: str # e.g., "Los Angeles"
|
||||
name: str # e.g., "Lakers"
|
||||
full_name: str # e.g., "Los Angeles Lakers"
|
||||
abbreviation: str # e.g., "LAL"
|
||||
conference: Optional[str] # e.g., "Western"
|
||||
division: Optional[str] # e.g., "Pacific"
|
||||
stadium_id: Optional[str] # Home stadium
|
||||
primary_color: Optional[str]
|
||||
secondary_color: Optional[str]
|
||||
logo_url: Optional[str]
|
||||
```
|
||||
|
||||
### Stadium
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Stadium:
|
||||
id: str # Canonical ID: stadium_{sport}_{city_team}
|
||||
sport: str
|
||||
name: str # Current name (e.g., "Crypto.com Arena")
|
||||
city: str
|
||||
state: Optional[str]
|
||||
country: str
|
||||
latitude: Optional[float]
|
||||
longitude: Optional[float]
|
||||
capacity: Optional[int]
|
||||
surface: Optional[str] # grass, turf, ice, hardwood
|
||||
roof_type: Optional[str] # dome, retractable, open
|
||||
opened_year: Optional[int]
|
||||
image_url: Optional[str]
|
||||
timezone: Optional[str]
|
||||
```
|
||||
|
||||
### ManualReviewItem
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ManualReviewItem:
|
||||
item_type: str # "team" or "stadium"
|
||||
raw_value: str # Original scraped value
|
||||
suggested_id: Optional[str] # Best fuzzy match (if any)
|
||||
confidence: float # 0.0 - 1.0
|
||||
reason: str # Why review is needed
|
||||
source_url: Optional[str] # Where it came from
|
||||
sport: str
|
||||
check_date: Optional[date] # For date-aware alias lookup
|
||||
```
|
||||
|
||||
## Normalizers
|
||||
|
||||
### Canonical ID Generation
|
||||
|
||||
IDs are deterministic and immutable:
|
||||
|
||||
```python
|
||||
# Team ID
|
||||
generate_team_id("nba", "LAL")
|
||||
# → "team_nba_lal"
|
||||
|
||||
# Stadium ID
|
||||
generate_stadium_id("nba", "Los Angeles", "Lakers")
|
||||
# → "stadium_nba_los_angeles_lakers"
|
||||
|
||||
# Game ID
|
||||
generate_game_id(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
away_abbrev="BOS",
|
||||
home_abbrev="LAL",
|
||||
game_date=datetime(2025, 10, 22),
|
||||
game_number=None
|
||||
)
|
||||
# → "game_nba_2025_20251022_bos_lal"
|
||||
|
||||
# Doubleheader Game ID
|
||||
generate_game_id(..., game_number=2)
|
||||
# → "game_nba_2025_20251022_bos_lal_2"
|
||||
```
|
||||
|
||||
### Team Resolution
|
||||
|
||||
The `TeamResolver` uses a three-stage strategy:
|
||||
|
||||
```python
|
||||
resolver = get_team_resolver("nba")
|
||||
result = resolver.resolve(
|
||||
"Los Angeles Lakers",
|
||||
check_date=date(2025, 10, 22),
|
||||
source_url="https://..."
|
||||
)
|
||||
|
||||
# Result:
|
||||
# - canonical_id: "team_nba_lal"
|
||||
# - confidence: 1.0 (exact match)
|
||||
# - review_item: None
|
||||
```
|
||||
|
||||
**Resolution stages:**
|
||||
|
||||
1. **Exact Match**: Check against canonical team mappings
|
||||
- Full name: "Los Angeles Lakers"
|
||||
- City + Name: "Los Angeles" + "Lakers"
|
||||
- Abbreviation: "LAL"
|
||||
|
||||
2. **Alias Lookup**: Check historical aliases with date awareness
|
||||
- "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31)
|
||||
- Handles relocations: "Oakland" → "Las Vegas" transition
|
||||
|
||||
3. **Fuzzy Match**: Use rapidfuzz with 85% threshold
|
||||
- "LA Lakers" → "Los Angeles Lakers" (92% match)
|
||||
- Low-confidence matches flagged for review
|
||||
|
||||
### Stadium Resolution
|
||||
|
||||
Similar three-stage strategy with additional location awareness:
|
||||
|
||||
```python
|
||||
resolver = get_stadium_resolver("nba")
|
||||
result = resolver.resolve(
|
||||
"Crypto.com Arena",
|
||||
check_date=date(2025, 10, 22)
|
||||
)
|
||||
```
|
||||
|
||||
**Key features:**
|
||||
- Handles naming rights changes (Staples Center → Crypto.com Arena)
|
||||
- Date-aware: "Staples Center" resolves correctly for historical games
|
||||
- Location-based fallback using latitude/longitude
|
||||
|
||||
## Scrapers
|
||||
|
||||
### Base Scraper
|
||||
|
||||
All scrapers extend `BaseScraper` with these features:
|
||||
|
||||
```python
|
||||
class BaseScraper(ABC):
|
||||
def __init__(self, sport: str, season: int): ...
|
||||
|
||||
# Required implementations
|
||||
def _get_sources(self) -> list[str]: ...
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ...
|
||||
def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ...
|
||||
def scrape_teams(self) -> list[Team]: ...
|
||||
def scrape_stadiums(self) -> list[Stadium]: ...
|
||||
|
||||
# Built-in features
|
||||
def scrape_games(self) -> ScrapeResult:
|
||||
"""Multi-source fallback - tries each source in order."""
|
||||
...
|
||||
|
||||
def scrape_all(self) -> ScrapeResult:
|
||||
"""Scrapes games, teams, and stadiums with progress tracking."""
|
||||
...
|
||||
```
|
||||
|
||||
### NBA Scraper
|
||||
|
||||
```python
|
||||
class NBAScraper(BaseScraper):
|
||||
"""
|
||||
Sources (in priority order):
|
||||
1. Basketball-Reference - HTML tables, monthly pages
|
||||
2. ESPN API - JSON, per-date queries
|
||||
3. CBS Sports - Backup (not implemented)
|
||||
|
||||
Season: October to June (split year, e.g., 2025-26)
|
||||
"""
|
||||
```
|
||||
|
||||
**Basketball-Reference parsing:**
|
||||
- URL: `https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html`
|
||||
- Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name
|
||||
|
||||
### MLB Scraper
|
||||
|
||||
```python
|
||||
class MLBScraper(BaseScraper):
|
||||
"""
|
||||
Sources:
|
||||
1. Baseball-Reference - Single page per season
|
||||
2. MLB Stats API - Official API with date range queries
|
||||
3. ESPN API - Backup
|
||||
|
||||
Season: March to November (single year)
|
||||
Handles: Doubleheaders with game_number
|
||||
"""
|
||||
```
|
||||
|
||||
### NFL Scraper
|
||||
|
||||
```python
|
||||
class NFLScraper(BaseScraper):
|
||||
"""
|
||||
Sources:
|
||||
1. ESPN API - Week-based queries
|
||||
2. Pro-Football-Reference - Single page per season
|
||||
|
||||
Season: September to February (split year)
|
||||
Filters: International games (London, Mexico City, Frankfurt)
|
||||
Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds)
|
||||
"""
|
||||
```
|
||||
|
||||
### NHL Scraper
|
||||
|
||||
```python
|
||||
class NHLScraper(BaseScraper):
|
||||
"""
|
||||
Sources:
|
||||
1. Hockey-Reference - Single page per season
|
||||
2. NHL API - New API (api-web.nhle.com)
|
||||
3. ESPN API - Backup
|
||||
|
||||
Season: October to June (split year)
|
||||
Filters: International games (Prague, Stockholm, Helsinki)
|
||||
"""
|
||||
```
|
||||
|
||||
### MLS / WNBA / NWSL Scrapers
|
||||
|
||||
All use ESPN API as primary source with similar structure:
|
||||
- Single calendar year seasons
|
||||
- Conference-based organization (MLS) or single table (WNBA, NWSL)
|
||||
|
||||
## Uploaders
|
||||
|
||||
### CloudKit Client
|
||||
|
||||
```python
|
||||
class CloudKitClient:
|
||||
"""CloudKit Web Services API client with JWT authentication."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
container_id: str = CLOUDKIT_CONTAINER,
|
||||
environment: str = "development", # or "production"
|
||||
key_id: str = None, # From CloudKit Dashboard
|
||||
private_key: str = None, # EC P-256 private key
|
||||
): ...
|
||||
|
||||
async def fetch_records(
|
||||
self,
|
||||
record_type: RecordType,
|
||||
filter_by: Optional[dict] = None,
|
||||
sort_by: Optional[str] = None,
|
||||
) -> list[dict]: ...
|
||||
|
||||
async def save_records(
|
||||
self,
|
||||
records: list[CloudKitRecord],
|
||||
batch_size: int = 200,
|
||||
) -> BatchResult: ...
|
||||
|
||||
async def delete_records(
|
||||
self,
|
||||
record_names: list[str],
|
||||
record_type: RecordType,
|
||||
) -> BatchResult: ...
|
||||
```
|
||||
|
||||
**Authentication:**
|
||||
- Uses EC P-256 key pair
|
||||
- JWT tokens signed with private key
|
||||
- Tokens valid for 30 minutes
|
||||
|
||||
### Record Differ
|
||||
|
||||
```python
|
||||
class RecordDiffer:
|
||||
"""Compares local records with CloudKit records."""
|
||||
|
||||
def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ...
|
||||
def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ...
|
||||
def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ...
|
||||
```
|
||||
|
||||
**DiffResult:**
|
||||
```python
|
||||
@dataclass
|
||||
class DiffResult:
|
||||
creates: list[RecordDiff] # New records to create
|
||||
updates: list[RecordDiff] # Changed records to update
|
||||
deletes: list[RecordDiff] # Remote records to delete
|
||||
unchanged: list[RecordDiff] # Records with no changes
|
||||
|
||||
def get_records_to_upload(self) -> list[CloudKitRecord]:
|
||||
"""Returns creates + updates ready for upload."""
|
||||
```
|
||||
|
||||
### State Manager
|
||||
|
||||
```python
|
||||
class StateManager:
|
||||
"""Manages resumable upload state."""
|
||||
|
||||
def load_session(self, sport, season, environment) -> Optional[UploadSession]: ...
|
||||
def save_session(self, session: UploadSession) -> None: ...
|
||||
def get_session_or_create(
|
||||
self,
|
||||
sport, season, environment,
|
||||
record_names: list[tuple[str, str]],
|
||||
resume: bool = False,
|
||||
) -> UploadSession: ...
|
||||
```
|
||||
|
||||
**State persistence:**
|
||||
- Stored in `.parser_state/upload_state_{sport}_{season}_{env}.json`
|
||||
- Tracks: pending, uploaded, failed records
|
||||
- Supports retry with backoff
|
||||
|
||||
## Utilities
|
||||
|
||||
### HTTP Client
|
||||
|
||||
```python
|
||||
class RateLimitedSession:
|
||||
"""HTTP session with rate limiting and exponential backoff."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
delay: float = 3.0, # Seconds between requests
|
||||
max_retries: int = 3,
|
||||
backoff_factor: float = 2.0,
|
||||
): ...
|
||||
|
||||
def get(self, url, **kwargs) -> Response: ...
|
||||
def get_json(self, url, **kwargs) -> dict: ...
|
||||
def get_html(self, url, **kwargs) -> str: ...
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- User-agent rotation (5 different Chrome/Firefox/Safari agents)
|
||||
- Per-domain rate limiting
|
||||
- Automatic 429 handling with exponential backoff + jitter
|
||||
- Connection pooling
|
||||
|
||||
### Logging
|
||||
|
||||
```python
|
||||
from sportstime_parser.utils import get_logger, log_success, log_error
|
||||
|
||||
logger = get_logger() # Rich-formatted logger
|
||||
logger.info("Starting scrape")
|
||||
|
||||
log_success("Scraped 1230 games") # Green checkmark
|
||||
log_error("Failed to parse") # Red X
|
||||
```
|
||||
|
||||
**Log output:**
|
||||
- Console: Rich-formatted with colors
|
||||
- File: `logs/parser_{timestamp}.log`
|
||||
|
||||
### Progress Tracking
|
||||
|
||||
```python
|
||||
from sportstime_parser.utils import ScrapeProgress, track_progress
|
||||
|
||||
# Specialized scrape tracking
|
||||
progress = ScrapeProgress("nba", 2025)
|
||||
progress.start()
|
||||
|
||||
with progress.scraping_schedule(total_months=9) as advance:
|
||||
for month in months:
|
||||
fetch(month)
|
||||
advance()
|
||||
|
||||
progress.finish() # Prints summary
|
||||
|
||||
# Generic progress bar
|
||||
for game in track_progress(games, "Processing games"):
|
||||
process(game)
|
||||
```
|
||||
|
||||
## Manual Review Workflow
|
||||
|
||||
When the system can't confidently resolve a team or stadium:
|
||||
|
||||
1. **Low confidence fuzzy match** (< 85%):
|
||||
```
|
||||
ManualReviewItem(
|
||||
item_type="team",
|
||||
raw_value="LA Lakers",
|
||||
suggested_id="team_nba_lal",
|
||||
confidence=0.82,
|
||||
reason="Fuzzy match below threshold"
|
||||
)
|
||||
```
|
||||
|
||||
2. **No match found**:
|
||||
```
|
||||
ManualReviewItem(
|
||||
raw_value="Unknown Team FC",
|
||||
suggested_id=None,
|
||||
confidence=0.0,
|
||||
reason="No match found in canonical mappings"
|
||||
)
|
||||
```
|
||||
|
||||
3. **Ambiguous match** (multiple candidates):
|
||||
```
|
||||
ManualReviewItem(
|
||||
raw_value="LA",
|
||||
suggested_id="team_nba_lac",
|
||||
confidence=0.5,
|
||||
reason="Ambiguous: could be Lakers or Clippers"
|
||||
)
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
- Review items are exported to JSON
|
||||
- Manually verify and add to `team_aliases.json` or `stadium_aliases.json`
|
||||
- Re-run scrape - aliases will be used for resolution
|
||||
|
||||
## Adding a New Sport
|
||||
|
||||
1. **Create scraper** in `scrapers/{sport}.py`:
|
||||
```python
|
||||
class NewSportScraper(BaseScraper):
|
||||
def __init__(self, season: int, **kwargs):
|
||||
super().__init__("newsport", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("newsport")
|
||||
self._stadium_resolver = get_stadium_resolver("newsport")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
return ["primary_source", "backup_source"]
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
# Implement source-specific scraping
|
||||
...
|
||||
|
||||
def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
# Use resolvers to normalize
|
||||
...
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
# Return canonical team list
|
||||
...
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
# Return canonical stadium list
|
||||
...
|
||||
```
|
||||
|
||||
2. **Add team mappings** in `normalizers/team_resolver.py`:
|
||||
```python
|
||||
TEAM_MAPPINGS["newsport"] = {
|
||||
"ABC": ("team_newsport_abc", "Full Team Name", "City"),
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
3. **Add stadium mappings** in `normalizers/stadium_resolver.py`:
|
||||
```python
|
||||
STADIUM_MAPPINGS["newsport"] = {
|
||||
"stadium_newsport_venue": StadiumInfo(
|
||||
name="Venue Name",
|
||||
city="City",
|
||||
state="State",
|
||||
country="USA",
|
||||
latitude=40.0,
|
||||
longitude=-74.0,
|
||||
),
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
4. **Add to league_structure.json** (if hierarchical)
|
||||
|
||||
5. **Update config.py**:
|
||||
```python
|
||||
EXPECTED_GAME_COUNTS["newsport"] = 500
|
||||
```
|
||||
|
||||
6. **Export from `__init__.py`**
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Rate Limiting (429 errors)
|
||||
|
||||
The system handles these automatically with exponential backoff. If persistent:
|
||||
- Increase `DEFAULT_REQUEST_DELAY` in config.py
|
||||
- Check if source has changed their rate limits
|
||||
|
||||
### Missing Teams/Stadiums
|
||||
|
||||
1. Check scraper logs for raw values
|
||||
2. Add to `team_aliases.json` or `stadium_aliases.json`
|
||||
3. Or add to canonical mappings if it's a new team/stadium
|
||||
|
||||
### CloudKit Authentication Errors
|
||||
|
||||
1. Verify key_id matches CloudKit Dashboard
|
||||
2. Check private key format (EC P-256, PEM)
|
||||
3. Ensure container identifier is correct
|
||||
|
||||
### Incomplete Scrapes
|
||||
|
||||
The system discards partial data on errors. Check:
|
||||
- `logs/` for error details
|
||||
- Network connectivity
|
||||
- Source website availability
|
||||
|
||||
### International Games Appearing
|
||||
|
||||
NFL and NHL scrapers filter these automatically. If new locations emerge:
|
||||
- Add to `INTERNATIONAL_LOCATIONS` in the scraper
|
||||
- Or add filtering logic for neutral site games
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Follow existing patterns for new scrapers
|
||||
2. Always use canonical IDs
|
||||
3. Add aliases for historical names
|
||||
4. Include source URLs for traceability
|
||||
5. Test with multiple seasons
|
||||
805
docs/DATA_AUDIT.md
Normal file
805
docs/DATA_AUDIT.md
Normal file
@@ -0,0 +1,805 @@
|
||||
# SportsTime Data Audit Report
|
||||
|
||||
**Generated:** 2026-01-20
|
||||
**Scope:** NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
|
||||
**Data Pipeline:** Scripts → CloudKit → iOS App
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The data audit identified **15 issues** across the SportsTime data pipeline, with significant gaps in source reliability, stadium resolution, and iOS data freshness.
|
||||
|
||||
| Severity | Count | Description |
|
||||
|----------|-------|-------------|
|
||||
| **Critical** | 1 | iOS bundled data severely outdated |
|
||||
| **High** | 4 | Single-source sports, NHL stadium data, NBA naming rights |
|
||||
| **Medium** | 6 | Alias gaps, outdated config, silent game exclusion |
|
||||
| **Low** | 4 | Minor configuration and coverage issues |
|
||||
|
||||
### Key Findings
|
||||
|
||||
**Data Pipeline Health:**
|
||||
- ✅ **Canonical ID system**: 100% format compliance across 7,186 IDs
|
||||
- ✅ **Team mappings**: All 183 teams correctly mapped with current abbreviations
|
||||
- ✅ **Referential integrity**: Zero orphan references (0 games pointing to non-existent teams/stadiums)
|
||||
- ⚠️ **Stadium resolution**: 1,466 games (21.6%) have unresolved stadiums
|
||||
|
||||
**Critical Risks:**
|
||||
1. **ESPN single-point-of-failure** for WNBA, NWSL, MLS - if ESPN changes, 3 sports lose all data
|
||||
2. **NHL has 100% missing stadiums** - Hockey Reference provides no venue data
|
||||
3. **iOS bundled data 27% behind** - 1,820 games missing from first-launch experience
|
||||
|
||||
**Root Causes:**
|
||||
- Stadium naming rights changed faster than alias updates (2024-2025)
|
||||
- Fallback source limit (`max_sources_to_try = 2`) prevents third source from being tried
|
||||
- Hockey Reference source limitation (no venue info) combined with fallback limit
|
||||
- iOS bundled JSON not updated with latest pipeline output
|
||||
|
||||
---
|
||||
|
||||
## Phase Status Tracking
|
||||
|
||||
| Phase | Status | Issues Found |
|
||||
|-------|--------|--------------|
|
||||
| 1. Hardcoded Mapping Audit | ✅ COMPLETE | 1 Low |
|
||||
| 2. Alias File Completeness | ✅ COMPLETE | 1 Medium, 1 Low |
|
||||
| 3. Scraper Source Reliability | ✅ COMPLETE | 2 High, 1 Medium |
|
||||
| 4. Game Count & Coverage | ✅ COMPLETE | 2 High, 2 Medium, 1 Low |
|
||||
| 5. Canonical ID Consistency | ✅ COMPLETE | 0 issues |
|
||||
| 6. Referential Integrity | ✅ COMPLETE | 1 Medium (NHL source) |
|
||||
| 7. iOS Data Reception | ✅ COMPLETE | 1 Critical, 1 Medium, 1 Low |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 Results: Hardcoded Mapping Audit
|
||||
|
||||
**Files Audited:**
|
||||
- `sportstime_parser/normalizers/team_resolver.py` (TEAM_MAPPINGS)
|
||||
- `sportstime_parser/normalizers/stadium_resolver.py` (STADIUM_MAPPINGS)
|
||||
|
||||
### Team Counts
|
||||
|
||||
| Sport | Hardcoded | Expected | Abbreviations | Status |
|
||||
|-------|-----------|----------|---------------|--------|
|
||||
| NBA | 30 | 30 | 38 | ✅ |
|
||||
| MLB | 30 | 30 | 38 | ✅ |
|
||||
| NFL | 32 | 32 | 40 | ✅ |
|
||||
| NHL | 32 | 32 | 41 | ✅ |
|
||||
| MLS | 30 | 30* | 32 | ✅ |
|
||||
| WNBA | 13 | 13 | 13 | ✅ |
|
||||
| NWSL | 16 | 16 | 24 | ✅ |
|
||||
|
||||
*MLS: 29 original teams + San Diego FC (2025 expansion) = 30
|
||||
|
||||
### Stadium Counts
|
||||
|
||||
| Sport | Hardcoded | Notes | Status |
|
||||
|-------|-----------|-------|--------|
|
||||
| NBA | 30 | 1 per team | ✅ |
|
||||
| MLB | 57 | 30 regular + 18 spring training + 9 special venues | ✅ |
|
||||
| NFL | 30 | Includes shared venues (SoFi Stadium: LAR+LAC, MetLife: NYG+NYJ) | ✅ |
|
||||
| NHL | 32 | 1 per team | ✅ |
|
||||
| MLS | 30 | 1 per team | ✅ |
|
||||
| WNBA | 13 | 1 per team | ✅ |
|
||||
| NWSL | 19 | 14 current + 5 expansion team venues (Boston/Denver) | ✅ |
|
||||
|
||||
### Recent Updates Verification
|
||||
|
||||
| Update | Type | Status | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| Utah Hockey Club (NHL) | Relocation | ✅ Present | ARI + UTA abbreviations both map to `team_nhl_ari` |
|
||||
| Golden State Valkyries (WNBA) | Expansion 2025 | ✅ Present | `team_wnba_gsv` with Chase Center venue |
|
||||
| Boston Legacy FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_bos` with Gillette Stadium |
|
||||
| Denver Summit FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_den` with Dick's Sporting Goods Park |
|
||||
| Oakland A's → Sacramento | Temporary relocation | ✅ Present | `stadium_mlb_sutter_health_park` |
|
||||
| San Diego FC (MLS) | Expansion 2025 | ✅ Present | `team_mls_sd` with Snapdragon Stadium |
|
||||
| FedExField → Northwest Stadium | Naming rights | ✅ Present | `stadium_nfl_northwest_stadium` |
|
||||
|
||||
### NFL Stadium Sharing
|
||||
|
||||
| Stadium | Teams | Status |
|
||||
|---------|-------|--------|
|
||||
| SoFi Stadium | LAR, LAC | ✅ Correct |
|
||||
| MetLife Stadium | NYG, NYJ | ✅ Correct |
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 1 | WNBA single abbreviations | Low | All 13 WNBA teams have only 1 abbreviation each. May need additional abbreviations for source compatibility. |
|
||||
|
||||
### Phase 1 Summary
|
||||
|
||||
**Result: PASS** - All team and stadium mappings are complete and up-to-date with 2025-2026 changes.
|
||||
|
||||
- ✅ All 7 sports have correct team counts
|
||||
- ✅ All stadium counts are appropriate (including spring training, special venues)
|
||||
- ✅ Recent franchise moves/expansions are reflected
|
||||
- ✅ Stadium sharing is correctly handled
|
||||
- ✅ Naming rights updates are current
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 Results: Alias File Completeness
|
||||
|
||||
**Files Audited:**
|
||||
- `Scripts/team_aliases.json`
|
||||
- `Scripts/stadium_aliases.json`
|
||||
|
||||
### Team Aliases Summary
|
||||
|
||||
| Sport | Entries | Coverage | Status |
|
||||
|-------|---------|----------|--------|
|
||||
| MLB | 23 | Historical relocations/renames | ✅ |
|
||||
| NBA | 29 | Historical relocations/renames | ✅ |
|
||||
| NHL | 24 | Historical relocations/renames | ✅ |
|
||||
| NFL | 0 | **No aliases** | ⚠️ |
|
||||
| MLS | 0 | No aliases (newer league) | ✅ |
|
||||
| WNBA | 0 | No aliases (newer league) | ✅ |
|
||||
| NWSL | 0 | No aliases (newer league) | ✅ |
|
||||
| **Total** | **76** | | |
|
||||
|
||||
- All 76 entries have valid date ranges
|
||||
- No orphan references (all canonical IDs exist in mappings)
|
||||
|
||||
### Stadium Aliases Summary
|
||||
|
||||
| Sport | Entries | Coverage | Status |
|
||||
|-------|---------|----------|--------|
|
||||
| MLB | 109 | Regular + spring training + special venues | ✅ |
|
||||
| NFL | 65 | Naming rights history | ✅ |
|
||||
| NBA | 44 | Naming rights history | ✅ |
|
||||
| NHL | 39 | Naming rights history | ✅ |
|
||||
| MLS | 35 | Current + naming variants | ✅ |
|
||||
| WNBA | 15 | Current + naming variants | ✅ |
|
||||
| NWSL | 14 | Current + naming variants | ✅ |
|
||||
| **Total** | **321** | | |
|
||||
|
||||
- 65 entries have date ranges (historical naming rights)
|
||||
- 256 entries are permanent aliases (no date restrictions)
|
||||
|
||||
### Orphan Reference Check
|
||||
|
||||
| Type | Count | Status |
|
||||
|------|-------|--------|
|
||||
| Team aliases with invalid references | 0 | ✅ |
|
||||
| Stadium aliases with invalid references | **5** | ❌ |
|
||||
|
||||
**Orphan Stadium References Found:**
|
||||
| Alias Name | References (Invalid) | Correct ID |
|
||||
|------------|---------------------|------------|
|
||||
| Broncos Stadium at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
|
||||
| Sports Authority Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
|
||||
| Invesco Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
|
||||
| Mile High Stadium | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
|
||||
| Arrowhead Stadium | `stadium_nfl_geha_field_at_arrowhead_stadium` | `stadium_nfl_arrowhead_stadium` |
|
||||
|
||||
### Historical Changes Coverage
|
||||
|
||||
| Historical Name | Current Team | In Aliases? |
|
||||
|-----------------|--------------|-------------|
|
||||
| Montreal Expos | Washington Nationals | ✅ |
|
||||
| Seattle SuperSonics | Oklahoma City Thunder | ✅ |
|
||||
| Arizona Coyotes | Utah Hockey Club | ✅ |
|
||||
| Cleveland Indians | Cleveland Guardians | ✅ |
|
||||
| Hartford Whalers | Carolina Hurricanes | ✅ |
|
||||
| Quebec Nordiques | Colorado Avalanche | ✅ |
|
||||
| Vancouver Grizzlies | Memphis Grizzlies | ✅ |
|
||||
| Washington Redskins | Washington Commanders | ❌ Missing |
|
||||
| Washington Football Team | Washington Commanders | ❌ Missing |
|
||||
| Brooklyn Dodgers | Los Angeles Dodgers | ❌ Missing |
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 2 | Orphan stadium alias references | Medium | 5 stadium aliases point to non-existent canonical IDs (`stadium_nfl_empower_field_at_mile_high`, `stadium_nfl_geha_field_at_arrowhead_stadium`). Causes resolution failures for historical Denver/KC stadium names. |
|
||||
| 3 | No NFL team aliases | Low | Missing Washington Redskins/Football Team historical names. Limits historical game matching for NFL. |
|
||||
|
||||
### Phase 2 Summary
|
||||
|
||||
**Result: PASS with issues** - Alias files cover most historical changes but have referential integrity bugs.
|
||||
|
||||
- ✅ Team aliases cover MLB/NBA/NHL historical changes
|
||||
- ✅ Stadium aliases cover naming rights changes across all sports
|
||||
- ✅ No date range validation errors
|
||||
- ❌ 5 orphan stadium references need fixing
|
||||
- ⚠️ No NFL team aliases (Washington Redskins/Football Team missing)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 Results: Scraper Source Reliability
|
||||
|
||||
**Files Audited:**
|
||||
- `sportstime_parser/scrapers/base.py` (fallback logic)
|
||||
- `sportstime_parser/scrapers/nba.py`, `mlb.py`, `nfl.py`, `nhl.py`, `mls.py`, `wnba.py`, `nwsl.py`
|
||||
|
||||
### Source Dependency Matrix
|
||||
|
||||
| Sport | Primary | Status | Fallback 1 | Status | Fallback 2 | Status | Risk |
|
||||
|-------|---------|--------|------------|--------|------------|--------|------|
|
||||
| NBA | basketball_reference | ✅ | espn | ✅ | cbs | ❌ NOT IMPL | Medium |
|
||||
| MLB | mlb_api | ✅ | espn | ✅ | baseball_reference | ✅ | Low |
|
||||
| NFL | espn | ✅ | pro_football_reference | ✅ | cbs | ❌ NOT IMPL | Medium |
|
||||
| NHL | hockey_reference | ✅ | nhl_api | ✅ | espn | ✅ | Low |
|
||||
| MLS | espn | ✅ | fbref | ❌ NOT IMPL | - | - | **HIGH** |
|
||||
| WNBA | espn | ✅ | - | - | - | - | **HIGH** |
|
||||
| NWSL | espn | ✅ | - | - | - | - | **HIGH** |
|
||||
|
||||
### Unimplemented Sources
|
||||
|
||||
| Sport | Source | Line | Status |
|
||||
|-------|--------|------|--------|
|
||||
| NBA | cbs | `nba.py:421` | `raise NotImplementedError("CBS scraper not implemented")` |
|
||||
| NFL | cbs | `nfl.py:386` | `raise NotImplementedError("CBS scraper not implemented")` |
|
||||
| MLS | fbref | `mls.py:214` | `raise NotImplementedError("FBref scraper not implemented")` |
|
||||
|
||||
### Fallback Logic Analysis
|
||||
|
||||
**File:** `base.py:189`
|
||||
```python
|
||||
max_sources_to_try = 2 # Don't try all sources if first few return nothing
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
- Even if 3 sources are declared, only 2 are tried
|
||||
- If sources 1 and 2 fail, source 3 is never attempted
|
||||
- This limits resilience for NBA, MLB, NFL, NHL which have 3 sources
|
||||
|
||||
### International Game Filtering
|
||||
|
||||
| Sport | Hardcoded Locations | Notes |
|
||||
|-------|---------------------|-------|
|
||||
| NFL | London, Mexico City, Frankfurt, Munich, São Paulo | ✅ Complete for 2025 |
|
||||
| NHL | Prague, Stockholm, Helsinki, Tampere, Gothenburg | ✅ Complete for 2025 |
|
||||
| NBA | None | ⚠️ No international filtering (Abu Dhabi games?) |
|
||||
| MLB | None | ⚠️ No international filtering (Mexico City games?) |
|
||||
| MLS | None | N/A (domestic only) |
|
||||
| WNBA | None | N/A (domestic only) |
|
||||
| NWSL | None | N/A (domestic only) |
|
||||
|
||||
### Single Point of Failure Risk
|
||||
|
||||
| Sport | Primary Source | If ESPN Fails... | Risk Level |
|
||||
|-------|----------------|------------------|------------|
|
||||
| WNBA | ESPN only | **Complete data loss** | Critical |
|
||||
| NWSL | ESPN only | **Complete data loss** | Critical |
|
||||
| MLS | ESPN only (fbref not impl) | **Complete data loss** | Critical |
|
||||
| NBA | Basketball-Ref → ESPN | ESPN fallback available | Low |
|
||||
| NFL | ESPN → Pro-Football-Ref | Fallback available | Low |
|
||||
| NHL | Hockey-Ref → NHL API → ESPN | Two fallbacks | Very Low |
|
||||
| MLB | MLB API → ESPN → B-Ref | Two fallbacks | Very Low |
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 4 | WNBA/NWSL/MLS single source | High | ESPN is the only working source for 3 sports. If ESPN changes or fails, data collection completely stops. |
|
||||
| 5 | max_sources_to_try = 2 | High | Third fallback source never tried even if available. Reduces resilience for NBA/MLB/NFL/NHL. |
|
||||
| 6 | CBS/FBref not implemented | Medium | Declared fallback sources raise NotImplementedError. Appears functional in config but fails at runtime. |
|
||||
|
||||
### Phase 3 Summary
|
||||
|
||||
**Result: FAIL** - Critical single-point-of-failure for 3 sports.
|
||||
|
||||
- ❌ WNBA, NWSL, MLS have only ESPN (no resilience)
|
||||
- ❌ Fallback limit of 2 prevents third source from being tried
|
||||
- ⚠️ CBS and FBref declared but not implemented
|
||||
- ✅ MLB and NHL have full fallback chains
|
||||
- ✅ International game filtering present for NFL/NHL
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 Results: Game Count & Coverage
|
||||
|
||||
**Files Audited:**
|
||||
- `Scripts/output/games_*.json` (all 2025 season files)
|
||||
- `Scripts/output/validation_*.md` (all validation reports)
|
||||
- `sportstime_parser/config.py` (EXPECTED_GAME_COUNTS)
|
||||
|
||||
### Coverage Summary
|
||||
|
||||
| Sport | Scraped | Expected | Coverage | Status |
|
||||
|-------|---------|----------|----------|--------|
|
||||
| NBA | 1,231 | 1,230 | 100.1% | ✅ |
|
||||
| MLB | 2,866 | 2,430 | 117.9% | ⚠️ Includes spring training |
|
||||
| NFL | 330 | 272 | 121.3% | ⚠️ Includes preseason/playoffs |
|
||||
| NHL | 1,312 | 1,312 | 100.0% | ✅ |
|
||||
| MLS | 542 | 493 | 109.9% | ✅ Includes playoffs |
|
||||
| WNBA | 322 | 220 | **146.4%** | ⚠️ Expected count outdated |
|
||||
| NWSL | 189 | 182 | 103.8% | ✅ |
|
||||
|
||||
### Date Range Analysis
|
||||
|
||||
| Sport | Start Date | End Date | Notes |
|
||||
|-------|------------|----------|-------|
|
||||
| NBA | 2025-10-21 | 2026-04-12 | Regular season only |
|
||||
| MLB | 2025-03-01 | 2025-11-02 | Includes spring training (417 games in March) |
|
||||
| NFL | 2025-08-01 | 2026-01-25 | Includes preseason (49 in Aug) + playoffs (28 in Jan) |
|
||||
| NHL | 2025-10-07 | 2026-04-16 | Regular season only |
|
||||
| MLS | 2025-02-22 | 2025-11-30 | Regular season + playoffs |
|
||||
| WNBA | 2025-05-02 | 2025-10-11 | Regular season + playoffs |
|
||||
| NWSL | 2025-03-15 | 2025-11-23 | Regular season + playoffs |
|
||||
|
||||
### Game Status Distribution
|
||||
|
||||
All games across all sports have status `unknown` - game status is not being properly parsed from sources.
|
||||
|
||||
### Duplicate Game Detection
|
||||
|
||||
| Sport | Duplicates Found | Details |
|
||||
|-------|-----------------|---------|
|
||||
| NBA | 0 | ✅ |
|
||||
| MLB | 1 | `game_mlb_2025_20250508_det_col_1` appears twice (doubleheader handling issue) |
|
||||
| NFL | 0 | ✅ |
|
||||
| NHL | 0 | ✅ |
|
||||
| MLS | 0 | ✅ |
|
||||
| WNBA | 0 | ✅ |
|
||||
| NWSL | 0 | ✅ |
|
||||
|
||||
### Validation Report Analysis
|
||||
|
||||
| Sport | Total Games | Unresolved Teams | Unresolved Stadiums | Manual Review Items |
|
||||
|-------|-------------|------------------|---------------------|---------------------|
|
||||
| NBA | 1,231 | 0 | **131** | 131 |
|
||||
| MLB | 2,866 | 12 | 4 | 20 |
|
||||
| NFL | 330 | 1 | 5 | 11 |
|
||||
| NHL | 1,312 | 0 | 0 | **1,312** (all missing stadiums) |
|
||||
| MLS | 542 | 1 | **64** | 129 |
|
||||
| WNBA | 322 | 5 | **65** | 135 |
|
||||
| NWSL | 189 | 0 | **16** | 32 |
|
||||
|
||||
### Top Unresolved Stadium Names (Recent Naming Rights)
|
||||
|
||||
| Stadium Name | Occurrences | Actual Venue | Issue |
|
||||
|--------------|-------------|--------------|-------|
|
||||
| Sports Illustrated Stadium | 11 | MLS expansion venue | New venue, missing alias |
|
||||
| Mortgage Matchup Center | 8 | Rocket Mortgage FieldHouse (CLE) | 2025 naming rights change |
|
||||
| ScottsMiracle-Gro Field | 4 | MLS Columbus Crew | Missing alias |
|
||||
| Energizer Park | 3 | MLS CITY SC (STL?) | Missing alias |
|
||||
| Xfinity Mobile Arena | 3 | Intuit Dome (LAC) | 2025 naming rights change |
|
||||
| Rocket Arena | 3 | Toyota Center (HOU) | Potential name change |
|
||||
| CareFirst Arena | 2 | Washington Mystics venue | New WNBA venue name |
|
||||
|
||||
### Unresolved Teams (Exhibition/International)
|
||||
|
||||
| Team Name | Sport | Type | Games |
|
||||
|-----------|-------|------|-------|
|
||||
| BRAZIL | WNBA | International exhibition | 2 |
|
||||
| Toyota Antelopes | WNBA | Japanese team | 2 |
|
||||
| TEAM CLARK | WNBA | All-Star Game | 1 |
|
||||
| (Various MLB) | MLB | International teams | 12 |
|
||||
| (MLS international) | MLS | CCL/exhibition | 1 |
|
||||
| (NFL preseason) | NFL | Pre-season exhibition | 1 |
|
||||
|
||||
### NHL Stadium Data Issue
|
||||
|
||||
**Critical:** Hockey Reference does not provide stadium data. All 1,312 NHL games have `raw_stadium: None`, causing 100% of games to have missing stadium IDs. The NHL fallback sources (NHL API, ESPN) should provide this data, but the `max_sources_to_try = 2` limit combined with Hockey Reference success means fallbacks are never attempted.
|
||||
|
||||
### Expected Count Updates Needed
|
||||
|
||||
| Sport | Current Expected | Recommended | Reason |
|
||||
|-------|------------------|-------------|--------|
|
||||
| WNBA | 220 | **286** | 13 teams × 44 games / 2 (expanded with Golden State Valkyries) |
|
||||
| NFL | 272 | 272 (filter preseason) | Or document that 330 includes preseason |
|
||||
| MLB | 2,430 | 2,430 (filter spring training) | Or document that 2,866 includes spring training |
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 7 | NHL has no stadium data | High | Hockey Reference provides no venue info. All 1,312 games missing stadium_id. Fallback sources not tried. |
|
||||
| 8 | 131 NBA stadium resolution failures | High | Recent naming rights changes ("Mortgage Matchup Center", "Xfinity Mobile Arena") not in aliases. |
|
||||
| 9 | Outdated WNBA expected count | Medium | Config says 220 but WNBA expanded to 13 teams in 2025; actual is 322 (286 regular + playoffs). |
|
||||
| 10 | MLS/WNBA stadium alias gaps | Medium | 64 MLS + 65 WNBA unresolved stadiums from new/renamed venues. |
|
||||
| 11 | Game status not parsed | Low | All games have status `unknown` instead of final/scheduled/postponed. |
|
||||
|
||||
### Phase 4 Summary
|
||||
|
||||
**Result: FAIL** - Significant stadium resolution failures across multiple sports.
|
||||
|
||||
- ❌ 131 NBA games missing stadium (naming rights changes)
|
||||
- ❌ 1,312 NHL games missing stadium (source doesn't provide data)
|
||||
- ❌ 64 MLS + 65 WNBA stadiums unresolved (new/renamed venues)
|
||||
- ⚠️ WNBA expected count severely outdated (220 vs 322 actual)
|
||||
- ⚠️ MLB/NFL include preseason/spring training games
|
||||
- ✅ No significant duplicate games (1 MLB doubleheader edge case)
|
||||
- ✅ All teams resolved except exhibition/international games
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 Results: Canonical ID Consistency
|
||||
|
||||
**Files Audited:**
|
||||
- `sportstime_parser/normalizers/canonical_id.py` (Python ID generation)
|
||||
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (iOS models)
|
||||
- `SportsTime/Core/Services/BootstrapService.swift` (iOS JSON parsing)
|
||||
- All `Scripts/output/*.json` files (generated IDs)
|
||||
|
||||
### Format Validation
|
||||
|
||||
| Type | Total IDs | Valid | Invalid | Pass Rate |
|
||||
|------|-----------|-------|---------|-----------|
|
||||
| Team | 183 | 183 | 0 | 100.0% ✅ |
|
||||
| Stadium | 211 | 211 | 0 | 100.0% ✅ |
|
||||
| Game | 6,792 | 6,792 | 0 | 100.0% ✅ |
|
||||
|
||||
### ID Format Patterns (all validated)
|
||||
|
||||
```
|
||||
Teams: team_{sport}_{abbrev} → team_nba_lal
|
||||
Stadiums: stadium_{sport}_{normalized_name} → stadium_nba_cryptocom_arena
|
||||
Games: game_{sport}_{season}_{YYYYMMDD}_{away}_{home}[_{#}]
|
||||
→ game_nba_2025_20251021_hou_okc
|
||||
```
|
||||
|
||||
### Normalization Quality
|
||||
|
||||
| Check | Result |
|
||||
|-------|--------|
|
||||
| Double underscores (`__`) | 0 found ✅ |
|
||||
| Leading/trailing underscores | 0 found ✅ |
|
||||
| Uppercase letters | 0 found ✅ |
|
||||
| Special characters | 0 found ✅ |
|
||||
|
||||
### Abbreviation Lengths (Teams)
|
||||
|
||||
| Length | Count |
|
||||
|--------|-------|
|
||||
| 2 chars | 21 |
|
||||
| 3 chars | 161 |
|
||||
| 4 chars | 1 |
|
||||
|
||||
### Stadium ID Lengths
|
||||
|
||||
- Minimum: 8 characters
|
||||
- Maximum: 29 characters
|
||||
- Average: 16.2 characters
|
||||
|
||||
### iOS Cross-Compatibility
|
||||
|
||||
| Aspect | Status | Notes |
|
||||
|--------|--------|-------|
|
||||
| Field naming convention | ✅ Compatible | Python uses snake_case; iOS `BootstrapService` uses matching Codable structs |
|
||||
| Deterministic UUID generation | ✅ Compatible | iOS uses SHA256 hash of canonical_id - matches any valid string |
|
||||
| Schema version | ✅ Compatible | Both use version 1 |
|
||||
| Required fields | ✅ Present | All iOS-required fields present in JSON output |
|
||||
|
||||
### Field Mapping (Python → iOS)
|
||||
|
||||
| Python Field | iOS Field | Notes |
|
||||
|--------------|-----------|-------|
|
||||
| `canonical_id` | `canonicalId` | Mapped via `JSONCanonicalStadium.canonical_id` → `CanonicalStadium.canonicalId` |
|
||||
| `home_team_canonical_id` | `homeTeamCanonicalId` | Explicit mapping in BootstrapService |
|
||||
| `away_team_canonical_id` | `awayTeamCanonicalId` | Explicit mapping in BootstrapService |
|
||||
| `stadium_canonical_id` | `stadiumCanonicalId` | Explicit mapping in BootstrapService |
|
||||
| `game_datetime_utc` | `dateTime` | ISO 8601 parsing with fallback to legacy format |
|
||||
|
||||
### Issues Found
|
||||
|
||||
**No issues found.** All canonical IDs are:
|
||||
- Correctly formatted according to defined patterns
|
||||
- Properly normalized (lowercase, no special characters)
|
||||
- Deterministic (same input produces same output)
|
||||
- Compatible with iOS parsing
|
||||
|
||||
### Phase 5 Summary
|
||||
|
||||
**Result: PASS** - All canonical IDs are consistent and iOS-compatible.
|
||||
|
||||
- ✅ 100% format validation pass rate across 7,186 IDs
|
||||
- ✅ No normalization issues found
|
||||
- ✅ iOS BootstrapService explicitly handles snake_case → camelCase mapping
|
||||
- ✅ Deterministic UUID generation using SHA256 hash
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 Results: Referential Integrity
|
||||
|
||||
**Files Audited:**
|
||||
- `Scripts/output/games_*_2025.json`
|
||||
- `Scripts/output/teams_*.json`
|
||||
- `Scripts/output/stadiums_*.json`
|
||||
|
||||
### Game → Team References
|
||||
|
||||
| Sport | Total Games | Valid Home | Valid Away | Orphan Home | Orphan Away | Status |
|
||||
|-------|-------------|------------|------------|-------------|-------------|--------|
|
||||
| NBA | 1,231 | 1,231 | 1,231 | 0 | 0 | ✅ |
|
||||
| MLB | 2,866 | 2,866 | 2,866 | 0 | 0 | ✅ |
|
||||
| NFL | 330 | 330 | 330 | 0 | 0 | ✅ |
|
||||
| NHL | 1,312 | 1,312 | 1,312 | 0 | 0 | ✅ |
|
||||
| MLS | 542 | 542 | 542 | 0 | 0 | ✅ |
|
||||
| WNBA | 322 | 322 | 322 | 0 | 0 | ✅ |
|
||||
| NWSL | 189 | 189 | 189 | 0 | 0 | ✅ |
|
||||
|
||||
**Result:** 100% valid team references across all 6,792 games.
|
||||
|
||||
### Game → Stadium References
|
||||
|
||||
| Sport | Total Games | Valid | Missing | Percentage Missing |
|
||||
|-------|-------------|-------|---------|-------------------|
|
||||
| NBA | 1,231 | 1,231 | 0 | 0.0% ✅ |
|
||||
| MLB | 2,866 | 2,862 | 4 | 0.1% ✅ |
|
||||
| NFL | 330 | 325 | 5 | 1.5% ✅ |
|
||||
| NHL | 1,312 | 0 | **1,312** | **100%** ❌ |
|
||||
| MLS | 542 | 478 | 64 | 11.8% ⚠️ |
|
||||
| WNBA | 322 | 257 | 65 | 20.2% ⚠️ |
|
||||
| NWSL | 189 | 173 | 16 | 8.5% ⚠️ |
|
||||
|
||||
**Note:** "Missing" means `stadium_canonical_id` is empty (resolution failed at scrape time). This is NOT orphan references to non-existent stadiums.
|
||||
|
||||
### Team → Stadium References
|
||||
|
||||
| Sport | Teams | Valid Stadium | Invalid | Status |
|
||||
|-------|-------|---------------|---------|--------|
|
||||
| NBA | 30 | 30 | 0 | ✅ |
|
||||
| MLB | 30 | 30 | 0 | ✅ |
|
||||
| NFL | 32 | 32 | 0 | ✅ |
|
||||
| NHL | 32 | 32 | 0 | ✅ |
|
||||
| MLS | 30 | 30 | 0 | ✅ |
|
||||
| WNBA | 13 | 13 | 0 | ✅ |
|
||||
| NWSL | 16 | 16 | 0 | ✅ |
|
||||
|
||||
**Result:** 100% valid team → stadium references.
|
||||
|
||||
### Cross-Sport Stadium Check
|
||||
|
||||
✅ No stadiums are duplicated across sports. Each `stadium_{sport}_*` ID is unique to its sport.
|
||||
|
||||
### Missing Stadium Root Causes
|
||||
|
||||
| Sport | Missing | Root Cause |
|
||||
|-------|---------|------------|
|
||||
| NHL | 1,312 | **Hockey Reference provides no venue data** - source limitation |
|
||||
| MLS | 64 | New/renamed stadiums not in aliases (see Phase 4) |
|
||||
| WNBA | 65 | New venue names not in aliases (see Phase 4) |
|
||||
| NWSL | 16 | Expansion team venues + alternate venues |
|
||||
| NFL | 5 | International games not in stadium mappings |
|
||||
| MLB | 4 | Exhibition/international games |
|
||||
|
||||
### Orphan Reference Summary
|
||||
|
||||
| Reference Type | Total Checked | Orphans Found |
|
||||
|----------------|---------------|---------------|
|
||||
| Game → Home Team | 6,792 | 0 ✅ |
|
||||
| Game → Away Team | 6,792 | 0 ✅ |
|
||||
| Game → Stadium | 6,792 | 0 ✅ |
|
||||
| Team → Stadium | 183 | 0 ✅ |
|
||||
|
||||
**Note:** Zero orphan references. All "missing" stadiums are resolution failures (empty string), not references to non-existent canonical IDs.
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 12 | NHL games have no stadium data | Medium | Hockey Reference source doesn't provide venue information. All 1,312 NHL games have empty stadium_canonical_id. Fallback sources could provide this data but are limited by `max_sources_to_try = 2`. |
|
||||
|
||||
### Phase 6 Summary
|
||||
|
||||
**Result: PASS with known limitations** - No orphan references exist; missing stadiums are resolution failures.
|
||||
|
||||
- ✅ 100% valid team references (home and away)
|
||||
- ✅ 100% valid team → stadium references
|
||||
- ✅ No orphan references to non-existent canonical IDs
|
||||
- ⚠️ 1,466 games (21.6%) have empty stadium_canonical_id (resolution failures, not orphans)
|
||||
- ⚠️ NHL accounts for 90% of missing stadium data (source limitation)
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 Results: iOS Data Reception
|
||||
|
||||
**Files Audited:**
|
||||
- `SportsTime/Core/Services/BootstrapService.swift` (JSON parsing)
|
||||
- `SportsTime/Core/Services/CanonicalSyncService.swift` (CloudKit sync)
|
||||
- `SportsTime/Core/Services/DataProvider.swift` (data access)
|
||||
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (SwiftData models)
|
||||
- `SportsTime/Resources/*_canonical.json` (bundled data files)
|
||||
|
||||
### Bundled Data Comparison
|
||||
|
||||
| Data Type | iOS Bundled | Scripts Output | Difference | Status |
|
||||
|-----------|-------------|----------------|------------|--------|
|
||||
| Teams | 148 | 183 | **-35** (19%) | ❌ STALE |
|
||||
| Stadiums | 122 | 211 | **-89** (42%) | ❌ STALE |
|
||||
| Games | 4,972 | 6,792 | **-1,820** (27%) | ❌ STALE |
|
||||
|
||||
**iOS bundled data is significantly outdated compared to Scripts output.**
|
||||
|
||||
### Field Mapping Verification
|
||||
|
||||
| Python Field | iOS JSON Struct | iOS Model | Type Match | Status |
|
||||
|--------------|-----------------|-----------|------------|--------|
|
||||
| `canonical_id` | `canonical_id` | `canonicalId` | String ✅ | ✅ |
|
||||
| `name` | `name` | `name` | String ✅ | ✅ |
|
||||
| `game_datetime_utc` | `game_datetime_utc` | `dateTime` | ISO 8601 → Date ✅ | ✅ |
|
||||
| `date` + `time` (legacy) | `date`, `time` | `dateTime` | Fallback parsing ✅ | ✅ |
|
||||
| `home_team_canonical_id` | `home_team_canonical_id` | `homeTeamCanonicalId` | String ✅ | ✅ |
|
||||
| `away_team_canonical_id` | `away_team_canonical_id` | `awayTeamCanonicalId` | String ✅ | ✅ |
|
||||
| `stadium_canonical_id` | `stadium_canonical_id` | `stadiumCanonicalId` | String ✅ | ✅ |
|
||||
| `sport` | `sport` | `sport` | String ✅ | ✅ |
|
||||
| `season` | `season` | `season` | String ✅ | ✅ |
|
||||
| `is_playoff` | `is_playoff` | `isPlayoff` | Bool ✅ | ✅ |
|
||||
| `broadcast_info` | `broadcast_info` | `broadcastInfo` | String? ✅ | ✅ |
|
||||
|
||||
**Result:** All field mappings are correct and compatible.
|
||||
|
||||
### Date Parsing Compatibility
|
||||
|
||||
iOS `BootstrapService` supports both formats:
|
||||
|
||||
```swift
|
||||
// New canonical format (preferred)
|
||||
let game_datetime_utc: String? // ISO 8601
|
||||
|
||||
// Legacy format (fallback)
|
||||
let date: String? // "YYYY-MM-DD"
|
||||
let time: String? // "HH:mm" or "TBD"
|
||||
```
|
||||
|
||||
**Current iOS bundled games use legacy format.** After updating bundled data, new `game_datetime_utc` format will be used.
|
||||
|
||||
### Missing Reference Handling
|
||||
|
||||
**`DataProvider.filterRichGames()` behavior:**
|
||||
```swift
|
||||
return games.compactMap { game in
|
||||
guard let homeTeam = teamsById[game.homeTeamId],
|
||||
let awayTeam = teamsById[game.awayTeamId],
|
||||
let stadium = stadiumsById[game.stadiumId] else {
|
||||
return nil // ⚠️ Silently drops game
|
||||
}
|
||||
return RichGame(...)
|
||||
}
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
- Games with missing stadium IDs are **silently excluded** from RichGame queries
|
||||
- No error logging or fallback behavior
|
||||
- User sees fewer games than expected without explanation
|
||||
|
||||
### Deduplication Logic
|
||||
|
||||
**Bootstrap:** No explicit deduplication. If bundled JSON contains duplicate canonical IDs, both would be inserted into SwiftData (leading to potential query issues).
|
||||
|
||||
**CloudKit Sync:** Uses upsert pattern with canonical ID as unique key - duplicates would overwrite.
|
||||
|
||||
### Schema Version Compatibility
|
||||
|
||||
| Component | Schema Version | Status |
|
||||
|-----------|----------------|--------|
|
||||
| Scripts output | 1 | ✅ |
|
||||
| iOS CanonicalModels | 1 | ✅ |
|
||||
| iOS BootstrapService | Expects 1 | ✅ |
|
||||
|
||||
**Compatible.** Schema version mismatch protection exists in `CanonicalSyncService`:
|
||||
```swift
|
||||
case .schemaVersionTooNew(let version):
|
||||
return "Data requires app version supporting schema \(version). Please update the app."
|
||||
```
|
||||
|
||||
### Bootstrap Order Validation
|
||||
|
||||
iOS bootstraps in correct dependency order:
|
||||
1. Stadiums (no dependencies)
|
||||
2. Stadium aliases (depends on stadiums)
|
||||
3. League structure (no dependencies)
|
||||
4. Teams (depends on stadiums)
|
||||
5. Team aliases (depends on teams)
|
||||
6. Games (depends on teams + stadiums)
|
||||
|
||||
**Correct - prevents orphan references during bootstrap.**
|
||||
|
||||
### CloudKit Sync Validation
|
||||
|
||||
`CanonicalSyncService` syncs in same dependency order and tracks:
|
||||
- Per-entity sync timestamps
|
||||
- Skipped records (incompatible schema version)
|
||||
- Skipped records (older than local)
|
||||
- Sync duration and cancellation
|
||||
|
||||
**Well-designed sync infrastructure.**
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 13 | iOS bundled data severely outdated | **Critical** | Missing 35 teams (19%), 89 stadiums (42%), 1,820 games (27%). First-launch experience shows incomplete data until CloudKit sync completes. |
|
||||
| 14 | Silent game exclusion in RichGame queries | Medium | `filterRichGames()` silently drops games with missing team/stadium references. Users see fewer games without explanation. |
|
||||
| 15 | No bootstrap deduplication | Low | Duplicate game IDs in bundled JSON would create duplicate SwiftData records. Low risk since JSON is generated correctly. |
|
||||
|
||||
### Phase 7 Summary
|
||||
|
||||
**Result: FAIL** - iOS bundled data is critically outdated.
|
||||
|
||||
- ❌ iOS bundled data missing 35 teams, 89 stadiums, 1,820 games
|
||||
- ⚠️ Games with unresolved references silently dropped from RichGame queries
|
||||
- ✅ Field mapping between Python and iOS is correct
|
||||
- ✅ Date parsing supports both legacy and new formats
|
||||
- ✅ Schema versions are compatible
|
||||
- ✅ Bootstrap/sync order handles dependencies correctly
|
||||
|
||||
---
|
||||
|
||||
## Prioritized Issue List
|
||||
|
||||
| # | Issue | Severity | Phase | Root Cause | Remediation |
|
||||
|---|-------|----------|-------|------------|-------------|
|
||||
| 13 | iOS bundled data severely outdated | **Critical** | 7 | Bundled JSON not updated after pipeline runs | Copy Scripts/output/*_canonical.json to iOS Resources/ and rebuild |
|
||||
| 4 | WNBA/NWSL/MLS ESPN-only source | **High** | 3 | No implemented fallback sources | Implement alternative scrapers (FBref for MLS, WNBA League Pass) |
|
||||
| 5 | max_sources_to_try = 2 limits fallback | **High** | 3 | Hardcoded limit in base.py:189 | Increase to 3 or remove limit for sports with 3+ sources |
|
||||
| 7 | NHL has no stadium data from primary source | **High** | 4 | Hockey Reference doesn't provide venue info | Force NHL to use NHL API or ESPN as primary (they provide venues) |
|
||||
| 8 | 131 NBA stadium resolution failures | **High** | 4 | 2024-2025 naming rights not in aliases | Add aliases: "Mortgage Matchup Center" → Rocket Mortgage FieldHouse, "Xfinity Mobile Arena" → Intuit Dome |
|
||||
| 2 | Orphan stadium alias references | **Medium** | 2 | Wrong canonical IDs in stadium_aliases.json | Fix 5 Denver/KC stadium aliases pointing to non-existent IDs |
|
||||
| 6 | CBS/FBref scrapers declared but not implemented | **Medium** | 3 | NotImplementedError at runtime | Either implement or remove from source lists to avoid confusion |
|
||||
| 9 | Outdated WNBA expected count | **Medium** | 4 | WNBA expanded to 13 teams in 2025 | Update config.py EXPECTED_GAME_COUNTS["wnba"] from 220 to 286 |
|
||||
| 10 | MLS/WNBA stadium alias gaps | **Medium** | 4 | New/renamed venues missing from aliases | Add 129 missing stadium aliases (64 MLS + 65 WNBA) |
|
||||
| 12 | NHL games have no stadium data | **Medium** | 6 | Same as Issue #7 | See Issue #7 remediation |
|
||||
| 14 | Silent game exclusion in RichGame queries | **Medium** | 7 | compactMap silently drops games | Log dropped games or return partial RichGame with placeholder stadium |
|
||||
| 1 | WNBA single abbreviations | **Low** | 1 | Only 1 abbreviation per team | Add alternative abbreviations for source compatibility |
|
||||
| 3 | No NFL team aliases | **Low** | 2 | Missing Washington Redskins/Football Team | Add historical Washington team name aliases |
|
||||
| 11 | Game status not parsed | **Low** | 4 | Status field always "unknown" | Parse game status from source data (final, scheduled, postponed) |
|
||||
| 15 | No bootstrap deduplication | **Low** | 7 | No explicit duplicate check during bootstrap | Add deduplication check in bootstrapGames() |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Next Steps
|
||||
|
||||
### Immediate (Before Next Release)
|
||||
|
||||
1. **Update iOS bundled data** (Issue #13)
|
||||
```bash
|
||||
cp Scripts/output/stadiums_*.json SportsTime/Resources/stadiums_canonical.json
|
||||
cp Scripts/output/teams_*.json SportsTime/Resources/teams_canonical.json
|
||||
cp Scripts/output/games_*.json SportsTime/Resources/games_canonical.json
|
||||
```
|
||||
|
||||
2. **Fix NHL stadium data** (Issues #7, #12)
|
||||
- Change NHL primary source from Hockey Reference to NHL API
|
||||
- Or: Increase `max_sources_to_try` to 3 so fallbacks are attempted
|
||||
|
||||
3. **Add critical stadium aliases** (Issues #8, #10)
|
||||
- "Mortgage Matchup Center" → `stadium_nba_rocket_mortgage_fieldhouse`
|
||||
- "Xfinity Mobile Arena" → `stadium_nba_intuit_dome`
|
||||
- Run validation report to identify all unresolved venue names
|
||||
|
||||
### Short-term (This Quarter)
|
||||
|
||||
4. **Implement MLS fallback source** (Issue #4)
|
||||
- FBref has MLS data with venue information
|
||||
- Reduces ESPN single-point-of-failure risk
|
||||
|
||||
5. **Fix orphan alias references** (Issue #2)
|
||||
- Correct 5 NFL stadium aliases pointing to wrong canonical IDs
|
||||
- Add validation check to prevent future orphan references
|
||||
|
||||
6. **Update expected game counts** (Issue #9)
|
||||
- WNBA: 220 → 286 (13 teams × 44 games / 2)
|
||||
|
||||
### Long-term (Next Quarter)
|
||||
|
||||
7. **Implement WNBA/NWSL fallback sources** (Issue #4)
|
||||
- Consider WNBA League Pass API or other sources
|
||||
- NWSL has limited data availability - may need to accept ESPN-only
|
||||
|
||||
8. **Add RichGame partial loading** (Issue #14)
|
||||
- Log games dropped due to missing references
|
||||
- Consider returning games with placeholder stadiums for NHL
|
||||
|
||||
9. **Parse game status** (Issue #11)
|
||||
- Extract final/scheduled/postponed from source data
|
||||
- Enables filtering by game state
|
||||
|
||||
---
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After implementing fixes, verify:
|
||||
|
||||
- [ ] Run `python -m sportstime_parser scrape --sport all --season 2025`
|
||||
- [ ] Check validation reports show <5% unresolved stadiums per sport
|
||||
- [ ] Copy output JSON to iOS Resources/
|
||||
- [ ] Build iOS app and verify data loads at startup
|
||||
- [ ] Query RichGames and verify game count matches expectations
|
||||
- [ ] Run CloudKit sync and verify no errors
|
||||
1046
docs/REMEDIATION_PLAN.md
Normal file
1046
docs/REMEDIATION_PLAN.md
Normal file
File diff suppressed because it is too large
Load Diff
371
league_structure.json
Normal file
371
league_structure.json
Normal file
@@ -0,0 +1,371 @@
|
||||
[
|
||||
{
|
||||
"id": "mlb_league",
|
||||
"sport": "MLB",
|
||||
"type": "league",
|
||||
"name": "Major League Baseball",
|
||||
"abbreviation": "MLB",
|
||||
"parent_id": null,
|
||||
"display_order": 0
|
||||
},
|
||||
{
|
||||
"id": "mlb_al",
|
||||
"sport": "MLB",
|
||||
"type": "conference",
|
||||
"name": "American League",
|
||||
"abbreviation": "AL",
|
||||
"parent_id": "mlb_league",
|
||||
"display_order": 1
|
||||
},
|
||||
{
|
||||
"id": "mlb_nl",
|
||||
"sport": "MLB",
|
||||
"type": "conference",
|
||||
"name": "National League",
|
||||
"abbreviation": "NL",
|
||||
"parent_id": "mlb_league",
|
||||
"display_order": 2
|
||||
},
|
||||
{
|
||||
"id": "mlb_al_east",
|
||||
"sport": "MLB",
|
||||
"type": "division",
|
||||
"name": "AL East",
|
||||
"abbreviation": null,
|
||||
"parent_id": "mlb_al",
|
||||
"display_order": 3
|
||||
},
|
||||
{
|
||||
"id": "mlb_al_central",
|
||||
"sport": "MLB",
|
||||
"type": "division",
|
||||
"name": "AL Central",
|
||||
"abbreviation": null,
|
||||
"parent_id": "mlb_al",
|
||||
"display_order": 4
|
||||
},
|
||||
{
|
||||
"id": "mlb_al_west",
|
||||
"sport": "MLB",
|
||||
"type": "division",
|
||||
"name": "AL West",
|
||||
"abbreviation": null,
|
||||
"parent_id": "mlb_al",
|
||||
"display_order": 5
|
||||
},
|
||||
{
|
||||
"id": "mlb_nl_east",
|
||||
"sport": "MLB",
|
||||
"type": "division",
|
||||
"name": "NL East",
|
||||
"abbreviation": null,
|
||||
"parent_id": "mlb_nl",
|
||||
"display_order": 6
|
||||
},
|
||||
{
|
||||
"id": "mlb_nl_central",
|
||||
"sport": "MLB",
|
||||
"type": "division",
|
||||
"name": "NL Central",
|
||||
"abbreviation": null,
|
||||
"parent_id": "mlb_nl",
|
||||
"display_order": 7
|
||||
},
|
||||
{
|
||||
"id": "mlb_nl_west",
|
||||
"sport": "MLB",
|
||||
"type": "division",
|
||||
"name": "NL West",
|
||||
"abbreviation": null,
|
||||
"parent_id": "mlb_nl",
|
||||
"display_order": 8
|
||||
},
|
||||
{
|
||||
"id": "nba_league",
|
||||
"sport": "NBA",
|
||||
"type": "league",
|
||||
"name": "National Basketball Association",
|
||||
"abbreviation": "NBA",
|
||||
"parent_id": null,
|
||||
"display_order": 9
|
||||
},
|
||||
{
|
||||
"id": "nba_eastern",
|
||||
"sport": "NBA",
|
||||
"type": "conference",
|
||||
"name": "Eastern Conference",
|
||||
"abbreviation": "East",
|
||||
"parent_id": "nba_league",
|
||||
"display_order": 10
|
||||
},
|
||||
{
|
||||
"id": "nba_western",
|
||||
"sport": "NBA",
|
||||
"type": "conference",
|
||||
"name": "Western Conference",
|
||||
"abbreviation": "West",
|
||||
"parent_id": "nba_league",
|
||||
"display_order": 11
|
||||
},
|
||||
{
|
||||
"id": "nba_atlantic",
|
||||
"sport": "NBA",
|
||||
"type": "division",
|
||||
"name": "Atlantic",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nba_eastern",
|
||||
"display_order": 12
|
||||
},
|
||||
{
|
||||
"id": "nba_central",
|
||||
"sport": "NBA",
|
||||
"type": "division",
|
||||
"name": "Central",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nba_eastern",
|
||||
"display_order": 13
|
||||
},
|
||||
{
|
||||
"id": "nba_southeast",
|
||||
"sport": "NBA",
|
||||
"type": "division",
|
||||
"name": "Southeast",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nba_eastern",
|
||||
"display_order": 14
|
||||
},
|
||||
{
|
||||
"id": "nba_northwest",
|
||||
"sport": "NBA",
|
||||
"type": "division",
|
||||
"name": "Northwest",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nba_western",
|
||||
"display_order": 15
|
||||
},
|
||||
{
|
||||
"id": "nba_pacific",
|
||||
"sport": "NBA",
|
||||
"type": "division",
|
||||
"name": "Pacific",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nba_western",
|
||||
"display_order": 16
|
||||
},
|
||||
{
|
||||
"id": "nba_southwest",
|
||||
"sport": "NBA",
|
||||
"type": "division",
|
||||
"name": "Southwest",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nba_western",
|
||||
"display_order": 17
|
||||
},
|
||||
{
|
||||
"id": "nfl_league",
|
||||
"sport": "NFL",
|
||||
"type": "league",
|
||||
"name": "National Football League",
|
||||
"abbreviation": "NFL",
|
||||
"parent_id": null,
|
||||
"display_order": 18
|
||||
},
|
||||
{
|
||||
"id": "nfl_afc",
|
||||
"sport": "NFL",
|
||||
"type": "conference",
|
||||
"name": "American Football Conference",
|
||||
"abbreviation": "AFC",
|
||||
"parent_id": "nfl_league",
|
||||
"display_order": 19
|
||||
},
|
||||
{
|
||||
"id": "nfl_nfc",
|
||||
"sport": "NFL",
|
||||
"type": "conference",
|
||||
"name": "National Football Conference",
|
||||
"abbreviation": "NFC",
|
||||
"parent_id": "nfl_league",
|
||||
"display_order": 20
|
||||
},
|
||||
{
|
||||
"id": "nfl_afc_east",
|
||||
"sport": "NFL",
|
||||
"type": "division",
|
||||
"name": "AFC East",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nfl_afc",
|
||||
"display_order": 21
|
||||
},
|
||||
{
|
||||
"id": "nfl_afc_north",
|
||||
"sport": "NFL",
|
||||
"type": "division",
|
||||
"name": "AFC North",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nfl_afc",
|
||||
"display_order": 22
|
||||
},
|
||||
{
|
||||
"id": "nfl_afc_south",
|
||||
"sport": "NFL",
|
||||
"type": "division",
|
||||
"name": "AFC South",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nfl_afc",
|
||||
"display_order": 23
|
||||
},
|
||||
{
|
||||
"id": "nfl_afc_west",
|
||||
"sport": "NFL",
|
||||
"type": "division",
|
||||
"name": "AFC West",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nfl_afc",
|
||||
"display_order": 24
|
||||
},
|
||||
{
|
||||
"id": "nfl_nfc_east",
|
||||
"sport": "NFL",
|
||||
"type": "division",
|
||||
"name": "NFC East",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nfl_nfc",
|
||||
"display_order": 25
|
||||
},
|
||||
{
|
||||
"id": "nfl_nfc_north",
|
||||
"sport": "NFL",
|
||||
"type": "division",
|
||||
"name": "NFC North",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nfl_nfc",
|
||||
"display_order": 26
|
||||
},
|
||||
{
|
||||
"id": "nfl_nfc_south",
|
||||
"sport": "NFL",
|
||||
"type": "division",
|
||||
"name": "NFC South",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nfl_nfc",
|
||||
"display_order": 27
|
||||
},
|
||||
{
|
||||
"id": "nfl_nfc_west",
|
||||
"sport": "NFL",
|
||||
"type": "division",
|
||||
"name": "NFC West",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nfl_nfc",
|
||||
"display_order": 28
|
||||
},
|
||||
{
|
||||
"id": "nhl_league",
|
||||
"sport": "NHL",
|
||||
"type": "league",
|
||||
"name": "National Hockey League",
|
||||
"abbreviation": "NHL",
|
||||
"parent_id": null,
|
||||
"display_order": 29
|
||||
},
|
||||
{
|
||||
"id": "nhl_eastern",
|
||||
"sport": "NHL",
|
||||
"type": "conference",
|
||||
"name": "Eastern Conference",
|
||||
"abbreviation": "East",
|
||||
"parent_id": "nhl_league",
|
||||
"display_order": 30
|
||||
},
|
||||
{
|
||||
"id": "nhl_western",
|
||||
"sport": "NHL",
|
||||
"type": "conference",
|
||||
"name": "Western Conference",
|
||||
"abbreviation": "West",
|
||||
"parent_id": "nhl_league",
|
||||
"display_order": 31
|
||||
},
|
||||
{
|
||||
"id": "nhl_atlantic",
|
||||
"sport": "NHL",
|
||||
"type": "division",
|
||||
"name": "Atlantic",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nhl_eastern",
|
||||
"display_order": 32
|
||||
},
|
||||
{
|
||||
"id": "nhl_metropolitan",
|
||||
"sport": "NHL",
|
||||
"type": "division",
|
||||
"name": "Metropolitan",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nhl_eastern",
|
||||
"display_order": 33
|
||||
},
|
||||
{
|
||||
"id": "nhl_central",
|
||||
"sport": "NHL",
|
||||
"type": "division",
|
||||
"name": "Central",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nhl_western",
|
||||
"display_order": 34
|
||||
},
|
||||
{
|
||||
"id": "nhl_pacific",
|
||||
"sport": "NHL",
|
||||
"type": "division",
|
||||
"name": "Pacific",
|
||||
"abbreviation": null,
|
||||
"parent_id": "nhl_western",
|
||||
"display_order": 35
|
||||
},
|
||||
{
|
||||
"id": "wnba_league",
|
||||
"sport": "WNBA",
|
||||
"type": "league",
|
||||
"name": "Women's National Basketball Association",
|
||||
"abbreviation": "WNBA",
|
||||
"parent_id": null,
|
||||
"display_order": 36
|
||||
},
|
||||
{
|
||||
"id": "mls_league",
|
||||
"sport": "MLS",
|
||||
"type": "league",
|
||||
"name": "Major League Soccer",
|
||||
"abbreviation": "MLS",
|
||||
"parent_id": null,
|
||||
"display_order": 37
|
||||
},
|
||||
{
|
||||
"id": "mls_eastern",
|
||||
"sport": "MLS",
|
||||
"type": "conference",
|
||||
"name": "Eastern Conference",
|
||||
"abbreviation": "East",
|
||||
"parent_id": "mls_league",
|
||||
"display_order": 38
|
||||
},
|
||||
{
|
||||
"id": "mls_western",
|
||||
"sport": "MLS",
|
||||
"type": "conference",
|
||||
"name": "Western Conference",
|
||||
"abbreviation": "West",
|
||||
"parent_id": "mls_league",
|
||||
"display_order": 39
|
||||
},
|
||||
{
|
||||
"id": "nwsl_league",
|
||||
"sport": "NWSL",
|
||||
"type": "league",
|
||||
"name": "National Women's Soccer League",
|
||||
"abbreviation": "NWSL",
|
||||
"parent_id": null,
|
||||
"display_order": 40
|
||||
}
|
||||
]
|
||||
66
pyproject.toml
Normal file
66
pyproject.toml
Normal file
@@ -0,0 +1,66 @@
|
||||
[build-system]
|
||||
requires = ["setuptools>=61.0", "wheel"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "sportstime-parser"
|
||||
version = "0.1.0"
|
||||
description = "Sports data scraper and CloudKit uploader for SportsTime app"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.11"
|
||||
license = {text = "MIT"}
|
||||
authors = [
|
||||
{name = "SportsTime Team"}
|
||||
]
|
||||
keywords = ["sports", "scraper", "cloudkit", "nba", "mlb", "nfl", "nhl", "mls"]
|
||||
classifiers = [
|
||||
"Development Status :: 3 - Alpha",
|
||||
"Intended Audience :: Developers",
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
"Programming Language :: Python :: 3.13",
|
||||
]
|
||||
dependencies = [
|
||||
"requests>=2.31.0",
|
||||
"beautifulsoup4>=4.12.0",
|
||||
"lxml>=5.0.0",
|
||||
"rapidfuzz>=3.5.0",
|
||||
"python-dateutil>=2.8.0",
|
||||
"pytz>=2024.1",
|
||||
"rich>=13.7.0",
|
||||
"pyjwt>=2.8.0",
|
||||
"cryptography>=42.0.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"pytest>=8.0.0",
|
||||
"pytest-cov>=4.1.0",
|
||||
"responses>=0.25.0",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
sportstime-parser = "sportstime_parser.__main__:main"
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["."]
|
||||
include = ["sportstime_parser*"]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
python_files = ["test_*.py"]
|
||||
python_functions = ["test_*"]
|
||||
addopts = "-v --tb=short"
|
||||
|
||||
[tool.coverage.run]
|
||||
source = ["sportstime_parser"]
|
||||
omit = ["tests/*"]
|
||||
|
||||
[tool.coverage.report]
|
||||
exclude_lines = [
|
||||
"pragma: no cover",
|
||||
"if __name__ == .__main__.:",
|
||||
"raise NotImplementedError",
|
||||
]
|
||||
15
requirements.txt
Normal file
15
requirements.txt
Normal file
@@ -0,0 +1,15 @@
|
||||
# Core dependencies
|
||||
requests>=2.31.0
|
||||
beautifulsoup4>=4.12.0
|
||||
lxml>=5.0.0
|
||||
rapidfuzz>=3.5.0
|
||||
python-dateutil>=2.8.0
|
||||
pytz>=2024.1
|
||||
rich>=13.7.0
|
||||
pyjwt>=2.8.0
|
||||
cryptography>=42.0.0
|
||||
|
||||
# Development dependencies
|
||||
pytest>=8.0.0
|
||||
pytest-cov>=4.1.0
|
||||
responses>=0.25.0
|
||||
688
sportstime_parser/README.md
Normal file
688
sportstime_parser/README.md
Normal file
@@ -0,0 +1,688 @@
|
||||
# SportsTime Parser
|
||||
|
||||
A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.
|
||||
|
||||
## Features
|
||||
|
||||
- Scrapes game schedules from multiple sources with automatic fallback
|
||||
- Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
|
||||
- Generates deterministic canonical IDs for games, teams, and stadiums
|
||||
- Produces validation reports with manual review lists
|
||||
- Uploads to CloudKit with resumable, diff-based updates
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.11+
|
||||
- CloudKit credentials (for upload functionality)
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# From the Scripts directory
|
||||
cd Scripts
|
||||
|
||||
# Install in development mode
|
||||
pip install -e ".[dev]"
|
||||
|
||||
# Or install dependencies only
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Scrape NBA 2025-26 season
|
||||
sportstime-parser scrape nba --season 2025
|
||||
|
||||
# Scrape all sports
|
||||
sportstime-parser scrape all --season 2025
|
||||
|
||||
# Validate existing scraped data
|
||||
sportstime-parser validate nba --season 2025
|
||||
|
||||
# Check status
|
||||
sportstime-parser status
|
||||
|
||||
# Upload to CloudKit (development)
|
||||
sportstime-parser upload nba --season 2025
|
||||
|
||||
# Upload to CloudKit (production)
|
||||
sportstime-parser upload nba --season 2025 --environment production
|
||||
```
|
||||
|
||||
## CLI Reference
|
||||
|
||||
### scrape
|
||||
|
||||
Scrape game schedules, teams, and stadiums from web sources.
|
||||
|
||||
```bash
|
||||
sportstime-parser scrape <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--dry-run Parse and validate only, don't write output files
|
||||
--verbose, -v Enable verbose output
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Scrape NBA 2025-26 season
|
||||
sportstime-parser scrape nba --season 2025
|
||||
|
||||
# Scrape all sports with verbose output
|
||||
sportstime-parser scrape all --season 2025 --verbose
|
||||
|
||||
# Dry run to test without writing files
|
||||
sportstime-parser scrape mlb --season 2026 --dry-run
|
||||
```
|
||||
|
||||
### validate
|
||||
|
||||
Run validation on existing scraped data and regenerate reports. Validation performs these checks:
|
||||
|
||||
1. **Game Coverage**: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
|
||||
2. **Team Resolution**: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
|
||||
3. **Stadium Resolution**: Identifies venue names that couldn't be matched to canonical stadium IDs
|
||||
4. **Duplicate Detection**: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
|
||||
5. **Missing Data**: Flags games missing required fields (stadium_id, team IDs, valid dates)
|
||||
|
||||
The output is a Markdown report with:
|
||||
- Summary statistics (total games, valid games, coverage percentage)
|
||||
- Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
|
||||
- Fuzzy match suggestions with confidence scores to help resolve unmatched names
|
||||
|
||||
```bash
|
||||
sportstime-parser validate <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Validate NBA data
|
||||
sportstime-parser validate nba --season 2025
|
||||
|
||||
# Validate all sports
|
||||
sportstime-parser validate all
|
||||
```
|
||||
|
||||
### upload
|
||||
|
||||
Upload scraped data to CloudKit with diff-based updates.
|
||||
|
||||
```bash
|
||||
sportstime-parser upload <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment: development or production (default: development)
|
||||
--resume Resume interrupted upload from last checkpoint
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Upload NBA to development
|
||||
sportstime-parser upload nba --season 2025
|
||||
|
||||
# Upload to production
|
||||
sportstime-parser upload nba --season 2025 --environment production
|
||||
|
||||
# Resume interrupted upload
|
||||
sportstime-parser upload mlb --season 2026 --resume
|
||||
```
|
||||
|
||||
### status
|
||||
|
||||
Show current scrape and upload status.
|
||||
|
||||
```bash
|
||||
sportstime-parser status
|
||||
```
|
||||
|
||||
### retry
|
||||
|
||||
Retry failed uploads from previous attempts.
|
||||
|
||||
```bash
|
||||
sportstime-parser retry <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment (default: development)
|
||||
--max-retries INT Maximum retry attempts per record (default: 3)
|
||||
```
|
||||
|
||||
### clear
|
||||
|
||||
Clear upload session state to start fresh.
|
||||
|
||||
```bash
|
||||
sportstime-parser clear <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment (default: development)
|
||||
```
|
||||
|
||||
## CloudKit Configuration
|
||||
|
||||
To upload data to CloudKit, you need to configure authentication credentials.
|
||||
|
||||
### 1. Get Credentials from Apple Developer Portal
|
||||
|
||||
1. Go to [Apple Developer Portal](https://developer.apple.com)
|
||||
2. Navigate to **Certificates, Identifiers & Profiles** > **Keys**
|
||||
3. Create a new key with **CloudKit** capability
|
||||
4. Download the private key file (.p8)
|
||||
5. Note the Key ID
|
||||
|
||||
### 2. Set Environment Variables
|
||||
|
||||
```bash
|
||||
# Key ID from Apple Developer Portal
|
||||
export CLOUDKIT_KEY_ID="your_key_id_here"
|
||||
|
||||
# Path to private key file
|
||||
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"
|
||||
|
||||
# Or provide key content directly (useful for CI/CD)
|
||||
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
|
||||
...key content...
|
||||
-----END EC PRIVATE KEY-----"
|
||||
```
|
||||
|
||||
### 3. Verify Configuration
|
||||
|
||||
```bash
|
||||
sportstime-parser status
|
||||
```
|
||||
|
||||
The status output will show whether CloudKit is configured correctly.
|
||||
|
||||
## Output Files
|
||||
|
||||
Scraped data is saved to the `output/` directory:
|
||||
|
||||
```
|
||||
output/
|
||||
games_nba_2025.json # Game schedules
|
||||
teams_nba.json # Team data
|
||||
stadiums_nba.json # Stadium data
|
||||
validation_nba_2025.md # Validation report
|
||||
```
|
||||
|
||||
## Validation Reports
|
||||
|
||||
Validation reports are generated in Markdown format at `output/validation_{sport}_{season}.md`.
|
||||
|
||||
### Report Sections
|
||||
|
||||
**Summary Table**
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| Total Games | Number of games scraped |
|
||||
| Valid Games | Games with all required fields resolved |
|
||||
| Coverage | Percentage of expected games found (based on league schedule) |
|
||||
| Unresolved Teams | Team names that couldn't be matched |
|
||||
| Unresolved Stadiums | Venue names that couldn't be matched |
|
||||
| Duplicates | Potential duplicate game entries |
|
||||
|
||||
**Manual Review Items**
|
||||
|
||||
Items are grouped by type and include the raw value, source URL, and suggested fixes:
|
||||
|
||||
- **Unresolved Teams**: Team names not in the alias mapping. Add to `team_aliases.json` to resolve.
|
||||
- **Unresolved Stadiums**: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to `stadium_aliases.json`.
|
||||
- **Duplicate Games**: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
|
||||
- **Missing Data**: Games missing stadium coordinates or other required fields.
|
||||
|
||||
**Fuzzy Match Suggestions**
|
||||
|
||||
For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.
|
||||
|
||||
## Canonical IDs
|
||||
|
||||
Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.
|
||||
|
||||
### ID Formats
|
||||
|
||||
**Games**
|
||||
```
|
||||
{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
|
||||
```
|
||||
Examples:
|
||||
- `nba_2025_hou_okc_1021` - NBA 2025-26, Houston @ OKC, Oct 21
|
||||
- `mlb_2026_nyy_bos_0401_1` - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)
|
||||
|
||||
**Teams**
|
||||
```
|
||||
{sport}_{city}_{name}
|
||||
```
|
||||
Examples:
|
||||
- `nba_la_lakers`
|
||||
- `mlb_new_york_yankees`
|
||||
- `nfl_new_york_giants`
|
||||
|
||||
**Stadiums**
|
||||
```
|
||||
{sport}_{normalized_name}
|
||||
```
|
||||
Examples:
|
||||
- `mlb_yankee_stadium`
|
||||
- `nba_crypto_com_arena`
|
||||
- `nfl_sofi_stadium`
|
||||
|
||||
### Generated vs Matched IDs
|
||||
|
||||
| Entity | Generated | Matched |
|
||||
|--------|-----------|---------|
|
||||
| **Teams** | Pre-defined in `team_resolver.py` mappings | Resolved from raw scraped names via aliases + fuzzy matching |
|
||||
| **Stadiums** | Pre-defined in `stadium_resolver.py` mappings | Resolved from raw venue names via aliases + fuzzy matching |
|
||||
| **Games** | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) |
|
||||
|
||||
**Resolution Flow:**
|
||||
```
|
||||
Raw Name (from scraper)
|
||||
↓
|
||||
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
|
||||
↓ (if no match)
|
||||
Fuzzy Match (Levenshtein distance against known names)
|
||||
↓ (if confidence > threshold)
|
||||
Canonical ID assigned
|
||||
↓ (if no match)
|
||||
Manual Review Item created
|
||||
```
|
||||
|
||||
### Cross-References
|
||||
|
||||
Entities reference each other via canonical IDs:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Game │
|
||||
│ id: nba_2025_hou_okc_1021 │
|
||||
│ home_team_id: nba_oklahoma_city_thunder ──────────────┐ │
|
||||
│ away_team_id: nba_houston_rockets ────────────────┐ │ │
|
||||
│ stadium_id: nba_paycom_center ────────────────┐ │ │ │
|
||||
└─────────────────────────────────────────────────│───│───│───┘
|
||||
│ │ │
|
||||
┌─────────────────────────────────────────────────│───│───│───┐
|
||||
│ Stadium │ │ │ │
|
||||
│ id: nba_paycom_center ◄───────────────────────┘ │ │ │
|
||||
│ name: "Paycom Center" │ │ │
|
||||
│ city: "Oklahoma City" │ │ │
|
||||
│ latitude: 35.4634 │ │ │
|
||||
│ longitude: -97.5151 │ │ │
|
||||
└─────────────────────────────────────────────────────│───│───┘
|
||||
│ │
|
||||
┌─────────────────────────────────────────────────────│───│───┐
|
||||
│ Team │ │ │
|
||||
│ id: nba_houston_rockets ◄─────────────────────────┘ │ │
|
||||
│ name: "Rockets" │ │
|
||||
│ city: "Houston" │ │
|
||||
│ stadium_id: nba_toyota_center │ │
|
||||
└─────────────────────────────────────────────────────────│───┘
|
||||
│
|
||||
┌─────────────────────────────────────────────────────────│───┐
|
||||
│ Team │ │
|
||||
│ id: nba_oklahoma_city_thunder ◄───────────────────────┘ │
|
||||
│ name: "Thunder" │
|
||||
│ city: "Oklahoma City" │
|
||||
│ stadium_id: nba_paycom_center │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Alias Files
|
||||
|
||||
Aliases map variant names to canonical IDs:
|
||||
|
||||
**`team_aliases.json`**
|
||||
```json
|
||||
{
|
||||
"nba": {
|
||||
"LA Lakers": "nba_la_lakers",
|
||||
"Los Angeles Lakers": "nba_la_lakers",
|
||||
"LAL": "nba_la_lakers"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**`stadium_aliases.json`**
|
||||
```json
|
||||
{
|
||||
"nba": {
|
||||
"Crypto.com Arena": "nba_crypto_com_arena",
|
||||
"Staples Center": "nba_crypto_com_arena",
|
||||
"STAPLES Center": "nba_crypto_com_arena"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
When a scraper returns a raw name like "LA Lakers", the resolver:
|
||||
1. Checks `team_aliases.json` for an exact match → finds `nba_la_lakers`
|
||||
2. If no exact match, runs fuzzy matching against all known team names
|
||||
3. If fuzzy match confidence > 80%, uses that canonical ID
|
||||
4. Otherwise, creates a manual review item for human resolution
|
||||
|
||||
## Adding a New Sport
|
||||
|
||||
To add support for a new sport (e.g., `cfb` for college football), update these files:
|
||||
|
||||
### 1. Configuration (`config.py`)
|
||||
|
||||
Add the sport to `SUPPORTED_SPORTS` and `EXPECTED_GAME_COUNTS`:
|
||||
|
||||
```python
|
||||
SUPPORTED_SPORTS: list[str] = [
|
||||
"nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
|
||||
"cfb", # ← Add new sport
|
||||
]
|
||||
|
||||
EXPECTED_GAME_COUNTS: dict[str, int] = {
|
||||
# ... existing sports ...
|
||||
"cfb": 900, # ← Add expected game count for validation
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Team Mappings (`normalizers/team_resolver.py`)
|
||||
|
||||
Add team definitions to `TEAM_MAPPINGS`. Each entry maps an abbreviation to `(canonical_id, full_name, city)`:
|
||||
|
||||
```python
|
||||
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
|
||||
# ... existing sports ...
|
||||
"cfb": {
|
||||
"ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
|
||||
"OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
|
||||
# ... all teams ...
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Stadium Mappings (`normalizers/stadium_resolver.py`)
|
||||
|
||||
Add stadium definitions to `STADIUM_MAPPINGS`. Each entry is a `StadiumInfo` with coordinates:
|
||||
|
||||
```python
|
||||
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
|
||||
# ... existing sports ...
|
||||
"cfb": {
|
||||
"stadium_cfb_bryant_denny": StadiumInfo(
|
||||
id="stadium_cfb_bryant_denny",
|
||||
name="Bryant-Denny Stadium",
|
||||
city="Tuscaloosa",
|
||||
state="AL",
|
||||
country="USA",
|
||||
sport="cfb",
|
||||
latitude=33.2083,
|
||||
longitude=-87.5503,
|
||||
),
|
||||
# ... all stadiums ...
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Scraper Implementation (`scrapers/cfb.py`)
|
||||
|
||||
Create a new scraper class extending `BaseScraper`:
|
||||
|
||||
```python
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
|
||||
class CFBScraper(BaseScraper):
|
||||
def __init__(self, season: int, **kwargs):
|
||||
super().__init__("cfb", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("cfb")
|
||||
self._stadium_resolver = get_stadium_resolver("cfb")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
return ["espn", "sports_reference"] # Priority order
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
# Return URL for each source
|
||||
...
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
# Implement scraping logic
|
||||
...
|
||||
|
||||
def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
# Convert raw data to Game objects using resolvers
|
||||
...
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
# Return Team objects from TEAM_MAPPINGS
|
||||
...
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
# Return Stadium objects from STADIUM_MAPPINGS
|
||||
...
|
||||
|
||||
def create_cfb_scraper(season: int) -> CFBScraper:
|
||||
return CFBScraper(season=season)
|
||||
```
|
||||
|
||||
### 5. Register Scraper (`scrapers/__init__.py`)
|
||||
|
||||
Export the new scraper:
|
||||
|
||||
```python
|
||||
from .cfb import CFBScraper, create_cfb_scraper
|
||||
|
||||
__all__ = [
|
||||
# ... existing exports ...
|
||||
"CFBScraper",
|
||||
"create_cfb_scraper",
|
||||
]
|
||||
```
|
||||
|
||||
### 6. CLI Registration (`cli.py`)
|
||||
|
||||
Add the sport to `get_scraper()`:
|
||||
|
||||
```python
|
||||
def get_scraper(sport: str, season: int):
|
||||
# ... existing sports ...
|
||||
elif sport == "cfb":
|
||||
from .scrapers.cfb import create_cfb_scraper
|
||||
return create_cfb_scraper(season)
|
||||
```
|
||||
|
||||
### 7. Alias Files (`team_aliases.json`, `stadium_aliases.json`)
|
||||
|
||||
Add initial aliases for common name variants:
|
||||
|
||||
```json
|
||||
// team_aliases.json
|
||||
{
|
||||
"cfb": {
|
||||
"Alabama": "team_cfb_ala",
|
||||
"Bama": "team_cfb_ala",
|
||||
"Roll Tide": "team_cfb_ala"
|
||||
}
|
||||
}
|
||||
|
||||
// stadium_aliases.json
|
||||
{
|
||||
"cfb": {
|
||||
"Bryant Denny Stadium": "stadium_cfb_bryant_denny",
|
||||
"Bryant-Denny": "stadium_cfb_bryant_denny"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8. Documentation (`SOURCES.md`)
|
||||
|
||||
Document data sources with URLs, rate limits, and notes:
|
||||
|
||||
```markdown
|
||||
## CFB (College Football)
|
||||
|
||||
**Teams**: 134 (FBS)
|
||||
**Expected Games**: ~900 per season
|
||||
**Season**: August - January
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
|
||||
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |
|
||||
```
|
||||
|
||||
### 9. Tests (`tests/test_scrapers/test_cfb.py`)
|
||||
|
||||
Create tests for the new scraper:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper
|
||||
|
||||
class TestCFBScraper:
|
||||
def test_factory_creates_scraper(self):
|
||||
scraper = create_cfb_scraper(season=2025)
|
||||
assert scraper.sport == "cfb"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_get_sources_returns_priority_list(self):
|
||||
scraper = CFBScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert "espn" in sources
|
||||
|
||||
# ... more tests ...
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
- [ ] Add to `SUPPORTED_SPORTS` in `config.py`
|
||||
- [ ] Add to `EXPECTED_GAME_COUNTS` in `config.py`
|
||||
- [ ] Add team mappings to `team_resolver.py`
|
||||
- [ ] Add stadium mappings to `stadium_resolver.py`
|
||||
- [ ] Create `scrapers/{sport}.py` with scraper class
|
||||
- [ ] Export in `scrapers/__init__.py`
|
||||
- [ ] Register in `cli.py` `get_scraper()`
|
||||
- [ ] Add aliases to `team_aliases.json`
|
||||
- [ ] Add aliases to `stadium_aliases.json`
|
||||
- [ ] Document sources in `SOURCES.md`
|
||||
- [ ] Create tests in `tests/test_scrapers/`
|
||||
- [ ] Run `pytest` to verify all tests pass
|
||||
- [ ] Run dry-run scrape: `sportstime-parser scrape {sport} --season 2025 --dry-run`
|
||||
|
||||
## Development
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
pytest
|
||||
|
||||
# Run with coverage
|
||||
pytest --cov=sportstime_parser --cov-report=html
|
||||
|
||||
# Run specific test file
|
||||
pytest tests/test_scrapers/test_nba.py
|
||||
|
||||
# Run with verbose output
|
||||
pytest -v
|
||||
```
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
sportstime_parser/
|
||||
__init__.py
|
||||
__main__.py # CLI entry point
|
||||
cli.py # Subcommand definitions
|
||||
config.py # Constants, defaults
|
||||
|
||||
models/
|
||||
game.py # Game dataclass
|
||||
team.py # Team dataclass
|
||||
stadium.py # Stadium dataclass
|
||||
aliases.py # Alias dataclasses
|
||||
|
||||
scrapers/
|
||||
base.py # BaseScraper abstract class
|
||||
nba.py # NBA scrapers
|
||||
mlb.py # MLB scrapers
|
||||
nfl.py # NFL scrapers
|
||||
nhl.py # NHL scrapers
|
||||
mls.py # MLS scrapers
|
||||
wnba.py # WNBA scrapers
|
||||
nwsl.py # NWSL scrapers
|
||||
|
||||
normalizers/
|
||||
canonical_id.py # ID generation
|
||||
team_resolver.py # Team name resolution
|
||||
stadium_resolver.py # Stadium name resolution
|
||||
timezone.py # Timezone conversion
|
||||
fuzzy.py # Fuzzy matching
|
||||
|
||||
validators/
|
||||
report.py # Validation report generator
|
||||
|
||||
uploaders/
|
||||
cloudkit.py # CloudKit Web Services client
|
||||
state.py # Resumable upload state
|
||||
diff.py # Record comparison
|
||||
|
||||
utils/
|
||||
http.py # Rate-limited HTTP client
|
||||
logging.py # Verbose logger
|
||||
progress.py # Progress bars
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "No games file found"
|
||||
|
||||
Run the scrape command first:
|
||||
```bash
|
||||
sportstime-parser scrape nba --season 2025
|
||||
```
|
||||
|
||||
### "CloudKit not configured"
|
||||
|
||||
Set the required environment variables:
|
||||
```bash
|
||||
export CLOUDKIT_KEY_ID="your_key_id"
|
||||
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"
|
||||
```
|
||||
|
||||
### Rate limit errors
|
||||
|
||||
The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:
|
||||
|
||||
1. Wait a few minutes before retrying
|
||||
2. Try scraping one sport at a time instead of "all"
|
||||
3. Check that you're not running multiple instances
|
||||
|
||||
### Scrape fails with no data
|
||||
|
||||
1. Check your internet connection
|
||||
2. Run with `--verbose` to see detailed error messages
|
||||
3. The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
254
sportstime_parser/SOURCES.md
Normal file
254
sportstime_parser/SOURCES.md
Normal file
@@ -0,0 +1,254 @@
|
||||
# Data Sources
|
||||
|
||||
This document lists all data sources used by the SportsTime parser, including URLs, rate limits, and data freshness expectations.
|
||||
|
||||
## Source Priority
|
||||
|
||||
Each sport has multiple sources configured in priority order. The scraper tries each source in order and uses the first one that succeeds. If a source fails (network error, parsing error, etc.), it falls back to the next source.
|
||||
|
||||
---
|
||||
|
||||
## NBA (National Basketball Association)
|
||||
|
||||
**Teams**: 30
|
||||
**Expected Games**: ~1,230 per season
|
||||
**Season**: October - June (spans two calendar years)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | HTML |
|
||||
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | JSON |
|
||||
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | HTML |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **Basketball-Reference**: ~1 request/second recommended
|
||||
- **ESPN API**: No published limit, use 1 request/second to be safe
|
||||
- **CBS Sports**: ~1 request/second recommended
|
||||
|
||||
### Notes
|
||||
|
||||
- Basketball-Reference is the most reliable source with complete historical data
|
||||
- ESPN API is good for current/future seasons
|
||||
- Games organized by month on Basketball-Reference
|
||||
|
||||
---
|
||||
|
||||
## MLB (Major League Baseball)
|
||||
|
||||
**Teams**: 30
|
||||
**Expected Games**: ~2,430 per season
|
||||
**Season**: March/April - October/November (single calendar year)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | Baseball-Reference | `baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | HTML |
|
||||
| 2 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | JSON |
|
||||
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | JSON |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **Baseball-Reference**: ~1 request/second recommended
|
||||
- **MLB Stats API**: No published limit, use 0.5 request/second
|
||||
- **ESPN API**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- MLB has doubleheaders; games are suffixed with `_1`, `_2`
|
||||
- Single schedule page per season on Baseball-Reference
|
||||
- MLB Stats API allows date range queries for efficiency
|
||||
|
||||
---
|
||||
|
||||
## NFL (National Football League)
|
||||
|
||||
**Teams**: 32
|
||||
**Expected Games**: ~272 per season (regular season only)
|
||||
**Season**: September - February (spans two calendar years)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | JSON |
|
||||
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{YEAR}/games.htm` | HTML |
|
||||
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | HTML |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **ESPN API**: ~1 request/second
|
||||
- **Pro-Football-Reference**: ~1 request/second
|
||||
- **CBS Sports**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- ESPN API uses week numbers instead of dates
|
||||
- International games (London, Mexico City, Frankfurt, etc.) are filtered out
|
||||
- Includes preseason, regular season, and playoffs
|
||||
|
||||
---
|
||||
|
||||
## NHL (National Hockey League)
|
||||
|
||||
**Teams**: 32 (including Utah Hockey Club)
|
||||
**Expected Games**: ~1,312 per season
|
||||
**Season**: October - June (spans two calendar years)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{YEAR}_games.html` | HTML |
|
||||
| 2 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | JSON |
|
||||
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | JSON |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **Hockey-Reference**: ~1 request/second
|
||||
- **NHL API**: No published limit, use 0.5 request/second
|
||||
- **ESPN API**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- International games (Prague, Stockholm, Helsinki, etc.) are filtered out
|
||||
- Single schedule page per season on Hockey-Reference
|
||||
|
||||
---
|
||||
|
||||
## MLS (Major League Soccer)
|
||||
|
||||
**Teams**: 30 (including San Diego FC)
|
||||
**Expected Games**: ~493 per season
|
||||
**Season**: February/March - October/November (single calendar year)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | JSON |
|
||||
| 2 | FBref | `fbref.com/en/comps/22/{YEAR}/schedule/` | HTML |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **ESPN API**: ~1 request/second
|
||||
- **FBref**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- MLS runs within a single calendar year
|
||||
- Some teams share stadiums with NFL teams
|
||||
|
||||
---
|
||||
|
||||
## WNBA (Women's National Basketball Association)
|
||||
|
||||
**Teams**: 13 (including Golden State Valkyries)
|
||||
**Expected Games**: ~220 per season
|
||||
**Season**: May - October (single calendar year)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | JSON |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **ESPN API**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- Many WNBA teams share arenas with NBA teams
|
||||
- Teams and stadiums are hardcoded (smaller league)
|
||||
|
||||
---
|
||||
|
||||
## NWSL (National Women's Soccer League)
|
||||
|
||||
**Teams**: 14
|
||||
**Expected Games**: ~182 per season
|
||||
**Season**: March - November (single calendar year)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | JSON |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **ESPN API**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- Many NWSL teams share stadiums with MLS teams
|
||||
- Teams and stadiums are hardcoded (smaller league)
|
||||
|
||||
---
|
||||
|
||||
## Stadium Data Sources
|
||||
|
||||
Stadium coordinates and metadata come from multiple sources:
|
||||
|
||||
| Sport | Sources |
|
||||
|-------|---------|
|
||||
| MLB | MLBScoreBot GitHub, cageyjames GeoJSON, hardcoded |
|
||||
| NFL | NFLScoreBot GitHub, brianhatchl GeoJSON, hardcoded |
|
||||
| NBA | Hardcoded |
|
||||
| NHL | Hardcoded |
|
||||
| MLS | gavinr GeoJSON, hardcoded |
|
||||
| WNBA | Hardcoded (shared with NBA) |
|
||||
| NWSL | Hardcoded (shared with MLS) |
|
||||
|
||||
---
|
||||
|
||||
## General Guidelines
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
All scrapers implement:
|
||||
|
||||
1. **Default delay**: 1 second between requests
|
||||
2. **Auto-detection**: Detects HTTP 429 (Too Many Requests) responses
|
||||
3. **Exponential backoff**: Starts at 1 second, doubles up to 3 retries
|
||||
4. **Connection pooling**: Reuses HTTP connections for efficiency
|
||||
|
||||
### Error Handling
|
||||
|
||||
- **Partial data**: If a source fails mid-scrape, partial data is discarded
|
||||
- **Source fallback**: Automatically tries the next source on failure
|
||||
- **Logging**: All errors are logged for debugging
|
||||
|
||||
### Data Freshness
|
||||
|
||||
| Data Type | Freshness |
|
||||
|-----------|-----------|
|
||||
| Games (future) | Check weekly during season |
|
||||
| Games (past) | Final scores available within hours |
|
||||
| Teams | Update at start of each season |
|
||||
| Stadiums | Update when venues change |
|
||||
|
||||
### Geographic Filter
|
||||
|
||||
Games at venues outside USA, Canada, and Mexico are automatically filtered out:
|
||||
|
||||
- **NFL**: London, Frankfurt, Munich, Mexico City, São Paulo
|
||||
- **NHL**: Prague, Stockholm, Helsinki, Tampere, Gothenburg
|
||||
|
||||
---
|
||||
|
||||
## Legal Considerations
|
||||
|
||||
This tool is designed for personal/educational use. When using these sources:
|
||||
|
||||
1. Respect robots.txt files
|
||||
2. Don't make excessive requests
|
||||
3. Cache responses when possible
|
||||
4. Check each source's Terms of Service
|
||||
5. Consider that schedule data may be copyrighted
|
||||
|
||||
The ESPN API is undocumented but publicly accessible. Sports-Reference sites allow scraping but request reasonable rate limiting.
|
||||
8
sportstime_parser/__init__.py
Normal file
8
sportstime_parser/__init__.py
Normal file
@@ -0,0 +1,8 @@
|
||||
"""SportsTime Parser - Sports data scraper and CloudKit uploader."""
|
||||
|
||||
__version__ = "0.1.0"
|
||||
__author__ = "SportsTime Team"
|
||||
|
||||
from .cli import run_cli
|
||||
|
||||
__all__ = ["run_cli", "__version__"]
|
||||
14
sportstime_parser/__main__.py
Normal file
14
sportstime_parser/__main__.py
Normal file
@@ -0,0 +1,14 @@
|
||||
"""Entry point for sportstime-parser CLI."""
|
||||
|
||||
import sys
|
||||
|
||||
from .cli import run_cli
|
||||
|
||||
|
||||
def main() -> int:
|
||||
"""Main entry point."""
|
||||
return run_cli()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
1294
sportstime_parser/cli.py
Normal file
1294
sportstime_parser/cli.py
Normal file
File diff suppressed because it is too large
Load Diff
59
sportstime_parser/config.py
Normal file
59
sportstime_parser/config.py
Normal file
@@ -0,0 +1,59 @@
|
||||
"""Configuration constants for sportstime-parser."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
# Package paths
|
||||
PACKAGE_DIR = Path(__file__).parent
|
||||
SCRIPTS_DIR = PACKAGE_DIR.parent
|
||||
OUTPUT_DIR = SCRIPTS_DIR / "output"
|
||||
STATE_DIR = SCRIPTS_DIR / ".parser_state"
|
||||
|
||||
# Alias files (existing in Scripts/)
|
||||
TEAM_ALIASES_FILE = SCRIPTS_DIR / "team_aliases.json"
|
||||
STADIUM_ALIASES_FILE = SCRIPTS_DIR / "stadium_aliases.json"
|
||||
LEAGUE_STRUCTURE_FILE = SCRIPTS_DIR / "league_structure.json"
|
||||
|
||||
# Supported sports
|
||||
SUPPORTED_SPORTS: list[str] = [
|
||||
"nba",
|
||||
"mlb",
|
||||
"nfl",
|
||||
"nhl",
|
||||
"mls",
|
||||
"wnba",
|
||||
"nwsl",
|
||||
]
|
||||
|
||||
# Default season (start year of the season, e.g., 2025 for 2025-26)
|
||||
DEFAULT_SEASON: int = 2025
|
||||
|
||||
# CloudKit configuration
|
||||
CLOUDKIT_CONTAINER_ID: str = "iCloud.com.sportstime.app"
|
||||
CLOUDKIT_ENVIRONMENT: str = "development"
|
||||
CLOUDKIT_BATCH_SIZE: int = 200
|
||||
CLOUDKIT_KEY_ID: str = "152be0715e0276e31aaea5cbfe79dc872f298861a55c70fae14e5fe3e026cff9"
|
||||
CLOUDKIT_PRIVATE_KEY_PATH: Path = SCRIPTS_DIR / "eckey.pem"
|
||||
|
||||
# Rate limiting
|
||||
DEFAULT_REQUEST_DELAY: float = 1.0 # seconds between requests
|
||||
MAX_RETRIES: int = 3
|
||||
BACKOFF_FACTOR: float = 2.0 # exponential backoff multiplier
|
||||
INITIAL_BACKOFF: float = 1.0 # initial backoff in seconds
|
||||
|
||||
# Expected game counts per sport (approximate, for validation)
|
||||
# Updated 2026-01-20 based on 2025-26 season data
|
||||
EXPECTED_GAME_COUNTS: dict[str, int] = {
|
||||
"nba": 1230, # 30 teams × 82 games / 2
|
||||
"mlb": 2430, # 30 teams × 162 games / 2 (regular season only)
|
||||
"nfl": 272, # 32 teams × 17 games / 2 (regular season only)
|
||||
"nhl": 1312, # 32 teams × 82 games / 2
|
||||
"mls": 540, # 30 teams × varies (updated for 2025 expansion)
|
||||
"wnba": 286, # 13 teams × 44 games / 2 (updated for 2025 expansion)
|
||||
"nwsl": 188, # 14→16 teams × varies (updated for 2025 expansion)
|
||||
}
|
||||
|
||||
# Minimum match score for fuzzy matching (0-100)
|
||||
FUZZY_MATCH_THRESHOLD: int = 80
|
||||
|
||||
# Geographic filter (only include games in these countries)
|
||||
ALLOWED_COUNTRIES: set[str] = {"USA", "US", "United States", "Canada", "Mexico"}
|
||||
52
sportstime_parser/models/__init__.py
Normal file
52
sportstime_parser/models/__init__.py
Normal file
@@ -0,0 +1,52 @@
|
||||
"""Data models for sportstime-parser."""
|
||||
|
||||
from .game import Game, save_games, load_games
|
||||
from .team import Team, save_teams, load_teams
|
||||
from .stadium import Stadium, save_stadiums, load_stadiums
|
||||
from .aliases import (
|
||||
AliasType,
|
||||
ReviewReason,
|
||||
TeamAlias,
|
||||
StadiumAlias,
|
||||
FuzzyMatch,
|
||||
ManualReviewItem,
|
||||
)
|
||||
from .sport import (
|
||||
Sport,
|
||||
LeagueStructure,
|
||||
LeagueStructureType,
|
||||
save_sports,
|
||||
load_sports,
|
||||
save_league_structures,
|
||||
load_league_structures,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Game
|
||||
"Game",
|
||||
"save_games",
|
||||
"load_games",
|
||||
# Team
|
||||
"Team",
|
||||
"save_teams",
|
||||
"load_teams",
|
||||
# Stadium
|
||||
"Stadium",
|
||||
"save_stadiums",
|
||||
"load_stadiums",
|
||||
# Aliases
|
||||
"AliasType",
|
||||
"ReviewReason",
|
||||
"TeamAlias",
|
||||
"StadiumAlias",
|
||||
"FuzzyMatch",
|
||||
"ManualReviewItem",
|
||||
# Sport and League Structure
|
||||
"Sport",
|
||||
"LeagueStructure",
|
||||
"LeagueStructureType",
|
||||
"save_sports",
|
||||
"load_sports",
|
||||
"save_league_structures",
|
||||
"load_league_structures",
|
||||
]
|
||||
262
sportstime_parser/models/aliases.py
Normal file
262
sportstime_parser/models/aliases.py
Normal file
@@ -0,0 +1,262 @@
|
||||
"""Alias and manual review data models for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date, datetime
|
||||
from enum import Enum
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
class AliasType(Enum):
|
||||
"""Type of team alias."""
|
||||
NAME = "name"
|
||||
ABBREVIATION = "abbreviation"
|
||||
CITY = "city"
|
||||
|
||||
|
||||
class ReviewReason(Enum):
|
||||
"""Reason an item requires manual review."""
|
||||
UNRESOLVED_TEAM = "unresolved_team"
|
||||
UNRESOLVED_STADIUM = "unresolved_stadium"
|
||||
LOW_CONFIDENCE_MATCH = "low_confidence_match"
|
||||
MISSING_DATA = "missing_data"
|
||||
DUPLICATE_GAME = "duplicate_game"
|
||||
TIMEZONE_UNKNOWN = "timezone_unknown"
|
||||
GEOGRAPHIC_FILTER = "geographic_filter"
|
||||
|
||||
|
||||
@dataclass
|
||||
class TeamAlias:
|
||||
"""Represents a team alias with optional date validity.
|
||||
|
||||
Attributes:
|
||||
id: Unique alias ID
|
||||
team_canonical_id: The canonical team ID this alias resolves to
|
||||
alias_type: Type of alias (name, abbreviation, city)
|
||||
alias_value: The alias value to match against
|
||||
valid_from: Start date of alias validity (None = always valid)
|
||||
valid_until: End date of alias validity (None = still valid)
|
||||
"""
|
||||
|
||||
id: str
|
||||
team_canonical_id: str
|
||||
alias_type: AliasType
|
||||
alias_value: str
|
||||
valid_from: Optional[date] = None
|
||||
valid_until: Optional[date] = None
|
||||
|
||||
def is_valid_on(self, check_date: date) -> bool:
|
||||
"""Check if this alias is valid on the given date."""
|
||||
if self.valid_from and check_date < self.valid_from:
|
||||
return False
|
||||
if self.valid_until and check_date > self.valid_until:
|
||||
return False
|
||||
return True
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"team_canonical_id": self.team_canonical_id,
|
||||
"alias_type": self.alias_type.value,
|
||||
"alias_value": self.alias_value,
|
||||
"valid_from": self.valid_from.isoformat() if self.valid_from else None,
|
||||
"valid_until": self.valid_until.isoformat() if self.valid_until else None,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "TeamAlias":
|
||||
"""Create a TeamAlias from a dictionary."""
|
||||
valid_from = None
|
||||
if data.get("valid_from"):
|
||||
valid_from = date.fromisoformat(data["valid_from"])
|
||||
|
||||
valid_until = None
|
||||
if data.get("valid_until"):
|
||||
valid_until = date.fromisoformat(data["valid_until"])
|
||||
|
||||
return cls(
|
||||
id=data["id"],
|
||||
team_canonical_id=data["team_canonical_id"],
|
||||
alias_type=AliasType(data["alias_type"]),
|
||||
alias_value=data["alias_value"],
|
||||
valid_from=valid_from,
|
||||
valid_until=valid_until,
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class StadiumAlias:
|
||||
"""Represents a stadium alias with optional date validity.
|
||||
|
||||
Attributes:
|
||||
alias_name: The alias name to match against (lowercase)
|
||||
stadium_canonical_id: The canonical stadium ID this alias resolves to
|
||||
valid_from: Start date of alias validity (None = always valid)
|
||||
valid_until: End date of alias validity (None = still valid)
|
||||
"""
|
||||
|
||||
alias_name: str
|
||||
stadium_canonical_id: str
|
||||
valid_from: Optional[date] = None
|
||||
valid_until: Optional[date] = None
|
||||
|
||||
def is_valid_on(self, check_date: date) -> bool:
|
||||
"""Check if this alias is valid on the given date."""
|
||||
if self.valid_from and check_date < self.valid_from:
|
||||
return False
|
||||
if self.valid_until and check_date > self.valid_until:
|
||||
return False
|
||||
return True
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"alias_name": self.alias_name,
|
||||
"stadium_canonical_id": self.stadium_canonical_id,
|
||||
"valid_from": self.valid_from.isoformat() if self.valid_from else None,
|
||||
"valid_until": self.valid_until.isoformat() if self.valid_until else None,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "StadiumAlias":
|
||||
"""Create a StadiumAlias from a dictionary."""
|
||||
valid_from = None
|
||||
if data.get("valid_from"):
|
||||
valid_from = date.fromisoformat(data["valid_from"])
|
||||
|
||||
valid_until = None
|
||||
if data.get("valid_until"):
|
||||
valid_until = date.fromisoformat(data["valid_until"])
|
||||
|
||||
return cls(
|
||||
alias_name=data["alias_name"],
|
||||
stadium_canonical_id=data["stadium_canonical_id"],
|
||||
valid_from=valid_from,
|
||||
valid_until=valid_until,
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class FuzzyMatch:
|
||||
"""Represents a fuzzy match suggestion with confidence score."""
|
||||
|
||||
canonical_id: str
|
||||
canonical_name: str
|
||||
confidence: int # 0-100
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"canonical_id": self.canonical_id,
|
||||
"canonical_name": self.canonical_name,
|
||||
"confidence": self.confidence,
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class ManualReviewItem:
|
||||
"""Represents an item requiring manual review.
|
||||
|
||||
Attributes:
|
||||
id: Unique review item ID
|
||||
reason: Why this item needs review
|
||||
sport: Sport code
|
||||
raw_value: The original unresolved value
|
||||
context: Additional context about the issue
|
||||
source_url: URL of the source page
|
||||
suggested_matches: List of potential matches with confidence scores
|
||||
game_date: Date of the game (if applicable)
|
||||
created_at: When this review item was created
|
||||
"""
|
||||
|
||||
id: str
|
||||
reason: ReviewReason
|
||||
sport: str
|
||||
raw_value: str
|
||||
context: dict = field(default_factory=dict)
|
||||
source_url: Optional[str] = None
|
||||
suggested_matches: list[FuzzyMatch] = field(default_factory=list)
|
||||
game_date: Optional[date] = None
|
||||
created_at: datetime = field(default_factory=datetime.now)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"reason": self.reason.value,
|
||||
"sport": self.sport,
|
||||
"raw_value": self.raw_value,
|
||||
"context": self.context,
|
||||
"source_url": self.source_url,
|
||||
"suggested_matches": [m.to_dict() for m in self.suggested_matches],
|
||||
"game_date": self.game_date.isoformat() if self.game_date else None,
|
||||
"created_at": self.created_at.isoformat(),
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "ManualReviewItem":
|
||||
"""Create a ManualReviewItem from a dictionary."""
|
||||
game_date = None
|
||||
if data.get("game_date"):
|
||||
game_date = date.fromisoformat(data["game_date"])
|
||||
|
||||
created_at = datetime.now()
|
||||
if data.get("created_at"):
|
||||
created_at = datetime.fromisoformat(data["created_at"])
|
||||
|
||||
suggested_matches = []
|
||||
for match_data in data.get("suggested_matches", []):
|
||||
suggested_matches.append(FuzzyMatch(
|
||||
canonical_id=match_data["canonical_id"],
|
||||
canonical_name=match_data["canonical_name"],
|
||||
confidence=match_data["confidence"],
|
||||
))
|
||||
|
||||
return cls(
|
||||
id=data["id"],
|
||||
reason=ReviewReason(data["reason"]),
|
||||
sport=data["sport"],
|
||||
raw_value=data["raw_value"],
|
||||
context=data.get("context", {}),
|
||||
source_url=data.get("source_url"),
|
||||
suggested_matches=suggested_matches,
|
||||
game_date=game_date,
|
||||
created_at=created_at,
|
||||
)
|
||||
|
||||
def to_markdown(self) -> str:
|
||||
"""Generate markdown representation for validation report."""
|
||||
lines = [
|
||||
f"### {self.reason.value.replace('_', ' ').title()}: {self.raw_value}",
|
||||
"",
|
||||
f"**Sport**: {self.sport.upper()}",
|
||||
]
|
||||
|
||||
if self.game_date:
|
||||
lines.append(f"**Game Date**: {self.game_date.isoformat()}")
|
||||
|
||||
if self.context:
|
||||
lines.append("")
|
||||
lines.append("**Context**:")
|
||||
for key, value in self.context.items():
|
||||
lines.append(f"- {key}: {value}")
|
||||
|
||||
if self.suggested_matches:
|
||||
lines.append("")
|
||||
lines.append("**Suggested Matches**:")
|
||||
for i, match in enumerate(self.suggested_matches, 1):
|
||||
marker = " <- likely correct" if match.confidence >= 90 else ""
|
||||
lines.append(
|
||||
f"{i}. `{match.canonical_id}` ({match.confidence}%){marker}"
|
||||
)
|
||||
|
||||
if self.source_url:
|
||||
lines.append("")
|
||||
lines.append(f"**Source**: [{self.source_url}]({self.source_url})")
|
||||
|
||||
lines.append("")
|
||||
lines.append("---")
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines)
|
||||
183
sportstime_parser/models/game.py
Normal file
183
sportstime_parser/models/game.py
Normal file
@@ -0,0 +1,183 @@
|
||||
"""Game data model for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
from zoneinfo import ZoneInfo
|
||||
import json
|
||||
|
||||
|
||||
@dataclass
|
||||
class Game:
|
||||
"""Represents a game with all CloudKit fields.
|
||||
|
||||
Attributes:
|
||||
id: Canonical game ID (e.g., 'nba_2025_hou_okc_1021')
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
home_team_id: Canonical home team ID
|
||||
away_team_id: Canonical away team ID
|
||||
stadium_id: Canonical stadium ID
|
||||
game_date: Game date/time in UTC
|
||||
game_number: Game number for doubleheaders (1 or 2), None for single games
|
||||
home_score: Final home team score (None if not played)
|
||||
away_score: Final away team score (None if not played)
|
||||
status: Game status ('scheduled', 'final', 'postponed', 'cancelled')
|
||||
source_url: URL of the source page for manual review
|
||||
raw_home_team: Original home team name from source (for debugging)
|
||||
raw_away_team: Original away team name from source (for debugging)
|
||||
raw_stadium: Original stadium name from source (for debugging)
|
||||
"""
|
||||
|
||||
id: str
|
||||
sport: str
|
||||
season: int
|
||||
home_team_id: str
|
||||
away_team_id: str
|
||||
stadium_id: str
|
||||
game_date: datetime
|
||||
game_number: Optional[int] = None
|
||||
home_score: Optional[int] = None
|
||||
away_score: Optional[int] = None
|
||||
status: str = "scheduled"
|
||||
source_url: Optional[str] = None
|
||||
raw_home_team: Optional[str] = None
|
||||
raw_away_team: Optional[str] = None
|
||||
raw_stadium: Optional[str] = None
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"sport": self.sport,
|
||||
"season": self.season,
|
||||
"home_team_id": self.home_team_id,
|
||||
"away_team_id": self.away_team_id,
|
||||
"stadium_id": self.stadium_id,
|
||||
"game_date": self.game_date.isoformat(),
|
||||
"game_number": self.game_number,
|
||||
"home_score": self.home_score,
|
||||
"away_score": self.away_score,
|
||||
"status": self.status,
|
||||
"source_url": self.source_url,
|
||||
"raw_home_team": self.raw_home_team,
|
||||
"raw_away_team": self.raw_away_team,
|
||||
"raw_stadium": self.raw_stadium,
|
||||
}
|
||||
|
||||
def to_canonical_dict(
|
||||
self,
|
||||
stadium_timezone: str,
|
||||
is_playoff: bool = False,
|
||||
broadcast: Optional[str] = None,
|
||||
) -> dict:
|
||||
"""Convert to canonical dictionary format matching iOS app schema.
|
||||
|
||||
Args:
|
||||
stadium_timezone: IANA timezone of the stadium (e.g., 'America/Chicago')
|
||||
is_playoff: Whether this is a playoff game
|
||||
broadcast: Broadcast network info (e.g., 'ESPN')
|
||||
|
||||
Returns:
|
||||
Dictionary with field names matching JSONCanonicalGame in BootstrapService.swift
|
||||
"""
|
||||
# Convert game_date to UTC
|
||||
if self.game_date.tzinfo is None:
|
||||
# Localize naive datetime to stadium timezone first
|
||||
local_tz = ZoneInfo(stadium_timezone)
|
||||
local_dt = self.game_date.replace(tzinfo=local_tz)
|
||||
else:
|
||||
local_dt = self.game_date
|
||||
|
||||
utc_dt = local_dt.astimezone(ZoneInfo("UTC"))
|
||||
|
||||
# Format season as string (e.g., 2025 -> "2025-26" for NBA/NHL, "2025" for MLB)
|
||||
sport_lower = self.sport.lower()
|
||||
if sport_lower in ("nba", "nhl"):
|
||||
season_str = f"{self.season}-{str(self.season + 1)[-2:]}"
|
||||
else:
|
||||
season_str = str(self.season)
|
||||
|
||||
return {
|
||||
"canonical_id": self.id,
|
||||
"sport": self.sport.upper(), # iOS Sport enum expects uppercase (e.g., "NFL")
|
||||
"season": season_str,
|
||||
"game_datetime_utc": utc_dt.strftime("%Y-%m-%dT%H:%M:%SZ"),
|
||||
"home_team_canonical_id": self.home_team_id,
|
||||
"away_team_canonical_id": self.away_team_id,
|
||||
"stadium_canonical_id": self.stadium_id,
|
||||
"is_playoff": is_playoff,
|
||||
"broadcast_info": broadcast,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "Game":
|
||||
"""Create a Game from a dictionary (internal format)."""
|
||||
game_date = data["game_date"]
|
||||
if isinstance(game_date, str):
|
||||
game_date = datetime.fromisoformat(game_date)
|
||||
|
||||
return cls(
|
||||
id=data["id"],
|
||||
sport=data["sport"],
|
||||
season=data["season"],
|
||||
home_team_id=data["home_team_id"],
|
||||
away_team_id=data["away_team_id"],
|
||||
stadium_id=data["stadium_id"],
|
||||
game_date=game_date,
|
||||
game_number=data.get("game_number"),
|
||||
home_score=data.get("home_score"),
|
||||
away_score=data.get("away_score"),
|
||||
status=data.get("status", "scheduled"),
|
||||
source_url=data.get("source_url"),
|
||||
raw_home_team=data.get("raw_home_team"),
|
||||
raw_away_team=data.get("raw_away_team"),
|
||||
raw_stadium=data.get("raw_stadium"),
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_canonical_dict(cls, data: dict) -> "Game":
|
||||
"""Create a Game from a canonical dictionary (iOS app format)."""
|
||||
# Handle 'Z' suffix (fromisoformat doesn't support it before Python 3.11)
|
||||
date_str = data["game_datetime_utc"].replace("Z", "+00:00")
|
||||
game_date = datetime.fromisoformat(date_str)
|
||||
|
||||
# Parse season string (e.g., "2025-26" -> 2025, or "2025" -> 2025)
|
||||
season_str = data["season"]
|
||||
season = int(season_str.split("-")[0])
|
||||
|
||||
return cls(
|
||||
id=data["canonical_id"],
|
||||
sport=data["sport"],
|
||||
season=season,
|
||||
home_team_id=data["home_team_canonical_id"],
|
||||
away_team_id=data["away_team_canonical_id"],
|
||||
stadium_id=data["stadium_canonical_id"],
|
||||
game_date=game_date,
|
||||
status="scheduled",
|
||||
)
|
||||
|
||||
def to_json(self) -> str:
|
||||
"""Serialize to JSON string."""
|
||||
return json.dumps(self.to_dict(), indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_str: str) -> "Game":
|
||||
"""Deserialize from JSON string."""
|
||||
return cls.from_dict(json.loads(json_str))
|
||||
|
||||
|
||||
def save_games(games: list[Game], filepath: str) -> None:
|
||||
"""Save a list of games to a JSON file."""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump([g.to_dict() for g in games], f, indent=2)
|
||||
|
||||
|
||||
def load_games(filepath: str) -> list[Game]:
|
||||
"""Load a list of games from a JSON file (auto-detects format)."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
# Detect format: canonical has "canonical_id" and "game_datetime_utc", internal has "id"
|
||||
if data and "canonical_id" in data[0] and "game_datetime_utc" in data[0]:
|
||||
return [Game.from_canonical_dict(d) for d in data]
|
||||
return [Game.from_dict(d) for d in data]
|
||||
157
sportstime_parser/models/sport.py
Normal file
157
sportstime_parser/models/sport.py
Normal file
@@ -0,0 +1,157 @@
|
||||
"""Sport and LeagueStructure data models for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
class LeagueStructureType(str, Enum):
|
||||
"""Type of league structure element."""
|
||||
CONFERENCE = "conference"
|
||||
DIVISION = "division"
|
||||
LEAGUE = "league"
|
||||
|
||||
|
||||
@dataclass
|
||||
class Sport:
|
||||
"""Represents a sport with all CloudKit fields.
|
||||
|
||||
Attributes:
|
||||
id: Canonical sport ID (e.g., 'MLB', 'NBA')
|
||||
abbreviation: Sport abbreviation (e.g., 'MLB', 'NBA')
|
||||
display_name: Full display name (e.g., 'Major League Baseball')
|
||||
icon_name: SF Symbol name for the sport icon
|
||||
color_hex: Primary color as hex string (e.g., '#FF0000')
|
||||
season_start_month: Month number when season typically starts (1-12)
|
||||
season_end_month: Month number when season typically ends (1-12)
|
||||
is_active: Whether the sport is currently active/supported
|
||||
"""
|
||||
|
||||
id: str
|
||||
abbreviation: str
|
||||
display_name: str
|
||||
icon_name: str
|
||||
color_hex: str
|
||||
season_start_month: int
|
||||
season_end_month: int
|
||||
is_active: bool = True
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"abbreviation": self.abbreviation,
|
||||
"display_name": self.display_name,
|
||||
"icon_name": self.icon_name,
|
||||
"color_hex": self.color_hex,
|
||||
"season_start_month": self.season_start_month,
|
||||
"season_end_month": self.season_end_month,
|
||||
"is_active": self.is_active,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "Sport":
|
||||
"""Create a Sport from a dictionary."""
|
||||
return cls(
|
||||
id=data["id"],
|
||||
abbreviation=data["abbreviation"],
|
||||
display_name=data["display_name"],
|
||||
icon_name=data["icon_name"],
|
||||
color_hex=data["color_hex"],
|
||||
season_start_month=data["season_start_month"],
|
||||
season_end_month=data["season_end_month"],
|
||||
is_active=data.get("is_active", True),
|
||||
)
|
||||
|
||||
def to_json(self) -> str:
|
||||
"""Serialize to JSON string."""
|
||||
return json.dumps(self.to_dict(), indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_str: str) -> "Sport":
|
||||
"""Deserialize from JSON string."""
|
||||
return cls.from_dict(json.loads(json_str))
|
||||
|
||||
|
||||
@dataclass
|
||||
class LeagueStructure:
|
||||
"""Represents a league structure element (conference, division, etc.).
|
||||
|
||||
Attributes:
|
||||
id: Unique ID (e.g., 'nba_eastern', 'mlb_al_east')
|
||||
sport: Sport code (e.g., 'NBA', 'MLB')
|
||||
structure_type: Type of structure (conference, division, league)
|
||||
name: Full name (e.g., 'Eastern Conference', 'AL East')
|
||||
abbreviation: Optional abbreviation (e.g., 'East', 'ALE')
|
||||
parent_id: Parent structure ID (e.g., division's parent is conference)
|
||||
display_order: Order for display (0-indexed)
|
||||
"""
|
||||
|
||||
id: str
|
||||
sport: str
|
||||
structure_type: LeagueStructureType
|
||||
name: str
|
||||
abbreviation: Optional[str] = None
|
||||
parent_id: Optional[str] = None
|
||||
display_order: int = 0
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"sport": self.sport,
|
||||
"structure_type": self.structure_type.value,
|
||||
"name": self.name,
|
||||
"abbreviation": self.abbreviation,
|
||||
"parent_id": self.parent_id,
|
||||
"display_order": self.display_order,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "LeagueStructure":
|
||||
"""Create a LeagueStructure from a dictionary."""
|
||||
return cls(
|
||||
id=data["id"],
|
||||
sport=data["sport"],
|
||||
structure_type=LeagueStructureType(data["structure_type"]),
|
||||
name=data["name"],
|
||||
abbreviation=data.get("abbreviation"),
|
||||
parent_id=data.get("parent_id"),
|
||||
display_order=data.get("display_order", 0),
|
||||
)
|
||||
|
||||
def to_json(self) -> str:
|
||||
"""Serialize to JSON string."""
|
||||
return json.dumps(self.to_dict(), indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_str: str) -> "LeagueStructure":
|
||||
"""Deserialize from JSON string."""
|
||||
return cls.from_dict(json.loads(json_str))
|
||||
|
||||
|
||||
def save_sports(sports: list[Sport], filepath: str) -> None:
|
||||
"""Save a list of sports to a JSON file."""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump([s.to_dict() for s in sports], f, indent=2)
|
||||
|
||||
|
||||
def load_sports(filepath: str) -> list[Sport]:
|
||||
"""Load a list of sports from a JSON file."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
return [Sport.from_dict(d) for d in data]
|
||||
|
||||
|
||||
def save_league_structures(structures: list[LeagueStructure], filepath: str) -> None:
|
||||
"""Save a list of league structures to a JSON file."""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump([s.to_dict() for s in structures], f, indent=2)
|
||||
|
||||
|
||||
def load_league_structures(filepath: str) -> list[LeagueStructure]:
|
||||
"""Load a list of league structures from a JSON file."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
return [LeagueStructure.from_dict(d) for d in data]
|
||||
154
sportstime_parser/models/stadium.py
Normal file
154
sportstime_parser/models/stadium.py
Normal file
@@ -0,0 +1,154 @@
|
||||
"""Stadium data model for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
@dataclass
|
||||
class Stadium:
|
||||
"""Represents a stadium with all CloudKit fields.
|
||||
|
||||
Attributes:
|
||||
id: Canonical stadium ID (e.g., 'stadium_nba_paycom_center')
|
||||
sport: Primary sport code (e.g., 'nba', 'mlb')
|
||||
name: Current stadium name (e.g., 'Paycom Center')
|
||||
city: City name (e.g., 'Oklahoma City')
|
||||
state: State/province code (e.g., 'OK', 'ON')
|
||||
country: Country code (e.g., 'USA', 'Canada')
|
||||
latitude: Latitude coordinate
|
||||
longitude: Longitude coordinate
|
||||
capacity: Seating capacity
|
||||
surface: Playing surface (e.g., 'grass', 'turf', 'hardwood')
|
||||
roof_type: Roof type (e.g., 'dome', 'retractable', 'open')
|
||||
opened_year: Year stadium opened
|
||||
image_url: URL to stadium image
|
||||
timezone: IANA timezone (e.g., 'America/Chicago')
|
||||
"""
|
||||
|
||||
id: str
|
||||
sport: str
|
||||
name: str
|
||||
city: str
|
||||
state: str
|
||||
country: str
|
||||
latitude: float
|
||||
longitude: float
|
||||
capacity: Optional[int] = None
|
||||
surface: Optional[str] = None
|
||||
roof_type: Optional[str] = None
|
||||
opened_year: Optional[int] = None
|
||||
image_url: Optional[str] = None
|
||||
timezone: Optional[str] = None
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"sport": self.sport,
|
||||
"name": self.name,
|
||||
"city": self.city,
|
||||
"state": self.state,
|
||||
"country": self.country,
|
||||
"latitude": self.latitude,
|
||||
"longitude": self.longitude,
|
||||
"capacity": self.capacity,
|
||||
"surface": self.surface,
|
||||
"roof_type": self.roof_type,
|
||||
"opened_year": self.opened_year,
|
||||
"image_url": self.image_url,
|
||||
"timezone": self.timezone,
|
||||
}
|
||||
|
||||
def to_canonical_dict(self, primary_team_abbrevs: list[str] | None = None) -> dict:
|
||||
"""Convert to canonical dictionary format matching iOS app schema.
|
||||
|
||||
Args:
|
||||
primary_team_abbrevs: List of team abbreviations that play at this stadium.
|
||||
If None, defaults to empty list.
|
||||
|
||||
Returns:
|
||||
Dictionary with field names matching JSONCanonicalStadium in BootstrapService.swift
|
||||
"""
|
||||
return {
|
||||
"canonical_id": self.id,
|
||||
"name": self.name,
|
||||
"city": self.city,
|
||||
"state": self.state,
|
||||
"latitude": self.latitude,
|
||||
"longitude": self.longitude,
|
||||
"capacity": self.capacity if self.capacity is not None else 0,
|
||||
"sport": self.sport.upper(), # iOS Sport enum expects uppercase (e.g., "NFL")
|
||||
"primary_team_abbrevs": primary_team_abbrevs or [],
|
||||
"year_opened": self.opened_year,
|
||||
"timezone_identifier": self.timezone,
|
||||
"image_url": self.image_url,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "Stadium":
|
||||
"""Create a Stadium from a dictionary (internal format)."""
|
||||
return cls(
|
||||
id=data["id"],
|
||||
sport=data["sport"],
|
||||
name=data["name"],
|
||||
city=data["city"],
|
||||
state=data["state"],
|
||||
country=data["country"],
|
||||
latitude=data["latitude"],
|
||||
longitude=data["longitude"],
|
||||
capacity=data.get("capacity"),
|
||||
surface=data.get("surface"),
|
||||
roof_type=data.get("roof_type"),
|
||||
opened_year=data.get("opened_year"),
|
||||
image_url=data.get("image_url"),
|
||||
timezone=data.get("timezone"),
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_canonical_dict(cls, data: dict) -> "Stadium":
|
||||
"""Create a Stadium from a canonical dictionary (iOS app format)."""
|
||||
return cls(
|
||||
id=data["canonical_id"],
|
||||
sport=data["sport"],
|
||||
name=data["name"],
|
||||
city=data["city"],
|
||||
state=data["state"],
|
||||
country="USA", # Canonical format doesn't include country
|
||||
latitude=data["latitude"],
|
||||
longitude=data["longitude"],
|
||||
capacity=data.get("capacity"),
|
||||
opened_year=data.get("year_opened"),
|
||||
image_url=data.get("image_url"),
|
||||
timezone=data.get("timezone_identifier"),
|
||||
)
|
||||
|
||||
def to_json(self) -> str:
|
||||
"""Serialize to JSON string."""
|
||||
return json.dumps(self.to_dict(), indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_str: str) -> "Stadium":
|
||||
"""Deserialize from JSON string."""
|
||||
return cls.from_dict(json.loads(json_str))
|
||||
|
||||
def is_in_allowed_region(self) -> bool:
|
||||
"""Check if stadium is in USA, Canada, or Mexico."""
|
||||
allowed = {"USA", "US", "United States", "Canada", "CA", "Mexico", "MX"}
|
||||
return self.country in allowed
|
||||
|
||||
|
||||
def save_stadiums(stadiums: list[Stadium], filepath: str) -> None:
|
||||
"""Save a list of stadiums to a JSON file."""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump([s.to_dict() for s in stadiums], f, indent=2)
|
||||
|
||||
|
||||
def load_stadiums(filepath: str) -> list[Stadium]:
|
||||
"""Load a list of stadiums from a JSON file (auto-detects format)."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
# Detect format: canonical has "canonical_id", internal has "id"
|
||||
if data and "canonical_id" in data[0]:
|
||||
return [Stadium.from_canonical_dict(d) for d in data]
|
||||
return [Stadium.from_dict(d) for d in data]
|
||||
177
sportstime_parser/models/team.py
Normal file
177
sportstime_parser/models/team.py
Normal file
@@ -0,0 +1,177 @@
|
||||
"""Team data model for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
@dataclass
|
||||
class Team:
|
||||
"""Represents a team with all CloudKit fields.
|
||||
|
||||
Attributes:
|
||||
id: Canonical team ID (e.g., 'team_nba_okc')
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
city: Team city (e.g., 'Oklahoma City')
|
||||
name: Team name (e.g., 'Thunder')
|
||||
full_name: Full team name (e.g., 'Oklahoma City Thunder')
|
||||
abbreviation: Official abbreviation (e.g., 'OKC')
|
||||
conference: Conference name (e.g., 'Western', 'American')
|
||||
division: Division name (e.g., 'Northwest', 'AL West')
|
||||
primary_color: Primary team color as hex (e.g., '#007AC1')
|
||||
secondary_color: Secondary team color as hex (e.g., '#EF3B24')
|
||||
logo_url: URL to team logo image
|
||||
stadium_id: Canonical ID of home stadium
|
||||
"""
|
||||
|
||||
id: str
|
||||
sport: str
|
||||
city: str
|
||||
name: str
|
||||
full_name: str
|
||||
abbreviation: str
|
||||
conference: Optional[str] = None
|
||||
division: Optional[str] = None
|
||||
primary_color: Optional[str] = None
|
||||
secondary_color: Optional[str] = None
|
||||
logo_url: Optional[str] = None
|
||||
stadium_id: Optional[str] = None
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"sport": self.sport,
|
||||
"city": self.city,
|
||||
"name": self.name,
|
||||
"full_name": self.full_name,
|
||||
"abbreviation": self.abbreviation,
|
||||
"conference": self.conference,
|
||||
"division": self.division,
|
||||
"primary_color": self.primary_color,
|
||||
"secondary_color": self.secondary_color,
|
||||
"logo_url": self.logo_url,
|
||||
"stadium_id": self.stadium_id,
|
||||
}
|
||||
|
||||
def _make_qualified_id(self, name: Optional[str]) -> Optional[str]:
|
||||
"""Convert a conference/division name to a qualified ID.
|
||||
|
||||
Examples:
|
||||
"Eastern" → "nba_eastern"
|
||||
"AL West" → "mlb_al_west"
|
||||
"Southeast" → "nba_southeast"
|
||||
"""
|
||||
if not name:
|
||||
return None
|
||||
# Lowercase, replace spaces with underscores
|
||||
normalized = name.lower().replace(" ", "_")
|
||||
return f"{self.sport.lower()}_{normalized}"
|
||||
|
||||
def to_canonical_dict(self) -> dict:
|
||||
"""Convert to canonical dictionary format matching iOS app schema.
|
||||
|
||||
Returns:
|
||||
Dictionary with field names matching JSONCanonicalTeam in BootstrapService.swift
|
||||
"""
|
||||
return {
|
||||
"canonical_id": self.id,
|
||||
"name": self.name,
|
||||
"abbreviation": self.abbreviation,
|
||||
"sport": self.sport.upper(), # iOS Sport enum expects uppercase (e.g., "NFL")
|
||||
"city": self.city,
|
||||
"stadium_canonical_id": self.stadium_id or "",
|
||||
"conference_id": self._make_qualified_id(self.conference),
|
||||
"division_id": self._make_qualified_id(self.division),
|
||||
"primary_color": self.primary_color,
|
||||
"secondary_color": self.secondary_color,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "Team":
|
||||
"""Create a Team from a dictionary (internal format)."""
|
||||
return cls(
|
||||
id=data["id"],
|
||||
sport=data["sport"],
|
||||
city=data["city"],
|
||||
name=data["name"],
|
||||
full_name=data["full_name"],
|
||||
abbreviation=data["abbreviation"],
|
||||
conference=data.get("conference"),
|
||||
division=data.get("division"),
|
||||
primary_color=data.get("primary_color"),
|
||||
secondary_color=data.get("secondary_color"),
|
||||
logo_url=data.get("logo_url"),
|
||||
stadium_id=data.get("stadium_id"),
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def _extract_name_from_qualified_id(qualified_id: Optional[str], sport: str) -> Optional[str]:
|
||||
"""Extract the name portion from a qualified ID.
|
||||
|
||||
Examples:
|
||||
"nba_eastern" → "Eastern"
|
||||
"mlb_al_west" → "AL West"
|
||||
"nba_southeast" → "Southeast"
|
||||
"""
|
||||
if not qualified_id:
|
||||
return None
|
||||
# Remove sport prefix (e.g., "nba_" or "mlb_")
|
||||
prefix = f"{sport.lower()}_"
|
||||
if qualified_id.startswith(prefix):
|
||||
name = qualified_id[len(prefix):]
|
||||
else:
|
||||
name = qualified_id
|
||||
# Convert underscores to spaces and title case
|
||||
# Special handling for league abbreviations (AL, NL, etc.)
|
||||
parts = name.split("_")
|
||||
result_parts = []
|
||||
for part in parts:
|
||||
if part.upper() in ("AL", "NL", "AFC", "NFC"):
|
||||
result_parts.append(part.upper())
|
||||
else:
|
||||
result_parts.append(part.capitalize())
|
||||
return " ".join(result_parts)
|
||||
|
||||
@classmethod
|
||||
def from_canonical_dict(cls, data: dict) -> "Team":
|
||||
"""Create a Team from a canonical dictionary (iOS app format)."""
|
||||
sport = data["sport"].lower()
|
||||
return cls(
|
||||
id=data["canonical_id"],
|
||||
sport=data["sport"],
|
||||
city=data["city"],
|
||||
name=data["name"],
|
||||
full_name=f"{data['city']} {data['name']}", # Reconstruct full_name
|
||||
abbreviation=data["abbreviation"],
|
||||
conference=cls._extract_name_from_qualified_id(data.get("conference_id"), sport),
|
||||
division=cls._extract_name_from_qualified_id(data.get("division_id"), sport),
|
||||
primary_color=data.get("primary_color"),
|
||||
secondary_color=data.get("secondary_color"),
|
||||
stadium_id=data.get("stadium_canonical_id"),
|
||||
)
|
||||
|
||||
def to_json(self) -> str:
|
||||
"""Serialize to JSON string."""
|
||||
return json.dumps(self.to_dict(), indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_str: str) -> "Team":
|
||||
"""Deserialize from JSON string."""
|
||||
return cls.from_dict(json.loads(json_str))
|
||||
|
||||
|
||||
def save_teams(teams: list[Team], filepath: str) -> None:
|
||||
"""Save a list of teams to a JSON file."""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump([t.to_dict() for t in teams], f, indent=2)
|
||||
|
||||
|
||||
def load_teams(filepath: str) -> list[Team]:
|
||||
"""Load a list of teams from a JSON file (auto-detects format)."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
# Detect format: canonical has "canonical_id", internal has "id"
|
||||
if data and "canonical_id" in data[0]:
|
||||
return [Team.from_canonical_dict(d) for d in data]
|
||||
return [Team.from_dict(d) for d in data]
|
||||
91
sportstime_parser/normalizers/__init__.py
Normal file
91
sportstime_parser/normalizers/__init__.py
Normal file
@@ -0,0 +1,91 @@
|
||||
"""Normalizers for team, stadium, and game data."""
|
||||
|
||||
from .canonical_id import (
|
||||
generate_game_id,
|
||||
generate_team_id,
|
||||
generate_team_id_from_abbrev,
|
||||
generate_stadium_id,
|
||||
parse_game_id,
|
||||
normalize_string,
|
||||
)
|
||||
from .timezone import (
|
||||
TimezoneResult,
|
||||
parse_datetime,
|
||||
convert_to_utc,
|
||||
detect_timezone_from_string,
|
||||
detect_timezone_from_location,
|
||||
get_stadium_timezone,
|
||||
create_timezone_warning,
|
||||
)
|
||||
from .fuzzy import (
|
||||
MatchCandidate,
|
||||
fuzzy_match_team,
|
||||
fuzzy_match_stadium,
|
||||
exact_match,
|
||||
best_match,
|
||||
calculate_similarity,
|
||||
normalize_for_matching,
|
||||
)
|
||||
from .alias_loader import (
|
||||
TeamAliasLoader,
|
||||
StadiumAliasLoader,
|
||||
get_team_alias_loader,
|
||||
get_stadium_alias_loader,
|
||||
resolve_team_alias,
|
||||
resolve_stadium_alias,
|
||||
)
|
||||
from .team_resolver import (
|
||||
TeamResolver,
|
||||
TeamResolveResult,
|
||||
get_team_resolver,
|
||||
resolve_team,
|
||||
)
|
||||
from .stadium_resolver import (
|
||||
StadiumResolver,
|
||||
StadiumResolveResult,
|
||||
get_stadium_resolver,
|
||||
resolve_stadium,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Canonical ID
|
||||
"generate_game_id",
|
||||
"generate_team_id",
|
||||
"generate_team_id_from_abbrev",
|
||||
"generate_stadium_id",
|
||||
"parse_game_id",
|
||||
"normalize_string",
|
||||
# Timezone
|
||||
"TimezoneResult",
|
||||
"parse_datetime",
|
||||
"convert_to_utc",
|
||||
"detect_timezone_from_string",
|
||||
"detect_timezone_from_location",
|
||||
"get_stadium_timezone",
|
||||
"create_timezone_warning",
|
||||
# Fuzzy matching
|
||||
"MatchCandidate",
|
||||
"fuzzy_match_team",
|
||||
"fuzzy_match_stadium",
|
||||
"exact_match",
|
||||
"best_match",
|
||||
"calculate_similarity",
|
||||
"normalize_for_matching",
|
||||
# Alias loaders
|
||||
"TeamAliasLoader",
|
||||
"StadiumAliasLoader",
|
||||
"get_team_alias_loader",
|
||||
"get_stadium_alias_loader",
|
||||
"resolve_team_alias",
|
||||
"resolve_stadium_alias",
|
||||
# Team resolver
|
||||
"TeamResolver",
|
||||
"TeamResolveResult",
|
||||
"get_team_resolver",
|
||||
"resolve_team",
|
||||
# Stadium resolver
|
||||
"StadiumResolver",
|
||||
"StadiumResolveResult",
|
||||
"get_stadium_resolver",
|
||||
"resolve_stadium",
|
||||
]
|
||||
312
sportstime_parser/normalizers/alias_loader.py
Normal file
312
sportstime_parser/normalizers/alias_loader.py
Normal file
@@ -0,0 +1,312 @@
|
||||
"""Alias file loaders for team and stadium name resolution."""
|
||||
|
||||
import json
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from ..config import TEAM_ALIASES_FILE, STADIUM_ALIASES_FILE
|
||||
from ..models.aliases import TeamAlias, StadiumAlias, AliasType
|
||||
|
||||
|
||||
class TeamAliasLoader:
|
||||
"""Loader for team aliases with date-aware resolution.
|
||||
|
||||
Loads team aliases from JSON and provides lookup methods
|
||||
with support for historical name changes.
|
||||
"""
|
||||
|
||||
def __init__(self, filepath: Optional[Path] = None):
|
||||
"""Initialize the loader.
|
||||
|
||||
Args:
|
||||
filepath: Path to team_aliases.json, defaults to config value
|
||||
"""
|
||||
self.filepath = filepath or TEAM_ALIASES_FILE
|
||||
self._aliases: list[TeamAlias] = []
|
||||
self._by_value: dict[str, list[TeamAlias]] = {}
|
||||
self._by_team: dict[str, list[TeamAlias]] = {}
|
||||
self._loaded = False
|
||||
|
||||
def load(self) -> None:
|
||||
"""Load aliases from the JSON file."""
|
||||
if not self.filepath.exists():
|
||||
self._loaded = True
|
||||
return
|
||||
|
||||
with open(self.filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
self._aliases = []
|
||||
self._by_value = {}
|
||||
self._by_team = {}
|
||||
|
||||
for item in data:
|
||||
alias = TeamAlias.from_dict(item)
|
||||
self._aliases.append(alias)
|
||||
|
||||
# Index by lowercase value
|
||||
value_key = alias.alias_value.lower()
|
||||
if value_key not in self._by_value:
|
||||
self._by_value[value_key] = []
|
||||
self._by_value[value_key].append(alias)
|
||||
|
||||
# Index by team ID
|
||||
if alias.team_canonical_id not in self._by_team:
|
||||
self._by_team[alias.team_canonical_id] = []
|
||||
self._by_team[alias.team_canonical_id].append(alias)
|
||||
|
||||
self._loaded = True
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
"""Ensure aliases are loaded."""
|
||||
if not self._loaded:
|
||||
self.load()
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
value: str,
|
||||
check_date: Optional[date] = None,
|
||||
alias_types: Optional[list[AliasType]] = None,
|
||||
) -> Optional[str]:
|
||||
"""Resolve an alias value to a canonical team ID.
|
||||
|
||||
Args:
|
||||
value: Alias value to look up (case-insensitive)
|
||||
check_date: Date to check validity (None = current date)
|
||||
alias_types: Types of aliases to check (None = all types)
|
||||
|
||||
Returns:
|
||||
Canonical team ID if found, None otherwise
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
if check_date is None:
|
||||
check_date = date.today()
|
||||
|
||||
value_key = value.lower().strip()
|
||||
aliases = self._by_value.get(value_key, [])
|
||||
|
||||
for alias in aliases:
|
||||
# Check type filter
|
||||
if alias_types and alias.alias_type not in alias_types:
|
||||
continue
|
||||
|
||||
# Check date validity
|
||||
if alias.is_valid_on(check_date):
|
||||
return alias.team_canonical_id
|
||||
|
||||
return None
|
||||
|
||||
def get_aliases_for_team(
|
||||
self,
|
||||
team_id: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> list[TeamAlias]:
|
||||
"""Get all aliases for a team.
|
||||
|
||||
Args:
|
||||
team_id: Canonical team ID
|
||||
check_date: Date to filter by (None = all aliases)
|
||||
|
||||
Returns:
|
||||
List of TeamAlias objects
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
aliases = self._by_team.get(team_id, [])
|
||||
|
||||
if check_date:
|
||||
aliases = [a for a in aliases if a.is_valid_on(check_date)]
|
||||
|
||||
return aliases
|
||||
|
||||
def get_all_values(
|
||||
self,
|
||||
alias_type: Optional[AliasType] = None,
|
||||
) -> list[str]:
|
||||
"""Get all alias values.
|
||||
|
||||
Args:
|
||||
alias_type: Filter by alias type (None = all types)
|
||||
|
||||
Returns:
|
||||
List of alias values
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
values = []
|
||||
for alias in self._aliases:
|
||||
if alias_type is None or alias.alias_type == alias_type:
|
||||
values.append(alias.alias_value)
|
||||
|
||||
return values
|
||||
|
||||
|
||||
class StadiumAliasLoader:
|
||||
"""Loader for stadium aliases with date-aware resolution.
|
||||
|
||||
Loads stadium aliases from JSON and provides lookup methods
|
||||
with support for historical name changes (e.g., naming rights).
|
||||
"""
|
||||
|
||||
def __init__(self, filepath: Optional[Path] = None):
|
||||
"""Initialize the loader.
|
||||
|
||||
Args:
|
||||
filepath: Path to stadium_aliases.json, defaults to config value
|
||||
"""
|
||||
self.filepath = filepath or STADIUM_ALIASES_FILE
|
||||
self._aliases: list[StadiumAlias] = []
|
||||
self._by_name: dict[str, list[StadiumAlias]] = {}
|
||||
self._by_stadium: dict[str, list[StadiumAlias]] = {}
|
||||
self._loaded = False
|
||||
|
||||
def load(self) -> None:
|
||||
"""Load aliases from the JSON file."""
|
||||
if not self.filepath.exists():
|
||||
self._loaded = True
|
||||
return
|
||||
|
||||
with open(self.filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
self._aliases = []
|
||||
self._by_name = {}
|
||||
self._by_stadium = {}
|
||||
|
||||
for item in data:
|
||||
alias = StadiumAlias.from_dict(item)
|
||||
self._aliases.append(alias)
|
||||
|
||||
# Index by lowercase name
|
||||
name_key = alias.alias_name.lower()
|
||||
if name_key not in self._by_name:
|
||||
self._by_name[name_key] = []
|
||||
self._by_name[name_key].append(alias)
|
||||
|
||||
# Index by stadium ID
|
||||
if alias.stadium_canonical_id not in self._by_stadium:
|
||||
self._by_stadium[alias.stadium_canonical_id] = []
|
||||
self._by_stadium[alias.stadium_canonical_id].append(alias)
|
||||
|
||||
self._loaded = True
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
"""Ensure aliases are loaded."""
|
||||
if not self._loaded:
|
||||
self.load()
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
name: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> Optional[str]:
|
||||
"""Resolve a stadium name to a canonical stadium ID.
|
||||
|
||||
Args:
|
||||
name: Stadium name to look up (case-insensitive)
|
||||
check_date: Date to check validity (None = current date)
|
||||
|
||||
Returns:
|
||||
Canonical stadium ID if found, None otherwise
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
if check_date is None:
|
||||
check_date = date.today()
|
||||
|
||||
name_key = name.lower().strip()
|
||||
aliases = self._by_name.get(name_key, [])
|
||||
|
||||
for alias in aliases:
|
||||
if alias.is_valid_on(check_date):
|
||||
return alias.stadium_canonical_id
|
||||
|
||||
return None
|
||||
|
||||
def get_aliases_for_stadium(
|
||||
self,
|
||||
stadium_id: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> list[StadiumAlias]:
|
||||
"""Get all aliases for a stadium.
|
||||
|
||||
Args:
|
||||
stadium_id: Canonical stadium ID
|
||||
check_date: Date to filter by (None = all aliases)
|
||||
|
||||
Returns:
|
||||
List of StadiumAlias objects
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
aliases = self._by_stadium.get(stadium_id, [])
|
||||
|
||||
if check_date:
|
||||
aliases = [a for a in aliases if a.is_valid_on(check_date)]
|
||||
|
||||
return aliases
|
||||
|
||||
def get_all_names(self) -> list[str]:
|
||||
"""Get all stadium alias names.
|
||||
|
||||
Returns:
|
||||
List of stadium names
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
return [alias.alias_name for alias in self._aliases]
|
||||
|
||||
|
||||
# Global loader instances (lazy initialized)
|
||||
_team_alias_loader: Optional[TeamAliasLoader] = None
|
||||
_stadium_alias_loader: Optional[StadiumAliasLoader] = None
|
||||
|
||||
|
||||
def get_team_alias_loader() -> TeamAliasLoader:
|
||||
"""Get the global team alias loader instance."""
|
||||
global _team_alias_loader
|
||||
if _team_alias_loader is None:
|
||||
_team_alias_loader = TeamAliasLoader()
|
||||
return _team_alias_loader
|
||||
|
||||
|
||||
def get_stadium_alias_loader() -> StadiumAliasLoader:
|
||||
"""Get the global stadium alias loader instance."""
|
||||
global _stadium_alias_loader
|
||||
if _stadium_alias_loader is None:
|
||||
_stadium_alias_loader = StadiumAliasLoader()
|
||||
return _stadium_alias_loader
|
||||
|
||||
|
||||
def resolve_team_alias(
|
||||
value: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> Optional[str]:
|
||||
"""Convenience function to resolve a team alias.
|
||||
|
||||
Args:
|
||||
value: Alias value (name, abbreviation, or city)
|
||||
check_date: Date to check validity
|
||||
|
||||
Returns:
|
||||
Canonical team ID if found
|
||||
"""
|
||||
return get_team_alias_loader().resolve(value, check_date)
|
||||
|
||||
|
||||
def resolve_stadium_alias(
|
||||
name: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> Optional[str]:
|
||||
"""Convenience function to resolve a stadium alias.
|
||||
|
||||
Args:
|
||||
name: Stadium name
|
||||
check_date: Date to check validity
|
||||
|
||||
Returns:
|
||||
Canonical stadium ID if found
|
||||
"""
|
||||
return get_stadium_alias_loader().resolve(name, check_date)
|
||||
284
sportstime_parser/normalizers/canonical_id.py
Normal file
284
sportstime_parser/normalizers/canonical_id.py
Normal file
@@ -0,0 +1,284 @@
|
||||
"""Canonical ID generation for games, teams, and stadiums."""
|
||||
|
||||
import re
|
||||
import unicodedata
|
||||
from datetime import date, datetime
|
||||
from typing import Optional
|
||||
|
||||
|
||||
def normalize_string(s: str) -> str:
|
||||
"""Normalize a string for use in canonical IDs.
|
||||
|
||||
- Convert to lowercase
|
||||
- Replace spaces and hyphens with underscores
|
||||
- Remove special characters (except underscores)
|
||||
- Collapse multiple underscores
|
||||
- Strip leading/trailing underscores
|
||||
|
||||
Args:
|
||||
s: String to normalize
|
||||
|
||||
Returns:
|
||||
Normalized string suitable for IDs
|
||||
"""
|
||||
# Convert to lowercase
|
||||
result = s.lower()
|
||||
|
||||
# Normalize unicode (e.g., é -> e)
|
||||
result = unicodedata.normalize("NFKD", result)
|
||||
result = result.encode("ascii", "ignore").decode("ascii")
|
||||
|
||||
# Replace spaces and hyphens with underscores
|
||||
result = re.sub(r"[\s\-]+", "_", result)
|
||||
|
||||
# Remove special characters except underscores
|
||||
result = re.sub(r"[^a-z0-9_]", "", result)
|
||||
|
||||
# Collapse multiple underscores
|
||||
result = re.sub(r"_+", "_", result)
|
||||
|
||||
# Strip leading/trailing underscores
|
||||
result = result.strip("_")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def generate_game_id(
|
||||
sport: str,
|
||||
season: int,
|
||||
away_abbrev: str,
|
||||
home_abbrev: str,
|
||||
game_date: date | datetime,
|
||||
game_number: Optional[int] = None,
|
||||
) -> str:
|
||||
"""Generate a canonical game ID.
|
||||
|
||||
Format: game_{sport}_{season}_{YYYYMMDD}_{away}_{home}[_{game_number}]
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
away_abbrev: Away team abbreviation (e.g., 'HOU')
|
||||
home_abbrev: Home team abbreviation (e.g., 'OKC')
|
||||
game_date: Date of the game
|
||||
game_number: Game number for doubleheaders (1 or 2), None for single games
|
||||
|
||||
Returns:
|
||||
Canonical game ID (e.g., 'game_nba_2025_20251021_hou_okc')
|
||||
|
||||
Examples:
|
||||
>>> generate_game_id('nba', 2025, 'HOU', 'OKC', date(2025, 10, 21))
|
||||
'game_nba_2025_20251021_hou_okc'
|
||||
|
||||
>>> generate_game_id('mlb', 2026, 'NYY', 'BOS', date(2026, 4, 1), game_number=1)
|
||||
'game_mlb_2026_20260401_nyy_bos_1'
|
||||
"""
|
||||
# Normalize sport and abbreviations
|
||||
sport_norm = sport.lower()
|
||||
away_norm = away_abbrev.lower()
|
||||
home_norm = home_abbrev.lower()
|
||||
|
||||
# Format date as YYYYMMDD
|
||||
if isinstance(game_date, datetime):
|
||||
game_date = game_date.date()
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
|
||||
# Build ID with game_ prefix
|
||||
parts = ["game", sport_norm, str(season), date_str, away_norm, home_norm]
|
||||
|
||||
# Add game number for doubleheaders
|
||||
if game_number is not None:
|
||||
parts.append(str(game_number))
|
||||
|
||||
return "_".join(parts)
|
||||
|
||||
|
||||
def generate_team_id(sport: str, city: str, name: str) -> str:
|
||||
"""Generate a canonical team ID.
|
||||
|
||||
Format: team_{sport}_{abbreviation}
|
||||
|
||||
For most teams, we use the standard abbreviation. This function generates
|
||||
a fallback ID based on city and name for teams without a known abbreviation.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
city: Team city (e.g., 'Los Angeles')
|
||||
name: Team name (e.g., 'Lakers')
|
||||
|
||||
Returns:
|
||||
Canonical team ID (e.g., 'team_nba_la_lakers')
|
||||
|
||||
Examples:
|
||||
>>> generate_team_id('nba', 'Los Angeles', 'Lakers')
|
||||
'team_nba_la_lakers'
|
||||
|
||||
>>> generate_team_id('mlb', 'New York', 'Yankees')
|
||||
'team_mlb_new_york_yankees'
|
||||
"""
|
||||
sport_norm = sport.lower()
|
||||
city_norm = normalize_string(city)
|
||||
name_norm = normalize_string(name)
|
||||
|
||||
return f"team_{sport_norm}_{city_norm}_{name_norm}"
|
||||
|
||||
|
||||
def generate_team_id_from_abbrev(sport: str, abbreviation: str) -> str:
|
||||
"""Generate a canonical team ID from abbreviation.
|
||||
|
||||
Format: team_{sport}_{abbreviation}
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
abbreviation: Team abbreviation (e.g., 'LAL', 'NYY')
|
||||
|
||||
Returns:
|
||||
Canonical team ID (e.g., 'team_nba_lal')
|
||||
|
||||
Examples:
|
||||
>>> generate_team_id_from_abbrev('nba', 'LAL')
|
||||
'team_nba_lal'
|
||||
|
||||
>>> generate_team_id_from_abbrev('mlb', 'NYY')
|
||||
'team_mlb_nyy'
|
||||
"""
|
||||
sport_norm = sport.lower()
|
||||
abbrev_norm = abbreviation.lower()
|
||||
|
||||
return f"team_{sport_norm}_{abbrev_norm}"
|
||||
|
||||
|
||||
def generate_stadium_id(sport: str, name: str) -> str:
|
||||
"""Generate a canonical stadium ID.
|
||||
|
||||
Format: stadium_{sport}_{normalized_name}
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
name: Stadium name (e.g., 'Yankee Stadium')
|
||||
|
||||
Returns:
|
||||
Canonical stadium ID (e.g., 'stadium_mlb_yankee_stadium')
|
||||
|
||||
Examples:
|
||||
>>> generate_stadium_id('nba', 'Crypto.com Arena')
|
||||
'stadium_nba_cryptocom_arena'
|
||||
|
||||
>>> generate_stadium_id('mlb', 'Yankee Stadium')
|
||||
'stadium_mlb_yankee_stadium'
|
||||
"""
|
||||
sport_norm = sport.lower()
|
||||
name_norm = normalize_string(name)
|
||||
|
||||
return f"stadium_{sport_norm}_{name_norm}"
|
||||
|
||||
|
||||
def parse_game_id(game_id: str) -> dict:
|
||||
"""Parse a canonical game ID into its components.
|
||||
|
||||
Args:
|
||||
game_id: Canonical game ID (e.g., 'game_nba_2025_20251021_hou_okc')
|
||||
|
||||
Returns:
|
||||
Dictionary with keys: sport, season, away_abbrev, home_abbrev,
|
||||
year, month, day, game_number (optional)
|
||||
|
||||
Raises:
|
||||
ValueError: If game_id format is invalid
|
||||
|
||||
Examples:
|
||||
>>> parse_game_id('game_nba_2025_20251021_hou_okc')
|
||||
{'sport': 'nba', 'season': 2025, 'away_abbrev': 'hou',
|
||||
'home_abbrev': 'okc', 'year': 2025, 'month': 10, 'day': 21, 'game_number': None}
|
||||
|
||||
>>> parse_game_id('game_mlb_2026_20260401_nyy_bos_1')
|
||||
{'sport': 'mlb', 'season': 2026, 'away_abbrev': 'nyy',
|
||||
'home_abbrev': 'bos', 'year': 2026, 'month': 4, 'day': 1, 'game_number': 1}
|
||||
"""
|
||||
parts = game_id.split("_")
|
||||
|
||||
if len(parts) < 6 or len(parts) > 7:
|
||||
raise ValueError(f"Invalid game ID format: {game_id}")
|
||||
|
||||
if parts[0] != "game":
|
||||
raise ValueError(f"Game ID must start with 'game_': {game_id}")
|
||||
|
||||
sport = parts[1]
|
||||
season = int(parts[2])
|
||||
date_str = parts[3]
|
||||
away_abbrev = parts[4]
|
||||
home_abbrev = parts[5]
|
||||
|
||||
if len(date_str) != 8:
|
||||
raise ValueError(f"Invalid date format in game ID: {game_id}")
|
||||
|
||||
year = int(date_str[:4])
|
||||
month = int(date_str[4:6])
|
||||
day = int(date_str[6:])
|
||||
|
||||
game_number = None
|
||||
if len(parts) == 7:
|
||||
game_number = int(parts[6])
|
||||
|
||||
return {
|
||||
"sport": sport,
|
||||
"season": season,
|
||||
"away_abbrev": away_abbrev,
|
||||
"home_abbrev": home_abbrev,
|
||||
"year": year,
|
||||
"month": month,
|
||||
"day": day,
|
||||
"game_number": game_number,
|
||||
}
|
||||
|
||||
|
||||
def parse_team_id(team_id: str) -> dict:
|
||||
"""Parse a canonical team ID into its components.
|
||||
|
||||
Args:
|
||||
team_id: Canonical team ID (e.g., 'team_nba_lal')
|
||||
|
||||
Returns:
|
||||
Dictionary with keys: sport, identifier (abbreviation or city_name)
|
||||
|
||||
Raises:
|
||||
ValueError: If team_id format is invalid
|
||||
"""
|
||||
if not team_id.startswith("team_"):
|
||||
raise ValueError(f"Invalid team ID format: {team_id}")
|
||||
|
||||
parts = team_id.split("_", 2)
|
||||
|
||||
if len(parts) < 3:
|
||||
raise ValueError(f"Invalid team ID format: {team_id}")
|
||||
|
||||
return {
|
||||
"sport": parts[1],
|
||||
"identifier": parts[2],
|
||||
}
|
||||
|
||||
|
||||
def parse_stadium_id(stadium_id: str) -> dict:
|
||||
"""Parse a canonical stadium ID into its components.
|
||||
|
||||
Args:
|
||||
stadium_id: Canonical stadium ID (e.g., 'stadium_nba_paycom_center')
|
||||
|
||||
Returns:
|
||||
Dictionary with keys: sport, name
|
||||
|
||||
Raises:
|
||||
ValueError: If stadium_id format is invalid
|
||||
"""
|
||||
if not stadium_id.startswith("stadium_"):
|
||||
raise ValueError(f"Invalid stadium ID format: {stadium_id}")
|
||||
|
||||
parts = stadium_id.split("_", 2)
|
||||
|
||||
if len(parts) < 3:
|
||||
raise ValueError(f"Invalid stadium ID format: {stadium_id}")
|
||||
|
||||
return {
|
||||
"sport": parts[1],
|
||||
"name": parts[2],
|
||||
}
|
||||
272
sportstime_parser/normalizers/fuzzy.py
Normal file
272
sportstime_parser/normalizers/fuzzy.py
Normal file
@@ -0,0 +1,272 @@
|
||||
"""Fuzzy string matching utilities for team and stadium name resolution."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
from rapidfuzz import fuzz, process
|
||||
from rapidfuzz.utils import default_process
|
||||
|
||||
from ..config import FUZZY_MATCH_THRESHOLD
|
||||
from ..models.aliases import FuzzyMatch
|
||||
|
||||
|
||||
@dataclass
|
||||
class MatchCandidate:
|
||||
"""A candidate for fuzzy matching.
|
||||
|
||||
Attributes:
|
||||
canonical_id: The canonical ID of this candidate
|
||||
name: The display name for this candidate
|
||||
aliases: List of alternative names to match against
|
||||
"""
|
||||
|
||||
canonical_id: str
|
||||
name: str
|
||||
aliases: list[str]
|
||||
|
||||
|
||||
def normalize_for_matching(s: str) -> str:
|
||||
"""Normalize a string for fuzzy matching.
|
||||
|
||||
- Convert to lowercase
|
||||
- Remove common prefixes/suffixes
|
||||
- Collapse whitespace
|
||||
|
||||
Args:
|
||||
s: String to normalize
|
||||
|
||||
Returns:
|
||||
Normalized string
|
||||
"""
|
||||
result = s.lower().strip()
|
||||
|
||||
# Remove common prefixes
|
||||
prefixes = ["the ", "team ", "stadium "]
|
||||
for prefix in prefixes:
|
||||
if result.startswith(prefix):
|
||||
result = result[len(prefix) :]
|
||||
|
||||
# Remove common suffixes
|
||||
suffixes = [" stadium", " arena", " center", " field", " park"]
|
||||
for suffix in suffixes:
|
||||
if result.endswith(suffix):
|
||||
result = result[: -len(suffix)]
|
||||
|
||||
return result.strip()
|
||||
|
||||
|
||||
def fuzzy_match_team(
|
||||
query: str,
|
||||
candidates: list[MatchCandidate],
|
||||
threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
top_n: int = 3,
|
||||
) -> list[FuzzyMatch]:
|
||||
"""Find fuzzy matches for a team name.
|
||||
|
||||
Uses multiple matching strategies:
|
||||
1. Token set ratio (handles word order differences)
|
||||
2. Partial ratio (handles substring matches)
|
||||
3. Standard ratio (overall similarity)
|
||||
|
||||
Args:
|
||||
query: Team name to match
|
||||
candidates: List of candidate teams to match against
|
||||
threshold: Minimum score to consider a match (0-100)
|
||||
top_n: Maximum number of matches to return
|
||||
|
||||
Returns:
|
||||
List of FuzzyMatch objects sorted by confidence (descending)
|
||||
"""
|
||||
query_norm = normalize_for_matching(query)
|
||||
|
||||
# Build list of all matchable strings with their canonical IDs
|
||||
match_strings: list[tuple[str, str, str]] = [] # (string, canonical_id, name)
|
||||
|
||||
for candidate in candidates:
|
||||
# Add primary name
|
||||
match_strings.append(
|
||||
(normalize_for_matching(candidate.name), candidate.canonical_id, candidate.name)
|
||||
)
|
||||
# Add aliases
|
||||
for alias in candidate.aliases:
|
||||
match_strings.append(
|
||||
(normalize_for_matching(alias), candidate.canonical_id, candidate.name)
|
||||
)
|
||||
|
||||
# Score all candidates
|
||||
scored: dict[str, tuple[int, str]] = {} # canonical_id -> (best_score, name)
|
||||
|
||||
for match_str, canonical_id, name in match_strings:
|
||||
# Use multiple scoring methods
|
||||
token_score = fuzz.token_set_ratio(query_norm, match_str)
|
||||
partial_score = fuzz.partial_ratio(query_norm, match_str)
|
||||
ratio_score = fuzz.ratio(query_norm, match_str)
|
||||
|
||||
# Weighted average favoring token_set_ratio for team names
|
||||
score = int(0.5 * token_score + 0.3 * partial_score + 0.2 * ratio_score)
|
||||
|
||||
# Keep best score for each canonical ID
|
||||
if canonical_id not in scored or score > scored[canonical_id][0]:
|
||||
scored[canonical_id] = (score, name)
|
||||
|
||||
# Filter by threshold and sort
|
||||
matches = [
|
||||
FuzzyMatch(canonical_id=cid, canonical_name=name, confidence=score)
|
||||
for cid, (score, name) in scored.items()
|
||||
if score >= threshold
|
||||
]
|
||||
|
||||
# Sort by confidence descending
|
||||
matches.sort(key=lambda m: m.confidence, reverse=True)
|
||||
|
||||
return matches[:top_n]
|
||||
|
||||
|
||||
def fuzzy_match_stadium(
|
||||
query: str,
|
||||
candidates: list[MatchCandidate],
|
||||
threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
top_n: int = 3,
|
||||
) -> list[FuzzyMatch]:
|
||||
"""Find fuzzy matches for a stadium name.
|
||||
|
||||
Uses matching strategies optimized for stadium names:
|
||||
1. Token sort ratio (handles "X Stadium" vs "Stadium X")
|
||||
2. Partial ratio (handles naming rights changes)
|
||||
3. Standard ratio
|
||||
|
||||
Args:
|
||||
query: Stadium name to match
|
||||
candidates: List of candidate stadiums to match against
|
||||
threshold: Minimum score to consider a match (0-100)
|
||||
top_n: Maximum number of matches to return
|
||||
|
||||
Returns:
|
||||
List of FuzzyMatch objects sorted by confidence (descending)
|
||||
"""
|
||||
query_norm = normalize_for_matching(query)
|
||||
|
||||
# Build list of all matchable strings
|
||||
match_strings: list[tuple[str, str, str]] = []
|
||||
|
||||
for candidate in candidates:
|
||||
match_strings.append(
|
||||
(normalize_for_matching(candidate.name), candidate.canonical_id, candidate.name)
|
||||
)
|
||||
for alias in candidate.aliases:
|
||||
match_strings.append(
|
||||
(normalize_for_matching(alias), candidate.canonical_id, candidate.name)
|
||||
)
|
||||
|
||||
# Score all candidates
|
||||
scored: dict[str, tuple[int, str]] = {}
|
||||
|
||||
for match_str, canonical_id, name in match_strings:
|
||||
# Use scoring methods suited for stadium names
|
||||
token_sort_score = fuzz.token_sort_ratio(query_norm, match_str)
|
||||
partial_score = fuzz.partial_ratio(query_norm, match_str)
|
||||
ratio_score = fuzz.ratio(query_norm, match_str)
|
||||
|
||||
# Weighted average
|
||||
score = int(0.4 * token_sort_score + 0.4 * partial_score + 0.2 * ratio_score)
|
||||
|
||||
if canonical_id not in scored or score > scored[canonical_id][0]:
|
||||
scored[canonical_id] = (score, name)
|
||||
|
||||
# Filter and sort
|
||||
matches = [
|
||||
FuzzyMatch(canonical_id=cid, canonical_name=name, confidence=score)
|
||||
for cid, (score, name) in scored.items()
|
||||
if score >= threshold
|
||||
]
|
||||
|
||||
matches.sort(key=lambda m: m.confidence, reverse=True)
|
||||
|
||||
return matches[:top_n]
|
||||
|
||||
|
||||
def exact_match(
|
||||
query: str,
|
||||
candidates: list[MatchCandidate],
|
||||
case_sensitive: bool = False,
|
||||
) -> Optional[str]:
|
||||
"""Find an exact match for a string.
|
||||
|
||||
Args:
|
||||
query: String to match
|
||||
candidates: List of candidates to match against
|
||||
case_sensitive: Whether to use case-sensitive matching
|
||||
|
||||
Returns:
|
||||
Canonical ID if exact match found, None otherwise
|
||||
"""
|
||||
if case_sensitive:
|
||||
query_norm = query.strip()
|
||||
else:
|
||||
query_norm = query.lower().strip()
|
||||
|
||||
for candidate in candidates:
|
||||
# Check primary name
|
||||
name = candidate.name if case_sensitive else candidate.name.lower()
|
||||
if query_norm == name.strip():
|
||||
return candidate.canonical_id
|
||||
|
||||
# Check aliases
|
||||
for alias in candidate.aliases:
|
||||
alias_norm = alias if case_sensitive else alias.lower()
|
||||
if query_norm == alias_norm.strip():
|
||||
return candidate.canonical_id
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def best_match(
|
||||
query: str,
|
||||
candidates: list[MatchCandidate],
|
||||
threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
) -> Optional[FuzzyMatch]:
|
||||
"""Find the best match for a query string.
|
||||
|
||||
First tries exact match, then falls back to fuzzy matching.
|
||||
|
||||
Args:
|
||||
query: String to match
|
||||
candidates: List of candidates
|
||||
threshold: Minimum fuzzy match score
|
||||
|
||||
Returns:
|
||||
Best FuzzyMatch or None if no match above threshold
|
||||
"""
|
||||
# Try exact match first
|
||||
exact = exact_match(query, candidates)
|
||||
if exact:
|
||||
# Find the name for this ID
|
||||
for c in candidates:
|
||||
if c.canonical_id == exact:
|
||||
return FuzzyMatch(
|
||||
canonical_id=exact,
|
||||
canonical_name=c.name,
|
||||
confidence=100,
|
||||
)
|
||||
|
||||
# Fall back to fuzzy matching
|
||||
# Use team matching by default (works for both)
|
||||
matches = fuzzy_match_team(query, candidates, threshold=threshold, top_n=1)
|
||||
|
||||
return matches[0] if matches else None
|
||||
|
||||
|
||||
def calculate_similarity(s1: str, s2: str) -> int:
|
||||
"""Calculate similarity between two strings.
|
||||
|
||||
Args:
|
||||
s1: First string
|
||||
s2: Second string
|
||||
|
||||
Returns:
|
||||
Similarity score 0-100
|
||||
"""
|
||||
s1_norm = normalize_for_matching(s1)
|
||||
s2_norm = normalize_for_matching(s2)
|
||||
|
||||
return fuzz.token_set_ratio(s1_norm, s2_norm)
|
||||
521
sportstime_parser/normalizers/stadium_resolver.py
Normal file
521
sportstime_parser/normalizers/stadium_resolver.py
Normal file
@@ -0,0 +1,521 @@
|
||||
"""Stadium name resolver with exact, alias, and fuzzy matching."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from datetime import date
|
||||
from typing import Optional
|
||||
from uuid import uuid4
|
||||
|
||||
from ..config import FUZZY_MATCH_THRESHOLD, ALLOWED_COUNTRIES
|
||||
from ..models.aliases import FuzzyMatch, ManualReviewItem, ReviewReason
|
||||
from .alias_loader import get_stadium_alias_loader, StadiumAliasLoader
|
||||
from .fuzzy import MatchCandidate, fuzzy_match_stadium
|
||||
|
||||
|
||||
@dataclass
|
||||
class StadiumResolveResult:
|
||||
"""Result of stadium resolution.
|
||||
|
||||
Attributes:
|
||||
canonical_id: Resolved canonical stadium ID (None if unresolved)
|
||||
confidence: Confidence in the match (100 for exact, lower for fuzzy)
|
||||
match_type: How the match was made ('exact', 'alias', 'fuzzy', 'unresolved')
|
||||
filtered_reason: Reason if stadium was filtered out (e.g., 'geographic')
|
||||
review_item: ManualReviewItem if resolution failed or low confidence
|
||||
"""
|
||||
|
||||
canonical_id: Optional[str]
|
||||
confidence: int
|
||||
match_type: str
|
||||
filtered_reason: Optional[str] = None
|
||||
review_item: Optional[ManualReviewItem] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class StadiumInfo:
|
||||
"""Stadium information for matching."""
|
||||
|
||||
canonical_id: str
|
||||
name: str
|
||||
city: str
|
||||
state: str
|
||||
country: str
|
||||
sport: str
|
||||
latitude: float
|
||||
longitude: float
|
||||
timezone: str = "America/New_York" # IANA timezone identifier
|
||||
|
||||
|
||||
# Hardcoded stadium mappings
|
||||
# Format: {sport: {canonical_id: StadiumInfo}}
|
||||
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
|
||||
"nba": {
|
||||
"stadium_nba_state_farm_arena": StadiumInfo("stadium_nba_state_farm_arena", "State Farm Arena", "Atlanta", "GA", "USA", "nba", 33.7573, -84.3963),
|
||||
"stadium_nba_td_garden": StadiumInfo("stadium_nba_td_garden", "TD Garden", "Boston", "MA", "USA", "nba", 42.3662, -71.0621),
|
||||
"stadium_nba_barclays_center": StadiumInfo("stadium_nba_barclays_center", "Barclays Center", "Brooklyn", "NY", "USA", "nba", 40.6826, -73.9754),
|
||||
"stadium_nba_spectrum_center": StadiumInfo("stadium_nba_spectrum_center", "Spectrum Center", "Charlotte", "NC", "USA", "nba", 35.2251, -80.8392),
|
||||
"stadium_nba_united_center": StadiumInfo("stadium_nba_united_center", "United Center", "Chicago", "IL", "USA", "nba", 41.8807, -87.6742),
|
||||
"stadium_nba_rocket_mortgage_fieldhouse": StadiumInfo("stadium_nba_rocket_mortgage_fieldhouse", "Rocket Mortgage FieldHouse", "Cleveland", "OH", "USA", "nba", 41.4965, -81.6882),
|
||||
"stadium_nba_american_airlines_center": StadiumInfo("stadium_nba_american_airlines_center", "American Airlines Center", "Dallas", "TX", "USA", "nba", 32.7905, -96.8103),
|
||||
"stadium_nba_ball_arena": StadiumInfo("stadium_nba_ball_arena", "Ball Arena", "Denver", "CO", "USA", "nba", 39.7487, -105.0077, "America/Denver"),
|
||||
"stadium_nba_little_caesars_arena": StadiumInfo("stadium_nba_little_caesars_arena", "Little Caesars Arena", "Detroit", "MI", "USA", "nba", 42.3411, -83.0553),
|
||||
"stadium_nba_chase_center": StadiumInfo("stadium_nba_chase_center", "Chase Center", "San Francisco", "CA", "USA", "nba", 37.7680, -122.3877, "America/Los_Angeles"),
|
||||
"stadium_nba_toyota_center": StadiumInfo("stadium_nba_toyota_center", "Toyota Center", "Houston", "TX", "USA", "nba", 29.7508, -95.3621, "America/Chicago"),
|
||||
"stadium_nba_gainbridge_fieldhouse": StadiumInfo("stadium_nba_gainbridge_fieldhouse", "Gainbridge Fieldhouse", "Indianapolis", "IN", "USA", "nba", 39.7640, -86.1555),
|
||||
"stadium_nba_intuit_dome": StadiumInfo("stadium_nba_intuit_dome", "Intuit Dome", "Inglewood", "CA", "USA", "nba", 33.9425, -118.3417),
|
||||
"stadium_nba_cryptocom_arena": StadiumInfo("stadium_nba_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "nba", 34.0430, -118.2673),
|
||||
"stadium_nba_fedexforum": StadiumInfo("stadium_nba_fedexforum", "FedExForum", "Memphis", "TN", "USA", "nba", 35.1383, -90.0505),
|
||||
"stadium_nba_kaseya_center": StadiumInfo("stadium_nba_kaseya_center", "Kaseya Center", "Miami", "FL", "USA", "nba", 25.7814, -80.1870),
|
||||
"stadium_nba_fiserv_forum": StadiumInfo("stadium_nba_fiserv_forum", "Fiserv Forum", "Milwaukee", "WI", "USA", "nba", 43.0451, -87.9172),
|
||||
"stadium_nba_target_center": StadiumInfo("stadium_nba_target_center", "Target Center", "Minneapolis", "MN", "USA", "nba", 44.9795, -93.2761),
|
||||
"stadium_nba_smoothie_king_center": StadiumInfo("stadium_nba_smoothie_king_center", "Smoothie King Center", "New Orleans", "LA", "USA", "nba", 29.9490, -90.0821),
|
||||
"stadium_nba_madison_square_garden": StadiumInfo("stadium_nba_madison_square_garden", "Madison Square Garden", "New York", "NY", "USA", "nba", 40.7505, -73.9934),
|
||||
"stadium_nba_paycom_center": StadiumInfo("stadium_nba_paycom_center", "Paycom Center", "Oklahoma City", "OK", "USA", "nba", 35.4634, -97.5151),
|
||||
"stadium_nba_kia_center": StadiumInfo("stadium_nba_kia_center", "Kia Center", "Orlando", "FL", "USA", "nba", 28.5392, -81.3839),
|
||||
"stadium_nba_wells_fargo_center": StadiumInfo("stadium_nba_wells_fargo_center", "Wells Fargo Center", "Philadelphia", "PA", "USA", "nba", 39.9012, -75.1720),
|
||||
"stadium_nba_footprint_center": StadiumInfo("stadium_nba_footprint_center", "Footprint Center", "Phoenix", "AZ", "USA", "nba", 33.4457, -112.0712),
|
||||
"stadium_nba_moda_center": StadiumInfo("stadium_nba_moda_center", "Moda Center", "Portland", "OR", "USA", "nba", 45.5316, -122.6668),
|
||||
"stadium_nba_golden_1_center": StadiumInfo("stadium_nba_golden_1_center", "Golden 1 Center", "Sacramento", "CA", "USA", "nba", 38.5802, -121.4997),
|
||||
"stadium_nba_frost_bank_center": StadiumInfo("stadium_nba_frost_bank_center", "Frost Bank Center", "San Antonio", "TX", "USA", "nba", 29.4270, -98.4375),
|
||||
"stadium_nba_scotiabank_arena": StadiumInfo("stadium_nba_scotiabank_arena", "Scotiabank Arena", "Toronto", "ON", "Canada", "nba", 43.6435, -79.3791, "America/Toronto"),
|
||||
"stadium_nba_delta_center": StadiumInfo("stadium_nba_delta_center", "Delta Center", "Salt Lake City", "UT", "USA", "nba", 40.7683, -111.9011, "America/Denver"),
|
||||
"stadium_nba_capital_one_arena": StadiumInfo("stadium_nba_capital_one_arena", "Capital One Arena", "Washington", "DC", "USA", "nba", 38.8981, -77.0209),
|
||||
# International venues
|
||||
"stadium_nba_mexico_city_arena": StadiumInfo("stadium_nba_mexico_city_arena", "Mexico City Arena", "Mexico City", "CDMX", "Mexico", "nba", 19.4042, -99.0970, "America/Mexico_City"),
|
||||
},
|
||||
"mlb": {
|
||||
"stadium_mlb_chase_field": StadiumInfo("stadium_mlb_chase_field", "Chase Field", "Phoenix", "AZ", "USA", "mlb", 33.4455, -112.0667),
|
||||
"stadium_mlb_truist_park": StadiumInfo("stadium_mlb_truist_park", "Truist Park", "Atlanta", "GA", "USA", "mlb", 33.8908, -84.4678),
|
||||
"stadium_mlb_oriole_park_at_camden_yards": StadiumInfo("stadium_mlb_oriole_park_at_camden_yards", "Oriole Park at Camden Yards", "Baltimore", "MD", "USA", "mlb", 39.2839, -76.6217),
|
||||
"stadium_mlb_fenway_park": StadiumInfo("stadium_mlb_fenway_park", "Fenway Park", "Boston", "MA", "USA", "mlb", 42.3467, -71.0972),
|
||||
"stadium_mlb_wrigley_field": StadiumInfo("stadium_mlb_wrigley_field", "Wrigley Field", "Chicago", "IL", "USA", "mlb", 41.9484, -87.6553),
|
||||
"stadium_mlb_guaranteed_rate_field": StadiumInfo("stadium_mlb_guaranteed_rate_field", "Guaranteed Rate Field", "Chicago", "IL", "USA", "mlb", 41.8299, -87.6338),
|
||||
"stadium_mlb_great_american_ball_park": StadiumInfo("stadium_mlb_great_american_ball_park", "Great American Ball Park", "Cincinnati", "OH", "USA", "mlb", 39.0974, -84.5082),
|
||||
"stadium_mlb_progressive_field": StadiumInfo("stadium_mlb_progressive_field", "Progressive Field", "Cleveland", "OH", "USA", "mlb", 41.4962, -81.6852),
|
||||
"stadium_mlb_coors_field": StadiumInfo("stadium_mlb_coors_field", "Coors Field", "Denver", "CO", "USA", "mlb", 39.7559, -104.9942),
|
||||
"stadium_mlb_comerica_park": StadiumInfo("stadium_mlb_comerica_park", "Comerica Park", "Detroit", "MI", "USA", "mlb", 42.3390, -83.0485),
|
||||
"stadium_mlb_minute_maid_park": StadiumInfo("stadium_mlb_minute_maid_park", "Minute Maid Park", "Houston", "TX", "USA", "mlb", 29.7573, -95.3555),
|
||||
"stadium_mlb_kauffman_stadium": StadiumInfo("stadium_mlb_kauffman_stadium", "Kauffman Stadium", "Kansas City", "MO", "USA", "mlb", 39.0517, -94.4803),
|
||||
"stadium_mlb_angel_stadium": StadiumInfo("stadium_mlb_angel_stadium", "Angel Stadium", "Anaheim", "CA", "USA", "mlb", 33.8003, -117.8827),
|
||||
"stadium_mlb_dodger_stadium": StadiumInfo("stadium_mlb_dodger_stadium", "Dodger Stadium", "Los Angeles", "CA", "USA", "mlb", 34.0739, -118.2400),
|
||||
"stadium_mlb_loandepot_park": StadiumInfo("stadium_mlb_loandepot_park", "loanDepot park", "Miami", "FL", "USA", "mlb", 25.7781, -80.2195),
|
||||
"stadium_mlb_american_family_field": StadiumInfo("stadium_mlb_american_family_field", "American Family Field", "Milwaukee", "WI", "USA", "mlb", 43.0280, -87.9712),
|
||||
"stadium_mlb_target_field": StadiumInfo("stadium_mlb_target_field", "Target Field", "Minneapolis", "MN", "USA", "mlb", 44.9818, -93.2775),
|
||||
"stadium_mlb_citi_field": StadiumInfo("stadium_mlb_citi_field", "Citi Field", "New York", "NY", "USA", "mlb", 40.7571, -73.8458),
|
||||
"stadium_mlb_yankee_stadium": StadiumInfo("stadium_mlb_yankee_stadium", "Yankee Stadium", "Bronx", "NY", "USA", "mlb", 40.8296, -73.9262),
|
||||
"stadium_mlb_sutter_health_park": StadiumInfo("stadium_mlb_sutter_health_park", "Sutter Health Park", "Sacramento", "CA", "USA", "mlb", 38.5803, -121.5005),
|
||||
"stadium_mlb_citizens_bank_park": StadiumInfo("stadium_mlb_citizens_bank_park", "Citizens Bank Park", "Philadelphia", "PA", "USA", "mlb", 39.9061, -75.1665),
|
||||
"stadium_mlb_pnc_park": StadiumInfo("stadium_mlb_pnc_park", "PNC Park", "Pittsburgh", "PA", "USA", "mlb", 40.4469, -80.0057),
|
||||
"stadium_mlb_petco_park": StadiumInfo("stadium_mlb_petco_park", "Petco Park", "San Diego", "CA", "USA", "mlb", 32.7076, -117.1570),
|
||||
"stadium_mlb_oracle_park": StadiumInfo("stadium_mlb_oracle_park", "Oracle Park", "San Francisco", "CA", "USA", "mlb", 37.7786, -122.3893),
|
||||
"stadium_mlb_tmobile_park": StadiumInfo("stadium_mlb_tmobile_park", "T-Mobile Park", "Seattle", "WA", "USA", "mlb", 47.5914, -122.3325),
|
||||
"stadium_mlb_busch_stadium": StadiumInfo("stadium_mlb_busch_stadium", "Busch Stadium", "St. Louis", "MO", "USA", "mlb", 38.6226, -90.1928),
|
||||
"stadium_mlb_tropicana_field": StadiumInfo("stadium_mlb_tropicana_field", "Tropicana Field", "St. Petersburg", "FL", "USA", "mlb", 27.7682, -82.6534),
|
||||
"stadium_mlb_globe_life_field": StadiumInfo("stadium_mlb_globe_life_field", "Globe Life Field", "Arlington", "TX", "USA", "mlb", 32.7473, -97.0845),
|
||||
"stadium_mlb_rogers_centre": StadiumInfo("stadium_mlb_rogers_centre", "Rogers Centre", "Toronto", "ON", "Canada", "mlb", 43.6414, -79.3894),
|
||||
"stadium_mlb_nationals_park": StadiumInfo("stadium_mlb_nationals_park", "Nationals Park", "Washington", "DC", "USA", "mlb", 38.8730, -77.0074),
|
||||
# Spring Training - Cactus League (Arizona)
|
||||
"stadium_mlb_spring_salt_river_fields": StadiumInfo("stadium_mlb_spring_salt_river_fields", "Salt River Fields at Talking Stick", "Scottsdale", "AZ", "USA", "mlb", 33.5412, -111.8847, "America/Phoenix"),
|
||||
"stadium_mlb_spring_sloan_park": StadiumInfo("stadium_mlb_spring_sloan_park", "Sloan Park", "Mesa", "AZ", "USA", "mlb", 33.4312, -111.8821, "America/Phoenix"),
|
||||
"stadium_mlb_spring_hohokam_stadium": StadiumInfo("stadium_mlb_spring_hohokam_stadium", "Hohokam Stadium", "Mesa", "AZ", "USA", "mlb", 33.4385, -111.8295, "America/Phoenix"),
|
||||
"stadium_mlb_spring_camelback_ranch": StadiumInfo("stadium_mlb_spring_camelback_ranch", "Camelback Ranch", "Glendale", "AZ", "USA", "mlb", 33.509, -112.272, "America/Phoenix"),
|
||||
"stadium_mlb_spring_goodyear_ballpark": StadiumInfo("stadium_mlb_spring_goodyear_ballpark", "Goodyear Ballpark", "Goodyear", "AZ", "USA", "mlb", 33.4286, -112.3908, "America/Phoenix"),
|
||||
"stadium_mlb_spring_tempe_diablo_stadium": StadiumInfo("stadium_mlb_spring_tempe_diablo_stadium", "Tempe Diablo Stadium", "Tempe", "AZ", "USA", "mlb", 33.4003, -111.9685, "America/Phoenix"),
|
||||
"stadium_mlb_spring_scottsdale_stadium": StadiumInfo("stadium_mlb_spring_scottsdale_stadium", "Scottsdale Stadium", "Scottsdale", "AZ", "USA", "mlb", 33.4881, -111.9210, "America/Phoenix"),
|
||||
"stadium_mlb_spring_american_family_fields": StadiumInfo("stadium_mlb_spring_american_family_fields", "American Family Fields of Phoenix", "Phoenix", "AZ", "USA", "mlb", 33.4916, -112.1733, "America/Phoenix"),
|
||||
"stadium_mlb_spring_peoria_sports_complex": StadiumInfo("stadium_mlb_spring_peoria_sports_complex", "Peoria Sports Complex", "Peoria", "AZ", "USA", "mlb", 33.6224, -112.2274, "America/Phoenix"),
|
||||
"stadium_mlb_spring_surprise_stadium": StadiumInfo("stadium_mlb_spring_surprise_stadium", "Surprise Stadium", "Surprise", "AZ", "USA", "mlb", 33.6306, -112.3332, "America/Phoenix"),
|
||||
# Spring Training - Grapefruit League (Florida)
|
||||
"stadium_mlb_spring_jetblue_park": StadiumInfo("stadium_mlb_spring_jetblue_park", "JetBlue Park", "Fort Myers", "FL", "USA", "mlb", 26.5511, -81.7620),
|
||||
"stadium_mlb_spring_roger_dean_stadium": StadiumInfo("stadium_mlb_spring_roger_dean_stadium", "Roger Dean Chevrolet Stadium", "Jupiter", "FL", "USA", "mlb", 26.8910, -80.1166),
|
||||
"stadium_mlb_spring_ed_smith_stadium": StadiumInfo("stadium_mlb_spring_ed_smith_stadium", "Ed Smith Stadium", "Sarasota", "FL", "USA", "mlb", 27.3482, -82.5176),
|
||||
"stadium_mlb_spring_steinbrenner_field": StadiumInfo("stadium_mlb_spring_steinbrenner_field", "George M. Steinbrenner Field", "Tampa", "FL", "USA", "mlb", 27.9748, -82.5040),
|
||||
"stadium_mlb_spring_td_ballpark": StadiumInfo("stadium_mlb_spring_td_ballpark", "TD Ballpark", "Dunedin", "FL", "USA", "mlb", 28.0039, -82.7867),
|
||||
"stadium_mlb_spring_cooltoday_park": StadiumInfo("stadium_mlb_spring_cooltoday_park", "CoolToday Park", "North Port", "FL", "USA", "mlb", 27.0219, -82.2358),
|
||||
"stadium_mlb_spring_hammond_stadium": StadiumInfo("stadium_mlb_spring_hammond_stadium", "Hammond Stadium", "Fort Myers", "FL", "USA", "mlb", 26.5363, -81.8385),
|
||||
"stadium_mlb_spring_clover_park": StadiumInfo("stadium_mlb_spring_clover_park", "Clover Park", "Port St. Lucie", "FL", "USA", "mlb", 27.2900, -80.4100),
|
||||
"stadium_mlb_spring_baycare_ballpark": StadiumInfo("stadium_mlb_spring_baycare_ballpark", "BayCare Ballpark", "Clearwater", "FL", "USA", "mlb", 27.9697, -82.7257),
|
||||
"stadium_mlb_spring_lecom_park": StadiumInfo("stadium_mlb_spring_lecom_park", "LECOM Park", "Bradenton", "FL", "USA", "mlb", 27.4939, -82.5753),
|
||||
"stadium_mlb_spring_charlotte_sports_park": StadiumInfo("stadium_mlb_spring_charlotte_sports_park", "Charlotte Sports Park", "Port Charlotte", "FL", "USA", "mlb", 26.9992, -82.1817),
|
||||
"stadium_mlb_spring_cacti_park": StadiumInfo("stadium_mlb_spring_cacti_park", "CACTI Park of the Palm Beaches", "West Palm Beach", "FL", "USA", "mlb", 26.7697, -80.1014),
|
||||
"stadium_mlb_spring_joker_marchant": StadiumInfo("stadium_mlb_spring_joker_marchant", "Publix Field at Joker Marchant Stadium", "Lakeland", "FL", "USA", "mlb", 28.0655, -81.9545),
|
||||
# Special venues
|
||||
"stadium_mlb_las_vegas_ballpark": StadiumInfo("stadium_mlb_las_vegas_ballpark", "Las Vegas Ballpark", "Las Vegas", "NV", "USA", "mlb", 36.0925, -115.1775, "America/Los_Angeles"),
|
||||
"stadium_mlb_mexico_alfredo_harp_helu": StadiumInfo("stadium_mlb_mexico_alfredo_harp_helu", "Estadio Alfredo Harp Helu", "Mexico City", "CDMX", "Mexico", "mlb", 19.3825, -99.0928, "America/Mexico_City"),
|
||||
"stadium_mlb_field_of_dreams": StadiumInfo("stadium_mlb_field_of_dreams", "Field of Dreams", "Dyersville", "IA", "USA", "mlb", 42.4671, -91.1095, "America/Chicago"),
|
||||
"stadium_mlb_journey_bank_ballpark": StadiumInfo("stadium_mlb_journey_bank_ballpark", "Journey Bank Ballpark", "Williamsport", "PA", "USA", "mlb", 41.2415, -77.0011),
|
||||
},
|
||||
"nfl": {
|
||||
"stadium_nfl_state_farm_stadium": StadiumInfo("stadium_nfl_state_farm_stadium", "State Farm Stadium", "Glendale", "AZ", "USA", "nfl", 33.5276, -112.2626),
|
||||
"stadium_nfl_mercedes_benz_stadium": StadiumInfo("stadium_nfl_mercedes_benz_stadium", "Mercedes-Benz Stadium", "Atlanta", "GA", "USA", "nfl", 33.7553, -84.4006),
|
||||
"stadium_nfl_mandt_bank_stadium": StadiumInfo("stadium_nfl_mandt_bank_stadium", "M&T Bank Stadium", "Baltimore", "MD", "USA", "nfl", 39.2780, -76.6227),
|
||||
"stadium_nfl_highmark_stadium": StadiumInfo("stadium_nfl_highmark_stadium", "Highmark Stadium", "Orchard Park", "NY", "USA", "nfl", 42.7738, -78.7870),
|
||||
"stadium_nfl_bank_of_america_stadium": StadiumInfo("stadium_nfl_bank_of_america_stadium", "Bank of America Stadium", "Charlotte", "NC", "USA", "nfl", 35.2258, -80.8528),
|
||||
"stadium_nfl_soldier_field": StadiumInfo("stadium_nfl_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "nfl", 41.8623, -87.6167),
|
||||
"stadium_nfl_paycor_stadium": StadiumInfo("stadium_nfl_paycor_stadium", "Paycor Stadium", "Cincinnati", "OH", "USA", "nfl", 39.0955, -84.5161),
|
||||
"stadium_nfl_huntington_bank_field": StadiumInfo("stadium_nfl_huntington_bank_field", "Huntington Bank Field", "Cleveland", "OH", "USA", "nfl", 41.5061, -81.6995),
|
||||
"stadium_nfl_att_stadium": StadiumInfo("stadium_nfl_att_stadium", "AT&T Stadium", "Arlington", "TX", "USA", "nfl", 32.7473, -97.0945),
|
||||
"stadium_nfl_empower_field": StadiumInfo("stadium_nfl_empower_field", "Empower Field at Mile High", "Denver", "CO", "USA", "nfl", 39.7439, -105.0201),
|
||||
"stadium_nfl_ford_field": StadiumInfo("stadium_nfl_ford_field", "Ford Field", "Detroit", "MI", "USA", "nfl", 42.3400, -83.0456),
|
||||
"stadium_nfl_lambeau_field": StadiumInfo("stadium_nfl_lambeau_field", "Lambeau Field", "Green Bay", "WI", "USA", "nfl", 44.5013, -88.0622),
|
||||
"stadium_nfl_nrg_stadium": StadiumInfo("stadium_nfl_nrg_stadium", "NRG Stadium", "Houston", "TX", "USA", "nfl", 29.6847, -95.4107),
|
||||
"stadium_nfl_lucas_oil_stadium": StadiumInfo("stadium_nfl_lucas_oil_stadium", "Lucas Oil Stadium", "Indianapolis", "IN", "USA", "nfl", 39.7601, -86.1639),
|
||||
"stadium_nfl_everbank_stadium": StadiumInfo("stadium_nfl_everbank_stadium", "EverBank Stadium", "Jacksonville", "FL", "USA", "nfl", 30.3239, -81.6373),
|
||||
"stadium_nfl_arrowhead_stadium": StadiumInfo("stadium_nfl_arrowhead_stadium", "Arrowhead Stadium", "Kansas City", "MO", "USA", "nfl", 39.0489, -94.4839),
|
||||
"stadium_nfl_allegiant_stadium": StadiumInfo("stadium_nfl_allegiant_stadium", "Allegiant Stadium", "Las Vegas", "NV", "USA", "nfl", 36.0909, -115.1833),
|
||||
"stadium_nfl_sofi_stadium": StadiumInfo("stadium_nfl_sofi_stadium", "SoFi Stadium", "Inglewood", "CA", "USA", "nfl", 33.9534, -118.3386),
|
||||
"stadium_nfl_hard_rock_stadium": StadiumInfo("stadium_nfl_hard_rock_stadium", "Hard Rock Stadium", "Miami Gardens", "FL", "USA", "nfl", 25.9580, -80.2389),
|
||||
"stadium_nfl_us_bank_stadium": StadiumInfo("stadium_nfl_us_bank_stadium", "U.S. Bank Stadium", "Minneapolis", "MN", "USA", "nfl", 44.9737, -93.2575),
|
||||
"stadium_nfl_gillette_stadium": StadiumInfo("stadium_nfl_gillette_stadium", "Gillette Stadium", "Foxborough", "MA", "USA", "nfl", 42.0909, -71.2643),
|
||||
"stadium_nfl_caesars_superdome": StadiumInfo("stadium_nfl_caesars_superdome", "Caesars Superdome", "New Orleans", "LA", "USA", "nfl", 29.9511, -90.0812),
|
||||
"stadium_nfl_metlife_stadium": StadiumInfo("stadium_nfl_metlife_stadium", "MetLife Stadium", "East Rutherford", "NJ", "USA", "nfl", 40.8128, -74.0742),
|
||||
"stadium_nfl_lincoln_financial_field": StadiumInfo("stadium_nfl_lincoln_financial_field", "Lincoln Financial Field", "Philadelphia", "PA", "USA", "nfl", 39.9008, -75.1675),
|
||||
"stadium_nfl_acrisure_stadium": StadiumInfo("stadium_nfl_acrisure_stadium", "Acrisure Stadium", "Pittsburgh", "PA", "USA", "nfl", 40.4468, -80.0158),
|
||||
"stadium_nfl_levis_stadium": StadiumInfo("stadium_nfl_levis_stadium", "Levi's Stadium", "Santa Clara", "CA", "USA", "nfl", 37.4033, -121.9695),
|
||||
"stadium_nfl_lumen_field": StadiumInfo("stadium_nfl_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "nfl", 47.5952, -122.3316),
|
||||
"stadium_nfl_raymond_james_stadium": StadiumInfo("stadium_nfl_raymond_james_stadium", "Raymond James Stadium", "Tampa", "FL", "USA", "nfl", 27.9759, -82.5033),
|
||||
"stadium_nfl_nissan_stadium": StadiumInfo("stadium_nfl_nissan_stadium", "Nissan Stadium", "Nashville", "TN", "USA", "nfl", 36.1665, -86.7713),
|
||||
"stadium_nfl_northwest_stadium": StadiumInfo("stadium_nfl_northwest_stadium", "Northwest Stadium", "Landover", "MD", "USA", "nfl", 38.9076, -76.8645),
|
||||
},
|
||||
"nhl": {
|
||||
"stadium_nhl_honda_center": StadiumInfo("stadium_nhl_honda_center", "Honda Center", "Anaheim", "CA", "USA", "nhl", 33.8078, -117.8765),
|
||||
"stadium_nhl_delta_center": StadiumInfo("stadium_nhl_delta_center", "Delta Center", "Salt Lake City", "UT", "USA", "nhl", 40.7683, -111.9011),
|
||||
"stadium_nhl_td_garden": StadiumInfo("stadium_nhl_td_garden", "TD Garden", "Boston", "MA", "USA", "nhl", 42.3662, -71.0621),
|
||||
"stadium_nhl_keybank_center": StadiumInfo("stadium_nhl_keybank_center", "KeyBank Center", "Buffalo", "NY", "USA", "nhl", 42.8750, -78.8764),
|
||||
"stadium_nhl_scotiabank_saddledome": StadiumInfo("stadium_nhl_scotiabank_saddledome", "Scotiabank Saddledome", "Calgary", "AB", "Canada", "nhl", 51.0374, -114.0519),
|
||||
"stadium_nhl_pnc_arena": StadiumInfo("stadium_nhl_pnc_arena", "PNC Arena", "Raleigh", "NC", "USA", "nhl", 35.8033, -78.7220),
|
||||
"stadium_nhl_united_center": StadiumInfo("stadium_nhl_united_center", "United Center", "Chicago", "IL", "USA", "nhl", 41.8807, -87.6742),
|
||||
"stadium_nhl_ball_arena": StadiumInfo("stadium_nhl_ball_arena", "Ball Arena", "Denver", "CO", "USA", "nhl", 39.7487, -105.0077),
|
||||
"stadium_nhl_nationwide_arena": StadiumInfo("stadium_nhl_nationwide_arena", "Nationwide Arena", "Columbus", "OH", "USA", "nhl", 39.9692, -83.0061),
|
||||
"stadium_nhl_american_airlines_center": StadiumInfo("stadium_nhl_american_airlines_center", "American Airlines Center", "Dallas", "TX", "USA", "nhl", 32.7905, -96.8103),
|
||||
"stadium_nhl_little_caesars_arena": StadiumInfo("stadium_nhl_little_caesars_arena", "Little Caesars Arena", "Detroit", "MI", "USA", "nhl", 42.3411, -83.0553),
|
||||
"stadium_nhl_rogers_place": StadiumInfo("stadium_nhl_rogers_place", "Rogers Place", "Edmonton", "AB", "Canada", "nhl", 53.5469, -113.4979),
|
||||
"stadium_nhl_amerant_bank_arena": StadiumInfo("stadium_nhl_amerant_bank_arena", "Amerant Bank Arena", "Sunrise", "FL", "USA", "nhl", 26.1584, -80.3256),
|
||||
"stadium_nhl_cryptocom_arena": StadiumInfo("stadium_nhl_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "nhl", 34.0430, -118.2673),
|
||||
"stadium_nhl_xcel_energy_center": StadiumInfo("stadium_nhl_xcel_energy_center", "Xcel Energy Center", "St. Paul", "MN", "USA", "nhl", 44.9448, -93.1010),
|
||||
"stadium_nhl_bell_centre": StadiumInfo("stadium_nhl_bell_centre", "Bell Centre", "Montreal", "QC", "Canada", "nhl", 45.4961, -73.5693),
|
||||
"stadium_nhl_bridgestone_arena": StadiumInfo("stadium_nhl_bridgestone_arena", "Bridgestone Arena", "Nashville", "TN", "USA", "nhl", 36.1592, -86.7785),
|
||||
"stadium_nhl_prudential_center": StadiumInfo("stadium_nhl_prudential_center", "Prudential Center", "Newark", "NJ", "USA", "nhl", 40.7334, -74.1712),
|
||||
"stadium_nhl_ubs_arena": StadiumInfo("stadium_nhl_ubs_arena", "UBS Arena", "Elmont", "NY", "USA", "nhl", 40.7170, -73.7255),
|
||||
"stadium_nhl_madison_square_garden": StadiumInfo("stadium_nhl_madison_square_garden", "Madison Square Garden", "New York", "NY", "USA", "nhl", 40.7505, -73.9934),
|
||||
"stadium_nhl_canadian_tire_centre": StadiumInfo("stadium_nhl_canadian_tire_centre", "Canadian Tire Centre", "Ottawa", "ON", "Canada", "nhl", 45.2969, -75.9272),
|
||||
"stadium_nhl_wells_fargo_center": StadiumInfo("stadium_nhl_wells_fargo_center", "Wells Fargo Center", "Philadelphia", "PA", "USA", "nhl", 39.9012, -75.1720),
|
||||
"stadium_nhl_ppg_paints_arena": StadiumInfo("stadium_nhl_ppg_paints_arena", "PPG Paints Arena", "Pittsburgh", "PA", "USA", "nhl", 40.4395, -79.9890),
|
||||
"stadium_nhl_sap_center": StadiumInfo("stadium_nhl_sap_center", "SAP Center", "San Jose", "CA", "USA", "nhl", 37.3327, -121.9011),
|
||||
"stadium_nhl_climate_pledge_arena": StadiumInfo("stadium_nhl_climate_pledge_arena", "Climate Pledge Arena", "Seattle", "WA", "USA", "nhl", 47.6221, -122.3540),
|
||||
"stadium_nhl_enterprise_center": StadiumInfo("stadium_nhl_enterprise_center", "Enterprise Center", "St. Louis", "MO", "USA", "nhl", 38.6268, -90.2025),
|
||||
"stadium_nhl_amalie_arena": StadiumInfo("stadium_nhl_amalie_arena", "Amalie Arena", "Tampa", "FL", "USA", "nhl", 27.9428, -82.4519),
|
||||
"stadium_nhl_scotiabank_arena": StadiumInfo("stadium_nhl_scotiabank_arena", "Scotiabank Arena", "Toronto", "ON", "Canada", "nhl", 43.6435, -79.3791),
|
||||
"stadium_nhl_rogers_arena": StadiumInfo("stadium_nhl_rogers_arena", "Rogers Arena", "Vancouver", "BC", "Canada", "nhl", 49.2778, -123.1088),
|
||||
"stadium_nhl_tmobile_arena": StadiumInfo("stadium_nhl_tmobile_arena", "T-Mobile Arena", "Las Vegas", "NV", "USA", "nhl", 36.1028, -115.1783),
|
||||
"stadium_nhl_capital_one_arena": StadiumInfo("stadium_nhl_capital_one_arena", "Capital One Arena", "Washington", "DC", "USA", "nhl", 38.8981, -77.0209),
|
||||
"stadium_nhl_canada_life_centre": StadiumInfo("stadium_nhl_canada_life_centre", "Canada Life Centre", "Winnipeg", "MB", "Canada", "nhl", 49.8928, -97.1433),
|
||||
},
|
||||
"mls": {
|
||||
"stadium_mls_mercedes_benz_stadium": StadiumInfo("stadium_mls_mercedes_benz_stadium", "Mercedes-Benz Stadium", "Atlanta", "GA", "USA", "mls", 33.7553, -84.4006),
|
||||
"stadium_mls_q2_stadium": StadiumInfo("stadium_mls_q2_stadium", "Q2 Stadium", "Austin", "TX", "USA", "mls", 30.3875, -97.7186),
|
||||
"stadium_mls_bank_of_america_stadium": StadiumInfo("stadium_mls_bank_of_america_stadium", "Bank of America Stadium", "Charlotte", "NC", "USA", "mls", 35.2258, -80.8528),
|
||||
"stadium_mls_soldier_field": StadiumInfo("stadium_mls_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "mls", 41.8623, -87.6167),
|
||||
"stadium_mls_tql_stadium": StadiumInfo("stadium_mls_tql_stadium", "TQL Stadium", "Cincinnati", "OH", "USA", "mls", 39.1112, -84.5225),
|
||||
"stadium_mls_dicks_sporting_goods_park": StadiumInfo("stadium_mls_dicks_sporting_goods_park", "Dick's Sporting Goods Park", "Commerce City", "CO", "USA", "mls", 39.8056, -104.8922),
|
||||
"stadium_mls_lowercom_field": StadiumInfo("stadium_mls_lowercom_field", "Lower.com Field", "Columbus", "OH", "USA", "mls", 39.9689, -83.0173),
|
||||
"stadium_mls_toyota_stadium": StadiumInfo("stadium_mls_toyota_stadium", "Toyota Stadium", "Frisco", "TX", "USA", "mls", 33.1545, -96.8353),
|
||||
"stadium_mls_audi_field": StadiumInfo("stadium_mls_audi_field", "Audi Field", "Washington", "DC", "USA", "mls", 38.8687, -77.0128),
|
||||
"stadium_mls_shell_energy_stadium": StadiumInfo("stadium_mls_shell_energy_stadium", "Shell Energy Stadium", "Houston", "TX", "USA", "mls", 29.7522, -95.3527),
|
||||
"stadium_mls_dignity_health_sports_park": StadiumInfo("stadium_mls_dignity_health_sports_park", "Dignity Health Sports Park", "Carson", "CA", "USA", "mls", 33.8644, -118.2611),
|
||||
"stadium_mls_bmo_stadium": StadiumInfo("stadium_mls_bmo_stadium", "BMO Stadium", "Los Angeles", "CA", "USA", "mls", 34.0128, -118.2841),
|
||||
"stadium_mls_chase_stadium": StadiumInfo("stadium_mls_chase_stadium", "Chase Stadium", "Fort Lauderdale", "FL", "USA", "mls", 26.1930, -80.1611),
|
||||
"stadium_mls_allianz_field": StadiumInfo("stadium_mls_allianz_field", "Allianz Field", "St. Paul", "MN", "USA", "mls", 44.9528, -93.1650),
|
||||
"stadium_mls_stade_saputo": StadiumInfo("stadium_mls_stade_saputo", "Stade Saputo", "Montreal", "QC", "Canada", "mls", 45.5622, -73.5528),
|
||||
"stadium_mls_geodis_park": StadiumInfo("stadium_mls_geodis_park", "GEODIS Park", "Nashville", "TN", "USA", "mls", 36.1304, -86.7651),
|
||||
"stadium_mls_gillette_stadium": StadiumInfo("stadium_mls_gillette_stadium", "Gillette Stadium", "Foxborough", "MA", "USA", "mls", 42.0909, -71.2643),
|
||||
"stadium_mls_yankee_stadium": StadiumInfo("stadium_mls_yankee_stadium", "Yankee Stadium", "Bronx", "NY", "USA", "mls", 40.8296, -73.9262),
|
||||
"stadium_mls_red_bull_arena": StadiumInfo("stadium_mls_red_bull_arena", "Red Bull Arena", "Harrison", "NJ", "USA", "mls", 40.7369, -74.1503),
|
||||
"stadium_mls_interco_stadium": StadiumInfo("stadium_mls_interco_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "mls", 28.5411, -81.3895),
|
||||
"stadium_mls_subaru_park": StadiumInfo("stadium_mls_subaru_park", "Subaru Park", "Chester", "PA", "USA", "mls", 39.8328, -75.3789),
|
||||
"stadium_mls_providence_park": StadiumInfo("stadium_mls_providence_park", "Providence Park", "Portland", "OR", "USA", "mls", 45.5216, -122.6917),
|
||||
"stadium_mls_america_first_field": StadiumInfo("stadium_mls_america_first_field", "America First Field", "Sandy", "UT", "USA", "mls", 40.5830, -111.8933),
|
||||
"stadium_mls_paypal_park": StadiumInfo("stadium_mls_paypal_park", "PayPal Park", "San Jose", "CA", "USA", "mls", 37.3511, -121.9250),
|
||||
"stadium_mls_snapdragon_stadium": StadiumInfo("stadium_mls_snapdragon_stadium", "Snapdragon Stadium", "San Diego", "CA", "USA", "mls", 32.7837, -117.1225),
|
||||
"stadium_mls_lumen_field": StadiumInfo("stadium_mls_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "mls", 47.5952, -122.3316),
|
||||
"stadium_mls_childrens_mercy_park": StadiumInfo("stadium_mls_childrens_mercy_park", "Children's Mercy Park", "Kansas City", "KS", "USA", "mls", 39.1217, -94.8231),
|
||||
"stadium_mls_citypark": StadiumInfo("stadium_mls_citypark", "CITYPARK", "St. Louis", "MO", "USA", "mls", 38.6316, -90.2106),
|
||||
"stadium_mls_bmo_field": StadiumInfo("stadium_mls_bmo_field", "BMO Field", "Toronto", "ON", "Canada", "mls", 43.6332, -79.4186),
|
||||
"stadium_mls_bc_place": StadiumInfo("stadium_mls_bc_place", "BC Place", "Vancouver", "BC", "Canada", "mls", 49.2768, -123.1118),
|
||||
},
|
||||
"wnba": {
|
||||
"stadium_wnba_gateway_center_arena": StadiumInfo("stadium_wnba_gateway_center_arena", "Gateway Center Arena", "College Park", "GA", "USA", "wnba", 33.6510, -84.4474),
|
||||
"stadium_wnba_wintrust_arena": StadiumInfo("stadium_wnba_wintrust_arena", "Wintrust Arena", "Chicago", "IL", "USA", "wnba", 41.8658, -87.6169),
|
||||
"stadium_wnba_mohegan_sun_arena": StadiumInfo("stadium_wnba_mohegan_sun_arena", "Mohegan Sun Arena", "Uncasville", "CT", "USA", "wnba", 41.4931, -72.0912),
|
||||
"stadium_wnba_college_park_center": StadiumInfo("stadium_wnba_college_park_center", "College Park Center", "Arlington", "TX", "USA", "wnba", 32.7304, -97.1077),
|
||||
"stadium_wnba_chase_center": StadiumInfo("stadium_wnba_chase_center", "Chase Center", "San Francisco", "CA", "USA", "wnba", 37.7680, -122.3877),
|
||||
"stadium_wnba_gainbridge_fieldhouse": StadiumInfo("stadium_wnba_gainbridge_fieldhouse", "Gainbridge Fieldhouse", "Indianapolis", "IN", "USA", "wnba", 39.7640, -86.1555),
|
||||
"stadium_wnba_michelob_ultra_arena": StadiumInfo("stadium_wnba_michelob_ultra_arena", "Michelob Ultra Arena", "Las Vegas", "NV", "USA", "wnba", 36.0902, -115.1756),
|
||||
"stadium_wnba_cryptocom_arena": StadiumInfo("stadium_wnba_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "wnba", 34.0430, -118.2673),
|
||||
"stadium_wnba_target_center": StadiumInfo("stadium_wnba_target_center", "Target Center", "Minneapolis", "MN", "USA", "wnba", 44.9795, -93.2761),
|
||||
"stadium_wnba_barclays_center": StadiumInfo("stadium_wnba_barclays_center", "Barclays Center", "Brooklyn", "NY", "USA", "wnba", 40.6826, -73.9754),
|
||||
"stadium_wnba_footprint_center": StadiumInfo("stadium_wnba_footprint_center", "Footprint Center", "Phoenix", "AZ", "USA", "wnba", 33.4457, -112.0712),
|
||||
"stadium_wnba_climate_pledge_arena": StadiumInfo("stadium_wnba_climate_pledge_arena", "Climate Pledge Arena", "Seattle", "WA", "USA", "wnba", 47.6221, -122.3540),
|
||||
"stadium_wnba_entertainment_sports_arena": StadiumInfo("stadium_wnba_entertainment_sports_arena", "Entertainment & Sports Arena", "Washington", "DC", "USA", "wnba", 38.8690, -76.9745),
|
||||
"stadium_wnba_state_farm_arena": StadiumInfo("stadium_wnba_state_farm_arena", "State Farm Arena", "Atlanta", "GA", "USA", "wnba", 33.7573, -84.3963),
|
||||
"stadium_wnba_rocket_mortgage_fieldhouse": StadiumInfo("stadium_wnba_rocket_mortgage_fieldhouse", "Rocket Mortgage FieldHouse", "Cleveland", "OH", "USA", "wnba", 41.4965, -81.6882),
|
||||
"stadium_wnba_cfg_bank_arena": StadiumInfo("stadium_wnba_cfg_bank_arena", "CFG Bank Arena", "Baltimore", "MD", "USA", "wnba", 39.2825, -76.6220),
|
||||
"stadium_wnba_purcell_pavilion": StadiumInfo("stadium_wnba_purcell_pavilion", "Purcell Pavilion", "Notre Dame", "IN", "USA", "wnba", 41.6987, -86.2340),
|
||||
},
|
||||
"nwsl": {
|
||||
"stadium_nwsl_bmo_stadium": StadiumInfo("stadium_nwsl_bmo_stadium", "BMO Stadium", "Los Angeles", "CA", "USA", "nwsl", 34.0128, -118.2841),
|
||||
"stadium_nwsl_seatgeek_stadium": StadiumInfo("stadium_nwsl_seatgeek_stadium", "SeatGeek Stadium", "Bridgeview", "IL", "USA", "nwsl", 41.7500, -87.8028),
|
||||
"stadium_nwsl_shell_energy_stadium": StadiumInfo("stadium_nwsl_shell_energy_stadium", "Shell Energy Stadium", "Houston", "TX", "USA", "nwsl", 29.7522, -95.3527),
|
||||
"stadium_nwsl_cpkc_stadium": StadiumInfo("stadium_nwsl_cpkc_stadium", "CPKC Stadium", "Kansas City", "MO", "USA", "nwsl", 39.1050, -94.5580),
|
||||
"stadium_nwsl_red_bull_arena": StadiumInfo("stadium_nwsl_red_bull_arena", "Red Bull Arena", "Harrison", "NJ", "USA", "nwsl", 40.7369, -74.1503),
|
||||
"stadium_nwsl_wakemed_soccer_park": StadiumInfo("stadium_nwsl_wakemed_soccer_park", "WakeMed Soccer Park", "Cary", "NC", "USA", "nwsl", 35.7879, -78.7806),
|
||||
"stadium_nwsl_interco_stadium": StadiumInfo("stadium_nwsl_interco_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "nwsl", 28.5411, -81.3895),
|
||||
"stadium_nwsl_providence_park": StadiumInfo("stadium_nwsl_providence_park", "Providence Park", "Portland", "OR", "USA", "nwsl", 45.5216, -122.6917),
|
||||
"stadium_nwsl_lynn_family_stadium": StadiumInfo("stadium_nwsl_lynn_family_stadium", "Lynn Family Stadium", "Louisville", "KY", "USA", "nwsl", 38.2219, -85.7381),
|
||||
"stadium_nwsl_snapdragon_stadium": StadiumInfo("stadium_nwsl_snapdragon_stadium", "Snapdragon Stadium", "San Diego", "CA", "USA", "nwsl", 32.7837, -117.1225),
|
||||
"stadium_nwsl_lumen_field": StadiumInfo("stadium_nwsl_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "nwsl", 47.5952, -122.3316),
|
||||
"stadium_nwsl_america_first_field": StadiumInfo("stadium_nwsl_america_first_field", "America First Field", "Sandy", "UT", "USA", "nwsl", 40.5830, -111.8933),
|
||||
"stadium_nwsl_audi_field": StadiumInfo("stadium_nwsl_audi_field", "Audi Field", "Washington", "DC", "USA", "nwsl", 38.8687, -77.0128),
|
||||
"stadium_nwsl_paypal_park": StadiumInfo("stadium_nwsl_paypal_park", "PayPal Park", "San Jose", "CA", "USA", "nwsl", 37.3511, -121.9250),
|
||||
# Boston Legacy FC venues
|
||||
"stadium_nwsl_gillette_stadium": StadiumInfo("stadium_nwsl_gillette_stadium", "Gillette Stadium", "Foxborough", "MA", "USA", "nwsl", 42.0909, -71.2643),
|
||||
"stadium_nwsl_centreville_bank_stadium": StadiumInfo("stadium_nwsl_centreville_bank_stadium", "Centreville Bank Stadium", "Pawtucket", "RI", "USA", "nwsl", 41.8770, -71.3910),
|
||||
# Denver Summit FC venues
|
||||
"stadium_nwsl_empower_field": StadiumInfo("stadium_nwsl_empower_field", "Empower Field at Mile High", "Denver", "CO", "USA", "nwsl", 39.7439, -105.0201, "America/Denver"),
|
||||
"stadium_nwsl_dicks_sporting_goods_park": StadiumInfo("stadium_nwsl_dicks_sporting_goods_park", "Dick's Sporting Goods Park", "Commerce City", "CO", "USA", "nwsl", 39.8056, -104.8922, "America/Denver"),
|
||||
"stadium_nwsl_centennial_stadium": StadiumInfo("stadium_nwsl_centennial_stadium", "Centennial Stadium", "Centennial", "CO", "USA", "nwsl", 39.6000, -104.8800, "America/Denver"),
|
||||
# Shared NFL/MLB venues
|
||||
"stadium_nwsl_soldier_field": StadiumInfo("stadium_nwsl_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "nwsl", 41.8623, -87.6167),
|
||||
"stadium_nwsl_oracle_park": StadiumInfo("stadium_nwsl_oracle_park", "Oracle Park", "San Francisco", "CA", "USA", "nwsl", 37.7786, -122.3893, "America/Los_Angeles"),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
class StadiumResolver:
|
||||
"""Resolves stadium names to canonical IDs.
|
||||
|
||||
Resolution order:
|
||||
1. Exact match against stadium names
|
||||
2. Alias lookup (with date awareness)
|
||||
3. Fuzzy match against all known names
|
||||
4. Geographic filter check
|
||||
5. Unresolved (returns ManualReviewItem)
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
sport: str,
|
||||
alias_loader: Optional[StadiumAliasLoader] = None,
|
||||
fuzzy_threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
):
|
||||
"""Initialize the resolver.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
alias_loader: Stadium alias loader (default: global loader)
|
||||
fuzzy_threshold: Minimum fuzzy match score
|
||||
"""
|
||||
self.sport = sport.lower()
|
||||
self.alias_loader = alias_loader or get_stadium_alias_loader()
|
||||
self.fuzzy_threshold = fuzzy_threshold
|
||||
self._stadiums = STADIUM_MAPPINGS.get(self.sport, {})
|
||||
|
||||
# Build match candidates
|
||||
self._candidates = self._build_candidates()
|
||||
|
||||
def _build_candidates(self) -> list[MatchCandidate]:
|
||||
"""Build match candidates from stadium mappings."""
|
||||
candidates = []
|
||||
|
||||
for stadium_id, info in self._stadiums.items():
|
||||
# Get aliases for this stadium
|
||||
aliases = [a.alias_name for a in self.alias_loader.get_aliases_for_stadium(stadium_id)]
|
||||
|
||||
# Add city as alias
|
||||
aliases.append(info.city)
|
||||
|
||||
candidates.append(MatchCandidate(
|
||||
canonical_id=stadium_id,
|
||||
name=info.name,
|
||||
aliases=aliases,
|
||||
))
|
||||
|
||||
return candidates
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
name: str,
|
||||
check_date: Optional[date] = None,
|
||||
country: Optional[str] = None,
|
||||
source_url: Optional[str] = None,
|
||||
) -> StadiumResolveResult:
|
||||
"""Resolve a stadium name to a canonical ID.
|
||||
|
||||
Args:
|
||||
name: Stadium name to resolve
|
||||
check_date: Date for alias validity (None = today)
|
||||
country: Country for geographic filtering (None = no filter)
|
||||
source_url: Source URL for manual review items
|
||||
|
||||
Returns:
|
||||
StadiumResolveResult with resolution details
|
||||
"""
|
||||
name_lower = name.lower().strip()
|
||||
|
||||
# 1. Exact match against stadium names
|
||||
for stadium_id, info in self._stadiums.items():
|
||||
if name_lower == info.name.lower():
|
||||
return StadiumResolveResult(
|
||||
canonical_id=stadium_id,
|
||||
confidence=100,
|
||||
match_type="exact",
|
||||
)
|
||||
|
||||
# 2. Alias lookup
|
||||
alias_result = self.alias_loader.resolve(name, check_date)
|
||||
if alias_result:
|
||||
# Verify it's for the right sport (alias file has all sports)
|
||||
if alias_result.startswith(f"stadium_{self.sport}_"):
|
||||
return StadiumResolveResult(
|
||||
canonical_id=alias_result,
|
||||
confidence=95,
|
||||
match_type="alias",
|
||||
)
|
||||
|
||||
# 3. Fuzzy match
|
||||
matches = fuzzy_match_stadium(
|
||||
name,
|
||||
self._candidates,
|
||||
threshold=self.fuzzy_threshold,
|
||||
)
|
||||
|
||||
if matches:
|
||||
best = matches[0]
|
||||
review_item = None
|
||||
|
||||
# Create review item for low confidence matches
|
||||
if best.confidence < 90:
|
||||
review_item = ManualReviewItem(
|
||||
id=f"stadium_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.LOW_CONFIDENCE_MATCH,
|
||||
sport=self.sport,
|
||||
raw_value=name,
|
||||
context={"match_type": "fuzzy"},
|
||||
source_url=source_url,
|
||||
suggested_matches=matches,
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return StadiumResolveResult(
|
||||
canonical_id=best.canonical_id,
|
||||
confidence=best.confidence,
|
||||
match_type="fuzzy",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
# 4. Geographic filter check
|
||||
if country and country not in ALLOWED_COUNTRIES:
|
||||
review_item = ManualReviewItem(
|
||||
id=f"stadium_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.GEOGRAPHIC_FILTER,
|
||||
sport=self.sport,
|
||||
raw_value=name,
|
||||
context={"country": country, "reason": "Stadium outside USA/Canada/Mexico"},
|
||||
source_url=source_url,
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return StadiumResolveResult(
|
||||
canonical_id=None,
|
||||
confidence=0,
|
||||
match_type="filtered",
|
||||
filtered_reason="geographic",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
# 5. Unresolved
|
||||
review_item = ManualReviewItem(
|
||||
id=f"stadium_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.UNRESOLVED_STADIUM,
|
||||
sport=self.sport,
|
||||
raw_value=name,
|
||||
context={},
|
||||
source_url=source_url,
|
||||
suggested_matches=fuzzy_match_stadium(
|
||||
name,
|
||||
self._candidates,
|
||||
threshold=50, # Lower threshold for suggestions
|
||||
top_n=5,
|
||||
),
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return StadiumResolveResult(
|
||||
canonical_id=None,
|
||||
confidence=0,
|
||||
match_type="unresolved",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
def get_stadium_info(self, stadium_id: str) -> Optional[StadiumInfo]:
|
||||
"""Get stadium info by ID.
|
||||
|
||||
Args:
|
||||
stadium_id: Canonical stadium ID
|
||||
|
||||
Returns:
|
||||
StadiumInfo or None
|
||||
"""
|
||||
return self._stadiums.get(stadium_id)
|
||||
|
||||
def get_all_stadiums(self) -> list[StadiumInfo]:
|
||||
"""Get all stadiums for this sport.
|
||||
|
||||
Returns:
|
||||
List of StadiumInfo objects
|
||||
"""
|
||||
return list(self._stadiums.values())
|
||||
|
||||
def is_in_allowed_region(self, stadium_id: str) -> bool:
|
||||
"""Check if a stadium is in an allowed region.
|
||||
|
||||
Args:
|
||||
stadium_id: Canonical stadium ID
|
||||
|
||||
Returns:
|
||||
True if stadium is in USA, Canada, or Mexico
|
||||
"""
|
||||
info = self._stadiums.get(stadium_id)
|
||||
if not info:
|
||||
return False
|
||||
|
||||
return info.country in ALLOWED_COUNTRIES
|
||||
|
||||
|
||||
# Cached resolvers
|
||||
_resolvers: dict[str, StadiumResolver] = {}
|
||||
|
||||
|
||||
def get_stadium_resolver(sport: str) -> StadiumResolver:
|
||||
"""Get or create a stadium resolver for a sport."""
|
||||
sport_lower = sport.lower()
|
||||
if sport_lower not in _resolvers:
|
||||
_resolvers[sport_lower] = StadiumResolver(sport_lower)
|
||||
return _resolvers[sport_lower]
|
||||
|
||||
|
||||
def resolve_stadium(
|
||||
sport: str,
|
||||
name: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> StadiumResolveResult:
|
||||
"""Convenience function to resolve a stadium name.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
name: Stadium name to resolve
|
||||
check_date: Date for alias validity
|
||||
|
||||
Returns:
|
||||
StadiumResolveResult
|
||||
"""
|
||||
return get_stadium_resolver(sport).resolve(name, check_date)
|
||||
514
sportstime_parser/normalizers/team_resolver.py
Normal file
514
sportstime_parser/normalizers/team_resolver.py
Normal file
@@ -0,0 +1,514 @@
|
||||
"""Team name resolver with exact, alias, and fuzzy matching."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from datetime import date
|
||||
from typing import Optional
|
||||
from uuid import uuid4
|
||||
|
||||
from ..config import FUZZY_MATCH_THRESHOLD
|
||||
from ..models.aliases import (
|
||||
AliasType,
|
||||
FuzzyMatch,
|
||||
ManualReviewItem,
|
||||
ReviewReason,
|
||||
)
|
||||
from .alias_loader import get_team_alias_loader, TeamAliasLoader
|
||||
from .fuzzy import MatchCandidate, fuzzy_match_team, exact_match
|
||||
|
||||
|
||||
@dataclass
|
||||
class TeamResolveResult:
|
||||
"""Result of team resolution.
|
||||
|
||||
Attributes:
|
||||
canonical_id: Resolved canonical team ID (None if unresolved)
|
||||
confidence: Confidence in the match (100 for exact, lower for fuzzy)
|
||||
match_type: How the match was made ('exact', 'alias', 'fuzzy', 'unresolved')
|
||||
review_item: ManualReviewItem if resolution failed or low confidence
|
||||
"""
|
||||
|
||||
canonical_id: Optional[str]
|
||||
confidence: int
|
||||
match_type: str
|
||||
review_item: Optional[ManualReviewItem] = None
|
||||
|
||||
|
||||
# Hardcoded team mappings for each sport
|
||||
# Format: {sport: {abbreviation: (canonical_id, full_name, city, stadium_id)}}
|
||||
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str, str]]] = {
|
||||
"nba": {
|
||||
"ATL": ("team_nba_atl", "Atlanta Hawks", "Atlanta", "stadium_nba_state_farm_arena"),
|
||||
"BOS": ("team_nba_bos", "Boston Celtics", "Boston", "stadium_nba_td_garden"),
|
||||
"BKN": ("team_nba_brk", "Brooklyn Nets", "Brooklyn", "stadium_nba_barclays_center"),
|
||||
"BRK": ("team_nba_brk", "Brooklyn Nets", "Brooklyn", "stadium_nba_barclays_center"),
|
||||
"CHA": ("team_nba_cho", "Charlotte Hornets", "Charlotte", "stadium_nba_spectrum_center"),
|
||||
"CHO": ("team_nba_cho", "Charlotte Hornets", "Charlotte", "stadium_nba_spectrum_center"),
|
||||
"CHI": ("team_nba_chi", "Chicago Bulls", "Chicago", "stadium_nba_united_center"),
|
||||
"CLE": ("team_nba_cle", "Cleveland Cavaliers", "Cleveland", "stadium_nba_rocket_mortgage_fieldhouse"),
|
||||
"DAL": ("team_nba_dal", "Dallas Mavericks", "Dallas", "stadium_nba_american_airlines_center"),
|
||||
"DEN": ("team_nba_den", "Denver Nuggets", "Denver", "stadium_nba_ball_arena"),
|
||||
"DET": ("team_nba_det", "Detroit Pistons", "Detroit", "stadium_nba_little_caesars_arena"),
|
||||
"GSW": ("team_nba_gsw", "Golden State Warriors", "Golden State", "stadium_nba_chase_center"),
|
||||
"GS": ("team_nba_gsw", "Golden State Warriors", "Golden State", "stadium_nba_chase_center"),
|
||||
"HOU": ("team_nba_hou", "Houston Rockets", "Houston", "stadium_nba_toyota_center"),
|
||||
"IND": ("team_nba_ind", "Indiana Pacers", "Indiana", "stadium_nba_gainbridge_fieldhouse"),
|
||||
"LAC": ("team_nba_lac", "Los Angeles Clippers", "Los Angeles", "stadium_nba_intuit_dome"),
|
||||
"LAL": ("team_nba_lal", "Los Angeles Lakers", "Los Angeles", "stadium_nba_cryptocom_arena"),
|
||||
"MEM": ("team_nba_mem", "Memphis Grizzlies", "Memphis", "stadium_nba_fedexforum"),
|
||||
"MIA": ("team_nba_mia", "Miami Heat", "Miami", "stadium_nba_kaseya_center"),
|
||||
"MIL": ("team_nba_mil", "Milwaukee Bucks", "Milwaukee", "stadium_nba_fiserv_forum"),
|
||||
"MIN": ("team_nba_min", "Minnesota Timberwolves", "Minnesota", "stadium_nba_target_center"),
|
||||
"NOP": ("team_nba_nop", "New Orleans Pelicans", "New Orleans", "stadium_nba_smoothie_king_center"),
|
||||
"NO": ("team_nba_nop", "New Orleans Pelicans", "New Orleans", "stadium_nba_smoothie_king_center"),
|
||||
"NYK": ("team_nba_nyk", "New York Knicks", "New York", "stadium_nba_madison_square_garden"),
|
||||
"NY": ("team_nba_nyk", "New York Knicks", "New York", "stadium_nba_madison_square_garden"),
|
||||
"OKC": ("team_nba_okc", "Oklahoma City Thunder", "Oklahoma City", "stadium_nba_paycom_center"),
|
||||
"ORL": ("team_nba_orl", "Orlando Magic", "Orlando", "stadium_nba_kia_center"),
|
||||
"PHI": ("team_nba_phi", "Philadelphia 76ers", "Philadelphia", "stadium_nba_wells_fargo_center"),
|
||||
"PHX": ("team_nba_phx", "Phoenix Suns", "Phoenix", "stadium_nba_footprint_center"),
|
||||
"PHO": ("team_nba_phx", "Phoenix Suns", "Phoenix", "stadium_nba_footprint_center"),
|
||||
"POR": ("team_nba_por", "Portland Trail Blazers", "Portland", "stadium_nba_moda_center"),
|
||||
"SAC": ("team_nba_sac", "Sacramento Kings", "Sacramento", "stadium_nba_golden_1_center"),
|
||||
"SAS": ("team_nba_sas", "San Antonio Spurs", "San Antonio", "stadium_nba_frost_bank_center"),
|
||||
"SA": ("team_nba_sas", "San Antonio Spurs", "San Antonio", "stadium_nba_frost_bank_center"),
|
||||
"TOR": ("team_nba_tor", "Toronto Raptors", "Toronto", "stadium_nba_scotiabank_arena"),
|
||||
"UTA": ("team_nba_uta", "Utah Jazz", "Utah", "stadium_nba_delta_center"),
|
||||
"WAS": ("team_nba_was", "Washington Wizards", "Washington", "stadium_nba_capital_one_arena"),
|
||||
"WSH": ("team_nba_was", "Washington Wizards", "Washington", "stadium_nba_capital_one_arena"),
|
||||
},
|
||||
"mlb": {
|
||||
"ARI": ("team_mlb_ari", "Arizona Diamondbacks", "Arizona", "stadium_mlb_chase_field"),
|
||||
"ATL": ("team_mlb_atl", "Atlanta Braves", "Atlanta", "stadium_mlb_truist_park"),
|
||||
"BAL": ("team_mlb_bal", "Baltimore Orioles", "Baltimore", "stadium_mlb_oriole_park_at_camden_yards"),
|
||||
"BOS": ("team_mlb_bos", "Boston Red Sox", "Boston", "stadium_mlb_fenway_park"),
|
||||
"CHC": ("team_mlb_chc", "Chicago Cubs", "Chicago", "stadium_mlb_wrigley_field"),
|
||||
"CHW": ("team_mlb_chw", "Chicago White Sox", "Chicago", "stadium_mlb_guaranteed_rate_field"),
|
||||
"CWS": ("team_mlb_chw", "Chicago White Sox", "Chicago", "stadium_mlb_guaranteed_rate_field"),
|
||||
"CIN": ("team_mlb_cin", "Cincinnati Reds", "Cincinnati", "stadium_mlb_great_american_ball_park"),
|
||||
"CLE": ("team_mlb_cle", "Cleveland Guardians", "Cleveland", "stadium_mlb_progressive_field"),
|
||||
"COL": ("team_mlb_col", "Colorado Rockies", "Colorado", "stadium_mlb_coors_field"),
|
||||
"DET": ("team_mlb_det", "Detroit Tigers", "Detroit", "stadium_mlb_comerica_park"),
|
||||
"HOU": ("team_mlb_hou", "Houston Astros", "Houston", "stadium_mlb_minute_maid_park"),
|
||||
"KC": ("team_mlb_kc", "Kansas City Royals", "Kansas City", "stadium_mlb_kauffman_stadium"),
|
||||
"KCR": ("team_mlb_kc", "Kansas City Royals", "Kansas City", "stadium_mlb_kauffman_stadium"),
|
||||
"LAA": ("team_mlb_laa", "Los Angeles Angels", "Los Angeles", "stadium_mlb_angel_stadium"),
|
||||
"ANA": ("team_mlb_laa", "Los Angeles Angels", "Anaheim", "stadium_mlb_angel_stadium"),
|
||||
"LAD": ("team_mlb_lad", "Los Angeles Dodgers", "Los Angeles", "stadium_mlb_dodger_stadium"),
|
||||
"MIA": ("team_mlb_mia", "Miami Marlins", "Miami", "stadium_mlb_loandepot_park"),
|
||||
"FLA": ("team_mlb_mia", "Miami Marlins", "Florida", "stadium_mlb_loandepot_park"),
|
||||
"MIL": ("team_mlb_mil", "Milwaukee Brewers", "Milwaukee", "stadium_mlb_american_family_field"),
|
||||
"MIN": ("team_mlb_min", "Minnesota Twins", "Minnesota", "stadium_mlb_target_field"),
|
||||
"NYM": ("team_mlb_nym", "New York Mets", "New York", "stadium_mlb_citi_field"),
|
||||
"NYY": ("team_mlb_nyy", "New York Yankees", "New York", "stadium_mlb_yankee_stadium"),
|
||||
"OAK": ("team_mlb_oak", "Oakland Athletics", "Oakland", "stadium_mlb_sutter_health_park"),
|
||||
"PHI": ("team_mlb_phi", "Philadelphia Phillies", "Philadelphia", "stadium_mlb_citizens_bank_park"),
|
||||
"PIT": ("team_mlb_pit", "Pittsburgh Pirates", "Pittsburgh", "stadium_mlb_pnc_park"),
|
||||
"SD": ("team_mlb_sd", "San Diego Padres", "San Diego", "stadium_mlb_petco_park"),
|
||||
"SDP": ("team_mlb_sd", "San Diego Padres", "San Diego", "stadium_mlb_petco_park"),
|
||||
"SF": ("team_mlb_sf", "San Francisco Giants", "San Francisco", "stadium_mlb_oracle_park"),
|
||||
"SFG": ("team_mlb_sf", "San Francisco Giants", "San Francisco", "stadium_mlb_oracle_park"),
|
||||
"SEA": ("team_mlb_sea", "Seattle Mariners", "Seattle", "stadium_mlb_tmobile_park"),
|
||||
"STL": ("team_mlb_stl", "St. Louis Cardinals", "St. Louis", "stadium_mlb_busch_stadium"),
|
||||
"TB": ("team_mlb_tbr", "Tampa Bay Rays", "Tampa Bay", "stadium_mlb_tropicana_field"),
|
||||
"TBR": ("team_mlb_tbr", "Tampa Bay Rays", "Tampa Bay", "stadium_mlb_tropicana_field"),
|
||||
"TEX": ("team_mlb_tex", "Texas Rangers", "Texas", "stadium_mlb_globe_life_field"),
|
||||
"TOR": ("team_mlb_tor", "Toronto Blue Jays", "Toronto", "stadium_mlb_rogers_centre"),
|
||||
"WSN": ("team_mlb_wsn", "Washington Nationals", "Washington", "stadium_mlb_nationals_park"),
|
||||
"WAS": ("team_mlb_wsn", "Washington Nationals", "Washington", "stadium_mlb_nationals_park"),
|
||||
},
|
||||
"nfl": {
|
||||
"ARI": ("team_nfl_ari", "Arizona Cardinals", "Arizona", "stadium_nfl_state_farm_stadium"),
|
||||
"ATL": ("team_nfl_atl", "Atlanta Falcons", "Atlanta", "stadium_nfl_mercedes_benz_stadium"),
|
||||
"BAL": ("team_nfl_bal", "Baltimore Ravens", "Baltimore", "stadium_nfl_mandt_bank_stadium"),
|
||||
"BUF": ("team_nfl_buf", "Buffalo Bills", "Buffalo", "stadium_nfl_highmark_stadium"),
|
||||
"CAR": ("team_nfl_car", "Carolina Panthers", "Carolina", "stadium_nfl_bank_of_america_stadium"),
|
||||
"CHI": ("team_nfl_chi", "Chicago Bears", "Chicago", "stadium_nfl_soldier_field"),
|
||||
"CIN": ("team_nfl_cin", "Cincinnati Bengals", "Cincinnati", "stadium_nfl_paycor_stadium"),
|
||||
"CLE": ("team_nfl_cle", "Cleveland Browns", "Cleveland", "stadium_nfl_huntington_bank_field"),
|
||||
"DAL": ("team_nfl_dal", "Dallas Cowboys", "Dallas", "stadium_nfl_att_stadium"),
|
||||
"DEN": ("team_nfl_den", "Denver Broncos", "Denver", "stadium_nfl_empower_field"),
|
||||
"DET": ("team_nfl_det", "Detroit Lions", "Detroit", "stadium_nfl_ford_field"),
|
||||
"GB": ("team_nfl_gb", "Green Bay Packers", "Green Bay", "stadium_nfl_lambeau_field"),
|
||||
"GNB": ("team_nfl_gb", "Green Bay Packers", "Green Bay", "stadium_nfl_lambeau_field"),
|
||||
"HOU": ("team_nfl_hou", "Houston Texans", "Houston", "stadium_nfl_nrg_stadium"),
|
||||
"IND": ("team_nfl_ind", "Indianapolis Colts", "Indianapolis", "stadium_nfl_lucas_oil_stadium"),
|
||||
"JAX": ("team_nfl_jax", "Jacksonville Jaguars", "Jacksonville", "stadium_nfl_everbank_stadium"),
|
||||
"JAC": ("team_nfl_jax", "Jacksonville Jaguars", "Jacksonville", "stadium_nfl_everbank_stadium"),
|
||||
"KC": ("team_nfl_kc", "Kansas City Chiefs", "Kansas City", "stadium_nfl_arrowhead_stadium"),
|
||||
"KAN": ("team_nfl_kc", "Kansas City Chiefs", "Kansas City", "stadium_nfl_arrowhead_stadium"),
|
||||
"LV": ("team_nfl_lv", "Las Vegas Raiders", "Las Vegas", "stadium_nfl_allegiant_stadium"),
|
||||
"LAC": ("team_nfl_lac", "Los Angeles Chargers", "Los Angeles", "stadium_nfl_sofi_stadium"),
|
||||
"LAR": ("team_nfl_lar", "Los Angeles Rams", "Los Angeles", "stadium_nfl_sofi_stadium"),
|
||||
"MIA": ("team_nfl_mia", "Miami Dolphins", "Miami", "stadium_nfl_hard_rock_stadium"),
|
||||
"MIN": ("team_nfl_min", "Minnesota Vikings", "Minnesota", "stadium_nfl_us_bank_stadium"),
|
||||
"NE": ("team_nfl_ne", "New England Patriots", "New England", "stadium_nfl_gillette_stadium"),
|
||||
"NWE": ("team_nfl_ne", "New England Patriots", "New England", "stadium_nfl_gillette_stadium"),
|
||||
"NO": ("team_nfl_no", "New Orleans Saints", "New Orleans", "stadium_nfl_caesars_superdome"),
|
||||
"NOR": ("team_nfl_no", "New Orleans Saints", "New Orleans", "stadium_nfl_caesars_superdome"),
|
||||
"NYG": ("team_nfl_nyg", "New York Giants", "New York", "stadium_nfl_metlife_stadium"),
|
||||
"NYJ": ("team_nfl_nyj", "New York Jets", "New York", "stadium_nfl_metlife_stadium"),
|
||||
"PHI": ("team_nfl_phi", "Philadelphia Eagles", "Philadelphia", "stadium_nfl_lincoln_financial_field"),
|
||||
"PIT": ("team_nfl_pit", "Pittsburgh Steelers", "Pittsburgh", "stadium_nfl_acrisure_stadium"),
|
||||
"SF": ("team_nfl_sf", "San Francisco 49ers", "San Francisco", "stadium_nfl_levis_stadium"),
|
||||
"SFO": ("team_nfl_sf", "San Francisco 49ers", "San Francisco", "stadium_nfl_levis_stadium"),
|
||||
"SEA": ("team_nfl_sea", "Seattle Seahawks", "Seattle", "stadium_nfl_lumen_field"),
|
||||
"TB": ("team_nfl_tb", "Tampa Bay Buccaneers", "Tampa Bay", "stadium_nfl_raymond_james_stadium"),
|
||||
"TAM": ("team_nfl_tb", "Tampa Bay Buccaneers", "Tampa Bay", "stadium_nfl_raymond_james_stadium"),
|
||||
"TEN": ("team_nfl_ten", "Tennessee Titans", "Tennessee", "stadium_nfl_nissan_stadium"),
|
||||
"WAS": ("team_nfl_was", "Washington Commanders", "Washington", "stadium_nfl_northwest_stadium"),
|
||||
"WSH": ("team_nfl_was", "Washington Commanders", "Washington", "stadium_nfl_northwest_stadium"),
|
||||
},
|
||||
"nhl": {
|
||||
"ANA": ("team_nhl_ana", "Anaheim Ducks", "Anaheim", "stadium_nhl_honda_center"),
|
||||
"ARI": ("team_nhl_ari", "Utah Hockey Club", "Utah", "stadium_nhl_delta_center"), # Moved 2024
|
||||
"UTA": ("team_nhl_ari", "Utah Hockey Club", "Utah", "stadium_nhl_delta_center"),
|
||||
"BOS": ("team_nhl_bos", "Boston Bruins", "Boston", "stadium_nhl_td_garden"),
|
||||
"BUF": ("team_nhl_buf", "Buffalo Sabres", "Buffalo", "stadium_nhl_keybank_center"),
|
||||
"CGY": ("team_nhl_cgy", "Calgary Flames", "Calgary", "stadium_nhl_scotiabank_saddledome"),
|
||||
"CAR": ("team_nhl_car", "Carolina Hurricanes", "Carolina", "stadium_nhl_pnc_arena"),
|
||||
"CHI": ("team_nhl_chi", "Chicago Blackhawks", "Chicago", "stadium_nhl_united_center"),
|
||||
"COL": ("team_nhl_col", "Colorado Avalanche", "Colorado", "stadium_nhl_ball_arena"),
|
||||
"CBJ": ("team_nhl_cbj", "Columbus Blue Jackets", "Columbus", "stadium_nhl_nationwide_arena"),
|
||||
"DAL": ("team_nhl_dal", "Dallas Stars", "Dallas", "stadium_nhl_american_airlines_center"),
|
||||
"DET": ("team_nhl_det", "Detroit Red Wings", "Detroit", "stadium_nhl_little_caesars_arena"),
|
||||
"EDM": ("team_nhl_edm", "Edmonton Oilers", "Edmonton", "stadium_nhl_rogers_place"),
|
||||
"FLA": ("team_nhl_fla", "Florida Panthers", "Florida", "stadium_nhl_amerant_bank_arena"),
|
||||
"LA": ("team_nhl_la", "Los Angeles Kings", "Los Angeles", "stadium_nhl_cryptocom_arena"),
|
||||
"LAK": ("team_nhl_la", "Los Angeles Kings", "Los Angeles", "stadium_nhl_cryptocom_arena"),
|
||||
"MIN": ("team_nhl_min", "Minnesota Wild", "Minnesota", "stadium_nhl_xcel_energy_center"),
|
||||
"MTL": ("team_nhl_mtl", "Montreal Canadiens", "Montreal", "stadium_nhl_bell_centre"),
|
||||
"MON": ("team_nhl_mtl", "Montreal Canadiens", "Montreal", "stadium_nhl_bell_centre"),
|
||||
"NSH": ("team_nhl_nsh", "Nashville Predators", "Nashville", "stadium_nhl_bridgestone_arena"),
|
||||
"NAS": ("team_nhl_nsh", "Nashville Predators", "Nashville", "stadium_nhl_bridgestone_arena"),
|
||||
"NJ": ("team_nhl_njd", "New Jersey Devils", "New Jersey", "stadium_nhl_prudential_center"),
|
||||
"NJD": ("team_nhl_njd", "New Jersey Devils", "New Jersey", "stadium_nhl_prudential_center"),
|
||||
"NYI": ("team_nhl_nyi", "New York Islanders", "New York", "stadium_nhl_ubs_arena"),
|
||||
"NYR": ("team_nhl_nyr", "New York Rangers", "New York", "stadium_nhl_madison_square_garden"),
|
||||
"OTT": ("team_nhl_ott", "Ottawa Senators", "Ottawa", "stadium_nhl_canadian_tire_centre"),
|
||||
"PHI": ("team_nhl_phi", "Philadelphia Flyers", "Philadelphia", "stadium_nhl_wells_fargo_center"),
|
||||
"PIT": ("team_nhl_pit", "Pittsburgh Penguins", "Pittsburgh", "stadium_nhl_ppg_paints_arena"),
|
||||
"SJ": ("team_nhl_sj", "San Jose Sharks", "San Jose", "stadium_nhl_sap_center"),
|
||||
"SJS": ("team_nhl_sj", "San Jose Sharks", "San Jose", "stadium_nhl_sap_center"),
|
||||
"SEA": ("team_nhl_sea", "Seattle Kraken", "Seattle", "stadium_nhl_climate_pledge_arena"),
|
||||
"STL": ("team_nhl_stl", "St. Louis Blues", "St. Louis", "stadium_nhl_enterprise_center"),
|
||||
"TB": ("team_nhl_tb", "Tampa Bay Lightning", "Tampa Bay", "stadium_nhl_amalie_arena"),
|
||||
"TBL": ("team_nhl_tb", "Tampa Bay Lightning", "Tampa Bay", "stadium_nhl_amalie_arena"),
|
||||
"TOR": ("team_nhl_tor", "Toronto Maple Leafs", "Toronto", "stadium_nhl_scotiabank_arena"),
|
||||
"VAN": ("team_nhl_van", "Vancouver Canucks", "Vancouver", "stadium_nhl_rogers_arena"),
|
||||
"VGK": ("team_nhl_vgk", "Vegas Golden Knights", "Vegas", "stadium_nhl_tmobile_arena"),
|
||||
"VEG": ("team_nhl_vgk", "Vegas Golden Knights", "Vegas", "stadium_nhl_tmobile_arena"),
|
||||
"WAS": ("team_nhl_was", "Washington Capitals", "Washington", "stadium_nhl_capital_one_arena"),
|
||||
"WSH": ("team_nhl_was", "Washington Capitals", "Washington", "stadium_nhl_capital_one_arena"),
|
||||
"WPG": ("team_nhl_wpg", "Winnipeg Jets", "Winnipeg", "stadium_nhl_canada_life_centre"),
|
||||
},
|
||||
"mls": {
|
||||
"ATL": ("team_mls_atl", "Atlanta United", "Atlanta", "stadium_mls_mercedes_benz_stadium"),
|
||||
"AUS": ("team_mls_aus", "Austin FC", "Austin", "stadium_mls_q2_stadium"),
|
||||
"CLT": ("team_mls_clt", "Charlotte FC", "Charlotte", "stadium_mls_bank_of_america_stadium"),
|
||||
"CHI": ("team_mls_chi", "Chicago Fire", "Chicago", "stadium_mls_soldier_field"),
|
||||
"CIN": ("team_mls_cin", "FC Cincinnati", "Cincinnati", "stadium_mls_tql_stadium"),
|
||||
"COL": ("team_mls_col", "Colorado Rapids", "Colorado", "stadium_mls_dicks_sporting_goods_park"),
|
||||
"CLB": ("team_mls_clb", "Columbus Crew", "Columbus", "stadium_mls_lowercom_field"),
|
||||
"DAL": ("team_mls_dal", "FC Dallas", "Dallas", "stadium_mls_toyota_stadium"),
|
||||
"DC": ("team_mls_dc", "D.C. United", "Washington", "stadium_mls_audi_field"),
|
||||
"HOU": ("team_mls_hou", "Houston Dynamo", "Houston", "stadium_mls_shell_energy_stadium"),
|
||||
"LAG": ("team_mls_lag", "LA Galaxy", "Los Angeles", "stadium_mls_dignity_health_sports_park"),
|
||||
"LAFC": ("team_mls_lafc", "Los Angeles FC", "Los Angeles", "stadium_mls_bmo_stadium"),
|
||||
"MIA": ("team_mls_mia", "Inter Miami", "Miami", "stadium_mls_chase_stadium"),
|
||||
"MIN": ("team_mls_min", "Minnesota United", "Minnesota", "stadium_mls_allianz_field"),
|
||||
"MTL": ("team_mls_mtl", "CF Montreal", "Montreal", "stadium_mls_stade_saputo"),
|
||||
"NSH": ("team_mls_nsh", "Nashville SC", "Nashville", "stadium_mls_geodis_park"),
|
||||
"NE": ("team_mls_ne", "New England Revolution", "New England", "stadium_mls_gillette_stadium"),
|
||||
"NYC": ("team_mls_nyc", "New York City FC", "New York", "stadium_mls_yankee_stadium"),
|
||||
"RB": ("team_mls_ny", "New York Red Bulls", "New York", "stadium_mls_red_bull_arena"),
|
||||
"RBNY": ("team_mls_ny", "New York Red Bulls", "New York", "stadium_mls_red_bull_arena"),
|
||||
"ORL": ("team_mls_orl", "Orlando City", "Orlando", "stadium_mls_interco_stadium"),
|
||||
"PHI": ("team_mls_phi", "Philadelphia Union", "Philadelphia", "stadium_mls_subaru_park"),
|
||||
"POR": ("team_mls_por", "Portland Timbers", "Portland", "stadium_mls_providence_park"),
|
||||
"SLC": ("team_mls_slc", "Real Salt Lake", "Salt Lake", "stadium_mls_america_first_field"),
|
||||
"RSL": ("team_mls_slc", "Real Salt Lake", "Salt Lake", "stadium_mls_america_first_field"),
|
||||
"SJ": ("team_mls_sj", "San Jose Earthquakes", "San Jose", "stadium_mls_paypal_park"),
|
||||
"SD": ("team_mls_sd", "San Diego FC", "San Diego", "stadium_mls_snapdragon_stadium"),
|
||||
"SEA": ("team_mls_sea", "Seattle Sounders", "Seattle", "stadium_mls_lumen_field"),
|
||||
"SKC": ("team_mls_skc", "Sporting Kansas City", "Kansas City", "stadium_mls_childrens_mercy_park"),
|
||||
"STL": ("team_mls_stl", "St. Louis City SC", "St. Louis", "stadium_mls_citypark"),
|
||||
"TOR": ("team_mls_tor", "Toronto FC", "Toronto", "stadium_mls_bmo_field"),
|
||||
"VAN": ("team_mls_van", "Vancouver Whitecaps", "Vancouver", "stadium_mls_bc_place"),
|
||||
},
|
||||
"wnba": {
|
||||
"ATL": ("team_wnba_atl", "Atlanta Dream", "Atlanta", "stadium_wnba_gateway_center_arena"),
|
||||
"DREAM": ("team_wnba_atl", "Atlanta Dream", "Atlanta", "stadium_wnba_gateway_center_arena"), # alias
|
||||
"CHI": ("team_wnba_chi", "Chicago Sky", "Chicago", "stadium_wnba_wintrust_arena"),
|
||||
"SKY": ("team_wnba_chi", "Chicago Sky", "Chicago", "stadium_wnba_wintrust_arena"), # alias
|
||||
"CON": ("team_wnba_con", "Connecticut Sun", "Connecticut", "stadium_wnba_mohegan_sun_arena"),
|
||||
"CONN": ("team_wnba_con", "Connecticut Sun", "Connecticut", "stadium_wnba_mohegan_sun_arena"), # alias
|
||||
"SUN": ("team_wnba_con", "Connecticut Sun", "Connecticut", "stadium_wnba_mohegan_sun_arena"), # alias
|
||||
"DAL": ("team_wnba_dal", "Dallas Wings", "Dallas", "stadium_wnba_college_park_center"),
|
||||
"WINGS": ("team_wnba_dal", "Dallas Wings", "Dallas", "stadium_wnba_college_park_center"), # alias
|
||||
"GSV": ("team_wnba_gsv", "Golden State Valkyries", "Golden State", "stadium_wnba_chase_center"),
|
||||
"GS": ("team_wnba_gsv", "Golden State Valkyries", "Golden State", "stadium_wnba_chase_center"), # alias
|
||||
"VAL": ("team_wnba_gsv", "Golden State Valkyries", "Golden State", "stadium_wnba_chase_center"), # alias
|
||||
"IND": ("team_wnba_ind", "Indiana Fever", "Indiana", "stadium_wnba_gainbridge_fieldhouse"),
|
||||
"FEVER": ("team_wnba_ind", "Indiana Fever", "Indiana", "stadium_wnba_gainbridge_fieldhouse"), # alias
|
||||
"LV": ("team_wnba_lv", "Las Vegas Aces", "Las Vegas", "stadium_wnba_michelob_ultra_arena"),
|
||||
"LVA": ("team_wnba_lv", "Las Vegas Aces", "Las Vegas", "stadium_wnba_michelob_ultra_arena"), # alias
|
||||
"ACES": ("team_wnba_lv", "Las Vegas Aces", "Las Vegas", "stadium_wnba_michelob_ultra_arena"), # alias
|
||||
"LA": ("team_wnba_la", "Los Angeles Sparks", "Los Angeles", "stadium_wnba_cryptocom_arena"),
|
||||
"LAS": ("team_wnba_la", "Los Angeles Sparks", "Los Angeles", "stadium_wnba_cryptocom_arena"), # alias
|
||||
"SPARKS": ("team_wnba_la", "Los Angeles Sparks", "Los Angeles", "stadium_wnba_cryptocom_arena"), # alias
|
||||
"MIN": ("team_wnba_min", "Minnesota Lynx", "Minnesota", "stadium_wnba_target_center"),
|
||||
"LYNX": ("team_wnba_min", "Minnesota Lynx", "Minnesota", "stadium_wnba_target_center"), # alias
|
||||
"NY": ("team_wnba_ny", "New York Liberty", "New York", "stadium_wnba_barclays_center"),
|
||||
"NYL": ("team_wnba_ny", "New York Liberty", "New York", "stadium_wnba_barclays_center"), # alias
|
||||
"LIB": ("team_wnba_ny", "New York Liberty", "New York", "stadium_wnba_barclays_center"), # alias
|
||||
"PHX": ("team_wnba_phx", "Phoenix Mercury", "Phoenix", "stadium_wnba_footprint_center"),
|
||||
"PHO": ("team_wnba_phx", "Phoenix Mercury", "Phoenix", "stadium_wnba_footprint_center"), # alias
|
||||
"MERCURY": ("team_wnba_phx", "Phoenix Mercury", "Phoenix", "stadium_wnba_footprint_center"), # alias
|
||||
"SEA": ("team_wnba_sea", "Seattle Storm", "Seattle", "stadium_wnba_climate_pledge_arena"),
|
||||
"STORM": ("team_wnba_sea", "Seattle Storm", "Seattle", "stadium_wnba_climate_pledge_arena"), # alias
|
||||
"WAS": ("team_wnba_was", "Washington Mystics", "Washington", "stadium_wnba_entertainment_sports_arena"),
|
||||
"WSH": ("team_wnba_was", "Washington Mystics", "Washington", "stadium_wnba_entertainment_sports_arena"), # alias
|
||||
"MYSTICS": ("team_wnba_was", "Washington Mystics", "Washington", "stadium_wnba_entertainment_sports_arena"), # alias
|
||||
},
|
||||
"nwsl": {
|
||||
# Canonical IDs aligned with teams_canonical.json
|
||||
"ANG": ("team_nwsl_ang", "Angel City FC", "Los Angeles", "stadium_nwsl_bmo_stadium"),
|
||||
"ANF": ("team_nwsl_ang", "Angel City FC", "Los Angeles", "stadium_nwsl_bmo_stadium"), # alias
|
||||
"CHI": ("team_nwsl_chi", "Chicago Red Stars", "Chicago", "stadium_nwsl_seatgeek_stadium"),
|
||||
"HOU": ("team_nwsl_hou", "Houston Dash", "Houston", "stadium_nwsl_shell_energy_stadium"),
|
||||
"KCC": ("team_nwsl_kcc", "Kansas City Current", "Kansas City", "stadium_nwsl_cpkc_stadium"),
|
||||
"KC": ("team_nwsl_kcc", "Kansas City Current", "Kansas City", "stadium_nwsl_cpkc_stadium"), # alias
|
||||
"NJY": ("team_nwsl_njy", "NJ/NY Gotham FC", "New Jersey", "stadium_nwsl_red_bull_arena"),
|
||||
"NJ": ("team_nwsl_njy", "NJ/NY Gotham FC", "New Jersey", "stadium_nwsl_red_bull_arena"), # alias
|
||||
"NCC": ("team_nwsl_ncc", "North Carolina Courage", "North Carolina", "stadium_nwsl_wakemed_soccer_park"),
|
||||
"NC": ("team_nwsl_ncc", "North Carolina Courage", "North Carolina", "stadium_nwsl_wakemed_soccer_park"), # alias
|
||||
"ORL": ("team_nwsl_orl", "Orlando Pride", "Orlando", "stadium_nwsl_interco_stadium"),
|
||||
"POR": ("team_nwsl_por", "Portland Thorns", "Portland", "stadium_nwsl_providence_park"),
|
||||
"RGN": ("team_nwsl_rgn", "Racing Louisville", "Louisville", "stadium_nwsl_lynn_family_stadium"),
|
||||
"SDW": ("team_nwsl_sdw", "San Diego Wave", "San Diego", "stadium_nwsl_snapdragon_stadium"),
|
||||
"SD": ("team_nwsl_sdw", "San Diego Wave", "San Diego", "stadium_nwsl_snapdragon_stadium"), # alias
|
||||
"SEA": ("team_nwsl_sea", "Seattle Reign", "Seattle", "stadium_nwsl_lumen_field"),
|
||||
"UTA": ("team_nwsl_uta", "Utah Royals", "Utah", "stadium_nwsl_america_first_field"),
|
||||
"SLC": ("team_nwsl_uta", "Utah Royals", "Utah", "stadium_nwsl_america_first_field"), # alias
|
||||
"WSH": ("team_nwsl_wsh", "Washington Spirit", "Washington", "stadium_nwsl_audi_field"),
|
||||
"WAS": ("team_nwsl_wsh", "Washington Spirit", "Washington", "stadium_nwsl_audi_field"), # alias
|
||||
"BAY": ("team_nwsl_bay", "Bay FC", "San Francisco", "stadium_nwsl_paypal_park"),
|
||||
"BFC": ("team_nwsl_bay", "Bay FC", "San Francisco", "stadium_nwsl_paypal_park"), # alias
|
||||
# Expansion teams (2026) - need to be added to teams_canonical.json
|
||||
"BOS": ("team_nwsl_bos", "Boston Legacy FC", "Boston", "stadium_nwsl_gillette_stadium"),
|
||||
"DEN": ("team_nwsl_den", "Denver Summit FC", "Denver", "stadium_nwsl_dicks_sporting_goods_park"),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
class TeamResolver:
|
||||
"""Resolves team names to canonical IDs.
|
||||
|
||||
Resolution order:
|
||||
1. Exact match against abbreviation mappings
|
||||
2. Exact match against full team names
|
||||
3. Alias lookup (with date awareness)
|
||||
4. Fuzzy match against all known names
|
||||
5. Unresolved (returns ManualReviewItem)
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
sport: str,
|
||||
alias_loader: Optional[TeamAliasLoader] = None,
|
||||
fuzzy_threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
):
|
||||
"""Initialize the resolver.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
alias_loader: Team alias loader (default: global loader)
|
||||
fuzzy_threshold: Minimum fuzzy match score
|
||||
"""
|
||||
self.sport = sport.lower()
|
||||
self.alias_loader = alias_loader or get_team_alias_loader()
|
||||
self.fuzzy_threshold = fuzzy_threshold
|
||||
self._mappings = TEAM_MAPPINGS.get(self.sport, {})
|
||||
|
||||
# Build match candidates for fuzzy matching
|
||||
self._candidates = self._build_candidates()
|
||||
|
||||
def _build_candidates(self) -> list[MatchCandidate]:
|
||||
"""Build match candidates from team mappings."""
|
||||
# Group by canonical ID to avoid duplicates
|
||||
by_id: dict[str, tuple[str, list[str]]] = {}
|
||||
|
||||
for abbrev, (canonical_id, full_name, city, stadium_id) in self._mappings.items():
|
||||
if canonical_id not in by_id:
|
||||
by_id[canonical_id] = (full_name, [])
|
||||
|
||||
# Add abbreviation as alias
|
||||
by_id[canonical_id][1].append(abbrev)
|
||||
by_id[canonical_id][1].append(city)
|
||||
|
||||
return [
|
||||
MatchCandidate(
|
||||
canonical_id=cid,
|
||||
name=name,
|
||||
aliases=list(set(aliases)), # Dedupe
|
||||
)
|
||||
for cid, (name, aliases) in by_id.items()
|
||||
]
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
value: str,
|
||||
check_date: Optional[date] = None,
|
||||
source_url: Optional[str] = None,
|
||||
) -> TeamResolveResult:
|
||||
"""Resolve a team name to a canonical ID.
|
||||
|
||||
Args:
|
||||
value: Team name, abbreviation, or city to resolve
|
||||
check_date: Date for alias validity (None = today)
|
||||
source_url: Source URL for manual review items
|
||||
|
||||
Returns:
|
||||
TeamResolveResult with resolution details
|
||||
"""
|
||||
value_upper = value.upper().strip()
|
||||
value_lower = value.lower().strip()
|
||||
|
||||
# 1. Exact match against abbreviation
|
||||
if value_upper in self._mappings:
|
||||
canonical_id, full_name, _, _ = self._mappings[value_upper]
|
||||
return TeamResolveResult(
|
||||
canonical_id=canonical_id,
|
||||
confidence=100,
|
||||
match_type="exact",
|
||||
)
|
||||
|
||||
# 2. Exact match against full names
|
||||
for abbrev, (canonical_id, full_name, city, stadium_id) in self._mappings.items():
|
||||
if value_lower == full_name.lower() or value_lower == city.lower():
|
||||
return TeamResolveResult(
|
||||
canonical_id=canonical_id,
|
||||
confidence=100,
|
||||
match_type="exact",
|
||||
)
|
||||
|
||||
# 3. Alias lookup
|
||||
alias_result = self.alias_loader.resolve(value, check_date)
|
||||
if alias_result:
|
||||
return TeamResolveResult(
|
||||
canonical_id=alias_result,
|
||||
confidence=95,
|
||||
match_type="alias",
|
||||
)
|
||||
|
||||
# 4. Fuzzy match
|
||||
matches = fuzzy_match_team(
|
||||
value,
|
||||
self._candidates,
|
||||
threshold=self.fuzzy_threshold,
|
||||
)
|
||||
|
||||
if matches:
|
||||
best = matches[0]
|
||||
review_item = None
|
||||
|
||||
# Create review item for low confidence matches
|
||||
if best.confidence < 90:
|
||||
review_item = ManualReviewItem(
|
||||
id=f"team_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.LOW_CONFIDENCE_MATCH,
|
||||
sport=self.sport,
|
||||
raw_value=value,
|
||||
context={"match_type": "fuzzy"},
|
||||
source_url=source_url,
|
||||
suggested_matches=matches,
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return TeamResolveResult(
|
||||
canonical_id=best.canonical_id,
|
||||
confidence=best.confidence,
|
||||
match_type="fuzzy",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
# 5. Unresolved
|
||||
review_item = ManualReviewItem(
|
||||
id=f"team_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.UNRESOLVED_TEAM,
|
||||
sport=self.sport,
|
||||
raw_value=value,
|
||||
context={},
|
||||
source_url=source_url,
|
||||
suggested_matches=fuzzy_match_team(
|
||||
value,
|
||||
self._candidates,
|
||||
threshold=50, # Lower threshold for suggestions
|
||||
top_n=5,
|
||||
),
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return TeamResolveResult(
|
||||
canonical_id=None,
|
||||
confidence=0,
|
||||
match_type="unresolved",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
def get_team_info(self, abbreviation: str) -> Optional[tuple[str, str, str, str]]:
|
||||
"""Get team info by abbreviation.
|
||||
|
||||
Args:
|
||||
abbreviation: Team abbreviation
|
||||
|
||||
Returns:
|
||||
Tuple of (canonical_id, full_name, city, stadium_id) or None
|
||||
"""
|
||||
return self._mappings.get(abbreviation.upper())
|
||||
|
||||
def get_all_teams(self) -> list[tuple[str, str, str, str]]:
|
||||
"""Get all teams for this sport.
|
||||
|
||||
Returns:
|
||||
List of (canonical_id, full_name, city, stadium_id) tuples
|
||||
"""
|
||||
seen = set()
|
||||
result = []
|
||||
|
||||
for abbrev, (canonical_id, full_name, city, stadium_id) in self._mappings.items():
|
||||
if canonical_id not in seen:
|
||||
seen.add(canonical_id)
|
||||
result.append((canonical_id, full_name, city, stadium_id))
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# Cached resolvers
|
||||
_resolvers: dict[str, TeamResolver] = {}
|
||||
|
||||
|
||||
def get_team_resolver(sport: str) -> TeamResolver:
|
||||
"""Get or create a team resolver for a sport."""
|
||||
sport_lower = sport.lower()
|
||||
if sport_lower not in _resolvers:
|
||||
_resolvers[sport_lower] = TeamResolver(sport_lower)
|
||||
return _resolvers[sport_lower]
|
||||
|
||||
|
||||
def resolve_team(
|
||||
sport: str,
|
||||
value: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> TeamResolveResult:
|
||||
"""Convenience function to resolve a team name.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
value: Team name to resolve
|
||||
check_date: Date for alias validity
|
||||
|
||||
Returns:
|
||||
TeamResolveResult
|
||||
"""
|
||||
return get_team_resolver(sport).resolve(value, check_date)
|
||||
344
sportstime_parser/normalizers/timezone.py
Normal file
344
sportstime_parser/normalizers/timezone.py
Normal file
@@ -0,0 +1,344 @@
|
||||
"""Timezone conversion utilities for normalizing game times to UTC."""
|
||||
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, date, time
|
||||
from typing import Optional
|
||||
from zoneinfo import ZoneInfo
|
||||
|
||||
from dateutil import parser as dateutil_parser
|
||||
from dateutil.tz import gettz, tzutc
|
||||
|
||||
from ..models.aliases import ReviewReason, ManualReviewItem
|
||||
|
||||
|
||||
# Common timezone abbreviations to IANA timezones
|
||||
TIMEZONE_ABBREV_MAP: dict[str, str] = {
|
||||
# US timezones
|
||||
"ET": "America/New_York",
|
||||
"EST": "America/New_York",
|
||||
"EDT": "America/New_York",
|
||||
"CT": "America/Chicago",
|
||||
"CST": "America/Chicago",
|
||||
"CDT": "America/Chicago",
|
||||
"MT": "America/Denver",
|
||||
"MST": "America/Denver",
|
||||
"MDT": "America/Denver",
|
||||
"PT": "America/Los_Angeles",
|
||||
"PST": "America/Los_Angeles",
|
||||
"PDT": "America/Los_Angeles",
|
||||
"AT": "America/Anchorage",
|
||||
"AKST": "America/Anchorage",
|
||||
"AKDT": "America/Anchorage",
|
||||
"HT": "Pacific/Honolulu",
|
||||
"HST": "Pacific/Honolulu",
|
||||
# Canada
|
||||
"AST": "America/Halifax",
|
||||
"ADT": "America/Halifax",
|
||||
"NST": "America/St_Johns",
|
||||
"NDT": "America/St_Johns",
|
||||
# Mexico
|
||||
"CDST": "America/Mexico_City",
|
||||
# UTC
|
||||
"UTC": "UTC",
|
||||
"GMT": "UTC",
|
||||
"Z": "UTC",
|
||||
}
|
||||
|
||||
# State/region to timezone mapping for inferring timezone from location
|
||||
STATE_TIMEZONE_MAP: dict[str, str] = {
|
||||
# Eastern
|
||||
"CT": "America/New_York",
|
||||
"DE": "America/New_York",
|
||||
"FL": "America/New_York", # Most of Florida
|
||||
"GA": "America/New_York",
|
||||
"MA": "America/New_York",
|
||||
"MD": "America/New_York",
|
||||
"ME": "America/New_York",
|
||||
"MI": "America/Detroit",
|
||||
"NC": "America/New_York",
|
||||
"NH": "America/New_York",
|
||||
"NJ": "America/New_York",
|
||||
"NY": "America/New_York",
|
||||
"OH": "America/New_York",
|
||||
"PA": "America/New_York",
|
||||
"RI": "America/New_York",
|
||||
"SC": "America/New_York",
|
||||
"VA": "America/New_York",
|
||||
"VT": "America/New_York",
|
||||
"WV": "America/New_York",
|
||||
"DC": "America/New_York",
|
||||
# Central
|
||||
"AL": "America/Chicago",
|
||||
"AR": "America/Chicago",
|
||||
"IA": "America/Chicago",
|
||||
"IL": "America/Chicago",
|
||||
"IN": "America/Indiana/Indianapolis",
|
||||
"KS": "America/Chicago",
|
||||
"KY": "America/Kentucky/Louisville",
|
||||
"LA": "America/Chicago",
|
||||
"MN": "America/Chicago",
|
||||
"MO": "America/Chicago",
|
||||
"MS": "America/Chicago",
|
||||
"ND": "America/Chicago",
|
||||
"NE": "America/Chicago",
|
||||
"OK": "America/Chicago",
|
||||
"SD": "America/Chicago",
|
||||
"TN": "America/Chicago",
|
||||
"TX": "America/Chicago",
|
||||
"WI": "America/Chicago",
|
||||
# Mountain
|
||||
"AZ": "America/Phoenix", # No DST
|
||||
"CO": "America/Denver",
|
||||
"ID": "America/Boise",
|
||||
"MT": "America/Denver",
|
||||
"NM": "America/Denver",
|
||||
"UT": "America/Denver",
|
||||
"WY": "America/Denver",
|
||||
# Pacific
|
||||
"CA": "America/Los_Angeles",
|
||||
"NV": "America/Los_Angeles",
|
||||
"OR": "America/Los_Angeles",
|
||||
"WA": "America/Los_Angeles",
|
||||
# Alaska/Hawaii
|
||||
"AK": "America/Anchorage",
|
||||
"HI": "Pacific/Honolulu",
|
||||
# Canada provinces
|
||||
"ON": "America/Toronto",
|
||||
"QC": "America/Montreal",
|
||||
"BC": "America/Vancouver",
|
||||
"AB": "America/Edmonton",
|
||||
"MB": "America/Winnipeg",
|
||||
"SK": "America/Regina",
|
||||
"NS": "America/Halifax",
|
||||
"NB": "America/Moncton",
|
||||
"NL": "America/St_Johns",
|
||||
"PE": "America/Halifax",
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class TimezoneResult:
|
||||
"""Result of timezone conversion.
|
||||
|
||||
Attributes:
|
||||
datetime_utc: The datetime converted to UTC
|
||||
source_timezone: The timezone that was detected/used
|
||||
confidence: Confidence in the timezone detection ('high', 'medium', 'low')
|
||||
warning: Warning message if timezone was uncertain
|
||||
"""
|
||||
|
||||
datetime_utc: datetime
|
||||
source_timezone: str
|
||||
confidence: str
|
||||
warning: Optional[str] = None
|
||||
|
||||
|
||||
def detect_timezone_from_string(time_str: str) -> Optional[str]:
|
||||
"""Detect timezone from a time string containing a timezone abbreviation.
|
||||
|
||||
Args:
|
||||
time_str: Time string that may contain timezone info (e.g., '7:00 PM ET')
|
||||
|
||||
Returns:
|
||||
IANA timezone string if detected, None otherwise
|
||||
"""
|
||||
# Look for timezone abbreviation at end of string
|
||||
for abbrev, tz in TIMEZONE_ABBREV_MAP.items():
|
||||
pattern = rf"\b{abbrev}\b"
|
||||
if re.search(pattern, time_str, re.IGNORECASE):
|
||||
return tz
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def detect_timezone_from_location(
|
||||
state: Optional[str] = None,
|
||||
city: Optional[str] = None,
|
||||
) -> Optional[str]:
|
||||
"""Detect timezone from location information.
|
||||
|
||||
Args:
|
||||
state: State/province code (e.g., 'NY', 'ON')
|
||||
city: City name (optional, for special cases)
|
||||
|
||||
Returns:
|
||||
IANA timezone string if detected, None otherwise
|
||||
"""
|
||||
if state and state.upper() in STATE_TIMEZONE_MAP:
|
||||
return STATE_TIMEZONE_MAP[state.upper()]
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def parse_datetime(
|
||||
date_str: str,
|
||||
time_str: Optional[str] = None,
|
||||
timezone_hint: Optional[str] = None,
|
||||
location_state: Optional[str] = None,
|
||||
) -> TimezoneResult:
|
||||
"""Parse a date/time string and convert to UTC.
|
||||
|
||||
Attempts to detect timezone from:
|
||||
1. Explicit timezone in the string
|
||||
2. Provided timezone hint
|
||||
3. Location-based inference
|
||||
4. Default to Eastern Time with warning
|
||||
|
||||
Args:
|
||||
date_str: Date string (e.g., '2025-10-21', 'October 21, 2025')
|
||||
time_str: Optional time string (e.g., '7:00 PM ET', '19:00')
|
||||
timezone_hint: Optional IANA timezone to use if not detected
|
||||
location_state: Optional state code for timezone inference
|
||||
|
||||
Returns:
|
||||
TimezoneResult with UTC datetime and metadata
|
||||
"""
|
||||
# Parse the date
|
||||
try:
|
||||
if time_str:
|
||||
# Combine date and time
|
||||
full_str = f"{date_str} {time_str}"
|
||||
else:
|
||||
full_str = date_str
|
||||
|
||||
parsed = dateutil_parser.parse(full_str, fuzzy=True)
|
||||
except (ValueError, OverflowError) as e:
|
||||
# If parsing fails, return a placeholder with low confidence
|
||||
return TimezoneResult(
|
||||
datetime_utc=datetime.now(tz=ZoneInfo("UTC")),
|
||||
source_timezone="unknown",
|
||||
confidence="low",
|
||||
warning=f"Failed to parse datetime: {e}",
|
||||
)
|
||||
|
||||
# Determine timezone
|
||||
detected_tz = None
|
||||
confidence = "high"
|
||||
warning = None
|
||||
|
||||
# Check if datetime already has timezone
|
||||
if parsed.tzinfo is not None:
|
||||
detected_tz = str(parsed.tzinfo)
|
||||
else:
|
||||
# Try to detect from time string
|
||||
if time_str:
|
||||
detected_tz = detect_timezone_from_string(time_str)
|
||||
|
||||
# Try timezone hint
|
||||
if not detected_tz and timezone_hint:
|
||||
detected_tz = timezone_hint
|
||||
confidence = "medium"
|
||||
|
||||
# Try location inference
|
||||
if not detected_tz and location_state:
|
||||
detected_tz = detect_timezone_from_location(state=location_state)
|
||||
confidence = "medium"
|
||||
|
||||
# Default to Eastern Time
|
||||
if not detected_tz:
|
||||
detected_tz = "America/New_York"
|
||||
confidence = "low"
|
||||
warning = "Timezone not detected, defaulting to Eastern Time"
|
||||
|
||||
# Apply timezone and convert to UTC
|
||||
try:
|
||||
tz = ZoneInfo(detected_tz)
|
||||
except KeyError:
|
||||
# Invalid timezone, try to resolve abbreviation
|
||||
if detected_tz in TIMEZONE_ABBREV_MAP:
|
||||
tz = ZoneInfo(TIMEZONE_ABBREV_MAP[detected_tz])
|
||||
detected_tz = TIMEZONE_ABBREV_MAP[detected_tz]
|
||||
else:
|
||||
tz = ZoneInfo("America/New_York")
|
||||
confidence = "low"
|
||||
warning = f"Unknown timezone '{detected_tz}', defaulting to Eastern Time"
|
||||
detected_tz = "America/New_York"
|
||||
|
||||
# Apply timezone if not already set
|
||||
if parsed.tzinfo is None:
|
||||
parsed = parsed.replace(tzinfo=tz)
|
||||
|
||||
# Convert to UTC
|
||||
utc_dt = parsed.astimezone(ZoneInfo("UTC"))
|
||||
|
||||
return TimezoneResult(
|
||||
datetime_utc=utc_dt,
|
||||
source_timezone=detected_tz,
|
||||
confidence=confidence,
|
||||
warning=warning,
|
||||
)
|
||||
|
||||
|
||||
def convert_to_utc(
|
||||
dt: datetime,
|
||||
source_timezone: str,
|
||||
) -> datetime:
|
||||
"""Convert a datetime from a known timezone to UTC.
|
||||
|
||||
Args:
|
||||
dt: Datetime to convert (timezone-naive or timezone-aware)
|
||||
source_timezone: IANA timezone of the datetime
|
||||
|
||||
Returns:
|
||||
Datetime in UTC
|
||||
"""
|
||||
tz = ZoneInfo(source_timezone)
|
||||
|
||||
if dt.tzinfo is None:
|
||||
# Localize naive datetime
|
||||
dt = dt.replace(tzinfo=tz)
|
||||
|
||||
return dt.astimezone(ZoneInfo("UTC"))
|
||||
|
||||
|
||||
def create_timezone_warning(
|
||||
raw_value: str,
|
||||
sport: str,
|
||||
game_date: Optional[date] = None,
|
||||
source_url: Optional[str] = None,
|
||||
) -> ManualReviewItem:
|
||||
"""Create a manual review item for an undetermined timezone.
|
||||
|
||||
Args:
|
||||
raw_value: The original time string that couldn't be resolved
|
||||
sport: Sport code
|
||||
game_date: Date of the game
|
||||
source_url: URL of the source page
|
||||
|
||||
Returns:
|
||||
ManualReviewItem for timezone review
|
||||
"""
|
||||
return ManualReviewItem(
|
||||
id=f"tz_{sport}_{raw_value[:20].replace(' ', '_')}",
|
||||
reason=ReviewReason.TIMEZONE_UNKNOWN,
|
||||
sport=sport,
|
||||
raw_value=raw_value,
|
||||
context={"issue": "Could not determine timezone for game time"},
|
||||
source_url=source_url,
|
||||
game_date=game_date,
|
||||
)
|
||||
|
||||
|
||||
def get_stadium_timezone(
|
||||
stadium_state: str,
|
||||
stadium_timezone: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Get the timezone for a stadium based on its location.
|
||||
|
||||
Args:
|
||||
stadium_state: State/province code
|
||||
stadium_timezone: Explicit timezone override from stadium data
|
||||
|
||||
Returns:
|
||||
IANA timezone string
|
||||
"""
|
||||
if stadium_timezone:
|
||||
return stadium_timezone
|
||||
|
||||
tz = detect_timezone_from_location(state=stadium_state)
|
||||
if tz:
|
||||
return tz
|
||||
|
||||
# Default to Eastern
|
||||
return "America/New_York"
|
||||
46
sportstime_parser/scrapers/__init__.py
Normal file
46
sportstime_parser/scrapers/__init__.py
Normal file
@@ -0,0 +1,46 @@
|
||||
"""Scrapers for fetching sports data from various sources."""
|
||||
|
||||
from .base import (
|
||||
BaseScraper,
|
||||
RawGameData,
|
||||
ScrapeResult,
|
||||
ScraperError,
|
||||
PartialDataError,
|
||||
)
|
||||
from .nba import NBAScraper, create_nba_scraper
|
||||
from .mlb import MLBScraper, create_mlb_scraper
|
||||
from .nfl import NFLScraper, create_nfl_scraper
|
||||
from .nhl import NHLScraper, create_nhl_scraper
|
||||
from .mls import MLSScraper, create_mls_scraper
|
||||
from .wnba import WNBAScraper, create_wnba_scraper
|
||||
from .nwsl import NWSLScraper, create_nwsl_scraper
|
||||
|
||||
__all__ = [
|
||||
# Base
|
||||
"BaseScraper",
|
||||
"RawGameData",
|
||||
"ScrapeResult",
|
||||
"ScraperError",
|
||||
"PartialDataError",
|
||||
# NBA
|
||||
"NBAScraper",
|
||||
"create_nba_scraper",
|
||||
# MLB
|
||||
"MLBScraper",
|
||||
"create_mlb_scraper",
|
||||
# NFL
|
||||
"NFLScraper",
|
||||
"create_nfl_scraper",
|
||||
# NHL
|
||||
"NHLScraper",
|
||||
"create_nhl_scraper",
|
||||
# MLS
|
||||
"MLSScraper",
|
||||
"create_mls_scraper",
|
||||
# WNBA
|
||||
"WNBAScraper",
|
||||
"create_wnba_scraper",
|
||||
# NWSL
|
||||
"NWSLScraper",
|
||||
"create_nwsl_scraper",
|
||||
]
|
||||
335
sportstime_parser/scrapers/base.py
Normal file
335
sportstime_parser/scrapers/base.py
Normal file
@@ -0,0 +1,335 @@
|
||||
"""Base scraper class for all sport scrapers."""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date, datetime
|
||||
from typing import Optional
|
||||
|
||||
from ..config import EXPECTED_GAME_COUNTS
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..utils.http import RateLimitedSession, get_session
|
||||
from ..utils.logging import get_logger, log_error, log_warning
|
||||
from ..utils.progress import ScrapeProgress
|
||||
|
||||
|
||||
@dataclass
|
||||
class RawGameData:
|
||||
"""Raw game data before normalization.
|
||||
|
||||
This intermediate format holds data as scraped from sources,
|
||||
before team/stadium resolution and canonical ID generation.
|
||||
"""
|
||||
|
||||
game_date: datetime
|
||||
home_team_raw: str
|
||||
away_team_raw: str
|
||||
stadium_raw: Optional[str] = None
|
||||
home_score: Optional[int] = None
|
||||
away_score: Optional[int] = None
|
||||
status: str = "scheduled"
|
||||
source_url: Optional[str] = None
|
||||
game_number: Optional[int] = None # For doubleheaders
|
||||
|
||||
|
||||
@dataclass
|
||||
class ScrapeResult:
|
||||
"""Result of a scraping operation.
|
||||
|
||||
Attributes:
|
||||
games: List of normalized Game objects
|
||||
teams: List of Team objects
|
||||
stadiums: List of Stadium objects
|
||||
review_items: Items requiring manual review
|
||||
source: Name of the source used
|
||||
success: Whether scraping succeeded
|
||||
error_message: Error message if failed
|
||||
"""
|
||||
|
||||
games: list[Game] = field(default_factory=list)
|
||||
teams: list[Team] = field(default_factory=list)
|
||||
stadiums: list[Stadium] = field(default_factory=list)
|
||||
review_items: list[ManualReviewItem] = field(default_factory=list)
|
||||
source: str = ""
|
||||
success: bool = True
|
||||
error_message: Optional[str] = None
|
||||
|
||||
@property
|
||||
def game_count(self) -> int:
|
||||
return len(self.games)
|
||||
|
||||
@property
|
||||
def team_count(self) -> int:
|
||||
return len(self.teams)
|
||||
|
||||
@property
|
||||
def stadium_count(self) -> int:
|
||||
return len(self.stadiums)
|
||||
|
||||
@property
|
||||
def review_count(self) -> int:
|
||||
return len(self.review_items)
|
||||
|
||||
|
||||
class BaseScraper(ABC):
|
||||
"""Abstract base class for sport scrapers.
|
||||
|
||||
Subclasses must implement:
|
||||
- scrape_games(): Fetch and normalize game schedule
|
||||
- scrape_teams(): Fetch team information
|
||||
- scrape_stadiums(): Fetch stadium information
|
||||
- _get_sources(): Return list of source names in priority order
|
||||
|
||||
Features:
|
||||
- Multi-source fallback (try sources in order)
|
||||
- Built-in rate limiting
|
||||
- Error handling with partial data discard
|
||||
- Progress tracking
|
||||
- Source URL tracking for manual review
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
sport: str,
|
||||
season: int,
|
||||
session: Optional[RateLimitedSession] = None,
|
||||
):
|
||||
"""Initialize the scraper.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
session: Optional HTTP session (default: global session)
|
||||
"""
|
||||
self.sport = sport.lower()
|
||||
self.season = season
|
||||
self.session = session or get_session()
|
||||
self._logger = get_logger()
|
||||
self._progress: Optional[ScrapeProgress] = None
|
||||
|
||||
@property
|
||||
def expected_game_count(self) -> int:
|
||||
"""Get expected number of games for this sport."""
|
||||
return EXPECTED_GAME_COUNTS.get(self.sport, 0)
|
||||
|
||||
@abstractmethod
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return list of source names in priority order.
|
||||
|
||||
Returns:
|
||||
List of source identifiers (e.g., ['basketball_reference', 'espn', 'cbs'])
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def _scrape_games_from_source(
|
||||
self,
|
||||
source: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source.
|
||||
|
||||
Args:
|
||||
source: Source identifier
|
||||
|
||||
Returns:
|
||||
List of raw game data
|
||||
|
||||
Raises:
|
||||
Exception: If scraping fails
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw game data to Game objects.
|
||||
|
||||
Args:
|
||||
raw_games: Raw scraped data
|
||||
|
||||
Returns:
|
||||
Tuple of (normalized games, review items)
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Fetch team information.
|
||||
|
||||
Returns:
|
||||
List of Team objects
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Fetch stadium information.
|
||||
|
||||
Returns:
|
||||
List of Stadium objects
|
||||
"""
|
||||
pass
|
||||
|
||||
def scrape_games(self) -> ScrapeResult:
|
||||
"""Scrape games with multi-source fallback.
|
||||
|
||||
Tries each source in priority order. On failure, discards
|
||||
partial data and tries the next source.
|
||||
|
||||
Returns:
|
||||
ScrapeResult with games, review items, and status
|
||||
"""
|
||||
sources = self._get_sources()
|
||||
last_error: Optional[str] = None
|
||||
sources_tried = 0
|
||||
# Allow 3 sources to be tried. This enables NHL to fall back to NHL API
|
||||
# for venue data since Hockey Reference doesn't provide it.
|
||||
max_sources_to_try = 3
|
||||
|
||||
for source in sources:
|
||||
self._logger.info(f"Trying source: {source}")
|
||||
sources_tried += 1
|
||||
|
||||
try:
|
||||
# Scrape raw data
|
||||
raw_games = self._scrape_games_from_source(source)
|
||||
|
||||
if not raw_games:
|
||||
log_warning(f"No games found from {source}")
|
||||
# If multiple sources return nothing, the schedule likely doesn't exist
|
||||
if sources_tried >= max_sources_to_try:
|
||||
return ScrapeResult(
|
||||
success=False,
|
||||
error_message=f"No schedule data available (tried {sources_tried} sources)",
|
||||
)
|
||||
continue
|
||||
|
||||
self._logger.info(f"Found {len(raw_games)} raw games from {source}")
|
||||
|
||||
# Normalize data
|
||||
games, review_items = self._normalize_games(raw_games)
|
||||
|
||||
self._logger.info(
|
||||
f"Normalized {len(games)} games, {len(review_items)} need review"
|
||||
)
|
||||
|
||||
return ScrapeResult(
|
||||
games=games,
|
||||
review_items=review_items,
|
||||
source=source,
|
||||
success=True,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
last_error = str(e)
|
||||
log_error(f"Failed to scrape from {source}: {e}", exc_info=True)
|
||||
# If we've tried enough sources, bail out
|
||||
if sources_tried >= max_sources_to_try:
|
||||
break
|
||||
continue
|
||||
|
||||
# All sources failed
|
||||
return ScrapeResult(
|
||||
success=False,
|
||||
error_message=f"All sources failed. Last error: {last_error}",
|
||||
)
|
||||
|
||||
def scrape_all(self) -> ScrapeResult:
|
||||
"""Scrape games, teams, and stadiums.
|
||||
|
||||
Returns:
|
||||
Complete ScrapeResult with all data
|
||||
"""
|
||||
self._progress = ScrapeProgress(self.sport, self.season)
|
||||
self._progress.start()
|
||||
|
||||
try:
|
||||
# Scrape games
|
||||
result = self.scrape_games()
|
||||
|
||||
if not result.success:
|
||||
self._progress.log_error(result.error_message or "Unknown error")
|
||||
self._progress.finish()
|
||||
return result
|
||||
|
||||
# Scrape teams
|
||||
teams = self.scrape_teams()
|
||||
result.teams = teams
|
||||
|
||||
# Scrape stadiums
|
||||
stadiums = self.scrape_stadiums()
|
||||
result.stadiums = stadiums
|
||||
|
||||
# Update progress
|
||||
self._progress.games_count = result.game_count
|
||||
self._progress.teams_count = result.team_count
|
||||
self._progress.stadiums_count = result.stadium_count
|
||||
self._progress.errors_count = result.review_count
|
||||
|
||||
self._progress.finish()
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
log_error(f"Scraping failed: {e}", exc_info=True)
|
||||
self._progress.finish()
|
||||
|
||||
return ScrapeResult(
|
||||
success=False,
|
||||
error_message=str(e),
|
||||
)
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for this sport's season.
|
||||
|
||||
Returns:
|
||||
List of (year, month) tuples
|
||||
"""
|
||||
# Default implementation for sports with fall-spring seasons
|
||||
# (NBA, NHL, etc.)
|
||||
months = []
|
||||
|
||||
# Fall months of season start year
|
||||
for month in range(10, 13): # Oct-Dec
|
||||
months.append((self.season, month))
|
||||
|
||||
# Winter-spring months of following year
|
||||
for month in range(1, 7): # Jan-Jun
|
||||
months.append((self.season + 1, month))
|
||||
|
||||
return months
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build a source URL with parameters.
|
||||
|
||||
Subclasses should override this to build URLs for their sources.
|
||||
|
||||
Args:
|
||||
source: Source identifier
|
||||
**kwargs: URL parameters
|
||||
|
||||
Returns:
|
||||
Complete URL string
|
||||
"""
|
||||
raise NotImplementedError(f"URL builder not implemented for {source}")
|
||||
|
||||
|
||||
class ScraperError(Exception):
|
||||
"""Exception raised when scraping fails."""
|
||||
|
||||
def __init__(self, source: str, message: str):
|
||||
self.source = source
|
||||
self.message = message
|
||||
super().__init__(f"[{source}] {message}")
|
||||
|
||||
|
||||
class PartialDataError(ScraperError):
|
||||
"""Exception raised when only partial data was retrieved."""
|
||||
|
||||
def __init__(self, source: str, message: str, partial_count: int):
|
||||
self.partial_count = partial_count
|
||||
super().__init__(source, f"{message} (got {partial_count} items)")
|
||||
685
sportstime_parser/scrapers/mlb.py
Normal file
685
sportstime_parser/scrapers/mlb.py
Normal file
@@ -0,0 +1,685 @@
|
||||
"""MLB scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date, timedelta
|
||||
from typing import Optional
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..normalizers.timezone import parse_datetime
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
class MLBScraper(BaseScraper):
|
||||
"""MLB schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. Baseball-Reference - Most reliable, complete historical data
|
||||
2. MLB Stats API - Official MLB data
|
||||
3. ESPN API - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize MLB scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026 for 2026 season)
|
||||
"""
|
||||
super().__init__("mlb", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("mlb")
|
||||
self._stadium_resolver = get_stadium_resolver("mlb")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
# MLB API is best - returns full schedule in one request
|
||||
# ESPN caps at ~25 results for baseball
|
||||
# Baseball-Reference requires HTML parsing
|
||||
return ["mlb_api", "espn", "baseball_reference"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "baseball_reference":
|
||||
month = kwargs.get("month", "april")
|
||||
# Baseball-Reference uses season year in URL
|
||||
return f"https://www.baseball-reference.com/leagues/majors/{self.season}-schedule.shtml"
|
||||
|
||||
elif source == "mlb_api":
|
||||
start_date = kwargs.get("start_date", "")
|
||||
end_date = kwargs.get("end_date", "")
|
||||
return f"https://statsapi.mlb.com/api/v1/schedule?sportId=1&startDate={start_date}&endDate={end_date}"
|
||||
|
||||
elif source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard?dates={date_str}"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for MLB season.
|
||||
|
||||
MLB season runs March/April through October/November.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# Spring training / early season
|
||||
for month in range(3, 12): # March-November
|
||||
months.append((self.season, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "baseball_reference":
|
||||
return self._scrape_baseball_reference()
|
||||
elif source == "mlb_api":
|
||||
return self._scrape_mlb_api()
|
||||
elif source == "espn":
|
||||
return self._scrape_espn()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_baseball_reference(self) -> list[RawGameData]:
|
||||
"""Scrape games from Baseball-Reference.
|
||||
|
||||
BR has a single schedule page per season.
|
||||
Format: https://www.baseball-reference.com/leagues/majors/YYYY-schedule.shtml
|
||||
"""
|
||||
url = self._get_source_url("baseball_reference")
|
||||
|
||||
try:
|
||||
html = self.session.get_html(url)
|
||||
games = self._parse_baseball_reference(html, url)
|
||||
return games
|
||||
|
||||
except Exception as e:
|
||||
self._logger.error(f"Failed to scrape Baseball-Reference: {e}")
|
||||
raise
|
||||
|
||||
def _parse_baseball_reference(
|
||||
self,
|
||||
html: str,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse Baseball-Reference schedule HTML.
|
||||
|
||||
Structure: Games are organized by date in div elements.
|
||||
Each game row has: date, away team, away score, home team, home score, venue.
|
||||
"""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
games: list[RawGameData] = []
|
||||
|
||||
# Find all game divs - they use class "game" or similar
|
||||
# Baseball-Reference uses <p class="game"> for each game
|
||||
game_paragraphs = soup.find_all("p", class_="game")
|
||||
|
||||
current_date = None
|
||||
|
||||
for elem in soup.find_all(["h3", "p"]):
|
||||
# H3 contains date headers
|
||||
if elem.name == "h3":
|
||||
date_text = elem.get_text(strip=True)
|
||||
try:
|
||||
# Format: "Thursday, April 1, 2026"
|
||||
current_date = datetime.strptime(date_text, "%A, %B %d, %Y")
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
elif elem.name == "p" and "game" in elem.get("class", []):
|
||||
if current_date is None:
|
||||
continue
|
||||
|
||||
try:
|
||||
game = self._parse_br_game(elem, current_date, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse game: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_br_game(
|
||||
self,
|
||||
elem,
|
||||
game_date: datetime,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single Baseball-Reference game element."""
|
||||
text = elem.get_text(" ", strip=True)
|
||||
|
||||
# Parse game text - formats vary:
|
||||
# "Team A (5) @ Team B (3)" or "Team A @ Team B"
|
||||
# Also handles doubleheader notation
|
||||
|
||||
# Find all links - usually team names
|
||||
links = elem.find_all("a")
|
||||
if len(links) < 2:
|
||||
return None
|
||||
|
||||
# First link is away team, second is home team
|
||||
away_team = links[0].get_text(strip=True)
|
||||
home_team = links[1].get_text(strip=True)
|
||||
|
||||
# Try to extract scores from text
|
||||
away_score = None
|
||||
home_score = None
|
||||
|
||||
# Look for score pattern "(N)"
|
||||
import re
|
||||
score_pattern = r"\((\d+)\)"
|
||||
scores = re.findall(score_pattern, text)
|
||||
|
||||
if len(scores) >= 2:
|
||||
try:
|
||||
away_score = int(scores[0])
|
||||
home_score = int(scores[1])
|
||||
except (ValueError, IndexError):
|
||||
pass
|
||||
|
||||
# Determine status
|
||||
status = "final" if home_score is not None else "scheduled"
|
||||
|
||||
# Check for postponed/cancelled
|
||||
text_lower = text.lower()
|
||||
if "postponed" in text_lower:
|
||||
status = "postponed"
|
||||
elif "cancelled" in text_lower or "canceled" in text_lower:
|
||||
status = "cancelled"
|
||||
|
||||
# Extract venue if present (usually after @ symbol)
|
||||
stadium = None
|
||||
if len(links) > 2:
|
||||
# Third link might be stadium
|
||||
stadium = links[2].get_text(strip=True)
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_mlb_api(self) -> list[RawGameData]:
|
||||
"""Scrape games from MLB Stats API using full season query."""
|
||||
# Build date range for entire season (March-November)
|
||||
season_months = self._get_season_months()
|
||||
start_year, start_month = season_months[0]
|
||||
end_year, end_month = season_months[-1]
|
||||
|
||||
# Get last day of end month
|
||||
if end_month == 12:
|
||||
end_date = date(end_year + 1, 1, 1) - timedelta(days=1)
|
||||
else:
|
||||
end_date = date(end_year, end_month + 1, 1) - timedelta(days=1)
|
||||
|
||||
start_date = date(start_year, start_month, 1)
|
||||
|
||||
url = f"https://statsapi.mlb.com/api/v1/schedule?sportId=1&startDate={start_date.strftime('%Y-%m-%d')}&endDate={end_date.strftime('%Y-%m-%d')}"
|
||||
self._logger.info(f"Fetching MLB schedule: {start_date} to {end_date}")
|
||||
|
||||
try:
|
||||
data = self.session.get_json(url)
|
||||
return self._parse_mlb_api_response(data, url)
|
||||
except Exception as e:
|
||||
self._logger.error(f"MLB API error: {e}")
|
||||
return []
|
||||
|
||||
def _parse_mlb_api_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse MLB Stats API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
dates = data.get("dates", [])
|
||||
|
||||
for date_entry in dates:
|
||||
for game in date_entry.get("games", []):
|
||||
try:
|
||||
raw_game = self._parse_mlb_api_game(game, source_url)
|
||||
if raw_game:
|
||||
games.append(raw_game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse MLB API game: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_mlb_api_game(
|
||||
self,
|
||||
game: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single MLB API game."""
|
||||
# Get game date/time
|
||||
game_date_str = game.get("gameDate", "")
|
||||
if not game_date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(game_date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
teams = game.get("teams", {})
|
||||
away_data = teams.get("away", {})
|
||||
home_data = teams.get("home", {})
|
||||
|
||||
away_team_info = away_data.get("team", {})
|
||||
home_team_info = home_data.get("team", {})
|
||||
|
||||
away_team = away_team_info.get("name", "")
|
||||
home_team = home_team_info.get("name", "")
|
||||
|
||||
if not away_team or not home_team:
|
||||
return None
|
||||
|
||||
# Get scores
|
||||
away_score = away_data.get("score")
|
||||
home_score = home_data.get("score")
|
||||
|
||||
# Get venue
|
||||
venue = game.get("venue", {})
|
||||
stadium = venue.get("name")
|
||||
|
||||
# Get status
|
||||
status_data = game.get("status", {})
|
||||
abstract_game_state = status_data.get("abstractGameState", "").lower()
|
||||
detailed_state = status_data.get("detailedState", "").lower()
|
||||
|
||||
if abstract_game_state == "final":
|
||||
status = "final"
|
||||
elif "postponed" in detailed_state:
|
||||
status = "postponed"
|
||||
elif "cancelled" in detailed_state or "canceled" in detailed_state:
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
# Check for doubleheader
|
||||
game_number = game.get("gameNumber")
|
||||
if game.get("doubleHeader") == "Y":
|
||||
game_number = game.get("gameNumber", 1)
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
game_number=game_number if game.get("doubleHeader") == "Y" else None,
|
||||
)
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API using date range query."""
|
||||
# Build date range for entire season (March-November)
|
||||
season_months = self._get_season_months()
|
||||
start_year, start_month = season_months[0]
|
||||
end_year, end_month = season_months[-1]
|
||||
|
||||
# Get last day of end month
|
||||
if end_month == 12:
|
||||
end_date = date(end_year + 1, 1, 1) - timedelta(days=1)
|
||||
else:
|
||||
end_date = date(end_year, end_month + 1, 1) - timedelta(days=1)
|
||||
|
||||
start_date = date(start_year, start_month, 1)
|
||||
date_range = f"{start_date.strftime('%Y%m%d')}-{end_date.strftime('%Y%m%d')}"
|
||||
|
||||
url = f"https://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard?limit=3000&dates={date_range}"
|
||||
self._logger.info(f"Fetching MLB schedule: {date_range}")
|
||||
|
||||
try:
|
||||
data = self.session.get_json(url)
|
||||
return self._parse_espn_response(data, url)
|
||||
except Exception as e:
|
||||
self._logger.error(f"ESPN error: {e}")
|
||||
return []
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Track games by date/matchup for doubleheader detection
|
||||
games_by_matchup: dict[str, list[RawGameData]] = {}
|
||||
|
||||
for raw in raw_games:
|
||||
date_key = raw.game_date.strftime("%Y%m%d")
|
||||
matchup_key = f"{date_key}_{raw.away_team_raw}_{raw.home_team_raw}"
|
||||
|
||||
if matchup_key not in games_by_matchup:
|
||||
games_by_matchup[matchup_key] = []
|
||||
games_by_matchup[matchup_key].append(raw)
|
||||
|
||||
# Process games with doubleheader detection
|
||||
for matchup_key, matchup_games in games_by_matchup.items():
|
||||
is_doubleheader = len(matchup_games) > 1
|
||||
|
||||
# Sort by time if doubleheader
|
||||
if is_doubleheader:
|
||||
matchup_games.sort(key=lambda g: g.game_date)
|
||||
|
||||
for i, raw in enumerate(matchup_games):
|
||||
# Use provided game_number or calculate from order
|
||||
game_number = raw.game_number or ((i + 1) if is_doubleheader else None)
|
||||
|
||||
game, item_reviews = self._normalize_single_game(raw, game_number)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
game_number: Optional[int],
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=game_number,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=game_number,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
# team_mlb_nyy -> nyy
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all MLB teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# MLB league/division structure
|
||||
divisions = {
|
||||
"AL East": ("American", ["BAL", "BOS", "NYY", "TB", "TOR"]),
|
||||
"AL Central": ("American", ["CHW", "CLE", "DET", "KC", "MIN"]),
|
||||
"AL West": ("American", ["HOU", "LAA", "OAK", "SEA", "TEX"]),
|
||||
"NL East": ("National", ["ATL", "MIA", "NYM", "PHI", "WSN"]),
|
||||
"NL Central": ("National", ["CHC", "CIN", "MIL", "PIT", "STL"]),
|
||||
"NL West": ("National", ["ARI", "COL", "LAD", "SD", "SF"]),
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_divisions: dict[str, tuple[str, str]] = {}
|
||||
for div, (league, abbrevs) in divisions.items():
|
||||
for abbrev in abbrevs:
|
||||
team_divisions[abbrev] = (league, div)
|
||||
|
||||
for abbrev, (team_id, full_name, city, stadium_id) in TEAM_MAPPINGS.get("mlb", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name from full name
|
||||
parts = full_name.split()
|
||||
if len(parts) >= 2:
|
||||
team_name = parts[-1]
|
||||
# Handle multi-word team names
|
||||
if team_name in ["Sox", "Jays"]:
|
||||
team_name = " ".join(parts[-2:])
|
||||
else:
|
||||
team_name = full_name
|
||||
|
||||
# Get league and division
|
||||
league, div = team_divisions.get(abbrev, (None, None))
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="mlb",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=league, # MLB uses "league" but we map to conference field
|
||||
division=div,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all MLB stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
mlb_stadiums = STADIUM_MAPPINGS.get("mlb", {})
|
||||
for stadium_id, info in mlb_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="mlb",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="grass", # Most MLB stadiums
|
||||
roof_type="open", # Most MLB stadiums
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_mlb_scraper(season: int) -> MLBScraper:
|
||||
"""Factory function to create an MLB scraper."""
|
||||
return MLBScraper(season=season)
|
||||
400
sportstime_parser/scrapers/mls.py
Normal file
400
sportstime_parser/scrapers/mls.py
Normal file
@@ -0,0 +1,400 @@
|
||||
"""MLS scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date, timedelta
|
||||
from typing import Optional
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
class MLSScraper(BaseScraper):
|
||||
"""MLS schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. ESPN API - Most reliable for MLS
|
||||
2. FBref - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize MLS scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026 for 2026 season)
|
||||
"""
|
||||
super().__init__("mls", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("mls")
|
||||
self._stadium_resolver = get_stadium_resolver("mls")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
# FBref scraper not yet implemented - TODO for future
|
||||
return ["espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard?dates={date_str}"
|
||||
|
||||
elif source == "fbref":
|
||||
return f"https://fbref.com/en/comps/22/{self.season}/schedule/{self.season}-Major-League-Soccer-Scores-and-Fixtures"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for MLS season.
|
||||
|
||||
MLS season runs February/March through October/November.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# MLS runs within a calendar year
|
||||
for month in range(2, 12): # Feb-Nov
|
||||
months.append((self.season, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "espn":
|
||||
return self._scrape_espn()
|
||||
elif source == "fbref":
|
||||
return self._scrape_fbref()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API using date range query."""
|
||||
# Build date range for entire season (Feb-November)
|
||||
season_months = self._get_season_months()
|
||||
start_year, start_month = season_months[0]
|
||||
end_year, end_month = season_months[-1]
|
||||
|
||||
# Get last day of end month
|
||||
if end_month == 12:
|
||||
end_date = date(end_year + 1, 1, 1) - timedelta(days=1)
|
||||
else:
|
||||
end_date = date(end_year, end_month + 1, 1) - timedelta(days=1)
|
||||
|
||||
start_date = date(start_year, start_month, 1)
|
||||
date_range = f"{start_date.strftime('%Y%m%d')}-{end_date.strftime('%Y%m%d')}"
|
||||
|
||||
url = f"https://site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard?limit=1000&dates={date_range}"
|
||||
self._logger.info(f"Fetching MLS schedule: {date_range}")
|
||||
|
||||
try:
|
||||
data = self.session.get_json(url)
|
||||
return self._parse_espn_response(data, url)
|
||||
except Exception as e:
|
||||
self._logger.error(f"ESPN error: {e}")
|
||||
return []
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_fbref(self) -> list[RawGameData]:
|
||||
"""Scrape games from FBref."""
|
||||
# FBref scraping would go here
|
||||
raise NotImplementedError("FBref scraper not implemented")
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all MLS teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# MLS conference structure
|
||||
conferences = {
|
||||
"Eastern": ["ATL", "CLT", "CHI", "CIN", "CLB", "DC", "MIA", "MTL", "NE", "NYC", "RB", "ORL", "PHI", "TOR"],
|
||||
"Western": ["AUS", "COL", "DAL", "HOU", "LAG", "LAFC", "MIN", "NSH", "POR", "SLC", "SD", "SJ", "SEA", "SKC", "STL", "VAN"],
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_conferences: dict[str, str] = {}
|
||||
for conf, abbrevs in conferences.items():
|
||||
for abbrev in abbrevs:
|
||||
team_conferences[abbrev] = conf
|
||||
|
||||
for abbrev, (team_id, full_name, city, stadium_id) in TEAM_MAPPINGS.get("mls", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
team_name = full_name
|
||||
|
||||
# Get conference
|
||||
conf = team_conferences.get(abbrev)
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="mls",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=conf,
|
||||
division=None, # MLS doesn't have divisions
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all MLS stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
mls_stadiums = STADIUM_MAPPINGS.get("mls", {})
|
||||
for stadium_id, info in mls_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="mls",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="grass",
|
||||
roof_type="open",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_mls_scraper(season: int) -> MLSScraper:
|
||||
"""Factory function to create an MLS scraper."""
|
||||
return MLSScraper(season=season)
|
||||
661
sportstime_parser/scrapers/nba.py
Normal file
661
sportstime_parser/scrapers/nba.py
Normal file
@@ -0,0 +1,661 @@
|
||||
"""NBA scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date, timezone
|
||||
from typing import Optional
|
||||
from bs4 import BeautifulSoup
|
||||
import re
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..normalizers.timezone import parse_datetime
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
# Month name to number mapping
|
||||
MONTH_MAP = {
|
||||
"january": 1, "february": 2, "march": 3, "april": 4,
|
||||
"may": 5, "june": 6, "july": 7, "august": 8,
|
||||
"september": 9, "october": 10, "november": 11, "december": 12,
|
||||
}
|
||||
|
||||
# Basketball Reference month URLs
|
||||
BR_MONTHS = [
|
||||
"october", "november", "december",
|
||||
"january", "february", "march", "april", "may", "june",
|
||||
]
|
||||
|
||||
|
||||
class NBAScraper(BaseScraper):
|
||||
"""NBA schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. Basketball-Reference - Most reliable, complete historical data
|
||||
2. ESPN API - Good for current/future seasons
|
||||
3. CBS Sports - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize NBA scraper.
|
||||
|
||||
Args:
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
"""
|
||||
super().__init__("nba", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("nba")
|
||||
self._stadium_resolver = get_stadium_resolver("nba")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
# CBS scraper not yet implemented - TODO for future
|
||||
return ["basketball_reference", "espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "basketball_reference":
|
||||
month = kwargs.get("month", "october")
|
||||
year = kwargs.get("year", self.season + 1)
|
||||
return f"https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html"
|
||||
|
||||
elif source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates={date_str}"
|
||||
|
||||
elif source == "cbs":
|
||||
return "https://www.cbssports.com/nba/schedule/"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "basketball_reference":
|
||||
return self._scrape_basketball_reference()
|
||||
elif source == "espn":
|
||||
return self._scrape_espn()
|
||||
elif source == "cbs":
|
||||
return self._scrape_cbs()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_basketball_reference(self) -> list[RawGameData]:
|
||||
"""Scrape games from Basketball-Reference.
|
||||
|
||||
BR organizes games by month with separate pages.
|
||||
Format: https://www.basketball-reference.com/leagues/NBA_YYYY_games-month.html
|
||||
where YYYY is the ending year of the season.
|
||||
Bails early if first few months have no data (season doesn't exist).
|
||||
"""
|
||||
all_games: list[RawGameData] = []
|
||||
end_year = self.season + 1
|
||||
consecutive_empty_months = 0
|
||||
|
||||
for month in BR_MONTHS:
|
||||
url = self._get_source_url("basketball_reference", month=month, year=end_year)
|
||||
|
||||
try:
|
||||
html = self.session.get_html(url)
|
||||
games = self._parse_basketball_reference(html, url)
|
||||
|
||||
if games:
|
||||
all_games.extend(games)
|
||||
consecutive_empty_months = 0
|
||||
self._logger.debug(f"Found {len(games)} games in {month}")
|
||||
else:
|
||||
consecutive_empty_months += 1
|
||||
|
||||
except Exception as e:
|
||||
# Some months may not exist (e.g., no games in August)
|
||||
self._logger.debug(f"No data for {month}: {e}")
|
||||
consecutive_empty_months += 1
|
||||
|
||||
# If first 3 months (Oct, Nov, Dec) all have no data, season doesn't exist
|
||||
if consecutive_empty_months >= 3 and not all_games:
|
||||
self._logger.info(f"No games found in first {consecutive_empty_months} months, season likely doesn't exist")
|
||||
break
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_basketball_reference(
|
||||
self,
|
||||
html: str,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse Basketball-Reference schedule HTML.
|
||||
|
||||
Table structure:
|
||||
- th[data-stat="date_game"]: Date (e.g., "Tue, Oct 22, 2024")
|
||||
- td[data-stat="visitor_team_name"]: Away team
|
||||
- td[data-stat="home_team_name"]: Home team
|
||||
- td[data-stat="visitor_pts"]: Away score
|
||||
- td[data-stat="home_pts"]: Home score
|
||||
- td[data-stat="arena_name"]: Arena/stadium name
|
||||
"""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
games: list[RawGameData] = []
|
||||
|
||||
# Find the schedule table
|
||||
table = soup.find("table", id="schedule")
|
||||
if not table:
|
||||
return games
|
||||
|
||||
tbody = table.find("tbody")
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all("tr"):
|
||||
# Skip header rows
|
||||
if row.get("class") and "thead" in row.get("class", []):
|
||||
continue
|
||||
|
||||
try:
|
||||
game = self._parse_br_row(row, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse row: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_br_row(
|
||||
self,
|
||||
row,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single Basketball-Reference table row."""
|
||||
# Get date
|
||||
date_cell = row.find("th", {"data-stat": "date_game"})
|
||||
if not date_cell:
|
||||
return None
|
||||
|
||||
date_text = date_cell.get_text(strip=True)
|
||||
if not date_text:
|
||||
return None
|
||||
|
||||
# Parse date (format: "Tue, Oct 22, 2024")
|
||||
try:
|
||||
game_date = datetime.strptime(date_text, "%a, %b %d, %Y")
|
||||
except ValueError:
|
||||
# Try alternative format
|
||||
try:
|
||||
game_date = datetime.strptime(date_text, "%B %d, %Y")
|
||||
except ValueError:
|
||||
self._logger.debug(f"Could not parse date: {date_text}")
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
away_cell = row.find("td", {"data-stat": "visitor_team_name"})
|
||||
home_cell = row.find("td", {"data-stat": "home_team_name"})
|
||||
|
||||
if not away_cell or not home_cell:
|
||||
return None
|
||||
|
||||
away_team = away_cell.get_text(strip=True)
|
||||
home_team = home_cell.get_text(strip=True)
|
||||
|
||||
if not away_team or not home_team:
|
||||
return None
|
||||
|
||||
# Get scores (may be empty for future games)
|
||||
away_score_cell = row.find("td", {"data-stat": "visitor_pts"})
|
||||
home_score_cell = row.find("td", {"data-stat": "home_pts"})
|
||||
|
||||
away_score = None
|
||||
home_score = None
|
||||
|
||||
if away_score_cell and away_score_cell.get_text(strip=True):
|
||||
try:
|
||||
away_score = int(away_score_cell.get_text(strip=True))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
if home_score_cell and home_score_cell.get_text(strip=True):
|
||||
try:
|
||||
home_score = int(home_score_cell.get_text(strip=True))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
# Get arena
|
||||
arena_cell = row.find("td", {"data-stat": "arena_name"})
|
||||
arena = arena_cell.get_text(strip=True) if arena_cell else None
|
||||
|
||||
# Determine status
|
||||
status = "final" if home_score is not None else "scheduled"
|
||||
|
||||
# Check for postponed/cancelled
|
||||
notes_cell = row.find("td", {"data-stat": "game_remarks"})
|
||||
if notes_cell:
|
||||
notes = notes_cell.get_text(strip=True).lower()
|
||||
if "postponed" in notes:
|
||||
status = "postponed"
|
||||
elif "cancelled" in notes or "canceled" in notes:
|
||||
status = "cancelled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=arena,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API.
|
||||
|
||||
ESPN API returns games for a specific date range.
|
||||
We iterate through each day of the season.
|
||||
Bails out early if no games found after checking first month.
|
||||
"""
|
||||
all_games: list[RawGameData] = []
|
||||
consecutive_empty_days = 0
|
||||
max_empty_days = 45 # Bail after ~1.5 months of no games
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
# Get number of days in month
|
||||
if month == 12:
|
||||
next_month = date(year + 1, 1, 1)
|
||||
else:
|
||||
next_month = date(year, month + 1, 1)
|
||||
|
||||
days_in_month = (next_month - date(year, month, 1)).days
|
||||
|
||||
for day in range(1, days_in_month + 1):
|
||||
try:
|
||||
game_date = date(year, month, day)
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
url = self._get_source_url("espn", date=date_str)
|
||||
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
|
||||
if games:
|
||||
all_games.extend(games)
|
||||
consecutive_empty_days = 0
|
||||
else:
|
||||
consecutive_empty_days += 1
|
||||
|
||||
# Bail early if no games found for a long stretch
|
||||
if consecutive_empty_days >= max_empty_days:
|
||||
self._logger.info(f"No games found for {max_empty_days} consecutive days, stopping ESPN scrape")
|
||||
return all_games
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
|
||||
consecutive_empty_days += 1
|
||||
|
||||
if consecutive_empty_days >= max_empty_days:
|
||||
self._logger.info(f"Too many consecutive failures, stopping ESPN scrape")
|
||||
return all_games
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
# ESPN uses ISO format
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions (usually just one)
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
arena = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=arena,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_cbs(self) -> list[RawGameData]:
|
||||
"""Scrape games from CBS Sports.
|
||||
|
||||
CBS Sports is a backup source with less structured data.
|
||||
"""
|
||||
# CBS Sports scraping would go here
|
||||
# For now, return empty to fall back to other sources
|
||||
raise NotImplementedError("CBS scraper not implemented")
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Track games by date for doubleheader detection
|
||||
games_by_date: dict[str, list[RawGameData]] = {}
|
||||
|
||||
for raw in raw_games:
|
||||
date_key = raw.game_date.strftime("%Y%m%d")
|
||||
matchup_key = f"{date_key}_{raw.away_team_raw}_{raw.home_team_raw}"
|
||||
|
||||
if matchup_key not in games_by_date:
|
||||
games_by_date[matchup_key] = []
|
||||
games_by_date[matchup_key].append(raw)
|
||||
|
||||
# Process games with doubleheader detection
|
||||
for matchup_key, matchup_games in games_by_date.items():
|
||||
is_doubleheader = len(matchup_games) > 1
|
||||
|
||||
for i, raw in enumerate(matchup_games):
|
||||
game_number = (i + 1) if is_doubleheader else None
|
||||
|
||||
game, item_reviews = self._normalize_single_game(raw, game_number)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
game_number: Optional[int],
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium (optional - use home team's stadium if not found)
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# If no stadium found, use home team's default stadium
|
||||
if not stadium_id:
|
||||
# Look up home team's stadium from mappings
|
||||
home_abbrev = home_result.canonical_id.split("_")[-1].upper()
|
||||
team_info = self._team_resolver.get_team_info(home_abbrev)
|
||||
|
||||
if team_info:
|
||||
# Try to find stadium by team's home arena
|
||||
for sid, sinfo in STADIUM_MAPPINGS.get("nba", {}).items():
|
||||
# Match by city
|
||||
if sinfo.city.lower() in team_info[2].lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=game_number,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=game_number,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
# team_nba_okc -> okc
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all NBA teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# NBA conference/division structure
|
||||
divisions = {
|
||||
"Atlantic": ("Eastern", ["BOS", "BKN", "NYK", "PHI", "TOR"]),
|
||||
"Central": ("Eastern", ["CHI", "CLE", "DET", "IND", "MIL"]),
|
||||
"Southeast": ("Eastern", ["ATL", "CHA", "MIA", "ORL", "WAS"]),
|
||||
"Northwest": ("Western", ["DEN", "MIN", "OKC", "POR", "UTA"]),
|
||||
"Pacific": ("Western", ["GSW", "LAC", "LAL", "PHX", "SAC"]),
|
||||
"Southwest": ("Western", ["DAL", "HOU", "MEM", "NOP", "SAS"]),
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_divisions: dict[str, tuple[str, str]] = {}
|
||||
for div, (conf, abbrevs) in divisions.items():
|
||||
for abbrev in abbrevs:
|
||||
team_divisions[abbrev] = (conf, div)
|
||||
|
||||
for abbrev, (team_id, full_name, city, stadium_id) in TEAM_MAPPINGS.get("nba", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse full name into city and name parts
|
||||
parts = full_name.split()
|
||||
if len(parts) >= 2:
|
||||
# Handle special cases like "Oklahoma City Thunder"
|
||||
if city == "Oklahoma City":
|
||||
team_name = "Thunder"
|
||||
elif city == "Golden State":
|
||||
team_name = "Warriors"
|
||||
elif city == "San Antonio":
|
||||
team_name = "Spurs"
|
||||
elif city == "New York":
|
||||
team_name = parts[-1] # Knicks
|
||||
elif city == "New Orleans":
|
||||
team_name = "Pelicans"
|
||||
elif city == "Los Angeles":
|
||||
team_name = parts[-1] # Lakers or Clippers
|
||||
else:
|
||||
team_name = parts[-1]
|
||||
else:
|
||||
team_name = full_name
|
||||
|
||||
# Get conference and division
|
||||
conf, div = team_divisions.get(abbrev, (None, None))
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="nba",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=conf,
|
||||
division=div,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all NBA stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
for stadium_id, info in STADIUM_MAPPINGS.get("nba", {}).items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="nba",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="hardwood",
|
||||
roof_type="dome",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_nba_scraper(season: int) -> NBAScraper:
|
||||
"""Factory function to create an NBA scraper."""
|
||||
return NBAScraper(season=season)
|
||||
579
sportstime_parser/scrapers/nfl.py
Normal file
579
sportstime_parser/scrapers/nfl.py
Normal file
@@ -0,0 +1,579 @@
|
||||
"""NFL scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date
|
||||
from typing import Optional
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
# International game locations to filter out
|
||||
INTERNATIONAL_LOCATIONS = {"London", "Mexico City", "Frankfurt", "Munich", "São Paulo"}
|
||||
|
||||
|
||||
class NFLScraper(BaseScraper):
|
||||
"""NFL schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. ESPN API - Most reliable for NFL
|
||||
2. Pro-Football-Reference - Complete historical data
|
||||
3. CBS Sports - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize NFL scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2025 for 2025 season)
|
||||
"""
|
||||
super().__init__("nfl", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("nfl")
|
||||
self._stadium_resolver = get_stadium_resolver("nfl")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
# CBS scraper not yet implemented - TODO for future
|
||||
return ["espn", "pro_football_reference"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "espn":
|
||||
week = kwargs.get("week", 1)
|
||||
season_type = kwargs.get("season_type", 2) # 1=preseason, 2=regular, 3=postseason
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard?seasontype={season_type}&week={week}"
|
||||
|
||||
elif source == "pro_football_reference":
|
||||
return f"https://www.pro-football-reference.com/years/{self.season}/games.htm"
|
||||
|
||||
elif source == "cbs":
|
||||
return "https://www.cbssports.com/nfl/schedule/"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for NFL season.
|
||||
|
||||
NFL season runs September through February.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# Regular season months
|
||||
for month in range(9, 13): # Sept-Dec
|
||||
months.append((self.season, month))
|
||||
|
||||
# Playoff months
|
||||
for month in range(1, 3): # Jan-Feb
|
||||
months.append((self.season + 1, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "espn":
|
||||
return self._scrape_espn()
|
||||
elif source == "pro_football_reference":
|
||||
return self._scrape_pro_football_reference()
|
||||
elif source == "cbs":
|
||||
return self._scrape_cbs()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API.
|
||||
|
||||
ESPN NFL API uses week numbers.
|
||||
"""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
# Scrape preseason (4 weeks)
|
||||
for week in range(1, 5):
|
||||
try:
|
||||
url = self._get_source_url("espn", week=week, season_type=1)
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN preseason week {week} error: {e}")
|
||||
continue
|
||||
|
||||
# Scrape regular season (18 weeks)
|
||||
for week in range(1, 19):
|
||||
try:
|
||||
url = self._get_source_url("espn", week=week, season_type=2)
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
self._logger.debug(f"Found {len(games)} games in week {week}")
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN regular season week {week} error: {e}")
|
||||
continue
|
||||
|
||||
# Scrape postseason (4 rounds)
|
||||
for week in range(1, 5):
|
||||
try:
|
||||
url = self._get_source_url("espn", week=week, season_type=3)
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN postseason week {week} error: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
# Filter international games
|
||||
if game.stadium_raw and any(loc in game.stadium_raw for loc in INTERNATIONAL_LOCATIONS):
|
||||
self._logger.debug(f"Skipping international game: {game.stadium_raw}")
|
||||
continue
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Check for neutral site (international games)
|
||||
if competition.get("neutralSite"):
|
||||
venue = competition.get("venue", {})
|
||||
venue_city = venue.get("address", {}).get("city", "")
|
||||
if venue_city in INTERNATIONAL_LOCATIONS:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_pro_football_reference(self) -> list[RawGameData]:
|
||||
"""Scrape games from Pro-Football-Reference.
|
||||
|
||||
PFR has a single schedule page per season.
|
||||
"""
|
||||
url = self._get_source_url("pro_football_reference")
|
||||
|
||||
try:
|
||||
html = self.session.get_html(url)
|
||||
games = self._parse_pfr(html, url)
|
||||
return games
|
||||
except Exception as e:
|
||||
self._logger.error(f"Failed to scrape Pro-Football-Reference: {e}")
|
||||
raise
|
||||
|
||||
def _parse_pfr(
|
||||
self,
|
||||
html: str,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse Pro-Football-Reference schedule HTML."""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
games: list[RawGameData] = []
|
||||
|
||||
# Find the schedule table
|
||||
table = soup.find("table", id="games")
|
||||
if not table:
|
||||
return games
|
||||
|
||||
tbody = table.find("tbody")
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all("tr"):
|
||||
# Skip header rows
|
||||
if row.get("class") and "thead" in row.get("class", []):
|
||||
continue
|
||||
|
||||
try:
|
||||
game = self._parse_pfr_row(row, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse PFR row: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_pfr_row(
|
||||
self,
|
||||
row,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single Pro-Football-Reference table row."""
|
||||
# Get date
|
||||
date_cell = row.find("td", {"data-stat": "game_date"})
|
||||
if not date_cell:
|
||||
return None
|
||||
|
||||
date_text = date_cell.get_text(strip=True)
|
||||
if not date_text:
|
||||
return None
|
||||
|
||||
# Parse date
|
||||
try:
|
||||
# PFR uses YYYY-MM-DD format
|
||||
game_date = datetime.strptime(date_text, "%Y-%m-%d")
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
winner_cell = row.find("td", {"data-stat": "winner"})
|
||||
loser_cell = row.find("td", {"data-stat": "loser"})
|
||||
|
||||
if not winner_cell or not loser_cell:
|
||||
return None
|
||||
|
||||
winner = winner_cell.get_text(strip=True)
|
||||
loser = loser_cell.get_text(strip=True)
|
||||
|
||||
if not winner or not loser:
|
||||
return None
|
||||
|
||||
# Determine home/away based on @ symbol
|
||||
game_location = row.find("td", {"data-stat": "game_location"})
|
||||
at_home = game_location and "@" in game_location.get_text()
|
||||
|
||||
if at_home:
|
||||
home_team = loser
|
||||
away_team = winner
|
||||
else:
|
||||
home_team = winner
|
||||
away_team = loser
|
||||
|
||||
# Get scores
|
||||
pts_win_cell = row.find("td", {"data-stat": "pts_win"})
|
||||
pts_lose_cell = row.find("td", {"data-stat": "pts_lose"})
|
||||
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
if pts_win_cell and pts_lose_cell:
|
||||
try:
|
||||
winner_pts = int(pts_win_cell.get_text(strip=True))
|
||||
loser_pts = int(pts_lose_cell.get_text(strip=True))
|
||||
|
||||
if at_home:
|
||||
home_score = loser_pts
|
||||
away_score = winner_pts
|
||||
else:
|
||||
home_score = winner_pts
|
||||
away_score = loser_pts
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
# Determine status
|
||||
status = "final" if home_score is not None else "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=None, # PFR doesn't always have stadium
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_cbs(self) -> list[RawGameData]:
|
||||
"""Scrape games from CBS Sports."""
|
||||
raise NotImplementedError("CBS scraper not implemented")
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None, # NFL doesn't have doubleheaders
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all NFL teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# NFL conference/division structure
|
||||
divisions = {
|
||||
"AFC East": ("AFC", ["BUF", "MIA", "NE", "NYJ"]),
|
||||
"AFC North": ("AFC", ["BAL", "CIN", "CLE", "PIT"]),
|
||||
"AFC South": ("AFC", ["HOU", "IND", "JAX", "TEN"]),
|
||||
"AFC West": ("AFC", ["DEN", "KC", "LV", "LAC"]),
|
||||
"NFC East": ("NFC", ["DAL", "NYG", "PHI", "WAS"]),
|
||||
"NFC North": ("NFC", ["CHI", "DET", "GB", "MIN"]),
|
||||
"NFC South": ("NFC", ["ATL", "CAR", "NO", "TB"]),
|
||||
"NFC West": ("NFC", ["ARI", "LAR", "SF", "SEA"]),
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_divisions: dict[str, tuple[str, str]] = {}
|
||||
for div, (conf, abbrevs) in divisions.items():
|
||||
for abbrev in abbrevs:
|
||||
team_divisions[abbrev] = (conf, div)
|
||||
|
||||
for abbrev, (team_id, full_name, city, stadium_id) in TEAM_MAPPINGS.get("nfl", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
parts = full_name.split()
|
||||
team_name = parts[-1] if parts else full_name
|
||||
|
||||
# Get conference and division
|
||||
conf, div = team_divisions.get(abbrev, (None, None))
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="nfl",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=conf,
|
||||
division=div,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all NFL stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
nfl_stadiums = STADIUM_MAPPINGS.get("nfl", {})
|
||||
for stadium_id, info in nfl_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="nfl",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="turf", # Many NFL stadiums
|
||||
roof_type="open", # Most outdoor
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_nfl_scraper(season: int) -> NFLScraper:
|
||||
"""Factory function to create an NFL scraper."""
|
||||
return NFLScraper(season=season)
|
||||
657
sportstime_parser/scrapers/nhl.py
Normal file
657
sportstime_parser/scrapers/nhl.py
Normal file
@@ -0,0 +1,657 @@
|
||||
"""NHL scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date
|
||||
from typing import Optional
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
# International game locations to filter out
|
||||
INTERNATIONAL_LOCATIONS = {"Prague", "Stockholm", "Helsinki", "Tampere", "Gothenburg"}
|
||||
|
||||
# Hockey Reference month URLs
|
||||
HR_MONTHS = [
|
||||
"october", "november", "december",
|
||||
"january", "february", "march", "april", "may", "june",
|
||||
]
|
||||
|
||||
|
||||
class NHLScraper(BaseScraper):
|
||||
"""NHL schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. Hockey-Reference - Most reliable for NHL
|
||||
2. NHL API - Official NHL data
|
||||
3. ESPN API - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize NHL scraper.
|
||||
|
||||
Args:
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
"""
|
||||
super().__init__("nhl", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("nhl")
|
||||
self._stadium_resolver = get_stadium_resolver("nhl")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["hockey_reference", "nhl_api", "espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "hockey_reference":
|
||||
month = kwargs.get("month", "october")
|
||||
year = kwargs.get("year", self.season + 1)
|
||||
return f"https://www.hockey-reference.com/leagues/NHL_{year}_games.html"
|
||||
|
||||
elif source == "nhl_api":
|
||||
start_date = kwargs.get("start_date", "")
|
||||
end_date = kwargs.get("end_date", "")
|
||||
return f"https://api-web.nhle.com/v1/schedule/{start_date}"
|
||||
|
||||
elif source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard?dates={date_str}"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "hockey_reference":
|
||||
return self._scrape_hockey_reference()
|
||||
elif source == "nhl_api":
|
||||
return self._scrape_nhl_api()
|
||||
elif source == "espn":
|
||||
return self._scrape_espn()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_hockey_reference(self) -> list[RawGameData]:
|
||||
"""Scrape games from Hockey-Reference.
|
||||
|
||||
HR has a single schedule page per season.
|
||||
"""
|
||||
end_year = self.season + 1
|
||||
url = self._get_source_url("hockey_reference", year=end_year)
|
||||
|
||||
try:
|
||||
html = self.session.get_html(url)
|
||||
games = self._parse_hockey_reference(html, url)
|
||||
return games
|
||||
except Exception as e:
|
||||
self._logger.error(f"Failed to scrape Hockey-Reference: {e}")
|
||||
raise
|
||||
|
||||
def _parse_hockey_reference(
|
||||
self,
|
||||
html: str,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse Hockey-Reference schedule HTML."""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
games: list[RawGameData] = []
|
||||
|
||||
# Find the schedule table
|
||||
table = soup.find("table", id="games")
|
||||
if not table:
|
||||
return games
|
||||
|
||||
tbody = table.find("tbody")
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all("tr"):
|
||||
# Skip header rows
|
||||
if row.get("class") and "thead" in row.get("class", []):
|
||||
continue
|
||||
|
||||
try:
|
||||
game = self._parse_hr_row(row, source_url)
|
||||
if game:
|
||||
# Filter international games
|
||||
if game.stadium_raw and any(loc in game.stadium_raw for loc in INTERNATIONAL_LOCATIONS):
|
||||
continue
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse HR row: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_hr_row(
|
||||
self,
|
||||
row,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single Hockey-Reference table row."""
|
||||
# Get date
|
||||
date_cell = row.find("th", {"data-stat": "date_game"})
|
||||
if not date_cell:
|
||||
return None
|
||||
|
||||
date_text = date_cell.get_text(strip=True)
|
||||
if not date_text:
|
||||
return None
|
||||
|
||||
# Parse date (format: "2025-10-15")
|
||||
try:
|
||||
game_date = datetime.strptime(date_text, "%Y-%m-%d")
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
visitor_cell = row.find("td", {"data-stat": "visitor_team_name"})
|
||||
home_cell = row.find("td", {"data-stat": "home_team_name"})
|
||||
|
||||
if not visitor_cell or not home_cell:
|
||||
return None
|
||||
|
||||
away_team = visitor_cell.get_text(strip=True)
|
||||
home_team = home_cell.get_text(strip=True)
|
||||
|
||||
if not away_team or not home_team:
|
||||
return None
|
||||
|
||||
# Get scores
|
||||
visitor_goals_cell = row.find("td", {"data-stat": "visitor_goals"})
|
||||
home_goals_cell = row.find("td", {"data-stat": "home_goals"})
|
||||
|
||||
away_score = None
|
||||
home_score = None
|
||||
|
||||
if visitor_goals_cell and visitor_goals_cell.get_text(strip=True):
|
||||
try:
|
||||
away_score = int(visitor_goals_cell.get_text(strip=True))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
if home_goals_cell and home_goals_cell.get_text(strip=True):
|
||||
try:
|
||||
home_score = int(home_goals_cell.get_text(strip=True))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
# Determine status
|
||||
status = "final" if home_score is not None else "scheduled"
|
||||
|
||||
# Check for OT/SO
|
||||
overtimes_cell = row.find("td", {"data-stat": "overtimes"})
|
||||
if overtimes_cell:
|
||||
ot_text = overtimes_cell.get_text(strip=True)
|
||||
if ot_text:
|
||||
status = "final" # OT games are still final
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=None, # HR doesn't have stadium
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_nhl_api(self) -> list[RawGameData]:
|
||||
"""Scrape games from NHL API."""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
start_date = date(year, month, 1)
|
||||
|
||||
url = self._get_source_url("nhl_api", start_date=start_date.strftime("%Y-%m-%d"))
|
||||
|
||||
try:
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_nhl_api_response(data, url)
|
||||
all_games.extend(games)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"NHL API error for {year}-{month}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_nhl_api_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse NHL API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
game_weeks = data.get("gameWeek", [])
|
||||
|
||||
for week in game_weeks:
|
||||
for game_day in week.get("games", []):
|
||||
try:
|
||||
game = self._parse_nhl_api_game(game_day, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse NHL API game: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_nhl_api_game(
|
||||
self,
|
||||
game: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single NHL API game."""
|
||||
# Get date
|
||||
start_time = game.get("startTimeUTC", "")
|
||||
if not start_time:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(start_time.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
away_team_data = game.get("awayTeam", {})
|
||||
home_team_data = game.get("homeTeam", {})
|
||||
|
||||
away_team = away_team_data.get("placeName", {}).get("default", "")
|
||||
home_team = home_team_data.get("placeName", {}).get("default", "")
|
||||
|
||||
if not away_team or not home_team:
|
||||
# Try full name
|
||||
away_team = away_team_data.get("name", {}).get("default", "")
|
||||
home_team = home_team_data.get("name", {}).get("default", "")
|
||||
|
||||
if not away_team or not home_team:
|
||||
return None
|
||||
|
||||
# Get scores
|
||||
away_score = away_team_data.get("score")
|
||||
home_score = home_team_data.get("score")
|
||||
|
||||
# Get venue
|
||||
venue = game.get("venue", {})
|
||||
stadium = venue.get("default")
|
||||
|
||||
# Get status
|
||||
game_state = game.get("gameState", "").lower()
|
||||
|
||||
if game_state in ["final", "off"]:
|
||||
status = "final"
|
||||
elif game_state == "postponed":
|
||||
status = "postponed"
|
||||
elif game_state in ["cancelled", "canceled"]:
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API."""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
# Get number of days in month
|
||||
if month == 12:
|
||||
next_month = date(year + 1, 1, 1)
|
||||
else:
|
||||
next_month = date(year, month + 1, 1)
|
||||
|
||||
days_in_month = (next_month - date(year, month, 1)).days
|
||||
|
||||
for day in range(1, days_in_month + 1):
|
||||
try:
|
||||
game_date = date(year, month, day)
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
url = self._get_source_url("espn", date=date_str)
|
||||
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Check for neutral site (international games like Global Series)
|
||||
if competition.get("neutralSite"):
|
||||
venue = competition.get("venue", {})
|
||||
venue_city = venue.get("address", {}).get("city", "")
|
||||
if venue_city in INTERNATIONAL_LOCATIONS:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Fallback: Use home team's default stadium if no venue provided
|
||||
# This is common for Hockey-Reference which doesn't have venue data
|
||||
if not stadium_id:
|
||||
home_team_data = TEAM_MAPPINGS.get("nhl", {})
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
for abbrev, (team_id, _, _, default_stadium) in home_team_data.items():
|
||||
if team_id == home_result.canonical_id:
|
||||
stadium_id = default_stadium
|
||||
break
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None, # NHL doesn't have doubleheaders
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all NHL teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# NHL conference/division structure
|
||||
divisions = {
|
||||
"Atlantic": ("Eastern", ["BOS", "BUF", "DET", "FLA", "MTL", "OTT", "TB", "TOR"]),
|
||||
"Metropolitan": ("Eastern", ["CAR", "CBJ", "NJ", "NYI", "NYR", "PHI", "PIT", "WAS"]),
|
||||
"Central": ("Western", ["ARI", "CHI", "COL", "DAL", "MIN", "NSH", "STL", "WPG"]),
|
||||
"Pacific": ("Western", ["ANA", "CGY", "EDM", "LA", "SJ", "SEA", "VAN", "VGK"]),
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_divisions: dict[str, tuple[str, str]] = {}
|
||||
for div, (conf, abbrevs) in divisions.items():
|
||||
for abbrev in abbrevs:
|
||||
team_divisions[abbrev] = (conf, div)
|
||||
|
||||
for abbrev, (team_id, full_name, city, stadium_id) in TEAM_MAPPINGS.get("nhl", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
parts = full_name.split()
|
||||
team_name = parts[-1] if parts else full_name
|
||||
# Handle multi-word names
|
||||
if team_name in ["Wings", "Jackets", "Knights", "Leafs"]:
|
||||
team_name = " ".join(parts[-2:])
|
||||
|
||||
# Get conference and division
|
||||
conf, div = team_divisions.get(abbrev, (None, None))
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="nhl",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=conf,
|
||||
division=div,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all NHL stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
nhl_stadiums = STADIUM_MAPPINGS.get("nhl", {})
|
||||
for stadium_id, info in nhl_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="nhl",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="ice",
|
||||
roof_type="dome",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_nhl_scraper(season: int) -> NHLScraper:
|
||||
"""Factory function to create an NHL scraper."""
|
||||
return NHLScraper(season=season)
|
||||
374
sportstime_parser/scrapers/nwsl.py
Normal file
374
sportstime_parser/scrapers/nwsl.py
Normal file
@@ -0,0 +1,374 @@
|
||||
"""NWSL scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date, timedelta
|
||||
from typing import Optional
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
class NWSLScraper(BaseScraper):
|
||||
"""NWSL schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. ESPN API - Most reliable for NWSL
|
||||
2. NWSL official (via ESPN) - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize NWSL scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026 for 2026 season)
|
||||
"""
|
||||
super().__init__("nwsl", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("nwsl")
|
||||
self._stadium_resolver = get_stadium_resolver("nwsl")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard?dates={date_str}"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for NWSL season.
|
||||
|
||||
NWSL season runs March through November.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# NWSL regular season + playoffs
|
||||
for month in range(3, 12): # March-Nov
|
||||
months.append((self.season, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "espn":
|
||||
return self._scrape_espn()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API using date range query."""
|
||||
# Build date range for entire season (March-November)
|
||||
season_months = self._get_season_months()
|
||||
start_year, start_month = season_months[0]
|
||||
end_year, end_month = season_months[-1]
|
||||
|
||||
# Get last day of end month
|
||||
if end_month == 12:
|
||||
end_date = date(end_year + 1, 1, 1) - timedelta(days=1)
|
||||
else:
|
||||
end_date = date(end_year, end_month + 1, 1) - timedelta(days=1)
|
||||
|
||||
start_date = date(start_year, start_month, 1)
|
||||
date_range = f"{start_date.strftime('%Y%m%d')}-{end_date.strftime('%Y%m%d')}"
|
||||
|
||||
url = f"https://site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard?limit=1000&dates={date_range}"
|
||||
self._logger.info(f"Fetching NWSL schedule: {date_range}")
|
||||
|
||||
try:
|
||||
data = self.session.get_json(url)
|
||||
return self._parse_espn_response(data, url)
|
||||
except Exception as e:
|
||||
self._logger.error(f"ESPN error: {e}")
|
||||
return []
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all NWSL teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
for abbrev, (team_id, full_name, city, stadium_id) in TEAM_MAPPINGS.get("nwsl", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
team_name = full_name
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="nwsl",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=None, # NWSL uses single table
|
||||
division=None,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all NWSL stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
nwsl_stadiums = STADIUM_MAPPINGS.get("nwsl", {})
|
||||
for stadium_id, info in nwsl_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="nwsl",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="grass",
|
||||
roof_type="open",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_nwsl_scraper(season: int) -> NWSLScraper:
|
||||
"""Factory function to create an NWSL scraper."""
|
||||
return NWSLScraper(season=season)
|
||||
375
sportstime_parser/scrapers/wnba.py
Normal file
375
sportstime_parser/scrapers/wnba.py
Normal file
@@ -0,0 +1,375 @@
|
||||
"""WNBA scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date, timedelta
|
||||
from typing import Optional
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
class WNBAScraper(BaseScraper):
|
||||
"""WNBA schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. ESPN API - Most reliable for WNBA
|
||||
2. WNBA official (via ESPN) - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize WNBA scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026 for 2026 season)
|
||||
"""
|
||||
super().__init__("wnba", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("wnba")
|
||||
self._stadium_resolver = get_stadium_resolver("wnba")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard?dates={date_str}"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for WNBA season.
|
||||
|
||||
WNBA season runs May through September/October.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# WNBA regular season + playoffs
|
||||
for month in range(5, 11): # May-Oct
|
||||
months.append((self.season, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "espn":
|
||||
return self._scrape_espn()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API using date range query."""
|
||||
# Build date range for entire season (May-October)
|
||||
season_months = self._get_season_months()
|
||||
start_year, start_month = season_months[0]
|
||||
end_year, end_month = season_months[-1]
|
||||
|
||||
# Get last day of end month
|
||||
if end_month == 12:
|
||||
end_date = date(end_year + 1, 1, 1) - timedelta(days=1)
|
||||
else:
|
||||
end_date = date(end_year, end_month + 1, 1) - timedelta(days=1)
|
||||
|
||||
start_date = date(start_year, start_month, 1)
|
||||
date_range = f"{start_date.strftime('%Y%m%d')}-{end_date.strftime('%Y%m%d')}"
|
||||
|
||||
url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard?limit=1000&dates={date_range}"
|
||||
self._logger.info(f"Fetching WNBA schedule: {date_range}")
|
||||
|
||||
try:
|
||||
data = self.session.get_json(url)
|
||||
return self._parse_espn_response(data, url)
|
||||
except Exception as e:
|
||||
self._logger.error(f"ESPN error: {e}")
|
||||
return []
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all WNBA teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
for abbrev, (team_id, full_name, city, stadium_id) in TEAM_MAPPINGS.get("wnba", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
parts = full_name.split()
|
||||
team_name = parts[-1] if parts else full_name
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="wnba",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=None, # WNBA uses single table now
|
||||
division=None,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all WNBA stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
wnba_stadiums = STADIUM_MAPPINGS.get("wnba", {})
|
||||
for stadium_id, info in wnba_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="wnba",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="hardwood",
|
||||
roof_type="dome",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_wnba_scraper(season: int) -> WNBAScraper:
|
||||
"""Factory function to create a WNBA scraper."""
|
||||
return WNBAScraper(season=season)
|
||||
1
sportstime_parser/tests/__init__.py
Normal file
1
sportstime_parser/tests/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Unit tests for sportstime_parser."""
|
||||
48
sportstime_parser/tests/fixtures/__init__.py
vendored
Normal file
48
sportstime_parser/tests/fixtures/__init__.py
vendored
Normal file
@@ -0,0 +1,48 @@
|
||||
"""Test fixtures for sportstime-parser tests."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
FIXTURES_DIR = Path(__file__).parent
|
||||
|
||||
# NBA fixtures
|
||||
NBA_FIXTURES_DIR = FIXTURES_DIR / "nba"
|
||||
NBA_BR_OCTOBER_HTML = NBA_FIXTURES_DIR / "basketball_reference_october.html"
|
||||
NBA_BR_EDGE_CASES_HTML = NBA_FIXTURES_DIR / "basketball_reference_edge_cases.html"
|
||||
NBA_ESPN_SCOREBOARD_JSON = NBA_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# MLB fixtures
|
||||
MLB_FIXTURES_DIR = FIXTURES_DIR / "mlb"
|
||||
MLB_ESPN_SCOREBOARD_JSON = MLB_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# NFL fixtures
|
||||
NFL_FIXTURES_DIR = FIXTURES_DIR / "nfl"
|
||||
NFL_ESPN_SCOREBOARD_JSON = NFL_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# NHL fixtures
|
||||
NHL_FIXTURES_DIR = FIXTURES_DIR / "nhl"
|
||||
NHL_ESPN_SCOREBOARD_JSON = NHL_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# MLS fixtures
|
||||
MLS_FIXTURES_DIR = FIXTURES_DIR / "mls"
|
||||
MLS_ESPN_SCOREBOARD_JSON = MLS_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# WNBA fixtures
|
||||
WNBA_FIXTURES_DIR = FIXTURES_DIR / "wnba"
|
||||
WNBA_ESPN_SCOREBOARD_JSON = WNBA_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# NWSL fixtures
|
||||
NWSL_FIXTURES_DIR = FIXTURES_DIR / "nwsl"
|
||||
NWSL_ESPN_SCOREBOARD_JSON = NWSL_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
|
||||
def load_fixture(path: Path) -> str:
|
||||
"""Load a fixture file as text."""
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
return f.read()
|
||||
|
||||
|
||||
def load_json_fixture(path: Path) -> dict:
|
||||
"""Load a JSON fixture file."""
|
||||
import json
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
245
sportstime_parser/tests/fixtures/mlb/espn_scoreboard.json
vendored
Normal file
245
sportstime_parser/tests/fixtures/mlb/espn_scoreboard.json
vendored
Normal file
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "10",
|
||||
"uid": "s:1~l:10",
|
||||
"name": "Major League Baseball",
|
||||
"abbreviation": "MLB"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2026-04-15T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401584801",
|
||||
"uid": "s:1~l:10~e:401584801",
|
||||
"date": "2026-04-15T23:05:00Z",
|
||||
"name": "New York Yankees at Boston Red Sox",
|
||||
"shortName": "NYY @ BOS",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584801",
|
||||
"uid": "s:1~l:10~e:401584801~c:401584801",
|
||||
"date": "2026-04-15T23:05:00Z",
|
||||
"attendance": 37435,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3",
|
||||
"fullName": "Fenway Park",
|
||||
"address": {
|
||||
"city": "Boston",
|
||||
"state": "MA"
|
||||
},
|
||||
"capacity": 37755,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "2",
|
||||
"uid": "s:1~l:10~t:2",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "2",
|
||||
"uid": "s:1~l:10~t:2",
|
||||
"location": "Boston",
|
||||
"name": "Red Sox",
|
||||
"abbreviation": "BOS",
|
||||
"displayName": "Boston Red Sox"
|
||||
},
|
||||
"score": "5",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "10",
|
||||
"uid": "s:1~l:10~t:10",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "10",
|
||||
"uid": "s:1~l:10~t:10",
|
||||
"location": "New York",
|
||||
"name": "Yankees",
|
||||
"abbreviation": "NYY",
|
||||
"displayName": "New York Yankees"
|
||||
},
|
||||
"score": "3",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 9,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401584802",
|
||||
"uid": "s:1~l:10~e:401584802",
|
||||
"date": "2026-04-15T20:10:00Z",
|
||||
"name": "Chicago Cubs at St. Louis Cardinals",
|
||||
"shortName": "CHC @ STL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584802",
|
||||
"uid": "s:1~l:10~e:401584802~c:401584802",
|
||||
"date": "2026-04-15T20:10:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "87",
|
||||
"fullName": "Busch Stadium",
|
||||
"address": {
|
||||
"city": "St. Louis",
|
||||
"state": "MO"
|
||||
},
|
||||
"capacity": 45538,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "24",
|
||||
"uid": "s:1~l:10~t:24",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "24",
|
||||
"uid": "s:1~l:10~t:24",
|
||||
"location": "St. Louis",
|
||||
"name": "Cardinals",
|
||||
"abbreviation": "STL",
|
||||
"displayName": "St. Louis Cardinals"
|
||||
},
|
||||
"score": "7",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "16",
|
||||
"uid": "s:1~l:10~t:16",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "16",
|
||||
"uid": "s:1~l:10~t:16",
|
||||
"location": "Chicago",
|
||||
"name": "Cubs",
|
||||
"abbreviation": "CHC",
|
||||
"displayName": "Chicago Cubs"
|
||||
},
|
||||
"score": "4",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 9,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401584803",
|
||||
"uid": "s:1~l:10~e:401584803",
|
||||
"date": "2026-04-16T00:10:00Z",
|
||||
"name": "Los Angeles Dodgers at San Francisco Giants",
|
||||
"shortName": "LAD @ SF",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584803",
|
||||
"uid": "s:1~l:10~e:401584803~c:401584803",
|
||||
"date": "2026-04-16T00:10:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "116",
|
||||
"fullName": "Oracle Park",
|
||||
"address": {
|
||||
"city": "San Francisco",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 41915,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "26",
|
||||
"uid": "s:1~l:10~t:26",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "26",
|
||||
"uid": "s:1~l:10~t:26",
|
||||
"location": "San Francisco",
|
||||
"name": "Giants",
|
||||
"abbreviation": "SF",
|
||||
"displayName": "San Francisco Giants"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "19",
|
||||
"uid": "s:1~l:10~t:19",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "19",
|
||||
"uid": "s:1~l:10~t:19",
|
||||
"location": "Los Angeles",
|
||||
"name": "Dodgers",
|
||||
"abbreviation": "LAD",
|
||||
"displayName": "Los Angeles Dodgers"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
245
sportstime_parser/tests/fixtures/mls/espn_scoreboard.json
vendored
Normal file
245
sportstime_parser/tests/fixtures/mls/espn_scoreboard.json
vendored
Normal file
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "19",
|
||||
"uid": "s:600~l:19",
|
||||
"name": "Major League Soccer",
|
||||
"abbreviation": "MLS"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2026-03-15T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401672001",
|
||||
"uid": "s:600~l:19~e:401672001",
|
||||
"date": "2026-03-15T22:00:00Z",
|
||||
"name": "LA Galaxy at LAFC",
|
||||
"shortName": "LA @ LAFC",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672001",
|
||||
"uid": "s:600~l:19~e:401672001~c:401672001",
|
||||
"date": "2026-03-15T22:00:00Z",
|
||||
"attendance": 22000,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8909",
|
||||
"fullName": "BMO Stadium",
|
||||
"address": {
|
||||
"city": "Los Angeles",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 22000,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "21295",
|
||||
"uid": "s:600~l:19~t:21295",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "21295",
|
||||
"uid": "s:600~l:19~t:21295",
|
||||
"location": "Los Angeles",
|
||||
"name": "FC",
|
||||
"abbreviation": "LAFC",
|
||||
"displayName": "Los Angeles FC"
|
||||
},
|
||||
"score": "3",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "3610",
|
||||
"uid": "s:600~l:19~t:3610",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "3610",
|
||||
"uid": "s:600~l:19~t:3610",
|
||||
"location": "Los Angeles",
|
||||
"name": "Galaxy",
|
||||
"abbreviation": "LA",
|
||||
"displayName": "LA Galaxy"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 90,
|
||||
"displayClock": "90'",
|
||||
"period": 2,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672002",
|
||||
"uid": "s:600~l:19~e:401672002",
|
||||
"date": "2026-03-15T23:00:00Z",
|
||||
"name": "Seattle Sounders at Portland Timbers",
|
||||
"shortName": "SEA @ POR",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672002",
|
||||
"uid": "s:600~l:19~e:401672002~c:401672002",
|
||||
"date": "2026-03-15T23:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8070",
|
||||
"fullName": "Providence Park",
|
||||
"address": {
|
||||
"city": "Portland",
|
||||
"state": "OR"
|
||||
},
|
||||
"capacity": 25218,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "5282",
|
||||
"uid": "s:600~l:19~t:5282",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "5282",
|
||||
"uid": "s:600~l:19~t:5282",
|
||||
"location": "Portland",
|
||||
"name": "Timbers",
|
||||
"abbreviation": "POR",
|
||||
"displayName": "Portland Timbers"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "4687",
|
||||
"uid": "s:600~l:19~t:4687",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "4687",
|
||||
"uid": "s:600~l:19~t:4687",
|
||||
"location": "Seattle",
|
||||
"name": "Sounders FC",
|
||||
"abbreviation": "SEA",
|
||||
"displayName": "Seattle Sounders FC"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 90,
|
||||
"displayClock": "90'",
|
||||
"period": 2,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672003",
|
||||
"uid": "s:600~l:19~e:401672003",
|
||||
"date": "2026-03-16T00:00:00Z",
|
||||
"name": "New York Red Bulls at Atlanta United",
|
||||
"shortName": "NY @ ATL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672003",
|
||||
"uid": "s:600~l:19~e:401672003~c:401672003",
|
||||
"date": "2026-03-16T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8904",
|
||||
"fullName": "Mercedes-Benz Stadium",
|
||||
"address": {
|
||||
"city": "Atlanta",
|
||||
"state": "GA"
|
||||
},
|
||||
"capacity": 42500,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "18626",
|
||||
"uid": "s:600~l:19~t:18626",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "18626",
|
||||
"uid": "s:600~l:19~t:18626",
|
||||
"location": "Atlanta",
|
||||
"name": "United FC",
|
||||
"abbreviation": "ATL",
|
||||
"displayName": "Atlanta United FC"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "399",
|
||||
"uid": "s:600~l:19~t:399",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "399",
|
||||
"uid": "s:600~l:19~t:399",
|
||||
"location": "New York",
|
||||
"name": "Red Bulls",
|
||||
"abbreviation": "NY",
|
||||
"displayName": "New York Red Bulls"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0'",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
79
sportstime_parser/tests/fixtures/nba/basketball_reference_edge_cases.html
vendored
Normal file
79
sportstime_parser/tests/fixtures/nba/basketball_reference_edge_cases.html
vendored
Normal file
@@ -0,0 +1,79 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>2025-26 NBA Schedule - Edge Cases | Basketball-Reference.com</title>
|
||||
</head>
|
||||
<body>
|
||||
<table id="schedule" class="stats_table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th data-stat="date_game">Date</th>
|
||||
<th data-stat="game_start_time">Start (ET)</th>
|
||||
<th data-stat="visitor_team_name">Visitor/Neutral</th>
|
||||
<th data-stat="visitor_pts">PTS</th>
|
||||
<th data-stat="home_team_name">Home/Neutral</th>
|
||||
<th data-stat="home_pts">PTS</th>
|
||||
<th data-stat="arena_name">Arena</th>
|
||||
<th data-stat="game_remarks">Notes</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<!-- Postponed game -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Sat, Jan 11, 2026</th>
|
||||
<td data-stat="game_start_time">7:30p</td>
|
||||
<td data-stat="visitor_team_name">Los Angeles Lakers</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Phoenix Suns</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Footprint Center</td>
|
||||
<td data-stat="game_remarks">Postponed - Weather</td>
|
||||
</tr>
|
||||
<!-- Neutral site game (Mexico City) -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Sat, Nov 8, 2025</th>
|
||||
<td data-stat="game_start_time">7:00p</td>
|
||||
<td data-stat="visitor_team_name">Miami Heat</td>
|
||||
<td data-stat="visitor_pts">105</td>
|
||||
<td data-stat="home_team_name">Washington Wizards</td>
|
||||
<td data-stat="home_pts">99</td>
|
||||
<td data-stat="arena_name">Arena CDMX</td>
|
||||
<td data-stat="game_remarks">NBA Mexico City Games</td>
|
||||
</tr>
|
||||
<!-- Cancelled game -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Wed, Dec 3, 2025</th>
|
||||
<td data-stat="game_start_time">8:00p</td>
|
||||
<td data-stat="visitor_team_name">Portland Trail Blazers</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Sacramento Kings</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Golden 1 Center</td>
|
||||
<td data-stat="game_remarks">Cancelled</td>
|
||||
</tr>
|
||||
<!-- Regular completed game with high scores -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Sun, Mar 15, 2026</th>
|
||||
<td data-stat="game_start_time">3:30p</td>
|
||||
<td data-stat="visitor_team_name">Indiana Pacers</td>
|
||||
<td data-stat="visitor_pts">147</td>
|
||||
<td data-stat="home_team_name">Atlanta Hawks</td>
|
||||
<td data-stat="home_pts">150</td>
|
||||
<td data-stat="arena_name">State Farm Arena</td>
|
||||
<td data-stat="game_remarks">OT</td>
|
||||
</tr>
|
||||
<!-- Game at arena with special characters -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Mon, Feb 2, 2026</th>
|
||||
<td data-stat="game_start_time">10:30p</td>
|
||||
<td data-stat="visitor_team_name">Golden State Warriors</td>
|
||||
<td data-stat="visitor_pts">118</td>
|
||||
<td data-stat="home_team_name">Los Angeles Clippers</td>
|
||||
<td data-stat="home_pts">115</td>
|
||||
<td data-stat="arena_name">Intuit Dome</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</body>
|
||||
</html>
|
||||
94
sportstime_parser/tests/fixtures/nba/basketball_reference_october.html
vendored
Normal file
94
sportstime_parser/tests/fixtures/nba/basketball_reference_october.html
vendored
Normal file
@@ -0,0 +1,94 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>2025-26 NBA Schedule - October | Basketball-Reference.com</title>
|
||||
</head>
|
||||
<body>
|
||||
<table id="schedule" class="stats_table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th data-stat="date_game">Date</th>
|
||||
<th data-stat="game_start_time">Start (ET)</th>
|
||||
<th data-stat="visitor_team_name">Visitor/Neutral</th>
|
||||
<th data-stat="visitor_pts">PTS</th>
|
||||
<th data-stat="home_team_name">Home/Neutral</th>
|
||||
<th data-stat="home_pts">PTS</th>
|
||||
<th data-stat="arena_name">Arena</th>
|
||||
<th data-stat="game_remarks">Notes</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<th data-stat="date_game">Tue, Oct 22, 2025</th>
|
||||
<td data-stat="game_start_time">7:30p</td>
|
||||
<td data-stat="visitor_team_name">Boston Celtics</td>
|
||||
<td data-stat="visitor_pts">112</td>
|
||||
<td data-stat="home_team_name">Cleveland Cavaliers</td>
|
||||
<td data-stat="home_pts">108</td>
|
||||
<td data-stat="arena_name">Rocket Mortgage FieldHouse</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Tue, Oct 22, 2025</th>
|
||||
<td data-stat="game_start_time">10:00p</td>
|
||||
<td data-stat="visitor_team_name">Denver Nuggets</td>
|
||||
<td data-stat="visitor_pts">119</td>
|
||||
<td data-stat="home_team_name">Los Angeles Lakers</td>
|
||||
<td data-stat="home_pts">127</td>
|
||||
<td data-stat="arena_name">Crypto.com Arena</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Wed, Oct 23, 2025</th>
|
||||
<td data-stat="game_start_time">7:00p</td>
|
||||
<td data-stat="visitor_team_name">Houston Rockets</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Oklahoma City Thunder</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Paycom Center</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Wed, Oct 23, 2025</th>
|
||||
<td data-stat="game_start_time">7:30p</td>
|
||||
<td data-stat="visitor_team_name">New York Knicks</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Brooklyn Nets</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Barclays Center</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Thu, Oct 24, 2025</th>
|
||||
<td data-stat="game_start_time">7:00p</td>
|
||||
<td data-stat="visitor_team_name">Chicago Bulls</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Miami Heat</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Kaseya Center</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Fri, Oct 25, 2025</th>
|
||||
<td data-stat="game_start_time">7:30p</td>
|
||||
<td data-stat="visitor_team_name">Toronto Raptors</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Boston Celtics</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">TD Garden</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Sat, Oct 26, 2025</th>
|
||||
<td data-stat="game_start_time">8:00p</td>
|
||||
<td data-stat="visitor_team_name">Minnesota Timberwolves</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Dallas Mavericks</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">American Airlines Center</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</body>
|
||||
</html>
|
||||
245
sportstime_parser/tests/fixtures/nba/espn_scoreboard.json
vendored
Normal file
245
sportstime_parser/tests/fixtures/nba/espn_scoreboard.json
vendored
Normal file
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "46",
|
||||
"uid": "s:40~l:46",
|
||||
"name": "National Basketball Association",
|
||||
"abbreviation": "NBA"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2025-10-22T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401584721",
|
||||
"uid": "s:40~l:46~e:401584721",
|
||||
"date": "2025-10-22T23:30:00Z",
|
||||
"name": "Boston Celtics at Cleveland Cavaliers",
|
||||
"shortName": "BOS @ CLE",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584721",
|
||||
"uid": "s:40~l:46~e:401584721~c:401584721",
|
||||
"date": "2025-10-22T23:30:00Z",
|
||||
"attendance": 20562,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "5064",
|
||||
"fullName": "Rocket Mortgage FieldHouse",
|
||||
"address": {
|
||||
"city": "Cleveland",
|
||||
"state": "OH"
|
||||
},
|
||||
"capacity": 19432,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "5",
|
||||
"uid": "s:40~l:46~t:5",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "5",
|
||||
"uid": "s:40~l:46~t:5",
|
||||
"location": "Cleveland",
|
||||
"name": "Cavaliers",
|
||||
"abbreviation": "CLE",
|
||||
"displayName": "Cleveland Cavaliers"
|
||||
},
|
||||
"score": "108",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "2",
|
||||
"uid": "s:40~l:46~t:2",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "2",
|
||||
"uid": "s:40~l:46~t:2",
|
||||
"location": "Boston",
|
||||
"name": "Celtics",
|
||||
"abbreviation": "BOS",
|
||||
"displayName": "Boston Celtics"
|
||||
},
|
||||
"score": "112",
|
||||
"winner": true
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401584722",
|
||||
"uid": "s:40~l:46~e:401584722",
|
||||
"date": "2025-10-23T02:00:00Z",
|
||||
"name": "Denver Nuggets at Los Angeles Lakers",
|
||||
"shortName": "DEN @ LAL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584722",
|
||||
"uid": "s:40~l:46~e:401584722~c:401584722",
|
||||
"date": "2025-10-23T02:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "5091",
|
||||
"fullName": "Crypto.com Arena",
|
||||
"address": {
|
||||
"city": "Los Angeles",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 19068,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "13",
|
||||
"uid": "s:40~l:46~t:13",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "13",
|
||||
"uid": "s:40~l:46~t:13",
|
||||
"location": "Los Angeles",
|
||||
"name": "Lakers",
|
||||
"abbreviation": "LAL",
|
||||
"displayName": "Los Angeles Lakers"
|
||||
},
|
||||
"score": "127",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "7",
|
||||
"uid": "s:40~l:46~t:7",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "7",
|
||||
"uid": "s:40~l:46~t:7",
|
||||
"location": "Denver",
|
||||
"name": "Nuggets",
|
||||
"abbreviation": "DEN",
|
||||
"displayName": "Denver Nuggets"
|
||||
},
|
||||
"score": "119",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401584723",
|
||||
"uid": "s:40~l:46~e:401584723",
|
||||
"date": "2025-10-24T00:00:00Z",
|
||||
"name": "Houston Rockets at Oklahoma City Thunder",
|
||||
"shortName": "HOU @ OKC",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584723",
|
||||
"uid": "s:40~l:46~e:401584723~c:401584723",
|
||||
"date": "2025-10-24T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "4922",
|
||||
"fullName": "Paycom Center",
|
||||
"address": {
|
||||
"city": "Oklahoma City",
|
||||
"state": "OK"
|
||||
},
|
||||
"capacity": 18203,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "25",
|
||||
"uid": "s:40~l:46~t:25",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "25",
|
||||
"uid": "s:40~l:46~t:25",
|
||||
"location": "Oklahoma City",
|
||||
"name": "Thunder",
|
||||
"abbreviation": "OKC",
|
||||
"displayName": "Oklahoma City Thunder"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "10",
|
||||
"uid": "s:40~l:46~t:10",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "10",
|
||||
"uid": "s:40~l:46~t:10",
|
||||
"location": "Houston",
|
||||
"name": "Rockets",
|
||||
"abbreviation": "HOU",
|
||||
"displayName": "Houston Rockets"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
245
sportstime_parser/tests/fixtures/nfl/espn_scoreboard.json
vendored
Normal file
245
sportstime_parser/tests/fixtures/nfl/espn_scoreboard.json
vendored
Normal file
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "28",
|
||||
"uid": "s:20~l:28",
|
||||
"name": "National Football League",
|
||||
"abbreviation": "NFL"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2025
|
||||
},
|
||||
"week": {
|
||||
"number": 1
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401671801",
|
||||
"uid": "s:20~l:28~e:401671801",
|
||||
"date": "2025-09-07T20:00:00Z",
|
||||
"name": "Kansas City Chiefs at Baltimore Ravens",
|
||||
"shortName": "KC @ BAL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671801",
|
||||
"uid": "s:20~l:28~e:401671801~c:401671801",
|
||||
"date": "2025-09-07T20:00:00Z",
|
||||
"attendance": 71547,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3814",
|
||||
"fullName": "M&T Bank Stadium",
|
||||
"address": {
|
||||
"city": "Baltimore",
|
||||
"state": "MD"
|
||||
},
|
||||
"capacity": 71008,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "33",
|
||||
"uid": "s:20~l:28~t:33",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "33",
|
||||
"uid": "s:20~l:28~t:33",
|
||||
"location": "Baltimore",
|
||||
"name": "Ravens",
|
||||
"abbreviation": "BAL",
|
||||
"displayName": "Baltimore Ravens"
|
||||
},
|
||||
"score": "20",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "12",
|
||||
"uid": "s:20~l:28~t:12",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "12",
|
||||
"uid": "s:20~l:28~t:12",
|
||||
"location": "Kansas City",
|
||||
"name": "Chiefs",
|
||||
"abbreviation": "KC",
|
||||
"displayName": "Kansas City Chiefs"
|
||||
},
|
||||
"score": "27",
|
||||
"winner": true
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401671802",
|
||||
"uid": "s:20~l:28~e:401671802",
|
||||
"date": "2025-09-08T17:00:00Z",
|
||||
"name": "Philadelphia Eagles at Green Bay Packers",
|
||||
"shortName": "PHI @ GB",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671802",
|
||||
"uid": "s:20~l:28~e:401671802~c:401671802",
|
||||
"date": "2025-09-08T17:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3798",
|
||||
"fullName": "Lambeau Field",
|
||||
"address": {
|
||||
"city": "Green Bay",
|
||||
"state": "WI"
|
||||
},
|
||||
"capacity": 81441,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "9",
|
||||
"uid": "s:20~l:28~t:9",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "9",
|
||||
"uid": "s:20~l:28~t:9",
|
||||
"location": "Green Bay",
|
||||
"name": "Packers",
|
||||
"abbreviation": "GB",
|
||||
"displayName": "Green Bay Packers"
|
||||
},
|
||||
"score": "34",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "21",
|
||||
"uid": "s:20~l:28~t:21",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "21",
|
||||
"uid": "s:20~l:28~t:21",
|
||||
"location": "Philadelphia",
|
||||
"name": "Eagles",
|
||||
"abbreviation": "PHI",
|
||||
"displayName": "Philadelphia Eagles"
|
||||
},
|
||||
"score": "29",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401671803",
|
||||
"uid": "s:20~l:28~e:401671803",
|
||||
"date": "2025-09-08T20:25:00Z",
|
||||
"name": "Dallas Cowboys at Cleveland Browns",
|
||||
"shortName": "DAL @ CLE",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671803",
|
||||
"uid": "s:20~l:28~e:401671803~c:401671803",
|
||||
"date": "2025-09-08T20:25:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3653",
|
||||
"fullName": "Cleveland Browns Stadium",
|
||||
"address": {
|
||||
"city": "Cleveland",
|
||||
"state": "OH"
|
||||
},
|
||||
"capacity": 67431,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "5",
|
||||
"uid": "s:20~l:28~t:5",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "5",
|
||||
"uid": "s:20~l:28~t:5",
|
||||
"location": "Cleveland",
|
||||
"name": "Browns",
|
||||
"abbreviation": "CLE",
|
||||
"displayName": "Cleveland Browns"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "6",
|
||||
"uid": "s:20~l:28~t:6",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "6",
|
||||
"uid": "s:20~l:28~t:6",
|
||||
"location": "Dallas",
|
||||
"name": "Cowboys",
|
||||
"abbreviation": "DAL",
|
||||
"displayName": "Dallas Cowboys"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
245
sportstime_parser/tests/fixtures/nhl/espn_scoreboard.json
vendored
Normal file
245
sportstime_parser/tests/fixtures/nhl/espn_scoreboard.json
vendored
Normal file
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "90",
|
||||
"uid": "s:70~l:90",
|
||||
"name": "National Hockey League",
|
||||
"abbreviation": "NHL"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2025-10-08T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401671901",
|
||||
"uid": "s:70~l:90~e:401671901",
|
||||
"date": "2025-10-08T23:00:00Z",
|
||||
"name": "Pittsburgh Penguins at Boston Bruins",
|
||||
"shortName": "PIT @ BOS",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671901",
|
||||
"uid": "s:70~l:90~e:401671901~c:401671901",
|
||||
"date": "2025-10-08T23:00:00Z",
|
||||
"attendance": 17850,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "1823",
|
||||
"fullName": "TD Garden",
|
||||
"address": {
|
||||
"city": "Boston",
|
||||
"state": "MA"
|
||||
},
|
||||
"capacity": 17850,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "1",
|
||||
"uid": "s:70~l:90~t:1",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "1",
|
||||
"uid": "s:70~l:90~t:1",
|
||||
"location": "Boston",
|
||||
"name": "Bruins",
|
||||
"abbreviation": "BOS",
|
||||
"displayName": "Boston Bruins"
|
||||
},
|
||||
"score": "4",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "5",
|
||||
"uid": "s:70~l:90~t:5",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "5",
|
||||
"uid": "s:70~l:90~t:5",
|
||||
"location": "Pittsburgh",
|
||||
"name": "Penguins",
|
||||
"abbreviation": "PIT",
|
||||
"displayName": "Pittsburgh Penguins"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 3,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401671902",
|
||||
"uid": "s:70~l:90~e:401671902",
|
||||
"date": "2025-10-09T00:00:00Z",
|
||||
"name": "Toronto Maple Leafs at Montreal Canadiens",
|
||||
"shortName": "TOR @ MTL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671902",
|
||||
"uid": "s:70~l:90~e:401671902~c:401671902",
|
||||
"date": "2025-10-09T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "1918",
|
||||
"fullName": "Bell Centre",
|
||||
"address": {
|
||||
"city": "Montreal",
|
||||
"state": "QC"
|
||||
},
|
||||
"capacity": 21302,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "8",
|
||||
"uid": "s:70~l:90~t:8",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "8",
|
||||
"uid": "s:70~l:90~t:8",
|
||||
"location": "Montreal",
|
||||
"name": "Canadiens",
|
||||
"abbreviation": "MTL",
|
||||
"displayName": "Montreal Canadiens"
|
||||
},
|
||||
"score": "3",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "10",
|
||||
"uid": "s:70~l:90~t:10",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "10",
|
||||
"uid": "s:70~l:90~t:10",
|
||||
"location": "Toronto",
|
||||
"name": "Maple Leafs",
|
||||
"abbreviation": "TOR",
|
||||
"displayName": "Toronto Maple Leafs"
|
||||
},
|
||||
"score": "5",
|
||||
"winner": true
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 3,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401671903",
|
||||
"uid": "s:70~l:90~e:401671903",
|
||||
"date": "2025-10-09T02:00:00Z",
|
||||
"name": "Vegas Golden Knights at Los Angeles Kings",
|
||||
"shortName": "VGK @ LAK",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671903",
|
||||
"uid": "s:70~l:90~e:401671903~c:401671903",
|
||||
"date": "2025-10-09T02:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "1816",
|
||||
"fullName": "Crypto.com Arena",
|
||||
"address": {
|
||||
"city": "Los Angeles",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 18230,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "26",
|
||||
"uid": "s:70~l:90~t:26",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "26",
|
||||
"uid": "s:70~l:90~t:26",
|
||||
"location": "Los Angeles",
|
||||
"name": "Kings",
|
||||
"abbreviation": "LAK",
|
||||
"displayName": "Los Angeles Kings"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "54",
|
||||
"uid": "s:70~l:90~t:54",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "54",
|
||||
"uid": "s:70~l:90~t:54",
|
||||
"location": "Vegas",
|
||||
"name": "Golden Knights",
|
||||
"abbreviation": "VGK",
|
||||
"displayName": "Vegas Golden Knights"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
245
sportstime_parser/tests/fixtures/nwsl/espn_scoreboard.json
vendored
Normal file
245
sportstime_parser/tests/fixtures/nwsl/espn_scoreboard.json
vendored
Normal file
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "761",
|
||||
"uid": "s:600~l:761",
|
||||
"name": "National Women's Soccer League",
|
||||
"abbreviation": "NWSL"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2026-04-10T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401672201",
|
||||
"uid": "s:600~l:761~e:401672201",
|
||||
"date": "2026-04-10T23:00:00Z",
|
||||
"name": "Angel City FC at Portland Thorns",
|
||||
"shortName": "LA @ POR",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672201",
|
||||
"uid": "s:600~l:761~e:401672201~c:401672201",
|
||||
"date": "2026-04-10T23:00:00Z",
|
||||
"attendance": 22000,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8070",
|
||||
"fullName": "Providence Park",
|
||||
"address": {
|
||||
"city": "Portland",
|
||||
"state": "OR"
|
||||
},
|
||||
"capacity": 25218,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "15625",
|
||||
"uid": "s:600~l:761~t:15625",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "15625",
|
||||
"uid": "s:600~l:761~t:15625",
|
||||
"location": "Portland",
|
||||
"name": "Thorns FC",
|
||||
"abbreviation": "POR",
|
||||
"displayName": "Portland Thorns FC"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "19934",
|
||||
"uid": "s:600~l:761~t:19934",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "19934",
|
||||
"uid": "s:600~l:761~t:19934",
|
||||
"location": "Los Angeles",
|
||||
"name": "Angel City",
|
||||
"abbreviation": "LA",
|
||||
"displayName": "Angel City FC"
|
||||
},
|
||||
"score": "1",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 90,
|
||||
"displayClock": "90'",
|
||||
"period": 2,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672202",
|
||||
"uid": "s:600~l:761~e:401672202",
|
||||
"date": "2026-04-11T00:00:00Z",
|
||||
"name": "Orlando Pride at North Carolina Courage",
|
||||
"shortName": "ORL @ NC",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672202",
|
||||
"uid": "s:600~l:761~e:401672202~c:401672202",
|
||||
"date": "2026-04-11T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8073",
|
||||
"fullName": "WakeMed Soccer Park",
|
||||
"address": {
|
||||
"city": "Cary",
|
||||
"state": "NC"
|
||||
},
|
||||
"capacity": 10000,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "15618",
|
||||
"uid": "s:600~l:761~t:15618",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "15618",
|
||||
"uid": "s:600~l:761~t:15618",
|
||||
"location": "North Carolina",
|
||||
"name": "Courage",
|
||||
"abbreviation": "NC",
|
||||
"displayName": "North Carolina Courage"
|
||||
},
|
||||
"score": "3",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "15626",
|
||||
"uid": "s:600~l:761~t:15626",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "15626",
|
||||
"uid": "s:600~l:761~t:15626",
|
||||
"location": "Orlando",
|
||||
"name": "Pride",
|
||||
"abbreviation": "ORL",
|
||||
"displayName": "Orlando Pride"
|
||||
},
|
||||
"score": "1",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 90,
|
||||
"displayClock": "90'",
|
||||
"period": 2,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672203",
|
||||
"uid": "s:600~l:761~e:401672203",
|
||||
"date": "2026-04-11T02:00:00Z",
|
||||
"name": "San Diego Wave at Bay FC",
|
||||
"shortName": "SD @ BAY",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672203",
|
||||
"uid": "s:600~l:761~e:401672203~c:401672203",
|
||||
"date": "2026-04-11T02:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3945",
|
||||
"fullName": "PayPal Park",
|
||||
"address": {
|
||||
"city": "San Jose",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 18000,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "25645",
|
||||
"uid": "s:600~l:761~t:25645",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "25645",
|
||||
"uid": "s:600~l:761~t:25645",
|
||||
"location": "Bay Area",
|
||||
"name": "FC",
|
||||
"abbreviation": "BAY",
|
||||
"displayName": "Bay FC"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "22638",
|
||||
"uid": "s:600~l:761~t:22638",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "22638",
|
||||
"uid": "s:600~l:761~t:22638",
|
||||
"location": "San Diego",
|
||||
"name": "Wave FC",
|
||||
"abbreviation": "SD",
|
||||
"displayName": "San Diego Wave FC"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0'",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
245
sportstime_parser/tests/fixtures/wnba/espn_scoreboard.json
vendored
Normal file
245
sportstime_parser/tests/fixtures/wnba/espn_scoreboard.json
vendored
Normal file
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "59",
|
||||
"uid": "s:40~l:59",
|
||||
"name": "Women's National Basketball Association",
|
||||
"abbreviation": "WNBA"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2026-05-20T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401672101",
|
||||
"uid": "s:40~l:59~e:401672101",
|
||||
"date": "2026-05-20T23:00:00Z",
|
||||
"name": "Las Vegas Aces at New York Liberty",
|
||||
"shortName": "LV @ NY",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672101",
|
||||
"uid": "s:40~l:59~e:401672101~c:401672101",
|
||||
"date": "2026-05-20T23:00:00Z",
|
||||
"attendance": 17732,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "4346",
|
||||
"fullName": "Barclays Center",
|
||||
"address": {
|
||||
"city": "Brooklyn",
|
||||
"state": "NY"
|
||||
},
|
||||
"capacity": 17732,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "9",
|
||||
"uid": "s:40~l:59~t:9",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "9",
|
||||
"uid": "s:40~l:59~t:9",
|
||||
"location": "New York",
|
||||
"name": "Liberty",
|
||||
"abbreviation": "NY",
|
||||
"displayName": "New York Liberty"
|
||||
},
|
||||
"score": "92",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "20",
|
||||
"uid": "s:40~l:59~t:20",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "20",
|
||||
"uid": "s:40~l:59~t:20",
|
||||
"location": "Las Vegas",
|
||||
"name": "Aces",
|
||||
"abbreviation": "LV",
|
||||
"displayName": "Las Vegas Aces"
|
||||
},
|
||||
"score": "88",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672102",
|
||||
"uid": "s:40~l:59~e:401672102",
|
||||
"date": "2026-05-21T00:00:00Z",
|
||||
"name": "Connecticut Sun at Chicago Sky",
|
||||
"shortName": "CONN @ CHI",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672102",
|
||||
"uid": "s:40~l:59~e:401672102~c:401672102",
|
||||
"date": "2026-05-21T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8086",
|
||||
"fullName": "Wintrust Arena",
|
||||
"address": {
|
||||
"city": "Chicago",
|
||||
"state": "IL"
|
||||
},
|
||||
"capacity": 10387,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "6",
|
||||
"uid": "s:40~l:59~t:6",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "6",
|
||||
"uid": "s:40~l:59~t:6",
|
||||
"location": "Chicago",
|
||||
"name": "Sky",
|
||||
"abbreviation": "CHI",
|
||||
"displayName": "Chicago Sky"
|
||||
},
|
||||
"score": "78",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "5",
|
||||
"uid": "s:40~l:59~t:5",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "5",
|
||||
"uid": "s:40~l:59~t:5",
|
||||
"location": "Connecticut",
|
||||
"name": "Sun",
|
||||
"abbreviation": "CONN",
|
||||
"displayName": "Connecticut Sun"
|
||||
},
|
||||
"score": "85",
|
||||
"winner": true
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672103",
|
||||
"uid": "s:40~l:59~e:401672103",
|
||||
"date": "2026-05-21T02:00:00Z",
|
||||
"name": "Phoenix Mercury at Seattle Storm",
|
||||
"shortName": "PHX @ SEA",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672103",
|
||||
"uid": "s:40~l:59~e:401672103~c:401672103",
|
||||
"date": "2026-05-21T02:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3097",
|
||||
"fullName": "Climate Pledge Arena",
|
||||
"address": {
|
||||
"city": "Seattle",
|
||||
"state": "WA"
|
||||
},
|
||||
"capacity": 18100,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "11",
|
||||
"uid": "s:40~l:59~t:11",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "11",
|
||||
"uid": "s:40~l:59~t:11",
|
||||
"location": "Seattle",
|
||||
"name": "Storm",
|
||||
"abbreviation": "SEA",
|
||||
"displayName": "Seattle Storm"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "8",
|
||||
"uid": "s:40~l:59~t:8",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "8",
|
||||
"uid": "s:40~l:59~t:8",
|
||||
"location": "Phoenix",
|
||||
"name": "Mercury",
|
||||
"abbreviation": "PHX",
|
||||
"displayName": "Phoenix Mercury"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
269
sportstime_parser/tests/test_alias_loader.py
Normal file
269
sportstime_parser/tests/test_alias_loader.py
Normal file
@@ -0,0 +1,269 @@
|
||||
"""Tests for alias loaders."""
|
||||
|
||||
import pytest
|
||||
import json
|
||||
import tempfile
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
|
||||
from sportstime_parser.normalizers.alias_loader import (
|
||||
TeamAliasLoader,
|
||||
StadiumAliasLoader,
|
||||
)
|
||||
from sportstime_parser.models.aliases import AliasType
|
||||
|
||||
|
||||
class TestTeamAliasLoader:
|
||||
"""Tests for TeamAliasLoader class."""
|
||||
|
||||
@pytest.fixture
|
||||
def sample_aliases_file(self):
|
||||
"""Create a temporary aliases file for testing."""
|
||||
data = [
|
||||
{
|
||||
"id": "1",
|
||||
"team_canonical_id": "nba_okc",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Seattle SuperSonics",
|
||||
"valid_from": "1967-01-01",
|
||||
"valid_until": "2008-07-02",
|
||||
},
|
||||
{
|
||||
"id": "2",
|
||||
"team_canonical_id": "nba_okc",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Oklahoma City Thunder",
|
||||
"valid_from": "2008-07-03",
|
||||
"valid_until": None,
|
||||
},
|
||||
{
|
||||
"id": "3",
|
||||
"team_canonical_id": "nba_okc",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "OKC",
|
||||
"valid_from": "2008-07-03",
|
||||
"valid_until": None,
|
||||
},
|
||||
]
|
||||
with tempfile.NamedTemporaryFile(
|
||||
mode="w", suffix=".json", delete=False
|
||||
) as f:
|
||||
json.dump(data, f)
|
||||
return Path(f.name)
|
||||
|
||||
def test_load_aliases(self, sample_aliases_file):
|
||||
"""Test loading aliases from file."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
loader.load()
|
||||
assert len(loader._aliases) == 3
|
||||
|
||||
def test_resolve_current_alias(self, sample_aliases_file):
|
||||
"""Test resolving a current alias."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
# Current date should resolve to Thunder
|
||||
result = loader.resolve("Oklahoma City Thunder")
|
||||
assert result == "nba_okc"
|
||||
|
||||
# Abbreviation should also work
|
||||
result = loader.resolve("OKC")
|
||||
assert result == "nba_okc"
|
||||
|
||||
def test_resolve_historical_alias(self, sample_aliases_file):
|
||||
"""Test resolving a historical alias with date."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
# Historical date should resolve SuperSonics
|
||||
result = loader.resolve("Seattle SuperSonics", check_date=date(2007, 1, 1))
|
||||
assert result == "nba_okc"
|
||||
|
||||
# After relocation, SuperSonics shouldn't resolve
|
||||
result = loader.resolve("Seattle SuperSonics", check_date=date(2010, 1, 1))
|
||||
assert result is None
|
||||
|
||||
def test_resolve_case_insensitive(self, sample_aliases_file):
|
||||
"""Test case insensitive resolution."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
result = loader.resolve("oklahoma city thunder")
|
||||
assert result == "nba_okc"
|
||||
|
||||
result = loader.resolve("okc")
|
||||
assert result == "nba_okc"
|
||||
|
||||
def test_resolve_with_type_filter(self, sample_aliases_file):
|
||||
"""Test filtering by alias type."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
# Should find when searching all types
|
||||
result = loader.resolve("OKC")
|
||||
assert result == "nba_okc"
|
||||
|
||||
# Should not find when filtering to name only
|
||||
result = loader.resolve("OKC", alias_types=[AliasType.NAME])
|
||||
assert result is None
|
||||
|
||||
def test_get_aliases_for_team(self, sample_aliases_file):
|
||||
"""Test getting all aliases for a team."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
aliases = loader.get_aliases_for_team("nba_okc")
|
||||
assert len(aliases) == 3
|
||||
|
||||
# Filter by current date
|
||||
aliases = loader.get_aliases_for_team(
|
||||
"nba_okc", check_date=date(2020, 1, 1)
|
||||
)
|
||||
assert len(aliases) == 2 # Thunder name + OKC abbreviation
|
||||
|
||||
def test_missing_file(self):
|
||||
"""Test handling of missing file."""
|
||||
loader = TeamAliasLoader(Path("/nonexistent/file.json"))
|
||||
loader.load() # Should not raise
|
||||
assert len(loader._aliases) == 0
|
||||
|
||||
|
||||
class TestStadiumAliasLoader:
|
||||
"""Tests for StadiumAliasLoader class."""
|
||||
|
||||
@pytest.fixture
|
||||
def sample_stadium_aliases(self):
|
||||
"""Create a temporary stadium aliases file."""
|
||||
data = [
|
||||
{
|
||||
"alias_name": "Crypto.com Arena",
|
||||
"stadium_canonical_id": "crypto_arena_los_angeles_ca",
|
||||
"valid_from": "2021-12-25",
|
||||
"valid_until": None,
|
||||
},
|
||||
{
|
||||
"alias_name": "Staples Center",
|
||||
"stadium_canonical_id": "crypto_arena_los_angeles_ca",
|
||||
"valid_from": "1999-10-17",
|
||||
"valid_until": "2021-12-24",
|
||||
},
|
||||
]
|
||||
with tempfile.NamedTemporaryFile(
|
||||
mode="w", suffix=".json", delete=False
|
||||
) as f:
|
||||
json.dump(data, f)
|
||||
return Path(f.name)
|
||||
|
||||
def test_load_stadium_aliases(self, sample_stadium_aliases):
|
||||
"""Test loading stadium aliases."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
loader.load()
|
||||
assert len(loader._aliases) == 2
|
||||
|
||||
def test_resolve_current_name(self, sample_stadium_aliases):
|
||||
"""Test resolving current stadium name."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
|
||||
result = loader.resolve("Crypto.com Arena")
|
||||
assert result == "crypto_arena_los_angeles_ca"
|
||||
|
||||
def test_resolve_historical_name(self, sample_stadium_aliases):
|
||||
"""Test resolving historical stadium name."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
|
||||
# Staples Center in 2020
|
||||
result = loader.resolve("Staples Center", check_date=date(2020, 1, 1))
|
||||
assert result == "crypto_arena_los_angeles_ca"
|
||||
|
||||
# Staples Center after rename shouldn't resolve
|
||||
result = loader.resolve("Staples Center", check_date=date(2023, 1, 1))
|
||||
assert result is None
|
||||
|
||||
def test_date_boundary(self, sample_stadium_aliases):
|
||||
"""Test exact date boundaries."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
|
||||
# Last day of Staples Center
|
||||
result = loader.resolve("Staples Center", check_date=date(2021, 12, 24))
|
||||
assert result == "crypto_arena_los_angeles_ca"
|
||||
|
||||
# First day of Crypto.com Arena
|
||||
result = loader.resolve("Crypto.com Arena", check_date=date(2021, 12, 25))
|
||||
assert result == "crypto_arena_los_angeles_ca"
|
||||
|
||||
def test_get_all_names(self, sample_stadium_aliases):
|
||||
"""Test getting all stadium names."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
|
||||
names = loader.get_all_names()
|
||||
assert len(names) == 2
|
||||
assert "Crypto.com Arena" in names
|
||||
assert "Staples Center" in names
|
||||
|
||||
|
||||
class TestDateRangeHandling:
|
||||
"""Tests for date range edge cases in aliases."""
|
||||
|
||||
@pytest.fixture
|
||||
def date_range_aliases(self):
|
||||
"""Create aliases with various date range scenarios."""
|
||||
data = [
|
||||
{
|
||||
"id": "1",
|
||||
"team_canonical_id": "test_team",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Always Valid",
|
||||
"valid_from": None,
|
||||
"valid_until": None,
|
||||
},
|
||||
{
|
||||
"id": "2",
|
||||
"team_canonical_id": "test_team",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Future Only",
|
||||
"valid_from": "2030-01-01",
|
||||
"valid_until": None,
|
||||
},
|
||||
{
|
||||
"id": "3",
|
||||
"team_canonical_id": "test_team",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Past Only",
|
||||
"valid_from": None,
|
||||
"valid_until": "2000-01-01",
|
||||
},
|
||||
]
|
||||
with tempfile.NamedTemporaryFile(
|
||||
mode="w", suffix=".json", delete=False
|
||||
) as f:
|
||||
json.dump(data, f)
|
||||
return Path(f.name)
|
||||
|
||||
def test_always_valid_alias(self, date_range_aliases):
|
||||
"""Test alias with no date restrictions."""
|
||||
loader = TeamAliasLoader(date_range_aliases)
|
||||
|
||||
result = loader.resolve("Always Valid", check_date=date(2025, 1, 1))
|
||||
assert result == "test_team"
|
||||
|
||||
result = loader.resolve("Always Valid", check_date=date(1990, 1, 1))
|
||||
assert result == "test_team"
|
||||
|
||||
def test_future_only_alias(self, date_range_aliases):
|
||||
"""Test alias that starts in the future."""
|
||||
loader = TeamAliasLoader(date_range_aliases)
|
||||
|
||||
# Before valid_from
|
||||
result = loader.resolve("Future Only", check_date=date(2025, 1, 1))
|
||||
assert result is None
|
||||
|
||||
# After valid_from
|
||||
result = loader.resolve("Future Only", check_date=date(2035, 1, 1))
|
||||
assert result == "test_team"
|
||||
|
||||
def test_past_only_alias(self, date_range_aliases):
|
||||
"""Test alias that expired in the past."""
|
||||
loader = TeamAliasLoader(date_range_aliases)
|
||||
|
||||
# Before valid_until
|
||||
result = loader.resolve("Past Only", check_date=date(1990, 1, 1))
|
||||
assert result == "test_team"
|
||||
|
||||
# After valid_until
|
||||
result = loader.resolve("Past Only", check_date=date(2025, 1, 1))
|
||||
assert result is None
|
||||
187
sportstime_parser/tests/test_canonical_id.py
Normal file
187
sportstime_parser/tests/test_canonical_id.py
Normal file
@@ -0,0 +1,187 @@
|
||||
"""Tests for canonical ID generation."""
|
||||
|
||||
import pytest
|
||||
from datetime import datetime, date
|
||||
|
||||
from sportstime_parser.normalizers.canonical_id import (
|
||||
generate_game_id,
|
||||
generate_team_id,
|
||||
generate_team_id_from_abbrev,
|
||||
generate_stadium_id,
|
||||
parse_game_id,
|
||||
normalize_string,
|
||||
)
|
||||
|
||||
|
||||
class TestNormalizeString:
|
||||
"""Tests for normalize_string function."""
|
||||
|
||||
def test_basic_normalization(self):
|
||||
"""Test basic string normalization."""
|
||||
assert normalize_string("New York") == "new_york"
|
||||
assert normalize_string("Los Angeles") == "los_angeles"
|
||||
|
||||
def test_removes_special_characters(self):
|
||||
"""Test that special characters are removed."""
|
||||
assert normalize_string("AT&T Stadium") == "att_stadium"
|
||||
assert normalize_string("St. Louis") == "st_louis"
|
||||
assert normalize_string("O'Brien Field") == "obrien_field"
|
||||
|
||||
def test_collapses_whitespace(self):
|
||||
"""Test that multiple spaces are collapsed."""
|
||||
assert normalize_string("New York") == "new_york"
|
||||
assert normalize_string(" Los Angeles ") == "los_angeles"
|
||||
|
||||
def test_empty_string(self):
|
||||
"""Test empty string handling."""
|
||||
assert normalize_string("") == ""
|
||||
assert normalize_string(" ") == ""
|
||||
|
||||
def test_unicode_normalization(self):
|
||||
"""Test unicode characters are handled."""
|
||||
assert normalize_string("Café") == "cafe"
|
||||
assert normalize_string("José") == "jose"
|
||||
|
||||
|
||||
class TestGenerateGameId:
|
||||
"""Tests for generate_game_id function."""
|
||||
|
||||
def test_basic_game_id(self):
|
||||
"""Test basic game ID generation."""
|
||||
game_id = generate_game_id(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
away_abbrev="bos",
|
||||
home_abbrev="lal",
|
||||
game_date=date(2025, 12, 25),
|
||||
)
|
||||
assert game_id == "game_nba_2025_20251225_bos_lal"
|
||||
|
||||
def test_game_id_with_datetime(self):
|
||||
"""Test game ID generation with datetime object."""
|
||||
game_id = generate_game_id(
|
||||
sport="mlb",
|
||||
season=2026,
|
||||
away_abbrev="nyy",
|
||||
home_abbrev="bos",
|
||||
game_date=datetime(2026, 4, 1, 19, 0),
|
||||
)
|
||||
assert game_id == "game_mlb_2026_20260401_nyy_bos"
|
||||
|
||||
def test_game_id_with_game_number(self):
|
||||
"""Test game ID for doubleheader."""
|
||||
game_id_1 = generate_game_id(
|
||||
sport="mlb",
|
||||
season=2026,
|
||||
away_abbrev="nyy",
|
||||
home_abbrev="bos",
|
||||
game_date=date(2026, 7, 4),
|
||||
game_number=1,
|
||||
)
|
||||
game_id_2 = generate_game_id(
|
||||
sport="mlb",
|
||||
season=2026,
|
||||
away_abbrev="nyy",
|
||||
home_abbrev="bos",
|
||||
game_date=date(2026, 7, 4),
|
||||
game_number=2,
|
||||
)
|
||||
assert game_id_1 == "game_mlb_2026_20260704_nyy_bos_1"
|
||||
assert game_id_2 == "game_mlb_2026_20260704_nyy_bos_2"
|
||||
|
||||
def test_sport_lowercased(self):
|
||||
"""Test that sport is lowercased."""
|
||||
game_id = generate_game_id(
|
||||
sport="NBA",
|
||||
season=2025,
|
||||
away_abbrev="BOS",
|
||||
home_abbrev="LAL",
|
||||
game_date=date(2025, 12, 25),
|
||||
)
|
||||
assert game_id == "game_nba_2025_20251225_bos_lal"
|
||||
|
||||
|
||||
class TestParseGameId:
|
||||
"""Tests for parse_game_id function."""
|
||||
|
||||
def test_parse_basic_game_id(self):
|
||||
"""Test parsing a basic game ID."""
|
||||
parsed = parse_game_id("game_nba_2025_20251225_bos_lal")
|
||||
assert parsed["sport"] == "nba"
|
||||
assert parsed["season"] == 2025
|
||||
assert parsed["away_abbrev"] == "bos"
|
||||
assert parsed["home_abbrev"] == "lal"
|
||||
assert parsed["year"] == 2025
|
||||
assert parsed["month"] == 12
|
||||
assert parsed["day"] == 25
|
||||
assert parsed["game_number"] is None
|
||||
|
||||
def test_parse_game_id_with_game_number(self):
|
||||
"""Test parsing game ID with game number."""
|
||||
parsed = parse_game_id("game_mlb_2026_20260704_nyy_bos_2")
|
||||
assert parsed["sport"] == "mlb"
|
||||
assert parsed["season"] == 2026
|
||||
assert parsed["away_abbrev"] == "nyy"
|
||||
assert parsed["home_abbrev"] == "bos"
|
||||
assert parsed["year"] == 2026
|
||||
assert parsed["month"] == 7
|
||||
assert parsed["day"] == 4
|
||||
assert parsed["game_number"] == 2
|
||||
|
||||
def test_parse_invalid_game_id(self):
|
||||
"""Test parsing invalid game ID raises error."""
|
||||
with pytest.raises(ValueError):
|
||||
parse_game_id("invalid")
|
||||
with pytest.raises(ValueError):
|
||||
parse_game_id("nba_2025_bos") # Missing game_ prefix
|
||||
with pytest.raises(ValueError):
|
||||
parse_game_id("")
|
||||
with pytest.raises(ValueError):
|
||||
parse_game_id("game_nba_2025_bos_lal") # Missing date
|
||||
|
||||
|
||||
class TestGenerateTeamId:
|
||||
"""Tests for generate_team_id function."""
|
||||
|
||||
def test_basic_team_id(self):
|
||||
"""Test basic team ID generation from city and name."""
|
||||
team_id = generate_team_id(sport="nba", city="Los Angeles", name="Lakers")
|
||||
assert team_id == "team_nba_los_angeles_lakers"
|
||||
|
||||
def test_team_id_normalizes_input(self):
|
||||
"""Test that inputs are normalized."""
|
||||
team_id = generate_team_id(sport="NBA", city="New York", name="Yankees")
|
||||
assert team_id == "team_nba_new_york_yankees"
|
||||
|
||||
|
||||
class TestGenerateTeamIdFromAbbrev:
|
||||
"""Tests for generate_team_id_from_abbrev function."""
|
||||
|
||||
def test_basic_team_id_from_abbrev(self):
|
||||
"""Test team ID from abbreviation."""
|
||||
team_id = generate_team_id_from_abbrev(sport="nba", abbreviation="LAL")
|
||||
assert team_id == "team_nba_lal"
|
||||
|
||||
def test_lowercases_abbreviation(self):
|
||||
"""Test abbreviation is lowercased."""
|
||||
team_id = generate_team_id_from_abbrev(sport="MLB", abbreviation="NYY")
|
||||
assert team_id == "team_mlb_nyy"
|
||||
|
||||
|
||||
class TestGenerateStadiumId:
|
||||
"""Tests for generate_stadium_id function."""
|
||||
|
||||
def test_basic_stadium_id(self):
|
||||
"""Test basic stadium ID generation."""
|
||||
stadium_id = generate_stadium_id(sport="mlb", name="Fenway Park")
|
||||
assert stadium_id == "stadium_mlb_fenway_park"
|
||||
|
||||
def test_stadium_id_special_characters(self):
|
||||
"""Test stadium ID with special characters."""
|
||||
stadium_id = generate_stadium_id(sport="nfl", name="AT&T Stadium")
|
||||
assert stadium_id == "stadium_nfl_att_stadium"
|
||||
|
||||
def test_stadium_id_with_sponsor(self):
|
||||
"""Test stadium ID with sponsor name."""
|
||||
stadium_id = generate_stadium_id(sport="nba", name="Crypto.com Arena")
|
||||
assert stadium_id == "stadium_nba_cryptocom_arena"
|
||||
194
sportstime_parser/tests/test_fuzzy.py
Normal file
194
sportstime_parser/tests/test_fuzzy.py
Normal file
@@ -0,0 +1,194 @@
|
||||
"""Tests for fuzzy string matching utilities."""
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.normalizers.fuzzy import (
|
||||
normalize_for_matching,
|
||||
fuzzy_match_team,
|
||||
fuzzy_match_stadium,
|
||||
exact_match,
|
||||
best_match,
|
||||
calculate_similarity,
|
||||
MatchCandidate,
|
||||
)
|
||||
|
||||
|
||||
class TestNormalizeForMatching:
|
||||
"""Tests for normalize_for_matching function."""
|
||||
|
||||
def test_basic_normalization(self):
|
||||
"""Test basic string normalization."""
|
||||
assert normalize_for_matching("Los Angeles Lakers") == "los angeles lakers"
|
||||
assert normalize_for_matching(" Boston Celtics ") == "boston celtics"
|
||||
|
||||
def test_removes_common_prefixes(self):
|
||||
"""Test removal of common prefixes."""
|
||||
assert normalize_for_matching("The Boston Celtics") == "boston celtics"
|
||||
assert normalize_for_matching("Team Lakers") == "lakers"
|
||||
|
||||
def test_removes_stadium_suffixes(self):
|
||||
"""Test removal of stadium-related suffixes."""
|
||||
assert normalize_for_matching("Fenway Park") == "fenway"
|
||||
assert normalize_for_matching("Madison Square Garden Arena") == "madison square garden"
|
||||
assert normalize_for_matching("Wrigley Field") == "wrigley"
|
||||
assert normalize_for_matching("TD Garden Center") == "td garden"
|
||||
|
||||
|
||||
class TestExactMatch:
|
||||
"""Tests for exact_match function."""
|
||||
|
||||
def test_exact_match_primary_name(self):
|
||||
"""Test exact match on primary name."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LAL"]),
|
||||
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics", "BOS"]),
|
||||
]
|
||||
assert exact_match("Los Angeles Lakers", candidates) == "nba_lal"
|
||||
assert exact_match("Boston Celtics", candidates) == "nba_bos"
|
||||
|
||||
def test_exact_match_alias(self):
|
||||
"""Test exact match on alias."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LAL"]),
|
||||
]
|
||||
assert exact_match("Lakers", candidates) == "nba_lal"
|
||||
assert exact_match("LAL", candidates) == "nba_lal"
|
||||
|
||||
def test_case_insensitive(self):
|
||||
"""Test case insensitive matching."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
|
||||
]
|
||||
assert exact_match("los angeles lakers", candidates) == "nba_lal"
|
||||
assert exact_match("LAKERS", candidates) == "nba_lal"
|
||||
|
||||
def test_no_match(self):
|
||||
"""Test no match returns None."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
|
||||
]
|
||||
assert exact_match("New York Knicks", candidates) is None
|
||||
|
||||
|
||||
class TestFuzzyMatchTeam:
|
||||
"""Tests for fuzzy_match_team function."""
|
||||
|
||||
def test_close_match(self):
|
||||
"""Test fuzzy matching finds close matches."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LA Lakers"]),
|
||||
MatchCandidate("nba_lac", "Los Angeles Clippers", ["Clippers", "LA Clippers"]),
|
||||
]
|
||||
matches = fuzzy_match_team("LA Lakers", candidates, threshold=70)
|
||||
assert len(matches) > 0
|
||||
assert matches[0].canonical_id == "nba_lal"
|
||||
|
||||
def test_partial_name_match(self):
|
||||
"""Test matching on partial team name."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics", "BOS"]),
|
||||
]
|
||||
matches = fuzzy_match_team("Celtics", candidates, threshold=80)
|
||||
assert len(matches) > 0
|
||||
assert matches[0].canonical_id == "nba_bos"
|
||||
|
||||
def test_threshold_filtering(self):
|
||||
"""Test that threshold filters low-confidence matches."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_bos", "Boston Celtics", []),
|
||||
]
|
||||
# Very different string should not match at high threshold
|
||||
matches = fuzzy_match_team("xyz123", candidates, threshold=90)
|
||||
assert len(matches) == 0
|
||||
|
||||
def test_returns_top_n(self):
|
||||
"""Test that top_n parameter limits results."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", []),
|
||||
MatchCandidate("nba_lac", "Los Angeles Clippers", []),
|
||||
MatchCandidate("mlb_lad", "Los Angeles Dodgers", []),
|
||||
]
|
||||
matches = fuzzy_match_team("Los Angeles", candidates, threshold=50, top_n=2)
|
||||
assert len(matches) <= 2
|
||||
|
||||
|
||||
class TestFuzzyMatchStadium:
|
||||
"""Tests for fuzzy_match_stadium function."""
|
||||
|
||||
def test_stadium_match(self):
|
||||
"""Test fuzzy matching stadium names."""
|
||||
candidates = [
|
||||
MatchCandidate("fenway", "Fenway Park", ["Fenway"]),
|
||||
MatchCandidate("td_garden", "TD Garden", ["Boston Garden"]),
|
||||
]
|
||||
matches = fuzzy_match_stadium("Fenway Park Boston", candidates, threshold=70)
|
||||
assert len(matches) > 0
|
||||
assert matches[0].canonical_id == "fenway"
|
||||
|
||||
def test_naming_rights_change(self):
|
||||
"""Test matching old stadium names."""
|
||||
candidates = [
|
||||
MatchCandidate(
|
||||
"chase_center",
|
||||
"Chase Center",
|
||||
["Oracle Arena", "Oakland Coliseum Arena"],
|
||||
),
|
||||
]
|
||||
# Should match on alias
|
||||
matches = fuzzy_match_stadium("Oracle Arena", candidates, threshold=70)
|
||||
assert len(matches) > 0
|
||||
|
||||
|
||||
class TestBestMatch:
|
||||
"""Tests for best_match function."""
|
||||
|
||||
def test_prefers_exact_match(self):
|
||||
"""Test that exact match is preferred over fuzzy."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
|
||||
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics"]),
|
||||
]
|
||||
result = best_match("Lakers", candidates)
|
||||
assert result is not None
|
||||
assert result.canonical_id == "nba_lal"
|
||||
assert result.confidence == 100 # Exact match
|
||||
|
||||
def test_falls_back_to_fuzzy(self):
|
||||
"""Test fallback to fuzzy when no exact match."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
|
||||
]
|
||||
result = best_match("LA Laker", candidates, threshold=70)
|
||||
assert result is not None
|
||||
assert result.confidence < 100 # Fuzzy match
|
||||
|
||||
def test_no_match_below_threshold(self):
|
||||
"""Test returns None when no match above threshold."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", []),
|
||||
]
|
||||
result = best_match("xyz123", candidates, threshold=90)
|
||||
assert result is None
|
||||
|
||||
|
||||
class TestCalculateSimilarity:
|
||||
"""Tests for calculate_similarity function."""
|
||||
|
||||
def test_identical_strings(self):
|
||||
"""Test identical strings have 100% similarity."""
|
||||
assert calculate_similarity("Boston Celtics", "Boston Celtics") == 100
|
||||
|
||||
def test_similar_strings(self):
|
||||
"""Test similar strings have high similarity."""
|
||||
score = calculate_similarity("Boston Celtics", "Celtics Boston")
|
||||
assert score >= 90
|
||||
|
||||
def test_different_strings(self):
|
||||
"""Test different strings have low similarity."""
|
||||
score = calculate_similarity("Boston Celtics", "Los Angeles Lakers")
|
||||
assert score < 50
|
||||
|
||||
def test_empty_string(self):
|
||||
"""Test empty string handling."""
|
||||
score = calculate_similarity("", "Boston Celtics")
|
||||
assert score == 0
|
||||
1
sportstime_parser/tests/test_scrapers/__init__.py
Normal file
1
sportstime_parser/tests/test_scrapers/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Tests for scrapers module."""
|
||||
257
sportstime_parser/tests/test_scrapers/test_mlb.py
Normal file
257
sportstime_parser/tests/test_scrapers/test_mlb.py
Normal file
@@ -0,0 +1,257 @@
|
||||
"""Tests for MLB scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.mlb import MLBScraper, create_mlb_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
MLB_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestMLBScraperInit:
|
||||
"""Test MLBScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
assert scraper.sport == "mlb"
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_mlb_scraper(season=2026)
|
||||
assert isinstance(scraper, MLBScraper)
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for MLB."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
assert scraper.expected_game_count == 2430
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["baseball_reference", "mlb_api", "espn"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Yankees @ Red Sox
|
||||
nyy_bos = next(g for g in completed if g.away_team_raw == "New York Yankees")
|
||||
assert nyy_bos.home_team_raw == "Boston Red Sox"
|
||||
assert nyy_bos.away_score == 3
|
||||
assert nyy_bos.home_score == 5
|
||||
assert nyy_bos.stadium_raw == "Fenway Park"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
lad_sf = scheduled[0]
|
||||
assert lad_sf.away_team_raw == "Los Angeles Dodgers"
|
||||
assert lad_sf.home_team_raw == "San Francisco Giants"
|
||||
assert lad_sf.stadium_raw == "Oracle Park"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 15),
|
||||
home_team_raw="Boston Red Sox",
|
||||
away_team_raw="New York Yankees",
|
||||
stadium_raw="Fenway Park",
|
||||
home_score=5,
|
||||
away_score=3,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "mlb_2026_nyy_bos_0415"
|
||||
assert game.sport == "mlb"
|
||||
assert game.season == 2026
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_mlb_bos"
|
||||
assert game.away_team_id == "team_mlb_nyy"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 5
|
||||
assert game.away_score == 3
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 15),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Boston Red Sox",
|
||||
stadium_raw="Fenway Park",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_mlb_teams(self):
|
||||
"""Test all 30 MLB teams are returned."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# 30 MLB teams
|
||||
assert len(teams) == 30
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 30
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_mlb_")
|
||||
assert team.sport == "mlb"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_leagues_and_divisions(self):
|
||||
"""Test teams have league (conference) and division info."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by league
|
||||
al = [t for t in teams if t.conference == "American"]
|
||||
nl = [t for t in teams if t.conference == "National"]
|
||||
|
||||
assert len(al) == 15
|
||||
assert len(nl) == 15
|
||||
|
||||
def test_scrapes_all_mlb_stadiums(self):
|
||||
"""Test all MLB stadiums are returned."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 30
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 30
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_mlb_")
|
||||
assert stadium.sport == "mlb"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country in ["USA", "Canada"]
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
|
||||
with patch.object(scraper, '_scrape_baseball_reference') as mock_br, \
|
||||
patch.object(scraper, '_scrape_mlb_api') as mock_mlb, \
|
||||
patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
|
||||
# Make BR and MLB API fail
|
||||
mock_br.side_effect = Exception("Connection failed")
|
||||
mock_mlb.side_effect = Exception("API error")
|
||||
|
||||
# Make ESPN return data
|
||||
mock_espn.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 15),
|
||||
home_team_raw="Boston Red Sox",
|
||||
away_team_raw="New York Yankees",
|
||||
stadium_raw="Fenway Park",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert result.success
|
||||
assert result.source == "espn"
|
||||
assert mock_br.called
|
||||
assert mock_mlb.called
|
||||
assert mock_espn.called
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for MLB season."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# MLB season is March-November
|
||||
assert len(months) == 9 # Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
|
||||
|
||||
# Check first month is March of season year
|
||||
assert months[0] == (2026, 3)
|
||||
|
||||
# Check last month is November
|
||||
assert months[-1] == (2026, 11)
|
||||
251
sportstime_parser/tests/test_scrapers/test_mls.py
Normal file
251
sportstime_parser/tests/test_scrapers/test_mls.py
Normal file
@@ -0,0 +1,251 @@
|
||||
"""Tests for MLS scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.mls import MLSScraper, create_mls_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
MLS_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestMLSScraperInit:
|
||||
"""Test MLSScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
assert scraper.sport == "mls"
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_mls_scraper(season=2026)
|
||||
assert isinstance(scraper, MLSScraper)
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for MLS."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
assert scraper.expected_game_count == 493
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["espn", "fbref"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Galaxy @ LAFC
|
||||
la_lafc = next(g for g in completed if g.away_team_raw == "LA Galaxy")
|
||||
assert la_lafc.home_team_raw == "Los Angeles FC"
|
||||
assert la_lafc.away_score == 2
|
||||
assert la_lafc.home_score == 3
|
||||
assert la_lafc.stadium_raw == "BMO Stadium"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
ny_atl = scheduled[0]
|
||||
assert ny_atl.away_team_raw == "New York Red Bulls"
|
||||
assert ny_atl.home_team_raw == "Atlanta United FC"
|
||||
assert ny_atl.stadium_raw == "Mercedes-Benz Stadium"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 3, 15),
|
||||
home_team_raw="Los Angeles FC",
|
||||
away_team_raw="LA Galaxy",
|
||||
stadium_raw="BMO Stadium",
|
||||
home_score=3,
|
||||
away_score=2,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "mls_2026_lag_lafc_0315"
|
||||
assert game.sport == "mls"
|
||||
assert game.season == 2026
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_mls_lafc"
|
||||
assert game.away_team_id == "team_mls_lag"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 3
|
||||
assert game.away_score == 2
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 3, 15),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="LA Galaxy",
|
||||
stadium_raw="BMO Stadium",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_mls_teams(self):
|
||||
"""Test all MLS teams are returned."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# MLS has 29+ teams
|
||||
assert len(teams) >= 29
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == len(teams)
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_mls_")
|
||||
assert team.sport == "mls"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_conferences(self):
|
||||
"""Test teams have conference info."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by conference
|
||||
eastern = [t for t in teams if t.conference == "Eastern"]
|
||||
western = [t for t in teams if t.conference == "Western"]
|
||||
|
||||
# MLS has two conferences
|
||||
assert len(eastern) >= 14
|
||||
assert len(western) >= 14
|
||||
|
||||
def test_scrapes_all_mls_stadiums(self):
|
||||
"""Test all MLS stadiums are returned."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) >= 29
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_mls_")
|
||||
assert stadium.sport == "mls"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country in ["USA", "Canada"]
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
|
||||
with patch.object(scraper, '_scrape_espn') as mock_espn, \
|
||||
patch.object(scraper, '_scrape_fbref') as mock_fbref:
|
||||
|
||||
# Make ESPN fail
|
||||
mock_espn.side_effect = Exception("Connection failed")
|
||||
|
||||
# Make FBref return data
|
||||
mock_fbref.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 3, 15),
|
||||
home_team_raw="Los Angeles FC",
|
||||
away_team_raw="LA Galaxy",
|
||||
stadium_raw="BMO Stadium",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert result.success
|
||||
assert result.source == "fbref"
|
||||
assert mock_espn.called
|
||||
assert mock_fbref.called
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for MLS season."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# MLS season is February-November
|
||||
assert len(months) == 10 # Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
|
||||
|
||||
# Check first month is February of season year
|
||||
assert months[0] == (2026, 2)
|
||||
|
||||
# Check last month is November
|
||||
assert months[-1] == (2026, 11)
|
||||
428
sportstime_parser/tests/test_scrapers/test_nba.py
Normal file
428
sportstime_parser/tests/test_scrapers/test_nba.py
Normal file
@@ -0,0 +1,428 @@
|
||||
"""Tests for NBA scraper."""
|
||||
|
||||
import json
|
||||
from datetime import datetime
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.nba import NBAScraper, create_nba_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_fixture,
|
||||
load_json_fixture,
|
||||
NBA_BR_OCTOBER_HTML,
|
||||
NBA_BR_EDGE_CASES_HTML,
|
||||
NBA_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestNBAScraperInit:
|
||||
"""Test NBAScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
assert scraper.sport == "nba"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_nba_scraper(season=2025)
|
||||
assert isinstance(scraper, NBAScraper)
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for NBA."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
assert scraper.expected_game_count == 1230
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["basketball_reference", "espn", "cbs"]
|
||||
|
||||
|
||||
class TestBasketballReferenceParsing:
|
||||
"""Test Basketball-Reference HTML parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games with scores."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_OCTOBER_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
# Should find all games in fixture
|
||||
assert len(games) == 7
|
||||
|
||||
# Check first completed game
|
||||
completed_games = [g for g in games if g.status == "final"]
|
||||
assert len(completed_games) == 2
|
||||
|
||||
# Boston @ Cleveland
|
||||
bos_cle = next(g for g in games if g.away_team_raw == "Boston Celtics")
|
||||
assert bos_cle.home_team_raw == "Cleveland Cavaliers"
|
||||
assert bos_cle.away_score == 112
|
||||
assert bos_cle.home_score == 108
|
||||
assert bos_cle.stadium_raw == "Rocket Mortgage FieldHouse"
|
||||
assert bos_cle.status == "final"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games without scores."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_OCTOBER_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
scheduled_games = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled_games) == 5
|
||||
|
||||
# Houston @ OKC
|
||||
hou_okc = next(g for g in scheduled_games if g.away_team_raw == "Houston Rockets")
|
||||
assert hou_okc.home_team_raw == "Oklahoma City Thunder"
|
||||
assert hou_okc.away_score is None
|
||||
assert hou_okc.home_score is None
|
||||
assert hou_okc.stadium_raw == "Paycom Center"
|
||||
|
||||
def test_parses_game_dates_correctly(self):
|
||||
"""Test game dates are parsed correctly."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_OCTOBER_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
# Check first game date
|
||||
first_game = games[0]
|
||||
assert first_game.game_date.year == 2025
|
||||
assert first_game.game_date.month == 10
|
||||
assert first_game.game_date.day == 22
|
||||
|
||||
def test_tracks_source_url(self):
|
||||
"""Test source URL is tracked for all games."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_OCTOBER_HTML)
|
||||
source_url = "http://basketball-reference.com/test"
|
||||
games = scraper._parse_basketball_reference(html, source_url)
|
||||
|
||||
for game in games:
|
||||
assert game.source_url == source_url
|
||||
|
||||
|
||||
class TestBasketballReferenceEdgeCases:
|
||||
"""Test edge case handling in Basketball-Reference parsing."""
|
||||
|
||||
def test_parses_postponed_games(self):
|
||||
"""Test postponed games are identified correctly."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
postponed = [g for g in games if g.status == "postponed"]
|
||||
assert len(postponed) == 1
|
||||
assert postponed[0].away_team_raw == "Los Angeles Lakers"
|
||||
assert postponed[0].home_team_raw == "Phoenix Suns"
|
||||
|
||||
def test_parses_cancelled_games(self):
|
||||
"""Test cancelled games are identified correctly."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
cancelled = [g for g in games if g.status == "cancelled"]
|
||||
assert len(cancelled) == 1
|
||||
assert cancelled[0].away_team_raw == "Portland Trail Blazers"
|
||||
|
||||
def test_parses_neutral_site_games(self):
|
||||
"""Test neutral site games are parsed."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
# Mexico City game
|
||||
mexico = next(g for g in games if g.stadium_raw == "Arena CDMX")
|
||||
assert mexico.away_team_raw == "Miami Heat"
|
||||
assert mexico.home_team_raw == "Washington Wizards"
|
||||
assert mexico.status == "final"
|
||||
|
||||
def test_parses_overtime_games(self):
|
||||
"""Test overtime games with high scores."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
# High scoring OT game
|
||||
ot_game = next(g for g in games if g.away_score == 147)
|
||||
assert ot_game.home_score == 150
|
||||
assert ot_game.status == "final"
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Boston @ Cleveland
|
||||
bos_cle = next(g for g in completed if g.away_team_raw == "Boston Celtics")
|
||||
assert bos_cle.home_team_raw == "Cleveland Cavaliers"
|
||||
assert bos_cle.away_score == 112
|
||||
assert bos_cle.home_score == 108
|
||||
assert bos_cle.stadium_raw == "Rocket Mortgage FieldHouse"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
hou_okc = scheduled[0]
|
||||
assert hou_okc.away_team_raw == "Houston Rockets"
|
||||
assert hou_okc.home_team_raw == "Oklahoma City Thunder"
|
||||
assert hou_okc.stadium_raw == "Paycom Center"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
# Check all games have venue info
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 22),
|
||||
home_team_raw="Cleveland Cavaliers",
|
||||
away_team_raw="Boston Celtics",
|
||||
stadium_raw="Rocket Mortgage FieldHouse",
|
||||
home_score=108,
|
||||
away_score=112,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "nba_2025_bos_cle_1022"
|
||||
assert game.sport == "nba"
|
||||
assert game.season == 2025
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_nba_cle"
|
||||
assert game.away_team_id == "team_nba_bos"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 108
|
||||
assert game.away_score == 112
|
||||
|
||||
def test_detects_doubleheaders(self):
|
||||
"""Test doubleheaders get correct game numbers."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 4, 1, 13, 0),
|
||||
home_team_raw="Boston Celtics",
|
||||
away_team_raw="New York Knicks",
|
||||
stadium_raw="TD Garden",
|
||||
status="final",
|
||||
home_score=105,
|
||||
away_score=98,
|
||||
),
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 4, 1, 19, 0),
|
||||
home_team_raw="Boston Celtics",
|
||||
away_team_raw="New York Knicks",
|
||||
stadium_raw="TD Garden",
|
||||
status="final",
|
||||
home_score=110,
|
||||
away_score=102,
|
||||
),
|
||||
]
|
||||
|
||||
games, _ = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 2
|
||||
game_numbers = sorted([g.game_number for g in games])
|
||||
assert game_numbers == [1, 2]
|
||||
|
||||
# Check IDs include game number
|
||||
game_ids = sorted([g.id for g in games])
|
||||
assert game_ids == ["nba_2025_nyk_bos_0401_1", "nba_2025_nyk_bos_0401_2"]
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 22),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Boston Celtics",
|
||||
stadium_raw="TD Garden",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_nba_teams(self):
|
||||
"""Test all 30 NBA teams are returned."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# 30 NBA teams
|
||||
assert len(teams) == 30
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 30
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_nba_")
|
||||
assert team.sport == "nba"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_conferences_and_divisions(self):
|
||||
"""Test teams have conference and division info."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by conference
|
||||
eastern = [t for t in teams if t.conference == "Eastern"]
|
||||
western = [t for t in teams if t.conference == "Western"]
|
||||
|
||||
assert len(eastern) == 15
|
||||
assert len(western) == 15
|
||||
|
||||
def test_scrapes_all_nba_stadiums(self):
|
||||
"""Test all NBA stadiums are returned."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 30
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 30
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_nba_")
|
||||
assert stadium.sport == "nba"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country in ["USA", "Canada"]
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
with patch.object(scraper, '_scrape_basketball_reference') as mock_br, \
|
||||
patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
|
||||
# Make BR fail
|
||||
mock_br.side_effect = Exception("Connection failed")
|
||||
|
||||
# Make ESPN return data
|
||||
mock_espn.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 22),
|
||||
home_team_raw="Cleveland Cavaliers",
|
||||
away_team_raw="Boston Celtics",
|
||||
stadium_raw="Rocket Mortgage FieldHouse",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
# Should have succeeded with ESPN
|
||||
assert result.success
|
||||
assert result.source == "espn"
|
||||
assert mock_br.called
|
||||
assert mock_espn.called
|
||||
|
||||
def test_returns_failure_when_all_sources_fail(self):
|
||||
"""Test scraper returns failure when all sources fail."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
with patch.object(scraper, '_scrape_basketball_reference') as mock_br, \
|
||||
patch.object(scraper, '_scrape_espn') as mock_espn, \
|
||||
patch.object(scraper, '_scrape_cbs') as mock_cbs:
|
||||
|
||||
mock_br.side_effect = Exception("BR failed")
|
||||
mock_espn.side_effect = Exception("ESPN failed")
|
||||
mock_cbs.side_effect = Exception("CBS failed")
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert not result.success
|
||||
assert "All sources failed" in result.error_message
|
||||
assert "CBS failed" in result.error_message
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for NBA season."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# NBA season is Oct-Jun
|
||||
assert len(months) == 9 # Oct, Nov, Dec, Jan, Feb, Mar, Apr, May, Jun
|
||||
|
||||
# Check first month is Oct of season year
|
||||
assert months[0] == (2025, 10)
|
||||
|
||||
# Check last month is Jun of following year
|
||||
assert months[-1] == (2026, 6)
|
||||
|
||||
# Check transition to new year
|
||||
assert months[2] == (2025, 12) # December
|
||||
assert months[3] == (2026, 1) # January
|
||||
310
sportstime_parser/tests/test_scrapers/test_nfl.py
Normal file
310
sportstime_parser/tests/test_scrapers/test_nfl.py
Normal file
@@ -0,0 +1,310 @@
|
||||
"""Tests for NFL scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.nfl import NFLScraper, create_nfl_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
NFL_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestNFLScraperInit:
|
||||
"""Test NFLScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
assert scraper.sport == "nfl"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_nfl_scraper(season=2025)
|
||||
assert isinstance(scraper, NFLScraper)
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for NFL."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
assert scraper.expected_game_count == 272
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["espn", "pro_football_reference", "cbs"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Chiefs @ Ravens
|
||||
kc_bal = next(g for g in completed if g.away_team_raw == "Kansas City Chiefs")
|
||||
assert kc_bal.home_team_raw == "Baltimore Ravens"
|
||||
assert kc_bal.away_score == 27
|
||||
assert kc_bal.home_score == 20
|
||||
assert kc_bal.stadium_raw == "M&T Bank Stadium"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
dal_cle = scheduled[0]
|
||||
assert dal_cle.away_team_raw == "Dallas Cowboys"
|
||||
assert dal_cle.home_team_raw == "Cleveland Browns"
|
||||
assert dal_cle.stadium_raw == "Cleveland Browns Stadium"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 9, 7),
|
||||
home_team_raw="Baltimore Ravens",
|
||||
away_team_raw="Kansas City Chiefs",
|
||||
stadium_raw="M&T Bank Stadium",
|
||||
home_score=20,
|
||||
away_score=27,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "nfl_2025_kc_bal_0907"
|
||||
assert game.sport == "nfl"
|
||||
assert game.season == 2025
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_nfl_bal"
|
||||
assert game.away_team_id == "team_nfl_kc"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 20
|
||||
assert game.away_score == 27
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 9, 7),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Kansas City Chiefs",
|
||||
stadium_raw="Arrowhead Stadium",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_nfl_teams(self):
|
||||
"""Test all 32 NFL teams are returned."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# 32 NFL teams
|
||||
assert len(teams) == 32
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 32
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_nfl_")
|
||||
assert team.sport == "nfl"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_conferences_and_divisions(self):
|
||||
"""Test teams have conference and division info."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by conference
|
||||
afc = [t for t in teams if t.conference == "AFC"]
|
||||
nfc = [t for t in teams if t.conference == "NFC"]
|
||||
|
||||
assert len(afc) == 16
|
||||
assert len(nfc) == 16
|
||||
|
||||
def test_scrapes_all_nfl_stadiums(self):
|
||||
"""Test all NFL stadiums are returned."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams (some share)
|
||||
assert len(stadiums) >= 30
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_nfl_")
|
||||
assert stadium.sport == "nfl"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country == "USA"
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
with patch.object(scraper, '_scrape_espn') as mock_espn, \
|
||||
patch.object(scraper, '_scrape_pro_football_reference') as mock_pfr:
|
||||
|
||||
# Make ESPN fail
|
||||
mock_espn.side_effect = Exception("Connection failed")
|
||||
|
||||
# Make PFR return data
|
||||
mock_pfr.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 9, 7),
|
||||
home_team_raw="Baltimore Ravens",
|
||||
away_team_raw="Kansas City Chiefs",
|
||||
stadium_raw="M&T Bank Stadium",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert result.success
|
||||
assert result.source == "pro_football_reference"
|
||||
assert mock_espn.called
|
||||
assert mock_pfr.called
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for NFL season."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# NFL season is September-February
|
||||
assert len(months) == 6 # Sep, Oct, Nov, Dec, Jan, Feb
|
||||
|
||||
# Check first month is September of season year
|
||||
assert months[0] == (2025, 9)
|
||||
|
||||
# Check last month is February of following year
|
||||
assert months[-1] == (2026, 2)
|
||||
|
||||
# Check transition to new year
|
||||
assert months[3] == (2025, 12) # December
|
||||
assert months[4] == (2026, 1) # January
|
||||
|
||||
|
||||
class TestInternationalFiltering:
|
||||
"""Test international game filtering.
|
||||
|
||||
Note: Filtering happens in _parse_espn_response, not _normalize_games.
|
||||
"""
|
||||
|
||||
def test_filters_london_games_during_parsing(self):
|
||||
"""Test London games are filtered out during ESPN parsing."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
# Create ESPN-like data with London game
|
||||
espn_data = {
|
||||
"events": [
|
||||
{
|
||||
"date": "2025-10-15T09:30:00Z",
|
||||
"competitions": [
|
||||
{
|
||||
"neutralSite": True,
|
||||
"venue": {
|
||||
"fullName": "London Stadium",
|
||||
"address": {"city": "London", "country": "UK"},
|
||||
},
|
||||
"competitors": [
|
||||
{"homeAway": "home", "team": {"displayName": "Jacksonville Jaguars"}},
|
||||
{"homeAway": "away", "team": {"displayName": "Buffalo Bills"}},
|
||||
],
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
games = scraper._parse_espn_response(espn_data, "http://espn.com/api")
|
||||
|
||||
# London game should be filtered
|
||||
assert len(games) == 0
|
||||
|
||||
def test_keeps_us_games(self):
|
||||
"""Test US games are kept."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 9, 7),
|
||||
home_team_raw="Baltimore Ravens",
|
||||
away_team_raw="Kansas City Chiefs",
|
||||
stadium_raw="M&T Bank Stadium",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, _ = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
317
sportstime_parser/tests/test_scrapers/test_nhl.py
Normal file
317
sportstime_parser/tests/test_scrapers/test_nhl.py
Normal file
@@ -0,0 +1,317 @@
|
||||
"""Tests for NHL scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.nhl import NHLScraper, create_nhl_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
NHL_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestNHLScraperInit:
|
||||
"""Test NHLScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
assert scraper.sport == "nhl"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_nhl_scraper(season=2025)
|
||||
assert isinstance(scraper, NHLScraper)
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for NHL."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
assert scraper.expected_game_count == 1312
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["hockey_reference", "nhl_api", "espn"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Penguins @ Bruins
|
||||
pit_bos = next(g for g in completed if g.away_team_raw == "Pittsburgh Penguins")
|
||||
assert pit_bos.home_team_raw == "Boston Bruins"
|
||||
assert pit_bos.away_score == 2
|
||||
assert pit_bos.home_score == 4
|
||||
assert pit_bos.stadium_raw == "TD Garden"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
vgk_lak = scheduled[0]
|
||||
assert vgk_lak.away_team_raw == "Vegas Golden Knights"
|
||||
assert vgk_lak.home_team_raw == "Los Angeles Kings"
|
||||
assert vgk_lak.stadium_raw == "Crypto.com Arena"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 8),
|
||||
home_team_raw="Boston Bruins",
|
||||
away_team_raw="Pittsburgh Penguins",
|
||||
stadium_raw="TD Garden",
|
||||
home_score=4,
|
||||
away_score=2,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "nhl_2025_pit_bos_1008"
|
||||
assert game.sport == "nhl"
|
||||
assert game.season == 2025
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_nhl_bos"
|
||||
assert game.away_team_id == "team_nhl_pit"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 4
|
||||
assert game.away_score == 2
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 8),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Boston Bruins",
|
||||
stadium_raw="TD Garden",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_nhl_teams(self):
|
||||
"""Test all 32 NHL teams are returned."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# 32 NHL teams
|
||||
assert len(teams) == 32
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 32
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_nhl_")
|
||||
assert team.sport == "nhl"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_conferences_and_divisions(self):
|
||||
"""Test teams have conference and division info."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by conference
|
||||
eastern = [t for t in teams if t.conference == "Eastern"]
|
||||
western = [t for t in teams if t.conference == "Western"]
|
||||
|
||||
assert len(eastern) == 16
|
||||
assert len(western) == 16
|
||||
|
||||
def test_scrapes_all_nhl_stadiums(self):
|
||||
"""Test all NHL stadiums are returned."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 32
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 32
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_nhl_")
|
||||
assert stadium.sport == "nhl"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country in ["USA", "Canada"]
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
with patch.object(scraper, '_scrape_hockey_reference') as mock_hr, \
|
||||
patch.object(scraper, '_scrape_nhl_api') as mock_nhl, \
|
||||
patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
|
||||
# Make HR and NHL API fail
|
||||
mock_hr.side_effect = Exception("Connection failed")
|
||||
mock_nhl.side_effect = Exception("API error")
|
||||
|
||||
# Make ESPN return data
|
||||
mock_espn.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 8),
|
||||
home_team_raw="Boston Bruins",
|
||||
away_team_raw="Pittsburgh Penguins",
|
||||
stadium_raw="TD Garden",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert result.success
|
||||
assert result.source == "espn"
|
||||
assert mock_hr.called
|
||||
assert mock_nhl.called
|
||||
assert mock_espn.called
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for NHL season."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# NHL season is October-June
|
||||
assert len(months) == 9 # Oct, Nov, Dec, Jan, Feb, Mar, Apr, May, Jun
|
||||
|
||||
# Check first month is October of season year
|
||||
assert months[0] == (2025, 10)
|
||||
|
||||
# Check last month is June of following year
|
||||
assert months[-1] == (2026, 6)
|
||||
|
||||
# Check transition to new year
|
||||
assert months[2] == (2025, 12) # December
|
||||
assert months[3] == (2026, 1) # January
|
||||
|
||||
|
||||
class TestInternationalFiltering:
|
||||
"""Test international game filtering.
|
||||
|
||||
Note: Filtering happens in _parse_espn_response, not _normalize_games.
|
||||
"""
|
||||
|
||||
def test_filters_european_games_during_parsing(self):
|
||||
"""Test European games are filtered out during ESPN parsing."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
# Create ESPN-like data with Prague game (Global Series)
|
||||
espn_data = {
|
||||
"events": [
|
||||
{
|
||||
"date": "2025-10-10T18:00:00Z",
|
||||
"competitions": [
|
||||
{
|
||||
"neutralSite": True,
|
||||
"venue": {
|
||||
"fullName": "O2 Arena, Prague",
|
||||
"address": {"city": "Prague", "country": "Czech Republic"},
|
||||
},
|
||||
"competitors": [
|
||||
{"homeAway": "home", "team": {"displayName": "Florida Panthers"}},
|
||||
{"homeAway": "away", "team": {"displayName": "Dallas Stars"}},
|
||||
],
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
games = scraper._parse_espn_response(espn_data, "http://espn.com/api")
|
||||
|
||||
# Prague game should be filtered
|
||||
assert len(games) == 0
|
||||
|
||||
def test_keeps_north_american_games(self):
|
||||
"""Test North American games are kept."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 8),
|
||||
home_team_raw="Boston Bruins",
|
||||
away_team_raw="Pittsburgh Penguins",
|
||||
stadium_raw="TD Garden",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, _ = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
226
sportstime_parser/tests/test_scrapers/test_nwsl.py
Normal file
226
sportstime_parser/tests/test_scrapers/test_nwsl.py
Normal file
@@ -0,0 +1,226 @@
|
||||
"""Tests for NWSL scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.nwsl import NWSLScraper, create_nwsl_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
NWSL_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestNWSLScraperInit:
|
||||
"""Test NWSLScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
assert scraper.sport == "nwsl"
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_nwsl_scraper(season=2026)
|
||||
assert isinstance(scraper, NWSLScraper)
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for NWSL."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
assert scraper.expected_game_count == 182
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["espn"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Angel City @ Thorns
|
||||
la_por = next(g for g in completed if g.away_team_raw == "Angel City FC")
|
||||
assert la_por.home_team_raw == "Portland Thorns FC"
|
||||
assert la_por.away_score == 1
|
||||
assert la_por.home_score == 2
|
||||
assert la_por.stadium_raw == "Providence Park"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
sd_bay = scheduled[0]
|
||||
assert sd_bay.away_team_raw == "San Diego Wave FC"
|
||||
assert sd_bay.home_team_raw == "Bay FC"
|
||||
assert sd_bay.stadium_raw == "PayPal Park"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 10),
|
||||
home_team_raw="Portland Thorns FC",
|
||||
away_team_raw="Angel City FC",
|
||||
stadium_raw="Providence Park",
|
||||
home_score=2,
|
||||
away_score=1,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "nwsl_2026_anf_por_0410"
|
||||
assert game.sport == "nwsl"
|
||||
assert game.season == 2026
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_nwsl_por"
|
||||
assert game.away_team_id == "team_nwsl_anf"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 2
|
||||
assert game.away_score == 1
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 10),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Portland Thorns FC",
|
||||
stadium_raw="Providence Park",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_nwsl_teams(self):
|
||||
"""Test all NWSL teams are returned."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# NWSL has 14 teams
|
||||
assert len(teams) == 14
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 14
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_nwsl_")
|
||||
assert team.sport == "nwsl"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_scrapes_all_nwsl_stadiums(self):
|
||||
"""Test all NWSL stadiums are returned."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 14
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 14
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_nwsl_")
|
||||
assert stadium.sport == "nwsl"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country == "USA"
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test fallback behavior (NWSL only has ESPN)."""
|
||||
|
||||
def test_returns_failure_when_espn_fails(self):
|
||||
"""Test scraper returns failure when ESPN fails."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
|
||||
with patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
mock_espn.side_effect = Exception("ESPN failed")
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert not result.success
|
||||
assert "All sources failed" in result.error_message
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for NWSL season."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# NWSL season is March-November
|
||||
assert len(months) == 9 # Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
|
||||
|
||||
# Check first month is March of season year
|
||||
assert months[0] == (2026, 3)
|
||||
|
||||
# Check last month is November
|
||||
assert months[-1] == (2026, 11)
|
||||
226
sportstime_parser/tests/test_scrapers/test_wnba.py
Normal file
226
sportstime_parser/tests/test_scrapers/test_wnba.py
Normal file
@@ -0,0 +1,226 @@
|
||||
"""Tests for WNBA scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.wnba import WNBAScraper, create_wnba_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
WNBA_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestWNBAScraperInit:
|
||||
"""Test WNBAScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
assert scraper.sport == "wnba"
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_wnba_scraper(season=2026)
|
||||
assert isinstance(scraper, WNBAScraper)
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for WNBA."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
assert scraper.expected_game_count == 220
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["espn"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Aces @ Liberty
|
||||
lv_ny = next(g for g in completed if g.away_team_raw == "Las Vegas Aces")
|
||||
assert lv_ny.home_team_raw == "New York Liberty"
|
||||
assert lv_ny.away_score == 88
|
||||
assert lv_ny.home_score == 92
|
||||
assert lv_ny.stadium_raw == "Barclays Center"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
phx_sea = scheduled[0]
|
||||
assert phx_sea.away_team_raw == "Phoenix Mercury"
|
||||
assert phx_sea.home_team_raw == "Seattle Storm"
|
||||
assert phx_sea.stadium_raw == "Climate Pledge Arena"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 5, 20),
|
||||
home_team_raw="New York Liberty",
|
||||
away_team_raw="Las Vegas Aces",
|
||||
stadium_raw="Barclays Center",
|
||||
home_score=92,
|
||||
away_score=88,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "wnba_2026_lv_ny_0520"
|
||||
assert game.sport == "wnba"
|
||||
assert game.season == 2026
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_wnba_ny"
|
||||
assert game.away_team_id == "team_wnba_lv"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 92
|
||||
assert game.away_score == 88
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 5, 20),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Las Vegas Aces",
|
||||
stadium_raw="Barclays Center",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_wnba_teams(self):
|
||||
"""Test all WNBA teams are returned."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# WNBA has 13 teams (including Golden State Valkyries)
|
||||
assert len(teams) == 13
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 13
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_wnba_")
|
||||
assert team.sport == "wnba"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_scrapes_all_wnba_stadiums(self):
|
||||
"""Test all WNBA stadiums are returned."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 13
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 13
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_wnba_")
|
||||
assert stadium.sport == "wnba"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country == "USA"
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test fallback behavior (WNBA only has ESPN)."""
|
||||
|
||||
def test_returns_failure_when_espn_fails(self):
|
||||
"""Test scraper returns failure when ESPN fails."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
|
||||
with patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
mock_espn.side_effect = Exception("ESPN failed")
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert not result.success
|
||||
assert "All sources failed" in result.error_message
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for WNBA season."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# WNBA season is May-October
|
||||
assert len(months) == 6 # May, Jun, Jul, Aug, Sep, Oct
|
||||
|
||||
# Check first month is May of season year
|
||||
assert months[0] == (2026, 5)
|
||||
|
||||
# Check last month is October
|
||||
assert months[-1] == (2026, 10)
|
||||
187
sportstime_parser/tests/test_timezone.py
Normal file
187
sportstime_parser/tests/test_timezone.py
Normal file
@@ -0,0 +1,187 @@
|
||||
"""Tests for timezone conversion utilities."""
|
||||
|
||||
import pytest
|
||||
from datetime import datetime, date
|
||||
from zoneinfo import ZoneInfo
|
||||
|
||||
from sportstime_parser.normalizers.timezone import (
|
||||
detect_timezone_from_string,
|
||||
detect_timezone_from_location,
|
||||
parse_datetime,
|
||||
convert_to_utc,
|
||||
get_stadium_timezone,
|
||||
TimezoneResult,
|
||||
)
|
||||
|
||||
|
||||
class TestDetectTimezoneFromString:
|
||||
"""Tests for detect_timezone_from_string function."""
|
||||
|
||||
def test_eastern_time(self):
|
||||
"""Test Eastern Time detection."""
|
||||
assert detect_timezone_from_string("7:00 PM ET") == "America/New_York"
|
||||
assert detect_timezone_from_string("7:00 PM EST") == "America/New_York"
|
||||
assert detect_timezone_from_string("7:00 PM EDT") == "America/New_York"
|
||||
|
||||
def test_central_time(self):
|
||||
"""Test Central Time detection."""
|
||||
assert detect_timezone_from_string("8:00 PM CT") == "America/Chicago"
|
||||
assert detect_timezone_from_string("8:00 PM CST") == "America/Chicago"
|
||||
assert detect_timezone_from_string("8:00 PM CDT") == "America/Chicago"
|
||||
|
||||
def test_mountain_time(self):
|
||||
"""Test Mountain Time detection."""
|
||||
assert detect_timezone_from_string("7:00 PM MT") == "America/Denver"
|
||||
assert detect_timezone_from_string("7:00 PM MST") == "America/Denver"
|
||||
|
||||
def test_pacific_time(self):
|
||||
"""Test Pacific Time detection."""
|
||||
assert detect_timezone_from_string("7:00 PM PT") == "America/Los_Angeles"
|
||||
assert detect_timezone_from_string("7:00 PM PST") == "America/Los_Angeles"
|
||||
assert detect_timezone_from_string("7:00 PM PDT") == "America/Los_Angeles"
|
||||
|
||||
def test_no_timezone(self):
|
||||
"""Test string with no timezone."""
|
||||
assert detect_timezone_from_string("7:00 PM") is None
|
||||
assert detect_timezone_from_string("19:00") is None
|
||||
|
||||
def test_case_insensitive(self):
|
||||
"""Test case insensitive matching."""
|
||||
assert detect_timezone_from_string("7:00 PM et") == "America/New_York"
|
||||
assert detect_timezone_from_string("7:00 PM Et") == "America/New_York"
|
||||
|
||||
|
||||
class TestDetectTimezoneFromLocation:
|
||||
"""Tests for detect_timezone_from_location function."""
|
||||
|
||||
def test_eastern_states(self):
|
||||
"""Test Eastern timezone states."""
|
||||
assert detect_timezone_from_location(state="NY") == "America/New_York"
|
||||
assert detect_timezone_from_location(state="MA") == "America/New_York"
|
||||
assert detect_timezone_from_location(state="FL") == "America/New_York"
|
||||
|
||||
def test_central_states(self):
|
||||
"""Test Central timezone states."""
|
||||
assert detect_timezone_from_location(state="TX") == "America/Chicago"
|
||||
assert detect_timezone_from_location(state="IL") == "America/Chicago"
|
||||
|
||||
def test_mountain_states(self):
|
||||
"""Test Mountain timezone states."""
|
||||
assert detect_timezone_from_location(state="CO") == "America/Denver"
|
||||
assert detect_timezone_from_location(state="AZ") == "America/Phoenix"
|
||||
|
||||
def test_pacific_states(self):
|
||||
"""Test Pacific timezone states."""
|
||||
assert detect_timezone_from_location(state="CA") == "America/Los_Angeles"
|
||||
assert detect_timezone_from_location(state="WA") == "America/Los_Angeles"
|
||||
|
||||
def test_canadian_provinces(self):
|
||||
"""Test Canadian provinces."""
|
||||
assert detect_timezone_from_location(state="ON") == "America/Toronto"
|
||||
assert detect_timezone_from_location(state="BC") == "America/Vancouver"
|
||||
assert detect_timezone_from_location(state="AB") == "America/Edmonton"
|
||||
|
||||
def test_case_insensitive(self):
|
||||
"""Test case insensitive matching."""
|
||||
assert detect_timezone_from_location(state="ny") == "America/New_York"
|
||||
assert detect_timezone_from_location(state="Ny") == "America/New_York"
|
||||
|
||||
def test_unknown_state(self):
|
||||
"""Test unknown state returns None."""
|
||||
assert detect_timezone_from_location(state="XX") is None
|
||||
assert detect_timezone_from_location(state=None) is None
|
||||
|
||||
|
||||
class TestParseDatetime:
|
||||
"""Tests for parse_datetime function."""
|
||||
|
||||
def test_basic_date_time(self):
|
||||
"""Test basic date and time parsing."""
|
||||
result = parse_datetime("2025-12-25", "7:00 PM ET")
|
||||
assert result.datetime_utc.year == 2025
|
||||
assert result.datetime_utc.month == 12
|
||||
assert result.datetime_utc.day == 26 # UTC is +5 hours ahead
|
||||
assert result.source_timezone == "America/New_York"
|
||||
assert result.confidence == "high"
|
||||
|
||||
def test_date_only(self):
|
||||
"""Test date only parsing."""
|
||||
result = parse_datetime("2025-10-21")
|
||||
assert result.datetime_utc.year == 2025
|
||||
assert result.datetime_utc.month == 10
|
||||
assert result.datetime_utc.day == 21
|
||||
|
||||
def test_timezone_hint(self):
|
||||
"""Test timezone hint is used when no timezone in string."""
|
||||
result = parse_datetime(
|
||||
"2025-10-21",
|
||||
"7:00 PM",
|
||||
timezone_hint="America/Chicago",
|
||||
)
|
||||
assert result.source_timezone == "America/Chicago"
|
||||
assert result.confidence == "medium"
|
||||
|
||||
def test_location_inference(self):
|
||||
"""Test timezone inference from location."""
|
||||
result = parse_datetime(
|
||||
"2025-10-21",
|
||||
"7:00 PM",
|
||||
location_state="CA",
|
||||
)
|
||||
assert result.source_timezone == "America/Los_Angeles"
|
||||
assert result.confidence == "medium"
|
||||
|
||||
def test_default_to_eastern(self):
|
||||
"""Test defaults to Eastern when no timezone info."""
|
||||
result = parse_datetime("2025-10-21", "7:00 PM")
|
||||
assert result.source_timezone == "America/New_York"
|
||||
assert result.confidence == "low"
|
||||
assert result.warning is not None
|
||||
|
||||
def test_invalid_date(self):
|
||||
"""Test handling of invalid date."""
|
||||
result = parse_datetime("not a date")
|
||||
assert result.confidence == "low"
|
||||
assert result.warning is not None
|
||||
|
||||
|
||||
class TestConvertToUtc:
|
||||
"""Tests for convert_to_utc function."""
|
||||
|
||||
def test_convert_naive_datetime(self):
|
||||
"""Test converting naive datetime to UTC."""
|
||||
dt = datetime(2025, 12, 25, 19, 0) # 7:00 PM
|
||||
utc = convert_to_utc(dt, "America/New_York")
|
||||
|
||||
# In December, Eastern Time is UTC-5
|
||||
assert utc.hour == 0 # Next day 00:00 UTC
|
||||
assert utc.day == 26
|
||||
|
||||
def test_convert_aware_datetime(self):
|
||||
"""Test converting timezone-aware datetime."""
|
||||
tz = ZoneInfo("America/Los_Angeles")
|
||||
dt = datetime(2025, 7, 4, 19, 0, tzinfo=tz) # 7:00 PM PT
|
||||
utc = convert_to_utc(dt, "America/Los_Angeles")
|
||||
|
||||
# In July, Pacific Time is UTC-7
|
||||
assert utc.hour == 2 # 02:00 UTC next day
|
||||
assert utc.day == 5
|
||||
|
||||
|
||||
class TestGetStadiumTimezone:
|
||||
"""Tests for get_stadium_timezone function."""
|
||||
|
||||
def test_explicit_timezone(self):
|
||||
"""Test explicit timezone override."""
|
||||
tz = get_stadium_timezone("AZ", stadium_timezone="America/Phoenix")
|
||||
assert tz == "America/Phoenix"
|
||||
|
||||
def test_state_inference(self):
|
||||
"""Test timezone from state."""
|
||||
tz = get_stadium_timezone("NY")
|
||||
assert tz == "America/New_York"
|
||||
|
||||
def test_default_eastern(self):
|
||||
"""Test default to Eastern for unknown state."""
|
||||
tz = get_stadium_timezone("XX")
|
||||
assert tz == "America/New_York"
|
||||
1
sportstime_parser/tests/test_uploaders/__init__.py
Normal file
1
sportstime_parser/tests/test_uploaders/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Tests for the uploaders module."""
|
||||
461
sportstime_parser/tests/test_uploaders/test_cloudkit.py
Normal file
461
sportstime_parser/tests/test_uploaders/test_cloudkit.py
Normal file
@@ -0,0 +1,461 @@
|
||||
"""Tests for the CloudKit client."""
|
||||
|
||||
import json
|
||||
import pytest
|
||||
from datetime import datetime
|
||||
from unittest.mock import Mock, patch, MagicMock
|
||||
|
||||
from sportstime_parser.uploaders.cloudkit import (
|
||||
CloudKitClient,
|
||||
CloudKitRecord,
|
||||
CloudKitError,
|
||||
CloudKitAuthError,
|
||||
CloudKitRateLimitError,
|
||||
CloudKitServerError,
|
||||
RecordType,
|
||||
OperationResult,
|
||||
BatchResult,
|
||||
)
|
||||
|
||||
|
||||
class TestCloudKitRecord:
|
||||
"""Tests for CloudKitRecord dataclass."""
|
||||
|
||||
def test_create_record(self):
|
||||
"""Test creating a CloudKitRecord."""
|
||||
record = CloudKitRecord(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type=RecordType.GAME,
|
||||
fields={
|
||||
"sport": "nba",
|
||||
"season": 2025,
|
||||
},
|
||||
)
|
||||
|
||||
assert record.record_name == "nba_2025_hou_okc_1021"
|
||||
assert record.record_type == RecordType.GAME
|
||||
assert record.fields["sport"] == "nba"
|
||||
assert record.record_change_tag is None
|
||||
|
||||
def test_to_cloudkit_dict(self):
|
||||
"""Test converting to CloudKit API format."""
|
||||
record = CloudKitRecord(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type=RecordType.GAME,
|
||||
fields={
|
||||
"sport": "nba",
|
||||
"season": 2025,
|
||||
},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["recordName"] == "nba_2025_hou_okc_1021"
|
||||
assert data["recordType"] == "Game"
|
||||
assert "fields" in data
|
||||
assert "recordChangeTag" not in data
|
||||
|
||||
def test_to_cloudkit_dict_with_change_tag(self):
|
||||
"""Test converting with change tag for updates."""
|
||||
record = CloudKitRecord(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type=RecordType.GAME,
|
||||
fields={"sport": "nba"},
|
||||
record_change_tag="abc123",
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["recordChangeTag"] == "abc123"
|
||||
|
||||
def test_format_string_field(self):
|
||||
"""Test formatting string fields."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.GAME,
|
||||
fields={"name": "Test Name"},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["fields"]["name"]["value"] == "Test Name"
|
||||
assert data["fields"]["name"]["type"] == "STRING"
|
||||
|
||||
def test_format_int_field(self):
|
||||
"""Test formatting integer fields."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.GAME,
|
||||
fields={"count": 42},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["fields"]["count"]["value"] == 42
|
||||
assert data["fields"]["count"]["type"] == "INT64"
|
||||
|
||||
def test_format_float_field(self):
|
||||
"""Test formatting float fields."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.STADIUM,
|
||||
fields={"latitude": 35.4634},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["fields"]["latitude"]["value"] == 35.4634
|
||||
assert data["fields"]["latitude"]["type"] == "DOUBLE"
|
||||
|
||||
def test_format_datetime_field(self):
|
||||
"""Test formatting datetime fields."""
|
||||
dt = datetime(2025, 10, 21, 19, 0, 0)
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.GAME,
|
||||
fields={"game_date": dt},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
expected_ms = int(dt.timestamp() * 1000)
|
||||
assert data["fields"]["game_date"]["value"] == expected_ms
|
||||
assert data["fields"]["game_date"]["type"] == "TIMESTAMP"
|
||||
|
||||
def test_format_location_field(self):
|
||||
"""Test formatting location fields."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.STADIUM,
|
||||
fields={
|
||||
"location": {"latitude": 35.4634, "longitude": -97.5151},
|
||||
},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["fields"]["location"]["type"] == "LOCATION"
|
||||
assert data["fields"]["location"]["value"]["latitude"] == 35.4634
|
||||
assert data["fields"]["location"]["value"]["longitude"] == -97.5151
|
||||
|
||||
def test_skip_none_fields(self):
|
||||
"""Test that None fields are skipped."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.GAME,
|
||||
fields={
|
||||
"sport": "nba",
|
||||
"score": None, # Should be skipped
|
||||
},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert "sport" in data["fields"]
|
||||
assert "score" not in data["fields"]
|
||||
|
||||
|
||||
class TestOperationResult:
|
||||
"""Tests for OperationResult dataclass."""
|
||||
|
||||
def test_successful_result(self):
|
||||
"""Test creating a successful operation result."""
|
||||
result = OperationResult(
|
||||
record_name="test_record",
|
||||
success=True,
|
||||
record_change_tag="new_tag",
|
||||
)
|
||||
|
||||
assert result.record_name == "test_record"
|
||||
assert result.success is True
|
||||
assert result.record_change_tag == "new_tag"
|
||||
assert result.error_code is None
|
||||
|
||||
def test_failed_result(self):
|
||||
"""Test creating a failed operation result."""
|
||||
result = OperationResult(
|
||||
record_name="test_record",
|
||||
success=False,
|
||||
error_code="SERVER_ERROR",
|
||||
error_message="Internal server error",
|
||||
)
|
||||
|
||||
assert result.success is False
|
||||
assert result.error_code == "SERVER_ERROR"
|
||||
assert result.error_message == "Internal server error"
|
||||
|
||||
|
||||
class TestBatchResult:
|
||||
"""Tests for BatchResult dataclass."""
|
||||
|
||||
def test_empty_batch_result(self):
|
||||
"""Test empty batch result."""
|
||||
result = BatchResult()
|
||||
|
||||
assert result.all_succeeded is True
|
||||
assert result.success_count == 0
|
||||
assert result.failure_count == 0
|
||||
|
||||
def test_batch_with_successes(self):
|
||||
"""Test batch with successful operations."""
|
||||
result = BatchResult()
|
||||
result.successful.append(OperationResult("rec1", True))
|
||||
result.successful.append(OperationResult("rec2", True))
|
||||
|
||||
assert result.all_succeeded is True
|
||||
assert result.success_count == 2
|
||||
assert result.failure_count == 0
|
||||
|
||||
def test_batch_with_failures(self):
|
||||
"""Test batch with failed operations."""
|
||||
result = BatchResult()
|
||||
result.successful.append(OperationResult("rec1", True))
|
||||
result.failed.append(OperationResult("rec2", False, error_message="Error"))
|
||||
|
||||
assert result.all_succeeded is False
|
||||
assert result.success_count == 1
|
||||
assert result.failure_count == 1
|
||||
|
||||
|
||||
class TestCloudKitClient:
|
||||
"""Tests for CloudKitClient."""
|
||||
|
||||
def test_not_configured_without_credentials(self):
|
||||
"""Test that client reports not configured without credentials."""
|
||||
with patch.dict("os.environ", {}, clear=True):
|
||||
client = CloudKitClient()
|
||||
assert client.is_configured is False
|
||||
|
||||
def test_configured_with_credentials(self):
|
||||
"""Test that client reports configured with credentials."""
|
||||
# Create a minimal mock for the private key
|
||||
mock_key = MagicMock()
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key_id",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
mock_load.return_value = mock_key
|
||||
client = CloudKitClient()
|
||||
assert client.is_configured is True
|
||||
|
||||
def test_get_api_path(self):
|
||||
"""Test API path construction."""
|
||||
client = CloudKitClient(
|
||||
container_id="iCloud.com.test.app",
|
||||
environment="development",
|
||||
)
|
||||
|
||||
path = client._get_api_path("records/query")
|
||||
|
||||
assert path == "/database/1/iCloud.com.test.app/development/public/records/query"
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_fetch_records_query(self, mock_session_class):
|
||||
"""Test fetching records with query."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.json.return_value = {
|
||||
"records": [
|
||||
{"recordName": "rec1", "recordType": "Game"},
|
||||
{"recordName": "rec2", "recordType": "Game"},
|
||||
]
|
||||
}
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
# Setup client with mocked auth
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
records = client.fetch_records(RecordType.GAME)
|
||||
|
||||
assert len(records) == 2
|
||||
assert records[0]["recordName"] == "rec1"
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_save_records_success(self, mock_session_class):
|
||||
"""Test saving records successfully."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.json.return_value = {
|
||||
"records": [
|
||||
{"recordName": "rec1", "recordChangeTag": "tag1"},
|
||||
{"recordName": "rec2", "recordChangeTag": "tag2"},
|
||||
]
|
||||
}
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
records = [
|
||||
CloudKitRecord("rec1", RecordType.GAME, {"sport": "nba"}),
|
||||
CloudKitRecord("rec2", RecordType.GAME, {"sport": "nba"}),
|
||||
]
|
||||
|
||||
result = client.save_records(records)
|
||||
|
||||
assert result.success_count == 2
|
||||
assert result.failure_count == 0
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_save_records_partial_failure(self, mock_session_class):
|
||||
"""Test saving records with some failures."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.json.return_value = {
|
||||
"records": [
|
||||
{"recordName": "rec1", "recordChangeTag": "tag1"},
|
||||
{"recordName": "rec2", "serverErrorCode": "QUOTA_EXCEEDED", "reason": "Quota exceeded"},
|
||||
]
|
||||
}
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
records = [
|
||||
CloudKitRecord("rec1", RecordType.GAME, {"sport": "nba"}),
|
||||
CloudKitRecord("rec2", RecordType.GAME, {"sport": "nba"}),
|
||||
]
|
||||
|
||||
result = client.save_records(records)
|
||||
|
||||
assert result.success_count == 1
|
||||
assert result.failure_count == 1
|
||||
assert result.failed[0].error_code == "QUOTA_EXCEEDED"
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_auth_error(self, mock_session_class):
|
||||
"""Test handling authentication error."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 421
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
with pytest.raises(CloudKitAuthError):
|
||||
client.fetch_records(RecordType.GAME)
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_rate_limit_error(self, mock_session_class):
|
||||
"""Test handling rate limit error."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 429
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
with pytest.raises(CloudKitRateLimitError):
|
||||
client.fetch_records(RecordType.GAME)
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_server_error(self, mock_session_class):
|
||||
"""Test handling server error."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 503
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
with pytest.raises(CloudKitServerError):
|
||||
client.fetch_records(RecordType.GAME)
|
||||
|
||||
|
||||
class TestRecordType:
|
||||
"""Tests for RecordType enum."""
|
||||
|
||||
def test_record_type_values(self):
|
||||
"""Test that record type values match CloudKit schema."""
|
||||
assert RecordType.GAME.value == "Game"
|
||||
assert RecordType.TEAM.value == "Team"
|
||||
assert RecordType.STADIUM.value == "Stadium"
|
||||
assert RecordType.TEAM_ALIAS.value == "TeamAlias"
|
||||
assert RecordType.STADIUM_ALIAS.value == "StadiumAlias"
|
||||
350
sportstime_parser/tests/test_uploaders/test_diff.py
Normal file
350
sportstime_parser/tests/test_uploaders/test_diff.py
Normal file
@@ -0,0 +1,350 @@
|
||||
"""Tests for the record differ."""
|
||||
|
||||
import pytest
|
||||
from datetime import datetime
|
||||
|
||||
from sportstime_parser.models.game import Game
|
||||
from sportstime_parser.models.team import Team
|
||||
from sportstime_parser.models.stadium import Stadium
|
||||
from sportstime_parser.uploaders.diff import (
|
||||
DiffAction,
|
||||
RecordDiff,
|
||||
DiffResult,
|
||||
RecordDiffer,
|
||||
game_to_cloudkit_record,
|
||||
team_to_cloudkit_record,
|
||||
stadium_to_cloudkit_record,
|
||||
)
|
||||
from sportstime_parser.uploaders.cloudkit import RecordType
|
||||
|
||||
|
||||
class TestRecordDiff:
|
||||
"""Tests for RecordDiff dataclass."""
|
||||
|
||||
def test_create_record_diff(self):
|
||||
"""Test creating a RecordDiff."""
|
||||
diff = RecordDiff(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.CREATE,
|
||||
)
|
||||
|
||||
assert diff.record_name == "nba_2025_hou_okc_1021"
|
||||
assert diff.record_type == RecordType.GAME
|
||||
assert diff.action == DiffAction.CREATE
|
||||
|
||||
|
||||
class TestDiffResult:
|
||||
"""Tests for DiffResult dataclass."""
|
||||
|
||||
def test_empty_result(self):
|
||||
"""Test empty DiffResult."""
|
||||
result = DiffResult()
|
||||
|
||||
assert result.create_count == 0
|
||||
assert result.update_count == 0
|
||||
assert result.delete_count == 0
|
||||
assert result.unchanged_count == 0
|
||||
assert result.total_changes == 0
|
||||
|
||||
def test_counts(self):
|
||||
"""Test counting different change types."""
|
||||
result = DiffResult()
|
||||
|
||||
result.creates.append(RecordDiff(
|
||||
record_name="game_1",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.CREATE,
|
||||
))
|
||||
result.creates.append(RecordDiff(
|
||||
record_name="game_2",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.CREATE,
|
||||
))
|
||||
result.updates.append(RecordDiff(
|
||||
record_name="game_3",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.UPDATE,
|
||||
))
|
||||
result.deletes.append(RecordDiff(
|
||||
record_name="game_4",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.DELETE,
|
||||
))
|
||||
result.unchanged.append(RecordDiff(
|
||||
record_name="game_5",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.UNCHANGED,
|
||||
))
|
||||
|
||||
assert result.create_count == 2
|
||||
assert result.update_count == 1
|
||||
assert result.delete_count == 1
|
||||
assert result.unchanged_count == 1
|
||||
assert result.total_changes == 4 # excludes unchanged
|
||||
|
||||
|
||||
class TestRecordDiffer:
|
||||
"""Tests for RecordDiffer."""
|
||||
|
||||
@pytest.fixture
|
||||
def differ(self):
|
||||
"""Create a RecordDiffer instance."""
|
||||
return RecordDiffer()
|
||||
|
||||
@pytest.fixture
|
||||
def sample_game(self):
|
||||
"""Create a sample Game."""
|
||||
return Game(
|
||||
id="nba_2025_hou_okc_1021",
|
||||
sport="nba",
|
||||
season=2025,
|
||||
home_team_id="team_nba_okc",
|
||||
away_team_id="team_nba_hou",
|
||||
stadium_id="stadium_nba_paycom_center",
|
||||
game_date=datetime(2025, 10, 21, 19, 0, 0),
|
||||
status="scheduled",
|
||||
)
|
||||
|
||||
@pytest.fixture
|
||||
def sample_team(self):
|
||||
"""Create a sample Team."""
|
||||
return Team(
|
||||
id="team_nba_okc",
|
||||
sport="nba",
|
||||
city="Oklahoma City",
|
||||
name="Thunder",
|
||||
full_name="Oklahoma City Thunder",
|
||||
abbreviation="OKC",
|
||||
conference="Western",
|
||||
division="Northwest",
|
||||
)
|
||||
|
||||
@pytest.fixture
|
||||
def sample_stadium(self):
|
||||
"""Create a sample Stadium."""
|
||||
return Stadium(
|
||||
id="stadium_nba_paycom_center",
|
||||
sport="nba",
|
||||
name="Paycom Center",
|
||||
city="Oklahoma City",
|
||||
state="OK",
|
||||
country="USA",
|
||||
latitude=35.4634,
|
||||
longitude=-97.5151,
|
||||
capacity=18203,
|
||||
)
|
||||
|
||||
def test_diff_games_create(self, differ, sample_game):
|
||||
"""Test detecting new games to create."""
|
||||
local_games = [sample_game]
|
||||
remote_records = []
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
|
||||
assert result.create_count == 1
|
||||
assert result.update_count == 0
|
||||
assert result.delete_count == 0
|
||||
assert result.creates[0].record_name == sample_game.id
|
||||
|
||||
def test_diff_games_delete(self, differ, sample_game):
|
||||
"""Test detecting games to delete."""
|
||||
local_games = []
|
||||
remote_records = [
|
||||
{
|
||||
"recordName": sample_game.id,
|
||||
"recordType": "Game",
|
||||
"fields": {
|
||||
"sport": {"value": "nba", "type": "STRING"},
|
||||
"season": {"value": 2025, "type": "INT64"},
|
||||
},
|
||||
"recordChangeTag": "abc123",
|
||||
}
|
||||
]
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
|
||||
assert result.create_count == 0
|
||||
assert result.delete_count == 1
|
||||
assert result.deletes[0].record_name == sample_game.id
|
||||
|
||||
def test_diff_games_unchanged(self, differ, sample_game):
|
||||
"""Test detecting unchanged games."""
|
||||
local_games = [sample_game]
|
||||
remote_records = [
|
||||
{
|
||||
"recordName": sample_game.id,
|
||||
"recordType": "Game",
|
||||
"fields": {
|
||||
"sport": {"value": "nba", "type": "STRING"},
|
||||
"season": {"value": 2025, "type": "INT64"},
|
||||
"home_team_id": {"value": "team_nba_okc", "type": "STRING"},
|
||||
"away_team_id": {"value": "team_nba_hou", "type": "STRING"},
|
||||
"stadium_id": {"value": "stadium_nba_paycom_center", "type": "STRING"},
|
||||
"game_date": {"value": int(sample_game.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
|
||||
"game_number": {"value": None, "type": "INT64"},
|
||||
"home_score": {"value": None, "type": "INT64"},
|
||||
"away_score": {"value": None, "type": "INT64"},
|
||||
"status": {"value": "scheduled", "type": "STRING"},
|
||||
},
|
||||
"recordChangeTag": "abc123",
|
||||
}
|
||||
]
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
|
||||
assert result.create_count == 0
|
||||
assert result.update_count == 0
|
||||
assert result.unchanged_count == 1
|
||||
|
||||
def test_diff_games_update(self, differ, sample_game):
|
||||
"""Test detecting games that need update."""
|
||||
local_games = [sample_game]
|
||||
# Remote has different status
|
||||
remote_records = [
|
||||
{
|
||||
"recordName": sample_game.id,
|
||||
"recordType": "Game",
|
||||
"fields": {
|
||||
"sport": {"value": "nba", "type": "STRING"},
|
||||
"season": {"value": 2025, "type": "INT64"},
|
||||
"home_team_id": {"value": "team_nba_okc", "type": "STRING"},
|
||||
"away_team_id": {"value": "team_nba_hou", "type": "STRING"},
|
||||
"stadium_id": {"value": "stadium_nba_paycom_center", "type": "STRING"},
|
||||
"game_date": {"value": int(sample_game.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
|
||||
"game_number": {"value": None, "type": "INT64"},
|
||||
"home_score": {"value": None, "type": "INT64"},
|
||||
"away_score": {"value": None, "type": "INT64"},
|
||||
"status": {"value": "postponed", "type": "STRING"}, # Different!
|
||||
},
|
||||
"recordChangeTag": "abc123",
|
||||
}
|
||||
]
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
|
||||
assert result.update_count == 1
|
||||
assert "status" in result.updates[0].changed_fields
|
||||
assert result.updates[0].record_change_tag == "abc123"
|
||||
|
||||
def test_diff_teams_create(self, differ, sample_team):
|
||||
"""Test detecting new teams to create."""
|
||||
local_teams = [sample_team]
|
||||
remote_records = []
|
||||
|
||||
result = differ.diff_teams(local_teams, remote_records)
|
||||
|
||||
assert result.create_count == 1
|
||||
assert result.creates[0].record_name == sample_team.id
|
||||
|
||||
def test_diff_stadiums_create(self, differ, sample_stadium):
|
||||
"""Test detecting new stadiums to create."""
|
||||
local_stadiums = [sample_stadium]
|
||||
remote_records = []
|
||||
|
||||
result = differ.diff_stadiums(local_stadiums, remote_records)
|
||||
|
||||
assert result.create_count == 1
|
||||
assert result.creates[0].record_name == sample_stadium.id
|
||||
|
||||
def test_get_records_to_upload(self, differ, sample_game):
|
||||
"""Test getting CloudKitRecords for upload."""
|
||||
game2 = Game(
|
||||
id="nba_2025_lal_lac_1022",
|
||||
sport="nba",
|
||||
season=2025,
|
||||
home_team_id="team_nba_lac",
|
||||
away_team_id="team_nba_lal",
|
||||
stadium_id="stadium_nba_crypto_com",
|
||||
game_date=datetime(2025, 10, 22, 19, 0, 0),
|
||||
status="scheduled",
|
||||
)
|
||||
|
||||
local_games = [sample_game, game2]
|
||||
# Only game2 exists remotely with different status
|
||||
remote_records = [
|
||||
{
|
||||
"recordName": game2.id,
|
||||
"recordType": "Game",
|
||||
"fields": {
|
||||
"sport": {"value": "nba", "type": "STRING"},
|
||||
"season": {"value": 2025, "type": "INT64"},
|
||||
"home_team_id": {"value": "team_nba_lac", "type": "STRING"},
|
||||
"away_team_id": {"value": "team_nba_lal", "type": "STRING"},
|
||||
"stadium_id": {"value": "stadium_nba_crypto_com", "type": "STRING"},
|
||||
"game_date": {"value": int(game2.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
|
||||
"status": {"value": "postponed", "type": "STRING"}, # Different!
|
||||
},
|
||||
"recordChangeTag": "xyz789",
|
||||
}
|
||||
]
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
records = result.get_records_to_upload()
|
||||
|
||||
assert len(records) == 2 # 1 create + 1 update
|
||||
record_names = [r.record_name for r in records]
|
||||
assert sample_game.id in record_names
|
||||
assert game2.id in record_names
|
||||
|
||||
|
||||
class TestConvenienceFunctions:
|
||||
"""Tests for module-level convenience functions."""
|
||||
|
||||
def test_game_to_cloudkit_record(self):
|
||||
"""Test converting Game to CloudKitRecord."""
|
||||
game = Game(
|
||||
id="nba_2025_hou_okc_1021",
|
||||
sport="nba",
|
||||
season=2025,
|
||||
home_team_id="team_nba_okc",
|
||||
away_team_id="team_nba_hou",
|
||||
stadium_id="stadium_nba_paycom_center",
|
||||
game_date=datetime(2025, 10, 21, 19, 0, 0),
|
||||
status="scheduled",
|
||||
)
|
||||
|
||||
record = game_to_cloudkit_record(game)
|
||||
|
||||
assert record.record_name == game.id
|
||||
assert record.record_type == RecordType.GAME
|
||||
assert record.fields["sport"] == "nba"
|
||||
assert record.fields["season"] == 2025
|
||||
|
||||
def test_team_to_cloudkit_record(self):
|
||||
"""Test converting Team to CloudKitRecord."""
|
||||
team = Team(
|
||||
id="team_nba_okc",
|
||||
sport="nba",
|
||||
city="Oklahoma City",
|
||||
name="Thunder",
|
||||
full_name="Oklahoma City Thunder",
|
||||
abbreviation="OKC",
|
||||
)
|
||||
|
||||
record = team_to_cloudkit_record(team)
|
||||
|
||||
assert record.record_name == team.id
|
||||
assert record.record_type == RecordType.TEAM
|
||||
assert record.fields["city"] == "Oklahoma City"
|
||||
assert record.fields["name"] == "Thunder"
|
||||
|
||||
def test_stadium_to_cloudkit_record(self):
|
||||
"""Test converting Stadium to CloudKitRecord."""
|
||||
stadium = Stadium(
|
||||
id="stadium_nba_paycom_center",
|
||||
sport="nba",
|
||||
name="Paycom Center",
|
||||
city="Oklahoma City",
|
||||
state="OK",
|
||||
country="USA",
|
||||
latitude=35.4634,
|
||||
longitude=-97.5151,
|
||||
)
|
||||
|
||||
record = stadium_to_cloudkit_record(stadium)
|
||||
|
||||
assert record.record_name == stadium.id
|
||||
assert record.record_type == RecordType.STADIUM
|
||||
assert record.fields["name"] == "Paycom Center"
|
||||
assert record.fields["latitude"] == 35.4634
|
||||
472
sportstime_parser/tests/test_uploaders/test_state.py
Normal file
472
sportstime_parser/tests/test_uploaders/test_state.py
Normal file
@@ -0,0 +1,472 @@
|
||||
"""Tests for the upload state manager."""
|
||||
|
||||
import json
|
||||
import pytest
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from tempfile import TemporaryDirectory
|
||||
|
||||
from sportstime_parser.uploaders.state import (
|
||||
RecordState,
|
||||
UploadSession,
|
||||
StateManager,
|
||||
)
|
||||
|
||||
|
||||
class TestRecordState:
|
||||
"""Tests for RecordState dataclass."""
|
||||
|
||||
def test_create_record_state(self):
|
||||
"""Test creating a RecordState with default values."""
|
||||
state = RecordState(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type="Game",
|
||||
)
|
||||
|
||||
assert state.record_name == "nba_2025_hou_okc_1021"
|
||||
assert state.record_type == "Game"
|
||||
assert state.status == "pending"
|
||||
assert state.uploaded_at is None
|
||||
assert state.record_change_tag is None
|
||||
assert state.error_message is None
|
||||
assert state.retry_count == 0
|
||||
|
||||
def test_record_state_to_dict(self):
|
||||
"""Test serializing RecordState to dictionary."""
|
||||
now = datetime.utcnow()
|
||||
state = RecordState(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type="Game",
|
||||
uploaded_at=now,
|
||||
record_change_tag="abc123",
|
||||
status="uploaded",
|
||||
)
|
||||
|
||||
data = state.to_dict()
|
||||
|
||||
assert data["record_name"] == "nba_2025_hou_okc_1021"
|
||||
assert data["record_type"] == "Game"
|
||||
assert data["status"] == "uploaded"
|
||||
assert data["uploaded_at"] == now.isoformat()
|
||||
assert data["record_change_tag"] == "abc123"
|
||||
|
||||
def test_record_state_from_dict(self):
|
||||
"""Test deserializing RecordState from dictionary."""
|
||||
data = {
|
||||
"record_name": "nba_2025_hou_okc_1021",
|
||||
"record_type": "Game",
|
||||
"uploaded_at": "2026-01-10T12:00:00",
|
||||
"record_change_tag": "abc123",
|
||||
"status": "uploaded",
|
||||
"error_message": None,
|
||||
"retry_count": 0,
|
||||
}
|
||||
|
||||
state = RecordState.from_dict(data)
|
||||
|
||||
assert state.record_name == "nba_2025_hou_okc_1021"
|
||||
assert state.record_type == "Game"
|
||||
assert state.status == "uploaded"
|
||||
assert state.uploaded_at == datetime.fromisoformat("2026-01-10T12:00:00")
|
||||
assert state.record_change_tag == "abc123"
|
||||
|
||||
|
||||
class TestUploadSession:
|
||||
"""Tests for UploadSession dataclass."""
|
||||
|
||||
def test_create_upload_session(self):
|
||||
"""Test creating an UploadSession."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
|
||||
assert session.sport == "nba"
|
||||
assert session.season == 2025
|
||||
assert session.environment == "development"
|
||||
assert session.total_count == 0
|
||||
assert len(session.records) == 0
|
||||
|
||||
def test_add_record(self):
|
||||
"""Test adding records to a session."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("team_1", "Team")
|
||||
|
||||
assert session.total_count == 3
|
||||
assert len(session.records) == 3
|
||||
assert "game_1" in session.records
|
||||
assert session.records["game_1"].record_type == "Game"
|
||||
|
||||
def test_mark_uploaded(self):
|
||||
"""Test marking a record as uploaded."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
|
||||
session.mark_uploaded("game_1", "change_tag_123")
|
||||
|
||||
assert session.records["game_1"].status == "uploaded"
|
||||
assert session.records["game_1"].record_change_tag == "change_tag_123"
|
||||
assert session.records["game_1"].uploaded_at is not None
|
||||
|
||||
def test_mark_failed(self):
|
||||
"""Test marking a record as failed."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
|
||||
session.mark_failed("game_1", "Server error")
|
||||
|
||||
assert session.records["game_1"].status == "failed"
|
||||
assert session.records["game_1"].error_message == "Server error"
|
||||
assert session.records["game_1"].retry_count == 1
|
||||
|
||||
def test_mark_failed_increments_retry_count(self):
|
||||
"""Test that marking failed increments retry count."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
|
||||
session.mark_failed("game_1", "Error 1")
|
||||
session.mark_failed("game_1", "Error 2")
|
||||
session.mark_failed("game_1", "Error 3")
|
||||
|
||||
assert session.records["game_1"].retry_count == 3
|
||||
|
||||
def test_counts(self):
|
||||
"""Test session counts."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
|
||||
session.mark_uploaded("game_1")
|
||||
session.mark_failed("game_2", "Error")
|
||||
|
||||
assert session.uploaded_count == 1
|
||||
assert session.failed_count == 1
|
||||
assert session.pending_count == 1
|
||||
|
||||
def test_is_complete(self):
|
||||
"""Test is_complete property."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
|
||||
assert not session.is_complete
|
||||
|
||||
session.mark_uploaded("game_1")
|
||||
assert not session.is_complete
|
||||
|
||||
session.mark_uploaded("game_2")
|
||||
assert session.is_complete
|
||||
|
||||
def test_progress_percent(self):
|
||||
"""Test progress percentage calculation."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
session.add_record("game_4", "Game")
|
||||
|
||||
session.mark_uploaded("game_1")
|
||||
|
||||
assert session.progress_percent == 25.0
|
||||
|
||||
def test_get_pending_records(self):
|
||||
"""Test getting pending record names."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
|
||||
session.mark_uploaded("game_1")
|
||||
session.mark_failed("game_2", "Error")
|
||||
|
||||
pending = session.get_pending_records()
|
||||
|
||||
assert pending == ["game_3"]
|
||||
|
||||
def test_get_failed_records(self):
|
||||
"""Test getting failed record names."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
|
||||
session.mark_failed("game_1", "Error 1")
|
||||
session.mark_failed("game_3", "Error 3")
|
||||
|
||||
failed = session.get_failed_records()
|
||||
|
||||
assert set(failed) == {"game_1", "game_3"}
|
||||
|
||||
def test_get_retryable_records(self):
|
||||
"""Test getting records eligible for retry."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
|
||||
# Fail game_1 once
|
||||
session.mark_failed("game_1", "Error")
|
||||
|
||||
# Fail game_2 three times (max retries)
|
||||
session.mark_failed("game_2", "Error")
|
||||
session.mark_failed("game_2", "Error")
|
||||
session.mark_failed("game_2", "Error")
|
||||
|
||||
retryable = session.get_retryable_records(max_retries=3)
|
||||
|
||||
assert retryable == ["game_1"]
|
||||
|
||||
def test_to_dict_and_from_dict(self):
|
||||
"""Test round-trip serialization."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.mark_uploaded("game_1", "tag_123")
|
||||
|
||||
data = session.to_dict()
|
||||
restored = UploadSession.from_dict(data)
|
||||
|
||||
assert restored.sport == session.sport
|
||||
assert restored.season == session.season
|
||||
assert restored.environment == session.environment
|
||||
assert restored.total_count == session.total_count
|
||||
assert restored.uploaded_count == session.uploaded_count
|
||||
assert restored.records["game_1"].status == "uploaded"
|
||||
|
||||
|
||||
class TestStateManager:
|
||||
"""Tests for StateManager."""
|
||||
|
||||
def test_create_session(self):
|
||||
"""Test creating a new session."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
session = manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[
|
||||
("game_1", "Game"),
|
||||
("game_2", "Game"),
|
||||
("team_1", "Team"),
|
||||
],
|
||||
)
|
||||
|
||||
assert session.sport == "nba"
|
||||
assert session.season == 2025
|
||||
assert session.total_count == 3
|
||||
|
||||
# Check file was created
|
||||
state_file = Path(tmpdir) / "upload_state_nba_2025_development.json"
|
||||
assert state_file.exists()
|
||||
|
||||
def test_load_session(self):
|
||||
"""Test loading an existing session."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create and save a session
|
||||
original = manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game")],
|
||||
)
|
||||
original.mark_uploaded("game_1", "tag_123")
|
||||
manager.save_session(original)
|
||||
|
||||
# Load it back
|
||||
loaded = manager.load_session("nba", 2025, "development")
|
||||
|
||||
assert loaded is not None
|
||||
assert loaded.sport == "nba"
|
||||
assert loaded.records["game_1"].status == "uploaded"
|
||||
|
||||
def test_load_nonexistent_session(self):
|
||||
"""Test loading a session that doesn't exist."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
session = manager.load_session("nba", 2025, "development")
|
||||
|
||||
assert session is None
|
||||
|
||||
def test_delete_session(self):
|
||||
"""Test deleting a session."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create a session
|
||||
manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game")],
|
||||
)
|
||||
|
||||
# Delete it
|
||||
result = manager.delete_session("nba", 2025, "development")
|
||||
|
||||
assert result is True
|
||||
|
||||
# Verify it's gone
|
||||
loaded = manager.load_session("nba", 2025, "development")
|
||||
assert loaded is None
|
||||
|
||||
def test_delete_nonexistent_session(self):
|
||||
"""Test deleting a session that doesn't exist."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
result = manager.delete_session("nba", 2025, "development")
|
||||
|
||||
assert result is False
|
||||
|
||||
def test_list_sessions(self):
|
||||
"""Test listing all sessions."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create multiple sessions
|
||||
manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game")],
|
||||
)
|
||||
manager.create_session(
|
||||
sport="mlb",
|
||||
season=2026,
|
||||
environment="production",
|
||||
record_names=[("game_2", "Game"), ("game_3", "Game")],
|
||||
)
|
||||
|
||||
sessions = manager.list_sessions()
|
||||
|
||||
assert len(sessions) == 2
|
||||
sports = {s["sport"] for s in sessions}
|
||||
assert sports == {"nba", "mlb"}
|
||||
|
||||
def test_get_session_or_create_new(self):
|
||||
"""Test getting a session when none exists."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
session = manager.get_session_or_create(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game")],
|
||||
resume=False,
|
||||
)
|
||||
|
||||
assert session.sport == "nba"
|
||||
assert session.total_count == 1
|
||||
|
||||
def test_get_session_or_create_resume(self):
|
||||
"""Test resuming an existing session."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create initial session
|
||||
original = manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game"), ("game_2", "Game")],
|
||||
)
|
||||
original.mark_uploaded("game_1", "tag_123")
|
||||
manager.save_session(original)
|
||||
|
||||
# Resume with additional records
|
||||
session = manager.get_session_or_create(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game"), ("game_2", "Game"), ("game_3", "Game")],
|
||||
resume=True,
|
||||
)
|
||||
|
||||
# Should have original progress plus new record
|
||||
assert session.records["game_1"].status == "uploaded"
|
||||
assert "game_3" in session.records
|
||||
assert session.total_count == 3
|
||||
|
||||
def test_get_session_or_create_overwrite(self):
|
||||
"""Test overwriting an existing session when not resuming."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create initial session
|
||||
original = manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game"), ("game_2", "Game")],
|
||||
)
|
||||
original.mark_uploaded("game_1", "tag_123")
|
||||
manager.save_session(original)
|
||||
|
||||
# Create new session (not resuming)
|
||||
session = manager.get_session_or_create(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_3", "Game")],
|
||||
resume=False,
|
||||
)
|
||||
|
||||
# Should be a fresh session
|
||||
assert session.total_count == 1
|
||||
assert "game_1" not in session.records
|
||||
assert "game_3" in session.records
|
||||
52
sportstime_parser/uploaders/__init__.py
Normal file
52
sportstime_parser/uploaders/__init__.py
Normal file
@@ -0,0 +1,52 @@
|
||||
"""CloudKit uploaders for sportstime-parser."""
|
||||
|
||||
from .cloudkit import (
|
||||
CloudKitClient,
|
||||
CloudKitRecord,
|
||||
CloudKitError,
|
||||
CloudKitAuthError,
|
||||
CloudKitRateLimitError,
|
||||
CloudKitServerError,
|
||||
RecordType,
|
||||
OperationResult,
|
||||
BatchResult,
|
||||
)
|
||||
from .state import (
|
||||
RecordState,
|
||||
UploadSession,
|
||||
StateManager,
|
||||
)
|
||||
from .diff import (
|
||||
DiffAction,
|
||||
RecordDiff,
|
||||
DiffResult,
|
||||
RecordDiffer,
|
||||
game_to_cloudkit_record,
|
||||
team_to_cloudkit_record,
|
||||
stadium_to_cloudkit_record,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# CloudKit client
|
||||
"CloudKitClient",
|
||||
"CloudKitRecord",
|
||||
"CloudKitError",
|
||||
"CloudKitAuthError",
|
||||
"CloudKitRateLimitError",
|
||||
"CloudKitServerError",
|
||||
"RecordType",
|
||||
"OperationResult",
|
||||
"BatchResult",
|
||||
# State manager
|
||||
"RecordState",
|
||||
"UploadSession",
|
||||
"StateManager",
|
||||
# Differ
|
||||
"DiffAction",
|
||||
"RecordDiff",
|
||||
"DiffResult",
|
||||
"RecordDiffer",
|
||||
"game_to_cloudkit_record",
|
||||
"team_to_cloudkit_record",
|
||||
"stadium_to_cloudkit_record",
|
||||
]
|
||||
578
sportstime_parser/uploaders/cloudkit.py
Normal file
578
sportstime_parser/uploaders/cloudkit.py
Normal file
@@ -0,0 +1,578 @@
|
||||
"""CloudKit Web Services client for sportstime-parser.
|
||||
|
||||
This module provides a client for uploading data to CloudKit using the
|
||||
CloudKit Web Services API. It handles JWT authentication, request signing,
|
||||
and batch operations.
|
||||
|
||||
Reference: https://developer.apple.com/documentation/cloudkitwebservices
|
||||
"""
|
||||
|
||||
import base64
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
from enum import Enum
|
||||
|
||||
import jwt
|
||||
import requests
|
||||
from cryptography.hazmat.primitives import hashes, serialization
|
||||
from cryptography.hazmat.primitives.asymmetric import ec
|
||||
from cryptography.hazmat.backends import default_backend
|
||||
|
||||
from ..config import (
|
||||
CLOUDKIT_CONTAINER_ID,
|
||||
CLOUDKIT_ENVIRONMENT,
|
||||
CLOUDKIT_BATCH_SIZE,
|
||||
CLOUDKIT_KEY_ID,
|
||||
CLOUDKIT_PRIVATE_KEY_PATH,
|
||||
)
|
||||
from ..utils.logging import get_logger
|
||||
|
||||
|
||||
class RecordType(str, Enum):
|
||||
"""CloudKit record types for SportsTime.
|
||||
|
||||
Must match CKRecordType constants in CKModels.swift.
|
||||
"""
|
||||
GAME = "Game"
|
||||
TEAM = "Team"
|
||||
STADIUM = "Stadium"
|
||||
TEAM_ALIAS = "TeamAlias"
|
||||
STADIUM_ALIAS = "StadiumAlias"
|
||||
SPORT = "Sport"
|
||||
LEAGUE_STRUCTURE = "LeagueStructure"
|
||||
TRIP_POLL = "TripPoll"
|
||||
POLL_VOTE = "PollVote"
|
||||
ITINERARY_ITEM = "ItineraryItem"
|
||||
|
||||
|
||||
@dataclass
|
||||
class CloudKitRecord:
|
||||
"""Represents a CloudKit record for upload.
|
||||
|
||||
Attributes:
|
||||
record_name: Unique record identifier (canonical ID)
|
||||
record_type: CloudKit record type
|
||||
fields: Dictionary of field name -> field value
|
||||
record_change_tag: Version tag for conflict detection (None for new records)
|
||||
"""
|
||||
record_name: str
|
||||
record_type: RecordType
|
||||
fields: dict[str, Any]
|
||||
record_change_tag: Optional[str] = None
|
||||
|
||||
def to_cloudkit_dict(self) -> dict:
|
||||
"""Convert to CloudKit API format."""
|
||||
record = {
|
||||
"recordName": self.record_name,
|
||||
"recordType": self.record_type.value,
|
||||
"fields": self._format_fields(),
|
||||
}
|
||||
if self.record_change_tag:
|
||||
record["recordChangeTag"] = self.record_change_tag
|
||||
return record
|
||||
|
||||
def _format_fields(self) -> dict:
|
||||
"""Format fields for CloudKit API."""
|
||||
formatted = {}
|
||||
for key, value in self.fields.items():
|
||||
if value is None:
|
||||
continue
|
||||
formatted[key] = self._format_field_value(value)
|
||||
return formatted
|
||||
|
||||
def _format_field_value(self, value: Any) -> dict:
|
||||
"""Format a single field value for CloudKit API."""
|
||||
# Check bool BEFORE int (bool is a subclass of int in Python)
|
||||
if isinstance(value, bool):
|
||||
return {"value": 1 if value else 0, "type": "INT64"}
|
||||
elif isinstance(value, str):
|
||||
return {"value": value, "type": "STRING"}
|
||||
elif isinstance(value, int):
|
||||
return {"value": value, "type": "INT64"}
|
||||
elif isinstance(value, float):
|
||||
return {"value": value, "type": "DOUBLE"}
|
||||
elif isinstance(value, datetime):
|
||||
# CloudKit expects milliseconds since epoch
|
||||
timestamp_ms = int(value.timestamp() * 1000)
|
||||
return {"value": timestamp_ms, "type": "TIMESTAMP"}
|
||||
elif isinstance(value, list):
|
||||
return {"value": value, "type": "STRING_LIST"}
|
||||
elif isinstance(value, dict) and "latitude" in value and "longitude" in value:
|
||||
return {
|
||||
"value": {
|
||||
"latitude": value["latitude"],
|
||||
"longitude": value["longitude"],
|
||||
},
|
||||
"type": "LOCATION",
|
||||
}
|
||||
else:
|
||||
# Default to string
|
||||
return {"value": str(value), "type": "STRING"}
|
||||
|
||||
|
||||
@dataclass
|
||||
class OperationResult:
|
||||
"""Result of a CloudKit operation."""
|
||||
record_name: str
|
||||
success: bool
|
||||
record_change_tag: Optional[str] = None
|
||||
error_code: Optional[str] = None
|
||||
error_message: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class BatchResult:
|
||||
"""Result of a batch CloudKit operation."""
|
||||
successful: list[OperationResult] = field(default_factory=list)
|
||||
failed: list[OperationResult] = field(default_factory=list)
|
||||
|
||||
@property
|
||||
def all_succeeded(self) -> bool:
|
||||
return len(self.failed) == 0
|
||||
|
||||
@property
|
||||
def success_count(self) -> int:
|
||||
return len(self.successful)
|
||||
|
||||
@property
|
||||
def failure_count(self) -> int:
|
||||
return len(self.failed)
|
||||
|
||||
|
||||
class CloudKitClient:
|
||||
"""Client for CloudKit Web Services API.
|
||||
|
||||
Handles authentication via server-to-server JWT tokens and provides
|
||||
methods for CRUD operations on CloudKit records.
|
||||
|
||||
Authentication requires:
|
||||
- Key ID: CloudKit key identifier from Apple Developer Portal
|
||||
- Private Key: EC private key in PEM format
|
||||
|
||||
Environment variables:
|
||||
- CLOUDKIT_KEY_ID: The key identifier
|
||||
- CLOUDKIT_PRIVATE_KEY_PATH: Path to the private key file
|
||||
- CLOUDKIT_PRIVATE_KEY: The private key contents (alternative to path)
|
||||
"""
|
||||
|
||||
BASE_URL = "https://api.apple-cloudkit.com"
|
||||
TOKEN_EXPIRY_SECONDS = 3600 # 1 hour
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
container_id: str = CLOUDKIT_CONTAINER_ID,
|
||||
environment: str = CLOUDKIT_ENVIRONMENT,
|
||||
key_id: Optional[str] = None,
|
||||
private_key: Optional[str] = None,
|
||||
private_key_path: Optional[str] = None,
|
||||
):
|
||||
"""Initialize the CloudKit client.
|
||||
|
||||
Args:
|
||||
container_id: CloudKit container identifier
|
||||
environment: 'development' or 'production'
|
||||
key_id: CloudKit server-to-server key ID
|
||||
private_key: PEM-encoded EC private key contents
|
||||
private_key_path: Path to PEM-encoded EC private key file
|
||||
"""
|
||||
self.container_id = container_id
|
||||
self.environment = environment
|
||||
self.logger = get_logger()
|
||||
|
||||
# Load authentication credentials (config defaults > env vars > None)
|
||||
self.key_id = key_id or os.environ.get("CLOUDKIT_KEY_ID") or CLOUDKIT_KEY_ID
|
||||
|
||||
if private_key:
|
||||
self._private_key_pem = private_key
|
||||
elif private_key_path:
|
||||
self._private_key_pem = Path(private_key_path).read_text()
|
||||
elif os.environ.get("CLOUDKIT_PRIVATE_KEY"):
|
||||
self._private_key_pem = os.environ["CLOUDKIT_PRIVATE_KEY"]
|
||||
elif os.environ.get("CLOUDKIT_PRIVATE_KEY_PATH"):
|
||||
self._private_key_pem = Path(os.environ["CLOUDKIT_PRIVATE_KEY_PATH"]).read_text()
|
||||
elif CLOUDKIT_PRIVATE_KEY_PATH.exists():
|
||||
self._private_key_pem = CLOUDKIT_PRIVATE_KEY_PATH.read_text()
|
||||
else:
|
||||
self._private_key_pem = None
|
||||
|
||||
# Parse the private key if available
|
||||
self._private_key = None
|
||||
if self._private_key_pem:
|
||||
self._private_key = serialization.load_pem_private_key(
|
||||
self._private_key_pem.encode(),
|
||||
password=None,
|
||||
backend=default_backend(),
|
||||
)
|
||||
|
||||
# Token cache
|
||||
self._token: Optional[str] = None
|
||||
self._token_expiry: float = 0
|
||||
|
||||
# Session for connection pooling
|
||||
self._session = requests.Session()
|
||||
|
||||
@property
|
||||
def is_configured(self) -> bool:
|
||||
"""Check if the client has valid authentication credentials."""
|
||||
return bool(self.key_id and self._private_key)
|
||||
|
||||
def _get_api_path(self, operation: str) -> str:
|
||||
"""Build the full API path for an operation."""
|
||||
return f"/database/1/{self.container_id}/{self.environment}/public/{operation}"
|
||||
|
||||
def _get_token(self) -> str:
|
||||
"""Get a valid JWT token, generating a new one if needed."""
|
||||
if not self.is_configured:
|
||||
raise ValueError(
|
||||
"CloudKit client not configured. Set CLOUDKIT_KEY_ID and "
|
||||
"CLOUDKIT_PRIVATE_KEY_PATH environment variables."
|
||||
)
|
||||
|
||||
now = time.time()
|
||||
|
||||
# Return cached token if still valid (with 5 min buffer)
|
||||
if self._token and (self._token_expiry - now) > 300:
|
||||
return self._token
|
||||
|
||||
# Generate new token
|
||||
expiry = now + self.TOKEN_EXPIRY_SECONDS
|
||||
|
||||
payload = {
|
||||
"iss": self.key_id,
|
||||
"iat": int(now),
|
||||
"exp": int(expiry),
|
||||
"sub": self.container_id,
|
||||
}
|
||||
|
||||
self._token = jwt.encode(
|
||||
payload,
|
||||
self._private_key,
|
||||
algorithm="ES256",
|
||||
)
|
||||
self._token_expiry = expiry
|
||||
|
||||
return self._token
|
||||
|
||||
def _sign_request(self, method: str, path: str, body: Optional[bytes] = None) -> dict:
|
||||
"""Generate request headers with authentication.
|
||||
|
||||
Args:
|
||||
method: HTTP method
|
||||
path: API path
|
||||
body: Request body bytes
|
||||
|
||||
Returns:
|
||||
Dictionary of headers to include in the request
|
||||
"""
|
||||
token = self._get_token()
|
||||
|
||||
# CloudKit uses date in ISO format
|
||||
date_str = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
|
||||
# Calculate body hash
|
||||
if body:
|
||||
body_hash = base64.b64encode(
|
||||
hashlib.sha256(body).digest()
|
||||
).decode()
|
||||
else:
|
||||
body_hash = base64.b64encode(
|
||||
hashlib.sha256(b"").digest()
|
||||
).decode()
|
||||
|
||||
# Build the message to sign
|
||||
# Format: date:body_hash:path
|
||||
message = f"{date_str}:{body_hash}:{path}"
|
||||
|
||||
# Sign the message
|
||||
signature = self._private_key.sign(
|
||||
message.encode(),
|
||||
ec.ECDSA(hashes.SHA256()),
|
||||
)
|
||||
signature_b64 = base64.b64encode(signature).decode()
|
||||
|
||||
return {
|
||||
"Authorization": f"Bearer {token}",
|
||||
"X-Apple-CloudKit-Request-KeyID": self.key_id,
|
||||
"X-Apple-CloudKit-Request-ISO8601Date": date_str,
|
||||
"X-Apple-CloudKit-Request-SignatureV1": signature_b64,
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
def _request(
|
||||
self,
|
||||
method: str,
|
||||
operation: str,
|
||||
body: Optional[dict] = None,
|
||||
) -> dict:
|
||||
"""Make a request to the CloudKit API.
|
||||
|
||||
Args:
|
||||
method: HTTP method
|
||||
operation: API operation path
|
||||
body: Request body as dictionary
|
||||
|
||||
Returns:
|
||||
Response data as dictionary
|
||||
|
||||
Raises:
|
||||
CloudKitError: If the request fails
|
||||
"""
|
||||
path = self._get_api_path(operation)
|
||||
url = f"{self.BASE_URL}{path}"
|
||||
|
||||
body_bytes = json.dumps(body).encode() if body else None
|
||||
headers = self._sign_request(method, path, body_bytes)
|
||||
|
||||
response = self._session.request(
|
||||
method=method,
|
||||
url=url,
|
||||
headers=headers,
|
||||
data=body_bytes,
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
elif response.status_code == 421:
|
||||
# Authentication required - token may be expired
|
||||
self._token = None
|
||||
raise CloudKitAuthError("Authentication failed - check credentials")
|
||||
elif response.status_code == 429:
|
||||
raise CloudKitRateLimitError("Rate limit exceeded")
|
||||
elif response.status_code >= 500:
|
||||
raise CloudKitServerError(f"Server error: {response.status_code}")
|
||||
else:
|
||||
try:
|
||||
error_data = response.json()
|
||||
error_msg = error_data.get("serverErrorCode", str(response.status_code))
|
||||
except (json.JSONDecodeError, KeyError):
|
||||
error_msg = response.text
|
||||
raise CloudKitError(f"Request failed: {error_msg}")
|
||||
|
||||
def fetch_records(
|
||||
self,
|
||||
record_type: RecordType,
|
||||
record_names: Optional[list[str]] = None,
|
||||
limit: int = 200,
|
||||
) -> list[dict]:
|
||||
"""Fetch records from CloudKit.
|
||||
|
||||
Args:
|
||||
record_type: Type of records to fetch
|
||||
record_names: Specific record names to fetch (optional)
|
||||
limit: Maximum records to return (default 200)
|
||||
|
||||
Returns:
|
||||
List of record dictionaries
|
||||
"""
|
||||
if record_names:
|
||||
# Fetch specific records by name
|
||||
body = {
|
||||
"records": [{"recordName": name} for name in record_names],
|
||||
}
|
||||
response = self._request("POST", "records/lookup", body)
|
||||
else:
|
||||
# Query all records of type
|
||||
body = {
|
||||
"query": {
|
||||
"recordType": record_type.value,
|
||||
},
|
||||
"resultsLimit": limit,
|
||||
}
|
||||
response = self._request("POST", "records/query", body)
|
||||
|
||||
records = response.get("records", [])
|
||||
return [r for r in records if "recordName" in r]
|
||||
|
||||
def fetch_all_records(self, record_type: RecordType) -> list[dict]:
|
||||
"""Fetch all records of a type using pagination.
|
||||
|
||||
Args:
|
||||
record_type: Type of records to fetch
|
||||
|
||||
Returns:
|
||||
List of all record dictionaries
|
||||
"""
|
||||
all_records = []
|
||||
continuation_marker = None
|
||||
|
||||
while True:
|
||||
body = {
|
||||
"query": {
|
||||
"recordType": record_type.value,
|
||||
},
|
||||
"resultsLimit": 200,
|
||||
}
|
||||
|
||||
if continuation_marker:
|
||||
body["continuationMarker"] = continuation_marker
|
||||
|
||||
response = self._request("POST", "records/query", body)
|
||||
|
||||
records = response.get("records", [])
|
||||
all_records.extend([r for r in records if "recordName" in r])
|
||||
|
||||
continuation_marker = response.get("continuationMarker")
|
||||
if not continuation_marker:
|
||||
break
|
||||
|
||||
return all_records
|
||||
|
||||
def save_records(self, records: list[CloudKitRecord]) -> BatchResult:
|
||||
"""Save records to CloudKit (create or update).
|
||||
|
||||
Args:
|
||||
records: List of records to save
|
||||
|
||||
Returns:
|
||||
BatchResult with success/failure details
|
||||
"""
|
||||
result = BatchResult()
|
||||
|
||||
# Process in batches
|
||||
for i in range(0, len(records), CLOUDKIT_BATCH_SIZE):
|
||||
batch = records[i:i + CLOUDKIT_BATCH_SIZE]
|
||||
batch_result = self._save_batch(batch)
|
||||
result.successful.extend(batch_result.successful)
|
||||
result.failed.extend(batch_result.failed)
|
||||
|
||||
return result
|
||||
|
||||
def _save_batch(self, records: list[CloudKitRecord]) -> BatchResult:
|
||||
"""Save a single batch of records.
|
||||
|
||||
Args:
|
||||
records: List of records (max CLOUDKIT_BATCH_SIZE)
|
||||
|
||||
Returns:
|
||||
BatchResult with success/failure details
|
||||
"""
|
||||
result = BatchResult()
|
||||
|
||||
operations = []
|
||||
for record in records:
|
||||
op = {
|
||||
"operationType": "forceReplace",
|
||||
"record": record.to_cloudkit_dict(),
|
||||
}
|
||||
operations.append(op)
|
||||
|
||||
body = {"operations": operations}
|
||||
|
||||
try:
|
||||
response = self._request("POST", "records/modify", body)
|
||||
except CloudKitError as e:
|
||||
# Entire batch failed
|
||||
for record in records:
|
||||
result.failed.append(OperationResult(
|
||||
record_name=record.record_name,
|
||||
success=False,
|
||||
error_message=str(e),
|
||||
))
|
||||
return result
|
||||
|
||||
# Process individual results
|
||||
for record_data in response.get("records", []):
|
||||
record_name = record_data.get("recordName", "unknown")
|
||||
|
||||
if "serverErrorCode" in record_data:
|
||||
result.failed.append(OperationResult(
|
||||
record_name=record_name,
|
||||
success=False,
|
||||
error_code=record_data.get("serverErrorCode"),
|
||||
error_message=record_data.get("reason"),
|
||||
))
|
||||
else:
|
||||
result.successful.append(OperationResult(
|
||||
record_name=record_name,
|
||||
success=True,
|
||||
record_change_tag=record_data.get("recordChangeTag"),
|
||||
))
|
||||
|
||||
return result
|
||||
|
||||
def delete_records(
|
||||
self,
|
||||
record_type: RecordType,
|
||||
records: list[dict],
|
||||
) -> BatchResult:
|
||||
"""Delete records from CloudKit.
|
||||
|
||||
Args:
|
||||
record_type: Type of records to delete
|
||||
records: List of record dicts (must have recordName and recordChangeTag)
|
||||
|
||||
Returns:
|
||||
BatchResult with success/failure details
|
||||
"""
|
||||
result = BatchResult()
|
||||
|
||||
# Process in batches
|
||||
for i in range(0, len(records), CLOUDKIT_BATCH_SIZE):
|
||||
batch = records[i:i + CLOUDKIT_BATCH_SIZE]
|
||||
|
||||
operations = []
|
||||
for record in batch:
|
||||
operations.append({
|
||||
"operationType": "delete",
|
||||
"record": {
|
||||
"recordName": record["recordName"],
|
||||
"recordChangeTag": record.get("recordChangeTag"),
|
||||
},
|
||||
})
|
||||
|
||||
body = {"operations": operations}
|
||||
|
||||
try:
|
||||
response = self._request("POST", "records/modify", body)
|
||||
except CloudKitError as e:
|
||||
for record in batch:
|
||||
result.failed.append(OperationResult(
|
||||
record_name=record["recordName"],
|
||||
success=False,
|
||||
error_message=str(e),
|
||||
))
|
||||
continue
|
||||
|
||||
for record_data in response.get("records", []):
|
||||
record_name = record_data.get("recordName", "unknown")
|
||||
|
||||
if "serverErrorCode" in record_data:
|
||||
result.failed.append(OperationResult(
|
||||
record_name=record_name,
|
||||
success=False,
|
||||
error_code=record_data.get("serverErrorCode"),
|
||||
error_message=record_data.get("reason"),
|
||||
))
|
||||
else:
|
||||
result.successful.append(OperationResult(
|
||||
record_name=record_name,
|
||||
success=True,
|
||||
))
|
||||
|
||||
return result
|
||||
|
||||
|
||||
class CloudKitError(Exception):
|
||||
"""Base exception for CloudKit errors."""
|
||||
pass
|
||||
|
||||
|
||||
class CloudKitAuthError(CloudKitError):
|
||||
"""Authentication error."""
|
||||
pass
|
||||
|
||||
|
||||
class CloudKitRateLimitError(CloudKitError):
|
||||
"""Rate limit exceeded."""
|
||||
pass
|
||||
|
||||
|
||||
class CloudKitServerError(CloudKitError):
|
||||
"""Server-side error."""
|
||||
pass
|
||||
741
sportstime_parser/uploaders/diff.py
Normal file
741
sportstime_parser/uploaders/diff.py
Normal file
@@ -0,0 +1,741 @@
|
||||
"""Record differ for CloudKit uploads.
|
||||
|
||||
This module compares local records with CloudKit records to determine
|
||||
what needs to be created, updated, or deleted.
|
||||
|
||||
Field names must match CKModels.swift exactly:
|
||||
- Stadium: stadiumId, canonicalId, name, city, state, location (CLLocation),
|
||||
capacity, yearOpened, imageURL, sport
|
||||
- Team: teamId, canonicalId, name, abbreviation, sport, city, stadiumCanonicalId,
|
||||
logoURL, primaryColor, secondaryColor
|
||||
- Game: gameId, canonicalId, homeTeamCanonicalId, awayTeamCanonicalId,
|
||||
stadiumCanonicalId, dateTime, sport, season, isPlayoff, broadcastInfo
|
||||
- TeamAlias: aliasId, teamCanonicalId, aliasType, aliasValue, validFrom, validUntil
|
||||
- StadiumAlias: aliasName, stadiumCanonicalId, validFrom, validUntil
|
||||
- Sport: sportId, abbreviation, displayName, iconName, colorHex,
|
||||
seasonStartMonth, seasonEndMonth, isActive
|
||||
- LeagueStructure: structureId, sport, type, name, abbreviation, parentId, displayOrder
|
||||
"""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, date
|
||||
from enum import Enum
|
||||
from typing import Any, Optional
|
||||
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import TeamAlias, StadiumAlias, AliasType
|
||||
from ..models.sport import Sport, LeagueStructure
|
||||
from .cloudkit import CloudKitRecord, RecordType
|
||||
|
||||
|
||||
def _date_to_datetime(d: Optional[date]) -> Optional[datetime]:
|
||||
"""Convert a date to a datetime at midnight UTC.
|
||||
|
||||
CloudKit TIMESTAMP fields require datetime, not date.
|
||||
"""
|
||||
if d is None:
|
||||
return None
|
||||
return datetime(d.year, d.month, d.day, 0, 0, 0)
|
||||
|
||||
|
||||
class DiffAction(str, Enum):
|
||||
"""Action to take for a record."""
|
||||
CREATE = "create"
|
||||
UPDATE = "update"
|
||||
DELETE = "delete"
|
||||
UNCHANGED = "unchanged"
|
||||
|
||||
|
||||
@dataclass
|
||||
class RecordDiff:
|
||||
"""Represents the difference between local and remote records.
|
||||
|
||||
Attributes:
|
||||
record_name: Canonical record ID
|
||||
record_type: CloudKit record type
|
||||
action: Action to take (create, update, delete, unchanged)
|
||||
local_record: Local CloudKitRecord (None if delete)
|
||||
remote_record: Remote record dict (None if create)
|
||||
changed_fields: List of field names that changed (for update)
|
||||
record_change_tag: Remote record's change tag (for update)
|
||||
"""
|
||||
record_name: str
|
||||
record_type: RecordType
|
||||
action: DiffAction
|
||||
local_record: Optional[CloudKitRecord] = None
|
||||
remote_record: Optional[dict] = None
|
||||
changed_fields: list[str] = field(default_factory=list)
|
||||
record_change_tag: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class DiffResult:
|
||||
"""Result of diffing local and remote records.
|
||||
|
||||
Attributes:
|
||||
creates: Records to create
|
||||
updates: Records to update
|
||||
deletes: Records to delete (record names)
|
||||
unchanged: Records with no changes
|
||||
"""
|
||||
creates: list[RecordDiff] = field(default_factory=list)
|
||||
updates: list[RecordDiff] = field(default_factory=list)
|
||||
deletes: list[RecordDiff] = field(default_factory=list)
|
||||
unchanged: list[RecordDiff] = field(default_factory=list)
|
||||
|
||||
@property
|
||||
def create_count(self) -> int:
|
||||
return len(self.creates)
|
||||
|
||||
@property
|
||||
def update_count(self) -> int:
|
||||
return len(self.updates)
|
||||
|
||||
@property
|
||||
def delete_count(self) -> int:
|
||||
return len(self.deletes)
|
||||
|
||||
@property
|
||||
def unchanged_count(self) -> int:
|
||||
return len(self.unchanged)
|
||||
|
||||
@property
|
||||
def total_changes(self) -> int:
|
||||
return self.create_count + self.update_count + self.delete_count
|
||||
|
||||
def get_records_to_upload(self) -> list[CloudKitRecord]:
|
||||
"""Get all records that need to be uploaded (creates + updates)."""
|
||||
records = []
|
||||
|
||||
for diff in self.creates:
|
||||
if diff.local_record:
|
||||
records.append(diff.local_record)
|
||||
|
||||
for diff in self.updates:
|
||||
if diff.local_record:
|
||||
# Add change tag for update
|
||||
diff.local_record.record_change_tag = diff.record_change_tag
|
||||
records.append(diff.local_record)
|
||||
|
||||
return records
|
||||
|
||||
|
||||
class RecordDiffer:
|
||||
"""Compares local records with CloudKit records.
|
||||
|
||||
Field names must match CKModels.swift field keys exactly (camelCase).
|
||||
"""
|
||||
|
||||
# Fields to compare for each record type (matching CKModels.swift keys)
|
||||
GAME_FIELDS = [
|
||||
"gameId", "canonicalId", "sport", "season", "dateTime",
|
||||
"homeTeamCanonicalId", "awayTeamCanonicalId", "stadiumCanonicalId",
|
||||
"isPlayoff", "broadcastInfo",
|
||||
]
|
||||
|
||||
TEAM_FIELDS = [
|
||||
"teamId", "canonicalId", "sport", "city", "name", "abbreviation",
|
||||
"stadiumCanonicalId", "logoURL", "primaryColor", "secondaryColor",
|
||||
]
|
||||
|
||||
STADIUM_FIELDS = [
|
||||
"stadiumId", "canonicalId", "sport", "name", "city", "state",
|
||||
"location", "capacity", "yearOpened", "imageURL",
|
||||
]
|
||||
|
||||
TEAM_ALIAS_FIELDS = [
|
||||
"aliasId", "teamCanonicalId", "aliasType", "aliasValue",
|
||||
"validFrom", "validUntil",
|
||||
]
|
||||
|
||||
STADIUM_ALIAS_FIELDS = [
|
||||
"aliasName", "stadiumCanonicalId", "validFrom", "validUntil",
|
||||
]
|
||||
|
||||
SPORT_FIELDS = [
|
||||
"sportId", "abbreviation", "displayName", "iconName",
|
||||
"colorHex", "seasonStartMonth", "seasonEndMonth", "isActive",
|
||||
]
|
||||
|
||||
LEAGUE_STRUCTURE_FIELDS = [
|
||||
"structureId", "sport", "type", "name", "abbreviation",
|
||||
"parentId", "displayOrder",
|
||||
]
|
||||
|
||||
def diff_games(
|
||||
self,
|
||||
local_games: list[Game],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local games against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_games: List of local Game objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._game_to_record(g) for g in local_games]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.GAME,
|
||||
self.GAME_FIELDS,
|
||||
)
|
||||
|
||||
def diff_teams(
|
||||
self,
|
||||
local_teams: list[Team],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local teams against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_teams: List of local Team objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._team_to_record(t) for t in local_teams]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.TEAM,
|
||||
self.TEAM_FIELDS,
|
||||
)
|
||||
|
||||
def diff_stadiums(
|
||||
self,
|
||||
local_stadiums: list[Stadium],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local stadiums against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_stadiums: List of local Stadium objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._stadium_to_record(s) for s in local_stadiums]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.STADIUM,
|
||||
self.STADIUM_FIELDS,
|
||||
)
|
||||
|
||||
def diff_team_aliases(
|
||||
self,
|
||||
local_aliases: list[TeamAlias],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local team aliases against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_aliases: List of local TeamAlias objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._team_alias_to_record(a) for a in local_aliases]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.TEAM_ALIAS,
|
||||
self.TEAM_ALIAS_FIELDS,
|
||||
)
|
||||
|
||||
def diff_stadium_aliases(
|
||||
self,
|
||||
local_aliases: list[StadiumAlias],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local stadium aliases against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_aliases: List of local StadiumAlias objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._stadium_alias_to_record(a) for a in local_aliases]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.STADIUM_ALIAS,
|
||||
self.STADIUM_ALIAS_FIELDS,
|
||||
)
|
||||
|
||||
def diff_sports(
|
||||
self,
|
||||
local_sports: list[Sport],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local sports against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_sports: List of local Sport objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._sport_to_record(s) for s in local_sports]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.SPORT,
|
||||
self.SPORT_FIELDS,
|
||||
)
|
||||
|
||||
def diff_league_structures(
|
||||
self,
|
||||
local_structures: list[LeagueStructure],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local league structures against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_structures: List of local LeagueStructure objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._league_structure_to_record(s) for s in local_structures]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.LEAGUE_STRUCTURE,
|
||||
self.LEAGUE_STRUCTURE_FIELDS,
|
||||
)
|
||||
|
||||
def _diff_records(
|
||||
self,
|
||||
local_records: list[CloudKitRecord],
|
||||
remote_records: list[dict],
|
||||
record_type: RecordType,
|
||||
compare_fields: list[str],
|
||||
) -> DiffResult:
|
||||
"""Compare local and remote records.
|
||||
|
||||
Args:
|
||||
local_records: List of local CloudKitRecord objects
|
||||
remote_records: List of remote record dictionaries
|
||||
record_type: Type of records being compared
|
||||
compare_fields: List of field names to compare
|
||||
|
||||
Returns:
|
||||
DiffResult with categorized differences
|
||||
"""
|
||||
result = DiffResult()
|
||||
|
||||
# Index remote records by name
|
||||
remote_by_name: dict[str, dict] = {}
|
||||
for record in remote_records:
|
||||
name = record.get("recordName")
|
||||
if name:
|
||||
remote_by_name[name] = record
|
||||
|
||||
# Index local records by name
|
||||
local_by_name: dict[str, CloudKitRecord] = {}
|
||||
for record in local_records:
|
||||
local_by_name[record.record_name] = record
|
||||
|
||||
# Find creates and updates
|
||||
for local_record in local_records:
|
||||
remote = remote_by_name.get(local_record.record_name)
|
||||
|
||||
if remote is None:
|
||||
# New record
|
||||
result.creates.append(RecordDiff(
|
||||
record_name=local_record.record_name,
|
||||
record_type=record_type,
|
||||
action=DiffAction.CREATE,
|
||||
local_record=local_record,
|
||||
))
|
||||
else:
|
||||
# Check for changes
|
||||
changed_fields = self._compare_fields(
|
||||
local_record.fields,
|
||||
remote.get("fields", {}),
|
||||
compare_fields,
|
||||
)
|
||||
|
||||
if changed_fields:
|
||||
result.updates.append(RecordDiff(
|
||||
record_name=local_record.record_name,
|
||||
record_type=record_type,
|
||||
action=DiffAction.UPDATE,
|
||||
local_record=local_record,
|
||||
remote_record=remote,
|
||||
changed_fields=changed_fields,
|
||||
record_change_tag=remote.get("recordChangeTag"),
|
||||
))
|
||||
else:
|
||||
result.unchanged.append(RecordDiff(
|
||||
record_name=local_record.record_name,
|
||||
record_type=record_type,
|
||||
action=DiffAction.UNCHANGED,
|
||||
local_record=local_record,
|
||||
remote_record=remote,
|
||||
record_change_tag=remote.get("recordChangeTag"),
|
||||
))
|
||||
|
||||
# Find deletes (remote records not in local)
|
||||
local_names = set(local_by_name.keys())
|
||||
for remote_name, remote in remote_by_name.items():
|
||||
if remote_name not in local_names:
|
||||
result.deletes.append(RecordDiff(
|
||||
record_name=remote_name,
|
||||
record_type=record_type,
|
||||
action=DiffAction.DELETE,
|
||||
remote_record=remote,
|
||||
record_change_tag=remote.get("recordChangeTag"),
|
||||
))
|
||||
|
||||
return result
|
||||
|
||||
def _compare_fields(
|
||||
self,
|
||||
local_fields: dict[str, Any],
|
||||
remote_fields: dict[str, dict],
|
||||
compare_fields: list[str],
|
||||
) -> list[str]:
|
||||
"""Compare field values between local and remote.
|
||||
|
||||
Args:
|
||||
local_fields: Local field values
|
||||
remote_fields: Remote field values (CloudKit format)
|
||||
compare_fields: Fields to compare
|
||||
|
||||
Returns:
|
||||
List of field names that differ
|
||||
"""
|
||||
changed = []
|
||||
|
||||
for field_name in compare_fields:
|
||||
local_value = local_fields.get(field_name)
|
||||
remote_field = remote_fields.get(field_name, {})
|
||||
remote_value = remote_field.get("value") if remote_field else None
|
||||
|
||||
# Normalize values for comparison
|
||||
local_normalized = self._normalize_value(local_value)
|
||||
remote_normalized = self._normalize_remote_value(remote_value, remote_field)
|
||||
|
||||
if local_normalized != remote_normalized:
|
||||
changed.append(field_name)
|
||||
|
||||
return changed
|
||||
|
||||
def _normalize_value(self, value: Any) -> Any:
|
||||
"""Normalize a local value for comparison."""
|
||||
if value is None:
|
||||
return None
|
||||
if isinstance(value, datetime):
|
||||
# Convert to milliseconds since epoch
|
||||
return int(value.timestamp() * 1000)
|
||||
if isinstance(value, float):
|
||||
# Round to 6 decimal places for coordinate comparison
|
||||
return round(value, 6)
|
||||
return value
|
||||
|
||||
def _normalize_remote_value(self, value: Any, field_data: dict) -> Any:
|
||||
"""Normalize a remote CloudKit value for comparison."""
|
||||
if value is None:
|
||||
return None
|
||||
|
||||
field_type = field_data.get("type", "")
|
||||
|
||||
if field_type == "TIMESTAMP":
|
||||
# Already in milliseconds
|
||||
return value
|
||||
if field_type == "DOUBLE":
|
||||
return round(value, 6)
|
||||
if field_type == "LOCATION":
|
||||
# Return as tuple for comparison
|
||||
if isinstance(value, dict):
|
||||
return (
|
||||
round(value.get("latitude", 0), 6),
|
||||
round(value.get("longitude", 0), 6),
|
||||
)
|
||||
|
||||
return value
|
||||
|
||||
def _game_to_record(self, game: Game) -> CloudKitRecord:
|
||||
"""Convert a Game to a CloudKitRecord.
|
||||
|
||||
Field names match CKGame keys in CKModels.swift:
|
||||
- gameId, canonicalId: Unique identifiers
|
||||
- homeTeamCanonicalId, awayTeamCanonicalId, stadiumCanonicalId: References as strings
|
||||
- dateTime: Game time as datetime (will be converted to TIMESTAMP)
|
||||
- sport: Sport code uppercase (e.g., "MLB")
|
||||
- season: Season string (e.g., "2025-26" or "2026")
|
||||
- isPlayoff: Boolean as int (1 or 0)
|
||||
- broadcastInfo: Optional broadcast network string
|
||||
"""
|
||||
# Format season as string
|
||||
sport_lower = game.sport.lower()
|
||||
if sport_lower in ("nba", "nhl"):
|
||||
season_str = f"{game.season}-{str(game.season + 1)[-2:]}"
|
||||
else:
|
||||
season_str = str(game.season)
|
||||
|
||||
return CloudKitRecord(
|
||||
record_name=game.id,
|
||||
record_type=RecordType.GAME,
|
||||
fields={
|
||||
"gameId": game.id,
|
||||
"canonicalId": game.id,
|
||||
"sport": game.sport.upper(),
|
||||
"season": season_str,
|
||||
"dateTime": game.game_date,
|
||||
"homeTeamCanonicalId": game.home_team_id,
|
||||
"awayTeamCanonicalId": game.away_team_id,
|
||||
"stadiumCanonicalId": game.stadium_id,
|
||||
"isPlayoff": False, # Default, can be overridden
|
||||
"broadcastInfo": None, # Default, can be overridden
|
||||
},
|
||||
)
|
||||
|
||||
def _team_to_record(self, team: Team) -> CloudKitRecord:
|
||||
"""Convert a Team to a CloudKitRecord.
|
||||
|
||||
Field names match CKTeam keys in CKModels.swift:
|
||||
- teamId, canonicalId: Unique identifiers
|
||||
- name, abbreviation, city: Team info
|
||||
- sport: Sport code uppercase (e.g., "NBA")
|
||||
- stadiumCanonicalId: Home stadium canonical ID string
|
||||
- logoURL: URL string for team logo
|
||||
- primaryColor, secondaryColor: Hex color strings
|
||||
"""
|
||||
return CloudKitRecord(
|
||||
record_name=team.id,
|
||||
record_type=RecordType.TEAM,
|
||||
fields={
|
||||
"teamId": team.id,
|
||||
"canonicalId": team.id,
|
||||
"sport": team.sport.upper(),
|
||||
"city": team.city,
|
||||
"name": team.name,
|
||||
"abbreviation": team.abbreviation,
|
||||
"stadiumCanonicalId": team.stadium_id,
|
||||
"logoURL": team.logo_url,
|
||||
"primaryColor": team.primary_color,
|
||||
"secondaryColor": team.secondary_color,
|
||||
},
|
||||
)
|
||||
|
||||
def _stadium_to_record(self, stadium: Stadium) -> CloudKitRecord:
|
||||
"""Convert a Stadium to a CloudKitRecord.
|
||||
|
||||
Field names match CKStadium keys in CKModels.swift:
|
||||
- stadiumId, canonicalId: Unique identifiers
|
||||
- name, city, state: Location info
|
||||
- location: CloudKit LOCATION type with latitude/longitude
|
||||
- capacity: Seating capacity as int
|
||||
- yearOpened: Year opened as int
|
||||
- imageURL: URL string for stadium image
|
||||
- sport: Sport code uppercase (e.g., "MLB")
|
||||
"""
|
||||
return CloudKitRecord(
|
||||
record_name=stadium.id,
|
||||
record_type=RecordType.STADIUM,
|
||||
fields={
|
||||
"stadiumId": stadium.id,
|
||||
"canonicalId": stadium.id,
|
||||
"sport": stadium.sport.upper(),
|
||||
"name": stadium.name,
|
||||
"city": stadium.city,
|
||||
"state": stadium.state,
|
||||
# CloudKit LOCATION type expects dict with latitude/longitude
|
||||
"location": {
|
||||
"latitude": stadium.latitude,
|
||||
"longitude": stadium.longitude,
|
||||
},
|
||||
"capacity": stadium.capacity,
|
||||
"yearOpened": stadium.opened_year,
|
||||
"imageURL": stadium.image_url,
|
||||
},
|
||||
)
|
||||
|
||||
def _team_alias_to_record(self, alias: TeamAlias) -> CloudKitRecord:
|
||||
"""Convert a TeamAlias to a CloudKitRecord.
|
||||
|
||||
Field names match CKTeamAlias keys in CKModels.swift:
|
||||
- aliasId: Unique identifier
|
||||
- teamCanonicalId: The canonical team this alias resolves to
|
||||
- aliasType: Type of alias ("abbreviation", "name", "city")
|
||||
- aliasValue: The alias value to match
|
||||
- validFrom, validUntil: Optional date bounds
|
||||
- schemaVersion, lastModified: Versioning fields
|
||||
"""
|
||||
return CloudKitRecord(
|
||||
record_name=alias.id,
|
||||
record_type=RecordType.TEAM_ALIAS,
|
||||
fields={
|
||||
"aliasId": alias.id,
|
||||
"teamCanonicalId": alias.team_canonical_id,
|
||||
"aliasType": alias.alias_type.value,
|
||||
"aliasValue": alias.alias_value,
|
||||
"validFrom": _date_to_datetime(alias.valid_from),
|
||||
"validUntil": _date_to_datetime(alias.valid_until),
|
||||
"schemaVersion": 1,
|
||||
"lastModified": datetime.utcnow(),
|
||||
},
|
||||
)
|
||||
|
||||
def _stadium_alias_to_record(self, alias: StadiumAlias) -> CloudKitRecord:
|
||||
"""Convert a StadiumAlias to a CloudKitRecord.
|
||||
|
||||
Field names match CKStadiumAlias keys in CKModels.swift:
|
||||
- aliasName: The alias name (used as record name/primary key)
|
||||
- stadiumCanonicalId: The canonical stadium this alias resolves to
|
||||
- validFrom, validUntil: Optional date bounds
|
||||
- schemaVersion, lastModified: Versioning fields
|
||||
"""
|
||||
# Record name must be unique - combine alias name with stadium ID
|
||||
# to handle cases like "yankee stadium" mapping to both MLB and MLS stadiums
|
||||
record_name = f"{alias.alias_name.lower()}|{alias.stadium_canonical_id}"
|
||||
return CloudKitRecord(
|
||||
record_name=record_name,
|
||||
record_type=RecordType.STADIUM_ALIAS,
|
||||
fields={
|
||||
"aliasName": alias.alias_name.lower(),
|
||||
"stadiumCanonicalId": alias.stadium_canonical_id,
|
||||
"validFrom": _date_to_datetime(alias.valid_from),
|
||||
"validUntil": _date_to_datetime(alias.valid_until),
|
||||
"schemaVersion": 1,
|
||||
"lastModified": datetime.utcnow(),
|
||||
},
|
||||
)
|
||||
|
||||
def _sport_to_record(self, sport: Sport) -> CloudKitRecord:
|
||||
"""Convert a Sport to a CloudKitRecord.
|
||||
|
||||
Field names match CKSport keys in CKModels.swift:
|
||||
- sportId: Unique identifier (e.g., 'MLB', 'NBA')
|
||||
- abbreviation: Sport abbreviation
|
||||
- displayName: Full display name
|
||||
- iconName: SF Symbol name
|
||||
- colorHex: Primary color as hex string
|
||||
- seasonStartMonth, seasonEndMonth: Season boundary months (1-12)
|
||||
- isActive: Whether sport is currently supported
|
||||
- schemaVersion, lastModified: Versioning fields
|
||||
"""
|
||||
return CloudKitRecord(
|
||||
record_name=sport.id,
|
||||
record_type=RecordType.SPORT,
|
||||
fields={
|
||||
"sportId": sport.id,
|
||||
"abbreviation": sport.abbreviation,
|
||||
"displayName": sport.display_name,
|
||||
"iconName": sport.icon_name,
|
||||
"colorHex": sport.color_hex,
|
||||
"seasonStartMonth": sport.season_start_month,
|
||||
"seasonEndMonth": sport.season_end_month,
|
||||
"isActive": sport.is_active,
|
||||
"schemaVersion": 1,
|
||||
"lastModified": datetime.utcnow(),
|
||||
},
|
||||
)
|
||||
|
||||
def _league_structure_to_record(self, structure: LeagueStructure) -> CloudKitRecord:
|
||||
"""Convert a LeagueStructure to a CloudKitRecord.
|
||||
|
||||
Field names match CKLeagueStructure keys in CKModels.swift:
|
||||
- structureId: Unique identifier (e.g., 'nba_eastern', 'mlb_al_east')
|
||||
- sport: Sport code (e.g., 'NBA', 'MLB')
|
||||
- type: Structure type ('conference', 'division', 'league')
|
||||
- name: Full name
|
||||
- abbreviation: Optional abbreviation
|
||||
- parentId: Parent structure ID (e.g., division's parent is conference)
|
||||
- displayOrder: Order for display (0-indexed)
|
||||
- schemaVersion, lastModified: Versioning fields
|
||||
"""
|
||||
return CloudKitRecord(
|
||||
record_name=structure.id,
|
||||
record_type=RecordType.LEAGUE_STRUCTURE,
|
||||
fields={
|
||||
"structureId": structure.id,
|
||||
"sport": structure.sport.upper(),
|
||||
"type": structure.structure_type.value,
|
||||
"name": structure.name,
|
||||
"abbreviation": structure.abbreviation,
|
||||
"parentId": structure.parent_id,
|
||||
"displayOrder": structure.display_order,
|
||||
"schemaVersion": 1,
|
||||
"lastModified": datetime.utcnow(),
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
def game_to_cloudkit_record(game: Game) -> CloudKitRecord:
|
||||
"""Convert a Game to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._game_to_record(game)
|
||||
|
||||
|
||||
def team_to_cloudkit_record(team: Team) -> CloudKitRecord:
|
||||
"""Convert a Team to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._team_to_record(team)
|
||||
|
||||
|
||||
def stadium_to_cloudkit_record(stadium: Stadium) -> CloudKitRecord:
|
||||
"""Convert a Stadium to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._stadium_to_record(stadium)
|
||||
|
||||
|
||||
def team_alias_to_cloudkit_record(alias: TeamAlias) -> CloudKitRecord:
|
||||
"""Convert a TeamAlias to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._team_alias_to_record(alias)
|
||||
|
||||
|
||||
def stadium_alias_to_cloudkit_record(alias: StadiumAlias) -> CloudKitRecord:
|
||||
"""Convert a StadiumAlias to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._stadium_alias_to_record(alias)
|
||||
|
||||
|
||||
def sport_to_cloudkit_record(sport: Sport) -> CloudKitRecord:
|
||||
"""Convert a Sport to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._sport_to_record(sport)
|
||||
|
||||
|
||||
def league_structure_to_cloudkit_record(structure: LeagueStructure) -> CloudKitRecord:
|
||||
"""Convert a LeagueStructure to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._league_structure_to_record(structure)
|
||||
384
sportstime_parser/uploaders/state.py
Normal file
384
sportstime_parser/uploaders/state.py
Normal file
@@ -0,0 +1,384 @@
|
||||
"""Upload state manager for resumable uploads.
|
||||
|
||||
This module tracks upload progress to enable resuming interrupted uploads.
|
||||
State is persisted to JSON files in the .parser_state directory.
|
||||
"""
|
||||
|
||||
import json
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from ..config import STATE_DIR
|
||||
|
||||
|
||||
@dataclass
|
||||
class RecordState:
|
||||
"""State of an individual record upload.
|
||||
|
||||
Attributes:
|
||||
record_name: Canonical record ID
|
||||
record_type: CloudKit record type
|
||||
uploaded_at: Timestamp when successfully uploaded
|
||||
record_change_tag: CloudKit version tag
|
||||
status: 'pending', 'uploaded', 'failed'
|
||||
error_message: Error message if failed
|
||||
retry_count: Number of retry attempts
|
||||
"""
|
||||
record_name: str
|
||||
record_type: str
|
||||
uploaded_at: Optional[datetime] = None
|
||||
record_change_tag: Optional[str] = None
|
||||
status: str = "pending"
|
||||
error_message: Optional[str] = None
|
||||
retry_count: int = 0
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"record_name": self.record_name,
|
||||
"record_type": self.record_type,
|
||||
"uploaded_at": self.uploaded_at.isoformat() if self.uploaded_at else None,
|
||||
"record_change_tag": self.record_change_tag,
|
||||
"status": self.status,
|
||||
"error_message": self.error_message,
|
||||
"retry_count": self.retry_count,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "RecordState":
|
||||
"""Create RecordState from dictionary."""
|
||||
uploaded_at = data.get("uploaded_at")
|
||||
if uploaded_at:
|
||||
uploaded_at = datetime.fromisoformat(uploaded_at)
|
||||
|
||||
return cls(
|
||||
record_name=data["record_name"],
|
||||
record_type=data["record_type"],
|
||||
uploaded_at=uploaded_at,
|
||||
record_change_tag=data.get("record_change_tag"),
|
||||
status=data.get("status", "pending"),
|
||||
error_message=data.get("error_message"),
|
||||
retry_count=data.get("retry_count", 0),
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class UploadSession:
|
||||
"""Tracks the state of an upload session.
|
||||
|
||||
Attributes:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
started_at: When the upload session started
|
||||
last_updated: When the state was last updated
|
||||
records: Dictionary of record_name -> RecordState
|
||||
total_count: Total number of records to upload
|
||||
"""
|
||||
sport: str
|
||||
season: int
|
||||
environment: str
|
||||
started_at: datetime = field(default_factory=datetime.utcnow)
|
||||
last_updated: datetime = field(default_factory=datetime.utcnow)
|
||||
records: dict[str, RecordState] = field(default_factory=dict)
|
||||
total_count: int = 0
|
||||
|
||||
@property
|
||||
def uploaded_count(self) -> int:
|
||||
"""Count of successfully uploaded records."""
|
||||
return sum(1 for r in self.records.values() if r.status == "uploaded")
|
||||
|
||||
@property
|
||||
def pending_count(self) -> int:
|
||||
"""Count of pending records."""
|
||||
return sum(1 for r in self.records.values() if r.status == "pending")
|
||||
|
||||
@property
|
||||
def failed_count(self) -> int:
|
||||
"""Count of failed records."""
|
||||
return sum(1 for r in self.records.values() if r.status == "failed")
|
||||
|
||||
@property
|
||||
def is_complete(self) -> bool:
|
||||
"""Check if all records have been processed."""
|
||||
return self.pending_count == 0
|
||||
|
||||
@property
|
||||
def progress_percent(self) -> float:
|
||||
"""Calculate upload progress as percentage."""
|
||||
if self.total_count == 0:
|
||||
return 100.0
|
||||
return (self.uploaded_count / self.total_count) * 100
|
||||
|
||||
def get_pending_records(self) -> list[str]:
|
||||
"""Get list of record names that still need to be uploaded."""
|
||||
return [
|
||||
name for name, state in self.records.items()
|
||||
if state.status == "pending"
|
||||
]
|
||||
|
||||
def get_failed_records(self) -> list[str]:
|
||||
"""Get list of record names that failed to upload."""
|
||||
return [
|
||||
name for name, state in self.records.items()
|
||||
if state.status == "failed"
|
||||
]
|
||||
|
||||
def get_retryable_records(self, max_retries: int = 3) -> list[str]:
|
||||
"""Get failed records that can be retried."""
|
||||
return [
|
||||
name for name, state in self.records.items()
|
||||
if state.status == "failed" and state.retry_count < max_retries
|
||||
]
|
||||
|
||||
def mark_uploaded(
|
||||
self,
|
||||
record_name: str,
|
||||
record_change_tag: Optional[str] = None,
|
||||
) -> None:
|
||||
"""Mark a record as successfully uploaded."""
|
||||
if record_name in self.records:
|
||||
state = self.records[record_name]
|
||||
state.status = "uploaded"
|
||||
state.uploaded_at = datetime.utcnow()
|
||||
state.record_change_tag = record_change_tag
|
||||
state.error_message = None
|
||||
self.last_updated = datetime.utcnow()
|
||||
|
||||
def mark_failed(self, record_name: str, error_message: str) -> None:
|
||||
"""Mark a record as failed."""
|
||||
if record_name in self.records:
|
||||
state = self.records[record_name]
|
||||
state.status = "failed"
|
||||
state.error_message = error_message
|
||||
state.retry_count += 1
|
||||
self.last_updated = datetime.utcnow()
|
||||
|
||||
def mark_pending(self, record_name: str) -> None:
|
||||
"""Mark a record as pending (for retry)."""
|
||||
if record_name in self.records:
|
||||
state = self.records[record_name]
|
||||
state.status = "pending"
|
||||
state.error_message = None
|
||||
self.last_updated = datetime.utcnow()
|
||||
|
||||
def add_record(self, record_name: str, record_type: str) -> None:
|
||||
"""Add a new record to track."""
|
||||
if record_name not in self.records:
|
||||
self.records[record_name] = RecordState(
|
||||
record_name=record_name,
|
||||
record_type=record_type,
|
||||
)
|
||||
self.total_count = len(self.records)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"sport": self.sport,
|
||||
"season": self.season,
|
||||
"environment": self.environment,
|
||||
"started_at": self.started_at.isoformat(),
|
||||
"last_updated": self.last_updated.isoformat(),
|
||||
"total_count": self.total_count,
|
||||
"records": {
|
||||
name: state.to_dict()
|
||||
for name, state in self.records.items()
|
||||
},
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "UploadSession":
|
||||
"""Create UploadSession from dictionary."""
|
||||
session = cls(
|
||||
sport=data["sport"],
|
||||
season=data["season"],
|
||||
environment=data["environment"],
|
||||
started_at=datetime.fromisoformat(data["started_at"]),
|
||||
last_updated=datetime.fromisoformat(data["last_updated"]),
|
||||
total_count=data.get("total_count", 0),
|
||||
)
|
||||
|
||||
for name, record_data in data.get("records", {}).items():
|
||||
session.records[name] = RecordState.from_dict(record_data)
|
||||
|
||||
return session
|
||||
|
||||
|
||||
class StateManager:
|
||||
"""Manages upload state persistence.
|
||||
|
||||
State files are stored in .parser_state/ with naming convention:
|
||||
upload_state_{sport}_{season}_{environment}.json
|
||||
"""
|
||||
|
||||
def __init__(self, state_dir: Optional[Path] = None):
|
||||
"""Initialize the state manager.
|
||||
|
||||
Args:
|
||||
state_dir: Directory for state files (default: .parser_state/)
|
||||
"""
|
||||
self.state_dir = state_dir or STATE_DIR
|
||||
self.state_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def _get_state_file(self, sport: str, season: int, environment: str) -> Path:
|
||||
"""Get the path to a state file."""
|
||||
return self.state_dir / f"upload_state_{sport}_{season}_{environment}.json"
|
||||
|
||||
def load_session(
|
||||
self,
|
||||
sport: str,
|
||||
season: int,
|
||||
environment: str,
|
||||
) -> Optional[UploadSession]:
|
||||
"""Load an existing upload session.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
|
||||
Returns:
|
||||
UploadSession if exists, None otherwise
|
||||
"""
|
||||
state_file = self._get_state_file(sport, season, environment)
|
||||
|
||||
if not state_file.exists():
|
||||
return None
|
||||
|
||||
try:
|
||||
with open(state_file, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
return UploadSession.from_dict(data)
|
||||
except (json.JSONDecodeError, KeyError) as e:
|
||||
# Corrupted state file
|
||||
return None
|
||||
|
||||
def save_session(self, session: UploadSession) -> None:
|
||||
"""Save an upload session to disk.
|
||||
|
||||
Args:
|
||||
session: The session to save
|
||||
"""
|
||||
state_file = self._get_state_file(
|
||||
session.sport,
|
||||
session.season,
|
||||
session.environment,
|
||||
)
|
||||
|
||||
session.last_updated = datetime.utcnow()
|
||||
|
||||
with open(state_file, "w", encoding="utf-8") as f:
|
||||
json.dump(session.to_dict(), f, indent=2)
|
||||
|
||||
def create_session(
|
||||
self,
|
||||
sport: str,
|
||||
season: int,
|
||||
environment: str,
|
||||
record_names: list[tuple[str, str]], # (record_name, record_type)
|
||||
) -> UploadSession:
|
||||
"""Create a new upload session.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
record_names: List of (record_name, record_type) tuples
|
||||
|
||||
Returns:
|
||||
New UploadSession
|
||||
"""
|
||||
session = UploadSession(
|
||||
sport=sport,
|
||||
season=season,
|
||||
environment=environment,
|
||||
)
|
||||
|
||||
for record_name, record_type in record_names:
|
||||
session.add_record(record_name, record_type)
|
||||
|
||||
self.save_session(session)
|
||||
return session
|
||||
|
||||
def delete_session(self, sport: str, season: int, environment: str) -> bool:
|
||||
"""Delete an upload session state file.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
|
||||
Returns:
|
||||
True if deleted, False if not found
|
||||
"""
|
||||
state_file = self._get_state_file(sport, season, environment)
|
||||
|
||||
if state_file.exists():
|
||||
state_file.unlink()
|
||||
return True
|
||||
return False
|
||||
|
||||
def list_sessions(self) -> list[dict]:
|
||||
"""List all upload sessions.
|
||||
|
||||
Returns:
|
||||
List of session summaries
|
||||
"""
|
||||
sessions = []
|
||||
|
||||
for state_file in self.state_dir.glob("upload_state_*.json"):
|
||||
try:
|
||||
with open(state_file, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
session = UploadSession.from_dict(data)
|
||||
sessions.append({
|
||||
"sport": session.sport,
|
||||
"season": session.season,
|
||||
"environment": session.environment,
|
||||
"started_at": session.started_at.isoformat(),
|
||||
"last_updated": session.last_updated.isoformat(),
|
||||
"progress": f"{session.uploaded_count}/{session.total_count}",
|
||||
"progress_percent": f"{session.progress_percent:.1f}%",
|
||||
"status": "complete" if session.is_complete else "in_progress",
|
||||
"failed_count": session.failed_count,
|
||||
})
|
||||
except (json.JSONDecodeError, KeyError):
|
||||
continue
|
||||
|
||||
return sessions
|
||||
|
||||
def get_session_or_create(
|
||||
self,
|
||||
sport: str,
|
||||
season: int,
|
||||
environment: str,
|
||||
record_names: list[tuple[str, str]],
|
||||
resume: bool = False,
|
||||
) -> UploadSession:
|
||||
"""Get existing session or create new one.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
record_names: List of (record_name, record_type) tuples
|
||||
resume: Whether to resume existing session
|
||||
|
||||
Returns:
|
||||
UploadSession (existing or new)
|
||||
"""
|
||||
if resume:
|
||||
existing = self.load_session(sport, season, environment)
|
||||
if existing:
|
||||
# Add any new records not in existing session
|
||||
existing_names = set(existing.records.keys())
|
||||
for record_name, record_type in record_names:
|
||||
if record_name not in existing_names:
|
||||
existing.add_record(record_name, record_type)
|
||||
return existing
|
||||
|
||||
# Create new session (overwrites existing)
|
||||
return self.create_session(sport, season, environment, record_names)
|
||||
58
sportstime_parser/utils/__init__.py
Normal file
58
sportstime_parser/utils/__init__.py
Normal file
@@ -0,0 +1,58 @@
|
||||
"""Utility modules for sportstime-parser."""
|
||||
|
||||
from .logging import (
|
||||
get_console,
|
||||
get_logger,
|
||||
is_verbose,
|
||||
log_error,
|
||||
log_failure,
|
||||
log_game,
|
||||
log_stadium,
|
||||
log_success,
|
||||
log_team,
|
||||
log_warning,
|
||||
set_verbose,
|
||||
)
|
||||
from .http import (
|
||||
RateLimitedSession,
|
||||
get_session,
|
||||
fetch_url,
|
||||
fetch_json,
|
||||
fetch_html,
|
||||
)
|
||||
from .progress import (
|
||||
create_progress,
|
||||
create_spinner_progress,
|
||||
progress_bar,
|
||||
track_progress,
|
||||
ProgressTracker,
|
||||
ScrapeProgress,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Logging
|
||||
"get_console",
|
||||
"get_logger",
|
||||
"is_verbose",
|
||||
"log_error",
|
||||
"log_failure",
|
||||
"log_game",
|
||||
"log_stadium",
|
||||
"log_success",
|
||||
"log_team",
|
||||
"log_warning",
|
||||
"set_verbose",
|
||||
# HTTP
|
||||
"RateLimitedSession",
|
||||
"get_session",
|
||||
"fetch_url",
|
||||
"fetch_json",
|
||||
"fetch_html",
|
||||
# Progress
|
||||
"create_progress",
|
||||
"create_spinner_progress",
|
||||
"progress_bar",
|
||||
"track_progress",
|
||||
"ProgressTracker",
|
||||
"ScrapeProgress",
|
||||
]
|
||||
276
sportstime_parser/utils/http.py
Normal file
276
sportstime_parser/utils/http.py
Normal file
@@ -0,0 +1,276 @@
|
||||
"""HTTP utilities with rate limiting and exponential backoff."""
|
||||
|
||||
import random
|
||||
import time
|
||||
from typing import Optional
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util.retry import Retry
|
||||
|
||||
from ..config import (
|
||||
DEFAULT_REQUEST_DELAY,
|
||||
MAX_RETRIES,
|
||||
BACKOFF_FACTOR,
|
||||
INITIAL_BACKOFF,
|
||||
)
|
||||
from .logging import get_logger, log_warning
|
||||
|
||||
|
||||
# User agents for rotation to avoid blocks
|
||||
USER_AGENTS = [
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0",
|
||||
]
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
"""HTTP session with rate limiting and exponential backoff.
|
||||
|
||||
Features:
|
||||
- Configurable delay between requests
|
||||
- Automatic 429 detection with exponential backoff
|
||||
- User-agent rotation
|
||||
- Connection pooling
|
||||
- Automatic retries for transient errors
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
delay: float = DEFAULT_REQUEST_DELAY,
|
||||
max_retries: int = MAX_RETRIES,
|
||||
backoff_factor: float = BACKOFF_FACTOR,
|
||||
initial_backoff: float = INITIAL_BACKOFF,
|
||||
):
|
||||
"""Initialize the rate-limited session.
|
||||
|
||||
Args:
|
||||
delay: Minimum delay between requests in seconds
|
||||
max_retries: Maximum number of retry attempts
|
||||
backoff_factor: Multiplier for exponential backoff
|
||||
initial_backoff: Initial backoff duration in seconds
|
||||
"""
|
||||
self.delay = delay
|
||||
self.max_retries = max_retries
|
||||
self.backoff_factor = backoff_factor
|
||||
self.initial_backoff = initial_backoff
|
||||
self.last_request_time: float = 0.0
|
||||
self._domain_delays: dict[str, float] = {}
|
||||
|
||||
# Create session with retry adapter
|
||||
self.session = requests.Session()
|
||||
|
||||
# Configure automatic retries for connection errors
|
||||
retry_strategy = Retry(
|
||||
total=max_retries,
|
||||
backoff_factor=0.5,
|
||||
status_forcelist=[500, 502, 503, 504],
|
||||
allowed_methods=["GET", "HEAD"],
|
||||
)
|
||||
adapter = HTTPAdapter(max_retries=retry_strategy, pool_maxsize=10)
|
||||
self.session.mount("http://", adapter)
|
||||
self.session.mount("https://", adapter)
|
||||
|
||||
self._logger = get_logger()
|
||||
|
||||
def _get_user_agent(self) -> str:
|
||||
"""Get a random user agent."""
|
||||
return random.choice(USER_AGENTS)
|
||||
|
||||
def _get_domain(self, url: str) -> str:
|
||||
"""Extract domain from URL."""
|
||||
parsed = urlparse(url)
|
||||
return parsed.netloc
|
||||
|
||||
def _wait_for_rate_limit(self, url: str) -> None:
|
||||
"""Wait to respect rate limiting."""
|
||||
domain = self._get_domain(url)
|
||||
|
||||
# Get domain-specific delay (if 429 was received)
|
||||
domain_delay = self._domain_delays.get(domain, 0.0)
|
||||
effective_delay = max(self.delay, domain_delay)
|
||||
|
||||
elapsed = time.time() - self.last_request_time
|
||||
if elapsed < effective_delay:
|
||||
sleep_time = effective_delay - elapsed
|
||||
self._logger.debug(f"Rate limiting: sleeping {sleep_time:.2f}s")
|
||||
time.sleep(sleep_time)
|
||||
|
||||
def _handle_429(self, url: str, attempt: int) -> float:
|
||||
"""Handle 429 Too Many Requests with exponential backoff.
|
||||
|
||||
Returns the backoff duration in seconds.
|
||||
"""
|
||||
domain = self._get_domain(url)
|
||||
backoff = self.initial_backoff * (self.backoff_factor ** attempt)
|
||||
|
||||
# Add jitter to prevent thundering herd
|
||||
backoff += random.uniform(0, 1)
|
||||
|
||||
# Update domain-specific delay
|
||||
self._domain_delays[domain] = min(backoff * 2, 60.0) # Cap at 60s
|
||||
|
||||
log_warning(f"Rate limited (429) for {domain}, backing off {backoff:.1f}s")
|
||||
|
||||
return backoff
|
||||
|
||||
def get(
|
||||
self,
|
||||
url: str,
|
||||
headers: Optional[dict] = None,
|
||||
params: Optional[dict] = None,
|
||||
timeout: float = 30.0,
|
||||
) -> requests.Response:
|
||||
"""Make a rate-limited GET request with automatic retries.
|
||||
|
||||
Args:
|
||||
url: URL to fetch
|
||||
headers: Additional headers to include
|
||||
params: Query parameters
|
||||
timeout: Request timeout in seconds
|
||||
|
||||
Returns:
|
||||
Response object
|
||||
|
||||
Raises:
|
||||
requests.RequestException: If all retries fail
|
||||
"""
|
||||
# Prepare headers with user agent
|
||||
request_headers = {"User-Agent": self._get_user_agent()}
|
||||
if headers:
|
||||
request_headers.update(headers)
|
||||
|
||||
last_exception: Optional[Exception] = None
|
||||
|
||||
for attempt in range(self.max_retries + 1):
|
||||
try:
|
||||
# Wait for rate limit
|
||||
self._wait_for_rate_limit(url)
|
||||
|
||||
# Make request
|
||||
self.last_request_time = time.time()
|
||||
response = self.session.get(
|
||||
url,
|
||||
headers=request_headers,
|
||||
params=params,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
# Handle 429
|
||||
if response.status_code == 429:
|
||||
if attempt < self.max_retries:
|
||||
backoff = self._handle_429(url, attempt)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
else:
|
||||
response.raise_for_status()
|
||||
|
||||
# Return successful response
|
||||
return response
|
||||
|
||||
except requests.RequestException as e:
|
||||
last_exception = e
|
||||
if attempt < self.max_retries:
|
||||
backoff = self.initial_backoff * (self.backoff_factor ** attempt)
|
||||
self._logger.warning(
|
||||
f"Request failed (attempt {attempt + 1}): {e}, retrying in {backoff:.1f}s"
|
||||
)
|
||||
time.sleep(backoff)
|
||||
else:
|
||||
raise
|
||||
|
||||
# Should not reach here, but just in case
|
||||
if last_exception:
|
||||
raise last_exception
|
||||
|
||||
raise requests.RequestException("Max retries exceeded")
|
||||
|
||||
def get_json(
|
||||
self,
|
||||
url: str,
|
||||
headers: Optional[dict] = None,
|
||||
params: Optional[dict] = None,
|
||||
timeout: float = 30.0,
|
||||
) -> dict:
|
||||
"""Make a rate-limited GET request and parse JSON response.
|
||||
|
||||
Args:
|
||||
url: URL to fetch
|
||||
headers: Additional headers to include
|
||||
params: Query parameters
|
||||
timeout: Request timeout in seconds
|
||||
|
||||
Returns:
|
||||
Parsed JSON as dictionary
|
||||
|
||||
Raises:
|
||||
requests.RequestException: If request fails
|
||||
ValueError: If response is not valid JSON
|
||||
"""
|
||||
response = self.get(url, headers=headers, params=params, timeout=timeout)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def get_html(
|
||||
self,
|
||||
url: str,
|
||||
headers: Optional[dict] = None,
|
||||
params: Optional[dict] = None,
|
||||
timeout: float = 30.0,
|
||||
) -> str:
|
||||
"""Make a rate-limited GET request and return HTML text.
|
||||
|
||||
Args:
|
||||
url: URL to fetch
|
||||
headers: Additional headers to include
|
||||
params: Query parameters
|
||||
timeout: Request timeout in seconds
|
||||
|
||||
Returns:
|
||||
HTML text content
|
||||
|
||||
Raises:
|
||||
requests.RequestException: If request fails
|
||||
"""
|
||||
response = self.get(url, headers=headers, params=params, timeout=timeout)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
|
||||
def reset_domain_delays(self) -> None:
|
||||
"""Reset domain-specific delays (e.g., after a long pause)."""
|
||||
self._domain_delays.clear()
|
||||
|
||||
def close(self) -> None:
|
||||
"""Close the session and release resources."""
|
||||
self.session.close()
|
||||
|
||||
|
||||
# Global session instance (lazy initialized)
|
||||
_global_session: Optional[RateLimitedSession] = None
|
||||
|
||||
|
||||
def get_session() -> RateLimitedSession:
|
||||
"""Get the global rate-limited session instance."""
|
||||
global _global_session
|
||||
if _global_session is None:
|
||||
_global_session = RateLimitedSession()
|
||||
return _global_session
|
||||
|
||||
|
||||
def fetch_url(url: str, **kwargs) -> requests.Response:
|
||||
"""Convenience function to fetch a URL with rate limiting."""
|
||||
return get_session().get(url, **kwargs)
|
||||
|
||||
|
||||
def fetch_json(url: str, **kwargs) -> dict:
|
||||
"""Convenience function to fetch JSON with rate limiting."""
|
||||
return get_session().get_json(url, **kwargs)
|
||||
|
||||
|
||||
def fetch_html(url: str, **kwargs) -> str:
|
||||
"""Convenience function to fetch HTML with rate limiting."""
|
||||
return get_session().get_html(url, **kwargs)
|
||||
149
sportstime_parser/utils/logging.py
Normal file
149
sportstime_parser/utils/logging.py
Normal file
@@ -0,0 +1,149 @@
|
||||
"""Logging infrastructure for sportstime-parser."""
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from rich.console import Console
|
||||
from rich.logging import RichHandler
|
||||
|
||||
from ..config import SCRIPTS_DIR
|
||||
|
||||
# Module-level state
|
||||
_logger: Optional[logging.Logger] = None
|
||||
_verbose: bool = False
|
||||
_console: Optional[Console] = None
|
||||
|
||||
|
||||
def get_console() -> Console:
|
||||
"""Get the shared Rich console instance."""
|
||||
global _console
|
||||
if _console is None:
|
||||
_console = Console()
|
||||
return _console
|
||||
|
||||
|
||||
def set_verbose(verbose: bool) -> None:
|
||||
"""Set verbose mode globally."""
|
||||
global _verbose
|
||||
_verbose = verbose
|
||||
|
||||
logger = get_logger()
|
||||
if verbose:
|
||||
logger.setLevel(logging.DEBUG)
|
||||
else:
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
|
||||
def is_verbose() -> bool:
|
||||
"""Check if verbose mode is enabled."""
|
||||
return _verbose
|
||||
|
||||
|
||||
def get_logger() -> logging.Logger:
|
||||
"""Get or create the application logger."""
|
||||
global _logger
|
||||
|
||||
if _logger is not None:
|
||||
return _logger
|
||||
|
||||
_logger = logging.getLogger("sportstime_parser")
|
||||
_logger.setLevel(logging.INFO)
|
||||
|
||||
# Prevent propagation to root logger
|
||||
_logger.propagate = False
|
||||
|
||||
# Clear any existing handlers
|
||||
_logger.handlers.clear()
|
||||
|
||||
# Console handler with Rich formatting
|
||||
console_handler = RichHandler(
|
||||
console=get_console(),
|
||||
show_time=True,
|
||||
show_path=False,
|
||||
rich_tracebacks=True,
|
||||
tracebacks_show_locals=True,
|
||||
markup=True,
|
||||
)
|
||||
console_handler.setLevel(logging.DEBUG)
|
||||
console_format = logging.Formatter("%(message)s")
|
||||
console_handler.setFormatter(console_format)
|
||||
_logger.addHandler(console_handler)
|
||||
|
||||
# File handler for persistent logs
|
||||
log_dir = SCRIPTS_DIR / "logs"
|
||||
log_dir.mkdir(exist_ok=True)
|
||||
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
log_file = log_dir / f"parser_{timestamp}.log"
|
||||
|
||||
file_handler = logging.FileHandler(log_file, encoding="utf-8")
|
||||
file_handler.setLevel(logging.DEBUG)
|
||||
file_format = logging.Formatter(
|
||||
"%(asctime)s | %(levelname)-8s | %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
)
|
||||
file_handler.setFormatter(file_format)
|
||||
_logger.addHandler(file_handler)
|
||||
|
||||
return _logger
|
||||
|
||||
|
||||
def log_game(
|
||||
sport: str,
|
||||
game_id: str,
|
||||
home: str,
|
||||
away: str,
|
||||
date: str,
|
||||
status: str = "parsed",
|
||||
) -> None:
|
||||
"""Log a game being processed (only in verbose mode)."""
|
||||
if not is_verbose():
|
||||
return
|
||||
|
||||
logger = get_logger()
|
||||
logger.debug(f"[{sport.upper()}] {game_id}: {away} @ {home} ({date}) - {status}")
|
||||
|
||||
|
||||
def log_team(sport: str, team_id: str, name: str, status: str = "resolved") -> None:
|
||||
"""Log a team being processed (only in verbose mode)."""
|
||||
if not is_verbose():
|
||||
return
|
||||
|
||||
logger = get_logger()
|
||||
logger.debug(f"[{sport.upper()}] Team: {name} -> {team_id} ({status})")
|
||||
|
||||
|
||||
def log_stadium(sport: str, stadium_id: str, name: str, status: str = "resolved") -> None:
|
||||
"""Log a stadium being processed (only in verbose mode)."""
|
||||
if not is_verbose():
|
||||
return
|
||||
|
||||
logger = get_logger()
|
||||
logger.debug(f"[{sport.upper()}] Stadium: {name} -> {stadium_id} ({status})")
|
||||
|
||||
|
||||
def log_error(message: str, exc_info: bool = False) -> None:
|
||||
"""Log an error message."""
|
||||
logger = get_logger()
|
||||
logger.error(message, exc_info=exc_info)
|
||||
|
||||
|
||||
def log_warning(message: str) -> None:
|
||||
"""Log a warning message."""
|
||||
logger = get_logger()
|
||||
logger.warning(message)
|
||||
|
||||
|
||||
def log_success(message: str) -> None:
|
||||
"""Log a success message with green formatting."""
|
||||
logger = get_logger()
|
||||
logger.info(f"[green]✓[/green] {message}")
|
||||
|
||||
|
||||
def log_failure(message: str) -> None:
|
||||
"""Log a failure message with red formatting."""
|
||||
logger = get_logger()
|
||||
logger.info(f"[red]✗[/red] {message}")
|
||||
360
sportstime_parser/utils/progress.py
Normal file
360
sportstime_parser/utils/progress.py
Normal file
@@ -0,0 +1,360 @@
|
||||
"""Progress utilities using Rich for visual feedback."""
|
||||
|
||||
from contextlib import contextmanager
|
||||
from typing import Generator, Iterable, Optional, TypeVar
|
||||
|
||||
from rich.progress import (
|
||||
Progress,
|
||||
SpinnerColumn,
|
||||
TextColumn,
|
||||
BarColumn,
|
||||
TaskProgressColumn,
|
||||
TimeElapsedColumn,
|
||||
TimeRemainingColumn,
|
||||
MofNCompleteColumn,
|
||||
)
|
||||
from rich.console import Console
|
||||
|
||||
from .logging import get_console
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
def create_progress() -> Progress:
|
||||
"""Create a Rich progress bar with standard columns."""
|
||||
return Progress(
|
||||
SpinnerColumn(),
|
||||
TextColumn("[bold blue]{task.description}"),
|
||||
BarColumn(bar_width=40),
|
||||
TaskProgressColumn(),
|
||||
MofNCompleteColumn(),
|
||||
TimeElapsedColumn(),
|
||||
TimeRemainingColumn(),
|
||||
console=get_console(),
|
||||
transient=False,
|
||||
)
|
||||
|
||||
|
||||
def create_spinner_progress() -> Progress:
|
||||
"""Create a Rich progress bar with spinner only (for indeterminate tasks)."""
|
||||
return Progress(
|
||||
SpinnerColumn(),
|
||||
TextColumn("[bold blue]{task.description}"),
|
||||
TimeElapsedColumn(),
|
||||
console=get_console(),
|
||||
transient=True,
|
||||
)
|
||||
|
||||
|
||||
@contextmanager
|
||||
def progress_bar(
|
||||
description: str,
|
||||
total: Optional[int] = None,
|
||||
) -> Generator[tuple[Progress, int], None, None]:
|
||||
"""Context manager for a progress bar.
|
||||
|
||||
Args:
|
||||
description: Task description to display
|
||||
total: Total number of items (None for indeterminate)
|
||||
|
||||
Yields:
|
||||
Tuple of (Progress instance, task_id)
|
||||
|
||||
Example:
|
||||
with progress_bar("Scraping games", total=100) as (progress, task):
|
||||
for item in items:
|
||||
process(item)
|
||||
progress.advance(task)
|
||||
"""
|
||||
if total is None:
|
||||
progress = create_spinner_progress()
|
||||
else:
|
||||
progress = create_progress()
|
||||
|
||||
with progress:
|
||||
task_id = progress.add_task(description, total=total)
|
||||
yield progress, task_id
|
||||
|
||||
|
||||
def track_progress(
|
||||
iterable: Iterable[T],
|
||||
description: str,
|
||||
total: Optional[int] = None,
|
||||
) -> Generator[T, None, None]:
|
||||
"""Wrap an iterable with a progress bar.
|
||||
|
||||
Args:
|
||||
iterable: Items to iterate over
|
||||
description: Task description to display
|
||||
total: Total number of items (auto-detected if iterable has len)
|
||||
|
||||
Yields:
|
||||
Items from the iterable
|
||||
|
||||
Example:
|
||||
for game in track_progress(games, "Processing games"):
|
||||
process(game)
|
||||
"""
|
||||
# Try to get length if not provided
|
||||
if total is None:
|
||||
try:
|
||||
total = len(iterable) # type: ignore
|
||||
except TypeError:
|
||||
pass
|
||||
|
||||
if total is None:
|
||||
# Indeterminate progress
|
||||
progress = create_spinner_progress()
|
||||
with progress:
|
||||
task_id = progress.add_task(description, total=None)
|
||||
for item in iterable:
|
||||
yield item
|
||||
progress.update(task_id, advance=1)
|
||||
else:
|
||||
# Determinate progress
|
||||
progress = create_progress()
|
||||
with progress:
|
||||
task_id = progress.add_task(description, total=total)
|
||||
for item in iterable:
|
||||
yield item
|
||||
progress.advance(task_id)
|
||||
|
||||
|
||||
class ProgressTracker:
|
||||
"""Track progress across multiple phases with nested tasks.
|
||||
|
||||
Example:
|
||||
tracker = ProgressTracker()
|
||||
tracker.start("Scraping NBA")
|
||||
|
||||
with tracker.task("Fetching schedule", total=12) as advance:
|
||||
for month in months:
|
||||
fetch(month)
|
||||
advance()
|
||||
|
||||
with tracker.task("Parsing games", total=1230) as advance:
|
||||
for game in games:
|
||||
parse(game)
|
||||
advance()
|
||||
|
||||
tracker.finish("Completed NBA scrape")
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the progress tracker."""
|
||||
self._console = get_console()
|
||||
self._current_progress: Optional[Progress] = None
|
||||
self._current_task: Optional[int] = None
|
||||
|
||||
def start(self, message: str) -> None:
|
||||
"""Start a new tracking session with a message."""
|
||||
self._console.print(f"\n[bold cyan]>>> {message}[/bold cyan]")
|
||||
|
||||
def finish(self, message: str) -> None:
|
||||
"""Finish the tracking session with a message."""
|
||||
self._console.print(f"[bold green]<<< {message}[/bold green]\n")
|
||||
|
||||
@contextmanager
|
||||
def task(
|
||||
self,
|
||||
description: str,
|
||||
total: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Context manager for a tracked task.
|
||||
|
||||
Args:
|
||||
description: Task description
|
||||
total: Total items (None for indeterminate)
|
||||
|
||||
Yields:
|
||||
Callable to advance the progress
|
||||
|
||||
Example:
|
||||
with tracker.task("Processing", total=100) as advance:
|
||||
for item in items:
|
||||
process(item)
|
||||
advance()
|
||||
"""
|
||||
with progress_bar(description, total) as (progress, task_id):
|
||||
self._current_progress = progress
|
||||
self._current_task = task_id
|
||||
|
||||
def advance(amount: int = 1) -> None:
|
||||
progress.advance(task_id, advance=amount)
|
||||
|
||||
yield advance
|
||||
|
||||
self._current_progress = None
|
||||
self._current_task = None
|
||||
|
||||
def log(self, message: str) -> None:
|
||||
"""Log a message (will be displayed above progress bar if active)."""
|
||||
if self._current_progress:
|
||||
self._current_progress.console.print(f" {message}")
|
||||
else:
|
||||
self._console.print(f" {message}")
|
||||
|
||||
|
||||
class ScrapeProgress:
|
||||
"""Specialized progress tracker for scraping operations.
|
||||
|
||||
Tracks counts of games, teams, stadiums scraped and provides
|
||||
formatted status updates.
|
||||
"""
|
||||
|
||||
def __init__(self, sport: str, season: int):
|
||||
"""Initialize scrape progress for a sport.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba')
|
||||
season: Season start year
|
||||
"""
|
||||
self.sport = sport
|
||||
self.season = season
|
||||
self.games_count = 0
|
||||
self.teams_count = 0
|
||||
self.stadiums_count = 0
|
||||
self.errors_count = 0
|
||||
self._tracker = ProgressTracker()
|
||||
|
||||
def start(self) -> None:
|
||||
"""Start the scraping session."""
|
||||
self._tracker.start(
|
||||
f"Scraping {self.sport.upper()} {self.season}-{self.season + 1}"
|
||||
)
|
||||
|
||||
def finish(self) -> None:
|
||||
"""Finish the scraping session with summary."""
|
||||
summary = (
|
||||
f"Scraped {self.games_count} games, "
|
||||
f"{self.teams_count} teams, "
|
||||
f"{self.stadiums_count} stadiums"
|
||||
)
|
||||
if self.errors_count > 0:
|
||||
summary += f" ({self.errors_count} errors)"
|
||||
self._tracker.finish(summary)
|
||||
|
||||
@contextmanager
|
||||
def scraping_schedule(
|
||||
self,
|
||||
total_months: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Track schedule scraping progress."""
|
||||
with self._tracker.task(
|
||||
f"Fetching {self.sport.upper()} schedule",
|
||||
total=total_months,
|
||||
) as advance:
|
||||
yield advance
|
||||
|
||||
@contextmanager
|
||||
def parsing_games(
|
||||
self,
|
||||
total_games: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Track game parsing progress."""
|
||||
with self._tracker.task(
|
||||
"Parsing games",
|
||||
total=total_games,
|
||||
) as advance:
|
||||
|
||||
def advance_and_count(amount: int = 1) -> None:
|
||||
self.games_count += amount
|
||||
advance(amount)
|
||||
|
||||
yield advance_and_count
|
||||
|
||||
@contextmanager
|
||||
def resolving_teams(
|
||||
self,
|
||||
total_teams: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Track team resolution progress."""
|
||||
with self._tracker.task(
|
||||
"Resolving teams",
|
||||
total=total_teams,
|
||||
) as advance:
|
||||
|
||||
def advance_and_count(amount: int = 1) -> None:
|
||||
self.teams_count += amount
|
||||
advance(amount)
|
||||
|
||||
yield advance_and_count
|
||||
|
||||
@contextmanager
|
||||
def resolving_stadiums(
|
||||
self,
|
||||
total_stadiums: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Track stadium resolution progress."""
|
||||
with self._tracker.task(
|
||||
"Resolving stadiums",
|
||||
total=total_stadiums,
|
||||
) as advance:
|
||||
|
||||
def advance_and_count(amount: int = 1) -> None:
|
||||
self.stadiums_count += amount
|
||||
advance(amount)
|
||||
|
||||
yield advance_and_count
|
||||
|
||||
def log_error(self, message: str) -> None:
|
||||
"""Log an error during scraping."""
|
||||
self.errors_count += 1
|
||||
self._tracker.log(f"[red]Error: {message}[/red]")
|
||||
|
||||
def log_warning(self, message: str) -> None:
|
||||
"""Log a warning during scraping."""
|
||||
self._tracker.log(f"[yellow]Warning: {message}[/yellow]")
|
||||
|
||||
def log_info(self, message: str) -> None:
|
||||
"""Log an info message during scraping."""
|
||||
self._tracker.log(message)
|
||||
|
||||
|
||||
class SimpleProgressBar:
|
||||
"""Simple progress bar wrapper for batch operations.
|
||||
|
||||
Example:
|
||||
with create_progress_bar(total=100, description="Uploading") as progress:
|
||||
for item in items:
|
||||
upload(item)
|
||||
progress.advance()
|
||||
"""
|
||||
|
||||
def __init__(self, progress: Progress, task_id: int):
|
||||
self._progress = progress
|
||||
self._task_id = task_id
|
||||
|
||||
def advance(self, amount: int = 1) -> None:
|
||||
"""Advance the progress bar."""
|
||||
self._progress.advance(self._task_id, advance=amount)
|
||||
|
||||
def update(self, completed: int) -> None:
|
||||
"""Set the progress to a specific value."""
|
||||
self._progress.update(self._task_id, completed=completed)
|
||||
|
||||
|
||||
@contextmanager
|
||||
def create_progress_bar(
|
||||
total: int,
|
||||
description: str = "Progress",
|
||||
) -> Generator[SimpleProgressBar, None, None]:
|
||||
"""Create a simple progress bar for batch operations.
|
||||
|
||||
Args:
|
||||
total: Total number of items
|
||||
description: Task description
|
||||
|
||||
Yields:
|
||||
SimpleProgressBar with advance() and update() methods
|
||||
|
||||
Example:
|
||||
with create_progress_bar(total=100, description="Uploading") as progress:
|
||||
for item in items:
|
||||
upload(item)
|
||||
progress.advance()
|
||||
"""
|
||||
progress = create_progress()
|
||||
with progress:
|
||||
task_id = progress.add_task(description, total=total)
|
||||
yield SimpleProgressBar(progress, task_id)
|
||||
32
sportstime_parser/validators/__init__.py
Normal file
32
sportstime_parser/validators/__init__.py
Normal file
@@ -0,0 +1,32 @@
|
||||
"""Validators for scraped data."""
|
||||
|
||||
from .report import (
|
||||
ValidationReport,
|
||||
ValidationSummary,
|
||||
generate_report,
|
||||
detect_duplicate_games,
|
||||
validate_games,
|
||||
)
|
||||
|
||||
from .schema import (
|
||||
SchemaValidationError,
|
||||
validate_canonical_stadium,
|
||||
validate_canonical_team,
|
||||
validate_canonical_game,
|
||||
validate_and_raise,
|
||||
validate_batch,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"ValidationReport",
|
||||
"ValidationSummary",
|
||||
"generate_report",
|
||||
"detect_duplicate_games",
|
||||
"validate_games",
|
||||
"SchemaValidationError",
|
||||
"validate_canonical_stadium",
|
||||
"validate_canonical_team",
|
||||
"validate_canonical_game",
|
||||
"validate_and_raise",
|
||||
"validate_batch",
|
||||
]
|
||||
409
sportstime_parser/validators/report.py
Normal file
409
sportstime_parser/validators/report.py
Normal file
@@ -0,0 +1,409 @@
|
||||
"""Validation report generator for scraped data."""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from ..config import EXPECTED_GAME_COUNTS, OUTPUT_DIR
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem, ReviewReason
|
||||
|
||||
|
||||
@dataclass
|
||||
class ValidationSummary:
|
||||
"""Summary statistics for validation report.
|
||||
|
||||
Attributes:
|
||||
total_games: Total number of games scraped
|
||||
valid_games: Number of games with all data resolved
|
||||
review_count: Number of items needing manual review
|
||||
unresolved_teams: Count of unresolved team names
|
||||
unresolved_stadiums: Count of unresolved stadium names
|
||||
duplicate_games: Count of potential duplicate games
|
||||
missing_data: Count of games with missing required data
|
||||
expected_games: Expected number of games for this sport
|
||||
"""
|
||||
|
||||
total_games: int = 0
|
||||
valid_games: int = 0
|
||||
review_count: int = 0
|
||||
unresolved_teams: int = 0
|
||||
unresolved_stadiums: int = 0
|
||||
duplicate_games: int = 0
|
||||
missing_data: int = 0
|
||||
expected_games: int = 0
|
||||
|
||||
@property
|
||||
def game_coverage(self) -> float:
|
||||
"""Percentage of expected games scraped."""
|
||||
if self.expected_games == 0:
|
||||
return 100.0
|
||||
return (self.total_games / self.expected_games) * 100
|
||||
|
||||
@property
|
||||
def validity_rate(self) -> float:
|
||||
"""Percentage of games that are valid."""
|
||||
if self.total_games == 0:
|
||||
return 100.0
|
||||
return (self.valid_games / self.total_games) * 100
|
||||
|
||||
@property
|
||||
def needs_review(self) -> bool:
|
||||
"""Check if report requires manual review."""
|
||||
return self.review_count > 0
|
||||
|
||||
@property
|
||||
def status_emoji(self) -> str:
|
||||
"""Get status emoji for report header."""
|
||||
if self.review_count == 0 and self.game_coverage >= 95:
|
||||
return "✅"
|
||||
elif self.review_count <= 10 and self.game_coverage >= 80:
|
||||
return "⚠️"
|
||||
else:
|
||||
return "❌"
|
||||
|
||||
|
||||
@dataclass
|
||||
class ValidationReport:
|
||||
"""Complete validation report for a sport/season.
|
||||
|
||||
Attributes:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
source: Name of the data source used
|
||||
summary: Summary statistics
|
||||
review_items: Items requiring manual review
|
||||
games: All scraped games
|
||||
teams: All teams
|
||||
stadiums: All stadiums
|
||||
generated_at: Timestamp of report generation
|
||||
"""
|
||||
|
||||
sport: str
|
||||
season: int
|
||||
source: str
|
||||
summary: ValidationSummary
|
||||
review_items: list[ManualReviewItem] = field(default_factory=list)
|
||||
games: list[Game] = field(default_factory=list)
|
||||
teams: list[Team] = field(default_factory=list)
|
||||
stadiums: list[Stadium] = field(default_factory=list)
|
||||
generated_at: datetime = field(default_factory=datetime.now)
|
||||
|
||||
def to_markdown(self) -> str:
|
||||
"""Generate markdown report.
|
||||
|
||||
Returns:
|
||||
Complete markdown report as string
|
||||
"""
|
||||
lines = []
|
||||
|
||||
# Header
|
||||
season_str = f"{self.season}-{str(self.season + 1)[-2:]}"
|
||||
lines.append(f"# Validation Report: {self.sport.upper()} {season_str}")
|
||||
lines.append("")
|
||||
lines.append(f"**Generated**: {self.generated_at.strftime('%Y-%m-%d %H:%M:%S')} UTC")
|
||||
lines.append(f"**Source**: {self.source}")
|
||||
lines.append(f"**Status**: {self.summary.status_emoji} {'Needs Review' if self.summary.needs_review else 'Ready'}")
|
||||
lines.append("")
|
||||
|
||||
# Summary table
|
||||
lines.append("## Summary")
|
||||
lines.append("")
|
||||
lines.append("| Metric | Count |")
|
||||
lines.append("|--------|-------|")
|
||||
lines.append(f"| Total Games | {self.summary.total_games:,} |")
|
||||
lines.append(f"| Valid Games | {self.summary.valid_games:,} |")
|
||||
lines.append(f"| Expected Games | {self.summary.expected_games:,} |")
|
||||
lines.append(f"| Coverage | {self.summary.game_coverage:.1f}% |")
|
||||
lines.append(f"| Manual Review | {self.summary.review_count} |")
|
||||
lines.append(f"| Unresolved Teams | {self.summary.unresolved_teams} |")
|
||||
lines.append(f"| Unresolved Stadiums | {self.summary.unresolved_stadiums} |")
|
||||
lines.append(f"| Duplicate Games | {self.summary.duplicate_games} |")
|
||||
lines.append(f"| Missing Data | {self.summary.missing_data} |")
|
||||
lines.append("")
|
||||
|
||||
# Manual review section
|
||||
if self.review_items:
|
||||
lines.append("## Manual Review Required")
|
||||
lines.append("")
|
||||
|
||||
# Group by reason
|
||||
by_reason: dict[ReviewReason, list[ManualReviewItem]] = {}
|
||||
for item in self.review_items:
|
||||
if item.reason not in by_reason:
|
||||
by_reason[item.reason] = []
|
||||
by_reason[item.reason].append(item)
|
||||
|
||||
for reason, items in sorted(by_reason.items(), key=lambda x: x[0].value):
|
||||
reason_title = reason.value.replace("_", " ").title()
|
||||
lines.append(f"### {reason_title} ({len(items)})")
|
||||
lines.append("")
|
||||
|
||||
for item in items[:10]: # Limit to first 10 per category
|
||||
lines.append(item.to_markdown())
|
||||
|
||||
if len(items) > 10:
|
||||
lines.append(f"*... and {len(items) - 10} more items*")
|
||||
lines.append("")
|
||||
|
||||
# Teams section
|
||||
lines.append("## Teams")
|
||||
lines.append("")
|
||||
lines.append(f"Total teams: {len(self.teams)}")
|
||||
lines.append("")
|
||||
|
||||
if self.teams:
|
||||
lines.append("| ID | Full Name | City | Conference | Division |")
|
||||
lines.append("|-----|-----------|------|------------|----------|")
|
||||
for team in sorted(self.teams, key=lambda t: t.full_name)[:20]:
|
||||
lines.append(
|
||||
f"| `{team.id}` | {team.full_name} | {team.city} | "
|
||||
f"{team.conference or '-'} | {team.division or '-'} |"
|
||||
)
|
||||
if len(self.teams) > 20:
|
||||
lines.append(f"*... and {len(self.teams) - 20} more teams*")
|
||||
lines.append("")
|
||||
|
||||
# Stadiums section
|
||||
lines.append("## Stadiums")
|
||||
lines.append("")
|
||||
lines.append(f"Total stadiums: {len(self.stadiums)}")
|
||||
lines.append("")
|
||||
|
||||
if self.stadiums:
|
||||
lines.append("| ID | Name | City | State |")
|
||||
lines.append("|-----|------|------|-------|")
|
||||
for stadium in sorted(self.stadiums, key=lambda s: s.name)[:20]:
|
||||
lines.append(
|
||||
f"| `{stadium.id}` | {stadium.name} | "
|
||||
f"{stadium.city} | {stadium.state} |"
|
||||
)
|
||||
if len(self.stadiums) > 20:
|
||||
lines.append(f"*... and {len(self.stadiums) - 20} more stadiums*")
|
||||
lines.append("")
|
||||
|
||||
# Game samples section
|
||||
lines.append("## Game Samples")
|
||||
lines.append("")
|
||||
|
||||
if self.games:
|
||||
# Show first 10 games
|
||||
lines.append("### First 10 Games")
|
||||
lines.append("")
|
||||
lines.append("| ID | Date | Away | Home | Status |")
|
||||
lines.append("|----|------|------|------|--------|")
|
||||
for game in self.games[:10]:
|
||||
date_str = game.game_date.strftime("%Y-%m-%d")
|
||||
lines.append(
|
||||
f"| `{game.id}` | {date_str} | {game.away_team_id} | "
|
||||
f"{game.home_team_id} | {game.status} |"
|
||||
)
|
||||
lines.append("")
|
||||
|
||||
# Show games with issues
|
||||
problem_games = [g for g in self.games if not g.stadium_id]
|
||||
if problem_games:
|
||||
lines.append("### Games Missing Stadium")
|
||||
lines.append("")
|
||||
lines.append("| ID | Date | Away | Home | Raw Stadium |")
|
||||
lines.append("|----|------|------|------|-------------|")
|
||||
for game in problem_games[:10]:
|
||||
date_str = game.game_date.strftime("%Y-%m-%d")
|
||||
lines.append(
|
||||
f"| `{game.id}` | {date_str} | {game.away_team_id} | "
|
||||
f"{game.home_team_id} | {game.raw_stadium or '-'} |"
|
||||
)
|
||||
if len(problem_games) > 10:
|
||||
lines.append(f"*... and {len(problem_games) - 10} more*")
|
||||
lines.append("")
|
||||
|
||||
lines.append("---")
|
||||
lines.append("")
|
||||
lines.append("*Generated by sportstime-parser*")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
def save(self, output_dir: Optional[Path] = None) -> Path:
|
||||
"""Save report to markdown file.
|
||||
|
||||
Args:
|
||||
output_dir: Directory to save to (default: OUTPUT_DIR)
|
||||
|
||||
Returns:
|
||||
Path to saved file
|
||||
"""
|
||||
if output_dir is None:
|
||||
output_dir = OUTPUT_DIR
|
||||
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
filename = f"validation_{self.sport}_{self.season}.md"
|
||||
filepath = output_dir / filename
|
||||
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
f.write(self.to_markdown())
|
||||
|
||||
return filepath
|
||||
|
||||
|
||||
def generate_report(
|
||||
sport: str,
|
||||
season: int,
|
||||
source: str,
|
||||
games: list[Game],
|
||||
teams: list[Team],
|
||||
stadiums: list[Stadium],
|
||||
review_items: list[ManualReviewItem],
|
||||
) -> ValidationReport:
|
||||
"""Generate a validation report from scraped data.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
source: Data source name
|
||||
games: List of scraped games
|
||||
teams: List of teams
|
||||
stadiums: List of stadiums
|
||||
review_items: Items requiring review
|
||||
|
||||
Returns:
|
||||
Complete ValidationReport
|
||||
"""
|
||||
# Calculate summary
|
||||
summary = ValidationSummary(
|
||||
total_games=len(games),
|
||||
expected_games=EXPECTED_GAME_COUNTS.get(sport, 0),
|
||||
review_count=len(review_items),
|
||||
)
|
||||
|
||||
# Count review item types
|
||||
for item in review_items:
|
||||
if item.reason == ReviewReason.UNRESOLVED_TEAM:
|
||||
summary.unresolved_teams += 1
|
||||
elif item.reason == ReviewReason.UNRESOLVED_STADIUM:
|
||||
summary.unresolved_stadiums += 1
|
||||
elif item.reason == ReviewReason.DUPLICATE_GAME:
|
||||
summary.duplicate_games += 1
|
||||
elif item.reason == ReviewReason.MISSING_DATA:
|
||||
summary.missing_data += 1
|
||||
|
||||
# Count valid games (games with all required data)
|
||||
valid_count = 0
|
||||
for game in games:
|
||||
if game.home_team_id and game.away_team_id:
|
||||
valid_count += 1
|
||||
|
||||
summary.valid_games = valid_count
|
||||
|
||||
return ValidationReport(
|
||||
sport=sport,
|
||||
season=season,
|
||||
source=source,
|
||||
summary=summary,
|
||||
review_items=review_items,
|
||||
games=games,
|
||||
teams=teams,
|
||||
stadiums=stadiums,
|
||||
)
|
||||
|
||||
|
||||
def detect_duplicate_games(games: list[Game]) -> list[ManualReviewItem]:
|
||||
"""Detect potential duplicate games.
|
||||
|
||||
Duplicates are identified by having the same:
|
||||
- Home team
|
||||
- Away team
|
||||
- Date (ignoring time)
|
||||
|
||||
Args:
|
||||
games: List of games to check
|
||||
|
||||
Returns:
|
||||
List of ManualReviewItems for duplicates
|
||||
"""
|
||||
from uuid import uuid4
|
||||
|
||||
seen: dict[str, Game] = {}
|
||||
duplicates: list[ManualReviewItem] = []
|
||||
|
||||
for game in games:
|
||||
# Create a key for the game
|
||||
key = (
|
||||
f"{game.home_team_id}_{game.away_team_id}_"
|
||||
f"{game.game_date.strftime('%Y%m%d')}"
|
||||
)
|
||||
|
||||
if key in seen:
|
||||
# Skip if it's a doubleheader (has game_number)
|
||||
if game.game_number:
|
||||
continue
|
||||
|
||||
existing = seen[key]
|
||||
duplicates.append(
|
||||
ManualReviewItem(
|
||||
id=f"dup_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.DUPLICATE_GAME,
|
||||
sport=game.sport,
|
||||
raw_value=f"{game.id} vs {existing.id}",
|
||||
context={
|
||||
"game1_id": existing.id,
|
||||
"game2_id": game.id,
|
||||
"date": game.game_date.strftime("%Y-%m-%d"),
|
||||
"home": game.home_team_id,
|
||||
"away": game.away_team_id,
|
||||
},
|
||||
game_date=game.game_date.date(),
|
||||
)
|
||||
)
|
||||
else:
|
||||
seen[key] = game
|
||||
|
||||
return duplicates
|
||||
|
||||
|
||||
def validate_games(games: list[Game]) -> list[ManualReviewItem]:
|
||||
"""Validate games and return issues found.
|
||||
|
||||
Checks:
|
||||
- Missing stadium IDs
|
||||
- Missing team IDs
|
||||
- Invalid dates
|
||||
- Duplicate games
|
||||
|
||||
Args:
|
||||
games: List of games to validate
|
||||
|
||||
Returns:
|
||||
List of ManualReviewItems for issues
|
||||
"""
|
||||
from uuid import uuid4
|
||||
|
||||
issues: list[ManualReviewItem] = []
|
||||
|
||||
for game in games:
|
||||
# Check for missing stadium
|
||||
if not game.stadium_id:
|
||||
issues.append(
|
||||
ManualReviewItem(
|
||||
id=f"missing_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.MISSING_DATA,
|
||||
sport=game.sport,
|
||||
raw_value=f"Game {game.id} missing stadium",
|
||||
context={
|
||||
"game_id": game.id,
|
||||
"field": "stadium_id",
|
||||
"raw_stadium": game.raw_stadium,
|
||||
},
|
||||
source_url=game.source_url,
|
||||
game_date=game.game_date.date(),
|
||||
)
|
||||
)
|
||||
|
||||
# Check for duplicates
|
||||
dup_issues = detect_duplicate_games(games)
|
||||
issues.extend(dup_issues)
|
||||
|
||||
return issues
|
||||
246
sportstime_parser/validators/schema.py
Normal file
246
sportstime_parser/validators/schema.py
Normal file
@@ -0,0 +1,246 @@
|
||||
"""JSON Schema validation for canonical output matching iOS app expectations.
|
||||
|
||||
This module defines schemas that match the Swift structs in BootstrapService.swift:
|
||||
- JSONCanonicalStadium
|
||||
- JSONCanonicalTeam
|
||||
- JSONCanonicalGame
|
||||
|
||||
Validation is performed at runtime before outputting JSON to ensure
|
||||
Python output matches what the iOS app expects.
|
||||
"""
|
||||
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
from typing import Any, Callable, Optional, Union
|
||||
|
||||
|
||||
class SchemaValidationError(Exception):
|
||||
"""Raised when canonical output fails schema validation."""
|
||||
|
||||
def __init__(self, model_type: str, errors: list[str]):
|
||||
self.model_type = model_type
|
||||
self.errors = errors
|
||||
super().__init__(f"{model_type} schema validation failed:\n" + "\n".join(f" - {e}" for e in errors))
|
||||
|
||||
|
||||
# ISO8601 UTC datetime pattern: YYYY-MM-DDTHH:MM:SSZ
|
||||
ISO8601_UTC_PATTERN = re.compile(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z$")
|
||||
|
||||
# Season format patterns
|
||||
SEASON_SPLIT_PATTERN = re.compile(r"^\d{4}-\d{2}$") # e.g., "2025-26"
|
||||
SEASON_SINGLE_PATTERN = re.compile(r"^\d{4}$") # e.g., "2025"
|
||||
|
||||
|
||||
@dataclass
|
||||
class FieldSpec:
|
||||
"""Specification for a field in the canonical schema."""
|
||||
|
||||
name: str
|
||||
required: bool
|
||||
field_type: Union[type, tuple]
|
||||
validator: Optional[Callable] = None
|
||||
|
||||
|
||||
# Schema definitions matching Swift structs in BootstrapService.swift
|
||||
|
||||
STADIUM_SCHEMA: list[FieldSpec] = [
|
||||
FieldSpec("canonical_id", required=True, field_type=str),
|
||||
FieldSpec("name", required=True, field_type=str),
|
||||
FieldSpec("city", required=True, field_type=str),
|
||||
FieldSpec("state", required=True, field_type=str),
|
||||
FieldSpec("latitude", required=True, field_type=(int, float)),
|
||||
FieldSpec("longitude", required=True, field_type=(int, float)),
|
||||
FieldSpec("capacity", required=True, field_type=int),
|
||||
FieldSpec("sport", required=True, field_type=str),
|
||||
FieldSpec("primary_team_abbrevs", required=True, field_type=list),
|
||||
FieldSpec("year_opened", required=False, field_type=(int, type(None))),
|
||||
]
|
||||
|
||||
TEAM_SCHEMA: list[FieldSpec] = [
|
||||
FieldSpec("canonical_id", required=True, field_type=str),
|
||||
FieldSpec("name", required=True, field_type=str),
|
||||
FieldSpec("abbreviation", required=True, field_type=str),
|
||||
FieldSpec("sport", required=True, field_type=str),
|
||||
FieldSpec("city", required=True, field_type=str),
|
||||
FieldSpec("stadium_canonical_id", required=True, field_type=str),
|
||||
FieldSpec("conference_id", required=False, field_type=(str, type(None))),
|
||||
FieldSpec("division_id", required=False, field_type=(str, type(None))),
|
||||
FieldSpec("primary_color", required=False, field_type=(str, type(None))),
|
||||
FieldSpec("secondary_color", required=False, field_type=(str, type(None))),
|
||||
]
|
||||
|
||||
GAME_SCHEMA: list[FieldSpec] = [
|
||||
FieldSpec("canonical_id", required=True, field_type=str),
|
||||
FieldSpec("sport", required=True, field_type=str),
|
||||
FieldSpec(
|
||||
"season",
|
||||
required=True,
|
||||
field_type=str,
|
||||
validator=lambda v: SEASON_SPLIT_PATTERN.match(v) or SEASON_SINGLE_PATTERN.match(v),
|
||||
),
|
||||
FieldSpec(
|
||||
"game_datetime_utc",
|
||||
required=True,
|
||||
field_type=str,
|
||||
validator=lambda v: ISO8601_UTC_PATTERN.match(v),
|
||||
),
|
||||
FieldSpec("home_team_canonical_id", required=True, field_type=str),
|
||||
FieldSpec("away_team_canonical_id", required=True, field_type=str),
|
||||
FieldSpec("stadium_canonical_id", required=True, field_type=str),
|
||||
FieldSpec("is_playoff", required=True, field_type=bool),
|
||||
FieldSpec("broadcast", required=False, field_type=(str, type(None))),
|
||||
]
|
||||
|
||||
|
||||
def validate_field(data: dict[str, Any], spec: FieldSpec) -> list[str]:
|
||||
"""Validate a single field against its specification.
|
||||
|
||||
Args:
|
||||
data: The dictionary to validate
|
||||
spec: The field specification
|
||||
|
||||
Returns:
|
||||
List of error messages (empty if valid)
|
||||
"""
|
||||
errors = []
|
||||
|
||||
if spec.name not in data:
|
||||
if spec.required:
|
||||
errors.append(f"Missing required field: {spec.name}")
|
||||
return errors
|
||||
|
||||
value = data[spec.name]
|
||||
|
||||
# Check type
|
||||
if not isinstance(value, spec.field_type):
|
||||
expected = spec.field_type.__name__ if isinstance(spec.field_type, type) else str(spec.field_type)
|
||||
actual = type(value).__name__
|
||||
errors.append(f"Field '{spec.name}' has wrong type: expected {expected}, got {actual} (value: {value!r})")
|
||||
return errors
|
||||
|
||||
# Check custom validator
|
||||
if spec.validator and value is not None:
|
||||
if not spec.validator(value):
|
||||
errors.append(f"Field '{spec.name}' failed validation: {value!r}")
|
||||
|
||||
return errors
|
||||
|
||||
|
||||
def validate_canonical_stadium(data: dict[str, Any]) -> list[str]:
|
||||
"""Validate a canonical stadium dictionary.
|
||||
|
||||
Args:
|
||||
data: Stadium dictionary from to_canonical_dict()
|
||||
|
||||
Returns:
|
||||
List of error messages (empty if valid)
|
||||
"""
|
||||
errors = []
|
||||
for spec in STADIUM_SCHEMA:
|
||||
errors.extend(validate_field(data, spec))
|
||||
|
||||
# Additional validation: primary_team_abbrevs should contain strings
|
||||
if "primary_team_abbrevs" in data and isinstance(data["primary_team_abbrevs"], list):
|
||||
for i, abbrev in enumerate(data["primary_team_abbrevs"]):
|
||||
if not isinstance(abbrev, str):
|
||||
errors.append(f"primary_team_abbrevs[{i}] must be string, got {type(abbrev).__name__}")
|
||||
|
||||
return errors
|
||||
|
||||
|
||||
def validate_canonical_team(data: dict[str, Any]) -> list[str]:
|
||||
"""Validate a canonical team dictionary.
|
||||
|
||||
Args:
|
||||
data: Team dictionary from to_canonical_dict()
|
||||
|
||||
Returns:
|
||||
List of error messages (empty if valid)
|
||||
"""
|
||||
errors = []
|
||||
for spec in TEAM_SCHEMA:
|
||||
errors.extend(validate_field(data, spec))
|
||||
return errors
|
||||
|
||||
|
||||
def validate_canonical_game(data: dict[str, Any]) -> list[str]:
|
||||
"""Validate a canonical game dictionary.
|
||||
|
||||
Args:
|
||||
data: Game dictionary from to_canonical_dict()
|
||||
|
||||
Returns:
|
||||
List of error messages (empty if valid)
|
||||
"""
|
||||
errors = []
|
||||
for spec in GAME_SCHEMA:
|
||||
errors.extend(validate_field(data, spec))
|
||||
return errors
|
||||
|
||||
|
||||
def validate_and_raise(data: dict[str, Any], model_type: str) -> None:
|
||||
"""Validate a canonical dictionary and raise on error.
|
||||
|
||||
Args:
|
||||
data: Dictionary from to_canonical_dict()
|
||||
model_type: One of 'stadium', 'team', 'game'
|
||||
|
||||
Raises:
|
||||
SchemaValidationError: If validation fails
|
||||
ValueError: If model_type is unknown
|
||||
"""
|
||||
validators = {
|
||||
"stadium": validate_canonical_stadium,
|
||||
"team": validate_canonical_team,
|
||||
"game": validate_canonical_game,
|
||||
}
|
||||
|
||||
if model_type not in validators:
|
||||
raise ValueError(f"Unknown model type: {model_type}")
|
||||
|
||||
errors = validators[model_type](data)
|
||||
if errors:
|
||||
raise SchemaValidationError(model_type, errors)
|
||||
|
||||
|
||||
def validate_batch(
|
||||
items: list[dict[str, Any]],
|
||||
model_type: str,
|
||||
fail_fast: bool = True,
|
||||
) -> list[tuple[int, list[str]]]:
|
||||
"""Validate a batch of canonical dictionaries.
|
||||
|
||||
Args:
|
||||
items: List of dictionaries from to_canonical_dict()
|
||||
model_type: One of 'stadium', 'team', 'game'
|
||||
fail_fast: If True, raise on first error; if False, collect all errors
|
||||
|
||||
Returns:
|
||||
List of (index, errors) tuples for items with validation errors
|
||||
|
||||
Raises:
|
||||
SchemaValidationError: If fail_fast=True and validation fails
|
||||
"""
|
||||
validators = {
|
||||
"stadium": validate_canonical_stadium,
|
||||
"team": validate_canonical_team,
|
||||
"game": validate_canonical_game,
|
||||
}
|
||||
|
||||
if model_type not in validators:
|
||||
raise ValueError(f"Unknown model type: {model_type}")
|
||||
|
||||
validator = validators[model_type]
|
||||
all_errors = []
|
||||
|
||||
for i, item in enumerate(items):
|
||||
errors = validator(item)
|
||||
if errors:
|
||||
if fail_fast:
|
||||
raise SchemaValidationError(
|
||||
model_type,
|
||||
[f"Item {i}: {e}" for e in errors],
|
||||
)
|
||||
all_errors.append((i, errors))
|
||||
|
||||
return all_errors
|
||||
2036
stadium_aliases.json
Normal file
2036
stadium_aliases.json
Normal file
File diff suppressed because it is too large
Load Diff
634
team_aliases.json
Normal file
634
team_aliases.json
Normal file
@@ -0,0 +1,634 @@
|
||||
[
|
||||
{
|
||||
"id": "alias_mlb_1",
|
||||
"team_canonical_id": "team_mlb_wsn",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Montreal Expos",
|
||||
"valid_from": "1969-01-01",
|
||||
"valid_until": "2004-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_2",
|
||||
"team_canonical_id": "team_mlb_wsn",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "MON",
|
||||
"valid_from": "1969-01-01",
|
||||
"valid_until": "2004-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_3",
|
||||
"team_canonical_id": "team_mlb_wsn",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Montreal",
|
||||
"valid_from": "1969-01-01",
|
||||
"valid_until": "2004-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_4",
|
||||
"team_canonical_id": "team_mlb_oak",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Kansas City Athletics",
|
||||
"valid_from": "1955-01-01",
|
||||
"valid_until": "1967-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_5",
|
||||
"team_canonical_id": "team_mlb_oak",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "KCA",
|
||||
"valid_from": "1955-01-01",
|
||||
"valid_until": "1967-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_6",
|
||||
"team_canonical_id": "team_mlb_oak",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Kansas City",
|
||||
"valid_from": "1955-01-01",
|
||||
"valid_until": "1967-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_7",
|
||||
"team_canonical_id": "team_mlb_oak",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Philadelphia Athletics",
|
||||
"valid_from": "1901-01-01",
|
||||
"valid_until": "1954-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_8",
|
||||
"team_canonical_id": "team_mlb_oak",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "PHA",
|
||||
"valid_from": "1901-01-01",
|
||||
"valid_until": "1954-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_9",
|
||||
"team_canonical_id": "team_mlb_oak",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Philadelphia",
|
||||
"valid_from": "1901-01-01",
|
||||
"valid_until": "1954-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_10",
|
||||
"team_canonical_id": "team_mlb_cle",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Cleveland Indians",
|
||||
"valid_from": "1915-01-01",
|
||||
"valid_until": "2021-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_11",
|
||||
"team_canonical_id": "team_mlb_tbr",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Tampa Bay Devil Rays",
|
||||
"valid_from": "1998-01-01",
|
||||
"valid_until": "2007-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_12",
|
||||
"team_canonical_id": "team_mlb_mia",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Florida Marlins",
|
||||
"valid_from": "1993-01-01",
|
||||
"valid_until": "2011-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_13",
|
||||
"team_canonical_id": "team_mlb_mia",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Florida",
|
||||
"valid_from": "1993-01-01",
|
||||
"valid_until": "2011-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_14",
|
||||
"team_canonical_id": "team_mlb_laa",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Anaheim Angels",
|
||||
"valid_from": "1997-01-01",
|
||||
"valid_until": "2004-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_15",
|
||||
"team_canonical_id": "team_mlb_laa",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Los Angeles Angels of Anaheim",
|
||||
"valid_from": "2005-01-01",
|
||||
"valid_until": "2015-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_16",
|
||||
"team_canonical_id": "team_mlb_laa",
|
||||
"alias_type": "name",
|
||||
"alias_value": "California Angels",
|
||||
"valid_from": "1965-01-01",
|
||||
"valid_until": "1996-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_17",
|
||||
"team_canonical_id": "team_mlb_tex",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Washington Senators",
|
||||
"valid_from": "1961-01-01",
|
||||
"valid_until": "1971-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_18",
|
||||
"team_canonical_id": "team_mlb_tex",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "WS2",
|
||||
"valid_from": "1961-01-01",
|
||||
"valid_until": "1971-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_19",
|
||||
"team_canonical_id": "team_mlb_tex",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Washington",
|
||||
"valid_from": "1961-01-01",
|
||||
"valid_until": "1971-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_20",
|
||||
"team_canonical_id": "team_mlb_mil",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Seattle Pilots",
|
||||
"valid_from": "1969-01-01",
|
||||
"valid_until": "1969-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_21",
|
||||
"team_canonical_id": "team_mlb_mil",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "SEP",
|
||||
"valid_from": "1969-01-01",
|
||||
"valid_until": "1969-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_22",
|
||||
"team_canonical_id": "team_mlb_mil",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Seattle",
|
||||
"valid_from": "1969-01-01",
|
||||
"valid_until": "1969-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_mlb_23",
|
||||
"team_canonical_id": "team_mlb_hou",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Houston Colt .45s",
|
||||
"valid_from": "1962-01-01",
|
||||
"valid_until": "1964-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_24",
|
||||
"team_canonical_id": "team_nba_brk",
|
||||
"alias_type": "name",
|
||||
"alias_value": "New Jersey Nets",
|
||||
"valid_from": "1977-01-01",
|
||||
"valid_until": "2012-04-30"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_25",
|
||||
"team_canonical_id": "team_nba_brk",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "NJN",
|
||||
"valid_from": "1977-01-01",
|
||||
"valid_until": "2012-04-30"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_26",
|
||||
"team_canonical_id": "team_nba_brk",
|
||||
"alias_type": "city",
|
||||
"alias_value": "New Jersey",
|
||||
"valid_from": "1977-01-01",
|
||||
"valid_until": "2012-04-30"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_27",
|
||||
"team_canonical_id": "team_nba_brk",
|
||||
"alias_type": "name",
|
||||
"alias_value": "New York Nets",
|
||||
"valid_from": "1968-01-01",
|
||||
"valid_until": "1977-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_28",
|
||||
"team_canonical_id": "team_nba_okc",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Seattle SuperSonics",
|
||||
"valid_from": "1967-01-01",
|
||||
"valid_until": "2008-07-01"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_29",
|
||||
"team_canonical_id": "team_nba_okc",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "SEA",
|
||||
"valid_from": "1967-01-01",
|
||||
"valid_until": "2008-07-01"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_30",
|
||||
"team_canonical_id": "team_nba_okc",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Seattle",
|
||||
"valid_from": "1967-01-01",
|
||||
"valid_until": "2008-07-01"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_31",
|
||||
"team_canonical_id": "team_nba_mem",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Vancouver Grizzlies",
|
||||
"valid_from": "1995-01-01",
|
||||
"valid_until": "2001-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_32",
|
||||
"team_canonical_id": "team_nba_mem",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "VAN",
|
||||
"valid_from": "1995-01-01",
|
||||
"valid_until": "2001-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_33",
|
||||
"team_canonical_id": "team_nba_mem",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Vancouver",
|
||||
"valid_from": "1995-01-01",
|
||||
"valid_until": "2001-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_34",
|
||||
"team_canonical_id": "team_nba_nop",
|
||||
"alias_type": "name",
|
||||
"alias_value": "New Orleans Hornets",
|
||||
"valid_from": "2002-01-01",
|
||||
"valid_until": "2013-04-30"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_35",
|
||||
"team_canonical_id": "team_nba_nop",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "NOH",
|
||||
"valid_from": "2002-01-01",
|
||||
"valid_until": "2013-04-30"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_36",
|
||||
"team_canonical_id": "team_nba_nop",
|
||||
"alias_type": "name",
|
||||
"alias_value": "New Orleans/Oklahoma City Hornets",
|
||||
"valid_from": "2005-01-01",
|
||||
"valid_until": "2007-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_37",
|
||||
"team_canonical_id": "team_nba_cho",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Charlotte Bobcats",
|
||||
"valid_from": "2004-01-01",
|
||||
"valid_until": "2014-04-30"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_38",
|
||||
"team_canonical_id": "team_nba_cho",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "CHA",
|
||||
"valid_from": "2004-01-01",
|
||||
"valid_until": "2014-04-30"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_39",
|
||||
"team_canonical_id": "team_nba_was",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Washington Bullets",
|
||||
"valid_from": "1974-01-01",
|
||||
"valid_until": "1997-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_40",
|
||||
"team_canonical_id": "team_nba_was",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Capital Bullets",
|
||||
"valid_from": "1973-01-01",
|
||||
"valid_until": "1973-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_41",
|
||||
"team_canonical_id": "team_nba_was",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Baltimore Bullets",
|
||||
"valid_from": "1963-01-01",
|
||||
"valid_until": "1972-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_42",
|
||||
"team_canonical_id": "team_nba_lac",
|
||||
"alias_type": "name",
|
||||
"alias_value": "San Diego Clippers",
|
||||
"valid_from": "1978-01-01",
|
||||
"valid_until": "1984-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_43",
|
||||
"team_canonical_id": "team_nba_lac",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "SDC",
|
||||
"valid_from": "1978-01-01",
|
||||
"valid_until": "1984-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_44",
|
||||
"team_canonical_id": "team_nba_lac",
|
||||
"alias_type": "city",
|
||||
"alias_value": "San Diego",
|
||||
"valid_from": "1978-01-01",
|
||||
"valid_until": "1984-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_45",
|
||||
"team_canonical_id": "team_nba_lac",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Buffalo Braves",
|
||||
"valid_from": "1970-01-01",
|
||||
"valid_until": "1978-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_46",
|
||||
"team_canonical_id": "team_nba_lac",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "BUF",
|
||||
"valid_from": "1970-01-01",
|
||||
"valid_until": "1978-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_47",
|
||||
"team_canonical_id": "team_nba_lac",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Buffalo",
|
||||
"valid_from": "1970-01-01",
|
||||
"valid_until": "1978-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_48",
|
||||
"team_canonical_id": "team_nba_sac",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Kansas City Kings",
|
||||
"valid_from": "1975-01-01",
|
||||
"valid_until": "1985-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_49",
|
||||
"team_canonical_id": "team_nba_sac",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "KCK",
|
||||
"valid_from": "1975-01-01",
|
||||
"valid_until": "1985-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_50",
|
||||
"team_canonical_id": "team_nba_sac",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Kansas City",
|
||||
"valid_from": "1975-01-01",
|
||||
"valid_until": "1985-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_51",
|
||||
"team_canonical_id": "team_nba_uta",
|
||||
"alias_type": "name",
|
||||
"alias_value": "New Orleans Jazz",
|
||||
"valid_from": "1974-01-01",
|
||||
"valid_until": "1979-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nba_52",
|
||||
"team_canonical_id": "team_nba_uta",
|
||||
"alias_type": "city",
|
||||
"alias_value": "New Orleans",
|
||||
"valid_from": "1974-01-01",
|
||||
"valid_until": "1979-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_53",
|
||||
"team_canonical_id": "team_nhl_ari",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Arizona Coyotes",
|
||||
"valid_from": "2014-01-01",
|
||||
"valid_until": "2024-04-30"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_54",
|
||||
"team_canonical_id": "team_nhl_ari",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Phoenix Coyotes",
|
||||
"valid_from": "1996-01-01",
|
||||
"valid_until": "2013-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_55",
|
||||
"team_canonical_id": "team_nhl_ari",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "PHX",
|
||||
"valid_from": "1996-01-01",
|
||||
"valid_until": "2013-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_56",
|
||||
"team_canonical_id": "team_nhl_ari",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Phoenix",
|
||||
"valid_from": "1996-01-01",
|
||||
"valid_until": "2013-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_57",
|
||||
"team_canonical_id": "team_nhl_ari",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Winnipeg Jets",
|
||||
"valid_from": "1979-01-01",
|
||||
"valid_until": "1996-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_58",
|
||||
"team_canonical_id": "team_nhl_car",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Hartford Whalers",
|
||||
"valid_from": "1979-01-01",
|
||||
"valid_until": "1997-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_59",
|
||||
"team_canonical_id": "team_nhl_car",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "HFD",
|
||||
"valid_from": "1979-01-01",
|
||||
"valid_until": "1997-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_60",
|
||||
"team_canonical_id": "team_nhl_car",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Hartford",
|
||||
"valid_from": "1979-01-01",
|
||||
"valid_until": "1997-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_61",
|
||||
"team_canonical_id": "team_nhl_col",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Quebec Nordiques",
|
||||
"valid_from": "1979-01-01",
|
||||
"valid_until": "1995-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_62",
|
||||
"team_canonical_id": "team_nhl_col",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "QUE",
|
||||
"valid_from": "1979-01-01",
|
||||
"valid_until": "1995-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_63",
|
||||
"team_canonical_id": "team_nhl_col",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Quebec",
|
||||
"valid_from": "1979-01-01",
|
||||
"valid_until": "1995-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_64",
|
||||
"team_canonical_id": "team_nhl_dal",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Minnesota North Stars",
|
||||
"valid_from": "1967-01-01",
|
||||
"valid_until": "1993-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_65",
|
||||
"team_canonical_id": "team_nhl_dal",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "MNS",
|
||||
"valid_from": "1967-01-01",
|
||||
"valid_until": "1993-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_66",
|
||||
"team_canonical_id": "team_nhl_dal",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Minnesota",
|
||||
"valid_from": "1967-01-01",
|
||||
"valid_until": "1993-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_67",
|
||||
"team_canonical_id": "team_nhl_njd",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Colorado Rockies",
|
||||
"valid_from": "1976-01-01",
|
||||
"valid_until": "1982-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_68",
|
||||
"team_canonical_id": "team_nhl_njd",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "CLR",
|
||||
"valid_from": "1976-01-01",
|
||||
"valid_until": "1982-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_69",
|
||||
"team_canonical_id": "team_nhl_njd",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Colorado",
|
||||
"valid_from": "1976-01-01",
|
||||
"valid_until": "1982-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_70",
|
||||
"team_canonical_id": "team_nhl_njd",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Kansas City Scouts",
|
||||
"valid_from": "1974-01-01",
|
||||
"valid_until": "1976-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_71",
|
||||
"team_canonical_id": "team_nhl_njd",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "KCS",
|
||||
"valid_from": "1974-01-01",
|
||||
"valid_until": "1976-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_72",
|
||||
"team_canonical_id": "team_nhl_njd",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Kansas City",
|
||||
"valid_from": "1974-01-01",
|
||||
"valid_until": "1976-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_73",
|
||||
"team_canonical_id": "team_nhl_wpg",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Atlanta Thrashers",
|
||||
"valid_from": "1999-01-01",
|
||||
"valid_until": "2011-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_74",
|
||||
"team_canonical_id": "team_nhl_wpg",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "ATL",
|
||||
"valid_from": "1999-01-01",
|
||||
"valid_until": "2011-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_75",
|
||||
"team_canonical_id": "team_nhl_wpg",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Atlanta",
|
||||
"valid_from": "1999-01-01",
|
||||
"valid_until": "2011-05-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nhl_76",
|
||||
"team_canonical_id": "team_nhl_fla",
|
||||
"alias_type": "city",
|
||||
"alias_value": "Miami",
|
||||
"valid_from": "1993-01-01",
|
||||
"valid_until": "1998-12-31"
|
||||
},
|
||||
{
|
||||
"id": "alias_nfl_77",
|
||||
"team_canonical_id": "team_nfl_was",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Washington Redskins",
|
||||
"valid_from": "1937-01-01",
|
||||
"valid_until": "2020-07-13"
|
||||
},
|
||||
{
|
||||
"id": "alias_nfl_78",
|
||||
"team_canonical_id": "team_nfl_was",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Washington Football Team",
|
||||
"valid_from": "2020-07-13",
|
||||
"valid_until": "2022-02-02"
|
||||
},
|
||||
{
|
||||
"id": "alias_nfl_79",
|
||||
"team_canonical_id": "team_nfl_was",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "WFT",
|
||||
"valid_from": "2020-07-13",
|
||||
"valid_until": "2022-02-02"
|
||||
}
|
||||
]
|
||||
99
validate_aliases.py
Normal file
99
validate_aliases.py
Normal file
@@ -0,0 +1,99 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Validate alias files for orphan references and format issues.
|
||||
|
||||
This script checks stadium_aliases.json and team_aliases.json for:
|
||||
1. Orphan references (aliases pointing to non-existent canonical IDs)
|
||||
2. JSON syntax errors
|
||||
3. Required field presence
|
||||
|
||||
Usage:
|
||||
python validate_aliases.py
|
||||
|
||||
Returns exit code 0 on success, 1 on failure.
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Add parent to path for imports
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from sportstime_parser.normalizers.stadium_resolver import STADIUM_MAPPINGS
|
||||
from sportstime_parser.normalizers.team_resolver import TEAM_MAPPINGS
|
||||
|
||||
|
||||
def main() -> int:
|
||||
"""Run validation checks on alias files."""
|
||||
errors: list[str] = []
|
||||
|
||||
# Build valid stadium ID set
|
||||
valid_stadium_ids: set[str] = set()
|
||||
for sport_stadiums in STADIUM_MAPPINGS.values():
|
||||
for stadium_id in sport_stadiums.keys():
|
||||
valid_stadium_ids.add(stadium_id)
|
||||
|
||||
# Build valid team ID set
|
||||
valid_team_ids: set[str] = set()
|
||||
for sport_teams in TEAM_MAPPINGS.values():
|
||||
for abbrev, team_data in sport_teams.items():
|
||||
valid_team_ids.add(team_data[0]) # team_id is first element
|
||||
|
||||
print(f"Valid stadium IDs: {len(valid_stadium_ids)}")
|
||||
print(f"Valid team IDs: {len(valid_team_ids)}")
|
||||
print()
|
||||
|
||||
# Check stadium aliases
|
||||
try:
|
||||
stadium_aliases = json.load(open("stadium_aliases.json"))
|
||||
print(f"✓ stadium_aliases.json: Valid JSON ({len(stadium_aliases)} aliases)")
|
||||
|
||||
for alias in stadium_aliases:
|
||||
# Check required fields
|
||||
if "alias_name" not in alias:
|
||||
errors.append(f"Stadium alias missing 'alias_name': {alias}")
|
||||
if "stadium_canonical_id" not in alias:
|
||||
errors.append(f"Stadium alias missing 'stadium_canonical_id': {alias}")
|
||||
elif alias["stadium_canonical_id"] not in valid_stadium_ids:
|
||||
errors.append(
|
||||
f"Orphan stadium alias: '{alias.get('alias_name', '?')}' -> "
|
||||
f"'{alias['stadium_canonical_id']}'"
|
||||
)
|
||||
except FileNotFoundError:
|
||||
errors.append("stadium_aliases.json not found")
|
||||
except json.JSONDecodeError as e:
|
||||
errors.append(f"stadium_aliases.json: Invalid JSON - {e}")
|
||||
|
||||
# Check team aliases
|
||||
try:
|
||||
team_aliases = json.load(open("team_aliases.json"))
|
||||
print(f"✓ team_aliases.json: Valid JSON ({len(team_aliases)} aliases)")
|
||||
|
||||
for alias in team_aliases:
|
||||
# Check required fields
|
||||
if "team_canonical_id" not in alias:
|
||||
errors.append(f"Team alias missing 'team_canonical_id': {alias}")
|
||||
elif alias["team_canonical_id"] not in valid_team_ids:
|
||||
errors.append(
|
||||
f"Orphan team alias: '{alias.get('alias_value', '?')}' -> "
|
||||
f"'{alias['team_canonical_id']}'"
|
||||
)
|
||||
except FileNotFoundError:
|
||||
errors.append("team_aliases.json not found")
|
||||
except json.JSONDecodeError as e:
|
||||
errors.append(f"team_aliases.json: Invalid JSON - {e}")
|
||||
|
||||
# Report results
|
||||
print()
|
||||
if errors:
|
||||
print(f"❌ Validation failed with {len(errors)} error(s):")
|
||||
for error in errors:
|
||||
print(f" - {error}")
|
||||
return 1
|
||||
|
||||
print("✅ All aliases valid")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user