feat(scripts): rewrite parser as modular Python CLI

Replace monolithic scraping scripts with sportstime_parser package:

- Multi-source scrapers with automatic fallback for 7 sports
- Canonical ID generation for games, teams, and stadiums
- Fuzzy matching with configurable thresholds for name resolution
- CloudKit Web Services uploader with JWT auth, diff-based updates
- Resumable uploads with checkpoint state persistence
- Validation reports with manual review items and suggested matches
- Comprehensive test suite (249 tests)

CLI: sportstime-parser scrape|validate|upload|status|retry|clear

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-01-10 21:06:12 -06:00
parent 284a10d9e1
commit eeaf900e5a
109 changed files with 18415 additions and 266211 deletions
-145
View File
@@ -1,145 +0,0 @@
# CloudKit Setup Guide for SportsTime
## 1. Configure Container in Apple Developer Portal
1. Go to [Apple Developer Portal](https://developer.apple.com/account)
2. Navigate to **Certificates, Identifiers & Profiles** > **Identifiers**
3. Select your App ID or create one for `com.sportstime.app`
4. Enable **iCloud** capability
5. Click **Configure** and create container: `iCloud.com.sportstime.app`
## 2. Configure in Xcode
1. Open `SportsTime.xcodeproj` in Xcode
2. Select the SportsTime target
3. Go to **Signing & Capabilities**
4. Ensure **iCloud** is added (should already be there)
5. Check **CloudKit** is selected
6. Select container `iCloud.com.sportstime.app`
## 3. Create Record Types in CloudKit Dashboard
Go to [CloudKit Dashboard](https://icloud.developer.apple.com/dashboard)
### Record Type: `Stadium`
| Field | Type | Notes |
|-------|------|-------|
| `stadiumId` | String | Unique identifier |
| `name` | String | Stadium name |
| `city` | String | City |
| `state` | String | State/Province |
| `location` | Location | CLLocation (lat/lng) |
| `capacity` | Int(64) | Seating capacity |
| `sport` | String | NBA, MLB, NHL |
| `teamAbbrevs` | String (List) | Team abbreviations |
| `source` | String | Data source |
| `yearOpened` | Int(64) | Optional |
**Indexes**:
- `sport` (Queryable, Sortable)
- `location` (Queryable) - for radius searches
- `teamAbbrevs` (Queryable)
### Record Type: `Team`
| Field | Type | Notes |
|-------|------|-------|
| `teamId` | String | Unique identifier |
| `name` | String | Full team name |
| `abbreviation` | String | 3-letter code |
| `sport` | String | NBA, MLB, NHL |
| `city` | String | City |
**Indexes**:
- `sport` (Queryable, Sortable)
- `abbreviation` (Queryable)
### Record Type: `Game`
| Field | Type | Notes |
|-------|------|-------|
| `gameId` | String | Unique identifier |
| `sport` | String | NBA, MLB, NHL |
| `season` | String | e.g., "2024-25" |
| `dateTime` | Date/Time | Game date and time |
| `homeTeamRef` | Reference | Reference to Team |
| `awayTeamRef` | Reference | Reference to Team |
| `venueRef` | Reference | Reference to Stadium |
| `isPlayoff` | Int(64) | 0 or 1 |
| `broadcastInfo` | String | TV channel |
| `source` | String | Data source |
**Indexes**:
- `sport` (Queryable, Sortable)
- `dateTime` (Queryable, Sortable)
- `homeTeamRef` (Queryable)
- `awayTeamRef` (Queryable)
- `season` (Queryable)
## 4. Import Data
After creating record types:
```bash
# 1. First scrape the data
cd Scripts
python3 scrape_schedules.py --sport all --season 2025 --output ./data
# 2. Run the import script (requires running from Xcode or with proper entitlements)
# The Swift script cannot run standalone - use the app or create a macOS command-line tool
```
### Alternative: Import via App
Add this to your app for first-run data import:
```swift
// In AppDelegate or App init
Task {
let importer = CloudKitImporter()
// Load JSON from bundle or downloaded file
if let stadiumsURL = Bundle.main.url(forResource: "stadiums", withExtension: "json"),
let gamesURL = Bundle.main.url(forResource: "games", withExtension: "json") {
// Import stadiums first
let stadiumsData = try Data(contentsOf: stadiumsURL)
let stadiums = try JSONDecoder().decode([ScrapedStadium].self, from: stadiumsData)
let count = try await importer.importStadiums(from: stadiums)
print("Imported \(count) stadiums")
}
}
```
## 5. Security Roles (CloudKit Dashboard)
For the **Public Database**:
| Role | Stadium | Team | Game |
|------|---------|------|------|
| World | Read | Read | Read |
| Authenticated | Read | Read | Read |
| Creator | Read/Write | Read/Write | Read/Write |
Users should only read from public database. Write access is for your admin imports.
## 6. Testing
1. Build and run the app on simulator or device
2. Check CloudKit Dashboard > **Data** to see imported records
3. Use **Logs** tab to debug any issues
## Troubleshooting
### "Container not found"
- Ensure container is created in Developer Portal
- Check entitlements file has correct container ID
- Clean build and re-run
### "Permission denied"
- Check Security Roles in CloudKit Dashboard
- Ensure app is signed with correct provisioning profile
### "Record type not found"
- Create record types in Development environment first
- Deploy schema to Production when ready
-72
View File
@@ -1,72 +0,0 @@
# Sports Data Sources
## Schedule Data Sources (by league)
### NBA Schedule
| Source | URL Pattern | Data Available | Notes |
|--------|-------------|----------------|-------|
| Basketball-Reference | `https://www.basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | Date, Time, Teams, Arena, Attendance | Monthly pages (october, november, etc.) |
| ESPN | `https://www.espn.com/nba/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
| NBA.com API | `https://cdn.nba.com/static/json/staticData/scheduleLeagueV2.json` | Full season JSON | Official source |
| FixtureDownload | `https://fixturedownload.com/download/nba-{year}-UTC.csv` | CSV download | Easy format |
### MLB Schedule
| Source | URL Pattern | Data Available | Notes |
|--------|-------------|----------------|-------|
| Baseball-Reference | `https://www.baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | Date, Teams, Score, Attendance | Full season page |
| ESPN | `https://www.espn.com/mlb/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
| MLB Stats API | `https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={YEAR}` | Full season JSON | Official API |
| FixtureDownload | `https://fixturedownload.com/download/mlb-{year}-UTC.csv` | CSV download | Easy format |
### NHL Schedule
| Source | URL Pattern | Data Available | Notes |
|--------|-------------|----------------|-------|
| Hockey-Reference | `https://www.hockey-reference.com/leagues/NHL_{YEAR}_games.html` | Date, Teams, Score, Arena, Attendance | Full season page |
| ESPN | `https://www.espn.com/nhl/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
| NHL API | `https://api-web.nhle.com/v1/schedule/{YYYY-MM-DD}` | Daily JSON | Official API |
| FixtureDownload | `https://fixturedownload.com/download/nhl-{year}-UTC.csv` | CSV download | Easy format |
---
## Stadium/Arena Data Sources
| Source | URL/Method | Data Available | Notes |
|--------|------------|----------------|-------|
| Wikipedia | Team pages | Name, City, Capacity, Coordinates | Manual or scrape |
| HIFLD Open Data | `https://hifld-geoplatform.opendata.arcgis.com/datasets/major-sport-venues` | GeoJSON with coordinates | US Government data |
| ESPN Team Pages | `https://www.espn.com/{sport}/team/_/name/{abbrev}` | Arena name, location | Per-team |
| Sports-Reference | Team pages | Arena name, capacity | In schedule data |
| OpenStreetMap | Nominatim API | Coordinates from address | For geocoding |
---
## Data Validation Strategy
### Cross-Reference Points
1. **Game Count**: Total games per team should match (82 NBA, 162 MLB, 82 NHL)
2. **Home/Away Balance**: Each team should have equal home/away games
3. **Date Alignment**: Same game should appear on same date across sources
4. **Team Names**: Map abbreviations across sources (NYK vs NY vs Knicks)
5. **Venue Names**: Stadiums may have different names (sponsorship changes)
### Discrepancy Handling
- If sources disagree on game time: prefer official API (NBA.com, MLB.com, NHL.com)
- If sources disagree on venue: prefer Sports-Reference (most accurate historically)
- Log all discrepancies for manual review
---
## Rate Limiting Guidelines
| Source | Limit | Recommended Delay |
|--------|-------|-------------------|
| Sports-Reference sites | 20 req/min | 3 seconds between requests |
| ESPN | Unknown | 1 second between requests |
| Official APIs | Varies | 0.5 seconds between requests |
| Wikipedia | Polite | 1 second between requests |
---
## Team Abbreviation Mappings
See `team_mappings.json` for canonical mappings between sources.
-147
View File
@@ -1,147 +0,0 @@
# SportsTime Data Pipeline
Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app.
## Overview
This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL.
## Quick Start
```bash
# Install dependencies
pip install -r requirements.txt
# Scrape all sports for current season
python scrape_schedules.py --sport all --season 2026
# Run full pipeline (scrape + canonicalize)
python run_pipeline.py --sport all
# Validate data integrity
python cloudkit_import.py --validate
# Sync to CloudKit
python cloudkit_import.py --upload
```
## Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ SPORT MODULES │
│ mlb.py nba.py nhl.py nfl.py mls.py wnba.py nwsl.py │
└────────────────────────────┬────────────────────────────────────────┘
│ scrape
┌─────────────────────────────────────────────────────────────────────┐
│ RAW DATA │
│ data/games.csv data/stadiums.csv data/games.json │
└────────────────────────────┬────────────────────────────────────────┘
│ canonicalize
┌─────────────────────────────────────────────────────────────────────┐
│ CANONICAL JSON │
│ data/stadiums_canonical.json data/teams_canonical.json │
│ data/games/*.json (per-sport/season) │
└────────────────────────────┬────────────────────────────────────────┘
│ sync
┌─────────────────────────────────────────────────────────────────────┐
│ CloudKit (iCloud.com.sportstime.app) │
│ Bundled JSON (SportsTime/Resources/) │
└─────────────────────────────────────────────────────────────────────┘
```
## Module Reference
| Script | Purpose |
|--------|---------|
| `core.py` | Shared utilities: data classes, rate limiting, fallback system |
| `scrape_schedules.py` | Main orchestrator for scraping schedules from multiple sources |
| `run_pipeline.py` | Full pipeline runner (scrape + canonicalize in one command) |
| `canonicalize_stadiums.py` | Stadium name resolution with alias support |
| `canonicalize_teams.py` | Team name resolution with alias support |
| `canonicalize_games.py` | Game linking (game → team → stadium relationships) |
| `cloudkit_import.py` | CloudKit sync with full CRUD, validation, and diff reporting |
| `validate_canonical.py` | Data validation with completeness metrics |
| `generate_canonical_data.py` | Generate bundled JSON for iOS app bootstrap |
## Sport Modules
Each sport has its own module with hardcoded stadium data and sport-specific scraping logic:
| Module | Sport | Stadiums | Notes |
|--------|-------|----------|-------|
| `mlb.py` | MLB | 30 ballparks | Baseball-Reference scraper |
| `nba.py` | NBA | 30 arenas | Basketball-Reference scraper |
| `nhl.py` | NHL | 32 arenas | Hockey-Reference scraper |
| `nfl.py` | NFL | 30 stadiums | Cross-calendar season (2025-26) |
| `mls.py` | MLS | 30 stadiums | Soccer-specific capacities |
| `wnba.py` | WNBA | 13 arenas | Shares venues with NBA |
| `nwsl.py` | NWSL | 13 stadiums | Shares some MLS venues |
## Data Files
### Output Directory: `data/`
| File | Contents |
|------|----------|
| `games.csv` | Raw scraped game data (all sports) |
| `games.json` | Raw scraped games as JSON |
| `stadiums.json` | Raw stadium data |
| `stadiums_canonical.json` | Canonical stadiums with resolved aliases |
| `teams_canonical.json` | Canonical teams with resolved aliases |
| `stadium_aliases.json` | Stadium name → canonical ID mapping |
| `games/{sport}_{season}.json` | Per-sport canonical games |
### Alias Files
- `data/canonical/stadiums.json` - Master stadium database
- `data/canonical/teams.json` - Master team database
## Pipeline Commands
### Scraping
```bash
# Single sport
python scrape_schedules.py --sport nba --season 2025-26
# All sports
python scrape_schedules.py --sport all --season 2026
# With specific output directory
python scrape_schedules.py --sport mlb --season 2025 --output ./data
```
### Canonicalization
```bash
# Run canonicalization pipeline
python run_canonicalization_pipeline.py --sport all
```
### CloudKit Operations
```bash
# Validate data without uploading
python cloudkit_import.py --validate
# Show what would be uploaded (dry run)
python cloudkit_import.py --upload --dry-run
# Upload to CloudKit
python cloudkit_import.py --upload
# List orphan records (requires CloudKit connection)
python cloudkit_import.py --validate --list-orphans
# Delete orphan records
python cloudkit_import.py --delete-orphans
```
## Related Documentation
- [DATA_SOURCES.md](DATA_SOURCES.md) - Data source URLs, rate limits, validation strategy
- [CLOUDKIT_SETUP.md](CLOUDKIT_SETUP.md) - CloudKit container setup, record types, security roles
-508
View File
@@ -1,508 +0,0 @@
#!/usr/bin/env python3
"""
Game Canonicalization for SportsTime
====================================
Stage 3 of the canonicalization pipeline.
Resolves team and stadium references in games, generates canonical game IDs.
Usage:
python canonicalize_games.py --games data/games.json --teams data/teams_canonical.json \
--aliases data/stadium_aliases.json --output data/
"""
import argparse
import json
from collections import defaultdict
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Optional
# =============================================================================
# DATA CLASSES
# =============================================================================
@dataclass
class CanonicalGame:
"""A canonicalized game with stable ID and resolved references."""
canonical_id: str
sport: str
season: str
date: str # YYYY-MM-DD
time: Optional[str]
home_team_canonical_id: str
away_team_canonical_id: str
stadium_canonical_id: str
is_playoff: bool = False
broadcast: Optional[str] = None
@dataclass
class ResolutionWarning:
"""Warning about a resolution issue."""
game_key: str
issue: str
details: str
# =============================================================================
# TEAM ABBREVIATION ALIASES
# Maps alternative abbreviations to canonical team IDs
# =============================================================================
TEAM_ABBREV_ALIASES = {
# NBA
('NBA', 'PHX'): 'team_nba_pho', # Phoenix
('NBA', 'BKN'): 'team_nba_brk', # Brooklyn
('NBA', 'CHA'): 'team_nba_cho', # Charlotte (older abbrev)
('NBA', 'NOP'): 'team_nba_nop', # New Orleans
('NBA', 'NO'): 'team_nba_nop', # New Orleans alt
('NBA', 'NY'): 'team_nba_nyk', # New York
('NBA', 'SA'): 'team_nba_sas', # San Antonio
('NBA', 'GS'): 'team_nba_gsw', # Golden State
('NBA', 'UTAH'): 'team_nba_uta', # Utah
# MLB
('MLB', 'AZ'): 'team_mlb_ari', # Arizona
('MLB', 'CWS'): 'team_mlb_chw', # Chicago White Sox
('MLB', 'KC'): 'team_mlb_kcr', # Kansas City
('MLB', 'SD'): 'team_mlb_sdp', # San Diego
('MLB', 'SF'): 'team_mlb_sfg', # San Francisco
('MLB', 'TB'): 'team_mlb_tbr', # Tampa Bay
('MLB', 'WSH'): 'team_mlb_wsn', # Washington
('MLB', 'WAS'): 'team_mlb_wsn', # Washington alt
('MLB', 'LA'): 'team_mlb_lad', # Los Angeles Dodgers
('MLB', 'ATH'): 'team_mlb_oak', # Oakland Athletics
# NHL
('NHL', 'ARI'): 'team_nhl_ari', # Arizona/Utah
('NHL', 'UTA'): 'team_nhl_ari', # Utah Hockey Club (uses ARI code)
('NHL', 'VGS'): 'team_nhl_vgk', # Vegas
('NHL', 'TB'): 'team_nhl_tbl', # Tampa Bay Lightning
('NHL', 'NJ'): 'team_nhl_njd', # New Jersey
('NHL', 'SJ'): 'team_nhl_sjs', # San Jose
('NHL', 'LA'): 'team_nhl_lak', # Los Angeles Kings
('NHL', 'MON'): 'team_nhl_mtl', # Montreal
# NFL
('NFL', 'JAC'): 'team_nfl_jax', # Jacksonville (JAC vs JAX)
('NFL', 'OAK'): 'team_nfl_lv', # Oakland → Las Vegas Raiders (moved 2020)
('NFL', 'SD'): 'team_nfl_lac', # San Diego → Los Angeles Chargers (moved 2017)
('NFL', 'STL'): 'team_nfl_lar', # St. Louis → Los Angeles Rams (moved 2016)
('NFL', 'GNB'): 'team_nfl_gb', # Green Bay alternate
('NFL', 'KAN'): 'team_nfl_kc', # Kansas City alternate
('NFL', 'NWE'): 'team_nfl_ne', # New England alternate
('NFL', 'NOR'): 'team_nfl_no', # New Orleans alternate
('NFL', 'TAM'): 'team_nfl_tb', # Tampa Bay alternate
('NFL', 'SFO'): 'team_nfl_sf', # San Francisco alternate
('NFL', 'WAS'): 'team_nfl_was', # Washington (direct match but include for completeness)
('NFL', 'WSH'): 'team_nfl_was', # Washington Commanders alternate abbrev
# MLS
('MLS', 'LA'): 'team_mls_lag', # LA Galaxy
('MLS', 'NYC'): 'team_mls_nycfc', # NYC FC
('MLS', 'RBNY'): 'team_mls_nyrb', # NY Red Bulls
('MLS', 'NYR'): 'team_mls_nyrb', # NY Red Bulls alt
('MLS', 'NY'): 'team_mls_nyrb', # NY Red Bulls short
('MLS', 'SJE'): 'team_mls_sj', # San Jose Earthquakes
('MLS', 'KC'): 'team_mls_skc', # Sporting KC
('MLS', 'DCU'): 'team_mls_dc', # DC United
('MLS', 'FCD'): 'team_mls_dal', # FC Dallas
('MLS', 'MON'): 'team_mls_mtl', # Montreal
('MLS', 'LAF'): 'team_mls_lafc', # LAFC alt
('MLS', 'ATX'): 'team_mls_aus', # Austin FC alt abbrev
# WNBA
('WNBA', 'LV'): 'team_wnba_lva', # Las Vegas Aces
('WNBA', 'LAS'): 'team_wnba_la', # LA Sparks
('WNBA', 'NYL'): 'team_wnba_ny', # New York Liberty
('WNBA', 'PHX'): 'team_wnba_pho', # Phoenix Mercury
('WNBA', 'CONN'): 'team_wnba_con', # Connecticut Sun
('WNBA', 'WSH'): 'team_wnba_was', # Washington Mystics
# NWSL
('NWSL', 'ANG'): 'team_nwsl_la', # Angel City FC (uses LA abbrev)
('NWSL', 'ACFC'): 'team_nwsl_la', # Angel City FC alt
('NWSL', 'NCC'): 'team_nwsl_nc', # North Carolina Courage
('NWSL', 'GOTHAM'): 'team_nwsl_nj', # NJ/NY Gotham FC
('NWSL', 'NY'): 'team_nwsl_nj', # NJ/NY Gotham FC alt
('NWSL', 'BAY'): 'team_nwsl_sj', # Bay FC (San Jose)
('NWSL', 'RLC'): 'team_nwsl_uta', # Racing Louisville -> Utah Royals (rebrand)
('NWSL', 'LOU'): 'team_nwsl_uta', # Louisville -> Utah alt
}
# =============================================================================
# ID GENERATION
# =============================================================================
def normalize_season(sport: str, season: str) -> str:
"""
Normalize season format for ID generation.
NBA/NHL: "2025-26" -> "202526"
MLB: "2026" -> "2026"
"""
return season.replace('-', '')
def generate_canonical_game_id(
sport: str,
season: str,
date: str, # YYYY-MM-DD
away_abbrev: str,
home_abbrev: str,
sequence: int = 1
) -> str:
"""
Generate deterministic canonical ID for game.
Format: game_{sport}_{season}_{date}_{away}_{home}[_{sequence}]
Example: game_nba_202526_20251021_hou_okc
game_mlb_2026_20260615_bos_nyy_2 (doubleheader game 2)
"""
normalized_season = normalize_season(sport, season)
date_compact = date.replace('-', '') # YYYYMMDD
base_id = f"game_{sport.lower()}_{normalized_season}_{date_compact}_{away_abbrev.lower()}_{home_abbrev.lower()}"
if sequence > 1:
return f"{base_id}_{sequence}"
return base_id
# =============================================================================
# RESOLUTION
# =============================================================================
def build_alias_lookup(stadium_aliases: list[dict]) -> dict[str, str]:
"""
Build lookup from alias name to canonical stadium ID.
Returns: {alias_name_lower: canonical_stadium_id}
"""
lookup = {}
for alias in stadium_aliases:
alias_name = alias.get('alias_name', '').lower().strip()
canonical_id = alias.get('stadium_canonical_id', '')
if alias_name and canonical_id:
lookup[alias_name] = canonical_id
return lookup
def resolve_team(
abbrev: str,
sport: str,
teams_by_abbrev: dict[tuple[str, str], dict],
teams_by_id: dict[str, dict]
) -> Optional[dict]:
"""
Resolve team abbreviation to canonical team.
1. Try direct match by (sport, abbrev)
2. Try alias lookup
3. Return None if not found
"""
key = (sport, abbrev.upper())
# Direct match
if key in teams_by_abbrev:
return teams_by_abbrev[key]
# Alias match
if key in TEAM_ABBREV_ALIASES:
canonical_id = TEAM_ABBREV_ALIASES[key]
if canonical_id in teams_by_id:
return teams_by_id[canonical_id]
return None
def resolve_stadium_from_venue(
venue: str,
home_team: dict,
sport: str,
alias_lookup: dict[str, str],
stadiums_by_id: dict[str, dict]
) -> str:
"""
Resolve stadium canonical ID from venue name.
Strategy:
1. ALWAYS prefer home team's stadium (most reliable, sport-correct)
2. Try sport-scoped alias match (only if home team has no stadium)
3. Fall back to unknown stadium slug
For multi-sport venues (MSG, Crypto.com Arena, etc.), home team's
stadium_canonical_id is authoritative because it's already sport-scoped.
Args:
venue: Venue name from game data
home_team: Resolved home team dict
sport: Sport code (NBA, MLB, NHL)
alias_lookup: {alias_name_lower: canonical_stadium_id}
stadiums_by_id: {canonical_id: stadium_dict}
Returns:
canonical_stadium_id
"""
# Strategy 1: Home team's stadium is most reliable (sport-scoped)
if home_team:
team_stadium = home_team.get('stadium_canonical_id', '')
if team_stadium:
return team_stadium
# Strategy 2: Sport-scoped alias match (fallback for neutral sites)
venue_lower = venue.lower().strip()
sport_prefix = f"stadium_{sport.lower()}_"
if venue_lower in alias_lookup:
matched_id = alias_lookup[venue_lower]
# Only use alias if it's for the correct sport
if matched_id.startswith(sport_prefix):
return matched_id
# Strategy 3: Partial match with sport check
for alias, canonical_id in alias_lookup.items():
if len(alias) > 3 and (alias in venue_lower or venue_lower in alias):
if canonical_id.startswith(sport_prefix):
return canonical_id
# Unknown stadium
slug = venue_lower[:30].replace(' ', '_').replace('.', '')
return f"stadium_unknown_{slug}"
# =============================================================================
# CANONICALIZATION
# =============================================================================
def canonicalize_games(
raw_games: list[dict],
canonical_teams: list[dict],
stadium_aliases: list[dict],
verbose: bool = False
) -> tuple[list[CanonicalGame], list[ResolutionWarning]]:
"""
Stage 3: Canonicalize games.
1. Resolve team abbreviations to canonical IDs
2. Resolve venues to stadium canonical IDs
3. Generate canonical game IDs (handling doubleheaders)
Args:
raw_games: List of raw game dicts
canonical_teams: List of canonical team dicts
stadium_aliases: List of stadium alias dicts
verbose: Print detailed progress
Returns:
(canonical_games, warnings)
"""
games = []
warnings = []
# Build lookups
teams_by_abbrev = {} # (sport, abbrev) -> team dict
teams_by_id = {} # canonical_id -> team dict
for team in canonical_teams:
abbrev = team['abbreviation'].upper()
sport = team['sport']
teams_by_abbrev[(sport, abbrev)] = team
teams_by_id[team['canonical_id']] = team
alias_lookup = build_alias_lookup(stadium_aliases)
stadiums_by_id = {} # Would be populated from stadiums_canonical.json if needed
# Track games for doubleheader detection
game_counts = defaultdict(int) # (date, away_id, home_id) -> count
resolved_count = 0
unresolved_teams = 0
unresolved_stadiums = 0
for raw in raw_games:
sport = raw.get('sport', '').upper()
season = raw.get('season', '')
date = raw.get('date', '')
home_abbrev = raw.get('home_team_abbrev', '').upper()
away_abbrev = raw.get('away_team_abbrev', '').upper()
venue = raw.get('venue', '')
game_key = f"{date}_{away_abbrev}_{home_abbrev}"
# Resolve teams
home_team = resolve_team(home_abbrev, sport, teams_by_abbrev, teams_by_id)
away_team = resolve_team(away_abbrev, sport, teams_by_abbrev, teams_by_id)
if not home_team:
warnings.append(ResolutionWarning(
game_key=game_key,
issue='Unknown home team',
details=f"Could not resolve home team '{home_abbrev}' for sport {sport}"
))
unresolved_teams += 1
if verbose:
print(f" WARNING: {game_key} - unknown home team {home_abbrev}")
continue
if not away_team:
warnings.append(ResolutionWarning(
game_key=game_key,
issue='Unknown away team',
details=f"Could not resolve away team '{away_abbrev}' for sport {sport}"
))
unresolved_teams += 1
if verbose:
print(f" WARNING: {game_key} - unknown away team {away_abbrev}")
continue
# Resolve stadium
stadium_canonical_id = resolve_stadium_from_venue(
venue, home_team, sport, alias_lookup, stadiums_by_id
)
if stadium_canonical_id.startswith('stadium_unknown'):
warnings.append(ResolutionWarning(
game_key=game_key,
issue='Unknown stadium',
details=f"Could not resolve venue '{venue}', using home team stadium"
))
unresolved_stadiums += 1
# Fall back to home team stadium
stadium_canonical_id = home_team.get('stadium_canonical_id', stadium_canonical_id)
# Handle doubleheaders
matchup_key = (date, away_team['canonical_id'], home_team['canonical_id'])
game_counts[matchup_key] += 1
sequence = game_counts[matchup_key]
# Generate canonical ID
canonical_id = generate_canonical_game_id(
sport, season, date,
away_team['abbreviation'], home_team['abbreviation'],
sequence
)
game = CanonicalGame(
canonical_id=canonical_id,
sport=sport,
season=season,
date=date,
time=raw.get('time'),
home_team_canonical_id=home_team['canonical_id'],
away_team_canonical_id=away_team['canonical_id'],
stadium_canonical_id=stadium_canonical_id,
is_playoff=raw.get('is_playoff', False),
broadcast=raw.get('broadcast')
)
games.append(game)
resolved_count += 1
if verbose:
print(f"\n Resolved: {resolved_count} games")
print(f" Unresolved teams: {unresolved_teams}")
print(f" Unknown stadiums (used home team): {unresolved_stadiums}")
return games, warnings
# =============================================================================
# MAIN
# =============================================================================
def main():
parser = argparse.ArgumentParser(
description='Canonicalize game data'
)
parser.add_argument(
'--games', type=str, default='./data/games.json',
help='Input raw games JSON file'
)
parser.add_argument(
'--teams', type=str, default='./data/teams_canonical.json',
help='Input canonical teams JSON file'
)
parser.add_argument(
'--aliases', type=str, default='./data/stadium_aliases.json',
help='Input stadium aliases JSON file'
)
parser.add_argument(
'--output', type=str, default='./data',
help='Output directory for canonical files'
)
parser.add_argument(
'--verbose', '-v', action='store_true',
help='Verbose output'
)
args = parser.parse_args()
games_path = Path(args.games)
teams_path = Path(args.teams)
aliases_path = Path(args.aliases)
output_dir = Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)
# Load input files
print(f"Loading raw games from {games_path}...")
with open(games_path) as f:
raw_games = json.load(f)
print(f" Loaded {len(raw_games)} raw games")
print(f"Loading canonical teams from {teams_path}...")
with open(teams_path) as f:
canonical_teams = json.load(f)
print(f" Loaded {len(canonical_teams)} canonical teams")
print(f"Loading stadium aliases from {aliases_path}...")
with open(aliases_path) as f:
stadium_aliases = json.load(f)
print(f" Loaded {len(stadium_aliases)} stadium aliases")
# Canonicalize games
print("\nCanonicalizing games...")
canonical_games, warnings = canonicalize_games(
raw_games, canonical_teams, stadium_aliases, verbose=args.verbose
)
print(f" Created {len(canonical_games)} canonical games")
if warnings:
print(f"\n Warnings: {len(warnings)}")
# Group by issue type
by_issue = defaultdict(list)
for w in warnings:
by_issue[w.issue].append(w)
for issue, issue_warnings in by_issue.items():
print(f" - {issue}: {len(issue_warnings)}")
# Export
games_path = output_dir / 'games_canonical.json'
warnings_path = output_dir / 'game_resolution_warnings.json'
with open(games_path, 'w') as f:
json.dump([asdict(g) for g in canonical_games], f, indent=2)
print(f"\nExported games to {games_path}")
if warnings:
with open(warnings_path, 'w') as f:
json.dump([asdict(w) for w in warnings], f, indent=2)
print(f"Exported warnings to {warnings_path}")
# Summary by sport
print("\nSummary by sport:")
by_sport = {}
for g in canonical_games:
by_sport[g.sport] = by_sport.get(g.sport, 0) + 1
for sport, count in sorted(by_sport.items()):
print(f" {sport}: {count} games")
# Check for doubleheaders
doubleheaders = sum(1 for g in canonical_games if '_2' in g.canonical_id or '_3' in g.canonical_id)
if doubleheaders:
print(f"\n Doubleheader games detected: {doubleheaders}")
if __name__ == '__main__':
main()
-515
View File
@@ -1,515 +0,0 @@
#!/usr/bin/env python3
"""
Stadium Canonicalization for SportsTime
========================================
Stage 1 of the canonicalization pipeline.
Normalizes stadium data and generates deterministic canonical IDs.
Creates stadium name aliases for fuzzy matching during game resolution.
Usage:
python canonicalize_stadiums.py --input data/stadiums.json --output data/
"""
import argparse
import json
import re
from dataclasses import dataclass, asdict, field
from pathlib import Path
from typing import Optional
# =============================================================================
# DATA CLASSES
# =============================================================================
@dataclass
class CanonicalStadium:
"""A canonicalized stadium with stable ID."""
canonical_id: str
name: str
city: str
state: str
latitude: float
longitude: float
capacity: int
sport: str
primary_team_abbrevs: list = field(default_factory=list)
year_opened: Optional[int] = None
@dataclass
class StadiumAlias:
"""Maps an alias name to a canonical stadium ID."""
alias_name: str # Normalized (lowercase)
stadium_canonical_id: str
valid_from: Optional[str] = None
valid_until: Optional[str] = None
# =============================================================================
# HISTORICAL STADIUM ALIASES
# Known name changes for stadiums (sponsorship changes, renames)
# =============================================================================
HISTORICAL_STADIUM_ALIASES = {
# MLB
'stadium_mlb_minute_maid_park': [
{'alias_name': 'daikin park', 'valid_from': '2025-01-01'},
{'alias_name': 'enron field', 'valid_from': '2000-04-01', 'valid_until': '2002-02-28'},
{'alias_name': 'astros field', 'valid_from': '2002-03-01', 'valid_until': '2002-06-04'},
],
'stadium_mlb_guaranteed_rate_field': [
{'alias_name': 'rate field', 'valid_from': '2024-01-01'},
{'alias_name': 'us cellular field', 'valid_from': '2003-01-01', 'valid_until': '2016-08-24'},
{'alias_name': 'comiskey park ii', 'valid_from': '1991-04-01', 'valid_until': '2002-12-31'},
{'alias_name': 'new comiskey park', 'valid_from': '1991-04-01', 'valid_until': '2002-12-31'},
],
'stadium_mlb_truist_park': [
{'alias_name': 'suntrust park', 'valid_from': '2017-04-01', 'valid_until': '2020-01-13'},
],
'stadium_mlb_progressive_field': [
{'alias_name': 'jacobs field', 'valid_from': '1994-04-01', 'valid_until': '2008-01-10'},
{'alias_name': 'the jake', 'valid_from': '1994-04-01', 'valid_until': '2008-01-10'},
],
'stadium_mlb_american_family_field': [
{'alias_name': 'miller park', 'valid_from': '2001-04-01', 'valid_until': '2020-12-31'},
],
'stadium_mlb_rogers_centre': [
{'alias_name': 'skydome', 'valid_from': '1989-06-01', 'valid_until': '2005-02-01'},
],
'stadium_mlb_loandepot_park': [
{'alias_name': 'marlins park', 'valid_from': '2012-04-01', 'valid_until': '2021-03-31'},
],
'stadium_mlb_t_mobile_park': [
{'alias_name': 'safeco field', 'valid_from': '1999-07-01', 'valid_until': '2018-12-31'},
],
'stadium_mlb_oracle_park': [
{'alias_name': 'att park', 'valid_from': '2006-01-01', 'valid_until': '2019-01-08'},
{'alias_name': 'sbc park', 'valid_from': '2004-01-01', 'valid_until': '2005-12-31'},
{'alias_name': 'pac bell park', 'valid_from': '2000-04-01', 'valid_until': '2003-12-31'},
],
'stadium_mlb_globe_life_field': [
{'alias_name': 'choctaw stadium', 'valid_from': '2020-01-01'}, # Globe Life Field opened 2020
],
# NBA
'stadium_nba_state_farm_arena': [
{'alias_name': 'philips arena', 'valid_from': '1999-09-01', 'valid_until': '2018-06-25'},
],
'stadium_nba_crypto_com_arena': [
{'alias_name': 'staples center', 'valid_from': '1999-10-01', 'valid_until': '2021-12-24'},
],
'stadium_nba_kaseya_center': [
{'alias_name': 'ftx arena', 'valid_from': '2021-06-01', 'valid_until': '2023-03-31'},
{'alias_name': 'american airlines arena', 'valid_from': '1999-12-01', 'valid_until': '2021-05-31'},
],
'stadium_nba_gainbridge_fieldhouse': [
{'alias_name': 'bankers life fieldhouse', 'valid_from': '2011-01-01', 'valid_until': '2021-12-31'},
{'alias_name': 'conseco fieldhouse', 'valid_from': '1999-11-01', 'valid_until': '2010-12-31'},
],
'stadium_nba_rocket_mortgage_fieldhouse': [
{'alias_name': 'quicken loans arena', 'valid_from': '2005-08-01', 'valid_until': '2019-08-08'},
{'alias_name': 'gund arena', 'valid_from': '1994-10-01', 'valid_until': '2005-07-31'},
],
'stadium_nba_kia_center': [
{'alias_name': 'amway center', 'valid_from': '2010-10-01', 'valid_until': '2023-07-12'},
],
'stadium_nba_frost_bank_center': [
{'alias_name': 'att center', 'valid_from': '2002-10-01', 'valid_until': '2023-10-01'},
],
'stadium_nba_intuit_dome': [
# New arena opened 2024, Clippers moved from Crypto.com Arena
],
'stadium_nba_delta_center': [
{'alias_name': 'vivint arena', 'valid_from': '2020-12-01', 'valid_until': '2023-07-01'},
{'alias_name': 'vivint smart home arena', 'valid_from': '2015-11-01', 'valid_until': '2020-11-30'},
{'alias_name': 'energysolutions arena', 'valid_from': '2006-11-01', 'valid_until': '2015-10-31'},
],
# NHL
'stadium_nhl_amerant_bank_arena': [
{'alias_name': 'fla live arena', 'valid_from': '2021-10-01', 'valid_until': '2024-05-31'},
{'alias_name': 'bb&t center', 'valid_from': '2012-06-01', 'valid_until': '2021-09-30'},
{'alias_name': 'bankatlantic center', 'valid_from': '2005-10-01', 'valid_until': '2012-05-31'},
],
'stadium_nhl_climate_pledge_arena': [
{'alias_name': 'keyarena', 'valid_from': '1995-01-01', 'valid_until': '2018-10-01'},
{'alias_name': 'seattle center coliseum', 'valid_from': '1962-01-01', 'valid_until': '1994-12-31'},
],
# NFL
'stadium_nfl_sofi_stadium': [
# SoFi Stadium opened 2020, no prior name
],
'stadium_nfl_allegiant_stadium': [
# Allegiant Stadium opened 2020, no prior name (Raiders moved from Oakland Coliseum)
],
'stadium_nfl_caesars_superdome': [
{'alias_name': 'mercedes-benz superdome', 'valid_from': '2011-10-01', 'valid_until': '2021-07-01'},
{'alias_name': 'louisiana superdome', 'valid_from': '1975-08-01', 'valid_until': '2011-09-30'},
{'alias_name': 'superdome', 'valid_from': '1975-08-01'},
],
'stadium_nfl_paycor_stadium': [
{'alias_name': 'paul brown stadium', 'valid_from': '2000-08-01', 'valid_until': '2022-09-05'},
],
'stadium_nfl_empower_field_at_mile_high': [
{'alias_name': 'broncos stadium at mile high', 'valid_from': '2018-09-01', 'valid_until': '2019-08-31'},
{'alias_name': 'sports authority field at mile high', 'valid_from': '2011-08-01', 'valid_until': '2018-08-31'},
{'alias_name': 'invesco field at mile high', 'valid_from': '2001-09-01', 'valid_until': '2011-07-31'},
{'alias_name': 'mile high stadium', 'valid_from': '1960-01-01', 'valid_until': '2001-08-31'},
],
'stadium_nfl_acrisure_stadium': [
{'alias_name': 'heinz field', 'valid_from': '2001-08-01', 'valid_until': '2022-07-10'},
],
'stadium_nfl_everbank_stadium': [
{'alias_name': 'tiaa bank field', 'valid_from': '2018-01-01', 'valid_until': '2023-03-31'},
{'alias_name': 'everbank field', 'valid_from': '2014-01-01', 'valid_until': '2017-12-31'},
{'alias_name': 'alltel stadium', 'valid_from': '1997-06-01', 'valid_until': '2006-12-31'},
{'alias_name': 'jacksonville municipal stadium', 'valid_from': '1995-08-01', 'valid_until': '1997-05-31'},
],
'stadium_nfl_northwest_stadium': [
{'alias_name': 'fedexfield', 'valid_from': '1999-11-01', 'valid_until': '2025-01-01'},
{'alias_name': 'fedex field', 'valid_from': '1999-11-01', 'valid_until': '2025-01-01'},
{'alias_name': 'jack kent cooke stadium', 'valid_from': '1997-09-01', 'valid_until': '1999-10-31'},
],
'stadium_nfl_hard_rock_stadium': [
{'alias_name': 'sun life stadium', 'valid_from': '2010-01-01', 'valid_until': '2016-07-31'},
{'alias_name': 'land shark stadium', 'valid_from': '2009-01-01', 'valid_until': '2009-12-31'},
{'alias_name': 'dolphin stadium', 'valid_from': '2005-01-01', 'valid_until': '2008-12-31'},
{'alias_name': 'pro player stadium', 'valid_from': '1996-04-01', 'valid_until': '2004-12-31'},
{'alias_name': 'joe robbie stadium', 'valid_from': '1987-08-01', 'valid_until': '1996-03-31'},
],
'stadium_nfl_highmark_stadium': [
{'alias_name': 'bills stadium', 'valid_from': '2020-03-01', 'valid_until': '2021-03-31'},
{'alias_name': 'new era field', 'valid_from': '2016-08-01', 'valid_until': '2020-02-29'},
{'alias_name': 'ralph wilson stadium', 'valid_from': '1998-08-01', 'valid_until': '2016-07-31'},
{'alias_name': 'rich stadium', 'valid_from': '1973-08-01', 'valid_until': '1998-07-31'},
],
'stadium_nfl_geha_field_at_arrowhead_stadium': [
{'alias_name': 'arrowhead stadium', 'valid_from': '1972-08-01'},
],
'stadium_nfl_att_stadium': [
{'alias_name': 'cowboys stadium', 'valid_from': '2009-05-01', 'valid_until': '2013-07-24'},
],
'stadium_nfl_us_bank_stadium': [
# Opened 2016, no prior name (Vikings moved from Metrodome)
],
'stadium_nfl_lumen_field': [
{'alias_name': 'centurylink field', 'valid_from': '2011-06-01', 'valid_until': '2020-11-18'},
{'alias_name': 'qwest field', 'valid_from': '2004-06-01', 'valid_until': '2011-05-31'},
{'alias_name': 'seahawks stadium', 'valid_from': '2002-07-01', 'valid_until': '2004-05-31'},
],
# MLS
'stadium_mls_bmo_stadium': [
{'alias_name': 'banc of california stadium', 'valid_from': '2018-04-01', 'valid_until': '2023-06-01'},
],
'stadium_mls_paypal_park': [
{'alias_name': 'earthquakes stadium', 'valid_from': '2015-03-01', 'valid_until': '2020-12-31'},
{'alias_name': 'avaya stadium', 'valid_from': '2015-03-01', 'valid_until': '2020-12-31'},
],
'stadium_mls_shell_energy_stadium': [
{'alias_name': 'pnc stadium', 'valid_from': '2021-03-01', 'valid_until': '2023-03-01'},
{'alias_name': 'bbva stadium', 'valid_from': '2019-01-01', 'valid_until': '2021-02-28'},
{'alias_name': 'bbva compass stadium', 'valid_from': '2012-05-01', 'valid_until': '2018-12-31'},
],
'stadium_mls_dignity_health_sports_park': [
{'alias_name': 'stubhub center', 'valid_from': '2013-06-01', 'valid_until': '2019-01-31'},
{'alias_name': 'home depot center', 'valid_from': '2003-06-01', 'valid_until': '2013-05-31'},
],
'stadium_mls_interandco_stadium': [
{'alias_name': 'exploria stadium', 'valid_from': '2017-03-01', 'valid_until': '2023-07-01'},
{'alias_name': 'orlando city stadium', 'valid_from': '2017-03-01', 'valid_until': '2019-01-01'},
],
'stadium_mls_chase_stadium': [
{'alias_name': 'drv pnk stadium', 'valid_from': '2020-07-01', 'valid_until': '2024-01-01'},
{'alias_name': 'inter miami cf stadium', 'valid_from': '2020-07-01', 'valid_until': '2020-09-01'},
],
'stadium_mls_america_first_field': [
{'alias_name': 'rio tinto stadium', 'valid_from': '2008-10-01', 'valid_until': '2021-08-01'},
],
'stadium_mls_lowercom_field': [
{'alias_name': 'lower.com field', 'valid_from': '2021-07-01'}, # Current name with period
{'alias_name': 'new crew stadium', 'valid_from': '2021-07-01', 'valid_until': '2021-07-01'},
],
# WNBA (most share NBA/NHL arenas with existing aliases; these are WNBA-specific arenas)
'stadium_wnba_michelob_ultra_arena': [
{'alias_name': 'mandalay bay events center', 'valid_from': '1999-03-01', 'valid_until': '2021-01-01'},
],
'stadium_wnba_gateway_center_arena': [
# Gateway Center Arena opened 2018, WNBA-specific venue
],
'stadium_wnba_wintrust_arena': [
# Wintrust Arena opened 2017, WNBA-specific venue
],
'stadium_wnba_college_park_center': [
# College Park Center opened 2012, university venue
],
# NWSL (most share MLS stadiums with existing aliases; these are NWSL-specific)
'stadium_nwsl_cpkc_stadium': [
# CPKC Stadium opened 2024, first soccer-specific stadium for NWSL team
],
'stadium_nwsl_seatgeek_stadium': [
{'alias_name': 'toyota park', 'valid_from': '2006-06-01', 'valid_until': '2018-04-30'},
{'alias_name': 'bridgeview stadium', 'valid_from': '2006-06-01', 'valid_until': '2006-06-01'},
],
'stadium_nwsl_wakemed_soccer_park': [
{'alias_name': 'sas soccer park', 'valid_from': '2002-04-01', 'valid_until': '2007-03-31'},
],
}
# =============================================================================
# SLUG GENERATION
# =============================================================================
def normalize_stadium_name(name: str) -> str:
"""
Normalize stadium name for slug generation.
- Lowercase
- Remove parentheticals like "(IV)"
- Remove special characters except spaces
- Collapse multiple spaces
"""
normalized = name.lower()
# Remove parentheticals
normalized = re.sub(r'\s*\([^)]*\)', '', normalized)
# Remove special characters except spaces and alphanumeric
normalized = re.sub(r'[^a-z0-9\s]', '', normalized)
# Replace multiple spaces with single space
normalized = re.sub(r'\s+', ' ', normalized).strip()
return normalized
def generate_stadium_slug(name: str) -> str:
"""
Generate URL-safe slug from stadium name.
Examples:
"State Farm Arena" -> "state_farm_arena"
"TD Garden" -> "td_garden"
"Crypto.com Arena" -> "crypto_com_arena"
"""
normalized = normalize_stadium_name(name)
# Replace spaces with underscores
slug = normalized.replace(' ', '_')
# Truncate to 50 chars
return slug[:50]
def generate_canonical_stadium_id(sport: str, name: str) -> str:
"""
Generate deterministic canonical ID for stadium.
Format: stadium_{sport}_{slug}
Example: stadium_nba_state_farm_arena
"""
slug = generate_stadium_slug(name)
return f"stadium_{sport.lower()}_{slug}"
# =============================================================================
# CANONICALIZATION
# =============================================================================
def canonicalize_stadiums(
raw_stadiums: list[dict],
verbose: bool = False
) -> tuple[list[CanonicalStadium], list[StadiumAlias]]:
"""
Stage 1: Canonicalize stadiums.
1. Normalize names and cities
2. Deduplicate by (sport, normalized_name, city)
3. Generate canonical IDs
4. Create name aliases
Args:
raw_stadiums: List of raw stadium dicts from scraper
verbose: Print detailed progress
Returns:
(canonical_stadiums, aliases)
"""
canonical_stadiums = []
aliases = []
seen_keys = {} # (sport, normalized_name, city) -> canonical_id
for raw in raw_stadiums:
sport = raw.get('sport', '').upper()
name = raw.get('name', '')
city = raw.get('city', '')
if not sport or not name:
if verbose:
print(f" Skipping invalid stadium: {raw}")
continue
# Generate canonical ID
canonical_id = generate_canonical_stadium_id(sport, name)
# Deduplication key (same stadium in same city for same sport)
normalized_name = normalize_stadium_name(name)
dedup_key = (sport, normalized_name, city.lower())
if dedup_key in seen_keys:
existing_canonical_id = seen_keys[dedup_key]
# Add as alias if the display name differs
alias_name = name.lower().strip()
if alias_name != normalized_name:
aliases.append(StadiumAlias(
alias_name=alias_name,
stadium_canonical_id=existing_canonical_id
))
if verbose:
print(f" Duplicate: {name} -> {existing_canonical_id}")
continue
seen_keys[dedup_key] = canonical_id
# Create canonical stadium
canonical = CanonicalStadium(
canonical_id=canonical_id,
name=name,
city=city,
state=raw.get('state', ''),
latitude=raw.get('latitude', 0.0),
longitude=raw.get('longitude', 0.0),
capacity=raw.get('capacity', 0),
sport=sport,
primary_team_abbrevs=raw.get('team_abbrevs', []),
year_opened=raw.get('year_opened')
)
canonical_stadiums.append(canonical)
# Add primary name as alias (normalized)
aliases.append(StadiumAlias(
alias_name=name.lower().strip(),
stadium_canonical_id=canonical_id
))
# Also add normalized version if different
if normalized_name != name.lower().strip():
aliases.append(StadiumAlias(
alias_name=normalized_name,
stadium_canonical_id=canonical_id
))
if verbose:
print(f" {canonical_id}: {name} ({city})")
return canonical_stadiums, aliases
def add_historical_aliases(
aliases: list[StadiumAlias],
canonical_ids: set[str]
) -> list[StadiumAlias]:
"""
Add historical stadium name aliases.
Only adds aliases for stadiums that exist in canonical_ids.
"""
for canonical_id, historical in HISTORICAL_STADIUM_ALIASES.items():
if canonical_id not in canonical_ids:
continue
for hist in historical:
aliases.append(StadiumAlias(
alias_name=hist['alias_name'],
stadium_canonical_id=canonical_id,
valid_from=hist.get('valid_from'),
valid_until=hist.get('valid_until')
))
return aliases
def deduplicate_aliases(aliases: list[StadiumAlias]) -> list[StadiumAlias]:
"""Remove duplicate aliases (same alias_name -> same canonical_id)."""
seen = set()
deduped = []
for alias in aliases:
key = (alias.alias_name, alias.stadium_canonical_id)
if key not in seen:
seen.add(key)
deduped.append(alias)
return deduped
# =============================================================================
# MAIN
# =============================================================================
def main():
parser = argparse.ArgumentParser(
description='Canonicalize stadium data'
)
parser.add_argument(
'--input', type=str, default='./data/stadiums.json',
help='Input raw stadiums JSON file'
)
parser.add_argument(
'--output', type=str, default='./data',
help='Output directory for canonical files'
)
parser.add_argument(
'--verbose', '-v', action='store_true',
help='Verbose output'
)
args = parser.parse_args()
input_path = Path(args.input)
output_dir = Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)
# Load raw stadiums
print(f"Loading raw stadiums from {input_path}...")
with open(input_path) as f:
raw_stadiums = json.load(f)
print(f" Loaded {len(raw_stadiums)} raw stadiums")
# Canonicalize
print("\nCanonicalizing stadiums...")
canonical_stadiums, aliases = canonicalize_stadiums(
raw_stadiums, verbose=args.verbose
)
print(f" Created {len(canonical_stadiums)} canonical stadiums")
# Add historical aliases
canonical_ids = {s.canonical_id for s in canonical_stadiums}
aliases = add_historical_aliases(aliases, canonical_ids)
# Deduplicate aliases
aliases = deduplicate_aliases(aliases)
print(f" Created {len(aliases)} stadium aliases")
# Export
stadiums_path = output_dir / 'stadiums_canonical.json'
aliases_path = output_dir / 'stadium_aliases.json'
with open(stadiums_path, 'w') as f:
json.dump([asdict(s) for s in canonical_stadiums], f, indent=2)
print(f"\nExported stadiums to {stadiums_path}")
with open(aliases_path, 'w') as f:
json.dump([asdict(a) for a in aliases], f, indent=2)
print(f"Exported aliases to {aliases_path}")
# Summary by sport
print("\nSummary by sport:")
by_sport = {}
for s in canonical_stadiums:
by_sport[s.sport] = by_sport.get(s.sport, 0) + 1
for sport, count in sorted(by_sport.items()):
print(f" {sport}: {count} stadiums")
if __name__ == '__main__':
main()
-610
View File
@@ -1,610 +0,0 @@
#!/usr/bin/env python3
"""
Team Canonicalization for SportsTime
====================================
Stage 2 of the canonicalization pipeline.
Generates canonical team IDs and fuzzy matches teams to stadiums.
Usage:
python canonicalize_teams.py --stadiums data/stadiums_canonical.json --output data/
"""
import argparse
import json
from dataclasses import dataclass, asdict, field
from difflib import SequenceMatcher
from pathlib import Path
from typing import Optional
# Import team mappings from scraper
from scrape_schedules import NBA_TEAMS, MLB_TEAMS, NHL_TEAMS, NFL_TEAMS
from mls import MLS_TEAMS
from wnba import WNBA_TEAMS
from nwsl import NWSL_TEAMS
# =============================================================================
# DATA CLASSES
# =============================================================================
@dataclass
class CanonicalTeam:
"""A canonicalized team with stable ID."""
canonical_id: str
name: str
abbreviation: str
sport: str
city: str
stadium_canonical_id: str
conference_id: Optional[str] = None
division_id: Optional[str] = None
primary_color: Optional[str] = None
secondary_color: Optional[str] = None
@dataclass
class MatchWarning:
"""Warning about a low-confidence match."""
team_canonical_id: str
team_name: str
arena_name: str
matched_stadium: Optional[str]
issue: str
confidence: float
# =============================================================================
# LEAGUE STRUCTURE
# Maps team abbreviation -> (conference_id, division_id)
# =============================================================================
NBA_DIVISIONS = {
# Eastern Conference - Atlantic
'BOS': ('nba_eastern', 'nba_atlantic'),
'BRK': ('nba_eastern', 'nba_atlantic'),
'NYK': ('nba_eastern', 'nba_atlantic'),
'PHI': ('nba_eastern', 'nba_atlantic'),
'TOR': ('nba_eastern', 'nba_atlantic'),
# Eastern Conference - Central
'CHI': ('nba_eastern', 'nba_central'),
'CLE': ('nba_eastern', 'nba_central'),
'DET': ('nba_eastern', 'nba_central'),
'IND': ('nba_eastern', 'nba_central'),
'MIL': ('nba_eastern', 'nba_central'),
# Eastern Conference - Southeast
'ATL': ('nba_eastern', 'nba_southeast'),
'CHO': ('nba_eastern', 'nba_southeast'),
'MIA': ('nba_eastern', 'nba_southeast'),
'ORL': ('nba_eastern', 'nba_southeast'),
'WAS': ('nba_eastern', 'nba_southeast'),
# Western Conference - Northwest
'DEN': ('nba_western', 'nba_northwest'),
'MIN': ('nba_western', 'nba_northwest'),
'OKC': ('nba_western', 'nba_northwest'),
'POR': ('nba_western', 'nba_northwest'),
'UTA': ('nba_western', 'nba_northwest'),
# Western Conference - Pacific
'GSW': ('nba_western', 'nba_pacific'),
'LAC': ('nba_western', 'nba_pacific'),
'LAL': ('nba_western', 'nba_pacific'),
'PHO': ('nba_western', 'nba_pacific'),
'SAC': ('nba_western', 'nba_pacific'),
# Western Conference - Southwest
'DAL': ('nba_western', 'nba_southwest'),
'HOU': ('nba_western', 'nba_southwest'),
'MEM': ('nba_western', 'nba_southwest'),
'NOP': ('nba_western', 'nba_southwest'),
'SAS': ('nba_western', 'nba_southwest'),
}
MLB_DIVISIONS = {
# American League - East
'NYY': ('mlb_al', 'mlb_al_east'),
'BOS': ('mlb_al', 'mlb_al_east'),
'TOR': ('mlb_al', 'mlb_al_east'),
'BAL': ('mlb_al', 'mlb_al_east'),
'TBR': ('mlb_al', 'mlb_al_east'),
# American League - Central
'CLE': ('mlb_al', 'mlb_al_central'),
'DET': ('mlb_al', 'mlb_al_central'),
'MIN': ('mlb_al', 'mlb_al_central'),
'CHW': ('mlb_al', 'mlb_al_central'),
'KCR': ('mlb_al', 'mlb_al_central'),
# American League - West
'HOU': ('mlb_al', 'mlb_al_west'),
'SEA': ('mlb_al', 'mlb_al_west'),
'TEX': ('mlb_al', 'mlb_al_west'),
'LAA': ('mlb_al', 'mlb_al_west'),
'OAK': ('mlb_al', 'mlb_al_west'),
# National League - East
'ATL': ('mlb_nl', 'mlb_nl_east'),
'PHI': ('mlb_nl', 'mlb_nl_east'),
'NYM': ('mlb_nl', 'mlb_nl_east'),
'MIA': ('mlb_nl', 'mlb_nl_east'),
'WSN': ('mlb_nl', 'mlb_nl_east'),
# National League - Central
'MIL': ('mlb_nl', 'mlb_nl_central'),
'CHC': ('mlb_nl', 'mlb_nl_central'),
'STL': ('mlb_nl', 'mlb_nl_central'),
'PIT': ('mlb_nl', 'mlb_nl_central'),
'CIN': ('mlb_nl', 'mlb_nl_central'),
# National League - West
'LAD': ('mlb_nl', 'mlb_nl_west'),
'ARI': ('mlb_nl', 'mlb_nl_west'),
'SDP': ('mlb_nl', 'mlb_nl_west'),
'SFG': ('mlb_nl', 'mlb_nl_west'),
'COL': ('mlb_nl', 'mlb_nl_west'),
}
NHL_DIVISIONS = {
# Eastern Conference - Atlantic
'BOS': ('nhl_eastern', 'nhl_atlantic'),
'BUF': ('nhl_eastern', 'nhl_atlantic'),
'DET': ('nhl_eastern', 'nhl_atlantic'),
'FLA': ('nhl_eastern', 'nhl_atlantic'),
'MTL': ('nhl_eastern', 'nhl_atlantic'),
'OTT': ('nhl_eastern', 'nhl_atlantic'),
'TBL': ('nhl_eastern', 'nhl_atlantic'),
'TOR': ('nhl_eastern', 'nhl_atlantic'),
# Eastern Conference - Metropolitan
'CAR': ('nhl_eastern', 'nhl_metropolitan'),
'CBJ': ('nhl_eastern', 'nhl_metropolitan'),
'NJD': ('nhl_eastern', 'nhl_metropolitan'),
'NYI': ('nhl_eastern', 'nhl_metropolitan'),
'NYR': ('nhl_eastern', 'nhl_metropolitan'),
'PHI': ('nhl_eastern', 'nhl_metropolitan'),
'PIT': ('nhl_eastern', 'nhl_metropolitan'),
'WSH': ('nhl_eastern', 'nhl_metropolitan'),
# Western Conference - Central
'ARI': ('nhl_western', 'nhl_central'), # Utah Hockey Club
'CHI': ('nhl_western', 'nhl_central'),
'COL': ('nhl_western', 'nhl_central'),
'DAL': ('nhl_western', 'nhl_central'),
'MIN': ('nhl_western', 'nhl_central'),
'NSH': ('nhl_western', 'nhl_central'),
'STL': ('nhl_western', 'nhl_central'),
'WPG': ('nhl_western', 'nhl_central'),
# Western Conference - Pacific
'ANA': ('nhl_western', 'nhl_pacific'),
'CGY': ('nhl_western', 'nhl_pacific'),
'EDM': ('nhl_western', 'nhl_pacific'),
'LAK': ('nhl_western', 'nhl_pacific'),
'SEA': ('nhl_western', 'nhl_pacific'),
'SJS': ('nhl_western', 'nhl_pacific'),
'VAN': ('nhl_western', 'nhl_pacific'),
'VGK': ('nhl_western', 'nhl_pacific'),
}
NFL_DIVISIONS = {
# AFC East
'BUF': ('nfl_afc', 'nfl_afc_east'),
'MIA': ('nfl_afc', 'nfl_afc_east'),
'NE': ('nfl_afc', 'nfl_afc_east'),
'NYJ': ('nfl_afc', 'nfl_afc_east'),
# AFC North
'BAL': ('nfl_afc', 'nfl_afc_north'),
'CIN': ('nfl_afc', 'nfl_afc_north'),
'CLE': ('nfl_afc', 'nfl_afc_north'),
'PIT': ('nfl_afc', 'nfl_afc_north'),
# AFC South
'HOU': ('nfl_afc', 'nfl_afc_south'),
'IND': ('nfl_afc', 'nfl_afc_south'),
'JAX': ('nfl_afc', 'nfl_afc_south'),
'TEN': ('nfl_afc', 'nfl_afc_south'),
# AFC West
'DEN': ('nfl_afc', 'nfl_afc_west'),
'KC': ('nfl_afc', 'nfl_afc_west'),
'LV': ('nfl_afc', 'nfl_afc_west'),
'LAC': ('nfl_afc', 'nfl_afc_west'),
# NFC East
'DAL': ('nfl_nfc', 'nfl_nfc_east'),
'NYG': ('nfl_nfc', 'nfl_nfc_east'),
'PHI': ('nfl_nfc', 'nfl_nfc_east'),
'WAS': ('nfl_nfc', 'nfl_nfc_east'),
# NFC North
'CHI': ('nfl_nfc', 'nfl_nfc_north'),
'DET': ('nfl_nfc', 'nfl_nfc_north'),
'GB': ('nfl_nfc', 'nfl_nfc_north'),
'MIN': ('nfl_nfc', 'nfl_nfc_north'),
# NFC South
'ATL': ('nfl_nfc', 'nfl_nfc_south'),
'CAR': ('nfl_nfc', 'nfl_nfc_south'),
'NO': ('nfl_nfc', 'nfl_nfc_south'),
'TB': ('nfl_nfc', 'nfl_nfc_south'),
# NFC West
'ARI': ('nfl_nfc', 'nfl_nfc_west'),
'LAR': ('nfl_nfc', 'nfl_nfc_west'),
'SF': ('nfl_nfc', 'nfl_nfc_west'),
'SEA': ('nfl_nfc', 'nfl_nfc_west'),
}
MLS_DIVISIONS = {
# Eastern Conference (MLS uses conferences, not divisions)
'ATL': ('mls_eastern', None),
'CHI': ('mls_eastern', None),
'CIN': ('mls_eastern', None),
'CLB': ('mls_eastern', None),
'CLT': ('mls_eastern', None),
'DC': ('mls_eastern', None),
'MIA': ('mls_eastern', None),
'MTL': ('mls_eastern', None),
'NE': ('mls_eastern', None),
'NYCFC': ('mls_eastern', None),
'NYRB': ('mls_eastern', None),
'ORL': ('mls_eastern', None),
'PHI': ('mls_eastern', None),
'TOR': ('mls_eastern', None),
# Western Conference
'AUS': ('mls_western', None),
'COL': ('mls_western', None),
'DAL': ('mls_western', None),
'HOU': ('mls_western', None),
'LAFC': ('mls_western', None),
'LAG': ('mls_western', None),
'MIN': ('mls_western', None),
'NSH': ('mls_western', None),
'POR': ('mls_western', None),
'RSL': ('mls_western', None),
'SD': ('mls_western', None),
'SEA': ('mls_western', None),
'SJ': ('mls_western', None),
'SKC': ('mls_western', None),
'STL': ('mls_western', None),
'VAN': ('mls_western', None),
}
WNBA_DIVISIONS = {
# WNBA has no divisions (single league structure)
'ATL': ('wnba', None),
'CHI': ('wnba', None),
'CON': ('wnba', None),
'DAL': ('wnba', None),
'GSV': ('wnba', None),
'IND': ('wnba', None),
'LVA': ('wnba', None),
'LA': ('wnba', None),
'MIN': ('wnba', None),
'NY': ('wnba', None),
'PHO': ('wnba', None),
'SEA': ('wnba', None),
'WAS': ('wnba', None),
}
NWSL_DIVISIONS = {
# NWSL has no divisions (single league structure)
'LA': ('nwsl', None), # Angel City FC
'SJ': ('nwsl', None), # Bay FC
'CHI': ('nwsl', None), # Chicago Red Stars
'HOU': ('nwsl', None), # Houston Dash
'KC': ('nwsl', None), # Kansas City Current
'NJ': ('nwsl', None), # NJ/NY Gotham FC
'NC': ('nwsl', None), # North Carolina Courage
'ORL': ('nwsl', None), # Orlando Pride
'POR': ('nwsl', None), # Portland Thorns FC
'SEA': ('nwsl', None), # Seattle Reign FC
'SD': ('nwsl', None), # San Diego Wave FC
'UTA': ('nwsl', None), # Utah Royals FC
'WAS': ('nwsl', None), # Washington Spirit
}
# =============================================================================
# FUZZY MATCHING
# =============================================================================
def normalize_for_matching(text: str) -> str:
"""Normalize text for fuzzy matching."""
import re
text = text.lower().strip()
# Remove common suffixes/prefixes
text = re.sub(r'\s*(arena|center|stadium|field|park|centre)\s*', ' ', text)
# Remove special characters
text = re.sub(r'[^a-z0-9\s]', '', text)
# Collapse spaces
text = re.sub(r'\s+', ' ', text).strip()
return text
def fuzzy_match_stadium(
team_arena_name: str,
team_city: str,
sport: str,
stadiums: list[dict],
confidence_threshold: float = 0.6
) -> tuple[Optional[str], float]:
"""
Fuzzy match team's arena to a canonical stadium.
Matching strategy:
- 70% weight: Name similarity (SequenceMatcher)
- 30% weight: City match (exact=1.0, partial=0.5)
Args:
team_arena_name: The arena name from team mapping
team_city: The team's city
sport: Sport code (NBA, MLB, NHL)
stadiums: List of canonical stadium dicts
confidence_threshold: Minimum confidence for a match
Returns:
(canonical_stadium_id, confidence_score)
"""
best_match = None
best_score = 0.0
# Normalize arena name
arena_normalized = normalize_for_matching(team_arena_name)
city_lower = team_city.lower()
# Filter to same sport
sport_stadiums = [s for s in stadiums if s['sport'] == sport]
for stadium in sport_stadiums:
stadium_name_normalized = normalize_for_matching(stadium['name'])
# Score 1: Name similarity
name_score = SequenceMatcher(
None,
arena_normalized,
stadium_name_normalized
).ratio()
# Also check full names (unnormalized)
full_name_score = SequenceMatcher(
None,
team_arena_name.lower(),
stadium['name'].lower()
).ratio()
# Take the better score
name_score = max(name_score, full_name_score)
# Score 2: City match
city_score = 0.0
stadium_city_lower = stadium['city'].lower()
if city_lower == stadium_city_lower:
city_score = 1.0
elif city_lower in stadium_city_lower or stadium_city_lower in city_lower:
city_score = 0.5
# Check for nearby cities (e.g., "San Francisco" team but "Oakland" arena)
nearby_cities = {
'san francisco': ['oakland', 'san jose'],
'new york': ['brooklyn', 'queens', 'elmont', 'newark'],
'los angeles': ['inglewood', 'anaheim'],
'miami': ['sunrise', 'fort lauderdale'],
'dallas': ['arlington', 'fort worth'],
'washington': ['landover', 'capital heights'],
'minneapolis': ['st paul', 'st. paul'],
'detroit': ['auburn hills', 'pontiac'],
}
for main_city, nearby in nearby_cities.items():
if city_lower == main_city and stadium_city_lower in nearby:
city_score = 0.7
elif stadium_city_lower == main_city and city_lower in nearby:
city_score = 0.7
# Combined score (weighted)
combined = (name_score * 0.7) + (city_score * 0.3)
if combined > best_score:
best_score = combined
best_match = stadium['canonical_id']
if best_score >= confidence_threshold:
return best_match, best_score
return None, best_score
# =============================================================================
# CANONICALIZATION
# =============================================================================
def generate_canonical_team_id(sport: str, abbrev: str) -> str:
"""
Generate deterministic canonical ID for team.
Format: team_{sport}_{abbrev}
Example: team_nba_atl
"""
return f"team_{sport.lower()}_{abbrev.lower()}"
def canonicalize_teams(
team_mappings: dict[str, dict],
sport: str,
canonical_stadiums: list[dict],
verbose: bool = False
) -> tuple[list[CanonicalTeam], list[MatchWarning]]:
"""
Stage 2: Canonicalize teams.
1. Generate canonical IDs from abbreviations
2. Fuzzy match to stadiums
3. Log low-confidence matches for review
Args:
team_mappings: Team data dict (e.g., NBA_TEAMS)
sport: Sport code
canonical_stadiums: List of canonical stadium dicts
verbose: Print detailed progress
Returns:
(canonical_teams, warnings)
"""
teams = []
warnings = []
# Determine arena key based on sport
arena_key = 'arena' if sport in ['NBA', 'NHL', 'WNBA'] else 'stadium'
# Get division structure
division_map = {
'NBA': NBA_DIVISIONS,
'MLB': MLB_DIVISIONS,
'NHL': NHL_DIVISIONS,
'NFL': NFL_DIVISIONS,
'MLS': MLS_DIVISIONS,
'WNBA': WNBA_DIVISIONS,
'NWSL': NWSL_DIVISIONS,
}.get(sport, {})
for abbrev, info in team_mappings.items():
canonical_id = generate_canonical_team_id(sport, abbrev)
arena_name = info.get(arena_key, '')
city = info.get('city', '')
team_name = info.get('name', '')
# Fuzzy match stadium
stadium_canonical_id, confidence = fuzzy_match_stadium(
arena_name, city, sport, canonical_stadiums
)
if stadium_canonical_id is None:
warnings.append(MatchWarning(
team_canonical_id=canonical_id,
team_name=team_name,
arena_name=arena_name,
matched_stadium=None,
issue='No stadium match found',
confidence=confidence
))
# Create placeholder ID
stadium_canonical_id = f"stadium_unknown_{sport.lower()}_{abbrev.lower()}"
if verbose:
print(f" WARNING: {canonical_id} - no stadium match for '{arena_name}'")
elif confidence < 0.8:
warnings.append(MatchWarning(
team_canonical_id=canonical_id,
team_name=team_name,
arena_name=arena_name,
matched_stadium=stadium_canonical_id,
issue='Low confidence stadium match',
confidence=confidence
))
if verbose:
print(f" WARNING: {canonical_id} - low confidence ({confidence:.2f}) match to {stadium_canonical_id}")
# Get conference/division
conf_id, div_id = division_map.get(abbrev, (None, None))
team = CanonicalTeam(
canonical_id=canonical_id,
name=team_name,
abbreviation=abbrev,
sport=sport,
city=city,
stadium_canonical_id=stadium_canonical_id,
conference_id=conf_id,
division_id=div_id
)
teams.append(team)
if verbose and confidence >= 0.8:
print(f" {canonical_id}: {team_name} -> {stadium_canonical_id} ({confidence:.2f})")
return teams, warnings
def canonicalize_all_teams(
canonical_stadiums: list[dict],
verbose: bool = False
) -> tuple[list[CanonicalTeam], list[MatchWarning]]:
"""Canonicalize teams for all sports."""
all_teams = []
all_warnings = []
sport_mappings = [
('NBA', NBA_TEAMS),
('MLB', MLB_TEAMS),
('NHL', NHL_TEAMS),
('NFL', NFL_TEAMS),
('MLS', MLS_TEAMS),
('WNBA', WNBA_TEAMS),
('NWSL', NWSL_TEAMS),
]
for sport, team_map in sport_mappings:
if verbose:
print(f"\n{sport}:")
teams, warnings = canonicalize_teams(
team_map, sport, canonical_stadiums, verbose
)
all_teams.extend(teams)
all_warnings.extend(warnings)
return all_teams, all_warnings
# =============================================================================
# MAIN
# =============================================================================
def main():
parser = argparse.ArgumentParser(
description='Canonicalize team data'
)
parser.add_argument(
'--stadiums', type=str, default='./data/stadiums_canonical.json',
help='Input canonical stadiums JSON file'
)
parser.add_argument(
'--output', type=str, default='./data',
help='Output directory for canonical files'
)
parser.add_argument(
'--verbose', '-v', action='store_true',
help='Verbose output'
)
args = parser.parse_args()
stadiums_path = Path(args.stadiums)
output_dir = Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)
# Load canonical stadiums
print(f"Loading canonical stadiums from {stadiums_path}...")
with open(stadiums_path) as f:
canonical_stadiums = json.load(f)
print(f" Loaded {len(canonical_stadiums)} canonical stadiums")
# Canonicalize teams
print("\nCanonicalizing teams...")
canonical_teams, warnings = canonicalize_all_teams(
canonical_stadiums, verbose=args.verbose
)
print(f" Created {len(canonical_teams)} canonical teams")
if warnings:
print(f"\n Warnings: {len(warnings)}")
for w in warnings:
print(f" - {w.team_canonical_id}: {w.issue} (confidence: {w.confidence:.2f})")
# Export
teams_path = output_dir / 'teams_canonical.json'
warnings_path = output_dir / 'team_matching_warnings.json'
with open(teams_path, 'w') as f:
json.dump([asdict(t) for t in canonical_teams], f, indent=2)
print(f"\nExported teams to {teams_path}")
if warnings:
with open(warnings_path, 'w') as f:
json.dump([asdict(w) for w in warnings], f, indent=2)
print(f"Exported warnings to {warnings_path}")
# Summary by sport
print("\nSummary by sport:")
by_sport = {}
for t in canonical_teams:
by_sport[t.sport] = by_sport.get(t.sport, 0) + 1
for sport, count in sorted(by_sport.items()):
print(f" {sport}: {count} teams")
if __name__ == '__main__':
main()
File diff suppressed because it is too large Load Diff
-53
View File
@@ -1,53 +0,0 @@
DEFINE SCHEMA
RECORD TYPE Stadium (
"___createTime" TIMESTAMP,
"___createdBy" REFERENCE,
"___etag" STRING,
"___modTime" TIMESTAMP,
"___modifiedBy" REFERENCE,
"___recordID" REFERENCE QUERYABLE,
stadiumId STRING QUERYABLE,
name STRING QUERYABLE SEARCHABLE,
city STRING QUERYABLE,
state STRING,
location LOCATION QUERYABLE,
capacity INT64,
sport STRING QUERYABLE SORTABLE,
teamAbbrevs LIST<STRING>,
source STRING,
yearOpened INT64
);
RECORD TYPE Team (
"___createTime" TIMESTAMP,
"___createdBy" REFERENCE,
"___etag" STRING,
"___modTime" TIMESTAMP,
"___modifiedBy" REFERENCE,
"___recordID" REFERENCE QUERYABLE,
teamId STRING QUERYABLE,
name STRING QUERYABLE SEARCHABLE,
abbreviation STRING QUERYABLE,
city STRING QUERYABLE,
sport STRING QUERYABLE SORTABLE
);
RECORD TYPE Game (
"___createTime" TIMESTAMP,
"___createdBy" REFERENCE,
"___etag" STRING,
"___modTime" TIMESTAMP,
"___modifiedBy" REFERENCE,
"___recordID" REFERENCE QUERYABLE,
gameId STRING QUERYABLE,
sport STRING QUERYABLE SORTABLE,
season STRING QUERYABLE,
dateTime TIMESTAMP QUERYABLE SORTABLE,
homeTeamRef REFERENCE QUERYABLE,
awayTeamRef REFERENCE QUERYABLE,
venueRef REFERENCE,
isPlayoff INT64,
broadcastInfo STRING,
source STRING
);
-384
View File
@@ -1,384 +0,0 @@
#!/usr/bin/env python3
"""
Core shared utilities for SportsTime data scrapers.
This module provides:
- Rate limiting utilities
- Data classes (Game, Stadium)
- Multi-source fallback system
- ID generation
- Export utilities
"""
import json
import time
from collections import defaultdict
from dataclasses import dataclass, asdict, field
from datetime import datetime, timedelta
from pathlib import Path
from typing import Optional, Callable
import pandas as pd
import requests
from bs4 import BeautifulSoup
__all__ = [
# Constants
'REQUEST_DELAY',
# Rate limiting
'rate_limit',
'fetch_page',
# Data classes
'Game',
'Stadium',
'ScraperSource',
'StadiumScraperSource',
# Fallback system
'scrape_with_fallback',
'scrape_stadiums_with_fallback',
# ID generation
'assign_stable_ids',
# Export utilities
'export_to_json',
'validate_games',
]
# =============================================================================
# RATE LIMITING
# =============================================================================
REQUEST_DELAY = 3.0 # seconds between requests to same domain
last_request_time: dict[str, float] = {}
def rate_limit(domain: str) -> None:
"""Enforce rate limiting per domain."""
now = time.time()
if domain in last_request_time:
elapsed = now - last_request_time[domain]
if elapsed < REQUEST_DELAY:
time.sleep(REQUEST_DELAY - elapsed)
last_request_time[domain] = time.time()
def fetch_page(url: str, domain: str) -> Optional[BeautifulSoup]:
"""Fetch and parse a webpage with rate limiting."""
rate_limit(domain)
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0',
}
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
return BeautifulSoup(response.content, 'html.parser')
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
# =============================================================================
# DATA CLASSES
# =============================================================================
@dataclass
class Game:
"""Represents a single game."""
id: str
sport: str
season: str
date: str # YYYY-MM-DD
time: Optional[str] # HH:MM (24hr, ET)
home_team: str
away_team: str
home_team_abbrev: str
away_team_abbrev: str
venue: str
source: str
is_playoff: bool = False
broadcast: Optional[str] = None
@dataclass
class Stadium:
"""Represents a stadium/arena/ballpark."""
id: str
name: str
city: str
state: str
latitude: float
longitude: float
capacity: int
sport: str
team_abbrevs: list
source: str
year_opened: Optional[int] = None
# =============================================================================
# MULTI-SOURCE FALLBACK SYSTEM
# =============================================================================
@dataclass
class ScraperSource:
"""Represents a single data source for scraping games."""
name: str
scraper_func: Callable[[int], list] # Takes season, returns list[Game]
priority: int = 1 # Lower = higher priority (1 is best)
min_games: int = 10 # Minimum games to consider successful
def scrape_with_fallback(
sport: str,
season: int,
sources: list[ScraperSource],
verbose: bool = True
) -> list[Game]:
"""
Try multiple sources in priority order until one succeeds.
Args:
sport: Sport name for logging
season: Season year
sources: List of ScraperSource configs, sorted by priority
verbose: Whether to print status messages
Returns:
List of Game objects from the first successful source
"""
sources = sorted(sources, key=lambda s: s.priority)
for i, source in enumerate(sources):
try:
if verbose:
attempt = f"[{i+1}/{len(sources)}]"
print(f" {attempt} Trying {source.name}...")
games = source.scraper_func(season)
if games and len(games) >= source.min_games:
if verbose:
print(f"{source.name} returned {len(games)} games")
return games
else:
if verbose:
count = len(games) if games else 0
print(f"{source.name} returned only {count} games (min: {source.min_games})")
except Exception as e:
if verbose:
print(f"{source.name} failed: {e}")
continue
# All sources failed
if verbose:
print(f" ⚠ All {len(sources)} sources failed for {sport}")
return []
@dataclass
class StadiumScraperSource:
"""Represents a single data source for stadium scraping."""
name: str
scraper_func: Callable[[], list] # Returns list[Stadium]
priority: int = 1 # Lower = higher priority (1 is best)
min_venues: int = 5 # Minimum venues to consider successful
def scrape_stadiums_with_fallback(
sport: str,
sources: list[StadiumScraperSource],
verbose: bool = True
) -> list[Stadium]:
"""
Try multiple stadium sources in priority order until one succeeds.
Args:
sport: Sport name for logging
sources: List of StadiumScraperSource configs, sorted by priority
verbose: Whether to print status messages
Returns:
List of Stadium objects from the first successful source
"""
sources = sorted(sources, key=lambda s: s.priority)
for i, source in enumerate(sources):
try:
if verbose:
attempt = f"[{i+1}/{len(sources)}]"
print(f" {attempt} Trying {source.name}...")
stadiums = source.scraper_func()
if stadiums and len(stadiums) >= source.min_venues:
if verbose:
print(f"{source.name} returned {len(stadiums)} venues")
return stadiums
else:
if verbose:
count = len(stadiums) if stadiums else 0
print(f"{source.name} returned only {count} venues (min: {source.min_venues})")
except Exception as e:
if verbose:
print(f"{source.name} failed: {e}")
continue
# All sources failed
if verbose:
print(f" ⚠ All {len(sources)} sources failed for {sport}")
return []
# =============================================================================
# ID GENERATION
# =============================================================================
def assign_stable_ids(games: list[Game], sport: str, season: str) -> list[Game]:
"""
Assign IDs based on matchup + date.
Format: {sport}_{season}_{away}_{home}_{MMDD} (or {MMDD}_2 for doubleheaders)
When games are rescheduled, the old ID becomes orphaned and a new one is created.
Use --delete-all before import to clean up orphaned records.
"""
season_str = season.replace('-', '')
# Track how many times we've seen each base ID (for doubleheaders)
id_counts: dict[str, int] = defaultdict(int)
for game in games:
away = game.away_team_abbrev.lower()
home = game.home_team_abbrev.lower()
# Extract MMDD from date (YYYY-MM-DD)
date_parts = game.date.split('-')
mmdd = f"{date_parts[1]}{date_parts[2]}" if len(date_parts) == 3 else "0000"
base_id = f"{sport.lower()}_{season_str}_{away}_{home}_{mmdd}"
id_counts[base_id] += 1
# Add suffix for doubleheaders (game 2+)
if id_counts[base_id] > 1:
game.id = f"{base_id}_{id_counts[base_id]}"
else:
game.id = base_id
return games
# =============================================================================
# EXPORT UTILITIES
# =============================================================================
def export_to_json(games: list[Game], stadiums: list[Stadium], output_dir: Path) -> None:
"""
Export scraped data to organized JSON files.
Structure:
data/
games/
mlb_2025.json
nba_2025.json
...
canonical/
stadiums.json
stadiums.json (legacy, for backward compatibility)
"""
output_dir.mkdir(parents=True, exist_ok=True)
# Create subdirectories
games_dir = output_dir / 'games'
canonical_dir = output_dir / 'canonical'
games_dir.mkdir(exist_ok=True)
canonical_dir.mkdir(exist_ok=True)
# Group games by sport and season
games_by_sport_season: dict[str, list[Game]] = {}
for game in games:
sport = game.sport.lower()
season = game.season
key = f"{sport}_{season}"
if key not in games_by_sport_season:
games_by_sport_season[key] = []
games_by_sport_season[key].append(game)
# Export games by sport/season
total_exported = 0
for key, sport_games in games_by_sport_season.items():
games_data = [asdict(g) for g in sport_games]
filepath = games_dir / f"{key}.json"
with open(filepath, 'w') as f:
json.dump(games_data, f, indent=2)
print(f" Exported {len(sport_games):,} games to games/{key}.json")
total_exported += len(sport_games)
# Export combined games.json for backward compatibility
all_games_data = [asdict(g) for g in games]
with open(output_dir / 'games.json', 'w') as f:
json.dump(all_games_data, f, indent=2)
# Export stadiums to canonical/
stadiums_data = [asdict(s) for s in stadiums]
with open(canonical_dir / 'stadiums.json', 'w') as f:
json.dump(stadiums_data, f, indent=2)
# Also export to root for backward compatibility
with open(output_dir / 'stadiums.json', 'w') as f:
json.dump(stadiums_data, f, indent=2)
# Export as CSV for easy viewing
if games:
df_games = pd.DataFrame(all_games_data)
df_games.to_csv(output_dir / 'games.csv', index=False)
if stadiums:
df_stadiums = pd.DataFrame(stadiums_data)
df_stadiums.to_csv(output_dir / 'stadiums.csv', index=False)
print(f"\nExported {total_exported:,} games across {len(games_by_sport_season)} sport/season files")
print(f"Exported {len(stadiums):,} stadiums to canonical/stadiums.json")
def validate_games(games_by_source: dict[str, list[Game]]) -> dict:
"""
Cross-validate games from multiple sources.
Returns discrepancies.
"""
discrepancies = {
'missing_in_source': [],
'date_mismatch': [],
'time_mismatch': [],
'venue_mismatch': [],
}
sources = list(games_by_source.keys())
if len(sources) < 2:
return discrepancies
primary = sources[0]
primary_games = {g.id: g for g in games_by_source[primary]}
for source in sources[1:]:
secondary_games = {g.id: g for g in games_by_source[source]}
for game_id, game in primary_games.items():
if game_id not in secondary_games:
discrepancies['missing_in_source'].append({
'game_id': game_id,
'present_in': primary,
'missing_in': source
})
return discrepancies
File diff suppressed because it is too large Load Diff
@@ -1,13 +0,0 @@
{
"is_valid": true,
"error_count": 0,
"warning_count": 0,
"summary": {
"stadiums": 148,
"teams": 92,
"games": 0,
"aliases": 194,
"by_category": {}
},
"errors": []
}
@@ -1,42 +0,0 @@
[
{
"game_key": "2026-01-17_TBD_TBD",
"issue": "Unknown home team",
"details": "Could not resolve home team 'TBD' for sport NFL"
},
{
"game_key": "2026-01-17_TBD_TBD",
"issue": "Unknown home team",
"details": "Could not resolve home team 'TBD' for sport NFL"
},
{
"game_key": "2026-01-18_TBD_TBD",
"issue": "Unknown home team",
"details": "Could not resolve home team 'TBD' for sport NFL"
},
{
"game_key": "2026-01-18_TBD_TBD",
"issue": "Unknown home team",
"details": "Could not resolve home team 'TBD' for sport NFL"
},
{
"game_key": "2026-01-25_TBD_TBD",
"issue": "Unknown home team",
"details": "Could not resolve home team 'TBD' for sport NFL"
},
{
"game_key": "2026-01-25_TBD_TBD",
"issue": "Unknown home team",
"details": "Could not resolve home team 'TBD' for sport NFL"
},
{
"game_key": "2026-02-04_NFC_AFC",
"issue": "Unknown home team",
"details": "Could not resolve home team 'AFC' for sport NFL"
},
{
"game_key": "2026-02-08_TBD_TBD",
"issue": "Unknown home team",
"details": "Could not resolve home team 'TBD' for sport NFL"
}
]
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
-23
View File
@@ -1,23 +0,0 @@
{
"generated_at": "2026-01-10T11:03:46.586763",
"season": 2026,
"sport": "all",
"summary": {
"games_scraped": 5768,
"stadiums_scraped": 178,
"games_by_sport": {
"NBA": 1230,
"MLB": 2430,
"NHL": 1312,
"NFL": 286,
"WNBA": 0,
"MLS": 510,
"NWSL": 0
},
"high_severity": 0,
"medium_severity": 0,
"low_severity": 0
},
"game_validations": [],
"stadium_issues": []
}
-179
View File
@@ -1,179 +0,0 @@
id,name,city,state,latitude,longitude,capacity,sport,team_abbrevs,source,year_opened
mlb_chase_field,Chase Field,Phoenix,AZ,33.4453,-112.0667,48519,MLB,['ARI'],mlb_hardcoded,1998
mlb_truist_park,Truist Park,Atlanta,GA,33.8907,-84.4677,41084,MLB,['ATL'],mlb_hardcoded,2017
mlb_oriole_park_at_camden_yards,Oriole Park at Camden Yards,Baltimore,MD,39.2839,-76.6216,44970,MLB,['BAL'],mlb_hardcoded,1992
mlb_fenway_park,Fenway Park,Boston,MA,42.3467,-71.0972,37755,MLB,['BOS'],mlb_hardcoded,1912
mlb_wrigley_field,Wrigley Field,Chicago,IL,41.9484,-87.6553,41649,MLB,['CHC'],mlb_hardcoded,1914
mlb_guaranteed_rate_field,Guaranteed Rate Field,Chicago,IL,41.8299,-87.6338,40615,MLB,['CHW'],mlb_hardcoded,1991
mlb_great_american_ball_park,Great American Ball Park,Cincinnati,OH,39.0979,-84.5082,42319,MLB,['CIN'],mlb_hardcoded,2003
mlb_progressive_field,Progressive Field,Cleveland,OH,41.4958,-81.6853,34830,MLB,['CLE'],mlb_hardcoded,1994
mlb_coors_field,Coors Field,Denver,CO,39.7559,-104.9942,50144,MLB,['COL'],mlb_hardcoded,1995
mlb_comerica_park,Comerica Park,Detroit,MI,42.339,-83.0485,41083,MLB,['DET'],mlb_hardcoded,2000
mlb_minute_maid_park,Minute Maid Park,Houston,TX,29.7573,-95.3555,41168,MLB,['HOU'],mlb_hardcoded,2000
mlb_kauffman_stadium,Kauffman Stadium,Kansas City,MO,39.0517,-94.4803,37903,MLB,['KCR'],mlb_hardcoded,1973
mlb_angel_stadium,Angel Stadium,Anaheim,CA,33.8003,-117.8827,45517,MLB,['LAA'],mlb_hardcoded,1966
mlb_dodger_stadium,Dodger Stadium,Los Angeles,CA,34.0739,-118.24,56000,MLB,['LAD'],mlb_hardcoded,1962
mlb_loandepot_park,LoanDepot Park,Miami,FL,25.7781,-80.2196,36742,MLB,['MIA'],mlb_hardcoded,2012
mlb_american_family_field,American Family Field,Milwaukee,WI,43.028,-87.9712,41900,MLB,['MIL'],mlb_hardcoded,2001
mlb_target_field,Target Field,Minneapolis,MN,44.9818,-93.2775,38544,MLB,['MIN'],mlb_hardcoded,2010
mlb_citi_field,Citi Field,Queens,NY,40.7571,-73.8458,41922,MLB,['NYM'],mlb_hardcoded,2009
mlb_yankee_stadium,Yankee Stadium,Bronx,NY,40.8296,-73.9262,46537,MLB,['NYY'],mlb_hardcoded,2009
mlb_sutter_health_park,Sutter Health Park,Sacramento,CA,38.5803,-121.5108,14014,MLB,['OAK'],mlb_hardcoded,2000
mlb_citizens_bank_park,Citizens Bank Park,Philadelphia,PA,39.9061,-75.1665,42901,MLB,['PHI'],mlb_hardcoded,2004
mlb_pnc_park,PNC Park,Pittsburgh,PA,40.4469,-80.0057,38362,MLB,['PIT'],mlb_hardcoded,2001
mlb_petco_park,Petco Park,San Diego,CA,32.7073,-117.1566,40209,MLB,['SDP'],mlb_hardcoded,2004
mlb_oracle_park,Oracle Park,San Francisco,CA,37.7786,-122.3893,41915,MLB,['SFG'],mlb_hardcoded,2000
mlb_t-mobile_park,T-Mobile Park,Seattle,WA,47.5914,-122.3325,47929,MLB,['SEA'],mlb_hardcoded,1999
mlb_busch_stadium,Busch Stadium,St. Louis,MO,38.6226,-90.1928,45538,MLB,['STL'],mlb_hardcoded,2006
mlb_tropicana_field,Tropicana Field,St. Petersburg,FL,27.7682,-82.6534,25000,MLB,['TBR'],mlb_hardcoded,1990
mlb_globe_life_field,Globe Life Field,Arlington,TX,32.7473,-97.0844,40300,MLB,['TEX'],mlb_hardcoded,2020
mlb_rogers_centre,Rogers Centre,Toronto,ON,43.6414,-79.3894,49282,MLB,['TOR'],mlb_hardcoded,1989
mlb_nationals_park,Nationals Park,Washington,DC,38.8729,-77.0074,41339,MLB,['WSN'],mlb_hardcoded,2008
nba_state_farm_arena,State Farm Arena,Atlanta,GA,33.7573,-84.3963,18118,NBA,['ATL'],nba_hardcoded,1999
nba_td_garden,TD Garden,Boston,MA,42.3662,-71.0621,19156,NBA,['BOS'],nba_hardcoded,1995
nba_barclays_center,Barclays Center,Brooklyn,NY,40.6826,-73.9754,17732,NBA,['BRK'],nba_hardcoded,2012
nba_spectrum_center,Spectrum Center,Charlotte,NC,35.2251,-80.8392,19077,NBA,['CHO'],nba_hardcoded,2005
nba_united_center,United Center,Chicago,IL,41.8807,-87.6742,20917,NBA,['CHI'],nba_hardcoded,1994
nba_rocket_mortgage_fieldhouse,Rocket Mortgage FieldHouse,Cleveland,OH,41.4965,-81.6882,19432,NBA,['CLE'],nba_hardcoded,1994
nba_american_airlines_center,American Airlines Center,Dallas,TX,32.7905,-96.8103,19200,NBA,['DAL'],nba_hardcoded,2001
nba_ball_arena,Ball Arena,Denver,CO,39.7487,-105.0077,19520,NBA,['DEN'],nba_hardcoded,1999
nba_little_caesars_arena,Little Caesars Arena,Detroit,MI,42.3411,-83.0553,20332,NBA,['DET'],nba_hardcoded,2017
nba_chase_center,Chase Center,San Francisco,CA,37.768,-122.3879,18064,NBA,['GSW'],nba_hardcoded,2019
nba_toyota_center,Toyota Center,Houston,TX,29.7508,-95.3621,18055,NBA,['HOU'],nba_hardcoded,2003
nba_gainbridge_fieldhouse,Gainbridge Fieldhouse,Indianapolis,IN,39.764,-86.1555,17923,NBA,['IND'],nba_hardcoded,1999
nba_intuit_dome,Intuit Dome,Inglewood,CA,33.9425,-118.3419,18000,NBA,['LAC'],nba_hardcoded,2024
nba_crypto.com_arena,Crypto.com Arena,Los Angeles,CA,34.043,-118.2673,18997,NBA,['LAL'],nba_hardcoded,1999
nba_fedexforum,FedExForum,Memphis,TN,35.1382,-90.0506,17794,NBA,['MEM'],nba_hardcoded,2004
nba_kaseya_center,Kaseya Center,Miami,FL,25.7814,-80.187,19600,NBA,['MIA'],nba_hardcoded,1999
nba_fiserv_forum,Fiserv Forum,Milwaukee,WI,43.0451,-87.9174,17341,NBA,['MIL'],nba_hardcoded,2018
nba_target_center,Target Center,Minneapolis,MN,44.9795,-93.2761,18978,NBA,['MIN'],nba_hardcoded,1990
nba_smoothie_king_center,Smoothie King Center,New Orleans,LA,29.949,-90.0821,16867,NBA,['NOP'],nba_hardcoded,1999
nba_madison_square_garden,Madison Square Garden,New York,NY,40.7505,-73.9934,19812,NBA,['NYK'],nba_hardcoded,1968
nba_paycom_center,Paycom Center,Oklahoma City,OK,35.4634,-97.5151,18203,NBA,['OKC'],nba_hardcoded,2002
nba_kia_center,Kia Center,Orlando,FL,28.5392,-81.3839,18846,NBA,['ORL'],nba_hardcoded,1989
nba_wells_fargo_center,Wells Fargo Center,Philadelphia,PA,39.9012,-75.172,20478,NBA,['PHI'],nba_hardcoded,1996
nba_footprint_center,Footprint Center,Phoenix,AZ,33.4457,-112.0712,17071,NBA,['PHO'],nba_hardcoded,1992
nba_moda_center,Moda Center,Portland,OR,45.5316,-122.6668,19393,NBA,['POR'],nba_hardcoded,1995
nba_golden_1_center,Golden 1 Center,Sacramento,CA,38.5802,-121.4997,17608,NBA,['SAC'],nba_hardcoded,2016
nba_frost_bank_center,Frost Bank Center,San Antonio,TX,29.427,-98.4375,18418,NBA,['SAS'],nba_hardcoded,2002
nba_scotiabank_arena,Scotiabank Arena,Toronto,ON,43.6435,-79.3791,19800,NBA,['TOR'],nba_hardcoded,1999
nba_delta_center,Delta Center,Salt Lake City,UT,40.7683,-111.9011,18306,NBA,['UTA'],nba_hardcoded,1991
nba_capital_one_arena,Capital One Arena,Washington,DC,38.8982,-77.0209,20356,NBA,['WAS'],nba_hardcoded,1997
nhl_td_garden,TD Garden,Boston,MA,42.3662,-71.0621,17850,NHL,['BOS'],nhl_hardcoded,1995
nhl_keybank_center,KeyBank Center,Buffalo,NY,42.875,-78.8764,19070,NHL,['BUF'],nhl_hardcoded,1996
nhl_little_caesars_arena,Little Caesars Arena,Detroit,MI,42.3411,-83.0553,19515,NHL,['DET'],nhl_hardcoded,2017
nhl_amerant_bank_arena,Amerant Bank Arena,Sunrise,FL,26.1584,-80.3256,19250,NHL,['FLA'],nhl_hardcoded,1998
nhl_bell_centre,Bell Centre,Montreal,QC,45.4961,-73.5693,21302,NHL,['MTL'],nhl_hardcoded,1996
nhl_canadian_tire_centre,Canadian Tire Centre,Ottawa,ON,45.2969,-75.9272,18652,NHL,['OTT'],nhl_hardcoded,1996
nhl_amalie_arena,Amalie Arena,Tampa,FL,27.9426,-82.4519,19092,NHL,['TBL'],nhl_hardcoded,1996
nhl_scotiabank_arena,Scotiabank Arena,Toronto,ON,43.6435,-79.3791,18800,NHL,['TOR'],nhl_hardcoded,1999
nhl_pnc_arena,PNC Arena,Raleigh,NC,35.8033,-78.722,18680,NHL,['CAR'],nhl_hardcoded,1999
nhl_nationwide_arena,Nationwide Arena,Columbus,OH,39.9692,-83.0061,18500,NHL,['CBJ'],nhl_hardcoded,2000
nhl_prudential_center,Prudential Center,Newark,NJ,40.7334,-74.1713,16514,NHL,['NJD'],nhl_hardcoded,2007
nhl_ubs_arena,UBS Arena,Elmont,NY,40.717,-73.726,17255,NHL,['NYI'],nhl_hardcoded,2021
nhl_madison_square_garden,Madison Square Garden,New York,NY,40.7505,-73.9934,18006,NHL,['NYR'],nhl_hardcoded,1968
nhl_wells_fargo_center,Wells Fargo Center,Philadelphia,PA,39.9012,-75.172,19500,NHL,['PHI'],nhl_hardcoded,1996
nhl_ppg_paints_arena,PPG Paints Arena,Pittsburgh,PA,40.4395,-79.9892,18387,NHL,['PIT'],nhl_hardcoded,2010
nhl_capital_one_arena,Capital One Arena,Washington,DC,38.8982,-77.0209,18573,NHL,['WSH'],nhl_hardcoded,1997
nhl_united_center,United Center,Chicago,IL,41.8807,-87.6742,19717,NHL,['CHI'],nhl_hardcoded,1994
nhl_ball_arena,Ball Arena,Denver,CO,39.7487,-105.0077,18007,NHL,['COL'],nhl_hardcoded,1999
nhl_american_airlines_center,American Airlines Center,Dallas,TX,32.7905,-96.8103,18532,NHL,['DAL'],nhl_hardcoded,2001
nhl_xcel_energy_center,Xcel Energy Center,Saint Paul,MN,44.9448,-93.101,17954,NHL,['MIN'],nhl_hardcoded,2000
nhl_bridgestone_arena,Bridgestone Arena,Nashville,TN,36.1592,-86.7785,17159,NHL,['NSH'],nhl_hardcoded,1996
nhl_enterprise_center,Enterprise Center,St. Louis,MO,38.6268,-90.2025,18096,NHL,['STL'],nhl_hardcoded,1994
nhl_canada_life_centre,Canada Life Centre,Winnipeg,MB,49.8928,-97.1437,15321,NHL,['WPG'],nhl_hardcoded,2004
nhl_honda_center,Honda Center,Anaheim,CA,33.8078,-117.8765,17174,NHL,['ANA'],nhl_hardcoded,1993
nhl_delta_center,Delta Center,Salt Lake City,UT,40.7683,-111.9011,16210,NHL,['ARI'],nhl_hardcoded,1991
nhl_sap_center,SAP Center,San Jose,CA,37.3327,-121.9012,17562,NHL,['SJS'],nhl_hardcoded,1993
nhl_rogers_arena,Rogers Arena,Vancouver,BC,49.2778,-123.1089,18910,NHL,['VAN'],nhl_hardcoded,1995
nhl_t-mobile_arena,T-Mobile Arena,Las Vegas,NV,36.1028,-115.1784,17500,NHL,['VGK'],nhl_hardcoded,2016
nhl_climate_pledge_arena,Climate Pledge Arena,Seattle,WA,47.622,-122.354,17100,NHL,['SEA'],nhl_hardcoded,2021
nhl_crypto.com_arena,Crypto.com Arena,Los Angeles,CA,34.043,-118.2673,18230,NHL,['LAK'],nhl_hardcoded,1999
nhl_rogers_place,Rogers Place,Edmonton,AB,53.5469,-113.4979,18347,NHL,['EDM'],nhl_hardcoded,2016
nhl_scotiabank_saddledome,Scotiabank Saddledome,Calgary,AB,51.0374,-114.0519,19289,NHL,['CGY'],nhl_hardcoded,1983
nfl_state_farm_stadium,State Farm Stadium,Glendale,AZ,33.5276,-112.2626,63400,NFL,['ARI'],nfl_hardcoded,2006
nfl_mercedes-benz_stadium,Mercedes-Benz Stadium,Atlanta,GA,33.7553,-84.4006,71000,NFL,['ATL'],nfl_hardcoded,2017
nfl_m&t_bank_stadium,M&T Bank Stadium,Baltimore,MD,39.278,-76.6227,71008,NFL,['BAL'],nfl_hardcoded,1998
nfl_highmark_stadium,Highmark Stadium,Orchard Park,NY,42.7738,-78.787,71608,NFL,['BUF'],nfl_hardcoded,1973
nfl_bank_of_america_stadium,Bank of America Stadium,Charlotte,NC,35.2258,-80.8528,75523,NFL,['CAR'],nfl_hardcoded,1996
nfl_soldier_field,Soldier Field,Chicago,IL,41.8623,-87.6167,61500,NFL,['CHI'],nfl_hardcoded,1924
nfl_paycor_stadium,Paycor Stadium,Cincinnati,OH,39.0954,-84.516,65515,NFL,['CIN'],nfl_hardcoded,2000
nfl_cleveland_browns_stadium,Cleveland Browns Stadium,Cleveland,OH,41.5061,-81.6995,67895,NFL,['CLE'],nfl_hardcoded,1999
nfl_at&t_stadium,AT&T Stadium,Arlington,TX,32.748,-97.0928,80000,NFL,['DAL'],nfl_hardcoded,2009
nfl_empower_field_at_mile_high,Empower Field at Mile High,Denver,CO,39.7439,-105.0201,76125,NFL,['DEN'],nfl_hardcoded,2001
nfl_ford_field,Ford Field,Detroit,MI,42.34,-83.0456,65000,NFL,['DET'],nfl_hardcoded,2002
nfl_lambeau_field,Lambeau Field,Green Bay,WI,44.5013,-88.0622,81435,NFL,['GB'],nfl_hardcoded,1957
nfl_nrg_stadium,NRG Stadium,Houston,TX,29.6847,-95.4107,72220,NFL,['HOU'],nfl_hardcoded,2002
nfl_lucas_oil_stadium,Lucas Oil Stadium,Indianapolis,IN,39.7601,-86.1639,67000,NFL,['IND'],nfl_hardcoded,2008
nfl_everbank_stadium,EverBank Stadium,Jacksonville,FL,30.3239,-81.6373,67814,NFL,['JAX'],nfl_hardcoded,1995
nfl_geha_field_at_arrowhead_stadiu,GEHA Field at Arrowhead Stadium,Kansas City,MO,39.0489,-94.4839,76416,NFL,['KC'],nfl_hardcoded,1972
nfl_allegiant_stadium,Allegiant Stadium,Las Vegas,NV,36.0909,-115.1833,65000,NFL,['LV'],nfl_hardcoded,2020
nfl_sofi_stadium,SoFi Stadium,Inglewood,CA,33.9535,-118.3392,70240,NFL,"['LAC', 'LAR']",nfl_hardcoded,2020
nfl_hard_rock_stadium,Hard Rock Stadium,Miami Gardens,FL,25.958,-80.2389,64767,NFL,['MIA'],nfl_hardcoded,1987
nfl_u.s._bank_stadium,U.S. Bank Stadium,Minneapolis,MN,44.9736,-93.2575,66655,NFL,['MIN'],nfl_hardcoded,2016
nfl_gillette_stadium,Gillette Stadium,Foxborough,MA,42.0909,-71.2643,65878,NFL,['NE'],nfl_hardcoded,2002
nfl_caesars_superdome,Caesars Superdome,New Orleans,LA,29.9511,-90.0812,73208,NFL,['NO'],nfl_hardcoded,1975
nfl_metlife_stadium,MetLife Stadium,East Rutherford,NJ,40.8135,-74.0745,82500,NFL,"['NYG', 'NYJ']",nfl_hardcoded,2010
nfl_lincoln_financial_field,Lincoln Financial Field,Philadelphia,PA,39.9008,-75.1675,69596,NFL,['PHI'],nfl_hardcoded,2003
nfl_acrisure_stadium,Acrisure Stadium,Pittsburgh,PA,40.4468,-80.0158,68400,NFL,['PIT'],nfl_hardcoded,2001
nfl_levi's_stadium,Levi's Stadium,Santa Clara,CA,37.4032,-121.9698,68500,NFL,['SF'],nfl_hardcoded,2014
nfl_lumen_field,Lumen Field,Seattle,WA,47.5952,-122.3316,68740,NFL,['SEA'],nfl_hardcoded,2002
nfl_raymond_james_stadium,Raymond James Stadium,Tampa,FL,27.9759,-82.5033,65618,NFL,['TB'],nfl_hardcoded,1998
nfl_nissan_stadium,Nissan Stadium,Nashville,TN,36.1665,-86.7713,69143,NFL,['TEN'],nfl_hardcoded,1999
nfl_northwest_stadium,Northwest Stadium,Landover,MD,38.9076,-76.8645,67617,NFL,['WAS'],nfl_hardcoded,1997
mls_mercedes-benz_stadium,Mercedes-Benz Stadium,Atlanta,GA,33.7555,-84.4,42500,MLS,['ATL'],mls_hardcoded,2017
mls_q2_stadium,Q2 Stadium,Austin,TX,30.3877,-97.7195,20738,MLS,['AUS'],mls_hardcoded,2021
mls_bank_of_america_stadium,Bank of America Stadium,Charlotte,NC,35.2258,-80.8528,38000,MLS,['CLT'],mls_hardcoded,1996
mls_soldier_field,Soldier Field,Chicago,IL,41.8623,-87.6167,24995,MLS,['CHI'],mls_hardcoded,1924
mls_tql_stadium,TQL Stadium,Cincinnati,OH,39.1114,-84.5222,26000,MLS,['CIN'],mls_hardcoded,2021
mls_dicks_sporting_goods_park,Dick's Sporting Goods Park,Commerce City,CO,39.8056,-104.8919,18061,MLS,['COL'],mls_hardcoded,2007
mls_lowercom_field,Lower.com Field,Columbus,OH,39.9685,-83.0171,20371,MLS,['CLB'],mls_hardcoded,2021
mls_toyota_stadium,Toyota Stadium,Frisco,TX,33.1544,-96.8353,20500,MLS,['DAL'],mls_hardcoded,2005
mls_audi_field,Audi Field,Washington,DC,38.8684,-77.0129,20000,MLS,['DC'],mls_hardcoded,2018
mls_shell_energy_stadium,Shell Energy Stadium,Houston,TX,29.7522,-95.3524,22039,MLS,['HOU'],mls_hardcoded,2012
mls_dignity_health_sports_park,Dignity Health Sports Park,Carson,CA,33.864,-118.261,27000,MLS,['LAG'],mls_hardcoded,2003
mls_bmo_stadium,BMO Stadium,Los Angeles,CA,34.0128,-118.2841,22000,MLS,['LAFC'],mls_hardcoded,2018
mls_chase_stadium,Chase Stadium,Fort Lauderdale,FL,26.1933,-80.1607,21550,MLS,['MIA'],mls_hardcoded,2020
mls_allianz_field,Allianz Field,Saint Paul,MN,44.9531,-93.1647,19400,MLS,['MIN'],mls_hardcoded,2019
mls_stade_saputo,Stade Saputo,Montreal,QC,45.5631,-73.5525,19619,MLS,['MTL'],mls_hardcoded,2008
mls_geodis_park,Geodis Park,Nashville,TN,36.1301,-86.766,30000,MLS,['NSH'],mls_hardcoded,2022
mls_gillette_stadium,Gillette Stadium,Foxborough,MA,42.0909,-71.2643,22385,MLS,['NE'],mls_hardcoded,2002
mls_yankee_stadium,Yankee Stadium,Bronx,NY,40.8292,-73.9264,28000,MLS,['NYCFC'],mls_hardcoded,2009
mls_red_bull_arena,Red Bull Arena,Harrison,NJ,40.7367,-74.1503,25000,MLS,['NYRB'],mls_hardcoded,2010
mls_interandco_stadium,Inter&Co Stadium,Orlando,FL,28.5411,-81.3893,25500,MLS,['ORL'],mls_hardcoded,2017
mls_subaru_park,Subaru Park,Chester,PA,39.8322,-75.3789,18500,MLS,['PHI'],mls_hardcoded,2010
mls_providence_park,Providence Park,Portland,OR,45.5214,-122.6917,25218,MLS,['POR'],mls_hardcoded,1926
mls_america_first_field,America First Field,Sandy,UT,40.5829,-111.8934,20213,MLS,['RSL'],mls_hardcoded,2008
mls_paypal_park,PayPal Park,San Jose,CA,37.3514,-121.925,18000,MLS,['SJ'],mls_hardcoded,2015
mls_lumen_field,Lumen Field,Seattle,WA,47.5952,-122.3316,37722,MLS,['SEA'],mls_hardcoded,2002
mls_childrens_mercy_park,Children's Mercy Park,Kansas City,KS,39.1217,-94.8232,18467,MLS,['SKC'],mls_hardcoded,2011
mls_citypark,CityPark,St. Louis,MO,38.6314,-90.2103,22500,MLS,['STL'],mls_hardcoded,2023
mls_bmo_field,BMO Field,Toronto,ON,43.6332,-79.4186,30000,MLS,['TOR'],mls_hardcoded,2007
mls_bc_place,BC Place,Vancouver,BC,49.2767,-123.1119,22120,MLS,['VAN'],mls_hardcoded,1983
mls_snapdragon_stadium,Snapdragon Stadium,San Diego,CA,32.7844,-117.1228,35000,MLS,['SD'],mls_hardcoded,2022
wnba_gateway_center_arena,Gateway Center Arena,College Park,GA,33.6343,-84.4489,3500,WNBA,['ATL'],wnba_hardcoded,2018
wnba_wintrust_arena,Wintrust Arena,Chicago,IL,41.8514,-87.6226,10387,WNBA,['CHI'],wnba_hardcoded,2017
wnba_mohegan_sun_arena,Mohegan Sun Arena,Uncasville,CT,41.4933,-72.0904,10000,WNBA,['CON'],wnba_hardcoded,2001
wnba_college_park_center,College Park Center,Arlington,TX,32.7319,-97.1103,7000,WNBA,['DAL'],wnba_hardcoded,2012
wnba_michelob_ultra_arena,Michelob Ultra Arena,Las Vegas,NV,36.0909,-115.175,12000,WNBA,['LVA'],wnba_hardcoded,2016
wnba_entertainment_and_sports_arena,Entertainment & Sports Arena,Washington,DC,38.872,-76.987,4200,WNBA,['WAS'],wnba_hardcoded,2018
wnba_chase_center,Chase Center,San Francisco,CA,37.768,-122.3879,18064,WNBA,['GSV'],wnba_hardcoded,2019
wnba_gainbridge_fieldhouse,Gainbridge Fieldhouse,Indianapolis,IN,39.764,-86.1555,17923,WNBA,['IND'],wnba_hardcoded,1999
wnba_cryptocom_arena,Crypto.com Arena,Los Angeles,CA,34.043,-118.2673,19079,WNBA,['LA'],wnba_hardcoded,1999
wnba_target_center,Target Center,Minneapolis,MN,44.9795,-93.2761,18978,WNBA,['MIN'],wnba_hardcoded,1990
wnba_barclays_center,Barclays Center,Brooklyn,NY,40.6826,-73.9754,17732,WNBA,['NY'],wnba_hardcoded,2012
wnba_footprint_center,Footprint Center,Phoenix,AZ,33.4457,-112.0712,17071,WNBA,['PHO'],wnba_hardcoded,1992
wnba_climate_pledge_arena,Climate Pledge Arena,Seattle,WA,47.622,-122.354,17100,WNBA,['SEA'],wnba_hardcoded,1962
nwsl_bmo_stadium,BMO Stadium,Los Angeles,CA,34.0128,-118.2841,22000,NWSL,['LA'],nwsl_hardcoded,2018
nwsl_paypal_park,PayPal Park,San Jose,CA,37.3514,-121.925,18000,NWSL,['SJ'],nwsl_hardcoded,2015
nwsl_shell_energy_stadium,Shell Energy Stadium,Houston,TX,29.7522,-95.3524,22039,NWSL,['HOU'],nwsl_hardcoded,2012
nwsl_red_bull_arena,Red Bull Arena,Harrison,NJ,40.7367,-74.1503,25000,NWSL,['NJ'],nwsl_hardcoded,2010
nwsl_interandco_stadium,Inter&Co Stadium,Orlando,FL,28.5411,-81.3893,25500,NWSL,['ORL'],nwsl_hardcoded,2017
nwsl_providence_park,Providence Park,Portland,OR,45.5214,-122.6917,25218,NWSL,['POR'],nwsl_hardcoded,1926
nwsl_lumen_field,Lumen Field,Seattle,WA,47.5952,-122.3316,37722,NWSL,['SEA'],nwsl_hardcoded,2002
nwsl_snapdragon_stadium,Snapdragon Stadium,San Diego,CA,32.7844,-117.1228,35000,NWSL,['SD'],nwsl_hardcoded,2022
nwsl_america_first_field,America First Field,Sandy,UT,40.5829,-111.8934,20213,NWSL,['UTA'],nwsl_hardcoded,2008
nwsl_audi_field,Audi Field,Washington,DC,38.8684,-77.0129,20000,NWSL,['WAS'],nwsl_hardcoded,2018
nwsl_seatgeek_stadium,SeatGeek Stadium,Bridgeview,IL,41.7653,-87.8049,20000,NWSL,['CHI'],nwsl_hardcoded,2006
nwsl_cpkc_stadium,CPKC Stadium,Kansas City,MO,39.0975,-94.5556,11500,NWSL,['KC'],nwsl_hardcoded,2024
nwsl_wakemed_soccer_park,WakeMed Soccer Park,Cary,NC,35.8018,-78.7442,10000,NWSL,['NC'],nwsl_hardcoded,2002
1 id name city state latitude longitude capacity sport team_abbrevs source year_opened
2 mlb_chase_field Chase Field Phoenix AZ 33.4453 -112.0667 48519 MLB ['ARI'] mlb_hardcoded 1998
3 mlb_truist_park Truist Park Atlanta GA 33.8907 -84.4677 41084 MLB ['ATL'] mlb_hardcoded 2017
4 mlb_oriole_park_at_camden_yards Oriole Park at Camden Yards Baltimore MD 39.2839 -76.6216 44970 MLB ['BAL'] mlb_hardcoded 1992
5 mlb_fenway_park Fenway Park Boston MA 42.3467 -71.0972 37755 MLB ['BOS'] mlb_hardcoded 1912
6 mlb_wrigley_field Wrigley Field Chicago IL 41.9484 -87.6553 41649 MLB ['CHC'] mlb_hardcoded 1914
7 mlb_guaranteed_rate_field Guaranteed Rate Field Chicago IL 41.8299 -87.6338 40615 MLB ['CHW'] mlb_hardcoded 1991
8 mlb_great_american_ball_park Great American Ball Park Cincinnati OH 39.0979 -84.5082 42319 MLB ['CIN'] mlb_hardcoded 2003
9 mlb_progressive_field Progressive Field Cleveland OH 41.4958 -81.6853 34830 MLB ['CLE'] mlb_hardcoded 1994
10 mlb_coors_field Coors Field Denver CO 39.7559 -104.9942 50144 MLB ['COL'] mlb_hardcoded 1995
11 mlb_comerica_park Comerica Park Detroit MI 42.339 -83.0485 41083 MLB ['DET'] mlb_hardcoded 2000
12 mlb_minute_maid_park Minute Maid Park Houston TX 29.7573 -95.3555 41168 MLB ['HOU'] mlb_hardcoded 2000
13 mlb_kauffman_stadium Kauffman Stadium Kansas City MO 39.0517 -94.4803 37903 MLB ['KCR'] mlb_hardcoded 1973
14 mlb_angel_stadium Angel Stadium Anaheim CA 33.8003 -117.8827 45517 MLB ['LAA'] mlb_hardcoded 1966
15 mlb_dodger_stadium Dodger Stadium Los Angeles CA 34.0739 -118.24 56000 MLB ['LAD'] mlb_hardcoded 1962
16 mlb_loandepot_park LoanDepot Park Miami FL 25.7781 -80.2196 36742 MLB ['MIA'] mlb_hardcoded 2012
17 mlb_american_family_field American Family Field Milwaukee WI 43.028 -87.9712 41900 MLB ['MIL'] mlb_hardcoded 2001
18 mlb_target_field Target Field Minneapolis MN 44.9818 -93.2775 38544 MLB ['MIN'] mlb_hardcoded 2010
19 mlb_citi_field Citi Field Queens NY 40.7571 -73.8458 41922 MLB ['NYM'] mlb_hardcoded 2009
20 mlb_yankee_stadium Yankee Stadium Bronx NY 40.8296 -73.9262 46537 MLB ['NYY'] mlb_hardcoded 2009
21 mlb_sutter_health_park Sutter Health Park Sacramento CA 38.5803 -121.5108 14014 MLB ['OAK'] mlb_hardcoded 2000
22 mlb_citizens_bank_park Citizens Bank Park Philadelphia PA 39.9061 -75.1665 42901 MLB ['PHI'] mlb_hardcoded 2004
23 mlb_pnc_park PNC Park Pittsburgh PA 40.4469 -80.0057 38362 MLB ['PIT'] mlb_hardcoded 2001
24 mlb_petco_park Petco Park San Diego CA 32.7073 -117.1566 40209 MLB ['SDP'] mlb_hardcoded 2004
25 mlb_oracle_park Oracle Park San Francisco CA 37.7786 -122.3893 41915 MLB ['SFG'] mlb_hardcoded 2000
26 mlb_t-mobile_park T-Mobile Park Seattle WA 47.5914 -122.3325 47929 MLB ['SEA'] mlb_hardcoded 1999
27 mlb_busch_stadium Busch Stadium St. Louis MO 38.6226 -90.1928 45538 MLB ['STL'] mlb_hardcoded 2006
28 mlb_tropicana_field Tropicana Field St. Petersburg FL 27.7682 -82.6534 25000 MLB ['TBR'] mlb_hardcoded 1990
29 mlb_globe_life_field Globe Life Field Arlington TX 32.7473 -97.0844 40300 MLB ['TEX'] mlb_hardcoded 2020
30 mlb_rogers_centre Rogers Centre Toronto ON 43.6414 -79.3894 49282 MLB ['TOR'] mlb_hardcoded 1989
31 mlb_nationals_park Nationals Park Washington DC 38.8729 -77.0074 41339 MLB ['WSN'] mlb_hardcoded 2008
32 nba_state_farm_arena State Farm Arena Atlanta GA 33.7573 -84.3963 18118 NBA ['ATL'] nba_hardcoded 1999
33 nba_td_garden TD Garden Boston MA 42.3662 -71.0621 19156 NBA ['BOS'] nba_hardcoded 1995
34 nba_barclays_center Barclays Center Brooklyn NY 40.6826 -73.9754 17732 NBA ['BRK'] nba_hardcoded 2012
35 nba_spectrum_center Spectrum Center Charlotte NC 35.2251 -80.8392 19077 NBA ['CHO'] nba_hardcoded 2005
36 nba_united_center United Center Chicago IL 41.8807 -87.6742 20917 NBA ['CHI'] nba_hardcoded 1994
37 nba_rocket_mortgage_fieldhouse Rocket Mortgage FieldHouse Cleveland OH 41.4965 -81.6882 19432 NBA ['CLE'] nba_hardcoded 1994
38 nba_american_airlines_center American Airlines Center Dallas TX 32.7905 -96.8103 19200 NBA ['DAL'] nba_hardcoded 2001
39 nba_ball_arena Ball Arena Denver CO 39.7487 -105.0077 19520 NBA ['DEN'] nba_hardcoded 1999
40 nba_little_caesars_arena Little Caesars Arena Detroit MI 42.3411 -83.0553 20332 NBA ['DET'] nba_hardcoded 2017
41 nba_chase_center Chase Center San Francisco CA 37.768 -122.3879 18064 NBA ['GSW'] nba_hardcoded 2019
42 nba_toyota_center Toyota Center Houston TX 29.7508 -95.3621 18055 NBA ['HOU'] nba_hardcoded 2003
43 nba_gainbridge_fieldhouse Gainbridge Fieldhouse Indianapolis IN 39.764 -86.1555 17923 NBA ['IND'] nba_hardcoded 1999
44 nba_intuit_dome Intuit Dome Inglewood CA 33.9425 -118.3419 18000 NBA ['LAC'] nba_hardcoded 2024
45 nba_crypto.com_arena Crypto.com Arena Los Angeles CA 34.043 -118.2673 18997 NBA ['LAL'] nba_hardcoded 1999
46 nba_fedexforum FedExForum Memphis TN 35.1382 -90.0506 17794 NBA ['MEM'] nba_hardcoded 2004
47 nba_kaseya_center Kaseya Center Miami FL 25.7814 -80.187 19600 NBA ['MIA'] nba_hardcoded 1999
48 nba_fiserv_forum Fiserv Forum Milwaukee WI 43.0451 -87.9174 17341 NBA ['MIL'] nba_hardcoded 2018
49 nba_target_center Target Center Minneapolis MN 44.9795 -93.2761 18978 NBA ['MIN'] nba_hardcoded 1990
50 nba_smoothie_king_center Smoothie King Center New Orleans LA 29.949 -90.0821 16867 NBA ['NOP'] nba_hardcoded 1999
51 nba_madison_square_garden Madison Square Garden New York NY 40.7505 -73.9934 19812 NBA ['NYK'] nba_hardcoded 1968
52 nba_paycom_center Paycom Center Oklahoma City OK 35.4634 -97.5151 18203 NBA ['OKC'] nba_hardcoded 2002
53 nba_kia_center Kia Center Orlando FL 28.5392 -81.3839 18846 NBA ['ORL'] nba_hardcoded 1989
54 nba_wells_fargo_center Wells Fargo Center Philadelphia PA 39.9012 -75.172 20478 NBA ['PHI'] nba_hardcoded 1996
55 nba_footprint_center Footprint Center Phoenix AZ 33.4457 -112.0712 17071 NBA ['PHO'] nba_hardcoded 1992
56 nba_moda_center Moda Center Portland OR 45.5316 -122.6668 19393 NBA ['POR'] nba_hardcoded 1995
57 nba_golden_1_center Golden 1 Center Sacramento CA 38.5802 -121.4997 17608 NBA ['SAC'] nba_hardcoded 2016
58 nba_frost_bank_center Frost Bank Center San Antonio TX 29.427 -98.4375 18418 NBA ['SAS'] nba_hardcoded 2002
59 nba_scotiabank_arena Scotiabank Arena Toronto ON 43.6435 -79.3791 19800 NBA ['TOR'] nba_hardcoded 1999
60 nba_delta_center Delta Center Salt Lake City UT 40.7683 -111.9011 18306 NBA ['UTA'] nba_hardcoded 1991
61 nba_capital_one_arena Capital One Arena Washington DC 38.8982 -77.0209 20356 NBA ['WAS'] nba_hardcoded 1997
62 nhl_td_garden TD Garden Boston MA 42.3662 -71.0621 17850 NHL ['BOS'] nhl_hardcoded 1995
63 nhl_keybank_center KeyBank Center Buffalo NY 42.875 -78.8764 19070 NHL ['BUF'] nhl_hardcoded 1996
64 nhl_little_caesars_arena Little Caesars Arena Detroit MI 42.3411 -83.0553 19515 NHL ['DET'] nhl_hardcoded 2017
65 nhl_amerant_bank_arena Amerant Bank Arena Sunrise FL 26.1584 -80.3256 19250 NHL ['FLA'] nhl_hardcoded 1998
66 nhl_bell_centre Bell Centre Montreal QC 45.4961 -73.5693 21302 NHL ['MTL'] nhl_hardcoded 1996
67 nhl_canadian_tire_centre Canadian Tire Centre Ottawa ON 45.2969 -75.9272 18652 NHL ['OTT'] nhl_hardcoded 1996
68 nhl_amalie_arena Amalie Arena Tampa FL 27.9426 -82.4519 19092 NHL ['TBL'] nhl_hardcoded 1996
69 nhl_scotiabank_arena Scotiabank Arena Toronto ON 43.6435 -79.3791 18800 NHL ['TOR'] nhl_hardcoded 1999
70 nhl_pnc_arena PNC Arena Raleigh NC 35.8033 -78.722 18680 NHL ['CAR'] nhl_hardcoded 1999
71 nhl_nationwide_arena Nationwide Arena Columbus OH 39.9692 -83.0061 18500 NHL ['CBJ'] nhl_hardcoded 2000
72 nhl_prudential_center Prudential Center Newark NJ 40.7334 -74.1713 16514 NHL ['NJD'] nhl_hardcoded 2007
73 nhl_ubs_arena UBS Arena Elmont NY 40.717 -73.726 17255 NHL ['NYI'] nhl_hardcoded 2021
74 nhl_madison_square_garden Madison Square Garden New York NY 40.7505 -73.9934 18006 NHL ['NYR'] nhl_hardcoded 1968
75 nhl_wells_fargo_center Wells Fargo Center Philadelphia PA 39.9012 -75.172 19500 NHL ['PHI'] nhl_hardcoded 1996
76 nhl_ppg_paints_arena PPG Paints Arena Pittsburgh PA 40.4395 -79.9892 18387 NHL ['PIT'] nhl_hardcoded 2010
77 nhl_capital_one_arena Capital One Arena Washington DC 38.8982 -77.0209 18573 NHL ['WSH'] nhl_hardcoded 1997
78 nhl_united_center United Center Chicago IL 41.8807 -87.6742 19717 NHL ['CHI'] nhl_hardcoded 1994
79 nhl_ball_arena Ball Arena Denver CO 39.7487 -105.0077 18007 NHL ['COL'] nhl_hardcoded 1999
80 nhl_american_airlines_center American Airlines Center Dallas TX 32.7905 -96.8103 18532 NHL ['DAL'] nhl_hardcoded 2001
81 nhl_xcel_energy_center Xcel Energy Center Saint Paul MN 44.9448 -93.101 17954 NHL ['MIN'] nhl_hardcoded 2000
82 nhl_bridgestone_arena Bridgestone Arena Nashville TN 36.1592 -86.7785 17159 NHL ['NSH'] nhl_hardcoded 1996
83 nhl_enterprise_center Enterprise Center St. Louis MO 38.6268 -90.2025 18096 NHL ['STL'] nhl_hardcoded 1994
84 nhl_canada_life_centre Canada Life Centre Winnipeg MB 49.8928 -97.1437 15321 NHL ['WPG'] nhl_hardcoded 2004
85 nhl_honda_center Honda Center Anaheim CA 33.8078 -117.8765 17174 NHL ['ANA'] nhl_hardcoded 1993
86 nhl_delta_center Delta Center Salt Lake City UT 40.7683 -111.9011 16210 NHL ['ARI'] nhl_hardcoded 1991
87 nhl_sap_center SAP Center San Jose CA 37.3327 -121.9012 17562 NHL ['SJS'] nhl_hardcoded 1993
88 nhl_rogers_arena Rogers Arena Vancouver BC 49.2778 -123.1089 18910 NHL ['VAN'] nhl_hardcoded 1995
89 nhl_t-mobile_arena T-Mobile Arena Las Vegas NV 36.1028 -115.1784 17500 NHL ['VGK'] nhl_hardcoded 2016
90 nhl_climate_pledge_arena Climate Pledge Arena Seattle WA 47.622 -122.354 17100 NHL ['SEA'] nhl_hardcoded 2021
91 nhl_crypto.com_arena Crypto.com Arena Los Angeles CA 34.043 -118.2673 18230 NHL ['LAK'] nhl_hardcoded 1999
92 nhl_rogers_place Rogers Place Edmonton AB 53.5469 -113.4979 18347 NHL ['EDM'] nhl_hardcoded 2016
93 nhl_scotiabank_saddledome Scotiabank Saddledome Calgary AB 51.0374 -114.0519 19289 NHL ['CGY'] nhl_hardcoded 1983
94 nfl_state_farm_stadium State Farm Stadium Glendale AZ 33.5276 -112.2626 63400 NFL ['ARI'] nfl_hardcoded 2006
95 nfl_mercedes-benz_stadium Mercedes-Benz Stadium Atlanta GA 33.7553 -84.4006 71000 NFL ['ATL'] nfl_hardcoded 2017
96 nfl_m&t_bank_stadium M&T Bank Stadium Baltimore MD 39.278 -76.6227 71008 NFL ['BAL'] nfl_hardcoded 1998
97 nfl_highmark_stadium Highmark Stadium Orchard Park NY 42.7738 -78.787 71608 NFL ['BUF'] nfl_hardcoded 1973
98 nfl_bank_of_america_stadium Bank of America Stadium Charlotte NC 35.2258 -80.8528 75523 NFL ['CAR'] nfl_hardcoded 1996
99 nfl_soldier_field Soldier Field Chicago IL 41.8623 -87.6167 61500 NFL ['CHI'] nfl_hardcoded 1924
100 nfl_paycor_stadium Paycor Stadium Cincinnati OH 39.0954 -84.516 65515 NFL ['CIN'] nfl_hardcoded 2000
101 nfl_cleveland_browns_stadium Cleveland Browns Stadium Cleveland OH 41.5061 -81.6995 67895 NFL ['CLE'] nfl_hardcoded 1999
102 nfl_at&t_stadium AT&T Stadium Arlington TX 32.748 -97.0928 80000 NFL ['DAL'] nfl_hardcoded 2009
103 nfl_empower_field_at_mile_high Empower Field at Mile High Denver CO 39.7439 -105.0201 76125 NFL ['DEN'] nfl_hardcoded 2001
104 nfl_ford_field Ford Field Detroit MI 42.34 -83.0456 65000 NFL ['DET'] nfl_hardcoded 2002
105 nfl_lambeau_field Lambeau Field Green Bay WI 44.5013 -88.0622 81435 NFL ['GB'] nfl_hardcoded 1957
106 nfl_nrg_stadium NRG Stadium Houston TX 29.6847 -95.4107 72220 NFL ['HOU'] nfl_hardcoded 2002
107 nfl_lucas_oil_stadium Lucas Oil Stadium Indianapolis IN 39.7601 -86.1639 67000 NFL ['IND'] nfl_hardcoded 2008
108 nfl_everbank_stadium EverBank Stadium Jacksonville FL 30.3239 -81.6373 67814 NFL ['JAX'] nfl_hardcoded 1995
109 nfl_geha_field_at_arrowhead_stadiu GEHA Field at Arrowhead Stadium Kansas City MO 39.0489 -94.4839 76416 NFL ['KC'] nfl_hardcoded 1972
110 nfl_allegiant_stadium Allegiant Stadium Las Vegas NV 36.0909 -115.1833 65000 NFL ['LV'] nfl_hardcoded 2020
111 nfl_sofi_stadium SoFi Stadium Inglewood CA 33.9535 -118.3392 70240 NFL ['LAC', 'LAR'] nfl_hardcoded 2020
112 nfl_hard_rock_stadium Hard Rock Stadium Miami Gardens FL 25.958 -80.2389 64767 NFL ['MIA'] nfl_hardcoded 1987
113 nfl_u.s._bank_stadium U.S. Bank Stadium Minneapolis MN 44.9736 -93.2575 66655 NFL ['MIN'] nfl_hardcoded 2016
114 nfl_gillette_stadium Gillette Stadium Foxborough MA 42.0909 -71.2643 65878 NFL ['NE'] nfl_hardcoded 2002
115 nfl_caesars_superdome Caesars Superdome New Orleans LA 29.9511 -90.0812 73208 NFL ['NO'] nfl_hardcoded 1975
116 nfl_metlife_stadium MetLife Stadium East Rutherford NJ 40.8135 -74.0745 82500 NFL ['NYG', 'NYJ'] nfl_hardcoded 2010
117 nfl_lincoln_financial_field Lincoln Financial Field Philadelphia PA 39.9008 -75.1675 69596 NFL ['PHI'] nfl_hardcoded 2003
118 nfl_acrisure_stadium Acrisure Stadium Pittsburgh PA 40.4468 -80.0158 68400 NFL ['PIT'] nfl_hardcoded 2001
119 nfl_levi's_stadium Levi's Stadium Santa Clara CA 37.4032 -121.9698 68500 NFL ['SF'] nfl_hardcoded 2014
120 nfl_lumen_field Lumen Field Seattle WA 47.5952 -122.3316 68740 NFL ['SEA'] nfl_hardcoded 2002
121 nfl_raymond_james_stadium Raymond James Stadium Tampa FL 27.9759 -82.5033 65618 NFL ['TB'] nfl_hardcoded 1998
122 nfl_nissan_stadium Nissan Stadium Nashville TN 36.1665 -86.7713 69143 NFL ['TEN'] nfl_hardcoded 1999
123 nfl_northwest_stadium Northwest Stadium Landover MD 38.9076 -76.8645 67617 NFL ['WAS'] nfl_hardcoded 1997
124 mls_mercedes-benz_stadium Mercedes-Benz Stadium Atlanta GA 33.7555 -84.4 42500 MLS ['ATL'] mls_hardcoded 2017
125 mls_q2_stadium Q2 Stadium Austin TX 30.3877 -97.7195 20738 MLS ['AUS'] mls_hardcoded 2021
126 mls_bank_of_america_stadium Bank of America Stadium Charlotte NC 35.2258 -80.8528 38000 MLS ['CLT'] mls_hardcoded 1996
127 mls_soldier_field Soldier Field Chicago IL 41.8623 -87.6167 24995 MLS ['CHI'] mls_hardcoded 1924
128 mls_tql_stadium TQL Stadium Cincinnati OH 39.1114 -84.5222 26000 MLS ['CIN'] mls_hardcoded 2021
129 mls_dicks_sporting_goods_park Dick's Sporting Goods Park Commerce City CO 39.8056 -104.8919 18061 MLS ['COL'] mls_hardcoded 2007
130 mls_lowercom_field Lower.com Field Columbus OH 39.9685 -83.0171 20371 MLS ['CLB'] mls_hardcoded 2021
131 mls_toyota_stadium Toyota Stadium Frisco TX 33.1544 -96.8353 20500 MLS ['DAL'] mls_hardcoded 2005
132 mls_audi_field Audi Field Washington DC 38.8684 -77.0129 20000 MLS ['DC'] mls_hardcoded 2018
133 mls_shell_energy_stadium Shell Energy Stadium Houston TX 29.7522 -95.3524 22039 MLS ['HOU'] mls_hardcoded 2012
134 mls_dignity_health_sports_park Dignity Health Sports Park Carson CA 33.864 -118.261 27000 MLS ['LAG'] mls_hardcoded 2003
135 mls_bmo_stadium BMO Stadium Los Angeles CA 34.0128 -118.2841 22000 MLS ['LAFC'] mls_hardcoded 2018
136 mls_chase_stadium Chase Stadium Fort Lauderdale FL 26.1933 -80.1607 21550 MLS ['MIA'] mls_hardcoded 2020
137 mls_allianz_field Allianz Field Saint Paul MN 44.9531 -93.1647 19400 MLS ['MIN'] mls_hardcoded 2019
138 mls_stade_saputo Stade Saputo Montreal QC 45.5631 -73.5525 19619 MLS ['MTL'] mls_hardcoded 2008
139 mls_geodis_park Geodis Park Nashville TN 36.1301 -86.766 30000 MLS ['NSH'] mls_hardcoded 2022
140 mls_gillette_stadium Gillette Stadium Foxborough MA 42.0909 -71.2643 22385 MLS ['NE'] mls_hardcoded 2002
141 mls_yankee_stadium Yankee Stadium Bronx NY 40.8292 -73.9264 28000 MLS ['NYCFC'] mls_hardcoded 2009
142 mls_red_bull_arena Red Bull Arena Harrison NJ 40.7367 -74.1503 25000 MLS ['NYRB'] mls_hardcoded 2010
143 mls_interandco_stadium Inter&Co Stadium Orlando FL 28.5411 -81.3893 25500 MLS ['ORL'] mls_hardcoded 2017
144 mls_subaru_park Subaru Park Chester PA 39.8322 -75.3789 18500 MLS ['PHI'] mls_hardcoded 2010
145 mls_providence_park Providence Park Portland OR 45.5214 -122.6917 25218 MLS ['POR'] mls_hardcoded 1926
146 mls_america_first_field America First Field Sandy UT 40.5829 -111.8934 20213 MLS ['RSL'] mls_hardcoded 2008
147 mls_paypal_park PayPal Park San Jose CA 37.3514 -121.925 18000 MLS ['SJ'] mls_hardcoded 2015
148 mls_lumen_field Lumen Field Seattle WA 47.5952 -122.3316 37722 MLS ['SEA'] mls_hardcoded 2002
149 mls_childrens_mercy_park Children's Mercy Park Kansas City KS 39.1217 -94.8232 18467 MLS ['SKC'] mls_hardcoded 2011
150 mls_citypark CityPark St. Louis MO 38.6314 -90.2103 22500 MLS ['STL'] mls_hardcoded 2023
151 mls_bmo_field BMO Field Toronto ON 43.6332 -79.4186 30000 MLS ['TOR'] mls_hardcoded 2007
152 mls_bc_place BC Place Vancouver BC 49.2767 -123.1119 22120 MLS ['VAN'] mls_hardcoded 1983
153 mls_snapdragon_stadium Snapdragon Stadium San Diego CA 32.7844 -117.1228 35000 MLS ['SD'] mls_hardcoded 2022
154 wnba_gateway_center_arena Gateway Center Arena College Park GA 33.6343 -84.4489 3500 WNBA ['ATL'] wnba_hardcoded 2018
155 wnba_wintrust_arena Wintrust Arena Chicago IL 41.8514 -87.6226 10387 WNBA ['CHI'] wnba_hardcoded 2017
156 wnba_mohegan_sun_arena Mohegan Sun Arena Uncasville CT 41.4933 -72.0904 10000 WNBA ['CON'] wnba_hardcoded 2001
157 wnba_college_park_center College Park Center Arlington TX 32.7319 -97.1103 7000 WNBA ['DAL'] wnba_hardcoded 2012
158 wnba_michelob_ultra_arena Michelob Ultra Arena Las Vegas NV 36.0909 -115.175 12000 WNBA ['LVA'] wnba_hardcoded 2016
159 wnba_entertainment_and_sports_arena Entertainment & Sports Arena Washington DC 38.872 -76.987 4200 WNBA ['WAS'] wnba_hardcoded 2018
160 wnba_chase_center Chase Center San Francisco CA 37.768 -122.3879 18064 WNBA ['GSV'] wnba_hardcoded 2019
161 wnba_gainbridge_fieldhouse Gainbridge Fieldhouse Indianapolis IN 39.764 -86.1555 17923 WNBA ['IND'] wnba_hardcoded 1999
162 wnba_cryptocom_arena Crypto.com Arena Los Angeles CA 34.043 -118.2673 19079 WNBA ['LA'] wnba_hardcoded 1999
163 wnba_target_center Target Center Minneapolis MN 44.9795 -93.2761 18978 WNBA ['MIN'] wnba_hardcoded 1990
164 wnba_barclays_center Barclays Center Brooklyn NY 40.6826 -73.9754 17732 WNBA ['NY'] wnba_hardcoded 2012
165 wnba_footprint_center Footprint Center Phoenix AZ 33.4457 -112.0712 17071 WNBA ['PHO'] wnba_hardcoded 1992
166 wnba_climate_pledge_arena Climate Pledge Arena Seattle WA 47.622 -122.354 17100 WNBA ['SEA'] wnba_hardcoded 1962
167 nwsl_bmo_stadium BMO Stadium Los Angeles CA 34.0128 -118.2841 22000 NWSL ['LA'] nwsl_hardcoded 2018
168 nwsl_paypal_park PayPal Park San Jose CA 37.3514 -121.925 18000 NWSL ['SJ'] nwsl_hardcoded 2015
169 nwsl_shell_energy_stadium Shell Energy Stadium Houston TX 29.7522 -95.3524 22039 NWSL ['HOU'] nwsl_hardcoded 2012
170 nwsl_red_bull_arena Red Bull Arena Harrison NJ 40.7367 -74.1503 25000 NWSL ['NJ'] nwsl_hardcoded 2010
171 nwsl_interandco_stadium Inter&Co Stadium Orlando FL 28.5411 -81.3893 25500 NWSL ['ORL'] nwsl_hardcoded 2017
172 nwsl_providence_park Providence Park Portland OR 45.5214 -122.6917 25218 NWSL ['POR'] nwsl_hardcoded 1926
173 nwsl_lumen_field Lumen Field Seattle WA 47.5952 -122.3316 37722 NWSL ['SEA'] nwsl_hardcoded 2002
174 nwsl_snapdragon_stadium Snapdragon Stadium San Diego CA 32.7844 -117.1228 35000 NWSL ['SD'] nwsl_hardcoded 2022
175 nwsl_america_first_field America First Field Sandy UT 40.5829 -111.8934 20213 NWSL ['UTA'] nwsl_hardcoded 2008
176 nwsl_audi_field Audi Field Washington DC 38.8684 -77.0129 20000 NWSL ['WAS'] nwsl_hardcoded 2018
177 nwsl_seatgeek_stadium SeatGeek Stadium Bridgeview IL 41.7653 -87.8049 20000 NWSL ['CHI'] nwsl_hardcoded 2006
178 nwsl_cpkc_stadium CPKC Stadium Kansas City MO 39.0975 -94.5556 11500 NWSL ['KC'] nwsl_hardcoded 2024
179 nwsl_wakemed_soccer_park WakeMed Soccer Park Cary NC 35.8018 -78.7442 10000 NWSL ['NC'] nwsl_hardcoded 2002
File diff suppressed because it is too large Load Diff
-405
View File
@@ -1,405 +0,0 @@
#!/usr/bin/env python3
"""
Generate Canonical Data for SportsTime App
==========================================
Generates team_aliases.json and league_structure.json from team mappings.
Usage:
python generate_canonical_data.py
python generate_canonical_data.py --output ./data
"""
import argparse
import json
from datetime import datetime
from pathlib import Path
# =============================================================================
# LEAGUE STRUCTURE
# =============================================================================
MLB_STRUCTURE = {
"leagues": [
{"id": "mlb_al", "name": "American League", "abbreviation": "AL"},
{"id": "mlb_nl", "name": "National League", "abbreviation": "NL"},
],
"divisions": [
# American League
{"id": "mlb_al_east", "name": "AL East", "parent_id": "mlb_al", "teams": ["NYY", "BOS", "TOR", "BAL", "TBR"]},
{"id": "mlb_al_central", "name": "AL Central", "parent_id": "mlb_al", "teams": ["CLE", "DET", "MIN", "CHW", "KCR"]},
{"id": "mlb_al_west", "name": "AL West", "parent_id": "mlb_al", "teams": ["HOU", "SEA", "TEX", "LAA", "OAK"]},
# National League
{"id": "mlb_nl_east", "name": "NL East", "parent_id": "mlb_nl", "teams": ["ATL", "PHI", "NYM", "MIA", "WSN"]},
{"id": "mlb_nl_central", "name": "NL Central", "parent_id": "mlb_nl", "teams": ["MIL", "CHC", "STL", "PIT", "CIN"]},
{"id": "mlb_nl_west", "name": "NL West", "parent_id": "mlb_nl", "teams": ["LAD", "ARI", "SDP", "SFG", "COL"]},
]
}
NBA_STRUCTURE = {
"conferences": [
{"id": "nba_eastern", "name": "Eastern Conference", "abbreviation": "East"},
{"id": "nba_western", "name": "Western Conference", "abbreviation": "West"},
],
"divisions": [
# Eastern Conference
{"id": "nba_atlantic", "name": "Atlantic", "parent_id": "nba_eastern", "teams": ["BOS", "BRK", "NYK", "PHI", "TOR"]},
{"id": "nba_central", "name": "Central", "parent_id": "nba_eastern", "teams": ["CHI", "CLE", "DET", "IND", "MIL"]},
{"id": "nba_southeast", "name": "Southeast", "parent_id": "nba_eastern", "teams": ["ATL", "CHO", "MIA", "ORL", "WAS"]},
# Western Conference
{"id": "nba_northwest", "name": "Northwest", "parent_id": "nba_western", "teams": ["DEN", "MIN", "OKC", "POR", "UTA"]},
{"id": "nba_pacific", "name": "Pacific", "parent_id": "nba_western", "teams": ["GSW", "LAC", "LAL", "PHO", "SAC"]},
{"id": "nba_southwest", "name": "Southwest", "parent_id": "nba_western", "teams": ["DAL", "HOU", "MEM", "NOP", "SAS"]},
]
}
NHL_STRUCTURE = {
"conferences": [
{"id": "nhl_eastern", "name": "Eastern Conference", "abbreviation": "East"},
{"id": "nhl_western", "name": "Western Conference", "abbreviation": "West"},
],
"divisions": [
# Eastern Conference
{"id": "nhl_atlantic", "name": "Atlantic", "parent_id": "nhl_eastern", "teams": ["BOS", "BUF", "DET", "FLA", "MTL", "OTT", "TBL", "TOR"]},
{"id": "nhl_metropolitan", "name": "Metropolitan", "parent_id": "nhl_eastern", "teams": ["CAR", "CBJ", "NJD", "NYI", "NYR", "PHI", "PIT", "WSH"]},
# Western Conference
{"id": "nhl_central", "name": "Central", "parent_id": "nhl_western", "teams": ["ARI", "CHI", "COL", "DAL", "MIN", "NSH", "STL", "WPG"]},
{"id": "nhl_pacific", "name": "Pacific", "parent_id": "nhl_western", "teams": ["ANA", "CGY", "EDM", "LAK", "SEA", "SJS", "VAN", "VGK"]},
]
}
# =============================================================================
# TEAM ALIASES (Historical name changes, relocations, abbreviation changes)
# =============================================================================
# Format: {current_abbrev: [(alias_type, alias_value, valid_from, valid_until), ...]}
MLB_ALIASES = {
# Washington Nationals (formerly Montreal Expos)
"WSN": [
("name", "Montreal Expos", "1969-01-01", "2004-12-31"),
("abbreviation", "MON", "1969-01-01", "2004-12-31"),
("city", "Montreal", "1969-01-01", "2004-12-31"),
],
# Oakland Athletics (moving to Sacramento, formerly in Kansas City and Philadelphia)
"OAK": [
("name", "Kansas City Athletics", "1955-01-01", "1967-12-31"),
("abbreviation", "KCA", "1955-01-01", "1967-12-31"),
("city", "Kansas City", "1955-01-01", "1967-12-31"),
("name", "Philadelphia Athletics", "1901-01-01", "1954-12-31"),
("abbreviation", "PHA", "1901-01-01", "1954-12-31"),
("city", "Philadelphia", "1901-01-01", "1954-12-31"),
],
# Cleveland Guardians (formerly Indians)
"CLE": [
("name", "Cleveland Indians", "1915-01-01", "2021-12-31"),
],
# Tampa Bay Rays (formerly Devil Rays)
"TBR": [
("name", "Tampa Bay Devil Rays", "1998-01-01", "2007-12-31"),
],
# Miami Marlins (formerly Florida Marlins)
"MIA": [
("name", "Florida Marlins", "1993-01-01", "2011-12-31"),
("city", "Florida", "1993-01-01", "2011-12-31"),
],
# Los Angeles Angels (various names)
"LAA": [
("name", "Anaheim Angels", "1997-01-01", "2004-12-31"),
("name", "Los Angeles Angels of Anaheim", "2005-01-01", "2015-12-31"),
("name", "California Angels", "1965-01-01", "1996-12-31"),
],
# Texas Rangers (formerly Washington Senators II)
"TEX": [
("name", "Washington Senators", "1961-01-01", "1971-12-31"),
("abbreviation", "WS2", "1961-01-01", "1971-12-31"),
("city", "Washington", "1961-01-01", "1971-12-31"),
],
# Milwaukee Brewers (briefly Seattle Pilots)
"MIL": [
("name", "Seattle Pilots", "1969-01-01", "1969-12-31"),
("abbreviation", "SEP", "1969-01-01", "1969-12-31"),
("city", "Seattle", "1969-01-01", "1969-12-31"),
],
# Houston Astros (formerly Colt .45s)
"HOU": [
("name", "Houston Colt .45s", "1962-01-01", "1964-12-31"),
],
}
NBA_ALIASES = {
# Brooklyn Nets (formerly New Jersey Nets, New York Nets)
"BRK": [
("name", "New Jersey Nets", "1977-01-01", "2012-04-30"),
("abbreviation", "NJN", "1977-01-01", "2012-04-30"),
("city", "New Jersey", "1977-01-01", "2012-04-30"),
("name", "New York Nets", "1968-01-01", "1977-12-31"),
],
# Oklahoma City Thunder (formerly Seattle SuperSonics)
"OKC": [
("name", "Seattle SuperSonics", "1967-01-01", "2008-07-01"),
("abbreviation", "SEA", "1967-01-01", "2008-07-01"),
("city", "Seattle", "1967-01-01", "2008-07-01"),
],
# Memphis Grizzlies (formerly Vancouver Grizzlies)
"MEM": [
("name", "Vancouver Grizzlies", "1995-01-01", "2001-05-31"),
("abbreviation", "VAN", "1995-01-01", "2001-05-31"),
("city", "Vancouver", "1995-01-01", "2001-05-31"),
],
# New Orleans Pelicans (formerly Hornets, formerly Charlotte Hornets original)
"NOP": [
("name", "New Orleans Hornets", "2002-01-01", "2013-04-30"),
("abbreviation", "NOH", "2002-01-01", "2013-04-30"),
("name", "New Orleans/Oklahoma City Hornets", "2005-01-01", "2007-12-31"),
],
# Charlotte Hornets (current, formerly Bobcats)
"CHO": [
("name", "Charlotte Bobcats", "2004-01-01", "2014-04-30"),
("abbreviation", "CHA", "2004-01-01", "2014-04-30"),
],
# Washington Wizards (formerly Bullets)
"WAS": [
("name", "Washington Bullets", "1974-01-01", "1997-05-31"),
("name", "Capital Bullets", "1973-01-01", "1973-12-31"),
("name", "Baltimore Bullets", "1963-01-01", "1972-12-31"),
],
# Los Angeles Clippers (formerly San Diego, Buffalo)
"LAC": [
("name", "San Diego Clippers", "1978-01-01", "1984-05-31"),
("abbreviation", "SDC", "1978-01-01", "1984-05-31"),
("city", "San Diego", "1978-01-01", "1984-05-31"),
("name", "Buffalo Braves", "1970-01-01", "1978-05-31"),
("abbreviation", "BUF", "1970-01-01", "1978-05-31"),
("city", "Buffalo", "1970-01-01", "1978-05-31"),
],
# Sacramento Kings (formerly Kansas City Kings, etc.)
"SAC": [
("name", "Kansas City Kings", "1975-01-01", "1985-05-31"),
("abbreviation", "KCK", "1975-01-01", "1985-05-31"),
("city", "Kansas City", "1975-01-01", "1985-05-31"),
],
# Utah Jazz (formerly New Orleans Jazz)
"UTA": [
("name", "New Orleans Jazz", "1974-01-01", "1979-05-31"),
("city", "New Orleans", "1974-01-01", "1979-05-31"),
],
}
NHL_ALIASES = {
# Arizona/Utah Hockey Club (formerly Phoenix Coyotes, originally Winnipeg Jets)
"ARI": [
("name", "Arizona Coyotes", "2014-01-01", "2024-04-30"),
("name", "Phoenix Coyotes", "1996-01-01", "2013-12-31"),
("abbreviation", "PHX", "1996-01-01", "2013-12-31"),
("city", "Phoenix", "1996-01-01", "2013-12-31"),
("name", "Winnipeg Jets", "1979-01-01", "1996-05-31"), # Original Jets
],
# Carolina Hurricanes (formerly Hartford Whalers)
"CAR": [
("name", "Hartford Whalers", "1979-01-01", "1997-05-31"),
("abbreviation", "HFD", "1979-01-01", "1997-05-31"),
("city", "Hartford", "1979-01-01", "1997-05-31"),
],
# Colorado Avalanche (formerly Quebec Nordiques)
"COL": [
("name", "Quebec Nordiques", "1979-01-01", "1995-05-31"),
("abbreviation", "QUE", "1979-01-01", "1995-05-31"),
("city", "Quebec", "1979-01-01", "1995-05-31"),
],
# Dallas Stars (formerly Minnesota North Stars)
"DAL": [
("name", "Minnesota North Stars", "1967-01-01", "1993-05-31"),
("abbreviation", "MNS", "1967-01-01", "1993-05-31"),
("city", "Minnesota", "1967-01-01", "1993-05-31"),
],
# New Jersey Devils (formerly Kansas City Scouts, Colorado Rockies)
"NJD": [
("name", "Colorado Rockies", "1976-01-01", "1982-05-31"),
("abbreviation", "CLR", "1976-01-01", "1982-05-31"),
("city", "Colorado", "1976-01-01", "1982-05-31"),
("name", "Kansas City Scouts", "1974-01-01", "1976-05-31"),
("abbreviation", "KCS", "1974-01-01", "1976-05-31"),
("city", "Kansas City", "1974-01-01", "1976-05-31"),
],
# Winnipeg Jets (current, formerly Atlanta Thrashers)
"WPG": [
("name", "Atlanta Thrashers", "1999-01-01", "2011-05-31"),
("abbreviation", "ATL", "1999-01-01", "2011-05-31"),
("city", "Atlanta", "1999-01-01", "2011-05-31"),
],
# Florida Panthers (originally in Miami)
"FLA": [
("city", "Miami", "1993-01-01", "1998-12-31"),
],
# Vegas Golden Knights (no aliases, expansion team)
# Seattle Kraken (no aliases, expansion team)
}
def generate_league_structure() -> list[dict]:
"""Generate league_structure.json data."""
structures = []
order = 0
# MLB
structures.append({
"id": "mlb_league",
"sport": "MLB",
"type": "league",
"name": "Major League Baseball",
"abbreviation": "MLB",
"parent_id": None,
"display_order": order,
})
order += 1
for league in MLB_STRUCTURE["leagues"]:
structures.append({
"id": league["id"],
"sport": "MLB",
"type": "conference", # AL/NL are like conferences
"name": league["name"],
"abbreviation": league["abbreviation"],
"parent_id": "mlb_league",
"display_order": order,
})
order += 1
for div in MLB_STRUCTURE["divisions"]:
structures.append({
"id": div["id"],
"sport": "MLB",
"type": "division",
"name": div["name"],
"abbreviation": None,
"parent_id": div["parent_id"],
"display_order": order,
})
order += 1
# NBA
structures.append({
"id": "nba_league",
"sport": "NBA",
"type": "league",
"name": "National Basketball Association",
"abbreviation": "NBA",
"parent_id": None,
"display_order": order,
})
order += 1
for conf in NBA_STRUCTURE["conferences"]:
structures.append({
"id": conf["id"],
"sport": "NBA",
"type": "conference",
"name": conf["name"],
"abbreviation": conf["abbreviation"],
"parent_id": "nba_league",
"display_order": order,
})
order += 1
for div in NBA_STRUCTURE["divisions"]:
structures.append({
"id": div["id"],
"sport": "NBA",
"type": "division",
"name": div["name"],
"abbreviation": None,
"parent_id": div["parent_id"],
"display_order": order,
})
order += 1
# NHL
structures.append({
"id": "nhl_league",
"sport": "NHL",
"type": "league",
"name": "National Hockey League",
"abbreviation": "NHL",
"parent_id": None,
"display_order": order,
})
order += 1
for conf in NHL_STRUCTURE["conferences"]:
structures.append({
"id": conf["id"],
"sport": "NHL",
"type": "conference",
"name": conf["name"],
"abbreviation": conf["abbreviation"],
"parent_id": "nhl_league",
"display_order": order,
})
order += 1
for div in NHL_STRUCTURE["divisions"]:
structures.append({
"id": div["id"],
"sport": "NHL",
"type": "division",
"name": div["name"],
"abbreviation": None,
"parent_id": div["parent_id"],
"display_order": order,
})
order += 1
return structures
def generate_team_aliases() -> list[dict]:
"""Generate team_aliases.json data."""
aliases = []
alias_id = 1
for sport, sport_aliases in [("MLB", MLB_ALIASES), ("NBA", NBA_ALIASES), ("NHL", NHL_ALIASES)]:
for current_abbrev, alias_list in sport_aliases.items():
team_canonical_id = f"team_{sport.lower()}_{current_abbrev.lower()}"
for alias_type, alias_value, valid_from, valid_until in alias_list:
aliases.append({
"id": f"alias_{sport.lower()}_{alias_id}",
"team_canonical_id": team_canonical_id,
"alias_type": alias_type,
"alias_value": alias_value,
"valid_from": valid_from,
"valid_until": valid_until,
})
alias_id += 1
return aliases
def main():
parser = argparse.ArgumentParser(description='Generate canonical data JSON files')
parser.add_argument('--output', type=str, default='./data', help='Output directory')
args = parser.parse_args()
output_dir = Path(args.output)
output_dir.mkdir(parents=True, exist_ok=True)
# Generate league structure
print("Generating league_structure.json...")
league_structure = generate_league_structure()
with open(output_dir / 'league_structure.json', 'w') as f:
json.dump(league_structure, f, indent=2)
print(f" Created {len(league_structure)} structure entries")
# Generate team aliases
print("Generating team_aliases.json...")
team_aliases = generate_team_aliases()
with open(output_dir / 'team_aliases.json', 'w') as f:
json.dump(team_aliases, f, indent=2)
print(f" Created {len(team_aliases)} alias entries")
print(f"\nFiles written to {output_dir}")
if __name__ == '__main__':
main()
-275
View File
@@ -1,275 +0,0 @@
#!/usr/bin/env swift
//
// import_to_cloudkit.swift
// SportsTime
//
// Imports scraped JSON data into CloudKit public database.
// Run from command line: swift import_to_cloudkit.swift --games data/games.json --stadiums data/stadiums.json
//
import Foundation
import CloudKit
// MARK: - Data Models (matching scraper output)
struct ScrapedGame: Codable {
let id: String
let sport: String
let season: String
let date: String
let time: String?
let home_team: String
let away_team: String
let home_team_abbrev: String
let away_team_abbrev: String
let venue: String
let source: String
let is_playoff: Bool?
let broadcast: String?
}
struct ScrapedStadium: Codable {
let id: String
let name: String
let city: String
let state: String
let latitude: Double
let longitude: Double
let capacity: Int
let sport: String
let team_abbrevs: [String]
let source: String
let year_opened: Int?
}
// MARK: - CloudKit Importer
class CloudKitImporter {
let container: CKContainer
let database: CKDatabase
init(containerIdentifier: String = "iCloud.com.sportstime.app") {
self.container = CKContainer(identifier: containerIdentifier)
self.database = container.publicCloudDatabase
}
// MARK: - Import Stadiums
func importStadiums(from stadiums: [ScrapedStadium]) async throws -> Int {
var imported = 0
for stadium in stadiums {
let record = CKRecord(recordType: "Stadium")
record["stadiumId"] = stadium.id
record["name"] = stadium.name
record["city"] = stadium.city
record["state"] = stadium.state
record["location"] = CLLocation(latitude: stadium.latitude, longitude: stadium.longitude)
record["capacity"] = stadium.capacity
record["sport"] = stadium.sport
record["teamAbbrevs"] = stadium.team_abbrevs
record["source"] = stadium.source
if let yearOpened = stadium.year_opened {
record["yearOpened"] = yearOpened
}
do {
_ = try await database.save(record)
imported += 1
print(" Imported stadium: \(stadium.name)")
} catch {
print(" Error importing \(stadium.name): \(error)")
}
}
return imported
}
// MARK: - Import Teams
func importTeams(from stadiums: [ScrapedStadium], teamMappings: [String: TeamInfo]) async throws -> [String: CKRecord.ID] {
var teamRecordIDs: [String: CKRecord.ID] = [:]
for (abbrev, info) in teamMappings {
let record = CKRecord(recordType: "Team")
record["teamId"] = UUID().uuidString
record["name"] = info.name
record["abbreviation"] = abbrev
record["sport"] = info.sport
record["city"] = info.city
do {
let saved = try await database.save(record)
teamRecordIDs[abbrev] = saved.recordID
print(" Imported team: \(info.name)")
} catch {
print(" Error importing team \(info.name): \(error)")
}
}
return teamRecordIDs
}
// MARK: - Import Games
func importGames(
from games: [ScrapedGame],
teamRecordIDs: [String: CKRecord.ID],
stadiumRecordIDs: [String: CKRecord.ID]
) async throws -> Int {
var imported = 0
// Batch imports for efficiency
let batchSize = 100
var batch: [CKRecord] = []
for game in games {
let record = CKRecord(recordType: "Game")
record["gameId"] = game.id
record["sport"] = game.sport
record["season"] = game.season
// Parse date
let dateFormatter = DateFormatter()
dateFormatter.dateFormat = "yyyy-MM-dd"
if let date = dateFormatter.date(from: game.date) {
if let timeStr = game.time {
// Combine date and time
let timeFormatter = DateFormatter()
timeFormatter.dateFormat = "HH:mm"
if let time = timeFormatter.date(from: timeStr) {
let calendar = Calendar.current
let timeComponents = calendar.dateComponents([.hour, .minute], from: time)
if let combined = calendar.date(bySettingHour: timeComponents.hour ?? 19,
minute: timeComponents.minute ?? 0,
second: 0, of: date) {
record["dateTime"] = combined
}
}
} else {
// Default to 7 PM if no time
let calendar = Calendar.current
if let defaultTime = calendar.date(bySettingHour: 19, minute: 0, second: 0, of: date) {
record["dateTime"] = defaultTime
}
}
}
// Team references
if let homeTeamID = teamRecordIDs[game.home_team_abbrev] {
record["homeTeamRef"] = CKRecord.Reference(recordID: homeTeamID, action: .none)
}
if let awayTeamID = teamRecordIDs[game.away_team_abbrev] {
record["awayTeamRef"] = CKRecord.Reference(recordID: awayTeamID, action: .none)
}
record["isPlayoff"] = (game.is_playoff ?? false) ? 1 : 0
record["broadcastInfo"] = game.broadcast
record["source"] = game.source
batch.append(record)
// Save batch
if batch.count >= batchSize {
do {
let operation = CKModifyRecordsOperation(recordsToSave: batch, recordIDsToDelete: nil)
operation.savePolicy = .changedKeys
try await database.modifyRecords(saving: batch, deleting: [])
imported += batch.count
print(" Imported batch of \(batch.count) games (total: \(imported))")
batch.removeAll()
} catch {
print(" Error importing batch: \(error)")
}
}
}
// Save remaining
if !batch.isEmpty {
do {
try await database.modifyRecords(saving: batch, deleting: [])
imported += batch.count
} catch {
print(" Error importing final batch: \(error)")
}
}
return imported
}
}
// MARK: - Team Info
struct TeamInfo {
let name: String
let city: String
let sport: String
}
// MARK: - Main
func loadJSON<T: Codable>(from path: String) throws -> T {
let url = URL(fileURLWithPath: path)
let data = try Data(contentsOf: url)
return try JSONDecoder().decode(T.self, from: data)
}
func main() async {
let args = CommandLine.arguments
guard args.count >= 3 else {
print("Usage: swift import_to_cloudkit.swift --games <path> --stadiums <path>")
return
}
var gamesPath: String?
var stadiumsPath: String?
for i in 1..<args.count {
if args[i] == "--games" && i + 1 < args.count {
gamesPath = args[i + 1]
}
if args[i] == "--stadiums" && i + 1 < args.count {
stadiumsPath = args[i + 1]
}
}
let importer = CloudKitImporter()
// Import stadiums
if let path = stadiumsPath {
print("\n=== Importing Stadiums ===")
do {
let stadiums: [ScrapedStadium] = try loadJSON(from: path)
let count = try await importer.importStadiums(from: stadiums)
print("Imported \(count) stadiums")
} catch {
print("Error loading stadiums: \(error)")
}
}
// Import games
if let path = gamesPath {
print("\n=== Importing Games ===")
do {
let games: [ScrapedGame] = try loadJSON(from: path)
// Note: Would need to first import teams and get their record IDs
// This is a simplified version
print("Loaded \(games.count) games for import")
} catch {
print("Error loading games: \(error)")
}
}
print("\n=== Import Complete ===")
}
// Run
Task {
await main()
}
// Keep the process running for async operations
RunLoop.main.run()
-510
View File
@@ -1,510 +0,0 @@
#!/usr/bin/env python3
"""
MLB schedule and stadium scrapers for SportsTime.
This module provides:
- MLB game scrapers (Baseball-Reference, Stats API, ESPN)
- MLB stadium scrapers (MLBScoreBot, GeoJSON, hardcoded)
- Multi-source fallback configurations
"""
from datetime import datetime
from typing import Optional
import requests
# Support both direct execution and import from parent directory
try:
from core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
except ImportError:
from Scripts.core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
__all__ = [
# Team data
'MLB_TEAMS',
# Game scrapers
'scrape_mlb_baseball_reference',
'scrape_mlb_statsapi',
'scrape_mlb_espn',
# Stadium scrapers
'scrape_mlb_stadiums_scorebot',
'scrape_mlb_stadiums_geojson',
'scrape_mlb_stadiums_hardcoded',
'scrape_mlb_stadiums',
# Source configurations
'MLB_GAME_SOURCES',
'MLB_STADIUM_SOURCES',
# Convenience function
'scrape_mlb_games',
]
# =============================================================================
# TEAM MAPPINGS
# =============================================================================
MLB_TEAMS = {
'ARI': {'name': 'Arizona Diamondbacks', 'city': 'Phoenix', 'stadium': 'Chase Field'},
'ATL': {'name': 'Atlanta Braves', 'city': 'Atlanta', 'stadium': 'Truist Park'},
'BAL': {'name': 'Baltimore Orioles', 'city': 'Baltimore', 'stadium': 'Oriole Park at Camden Yards'},
'BOS': {'name': 'Boston Red Sox', 'city': 'Boston', 'stadium': 'Fenway Park'},
'CHC': {'name': 'Chicago Cubs', 'city': 'Chicago', 'stadium': 'Wrigley Field'},
'CHW': {'name': 'Chicago White Sox', 'city': 'Chicago', 'stadium': 'Guaranteed Rate Field'},
'CIN': {'name': 'Cincinnati Reds', 'city': 'Cincinnati', 'stadium': 'Great American Ball Park'},
'CLE': {'name': 'Cleveland Guardians', 'city': 'Cleveland', 'stadium': 'Progressive Field'},
'COL': {'name': 'Colorado Rockies', 'city': 'Denver', 'stadium': 'Coors Field'},
'DET': {'name': 'Detroit Tigers', 'city': 'Detroit', 'stadium': 'Comerica Park'},
'HOU': {'name': 'Houston Astros', 'city': 'Houston', 'stadium': 'Minute Maid Park'},
'KCR': {'name': 'Kansas City Royals', 'city': 'Kansas City', 'stadium': 'Kauffman Stadium'},
'LAA': {'name': 'Los Angeles Angels', 'city': 'Anaheim', 'stadium': 'Angel Stadium'},
'LAD': {'name': 'Los Angeles Dodgers', 'city': 'Los Angeles', 'stadium': 'Dodger Stadium'},
'MIA': {'name': 'Miami Marlins', 'city': 'Miami', 'stadium': 'LoanDepot Park'},
'MIL': {'name': 'Milwaukee Brewers', 'city': 'Milwaukee', 'stadium': 'American Family Field'},
'MIN': {'name': 'Minnesota Twins', 'city': 'Minneapolis', 'stadium': 'Target Field'},
'NYM': {'name': 'New York Mets', 'city': 'New York', 'stadium': 'Citi Field'},
'NYY': {'name': 'New York Yankees', 'city': 'New York', 'stadium': 'Yankee Stadium'},
'OAK': {'name': 'Oakland Athletics', 'city': 'Sacramento', 'stadium': 'Sutter Health Park'},
'PHI': {'name': 'Philadelphia Phillies', 'city': 'Philadelphia', 'stadium': 'Citizens Bank Park'},
'PIT': {'name': 'Pittsburgh Pirates', 'city': 'Pittsburgh', 'stadium': 'PNC Park'},
'SDP': {'name': 'San Diego Padres', 'city': 'San Diego', 'stadium': 'Petco Park'},
'SFG': {'name': 'San Francisco Giants', 'city': 'San Francisco', 'stadium': 'Oracle Park'},
'SEA': {'name': 'Seattle Mariners', 'city': 'Seattle', 'stadium': 'T-Mobile Park'},
'STL': {'name': 'St. Louis Cardinals', 'city': 'St. Louis', 'stadium': 'Busch Stadium'},
'TBR': {'name': 'Tampa Bay Rays', 'city': 'St. Petersburg', 'stadium': 'Tropicana Field'},
'TEX': {'name': 'Texas Rangers', 'city': 'Arlington', 'stadium': 'Globe Life Field'},
'TOR': {'name': 'Toronto Blue Jays', 'city': 'Toronto', 'stadium': 'Rogers Centre'},
'WSN': {'name': 'Washington Nationals', 'city': 'Washington', 'stadium': 'Nationals Park'},
}
def get_mlb_team_abbrev(team_name: str) -> str:
"""Get MLB team abbreviation from full name."""
for abbrev, info in MLB_TEAMS.items():
if info['name'].lower() == team_name.lower():
return abbrev
if team_name.lower() in info['name'].lower():
return abbrev
# Return first 3 letters as fallback
return team_name[:3].upper()
# =============================================================================
# GAME SCRAPERS
# =============================================================================
def scrape_mlb_baseball_reference(season: int) -> list[Game]:
"""
Scrape MLB schedule from Baseball-Reference.
URL: https://www.baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml
"""
games = []
url = f"https://www.baseball-reference.com/leagues/majors/{season}-schedule.shtml"
print(f"Scraping MLB {season} from Baseball-Reference...")
soup = fetch_page(url, 'baseball-reference.com')
if not soup:
return games
# Baseball-Reference groups games by date in h3 headers
current_date = None
# Find the schedule section
schedule_div = soup.find('div', {'id': 'all_schedule'})
if not schedule_div:
schedule_div = soup
# Process all elements to track date context
for element in schedule_div.find_all(['h3', 'p', 'div']):
# Check for date header
if element.name == 'h3':
date_text = element.get_text(strip=True)
# Parse date like "Thursday, March 27, 2025"
try:
for fmt in ['%A, %B %d, %Y', '%B %d, %Y', '%a, %b %d, %Y']:
try:
parsed = datetime.strptime(date_text, fmt)
current_date = parsed.strftime('%Y-%m-%d')
break
except:
continue
except:
pass
# Check for game entries
elif element.name == 'p' and 'game' in element.get('class', []):
if not current_date:
continue
try:
links = element.find_all('a')
if len(links) >= 2:
away_team = links[0].text.strip()
home_team = links[1].text.strip()
# Generate unique game ID
away_abbrev = get_mlb_team_abbrev(away_team)
home_abbrev = get_mlb_team_abbrev(home_team)
game_id = f"mlb_br_{current_date}_{away_abbrev}_{home_abbrev}".lower()
game = Game(
id=game_id,
sport='MLB',
season=str(season),
date=current_date,
time=None,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev,
away_team_abbrev=away_abbrev,
venue='',
source='baseball-reference.com'
)
games.append(game)
except Exception as e:
continue
print(f" Found {len(games)} games from Baseball-Reference")
return games
def scrape_mlb_statsapi(season: int) -> list[Game]:
"""
Fetch MLB schedule from official Stats API (JSON).
URL: https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={YEAR}&gameType=R
"""
games = []
url = f"https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={season}&gameType=R&hydrate=team,venue"
print(f"Fetching MLB {season} from Stats API...")
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
for date_entry in data.get('dates', []):
game_date = date_entry.get('date', '')
for game_data in date_entry.get('games', []):
try:
teams = game_data.get('teams', {})
away = teams.get('away', {}).get('team', {})
home = teams.get('home', {}).get('team', {})
venue = game_data.get('venue', {})
game_time = game_data.get('gameDate', '')
if 'T' in game_time:
time_str = game_time.split('T')[1][:5]
else:
time_str = None
game = Game(
id='', # Will be assigned by assign_stable_ids
sport='MLB',
season=str(season),
date=game_date,
time=time_str,
home_team=home.get('name', ''),
away_team=away.get('name', ''),
home_team_abbrev=home.get('abbreviation', ''),
away_team_abbrev=away.get('abbreviation', ''),
venue=venue.get('name', ''),
source='statsapi.mlb.com'
)
games.append(game)
except Exception as e:
continue
except Exception as e:
print(f" Error fetching MLB API: {e}")
print(f" Found {len(games)} games from MLB Stats API")
return games
def scrape_mlb_espn(season: int) -> list[Game]:
"""Fetch MLB schedule from ESPN API."""
games = []
print(f"Fetching MLB {season} from ESPN API...")
# MLB regular season: Late March - Early October
start = f"{season}0320"
end = f"{season}1010"
url = "https://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard"
params = {
'dates': f"{start}-{end}",
'limit': 1000
}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
}
try:
response = requests.get(url, params=params, headers=headers, timeout=30)
response.raise_for_status()
data = response.json()
events = data.get('events', [])
for event in events:
try:
date_str = event.get('date', '')[:10]
time_str = event.get('date', '')[11:16] if len(event.get('date', '')) > 11 else None
competitions = event.get('competitions', [{}])
if not competitions:
continue
comp = competitions[0]
competitors = comp.get('competitors', [])
if len(competitors) < 2:
continue
home_team = away_team = home_abbrev = away_abbrev = None
for team in competitors:
team_data = team.get('team', {})
team_name = team_data.get('displayName', team_data.get('name', ''))
team_abbrev = team_data.get('abbreviation', '')
if team.get('homeAway') == 'home':
home_team = team_name
home_abbrev = team_abbrev
else:
away_team = team_name
away_abbrev = team_abbrev
if not home_team or not away_team:
continue
venue = comp.get('venue', {}).get('fullName', '')
game_id = f"mlb_{date_str}_{away_abbrev}_{home_abbrev}".lower()
game = Game(
id=game_id,
sport='MLB',
season=str(season),
date=date_str,
time=time_str,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev or get_mlb_team_abbrev(home_team),
away_team_abbrev=away_abbrev or get_mlb_team_abbrev(away_team),
venue=venue,
source='espn.com'
)
games.append(game)
except Exception:
continue
print(f" Found {len(games)} games from ESPN")
except Exception as e:
print(f"Error fetching ESPN MLB: {e}")
return games
# =============================================================================
# STADIUM SCRAPERS
# =============================================================================
def scrape_mlb_stadiums_scorebot() -> list[Stadium]:
"""
Source 1: MLBScoreBot/ballparks GitHub (public domain).
"""
stadiums = []
url = "https://raw.githubusercontent.com/MLBScoreBot/ballparks/main/ballparks.json"
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
for name, info in data.items():
stadium = Stadium(
id=f"mlb_{name.lower().replace(' ', '_')[:30]}",
name=name,
city=info.get('city', ''),
state=info.get('state', ''),
latitude=info.get('lat', 0) / 1000000 if info.get('lat') else 0,
longitude=info.get('long', 0) / 1000000 if info.get('long') else 0,
capacity=info.get('capacity', 0),
sport='MLB',
team_abbrevs=[info.get('team', '')],
source='github.com/MLBScoreBot'
)
stadiums.append(stadium)
return stadiums
def scrape_mlb_stadiums_geojson() -> list[Stadium]:
"""
Source 2: cageyjames/GeoJSON-Ballparks GitHub.
"""
stadiums = []
url = "https://raw.githubusercontent.com/cageyjames/GeoJSON-Ballparks/master/ballparks.geojson"
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
for feature in data.get('features', []):
props = feature.get('properties', {})
coords = feature.get('geometry', {}).get('coordinates', [0, 0])
# Only include MLB stadiums (filter by League)
if props.get('League', '').upper() != 'MLB':
continue
stadium = Stadium(
id=f"mlb_{props.get('Ballpark', '').lower().replace(' ', '_')[:30]}",
name=props.get('Ballpark', ''),
city=props.get('City', ''),
state=props.get('State', ''),
latitude=coords[1] if len(coords) > 1 else 0,
longitude=coords[0] if len(coords) > 0 else 0,
capacity=0, # Not in this dataset
sport='MLB',
team_abbrevs=[props.get('Team', '')],
source='github.com/cageyjames'
)
stadiums.append(stadium)
return stadiums
def scrape_mlb_stadiums_hardcoded() -> list[Stadium]:
"""
Source 3: Hardcoded MLB ballparks (fallback).
"""
mlb_ballparks = {
'Chase Field': {'city': 'Phoenix', 'state': 'AZ', 'lat': 33.4453, 'lng': -112.0667, 'capacity': 48519, 'teams': ['ARI'], 'year_opened': 1998},
'Truist Park': {'city': 'Atlanta', 'state': 'GA', 'lat': 33.8907, 'lng': -84.4677, 'capacity': 41084, 'teams': ['ATL'], 'year_opened': 2017},
'Oriole Park at Camden Yards': {'city': 'Baltimore', 'state': 'MD', 'lat': 39.2839, 'lng': -76.6216, 'capacity': 44970, 'teams': ['BAL'], 'year_opened': 1992},
'Fenway Park': {'city': 'Boston', 'state': 'MA', 'lat': 42.3467, 'lng': -71.0972, 'capacity': 37755, 'teams': ['BOS'], 'year_opened': 1912},
'Wrigley Field': {'city': 'Chicago', 'state': 'IL', 'lat': 41.9484, 'lng': -87.6553, 'capacity': 41649, 'teams': ['CHC'], 'year_opened': 1914},
'Guaranteed Rate Field': {'city': 'Chicago', 'state': 'IL', 'lat': 41.8299, 'lng': -87.6338, 'capacity': 40615, 'teams': ['CHW'], 'year_opened': 1991},
'Great American Ball Park': {'city': 'Cincinnati', 'state': 'OH', 'lat': 39.0979, 'lng': -84.5082, 'capacity': 42319, 'teams': ['CIN'], 'year_opened': 2003},
'Progressive Field': {'city': 'Cleveland', 'state': 'OH', 'lat': 41.4958, 'lng': -81.6853, 'capacity': 34830, 'teams': ['CLE'], 'year_opened': 1994},
'Coors Field': {'city': 'Denver', 'state': 'CO', 'lat': 39.7559, 'lng': -104.9942, 'capacity': 50144, 'teams': ['COL'], 'year_opened': 1995},
'Comerica Park': {'city': 'Detroit', 'state': 'MI', 'lat': 42.3390, 'lng': -83.0485, 'capacity': 41083, 'teams': ['DET'], 'year_opened': 2000},
'Minute Maid Park': {'city': 'Houston', 'state': 'TX', 'lat': 29.7573, 'lng': -95.3555, 'capacity': 41168, 'teams': ['HOU'], 'year_opened': 2000},
'Kauffman Stadium': {'city': 'Kansas City', 'state': 'MO', 'lat': 39.0517, 'lng': -94.4803, 'capacity': 37903, 'teams': ['KCR'], 'year_opened': 1973},
'Angel Stadium': {'city': 'Anaheim', 'state': 'CA', 'lat': 33.8003, 'lng': -117.8827, 'capacity': 45517, 'teams': ['LAA'], 'year_opened': 1966},
'Dodger Stadium': {'city': 'Los Angeles', 'state': 'CA', 'lat': 34.0739, 'lng': -118.2400, 'capacity': 56000, 'teams': ['LAD'], 'year_opened': 1962},
'LoanDepot Park': {'city': 'Miami', 'state': 'FL', 'lat': 25.7781, 'lng': -80.2196, 'capacity': 36742, 'teams': ['MIA'], 'year_opened': 2012},
'American Family Field': {'city': 'Milwaukee', 'state': 'WI', 'lat': 43.0280, 'lng': -87.9712, 'capacity': 41900, 'teams': ['MIL'], 'year_opened': 2001},
'Target Field': {'city': 'Minneapolis', 'state': 'MN', 'lat': 44.9818, 'lng': -93.2775, 'capacity': 38544, 'teams': ['MIN'], 'year_opened': 2010},
'Citi Field': {'city': 'Queens', 'state': 'NY', 'lat': 40.7571, 'lng': -73.8458, 'capacity': 41922, 'teams': ['NYM'], 'year_opened': 2009},
'Yankee Stadium': {'city': 'Bronx', 'state': 'NY', 'lat': 40.8296, 'lng': -73.9262, 'capacity': 46537, 'teams': ['NYY'], 'year_opened': 2009},
'Sutter Health Park': {'city': 'Sacramento', 'state': 'CA', 'lat': 38.5803, 'lng': -121.5108, 'capacity': 14014, 'teams': ['OAK'], 'year_opened': 2000},
'Citizens Bank Park': {'city': 'Philadelphia', 'state': 'PA', 'lat': 39.9061, 'lng': -75.1665, 'capacity': 42901, 'teams': ['PHI'], 'year_opened': 2004},
'PNC Park': {'city': 'Pittsburgh', 'state': 'PA', 'lat': 40.4469, 'lng': -80.0057, 'capacity': 38362, 'teams': ['PIT'], 'year_opened': 2001},
'Petco Park': {'city': 'San Diego', 'state': 'CA', 'lat': 32.7073, 'lng': -117.1566, 'capacity': 40209, 'teams': ['SDP'], 'year_opened': 2004},
'Oracle Park': {'city': 'San Francisco', 'state': 'CA', 'lat': 37.7786, 'lng': -122.3893, 'capacity': 41915, 'teams': ['SFG'], 'year_opened': 2000},
'T-Mobile Park': {'city': 'Seattle', 'state': 'WA', 'lat': 47.5914, 'lng': -122.3325, 'capacity': 47929, 'teams': ['SEA'], 'year_opened': 1999},
'Busch Stadium': {'city': 'St. Louis', 'state': 'MO', 'lat': 38.6226, 'lng': -90.1928, 'capacity': 45538, 'teams': ['STL'], 'year_opened': 2006},
'Tropicana Field': {'city': 'St. Petersburg', 'state': 'FL', 'lat': 27.7682, 'lng': -82.6534, 'capacity': 25000, 'teams': ['TBR'], 'year_opened': 1990},
'Globe Life Field': {'city': 'Arlington', 'state': 'TX', 'lat': 32.7473, 'lng': -97.0844, 'capacity': 40300, 'teams': ['TEX'], 'year_opened': 2020},
'Rogers Centre': {'city': 'Toronto', 'state': 'ON', 'lat': 43.6414, 'lng': -79.3894, 'capacity': 49282, 'teams': ['TOR'], 'year_opened': 1989},
'Nationals Park': {'city': 'Washington', 'state': 'DC', 'lat': 38.8729, 'lng': -77.0074, 'capacity': 41339, 'teams': ['WSN'], 'year_opened': 2008},
}
stadiums = []
for name, info in mlb_ballparks.items():
stadium = Stadium(
id=f"mlb_{name.lower().replace(' ', '_')[:30]}",
name=name,
city=info['city'],
state=info['state'],
latitude=info['lat'],
longitude=info['lng'],
capacity=info['capacity'],
sport='MLB',
team_abbrevs=info['teams'],
source='mlb_hardcoded',
year_opened=info.get('year_opened')
)
stadiums.append(stadium)
return stadiums
def scrape_mlb_stadiums() -> list[Stadium]:
"""
Fetch MLB stadium data with multi-source fallback.
"""
print("\nMLB STADIUMS")
print("-" * 40)
sources = [
StadiumScraperSource('MLBScoreBot', scrape_mlb_stadiums_scorebot, priority=1, min_venues=25),
StadiumScraperSource('GeoJSON-Ballparks', scrape_mlb_stadiums_geojson, priority=2, min_venues=25),
StadiumScraperSource('Hardcoded', scrape_mlb_stadiums_hardcoded, priority=3, min_venues=25),
]
return scrape_stadiums_with_fallback('MLB', sources)
# =============================================================================
# SOURCE CONFIGURATIONS
# =============================================================================
MLB_GAME_SOURCES = [
ScraperSource('MLB Stats API', scrape_mlb_statsapi, priority=1, min_games=100),
ScraperSource('Baseball-Reference', scrape_mlb_baseball_reference, priority=2, min_games=100),
ScraperSource('ESPN', scrape_mlb_espn, priority=3, min_games=100),
]
MLB_STADIUM_SOURCES = [
StadiumScraperSource('MLBScoreBot', scrape_mlb_stadiums_scorebot, priority=1, min_venues=25),
StadiumScraperSource('GeoJSON-Ballparks', scrape_mlb_stadiums_geojson, priority=2, min_venues=25),
StadiumScraperSource('Hardcoded', scrape_mlb_stadiums_hardcoded, priority=3, min_venues=25),
]
# =============================================================================
# CONVENIENCE FUNCTIONS
# =============================================================================
def scrape_mlb_games(season: int) -> list[Game]:
"""
Scrape MLB games for a season using multi-source fallback.
Args:
season: Season year (e.g., 2026)
Returns:
List of Game objects from the first successful source
"""
print(f"\nMLB {season} SCHEDULE")
print("-" * 40)
return scrape_with_fallback('MLB', season, MLB_GAME_SOURCES)
-343
View File
@@ -1,343 +0,0 @@
#!/usr/bin/env python3
"""
MLS schedule and stadium scrapers for SportsTime.
This module provides:
- MLS game scrapers (ESPN, FBref, MLSSoccer.com)
- MLS stadium scrapers (gavinr GeoJSON, hardcoded)
- Multi-source fallback configurations
"""
from typing import Optional
import requests
# Support both direct execution and import from parent directory
try:
from core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
except ImportError:
from Scripts.core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
__all__ = [
# Team data
'MLS_TEAMS',
# Stadium scrapers
'scrape_mls_stadiums_hardcoded',
'scrape_mls_stadiums_gavinr',
'scrape_mls_stadiums',
# Source configurations
'MLS_STADIUM_SOURCES',
# Convenience functions
'get_mls_team_abbrev',
]
# =============================================================================
# TEAM MAPPINGS
# =============================================================================
MLS_TEAMS = {
'ATL': {'name': 'Atlanta United FC', 'city': 'Atlanta', 'stadium': 'Mercedes-Benz Stadium'},
'AUS': {'name': 'Austin FC', 'city': 'Austin', 'stadium': 'Q2 Stadium'},
'CLT': {'name': 'Charlotte FC', 'city': 'Charlotte', 'stadium': 'Bank of America Stadium'},
'CHI': {'name': 'Chicago Fire FC', 'city': 'Chicago', 'stadium': 'Soldier Field'},
'CIN': {'name': 'FC Cincinnati', 'city': 'Cincinnati', 'stadium': 'TQL Stadium'},
'COL': {'name': 'Colorado Rapids', 'city': 'Commerce City', 'stadium': "Dick's Sporting Goods Park"},
'CLB': {'name': 'Columbus Crew', 'city': 'Columbus', 'stadium': 'Lower.com Field'},
'DAL': {'name': 'FC Dallas', 'city': 'Frisco', 'stadium': 'Toyota Stadium'},
'DC': {'name': 'D.C. United', 'city': 'Washington', 'stadium': 'Audi Field'},
'HOU': {'name': 'Houston Dynamo FC', 'city': 'Houston', 'stadium': 'Shell Energy Stadium'},
'LAG': {'name': 'LA Galaxy', 'city': 'Carson', 'stadium': 'Dignity Health Sports Park'},
'LAFC': {'name': 'Los Angeles FC', 'city': 'Los Angeles', 'stadium': 'BMO Stadium'},
'MIA': {'name': 'Inter Miami CF', 'city': 'Fort Lauderdale', 'stadium': 'Chase Stadium'},
'MIN': {'name': 'Minnesota United FC', 'city': 'Saint Paul', 'stadium': 'Allianz Field'},
'MTL': {'name': 'CF Montreal', 'city': 'Montreal', 'stadium': 'Stade Saputo'},
'NSH': {'name': 'Nashville SC', 'city': 'Nashville', 'stadium': 'Geodis Park'},
'NE': {'name': 'New England Revolution', 'city': 'Foxborough', 'stadium': 'Gillette Stadium'},
'NYCFC': {'name': 'New York City FC', 'city': 'New York', 'stadium': 'Yankee Stadium'},
'NYRB': {'name': 'New York Red Bulls', 'city': 'Harrison', 'stadium': 'Red Bull Arena'},
'ORL': {'name': 'Orlando City SC', 'city': 'Orlando', 'stadium': 'Inter&Co Stadium'},
'PHI': {'name': 'Philadelphia Union', 'city': 'Chester', 'stadium': 'Subaru Park'},
'POR': {'name': 'Portland Timbers', 'city': 'Portland', 'stadium': 'Providence Park'},
'RSL': {'name': 'Real Salt Lake', 'city': 'Sandy', 'stadium': 'America First Field'},
'SJ': {'name': 'San Jose Earthquakes', 'city': 'San Jose', 'stadium': 'PayPal Park'},
'SEA': {'name': 'Seattle Sounders FC', 'city': 'Seattle', 'stadium': 'Lumen Field'},
'SKC': {'name': 'Sporting Kansas City', 'city': 'Kansas City', 'stadium': "Children's Mercy Park"},
'STL': {'name': 'St. Louis City SC', 'city': 'St. Louis', 'stadium': 'CityPark'},
'TOR': {'name': 'Toronto FC', 'city': 'Toronto', 'stadium': 'BMO Field'},
'VAN': {'name': 'Vancouver Whitecaps FC', 'city': 'Vancouver', 'stadium': 'BC Place'},
'SD': {'name': 'San Diego FC', 'city': 'San Diego', 'stadium': 'Snapdragon Stadium'},
}
def get_mls_team_abbrev(team_name: str) -> str:
"""Get MLS team abbreviation from full name."""
for abbrev, info in MLS_TEAMS.items():
if info['name'].lower() == team_name.lower():
return abbrev
if team_name.lower() in info['name'].lower():
return abbrev
# Return first 3 letters as fallback
return team_name[:3].upper()
# =============================================================================
# STADIUM SCRAPERS
# =============================================================================
def scrape_mls_stadiums_hardcoded() -> list[Stadium]:
"""
Source 1: Hardcoded MLS stadiums with complete data.
All 30 MLS stadiums with capacity (soccer configuration) and year_opened.
"""
mls_stadiums = {
'Mercedes-Benz Stadium': {
'city': 'Atlanta', 'state': 'GA',
'lat': 33.7555, 'lng': -84.4000,
'capacity': 42500, 'teams': ['ATL'], 'year_opened': 2017
},
'Q2 Stadium': {
'city': 'Austin', 'state': 'TX',
'lat': 30.3877, 'lng': -97.7195,
'capacity': 20738, 'teams': ['AUS'], 'year_opened': 2021
},
'Bank of America Stadium': {
'city': 'Charlotte', 'state': 'NC',
'lat': 35.2258, 'lng': -80.8528,
'capacity': 38000, 'teams': ['CLT'], 'year_opened': 1996
},
'Soldier Field': {
'city': 'Chicago', 'state': 'IL',
'lat': 41.8623, 'lng': -87.6167,
'capacity': 24995, 'teams': ['CHI'], 'year_opened': 1924
},
'TQL Stadium': {
'city': 'Cincinnati', 'state': 'OH',
'lat': 39.1114, 'lng': -84.5222,
'capacity': 26000, 'teams': ['CIN'], 'year_opened': 2021
},
"Dick's Sporting Goods Park": {
'city': 'Commerce City', 'state': 'CO',
'lat': 39.8056, 'lng': -104.8919,
'capacity': 18061, 'teams': ['COL'], 'year_opened': 2007
},
'Lower.com Field': {
'city': 'Columbus', 'state': 'OH',
'lat': 39.9685, 'lng': -83.0171,
'capacity': 20371, 'teams': ['CLB'], 'year_opened': 2021
},
'Toyota Stadium': {
'city': 'Frisco', 'state': 'TX',
'lat': 33.1544, 'lng': -96.8353,
'capacity': 20500, 'teams': ['DAL'], 'year_opened': 2005
},
'Audi Field': {
'city': 'Washington', 'state': 'DC',
'lat': 38.8684, 'lng': -77.0129,
'capacity': 20000, 'teams': ['DC'], 'year_opened': 2018
},
'Shell Energy Stadium': {
'city': 'Houston', 'state': 'TX',
'lat': 29.7522, 'lng': -95.3524,
'capacity': 22039, 'teams': ['HOU'], 'year_opened': 2012
},
'Dignity Health Sports Park': {
'city': 'Carson', 'state': 'CA',
'lat': 33.8640, 'lng': -118.2610,
'capacity': 27000, 'teams': ['LAG'], 'year_opened': 2003
},
'BMO Stadium': {
'city': 'Los Angeles', 'state': 'CA',
'lat': 34.0128, 'lng': -118.2841,
'capacity': 22000, 'teams': ['LAFC'], 'year_opened': 2018
},
'Chase Stadium': {
'city': 'Fort Lauderdale', 'state': 'FL',
'lat': 26.1933, 'lng': -80.1607,
'capacity': 21550, 'teams': ['MIA'], 'year_opened': 2020
},
'Allianz Field': {
'city': 'Saint Paul', 'state': 'MN',
'lat': 44.9531, 'lng': -93.1647,
'capacity': 19400, 'teams': ['MIN'], 'year_opened': 2019
},
'Stade Saputo': {
'city': 'Montreal', 'state': 'QC',
'lat': 45.5631, 'lng': -73.5525,
'capacity': 19619, 'teams': ['MTL'], 'year_opened': 2008
},
'Geodis Park': {
'city': 'Nashville', 'state': 'TN',
'lat': 36.1301, 'lng': -86.7660,
'capacity': 30000, 'teams': ['NSH'], 'year_opened': 2022
},
'Gillette Stadium': {
'city': 'Foxborough', 'state': 'MA',
'lat': 42.0909, 'lng': -71.2643,
'capacity': 22385, 'teams': ['NE'], 'year_opened': 2002
},
'Yankee Stadium': {
'city': 'Bronx', 'state': 'NY',
'lat': 40.8292, 'lng': -73.9264,
'capacity': 28000, 'teams': ['NYCFC'], 'year_opened': 2009
},
'Red Bull Arena': {
'city': 'Harrison', 'state': 'NJ',
'lat': 40.7367, 'lng': -74.1503,
'capacity': 25000, 'teams': ['NYRB'], 'year_opened': 2010
},
'Inter&Co Stadium': {
'city': 'Orlando', 'state': 'FL',
'lat': 28.5411, 'lng': -81.3893,
'capacity': 25500, 'teams': ['ORL'], 'year_opened': 2017
},
'Subaru Park': {
'city': 'Chester', 'state': 'PA',
'lat': 39.8322, 'lng': -75.3789,
'capacity': 18500, 'teams': ['PHI'], 'year_opened': 2010
},
'Providence Park': {
'city': 'Portland', 'state': 'OR',
'lat': 45.5214, 'lng': -122.6917,
'capacity': 25218, 'teams': ['POR'], 'year_opened': 1926
},
'America First Field': {
'city': 'Sandy', 'state': 'UT',
'lat': 40.5829, 'lng': -111.8934,
'capacity': 20213, 'teams': ['RSL'], 'year_opened': 2008
},
'PayPal Park': {
'city': 'San Jose', 'state': 'CA',
'lat': 37.3514, 'lng': -121.9250,
'capacity': 18000, 'teams': ['SJ'], 'year_opened': 2015
},
'Lumen Field': {
'city': 'Seattle', 'state': 'WA',
'lat': 47.5952, 'lng': -122.3316,
'capacity': 37722, 'teams': ['SEA'], 'year_opened': 2002
},
"Children's Mercy Park": {
'city': 'Kansas City', 'state': 'KS',
'lat': 39.1217, 'lng': -94.8232,
'capacity': 18467, 'teams': ['SKC'], 'year_opened': 2011
},
'CityPark': {
'city': 'St. Louis', 'state': 'MO',
'lat': 38.6314, 'lng': -90.2103,
'capacity': 22500, 'teams': ['STL'], 'year_opened': 2023
},
'BMO Field': {
'city': 'Toronto', 'state': 'ON',
'lat': 43.6332, 'lng': -79.4186,
'capacity': 30000, 'teams': ['TOR'], 'year_opened': 2007
},
'BC Place': {
'city': 'Vancouver', 'state': 'BC',
'lat': 49.2767, 'lng': -123.1119,
'capacity': 22120, 'teams': ['VAN'], 'year_opened': 1983
},
'Snapdragon Stadium': {
'city': 'San Diego', 'state': 'CA',
'lat': 32.7844, 'lng': -117.1228,
'capacity': 35000, 'teams': ['SD'], 'year_opened': 2022
},
}
stadiums = []
for name, info in mls_stadiums.items():
# Create normalized ID (f-strings can't have backslashes)
normalized_name = name.lower().replace(' ', '_').replace('&', 'and').replace('.', '').replace("'", '')
stadium_id = f"mls_{normalized_name[:30]}"
stadium = Stadium(
id=stadium_id,
name=name,
city=info['city'],
state=info['state'],
latitude=info['lat'],
longitude=info['lng'],
capacity=info['capacity'],
sport='MLS',
team_abbrevs=info['teams'],
source='mls_hardcoded',
year_opened=info.get('year_opened')
)
stadiums.append(stadium)
return stadiums
def scrape_mls_stadiums_gavinr() -> list[Stadium]:
"""
Source 2: gavinr/usa-soccer GeoJSON (fallback for coordinates).
Note: This source lacks capacity and year_opened data.
"""
stadiums = []
url = "https://raw.githubusercontent.com/gavinr/usa-soccer/master/mls.geojson"
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
for feature in data.get('features', []):
props = feature.get('properties', {})
coords = feature.get('geometry', {}).get('coordinates', [0, 0])
stadium = Stadium(
id=f"mls_{props.get('stadium', '').lower().replace(' ', '_')[:30]}",
name=props.get('stadium', ''),
city=props.get('city', ''),
state=props.get('state', ''),
latitude=coords[1] if len(coords) > 1 else 0,
longitude=coords[0] if len(coords) > 0 else 0,
capacity=props.get('capacity', 0),
sport='MLS',
team_abbrevs=[get_mls_team_abbrev(props.get('team', ''))],
source='github.com/gavinr'
)
stadiums.append(stadium)
return stadiums
def scrape_mls_stadiums() -> list[Stadium]:
"""
Fetch MLS stadium data with multi-source fallback.
Hardcoded source is primary (has complete data).
"""
print("\nMLS STADIUMS")
print("-" * 40)
sources = [
StadiumScraperSource('Hardcoded', scrape_mls_stadiums_hardcoded, priority=1, min_venues=25),
StadiumScraperSource('gavinr GeoJSON', scrape_mls_stadiums_gavinr, priority=2, min_venues=20),
]
return scrape_stadiums_with_fallback('MLS', sources)
# =============================================================================
# SOURCE CONFIGURATIONS
# =============================================================================
MLS_STADIUM_SOURCES = [
StadiumScraperSource('Hardcoded', scrape_mls_stadiums_hardcoded, priority=1, min_venues=25),
StadiumScraperSource('gavinr GeoJSON', scrape_mls_stadiums_gavinr, priority=2, min_venues=20),
]
-412
View File
@@ -1,412 +0,0 @@
#!/usr/bin/env python3
"""
NBA schedule and stadium scrapers for SportsTime.
This module provides:
- NBA game scrapers (Basketball-Reference, ESPN, CBS Sports)
- NBA stadium scrapers (hardcoded with coordinates)
- Multi-source fallback configurations
"""
from datetime import datetime, timedelta
from typing import Optional
import requests
# Support both direct execution and import from parent directory
try:
from core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
except ImportError:
from Scripts.core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
__all__ = [
# Team data
'NBA_TEAMS',
# Game scrapers
'scrape_nba_basketball_reference',
'scrape_nba_espn',
'scrape_nba_cbssports',
# Stadium scrapers
'scrape_nba_stadiums',
# Source configurations
'NBA_GAME_SOURCES',
'NBA_STADIUM_SOURCES',
# Convenience functions
'scrape_nba_games',
'get_nba_season_string',
]
# =============================================================================
# TEAM MAPPINGS
# =============================================================================
NBA_TEAMS = {
'ATL': {'name': 'Atlanta Hawks', 'city': 'Atlanta', 'arena': 'State Farm Arena'},
'BOS': {'name': 'Boston Celtics', 'city': 'Boston', 'arena': 'TD Garden'},
'BRK': {'name': 'Brooklyn Nets', 'city': 'Brooklyn', 'arena': 'Barclays Center'},
'CHO': {'name': 'Charlotte Hornets', 'city': 'Charlotte', 'arena': 'Spectrum Center'},
'CHI': {'name': 'Chicago Bulls', 'city': 'Chicago', 'arena': 'United Center'},
'CLE': {'name': 'Cleveland Cavaliers', 'city': 'Cleveland', 'arena': 'Rocket Mortgage FieldHouse'},
'DAL': {'name': 'Dallas Mavericks', 'city': 'Dallas', 'arena': 'American Airlines Center'},
'DEN': {'name': 'Denver Nuggets', 'city': 'Denver', 'arena': 'Ball Arena'},
'DET': {'name': 'Detroit Pistons', 'city': 'Detroit', 'arena': 'Little Caesars Arena'},
'GSW': {'name': 'Golden State Warriors', 'city': 'San Francisco', 'arena': 'Chase Center'},
'HOU': {'name': 'Houston Rockets', 'city': 'Houston', 'arena': 'Toyota Center'},
'IND': {'name': 'Indiana Pacers', 'city': 'Indianapolis', 'arena': 'Gainbridge Fieldhouse'},
'LAC': {'name': 'Los Angeles Clippers', 'city': 'Inglewood', 'arena': 'Intuit Dome'},
'LAL': {'name': 'Los Angeles Lakers', 'city': 'Los Angeles', 'arena': 'Crypto.com Arena'},
'MEM': {'name': 'Memphis Grizzlies', 'city': 'Memphis', 'arena': 'FedExForum'},
'MIA': {'name': 'Miami Heat', 'city': 'Miami', 'arena': 'Kaseya Center'},
'MIL': {'name': 'Milwaukee Bucks', 'city': 'Milwaukee', 'arena': 'Fiserv Forum'},
'MIN': {'name': 'Minnesota Timberwolves', 'city': 'Minneapolis', 'arena': 'Target Center'},
'NOP': {'name': 'New Orleans Pelicans', 'city': 'New Orleans', 'arena': 'Smoothie King Center'},
'NYK': {'name': 'New York Knicks', 'city': 'New York', 'arena': 'Madison Square Garden'},
'OKC': {'name': 'Oklahoma City Thunder', 'city': 'Oklahoma City', 'arena': 'Paycom Center'},
'ORL': {'name': 'Orlando Magic', 'city': 'Orlando', 'arena': 'Kia Center'},
'PHI': {'name': 'Philadelphia 76ers', 'city': 'Philadelphia', 'arena': 'Wells Fargo Center'},
'PHO': {'name': 'Phoenix Suns', 'city': 'Phoenix', 'arena': 'Footprint Center'},
'POR': {'name': 'Portland Trail Blazers', 'city': 'Portland', 'arena': 'Moda Center'},
'SAC': {'name': 'Sacramento Kings', 'city': 'Sacramento', 'arena': 'Golden 1 Center'},
'SAS': {'name': 'San Antonio Spurs', 'city': 'San Antonio', 'arena': 'Frost Bank Center'},
'TOR': {'name': 'Toronto Raptors', 'city': 'Toronto', 'arena': 'Scotiabank Arena'},
'UTA': {'name': 'Utah Jazz', 'city': 'Salt Lake City', 'arena': 'Delta Center'},
'WAS': {'name': 'Washington Wizards', 'city': 'Washington', 'arena': 'Capital One Arena'},
}
def get_nba_team_abbrev(team_name: str) -> str:
"""Get NBA team abbreviation from full name."""
for abbrev, info in NBA_TEAMS.items():
if info['name'].lower() == team_name.lower():
return abbrev
if team_name.lower() in info['name'].lower():
return abbrev
# Return first 3 letters as fallback
return team_name[:3].upper()
def get_nba_season_string(season: int) -> str:
"""
Get NBA season string in "2024-25" format.
Args:
season: The ending year of the season (e.g., 2025 for 2024-25 season)
Returns:
Season string like "2024-25"
"""
return f"{season-1}-{str(season)[2:]}"
# =============================================================================
# GAME SCRAPERS
# =============================================================================
def scrape_nba_basketball_reference(season: int) -> list[Game]:
"""
Scrape NBA schedule from Basketball-Reference.
URL: https://www.basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html
Season year is the ending year (e.g., 2025 for 2024-25 season)
"""
games = []
months = ['october', 'november', 'december', 'january', 'february', 'march', 'april', 'may', 'june']
print(f"Scraping NBA {season} from Basketball-Reference...")
for month in months:
url = f"https://www.basketball-reference.com/leagues/NBA_{season}_games-{month}.html"
soup = fetch_page(url, 'basketball-reference.com')
if not soup:
continue
table = soup.find('table', {'id': 'schedule'})
if not table:
continue
tbody = table.find('tbody')
if not tbody:
continue
for row in tbody.find_all('tr'):
if row.get('class') and 'thead' in row.get('class'):
continue
cells = row.find_all(['td', 'th'])
if len(cells) < 6:
continue
try:
# Parse date
date_cell = row.find('th', {'data-stat': 'date_game'})
if not date_cell:
continue
date_link = date_cell.find('a')
date_str = date_link.text if date_link else date_cell.text
# Parse time
time_cell = row.find('td', {'data-stat': 'game_start_time'})
time_str = time_cell.text.strip() if time_cell else None
# Parse teams
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
home_cell = row.find('td', {'data-stat': 'home_team_name'})
if not visitor_cell or not home_cell:
continue
visitor_link = visitor_cell.find('a')
home_link = home_cell.find('a')
away_team = visitor_link.text if visitor_link else visitor_cell.text
home_team = home_link.text if home_link else home_cell.text
# Parse arena
arena_cell = row.find('td', {'data-stat': 'arena_name'})
arena = arena_cell.text.strip() if arena_cell else ''
# Convert date
try:
parsed_date = datetime.strptime(date_str.strip(), '%a, %b %d, %Y')
date_formatted = parsed_date.strftime('%Y-%m-%d')
except:
continue
# Generate game ID
away_abbrev = get_nba_team_abbrev(away_team)
home_abbrev = get_nba_team_abbrev(home_team)
game_id = f"nba_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
game = Game(
id=game_id,
sport='NBA',
season=get_nba_season_string(season),
date=date_formatted,
time=time_str,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev,
away_team_abbrev=away_abbrev,
venue=arena,
source='basketball-reference.com'
)
games.append(game)
except Exception as e:
print(f" Error parsing row: {e}")
continue
print(f" Found {len(games)} games from Basketball-Reference")
return games
def scrape_nba_espn(season: int) -> list[Game]:
"""
Scrape NBA schedule from ESPN.
URL: https://www.espn.com/nba/schedule/_/date/{YYYYMMDD}
"""
games = []
print(f"Scraping NBA {season} from ESPN...")
# Determine date range for season
start_date = datetime(season - 1, 10, 1) # October of previous year
end_date = datetime(season, 6, 30) # June of season year
current_date = start_date
while current_date <= end_date:
date_str = current_date.strftime('%Y%m%d')
url = f"https://www.espn.com/nba/schedule/_/date/{date_str}"
soup = fetch_page(url, 'espn.com')
if soup:
# ESPN uses JavaScript rendering, so we need to parse what's available
# This is a simplified version - full implementation would need Selenium
pass
current_date += timedelta(days=7) # Sample weekly to respect rate limits
print(f" Found {len(games)} games from ESPN")
return games
def scrape_nba_cbssports(season: int) -> list[Game]:
"""
Fetch NBA schedule from CBS Sports.
CBS Sports provides a JSON API for schedule data.
"""
games = []
print(f"Fetching NBA {season} from CBS Sports...")
# CBS Sports has a schedule endpoint
url = "https://www.cbssports.com/nba/schedule/"
soup = fetch_page(url, 'cbssports.com')
if not soup:
return games
# Find all game rows
tables = soup.find_all('table', class_='TableBase-table')
for table in tables:
rows = table.find_all('tr')
for row in rows:
try:
cells = row.find_all('td')
if len(cells) < 2:
continue
# Parse teams from row
team_cells = row.find_all('a', class_='TeamName')
if len(team_cells) < 2:
continue
away_team = team_cells[0].get_text(strip=True)
home_team = team_cells[1].get_text(strip=True)
# Get date from table section
date_formatted = datetime.now().strftime('%Y-%m-%d') # Placeholder
away_abbrev = get_nba_team_abbrev(away_team)
home_abbrev = get_nba_team_abbrev(home_team)
game_id = f"nba_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
game = Game(
id=game_id,
sport='NBA',
season=get_nba_season_string(season),
date=date_formatted,
time=None,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev,
away_team_abbrev=away_abbrev,
venue='',
source='cbssports.com'
)
games.append(game)
except Exception:
continue
print(f" Found {len(games)} games from CBS Sports")
return games
# =============================================================================
# STADIUM SCRAPERS
# =============================================================================
def scrape_nba_stadiums() -> list[Stadium]:
"""
Fetch NBA arena data (hardcoded with accurate coordinates).
"""
print("\nNBA STADIUMS")
print("-" * 40)
print(" Loading NBA arenas...")
nba_arenas = {
'State Farm Arena': {'city': 'Atlanta', 'state': 'GA', 'lat': 33.7573, 'lng': -84.3963, 'capacity': 18118, 'teams': ['ATL'], 'year_opened': 1999},
'TD Garden': {'city': 'Boston', 'state': 'MA', 'lat': 42.3662, 'lng': -71.0621, 'capacity': 19156, 'teams': ['BOS'], 'year_opened': 1995},
'Barclays Center': {'city': 'Brooklyn', 'state': 'NY', 'lat': 40.6826, 'lng': -73.9754, 'capacity': 17732, 'teams': ['BRK'], 'year_opened': 2012},
'Spectrum Center': {'city': 'Charlotte', 'state': 'NC', 'lat': 35.2251, 'lng': -80.8392, 'capacity': 19077, 'teams': ['CHO'], 'year_opened': 2005},
'United Center': {'city': 'Chicago', 'state': 'IL', 'lat': 41.8807, 'lng': -87.6742, 'capacity': 20917, 'teams': ['CHI'], 'year_opened': 1994},
'Rocket Mortgage FieldHouse': {'city': 'Cleveland', 'state': 'OH', 'lat': 41.4965, 'lng': -81.6882, 'capacity': 19432, 'teams': ['CLE'], 'year_opened': 1994},
'American Airlines Center': {'city': 'Dallas', 'state': 'TX', 'lat': 32.7905, 'lng': -96.8103, 'capacity': 19200, 'teams': ['DAL'], 'year_opened': 2001},
'Ball Arena': {'city': 'Denver', 'state': 'CO', 'lat': 39.7487, 'lng': -105.0077, 'capacity': 19520, 'teams': ['DEN'], 'year_opened': 1999},
'Little Caesars Arena': {'city': 'Detroit', 'state': 'MI', 'lat': 42.3411, 'lng': -83.0553, 'capacity': 20332, 'teams': ['DET'], 'year_opened': 2017},
'Chase Center': {'city': 'San Francisco', 'state': 'CA', 'lat': 37.7680, 'lng': -122.3879, 'capacity': 18064, 'teams': ['GSW'], 'year_opened': 2019},
'Toyota Center': {'city': 'Houston', 'state': 'TX', 'lat': 29.7508, 'lng': -95.3621, 'capacity': 18055, 'teams': ['HOU'], 'year_opened': 2003},
'Gainbridge Fieldhouse': {'city': 'Indianapolis', 'state': 'IN', 'lat': 39.7640, 'lng': -86.1555, 'capacity': 17923, 'teams': ['IND'], 'year_opened': 1999},
'Intuit Dome': {'city': 'Inglewood', 'state': 'CA', 'lat': 33.9425, 'lng': -118.3419, 'capacity': 18000, 'teams': ['LAC'], 'year_opened': 2024},
'Crypto.com Arena': {'city': 'Los Angeles', 'state': 'CA', 'lat': 34.0430, 'lng': -118.2673, 'capacity': 18997, 'teams': ['LAL'], 'year_opened': 1999},
'FedExForum': {'city': 'Memphis', 'state': 'TN', 'lat': 35.1382, 'lng': -90.0506, 'capacity': 17794, 'teams': ['MEM'], 'year_opened': 2004},
'Kaseya Center': {'city': 'Miami', 'state': 'FL', 'lat': 25.7814, 'lng': -80.1870, 'capacity': 19600, 'teams': ['MIA'], 'year_opened': 1999},
'Fiserv Forum': {'city': 'Milwaukee', 'state': 'WI', 'lat': 43.0451, 'lng': -87.9174, 'capacity': 17341, 'teams': ['MIL'], 'year_opened': 2018},
'Target Center': {'city': 'Minneapolis', 'state': 'MN', 'lat': 44.9795, 'lng': -93.2761, 'capacity': 18978, 'teams': ['MIN'], 'year_opened': 1990},
'Smoothie King Center': {'city': 'New Orleans', 'state': 'LA', 'lat': 29.9490, 'lng': -90.0821, 'capacity': 16867, 'teams': ['NOP'], 'year_opened': 1999},
'Madison Square Garden': {'city': 'New York', 'state': 'NY', 'lat': 40.7505, 'lng': -73.9934, 'capacity': 19812, 'teams': ['NYK'], 'year_opened': 1968},
'Paycom Center': {'city': 'Oklahoma City', 'state': 'OK', 'lat': 35.4634, 'lng': -97.5151, 'capacity': 18203, 'teams': ['OKC'], 'year_opened': 2002},
'Kia Center': {'city': 'Orlando', 'state': 'FL', 'lat': 28.5392, 'lng': -81.3839, 'capacity': 18846, 'teams': ['ORL'], 'year_opened': 1989},
'Wells Fargo Center': {'city': 'Philadelphia', 'state': 'PA', 'lat': 39.9012, 'lng': -75.1720, 'capacity': 20478, 'teams': ['PHI'], 'year_opened': 1996},
'Footprint Center': {'city': 'Phoenix', 'state': 'AZ', 'lat': 33.4457, 'lng': -112.0712, 'capacity': 17071, 'teams': ['PHO'], 'year_opened': 1992},
'Moda Center': {'city': 'Portland', 'state': 'OR', 'lat': 45.5316, 'lng': -122.6668, 'capacity': 19393, 'teams': ['POR'], 'year_opened': 1995},
'Golden 1 Center': {'city': 'Sacramento', 'state': 'CA', 'lat': 38.5802, 'lng': -121.4997, 'capacity': 17608, 'teams': ['SAC'], 'year_opened': 2016},
'Frost Bank Center': {'city': 'San Antonio', 'state': 'TX', 'lat': 29.4270, 'lng': -98.4375, 'capacity': 18418, 'teams': ['SAS'], 'year_opened': 2002},
'Scotiabank Arena': {'city': 'Toronto', 'state': 'ON', 'lat': 43.6435, 'lng': -79.3791, 'capacity': 19800, 'teams': ['TOR'], 'year_opened': 1999},
'Delta Center': {'city': 'Salt Lake City', 'state': 'UT', 'lat': 40.7683, 'lng': -111.9011, 'capacity': 18306, 'teams': ['UTA'], 'year_opened': 1991},
'Capital One Arena': {'city': 'Washington', 'state': 'DC', 'lat': 38.8982, 'lng': -77.0209, 'capacity': 20356, 'teams': ['WAS'], 'year_opened': 1997},
}
stadiums = []
for name, info in nba_arenas.items():
stadium = Stadium(
id=f"nba_{name.lower().replace(' ', '_')[:30]}",
name=name,
city=info['city'],
state=info['state'],
latitude=info['lat'],
longitude=info['lng'],
capacity=info['capacity'],
sport='NBA',
team_abbrevs=info['teams'],
source='nba_hardcoded',
year_opened=info.get('year_opened')
)
stadiums.append(stadium)
print(f" ✓ Found {len(stadiums)} NBA arenas")
return stadiums
# =============================================================================
# SOURCE CONFIGURATIONS
# =============================================================================
NBA_GAME_SOURCES = [
ScraperSource('Basketball-Reference', scrape_nba_basketball_reference, priority=1, min_games=100),
ScraperSource('CBS Sports', scrape_nba_cbssports, priority=2, min_games=50),
ScraperSource('ESPN', scrape_nba_espn, priority=3, min_games=50),
]
NBA_STADIUM_SOURCES = [
StadiumScraperSource('Hardcoded', scrape_nba_stadiums, priority=1, min_venues=25),
]
# =============================================================================
# CONVENIENCE FUNCTIONS
# =============================================================================
def scrape_nba_games(season: int) -> list[Game]:
"""
Scrape NBA games for a season using multi-source fallback.
Args:
season: Season ending year (e.g., 2025 for 2024-25 season)
Returns:
List of Game objects from the first successful source
"""
print(f"\nNBA {get_nba_season_string(season)} SCHEDULE")
print("-" * 40)
return scrape_with_fallback('NBA', season, NBA_GAME_SOURCES)
-574
View File
@@ -1,574 +0,0 @@
#!/usr/bin/env python3
"""
NFL schedule and stadium scrapers for SportsTime.
This module provides:
- NFL game scrapers (ESPN, Pro-Football-Reference, CBS Sports)
- NFL stadium scrapers (ScoreBot, GeoJSON, hardcoded)
- Multi-source fallback configurations
"""
from datetime import datetime
from typing import Optional
import requests
# Support both direct execution and import from parent directory
try:
from core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
except ImportError:
from Scripts.core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
__all__ = [
# Team data
'NFL_TEAMS',
# Game scrapers
'scrape_nfl_espn',
'scrape_nfl_pro_football_reference',
'scrape_nfl_cbssports',
# Stadium scrapers
'scrape_nfl_stadiums',
'scrape_nfl_stadiums_scorebot',
'scrape_nfl_stadiums_geojson',
'scrape_nfl_stadiums_hardcoded',
# Source configurations
'NFL_GAME_SOURCES',
'NFL_STADIUM_SOURCES',
# Convenience functions
'scrape_nfl_games',
'get_nfl_season_string',
]
# =============================================================================
# TEAM MAPPINGS
# =============================================================================
NFL_TEAMS = {
'ARI': {'name': 'Arizona Cardinals', 'city': 'Glendale', 'stadium': 'State Farm Stadium'},
'ATL': {'name': 'Atlanta Falcons', 'city': 'Atlanta', 'stadium': 'Mercedes-Benz Stadium'},
'BAL': {'name': 'Baltimore Ravens', 'city': 'Baltimore', 'stadium': 'M&T Bank Stadium'},
'BUF': {'name': 'Buffalo Bills', 'city': 'Orchard Park', 'stadium': 'Highmark Stadium'},
'CAR': {'name': 'Carolina Panthers', 'city': 'Charlotte', 'stadium': 'Bank of America Stadium'},
'CHI': {'name': 'Chicago Bears', 'city': 'Chicago', 'stadium': 'Soldier Field'},
'CIN': {'name': 'Cincinnati Bengals', 'city': 'Cincinnati', 'stadium': 'Paycor Stadium'},
'CLE': {'name': 'Cleveland Browns', 'city': 'Cleveland', 'stadium': 'Cleveland Browns Stadium'},
'DAL': {'name': 'Dallas Cowboys', 'city': 'Arlington', 'stadium': 'AT&T Stadium'},
'DEN': {'name': 'Denver Broncos', 'city': 'Denver', 'stadium': 'Empower Field at Mile High'},
'DET': {'name': 'Detroit Lions', 'city': 'Detroit', 'stadium': 'Ford Field'},
'GB': {'name': 'Green Bay Packers', 'city': 'Green Bay', 'stadium': 'Lambeau Field'},
'HOU': {'name': 'Houston Texans', 'city': 'Houston', 'stadium': 'NRG Stadium'},
'IND': {'name': 'Indianapolis Colts', 'city': 'Indianapolis', 'stadium': 'Lucas Oil Stadium'},
'JAX': {'name': 'Jacksonville Jaguars', 'city': 'Jacksonville', 'stadium': 'EverBank Stadium'},
'KC': {'name': 'Kansas City Chiefs', 'city': 'Kansas City', 'stadium': 'GEHA Field at Arrowhead Stadium'},
'LV': {'name': 'Las Vegas Raiders', 'city': 'Las Vegas', 'stadium': 'Allegiant Stadium'},
'LAC': {'name': 'Los Angeles Chargers', 'city': 'Inglewood', 'stadium': 'SoFi Stadium'},
'LAR': {'name': 'Los Angeles Rams', 'city': 'Inglewood', 'stadium': 'SoFi Stadium'},
'MIA': {'name': 'Miami Dolphins', 'city': 'Miami Gardens', 'stadium': 'Hard Rock Stadium'},
'MIN': {'name': 'Minnesota Vikings', 'city': 'Minneapolis', 'stadium': 'U.S. Bank Stadium'},
'NE': {'name': 'New England Patriots', 'city': 'Foxborough', 'stadium': 'Gillette Stadium'},
'NO': {'name': 'New Orleans Saints', 'city': 'New Orleans', 'stadium': 'Caesars Superdome'},
'NYG': {'name': 'New York Giants', 'city': 'East Rutherford', 'stadium': 'MetLife Stadium'},
'NYJ': {'name': 'New York Jets', 'city': 'East Rutherford', 'stadium': 'MetLife Stadium'},
'PHI': {'name': 'Philadelphia Eagles', 'city': 'Philadelphia', 'stadium': 'Lincoln Financial Field'},
'PIT': {'name': 'Pittsburgh Steelers', 'city': 'Pittsburgh', 'stadium': 'Acrisure Stadium'},
'SF': {'name': 'San Francisco 49ers', 'city': 'Santa Clara', 'stadium': "Levi's Stadium"},
'SEA': {'name': 'Seattle Seahawks', 'city': 'Seattle', 'stadium': 'Lumen Field'},
'TB': {'name': 'Tampa Bay Buccaneers', 'city': 'Tampa', 'stadium': 'Raymond James Stadium'},
'TEN': {'name': 'Tennessee Titans', 'city': 'Nashville', 'stadium': 'Nissan Stadium'},
'WAS': {'name': 'Washington Commanders', 'city': 'Landover', 'stadium': 'Northwest Stadium'},
}
def get_nfl_team_abbrev(team_name: str) -> str:
"""Get NFL team abbreviation from full name."""
for abbrev, info in NFL_TEAMS.items():
if info['name'].lower() == team_name.lower():
return abbrev
if team_name.lower() in info['name'].lower():
return abbrev
# Return first 3 letters as fallback
return team_name[:3].upper()
def get_nfl_season_string(season: int) -> str:
"""
Get NFL season string in "2025-26" format.
Args:
season: The ending year of the season (e.g., 2026 for 2025-26 season)
Returns:
Season string like "2025-26"
"""
return f"{season-1}-{str(season)[2:]}"
# =============================================================================
# GAME SCRAPERS
# =============================================================================
def _scrape_espn_schedule(sport: str, league: str, season: int, date_range: tuple[str, str]) -> list[Game]:
"""
Fetch schedule from ESPN API.
Args:
sport: 'football'
league: 'nfl'
season: Season year
date_range: (start_date, end_date) in YYYYMMDD format
"""
games = []
sport_upper = 'NFL'
print(f"Fetching {sport_upper} {season} from ESPN API...")
url = f"https://site.api.espn.com/apis/site/v2/sports/{sport}/{league}/scoreboard"
params = {
'dates': f"{date_range[0]}-{date_range[1]}",
'limit': 1000
}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
}
try:
response = requests.get(url, params=params, headers=headers, timeout=30)
response.raise_for_status()
data = response.json()
events = data.get('events', [])
for event in events:
try:
# Parse date/time
date_str = event.get('date', '')[:10] # YYYY-MM-DD
time_str = event.get('date', '')[11:16] if len(event.get('date', '')) > 11 else None
# Get teams
competitions = event.get('competitions', [{}])
if not competitions:
continue
comp = competitions[0]
competitors = comp.get('competitors', [])
if len(competitors) < 2:
continue
home_team = None
away_team = None
home_abbrev = None
away_abbrev = None
for team in competitors:
team_data = team.get('team', {})
team_name = team_data.get('displayName', team_data.get('name', ''))
team_abbrev = team_data.get('abbreviation', '')
if team.get('homeAway') == 'home':
home_team = team_name
home_abbrev = team_abbrev
else:
away_team = team_name
away_abbrev = team_abbrev
if not home_team or not away_team:
continue
# Get venue
venue = comp.get('venue', {}).get('fullName', '')
game_id = f"nfl_{date_str}_{away_abbrev}_{home_abbrev}".lower()
game = Game(
id=game_id,
sport='NFL',
season=get_nfl_season_string(season),
date=date_str,
time=time_str,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev or get_nfl_team_abbrev(home_team),
away_team_abbrev=away_abbrev or get_nfl_team_abbrev(away_team),
venue=venue,
source='espn.com'
)
games.append(game)
except Exception:
continue
print(f" Found {len(games)} games from ESPN")
except Exception as e:
print(f"Error fetching ESPN NFL: {e}")
return games
def scrape_nfl_espn(season: int) -> list[Game]:
"""Fetch NFL schedule from ESPN API."""
# NFL season: September - February (spans years)
start = f"{season-1}0901"
end = f"{season}0228"
return _scrape_espn_schedule('football', 'nfl', season, (start, end))
def scrape_nfl_pro_football_reference(season: int) -> list[Game]:
"""
Scrape NFL schedule from Pro-Football-Reference.
URL: https://www.pro-football-reference.com/years/{YEAR}/games.htm
Season year is the starting year (e.g., 2025 for 2025-26 season)
"""
games = []
year = season - 1 # PFR uses starting year
url = f"https://www.pro-football-reference.com/years/{year}/games.htm"
print(f"Scraping NFL {season} from Pro-Football-Reference...")
soup = fetch_page(url, 'pro-football-reference.com')
if not soup:
return games
table = soup.find('table', {'id': 'games'})
if not table:
print(" Could not find games table")
return games
tbody = table.find('tbody')
if not tbody:
return games
for row in tbody.find_all('tr'):
if row.get('class') and 'thead' in row.get('class'):
continue
try:
# Parse date
date_cell = row.find('td', {'data-stat': 'game_date'})
if not date_cell:
continue
date_str = date_cell.text.strip()
# Parse teams
winner_cell = row.find('td', {'data-stat': 'winner'})
loser_cell = row.find('td', {'data-stat': 'loser'})
home_cell = row.find('td', {'data-stat': 'game_location'})
if not winner_cell or not loser_cell:
continue
winner_link = winner_cell.find('a')
loser_link = loser_cell.find('a')
winner = winner_link.text if winner_link else winner_cell.text.strip()
loser = loser_link.text if loser_link else loser_cell.text.strip()
# Determine home/away - '@' in game_location means winner was away
is_at_loser = home_cell and '@' in home_cell.text
if is_at_loser:
home_team, away_team = loser, winner
else:
home_team, away_team = winner, loser
# Convert date (e.g., "September 7" or "2025-09-07")
try:
if '-' in date_str:
parsed_date = datetime.strptime(date_str, '%Y-%m-%d')
else:
# Add year based on month
month_str = date_str.split()[0]
if month_str in ['January', 'February']:
date_with_year = f"{date_str}, {year + 1}"
else:
date_with_year = f"{date_str}, {year}"
parsed_date = datetime.strptime(date_with_year, '%B %d, %Y')
date_formatted = parsed_date.strftime('%Y-%m-%d')
except:
continue
away_abbrev = get_nfl_team_abbrev(away_team)
home_abbrev = get_nfl_team_abbrev(home_team)
game_id = f"nfl_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
game = Game(
id=game_id,
sport='NFL',
season=get_nfl_season_string(season),
date=date_formatted,
time=None,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev,
away_team_abbrev=away_abbrev,
venue='',
source='pro-football-reference.com'
)
games.append(game)
except Exception:
continue
print(f" Found {len(games)} games from Pro-Football-Reference")
return games
def scrape_nfl_cbssports(season: int) -> list[Game]:
"""
Scrape NFL schedule from CBS Sports.
Provides structured schedule data via web scraping.
"""
games = []
year = season - 1 # CBS uses starting year
print(f"Fetching NFL {season} from CBS Sports...")
# CBS Sports schedule endpoint
url = f"https://www.cbssports.com/nfl/schedule/{year}/regular/"
soup = fetch_page(url, 'cbssports.com')
if not soup:
return games
# Find game tables
tables = soup.find_all('table', class_='TableBase-table')
for table in tables:
rows = table.find_all('tr')
for row in rows:
try:
cells = row.find_all('td')
if len(cells) < 3:
continue
# Parse matchup
away_cell = cells[0] if len(cells) > 0 else None
home_cell = cells[1] if len(cells) > 1 else None
if not away_cell or not home_cell:
continue
away_team = away_cell.get_text(strip=True)
home_team = home_cell.get_text(strip=True)
if not away_team or not home_team:
continue
# CBS includes @ symbol
away_team = away_team.replace('@', '').strip()
# Get date from parent section if available
date_formatted = datetime.now().strftime('%Y-%m-%d') # Placeholder
away_abbrev = get_nfl_team_abbrev(away_team)
home_abbrev = get_nfl_team_abbrev(home_team)
game_id = f"nfl_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
game = Game(
id=game_id,
sport='NFL',
season=get_nfl_season_string(season),
date=date_formatted,
time=None,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev,
away_team_abbrev=away_abbrev,
venue='',
source='cbssports.com'
)
games.append(game)
except Exception:
continue
print(f" Found {len(games)} games from CBS Sports")
return games
# =============================================================================
# STADIUM SCRAPERS
# =============================================================================
def scrape_nfl_stadiums_scorebot() -> list[Stadium]:
"""
Source 1: NFLScoreBot/stadiums GitHub (public domain).
"""
stadiums = []
url = "https://raw.githubusercontent.com/NFLScoreBot/stadiums/main/stadiums.json"
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
for name, info in data.items():
stadium = Stadium(
id=f"nfl_{name.lower().replace(' ', '_')[:30]}",
name=name,
city=info.get('city', ''),
state=info.get('state', ''),
latitude=info.get('lat', 0) / 1000000 if info.get('lat') else 0,
longitude=info.get('long', 0) / 1000000 if info.get('long') else 0,
capacity=info.get('capacity', 0),
sport='NFL',
team_abbrevs=info.get('teams', []),
source='github.com/NFLScoreBot'
)
stadiums.append(stadium)
return stadiums
def scrape_nfl_stadiums_geojson() -> list[Stadium]:
"""
Source 2: brianhatchl/nfl-stadiums GeoJSON gist.
"""
stadiums = []
url = "https://gist.githubusercontent.com/brianhatchl/6265918/raw/dbe6acfe5deb48f51ce5a4c4f8f5dded4f02b9bd/nfl_stadiums.geojson"
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
for feature in data.get('features', []):
props = feature.get('properties', {})
coords = feature.get('geometry', {}).get('coordinates', [0, 0])
stadium = Stadium(
id=f"nfl_{props.get('Stadium', '').lower().replace(' ', '_')[:30]}",
name=props.get('Stadium', ''),
city=props.get('City', ''),
state=props.get('State', ''),
latitude=coords[1] if len(coords) > 1 else 0,
longitude=coords[0] if len(coords) > 0 else 0,
capacity=int(props.get('Capacity', 0) or 0),
sport='NFL',
team_abbrevs=[props.get('Team', '')],
source='gist.github.com/brianhatchl'
)
stadiums.append(stadium)
return stadiums
def scrape_nfl_stadiums_hardcoded() -> list[Stadium]:
"""
Source 3: Hardcoded NFL stadiums (fallback).
"""
nfl_stadiums_data = {
'State Farm Stadium': {'city': 'Glendale', 'state': 'AZ', 'lat': 33.5276, 'lng': -112.2626, 'capacity': 63400, 'teams': ['ARI'], 'year_opened': 2006},
'Mercedes-Benz Stadium': {'city': 'Atlanta', 'state': 'GA', 'lat': 33.7553, 'lng': -84.4006, 'capacity': 71000, 'teams': ['ATL'], 'year_opened': 2017},
'M&T Bank Stadium': {'city': 'Baltimore', 'state': 'MD', 'lat': 39.2780, 'lng': -76.6227, 'capacity': 71008, 'teams': ['BAL'], 'year_opened': 1998},
'Highmark Stadium': {'city': 'Orchard Park', 'state': 'NY', 'lat': 42.7738, 'lng': -78.7870, 'capacity': 71608, 'teams': ['BUF'], 'year_opened': 1973},
'Bank of America Stadium': {'city': 'Charlotte', 'state': 'NC', 'lat': 35.2258, 'lng': -80.8528, 'capacity': 75523, 'teams': ['CAR'], 'year_opened': 1996},
'Soldier Field': {'city': 'Chicago', 'state': 'IL', 'lat': 41.8623, 'lng': -87.6167, 'capacity': 61500, 'teams': ['CHI'], 'year_opened': 1924},
'Paycor Stadium': {'city': 'Cincinnati', 'state': 'OH', 'lat': 39.0954, 'lng': -84.5160, 'capacity': 65515, 'teams': ['CIN'], 'year_opened': 2000},
'Cleveland Browns Stadium': {'city': 'Cleveland', 'state': 'OH', 'lat': 41.5061, 'lng': -81.6995, 'capacity': 67895, 'teams': ['CLE'], 'year_opened': 1999},
'AT&T Stadium': {'city': 'Arlington', 'state': 'TX', 'lat': 32.7480, 'lng': -97.0928, 'capacity': 80000, 'teams': ['DAL'], 'year_opened': 2009},
'Empower Field at Mile High': {'city': 'Denver', 'state': 'CO', 'lat': 39.7439, 'lng': -105.0201, 'capacity': 76125, 'teams': ['DEN'], 'year_opened': 2001},
'Ford Field': {'city': 'Detroit', 'state': 'MI', 'lat': 42.3400, 'lng': -83.0456, 'capacity': 65000, 'teams': ['DET'], 'year_opened': 2002},
'Lambeau Field': {'city': 'Green Bay', 'state': 'WI', 'lat': 44.5013, 'lng': -88.0622, 'capacity': 81435, 'teams': ['GB'], 'year_opened': 1957},
'NRG Stadium': {'city': 'Houston', 'state': 'TX', 'lat': 29.6847, 'lng': -95.4107, 'capacity': 72220, 'teams': ['HOU'], 'year_opened': 2002},
'Lucas Oil Stadium': {'city': 'Indianapolis', 'state': 'IN', 'lat': 39.7601, 'lng': -86.1639, 'capacity': 67000, 'teams': ['IND'], 'year_opened': 2008},
'EverBank Stadium': {'city': 'Jacksonville', 'state': 'FL', 'lat': 30.3239, 'lng': -81.6373, 'capacity': 67814, 'teams': ['JAX'], 'year_opened': 1995},
'GEHA Field at Arrowhead Stadium': {'city': 'Kansas City', 'state': 'MO', 'lat': 39.0489, 'lng': -94.4839, 'capacity': 76416, 'teams': ['KC'], 'year_opened': 1972},
'Allegiant Stadium': {'city': 'Las Vegas', 'state': 'NV', 'lat': 36.0909, 'lng': -115.1833, 'capacity': 65000, 'teams': ['LV'], 'year_opened': 2020},
'SoFi Stadium': {'city': 'Inglewood', 'state': 'CA', 'lat': 33.9535, 'lng': -118.3392, 'capacity': 70240, 'teams': ['LAC', 'LAR'], 'year_opened': 2020},
'Hard Rock Stadium': {'city': 'Miami Gardens', 'state': 'FL', 'lat': 25.9580, 'lng': -80.2389, 'capacity': 64767, 'teams': ['MIA'], 'year_opened': 1987},
'U.S. Bank Stadium': {'city': 'Minneapolis', 'state': 'MN', 'lat': 44.9736, 'lng': -93.2575, 'capacity': 66655, 'teams': ['MIN'], 'year_opened': 2016},
'Gillette Stadium': {'city': 'Foxborough', 'state': 'MA', 'lat': 42.0909, 'lng': -71.2643, 'capacity': 65878, 'teams': ['NE'], 'year_opened': 2002},
'Caesars Superdome': {'city': 'New Orleans', 'state': 'LA', 'lat': 29.9511, 'lng': -90.0812, 'capacity': 73208, 'teams': ['NO'], 'year_opened': 1975},
'MetLife Stadium': {'city': 'East Rutherford', 'state': 'NJ', 'lat': 40.8135, 'lng': -74.0745, 'capacity': 82500, 'teams': ['NYG', 'NYJ'], 'year_opened': 2010},
'Lincoln Financial Field': {'city': 'Philadelphia', 'state': 'PA', 'lat': 39.9008, 'lng': -75.1675, 'capacity': 69596, 'teams': ['PHI'], 'year_opened': 2003},
'Acrisure Stadium': {'city': 'Pittsburgh', 'state': 'PA', 'lat': 40.4468, 'lng': -80.0158, 'capacity': 68400, 'teams': ['PIT'], 'year_opened': 2001},
"Levi's Stadium": {'city': 'Santa Clara', 'state': 'CA', 'lat': 37.4032, 'lng': -121.9698, 'capacity': 68500, 'teams': ['SF'], 'year_opened': 2014},
'Lumen Field': {'city': 'Seattle', 'state': 'WA', 'lat': 47.5952, 'lng': -122.3316, 'capacity': 68740, 'teams': ['SEA'], 'year_opened': 2002},
'Raymond James Stadium': {'city': 'Tampa', 'state': 'FL', 'lat': 27.9759, 'lng': -82.5033, 'capacity': 65618, 'teams': ['TB'], 'year_opened': 1998},
'Nissan Stadium': {'city': 'Nashville', 'state': 'TN', 'lat': 36.1665, 'lng': -86.7713, 'capacity': 69143, 'teams': ['TEN'], 'year_opened': 1999},
'Northwest Stadium': {'city': 'Landover', 'state': 'MD', 'lat': 38.9076, 'lng': -76.8645, 'capacity': 67617, 'teams': ['WAS'], 'year_opened': 1997},
}
stadiums = []
for name, info in nfl_stadiums_data.items():
stadium = Stadium(
id=f"nfl_{name.lower().replace(' ', '_')[:30]}",
name=name,
city=info['city'],
state=info['state'],
latitude=info['lat'],
longitude=info['lng'],
capacity=info['capacity'],
sport='NFL',
team_abbrevs=info['teams'],
source='nfl_hardcoded',
year_opened=info.get('year_opened')
)
stadiums.append(stadium)
return stadiums
def scrape_nfl_stadiums() -> list[Stadium]:
"""
Fetch NFL stadium data with multi-source fallback.
"""
print("\nNFL STADIUMS")
print("-" * 40)
return scrape_stadiums_with_fallback('NFL', NFL_STADIUM_SOURCES)
# =============================================================================
# SOURCE CONFIGURATIONS
# =============================================================================
NFL_GAME_SOURCES = [
ScraperSource('ESPN', scrape_nfl_espn, priority=1, min_games=200),
ScraperSource('Pro-Football-Reference', scrape_nfl_pro_football_reference, priority=2, min_games=200),
ScraperSource('CBS Sports', scrape_nfl_cbssports, priority=3, min_games=100),
]
NFL_STADIUM_SOURCES = [
StadiumScraperSource('NFLScoreBot', scrape_nfl_stadiums_scorebot, priority=1, min_venues=28),
StadiumScraperSource('GeoJSON-Gist', scrape_nfl_stadiums_geojson, priority=2, min_venues=28),
StadiumScraperSource('Hardcoded', scrape_nfl_stadiums_hardcoded, priority=3, min_venues=28),
]
# =============================================================================
# CONVENIENCE FUNCTIONS
# =============================================================================
def scrape_nfl_games(season: int) -> list[Game]:
"""
Scrape NFL games for a season using multi-source fallback.
Args:
season: Season ending year (e.g., 2026 for 2025-26 season)
Returns:
List of Game objects from the first successful source
"""
print(f"\nNFL {get_nfl_season_string(season)} SCHEDULE")
print("-" * 40)
return scrape_with_fallback('NFL', season, NFL_GAME_SOURCES)
-411
View File
@@ -1,411 +0,0 @@
#!/usr/bin/env python3
"""
NHL schedule and stadium scrapers for SportsTime.
This module provides:
- NHL game scrapers (Hockey-Reference, NHL API, ESPN)
- NHL stadium scrapers (hardcoded with coordinates)
- Multi-source fallback configurations
"""
from datetime import datetime
from typing import Optional
import requests
# Support both direct execution and import from parent directory
try:
from core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
except ImportError:
from Scripts.core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
__all__ = [
# Team data
'NHL_TEAMS',
# Game scrapers
'scrape_nhl_hockey_reference',
'scrape_nhl_api',
'scrape_nhl_espn',
# Stadium scrapers
'scrape_nhl_stadiums',
# Source configurations
'NHL_GAME_SOURCES',
'NHL_STADIUM_SOURCES',
# Convenience functions
'scrape_nhl_games',
'get_nhl_season_string',
]
# =============================================================================
# TEAM MAPPINGS
# =============================================================================
NHL_TEAMS = {
'ANA': {'name': 'Anaheim Ducks', 'city': 'Anaheim', 'arena': 'Honda Center'},
'ARI': {'name': 'Utah Hockey Club', 'city': 'Salt Lake City', 'arena': 'Delta Center'},
'BOS': {'name': 'Boston Bruins', 'city': 'Boston', 'arena': 'TD Garden'},
'BUF': {'name': 'Buffalo Sabres', 'city': 'Buffalo', 'arena': 'KeyBank Center'},
'CGY': {'name': 'Calgary Flames', 'city': 'Calgary', 'arena': 'Scotiabank Saddledome'},
'CAR': {'name': 'Carolina Hurricanes', 'city': 'Raleigh', 'arena': 'PNC Arena'},
'CHI': {'name': 'Chicago Blackhawks', 'city': 'Chicago', 'arena': 'United Center'},
'COL': {'name': 'Colorado Avalanche', 'city': 'Denver', 'arena': 'Ball Arena'},
'CBJ': {'name': 'Columbus Blue Jackets', 'city': 'Columbus', 'arena': 'Nationwide Arena'},
'DAL': {'name': 'Dallas Stars', 'city': 'Dallas', 'arena': 'American Airlines Center'},
'DET': {'name': 'Detroit Red Wings', 'city': 'Detroit', 'arena': 'Little Caesars Arena'},
'EDM': {'name': 'Edmonton Oilers', 'city': 'Edmonton', 'arena': 'Rogers Place'},
'FLA': {'name': 'Florida Panthers', 'city': 'Sunrise', 'arena': 'Amerant Bank Arena'},
'LAK': {'name': 'Los Angeles Kings', 'city': 'Los Angeles', 'arena': 'Crypto.com Arena'},
'MIN': {'name': 'Minnesota Wild', 'city': 'St. Paul', 'arena': 'Xcel Energy Center'},
'MTL': {'name': 'Montreal Canadiens', 'city': 'Montreal', 'arena': 'Bell Centre'},
'NSH': {'name': 'Nashville Predators', 'city': 'Nashville', 'arena': 'Bridgestone Arena'},
'NJD': {'name': 'New Jersey Devils', 'city': 'Newark', 'arena': 'Prudential Center'},
'NYI': {'name': 'New York Islanders', 'city': 'Elmont', 'arena': 'UBS Arena'},
'NYR': {'name': 'New York Rangers', 'city': 'New York', 'arena': 'Madison Square Garden'},
'OTT': {'name': 'Ottawa Senators', 'city': 'Ottawa', 'arena': 'Canadian Tire Centre'},
'PHI': {'name': 'Philadelphia Flyers', 'city': 'Philadelphia', 'arena': 'Wells Fargo Center'},
'PIT': {'name': 'Pittsburgh Penguins', 'city': 'Pittsburgh', 'arena': 'PPG Paints Arena'},
'SJS': {'name': 'San Jose Sharks', 'city': 'San Jose', 'arena': 'SAP Center'},
'SEA': {'name': 'Seattle Kraken', 'city': 'Seattle', 'arena': 'Climate Pledge Arena'},
'STL': {'name': 'St. Louis Blues', 'city': 'St. Louis', 'arena': 'Enterprise Center'},
'TBL': {'name': 'Tampa Bay Lightning', 'city': 'Tampa', 'arena': 'Amalie Arena'},
'TOR': {'name': 'Toronto Maple Leafs', 'city': 'Toronto', 'arena': 'Scotiabank Arena'},
'VAN': {'name': 'Vancouver Canucks', 'city': 'Vancouver', 'arena': 'Rogers Arena'},
'VGK': {'name': 'Vegas Golden Knights', 'city': 'Las Vegas', 'arena': 'T-Mobile Arena'},
'WSH': {'name': 'Washington Capitals', 'city': 'Washington', 'arena': 'Capital One Arena'},
'WPG': {'name': 'Winnipeg Jets', 'city': 'Winnipeg', 'arena': 'Canada Life Centre'},
}
def get_nhl_team_abbrev(team_name: str) -> str:
"""Get NHL team abbreviation from full name."""
for abbrev, info in NHL_TEAMS.items():
if info['name'].lower() == team_name.lower():
return abbrev
if team_name.lower() in info['name'].lower():
return abbrev
# Return first 3 letters as fallback
return team_name[:3].upper()
def get_nhl_season_string(season: int) -> str:
"""
Get NHL season string in "2024-25" format.
Args:
season: The ending year of the season (e.g., 2025 for 2024-25 season)
Returns:
Season string like "2024-25"
"""
return f"{season-1}-{str(season)[2:]}"
# =============================================================================
# GAME SCRAPERS
# =============================================================================
def scrape_nhl_hockey_reference(season: int) -> list[Game]:
"""
Scrape NHL schedule from Hockey-Reference.
URL: https://www.hockey-reference.com/leagues/NHL_{YEAR}_games.html
"""
games = []
url = f"https://www.hockey-reference.com/leagues/NHL_{season}_games.html"
print(f"Scraping NHL {season} from Hockey-Reference...")
soup = fetch_page(url, 'hockey-reference.com')
if not soup:
return games
table = soup.find('table', {'id': 'games'})
if not table:
print(" Could not find games table")
return games
tbody = table.find('tbody')
if not tbody:
return games
for row in tbody.find_all('tr'):
try:
cells = row.find_all(['td', 'th'])
if len(cells) < 5:
continue
# Parse date
date_cell = row.find('th', {'data-stat': 'date_game'})
if not date_cell:
continue
date_link = date_cell.find('a')
date_str = date_link.text if date_link else date_cell.text
# Parse teams
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
home_cell = row.find('td', {'data-stat': 'home_team_name'})
if not visitor_cell or not home_cell:
continue
visitor_link = visitor_cell.find('a')
home_link = home_cell.find('a')
away_team = visitor_link.text if visitor_link else visitor_cell.text
home_team = home_link.text if home_link else home_cell.text
# Convert date
try:
parsed_date = datetime.strptime(date_str.strip(), '%Y-%m-%d')
date_formatted = parsed_date.strftime('%Y-%m-%d')
except:
continue
away_abbrev = get_nhl_team_abbrev(away_team)
home_abbrev = get_nhl_team_abbrev(home_team)
game_id = f"nhl_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
game = Game(
id=game_id,
sport='NHL',
season=get_nhl_season_string(season),
date=date_formatted,
time=None,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev,
away_team_abbrev=away_abbrev,
venue='',
source='hockey-reference.com'
)
games.append(game)
except Exception as e:
continue
print(f" Found {len(games)} games from Hockey-Reference")
return games
def scrape_nhl_api(season: int) -> list[Game]:
"""
Fetch NHL schedule from official API (JSON).
URL: https://api-web.nhle.com/v1/schedule/{YYYY-MM-DD}
"""
games = []
print(f"Fetching NHL {season} from NHL API...")
# NHL API provides club schedules
# We'd need to iterate through dates or teams
# Simplified implementation here
return games
def scrape_nhl_espn(season: int) -> list[Game]:
"""Fetch NHL schedule from ESPN API."""
games = []
print(f"Fetching NHL {season} from ESPN API...")
# NHL regular season: October - April (spans calendar years)
start = f"{season-1}1001"
end = f"{season}0430"
url = "https://site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard"
params = {
'dates': f"{start}-{end}",
'limit': 1000
}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
}
try:
response = requests.get(url, params=params, headers=headers, timeout=30)
response.raise_for_status()
data = response.json()
events = data.get('events', [])
for event in events:
try:
date_str = event.get('date', '')[:10]
time_str = event.get('date', '')[11:16] if len(event.get('date', '')) > 11 else None
competitions = event.get('competitions', [{}])
if not competitions:
continue
comp = competitions[0]
competitors = comp.get('competitors', [])
if len(competitors) < 2:
continue
home_team = away_team = home_abbrev = away_abbrev = None
for team in competitors:
team_data = team.get('team', {})
team_name = team_data.get('displayName', team_data.get('name', ''))
team_abbrev = team_data.get('abbreviation', '')
if team.get('homeAway') == 'home':
home_team = team_name
home_abbrev = team_abbrev
else:
away_team = team_name
away_abbrev = team_abbrev
if not home_team or not away_team:
continue
venue = comp.get('venue', {}).get('fullName', '')
game_id = f"nhl_{date_str}_{away_abbrev}_{home_abbrev}".lower()
game = Game(
id=game_id,
sport='NHL',
season=get_nhl_season_string(season),
date=date_str,
time=time_str,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev or get_nhl_team_abbrev(home_team),
away_team_abbrev=away_abbrev or get_nhl_team_abbrev(away_team),
venue=venue,
source='espn.com'
)
games.append(game)
except Exception:
continue
print(f" Found {len(games)} games from ESPN")
except Exception as e:
print(f"Error fetching ESPN NHL: {e}")
return games
# =============================================================================
# STADIUM SCRAPERS
# =============================================================================
def scrape_nhl_stadiums() -> list[Stadium]:
"""
Fetch NHL arena data (hardcoded with accurate coordinates).
"""
print("\nNHL STADIUMS")
print("-" * 40)
print(" Loading NHL arenas...")
nhl_arenas = {
'TD Garden': {'city': 'Boston', 'state': 'MA', 'lat': 42.3662, 'lng': -71.0621, 'capacity': 17850, 'teams': ['BOS'], 'year_opened': 1995},
'KeyBank Center': {'city': 'Buffalo', 'state': 'NY', 'lat': 42.8750, 'lng': -78.8764, 'capacity': 19070, 'teams': ['BUF'], 'year_opened': 1996},
'Little Caesars Arena': {'city': 'Detroit', 'state': 'MI', 'lat': 42.3411, 'lng': -83.0553, 'capacity': 19515, 'teams': ['DET'], 'year_opened': 2017},
'Amerant Bank Arena': {'city': 'Sunrise', 'state': 'FL', 'lat': 26.1584, 'lng': -80.3256, 'capacity': 19250, 'teams': ['FLA'], 'year_opened': 1998},
'Bell Centre': {'city': 'Montreal', 'state': 'QC', 'lat': 45.4961, 'lng': -73.5693, 'capacity': 21302, 'teams': ['MTL'], 'year_opened': 1996},
'Canadian Tire Centre': {'city': 'Ottawa', 'state': 'ON', 'lat': 45.2969, 'lng': -75.9272, 'capacity': 18652, 'teams': ['OTT'], 'year_opened': 1996},
'Amalie Arena': {'city': 'Tampa', 'state': 'FL', 'lat': 27.9426, 'lng': -82.4519, 'capacity': 19092, 'teams': ['TBL'], 'year_opened': 1996},
'Scotiabank Arena': {'city': 'Toronto', 'state': 'ON', 'lat': 43.6435, 'lng': -79.3791, 'capacity': 18800, 'teams': ['TOR'], 'year_opened': 1999},
'PNC Arena': {'city': 'Raleigh', 'state': 'NC', 'lat': 35.8033, 'lng': -78.7220, 'capacity': 18680, 'teams': ['CAR'], 'year_opened': 1999},
'Nationwide Arena': {'city': 'Columbus', 'state': 'OH', 'lat': 39.9692, 'lng': -83.0061, 'capacity': 18500, 'teams': ['CBJ'], 'year_opened': 2000},
'Prudential Center': {'city': 'Newark', 'state': 'NJ', 'lat': 40.7334, 'lng': -74.1713, 'capacity': 16514, 'teams': ['NJD'], 'year_opened': 2007},
'UBS Arena': {'city': 'Elmont', 'state': 'NY', 'lat': 40.7170, 'lng': -73.7260, 'capacity': 17255, 'teams': ['NYI'], 'year_opened': 2021},
'Madison Square Garden': {'city': 'New York', 'state': 'NY', 'lat': 40.7505, 'lng': -73.9934, 'capacity': 18006, 'teams': ['NYR'], 'year_opened': 1968},
'Wells Fargo Center': {'city': 'Philadelphia', 'state': 'PA', 'lat': 39.9012, 'lng': -75.1720, 'capacity': 19500, 'teams': ['PHI'], 'year_opened': 1996},
'PPG Paints Arena': {'city': 'Pittsburgh', 'state': 'PA', 'lat': 40.4395, 'lng': -79.9892, 'capacity': 18387, 'teams': ['PIT'], 'year_opened': 2010},
'Capital One Arena': {'city': 'Washington', 'state': 'DC', 'lat': 38.8982, 'lng': -77.0209, 'capacity': 18573, 'teams': ['WSH'], 'year_opened': 1997},
'United Center': {'city': 'Chicago', 'state': 'IL', 'lat': 41.8807, 'lng': -87.6742, 'capacity': 19717, 'teams': ['CHI'], 'year_opened': 1994},
'Ball Arena': {'city': 'Denver', 'state': 'CO', 'lat': 39.7487, 'lng': -105.0077, 'capacity': 18007, 'teams': ['COL'], 'year_opened': 1999},
'American Airlines Center': {'city': 'Dallas', 'state': 'TX', 'lat': 32.7905, 'lng': -96.8103, 'capacity': 18532, 'teams': ['DAL'], 'year_opened': 2001},
'Xcel Energy Center': {'city': 'Saint Paul', 'state': 'MN', 'lat': 44.9448, 'lng': -93.1010, 'capacity': 17954, 'teams': ['MIN'], 'year_opened': 2000},
'Bridgestone Arena': {'city': 'Nashville', 'state': 'TN', 'lat': 36.1592, 'lng': -86.7785, 'capacity': 17159, 'teams': ['NSH'], 'year_opened': 1996},
'Enterprise Center': {'city': 'St. Louis', 'state': 'MO', 'lat': 38.6268, 'lng': -90.2025, 'capacity': 18096, 'teams': ['STL'], 'year_opened': 1994},
'Canada Life Centre': {'city': 'Winnipeg', 'state': 'MB', 'lat': 49.8928, 'lng': -97.1437, 'capacity': 15321, 'teams': ['WPG'], 'year_opened': 2004},
'Honda Center': {'city': 'Anaheim', 'state': 'CA', 'lat': 33.8078, 'lng': -117.8765, 'capacity': 17174, 'teams': ['ANA'], 'year_opened': 1993},
'Delta Center': {'city': 'Salt Lake City', 'state': 'UT', 'lat': 40.7683, 'lng': -111.9011, 'capacity': 16210, 'teams': ['ARI'], 'year_opened': 1991},
'SAP Center': {'city': 'San Jose', 'state': 'CA', 'lat': 37.3327, 'lng': -121.9012, 'capacity': 17562, 'teams': ['SJS'], 'year_opened': 1993},
'Rogers Arena': {'city': 'Vancouver', 'state': 'BC', 'lat': 49.2778, 'lng': -123.1089, 'capacity': 18910, 'teams': ['VAN'], 'year_opened': 1995},
'T-Mobile Arena': {'city': 'Las Vegas', 'state': 'NV', 'lat': 36.1028, 'lng': -115.1784, 'capacity': 17500, 'teams': ['VGK'], 'year_opened': 2016},
'Climate Pledge Arena': {'city': 'Seattle', 'state': 'WA', 'lat': 47.6220, 'lng': -122.3540, 'capacity': 17100, 'teams': ['SEA'], 'year_opened': 2021},
'Crypto.com Arena': {'city': 'Los Angeles', 'state': 'CA', 'lat': 34.0430, 'lng': -118.2673, 'capacity': 18230, 'teams': ['LAK'], 'year_opened': 1999},
'Rogers Place': {'city': 'Edmonton', 'state': 'AB', 'lat': 53.5469, 'lng': -113.4979, 'capacity': 18347, 'teams': ['EDM'], 'year_opened': 2016},
'Scotiabank Saddledome': {'city': 'Calgary', 'state': 'AB', 'lat': 51.0374, 'lng': -114.0519, 'capacity': 19289, 'teams': ['CGY'], 'year_opened': 1983},
}
stadiums = []
for name, info in nhl_arenas.items():
stadium = Stadium(
id=f"nhl_{name.lower().replace(' ', '_')[:30]}",
name=name,
city=info['city'],
state=info['state'],
latitude=info['lat'],
longitude=info['lng'],
capacity=info['capacity'],
sport='NHL',
team_abbrevs=info['teams'],
source='nhl_hardcoded',
year_opened=info.get('year_opened')
)
stadiums.append(stadium)
print(f" ✓ Found {len(stadiums)} NHL arenas")
return stadiums
# =============================================================================
# SOURCE CONFIGURATIONS
# =============================================================================
NHL_GAME_SOURCES = [
ScraperSource('Hockey-Reference', scrape_nhl_hockey_reference, priority=1, min_games=100),
ScraperSource('ESPN', scrape_nhl_espn, priority=2, min_games=50),
ScraperSource('NHL API', scrape_nhl_api, priority=3, min_games=50),
]
NHL_STADIUM_SOURCES = [
StadiumScraperSource('Hardcoded', scrape_nhl_stadiums, priority=1, min_venues=25),
]
# =============================================================================
# CONVENIENCE FUNCTIONS
# =============================================================================
def scrape_nhl_games(season: int) -> list[Game]:
"""
Scrape NHL games for a season using multi-source fallback.
Args:
season: Season ending year (e.g., 2025 for 2024-25 season)
Returns:
List of Game objects from the first successful source
"""
print(f"\nNHL {get_nhl_season_string(season)} SCHEDULE")
print("-" * 40)
return scrape_with_fallback('NHL', season, NHL_GAME_SOURCES)
-222
View File
@@ -1,222 +0,0 @@
#!/usr/bin/env python3
"""
NWSL schedule and stadium scrapers for SportsTime.
This module provides:
- NWSL team mappings (13 teams)
- NWSL stadium scrapers (hardcoded with coordinates)
- Multi-source fallback configurations
Note: Many NWSL teams share stadiums with MLS teams.
Coordinates are cross-referenced from mls.py where applicable.
"""
from typing import Optional
import requests
# Support both direct execution and import from parent directory
try:
from core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
except ImportError:
from Scripts.core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
)
__all__ = [
# Team data
'NWSL_TEAMS',
# Stadium scrapers
'scrape_nwsl_stadiums_hardcoded',
'scrape_nwsl_stadiums',
# Source configurations
'NWSL_STADIUM_SOURCES',
# Convenience functions
'get_nwsl_team_abbrev',
]
# =============================================================================
# TEAM MAPPINGS
# =============================================================================
NWSL_TEAMS = {
'LA': {'name': 'Angel City FC', 'city': 'Los Angeles', 'stadium': 'BMO Stadium'},
'SJ': {'name': 'Bay FC', 'city': 'San Jose', 'stadium': 'PayPal Park'},
'CHI': {'name': 'Chicago Red Stars', 'city': 'Bridgeview', 'stadium': 'SeatGeek Stadium'},
'HOU': {'name': 'Houston Dash', 'city': 'Houston', 'stadium': 'Shell Energy Stadium'},
'KC': {'name': 'Kansas City Current', 'city': 'Kansas City', 'stadium': 'CPKC Stadium'},
'NJ': {'name': 'NJ/NY Gotham FC', 'city': 'Harrison', 'stadium': 'Red Bull Arena'},
'NC': {'name': 'North Carolina Courage', 'city': 'Cary', 'stadium': 'WakeMed Soccer Park'},
'ORL': {'name': 'Orlando Pride', 'city': 'Orlando', 'stadium': 'Inter&Co Stadium'},
'POR': {'name': 'Portland Thorns FC', 'city': 'Portland', 'stadium': 'Providence Park'},
'SEA': {'name': 'Seattle Reign FC', 'city': 'Seattle', 'stadium': 'Lumen Field'},
'SD': {'name': 'San Diego Wave FC', 'city': 'San Diego', 'stadium': 'Snapdragon Stadium'},
'UTA': {'name': 'Utah Royals FC', 'city': 'Sandy', 'stadium': 'America First Field'},
'WAS': {'name': 'Washington Spirit', 'city': 'Washington', 'stadium': 'Audi Field'},
}
def get_nwsl_team_abbrev(team_name: str) -> str:
"""Get NWSL team abbreviation from full name."""
for abbrev, info in NWSL_TEAMS.items():
if info['name'].lower() == team_name.lower():
return abbrev
if team_name.lower() in info['name'].lower():
return abbrev
# Return first 3 letters as fallback
return team_name[:3].upper()
# =============================================================================
# STADIUM SCRAPERS
# =============================================================================
def scrape_nwsl_stadiums_hardcoded() -> list[Stadium]:
"""
Source 1: Hardcoded NWSL stadiums with complete data.
All 13 NWSL stadiums with capacity (NWSL configuration) and year_opened.
Shared stadium coordinates are cross-referenced from MLS module:
- BMO Stadium (shared with LAFC)
- PayPal Park (shared with SJ Earthquakes)
- Shell Energy Stadium (shared with Houston Dynamo)
- Red Bull Arena (shared with NY Red Bulls)
- Inter&Co Stadium (shared with Orlando City SC)
- Providence Park (shared with Portland Timbers)
- Lumen Field (shared with Seattle Sounders/Seahawks)
- Snapdragon Stadium (shared with San Diego FC)
- America First Field (shared with Real Salt Lake)
- Audi Field (shared with DC United)
"""
nwsl_stadiums = {
# Shared stadiums with MLS teams (coordinates from mls.py)
'BMO Stadium': {
'city': 'Los Angeles', 'state': 'CA',
'lat': 34.0128, 'lng': -118.2841,
'capacity': 22000, 'teams': ['LA'], 'year_opened': 2018
},
'PayPal Park': {
'city': 'San Jose', 'state': 'CA',
'lat': 37.3514, 'lng': -121.9250,
'capacity': 18000, 'teams': ['SJ'], 'year_opened': 2015
},
'Shell Energy Stadium': {
'city': 'Houston', 'state': 'TX',
'lat': 29.7522, 'lng': -95.3524,
'capacity': 22039, 'teams': ['HOU'], 'year_opened': 2012
},
'Red Bull Arena': {
'city': 'Harrison', 'state': 'NJ',
'lat': 40.7367, 'lng': -74.1503,
'capacity': 25000, 'teams': ['NJ'], 'year_opened': 2010
},
'Inter&Co Stadium': {
'city': 'Orlando', 'state': 'FL',
'lat': 28.5411, 'lng': -81.3893,
'capacity': 25500, 'teams': ['ORL'], 'year_opened': 2017
},
'Providence Park': {
'city': 'Portland', 'state': 'OR',
'lat': 45.5214, 'lng': -122.6917,
'capacity': 25218, 'teams': ['POR'], 'year_opened': 1926
},
'Lumen Field': {
'city': 'Seattle', 'state': 'WA',
'lat': 47.5952, 'lng': -122.3316,
'capacity': 37722, 'teams': ['SEA'], 'year_opened': 2002
},
'Snapdragon Stadium': {
'city': 'San Diego', 'state': 'CA',
'lat': 32.7844, 'lng': -117.1228,
'capacity': 35000, 'teams': ['SD'], 'year_opened': 2022
},
'America First Field': {
'city': 'Sandy', 'state': 'UT',
'lat': 40.5829, 'lng': -111.8934,
'capacity': 20213, 'teams': ['UTA'], 'year_opened': 2008
},
'Audi Field': {
'city': 'Washington', 'state': 'DC',
'lat': 38.8684, 'lng': -77.0129,
'capacity': 20000, 'teams': ['WAS'], 'year_opened': 2018
},
# NWSL-specific stadiums
'SeatGeek Stadium': {
'city': 'Bridgeview', 'state': 'IL',
'lat': 41.7653, 'lng': -87.8049,
'capacity': 20000, 'teams': ['CHI'], 'year_opened': 2006
},
'CPKC Stadium': {
'city': 'Kansas City', 'state': 'MO',
'lat': 39.0975, 'lng': -94.5556,
'capacity': 11500, 'teams': ['KC'], 'year_opened': 2024
},
'WakeMed Soccer Park': {
'city': 'Cary', 'state': 'NC',
'lat': 35.8018, 'lng': -78.7442,
'capacity': 10000, 'teams': ['NC'], 'year_opened': 2002
},
}
stadiums = []
for name, info in nwsl_stadiums.items():
# Create normalized ID (f-strings can't have backslashes)
normalized_name = name.lower().replace(' ', '_').replace('&', 'and').replace('.', '').replace("'", '')
stadium_id = f"nwsl_{normalized_name[:30]}"
stadium = Stadium(
id=stadium_id,
name=name,
city=info['city'],
state=info['state'],
latitude=info['lat'],
longitude=info['lng'],
capacity=info['capacity'],
sport='NWSL',
team_abbrevs=info['teams'],
source='nwsl_hardcoded',
year_opened=info.get('year_opened')
)
stadiums.append(stadium)
return stadiums
def scrape_nwsl_stadiums() -> list[Stadium]:
"""
Fetch NWSL stadium data with multi-source fallback.
Hardcoded source is primary (has complete data).
"""
print("\nNWSL STADIUMS")
print("-" * 40)
sources = [
StadiumScraperSource('Hardcoded', scrape_nwsl_stadiums_hardcoded, priority=1, min_venues=10),
]
return scrape_stadiums_with_fallback('NWSL', sources)
# =============================================================================
# SOURCE CONFIGURATIONS
# =============================================================================
NWSL_STADIUM_SOURCES = [
StadiumScraperSource('Hardcoded', scrape_nwsl_stadiums_hardcoded, priority=1, min_venues=10),
]
+66
View File
@@ -0,0 +1,66 @@
[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "sportstime-parser"
version = "0.1.0"
description = "Sports data scraper and CloudKit uploader for SportsTime app"
readme = "README.md"
requires-python = ">=3.11"
license = {text = "MIT"}
authors = [
{name = "SportsTime Team"}
]
keywords = ["sports", "scraper", "cloudkit", "nba", "mlb", "nfl", "nhl", "mls"]
classifiers = [
"Development Status :: 3 - Alpha",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
]
dependencies = [
"requests>=2.31.0",
"beautifulsoup4>=4.12.0",
"lxml>=5.0.0",
"rapidfuzz>=3.5.0",
"python-dateutil>=2.8.0",
"pytz>=2024.1",
"rich>=13.7.0",
"pyjwt>=2.8.0",
"cryptography>=42.0.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0.0",
"pytest-cov>=4.1.0",
"responses>=0.25.0",
]
[project.scripts]
sportstime-parser = "sportstime_parser.__main__:main"
[tool.setuptools.packages.find]
where = ["."]
include = ["sportstime_parser*"]
[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_functions = ["test_*"]
addopts = "-v --tb=short"
[tool.coverage.run]
source = ["sportstime_parser"]
omit = ["tests/*"]
[tool.coverage.report]
exclude_lines = [
"pragma: no cover",
"if __name__ == .__main__.:",
"raise NotImplementedError",
]
+14 -7
View File
@@ -1,8 +1,15 @@
# Sports Schedule Scraper Dependencies
requests>=2.28.0
beautifulsoup4>=4.11.0
pandas>=2.0.0
lxml>=4.9.0
# Core dependencies
requests>=2.31.0
beautifulsoup4>=4.12.0
lxml>=5.0.0
rapidfuzz>=3.5.0
python-dateutil>=2.8.0
pytz>=2024.1
rich>=13.7.0
pyjwt>=2.8.0
cryptography>=42.0.0
# CloudKit Import (optional - only needed for cloudkit_import.py)
cryptography>=41.0.0
# Development dependencies
pytest>=8.0.0
pytest-cov>=4.1.0
responses>=0.25.0
-517
View File
@@ -1,517 +0,0 @@
#!/usr/bin/env python3
"""
SportsTime Canonicalization Pipeline
====================================
Master script that orchestrates all data canonicalization steps.
This is the NEW pipeline that performs local identity resolution
BEFORE any CloudKit upload.
Pipeline Stages:
1. SCRAPE: Fetch raw data from web sources
2. CANONICALIZE STADIUMS: Generate canonical stadium IDs and aliases
3. CANONICALIZE TEAMS: Match teams to stadiums, generate canonical IDs
4. CANONICALIZE GAMES: Resolve all references, generate canonical IDs
5. VALIDATE: Verify all data is internally consistent
6. (Optional) UPLOAD: CloudKit upload (separate script)
Usage:
python run_canonicalization_pipeline.py # Full pipeline
python run_canonicalization_pipeline.py --season 2026 # Specify season
python run_canonicalization_pipeline.py --skip-scrape # Use existing raw data
python run_canonicalization_pipeline.py --verbose # Detailed output
"""
import argparse
import json
import sys
from datetime import datetime
from pathlib import Path
from dataclasses import dataclass, asdict
# Import from core module
from core import (
ScraperSource, scrape_with_fallback,
assign_stable_ids, export_to_json,
)
# Import from sport modules
from nba import scrape_nba_basketball_reference, scrape_nba_espn, scrape_nba_cbssports
from mlb import scrape_mlb_statsapi, scrape_mlb_baseball_reference, scrape_mlb_espn
from nhl import scrape_nhl_hockey_reference, scrape_nhl_espn, scrape_nhl_api
from nfl import scrape_nfl_espn, scrape_nfl_pro_football_reference, scrape_nfl_cbssports
# Import secondary sports from scrape_schedules (stubs)
from scrape_schedules import (
# WNBA sources
scrape_wnba_espn, scrape_wnba_basketball_reference, scrape_wnba_cbssports,
# MLS sources
scrape_mls_espn, scrape_mls_fbref, scrape_mls_mlssoccer,
# NWSL sources
scrape_nwsl_espn, scrape_nwsl_fbref, scrape_nwsl_nwslsoccer,
# Utilities
generate_stadiums_from_teams,
)
from canonicalize_stadiums import (
canonicalize_stadiums,
add_historical_aliases,
deduplicate_aliases,
)
from canonicalize_teams import canonicalize_all_teams
from canonicalize_games import canonicalize_games
from validate_canonical import validate_canonical_data
@dataclass
class PipelineResult:
"""Result of the full canonicalization pipeline."""
success: bool
stadiums_count: int
teams_count: int
games_count: int
aliases_count: int
validation_errors: int
validation_warnings: int
duration_seconds: float
output_dir: str
def print_header(text: str):
"""Print a formatted header."""
print()
print("=" * 70)
print(f" {text}")
print("=" * 70)
def print_section(text: str):
"""Print a section header."""
print()
print(f"--- {text} ---")
def run_pipeline(
season: int = 2026,
output_dir: Path = Path('./data'),
skip_scrape: bool = False,
validate: bool = True,
verbose: bool = False,
) -> PipelineResult:
"""
Run the complete canonicalization pipeline.
Args:
season: Season year (e.g., 2026)
output_dir: Directory for output files
skip_scrape: Skip scraping, use existing raw data
validate: Run validation step
verbose: Print detailed output
Returns:
PipelineResult with statistics
"""
start_time = datetime.now()
output_dir.mkdir(parents=True, exist_ok=True)
# =========================================================================
# STAGE 1: SCRAPE RAW DATA
# =========================================================================
if not skip_scrape:
print_header("STAGE 1: SCRAPING RAW DATA")
all_games = []
all_stadiums = []
# Scrape stadiums from team mappings
print_section("Stadiums")
all_stadiums = generate_stadiums_from_teams()
print(f" Generated {len(all_stadiums)} stadiums from team data")
# Scrape all sports with multi-source fallback
print_section(f"NBA {season}")
nba_sources = [
ScraperSource('Basketball-Reference', scrape_nba_basketball_reference, priority=1, min_games=500),
ScraperSource('ESPN', scrape_nba_espn, priority=2, min_games=500),
ScraperSource('CBS Sports', scrape_nba_cbssports, priority=3, min_games=100),
]
nba_games = scrape_with_fallback('NBA', season, nba_sources)
nba_season = f"{season-1}-{str(season)[2:]}"
nba_games = assign_stable_ids(nba_games, 'NBA', nba_season)
all_games.extend(nba_games)
print_section(f"MLB {season}")
mlb_sources = [
ScraperSource('MLB Stats API', scrape_mlb_statsapi, priority=1, min_games=1000),
ScraperSource('Baseball-Reference', scrape_mlb_baseball_reference, priority=2, min_games=500),
ScraperSource('ESPN', scrape_mlb_espn, priority=3, min_games=500),
]
mlb_games = scrape_with_fallback('MLB', season, mlb_sources)
mlb_games = assign_stable_ids(mlb_games, 'MLB', str(season))
all_games.extend(mlb_games)
print_section(f"NHL {season}")
nhl_sources = [
ScraperSource('Hockey-Reference', scrape_nhl_hockey_reference, priority=1, min_games=500),
ScraperSource('ESPN', scrape_nhl_espn, priority=2, min_games=500),
ScraperSource('NHL API', scrape_nhl_api, priority=3, min_games=100),
]
nhl_games = scrape_with_fallback('NHL', season, nhl_sources)
nhl_season = f"{season-1}-{str(season)[2:]}"
nhl_games = assign_stable_ids(nhl_games, 'NHL', nhl_season)
all_games.extend(nhl_games)
print_section(f"NFL {season}")
nfl_sources = [
ScraperSource('ESPN', scrape_nfl_espn, priority=1, min_games=200),
ScraperSource('Pro-Football-Reference', scrape_nfl_pro_football_reference, priority=2, min_games=200),
ScraperSource('CBS Sports', scrape_nfl_cbssports, priority=3, min_games=100),
]
nfl_games = scrape_with_fallback('NFL', season, nfl_sources)
nfl_season = f"{season-1}-{str(season)[2:]}"
nfl_games = assign_stable_ids(nfl_games, 'NFL', nfl_season)
all_games.extend(nfl_games)
print_section(f"WNBA {season}")
wnba_sources = [
ScraperSource('ESPN', scrape_wnba_espn, priority=1, min_games=100),
ScraperSource('Basketball-Reference', scrape_wnba_basketball_reference, priority=2, min_games=100),
ScraperSource('CBS Sports', scrape_wnba_cbssports, priority=3, min_games=50),
]
wnba_games = scrape_with_fallback('WNBA', season, wnba_sources)
wnba_games = assign_stable_ids(wnba_games, 'WNBA', str(season))
all_games.extend(wnba_games)
print_section(f"MLS {season}")
mls_sources = [
ScraperSource('ESPN', scrape_mls_espn, priority=1, min_games=200),
ScraperSource('FBref', scrape_mls_fbref, priority=2, min_games=100),
ScraperSource('MLSSoccer.com', scrape_mls_mlssoccer, priority=3, min_games=100),
]
mls_games = scrape_with_fallback('MLS', season, mls_sources)
mls_games = assign_stable_ids(mls_games, 'MLS', str(season))
all_games.extend(mls_games)
print_section(f"NWSL {season}")
nwsl_sources = [
ScraperSource('ESPN', scrape_nwsl_espn, priority=1, min_games=100),
ScraperSource('FBref', scrape_nwsl_fbref, priority=2, min_games=50),
ScraperSource('NWSL.com', scrape_nwsl_nwslsoccer, priority=3, min_games=50),
]
nwsl_games = scrape_with_fallback('NWSL', season, nwsl_sources)
nwsl_games = assign_stable_ids(nwsl_games, 'NWSL', str(season))
all_games.extend(nwsl_games)
# Export raw data
print_section("Exporting Raw Data")
export_to_json(all_games, all_stadiums, output_dir)
print(f" Exported to {output_dir}")
raw_games = [g.__dict__ for g in all_games]
raw_stadiums = [s.__dict__ for s in all_stadiums]
else:
print_header("LOADING EXISTING RAW DATA")
# Try loading from new structure first (games/*.json)
games_dir = output_dir / 'games'
raw_games = []
if games_dir.exists() and any(games_dir.glob('*.json')):
print_section("Loading from games/ directory")
for games_file in sorted(games_dir.glob('*.json')):
with open(games_file) as f:
file_games = json.load(f)
raw_games.extend(file_games)
print(f" Loaded {len(file_games):,} games from {games_file.name}")
else:
# Fallback to legacy games.json
print_section("Loading from legacy games.json")
games_file = output_dir / 'games.json'
with open(games_file) as f:
raw_games = json.load(f)
print(f" Total: {len(raw_games):,} raw games")
# Try loading stadiums from canonical/ first, then legacy
canonical_dir = output_dir / 'canonical'
if (canonical_dir / 'stadiums.json').exists():
with open(canonical_dir / 'stadiums.json') as f:
raw_stadiums = json.load(f)
print(f" Loaded {len(raw_stadiums)} raw stadiums from canonical/stadiums.json")
else:
with open(output_dir / 'stadiums.json') as f:
raw_stadiums = json.load(f)
print(f" Loaded {len(raw_stadiums)} raw stadiums from stadiums.json")
# =========================================================================
# STAGE 2: CANONICALIZE STADIUMS
# =========================================================================
print_header("STAGE 2: CANONICALIZING STADIUMS")
canonical_stadiums, stadium_aliases = canonicalize_stadiums(
raw_stadiums, verbose=verbose
)
print(f" Created {len(canonical_stadiums)} canonical stadiums")
# Add historical aliases
canonical_ids = {s.canonical_id for s in canonical_stadiums}
stadium_aliases = add_historical_aliases(stadium_aliases, canonical_ids)
stadium_aliases = deduplicate_aliases(stadium_aliases)
print(f" Created {len(stadium_aliases)} stadium aliases")
# Export
stadiums_canonical_path = output_dir / 'stadiums_canonical.json'
aliases_path = output_dir / 'stadium_aliases.json'
with open(stadiums_canonical_path, 'w') as f:
json.dump([asdict(s) for s in canonical_stadiums], f, indent=2)
with open(aliases_path, 'w') as f:
json.dump([asdict(a) for a in stadium_aliases], f, indent=2)
print(f" Exported to {stadiums_canonical_path}")
print(f" Exported to {aliases_path}")
# =========================================================================
# STAGE 3: CANONICALIZE TEAMS
# =========================================================================
print_header("STAGE 3: CANONICALIZING TEAMS")
# Convert canonical stadiums to dicts for team matching
stadiums_list = [asdict(s) for s in canonical_stadiums]
canonical_teams, team_warnings = canonicalize_all_teams(
stadiums_list, verbose=verbose
)
print(f" Created {len(canonical_teams)} canonical teams")
if team_warnings:
print(f" Warnings: {len(team_warnings)}")
if verbose:
for w in team_warnings:
print(f" - {w.team_canonical_id}: {w.issue}")
# Export
teams_canonical_path = output_dir / 'teams_canonical.json'
with open(teams_canonical_path, 'w') as f:
json.dump([asdict(t) for t in canonical_teams], f, indent=2)
print(f" Exported to {teams_canonical_path}")
# =========================================================================
# STAGE 4: CANONICALIZE GAMES
# =========================================================================
print_header("STAGE 4: CANONICALIZING GAMES")
# Convert data to dicts for game canonicalization
teams_list = [asdict(t) for t in canonical_teams]
aliases_list = [asdict(a) for a in stadium_aliases]
canonical_games_list, game_warnings = canonicalize_games(
raw_games, teams_list, aliases_list, verbose=verbose
)
print(f" Created {len(canonical_games_list)} canonical games")
if game_warnings:
print(f" Warnings: {len(game_warnings)}")
if verbose:
from collections import defaultdict
by_issue = defaultdict(int)
for w in game_warnings:
by_issue[w.issue] += 1
for issue, count in by_issue.items():
print(f" - {issue}: {count}")
# Export games to new structure: canonical/games/{sport}_{season}.json
canonical_games_dir = output_dir / 'canonical' / 'games'
canonical_games_dir.mkdir(parents=True, exist_ok=True)
# Group games by sport and season
games_by_sport_season = {}
for game in canonical_games_list:
sport = game.sport.lower()
season = game.season
key = f"{sport}_{season}"
if key not in games_by_sport_season:
games_by_sport_season[key] = []
games_by_sport_season[key].append(game)
# Export each sport/season file
for key, sport_games in sorted(games_by_sport_season.items()):
filepath = canonical_games_dir / f"{key}.json"
with open(filepath, 'w') as f:
json.dump([asdict(g) for g in sport_games], f, indent=2)
print(f" Exported {len(sport_games):,} games to canonical/games/{key}.json")
# Also export combined games_canonical.json for backward compatibility
games_canonical_path = output_dir / 'games_canonical.json'
with open(games_canonical_path, 'w') as f:
json.dump([asdict(g) for g in canonical_games_list], f, indent=2)
print(f" Exported combined to {games_canonical_path}")
# =========================================================================
# STAGE 5: VALIDATE
# =========================================================================
validation_result = None
if validate:
print_header("STAGE 5: VALIDATION")
# Reload as dicts for validation
canonical_stadiums_dicts = [asdict(s) for s in canonical_stadiums]
canonical_teams_dicts = [asdict(t) for t in canonical_teams]
canonical_games_dicts = [asdict(g) for g in canonical_games_list]
aliases_dicts = [asdict(a) for a in stadium_aliases]
validation_result = validate_canonical_data(
canonical_stadiums_dicts,
canonical_teams_dicts,
canonical_games_dicts,
aliases_dicts,
verbose=verbose
)
if validation_result.is_valid:
print(f" STATUS: PASSED")
else:
print(f" STATUS: FAILED")
print(f" Errors: {validation_result.error_count}")
print(f" Warnings: {validation_result.warning_count}")
# Export validation report
validation_path = output_dir / 'canonicalization_validation.json'
with open(validation_path, 'w') as f:
json.dump({
'is_valid': validation_result.is_valid,
'error_count': validation_result.error_count,
'warning_count': validation_result.warning_count,
'summary': validation_result.summary,
'errors': validation_result.errors[:100], # Limit to 100 for readability
}, f, indent=2)
print(f" Report exported to {validation_path}")
# =========================================================================
# SUMMARY
# =========================================================================
duration = (datetime.now() - start_time).total_seconds()
print_header("PIPELINE COMPLETE")
print()
print(f" Duration: {duration:.1f} seconds")
print(f" Stadiums: {len(canonical_stadiums)}")
print(f" Teams: {len(canonical_teams)}")
print(f" Games: {len(canonical_games_list)}")
print(f" Aliases: {len(stadium_aliases)}")
print()
# Games by sport
print(" Games by sport:")
by_sport = {}
for g in canonical_games_list:
by_sport[g.sport] = by_sport.get(g.sport, 0) + 1
for sport, count in sorted(by_sport.items()):
print(f" {sport}: {count:,} games")
print()
print(" Output files:")
print(f" - {output_dir / 'stadiums_canonical.json'}")
print(f" - {output_dir / 'stadium_aliases.json'}")
print(f" - {output_dir / 'teams_canonical.json'}")
print(f" - {output_dir / 'games_canonical.json'} (combined)")
print(f" - {output_dir / 'canonical' / 'games' / '*.json'} (by sport/season)")
print(f" - {output_dir / 'canonicalization_validation.json'}")
print()
# Final status
success = True
if validation_result and not validation_result.is_valid:
success = False
print(" PIPELINE FAILED - Validation errors detected")
print(" CloudKit upload should NOT proceed until errors are fixed")
else:
print(" PIPELINE SUCCEEDED - Ready for CloudKit upload")
print()
return PipelineResult(
success=success,
stadiums_count=len(canonical_stadiums),
teams_count=len(canonical_teams),
games_count=len(canonical_games_list),
aliases_count=len(stadium_aliases),
validation_errors=validation_result.error_count if validation_result else 0,
validation_warnings=validation_result.warning_count if validation_result else 0,
duration_seconds=duration,
output_dir=str(output_dir),
)
def main():
parser = argparse.ArgumentParser(
description='SportsTime Canonicalization Pipeline',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Pipeline Stages:
1. SCRAPE: Fetch raw data from web sources
2. CANONICALIZE STADIUMS: Generate canonical IDs and aliases
3. CANONICALIZE TEAMS: Match teams to stadiums
4. CANONICALIZE GAMES: Resolve all references
5. VALIDATE: Verify internal consistency
Examples:
python run_canonicalization_pipeline.py # Full pipeline
python run_canonicalization_pipeline.py --season 2026 # Different season
python run_canonicalization_pipeline.py --skip-scrape # Use existing raw data
python run_canonicalization_pipeline.py --verbose # Show all details
"""
)
parser.add_argument(
'--season', type=int, default=2026,
help='Season year (default: 2026)'
)
parser.add_argument(
'--output', type=str, default='./data',
help='Output directory (default: ./data)'
)
parser.add_argument(
'--skip-scrape', action='store_true',
help='Skip scraping, use existing raw data files'
)
parser.add_argument(
'--no-validate', action='store_true',
help='Skip validation step'
)
parser.add_argument(
'--verbose', '-v', action='store_true',
help='Verbose output'
)
parser.add_argument(
'--strict', action='store_true',
help='Exit with error code if validation fails'
)
args = parser.parse_args()
result = run_pipeline(
season=args.season,
output_dir=Path(args.output),
skip_scrape=args.skip_scrape,
validate=not args.no_validate,
verbose=args.verbose,
)
# Exit with error code if requested and validation failed
if args.strict and not result.success:
sys.exit(1)
if __name__ == '__main__':
main()
-523
View File
@@ -1,523 +0,0 @@
#!/usr/bin/env python3
"""
SportsTime Data Pipeline
========================
Master script that orchestrates all data fetching, validation, and reporting.
Usage:
python run_pipeline.py # Full pipeline with defaults
python run_pipeline.py --season 2026 # Specify season
python run_pipeline.py --sport nba # Single sport only
python run_pipeline.py --skip-scrape # Validate existing data only
python run_pipeline.py --verbose # Detailed output
"""
import argparse
import json
import sys
from datetime import datetime
from pathlib import Path
from dataclasses import dataclass
from typing import Optional
from enum import Enum
# Import from core module
from core import (
Game, Stadium, ScraperSource, scrape_with_fallback,
assign_stable_ids, export_to_json,
)
# Import from sport modules
from nba import scrape_nba_basketball_reference, scrape_nba_espn, scrape_nba_cbssports
from mlb import scrape_mlb_statsapi, scrape_mlb_baseball_reference, scrape_mlb_espn
from nhl import scrape_nhl_hockey_reference, scrape_nhl_espn, scrape_nhl_api
from nfl import scrape_nfl_espn, scrape_nfl_pro_football_reference, scrape_nfl_cbssports
# Import secondary sports from scrape_schedules (stubs)
from scrape_schedules import (
# WNBA sources
scrape_wnba_espn, scrape_wnba_basketball_reference, scrape_wnba_cbssports,
# MLS sources
scrape_mls_espn, scrape_mls_fbref, scrape_mls_mlssoccer,
# NWSL sources
scrape_nwsl_espn, scrape_nwsl_fbref, scrape_nwsl_nwslsoccer,
# Utilities
scrape_all_stadiums,
)
from validate_data import (
validate_games,
validate_stadiums,
scrape_mlb_all_sources,
scrape_nba_all_sources,
scrape_nhl_all_sources,
ValidationReport,
)
class Severity(Enum):
HIGH = "high"
MEDIUM = "medium"
LOW = "low"
@dataclass
class PipelineResult:
success: bool
games_scraped: int
stadiums_scraped: int
games_by_sport: dict
validation_reports: list
stadium_issues: list
high_severity_count: int
medium_severity_count: int
low_severity_count: int
output_dir: Path
duration_seconds: float
def print_header(text: str):
"""Print a formatted header."""
print()
print("=" * 70)
print(f" {text}")
print("=" * 70)
def print_section(text: str):
"""Print a section header."""
print()
print(f"--- {text} ---")
def print_severity(severity: str, message: str):
"""Print a message with severity indicator."""
icons = {
'high': '🔴 HIGH',
'medium': '🟡 MEDIUM',
'low': '🟢 LOW',
}
print(f" {icons.get(severity, '')} {message}")
def run_pipeline(
season: int = 2025,
sport: str = 'all',
output_dir: Path = Path('./data'),
skip_scrape: bool = False,
validate: bool = True,
verbose: bool = False,
) -> PipelineResult:
"""
Run the complete data pipeline.
"""
start_time = datetime.now()
all_games = []
all_stadiums = []
games_by_sport = {}
validation_reports = []
stadium_issues = []
output_dir.mkdir(parents=True, exist_ok=True)
# =========================================================================
# PHASE 1: SCRAPE DATA
# =========================================================================
if not skip_scrape:
print_header("PHASE 1: SCRAPING DATA")
# Scrape stadiums
print_section("Stadiums")
all_stadiums = scrape_all_stadiums()
print(f" Generated {len(all_stadiums)} stadiums from team data")
# Scrape by sport with multi-source fallback
if sport in ['nba', 'all']:
print_section(f"NBA {season}")
nba_sources = [
ScraperSource('Basketball-Reference', scrape_nba_basketball_reference, priority=1, min_games=500),
ScraperSource('ESPN', scrape_nba_espn, priority=2, min_games=500),
ScraperSource('CBS Sports', scrape_nba_cbssports, priority=3, min_games=100),
]
nba_games = scrape_with_fallback('NBA', season, nba_sources)
nba_season = f"{season-1}-{str(season)[2:]}"
nba_games = assign_stable_ids(nba_games, 'NBA', nba_season)
all_games.extend(nba_games)
games_by_sport['NBA'] = len(nba_games)
if sport in ['mlb', 'all']:
print_section(f"MLB {season}")
mlb_sources = [
ScraperSource('MLB Stats API', scrape_mlb_statsapi, priority=1, min_games=1000),
ScraperSource('Baseball-Reference', scrape_mlb_baseball_reference, priority=2, min_games=500),
ScraperSource('ESPN', scrape_mlb_espn, priority=3, min_games=500),
]
mlb_games = scrape_with_fallback('MLB', season, mlb_sources)
mlb_games = assign_stable_ids(mlb_games, 'MLB', str(season))
all_games.extend(mlb_games)
games_by_sport['MLB'] = len(mlb_games)
if sport in ['nhl', 'all']:
print_section(f"NHL {season}")
nhl_sources = [
ScraperSource('Hockey-Reference', scrape_nhl_hockey_reference, priority=1, min_games=500),
ScraperSource('ESPN', scrape_nhl_espn, priority=2, min_games=500),
ScraperSource('NHL API', scrape_nhl_api, priority=3, min_games=100),
]
nhl_games = scrape_with_fallback('NHL', season, nhl_sources)
nhl_season = f"{season-1}-{str(season)[2:]}"
nhl_games = assign_stable_ids(nhl_games, 'NHL', nhl_season)
all_games.extend(nhl_games)
games_by_sport['NHL'] = len(nhl_games)
if sport in ['nfl', 'all']:
print_section(f"NFL {season}")
nfl_sources = [
ScraperSource('ESPN', scrape_nfl_espn, priority=1, min_games=200),
ScraperSource('Pro-Football-Reference', scrape_nfl_pro_football_reference, priority=2, min_games=200),
ScraperSource('CBS Sports', scrape_nfl_cbssports, priority=3, min_games=100),
]
nfl_games = scrape_with_fallback('NFL', season, nfl_sources)
nfl_season = f"{season-1}-{str(season)[2:]}"
nfl_games = assign_stable_ids(nfl_games, 'NFL', nfl_season)
all_games.extend(nfl_games)
games_by_sport['NFL'] = len(nfl_games)
if sport in ['wnba', 'all']:
print_section(f"WNBA {season}")
wnba_sources = [
ScraperSource('ESPN', scrape_wnba_espn, priority=1, min_games=100),
ScraperSource('Basketball-Reference', scrape_wnba_basketball_reference, priority=2, min_games=100),
ScraperSource('CBS Sports', scrape_wnba_cbssports, priority=3, min_games=50),
]
wnba_games = scrape_with_fallback('WNBA', season, wnba_sources)
wnba_games = assign_stable_ids(wnba_games, 'WNBA', str(season))
all_games.extend(wnba_games)
games_by_sport['WNBA'] = len(wnba_games)
if sport in ['mls', 'all']:
print_section(f"MLS {season}")
mls_sources = [
ScraperSource('ESPN', scrape_mls_espn, priority=1, min_games=200),
ScraperSource('FBref', scrape_mls_fbref, priority=2, min_games=100),
ScraperSource('MLSSoccer.com', scrape_mls_mlssoccer, priority=3, min_games=100),
]
mls_games = scrape_with_fallback('MLS', season, mls_sources)
mls_games = assign_stable_ids(mls_games, 'MLS', str(season))
all_games.extend(mls_games)
games_by_sport['MLS'] = len(mls_games)
if sport in ['nwsl', 'all']:
print_section(f"NWSL {season}")
nwsl_sources = [
ScraperSource('ESPN', scrape_nwsl_espn, priority=1, min_games=100),
ScraperSource('FBref', scrape_nwsl_fbref, priority=2, min_games=50),
ScraperSource('NWSL.com', scrape_nwsl_nwslsoccer, priority=3, min_games=50),
]
nwsl_games = scrape_with_fallback('NWSL', season, nwsl_sources)
nwsl_games = assign_stable_ids(nwsl_games, 'NWSL', str(season))
all_games.extend(nwsl_games)
games_by_sport['NWSL'] = len(nwsl_games)
# Export data
print_section("Exporting Data")
export_to_json(all_games, all_stadiums, output_dir)
print(f" Exported to {output_dir}")
else:
# Load existing data
print_header("LOADING EXISTING DATA")
games_file = output_dir / 'games.json'
stadiums_file = output_dir / 'stadiums.json'
if games_file.exists():
with open(games_file) as f:
games_data = json.load(f)
all_games = [Game(**g) for g in games_data]
for g in all_games:
games_by_sport[g.sport] = games_by_sport.get(g.sport, 0) + 1
print(f" Loaded {len(all_games)} games")
if stadiums_file.exists():
with open(stadiums_file) as f:
stadiums_data = json.load(f)
all_stadiums = [Stadium(**s) for s in stadiums_data]
print(f" Loaded {len(all_stadiums)} stadiums")
# =========================================================================
# PHASE 2: VALIDATE DATA
# =========================================================================
if validate:
print_header("PHASE 2: CROSS-VALIDATION")
# MLB validation (has two good sources)
if sport in ['mlb', 'all']:
print_section("MLB Cross-Validation")
try:
mlb_sources = scrape_mlb_all_sources(season)
source_names = list(mlb_sources.keys())
if len(source_names) >= 2:
games1 = mlb_sources[source_names[0]]
games2 = mlb_sources[source_names[1]]
if games1 and games2:
report = validate_games(
games1, games2,
source_names[0], source_names[1],
'MLB', str(season)
)
validation_reports.append(report)
print(f" Sources: {source_names[0]} vs {source_names[1]}")
print(f" Games compared: {report.total_games_source1} vs {report.total_games_source2}")
print(f" Matched: {report.games_matched}")
print(f" Discrepancies: {len(report.discrepancies)}")
except Exception as e:
print(f" Error during MLB validation: {e}")
# Stadium validation
print_section("Stadium Validation")
stadium_issues = validate_stadiums(all_stadiums)
print(f" Issues found: {len(stadium_issues)}")
# Data quality checks
print_section("Data Quality Checks")
# Check game counts per team
if sport in ['nba', 'all']:
nba_games = [g for g in all_games if g.sport == 'NBA']
team_counts = {}
for g in nba_games:
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
for team, count in sorted(team_counts.items()):
if count < 75 or count > 90:
print(f" NBA: {team} has {count} games (expected ~82)")
if sport in ['nhl', 'all']:
nhl_games = [g for g in all_games if g.sport == 'NHL']
team_counts = {}
for g in nhl_games:
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
for team, count in sorted(team_counts.items()):
if count < 75 or count > 90:
print(f" NHL: {team} has {count} games (expected ~82)")
if sport in ['nfl', 'all']:
nfl_games = [g for g in all_games if g.sport == 'NFL']
team_counts = {}
for g in nfl_games:
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
for team, count in sorted(team_counts.items()):
if count < 15 or count > 20:
print(f" NFL: {team} has {count} games (expected ~17)")
# =========================================================================
# PHASE 3: GENERATE REPORT
# =========================================================================
print_header("PHASE 3: DISCREPANCY REPORT")
# Count by severity
high_count = 0
medium_count = 0
low_count = 0
# Game discrepancies
for report in validation_reports:
for d in report.discrepancies:
if d.severity == 'high':
high_count += 1
elif d.severity == 'medium':
medium_count += 1
else:
low_count += 1
# Stadium issues
for issue in stadium_issues:
if issue['severity'] == 'high':
high_count += 1
elif issue['severity'] == 'medium':
medium_count += 1
else:
low_count += 1
# Print summary
print()
print(f" 🔴 HIGH severity: {high_count}")
print(f" 🟡 MEDIUM severity: {medium_count}")
print(f" 🟢 LOW severity: {low_count}")
print()
# Print high severity issues (always)
if high_count > 0:
print_section("HIGH Severity Issues (Requires Attention)")
shown = 0
max_show = 10 if not verbose else 50
for report in validation_reports:
for d in report.discrepancies:
if d.severity == 'high' and shown < max_show:
print_severity('high', f"[{report.sport}] {d.field}: {d.game_key}")
if verbose:
print(f" {d.source1}: {d.value1}")
print(f" {d.source2}: {d.value2}")
shown += 1
for issue in stadium_issues:
if issue['severity'] == 'high' and shown < max_show:
print_severity('high', f"[Stadium] {issue['stadium']}: {issue['issue']}")
shown += 1
if high_count > max_show:
print(f" ... and {high_count - max_show} more (use --verbose to see all)")
# Print medium severity if verbose
if medium_count > 0 and verbose:
print_section("MEDIUM Severity Issues")
for report in validation_reports:
for d in report.discrepancies:
if d.severity == 'medium':
print_severity('medium', f"[{report.sport}] {d.field}: {d.game_key}")
for issue in stadium_issues:
if issue['severity'] == 'medium':
print_severity('medium', f"[Stadium] {issue['stadium']}: {issue['issue']}")
# Save full report
report_path = output_dir / 'pipeline_report.json'
full_report = {
'generated_at': datetime.now().isoformat(),
'season': season,
'sport': sport,
'summary': {
'games_scraped': len(all_games),
'stadiums_scraped': len(all_stadiums),
'games_by_sport': games_by_sport,
'high_severity': high_count,
'medium_severity': medium_count,
'low_severity': low_count,
},
'game_validations': [r.to_dict() for r in validation_reports],
'stadium_issues': stadium_issues,
}
with open(report_path, 'w') as f:
json.dump(full_report, f, indent=2)
# =========================================================================
# FINAL SUMMARY
# =========================================================================
duration = (datetime.now() - start_time).total_seconds()
print_header("PIPELINE COMPLETE")
print()
print(f" Duration: {duration:.1f} seconds")
print(f" Games: {len(all_games):,}")
print(f" Stadiums: {len(all_stadiums)}")
print(f" Output: {output_dir.absolute()}")
print()
for sport_name, count in sorted(games_by_sport.items()):
print(f" {sport_name}: {count:,} games")
print()
print(f" Reports saved to:")
print(f" - {output_dir / 'games.json'}")
print(f" - {output_dir / 'stadiums.json'}")
print(f" - {output_dir / 'pipeline_report.json'}")
print()
# Status indicator
if high_count > 0:
print(" ⚠️ STATUS: Review required - high severity issues found")
elif medium_count > 0:
print(" ✓ STATUS: Complete with warnings")
else:
print(" ✅ STATUS: All checks passed")
print()
return PipelineResult(
success=high_count == 0,
games_scraped=len(all_games),
stadiums_scraped=len(all_stadiums),
games_by_sport=games_by_sport,
validation_reports=validation_reports,
stadium_issues=stadium_issues,
high_severity_count=high_count,
medium_severity_count=medium_count,
low_severity_count=low_count,
output_dir=output_dir,
duration_seconds=duration,
)
def main():
parser = argparse.ArgumentParser(
description='SportsTime Data Pipeline - Fetch, validate, and report on sports data',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python run_pipeline.py # Full pipeline
python run_pipeline.py --season 2026 # Different season
python run_pipeline.py --sport mlb # MLB only
python run_pipeline.py --skip-scrape # Validate existing data
python run_pipeline.py --verbose # Show all issues
"""
)
parser.add_argument(
'--season', type=int, default=2025,
help='Season year (default: 2025)'
)
parser.add_argument(
'--sport', choices=['nba', 'mlb', 'nhl', 'nfl', 'wnba', 'mls', 'nwsl', 'all'], default='all',
help='Sport to process (default: all)'
)
parser.add_argument(
'--output', type=str, default='./data',
help='Output directory (default: ./data)'
)
parser.add_argument(
'--skip-scrape', action='store_true',
help='Skip scraping, validate existing data only'
)
parser.add_argument(
'--no-validate', action='store_true',
help='Skip validation step'
)
parser.add_argument(
'--verbose', '-v', action='store_true',
help='Verbose output with all issues'
)
args = parser.parse_args()
result = run_pipeline(
season=args.season,
sport=args.sport,
output_dir=Path(args.output),
skip_scrape=args.skip_scrape,
validate=not args.no_validate,
verbose=args.verbose,
)
# Exit with error code if high severity issues
sys.exit(0 if result.success else 1)
if __name__ == '__main__':
main()
-527
View File
@@ -1,527 +0,0 @@
#!/usr/bin/env python3
"""
Sports Schedule Scraper Orchestrator
This script coordinates scraping across sport-specific modules:
- core.py: Shared utilities, data classes, fallback system
- mlb.py: MLB scrapers
- nba.py: NBA scrapers
- nhl.py: NHL scrapers
- nfl.py: NFL scrapers
- mls.py: MLS stadiums
- wnba.py: WNBA stadiums
- nwsl.py: NWSL stadiums
Usage:
python scrape_schedules.py --sport nba --season 2026
python scrape_schedules.py --sport all --season 2026
python scrape_schedules.py --stadiums-only
"""
import argparse
import csv
import json
import time
from collections import defaultdict
from dataclasses import asdict
from datetime import datetime
from io import StringIO
from pathlib import Path
from typing import Optional
import requests
# Import from core module
from core import (
Game,
Stadium,
ScraperSource,
StadiumScraperSource,
fetch_page,
scrape_with_fallback,
scrape_stadiums_with_fallback,
assign_stable_ids,
export_to_json,
)
# Import from sport modules (core 4 sports)
from mlb import (
scrape_mlb_games,
scrape_mlb_stadiums,
MLB_TEAMS,
)
from nba import (
scrape_nba_games,
scrape_nba_stadiums,
get_nba_season_string,
NBA_TEAMS,
)
from nhl import (
scrape_nhl_games,
scrape_nhl_stadiums,
get_nhl_season_string,
NHL_TEAMS,
)
from nfl import (
scrape_nfl_games,
scrape_nfl_stadiums,
get_nfl_season_string,
NFL_TEAMS,
)
from mls import (
MLS_TEAMS,
get_mls_team_abbrev,
scrape_mls_stadiums,
MLS_STADIUM_SOURCES,
)
from wnba import (
WNBA_TEAMS,
get_wnba_team_abbrev,
scrape_wnba_stadiums,
WNBA_STADIUM_SOURCES,
)
from nwsl import (
NWSL_TEAMS,
get_nwsl_team_abbrev,
scrape_nwsl_stadiums,
NWSL_STADIUM_SOURCES,
)
# =============================================================================
# NON-CORE SPORT SCRAPERS
# NOTE: MLS, WNBA, NWSL stadiums are now imported from their respective modules
# =============================================================================
def _scrape_espn_schedule(sport: str, league: str, season: int, date_range: tuple[str, str]) -> list[Game]:
"""
Fetch schedule from ESPN API.
Shared helper for non-core sports that use ESPN API.
"""
games = []
sport_upper = {
'wnba': 'WNBA',
'usa.1': 'MLS',
'usa.nwsl': 'NWSL',
}.get(league, league.upper())
print(f"Fetching {sport_upper} {season} from ESPN API...")
url = f"https://site.api.espn.com/apis/site/v2/sports/{sport}/{league}/scoreboard"
params = {
'dates': f"{date_range[0]}-{date_range[1]}",
'limit': 1000
}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
}
try:
response = requests.get(url, params=params, headers=headers, timeout=30)
response.raise_for_status()
data = response.json()
events = data.get('events', [])
for event in events:
try:
date_str = event.get('date', '')[:10]
time_str = event.get('date', '')[11:16] if len(event.get('date', '')) > 11 else None
competitions = event.get('competitions', [{}])
if not competitions:
continue
comp = competitions[0]
competitors = comp.get('competitors', [])
if len(competitors) < 2:
continue
home_team = None
away_team = None
home_abbrev = None
away_abbrev = None
for team in competitors:
team_data = team.get('team', {})
team_name = team_data.get('displayName', team_data.get('name', ''))
team_abbrev = team_data.get('abbreviation', '')
if team.get('homeAway') == 'home':
home_team = team_name
home_abbrev = team_abbrev
else:
away_team = team_name
away_abbrev = team_abbrev
if not home_team or not away_team:
continue
venue = comp.get('venue', {}).get('fullName', '')
game_id = f"{sport_upper.lower()}_{date_str}_{away_abbrev}_{home_abbrev}".lower()
game = Game(
id=game_id,
sport=sport_upper,
season=str(season),
date=date_str,
time=time_str,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev or get_team_abbrev(home_team, sport_upper),
away_team_abbrev=away_abbrev or get_team_abbrev(away_team, sport_upper),
venue=venue,
source='espn.com'
)
games.append(game)
except Exception:
continue
print(f" Found {len(games)} games from ESPN")
except Exception as e:
print(f"Error fetching ESPN {sport_upper}: {e}")
return games
def scrape_wnba_espn(season: int) -> list[Game]:
"""Fetch WNBA schedule from ESPN API."""
start = f"{season}0501"
end = f"{season}1031"
return _scrape_espn_schedule('basketball', 'wnba', season, (start, end))
def scrape_mls_espn(season: int) -> list[Game]:
"""Fetch MLS schedule from ESPN API."""
start = f"{season}0201"
end = f"{season}1231"
return _scrape_espn_schedule('soccer', 'usa.1', season, (start, end))
def scrape_nwsl_espn(season: int) -> list[Game]:
"""Fetch NWSL schedule from ESPN API."""
start = f"{season}0301"
end = f"{season}1130"
return _scrape_espn_schedule('soccer', 'usa.nwsl', season, (start, end))
def scrape_wnba_basketball_reference(season: int) -> list[Game]:
"""Scrape WNBA schedule from Basketball-Reference."""
games = []
url = f"https://www.basketball-reference.com/wnba/years/{season}_games.html"
print(f"Scraping WNBA {season} from Basketball-Reference...")
soup = fetch_page(url, 'basketball-reference.com')
if not soup:
return games
table = soup.find('table', {'id': 'schedule'})
if not table:
return games
tbody = table.find('tbody')
if not tbody:
return games
for row in tbody.find_all('tr'):
if row.get('class') and 'thead' in row.get('class'):
continue
try:
date_cell = row.find('th', {'data-stat': 'date_game'})
if not date_cell:
continue
date_link = date_cell.find('a')
date_str = date_link.text if date_link else date_cell.text
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
home_cell = row.find('td', {'data-stat': 'home_team_name'})
if not visitor_cell or not home_cell:
continue
visitor_link = visitor_cell.find('a')
home_link = home_cell.find('a')
away_team = visitor_link.text if visitor_link else visitor_cell.text
home_team = home_link.text if home_link else home_cell.text
try:
parsed_date = datetime.strptime(date_str.strip(), '%a, %b %d, %Y')
date_formatted = parsed_date.strftime('%Y-%m-%d')
except:
continue
away_abbrev = get_team_abbrev(away_team, 'WNBA')
home_abbrev = get_team_abbrev(home_team, 'WNBA')
game_id = f"wnba_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
game = Game(
id=game_id,
sport='WNBA',
season=str(season),
date=date_formatted,
time=None,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev,
away_team_abbrev=away_abbrev,
venue='',
source='basketball-reference.com'
)
games.append(game)
except Exception:
continue
print(f" Found {len(games)} games from Basketball-Reference")
return games
def scrape_wnba_cbssports(season: int) -> list[Game]:
"""Fetch WNBA schedule from CBS Sports."""
games = []
print(f"Fetching WNBA {season} from CBS Sports...")
# Placeholder - CBS Sports scraping would go here
print(f" Found {len(games)} games from CBS Sports")
return games
def scrape_mls_fbref(season: int) -> list[Game]:
"""Scrape MLS schedule from FBref."""
games = []
print(f"Scraping MLS {season} from FBref...")
# Placeholder - FBref scraping would go here
print(f" Found {len(games)} games from FBref")
return games
def scrape_mls_mlssoccer(season: int) -> list[Game]:
"""Scrape MLS schedule from MLSSoccer.com."""
games = []
print(f"Scraping MLS {season} from MLSSoccer.com...")
# Placeholder - MLSSoccer.com scraping would go here
print(f" Found {len(games)} games from MLSSoccer.com")
return games
def scrape_nwsl_fbref(season: int) -> list[Game]:
"""Scrape NWSL schedule from FBref."""
games = []
print(f"Scraping NWSL {season} from FBref...")
# Placeholder - FBref scraping would go here
print(f" Found {len(games)} games from FBref")
return games
def scrape_nwsl_nwslsoccer(season: int) -> list[Game]:
"""Scrape NWSL schedule from NWSL.com."""
games = []
print(f"Scraping NWSL {season} from NWSL.com...")
# Placeholder - NWSL.com scraping would go here
print(f" Found {len(games)} games from NWSL.com")
return games
# =============================================================================
# LEGACY STADIUM FUNCTIONS
# =============================================================================
def scrape_stadiums_hifld() -> list[Stadium]:
"""Legacy: Scrape from HIFLD open data."""
# Placeholder for legacy HIFLD scraping
return []
def generate_stadiums_from_teams() -> list[Stadium]:
"""Generate stadium entries from team data with hardcoded coordinates."""
stadiums = []
# This function would generate stadiums from all team dictionaries
# Keeping as placeholder since sport modules have their own stadium scrapers
return stadiums
def scrape_all_stadiums() -> list[Stadium]:
"""Comprehensive stadium scraping for all sports."""
all_stadiums = []
# Core sports (from modules)
all_stadiums.extend(scrape_mlb_stadiums())
all_stadiums.extend(scrape_nba_stadiums())
all_stadiums.extend(scrape_nhl_stadiums())
all_stadiums.extend(scrape_nfl_stadiums())
# Non-core sports
all_stadiums.extend(scrape_mls_stadiums())
all_stadiums.extend(scrape_wnba_stadiums())
all_stadiums.extend(scrape_nwsl_stadiums())
return all_stadiums
# =============================================================================
# HELPERS
# =============================================================================
def get_team_abbrev(team_name: str, sport: str) -> str:
"""Get team abbreviation from full name."""
teams = {
'NBA': NBA_TEAMS,
'MLB': MLB_TEAMS,
'NHL': NHL_TEAMS,
'NFL': NFL_TEAMS,
'WNBA': WNBA_TEAMS,
'MLS': MLS_TEAMS,
'NWSL': NWSL_TEAMS,
}.get(sport, {})
for abbrev, info in teams.items():
if info['name'].lower() == team_name.lower():
return abbrev
if team_name.lower() in info['name'].lower():
return abbrev
# Return first 3 letters as fallback
return team_name[:3].upper()
# =============================================================================
# MAIN ORCHESTRATOR
# =============================================================================
def main():
parser = argparse.ArgumentParser(description='Scrape sports schedules')
parser.add_argument('--sport', choices=['nba', 'mlb', 'nhl', 'nfl', 'wnba', 'mls', 'nwsl', 'all'], default='all')
parser.add_argument('--season', type=int, default=2026, help='Season year (ending year)')
parser.add_argument('--stadiums-only', action='store_true', help='Only scrape stadium data (legacy method)')
parser.add_argument('--stadiums-update', action='store_true', help='Scrape ALL stadium data for all 8 sports (comprehensive)')
parser.add_argument('--output', type=str, default='./data', help='Output directory')
args = parser.parse_args()
output_dir = Path(args.output)
all_games = []
all_stadiums = []
# Scrape stadiums
print("\n" + "="*60)
print("SCRAPING STADIUMS")
print("="*60)
if args.stadiums_update:
print("Using comprehensive stadium scrapers for all sports...")
all_stadiums.extend(scrape_all_stadiums())
print(f" Total stadiums scraped: {len(all_stadiums)}")
else:
all_stadiums.extend(scrape_stadiums_hifld())
all_stadiums.extend(generate_stadiums_from_teams())
# If stadiums-only mode, export and exit
if args.stadiums_only:
export_to_json([], all_stadiums, output_dir)
return
# Scrape schedules using sport modules
if args.sport in ['nba', 'all']:
print("\n" + "="*60)
print(f"SCRAPING NBA {args.season}")
print("="*60)
nba_games = scrape_nba_games(args.season)
nba_season = get_nba_season_string(args.season)
nba_games = assign_stable_ids(nba_games, 'NBA', nba_season)
all_games.extend(nba_games)
if args.sport in ['mlb', 'all']:
print("\n" + "="*60)
print(f"SCRAPING MLB {args.season}")
print("="*60)
mlb_games = scrape_mlb_games(args.season)
mlb_games = assign_stable_ids(mlb_games, 'MLB', str(args.season))
all_games.extend(mlb_games)
if args.sport in ['nhl', 'all']:
print("\n" + "="*60)
print(f"SCRAPING NHL {args.season}")
print("="*60)
nhl_games = scrape_nhl_games(args.season)
nhl_season = get_nhl_season_string(args.season)
nhl_games = assign_stable_ids(nhl_games, 'NHL', nhl_season)
all_games.extend(nhl_games)
if args.sport in ['nfl', 'all']:
print("\n" + "="*60)
print(f"SCRAPING NFL {args.season}")
print("="*60)
nfl_games = scrape_nfl_games(args.season)
nfl_season = get_nfl_season_string(args.season)
nfl_games = assign_stable_ids(nfl_games, 'NFL', nfl_season)
all_games.extend(nfl_games)
# Non-core sports (TODO: Extract to modules)
if args.sport in ['wnba', 'all']:
print("\n" + "="*60)
print(f"SCRAPING WNBA {args.season}")
print("="*60)
wnba_sources = [
ScraperSource('ESPN', scrape_wnba_espn, priority=1, min_games=100),
ScraperSource('Basketball-Reference', scrape_wnba_basketball_reference, priority=2, min_games=100),
ScraperSource('CBS Sports', scrape_wnba_cbssports, priority=3, min_games=50),
]
wnba_games = scrape_with_fallback('WNBA', args.season, wnba_sources)
wnba_games = assign_stable_ids(wnba_games, 'WNBA', str(args.season))
all_games.extend(wnba_games)
if args.sport in ['mls', 'all']:
print("\n" + "="*60)
print(f"SCRAPING MLS {args.season}")
print("="*60)
mls_sources = [
ScraperSource('ESPN', scrape_mls_espn, priority=1, min_games=200),
ScraperSource('FBref', scrape_mls_fbref, priority=2, min_games=100),
ScraperSource('MLSSoccer.com', scrape_mls_mlssoccer, priority=3, min_games=100),
]
mls_games = scrape_with_fallback('MLS', args.season, mls_sources)
mls_games = assign_stable_ids(mls_games, 'MLS', str(args.season))
all_games.extend(mls_games)
if args.sport in ['nwsl', 'all']:
print("\n" + "="*60)
print(f"SCRAPING NWSL {args.season}")
print("="*60)
nwsl_sources = [
ScraperSource('ESPN', scrape_nwsl_espn, priority=1, min_games=100),
ScraperSource('FBref', scrape_nwsl_fbref, priority=2, min_games=50),
ScraperSource('NWSL.com', scrape_nwsl_nwslsoccer, priority=3, min_games=50),
]
nwsl_games = scrape_with_fallback('NWSL', args.season, nwsl_sources)
nwsl_games = assign_stable_ids(nwsl_games, 'NWSL', str(args.season))
all_games.extend(nwsl_games)
# Export
print("\n" + "="*60)
print("EXPORTING DATA")
print("="*60)
export_to_json(all_games, all_stadiums, output_dir)
# Summary
print("\n" + "="*60)
print("SUMMARY")
print("="*60)
print(f"Total games scraped: {len(all_games)}")
print(f"Total stadiums: {len(all_stadiums)}")
by_sport = {}
for g in all_games:
by_sport[g.sport] = by_sport.get(g.sport, 0) + 1
for sport, count in by_sport.items():
print(f" {sport}: {count} games")
if __name__ == '__main__':
main()
File diff suppressed because it is too large Load Diff
+688
View File
@@ -0,0 +1,688 @@
# SportsTime Parser
A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.
## Features
- Scrapes game schedules from multiple sources with automatic fallback
- Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Generates deterministic canonical IDs for games, teams, and stadiums
- Produces validation reports with manual review lists
- Uploads to CloudKit with resumable, diff-based updates
## Requirements
- Python 3.11+
- CloudKit credentials (for upload functionality)
## Installation
```bash
# From the Scripts directory
cd Scripts
# Install in development mode
pip install -e ".[dev]"
# Or install dependencies only
pip install -r requirements.txt
```
## Quick Start
```bash
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025
# Scrape all sports
sportstime-parser scrape all --season 2025
# Validate existing scraped data
sportstime-parser validate nba --season 2025
# Check status
sportstime-parser status
# Upload to CloudKit (development)
sportstime-parser upload nba --season 2025
# Upload to CloudKit (production)
sportstime-parser upload nba --season 2025 --environment production
```
## CLI Reference
### scrape
Scrape game schedules, teams, and stadiums from web sources.
```bash
sportstime-parser scrape <sport> [options]
Arguments:
sport Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--dry-run Parse and validate only, don't write output files
--verbose, -v Enable verbose output
```
**Examples:**
```bash
# Scrape NBA 2025-26 season
sportstime-parser scrape nba --season 2025
# Scrape all sports with verbose output
sportstime-parser scrape all --season 2025 --verbose
# Dry run to test without writing files
sportstime-parser scrape mlb --season 2026 --dry-run
```
### validate
Run validation on existing scraped data and regenerate reports. Validation performs these checks:
1. **Game Coverage**: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
2. **Team Resolution**: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
3. **Stadium Resolution**: Identifies venue names that couldn't be matched to canonical stadium IDs
4. **Duplicate Detection**: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
5. **Missing Data**: Flags games missing required fields (stadium_id, team IDs, valid dates)
The output is a Markdown report with:
- Summary statistics (total games, valid games, coverage percentage)
- Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
- Fuzzy match suggestions with confidence scores to help resolve unmatched names
```bash
sportstime-parser validate <sport> [options]
Arguments:
sport Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
```
**Examples:**
```bash
# Validate NBA data
sportstime-parser validate nba --season 2025
# Validate all sports
sportstime-parser validate all
```
### upload
Upload scraped data to CloudKit with diff-based updates.
```bash
sportstime-parser upload <sport> [options]
Arguments:
sport Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment: development or production (default: development)
--resume Resume interrupted upload from last checkpoint
```
**Examples:**
```bash
# Upload NBA to development
sportstime-parser upload nba --season 2025
# Upload to production
sportstime-parser upload nba --season 2025 --environment production
# Resume interrupted upload
sportstime-parser upload mlb --season 2026 --resume
```
### status
Show current scrape and upload status.
```bash
sportstime-parser status
```
### retry
Retry failed uploads from previous attempts.
```bash
sportstime-parser retry <sport> [options]
Arguments:
sport Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment (default: development)
--max-retries INT Maximum retry attempts per record (default: 3)
```
### clear
Clear upload session state to start fresh.
```bash
sportstime-parser clear <sport> [options]
Arguments:
sport Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
Options:
--season, -s INT Season start year (default: 2025)
--environment, -e CloudKit environment (default: development)
```
## CloudKit Configuration
To upload data to CloudKit, you need to configure authentication credentials.
### 1. Get Credentials from Apple Developer Portal
1. Go to [Apple Developer Portal](https://developer.apple.com)
2. Navigate to **Certificates, Identifiers & Profiles** > **Keys**
3. Create a new key with **CloudKit** capability
4. Download the private key file (.p8)
5. Note the Key ID
### 2. Set Environment Variables
```bash
# Key ID from Apple Developer Portal
export CLOUDKIT_KEY_ID="your_key_id_here"
# Path to private key file
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"
# Or provide key content directly (useful for CI/CD)
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
...key content...
-----END EC PRIVATE KEY-----"
```
### 3. Verify Configuration
```bash
sportstime-parser status
```
The status output will show whether CloudKit is configured correctly.
## Output Files
Scraped data is saved to the `output/` directory:
```
output/
games_nba_2025.json # Game schedules
teams_nba.json # Team data
stadiums_nba.json # Stadium data
validation_nba_2025.md # Validation report
```
## Validation Reports
Validation reports are generated in Markdown format at `output/validation_{sport}_{season}.md`.
### Report Sections
**Summary Table**
| Metric | Description |
|--------|-------------|
| Total Games | Number of games scraped |
| Valid Games | Games with all required fields resolved |
| Coverage | Percentage of expected games found (based on league schedule) |
| Unresolved Teams | Team names that couldn't be matched |
| Unresolved Stadiums | Venue names that couldn't be matched |
| Duplicates | Potential duplicate game entries |
**Manual Review Items**
Items are grouped by type and include the raw value, source URL, and suggested fixes:
- **Unresolved Teams**: Team names not in the alias mapping. Add to `team_aliases.json` to resolve.
- **Unresolved Stadiums**: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to `stadium_aliases.json`.
- **Duplicate Games**: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
- **Missing Data**: Games missing stadium coordinates or other required fields.
**Fuzzy Match Suggestions**
For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.
## Canonical IDs
Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.
### ID Formats
**Games**
```
{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
```
Examples:
- `nba_2025_hou_okc_1021` - NBA 2025-26, Houston @ OKC, Oct 21
- `mlb_2026_nyy_bos_0401_1` - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)
**Teams**
```
{sport}_{city}_{name}
```
Examples:
- `nba_la_lakers`
- `mlb_new_york_yankees`
- `nfl_new_york_giants`
**Stadiums**
```
{sport}_{normalized_name}
```
Examples:
- `mlb_yankee_stadium`
- `nba_crypto_com_arena`
- `nfl_sofi_stadium`
### Generated vs Matched IDs
| Entity | Generated | Matched |
|--------|-----------|---------|
| **Teams** | Pre-defined in `team_resolver.py` mappings | Resolved from raw scraped names via aliases + fuzzy matching |
| **Stadiums** | Pre-defined in `stadium_resolver.py` mappings | Resolved from raw venue names via aliases + fuzzy matching |
| **Games** | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) |
**Resolution Flow:**
```
Raw Name (from scraper)
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
↓ (if no match)
Fuzzy Match (Levenshtein distance against known names)
↓ (if confidence > threshold)
Canonical ID assigned
↓ (if no match)
Manual Review Item created
```
### Cross-References
Entities reference each other via canonical IDs:
```
┌─────────────────────────────────────────────────────────────┐
│ Game │
│ id: nba_2025_hou_okc_1021 │
│ home_team_id: nba_oklahoma_city_thunder ──────────────┐ │
│ away_team_id: nba_houston_rockets ────────────────┐ │ │
│ stadium_id: nba_paycom_center ────────────────┐ │ │ │
└─────────────────────────────────────────────────│───│───│───┘
│ │ │
┌─────────────────────────────────────────────────│───│───│───┐
│ Stadium │ │ │ │
│ id: nba_paycom_center ◄───────────────────────┘ │ │ │
│ name: "Paycom Center" │ │ │
│ city: "Oklahoma City" │ │ │
│ latitude: 35.4634 │ │ │
│ longitude: -97.5151 │ │ │
└─────────────────────────────────────────────────────│───│───┘
│ │
┌─────────────────────────────────────────────────────│───│───┐
│ Team │ │ │
│ id: nba_houston_rockets ◄─────────────────────────┘ │ │
│ name: "Rockets" │ │
│ city: "Houston" │ │
│ stadium_id: nba_toyota_center │ │
└─────────────────────────────────────────────────────────│───┘
┌─────────────────────────────────────────────────────────│───┐
│ Team │ │
│ id: nba_oklahoma_city_thunder ◄───────────────────────┘ │
│ name: "Thunder" │
│ city: "Oklahoma City" │
│ stadium_id: nba_paycom_center │
└─────────────────────────────────────────────────────────────┘
```
### Alias Files
Aliases map variant names to canonical IDs:
**`team_aliases.json`**
```json
{
"nba": {
"LA Lakers": "nba_la_lakers",
"Los Angeles Lakers": "nba_la_lakers",
"LAL": "nba_la_lakers"
}
}
```
**`stadium_aliases.json`**
```json
{
"nba": {
"Crypto.com Arena": "nba_crypto_com_arena",
"Staples Center": "nba_crypto_com_arena",
"STAPLES Center": "nba_crypto_com_arena"
}
}
```
When a scraper returns a raw name like "LA Lakers", the resolver:
1. Checks `team_aliases.json` for an exact match → finds `nba_la_lakers`
2. If no exact match, runs fuzzy matching against all known team names
3. If fuzzy match confidence > 80%, uses that canonical ID
4. Otherwise, creates a manual review item for human resolution
## Adding a New Sport
To add support for a new sport (e.g., `cfb` for college football), update these files:
### 1. Configuration (`config.py`)
Add the sport to `SUPPORTED_SPORTS` and `EXPECTED_GAME_COUNTS`:
```python
SUPPORTED_SPORTS: list[str] = [
"nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
"cfb", # ← Add new sport
]
EXPECTED_GAME_COUNTS: dict[str, int] = {
# ... existing sports ...
"cfb": 900, # ← Add expected game count for validation
}
```
### 2. Team Mappings (`normalizers/team_resolver.py`)
Add team definitions to `TEAM_MAPPINGS`. Each entry maps an abbreviation to `(canonical_id, full_name, city)`:
```python
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
# ... existing sports ...
"cfb": {
"ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
"OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
# ... all teams ...
},
}
```
### 3. Stadium Mappings (`normalizers/stadium_resolver.py`)
Add stadium definitions to `STADIUM_MAPPINGS`. Each entry is a `StadiumInfo` with coordinates:
```python
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
# ... existing sports ...
"cfb": {
"stadium_cfb_bryant_denny": StadiumInfo(
id="stadium_cfb_bryant_denny",
name="Bryant-Denny Stadium",
city="Tuscaloosa",
state="AL",
country="USA",
sport="cfb",
latitude=33.2083,
longitude=-87.5503,
),
# ... all stadiums ...
},
}
```
### 4. Scraper Implementation (`scrapers/cfb.py`)
Create a new scraper class extending `BaseScraper`:
```python
from .base import BaseScraper, RawGameData, ScrapeResult
class CFBScraper(BaseScraper):
def __init__(self, season: int, **kwargs):
super().__init__("cfb", season, **kwargs)
self._team_resolver = get_team_resolver("cfb")
self._stadium_resolver = get_stadium_resolver("cfb")
def _get_sources(self) -> list[str]:
return ["espn", "sports_reference"] # Priority order
def _get_source_url(self, source: str, **kwargs) -> str:
# Return URL for each source
...
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
# Implement scraping logic
...
def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
# Convert raw data to Game objects using resolvers
...
def scrape_teams(self) -> list[Team]:
# Return Team objects from TEAM_MAPPINGS
...
def scrape_stadiums(self) -> list[Stadium]:
# Return Stadium objects from STADIUM_MAPPINGS
...
def create_cfb_scraper(season: int) -> CFBScraper:
return CFBScraper(season=season)
```
### 5. Register Scraper (`scrapers/__init__.py`)
Export the new scraper:
```python
from .cfb import CFBScraper, create_cfb_scraper
__all__ = [
# ... existing exports ...
"CFBScraper",
"create_cfb_scraper",
]
```
### 6. CLI Registration (`cli.py`)
Add the sport to `get_scraper()`:
```python
def get_scraper(sport: str, season: int):
# ... existing sports ...
elif sport == "cfb":
from .scrapers.cfb import create_cfb_scraper
return create_cfb_scraper(season)
```
### 7. Alias Files (`team_aliases.json`, `stadium_aliases.json`)
Add initial aliases for common name variants:
```json
// team_aliases.json
{
"cfb": {
"Alabama": "team_cfb_ala",
"Bama": "team_cfb_ala",
"Roll Tide": "team_cfb_ala"
}
}
// stadium_aliases.json
{
"cfb": {
"Bryant Denny Stadium": "stadium_cfb_bryant_denny",
"Bryant-Denny": "stadium_cfb_bryant_denny"
}
}
```
### 8. Documentation (`SOURCES.md`)
Document data sources with URLs, rate limits, and notes:
```markdown
## CFB (College Football)
**Teams**: 134 (FBS)
**Expected Games**: ~900 per season
**Season**: August - January
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |
```
### 9. Tests (`tests/test_scrapers/test_cfb.py`)
Create tests for the new scraper:
```python
import pytest
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper
class TestCFBScraper:
def test_factory_creates_scraper(self):
scraper = create_cfb_scraper(season=2025)
assert scraper.sport == "cfb"
assert scraper.season == 2025
def test_get_sources_returns_priority_list(self):
scraper = CFBScraper(season=2025)
sources = scraper._get_sources()
assert "espn" in sources
# ... more tests ...
```
### Checklist
- [ ] Add to `SUPPORTED_SPORTS` in `config.py`
- [ ] Add to `EXPECTED_GAME_COUNTS` in `config.py`
- [ ] Add team mappings to `team_resolver.py`
- [ ] Add stadium mappings to `stadium_resolver.py`
- [ ] Create `scrapers/{sport}.py` with scraper class
- [ ] Export in `scrapers/__init__.py`
- [ ] Register in `cli.py` `get_scraper()`
- [ ] Add aliases to `team_aliases.json`
- [ ] Add aliases to `stadium_aliases.json`
- [ ] Document sources in `SOURCES.md`
- [ ] Create tests in `tests/test_scrapers/`
- [ ] Run `pytest` to verify all tests pass
- [ ] Run dry-run scrape: `sportstime-parser scrape {sport} --season 2025 --dry-run`
## Development
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=sportstime_parser --cov-report=html
# Run specific test file
pytest tests/test_scrapers/test_nba.py
# Run with verbose output
pytest -v
```
### Project Structure
```
sportstime_parser/
__init__.py
__main__.py # CLI entry point
cli.py # Subcommand definitions
config.py # Constants, defaults
models/
game.py # Game dataclass
team.py # Team dataclass
stadium.py # Stadium dataclass
aliases.py # Alias dataclasses
scrapers/
base.py # BaseScraper abstract class
nba.py # NBA scrapers
mlb.py # MLB scrapers
nfl.py # NFL scrapers
nhl.py # NHL scrapers
mls.py # MLS scrapers
wnba.py # WNBA scrapers
nwsl.py # NWSL scrapers
normalizers/
canonical_id.py # ID generation
team_resolver.py # Team name resolution
stadium_resolver.py # Stadium name resolution
timezone.py # Timezone conversion
fuzzy.py # Fuzzy matching
validators/
report.py # Validation report generator
uploaders/
cloudkit.py # CloudKit Web Services client
state.py # Resumable upload state
diff.py # Record comparison
utils/
http.py # Rate-limited HTTP client
logging.py # Verbose logger
progress.py # Progress bars
```
## Troubleshooting
### "No games file found"
Run the scrape command first:
```bash
sportstime-parser scrape nba --season 2025
```
### "CloudKit not configured"
Set the required environment variables:
```bash
export CLOUDKIT_KEY_ID="your_key_id"
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"
```
### Rate limit errors
The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:
1. Wait a few minutes before retrying
2. Try scraping one sport at a time instead of "all"
3. Check that you're not running multiple instances
### Scrape fails with no data
1. Check your internet connection
2. Run with `--verbose` to see detailed error messages
3. The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable
## License
MIT
+254
View File
@@ -0,0 +1,254 @@
# Data Sources
This document lists all data sources used by the SportsTime parser, including URLs, rate limits, and data freshness expectations.
## Source Priority
Each sport has multiple sources configured in priority order. The scraper tries each source in order and uses the first one that succeeds. If a source fails (network error, parsing error, etc.), it falls back to the next source.
---
## NBA (National Basketball Association)
**Teams**: 30
**Expected Games**: ~1,230 per season
**Season**: October - June (spans two calendar years)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | HTML |
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | JSON |
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | HTML |
### Rate Limits
- **Basketball-Reference**: ~1 request/second recommended
- **ESPN API**: No published limit, use 1 request/second to be safe
- **CBS Sports**: ~1 request/second recommended
### Notes
- Basketball-Reference is the most reliable source with complete historical data
- ESPN API is good for current/future seasons
- Games organized by month on Basketball-Reference
---
## MLB (Major League Baseball)
**Teams**: 30
**Expected Games**: ~2,430 per season
**Season**: March/April - October/November (single calendar year)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | Baseball-Reference | `baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | HTML |
| 2 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | JSON |
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | JSON |
### Rate Limits
- **Baseball-Reference**: ~1 request/second recommended
- **MLB Stats API**: No published limit, use 0.5 request/second
- **ESPN API**: ~1 request/second
### Notes
- MLB has doubleheaders; games are suffixed with `_1`, `_2`
- Single schedule page per season on Baseball-Reference
- MLB Stats API allows date range queries for efficiency
---
## NFL (National Football League)
**Teams**: 32
**Expected Games**: ~272 per season (regular season only)
**Season**: September - February (spans two calendar years)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | JSON |
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{YEAR}/games.htm` | HTML |
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | HTML |
### Rate Limits
- **ESPN API**: ~1 request/second
- **Pro-Football-Reference**: ~1 request/second
- **CBS Sports**: ~1 request/second
### Notes
- ESPN API uses week numbers instead of dates
- International games (London, Mexico City, Frankfurt, etc.) are filtered out
- Includes preseason, regular season, and playoffs
---
## NHL (National Hockey League)
**Teams**: 32 (including Utah Hockey Club)
**Expected Games**: ~1,312 per season
**Season**: October - June (spans two calendar years)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{YEAR}_games.html` | HTML |
| 2 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | JSON |
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | JSON |
### Rate Limits
- **Hockey-Reference**: ~1 request/second
- **NHL API**: No published limit, use 0.5 request/second
- **ESPN API**: ~1 request/second
### Notes
- International games (Prague, Stockholm, Helsinki, etc.) are filtered out
- Single schedule page per season on Hockey-Reference
---
## MLS (Major League Soccer)
**Teams**: 30 (including San Diego FC)
**Expected Games**: ~493 per season
**Season**: February/March - October/November (single calendar year)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | JSON |
| 2 | FBref | `fbref.com/en/comps/22/{YEAR}/schedule/` | HTML |
### Rate Limits
- **ESPN API**: ~1 request/second
- **FBref**: ~1 request/second
### Notes
- MLS runs within a single calendar year
- Some teams share stadiums with NFL teams
---
## WNBA (Women's National Basketball Association)
**Teams**: 13 (including Golden State Valkyries)
**Expected Games**: ~220 per season
**Season**: May - October (single calendar year)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | JSON |
### Rate Limits
- **ESPN API**: ~1 request/second
### Notes
- Many WNBA teams share arenas with NBA teams
- Teams and stadiums are hardcoded (smaller league)
---
## NWSL (National Women's Soccer League)
**Teams**: 14
**Expected Games**: ~182 per season
**Season**: March - November (single calendar year)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | JSON |
### Rate Limits
- **ESPN API**: ~1 request/second
### Notes
- Many NWSL teams share stadiums with MLS teams
- Teams and stadiums are hardcoded (smaller league)
---
## Stadium Data Sources
Stadium coordinates and metadata come from multiple sources:
| Sport | Sources |
|-------|---------|
| MLB | MLBScoreBot GitHub, cageyjames GeoJSON, hardcoded |
| NFL | NFLScoreBot GitHub, brianhatchl GeoJSON, hardcoded |
| NBA | Hardcoded |
| NHL | Hardcoded |
| MLS | gavinr GeoJSON, hardcoded |
| WNBA | Hardcoded (shared with NBA) |
| NWSL | Hardcoded (shared with MLS) |
---
## General Guidelines
### Rate Limiting
All scrapers implement:
1. **Default delay**: 1 second between requests
2. **Auto-detection**: Detects HTTP 429 (Too Many Requests) responses
3. **Exponential backoff**: Starts at 1 second, doubles up to 3 retries
4. **Connection pooling**: Reuses HTTP connections for efficiency
### Error Handling
- **Partial data**: If a source fails mid-scrape, partial data is discarded
- **Source fallback**: Automatically tries the next source on failure
- **Logging**: All errors are logged for debugging
### Data Freshness
| Data Type | Freshness |
|-----------|-----------|
| Games (future) | Check weekly during season |
| Games (past) | Final scores available within hours |
| Teams | Update at start of each season |
| Stadiums | Update when venues change |
### Geographic Filter
Games at venues outside USA, Canada, and Mexico are automatically filtered out:
- **NFL**: London, Frankfurt, Munich, Mexico City, São Paulo
- **NHL**: Prague, Stockholm, Helsinki, Tampere, Gothenburg
---
## Legal Considerations
This tool is designed for personal/educational use. When using these sources:
1. Respect robots.txt files
2. Don't make excessive requests
3. Cache responses when possible
4. Check each source's Terms of Service
5. Consider that schedule data may be copyrighted
The ESPN API is undocumented but publicly accessible. Sports-Reference sites allow scraping but request reasonable rate limiting.
+8
View File
@@ -0,0 +1,8 @@
"""SportsTime Parser - Sports data scraper and CloudKit uploader."""
__version__ = "0.1.0"
__author__ = "SportsTime Team"
from .cli import run_cli
__all__ = ["run_cli", "__version__"]
+14
View File
@@ -0,0 +1,14 @@
"""Entry point for sportstime-parser CLI."""
import sys
from .cli import run_cli
def main() -> int:
"""Main entry point."""
return run_cli()
if __name__ == "__main__":
sys.exit(main())
+914
View File
@@ -0,0 +1,914 @@
"""CLI subcommand definitions for sportstime-parser."""
import argparse
import sys
from typing import Optional
from .config import (
DEFAULT_SEASON,
CLOUDKIT_ENVIRONMENT,
SUPPORTED_SPORTS,
OUTPUT_DIR,
)
from .utils.logging import get_logger, set_verbose, log_success, log_failure
def create_parser() -> argparse.ArgumentParser:
"""Create the main argument parser with all subcommands."""
parser = argparse.ArgumentParser(
prog="sportstime-parser",
description="Sports data scraper and CloudKit uploader for SportsTime app",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
sportstime-parser scrape nba --season 2025
sportstime-parser scrape all --season 2025
sportstime-parser validate nba --season 2025
sportstime-parser upload nba --season 2025
sportstime-parser status
""",
)
parser.add_argument(
"--verbose", "-v",
action="store_true",
help="Enable verbose output",
)
subparsers = parser.add_subparsers(
dest="command",
title="commands",
description="Available commands",
metavar="COMMAND",
)
# Scrape subcommand
scrape_parser = subparsers.add_parser(
"scrape",
help="Scrape game schedules, teams, and stadiums",
description="Scrape sports data from multiple sources",
)
scrape_parser.add_argument(
"sport",
choices=SUPPORTED_SPORTS + ["all"],
help="Sport to scrape (or 'all' for all sports)",
)
scrape_parser.add_argument(
"--season", "-s",
type=int,
default=DEFAULT_SEASON,
help=f"Season start year (default: {DEFAULT_SEASON})",
)
scrape_parser.add_argument(
"--dry-run",
action="store_true",
help="Parse and validate only, don't write output files",
)
scrape_parser.set_defaults(func=cmd_scrape)
# Validate subcommand
validate_parser = subparsers.add_parser(
"validate",
help="Run validation on existing scraped data",
description="Validate scraped data and regenerate reports",
)
validate_parser.add_argument(
"sport",
choices=SUPPORTED_SPORTS + ["all"],
help="Sport to validate (or 'all' for all sports)",
)
validate_parser.add_argument(
"--season", "-s",
type=int,
default=DEFAULT_SEASON,
help=f"Season start year (default: {DEFAULT_SEASON})",
)
validate_parser.set_defaults(func=cmd_validate)
# Upload subcommand
upload_parser = subparsers.add_parser(
"upload",
help="Upload scraped data to CloudKit",
description="Upload data to CloudKit with resumable, diff-based updates",
)
upload_parser.add_argument(
"sport",
choices=SUPPORTED_SPORTS + ["all"],
help="Sport to upload (or 'all' for all sports)",
)
upload_parser.add_argument(
"--season", "-s",
type=int,
default=DEFAULT_SEASON,
help=f"Season start year (default: {DEFAULT_SEASON})",
)
upload_parser.add_argument(
"--environment", "-e",
choices=["development", "production"],
default=CLOUDKIT_ENVIRONMENT,
help=f"CloudKit environment (default: {CLOUDKIT_ENVIRONMENT})",
)
upload_parser.add_argument(
"--resume",
action="store_true",
help="Resume interrupted upload from last checkpoint",
)
upload_parser.set_defaults(func=cmd_upload)
# Status subcommand
status_parser = subparsers.add_parser(
"status",
help="Show current scrape and upload status",
description="Display summary of scraped data and upload progress",
)
status_parser.set_defaults(func=cmd_status)
# Retry subcommand
retry_parser = subparsers.add_parser(
"retry",
help="Retry failed uploads",
description="Retry records that failed during previous upload attempts",
)
retry_parser.add_argument(
"sport",
choices=SUPPORTED_SPORTS + ["all"],
help="Sport to retry (or 'all' for all sports)",
)
retry_parser.add_argument(
"--season", "-s",
type=int,
default=DEFAULT_SEASON,
help=f"Season start year (default: {DEFAULT_SEASON})",
)
retry_parser.add_argument(
"--environment", "-e",
choices=["development", "production"],
default=CLOUDKIT_ENVIRONMENT,
help=f"CloudKit environment (default: {CLOUDKIT_ENVIRONMENT})",
)
retry_parser.add_argument(
"--max-retries",
type=int,
default=3,
help="Maximum retry attempts per record (default: 3)",
)
retry_parser.set_defaults(func=cmd_retry)
# Clear subcommand
clear_parser = subparsers.add_parser(
"clear",
help="Clear upload session state",
description="Delete upload session state files to start fresh",
)
clear_parser.add_argument(
"sport",
choices=SUPPORTED_SPORTS + ["all"],
help="Sport to clear (or 'all' for all sports)",
)
clear_parser.add_argument(
"--season", "-s",
type=int,
default=DEFAULT_SEASON,
help=f"Season start year (default: {DEFAULT_SEASON})",
)
clear_parser.add_argument(
"--environment", "-e",
choices=["development", "production"],
default=CLOUDKIT_ENVIRONMENT,
help=f"CloudKit environment (default: {CLOUDKIT_ENVIRONMENT})",
)
clear_parser.set_defaults(func=cmd_clear)
return parser
def get_scraper(sport: str, season: int):
"""Get the appropriate scraper for a sport.
Args:
sport: Sport code
season: Season start year
Returns:
Scraper instance
Raises:
NotImplementedError: If sport scraper is not yet implemented
"""
if sport == "nba":
from .scrapers.nba import create_nba_scraper
return create_nba_scraper(season)
elif sport == "mlb":
from .scrapers.mlb import create_mlb_scraper
return create_mlb_scraper(season)
elif sport == "nfl":
from .scrapers.nfl import create_nfl_scraper
return create_nfl_scraper(season)
elif sport == "nhl":
from .scrapers.nhl import create_nhl_scraper
return create_nhl_scraper(season)
elif sport == "mls":
from .scrapers.mls import create_mls_scraper
return create_mls_scraper(season)
elif sport == "wnba":
from .scrapers.wnba import create_wnba_scraper
return create_wnba_scraper(season)
elif sport == "nwsl":
from .scrapers.nwsl import create_nwsl_scraper
return create_nwsl_scraper(season)
else:
raise NotImplementedError(f"Scraper for {sport} not yet implemented")
def cmd_scrape(args: argparse.Namespace) -> int:
"""Execute the scrape command."""
from .models.game import save_games
from .models.team import save_teams
from .models.stadium import save_stadiums
from .validators.report import generate_report, validate_games
logger = get_logger()
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
logger.info(f"Scraping {', '.join(sports)} for {args.season}-{args.season + 1} season")
if args.dry_run:
logger.info("Dry run mode - no files will be written")
# Ensure output directory exists
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
success_count = 0
failure_count = 0
for sport in sports:
logger.info(f"\n{'='*50}")
logger.info(f"Scraping {sport.upper()}...")
logger.info(f"{'='*50}")
try:
# Get scraper for this sport
scraper = get_scraper(sport, args.season)
# Scrape all data
result = scraper.scrape_all()
if not result.success:
log_failure(f"{sport.upper()}: {result.error_message}")
failure_count += 1
continue
# Validate games
validation_issues = validate_games(result.games)
all_review_items = result.review_items + validation_issues
# Generate validation report
report = generate_report(
sport=sport,
season=args.season,
source=result.source,
games=result.games,
teams=result.teams,
stadiums=result.stadiums,
review_items=all_review_items,
)
# Log summary
logger.info(f"Games: {report.summary.total_games}")
logger.info(f"Teams: {len(result.teams)}")
logger.info(f"Stadiums: {len(result.stadiums)}")
logger.info(f"Coverage: {report.summary.game_coverage:.1f}%")
logger.info(f"Review items: {report.summary.review_count}")
if not args.dry_run:
# Save output files
games_file = OUTPUT_DIR / f"games_{sport}_{args.season}.json"
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
save_games(result.games, str(games_file))
save_teams(result.teams, str(teams_file))
save_stadiums(result.stadiums, str(stadiums_file))
# Save validation report
report_path = report.save()
logger.info(f"Saved games to: {games_file}")
logger.info(f"Saved teams to: {teams_file}")
logger.info(f"Saved stadiums to: {stadiums_file}")
logger.info(f"Saved report to: {report_path}")
log_success(f"{sport.upper()}: Scraped {result.game_count} games")
success_count += 1
except NotImplementedError as e:
logger.warning(str(e))
failure_count += 1
continue
except Exception as e:
log_failure(f"{sport.upper()}: {e}")
logger.exception("Scraping failed")
failure_count += 1
continue
# Final summary
logger.info(f"\n{'='*50}")
logger.info("SUMMARY")
logger.info(f"{'='*50}")
logger.info(f"Successful: {success_count}")
logger.info(f"Failed: {failure_count}")
return 0 if failure_count == 0 else 1
def cmd_validate(args: argparse.Namespace) -> int:
"""Execute the validate command."""
from .models.game import load_games
from .models.team import load_teams
from .models.stadium import load_stadiums
from .validators.report import generate_report, validate_games
logger = get_logger()
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
logger.info(f"Validating {', '.join(sports)} for {args.season}-{args.season + 1} season")
for sport in sports:
logger.info(f"\nValidating {sport.upper()}...")
# Load existing data
games_file = OUTPUT_DIR / f"games_{sport}_{args.season}.json"
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
if not games_file.exists():
logger.warning(f"No games file found: {games_file}")
continue
try:
games = load_games(str(games_file))
teams = load_teams(str(teams_file)) if teams_file.exists() else []
stadiums = load_stadiums(str(stadiums_file)) if stadiums_file.exists() else []
# Run validation
review_items = validate_games(games)
# Generate report
report = generate_report(
sport=sport,
season=args.season,
source="existing",
games=games,
teams=teams,
stadiums=stadiums,
review_items=review_items,
)
# Save report
report_path = report.save()
logger.info(f"Games: {report.summary.total_games}")
logger.info(f"Valid: {report.summary.valid_games}")
logger.info(f"Review items: {report.summary.review_count}")
logger.info(f"Saved report to: {report_path}")
log_success(f"{sport.upper()}: Validation complete")
except Exception as e:
log_failure(f"{sport.upper()}: {e}")
logger.exception("Validation failed")
continue
return 0
def cmd_upload(args: argparse.Namespace) -> int:
"""Execute the upload command."""
from .models.game import load_games
from .models.team import load_teams
from .models.stadium import load_stadiums
from .uploaders import (
CloudKitClient,
CloudKitError,
CloudKitAuthError,
CloudKitRateLimitError,
RecordType,
RecordDiffer,
StateManager,
game_to_cloudkit_record,
team_to_cloudkit_record,
stadium_to_cloudkit_record,
)
from .utils.progress import create_progress_bar
logger = get_logger()
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
logger.info(f"Uploading {', '.join(sports)} for {args.season}-{args.season + 1} season")
logger.info(f"Environment: {args.environment}")
# Initialize CloudKit client
client = CloudKitClient(environment=args.environment)
if not client.is_configured:
log_failure("CloudKit not configured")
logger.error(
"Set CLOUDKIT_KEY_ID and CLOUDKIT_PRIVATE_KEY_PATH environment variables.\n"
"Get credentials from Apple Developer Portal > Certificates, Identifiers & Profiles > Keys"
)
return 1
# Initialize state manager
state_manager = StateManager()
differ = RecordDiffer()
success_count = 0
failure_count = 0
for sport in sports:
logger.info(f"\n{'='*50}")
logger.info(f"Uploading {sport.upper()}...")
logger.info(f"{'='*50}")
try:
# Load local data
games_file = OUTPUT_DIR / f"games_{sport}_{args.season}.json"
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
if not games_file.exists():
logger.warning(f"No games file found: {games_file}")
logger.warning("Run 'scrape' command first")
failure_count += 1
continue
games = load_games(str(games_file))
teams = load_teams(str(teams_file)) if teams_file.exists() else []
stadiums = load_stadiums(str(stadiums_file)) if stadiums_file.exists() else []
logger.info(f"Loaded {len(games)} games, {len(teams)} teams, {len(stadiums)} stadiums")
# Fetch existing CloudKit records for diff
logger.info("Fetching existing CloudKit records...")
try:
remote_games = client.fetch_all_records(RecordType.GAME)
remote_teams = client.fetch_all_records(RecordType.TEAM)
remote_stadiums = client.fetch_all_records(RecordType.STADIUM)
except CloudKitAuthError as e:
log_failure(f"Authentication failed: {e}")
return 1
except CloudKitRateLimitError:
log_failure("Rate limit exceeded - try again later")
return 1
except CloudKitError as e:
log_failure(f"Failed to fetch records: {e}")
failure_count += 1
continue
# Filter remote records to this sport/season
remote_games = [
r for r in remote_games
if r.get("fields", {}).get("sport", {}).get("value") == sport
and r.get("fields", {}).get("season", {}).get("value") == args.season
]
remote_teams = [
r for r in remote_teams
if r.get("fields", {}).get("sport", {}).get("value") == sport
]
remote_stadiums = [
r for r in remote_stadiums
if r.get("fields", {}).get("sport", {}).get("value") == sport
]
logger.info(f"Found {len(remote_games)} games, {len(remote_teams)} teams, {len(remote_stadiums)} stadiums in CloudKit")
# Calculate diffs
logger.info("Calculating changes...")
game_diff = differ.diff_games(games, remote_games)
team_diff = differ.diff_teams(teams, remote_teams)
stadium_diff = differ.diff_stadiums(stadiums, remote_stadiums)
total_creates = game_diff.create_count + team_diff.create_count + stadium_diff.create_count
total_updates = game_diff.update_count + team_diff.update_count + stadium_diff.update_count
total_unchanged = game_diff.unchanged_count + team_diff.unchanged_count + stadium_diff.unchanged_count
logger.info(f"Creates: {total_creates}, Updates: {total_updates}, Unchanged: {total_unchanged}")
if total_creates == 0 and total_updates == 0:
log_success(f"{sport.upper()}: Already up to date")
success_count += 1
continue
# Prepare records for upload
all_records = []
all_records.extend(game_diff.get_records_to_upload())
all_records.extend(team_diff.get_records_to_upload())
all_records.extend(stadium_diff.get_records_to_upload())
# Create or resume upload session
record_info = [(r.record_name, r.record_type.value) for r in all_records]
session = state_manager.get_session_or_create(
sport=sport,
season=args.season,
environment=args.environment,
record_names=record_info,
resume=args.resume,
)
if args.resume:
pending = session.get_pending_records()
logger.info(f"Resuming: {len(pending)} records pending")
# Filter to only pending records
pending_set = set(pending)
all_records = [r for r in all_records if r.record_name in pending_set]
# Upload records with progress
logger.info(f"Uploading {len(all_records)} records...")
with create_progress_bar(total=len(all_records), description="Uploading") as progress:
batch_result = client.save_records(all_records)
# Update session state
for op_result in batch_result.successful:
session.mark_uploaded(op_result.record_name, op_result.record_change_tag)
progress.advance()
for op_result in batch_result.failed:
session.mark_failed(op_result.record_name, op_result.error_message or "Unknown error")
progress.advance()
# Save session state
state_manager.save_session(session)
# Report results
logger.info(f"Uploaded: {batch_result.success_count}")
logger.info(f"Failed: {batch_result.failure_count}")
if batch_result.failure_count > 0:
log_failure(f"{sport.upper()}: {batch_result.failure_count} records failed")
for op_result in batch_result.failed[:5]: # Show first 5 failures
logger.error(f" {op_result.record_name}: {op_result.error_message}")
if batch_result.failure_count > 5:
logger.error(f" ... and {batch_result.failure_count - 5} more")
failure_count += 1
else:
log_success(f"{sport.upper()}: Uploaded {batch_result.success_count} records")
# Clear session on complete success
state_manager.delete_session(sport, args.season, args.environment)
success_count += 1
except Exception as e:
log_failure(f"{sport.upper()}: {e}")
logger.exception("Upload failed")
failure_count += 1
continue
# Final summary
logger.info(f"\n{'='*50}")
logger.info("SUMMARY")
logger.info(f"{'='*50}")
logger.info(f"Successful: {success_count}")
logger.info(f"Failed: {failure_count}")
return 0 if failure_count == 0 else 1
def cmd_status(args: argparse.Namespace) -> int:
"""Execute the status command."""
from datetime import datetime
from .config import STATE_DIR, EXPECTED_GAME_COUNTS
from .uploaders import StateManager
logger = get_logger()
logger.info("SportsTime Parser Status")
logger.info("=" * 50)
logger.info("")
# Check for scraped data
logger.info("[bold]Scraped Data[/bold]")
logger.info("-" * 40)
total_games = 0
scraped_sports = 0
for sport in SUPPORTED_SPORTS:
games_file = OUTPUT_DIR / f"games_{sport}_{DEFAULT_SEASON}.json"
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
if games_file.exists():
from .models.game import load_games
from .models.team import load_teams
from .models.stadium import load_stadiums
try:
games = load_games(str(games_file))
teams = load_teams(str(teams_file)) if teams_file.exists() else []
stadiums = load_stadiums(str(stadiums_file)) if stadiums_file.exists() else []
game_count = len(games)
expected = EXPECTED_GAME_COUNTS.get(sport, 0)
coverage = (game_count / expected * 100) if expected > 0 else 0
# Format with coverage indicator
if coverage >= 95:
status = "[green]✓[/green]"
elif coverage >= 80:
status = "[yellow]~[/yellow]"
else:
status = "[red]![/red]"
logger.info(
f" {status} {sport.upper():6} {game_count:5} games, "
f"{len(teams):2} teams, {len(stadiums):2} stadiums "
f"({coverage:.0f}% coverage)"
)
total_games += game_count
scraped_sports += 1
except Exception as e:
logger.info(f" [red]✗[/red] {sport.upper():6} Error loading: {e}")
else:
logger.info(f" [dim]-[/dim] {sport.upper():6} Not scraped")
logger.info("-" * 40)
logger.info(f" Total: {total_games} games across {scraped_sports} sports")
logger.info("")
# Check for upload sessions
logger.info("[bold]Upload Sessions[/bold]")
logger.info("-" * 40)
state_manager = StateManager()
sessions = state_manager.list_sessions()
if sessions:
for session in sessions:
sport = session["sport"].upper()
season = session["season"]
env = session["environment"]
progress = session["progress"]
percent = session["progress_percent"]
status = session["status"]
failed = session["failed_count"]
if status == "complete":
status_icon = "[green]✓[/green]"
elif failed > 0:
status_icon = "[yellow]![/yellow]"
else:
status_icon = "[blue]→[/blue]"
logger.info(
f" {status_icon} {sport} {season} ({env}): "
f"{progress} ({percent})"
)
if failed > 0:
logger.info(f" [yellow]⚠ {failed} failed records[/yellow]")
# Show last updated time
try:
last_updated = datetime.fromisoformat(session["last_updated"])
age = datetime.utcnow() - last_updated
if age.days > 0:
age_str = f"{age.days} days ago"
elif age.seconds > 3600:
age_str = f"{age.seconds // 3600} hours ago"
elif age.seconds > 60:
age_str = f"{age.seconds // 60} minutes ago"
else:
age_str = "just now"
logger.info(f" Last updated: {age_str}")
except (ValueError, KeyError):
pass
else:
logger.info(" No upload sessions found")
logger.info("")
# CloudKit configuration status
logger.info("[bold]CloudKit Configuration[/bold]")
logger.info("-" * 40)
import os
key_id = os.environ.get("CLOUDKIT_KEY_ID")
key_path = os.environ.get("CLOUDKIT_PRIVATE_KEY_PATH")
key_content = os.environ.get("CLOUDKIT_PRIVATE_KEY")
if key_id:
logger.info(f" [green]✓[/green] CLOUDKIT_KEY_ID: {key_id[:8]}...")
else:
logger.info(" [red]✗[/red] CLOUDKIT_KEY_ID: Not set")
if key_path:
from pathlib import Path
if Path(key_path).exists():
logger.info(f" [green]✓[/green] CLOUDKIT_PRIVATE_KEY_PATH: {key_path}")
else:
logger.info(f" [red]✗[/red] CLOUDKIT_PRIVATE_KEY_PATH: File not found: {key_path}")
elif key_content:
logger.info(" [green]✓[/green] CLOUDKIT_PRIVATE_KEY: Set (inline)")
else:
logger.info(" [red]✗[/red] CLOUDKIT_PRIVATE_KEY: Not set")
logger.info("")
return 0
def cmd_retry(args: argparse.Namespace) -> int:
"""Execute the retry command for failed uploads."""
from .models.game import load_games
from .models.team import load_teams
from .models.stadium import load_stadiums
from .uploaders import (
CloudKitClient,
CloudKitError,
CloudKitAuthError,
CloudKitRateLimitError,
StateManager,
game_to_cloudkit_record,
team_to_cloudkit_record,
stadium_to_cloudkit_record,
)
from .utils.progress import create_progress_bar
logger = get_logger()
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
logger.info(f"Retrying failed uploads for {', '.join(sports)}")
logger.info(f"Environment: {args.environment}")
logger.info(f"Max retries per record: {args.max_retries}")
# Initialize CloudKit client
client = CloudKitClient(environment=args.environment)
if not client.is_configured:
log_failure("CloudKit not configured")
return 1
# Initialize state manager
state_manager = StateManager()
total_retried = 0
total_succeeded = 0
total_failed = 0
for sport in sports:
# Load existing session
session = state_manager.load_session(sport, args.season, args.environment)
if session is None:
logger.info(f"{sport.upper()}: No upload session found")
continue
# Get records eligible for retry
retryable = session.get_retryable_records(max_retries=args.max_retries)
if not retryable:
failed_count = session.failed_count
if failed_count > 0:
logger.info(f"{sport.upper()}: {failed_count} failed records exceeded max retries")
else:
logger.info(f"{sport.upper()}: No failed records to retry")
continue
logger.info(f"{sport.upper()}: Retrying {len(retryable)} failed records...")
# Load local data to get the records
games_file = OUTPUT_DIR / f"games_{sport}_{args.season}.json"
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
if not games_file.exists():
logger.warning(f"No games file found: {games_file}")
continue
games = load_games(str(games_file))
teams = load_teams(str(teams_file)) if teams_file.exists() else []
stadiums = load_stadiums(str(stadiums_file)) if stadiums_file.exists() else []
# Build record lookup
records_to_retry = []
retryable_set = set(retryable)
for game in games:
if game.id in retryable_set:
records_to_retry.append(game_to_cloudkit_record(game))
for team in teams:
if team.id in retryable_set:
records_to_retry.append(team_to_cloudkit_record(team))
for stadium in stadiums:
if stadium.id in retryable_set:
records_to_retry.append(stadium_to_cloudkit_record(stadium))
if not records_to_retry:
logger.warning(f"{sport.upper()}: Could not find records for retry")
continue
# Mark as pending for retry
for record_name in retryable:
session.mark_pending(record_name)
# Retry upload
try:
with create_progress_bar(total=len(records_to_retry), description="Retrying") as progress:
batch_result = client.save_records(records_to_retry)
for op_result in batch_result.successful:
session.mark_uploaded(op_result.record_name, op_result.record_change_tag)
progress.advance()
total_succeeded += 1
for op_result in batch_result.failed:
session.mark_failed(op_result.record_name, op_result.error_message or "Unknown error")
progress.advance()
total_failed += 1
state_manager.save_session(session)
total_retried += len(records_to_retry)
if batch_result.failure_count > 0:
log_failure(f"{sport.upper()}: {batch_result.failure_count} still failing")
else:
log_success(f"{sport.upper()}: All {batch_result.success_count} retries succeeded")
# Clear session if all complete
if session.is_complete:
state_manager.delete_session(sport, args.season, args.environment)
except CloudKitAuthError as e:
log_failure(f"Authentication failed: {e}")
return 1
except CloudKitRateLimitError:
log_failure("Rate limit exceeded - try again later")
state_manager.save_session(session)
return 1
except CloudKitError as e:
log_failure(f"Upload error: {e}")
state_manager.save_session(session)
continue
# Summary
logger.info(f"\n{'='*50}")
logger.info("RETRY SUMMARY")
logger.info(f"{'='*50}")
logger.info(f"Retried: {total_retried}")
logger.info(f"Succeeded: {total_succeeded}")
logger.info(f"Failed: {total_failed}")
return 0 if total_failed == 0 else 1
def cmd_clear(args: argparse.Namespace) -> int:
"""Execute the clear command to delete upload state."""
from .uploaders import StateManager
logger = get_logger()
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
logger.info(f"Clearing upload state for {', '.join(sports)}")
state_manager = StateManager()
cleared_count = 0
for sport in sports:
if state_manager.delete_session(sport, args.season, args.environment):
logger.info(f" [green]✓[/green] Cleared {sport.upper()} {args.season} ({args.environment})")
cleared_count += 1
else:
logger.info(f" [dim]-[/dim] No session for {sport.upper()} {args.season} ({args.environment})")
logger.info(f"\nCleared {cleared_count} session(s)")
return 0
def run_cli(argv: Optional[list[str]] = None) -> int:
"""Parse arguments and run the appropriate command."""
parser = create_parser()
args = parser.parse_args(argv)
if args.verbose:
set_verbose(True)
if args.command is None:
parser.print_help()
return 1
return args.func(args)
+56
View File
@@ -0,0 +1,56 @@
"""Configuration constants for sportstime-parser."""
from pathlib import Path
# Package paths
PACKAGE_DIR = Path(__file__).parent
SCRIPTS_DIR = PACKAGE_DIR.parent
OUTPUT_DIR = SCRIPTS_DIR / "output"
STATE_DIR = SCRIPTS_DIR / ".parser_state"
# Alias files (existing in Scripts/)
TEAM_ALIASES_FILE = SCRIPTS_DIR / "team_aliases.json"
STADIUM_ALIASES_FILE = SCRIPTS_DIR / "stadium_aliases.json"
LEAGUE_STRUCTURE_FILE = SCRIPTS_DIR / "league_structure.json"
# Supported sports
SUPPORTED_SPORTS: list[str] = [
"nba",
"mlb",
"nfl",
"nhl",
"mls",
"wnba",
"nwsl",
]
# Default season (start year of the season, e.g., 2025 for 2025-26)
DEFAULT_SEASON: int = 2025
# CloudKit configuration
CLOUDKIT_CONTAINER_ID: str = "iCloud.com.sportstime.app"
CLOUDKIT_ENVIRONMENT: str = "development"
CLOUDKIT_BATCH_SIZE: int = 200
# Rate limiting
DEFAULT_REQUEST_DELAY: float = 1.0 # seconds between requests
MAX_RETRIES: int = 3
BACKOFF_FACTOR: float = 2.0 # exponential backoff multiplier
INITIAL_BACKOFF: float = 1.0 # initial backoff in seconds
# Expected game counts per sport (approximate, for validation)
EXPECTED_GAME_COUNTS: dict[str, int] = {
"nba": 1230, # 30 teams × 82 games / 2
"mlb": 2430, # 30 teams × 162 games / 2
"nfl": 272, # 32 teams × 17 games / 2
"nhl": 1312, # 32 teams × 82 games / 2
"mls": 493, # 30 teams × varies
"wnba": 220, # 13 teams × 40 games / 2 (approx)
"nwsl": 182, # 14 teams × 26 games / 2
}
# Minimum match score for fuzzy matching (0-100)
FUZZY_MATCH_THRESHOLD: int = 80
# Geographic filter (only include games in these countries)
ALLOWED_COUNTRIES: set[str] = {"USA", "US", "United States", "Canada", "Mexico"}
@@ -0,0 +1,35 @@
"""Data models for sportstime-parser."""
from .game import Game, save_games, load_games
from .team import Team, save_teams, load_teams
from .stadium import Stadium, save_stadiums, load_stadiums
from .aliases import (
AliasType,
ReviewReason,
TeamAlias,
StadiumAlias,
FuzzyMatch,
ManualReviewItem,
)
__all__ = [
# Game
"Game",
"save_games",
"load_games",
# Team
"Team",
"save_teams",
"load_teams",
# Stadium
"Stadium",
"save_stadiums",
"load_stadiums",
# Aliases
"AliasType",
"ReviewReason",
"TeamAlias",
"StadiumAlias",
"FuzzyMatch",
"ManualReviewItem",
]
+262
View File
@@ -0,0 +1,262 @@
"""Alias and manual review data models for sportstime-parser."""
from dataclasses import dataclass, field
from datetime import date, datetime
from enum import Enum
from typing import Optional
import json
class AliasType(Enum):
"""Type of team alias."""
NAME = "name"
ABBREVIATION = "abbreviation"
CITY = "city"
class ReviewReason(Enum):
"""Reason an item requires manual review."""
UNRESOLVED_TEAM = "unresolved_team"
UNRESOLVED_STADIUM = "unresolved_stadium"
LOW_CONFIDENCE_MATCH = "low_confidence_match"
MISSING_DATA = "missing_data"
DUPLICATE_GAME = "duplicate_game"
TIMEZONE_UNKNOWN = "timezone_unknown"
GEOGRAPHIC_FILTER = "geographic_filter"
@dataclass
class TeamAlias:
"""Represents a team alias with optional date validity.
Attributes:
id: Unique alias ID
team_canonical_id: The canonical team ID this alias resolves to
alias_type: Type of alias (name, abbreviation, city)
alias_value: The alias value to match against
valid_from: Start date of alias validity (None = always valid)
valid_until: End date of alias validity (None = still valid)
"""
id: str
team_canonical_id: str
alias_type: AliasType
alias_value: str
valid_from: Optional[date] = None
valid_until: Optional[date] = None
def is_valid_on(self, check_date: date) -> bool:
"""Check if this alias is valid on the given date."""
if self.valid_from and check_date < self.valid_from:
return False
if self.valid_until and check_date > self.valid_until:
return False
return True
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"id": self.id,
"team_canonical_id": self.team_canonical_id,
"alias_type": self.alias_type.value,
"alias_value": self.alias_value,
"valid_from": self.valid_from.isoformat() if self.valid_from else None,
"valid_until": self.valid_until.isoformat() if self.valid_until else None,
}
@classmethod
def from_dict(cls, data: dict) -> "TeamAlias":
"""Create a TeamAlias from a dictionary."""
valid_from = None
if data.get("valid_from"):
valid_from = date.fromisoformat(data["valid_from"])
valid_until = None
if data.get("valid_until"):
valid_until = date.fromisoformat(data["valid_until"])
return cls(
id=data["id"],
team_canonical_id=data["team_canonical_id"],
alias_type=AliasType(data["alias_type"]),
alias_value=data["alias_value"],
valid_from=valid_from,
valid_until=valid_until,
)
@dataclass
class StadiumAlias:
"""Represents a stadium alias with optional date validity.
Attributes:
alias_name: The alias name to match against (lowercase)
stadium_canonical_id: The canonical stadium ID this alias resolves to
valid_from: Start date of alias validity (None = always valid)
valid_until: End date of alias validity (None = still valid)
"""
alias_name: str
stadium_canonical_id: str
valid_from: Optional[date] = None
valid_until: Optional[date] = None
def is_valid_on(self, check_date: date) -> bool:
"""Check if this alias is valid on the given date."""
if self.valid_from and check_date < self.valid_from:
return False
if self.valid_until and check_date > self.valid_until:
return False
return True
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"alias_name": self.alias_name,
"stadium_canonical_id": self.stadium_canonical_id,
"valid_from": self.valid_from.isoformat() if self.valid_from else None,
"valid_until": self.valid_until.isoformat() if self.valid_until else None,
}
@classmethod
def from_dict(cls, data: dict) -> "StadiumAlias":
"""Create a StadiumAlias from a dictionary."""
valid_from = None
if data.get("valid_from"):
valid_from = date.fromisoformat(data["valid_from"])
valid_until = None
if data.get("valid_until"):
valid_until = date.fromisoformat(data["valid_until"])
return cls(
alias_name=data["alias_name"],
stadium_canonical_id=data["stadium_canonical_id"],
valid_from=valid_from,
valid_until=valid_until,
)
@dataclass
class FuzzyMatch:
"""Represents a fuzzy match suggestion with confidence score."""
canonical_id: str
canonical_name: str
confidence: int # 0-100
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"canonical_id": self.canonical_id,
"canonical_name": self.canonical_name,
"confidence": self.confidence,
}
@dataclass
class ManualReviewItem:
"""Represents an item requiring manual review.
Attributes:
id: Unique review item ID
reason: Why this item needs review
sport: Sport code
raw_value: The original unresolved value
context: Additional context about the issue
source_url: URL of the source page
suggested_matches: List of potential matches with confidence scores
game_date: Date of the game (if applicable)
created_at: When this review item was created
"""
id: str
reason: ReviewReason
sport: str
raw_value: str
context: dict = field(default_factory=dict)
source_url: Optional[str] = None
suggested_matches: list[FuzzyMatch] = field(default_factory=list)
game_date: Optional[date] = None
created_at: datetime = field(default_factory=datetime.now)
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"id": self.id,
"reason": self.reason.value,
"sport": self.sport,
"raw_value": self.raw_value,
"context": self.context,
"source_url": self.source_url,
"suggested_matches": [m.to_dict() for m in self.suggested_matches],
"game_date": self.game_date.isoformat() if self.game_date else None,
"created_at": self.created_at.isoformat(),
}
@classmethod
def from_dict(cls, data: dict) -> "ManualReviewItem":
"""Create a ManualReviewItem from a dictionary."""
game_date = None
if data.get("game_date"):
game_date = date.fromisoformat(data["game_date"])
created_at = datetime.now()
if data.get("created_at"):
created_at = datetime.fromisoformat(data["created_at"])
suggested_matches = []
for match_data in data.get("suggested_matches", []):
suggested_matches.append(FuzzyMatch(
canonical_id=match_data["canonical_id"],
canonical_name=match_data["canonical_name"],
confidence=match_data["confidence"],
))
return cls(
id=data["id"],
reason=ReviewReason(data["reason"]),
sport=data["sport"],
raw_value=data["raw_value"],
context=data.get("context", {}),
source_url=data.get("source_url"),
suggested_matches=suggested_matches,
game_date=game_date,
created_at=created_at,
)
def to_markdown(self) -> str:
"""Generate markdown representation for validation report."""
lines = [
f"### {self.reason.value.replace('_', ' ').title()}: {self.raw_value}",
"",
f"**Sport**: {self.sport.upper()}",
]
if self.game_date:
lines.append(f"**Game Date**: {self.game_date.isoformat()}")
if self.context:
lines.append("")
lines.append("**Context**:")
for key, value in self.context.items():
lines.append(f"- {key}: {value}")
if self.suggested_matches:
lines.append("")
lines.append("**Suggested Matches**:")
for i, match in enumerate(self.suggested_matches, 1):
marker = " <- likely correct" if match.confidence >= 90 else ""
lines.append(
f"{i}. `{match.canonical_id}` ({match.confidence}%){marker}"
)
if self.source_url:
lines.append("")
lines.append(f"**Source**: [{self.source_url}]({self.source_url})")
lines.append("")
lines.append("---")
lines.append("")
return "\n".join(lines)
+112
View File
@@ -0,0 +1,112 @@
"""Game data model for sportstime-parser."""
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
import json
@dataclass
class Game:
"""Represents a game with all CloudKit fields.
Attributes:
id: Canonical game ID (e.g., 'nba_2025_hou_okc_1021')
sport: Sport code (e.g., 'nba', 'mlb')
season: Season start year (e.g., 2025 for 2025-26)
home_team_id: Canonical home team ID
away_team_id: Canonical away team ID
stadium_id: Canonical stadium ID
game_date: Game date/time in UTC
game_number: Game number for doubleheaders (1 or 2), None for single games
home_score: Final home team score (None if not played)
away_score: Final away team score (None if not played)
status: Game status ('scheduled', 'final', 'postponed', 'cancelled')
source_url: URL of the source page for manual review
raw_home_team: Original home team name from source (for debugging)
raw_away_team: Original away team name from source (for debugging)
raw_stadium: Original stadium name from source (for debugging)
"""
id: str
sport: str
season: int
home_team_id: str
away_team_id: str
stadium_id: str
game_date: datetime
game_number: Optional[int] = None
home_score: Optional[int] = None
away_score: Optional[int] = None
status: str = "scheduled"
source_url: Optional[str] = None
raw_home_team: Optional[str] = None
raw_away_team: Optional[str] = None
raw_stadium: Optional[str] = None
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"id": self.id,
"sport": self.sport,
"season": self.season,
"home_team_id": self.home_team_id,
"away_team_id": self.away_team_id,
"stadium_id": self.stadium_id,
"game_date": self.game_date.isoformat(),
"game_number": self.game_number,
"home_score": self.home_score,
"away_score": self.away_score,
"status": self.status,
"source_url": self.source_url,
"raw_home_team": self.raw_home_team,
"raw_away_team": self.raw_away_team,
"raw_stadium": self.raw_stadium,
}
@classmethod
def from_dict(cls, data: dict) -> "Game":
"""Create a Game from a dictionary."""
game_date = data["game_date"]
if isinstance(game_date, str):
game_date = datetime.fromisoformat(game_date)
return cls(
id=data["id"],
sport=data["sport"],
season=data["season"],
home_team_id=data["home_team_id"],
away_team_id=data["away_team_id"],
stadium_id=data["stadium_id"],
game_date=game_date,
game_number=data.get("game_number"),
home_score=data.get("home_score"),
away_score=data.get("away_score"),
status=data.get("status", "scheduled"),
source_url=data.get("source_url"),
raw_home_team=data.get("raw_home_team"),
raw_away_team=data.get("raw_away_team"),
raw_stadium=data.get("raw_stadium"),
)
def to_json(self) -> str:
"""Serialize to JSON string."""
return json.dumps(self.to_dict(), indent=2)
@classmethod
def from_json(cls, json_str: str) -> "Game":
"""Deserialize from JSON string."""
return cls.from_dict(json.loads(json_str))
def save_games(games: list[Game], filepath: str) -> None:
"""Save a list of games to a JSON file."""
with open(filepath, "w", encoding="utf-8") as f:
json.dump([g.to_dict() for g in games], f, indent=2)
def load_games(filepath: str) -> list[Game]:
"""Load a list of games from a JSON file."""
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
return [Game.from_dict(d) for d in data]
+108
View File
@@ -0,0 +1,108 @@
"""Stadium data model for sportstime-parser."""
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class Stadium:
"""Represents a stadium with all CloudKit fields.
Attributes:
id: Canonical stadium ID (e.g., 'stadium_nba_paycom_center')
sport: Primary sport code (e.g., 'nba', 'mlb')
name: Current stadium name (e.g., 'Paycom Center')
city: City name (e.g., 'Oklahoma City')
state: State/province code (e.g., 'OK', 'ON')
country: Country code (e.g., 'USA', 'Canada')
latitude: Latitude coordinate
longitude: Longitude coordinate
capacity: Seating capacity
surface: Playing surface (e.g., 'grass', 'turf', 'hardwood')
roof_type: Roof type (e.g., 'dome', 'retractable', 'open')
opened_year: Year stadium opened
image_url: URL to stadium image
timezone: IANA timezone (e.g., 'America/Chicago')
"""
id: str
sport: str
name: str
city: str
state: str
country: str
latitude: float
longitude: float
capacity: Optional[int] = None
surface: Optional[str] = None
roof_type: Optional[str] = None
opened_year: Optional[int] = None
image_url: Optional[str] = None
timezone: Optional[str] = None
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"id": self.id,
"sport": self.sport,
"name": self.name,
"city": self.city,
"state": self.state,
"country": self.country,
"latitude": self.latitude,
"longitude": self.longitude,
"capacity": self.capacity,
"surface": self.surface,
"roof_type": self.roof_type,
"opened_year": self.opened_year,
"image_url": self.image_url,
"timezone": self.timezone,
}
@classmethod
def from_dict(cls, data: dict) -> "Stadium":
"""Create a Stadium from a dictionary."""
return cls(
id=data["id"],
sport=data["sport"],
name=data["name"],
city=data["city"],
state=data["state"],
country=data["country"],
latitude=data["latitude"],
longitude=data["longitude"],
capacity=data.get("capacity"),
surface=data.get("surface"),
roof_type=data.get("roof_type"),
opened_year=data.get("opened_year"),
image_url=data.get("image_url"),
timezone=data.get("timezone"),
)
def to_json(self) -> str:
"""Serialize to JSON string."""
return json.dumps(self.to_dict(), indent=2)
@classmethod
def from_json(cls, json_str: str) -> "Stadium":
"""Deserialize from JSON string."""
return cls.from_dict(json.loads(json_str))
def is_in_allowed_region(self) -> bool:
"""Check if stadium is in USA, Canada, or Mexico."""
allowed = {"USA", "US", "United States", "Canada", "CA", "Mexico", "MX"}
return self.country in allowed
def save_stadiums(stadiums: list[Stadium], filepath: str) -> None:
"""Save a list of stadiums to a JSON file."""
with open(filepath, "w", encoding="utf-8") as f:
json.dump([s.to_dict() for s in stadiums], f, indent=2)
def load_stadiums(filepath: str) -> list[Stadium]:
"""Load a list of stadiums from a JSON file."""
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
return [Stadium.from_dict(d) for d in data]
+95
View File
@@ -0,0 +1,95 @@
"""Team data model for sportstime-parser."""
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class Team:
"""Represents a team with all CloudKit fields.
Attributes:
id: Canonical team ID (e.g., 'team_nba_okc')
sport: Sport code (e.g., 'nba', 'mlb')
city: Team city (e.g., 'Oklahoma City')
name: Team name (e.g., 'Thunder')
full_name: Full team name (e.g., 'Oklahoma City Thunder')
abbreviation: Official abbreviation (e.g., 'OKC')
conference: Conference name (e.g., 'Western', 'American')
division: Division name (e.g., 'Northwest', 'AL West')
primary_color: Primary team color as hex (e.g., '#007AC1')
secondary_color: Secondary team color as hex (e.g., '#EF3B24')
logo_url: URL to team logo image
stadium_id: Canonical ID of home stadium
"""
id: str
sport: str
city: str
name: str
full_name: str
abbreviation: str
conference: Optional[str] = None
division: Optional[str] = None
primary_color: Optional[str] = None
secondary_color: Optional[str] = None
logo_url: Optional[str] = None
stadium_id: Optional[str] = None
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"id": self.id,
"sport": self.sport,
"city": self.city,
"name": self.name,
"full_name": self.full_name,
"abbreviation": self.abbreviation,
"conference": self.conference,
"division": self.division,
"primary_color": self.primary_color,
"secondary_color": self.secondary_color,
"logo_url": self.logo_url,
"stadium_id": self.stadium_id,
}
@classmethod
def from_dict(cls, data: dict) -> "Team":
"""Create a Team from a dictionary."""
return cls(
id=data["id"],
sport=data["sport"],
city=data["city"],
name=data["name"],
full_name=data["full_name"],
abbreviation=data["abbreviation"],
conference=data.get("conference"),
division=data.get("division"),
primary_color=data.get("primary_color"),
secondary_color=data.get("secondary_color"),
logo_url=data.get("logo_url"),
stadium_id=data.get("stadium_id"),
)
def to_json(self) -> str:
"""Serialize to JSON string."""
return json.dumps(self.to_dict(), indent=2)
@classmethod
def from_json(cls, json_str: str) -> "Team":
"""Deserialize from JSON string."""
return cls.from_dict(json.loads(json_str))
def save_teams(teams: list[Team], filepath: str) -> None:
"""Save a list of teams to a JSON file."""
with open(filepath, "w", encoding="utf-8") as f:
json.dump([t.to_dict() for t in teams], f, indent=2)
def load_teams(filepath: str) -> list[Team]:
"""Load a list of teams from a JSON file."""
with open(filepath, "r", encoding="utf-8") as f:
data = json.load(f)
return [Team.from_dict(d) for d in data]
@@ -0,0 +1,91 @@
"""Normalizers for team, stadium, and game data."""
from .canonical_id import (
generate_game_id,
generate_team_id,
generate_team_id_from_abbrev,
generate_stadium_id,
parse_game_id,
normalize_string,
)
from .timezone import (
TimezoneResult,
parse_datetime,
convert_to_utc,
detect_timezone_from_string,
detect_timezone_from_location,
get_stadium_timezone,
create_timezone_warning,
)
from .fuzzy import (
MatchCandidate,
fuzzy_match_team,
fuzzy_match_stadium,
exact_match,
best_match,
calculate_similarity,
normalize_for_matching,
)
from .alias_loader import (
TeamAliasLoader,
StadiumAliasLoader,
get_team_alias_loader,
get_stadium_alias_loader,
resolve_team_alias,
resolve_stadium_alias,
)
from .team_resolver import (
TeamResolver,
TeamResolveResult,
get_team_resolver,
resolve_team,
)
from .stadium_resolver import (
StadiumResolver,
StadiumResolveResult,
get_stadium_resolver,
resolve_stadium,
)
__all__ = [
# Canonical ID
"generate_game_id",
"generate_team_id",
"generate_team_id_from_abbrev",
"generate_stadium_id",
"parse_game_id",
"normalize_string",
# Timezone
"TimezoneResult",
"parse_datetime",
"convert_to_utc",
"detect_timezone_from_string",
"detect_timezone_from_location",
"get_stadium_timezone",
"create_timezone_warning",
# Fuzzy matching
"MatchCandidate",
"fuzzy_match_team",
"fuzzy_match_stadium",
"exact_match",
"best_match",
"calculate_similarity",
"normalize_for_matching",
# Alias loaders
"TeamAliasLoader",
"StadiumAliasLoader",
"get_team_alias_loader",
"get_stadium_alias_loader",
"resolve_team_alias",
"resolve_stadium_alias",
# Team resolver
"TeamResolver",
"TeamResolveResult",
"get_team_resolver",
"resolve_team",
# Stadium resolver
"StadiumResolver",
"StadiumResolveResult",
"get_stadium_resolver",
"resolve_stadium",
]
@@ -0,0 +1,312 @@
"""Alias file loaders for team and stadium name resolution."""
import json
from datetime import date
from pathlib import Path
from typing import Optional
from ..config import TEAM_ALIASES_FILE, STADIUM_ALIASES_FILE
from ..models.aliases import TeamAlias, StadiumAlias, AliasType
class TeamAliasLoader:
"""Loader for team aliases with date-aware resolution.
Loads team aliases from JSON and provides lookup methods
with support for historical name changes.
"""
def __init__(self, filepath: Optional[Path] = None):
"""Initialize the loader.
Args:
filepath: Path to team_aliases.json, defaults to config value
"""
self.filepath = filepath or TEAM_ALIASES_FILE
self._aliases: list[TeamAlias] = []
self._by_value: dict[str, list[TeamAlias]] = {}
self._by_team: dict[str, list[TeamAlias]] = {}
self._loaded = False
def load(self) -> None:
"""Load aliases from the JSON file."""
if not self.filepath.exists():
self._loaded = True
return
with open(self.filepath, "r", encoding="utf-8") as f:
data = json.load(f)
self._aliases = []
self._by_value = {}
self._by_team = {}
for item in data:
alias = TeamAlias.from_dict(item)
self._aliases.append(alias)
# Index by lowercase value
value_key = alias.alias_value.lower()
if value_key not in self._by_value:
self._by_value[value_key] = []
self._by_value[value_key].append(alias)
# Index by team ID
if alias.team_canonical_id not in self._by_team:
self._by_team[alias.team_canonical_id] = []
self._by_team[alias.team_canonical_id].append(alias)
self._loaded = True
def _ensure_loaded(self) -> None:
"""Ensure aliases are loaded."""
if not self._loaded:
self.load()
def resolve(
self,
value: str,
check_date: Optional[date] = None,
alias_types: Optional[list[AliasType]] = None,
) -> Optional[str]:
"""Resolve an alias value to a canonical team ID.
Args:
value: Alias value to look up (case-insensitive)
check_date: Date to check validity (None = current date)
alias_types: Types of aliases to check (None = all types)
Returns:
Canonical team ID if found, None otherwise
"""
self._ensure_loaded()
if check_date is None:
check_date = date.today()
value_key = value.lower().strip()
aliases = self._by_value.get(value_key, [])
for alias in aliases:
# Check type filter
if alias_types and alias.alias_type not in alias_types:
continue
# Check date validity
if alias.is_valid_on(check_date):
return alias.team_canonical_id
return None
def get_aliases_for_team(
self,
team_id: str,
check_date: Optional[date] = None,
) -> list[TeamAlias]:
"""Get all aliases for a team.
Args:
team_id: Canonical team ID
check_date: Date to filter by (None = all aliases)
Returns:
List of TeamAlias objects
"""
self._ensure_loaded()
aliases = self._by_team.get(team_id, [])
if check_date:
aliases = [a for a in aliases if a.is_valid_on(check_date)]
return aliases
def get_all_values(
self,
alias_type: Optional[AliasType] = None,
) -> list[str]:
"""Get all alias values.
Args:
alias_type: Filter by alias type (None = all types)
Returns:
List of alias values
"""
self._ensure_loaded()
values = []
for alias in self._aliases:
if alias_type is None or alias.alias_type == alias_type:
values.append(alias.alias_value)
return values
class StadiumAliasLoader:
"""Loader for stadium aliases with date-aware resolution.
Loads stadium aliases from JSON and provides lookup methods
with support for historical name changes (e.g., naming rights).
"""
def __init__(self, filepath: Optional[Path] = None):
"""Initialize the loader.
Args:
filepath: Path to stadium_aliases.json, defaults to config value
"""
self.filepath = filepath or STADIUM_ALIASES_FILE
self._aliases: list[StadiumAlias] = []
self._by_name: dict[str, list[StadiumAlias]] = {}
self._by_stadium: dict[str, list[StadiumAlias]] = {}
self._loaded = False
def load(self) -> None:
"""Load aliases from the JSON file."""
if not self.filepath.exists():
self._loaded = True
return
with open(self.filepath, "r", encoding="utf-8") as f:
data = json.load(f)
self._aliases = []
self._by_name = {}
self._by_stadium = {}
for item in data:
alias = StadiumAlias.from_dict(item)
self._aliases.append(alias)
# Index by lowercase name
name_key = alias.alias_name.lower()
if name_key not in self._by_name:
self._by_name[name_key] = []
self._by_name[name_key].append(alias)
# Index by stadium ID
if alias.stadium_canonical_id not in self._by_stadium:
self._by_stadium[alias.stadium_canonical_id] = []
self._by_stadium[alias.stadium_canonical_id].append(alias)
self._loaded = True
def _ensure_loaded(self) -> None:
"""Ensure aliases are loaded."""
if not self._loaded:
self.load()
def resolve(
self,
name: str,
check_date: Optional[date] = None,
) -> Optional[str]:
"""Resolve a stadium name to a canonical stadium ID.
Args:
name: Stadium name to look up (case-insensitive)
check_date: Date to check validity (None = current date)
Returns:
Canonical stadium ID if found, None otherwise
"""
self._ensure_loaded()
if check_date is None:
check_date = date.today()
name_key = name.lower().strip()
aliases = self._by_name.get(name_key, [])
for alias in aliases:
if alias.is_valid_on(check_date):
return alias.stadium_canonical_id
return None
def get_aliases_for_stadium(
self,
stadium_id: str,
check_date: Optional[date] = None,
) -> list[StadiumAlias]:
"""Get all aliases for a stadium.
Args:
stadium_id: Canonical stadium ID
check_date: Date to filter by (None = all aliases)
Returns:
List of StadiumAlias objects
"""
self._ensure_loaded()
aliases = self._by_stadium.get(stadium_id, [])
if check_date:
aliases = [a for a in aliases if a.is_valid_on(check_date)]
return aliases
def get_all_names(self) -> list[str]:
"""Get all stadium alias names.
Returns:
List of stadium names
"""
self._ensure_loaded()
return [alias.alias_name for alias in self._aliases]
# Global loader instances (lazy initialized)
_team_alias_loader: Optional[TeamAliasLoader] = None
_stadium_alias_loader: Optional[StadiumAliasLoader] = None
def get_team_alias_loader() -> TeamAliasLoader:
"""Get the global team alias loader instance."""
global _team_alias_loader
if _team_alias_loader is None:
_team_alias_loader = TeamAliasLoader()
return _team_alias_loader
def get_stadium_alias_loader() -> StadiumAliasLoader:
"""Get the global stadium alias loader instance."""
global _stadium_alias_loader
if _stadium_alias_loader is None:
_stadium_alias_loader = StadiumAliasLoader()
return _stadium_alias_loader
def resolve_team_alias(
value: str,
check_date: Optional[date] = None,
) -> Optional[str]:
"""Convenience function to resolve a team alias.
Args:
value: Alias value (name, abbreviation, or city)
check_date: Date to check validity
Returns:
Canonical team ID if found
"""
return get_team_alias_loader().resolve(value, check_date)
def resolve_stadium_alias(
name: str,
check_date: Optional[date] = None,
) -> Optional[str]:
"""Convenience function to resolve a stadium alias.
Args:
name: Stadium name
check_date: Date to check validity
Returns:
Canonical stadium ID if found
"""
return get_stadium_alias_loader().resolve(name, check_date)
@@ -0,0 +1,279 @@
"""Canonical ID generation for games, teams, and stadiums."""
import re
import unicodedata
from datetime import date, datetime
from typing import Optional
def normalize_string(s: str) -> str:
"""Normalize a string for use in canonical IDs.
- Convert to lowercase
- Replace spaces and hyphens with underscores
- Remove special characters (except underscores)
- Collapse multiple underscores
- Strip leading/trailing underscores
Args:
s: String to normalize
Returns:
Normalized string suitable for IDs
"""
# Convert to lowercase
result = s.lower()
# Normalize unicode (e.g., é -> e)
result = unicodedata.normalize("NFKD", result)
result = result.encode("ascii", "ignore").decode("ascii")
# Replace spaces and hyphens with underscores
result = re.sub(r"[\s\-]+", "_", result)
# Remove special characters except underscores
result = re.sub(r"[^a-z0-9_]", "", result)
# Collapse multiple underscores
result = re.sub(r"_+", "_", result)
# Strip leading/trailing underscores
result = result.strip("_")
return result
def generate_game_id(
sport: str,
season: int,
away_abbrev: str,
home_abbrev: str,
game_date: date | datetime,
game_number: Optional[int] = None,
) -> str:
"""Generate a canonical game ID.
Format: {sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
Args:
sport: Sport code (e.g., 'nba', 'mlb')
season: Season start year (e.g., 2025 for 2025-26)
away_abbrev: Away team abbreviation (e.g., 'HOU')
home_abbrev: Home team abbreviation (e.g., 'OKC')
game_date: Date of the game
game_number: Game number for doubleheaders (1 or 2), None for single games
Returns:
Canonical game ID (e.g., 'nba_2025_hou_okc_1021')
Examples:
>>> generate_game_id('nba', 2025, 'HOU', 'OKC', date(2025, 10, 21))
'nba_2025_hou_okc_1021'
>>> generate_game_id('mlb', 2026, 'NYY', 'BOS', date(2026, 4, 1), game_number=1)
'mlb_2026_nyy_bos_0401_1'
"""
# Normalize sport and abbreviations
sport_norm = sport.lower()
away_norm = away_abbrev.lower()
home_norm = home_abbrev.lower()
# Format date as MMDD
if isinstance(game_date, datetime):
game_date = game_date.date()
date_str = game_date.strftime("%m%d")
# Build ID
parts = [sport_norm, str(season), away_norm, home_norm, date_str]
# Add game number for doubleheaders
if game_number is not None:
parts.append(str(game_number))
return "_".join(parts)
def generate_team_id(sport: str, city: str, name: str) -> str:
"""Generate a canonical team ID.
Format: team_{sport}_{abbreviation}
For most teams, we use the standard abbreviation. This function generates
a fallback ID based on city and name for teams without a known abbreviation.
Args:
sport: Sport code (e.g., 'nba', 'mlb')
city: Team city (e.g., 'Los Angeles')
name: Team name (e.g., 'Lakers')
Returns:
Canonical team ID (e.g., 'team_nba_la_lakers')
Examples:
>>> generate_team_id('nba', 'Los Angeles', 'Lakers')
'team_nba_la_lakers'
>>> generate_team_id('mlb', 'New York', 'Yankees')
'team_mlb_new_york_yankees'
"""
sport_norm = sport.lower()
city_norm = normalize_string(city)
name_norm = normalize_string(name)
return f"team_{sport_norm}_{city_norm}_{name_norm}"
def generate_team_id_from_abbrev(sport: str, abbreviation: str) -> str:
"""Generate a canonical team ID from abbreviation.
Format: team_{sport}_{abbreviation}
Args:
sport: Sport code (e.g., 'nba', 'mlb')
abbreviation: Team abbreviation (e.g., 'LAL', 'NYY')
Returns:
Canonical team ID (e.g., 'team_nba_lal')
Examples:
>>> generate_team_id_from_abbrev('nba', 'LAL')
'team_nba_lal'
>>> generate_team_id_from_abbrev('mlb', 'NYY')
'team_mlb_nyy'
"""
sport_norm = sport.lower()
abbrev_norm = abbreviation.lower()
return f"team_{sport_norm}_{abbrev_norm}"
def generate_stadium_id(sport: str, name: str) -> str:
"""Generate a canonical stadium ID.
Format: stadium_{sport}_{normalized_name}
Args:
sport: Sport code (e.g., 'nba', 'mlb')
name: Stadium name (e.g., 'Yankee Stadium')
Returns:
Canonical stadium ID (e.g., 'stadium_mlb_yankee_stadium')
Examples:
>>> generate_stadium_id('nba', 'Crypto.com Arena')
'stadium_nba_cryptocom_arena'
>>> generate_stadium_id('mlb', 'Yankee Stadium')
'stadium_mlb_yankee_stadium'
"""
sport_norm = sport.lower()
name_norm = normalize_string(name)
return f"stadium_{sport_norm}_{name_norm}"
def parse_game_id(game_id: str) -> dict:
"""Parse a canonical game ID into its components.
Args:
game_id: Canonical game ID (e.g., 'nba_2025_hou_okc_1021')
Returns:
Dictionary with keys: sport, season, away_abbrev, home_abbrev,
month, day, game_number (optional)
Raises:
ValueError: If game_id format is invalid
Examples:
>>> parse_game_id('nba_2025_hou_okc_1021')
{'sport': 'nba', 'season': 2025, 'away_abbrev': 'hou',
'home_abbrev': 'okc', 'month': 10, 'day': 21, 'game_number': None}
>>> parse_game_id('mlb_2026_nyy_bos_0401_1')
{'sport': 'mlb', 'season': 2026, 'away_abbrev': 'nyy',
'home_abbrev': 'bos', 'month': 4, 'day': 1, 'game_number': 1}
"""
parts = game_id.split("_")
if len(parts) < 5 or len(parts) > 6:
raise ValueError(f"Invalid game ID format: {game_id}")
sport = parts[0]
season = int(parts[1])
away_abbrev = parts[2]
home_abbrev = parts[3]
date_str = parts[4]
if len(date_str) != 4:
raise ValueError(f"Invalid date format in game ID: {game_id}")
month = int(date_str[:2])
day = int(date_str[2:])
game_number = None
if len(parts) == 6:
game_number = int(parts[5])
return {
"sport": sport,
"season": season,
"away_abbrev": away_abbrev,
"home_abbrev": home_abbrev,
"month": month,
"day": day,
"game_number": game_number,
}
def parse_team_id(team_id: str) -> dict:
"""Parse a canonical team ID into its components.
Args:
team_id: Canonical team ID (e.g., 'team_nba_lal')
Returns:
Dictionary with keys: sport, identifier (abbreviation or city_name)
Raises:
ValueError: If team_id format is invalid
"""
if not team_id.startswith("team_"):
raise ValueError(f"Invalid team ID format: {team_id}")
parts = team_id.split("_", 2)
if len(parts) < 3:
raise ValueError(f"Invalid team ID format: {team_id}")
return {
"sport": parts[1],
"identifier": parts[2],
}
def parse_stadium_id(stadium_id: str) -> dict:
"""Parse a canonical stadium ID into its components.
Args:
stadium_id: Canonical stadium ID (e.g., 'stadium_nba_paycom_center')
Returns:
Dictionary with keys: sport, name
Raises:
ValueError: If stadium_id format is invalid
"""
if not stadium_id.startswith("stadium_"):
raise ValueError(f"Invalid stadium ID format: {stadium_id}")
parts = stadium_id.split("_", 2)
if len(parts) < 3:
raise ValueError(f"Invalid stadium ID format: {stadium_id}")
return {
"sport": parts[1],
"name": parts[2],
}
@@ -0,0 +1,272 @@
"""Fuzzy string matching utilities for team and stadium name resolution."""
from dataclasses import dataclass
from typing import Optional
from rapidfuzz import fuzz, process
from rapidfuzz.utils import default_process
from ..config import FUZZY_MATCH_THRESHOLD
from ..models.aliases import FuzzyMatch
@dataclass
class MatchCandidate:
"""A candidate for fuzzy matching.
Attributes:
canonical_id: The canonical ID of this candidate
name: The display name for this candidate
aliases: List of alternative names to match against
"""
canonical_id: str
name: str
aliases: list[str]
def normalize_for_matching(s: str) -> str:
"""Normalize a string for fuzzy matching.
- Convert to lowercase
- Remove common prefixes/suffixes
- Collapse whitespace
Args:
s: String to normalize
Returns:
Normalized string
"""
result = s.lower().strip()
# Remove common prefixes
prefixes = ["the ", "team ", "stadium "]
for prefix in prefixes:
if result.startswith(prefix):
result = result[len(prefix) :]
# Remove common suffixes
suffixes = [" stadium", " arena", " center", " field", " park"]
for suffix in suffixes:
if result.endswith(suffix):
result = result[: -len(suffix)]
return result.strip()
def fuzzy_match_team(
query: str,
candidates: list[MatchCandidate],
threshold: int = FUZZY_MATCH_THRESHOLD,
top_n: int = 3,
) -> list[FuzzyMatch]:
"""Find fuzzy matches for a team name.
Uses multiple matching strategies:
1. Token set ratio (handles word order differences)
2. Partial ratio (handles substring matches)
3. Standard ratio (overall similarity)
Args:
query: Team name to match
candidates: List of candidate teams to match against
threshold: Minimum score to consider a match (0-100)
top_n: Maximum number of matches to return
Returns:
List of FuzzyMatch objects sorted by confidence (descending)
"""
query_norm = normalize_for_matching(query)
# Build list of all matchable strings with their canonical IDs
match_strings: list[tuple[str, str, str]] = [] # (string, canonical_id, name)
for candidate in candidates:
# Add primary name
match_strings.append(
(normalize_for_matching(candidate.name), candidate.canonical_id, candidate.name)
)
# Add aliases
for alias in candidate.aliases:
match_strings.append(
(normalize_for_matching(alias), candidate.canonical_id, candidate.name)
)
# Score all candidates
scored: dict[str, tuple[int, str]] = {} # canonical_id -> (best_score, name)
for match_str, canonical_id, name in match_strings:
# Use multiple scoring methods
token_score = fuzz.token_set_ratio(query_norm, match_str)
partial_score = fuzz.partial_ratio(query_norm, match_str)
ratio_score = fuzz.ratio(query_norm, match_str)
# Weighted average favoring token_set_ratio for team names
score = int(0.5 * token_score + 0.3 * partial_score + 0.2 * ratio_score)
# Keep best score for each canonical ID
if canonical_id not in scored or score > scored[canonical_id][0]:
scored[canonical_id] = (score, name)
# Filter by threshold and sort
matches = [
FuzzyMatch(canonical_id=cid, canonical_name=name, confidence=score)
for cid, (score, name) in scored.items()
if score >= threshold
]
# Sort by confidence descending
matches.sort(key=lambda m: m.confidence, reverse=True)
return matches[:top_n]
def fuzzy_match_stadium(
query: str,
candidates: list[MatchCandidate],
threshold: int = FUZZY_MATCH_THRESHOLD,
top_n: int = 3,
) -> list[FuzzyMatch]:
"""Find fuzzy matches for a stadium name.
Uses matching strategies optimized for stadium names:
1. Token sort ratio (handles "X Stadium" vs "Stadium X")
2. Partial ratio (handles naming rights changes)
3. Standard ratio
Args:
query: Stadium name to match
candidates: List of candidate stadiums to match against
threshold: Minimum score to consider a match (0-100)
top_n: Maximum number of matches to return
Returns:
List of FuzzyMatch objects sorted by confidence (descending)
"""
query_norm = normalize_for_matching(query)
# Build list of all matchable strings
match_strings: list[tuple[str, str, str]] = []
for candidate in candidates:
match_strings.append(
(normalize_for_matching(candidate.name), candidate.canonical_id, candidate.name)
)
for alias in candidate.aliases:
match_strings.append(
(normalize_for_matching(alias), candidate.canonical_id, candidate.name)
)
# Score all candidates
scored: dict[str, tuple[int, str]] = {}
for match_str, canonical_id, name in match_strings:
# Use scoring methods suited for stadium names
token_sort_score = fuzz.token_sort_ratio(query_norm, match_str)
partial_score = fuzz.partial_ratio(query_norm, match_str)
ratio_score = fuzz.ratio(query_norm, match_str)
# Weighted average
score = int(0.4 * token_sort_score + 0.4 * partial_score + 0.2 * ratio_score)
if canonical_id not in scored or score > scored[canonical_id][0]:
scored[canonical_id] = (score, name)
# Filter and sort
matches = [
FuzzyMatch(canonical_id=cid, canonical_name=name, confidence=score)
for cid, (score, name) in scored.items()
if score >= threshold
]
matches.sort(key=lambda m: m.confidence, reverse=True)
return matches[:top_n]
def exact_match(
query: str,
candidates: list[MatchCandidate],
case_sensitive: bool = False,
) -> Optional[str]:
"""Find an exact match for a string.
Args:
query: String to match
candidates: List of candidates to match against
case_sensitive: Whether to use case-sensitive matching
Returns:
Canonical ID if exact match found, None otherwise
"""
if case_sensitive:
query_norm = query.strip()
else:
query_norm = query.lower().strip()
for candidate in candidates:
# Check primary name
name = candidate.name if case_sensitive else candidate.name.lower()
if query_norm == name.strip():
return candidate.canonical_id
# Check aliases
for alias in candidate.aliases:
alias_norm = alias if case_sensitive else alias.lower()
if query_norm == alias_norm.strip():
return candidate.canonical_id
return None
def best_match(
query: str,
candidates: list[MatchCandidate],
threshold: int = FUZZY_MATCH_THRESHOLD,
) -> Optional[FuzzyMatch]:
"""Find the best match for a query string.
First tries exact match, then falls back to fuzzy matching.
Args:
query: String to match
candidates: List of candidates
threshold: Minimum fuzzy match score
Returns:
Best FuzzyMatch or None if no match above threshold
"""
# Try exact match first
exact = exact_match(query, candidates)
if exact:
# Find the name for this ID
for c in candidates:
if c.canonical_id == exact:
return FuzzyMatch(
canonical_id=exact,
canonical_name=c.name,
confidence=100,
)
# Fall back to fuzzy matching
# Use team matching by default (works for both)
matches = fuzzy_match_team(query, candidates, threshold=threshold, top_n=1)
return matches[0] if matches else None
def calculate_similarity(s1: str, s2: str) -> int:
"""Calculate similarity between two strings.
Args:
s1: First string
s2: Second string
Returns:
Similarity score 0-100
"""
s1_norm = normalize_for_matching(s1)
s2_norm = normalize_for_matching(s2)
return fuzz.token_set_ratio(s1_norm, s2_norm)
@@ -0,0 +1,474 @@
"""Stadium name resolver with exact, alias, and fuzzy matching."""
from dataclasses import dataclass
from datetime import date
from typing import Optional
from uuid import uuid4
from ..config import FUZZY_MATCH_THRESHOLD, ALLOWED_COUNTRIES
from ..models.aliases import FuzzyMatch, ManualReviewItem, ReviewReason
from .alias_loader import get_stadium_alias_loader, StadiumAliasLoader
from .fuzzy import MatchCandidate, fuzzy_match_stadium
@dataclass
class StadiumResolveResult:
"""Result of stadium resolution.
Attributes:
canonical_id: Resolved canonical stadium ID (None if unresolved)
confidence: Confidence in the match (100 for exact, lower for fuzzy)
match_type: How the match was made ('exact', 'alias', 'fuzzy', 'unresolved')
filtered_reason: Reason if stadium was filtered out (e.g., 'geographic')
review_item: ManualReviewItem if resolution failed or low confidence
"""
canonical_id: Optional[str]
confidence: int
match_type: str
filtered_reason: Optional[str] = None
review_item: Optional[ManualReviewItem] = None
@dataclass
class StadiumInfo:
"""Stadium information for matching."""
canonical_id: str
name: str
city: str
state: str
country: str
sport: str
latitude: float
longitude: float
# Hardcoded stadium mappings
# Format: {sport: {canonical_id: StadiumInfo}}
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
"nba": {
"stadium_nba_state_farm_arena": StadiumInfo("stadium_nba_state_farm_arena", "State Farm Arena", "Atlanta", "GA", "USA", "nba", 33.7573, -84.3963),
"stadium_nba_td_garden": StadiumInfo("stadium_nba_td_garden", "TD Garden", "Boston", "MA", "USA", "nba", 42.3662, -71.0621),
"stadium_nba_barclays_center": StadiumInfo("stadium_nba_barclays_center", "Barclays Center", "Brooklyn", "NY", "USA", "nba", 40.6826, -73.9754),
"stadium_nba_spectrum_center": StadiumInfo("stadium_nba_spectrum_center", "Spectrum Center", "Charlotte", "NC", "USA", "nba", 35.2251, -80.8392),
"stadium_nba_united_center": StadiumInfo("stadium_nba_united_center", "United Center", "Chicago", "IL", "USA", "nba", 41.8807, -87.6742),
"stadium_nba_rocket_mortgage_fieldhouse": StadiumInfo("stadium_nba_rocket_mortgage_fieldhouse", "Rocket Mortgage FieldHouse", "Cleveland", "OH", "USA", "nba", 41.4965, -81.6882),
"stadium_nba_american_airlines_center": StadiumInfo("stadium_nba_american_airlines_center", "American Airlines Center", "Dallas", "TX", "USA", "nba", 32.7905, -96.8103),
"stadium_nba_ball_arena": StadiumInfo("stadium_nba_ball_arena", "Ball Arena", "Denver", "CO", "USA", "nba", 39.7487, -105.0077),
"stadium_nba_little_caesars_arena": StadiumInfo("stadium_nba_little_caesars_arena", "Little Caesars Arena", "Detroit", "MI", "USA", "nba", 42.3411, -83.0553),
"stadium_nba_chase_center": StadiumInfo("stadium_nba_chase_center", "Chase Center", "San Francisco", "CA", "USA", "nba", 37.7680, -122.3877),
"stadium_nba_toyota_center": StadiumInfo("stadium_nba_toyota_center", "Toyota Center", "Houston", "TX", "USA", "nba", 29.7508, -95.3621),
"stadium_nba_gainbridge_fieldhouse": StadiumInfo("stadium_nba_gainbridge_fieldhouse", "Gainbridge Fieldhouse", "Indianapolis", "IN", "USA", "nba", 39.7640, -86.1555),
"stadium_nba_intuit_dome": StadiumInfo("stadium_nba_intuit_dome", "Intuit Dome", "Inglewood", "CA", "USA", "nba", 33.9425, -118.3417),
"stadium_nba_cryptocom_arena": StadiumInfo("stadium_nba_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "nba", 34.0430, -118.2673),
"stadium_nba_fedexforum": StadiumInfo("stadium_nba_fedexforum", "FedExForum", "Memphis", "TN", "USA", "nba", 35.1383, -90.0505),
"stadium_nba_kaseya_center": StadiumInfo("stadium_nba_kaseya_center", "Kaseya Center", "Miami", "FL", "USA", "nba", 25.7814, -80.1870),
"stadium_nba_fiserv_forum": StadiumInfo("stadium_nba_fiserv_forum", "Fiserv Forum", "Milwaukee", "WI", "USA", "nba", 43.0451, -87.9172),
"stadium_nba_target_center": StadiumInfo("stadium_nba_target_center", "Target Center", "Minneapolis", "MN", "USA", "nba", 44.9795, -93.2761),
"stadium_nba_smoothie_king_center": StadiumInfo("stadium_nba_smoothie_king_center", "Smoothie King Center", "New Orleans", "LA", "USA", "nba", 29.9490, -90.0821),
"stadium_nba_madison_square_garden": StadiumInfo("stadium_nba_madison_square_garden", "Madison Square Garden", "New York", "NY", "USA", "nba", 40.7505, -73.9934),
"stadium_nba_paycom_center": StadiumInfo("stadium_nba_paycom_center", "Paycom Center", "Oklahoma City", "OK", "USA", "nba", 35.4634, -97.5151),
"stadium_nba_kia_center": StadiumInfo("stadium_nba_kia_center", "Kia Center", "Orlando", "FL", "USA", "nba", 28.5392, -81.3839),
"stadium_nba_wells_fargo_center": StadiumInfo("stadium_nba_wells_fargo_center", "Wells Fargo Center", "Philadelphia", "PA", "USA", "nba", 39.9012, -75.1720),
"stadium_nba_footprint_center": StadiumInfo("stadium_nba_footprint_center", "Footprint Center", "Phoenix", "AZ", "USA", "nba", 33.4457, -112.0712),
"stadium_nba_moda_center": StadiumInfo("stadium_nba_moda_center", "Moda Center", "Portland", "OR", "USA", "nba", 45.5316, -122.6668),
"stadium_nba_golden_1_center": StadiumInfo("stadium_nba_golden_1_center", "Golden 1 Center", "Sacramento", "CA", "USA", "nba", 38.5802, -121.4997),
"stadium_nba_frost_bank_center": StadiumInfo("stadium_nba_frost_bank_center", "Frost Bank Center", "San Antonio", "TX", "USA", "nba", 29.4270, -98.4375),
"stadium_nba_scotiabank_arena": StadiumInfo("stadium_nba_scotiabank_arena", "Scotiabank Arena", "Toronto", "ON", "Canada", "nba", 43.6435, -79.3791),
"stadium_nba_delta_center": StadiumInfo("stadium_nba_delta_center", "Delta Center", "Salt Lake City", "UT", "USA", "nba", 40.7683, -111.9011),
"stadium_nba_capital_one_arena": StadiumInfo("stadium_nba_capital_one_arena", "Capital One Arena", "Washington", "DC", "USA", "nba", 38.8981, -77.0209),
},
"mlb": {
"stadium_mlb_chase_field": StadiumInfo("stadium_mlb_chase_field", "Chase Field", "Phoenix", "AZ", "USA", "mlb", 33.4455, -112.0667),
"stadium_mlb_truist_park": StadiumInfo("stadium_mlb_truist_park", "Truist Park", "Atlanta", "GA", "USA", "mlb", 33.8908, -84.4678),
"stadium_mlb_oriole_park_at_camden_yards": StadiumInfo("stadium_mlb_oriole_park_at_camden_yards", "Oriole Park at Camden Yards", "Baltimore", "MD", "USA", "mlb", 39.2839, -76.6217),
"stadium_mlb_fenway_park": StadiumInfo("stadium_mlb_fenway_park", "Fenway Park", "Boston", "MA", "USA", "mlb", 42.3467, -71.0972),
"stadium_mlb_wrigley_field": StadiumInfo("stadium_mlb_wrigley_field", "Wrigley Field", "Chicago", "IL", "USA", "mlb", 41.9484, -87.6553),
"stadium_mlb_guaranteed_rate_field": StadiumInfo("stadium_mlb_guaranteed_rate_field", "Guaranteed Rate Field", "Chicago", "IL", "USA", "mlb", 41.8299, -87.6338),
"stadium_mlb_great_american_ball_park": StadiumInfo("stadium_mlb_great_american_ball_park", "Great American Ball Park", "Cincinnati", "OH", "USA", "mlb", 39.0974, -84.5082),
"stadium_mlb_progressive_field": StadiumInfo("stadium_mlb_progressive_field", "Progressive Field", "Cleveland", "OH", "USA", "mlb", 41.4962, -81.6852),
"stadium_mlb_coors_field": StadiumInfo("stadium_mlb_coors_field", "Coors Field", "Denver", "CO", "USA", "mlb", 39.7559, -104.9942),
"stadium_mlb_comerica_park": StadiumInfo("stadium_mlb_comerica_park", "Comerica Park", "Detroit", "MI", "USA", "mlb", 42.3390, -83.0485),
"stadium_mlb_minute_maid_park": StadiumInfo("stadium_mlb_minute_maid_park", "Minute Maid Park", "Houston", "TX", "USA", "mlb", 29.7573, -95.3555),
"stadium_mlb_kauffman_stadium": StadiumInfo("stadium_mlb_kauffman_stadium", "Kauffman Stadium", "Kansas City", "MO", "USA", "mlb", 39.0517, -94.4803),
"stadium_mlb_angel_stadium": StadiumInfo("stadium_mlb_angel_stadium", "Angel Stadium", "Anaheim", "CA", "USA", "mlb", 33.8003, -117.8827),
"stadium_mlb_dodger_stadium": StadiumInfo("stadium_mlb_dodger_stadium", "Dodger Stadium", "Los Angeles", "CA", "USA", "mlb", 34.0739, -118.2400),
"stadium_mlb_loandepot_park": StadiumInfo("stadium_mlb_loandepot_park", "loanDepot park", "Miami", "FL", "USA", "mlb", 25.7781, -80.2195),
"stadium_mlb_american_family_field": StadiumInfo("stadium_mlb_american_family_field", "American Family Field", "Milwaukee", "WI", "USA", "mlb", 43.0280, -87.9712),
"stadium_mlb_target_field": StadiumInfo("stadium_mlb_target_field", "Target Field", "Minneapolis", "MN", "USA", "mlb", 44.9818, -93.2775),
"stadium_mlb_citi_field": StadiumInfo("stadium_mlb_citi_field", "Citi Field", "New York", "NY", "USA", "mlb", 40.7571, -73.8458),
"stadium_mlb_yankee_stadium": StadiumInfo("stadium_mlb_yankee_stadium", "Yankee Stadium", "Bronx", "NY", "USA", "mlb", 40.8296, -73.9262),
"stadium_mlb_sutter_health_park": StadiumInfo("stadium_mlb_sutter_health_park", "Sutter Health Park", "Sacramento", "CA", "USA", "mlb", 38.5803, -121.5005),
"stadium_mlb_citizens_bank_park": StadiumInfo("stadium_mlb_citizens_bank_park", "Citizens Bank Park", "Philadelphia", "PA", "USA", "mlb", 39.9061, -75.1665),
"stadium_mlb_pnc_park": StadiumInfo("stadium_mlb_pnc_park", "PNC Park", "Pittsburgh", "PA", "USA", "mlb", 40.4469, -80.0057),
"stadium_mlb_petco_park": StadiumInfo("stadium_mlb_petco_park", "Petco Park", "San Diego", "CA", "USA", "mlb", 32.7076, -117.1570),
"stadium_mlb_oracle_park": StadiumInfo("stadium_mlb_oracle_park", "Oracle Park", "San Francisco", "CA", "USA", "mlb", 37.7786, -122.3893),
"stadium_mlb_tmobile_park": StadiumInfo("stadium_mlb_tmobile_park", "T-Mobile Park", "Seattle", "WA", "USA", "mlb", 47.5914, -122.3325),
"stadium_mlb_busch_stadium": StadiumInfo("stadium_mlb_busch_stadium", "Busch Stadium", "St. Louis", "MO", "USA", "mlb", 38.6226, -90.1928),
"stadium_mlb_tropicana_field": StadiumInfo("stadium_mlb_tropicana_field", "Tropicana Field", "St. Petersburg", "FL", "USA", "mlb", 27.7682, -82.6534),
"stadium_mlb_globe_life_field": StadiumInfo("stadium_mlb_globe_life_field", "Globe Life Field", "Arlington", "TX", "USA", "mlb", 32.7473, -97.0845),
"stadium_mlb_rogers_centre": StadiumInfo("stadium_mlb_rogers_centre", "Rogers Centre", "Toronto", "ON", "Canada", "mlb", 43.6414, -79.3894),
"stadium_mlb_nationals_park": StadiumInfo("stadium_mlb_nationals_park", "Nationals Park", "Washington", "DC", "USA", "mlb", 38.8730, -77.0074),
},
"nfl": {
"stadium_nfl_state_farm_stadium": StadiumInfo("stadium_nfl_state_farm_stadium", "State Farm Stadium", "Glendale", "AZ", "USA", "nfl", 33.5276, -112.2626),
"stadium_nfl_mercedes_benz_stadium": StadiumInfo("stadium_nfl_mercedes_benz_stadium", "Mercedes-Benz Stadium", "Atlanta", "GA", "USA", "nfl", 33.7553, -84.4006),
"stadium_nfl_mandt_bank_stadium": StadiumInfo("stadium_nfl_mandt_bank_stadium", "M&T Bank Stadium", "Baltimore", "MD", "USA", "nfl", 39.2780, -76.6227),
"stadium_nfl_highmark_stadium": StadiumInfo("stadium_nfl_highmark_stadium", "Highmark Stadium", "Orchard Park", "NY", "USA", "nfl", 42.7738, -78.7870),
"stadium_nfl_bank_of_america_stadium": StadiumInfo("stadium_nfl_bank_of_america_stadium", "Bank of America Stadium", "Charlotte", "NC", "USA", "nfl", 35.2258, -80.8528),
"stadium_nfl_soldier_field": StadiumInfo("stadium_nfl_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "nfl", 41.8623, -87.6167),
"stadium_nfl_paycor_stadium": StadiumInfo("stadium_nfl_paycor_stadium", "Paycor Stadium", "Cincinnati", "OH", "USA", "nfl", 39.0955, -84.5161),
"stadium_nfl_huntington_bank_field": StadiumInfo("stadium_nfl_huntington_bank_field", "Huntington Bank Field", "Cleveland", "OH", "USA", "nfl", 41.5061, -81.6995),
"stadium_nfl_att_stadium": StadiumInfo("stadium_nfl_att_stadium", "AT&T Stadium", "Arlington", "TX", "USA", "nfl", 32.7473, -97.0945),
"stadium_nfl_empower_field": StadiumInfo("stadium_nfl_empower_field", "Empower Field at Mile High", "Denver", "CO", "USA", "nfl", 39.7439, -105.0201),
"stadium_nfl_ford_field": StadiumInfo("stadium_nfl_ford_field", "Ford Field", "Detroit", "MI", "USA", "nfl", 42.3400, -83.0456),
"stadium_nfl_lambeau_field": StadiumInfo("stadium_nfl_lambeau_field", "Lambeau Field", "Green Bay", "WI", "USA", "nfl", 44.5013, -88.0622),
"stadium_nfl_nrg_stadium": StadiumInfo("stadium_nfl_nrg_stadium", "NRG Stadium", "Houston", "TX", "USA", "nfl", 29.6847, -95.4107),
"stadium_nfl_lucas_oil_stadium": StadiumInfo("stadium_nfl_lucas_oil_stadium", "Lucas Oil Stadium", "Indianapolis", "IN", "USA", "nfl", 39.7601, -86.1639),
"stadium_nfl_everbank_stadium": StadiumInfo("stadium_nfl_everbank_stadium", "EverBank Stadium", "Jacksonville", "FL", "USA", "nfl", 30.3239, -81.6373),
"stadium_nfl_arrowhead_stadium": StadiumInfo("stadium_nfl_arrowhead_stadium", "Arrowhead Stadium", "Kansas City", "MO", "USA", "nfl", 39.0489, -94.4839),
"stadium_nfl_allegiant_stadium": StadiumInfo("stadium_nfl_allegiant_stadium", "Allegiant Stadium", "Las Vegas", "NV", "USA", "nfl", 36.0909, -115.1833),
"stadium_nfl_sofi_stadium": StadiumInfo("stadium_nfl_sofi_stadium", "SoFi Stadium", "Inglewood", "CA", "USA", "nfl", 33.9534, -118.3386),
"stadium_nfl_hard_rock_stadium": StadiumInfo("stadium_nfl_hard_rock_stadium", "Hard Rock Stadium", "Miami Gardens", "FL", "USA", "nfl", 25.9580, -80.2389),
"stadium_nfl_us_bank_stadium": StadiumInfo("stadium_nfl_us_bank_stadium", "U.S. Bank Stadium", "Minneapolis", "MN", "USA", "nfl", 44.9737, -93.2575),
"stadium_nfl_gillette_stadium": StadiumInfo("stadium_nfl_gillette_stadium", "Gillette Stadium", "Foxborough", "MA", "USA", "nfl", 42.0909, -71.2643),
"stadium_nfl_caesars_superdome": StadiumInfo("stadium_nfl_caesars_superdome", "Caesars Superdome", "New Orleans", "LA", "USA", "nfl", 29.9511, -90.0812),
"stadium_nfl_metlife_stadium": StadiumInfo("stadium_nfl_metlife_stadium", "MetLife Stadium", "East Rutherford", "NJ", "USA", "nfl", 40.8128, -74.0742),
"stadium_nfl_lincoln_financial_field": StadiumInfo("stadium_nfl_lincoln_financial_field", "Lincoln Financial Field", "Philadelphia", "PA", "USA", "nfl", 39.9008, -75.1675),
"stadium_nfl_acrisure_stadium": StadiumInfo("stadium_nfl_acrisure_stadium", "Acrisure Stadium", "Pittsburgh", "PA", "USA", "nfl", 40.4468, -80.0158),
"stadium_nfl_levis_stadium": StadiumInfo("stadium_nfl_levis_stadium", "Levi's Stadium", "Santa Clara", "CA", "USA", "nfl", 37.4033, -121.9695),
"stadium_nfl_lumen_field": StadiumInfo("stadium_nfl_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "nfl", 47.5952, -122.3316),
"stadium_nfl_raymond_james_stadium": StadiumInfo("stadium_nfl_raymond_james_stadium", "Raymond James Stadium", "Tampa", "FL", "USA", "nfl", 27.9759, -82.5033),
"stadium_nfl_nissan_stadium": StadiumInfo("stadium_nfl_nissan_stadium", "Nissan Stadium", "Nashville", "TN", "USA", "nfl", 36.1665, -86.7713),
"stadium_nfl_northwest_stadium": StadiumInfo("stadium_nfl_northwest_stadium", "Northwest Stadium", "Landover", "MD", "USA", "nfl", 38.9076, -76.8645),
},
"nhl": {
"stadium_nhl_honda_center": StadiumInfo("stadium_nhl_honda_center", "Honda Center", "Anaheim", "CA", "USA", "nhl", 33.8078, -117.8765),
"stadium_nhl_delta_center": StadiumInfo("stadium_nhl_delta_center", "Delta Center", "Salt Lake City", "UT", "USA", "nhl", 40.7683, -111.9011),
"stadium_nhl_td_garden": StadiumInfo("stadium_nhl_td_garden", "TD Garden", "Boston", "MA", "USA", "nhl", 42.3662, -71.0621),
"stadium_nhl_keybank_center": StadiumInfo("stadium_nhl_keybank_center", "KeyBank Center", "Buffalo", "NY", "USA", "nhl", 42.8750, -78.8764),
"stadium_nhl_scotiabank_saddledome": StadiumInfo("stadium_nhl_scotiabank_saddledome", "Scotiabank Saddledome", "Calgary", "AB", "Canada", "nhl", 51.0374, -114.0519),
"stadium_nhl_pnc_arena": StadiumInfo("stadium_nhl_pnc_arena", "PNC Arena", "Raleigh", "NC", "USA", "nhl", 35.8033, -78.7220),
"stadium_nhl_united_center": StadiumInfo("stadium_nhl_united_center", "United Center", "Chicago", "IL", "USA", "nhl", 41.8807, -87.6742),
"stadium_nhl_ball_arena": StadiumInfo("stadium_nhl_ball_arena", "Ball Arena", "Denver", "CO", "USA", "nhl", 39.7487, -105.0077),
"stadium_nhl_nationwide_arena": StadiumInfo("stadium_nhl_nationwide_arena", "Nationwide Arena", "Columbus", "OH", "USA", "nhl", 39.9692, -83.0061),
"stadium_nhl_american_airlines_center": StadiumInfo("stadium_nhl_american_airlines_center", "American Airlines Center", "Dallas", "TX", "USA", "nhl", 32.7905, -96.8103),
"stadium_nhl_little_caesars_arena": StadiumInfo("stadium_nhl_little_caesars_arena", "Little Caesars Arena", "Detroit", "MI", "USA", "nhl", 42.3411, -83.0553),
"stadium_nhl_rogers_place": StadiumInfo("stadium_nhl_rogers_place", "Rogers Place", "Edmonton", "AB", "Canada", "nhl", 53.5469, -113.4979),
"stadium_nhl_amerant_bank_arena": StadiumInfo("stadium_nhl_amerant_bank_arena", "Amerant Bank Arena", "Sunrise", "FL", "USA", "nhl", 26.1584, -80.3256),
"stadium_nhl_cryptocom_arena": StadiumInfo("stadium_nhl_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "nhl", 34.0430, -118.2673),
"stadium_nhl_xcel_energy_center": StadiumInfo("stadium_nhl_xcel_energy_center", "Xcel Energy Center", "St. Paul", "MN", "USA", "nhl", 44.9448, -93.1010),
"stadium_nhl_bell_centre": StadiumInfo("stadium_nhl_bell_centre", "Bell Centre", "Montreal", "QC", "Canada", "nhl", 45.4961, -73.5693),
"stadium_nhl_bridgestone_arena": StadiumInfo("stadium_nhl_bridgestone_arena", "Bridgestone Arena", "Nashville", "TN", "USA", "nhl", 36.1592, -86.7785),
"stadium_nhl_prudential_center": StadiumInfo("stadium_nhl_prudential_center", "Prudential Center", "Newark", "NJ", "USA", "nhl", 40.7334, -74.1712),
"stadium_nhl_ubs_arena": StadiumInfo("stadium_nhl_ubs_arena", "UBS Arena", "Elmont", "NY", "USA", "nhl", 40.7170, -73.7255),
"stadium_nhl_madison_square_garden": StadiumInfo("stadium_nhl_madison_square_garden", "Madison Square Garden", "New York", "NY", "USA", "nhl", 40.7505, -73.9934),
"stadium_nhl_canadian_tire_centre": StadiumInfo("stadium_nhl_canadian_tire_centre", "Canadian Tire Centre", "Ottawa", "ON", "Canada", "nhl", 45.2969, -75.9272),
"stadium_nhl_wells_fargo_center": StadiumInfo("stadium_nhl_wells_fargo_center", "Wells Fargo Center", "Philadelphia", "PA", "USA", "nhl", 39.9012, -75.1720),
"stadium_nhl_ppg_paints_arena": StadiumInfo("stadium_nhl_ppg_paints_arena", "PPG Paints Arena", "Pittsburgh", "PA", "USA", "nhl", 40.4395, -79.9890),
"stadium_nhl_sap_center": StadiumInfo("stadium_nhl_sap_center", "SAP Center", "San Jose", "CA", "USA", "nhl", 37.3327, -121.9011),
"stadium_nhl_climate_pledge_arena": StadiumInfo("stadium_nhl_climate_pledge_arena", "Climate Pledge Arena", "Seattle", "WA", "USA", "nhl", 47.6221, -122.3540),
"stadium_nhl_enterprise_center": StadiumInfo("stadium_nhl_enterprise_center", "Enterprise Center", "St. Louis", "MO", "USA", "nhl", 38.6268, -90.2025),
"stadium_nhl_amalie_arena": StadiumInfo("stadium_nhl_amalie_arena", "Amalie Arena", "Tampa", "FL", "USA", "nhl", 27.9428, -82.4519),
"stadium_nhl_scotiabank_arena": StadiumInfo("stadium_nhl_scotiabank_arena", "Scotiabank Arena", "Toronto", "ON", "Canada", "nhl", 43.6435, -79.3791),
"stadium_nhl_rogers_arena": StadiumInfo("stadium_nhl_rogers_arena", "Rogers Arena", "Vancouver", "BC", "Canada", "nhl", 49.2778, -123.1088),
"stadium_nhl_tmobile_arena": StadiumInfo("stadium_nhl_tmobile_arena", "T-Mobile Arena", "Las Vegas", "NV", "USA", "nhl", 36.1028, -115.1783),
"stadium_nhl_capital_one_arena": StadiumInfo("stadium_nhl_capital_one_arena", "Capital One Arena", "Washington", "DC", "USA", "nhl", 38.8981, -77.0209),
"stadium_nhl_canada_life_centre": StadiumInfo("stadium_nhl_canada_life_centre", "Canada Life Centre", "Winnipeg", "MB", "Canada", "nhl", 49.8928, -97.1433),
},
"mls": {
"stadium_mls_mercedes_benz_stadium": StadiumInfo("stadium_mls_mercedes_benz_stadium", "Mercedes-Benz Stadium", "Atlanta", "GA", "USA", "mls", 33.7553, -84.4006),
"stadium_mls_q2_stadium": StadiumInfo("stadium_mls_q2_stadium", "Q2 Stadium", "Austin", "TX", "USA", "mls", 30.3875, -97.7186),
"stadium_mls_bank_of_america_stadium": StadiumInfo("stadium_mls_bank_of_america_stadium", "Bank of America Stadium", "Charlotte", "NC", "USA", "mls", 35.2258, -80.8528),
"stadium_mls_soldier_field": StadiumInfo("stadium_mls_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "mls", 41.8623, -87.6167),
"stadium_mls_tql_stadium": StadiumInfo("stadium_mls_tql_stadium", "TQL Stadium", "Cincinnati", "OH", "USA", "mls", 39.1112, -84.5225),
"stadium_mls_dicks_sporting_goods_park": StadiumInfo("stadium_mls_dicks_sporting_goods_park", "Dick's Sporting Goods Park", "Commerce City", "CO", "USA", "mls", 39.8056, -104.8922),
"stadium_mls_lower_com_field": StadiumInfo("stadium_mls_lower_com_field", "Lower.com Field", "Columbus", "OH", "USA", "mls", 39.9689, -83.0173),
"stadium_mls_toyota_stadium": StadiumInfo("stadium_mls_toyota_stadium", "Toyota Stadium", "Frisco", "TX", "USA", "mls", 33.1545, -96.8353),
"stadium_mls_audi_field": StadiumInfo("stadium_mls_audi_field", "Audi Field", "Washington", "DC", "USA", "mls", 38.8687, -77.0128),
"stadium_mls_shell_energy_stadium": StadiumInfo("stadium_mls_shell_energy_stadium", "Shell Energy Stadium", "Houston", "TX", "USA", "mls", 29.7522, -95.3527),
"stadium_mls_dignity_health_sports_park": StadiumInfo("stadium_mls_dignity_health_sports_park", "Dignity Health Sports Park", "Carson", "CA", "USA", "mls", 33.8644, -118.2611),
"stadium_mls_bmo_stadium": StadiumInfo("stadium_mls_bmo_stadium", "BMO Stadium", "Los Angeles", "CA", "USA", "mls", 34.0128, -118.2841),
"stadium_mls_chase_stadium": StadiumInfo("stadium_mls_chase_stadium", "Chase Stadium", "Fort Lauderdale", "FL", "USA", "mls", 26.1930, -80.1611),
"stadium_mls_allianz_field": StadiumInfo("stadium_mls_allianz_field", "Allianz Field", "St. Paul", "MN", "USA", "mls", 44.9528, -93.1650),
"stadium_mls_stade_saputo": StadiumInfo("stadium_mls_stade_saputo", "Stade Saputo", "Montreal", "QC", "Canada", "mls", 45.5622, -73.5528),
"stadium_mls_geodis_park": StadiumInfo("stadium_mls_geodis_park", "GEODIS Park", "Nashville", "TN", "USA", "mls", 36.1304, -86.7651),
"stadium_mls_gillette_stadium": StadiumInfo("stadium_mls_gillette_stadium", "Gillette Stadium", "Foxborough", "MA", "USA", "mls", 42.0909, -71.2643),
"stadium_mls_yankee_stadium": StadiumInfo("stadium_mls_yankee_stadium", "Yankee Stadium", "Bronx", "NY", "USA", "mls", 40.8296, -73.9262),
"stadium_mls_red_bull_arena": StadiumInfo("stadium_mls_red_bull_arena", "Red Bull Arena", "Harrison", "NJ", "USA", "mls", 40.7369, -74.1503),
"stadium_mls_inter_co_stadium": StadiumInfo("stadium_mls_inter_co_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "mls", 28.5411, -81.3895),
"stadium_mls_subaru_park": StadiumInfo("stadium_mls_subaru_park", "Subaru Park", "Chester", "PA", "USA", "mls", 39.8328, -75.3789),
"stadium_mls_providence_park": StadiumInfo("stadium_mls_providence_park", "Providence Park", "Portland", "OR", "USA", "mls", 45.5216, -122.6917),
"stadium_mls_america_first_field": StadiumInfo("stadium_mls_america_first_field", "America First Field", "Sandy", "UT", "USA", "mls", 40.5830, -111.8933),
"stadium_mls_paypal_park": StadiumInfo("stadium_mls_paypal_park", "PayPal Park", "San Jose", "CA", "USA", "mls", 37.3511, -121.9250),
"stadium_mls_snapdragon_stadium": StadiumInfo("stadium_mls_snapdragon_stadium", "Snapdragon Stadium", "San Diego", "CA", "USA", "mls", 32.7837, -117.1225),
"stadium_mls_lumen_field": StadiumInfo("stadium_mls_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "mls", 47.5952, -122.3316),
"stadium_mls_childrens_mercy_park": StadiumInfo("stadium_mls_childrens_mercy_park", "Children's Mercy Park", "Kansas City", "KS", "USA", "mls", 39.1217, -94.8231),
"stadium_mls_citypark": StadiumInfo("stadium_mls_citypark", "CITYPARK", "St. Louis", "MO", "USA", "mls", 38.6316, -90.2106),
"stadium_mls_bmo_field": StadiumInfo("stadium_mls_bmo_field", "BMO Field", "Toronto", "ON", "Canada", "mls", 43.6332, -79.4186),
"stadium_mls_bc_place": StadiumInfo("stadium_mls_bc_place", "BC Place", "Vancouver", "BC", "Canada", "mls", 49.2768, -123.1118),
},
"wnba": {
"stadium_wnba_gateway_center_arena": StadiumInfo("stadium_wnba_gateway_center_arena", "Gateway Center Arena", "College Park", "GA", "USA", "wnba", 33.6510, -84.4474),
"stadium_wnba_wintrust_arena": StadiumInfo("stadium_wnba_wintrust_arena", "Wintrust Arena", "Chicago", "IL", "USA", "wnba", 41.8658, -87.6169),
"stadium_wnba_mohegan_sun_arena": StadiumInfo("stadium_wnba_mohegan_sun_arena", "Mohegan Sun Arena", "Uncasville", "CT", "USA", "wnba", 41.4931, -72.0912),
"stadium_wnba_college_park_center": StadiumInfo("stadium_wnba_college_park_center", "College Park Center", "Arlington", "TX", "USA", "wnba", 32.7304, -97.1077),
"stadium_wnba_chase_center": StadiumInfo("stadium_wnba_chase_center", "Chase Center", "San Francisco", "CA", "USA", "wnba", 37.7680, -122.3877),
"stadium_wnba_gainbridge_fieldhouse": StadiumInfo("stadium_wnba_gainbridge_fieldhouse", "Gainbridge Fieldhouse", "Indianapolis", "IN", "USA", "wnba", 39.7640, -86.1555),
"stadium_wnba_michelob_ultra_arena": StadiumInfo("stadium_wnba_michelob_ultra_arena", "Michelob Ultra Arena", "Las Vegas", "NV", "USA", "wnba", 36.0902, -115.1756),
"stadium_wnba_cryptocom_arena": StadiumInfo("stadium_wnba_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "wnba", 34.0430, -118.2673),
"stadium_wnba_target_center": StadiumInfo("stadium_wnba_target_center", "Target Center", "Minneapolis", "MN", "USA", "wnba", 44.9795, -93.2761),
"stadium_wnba_barclays_center": StadiumInfo("stadium_wnba_barclays_center", "Barclays Center", "Brooklyn", "NY", "USA", "wnba", 40.6826, -73.9754),
"stadium_wnba_footprint_center": StadiumInfo("stadium_wnba_footprint_center", "Footprint Center", "Phoenix", "AZ", "USA", "wnba", 33.4457, -112.0712),
"stadium_wnba_climate_pledge_arena": StadiumInfo("stadium_wnba_climate_pledge_arena", "Climate Pledge Arena", "Seattle", "WA", "USA", "wnba", 47.6221, -122.3540),
"stadium_wnba_entertainment_sports_arena": StadiumInfo("stadium_wnba_entertainment_sports_arena", "Entertainment & Sports Arena", "Washington", "DC", "USA", "wnba", 38.8690, -76.9745),
},
"nwsl": {
"stadium_nwsl_bmo_stadium": StadiumInfo("stadium_nwsl_bmo_stadium", "BMO Stadium", "Los Angeles", "CA", "USA", "nwsl", 34.0128, -118.2841),
"stadium_nwsl_seatgeek_stadium": StadiumInfo("stadium_nwsl_seatgeek_stadium", "SeatGeek Stadium", "Bridgeview", "IL", "USA", "nwsl", 41.7500, -87.8028),
"stadium_nwsl_shell_energy_stadium": StadiumInfo("stadium_nwsl_shell_energy_stadium", "Shell Energy Stadium", "Houston", "TX", "USA", "nwsl", 29.7522, -95.3527),
"stadium_nwsl_cpkc_stadium": StadiumInfo("stadium_nwsl_cpkc_stadium", "CPKC Stadium", "Kansas City", "MO", "USA", "nwsl", 39.1050, -94.5580),
"stadium_nwsl_red_bull_arena": StadiumInfo("stadium_nwsl_red_bull_arena", "Red Bull Arena", "Harrison", "NJ", "USA", "nwsl", 40.7369, -74.1503),
"stadium_nwsl_wakemed_soccer_park": StadiumInfo("stadium_nwsl_wakemed_soccer_park", "WakeMed Soccer Park", "Cary", "NC", "USA", "nwsl", 35.7879, -78.7806),
"stadium_nwsl_inter_co_stadium": StadiumInfo("stadium_nwsl_inter_co_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "nwsl", 28.5411, -81.3895),
"stadium_nwsl_providence_park": StadiumInfo("stadium_nwsl_providence_park", "Providence Park", "Portland", "OR", "USA", "nwsl", 45.5216, -122.6917),
"stadium_nwsl_lynn_family_stadium": StadiumInfo("stadium_nwsl_lynn_family_stadium", "Lynn Family Stadium", "Louisville", "KY", "USA", "nwsl", 38.2219, -85.7381),
"stadium_nwsl_snapdragon_stadium": StadiumInfo("stadium_nwsl_snapdragon_stadium", "Snapdragon Stadium", "San Diego", "CA", "USA", "nwsl", 32.7837, -117.1225),
"stadium_nwsl_lumen_field": StadiumInfo("stadium_nwsl_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "nwsl", 47.5952, -122.3316),
"stadium_nwsl_america_first_field": StadiumInfo("stadium_nwsl_america_first_field", "America First Field", "Sandy", "UT", "USA", "nwsl", 40.5830, -111.8933),
"stadium_nwsl_audi_field": StadiumInfo("stadium_nwsl_audi_field", "Audi Field", "Washington", "DC", "USA", "nwsl", 38.8687, -77.0128),
"stadium_nwsl_paypal_park": StadiumInfo("stadium_nwsl_paypal_park", "PayPal Park", "San Jose", "CA", "USA", "nwsl", 37.3511, -121.9250),
},
}
class StadiumResolver:
"""Resolves stadium names to canonical IDs.
Resolution order:
1. Exact match against stadium names
2. Alias lookup (with date awareness)
3. Fuzzy match against all known names
4. Geographic filter check
5. Unresolved (returns ManualReviewItem)
"""
def __init__(
self,
sport: str,
alias_loader: Optional[StadiumAliasLoader] = None,
fuzzy_threshold: int = FUZZY_MATCH_THRESHOLD,
):
"""Initialize the resolver.
Args:
sport: Sport code (e.g., 'nba', 'mlb')
alias_loader: Stadium alias loader (default: global loader)
fuzzy_threshold: Minimum fuzzy match score
"""
self.sport = sport.lower()
self.alias_loader = alias_loader or get_stadium_alias_loader()
self.fuzzy_threshold = fuzzy_threshold
self._stadiums = STADIUM_MAPPINGS.get(self.sport, {})
# Build match candidates
self._candidates = self._build_candidates()
def _build_candidates(self) -> list[MatchCandidate]:
"""Build match candidates from stadium mappings."""
candidates = []
for stadium_id, info in self._stadiums.items():
# Get aliases for this stadium
aliases = [a.alias_name for a in self.alias_loader.get_aliases_for_stadium(stadium_id)]
# Add city as alias
aliases.append(info.city)
candidates.append(MatchCandidate(
canonical_id=stadium_id,
name=info.name,
aliases=aliases,
))
return candidates
def resolve(
self,
name: str,
check_date: Optional[date] = None,
country: Optional[str] = None,
source_url: Optional[str] = None,
) -> StadiumResolveResult:
"""Resolve a stadium name to a canonical ID.
Args:
name: Stadium name to resolve
check_date: Date for alias validity (None = today)
country: Country for geographic filtering (None = no filter)
source_url: Source URL for manual review items
Returns:
StadiumResolveResult with resolution details
"""
name_lower = name.lower().strip()
# 1. Exact match against stadium names
for stadium_id, info in self._stadiums.items():
if name_lower == info.name.lower():
return StadiumResolveResult(
canonical_id=stadium_id,
confidence=100,
match_type="exact",
)
# 2. Alias lookup
alias_result = self.alias_loader.resolve(name, check_date)
if alias_result:
# Verify it's for the right sport (alias file has all sports)
if alias_result.startswith(f"stadium_{self.sport}_"):
return StadiumResolveResult(
canonical_id=alias_result,
confidence=95,
match_type="alias",
)
# 3. Fuzzy match
matches = fuzzy_match_stadium(
name,
self._candidates,
threshold=self.fuzzy_threshold,
)
if matches:
best = matches[0]
review_item = None
# Create review item for low confidence matches
if best.confidence < 90:
review_item = ManualReviewItem(
id=f"stadium_{uuid4().hex[:8]}",
reason=ReviewReason.LOW_CONFIDENCE_MATCH,
sport=self.sport,
raw_value=name,
context={"match_type": "fuzzy"},
source_url=source_url,
suggested_matches=matches,
game_date=check_date,
)
return StadiumResolveResult(
canonical_id=best.canonical_id,
confidence=best.confidence,
match_type="fuzzy",
review_item=review_item,
)
# 4. Geographic filter check
if country and country not in ALLOWED_COUNTRIES:
review_item = ManualReviewItem(
id=f"stadium_{uuid4().hex[:8]}",
reason=ReviewReason.GEOGRAPHIC_FILTER,
sport=self.sport,
raw_value=name,
context={"country": country, "reason": "Stadium outside USA/Canada/Mexico"},
source_url=source_url,
game_date=check_date,
)
return StadiumResolveResult(
canonical_id=None,
confidence=0,
match_type="filtered",
filtered_reason="geographic",
review_item=review_item,
)
# 5. Unresolved
review_item = ManualReviewItem(
id=f"stadium_{uuid4().hex[:8]}",
reason=ReviewReason.UNRESOLVED_STADIUM,
sport=self.sport,
raw_value=name,
context={},
source_url=source_url,
suggested_matches=fuzzy_match_stadium(
name,
self._candidates,
threshold=50, # Lower threshold for suggestions
top_n=5,
),
game_date=check_date,
)
return StadiumResolveResult(
canonical_id=None,
confidence=0,
match_type="unresolved",
review_item=review_item,
)
def get_stadium_info(self, stadium_id: str) -> Optional[StadiumInfo]:
"""Get stadium info by ID.
Args:
stadium_id: Canonical stadium ID
Returns:
StadiumInfo or None
"""
return self._stadiums.get(stadium_id)
def get_all_stadiums(self) -> list[StadiumInfo]:
"""Get all stadiums for this sport.
Returns:
List of StadiumInfo objects
"""
return list(self._stadiums.values())
def is_in_allowed_region(self, stadium_id: str) -> bool:
"""Check if a stadium is in an allowed region.
Args:
stadium_id: Canonical stadium ID
Returns:
True if stadium is in USA, Canada, or Mexico
"""
info = self._stadiums.get(stadium_id)
if not info:
return False
return info.country in ALLOWED_COUNTRIES
# Cached resolvers
_resolvers: dict[str, StadiumResolver] = {}
def get_stadium_resolver(sport: str) -> StadiumResolver:
"""Get or create a stadium resolver for a sport."""
sport_lower = sport.lower()
if sport_lower not in _resolvers:
_resolvers[sport_lower] = StadiumResolver(sport_lower)
return _resolvers[sport_lower]
def resolve_stadium(
sport: str,
name: str,
check_date: Optional[date] = None,
) -> StadiumResolveResult:
"""Convenience function to resolve a stadium name.
Args:
sport: Sport code
name: Stadium name to resolve
check_date: Date for alias validity
Returns:
StadiumResolveResult
"""
return get_stadium_resolver(sport).resolve(name, check_date)
@@ -0,0 +1,482 @@
"""Team name resolver with exact, alias, and fuzzy matching."""
from dataclasses import dataclass
from datetime import date
from typing import Optional
from uuid import uuid4
from ..config import FUZZY_MATCH_THRESHOLD
from ..models.aliases import (
AliasType,
FuzzyMatch,
ManualReviewItem,
ReviewReason,
)
from .alias_loader import get_team_alias_loader, TeamAliasLoader
from .fuzzy import MatchCandidate, fuzzy_match_team, exact_match
@dataclass
class TeamResolveResult:
"""Result of team resolution.
Attributes:
canonical_id: Resolved canonical team ID (None if unresolved)
confidence: Confidence in the match (100 for exact, lower for fuzzy)
match_type: How the match was made ('exact', 'alias', 'fuzzy', 'unresolved')
review_item: ManualReviewItem if resolution failed or low confidence
"""
canonical_id: Optional[str]
confidence: int
match_type: str
review_item: Optional[ManualReviewItem] = None
# Hardcoded team mappings for each sport
# Format: {sport: {abbreviation: (canonical_id, full_name, city)}}
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
"nba": {
"ATL": ("team_nba_atl", "Atlanta Hawks", "Atlanta"),
"BOS": ("team_nba_bos", "Boston Celtics", "Boston"),
"BKN": ("team_nba_brk", "Brooklyn Nets", "Brooklyn"),
"BRK": ("team_nba_brk", "Brooklyn Nets", "Brooklyn"),
"CHA": ("team_nba_cho", "Charlotte Hornets", "Charlotte"),
"CHO": ("team_nba_cho", "Charlotte Hornets", "Charlotte"),
"CHI": ("team_nba_chi", "Chicago Bulls", "Chicago"),
"CLE": ("team_nba_cle", "Cleveland Cavaliers", "Cleveland"),
"DAL": ("team_nba_dal", "Dallas Mavericks", "Dallas"),
"DEN": ("team_nba_den", "Denver Nuggets", "Denver"),
"DET": ("team_nba_det", "Detroit Pistons", "Detroit"),
"GSW": ("team_nba_gsw", "Golden State Warriors", "Golden State"),
"GS": ("team_nba_gsw", "Golden State Warriors", "Golden State"),
"HOU": ("team_nba_hou", "Houston Rockets", "Houston"),
"IND": ("team_nba_ind", "Indiana Pacers", "Indiana"),
"LAC": ("team_nba_lac", "Los Angeles Clippers", "Los Angeles"),
"LAL": ("team_nba_lal", "Los Angeles Lakers", "Los Angeles"),
"MEM": ("team_nba_mem", "Memphis Grizzlies", "Memphis"),
"MIA": ("team_nba_mia", "Miami Heat", "Miami"),
"MIL": ("team_nba_mil", "Milwaukee Bucks", "Milwaukee"),
"MIN": ("team_nba_min", "Minnesota Timberwolves", "Minnesota"),
"NOP": ("team_nba_nop", "New Orleans Pelicans", "New Orleans"),
"NO": ("team_nba_nop", "New Orleans Pelicans", "New Orleans"),
"NYK": ("team_nba_nyk", "New York Knicks", "New York"),
"NY": ("team_nba_nyk", "New York Knicks", "New York"),
"OKC": ("team_nba_okc", "Oklahoma City Thunder", "Oklahoma City"),
"ORL": ("team_nba_orl", "Orlando Magic", "Orlando"),
"PHI": ("team_nba_phi", "Philadelphia 76ers", "Philadelphia"),
"PHX": ("team_nba_phx", "Phoenix Suns", "Phoenix"),
"PHO": ("team_nba_phx", "Phoenix Suns", "Phoenix"),
"POR": ("team_nba_por", "Portland Trail Blazers", "Portland"),
"SAC": ("team_nba_sac", "Sacramento Kings", "Sacramento"),
"SAS": ("team_nba_sas", "San Antonio Spurs", "San Antonio"),
"SA": ("team_nba_sas", "San Antonio Spurs", "San Antonio"),
"TOR": ("team_nba_tor", "Toronto Raptors", "Toronto"),
"UTA": ("team_nba_uta", "Utah Jazz", "Utah"),
"WAS": ("team_nba_was", "Washington Wizards", "Washington"),
"WSH": ("team_nba_was", "Washington Wizards", "Washington"),
},
"mlb": {
"ARI": ("team_mlb_ari", "Arizona Diamondbacks", "Arizona"),
"ATL": ("team_mlb_atl", "Atlanta Braves", "Atlanta"),
"BAL": ("team_mlb_bal", "Baltimore Orioles", "Baltimore"),
"BOS": ("team_mlb_bos", "Boston Red Sox", "Boston"),
"CHC": ("team_mlb_chc", "Chicago Cubs", "Chicago"),
"CHW": ("team_mlb_chw", "Chicago White Sox", "Chicago"),
"CWS": ("team_mlb_chw", "Chicago White Sox", "Chicago"),
"CIN": ("team_mlb_cin", "Cincinnati Reds", "Cincinnati"),
"CLE": ("team_mlb_cle", "Cleveland Guardians", "Cleveland"),
"COL": ("team_mlb_col", "Colorado Rockies", "Colorado"),
"DET": ("team_mlb_det", "Detroit Tigers", "Detroit"),
"HOU": ("team_mlb_hou", "Houston Astros", "Houston"),
"KC": ("team_mlb_kc", "Kansas City Royals", "Kansas City"),
"KCR": ("team_mlb_kc", "Kansas City Royals", "Kansas City"),
"LAA": ("team_mlb_laa", "Los Angeles Angels", "Los Angeles"),
"ANA": ("team_mlb_laa", "Los Angeles Angels", "Anaheim"),
"LAD": ("team_mlb_lad", "Los Angeles Dodgers", "Los Angeles"),
"MIA": ("team_mlb_mia", "Miami Marlins", "Miami"),
"FLA": ("team_mlb_mia", "Miami Marlins", "Florida"),
"MIL": ("team_mlb_mil", "Milwaukee Brewers", "Milwaukee"),
"MIN": ("team_mlb_min", "Minnesota Twins", "Minnesota"),
"NYM": ("team_mlb_nym", "New York Mets", "New York"),
"NYY": ("team_mlb_nyy", "New York Yankees", "New York"),
"OAK": ("team_mlb_oak", "Oakland Athletics", "Oakland"),
"PHI": ("team_mlb_phi", "Philadelphia Phillies", "Philadelphia"),
"PIT": ("team_mlb_pit", "Pittsburgh Pirates", "Pittsburgh"),
"SD": ("team_mlb_sd", "San Diego Padres", "San Diego"),
"SDP": ("team_mlb_sd", "San Diego Padres", "San Diego"),
"SF": ("team_mlb_sf", "San Francisco Giants", "San Francisco"),
"SFG": ("team_mlb_sf", "San Francisco Giants", "San Francisco"),
"SEA": ("team_mlb_sea", "Seattle Mariners", "Seattle"),
"STL": ("team_mlb_stl", "St. Louis Cardinals", "St. Louis"),
"TB": ("team_mlb_tbr", "Tampa Bay Rays", "Tampa Bay"),
"TBR": ("team_mlb_tbr", "Tampa Bay Rays", "Tampa Bay"),
"TEX": ("team_mlb_tex", "Texas Rangers", "Texas"),
"TOR": ("team_mlb_tor", "Toronto Blue Jays", "Toronto"),
"WSN": ("team_mlb_wsn", "Washington Nationals", "Washington"),
"WAS": ("team_mlb_wsn", "Washington Nationals", "Washington"),
},
"nfl": {
"ARI": ("team_nfl_ari", "Arizona Cardinals", "Arizona"),
"ATL": ("team_nfl_atl", "Atlanta Falcons", "Atlanta"),
"BAL": ("team_nfl_bal", "Baltimore Ravens", "Baltimore"),
"BUF": ("team_nfl_buf", "Buffalo Bills", "Buffalo"),
"CAR": ("team_nfl_car", "Carolina Panthers", "Carolina"),
"CHI": ("team_nfl_chi", "Chicago Bears", "Chicago"),
"CIN": ("team_nfl_cin", "Cincinnati Bengals", "Cincinnati"),
"CLE": ("team_nfl_cle", "Cleveland Browns", "Cleveland"),
"DAL": ("team_nfl_dal", "Dallas Cowboys", "Dallas"),
"DEN": ("team_nfl_den", "Denver Broncos", "Denver"),
"DET": ("team_nfl_det", "Detroit Lions", "Detroit"),
"GB": ("team_nfl_gb", "Green Bay Packers", "Green Bay"),
"GNB": ("team_nfl_gb", "Green Bay Packers", "Green Bay"),
"HOU": ("team_nfl_hou", "Houston Texans", "Houston"),
"IND": ("team_nfl_ind", "Indianapolis Colts", "Indianapolis"),
"JAX": ("team_nfl_jax", "Jacksonville Jaguars", "Jacksonville"),
"JAC": ("team_nfl_jax", "Jacksonville Jaguars", "Jacksonville"),
"KC": ("team_nfl_kc", "Kansas City Chiefs", "Kansas City"),
"KAN": ("team_nfl_kc", "Kansas City Chiefs", "Kansas City"),
"LV": ("team_nfl_lv", "Las Vegas Raiders", "Las Vegas"),
"LAC": ("team_nfl_lac", "Los Angeles Chargers", "Los Angeles"),
"LAR": ("team_nfl_lar", "Los Angeles Rams", "Los Angeles"),
"MIA": ("team_nfl_mia", "Miami Dolphins", "Miami"),
"MIN": ("team_nfl_min", "Minnesota Vikings", "Minnesota"),
"NE": ("team_nfl_ne", "New England Patriots", "New England"),
"NWE": ("team_nfl_ne", "New England Patriots", "New England"),
"NO": ("team_nfl_no", "New Orleans Saints", "New Orleans"),
"NOR": ("team_nfl_no", "New Orleans Saints", "New Orleans"),
"NYG": ("team_nfl_nyg", "New York Giants", "New York"),
"NYJ": ("team_nfl_nyj", "New York Jets", "New York"),
"PHI": ("team_nfl_phi", "Philadelphia Eagles", "Philadelphia"),
"PIT": ("team_nfl_pit", "Pittsburgh Steelers", "Pittsburgh"),
"SF": ("team_nfl_sf", "San Francisco 49ers", "San Francisco"),
"SFO": ("team_nfl_sf", "San Francisco 49ers", "San Francisco"),
"SEA": ("team_nfl_sea", "Seattle Seahawks", "Seattle"),
"TB": ("team_nfl_tb", "Tampa Bay Buccaneers", "Tampa Bay"),
"TAM": ("team_nfl_tb", "Tampa Bay Buccaneers", "Tampa Bay"),
"TEN": ("team_nfl_ten", "Tennessee Titans", "Tennessee"),
"WAS": ("team_nfl_was", "Washington Commanders", "Washington"),
"WSH": ("team_nfl_was", "Washington Commanders", "Washington"),
},
"nhl": {
"ANA": ("team_nhl_ana", "Anaheim Ducks", "Anaheim"),
"ARI": ("team_nhl_ari", "Utah Hockey Club", "Utah"), # Moved 2024
"UTA": ("team_nhl_ari", "Utah Hockey Club", "Utah"),
"BOS": ("team_nhl_bos", "Boston Bruins", "Boston"),
"BUF": ("team_nhl_buf", "Buffalo Sabres", "Buffalo"),
"CGY": ("team_nhl_cgy", "Calgary Flames", "Calgary"),
"CAR": ("team_nhl_car", "Carolina Hurricanes", "Carolina"),
"CHI": ("team_nhl_chi", "Chicago Blackhawks", "Chicago"),
"COL": ("team_nhl_col", "Colorado Avalanche", "Colorado"),
"CBJ": ("team_nhl_cbj", "Columbus Blue Jackets", "Columbus"),
"DAL": ("team_nhl_dal", "Dallas Stars", "Dallas"),
"DET": ("team_nhl_det", "Detroit Red Wings", "Detroit"),
"EDM": ("team_nhl_edm", "Edmonton Oilers", "Edmonton"),
"FLA": ("team_nhl_fla", "Florida Panthers", "Florida"),
"LA": ("team_nhl_la", "Los Angeles Kings", "Los Angeles"),
"LAK": ("team_nhl_la", "Los Angeles Kings", "Los Angeles"),
"MIN": ("team_nhl_min", "Minnesota Wild", "Minnesota"),
"MTL": ("team_nhl_mtl", "Montreal Canadiens", "Montreal"),
"MON": ("team_nhl_mtl", "Montreal Canadiens", "Montreal"),
"NSH": ("team_nhl_nsh", "Nashville Predators", "Nashville"),
"NAS": ("team_nhl_nsh", "Nashville Predators", "Nashville"),
"NJ": ("team_nhl_njd", "New Jersey Devils", "New Jersey"),
"NJD": ("team_nhl_njd", "New Jersey Devils", "New Jersey"),
"NYI": ("team_nhl_nyi", "New York Islanders", "New York"),
"NYR": ("team_nhl_nyr", "New York Rangers", "New York"),
"OTT": ("team_nhl_ott", "Ottawa Senators", "Ottawa"),
"PHI": ("team_nhl_phi", "Philadelphia Flyers", "Philadelphia"),
"PIT": ("team_nhl_pit", "Pittsburgh Penguins", "Pittsburgh"),
"SJ": ("team_nhl_sj", "San Jose Sharks", "San Jose"),
"SJS": ("team_nhl_sj", "San Jose Sharks", "San Jose"),
"SEA": ("team_nhl_sea", "Seattle Kraken", "Seattle"),
"STL": ("team_nhl_stl", "St. Louis Blues", "St. Louis"),
"TB": ("team_nhl_tb", "Tampa Bay Lightning", "Tampa Bay"),
"TBL": ("team_nhl_tb", "Tampa Bay Lightning", "Tampa Bay"),
"TOR": ("team_nhl_tor", "Toronto Maple Leafs", "Toronto"),
"VAN": ("team_nhl_van", "Vancouver Canucks", "Vancouver"),
"VGK": ("team_nhl_vgk", "Vegas Golden Knights", "Vegas"),
"VEG": ("team_nhl_vgk", "Vegas Golden Knights", "Vegas"),
"WAS": ("team_nhl_was", "Washington Capitals", "Washington"),
"WSH": ("team_nhl_was", "Washington Capitals", "Washington"),
"WPG": ("team_nhl_wpg", "Winnipeg Jets", "Winnipeg"),
},
"mls": {
"ATL": ("team_mls_atl", "Atlanta United", "Atlanta"),
"AUS": ("team_mls_aus", "Austin FC", "Austin"),
"CLT": ("team_mls_clt", "Charlotte FC", "Charlotte"),
"CHI": ("team_mls_chi", "Chicago Fire", "Chicago"),
"CIN": ("team_mls_cin", "FC Cincinnati", "Cincinnati"),
"COL": ("team_mls_col", "Colorado Rapids", "Colorado"),
"CLB": ("team_mls_clb", "Columbus Crew", "Columbus"),
"DAL": ("team_mls_dal", "FC Dallas", "Dallas"),
"DC": ("team_mls_dc", "D.C. United", "Washington"),
"HOU": ("team_mls_hou", "Houston Dynamo", "Houston"),
"LAG": ("team_mls_lag", "LA Galaxy", "Los Angeles"),
"LAFC": ("team_mls_lafc", "Los Angeles FC", "Los Angeles"),
"MIA": ("team_mls_mia", "Inter Miami", "Miami"),
"MIN": ("team_mls_min", "Minnesota United", "Minnesota"),
"MTL": ("team_mls_mtl", "CF Montreal", "Montreal"),
"NSH": ("team_mls_nsh", "Nashville SC", "Nashville"),
"NE": ("team_mls_ne", "New England Revolution", "New England"),
"NYC": ("team_mls_nyc", "New York City FC", "New York"),
"RB": ("team_mls_ny", "New York Red Bulls", "New York"),
"RBNY": ("team_mls_ny", "New York Red Bulls", "New York"),
"ORL": ("team_mls_orl", "Orlando City", "Orlando"),
"PHI": ("team_mls_phi", "Philadelphia Union", "Philadelphia"),
"POR": ("team_mls_por", "Portland Timbers", "Portland"),
"SLC": ("team_mls_slc", "Real Salt Lake", "Salt Lake"),
"RSL": ("team_mls_slc", "Real Salt Lake", "Salt Lake"),
"SJ": ("team_mls_sj", "San Jose Earthquakes", "San Jose"),
"SD": ("team_mls_sd", "San Diego FC", "San Diego"),
"SEA": ("team_mls_sea", "Seattle Sounders", "Seattle"),
"SKC": ("team_mls_skc", "Sporting Kansas City", "Kansas City"),
"STL": ("team_mls_stl", "St. Louis City SC", "St. Louis"),
"TOR": ("team_mls_tor", "Toronto FC", "Toronto"),
"VAN": ("team_mls_van", "Vancouver Whitecaps", "Vancouver"),
},
"wnba": {
"ATL": ("team_wnba_atl", "Atlanta Dream", "Atlanta"),
"CHI": ("team_wnba_chi", "Chicago Sky", "Chicago"),
"CON": ("team_wnba_con", "Connecticut Sun", "Connecticut"),
"DAL": ("team_wnba_dal", "Dallas Wings", "Dallas"),
"GSV": ("team_wnba_gsv", "Golden State Valkyries", "Golden State"),
"IND": ("team_wnba_ind", "Indiana Fever", "Indiana"),
"LV": ("team_wnba_lv", "Las Vegas Aces", "Las Vegas"),
"LA": ("team_wnba_la", "Los Angeles Sparks", "Los Angeles"),
"MIN": ("team_wnba_min", "Minnesota Lynx", "Minnesota"),
"NY": ("team_wnba_ny", "New York Liberty", "New York"),
"PHX": ("team_wnba_phx", "Phoenix Mercury", "Phoenix"),
"SEA": ("team_wnba_sea", "Seattle Storm", "Seattle"),
"WAS": ("team_wnba_was", "Washington Mystics", "Washington"),
},
"nwsl": {
"ANF": ("team_nwsl_anf", "Angel City FC", "Los Angeles"),
"CHI": ("team_nwsl_chi", "Chicago Red Stars", "Chicago"),
"HOU": ("team_nwsl_hou", "Houston Dash", "Houston"),
"KC": ("team_nwsl_kc", "Kansas City Current", "Kansas City"),
"NJ": ("team_nwsl_nj", "NJ/NY Gotham FC", "New Jersey"),
"NC": ("team_nwsl_nc", "North Carolina Courage", "North Carolina"),
"ORL": ("team_nwsl_orl", "Orlando Pride", "Orlando"),
"POR": ("team_nwsl_por", "Portland Thorns", "Portland"),
"RGN": ("team_nwsl_rgn", "Racing Louisville", "Louisville"),
"SD": ("team_nwsl_sd", "San Diego Wave", "San Diego"),
"SEA": ("team_nwsl_sea", "Seattle Reign", "Seattle"),
"SLC": ("team_nwsl_slc", "Utah Royals", "Utah"),
"WAS": ("team_nwsl_was", "Washington Spirit", "Washington"),
"BFC": ("team_nwsl_bfc", "Bay FC", "San Francisco"),
},
}
class TeamResolver:
"""Resolves team names to canonical IDs.
Resolution order:
1. Exact match against abbreviation mappings
2. Exact match against full team names
3. Alias lookup (with date awareness)
4. Fuzzy match against all known names
5. Unresolved (returns ManualReviewItem)
"""
def __init__(
self,
sport: str,
alias_loader: Optional[TeamAliasLoader] = None,
fuzzy_threshold: int = FUZZY_MATCH_THRESHOLD,
):
"""Initialize the resolver.
Args:
sport: Sport code (e.g., 'nba', 'mlb')
alias_loader: Team alias loader (default: global loader)
fuzzy_threshold: Minimum fuzzy match score
"""
self.sport = sport.lower()
self.alias_loader = alias_loader or get_team_alias_loader()
self.fuzzy_threshold = fuzzy_threshold
self._mappings = TEAM_MAPPINGS.get(self.sport, {})
# Build match candidates for fuzzy matching
self._candidates = self._build_candidates()
def _build_candidates(self) -> list[MatchCandidate]:
"""Build match candidates from team mappings."""
# Group by canonical ID to avoid duplicates
by_id: dict[str, tuple[str, list[str]]] = {}
for abbrev, (canonical_id, full_name, city) in self._mappings.items():
if canonical_id not in by_id:
by_id[canonical_id] = (full_name, [])
# Add abbreviation as alias
by_id[canonical_id][1].append(abbrev)
by_id[canonical_id][1].append(city)
return [
MatchCandidate(
canonical_id=cid,
name=name,
aliases=list(set(aliases)), # Dedupe
)
for cid, (name, aliases) in by_id.items()
]
def resolve(
self,
value: str,
check_date: Optional[date] = None,
source_url: Optional[str] = None,
) -> TeamResolveResult:
"""Resolve a team name to a canonical ID.
Args:
value: Team name, abbreviation, or city to resolve
check_date: Date for alias validity (None = today)
source_url: Source URL for manual review items
Returns:
TeamResolveResult with resolution details
"""
value_upper = value.upper().strip()
value_lower = value.lower().strip()
# 1. Exact match against abbreviation
if value_upper in self._mappings:
canonical_id, full_name, _ = self._mappings[value_upper]
return TeamResolveResult(
canonical_id=canonical_id,
confidence=100,
match_type="exact",
)
# 2. Exact match against full names
for abbrev, (canonical_id, full_name, city) in self._mappings.items():
if value_lower == full_name.lower() or value_lower == city.lower():
return TeamResolveResult(
canonical_id=canonical_id,
confidence=100,
match_type="exact",
)
# 3. Alias lookup
alias_result = self.alias_loader.resolve(value, check_date)
if alias_result:
return TeamResolveResult(
canonical_id=alias_result,
confidence=95,
match_type="alias",
)
# 4. Fuzzy match
matches = fuzzy_match_team(
value,
self._candidates,
threshold=self.fuzzy_threshold,
)
if matches:
best = matches[0]
review_item = None
# Create review item for low confidence matches
if best.confidence < 90:
review_item = ManualReviewItem(
id=f"team_{uuid4().hex[:8]}",
reason=ReviewReason.LOW_CONFIDENCE_MATCH,
sport=self.sport,
raw_value=value,
context={"match_type": "fuzzy"},
source_url=source_url,
suggested_matches=matches,
game_date=check_date,
)
return TeamResolveResult(
canonical_id=best.canonical_id,
confidence=best.confidence,
match_type="fuzzy",
review_item=review_item,
)
# 5. Unresolved
review_item = ManualReviewItem(
id=f"team_{uuid4().hex[:8]}",
reason=ReviewReason.UNRESOLVED_TEAM,
sport=self.sport,
raw_value=value,
context={},
source_url=source_url,
suggested_matches=fuzzy_match_team(
value,
self._candidates,
threshold=50, # Lower threshold for suggestions
top_n=5,
),
game_date=check_date,
)
return TeamResolveResult(
canonical_id=None,
confidence=0,
match_type="unresolved",
review_item=review_item,
)
def get_team_info(self, abbreviation: str) -> Optional[tuple[str, str, str]]:
"""Get team info by abbreviation.
Args:
abbreviation: Team abbreviation
Returns:
Tuple of (canonical_id, full_name, city) or None
"""
return self._mappings.get(abbreviation.upper())
def get_all_teams(self) -> list[tuple[str, str, str]]:
"""Get all teams for this sport.
Returns:
List of (canonical_id, full_name, city) tuples
"""
seen = set()
result = []
for abbrev, (canonical_id, full_name, city) in self._mappings.items():
if canonical_id not in seen:
seen.add(canonical_id)
result.append((canonical_id, full_name, city))
return result
# Cached resolvers
_resolvers: dict[str, TeamResolver] = {}
def get_team_resolver(sport: str) -> TeamResolver:
"""Get or create a team resolver for a sport."""
sport_lower = sport.lower()
if sport_lower not in _resolvers:
_resolvers[sport_lower] = TeamResolver(sport_lower)
return _resolvers[sport_lower]
def resolve_team(
sport: str,
value: str,
check_date: Optional[date] = None,
) -> TeamResolveResult:
"""Convenience function to resolve a team name.
Args:
sport: Sport code
value: Team name to resolve
check_date: Date for alias validity
Returns:
TeamResolveResult
"""
return get_team_resolver(sport).resolve(value, check_date)
@@ -0,0 +1,344 @@
"""Timezone conversion utilities for normalizing game times to UTC."""
import re
from dataclasses import dataclass
from datetime import datetime, date, time
from typing import Optional
from zoneinfo import ZoneInfo
from dateutil import parser as dateutil_parser
from dateutil.tz import gettz, tzutc
from ..models.aliases import ReviewReason, ManualReviewItem
# Common timezone abbreviations to IANA timezones
TIMEZONE_ABBREV_MAP: dict[str, str] = {
# US timezones
"ET": "America/New_York",
"EST": "America/New_York",
"EDT": "America/New_York",
"CT": "America/Chicago",
"CST": "America/Chicago",
"CDT": "America/Chicago",
"MT": "America/Denver",
"MST": "America/Denver",
"MDT": "America/Denver",
"PT": "America/Los_Angeles",
"PST": "America/Los_Angeles",
"PDT": "America/Los_Angeles",
"AT": "America/Anchorage",
"AKST": "America/Anchorage",
"AKDT": "America/Anchorage",
"HT": "Pacific/Honolulu",
"HST": "Pacific/Honolulu",
# Canada
"AST": "America/Halifax",
"ADT": "America/Halifax",
"NST": "America/St_Johns",
"NDT": "America/St_Johns",
# Mexico
"CDST": "America/Mexico_City",
# UTC
"UTC": "UTC",
"GMT": "UTC",
"Z": "UTC",
}
# State/region to timezone mapping for inferring timezone from location
STATE_TIMEZONE_MAP: dict[str, str] = {
# Eastern
"CT": "America/New_York",
"DE": "America/New_York",
"FL": "America/New_York", # Most of Florida
"GA": "America/New_York",
"MA": "America/New_York",
"MD": "America/New_York",
"ME": "America/New_York",
"MI": "America/Detroit",
"NC": "America/New_York",
"NH": "America/New_York",
"NJ": "America/New_York",
"NY": "America/New_York",
"OH": "America/New_York",
"PA": "America/New_York",
"RI": "America/New_York",
"SC": "America/New_York",
"VA": "America/New_York",
"VT": "America/New_York",
"WV": "America/New_York",
"DC": "America/New_York",
# Central
"AL": "America/Chicago",
"AR": "America/Chicago",
"IA": "America/Chicago",
"IL": "America/Chicago",
"IN": "America/Indiana/Indianapolis",
"KS": "America/Chicago",
"KY": "America/Kentucky/Louisville",
"LA": "America/Chicago",
"MN": "America/Chicago",
"MO": "America/Chicago",
"MS": "America/Chicago",
"ND": "America/Chicago",
"NE": "America/Chicago",
"OK": "America/Chicago",
"SD": "America/Chicago",
"TN": "America/Chicago",
"TX": "America/Chicago",
"WI": "America/Chicago",
# Mountain
"AZ": "America/Phoenix", # No DST
"CO": "America/Denver",
"ID": "America/Boise",
"MT": "America/Denver",
"NM": "America/Denver",
"UT": "America/Denver",
"WY": "America/Denver",
# Pacific
"CA": "America/Los_Angeles",
"NV": "America/Los_Angeles",
"OR": "America/Los_Angeles",
"WA": "America/Los_Angeles",
# Alaska/Hawaii
"AK": "America/Anchorage",
"HI": "Pacific/Honolulu",
# Canada provinces
"ON": "America/Toronto",
"QC": "America/Montreal",
"BC": "America/Vancouver",
"AB": "America/Edmonton",
"MB": "America/Winnipeg",
"SK": "America/Regina",
"NS": "America/Halifax",
"NB": "America/Moncton",
"NL": "America/St_Johns",
"PE": "America/Halifax",
}
@dataclass
class TimezoneResult:
"""Result of timezone conversion.
Attributes:
datetime_utc: The datetime converted to UTC
source_timezone: The timezone that was detected/used
confidence: Confidence in the timezone detection ('high', 'medium', 'low')
warning: Warning message if timezone was uncertain
"""
datetime_utc: datetime
source_timezone: str
confidence: str
warning: Optional[str] = None
def detect_timezone_from_string(time_str: str) -> Optional[str]:
"""Detect timezone from a time string containing a timezone abbreviation.
Args:
time_str: Time string that may contain timezone info (e.g., '7:00 PM ET')
Returns:
IANA timezone string if detected, None otherwise
"""
# Look for timezone abbreviation at end of string
for abbrev, tz in TIMEZONE_ABBREV_MAP.items():
pattern = rf"\b{abbrev}\b"
if re.search(pattern, time_str, re.IGNORECASE):
return tz
return None
def detect_timezone_from_location(
state: Optional[str] = None,
city: Optional[str] = None,
) -> Optional[str]:
"""Detect timezone from location information.
Args:
state: State/province code (e.g., 'NY', 'ON')
city: City name (optional, for special cases)
Returns:
IANA timezone string if detected, None otherwise
"""
if state and state.upper() in STATE_TIMEZONE_MAP:
return STATE_TIMEZONE_MAP[state.upper()]
return None
def parse_datetime(
date_str: str,
time_str: Optional[str] = None,
timezone_hint: Optional[str] = None,
location_state: Optional[str] = None,
) -> TimezoneResult:
"""Parse a date/time string and convert to UTC.
Attempts to detect timezone from:
1. Explicit timezone in the string
2. Provided timezone hint
3. Location-based inference
4. Default to Eastern Time with warning
Args:
date_str: Date string (e.g., '2025-10-21', 'October 21, 2025')
time_str: Optional time string (e.g., '7:00 PM ET', '19:00')
timezone_hint: Optional IANA timezone to use if not detected
location_state: Optional state code for timezone inference
Returns:
TimezoneResult with UTC datetime and metadata
"""
# Parse the date
try:
if time_str:
# Combine date and time
full_str = f"{date_str} {time_str}"
else:
full_str = date_str
parsed = dateutil_parser.parse(full_str, fuzzy=True)
except (ValueError, OverflowError) as e:
# If parsing fails, return a placeholder with low confidence
return TimezoneResult(
datetime_utc=datetime.now(tz=ZoneInfo("UTC")),
source_timezone="unknown",
confidence="low",
warning=f"Failed to parse datetime: {e}",
)
# Determine timezone
detected_tz = None
confidence = "high"
warning = None
# Check if datetime already has timezone
if parsed.tzinfo is not None:
detected_tz = str(parsed.tzinfo)
else:
# Try to detect from time string
if time_str:
detected_tz = detect_timezone_from_string(time_str)
# Try timezone hint
if not detected_tz and timezone_hint:
detected_tz = timezone_hint
confidence = "medium"
# Try location inference
if not detected_tz and location_state:
detected_tz = detect_timezone_from_location(state=location_state)
confidence = "medium"
# Default to Eastern Time
if not detected_tz:
detected_tz = "America/New_York"
confidence = "low"
warning = "Timezone not detected, defaulting to Eastern Time"
# Apply timezone and convert to UTC
try:
tz = ZoneInfo(detected_tz)
except KeyError:
# Invalid timezone, try to resolve abbreviation
if detected_tz in TIMEZONE_ABBREV_MAP:
tz = ZoneInfo(TIMEZONE_ABBREV_MAP[detected_tz])
detected_tz = TIMEZONE_ABBREV_MAP[detected_tz]
else:
tz = ZoneInfo("America/New_York")
confidence = "low"
warning = f"Unknown timezone '{detected_tz}', defaulting to Eastern Time"
detected_tz = "America/New_York"
# Apply timezone if not already set
if parsed.tzinfo is None:
parsed = parsed.replace(tzinfo=tz)
# Convert to UTC
utc_dt = parsed.astimezone(ZoneInfo("UTC"))
return TimezoneResult(
datetime_utc=utc_dt,
source_timezone=detected_tz,
confidence=confidence,
warning=warning,
)
def convert_to_utc(
dt: datetime,
source_timezone: str,
) -> datetime:
"""Convert a datetime from a known timezone to UTC.
Args:
dt: Datetime to convert (timezone-naive or timezone-aware)
source_timezone: IANA timezone of the datetime
Returns:
Datetime in UTC
"""
tz = ZoneInfo(source_timezone)
if dt.tzinfo is None:
# Localize naive datetime
dt = dt.replace(tzinfo=tz)
return dt.astimezone(ZoneInfo("UTC"))
def create_timezone_warning(
raw_value: str,
sport: str,
game_date: Optional[date] = None,
source_url: Optional[str] = None,
) -> ManualReviewItem:
"""Create a manual review item for an undetermined timezone.
Args:
raw_value: The original time string that couldn't be resolved
sport: Sport code
game_date: Date of the game
source_url: URL of the source page
Returns:
ManualReviewItem for timezone review
"""
return ManualReviewItem(
id=f"tz_{sport}_{raw_value[:20].replace(' ', '_')}",
reason=ReviewReason.TIMEZONE_UNKNOWN,
sport=sport,
raw_value=raw_value,
context={"issue": "Could not determine timezone for game time"},
source_url=source_url,
game_date=game_date,
)
def get_stadium_timezone(
stadium_state: str,
stadium_timezone: Optional[str] = None,
) -> str:
"""Get the timezone for a stadium based on its location.
Args:
stadium_state: State/province code
stadium_timezone: Explicit timezone override from stadium data
Returns:
IANA timezone string
"""
if stadium_timezone:
return stadium_timezone
tz = detect_timezone_from_location(state=stadium_state)
if tz:
return tz
# Default to Eastern
return "America/New_York"
@@ -0,0 +1,46 @@
"""Scrapers for fetching sports data from various sources."""
from .base import (
BaseScraper,
RawGameData,
ScrapeResult,
ScraperError,
PartialDataError,
)
from .nba import NBAScraper, create_nba_scraper
from .mlb import MLBScraper, create_mlb_scraper
from .nfl import NFLScraper, create_nfl_scraper
from .nhl import NHLScraper, create_nhl_scraper
from .mls import MLSScraper, create_mls_scraper
from .wnba import WNBAScraper, create_wnba_scraper
from .nwsl import NWSLScraper, create_nwsl_scraper
__all__ = [
# Base
"BaseScraper",
"RawGameData",
"ScrapeResult",
"ScraperError",
"PartialDataError",
# NBA
"NBAScraper",
"create_nba_scraper",
# MLB
"MLBScraper",
"create_mlb_scraper",
# NFL
"NFLScraper",
"create_nfl_scraper",
# NHL
"NHLScraper",
"create_nhl_scraper",
# MLS
"MLSScraper",
"create_mls_scraper",
# WNBA
"WNBAScraper",
"create_wnba_scraper",
# NWSL
"NWSLScraper",
"create_nwsl_scraper",
]
+322
View File
@@ -0,0 +1,322 @@
"""Base scraper class for all sport scrapers."""
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from datetime import date, datetime
from typing import Optional
from ..config import EXPECTED_GAME_COUNTS
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from ..models.aliases import ManualReviewItem
from ..utils.http import RateLimitedSession, get_session
from ..utils.logging import get_logger, log_error, log_warning
from ..utils.progress import ScrapeProgress
@dataclass
class RawGameData:
"""Raw game data before normalization.
This intermediate format holds data as scraped from sources,
before team/stadium resolution and canonical ID generation.
"""
game_date: datetime
home_team_raw: str
away_team_raw: str
stadium_raw: Optional[str] = None
home_score: Optional[int] = None
away_score: Optional[int] = None
status: str = "scheduled"
source_url: Optional[str] = None
game_number: Optional[int] = None # For doubleheaders
@dataclass
class ScrapeResult:
"""Result of a scraping operation.
Attributes:
games: List of normalized Game objects
teams: List of Team objects
stadiums: List of Stadium objects
review_items: Items requiring manual review
source: Name of the source used
success: Whether scraping succeeded
error_message: Error message if failed
"""
games: list[Game] = field(default_factory=list)
teams: list[Team] = field(default_factory=list)
stadiums: list[Stadium] = field(default_factory=list)
review_items: list[ManualReviewItem] = field(default_factory=list)
source: str = ""
success: bool = True
error_message: Optional[str] = None
@property
def game_count(self) -> int:
return len(self.games)
@property
def team_count(self) -> int:
return len(self.teams)
@property
def stadium_count(self) -> int:
return len(self.stadiums)
@property
def review_count(self) -> int:
return len(self.review_items)
class BaseScraper(ABC):
"""Abstract base class for sport scrapers.
Subclasses must implement:
- scrape_games(): Fetch and normalize game schedule
- scrape_teams(): Fetch team information
- scrape_stadiums(): Fetch stadium information
- _get_sources(): Return list of source names in priority order
Features:
- Multi-source fallback (try sources in order)
- Built-in rate limiting
- Error handling with partial data discard
- Progress tracking
- Source URL tracking for manual review
"""
def __init__(
self,
sport: str,
season: int,
session: Optional[RateLimitedSession] = None,
):
"""Initialize the scraper.
Args:
sport: Sport code (e.g., 'nba', 'mlb')
season: Season start year (e.g., 2025 for 2025-26)
session: Optional HTTP session (default: global session)
"""
self.sport = sport.lower()
self.season = season
self.session = session or get_session()
self._logger = get_logger()
self._progress: Optional[ScrapeProgress] = None
@property
def expected_game_count(self) -> int:
"""Get expected number of games for this sport."""
return EXPECTED_GAME_COUNTS.get(self.sport, 0)
@abstractmethod
def _get_sources(self) -> list[str]:
"""Return list of source names in priority order.
Returns:
List of source identifiers (e.g., ['basketball_reference', 'espn', 'cbs'])
"""
pass
@abstractmethod
def _scrape_games_from_source(
self,
source: str,
) -> list[RawGameData]:
"""Scrape games from a specific source.
Args:
source: Source identifier
Returns:
List of raw game data
Raises:
Exception: If scraping fails
"""
pass
@abstractmethod
def _normalize_games(
self,
raw_games: list[RawGameData],
) -> tuple[list[Game], list[ManualReviewItem]]:
"""Normalize raw game data to Game objects.
Args:
raw_games: Raw scraped data
Returns:
Tuple of (normalized games, review items)
"""
pass
@abstractmethod
def scrape_teams(self) -> list[Team]:
"""Fetch team information.
Returns:
List of Team objects
"""
pass
@abstractmethod
def scrape_stadiums(self) -> list[Stadium]:
"""Fetch stadium information.
Returns:
List of Stadium objects
"""
pass
def scrape_games(self) -> ScrapeResult:
"""Scrape games with multi-source fallback.
Tries each source in priority order. On failure, discards
partial data and tries the next source.
Returns:
ScrapeResult with games, review items, and status
"""
sources = self._get_sources()
last_error: Optional[str] = None
for source in sources:
self._logger.info(f"Trying source: {source}")
try:
# Scrape raw data
raw_games = self._scrape_games_from_source(source)
if not raw_games:
log_warning(f"No games found from {source}")
continue
self._logger.info(f"Found {len(raw_games)} raw games from {source}")
# Normalize data
games, review_items = self._normalize_games(raw_games)
self._logger.info(
f"Normalized {len(games)} games, {len(review_items)} need review"
)
return ScrapeResult(
games=games,
review_items=review_items,
source=source,
success=True,
)
except Exception as e:
last_error = str(e)
log_error(f"Failed to scrape from {source}: {e}", exc_info=True)
# Discard partial data and try next source
continue
# All sources failed
return ScrapeResult(
success=False,
error_message=f"All sources failed. Last error: {last_error}",
)
def scrape_all(self) -> ScrapeResult:
"""Scrape games, teams, and stadiums.
Returns:
Complete ScrapeResult with all data
"""
self._progress = ScrapeProgress(self.sport, self.season)
self._progress.start()
try:
# Scrape games
result = self.scrape_games()
if not result.success:
self._progress.log_error(result.error_message or "Unknown error")
self._progress.finish()
return result
# Scrape teams
teams = self.scrape_teams()
result.teams = teams
# Scrape stadiums
stadiums = self.scrape_stadiums()
result.stadiums = stadiums
# Update progress
self._progress.games_count = result.game_count
self._progress.teams_count = result.team_count
self._progress.stadiums_count = result.stadium_count
self._progress.errors_count = result.review_count
self._progress.finish()
return result
except Exception as e:
log_error(f"Scraping failed: {e}", exc_info=True)
self._progress.finish()
return ScrapeResult(
success=False,
error_message=str(e),
)
def _get_season_months(self) -> list[tuple[int, int]]:
"""Get the months to scrape for this sport's season.
Returns:
List of (year, month) tuples
"""
# Default implementation for sports with fall-spring seasons
# (NBA, NHL, etc.)
months = []
# Fall months of season start year
for month in range(10, 13): # Oct-Dec
months.append((self.season, month))
# Winter-spring months of following year
for month in range(1, 7): # Jan-Jun
months.append((self.season + 1, month))
return months
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build a source URL with parameters.
Subclasses should override this to build URLs for their sources.
Args:
source: Source identifier
**kwargs: URL parameters
Returns:
Complete URL string
"""
raise NotImplementedError(f"URL builder not implemented for {source}")
class ScraperError(Exception):
"""Exception raised when scraping fails."""
def __init__(self, source: str, message: str):
self.source = source
self.message = message
super().__init__(f"[{source}] {message}")
class PartialDataError(ScraperError):
"""Exception raised when only partial data was retrieved."""
def __init__(self, source: str, message: str, partial_count: int):
self.partial_count = partial_count
super().__init__(source, f"{message} (got {partial_count} items)")
+707
View File
@@ -0,0 +1,707 @@
"""MLB scraper implementation with multi-source fallback."""
from datetime import datetime, date
from typing import Optional
from bs4 import BeautifulSoup
from .base import BaseScraper, RawGameData, ScrapeResult
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from ..models.aliases import ManualReviewItem
from ..normalizers.canonical_id import generate_game_id
from ..normalizers.team_resolver import (
TeamResolver,
TEAM_MAPPINGS,
get_team_resolver,
)
from ..normalizers.stadium_resolver import (
StadiumResolver,
STADIUM_MAPPINGS,
get_stadium_resolver,
)
from ..normalizers.timezone import parse_datetime
from ..utils.logging import get_logger, log_game, log_warning
class MLBScraper(BaseScraper):
"""MLB schedule scraper with multi-source fallback.
Sources (in priority order):
1. Baseball-Reference - Most reliable, complete historical data
2. MLB Stats API - Official MLB data
3. ESPN API - Backup option
"""
def __init__(self, season: int, **kwargs):
"""Initialize MLB scraper.
Args:
season: Season year (e.g., 2026 for 2026 season)
"""
super().__init__("mlb", season, **kwargs)
self._team_resolver = get_team_resolver("mlb")
self._stadium_resolver = get_stadium_resolver("mlb")
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["baseball_reference", "mlb_api", "espn"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""
if source == "baseball_reference":
month = kwargs.get("month", "april")
# Baseball-Reference uses season year in URL
return f"https://www.baseball-reference.com/leagues/majors/{self.season}-schedule.shtml"
elif source == "mlb_api":
start_date = kwargs.get("start_date", "")
end_date = kwargs.get("end_date", "")
return f"https://statsapi.mlb.com/api/v1/schedule?sportId=1&startDate={start_date}&endDate={end_date}"
elif source == "espn":
date_str = kwargs.get("date", "")
return f"https://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard?dates={date_str}"
raise ValueError(f"Unknown source: {source}")
def _get_season_months(self) -> list[tuple[int, int]]:
"""Get the months to scrape for MLB season.
MLB season runs March/April through October/November.
"""
months = []
# Spring training / early season
for month in range(3, 12): # March-November
months.append((self.season, month))
return months
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
"""Scrape games from a specific source."""
if source == "baseball_reference":
return self._scrape_baseball_reference()
elif source == "mlb_api":
return self._scrape_mlb_api()
elif source == "espn":
return self._scrape_espn()
else:
raise ValueError(f"Unknown source: {source}")
def _scrape_baseball_reference(self) -> list[RawGameData]:
"""Scrape games from Baseball-Reference.
BR has a single schedule page per season.
Format: https://www.baseball-reference.com/leagues/majors/YYYY-schedule.shtml
"""
url = self._get_source_url("baseball_reference")
try:
html = self.session.get_html(url)
games = self._parse_baseball_reference(html, url)
return games
except Exception as e:
self._logger.error(f"Failed to scrape Baseball-Reference: {e}")
raise
def _parse_baseball_reference(
self,
html: str,
source_url: str,
) -> list[RawGameData]:
"""Parse Baseball-Reference schedule HTML.
Structure: Games are organized by date in div elements.
Each game row has: date, away team, away score, home team, home score, venue.
"""
soup = BeautifulSoup(html, "lxml")
games: list[RawGameData] = []
# Find all game divs - they use class "game" or similar
# Baseball-Reference uses <p class="game"> for each game
game_paragraphs = soup.find_all("p", class_="game")
current_date = None
for elem in soup.find_all(["h3", "p"]):
# H3 contains date headers
if elem.name == "h3":
date_text = elem.get_text(strip=True)
try:
# Format: "Thursday, April 1, 2026"
current_date = datetime.strptime(date_text, "%A, %B %d, %Y")
except ValueError:
continue
elif elem.name == "p" and "game" in elem.get("class", []):
if current_date is None:
continue
try:
game = self._parse_br_game(elem, current_date, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse game: {e}")
continue
return games
def _parse_br_game(
self,
elem,
game_date: datetime,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single Baseball-Reference game element."""
text = elem.get_text(" ", strip=True)
# Parse game text - formats vary:
# "Team A (5) @ Team B (3)" or "Team A @ Team B"
# Also handles doubleheader notation
# Find all links - usually team names
links = elem.find_all("a")
if len(links) < 2:
return None
# First link is away team, second is home team
away_team = links[0].get_text(strip=True)
home_team = links[1].get_text(strip=True)
# Try to extract scores from text
away_score = None
home_score = None
# Look for score pattern "(N)"
import re
score_pattern = r"\((\d+)\)"
scores = re.findall(score_pattern, text)
if len(scores) >= 2:
try:
away_score = int(scores[0])
home_score = int(scores[1])
except (ValueError, IndexError):
pass
# Determine status
status = "final" if home_score is not None else "scheduled"
# Check for postponed/cancelled
text_lower = text.lower()
if "postponed" in text_lower:
status = "postponed"
elif "cancelled" in text_lower or "canceled" in text_lower:
status = "cancelled"
# Extract venue if present (usually after @ symbol)
stadium = None
if len(links) > 2:
# Third link might be stadium
stadium = links[2].get_text(strip=True)
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _scrape_mlb_api(self) -> list[RawGameData]:
"""Scrape games from MLB Stats API.
MLB API allows date range queries.
"""
all_games: list[RawGameData] = []
# Query by month to avoid hitting API limits
for year, month in self._get_season_months():
start_date = date(year, month, 1)
# Get last day of month
if month == 12:
end_date = date(year + 1, 1, 1)
else:
end_date = date(year, month + 1, 1)
# Adjust end date to last day of month
from datetime import timedelta
end_date = end_date - timedelta(days=1)
url = self._get_source_url(
"mlb_api",
start_date=start_date.strftime("%Y-%m-%d"),
end_date=end_date.strftime("%Y-%m-%d"),
)
try:
data = self.session.get_json(url)
games = self._parse_mlb_api_response(data, url)
all_games.extend(games)
self._logger.debug(f"Found {len(games)} games in {year}-{month:02d}")
except Exception as e:
self._logger.debug(f"MLB API error for {year}-{month}: {e}")
continue
return all_games
def _parse_mlb_api_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse MLB Stats API response."""
games: list[RawGameData] = []
dates = data.get("dates", [])
for date_entry in dates:
for game in date_entry.get("games", []):
try:
raw_game = self._parse_mlb_api_game(game, source_url)
if raw_game:
games.append(raw_game)
except Exception as e:
self._logger.debug(f"Failed to parse MLB API game: {e}")
continue
return games
def _parse_mlb_api_game(
self,
game: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single MLB API game."""
# Get game date/time
game_date_str = game.get("gameDate", "")
if not game_date_str:
return None
try:
game_date = datetime.fromisoformat(game_date_str.replace("Z", "+00:00"))
except ValueError:
return None
# Get teams
teams = game.get("teams", {})
away_data = teams.get("away", {})
home_data = teams.get("home", {})
away_team_info = away_data.get("team", {})
home_team_info = home_data.get("team", {})
away_team = away_team_info.get("name", "")
home_team = home_team_info.get("name", "")
if not away_team or not home_team:
return None
# Get scores
away_score = away_data.get("score")
home_score = home_data.get("score")
# Get venue
venue = game.get("venue", {})
stadium = venue.get("name")
# Get status
status_data = game.get("status", {})
abstract_game_state = status_data.get("abstractGameState", "").lower()
detailed_state = status_data.get("detailedState", "").lower()
if abstract_game_state == "final":
status = "final"
elif "postponed" in detailed_state:
status = "postponed"
elif "cancelled" in detailed_state or "canceled" in detailed_state:
status = "cancelled"
else:
status = "scheduled"
# Check for doubleheader
game_number = game.get("gameNumber")
if game.get("doubleHeader") == "Y":
game_number = game.get("gameNumber", 1)
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
game_number=game_number if game.get("doubleHeader") == "Y" else None,
)
def _scrape_espn(self) -> list[RawGameData]:
"""Scrape games from ESPN API."""
all_games: list[RawGameData] = []
for year, month in self._get_season_months():
# Get number of days in month
if month == 12:
next_month = date(year + 1, 1, 1)
else:
next_month = date(year, month + 1, 1)
days_in_month = (next_month - date(year, month, 1)).days
for day in range(1, days_in_month + 1):
try:
game_date = date(year, month, day)
date_str = game_date.strftime("%Y%m%d")
url = self._get_source_url("espn", date=date_str)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
continue
return all_games
def _parse_espn_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse ESPN API response."""
games: list[RawGameData] = []
events = data.get("events", [])
for event in events:
try:
game = self._parse_espn_event(event, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse ESPN event: {e}")
continue
return games
def _parse_espn_event(
self,
event: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single ESPN event."""
# Get date
date_str = event.get("date", "")
if not date_str:
return None
try:
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
except ValueError:
return None
# Get competitions
competitions = event.get("competitions", [])
if not competitions:
return None
competition = competitions[0]
# Get teams
competitors = competition.get("competitors", [])
if len(competitors) != 2:
return None
home_team = None
away_team = None
home_score = None
away_score = None
for competitor in competitors:
team_info = competitor.get("team", {})
team_name = team_info.get("displayName", "")
is_home = competitor.get("homeAway") == "home"
score = competitor.get("score")
if score:
try:
score = int(score)
except (ValueError, TypeError):
score = None
if is_home:
home_team = team_name
home_score = score
else:
away_team = team_name
away_score = score
if not home_team or not away_team:
return None
# Get venue
venue = competition.get("venue", {})
stadium = venue.get("fullName")
# Get status
status_info = competition.get("status", {})
status_type = status_info.get("type", {})
status_name = status_type.get("name", "").lower()
if status_name == "status_final":
status = "final"
elif status_name == "status_postponed":
status = "postponed"
elif status_name == "status_canceled":
status = "cancelled"
else:
status = "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _normalize_games(
self,
raw_games: list[RawGameData],
) -> tuple[list[Game], list[ManualReviewItem]]:
"""Normalize raw games to Game objects with canonical IDs."""
games: list[Game] = []
review_items: list[ManualReviewItem] = []
# Track games by date/matchup for doubleheader detection
games_by_matchup: dict[str, list[RawGameData]] = {}
for raw in raw_games:
date_key = raw.game_date.strftime("%Y%m%d")
matchup_key = f"{date_key}_{raw.away_team_raw}_{raw.home_team_raw}"
if matchup_key not in games_by_matchup:
games_by_matchup[matchup_key] = []
games_by_matchup[matchup_key].append(raw)
# Process games with doubleheader detection
for matchup_key, matchup_games in games_by_matchup.items():
is_doubleheader = len(matchup_games) > 1
# Sort by time if doubleheader
if is_doubleheader:
matchup_games.sort(key=lambda g: g.game_date)
for i, raw in enumerate(matchup_games):
# Use provided game_number or calculate from order
game_number = raw.game_number or ((i + 1) if is_doubleheader else None)
game, item_reviews = self._normalize_single_game(raw, game_number)
if game:
games.append(game)
log_game(
self.sport,
game.id,
game.home_team_id,
game.away_team_id,
game.game_date.strftime("%Y-%m-%d"),
game.status,
)
review_items.extend(item_reviews)
return games, review_items
def _normalize_single_game(
self,
raw: RawGameData,
game_number: Optional[int],
) -> tuple[Optional[Game], list[ManualReviewItem]]:
"""Normalize a single raw game."""
review_items: list[ManualReviewItem] = []
# Resolve home team
home_result = self._team_resolver.resolve(
raw.home_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if home_result.review_item:
review_items.append(home_result.review_item)
if not home_result.canonical_id:
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
return None, review_items
# Resolve away team
away_result = self._team_resolver.resolve(
raw.away_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if away_result.review_item:
review_items.append(away_result.review_item)
if not away_result.canonical_id:
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
return None, review_items
# Resolve stadium
stadium_id = None
if raw.stadium_raw:
stadium_result = self._stadium_resolver.resolve(
raw.stadium_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if stadium_result.review_item:
review_items.append(stadium_result.review_item)
stadium_id = stadium_result.canonical_id
# Get abbreviations for game ID
home_abbrev = self._get_abbreviation(home_result.canonical_id)
away_abbrev = self._get_abbreviation(away_result.canonical_id)
# Generate canonical game ID
game_id = generate_game_id(
sport=self.sport,
season=self.season,
away_abbrev=away_abbrev,
home_abbrev=home_abbrev,
game_date=raw.game_date,
game_number=game_number,
)
game = Game(
id=game_id,
sport=self.sport,
season=self.season,
home_team_id=home_result.canonical_id,
away_team_id=away_result.canonical_id,
stadium_id=stadium_id or "",
game_date=raw.game_date,
game_number=game_number,
home_score=raw.home_score,
away_score=raw.away_score,
status=raw.status,
source_url=raw.source_url,
raw_home_team=raw.home_team_raw,
raw_away_team=raw.away_team_raw,
raw_stadium=raw.stadium_raw,
)
return game, review_items
def _get_abbreviation(self, team_id: str) -> str:
"""Extract abbreviation from team ID."""
# team_mlb_nyy -> nyy
parts = team_id.split("_")
return parts[-1] if parts else ""
def scrape_teams(self) -> list[Team]:
"""Get all MLB teams from hardcoded mappings."""
teams: list[Team] = []
seen: set[str] = set()
# MLB league/division structure
divisions = {
"AL East": ("American", ["BAL", "BOS", "NYY", "TB", "TOR"]),
"AL Central": ("American", ["CHW", "CLE", "DET", "KC", "MIN"]),
"AL West": ("American", ["HOU", "LAA", "OAK", "SEA", "TEX"]),
"NL East": ("National", ["ATL", "MIA", "NYM", "PHI", "WSN"]),
"NL Central": ("National", ["CHC", "CIN", "MIL", "PIT", "STL"]),
"NL West": ("National", ["ARI", "COL", "LAD", "SD", "SF"]),
}
# Build reverse lookup
team_divisions: dict[str, tuple[str, str]] = {}
for div, (league, abbrevs) in divisions.items():
for abbrev in abbrevs:
team_divisions[abbrev] = (league, div)
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("mlb", {}).items():
if team_id in seen:
continue
seen.add(team_id)
# Parse team name from full name
parts = full_name.split()
if len(parts) >= 2:
team_name = parts[-1]
# Handle multi-word team names
if team_name in ["Sox", "Jays"]:
team_name = " ".join(parts[-2:])
else:
team_name = full_name
# Get league and division
league, div = team_divisions.get(abbrev, (None, None))
# Get stadium ID
stadium_id = None
mlb_stadiums = STADIUM_MAPPINGS.get("mlb", {})
for sid, sinfo in mlb_stadiums.items():
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
stadium_id = sid
break
team = Team(
id=team_id,
sport="mlb",
city=city,
name=team_name,
full_name=full_name,
abbreviation=abbrev,
conference=league, # MLB uses "league" but we map to conference field
division=div,
stadium_id=stadium_id,
)
teams.append(team)
return teams
def scrape_stadiums(self) -> list[Stadium]:
"""Get all MLB stadiums from hardcoded mappings."""
stadiums: list[Stadium] = []
mlb_stadiums = STADIUM_MAPPINGS.get("mlb", {})
for stadium_id, info in mlb_stadiums.items():
stadium = Stadium(
id=stadium_id,
sport="mlb",
name=info.name,
city=info.city,
state=info.state,
country=info.country,
latitude=info.latitude,
longitude=info.longitude,
surface="grass", # Most MLB stadiums
roof_type="open", # Most MLB stadiums
)
stadiums.append(stadium)
return stadiums
def create_mlb_scraper(season: int) -> MLBScraper:
"""Factory function to create an MLB scraper."""
return MLBScraper(season=season)
+410
View File
@@ -0,0 +1,410 @@
"""MLS scraper implementation with multi-source fallback."""
from datetime import datetime, date
from typing import Optional
from .base import BaseScraper, RawGameData, ScrapeResult
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from ..models.aliases import ManualReviewItem
from ..normalizers.canonical_id import generate_game_id
from ..normalizers.team_resolver import (
TeamResolver,
TEAM_MAPPINGS,
get_team_resolver,
)
from ..normalizers.stadium_resolver import (
StadiumResolver,
STADIUM_MAPPINGS,
get_stadium_resolver,
)
from ..utils.logging import get_logger, log_game, log_warning
class MLSScraper(BaseScraper):
"""MLS schedule scraper with multi-source fallback.
Sources (in priority order):
1. ESPN API - Most reliable for MLS
2. FBref - Backup option
"""
def __init__(self, season: int, **kwargs):
"""Initialize MLS scraper.
Args:
season: Season year (e.g., 2026 for 2026 season)
"""
super().__init__("mls", season, **kwargs)
self._team_resolver = get_team_resolver("mls")
self._stadium_resolver = get_stadium_resolver("mls")
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["espn", "fbref"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""
if source == "espn":
date_str = kwargs.get("date", "")
return f"https://site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard?dates={date_str}"
elif source == "fbref":
return f"https://fbref.com/en/comps/22/{self.season}/schedule/{self.season}-Major-League-Soccer-Scores-and-Fixtures"
raise ValueError(f"Unknown source: {source}")
def _get_season_months(self) -> list[tuple[int, int]]:
"""Get the months to scrape for MLS season.
MLS season runs February/March through October/November.
"""
months = []
# MLS runs within a calendar year
for month in range(2, 12): # Feb-Nov
months.append((self.season, month))
return months
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
"""Scrape games from a specific source."""
if source == "espn":
return self._scrape_espn()
elif source == "fbref":
return self._scrape_fbref()
else:
raise ValueError(f"Unknown source: {source}")
def _scrape_espn(self) -> list[RawGameData]:
"""Scrape games from ESPN API."""
all_games: list[RawGameData] = []
for year, month in self._get_season_months():
# Get number of days in month
if month == 12:
next_month = date(year + 1, 1, 1)
else:
next_month = date(year, month + 1, 1)
days_in_month = (next_month - date(year, month, 1)).days
for day in range(1, days_in_month + 1):
try:
game_date = date(year, month, day)
date_str = game_date.strftime("%Y%m%d")
url = self._get_source_url("espn", date=date_str)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
continue
return all_games
def _parse_espn_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse ESPN API response."""
games: list[RawGameData] = []
events = data.get("events", [])
for event in events:
try:
game = self._parse_espn_event(event, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse ESPN event: {e}")
continue
return games
def _parse_espn_event(
self,
event: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single ESPN event."""
# Get date
date_str = event.get("date", "")
if not date_str:
return None
try:
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
except ValueError:
return None
# Get competitions
competitions = event.get("competitions", [])
if not competitions:
return None
competition = competitions[0]
# Get teams
competitors = competition.get("competitors", [])
if len(competitors) != 2:
return None
home_team = None
away_team = None
home_score = None
away_score = None
for competitor in competitors:
team_info = competitor.get("team", {})
team_name = team_info.get("displayName", "")
is_home = competitor.get("homeAway") == "home"
score = competitor.get("score")
if score:
try:
score = int(score)
except (ValueError, TypeError):
score = None
if is_home:
home_team = team_name
home_score = score
else:
away_team = team_name
away_score = score
if not home_team or not away_team:
return None
# Get venue
venue = competition.get("venue", {})
stadium = venue.get("fullName")
# Get status
status_info = competition.get("status", {})
status_type = status_info.get("type", {})
status_name = status_type.get("name", "").lower()
if status_name == "status_final":
status = "final"
elif status_name == "status_postponed":
status = "postponed"
elif status_name == "status_canceled":
status = "cancelled"
else:
status = "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _scrape_fbref(self) -> list[RawGameData]:
"""Scrape games from FBref."""
# FBref scraping would go here
raise NotImplementedError("FBref scraper not implemented")
def _normalize_games(
self,
raw_games: list[RawGameData],
) -> tuple[list[Game], list[ManualReviewItem]]:
"""Normalize raw games to Game objects with canonical IDs."""
games: list[Game] = []
review_items: list[ManualReviewItem] = []
for raw in raw_games:
game, item_reviews = self._normalize_single_game(raw)
if game:
games.append(game)
log_game(
self.sport,
game.id,
game.home_team_id,
game.away_team_id,
game.game_date.strftime("%Y-%m-%d"),
game.status,
)
review_items.extend(item_reviews)
return games, review_items
def _normalize_single_game(
self,
raw: RawGameData,
) -> tuple[Optional[Game], list[ManualReviewItem]]:
"""Normalize a single raw game."""
review_items: list[ManualReviewItem] = []
# Resolve home team
home_result = self._team_resolver.resolve(
raw.home_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if home_result.review_item:
review_items.append(home_result.review_item)
if not home_result.canonical_id:
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
return None, review_items
# Resolve away team
away_result = self._team_resolver.resolve(
raw.away_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if away_result.review_item:
review_items.append(away_result.review_item)
if not away_result.canonical_id:
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
return None, review_items
# Resolve stadium
stadium_id = None
if raw.stadium_raw:
stadium_result = self._stadium_resolver.resolve(
raw.stadium_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if stadium_result.review_item:
review_items.append(stadium_result.review_item)
stadium_id = stadium_result.canonical_id
# Get abbreviations for game ID
home_abbrev = self._get_abbreviation(home_result.canonical_id)
away_abbrev = self._get_abbreviation(away_result.canonical_id)
# Generate canonical game ID
game_id = generate_game_id(
sport=self.sport,
season=self.season,
away_abbrev=away_abbrev,
home_abbrev=home_abbrev,
game_date=raw.game_date,
game_number=None,
)
game = Game(
id=game_id,
sport=self.sport,
season=self.season,
home_team_id=home_result.canonical_id,
away_team_id=away_result.canonical_id,
stadium_id=stadium_id or "",
game_date=raw.game_date,
game_number=None,
home_score=raw.home_score,
away_score=raw.away_score,
status=raw.status,
source_url=raw.source_url,
raw_home_team=raw.home_team_raw,
raw_away_team=raw.away_team_raw,
raw_stadium=raw.stadium_raw,
)
return game, review_items
def _get_abbreviation(self, team_id: str) -> str:
"""Extract abbreviation from team ID."""
parts = team_id.split("_")
return parts[-1] if parts else ""
def scrape_teams(self) -> list[Team]:
"""Get all MLS teams from hardcoded mappings."""
teams: list[Team] = []
seen: set[str] = set()
# MLS conference structure
conferences = {
"Eastern": ["ATL", "CLT", "CHI", "CIN", "CLB", "DC", "MIA", "MTL", "NE", "NYC", "RB", "ORL", "PHI", "TOR"],
"Western": ["AUS", "COL", "DAL", "HOU", "LAG", "LAFC", "MIN", "NSH", "POR", "SLC", "SD", "SJ", "SEA", "SKC", "STL", "VAN"],
}
# Build reverse lookup
team_conferences: dict[str, str] = {}
for conf, abbrevs in conferences.items():
for abbrev in abbrevs:
team_conferences[abbrev] = conf
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("mls", {}).items():
if team_id in seen:
continue
seen.add(team_id)
# Parse team name
team_name = full_name
# Get conference
conf = team_conferences.get(abbrev)
# Get stadium ID
stadium_id = None
mls_stadiums = STADIUM_MAPPINGS.get("mls", {})
for sid, sinfo in mls_stadiums.items():
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
stadium_id = sid
break
team = Team(
id=team_id,
sport="mls",
city=city,
name=team_name,
full_name=full_name,
abbreviation=abbrev,
conference=conf,
division=None, # MLS doesn't have divisions
stadium_id=stadium_id,
)
teams.append(team)
return teams
def scrape_stadiums(self) -> list[Stadium]:
"""Get all MLS stadiums from hardcoded mappings."""
stadiums: list[Stadium] = []
mls_stadiums = STADIUM_MAPPINGS.get("mls", {})
for stadium_id, info in mls_stadiums.items():
stadium = Stadium(
id=stadium_id,
sport="mls",
name=info.name,
city=info.city,
state=info.state,
country=info.country,
latitude=info.latitude,
longitude=info.longitude,
surface="grass",
roof_type="open",
)
stadiums.append(stadium)
return stadiums
def create_mls_scraper(season: int) -> MLSScraper:
"""Factory function to create an MLS scraper."""
return MLSScraper(season=season)
+637
View File
@@ -0,0 +1,637 @@
"""NBA scraper implementation with multi-source fallback."""
from datetime import datetime, date, timezone
from typing import Optional
from bs4 import BeautifulSoup
import re
from .base import BaseScraper, RawGameData, ScrapeResult
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from ..models.aliases import ManualReviewItem
from ..normalizers.canonical_id import generate_game_id
from ..normalizers.team_resolver import (
TeamResolver,
TEAM_MAPPINGS,
get_team_resolver,
)
from ..normalizers.stadium_resolver import (
StadiumResolver,
STADIUM_MAPPINGS,
get_stadium_resolver,
)
from ..normalizers.timezone import parse_datetime
from ..utils.logging import get_logger, log_game, log_warning
# Month name to number mapping
MONTH_MAP = {
"january": 1, "february": 2, "march": 3, "april": 4,
"may": 5, "june": 6, "july": 7, "august": 8,
"september": 9, "october": 10, "november": 11, "december": 12,
}
# Basketball Reference month URLs
BR_MONTHS = [
"october", "november", "december",
"january", "february", "march", "april", "may", "june",
]
class NBAScraper(BaseScraper):
"""NBA schedule scraper with multi-source fallback.
Sources (in priority order):
1. Basketball-Reference - Most reliable, complete historical data
2. ESPN API - Good for current/future seasons
3. CBS Sports - Backup option
"""
def __init__(self, season: int, **kwargs):
"""Initialize NBA scraper.
Args:
season: Season start year (e.g., 2025 for 2025-26)
"""
super().__init__("nba", season, **kwargs)
self._team_resolver = get_team_resolver("nba")
self._stadium_resolver = get_stadium_resolver("nba")
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["basketball_reference", "espn", "cbs"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""
if source == "basketball_reference":
month = kwargs.get("month", "october")
year = kwargs.get("year", self.season + 1)
return f"https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html"
elif source == "espn":
date_str = kwargs.get("date", "")
return f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates={date_str}"
elif source == "cbs":
return "https://www.cbssports.com/nba/schedule/"
raise ValueError(f"Unknown source: {source}")
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
"""Scrape games from a specific source."""
if source == "basketball_reference":
return self._scrape_basketball_reference()
elif source == "espn":
return self._scrape_espn()
elif source == "cbs":
return self._scrape_cbs()
else:
raise ValueError(f"Unknown source: {source}")
def _scrape_basketball_reference(self) -> list[RawGameData]:
"""Scrape games from Basketball-Reference.
BR organizes games by month with separate pages.
Format: https://www.basketball-reference.com/leagues/NBA_YYYY_games-month.html
where YYYY is the ending year of the season.
"""
all_games: list[RawGameData] = []
end_year = self.season + 1
for month in BR_MONTHS:
url = self._get_source_url("basketball_reference", month=month, year=end_year)
try:
html = self.session.get_html(url)
games = self._parse_basketball_reference(html, url)
all_games.extend(games)
self._logger.debug(f"Found {len(games)} games in {month}")
except Exception as e:
# Some months may not exist (e.g., no games in August)
self._logger.debug(f"No data for {month}: {e}")
continue
return all_games
def _parse_basketball_reference(
self,
html: str,
source_url: str,
) -> list[RawGameData]:
"""Parse Basketball-Reference schedule HTML.
Table structure:
- th[data-stat="date_game"]: Date (e.g., "Tue, Oct 22, 2024")
- td[data-stat="visitor_team_name"]: Away team
- td[data-stat="home_team_name"]: Home team
- td[data-stat="visitor_pts"]: Away score
- td[data-stat="home_pts"]: Home score
- td[data-stat="arena_name"]: Arena/stadium name
"""
soup = BeautifulSoup(html, "lxml")
games: list[RawGameData] = []
# Find the schedule table
table = soup.find("table", id="schedule")
if not table:
return games
tbody = table.find("tbody")
if not tbody:
return games
for row in tbody.find_all("tr"):
# Skip header rows
if row.get("class") and "thead" in row.get("class", []):
continue
try:
game = self._parse_br_row(row, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse row: {e}")
continue
return games
def _parse_br_row(
self,
row,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single Basketball-Reference table row."""
# Get date
date_cell = row.find("th", {"data-stat": "date_game"})
if not date_cell:
return None
date_text = date_cell.get_text(strip=True)
if not date_text:
return None
# Parse date (format: "Tue, Oct 22, 2024")
try:
game_date = datetime.strptime(date_text, "%a, %b %d, %Y")
except ValueError:
# Try alternative format
try:
game_date = datetime.strptime(date_text, "%B %d, %Y")
except ValueError:
self._logger.debug(f"Could not parse date: {date_text}")
return None
# Get teams
away_cell = row.find("td", {"data-stat": "visitor_team_name"})
home_cell = row.find("td", {"data-stat": "home_team_name"})
if not away_cell or not home_cell:
return None
away_team = away_cell.get_text(strip=True)
home_team = home_cell.get_text(strip=True)
if not away_team or not home_team:
return None
# Get scores (may be empty for future games)
away_score_cell = row.find("td", {"data-stat": "visitor_pts"})
home_score_cell = row.find("td", {"data-stat": "home_pts"})
away_score = None
home_score = None
if away_score_cell and away_score_cell.get_text(strip=True):
try:
away_score = int(away_score_cell.get_text(strip=True))
except ValueError:
pass
if home_score_cell and home_score_cell.get_text(strip=True):
try:
home_score = int(home_score_cell.get_text(strip=True))
except ValueError:
pass
# Get arena
arena_cell = row.find("td", {"data-stat": "arena_name"})
arena = arena_cell.get_text(strip=True) if arena_cell else None
# Determine status
status = "final" if home_score is not None else "scheduled"
# Check for postponed/cancelled
notes_cell = row.find("td", {"data-stat": "game_remarks"})
if notes_cell:
notes = notes_cell.get_text(strip=True).lower()
if "postponed" in notes:
status = "postponed"
elif "cancelled" in notes or "canceled" in notes:
status = "cancelled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=arena,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _scrape_espn(self) -> list[RawGameData]:
"""Scrape games from ESPN API.
ESPN API returns games for a specific date range.
We iterate through each day of the season.
"""
all_games: list[RawGameData] = []
for year, month in self._get_season_months():
# Get number of days in month
if month == 12:
next_month = date(year + 1, 1, 1)
else:
next_month = date(year, month + 1, 1)
days_in_month = (next_month - date(year, month, 1)).days
for day in range(1, days_in_month + 1):
try:
game_date = date(year, month, day)
date_str = game_date.strftime("%Y%m%d")
url = self._get_source_url("espn", date=date_str)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
continue
return all_games
def _parse_espn_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse ESPN API response."""
games: list[RawGameData] = []
events = data.get("events", [])
for event in events:
try:
game = self._parse_espn_event(event, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse ESPN event: {e}")
continue
return games
def _parse_espn_event(
self,
event: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single ESPN event."""
# Get date
date_str = event.get("date", "")
if not date_str:
return None
try:
# ESPN uses ISO format
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
except ValueError:
return None
# Get competitions (usually just one)
competitions = event.get("competitions", [])
if not competitions:
return None
competition = competitions[0]
# Get teams
competitors = competition.get("competitors", [])
if len(competitors) != 2:
return None
home_team = None
away_team = None
home_score = None
away_score = None
for competitor in competitors:
team_info = competitor.get("team", {})
team_name = team_info.get("displayName", "")
is_home = competitor.get("homeAway") == "home"
score = competitor.get("score")
if score:
try:
score = int(score)
except (ValueError, TypeError):
score = None
if is_home:
home_team = team_name
home_score = score
else:
away_team = team_name
away_score = score
if not home_team or not away_team:
return None
# Get venue
venue = competition.get("venue", {})
arena = venue.get("fullName")
# Get status
status_info = competition.get("status", {})
status_type = status_info.get("type", {})
status_name = status_type.get("name", "").lower()
if status_name == "status_final":
status = "final"
elif status_name == "status_postponed":
status = "postponed"
elif status_name == "status_canceled":
status = "cancelled"
else:
status = "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=arena,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _scrape_cbs(self) -> list[RawGameData]:
"""Scrape games from CBS Sports.
CBS Sports is a backup source with less structured data.
"""
# CBS Sports scraping would go here
# For now, return empty to fall back to other sources
raise NotImplementedError("CBS scraper not implemented")
def _normalize_games(
self,
raw_games: list[RawGameData],
) -> tuple[list[Game], list[ManualReviewItem]]:
"""Normalize raw games to Game objects with canonical IDs."""
games: list[Game] = []
review_items: list[ManualReviewItem] = []
# Track games by date for doubleheader detection
games_by_date: dict[str, list[RawGameData]] = {}
for raw in raw_games:
date_key = raw.game_date.strftime("%Y%m%d")
matchup_key = f"{date_key}_{raw.away_team_raw}_{raw.home_team_raw}"
if matchup_key not in games_by_date:
games_by_date[matchup_key] = []
games_by_date[matchup_key].append(raw)
# Process games with doubleheader detection
for matchup_key, matchup_games in games_by_date.items():
is_doubleheader = len(matchup_games) > 1
for i, raw in enumerate(matchup_games):
game_number = (i + 1) if is_doubleheader else None
game, item_reviews = self._normalize_single_game(raw, game_number)
if game:
games.append(game)
log_game(
self.sport,
game.id,
game.home_team_id,
game.away_team_id,
game.game_date.strftime("%Y-%m-%d"),
game.status,
)
review_items.extend(item_reviews)
return games, review_items
def _normalize_single_game(
self,
raw: RawGameData,
game_number: Optional[int],
) -> tuple[Optional[Game], list[ManualReviewItem]]:
"""Normalize a single raw game."""
review_items: list[ManualReviewItem] = []
# Resolve home team
home_result = self._team_resolver.resolve(
raw.home_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if home_result.review_item:
review_items.append(home_result.review_item)
if not home_result.canonical_id:
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
return None, review_items
# Resolve away team
away_result = self._team_resolver.resolve(
raw.away_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if away_result.review_item:
review_items.append(away_result.review_item)
if not away_result.canonical_id:
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
return None, review_items
# Resolve stadium (optional - use home team's stadium if not found)
stadium_id = None
if raw.stadium_raw:
stadium_result = self._stadium_resolver.resolve(
raw.stadium_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if stadium_result.review_item:
review_items.append(stadium_result.review_item)
stadium_id = stadium_result.canonical_id
# If no stadium found, use home team's default stadium
if not stadium_id:
# Look up home team's stadium from mappings
home_abbrev = home_result.canonical_id.split("_")[-1].upper()
team_info = self._team_resolver.get_team_info(home_abbrev)
if team_info:
# Try to find stadium by team's home arena
for sid, sinfo in STADIUM_MAPPINGS.get("nba", {}).items():
# Match by city
if sinfo.city.lower() in team_info[2].lower():
stadium_id = sid
break
# Get abbreviations for game ID
home_abbrev = self._get_abbreviation(home_result.canonical_id)
away_abbrev = self._get_abbreviation(away_result.canonical_id)
# Generate canonical game ID
game_id = generate_game_id(
sport=self.sport,
season=self.season,
away_abbrev=away_abbrev,
home_abbrev=home_abbrev,
game_date=raw.game_date,
game_number=game_number,
)
game = Game(
id=game_id,
sport=self.sport,
season=self.season,
home_team_id=home_result.canonical_id,
away_team_id=away_result.canonical_id,
stadium_id=stadium_id or "",
game_date=raw.game_date,
game_number=game_number,
home_score=raw.home_score,
away_score=raw.away_score,
status=raw.status,
source_url=raw.source_url,
raw_home_team=raw.home_team_raw,
raw_away_team=raw.away_team_raw,
raw_stadium=raw.stadium_raw,
)
return game, review_items
def _get_abbreviation(self, team_id: str) -> str:
"""Extract abbreviation from team ID."""
# team_nba_okc -> okc
parts = team_id.split("_")
return parts[-1] if parts else ""
def scrape_teams(self) -> list[Team]:
"""Get all NBA teams from hardcoded mappings."""
teams: list[Team] = []
seen: set[str] = set()
# NBA conference/division structure
divisions = {
"Atlantic": ("Eastern", ["BOS", "BKN", "NYK", "PHI", "TOR"]),
"Central": ("Eastern", ["CHI", "CLE", "DET", "IND", "MIL"]),
"Southeast": ("Eastern", ["ATL", "CHA", "MIA", "ORL", "WAS"]),
"Northwest": ("Western", ["DEN", "MIN", "OKC", "POR", "UTA"]),
"Pacific": ("Western", ["GSW", "LAC", "LAL", "PHX", "SAC"]),
"Southwest": ("Western", ["DAL", "HOU", "MEM", "NOP", "SAS"]),
}
# Build reverse lookup
team_divisions: dict[str, tuple[str, str]] = {}
for div, (conf, abbrevs) in divisions.items():
for abbrev in abbrevs:
team_divisions[abbrev] = (conf, div)
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("nba", {}).items():
if team_id in seen:
continue
seen.add(team_id)
# Parse full name into city and name parts
parts = full_name.split()
if len(parts) >= 2:
# Handle special cases like "Oklahoma City Thunder"
if city == "Oklahoma City":
team_name = "Thunder"
elif city == "Golden State":
team_name = "Warriors"
elif city == "San Antonio":
team_name = "Spurs"
elif city == "New York":
team_name = parts[-1] # Knicks
elif city == "New Orleans":
team_name = "Pelicans"
elif city == "Los Angeles":
team_name = parts[-1] # Lakers or Clippers
else:
team_name = parts[-1]
else:
team_name = full_name
# Get conference and division
conf, div = team_divisions.get(abbrev, (None, None))
# Get stadium ID
stadium_id = None
for sid, sinfo in STADIUM_MAPPINGS.get("nba", {}).items():
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
stadium_id = sid
break
team = Team(
id=team_id,
sport="nba",
city=city,
name=team_name,
full_name=full_name,
abbreviation=abbrev,
conference=conf,
division=div,
stadium_id=stadium_id,
)
teams.append(team)
return teams
def scrape_stadiums(self) -> list[Stadium]:
"""Get all NBA stadiums from hardcoded mappings."""
stadiums: list[Stadium] = []
for stadium_id, info in STADIUM_MAPPINGS.get("nba", {}).items():
stadium = Stadium(
id=stadium_id,
sport="nba",
name=info.name,
city=info.city,
state=info.state,
country=info.country,
latitude=info.latitude,
longitude=info.longitude,
surface="hardwood",
roof_type="dome",
)
stadiums.append(stadium)
return stadiums
def create_nba_scraper(season: int) -> NBAScraper:
"""Factory function to create an NBA scraper."""
return NBAScraper(season=season)
+586
View File
@@ -0,0 +1,586 @@
"""NFL scraper implementation with multi-source fallback."""
from datetime import datetime, date
from typing import Optional
from bs4 import BeautifulSoup
from .base import BaseScraper, RawGameData, ScrapeResult
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from ..models.aliases import ManualReviewItem
from ..normalizers.canonical_id import generate_game_id
from ..normalizers.team_resolver import (
TeamResolver,
TEAM_MAPPINGS,
get_team_resolver,
)
from ..normalizers.stadium_resolver import (
StadiumResolver,
STADIUM_MAPPINGS,
get_stadium_resolver,
)
from ..utils.logging import get_logger, log_game, log_warning
# International game locations to filter out
INTERNATIONAL_LOCATIONS = {"London", "Mexico City", "Frankfurt", "Munich", "São Paulo"}
class NFLScraper(BaseScraper):
"""NFL schedule scraper with multi-source fallback.
Sources (in priority order):
1. ESPN API - Most reliable for NFL
2. Pro-Football-Reference - Complete historical data
3. CBS Sports - Backup option
"""
def __init__(self, season: int, **kwargs):
"""Initialize NFL scraper.
Args:
season: Season year (e.g., 2025 for 2025 season)
"""
super().__init__("nfl", season, **kwargs)
self._team_resolver = get_team_resolver("nfl")
self._stadium_resolver = get_stadium_resolver("nfl")
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["espn", "pro_football_reference", "cbs"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""
if source == "espn":
week = kwargs.get("week", 1)
season_type = kwargs.get("season_type", 2) # 1=preseason, 2=regular, 3=postseason
return f"https://site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard?seasontype={season_type}&week={week}"
elif source == "pro_football_reference":
return f"https://www.pro-football-reference.com/years/{self.season}/games.htm"
elif source == "cbs":
return "https://www.cbssports.com/nfl/schedule/"
raise ValueError(f"Unknown source: {source}")
def _get_season_months(self) -> list[tuple[int, int]]:
"""Get the months to scrape for NFL season.
NFL season runs September through February.
"""
months = []
# Regular season months
for month in range(9, 13): # Sept-Dec
months.append((self.season, month))
# Playoff months
for month in range(1, 3): # Jan-Feb
months.append((self.season + 1, month))
return months
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
"""Scrape games from a specific source."""
if source == "espn":
return self._scrape_espn()
elif source == "pro_football_reference":
return self._scrape_pro_football_reference()
elif source == "cbs":
return self._scrape_cbs()
else:
raise ValueError(f"Unknown source: {source}")
def _scrape_espn(self) -> list[RawGameData]:
"""Scrape games from ESPN API.
ESPN NFL API uses week numbers.
"""
all_games: list[RawGameData] = []
# Scrape preseason (4 weeks)
for week in range(1, 5):
try:
url = self._get_source_url("espn", week=week, season_type=1)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"ESPN preseason week {week} error: {e}")
continue
# Scrape regular season (18 weeks)
for week in range(1, 19):
try:
url = self._get_source_url("espn", week=week, season_type=2)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
self._logger.debug(f"Found {len(games)} games in week {week}")
except Exception as e:
self._logger.debug(f"ESPN regular season week {week} error: {e}")
continue
# Scrape postseason (4 rounds)
for week in range(1, 5):
try:
url = self._get_source_url("espn", week=week, season_type=3)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"ESPN postseason week {week} error: {e}")
continue
return all_games
def _parse_espn_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse ESPN API response."""
games: list[RawGameData] = []
events = data.get("events", [])
for event in events:
try:
game = self._parse_espn_event(event, source_url)
if game:
# Filter international games
if game.stadium_raw and any(loc in game.stadium_raw for loc in INTERNATIONAL_LOCATIONS):
self._logger.debug(f"Skipping international game: {game.stadium_raw}")
continue
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse ESPN event: {e}")
continue
return games
def _parse_espn_event(
self,
event: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single ESPN event."""
# Get date
date_str = event.get("date", "")
if not date_str:
return None
try:
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
except ValueError:
return None
# Get competitions
competitions = event.get("competitions", [])
if not competitions:
return None
competition = competitions[0]
# Check for neutral site (international games)
if competition.get("neutralSite"):
venue = competition.get("venue", {})
venue_city = venue.get("address", {}).get("city", "")
if venue_city in INTERNATIONAL_LOCATIONS:
return None
# Get teams
competitors = competition.get("competitors", [])
if len(competitors) != 2:
return None
home_team = None
away_team = None
home_score = None
away_score = None
for competitor in competitors:
team_info = competitor.get("team", {})
team_name = team_info.get("displayName", "")
is_home = competitor.get("homeAway") == "home"
score = competitor.get("score")
if score:
try:
score = int(score)
except (ValueError, TypeError):
score = None
if is_home:
home_team = team_name
home_score = score
else:
away_team = team_name
away_score = score
if not home_team or not away_team:
return None
# Get venue
venue = competition.get("venue", {})
stadium = venue.get("fullName")
# Get status
status_info = competition.get("status", {})
status_type = status_info.get("type", {})
status_name = status_type.get("name", "").lower()
if status_name == "status_final":
status = "final"
elif status_name == "status_postponed":
status = "postponed"
elif status_name == "status_canceled":
status = "cancelled"
else:
status = "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _scrape_pro_football_reference(self) -> list[RawGameData]:
"""Scrape games from Pro-Football-Reference.
PFR has a single schedule page per season.
"""
url = self._get_source_url("pro_football_reference")
try:
html = self.session.get_html(url)
games = self._parse_pfr(html, url)
return games
except Exception as e:
self._logger.error(f"Failed to scrape Pro-Football-Reference: {e}")
raise
def _parse_pfr(
self,
html: str,
source_url: str,
) -> list[RawGameData]:
"""Parse Pro-Football-Reference schedule HTML."""
soup = BeautifulSoup(html, "lxml")
games: list[RawGameData] = []
# Find the schedule table
table = soup.find("table", id="games")
if not table:
return games
tbody = table.find("tbody")
if not tbody:
return games
for row in tbody.find_all("tr"):
# Skip header rows
if row.get("class") and "thead" in row.get("class", []):
continue
try:
game = self._parse_pfr_row(row, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse PFR row: {e}")
continue
return games
def _parse_pfr_row(
self,
row,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single Pro-Football-Reference table row."""
# Get date
date_cell = row.find("td", {"data-stat": "game_date"})
if not date_cell:
return None
date_text = date_cell.get_text(strip=True)
if not date_text:
return None
# Parse date
try:
# PFR uses YYYY-MM-DD format
game_date = datetime.strptime(date_text, "%Y-%m-%d")
except ValueError:
return None
# Get teams
winner_cell = row.find("td", {"data-stat": "winner"})
loser_cell = row.find("td", {"data-stat": "loser"})
if not winner_cell or not loser_cell:
return None
winner = winner_cell.get_text(strip=True)
loser = loser_cell.get_text(strip=True)
if not winner or not loser:
return None
# Determine home/away based on @ symbol
game_location = row.find("td", {"data-stat": "game_location"})
at_home = game_location and "@" in game_location.get_text()
if at_home:
home_team = loser
away_team = winner
else:
home_team = winner
away_team = loser
# Get scores
pts_win_cell = row.find("td", {"data-stat": "pts_win"})
pts_lose_cell = row.find("td", {"data-stat": "pts_lose"})
home_score = None
away_score = None
if pts_win_cell and pts_lose_cell:
try:
winner_pts = int(pts_win_cell.get_text(strip=True))
loser_pts = int(pts_lose_cell.get_text(strip=True))
if at_home:
home_score = loser_pts
away_score = winner_pts
else:
home_score = winner_pts
away_score = loser_pts
except ValueError:
pass
# Determine status
status = "final" if home_score is not None else "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=None, # PFR doesn't always have stadium
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _scrape_cbs(self) -> list[RawGameData]:
"""Scrape games from CBS Sports."""
raise NotImplementedError("CBS scraper not implemented")
def _normalize_games(
self,
raw_games: list[RawGameData],
) -> tuple[list[Game], list[ManualReviewItem]]:
"""Normalize raw games to Game objects with canonical IDs."""
games: list[Game] = []
review_items: list[ManualReviewItem] = []
for raw in raw_games:
game, item_reviews = self._normalize_single_game(raw)
if game:
games.append(game)
log_game(
self.sport,
game.id,
game.home_team_id,
game.away_team_id,
game.game_date.strftime("%Y-%m-%d"),
game.status,
)
review_items.extend(item_reviews)
return games, review_items
def _normalize_single_game(
self,
raw: RawGameData,
) -> tuple[Optional[Game], list[ManualReviewItem]]:
"""Normalize a single raw game."""
review_items: list[ManualReviewItem] = []
# Resolve home team
home_result = self._team_resolver.resolve(
raw.home_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if home_result.review_item:
review_items.append(home_result.review_item)
if not home_result.canonical_id:
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
return None, review_items
# Resolve away team
away_result = self._team_resolver.resolve(
raw.away_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if away_result.review_item:
review_items.append(away_result.review_item)
if not away_result.canonical_id:
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
return None, review_items
# Resolve stadium
stadium_id = None
if raw.stadium_raw:
stadium_result = self._stadium_resolver.resolve(
raw.stadium_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if stadium_result.review_item:
review_items.append(stadium_result.review_item)
stadium_id = stadium_result.canonical_id
# Get abbreviations for game ID
home_abbrev = self._get_abbreviation(home_result.canonical_id)
away_abbrev = self._get_abbreviation(away_result.canonical_id)
# Generate canonical game ID
game_id = generate_game_id(
sport=self.sport,
season=self.season,
away_abbrev=away_abbrev,
home_abbrev=home_abbrev,
game_date=raw.game_date,
game_number=None, # NFL doesn't have doubleheaders
)
game = Game(
id=game_id,
sport=self.sport,
season=self.season,
home_team_id=home_result.canonical_id,
away_team_id=away_result.canonical_id,
stadium_id=stadium_id or "",
game_date=raw.game_date,
game_number=None,
home_score=raw.home_score,
away_score=raw.away_score,
status=raw.status,
source_url=raw.source_url,
raw_home_team=raw.home_team_raw,
raw_away_team=raw.away_team_raw,
raw_stadium=raw.stadium_raw,
)
return game, review_items
def _get_abbreviation(self, team_id: str) -> str:
"""Extract abbreviation from team ID."""
parts = team_id.split("_")
return parts[-1] if parts else ""
def scrape_teams(self) -> list[Team]:
"""Get all NFL teams from hardcoded mappings."""
teams: list[Team] = []
seen: set[str] = set()
# NFL conference/division structure
divisions = {
"AFC East": ("AFC", ["BUF", "MIA", "NE", "NYJ"]),
"AFC North": ("AFC", ["BAL", "CIN", "CLE", "PIT"]),
"AFC South": ("AFC", ["HOU", "IND", "JAX", "TEN"]),
"AFC West": ("AFC", ["DEN", "KC", "LV", "LAC"]),
"NFC East": ("NFC", ["DAL", "NYG", "PHI", "WAS"]),
"NFC North": ("NFC", ["CHI", "DET", "GB", "MIN"]),
"NFC South": ("NFC", ["ATL", "CAR", "NO", "TB"]),
"NFC West": ("NFC", ["ARI", "LAR", "SF", "SEA"]),
}
# Build reverse lookup
team_divisions: dict[str, tuple[str, str]] = {}
for div, (conf, abbrevs) in divisions.items():
for abbrev in abbrevs:
team_divisions[abbrev] = (conf, div)
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("nfl", {}).items():
if team_id in seen:
continue
seen.add(team_id)
# Parse team name
parts = full_name.split()
team_name = parts[-1] if parts else full_name
# Get conference and division
conf, div = team_divisions.get(abbrev, (None, None))
# Get stadium ID
stadium_id = None
nfl_stadiums = STADIUM_MAPPINGS.get("nfl", {})
for sid, sinfo in nfl_stadiums.items():
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
stadium_id = sid
break
team = Team(
id=team_id,
sport="nfl",
city=city,
name=team_name,
full_name=full_name,
abbreviation=abbrev,
conference=conf,
division=div,
stadium_id=stadium_id,
)
teams.append(team)
return teams
def scrape_stadiums(self) -> list[Stadium]:
"""Get all NFL stadiums from hardcoded mappings."""
stadiums: list[Stadium] = []
nfl_stadiums = STADIUM_MAPPINGS.get("nfl", {})
for stadium_id, info in nfl_stadiums.items():
stadium = Stadium(
id=stadium_id,
sport="nfl",
name=info.name,
city=info.city,
state=info.state,
country=info.country,
latitude=info.latitude,
longitude=info.longitude,
surface="turf", # Many NFL stadiums
roof_type="open", # Most outdoor
)
stadiums.append(stadium)
return stadiums
def create_nfl_scraper(season: int) -> NFLScraper:
"""Factory function to create an NFL scraper."""
return NFLScraper(season=season)
+655
View File
@@ -0,0 +1,655 @@
"""NHL scraper implementation with multi-source fallback."""
from datetime import datetime, date
from typing import Optional
from bs4 import BeautifulSoup
from .base import BaseScraper, RawGameData, ScrapeResult
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from ..models.aliases import ManualReviewItem
from ..normalizers.canonical_id import generate_game_id
from ..normalizers.team_resolver import (
TeamResolver,
TEAM_MAPPINGS,
get_team_resolver,
)
from ..normalizers.stadium_resolver import (
StadiumResolver,
STADIUM_MAPPINGS,
get_stadium_resolver,
)
from ..utils.logging import get_logger, log_game, log_warning
# International game locations to filter out
INTERNATIONAL_LOCATIONS = {"Prague", "Stockholm", "Helsinki", "Tampere", "Gothenburg"}
# Hockey Reference month URLs
HR_MONTHS = [
"october", "november", "december",
"january", "february", "march", "april", "may", "june",
]
class NHLScraper(BaseScraper):
"""NHL schedule scraper with multi-source fallback.
Sources (in priority order):
1. Hockey-Reference - Most reliable for NHL
2. NHL API - Official NHL data
3. ESPN API - Backup option
"""
def __init__(self, season: int, **kwargs):
"""Initialize NHL scraper.
Args:
season: Season start year (e.g., 2025 for 2025-26)
"""
super().__init__("nhl", season, **kwargs)
self._team_resolver = get_team_resolver("nhl")
self._stadium_resolver = get_stadium_resolver("nhl")
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["hockey_reference", "nhl_api", "espn"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""
if source == "hockey_reference":
month = kwargs.get("month", "october")
year = kwargs.get("year", self.season + 1)
return f"https://www.hockey-reference.com/leagues/NHL_{year}_games.html"
elif source == "nhl_api":
start_date = kwargs.get("start_date", "")
end_date = kwargs.get("end_date", "")
return f"https://api-web.nhle.com/v1/schedule/{start_date}"
elif source == "espn":
date_str = kwargs.get("date", "")
return f"https://site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard?dates={date_str}"
raise ValueError(f"Unknown source: {source}")
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
"""Scrape games from a specific source."""
if source == "hockey_reference":
return self._scrape_hockey_reference()
elif source == "nhl_api":
return self._scrape_nhl_api()
elif source == "espn":
return self._scrape_espn()
else:
raise ValueError(f"Unknown source: {source}")
def _scrape_hockey_reference(self) -> list[RawGameData]:
"""Scrape games from Hockey-Reference.
HR has a single schedule page per season.
"""
end_year = self.season + 1
url = self._get_source_url("hockey_reference", year=end_year)
try:
html = self.session.get_html(url)
games = self._parse_hockey_reference(html, url)
return games
except Exception as e:
self._logger.error(f"Failed to scrape Hockey-Reference: {e}")
raise
def _parse_hockey_reference(
self,
html: str,
source_url: str,
) -> list[RawGameData]:
"""Parse Hockey-Reference schedule HTML."""
soup = BeautifulSoup(html, "lxml")
games: list[RawGameData] = []
# Find the schedule table
table = soup.find("table", id="games")
if not table:
return games
tbody = table.find("tbody")
if not tbody:
return games
for row in tbody.find_all("tr"):
# Skip header rows
if row.get("class") and "thead" in row.get("class", []):
continue
try:
game = self._parse_hr_row(row, source_url)
if game:
# Filter international games
if game.stadium_raw and any(loc in game.stadium_raw for loc in INTERNATIONAL_LOCATIONS):
continue
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse HR row: {e}")
continue
return games
def _parse_hr_row(
self,
row,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single Hockey-Reference table row."""
# Get date
date_cell = row.find("th", {"data-stat": "date_game"})
if not date_cell:
return None
date_text = date_cell.get_text(strip=True)
if not date_text:
return None
# Parse date (format: "2025-10-15")
try:
game_date = datetime.strptime(date_text, "%Y-%m-%d")
except ValueError:
return None
# Get teams
visitor_cell = row.find("td", {"data-stat": "visitor_team_name"})
home_cell = row.find("td", {"data-stat": "home_team_name"})
if not visitor_cell or not home_cell:
return None
away_team = visitor_cell.get_text(strip=True)
home_team = home_cell.get_text(strip=True)
if not away_team or not home_team:
return None
# Get scores
visitor_goals_cell = row.find("td", {"data-stat": "visitor_goals"})
home_goals_cell = row.find("td", {"data-stat": "home_goals"})
away_score = None
home_score = None
if visitor_goals_cell and visitor_goals_cell.get_text(strip=True):
try:
away_score = int(visitor_goals_cell.get_text(strip=True))
except ValueError:
pass
if home_goals_cell and home_goals_cell.get_text(strip=True):
try:
home_score = int(home_goals_cell.get_text(strip=True))
except ValueError:
pass
# Determine status
status = "final" if home_score is not None else "scheduled"
# Check for OT/SO
overtimes_cell = row.find("td", {"data-stat": "overtimes"})
if overtimes_cell:
ot_text = overtimes_cell.get_text(strip=True)
if ot_text:
status = "final" # OT games are still final
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=None, # HR doesn't have stadium
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _scrape_nhl_api(self) -> list[RawGameData]:
"""Scrape games from NHL API."""
all_games: list[RawGameData] = []
for year, month in self._get_season_months():
start_date = date(year, month, 1)
url = self._get_source_url("nhl_api", start_date=start_date.strftime("%Y-%m-%d"))
try:
data = self.session.get_json(url)
games = self._parse_nhl_api_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"NHL API error for {year}-{month}: {e}")
continue
return all_games
def _parse_nhl_api_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse NHL API response."""
games: list[RawGameData] = []
game_weeks = data.get("gameWeek", [])
for week in game_weeks:
for game_day in week.get("games", []):
try:
game = self._parse_nhl_api_game(game_day, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse NHL API game: {e}")
continue
return games
def _parse_nhl_api_game(
self,
game: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single NHL API game."""
# Get date
start_time = game.get("startTimeUTC", "")
if not start_time:
return None
try:
game_date = datetime.fromisoformat(start_time.replace("Z", "+00:00"))
except ValueError:
return None
# Get teams
away_team_data = game.get("awayTeam", {})
home_team_data = game.get("homeTeam", {})
away_team = away_team_data.get("placeName", {}).get("default", "")
home_team = home_team_data.get("placeName", {}).get("default", "")
if not away_team or not home_team:
# Try full name
away_team = away_team_data.get("name", {}).get("default", "")
home_team = home_team_data.get("name", {}).get("default", "")
if not away_team or not home_team:
return None
# Get scores
away_score = away_team_data.get("score")
home_score = home_team_data.get("score")
# Get venue
venue = game.get("venue", {})
stadium = venue.get("default")
# Get status
game_state = game.get("gameState", "").lower()
if game_state in ["final", "off"]:
status = "final"
elif game_state == "postponed":
status = "postponed"
elif game_state in ["cancelled", "canceled"]:
status = "cancelled"
else:
status = "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _scrape_espn(self) -> list[RawGameData]:
"""Scrape games from ESPN API."""
all_games: list[RawGameData] = []
for year, month in self._get_season_months():
# Get number of days in month
if month == 12:
next_month = date(year + 1, 1, 1)
else:
next_month = date(year, month + 1, 1)
days_in_month = (next_month - date(year, month, 1)).days
for day in range(1, days_in_month + 1):
try:
game_date = date(year, month, day)
date_str = game_date.strftime("%Y%m%d")
url = self._get_source_url("espn", date=date_str)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
continue
return all_games
def _parse_espn_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse ESPN API response."""
games: list[RawGameData] = []
events = data.get("events", [])
for event in events:
try:
game = self._parse_espn_event(event, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse ESPN event: {e}")
continue
return games
def _parse_espn_event(
self,
event: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single ESPN event."""
# Get date
date_str = event.get("date", "")
if not date_str:
return None
try:
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
except ValueError:
return None
# Get competitions
competitions = event.get("competitions", [])
if not competitions:
return None
competition = competitions[0]
# Check for neutral site (international games like Global Series)
if competition.get("neutralSite"):
venue = competition.get("venue", {})
venue_city = venue.get("address", {}).get("city", "")
if venue_city in INTERNATIONAL_LOCATIONS:
return None
# Get teams
competitors = competition.get("competitors", [])
if len(competitors) != 2:
return None
home_team = None
away_team = None
home_score = None
away_score = None
for competitor in competitors:
team_info = competitor.get("team", {})
team_name = team_info.get("displayName", "")
is_home = competitor.get("homeAway") == "home"
score = competitor.get("score")
if score:
try:
score = int(score)
except (ValueError, TypeError):
score = None
if is_home:
home_team = team_name
home_score = score
else:
away_team = team_name
away_score = score
if not home_team or not away_team:
return None
# Get venue
venue = competition.get("venue", {})
stadium = venue.get("fullName")
# Get status
status_info = competition.get("status", {})
status_type = status_info.get("type", {})
status_name = status_type.get("name", "").lower()
if status_name == "status_final":
status = "final"
elif status_name == "status_postponed":
status = "postponed"
elif status_name == "status_canceled":
status = "cancelled"
else:
status = "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _normalize_games(
self,
raw_games: list[RawGameData],
) -> tuple[list[Game], list[ManualReviewItem]]:
"""Normalize raw games to Game objects with canonical IDs."""
games: list[Game] = []
review_items: list[ManualReviewItem] = []
for raw in raw_games:
game, item_reviews = self._normalize_single_game(raw)
if game:
games.append(game)
log_game(
self.sport,
game.id,
game.home_team_id,
game.away_team_id,
game.game_date.strftime("%Y-%m-%d"),
game.status,
)
review_items.extend(item_reviews)
return games, review_items
def _normalize_single_game(
self,
raw: RawGameData,
) -> tuple[Optional[Game], list[ManualReviewItem]]:
"""Normalize a single raw game."""
review_items: list[ManualReviewItem] = []
# Resolve home team
home_result = self._team_resolver.resolve(
raw.home_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if home_result.review_item:
review_items.append(home_result.review_item)
if not home_result.canonical_id:
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
return None, review_items
# Resolve away team
away_result = self._team_resolver.resolve(
raw.away_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if away_result.review_item:
review_items.append(away_result.review_item)
if not away_result.canonical_id:
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
return None, review_items
# Resolve stadium
stadium_id = None
if raw.stadium_raw:
stadium_result = self._stadium_resolver.resolve(
raw.stadium_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if stadium_result.review_item:
review_items.append(stadium_result.review_item)
stadium_id = stadium_result.canonical_id
# Get abbreviations for game ID
home_abbrev = self._get_abbreviation(home_result.canonical_id)
away_abbrev = self._get_abbreviation(away_result.canonical_id)
# Generate canonical game ID
game_id = generate_game_id(
sport=self.sport,
season=self.season,
away_abbrev=away_abbrev,
home_abbrev=home_abbrev,
game_date=raw.game_date,
game_number=None, # NHL doesn't have doubleheaders
)
game = Game(
id=game_id,
sport=self.sport,
season=self.season,
home_team_id=home_result.canonical_id,
away_team_id=away_result.canonical_id,
stadium_id=stadium_id or "",
game_date=raw.game_date,
game_number=None,
home_score=raw.home_score,
away_score=raw.away_score,
status=raw.status,
source_url=raw.source_url,
raw_home_team=raw.home_team_raw,
raw_away_team=raw.away_team_raw,
raw_stadium=raw.stadium_raw,
)
return game, review_items
def _get_abbreviation(self, team_id: str) -> str:
"""Extract abbreviation from team ID."""
parts = team_id.split("_")
return parts[-1] if parts else ""
def scrape_teams(self) -> list[Team]:
"""Get all NHL teams from hardcoded mappings."""
teams: list[Team] = []
seen: set[str] = set()
# NHL conference/division structure
divisions = {
"Atlantic": ("Eastern", ["BOS", "BUF", "DET", "FLA", "MTL", "OTT", "TB", "TOR"]),
"Metropolitan": ("Eastern", ["CAR", "CBJ", "NJ", "NYI", "NYR", "PHI", "PIT", "WAS"]),
"Central": ("Western", ["ARI", "CHI", "COL", "DAL", "MIN", "NSH", "STL", "WPG"]),
"Pacific": ("Western", ["ANA", "CGY", "EDM", "LA", "SJ", "SEA", "VAN", "VGK"]),
}
# Build reverse lookup
team_divisions: dict[str, tuple[str, str]] = {}
for div, (conf, abbrevs) in divisions.items():
for abbrev in abbrevs:
team_divisions[abbrev] = (conf, div)
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("nhl", {}).items():
if team_id in seen:
continue
seen.add(team_id)
# Parse team name
parts = full_name.split()
team_name = parts[-1] if parts else full_name
# Handle multi-word names
if team_name in ["Wings", "Jackets", "Knights", "Leafs"]:
team_name = " ".join(parts[-2:])
# Get conference and division
conf, div = team_divisions.get(abbrev, (None, None))
# Get stadium ID
stadium_id = None
nhl_stadiums = STADIUM_MAPPINGS.get("nhl", {})
for sid, sinfo in nhl_stadiums.items():
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
stadium_id = sid
break
team = Team(
id=team_id,
sport="nhl",
city=city,
name=team_name,
full_name=full_name,
abbreviation=abbrev,
conference=conf,
division=div,
stadium_id=stadium_id,
)
teams.append(team)
return teams
def scrape_stadiums(self) -> list[Stadium]:
"""Get all NHL stadiums from hardcoded mappings."""
stadiums: list[Stadium] = []
nhl_stadiums = STADIUM_MAPPINGS.get("nhl", {})
for stadium_id, info in nhl_stadiums.items():
stadium = Stadium(
id=stadium_id,
sport="nhl",
name=info.name,
city=info.city,
state=info.state,
country=info.country,
latitude=info.latitude,
longitude=info.longitude,
surface="ice",
roof_type="dome",
)
stadiums.append(stadium)
return stadiums
def create_nhl_scraper(season: int) -> NHLScraper:
"""Factory function to create an NHL scraper."""
return NHLScraper(season=season)
+385
View File
@@ -0,0 +1,385 @@
"""NWSL scraper implementation with multi-source fallback."""
from datetime import datetime, date
from typing import Optional
from .base import BaseScraper, RawGameData, ScrapeResult
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from ..models.aliases import ManualReviewItem
from ..normalizers.canonical_id import generate_game_id
from ..normalizers.team_resolver import (
TeamResolver,
TEAM_MAPPINGS,
get_team_resolver,
)
from ..normalizers.stadium_resolver import (
StadiumResolver,
STADIUM_MAPPINGS,
get_stadium_resolver,
)
from ..utils.logging import get_logger, log_game, log_warning
class NWSLScraper(BaseScraper):
"""NWSL schedule scraper with multi-source fallback.
Sources (in priority order):
1. ESPN API - Most reliable for NWSL
2. NWSL official (via ESPN) - Backup option
"""
def __init__(self, season: int, **kwargs):
"""Initialize NWSL scraper.
Args:
season: Season year (e.g., 2026 for 2026 season)
"""
super().__init__("nwsl", season, **kwargs)
self._team_resolver = get_team_resolver("nwsl")
self._stadium_resolver = get_stadium_resolver("nwsl")
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["espn"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""
if source == "espn":
date_str = kwargs.get("date", "")
return f"https://site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard?dates={date_str}"
raise ValueError(f"Unknown source: {source}")
def _get_season_months(self) -> list[tuple[int, int]]:
"""Get the months to scrape for NWSL season.
NWSL season runs March through November.
"""
months = []
# NWSL regular season + playoffs
for month in range(3, 12): # March-Nov
months.append((self.season, month))
return months
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
"""Scrape games from a specific source."""
if source == "espn":
return self._scrape_espn()
else:
raise ValueError(f"Unknown source: {source}")
def _scrape_espn(self) -> list[RawGameData]:
"""Scrape games from ESPN API."""
all_games: list[RawGameData] = []
for year, month in self._get_season_months():
# Get number of days in month
if month == 12:
next_month = date(year + 1, 1, 1)
else:
next_month = date(year, month + 1, 1)
days_in_month = (next_month - date(year, month, 1)).days
for day in range(1, days_in_month + 1):
try:
game_date = date(year, month, day)
date_str = game_date.strftime("%Y%m%d")
url = self._get_source_url("espn", date=date_str)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
continue
return all_games
def _parse_espn_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse ESPN API response."""
games: list[RawGameData] = []
events = data.get("events", [])
for event in events:
try:
game = self._parse_espn_event(event, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse ESPN event: {e}")
continue
return games
def _parse_espn_event(
self,
event: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single ESPN event."""
# Get date
date_str = event.get("date", "")
if not date_str:
return None
try:
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
except ValueError:
return None
# Get competitions
competitions = event.get("competitions", [])
if not competitions:
return None
competition = competitions[0]
# Get teams
competitors = competition.get("competitors", [])
if len(competitors) != 2:
return None
home_team = None
away_team = None
home_score = None
away_score = None
for competitor in competitors:
team_info = competitor.get("team", {})
team_name = team_info.get("displayName", "")
is_home = competitor.get("homeAway") == "home"
score = competitor.get("score")
if score:
try:
score = int(score)
except (ValueError, TypeError):
score = None
if is_home:
home_team = team_name
home_score = score
else:
away_team = team_name
away_score = score
if not home_team or not away_team:
return None
# Get venue
venue = competition.get("venue", {})
stadium = venue.get("fullName")
# Get status
status_info = competition.get("status", {})
status_type = status_info.get("type", {})
status_name = status_type.get("name", "").lower()
if status_name == "status_final":
status = "final"
elif status_name == "status_postponed":
status = "postponed"
elif status_name == "status_canceled":
status = "cancelled"
else:
status = "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _normalize_games(
self,
raw_games: list[RawGameData],
) -> tuple[list[Game], list[ManualReviewItem]]:
"""Normalize raw games to Game objects with canonical IDs."""
games: list[Game] = []
review_items: list[ManualReviewItem] = []
for raw in raw_games:
game, item_reviews = self._normalize_single_game(raw)
if game:
games.append(game)
log_game(
self.sport,
game.id,
game.home_team_id,
game.away_team_id,
game.game_date.strftime("%Y-%m-%d"),
game.status,
)
review_items.extend(item_reviews)
return games, review_items
def _normalize_single_game(
self,
raw: RawGameData,
) -> tuple[Optional[Game], list[ManualReviewItem]]:
"""Normalize a single raw game."""
review_items: list[ManualReviewItem] = []
# Resolve home team
home_result = self._team_resolver.resolve(
raw.home_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if home_result.review_item:
review_items.append(home_result.review_item)
if not home_result.canonical_id:
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
return None, review_items
# Resolve away team
away_result = self._team_resolver.resolve(
raw.away_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if away_result.review_item:
review_items.append(away_result.review_item)
if not away_result.canonical_id:
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
return None, review_items
# Resolve stadium
stadium_id = None
if raw.stadium_raw:
stadium_result = self._stadium_resolver.resolve(
raw.stadium_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if stadium_result.review_item:
review_items.append(stadium_result.review_item)
stadium_id = stadium_result.canonical_id
# Get abbreviations for game ID
home_abbrev = self._get_abbreviation(home_result.canonical_id)
away_abbrev = self._get_abbreviation(away_result.canonical_id)
# Generate canonical game ID
game_id = generate_game_id(
sport=self.sport,
season=self.season,
away_abbrev=away_abbrev,
home_abbrev=home_abbrev,
game_date=raw.game_date,
game_number=None,
)
game = Game(
id=game_id,
sport=self.sport,
season=self.season,
home_team_id=home_result.canonical_id,
away_team_id=away_result.canonical_id,
stadium_id=stadium_id or "",
game_date=raw.game_date,
game_number=None,
home_score=raw.home_score,
away_score=raw.away_score,
status=raw.status,
source_url=raw.source_url,
raw_home_team=raw.home_team_raw,
raw_away_team=raw.away_team_raw,
raw_stadium=raw.stadium_raw,
)
return game, review_items
def _get_abbreviation(self, team_id: str) -> str:
"""Extract abbreviation from team ID."""
parts = team_id.split("_")
return parts[-1] if parts else ""
def scrape_teams(self) -> list[Team]:
"""Get all NWSL teams from hardcoded mappings."""
teams: list[Team] = []
seen: set[str] = set()
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("nwsl", {}).items():
if team_id in seen:
continue
seen.add(team_id)
# Parse team name
team_name = full_name
# Get stadium ID
stadium_id = None
nwsl_stadiums = STADIUM_MAPPINGS.get("nwsl", {})
for sid, sinfo in nwsl_stadiums.items():
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
stadium_id = sid
break
team = Team(
id=team_id,
sport="nwsl",
city=city,
name=team_name,
full_name=full_name,
abbreviation=abbrev,
conference=None, # NWSL uses single table
division=None,
stadium_id=stadium_id,
)
teams.append(team)
return teams
def scrape_stadiums(self) -> list[Stadium]:
"""Get all NWSL stadiums from hardcoded mappings."""
stadiums: list[Stadium] = []
nwsl_stadiums = STADIUM_MAPPINGS.get("nwsl", {})
for stadium_id, info in nwsl_stadiums.items():
stadium = Stadium(
id=stadium_id,
sport="nwsl",
name=info.name,
city=info.city,
state=info.state,
country=info.country,
latitude=info.latitude,
longitude=info.longitude,
surface="grass",
roof_type="open",
)
stadiums.append(stadium)
return stadiums
def create_nwsl_scraper(season: int) -> NWSLScraper:
"""Factory function to create an NWSL scraper."""
return NWSLScraper(season=season)
+386
View File
@@ -0,0 +1,386 @@
"""WNBA scraper implementation with multi-source fallback."""
from datetime import datetime, date
from typing import Optional
from .base import BaseScraper, RawGameData, ScrapeResult
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from ..models.aliases import ManualReviewItem
from ..normalizers.canonical_id import generate_game_id
from ..normalizers.team_resolver import (
TeamResolver,
TEAM_MAPPINGS,
get_team_resolver,
)
from ..normalizers.stadium_resolver import (
StadiumResolver,
STADIUM_MAPPINGS,
get_stadium_resolver,
)
from ..utils.logging import get_logger, log_game, log_warning
class WNBAScraper(BaseScraper):
"""WNBA schedule scraper with multi-source fallback.
Sources (in priority order):
1. ESPN API - Most reliable for WNBA
2. WNBA official (via ESPN) - Backup option
"""
def __init__(self, season: int, **kwargs):
"""Initialize WNBA scraper.
Args:
season: Season year (e.g., 2026 for 2026 season)
"""
super().__init__("wnba", season, **kwargs)
self._team_resolver = get_team_resolver("wnba")
self._stadium_resolver = get_stadium_resolver("wnba")
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["espn"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""
if source == "espn":
date_str = kwargs.get("date", "")
return f"https://site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard?dates={date_str}"
raise ValueError(f"Unknown source: {source}")
def _get_season_months(self) -> list[tuple[int, int]]:
"""Get the months to scrape for WNBA season.
WNBA season runs May through September/October.
"""
months = []
# WNBA regular season + playoffs
for month in range(5, 11): # May-Oct
months.append((self.season, month))
return months
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
"""Scrape games from a specific source."""
if source == "espn":
return self._scrape_espn()
else:
raise ValueError(f"Unknown source: {source}")
def _scrape_espn(self) -> list[RawGameData]:
"""Scrape games from ESPN API."""
all_games: list[RawGameData] = []
for year, month in self._get_season_months():
# Get number of days in month
if month == 12:
next_month = date(year + 1, 1, 1)
else:
next_month = date(year, month + 1, 1)
days_in_month = (next_month - date(year, month, 1)).days
for day in range(1, days_in_month + 1):
try:
game_date = date(year, month, day)
date_str = game_date.strftime("%Y%m%d")
url = self._get_source_url("espn", date=date_str)
data = self.session.get_json(url)
games = self._parse_espn_response(data, url)
all_games.extend(games)
except Exception as e:
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
continue
return all_games
def _parse_espn_response(
self,
data: dict,
source_url: str,
) -> list[RawGameData]:
"""Parse ESPN API response."""
games: list[RawGameData] = []
events = data.get("events", [])
for event in events:
try:
game = self._parse_espn_event(event, source_url)
if game:
games.append(game)
except Exception as e:
self._logger.debug(f"Failed to parse ESPN event: {e}")
continue
return games
def _parse_espn_event(
self,
event: dict,
source_url: str,
) -> Optional[RawGameData]:
"""Parse a single ESPN event."""
# Get date
date_str = event.get("date", "")
if not date_str:
return None
try:
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
except ValueError:
return None
# Get competitions
competitions = event.get("competitions", [])
if not competitions:
return None
competition = competitions[0]
# Get teams
competitors = competition.get("competitors", [])
if len(competitors) != 2:
return None
home_team = None
away_team = None
home_score = None
away_score = None
for competitor in competitors:
team_info = competitor.get("team", {})
team_name = team_info.get("displayName", "")
is_home = competitor.get("homeAway") == "home"
score = competitor.get("score")
if score:
try:
score = int(score)
except (ValueError, TypeError):
score = None
if is_home:
home_team = team_name
home_score = score
else:
away_team = team_name
away_score = score
if not home_team or not away_team:
return None
# Get venue
venue = competition.get("venue", {})
stadium = venue.get("fullName")
# Get status
status_info = competition.get("status", {})
status_type = status_info.get("type", {})
status_name = status_type.get("name", "").lower()
if status_name == "status_final":
status = "final"
elif status_name == "status_postponed":
status = "postponed"
elif status_name == "status_canceled":
status = "cancelled"
else:
status = "scheduled"
return RawGameData(
game_date=game_date,
home_team_raw=home_team,
away_team_raw=away_team,
stadium_raw=stadium,
home_score=home_score,
away_score=away_score,
status=status,
source_url=source_url,
)
def _normalize_games(
self,
raw_games: list[RawGameData],
) -> tuple[list[Game], list[ManualReviewItem]]:
"""Normalize raw games to Game objects with canonical IDs."""
games: list[Game] = []
review_items: list[ManualReviewItem] = []
for raw in raw_games:
game, item_reviews = self._normalize_single_game(raw)
if game:
games.append(game)
log_game(
self.sport,
game.id,
game.home_team_id,
game.away_team_id,
game.game_date.strftime("%Y-%m-%d"),
game.status,
)
review_items.extend(item_reviews)
return games, review_items
def _normalize_single_game(
self,
raw: RawGameData,
) -> tuple[Optional[Game], list[ManualReviewItem]]:
"""Normalize a single raw game."""
review_items: list[ManualReviewItem] = []
# Resolve home team
home_result = self._team_resolver.resolve(
raw.home_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if home_result.review_item:
review_items.append(home_result.review_item)
if not home_result.canonical_id:
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
return None, review_items
# Resolve away team
away_result = self._team_resolver.resolve(
raw.away_team_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if away_result.review_item:
review_items.append(away_result.review_item)
if not away_result.canonical_id:
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
return None, review_items
# Resolve stadium
stadium_id = None
if raw.stadium_raw:
stadium_result = self._stadium_resolver.resolve(
raw.stadium_raw,
check_date=raw.game_date.date(),
source_url=raw.source_url,
)
if stadium_result.review_item:
review_items.append(stadium_result.review_item)
stadium_id = stadium_result.canonical_id
# Get abbreviations for game ID
home_abbrev = self._get_abbreviation(home_result.canonical_id)
away_abbrev = self._get_abbreviation(away_result.canonical_id)
# Generate canonical game ID
game_id = generate_game_id(
sport=self.sport,
season=self.season,
away_abbrev=away_abbrev,
home_abbrev=home_abbrev,
game_date=raw.game_date,
game_number=None,
)
game = Game(
id=game_id,
sport=self.sport,
season=self.season,
home_team_id=home_result.canonical_id,
away_team_id=away_result.canonical_id,
stadium_id=stadium_id or "",
game_date=raw.game_date,
game_number=None,
home_score=raw.home_score,
away_score=raw.away_score,
status=raw.status,
source_url=raw.source_url,
raw_home_team=raw.home_team_raw,
raw_away_team=raw.away_team_raw,
raw_stadium=raw.stadium_raw,
)
return game, review_items
def _get_abbreviation(self, team_id: str) -> str:
"""Extract abbreviation from team ID."""
parts = team_id.split("_")
return parts[-1] if parts else ""
def scrape_teams(self) -> list[Team]:
"""Get all WNBA teams from hardcoded mappings."""
teams: list[Team] = []
seen: set[str] = set()
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("wnba", {}).items():
if team_id in seen:
continue
seen.add(team_id)
# Parse team name
parts = full_name.split()
team_name = parts[-1] if parts else full_name
# Get stadium ID
stadium_id = None
wnba_stadiums = STADIUM_MAPPINGS.get("wnba", {})
for sid, sinfo in wnba_stadiums.items():
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
stadium_id = sid
break
team = Team(
id=team_id,
sport="wnba",
city=city,
name=team_name,
full_name=full_name,
abbreviation=abbrev,
conference=None, # WNBA uses single table now
division=None,
stadium_id=stadium_id,
)
teams.append(team)
return teams
def scrape_stadiums(self) -> list[Stadium]:
"""Get all WNBA stadiums from hardcoded mappings."""
stadiums: list[Stadium] = []
wnba_stadiums = STADIUM_MAPPINGS.get("wnba", {})
for stadium_id, info in wnba_stadiums.items():
stadium = Stadium(
id=stadium_id,
sport="wnba",
name=info.name,
city=info.city,
state=info.state,
country=info.country,
latitude=info.latitude,
longitude=info.longitude,
surface="hardwood",
roof_type="dome",
)
stadiums.append(stadium)
return stadiums
def create_wnba_scraper(season: int) -> WNBAScraper:
"""Factory function to create a WNBA scraper."""
return WNBAScraper(season=season)
@@ -0,0 +1 @@
"""Unit tests for sportstime_parser."""
+48
View File
@@ -0,0 +1,48 @@
"""Test fixtures for sportstime-parser tests."""
from pathlib import Path
FIXTURES_DIR = Path(__file__).parent
# NBA fixtures
NBA_FIXTURES_DIR = FIXTURES_DIR / "nba"
NBA_BR_OCTOBER_HTML = NBA_FIXTURES_DIR / "basketball_reference_october.html"
NBA_BR_EDGE_CASES_HTML = NBA_FIXTURES_DIR / "basketball_reference_edge_cases.html"
NBA_ESPN_SCOREBOARD_JSON = NBA_FIXTURES_DIR / "espn_scoreboard.json"
# MLB fixtures
MLB_FIXTURES_DIR = FIXTURES_DIR / "mlb"
MLB_ESPN_SCOREBOARD_JSON = MLB_FIXTURES_DIR / "espn_scoreboard.json"
# NFL fixtures
NFL_FIXTURES_DIR = FIXTURES_DIR / "nfl"
NFL_ESPN_SCOREBOARD_JSON = NFL_FIXTURES_DIR / "espn_scoreboard.json"
# NHL fixtures
NHL_FIXTURES_DIR = FIXTURES_DIR / "nhl"
NHL_ESPN_SCOREBOARD_JSON = NHL_FIXTURES_DIR / "espn_scoreboard.json"
# MLS fixtures
MLS_FIXTURES_DIR = FIXTURES_DIR / "mls"
MLS_ESPN_SCOREBOARD_JSON = MLS_FIXTURES_DIR / "espn_scoreboard.json"
# WNBA fixtures
WNBA_FIXTURES_DIR = FIXTURES_DIR / "wnba"
WNBA_ESPN_SCOREBOARD_JSON = WNBA_FIXTURES_DIR / "espn_scoreboard.json"
# NWSL fixtures
NWSL_FIXTURES_DIR = FIXTURES_DIR / "nwsl"
NWSL_ESPN_SCOREBOARD_JSON = NWSL_FIXTURES_DIR / "espn_scoreboard.json"
def load_fixture(path: Path) -> str:
"""Load a fixture file as text."""
with open(path, "r", encoding="utf-8") as f:
return f.read()
def load_json_fixture(path: Path) -> dict:
"""Load a JSON fixture file."""
import json
with open(path, "r", encoding="utf-8") as f:
return json.load(f)
@@ -0,0 +1,245 @@
{
"leagues": [
{
"id": "10",
"uid": "s:1~l:10",
"name": "Major League Baseball",
"abbreviation": "MLB"
}
],
"season": {
"type": 2,
"year": 2026
},
"day": {
"date": "2026-04-15T00:00:00Z"
},
"events": [
{
"id": "401584801",
"uid": "s:1~l:10~e:401584801",
"date": "2026-04-15T23:05:00Z",
"name": "New York Yankees at Boston Red Sox",
"shortName": "NYY @ BOS",
"competitions": [
{
"id": "401584801",
"uid": "s:1~l:10~e:401584801~c:401584801",
"date": "2026-04-15T23:05:00Z",
"attendance": 37435,
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "3",
"fullName": "Fenway Park",
"address": {
"city": "Boston",
"state": "MA"
},
"capacity": 37755,
"indoor": false
},
"competitors": [
{
"id": "2",
"uid": "s:1~l:10~t:2",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "2",
"uid": "s:1~l:10~t:2",
"location": "Boston",
"name": "Red Sox",
"abbreviation": "BOS",
"displayName": "Boston Red Sox"
},
"score": "5",
"winner": true
},
{
"id": "10",
"uid": "s:1~l:10~t:10",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "10",
"uid": "s:1~l:10~t:10",
"location": "New York",
"name": "Yankees",
"abbreviation": "NYY",
"displayName": "New York Yankees"
},
"score": "3",
"winner": false
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 9,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401584802",
"uid": "s:1~l:10~e:401584802",
"date": "2026-04-15T20:10:00Z",
"name": "Chicago Cubs at St. Louis Cardinals",
"shortName": "CHC @ STL",
"competitions": [
{
"id": "401584802",
"uid": "s:1~l:10~e:401584802~c:401584802",
"date": "2026-04-15T20:10:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "87",
"fullName": "Busch Stadium",
"address": {
"city": "St. Louis",
"state": "MO"
},
"capacity": 45538,
"indoor": false
},
"competitors": [
{
"id": "24",
"uid": "s:1~l:10~t:24",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "24",
"uid": "s:1~l:10~t:24",
"location": "St. Louis",
"name": "Cardinals",
"abbreviation": "STL",
"displayName": "St. Louis Cardinals"
},
"score": "7",
"winner": true
},
{
"id": "16",
"uid": "s:1~l:10~t:16",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "16",
"uid": "s:1~l:10~t:16",
"location": "Chicago",
"name": "Cubs",
"abbreviation": "CHC",
"displayName": "Chicago Cubs"
},
"score": "4",
"winner": false
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 9,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401584803",
"uid": "s:1~l:10~e:401584803",
"date": "2026-04-16T00:10:00Z",
"name": "Los Angeles Dodgers at San Francisco Giants",
"shortName": "LAD @ SF",
"competitions": [
{
"id": "401584803",
"uid": "s:1~l:10~e:401584803~c:401584803",
"date": "2026-04-16T00:10:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "116",
"fullName": "Oracle Park",
"address": {
"city": "San Francisco",
"state": "CA"
},
"capacity": 41915,
"indoor": false
},
"competitors": [
{
"id": "26",
"uid": "s:1~l:10~t:26",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "26",
"uid": "s:1~l:10~t:26",
"location": "San Francisco",
"name": "Giants",
"abbreviation": "SF",
"displayName": "San Francisco Giants"
},
"score": null,
"winner": null
},
{
"id": "19",
"uid": "s:1~l:10~t:19",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "19",
"uid": "s:1~l:10~t:19",
"location": "Los Angeles",
"name": "Dodgers",
"abbreviation": "LAD",
"displayName": "Los Angeles Dodgers"
},
"score": null,
"winner": null
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 0,
"type": {
"id": "1",
"name": "STATUS_SCHEDULED",
"state": "pre",
"completed": false
}
}
}
]
}
]
}
@@ -0,0 +1,245 @@
{
"leagues": [
{
"id": "19",
"uid": "s:600~l:19",
"name": "Major League Soccer",
"abbreviation": "MLS"
}
],
"season": {
"type": 2,
"year": 2026
},
"day": {
"date": "2026-03-15T00:00:00Z"
},
"events": [
{
"id": "401672001",
"uid": "s:600~l:19~e:401672001",
"date": "2026-03-15T22:00:00Z",
"name": "LA Galaxy at LAFC",
"shortName": "LA @ LAFC",
"competitions": [
{
"id": "401672001",
"uid": "s:600~l:19~e:401672001~c:401672001",
"date": "2026-03-15T22:00:00Z",
"attendance": 22000,
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "8909",
"fullName": "BMO Stadium",
"address": {
"city": "Los Angeles",
"state": "CA"
},
"capacity": 22000,
"indoor": false
},
"competitors": [
{
"id": "21295",
"uid": "s:600~l:19~t:21295",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "21295",
"uid": "s:600~l:19~t:21295",
"location": "Los Angeles",
"name": "FC",
"abbreviation": "LAFC",
"displayName": "Los Angeles FC"
},
"score": "3",
"winner": true
},
{
"id": "3610",
"uid": "s:600~l:19~t:3610",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "3610",
"uid": "s:600~l:19~t:3610",
"location": "Los Angeles",
"name": "Galaxy",
"abbreviation": "LA",
"displayName": "LA Galaxy"
},
"score": "2",
"winner": false
}
],
"status": {
"clock": 90,
"displayClock": "90'",
"period": 2,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401672002",
"uid": "s:600~l:19~e:401672002",
"date": "2026-03-15T23:00:00Z",
"name": "Seattle Sounders at Portland Timbers",
"shortName": "SEA @ POR",
"competitions": [
{
"id": "401672002",
"uid": "s:600~l:19~e:401672002~c:401672002",
"date": "2026-03-15T23:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "8070",
"fullName": "Providence Park",
"address": {
"city": "Portland",
"state": "OR"
},
"capacity": 25218,
"indoor": false
},
"competitors": [
{
"id": "5282",
"uid": "s:600~l:19~t:5282",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "5282",
"uid": "s:600~l:19~t:5282",
"location": "Portland",
"name": "Timbers",
"abbreviation": "POR",
"displayName": "Portland Timbers"
},
"score": "2",
"winner": false
},
{
"id": "4687",
"uid": "s:600~l:19~t:4687",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "4687",
"uid": "s:600~l:19~t:4687",
"location": "Seattle",
"name": "Sounders FC",
"abbreviation": "SEA",
"displayName": "Seattle Sounders FC"
},
"score": "2",
"winner": false
}
],
"status": {
"clock": 90,
"displayClock": "90'",
"period": 2,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401672003",
"uid": "s:600~l:19~e:401672003",
"date": "2026-03-16T00:00:00Z",
"name": "New York Red Bulls at Atlanta United",
"shortName": "NY @ ATL",
"competitions": [
{
"id": "401672003",
"uid": "s:600~l:19~e:401672003~c:401672003",
"date": "2026-03-16T00:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "8904",
"fullName": "Mercedes-Benz Stadium",
"address": {
"city": "Atlanta",
"state": "GA"
},
"capacity": 42500,
"indoor": true
},
"competitors": [
{
"id": "18626",
"uid": "s:600~l:19~t:18626",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "18626",
"uid": "s:600~l:19~t:18626",
"location": "Atlanta",
"name": "United FC",
"abbreviation": "ATL",
"displayName": "Atlanta United FC"
},
"score": null,
"winner": null
},
{
"id": "399",
"uid": "s:600~l:19~t:399",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "399",
"uid": "s:600~l:19~t:399",
"location": "New York",
"name": "Red Bulls",
"abbreviation": "NY",
"displayName": "New York Red Bulls"
},
"score": null,
"winner": null
}
],
"status": {
"clock": 0,
"displayClock": "0'",
"period": 0,
"type": {
"id": "1",
"name": "STATUS_SCHEDULED",
"state": "pre",
"completed": false
}
}
}
]
}
]
}
@@ -0,0 +1,79 @@
<!DOCTYPE html>
<html>
<head>
<title>2025-26 NBA Schedule - Edge Cases | Basketball-Reference.com</title>
</head>
<body>
<table id="schedule" class="stats_table">
<thead>
<tr>
<th data-stat="date_game">Date</th>
<th data-stat="game_start_time">Start (ET)</th>
<th data-stat="visitor_team_name">Visitor/Neutral</th>
<th data-stat="visitor_pts">PTS</th>
<th data-stat="home_team_name">Home/Neutral</th>
<th data-stat="home_pts">PTS</th>
<th data-stat="arena_name">Arena</th>
<th data-stat="game_remarks">Notes</th>
</tr>
</thead>
<tbody>
<!-- Postponed game -->
<tr>
<th data-stat="date_game">Sat, Jan 11, 2026</th>
<td data-stat="game_start_time">7:30p</td>
<td data-stat="visitor_team_name">Los Angeles Lakers</td>
<td data-stat="visitor_pts"></td>
<td data-stat="home_team_name">Phoenix Suns</td>
<td data-stat="home_pts"></td>
<td data-stat="arena_name">Footprint Center</td>
<td data-stat="game_remarks">Postponed - Weather</td>
</tr>
<!-- Neutral site game (Mexico City) -->
<tr>
<th data-stat="date_game">Sat, Nov 8, 2025</th>
<td data-stat="game_start_time">7:00p</td>
<td data-stat="visitor_team_name">Miami Heat</td>
<td data-stat="visitor_pts">105</td>
<td data-stat="home_team_name">Washington Wizards</td>
<td data-stat="home_pts">99</td>
<td data-stat="arena_name">Arena CDMX</td>
<td data-stat="game_remarks">NBA Mexico City Games</td>
</tr>
<!-- Cancelled game -->
<tr>
<th data-stat="date_game">Wed, Dec 3, 2025</th>
<td data-stat="game_start_time">8:00p</td>
<td data-stat="visitor_team_name">Portland Trail Blazers</td>
<td data-stat="visitor_pts"></td>
<td data-stat="home_team_name">Sacramento Kings</td>
<td data-stat="home_pts"></td>
<td data-stat="arena_name">Golden 1 Center</td>
<td data-stat="game_remarks">Cancelled</td>
</tr>
<!-- Regular completed game with high scores -->
<tr>
<th data-stat="date_game">Sun, Mar 15, 2026</th>
<td data-stat="game_start_time">3:30p</td>
<td data-stat="visitor_team_name">Indiana Pacers</td>
<td data-stat="visitor_pts">147</td>
<td data-stat="home_team_name">Atlanta Hawks</td>
<td data-stat="home_pts">150</td>
<td data-stat="arena_name">State Farm Arena</td>
<td data-stat="game_remarks">OT</td>
</tr>
<!-- Game at arena with special characters -->
<tr>
<th data-stat="date_game">Mon, Feb 2, 2026</th>
<td data-stat="game_start_time">10:30p</td>
<td data-stat="visitor_team_name">Golden State Warriors</td>
<td data-stat="visitor_pts">118</td>
<td data-stat="home_team_name">Los Angeles Clippers</td>
<td data-stat="home_pts">115</td>
<td data-stat="arena_name">Intuit Dome</td>
<td data-stat="game_remarks"></td>
</tr>
</tbody>
</table>
</body>
</html>
@@ -0,0 +1,94 @@
<!DOCTYPE html>
<html>
<head>
<title>2025-26 NBA Schedule - October | Basketball-Reference.com</title>
</head>
<body>
<table id="schedule" class="stats_table">
<thead>
<tr>
<th data-stat="date_game">Date</th>
<th data-stat="game_start_time">Start (ET)</th>
<th data-stat="visitor_team_name">Visitor/Neutral</th>
<th data-stat="visitor_pts">PTS</th>
<th data-stat="home_team_name">Home/Neutral</th>
<th data-stat="home_pts">PTS</th>
<th data-stat="arena_name">Arena</th>
<th data-stat="game_remarks">Notes</th>
</tr>
</thead>
<tbody>
<tr>
<th data-stat="date_game">Tue, Oct 22, 2025</th>
<td data-stat="game_start_time">7:30p</td>
<td data-stat="visitor_team_name">Boston Celtics</td>
<td data-stat="visitor_pts">112</td>
<td data-stat="home_team_name">Cleveland Cavaliers</td>
<td data-stat="home_pts">108</td>
<td data-stat="arena_name">Rocket Mortgage FieldHouse</td>
<td data-stat="game_remarks"></td>
</tr>
<tr>
<th data-stat="date_game">Tue, Oct 22, 2025</th>
<td data-stat="game_start_time">10:00p</td>
<td data-stat="visitor_team_name">Denver Nuggets</td>
<td data-stat="visitor_pts">119</td>
<td data-stat="home_team_name">Los Angeles Lakers</td>
<td data-stat="home_pts">127</td>
<td data-stat="arena_name">Crypto.com Arena</td>
<td data-stat="game_remarks"></td>
</tr>
<tr>
<th data-stat="date_game">Wed, Oct 23, 2025</th>
<td data-stat="game_start_time">7:00p</td>
<td data-stat="visitor_team_name">Houston Rockets</td>
<td data-stat="visitor_pts"></td>
<td data-stat="home_team_name">Oklahoma City Thunder</td>
<td data-stat="home_pts"></td>
<td data-stat="arena_name">Paycom Center</td>
<td data-stat="game_remarks"></td>
</tr>
<tr>
<th data-stat="date_game">Wed, Oct 23, 2025</th>
<td data-stat="game_start_time">7:30p</td>
<td data-stat="visitor_team_name">New York Knicks</td>
<td data-stat="visitor_pts"></td>
<td data-stat="home_team_name">Brooklyn Nets</td>
<td data-stat="home_pts"></td>
<td data-stat="arena_name">Barclays Center</td>
<td data-stat="game_remarks"></td>
</tr>
<tr>
<th data-stat="date_game">Thu, Oct 24, 2025</th>
<td data-stat="game_start_time">7:00p</td>
<td data-stat="visitor_team_name">Chicago Bulls</td>
<td data-stat="visitor_pts"></td>
<td data-stat="home_team_name">Miami Heat</td>
<td data-stat="home_pts"></td>
<td data-stat="arena_name">Kaseya Center</td>
<td data-stat="game_remarks"></td>
</tr>
<tr>
<th data-stat="date_game">Fri, Oct 25, 2025</th>
<td data-stat="game_start_time">7:30p</td>
<td data-stat="visitor_team_name">Toronto Raptors</td>
<td data-stat="visitor_pts"></td>
<td data-stat="home_team_name">Boston Celtics</td>
<td data-stat="home_pts"></td>
<td data-stat="arena_name">TD Garden</td>
<td data-stat="game_remarks"></td>
</tr>
<tr>
<th data-stat="date_game">Sat, Oct 26, 2025</th>
<td data-stat="game_start_time">8:00p</td>
<td data-stat="visitor_team_name">Minnesota Timberwolves</td>
<td data-stat="visitor_pts"></td>
<td data-stat="home_team_name">Dallas Mavericks</td>
<td data-stat="home_pts"></td>
<td data-stat="arena_name">American Airlines Center</td>
<td data-stat="game_remarks"></td>
</tr>
</tbody>
</table>
</body>
</html>
@@ -0,0 +1,245 @@
{
"leagues": [
{
"id": "46",
"uid": "s:40~l:46",
"name": "National Basketball Association",
"abbreviation": "NBA"
}
],
"season": {
"type": 2,
"year": 2026
},
"day": {
"date": "2025-10-22T00:00:00Z"
},
"events": [
{
"id": "401584721",
"uid": "s:40~l:46~e:401584721",
"date": "2025-10-22T23:30:00Z",
"name": "Boston Celtics at Cleveland Cavaliers",
"shortName": "BOS @ CLE",
"competitions": [
{
"id": "401584721",
"uid": "s:40~l:46~e:401584721~c:401584721",
"date": "2025-10-22T23:30:00Z",
"attendance": 20562,
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "5064",
"fullName": "Rocket Mortgage FieldHouse",
"address": {
"city": "Cleveland",
"state": "OH"
},
"capacity": 19432,
"indoor": true
},
"competitors": [
{
"id": "5",
"uid": "s:40~l:46~t:5",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "5",
"uid": "s:40~l:46~t:5",
"location": "Cleveland",
"name": "Cavaliers",
"abbreviation": "CLE",
"displayName": "Cleveland Cavaliers"
},
"score": "108",
"winner": false
},
{
"id": "2",
"uid": "s:40~l:46~t:2",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "2",
"uid": "s:40~l:46~t:2",
"location": "Boston",
"name": "Celtics",
"abbreviation": "BOS",
"displayName": "Boston Celtics"
},
"score": "112",
"winner": true
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 4,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401584722",
"uid": "s:40~l:46~e:401584722",
"date": "2025-10-23T02:00:00Z",
"name": "Denver Nuggets at Los Angeles Lakers",
"shortName": "DEN @ LAL",
"competitions": [
{
"id": "401584722",
"uid": "s:40~l:46~e:401584722~c:401584722",
"date": "2025-10-23T02:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "5091",
"fullName": "Crypto.com Arena",
"address": {
"city": "Los Angeles",
"state": "CA"
},
"capacity": 19068,
"indoor": true
},
"competitors": [
{
"id": "13",
"uid": "s:40~l:46~t:13",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "13",
"uid": "s:40~l:46~t:13",
"location": "Los Angeles",
"name": "Lakers",
"abbreviation": "LAL",
"displayName": "Los Angeles Lakers"
},
"score": "127",
"winner": true
},
{
"id": "7",
"uid": "s:40~l:46~t:7",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "7",
"uid": "s:40~l:46~t:7",
"location": "Denver",
"name": "Nuggets",
"abbreviation": "DEN",
"displayName": "Denver Nuggets"
},
"score": "119",
"winner": false
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 4,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401584723",
"uid": "s:40~l:46~e:401584723",
"date": "2025-10-24T00:00:00Z",
"name": "Houston Rockets at Oklahoma City Thunder",
"shortName": "HOU @ OKC",
"competitions": [
{
"id": "401584723",
"uid": "s:40~l:46~e:401584723~c:401584723",
"date": "2025-10-24T00:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "4922",
"fullName": "Paycom Center",
"address": {
"city": "Oklahoma City",
"state": "OK"
},
"capacity": 18203,
"indoor": true
},
"competitors": [
{
"id": "25",
"uid": "s:40~l:46~t:25",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "25",
"uid": "s:40~l:46~t:25",
"location": "Oklahoma City",
"name": "Thunder",
"abbreviation": "OKC",
"displayName": "Oklahoma City Thunder"
},
"score": null,
"winner": null
},
{
"id": "10",
"uid": "s:40~l:46~t:10",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "10",
"uid": "s:40~l:46~t:10",
"location": "Houston",
"name": "Rockets",
"abbreviation": "HOU",
"displayName": "Houston Rockets"
},
"score": null,
"winner": null
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 0,
"type": {
"id": "1",
"name": "STATUS_SCHEDULED",
"state": "pre",
"completed": false
}
}
}
]
}
]
}
@@ -0,0 +1,245 @@
{
"leagues": [
{
"id": "28",
"uid": "s:20~l:28",
"name": "National Football League",
"abbreviation": "NFL"
}
],
"season": {
"type": 2,
"year": 2025
},
"week": {
"number": 1
},
"events": [
{
"id": "401671801",
"uid": "s:20~l:28~e:401671801",
"date": "2025-09-07T20:00:00Z",
"name": "Kansas City Chiefs at Baltimore Ravens",
"shortName": "KC @ BAL",
"competitions": [
{
"id": "401671801",
"uid": "s:20~l:28~e:401671801~c:401671801",
"date": "2025-09-07T20:00:00Z",
"attendance": 71547,
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "3814",
"fullName": "M&T Bank Stadium",
"address": {
"city": "Baltimore",
"state": "MD"
},
"capacity": 71008,
"indoor": false
},
"competitors": [
{
"id": "33",
"uid": "s:20~l:28~t:33",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "33",
"uid": "s:20~l:28~t:33",
"location": "Baltimore",
"name": "Ravens",
"abbreviation": "BAL",
"displayName": "Baltimore Ravens"
},
"score": "20",
"winner": false
},
{
"id": "12",
"uid": "s:20~l:28~t:12",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "12",
"uid": "s:20~l:28~t:12",
"location": "Kansas City",
"name": "Chiefs",
"abbreviation": "KC",
"displayName": "Kansas City Chiefs"
},
"score": "27",
"winner": true
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 4,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401671802",
"uid": "s:20~l:28~e:401671802",
"date": "2025-09-08T17:00:00Z",
"name": "Philadelphia Eagles at Green Bay Packers",
"shortName": "PHI @ GB",
"competitions": [
{
"id": "401671802",
"uid": "s:20~l:28~e:401671802~c:401671802",
"date": "2025-09-08T17:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "3798",
"fullName": "Lambeau Field",
"address": {
"city": "Green Bay",
"state": "WI"
},
"capacity": 81441,
"indoor": false
},
"competitors": [
{
"id": "9",
"uid": "s:20~l:28~t:9",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "9",
"uid": "s:20~l:28~t:9",
"location": "Green Bay",
"name": "Packers",
"abbreviation": "GB",
"displayName": "Green Bay Packers"
},
"score": "34",
"winner": true
},
{
"id": "21",
"uid": "s:20~l:28~t:21",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "21",
"uid": "s:20~l:28~t:21",
"location": "Philadelphia",
"name": "Eagles",
"abbreviation": "PHI",
"displayName": "Philadelphia Eagles"
},
"score": "29",
"winner": false
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 4,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401671803",
"uid": "s:20~l:28~e:401671803",
"date": "2025-09-08T20:25:00Z",
"name": "Dallas Cowboys at Cleveland Browns",
"shortName": "DAL @ CLE",
"competitions": [
{
"id": "401671803",
"uid": "s:20~l:28~e:401671803~c:401671803",
"date": "2025-09-08T20:25:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "3653",
"fullName": "Cleveland Browns Stadium",
"address": {
"city": "Cleveland",
"state": "OH"
},
"capacity": 67431,
"indoor": false
},
"competitors": [
{
"id": "5",
"uid": "s:20~l:28~t:5",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "5",
"uid": "s:20~l:28~t:5",
"location": "Cleveland",
"name": "Browns",
"abbreviation": "CLE",
"displayName": "Cleveland Browns"
},
"score": null,
"winner": null
},
{
"id": "6",
"uid": "s:20~l:28~t:6",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "6",
"uid": "s:20~l:28~t:6",
"location": "Dallas",
"name": "Cowboys",
"abbreviation": "DAL",
"displayName": "Dallas Cowboys"
},
"score": null,
"winner": null
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 0,
"type": {
"id": "1",
"name": "STATUS_SCHEDULED",
"state": "pre",
"completed": false
}
}
}
]
}
]
}
@@ -0,0 +1,245 @@
{
"leagues": [
{
"id": "90",
"uid": "s:70~l:90",
"name": "National Hockey League",
"abbreviation": "NHL"
}
],
"season": {
"type": 2,
"year": 2026
},
"day": {
"date": "2025-10-08T00:00:00Z"
},
"events": [
{
"id": "401671901",
"uid": "s:70~l:90~e:401671901",
"date": "2025-10-08T23:00:00Z",
"name": "Pittsburgh Penguins at Boston Bruins",
"shortName": "PIT @ BOS",
"competitions": [
{
"id": "401671901",
"uid": "s:70~l:90~e:401671901~c:401671901",
"date": "2025-10-08T23:00:00Z",
"attendance": 17850,
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "1823",
"fullName": "TD Garden",
"address": {
"city": "Boston",
"state": "MA"
},
"capacity": 17850,
"indoor": true
},
"competitors": [
{
"id": "1",
"uid": "s:70~l:90~t:1",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "1",
"uid": "s:70~l:90~t:1",
"location": "Boston",
"name": "Bruins",
"abbreviation": "BOS",
"displayName": "Boston Bruins"
},
"score": "4",
"winner": true
},
{
"id": "5",
"uid": "s:70~l:90~t:5",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "5",
"uid": "s:70~l:90~t:5",
"location": "Pittsburgh",
"name": "Penguins",
"abbreviation": "PIT",
"displayName": "Pittsburgh Penguins"
},
"score": "2",
"winner": false
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 3,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401671902",
"uid": "s:70~l:90~e:401671902",
"date": "2025-10-09T00:00:00Z",
"name": "Toronto Maple Leafs at Montreal Canadiens",
"shortName": "TOR @ MTL",
"competitions": [
{
"id": "401671902",
"uid": "s:70~l:90~e:401671902~c:401671902",
"date": "2025-10-09T00:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "1918",
"fullName": "Bell Centre",
"address": {
"city": "Montreal",
"state": "QC"
},
"capacity": 21302,
"indoor": true
},
"competitors": [
{
"id": "8",
"uid": "s:70~l:90~t:8",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "8",
"uid": "s:70~l:90~t:8",
"location": "Montreal",
"name": "Canadiens",
"abbreviation": "MTL",
"displayName": "Montreal Canadiens"
},
"score": "3",
"winner": false
},
{
"id": "10",
"uid": "s:70~l:90~t:10",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "10",
"uid": "s:70~l:90~t:10",
"location": "Toronto",
"name": "Maple Leafs",
"abbreviation": "TOR",
"displayName": "Toronto Maple Leafs"
},
"score": "5",
"winner": true
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 3,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401671903",
"uid": "s:70~l:90~e:401671903",
"date": "2025-10-09T02:00:00Z",
"name": "Vegas Golden Knights at Los Angeles Kings",
"shortName": "VGK @ LAK",
"competitions": [
{
"id": "401671903",
"uid": "s:70~l:90~e:401671903~c:401671903",
"date": "2025-10-09T02:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "1816",
"fullName": "Crypto.com Arena",
"address": {
"city": "Los Angeles",
"state": "CA"
},
"capacity": 18230,
"indoor": true
},
"competitors": [
{
"id": "26",
"uid": "s:70~l:90~t:26",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "26",
"uid": "s:70~l:90~t:26",
"location": "Los Angeles",
"name": "Kings",
"abbreviation": "LAK",
"displayName": "Los Angeles Kings"
},
"score": null,
"winner": null
},
{
"id": "54",
"uid": "s:70~l:90~t:54",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "54",
"uid": "s:70~l:90~t:54",
"location": "Vegas",
"name": "Golden Knights",
"abbreviation": "VGK",
"displayName": "Vegas Golden Knights"
},
"score": null,
"winner": null
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 0,
"type": {
"id": "1",
"name": "STATUS_SCHEDULED",
"state": "pre",
"completed": false
}
}
}
]
}
]
}
@@ -0,0 +1,245 @@
{
"leagues": [
{
"id": "761",
"uid": "s:600~l:761",
"name": "National Women's Soccer League",
"abbreviation": "NWSL"
}
],
"season": {
"type": 2,
"year": 2026
},
"day": {
"date": "2026-04-10T00:00:00Z"
},
"events": [
{
"id": "401672201",
"uid": "s:600~l:761~e:401672201",
"date": "2026-04-10T23:00:00Z",
"name": "Angel City FC at Portland Thorns",
"shortName": "LA @ POR",
"competitions": [
{
"id": "401672201",
"uid": "s:600~l:761~e:401672201~c:401672201",
"date": "2026-04-10T23:00:00Z",
"attendance": 22000,
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "8070",
"fullName": "Providence Park",
"address": {
"city": "Portland",
"state": "OR"
},
"capacity": 25218,
"indoor": false
},
"competitors": [
{
"id": "15625",
"uid": "s:600~l:761~t:15625",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "15625",
"uid": "s:600~l:761~t:15625",
"location": "Portland",
"name": "Thorns FC",
"abbreviation": "POR",
"displayName": "Portland Thorns FC"
},
"score": "2",
"winner": true
},
{
"id": "19934",
"uid": "s:600~l:761~t:19934",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "19934",
"uid": "s:600~l:761~t:19934",
"location": "Los Angeles",
"name": "Angel City",
"abbreviation": "LA",
"displayName": "Angel City FC"
},
"score": "1",
"winner": false
}
],
"status": {
"clock": 90,
"displayClock": "90'",
"period": 2,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401672202",
"uid": "s:600~l:761~e:401672202",
"date": "2026-04-11T00:00:00Z",
"name": "Orlando Pride at North Carolina Courage",
"shortName": "ORL @ NC",
"competitions": [
{
"id": "401672202",
"uid": "s:600~l:761~e:401672202~c:401672202",
"date": "2026-04-11T00:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "8073",
"fullName": "WakeMed Soccer Park",
"address": {
"city": "Cary",
"state": "NC"
},
"capacity": 10000,
"indoor": false
},
"competitors": [
{
"id": "15618",
"uid": "s:600~l:761~t:15618",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "15618",
"uid": "s:600~l:761~t:15618",
"location": "North Carolina",
"name": "Courage",
"abbreviation": "NC",
"displayName": "North Carolina Courage"
},
"score": "3",
"winner": true
},
{
"id": "15626",
"uid": "s:600~l:761~t:15626",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "15626",
"uid": "s:600~l:761~t:15626",
"location": "Orlando",
"name": "Pride",
"abbreviation": "ORL",
"displayName": "Orlando Pride"
},
"score": "1",
"winner": false
}
],
"status": {
"clock": 90,
"displayClock": "90'",
"period": 2,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401672203",
"uid": "s:600~l:761~e:401672203",
"date": "2026-04-11T02:00:00Z",
"name": "San Diego Wave at Bay FC",
"shortName": "SD @ BAY",
"competitions": [
{
"id": "401672203",
"uid": "s:600~l:761~e:401672203~c:401672203",
"date": "2026-04-11T02:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "3945",
"fullName": "PayPal Park",
"address": {
"city": "San Jose",
"state": "CA"
},
"capacity": 18000,
"indoor": false
},
"competitors": [
{
"id": "25645",
"uid": "s:600~l:761~t:25645",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "25645",
"uid": "s:600~l:761~t:25645",
"location": "Bay Area",
"name": "FC",
"abbreviation": "BAY",
"displayName": "Bay FC"
},
"score": null,
"winner": null
},
{
"id": "22638",
"uid": "s:600~l:761~t:22638",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "22638",
"uid": "s:600~l:761~t:22638",
"location": "San Diego",
"name": "Wave FC",
"abbreviation": "SD",
"displayName": "San Diego Wave FC"
},
"score": null,
"winner": null
}
],
"status": {
"clock": 0,
"displayClock": "0'",
"period": 0,
"type": {
"id": "1",
"name": "STATUS_SCHEDULED",
"state": "pre",
"completed": false
}
}
}
]
}
]
}
@@ -0,0 +1,245 @@
{
"leagues": [
{
"id": "59",
"uid": "s:40~l:59",
"name": "Women's National Basketball Association",
"abbreviation": "WNBA"
}
],
"season": {
"type": 2,
"year": 2026
},
"day": {
"date": "2026-05-20T00:00:00Z"
},
"events": [
{
"id": "401672101",
"uid": "s:40~l:59~e:401672101",
"date": "2026-05-20T23:00:00Z",
"name": "Las Vegas Aces at New York Liberty",
"shortName": "LV @ NY",
"competitions": [
{
"id": "401672101",
"uid": "s:40~l:59~e:401672101~c:401672101",
"date": "2026-05-20T23:00:00Z",
"attendance": 17732,
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "4346",
"fullName": "Barclays Center",
"address": {
"city": "Brooklyn",
"state": "NY"
},
"capacity": 17732,
"indoor": true
},
"competitors": [
{
"id": "9",
"uid": "s:40~l:59~t:9",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "9",
"uid": "s:40~l:59~t:9",
"location": "New York",
"name": "Liberty",
"abbreviation": "NY",
"displayName": "New York Liberty"
},
"score": "92",
"winner": true
},
{
"id": "20",
"uid": "s:40~l:59~t:20",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "20",
"uid": "s:40~l:59~t:20",
"location": "Las Vegas",
"name": "Aces",
"abbreviation": "LV",
"displayName": "Las Vegas Aces"
},
"score": "88",
"winner": false
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 4,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401672102",
"uid": "s:40~l:59~e:401672102",
"date": "2026-05-21T00:00:00Z",
"name": "Connecticut Sun at Chicago Sky",
"shortName": "CONN @ CHI",
"competitions": [
{
"id": "401672102",
"uid": "s:40~l:59~e:401672102~c:401672102",
"date": "2026-05-21T00:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "8086",
"fullName": "Wintrust Arena",
"address": {
"city": "Chicago",
"state": "IL"
},
"capacity": 10387,
"indoor": true
},
"competitors": [
{
"id": "6",
"uid": "s:40~l:59~t:6",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "6",
"uid": "s:40~l:59~t:6",
"location": "Chicago",
"name": "Sky",
"abbreviation": "CHI",
"displayName": "Chicago Sky"
},
"score": "78",
"winner": false
},
{
"id": "5",
"uid": "s:40~l:59~t:5",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "5",
"uid": "s:40~l:59~t:5",
"location": "Connecticut",
"name": "Sun",
"abbreviation": "CONN",
"displayName": "Connecticut Sun"
},
"score": "85",
"winner": true
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 4,
"type": {
"id": "3",
"name": "STATUS_FINAL",
"state": "post",
"completed": true
}
}
}
]
},
{
"id": "401672103",
"uid": "s:40~l:59~e:401672103",
"date": "2026-05-21T02:00:00Z",
"name": "Phoenix Mercury at Seattle Storm",
"shortName": "PHX @ SEA",
"competitions": [
{
"id": "401672103",
"uid": "s:40~l:59~e:401672103~c:401672103",
"date": "2026-05-21T02:00:00Z",
"type": {
"id": "1",
"abbreviation": "STD"
},
"venue": {
"id": "3097",
"fullName": "Climate Pledge Arena",
"address": {
"city": "Seattle",
"state": "WA"
},
"capacity": 18100,
"indoor": true
},
"competitors": [
{
"id": "11",
"uid": "s:40~l:59~t:11",
"type": "team",
"order": 0,
"homeAway": "home",
"team": {
"id": "11",
"uid": "s:40~l:59~t:11",
"location": "Seattle",
"name": "Storm",
"abbreviation": "SEA",
"displayName": "Seattle Storm"
},
"score": null,
"winner": null
},
{
"id": "8",
"uid": "s:40~l:59~t:8",
"type": "team",
"order": 1,
"homeAway": "away",
"team": {
"id": "8",
"uid": "s:40~l:59~t:8",
"location": "Phoenix",
"name": "Mercury",
"abbreviation": "PHX",
"displayName": "Phoenix Mercury"
},
"score": null,
"winner": null
}
],
"status": {
"clock": 0,
"displayClock": "0:00",
"period": 0,
"type": {
"id": "1",
"name": "STATUS_SCHEDULED",
"state": "pre",
"completed": false
}
}
}
]
}
]
}
@@ -0,0 +1,269 @@
"""Tests for alias loaders."""
import pytest
import json
import tempfile
from datetime import date
from pathlib import Path
from sportstime_parser.normalizers.alias_loader import (
TeamAliasLoader,
StadiumAliasLoader,
)
from sportstime_parser.models.aliases import AliasType
class TestTeamAliasLoader:
"""Tests for TeamAliasLoader class."""
@pytest.fixture
def sample_aliases_file(self):
"""Create a temporary aliases file for testing."""
data = [
{
"id": "1",
"team_canonical_id": "nba_okc",
"alias_type": "name",
"alias_value": "Seattle SuperSonics",
"valid_from": "1967-01-01",
"valid_until": "2008-07-02",
},
{
"id": "2",
"team_canonical_id": "nba_okc",
"alias_type": "name",
"alias_value": "Oklahoma City Thunder",
"valid_from": "2008-07-03",
"valid_until": None,
},
{
"id": "3",
"team_canonical_id": "nba_okc",
"alias_type": "abbreviation",
"alias_value": "OKC",
"valid_from": "2008-07-03",
"valid_until": None,
},
]
with tempfile.NamedTemporaryFile(
mode="w", suffix=".json", delete=False
) as f:
json.dump(data, f)
return Path(f.name)
def test_load_aliases(self, sample_aliases_file):
"""Test loading aliases from file."""
loader = TeamAliasLoader(sample_aliases_file)
loader.load()
assert len(loader._aliases) == 3
def test_resolve_current_alias(self, sample_aliases_file):
"""Test resolving a current alias."""
loader = TeamAliasLoader(sample_aliases_file)
# Current date should resolve to Thunder
result = loader.resolve("Oklahoma City Thunder")
assert result == "nba_okc"
# Abbreviation should also work
result = loader.resolve("OKC")
assert result == "nba_okc"
def test_resolve_historical_alias(self, sample_aliases_file):
"""Test resolving a historical alias with date."""
loader = TeamAliasLoader(sample_aliases_file)
# Historical date should resolve SuperSonics
result = loader.resolve("Seattle SuperSonics", check_date=date(2007, 1, 1))
assert result == "nba_okc"
# After relocation, SuperSonics shouldn't resolve
result = loader.resolve("Seattle SuperSonics", check_date=date(2010, 1, 1))
assert result is None
def test_resolve_case_insensitive(self, sample_aliases_file):
"""Test case insensitive resolution."""
loader = TeamAliasLoader(sample_aliases_file)
result = loader.resolve("oklahoma city thunder")
assert result == "nba_okc"
result = loader.resolve("okc")
assert result == "nba_okc"
def test_resolve_with_type_filter(self, sample_aliases_file):
"""Test filtering by alias type."""
loader = TeamAliasLoader(sample_aliases_file)
# Should find when searching all types
result = loader.resolve("OKC")
assert result == "nba_okc"
# Should not find when filtering to name only
result = loader.resolve("OKC", alias_types=[AliasType.NAME])
assert result is None
def test_get_aliases_for_team(self, sample_aliases_file):
"""Test getting all aliases for a team."""
loader = TeamAliasLoader(sample_aliases_file)
aliases = loader.get_aliases_for_team("nba_okc")
assert len(aliases) == 3
# Filter by current date
aliases = loader.get_aliases_for_team(
"nba_okc", check_date=date(2020, 1, 1)
)
assert len(aliases) == 2 # Thunder name + OKC abbreviation
def test_missing_file(self):
"""Test handling of missing file."""
loader = TeamAliasLoader(Path("/nonexistent/file.json"))
loader.load() # Should not raise
assert len(loader._aliases) == 0
class TestStadiumAliasLoader:
"""Tests for StadiumAliasLoader class."""
@pytest.fixture
def sample_stadium_aliases(self):
"""Create a temporary stadium aliases file."""
data = [
{
"alias_name": "Crypto.com Arena",
"stadium_canonical_id": "crypto_arena_los_angeles_ca",
"valid_from": "2021-12-25",
"valid_until": None,
},
{
"alias_name": "Staples Center",
"stadium_canonical_id": "crypto_arena_los_angeles_ca",
"valid_from": "1999-10-17",
"valid_until": "2021-12-24",
},
]
with tempfile.NamedTemporaryFile(
mode="w", suffix=".json", delete=False
) as f:
json.dump(data, f)
return Path(f.name)
def test_load_stadium_aliases(self, sample_stadium_aliases):
"""Test loading stadium aliases."""
loader = StadiumAliasLoader(sample_stadium_aliases)
loader.load()
assert len(loader._aliases) == 2
def test_resolve_current_name(self, sample_stadium_aliases):
"""Test resolving current stadium name."""
loader = StadiumAliasLoader(sample_stadium_aliases)
result = loader.resolve("Crypto.com Arena")
assert result == "crypto_arena_los_angeles_ca"
def test_resolve_historical_name(self, sample_stadium_aliases):
"""Test resolving historical stadium name."""
loader = StadiumAliasLoader(sample_stadium_aliases)
# Staples Center in 2020
result = loader.resolve("Staples Center", check_date=date(2020, 1, 1))
assert result == "crypto_arena_los_angeles_ca"
# Staples Center after rename shouldn't resolve
result = loader.resolve("Staples Center", check_date=date(2023, 1, 1))
assert result is None
def test_date_boundary(self, sample_stadium_aliases):
"""Test exact date boundaries."""
loader = StadiumAliasLoader(sample_stadium_aliases)
# Last day of Staples Center
result = loader.resolve("Staples Center", check_date=date(2021, 12, 24))
assert result == "crypto_arena_los_angeles_ca"
# First day of Crypto.com Arena
result = loader.resolve("Crypto.com Arena", check_date=date(2021, 12, 25))
assert result == "crypto_arena_los_angeles_ca"
def test_get_all_names(self, sample_stadium_aliases):
"""Test getting all stadium names."""
loader = StadiumAliasLoader(sample_stadium_aliases)
names = loader.get_all_names()
assert len(names) == 2
assert "Crypto.com Arena" in names
assert "Staples Center" in names
class TestDateRangeHandling:
"""Tests for date range edge cases in aliases."""
@pytest.fixture
def date_range_aliases(self):
"""Create aliases with various date range scenarios."""
data = [
{
"id": "1",
"team_canonical_id": "test_team",
"alias_type": "name",
"alias_value": "Always Valid",
"valid_from": None,
"valid_until": None,
},
{
"id": "2",
"team_canonical_id": "test_team",
"alias_type": "name",
"alias_value": "Future Only",
"valid_from": "2030-01-01",
"valid_until": None,
},
{
"id": "3",
"team_canonical_id": "test_team",
"alias_type": "name",
"alias_value": "Past Only",
"valid_from": None,
"valid_until": "2000-01-01",
},
]
with tempfile.NamedTemporaryFile(
mode="w", suffix=".json", delete=False
) as f:
json.dump(data, f)
return Path(f.name)
def test_always_valid_alias(self, date_range_aliases):
"""Test alias with no date restrictions."""
loader = TeamAliasLoader(date_range_aliases)
result = loader.resolve("Always Valid", check_date=date(2025, 1, 1))
assert result == "test_team"
result = loader.resolve("Always Valid", check_date=date(1990, 1, 1))
assert result == "test_team"
def test_future_only_alias(self, date_range_aliases):
"""Test alias that starts in the future."""
loader = TeamAliasLoader(date_range_aliases)
# Before valid_from
result = loader.resolve("Future Only", check_date=date(2025, 1, 1))
assert result is None
# After valid_from
result = loader.resolve("Future Only", check_date=date(2035, 1, 1))
assert result == "test_team"
def test_past_only_alias(self, date_range_aliases):
"""Test alias that expired in the past."""
loader = TeamAliasLoader(date_range_aliases)
# Before valid_until
result = loader.resolve("Past Only", check_date=date(1990, 1, 1))
assert result == "test_team"
# After valid_until
result = loader.resolve("Past Only", check_date=date(2025, 1, 1))
assert result is None
@@ -0,0 +1,183 @@
"""Tests for canonical ID generation."""
import pytest
from datetime import datetime, date
from sportstime_parser.normalizers.canonical_id import (
generate_game_id,
generate_team_id,
generate_team_id_from_abbrev,
generate_stadium_id,
parse_game_id,
normalize_string,
)
class TestNormalizeString:
"""Tests for normalize_string function."""
def test_basic_normalization(self):
"""Test basic string normalization."""
assert normalize_string("New York") == "new_york"
assert normalize_string("Los Angeles") == "los_angeles"
def test_removes_special_characters(self):
"""Test that special characters are removed."""
assert normalize_string("AT&T Stadium") == "att_stadium"
assert normalize_string("St. Louis") == "st_louis"
assert normalize_string("O'Brien Field") == "obrien_field"
def test_collapses_whitespace(self):
"""Test that multiple spaces are collapsed."""
assert normalize_string("New York") == "new_york"
assert normalize_string(" Los Angeles ") == "los_angeles"
def test_empty_string(self):
"""Test empty string handling."""
assert normalize_string("") == ""
assert normalize_string(" ") == ""
def test_unicode_normalization(self):
"""Test unicode characters are handled."""
assert normalize_string("Café") == "cafe"
assert normalize_string("José") == "jose"
class TestGenerateGameId:
"""Tests for generate_game_id function."""
def test_basic_game_id(self):
"""Test basic game ID generation."""
game_id = generate_game_id(
sport="nba",
season=2025,
away_abbrev="bos",
home_abbrev="lal",
game_date=date(2025, 12, 25),
)
assert game_id == "nba_2025_bos_lal_1225"
def test_game_id_with_datetime(self):
"""Test game ID generation with datetime object."""
game_id = generate_game_id(
sport="mlb",
season=2026,
away_abbrev="nyy",
home_abbrev="bos",
game_date=datetime(2026, 4, 1, 19, 0),
)
assert game_id == "mlb_2026_nyy_bos_0401"
def test_game_id_with_game_number(self):
"""Test game ID for doubleheader."""
game_id_1 = generate_game_id(
sport="mlb",
season=2026,
away_abbrev="nyy",
home_abbrev="bos",
game_date=date(2026, 7, 4),
game_number=1,
)
game_id_2 = generate_game_id(
sport="mlb",
season=2026,
away_abbrev="nyy",
home_abbrev="bos",
game_date=date(2026, 7, 4),
game_number=2,
)
assert game_id_1 == "mlb_2026_nyy_bos_0704_1"
assert game_id_2 == "mlb_2026_nyy_bos_0704_2"
def test_sport_lowercased(self):
"""Test that sport is lowercased."""
game_id = generate_game_id(
sport="NBA",
season=2025,
away_abbrev="BOS",
home_abbrev="LAL",
game_date=date(2025, 12, 25),
)
assert game_id == "nba_2025_bos_lal_1225"
class TestParseGameId:
"""Tests for parse_game_id function."""
def test_parse_basic_game_id(self):
"""Test parsing a basic game ID."""
parsed = parse_game_id("nba_2025_bos_lal_1225")
assert parsed["sport"] == "nba"
assert parsed["season"] == 2025
assert parsed["away_abbrev"] == "bos"
assert parsed["home_abbrev"] == "lal"
assert parsed["month"] == 12
assert parsed["day"] == 25
assert parsed["game_number"] is None
def test_parse_game_id_with_game_number(self):
"""Test parsing game ID with game number."""
parsed = parse_game_id("mlb_2026_nyy_bos_0704_2")
assert parsed["sport"] == "mlb"
assert parsed["season"] == 2026
assert parsed["away_abbrev"] == "nyy"
assert parsed["home_abbrev"] == "bos"
assert parsed["month"] == 7
assert parsed["day"] == 4
assert parsed["game_number"] == 2
def test_parse_invalid_game_id(self):
"""Test parsing invalid game ID raises error."""
with pytest.raises(ValueError):
parse_game_id("invalid")
with pytest.raises(ValueError):
parse_game_id("nba_2025_bos")
with pytest.raises(ValueError):
parse_game_id("")
class TestGenerateTeamId:
"""Tests for generate_team_id function."""
def test_basic_team_id(self):
"""Test basic team ID generation from city and name."""
team_id = generate_team_id(sport="nba", city="Los Angeles", name="Lakers")
assert team_id == "team_nba_los_angeles_lakers"
def test_team_id_normalizes_input(self):
"""Test that inputs are normalized."""
team_id = generate_team_id(sport="NBA", city="New York", name="Yankees")
assert team_id == "team_nba_new_york_yankees"
class TestGenerateTeamIdFromAbbrev:
"""Tests for generate_team_id_from_abbrev function."""
def test_basic_team_id_from_abbrev(self):
"""Test team ID from abbreviation."""
team_id = generate_team_id_from_abbrev(sport="nba", abbreviation="LAL")
assert team_id == "team_nba_lal"
def test_lowercases_abbreviation(self):
"""Test abbreviation is lowercased."""
team_id = generate_team_id_from_abbrev(sport="MLB", abbreviation="NYY")
assert team_id == "team_mlb_nyy"
class TestGenerateStadiumId:
"""Tests for generate_stadium_id function."""
def test_basic_stadium_id(self):
"""Test basic stadium ID generation."""
stadium_id = generate_stadium_id(sport="mlb", name="Fenway Park")
assert stadium_id == "stadium_mlb_fenway_park"
def test_stadium_id_special_characters(self):
"""Test stadium ID with special characters."""
stadium_id = generate_stadium_id(sport="nfl", name="AT&T Stadium")
assert stadium_id == "stadium_nfl_att_stadium"
def test_stadium_id_with_sponsor(self):
"""Test stadium ID with sponsor name."""
stadium_id = generate_stadium_id(sport="nba", name="Crypto.com Arena")
assert stadium_id == "stadium_nba_cryptocom_arena"
@@ -0,0 +1,194 @@
"""Tests for fuzzy string matching utilities."""
import pytest
from sportstime_parser.normalizers.fuzzy import (
normalize_for_matching,
fuzzy_match_team,
fuzzy_match_stadium,
exact_match,
best_match,
calculate_similarity,
MatchCandidate,
)
class TestNormalizeForMatching:
"""Tests for normalize_for_matching function."""
def test_basic_normalization(self):
"""Test basic string normalization."""
assert normalize_for_matching("Los Angeles Lakers") == "los angeles lakers"
assert normalize_for_matching(" Boston Celtics ") == "boston celtics"
def test_removes_common_prefixes(self):
"""Test removal of common prefixes."""
assert normalize_for_matching("The Boston Celtics") == "boston celtics"
assert normalize_for_matching("Team Lakers") == "lakers"
def test_removes_stadium_suffixes(self):
"""Test removal of stadium-related suffixes."""
assert normalize_for_matching("Fenway Park") == "fenway"
assert normalize_for_matching("Madison Square Garden Arena") == "madison square garden"
assert normalize_for_matching("Wrigley Field") == "wrigley"
assert normalize_for_matching("TD Garden Center") == "td garden"
class TestExactMatch:
"""Tests for exact_match function."""
def test_exact_match_primary_name(self):
"""Test exact match on primary name."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LAL"]),
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics", "BOS"]),
]
assert exact_match("Los Angeles Lakers", candidates) == "nba_lal"
assert exact_match("Boston Celtics", candidates) == "nba_bos"
def test_exact_match_alias(self):
"""Test exact match on alias."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LAL"]),
]
assert exact_match("Lakers", candidates) == "nba_lal"
assert exact_match("LAL", candidates) == "nba_lal"
def test_case_insensitive(self):
"""Test case insensitive matching."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
]
assert exact_match("los angeles lakers", candidates) == "nba_lal"
assert exact_match("LAKERS", candidates) == "nba_lal"
def test_no_match(self):
"""Test no match returns None."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
]
assert exact_match("New York Knicks", candidates) is None
class TestFuzzyMatchTeam:
"""Tests for fuzzy_match_team function."""
def test_close_match(self):
"""Test fuzzy matching finds close matches."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LA Lakers"]),
MatchCandidate("nba_lac", "Los Angeles Clippers", ["Clippers", "LA Clippers"]),
]
matches = fuzzy_match_team("LA Lakers", candidates, threshold=70)
assert len(matches) > 0
assert matches[0].canonical_id == "nba_lal"
def test_partial_name_match(self):
"""Test matching on partial team name."""
candidates = [
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics", "BOS"]),
]
matches = fuzzy_match_team("Celtics", candidates, threshold=80)
assert len(matches) > 0
assert matches[0].canonical_id == "nba_bos"
def test_threshold_filtering(self):
"""Test that threshold filters low-confidence matches."""
candidates = [
MatchCandidate("nba_bos", "Boston Celtics", []),
]
# Very different string should not match at high threshold
matches = fuzzy_match_team("xyz123", candidates, threshold=90)
assert len(matches) == 0
def test_returns_top_n(self):
"""Test that top_n parameter limits results."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", []),
MatchCandidate("nba_lac", "Los Angeles Clippers", []),
MatchCandidate("mlb_lad", "Los Angeles Dodgers", []),
]
matches = fuzzy_match_team("Los Angeles", candidates, threshold=50, top_n=2)
assert len(matches) <= 2
class TestFuzzyMatchStadium:
"""Tests for fuzzy_match_stadium function."""
def test_stadium_match(self):
"""Test fuzzy matching stadium names."""
candidates = [
MatchCandidate("fenway", "Fenway Park", ["Fenway"]),
MatchCandidate("td_garden", "TD Garden", ["Boston Garden"]),
]
matches = fuzzy_match_stadium("Fenway Park Boston", candidates, threshold=70)
assert len(matches) > 0
assert matches[0].canonical_id == "fenway"
def test_naming_rights_change(self):
"""Test matching old stadium names."""
candidates = [
MatchCandidate(
"chase_center",
"Chase Center",
["Oracle Arena", "Oakland Coliseum Arena"],
),
]
# Should match on alias
matches = fuzzy_match_stadium("Oracle Arena", candidates, threshold=70)
assert len(matches) > 0
class TestBestMatch:
"""Tests for best_match function."""
def test_prefers_exact_match(self):
"""Test that exact match is preferred over fuzzy."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics"]),
]
result = best_match("Lakers", candidates)
assert result is not None
assert result.canonical_id == "nba_lal"
assert result.confidence == 100 # Exact match
def test_falls_back_to_fuzzy(self):
"""Test fallback to fuzzy when no exact match."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
]
result = best_match("LA Laker", candidates, threshold=70)
assert result is not None
assert result.confidence < 100 # Fuzzy match
def test_no_match_below_threshold(self):
"""Test returns None when no match above threshold."""
candidates = [
MatchCandidate("nba_lal", "Los Angeles Lakers", []),
]
result = best_match("xyz123", candidates, threshold=90)
assert result is None
class TestCalculateSimilarity:
"""Tests for calculate_similarity function."""
def test_identical_strings(self):
"""Test identical strings have 100% similarity."""
assert calculate_similarity("Boston Celtics", "Boston Celtics") == 100
def test_similar_strings(self):
"""Test similar strings have high similarity."""
score = calculate_similarity("Boston Celtics", "Celtics Boston")
assert score >= 90
def test_different_strings(self):
"""Test different strings have low similarity."""
score = calculate_similarity("Boston Celtics", "Los Angeles Lakers")
assert score < 50
def test_empty_string(self):
"""Test empty string handling."""
score = calculate_similarity("", "Boston Celtics")
assert score == 0
@@ -0,0 +1 @@
"""Tests for scrapers module."""
@@ -0,0 +1,257 @@
"""Tests for MLB scraper."""
from datetime import datetime
from unittest.mock import patch
import pytest
from sportstime_parser.scrapers.mlb import MLBScraper, create_mlb_scraper
from sportstime_parser.scrapers.base import RawGameData
from sportstime_parser.tests.fixtures import (
load_json_fixture,
MLB_ESPN_SCOREBOARD_JSON,
)
class TestMLBScraperInit:
"""Test MLBScraper initialization."""
def test_creates_scraper_with_season(self):
"""Test scraper initializes with correct season."""
scraper = MLBScraper(season=2026)
assert scraper.sport == "mlb"
assert scraper.season == 2026
def test_factory_function_creates_scraper(self):
"""Test factory function creates correct scraper."""
scraper = create_mlb_scraper(season=2026)
assert isinstance(scraper, MLBScraper)
assert scraper.season == 2026
def test_expected_game_count(self):
"""Test expected game count is correct for MLB."""
scraper = MLBScraper(season=2026)
assert scraper.expected_game_count == 2430
def test_sources_in_priority_order(self):
"""Test sources are returned in correct priority order."""
scraper = MLBScraper(season=2026)
sources = scraper._get_sources()
assert sources == ["baseball_reference", "mlb_api", "espn"]
class TestESPNParsing:
"""Test ESPN API response parsing."""
def test_parses_completed_games(self):
"""Test parsing completed games from ESPN."""
scraper = MLBScraper(season=2026)
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
completed = [g for g in games if g.status == "final"]
assert len(completed) == 2
# Yankees @ Red Sox
nyy_bos = next(g for g in completed if g.away_team_raw == "New York Yankees")
assert nyy_bos.home_team_raw == "Boston Red Sox"
assert nyy_bos.away_score == 3
assert nyy_bos.home_score == 5
assert nyy_bos.stadium_raw == "Fenway Park"
def test_parses_scheduled_games(self):
"""Test parsing scheduled games from ESPN."""
scraper = MLBScraper(season=2026)
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
scheduled = [g for g in games if g.status == "scheduled"]
assert len(scheduled) == 1
lad_sf = scheduled[0]
assert lad_sf.away_team_raw == "Los Angeles Dodgers"
assert lad_sf.home_team_raw == "San Francisco Giants"
assert lad_sf.stadium_raw == "Oracle Park"
def test_parses_venue_info(self):
"""Test venue information is extracted."""
scraper = MLBScraper(season=2026)
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
for game in games:
assert game.stadium_raw is not None
class TestGameNormalization:
"""Test game normalization and canonical ID generation."""
def test_normalizes_games_with_canonical_ids(self):
"""Test games are normalized with correct canonical IDs."""
scraper = MLBScraper(season=2026)
raw_games = [
RawGameData(
game_date=datetime(2026, 4, 15),
home_team_raw="Boston Red Sox",
away_team_raw="New York Yankees",
stadium_raw="Fenway Park",
home_score=5,
away_score=3,
status="final",
source_url="http://example.com",
)
]
games, review_items = scraper._normalize_games(raw_games)
assert len(games) == 1
game = games[0]
# Check canonical ID format
assert game.id == "mlb_2026_nyy_bos_0415"
assert game.sport == "mlb"
assert game.season == 2026
# Check team IDs
assert game.home_team_id == "team_mlb_bos"
assert game.away_team_id == "team_mlb_nyy"
# Check scores preserved
assert game.home_score == 5
assert game.away_score == 3
def test_creates_review_items_for_unresolved_teams(self):
"""Test review items are created for unresolved teams."""
scraper = MLBScraper(season=2026)
raw_games = [
RawGameData(
game_date=datetime(2026, 4, 15),
home_team_raw="Unknown Team XYZ",
away_team_raw="Boston Red Sox",
stadium_raw="Fenway Park",
status="scheduled",
),
]
games, review_items = scraper._normalize_games(raw_games)
# Game should not be created due to unresolved team
assert len(games) == 0
# But there should be a review item
assert len(review_items) >= 1
class TestTeamAndStadiumScraping:
"""Test team and stadium data scraping."""
def test_scrapes_all_mlb_teams(self):
"""Test all 30 MLB teams are returned."""
scraper = MLBScraper(season=2026)
teams = scraper.scrape_teams()
# 30 MLB teams
assert len(teams) == 30
# Check team IDs are unique
team_ids = [t.id for t in teams]
assert len(set(team_ids)) == 30
# Check all teams have required fields
for team in teams:
assert team.id.startswith("team_mlb_")
assert team.sport == "mlb"
assert team.city
assert team.name
assert team.full_name
assert team.abbreviation
def test_teams_have_leagues_and_divisions(self):
"""Test teams have league (conference) and division info."""
scraper = MLBScraper(season=2026)
teams = scraper.scrape_teams()
# Count teams by league
al = [t for t in teams if t.conference == "American"]
nl = [t for t in teams if t.conference == "National"]
assert len(al) == 15
assert len(nl) == 15
def test_scrapes_all_mlb_stadiums(self):
"""Test all MLB stadiums are returned."""
scraper = MLBScraper(season=2026)
stadiums = scraper.scrape_stadiums()
# Should have stadiums for all teams
assert len(stadiums) == 30
# Check stadium IDs are unique
stadium_ids = [s.id for s in stadiums]
assert len(set(stadium_ids)) == 30
# Check all stadiums have required fields
for stadium in stadiums:
assert stadium.id.startswith("stadium_mlb_")
assert stadium.sport == "mlb"
assert stadium.name
assert stadium.city
assert stadium.state
assert stadium.country in ["USA", "Canada"]
assert stadium.latitude != 0
assert stadium.longitude != 0
class TestScrapeFallback:
"""Test multi-source fallback behavior."""
def test_falls_back_to_next_source_on_failure(self):
"""Test scraper tries next source when first fails."""
scraper = MLBScraper(season=2026)
with patch.object(scraper, '_scrape_baseball_reference') as mock_br, \
patch.object(scraper, '_scrape_mlb_api') as mock_mlb, \
patch.object(scraper, '_scrape_espn') as mock_espn:
# Make BR and MLB API fail
mock_br.side_effect = Exception("Connection failed")
mock_mlb.side_effect = Exception("API error")
# Make ESPN return data
mock_espn.return_value = [
RawGameData(
game_date=datetime(2026, 4, 15),
home_team_raw="Boston Red Sox",
away_team_raw="New York Yankees",
stadium_raw="Fenway Park",
status="scheduled",
)
]
result = scraper.scrape_games()
assert result.success
assert result.source == "espn"
assert mock_br.called
assert mock_mlb.called
assert mock_espn.called
class TestSeasonMonths:
"""Test season month calculation."""
def test_gets_correct_season_months(self):
"""Test correct months are returned for MLB season."""
scraper = MLBScraper(season=2026)
months = scraper._get_season_months()
# MLB season is March-November
assert len(months) == 9 # Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
# Check first month is March of season year
assert months[0] == (2026, 3)
# Check last month is November
assert months[-1] == (2026, 11)
@@ -0,0 +1,251 @@
"""Tests for MLS scraper."""
from datetime import datetime
from unittest.mock import patch
import pytest
from sportstime_parser.scrapers.mls import MLSScraper, create_mls_scraper
from sportstime_parser.scrapers.base import RawGameData
from sportstime_parser.tests.fixtures import (
load_json_fixture,
MLS_ESPN_SCOREBOARD_JSON,
)
class TestMLSScraperInit:
"""Test MLSScraper initialization."""
def test_creates_scraper_with_season(self):
"""Test scraper initializes with correct season."""
scraper = MLSScraper(season=2026)
assert scraper.sport == "mls"
assert scraper.season == 2026
def test_factory_function_creates_scraper(self):
"""Test factory function creates correct scraper."""
scraper = create_mls_scraper(season=2026)
assert isinstance(scraper, MLSScraper)
assert scraper.season == 2026
def test_expected_game_count(self):
"""Test expected game count is correct for MLS."""
scraper = MLSScraper(season=2026)
assert scraper.expected_game_count == 493
def test_sources_in_priority_order(self):
"""Test sources are returned in correct priority order."""
scraper = MLSScraper(season=2026)
sources = scraper._get_sources()
assert sources == ["espn", "fbref"]
class TestESPNParsing:
"""Test ESPN API response parsing."""
def test_parses_completed_games(self):
"""Test parsing completed games from ESPN."""
scraper = MLSScraper(season=2026)
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
completed = [g for g in games if g.status == "final"]
assert len(completed) == 2
# Galaxy @ LAFC
la_lafc = next(g for g in completed if g.away_team_raw == "LA Galaxy")
assert la_lafc.home_team_raw == "Los Angeles FC"
assert la_lafc.away_score == 2
assert la_lafc.home_score == 3
assert la_lafc.stadium_raw == "BMO Stadium"
def test_parses_scheduled_games(self):
"""Test parsing scheduled games from ESPN."""
scraper = MLSScraper(season=2026)
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
scheduled = [g for g in games if g.status == "scheduled"]
assert len(scheduled) == 1
ny_atl = scheduled[0]
assert ny_atl.away_team_raw == "New York Red Bulls"
assert ny_atl.home_team_raw == "Atlanta United FC"
assert ny_atl.stadium_raw == "Mercedes-Benz Stadium"
def test_parses_venue_info(self):
"""Test venue information is extracted."""
scraper = MLSScraper(season=2026)
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
for game in games:
assert game.stadium_raw is not None
class TestGameNormalization:
"""Test game normalization and canonical ID generation."""
def test_normalizes_games_with_canonical_ids(self):
"""Test games are normalized with correct canonical IDs."""
scraper = MLSScraper(season=2026)
raw_games = [
RawGameData(
game_date=datetime(2026, 3, 15),
home_team_raw="Los Angeles FC",
away_team_raw="LA Galaxy",
stadium_raw="BMO Stadium",
home_score=3,
away_score=2,
status="final",
source_url="http://example.com",
)
]
games, review_items = scraper._normalize_games(raw_games)
assert len(games) == 1
game = games[0]
# Check canonical ID format
assert game.id == "mls_2026_lag_lafc_0315"
assert game.sport == "mls"
assert game.season == 2026
# Check team IDs
assert game.home_team_id == "team_mls_lafc"
assert game.away_team_id == "team_mls_lag"
# Check scores preserved
assert game.home_score == 3
assert game.away_score == 2
def test_creates_review_items_for_unresolved_teams(self):
"""Test review items are created for unresolved teams."""
scraper = MLSScraper(season=2026)
raw_games = [
RawGameData(
game_date=datetime(2026, 3, 15),
home_team_raw="Unknown Team XYZ",
away_team_raw="LA Galaxy",
stadium_raw="BMO Stadium",
status="scheduled",
),
]
games, review_items = scraper._normalize_games(raw_games)
# Game should not be created due to unresolved team
assert len(games) == 0
# But there should be a review item
assert len(review_items) >= 1
class TestTeamAndStadiumScraping:
"""Test team and stadium data scraping."""
def test_scrapes_all_mls_teams(self):
"""Test all MLS teams are returned."""
scraper = MLSScraper(season=2026)
teams = scraper.scrape_teams()
# MLS has 29+ teams
assert len(teams) >= 29
# Check team IDs are unique
team_ids = [t.id for t in teams]
assert len(set(team_ids)) == len(teams)
# Check all teams have required fields
for team in teams:
assert team.id.startswith("team_mls_")
assert team.sport == "mls"
assert team.city
assert team.name
assert team.full_name
assert team.abbreviation
def test_teams_have_conferences(self):
"""Test teams have conference info."""
scraper = MLSScraper(season=2026)
teams = scraper.scrape_teams()
# Count teams by conference
eastern = [t for t in teams if t.conference == "Eastern"]
western = [t for t in teams if t.conference == "Western"]
# MLS has two conferences
assert len(eastern) >= 14
assert len(western) >= 14
def test_scrapes_all_mls_stadiums(self):
"""Test all MLS stadiums are returned."""
scraper = MLSScraper(season=2026)
stadiums = scraper.scrape_stadiums()
# Should have stadiums for all teams
assert len(stadiums) >= 29
# Check all stadiums have required fields
for stadium in stadiums:
assert stadium.id.startswith("stadium_mls_")
assert stadium.sport == "mls"
assert stadium.name
assert stadium.city
assert stadium.state
assert stadium.country in ["USA", "Canada"]
assert stadium.latitude != 0
assert stadium.longitude != 0
class TestScrapeFallback:
"""Test multi-source fallback behavior."""
def test_falls_back_to_next_source_on_failure(self):
"""Test scraper tries next source when first fails."""
scraper = MLSScraper(season=2026)
with patch.object(scraper, '_scrape_espn') as mock_espn, \
patch.object(scraper, '_scrape_fbref') as mock_fbref:
# Make ESPN fail
mock_espn.side_effect = Exception("Connection failed")
# Make FBref return data
mock_fbref.return_value = [
RawGameData(
game_date=datetime(2026, 3, 15),
home_team_raw="Los Angeles FC",
away_team_raw="LA Galaxy",
stadium_raw="BMO Stadium",
status="scheduled",
)
]
result = scraper.scrape_games()
assert result.success
assert result.source == "fbref"
assert mock_espn.called
assert mock_fbref.called
class TestSeasonMonths:
"""Test season month calculation."""
def test_gets_correct_season_months(self):
"""Test correct months are returned for MLS season."""
scraper = MLSScraper(season=2026)
months = scraper._get_season_months()
# MLS season is February-November
assert len(months) == 10 # Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
# Check first month is February of season year
assert months[0] == (2026, 2)
# Check last month is November
assert months[-1] == (2026, 11)
@@ -0,0 +1,428 @@
"""Tests for NBA scraper."""
import json
from datetime import datetime
from unittest.mock import MagicMock, patch
import pytest
from sportstime_parser.scrapers.nba import NBAScraper, create_nba_scraper
from sportstime_parser.scrapers.base import RawGameData
from sportstime_parser.tests.fixtures import (
load_fixture,
load_json_fixture,
NBA_BR_OCTOBER_HTML,
NBA_BR_EDGE_CASES_HTML,
NBA_ESPN_SCOREBOARD_JSON,
)
class TestNBAScraperInit:
"""Test NBAScraper initialization."""
def test_creates_scraper_with_season(self):
"""Test scraper initializes with correct season."""
scraper = NBAScraper(season=2025)
assert scraper.sport == "nba"
assert scraper.season == 2025
def test_factory_function_creates_scraper(self):
"""Test factory function creates correct scraper."""
scraper = create_nba_scraper(season=2025)
assert isinstance(scraper, NBAScraper)
assert scraper.season == 2025
def test_expected_game_count(self):
"""Test expected game count is correct for NBA."""
scraper = NBAScraper(season=2025)
assert scraper.expected_game_count == 1230
def test_sources_in_priority_order(self):
"""Test sources are returned in correct priority order."""
scraper = NBAScraper(season=2025)
sources = scraper._get_sources()
assert sources == ["basketball_reference", "espn", "cbs"]
class TestBasketballReferenceParsing:
"""Test Basketball-Reference HTML parsing."""
def test_parses_completed_games(self):
"""Test parsing completed games with scores."""
scraper = NBAScraper(season=2025)
html = load_fixture(NBA_BR_OCTOBER_HTML)
games = scraper._parse_basketball_reference(html, "http://example.com")
# Should find all games in fixture
assert len(games) == 7
# Check first completed game
completed_games = [g for g in games if g.status == "final"]
assert len(completed_games) == 2
# Boston @ Cleveland
bos_cle = next(g for g in games if g.away_team_raw == "Boston Celtics")
assert bos_cle.home_team_raw == "Cleveland Cavaliers"
assert bos_cle.away_score == 112
assert bos_cle.home_score == 108
assert bos_cle.stadium_raw == "Rocket Mortgage FieldHouse"
assert bos_cle.status == "final"
def test_parses_scheduled_games(self):
"""Test parsing scheduled games without scores."""
scraper = NBAScraper(season=2025)
html = load_fixture(NBA_BR_OCTOBER_HTML)
games = scraper._parse_basketball_reference(html, "http://example.com")
scheduled_games = [g for g in games if g.status == "scheduled"]
assert len(scheduled_games) == 5
# Houston @ OKC
hou_okc = next(g for g in scheduled_games if g.away_team_raw == "Houston Rockets")
assert hou_okc.home_team_raw == "Oklahoma City Thunder"
assert hou_okc.away_score is None
assert hou_okc.home_score is None
assert hou_okc.stadium_raw == "Paycom Center"
def test_parses_game_dates_correctly(self):
"""Test game dates are parsed correctly."""
scraper = NBAScraper(season=2025)
html = load_fixture(NBA_BR_OCTOBER_HTML)
games = scraper._parse_basketball_reference(html, "http://example.com")
# Check first game date
first_game = games[0]
assert first_game.game_date.year == 2025
assert first_game.game_date.month == 10
assert first_game.game_date.day == 22
def test_tracks_source_url(self):
"""Test source URL is tracked for all games."""
scraper = NBAScraper(season=2025)
html = load_fixture(NBA_BR_OCTOBER_HTML)
source_url = "http://basketball-reference.com/test"
games = scraper._parse_basketball_reference(html, source_url)
for game in games:
assert game.source_url == source_url
class TestBasketballReferenceEdgeCases:
"""Test edge case handling in Basketball-Reference parsing."""
def test_parses_postponed_games(self):
"""Test postponed games are identified correctly."""
scraper = NBAScraper(season=2025)
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
games = scraper._parse_basketball_reference(html, "http://example.com")
postponed = [g for g in games if g.status == "postponed"]
assert len(postponed) == 1
assert postponed[0].away_team_raw == "Los Angeles Lakers"
assert postponed[0].home_team_raw == "Phoenix Suns"
def test_parses_cancelled_games(self):
"""Test cancelled games are identified correctly."""
scraper = NBAScraper(season=2025)
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
games = scraper._parse_basketball_reference(html, "http://example.com")
cancelled = [g for g in games if g.status == "cancelled"]
assert len(cancelled) == 1
assert cancelled[0].away_team_raw == "Portland Trail Blazers"
def test_parses_neutral_site_games(self):
"""Test neutral site games are parsed."""
scraper = NBAScraper(season=2025)
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
games = scraper._parse_basketball_reference(html, "http://example.com")
# Mexico City game
mexico = next(g for g in games if g.stadium_raw == "Arena CDMX")
assert mexico.away_team_raw == "Miami Heat"
assert mexico.home_team_raw == "Washington Wizards"
assert mexico.status == "final"
def test_parses_overtime_games(self):
"""Test overtime games with high scores."""
scraper = NBAScraper(season=2025)
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
games = scraper._parse_basketball_reference(html, "http://example.com")
# High scoring OT game
ot_game = next(g for g in games if g.away_score == 147)
assert ot_game.home_score == 150
assert ot_game.status == "final"
class TestESPNParsing:
"""Test ESPN API response parsing."""
def test_parses_completed_games(self):
"""Test parsing completed games from ESPN."""
scraper = NBAScraper(season=2025)
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
completed = [g for g in games if g.status == "final"]
assert len(completed) == 2
# Boston @ Cleveland
bos_cle = next(g for g in completed if g.away_team_raw == "Boston Celtics")
assert bos_cle.home_team_raw == "Cleveland Cavaliers"
assert bos_cle.away_score == 112
assert bos_cle.home_score == 108
assert bos_cle.stadium_raw == "Rocket Mortgage FieldHouse"
def test_parses_scheduled_games(self):
"""Test parsing scheduled games from ESPN."""
scraper = NBAScraper(season=2025)
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
scheduled = [g for g in games if g.status == "scheduled"]
assert len(scheduled) == 1
hou_okc = scheduled[0]
assert hou_okc.away_team_raw == "Houston Rockets"
assert hou_okc.home_team_raw == "Oklahoma City Thunder"
assert hou_okc.stadium_raw == "Paycom Center"
def test_parses_venue_info(self):
"""Test venue information is extracted."""
scraper = NBAScraper(season=2025)
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
# Check all games have venue info
for game in games:
assert game.stadium_raw is not None
class TestGameNormalization:
"""Test game normalization and canonical ID generation."""
def test_normalizes_games_with_canonical_ids(self):
"""Test games are normalized with correct canonical IDs."""
scraper = NBAScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 10, 22),
home_team_raw="Cleveland Cavaliers",
away_team_raw="Boston Celtics",
stadium_raw="Rocket Mortgage FieldHouse",
home_score=108,
away_score=112,
status="final",
source_url="http://example.com",
)
]
games, review_items = scraper._normalize_games(raw_games)
assert len(games) == 1
game = games[0]
# Check canonical ID format
assert game.id == "nba_2025_bos_cle_1022"
assert game.sport == "nba"
assert game.season == 2025
# Check team IDs
assert game.home_team_id == "team_nba_cle"
assert game.away_team_id == "team_nba_bos"
# Check scores preserved
assert game.home_score == 108
assert game.away_score == 112
def test_detects_doubleheaders(self):
"""Test doubleheaders get correct game numbers."""
scraper = NBAScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 4, 1, 13, 0),
home_team_raw="Boston Celtics",
away_team_raw="New York Knicks",
stadium_raw="TD Garden",
status="final",
home_score=105,
away_score=98,
),
RawGameData(
game_date=datetime(2025, 4, 1, 19, 0),
home_team_raw="Boston Celtics",
away_team_raw="New York Knicks",
stadium_raw="TD Garden",
status="final",
home_score=110,
away_score=102,
),
]
games, _ = scraper._normalize_games(raw_games)
assert len(games) == 2
game_numbers = sorted([g.game_number for g in games])
assert game_numbers == [1, 2]
# Check IDs include game number
game_ids = sorted([g.id for g in games])
assert game_ids == ["nba_2025_nyk_bos_0401_1", "nba_2025_nyk_bos_0401_2"]
def test_creates_review_items_for_unresolved_teams(self):
"""Test review items are created for unresolved teams."""
scraper = NBAScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 10, 22),
home_team_raw="Unknown Team XYZ",
away_team_raw="Boston Celtics",
stadium_raw="TD Garden",
status="scheduled",
),
]
games, review_items = scraper._normalize_games(raw_games)
# Game should not be created due to unresolved team
assert len(games) == 0
# But there should be a review item
assert len(review_items) >= 1
class TestTeamAndStadiumScraping:
"""Test team and stadium data scraping."""
def test_scrapes_all_nba_teams(self):
"""Test all 30 NBA teams are returned."""
scraper = NBAScraper(season=2025)
teams = scraper.scrape_teams()
# 30 NBA teams
assert len(teams) == 30
# Check team IDs are unique
team_ids = [t.id for t in teams]
assert len(set(team_ids)) == 30
# Check all teams have required fields
for team in teams:
assert team.id.startswith("team_nba_")
assert team.sport == "nba"
assert team.city
assert team.name
assert team.full_name
assert team.abbreviation
def test_teams_have_conferences_and_divisions(self):
"""Test teams have conference and division info."""
scraper = NBAScraper(season=2025)
teams = scraper.scrape_teams()
# Count teams by conference
eastern = [t for t in teams if t.conference == "Eastern"]
western = [t for t in teams if t.conference == "Western"]
assert len(eastern) == 15
assert len(western) == 15
def test_scrapes_all_nba_stadiums(self):
"""Test all NBA stadiums are returned."""
scraper = NBAScraper(season=2025)
stadiums = scraper.scrape_stadiums()
# Should have stadiums for all teams
assert len(stadiums) == 30
# Check stadium IDs are unique
stadium_ids = [s.id for s in stadiums]
assert len(set(stadium_ids)) == 30
# Check all stadiums have required fields
for stadium in stadiums:
assert stadium.id.startswith("stadium_nba_")
assert stadium.sport == "nba"
assert stadium.name
assert stadium.city
assert stadium.state
assert stadium.country in ["USA", "Canada"]
assert stadium.latitude != 0
assert stadium.longitude != 0
class TestScrapeFallback:
"""Test multi-source fallback behavior."""
def test_falls_back_to_next_source_on_failure(self):
"""Test scraper tries next source when first fails."""
scraper = NBAScraper(season=2025)
with patch.object(scraper, '_scrape_basketball_reference') as mock_br, \
patch.object(scraper, '_scrape_espn') as mock_espn:
# Make BR fail
mock_br.side_effect = Exception("Connection failed")
# Make ESPN return data
mock_espn.return_value = [
RawGameData(
game_date=datetime(2025, 10, 22),
home_team_raw="Cleveland Cavaliers",
away_team_raw="Boston Celtics",
stadium_raw="Rocket Mortgage FieldHouse",
status="scheduled",
)
]
result = scraper.scrape_games()
# Should have succeeded with ESPN
assert result.success
assert result.source == "espn"
assert mock_br.called
assert mock_espn.called
def test_returns_failure_when_all_sources_fail(self):
"""Test scraper returns failure when all sources fail."""
scraper = NBAScraper(season=2025)
with patch.object(scraper, '_scrape_basketball_reference') as mock_br, \
patch.object(scraper, '_scrape_espn') as mock_espn, \
patch.object(scraper, '_scrape_cbs') as mock_cbs:
mock_br.side_effect = Exception("BR failed")
mock_espn.side_effect = Exception("ESPN failed")
mock_cbs.side_effect = Exception("CBS failed")
result = scraper.scrape_games()
assert not result.success
assert "All sources failed" in result.error_message
assert "CBS failed" in result.error_message
class TestSeasonMonths:
"""Test season month calculation."""
def test_gets_correct_season_months(self):
"""Test correct months are returned for NBA season."""
scraper = NBAScraper(season=2025)
months = scraper._get_season_months()
# NBA season is Oct-Jun
assert len(months) == 9 # Oct, Nov, Dec, Jan, Feb, Mar, Apr, May, Jun
# Check first month is Oct of season year
assert months[0] == (2025, 10)
# Check last month is Jun of following year
assert months[-1] == (2026, 6)
# Check transition to new year
assert months[2] == (2025, 12) # December
assert months[3] == (2026, 1) # January
@@ -0,0 +1,310 @@
"""Tests for NFL scraper."""
from datetime import datetime
from unittest.mock import patch
import pytest
from sportstime_parser.scrapers.nfl import NFLScraper, create_nfl_scraper
from sportstime_parser.scrapers.base import RawGameData
from sportstime_parser.tests.fixtures import (
load_json_fixture,
NFL_ESPN_SCOREBOARD_JSON,
)
class TestNFLScraperInit:
"""Test NFLScraper initialization."""
def test_creates_scraper_with_season(self):
"""Test scraper initializes with correct season."""
scraper = NFLScraper(season=2025)
assert scraper.sport == "nfl"
assert scraper.season == 2025
def test_factory_function_creates_scraper(self):
"""Test factory function creates correct scraper."""
scraper = create_nfl_scraper(season=2025)
assert isinstance(scraper, NFLScraper)
assert scraper.season == 2025
def test_expected_game_count(self):
"""Test expected game count is correct for NFL."""
scraper = NFLScraper(season=2025)
assert scraper.expected_game_count == 272
def test_sources_in_priority_order(self):
"""Test sources are returned in correct priority order."""
scraper = NFLScraper(season=2025)
sources = scraper._get_sources()
assert sources == ["espn", "pro_football_reference", "cbs"]
class TestESPNParsing:
"""Test ESPN API response parsing."""
def test_parses_completed_games(self):
"""Test parsing completed games from ESPN."""
scraper = NFLScraper(season=2025)
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
completed = [g for g in games if g.status == "final"]
assert len(completed) == 2
# Chiefs @ Ravens
kc_bal = next(g for g in completed if g.away_team_raw == "Kansas City Chiefs")
assert kc_bal.home_team_raw == "Baltimore Ravens"
assert kc_bal.away_score == 27
assert kc_bal.home_score == 20
assert kc_bal.stadium_raw == "M&T Bank Stadium"
def test_parses_scheduled_games(self):
"""Test parsing scheduled games from ESPN."""
scraper = NFLScraper(season=2025)
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
scheduled = [g for g in games if g.status == "scheduled"]
assert len(scheduled) == 1
dal_cle = scheduled[0]
assert dal_cle.away_team_raw == "Dallas Cowboys"
assert dal_cle.home_team_raw == "Cleveland Browns"
assert dal_cle.stadium_raw == "Cleveland Browns Stadium"
def test_parses_venue_info(self):
"""Test venue information is extracted."""
scraper = NFLScraper(season=2025)
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
for game in games:
assert game.stadium_raw is not None
class TestGameNormalization:
"""Test game normalization and canonical ID generation."""
def test_normalizes_games_with_canonical_ids(self):
"""Test games are normalized with correct canonical IDs."""
scraper = NFLScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 9, 7),
home_team_raw="Baltimore Ravens",
away_team_raw="Kansas City Chiefs",
stadium_raw="M&T Bank Stadium",
home_score=20,
away_score=27,
status="final",
source_url="http://example.com",
)
]
games, review_items = scraper._normalize_games(raw_games)
assert len(games) == 1
game = games[0]
# Check canonical ID format
assert game.id == "nfl_2025_kc_bal_0907"
assert game.sport == "nfl"
assert game.season == 2025
# Check team IDs
assert game.home_team_id == "team_nfl_bal"
assert game.away_team_id == "team_nfl_kc"
# Check scores preserved
assert game.home_score == 20
assert game.away_score == 27
def test_creates_review_items_for_unresolved_teams(self):
"""Test review items are created for unresolved teams."""
scraper = NFLScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 9, 7),
home_team_raw="Unknown Team XYZ",
away_team_raw="Kansas City Chiefs",
stadium_raw="Arrowhead Stadium",
status="scheduled",
),
]
games, review_items = scraper._normalize_games(raw_games)
# Game should not be created due to unresolved team
assert len(games) == 0
# But there should be a review item
assert len(review_items) >= 1
class TestTeamAndStadiumScraping:
"""Test team and stadium data scraping."""
def test_scrapes_all_nfl_teams(self):
"""Test all 32 NFL teams are returned."""
scraper = NFLScraper(season=2025)
teams = scraper.scrape_teams()
# 32 NFL teams
assert len(teams) == 32
# Check team IDs are unique
team_ids = [t.id for t in teams]
assert len(set(team_ids)) == 32
# Check all teams have required fields
for team in teams:
assert team.id.startswith("team_nfl_")
assert team.sport == "nfl"
assert team.city
assert team.name
assert team.full_name
assert team.abbreviation
def test_teams_have_conferences_and_divisions(self):
"""Test teams have conference and division info."""
scraper = NFLScraper(season=2025)
teams = scraper.scrape_teams()
# Count teams by conference
afc = [t for t in teams if t.conference == "AFC"]
nfc = [t for t in teams if t.conference == "NFC"]
assert len(afc) == 16
assert len(nfc) == 16
def test_scrapes_all_nfl_stadiums(self):
"""Test all NFL stadiums are returned."""
scraper = NFLScraper(season=2025)
stadiums = scraper.scrape_stadiums()
# Should have stadiums for all teams (some share)
assert len(stadiums) >= 30
# Check all stadiums have required fields
for stadium in stadiums:
assert stadium.id.startswith("stadium_nfl_")
assert stadium.sport == "nfl"
assert stadium.name
assert stadium.city
assert stadium.state
assert stadium.country == "USA"
assert stadium.latitude != 0
assert stadium.longitude != 0
class TestScrapeFallback:
"""Test multi-source fallback behavior."""
def test_falls_back_to_next_source_on_failure(self):
"""Test scraper tries next source when first fails."""
scraper = NFLScraper(season=2025)
with patch.object(scraper, '_scrape_espn') as mock_espn, \
patch.object(scraper, '_scrape_pro_football_reference') as mock_pfr:
# Make ESPN fail
mock_espn.side_effect = Exception("Connection failed")
# Make PFR return data
mock_pfr.return_value = [
RawGameData(
game_date=datetime(2025, 9, 7),
home_team_raw="Baltimore Ravens",
away_team_raw="Kansas City Chiefs",
stadium_raw="M&T Bank Stadium",
status="scheduled",
)
]
result = scraper.scrape_games()
assert result.success
assert result.source == "pro_football_reference"
assert mock_espn.called
assert mock_pfr.called
class TestSeasonMonths:
"""Test season month calculation."""
def test_gets_correct_season_months(self):
"""Test correct months are returned for NFL season."""
scraper = NFLScraper(season=2025)
months = scraper._get_season_months()
# NFL season is September-February
assert len(months) == 6 # Sep, Oct, Nov, Dec, Jan, Feb
# Check first month is September of season year
assert months[0] == (2025, 9)
# Check last month is February of following year
assert months[-1] == (2026, 2)
# Check transition to new year
assert months[3] == (2025, 12) # December
assert months[4] == (2026, 1) # January
class TestInternationalFiltering:
"""Test international game filtering.
Note: Filtering happens in _parse_espn_response, not _normalize_games.
"""
def test_filters_london_games_during_parsing(self):
"""Test London games are filtered out during ESPN parsing."""
scraper = NFLScraper(season=2025)
# Create ESPN-like data with London game
espn_data = {
"events": [
{
"date": "2025-10-15T09:30:00Z",
"competitions": [
{
"neutralSite": True,
"venue": {
"fullName": "London Stadium",
"address": {"city": "London", "country": "UK"},
},
"competitors": [
{"homeAway": "home", "team": {"displayName": "Jacksonville Jaguars"}},
{"homeAway": "away", "team": {"displayName": "Buffalo Bills"}},
],
}
],
}
]
}
games = scraper._parse_espn_response(espn_data, "http://espn.com/api")
# London game should be filtered
assert len(games) == 0
def test_keeps_us_games(self):
"""Test US games are kept."""
scraper = NFLScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 9, 7),
home_team_raw="Baltimore Ravens",
away_team_raw="Kansas City Chiefs",
stadium_raw="M&T Bank Stadium",
status="scheduled",
),
]
games, _ = scraper._normalize_games(raw_games)
assert len(games) == 1
@@ -0,0 +1,317 @@
"""Tests for NHL scraper."""
from datetime import datetime
from unittest.mock import patch
import pytest
from sportstime_parser.scrapers.nhl import NHLScraper, create_nhl_scraper
from sportstime_parser.scrapers.base import RawGameData
from sportstime_parser.tests.fixtures import (
load_json_fixture,
NHL_ESPN_SCOREBOARD_JSON,
)
class TestNHLScraperInit:
"""Test NHLScraper initialization."""
def test_creates_scraper_with_season(self):
"""Test scraper initializes with correct season."""
scraper = NHLScraper(season=2025)
assert scraper.sport == "nhl"
assert scraper.season == 2025
def test_factory_function_creates_scraper(self):
"""Test factory function creates correct scraper."""
scraper = create_nhl_scraper(season=2025)
assert isinstance(scraper, NHLScraper)
assert scraper.season == 2025
def test_expected_game_count(self):
"""Test expected game count is correct for NHL."""
scraper = NHLScraper(season=2025)
assert scraper.expected_game_count == 1312
def test_sources_in_priority_order(self):
"""Test sources are returned in correct priority order."""
scraper = NHLScraper(season=2025)
sources = scraper._get_sources()
assert sources == ["hockey_reference", "nhl_api", "espn"]
class TestESPNParsing:
"""Test ESPN API response parsing."""
def test_parses_completed_games(self):
"""Test parsing completed games from ESPN."""
scraper = NHLScraper(season=2025)
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
completed = [g for g in games if g.status == "final"]
assert len(completed) == 2
# Penguins @ Bruins
pit_bos = next(g for g in completed if g.away_team_raw == "Pittsburgh Penguins")
assert pit_bos.home_team_raw == "Boston Bruins"
assert pit_bos.away_score == 2
assert pit_bos.home_score == 4
assert pit_bos.stadium_raw == "TD Garden"
def test_parses_scheduled_games(self):
"""Test parsing scheduled games from ESPN."""
scraper = NHLScraper(season=2025)
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
scheduled = [g for g in games if g.status == "scheduled"]
assert len(scheduled) == 1
vgk_lak = scheduled[0]
assert vgk_lak.away_team_raw == "Vegas Golden Knights"
assert vgk_lak.home_team_raw == "Los Angeles Kings"
assert vgk_lak.stadium_raw == "Crypto.com Arena"
def test_parses_venue_info(self):
"""Test venue information is extracted."""
scraper = NHLScraper(season=2025)
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
for game in games:
assert game.stadium_raw is not None
class TestGameNormalization:
"""Test game normalization and canonical ID generation."""
def test_normalizes_games_with_canonical_ids(self):
"""Test games are normalized with correct canonical IDs."""
scraper = NHLScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 10, 8),
home_team_raw="Boston Bruins",
away_team_raw="Pittsburgh Penguins",
stadium_raw="TD Garden",
home_score=4,
away_score=2,
status="final",
source_url="http://example.com",
)
]
games, review_items = scraper._normalize_games(raw_games)
assert len(games) == 1
game = games[0]
# Check canonical ID format
assert game.id == "nhl_2025_pit_bos_1008"
assert game.sport == "nhl"
assert game.season == 2025
# Check team IDs
assert game.home_team_id == "team_nhl_bos"
assert game.away_team_id == "team_nhl_pit"
# Check scores preserved
assert game.home_score == 4
assert game.away_score == 2
def test_creates_review_items_for_unresolved_teams(self):
"""Test review items are created for unresolved teams."""
scraper = NHLScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 10, 8),
home_team_raw="Unknown Team XYZ",
away_team_raw="Boston Bruins",
stadium_raw="TD Garden",
status="scheduled",
),
]
games, review_items = scraper._normalize_games(raw_games)
# Game should not be created due to unresolved team
assert len(games) == 0
# But there should be a review item
assert len(review_items) >= 1
class TestTeamAndStadiumScraping:
"""Test team and stadium data scraping."""
def test_scrapes_all_nhl_teams(self):
"""Test all 32 NHL teams are returned."""
scraper = NHLScraper(season=2025)
teams = scraper.scrape_teams()
# 32 NHL teams
assert len(teams) == 32
# Check team IDs are unique
team_ids = [t.id for t in teams]
assert len(set(team_ids)) == 32
# Check all teams have required fields
for team in teams:
assert team.id.startswith("team_nhl_")
assert team.sport == "nhl"
assert team.city
assert team.name
assert team.full_name
assert team.abbreviation
def test_teams_have_conferences_and_divisions(self):
"""Test teams have conference and division info."""
scraper = NHLScraper(season=2025)
teams = scraper.scrape_teams()
# Count teams by conference
eastern = [t for t in teams if t.conference == "Eastern"]
western = [t for t in teams if t.conference == "Western"]
assert len(eastern) == 16
assert len(western) == 16
def test_scrapes_all_nhl_stadiums(self):
"""Test all NHL stadiums are returned."""
scraper = NHLScraper(season=2025)
stadiums = scraper.scrape_stadiums()
# Should have stadiums for all teams
assert len(stadiums) == 32
# Check stadium IDs are unique
stadium_ids = [s.id for s in stadiums]
assert len(set(stadium_ids)) == 32
# Check all stadiums have required fields
for stadium in stadiums:
assert stadium.id.startswith("stadium_nhl_")
assert stadium.sport == "nhl"
assert stadium.name
assert stadium.city
assert stadium.state
assert stadium.country in ["USA", "Canada"]
assert stadium.latitude != 0
assert stadium.longitude != 0
class TestScrapeFallback:
"""Test multi-source fallback behavior."""
def test_falls_back_to_next_source_on_failure(self):
"""Test scraper tries next source when first fails."""
scraper = NHLScraper(season=2025)
with patch.object(scraper, '_scrape_hockey_reference') as mock_hr, \
patch.object(scraper, '_scrape_nhl_api') as mock_nhl, \
patch.object(scraper, '_scrape_espn') as mock_espn:
# Make HR and NHL API fail
mock_hr.side_effect = Exception("Connection failed")
mock_nhl.side_effect = Exception("API error")
# Make ESPN return data
mock_espn.return_value = [
RawGameData(
game_date=datetime(2025, 10, 8),
home_team_raw="Boston Bruins",
away_team_raw="Pittsburgh Penguins",
stadium_raw="TD Garden",
status="scheduled",
)
]
result = scraper.scrape_games()
assert result.success
assert result.source == "espn"
assert mock_hr.called
assert mock_nhl.called
assert mock_espn.called
class TestSeasonMonths:
"""Test season month calculation."""
def test_gets_correct_season_months(self):
"""Test correct months are returned for NHL season."""
scraper = NHLScraper(season=2025)
months = scraper._get_season_months()
# NHL season is October-June
assert len(months) == 9 # Oct, Nov, Dec, Jan, Feb, Mar, Apr, May, Jun
# Check first month is October of season year
assert months[0] == (2025, 10)
# Check last month is June of following year
assert months[-1] == (2026, 6)
# Check transition to new year
assert months[2] == (2025, 12) # December
assert months[3] == (2026, 1) # January
class TestInternationalFiltering:
"""Test international game filtering.
Note: Filtering happens in _parse_espn_response, not _normalize_games.
"""
def test_filters_european_games_during_parsing(self):
"""Test European games are filtered out during ESPN parsing."""
scraper = NHLScraper(season=2025)
# Create ESPN-like data with Prague game (Global Series)
espn_data = {
"events": [
{
"date": "2025-10-10T18:00:00Z",
"competitions": [
{
"neutralSite": True,
"venue": {
"fullName": "O2 Arena, Prague",
"address": {"city": "Prague", "country": "Czech Republic"},
},
"competitors": [
{"homeAway": "home", "team": {"displayName": "Florida Panthers"}},
{"homeAway": "away", "team": {"displayName": "Dallas Stars"}},
],
}
],
}
]
}
games = scraper._parse_espn_response(espn_data, "http://espn.com/api")
# Prague game should be filtered
assert len(games) == 0
def test_keeps_north_american_games(self):
"""Test North American games are kept."""
scraper = NHLScraper(season=2025)
raw_games = [
RawGameData(
game_date=datetime(2025, 10, 8),
home_team_raw="Boston Bruins",
away_team_raw="Pittsburgh Penguins",
stadium_raw="TD Garden",
status="scheduled",
),
]
games, _ = scraper._normalize_games(raw_games)
assert len(games) == 1
@@ -0,0 +1,226 @@
"""Tests for NWSL scraper."""
from datetime import datetime
from unittest.mock import patch
import pytest
from sportstime_parser.scrapers.nwsl import NWSLScraper, create_nwsl_scraper
from sportstime_parser.scrapers.base import RawGameData
from sportstime_parser.tests.fixtures import (
load_json_fixture,
NWSL_ESPN_SCOREBOARD_JSON,
)
class TestNWSLScraperInit:
"""Test NWSLScraper initialization."""
def test_creates_scraper_with_season(self):
"""Test scraper initializes with correct season."""
scraper = NWSLScraper(season=2026)
assert scraper.sport == "nwsl"
assert scraper.season == 2026
def test_factory_function_creates_scraper(self):
"""Test factory function creates correct scraper."""
scraper = create_nwsl_scraper(season=2026)
assert isinstance(scraper, NWSLScraper)
assert scraper.season == 2026
def test_expected_game_count(self):
"""Test expected game count is correct for NWSL."""
scraper = NWSLScraper(season=2026)
assert scraper.expected_game_count == 182
def test_sources_in_priority_order(self):
"""Test sources are returned in correct priority order."""
scraper = NWSLScraper(season=2026)
sources = scraper._get_sources()
assert sources == ["espn"]
class TestESPNParsing:
"""Test ESPN API response parsing."""
def test_parses_completed_games(self):
"""Test parsing completed games from ESPN."""
scraper = NWSLScraper(season=2026)
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
completed = [g for g in games if g.status == "final"]
assert len(completed) == 2
# Angel City @ Thorns
la_por = next(g for g in completed if g.away_team_raw == "Angel City FC")
assert la_por.home_team_raw == "Portland Thorns FC"
assert la_por.away_score == 1
assert la_por.home_score == 2
assert la_por.stadium_raw == "Providence Park"
def test_parses_scheduled_games(self):
"""Test parsing scheduled games from ESPN."""
scraper = NWSLScraper(season=2026)
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
scheduled = [g for g in games if g.status == "scheduled"]
assert len(scheduled) == 1
sd_bay = scheduled[0]
assert sd_bay.away_team_raw == "San Diego Wave FC"
assert sd_bay.home_team_raw == "Bay FC"
assert sd_bay.stadium_raw == "PayPal Park"
def test_parses_venue_info(self):
"""Test venue information is extracted."""
scraper = NWSLScraper(season=2026)
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
for game in games:
assert game.stadium_raw is not None
class TestGameNormalization:
"""Test game normalization and canonical ID generation."""
def test_normalizes_games_with_canonical_ids(self):
"""Test games are normalized with correct canonical IDs."""
scraper = NWSLScraper(season=2026)
raw_games = [
RawGameData(
game_date=datetime(2026, 4, 10),
home_team_raw="Portland Thorns FC",
away_team_raw="Angel City FC",
stadium_raw="Providence Park",
home_score=2,
away_score=1,
status="final",
source_url="http://example.com",
)
]
games, review_items = scraper._normalize_games(raw_games)
assert len(games) == 1
game = games[0]
# Check canonical ID format
assert game.id == "nwsl_2026_anf_por_0410"
assert game.sport == "nwsl"
assert game.season == 2026
# Check team IDs
assert game.home_team_id == "team_nwsl_por"
assert game.away_team_id == "team_nwsl_anf"
# Check scores preserved
assert game.home_score == 2
assert game.away_score == 1
def test_creates_review_items_for_unresolved_teams(self):
"""Test review items are created for unresolved teams."""
scraper = NWSLScraper(season=2026)
raw_games = [
RawGameData(
game_date=datetime(2026, 4, 10),
home_team_raw="Unknown Team XYZ",
away_team_raw="Portland Thorns FC",
stadium_raw="Providence Park",
status="scheduled",
),
]
games, review_items = scraper._normalize_games(raw_games)
# Game should not be created due to unresolved team
assert len(games) == 0
# But there should be a review item
assert len(review_items) >= 1
class TestTeamAndStadiumScraping:
"""Test team and stadium data scraping."""
def test_scrapes_all_nwsl_teams(self):
"""Test all NWSL teams are returned."""
scraper = NWSLScraper(season=2026)
teams = scraper.scrape_teams()
# NWSL has 14 teams
assert len(teams) == 14
# Check team IDs are unique
team_ids = [t.id for t in teams]
assert len(set(team_ids)) == 14
# Check all teams have required fields
for team in teams:
assert team.id.startswith("team_nwsl_")
assert team.sport == "nwsl"
assert team.city
assert team.name
assert team.full_name
assert team.abbreviation
def test_scrapes_all_nwsl_stadiums(self):
"""Test all NWSL stadiums are returned."""
scraper = NWSLScraper(season=2026)
stadiums = scraper.scrape_stadiums()
# Should have stadiums for all teams
assert len(stadiums) == 14
# Check stadium IDs are unique
stadium_ids = [s.id for s in stadiums]
assert len(set(stadium_ids)) == 14
# Check all stadiums have required fields
for stadium in stadiums:
assert stadium.id.startswith("stadium_nwsl_")
assert stadium.sport == "nwsl"
assert stadium.name
assert stadium.city
assert stadium.state
assert stadium.country == "USA"
assert stadium.latitude != 0
assert stadium.longitude != 0
class TestScrapeFallback:
"""Test fallback behavior (NWSL only has ESPN)."""
def test_returns_failure_when_espn_fails(self):
"""Test scraper returns failure when ESPN fails."""
scraper = NWSLScraper(season=2026)
with patch.object(scraper, '_scrape_espn') as mock_espn:
mock_espn.side_effect = Exception("ESPN failed")
result = scraper.scrape_games()
assert not result.success
assert "All sources failed" in result.error_message
class TestSeasonMonths:
"""Test season month calculation."""
def test_gets_correct_season_months(self):
"""Test correct months are returned for NWSL season."""
scraper = NWSLScraper(season=2026)
months = scraper._get_season_months()
# NWSL season is March-November
assert len(months) == 9 # Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
# Check first month is March of season year
assert months[0] == (2026, 3)
# Check last month is November
assert months[-1] == (2026, 11)
@@ -0,0 +1,226 @@
"""Tests for WNBA scraper."""
from datetime import datetime
from unittest.mock import patch
import pytest
from sportstime_parser.scrapers.wnba import WNBAScraper, create_wnba_scraper
from sportstime_parser.scrapers.base import RawGameData
from sportstime_parser.tests.fixtures import (
load_json_fixture,
WNBA_ESPN_SCOREBOARD_JSON,
)
class TestWNBAScraperInit:
"""Test WNBAScraper initialization."""
def test_creates_scraper_with_season(self):
"""Test scraper initializes with correct season."""
scraper = WNBAScraper(season=2026)
assert scraper.sport == "wnba"
assert scraper.season == 2026
def test_factory_function_creates_scraper(self):
"""Test factory function creates correct scraper."""
scraper = create_wnba_scraper(season=2026)
assert isinstance(scraper, WNBAScraper)
assert scraper.season == 2026
def test_expected_game_count(self):
"""Test expected game count is correct for WNBA."""
scraper = WNBAScraper(season=2026)
assert scraper.expected_game_count == 220
def test_sources_in_priority_order(self):
"""Test sources are returned in correct priority order."""
scraper = WNBAScraper(season=2026)
sources = scraper._get_sources()
assert sources == ["espn"]
class TestESPNParsing:
"""Test ESPN API response parsing."""
def test_parses_completed_games(self):
"""Test parsing completed games from ESPN."""
scraper = WNBAScraper(season=2026)
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
completed = [g for g in games if g.status == "final"]
assert len(completed) == 2
# Aces @ Liberty
lv_ny = next(g for g in completed if g.away_team_raw == "Las Vegas Aces")
assert lv_ny.home_team_raw == "New York Liberty"
assert lv_ny.away_score == 88
assert lv_ny.home_score == 92
assert lv_ny.stadium_raw == "Barclays Center"
def test_parses_scheduled_games(self):
"""Test parsing scheduled games from ESPN."""
scraper = WNBAScraper(season=2026)
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
scheduled = [g for g in games if g.status == "scheduled"]
assert len(scheduled) == 1
phx_sea = scheduled[0]
assert phx_sea.away_team_raw == "Phoenix Mercury"
assert phx_sea.home_team_raw == "Seattle Storm"
assert phx_sea.stadium_raw == "Climate Pledge Arena"
def test_parses_venue_info(self):
"""Test venue information is extracted."""
scraper = WNBAScraper(season=2026)
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
games = scraper._parse_espn_response(data, "http://espn.com/api")
for game in games:
assert game.stadium_raw is not None
class TestGameNormalization:
"""Test game normalization and canonical ID generation."""
def test_normalizes_games_with_canonical_ids(self):
"""Test games are normalized with correct canonical IDs."""
scraper = WNBAScraper(season=2026)
raw_games = [
RawGameData(
game_date=datetime(2026, 5, 20),
home_team_raw="New York Liberty",
away_team_raw="Las Vegas Aces",
stadium_raw="Barclays Center",
home_score=92,
away_score=88,
status="final",
source_url="http://example.com",
)
]
games, review_items = scraper._normalize_games(raw_games)
assert len(games) == 1
game = games[0]
# Check canonical ID format
assert game.id == "wnba_2026_lv_ny_0520"
assert game.sport == "wnba"
assert game.season == 2026
# Check team IDs
assert game.home_team_id == "team_wnba_ny"
assert game.away_team_id == "team_wnba_lv"
# Check scores preserved
assert game.home_score == 92
assert game.away_score == 88
def test_creates_review_items_for_unresolved_teams(self):
"""Test review items are created for unresolved teams."""
scraper = WNBAScraper(season=2026)
raw_games = [
RawGameData(
game_date=datetime(2026, 5, 20),
home_team_raw="Unknown Team XYZ",
away_team_raw="Las Vegas Aces",
stadium_raw="Barclays Center",
status="scheduled",
),
]
games, review_items = scraper._normalize_games(raw_games)
# Game should not be created due to unresolved team
assert len(games) == 0
# But there should be a review item
assert len(review_items) >= 1
class TestTeamAndStadiumScraping:
"""Test team and stadium data scraping."""
def test_scrapes_all_wnba_teams(self):
"""Test all WNBA teams are returned."""
scraper = WNBAScraper(season=2026)
teams = scraper.scrape_teams()
# WNBA has 13 teams (including Golden State Valkyries)
assert len(teams) == 13
# Check team IDs are unique
team_ids = [t.id for t in teams]
assert len(set(team_ids)) == 13
# Check all teams have required fields
for team in teams:
assert team.id.startswith("team_wnba_")
assert team.sport == "wnba"
assert team.city
assert team.name
assert team.full_name
assert team.abbreviation
def test_scrapes_all_wnba_stadiums(self):
"""Test all WNBA stadiums are returned."""
scraper = WNBAScraper(season=2026)
stadiums = scraper.scrape_stadiums()
# Should have stadiums for all teams
assert len(stadiums) == 13
# Check stadium IDs are unique
stadium_ids = [s.id for s in stadiums]
assert len(set(stadium_ids)) == 13
# Check all stadiums have required fields
for stadium in stadiums:
assert stadium.id.startswith("stadium_wnba_")
assert stadium.sport == "wnba"
assert stadium.name
assert stadium.city
assert stadium.state
assert stadium.country == "USA"
assert stadium.latitude != 0
assert stadium.longitude != 0
class TestScrapeFallback:
"""Test fallback behavior (WNBA only has ESPN)."""
def test_returns_failure_when_espn_fails(self):
"""Test scraper returns failure when ESPN fails."""
scraper = WNBAScraper(season=2026)
with patch.object(scraper, '_scrape_espn') as mock_espn:
mock_espn.side_effect = Exception("ESPN failed")
result = scraper.scrape_games()
assert not result.success
assert "All sources failed" in result.error_message
class TestSeasonMonths:
"""Test season month calculation."""
def test_gets_correct_season_months(self):
"""Test correct months are returned for WNBA season."""
scraper = WNBAScraper(season=2026)
months = scraper._get_season_months()
# WNBA season is May-October
assert len(months) == 6 # May, Jun, Jul, Aug, Sep, Oct
# Check first month is May of season year
assert months[0] == (2026, 5)
# Check last month is October
assert months[-1] == (2026, 10)
@@ -0,0 +1,187 @@
"""Tests for timezone conversion utilities."""
import pytest
from datetime import datetime, date
from zoneinfo import ZoneInfo
from sportstime_parser.normalizers.timezone import (
detect_timezone_from_string,
detect_timezone_from_location,
parse_datetime,
convert_to_utc,
get_stadium_timezone,
TimezoneResult,
)
class TestDetectTimezoneFromString:
"""Tests for detect_timezone_from_string function."""
def test_eastern_time(self):
"""Test Eastern Time detection."""
assert detect_timezone_from_string("7:00 PM ET") == "America/New_York"
assert detect_timezone_from_string("7:00 PM EST") == "America/New_York"
assert detect_timezone_from_string("7:00 PM EDT") == "America/New_York"
def test_central_time(self):
"""Test Central Time detection."""
assert detect_timezone_from_string("8:00 PM CT") == "America/Chicago"
assert detect_timezone_from_string("8:00 PM CST") == "America/Chicago"
assert detect_timezone_from_string("8:00 PM CDT") == "America/Chicago"
def test_mountain_time(self):
"""Test Mountain Time detection."""
assert detect_timezone_from_string("7:00 PM MT") == "America/Denver"
assert detect_timezone_from_string("7:00 PM MST") == "America/Denver"
def test_pacific_time(self):
"""Test Pacific Time detection."""
assert detect_timezone_from_string("7:00 PM PT") == "America/Los_Angeles"
assert detect_timezone_from_string("7:00 PM PST") == "America/Los_Angeles"
assert detect_timezone_from_string("7:00 PM PDT") == "America/Los_Angeles"
def test_no_timezone(self):
"""Test string with no timezone."""
assert detect_timezone_from_string("7:00 PM") is None
assert detect_timezone_from_string("19:00") is None
def test_case_insensitive(self):
"""Test case insensitive matching."""
assert detect_timezone_from_string("7:00 PM et") == "America/New_York"
assert detect_timezone_from_string("7:00 PM Et") == "America/New_York"
class TestDetectTimezoneFromLocation:
"""Tests for detect_timezone_from_location function."""
def test_eastern_states(self):
"""Test Eastern timezone states."""
assert detect_timezone_from_location(state="NY") == "America/New_York"
assert detect_timezone_from_location(state="MA") == "America/New_York"
assert detect_timezone_from_location(state="FL") == "America/New_York"
def test_central_states(self):
"""Test Central timezone states."""
assert detect_timezone_from_location(state="TX") == "America/Chicago"
assert detect_timezone_from_location(state="IL") == "America/Chicago"
def test_mountain_states(self):
"""Test Mountain timezone states."""
assert detect_timezone_from_location(state="CO") == "America/Denver"
assert detect_timezone_from_location(state="AZ") == "America/Phoenix"
def test_pacific_states(self):
"""Test Pacific timezone states."""
assert detect_timezone_from_location(state="CA") == "America/Los_Angeles"
assert detect_timezone_from_location(state="WA") == "America/Los_Angeles"
def test_canadian_provinces(self):
"""Test Canadian provinces."""
assert detect_timezone_from_location(state="ON") == "America/Toronto"
assert detect_timezone_from_location(state="BC") == "America/Vancouver"
assert detect_timezone_from_location(state="AB") == "America/Edmonton"
def test_case_insensitive(self):
"""Test case insensitive matching."""
assert detect_timezone_from_location(state="ny") == "America/New_York"
assert detect_timezone_from_location(state="Ny") == "America/New_York"
def test_unknown_state(self):
"""Test unknown state returns None."""
assert detect_timezone_from_location(state="XX") is None
assert detect_timezone_from_location(state=None) is None
class TestParseDatetime:
"""Tests for parse_datetime function."""
def test_basic_date_time(self):
"""Test basic date and time parsing."""
result = parse_datetime("2025-12-25", "7:00 PM ET")
assert result.datetime_utc.year == 2025
assert result.datetime_utc.month == 12
assert result.datetime_utc.day == 26 # UTC is +5 hours ahead
assert result.source_timezone == "America/New_York"
assert result.confidence == "high"
def test_date_only(self):
"""Test date only parsing."""
result = parse_datetime("2025-10-21")
assert result.datetime_utc.year == 2025
assert result.datetime_utc.month == 10
assert result.datetime_utc.day == 21
def test_timezone_hint(self):
"""Test timezone hint is used when no timezone in string."""
result = parse_datetime(
"2025-10-21",
"7:00 PM",
timezone_hint="America/Chicago",
)
assert result.source_timezone == "America/Chicago"
assert result.confidence == "medium"
def test_location_inference(self):
"""Test timezone inference from location."""
result = parse_datetime(
"2025-10-21",
"7:00 PM",
location_state="CA",
)
assert result.source_timezone == "America/Los_Angeles"
assert result.confidence == "medium"
def test_default_to_eastern(self):
"""Test defaults to Eastern when no timezone info."""
result = parse_datetime("2025-10-21", "7:00 PM")
assert result.source_timezone == "America/New_York"
assert result.confidence == "low"
assert result.warning is not None
def test_invalid_date(self):
"""Test handling of invalid date."""
result = parse_datetime("not a date")
assert result.confidence == "low"
assert result.warning is not None
class TestConvertToUtc:
"""Tests for convert_to_utc function."""
def test_convert_naive_datetime(self):
"""Test converting naive datetime to UTC."""
dt = datetime(2025, 12, 25, 19, 0) # 7:00 PM
utc = convert_to_utc(dt, "America/New_York")
# In December, Eastern Time is UTC-5
assert utc.hour == 0 # Next day 00:00 UTC
assert utc.day == 26
def test_convert_aware_datetime(self):
"""Test converting timezone-aware datetime."""
tz = ZoneInfo("America/Los_Angeles")
dt = datetime(2025, 7, 4, 19, 0, tzinfo=tz) # 7:00 PM PT
utc = convert_to_utc(dt, "America/Los_Angeles")
# In July, Pacific Time is UTC-7
assert utc.hour == 2 # 02:00 UTC next day
assert utc.day == 5
class TestGetStadiumTimezone:
"""Tests for get_stadium_timezone function."""
def test_explicit_timezone(self):
"""Test explicit timezone override."""
tz = get_stadium_timezone("AZ", stadium_timezone="America/Phoenix")
assert tz == "America/Phoenix"
def test_state_inference(self):
"""Test timezone from state."""
tz = get_stadium_timezone("NY")
assert tz == "America/New_York"
def test_default_eastern(self):
"""Test default to Eastern for unknown state."""
tz = get_stadium_timezone("XX")
assert tz == "America/New_York"
@@ -0,0 +1 @@
"""Tests for the uploaders module."""
@@ -0,0 +1,461 @@
"""Tests for the CloudKit client."""
import json
import pytest
from datetime import datetime
from unittest.mock import Mock, patch, MagicMock
from sportstime_parser.uploaders.cloudkit import (
CloudKitClient,
CloudKitRecord,
CloudKitError,
CloudKitAuthError,
CloudKitRateLimitError,
CloudKitServerError,
RecordType,
OperationResult,
BatchResult,
)
class TestCloudKitRecord:
"""Tests for CloudKitRecord dataclass."""
def test_create_record(self):
"""Test creating a CloudKitRecord."""
record = CloudKitRecord(
record_name="nba_2025_hou_okc_1021",
record_type=RecordType.GAME,
fields={
"sport": "nba",
"season": 2025,
},
)
assert record.record_name == "nba_2025_hou_okc_1021"
assert record.record_type == RecordType.GAME
assert record.fields["sport"] == "nba"
assert record.record_change_tag is None
def test_to_cloudkit_dict(self):
"""Test converting to CloudKit API format."""
record = CloudKitRecord(
record_name="nba_2025_hou_okc_1021",
record_type=RecordType.GAME,
fields={
"sport": "nba",
"season": 2025,
},
)
data = record.to_cloudkit_dict()
assert data["recordName"] == "nba_2025_hou_okc_1021"
assert data["recordType"] == "Game"
assert "fields" in data
assert "recordChangeTag" not in data
def test_to_cloudkit_dict_with_change_tag(self):
"""Test converting with change tag for updates."""
record = CloudKitRecord(
record_name="nba_2025_hou_okc_1021",
record_type=RecordType.GAME,
fields={"sport": "nba"},
record_change_tag="abc123",
)
data = record.to_cloudkit_dict()
assert data["recordChangeTag"] == "abc123"
def test_format_string_field(self):
"""Test formatting string fields."""
record = CloudKitRecord(
record_name="test",
record_type=RecordType.GAME,
fields={"name": "Test Name"},
)
data = record.to_cloudkit_dict()
assert data["fields"]["name"]["value"] == "Test Name"
assert data["fields"]["name"]["type"] == "STRING"
def test_format_int_field(self):
"""Test formatting integer fields."""
record = CloudKitRecord(
record_name="test",
record_type=RecordType.GAME,
fields={"count": 42},
)
data = record.to_cloudkit_dict()
assert data["fields"]["count"]["value"] == 42
assert data["fields"]["count"]["type"] == "INT64"
def test_format_float_field(self):
"""Test formatting float fields."""
record = CloudKitRecord(
record_name="test",
record_type=RecordType.STADIUM,
fields={"latitude": 35.4634},
)
data = record.to_cloudkit_dict()
assert data["fields"]["latitude"]["value"] == 35.4634
assert data["fields"]["latitude"]["type"] == "DOUBLE"
def test_format_datetime_field(self):
"""Test formatting datetime fields."""
dt = datetime(2025, 10, 21, 19, 0, 0)
record = CloudKitRecord(
record_name="test",
record_type=RecordType.GAME,
fields={"game_date": dt},
)
data = record.to_cloudkit_dict()
expected_ms = int(dt.timestamp() * 1000)
assert data["fields"]["game_date"]["value"] == expected_ms
assert data["fields"]["game_date"]["type"] == "TIMESTAMP"
def test_format_location_field(self):
"""Test formatting location fields."""
record = CloudKitRecord(
record_name="test",
record_type=RecordType.STADIUM,
fields={
"location": {"latitude": 35.4634, "longitude": -97.5151},
},
)
data = record.to_cloudkit_dict()
assert data["fields"]["location"]["type"] == "LOCATION"
assert data["fields"]["location"]["value"]["latitude"] == 35.4634
assert data["fields"]["location"]["value"]["longitude"] == -97.5151
def test_skip_none_fields(self):
"""Test that None fields are skipped."""
record = CloudKitRecord(
record_name="test",
record_type=RecordType.GAME,
fields={
"sport": "nba",
"score": None, # Should be skipped
},
)
data = record.to_cloudkit_dict()
assert "sport" in data["fields"]
assert "score" not in data["fields"]
class TestOperationResult:
"""Tests for OperationResult dataclass."""
def test_successful_result(self):
"""Test creating a successful operation result."""
result = OperationResult(
record_name="test_record",
success=True,
record_change_tag="new_tag",
)
assert result.record_name == "test_record"
assert result.success is True
assert result.record_change_tag == "new_tag"
assert result.error_code is None
def test_failed_result(self):
"""Test creating a failed operation result."""
result = OperationResult(
record_name="test_record",
success=False,
error_code="SERVER_ERROR",
error_message="Internal server error",
)
assert result.success is False
assert result.error_code == "SERVER_ERROR"
assert result.error_message == "Internal server error"
class TestBatchResult:
"""Tests for BatchResult dataclass."""
def test_empty_batch_result(self):
"""Test empty batch result."""
result = BatchResult()
assert result.all_succeeded is True
assert result.success_count == 0
assert result.failure_count == 0
def test_batch_with_successes(self):
"""Test batch with successful operations."""
result = BatchResult()
result.successful.append(OperationResult("rec1", True))
result.successful.append(OperationResult("rec2", True))
assert result.all_succeeded is True
assert result.success_count == 2
assert result.failure_count == 0
def test_batch_with_failures(self):
"""Test batch with failed operations."""
result = BatchResult()
result.successful.append(OperationResult("rec1", True))
result.failed.append(OperationResult("rec2", False, error_message="Error"))
assert result.all_succeeded is False
assert result.success_count == 1
assert result.failure_count == 1
class TestCloudKitClient:
"""Tests for CloudKitClient."""
def test_not_configured_without_credentials(self):
"""Test that client reports not configured without credentials."""
with patch.dict("os.environ", {}, clear=True):
client = CloudKitClient()
assert client.is_configured is False
def test_configured_with_credentials(self):
"""Test that client reports configured with credentials."""
# Create a minimal mock for the private key
mock_key = MagicMock()
with patch.dict("os.environ", {
"CLOUDKIT_KEY_ID": "test_key_id",
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
}):
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
mock_load.return_value = mock_key
client = CloudKitClient()
assert client.is_configured is True
def test_get_api_path(self):
"""Test API path construction."""
client = CloudKitClient(
container_id="iCloud.com.test.app",
environment="development",
)
path = client._get_api_path("records/query")
assert path == "/database/1/iCloud.com.test.app/development/public/records/query"
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
def test_fetch_records_query(self, mock_session_class):
"""Test fetching records with query."""
mock_session = MagicMock()
mock_session_class.return_value = mock_session
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.json.return_value = {
"records": [
{"recordName": "rec1", "recordType": "Game"},
{"recordName": "rec2", "recordType": "Game"},
]
}
mock_session.request.return_value = mock_response
# Setup client with mocked auth
mock_key = MagicMock()
mock_key.sign.return_value = b"signature"
with patch.dict("os.environ", {
"CLOUDKIT_KEY_ID": "test_key",
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
}):
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
mock_load.return_value = mock_key
mock_jwt.return_value = "test_token"
client = CloudKitClient()
records = client.fetch_records(RecordType.GAME)
assert len(records) == 2
assert records[0]["recordName"] == "rec1"
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
def test_save_records_success(self, mock_session_class):
"""Test saving records successfully."""
mock_session = MagicMock()
mock_session_class.return_value = mock_session
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.json.return_value = {
"records": [
{"recordName": "rec1", "recordChangeTag": "tag1"},
{"recordName": "rec2", "recordChangeTag": "tag2"},
]
}
mock_session.request.return_value = mock_response
mock_key = MagicMock()
mock_key.sign.return_value = b"signature"
with patch.dict("os.environ", {
"CLOUDKIT_KEY_ID": "test_key",
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
}):
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
mock_load.return_value = mock_key
mock_jwt.return_value = "test_token"
client = CloudKitClient()
records = [
CloudKitRecord("rec1", RecordType.GAME, {"sport": "nba"}),
CloudKitRecord("rec2", RecordType.GAME, {"sport": "nba"}),
]
result = client.save_records(records)
assert result.success_count == 2
assert result.failure_count == 0
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
def test_save_records_partial_failure(self, mock_session_class):
"""Test saving records with some failures."""
mock_session = MagicMock()
mock_session_class.return_value = mock_session
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.json.return_value = {
"records": [
{"recordName": "rec1", "recordChangeTag": "tag1"},
{"recordName": "rec2", "serverErrorCode": "QUOTA_EXCEEDED", "reason": "Quota exceeded"},
]
}
mock_session.request.return_value = mock_response
mock_key = MagicMock()
mock_key.sign.return_value = b"signature"
with patch.dict("os.environ", {
"CLOUDKIT_KEY_ID": "test_key",
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
}):
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
mock_load.return_value = mock_key
mock_jwt.return_value = "test_token"
client = CloudKitClient()
records = [
CloudKitRecord("rec1", RecordType.GAME, {"sport": "nba"}),
CloudKitRecord("rec2", RecordType.GAME, {"sport": "nba"}),
]
result = client.save_records(records)
assert result.success_count == 1
assert result.failure_count == 1
assert result.failed[0].error_code == "QUOTA_EXCEEDED"
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
def test_auth_error(self, mock_session_class):
"""Test handling authentication error."""
mock_session = MagicMock()
mock_session_class.return_value = mock_session
mock_response = MagicMock()
mock_response.status_code = 421
mock_session.request.return_value = mock_response
mock_key = MagicMock()
mock_key.sign.return_value = b"signature"
with patch.dict("os.environ", {
"CLOUDKIT_KEY_ID": "test_key",
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
}):
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
mock_load.return_value = mock_key
mock_jwt.return_value = "test_token"
client = CloudKitClient()
with pytest.raises(CloudKitAuthError):
client.fetch_records(RecordType.GAME)
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
def test_rate_limit_error(self, mock_session_class):
"""Test handling rate limit error."""
mock_session = MagicMock()
mock_session_class.return_value = mock_session
mock_response = MagicMock()
mock_response.status_code = 429
mock_session.request.return_value = mock_response
mock_key = MagicMock()
mock_key.sign.return_value = b"signature"
with patch.dict("os.environ", {
"CLOUDKIT_KEY_ID": "test_key",
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
}):
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
mock_load.return_value = mock_key
mock_jwt.return_value = "test_token"
client = CloudKitClient()
with pytest.raises(CloudKitRateLimitError):
client.fetch_records(RecordType.GAME)
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
def test_server_error(self, mock_session_class):
"""Test handling server error."""
mock_session = MagicMock()
mock_session_class.return_value = mock_session
mock_response = MagicMock()
mock_response.status_code = 503
mock_session.request.return_value = mock_response
mock_key = MagicMock()
mock_key.sign.return_value = b"signature"
with patch.dict("os.environ", {
"CLOUDKIT_KEY_ID": "test_key",
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
}):
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
mock_load.return_value = mock_key
mock_jwt.return_value = "test_token"
client = CloudKitClient()
with pytest.raises(CloudKitServerError):
client.fetch_records(RecordType.GAME)
class TestRecordType:
"""Tests for RecordType enum."""
def test_record_type_values(self):
"""Test that record type values match CloudKit schema."""
assert RecordType.GAME.value == "Game"
assert RecordType.TEAM.value == "Team"
assert RecordType.STADIUM.value == "Stadium"
assert RecordType.TEAM_ALIAS.value == "TeamAlias"
assert RecordType.STADIUM_ALIAS.value == "StadiumAlias"
@@ -0,0 +1,350 @@
"""Tests for the record differ."""
import pytest
from datetime import datetime
from sportstime_parser.models.game import Game
from sportstime_parser.models.team import Team
from sportstime_parser.models.stadium import Stadium
from sportstime_parser.uploaders.diff import (
DiffAction,
RecordDiff,
DiffResult,
RecordDiffer,
game_to_cloudkit_record,
team_to_cloudkit_record,
stadium_to_cloudkit_record,
)
from sportstime_parser.uploaders.cloudkit import RecordType
class TestRecordDiff:
"""Tests for RecordDiff dataclass."""
def test_create_record_diff(self):
"""Test creating a RecordDiff."""
diff = RecordDiff(
record_name="nba_2025_hou_okc_1021",
record_type=RecordType.GAME,
action=DiffAction.CREATE,
)
assert diff.record_name == "nba_2025_hou_okc_1021"
assert diff.record_type == RecordType.GAME
assert diff.action == DiffAction.CREATE
class TestDiffResult:
"""Tests for DiffResult dataclass."""
def test_empty_result(self):
"""Test empty DiffResult."""
result = DiffResult()
assert result.create_count == 0
assert result.update_count == 0
assert result.delete_count == 0
assert result.unchanged_count == 0
assert result.total_changes == 0
def test_counts(self):
"""Test counting different change types."""
result = DiffResult()
result.creates.append(RecordDiff(
record_name="game_1",
record_type=RecordType.GAME,
action=DiffAction.CREATE,
))
result.creates.append(RecordDiff(
record_name="game_2",
record_type=RecordType.GAME,
action=DiffAction.CREATE,
))
result.updates.append(RecordDiff(
record_name="game_3",
record_type=RecordType.GAME,
action=DiffAction.UPDATE,
))
result.deletes.append(RecordDiff(
record_name="game_4",
record_type=RecordType.GAME,
action=DiffAction.DELETE,
))
result.unchanged.append(RecordDiff(
record_name="game_5",
record_type=RecordType.GAME,
action=DiffAction.UNCHANGED,
))
assert result.create_count == 2
assert result.update_count == 1
assert result.delete_count == 1
assert result.unchanged_count == 1
assert result.total_changes == 4 # excludes unchanged
class TestRecordDiffer:
"""Tests for RecordDiffer."""
@pytest.fixture
def differ(self):
"""Create a RecordDiffer instance."""
return RecordDiffer()
@pytest.fixture
def sample_game(self):
"""Create a sample Game."""
return Game(
id="nba_2025_hou_okc_1021",
sport="nba",
season=2025,
home_team_id="team_nba_okc",
away_team_id="team_nba_hou",
stadium_id="stadium_nba_paycom_center",
game_date=datetime(2025, 10, 21, 19, 0, 0),
status="scheduled",
)
@pytest.fixture
def sample_team(self):
"""Create a sample Team."""
return Team(
id="team_nba_okc",
sport="nba",
city="Oklahoma City",
name="Thunder",
full_name="Oklahoma City Thunder",
abbreviation="OKC",
conference="Western",
division="Northwest",
)
@pytest.fixture
def sample_stadium(self):
"""Create a sample Stadium."""
return Stadium(
id="stadium_nba_paycom_center",
sport="nba",
name="Paycom Center",
city="Oklahoma City",
state="OK",
country="USA",
latitude=35.4634,
longitude=-97.5151,
capacity=18203,
)
def test_diff_games_create(self, differ, sample_game):
"""Test detecting new games to create."""
local_games = [sample_game]
remote_records = []
result = differ.diff_games(local_games, remote_records)
assert result.create_count == 1
assert result.update_count == 0
assert result.delete_count == 0
assert result.creates[0].record_name == sample_game.id
def test_diff_games_delete(self, differ, sample_game):
"""Test detecting games to delete."""
local_games = []
remote_records = [
{
"recordName": sample_game.id,
"recordType": "Game",
"fields": {
"sport": {"value": "nba", "type": "STRING"},
"season": {"value": 2025, "type": "INT64"},
},
"recordChangeTag": "abc123",
}
]
result = differ.diff_games(local_games, remote_records)
assert result.create_count == 0
assert result.delete_count == 1
assert result.deletes[0].record_name == sample_game.id
def test_diff_games_unchanged(self, differ, sample_game):
"""Test detecting unchanged games."""
local_games = [sample_game]
remote_records = [
{
"recordName": sample_game.id,
"recordType": "Game",
"fields": {
"sport": {"value": "nba", "type": "STRING"},
"season": {"value": 2025, "type": "INT64"},
"home_team_id": {"value": "team_nba_okc", "type": "STRING"},
"away_team_id": {"value": "team_nba_hou", "type": "STRING"},
"stadium_id": {"value": "stadium_nba_paycom_center", "type": "STRING"},
"game_date": {"value": int(sample_game.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
"game_number": {"value": None, "type": "INT64"},
"home_score": {"value": None, "type": "INT64"},
"away_score": {"value": None, "type": "INT64"},
"status": {"value": "scheduled", "type": "STRING"},
},
"recordChangeTag": "abc123",
}
]
result = differ.diff_games(local_games, remote_records)
assert result.create_count == 0
assert result.update_count == 0
assert result.unchanged_count == 1
def test_diff_games_update(self, differ, sample_game):
"""Test detecting games that need update."""
local_games = [sample_game]
# Remote has different status
remote_records = [
{
"recordName": sample_game.id,
"recordType": "Game",
"fields": {
"sport": {"value": "nba", "type": "STRING"},
"season": {"value": 2025, "type": "INT64"},
"home_team_id": {"value": "team_nba_okc", "type": "STRING"},
"away_team_id": {"value": "team_nba_hou", "type": "STRING"},
"stadium_id": {"value": "stadium_nba_paycom_center", "type": "STRING"},
"game_date": {"value": int(sample_game.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
"game_number": {"value": None, "type": "INT64"},
"home_score": {"value": None, "type": "INT64"},
"away_score": {"value": None, "type": "INT64"},
"status": {"value": "postponed", "type": "STRING"}, # Different!
},
"recordChangeTag": "abc123",
}
]
result = differ.diff_games(local_games, remote_records)
assert result.update_count == 1
assert "status" in result.updates[0].changed_fields
assert result.updates[0].record_change_tag == "abc123"
def test_diff_teams_create(self, differ, sample_team):
"""Test detecting new teams to create."""
local_teams = [sample_team]
remote_records = []
result = differ.diff_teams(local_teams, remote_records)
assert result.create_count == 1
assert result.creates[0].record_name == sample_team.id
def test_diff_stadiums_create(self, differ, sample_stadium):
"""Test detecting new stadiums to create."""
local_stadiums = [sample_stadium]
remote_records = []
result = differ.diff_stadiums(local_stadiums, remote_records)
assert result.create_count == 1
assert result.creates[0].record_name == sample_stadium.id
def test_get_records_to_upload(self, differ, sample_game):
"""Test getting CloudKitRecords for upload."""
game2 = Game(
id="nba_2025_lal_lac_1022",
sport="nba",
season=2025,
home_team_id="team_nba_lac",
away_team_id="team_nba_lal",
stadium_id="stadium_nba_crypto_com",
game_date=datetime(2025, 10, 22, 19, 0, 0),
status="scheduled",
)
local_games = [sample_game, game2]
# Only game2 exists remotely with different status
remote_records = [
{
"recordName": game2.id,
"recordType": "Game",
"fields": {
"sport": {"value": "nba", "type": "STRING"},
"season": {"value": 2025, "type": "INT64"},
"home_team_id": {"value": "team_nba_lac", "type": "STRING"},
"away_team_id": {"value": "team_nba_lal", "type": "STRING"},
"stadium_id": {"value": "stadium_nba_crypto_com", "type": "STRING"},
"game_date": {"value": int(game2.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
"status": {"value": "postponed", "type": "STRING"}, # Different!
},
"recordChangeTag": "xyz789",
}
]
result = differ.diff_games(local_games, remote_records)
records = result.get_records_to_upload()
assert len(records) == 2 # 1 create + 1 update
record_names = [r.record_name for r in records]
assert sample_game.id in record_names
assert game2.id in record_names
class TestConvenienceFunctions:
"""Tests for module-level convenience functions."""
def test_game_to_cloudkit_record(self):
"""Test converting Game to CloudKitRecord."""
game = Game(
id="nba_2025_hou_okc_1021",
sport="nba",
season=2025,
home_team_id="team_nba_okc",
away_team_id="team_nba_hou",
stadium_id="stadium_nba_paycom_center",
game_date=datetime(2025, 10, 21, 19, 0, 0),
status="scheduled",
)
record = game_to_cloudkit_record(game)
assert record.record_name == game.id
assert record.record_type == RecordType.GAME
assert record.fields["sport"] == "nba"
assert record.fields["season"] == 2025
def test_team_to_cloudkit_record(self):
"""Test converting Team to CloudKitRecord."""
team = Team(
id="team_nba_okc",
sport="nba",
city="Oklahoma City",
name="Thunder",
full_name="Oklahoma City Thunder",
abbreviation="OKC",
)
record = team_to_cloudkit_record(team)
assert record.record_name == team.id
assert record.record_type == RecordType.TEAM
assert record.fields["city"] == "Oklahoma City"
assert record.fields["name"] == "Thunder"
def test_stadium_to_cloudkit_record(self):
"""Test converting Stadium to CloudKitRecord."""
stadium = Stadium(
id="stadium_nba_paycom_center",
sport="nba",
name="Paycom Center",
city="Oklahoma City",
state="OK",
country="USA",
latitude=35.4634,
longitude=-97.5151,
)
record = stadium_to_cloudkit_record(stadium)
assert record.record_name == stadium.id
assert record.record_type == RecordType.STADIUM
assert record.fields["name"] == "Paycom Center"
assert record.fields["latitude"] == 35.4634
@@ -0,0 +1,472 @@
"""Tests for the upload state manager."""
import json
import pytest
from datetime import datetime, timedelta
from pathlib import Path
from tempfile import TemporaryDirectory
from sportstime_parser.uploaders.state import (
RecordState,
UploadSession,
StateManager,
)
class TestRecordState:
"""Tests for RecordState dataclass."""
def test_create_record_state(self):
"""Test creating a RecordState with default values."""
state = RecordState(
record_name="nba_2025_hou_okc_1021",
record_type="Game",
)
assert state.record_name == "nba_2025_hou_okc_1021"
assert state.record_type == "Game"
assert state.status == "pending"
assert state.uploaded_at is None
assert state.record_change_tag is None
assert state.error_message is None
assert state.retry_count == 0
def test_record_state_to_dict(self):
"""Test serializing RecordState to dictionary."""
now = datetime.utcnow()
state = RecordState(
record_name="nba_2025_hou_okc_1021",
record_type="Game",
uploaded_at=now,
record_change_tag="abc123",
status="uploaded",
)
data = state.to_dict()
assert data["record_name"] == "nba_2025_hou_okc_1021"
assert data["record_type"] == "Game"
assert data["status"] == "uploaded"
assert data["uploaded_at"] == now.isoformat()
assert data["record_change_tag"] == "abc123"
def test_record_state_from_dict(self):
"""Test deserializing RecordState from dictionary."""
data = {
"record_name": "nba_2025_hou_okc_1021",
"record_type": "Game",
"uploaded_at": "2026-01-10T12:00:00",
"record_change_tag": "abc123",
"status": "uploaded",
"error_message": None,
"retry_count": 0,
}
state = RecordState.from_dict(data)
assert state.record_name == "nba_2025_hou_okc_1021"
assert state.record_type == "Game"
assert state.status == "uploaded"
assert state.uploaded_at == datetime.fromisoformat("2026-01-10T12:00:00")
assert state.record_change_tag == "abc123"
class TestUploadSession:
"""Tests for UploadSession dataclass."""
def test_create_upload_session(self):
"""Test creating an UploadSession."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
assert session.sport == "nba"
assert session.season == 2025
assert session.environment == "development"
assert session.total_count == 0
assert len(session.records) == 0
def test_add_record(self):
"""Test adding records to a session."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.add_record("game_2", "Game")
session.add_record("team_1", "Team")
assert session.total_count == 3
assert len(session.records) == 3
assert "game_1" in session.records
assert session.records["game_1"].record_type == "Game"
def test_mark_uploaded(self):
"""Test marking a record as uploaded."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.mark_uploaded("game_1", "change_tag_123")
assert session.records["game_1"].status == "uploaded"
assert session.records["game_1"].record_change_tag == "change_tag_123"
assert session.records["game_1"].uploaded_at is not None
def test_mark_failed(self):
"""Test marking a record as failed."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.mark_failed("game_1", "Server error")
assert session.records["game_1"].status == "failed"
assert session.records["game_1"].error_message == "Server error"
assert session.records["game_1"].retry_count == 1
def test_mark_failed_increments_retry_count(self):
"""Test that marking failed increments retry count."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.mark_failed("game_1", "Error 1")
session.mark_failed("game_1", "Error 2")
session.mark_failed("game_1", "Error 3")
assert session.records["game_1"].retry_count == 3
def test_counts(self):
"""Test session counts."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.add_record("game_2", "Game")
session.add_record("game_3", "Game")
session.mark_uploaded("game_1")
session.mark_failed("game_2", "Error")
assert session.uploaded_count == 1
assert session.failed_count == 1
assert session.pending_count == 1
def test_is_complete(self):
"""Test is_complete property."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.add_record("game_2", "Game")
assert not session.is_complete
session.mark_uploaded("game_1")
assert not session.is_complete
session.mark_uploaded("game_2")
assert session.is_complete
def test_progress_percent(self):
"""Test progress percentage calculation."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.add_record("game_2", "Game")
session.add_record("game_3", "Game")
session.add_record("game_4", "Game")
session.mark_uploaded("game_1")
assert session.progress_percent == 25.0
def test_get_pending_records(self):
"""Test getting pending record names."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.add_record("game_2", "Game")
session.add_record("game_3", "Game")
session.mark_uploaded("game_1")
session.mark_failed("game_2", "Error")
pending = session.get_pending_records()
assert pending == ["game_3"]
def test_get_failed_records(self):
"""Test getting failed record names."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.add_record("game_2", "Game")
session.add_record("game_3", "Game")
session.mark_failed("game_1", "Error 1")
session.mark_failed("game_3", "Error 3")
failed = session.get_failed_records()
assert set(failed) == {"game_1", "game_3"}
def test_get_retryable_records(self):
"""Test getting records eligible for retry."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.add_record("game_2", "Game")
session.add_record("game_3", "Game")
# Fail game_1 once
session.mark_failed("game_1", "Error")
# Fail game_2 three times (max retries)
session.mark_failed("game_2", "Error")
session.mark_failed("game_2", "Error")
session.mark_failed("game_2", "Error")
retryable = session.get_retryable_records(max_retries=3)
assert retryable == ["game_1"]
def test_to_dict_and_from_dict(self):
"""Test round-trip serialization."""
session = UploadSession(
sport="nba",
season=2025,
environment="development",
)
session.add_record("game_1", "Game")
session.add_record("game_2", "Game")
session.mark_uploaded("game_1", "tag_123")
data = session.to_dict()
restored = UploadSession.from_dict(data)
assert restored.sport == session.sport
assert restored.season == session.season
assert restored.environment == session.environment
assert restored.total_count == session.total_count
assert restored.uploaded_count == session.uploaded_count
assert restored.records["game_1"].status == "uploaded"
class TestStateManager:
"""Tests for StateManager."""
def test_create_session(self):
"""Test creating a new session."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
session = manager.create_session(
sport="nba",
season=2025,
environment="development",
record_names=[
("game_1", "Game"),
("game_2", "Game"),
("team_1", "Team"),
],
)
assert session.sport == "nba"
assert session.season == 2025
assert session.total_count == 3
# Check file was created
state_file = Path(tmpdir) / "upload_state_nba_2025_development.json"
assert state_file.exists()
def test_load_session(self):
"""Test loading an existing session."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
# Create and save a session
original = manager.create_session(
sport="nba",
season=2025,
environment="development",
record_names=[("game_1", "Game")],
)
original.mark_uploaded("game_1", "tag_123")
manager.save_session(original)
# Load it back
loaded = manager.load_session("nba", 2025, "development")
assert loaded is not None
assert loaded.sport == "nba"
assert loaded.records["game_1"].status == "uploaded"
def test_load_nonexistent_session(self):
"""Test loading a session that doesn't exist."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
session = manager.load_session("nba", 2025, "development")
assert session is None
def test_delete_session(self):
"""Test deleting a session."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
# Create a session
manager.create_session(
sport="nba",
season=2025,
environment="development",
record_names=[("game_1", "Game")],
)
# Delete it
result = manager.delete_session("nba", 2025, "development")
assert result is True
# Verify it's gone
loaded = manager.load_session("nba", 2025, "development")
assert loaded is None
def test_delete_nonexistent_session(self):
"""Test deleting a session that doesn't exist."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
result = manager.delete_session("nba", 2025, "development")
assert result is False
def test_list_sessions(self):
"""Test listing all sessions."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
# Create multiple sessions
manager.create_session(
sport="nba",
season=2025,
environment="development",
record_names=[("game_1", "Game")],
)
manager.create_session(
sport="mlb",
season=2026,
environment="production",
record_names=[("game_2", "Game"), ("game_3", "Game")],
)
sessions = manager.list_sessions()
assert len(sessions) == 2
sports = {s["sport"] for s in sessions}
assert sports == {"nba", "mlb"}
def test_get_session_or_create_new(self):
"""Test getting a session when none exists."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
session = manager.get_session_or_create(
sport="nba",
season=2025,
environment="development",
record_names=[("game_1", "Game")],
resume=False,
)
assert session.sport == "nba"
assert session.total_count == 1
def test_get_session_or_create_resume(self):
"""Test resuming an existing session."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
# Create initial session
original = manager.create_session(
sport="nba",
season=2025,
environment="development",
record_names=[("game_1", "Game"), ("game_2", "Game")],
)
original.mark_uploaded("game_1", "tag_123")
manager.save_session(original)
# Resume with additional records
session = manager.get_session_or_create(
sport="nba",
season=2025,
environment="development",
record_names=[("game_1", "Game"), ("game_2", "Game"), ("game_3", "Game")],
resume=True,
)
# Should have original progress plus new record
assert session.records["game_1"].status == "uploaded"
assert "game_3" in session.records
assert session.total_count == 3
def test_get_session_or_create_overwrite(self):
"""Test overwriting an existing session when not resuming."""
with TemporaryDirectory() as tmpdir:
manager = StateManager(state_dir=Path(tmpdir))
# Create initial session
original = manager.create_session(
sport="nba",
season=2025,
environment="development",
record_names=[("game_1", "Game"), ("game_2", "Game")],
)
original.mark_uploaded("game_1", "tag_123")
manager.save_session(original)
# Create new session (not resuming)
session = manager.get_session_or_create(
sport="nba",
season=2025,
environment="development",
record_names=[("game_3", "Game")],
resume=False,
)
# Should be a fresh session
assert session.total_count == 1
assert "game_1" not in session.records
assert "game_3" in session.records
@@ -0,0 +1,52 @@
"""CloudKit uploaders for sportstime-parser."""
from .cloudkit import (
CloudKitClient,
CloudKitRecord,
CloudKitError,
CloudKitAuthError,
CloudKitRateLimitError,
CloudKitServerError,
RecordType,
OperationResult,
BatchResult,
)
from .state import (
RecordState,
UploadSession,
StateManager,
)
from .diff import (
DiffAction,
RecordDiff,
DiffResult,
RecordDiffer,
game_to_cloudkit_record,
team_to_cloudkit_record,
stadium_to_cloudkit_record,
)
__all__ = [
# CloudKit client
"CloudKitClient",
"CloudKitRecord",
"CloudKitError",
"CloudKitAuthError",
"CloudKitRateLimitError",
"CloudKitServerError",
"RecordType",
"OperationResult",
"BatchResult",
# State manager
"RecordState",
"UploadSession",
"StateManager",
# Differ
"DiffAction",
"RecordDiff",
"DiffResult",
"RecordDiffer",
"game_to_cloudkit_record",
"team_to_cloudkit_record",
"stadium_to_cloudkit_record",
]
@@ -0,0 +1,565 @@
"""CloudKit Web Services client for sportstime-parser.
This module provides a client for uploading data to CloudKit using the
CloudKit Web Services API. It handles JWT authentication, request signing,
and batch operations.
Reference: https://developer.apple.com/documentation/cloudkitwebservices
"""
import base64
import hashlib
import json
import os
import time
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Any, Optional
from enum import Enum
import jwt
import requests
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import ec
from cryptography.hazmat.backends import default_backend
from ..config import (
CLOUDKIT_CONTAINER_ID,
CLOUDKIT_ENVIRONMENT,
CLOUDKIT_BATCH_SIZE,
)
from ..utils.logging import get_logger
class RecordType(str, Enum):
"""CloudKit record types for SportsTime."""
GAME = "Game"
TEAM = "Team"
STADIUM = "Stadium"
TEAM_ALIAS = "TeamAlias"
STADIUM_ALIAS = "StadiumAlias"
@dataclass
class CloudKitRecord:
"""Represents a CloudKit record for upload.
Attributes:
record_name: Unique record identifier (canonical ID)
record_type: CloudKit record type
fields: Dictionary of field name -> field value
record_change_tag: Version tag for conflict detection (None for new records)
"""
record_name: str
record_type: RecordType
fields: dict[str, Any]
record_change_tag: Optional[str] = None
def to_cloudkit_dict(self) -> dict:
"""Convert to CloudKit API format."""
record = {
"recordName": self.record_name,
"recordType": self.record_type.value,
"fields": self._format_fields(),
}
if self.record_change_tag:
record["recordChangeTag"] = self.record_change_tag
return record
def _format_fields(self) -> dict:
"""Format fields for CloudKit API."""
formatted = {}
for key, value in self.fields.items():
if value is None:
continue
formatted[key] = self._format_field_value(value)
return formatted
def _format_field_value(self, value: Any) -> dict:
"""Format a single field value for CloudKit API."""
if isinstance(value, str):
return {"value": value, "type": "STRING"}
elif isinstance(value, int):
return {"value": value, "type": "INT64"}
elif isinstance(value, float):
return {"value": value, "type": "DOUBLE"}
elif isinstance(value, bool):
return {"value": 1 if value else 0, "type": "INT64"}
elif isinstance(value, datetime):
# CloudKit expects milliseconds since epoch
timestamp_ms = int(value.timestamp() * 1000)
return {"value": timestamp_ms, "type": "TIMESTAMP"}
elif isinstance(value, list):
return {"value": value, "type": "STRING_LIST"}
elif isinstance(value, dict) and "latitude" in value and "longitude" in value:
return {
"value": {
"latitude": value["latitude"],
"longitude": value["longitude"],
},
"type": "LOCATION",
}
else:
# Default to string
return {"value": str(value), "type": "STRING"}
@dataclass
class OperationResult:
"""Result of a CloudKit operation."""
record_name: str
success: bool
record_change_tag: Optional[str] = None
error_code: Optional[str] = None
error_message: Optional[str] = None
@dataclass
class BatchResult:
"""Result of a batch CloudKit operation."""
successful: list[OperationResult] = field(default_factory=list)
failed: list[OperationResult] = field(default_factory=list)
@property
def all_succeeded(self) -> bool:
return len(self.failed) == 0
@property
def success_count(self) -> int:
return len(self.successful)
@property
def failure_count(self) -> int:
return len(self.failed)
class CloudKitClient:
"""Client for CloudKit Web Services API.
Handles authentication via server-to-server JWT tokens and provides
methods for CRUD operations on CloudKit records.
Authentication requires:
- Key ID: CloudKit key identifier from Apple Developer Portal
- Private Key: EC private key in PEM format
Environment variables:
- CLOUDKIT_KEY_ID: The key identifier
- CLOUDKIT_PRIVATE_KEY_PATH: Path to the private key file
- CLOUDKIT_PRIVATE_KEY: The private key contents (alternative to path)
"""
BASE_URL = "https://api.apple-cloudkit.com"
TOKEN_EXPIRY_SECONDS = 3600 # 1 hour
def __init__(
self,
container_id: str = CLOUDKIT_CONTAINER_ID,
environment: str = CLOUDKIT_ENVIRONMENT,
key_id: Optional[str] = None,
private_key: Optional[str] = None,
private_key_path: Optional[str] = None,
):
"""Initialize the CloudKit client.
Args:
container_id: CloudKit container identifier
environment: 'development' or 'production'
key_id: CloudKit server-to-server key ID
private_key: PEM-encoded EC private key contents
private_key_path: Path to PEM-encoded EC private key file
"""
self.container_id = container_id
self.environment = environment
self.logger = get_logger()
# Load authentication credentials
self.key_id = key_id or os.environ.get("CLOUDKIT_KEY_ID")
if private_key:
self._private_key_pem = private_key
elif private_key_path:
self._private_key_pem = Path(private_key_path).read_text()
elif os.environ.get("CLOUDKIT_PRIVATE_KEY"):
self._private_key_pem = os.environ["CLOUDKIT_PRIVATE_KEY"]
elif os.environ.get("CLOUDKIT_PRIVATE_KEY_PATH"):
self._private_key_pem = Path(os.environ["CLOUDKIT_PRIVATE_KEY_PATH"]).read_text()
else:
self._private_key_pem = None
# Parse the private key if available
self._private_key = None
if self._private_key_pem:
self._private_key = serialization.load_pem_private_key(
self._private_key_pem.encode(),
password=None,
backend=default_backend(),
)
# Token cache
self._token: Optional[str] = None
self._token_expiry: float = 0
# Session for connection pooling
self._session = requests.Session()
@property
def is_configured(self) -> bool:
"""Check if the client has valid authentication credentials."""
return bool(self.key_id and self._private_key)
def _get_api_path(self, operation: str) -> str:
"""Build the full API path for an operation."""
return f"/database/1/{self.container_id}/{self.environment}/public/{operation}"
def _get_token(self) -> str:
"""Get a valid JWT token, generating a new one if needed."""
if not self.is_configured:
raise ValueError(
"CloudKit client not configured. Set CLOUDKIT_KEY_ID and "
"CLOUDKIT_PRIVATE_KEY_PATH environment variables."
)
now = time.time()
# Return cached token if still valid (with 5 min buffer)
if self._token and (self._token_expiry - now) > 300:
return self._token
# Generate new token
expiry = now + self.TOKEN_EXPIRY_SECONDS
payload = {
"iss": self.key_id,
"iat": int(now),
"exp": int(expiry),
"sub": self.container_id,
}
self._token = jwt.encode(
payload,
self._private_key,
algorithm="ES256",
)
self._token_expiry = expiry
return self._token
def _sign_request(self, method: str, path: str, body: Optional[bytes] = None) -> dict:
"""Generate request headers with authentication.
Args:
method: HTTP method
path: API path
body: Request body bytes
Returns:
Dictionary of headers to include in the request
"""
token = self._get_token()
# CloudKit uses date in ISO format
date_str = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
# Calculate body hash
if body:
body_hash = base64.b64encode(
hashlib.sha256(body).digest()
).decode()
else:
body_hash = base64.b64encode(
hashlib.sha256(b"").digest()
).decode()
# Build the message to sign
# Format: date:body_hash:path
message = f"{date_str}:{body_hash}:{path}"
# Sign the message
signature = self._private_key.sign(
message.encode(),
ec.ECDSA(hashes.SHA256()),
)
signature_b64 = base64.b64encode(signature).decode()
return {
"Authorization": f"Bearer {token}",
"X-Apple-CloudKit-Request-KeyID": self.key_id,
"X-Apple-CloudKit-Request-ISO8601Date": date_str,
"X-Apple-CloudKit-Request-SignatureV1": signature_b64,
"Content-Type": "application/json",
}
def _request(
self,
method: str,
operation: str,
body: Optional[dict] = None,
) -> dict:
"""Make a request to the CloudKit API.
Args:
method: HTTP method
operation: API operation path
body: Request body as dictionary
Returns:
Response data as dictionary
Raises:
CloudKitError: If the request fails
"""
path = self._get_api_path(operation)
url = f"{self.BASE_URL}{path}"
body_bytes = json.dumps(body).encode() if body else None
headers = self._sign_request(method, path, body_bytes)
response = self._session.request(
method=method,
url=url,
headers=headers,
data=body_bytes,
)
if response.status_code == 200:
return response.json()
elif response.status_code == 421:
# Authentication required - token may be expired
self._token = None
raise CloudKitAuthError("Authentication failed - check credentials")
elif response.status_code == 429:
raise CloudKitRateLimitError("Rate limit exceeded")
elif response.status_code >= 500:
raise CloudKitServerError(f"Server error: {response.status_code}")
else:
try:
error_data = response.json()
error_msg = error_data.get("serverErrorCode", str(response.status_code))
except (json.JSONDecodeError, KeyError):
error_msg = response.text
raise CloudKitError(f"Request failed: {error_msg}")
def fetch_records(
self,
record_type: RecordType,
record_names: Optional[list[str]] = None,
limit: int = 200,
) -> list[dict]:
"""Fetch records from CloudKit.
Args:
record_type: Type of records to fetch
record_names: Specific record names to fetch (optional)
limit: Maximum records to return (default 200)
Returns:
List of record dictionaries
"""
if record_names:
# Fetch specific records by name
body = {
"records": [{"recordName": name} for name in record_names],
}
response = self._request("POST", "records/lookup", body)
else:
# Query all records of type
body = {
"query": {
"recordType": record_type.value,
},
"resultsLimit": limit,
}
response = self._request("POST", "records/query", body)
records = response.get("records", [])
return [r for r in records if "recordName" in r]
def fetch_all_records(self, record_type: RecordType) -> list[dict]:
"""Fetch all records of a type using pagination.
Args:
record_type: Type of records to fetch
Returns:
List of all record dictionaries
"""
all_records = []
continuation_marker = None
while True:
body = {
"query": {
"recordType": record_type.value,
},
"resultsLimit": 200,
}
if continuation_marker:
body["continuationMarker"] = continuation_marker
response = self._request("POST", "records/query", body)
records = response.get("records", [])
all_records.extend([r for r in records if "recordName" in r])
continuation_marker = response.get("continuationMarker")
if not continuation_marker:
break
return all_records
def save_records(self, records: list[CloudKitRecord]) -> BatchResult:
"""Save records to CloudKit (create or update).
Args:
records: List of records to save
Returns:
BatchResult with success/failure details
"""
result = BatchResult()
# Process in batches
for i in range(0, len(records), CLOUDKIT_BATCH_SIZE):
batch = records[i:i + CLOUDKIT_BATCH_SIZE]
batch_result = self._save_batch(batch)
result.successful.extend(batch_result.successful)
result.failed.extend(batch_result.failed)
return result
def _save_batch(self, records: list[CloudKitRecord]) -> BatchResult:
"""Save a single batch of records.
Args:
records: List of records (max CLOUDKIT_BATCH_SIZE)
Returns:
BatchResult with success/failure details
"""
result = BatchResult()
operations = []
for record in records:
op = {
"operationType": "forceReplace",
"record": record.to_cloudkit_dict(),
}
operations.append(op)
body = {"operations": operations}
try:
response = self._request("POST", "records/modify", body)
except CloudKitError as e:
# Entire batch failed
for record in records:
result.failed.append(OperationResult(
record_name=record.record_name,
success=False,
error_message=str(e),
))
return result
# Process individual results
for record_data in response.get("records", []):
record_name = record_data.get("recordName", "unknown")
if "serverErrorCode" in record_data:
result.failed.append(OperationResult(
record_name=record_name,
success=False,
error_code=record_data.get("serverErrorCode"),
error_message=record_data.get("reason"),
))
else:
result.successful.append(OperationResult(
record_name=record_name,
success=True,
record_change_tag=record_data.get("recordChangeTag"),
))
return result
def delete_records(
self,
record_type: RecordType,
record_names: list[str],
) -> BatchResult:
"""Delete records from CloudKit.
Args:
record_type: Type of records to delete
record_names: List of record names to delete
Returns:
BatchResult with success/failure details
"""
result = BatchResult()
# Process in batches
for i in range(0, len(record_names), CLOUDKIT_BATCH_SIZE):
batch = record_names[i:i + CLOUDKIT_BATCH_SIZE]
operations = []
for name in batch:
operations.append({
"operationType": "delete",
"record": {
"recordName": name,
"recordType": record_type.value,
},
})
body = {"operations": operations}
try:
response = self._request("POST", "records/modify", body)
except CloudKitError as e:
for name in batch:
result.failed.append(OperationResult(
record_name=name,
success=False,
error_message=str(e),
))
continue
for record_data in response.get("records", []):
record_name = record_data.get("recordName", "unknown")
if "serverErrorCode" in record_data:
result.failed.append(OperationResult(
record_name=record_name,
success=False,
error_code=record_data.get("serverErrorCode"),
error_message=record_data.get("reason"),
))
else:
result.successful.append(OperationResult(
record_name=record_name,
success=True,
))
return result
class CloudKitError(Exception):
"""Base exception for CloudKit errors."""
pass
class CloudKitAuthError(CloudKitError):
"""Authentication error."""
pass
class CloudKitRateLimitError(CloudKitError):
"""Rate limit exceeded."""
pass
class CloudKitServerError(CloudKitError):
"""Server-side error."""
pass
+425
View File
@@ -0,0 +1,425 @@
"""Record differ for CloudKit uploads.
This module compares local records with CloudKit records to determine
what needs to be created, updated, or deleted.
"""
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Any, Optional
from ..models.game import Game
from ..models.team import Team
from ..models.stadium import Stadium
from .cloudkit import CloudKitRecord, RecordType
class DiffAction(str, Enum):
"""Action to take for a record."""
CREATE = "create"
UPDATE = "update"
DELETE = "delete"
UNCHANGED = "unchanged"
@dataclass
class RecordDiff:
"""Represents the difference between local and remote records.
Attributes:
record_name: Canonical record ID
record_type: CloudKit record type
action: Action to take (create, update, delete, unchanged)
local_record: Local CloudKitRecord (None if delete)
remote_record: Remote record dict (None if create)
changed_fields: List of field names that changed (for update)
record_change_tag: Remote record's change tag (for update)
"""
record_name: str
record_type: RecordType
action: DiffAction
local_record: Optional[CloudKitRecord] = None
remote_record: Optional[dict] = None
changed_fields: list[str] = field(default_factory=list)
record_change_tag: Optional[str] = None
@dataclass
class DiffResult:
"""Result of diffing local and remote records.
Attributes:
creates: Records to create
updates: Records to update
deletes: Records to delete (record names)
unchanged: Records with no changes
"""
creates: list[RecordDiff] = field(default_factory=list)
updates: list[RecordDiff] = field(default_factory=list)
deletes: list[RecordDiff] = field(default_factory=list)
unchanged: list[RecordDiff] = field(default_factory=list)
@property
def create_count(self) -> int:
return len(self.creates)
@property
def update_count(self) -> int:
return len(self.updates)
@property
def delete_count(self) -> int:
return len(self.deletes)
@property
def unchanged_count(self) -> int:
return len(self.unchanged)
@property
def total_changes(self) -> int:
return self.create_count + self.update_count + self.delete_count
def get_records_to_upload(self) -> list[CloudKitRecord]:
"""Get all records that need to be uploaded (creates + updates)."""
records = []
for diff in self.creates:
if diff.local_record:
records.append(diff.local_record)
for diff in self.updates:
if diff.local_record:
# Add change tag for update
diff.local_record.record_change_tag = diff.record_change_tag
records.append(diff.local_record)
return records
class RecordDiffer:
"""Compares local records with CloudKit records."""
# Fields to compare for each record type
GAME_FIELDS = [
"sport", "season", "home_team_id", "away_team_id", "stadium_id",
"game_date", "game_number", "home_score", "away_score", "status",
]
TEAM_FIELDS = [
"sport", "city", "name", "full_name", "abbreviation",
"conference", "division", "primary_color", "secondary_color",
"logo_url", "stadium_id",
]
STADIUM_FIELDS = [
"sport", "name", "city", "state", "country",
"latitude", "longitude", "capacity", "surface",
"roof_type", "opened_year", "image_url", "timezone",
]
def diff_games(
self,
local_games: list[Game],
remote_records: list[dict],
) -> DiffResult:
"""Diff local games against remote CloudKit records.
Args:
local_games: List of local Game objects
remote_records: List of remote record dictionaries
Returns:
DiffResult with creates, updates, deletes
"""
local_records = [self._game_to_record(g) for g in local_games]
return self._diff_records(
local_records,
remote_records,
RecordType.GAME,
self.GAME_FIELDS,
)
def diff_teams(
self,
local_teams: list[Team],
remote_records: list[dict],
) -> DiffResult:
"""Diff local teams against remote CloudKit records.
Args:
local_teams: List of local Team objects
remote_records: List of remote record dictionaries
Returns:
DiffResult with creates, updates, deletes
"""
local_records = [self._team_to_record(t) for t in local_teams]
return self._diff_records(
local_records,
remote_records,
RecordType.TEAM,
self.TEAM_FIELDS,
)
def diff_stadiums(
self,
local_stadiums: list[Stadium],
remote_records: list[dict],
) -> DiffResult:
"""Diff local stadiums against remote CloudKit records.
Args:
local_stadiums: List of local Stadium objects
remote_records: List of remote record dictionaries
Returns:
DiffResult with creates, updates, deletes
"""
local_records = [self._stadium_to_record(s) for s in local_stadiums]
return self._diff_records(
local_records,
remote_records,
RecordType.STADIUM,
self.STADIUM_FIELDS,
)
def _diff_records(
self,
local_records: list[CloudKitRecord],
remote_records: list[dict],
record_type: RecordType,
compare_fields: list[str],
) -> DiffResult:
"""Compare local and remote records.
Args:
local_records: List of local CloudKitRecord objects
remote_records: List of remote record dictionaries
record_type: Type of records being compared
compare_fields: List of field names to compare
Returns:
DiffResult with categorized differences
"""
result = DiffResult()
# Index remote records by name
remote_by_name: dict[str, dict] = {}
for record in remote_records:
name = record.get("recordName")
if name:
remote_by_name[name] = record
# Index local records by name
local_by_name: dict[str, CloudKitRecord] = {}
for record in local_records:
local_by_name[record.record_name] = record
# Find creates and updates
for local_record in local_records:
remote = remote_by_name.get(local_record.record_name)
if remote is None:
# New record
result.creates.append(RecordDiff(
record_name=local_record.record_name,
record_type=record_type,
action=DiffAction.CREATE,
local_record=local_record,
))
else:
# Check for changes
changed_fields = self._compare_fields(
local_record.fields,
remote.get("fields", {}),
compare_fields,
)
if changed_fields:
result.updates.append(RecordDiff(
record_name=local_record.record_name,
record_type=record_type,
action=DiffAction.UPDATE,
local_record=local_record,
remote_record=remote,
changed_fields=changed_fields,
record_change_tag=remote.get("recordChangeTag"),
))
else:
result.unchanged.append(RecordDiff(
record_name=local_record.record_name,
record_type=record_type,
action=DiffAction.UNCHANGED,
local_record=local_record,
remote_record=remote,
record_change_tag=remote.get("recordChangeTag"),
))
# Find deletes (remote records not in local)
local_names = set(local_by_name.keys())
for remote_name, remote in remote_by_name.items():
if remote_name not in local_names:
result.deletes.append(RecordDiff(
record_name=remote_name,
record_type=record_type,
action=DiffAction.DELETE,
remote_record=remote,
record_change_tag=remote.get("recordChangeTag"),
))
return result
def _compare_fields(
self,
local_fields: dict[str, Any],
remote_fields: dict[str, dict],
compare_fields: list[str],
) -> list[str]:
"""Compare field values between local and remote.
Args:
local_fields: Local field values
remote_fields: Remote field values (CloudKit format)
compare_fields: Fields to compare
Returns:
List of field names that differ
"""
changed = []
for field_name in compare_fields:
local_value = local_fields.get(field_name)
remote_field = remote_fields.get(field_name, {})
remote_value = remote_field.get("value") if remote_field else None
# Normalize values for comparison
local_normalized = self._normalize_value(local_value)
remote_normalized = self._normalize_remote_value(remote_value, remote_field)
if local_normalized != remote_normalized:
changed.append(field_name)
return changed
def _normalize_value(self, value: Any) -> Any:
"""Normalize a local value for comparison."""
if value is None:
return None
if isinstance(value, datetime):
# Convert to milliseconds since epoch
return int(value.timestamp() * 1000)
if isinstance(value, float):
# Round to 6 decimal places for coordinate comparison
return round(value, 6)
return value
def _normalize_remote_value(self, value: Any, field_data: dict) -> Any:
"""Normalize a remote CloudKit value for comparison."""
if value is None:
return None
field_type = field_data.get("type", "")
if field_type == "TIMESTAMP":
# Already in milliseconds
return value
if field_type == "DOUBLE":
return round(value, 6)
if field_type == "LOCATION":
# Return as tuple for comparison
if isinstance(value, dict):
return (
round(value.get("latitude", 0), 6),
round(value.get("longitude", 0), 6),
)
return value
def _game_to_record(self, game: Game) -> CloudKitRecord:
"""Convert a Game to a CloudKitRecord."""
return CloudKitRecord(
record_name=game.id,
record_type=RecordType.GAME,
fields={
"sport": game.sport,
"season": game.season,
"home_team_id": game.home_team_id,
"away_team_id": game.away_team_id,
"stadium_id": game.stadium_id,
"game_date": game.game_date,
"game_number": game.game_number,
"home_score": game.home_score,
"away_score": game.away_score,
"status": game.status,
},
)
def _team_to_record(self, team: Team) -> CloudKitRecord:
"""Convert a Team to a CloudKitRecord."""
return CloudKitRecord(
record_name=team.id,
record_type=RecordType.TEAM,
fields={
"sport": team.sport,
"city": team.city,
"name": team.name,
"full_name": team.full_name,
"abbreviation": team.abbreviation,
"conference": team.conference,
"division": team.division,
"primary_color": team.primary_color,
"secondary_color": team.secondary_color,
"logo_url": team.logo_url,
"stadium_id": team.stadium_id,
},
)
def _stadium_to_record(self, stadium: Stadium) -> CloudKitRecord:
"""Convert a Stadium to a CloudKitRecord."""
return CloudKitRecord(
record_name=stadium.id,
record_type=RecordType.STADIUM,
fields={
"sport": stadium.sport,
"name": stadium.name,
"city": stadium.city,
"state": stadium.state,
"country": stadium.country,
"latitude": stadium.latitude,
"longitude": stadium.longitude,
"capacity": stadium.capacity,
"surface": stadium.surface,
"roof_type": stadium.roof_type,
"opened_year": stadium.opened_year,
"image_url": stadium.image_url,
"timezone": stadium.timezone,
},
)
def game_to_cloudkit_record(game: Game) -> CloudKitRecord:
"""Convert a Game to a CloudKitRecord.
Convenience function for external use.
"""
differ = RecordDiffer()
return differ._game_to_record(game)
def team_to_cloudkit_record(team: Team) -> CloudKitRecord:
"""Convert a Team to a CloudKitRecord.
Convenience function for external use.
"""
differ = RecordDiffer()
return differ._team_to_record(team)
def stadium_to_cloudkit_record(stadium: Stadium) -> CloudKitRecord:
"""Convert a Stadium to a CloudKitRecord.
Convenience function for external use.
"""
differ = RecordDiffer()
return differ._stadium_to_record(stadium)
@@ -0,0 +1,384 @@
"""Upload state manager for resumable uploads.
This module tracks upload progress to enable resuming interrupted uploads.
State is persisted to JSON files in the .parser_state directory.
"""
import json
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Optional
from ..config import STATE_DIR
@dataclass
class RecordState:
"""State of an individual record upload.
Attributes:
record_name: Canonical record ID
record_type: CloudKit record type
uploaded_at: Timestamp when successfully uploaded
record_change_tag: CloudKit version tag
status: 'pending', 'uploaded', 'failed'
error_message: Error message if failed
retry_count: Number of retry attempts
"""
record_name: str
record_type: str
uploaded_at: Optional[datetime] = None
record_change_tag: Optional[str] = None
status: str = "pending"
error_message: Optional[str] = None
retry_count: int = 0
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"record_name": self.record_name,
"record_type": self.record_type,
"uploaded_at": self.uploaded_at.isoformat() if self.uploaded_at else None,
"record_change_tag": self.record_change_tag,
"status": self.status,
"error_message": self.error_message,
"retry_count": self.retry_count,
}
@classmethod
def from_dict(cls, data: dict) -> "RecordState":
"""Create RecordState from dictionary."""
uploaded_at = data.get("uploaded_at")
if uploaded_at:
uploaded_at = datetime.fromisoformat(uploaded_at)
return cls(
record_name=data["record_name"],
record_type=data["record_type"],
uploaded_at=uploaded_at,
record_change_tag=data.get("record_change_tag"),
status=data.get("status", "pending"),
error_message=data.get("error_message"),
retry_count=data.get("retry_count", 0),
)
@dataclass
class UploadSession:
"""Tracks the state of an upload session.
Attributes:
sport: Sport code
season: Season start year
environment: CloudKit environment
started_at: When the upload session started
last_updated: When the state was last updated
records: Dictionary of record_name -> RecordState
total_count: Total number of records to upload
"""
sport: str
season: int
environment: str
started_at: datetime = field(default_factory=datetime.utcnow)
last_updated: datetime = field(default_factory=datetime.utcnow)
records: dict[str, RecordState] = field(default_factory=dict)
total_count: int = 0
@property
def uploaded_count(self) -> int:
"""Count of successfully uploaded records."""
return sum(1 for r in self.records.values() if r.status == "uploaded")
@property
def pending_count(self) -> int:
"""Count of pending records."""
return sum(1 for r in self.records.values() if r.status == "pending")
@property
def failed_count(self) -> int:
"""Count of failed records."""
return sum(1 for r in self.records.values() if r.status == "failed")
@property
def is_complete(self) -> bool:
"""Check if all records have been processed."""
return self.pending_count == 0
@property
def progress_percent(self) -> float:
"""Calculate upload progress as percentage."""
if self.total_count == 0:
return 100.0
return (self.uploaded_count / self.total_count) * 100
def get_pending_records(self) -> list[str]:
"""Get list of record names that still need to be uploaded."""
return [
name for name, state in self.records.items()
if state.status == "pending"
]
def get_failed_records(self) -> list[str]:
"""Get list of record names that failed to upload."""
return [
name for name, state in self.records.items()
if state.status == "failed"
]
def get_retryable_records(self, max_retries: int = 3) -> list[str]:
"""Get failed records that can be retried."""
return [
name for name, state in self.records.items()
if state.status == "failed" and state.retry_count < max_retries
]
def mark_uploaded(
self,
record_name: str,
record_change_tag: Optional[str] = None,
) -> None:
"""Mark a record as successfully uploaded."""
if record_name in self.records:
state = self.records[record_name]
state.status = "uploaded"
state.uploaded_at = datetime.utcnow()
state.record_change_tag = record_change_tag
state.error_message = None
self.last_updated = datetime.utcnow()
def mark_failed(self, record_name: str, error_message: str) -> None:
"""Mark a record as failed."""
if record_name in self.records:
state = self.records[record_name]
state.status = "failed"
state.error_message = error_message
state.retry_count += 1
self.last_updated = datetime.utcnow()
def mark_pending(self, record_name: str) -> None:
"""Mark a record as pending (for retry)."""
if record_name in self.records:
state = self.records[record_name]
state.status = "pending"
state.error_message = None
self.last_updated = datetime.utcnow()
def add_record(self, record_name: str, record_type: str) -> None:
"""Add a new record to track."""
if record_name not in self.records:
self.records[record_name] = RecordState(
record_name=record_name,
record_type=record_type,
)
self.total_count = len(self.records)
def to_dict(self) -> dict:
"""Convert to dictionary for JSON serialization."""
return {
"sport": self.sport,
"season": self.season,
"environment": self.environment,
"started_at": self.started_at.isoformat(),
"last_updated": self.last_updated.isoformat(),
"total_count": self.total_count,
"records": {
name: state.to_dict()
for name, state in self.records.items()
},
}
@classmethod
def from_dict(cls, data: dict) -> "UploadSession":
"""Create UploadSession from dictionary."""
session = cls(
sport=data["sport"],
season=data["season"],
environment=data["environment"],
started_at=datetime.fromisoformat(data["started_at"]),
last_updated=datetime.fromisoformat(data["last_updated"]),
total_count=data.get("total_count", 0),
)
for name, record_data in data.get("records", {}).items():
session.records[name] = RecordState.from_dict(record_data)
return session
class StateManager:
"""Manages upload state persistence.
State files are stored in .parser_state/ with naming convention:
upload_state_{sport}_{season}_{environment}.json
"""
def __init__(self, state_dir: Optional[Path] = None):
"""Initialize the state manager.
Args:
state_dir: Directory for state files (default: .parser_state/)
"""
self.state_dir = state_dir or STATE_DIR
self.state_dir.mkdir(parents=True, exist_ok=True)
def _get_state_file(self, sport: str, season: int, environment: str) -> Path:
"""Get the path to a state file."""
return self.state_dir / f"upload_state_{sport}_{season}_{environment}.json"
def load_session(
self,
sport: str,
season: int,
environment: str,
) -> Optional[UploadSession]:
"""Load an existing upload session.
Args:
sport: Sport code
season: Season start year
environment: CloudKit environment
Returns:
UploadSession if exists, None otherwise
"""
state_file = self._get_state_file(sport, season, environment)
if not state_file.exists():
return None
try:
with open(state_file, "r", encoding="utf-8") as f:
data = json.load(f)
return UploadSession.from_dict(data)
except (json.JSONDecodeError, KeyError) as e:
# Corrupted state file
return None
def save_session(self, session: UploadSession) -> None:
"""Save an upload session to disk.
Args:
session: The session to save
"""
state_file = self._get_state_file(
session.sport,
session.season,
session.environment,
)
session.last_updated = datetime.utcnow()
with open(state_file, "w", encoding="utf-8") as f:
json.dump(session.to_dict(), f, indent=2)
def create_session(
self,
sport: str,
season: int,
environment: str,
record_names: list[tuple[str, str]], # (record_name, record_type)
) -> UploadSession:
"""Create a new upload session.
Args:
sport: Sport code
season: Season start year
environment: CloudKit environment
record_names: List of (record_name, record_type) tuples
Returns:
New UploadSession
"""
session = UploadSession(
sport=sport,
season=season,
environment=environment,
)
for record_name, record_type in record_names:
session.add_record(record_name, record_type)
self.save_session(session)
return session
def delete_session(self, sport: str, season: int, environment: str) -> bool:
"""Delete an upload session state file.
Args:
sport: Sport code
season: Season start year
environment: CloudKit environment
Returns:
True if deleted, False if not found
"""
state_file = self._get_state_file(sport, season, environment)
if state_file.exists():
state_file.unlink()
return True
return False
def list_sessions(self) -> list[dict]:
"""List all upload sessions.
Returns:
List of session summaries
"""
sessions = []
for state_file in self.state_dir.glob("upload_state_*.json"):
try:
with open(state_file, "r", encoding="utf-8") as f:
data = json.load(f)
session = UploadSession.from_dict(data)
sessions.append({
"sport": session.sport,
"season": session.season,
"environment": session.environment,
"started_at": session.started_at.isoformat(),
"last_updated": session.last_updated.isoformat(),
"progress": f"{session.uploaded_count}/{session.total_count}",
"progress_percent": f"{session.progress_percent:.1f}%",
"status": "complete" if session.is_complete else "in_progress",
"failed_count": session.failed_count,
})
except (json.JSONDecodeError, KeyError):
continue
return sessions
def get_session_or_create(
self,
sport: str,
season: int,
environment: str,
record_names: list[tuple[str, str]],
resume: bool = False,
) -> UploadSession:
"""Get existing session or create new one.
Args:
sport: Sport code
season: Season start year
environment: CloudKit environment
record_names: List of (record_name, record_type) tuples
resume: Whether to resume existing session
Returns:
UploadSession (existing or new)
"""
if resume:
existing = self.load_session(sport, season, environment)
if existing:
# Add any new records not in existing session
existing_names = set(existing.records.keys())
for record_name, record_type in record_names:
if record_name not in existing_names:
existing.add_record(record_name, record_type)
return existing
# Create new session (overwrites existing)
return self.create_session(sport, season, environment, record_names)
@@ -0,0 +1,58 @@
"""Utility modules for sportstime-parser."""
from .logging import (
get_console,
get_logger,
is_verbose,
log_error,
log_failure,
log_game,
log_stadium,
log_success,
log_team,
log_warning,
set_verbose,
)
from .http import (
RateLimitedSession,
get_session,
fetch_url,
fetch_json,
fetch_html,
)
from .progress import (
create_progress,
create_spinner_progress,
progress_bar,
track_progress,
ProgressTracker,
ScrapeProgress,
)
__all__ = [
# Logging
"get_console",
"get_logger",
"is_verbose",
"log_error",
"log_failure",
"log_game",
"log_stadium",
"log_success",
"log_team",
"log_warning",
"set_verbose",
# HTTP
"RateLimitedSession",
"get_session",
"fetch_url",
"fetch_json",
"fetch_html",
# Progress
"create_progress",
"create_spinner_progress",
"progress_bar",
"track_progress",
"ProgressTracker",
"ScrapeProgress",
]
+276
View File
@@ -0,0 +1,276 @@
"""HTTP utilities with rate limiting and exponential backoff."""
import random
import time
from typing import Optional
from urllib.parse import urlparse
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from ..config import (
DEFAULT_REQUEST_DELAY,
MAX_RETRIES,
BACKOFF_FACTOR,
INITIAL_BACKOFF,
)
from .logging import get_logger, log_warning
# User agents for rotation to avoid blocks
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0",
]
class RateLimitedSession:
"""HTTP session with rate limiting and exponential backoff.
Features:
- Configurable delay between requests
- Automatic 429 detection with exponential backoff
- User-agent rotation
- Connection pooling
- Automatic retries for transient errors
"""
def __init__(
self,
delay: float = DEFAULT_REQUEST_DELAY,
max_retries: int = MAX_RETRIES,
backoff_factor: float = BACKOFF_FACTOR,
initial_backoff: float = INITIAL_BACKOFF,
):
"""Initialize the rate-limited session.
Args:
delay: Minimum delay between requests in seconds
max_retries: Maximum number of retry attempts
backoff_factor: Multiplier for exponential backoff
initial_backoff: Initial backoff duration in seconds
"""
self.delay = delay
self.max_retries = max_retries
self.backoff_factor = backoff_factor
self.initial_backoff = initial_backoff
self.last_request_time: float = 0.0
self._domain_delays: dict[str, float] = {}
# Create session with retry adapter
self.session = requests.Session()
# Configure automatic retries for connection errors
retry_strategy = Retry(
total=max_retries,
backoff_factor=0.5,
status_forcelist=[500, 502, 503, 504],
allowed_methods=["GET", "HEAD"],
)
adapter = HTTPAdapter(max_retries=retry_strategy, pool_maxsize=10)
self.session.mount("http://", adapter)
self.session.mount("https://", adapter)
self._logger = get_logger()
def _get_user_agent(self) -> str:
"""Get a random user agent."""
return random.choice(USER_AGENTS)
def _get_domain(self, url: str) -> str:
"""Extract domain from URL."""
parsed = urlparse(url)
return parsed.netloc
def _wait_for_rate_limit(self, url: str) -> None:
"""Wait to respect rate limiting."""
domain = self._get_domain(url)
# Get domain-specific delay (if 429 was received)
domain_delay = self._domain_delays.get(domain, 0.0)
effective_delay = max(self.delay, domain_delay)
elapsed = time.time() - self.last_request_time
if elapsed < effective_delay:
sleep_time = effective_delay - elapsed
self._logger.debug(f"Rate limiting: sleeping {sleep_time:.2f}s")
time.sleep(sleep_time)
def _handle_429(self, url: str, attempt: int) -> float:
"""Handle 429 Too Many Requests with exponential backoff.
Returns the backoff duration in seconds.
"""
domain = self._get_domain(url)
backoff = self.initial_backoff * (self.backoff_factor ** attempt)
# Add jitter to prevent thundering herd
backoff += random.uniform(0, 1)
# Update domain-specific delay
self._domain_delays[domain] = min(backoff * 2, 60.0) # Cap at 60s
log_warning(f"Rate limited (429) for {domain}, backing off {backoff:.1f}s")
return backoff
def get(
self,
url: str,
headers: Optional[dict] = None,
params: Optional[dict] = None,
timeout: float = 30.0,
) -> requests.Response:
"""Make a rate-limited GET request with automatic retries.
Args:
url: URL to fetch
headers: Additional headers to include
params: Query parameters
timeout: Request timeout in seconds
Returns:
Response object
Raises:
requests.RequestException: If all retries fail
"""
# Prepare headers with user agent
request_headers = {"User-Agent": self._get_user_agent()}
if headers:
request_headers.update(headers)
last_exception: Optional[Exception] = None
for attempt in range(self.max_retries + 1):
try:
# Wait for rate limit
self._wait_for_rate_limit(url)
# Make request
self.last_request_time = time.time()
response = self.session.get(
url,
headers=request_headers,
params=params,
timeout=timeout,
)
# Handle 429
if response.status_code == 429:
if attempt < self.max_retries:
backoff = self._handle_429(url, attempt)
time.sleep(backoff)
continue
else:
response.raise_for_status()
# Return successful response
return response
except requests.RequestException as e:
last_exception = e
if attempt < self.max_retries:
backoff = self.initial_backoff * (self.backoff_factor ** attempt)
self._logger.warning(
f"Request failed (attempt {attempt + 1}): {e}, retrying in {backoff:.1f}s"
)
time.sleep(backoff)
else:
raise
# Should not reach here, but just in case
if last_exception:
raise last_exception
raise requests.RequestException("Max retries exceeded")
def get_json(
self,
url: str,
headers: Optional[dict] = None,
params: Optional[dict] = None,
timeout: float = 30.0,
) -> dict:
"""Make a rate-limited GET request and parse JSON response.
Args:
url: URL to fetch
headers: Additional headers to include
params: Query parameters
timeout: Request timeout in seconds
Returns:
Parsed JSON as dictionary
Raises:
requests.RequestException: If request fails
ValueError: If response is not valid JSON
"""
response = self.get(url, headers=headers, params=params, timeout=timeout)
response.raise_for_status()
return response.json()
def get_html(
self,
url: str,
headers: Optional[dict] = None,
params: Optional[dict] = None,
timeout: float = 30.0,
) -> str:
"""Make a rate-limited GET request and return HTML text.
Args:
url: URL to fetch
headers: Additional headers to include
params: Query parameters
timeout: Request timeout in seconds
Returns:
HTML text content
Raises:
requests.RequestException: If request fails
"""
response = self.get(url, headers=headers, params=params, timeout=timeout)
response.raise_for_status()
return response.text
def reset_domain_delays(self) -> None:
"""Reset domain-specific delays (e.g., after a long pause)."""
self._domain_delays.clear()
def close(self) -> None:
"""Close the session and release resources."""
self.session.close()
# Global session instance (lazy initialized)
_global_session: Optional[RateLimitedSession] = None
def get_session() -> RateLimitedSession:
"""Get the global rate-limited session instance."""
global _global_session
if _global_session is None:
_global_session = RateLimitedSession()
return _global_session
def fetch_url(url: str, **kwargs) -> requests.Response:
"""Convenience function to fetch a URL with rate limiting."""
return get_session().get(url, **kwargs)
def fetch_json(url: str, **kwargs) -> dict:
"""Convenience function to fetch JSON with rate limiting."""
return get_session().get_json(url, **kwargs)
def fetch_html(url: str, **kwargs) -> str:
"""Convenience function to fetch HTML with rate limiting."""
return get_session().get_html(url, **kwargs)
+149
View File
@@ -0,0 +1,149 @@
"""Logging infrastructure for sportstime-parser."""
import logging
import sys
from datetime import datetime
from pathlib import Path
from typing import Optional
from rich.console import Console
from rich.logging import RichHandler
from ..config import SCRIPTS_DIR
# Module-level state
_logger: Optional[logging.Logger] = None
_verbose: bool = False
_console: Optional[Console] = None
def get_console() -> Console:
"""Get the shared Rich console instance."""
global _console
if _console is None:
_console = Console()
return _console
def set_verbose(verbose: bool) -> None:
"""Set verbose mode globally."""
global _verbose
_verbose = verbose
logger = get_logger()
if verbose:
logger.setLevel(logging.DEBUG)
else:
logger.setLevel(logging.INFO)
def is_verbose() -> bool:
"""Check if verbose mode is enabled."""
return _verbose
def get_logger() -> logging.Logger:
"""Get or create the application logger."""
global _logger
if _logger is not None:
return _logger
_logger = logging.getLogger("sportstime_parser")
_logger.setLevel(logging.INFO)
# Prevent propagation to root logger
_logger.propagate = False
# Clear any existing handlers
_logger.handlers.clear()
# Console handler with Rich formatting
console_handler = RichHandler(
console=get_console(),
show_time=True,
show_path=False,
rich_tracebacks=True,
tracebacks_show_locals=True,
markup=True,
)
console_handler.setLevel(logging.DEBUG)
console_format = logging.Formatter("%(message)s")
console_handler.setFormatter(console_format)
_logger.addHandler(console_handler)
# File handler for persistent logs
log_dir = SCRIPTS_DIR / "logs"
log_dir.mkdir(exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
log_file = log_dir / f"parser_{timestamp}.log"
file_handler = logging.FileHandler(log_file, encoding="utf-8")
file_handler.setLevel(logging.DEBUG)
file_format = logging.Formatter(
"%(asctime)s | %(levelname)-8s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
file_handler.setFormatter(file_format)
_logger.addHandler(file_handler)
return _logger
def log_game(
sport: str,
game_id: str,
home: str,
away: str,
date: str,
status: str = "parsed",
) -> None:
"""Log a game being processed (only in verbose mode)."""
if not is_verbose():
return
logger = get_logger()
logger.debug(f"[{sport.upper()}] {game_id}: {away} @ {home} ({date}) - {status}")
def log_team(sport: str, team_id: str, name: str, status: str = "resolved") -> None:
"""Log a team being processed (only in verbose mode)."""
if not is_verbose():
return
logger = get_logger()
logger.debug(f"[{sport.upper()}] Team: {name} -> {team_id} ({status})")
def log_stadium(sport: str, stadium_id: str, name: str, status: str = "resolved") -> None:
"""Log a stadium being processed (only in verbose mode)."""
if not is_verbose():
return
logger = get_logger()
logger.debug(f"[{sport.upper()}] Stadium: {name} -> {stadium_id} ({status})")
def log_error(message: str, exc_info: bool = False) -> None:
"""Log an error message."""
logger = get_logger()
logger.error(message, exc_info=exc_info)
def log_warning(message: str) -> None:
"""Log a warning message."""
logger = get_logger()
logger.warning(message)
def log_success(message: str) -> None:
"""Log a success message with green formatting."""
logger = get_logger()
logger.info(f"[green]✓[/green] {message}")
def log_failure(message: str) -> None:
"""Log a failure message with red formatting."""
logger = get_logger()
logger.info(f"[red]✗[/red] {message}")
+360
View File
@@ -0,0 +1,360 @@
"""Progress utilities using Rich for visual feedback."""
from contextlib import contextmanager
from typing import Generator, Iterable, Optional, TypeVar
from rich.progress import (
Progress,
SpinnerColumn,
TextColumn,
BarColumn,
TaskProgressColumn,
TimeElapsedColumn,
TimeRemainingColumn,
MofNCompleteColumn,
)
from rich.console import Console
from .logging import get_console
T = TypeVar("T")
def create_progress() -> Progress:
"""Create a Rich progress bar with standard columns."""
return Progress(
SpinnerColumn(),
TextColumn("[bold blue]{task.description}"),
BarColumn(bar_width=40),
TaskProgressColumn(),
MofNCompleteColumn(),
TimeElapsedColumn(),
TimeRemainingColumn(),
console=get_console(),
transient=False,
)
def create_spinner_progress() -> Progress:
"""Create a Rich progress bar with spinner only (for indeterminate tasks)."""
return Progress(
SpinnerColumn(),
TextColumn("[bold blue]{task.description}"),
TimeElapsedColumn(),
console=get_console(),
transient=True,
)
@contextmanager
def progress_bar(
description: str,
total: Optional[int] = None,
) -> Generator[tuple[Progress, int], None, None]:
"""Context manager for a progress bar.
Args:
description: Task description to display
total: Total number of items (None for indeterminate)
Yields:
Tuple of (Progress instance, task_id)
Example:
with progress_bar("Scraping games", total=100) as (progress, task):
for item in items:
process(item)
progress.advance(task)
"""
if total is None:
progress = create_spinner_progress()
else:
progress = create_progress()
with progress:
task_id = progress.add_task(description, total=total)
yield progress, task_id
def track_progress(
iterable: Iterable[T],
description: str,
total: Optional[int] = None,
) -> Generator[T, None, None]:
"""Wrap an iterable with a progress bar.
Args:
iterable: Items to iterate over
description: Task description to display
total: Total number of items (auto-detected if iterable has len)
Yields:
Items from the iterable
Example:
for game in track_progress(games, "Processing games"):
process(game)
"""
# Try to get length if not provided
if total is None:
try:
total = len(iterable) # type: ignore
except TypeError:
pass
if total is None:
# Indeterminate progress
progress = create_spinner_progress()
with progress:
task_id = progress.add_task(description, total=None)
for item in iterable:
yield item
progress.update(task_id, advance=1)
else:
# Determinate progress
progress = create_progress()
with progress:
task_id = progress.add_task(description, total=total)
for item in iterable:
yield item
progress.advance(task_id)
class ProgressTracker:
"""Track progress across multiple phases with nested tasks.
Example:
tracker = ProgressTracker()
tracker.start("Scraping NBA")
with tracker.task("Fetching schedule", total=12) as advance:
for month in months:
fetch(month)
advance()
with tracker.task("Parsing games", total=1230) as advance:
for game in games:
parse(game)
advance()
tracker.finish("Completed NBA scrape")
"""
def __init__(self):
"""Initialize the progress tracker."""
self._console = get_console()
self._current_progress: Optional[Progress] = None
self._current_task: Optional[int] = None
def start(self, message: str) -> None:
"""Start a new tracking session with a message."""
self._console.print(f"\n[bold cyan]>>> {message}[/bold cyan]")
def finish(self, message: str) -> None:
"""Finish the tracking session with a message."""
self._console.print(f"[bold green]<<< {message}[/bold green]\n")
@contextmanager
def task(
self,
description: str,
total: Optional[int] = None,
) -> Generator[callable, None, None]:
"""Context manager for a tracked task.
Args:
description: Task description
total: Total items (None for indeterminate)
Yields:
Callable to advance the progress
Example:
with tracker.task("Processing", total=100) as advance:
for item in items:
process(item)
advance()
"""
with progress_bar(description, total) as (progress, task_id):
self._current_progress = progress
self._current_task = task_id
def advance(amount: int = 1) -> None:
progress.advance(task_id, advance=amount)
yield advance
self._current_progress = None
self._current_task = None
def log(self, message: str) -> None:
"""Log a message (will be displayed above progress bar if active)."""
if self._current_progress:
self._current_progress.console.print(f" {message}")
else:
self._console.print(f" {message}")
class ScrapeProgress:
"""Specialized progress tracker for scraping operations.
Tracks counts of games, teams, stadiums scraped and provides
formatted status updates.
"""
def __init__(self, sport: str, season: int):
"""Initialize scrape progress for a sport.
Args:
sport: Sport code (e.g., 'nba')
season: Season start year
"""
self.sport = sport
self.season = season
self.games_count = 0
self.teams_count = 0
self.stadiums_count = 0
self.errors_count = 0
self._tracker = ProgressTracker()
def start(self) -> None:
"""Start the scraping session."""
self._tracker.start(
f"Scraping {self.sport.upper()} {self.season}-{self.season + 1}"
)
def finish(self) -> None:
"""Finish the scraping session with summary."""
summary = (
f"Scraped {self.games_count} games, "
f"{self.teams_count} teams, "
f"{self.stadiums_count} stadiums"
)
if self.errors_count > 0:
summary += f" ({self.errors_count} errors)"
self._tracker.finish(summary)
@contextmanager
def scraping_schedule(
self,
total_months: Optional[int] = None,
) -> Generator[callable, None, None]:
"""Track schedule scraping progress."""
with self._tracker.task(
f"Fetching {self.sport.upper()} schedule",
total=total_months,
) as advance:
yield advance
@contextmanager
def parsing_games(
self,
total_games: Optional[int] = None,
) -> Generator[callable, None, None]:
"""Track game parsing progress."""
with self._tracker.task(
"Parsing games",
total=total_games,
) as advance:
def advance_and_count(amount: int = 1) -> None:
self.games_count += amount
advance(amount)
yield advance_and_count
@contextmanager
def resolving_teams(
self,
total_teams: Optional[int] = None,
) -> Generator[callable, None, None]:
"""Track team resolution progress."""
with self._tracker.task(
"Resolving teams",
total=total_teams,
) as advance:
def advance_and_count(amount: int = 1) -> None:
self.teams_count += amount
advance(amount)
yield advance_and_count
@contextmanager
def resolving_stadiums(
self,
total_stadiums: Optional[int] = None,
) -> Generator[callable, None, None]:
"""Track stadium resolution progress."""
with self._tracker.task(
"Resolving stadiums",
total=total_stadiums,
) as advance:
def advance_and_count(amount: int = 1) -> None:
self.stadiums_count += amount
advance(amount)
yield advance_and_count
def log_error(self, message: str) -> None:
"""Log an error during scraping."""
self.errors_count += 1
self._tracker.log(f"[red]Error: {message}[/red]")
def log_warning(self, message: str) -> None:
"""Log a warning during scraping."""
self._tracker.log(f"[yellow]Warning: {message}[/yellow]")
def log_info(self, message: str) -> None:
"""Log an info message during scraping."""
self._tracker.log(message)
class SimpleProgressBar:
"""Simple progress bar wrapper for batch operations.
Example:
with create_progress_bar(total=100, description="Uploading") as progress:
for item in items:
upload(item)
progress.advance()
"""
def __init__(self, progress: Progress, task_id: int):
self._progress = progress
self._task_id = task_id
def advance(self, amount: int = 1) -> None:
"""Advance the progress bar."""
self._progress.advance(self._task_id, advance=amount)
def update(self, completed: int) -> None:
"""Set the progress to a specific value."""
self._progress.update(self._task_id, completed=completed)
@contextmanager
def create_progress_bar(
total: int,
description: str = "Progress",
) -> Generator[SimpleProgressBar, None, None]:
"""Create a simple progress bar for batch operations.
Args:
total: Total number of items
description: Task description
Yields:
SimpleProgressBar with advance() and update() methods
Example:
with create_progress_bar(total=100, description="Uploading") as progress:
for item in items:
upload(item)
progress.advance()
"""
progress = create_progress()
with progress:
task_id = progress.add_task(description, total=total)
yield SimpleProgressBar(progress, task_id)

Some files were not shown because too many files have changed in this diff Show More