feat(scripts): rewrite parser as modular Python CLI
Replace monolithic scraping scripts with sportstime_parser package: - Multi-source scrapers with automatic fallback for 7 sports - Canonical ID generation for games, teams, and stadiums - Fuzzy matching with configurable thresholds for name resolution - CloudKit Web Services uploader with JWT auth, diff-based updates - Resumable uploads with checkpoint state persistence - Validation reports with manual review items and suggested matches - Comprehensive test suite (249 tests) CLI: sportstime-parser scrape|validate|upload|status|retry|clear Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,145 +0,0 @@
|
||||
# CloudKit Setup Guide for SportsTime
|
||||
|
||||
## 1. Configure Container in Apple Developer Portal
|
||||
|
||||
1. Go to [Apple Developer Portal](https://developer.apple.com/account)
|
||||
2. Navigate to **Certificates, Identifiers & Profiles** > **Identifiers**
|
||||
3. Select your App ID or create one for `com.sportstime.app`
|
||||
4. Enable **iCloud** capability
|
||||
5. Click **Configure** and create container: `iCloud.com.sportstime.app`
|
||||
|
||||
## 2. Configure in Xcode
|
||||
|
||||
1. Open `SportsTime.xcodeproj` in Xcode
|
||||
2. Select the SportsTime target
|
||||
3. Go to **Signing & Capabilities**
|
||||
4. Ensure **iCloud** is added (should already be there)
|
||||
5. Check **CloudKit** is selected
|
||||
6. Select container `iCloud.com.sportstime.app`
|
||||
|
||||
## 3. Create Record Types in CloudKit Dashboard
|
||||
|
||||
Go to [CloudKit Dashboard](https://icloud.developer.apple.com/dashboard)
|
||||
|
||||
### Record Type: `Stadium`
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `stadiumId` | String | Unique identifier |
|
||||
| `name` | String | Stadium name |
|
||||
| `city` | String | City |
|
||||
| `state` | String | State/Province |
|
||||
| `location` | Location | CLLocation (lat/lng) |
|
||||
| `capacity` | Int(64) | Seating capacity |
|
||||
| `sport` | String | NBA, MLB, NHL |
|
||||
| `teamAbbrevs` | String (List) | Team abbreviations |
|
||||
| `source` | String | Data source |
|
||||
| `yearOpened` | Int(64) | Optional |
|
||||
|
||||
**Indexes**:
|
||||
- `sport` (Queryable, Sortable)
|
||||
- `location` (Queryable) - for radius searches
|
||||
- `teamAbbrevs` (Queryable)
|
||||
|
||||
### Record Type: `Team`
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `teamId` | String | Unique identifier |
|
||||
| `name` | String | Full team name |
|
||||
| `abbreviation` | String | 3-letter code |
|
||||
| `sport` | String | NBA, MLB, NHL |
|
||||
| `city` | String | City |
|
||||
|
||||
**Indexes**:
|
||||
- `sport` (Queryable, Sortable)
|
||||
- `abbreviation` (Queryable)
|
||||
|
||||
### Record Type: `Game`
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `gameId` | String | Unique identifier |
|
||||
| `sport` | String | NBA, MLB, NHL |
|
||||
| `season` | String | e.g., "2024-25" |
|
||||
| `dateTime` | Date/Time | Game date and time |
|
||||
| `homeTeamRef` | Reference | Reference to Team |
|
||||
| `awayTeamRef` | Reference | Reference to Team |
|
||||
| `venueRef` | Reference | Reference to Stadium |
|
||||
| `isPlayoff` | Int(64) | 0 or 1 |
|
||||
| `broadcastInfo` | String | TV channel |
|
||||
| `source` | String | Data source |
|
||||
|
||||
**Indexes**:
|
||||
- `sport` (Queryable, Sortable)
|
||||
- `dateTime` (Queryable, Sortable)
|
||||
- `homeTeamRef` (Queryable)
|
||||
- `awayTeamRef` (Queryable)
|
||||
- `season` (Queryable)
|
||||
|
||||
## 4. Import Data
|
||||
|
||||
After creating record types:
|
||||
|
||||
```bash
|
||||
# 1. First scrape the data
|
||||
cd Scripts
|
||||
python3 scrape_schedules.py --sport all --season 2025 --output ./data
|
||||
|
||||
# 2. Run the import script (requires running from Xcode or with proper entitlements)
|
||||
# The Swift script cannot run standalone - use the app or create a macOS command-line tool
|
||||
```
|
||||
|
||||
### Alternative: Import via App
|
||||
|
||||
Add this to your app for first-run data import:
|
||||
|
||||
```swift
|
||||
// In AppDelegate or App init
|
||||
Task {
|
||||
let importer = CloudKitImporter()
|
||||
|
||||
// Load JSON from bundle or downloaded file
|
||||
if let stadiumsURL = Bundle.main.url(forResource: "stadiums", withExtension: "json"),
|
||||
let gamesURL = Bundle.main.url(forResource: "games", withExtension: "json") {
|
||||
// Import stadiums first
|
||||
let stadiumsData = try Data(contentsOf: stadiumsURL)
|
||||
let stadiums = try JSONDecoder().decode([ScrapedStadium].self, from: stadiumsData)
|
||||
let count = try await importer.importStadiums(from: stadiums)
|
||||
print("Imported \(count) stadiums")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 5. Security Roles (CloudKit Dashboard)
|
||||
|
||||
For the **Public Database**:
|
||||
|
||||
| Role | Stadium | Team | Game |
|
||||
|------|---------|------|------|
|
||||
| World | Read | Read | Read |
|
||||
| Authenticated | Read | Read | Read |
|
||||
| Creator | Read/Write | Read/Write | Read/Write |
|
||||
|
||||
Users should only read from public database. Write access is for your admin imports.
|
||||
|
||||
## 6. Testing
|
||||
|
||||
1. Build and run the app on simulator or device
|
||||
2. Check CloudKit Dashboard > **Data** to see imported records
|
||||
3. Use **Logs** tab to debug any issues
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Container not found"
|
||||
- Ensure container is created in Developer Portal
|
||||
- Check entitlements file has correct container ID
|
||||
- Clean build and re-run
|
||||
|
||||
### "Permission denied"
|
||||
- Check Security Roles in CloudKit Dashboard
|
||||
- Ensure app is signed with correct provisioning profile
|
||||
|
||||
### "Record type not found"
|
||||
- Create record types in Development environment first
|
||||
- Deploy schema to Production when ready
|
||||
@@ -1,72 +0,0 @@
|
||||
# Sports Data Sources
|
||||
|
||||
## Schedule Data Sources (by league)
|
||||
|
||||
### NBA Schedule
|
||||
| Source | URL Pattern | Data Available | Notes |
|
||||
|--------|-------------|----------------|-------|
|
||||
| Basketball-Reference | `https://www.basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | Date, Time, Teams, Arena, Attendance | Monthly pages (october, november, etc.) |
|
||||
| ESPN | `https://www.espn.com/nba/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
|
||||
| NBA.com API | `https://cdn.nba.com/static/json/staticData/scheduleLeagueV2.json` | Full season JSON | Official source |
|
||||
| FixtureDownload | `https://fixturedownload.com/download/nba-{year}-UTC.csv` | CSV download | Easy format |
|
||||
|
||||
### MLB Schedule
|
||||
| Source | URL Pattern | Data Available | Notes |
|
||||
|--------|-------------|----------------|-------|
|
||||
| Baseball-Reference | `https://www.baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | Date, Teams, Score, Attendance | Full season page |
|
||||
| ESPN | `https://www.espn.com/mlb/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
|
||||
| MLB Stats API | `https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={YEAR}` | Full season JSON | Official API |
|
||||
| FixtureDownload | `https://fixturedownload.com/download/mlb-{year}-UTC.csv` | CSV download | Easy format |
|
||||
|
||||
### NHL Schedule
|
||||
| Source | URL Pattern | Data Available | Notes |
|
||||
|--------|-------------|----------------|-------|
|
||||
| Hockey-Reference | `https://www.hockey-reference.com/leagues/NHL_{YEAR}_games.html` | Date, Teams, Score, Arena, Attendance | Full season page |
|
||||
| ESPN | `https://www.espn.com/nhl/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
|
||||
| NHL API | `https://api-web.nhle.com/v1/schedule/{YYYY-MM-DD}` | Daily JSON | Official API |
|
||||
| FixtureDownload | `https://fixturedownload.com/download/nhl-{year}-UTC.csv` | CSV download | Easy format |
|
||||
|
||||
---
|
||||
|
||||
## Stadium/Arena Data Sources
|
||||
|
||||
| Source | URL/Method | Data Available | Notes |
|
||||
|--------|------------|----------------|-------|
|
||||
| Wikipedia | Team pages | Name, City, Capacity, Coordinates | Manual or scrape |
|
||||
| HIFLD Open Data | `https://hifld-geoplatform.opendata.arcgis.com/datasets/major-sport-venues` | GeoJSON with coordinates | US Government data |
|
||||
| ESPN Team Pages | `https://www.espn.com/{sport}/team/_/name/{abbrev}` | Arena name, location | Per-team |
|
||||
| Sports-Reference | Team pages | Arena name, capacity | In schedule data |
|
||||
| OpenStreetMap | Nominatim API | Coordinates from address | For geocoding |
|
||||
|
||||
---
|
||||
|
||||
## Data Validation Strategy
|
||||
|
||||
### Cross-Reference Points
|
||||
1. **Game Count**: Total games per team should match (82 NBA, 162 MLB, 82 NHL)
|
||||
2. **Home/Away Balance**: Each team should have equal home/away games
|
||||
3. **Date Alignment**: Same game should appear on same date across sources
|
||||
4. **Team Names**: Map abbreviations across sources (NYK vs NY vs Knicks)
|
||||
5. **Venue Names**: Stadiums may have different names (sponsorship changes)
|
||||
|
||||
### Discrepancy Handling
|
||||
- If sources disagree on game time: prefer official API (NBA.com, MLB.com, NHL.com)
|
||||
- If sources disagree on venue: prefer Sports-Reference (most accurate historically)
|
||||
- Log all discrepancies for manual review
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting Guidelines
|
||||
|
||||
| Source | Limit | Recommended Delay |
|
||||
|--------|-------|-------------------|
|
||||
| Sports-Reference sites | 20 req/min | 3 seconds between requests |
|
||||
| ESPN | Unknown | 1 second between requests |
|
||||
| Official APIs | Varies | 0.5 seconds between requests |
|
||||
| Wikipedia | Polite | 1 second between requests |
|
||||
|
||||
---
|
||||
|
||||
## Team Abbreviation Mappings
|
||||
|
||||
See `team_mappings.json` for canonical mappings between sources.
|
||||
@@ -1,147 +0,0 @@
|
||||
# SportsTime Data Pipeline
|
||||
|
||||
Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app.
|
||||
|
||||
## Overview
|
||||
|
||||
This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Scrape all sports for current season
|
||||
python scrape_schedules.py --sport all --season 2026
|
||||
|
||||
# Run full pipeline (scrape + canonicalize)
|
||||
python run_pipeline.py --sport all
|
||||
|
||||
# Validate data integrity
|
||||
python cloudkit_import.py --validate
|
||||
|
||||
# Sync to CloudKit
|
||||
python cloudkit_import.py --upload
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ SPORT MODULES │
|
||||
│ mlb.py nba.py nhl.py nfl.py mls.py wnba.py nwsl.py │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│ scrape
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ RAW DATA │
|
||||
│ data/games.csv data/stadiums.csv data/games.json │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│ canonicalize
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ CANONICAL JSON │
|
||||
│ data/stadiums_canonical.json data/teams_canonical.json │
|
||||
│ data/games/*.json (per-sport/season) │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│ sync
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ CloudKit (iCloud.com.sportstime.app) │
|
||||
│ Bundled JSON (SportsTime/Resources/) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Module Reference
|
||||
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `core.py` | Shared utilities: data classes, rate limiting, fallback system |
|
||||
| `scrape_schedules.py` | Main orchestrator for scraping schedules from multiple sources |
|
||||
| `run_pipeline.py` | Full pipeline runner (scrape + canonicalize in one command) |
|
||||
| `canonicalize_stadiums.py` | Stadium name resolution with alias support |
|
||||
| `canonicalize_teams.py` | Team name resolution with alias support |
|
||||
| `canonicalize_games.py` | Game linking (game → team → stadium relationships) |
|
||||
| `cloudkit_import.py` | CloudKit sync with full CRUD, validation, and diff reporting |
|
||||
| `validate_canonical.py` | Data validation with completeness metrics |
|
||||
| `generate_canonical_data.py` | Generate bundled JSON for iOS app bootstrap |
|
||||
|
||||
## Sport Modules
|
||||
|
||||
Each sport has its own module with hardcoded stadium data and sport-specific scraping logic:
|
||||
|
||||
| Module | Sport | Stadiums | Notes |
|
||||
|--------|-------|----------|-------|
|
||||
| `mlb.py` | MLB | 30 ballparks | Baseball-Reference scraper |
|
||||
| `nba.py` | NBA | 30 arenas | Basketball-Reference scraper |
|
||||
| `nhl.py` | NHL | 32 arenas | Hockey-Reference scraper |
|
||||
| `nfl.py` | NFL | 30 stadiums | Cross-calendar season (2025-26) |
|
||||
| `mls.py` | MLS | 30 stadiums | Soccer-specific capacities |
|
||||
| `wnba.py` | WNBA | 13 arenas | Shares venues with NBA |
|
||||
| `nwsl.py` | NWSL | 13 stadiums | Shares some MLS venues |
|
||||
|
||||
## Data Files
|
||||
|
||||
### Output Directory: `data/`
|
||||
|
||||
| File | Contents |
|
||||
|------|----------|
|
||||
| `games.csv` | Raw scraped game data (all sports) |
|
||||
| `games.json` | Raw scraped games as JSON |
|
||||
| `stadiums.json` | Raw stadium data |
|
||||
| `stadiums_canonical.json` | Canonical stadiums with resolved aliases |
|
||||
| `teams_canonical.json` | Canonical teams with resolved aliases |
|
||||
| `stadium_aliases.json` | Stadium name → canonical ID mapping |
|
||||
| `games/{sport}_{season}.json` | Per-sport canonical games |
|
||||
|
||||
### Alias Files
|
||||
|
||||
- `data/canonical/stadiums.json` - Master stadium database
|
||||
- `data/canonical/teams.json` - Master team database
|
||||
|
||||
## Pipeline Commands
|
||||
|
||||
### Scraping
|
||||
|
||||
```bash
|
||||
# Single sport
|
||||
python scrape_schedules.py --sport nba --season 2025-26
|
||||
|
||||
# All sports
|
||||
python scrape_schedules.py --sport all --season 2026
|
||||
|
||||
# With specific output directory
|
||||
python scrape_schedules.py --sport mlb --season 2025 --output ./data
|
||||
```
|
||||
|
||||
### Canonicalization
|
||||
|
||||
```bash
|
||||
# Run canonicalization pipeline
|
||||
python run_canonicalization_pipeline.py --sport all
|
||||
```
|
||||
|
||||
### CloudKit Operations
|
||||
|
||||
```bash
|
||||
# Validate data without uploading
|
||||
python cloudkit_import.py --validate
|
||||
|
||||
# Show what would be uploaded (dry run)
|
||||
python cloudkit_import.py --upload --dry-run
|
||||
|
||||
# Upload to CloudKit
|
||||
python cloudkit_import.py --upload
|
||||
|
||||
# List orphan records (requires CloudKit connection)
|
||||
python cloudkit_import.py --validate --list-orphans
|
||||
|
||||
# Delete orphan records
|
||||
python cloudkit_import.py --delete-orphans
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [DATA_SOURCES.md](DATA_SOURCES.md) - Data source URLs, rate limits, validation strategy
|
||||
- [CLOUDKIT_SETUP.md](CLOUDKIT_SETUP.md) - CloudKit container setup, record types, security roles
|
||||
@@ -1,508 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Game Canonicalization for SportsTime
|
||||
====================================
|
||||
Stage 3 of the canonicalization pipeline.
|
||||
|
||||
Resolves team and stadium references in games, generates canonical game IDs.
|
||||
|
||||
Usage:
|
||||
python canonicalize_games.py --games data/games.json --teams data/teams_canonical.json \
|
||||
--aliases data/stadium_aliases.json --output data/
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from collections import defaultdict
|
||||
from dataclasses import dataclass, asdict
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# DATA CLASSES
|
||||
# =============================================================================
|
||||
|
||||
@dataclass
|
||||
class CanonicalGame:
|
||||
"""A canonicalized game with stable ID and resolved references."""
|
||||
canonical_id: str
|
||||
sport: str
|
||||
season: str
|
||||
date: str # YYYY-MM-DD
|
||||
time: Optional[str]
|
||||
home_team_canonical_id: str
|
||||
away_team_canonical_id: str
|
||||
stadium_canonical_id: str
|
||||
is_playoff: bool = False
|
||||
broadcast: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class ResolutionWarning:
|
||||
"""Warning about a resolution issue."""
|
||||
game_key: str
|
||||
issue: str
|
||||
details: str
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM ABBREVIATION ALIASES
|
||||
# Maps alternative abbreviations to canonical team IDs
|
||||
# =============================================================================
|
||||
|
||||
TEAM_ABBREV_ALIASES = {
|
||||
# NBA
|
||||
('NBA', 'PHX'): 'team_nba_pho', # Phoenix
|
||||
('NBA', 'BKN'): 'team_nba_brk', # Brooklyn
|
||||
('NBA', 'CHA'): 'team_nba_cho', # Charlotte (older abbrev)
|
||||
('NBA', 'NOP'): 'team_nba_nop', # New Orleans
|
||||
('NBA', 'NO'): 'team_nba_nop', # New Orleans alt
|
||||
('NBA', 'NY'): 'team_nba_nyk', # New York
|
||||
('NBA', 'SA'): 'team_nba_sas', # San Antonio
|
||||
('NBA', 'GS'): 'team_nba_gsw', # Golden State
|
||||
('NBA', 'UTAH'): 'team_nba_uta', # Utah
|
||||
|
||||
# MLB
|
||||
('MLB', 'AZ'): 'team_mlb_ari', # Arizona
|
||||
('MLB', 'CWS'): 'team_mlb_chw', # Chicago White Sox
|
||||
('MLB', 'KC'): 'team_mlb_kcr', # Kansas City
|
||||
('MLB', 'SD'): 'team_mlb_sdp', # San Diego
|
||||
('MLB', 'SF'): 'team_mlb_sfg', # San Francisco
|
||||
('MLB', 'TB'): 'team_mlb_tbr', # Tampa Bay
|
||||
('MLB', 'WSH'): 'team_mlb_wsn', # Washington
|
||||
('MLB', 'WAS'): 'team_mlb_wsn', # Washington alt
|
||||
('MLB', 'LA'): 'team_mlb_lad', # Los Angeles Dodgers
|
||||
('MLB', 'ATH'): 'team_mlb_oak', # Oakland Athletics
|
||||
|
||||
# NHL
|
||||
('NHL', 'ARI'): 'team_nhl_ari', # Arizona/Utah
|
||||
('NHL', 'UTA'): 'team_nhl_ari', # Utah Hockey Club (uses ARI code)
|
||||
('NHL', 'VGS'): 'team_nhl_vgk', # Vegas
|
||||
('NHL', 'TB'): 'team_nhl_tbl', # Tampa Bay Lightning
|
||||
('NHL', 'NJ'): 'team_nhl_njd', # New Jersey
|
||||
('NHL', 'SJ'): 'team_nhl_sjs', # San Jose
|
||||
('NHL', 'LA'): 'team_nhl_lak', # Los Angeles Kings
|
||||
('NHL', 'MON'): 'team_nhl_mtl', # Montreal
|
||||
|
||||
# NFL
|
||||
('NFL', 'JAC'): 'team_nfl_jax', # Jacksonville (JAC vs JAX)
|
||||
('NFL', 'OAK'): 'team_nfl_lv', # Oakland → Las Vegas Raiders (moved 2020)
|
||||
('NFL', 'SD'): 'team_nfl_lac', # San Diego → Los Angeles Chargers (moved 2017)
|
||||
('NFL', 'STL'): 'team_nfl_lar', # St. Louis → Los Angeles Rams (moved 2016)
|
||||
('NFL', 'GNB'): 'team_nfl_gb', # Green Bay alternate
|
||||
('NFL', 'KAN'): 'team_nfl_kc', # Kansas City alternate
|
||||
('NFL', 'NWE'): 'team_nfl_ne', # New England alternate
|
||||
('NFL', 'NOR'): 'team_nfl_no', # New Orleans alternate
|
||||
('NFL', 'TAM'): 'team_nfl_tb', # Tampa Bay alternate
|
||||
('NFL', 'SFO'): 'team_nfl_sf', # San Francisco alternate
|
||||
('NFL', 'WAS'): 'team_nfl_was', # Washington (direct match but include for completeness)
|
||||
('NFL', 'WSH'): 'team_nfl_was', # Washington Commanders alternate abbrev
|
||||
|
||||
# MLS
|
||||
('MLS', 'LA'): 'team_mls_lag', # LA Galaxy
|
||||
('MLS', 'NYC'): 'team_mls_nycfc', # NYC FC
|
||||
('MLS', 'RBNY'): 'team_mls_nyrb', # NY Red Bulls
|
||||
('MLS', 'NYR'): 'team_mls_nyrb', # NY Red Bulls alt
|
||||
('MLS', 'NY'): 'team_mls_nyrb', # NY Red Bulls short
|
||||
('MLS', 'SJE'): 'team_mls_sj', # San Jose Earthquakes
|
||||
('MLS', 'KC'): 'team_mls_skc', # Sporting KC
|
||||
('MLS', 'DCU'): 'team_mls_dc', # DC United
|
||||
('MLS', 'FCD'): 'team_mls_dal', # FC Dallas
|
||||
('MLS', 'MON'): 'team_mls_mtl', # Montreal
|
||||
('MLS', 'LAF'): 'team_mls_lafc', # LAFC alt
|
||||
('MLS', 'ATX'): 'team_mls_aus', # Austin FC alt abbrev
|
||||
|
||||
# WNBA
|
||||
('WNBA', 'LV'): 'team_wnba_lva', # Las Vegas Aces
|
||||
('WNBA', 'LAS'): 'team_wnba_la', # LA Sparks
|
||||
('WNBA', 'NYL'): 'team_wnba_ny', # New York Liberty
|
||||
('WNBA', 'PHX'): 'team_wnba_pho', # Phoenix Mercury
|
||||
('WNBA', 'CONN'): 'team_wnba_con', # Connecticut Sun
|
||||
('WNBA', 'WSH'): 'team_wnba_was', # Washington Mystics
|
||||
|
||||
# NWSL
|
||||
('NWSL', 'ANG'): 'team_nwsl_la', # Angel City FC (uses LA abbrev)
|
||||
('NWSL', 'ACFC'): 'team_nwsl_la', # Angel City FC alt
|
||||
('NWSL', 'NCC'): 'team_nwsl_nc', # North Carolina Courage
|
||||
('NWSL', 'GOTHAM'): 'team_nwsl_nj', # NJ/NY Gotham FC
|
||||
('NWSL', 'NY'): 'team_nwsl_nj', # NJ/NY Gotham FC alt
|
||||
('NWSL', 'BAY'): 'team_nwsl_sj', # Bay FC (San Jose)
|
||||
('NWSL', 'RLC'): 'team_nwsl_uta', # Racing Louisville -> Utah Royals (rebrand)
|
||||
('NWSL', 'LOU'): 'team_nwsl_uta', # Louisville -> Utah alt
|
||||
}
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# ID GENERATION
|
||||
# =============================================================================
|
||||
|
||||
def normalize_season(sport: str, season: str) -> str:
|
||||
"""
|
||||
Normalize season format for ID generation.
|
||||
|
||||
NBA/NHL: "2025-26" -> "202526"
|
||||
MLB: "2026" -> "2026"
|
||||
"""
|
||||
return season.replace('-', '')
|
||||
|
||||
|
||||
def generate_canonical_game_id(
|
||||
sport: str,
|
||||
season: str,
|
||||
date: str, # YYYY-MM-DD
|
||||
away_abbrev: str,
|
||||
home_abbrev: str,
|
||||
sequence: int = 1
|
||||
) -> str:
|
||||
"""
|
||||
Generate deterministic canonical ID for game.
|
||||
|
||||
Format: game_{sport}_{season}_{date}_{away}_{home}[_{sequence}]
|
||||
Example: game_nba_202526_20251021_hou_okc
|
||||
game_mlb_2026_20260615_bos_nyy_2 (doubleheader game 2)
|
||||
"""
|
||||
normalized_season = normalize_season(sport, season)
|
||||
date_compact = date.replace('-', '') # YYYYMMDD
|
||||
|
||||
base_id = f"game_{sport.lower()}_{normalized_season}_{date_compact}_{away_abbrev.lower()}_{home_abbrev.lower()}"
|
||||
|
||||
if sequence > 1:
|
||||
return f"{base_id}_{sequence}"
|
||||
return base_id
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# RESOLUTION
|
||||
# =============================================================================
|
||||
|
||||
def build_alias_lookup(stadium_aliases: list[dict]) -> dict[str, str]:
|
||||
"""
|
||||
Build lookup from alias name to canonical stadium ID.
|
||||
|
||||
Returns: {alias_name_lower: canonical_stadium_id}
|
||||
"""
|
||||
lookup = {}
|
||||
for alias in stadium_aliases:
|
||||
alias_name = alias.get('alias_name', '').lower().strip()
|
||||
canonical_id = alias.get('stadium_canonical_id', '')
|
||||
if alias_name and canonical_id:
|
||||
lookup[alias_name] = canonical_id
|
||||
return lookup
|
||||
|
||||
|
||||
def resolve_team(
|
||||
abbrev: str,
|
||||
sport: str,
|
||||
teams_by_abbrev: dict[tuple[str, str], dict],
|
||||
teams_by_id: dict[str, dict]
|
||||
) -> Optional[dict]:
|
||||
"""
|
||||
Resolve team abbreviation to canonical team.
|
||||
|
||||
1. Try direct match by (sport, abbrev)
|
||||
2. Try alias lookup
|
||||
3. Return None if not found
|
||||
"""
|
||||
key = (sport, abbrev.upper())
|
||||
|
||||
# Direct match
|
||||
if key in teams_by_abbrev:
|
||||
return teams_by_abbrev[key]
|
||||
|
||||
# Alias match
|
||||
if key in TEAM_ABBREV_ALIASES:
|
||||
canonical_id = TEAM_ABBREV_ALIASES[key]
|
||||
if canonical_id in teams_by_id:
|
||||
return teams_by_id[canonical_id]
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def resolve_stadium_from_venue(
|
||||
venue: str,
|
||||
home_team: dict,
|
||||
sport: str,
|
||||
alias_lookup: dict[str, str],
|
||||
stadiums_by_id: dict[str, dict]
|
||||
) -> str:
|
||||
"""
|
||||
Resolve stadium canonical ID from venue name.
|
||||
|
||||
Strategy:
|
||||
1. ALWAYS prefer home team's stadium (most reliable, sport-correct)
|
||||
2. Try sport-scoped alias match (only if home team has no stadium)
|
||||
3. Fall back to unknown stadium slug
|
||||
|
||||
For multi-sport venues (MSG, Crypto.com Arena, etc.), home team's
|
||||
stadium_canonical_id is authoritative because it's already sport-scoped.
|
||||
|
||||
Args:
|
||||
venue: Venue name from game data
|
||||
home_team: Resolved home team dict
|
||||
sport: Sport code (NBA, MLB, NHL)
|
||||
alias_lookup: {alias_name_lower: canonical_stadium_id}
|
||||
stadiums_by_id: {canonical_id: stadium_dict}
|
||||
|
||||
Returns:
|
||||
canonical_stadium_id
|
||||
"""
|
||||
# Strategy 1: Home team's stadium is most reliable (sport-scoped)
|
||||
if home_team:
|
||||
team_stadium = home_team.get('stadium_canonical_id', '')
|
||||
if team_stadium:
|
||||
return team_stadium
|
||||
|
||||
# Strategy 2: Sport-scoped alias match (fallback for neutral sites)
|
||||
venue_lower = venue.lower().strip()
|
||||
sport_prefix = f"stadium_{sport.lower()}_"
|
||||
|
||||
if venue_lower in alias_lookup:
|
||||
matched_id = alias_lookup[venue_lower]
|
||||
# Only use alias if it's for the correct sport
|
||||
if matched_id.startswith(sport_prefix):
|
||||
return matched_id
|
||||
|
||||
# Strategy 3: Partial match with sport check
|
||||
for alias, canonical_id in alias_lookup.items():
|
||||
if len(alias) > 3 and (alias in venue_lower or venue_lower in alias):
|
||||
if canonical_id.startswith(sport_prefix):
|
||||
return canonical_id
|
||||
|
||||
# Unknown stadium
|
||||
slug = venue_lower[:30].replace(' ', '_').replace('.', '')
|
||||
return f"stadium_unknown_{slug}"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CANONICALIZATION
|
||||
# =============================================================================
|
||||
|
||||
def canonicalize_games(
|
||||
raw_games: list[dict],
|
||||
canonical_teams: list[dict],
|
||||
stadium_aliases: list[dict],
|
||||
verbose: bool = False
|
||||
) -> tuple[list[CanonicalGame], list[ResolutionWarning]]:
|
||||
"""
|
||||
Stage 3: Canonicalize games.
|
||||
|
||||
1. Resolve team abbreviations to canonical IDs
|
||||
2. Resolve venues to stadium canonical IDs
|
||||
3. Generate canonical game IDs (handling doubleheaders)
|
||||
|
||||
Args:
|
||||
raw_games: List of raw game dicts
|
||||
canonical_teams: List of canonical team dicts
|
||||
stadium_aliases: List of stadium alias dicts
|
||||
verbose: Print detailed progress
|
||||
|
||||
Returns:
|
||||
(canonical_games, warnings)
|
||||
"""
|
||||
games = []
|
||||
warnings = []
|
||||
|
||||
# Build lookups
|
||||
teams_by_abbrev = {} # (sport, abbrev) -> team dict
|
||||
teams_by_id = {} # canonical_id -> team dict
|
||||
|
||||
for team in canonical_teams:
|
||||
abbrev = team['abbreviation'].upper()
|
||||
sport = team['sport']
|
||||
teams_by_abbrev[(sport, abbrev)] = team
|
||||
teams_by_id[team['canonical_id']] = team
|
||||
|
||||
alias_lookup = build_alias_lookup(stadium_aliases)
|
||||
stadiums_by_id = {} # Would be populated from stadiums_canonical.json if needed
|
||||
|
||||
# Track games for doubleheader detection
|
||||
game_counts = defaultdict(int) # (date, away_id, home_id) -> count
|
||||
|
||||
resolved_count = 0
|
||||
unresolved_teams = 0
|
||||
unresolved_stadiums = 0
|
||||
|
||||
for raw in raw_games:
|
||||
sport = raw.get('sport', '').upper()
|
||||
season = raw.get('season', '')
|
||||
date = raw.get('date', '')
|
||||
home_abbrev = raw.get('home_team_abbrev', '').upper()
|
||||
away_abbrev = raw.get('away_team_abbrev', '').upper()
|
||||
venue = raw.get('venue', '')
|
||||
|
||||
game_key = f"{date}_{away_abbrev}_{home_abbrev}"
|
||||
|
||||
# Resolve teams
|
||||
home_team = resolve_team(home_abbrev, sport, teams_by_abbrev, teams_by_id)
|
||||
away_team = resolve_team(away_abbrev, sport, teams_by_abbrev, teams_by_id)
|
||||
|
||||
if not home_team:
|
||||
warnings.append(ResolutionWarning(
|
||||
game_key=game_key,
|
||||
issue='Unknown home team',
|
||||
details=f"Could not resolve home team '{home_abbrev}' for sport {sport}"
|
||||
))
|
||||
unresolved_teams += 1
|
||||
if verbose:
|
||||
print(f" WARNING: {game_key} - unknown home team {home_abbrev}")
|
||||
continue
|
||||
|
||||
if not away_team:
|
||||
warnings.append(ResolutionWarning(
|
||||
game_key=game_key,
|
||||
issue='Unknown away team',
|
||||
details=f"Could not resolve away team '{away_abbrev}' for sport {sport}"
|
||||
))
|
||||
unresolved_teams += 1
|
||||
if verbose:
|
||||
print(f" WARNING: {game_key} - unknown away team {away_abbrev}")
|
||||
continue
|
||||
|
||||
# Resolve stadium
|
||||
stadium_canonical_id = resolve_stadium_from_venue(
|
||||
venue, home_team, sport, alias_lookup, stadiums_by_id
|
||||
)
|
||||
|
||||
if stadium_canonical_id.startswith('stadium_unknown'):
|
||||
warnings.append(ResolutionWarning(
|
||||
game_key=game_key,
|
||||
issue='Unknown stadium',
|
||||
details=f"Could not resolve venue '{venue}', using home team stadium"
|
||||
))
|
||||
unresolved_stadiums += 1
|
||||
# Fall back to home team stadium
|
||||
stadium_canonical_id = home_team.get('stadium_canonical_id', stadium_canonical_id)
|
||||
|
||||
# Handle doubleheaders
|
||||
matchup_key = (date, away_team['canonical_id'], home_team['canonical_id'])
|
||||
game_counts[matchup_key] += 1
|
||||
sequence = game_counts[matchup_key]
|
||||
|
||||
# Generate canonical ID
|
||||
canonical_id = generate_canonical_game_id(
|
||||
sport, season, date,
|
||||
away_team['abbreviation'], home_team['abbreviation'],
|
||||
sequence
|
||||
)
|
||||
|
||||
game = CanonicalGame(
|
||||
canonical_id=canonical_id,
|
||||
sport=sport,
|
||||
season=season,
|
||||
date=date,
|
||||
time=raw.get('time'),
|
||||
home_team_canonical_id=home_team['canonical_id'],
|
||||
away_team_canonical_id=away_team['canonical_id'],
|
||||
stadium_canonical_id=stadium_canonical_id,
|
||||
is_playoff=raw.get('is_playoff', False),
|
||||
broadcast=raw.get('broadcast')
|
||||
)
|
||||
games.append(game)
|
||||
resolved_count += 1
|
||||
|
||||
if verbose:
|
||||
print(f"\n Resolved: {resolved_count} games")
|
||||
print(f" Unresolved teams: {unresolved_teams}")
|
||||
print(f" Unknown stadiums (used home team): {unresolved_stadiums}")
|
||||
|
||||
return games, warnings
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MAIN
|
||||
# =============================================================================
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Canonicalize game data'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--games', type=str, default='./data/games.json',
|
||||
help='Input raw games JSON file'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--teams', type=str, default='./data/teams_canonical.json',
|
||||
help='Input canonical teams JSON file'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--aliases', type=str, default='./data/stadium_aliases.json',
|
||||
help='Input stadium aliases JSON file'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output', type=str, default='./data',
|
||||
help='Output directory for canonical files'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--verbose', '-v', action='store_true',
|
||||
help='Verbose output'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
games_path = Path(args.games)
|
||||
teams_path = Path(args.teams)
|
||||
aliases_path = Path(args.aliases)
|
||||
output_dir = Path(args.output)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Load input files
|
||||
print(f"Loading raw games from {games_path}...")
|
||||
with open(games_path) as f:
|
||||
raw_games = json.load(f)
|
||||
print(f" Loaded {len(raw_games)} raw games")
|
||||
|
||||
print(f"Loading canonical teams from {teams_path}...")
|
||||
with open(teams_path) as f:
|
||||
canonical_teams = json.load(f)
|
||||
print(f" Loaded {len(canonical_teams)} canonical teams")
|
||||
|
||||
print(f"Loading stadium aliases from {aliases_path}...")
|
||||
with open(aliases_path) as f:
|
||||
stadium_aliases = json.load(f)
|
||||
print(f" Loaded {len(stadium_aliases)} stadium aliases")
|
||||
|
||||
# Canonicalize games
|
||||
print("\nCanonicalizing games...")
|
||||
canonical_games, warnings = canonicalize_games(
|
||||
raw_games, canonical_teams, stadium_aliases, verbose=args.verbose
|
||||
)
|
||||
print(f" Created {len(canonical_games)} canonical games")
|
||||
|
||||
if warnings:
|
||||
print(f"\n Warnings: {len(warnings)}")
|
||||
# Group by issue type
|
||||
by_issue = defaultdict(list)
|
||||
for w in warnings:
|
||||
by_issue[w.issue].append(w)
|
||||
for issue, issue_warnings in by_issue.items():
|
||||
print(f" - {issue}: {len(issue_warnings)}")
|
||||
|
||||
# Export
|
||||
games_path = output_dir / 'games_canonical.json'
|
||||
warnings_path = output_dir / 'game_resolution_warnings.json'
|
||||
|
||||
with open(games_path, 'w') as f:
|
||||
json.dump([asdict(g) for g in canonical_games], f, indent=2)
|
||||
print(f"\nExported games to {games_path}")
|
||||
|
||||
if warnings:
|
||||
with open(warnings_path, 'w') as f:
|
||||
json.dump([asdict(w) for w in warnings], f, indent=2)
|
||||
print(f"Exported warnings to {warnings_path}")
|
||||
|
||||
# Summary by sport
|
||||
print("\nSummary by sport:")
|
||||
by_sport = {}
|
||||
for g in canonical_games:
|
||||
by_sport[g.sport] = by_sport.get(g.sport, 0) + 1
|
||||
for sport, count in sorted(by_sport.items()):
|
||||
print(f" {sport}: {count} games")
|
||||
|
||||
# Check for doubleheaders
|
||||
doubleheaders = sum(1 for g in canonical_games if '_2' in g.canonical_id or '_3' in g.canonical_id)
|
||||
if doubleheaders:
|
||||
print(f"\n Doubleheader games detected: {doubleheaders}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -1,515 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Stadium Canonicalization for SportsTime
|
||||
========================================
|
||||
Stage 1 of the canonicalization pipeline.
|
||||
|
||||
Normalizes stadium data and generates deterministic canonical IDs.
|
||||
Creates stadium name aliases for fuzzy matching during game resolution.
|
||||
|
||||
Usage:
|
||||
python canonicalize_stadiums.py --input data/stadiums.json --output data/
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from dataclasses import dataclass, asdict, field
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# DATA CLASSES
|
||||
# =============================================================================
|
||||
|
||||
@dataclass
|
||||
class CanonicalStadium:
|
||||
"""A canonicalized stadium with stable ID."""
|
||||
canonical_id: str
|
||||
name: str
|
||||
city: str
|
||||
state: str
|
||||
latitude: float
|
||||
longitude: float
|
||||
capacity: int
|
||||
sport: str
|
||||
primary_team_abbrevs: list = field(default_factory=list)
|
||||
year_opened: Optional[int] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class StadiumAlias:
|
||||
"""Maps an alias name to a canonical stadium ID."""
|
||||
alias_name: str # Normalized (lowercase)
|
||||
stadium_canonical_id: str
|
||||
valid_from: Optional[str] = None
|
||||
valid_until: Optional[str] = None
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# HISTORICAL STADIUM ALIASES
|
||||
# Known name changes for stadiums (sponsorship changes, renames)
|
||||
# =============================================================================
|
||||
|
||||
HISTORICAL_STADIUM_ALIASES = {
|
||||
# MLB
|
||||
'stadium_mlb_minute_maid_park': [
|
||||
{'alias_name': 'daikin park', 'valid_from': '2025-01-01'},
|
||||
{'alias_name': 'enron field', 'valid_from': '2000-04-01', 'valid_until': '2002-02-28'},
|
||||
{'alias_name': 'astros field', 'valid_from': '2002-03-01', 'valid_until': '2002-06-04'},
|
||||
],
|
||||
'stadium_mlb_guaranteed_rate_field': [
|
||||
{'alias_name': 'rate field', 'valid_from': '2024-01-01'},
|
||||
{'alias_name': 'us cellular field', 'valid_from': '2003-01-01', 'valid_until': '2016-08-24'},
|
||||
{'alias_name': 'comiskey park ii', 'valid_from': '1991-04-01', 'valid_until': '2002-12-31'},
|
||||
{'alias_name': 'new comiskey park', 'valid_from': '1991-04-01', 'valid_until': '2002-12-31'},
|
||||
],
|
||||
'stadium_mlb_truist_park': [
|
||||
{'alias_name': 'suntrust park', 'valid_from': '2017-04-01', 'valid_until': '2020-01-13'},
|
||||
],
|
||||
'stadium_mlb_progressive_field': [
|
||||
{'alias_name': 'jacobs field', 'valid_from': '1994-04-01', 'valid_until': '2008-01-10'},
|
||||
{'alias_name': 'the jake', 'valid_from': '1994-04-01', 'valid_until': '2008-01-10'},
|
||||
],
|
||||
'stadium_mlb_american_family_field': [
|
||||
{'alias_name': 'miller park', 'valid_from': '2001-04-01', 'valid_until': '2020-12-31'},
|
||||
],
|
||||
'stadium_mlb_rogers_centre': [
|
||||
{'alias_name': 'skydome', 'valid_from': '1989-06-01', 'valid_until': '2005-02-01'},
|
||||
],
|
||||
'stadium_mlb_loandepot_park': [
|
||||
{'alias_name': 'marlins park', 'valid_from': '2012-04-01', 'valid_until': '2021-03-31'},
|
||||
],
|
||||
'stadium_mlb_t_mobile_park': [
|
||||
{'alias_name': 'safeco field', 'valid_from': '1999-07-01', 'valid_until': '2018-12-31'},
|
||||
],
|
||||
'stadium_mlb_oracle_park': [
|
||||
{'alias_name': 'att park', 'valid_from': '2006-01-01', 'valid_until': '2019-01-08'},
|
||||
{'alias_name': 'sbc park', 'valid_from': '2004-01-01', 'valid_until': '2005-12-31'},
|
||||
{'alias_name': 'pac bell park', 'valid_from': '2000-04-01', 'valid_until': '2003-12-31'},
|
||||
],
|
||||
'stadium_mlb_globe_life_field': [
|
||||
{'alias_name': 'choctaw stadium', 'valid_from': '2020-01-01'}, # Globe Life Field opened 2020
|
||||
],
|
||||
|
||||
# NBA
|
||||
'stadium_nba_state_farm_arena': [
|
||||
{'alias_name': 'philips arena', 'valid_from': '1999-09-01', 'valid_until': '2018-06-25'},
|
||||
],
|
||||
'stadium_nba_crypto_com_arena': [
|
||||
{'alias_name': 'staples center', 'valid_from': '1999-10-01', 'valid_until': '2021-12-24'},
|
||||
],
|
||||
'stadium_nba_kaseya_center': [
|
||||
{'alias_name': 'ftx arena', 'valid_from': '2021-06-01', 'valid_until': '2023-03-31'},
|
||||
{'alias_name': 'american airlines arena', 'valid_from': '1999-12-01', 'valid_until': '2021-05-31'},
|
||||
],
|
||||
'stadium_nba_gainbridge_fieldhouse': [
|
||||
{'alias_name': 'bankers life fieldhouse', 'valid_from': '2011-01-01', 'valid_until': '2021-12-31'},
|
||||
{'alias_name': 'conseco fieldhouse', 'valid_from': '1999-11-01', 'valid_until': '2010-12-31'},
|
||||
],
|
||||
'stadium_nba_rocket_mortgage_fieldhouse': [
|
||||
{'alias_name': 'quicken loans arena', 'valid_from': '2005-08-01', 'valid_until': '2019-08-08'},
|
||||
{'alias_name': 'gund arena', 'valid_from': '1994-10-01', 'valid_until': '2005-07-31'},
|
||||
],
|
||||
'stadium_nba_kia_center': [
|
||||
{'alias_name': 'amway center', 'valid_from': '2010-10-01', 'valid_until': '2023-07-12'},
|
||||
],
|
||||
'stadium_nba_frost_bank_center': [
|
||||
{'alias_name': 'att center', 'valid_from': '2002-10-01', 'valid_until': '2023-10-01'},
|
||||
],
|
||||
'stadium_nba_intuit_dome': [
|
||||
# New arena opened 2024, Clippers moved from Crypto.com Arena
|
||||
],
|
||||
'stadium_nba_delta_center': [
|
||||
{'alias_name': 'vivint arena', 'valid_from': '2020-12-01', 'valid_until': '2023-07-01'},
|
||||
{'alias_name': 'vivint smart home arena', 'valid_from': '2015-11-01', 'valid_until': '2020-11-30'},
|
||||
{'alias_name': 'energysolutions arena', 'valid_from': '2006-11-01', 'valid_until': '2015-10-31'},
|
||||
],
|
||||
|
||||
# NHL
|
||||
'stadium_nhl_amerant_bank_arena': [
|
||||
{'alias_name': 'fla live arena', 'valid_from': '2021-10-01', 'valid_until': '2024-05-31'},
|
||||
{'alias_name': 'bb&t center', 'valid_from': '2012-06-01', 'valid_until': '2021-09-30'},
|
||||
{'alias_name': 'bankatlantic center', 'valid_from': '2005-10-01', 'valid_until': '2012-05-31'},
|
||||
],
|
||||
'stadium_nhl_climate_pledge_arena': [
|
||||
{'alias_name': 'keyarena', 'valid_from': '1995-01-01', 'valid_until': '2018-10-01'},
|
||||
{'alias_name': 'seattle center coliseum', 'valid_from': '1962-01-01', 'valid_until': '1994-12-31'},
|
||||
],
|
||||
|
||||
# NFL
|
||||
'stadium_nfl_sofi_stadium': [
|
||||
# SoFi Stadium opened 2020, no prior name
|
||||
],
|
||||
'stadium_nfl_allegiant_stadium': [
|
||||
# Allegiant Stadium opened 2020, no prior name (Raiders moved from Oakland Coliseum)
|
||||
],
|
||||
'stadium_nfl_caesars_superdome': [
|
||||
{'alias_name': 'mercedes-benz superdome', 'valid_from': '2011-10-01', 'valid_until': '2021-07-01'},
|
||||
{'alias_name': 'louisiana superdome', 'valid_from': '1975-08-01', 'valid_until': '2011-09-30'},
|
||||
{'alias_name': 'superdome', 'valid_from': '1975-08-01'},
|
||||
],
|
||||
'stadium_nfl_paycor_stadium': [
|
||||
{'alias_name': 'paul brown stadium', 'valid_from': '2000-08-01', 'valid_until': '2022-09-05'},
|
||||
],
|
||||
'stadium_nfl_empower_field_at_mile_high': [
|
||||
{'alias_name': 'broncos stadium at mile high', 'valid_from': '2018-09-01', 'valid_until': '2019-08-31'},
|
||||
{'alias_name': 'sports authority field at mile high', 'valid_from': '2011-08-01', 'valid_until': '2018-08-31'},
|
||||
{'alias_name': 'invesco field at mile high', 'valid_from': '2001-09-01', 'valid_until': '2011-07-31'},
|
||||
{'alias_name': 'mile high stadium', 'valid_from': '1960-01-01', 'valid_until': '2001-08-31'},
|
||||
],
|
||||
'stadium_nfl_acrisure_stadium': [
|
||||
{'alias_name': 'heinz field', 'valid_from': '2001-08-01', 'valid_until': '2022-07-10'},
|
||||
],
|
||||
'stadium_nfl_everbank_stadium': [
|
||||
{'alias_name': 'tiaa bank field', 'valid_from': '2018-01-01', 'valid_until': '2023-03-31'},
|
||||
{'alias_name': 'everbank field', 'valid_from': '2014-01-01', 'valid_until': '2017-12-31'},
|
||||
{'alias_name': 'alltel stadium', 'valid_from': '1997-06-01', 'valid_until': '2006-12-31'},
|
||||
{'alias_name': 'jacksonville municipal stadium', 'valid_from': '1995-08-01', 'valid_until': '1997-05-31'},
|
||||
],
|
||||
'stadium_nfl_northwest_stadium': [
|
||||
{'alias_name': 'fedexfield', 'valid_from': '1999-11-01', 'valid_until': '2025-01-01'},
|
||||
{'alias_name': 'fedex field', 'valid_from': '1999-11-01', 'valid_until': '2025-01-01'},
|
||||
{'alias_name': 'jack kent cooke stadium', 'valid_from': '1997-09-01', 'valid_until': '1999-10-31'},
|
||||
],
|
||||
'stadium_nfl_hard_rock_stadium': [
|
||||
{'alias_name': 'sun life stadium', 'valid_from': '2010-01-01', 'valid_until': '2016-07-31'},
|
||||
{'alias_name': 'land shark stadium', 'valid_from': '2009-01-01', 'valid_until': '2009-12-31'},
|
||||
{'alias_name': 'dolphin stadium', 'valid_from': '2005-01-01', 'valid_until': '2008-12-31'},
|
||||
{'alias_name': 'pro player stadium', 'valid_from': '1996-04-01', 'valid_until': '2004-12-31'},
|
||||
{'alias_name': 'joe robbie stadium', 'valid_from': '1987-08-01', 'valid_until': '1996-03-31'},
|
||||
],
|
||||
'stadium_nfl_highmark_stadium': [
|
||||
{'alias_name': 'bills stadium', 'valid_from': '2020-03-01', 'valid_until': '2021-03-31'},
|
||||
{'alias_name': 'new era field', 'valid_from': '2016-08-01', 'valid_until': '2020-02-29'},
|
||||
{'alias_name': 'ralph wilson stadium', 'valid_from': '1998-08-01', 'valid_until': '2016-07-31'},
|
||||
{'alias_name': 'rich stadium', 'valid_from': '1973-08-01', 'valid_until': '1998-07-31'},
|
||||
],
|
||||
'stadium_nfl_geha_field_at_arrowhead_stadium': [
|
||||
{'alias_name': 'arrowhead stadium', 'valid_from': '1972-08-01'},
|
||||
],
|
||||
'stadium_nfl_att_stadium': [
|
||||
{'alias_name': 'cowboys stadium', 'valid_from': '2009-05-01', 'valid_until': '2013-07-24'},
|
||||
],
|
||||
'stadium_nfl_us_bank_stadium': [
|
||||
# Opened 2016, no prior name (Vikings moved from Metrodome)
|
||||
],
|
||||
'stadium_nfl_lumen_field': [
|
||||
{'alias_name': 'centurylink field', 'valid_from': '2011-06-01', 'valid_until': '2020-11-18'},
|
||||
{'alias_name': 'qwest field', 'valid_from': '2004-06-01', 'valid_until': '2011-05-31'},
|
||||
{'alias_name': 'seahawks stadium', 'valid_from': '2002-07-01', 'valid_until': '2004-05-31'},
|
||||
],
|
||||
|
||||
# MLS
|
||||
'stadium_mls_bmo_stadium': [
|
||||
{'alias_name': 'banc of california stadium', 'valid_from': '2018-04-01', 'valid_until': '2023-06-01'},
|
||||
],
|
||||
'stadium_mls_paypal_park': [
|
||||
{'alias_name': 'earthquakes stadium', 'valid_from': '2015-03-01', 'valid_until': '2020-12-31'},
|
||||
{'alias_name': 'avaya stadium', 'valid_from': '2015-03-01', 'valid_until': '2020-12-31'},
|
||||
],
|
||||
'stadium_mls_shell_energy_stadium': [
|
||||
{'alias_name': 'pnc stadium', 'valid_from': '2021-03-01', 'valid_until': '2023-03-01'},
|
||||
{'alias_name': 'bbva stadium', 'valid_from': '2019-01-01', 'valid_until': '2021-02-28'},
|
||||
{'alias_name': 'bbva compass stadium', 'valid_from': '2012-05-01', 'valid_until': '2018-12-31'},
|
||||
],
|
||||
'stadium_mls_dignity_health_sports_park': [
|
||||
{'alias_name': 'stubhub center', 'valid_from': '2013-06-01', 'valid_until': '2019-01-31'},
|
||||
{'alias_name': 'home depot center', 'valid_from': '2003-06-01', 'valid_until': '2013-05-31'},
|
||||
],
|
||||
'stadium_mls_interandco_stadium': [
|
||||
{'alias_name': 'exploria stadium', 'valid_from': '2017-03-01', 'valid_until': '2023-07-01'},
|
||||
{'alias_name': 'orlando city stadium', 'valid_from': '2017-03-01', 'valid_until': '2019-01-01'},
|
||||
],
|
||||
'stadium_mls_chase_stadium': [
|
||||
{'alias_name': 'drv pnk stadium', 'valid_from': '2020-07-01', 'valid_until': '2024-01-01'},
|
||||
{'alias_name': 'inter miami cf stadium', 'valid_from': '2020-07-01', 'valid_until': '2020-09-01'},
|
||||
],
|
||||
'stadium_mls_america_first_field': [
|
||||
{'alias_name': 'rio tinto stadium', 'valid_from': '2008-10-01', 'valid_until': '2021-08-01'},
|
||||
],
|
||||
'stadium_mls_lowercom_field': [
|
||||
{'alias_name': 'lower.com field', 'valid_from': '2021-07-01'}, # Current name with period
|
||||
{'alias_name': 'new crew stadium', 'valid_from': '2021-07-01', 'valid_until': '2021-07-01'},
|
||||
],
|
||||
|
||||
# WNBA (most share NBA/NHL arenas with existing aliases; these are WNBA-specific arenas)
|
||||
'stadium_wnba_michelob_ultra_arena': [
|
||||
{'alias_name': 'mandalay bay events center', 'valid_from': '1999-03-01', 'valid_until': '2021-01-01'},
|
||||
],
|
||||
'stadium_wnba_gateway_center_arena': [
|
||||
# Gateway Center Arena opened 2018, WNBA-specific venue
|
||||
],
|
||||
'stadium_wnba_wintrust_arena': [
|
||||
# Wintrust Arena opened 2017, WNBA-specific venue
|
||||
],
|
||||
'stadium_wnba_college_park_center': [
|
||||
# College Park Center opened 2012, university venue
|
||||
],
|
||||
|
||||
# NWSL (most share MLS stadiums with existing aliases; these are NWSL-specific)
|
||||
'stadium_nwsl_cpkc_stadium': [
|
||||
# CPKC Stadium opened 2024, first soccer-specific stadium for NWSL team
|
||||
],
|
||||
'stadium_nwsl_seatgeek_stadium': [
|
||||
{'alias_name': 'toyota park', 'valid_from': '2006-06-01', 'valid_until': '2018-04-30'},
|
||||
{'alias_name': 'bridgeview stadium', 'valid_from': '2006-06-01', 'valid_until': '2006-06-01'},
|
||||
],
|
||||
'stadium_nwsl_wakemed_soccer_park': [
|
||||
{'alias_name': 'sas soccer park', 'valid_from': '2002-04-01', 'valid_until': '2007-03-31'},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SLUG GENERATION
|
||||
# =============================================================================
|
||||
|
||||
def normalize_stadium_name(name: str) -> str:
|
||||
"""
|
||||
Normalize stadium name for slug generation.
|
||||
|
||||
- Lowercase
|
||||
- Remove parentheticals like "(IV)"
|
||||
- Remove special characters except spaces
|
||||
- Collapse multiple spaces
|
||||
"""
|
||||
normalized = name.lower()
|
||||
# Remove parentheticals
|
||||
normalized = re.sub(r'\s*\([^)]*\)', '', normalized)
|
||||
# Remove special characters except spaces and alphanumeric
|
||||
normalized = re.sub(r'[^a-z0-9\s]', '', normalized)
|
||||
# Replace multiple spaces with single space
|
||||
normalized = re.sub(r'\s+', ' ', normalized).strip()
|
||||
return normalized
|
||||
|
||||
|
||||
def generate_stadium_slug(name: str) -> str:
|
||||
"""
|
||||
Generate URL-safe slug from stadium name.
|
||||
|
||||
Examples:
|
||||
"State Farm Arena" -> "state_farm_arena"
|
||||
"TD Garden" -> "td_garden"
|
||||
"Crypto.com Arena" -> "crypto_com_arena"
|
||||
"""
|
||||
normalized = normalize_stadium_name(name)
|
||||
# Replace spaces with underscores
|
||||
slug = normalized.replace(' ', '_')
|
||||
# Truncate to 50 chars
|
||||
return slug[:50]
|
||||
|
||||
|
||||
def generate_canonical_stadium_id(sport: str, name: str) -> str:
|
||||
"""
|
||||
Generate deterministic canonical ID for stadium.
|
||||
|
||||
Format: stadium_{sport}_{slug}
|
||||
Example: stadium_nba_state_farm_arena
|
||||
"""
|
||||
slug = generate_stadium_slug(name)
|
||||
return f"stadium_{sport.lower()}_{slug}"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CANONICALIZATION
|
||||
# =============================================================================
|
||||
|
||||
def canonicalize_stadiums(
|
||||
raw_stadiums: list[dict],
|
||||
verbose: bool = False
|
||||
) -> tuple[list[CanonicalStadium], list[StadiumAlias]]:
|
||||
"""
|
||||
Stage 1: Canonicalize stadiums.
|
||||
|
||||
1. Normalize names and cities
|
||||
2. Deduplicate by (sport, normalized_name, city)
|
||||
3. Generate canonical IDs
|
||||
4. Create name aliases
|
||||
|
||||
Args:
|
||||
raw_stadiums: List of raw stadium dicts from scraper
|
||||
verbose: Print detailed progress
|
||||
|
||||
Returns:
|
||||
(canonical_stadiums, aliases)
|
||||
"""
|
||||
canonical_stadiums = []
|
||||
aliases = []
|
||||
seen_keys = {} # (sport, normalized_name, city) -> canonical_id
|
||||
|
||||
for raw in raw_stadiums:
|
||||
sport = raw.get('sport', '').upper()
|
||||
name = raw.get('name', '')
|
||||
city = raw.get('city', '')
|
||||
|
||||
if not sport or not name:
|
||||
if verbose:
|
||||
print(f" Skipping invalid stadium: {raw}")
|
||||
continue
|
||||
|
||||
# Generate canonical ID
|
||||
canonical_id = generate_canonical_stadium_id(sport, name)
|
||||
|
||||
# Deduplication key (same stadium in same city for same sport)
|
||||
normalized_name = normalize_stadium_name(name)
|
||||
dedup_key = (sport, normalized_name, city.lower())
|
||||
|
||||
if dedup_key in seen_keys:
|
||||
existing_canonical_id = seen_keys[dedup_key]
|
||||
# Add as alias if the display name differs
|
||||
alias_name = name.lower().strip()
|
||||
if alias_name != normalized_name:
|
||||
aliases.append(StadiumAlias(
|
||||
alias_name=alias_name,
|
||||
stadium_canonical_id=existing_canonical_id
|
||||
))
|
||||
if verbose:
|
||||
print(f" Duplicate: {name} -> {existing_canonical_id}")
|
||||
continue
|
||||
|
||||
seen_keys[dedup_key] = canonical_id
|
||||
|
||||
# Create canonical stadium
|
||||
canonical = CanonicalStadium(
|
||||
canonical_id=canonical_id,
|
||||
name=name,
|
||||
city=city,
|
||||
state=raw.get('state', ''),
|
||||
latitude=raw.get('latitude', 0.0),
|
||||
longitude=raw.get('longitude', 0.0),
|
||||
capacity=raw.get('capacity', 0),
|
||||
sport=sport,
|
||||
primary_team_abbrevs=raw.get('team_abbrevs', []),
|
||||
year_opened=raw.get('year_opened')
|
||||
)
|
||||
canonical_stadiums.append(canonical)
|
||||
|
||||
# Add primary name as alias (normalized)
|
||||
aliases.append(StadiumAlias(
|
||||
alias_name=name.lower().strip(),
|
||||
stadium_canonical_id=canonical_id
|
||||
))
|
||||
|
||||
# Also add normalized version if different
|
||||
if normalized_name != name.lower().strip():
|
||||
aliases.append(StadiumAlias(
|
||||
alias_name=normalized_name,
|
||||
stadium_canonical_id=canonical_id
|
||||
))
|
||||
|
||||
if verbose:
|
||||
print(f" {canonical_id}: {name} ({city})")
|
||||
|
||||
return canonical_stadiums, aliases
|
||||
|
||||
|
||||
def add_historical_aliases(
|
||||
aliases: list[StadiumAlias],
|
||||
canonical_ids: set[str]
|
||||
) -> list[StadiumAlias]:
|
||||
"""
|
||||
Add historical stadium name aliases.
|
||||
|
||||
Only adds aliases for stadiums that exist in canonical_ids.
|
||||
"""
|
||||
for canonical_id, historical in HISTORICAL_STADIUM_ALIASES.items():
|
||||
if canonical_id not in canonical_ids:
|
||||
continue
|
||||
|
||||
for hist in historical:
|
||||
aliases.append(StadiumAlias(
|
||||
alias_name=hist['alias_name'],
|
||||
stadium_canonical_id=canonical_id,
|
||||
valid_from=hist.get('valid_from'),
|
||||
valid_until=hist.get('valid_until')
|
||||
))
|
||||
|
||||
return aliases
|
||||
|
||||
|
||||
def deduplicate_aliases(aliases: list[StadiumAlias]) -> list[StadiumAlias]:
|
||||
"""Remove duplicate aliases (same alias_name -> same canonical_id)."""
|
||||
seen = set()
|
||||
deduped = []
|
||||
|
||||
for alias in aliases:
|
||||
key = (alias.alias_name, alias.stadium_canonical_id)
|
||||
if key not in seen:
|
||||
seen.add(key)
|
||||
deduped.append(alias)
|
||||
|
||||
return deduped
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MAIN
|
||||
# =============================================================================
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Canonicalize stadium data'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--input', type=str, default='./data/stadiums.json',
|
||||
help='Input raw stadiums JSON file'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output', type=str, default='./data',
|
||||
help='Output directory for canonical files'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--verbose', '-v', action='store_true',
|
||||
help='Verbose output'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
input_path = Path(args.input)
|
||||
output_dir = Path(args.output)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Load raw stadiums
|
||||
print(f"Loading raw stadiums from {input_path}...")
|
||||
with open(input_path) as f:
|
||||
raw_stadiums = json.load(f)
|
||||
print(f" Loaded {len(raw_stadiums)} raw stadiums")
|
||||
|
||||
# Canonicalize
|
||||
print("\nCanonicalizing stadiums...")
|
||||
canonical_stadiums, aliases = canonicalize_stadiums(
|
||||
raw_stadiums, verbose=args.verbose
|
||||
)
|
||||
print(f" Created {len(canonical_stadiums)} canonical stadiums")
|
||||
|
||||
# Add historical aliases
|
||||
canonical_ids = {s.canonical_id for s in canonical_stadiums}
|
||||
aliases = add_historical_aliases(aliases, canonical_ids)
|
||||
|
||||
# Deduplicate aliases
|
||||
aliases = deduplicate_aliases(aliases)
|
||||
print(f" Created {len(aliases)} stadium aliases")
|
||||
|
||||
# Export
|
||||
stadiums_path = output_dir / 'stadiums_canonical.json'
|
||||
aliases_path = output_dir / 'stadium_aliases.json'
|
||||
|
||||
with open(stadiums_path, 'w') as f:
|
||||
json.dump([asdict(s) for s in canonical_stadiums], f, indent=2)
|
||||
print(f"\nExported stadiums to {stadiums_path}")
|
||||
|
||||
with open(aliases_path, 'w') as f:
|
||||
json.dump([asdict(a) for a in aliases], f, indent=2)
|
||||
print(f"Exported aliases to {aliases_path}")
|
||||
|
||||
# Summary by sport
|
||||
print("\nSummary by sport:")
|
||||
by_sport = {}
|
||||
for s in canonical_stadiums:
|
||||
by_sport[s.sport] = by_sport.get(s.sport, 0) + 1
|
||||
for sport, count in sorted(by_sport.items()):
|
||||
print(f" {sport}: {count} stadiums")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -1,610 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Team Canonicalization for SportsTime
|
||||
====================================
|
||||
Stage 2 of the canonicalization pipeline.
|
||||
|
||||
Generates canonical team IDs and fuzzy matches teams to stadiums.
|
||||
|
||||
Usage:
|
||||
python canonicalize_teams.py --stadiums data/stadiums_canonical.json --output data/
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from dataclasses import dataclass, asdict, field
|
||||
from difflib import SequenceMatcher
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
# Import team mappings from scraper
|
||||
from scrape_schedules import NBA_TEAMS, MLB_TEAMS, NHL_TEAMS, NFL_TEAMS
|
||||
from mls import MLS_TEAMS
|
||||
from wnba import WNBA_TEAMS
|
||||
from nwsl import NWSL_TEAMS
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# DATA CLASSES
|
||||
# =============================================================================
|
||||
|
||||
@dataclass
|
||||
class CanonicalTeam:
|
||||
"""A canonicalized team with stable ID."""
|
||||
canonical_id: str
|
||||
name: str
|
||||
abbreviation: str
|
||||
sport: str
|
||||
city: str
|
||||
stadium_canonical_id: str
|
||||
conference_id: Optional[str] = None
|
||||
division_id: Optional[str] = None
|
||||
primary_color: Optional[str] = None
|
||||
secondary_color: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class MatchWarning:
|
||||
"""Warning about a low-confidence match."""
|
||||
team_canonical_id: str
|
||||
team_name: str
|
||||
arena_name: str
|
||||
matched_stadium: Optional[str]
|
||||
issue: str
|
||||
confidence: float
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# LEAGUE STRUCTURE
|
||||
# Maps team abbreviation -> (conference_id, division_id)
|
||||
# =============================================================================
|
||||
|
||||
NBA_DIVISIONS = {
|
||||
# Eastern Conference - Atlantic
|
||||
'BOS': ('nba_eastern', 'nba_atlantic'),
|
||||
'BRK': ('nba_eastern', 'nba_atlantic'),
|
||||
'NYK': ('nba_eastern', 'nba_atlantic'),
|
||||
'PHI': ('nba_eastern', 'nba_atlantic'),
|
||||
'TOR': ('nba_eastern', 'nba_atlantic'),
|
||||
# Eastern Conference - Central
|
||||
'CHI': ('nba_eastern', 'nba_central'),
|
||||
'CLE': ('nba_eastern', 'nba_central'),
|
||||
'DET': ('nba_eastern', 'nba_central'),
|
||||
'IND': ('nba_eastern', 'nba_central'),
|
||||
'MIL': ('nba_eastern', 'nba_central'),
|
||||
# Eastern Conference - Southeast
|
||||
'ATL': ('nba_eastern', 'nba_southeast'),
|
||||
'CHO': ('nba_eastern', 'nba_southeast'),
|
||||
'MIA': ('nba_eastern', 'nba_southeast'),
|
||||
'ORL': ('nba_eastern', 'nba_southeast'),
|
||||
'WAS': ('nba_eastern', 'nba_southeast'),
|
||||
# Western Conference - Northwest
|
||||
'DEN': ('nba_western', 'nba_northwest'),
|
||||
'MIN': ('nba_western', 'nba_northwest'),
|
||||
'OKC': ('nba_western', 'nba_northwest'),
|
||||
'POR': ('nba_western', 'nba_northwest'),
|
||||
'UTA': ('nba_western', 'nba_northwest'),
|
||||
# Western Conference - Pacific
|
||||
'GSW': ('nba_western', 'nba_pacific'),
|
||||
'LAC': ('nba_western', 'nba_pacific'),
|
||||
'LAL': ('nba_western', 'nba_pacific'),
|
||||
'PHO': ('nba_western', 'nba_pacific'),
|
||||
'SAC': ('nba_western', 'nba_pacific'),
|
||||
# Western Conference - Southwest
|
||||
'DAL': ('nba_western', 'nba_southwest'),
|
||||
'HOU': ('nba_western', 'nba_southwest'),
|
||||
'MEM': ('nba_western', 'nba_southwest'),
|
||||
'NOP': ('nba_western', 'nba_southwest'),
|
||||
'SAS': ('nba_western', 'nba_southwest'),
|
||||
}
|
||||
|
||||
MLB_DIVISIONS = {
|
||||
# American League - East
|
||||
'NYY': ('mlb_al', 'mlb_al_east'),
|
||||
'BOS': ('mlb_al', 'mlb_al_east'),
|
||||
'TOR': ('mlb_al', 'mlb_al_east'),
|
||||
'BAL': ('mlb_al', 'mlb_al_east'),
|
||||
'TBR': ('mlb_al', 'mlb_al_east'),
|
||||
# American League - Central
|
||||
'CLE': ('mlb_al', 'mlb_al_central'),
|
||||
'DET': ('mlb_al', 'mlb_al_central'),
|
||||
'MIN': ('mlb_al', 'mlb_al_central'),
|
||||
'CHW': ('mlb_al', 'mlb_al_central'),
|
||||
'KCR': ('mlb_al', 'mlb_al_central'),
|
||||
# American League - West
|
||||
'HOU': ('mlb_al', 'mlb_al_west'),
|
||||
'SEA': ('mlb_al', 'mlb_al_west'),
|
||||
'TEX': ('mlb_al', 'mlb_al_west'),
|
||||
'LAA': ('mlb_al', 'mlb_al_west'),
|
||||
'OAK': ('mlb_al', 'mlb_al_west'),
|
||||
# National League - East
|
||||
'ATL': ('mlb_nl', 'mlb_nl_east'),
|
||||
'PHI': ('mlb_nl', 'mlb_nl_east'),
|
||||
'NYM': ('mlb_nl', 'mlb_nl_east'),
|
||||
'MIA': ('mlb_nl', 'mlb_nl_east'),
|
||||
'WSN': ('mlb_nl', 'mlb_nl_east'),
|
||||
# National League - Central
|
||||
'MIL': ('mlb_nl', 'mlb_nl_central'),
|
||||
'CHC': ('mlb_nl', 'mlb_nl_central'),
|
||||
'STL': ('mlb_nl', 'mlb_nl_central'),
|
||||
'PIT': ('mlb_nl', 'mlb_nl_central'),
|
||||
'CIN': ('mlb_nl', 'mlb_nl_central'),
|
||||
# National League - West
|
||||
'LAD': ('mlb_nl', 'mlb_nl_west'),
|
||||
'ARI': ('mlb_nl', 'mlb_nl_west'),
|
||||
'SDP': ('mlb_nl', 'mlb_nl_west'),
|
||||
'SFG': ('mlb_nl', 'mlb_nl_west'),
|
||||
'COL': ('mlb_nl', 'mlb_nl_west'),
|
||||
}
|
||||
|
||||
NHL_DIVISIONS = {
|
||||
# Eastern Conference - Atlantic
|
||||
'BOS': ('nhl_eastern', 'nhl_atlantic'),
|
||||
'BUF': ('nhl_eastern', 'nhl_atlantic'),
|
||||
'DET': ('nhl_eastern', 'nhl_atlantic'),
|
||||
'FLA': ('nhl_eastern', 'nhl_atlantic'),
|
||||
'MTL': ('nhl_eastern', 'nhl_atlantic'),
|
||||
'OTT': ('nhl_eastern', 'nhl_atlantic'),
|
||||
'TBL': ('nhl_eastern', 'nhl_atlantic'),
|
||||
'TOR': ('nhl_eastern', 'nhl_atlantic'),
|
||||
# Eastern Conference - Metropolitan
|
||||
'CAR': ('nhl_eastern', 'nhl_metropolitan'),
|
||||
'CBJ': ('nhl_eastern', 'nhl_metropolitan'),
|
||||
'NJD': ('nhl_eastern', 'nhl_metropolitan'),
|
||||
'NYI': ('nhl_eastern', 'nhl_metropolitan'),
|
||||
'NYR': ('nhl_eastern', 'nhl_metropolitan'),
|
||||
'PHI': ('nhl_eastern', 'nhl_metropolitan'),
|
||||
'PIT': ('nhl_eastern', 'nhl_metropolitan'),
|
||||
'WSH': ('nhl_eastern', 'nhl_metropolitan'),
|
||||
# Western Conference - Central
|
||||
'ARI': ('nhl_western', 'nhl_central'), # Utah Hockey Club
|
||||
'CHI': ('nhl_western', 'nhl_central'),
|
||||
'COL': ('nhl_western', 'nhl_central'),
|
||||
'DAL': ('nhl_western', 'nhl_central'),
|
||||
'MIN': ('nhl_western', 'nhl_central'),
|
||||
'NSH': ('nhl_western', 'nhl_central'),
|
||||
'STL': ('nhl_western', 'nhl_central'),
|
||||
'WPG': ('nhl_western', 'nhl_central'),
|
||||
# Western Conference - Pacific
|
||||
'ANA': ('nhl_western', 'nhl_pacific'),
|
||||
'CGY': ('nhl_western', 'nhl_pacific'),
|
||||
'EDM': ('nhl_western', 'nhl_pacific'),
|
||||
'LAK': ('nhl_western', 'nhl_pacific'),
|
||||
'SEA': ('nhl_western', 'nhl_pacific'),
|
||||
'SJS': ('nhl_western', 'nhl_pacific'),
|
||||
'VAN': ('nhl_western', 'nhl_pacific'),
|
||||
'VGK': ('nhl_western', 'nhl_pacific'),
|
||||
}
|
||||
|
||||
NFL_DIVISIONS = {
|
||||
# AFC East
|
||||
'BUF': ('nfl_afc', 'nfl_afc_east'),
|
||||
'MIA': ('nfl_afc', 'nfl_afc_east'),
|
||||
'NE': ('nfl_afc', 'nfl_afc_east'),
|
||||
'NYJ': ('nfl_afc', 'nfl_afc_east'),
|
||||
# AFC North
|
||||
'BAL': ('nfl_afc', 'nfl_afc_north'),
|
||||
'CIN': ('nfl_afc', 'nfl_afc_north'),
|
||||
'CLE': ('nfl_afc', 'nfl_afc_north'),
|
||||
'PIT': ('nfl_afc', 'nfl_afc_north'),
|
||||
# AFC South
|
||||
'HOU': ('nfl_afc', 'nfl_afc_south'),
|
||||
'IND': ('nfl_afc', 'nfl_afc_south'),
|
||||
'JAX': ('nfl_afc', 'nfl_afc_south'),
|
||||
'TEN': ('nfl_afc', 'nfl_afc_south'),
|
||||
# AFC West
|
||||
'DEN': ('nfl_afc', 'nfl_afc_west'),
|
||||
'KC': ('nfl_afc', 'nfl_afc_west'),
|
||||
'LV': ('nfl_afc', 'nfl_afc_west'),
|
||||
'LAC': ('nfl_afc', 'nfl_afc_west'),
|
||||
# NFC East
|
||||
'DAL': ('nfl_nfc', 'nfl_nfc_east'),
|
||||
'NYG': ('nfl_nfc', 'nfl_nfc_east'),
|
||||
'PHI': ('nfl_nfc', 'nfl_nfc_east'),
|
||||
'WAS': ('nfl_nfc', 'nfl_nfc_east'),
|
||||
# NFC North
|
||||
'CHI': ('nfl_nfc', 'nfl_nfc_north'),
|
||||
'DET': ('nfl_nfc', 'nfl_nfc_north'),
|
||||
'GB': ('nfl_nfc', 'nfl_nfc_north'),
|
||||
'MIN': ('nfl_nfc', 'nfl_nfc_north'),
|
||||
# NFC South
|
||||
'ATL': ('nfl_nfc', 'nfl_nfc_south'),
|
||||
'CAR': ('nfl_nfc', 'nfl_nfc_south'),
|
||||
'NO': ('nfl_nfc', 'nfl_nfc_south'),
|
||||
'TB': ('nfl_nfc', 'nfl_nfc_south'),
|
||||
# NFC West
|
||||
'ARI': ('nfl_nfc', 'nfl_nfc_west'),
|
||||
'LAR': ('nfl_nfc', 'nfl_nfc_west'),
|
||||
'SF': ('nfl_nfc', 'nfl_nfc_west'),
|
||||
'SEA': ('nfl_nfc', 'nfl_nfc_west'),
|
||||
}
|
||||
|
||||
MLS_DIVISIONS = {
|
||||
# Eastern Conference (MLS uses conferences, not divisions)
|
||||
'ATL': ('mls_eastern', None),
|
||||
'CHI': ('mls_eastern', None),
|
||||
'CIN': ('mls_eastern', None),
|
||||
'CLB': ('mls_eastern', None),
|
||||
'CLT': ('mls_eastern', None),
|
||||
'DC': ('mls_eastern', None),
|
||||
'MIA': ('mls_eastern', None),
|
||||
'MTL': ('mls_eastern', None),
|
||||
'NE': ('mls_eastern', None),
|
||||
'NYCFC': ('mls_eastern', None),
|
||||
'NYRB': ('mls_eastern', None),
|
||||
'ORL': ('mls_eastern', None),
|
||||
'PHI': ('mls_eastern', None),
|
||||
'TOR': ('mls_eastern', None),
|
||||
# Western Conference
|
||||
'AUS': ('mls_western', None),
|
||||
'COL': ('mls_western', None),
|
||||
'DAL': ('mls_western', None),
|
||||
'HOU': ('mls_western', None),
|
||||
'LAFC': ('mls_western', None),
|
||||
'LAG': ('mls_western', None),
|
||||
'MIN': ('mls_western', None),
|
||||
'NSH': ('mls_western', None),
|
||||
'POR': ('mls_western', None),
|
||||
'RSL': ('mls_western', None),
|
||||
'SD': ('mls_western', None),
|
||||
'SEA': ('mls_western', None),
|
||||
'SJ': ('mls_western', None),
|
||||
'SKC': ('mls_western', None),
|
||||
'STL': ('mls_western', None),
|
||||
'VAN': ('mls_western', None),
|
||||
}
|
||||
|
||||
WNBA_DIVISIONS = {
|
||||
# WNBA has no divisions (single league structure)
|
||||
'ATL': ('wnba', None),
|
||||
'CHI': ('wnba', None),
|
||||
'CON': ('wnba', None),
|
||||
'DAL': ('wnba', None),
|
||||
'GSV': ('wnba', None),
|
||||
'IND': ('wnba', None),
|
||||
'LVA': ('wnba', None),
|
||||
'LA': ('wnba', None),
|
||||
'MIN': ('wnba', None),
|
||||
'NY': ('wnba', None),
|
||||
'PHO': ('wnba', None),
|
||||
'SEA': ('wnba', None),
|
||||
'WAS': ('wnba', None),
|
||||
}
|
||||
|
||||
NWSL_DIVISIONS = {
|
||||
# NWSL has no divisions (single league structure)
|
||||
'LA': ('nwsl', None), # Angel City FC
|
||||
'SJ': ('nwsl', None), # Bay FC
|
||||
'CHI': ('nwsl', None), # Chicago Red Stars
|
||||
'HOU': ('nwsl', None), # Houston Dash
|
||||
'KC': ('nwsl', None), # Kansas City Current
|
||||
'NJ': ('nwsl', None), # NJ/NY Gotham FC
|
||||
'NC': ('nwsl', None), # North Carolina Courage
|
||||
'ORL': ('nwsl', None), # Orlando Pride
|
||||
'POR': ('nwsl', None), # Portland Thorns FC
|
||||
'SEA': ('nwsl', None), # Seattle Reign FC
|
||||
'SD': ('nwsl', None), # San Diego Wave FC
|
||||
'UTA': ('nwsl', None), # Utah Royals FC
|
||||
'WAS': ('nwsl', None), # Washington Spirit
|
||||
}
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# FUZZY MATCHING
|
||||
# =============================================================================
|
||||
|
||||
def normalize_for_matching(text: str) -> str:
|
||||
"""Normalize text for fuzzy matching."""
|
||||
import re
|
||||
text = text.lower().strip()
|
||||
# Remove common suffixes/prefixes
|
||||
text = re.sub(r'\s*(arena|center|stadium|field|park|centre)\s*', ' ', text)
|
||||
# Remove special characters
|
||||
text = re.sub(r'[^a-z0-9\s]', '', text)
|
||||
# Collapse spaces
|
||||
text = re.sub(r'\s+', ' ', text).strip()
|
||||
return text
|
||||
|
||||
|
||||
def fuzzy_match_stadium(
|
||||
team_arena_name: str,
|
||||
team_city: str,
|
||||
sport: str,
|
||||
stadiums: list[dict],
|
||||
confidence_threshold: float = 0.6
|
||||
) -> tuple[Optional[str], float]:
|
||||
"""
|
||||
Fuzzy match team's arena to a canonical stadium.
|
||||
|
||||
Matching strategy:
|
||||
- 70% weight: Name similarity (SequenceMatcher)
|
||||
- 30% weight: City match (exact=1.0, partial=0.5)
|
||||
|
||||
Args:
|
||||
team_arena_name: The arena name from team mapping
|
||||
team_city: The team's city
|
||||
sport: Sport code (NBA, MLB, NHL)
|
||||
stadiums: List of canonical stadium dicts
|
||||
confidence_threshold: Minimum confidence for a match
|
||||
|
||||
Returns:
|
||||
(canonical_stadium_id, confidence_score)
|
||||
"""
|
||||
best_match = None
|
||||
best_score = 0.0
|
||||
|
||||
# Normalize arena name
|
||||
arena_normalized = normalize_for_matching(team_arena_name)
|
||||
city_lower = team_city.lower()
|
||||
|
||||
# Filter to same sport
|
||||
sport_stadiums = [s for s in stadiums if s['sport'] == sport]
|
||||
|
||||
for stadium in sport_stadiums:
|
||||
stadium_name_normalized = normalize_for_matching(stadium['name'])
|
||||
|
||||
# Score 1: Name similarity
|
||||
name_score = SequenceMatcher(
|
||||
None,
|
||||
arena_normalized,
|
||||
stadium_name_normalized
|
||||
).ratio()
|
||||
|
||||
# Also check full names (unnormalized)
|
||||
full_name_score = SequenceMatcher(
|
||||
None,
|
||||
team_arena_name.lower(),
|
||||
stadium['name'].lower()
|
||||
).ratio()
|
||||
|
||||
# Take the better score
|
||||
name_score = max(name_score, full_name_score)
|
||||
|
||||
# Score 2: City match
|
||||
city_score = 0.0
|
||||
stadium_city_lower = stadium['city'].lower()
|
||||
|
||||
if city_lower == stadium_city_lower:
|
||||
city_score = 1.0
|
||||
elif city_lower in stadium_city_lower or stadium_city_lower in city_lower:
|
||||
city_score = 0.5
|
||||
# Check for nearby cities (e.g., "San Francisco" team but "Oakland" arena)
|
||||
nearby_cities = {
|
||||
'san francisco': ['oakland', 'san jose'],
|
||||
'new york': ['brooklyn', 'queens', 'elmont', 'newark'],
|
||||
'los angeles': ['inglewood', 'anaheim'],
|
||||
'miami': ['sunrise', 'fort lauderdale'],
|
||||
'dallas': ['arlington', 'fort worth'],
|
||||
'washington': ['landover', 'capital heights'],
|
||||
'minneapolis': ['st paul', 'st. paul'],
|
||||
'detroit': ['auburn hills', 'pontiac'],
|
||||
}
|
||||
for main_city, nearby in nearby_cities.items():
|
||||
if city_lower == main_city and stadium_city_lower in nearby:
|
||||
city_score = 0.7
|
||||
elif stadium_city_lower == main_city and city_lower in nearby:
|
||||
city_score = 0.7
|
||||
|
||||
# Combined score (weighted)
|
||||
combined = (name_score * 0.7) + (city_score * 0.3)
|
||||
|
||||
if combined > best_score:
|
||||
best_score = combined
|
||||
best_match = stadium['canonical_id']
|
||||
|
||||
if best_score >= confidence_threshold:
|
||||
return best_match, best_score
|
||||
|
||||
return None, best_score
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CANONICALIZATION
|
||||
# =============================================================================
|
||||
|
||||
def generate_canonical_team_id(sport: str, abbrev: str) -> str:
|
||||
"""
|
||||
Generate deterministic canonical ID for team.
|
||||
|
||||
Format: team_{sport}_{abbrev}
|
||||
Example: team_nba_atl
|
||||
"""
|
||||
return f"team_{sport.lower()}_{abbrev.lower()}"
|
||||
|
||||
|
||||
def canonicalize_teams(
|
||||
team_mappings: dict[str, dict],
|
||||
sport: str,
|
||||
canonical_stadiums: list[dict],
|
||||
verbose: bool = False
|
||||
) -> tuple[list[CanonicalTeam], list[MatchWarning]]:
|
||||
"""
|
||||
Stage 2: Canonicalize teams.
|
||||
|
||||
1. Generate canonical IDs from abbreviations
|
||||
2. Fuzzy match to stadiums
|
||||
3. Log low-confidence matches for review
|
||||
|
||||
Args:
|
||||
team_mappings: Team data dict (e.g., NBA_TEAMS)
|
||||
sport: Sport code
|
||||
canonical_stadiums: List of canonical stadium dicts
|
||||
verbose: Print detailed progress
|
||||
|
||||
Returns:
|
||||
(canonical_teams, warnings)
|
||||
"""
|
||||
teams = []
|
||||
warnings = []
|
||||
|
||||
# Determine arena key based on sport
|
||||
arena_key = 'arena' if sport in ['NBA', 'NHL', 'WNBA'] else 'stadium'
|
||||
|
||||
# Get division structure
|
||||
division_map = {
|
||||
'NBA': NBA_DIVISIONS,
|
||||
'MLB': MLB_DIVISIONS,
|
||||
'NHL': NHL_DIVISIONS,
|
||||
'NFL': NFL_DIVISIONS,
|
||||
'MLS': MLS_DIVISIONS,
|
||||
'WNBA': WNBA_DIVISIONS,
|
||||
'NWSL': NWSL_DIVISIONS,
|
||||
}.get(sport, {})
|
||||
|
||||
for abbrev, info in team_mappings.items():
|
||||
canonical_id = generate_canonical_team_id(sport, abbrev)
|
||||
arena_name = info.get(arena_key, '')
|
||||
city = info.get('city', '')
|
||||
team_name = info.get('name', '')
|
||||
|
||||
# Fuzzy match stadium
|
||||
stadium_canonical_id, confidence = fuzzy_match_stadium(
|
||||
arena_name, city, sport, canonical_stadiums
|
||||
)
|
||||
|
||||
if stadium_canonical_id is None:
|
||||
warnings.append(MatchWarning(
|
||||
team_canonical_id=canonical_id,
|
||||
team_name=team_name,
|
||||
arena_name=arena_name,
|
||||
matched_stadium=None,
|
||||
issue='No stadium match found',
|
||||
confidence=confidence
|
||||
))
|
||||
# Create placeholder ID
|
||||
stadium_canonical_id = f"stadium_unknown_{sport.lower()}_{abbrev.lower()}"
|
||||
if verbose:
|
||||
print(f" WARNING: {canonical_id} - no stadium match for '{arena_name}'")
|
||||
|
||||
elif confidence < 0.8:
|
||||
warnings.append(MatchWarning(
|
||||
team_canonical_id=canonical_id,
|
||||
team_name=team_name,
|
||||
arena_name=arena_name,
|
||||
matched_stadium=stadium_canonical_id,
|
||||
issue='Low confidence stadium match',
|
||||
confidence=confidence
|
||||
))
|
||||
if verbose:
|
||||
print(f" WARNING: {canonical_id} - low confidence ({confidence:.2f}) match to {stadium_canonical_id}")
|
||||
|
||||
# Get conference/division
|
||||
conf_id, div_id = division_map.get(abbrev, (None, None))
|
||||
|
||||
team = CanonicalTeam(
|
||||
canonical_id=canonical_id,
|
||||
name=team_name,
|
||||
abbreviation=abbrev,
|
||||
sport=sport,
|
||||
city=city,
|
||||
stadium_canonical_id=stadium_canonical_id,
|
||||
conference_id=conf_id,
|
||||
division_id=div_id
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
if verbose and confidence >= 0.8:
|
||||
print(f" {canonical_id}: {team_name} -> {stadium_canonical_id} ({confidence:.2f})")
|
||||
|
||||
return teams, warnings
|
||||
|
||||
|
||||
def canonicalize_all_teams(
|
||||
canonical_stadiums: list[dict],
|
||||
verbose: bool = False
|
||||
) -> tuple[list[CanonicalTeam], list[MatchWarning]]:
|
||||
"""Canonicalize teams for all sports."""
|
||||
all_teams = []
|
||||
all_warnings = []
|
||||
|
||||
sport_mappings = [
|
||||
('NBA', NBA_TEAMS),
|
||||
('MLB', MLB_TEAMS),
|
||||
('NHL', NHL_TEAMS),
|
||||
('NFL', NFL_TEAMS),
|
||||
('MLS', MLS_TEAMS),
|
||||
('WNBA', WNBA_TEAMS),
|
||||
('NWSL', NWSL_TEAMS),
|
||||
]
|
||||
|
||||
for sport, team_map in sport_mappings:
|
||||
if verbose:
|
||||
print(f"\n{sport}:")
|
||||
|
||||
teams, warnings = canonicalize_teams(
|
||||
team_map, sport, canonical_stadiums, verbose
|
||||
)
|
||||
all_teams.extend(teams)
|
||||
all_warnings.extend(warnings)
|
||||
|
||||
return all_teams, all_warnings
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MAIN
|
||||
# =============================================================================
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Canonicalize team data'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--stadiums', type=str, default='./data/stadiums_canonical.json',
|
||||
help='Input canonical stadiums JSON file'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output', type=str, default='./data',
|
||||
help='Output directory for canonical files'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--verbose', '-v', action='store_true',
|
||||
help='Verbose output'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
stadiums_path = Path(args.stadiums)
|
||||
output_dir = Path(args.output)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Load canonical stadiums
|
||||
print(f"Loading canonical stadiums from {stadiums_path}...")
|
||||
with open(stadiums_path) as f:
|
||||
canonical_stadiums = json.load(f)
|
||||
print(f" Loaded {len(canonical_stadiums)} canonical stadiums")
|
||||
|
||||
# Canonicalize teams
|
||||
print("\nCanonicalizing teams...")
|
||||
canonical_teams, warnings = canonicalize_all_teams(
|
||||
canonical_stadiums, verbose=args.verbose
|
||||
)
|
||||
print(f" Created {len(canonical_teams)} canonical teams")
|
||||
|
||||
if warnings:
|
||||
print(f"\n Warnings: {len(warnings)}")
|
||||
for w in warnings:
|
||||
print(f" - {w.team_canonical_id}: {w.issue} (confidence: {w.confidence:.2f})")
|
||||
|
||||
# Export
|
||||
teams_path = output_dir / 'teams_canonical.json'
|
||||
warnings_path = output_dir / 'team_matching_warnings.json'
|
||||
|
||||
with open(teams_path, 'w') as f:
|
||||
json.dump([asdict(t) for t in canonical_teams], f, indent=2)
|
||||
print(f"\nExported teams to {teams_path}")
|
||||
|
||||
if warnings:
|
||||
with open(warnings_path, 'w') as f:
|
||||
json.dump([asdict(w) for w in warnings], f, indent=2)
|
||||
print(f"Exported warnings to {warnings_path}")
|
||||
|
||||
# Summary by sport
|
||||
print("\nSummary by sport:")
|
||||
by_sport = {}
|
||||
for t in canonical_teams:
|
||||
by_sport[t.sport] = by_sport.get(t.sport, 0) + 1
|
||||
for sport, count in sorted(by_sport.items()):
|
||||
print(f" {sport}: {count} teams")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,53 +0,0 @@
|
||||
DEFINE SCHEMA
|
||||
|
||||
RECORD TYPE Stadium (
|
||||
"___createTime" TIMESTAMP,
|
||||
"___createdBy" REFERENCE,
|
||||
"___etag" STRING,
|
||||
"___modTime" TIMESTAMP,
|
||||
"___modifiedBy" REFERENCE,
|
||||
"___recordID" REFERENCE QUERYABLE,
|
||||
stadiumId STRING QUERYABLE,
|
||||
name STRING QUERYABLE SEARCHABLE,
|
||||
city STRING QUERYABLE,
|
||||
state STRING,
|
||||
location LOCATION QUERYABLE,
|
||||
capacity INT64,
|
||||
sport STRING QUERYABLE SORTABLE,
|
||||
teamAbbrevs LIST<STRING>,
|
||||
source STRING,
|
||||
yearOpened INT64
|
||||
);
|
||||
|
||||
RECORD TYPE Team (
|
||||
"___createTime" TIMESTAMP,
|
||||
"___createdBy" REFERENCE,
|
||||
"___etag" STRING,
|
||||
"___modTime" TIMESTAMP,
|
||||
"___modifiedBy" REFERENCE,
|
||||
"___recordID" REFERENCE QUERYABLE,
|
||||
teamId STRING QUERYABLE,
|
||||
name STRING QUERYABLE SEARCHABLE,
|
||||
abbreviation STRING QUERYABLE,
|
||||
city STRING QUERYABLE,
|
||||
sport STRING QUERYABLE SORTABLE
|
||||
);
|
||||
|
||||
RECORD TYPE Game (
|
||||
"___createTime" TIMESTAMP,
|
||||
"___createdBy" REFERENCE,
|
||||
"___etag" STRING,
|
||||
"___modTime" TIMESTAMP,
|
||||
"___modifiedBy" REFERENCE,
|
||||
"___recordID" REFERENCE QUERYABLE,
|
||||
gameId STRING QUERYABLE,
|
||||
sport STRING QUERYABLE SORTABLE,
|
||||
season STRING QUERYABLE,
|
||||
dateTime TIMESTAMP QUERYABLE SORTABLE,
|
||||
homeTeamRef REFERENCE QUERYABLE,
|
||||
awayTeamRef REFERENCE QUERYABLE,
|
||||
venueRef REFERENCE,
|
||||
isPlayoff INT64,
|
||||
broadcastInfo STRING,
|
||||
source STRING
|
||||
);
|
||||
-384
@@ -1,384 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Core shared utilities for SportsTime data scrapers.
|
||||
|
||||
This module provides:
|
||||
- Rate limiting utilities
|
||||
- Data classes (Game, Stadium)
|
||||
- Multi-source fallback system
|
||||
- ID generation
|
||||
- Export utilities
|
||||
"""
|
||||
|
||||
import json
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from dataclasses import dataclass, asdict, field
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from typing import Optional, Callable
|
||||
|
||||
import pandas as pd
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
|
||||
__all__ = [
|
||||
# Constants
|
||||
'REQUEST_DELAY',
|
||||
# Rate limiting
|
||||
'rate_limit',
|
||||
'fetch_page',
|
||||
# Data classes
|
||||
'Game',
|
||||
'Stadium',
|
||||
'ScraperSource',
|
||||
'StadiumScraperSource',
|
||||
# Fallback system
|
||||
'scrape_with_fallback',
|
||||
'scrape_stadiums_with_fallback',
|
||||
# ID generation
|
||||
'assign_stable_ids',
|
||||
# Export utilities
|
||||
'export_to_json',
|
||||
'validate_games',
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# RATE LIMITING
|
||||
# =============================================================================
|
||||
|
||||
REQUEST_DELAY = 3.0 # seconds between requests to same domain
|
||||
last_request_time: dict[str, float] = {}
|
||||
|
||||
|
||||
def rate_limit(domain: str) -> None:
|
||||
"""Enforce rate limiting per domain."""
|
||||
now = time.time()
|
||||
if domain in last_request_time:
|
||||
elapsed = now - last_request_time[domain]
|
||||
if elapsed < REQUEST_DELAY:
|
||||
time.sleep(REQUEST_DELAY - elapsed)
|
||||
last_request_time[domain] = time.time()
|
||||
|
||||
|
||||
def fetch_page(url: str, domain: str) -> Optional[BeautifulSoup]:
|
||||
"""Fetch and parse a webpage with rate limiting."""
|
||||
rate_limit(domain)
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
|
||||
'Accept-Language': 'en-US,en;q=0.9',
|
||||
'Accept-Encoding': 'gzip, deflate, br',
|
||||
'Connection': 'keep-alive',
|
||||
'Upgrade-Insecure-Requests': '1',
|
||||
'Sec-Fetch-Dest': 'document',
|
||||
'Sec-Fetch-Mode': 'navigate',
|
||||
'Sec-Fetch-Site': 'none',
|
||||
'Sec-Fetch-User': '?1',
|
||||
'Cache-Control': 'max-age=0',
|
||||
}
|
||||
try:
|
||||
response = requests.get(url, headers=headers, timeout=30)
|
||||
response.raise_for_status()
|
||||
return BeautifulSoup(response.content, 'html.parser')
|
||||
except Exception as e:
|
||||
print(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# DATA CLASSES
|
||||
# =============================================================================
|
||||
|
||||
@dataclass
|
||||
class Game:
|
||||
"""Represents a single game."""
|
||||
id: str
|
||||
sport: str
|
||||
season: str
|
||||
date: str # YYYY-MM-DD
|
||||
time: Optional[str] # HH:MM (24hr, ET)
|
||||
home_team: str
|
||||
away_team: str
|
||||
home_team_abbrev: str
|
||||
away_team_abbrev: str
|
||||
venue: str
|
||||
source: str
|
||||
is_playoff: bool = False
|
||||
broadcast: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class Stadium:
|
||||
"""Represents a stadium/arena/ballpark."""
|
||||
id: str
|
||||
name: str
|
||||
city: str
|
||||
state: str
|
||||
latitude: float
|
||||
longitude: float
|
||||
capacity: int
|
||||
sport: str
|
||||
team_abbrevs: list
|
||||
source: str
|
||||
year_opened: Optional[int] = None
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MULTI-SOURCE FALLBACK SYSTEM
|
||||
# =============================================================================
|
||||
|
||||
@dataclass
|
||||
class ScraperSource:
|
||||
"""Represents a single data source for scraping games."""
|
||||
name: str
|
||||
scraper_func: Callable[[int], list] # Takes season, returns list[Game]
|
||||
priority: int = 1 # Lower = higher priority (1 is best)
|
||||
min_games: int = 10 # Minimum games to consider successful
|
||||
|
||||
|
||||
def scrape_with_fallback(
|
||||
sport: str,
|
||||
season: int,
|
||||
sources: list[ScraperSource],
|
||||
verbose: bool = True
|
||||
) -> list[Game]:
|
||||
"""
|
||||
Try multiple sources in priority order until one succeeds.
|
||||
|
||||
Args:
|
||||
sport: Sport name for logging
|
||||
season: Season year
|
||||
sources: List of ScraperSource configs, sorted by priority
|
||||
verbose: Whether to print status messages
|
||||
|
||||
Returns:
|
||||
List of Game objects from the first successful source
|
||||
"""
|
||||
sources = sorted(sources, key=lambda s: s.priority)
|
||||
|
||||
for i, source in enumerate(sources):
|
||||
try:
|
||||
if verbose:
|
||||
attempt = f"[{i+1}/{len(sources)}]"
|
||||
print(f" {attempt} Trying {source.name}...")
|
||||
|
||||
games = source.scraper_func(season)
|
||||
|
||||
if games and len(games) >= source.min_games:
|
||||
if verbose:
|
||||
print(f" ✓ {source.name} returned {len(games)} games")
|
||||
return games
|
||||
else:
|
||||
if verbose:
|
||||
count = len(games) if games else 0
|
||||
print(f" ✗ {source.name} returned only {count} games (min: {source.min_games})")
|
||||
|
||||
except Exception as e:
|
||||
if verbose:
|
||||
print(f" ✗ {source.name} failed: {e}")
|
||||
continue
|
||||
|
||||
# All sources failed
|
||||
if verbose:
|
||||
print(f" ⚠ All {len(sources)} sources failed for {sport}")
|
||||
return []
|
||||
|
||||
|
||||
@dataclass
|
||||
class StadiumScraperSource:
|
||||
"""Represents a single data source for stadium scraping."""
|
||||
name: str
|
||||
scraper_func: Callable[[], list] # Returns list[Stadium]
|
||||
priority: int = 1 # Lower = higher priority (1 is best)
|
||||
min_venues: int = 5 # Minimum venues to consider successful
|
||||
|
||||
|
||||
def scrape_stadiums_with_fallback(
|
||||
sport: str,
|
||||
sources: list[StadiumScraperSource],
|
||||
verbose: bool = True
|
||||
) -> list[Stadium]:
|
||||
"""
|
||||
Try multiple stadium sources in priority order until one succeeds.
|
||||
|
||||
Args:
|
||||
sport: Sport name for logging
|
||||
sources: List of StadiumScraperSource configs, sorted by priority
|
||||
verbose: Whether to print status messages
|
||||
|
||||
Returns:
|
||||
List of Stadium objects from the first successful source
|
||||
"""
|
||||
sources = sorted(sources, key=lambda s: s.priority)
|
||||
|
||||
for i, source in enumerate(sources):
|
||||
try:
|
||||
if verbose:
|
||||
attempt = f"[{i+1}/{len(sources)}]"
|
||||
print(f" {attempt} Trying {source.name}...")
|
||||
|
||||
stadiums = source.scraper_func()
|
||||
|
||||
if stadiums and len(stadiums) >= source.min_venues:
|
||||
if verbose:
|
||||
print(f" ✓ {source.name} returned {len(stadiums)} venues")
|
||||
return stadiums
|
||||
else:
|
||||
if verbose:
|
||||
count = len(stadiums) if stadiums else 0
|
||||
print(f" ✗ {source.name} returned only {count} venues (min: {source.min_venues})")
|
||||
|
||||
except Exception as e:
|
||||
if verbose:
|
||||
print(f" ✗ {source.name} failed: {e}")
|
||||
continue
|
||||
|
||||
# All sources failed
|
||||
if verbose:
|
||||
print(f" ⚠ All {len(sources)} sources failed for {sport}")
|
||||
return []
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# ID GENERATION
|
||||
# =============================================================================
|
||||
|
||||
def assign_stable_ids(games: list[Game], sport: str, season: str) -> list[Game]:
|
||||
"""
|
||||
Assign IDs based on matchup + date.
|
||||
Format: {sport}_{season}_{away}_{home}_{MMDD} (or {MMDD}_2 for doubleheaders)
|
||||
|
||||
When games are rescheduled, the old ID becomes orphaned and a new one is created.
|
||||
Use --delete-all before import to clean up orphaned records.
|
||||
"""
|
||||
season_str = season.replace('-', '')
|
||||
|
||||
# Track how many times we've seen each base ID (for doubleheaders)
|
||||
id_counts: dict[str, int] = defaultdict(int)
|
||||
|
||||
for game in games:
|
||||
away = game.away_team_abbrev.lower()
|
||||
home = game.home_team_abbrev.lower()
|
||||
# Extract MMDD from date (YYYY-MM-DD)
|
||||
date_parts = game.date.split('-')
|
||||
mmdd = f"{date_parts[1]}{date_parts[2]}" if len(date_parts) == 3 else "0000"
|
||||
|
||||
base_id = f"{sport.lower()}_{season_str}_{away}_{home}_{mmdd}"
|
||||
id_counts[base_id] += 1
|
||||
|
||||
# Add suffix for doubleheaders (game 2+)
|
||||
if id_counts[base_id] > 1:
|
||||
game.id = f"{base_id}_{id_counts[base_id]}"
|
||||
else:
|
||||
game.id = base_id
|
||||
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# EXPORT UTILITIES
|
||||
# =============================================================================
|
||||
|
||||
def export_to_json(games: list[Game], stadiums: list[Stadium], output_dir: Path) -> None:
|
||||
"""
|
||||
Export scraped data to organized JSON files.
|
||||
|
||||
Structure:
|
||||
data/
|
||||
games/
|
||||
mlb_2025.json
|
||||
nba_2025.json
|
||||
...
|
||||
canonical/
|
||||
stadiums.json
|
||||
stadiums.json (legacy, for backward compatibility)
|
||||
"""
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Create subdirectories
|
||||
games_dir = output_dir / 'games'
|
||||
canonical_dir = output_dir / 'canonical'
|
||||
games_dir.mkdir(exist_ok=True)
|
||||
canonical_dir.mkdir(exist_ok=True)
|
||||
|
||||
# Group games by sport and season
|
||||
games_by_sport_season: dict[str, list[Game]] = {}
|
||||
for game in games:
|
||||
sport = game.sport.lower()
|
||||
season = game.season
|
||||
key = f"{sport}_{season}"
|
||||
if key not in games_by_sport_season:
|
||||
games_by_sport_season[key] = []
|
||||
games_by_sport_season[key].append(game)
|
||||
|
||||
# Export games by sport/season
|
||||
total_exported = 0
|
||||
for key, sport_games in games_by_sport_season.items():
|
||||
games_data = [asdict(g) for g in sport_games]
|
||||
filepath = games_dir / f"{key}.json"
|
||||
with open(filepath, 'w') as f:
|
||||
json.dump(games_data, f, indent=2)
|
||||
print(f" Exported {len(sport_games):,} games to games/{key}.json")
|
||||
total_exported += len(sport_games)
|
||||
|
||||
# Export combined games.json for backward compatibility
|
||||
all_games_data = [asdict(g) for g in games]
|
||||
with open(output_dir / 'games.json', 'w') as f:
|
||||
json.dump(all_games_data, f, indent=2)
|
||||
|
||||
# Export stadiums to canonical/
|
||||
stadiums_data = [asdict(s) for s in stadiums]
|
||||
with open(canonical_dir / 'stadiums.json', 'w') as f:
|
||||
json.dump(stadiums_data, f, indent=2)
|
||||
|
||||
# Also export to root for backward compatibility
|
||||
with open(output_dir / 'stadiums.json', 'w') as f:
|
||||
json.dump(stadiums_data, f, indent=2)
|
||||
|
||||
# Export as CSV for easy viewing
|
||||
if games:
|
||||
df_games = pd.DataFrame(all_games_data)
|
||||
df_games.to_csv(output_dir / 'games.csv', index=False)
|
||||
|
||||
if stadiums:
|
||||
df_stadiums = pd.DataFrame(stadiums_data)
|
||||
df_stadiums.to_csv(output_dir / 'stadiums.csv', index=False)
|
||||
|
||||
print(f"\nExported {total_exported:,} games across {len(games_by_sport_season)} sport/season files")
|
||||
print(f"Exported {len(stadiums):,} stadiums to canonical/stadiums.json")
|
||||
|
||||
|
||||
def validate_games(games_by_source: dict[str, list[Game]]) -> dict:
|
||||
"""
|
||||
Cross-validate games from multiple sources.
|
||||
Returns discrepancies.
|
||||
"""
|
||||
discrepancies = {
|
||||
'missing_in_source': [],
|
||||
'date_mismatch': [],
|
||||
'time_mismatch': [],
|
||||
'venue_mismatch': [],
|
||||
}
|
||||
|
||||
sources = list(games_by_source.keys())
|
||||
if len(sources) < 2:
|
||||
return discrepancies
|
||||
|
||||
primary = sources[0]
|
||||
primary_games = {g.id: g for g in games_by_source[primary]}
|
||||
|
||||
for source in sources[1:]:
|
||||
secondary_games = {g.id: g for g in games_by_source[source]}
|
||||
|
||||
for game_id, game in primary_games.items():
|
||||
if game_id not in secondary_games:
|
||||
discrepancies['missing_in_source'].append({
|
||||
'game_id': game_id,
|
||||
'present_in': primary,
|
||||
'missing_in': source
|
||||
})
|
||||
|
||||
return discrepancies
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,13 +0,0 @@
|
||||
{
|
||||
"is_valid": true,
|
||||
"error_count": 0,
|
||||
"warning_count": 0,
|
||||
"summary": {
|
||||
"stadiums": 148,
|
||||
"teams": 92,
|
||||
"games": 0,
|
||||
"aliases": 194,
|
||||
"by_category": {}
|
||||
},
|
||||
"errors": []
|
||||
}
|
||||
@@ -1,42 +0,0 @@
|
||||
[
|
||||
{
|
||||
"game_key": "2026-01-17_TBD_TBD",
|
||||
"issue": "Unknown home team",
|
||||
"details": "Could not resolve home team 'TBD' for sport NFL"
|
||||
},
|
||||
{
|
||||
"game_key": "2026-01-17_TBD_TBD",
|
||||
"issue": "Unknown home team",
|
||||
"details": "Could not resolve home team 'TBD' for sport NFL"
|
||||
},
|
||||
{
|
||||
"game_key": "2026-01-18_TBD_TBD",
|
||||
"issue": "Unknown home team",
|
||||
"details": "Could not resolve home team 'TBD' for sport NFL"
|
||||
},
|
||||
{
|
||||
"game_key": "2026-01-18_TBD_TBD",
|
||||
"issue": "Unknown home team",
|
||||
"details": "Could not resolve home team 'TBD' for sport NFL"
|
||||
},
|
||||
{
|
||||
"game_key": "2026-01-25_TBD_TBD",
|
||||
"issue": "Unknown home team",
|
||||
"details": "Could not resolve home team 'TBD' for sport NFL"
|
||||
},
|
||||
{
|
||||
"game_key": "2026-01-25_TBD_TBD",
|
||||
"issue": "Unknown home team",
|
||||
"details": "Could not resolve home team 'TBD' for sport NFL"
|
||||
},
|
||||
{
|
||||
"game_key": "2026-02-04_NFC_AFC",
|
||||
"issue": "Unknown home team",
|
||||
"details": "Could not resolve home team 'AFC' for sport NFL"
|
||||
},
|
||||
{
|
||||
"game_key": "2026-02-08_TBD_TBD",
|
||||
"issue": "Unknown home team",
|
||||
"details": "Could not resolve home team 'TBD' for sport NFL"
|
||||
}
|
||||
]
|
||||
File diff suppressed because it is too large
Load Diff
-86522
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,23 +0,0 @@
|
||||
{
|
||||
"generated_at": "2026-01-10T11:03:46.586763",
|
||||
"season": 2026,
|
||||
"sport": "all",
|
||||
"summary": {
|
||||
"games_scraped": 5768,
|
||||
"stadiums_scraped": 178,
|
||||
"games_by_sport": {
|
||||
"NBA": 1230,
|
||||
"MLB": 2430,
|
||||
"NHL": 1312,
|
||||
"NFL": 286,
|
||||
"WNBA": 0,
|
||||
"MLS": 510,
|
||||
"NWSL": 0
|
||||
},
|
||||
"high_severity": 0,
|
||||
"medium_severity": 0,
|
||||
"low_severity": 0
|
||||
},
|
||||
"game_validations": [],
|
||||
"stadium_issues": []
|
||||
}
|
||||
@@ -1,179 +0,0 @@
|
||||
id,name,city,state,latitude,longitude,capacity,sport,team_abbrevs,source,year_opened
|
||||
mlb_chase_field,Chase Field,Phoenix,AZ,33.4453,-112.0667,48519,MLB,['ARI'],mlb_hardcoded,1998
|
||||
mlb_truist_park,Truist Park,Atlanta,GA,33.8907,-84.4677,41084,MLB,['ATL'],mlb_hardcoded,2017
|
||||
mlb_oriole_park_at_camden_yards,Oriole Park at Camden Yards,Baltimore,MD,39.2839,-76.6216,44970,MLB,['BAL'],mlb_hardcoded,1992
|
||||
mlb_fenway_park,Fenway Park,Boston,MA,42.3467,-71.0972,37755,MLB,['BOS'],mlb_hardcoded,1912
|
||||
mlb_wrigley_field,Wrigley Field,Chicago,IL,41.9484,-87.6553,41649,MLB,['CHC'],mlb_hardcoded,1914
|
||||
mlb_guaranteed_rate_field,Guaranteed Rate Field,Chicago,IL,41.8299,-87.6338,40615,MLB,['CHW'],mlb_hardcoded,1991
|
||||
mlb_great_american_ball_park,Great American Ball Park,Cincinnati,OH,39.0979,-84.5082,42319,MLB,['CIN'],mlb_hardcoded,2003
|
||||
mlb_progressive_field,Progressive Field,Cleveland,OH,41.4958,-81.6853,34830,MLB,['CLE'],mlb_hardcoded,1994
|
||||
mlb_coors_field,Coors Field,Denver,CO,39.7559,-104.9942,50144,MLB,['COL'],mlb_hardcoded,1995
|
||||
mlb_comerica_park,Comerica Park,Detroit,MI,42.339,-83.0485,41083,MLB,['DET'],mlb_hardcoded,2000
|
||||
mlb_minute_maid_park,Minute Maid Park,Houston,TX,29.7573,-95.3555,41168,MLB,['HOU'],mlb_hardcoded,2000
|
||||
mlb_kauffman_stadium,Kauffman Stadium,Kansas City,MO,39.0517,-94.4803,37903,MLB,['KCR'],mlb_hardcoded,1973
|
||||
mlb_angel_stadium,Angel Stadium,Anaheim,CA,33.8003,-117.8827,45517,MLB,['LAA'],mlb_hardcoded,1966
|
||||
mlb_dodger_stadium,Dodger Stadium,Los Angeles,CA,34.0739,-118.24,56000,MLB,['LAD'],mlb_hardcoded,1962
|
||||
mlb_loandepot_park,LoanDepot Park,Miami,FL,25.7781,-80.2196,36742,MLB,['MIA'],mlb_hardcoded,2012
|
||||
mlb_american_family_field,American Family Field,Milwaukee,WI,43.028,-87.9712,41900,MLB,['MIL'],mlb_hardcoded,2001
|
||||
mlb_target_field,Target Field,Minneapolis,MN,44.9818,-93.2775,38544,MLB,['MIN'],mlb_hardcoded,2010
|
||||
mlb_citi_field,Citi Field,Queens,NY,40.7571,-73.8458,41922,MLB,['NYM'],mlb_hardcoded,2009
|
||||
mlb_yankee_stadium,Yankee Stadium,Bronx,NY,40.8296,-73.9262,46537,MLB,['NYY'],mlb_hardcoded,2009
|
||||
mlb_sutter_health_park,Sutter Health Park,Sacramento,CA,38.5803,-121.5108,14014,MLB,['OAK'],mlb_hardcoded,2000
|
||||
mlb_citizens_bank_park,Citizens Bank Park,Philadelphia,PA,39.9061,-75.1665,42901,MLB,['PHI'],mlb_hardcoded,2004
|
||||
mlb_pnc_park,PNC Park,Pittsburgh,PA,40.4469,-80.0057,38362,MLB,['PIT'],mlb_hardcoded,2001
|
||||
mlb_petco_park,Petco Park,San Diego,CA,32.7073,-117.1566,40209,MLB,['SDP'],mlb_hardcoded,2004
|
||||
mlb_oracle_park,Oracle Park,San Francisco,CA,37.7786,-122.3893,41915,MLB,['SFG'],mlb_hardcoded,2000
|
||||
mlb_t-mobile_park,T-Mobile Park,Seattle,WA,47.5914,-122.3325,47929,MLB,['SEA'],mlb_hardcoded,1999
|
||||
mlb_busch_stadium,Busch Stadium,St. Louis,MO,38.6226,-90.1928,45538,MLB,['STL'],mlb_hardcoded,2006
|
||||
mlb_tropicana_field,Tropicana Field,St. Petersburg,FL,27.7682,-82.6534,25000,MLB,['TBR'],mlb_hardcoded,1990
|
||||
mlb_globe_life_field,Globe Life Field,Arlington,TX,32.7473,-97.0844,40300,MLB,['TEX'],mlb_hardcoded,2020
|
||||
mlb_rogers_centre,Rogers Centre,Toronto,ON,43.6414,-79.3894,49282,MLB,['TOR'],mlb_hardcoded,1989
|
||||
mlb_nationals_park,Nationals Park,Washington,DC,38.8729,-77.0074,41339,MLB,['WSN'],mlb_hardcoded,2008
|
||||
nba_state_farm_arena,State Farm Arena,Atlanta,GA,33.7573,-84.3963,18118,NBA,['ATL'],nba_hardcoded,1999
|
||||
nba_td_garden,TD Garden,Boston,MA,42.3662,-71.0621,19156,NBA,['BOS'],nba_hardcoded,1995
|
||||
nba_barclays_center,Barclays Center,Brooklyn,NY,40.6826,-73.9754,17732,NBA,['BRK'],nba_hardcoded,2012
|
||||
nba_spectrum_center,Spectrum Center,Charlotte,NC,35.2251,-80.8392,19077,NBA,['CHO'],nba_hardcoded,2005
|
||||
nba_united_center,United Center,Chicago,IL,41.8807,-87.6742,20917,NBA,['CHI'],nba_hardcoded,1994
|
||||
nba_rocket_mortgage_fieldhouse,Rocket Mortgage FieldHouse,Cleveland,OH,41.4965,-81.6882,19432,NBA,['CLE'],nba_hardcoded,1994
|
||||
nba_american_airlines_center,American Airlines Center,Dallas,TX,32.7905,-96.8103,19200,NBA,['DAL'],nba_hardcoded,2001
|
||||
nba_ball_arena,Ball Arena,Denver,CO,39.7487,-105.0077,19520,NBA,['DEN'],nba_hardcoded,1999
|
||||
nba_little_caesars_arena,Little Caesars Arena,Detroit,MI,42.3411,-83.0553,20332,NBA,['DET'],nba_hardcoded,2017
|
||||
nba_chase_center,Chase Center,San Francisco,CA,37.768,-122.3879,18064,NBA,['GSW'],nba_hardcoded,2019
|
||||
nba_toyota_center,Toyota Center,Houston,TX,29.7508,-95.3621,18055,NBA,['HOU'],nba_hardcoded,2003
|
||||
nba_gainbridge_fieldhouse,Gainbridge Fieldhouse,Indianapolis,IN,39.764,-86.1555,17923,NBA,['IND'],nba_hardcoded,1999
|
||||
nba_intuit_dome,Intuit Dome,Inglewood,CA,33.9425,-118.3419,18000,NBA,['LAC'],nba_hardcoded,2024
|
||||
nba_crypto.com_arena,Crypto.com Arena,Los Angeles,CA,34.043,-118.2673,18997,NBA,['LAL'],nba_hardcoded,1999
|
||||
nba_fedexforum,FedExForum,Memphis,TN,35.1382,-90.0506,17794,NBA,['MEM'],nba_hardcoded,2004
|
||||
nba_kaseya_center,Kaseya Center,Miami,FL,25.7814,-80.187,19600,NBA,['MIA'],nba_hardcoded,1999
|
||||
nba_fiserv_forum,Fiserv Forum,Milwaukee,WI,43.0451,-87.9174,17341,NBA,['MIL'],nba_hardcoded,2018
|
||||
nba_target_center,Target Center,Minneapolis,MN,44.9795,-93.2761,18978,NBA,['MIN'],nba_hardcoded,1990
|
||||
nba_smoothie_king_center,Smoothie King Center,New Orleans,LA,29.949,-90.0821,16867,NBA,['NOP'],nba_hardcoded,1999
|
||||
nba_madison_square_garden,Madison Square Garden,New York,NY,40.7505,-73.9934,19812,NBA,['NYK'],nba_hardcoded,1968
|
||||
nba_paycom_center,Paycom Center,Oklahoma City,OK,35.4634,-97.5151,18203,NBA,['OKC'],nba_hardcoded,2002
|
||||
nba_kia_center,Kia Center,Orlando,FL,28.5392,-81.3839,18846,NBA,['ORL'],nba_hardcoded,1989
|
||||
nba_wells_fargo_center,Wells Fargo Center,Philadelphia,PA,39.9012,-75.172,20478,NBA,['PHI'],nba_hardcoded,1996
|
||||
nba_footprint_center,Footprint Center,Phoenix,AZ,33.4457,-112.0712,17071,NBA,['PHO'],nba_hardcoded,1992
|
||||
nba_moda_center,Moda Center,Portland,OR,45.5316,-122.6668,19393,NBA,['POR'],nba_hardcoded,1995
|
||||
nba_golden_1_center,Golden 1 Center,Sacramento,CA,38.5802,-121.4997,17608,NBA,['SAC'],nba_hardcoded,2016
|
||||
nba_frost_bank_center,Frost Bank Center,San Antonio,TX,29.427,-98.4375,18418,NBA,['SAS'],nba_hardcoded,2002
|
||||
nba_scotiabank_arena,Scotiabank Arena,Toronto,ON,43.6435,-79.3791,19800,NBA,['TOR'],nba_hardcoded,1999
|
||||
nba_delta_center,Delta Center,Salt Lake City,UT,40.7683,-111.9011,18306,NBA,['UTA'],nba_hardcoded,1991
|
||||
nba_capital_one_arena,Capital One Arena,Washington,DC,38.8982,-77.0209,20356,NBA,['WAS'],nba_hardcoded,1997
|
||||
nhl_td_garden,TD Garden,Boston,MA,42.3662,-71.0621,17850,NHL,['BOS'],nhl_hardcoded,1995
|
||||
nhl_keybank_center,KeyBank Center,Buffalo,NY,42.875,-78.8764,19070,NHL,['BUF'],nhl_hardcoded,1996
|
||||
nhl_little_caesars_arena,Little Caesars Arena,Detroit,MI,42.3411,-83.0553,19515,NHL,['DET'],nhl_hardcoded,2017
|
||||
nhl_amerant_bank_arena,Amerant Bank Arena,Sunrise,FL,26.1584,-80.3256,19250,NHL,['FLA'],nhl_hardcoded,1998
|
||||
nhl_bell_centre,Bell Centre,Montreal,QC,45.4961,-73.5693,21302,NHL,['MTL'],nhl_hardcoded,1996
|
||||
nhl_canadian_tire_centre,Canadian Tire Centre,Ottawa,ON,45.2969,-75.9272,18652,NHL,['OTT'],nhl_hardcoded,1996
|
||||
nhl_amalie_arena,Amalie Arena,Tampa,FL,27.9426,-82.4519,19092,NHL,['TBL'],nhl_hardcoded,1996
|
||||
nhl_scotiabank_arena,Scotiabank Arena,Toronto,ON,43.6435,-79.3791,18800,NHL,['TOR'],nhl_hardcoded,1999
|
||||
nhl_pnc_arena,PNC Arena,Raleigh,NC,35.8033,-78.722,18680,NHL,['CAR'],nhl_hardcoded,1999
|
||||
nhl_nationwide_arena,Nationwide Arena,Columbus,OH,39.9692,-83.0061,18500,NHL,['CBJ'],nhl_hardcoded,2000
|
||||
nhl_prudential_center,Prudential Center,Newark,NJ,40.7334,-74.1713,16514,NHL,['NJD'],nhl_hardcoded,2007
|
||||
nhl_ubs_arena,UBS Arena,Elmont,NY,40.717,-73.726,17255,NHL,['NYI'],nhl_hardcoded,2021
|
||||
nhl_madison_square_garden,Madison Square Garden,New York,NY,40.7505,-73.9934,18006,NHL,['NYR'],nhl_hardcoded,1968
|
||||
nhl_wells_fargo_center,Wells Fargo Center,Philadelphia,PA,39.9012,-75.172,19500,NHL,['PHI'],nhl_hardcoded,1996
|
||||
nhl_ppg_paints_arena,PPG Paints Arena,Pittsburgh,PA,40.4395,-79.9892,18387,NHL,['PIT'],nhl_hardcoded,2010
|
||||
nhl_capital_one_arena,Capital One Arena,Washington,DC,38.8982,-77.0209,18573,NHL,['WSH'],nhl_hardcoded,1997
|
||||
nhl_united_center,United Center,Chicago,IL,41.8807,-87.6742,19717,NHL,['CHI'],nhl_hardcoded,1994
|
||||
nhl_ball_arena,Ball Arena,Denver,CO,39.7487,-105.0077,18007,NHL,['COL'],nhl_hardcoded,1999
|
||||
nhl_american_airlines_center,American Airlines Center,Dallas,TX,32.7905,-96.8103,18532,NHL,['DAL'],nhl_hardcoded,2001
|
||||
nhl_xcel_energy_center,Xcel Energy Center,Saint Paul,MN,44.9448,-93.101,17954,NHL,['MIN'],nhl_hardcoded,2000
|
||||
nhl_bridgestone_arena,Bridgestone Arena,Nashville,TN,36.1592,-86.7785,17159,NHL,['NSH'],nhl_hardcoded,1996
|
||||
nhl_enterprise_center,Enterprise Center,St. Louis,MO,38.6268,-90.2025,18096,NHL,['STL'],nhl_hardcoded,1994
|
||||
nhl_canada_life_centre,Canada Life Centre,Winnipeg,MB,49.8928,-97.1437,15321,NHL,['WPG'],nhl_hardcoded,2004
|
||||
nhl_honda_center,Honda Center,Anaheim,CA,33.8078,-117.8765,17174,NHL,['ANA'],nhl_hardcoded,1993
|
||||
nhl_delta_center,Delta Center,Salt Lake City,UT,40.7683,-111.9011,16210,NHL,['ARI'],nhl_hardcoded,1991
|
||||
nhl_sap_center,SAP Center,San Jose,CA,37.3327,-121.9012,17562,NHL,['SJS'],nhl_hardcoded,1993
|
||||
nhl_rogers_arena,Rogers Arena,Vancouver,BC,49.2778,-123.1089,18910,NHL,['VAN'],nhl_hardcoded,1995
|
||||
nhl_t-mobile_arena,T-Mobile Arena,Las Vegas,NV,36.1028,-115.1784,17500,NHL,['VGK'],nhl_hardcoded,2016
|
||||
nhl_climate_pledge_arena,Climate Pledge Arena,Seattle,WA,47.622,-122.354,17100,NHL,['SEA'],nhl_hardcoded,2021
|
||||
nhl_crypto.com_arena,Crypto.com Arena,Los Angeles,CA,34.043,-118.2673,18230,NHL,['LAK'],nhl_hardcoded,1999
|
||||
nhl_rogers_place,Rogers Place,Edmonton,AB,53.5469,-113.4979,18347,NHL,['EDM'],nhl_hardcoded,2016
|
||||
nhl_scotiabank_saddledome,Scotiabank Saddledome,Calgary,AB,51.0374,-114.0519,19289,NHL,['CGY'],nhl_hardcoded,1983
|
||||
nfl_state_farm_stadium,State Farm Stadium,Glendale,AZ,33.5276,-112.2626,63400,NFL,['ARI'],nfl_hardcoded,2006
|
||||
nfl_mercedes-benz_stadium,Mercedes-Benz Stadium,Atlanta,GA,33.7553,-84.4006,71000,NFL,['ATL'],nfl_hardcoded,2017
|
||||
nfl_m&t_bank_stadium,M&T Bank Stadium,Baltimore,MD,39.278,-76.6227,71008,NFL,['BAL'],nfl_hardcoded,1998
|
||||
nfl_highmark_stadium,Highmark Stadium,Orchard Park,NY,42.7738,-78.787,71608,NFL,['BUF'],nfl_hardcoded,1973
|
||||
nfl_bank_of_america_stadium,Bank of America Stadium,Charlotte,NC,35.2258,-80.8528,75523,NFL,['CAR'],nfl_hardcoded,1996
|
||||
nfl_soldier_field,Soldier Field,Chicago,IL,41.8623,-87.6167,61500,NFL,['CHI'],nfl_hardcoded,1924
|
||||
nfl_paycor_stadium,Paycor Stadium,Cincinnati,OH,39.0954,-84.516,65515,NFL,['CIN'],nfl_hardcoded,2000
|
||||
nfl_cleveland_browns_stadium,Cleveland Browns Stadium,Cleveland,OH,41.5061,-81.6995,67895,NFL,['CLE'],nfl_hardcoded,1999
|
||||
nfl_at&t_stadium,AT&T Stadium,Arlington,TX,32.748,-97.0928,80000,NFL,['DAL'],nfl_hardcoded,2009
|
||||
nfl_empower_field_at_mile_high,Empower Field at Mile High,Denver,CO,39.7439,-105.0201,76125,NFL,['DEN'],nfl_hardcoded,2001
|
||||
nfl_ford_field,Ford Field,Detroit,MI,42.34,-83.0456,65000,NFL,['DET'],nfl_hardcoded,2002
|
||||
nfl_lambeau_field,Lambeau Field,Green Bay,WI,44.5013,-88.0622,81435,NFL,['GB'],nfl_hardcoded,1957
|
||||
nfl_nrg_stadium,NRG Stadium,Houston,TX,29.6847,-95.4107,72220,NFL,['HOU'],nfl_hardcoded,2002
|
||||
nfl_lucas_oil_stadium,Lucas Oil Stadium,Indianapolis,IN,39.7601,-86.1639,67000,NFL,['IND'],nfl_hardcoded,2008
|
||||
nfl_everbank_stadium,EverBank Stadium,Jacksonville,FL,30.3239,-81.6373,67814,NFL,['JAX'],nfl_hardcoded,1995
|
||||
nfl_geha_field_at_arrowhead_stadiu,GEHA Field at Arrowhead Stadium,Kansas City,MO,39.0489,-94.4839,76416,NFL,['KC'],nfl_hardcoded,1972
|
||||
nfl_allegiant_stadium,Allegiant Stadium,Las Vegas,NV,36.0909,-115.1833,65000,NFL,['LV'],nfl_hardcoded,2020
|
||||
nfl_sofi_stadium,SoFi Stadium,Inglewood,CA,33.9535,-118.3392,70240,NFL,"['LAC', 'LAR']",nfl_hardcoded,2020
|
||||
nfl_hard_rock_stadium,Hard Rock Stadium,Miami Gardens,FL,25.958,-80.2389,64767,NFL,['MIA'],nfl_hardcoded,1987
|
||||
nfl_u.s._bank_stadium,U.S. Bank Stadium,Minneapolis,MN,44.9736,-93.2575,66655,NFL,['MIN'],nfl_hardcoded,2016
|
||||
nfl_gillette_stadium,Gillette Stadium,Foxborough,MA,42.0909,-71.2643,65878,NFL,['NE'],nfl_hardcoded,2002
|
||||
nfl_caesars_superdome,Caesars Superdome,New Orleans,LA,29.9511,-90.0812,73208,NFL,['NO'],nfl_hardcoded,1975
|
||||
nfl_metlife_stadium,MetLife Stadium,East Rutherford,NJ,40.8135,-74.0745,82500,NFL,"['NYG', 'NYJ']",nfl_hardcoded,2010
|
||||
nfl_lincoln_financial_field,Lincoln Financial Field,Philadelphia,PA,39.9008,-75.1675,69596,NFL,['PHI'],nfl_hardcoded,2003
|
||||
nfl_acrisure_stadium,Acrisure Stadium,Pittsburgh,PA,40.4468,-80.0158,68400,NFL,['PIT'],nfl_hardcoded,2001
|
||||
nfl_levi's_stadium,Levi's Stadium,Santa Clara,CA,37.4032,-121.9698,68500,NFL,['SF'],nfl_hardcoded,2014
|
||||
nfl_lumen_field,Lumen Field,Seattle,WA,47.5952,-122.3316,68740,NFL,['SEA'],nfl_hardcoded,2002
|
||||
nfl_raymond_james_stadium,Raymond James Stadium,Tampa,FL,27.9759,-82.5033,65618,NFL,['TB'],nfl_hardcoded,1998
|
||||
nfl_nissan_stadium,Nissan Stadium,Nashville,TN,36.1665,-86.7713,69143,NFL,['TEN'],nfl_hardcoded,1999
|
||||
nfl_northwest_stadium,Northwest Stadium,Landover,MD,38.9076,-76.8645,67617,NFL,['WAS'],nfl_hardcoded,1997
|
||||
mls_mercedes-benz_stadium,Mercedes-Benz Stadium,Atlanta,GA,33.7555,-84.4,42500,MLS,['ATL'],mls_hardcoded,2017
|
||||
mls_q2_stadium,Q2 Stadium,Austin,TX,30.3877,-97.7195,20738,MLS,['AUS'],mls_hardcoded,2021
|
||||
mls_bank_of_america_stadium,Bank of America Stadium,Charlotte,NC,35.2258,-80.8528,38000,MLS,['CLT'],mls_hardcoded,1996
|
||||
mls_soldier_field,Soldier Field,Chicago,IL,41.8623,-87.6167,24995,MLS,['CHI'],mls_hardcoded,1924
|
||||
mls_tql_stadium,TQL Stadium,Cincinnati,OH,39.1114,-84.5222,26000,MLS,['CIN'],mls_hardcoded,2021
|
||||
mls_dicks_sporting_goods_park,Dick's Sporting Goods Park,Commerce City,CO,39.8056,-104.8919,18061,MLS,['COL'],mls_hardcoded,2007
|
||||
mls_lowercom_field,Lower.com Field,Columbus,OH,39.9685,-83.0171,20371,MLS,['CLB'],mls_hardcoded,2021
|
||||
mls_toyota_stadium,Toyota Stadium,Frisco,TX,33.1544,-96.8353,20500,MLS,['DAL'],mls_hardcoded,2005
|
||||
mls_audi_field,Audi Field,Washington,DC,38.8684,-77.0129,20000,MLS,['DC'],mls_hardcoded,2018
|
||||
mls_shell_energy_stadium,Shell Energy Stadium,Houston,TX,29.7522,-95.3524,22039,MLS,['HOU'],mls_hardcoded,2012
|
||||
mls_dignity_health_sports_park,Dignity Health Sports Park,Carson,CA,33.864,-118.261,27000,MLS,['LAG'],mls_hardcoded,2003
|
||||
mls_bmo_stadium,BMO Stadium,Los Angeles,CA,34.0128,-118.2841,22000,MLS,['LAFC'],mls_hardcoded,2018
|
||||
mls_chase_stadium,Chase Stadium,Fort Lauderdale,FL,26.1933,-80.1607,21550,MLS,['MIA'],mls_hardcoded,2020
|
||||
mls_allianz_field,Allianz Field,Saint Paul,MN,44.9531,-93.1647,19400,MLS,['MIN'],mls_hardcoded,2019
|
||||
mls_stade_saputo,Stade Saputo,Montreal,QC,45.5631,-73.5525,19619,MLS,['MTL'],mls_hardcoded,2008
|
||||
mls_geodis_park,Geodis Park,Nashville,TN,36.1301,-86.766,30000,MLS,['NSH'],mls_hardcoded,2022
|
||||
mls_gillette_stadium,Gillette Stadium,Foxborough,MA,42.0909,-71.2643,22385,MLS,['NE'],mls_hardcoded,2002
|
||||
mls_yankee_stadium,Yankee Stadium,Bronx,NY,40.8292,-73.9264,28000,MLS,['NYCFC'],mls_hardcoded,2009
|
||||
mls_red_bull_arena,Red Bull Arena,Harrison,NJ,40.7367,-74.1503,25000,MLS,['NYRB'],mls_hardcoded,2010
|
||||
mls_interandco_stadium,Inter&Co Stadium,Orlando,FL,28.5411,-81.3893,25500,MLS,['ORL'],mls_hardcoded,2017
|
||||
mls_subaru_park,Subaru Park,Chester,PA,39.8322,-75.3789,18500,MLS,['PHI'],mls_hardcoded,2010
|
||||
mls_providence_park,Providence Park,Portland,OR,45.5214,-122.6917,25218,MLS,['POR'],mls_hardcoded,1926
|
||||
mls_america_first_field,America First Field,Sandy,UT,40.5829,-111.8934,20213,MLS,['RSL'],mls_hardcoded,2008
|
||||
mls_paypal_park,PayPal Park,San Jose,CA,37.3514,-121.925,18000,MLS,['SJ'],mls_hardcoded,2015
|
||||
mls_lumen_field,Lumen Field,Seattle,WA,47.5952,-122.3316,37722,MLS,['SEA'],mls_hardcoded,2002
|
||||
mls_childrens_mercy_park,Children's Mercy Park,Kansas City,KS,39.1217,-94.8232,18467,MLS,['SKC'],mls_hardcoded,2011
|
||||
mls_citypark,CityPark,St. Louis,MO,38.6314,-90.2103,22500,MLS,['STL'],mls_hardcoded,2023
|
||||
mls_bmo_field,BMO Field,Toronto,ON,43.6332,-79.4186,30000,MLS,['TOR'],mls_hardcoded,2007
|
||||
mls_bc_place,BC Place,Vancouver,BC,49.2767,-123.1119,22120,MLS,['VAN'],mls_hardcoded,1983
|
||||
mls_snapdragon_stadium,Snapdragon Stadium,San Diego,CA,32.7844,-117.1228,35000,MLS,['SD'],mls_hardcoded,2022
|
||||
wnba_gateway_center_arena,Gateway Center Arena,College Park,GA,33.6343,-84.4489,3500,WNBA,['ATL'],wnba_hardcoded,2018
|
||||
wnba_wintrust_arena,Wintrust Arena,Chicago,IL,41.8514,-87.6226,10387,WNBA,['CHI'],wnba_hardcoded,2017
|
||||
wnba_mohegan_sun_arena,Mohegan Sun Arena,Uncasville,CT,41.4933,-72.0904,10000,WNBA,['CON'],wnba_hardcoded,2001
|
||||
wnba_college_park_center,College Park Center,Arlington,TX,32.7319,-97.1103,7000,WNBA,['DAL'],wnba_hardcoded,2012
|
||||
wnba_michelob_ultra_arena,Michelob Ultra Arena,Las Vegas,NV,36.0909,-115.175,12000,WNBA,['LVA'],wnba_hardcoded,2016
|
||||
wnba_entertainment_and_sports_arena,Entertainment & Sports Arena,Washington,DC,38.872,-76.987,4200,WNBA,['WAS'],wnba_hardcoded,2018
|
||||
wnba_chase_center,Chase Center,San Francisco,CA,37.768,-122.3879,18064,WNBA,['GSV'],wnba_hardcoded,2019
|
||||
wnba_gainbridge_fieldhouse,Gainbridge Fieldhouse,Indianapolis,IN,39.764,-86.1555,17923,WNBA,['IND'],wnba_hardcoded,1999
|
||||
wnba_cryptocom_arena,Crypto.com Arena,Los Angeles,CA,34.043,-118.2673,19079,WNBA,['LA'],wnba_hardcoded,1999
|
||||
wnba_target_center,Target Center,Minneapolis,MN,44.9795,-93.2761,18978,WNBA,['MIN'],wnba_hardcoded,1990
|
||||
wnba_barclays_center,Barclays Center,Brooklyn,NY,40.6826,-73.9754,17732,WNBA,['NY'],wnba_hardcoded,2012
|
||||
wnba_footprint_center,Footprint Center,Phoenix,AZ,33.4457,-112.0712,17071,WNBA,['PHO'],wnba_hardcoded,1992
|
||||
wnba_climate_pledge_arena,Climate Pledge Arena,Seattle,WA,47.622,-122.354,17100,WNBA,['SEA'],wnba_hardcoded,1962
|
||||
nwsl_bmo_stadium,BMO Stadium,Los Angeles,CA,34.0128,-118.2841,22000,NWSL,['LA'],nwsl_hardcoded,2018
|
||||
nwsl_paypal_park,PayPal Park,San Jose,CA,37.3514,-121.925,18000,NWSL,['SJ'],nwsl_hardcoded,2015
|
||||
nwsl_shell_energy_stadium,Shell Energy Stadium,Houston,TX,29.7522,-95.3524,22039,NWSL,['HOU'],nwsl_hardcoded,2012
|
||||
nwsl_red_bull_arena,Red Bull Arena,Harrison,NJ,40.7367,-74.1503,25000,NWSL,['NJ'],nwsl_hardcoded,2010
|
||||
nwsl_interandco_stadium,Inter&Co Stadium,Orlando,FL,28.5411,-81.3893,25500,NWSL,['ORL'],nwsl_hardcoded,2017
|
||||
nwsl_providence_park,Providence Park,Portland,OR,45.5214,-122.6917,25218,NWSL,['POR'],nwsl_hardcoded,1926
|
||||
nwsl_lumen_field,Lumen Field,Seattle,WA,47.5952,-122.3316,37722,NWSL,['SEA'],nwsl_hardcoded,2002
|
||||
nwsl_snapdragon_stadium,Snapdragon Stadium,San Diego,CA,32.7844,-117.1228,35000,NWSL,['SD'],nwsl_hardcoded,2022
|
||||
nwsl_america_first_field,America First Field,Sandy,UT,40.5829,-111.8934,20213,NWSL,['UTA'],nwsl_hardcoded,2008
|
||||
nwsl_audi_field,Audi Field,Washington,DC,38.8684,-77.0129,20000,NWSL,['WAS'],nwsl_hardcoded,2018
|
||||
nwsl_seatgeek_stadium,SeatGeek Stadium,Bridgeview,IL,41.7653,-87.8049,20000,NWSL,['CHI'],nwsl_hardcoded,2006
|
||||
nwsl_cpkc_stadium,CPKC Stadium,Kansas City,MO,39.0975,-94.5556,11500,NWSL,['KC'],nwsl_hardcoded,2024
|
||||
nwsl_wakemed_soccer_park,WakeMed Soccer Park,Cary,NC,35.8018,-78.7442,10000,NWSL,['NC'],nwsl_hardcoded,2002
|
||||
|
File diff suppressed because it is too large
Load Diff
@@ -1,405 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Generate Canonical Data for SportsTime App
|
||||
==========================================
|
||||
Generates team_aliases.json and league_structure.json from team mappings.
|
||||
|
||||
Usage:
|
||||
python generate_canonical_data.py
|
||||
python generate_canonical_data.py --output ./data
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# =============================================================================
|
||||
# LEAGUE STRUCTURE
|
||||
# =============================================================================
|
||||
|
||||
MLB_STRUCTURE = {
|
||||
"leagues": [
|
||||
{"id": "mlb_al", "name": "American League", "abbreviation": "AL"},
|
||||
{"id": "mlb_nl", "name": "National League", "abbreviation": "NL"},
|
||||
],
|
||||
"divisions": [
|
||||
# American League
|
||||
{"id": "mlb_al_east", "name": "AL East", "parent_id": "mlb_al", "teams": ["NYY", "BOS", "TOR", "BAL", "TBR"]},
|
||||
{"id": "mlb_al_central", "name": "AL Central", "parent_id": "mlb_al", "teams": ["CLE", "DET", "MIN", "CHW", "KCR"]},
|
||||
{"id": "mlb_al_west", "name": "AL West", "parent_id": "mlb_al", "teams": ["HOU", "SEA", "TEX", "LAA", "OAK"]},
|
||||
# National League
|
||||
{"id": "mlb_nl_east", "name": "NL East", "parent_id": "mlb_nl", "teams": ["ATL", "PHI", "NYM", "MIA", "WSN"]},
|
||||
{"id": "mlb_nl_central", "name": "NL Central", "parent_id": "mlb_nl", "teams": ["MIL", "CHC", "STL", "PIT", "CIN"]},
|
||||
{"id": "mlb_nl_west", "name": "NL West", "parent_id": "mlb_nl", "teams": ["LAD", "ARI", "SDP", "SFG", "COL"]},
|
||||
]
|
||||
}
|
||||
|
||||
NBA_STRUCTURE = {
|
||||
"conferences": [
|
||||
{"id": "nba_eastern", "name": "Eastern Conference", "abbreviation": "East"},
|
||||
{"id": "nba_western", "name": "Western Conference", "abbreviation": "West"},
|
||||
],
|
||||
"divisions": [
|
||||
# Eastern Conference
|
||||
{"id": "nba_atlantic", "name": "Atlantic", "parent_id": "nba_eastern", "teams": ["BOS", "BRK", "NYK", "PHI", "TOR"]},
|
||||
{"id": "nba_central", "name": "Central", "parent_id": "nba_eastern", "teams": ["CHI", "CLE", "DET", "IND", "MIL"]},
|
||||
{"id": "nba_southeast", "name": "Southeast", "parent_id": "nba_eastern", "teams": ["ATL", "CHO", "MIA", "ORL", "WAS"]},
|
||||
# Western Conference
|
||||
{"id": "nba_northwest", "name": "Northwest", "parent_id": "nba_western", "teams": ["DEN", "MIN", "OKC", "POR", "UTA"]},
|
||||
{"id": "nba_pacific", "name": "Pacific", "parent_id": "nba_western", "teams": ["GSW", "LAC", "LAL", "PHO", "SAC"]},
|
||||
{"id": "nba_southwest", "name": "Southwest", "parent_id": "nba_western", "teams": ["DAL", "HOU", "MEM", "NOP", "SAS"]},
|
||||
]
|
||||
}
|
||||
|
||||
NHL_STRUCTURE = {
|
||||
"conferences": [
|
||||
{"id": "nhl_eastern", "name": "Eastern Conference", "abbreviation": "East"},
|
||||
{"id": "nhl_western", "name": "Western Conference", "abbreviation": "West"},
|
||||
],
|
||||
"divisions": [
|
||||
# Eastern Conference
|
||||
{"id": "nhl_atlantic", "name": "Atlantic", "parent_id": "nhl_eastern", "teams": ["BOS", "BUF", "DET", "FLA", "MTL", "OTT", "TBL", "TOR"]},
|
||||
{"id": "nhl_metropolitan", "name": "Metropolitan", "parent_id": "nhl_eastern", "teams": ["CAR", "CBJ", "NJD", "NYI", "NYR", "PHI", "PIT", "WSH"]},
|
||||
# Western Conference
|
||||
{"id": "nhl_central", "name": "Central", "parent_id": "nhl_western", "teams": ["ARI", "CHI", "COL", "DAL", "MIN", "NSH", "STL", "WPG"]},
|
||||
{"id": "nhl_pacific", "name": "Pacific", "parent_id": "nhl_western", "teams": ["ANA", "CGY", "EDM", "LAK", "SEA", "SJS", "VAN", "VGK"]},
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM ALIASES (Historical name changes, relocations, abbreviation changes)
|
||||
# =============================================================================
|
||||
|
||||
# Format: {current_abbrev: [(alias_type, alias_value, valid_from, valid_until), ...]}
|
||||
|
||||
MLB_ALIASES = {
|
||||
# Washington Nationals (formerly Montreal Expos)
|
||||
"WSN": [
|
||||
("name", "Montreal Expos", "1969-01-01", "2004-12-31"),
|
||||
("abbreviation", "MON", "1969-01-01", "2004-12-31"),
|
||||
("city", "Montreal", "1969-01-01", "2004-12-31"),
|
||||
],
|
||||
# Oakland Athletics (moving to Sacramento, formerly in Kansas City and Philadelphia)
|
||||
"OAK": [
|
||||
("name", "Kansas City Athletics", "1955-01-01", "1967-12-31"),
|
||||
("abbreviation", "KCA", "1955-01-01", "1967-12-31"),
|
||||
("city", "Kansas City", "1955-01-01", "1967-12-31"),
|
||||
("name", "Philadelphia Athletics", "1901-01-01", "1954-12-31"),
|
||||
("abbreviation", "PHA", "1901-01-01", "1954-12-31"),
|
||||
("city", "Philadelphia", "1901-01-01", "1954-12-31"),
|
||||
],
|
||||
# Cleveland Guardians (formerly Indians)
|
||||
"CLE": [
|
||||
("name", "Cleveland Indians", "1915-01-01", "2021-12-31"),
|
||||
],
|
||||
# Tampa Bay Rays (formerly Devil Rays)
|
||||
"TBR": [
|
||||
("name", "Tampa Bay Devil Rays", "1998-01-01", "2007-12-31"),
|
||||
],
|
||||
# Miami Marlins (formerly Florida Marlins)
|
||||
"MIA": [
|
||||
("name", "Florida Marlins", "1993-01-01", "2011-12-31"),
|
||||
("city", "Florida", "1993-01-01", "2011-12-31"),
|
||||
],
|
||||
# Los Angeles Angels (various names)
|
||||
"LAA": [
|
||||
("name", "Anaheim Angels", "1997-01-01", "2004-12-31"),
|
||||
("name", "Los Angeles Angels of Anaheim", "2005-01-01", "2015-12-31"),
|
||||
("name", "California Angels", "1965-01-01", "1996-12-31"),
|
||||
],
|
||||
# Texas Rangers (formerly Washington Senators II)
|
||||
"TEX": [
|
||||
("name", "Washington Senators", "1961-01-01", "1971-12-31"),
|
||||
("abbreviation", "WS2", "1961-01-01", "1971-12-31"),
|
||||
("city", "Washington", "1961-01-01", "1971-12-31"),
|
||||
],
|
||||
# Milwaukee Brewers (briefly Seattle Pilots)
|
||||
"MIL": [
|
||||
("name", "Seattle Pilots", "1969-01-01", "1969-12-31"),
|
||||
("abbreviation", "SEP", "1969-01-01", "1969-12-31"),
|
||||
("city", "Seattle", "1969-01-01", "1969-12-31"),
|
||||
],
|
||||
# Houston Astros (formerly Colt .45s)
|
||||
"HOU": [
|
||||
("name", "Houston Colt .45s", "1962-01-01", "1964-12-31"),
|
||||
],
|
||||
}
|
||||
|
||||
NBA_ALIASES = {
|
||||
# Brooklyn Nets (formerly New Jersey Nets, New York Nets)
|
||||
"BRK": [
|
||||
("name", "New Jersey Nets", "1977-01-01", "2012-04-30"),
|
||||
("abbreviation", "NJN", "1977-01-01", "2012-04-30"),
|
||||
("city", "New Jersey", "1977-01-01", "2012-04-30"),
|
||||
("name", "New York Nets", "1968-01-01", "1977-12-31"),
|
||||
],
|
||||
# Oklahoma City Thunder (formerly Seattle SuperSonics)
|
||||
"OKC": [
|
||||
("name", "Seattle SuperSonics", "1967-01-01", "2008-07-01"),
|
||||
("abbreviation", "SEA", "1967-01-01", "2008-07-01"),
|
||||
("city", "Seattle", "1967-01-01", "2008-07-01"),
|
||||
],
|
||||
# Memphis Grizzlies (formerly Vancouver Grizzlies)
|
||||
"MEM": [
|
||||
("name", "Vancouver Grizzlies", "1995-01-01", "2001-05-31"),
|
||||
("abbreviation", "VAN", "1995-01-01", "2001-05-31"),
|
||||
("city", "Vancouver", "1995-01-01", "2001-05-31"),
|
||||
],
|
||||
# New Orleans Pelicans (formerly Hornets, formerly Charlotte Hornets original)
|
||||
"NOP": [
|
||||
("name", "New Orleans Hornets", "2002-01-01", "2013-04-30"),
|
||||
("abbreviation", "NOH", "2002-01-01", "2013-04-30"),
|
||||
("name", "New Orleans/Oklahoma City Hornets", "2005-01-01", "2007-12-31"),
|
||||
],
|
||||
# Charlotte Hornets (current, formerly Bobcats)
|
||||
"CHO": [
|
||||
("name", "Charlotte Bobcats", "2004-01-01", "2014-04-30"),
|
||||
("abbreviation", "CHA", "2004-01-01", "2014-04-30"),
|
||||
],
|
||||
# Washington Wizards (formerly Bullets)
|
||||
"WAS": [
|
||||
("name", "Washington Bullets", "1974-01-01", "1997-05-31"),
|
||||
("name", "Capital Bullets", "1973-01-01", "1973-12-31"),
|
||||
("name", "Baltimore Bullets", "1963-01-01", "1972-12-31"),
|
||||
],
|
||||
# Los Angeles Clippers (formerly San Diego, Buffalo)
|
||||
"LAC": [
|
||||
("name", "San Diego Clippers", "1978-01-01", "1984-05-31"),
|
||||
("abbreviation", "SDC", "1978-01-01", "1984-05-31"),
|
||||
("city", "San Diego", "1978-01-01", "1984-05-31"),
|
||||
("name", "Buffalo Braves", "1970-01-01", "1978-05-31"),
|
||||
("abbreviation", "BUF", "1970-01-01", "1978-05-31"),
|
||||
("city", "Buffalo", "1970-01-01", "1978-05-31"),
|
||||
],
|
||||
# Sacramento Kings (formerly Kansas City Kings, etc.)
|
||||
"SAC": [
|
||||
("name", "Kansas City Kings", "1975-01-01", "1985-05-31"),
|
||||
("abbreviation", "KCK", "1975-01-01", "1985-05-31"),
|
||||
("city", "Kansas City", "1975-01-01", "1985-05-31"),
|
||||
],
|
||||
# Utah Jazz (formerly New Orleans Jazz)
|
||||
"UTA": [
|
||||
("name", "New Orleans Jazz", "1974-01-01", "1979-05-31"),
|
||||
("city", "New Orleans", "1974-01-01", "1979-05-31"),
|
||||
],
|
||||
}
|
||||
|
||||
NHL_ALIASES = {
|
||||
# Arizona/Utah Hockey Club (formerly Phoenix Coyotes, originally Winnipeg Jets)
|
||||
"ARI": [
|
||||
("name", "Arizona Coyotes", "2014-01-01", "2024-04-30"),
|
||||
("name", "Phoenix Coyotes", "1996-01-01", "2013-12-31"),
|
||||
("abbreviation", "PHX", "1996-01-01", "2013-12-31"),
|
||||
("city", "Phoenix", "1996-01-01", "2013-12-31"),
|
||||
("name", "Winnipeg Jets", "1979-01-01", "1996-05-31"), # Original Jets
|
||||
],
|
||||
# Carolina Hurricanes (formerly Hartford Whalers)
|
||||
"CAR": [
|
||||
("name", "Hartford Whalers", "1979-01-01", "1997-05-31"),
|
||||
("abbreviation", "HFD", "1979-01-01", "1997-05-31"),
|
||||
("city", "Hartford", "1979-01-01", "1997-05-31"),
|
||||
],
|
||||
# Colorado Avalanche (formerly Quebec Nordiques)
|
||||
"COL": [
|
||||
("name", "Quebec Nordiques", "1979-01-01", "1995-05-31"),
|
||||
("abbreviation", "QUE", "1979-01-01", "1995-05-31"),
|
||||
("city", "Quebec", "1979-01-01", "1995-05-31"),
|
||||
],
|
||||
# Dallas Stars (formerly Minnesota North Stars)
|
||||
"DAL": [
|
||||
("name", "Minnesota North Stars", "1967-01-01", "1993-05-31"),
|
||||
("abbreviation", "MNS", "1967-01-01", "1993-05-31"),
|
||||
("city", "Minnesota", "1967-01-01", "1993-05-31"),
|
||||
],
|
||||
# New Jersey Devils (formerly Kansas City Scouts, Colorado Rockies)
|
||||
"NJD": [
|
||||
("name", "Colorado Rockies", "1976-01-01", "1982-05-31"),
|
||||
("abbreviation", "CLR", "1976-01-01", "1982-05-31"),
|
||||
("city", "Colorado", "1976-01-01", "1982-05-31"),
|
||||
("name", "Kansas City Scouts", "1974-01-01", "1976-05-31"),
|
||||
("abbreviation", "KCS", "1974-01-01", "1976-05-31"),
|
||||
("city", "Kansas City", "1974-01-01", "1976-05-31"),
|
||||
],
|
||||
# Winnipeg Jets (current, formerly Atlanta Thrashers)
|
||||
"WPG": [
|
||||
("name", "Atlanta Thrashers", "1999-01-01", "2011-05-31"),
|
||||
("abbreviation", "ATL", "1999-01-01", "2011-05-31"),
|
||||
("city", "Atlanta", "1999-01-01", "2011-05-31"),
|
||||
],
|
||||
# Florida Panthers (originally in Miami)
|
||||
"FLA": [
|
||||
("city", "Miami", "1993-01-01", "1998-12-31"),
|
||||
],
|
||||
# Vegas Golden Knights (no aliases, expansion team)
|
||||
# Seattle Kraken (no aliases, expansion team)
|
||||
}
|
||||
|
||||
|
||||
def generate_league_structure() -> list[dict]:
|
||||
"""Generate league_structure.json data."""
|
||||
structures = []
|
||||
order = 0
|
||||
|
||||
# MLB
|
||||
structures.append({
|
||||
"id": "mlb_league",
|
||||
"sport": "MLB",
|
||||
"type": "league",
|
||||
"name": "Major League Baseball",
|
||||
"abbreviation": "MLB",
|
||||
"parent_id": None,
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
for league in MLB_STRUCTURE["leagues"]:
|
||||
structures.append({
|
||||
"id": league["id"],
|
||||
"sport": "MLB",
|
||||
"type": "conference", # AL/NL are like conferences
|
||||
"name": league["name"],
|
||||
"abbreviation": league["abbreviation"],
|
||||
"parent_id": "mlb_league",
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
for div in MLB_STRUCTURE["divisions"]:
|
||||
structures.append({
|
||||
"id": div["id"],
|
||||
"sport": "MLB",
|
||||
"type": "division",
|
||||
"name": div["name"],
|
||||
"abbreviation": None,
|
||||
"parent_id": div["parent_id"],
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
# NBA
|
||||
structures.append({
|
||||
"id": "nba_league",
|
||||
"sport": "NBA",
|
||||
"type": "league",
|
||||
"name": "National Basketball Association",
|
||||
"abbreviation": "NBA",
|
||||
"parent_id": None,
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
for conf in NBA_STRUCTURE["conferences"]:
|
||||
structures.append({
|
||||
"id": conf["id"],
|
||||
"sport": "NBA",
|
||||
"type": "conference",
|
||||
"name": conf["name"],
|
||||
"abbreviation": conf["abbreviation"],
|
||||
"parent_id": "nba_league",
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
for div in NBA_STRUCTURE["divisions"]:
|
||||
structures.append({
|
||||
"id": div["id"],
|
||||
"sport": "NBA",
|
||||
"type": "division",
|
||||
"name": div["name"],
|
||||
"abbreviation": None,
|
||||
"parent_id": div["parent_id"],
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
# NHL
|
||||
structures.append({
|
||||
"id": "nhl_league",
|
||||
"sport": "NHL",
|
||||
"type": "league",
|
||||
"name": "National Hockey League",
|
||||
"abbreviation": "NHL",
|
||||
"parent_id": None,
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
for conf in NHL_STRUCTURE["conferences"]:
|
||||
structures.append({
|
||||
"id": conf["id"],
|
||||
"sport": "NHL",
|
||||
"type": "conference",
|
||||
"name": conf["name"],
|
||||
"abbreviation": conf["abbreviation"],
|
||||
"parent_id": "nhl_league",
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
for div in NHL_STRUCTURE["divisions"]:
|
||||
structures.append({
|
||||
"id": div["id"],
|
||||
"sport": "NHL",
|
||||
"type": "division",
|
||||
"name": div["name"],
|
||||
"abbreviation": None,
|
||||
"parent_id": div["parent_id"],
|
||||
"display_order": order,
|
||||
})
|
||||
order += 1
|
||||
|
||||
return structures
|
||||
|
||||
|
||||
def generate_team_aliases() -> list[dict]:
|
||||
"""Generate team_aliases.json data."""
|
||||
aliases = []
|
||||
alias_id = 1
|
||||
|
||||
for sport, sport_aliases in [("MLB", MLB_ALIASES), ("NBA", NBA_ALIASES), ("NHL", NHL_ALIASES)]:
|
||||
for current_abbrev, alias_list in sport_aliases.items():
|
||||
team_canonical_id = f"team_{sport.lower()}_{current_abbrev.lower()}"
|
||||
|
||||
for alias_type, alias_value, valid_from, valid_until in alias_list:
|
||||
aliases.append({
|
||||
"id": f"alias_{sport.lower()}_{alias_id}",
|
||||
"team_canonical_id": team_canonical_id,
|
||||
"alias_type": alias_type,
|
||||
"alias_value": alias_value,
|
||||
"valid_from": valid_from,
|
||||
"valid_until": valid_until,
|
||||
})
|
||||
alias_id += 1
|
||||
|
||||
return aliases
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Generate canonical data JSON files')
|
||||
parser.add_argument('--output', type=str, default='./data', help='Output directory')
|
||||
args = parser.parse_args()
|
||||
|
||||
output_dir = Path(args.output)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Generate league structure
|
||||
print("Generating league_structure.json...")
|
||||
league_structure = generate_league_structure()
|
||||
with open(output_dir / 'league_structure.json', 'w') as f:
|
||||
json.dump(league_structure, f, indent=2)
|
||||
print(f" Created {len(league_structure)} structure entries")
|
||||
|
||||
# Generate team aliases
|
||||
print("Generating team_aliases.json...")
|
||||
team_aliases = generate_team_aliases()
|
||||
with open(output_dir / 'team_aliases.json', 'w') as f:
|
||||
json.dump(team_aliases, f, indent=2)
|
||||
print(f" Created {len(team_aliases)} alias entries")
|
||||
|
||||
print(f"\nFiles written to {output_dir}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -1,275 +0,0 @@
|
||||
#!/usr/bin/env swift
|
||||
//
|
||||
// import_to_cloudkit.swift
|
||||
// SportsTime
|
||||
//
|
||||
// Imports scraped JSON data into CloudKit public database.
|
||||
// Run from command line: swift import_to_cloudkit.swift --games data/games.json --stadiums data/stadiums.json
|
||||
//
|
||||
|
||||
import Foundation
|
||||
import CloudKit
|
||||
|
||||
// MARK: - Data Models (matching scraper output)
|
||||
|
||||
struct ScrapedGame: Codable {
|
||||
let id: String
|
||||
let sport: String
|
||||
let season: String
|
||||
let date: String
|
||||
let time: String?
|
||||
let home_team: String
|
||||
let away_team: String
|
||||
let home_team_abbrev: String
|
||||
let away_team_abbrev: String
|
||||
let venue: String
|
||||
let source: String
|
||||
let is_playoff: Bool?
|
||||
let broadcast: String?
|
||||
}
|
||||
|
||||
struct ScrapedStadium: Codable {
|
||||
let id: String
|
||||
let name: String
|
||||
let city: String
|
||||
let state: String
|
||||
let latitude: Double
|
||||
let longitude: Double
|
||||
let capacity: Int
|
||||
let sport: String
|
||||
let team_abbrevs: [String]
|
||||
let source: String
|
||||
let year_opened: Int?
|
||||
}
|
||||
|
||||
// MARK: - CloudKit Importer
|
||||
|
||||
class CloudKitImporter {
|
||||
let container: CKContainer
|
||||
let database: CKDatabase
|
||||
|
||||
init(containerIdentifier: String = "iCloud.com.sportstime.app") {
|
||||
self.container = CKContainer(identifier: containerIdentifier)
|
||||
self.database = container.publicCloudDatabase
|
||||
}
|
||||
|
||||
// MARK: - Import Stadiums
|
||||
|
||||
func importStadiums(from stadiums: [ScrapedStadium]) async throws -> Int {
|
||||
var imported = 0
|
||||
|
||||
for stadium in stadiums {
|
||||
let record = CKRecord(recordType: "Stadium")
|
||||
record["stadiumId"] = stadium.id
|
||||
record["name"] = stadium.name
|
||||
record["city"] = stadium.city
|
||||
record["state"] = stadium.state
|
||||
record["location"] = CLLocation(latitude: stadium.latitude, longitude: stadium.longitude)
|
||||
record["capacity"] = stadium.capacity
|
||||
record["sport"] = stadium.sport
|
||||
record["teamAbbrevs"] = stadium.team_abbrevs
|
||||
record["source"] = stadium.source
|
||||
|
||||
if let yearOpened = stadium.year_opened {
|
||||
record["yearOpened"] = yearOpened
|
||||
}
|
||||
|
||||
do {
|
||||
_ = try await database.save(record)
|
||||
imported += 1
|
||||
print(" Imported stadium: \(stadium.name)")
|
||||
} catch {
|
||||
print(" Error importing \(stadium.name): \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
return imported
|
||||
}
|
||||
|
||||
// MARK: - Import Teams
|
||||
|
||||
func importTeams(from stadiums: [ScrapedStadium], teamMappings: [String: TeamInfo]) async throws -> [String: CKRecord.ID] {
|
||||
var teamRecordIDs: [String: CKRecord.ID] = [:]
|
||||
|
||||
for (abbrev, info) in teamMappings {
|
||||
let record = CKRecord(recordType: "Team")
|
||||
record["teamId"] = UUID().uuidString
|
||||
record["name"] = info.name
|
||||
record["abbreviation"] = abbrev
|
||||
record["sport"] = info.sport
|
||||
record["city"] = info.city
|
||||
|
||||
do {
|
||||
let saved = try await database.save(record)
|
||||
teamRecordIDs[abbrev] = saved.recordID
|
||||
print(" Imported team: \(info.name)")
|
||||
} catch {
|
||||
print(" Error importing team \(info.name): \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
return teamRecordIDs
|
||||
}
|
||||
|
||||
// MARK: - Import Games
|
||||
|
||||
func importGames(
|
||||
from games: [ScrapedGame],
|
||||
teamRecordIDs: [String: CKRecord.ID],
|
||||
stadiumRecordIDs: [String: CKRecord.ID]
|
||||
) async throws -> Int {
|
||||
var imported = 0
|
||||
|
||||
// Batch imports for efficiency
|
||||
let batchSize = 100
|
||||
var batch: [CKRecord] = []
|
||||
|
||||
for game in games {
|
||||
let record = CKRecord(recordType: "Game")
|
||||
record["gameId"] = game.id
|
||||
record["sport"] = game.sport
|
||||
record["season"] = game.season
|
||||
|
||||
// Parse date
|
||||
let dateFormatter = DateFormatter()
|
||||
dateFormatter.dateFormat = "yyyy-MM-dd"
|
||||
if let date = dateFormatter.date(from: game.date) {
|
||||
if let timeStr = game.time {
|
||||
// Combine date and time
|
||||
let timeFormatter = DateFormatter()
|
||||
timeFormatter.dateFormat = "HH:mm"
|
||||
if let time = timeFormatter.date(from: timeStr) {
|
||||
let calendar = Calendar.current
|
||||
let timeComponents = calendar.dateComponents([.hour, .minute], from: time)
|
||||
if let combined = calendar.date(bySettingHour: timeComponents.hour ?? 19,
|
||||
minute: timeComponents.minute ?? 0,
|
||||
second: 0, of: date) {
|
||||
record["dateTime"] = combined
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Default to 7 PM if no time
|
||||
let calendar = Calendar.current
|
||||
if let defaultTime = calendar.date(bySettingHour: 19, minute: 0, second: 0, of: date) {
|
||||
record["dateTime"] = defaultTime
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Team references
|
||||
if let homeTeamID = teamRecordIDs[game.home_team_abbrev] {
|
||||
record["homeTeamRef"] = CKRecord.Reference(recordID: homeTeamID, action: .none)
|
||||
}
|
||||
if let awayTeamID = teamRecordIDs[game.away_team_abbrev] {
|
||||
record["awayTeamRef"] = CKRecord.Reference(recordID: awayTeamID, action: .none)
|
||||
}
|
||||
|
||||
record["isPlayoff"] = (game.is_playoff ?? false) ? 1 : 0
|
||||
record["broadcastInfo"] = game.broadcast
|
||||
record["source"] = game.source
|
||||
|
||||
batch.append(record)
|
||||
|
||||
// Save batch
|
||||
if batch.count >= batchSize {
|
||||
do {
|
||||
let operation = CKModifyRecordsOperation(recordsToSave: batch, recordIDsToDelete: nil)
|
||||
operation.savePolicy = .changedKeys
|
||||
|
||||
try await database.modifyRecords(saving: batch, deleting: [])
|
||||
imported += batch.count
|
||||
print(" Imported batch of \(batch.count) games (total: \(imported))")
|
||||
batch.removeAll()
|
||||
} catch {
|
||||
print(" Error importing batch: \(error)")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Save remaining
|
||||
if !batch.isEmpty {
|
||||
do {
|
||||
try await database.modifyRecords(saving: batch, deleting: [])
|
||||
imported += batch.count
|
||||
} catch {
|
||||
print(" Error importing final batch: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
return imported
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - Team Info
|
||||
|
||||
struct TeamInfo {
|
||||
let name: String
|
||||
let city: String
|
||||
let sport: String
|
||||
}
|
||||
|
||||
// MARK: - Main
|
||||
|
||||
func loadJSON<T: Codable>(from path: String) throws -> T {
|
||||
let url = URL(fileURLWithPath: path)
|
||||
let data = try Data(contentsOf: url)
|
||||
return try JSONDecoder().decode(T.self, from: data)
|
||||
}
|
||||
|
||||
func main() async {
|
||||
let args = CommandLine.arguments
|
||||
|
||||
guard args.count >= 3 else {
|
||||
print("Usage: swift import_to_cloudkit.swift --games <path> --stadiums <path>")
|
||||
return
|
||||
}
|
||||
|
||||
var gamesPath: String?
|
||||
var stadiumsPath: String?
|
||||
|
||||
for i in 1..<args.count {
|
||||
if args[i] == "--games" && i + 1 < args.count {
|
||||
gamesPath = args[i + 1]
|
||||
}
|
||||
if args[i] == "--stadiums" && i + 1 < args.count {
|
||||
stadiumsPath = args[i + 1]
|
||||
}
|
||||
}
|
||||
|
||||
let importer = CloudKitImporter()
|
||||
|
||||
// Import stadiums
|
||||
if let path = stadiumsPath {
|
||||
print("\n=== Importing Stadiums ===")
|
||||
do {
|
||||
let stadiums: [ScrapedStadium] = try loadJSON(from: path)
|
||||
let count = try await importer.importStadiums(from: stadiums)
|
||||
print("Imported \(count) stadiums")
|
||||
} catch {
|
||||
print("Error loading stadiums: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
// Import games
|
||||
if let path = gamesPath {
|
||||
print("\n=== Importing Games ===")
|
||||
do {
|
||||
let games: [ScrapedGame] = try loadJSON(from: path)
|
||||
// Note: Would need to first import teams and get their record IDs
|
||||
// This is a simplified version
|
||||
print("Loaded \(games.count) games for import")
|
||||
} catch {
|
||||
print("Error loading games: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
print("\n=== Import Complete ===")
|
||||
}
|
||||
|
||||
// Run
|
||||
Task {
|
||||
await main()
|
||||
}
|
||||
|
||||
// Keep the process running for async operations
|
||||
RunLoop.main.run()
|
||||
-510
@@ -1,510 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
MLB schedule and stadium scrapers for SportsTime.
|
||||
|
||||
This module provides:
|
||||
- MLB game scrapers (Baseball-Reference, Stats API, ESPN)
|
||||
- MLB stadium scrapers (MLBScoreBot, GeoJSON, hardcoded)
|
||||
- Multi-source fallback configurations
|
||||
"""
|
||||
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
|
||||
# Support both direct execution and import from parent directory
|
||||
try:
|
||||
from core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
except ImportError:
|
||||
from Scripts.core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
|
||||
|
||||
__all__ = [
|
||||
# Team data
|
||||
'MLB_TEAMS',
|
||||
# Game scrapers
|
||||
'scrape_mlb_baseball_reference',
|
||||
'scrape_mlb_statsapi',
|
||||
'scrape_mlb_espn',
|
||||
# Stadium scrapers
|
||||
'scrape_mlb_stadiums_scorebot',
|
||||
'scrape_mlb_stadiums_geojson',
|
||||
'scrape_mlb_stadiums_hardcoded',
|
||||
'scrape_mlb_stadiums',
|
||||
# Source configurations
|
||||
'MLB_GAME_SOURCES',
|
||||
'MLB_STADIUM_SOURCES',
|
||||
# Convenience function
|
||||
'scrape_mlb_games',
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM MAPPINGS
|
||||
# =============================================================================
|
||||
|
||||
MLB_TEAMS = {
|
||||
'ARI': {'name': 'Arizona Diamondbacks', 'city': 'Phoenix', 'stadium': 'Chase Field'},
|
||||
'ATL': {'name': 'Atlanta Braves', 'city': 'Atlanta', 'stadium': 'Truist Park'},
|
||||
'BAL': {'name': 'Baltimore Orioles', 'city': 'Baltimore', 'stadium': 'Oriole Park at Camden Yards'},
|
||||
'BOS': {'name': 'Boston Red Sox', 'city': 'Boston', 'stadium': 'Fenway Park'},
|
||||
'CHC': {'name': 'Chicago Cubs', 'city': 'Chicago', 'stadium': 'Wrigley Field'},
|
||||
'CHW': {'name': 'Chicago White Sox', 'city': 'Chicago', 'stadium': 'Guaranteed Rate Field'},
|
||||
'CIN': {'name': 'Cincinnati Reds', 'city': 'Cincinnati', 'stadium': 'Great American Ball Park'},
|
||||
'CLE': {'name': 'Cleveland Guardians', 'city': 'Cleveland', 'stadium': 'Progressive Field'},
|
||||
'COL': {'name': 'Colorado Rockies', 'city': 'Denver', 'stadium': 'Coors Field'},
|
||||
'DET': {'name': 'Detroit Tigers', 'city': 'Detroit', 'stadium': 'Comerica Park'},
|
||||
'HOU': {'name': 'Houston Astros', 'city': 'Houston', 'stadium': 'Minute Maid Park'},
|
||||
'KCR': {'name': 'Kansas City Royals', 'city': 'Kansas City', 'stadium': 'Kauffman Stadium'},
|
||||
'LAA': {'name': 'Los Angeles Angels', 'city': 'Anaheim', 'stadium': 'Angel Stadium'},
|
||||
'LAD': {'name': 'Los Angeles Dodgers', 'city': 'Los Angeles', 'stadium': 'Dodger Stadium'},
|
||||
'MIA': {'name': 'Miami Marlins', 'city': 'Miami', 'stadium': 'LoanDepot Park'},
|
||||
'MIL': {'name': 'Milwaukee Brewers', 'city': 'Milwaukee', 'stadium': 'American Family Field'},
|
||||
'MIN': {'name': 'Minnesota Twins', 'city': 'Minneapolis', 'stadium': 'Target Field'},
|
||||
'NYM': {'name': 'New York Mets', 'city': 'New York', 'stadium': 'Citi Field'},
|
||||
'NYY': {'name': 'New York Yankees', 'city': 'New York', 'stadium': 'Yankee Stadium'},
|
||||
'OAK': {'name': 'Oakland Athletics', 'city': 'Sacramento', 'stadium': 'Sutter Health Park'},
|
||||
'PHI': {'name': 'Philadelphia Phillies', 'city': 'Philadelphia', 'stadium': 'Citizens Bank Park'},
|
||||
'PIT': {'name': 'Pittsburgh Pirates', 'city': 'Pittsburgh', 'stadium': 'PNC Park'},
|
||||
'SDP': {'name': 'San Diego Padres', 'city': 'San Diego', 'stadium': 'Petco Park'},
|
||||
'SFG': {'name': 'San Francisco Giants', 'city': 'San Francisco', 'stadium': 'Oracle Park'},
|
||||
'SEA': {'name': 'Seattle Mariners', 'city': 'Seattle', 'stadium': 'T-Mobile Park'},
|
||||
'STL': {'name': 'St. Louis Cardinals', 'city': 'St. Louis', 'stadium': 'Busch Stadium'},
|
||||
'TBR': {'name': 'Tampa Bay Rays', 'city': 'St. Petersburg', 'stadium': 'Tropicana Field'},
|
||||
'TEX': {'name': 'Texas Rangers', 'city': 'Arlington', 'stadium': 'Globe Life Field'},
|
||||
'TOR': {'name': 'Toronto Blue Jays', 'city': 'Toronto', 'stadium': 'Rogers Centre'},
|
||||
'WSN': {'name': 'Washington Nationals', 'city': 'Washington', 'stadium': 'Nationals Park'},
|
||||
}
|
||||
|
||||
|
||||
def get_mlb_team_abbrev(team_name: str) -> str:
|
||||
"""Get MLB team abbreviation from full name."""
|
||||
for abbrev, info in MLB_TEAMS.items():
|
||||
if info['name'].lower() == team_name.lower():
|
||||
return abbrev
|
||||
if team_name.lower() in info['name'].lower():
|
||||
return abbrev
|
||||
|
||||
# Return first 3 letters as fallback
|
||||
return team_name[:3].upper()
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# GAME SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_mlb_baseball_reference(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape MLB schedule from Baseball-Reference.
|
||||
URL: https://www.baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml
|
||||
"""
|
||||
games = []
|
||||
url = f"https://www.baseball-reference.com/leagues/majors/{season}-schedule.shtml"
|
||||
|
||||
print(f"Scraping MLB {season} from Baseball-Reference...")
|
||||
soup = fetch_page(url, 'baseball-reference.com')
|
||||
|
||||
if not soup:
|
||||
return games
|
||||
|
||||
# Baseball-Reference groups games by date in h3 headers
|
||||
current_date = None
|
||||
|
||||
# Find the schedule section
|
||||
schedule_div = soup.find('div', {'id': 'all_schedule'})
|
||||
if not schedule_div:
|
||||
schedule_div = soup
|
||||
|
||||
# Process all elements to track date context
|
||||
for element in schedule_div.find_all(['h3', 'p', 'div']):
|
||||
# Check for date header
|
||||
if element.name == 'h3':
|
||||
date_text = element.get_text(strip=True)
|
||||
# Parse date like "Thursday, March 27, 2025"
|
||||
try:
|
||||
for fmt in ['%A, %B %d, %Y', '%B %d, %Y', '%a, %b %d, %Y']:
|
||||
try:
|
||||
parsed = datetime.strptime(date_text, fmt)
|
||||
current_date = parsed.strftime('%Y-%m-%d')
|
||||
break
|
||||
except:
|
||||
continue
|
||||
except:
|
||||
pass
|
||||
|
||||
# Check for game entries
|
||||
elif element.name == 'p' and 'game' in element.get('class', []):
|
||||
if not current_date:
|
||||
continue
|
||||
|
||||
try:
|
||||
links = element.find_all('a')
|
||||
if len(links) >= 2:
|
||||
away_team = links[0].text.strip()
|
||||
home_team = links[1].text.strip()
|
||||
|
||||
# Generate unique game ID
|
||||
away_abbrev = get_mlb_team_abbrev(away_team)
|
||||
home_abbrev = get_mlb_team_abbrev(home_team)
|
||||
game_id = f"mlb_br_{current_date}_{away_abbrev}_{home_abbrev}".lower()
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='MLB',
|
||||
season=str(season),
|
||||
date=current_date,
|
||||
time=None,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev,
|
||||
away_team_abbrev=away_abbrev,
|
||||
venue='',
|
||||
source='baseball-reference.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from Baseball-Reference")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_mlb_statsapi(season: int) -> list[Game]:
|
||||
"""
|
||||
Fetch MLB schedule from official Stats API (JSON).
|
||||
URL: https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={YEAR}&gameType=R
|
||||
"""
|
||||
games = []
|
||||
url = f"https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={season}&gameType=R&hydrate=team,venue"
|
||||
|
||||
print(f"Fetching MLB {season} from Stats API...")
|
||||
|
||||
try:
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for date_entry in data.get('dates', []):
|
||||
game_date = date_entry.get('date', '')
|
||||
|
||||
for game_data in date_entry.get('games', []):
|
||||
try:
|
||||
teams = game_data.get('teams', {})
|
||||
away = teams.get('away', {}).get('team', {})
|
||||
home = teams.get('home', {}).get('team', {})
|
||||
venue = game_data.get('venue', {})
|
||||
|
||||
game_time = game_data.get('gameDate', '')
|
||||
if 'T' in game_time:
|
||||
time_str = game_time.split('T')[1][:5]
|
||||
else:
|
||||
time_str = None
|
||||
|
||||
game = Game(
|
||||
id='', # Will be assigned by assign_stable_ids
|
||||
sport='MLB',
|
||||
season=str(season),
|
||||
date=game_date,
|
||||
time=time_str,
|
||||
home_team=home.get('name', ''),
|
||||
away_team=away.get('name', ''),
|
||||
home_team_abbrev=home.get('abbreviation', ''),
|
||||
away_team_abbrev=away.get('abbreviation', ''),
|
||||
venue=venue.get('name', ''),
|
||||
source='statsapi.mlb.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
print(f" Error fetching MLB API: {e}")
|
||||
|
||||
print(f" Found {len(games)} games from MLB Stats API")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_mlb_espn(season: int) -> list[Game]:
|
||||
"""Fetch MLB schedule from ESPN API."""
|
||||
games = []
|
||||
print(f"Fetching MLB {season} from ESPN API...")
|
||||
|
||||
# MLB regular season: Late March - Early October
|
||||
start = f"{season}0320"
|
||||
end = f"{season}1010"
|
||||
|
||||
url = "https://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard"
|
||||
params = {
|
||||
'dates': f"{start}-{end}",
|
||||
'limit': 1000
|
||||
}
|
||||
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.get(url, params=params, headers=headers, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
events = data.get('events', [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
date_str = event.get('date', '')[:10]
|
||||
time_str = event.get('date', '')[11:16] if len(event.get('date', '')) > 11 else None
|
||||
|
||||
competitions = event.get('competitions', [{}])
|
||||
if not competitions:
|
||||
continue
|
||||
|
||||
comp = competitions[0]
|
||||
competitors = comp.get('competitors', [])
|
||||
|
||||
if len(competitors) < 2:
|
||||
continue
|
||||
|
||||
home_team = away_team = home_abbrev = away_abbrev = None
|
||||
|
||||
for team in competitors:
|
||||
team_data = team.get('team', {})
|
||||
team_name = team_data.get('displayName', team_data.get('name', ''))
|
||||
team_abbrev = team_data.get('abbreviation', '')
|
||||
|
||||
if team.get('homeAway') == 'home':
|
||||
home_team = team_name
|
||||
home_abbrev = team_abbrev
|
||||
else:
|
||||
away_team = team_name
|
||||
away_abbrev = team_abbrev
|
||||
|
||||
if not home_team or not away_team:
|
||||
continue
|
||||
|
||||
venue = comp.get('venue', {}).get('fullName', '')
|
||||
|
||||
game_id = f"mlb_{date_str}_{away_abbrev}_{home_abbrev}".lower()
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='MLB',
|
||||
season=str(season),
|
||||
date=date_str,
|
||||
time=time_str,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev or get_mlb_team_abbrev(home_team),
|
||||
away_team_abbrev=away_abbrev or get_mlb_team_abbrev(away_team),
|
||||
venue=venue,
|
||||
source='espn.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from ESPN")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching ESPN MLB: {e}")
|
||||
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# STADIUM SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_mlb_stadiums_scorebot() -> list[Stadium]:
|
||||
"""
|
||||
Source 1: MLBScoreBot/ballparks GitHub (public domain).
|
||||
"""
|
||||
stadiums = []
|
||||
url = "https://raw.githubusercontent.com/MLBScoreBot/ballparks/main/ballparks.json"
|
||||
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for name, info in data.items():
|
||||
stadium = Stadium(
|
||||
id=f"mlb_{name.lower().replace(' ', '_')[:30]}",
|
||||
name=name,
|
||||
city=info.get('city', ''),
|
||||
state=info.get('state', ''),
|
||||
latitude=info.get('lat', 0) / 1000000 if info.get('lat') else 0,
|
||||
longitude=info.get('long', 0) / 1000000 if info.get('long') else 0,
|
||||
capacity=info.get('capacity', 0),
|
||||
sport='MLB',
|
||||
team_abbrevs=[info.get('team', '')],
|
||||
source='github.com/MLBScoreBot'
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_mlb_stadiums_geojson() -> list[Stadium]:
|
||||
"""
|
||||
Source 2: cageyjames/GeoJSON-Ballparks GitHub.
|
||||
"""
|
||||
stadiums = []
|
||||
url = "https://raw.githubusercontent.com/cageyjames/GeoJSON-Ballparks/master/ballparks.geojson"
|
||||
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for feature in data.get('features', []):
|
||||
props = feature.get('properties', {})
|
||||
coords = feature.get('geometry', {}).get('coordinates', [0, 0])
|
||||
|
||||
# Only include MLB stadiums (filter by League)
|
||||
if props.get('League', '').upper() != 'MLB':
|
||||
continue
|
||||
|
||||
stadium = Stadium(
|
||||
id=f"mlb_{props.get('Ballpark', '').lower().replace(' ', '_')[:30]}",
|
||||
name=props.get('Ballpark', ''),
|
||||
city=props.get('City', ''),
|
||||
state=props.get('State', ''),
|
||||
latitude=coords[1] if len(coords) > 1 else 0,
|
||||
longitude=coords[0] if len(coords) > 0 else 0,
|
||||
capacity=0, # Not in this dataset
|
||||
sport='MLB',
|
||||
team_abbrevs=[props.get('Team', '')],
|
||||
source='github.com/cageyjames'
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_mlb_stadiums_hardcoded() -> list[Stadium]:
|
||||
"""
|
||||
Source 3: Hardcoded MLB ballparks (fallback).
|
||||
"""
|
||||
mlb_ballparks = {
|
||||
'Chase Field': {'city': 'Phoenix', 'state': 'AZ', 'lat': 33.4453, 'lng': -112.0667, 'capacity': 48519, 'teams': ['ARI'], 'year_opened': 1998},
|
||||
'Truist Park': {'city': 'Atlanta', 'state': 'GA', 'lat': 33.8907, 'lng': -84.4677, 'capacity': 41084, 'teams': ['ATL'], 'year_opened': 2017},
|
||||
'Oriole Park at Camden Yards': {'city': 'Baltimore', 'state': 'MD', 'lat': 39.2839, 'lng': -76.6216, 'capacity': 44970, 'teams': ['BAL'], 'year_opened': 1992},
|
||||
'Fenway Park': {'city': 'Boston', 'state': 'MA', 'lat': 42.3467, 'lng': -71.0972, 'capacity': 37755, 'teams': ['BOS'], 'year_opened': 1912},
|
||||
'Wrigley Field': {'city': 'Chicago', 'state': 'IL', 'lat': 41.9484, 'lng': -87.6553, 'capacity': 41649, 'teams': ['CHC'], 'year_opened': 1914},
|
||||
'Guaranteed Rate Field': {'city': 'Chicago', 'state': 'IL', 'lat': 41.8299, 'lng': -87.6338, 'capacity': 40615, 'teams': ['CHW'], 'year_opened': 1991},
|
||||
'Great American Ball Park': {'city': 'Cincinnati', 'state': 'OH', 'lat': 39.0979, 'lng': -84.5082, 'capacity': 42319, 'teams': ['CIN'], 'year_opened': 2003},
|
||||
'Progressive Field': {'city': 'Cleveland', 'state': 'OH', 'lat': 41.4958, 'lng': -81.6853, 'capacity': 34830, 'teams': ['CLE'], 'year_opened': 1994},
|
||||
'Coors Field': {'city': 'Denver', 'state': 'CO', 'lat': 39.7559, 'lng': -104.9942, 'capacity': 50144, 'teams': ['COL'], 'year_opened': 1995},
|
||||
'Comerica Park': {'city': 'Detroit', 'state': 'MI', 'lat': 42.3390, 'lng': -83.0485, 'capacity': 41083, 'teams': ['DET'], 'year_opened': 2000},
|
||||
'Minute Maid Park': {'city': 'Houston', 'state': 'TX', 'lat': 29.7573, 'lng': -95.3555, 'capacity': 41168, 'teams': ['HOU'], 'year_opened': 2000},
|
||||
'Kauffman Stadium': {'city': 'Kansas City', 'state': 'MO', 'lat': 39.0517, 'lng': -94.4803, 'capacity': 37903, 'teams': ['KCR'], 'year_opened': 1973},
|
||||
'Angel Stadium': {'city': 'Anaheim', 'state': 'CA', 'lat': 33.8003, 'lng': -117.8827, 'capacity': 45517, 'teams': ['LAA'], 'year_opened': 1966},
|
||||
'Dodger Stadium': {'city': 'Los Angeles', 'state': 'CA', 'lat': 34.0739, 'lng': -118.2400, 'capacity': 56000, 'teams': ['LAD'], 'year_opened': 1962},
|
||||
'LoanDepot Park': {'city': 'Miami', 'state': 'FL', 'lat': 25.7781, 'lng': -80.2196, 'capacity': 36742, 'teams': ['MIA'], 'year_opened': 2012},
|
||||
'American Family Field': {'city': 'Milwaukee', 'state': 'WI', 'lat': 43.0280, 'lng': -87.9712, 'capacity': 41900, 'teams': ['MIL'], 'year_opened': 2001},
|
||||
'Target Field': {'city': 'Minneapolis', 'state': 'MN', 'lat': 44.9818, 'lng': -93.2775, 'capacity': 38544, 'teams': ['MIN'], 'year_opened': 2010},
|
||||
'Citi Field': {'city': 'Queens', 'state': 'NY', 'lat': 40.7571, 'lng': -73.8458, 'capacity': 41922, 'teams': ['NYM'], 'year_opened': 2009},
|
||||
'Yankee Stadium': {'city': 'Bronx', 'state': 'NY', 'lat': 40.8296, 'lng': -73.9262, 'capacity': 46537, 'teams': ['NYY'], 'year_opened': 2009},
|
||||
'Sutter Health Park': {'city': 'Sacramento', 'state': 'CA', 'lat': 38.5803, 'lng': -121.5108, 'capacity': 14014, 'teams': ['OAK'], 'year_opened': 2000},
|
||||
'Citizens Bank Park': {'city': 'Philadelphia', 'state': 'PA', 'lat': 39.9061, 'lng': -75.1665, 'capacity': 42901, 'teams': ['PHI'], 'year_opened': 2004},
|
||||
'PNC Park': {'city': 'Pittsburgh', 'state': 'PA', 'lat': 40.4469, 'lng': -80.0057, 'capacity': 38362, 'teams': ['PIT'], 'year_opened': 2001},
|
||||
'Petco Park': {'city': 'San Diego', 'state': 'CA', 'lat': 32.7073, 'lng': -117.1566, 'capacity': 40209, 'teams': ['SDP'], 'year_opened': 2004},
|
||||
'Oracle Park': {'city': 'San Francisco', 'state': 'CA', 'lat': 37.7786, 'lng': -122.3893, 'capacity': 41915, 'teams': ['SFG'], 'year_opened': 2000},
|
||||
'T-Mobile Park': {'city': 'Seattle', 'state': 'WA', 'lat': 47.5914, 'lng': -122.3325, 'capacity': 47929, 'teams': ['SEA'], 'year_opened': 1999},
|
||||
'Busch Stadium': {'city': 'St. Louis', 'state': 'MO', 'lat': 38.6226, 'lng': -90.1928, 'capacity': 45538, 'teams': ['STL'], 'year_opened': 2006},
|
||||
'Tropicana Field': {'city': 'St. Petersburg', 'state': 'FL', 'lat': 27.7682, 'lng': -82.6534, 'capacity': 25000, 'teams': ['TBR'], 'year_opened': 1990},
|
||||
'Globe Life Field': {'city': 'Arlington', 'state': 'TX', 'lat': 32.7473, 'lng': -97.0844, 'capacity': 40300, 'teams': ['TEX'], 'year_opened': 2020},
|
||||
'Rogers Centre': {'city': 'Toronto', 'state': 'ON', 'lat': 43.6414, 'lng': -79.3894, 'capacity': 49282, 'teams': ['TOR'], 'year_opened': 1989},
|
||||
'Nationals Park': {'city': 'Washington', 'state': 'DC', 'lat': 38.8729, 'lng': -77.0074, 'capacity': 41339, 'teams': ['WSN'], 'year_opened': 2008},
|
||||
}
|
||||
|
||||
stadiums = []
|
||||
for name, info in mlb_ballparks.items():
|
||||
stadium = Stadium(
|
||||
id=f"mlb_{name.lower().replace(' ', '_')[:30]}",
|
||||
name=name,
|
||||
city=info['city'],
|
||||
state=info['state'],
|
||||
latitude=info['lat'],
|
||||
longitude=info['lng'],
|
||||
capacity=info['capacity'],
|
||||
sport='MLB',
|
||||
team_abbrevs=info['teams'],
|
||||
source='mlb_hardcoded',
|
||||
year_opened=info.get('year_opened')
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_mlb_stadiums() -> list[Stadium]:
|
||||
"""
|
||||
Fetch MLB stadium data with multi-source fallback.
|
||||
"""
|
||||
print("\nMLB STADIUMS")
|
||||
print("-" * 40)
|
||||
|
||||
sources = [
|
||||
StadiumScraperSource('MLBScoreBot', scrape_mlb_stadiums_scorebot, priority=1, min_venues=25),
|
||||
StadiumScraperSource('GeoJSON-Ballparks', scrape_mlb_stadiums_geojson, priority=2, min_venues=25),
|
||||
StadiumScraperSource('Hardcoded', scrape_mlb_stadiums_hardcoded, priority=3, min_venues=25),
|
||||
]
|
||||
|
||||
return scrape_stadiums_with_fallback('MLB', sources)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SOURCE CONFIGURATIONS
|
||||
# =============================================================================
|
||||
|
||||
MLB_GAME_SOURCES = [
|
||||
ScraperSource('MLB Stats API', scrape_mlb_statsapi, priority=1, min_games=100),
|
||||
ScraperSource('Baseball-Reference', scrape_mlb_baseball_reference, priority=2, min_games=100),
|
||||
ScraperSource('ESPN', scrape_mlb_espn, priority=3, min_games=100),
|
||||
]
|
||||
|
||||
MLB_STADIUM_SOURCES = [
|
||||
StadiumScraperSource('MLBScoreBot', scrape_mlb_stadiums_scorebot, priority=1, min_venues=25),
|
||||
StadiumScraperSource('GeoJSON-Ballparks', scrape_mlb_stadiums_geojson, priority=2, min_venues=25),
|
||||
StadiumScraperSource('Hardcoded', scrape_mlb_stadiums_hardcoded, priority=3, min_venues=25),
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CONVENIENCE FUNCTIONS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_mlb_games(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape MLB games for a season using multi-source fallback.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026)
|
||||
|
||||
Returns:
|
||||
List of Game objects from the first successful source
|
||||
"""
|
||||
print(f"\nMLB {season} SCHEDULE")
|
||||
print("-" * 40)
|
||||
|
||||
return scrape_with_fallback('MLB', season, MLB_GAME_SOURCES)
|
||||
-343
@@ -1,343 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
MLS schedule and stadium scrapers for SportsTime.
|
||||
|
||||
This module provides:
|
||||
- MLS game scrapers (ESPN, FBref, MLSSoccer.com)
|
||||
- MLS stadium scrapers (gavinr GeoJSON, hardcoded)
|
||||
- Multi-source fallback configurations
|
||||
"""
|
||||
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
|
||||
# Support both direct execution and import from parent directory
|
||||
try:
|
||||
from core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
except ImportError:
|
||||
from Scripts.core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
|
||||
|
||||
__all__ = [
|
||||
# Team data
|
||||
'MLS_TEAMS',
|
||||
# Stadium scrapers
|
||||
'scrape_mls_stadiums_hardcoded',
|
||||
'scrape_mls_stadiums_gavinr',
|
||||
'scrape_mls_stadiums',
|
||||
# Source configurations
|
||||
'MLS_STADIUM_SOURCES',
|
||||
# Convenience functions
|
||||
'get_mls_team_abbrev',
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM MAPPINGS
|
||||
# =============================================================================
|
||||
|
||||
MLS_TEAMS = {
|
||||
'ATL': {'name': 'Atlanta United FC', 'city': 'Atlanta', 'stadium': 'Mercedes-Benz Stadium'},
|
||||
'AUS': {'name': 'Austin FC', 'city': 'Austin', 'stadium': 'Q2 Stadium'},
|
||||
'CLT': {'name': 'Charlotte FC', 'city': 'Charlotte', 'stadium': 'Bank of America Stadium'},
|
||||
'CHI': {'name': 'Chicago Fire FC', 'city': 'Chicago', 'stadium': 'Soldier Field'},
|
||||
'CIN': {'name': 'FC Cincinnati', 'city': 'Cincinnati', 'stadium': 'TQL Stadium'},
|
||||
'COL': {'name': 'Colorado Rapids', 'city': 'Commerce City', 'stadium': "Dick's Sporting Goods Park"},
|
||||
'CLB': {'name': 'Columbus Crew', 'city': 'Columbus', 'stadium': 'Lower.com Field'},
|
||||
'DAL': {'name': 'FC Dallas', 'city': 'Frisco', 'stadium': 'Toyota Stadium'},
|
||||
'DC': {'name': 'D.C. United', 'city': 'Washington', 'stadium': 'Audi Field'},
|
||||
'HOU': {'name': 'Houston Dynamo FC', 'city': 'Houston', 'stadium': 'Shell Energy Stadium'},
|
||||
'LAG': {'name': 'LA Galaxy', 'city': 'Carson', 'stadium': 'Dignity Health Sports Park'},
|
||||
'LAFC': {'name': 'Los Angeles FC', 'city': 'Los Angeles', 'stadium': 'BMO Stadium'},
|
||||
'MIA': {'name': 'Inter Miami CF', 'city': 'Fort Lauderdale', 'stadium': 'Chase Stadium'},
|
||||
'MIN': {'name': 'Minnesota United FC', 'city': 'Saint Paul', 'stadium': 'Allianz Field'},
|
||||
'MTL': {'name': 'CF Montreal', 'city': 'Montreal', 'stadium': 'Stade Saputo'},
|
||||
'NSH': {'name': 'Nashville SC', 'city': 'Nashville', 'stadium': 'Geodis Park'},
|
||||
'NE': {'name': 'New England Revolution', 'city': 'Foxborough', 'stadium': 'Gillette Stadium'},
|
||||
'NYCFC': {'name': 'New York City FC', 'city': 'New York', 'stadium': 'Yankee Stadium'},
|
||||
'NYRB': {'name': 'New York Red Bulls', 'city': 'Harrison', 'stadium': 'Red Bull Arena'},
|
||||
'ORL': {'name': 'Orlando City SC', 'city': 'Orlando', 'stadium': 'Inter&Co Stadium'},
|
||||
'PHI': {'name': 'Philadelphia Union', 'city': 'Chester', 'stadium': 'Subaru Park'},
|
||||
'POR': {'name': 'Portland Timbers', 'city': 'Portland', 'stadium': 'Providence Park'},
|
||||
'RSL': {'name': 'Real Salt Lake', 'city': 'Sandy', 'stadium': 'America First Field'},
|
||||
'SJ': {'name': 'San Jose Earthquakes', 'city': 'San Jose', 'stadium': 'PayPal Park'},
|
||||
'SEA': {'name': 'Seattle Sounders FC', 'city': 'Seattle', 'stadium': 'Lumen Field'},
|
||||
'SKC': {'name': 'Sporting Kansas City', 'city': 'Kansas City', 'stadium': "Children's Mercy Park"},
|
||||
'STL': {'name': 'St. Louis City SC', 'city': 'St. Louis', 'stadium': 'CityPark'},
|
||||
'TOR': {'name': 'Toronto FC', 'city': 'Toronto', 'stadium': 'BMO Field'},
|
||||
'VAN': {'name': 'Vancouver Whitecaps FC', 'city': 'Vancouver', 'stadium': 'BC Place'},
|
||||
'SD': {'name': 'San Diego FC', 'city': 'San Diego', 'stadium': 'Snapdragon Stadium'},
|
||||
}
|
||||
|
||||
|
||||
def get_mls_team_abbrev(team_name: str) -> str:
|
||||
"""Get MLS team abbreviation from full name."""
|
||||
for abbrev, info in MLS_TEAMS.items():
|
||||
if info['name'].lower() == team_name.lower():
|
||||
return abbrev
|
||||
if team_name.lower() in info['name'].lower():
|
||||
return abbrev
|
||||
|
||||
# Return first 3 letters as fallback
|
||||
return team_name[:3].upper()
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# STADIUM SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_mls_stadiums_hardcoded() -> list[Stadium]:
|
||||
"""
|
||||
Source 1: Hardcoded MLS stadiums with complete data.
|
||||
All 30 MLS stadiums with capacity (soccer configuration) and year_opened.
|
||||
"""
|
||||
mls_stadiums = {
|
||||
'Mercedes-Benz Stadium': {
|
||||
'city': 'Atlanta', 'state': 'GA',
|
||||
'lat': 33.7555, 'lng': -84.4000,
|
||||
'capacity': 42500, 'teams': ['ATL'], 'year_opened': 2017
|
||||
},
|
||||
'Q2 Stadium': {
|
||||
'city': 'Austin', 'state': 'TX',
|
||||
'lat': 30.3877, 'lng': -97.7195,
|
||||
'capacity': 20738, 'teams': ['AUS'], 'year_opened': 2021
|
||||
},
|
||||
'Bank of America Stadium': {
|
||||
'city': 'Charlotte', 'state': 'NC',
|
||||
'lat': 35.2258, 'lng': -80.8528,
|
||||
'capacity': 38000, 'teams': ['CLT'], 'year_opened': 1996
|
||||
},
|
||||
'Soldier Field': {
|
||||
'city': 'Chicago', 'state': 'IL',
|
||||
'lat': 41.8623, 'lng': -87.6167,
|
||||
'capacity': 24995, 'teams': ['CHI'], 'year_opened': 1924
|
||||
},
|
||||
'TQL Stadium': {
|
||||
'city': 'Cincinnati', 'state': 'OH',
|
||||
'lat': 39.1114, 'lng': -84.5222,
|
||||
'capacity': 26000, 'teams': ['CIN'], 'year_opened': 2021
|
||||
},
|
||||
"Dick's Sporting Goods Park": {
|
||||
'city': 'Commerce City', 'state': 'CO',
|
||||
'lat': 39.8056, 'lng': -104.8919,
|
||||
'capacity': 18061, 'teams': ['COL'], 'year_opened': 2007
|
||||
},
|
||||
'Lower.com Field': {
|
||||
'city': 'Columbus', 'state': 'OH',
|
||||
'lat': 39.9685, 'lng': -83.0171,
|
||||
'capacity': 20371, 'teams': ['CLB'], 'year_opened': 2021
|
||||
},
|
||||
'Toyota Stadium': {
|
||||
'city': 'Frisco', 'state': 'TX',
|
||||
'lat': 33.1544, 'lng': -96.8353,
|
||||
'capacity': 20500, 'teams': ['DAL'], 'year_opened': 2005
|
||||
},
|
||||
'Audi Field': {
|
||||
'city': 'Washington', 'state': 'DC',
|
||||
'lat': 38.8684, 'lng': -77.0129,
|
||||
'capacity': 20000, 'teams': ['DC'], 'year_opened': 2018
|
||||
},
|
||||
'Shell Energy Stadium': {
|
||||
'city': 'Houston', 'state': 'TX',
|
||||
'lat': 29.7522, 'lng': -95.3524,
|
||||
'capacity': 22039, 'teams': ['HOU'], 'year_opened': 2012
|
||||
},
|
||||
'Dignity Health Sports Park': {
|
||||
'city': 'Carson', 'state': 'CA',
|
||||
'lat': 33.8640, 'lng': -118.2610,
|
||||
'capacity': 27000, 'teams': ['LAG'], 'year_opened': 2003
|
||||
},
|
||||
'BMO Stadium': {
|
||||
'city': 'Los Angeles', 'state': 'CA',
|
||||
'lat': 34.0128, 'lng': -118.2841,
|
||||
'capacity': 22000, 'teams': ['LAFC'], 'year_opened': 2018
|
||||
},
|
||||
'Chase Stadium': {
|
||||
'city': 'Fort Lauderdale', 'state': 'FL',
|
||||
'lat': 26.1933, 'lng': -80.1607,
|
||||
'capacity': 21550, 'teams': ['MIA'], 'year_opened': 2020
|
||||
},
|
||||
'Allianz Field': {
|
||||
'city': 'Saint Paul', 'state': 'MN',
|
||||
'lat': 44.9531, 'lng': -93.1647,
|
||||
'capacity': 19400, 'teams': ['MIN'], 'year_opened': 2019
|
||||
},
|
||||
'Stade Saputo': {
|
||||
'city': 'Montreal', 'state': 'QC',
|
||||
'lat': 45.5631, 'lng': -73.5525,
|
||||
'capacity': 19619, 'teams': ['MTL'], 'year_opened': 2008
|
||||
},
|
||||
'Geodis Park': {
|
||||
'city': 'Nashville', 'state': 'TN',
|
||||
'lat': 36.1301, 'lng': -86.7660,
|
||||
'capacity': 30000, 'teams': ['NSH'], 'year_opened': 2022
|
||||
},
|
||||
'Gillette Stadium': {
|
||||
'city': 'Foxborough', 'state': 'MA',
|
||||
'lat': 42.0909, 'lng': -71.2643,
|
||||
'capacity': 22385, 'teams': ['NE'], 'year_opened': 2002
|
||||
},
|
||||
'Yankee Stadium': {
|
||||
'city': 'Bronx', 'state': 'NY',
|
||||
'lat': 40.8292, 'lng': -73.9264,
|
||||
'capacity': 28000, 'teams': ['NYCFC'], 'year_opened': 2009
|
||||
},
|
||||
'Red Bull Arena': {
|
||||
'city': 'Harrison', 'state': 'NJ',
|
||||
'lat': 40.7367, 'lng': -74.1503,
|
||||
'capacity': 25000, 'teams': ['NYRB'], 'year_opened': 2010
|
||||
},
|
||||
'Inter&Co Stadium': {
|
||||
'city': 'Orlando', 'state': 'FL',
|
||||
'lat': 28.5411, 'lng': -81.3893,
|
||||
'capacity': 25500, 'teams': ['ORL'], 'year_opened': 2017
|
||||
},
|
||||
'Subaru Park': {
|
||||
'city': 'Chester', 'state': 'PA',
|
||||
'lat': 39.8322, 'lng': -75.3789,
|
||||
'capacity': 18500, 'teams': ['PHI'], 'year_opened': 2010
|
||||
},
|
||||
'Providence Park': {
|
||||
'city': 'Portland', 'state': 'OR',
|
||||
'lat': 45.5214, 'lng': -122.6917,
|
||||
'capacity': 25218, 'teams': ['POR'], 'year_opened': 1926
|
||||
},
|
||||
'America First Field': {
|
||||
'city': 'Sandy', 'state': 'UT',
|
||||
'lat': 40.5829, 'lng': -111.8934,
|
||||
'capacity': 20213, 'teams': ['RSL'], 'year_opened': 2008
|
||||
},
|
||||
'PayPal Park': {
|
||||
'city': 'San Jose', 'state': 'CA',
|
||||
'lat': 37.3514, 'lng': -121.9250,
|
||||
'capacity': 18000, 'teams': ['SJ'], 'year_opened': 2015
|
||||
},
|
||||
'Lumen Field': {
|
||||
'city': 'Seattle', 'state': 'WA',
|
||||
'lat': 47.5952, 'lng': -122.3316,
|
||||
'capacity': 37722, 'teams': ['SEA'], 'year_opened': 2002
|
||||
},
|
||||
"Children's Mercy Park": {
|
||||
'city': 'Kansas City', 'state': 'KS',
|
||||
'lat': 39.1217, 'lng': -94.8232,
|
||||
'capacity': 18467, 'teams': ['SKC'], 'year_opened': 2011
|
||||
},
|
||||
'CityPark': {
|
||||
'city': 'St. Louis', 'state': 'MO',
|
||||
'lat': 38.6314, 'lng': -90.2103,
|
||||
'capacity': 22500, 'teams': ['STL'], 'year_opened': 2023
|
||||
},
|
||||
'BMO Field': {
|
||||
'city': 'Toronto', 'state': 'ON',
|
||||
'lat': 43.6332, 'lng': -79.4186,
|
||||
'capacity': 30000, 'teams': ['TOR'], 'year_opened': 2007
|
||||
},
|
||||
'BC Place': {
|
||||
'city': 'Vancouver', 'state': 'BC',
|
||||
'lat': 49.2767, 'lng': -123.1119,
|
||||
'capacity': 22120, 'teams': ['VAN'], 'year_opened': 1983
|
||||
},
|
||||
'Snapdragon Stadium': {
|
||||
'city': 'San Diego', 'state': 'CA',
|
||||
'lat': 32.7844, 'lng': -117.1228,
|
||||
'capacity': 35000, 'teams': ['SD'], 'year_opened': 2022
|
||||
},
|
||||
}
|
||||
|
||||
stadiums = []
|
||||
for name, info in mls_stadiums.items():
|
||||
# Create normalized ID (f-strings can't have backslashes)
|
||||
normalized_name = name.lower().replace(' ', '_').replace('&', 'and').replace('.', '').replace("'", '')
|
||||
stadium_id = f"mls_{normalized_name[:30]}"
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
name=name,
|
||||
city=info['city'],
|
||||
state=info['state'],
|
||||
latitude=info['lat'],
|
||||
longitude=info['lng'],
|
||||
capacity=info['capacity'],
|
||||
sport='MLS',
|
||||
team_abbrevs=info['teams'],
|
||||
source='mls_hardcoded',
|
||||
year_opened=info.get('year_opened')
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_mls_stadiums_gavinr() -> list[Stadium]:
|
||||
"""
|
||||
Source 2: gavinr/usa-soccer GeoJSON (fallback for coordinates).
|
||||
Note: This source lacks capacity and year_opened data.
|
||||
"""
|
||||
stadiums = []
|
||||
url = "https://raw.githubusercontent.com/gavinr/usa-soccer/master/mls.geojson"
|
||||
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for feature in data.get('features', []):
|
||||
props = feature.get('properties', {})
|
||||
coords = feature.get('geometry', {}).get('coordinates', [0, 0])
|
||||
|
||||
stadium = Stadium(
|
||||
id=f"mls_{props.get('stadium', '').lower().replace(' ', '_')[:30]}",
|
||||
name=props.get('stadium', ''),
|
||||
city=props.get('city', ''),
|
||||
state=props.get('state', ''),
|
||||
latitude=coords[1] if len(coords) > 1 else 0,
|
||||
longitude=coords[0] if len(coords) > 0 else 0,
|
||||
capacity=props.get('capacity', 0),
|
||||
sport='MLS',
|
||||
team_abbrevs=[get_mls_team_abbrev(props.get('team', ''))],
|
||||
source='github.com/gavinr'
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_mls_stadiums() -> list[Stadium]:
|
||||
"""
|
||||
Fetch MLS stadium data with multi-source fallback.
|
||||
Hardcoded source is primary (has complete data).
|
||||
"""
|
||||
print("\nMLS STADIUMS")
|
||||
print("-" * 40)
|
||||
|
||||
sources = [
|
||||
StadiumScraperSource('Hardcoded', scrape_mls_stadiums_hardcoded, priority=1, min_venues=25),
|
||||
StadiumScraperSource('gavinr GeoJSON', scrape_mls_stadiums_gavinr, priority=2, min_venues=20),
|
||||
]
|
||||
|
||||
return scrape_stadiums_with_fallback('MLS', sources)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SOURCE CONFIGURATIONS
|
||||
# =============================================================================
|
||||
|
||||
MLS_STADIUM_SOURCES = [
|
||||
StadiumScraperSource('Hardcoded', scrape_mls_stadiums_hardcoded, priority=1, min_venues=25),
|
||||
StadiumScraperSource('gavinr GeoJSON', scrape_mls_stadiums_gavinr, priority=2, min_venues=20),
|
||||
]
|
||||
-412
@@ -1,412 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
NBA schedule and stadium scrapers for SportsTime.
|
||||
|
||||
This module provides:
|
||||
- NBA game scrapers (Basketball-Reference, ESPN, CBS Sports)
|
||||
- NBA stadium scrapers (hardcoded with coordinates)
|
||||
- Multi-source fallback configurations
|
||||
"""
|
||||
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
|
||||
# Support both direct execution and import from parent directory
|
||||
try:
|
||||
from core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
except ImportError:
|
||||
from Scripts.core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
|
||||
|
||||
__all__ = [
|
||||
# Team data
|
||||
'NBA_TEAMS',
|
||||
# Game scrapers
|
||||
'scrape_nba_basketball_reference',
|
||||
'scrape_nba_espn',
|
||||
'scrape_nba_cbssports',
|
||||
# Stadium scrapers
|
||||
'scrape_nba_stadiums',
|
||||
# Source configurations
|
||||
'NBA_GAME_SOURCES',
|
||||
'NBA_STADIUM_SOURCES',
|
||||
# Convenience functions
|
||||
'scrape_nba_games',
|
||||
'get_nba_season_string',
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM MAPPINGS
|
||||
# =============================================================================
|
||||
|
||||
NBA_TEAMS = {
|
||||
'ATL': {'name': 'Atlanta Hawks', 'city': 'Atlanta', 'arena': 'State Farm Arena'},
|
||||
'BOS': {'name': 'Boston Celtics', 'city': 'Boston', 'arena': 'TD Garden'},
|
||||
'BRK': {'name': 'Brooklyn Nets', 'city': 'Brooklyn', 'arena': 'Barclays Center'},
|
||||
'CHO': {'name': 'Charlotte Hornets', 'city': 'Charlotte', 'arena': 'Spectrum Center'},
|
||||
'CHI': {'name': 'Chicago Bulls', 'city': 'Chicago', 'arena': 'United Center'},
|
||||
'CLE': {'name': 'Cleveland Cavaliers', 'city': 'Cleveland', 'arena': 'Rocket Mortgage FieldHouse'},
|
||||
'DAL': {'name': 'Dallas Mavericks', 'city': 'Dallas', 'arena': 'American Airlines Center'},
|
||||
'DEN': {'name': 'Denver Nuggets', 'city': 'Denver', 'arena': 'Ball Arena'},
|
||||
'DET': {'name': 'Detroit Pistons', 'city': 'Detroit', 'arena': 'Little Caesars Arena'},
|
||||
'GSW': {'name': 'Golden State Warriors', 'city': 'San Francisco', 'arena': 'Chase Center'},
|
||||
'HOU': {'name': 'Houston Rockets', 'city': 'Houston', 'arena': 'Toyota Center'},
|
||||
'IND': {'name': 'Indiana Pacers', 'city': 'Indianapolis', 'arena': 'Gainbridge Fieldhouse'},
|
||||
'LAC': {'name': 'Los Angeles Clippers', 'city': 'Inglewood', 'arena': 'Intuit Dome'},
|
||||
'LAL': {'name': 'Los Angeles Lakers', 'city': 'Los Angeles', 'arena': 'Crypto.com Arena'},
|
||||
'MEM': {'name': 'Memphis Grizzlies', 'city': 'Memphis', 'arena': 'FedExForum'},
|
||||
'MIA': {'name': 'Miami Heat', 'city': 'Miami', 'arena': 'Kaseya Center'},
|
||||
'MIL': {'name': 'Milwaukee Bucks', 'city': 'Milwaukee', 'arena': 'Fiserv Forum'},
|
||||
'MIN': {'name': 'Minnesota Timberwolves', 'city': 'Minneapolis', 'arena': 'Target Center'},
|
||||
'NOP': {'name': 'New Orleans Pelicans', 'city': 'New Orleans', 'arena': 'Smoothie King Center'},
|
||||
'NYK': {'name': 'New York Knicks', 'city': 'New York', 'arena': 'Madison Square Garden'},
|
||||
'OKC': {'name': 'Oklahoma City Thunder', 'city': 'Oklahoma City', 'arena': 'Paycom Center'},
|
||||
'ORL': {'name': 'Orlando Magic', 'city': 'Orlando', 'arena': 'Kia Center'},
|
||||
'PHI': {'name': 'Philadelphia 76ers', 'city': 'Philadelphia', 'arena': 'Wells Fargo Center'},
|
||||
'PHO': {'name': 'Phoenix Suns', 'city': 'Phoenix', 'arena': 'Footprint Center'},
|
||||
'POR': {'name': 'Portland Trail Blazers', 'city': 'Portland', 'arena': 'Moda Center'},
|
||||
'SAC': {'name': 'Sacramento Kings', 'city': 'Sacramento', 'arena': 'Golden 1 Center'},
|
||||
'SAS': {'name': 'San Antonio Spurs', 'city': 'San Antonio', 'arena': 'Frost Bank Center'},
|
||||
'TOR': {'name': 'Toronto Raptors', 'city': 'Toronto', 'arena': 'Scotiabank Arena'},
|
||||
'UTA': {'name': 'Utah Jazz', 'city': 'Salt Lake City', 'arena': 'Delta Center'},
|
||||
'WAS': {'name': 'Washington Wizards', 'city': 'Washington', 'arena': 'Capital One Arena'},
|
||||
}
|
||||
|
||||
|
||||
def get_nba_team_abbrev(team_name: str) -> str:
|
||||
"""Get NBA team abbreviation from full name."""
|
||||
for abbrev, info in NBA_TEAMS.items():
|
||||
if info['name'].lower() == team_name.lower():
|
||||
return abbrev
|
||||
if team_name.lower() in info['name'].lower():
|
||||
return abbrev
|
||||
|
||||
# Return first 3 letters as fallback
|
||||
return team_name[:3].upper()
|
||||
|
||||
|
||||
def get_nba_season_string(season: int) -> str:
|
||||
"""
|
||||
Get NBA season string in "2024-25" format.
|
||||
|
||||
Args:
|
||||
season: The ending year of the season (e.g., 2025 for 2024-25 season)
|
||||
|
||||
Returns:
|
||||
Season string like "2024-25"
|
||||
"""
|
||||
return f"{season-1}-{str(season)[2:]}"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# GAME SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nba_basketball_reference(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NBA schedule from Basketball-Reference.
|
||||
URL: https://www.basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html
|
||||
Season year is the ending year (e.g., 2025 for 2024-25 season)
|
||||
"""
|
||||
games = []
|
||||
months = ['october', 'november', 'december', 'january', 'february', 'march', 'april', 'may', 'june']
|
||||
|
||||
print(f"Scraping NBA {season} from Basketball-Reference...")
|
||||
|
||||
for month in months:
|
||||
url = f"https://www.basketball-reference.com/leagues/NBA_{season}_games-{month}.html"
|
||||
soup = fetch_page(url, 'basketball-reference.com')
|
||||
|
||||
if not soup:
|
||||
continue
|
||||
|
||||
table = soup.find('table', {'id': 'schedule'})
|
||||
if not table:
|
||||
continue
|
||||
|
||||
tbody = table.find('tbody')
|
||||
if not tbody:
|
||||
continue
|
||||
|
||||
for row in tbody.find_all('tr'):
|
||||
if row.get('class') and 'thead' in row.get('class'):
|
||||
continue
|
||||
|
||||
cells = row.find_all(['td', 'th'])
|
||||
if len(cells) < 6:
|
||||
continue
|
||||
|
||||
try:
|
||||
# Parse date
|
||||
date_cell = row.find('th', {'data-stat': 'date_game'})
|
||||
if not date_cell:
|
||||
continue
|
||||
date_link = date_cell.find('a')
|
||||
date_str = date_link.text if date_link else date_cell.text
|
||||
|
||||
# Parse time
|
||||
time_cell = row.find('td', {'data-stat': 'game_start_time'})
|
||||
time_str = time_cell.text.strip() if time_cell else None
|
||||
|
||||
# Parse teams
|
||||
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
|
||||
home_cell = row.find('td', {'data-stat': 'home_team_name'})
|
||||
|
||||
if not visitor_cell or not home_cell:
|
||||
continue
|
||||
|
||||
visitor_link = visitor_cell.find('a')
|
||||
home_link = home_cell.find('a')
|
||||
|
||||
away_team = visitor_link.text if visitor_link else visitor_cell.text
|
||||
home_team = home_link.text if home_link else home_cell.text
|
||||
|
||||
# Parse arena
|
||||
arena_cell = row.find('td', {'data-stat': 'arena_name'})
|
||||
arena = arena_cell.text.strip() if arena_cell else ''
|
||||
|
||||
# Convert date
|
||||
try:
|
||||
parsed_date = datetime.strptime(date_str.strip(), '%a, %b %d, %Y')
|
||||
date_formatted = parsed_date.strftime('%Y-%m-%d')
|
||||
except:
|
||||
continue
|
||||
|
||||
# Generate game ID
|
||||
away_abbrev = get_nba_team_abbrev(away_team)
|
||||
home_abbrev = get_nba_team_abbrev(home_team)
|
||||
game_id = f"nba_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NBA',
|
||||
season=get_nba_season_string(season),
|
||||
date=date_formatted,
|
||||
time=time_str,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev,
|
||||
away_team_abbrev=away_abbrev,
|
||||
venue=arena,
|
||||
source='basketball-reference.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception as e:
|
||||
print(f" Error parsing row: {e}")
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from Basketball-Reference")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nba_espn(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NBA schedule from ESPN.
|
||||
URL: https://www.espn.com/nba/schedule/_/date/{YYYYMMDD}
|
||||
"""
|
||||
games = []
|
||||
print(f"Scraping NBA {season} from ESPN...")
|
||||
|
||||
# Determine date range for season
|
||||
start_date = datetime(season - 1, 10, 1) # October of previous year
|
||||
end_date = datetime(season, 6, 30) # June of season year
|
||||
|
||||
current_date = start_date
|
||||
while current_date <= end_date:
|
||||
date_str = current_date.strftime('%Y%m%d')
|
||||
url = f"https://www.espn.com/nba/schedule/_/date/{date_str}"
|
||||
|
||||
soup = fetch_page(url, 'espn.com')
|
||||
if soup:
|
||||
# ESPN uses JavaScript rendering, so we need to parse what's available
|
||||
# This is a simplified version - full implementation would need Selenium
|
||||
pass
|
||||
|
||||
current_date += timedelta(days=7) # Sample weekly to respect rate limits
|
||||
|
||||
print(f" Found {len(games)} games from ESPN")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nba_cbssports(season: int) -> list[Game]:
|
||||
"""
|
||||
Fetch NBA schedule from CBS Sports.
|
||||
CBS Sports provides a JSON API for schedule data.
|
||||
"""
|
||||
games = []
|
||||
print(f"Fetching NBA {season} from CBS Sports...")
|
||||
|
||||
# CBS Sports has a schedule endpoint
|
||||
url = "https://www.cbssports.com/nba/schedule/"
|
||||
|
||||
soup = fetch_page(url, 'cbssports.com')
|
||||
if not soup:
|
||||
return games
|
||||
|
||||
# Find all game rows
|
||||
tables = soup.find_all('table', class_='TableBase-table')
|
||||
|
||||
for table in tables:
|
||||
rows = table.find_all('tr')
|
||||
for row in rows:
|
||||
try:
|
||||
cells = row.find_all('td')
|
||||
if len(cells) < 2:
|
||||
continue
|
||||
|
||||
# Parse teams from row
|
||||
team_cells = row.find_all('a', class_='TeamName')
|
||||
if len(team_cells) < 2:
|
||||
continue
|
||||
|
||||
away_team = team_cells[0].get_text(strip=True)
|
||||
home_team = team_cells[1].get_text(strip=True)
|
||||
|
||||
# Get date from table section
|
||||
date_formatted = datetime.now().strftime('%Y-%m-%d') # Placeholder
|
||||
|
||||
away_abbrev = get_nba_team_abbrev(away_team)
|
||||
home_abbrev = get_nba_team_abbrev(home_team)
|
||||
game_id = f"nba_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NBA',
|
||||
season=get_nba_season_string(season),
|
||||
date=date_formatted,
|
||||
time=None,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev,
|
||||
away_team_abbrev=away_abbrev,
|
||||
venue='',
|
||||
source='cbssports.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from CBS Sports")
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# STADIUM SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nba_stadiums() -> list[Stadium]:
|
||||
"""
|
||||
Fetch NBA arena data (hardcoded with accurate coordinates).
|
||||
"""
|
||||
print("\nNBA STADIUMS")
|
||||
print("-" * 40)
|
||||
print(" Loading NBA arenas...")
|
||||
|
||||
nba_arenas = {
|
||||
'State Farm Arena': {'city': 'Atlanta', 'state': 'GA', 'lat': 33.7573, 'lng': -84.3963, 'capacity': 18118, 'teams': ['ATL'], 'year_opened': 1999},
|
||||
'TD Garden': {'city': 'Boston', 'state': 'MA', 'lat': 42.3662, 'lng': -71.0621, 'capacity': 19156, 'teams': ['BOS'], 'year_opened': 1995},
|
||||
'Barclays Center': {'city': 'Brooklyn', 'state': 'NY', 'lat': 40.6826, 'lng': -73.9754, 'capacity': 17732, 'teams': ['BRK'], 'year_opened': 2012},
|
||||
'Spectrum Center': {'city': 'Charlotte', 'state': 'NC', 'lat': 35.2251, 'lng': -80.8392, 'capacity': 19077, 'teams': ['CHO'], 'year_opened': 2005},
|
||||
'United Center': {'city': 'Chicago', 'state': 'IL', 'lat': 41.8807, 'lng': -87.6742, 'capacity': 20917, 'teams': ['CHI'], 'year_opened': 1994},
|
||||
'Rocket Mortgage FieldHouse': {'city': 'Cleveland', 'state': 'OH', 'lat': 41.4965, 'lng': -81.6882, 'capacity': 19432, 'teams': ['CLE'], 'year_opened': 1994},
|
||||
'American Airlines Center': {'city': 'Dallas', 'state': 'TX', 'lat': 32.7905, 'lng': -96.8103, 'capacity': 19200, 'teams': ['DAL'], 'year_opened': 2001},
|
||||
'Ball Arena': {'city': 'Denver', 'state': 'CO', 'lat': 39.7487, 'lng': -105.0077, 'capacity': 19520, 'teams': ['DEN'], 'year_opened': 1999},
|
||||
'Little Caesars Arena': {'city': 'Detroit', 'state': 'MI', 'lat': 42.3411, 'lng': -83.0553, 'capacity': 20332, 'teams': ['DET'], 'year_opened': 2017},
|
||||
'Chase Center': {'city': 'San Francisco', 'state': 'CA', 'lat': 37.7680, 'lng': -122.3879, 'capacity': 18064, 'teams': ['GSW'], 'year_opened': 2019},
|
||||
'Toyota Center': {'city': 'Houston', 'state': 'TX', 'lat': 29.7508, 'lng': -95.3621, 'capacity': 18055, 'teams': ['HOU'], 'year_opened': 2003},
|
||||
'Gainbridge Fieldhouse': {'city': 'Indianapolis', 'state': 'IN', 'lat': 39.7640, 'lng': -86.1555, 'capacity': 17923, 'teams': ['IND'], 'year_opened': 1999},
|
||||
'Intuit Dome': {'city': 'Inglewood', 'state': 'CA', 'lat': 33.9425, 'lng': -118.3419, 'capacity': 18000, 'teams': ['LAC'], 'year_opened': 2024},
|
||||
'Crypto.com Arena': {'city': 'Los Angeles', 'state': 'CA', 'lat': 34.0430, 'lng': -118.2673, 'capacity': 18997, 'teams': ['LAL'], 'year_opened': 1999},
|
||||
'FedExForum': {'city': 'Memphis', 'state': 'TN', 'lat': 35.1382, 'lng': -90.0506, 'capacity': 17794, 'teams': ['MEM'], 'year_opened': 2004},
|
||||
'Kaseya Center': {'city': 'Miami', 'state': 'FL', 'lat': 25.7814, 'lng': -80.1870, 'capacity': 19600, 'teams': ['MIA'], 'year_opened': 1999},
|
||||
'Fiserv Forum': {'city': 'Milwaukee', 'state': 'WI', 'lat': 43.0451, 'lng': -87.9174, 'capacity': 17341, 'teams': ['MIL'], 'year_opened': 2018},
|
||||
'Target Center': {'city': 'Minneapolis', 'state': 'MN', 'lat': 44.9795, 'lng': -93.2761, 'capacity': 18978, 'teams': ['MIN'], 'year_opened': 1990},
|
||||
'Smoothie King Center': {'city': 'New Orleans', 'state': 'LA', 'lat': 29.9490, 'lng': -90.0821, 'capacity': 16867, 'teams': ['NOP'], 'year_opened': 1999},
|
||||
'Madison Square Garden': {'city': 'New York', 'state': 'NY', 'lat': 40.7505, 'lng': -73.9934, 'capacity': 19812, 'teams': ['NYK'], 'year_opened': 1968},
|
||||
'Paycom Center': {'city': 'Oklahoma City', 'state': 'OK', 'lat': 35.4634, 'lng': -97.5151, 'capacity': 18203, 'teams': ['OKC'], 'year_opened': 2002},
|
||||
'Kia Center': {'city': 'Orlando', 'state': 'FL', 'lat': 28.5392, 'lng': -81.3839, 'capacity': 18846, 'teams': ['ORL'], 'year_opened': 1989},
|
||||
'Wells Fargo Center': {'city': 'Philadelphia', 'state': 'PA', 'lat': 39.9012, 'lng': -75.1720, 'capacity': 20478, 'teams': ['PHI'], 'year_opened': 1996},
|
||||
'Footprint Center': {'city': 'Phoenix', 'state': 'AZ', 'lat': 33.4457, 'lng': -112.0712, 'capacity': 17071, 'teams': ['PHO'], 'year_opened': 1992},
|
||||
'Moda Center': {'city': 'Portland', 'state': 'OR', 'lat': 45.5316, 'lng': -122.6668, 'capacity': 19393, 'teams': ['POR'], 'year_opened': 1995},
|
||||
'Golden 1 Center': {'city': 'Sacramento', 'state': 'CA', 'lat': 38.5802, 'lng': -121.4997, 'capacity': 17608, 'teams': ['SAC'], 'year_opened': 2016},
|
||||
'Frost Bank Center': {'city': 'San Antonio', 'state': 'TX', 'lat': 29.4270, 'lng': -98.4375, 'capacity': 18418, 'teams': ['SAS'], 'year_opened': 2002},
|
||||
'Scotiabank Arena': {'city': 'Toronto', 'state': 'ON', 'lat': 43.6435, 'lng': -79.3791, 'capacity': 19800, 'teams': ['TOR'], 'year_opened': 1999},
|
||||
'Delta Center': {'city': 'Salt Lake City', 'state': 'UT', 'lat': 40.7683, 'lng': -111.9011, 'capacity': 18306, 'teams': ['UTA'], 'year_opened': 1991},
|
||||
'Capital One Arena': {'city': 'Washington', 'state': 'DC', 'lat': 38.8982, 'lng': -77.0209, 'capacity': 20356, 'teams': ['WAS'], 'year_opened': 1997},
|
||||
}
|
||||
|
||||
stadiums = []
|
||||
for name, info in nba_arenas.items():
|
||||
stadium = Stadium(
|
||||
id=f"nba_{name.lower().replace(' ', '_')[:30]}",
|
||||
name=name,
|
||||
city=info['city'],
|
||||
state=info['state'],
|
||||
latitude=info['lat'],
|
||||
longitude=info['lng'],
|
||||
capacity=info['capacity'],
|
||||
sport='NBA',
|
||||
team_abbrevs=info['teams'],
|
||||
source='nba_hardcoded',
|
||||
year_opened=info.get('year_opened')
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
print(f" ✓ Found {len(stadiums)} NBA arenas")
|
||||
return stadiums
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SOURCE CONFIGURATIONS
|
||||
# =============================================================================
|
||||
|
||||
NBA_GAME_SOURCES = [
|
||||
ScraperSource('Basketball-Reference', scrape_nba_basketball_reference, priority=1, min_games=100),
|
||||
ScraperSource('CBS Sports', scrape_nba_cbssports, priority=2, min_games=50),
|
||||
ScraperSource('ESPN', scrape_nba_espn, priority=3, min_games=50),
|
||||
]
|
||||
|
||||
NBA_STADIUM_SOURCES = [
|
||||
StadiumScraperSource('Hardcoded', scrape_nba_stadiums, priority=1, min_venues=25),
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CONVENIENCE FUNCTIONS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nba_games(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NBA games for a season using multi-source fallback.
|
||||
|
||||
Args:
|
||||
season: Season ending year (e.g., 2025 for 2024-25 season)
|
||||
|
||||
Returns:
|
||||
List of Game objects from the first successful source
|
||||
"""
|
||||
print(f"\nNBA {get_nba_season_string(season)} SCHEDULE")
|
||||
print("-" * 40)
|
||||
|
||||
return scrape_with_fallback('NBA', season, NBA_GAME_SOURCES)
|
||||
-574
@@ -1,574 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
NFL schedule and stadium scrapers for SportsTime.
|
||||
|
||||
This module provides:
|
||||
- NFL game scrapers (ESPN, Pro-Football-Reference, CBS Sports)
|
||||
- NFL stadium scrapers (ScoreBot, GeoJSON, hardcoded)
|
||||
- Multi-source fallback configurations
|
||||
"""
|
||||
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
|
||||
# Support both direct execution and import from parent directory
|
||||
try:
|
||||
from core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
except ImportError:
|
||||
from Scripts.core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
|
||||
|
||||
__all__ = [
|
||||
# Team data
|
||||
'NFL_TEAMS',
|
||||
# Game scrapers
|
||||
'scrape_nfl_espn',
|
||||
'scrape_nfl_pro_football_reference',
|
||||
'scrape_nfl_cbssports',
|
||||
# Stadium scrapers
|
||||
'scrape_nfl_stadiums',
|
||||
'scrape_nfl_stadiums_scorebot',
|
||||
'scrape_nfl_stadiums_geojson',
|
||||
'scrape_nfl_stadiums_hardcoded',
|
||||
# Source configurations
|
||||
'NFL_GAME_SOURCES',
|
||||
'NFL_STADIUM_SOURCES',
|
||||
# Convenience functions
|
||||
'scrape_nfl_games',
|
||||
'get_nfl_season_string',
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM MAPPINGS
|
||||
# =============================================================================
|
||||
|
||||
NFL_TEAMS = {
|
||||
'ARI': {'name': 'Arizona Cardinals', 'city': 'Glendale', 'stadium': 'State Farm Stadium'},
|
||||
'ATL': {'name': 'Atlanta Falcons', 'city': 'Atlanta', 'stadium': 'Mercedes-Benz Stadium'},
|
||||
'BAL': {'name': 'Baltimore Ravens', 'city': 'Baltimore', 'stadium': 'M&T Bank Stadium'},
|
||||
'BUF': {'name': 'Buffalo Bills', 'city': 'Orchard Park', 'stadium': 'Highmark Stadium'},
|
||||
'CAR': {'name': 'Carolina Panthers', 'city': 'Charlotte', 'stadium': 'Bank of America Stadium'},
|
||||
'CHI': {'name': 'Chicago Bears', 'city': 'Chicago', 'stadium': 'Soldier Field'},
|
||||
'CIN': {'name': 'Cincinnati Bengals', 'city': 'Cincinnati', 'stadium': 'Paycor Stadium'},
|
||||
'CLE': {'name': 'Cleveland Browns', 'city': 'Cleveland', 'stadium': 'Cleveland Browns Stadium'},
|
||||
'DAL': {'name': 'Dallas Cowboys', 'city': 'Arlington', 'stadium': 'AT&T Stadium'},
|
||||
'DEN': {'name': 'Denver Broncos', 'city': 'Denver', 'stadium': 'Empower Field at Mile High'},
|
||||
'DET': {'name': 'Detroit Lions', 'city': 'Detroit', 'stadium': 'Ford Field'},
|
||||
'GB': {'name': 'Green Bay Packers', 'city': 'Green Bay', 'stadium': 'Lambeau Field'},
|
||||
'HOU': {'name': 'Houston Texans', 'city': 'Houston', 'stadium': 'NRG Stadium'},
|
||||
'IND': {'name': 'Indianapolis Colts', 'city': 'Indianapolis', 'stadium': 'Lucas Oil Stadium'},
|
||||
'JAX': {'name': 'Jacksonville Jaguars', 'city': 'Jacksonville', 'stadium': 'EverBank Stadium'},
|
||||
'KC': {'name': 'Kansas City Chiefs', 'city': 'Kansas City', 'stadium': 'GEHA Field at Arrowhead Stadium'},
|
||||
'LV': {'name': 'Las Vegas Raiders', 'city': 'Las Vegas', 'stadium': 'Allegiant Stadium'},
|
||||
'LAC': {'name': 'Los Angeles Chargers', 'city': 'Inglewood', 'stadium': 'SoFi Stadium'},
|
||||
'LAR': {'name': 'Los Angeles Rams', 'city': 'Inglewood', 'stadium': 'SoFi Stadium'},
|
||||
'MIA': {'name': 'Miami Dolphins', 'city': 'Miami Gardens', 'stadium': 'Hard Rock Stadium'},
|
||||
'MIN': {'name': 'Minnesota Vikings', 'city': 'Minneapolis', 'stadium': 'U.S. Bank Stadium'},
|
||||
'NE': {'name': 'New England Patriots', 'city': 'Foxborough', 'stadium': 'Gillette Stadium'},
|
||||
'NO': {'name': 'New Orleans Saints', 'city': 'New Orleans', 'stadium': 'Caesars Superdome'},
|
||||
'NYG': {'name': 'New York Giants', 'city': 'East Rutherford', 'stadium': 'MetLife Stadium'},
|
||||
'NYJ': {'name': 'New York Jets', 'city': 'East Rutherford', 'stadium': 'MetLife Stadium'},
|
||||
'PHI': {'name': 'Philadelphia Eagles', 'city': 'Philadelphia', 'stadium': 'Lincoln Financial Field'},
|
||||
'PIT': {'name': 'Pittsburgh Steelers', 'city': 'Pittsburgh', 'stadium': 'Acrisure Stadium'},
|
||||
'SF': {'name': 'San Francisco 49ers', 'city': 'Santa Clara', 'stadium': "Levi's Stadium"},
|
||||
'SEA': {'name': 'Seattle Seahawks', 'city': 'Seattle', 'stadium': 'Lumen Field'},
|
||||
'TB': {'name': 'Tampa Bay Buccaneers', 'city': 'Tampa', 'stadium': 'Raymond James Stadium'},
|
||||
'TEN': {'name': 'Tennessee Titans', 'city': 'Nashville', 'stadium': 'Nissan Stadium'},
|
||||
'WAS': {'name': 'Washington Commanders', 'city': 'Landover', 'stadium': 'Northwest Stadium'},
|
||||
}
|
||||
|
||||
|
||||
def get_nfl_team_abbrev(team_name: str) -> str:
|
||||
"""Get NFL team abbreviation from full name."""
|
||||
for abbrev, info in NFL_TEAMS.items():
|
||||
if info['name'].lower() == team_name.lower():
|
||||
return abbrev
|
||||
if team_name.lower() in info['name'].lower():
|
||||
return abbrev
|
||||
|
||||
# Return first 3 letters as fallback
|
||||
return team_name[:3].upper()
|
||||
|
||||
|
||||
def get_nfl_season_string(season: int) -> str:
|
||||
"""
|
||||
Get NFL season string in "2025-26" format.
|
||||
|
||||
Args:
|
||||
season: The ending year of the season (e.g., 2026 for 2025-26 season)
|
||||
|
||||
Returns:
|
||||
Season string like "2025-26"
|
||||
"""
|
||||
return f"{season-1}-{str(season)[2:]}"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# GAME SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def _scrape_espn_schedule(sport: str, league: str, season: int, date_range: tuple[str, str]) -> list[Game]:
|
||||
"""
|
||||
Fetch schedule from ESPN API.
|
||||
|
||||
Args:
|
||||
sport: 'football'
|
||||
league: 'nfl'
|
||||
season: Season year
|
||||
date_range: (start_date, end_date) in YYYYMMDD format
|
||||
"""
|
||||
games = []
|
||||
sport_upper = 'NFL'
|
||||
|
||||
print(f"Fetching {sport_upper} {season} from ESPN API...")
|
||||
|
||||
url = f"https://site.api.espn.com/apis/site/v2/sports/{sport}/{league}/scoreboard"
|
||||
params = {
|
||||
'dates': f"{date_range[0]}-{date_range[1]}",
|
||||
'limit': 1000
|
||||
}
|
||||
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.get(url, params=params, headers=headers, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
events = data.get('events', [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
# Parse date/time
|
||||
date_str = event.get('date', '')[:10] # YYYY-MM-DD
|
||||
time_str = event.get('date', '')[11:16] if len(event.get('date', '')) > 11 else None
|
||||
|
||||
# Get teams
|
||||
competitions = event.get('competitions', [{}])
|
||||
if not competitions:
|
||||
continue
|
||||
|
||||
comp = competitions[0]
|
||||
competitors = comp.get('competitors', [])
|
||||
|
||||
if len(competitors) < 2:
|
||||
continue
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_abbrev = None
|
||||
away_abbrev = None
|
||||
|
||||
for team in competitors:
|
||||
team_data = team.get('team', {})
|
||||
team_name = team_data.get('displayName', team_data.get('name', ''))
|
||||
team_abbrev = team_data.get('abbreviation', '')
|
||||
|
||||
if team.get('homeAway') == 'home':
|
||||
home_team = team_name
|
||||
home_abbrev = team_abbrev
|
||||
else:
|
||||
away_team = team_name
|
||||
away_abbrev = team_abbrev
|
||||
|
||||
if not home_team or not away_team:
|
||||
continue
|
||||
|
||||
# Get venue
|
||||
venue = comp.get('venue', {}).get('fullName', '')
|
||||
|
||||
game_id = f"nfl_{date_str}_{away_abbrev}_{home_abbrev}".lower()
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NFL',
|
||||
season=get_nfl_season_string(season),
|
||||
date=date_str,
|
||||
time=time_str,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev or get_nfl_team_abbrev(home_team),
|
||||
away_team_abbrev=away_abbrev or get_nfl_team_abbrev(away_team),
|
||||
venue=venue,
|
||||
source='espn.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from ESPN")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching ESPN NFL: {e}")
|
||||
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nfl_espn(season: int) -> list[Game]:
|
||||
"""Fetch NFL schedule from ESPN API."""
|
||||
# NFL season: September - February (spans years)
|
||||
start = f"{season-1}0901"
|
||||
end = f"{season}0228"
|
||||
return _scrape_espn_schedule('football', 'nfl', season, (start, end))
|
||||
|
||||
|
||||
def scrape_nfl_pro_football_reference(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NFL schedule from Pro-Football-Reference.
|
||||
URL: https://www.pro-football-reference.com/years/{YEAR}/games.htm
|
||||
Season year is the starting year (e.g., 2025 for 2025-26 season)
|
||||
"""
|
||||
games = []
|
||||
year = season - 1 # PFR uses starting year
|
||||
url = f"https://www.pro-football-reference.com/years/{year}/games.htm"
|
||||
|
||||
print(f"Scraping NFL {season} from Pro-Football-Reference...")
|
||||
soup = fetch_page(url, 'pro-football-reference.com')
|
||||
|
||||
if not soup:
|
||||
return games
|
||||
|
||||
table = soup.find('table', {'id': 'games'})
|
||||
if not table:
|
||||
print(" Could not find games table")
|
||||
return games
|
||||
|
||||
tbody = table.find('tbody')
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all('tr'):
|
||||
if row.get('class') and 'thead' in row.get('class'):
|
||||
continue
|
||||
|
||||
try:
|
||||
# Parse date
|
||||
date_cell = row.find('td', {'data-stat': 'game_date'})
|
||||
if not date_cell:
|
||||
continue
|
||||
date_str = date_cell.text.strip()
|
||||
|
||||
# Parse teams
|
||||
winner_cell = row.find('td', {'data-stat': 'winner'})
|
||||
loser_cell = row.find('td', {'data-stat': 'loser'})
|
||||
home_cell = row.find('td', {'data-stat': 'game_location'})
|
||||
|
||||
if not winner_cell or not loser_cell:
|
||||
continue
|
||||
|
||||
winner_link = winner_cell.find('a')
|
||||
loser_link = loser_cell.find('a')
|
||||
|
||||
winner = winner_link.text if winner_link else winner_cell.text.strip()
|
||||
loser = loser_link.text if loser_link else loser_cell.text.strip()
|
||||
|
||||
# Determine home/away - '@' in game_location means winner was away
|
||||
is_at_loser = home_cell and '@' in home_cell.text
|
||||
if is_at_loser:
|
||||
home_team, away_team = loser, winner
|
||||
else:
|
||||
home_team, away_team = winner, loser
|
||||
|
||||
# Convert date (e.g., "September 7" or "2025-09-07")
|
||||
try:
|
||||
if '-' in date_str:
|
||||
parsed_date = datetime.strptime(date_str, '%Y-%m-%d')
|
||||
else:
|
||||
# Add year based on month
|
||||
month_str = date_str.split()[0]
|
||||
if month_str in ['January', 'February']:
|
||||
date_with_year = f"{date_str}, {year + 1}"
|
||||
else:
|
||||
date_with_year = f"{date_str}, {year}"
|
||||
parsed_date = datetime.strptime(date_with_year, '%B %d, %Y')
|
||||
date_formatted = parsed_date.strftime('%Y-%m-%d')
|
||||
except:
|
||||
continue
|
||||
|
||||
away_abbrev = get_nfl_team_abbrev(away_team)
|
||||
home_abbrev = get_nfl_team_abbrev(home_team)
|
||||
game_id = f"nfl_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NFL',
|
||||
season=get_nfl_season_string(season),
|
||||
date=date_formatted,
|
||||
time=None,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev,
|
||||
away_team_abbrev=away_abbrev,
|
||||
venue='',
|
||||
source='pro-football-reference.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from Pro-Football-Reference")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nfl_cbssports(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NFL schedule from CBS Sports.
|
||||
Provides structured schedule data via web scraping.
|
||||
"""
|
||||
games = []
|
||||
year = season - 1 # CBS uses starting year
|
||||
print(f"Fetching NFL {season} from CBS Sports...")
|
||||
|
||||
# CBS Sports schedule endpoint
|
||||
url = f"https://www.cbssports.com/nfl/schedule/{year}/regular/"
|
||||
|
||||
soup = fetch_page(url, 'cbssports.com')
|
||||
if not soup:
|
||||
return games
|
||||
|
||||
# Find game tables
|
||||
tables = soup.find_all('table', class_='TableBase-table')
|
||||
|
||||
for table in tables:
|
||||
rows = table.find_all('tr')
|
||||
for row in rows:
|
||||
try:
|
||||
cells = row.find_all('td')
|
||||
if len(cells) < 3:
|
||||
continue
|
||||
|
||||
# Parse matchup
|
||||
away_cell = cells[0] if len(cells) > 0 else None
|
||||
home_cell = cells[1] if len(cells) > 1 else None
|
||||
|
||||
if not away_cell or not home_cell:
|
||||
continue
|
||||
|
||||
away_team = away_cell.get_text(strip=True)
|
||||
home_team = home_cell.get_text(strip=True)
|
||||
|
||||
if not away_team or not home_team:
|
||||
continue
|
||||
|
||||
# CBS includes @ symbol
|
||||
away_team = away_team.replace('@', '').strip()
|
||||
|
||||
# Get date from parent section if available
|
||||
date_formatted = datetime.now().strftime('%Y-%m-%d') # Placeholder
|
||||
|
||||
away_abbrev = get_nfl_team_abbrev(away_team)
|
||||
home_abbrev = get_nfl_team_abbrev(home_team)
|
||||
game_id = f"nfl_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NFL',
|
||||
season=get_nfl_season_string(season),
|
||||
date=date_formatted,
|
||||
time=None,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev,
|
||||
away_team_abbrev=away_abbrev,
|
||||
venue='',
|
||||
source='cbssports.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from CBS Sports")
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# STADIUM SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nfl_stadiums_scorebot() -> list[Stadium]:
|
||||
"""
|
||||
Source 1: NFLScoreBot/stadiums GitHub (public domain).
|
||||
"""
|
||||
stadiums = []
|
||||
url = "https://raw.githubusercontent.com/NFLScoreBot/stadiums/main/stadiums.json"
|
||||
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for name, info in data.items():
|
||||
stadium = Stadium(
|
||||
id=f"nfl_{name.lower().replace(' ', '_')[:30]}",
|
||||
name=name,
|
||||
city=info.get('city', ''),
|
||||
state=info.get('state', ''),
|
||||
latitude=info.get('lat', 0) / 1000000 if info.get('lat') else 0,
|
||||
longitude=info.get('long', 0) / 1000000 if info.get('long') else 0,
|
||||
capacity=info.get('capacity', 0),
|
||||
sport='NFL',
|
||||
team_abbrevs=info.get('teams', []),
|
||||
source='github.com/NFLScoreBot'
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_nfl_stadiums_geojson() -> list[Stadium]:
|
||||
"""
|
||||
Source 2: brianhatchl/nfl-stadiums GeoJSON gist.
|
||||
"""
|
||||
stadiums = []
|
||||
url = "https://gist.githubusercontent.com/brianhatchl/6265918/raw/dbe6acfe5deb48f51ce5a4c4f8f5dded4f02b9bd/nfl_stadiums.geojson"
|
||||
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for feature in data.get('features', []):
|
||||
props = feature.get('properties', {})
|
||||
coords = feature.get('geometry', {}).get('coordinates', [0, 0])
|
||||
|
||||
stadium = Stadium(
|
||||
id=f"nfl_{props.get('Stadium', '').lower().replace(' ', '_')[:30]}",
|
||||
name=props.get('Stadium', ''),
|
||||
city=props.get('City', ''),
|
||||
state=props.get('State', ''),
|
||||
latitude=coords[1] if len(coords) > 1 else 0,
|
||||
longitude=coords[0] if len(coords) > 0 else 0,
|
||||
capacity=int(props.get('Capacity', 0) or 0),
|
||||
sport='NFL',
|
||||
team_abbrevs=[props.get('Team', '')],
|
||||
source='gist.github.com/brianhatchl'
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_nfl_stadiums_hardcoded() -> list[Stadium]:
|
||||
"""
|
||||
Source 3: Hardcoded NFL stadiums (fallback).
|
||||
"""
|
||||
nfl_stadiums_data = {
|
||||
'State Farm Stadium': {'city': 'Glendale', 'state': 'AZ', 'lat': 33.5276, 'lng': -112.2626, 'capacity': 63400, 'teams': ['ARI'], 'year_opened': 2006},
|
||||
'Mercedes-Benz Stadium': {'city': 'Atlanta', 'state': 'GA', 'lat': 33.7553, 'lng': -84.4006, 'capacity': 71000, 'teams': ['ATL'], 'year_opened': 2017},
|
||||
'M&T Bank Stadium': {'city': 'Baltimore', 'state': 'MD', 'lat': 39.2780, 'lng': -76.6227, 'capacity': 71008, 'teams': ['BAL'], 'year_opened': 1998},
|
||||
'Highmark Stadium': {'city': 'Orchard Park', 'state': 'NY', 'lat': 42.7738, 'lng': -78.7870, 'capacity': 71608, 'teams': ['BUF'], 'year_opened': 1973},
|
||||
'Bank of America Stadium': {'city': 'Charlotte', 'state': 'NC', 'lat': 35.2258, 'lng': -80.8528, 'capacity': 75523, 'teams': ['CAR'], 'year_opened': 1996},
|
||||
'Soldier Field': {'city': 'Chicago', 'state': 'IL', 'lat': 41.8623, 'lng': -87.6167, 'capacity': 61500, 'teams': ['CHI'], 'year_opened': 1924},
|
||||
'Paycor Stadium': {'city': 'Cincinnati', 'state': 'OH', 'lat': 39.0954, 'lng': -84.5160, 'capacity': 65515, 'teams': ['CIN'], 'year_opened': 2000},
|
||||
'Cleveland Browns Stadium': {'city': 'Cleveland', 'state': 'OH', 'lat': 41.5061, 'lng': -81.6995, 'capacity': 67895, 'teams': ['CLE'], 'year_opened': 1999},
|
||||
'AT&T Stadium': {'city': 'Arlington', 'state': 'TX', 'lat': 32.7480, 'lng': -97.0928, 'capacity': 80000, 'teams': ['DAL'], 'year_opened': 2009},
|
||||
'Empower Field at Mile High': {'city': 'Denver', 'state': 'CO', 'lat': 39.7439, 'lng': -105.0201, 'capacity': 76125, 'teams': ['DEN'], 'year_opened': 2001},
|
||||
'Ford Field': {'city': 'Detroit', 'state': 'MI', 'lat': 42.3400, 'lng': -83.0456, 'capacity': 65000, 'teams': ['DET'], 'year_opened': 2002},
|
||||
'Lambeau Field': {'city': 'Green Bay', 'state': 'WI', 'lat': 44.5013, 'lng': -88.0622, 'capacity': 81435, 'teams': ['GB'], 'year_opened': 1957},
|
||||
'NRG Stadium': {'city': 'Houston', 'state': 'TX', 'lat': 29.6847, 'lng': -95.4107, 'capacity': 72220, 'teams': ['HOU'], 'year_opened': 2002},
|
||||
'Lucas Oil Stadium': {'city': 'Indianapolis', 'state': 'IN', 'lat': 39.7601, 'lng': -86.1639, 'capacity': 67000, 'teams': ['IND'], 'year_opened': 2008},
|
||||
'EverBank Stadium': {'city': 'Jacksonville', 'state': 'FL', 'lat': 30.3239, 'lng': -81.6373, 'capacity': 67814, 'teams': ['JAX'], 'year_opened': 1995},
|
||||
'GEHA Field at Arrowhead Stadium': {'city': 'Kansas City', 'state': 'MO', 'lat': 39.0489, 'lng': -94.4839, 'capacity': 76416, 'teams': ['KC'], 'year_opened': 1972},
|
||||
'Allegiant Stadium': {'city': 'Las Vegas', 'state': 'NV', 'lat': 36.0909, 'lng': -115.1833, 'capacity': 65000, 'teams': ['LV'], 'year_opened': 2020},
|
||||
'SoFi Stadium': {'city': 'Inglewood', 'state': 'CA', 'lat': 33.9535, 'lng': -118.3392, 'capacity': 70240, 'teams': ['LAC', 'LAR'], 'year_opened': 2020},
|
||||
'Hard Rock Stadium': {'city': 'Miami Gardens', 'state': 'FL', 'lat': 25.9580, 'lng': -80.2389, 'capacity': 64767, 'teams': ['MIA'], 'year_opened': 1987},
|
||||
'U.S. Bank Stadium': {'city': 'Minneapolis', 'state': 'MN', 'lat': 44.9736, 'lng': -93.2575, 'capacity': 66655, 'teams': ['MIN'], 'year_opened': 2016},
|
||||
'Gillette Stadium': {'city': 'Foxborough', 'state': 'MA', 'lat': 42.0909, 'lng': -71.2643, 'capacity': 65878, 'teams': ['NE'], 'year_opened': 2002},
|
||||
'Caesars Superdome': {'city': 'New Orleans', 'state': 'LA', 'lat': 29.9511, 'lng': -90.0812, 'capacity': 73208, 'teams': ['NO'], 'year_opened': 1975},
|
||||
'MetLife Stadium': {'city': 'East Rutherford', 'state': 'NJ', 'lat': 40.8135, 'lng': -74.0745, 'capacity': 82500, 'teams': ['NYG', 'NYJ'], 'year_opened': 2010},
|
||||
'Lincoln Financial Field': {'city': 'Philadelphia', 'state': 'PA', 'lat': 39.9008, 'lng': -75.1675, 'capacity': 69596, 'teams': ['PHI'], 'year_opened': 2003},
|
||||
'Acrisure Stadium': {'city': 'Pittsburgh', 'state': 'PA', 'lat': 40.4468, 'lng': -80.0158, 'capacity': 68400, 'teams': ['PIT'], 'year_opened': 2001},
|
||||
"Levi's Stadium": {'city': 'Santa Clara', 'state': 'CA', 'lat': 37.4032, 'lng': -121.9698, 'capacity': 68500, 'teams': ['SF'], 'year_opened': 2014},
|
||||
'Lumen Field': {'city': 'Seattle', 'state': 'WA', 'lat': 47.5952, 'lng': -122.3316, 'capacity': 68740, 'teams': ['SEA'], 'year_opened': 2002},
|
||||
'Raymond James Stadium': {'city': 'Tampa', 'state': 'FL', 'lat': 27.9759, 'lng': -82.5033, 'capacity': 65618, 'teams': ['TB'], 'year_opened': 1998},
|
||||
'Nissan Stadium': {'city': 'Nashville', 'state': 'TN', 'lat': 36.1665, 'lng': -86.7713, 'capacity': 69143, 'teams': ['TEN'], 'year_opened': 1999},
|
||||
'Northwest Stadium': {'city': 'Landover', 'state': 'MD', 'lat': 38.9076, 'lng': -76.8645, 'capacity': 67617, 'teams': ['WAS'], 'year_opened': 1997},
|
||||
}
|
||||
|
||||
stadiums = []
|
||||
for name, info in nfl_stadiums_data.items():
|
||||
stadium = Stadium(
|
||||
id=f"nfl_{name.lower().replace(' ', '_')[:30]}",
|
||||
name=name,
|
||||
city=info['city'],
|
||||
state=info['state'],
|
||||
latitude=info['lat'],
|
||||
longitude=info['lng'],
|
||||
capacity=info['capacity'],
|
||||
sport='NFL',
|
||||
team_abbrevs=info['teams'],
|
||||
source='nfl_hardcoded',
|
||||
year_opened=info.get('year_opened')
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_nfl_stadiums() -> list[Stadium]:
|
||||
"""
|
||||
Fetch NFL stadium data with multi-source fallback.
|
||||
"""
|
||||
print("\nNFL STADIUMS")
|
||||
print("-" * 40)
|
||||
|
||||
return scrape_stadiums_with_fallback('NFL', NFL_STADIUM_SOURCES)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SOURCE CONFIGURATIONS
|
||||
# =============================================================================
|
||||
|
||||
NFL_GAME_SOURCES = [
|
||||
ScraperSource('ESPN', scrape_nfl_espn, priority=1, min_games=200),
|
||||
ScraperSource('Pro-Football-Reference', scrape_nfl_pro_football_reference, priority=2, min_games=200),
|
||||
ScraperSource('CBS Sports', scrape_nfl_cbssports, priority=3, min_games=100),
|
||||
]
|
||||
|
||||
NFL_STADIUM_SOURCES = [
|
||||
StadiumScraperSource('NFLScoreBot', scrape_nfl_stadiums_scorebot, priority=1, min_venues=28),
|
||||
StadiumScraperSource('GeoJSON-Gist', scrape_nfl_stadiums_geojson, priority=2, min_venues=28),
|
||||
StadiumScraperSource('Hardcoded', scrape_nfl_stadiums_hardcoded, priority=3, min_venues=28),
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CONVENIENCE FUNCTIONS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nfl_games(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NFL games for a season using multi-source fallback.
|
||||
|
||||
Args:
|
||||
season: Season ending year (e.g., 2026 for 2025-26 season)
|
||||
|
||||
Returns:
|
||||
List of Game objects from the first successful source
|
||||
"""
|
||||
print(f"\nNFL {get_nfl_season_string(season)} SCHEDULE")
|
||||
print("-" * 40)
|
||||
|
||||
return scrape_with_fallback('NFL', season, NFL_GAME_SOURCES)
|
||||
-411
@@ -1,411 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
NHL schedule and stadium scrapers for SportsTime.
|
||||
|
||||
This module provides:
|
||||
- NHL game scrapers (Hockey-Reference, NHL API, ESPN)
|
||||
- NHL stadium scrapers (hardcoded with coordinates)
|
||||
- Multi-source fallback configurations
|
||||
"""
|
||||
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
|
||||
# Support both direct execution and import from parent directory
|
||||
try:
|
||||
from core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
except ImportError:
|
||||
from Scripts.core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
|
||||
|
||||
__all__ = [
|
||||
# Team data
|
||||
'NHL_TEAMS',
|
||||
# Game scrapers
|
||||
'scrape_nhl_hockey_reference',
|
||||
'scrape_nhl_api',
|
||||
'scrape_nhl_espn',
|
||||
# Stadium scrapers
|
||||
'scrape_nhl_stadiums',
|
||||
# Source configurations
|
||||
'NHL_GAME_SOURCES',
|
||||
'NHL_STADIUM_SOURCES',
|
||||
# Convenience functions
|
||||
'scrape_nhl_games',
|
||||
'get_nhl_season_string',
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM MAPPINGS
|
||||
# =============================================================================
|
||||
|
||||
NHL_TEAMS = {
|
||||
'ANA': {'name': 'Anaheim Ducks', 'city': 'Anaheim', 'arena': 'Honda Center'},
|
||||
'ARI': {'name': 'Utah Hockey Club', 'city': 'Salt Lake City', 'arena': 'Delta Center'},
|
||||
'BOS': {'name': 'Boston Bruins', 'city': 'Boston', 'arena': 'TD Garden'},
|
||||
'BUF': {'name': 'Buffalo Sabres', 'city': 'Buffalo', 'arena': 'KeyBank Center'},
|
||||
'CGY': {'name': 'Calgary Flames', 'city': 'Calgary', 'arena': 'Scotiabank Saddledome'},
|
||||
'CAR': {'name': 'Carolina Hurricanes', 'city': 'Raleigh', 'arena': 'PNC Arena'},
|
||||
'CHI': {'name': 'Chicago Blackhawks', 'city': 'Chicago', 'arena': 'United Center'},
|
||||
'COL': {'name': 'Colorado Avalanche', 'city': 'Denver', 'arena': 'Ball Arena'},
|
||||
'CBJ': {'name': 'Columbus Blue Jackets', 'city': 'Columbus', 'arena': 'Nationwide Arena'},
|
||||
'DAL': {'name': 'Dallas Stars', 'city': 'Dallas', 'arena': 'American Airlines Center'},
|
||||
'DET': {'name': 'Detroit Red Wings', 'city': 'Detroit', 'arena': 'Little Caesars Arena'},
|
||||
'EDM': {'name': 'Edmonton Oilers', 'city': 'Edmonton', 'arena': 'Rogers Place'},
|
||||
'FLA': {'name': 'Florida Panthers', 'city': 'Sunrise', 'arena': 'Amerant Bank Arena'},
|
||||
'LAK': {'name': 'Los Angeles Kings', 'city': 'Los Angeles', 'arena': 'Crypto.com Arena'},
|
||||
'MIN': {'name': 'Minnesota Wild', 'city': 'St. Paul', 'arena': 'Xcel Energy Center'},
|
||||
'MTL': {'name': 'Montreal Canadiens', 'city': 'Montreal', 'arena': 'Bell Centre'},
|
||||
'NSH': {'name': 'Nashville Predators', 'city': 'Nashville', 'arena': 'Bridgestone Arena'},
|
||||
'NJD': {'name': 'New Jersey Devils', 'city': 'Newark', 'arena': 'Prudential Center'},
|
||||
'NYI': {'name': 'New York Islanders', 'city': 'Elmont', 'arena': 'UBS Arena'},
|
||||
'NYR': {'name': 'New York Rangers', 'city': 'New York', 'arena': 'Madison Square Garden'},
|
||||
'OTT': {'name': 'Ottawa Senators', 'city': 'Ottawa', 'arena': 'Canadian Tire Centre'},
|
||||
'PHI': {'name': 'Philadelphia Flyers', 'city': 'Philadelphia', 'arena': 'Wells Fargo Center'},
|
||||
'PIT': {'name': 'Pittsburgh Penguins', 'city': 'Pittsburgh', 'arena': 'PPG Paints Arena'},
|
||||
'SJS': {'name': 'San Jose Sharks', 'city': 'San Jose', 'arena': 'SAP Center'},
|
||||
'SEA': {'name': 'Seattle Kraken', 'city': 'Seattle', 'arena': 'Climate Pledge Arena'},
|
||||
'STL': {'name': 'St. Louis Blues', 'city': 'St. Louis', 'arena': 'Enterprise Center'},
|
||||
'TBL': {'name': 'Tampa Bay Lightning', 'city': 'Tampa', 'arena': 'Amalie Arena'},
|
||||
'TOR': {'name': 'Toronto Maple Leafs', 'city': 'Toronto', 'arena': 'Scotiabank Arena'},
|
||||
'VAN': {'name': 'Vancouver Canucks', 'city': 'Vancouver', 'arena': 'Rogers Arena'},
|
||||
'VGK': {'name': 'Vegas Golden Knights', 'city': 'Las Vegas', 'arena': 'T-Mobile Arena'},
|
||||
'WSH': {'name': 'Washington Capitals', 'city': 'Washington', 'arena': 'Capital One Arena'},
|
||||
'WPG': {'name': 'Winnipeg Jets', 'city': 'Winnipeg', 'arena': 'Canada Life Centre'},
|
||||
}
|
||||
|
||||
|
||||
def get_nhl_team_abbrev(team_name: str) -> str:
|
||||
"""Get NHL team abbreviation from full name."""
|
||||
for abbrev, info in NHL_TEAMS.items():
|
||||
if info['name'].lower() == team_name.lower():
|
||||
return abbrev
|
||||
if team_name.lower() in info['name'].lower():
|
||||
return abbrev
|
||||
|
||||
# Return first 3 letters as fallback
|
||||
return team_name[:3].upper()
|
||||
|
||||
|
||||
def get_nhl_season_string(season: int) -> str:
|
||||
"""
|
||||
Get NHL season string in "2024-25" format.
|
||||
|
||||
Args:
|
||||
season: The ending year of the season (e.g., 2025 for 2024-25 season)
|
||||
|
||||
Returns:
|
||||
Season string like "2024-25"
|
||||
"""
|
||||
return f"{season-1}-{str(season)[2:]}"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# GAME SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nhl_hockey_reference(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NHL schedule from Hockey-Reference.
|
||||
URL: https://www.hockey-reference.com/leagues/NHL_{YEAR}_games.html
|
||||
"""
|
||||
games = []
|
||||
url = f"https://www.hockey-reference.com/leagues/NHL_{season}_games.html"
|
||||
|
||||
print(f"Scraping NHL {season} from Hockey-Reference...")
|
||||
soup = fetch_page(url, 'hockey-reference.com')
|
||||
|
||||
if not soup:
|
||||
return games
|
||||
|
||||
table = soup.find('table', {'id': 'games'})
|
||||
if not table:
|
||||
print(" Could not find games table")
|
||||
return games
|
||||
|
||||
tbody = table.find('tbody')
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all('tr'):
|
||||
try:
|
||||
cells = row.find_all(['td', 'th'])
|
||||
if len(cells) < 5:
|
||||
continue
|
||||
|
||||
# Parse date
|
||||
date_cell = row.find('th', {'data-stat': 'date_game'})
|
||||
if not date_cell:
|
||||
continue
|
||||
date_link = date_cell.find('a')
|
||||
date_str = date_link.text if date_link else date_cell.text
|
||||
|
||||
# Parse teams
|
||||
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
|
||||
home_cell = row.find('td', {'data-stat': 'home_team_name'})
|
||||
|
||||
if not visitor_cell or not home_cell:
|
||||
continue
|
||||
|
||||
visitor_link = visitor_cell.find('a')
|
||||
home_link = home_cell.find('a')
|
||||
|
||||
away_team = visitor_link.text if visitor_link else visitor_cell.text
|
||||
home_team = home_link.text if home_link else home_cell.text
|
||||
|
||||
# Convert date
|
||||
try:
|
||||
parsed_date = datetime.strptime(date_str.strip(), '%Y-%m-%d')
|
||||
date_formatted = parsed_date.strftime('%Y-%m-%d')
|
||||
except:
|
||||
continue
|
||||
|
||||
away_abbrev = get_nhl_team_abbrev(away_team)
|
||||
home_abbrev = get_nhl_team_abbrev(home_team)
|
||||
game_id = f"nhl_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NHL',
|
||||
season=get_nhl_season_string(season),
|
||||
date=date_formatted,
|
||||
time=None,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev,
|
||||
away_team_abbrev=away_abbrev,
|
||||
venue='',
|
||||
source='hockey-reference.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from Hockey-Reference")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nhl_api(season: int) -> list[Game]:
|
||||
"""
|
||||
Fetch NHL schedule from official API (JSON).
|
||||
URL: https://api-web.nhle.com/v1/schedule/{YYYY-MM-DD}
|
||||
"""
|
||||
games = []
|
||||
print(f"Fetching NHL {season} from NHL API...")
|
||||
|
||||
# NHL API provides club schedules
|
||||
# We'd need to iterate through dates or teams
|
||||
# Simplified implementation here
|
||||
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nhl_espn(season: int) -> list[Game]:
|
||||
"""Fetch NHL schedule from ESPN API."""
|
||||
games = []
|
||||
print(f"Fetching NHL {season} from ESPN API...")
|
||||
|
||||
# NHL regular season: October - April (spans calendar years)
|
||||
start = f"{season-1}1001"
|
||||
end = f"{season}0430"
|
||||
|
||||
url = "https://site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard"
|
||||
params = {
|
||||
'dates': f"{start}-{end}",
|
||||
'limit': 1000
|
||||
}
|
||||
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.get(url, params=params, headers=headers, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
events = data.get('events', [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
date_str = event.get('date', '')[:10]
|
||||
time_str = event.get('date', '')[11:16] if len(event.get('date', '')) > 11 else None
|
||||
|
||||
competitions = event.get('competitions', [{}])
|
||||
if not competitions:
|
||||
continue
|
||||
|
||||
comp = competitions[0]
|
||||
competitors = comp.get('competitors', [])
|
||||
|
||||
if len(competitors) < 2:
|
||||
continue
|
||||
|
||||
home_team = away_team = home_abbrev = away_abbrev = None
|
||||
|
||||
for team in competitors:
|
||||
team_data = team.get('team', {})
|
||||
team_name = team_data.get('displayName', team_data.get('name', ''))
|
||||
team_abbrev = team_data.get('abbreviation', '')
|
||||
|
||||
if team.get('homeAway') == 'home':
|
||||
home_team = team_name
|
||||
home_abbrev = team_abbrev
|
||||
else:
|
||||
away_team = team_name
|
||||
away_abbrev = team_abbrev
|
||||
|
||||
if not home_team or not away_team:
|
||||
continue
|
||||
|
||||
venue = comp.get('venue', {}).get('fullName', '')
|
||||
|
||||
game_id = f"nhl_{date_str}_{away_abbrev}_{home_abbrev}".lower()
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NHL',
|
||||
season=get_nhl_season_string(season),
|
||||
date=date_str,
|
||||
time=time_str,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev or get_nhl_team_abbrev(home_team),
|
||||
away_team_abbrev=away_abbrev or get_nhl_team_abbrev(away_team),
|
||||
venue=venue,
|
||||
source='espn.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from ESPN")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching ESPN NHL: {e}")
|
||||
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# STADIUM SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nhl_stadiums() -> list[Stadium]:
|
||||
"""
|
||||
Fetch NHL arena data (hardcoded with accurate coordinates).
|
||||
"""
|
||||
print("\nNHL STADIUMS")
|
||||
print("-" * 40)
|
||||
print(" Loading NHL arenas...")
|
||||
|
||||
nhl_arenas = {
|
||||
'TD Garden': {'city': 'Boston', 'state': 'MA', 'lat': 42.3662, 'lng': -71.0621, 'capacity': 17850, 'teams': ['BOS'], 'year_opened': 1995},
|
||||
'KeyBank Center': {'city': 'Buffalo', 'state': 'NY', 'lat': 42.8750, 'lng': -78.8764, 'capacity': 19070, 'teams': ['BUF'], 'year_opened': 1996},
|
||||
'Little Caesars Arena': {'city': 'Detroit', 'state': 'MI', 'lat': 42.3411, 'lng': -83.0553, 'capacity': 19515, 'teams': ['DET'], 'year_opened': 2017},
|
||||
'Amerant Bank Arena': {'city': 'Sunrise', 'state': 'FL', 'lat': 26.1584, 'lng': -80.3256, 'capacity': 19250, 'teams': ['FLA'], 'year_opened': 1998},
|
||||
'Bell Centre': {'city': 'Montreal', 'state': 'QC', 'lat': 45.4961, 'lng': -73.5693, 'capacity': 21302, 'teams': ['MTL'], 'year_opened': 1996},
|
||||
'Canadian Tire Centre': {'city': 'Ottawa', 'state': 'ON', 'lat': 45.2969, 'lng': -75.9272, 'capacity': 18652, 'teams': ['OTT'], 'year_opened': 1996},
|
||||
'Amalie Arena': {'city': 'Tampa', 'state': 'FL', 'lat': 27.9426, 'lng': -82.4519, 'capacity': 19092, 'teams': ['TBL'], 'year_opened': 1996},
|
||||
'Scotiabank Arena': {'city': 'Toronto', 'state': 'ON', 'lat': 43.6435, 'lng': -79.3791, 'capacity': 18800, 'teams': ['TOR'], 'year_opened': 1999},
|
||||
'PNC Arena': {'city': 'Raleigh', 'state': 'NC', 'lat': 35.8033, 'lng': -78.7220, 'capacity': 18680, 'teams': ['CAR'], 'year_opened': 1999},
|
||||
'Nationwide Arena': {'city': 'Columbus', 'state': 'OH', 'lat': 39.9692, 'lng': -83.0061, 'capacity': 18500, 'teams': ['CBJ'], 'year_opened': 2000},
|
||||
'Prudential Center': {'city': 'Newark', 'state': 'NJ', 'lat': 40.7334, 'lng': -74.1713, 'capacity': 16514, 'teams': ['NJD'], 'year_opened': 2007},
|
||||
'UBS Arena': {'city': 'Elmont', 'state': 'NY', 'lat': 40.7170, 'lng': -73.7260, 'capacity': 17255, 'teams': ['NYI'], 'year_opened': 2021},
|
||||
'Madison Square Garden': {'city': 'New York', 'state': 'NY', 'lat': 40.7505, 'lng': -73.9934, 'capacity': 18006, 'teams': ['NYR'], 'year_opened': 1968},
|
||||
'Wells Fargo Center': {'city': 'Philadelphia', 'state': 'PA', 'lat': 39.9012, 'lng': -75.1720, 'capacity': 19500, 'teams': ['PHI'], 'year_opened': 1996},
|
||||
'PPG Paints Arena': {'city': 'Pittsburgh', 'state': 'PA', 'lat': 40.4395, 'lng': -79.9892, 'capacity': 18387, 'teams': ['PIT'], 'year_opened': 2010},
|
||||
'Capital One Arena': {'city': 'Washington', 'state': 'DC', 'lat': 38.8982, 'lng': -77.0209, 'capacity': 18573, 'teams': ['WSH'], 'year_opened': 1997},
|
||||
'United Center': {'city': 'Chicago', 'state': 'IL', 'lat': 41.8807, 'lng': -87.6742, 'capacity': 19717, 'teams': ['CHI'], 'year_opened': 1994},
|
||||
'Ball Arena': {'city': 'Denver', 'state': 'CO', 'lat': 39.7487, 'lng': -105.0077, 'capacity': 18007, 'teams': ['COL'], 'year_opened': 1999},
|
||||
'American Airlines Center': {'city': 'Dallas', 'state': 'TX', 'lat': 32.7905, 'lng': -96.8103, 'capacity': 18532, 'teams': ['DAL'], 'year_opened': 2001},
|
||||
'Xcel Energy Center': {'city': 'Saint Paul', 'state': 'MN', 'lat': 44.9448, 'lng': -93.1010, 'capacity': 17954, 'teams': ['MIN'], 'year_opened': 2000},
|
||||
'Bridgestone Arena': {'city': 'Nashville', 'state': 'TN', 'lat': 36.1592, 'lng': -86.7785, 'capacity': 17159, 'teams': ['NSH'], 'year_opened': 1996},
|
||||
'Enterprise Center': {'city': 'St. Louis', 'state': 'MO', 'lat': 38.6268, 'lng': -90.2025, 'capacity': 18096, 'teams': ['STL'], 'year_opened': 1994},
|
||||
'Canada Life Centre': {'city': 'Winnipeg', 'state': 'MB', 'lat': 49.8928, 'lng': -97.1437, 'capacity': 15321, 'teams': ['WPG'], 'year_opened': 2004},
|
||||
'Honda Center': {'city': 'Anaheim', 'state': 'CA', 'lat': 33.8078, 'lng': -117.8765, 'capacity': 17174, 'teams': ['ANA'], 'year_opened': 1993},
|
||||
'Delta Center': {'city': 'Salt Lake City', 'state': 'UT', 'lat': 40.7683, 'lng': -111.9011, 'capacity': 16210, 'teams': ['ARI'], 'year_opened': 1991},
|
||||
'SAP Center': {'city': 'San Jose', 'state': 'CA', 'lat': 37.3327, 'lng': -121.9012, 'capacity': 17562, 'teams': ['SJS'], 'year_opened': 1993},
|
||||
'Rogers Arena': {'city': 'Vancouver', 'state': 'BC', 'lat': 49.2778, 'lng': -123.1089, 'capacity': 18910, 'teams': ['VAN'], 'year_opened': 1995},
|
||||
'T-Mobile Arena': {'city': 'Las Vegas', 'state': 'NV', 'lat': 36.1028, 'lng': -115.1784, 'capacity': 17500, 'teams': ['VGK'], 'year_opened': 2016},
|
||||
'Climate Pledge Arena': {'city': 'Seattle', 'state': 'WA', 'lat': 47.6220, 'lng': -122.3540, 'capacity': 17100, 'teams': ['SEA'], 'year_opened': 2021},
|
||||
'Crypto.com Arena': {'city': 'Los Angeles', 'state': 'CA', 'lat': 34.0430, 'lng': -118.2673, 'capacity': 18230, 'teams': ['LAK'], 'year_opened': 1999},
|
||||
'Rogers Place': {'city': 'Edmonton', 'state': 'AB', 'lat': 53.5469, 'lng': -113.4979, 'capacity': 18347, 'teams': ['EDM'], 'year_opened': 2016},
|
||||
'Scotiabank Saddledome': {'city': 'Calgary', 'state': 'AB', 'lat': 51.0374, 'lng': -114.0519, 'capacity': 19289, 'teams': ['CGY'], 'year_opened': 1983},
|
||||
}
|
||||
|
||||
stadiums = []
|
||||
for name, info in nhl_arenas.items():
|
||||
stadium = Stadium(
|
||||
id=f"nhl_{name.lower().replace(' ', '_')[:30]}",
|
||||
name=name,
|
||||
city=info['city'],
|
||||
state=info['state'],
|
||||
latitude=info['lat'],
|
||||
longitude=info['lng'],
|
||||
capacity=info['capacity'],
|
||||
sport='NHL',
|
||||
team_abbrevs=info['teams'],
|
||||
source='nhl_hardcoded',
|
||||
year_opened=info.get('year_opened')
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
print(f" ✓ Found {len(stadiums)} NHL arenas")
|
||||
return stadiums
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SOURCE CONFIGURATIONS
|
||||
# =============================================================================
|
||||
|
||||
NHL_GAME_SOURCES = [
|
||||
ScraperSource('Hockey-Reference', scrape_nhl_hockey_reference, priority=1, min_games=100),
|
||||
ScraperSource('ESPN', scrape_nhl_espn, priority=2, min_games=50),
|
||||
ScraperSource('NHL API', scrape_nhl_api, priority=3, min_games=50),
|
||||
]
|
||||
|
||||
NHL_STADIUM_SOURCES = [
|
||||
StadiumScraperSource('Hardcoded', scrape_nhl_stadiums, priority=1, min_venues=25),
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CONVENIENCE FUNCTIONS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nhl_games(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NHL games for a season using multi-source fallback.
|
||||
|
||||
Args:
|
||||
season: Season ending year (e.g., 2025 for 2024-25 season)
|
||||
|
||||
Returns:
|
||||
List of Game objects from the first successful source
|
||||
"""
|
||||
print(f"\nNHL {get_nhl_season_string(season)} SCHEDULE")
|
||||
print("-" * 40)
|
||||
|
||||
return scrape_with_fallback('NHL', season, NHL_GAME_SOURCES)
|
||||
-222
@@ -1,222 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
NWSL schedule and stadium scrapers for SportsTime.
|
||||
|
||||
This module provides:
|
||||
- NWSL team mappings (13 teams)
|
||||
- NWSL stadium scrapers (hardcoded with coordinates)
|
||||
- Multi-source fallback configurations
|
||||
|
||||
Note: Many NWSL teams share stadiums with MLS teams.
|
||||
Coordinates are cross-referenced from mls.py where applicable.
|
||||
"""
|
||||
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
|
||||
# Support both direct execution and import from parent directory
|
||||
try:
|
||||
from core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
except ImportError:
|
||||
from Scripts.core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
)
|
||||
|
||||
|
||||
__all__ = [
|
||||
# Team data
|
||||
'NWSL_TEAMS',
|
||||
# Stadium scrapers
|
||||
'scrape_nwsl_stadiums_hardcoded',
|
||||
'scrape_nwsl_stadiums',
|
||||
# Source configurations
|
||||
'NWSL_STADIUM_SOURCES',
|
||||
# Convenience functions
|
||||
'get_nwsl_team_abbrev',
|
||||
]
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM MAPPINGS
|
||||
# =============================================================================
|
||||
|
||||
NWSL_TEAMS = {
|
||||
'LA': {'name': 'Angel City FC', 'city': 'Los Angeles', 'stadium': 'BMO Stadium'},
|
||||
'SJ': {'name': 'Bay FC', 'city': 'San Jose', 'stadium': 'PayPal Park'},
|
||||
'CHI': {'name': 'Chicago Red Stars', 'city': 'Bridgeview', 'stadium': 'SeatGeek Stadium'},
|
||||
'HOU': {'name': 'Houston Dash', 'city': 'Houston', 'stadium': 'Shell Energy Stadium'},
|
||||
'KC': {'name': 'Kansas City Current', 'city': 'Kansas City', 'stadium': 'CPKC Stadium'},
|
||||
'NJ': {'name': 'NJ/NY Gotham FC', 'city': 'Harrison', 'stadium': 'Red Bull Arena'},
|
||||
'NC': {'name': 'North Carolina Courage', 'city': 'Cary', 'stadium': 'WakeMed Soccer Park'},
|
||||
'ORL': {'name': 'Orlando Pride', 'city': 'Orlando', 'stadium': 'Inter&Co Stadium'},
|
||||
'POR': {'name': 'Portland Thorns FC', 'city': 'Portland', 'stadium': 'Providence Park'},
|
||||
'SEA': {'name': 'Seattle Reign FC', 'city': 'Seattle', 'stadium': 'Lumen Field'},
|
||||
'SD': {'name': 'San Diego Wave FC', 'city': 'San Diego', 'stadium': 'Snapdragon Stadium'},
|
||||
'UTA': {'name': 'Utah Royals FC', 'city': 'Sandy', 'stadium': 'America First Field'},
|
||||
'WAS': {'name': 'Washington Spirit', 'city': 'Washington', 'stadium': 'Audi Field'},
|
||||
}
|
||||
|
||||
|
||||
def get_nwsl_team_abbrev(team_name: str) -> str:
|
||||
"""Get NWSL team abbreviation from full name."""
|
||||
for abbrev, info in NWSL_TEAMS.items():
|
||||
if info['name'].lower() == team_name.lower():
|
||||
return abbrev
|
||||
if team_name.lower() in info['name'].lower():
|
||||
return abbrev
|
||||
|
||||
# Return first 3 letters as fallback
|
||||
return team_name[:3].upper()
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# STADIUM SCRAPERS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nwsl_stadiums_hardcoded() -> list[Stadium]:
|
||||
"""
|
||||
Source 1: Hardcoded NWSL stadiums with complete data.
|
||||
All 13 NWSL stadiums with capacity (NWSL configuration) and year_opened.
|
||||
|
||||
Shared stadium coordinates are cross-referenced from MLS module:
|
||||
- BMO Stadium (shared with LAFC)
|
||||
- PayPal Park (shared with SJ Earthquakes)
|
||||
- Shell Energy Stadium (shared with Houston Dynamo)
|
||||
- Red Bull Arena (shared with NY Red Bulls)
|
||||
- Inter&Co Stadium (shared with Orlando City SC)
|
||||
- Providence Park (shared with Portland Timbers)
|
||||
- Lumen Field (shared with Seattle Sounders/Seahawks)
|
||||
- Snapdragon Stadium (shared with San Diego FC)
|
||||
- America First Field (shared with Real Salt Lake)
|
||||
- Audi Field (shared with DC United)
|
||||
"""
|
||||
nwsl_stadiums = {
|
||||
# Shared stadiums with MLS teams (coordinates from mls.py)
|
||||
'BMO Stadium': {
|
||||
'city': 'Los Angeles', 'state': 'CA',
|
||||
'lat': 34.0128, 'lng': -118.2841,
|
||||
'capacity': 22000, 'teams': ['LA'], 'year_opened': 2018
|
||||
},
|
||||
'PayPal Park': {
|
||||
'city': 'San Jose', 'state': 'CA',
|
||||
'lat': 37.3514, 'lng': -121.9250,
|
||||
'capacity': 18000, 'teams': ['SJ'], 'year_opened': 2015
|
||||
},
|
||||
'Shell Energy Stadium': {
|
||||
'city': 'Houston', 'state': 'TX',
|
||||
'lat': 29.7522, 'lng': -95.3524,
|
||||
'capacity': 22039, 'teams': ['HOU'], 'year_opened': 2012
|
||||
},
|
||||
'Red Bull Arena': {
|
||||
'city': 'Harrison', 'state': 'NJ',
|
||||
'lat': 40.7367, 'lng': -74.1503,
|
||||
'capacity': 25000, 'teams': ['NJ'], 'year_opened': 2010
|
||||
},
|
||||
'Inter&Co Stadium': {
|
||||
'city': 'Orlando', 'state': 'FL',
|
||||
'lat': 28.5411, 'lng': -81.3893,
|
||||
'capacity': 25500, 'teams': ['ORL'], 'year_opened': 2017
|
||||
},
|
||||
'Providence Park': {
|
||||
'city': 'Portland', 'state': 'OR',
|
||||
'lat': 45.5214, 'lng': -122.6917,
|
||||
'capacity': 25218, 'teams': ['POR'], 'year_opened': 1926
|
||||
},
|
||||
'Lumen Field': {
|
||||
'city': 'Seattle', 'state': 'WA',
|
||||
'lat': 47.5952, 'lng': -122.3316,
|
||||
'capacity': 37722, 'teams': ['SEA'], 'year_opened': 2002
|
||||
},
|
||||
'Snapdragon Stadium': {
|
||||
'city': 'San Diego', 'state': 'CA',
|
||||
'lat': 32.7844, 'lng': -117.1228,
|
||||
'capacity': 35000, 'teams': ['SD'], 'year_opened': 2022
|
||||
},
|
||||
'America First Field': {
|
||||
'city': 'Sandy', 'state': 'UT',
|
||||
'lat': 40.5829, 'lng': -111.8934,
|
||||
'capacity': 20213, 'teams': ['UTA'], 'year_opened': 2008
|
||||
},
|
||||
'Audi Field': {
|
||||
'city': 'Washington', 'state': 'DC',
|
||||
'lat': 38.8684, 'lng': -77.0129,
|
||||
'capacity': 20000, 'teams': ['WAS'], 'year_opened': 2018
|
||||
},
|
||||
# NWSL-specific stadiums
|
||||
'SeatGeek Stadium': {
|
||||
'city': 'Bridgeview', 'state': 'IL',
|
||||
'lat': 41.7653, 'lng': -87.8049,
|
||||
'capacity': 20000, 'teams': ['CHI'], 'year_opened': 2006
|
||||
},
|
||||
'CPKC Stadium': {
|
||||
'city': 'Kansas City', 'state': 'MO',
|
||||
'lat': 39.0975, 'lng': -94.5556,
|
||||
'capacity': 11500, 'teams': ['KC'], 'year_opened': 2024
|
||||
},
|
||||
'WakeMed Soccer Park': {
|
||||
'city': 'Cary', 'state': 'NC',
|
||||
'lat': 35.8018, 'lng': -78.7442,
|
||||
'capacity': 10000, 'teams': ['NC'], 'year_opened': 2002
|
||||
},
|
||||
}
|
||||
|
||||
stadiums = []
|
||||
for name, info in nwsl_stadiums.items():
|
||||
# Create normalized ID (f-strings can't have backslashes)
|
||||
normalized_name = name.lower().replace(' ', '_').replace('&', 'and').replace('.', '').replace("'", '')
|
||||
stadium_id = f"nwsl_{normalized_name[:30]}"
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
name=name,
|
||||
city=info['city'],
|
||||
state=info['state'],
|
||||
latitude=info['lat'],
|
||||
longitude=info['lng'],
|
||||
capacity=info['capacity'],
|
||||
sport='NWSL',
|
||||
team_abbrevs=info['teams'],
|
||||
source='nwsl_hardcoded',
|
||||
year_opened=info.get('year_opened')
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_nwsl_stadiums() -> list[Stadium]:
|
||||
"""
|
||||
Fetch NWSL stadium data with multi-source fallback.
|
||||
Hardcoded source is primary (has complete data).
|
||||
"""
|
||||
print("\nNWSL STADIUMS")
|
||||
print("-" * 40)
|
||||
|
||||
sources = [
|
||||
StadiumScraperSource('Hardcoded', scrape_nwsl_stadiums_hardcoded, priority=1, min_venues=10),
|
||||
]
|
||||
|
||||
return scrape_stadiums_with_fallback('NWSL', sources)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SOURCE CONFIGURATIONS
|
||||
# =============================================================================
|
||||
|
||||
NWSL_STADIUM_SOURCES = [
|
||||
StadiumScraperSource('Hardcoded', scrape_nwsl_stadiums_hardcoded, priority=1, min_venues=10),
|
||||
]
|
||||
@@ -0,0 +1,66 @@
|
||||
[build-system]
|
||||
requires = ["setuptools>=61.0", "wheel"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "sportstime-parser"
|
||||
version = "0.1.0"
|
||||
description = "Sports data scraper and CloudKit uploader for SportsTime app"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.11"
|
||||
license = {text = "MIT"}
|
||||
authors = [
|
||||
{name = "SportsTime Team"}
|
||||
]
|
||||
keywords = ["sports", "scraper", "cloudkit", "nba", "mlb", "nfl", "nhl", "mls"]
|
||||
classifiers = [
|
||||
"Development Status :: 3 - Alpha",
|
||||
"Intended Audience :: Developers",
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
"Programming Language :: Python :: 3.13",
|
||||
]
|
||||
dependencies = [
|
||||
"requests>=2.31.0",
|
||||
"beautifulsoup4>=4.12.0",
|
||||
"lxml>=5.0.0",
|
||||
"rapidfuzz>=3.5.0",
|
||||
"python-dateutil>=2.8.0",
|
||||
"pytz>=2024.1",
|
||||
"rich>=13.7.0",
|
||||
"pyjwt>=2.8.0",
|
||||
"cryptography>=42.0.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"pytest>=8.0.0",
|
||||
"pytest-cov>=4.1.0",
|
||||
"responses>=0.25.0",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
sportstime-parser = "sportstime_parser.__main__:main"
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
where = ["."]
|
||||
include = ["sportstime_parser*"]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
python_files = ["test_*.py"]
|
||||
python_functions = ["test_*"]
|
||||
addopts = "-v --tb=short"
|
||||
|
||||
[tool.coverage.run]
|
||||
source = ["sportstime_parser"]
|
||||
omit = ["tests/*"]
|
||||
|
||||
[tool.coverage.report]
|
||||
exclude_lines = [
|
||||
"pragma: no cover",
|
||||
"if __name__ == .__main__.:",
|
||||
"raise NotImplementedError",
|
||||
]
|
||||
@@ -1,8 +1,15 @@
|
||||
# Sports Schedule Scraper Dependencies
|
||||
requests>=2.28.0
|
||||
beautifulsoup4>=4.11.0
|
||||
pandas>=2.0.0
|
||||
lxml>=4.9.0
|
||||
# Core dependencies
|
||||
requests>=2.31.0
|
||||
beautifulsoup4>=4.12.0
|
||||
lxml>=5.0.0
|
||||
rapidfuzz>=3.5.0
|
||||
python-dateutil>=2.8.0
|
||||
pytz>=2024.1
|
||||
rich>=13.7.0
|
||||
pyjwt>=2.8.0
|
||||
cryptography>=42.0.0
|
||||
|
||||
# CloudKit Import (optional - only needed for cloudkit_import.py)
|
||||
cryptography>=41.0.0
|
||||
# Development dependencies
|
||||
pytest>=8.0.0
|
||||
pytest-cov>=4.1.0
|
||||
responses>=0.25.0
|
||||
|
||||
@@ -1,517 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
SportsTime Canonicalization Pipeline
|
||||
====================================
|
||||
Master script that orchestrates all data canonicalization steps.
|
||||
|
||||
This is the NEW pipeline that performs local identity resolution
|
||||
BEFORE any CloudKit upload.
|
||||
|
||||
Pipeline Stages:
|
||||
1. SCRAPE: Fetch raw data from web sources
|
||||
2. CANONICALIZE STADIUMS: Generate canonical stadium IDs and aliases
|
||||
3. CANONICALIZE TEAMS: Match teams to stadiums, generate canonical IDs
|
||||
4. CANONICALIZE GAMES: Resolve all references, generate canonical IDs
|
||||
5. VALIDATE: Verify all data is internally consistent
|
||||
6. (Optional) UPLOAD: CloudKit upload (separate script)
|
||||
|
||||
Usage:
|
||||
python run_canonicalization_pipeline.py # Full pipeline
|
||||
python run_canonicalization_pipeline.py --season 2026 # Specify season
|
||||
python run_canonicalization_pipeline.py --skip-scrape # Use existing raw data
|
||||
python run_canonicalization_pipeline.py --verbose # Detailed output
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from dataclasses import dataclass, asdict
|
||||
|
||||
# Import from core module
|
||||
from core import (
|
||||
ScraperSource, scrape_with_fallback,
|
||||
assign_stable_ids, export_to_json,
|
||||
)
|
||||
|
||||
# Import from sport modules
|
||||
from nba import scrape_nba_basketball_reference, scrape_nba_espn, scrape_nba_cbssports
|
||||
from mlb import scrape_mlb_statsapi, scrape_mlb_baseball_reference, scrape_mlb_espn
|
||||
from nhl import scrape_nhl_hockey_reference, scrape_nhl_espn, scrape_nhl_api
|
||||
from nfl import scrape_nfl_espn, scrape_nfl_pro_football_reference, scrape_nfl_cbssports
|
||||
|
||||
# Import secondary sports from scrape_schedules (stubs)
|
||||
from scrape_schedules import (
|
||||
# WNBA sources
|
||||
scrape_wnba_espn, scrape_wnba_basketball_reference, scrape_wnba_cbssports,
|
||||
# MLS sources
|
||||
scrape_mls_espn, scrape_mls_fbref, scrape_mls_mlssoccer,
|
||||
# NWSL sources
|
||||
scrape_nwsl_espn, scrape_nwsl_fbref, scrape_nwsl_nwslsoccer,
|
||||
# Utilities
|
||||
generate_stadiums_from_teams,
|
||||
)
|
||||
from canonicalize_stadiums import (
|
||||
canonicalize_stadiums,
|
||||
add_historical_aliases,
|
||||
deduplicate_aliases,
|
||||
)
|
||||
from canonicalize_teams import canonicalize_all_teams
|
||||
from canonicalize_games import canonicalize_games
|
||||
from validate_canonical import validate_canonical_data
|
||||
|
||||
|
||||
@dataclass
|
||||
class PipelineResult:
|
||||
"""Result of the full canonicalization pipeline."""
|
||||
success: bool
|
||||
stadiums_count: int
|
||||
teams_count: int
|
||||
games_count: int
|
||||
aliases_count: int
|
||||
validation_errors: int
|
||||
validation_warnings: int
|
||||
duration_seconds: float
|
||||
output_dir: str
|
||||
|
||||
|
||||
def print_header(text: str):
|
||||
"""Print a formatted header."""
|
||||
print()
|
||||
print("=" * 70)
|
||||
print(f" {text}")
|
||||
print("=" * 70)
|
||||
|
||||
|
||||
def print_section(text: str):
|
||||
"""Print a section header."""
|
||||
print()
|
||||
print(f"--- {text} ---")
|
||||
|
||||
|
||||
def run_pipeline(
|
||||
season: int = 2026,
|
||||
output_dir: Path = Path('./data'),
|
||||
skip_scrape: bool = False,
|
||||
validate: bool = True,
|
||||
verbose: bool = False,
|
||||
) -> PipelineResult:
|
||||
"""
|
||||
Run the complete canonicalization pipeline.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026)
|
||||
output_dir: Directory for output files
|
||||
skip_scrape: Skip scraping, use existing raw data
|
||||
validate: Run validation step
|
||||
verbose: Print detailed output
|
||||
|
||||
Returns:
|
||||
PipelineResult with statistics
|
||||
"""
|
||||
start_time = datetime.now()
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# =========================================================================
|
||||
# STAGE 1: SCRAPE RAW DATA
|
||||
# =========================================================================
|
||||
|
||||
if not skip_scrape:
|
||||
print_header("STAGE 1: SCRAPING RAW DATA")
|
||||
|
||||
all_games = []
|
||||
all_stadiums = []
|
||||
|
||||
# Scrape stadiums from team mappings
|
||||
print_section("Stadiums")
|
||||
all_stadiums = generate_stadiums_from_teams()
|
||||
print(f" Generated {len(all_stadiums)} stadiums from team data")
|
||||
|
||||
# Scrape all sports with multi-source fallback
|
||||
print_section(f"NBA {season}")
|
||||
nba_sources = [
|
||||
ScraperSource('Basketball-Reference', scrape_nba_basketball_reference, priority=1, min_games=500),
|
||||
ScraperSource('ESPN', scrape_nba_espn, priority=2, min_games=500),
|
||||
ScraperSource('CBS Sports', scrape_nba_cbssports, priority=3, min_games=100),
|
||||
]
|
||||
nba_games = scrape_with_fallback('NBA', season, nba_sources)
|
||||
nba_season = f"{season-1}-{str(season)[2:]}"
|
||||
nba_games = assign_stable_ids(nba_games, 'NBA', nba_season)
|
||||
all_games.extend(nba_games)
|
||||
|
||||
print_section(f"MLB {season}")
|
||||
mlb_sources = [
|
||||
ScraperSource('MLB Stats API', scrape_mlb_statsapi, priority=1, min_games=1000),
|
||||
ScraperSource('Baseball-Reference', scrape_mlb_baseball_reference, priority=2, min_games=500),
|
||||
ScraperSource('ESPN', scrape_mlb_espn, priority=3, min_games=500),
|
||||
]
|
||||
mlb_games = scrape_with_fallback('MLB', season, mlb_sources)
|
||||
mlb_games = assign_stable_ids(mlb_games, 'MLB', str(season))
|
||||
all_games.extend(mlb_games)
|
||||
|
||||
print_section(f"NHL {season}")
|
||||
nhl_sources = [
|
||||
ScraperSource('Hockey-Reference', scrape_nhl_hockey_reference, priority=1, min_games=500),
|
||||
ScraperSource('ESPN', scrape_nhl_espn, priority=2, min_games=500),
|
||||
ScraperSource('NHL API', scrape_nhl_api, priority=3, min_games=100),
|
||||
]
|
||||
nhl_games = scrape_with_fallback('NHL', season, nhl_sources)
|
||||
nhl_season = f"{season-1}-{str(season)[2:]}"
|
||||
nhl_games = assign_stable_ids(nhl_games, 'NHL', nhl_season)
|
||||
all_games.extend(nhl_games)
|
||||
|
||||
print_section(f"NFL {season}")
|
||||
nfl_sources = [
|
||||
ScraperSource('ESPN', scrape_nfl_espn, priority=1, min_games=200),
|
||||
ScraperSource('Pro-Football-Reference', scrape_nfl_pro_football_reference, priority=2, min_games=200),
|
||||
ScraperSource('CBS Sports', scrape_nfl_cbssports, priority=3, min_games=100),
|
||||
]
|
||||
nfl_games = scrape_with_fallback('NFL', season, nfl_sources)
|
||||
nfl_season = f"{season-1}-{str(season)[2:]}"
|
||||
nfl_games = assign_stable_ids(nfl_games, 'NFL', nfl_season)
|
||||
all_games.extend(nfl_games)
|
||||
|
||||
print_section(f"WNBA {season}")
|
||||
wnba_sources = [
|
||||
ScraperSource('ESPN', scrape_wnba_espn, priority=1, min_games=100),
|
||||
ScraperSource('Basketball-Reference', scrape_wnba_basketball_reference, priority=2, min_games=100),
|
||||
ScraperSource('CBS Sports', scrape_wnba_cbssports, priority=3, min_games=50),
|
||||
]
|
||||
wnba_games = scrape_with_fallback('WNBA', season, wnba_sources)
|
||||
wnba_games = assign_stable_ids(wnba_games, 'WNBA', str(season))
|
||||
all_games.extend(wnba_games)
|
||||
|
||||
print_section(f"MLS {season}")
|
||||
mls_sources = [
|
||||
ScraperSource('ESPN', scrape_mls_espn, priority=1, min_games=200),
|
||||
ScraperSource('FBref', scrape_mls_fbref, priority=2, min_games=100),
|
||||
ScraperSource('MLSSoccer.com', scrape_mls_mlssoccer, priority=3, min_games=100),
|
||||
]
|
||||
mls_games = scrape_with_fallback('MLS', season, mls_sources)
|
||||
mls_games = assign_stable_ids(mls_games, 'MLS', str(season))
|
||||
all_games.extend(mls_games)
|
||||
|
||||
print_section(f"NWSL {season}")
|
||||
nwsl_sources = [
|
||||
ScraperSource('ESPN', scrape_nwsl_espn, priority=1, min_games=100),
|
||||
ScraperSource('FBref', scrape_nwsl_fbref, priority=2, min_games=50),
|
||||
ScraperSource('NWSL.com', scrape_nwsl_nwslsoccer, priority=3, min_games=50),
|
||||
]
|
||||
nwsl_games = scrape_with_fallback('NWSL', season, nwsl_sources)
|
||||
nwsl_games = assign_stable_ids(nwsl_games, 'NWSL', str(season))
|
||||
all_games.extend(nwsl_games)
|
||||
|
||||
# Export raw data
|
||||
print_section("Exporting Raw Data")
|
||||
export_to_json(all_games, all_stadiums, output_dir)
|
||||
print(f" Exported to {output_dir}")
|
||||
|
||||
raw_games = [g.__dict__ for g in all_games]
|
||||
raw_stadiums = [s.__dict__ for s in all_stadiums]
|
||||
|
||||
else:
|
||||
print_header("LOADING EXISTING RAW DATA")
|
||||
|
||||
# Try loading from new structure first (games/*.json)
|
||||
games_dir = output_dir / 'games'
|
||||
raw_games = []
|
||||
|
||||
if games_dir.exists() and any(games_dir.glob('*.json')):
|
||||
print_section("Loading from games/ directory")
|
||||
for games_file in sorted(games_dir.glob('*.json')):
|
||||
with open(games_file) as f:
|
||||
file_games = json.load(f)
|
||||
raw_games.extend(file_games)
|
||||
print(f" Loaded {len(file_games):,} games from {games_file.name}")
|
||||
else:
|
||||
# Fallback to legacy games.json
|
||||
print_section("Loading from legacy games.json")
|
||||
games_file = output_dir / 'games.json'
|
||||
with open(games_file) as f:
|
||||
raw_games = json.load(f)
|
||||
|
||||
print(f" Total: {len(raw_games):,} raw games")
|
||||
|
||||
# Try loading stadiums from canonical/ first, then legacy
|
||||
canonical_dir = output_dir / 'canonical'
|
||||
if (canonical_dir / 'stadiums.json').exists():
|
||||
with open(canonical_dir / 'stadiums.json') as f:
|
||||
raw_stadiums = json.load(f)
|
||||
print(f" Loaded {len(raw_stadiums)} raw stadiums from canonical/stadiums.json")
|
||||
else:
|
||||
with open(output_dir / 'stadiums.json') as f:
|
||||
raw_stadiums = json.load(f)
|
||||
print(f" Loaded {len(raw_stadiums)} raw stadiums from stadiums.json")
|
||||
|
||||
# =========================================================================
|
||||
# STAGE 2: CANONICALIZE STADIUMS
|
||||
# =========================================================================
|
||||
|
||||
print_header("STAGE 2: CANONICALIZING STADIUMS")
|
||||
|
||||
canonical_stadiums, stadium_aliases = canonicalize_stadiums(
|
||||
raw_stadiums, verbose=verbose
|
||||
)
|
||||
print(f" Created {len(canonical_stadiums)} canonical stadiums")
|
||||
|
||||
# Add historical aliases
|
||||
canonical_ids = {s.canonical_id for s in canonical_stadiums}
|
||||
stadium_aliases = add_historical_aliases(stadium_aliases, canonical_ids)
|
||||
stadium_aliases = deduplicate_aliases(stadium_aliases)
|
||||
print(f" Created {len(stadium_aliases)} stadium aliases")
|
||||
|
||||
# Export
|
||||
stadiums_canonical_path = output_dir / 'stadiums_canonical.json'
|
||||
aliases_path = output_dir / 'stadium_aliases.json'
|
||||
|
||||
with open(stadiums_canonical_path, 'w') as f:
|
||||
json.dump([asdict(s) for s in canonical_stadiums], f, indent=2)
|
||||
|
||||
with open(aliases_path, 'w') as f:
|
||||
json.dump([asdict(a) for a in stadium_aliases], f, indent=2)
|
||||
|
||||
print(f" Exported to {stadiums_canonical_path}")
|
||||
print(f" Exported to {aliases_path}")
|
||||
|
||||
# =========================================================================
|
||||
# STAGE 3: CANONICALIZE TEAMS
|
||||
# =========================================================================
|
||||
|
||||
print_header("STAGE 3: CANONICALIZING TEAMS")
|
||||
|
||||
# Convert canonical stadiums to dicts for team matching
|
||||
stadiums_list = [asdict(s) for s in canonical_stadiums]
|
||||
|
||||
canonical_teams, team_warnings = canonicalize_all_teams(
|
||||
stadiums_list, verbose=verbose
|
||||
)
|
||||
print(f" Created {len(canonical_teams)} canonical teams")
|
||||
|
||||
if team_warnings:
|
||||
print(f" Warnings: {len(team_warnings)}")
|
||||
if verbose:
|
||||
for w in team_warnings:
|
||||
print(f" - {w.team_canonical_id}: {w.issue}")
|
||||
|
||||
# Export
|
||||
teams_canonical_path = output_dir / 'teams_canonical.json'
|
||||
|
||||
with open(teams_canonical_path, 'w') as f:
|
||||
json.dump([asdict(t) for t in canonical_teams], f, indent=2)
|
||||
|
||||
print(f" Exported to {teams_canonical_path}")
|
||||
|
||||
# =========================================================================
|
||||
# STAGE 4: CANONICALIZE GAMES
|
||||
# =========================================================================
|
||||
|
||||
print_header("STAGE 4: CANONICALIZING GAMES")
|
||||
|
||||
# Convert data to dicts for game canonicalization
|
||||
teams_list = [asdict(t) for t in canonical_teams]
|
||||
aliases_list = [asdict(a) for a in stadium_aliases]
|
||||
|
||||
canonical_games_list, game_warnings = canonicalize_games(
|
||||
raw_games, teams_list, aliases_list, verbose=verbose
|
||||
)
|
||||
print(f" Created {len(canonical_games_list)} canonical games")
|
||||
|
||||
if game_warnings:
|
||||
print(f" Warnings: {len(game_warnings)}")
|
||||
if verbose:
|
||||
from collections import defaultdict
|
||||
by_issue = defaultdict(int)
|
||||
for w in game_warnings:
|
||||
by_issue[w.issue] += 1
|
||||
for issue, count in by_issue.items():
|
||||
print(f" - {issue}: {count}")
|
||||
|
||||
# Export games to new structure: canonical/games/{sport}_{season}.json
|
||||
canonical_games_dir = output_dir / 'canonical' / 'games'
|
||||
canonical_games_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Group games by sport and season
|
||||
games_by_sport_season = {}
|
||||
for game in canonical_games_list:
|
||||
sport = game.sport.lower()
|
||||
season = game.season
|
||||
key = f"{sport}_{season}"
|
||||
if key not in games_by_sport_season:
|
||||
games_by_sport_season[key] = []
|
||||
games_by_sport_season[key].append(game)
|
||||
|
||||
# Export each sport/season file
|
||||
for key, sport_games in sorted(games_by_sport_season.items()):
|
||||
filepath = canonical_games_dir / f"{key}.json"
|
||||
with open(filepath, 'w') as f:
|
||||
json.dump([asdict(g) for g in sport_games], f, indent=2)
|
||||
print(f" Exported {len(sport_games):,} games to canonical/games/{key}.json")
|
||||
|
||||
# Also export combined games_canonical.json for backward compatibility
|
||||
games_canonical_path = output_dir / 'games_canonical.json'
|
||||
with open(games_canonical_path, 'w') as f:
|
||||
json.dump([asdict(g) for g in canonical_games_list], f, indent=2)
|
||||
print(f" Exported combined to {games_canonical_path}")
|
||||
|
||||
# =========================================================================
|
||||
# STAGE 5: VALIDATE
|
||||
# =========================================================================
|
||||
|
||||
validation_result = None
|
||||
if validate:
|
||||
print_header("STAGE 5: VALIDATION")
|
||||
|
||||
# Reload as dicts for validation
|
||||
canonical_stadiums_dicts = [asdict(s) for s in canonical_stadiums]
|
||||
canonical_teams_dicts = [asdict(t) for t in canonical_teams]
|
||||
canonical_games_dicts = [asdict(g) for g in canonical_games_list]
|
||||
aliases_dicts = [asdict(a) for a in stadium_aliases]
|
||||
|
||||
validation_result = validate_canonical_data(
|
||||
canonical_stadiums_dicts,
|
||||
canonical_teams_dicts,
|
||||
canonical_games_dicts,
|
||||
aliases_dicts,
|
||||
verbose=verbose
|
||||
)
|
||||
|
||||
if validation_result.is_valid:
|
||||
print(f" STATUS: PASSED")
|
||||
else:
|
||||
print(f" STATUS: FAILED")
|
||||
|
||||
print(f" Errors: {validation_result.error_count}")
|
||||
print(f" Warnings: {validation_result.warning_count}")
|
||||
|
||||
# Export validation report
|
||||
validation_path = output_dir / 'canonicalization_validation.json'
|
||||
with open(validation_path, 'w') as f:
|
||||
json.dump({
|
||||
'is_valid': validation_result.is_valid,
|
||||
'error_count': validation_result.error_count,
|
||||
'warning_count': validation_result.warning_count,
|
||||
'summary': validation_result.summary,
|
||||
'errors': validation_result.errors[:100], # Limit to 100 for readability
|
||||
}, f, indent=2)
|
||||
print(f" Report exported to {validation_path}")
|
||||
|
||||
# =========================================================================
|
||||
# SUMMARY
|
||||
# =========================================================================
|
||||
|
||||
duration = (datetime.now() - start_time).total_seconds()
|
||||
|
||||
print_header("PIPELINE COMPLETE")
|
||||
print()
|
||||
print(f" Duration: {duration:.1f} seconds")
|
||||
print(f" Stadiums: {len(canonical_stadiums)}")
|
||||
print(f" Teams: {len(canonical_teams)}")
|
||||
print(f" Games: {len(canonical_games_list)}")
|
||||
print(f" Aliases: {len(stadium_aliases)}")
|
||||
print()
|
||||
|
||||
# Games by sport
|
||||
print(" Games by sport:")
|
||||
by_sport = {}
|
||||
for g in canonical_games_list:
|
||||
by_sport[g.sport] = by_sport.get(g.sport, 0) + 1
|
||||
for sport, count in sorted(by_sport.items()):
|
||||
print(f" {sport}: {count:,} games")
|
||||
|
||||
print()
|
||||
print(" Output files:")
|
||||
print(f" - {output_dir / 'stadiums_canonical.json'}")
|
||||
print(f" - {output_dir / 'stadium_aliases.json'}")
|
||||
print(f" - {output_dir / 'teams_canonical.json'}")
|
||||
print(f" - {output_dir / 'games_canonical.json'} (combined)")
|
||||
print(f" - {output_dir / 'canonical' / 'games' / '*.json'} (by sport/season)")
|
||||
print(f" - {output_dir / 'canonicalization_validation.json'}")
|
||||
print()
|
||||
|
||||
# Final status
|
||||
success = True
|
||||
if validation_result and not validation_result.is_valid:
|
||||
success = False
|
||||
print(" PIPELINE FAILED - Validation errors detected")
|
||||
print(" CloudKit upload should NOT proceed until errors are fixed")
|
||||
else:
|
||||
print(" PIPELINE SUCCEEDED - Ready for CloudKit upload")
|
||||
|
||||
print()
|
||||
|
||||
return PipelineResult(
|
||||
success=success,
|
||||
stadiums_count=len(canonical_stadiums),
|
||||
teams_count=len(canonical_teams),
|
||||
games_count=len(canonical_games_list),
|
||||
aliases_count=len(stadium_aliases),
|
||||
validation_errors=validation_result.error_count if validation_result else 0,
|
||||
validation_warnings=validation_result.warning_count if validation_result else 0,
|
||||
duration_seconds=duration,
|
||||
output_dir=str(output_dir),
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='SportsTime Canonicalization Pipeline',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Pipeline Stages:
|
||||
1. SCRAPE: Fetch raw data from web sources
|
||||
2. CANONICALIZE STADIUMS: Generate canonical IDs and aliases
|
||||
3. CANONICALIZE TEAMS: Match teams to stadiums
|
||||
4. CANONICALIZE GAMES: Resolve all references
|
||||
5. VALIDATE: Verify internal consistency
|
||||
|
||||
Examples:
|
||||
python run_canonicalization_pipeline.py # Full pipeline
|
||||
python run_canonicalization_pipeline.py --season 2026 # Different season
|
||||
python run_canonicalization_pipeline.py --skip-scrape # Use existing raw data
|
||||
python run_canonicalization_pipeline.py --verbose # Show all details
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--season', type=int, default=2026,
|
||||
help='Season year (default: 2026)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output', type=str, default='./data',
|
||||
help='Output directory (default: ./data)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--skip-scrape', action='store_true',
|
||||
help='Skip scraping, use existing raw data files'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--no-validate', action='store_true',
|
||||
help='Skip validation step'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--verbose', '-v', action='store_true',
|
||||
help='Verbose output'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--strict', action='store_true',
|
||||
help='Exit with error code if validation fails'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
result = run_pipeline(
|
||||
season=args.season,
|
||||
output_dir=Path(args.output),
|
||||
skip_scrape=args.skip_scrape,
|
||||
validate=not args.no_validate,
|
||||
verbose=args.verbose,
|
||||
)
|
||||
|
||||
# Exit with error code if requested and validation failed
|
||||
if args.strict and not result.success:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -1,523 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
SportsTime Data Pipeline
|
||||
========================
|
||||
Master script that orchestrates all data fetching, validation, and reporting.
|
||||
|
||||
Usage:
|
||||
python run_pipeline.py # Full pipeline with defaults
|
||||
python run_pipeline.py --season 2026 # Specify season
|
||||
python run_pipeline.py --sport nba # Single sport only
|
||||
python run_pipeline.py --skip-scrape # Validate existing data only
|
||||
python run_pipeline.py --verbose # Detailed output
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
from enum import Enum
|
||||
|
||||
# Import from core module
|
||||
from core import (
|
||||
Game, Stadium, ScraperSource, scrape_with_fallback,
|
||||
assign_stable_ids, export_to_json,
|
||||
)
|
||||
|
||||
# Import from sport modules
|
||||
from nba import scrape_nba_basketball_reference, scrape_nba_espn, scrape_nba_cbssports
|
||||
from mlb import scrape_mlb_statsapi, scrape_mlb_baseball_reference, scrape_mlb_espn
|
||||
from nhl import scrape_nhl_hockey_reference, scrape_nhl_espn, scrape_nhl_api
|
||||
from nfl import scrape_nfl_espn, scrape_nfl_pro_football_reference, scrape_nfl_cbssports
|
||||
|
||||
# Import secondary sports from scrape_schedules (stubs)
|
||||
from scrape_schedules import (
|
||||
# WNBA sources
|
||||
scrape_wnba_espn, scrape_wnba_basketball_reference, scrape_wnba_cbssports,
|
||||
# MLS sources
|
||||
scrape_mls_espn, scrape_mls_fbref, scrape_mls_mlssoccer,
|
||||
# NWSL sources
|
||||
scrape_nwsl_espn, scrape_nwsl_fbref, scrape_nwsl_nwslsoccer,
|
||||
# Utilities
|
||||
scrape_all_stadiums,
|
||||
)
|
||||
from validate_data import (
|
||||
validate_games,
|
||||
validate_stadiums,
|
||||
scrape_mlb_all_sources,
|
||||
scrape_nba_all_sources,
|
||||
scrape_nhl_all_sources,
|
||||
ValidationReport,
|
||||
)
|
||||
|
||||
|
||||
class Severity(Enum):
|
||||
HIGH = "high"
|
||||
MEDIUM = "medium"
|
||||
LOW = "low"
|
||||
|
||||
|
||||
@dataclass
|
||||
class PipelineResult:
|
||||
success: bool
|
||||
games_scraped: int
|
||||
stadiums_scraped: int
|
||||
games_by_sport: dict
|
||||
validation_reports: list
|
||||
stadium_issues: list
|
||||
high_severity_count: int
|
||||
medium_severity_count: int
|
||||
low_severity_count: int
|
||||
output_dir: Path
|
||||
duration_seconds: float
|
||||
|
||||
|
||||
def print_header(text: str):
|
||||
"""Print a formatted header."""
|
||||
print()
|
||||
print("=" * 70)
|
||||
print(f" {text}")
|
||||
print("=" * 70)
|
||||
|
||||
|
||||
def print_section(text: str):
|
||||
"""Print a section header."""
|
||||
print()
|
||||
print(f"--- {text} ---")
|
||||
|
||||
|
||||
def print_severity(severity: str, message: str):
|
||||
"""Print a message with severity indicator."""
|
||||
icons = {
|
||||
'high': '🔴 HIGH',
|
||||
'medium': '🟡 MEDIUM',
|
||||
'low': '🟢 LOW',
|
||||
}
|
||||
print(f" {icons.get(severity, '⚪')} {message}")
|
||||
|
||||
|
||||
def run_pipeline(
|
||||
season: int = 2025,
|
||||
sport: str = 'all',
|
||||
output_dir: Path = Path('./data'),
|
||||
skip_scrape: bool = False,
|
||||
validate: bool = True,
|
||||
verbose: bool = False,
|
||||
) -> PipelineResult:
|
||||
"""
|
||||
Run the complete data pipeline.
|
||||
"""
|
||||
start_time = datetime.now()
|
||||
|
||||
all_games = []
|
||||
all_stadiums = []
|
||||
games_by_sport = {}
|
||||
validation_reports = []
|
||||
stadium_issues = []
|
||||
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# =========================================================================
|
||||
# PHASE 1: SCRAPE DATA
|
||||
# =========================================================================
|
||||
|
||||
if not skip_scrape:
|
||||
print_header("PHASE 1: SCRAPING DATA")
|
||||
|
||||
# Scrape stadiums
|
||||
print_section("Stadiums")
|
||||
all_stadiums = scrape_all_stadiums()
|
||||
print(f" Generated {len(all_stadiums)} stadiums from team data")
|
||||
|
||||
# Scrape by sport with multi-source fallback
|
||||
if sport in ['nba', 'all']:
|
||||
print_section(f"NBA {season}")
|
||||
nba_sources = [
|
||||
ScraperSource('Basketball-Reference', scrape_nba_basketball_reference, priority=1, min_games=500),
|
||||
ScraperSource('ESPN', scrape_nba_espn, priority=2, min_games=500),
|
||||
ScraperSource('CBS Sports', scrape_nba_cbssports, priority=3, min_games=100),
|
||||
]
|
||||
nba_games = scrape_with_fallback('NBA', season, nba_sources)
|
||||
nba_season = f"{season-1}-{str(season)[2:]}"
|
||||
nba_games = assign_stable_ids(nba_games, 'NBA', nba_season)
|
||||
all_games.extend(nba_games)
|
||||
games_by_sport['NBA'] = len(nba_games)
|
||||
|
||||
if sport in ['mlb', 'all']:
|
||||
print_section(f"MLB {season}")
|
||||
mlb_sources = [
|
||||
ScraperSource('MLB Stats API', scrape_mlb_statsapi, priority=1, min_games=1000),
|
||||
ScraperSource('Baseball-Reference', scrape_mlb_baseball_reference, priority=2, min_games=500),
|
||||
ScraperSource('ESPN', scrape_mlb_espn, priority=3, min_games=500),
|
||||
]
|
||||
mlb_games = scrape_with_fallback('MLB', season, mlb_sources)
|
||||
mlb_games = assign_stable_ids(mlb_games, 'MLB', str(season))
|
||||
all_games.extend(mlb_games)
|
||||
games_by_sport['MLB'] = len(mlb_games)
|
||||
|
||||
if sport in ['nhl', 'all']:
|
||||
print_section(f"NHL {season}")
|
||||
nhl_sources = [
|
||||
ScraperSource('Hockey-Reference', scrape_nhl_hockey_reference, priority=1, min_games=500),
|
||||
ScraperSource('ESPN', scrape_nhl_espn, priority=2, min_games=500),
|
||||
ScraperSource('NHL API', scrape_nhl_api, priority=3, min_games=100),
|
||||
]
|
||||
nhl_games = scrape_with_fallback('NHL', season, nhl_sources)
|
||||
nhl_season = f"{season-1}-{str(season)[2:]}"
|
||||
nhl_games = assign_stable_ids(nhl_games, 'NHL', nhl_season)
|
||||
all_games.extend(nhl_games)
|
||||
games_by_sport['NHL'] = len(nhl_games)
|
||||
|
||||
if sport in ['nfl', 'all']:
|
||||
print_section(f"NFL {season}")
|
||||
nfl_sources = [
|
||||
ScraperSource('ESPN', scrape_nfl_espn, priority=1, min_games=200),
|
||||
ScraperSource('Pro-Football-Reference', scrape_nfl_pro_football_reference, priority=2, min_games=200),
|
||||
ScraperSource('CBS Sports', scrape_nfl_cbssports, priority=3, min_games=100),
|
||||
]
|
||||
nfl_games = scrape_with_fallback('NFL', season, nfl_sources)
|
||||
nfl_season = f"{season-1}-{str(season)[2:]}"
|
||||
nfl_games = assign_stable_ids(nfl_games, 'NFL', nfl_season)
|
||||
all_games.extend(nfl_games)
|
||||
games_by_sport['NFL'] = len(nfl_games)
|
||||
|
||||
if sport in ['wnba', 'all']:
|
||||
print_section(f"WNBA {season}")
|
||||
wnba_sources = [
|
||||
ScraperSource('ESPN', scrape_wnba_espn, priority=1, min_games=100),
|
||||
ScraperSource('Basketball-Reference', scrape_wnba_basketball_reference, priority=2, min_games=100),
|
||||
ScraperSource('CBS Sports', scrape_wnba_cbssports, priority=3, min_games=50),
|
||||
]
|
||||
wnba_games = scrape_with_fallback('WNBA', season, wnba_sources)
|
||||
wnba_games = assign_stable_ids(wnba_games, 'WNBA', str(season))
|
||||
all_games.extend(wnba_games)
|
||||
games_by_sport['WNBA'] = len(wnba_games)
|
||||
|
||||
if sport in ['mls', 'all']:
|
||||
print_section(f"MLS {season}")
|
||||
mls_sources = [
|
||||
ScraperSource('ESPN', scrape_mls_espn, priority=1, min_games=200),
|
||||
ScraperSource('FBref', scrape_mls_fbref, priority=2, min_games=100),
|
||||
ScraperSource('MLSSoccer.com', scrape_mls_mlssoccer, priority=3, min_games=100),
|
||||
]
|
||||
mls_games = scrape_with_fallback('MLS', season, mls_sources)
|
||||
mls_games = assign_stable_ids(mls_games, 'MLS', str(season))
|
||||
all_games.extend(mls_games)
|
||||
games_by_sport['MLS'] = len(mls_games)
|
||||
|
||||
if sport in ['nwsl', 'all']:
|
||||
print_section(f"NWSL {season}")
|
||||
nwsl_sources = [
|
||||
ScraperSource('ESPN', scrape_nwsl_espn, priority=1, min_games=100),
|
||||
ScraperSource('FBref', scrape_nwsl_fbref, priority=2, min_games=50),
|
||||
ScraperSource('NWSL.com', scrape_nwsl_nwslsoccer, priority=3, min_games=50),
|
||||
]
|
||||
nwsl_games = scrape_with_fallback('NWSL', season, nwsl_sources)
|
||||
nwsl_games = assign_stable_ids(nwsl_games, 'NWSL', str(season))
|
||||
all_games.extend(nwsl_games)
|
||||
games_by_sport['NWSL'] = len(nwsl_games)
|
||||
|
||||
# Export data
|
||||
print_section("Exporting Data")
|
||||
export_to_json(all_games, all_stadiums, output_dir)
|
||||
print(f" Exported to {output_dir}")
|
||||
|
||||
else:
|
||||
# Load existing data
|
||||
print_header("LOADING EXISTING DATA")
|
||||
|
||||
games_file = output_dir / 'games.json'
|
||||
stadiums_file = output_dir / 'stadiums.json'
|
||||
|
||||
if games_file.exists():
|
||||
with open(games_file) as f:
|
||||
games_data = json.load(f)
|
||||
all_games = [Game(**g) for g in games_data]
|
||||
for g in all_games:
|
||||
games_by_sport[g.sport] = games_by_sport.get(g.sport, 0) + 1
|
||||
print(f" Loaded {len(all_games)} games")
|
||||
|
||||
if stadiums_file.exists():
|
||||
with open(stadiums_file) as f:
|
||||
stadiums_data = json.load(f)
|
||||
all_stadiums = [Stadium(**s) for s in stadiums_data]
|
||||
print(f" Loaded {len(all_stadiums)} stadiums")
|
||||
|
||||
# =========================================================================
|
||||
# PHASE 2: VALIDATE DATA
|
||||
# =========================================================================
|
||||
|
||||
if validate:
|
||||
print_header("PHASE 2: CROSS-VALIDATION")
|
||||
|
||||
# MLB validation (has two good sources)
|
||||
if sport in ['mlb', 'all']:
|
||||
print_section("MLB Cross-Validation")
|
||||
try:
|
||||
mlb_sources = scrape_mlb_all_sources(season)
|
||||
source_names = list(mlb_sources.keys())
|
||||
|
||||
if len(source_names) >= 2:
|
||||
games1 = mlb_sources[source_names[0]]
|
||||
games2 = mlb_sources[source_names[1]]
|
||||
|
||||
if games1 and games2:
|
||||
report = validate_games(
|
||||
games1, games2,
|
||||
source_names[0], source_names[1],
|
||||
'MLB', str(season)
|
||||
)
|
||||
validation_reports.append(report)
|
||||
|
||||
print(f" Sources: {source_names[0]} vs {source_names[1]}")
|
||||
print(f" Games compared: {report.total_games_source1} vs {report.total_games_source2}")
|
||||
print(f" Matched: {report.games_matched}")
|
||||
print(f" Discrepancies: {len(report.discrepancies)}")
|
||||
except Exception as e:
|
||||
print(f" Error during MLB validation: {e}")
|
||||
|
||||
# Stadium validation
|
||||
print_section("Stadium Validation")
|
||||
stadium_issues = validate_stadiums(all_stadiums)
|
||||
print(f" Issues found: {len(stadium_issues)}")
|
||||
|
||||
# Data quality checks
|
||||
print_section("Data Quality Checks")
|
||||
|
||||
# Check game counts per team
|
||||
if sport in ['nba', 'all']:
|
||||
nba_games = [g for g in all_games if g.sport == 'NBA']
|
||||
team_counts = {}
|
||||
for g in nba_games:
|
||||
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
|
||||
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
|
||||
|
||||
for team, count in sorted(team_counts.items()):
|
||||
if count < 75 or count > 90:
|
||||
print(f" NBA: {team} has {count} games (expected ~82)")
|
||||
|
||||
if sport in ['nhl', 'all']:
|
||||
nhl_games = [g for g in all_games if g.sport == 'NHL']
|
||||
team_counts = {}
|
||||
for g in nhl_games:
|
||||
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
|
||||
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
|
||||
|
||||
for team, count in sorted(team_counts.items()):
|
||||
if count < 75 or count > 90:
|
||||
print(f" NHL: {team} has {count} games (expected ~82)")
|
||||
|
||||
if sport in ['nfl', 'all']:
|
||||
nfl_games = [g for g in all_games if g.sport == 'NFL']
|
||||
team_counts = {}
|
||||
for g in nfl_games:
|
||||
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
|
||||
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
|
||||
|
||||
for team, count in sorted(team_counts.items()):
|
||||
if count < 15 or count > 20:
|
||||
print(f" NFL: {team} has {count} games (expected ~17)")
|
||||
|
||||
# =========================================================================
|
||||
# PHASE 3: GENERATE REPORT
|
||||
# =========================================================================
|
||||
|
||||
print_header("PHASE 3: DISCREPANCY REPORT")
|
||||
|
||||
# Count by severity
|
||||
high_count = 0
|
||||
medium_count = 0
|
||||
low_count = 0
|
||||
|
||||
# Game discrepancies
|
||||
for report in validation_reports:
|
||||
for d in report.discrepancies:
|
||||
if d.severity == 'high':
|
||||
high_count += 1
|
||||
elif d.severity == 'medium':
|
||||
medium_count += 1
|
||||
else:
|
||||
low_count += 1
|
||||
|
||||
# Stadium issues
|
||||
for issue in stadium_issues:
|
||||
if issue['severity'] == 'high':
|
||||
high_count += 1
|
||||
elif issue['severity'] == 'medium':
|
||||
medium_count += 1
|
||||
else:
|
||||
low_count += 1
|
||||
|
||||
# Print summary
|
||||
print()
|
||||
print(f" 🔴 HIGH severity: {high_count}")
|
||||
print(f" 🟡 MEDIUM severity: {medium_count}")
|
||||
print(f" 🟢 LOW severity: {low_count}")
|
||||
print()
|
||||
|
||||
# Print high severity issues (always)
|
||||
if high_count > 0:
|
||||
print_section("HIGH Severity Issues (Requires Attention)")
|
||||
|
||||
shown = 0
|
||||
max_show = 10 if not verbose else 50
|
||||
|
||||
for report in validation_reports:
|
||||
for d in report.discrepancies:
|
||||
if d.severity == 'high' and shown < max_show:
|
||||
print_severity('high', f"[{report.sport}] {d.field}: {d.game_key}")
|
||||
if verbose:
|
||||
print(f" {d.source1}: {d.value1}")
|
||||
print(f" {d.source2}: {d.value2}")
|
||||
shown += 1
|
||||
|
||||
for issue in stadium_issues:
|
||||
if issue['severity'] == 'high' and shown < max_show:
|
||||
print_severity('high', f"[Stadium] {issue['stadium']}: {issue['issue']}")
|
||||
shown += 1
|
||||
|
||||
if high_count > max_show:
|
||||
print(f" ... and {high_count - max_show} more (use --verbose to see all)")
|
||||
|
||||
# Print medium severity if verbose
|
||||
if medium_count > 0 and verbose:
|
||||
print_section("MEDIUM Severity Issues")
|
||||
|
||||
for report in validation_reports:
|
||||
for d in report.discrepancies:
|
||||
if d.severity == 'medium':
|
||||
print_severity('medium', f"[{report.sport}] {d.field}: {d.game_key}")
|
||||
|
||||
for issue in stadium_issues:
|
||||
if issue['severity'] == 'medium':
|
||||
print_severity('medium', f"[Stadium] {issue['stadium']}: {issue['issue']}")
|
||||
|
||||
# Save full report
|
||||
report_path = output_dir / 'pipeline_report.json'
|
||||
full_report = {
|
||||
'generated_at': datetime.now().isoformat(),
|
||||
'season': season,
|
||||
'sport': sport,
|
||||
'summary': {
|
||||
'games_scraped': len(all_games),
|
||||
'stadiums_scraped': len(all_stadiums),
|
||||
'games_by_sport': games_by_sport,
|
||||
'high_severity': high_count,
|
||||
'medium_severity': medium_count,
|
||||
'low_severity': low_count,
|
||||
},
|
||||
'game_validations': [r.to_dict() for r in validation_reports],
|
||||
'stadium_issues': stadium_issues,
|
||||
}
|
||||
|
||||
with open(report_path, 'w') as f:
|
||||
json.dump(full_report, f, indent=2)
|
||||
|
||||
# =========================================================================
|
||||
# FINAL SUMMARY
|
||||
# =========================================================================
|
||||
|
||||
duration = (datetime.now() - start_time).total_seconds()
|
||||
|
||||
print_header("PIPELINE COMPLETE")
|
||||
print()
|
||||
print(f" Duration: {duration:.1f} seconds")
|
||||
print(f" Games: {len(all_games):,}")
|
||||
print(f" Stadiums: {len(all_stadiums)}")
|
||||
print(f" Output: {output_dir.absolute()}")
|
||||
print()
|
||||
|
||||
for sport_name, count in sorted(games_by_sport.items()):
|
||||
print(f" {sport_name}: {count:,} games")
|
||||
|
||||
print()
|
||||
print(f" Reports saved to:")
|
||||
print(f" - {output_dir / 'games.json'}")
|
||||
print(f" - {output_dir / 'stadiums.json'}")
|
||||
print(f" - {output_dir / 'pipeline_report.json'}")
|
||||
print()
|
||||
|
||||
# Status indicator
|
||||
if high_count > 0:
|
||||
print(" ⚠️ STATUS: Review required - high severity issues found")
|
||||
elif medium_count > 0:
|
||||
print(" ✓ STATUS: Complete with warnings")
|
||||
else:
|
||||
print(" ✅ STATUS: All checks passed")
|
||||
|
||||
print()
|
||||
|
||||
return PipelineResult(
|
||||
success=high_count == 0,
|
||||
games_scraped=len(all_games),
|
||||
stadiums_scraped=len(all_stadiums),
|
||||
games_by_sport=games_by_sport,
|
||||
validation_reports=validation_reports,
|
||||
stadium_issues=stadium_issues,
|
||||
high_severity_count=high_count,
|
||||
medium_severity_count=medium_count,
|
||||
low_severity_count=low_count,
|
||||
output_dir=output_dir,
|
||||
duration_seconds=duration,
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='SportsTime Data Pipeline - Fetch, validate, and report on sports data',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python run_pipeline.py # Full pipeline
|
||||
python run_pipeline.py --season 2026 # Different season
|
||||
python run_pipeline.py --sport mlb # MLB only
|
||||
python run_pipeline.py --skip-scrape # Validate existing data
|
||||
python run_pipeline.py --verbose # Show all issues
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--season', type=int, default=2025,
|
||||
help='Season year (default: 2025)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--sport', choices=['nba', 'mlb', 'nhl', 'nfl', 'wnba', 'mls', 'nwsl', 'all'], default='all',
|
||||
help='Sport to process (default: all)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output', type=str, default='./data',
|
||||
help='Output directory (default: ./data)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--skip-scrape', action='store_true',
|
||||
help='Skip scraping, validate existing data only'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--no-validate', action='store_true',
|
||||
help='Skip validation step'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--verbose', '-v', action='store_true',
|
||||
help='Verbose output with all issues'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
result = run_pipeline(
|
||||
season=args.season,
|
||||
sport=args.sport,
|
||||
output_dir=Path(args.output),
|
||||
skip_scrape=args.skip_scrape,
|
||||
validate=not args.no_validate,
|
||||
verbose=args.verbose,
|
||||
)
|
||||
|
||||
# Exit with error code if high severity issues
|
||||
sys.exit(0 if result.success else 1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -1,527 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Sports Schedule Scraper Orchestrator
|
||||
|
||||
This script coordinates scraping across sport-specific modules:
|
||||
- core.py: Shared utilities, data classes, fallback system
|
||||
- mlb.py: MLB scrapers
|
||||
- nba.py: NBA scrapers
|
||||
- nhl.py: NHL scrapers
|
||||
- nfl.py: NFL scrapers
|
||||
- mls.py: MLS stadiums
|
||||
- wnba.py: WNBA stadiums
|
||||
- nwsl.py: NWSL stadiums
|
||||
|
||||
Usage:
|
||||
python scrape_schedules.py --sport nba --season 2026
|
||||
python scrape_schedules.py --sport all --season 2026
|
||||
python scrape_schedules.py --stadiums-only
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from dataclasses import asdict
|
||||
from datetime import datetime
|
||||
from io import StringIO
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import requests
|
||||
|
||||
# Import from core module
|
||||
from core import (
|
||||
Game,
|
||||
Stadium,
|
||||
ScraperSource,
|
||||
StadiumScraperSource,
|
||||
fetch_page,
|
||||
scrape_with_fallback,
|
||||
scrape_stadiums_with_fallback,
|
||||
assign_stable_ids,
|
||||
export_to_json,
|
||||
)
|
||||
|
||||
# Import from sport modules (core 4 sports)
|
||||
from mlb import (
|
||||
scrape_mlb_games,
|
||||
scrape_mlb_stadiums,
|
||||
MLB_TEAMS,
|
||||
)
|
||||
from nba import (
|
||||
scrape_nba_games,
|
||||
scrape_nba_stadiums,
|
||||
get_nba_season_string,
|
||||
NBA_TEAMS,
|
||||
)
|
||||
from nhl import (
|
||||
scrape_nhl_games,
|
||||
scrape_nhl_stadiums,
|
||||
get_nhl_season_string,
|
||||
NHL_TEAMS,
|
||||
)
|
||||
from nfl import (
|
||||
scrape_nfl_games,
|
||||
scrape_nfl_stadiums,
|
||||
get_nfl_season_string,
|
||||
NFL_TEAMS,
|
||||
)
|
||||
from mls import (
|
||||
MLS_TEAMS,
|
||||
get_mls_team_abbrev,
|
||||
scrape_mls_stadiums,
|
||||
MLS_STADIUM_SOURCES,
|
||||
)
|
||||
from wnba import (
|
||||
WNBA_TEAMS,
|
||||
get_wnba_team_abbrev,
|
||||
scrape_wnba_stadiums,
|
||||
WNBA_STADIUM_SOURCES,
|
||||
)
|
||||
from nwsl import (
|
||||
NWSL_TEAMS,
|
||||
get_nwsl_team_abbrev,
|
||||
scrape_nwsl_stadiums,
|
||||
NWSL_STADIUM_SOURCES,
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# NON-CORE SPORT SCRAPERS
|
||||
# NOTE: MLS, WNBA, NWSL stadiums are now imported from their respective modules
|
||||
# =============================================================================
|
||||
|
||||
def _scrape_espn_schedule(sport: str, league: str, season: int, date_range: tuple[str, str]) -> list[Game]:
|
||||
"""
|
||||
Fetch schedule from ESPN API.
|
||||
Shared helper for non-core sports that use ESPN API.
|
||||
"""
|
||||
games = []
|
||||
sport_upper = {
|
||||
'wnba': 'WNBA',
|
||||
'usa.1': 'MLS',
|
||||
'usa.nwsl': 'NWSL',
|
||||
}.get(league, league.upper())
|
||||
|
||||
print(f"Fetching {sport_upper} {season} from ESPN API...")
|
||||
|
||||
url = f"https://site.api.espn.com/apis/site/v2/sports/{sport}/{league}/scoreboard"
|
||||
params = {
|
||||
'dates': f"{date_range[0]}-{date_range[1]}",
|
||||
'limit': 1000
|
||||
}
|
||||
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.get(url, params=params, headers=headers, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
events = data.get('events', [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
date_str = event.get('date', '')[:10]
|
||||
time_str = event.get('date', '')[11:16] if len(event.get('date', '')) > 11 else None
|
||||
|
||||
competitions = event.get('competitions', [{}])
|
||||
if not competitions:
|
||||
continue
|
||||
|
||||
comp = competitions[0]
|
||||
competitors = comp.get('competitors', [])
|
||||
|
||||
if len(competitors) < 2:
|
||||
continue
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_abbrev = None
|
||||
away_abbrev = None
|
||||
|
||||
for team in competitors:
|
||||
team_data = team.get('team', {})
|
||||
team_name = team_data.get('displayName', team_data.get('name', ''))
|
||||
team_abbrev = team_data.get('abbreviation', '')
|
||||
|
||||
if team.get('homeAway') == 'home':
|
||||
home_team = team_name
|
||||
home_abbrev = team_abbrev
|
||||
else:
|
||||
away_team = team_name
|
||||
away_abbrev = team_abbrev
|
||||
|
||||
if not home_team or not away_team:
|
||||
continue
|
||||
|
||||
venue = comp.get('venue', {}).get('fullName', '')
|
||||
game_id = f"{sport_upper.lower()}_{date_str}_{away_abbrev}_{home_abbrev}".lower()
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=sport_upper,
|
||||
season=str(season),
|
||||
date=date_str,
|
||||
time=time_str,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev or get_team_abbrev(home_team, sport_upper),
|
||||
away_team_abbrev=away_abbrev or get_team_abbrev(away_team, sport_upper),
|
||||
venue=venue,
|
||||
source='espn.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from ESPN")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching ESPN {sport_upper}: {e}")
|
||||
|
||||
return games
|
||||
|
||||
|
||||
def scrape_wnba_espn(season: int) -> list[Game]:
|
||||
"""Fetch WNBA schedule from ESPN API."""
|
||||
start = f"{season}0501"
|
||||
end = f"{season}1031"
|
||||
return _scrape_espn_schedule('basketball', 'wnba', season, (start, end))
|
||||
|
||||
|
||||
def scrape_mls_espn(season: int) -> list[Game]:
|
||||
"""Fetch MLS schedule from ESPN API."""
|
||||
start = f"{season}0201"
|
||||
end = f"{season}1231"
|
||||
return _scrape_espn_schedule('soccer', 'usa.1', season, (start, end))
|
||||
|
||||
|
||||
def scrape_nwsl_espn(season: int) -> list[Game]:
|
||||
"""Fetch NWSL schedule from ESPN API."""
|
||||
start = f"{season}0301"
|
||||
end = f"{season}1130"
|
||||
return _scrape_espn_schedule('soccer', 'usa.nwsl', season, (start, end))
|
||||
|
||||
|
||||
def scrape_wnba_basketball_reference(season: int) -> list[Game]:
|
||||
"""Scrape WNBA schedule from Basketball-Reference."""
|
||||
games = []
|
||||
url = f"https://www.basketball-reference.com/wnba/years/{season}_games.html"
|
||||
|
||||
print(f"Scraping WNBA {season} from Basketball-Reference...")
|
||||
soup = fetch_page(url, 'basketball-reference.com')
|
||||
|
||||
if not soup:
|
||||
return games
|
||||
|
||||
table = soup.find('table', {'id': 'schedule'})
|
||||
if not table:
|
||||
return games
|
||||
|
||||
tbody = table.find('tbody')
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all('tr'):
|
||||
if row.get('class') and 'thead' in row.get('class'):
|
||||
continue
|
||||
|
||||
try:
|
||||
date_cell = row.find('th', {'data-stat': 'date_game'})
|
||||
if not date_cell:
|
||||
continue
|
||||
date_link = date_cell.find('a')
|
||||
date_str = date_link.text if date_link else date_cell.text
|
||||
|
||||
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
|
||||
home_cell = row.find('td', {'data-stat': 'home_team_name'})
|
||||
|
||||
if not visitor_cell or not home_cell:
|
||||
continue
|
||||
|
||||
visitor_link = visitor_cell.find('a')
|
||||
home_link = home_cell.find('a')
|
||||
|
||||
away_team = visitor_link.text if visitor_link else visitor_cell.text
|
||||
home_team = home_link.text if home_link else home_cell.text
|
||||
|
||||
try:
|
||||
parsed_date = datetime.strptime(date_str.strip(), '%a, %b %d, %Y')
|
||||
date_formatted = parsed_date.strftime('%Y-%m-%d')
|
||||
except:
|
||||
continue
|
||||
|
||||
away_abbrev = get_team_abbrev(away_team, 'WNBA')
|
||||
home_abbrev = get_team_abbrev(home_team, 'WNBA')
|
||||
game_id = f"wnba_{date_formatted}_{away_abbrev}_{home_abbrev}".lower().replace(' ', '')
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='WNBA',
|
||||
season=str(season),
|
||||
date=date_formatted,
|
||||
time=None,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev,
|
||||
away_team_abbrev=away_abbrev,
|
||||
venue='',
|
||||
source='basketball-reference.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from Basketball-Reference")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_wnba_cbssports(season: int) -> list[Game]:
|
||||
"""Fetch WNBA schedule from CBS Sports."""
|
||||
games = []
|
||||
print(f"Fetching WNBA {season} from CBS Sports...")
|
||||
# Placeholder - CBS Sports scraping would go here
|
||||
print(f" Found {len(games)} games from CBS Sports")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_mls_fbref(season: int) -> list[Game]:
|
||||
"""Scrape MLS schedule from FBref."""
|
||||
games = []
|
||||
print(f"Scraping MLS {season} from FBref...")
|
||||
# Placeholder - FBref scraping would go here
|
||||
print(f" Found {len(games)} games from FBref")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_mls_mlssoccer(season: int) -> list[Game]:
|
||||
"""Scrape MLS schedule from MLSSoccer.com."""
|
||||
games = []
|
||||
print(f"Scraping MLS {season} from MLSSoccer.com...")
|
||||
# Placeholder - MLSSoccer.com scraping would go here
|
||||
print(f" Found {len(games)} games from MLSSoccer.com")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nwsl_fbref(season: int) -> list[Game]:
|
||||
"""Scrape NWSL schedule from FBref."""
|
||||
games = []
|
||||
print(f"Scraping NWSL {season} from FBref...")
|
||||
# Placeholder - FBref scraping would go here
|
||||
print(f" Found {len(games)} games from FBref")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nwsl_nwslsoccer(season: int) -> list[Game]:
|
||||
"""Scrape NWSL schedule from NWSL.com."""
|
||||
games = []
|
||||
print(f"Scraping NWSL {season} from NWSL.com...")
|
||||
# Placeholder - NWSL.com scraping would go here
|
||||
print(f" Found {len(games)} games from NWSL.com")
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# LEGACY STADIUM FUNCTIONS
|
||||
# =============================================================================
|
||||
|
||||
def scrape_stadiums_hifld() -> list[Stadium]:
|
||||
"""Legacy: Scrape from HIFLD open data."""
|
||||
# Placeholder for legacy HIFLD scraping
|
||||
return []
|
||||
|
||||
|
||||
def generate_stadiums_from_teams() -> list[Stadium]:
|
||||
"""Generate stadium entries from team data with hardcoded coordinates."""
|
||||
stadiums = []
|
||||
# This function would generate stadiums from all team dictionaries
|
||||
# Keeping as placeholder since sport modules have their own stadium scrapers
|
||||
return stadiums
|
||||
|
||||
|
||||
def scrape_all_stadiums() -> list[Stadium]:
|
||||
"""Comprehensive stadium scraping for all sports."""
|
||||
all_stadiums = []
|
||||
|
||||
# Core sports (from modules)
|
||||
all_stadiums.extend(scrape_mlb_stadiums())
|
||||
all_stadiums.extend(scrape_nba_stadiums())
|
||||
all_stadiums.extend(scrape_nhl_stadiums())
|
||||
all_stadiums.extend(scrape_nfl_stadiums())
|
||||
|
||||
# Non-core sports
|
||||
all_stadiums.extend(scrape_mls_stadiums())
|
||||
all_stadiums.extend(scrape_wnba_stadiums())
|
||||
all_stadiums.extend(scrape_nwsl_stadiums())
|
||||
|
||||
return all_stadiums
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# HELPERS
|
||||
# =============================================================================
|
||||
|
||||
def get_team_abbrev(team_name: str, sport: str) -> str:
|
||||
"""Get team abbreviation from full name."""
|
||||
teams = {
|
||||
'NBA': NBA_TEAMS,
|
||||
'MLB': MLB_TEAMS,
|
||||
'NHL': NHL_TEAMS,
|
||||
'NFL': NFL_TEAMS,
|
||||
'WNBA': WNBA_TEAMS,
|
||||
'MLS': MLS_TEAMS,
|
||||
'NWSL': NWSL_TEAMS,
|
||||
}.get(sport, {})
|
||||
|
||||
for abbrev, info in teams.items():
|
||||
if info['name'].lower() == team_name.lower():
|
||||
return abbrev
|
||||
if team_name.lower() in info['name'].lower():
|
||||
return abbrev
|
||||
|
||||
# Return first 3 letters as fallback
|
||||
return team_name[:3].upper()
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MAIN ORCHESTRATOR
|
||||
# =============================================================================
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Scrape sports schedules')
|
||||
parser.add_argument('--sport', choices=['nba', 'mlb', 'nhl', 'nfl', 'wnba', 'mls', 'nwsl', 'all'], default='all')
|
||||
parser.add_argument('--season', type=int, default=2026, help='Season year (ending year)')
|
||||
parser.add_argument('--stadiums-only', action='store_true', help='Only scrape stadium data (legacy method)')
|
||||
parser.add_argument('--stadiums-update', action='store_true', help='Scrape ALL stadium data for all 8 sports (comprehensive)')
|
||||
parser.add_argument('--output', type=str, default='./data', help='Output directory')
|
||||
|
||||
args = parser.parse_args()
|
||||
output_dir = Path(args.output)
|
||||
|
||||
all_games = []
|
||||
all_stadiums = []
|
||||
|
||||
# Scrape stadiums
|
||||
print("\n" + "="*60)
|
||||
print("SCRAPING STADIUMS")
|
||||
print("="*60)
|
||||
|
||||
if args.stadiums_update:
|
||||
print("Using comprehensive stadium scrapers for all sports...")
|
||||
all_stadiums.extend(scrape_all_stadiums())
|
||||
print(f" Total stadiums scraped: {len(all_stadiums)}")
|
||||
else:
|
||||
all_stadiums.extend(scrape_stadiums_hifld())
|
||||
all_stadiums.extend(generate_stadiums_from_teams())
|
||||
|
||||
# If stadiums-only mode, export and exit
|
||||
if args.stadiums_only:
|
||||
export_to_json([], all_stadiums, output_dir)
|
||||
return
|
||||
|
||||
# Scrape schedules using sport modules
|
||||
if args.sport in ['nba', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING NBA {args.season}")
|
||||
print("="*60)
|
||||
nba_games = scrape_nba_games(args.season)
|
||||
nba_season = get_nba_season_string(args.season)
|
||||
nba_games = assign_stable_ids(nba_games, 'NBA', nba_season)
|
||||
all_games.extend(nba_games)
|
||||
|
||||
if args.sport in ['mlb', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING MLB {args.season}")
|
||||
print("="*60)
|
||||
mlb_games = scrape_mlb_games(args.season)
|
||||
mlb_games = assign_stable_ids(mlb_games, 'MLB', str(args.season))
|
||||
all_games.extend(mlb_games)
|
||||
|
||||
if args.sport in ['nhl', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING NHL {args.season}")
|
||||
print("="*60)
|
||||
nhl_games = scrape_nhl_games(args.season)
|
||||
nhl_season = get_nhl_season_string(args.season)
|
||||
nhl_games = assign_stable_ids(nhl_games, 'NHL', nhl_season)
|
||||
all_games.extend(nhl_games)
|
||||
|
||||
if args.sport in ['nfl', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING NFL {args.season}")
|
||||
print("="*60)
|
||||
nfl_games = scrape_nfl_games(args.season)
|
||||
nfl_season = get_nfl_season_string(args.season)
|
||||
nfl_games = assign_stable_ids(nfl_games, 'NFL', nfl_season)
|
||||
all_games.extend(nfl_games)
|
||||
|
||||
# Non-core sports (TODO: Extract to modules)
|
||||
if args.sport in ['wnba', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING WNBA {args.season}")
|
||||
print("="*60)
|
||||
wnba_sources = [
|
||||
ScraperSource('ESPN', scrape_wnba_espn, priority=1, min_games=100),
|
||||
ScraperSource('Basketball-Reference', scrape_wnba_basketball_reference, priority=2, min_games=100),
|
||||
ScraperSource('CBS Sports', scrape_wnba_cbssports, priority=3, min_games=50),
|
||||
]
|
||||
wnba_games = scrape_with_fallback('WNBA', args.season, wnba_sources)
|
||||
wnba_games = assign_stable_ids(wnba_games, 'WNBA', str(args.season))
|
||||
all_games.extend(wnba_games)
|
||||
|
||||
if args.sport in ['mls', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING MLS {args.season}")
|
||||
print("="*60)
|
||||
mls_sources = [
|
||||
ScraperSource('ESPN', scrape_mls_espn, priority=1, min_games=200),
|
||||
ScraperSource('FBref', scrape_mls_fbref, priority=2, min_games=100),
|
||||
ScraperSource('MLSSoccer.com', scrape_mls_mlssoccer, priority=3, min_games=100),
|
||||
]
|
||||
mls_games = scrape_with_fallback('MLS', args.season, mls_sources)
|
||||
mls_games = assign_stable_ids(mls_games, 'MLS', str(args.season))
|
||||
all_games.extend(mls_games)
|
||||
|
||||
if args.sport in ['nwsl', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING NWSL {args.season}")
|
||||
print("="*60)
|
||||
nwsl_sources = [
|
||||
ScraperSource('ESPN', scrape_nwsl_espn, priority=1, min_games=100),
|
||||
ScraperSource('FBref', scrape_nwsl_fbref, priority=2, min_games=50),
|
||||
ScraperSource('NWSL.com', scrape_nwsl_nwslsoccer, priority=3, min_games=50),
|
||||
]
|
||||
nwsl_games = scrape_with_fallback('NWSL', args.season, nwsl_sources)
|
||||
nwsl_games = assign_stable_ids(nwsl_games, 'NWSL', str(args.season))
|
||||
all_games.extend(nwsl_games)
|
||||
|
||||
# Export
|
||||
print("\n" + "="*60)
|
||||
print("EXPORTING DATA")
|
||||
print("="*60)
|
||||
|
||||
export_to_json(all_games, all_stadiums, output_dir)
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*60)
|
||||
print("SUMMARY")
|
||||
print("="*60)
|
||||
print(f"Total games scraped: {len(all_games)}")
|
||||
print(f"Total stadiums: {len(all_stadiums)}")
|
||||
|
||||
by_sport = {}
|
||||
for g in all_games:
|
||||
by_sport[g.sport] = by_sport.get(g.sport, 0) + 1
|
||||
for sport, count in by_sport.items():
|
||||
print(f" {sport}: {count} games")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,688 @@
|
||||
# SportsTime Parser
|
||||
|
||||
A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.
|
||||
|
||||
## Features
|
||||
|
||||
- Scrapes game schedules from multiple sources with automatic fallback
|
||||
- Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
|
||||
- Generates deterministic canonical IDs for games, teams, and stadiums
|
||||
- Produces validation reports with manual review lists
|
||||
- Uploads to CloudKit with resumable, diff-based updates
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.11+
|
||||
- CloudKit credentials (for upload functionality)
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# From the Scripts directory
|
||||
cd Scripts
|
||||
|
||||
# Install in development mode
|
||||
pip install -e ".[dev]"
|
||||
|
||||
# Or install dependencies only
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Scrape NBA 2025-26 season
|
||||
sportstime-parser scrape nba --season 2025
|
||||
|
||||
# Scrape all sports
|
||||
sportstime-parser scrape all --season 2025
|
||||
|
||||
# Validate existing scraped data
|
||||
sportstime-parser validate nba --season 2025
|
||||
|
||||
# Check status
|
||||
sportstime-parser status
|
||||
|
||||
# Upload to CloudKit (development)
|
||||
sportstime-parser upload nba --season 2025
|
||||
|
||||
# Upload to CloudKit (production)
|
||||
sportstime-parser upload nba --season 2025 --environment production
|
||||
```
|
||||
|
||||
## CLI Reference
|
||||
|
||||
### scrape
|
||||
|
||||
Scrape game schedules, teams, and stadiums from web sources.
|
||||
|
||||
```bash
|
||||
sportstime-parser scrape <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--dry-run Parse and validate only, don't write output files
|
||||
--verbose, -v Enable verbose output
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Scrape NBA 2025-26 season
|
||||
sportstime-parser scrape nba --season 2025
|
||||
|
||||
# Scrape all sports with verbose output
|
||||
sportstime-parser scrape all --season 2025 --verbose
|
||||
|
||||
# Dry run to test without writing files
|
||||
sportstime-parser scrape mlb --season 2026 --dry-run
|
||||
```
|
||||
|
||||
### validate
|
||||
|
||||
Run validation on existing scraped data and regenerate reports. Validation performs these checks:
|
||||
|
||||
1. **Game Coverage**: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
|
||||
2. **Team Resolution**: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
|
||||
3. **Stadium Resolution**: Identifies venue names that couldn't be matched to canonical stadium IDs
|
||||
4. **Duplicate Detection**: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
|
||||
5. **Missing Data**: Flags games missing required fields (stadium_id, team IDs, valid dates)
|
||||
|
||||
The output is a Markdown report with:
|
||||
- Summary statistics (total games, valid games, coverage percentage)
|
||||
- Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
|
||||
- Fuzzy match suggestions with confidence scores to help resolve unmatched names
|
||||
|
||||
```bash
|
||||
sportstime-parser validate <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Validate NBA data
|
||||
sportstime-parser validate nba --season 2025
|
||||
|
||||
# Validate all sports
|
||||
sportstime-parser validate all
|
||||
```
|
||||
|
||||
### upload
|
||||
|
||||
Upload scraped data to CloudKit with diff-based updates.
|
||||
|
||||
```bash
|
||||
sportstime-parser upload <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment: development or production (default: development)
|
||||
--resume Resume interrupted upload from last checkpoint
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Upload NBA to development
|
||||
sportstime-parser upload nba --season 2025
|
||||
|
||||
# Upload to production
|
||||
sportstime-parser upload nba --season 2025 --environment production
|
||||
|
||||
# Resume interrupted upload
|
||||
sportstime-parser upload mlb --season 2026 --resume
|
||||
```
|
||||
|
||||
### status
|
||||
|
||||
Show current scrape and upload status.
|
||||
|
||||
```bash
|
||||
sportstime-parser status
|
||||
```
|
||||
|
||||
### retry
|
||||
|
||||
Retry failed uploads from previous attempts.
|
||||
|
||||
```bash
|
||||
sportstime-parser retry <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment (default: development)
|
||||
--max-retries INT Maximum retry attempts per record (default: 3)
|
||||
```
|
||||
|
||||
### clear
|
||||
|
||||
Clear upload session state to start fresh.
|
||||
|
||||
```bash
|
||||
sportstime-parser clear <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment (default: development)
|
||||
```
|
||||
|
||||
## CloudKit Configuration
|
||||
|
||||
To upload data to CloudKit, you need to configure authentication credentials.
|
||||
|
||||
### 1. Get Credentials from Apple Developer Portal
|
||||
|
||||
1. Go to [Apple Developer Portal](https://developer.apple.com)
|
||||
2. Navigate to **Certificates, Identifiers & Profiles** > **Keys**
|
||||
3. Create a new key with **CloudKit** capability
|
||||
4. Download the private key file (.p8)
|
||||
5. Note the Key ID
|
||||
|
||||
### 2. Set Environment Variables
|
||||
|
||||
```bash
|
||||
# Key ID from Apple Developer Portal
|
||||
export CLOUDKIT_KEY_ID="your_key_id_here"
|
||||
|
||||
# Path to private key file
|
||||
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"
|
||||
|
||||
# Or provide key content directly (useful for CI/CD)
|
||||
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
|
||||
...key content...
|
||||
-----END EC PRIVATE KEY-----"
|
||||
```
|
||||
|
||||
### 3. Verify Configuration
|
||||
|
||||
```bash
|
||||
sportstime-parser status
|
||||
```
|
||||
|
||||
The status output will show whether CloudKit is configured correctly.
|
||||
|
||||
## Output Files
|
||||
|
||||
Scraped data is saved to the `output/` directory:
|
||||
|
||||
```
|
||||
output/
|
||||
games_nba_2025.json # Game schedules
|
||||
teams_nba.json # Team data
|
||||
stadiums_nba.json # Stadium data
|
||||
validation_nba_2025.md # Validation report
|
||||
```
|
||||
|
||||
## Validation Reports
|
||||
|
||||
Validation reports are generated in Markdown format at `output/validation_{sport}_{season}.md`.
|
||||
|
||||
### Report Sections
|
||||
|
||||
**Summary Table**
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| Total Games | Number of games scraped |
|
||||
| Valid Games | Games with all required fields resolved |
|
||||
| Coverage | Percentage of expected games found (based on league schedule) |
|
||||
| Unresolved Teams | Team names that couldn't be matched |
|
||||
| Unresolved Stadiums | Venue names that couldn't be matched |
|
||||
| Duplicates | Potential duplicate game entries |
|
||||
|
||||
**Manual Review Items**
|
||||
|
||||
Items are grouped by type and include the raw value, source URL, and suggested fixes:
|
||||
|
||||
- **Unresolved Teams**: Team names not in the alias mapping. Add to `team_aliases.json` to resolve.
|
||||
- **Unresolved Stadiums**: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to `stadium_aliases.json`.
|
||||
- **Duplicate Games**: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
|
||||
- **Missing Data**: Games missing stadium coordinates or other required fields.
|
||||
|
||||
**Fuzzy Match Suggestions**
|
||||
|
||||
For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.
|
||||
|
||||
## Canonical IDs
|
||||
|
||||
Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.
|
||||
|
||||
### ID Formats
|
||||
|
||||
**Games**
|
||||
```
|
||||
{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
|
||||
```
|
||||
Examples:
|
||||
- `nba_2025_hou_okc_1021` - NBA 2025-26, Houston @ OKC, Oct 21
|
||||
- `mlb_2026_nyy_bos_0401_1` - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)
|
||||
|
||||
**Teams**
|
||||
```
|
||||
{sport}_{city}_{name}
|
||||
```
|
||||
Examples:
|
||||
- `nba_la_lakers`
|
||||
- `mlb_new_york_yankees`
|
||||
- `nfl_new_york_giants`
|
||||
|
||||
**Stadiums**
|
||||
```
|
||||
{sport}_{normalized_name}
|
||||
```
|
||||
Examples:
|
||||
- `mlb_yankee_stadium`
|
||||
- `nba_crypto_com_arena`
|
||||
- `nfl_sofi_stadium`
|
||||
|
||||
### Generated vs Matched IDs
|
||||
|
||||
| Entity | Generated | Matched |
|
||||
|--------|-----------|---------|
|
||||
| **Teams** | Pre-defined in `team_resolver.py` mappings | Resolved from raw scraped names via aliases + fuzzy matching |
|
||||
| **Stadiums** | Pre-defined in `stadium_resolver.py` mappings | Resolved from raw venue names via aliases + fuzzy matching |
|
||||
| **Games** | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) |
|
||||
|
||||
**Resolution Flow:**
|
||||
```
|
||||
Raw Name (from scraper)
|
||||
↓
|
||||
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
|
||||
↓ (if no match)
|
||||
Fuzzy Match (Levenshtein distance against known names)
|
||||
↓ (if confidence > threshold)
|
||||
Canonical ID assigned
|
||||
↓ (if no match)
|
||||
Manual Review Item created
|
||||
```
|
||||
|
||||
### Cross-References
|
||||
|
||||
Entities reference each other via canonical IDs:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Game │
|
||||
│ id: nba_2025_hou_okc_1021 │
|
||||
│ home_team_id: nba_oklahoma_city_thunder ──────────────┐ │
|
||||
│ away_team_id: nba_houston_rockets ────────────────┐ │ │
|
||||
│ stadium_id: nba_paycom_center ────────────────┐ │ │ │
|
||||
└─────────────────────────────────────────────────│───│───│───┘
|
||||
│ │ │
|
||||
┌─────────────────────────────────────────────────│───│───│───┐
|
||||
│ Stadium │ │ │ │
|
||||
│ id: nba_paycom_center ◄───────────────────────┘ │ │ │
|
||||
│ name: "Paycom Center" │ │ │
|
||||
│ city: "Oklahoma City" │ │ │
|
||||
│ latitude: 35.4634 │ │ │
|
||||
│ longitude: -97.5151 │ │ │
|
||||
└─────────────────────────────────────────────────────│───│───┘
|
||||
│ │
|
||||
┌─────────────────────────────────────────────────────│───│───┐
|
||||
│ Team │ │ │
|
||||
│ id: nba_houston_rockets ◄─────────────────────────┘ │ │
|
||||
│ name: "Rockets" │ │
|
||||
│ city: "Houston" │ │
|
||||
│ stadium_id: nba_toyota_center │ │
|
||||
└─────────────────────────────────────────────────────────│───┘
|
||||
│
|
||||
┌─────────────────────────────────────────────────────────│───┐
|
||||
│ Team │ │
|
||||
│ id: nba_oklahoma_city_thunder ◄───────────────────────┘ │
|
||||
│ name: "Thunder" │
|
||||
│ city: "Oklahoma City" │
|
||||
│ stadium_id: nba_paycom_center │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Alias Files
|
||||
|
||||
Aliases map variant names to canonical IDs:
|
||||
|
||||
**`team_aliases.json`**
|
||||
```json
|
||||
{
|
||||
"nba": {
|
||||
"LA Lakers": "nba_la_lakers",
|
||||
"Los Angeles Lakers": "nba_la_lakers",
|
||||
"LAL": "nba_la_lakers"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**`stadium_aliases.json`**
|
||||
```json
|
||||
{
|
||||
"nba": {
|
||||
"Crypto.com Arena": "nba_crypto_com_arena",
|
||||
"Staples Center": "nba_crypto_com_arena",
|
||||
"STAPLES Center": "nba_crypto_com_arena"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
When a scraper returns a raw name like "LA Lakers", the resolver:
|
||||
1. Checks `team_aliases.json` for an exact match → finds `nba_la_lakers`
|
||||
2. If no exact match, runs fuzzy matching against all known team names
|
||||
3. If fuzzy match confidence > 80%, uses that canonical ID
|
||||
4. Otherwise, creates a manual review item for human resolution
|
||||
|
||||
## Adding a New Sport
|
||||
|
||||
To add support for a new sport (e.g., `cfb` for college football), update these files:
|
||||
|
||||
### 1. Configuration (`config.py`)
|
||||
|
||||
Add the sport to `SUPPORTED_SPORTS` and `EXPECTED_GAME_COUNTS`:
|
||||
|
||||
```python
|
||||
SUPPORTED_SPORTS: list[str] = [
|
||||
"nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
|
||||
"cfb", # ← Add new sport
|
||||
]
|
||||
|
||||
EXPECTED_GAME_COUNTS: dict[str, int] = {
|
||||
# ... existing sports ...
|
||||
"cfb": 900, # ← Add expected game count for validation
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Team Mappings (`normalizers/team_resolver.py`)
|
||||
|
||||
Add team definitions to `TEAM_MAPPINGS`. Each entry maps an abbreviation to `(canonical_id, full_name, city)`:
|
||||
|
||||
```python
|
||||
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
|
||||
# ... existing sports ...
|
||||
"cfb": {
|
||||
"ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
|
||||
"OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
|
||||
# ... all teams ...
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Stadium Mappings (`normalizers/stadium_resolver.py`)
|
||||
|
||||
Add stadium definitions to `STADIUM_MAPPINGS`. Each entry is a `StadiumInfo` with coordinates:
|
||||
|
||||
```python
|
||||
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
|
||||
# ... existing sports ...
|
||||
"cfb": {
|
||||
"stadium_cfb_bryant_denny": StadiumInfo(
|
||||
id="stadium_cfb_bryant_denny",
|
||||
name="Bryant-Denny Stadium",
|
||||
city="Tuscaloosa",
|
||||
state="AL",
|
||||
country="USA",
|
||||
sport="cfb",
|
||||
latitude=33.2083,
|
||||
longitude=-87.5503,
|
||||
),
|
||||
# ... all stadiums ...
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Scraper Implementation (`scrapers/cfb.py`)
|
||||
|
||||
Create a new scraper class extending `BaseScraper`:
|
||||
|
||||
```python
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
|
||||
class CFBScraper(BaseScraper):
|
||||
def __init__(self, season: int, **kwargs):
|
||||
super().__init__("cfb", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("cfb")
|
||||
self._stadium_resolver = get_stadium_resolver("cfb")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
return ["espn", "sports_reference"] # Priority order
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
# Return URL for each source
|
||||
...
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
# Implement scraping logic
|
||||
...
|
||||
|
||||
def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
# Convert raw data to Game objects using resolvers
|
||||
...
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
# Return Team objects from TEAM_MAPPINGS
|
||||
...
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
# Return Stadium objects from STADIUM_MAPPINGS
|
||||
...
|
||||
|
||||
def create_cfb_scraper(season: int) -> CFBScraper:
|
||||
return CFBScraper(season=season)
|
||||
```
|
||||
|
||||
### 5. Register Scraper (`scrapers/__init__.py`)
|
||||
|
||||
Export the new scraper:
|
||||
|
||||
```python
|
||||
from .cfb import CFBScraper, create_cfb_scraper
|
||||
|
||||
__all__ = [
|
||||
# ... existing exports ...
|
||||
"CFBScraper",
|
||||
"create_cfb_scraper",
|
||||
]
|
||||
```
|
||||
|
||||
### 6. CLI Registration (`cli.py`)
|
||||
|
||||
Add the sport to `get_scraper()`:
|
||||
|
||||
```python
|
||||
def get_scraper(sport: str, season: int):
|
||||
# ... existing sports ...
|
||||
elif sport == "cfb":
|
||||
from .scrapers.cfb import create_cfb_scraper
|
||||
return create_cfb_scraper(season)
|
||||
```
|
||||
|
||||
### 7. Alias Files (`team_aliases.json`, `stadium_aliases.json`)
|
||||
|
||||
Add initial aliases for common name variants:
|
||||
|
||||
```json
|
||||
// team_aliases.json
|
||||
{
|
||||
"cfb": {
|
||||
"Alabama": "team_cfb_ala",
|
||||
"Bama": "team_cfb_ala",
|
||||
"Roll Tide": "team_cfb_ala"
|
||||
}
|
||||
}
|
||||
|
||||
// stadium_aliases.json
|
||||
{
|
||||
"cfb": {
|
||||
"Bryant Denny Stadium": "stadium_cfb_bryant_denny",
|
||||
"Bryant-Denny": "stadium_cfb_bryant_denny"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8. Documentation (`SOURCES.md`)
|
||||
|
||||
Document data sources with URLs, rate limits, and notes:
|
||||
|
||||
```markdown
|
||||
## CFB (College Football)
|
||||
|
||||
**Teams**: 134 (FBS)
|
||||
**Expected Games**: ~900 per season
|
||||
**Season**: August - January
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
|
||||
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |
|
||||
```
|
||||
|
||||
### 9. Tests (`tests/test_scrapers/test_cfb.py`)
|
||||
|
||||
Create tests for the new scraper:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper
|
||||
|
||||
class TestCFBScraper:
|
||||
def test_factory_creates_scraper(self):
|
||||
scraper = create_cfb_scraper(season=2025)
|
||||
assert scraper.sport == "cfb"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_get_sources_returns_priority_list(self):
|
||||
scraper = CFBScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert "espn" in sources
|
||||
|
||||
# ... more tests ...
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
- [ ] Add to `SUPPORTED_SPORTS` in `config.py`
|
||||
- [ ] Add to `EXPECTED_GAME_COUNTS` in `config.py`
|
||||
- [ ] Add team mappings to `team_resolver.py`
|
||||
- [ ] Add stadium mappings to `stadium_resolver.py`
|
||||
- [ ] Create `scrapers/{sport}.py` with scraper class
|
||||
- [ ] Export in `scrapers/__init__.py`
|
||||
- [ ] Register in `cli.py` `get_scraper()`
|
||||
- [ ] Add aliases to `team_aliases.json`
|
||||
- [ ] Add aliases to `stadium_aliases.json`
|
||||
- [ ] Document sources in `SOURCES.md`
|
||||
- [ ] Create tests in `tests/test_scrapers/`
|
||||
- [ ] Run `pytest` to verify all tests pass
|
||||
- [ ] Run dry-run scrape: `sportstime-parser scrape {sport} --season 2025 --dry-run`
|
||||
|
||||
## Development
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
pytest
|
||||
|
||||
# Run with coverage
|
||||
pytest --cov=sportstime_parser --cov-report=html
|
||||
|
||||
# Run specific test file
|
||||
pytest tests/test_scrapers/test_nba.py
|
||||
|
||||
# Run with verbose output
|
||||
pytest -v
|
||||
```
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
sportstime_parser/
|
||||
__init__.py
|
||||
__main__.py # CLI entry point
|
||||
cli.py # Subcommand definitions
|
||||
config.py # Constants, defaults
|
||||
|
||||
models/
|
||||
game.py # Game dataclass
|
||||
team.py # Team dataclass
|
||||
stadium.py # Stadium dataclass
|
||||
aliases.py # Alias dataclasses
|
||||
|
||||
scrapers/
|
||||
base.py # BaseScraper abstract class
|
||||
nba.py # NBA scrapers
|
||||
mlb.py # MLB scrapers
|
||||
nfl.py # NFL scrapers
|
||||
nhl.py # NHL scrapers
|
||||
mls.py # MLS scrapers
|
||||
wnba.py # WNBA scrapers
|
||||
nwsl.py # NWSL scrapers
|
||||
|
||||
normalizers/
|
||||
canonical_id.py # ID generation
|
||||
team_resolver.py # Team name resolution
|
||||
stadium_resolver.py # Stadium name resolution
|
||||
timezone.py # Timezone conversion
|
||||
fuzzy.py # Fuzzy matching
|
||||
|
||||
validators/
|
||||
report.py # Validation report generator
|
||||
|
||||
uploaders/
|
||||
cloudkit.py # CloudKit Web Services client
|
||||
state.py # Resumable upload state
|
||||
diff.py # Record comparison
|
||||
|
||||
utils/
|
||||
http.py # Rate-limited HTTP client
|
||||
logging.py # Verbose logger
|
||||
progress.py # Progress bars
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "No games file found"
|
||||
|
||||
Run the scrape command first:
|
||||
```bash
|
||||
sportstime-parser scrape nba --season 2025
|
||||
```
|
||||
|
||||
### "CloudKit not configured"
|
||||
|
||||
Set the required environment variables:
|
||||
```bash
|
||||
export CLOUDKIT_KEY_ID="your_key_id"
|
||||
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"
|
||||
```
|
||||
|
||||
### Rate limit errors
|
||||
|
||||
The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:
|
||||
|
||||
1. Wait a few minutes before retrying
|
||||
2. Try scraping one sport at a time instead of "all"
|
||||
3. Check that you're not running multiple instances
|
||||
|
||||
### Scrape fails with no data
|
||||
|
||||
1. Check your internet connection
|
||||
2. Run with `--verbose` to see detailed error messages
|
||||
3. The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
@@ -0,0 +1,254 @@
|
||||
# Data Sources
|
||||
|
||||
This document lists all data sources used by the SportsTime parser, including URLs, rate limits, and data freshness expectations.
|
||||
|
||||
## Source Priority
|
||||
|
||||
Each sport has multiple sources configured in priority order. The scraper tries each source in order and uses the first one that succeeds. If a source fails (network error, parsing error, etc.), it falls back to the next source.
|
||||
|
||||
---
|
||||
|
||||
## NBA (National Basketball Association)
|
||||
|
||||
**Teams**: 30
|
||||
**Expected Games**: ~1,230 per season
|
||||
**Season**: October - June (spans two calendar years)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | HTML |
|
||||
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | JSON |
|
||||
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | HTML |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **Basketball-Reference**: ~1 request/second recommended
|
||||
- **ESPN API**: No published limit, use 1 request/second to be safe
|
||||
- **CBS Sports**: ~1 request/second recommended
|
||||
|
||||
### Notes
|
||||
|
||||
- Basketball-Reference is the most reliable source with complete historical data
|
||||
- ESPN API is good for current/future seasons
|
||||
- Games organized by month on Basketball-Reference
|
||||
|
||||
---
|
||||
|
||||
## MLB (Major League Baseball)
|
||||
|
||||
**Teams**: 30
|
||||
**Expected Games**: ~2,430 per season
|
||||
**Season**: March/April - October/November (single calendar year)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | Baseball-Reference | `baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | HTML |
|
||||
| 2 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | JSON |
|
||||
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | JSON |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **Baseball-Reference**: ~1 request/second recommended
|
||||
- **MLB Stats API**: No published limit, use 0.5 request/second
|
||||
- **ESPN API**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- MLB has doubleheaders; games are suffixed with `_1`, `_2`
|
||||
- Single schedule page per season on Baseball-Reference
|
||||
- MLB Stats API allows date range queries for efficiency
|
||||
|
||||
---
|
||||
|
||||
## NFL (National Football League)
|
||||
|
||||
**Teams**: 32
|
||||
**Expected Games**: ~272 per season (regular season only)
|
||||
**Season**: September - February (spans two calendar years)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | JSON |
|
||||
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{YEAR}/games.htm` | HTML |
|
||||
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | HTML |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **ESPN API**: ~1 request/second
|
||||
- **Pro-Football-Reference**: ~1 request/second
|
||||
- **CBS Sports**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- ESPN API uses week numbers instead of dates
|
||||
- International games (London, Mexico City, Frankfurt, etc.) are filtered out
|
||||
- Includes preseason, regular season, and playoffs
|
||||
|
||||
---
|
||||
|
||||
## NHL (National Hockey League)
|
||||
|
||||
**Teams**: 32 (including Utah Hockey Club)
|
||||
**Expected Games**: ~1,312 per season
|
||||
**Season**: October - June (spans two calendar years)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{YEAR}_games.html` | HTML |
|
||||
| 2 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | JSON |
|
||||
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | JSON |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **Hockey-Reference**: ~1 request/second
|
||||
- **NHL API**: No published limit, use 0.5 request/second
|
||||
- **ESPN API**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- International games (Prague, Stockholm, Helsinki, etc.) are filtered out
|
||||
- Single schedule page per season on Hockey-Reference
|
||||
|
||||
---
|
||||
|
||||
## MLS (Major League Soccer)
|
||||
|
||||
**Teams**: 30 (including San Diego FC)
|
||||
**Expected Games**: ~493 per season
|
||||
**Season**: February/March - October/November (single calendar year)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | JSON |
|
||||
| 2 | FBref | `fbref.com/en/comps/22/{YEAR}/schedule/` | HTML |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **ESPN API**: ~1 request/second
|
||||
- **FBref**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- MLS runs within a single calendar year
|
||||
- Some teams share stadiums with NFL teams
|
||||
|
||||
---
|
||||
|
||||
## WNBA (Women's National Basketball Association)
|
||||
|
||||
**Teams**: 13 (including Golden State Valkyries)
|
||||
**Expected Games**: ~220 per season
|
||||
**Season**: May - October (single calendar year)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | JSON |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **ESPN API**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- Many WNBA teams share arenas with NBA teams
|
||||
- Teams and stadiums are hardcoded (smaller league)
|
||||
|
||||
---
|
||||
|
||||
## NWSL (National Women's Soccer League)
|
||||
|
||||
**Teams**: 14
|
||||
**Expected Games**: ~182 per season
|
||||
**Season**: March - November (single calendar year)
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | JSON |
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **ESPN API**: ~1 request/second
|
||||
|
||||
### Notes
|
||||
|
||||
- Many NWSL teams share stadiums with MLS teams
|
||||
- Teams and stadiums are hardcoded (smaller league)
|
||||
|
||||
---
|
||||
|
||||
## Stadium Data Sources
|
||||
|
||||
Stadium coordinates and metadata come from multiple sources:
|
||||
|
||||
| Sport | Sources |
|
||||
|-------|---------|
|
||||
| MLB | MLBScoreBot GitHub, cageyjames GeoJSON, hardcoded |
|
||||
| NFL | NFLScoreBot GitHub, brianhatchl GeoJSON, hardcoded |
|
||||
| NBA | Hardcoded |
|
||||
| NHL | Hardcoded |
|
||||
| MLS | gavinr GeoJSON, hardcoded |
|
||||
| WNBA | Hardcoded (shared with NBA) |
|
||||
| NWSL | Hardcoded (shared with MLS) |
|
||||
|
||||
---
|
||||
|
||||
## General Guidelines
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
All scrapers implement:
|
||||
|
||||
1. **Default delay**: 1 second between requests
|
||||
2. **Auto-detection**: Detects HTTP 429 (Too Many Requests) responses
|
||||
3. **Exponential backoff**: Starts at 1 second, doubles up to 3 retries
|
||||
4. **Connection pooling**: Reuses HTTP connections for efficiency
|
||||
|
||||
### Error Handling
|
||||
|
||||
- **Partial data**: If a source fails mid-scrape, partial data is discarded
|
||||
- **Source fallback**: Automatically tries the next source on failure
|
||||
- **Logging**: All errors are logged for debugging
|
||||
|
||||
### Data Freshness
|
||||
|
||||
| Data Type | Freshness |
|
||||
|-----------|-----------|
|
||||
| Games (future) | Check weekly during season |
|
||||
| Games (past) | Final scores available within hours |
|
||||
| Teams | Update at start of each season |
|
||||
| Stadiums | Update when venues change |
|
||||
|
||||
### Geographic Filter
|
||||
|
||||
Games at venues outside USA, Canada, and Mexico are automatically filtered out:
|
||||
|
||||
- **NFL**: London, Frankfurt, Munich, Mexico City, São Paulo
|
||||
- **NHL**: Prague, Stockholm, Helsinki, Tampere, Gothenburg
|
||||
|
||||
---
|
||||
|
||||
## Legal Considerations
|
||||
|
||||
This tool is designed for personal/educational use. When using these sources:
|
||||
|
||||
1. Respect robots.txt files
|
||||
2. Don't make excessive requests
|
||||
3. Cache responses when possible
|
||||
4. Check each source's Terms of Service
|
||||
5. Consider that schedule data may be copyrighted
|
||||
|
||||
The ESPN API is undocumented but publicly accessible. Sports-Reference sites allow scraping but request reasonable rate limiting.
|
||||
@@ -0,0 +1,8 @@
|
||||
"""SportsTime Parser - Sports data scraper and CloudKit uploader."""
|
||||
|
||||
__version__ = "0.1.0"
|
||||
__author__ = "SportsTime Team"
|
||||
|
||||
from .cli import run_cli
|
||||
|
||||
__all__ = ["run_cli", "__version__"]
|
||||
@@ -0,0 +1,14 @@
|
||||
"""Entry point for sportstime-parser CLI."""
|
||||
|
||||
import sys
|
||||
|
||||
from .cli import run_cli
|
||||
|
||||
|
||||
def main() -> int:
|
||||
"""Main entry point."""
|
||||
return run_cli()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,914 @@
|
||||
"""CLI subcommand definitions for sportstime-parser."""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from typing import Optional
|
||||
|
||||
from .config import (
|
||||
DEFAULT_SEASON,
|
||||
CLOUDKIT_ENVIRONMENT,
|
||||
SUPPORTED_SPORTS,
|
||||
OUTPUT_DIR,
|
||||
)
|
||||
from .utils.logging import get_logger, set_verbose, log_success, log_failure
|
||||
|
||||
|
||||
def create_parser() -> argparse.ArgumentParser:
|
||||
"""Create the main argument parser with all subcommands."""
|
||||
parser = argparse.ArgumentParser(
|
||||
prog="sportstime-parser",
|
||||
description="Sports data scraper and CloudKit uploader for SportsTime app",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
sportstime-parser scrape nba --season 2025
|
||||
sportstime-parser scrape all --season 2025
|
||||
sportstime-parser validate nba --season 2025
|
||||
sportstime-parser upload nba --season 2025
|
||||
sportstime-parser status
|
||||
""",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--verbose", "-v",
|
||||
action="store_true",
|
||||
help="Enable verbose output",
|
||||
)
|
||||
|
||||
subparsers = parser.add_subparsers(
|
||||
dest="command",
|
||||
title="commands",
|
||||
description="Available commands",
|
||||
metavar="COMMAND",
|
||||
)
|
||||
|
||||
# Scrape subcommand
|
||||
scrape_parser = subparsers.add_parser(
|
||||
"scrape",
|
||||
help="Scrape game schedules, teams, and stadiums",
|
||||
description="Scrape sports data from multiple sources",
|
||||
)
|
||||
scrape_parser.add_argument(
|
||||
"sport",
|
||||
choices=SUPPORTED_SPORTS + ["all"],
|
||||
help="Sport to scrape (or 'all' for all sports)",
|
||||
)
|
||||
scrape_parser.add_argument(
|
||||
"--season", "-s",
|
||||
type=int,
|
||||
default=DEFAULT_SEASON,
|
||||
help=f"Season start year (default: {DEFAULT_SEASON})",
|
||||
)
|
||||
scrape_parser.add_argument(
|
||||
"--dry-run",
|
||||
action="store_true",
|
||||
help="Parse and validate only, don't write output files",
|
||||
)
|
||||
scrape_parser.set_defaults(func=cmd_scrape)
|
||||
|
||||
# Validate subcommand
|
||||
validate_parser = subparsers.add_parser(
|
||||
"validate",
|
||||
help="Run validation on existing scraped data",
|
||||
description="Validate scraped data and regenerate reports",
|
||||
)
|
||||
validate_parser.add_argument(
|
||||
"sport",
|
||||
choices=SUPPORTED_SPORTS + ["all"],
|
||||
help="Sport to validate (or 'all' for all sports)",
|
||||
)
|
||||
validate_parser.add_argument(
|
||||
"--season", "-s",
|
||||
type=int,
|
||||
default=DEFAULT_SEASON,
|
||||
help=f"Season start year (default: {DEFAULT_SEASON})",
|
||||
)
|
||||
validate_parser.set_defaults(func=cmd_validate)
|
||||
|
||||
# Upload subcommand
|
||||
upload_parser = subparsers.add_parser(
|
||||
"upload",
|
||||
help="Upload scraped data to CloudKit",
|
||||
description="Upload data to CloudKit with resumable, diff-based updates",
|
||||
)
|
||||
upload_parser.add_argument(
|
||||
"sport",
|
||||
choices=SUPPORTED_SPORTS + ["all"],
|
||||
help="Sport to upload (or 'all' for all sports)",
|
||||
)
|
||||
upload_parser.add_argument(
|
||||
"--season", "-s",
|
||||
type=int,
|
||||
default=DEFAULT_SEASON,
|
||||
help=f"Season start year (default: {DEFAULT_SEASON})",
|
||||
)
|
||||
upload_parser.add_argument(
|
||||
"--environment", "-e",
|
||||
choices=["development", "production"],
|
||||
default=CLOUDKIT_ENVIRONMENT,
|
||||
help=f"CloudKit environment (default: {CLOUDKIT_ENVIRONMENT})",
|
||||
)
|
||||
upload_parser.add_argument(
|
||||
"--resume",
|
||||
action="store_true",
|
||||
help="Resume interrupted upload from last checkpoint",
|
||||
)
|
||||
upload_parser.set_defaults(func=cmd_upload)
|
||||
|
||||
# Status subcommand
|
||||
status_parser = subparsers.add_parser(
|
||||
"status",
|
||||
help="Show current scrape and upload status",
|
||||
description="Display summary of scraped data and upload progress",
|
||||
)
|
||||
status_parser.set_defaults(func=cmd_status)
|
||||
|
||||
# Retry subcommand
|
||||
retry_parser = subparsers.add_parser(
|
||||
"retry",
|
||||
help="Retry failed uploads",
|
||||
description="Retry records that failed during previous upload attempts",
|
||||
)
|
||||
retry_parser.add_argument(
|
||||
"sport",
|
||||
choices=SUPPORTED_SPORTS + ["all"],
|
||||
help="Sport to retry (or 'all' for all sports)",
|
||||
)
|
||||
retry_parser.add_argument(
|
||||
"--season", "-s",
|
||||
type=int,
|
||||
default=DEFAULT_SEASON,
|
||||
help=f"Season start year (default: {DEFAULT_SEASON})",
|
||||
)
|
||||
retry_parser.add_argument(
|
||||
"--environment", "-e",
|
||||
choices=["development", "production"],
|
||||
default=CLOUDKIT_ENVIRONMENT,
|
||||
help=f"CloudKit environment (default: {CLOUDKIT_ENVIRONMENT})",
|
||||
)
|
||||
retry_parser.add_argument(
|
||||
"--max-retries",
|
||||
type=int,
|
||||
default=3,
|
||||
help="Maximum retry attempts per record (default: 3)",
|
||||
)
|
||||
retry_parser.set_defaults(func=cmd_retry)
|
||||
|
||||
# Clear subcommand
|
||||
clear_parser = subparsers.add_parser(
|
||||
"clear",
|
||||
help="Clear upload session state",
|
||||
description="Delete upload session state files to start fresh",
|
||||
)
|
||||
clear_parser.add_argument(
|
||||
"sport",
|
||||
choices=SUPPORTED_SPORTS + ["all"],
|
||||
help="Sport to clear (or 'all' for all sports)",
|
||||
)
|
||||
clear_parser.add_argument(
|
||||
"--season", "-s",
|
||||
type=int,
|
||||
default=DEFAULT_SEASON,
|
||||
help=f"Season start year (default: {DEFAULT_SEASON})",
|
||||
)
|
||||
clear_parser.add_argument(
|
||||
"--environment", "-e",
|
||||
choices=["development", "production"],
|
||||
default=CLOUDKIT_ENVIRONMENT,
|
||||
help=f"CloudKit environment (default: {CLOUDKIT_ENVIRONMENT})",
|
||||
)
|
||||
clear_parser.set_defaults(func=cmd_clear)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def get_scraper(sport: str, season: int):
|
||||
"""Get the appropriate scraper for a sport.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
|
||||
Returns:
|
||||
Scraper instance
|
||||
|
||||
Raises:
|
||||
NotImplementedError: If sport scraper is not yet implemented
|
||||
"""
|
||||
if sport == "nba":
|
||||
from .scrapers.nba import create_nba_scraper
|
||||
return create_nba_scraper(season)
|
||||
elif sport == "mlb":
|
||||
from .scrapers.mlb import create_mlb_scraper
|
||||
return create_mlb_scraper(season)
|
||||
elif sport == "nfl":
|
||||
from .scrapers.nfl import create_nfl_scraper
|
||||
return create_nfl_scraper(season)
|
||||
elif sport == "nhl":
|
||||
from .scrapers.nhl import create_nhl_scraper
|
||||
return create_nhl_scraper(season)
|
||||
elif sport == "mls":
|
||||
from .scrapers.mls import create_mls_scraper
|
||||
return create_mls_scraper(season)
|
||||
elif sport == "wnba":
|
||||
from .scrapers.wnba import create_wnba_scraper
|
||||
return create_wnba_scraper(season)
|
||||
elif sport == "nwsl":
|
||||
from .scrapers.nwsl import create_nwsl_scraper
|
||||
return create_nwsl_scraper(season)
|
||||
else:
|
||||
raise NotImplementedError(f"Scraper for {sport} not yet implemented")
|
||||
|
||||
|
||||
def cmd_scrape(args: argparse.Namespace) -> int:
|
||||
"""Execute the scrape command."""
|
||||
from .models.game import save_games
|
||||
from .models.team import save_teams
|
||||
from .models.stadium import save_stadiums
|
||||
from .validators.report import generate_report, validate_games
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
|
||||
|
||||
logger.info(f"Scraping {', '.join(sports)} for {args.season}-{args.season + 1} season")
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("Dry run mode - no files will be written")
|
||||
|
||||
# Ensure output directory exists
|
||||
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
success_count = 0
|
||||
failure_count = 0
|
||||
|
||||
for sport in sports:
|
||||
logger.info(f"\n{'='*50}")
|
||||
logger.info(f"Scraping {sport.upper()}...")
|
||||
logger.info(f"{'='*50}")
|
||||
|
||||
try:
|
||||
# Get scraper for this sport
|
||||
scraper = get_scraper(sport, args.season)
|
||||
|
||||
# Scrape all data
|
||||
result = scraper.scrape_all()
|
||||
|
||||
if not result.success:
|
||||
log_failure(f"{sport.upper()}: {result.error_message}")
|
||||
failure_count += 1
|
||||
continue
|
||||
|
||||
# Validate games
|
||||
validation_issues = validate_games(result.games)
|
||||
all_review_items = result.review_items + validation_issues
|
||||
|
||||
# Generate validation report
|
||||
report = generate_report(
|
||||
sport=sport,
|
||||
season=args.season,
|
||||
source=result.source,
|
||||
games=result.games,
|
||||
teams=result.teams,
|
||||
stadiums=result.stadiums,
|
||||
review_items=all_review_items,
|
||||
)
|
||||
|
||||
# Log summary
|
||||
logger.info(f"Games: {report.summary.total_games}")
|
||||
logger.info(f"Teams: {len(result.teams)}")
|
||||
logger.info(f"Stadiums: {len(result.stadiums)}")
|
||||
logger.info(f"Coverage: {report.summary.game_coverage:.1f}%")
|
||||
logger.info(f"Review items: {report.summary.review_count}")
|
||||
|
||||
if not args.dry_run:
|
||||
# Save output files
|
||||
games_file = OUTPUT_DIR / f"games_{sport}_{args.season}.json"
|
||||
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
|
||||
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
|
||||
|
||||
save_games(result.games, str(games_file))
|
||||
save_teams(result.teams, str(teams_file))
|
||||
save_stadiums(result.stadiums, str(stadiums_file))
|
||||
|
||||
# Save validation report
|
||||
report_path = report.save()
|
||||
|
||||
logger.info(f"Saved games to: {games_file}")
|
||||
logger.info(f"Saved teams to: {teams_file}")
|
||||
logger.info(f"Saved stadiums to: {stadiums_file}")
|
||||
logger.info(f"Saved report to: {report_path}")
|
||||
|
||||
log_success(f"{sport.upper()}: Scraped {result.game_count} games")
|
||||
success_count += 1
|
||||
|
||||
except NotImplementedError as e:
|
||||
logger.warning(str(e))
|
||||
failure_count += 1
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
log_failure(f"{sport.upper()}: {e}")
|
||||
logger.exception("Scraping failed")
|
||||
failure_count += 1
|
||||
continue
|
||||
|
||||
# Final summary
|
||||
logger.info(f"\n{'='*50}")
|
||||
logger.info("SUMMARY")
|
||||
logger.info(f"{'='*50}")
|
||||
logger.info(f"Successful: {success_count}")
|
||||
logger.info(f"Failed: {failure_count}")
|
||||
|
||||
return 0 if failure_count == 0 else 1
|
||||
|
||||
|
||||
def cmd_validate(args: argparse.Namespace) -> int:
|
||||
"""Execute the validate command."""
|
||||
from .models.game import load_games
|
||||
from .models.team import load_teams
|
||||
from .models.stadium import load_stadiums
|
||||
from .validators.report import generate_report, validate_games
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
|
||||
|
||||
logger.info(f"Validating {', '.join(sports)} for {args.season}-{args.season + 1} season")
|
||||
|
||||
for sport in sports:
|
||||
logger.info(f"\nValidating {sport.upper()}...")
|
||||
|
||||
# Load existing data
|
||||
games_file = OUTPUT_DIR / f"games_{sport}_{args.season}.json"
|
||||
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
|
||||
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
|
||||
|
||||
if not games_file.exists():
|
||||
logger.warning(f"No games file found: {games_file}")
|
||||
continue
|
||||
|
||||
try:
|
||||
games = load_games(str(games_file))
|
||||
teams = load_teams(str(teams_file)) if teams_file.exists() else []
|
||||
stadiums = load_stadiums(str(stadiums_file)) if stadiums_file.exists() else []
|
||||
|
||||
# Run validation
|
||||
review_items = validate_games(games)
|
||||
|
||||
# Generate report
|
||||
report = generate_report(
|
||||
sport=sport,
|
||||
season=args.season,
|
||||
source="existing",
|
||||
games=games,
|
||||
teams=teams,
|
||||
stadiums=stadiums,
|
||||
review_items=review_items,
|
||||
)
|
||||
|
||||
# Save report
|
||||
report_path = report.save()
|
||||
|
||||
logger.info(f"Games: {report.summary.total_games}")
|
||||
logger.info(f"Valid: {report.summary.valid_games}")
|
||||
logger.info(f"Review items: {report.summary.review_count}")
|
||||
logger.info(f"Saved report to: {report_path}")
|
||||
|
||||
log_success(f"{sport.upper()}: Validation complete")
|
||||
|
||||
except Exception as e:
|
||||
log_failure(f"{sport.upper()}: {e}")
|
||||
logger.exception("Validation failed")
|
||||
continue
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_upload(args: argparse.Namespace) -> int:
|
||||
"""Execute the upload command."""
|
||||
from .models.game import load_games
|
||||
from .models.team import load_teams
|
||||
from .models.stadium import load_stadiums
|
||||
from .uploaders import (
|
||||
CloudKitClient,
|
||||
CloudKitError,
|
||||
CloudKitAuthError,
|
||||
CloudKitRateLimitError,
|
||||
RecordType,
|
||||
RecordDiffer,
|
||||
StateManager,
|
||||
game_to_cloudkit_record,
|
||||
team_to_cloudkit_record,
|
||||
stadium_to_cloudkit_record,
|
||||
)
|
||||
from .utils.progress import create_progress_bar
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
|
||||
|
||||
logger.info(f"Uploading {', '.join(sports)} for {args.season}-{args.season + 1} season")
|
||||
logger.info(f"Environment: {args.environment}")
|
||||
|
||||
# Initialize CloudKit client
|
||||
client = CloudKitClient(environment=args.environment)
|
||||
|
||||
if not client.is_configured:
|
||||
log_failure("CloudKit not configured")
|
||||
logger.error(
|
||||
"Set CLOUDKIT_KEY_ID and CLOUDKIT_PRIVATE_KEY_PATH environment variables.\n"
|
||||
"Get credentials from Apple Developer Portal > Certificates, Identifiers & Profiles > Keys"
|
||||
)
|
||||
return 1
|
||||
|
||||
# Initialize state manager
|
||||
state_manager = StateManager()
|
||||
differ = RecordDiffer()
|
||||
|
||||
success_count = 0
|
||||
failure_count = 0
|
||||
|
||||
for sport in sports:
|
||||
logger.info(f"\n{'='*50}")
|
||||
logger.info(f"Uploading {sport.upper()}...")
|
||||
logger.info(f"{'='*50}")
|
||||
|
||||
try:
|
||||
# Load local data
|
||||
games_file = OUTPUT_DIR / f"games_{sport}_{args.season}.json"
|
||||
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
|
||||
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
|
||||
|
||||
if not games_file.exists():
|
||||
logger.warning(f"No games file found: {games_file}")
|
||||
logger.warning("Run 'scrape' command first")
|
||||
failure_count += 1
|
||||
continue
|
||||
|
||||
games = load_games(str(games_file))
|
||||
teams = load_teams(str(teams_file)) if teams_file.exists() else []
|
||||
stadiums = load_stadiums(str(stadiums_file)) if stadiums_file.exists() else []
|
||||
|
||||
logger.info(f"Loaded {len(games)} games, {len(teams)} teams, {len(stadiums)} stadiums")
|
||||
|
||||
# Fetch existing CloudKit records for diff
|
||||
logger.info("Fetching existing CloudKit records...")
|
||||
|
||||
try:
|
||||
remote_games = client.fetch_all_records(RecordType.GAME)
|
||||
remote_teams = client.fetch_all_records(RecordType.TEAM)
|
||||
remote_stadiums = client.fetch_all_records(RecordType.STADIUM)
|
||||
except CloudKitAuthError as e:
|
||||
log_failure(f"Authentication failed: {e}")
|
||||
return 1
|
||||
except CloudKitRateLimitError:
|
||||
log_failure("Rate limit exceeded - try again later")
|
||||
return 1
|
||||
except CloudKitError as e:
|
||||
log_failure(f"Failed to fetch records: {e}")
|
||||
failure_count += 1
|
||||
continue
|
||||
|
||||
# Filter remote records to this sport/season
|
||||
remote_games = [
|
||||
r for r in remote_games
|
||||
if r.get("fields", {}).get("sport", {}).get("value") == sport
|
||||
and r.get("fields", {}).get("season", {}).get("value") == args.season
|
||||
]
|
||||
remote_teams = [
|
||||
r for r in remote_teams
|
||||
if r.get("fields", {}).get("sport", {}).get("value") == sport
|
||||
]
|
||||
remote_stadiums = [
|
||||
r for r in remote_stadiums
|
||||
if r.get("fields", {}).get("sport", {}).get("value") == sport
|
||||
]
|
||||
|
||||
logger.info(f"Found {len(remote_games)} games, {len(remote_teams)} teams, {len(remote_stadiums)} stadiums in CloudKit")
|
||||
|
||||
# Calculate diffs
|
||||
logger.info("Calculating changes...")
|
||||
|
||||
game_diff = differ.diff_games(games, remote_games)
|
||||
team_diff = differ.diff_teams(teams, remote_teams)
|
||||
stadium_diff = differ.diff_stadiums(stadiums, remote_stadiums)
|
||||
|
||||
total_creates = game_diff.create_count + team_diff.create_count + stadium_diff.create_count
|
||||
total_updates = game_diff.update_count + team_diff.update_count + stadium_diff.update_count
|
||||
total_unchanged = game_diff.unchanged_count + team_diff.unchanged_count + stadium_diff.unchanged_count
|
||||
|
||||
logger.info(f"Creates: {total_creates}, Updates: {total_updates}, Unchanged: {total_unchanged}")
|
||||
|
||||
if total_creates == 0 and total_updates == 0:
|
||||
log_success(f"{sport.upper()}: Already up to date")
|
||||
success_count += 1
|
||||
continue
|
||||
|
||||
# Prepare records for upload
|
||||
all_records = []
|
||||
all_records.extend(game_diff.get_records_to_upload())
|
||||
all_records.extend(team_diff.get_records_to_upload())
|
||||
all_records.extend(stadium_diff.get_records_to_upload())
|
||||
|
||||
# Create or resume upload session
|
||||
record_info = [(r.record_name, r.record_type.value) for r in all_records]
|
||||
session = state_manager.get_session_or_create(
|
||||
sport=sport,
|
||||
season=args.season,
|
||||
environment=args.environment,
|
||||
record_names=record_info,
|
||||
resume=args.resume,
|
||||
)
|
||||
|
||||
if args.resume:
|
||||
pending = session.get_pending_records()
|
||||
logger.info(f"Resuming: {len(pending)} records pending")
|
||||
# Filter to only pending records
|
||||
pending_set = set(pending)
|
||||
all_records = [r for r in all_records if r.record_name in pending_set]
|
||||
|
||||
# Upload records with progress
|
||||
logger.info(f"Uploading {len(all_records)} records...")
|
||||
|
||||
with create_progress_bar(total=len(all_records), description="Uploading") as progress:
|
||||
batch_result = client.save_records(all_records)
|
||||
|
||||
# Update session state
|
||||
for op_result in batch_result.successful:
|
||||
session.mark_uploaded(op_result.record_name, op_result.record_change_tag)
|
||||
progress.advance()
|
||||
|
||||
for op_result in batch_result.failed:
|
||||
session.mark_failed(op_result.record_name, op_result.error_message or "Unknown error")
|
||||
progress.advance()
|
||||
|
||||
# Save session state
|
||||
state_manager.save_session(session)
|
||||
|
||||
# Report results
|
||||
logger.info(f"Uploaded: {batch_result.success_count}")
|
||||
logger.info(f"Failed: {batch_result.failure_count}")
|
||||
|
||||
if batch_result.failure_count > 0:
|
||||
log_failure(f"{sport.upper()}: {batch_result.failure_count} records failed")
|
||||
for op_result in batch_result.failed[:5]: # Show first 5 failures
|
||||
logger.error(f" {op_result.record_name}: {op_result.error_message}")
|
||||
if batch_result.failure_count > 5:
|
||||
logger.error(f" ... and {batch_result.failure_count - 5} more")
|
||||
failure_count += 1
|
||||
else:
|
||||
log_success(f"{sport.upper()}: Uploaded {batch_result.success_count} records")
|
||||
# Clear session on complete success
|
||||
state_manager.delete_session(sport, args.season, args.environment)
|
||||
success_count += 1
|
||||
|
||||
except Exception as e:
|
||||
log_failure(f"{sport.upper()}: {e}")
|
||||
logger.exception("Upload failed")
|
||||
failure_count += 1
|
||||
continue
|
||||
|
||||
# Final summary
|
||||
logger.info(f"\n{'='*50}")
|
||||
logger.info("SUMMARY")
|
||||
logger.info(f"{'='*50}")
|
||||
logger.info(f"Successful: {success_count}")
|
||||
logger.info(f"Failed: {failure_count}")
|
||||
|
||||
return 0 if failure_count == 0 else 1
|
||||
|
||||
|
||||
def cmd_status(args: argparse.Namespace) -> int:
|
||||
"""Execute the status command."""
|
||||
from datetime import datetime
|
||||
from .config import STATE_DIR, EXPECTED_GAME_COUNTS
|
||||
from .uploaders import StateManager
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
logger.info("SportsTime Parser Status")
|
||||
logger.info("=" * 50)
|
||||
logger.info("")
|
||||
|
||||
# Check for scraped data
|
||||
logger.info("[bold]Scraped Data[/bold]")
|
||||
logger.info("-" * 40)
|
||||
|
||||
total_games = 0
|
||||
scraped_sports = 0
|
||||
|
||||
for sport in SUPPORTED_SPORTS:
|
||||
games_file = OUTPUT_DIR / f"games_{sport}_{DEFAULT_SEASON}.json"
|
||||
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
|
||||
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
|
||||
|
||||
if games_file.exists():
|
||||
from .models.game import load_games
|
||||
from .models.team import load_teams
|
||||
from .models.stadium import load_stadiums
|
||||
|
||||
try:
|
||||
games = load_games(str(games_file))
|
||||
teams = load_teams(str(teams_file)) if teams_file.exists() else []
|
||||
stadiums = load_stadiums(str(stadiums_file)) if stadiums_file.exists() else []
|
||||
|
||||
game_count = len(games)
|
||||
expected = EXPECTED_GAME_COUNTS.get(sport, 0)
|
||||
coverage = (game_count / expected * 100) if expected > 0 else 0
|
||||
|
||||
# Format with coverage indicator
|
||||
if coverage >= 95:
|
||||
status = "[green]✓[/green]"
|
||||
elif coverage >= 80:
|
||||
status = "[yellow]~[/yellow]"
|
||||
else:
|
||||
status = "[red]![/red]"
|
||||
|
||||
logger.info(
|
||||
f" {status} {sport.upper():6} {game_count:5} games, "
|
||||
f"{len(teams):2} teams, {len(stadiums):2} stadiums "
|
||||
f"({coverage:.0f}% coverage)"
|
||||
)
|
||||
|
||||
total_games += game_count
|
||||
scraped_sports += 1
|
||||
|
||||
except Exception as e:
|
||||
logger.info(f" [red]✗[/red] {sport.upper():6} Error loading: {e}")
|
||||
else:
|
||||
logger.info(f" [dim]-[/dim] {sport.upper():6} Not scraped")
|
||||
|
||||
logger.info("-" * 40)
|
||||
logger.info(f" Total: {total_games} games across {scraped_sports} sports")
|
||||
logger.info("")
|
||||
|
||||
# Check for upload sessions
|
||||
logger.info("[bold]Upload Sessions[/bold]")
|
||||
logger.info("-" * 40)
|
||||
|
||||
state_manager = StateManager()
|
||||
sessions = state_manager.list_sessions()
|
||||
|
||||
if sessions:
|
||||
for session in sessions:
|
||||
sport = session["sport"].upper()
|
||||
season = session["season"]
|
||||
env = session["environment"]
|
||||
progress = session["progress"]
|
||||
percent = session["progress_percent"]
|
||||
status = session["status"]
|
||||
failed = session["failed_count"]
|
||||
|
||||
if status == "complete":
|
||||
status_icon = "[green]✓[/green]"
|
||||
elif failed > 0:
|
||||
status_icon = "[yellow]![/yellow]"
|
||||
else:
|
||||
status_icon = "[blue]→[/blue]"
|
||||
|
||||
logger.info(
|
||||
f" {status_icon} {sport} {season} ({env}): "
|
||||
f"{progress} ({percent})"
|
||||
)
|
||||
|
||||
if failed > 0:
|
||||
logger.info(f" [yellow]⚠ {failed} failed records[/yellow]")
|
||||
|
||||
# Show last updated time
|
||||
try:
|
||||
last_updated = datetime.fromisoformat(session["last_updated"])
|
||||
age = datetime.utcnow() - last_updated
|
||||
if age.days > 0:
|
||||
age_str = f"{age.days} days ago"
|
||||
elif age.seconds > 3600:
|
||||
age_str = f"{age.seconds // 3600} hours ago"
|
||||
elif age.seconds > 60:
|
||||
age_str = f"{age.seconds // 60} minutes ago"
|
||||
else:
|
||||
age_str = "just now"
|
||||
logger.info(f" Last updated: {age_str}")
|
||||
except (ValueError, KeyError):
|
||||
pass
|
||||
|
||||
else:
|
||||
logger.info(" No upload sessions found")
|
||||
|
||||
logger.info("")
|
||||
|
||||
# CloudKit configuration status
|
||||
logger.info("[bold]CloudKit Configuration[/bold]")
|
||||
logger.info("-" * 40)
|
||||
|
||||
import os
|
||||
key_id = os.environ.get("CLOUDKIT_KEY_ID")
|
||||
key_path = os.environ.get("CLOUDKIT_PRIVATE_KEY_PATH")
|
||||
key_content = os.environ.get("CLOUDKIT_PRIVATE_KEY")
|
||||
|
||||
if key_id:
|
||||
logger.info(f" [green]✓[/green] CLOUDKIT_KEY_ID: {key_id[:8]}...")
|
||||
else:
|
||||
logger.info(" [red]✗[/red] CLOUDKIT_KEY_ID: Not set")
|
||||
|
||||
if key_path:
|
||||
from pathlib import Path
|
||||
if Path(key_path).exists():
|
||||
logger.info(f" [green]✓[/green] CLOUDKIT_PRIVATE_KEY_PATH: {key_path}")
|
||||
else:
|
||||
logger.info(f" [red]✗[/red] CLOUDKIT_PRIVATE_KEY_PATH: File not found: {key_path}")
|
||||
elif key_content:
|
||||
logger.info(" [green]✓[/green] CLOUDKIT_PRIVATE_KEY: Set (inline)")
|
||||
else:
|
||||
logger.info(" [red]✗[/red] CLOUDKIT_PRIVATE_KEY: Not set")
|
||||
|
||||
logger.info("")
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def cmd_retry(args: argparse.Namespace) -> int:
|
||||
"""Execute the retry command for failed uploads."""
|
||||
from .models.game import load_games
|
||||
from .models.team import load_teams
|
||||
from .models.stadium import load_stadiums
|
||||
from .uploaders import (
|
||||
CloudKitClient,
|
||||
CloudKitError,
|
||||
CloudKitAuthError,
|
||||
CloudKitRateLimitError,
|
||||
StateManager,
|
||||
game_to_cloudkit_record,
|
||||
team_to_cloudkit_record,
|
||||
stadium_to_cloudkit_record,
|
||||
)
|
||||
from .utils.progress import create_progress_bar
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
|
||||
|
||||
logger.info(f"Retrying failed uploads for {', '.join(sports)}")
|
||||
logger.info(f"Environment: {args.environment}")
|
||||
logger.info(f"Max retries per record: {args.max_retries}")
|
||||
|
||||
# Initialize CloudKit client
|
||||
client = CloudKitClient(environment=args.environment)
|
||||
|
||||
if not client.is_configured:
|
||||
log_failure("CloudKit not configured")
|
||||
return 1
|
||||
|
||||
# Initialize state manager
|
||||
state_manager = StateManager()
|
||||
|
||||
total_retried = 0
|
||||
total_succeeded = 0
|
||||
total_failed = 0
|
||||
|
||||
for sport in sports:
|
||||
# Load existing session
|
||||
session = state_manager.load_session(sport, args.season, args.environment)
|
||||
|
||||
if session is None:
|
||||
logger.info(f"{sport.upper()}: No upload session found")
|
||||
continue
|
||||
|
||||
# Get records eligible for retry
|
||||
retryable = session.get_retryable_records(max_retries=args.max_retries)
|
||||
|
||||
if not retryable:
|
||||
failed_count = session.failed_count
|
||||
if failed_count > 0:
|
||||
logger.info(f"{sport.upper()}: {failed_count} failed records exceeded max retries")
|
||||
else:
|
||||
logger.info(f"{sport.upper()}: No failed records to retry")
|
||||
continue
|
||||
|
||||
logger.info(f"{sport.upper()}: Retrying {len(retryable)} failed records...")
|
||||
|
||||
# Load local data to get the records
|
||||
games_file = OUTPUT_DIR / f"games_{sport}_{args.season}.json"
|
||||
teams_file = OUTPUT_DIR / f"teams_{sport}.json"
|
||||
stadiums_file = OUTPUT_DIR / f"stadiums_{sport}.json"
|
||||
|
||||
if not games_file.exists():
|
||||
logger.warning(f"No games file found: {games_file}")
|
||||
continue
|
||||
|
||||
games = load_games(str(games_file))
|
||||
teams = load_teams(str(teams_file)) if teams_file.exists() else []
|
||||
stadiums = load_stadiums(str(stadiums_file)) if stadiums_file.exists() else []
|
||||
|
||||
# Build record lookup
|
||||
records_to_retry = []
|
||||
retryable_set = set(retryable)
|
||||
|
||||
for game in games:
|
||||
if game.id in retryable_set:
|
||||
records_to_retry.append(game_to_cloudkit_record(game))
|
||||
|
||||
for team in teams:
|
||||
if team.id in retryable_set:
|
||||
records_to_retry.append(team_to_cloudkit_record(team))
|
||||
|
||||
for stadium in stadiums:
|
||||
if stadium.id in retryable_set:
|
||||
records_to_retry.append(stadium_to_cloudkit_record(stadium))
|
||||
|
||||
if not records_to_retry:
|
||||
logger.warning(f"{sport.upper()}: Could not find records for retry")
|
||||
continue
|
||||
|
||||
# Mark as pending for retry
|
||||
for record_name in retryable:
|
||||
session.mark_pending(record_name)
|
||||
|
||||
# Retry upload
|
||||
try:
|
||||
with create_progress_bar(total=len(records_to_retry), description="Retrying") as progress:
|
||||
batch_result = client.save_records(records_to_retry)
|
||||
|
||||
for op_result in batch_result.successful:
|
||||
session.mark_uploaded(op_result.record_name, op_result.record_change_tag)
|
||||
progress.advance()
|
||||
total_succeeded += 1
|
||||
|
||||
for op_result in batch_result.failed:
|
||||
session.mark_failed(op_result.record_name, op_result.error_message or "Unknown error")
|
||||
progress.advance()
|
||||
total_failed += 1
|
||||
|
||||
state_manager.save_session(session)
|
||||
|
||||
total_retried += len(records_to_retry)
|
||||
|
||||
if batch_result.failure_count > 0:
|
||||
log_failure(f"{sport.upper()}: {batch_result.failure_count} still failing")
|
||||
else:
|
||||
log_success(f"{sport.upper()}: All {batch_result.success_count} retries succeeded")
|
||||
|
||||
# Clear session if all complete
|
||||
if session.is_complete:
|
||||
state_manager.delete_session(sport, args.season, args.environment)
|
||||
|
||||
except CloudKitAuthError as e:
|
||||
log_failure(f"Authentication failed: {e}")
|
||||
return 1
|
||||
except CloudKitRateLimitError:
|
||||
log_failure("Rate limit exceeded - try again later")
|
||||
state_manager.save_session(session)
|
||||
return 1
|
||||
except CloudKitError as e:
|
||||
log_failure(f"Upload error: {e}")
|
||||
state_manager.save_session(session)
|
||||
continue
|
||||
|
||||
# Summary
|
||||
logger.info(f"\n{'='*50}")
|
||||
logger.info("RETRY SUMMARY")
|
||||
logger.info(f"{'='*50}")
|
||||
logger.info(f"Retried: {total_retried}")
|
||||
logger.info(f"Succeeded: {total_succeeded}")
|
||||
logger.info(f"Failed: {total_failed}")
|
||||
|
||||
return 0 if total_failed == 0 else 1
|
||||
|
||||
|
||||
def cmd_clear(args: argparse.Namespace) -> int:
|
||||
"""Execute the clear command to delete upload state."""
|
||||
from .uploaders import StateManager
|
||||
|
||||
logger = get_logger()
|
||||
|
||||
sports = SUPPORTED_SPORTS if args.sport == "all" else [args.sport]
|
||||
|
||||
logger.info(f"Clearing upload state for {', '.join(sports)}")
|
||||
|
||||
state_manager = StateManager()
|
||||
cleared_count = 0
|
||||
|
||||
for sport in sports:
|
||||
if state_manager.delete_session(sport, args.season, args.environment):
|
||||
logger.info(f" [green]✓[/green] Cleared {sport.upper()} {args.season} ({args.environment})")
|
||||
cleared_count += 1
|
||||
else:
|
||||
logger.info(f" [dim]-[/dim] No session for {sport.upper()} {args.season} ({args.environment})")
|
||||
|
||||
logger.info(f"\nCleared {cleared_count} session(s)")
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def run_cli(argv: Optional[list[str]] = None) -> int:
|
||||
"""Parse arguments and run the appropriate command."""
|
||||
parser = create_parser()
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.verbose:
|
||||
set_verbose(True)
|
||||
|
||||
if args.command is None:
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
return args.func(args)
|
||||
@@ -0,0 +1,56 @@
|
||||
"""Configuration constants for sportstime-parser."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
# Package paths
|
||||
PACKAGE_DIR = Path(__file__).parent
|
||||
SCRIPTS_DIR = PACKAGE_DIR.parent
|
||||
OUTPUT_DIR = SCRIPTS_DIR / "output"
|
||||
STATE_DIR = SCRIPTS_DIR / ".parser_state"
|
||||
|
||||
# Alias files (existing in Scripts/)
|
||||
TEAM_ALIASES_FILE = SCRIPTS_DIR / "team_aliases.json"
|
||||
STADIUM_ALIASES_FILE = SCRIPTS_DIR / "stadium_aliases.json"
|
||||
LEAGUE_STRUCTURE_FILE = SCRIPTS_DIR / "league_structure.json"
|
||||
|
||||
# Supported sports
|
||||
SUPPORTED_SPORTS: list[str] = [
|
||||
"nba",
|
||||
"mlb",
|
||||
"nfl",
|
||||
"nhl",
|
||||
"mls",
|
||||
"wnba",
|
||||
"nwsl",
|
||||
]
|
||||
|
||||
# Default season (start year of the season, e.g., 2025 for 2025-26)
|
||||
DEFAULT_SEASON: int = 2025
|
||||
|
||||
# CloudKit configuration
|
||||
CLOUDKIT_CONTAINER_ID: str = "iCloud.com.sportstime.app"
|
||||
CLOUDKIT_ENVIRONMENT: str = "development"
|
||||
CLOUDKIT_BATCH_SIZE: int = 200
|
||||
|
||||
# Rate limiting
|
||||
DEFAULT_REQUEST_DELAY: float = 1.0 # seconds between requests
|
||||
MAX_RETRIES: int = 3
|
||||
BACKOFF_FACTOR: float = 2.0 # exponential backoff multiplier
|
||||
INITIAL_BACKOFF: float = 1.0 # initial backoff in seconds
|
||||
|
||||
# Expected game counts per sport (approximate, for validation)
|
||||
EXPECTED_GAME_COUNTS: dict[str, int] = {
|
||||
"nba": 1230, # 30 teams × 82 games / 2
|
||||
"mlb": 2430, # 30 teams × 162 games / 2
|
||||
"nfl": 272, # 32 teams × 17 games / 2
|
||||
"nhl": 1312, # 32 teams × 82 games / 2
|
||||
"mls": 493, # 30 teams × varies
|
||||
"wnba": 220, # 13 teams × 40 games / 2 (approx)
|
||||
"nwsl": 182, # 14 teams × 26 games / 2
|
||||
}
|
||||
|
||||
# Minimum match score for fuzzy matching (0-100)
|
||||
FUZZY_MATCH_THRESHOLD: int = 80
|
||||
|
||||
# Geographic filter (only include games in these countries)
|
||||
ALLOWED_COUNTRIES: set[str] = {"USA", "US", "United States", "Canada", "Mexico"}
|
||||
@@ -0,0 +1,35 @@
|
||||
"""Data models for sportstime-parser."""
|
||||
|
||||
from .game import Game, save_games, load_games
|
||||
from .team import Team, save_teams, load_teams
|
||||
from .stadium import Stadium, save_stadiums, load_stadiums
|
||||
from .aliases import (
|
||||
AliasType,
|
||||
ReviewReason,
|
||||
TeamAlias,
|
||||
StadiumAlias,
|
||||
FuzzyMatch,
|
||||
ManualReviewItem,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Game
|
||||
"Game",
|
||||
"save_games",
|
||||
"load_games",
|
||||
# Team
|
||||
"Team",
|
||||
"save_teams",
|
||||
"load_teams",
|
||||
# Stadium
|
||||
"Stadium",
|
||||
"save_stadiums",
|
||||
"load_stadiums",
|
||||
# Aliases
|
||||
"AliasType",
|
||||
"ReviewReason",
|
||||
"TeamAlias",
|
||||
"StadiumAlias",
|
||||
"FuzzyMatch",
|
||||
"ManualReviewItem",
|
||||
]
|
||||
@@ -0,0 +1,262 @@
|
||||
"""Alias and manual review data models for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date, datetime
|
||||
from enum import Enum
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
class AliasType(Enum):
|
||||
"""Type of team alias."""
|
||||
NAME = "name"
|
||||
ABBREVIATION = "abbreviation"
|
||||
CITY = "city"
|
||||
|
||||
|
||||
class ReviewReason(Enum):
|
||||
"""Reason an item requires manual review."""
|
||||
UNRESOLVED_TEAM = "unresolved_team"
|
||||
UNRESOLVED_STADIUM = "unresolved_stadium"
|
||||
LOW_CONFIDENCE_MATCH = "low_confidence_match"
|
||||
MISSING_DATA = "missing_data"
|
||||
DUPLICATE_GAME = "duplicate_game"
|
||||
TIMEZONE_UNKNOWN = "timezone_unknown"
|
||||
GEOGRAPHIC_FILTER = "geographic_filter"
|
||||
|
||||
|
||||
@dataclass
|
||||
class TeamAlias:
|
||||
"""Represents a team alias with optional date validity.
|
||||
|
||||
Attributes:
|
||||
id: Unique alias ID
|
||||
team_canonical_id: The canonical team ID this alias resolves to
|
||||
alias_type: Type of alias (name, abbreviation, city)
|
||||
alias_value: The alias value to match against
|
||||
valid_from: Start date of alias validity (None = always valid)
|
||||
valid_until: End date of alias validity (None = still valid)
|
||||
"""
|
||||
|
||||
id: str
|
||||
team_canonical_id: str
|
||||
alias_type: AliasType
|
||||
alias_value: str
|
||||
valid_from: Optional[date] = None
|
||||
valid_until: Optional[date] = None
|
||||
|
||||
def is_valid_on(self, check_date: date) -> bool:
|
||||
"""Check if this alias is valid on the given date."""
|
||||
if self.valid_from and check_date < self.valid_from:
|
||||
return False
|
||||
if self.valid_until and check_date > self.valid_until:
|
||||
return False
|
||||
return True
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"team_canonical_id": self.team_canonical_id,
|
||||
"alias_type": self.alias_type.value,
|
||||
"alias_value": self.alias_value,
|
||||
"valid_from": self.valid_from.isoformat() if self.valid_from else None,
|
||||
"valid_until": self.valid_until.isoformat() if self.valid_until else None,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "TeamAlias":
|
||||
"""Create a TeamAlias from a dictionary."""
|
||||
valid_from = None
|
||||
if data.get("valid_from"):
|
||||
valid_from = date.fromisoformat(data["valid_from"])
|
||||
|
||||
valid_until = None
|
||||
if data.get("valid_until"):
|
||||
valid_until = date.fromisoformat(data["valid_until"])
|
||||
|
||||
return cls(
|
||||
id=data["id"],
|
||||
team_canonical_id=data["team_canonical_id"],
|
||||
alias_type=AliasType(data["alias_type"]),
|
||||
alias_value=data["alias_value"],
|
||||
valid_from=valid_from,
|
||||
valid_until=valid_until,
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class StadiumAlias:
|
||||
"""Represents a stadium alias with optional date validity.
|
||||
|
||||
Attributes:
|
||||
alias_name: The alias name to match against (lowercase)
|
||||
stadium_canonical_id: The canonical stadium ID this alias resolves to
|
||||
valid_from: Start date of alias validity (None = always valid)
|
||||
valid_until: End date of alias validity (None = still valid)
|
||||
"""
|
||||
|
||||
alias_name: str
|
||||
stadium_canonical_id: str
|
||||
valid_from: Optional[date] = None
|
||||
valid_until: Optional[date] = None
|
||||
|
||||
def is_valid_on(self, check_date: date) -> bool:
|
||||
"""Check if this alias is valid on the given date."""
|
||||
if self.valid_from and check_date < self.valid_from:
|
||||
return False
|
||||
if self.valid_until and check_date > self.valid_until:
|
||||
return False
|
||||
return True
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"alias_name": self.alias_name,
|
||||
"stadium_canonical_id": self.stadium_canonical_id,
|
||||
"valid_from": self.valid_from.isoformat() if self.valid_from else None,
|
||||
"valid_until": self.valid_until.isoformat() if self.valid_until else None,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "StadiumAlias":
|
||||
"""Create a StadiumAlias from a dictionary."""
|
||||
valid_from = None
|
||||
if data.get("valid_from"):
|
||||
valid_from = date.fromisoformat(data["valid_from"])
|
||||
|
||||
valid_until = None
|
||||
if data.get("valid_until"):
|
||||
valid_until = date.fromisoformat(data["valid_until"])
|
||||
|
||||
return cls(
|
||||
alias_name=data["alias_name"],
|
||||
stadium_canonical_id=data["stadium_canonical_id"],
|
||||
valid_from=valid_from,
|
||||
valid_until=valid_until,
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class FuzzyMatch:
|
||||
"""Represents a fuzzy match suggestion with confidence score."""
|
||||
|
||||
canonical_id: str
|
||||
canonical_name: str
|
||||
confidence: int # 0-100
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"canonical_id": self.canonical_id,
|
||||
"canonical_name": self.canonical_name,
|
||||
"confidence": self.confidence,
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class ManualReviewItem:
|
||||
"""Represents an item requiring manual review.
|
||||
|
||||
Attributes:
|
||||
id: Unique review item ID
|
||||
reason: Why this item needs review
|
||||
sport: Sport code
|
||||
raw_value: The original unresolved value
|
||||
context: Additional context about the issue
|
||||
source_url: URL of the source page
|
||||
suggested_matches: List of potential matches with confidence scores
|
||||
game_date: Date of the game (if applicable)
|
||||
created_at: When this review item was created
|
||||
"""
|
||||
|
||||
id: str
|
||||
reason: ReviewReason
|
||||
sport: str
|
||||
raw_value: str
|
||||
context: dict = field(default_factory=dict)
|
||||
source_url: Optional[str] = None
|
||||
suggested_matches: list[FuzzyMatch] = field(default_factory=list)
|
||||
game_date: Optional[date] = None
|
||||
created_at: datetime = field(default_factory=datetime.now)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"reason": self.reason.value,
|
||||
"sport": self.sport,
|
||||
"raw_value": self.raw_value,
|
||||
"context": self.context,
|
||||
"source_url": self.source_url,
|
||||
"suggested_matches": [m.to_dict() for m in self.suggested_matches],
|
||||
"game_date": self.game_date.isoformat() if self.game_date else None,
|
||||
"created_at": self.created_at.isoformat(),
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "ManualReviewItem":
|
||||
"""Create a ManualReviewItem from a dictionary."""
|
||||
game_date = None
|
||||
if data.get("game_date"):
|
||||
game_date = date.fromisoformat(data["game_date"])
|
||||
|
||||
created_at = datetime.now()
|
||||
if data.get("created_at"):
|
||||
created_at = datetime.fromisoformat(data["created_at"])
|
||||
|
||||
suggested_matches = []
|
||||
for match_data in data.get("suggested_matches", []):
|
||||
suggested_matches.append(FuzzyMatch(
|
||||
canonical_id=match_data["canonical_id"],
|
||||
canonical_name=match_data["canonical_name"],
|
||||
confidence=match_data["confidence"],
|
||||
))
|
||||
|
||||
return cls(
|
||||
id=data["id"],
|
||||
reason=ReviewReason(data["reason"]),
|
||||
sport=data["sport"],
|
||||
raw_value=data["raw_value"],
|
||||
context=data.get("context", {}),
|
||||
source_url=data.get("source_url"),
|
||||
suggested_matches=suggested_matches,
|
||||
game_date=game_date,
|
||||
created_at=created_at,
|
||||
)
|
||||
|
||||
def to_markdown(self) -> str:
|
||||
"""Generate markdown representation for validation report."""
|
||||
lines = [
|
||||
f"### {self.reason.value.replace('_', ' ').title()}: {self.raw_value}",
|
||||
"",
|
||||
f"**Sport**: {self.sport.upper()}",
|
||||
]
|
||||
|
||||
if self.game_date:
|
||||
lines.append(f"**Game Date**: {self.game_date.isoformat()}")
|
||||
|
||||
if self.context:
|
||||
lines.append("")
|
||||
lines.append("**Context**:")
|
||||
for key, value in self.context.items():
|
||||
lines.append(f"- {key}: {value}")
|
||||
|
||||
if self.suggested_matches:
|
||||
lines.append("")
|
||||
lines.append("**Suggested Matches**:")
|
||||
for i, match in enumerate(self.suggested_matches, 1):
|
||||
marker = " <- likely correct" if match.confidence >= 90 else ""
|
||||
lines.append(
|
||||
f"{i}. `{match.canonical_id}` ({match.confidence}%){marker}"
|
||||
)
|
||||
|
||||
if self.source_url:
|
||||
lines.append("")
|
||||
lines.append(f"**Source**: [{self.source_url}]({self.source_url})")
|
||||
|
||||
lines.append("")
|
||||
lines.append("---")
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines)
|
||||
@@ -0,0 +1,112 @@
|
||||
"""Game data model for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
@dataclass
|
||||
class Game:
|
||||
"""Represents a game with all CloudKit fields.
|
||||
|
||||
Attributes:
|
||||
id: Canonical game ID (e.g., 'nba_2025_hou_okc_1021')
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
home_team_id: Canonical home team ID
|
||||
away_team_id: Canonical away team ID
|
||||
stadium_id: Canonical stadium ID
|
||||
game_date: Game date/time in UTC
|
||||
game_number: Game number for doubleheaders (1 or 2), None for single games
|
||||
home_score: Final home team score (None if not played)
|
||||
away_score: Final away team score (None if not played)
|
||||
status: Game status ('scheduled', 'final', 'postponed', 'cancelled')
|
||||
source_url: URL of the source page for manual review
|
||||
raw_home_team: Original home team name from source (for debugging)
|
||||
raw_away_team: Original away team name from source (for debugging)
|
||||
raw_stadium: Original stadium name from source (for debugging)
|
||||
"""
|
||||
|
||||
id: str
|
||||
sport: str
|
||||
season: int
|
||||
home_team_id: str
|
||||
away_team_id: str
|
||||
stadium_id: str
|
||||
game_date: datetime
|
||||
game_number: Optional[int] = None
|
||||
home_score: Optional[int] = None
|
||||
away_score: Optional[int] = None
|
||||
status: str = "scheduled"
|
||||
source_url: Optional[str] = None
|
||||
raw_home_team: Optional[str] = None
|
||||
raw_away_team: Optional[str] = None
|
||||
raw_stadium: Optional[str] = None
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"sport": self.sport,
|
||||
"season": self.season,
|
||||
"home_team_id": self.home_team_id,
|
||||
"away_team_id": self.away_team_id,
|
||||
"stadium_id": self.stadium_id,
|
||||
"game_date": self.game_date.isoformat(),
|
||||
"game_number": self.game_number,
|
||||
"home_score": self.home_score,
|
||||
"away_score": self.away_score,
|
||||
"status": self.status,
|
||||
"source_url": self.source_url,
|
||||
"raw_home_team": self.raw_home_team,
|
||||
"raw_away_team": self.raw_away_team,
|
||||
"raw_stadium": self.raw_stadium,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "Game":
|
||||
"""Create a Game from a dictionary."""
|
||||
game_date = data["game_date"]
|
||||
if isinstance(game_date, str):
|
||||
game_date = datetime.fromisoformat(game_date)
|
||||
|
||||
return cls(
|
||||
id=data["id"],
|
||||
sport=data["sport"],
|
||||
season=data["season"],
|
||||
home_team_id=data["home_team_id"],
|
||||
away_team_id=data["away_team_id"],
|
||||
stadium_id=data["stadium_id"],
|
||||
game_date=game_date,
|
||||
game_number=data.get("game_number"),
|
||||
home_score=data.get("home_score"),
|
||||
away_score=data.get("away_score"),
|
||||
status=data.get("status", "scheduled"),
|
||||
source_url=data.get("source_url"),
|
||||
raw_home_team=data.get("raw_home_team"),
|
||||
raw_away_team=data.get("raw_away_team"),
|
||||
raw_stadium=data.get("raw_stadium"),
|
||||
)
|
||||
|
||||
def to_json(self) -> str:
|
||||
"""Serialize to JSON string."""
|
||||
return json.dumps(self.to_dict(), indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_str: str) -> "Game":
|
||||
"""Deserialize from JSON string."""
|
||||
return cls.from_dict(json.loads(json_str))
|
||||
|
||||
|
||||
def save_games(games: list[Game], filepath: str) -> None:
|
||||
"""Save a list of games to a JSON file."""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump([g.to_dict() for g in games], f, indent=2)
|
||||
|
||||
|
||||
def load_games(filepath: str) -> list[Game]:
|
||||
"""Load a list of games from a JSON file."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
return [Game.from_dict(d) for d in data]
|
||||
@@ -0,0 +1,108 @@
|
||||
"""Stadium data model for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
@dataclass
|
||||
class Stadium:
|
||||
"""Represents a stadium with all CloudKit fields.
|
||||
|
||||
Attributes:
|
||||
id: Canonical stadium ID (e.g., 'stadium_nba_paycom_center')
|
||||
sport: Primary sport code (e.g., 'nba', 'mlb')
|
||||
name: Current stadium name (e.g., 'Paycom Center')
|
||||
city: City name (e.g., 'Oklahoma City')
|
||||
state: State/province code (e.g., 'OK', 'ON')
|
||||
country: Country code (e.g., 'USA', 'Canada')
|
||||
latitude: Latitude coordinate
|
||||
longitude: Longitude coordinate
|
||||
capacity: Seating capacity
|
||||
surface: Playing surface (e.g., 'grass', 'turf', 'hardwood')
|
||||
roof_type: Roof type (e.g., 'dome', 'retractable', 'open')
|
||||
opened_year: Year stadium opened
|
||||
image_url: URL to stadium image
|
||||
timezone: IANA timezone (e.g., 'America/Chicago')
|
||||
"""
|
||||
|
||||
id: str
|
||||
sport: str
|
||||
name: str
|
||||
city: str
|
||||
state: str
|
||||
country: str
|
||||
latitude: float
|
||||
longitude: float
|
||||
capacity: Optional[int] = None
|
||||
surface: Optional[str] = None
|
||||
roof_type: Optional[str] = None
|
||||
opened_year: Optional[int] = None
|
||||
image_url: Optional[str] = None
|
||||
timezone: Optional[str] = None
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"sport": self.sport,
|
||||
"name": self.name,
|
||||
"city": self.city,
|
||||
"state": self.state,
|
||||
"country": self.country,
|
||||
"latitude": self.latitude,
|
||||
"longitude": self.longitude,
|
||||
"capacity": self.capacity,
|
||||
"surface": self.surface,
|
||||
"roof_type": self.roof_type,
|
||||
"opened_year": self.opened_year,
|
||||
"image_url": self.image_url,
|
||||
"timezone": self.timezone,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "Stadium":
|
||||
"""Create a Stadium from a dictionary."""
|
||||
return cls(
|
||||
id=data["id"],
|
||||
sport=data["sport"],
|
||||
name=data["name"],
|
||||
city=data["city"],
|
||||
state=data["state"],
|
||||
country=data["country"],
|
||||
latitude=data["latitude"],
|
||||
longitude=data["longitude"],
|
||||
capacity=data.get("capacity"),
|
||||
surface=data.get("surface"),
|
||||
roof_type=data.get("roof_type"),
|
||||
opened_year=data.get("opened_year"),
|
||||
image_url=data.get("image_url"),
|
||||
timezone=data.get("timezone"),
|
||||
)
|
||||
|
||||
def to_json(self) -> str:
|
||||
"""Serialize to JSON string."""
|
||||
return json.dumps(self.to_dict(), indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_str: str) -> "Stadium":
|
||||
"""Deserialize from JSON string."""
|
||||
return cls.from_dict(json.loads(json_str))
|
||||
|
||||
def is_in_allowed_region(self) -> bool:
|
||||
"""Check if stadium is in USA, Canada, or Mexico."""
|
||||
allowed = {"USA", "US", "United States", "Canada", "CA", "Mexico", "MX"}
|
||||
return self.country in allowed
|
||||
|
||||
|
||||
def save_stadiums(stadiums: list[Stadium], filepath: str) -> None:
|
||||
"""Save a list of stadiums to a JSON file."""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump([s.to_dict() for s in stadiums], f, indent=2)
|
||||
|
||||
|
||||
def load_stadiums(filepath: str) -> list[Stadium]:
|
||||
"""Load a list of stadiums from a JSON file."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
return [Stadium.from_dict(d) for d in data]
|
||||
@@ -0,0 +1,95 @@
|
||||
"""Team data model for sportstime-parser."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
import json
|
||||
|
||||
|
||||
@dataclass
|
||||
class Team:
|
||||
"""Represents a team with all CloudKit fields.
|
||||
|
||||
Attributes:
|
||||
id: Canonical team ID (e.g., 'team_nba_okc')
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
city: Team city (e.g., 'Oklahoma City')
|
||||
name: Team name (e.g., 'Thunder')
|
||||
full_name: Full team name (e.g., 'Oklahoma City Thunder')
|
||||
abbreviation: Official abbreviation (e.g., 'OKC')
|
||||
conference: Conference name (e.g., 'Western', 'American')
|
||||
division: Division name (e.g., 'Northwest', 'AL West')
|
||||
primary_color: Primary team color as hex (e.g., '#007AC1')
|
||||
secondary_color: Secondary team color as hex (e.g., '#EF3B24')
|
||||
logo_url: URL to team logo image
|
||||
stadium_id: Canonical ID of home stadium
|
||||
"""
|
||||
|
||||
id: str
|
||||
sport: str
|
||||
city: str
|
||||
name: str
|
||||
full_name: str
|
||||
abbreviation: str
|
||||
conference: Optional[str] = None
|
||||
division: Optional[str] = None
|
||||
primary_color: Optional[str] = None
|
||||
secondary_color: Optional[str] = None
|
||||
logo_url: Optional[str] = None
|
||||
stadium_id: Optional[str] = None
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"id": self.id,
|
||||
"sport": self.sport,
|
||||
"city": self.city,
|
||||
"name": self.name,
|
||||
"full_name": self.full_name,
|
||||
"abbreviation": self.abbreviation,
|
||||
"conference": self.conference,
|
||||
"division": self.division,
|
||||
"primary_color": self.primary_color,
|
||||
"secondary_color": self.secondary_color,
|
||||
"logo_url": self.logo_url,
|
||||
"stadium_id": self.stadium_id,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "Team":
|
||||
"""Create a Team from a dictionary."""
|
||||
return cls(
|
||||
id=data["id"],
|
||||
sport=data["sport"],
|
||||
city=data["city"],
|
||||
name=data["name"],
|
||||
full_name=data["full_name"],
|
||||
abbreviation=data["abbreviation"],
|
||||
conference=data.get("conference"),
|
||||
division=data.get("division"),
|
||||
primary_color=data.get("primary_color"),
|
||||
secondary_color=data.get("secondary_color"),
|
||||
logo_url=data.get("logo_url"),
|
||||
stadium_id=data.get("stadium_id"),
|
||||
)
|
||||
|
||||
def to_json(self) -> str:
|
||||
"""Serialize to JSON string."""
|
||||
return json.dumps(self.to_dict(), indent=2)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, json_str: str) -> "Team":
|
||||
"""Deserialize from JSON string."""
|
||||
return cls.from_dict(json.loads(json_str))
|
||||
|
||||
|
||||
def save_teams(teams: list[Team], filepath: str) -> None:
|
||||
"""Save a list of teams to a JSON file."""
|
||||
with open(filepath, "w", encoding="utf-8") as f:
|
||||
json.dump([t.to_dict() for t in teams], f, indent=2)
|
||||
|
||||
|
||||
def load_teams(filepath: str) -> list[Team]:
|
||||
"""Load a list of teams from a JSON file."""
|
||||
with open(filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
return [Team.from_dict(d) for d in data]
|
||||
@@ -0,0 +1,91 @@
|
||||
"""Normalizers for team, stadium, and game data."""
|
||||
|
||||
from .canonical_id import (
|
||||
generate_game_id,
|
||||
generate_team_id,
|
||||
generate_team_id_from_abbrev,
|
||||
generate_stadium_id,
|
||||
parse_game_id,
|
||||
normalize_string,
|
||||
)
|
||||
from .timezone import (
|
||||
TimezoneResult,
|
||||
parse_datetime,
|
||||
convert_to_utc,
|
||||
detect_timezone_from_string,
|
||||
detect_timezone_from_location,
|
||||
get_stadium_timezone,
|
||||
create_timezone_warning,
|
||||
)
|
||||
from .fuzzy import (
|
||||
MatchCandidate,
|
||||
fuzzy_match_team,
|
||||
fuzzy_match_stadium,
|
||||
exact_match,
|
||||
best_match,
|
||||
calculate_similarity,
|
||||
normalize_for_matching,
|
||||
)
|
||||
from .alias_loader import (
|
||||
TeamAliasLoader,
|
||||
StadiumAliasLoader,
|
||||
get_team_alias_loader,
|
||||
get_stadium_alias_loader,
|
||||
resolve_team_alias,
|
||||
resolve_stadium_alias,
|
||||
)
|
||||
from .team_resolver import (
|
||||
TeamResolver,
|
||||
TeamResolveResult,
|
||||
get_team_resolver,
|
||||
resolve_team,
|
||||
)
|
||||
from .stadium_resolver import (
|
||||
StadiumResolver,
|
||||
StadiumResolveResult,
|
||||
get_stadium_resolver,
|
||||
resolve_stadium,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Canonical ID
|
||||
"generate_game_id",
|
||||
"generate_team_id",
|
||||
"generate_team_id_from_abbrev",
|
||||
"generate_stadium_id",
|
||||
"parse_game_id",
|
||||
"normalize_string",
|
||||
# Timezone
|
||||
"TimezoneResult",
|
||||
"parse_datetime",
|
||||
"convert_to_utc",
|
||||
"detect_timezone_from_string",
|
||||
"detect_timezone_from_location",
|
||||
"get_stadium_timezone",
|
||||
"create_timezone_warning",
|
||||
# Fuzzy matching
|
||||
"MatchCandidate",
|
||||
"fuzzy_match_team",
|
||||
"fuzzy_match_stadium",
|
||||
"exact_match",
|
||||
"best_match",
|
||||
"calculate_similarity",
|
||||
"normalize_for_matching",
|
||||
# Alias loaders
|
||||
"TeamAliasLoader",
|
||||
"StadiumAliasLoader",
|
||||
"get_team_alias_loader",
|
||||
"get_stadium_alias_loader",
|
||||
"resolve_team_alias",
|
||||
"resolve_stadium_alias",
|
||||
# Team resolver
|
||||
"TeamResolver",
|
||||
"TeamResolveResult",
|
||||
"get_team_resolver",
|
||||
"resolve_team",
|
||||
# Stadium resolver
|
||||
"StadiumResolver",
|
||||
"StadiumResolveResult",
|
||||
"get_stadium_resolver",
|
||||
"resolve_stadium",
|
||||
]
|
||||
@@ -0,0 +1,312 @@
|
||||
"""Alias file loaders for team and stadium name resolution."""
|
||||
|
||||
import json
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from ..config import TEAM_ALIASES_FILE, STADIUM_ALIASES_FILE
|
||||
from ..models.aliases import TeamAlias, StadiumAlias, AliasType
|
||||
|
||||
|
||||
class TeamAliasLoader:
|
||||
"""Loader for team aliases with date-aware resolution.
|
||||
|
||||
Loads team aliases from JSON and provides lookup methods
|
||||
with support for historical name changes.
|
||||
"""
|
||||
|
||||
def __init__(self, filepath: Optional[Path] = None):
|
||||
"""Initialize the loader.
|
||||
|
||||
Args:
|
||||
filepath: Path to team_aliases.json, defaults to config value
|
||||
"""
|
||||
self.filepath = filepath or TEAM_ALIASES_FILE
|
||||
self._aliases: list[TeamAlias] = []
|
||||
self._by_value: dict[str, list[TeamAlias]] = {}
|
||||
self._by_team: dict[str, list[TeamAlias]] = {}
|
||||
self._loaded = False
|
||||
|
||||
def load(self) -> None:
|
||||
"""Load aliases from the JSON file."""
|
||||
if not self.filepath.exists():
|
||||
self._loaded = True
|
||||
return
|
||||
|
||||
with open(self.filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
self._aliases = []
|
||||
self._by_value = {}
|
||||
self._by_team = {}
|
||||
|
||||
for item in data:
|
||||
alias = TeamAlias.from_dict(item)
|
||||
self._aliases.append(alias)
|
||||
|
||||
# Index by lowercase value
|
||||
value_key = alias.alias_value.lower()
|
||||
if value_key not in self._by_value:
|
||||
self._by_value[value_key] = []
|
||||
self._by_value[value_key].append(alias)
|
||||
|
||||
# Index by team ID
|
||||
if alias.team_canonical_id not in self._by_team:
|
||||
self._by_team[alias.team_canonical_id] = []
|
||||
self._by_team[alias.team_canonical_id].append(alias)
|
||||
|
||||
self._loaded = True
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
"""Ensure aliases are loaded."""
|
||||
if not self._loaded:
|
||||
self.load()
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
value: str,
|
||||
check_date: Optional[date] = None,
|
||||
alias_types: Optional[list[AliasType]] = None,
|
||||
) -> Optional[str]:
|
||||
"""Resolve an alias value to a canonical team ID.
|
||||
|
||||
Args:
|
||||
value: Alias value to look up (case-insensitive)
|
||||
check_date: Date to check validity (None = current date)
|
||||
alias_types: Types of aliases to check (None = all types)
|
||||
|
||||
Returns:
|
||||
Canonical team ID if found, None otherwise
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
if check_date is None:
|
||||
check_date = date.today()
|
||||
|
||||
value_key = value.lower().strip()
|
||||
aliases = self._by_value.get(value_key, [])
|
||||
|
||||
for alias in aliases:
|
||||
# Check type filter
|
||||
if alias_types and alias.alias_type not in alias_types:
|
||||
continue
|
||||
|
||||
# Check date validity
|
||||
if alias.is_valid_on(check_date):
|
||||
return alias.team_canonical_id
|
||||
|
||||
return None
|
||||
|
||||
def get_aliases_for_team(
|
||||
self,
|
||||
team_id: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> list[TeamAlias]:
|
||||
"""Get all aliases for a team.
|
||||
|
||||
Args:
|
||||
team_id: Canonical team ID
|
||||
check_date: Date to filter by (None = all aliases)
|
||||
|
||||
Returns:
|
||||
List of TeamAlias objects
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
aliases = self._by_team.get(team_id, [])
|
||||
|
||||
if check_date:
|
||||
aliases = [a for a in aliases if a.is_valid_on(check_date)]
|
||||
|
||||
return aliases
|
||||
|
||||
def get_all_values(
|
||||
self,
|
||||
alias_type: Optional[AliasType] = None,
|
||||
) -> list[str]:
|
||||
"""Get all alias values.
|
||||
|
||||
Args:
|
||||
alias_type: Filter by alias type (None = all types)
|
||||
|
||||
Returns:
|
||||
List of alias values
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
values = []
|
||||
for alias in self._aliases:
|
||||
if alias_type is None or alias.alias_type == alias_type:
|
||||
values.append(alias.alias_value)
|
||||
|
||||
return values
|
||||
|
||||
|
||||
class StadiumAliasLoader:
|
||||
"""Loader for stadium aliases with date-aware resolution.
|
||||
|
||||
Loads stadium aliases from JSON and provides lookup methods
|
||||
with support for historical name changes (e.g., naming rights).
|
||||
"""
|
||||
|
||||
def __init__(self, filepath: Optional[Path] = None):
|
||||
"""Initialize the loader.
|
||||
|
||||
Args:
|
||||
filepath: Path to stadium_aliases.json, defaults to config value
|
||||
"""
|
||||
self.filepath = filepath or STADIUM_ALIASES_FILE
|
||||
self._aliases: list[StadiumAlias] = []
|
||||
self._by_name: dict[str, list[StadiumAlias]] = {}
|
||||
self._by_stadium: dict[str, list[StadiumAlias]] = {}
|
||||
self._loaded = False
|
||||
|
||||
def load(self) -> None:
|
||||
"""Load aliases from the JSON file."""
|
||||
if not self.filepath.exists():
|
||||
self._loaded = True
|
||||
return
|
||||
|
||||
with open(self.filepath, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
self._aliases = []
|
||||
self._by_name = {}
|
||||
self._by_stadium = {}
|
||||
|
||||
for item in data:
|
||||
alias = StadiumAlias.from_dict(item)
|
||||
self._aliases.append(alias)
|
||||
|
||||
# Index by lowercase name
|
||||
name_key = alias.alias_name.lower()
|
||||
if name_key not in self._by_name:
|
||||
self._by_name[name_key] = []
|
||||
self._by_name[name_key].append(alias)
|
||||
|
||||
# Index by stadium ID
|
||||
if alias.stadium_canonical_id not in self._by_stadium:
|
||||
self._by_stadium[alias.stadium_canonical_id] = []
|
||||
self._by_stadium[alias.stadium_canonical_id].append(alias)
|
||||
|
||||
self._loaded = True
|
||||
|
||||
def _ensure_loaded(self) -> None:
|
||||
"""Ensure aliases are loaded."""
|
||||
if not self._loaded:
|
||||
self.load()
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
name: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> Optional[str]:
|
||||
"""Resolve a stadium name to a canonical stadium ID.
|
||||
|
||||
Args:
|
||||
name: Stadium name to look up (case-insensitive)
|
||||
check_date: Date to check validity (None = current date)
|
||||
|
||||
Returns:
|
||||
Canonical stadium ID if found, None otherwise
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
if check_date is None:
|
||||
check_date = date.today()
|
||||
|
||||
name_key = name.lower().strip()
|
||||
aliases = self._by_name.get(name_key, [])
|
||||
|
||||
for alias in aliases:
|
||||
if alias.is_valid_on(check_date):
|
||||
return alias.stadium_canonical_id
|
||||
|
||||
return None
|
||||
|
||||
def get_aliases_for_stadium(
|
||||
self,
|
||||
stadium_id: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> list[StadiumAlias]:
|
||||
"""Get all aliases for a stadium.
|
||||
|
||||
Args:
|
||||
stadium_id: Canonical stadium ID
|
||||
check_date: Date to filter by (None = all aliases)
|
||||
|
||||
Returns:
|
||||
List of StadiumAlias objects
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
aliases = self._by_stadium.get(stadium_id, [])
|
||||
|
||||
if check_date:
|
||||
aliases = [a for a in aliases if a.is_valid_on(check_date)]
|
||||
|
||||
return aliases
|
||||
|
||||
def get_all_names(self) -> list[str]:
|
||||
"""Get all stadium alias names.
|
||||
|
||||
Returns:
|
||||
List of stadium names
|
||||
"""
|
||||
self._ensure_loaded()
|
||||
|
||||
return [alias.alias_name for alias in self._aliases]
|
||||
|
||||
|
||||
# Global loader instances (lazy initialized)
|
||||
_team_alias_loader: Optional[TeamAliasLoader] = None
|
||||
_stadium_alias_loader: Optional[StadiumAliasLoader] = None
|
||||
|
||||
|
||||
def get_team_alias_loader() -> TeamAliasLoader:
|
||||
"""Get the global team alias loader instance."""
|
||||
global _team_alias_loader
|
||||
if _team_alias_loader is None:
|
||||
_team_alias_loader = TeamAliasLoader()
|
||||
return _team_alias_loader
|
||||
|
||||
|
||||
def get_stadium_alias_loader() -> StadiumAliasLoader:
|
||||
"""Get the global stadium alias loader instance."""
|
||||
global _stadium_alias_loader
|
||||
if _stadium_alias_loader is None:
|
||||
_stadium_alias_loader = StadiumAliasLoader()
|
||||
return _stadium_alias_loader
|
||||
|
||||
|
||||
def resolve_team_alias(
|
||||
value: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> Optional[str]:
|
||||
"""Convenience function to resolve a team alias.
|
||||
|
||||
Args:
|
||||
value: Alias value (name, abbreviation, or city)
|
||||
check_date: Date to check validity
|
||||
|
||||
Returns:
|
||||
Canonical team ID if found
|
||||
"""
|
||||
return get_team_alias_loader().resolve(value, check_date)
|
||||
|
||||
|
||||
def resolve_stadium_alias(
|
||||
name: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> Optional[str]:
|
||||
"""Convenience function to resolve a stadium alias.
|
||||
|
||||
Args:
|
||||
name: Stadium name
|
||||
check_date: Date to check validity
|
||||
|
||||
Returns:
|
||||
Canonical stadium ID if found
|
||||
"""
|
||||
return get_stadium_alias_loader().resolve(name, check_date)
|
||||
@@ -0,0 +1,279 @@
|
||||
"""Canonical ID generation for games, teams, and stadiums."""
|
||||
|
||||
import re
|
||||
import unicodedata
|
||||
from datetime import date, datetime
|
||||
from typing import Optional
|
||||
|
||||
|
||||
def normalize_string(s: str) -> str:
|
||||
"""Normalize a string for use in canonical IDs.
|
||||
|
||||
- Convert to lowercase
|
||||
- Replace spaces and hyphens with underscores
|
||||
- Remove special characters (except underscores)
|
||||
- Collapse multiple underscores
|
||||
- Strip leading/trailing underscores
|
||||
|
||||
Args:
|
||||
s: String to normalize
|
||||
|
||||
Returns:
|
||||
Normalized string suitable for IDs
|
||||
"""
|
||||
# Convert to lowercase
|
||||
result = s.lower()
|
||||
|
||||
# Normalize unicode (e.g., é -> e)
|
||||
result = unicodedata.normalize("NFKD", result)
|
||||
result = result.encode("ascii", "ignore").decode("ascii")
|
||||
|
||||
# Replace spaces and hyphens with underscores
|
||||
result = re.sub(r"[\s\-]+", "_", result)
|
||||
|
||||
# Remove special characters except underscores
|
||||
result = re.sub(r"[^a-z0-9_]", "", result)
|
||||
|
||||
# Collapse multiple underscores
|
||||
result = re.sub(r"_+", "_", result)
|
||||
|
||||
# Strip leading/trailing underscores
|
||||
result = result.strip("_")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def generate_game_id(
|
||||
sport: str,
|
||||
season: int,
|
||||
away_abbrev: str,
|
||||
home_abbrev: str,
|
||||
game_date: date | datetime,
|
||||
game_number: Optional[int] = None,
|
||||
) -> str:
|
||||
"""Generate a canonical game ID.
|
||||
|
||||
Format: {sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
away_abbrev: Away team abbreviation (e.g., 'HOU')
|
||||
home_abbrev: Home team abbreviation (e.g., 'OKC')
|
||||
game_date: Date of the game
|
||||
game_number: Game number for doubleheaders (1 or 2), None for single games
|
||||
|
||||
Returns:
|
||||
Canonical game ID (e.g., 'nba_2025_hou_okc_1021')
|
||||
|
||||
Examples:
|
||||
>>> generate_game_id('nba', 2025, 'HOU', 'OKC', date(2025, 10, 21))
|
||||
'nba_2025_hou_okc_1021'
|
||||
|
||||
>>> generate_game_id('mlb', 2026, 'NYY', 'BOS', date(2026, 4, 1), game_number=1)
|
||||
'mlb_2026_nyy_bos_0401_1'
|
||||
"""
|
||||
# Normalize sport and abbreviations
|
||||
sport_norm = sport.lower()
|
||||
away_norm = away_abbrev.lower()
|
||||
home_norm = home_abbrev.lower()
|
||||
|
||||
# Format date as MMDD
|
||||
if isinstance(game_date, datetime):
|
||||
game_date = game_date.date()
|
||||
date_str = game_date.strftime("%m%d")
|
||||
|
||||
# Build ID
|
||||
parts = [sport_norm, str(season), away_norm, home_norm, date_str]
|
||||
|
||||
# Add game number for doubleheaders
|
||||
if game_number is not None:
|
||||
parts.append(str(game_number))
|
||||
|
||||
return "_".join(parts)
|
||||
|
||||
|
||||
def generate_team_id(sport: str, city: str, name: str) -> str:
|
||||
"""Generate a canonical team ID.
|
||||
|
||||
Format: team_{sport}_{abbreviation}
|
||||
|
||||
For most teams, we use the standard abbreviation. This function generates
|
||||
a fallback ID based on city and name for teams without a known abbreviation.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
city: Team city (e.g., 'Los Angeles')
|
||||
name: Team name (e.g., 'Lakers')
|
||||
|
||||
Returns:
|
||||
Canonical team ID (e.g., 'team_nba_la_lakers')
|
||||
|
||||
Examples:
|
||||
>>> generate_team_id('nba', 'Los Angeles', 'Lakers')
|
||||
'team_nba_la_lakers'
|
||||
|
||||
>>> generate_team_id('mlb', 'New York', 'Yankees')
|
||||
'team_mlb_new_york_yankees'
|
||||
"""
|
||||
sport_norm = sport.lower()
|
||||
city_norm = normalize_string(city)
|
||||
name_norm = normalize_string(name)
|
||||
|
||||
return f"team_{sport_norm}_{city_norm}_{name_norm}"
|
||||
|
||||
|
||||
def generate_team_id_from_abbrev(sport: str, abbreviation: str) -> str:
|
||||
"""Generate a canonical team ID from abbreviation.
|
||||
|
||||
Format: team_{sport}_{abbreviation}
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
abbreviation: Team abbreviation (e.g., 'LAL', 'NYY')
|
||||
|
||||
Returns:
|
||||
Canonical team ID (e.g., 'team_nba_lal')
|
||||
|
||||
Examples:
|
||||
>>> generate_team_id_from_abbrev('nba', 'LAL')
|
||||
'team_nba_lal'
|
||||
|
||||
>>> generate_team_id_from_abbrev('mlb', 'NYY')
|
||||
'team_mlb_nyy'
|
||||
"""
|
||||
sport_norm = sport.lower()
|
||||
abbrev_norm = abbreviation.lower()
|
||||
|
||||
return f"team_{sport_norm}_{abbrev_norm}"
|
||||
|
||||
|
||||
def generate_stadium_id(sport: str, name: str) -> str:
|
||||
"""Generate a canonical stadium ID.
|
||||
|
||||
Format: stadium_{sport}_{normalized_name}
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
name: Stadium name (e.g., 'Yankee Stadium')
|
||||
|
||||
Returns:
|
||||
Canonical stadium ID (e.g., 'stadium_mlb_yankee_stadium')
|
||||
|
||||
Examples:
|
||||
>>> generate_stadium_id('nba', 'Crypto.com Arena')
|
||||
'stadium_nba_cryptocom_arena'
|
||||
|
||||
>>> generate_stadium_id('mlb', 'Yankee Stadium')
|
||||
'stadium_mlb_yankee_stadium'
|
||||
"""
|
||||
sport_norm = sport.lower()
|
||||
name_norm = normalize_string(name)
|
||||
|
||||
return f"stadium_{sport_norm}_{name_norm}"
|
||||
|
||||
|
||||
def parse_game_id(game_id: str) -> dict:
|
||||
"""Parse a canonical game ID into its components.
|
||||
|
||||
Args:
|
||||
game_id: Canonical game ID (e.g., 'nba_2025_hou_okc_1021')
|
||||
|
||||
Returns:
|
||||
Dictionary with keys: sport, season, away_abbrev, home_abbrev,
|
||||
month, day, game_number (optional)
|
||||
|
||||
Raises:
|
||||
ValueError: If game_id format is invalid
|
||||
|
||||
Examples:
|
||||
>>> parse_game_id('nba_2025_hou_okc_1021')
|
||||
{'sport': 'nba', 'season': 2025, 'away_abbrev': 'hou',
|
||||
'home_abbrev': 'okc', 'month': 10, 'day': 21, 'game_number': None}
|
||||
|
||||
>>> parse_game_id('mlb_2026_nyy_bos_0401_1')
|
||||
{'sport': 'mlb', 'season': 2026, 'away_abbrev': 'nyy',
|
||||
'home_abbrev': 'bos', 'month': 4, 'day': 1, 'game_number': 1}
|
||||
"""
|
||||
parts = game_id.split("_")
|
||||
|
||||
if len(parts) < 5 or len(parts) > 6:
|
||||
raise ValueError(f"Invalid game ID format: {game_id}")
|
||||
|
||||
sport = parts[0]
|
||||
season = int(parts[1])
|
||||
away_abbrev = parts[2]
|
||||
home_abbrev = parts[3]
|
||||
date_str = parts[4]
|
||||
|
||||
if len(date_str) != 4:
|
||||
raise ValueError(f"Invalid date format in game ID: {game_id}")
|
||||
|
||||
month = int(date_str[:2])
|
||||
day = int(date_str[2:])
|
||||
|
||||
game_number = None
|
||||
if len(parts) == 6:
|
||||
game_number = int(parts[5])
|
||||
|
||||
return {
|
||||
"sport": sport,
|
||||
"season": season,
|
||||
"away_abbrev": away_abbrev,
|
||||
"home_abbrev": home_abbrev,
|
||||
"month": month,
|
||||
"day": day,
|
||||
"game_number": game_number,
|
||||
}
|
||||
|
||||
|
||||
def parse_team_id(team_id: str) -> dict:
|
||||
"""Parse a canonical team ID into its components.
|
||||
|
||||
Args:
|
||||
team_id: Canonical team ID (e.g., 'team_nba_lal')
|
||||
|
||||
Returns:
|
||||
Dictionary with keys: sport, identifier (abbreviation or city_name)
|
||||
|
||||
Raises:
|
||||
ValueError: If team_id format is invalid
|
||||
"""
|
||||
if not team_id.startswith("team_"):
|
||||
raise ValueError(f"Invalid team ID format: {team_id}")
|
||||
|
||||
parts = team_id.split("_", 2)
|
||||
|
||||
if len(parts) < 3:
|
||||
raise ValueError(f"Invalid team ID format: {team_id}")
|
||||
|
||||
return {
|
||||
"sport": parts[1],
|
||||
"identifier": parts[2],
|
||||
}
|
||||
|
||||
|
||||
def parse_stadium_id(stadium_id: str) -> dict:
|
||||
"""Parse a canonical stadium ID into its components.
|
||||
|
||||
Args:
|
||||
stadium_id: Canonical stadium ID (e.g., 'stadium_nba_paycom_center')
|
||||
|
||||
Returns:
|
||||
Dictionary with keys: sport, name
|
||||
|
||||
Raises:
|
||||
ValueError: If stadium_id format is invalid
|
||||
"""
|
||||
if not stadium_id.startswith("stadium_"):
|
||||
raise ValueError(f"Invalid stadium ID format: {stadium_id}")
|
||||
|
||||
parts = stadium_id.split("_", 2)
|
||||
|
||||
if len(parts) < 3:
|
||||
raise ValueError(f"Invalid stadium ID format: {stadium_id}")
|
||||
|
||||
return {
|
||||
"sport": parts[1],
|
||||
"name": parts[2],
|
||||
}
|
||||
@@ -0,0 +1,272 @@
|
||||
"""Fuzzy string matching utilities for team and stadium name resolution."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
from rapidfuzz import fuzz, process
|
||||
from rapidfuzz.utils import default_process
|
||||
|
||||
from ..config import FUZZY_MATCH_THRESHOLD
|
||||
from ..models.aliases import FuzzyMatch
|
||||
|
||||
|
||||
@dataclass
|
||||
class MatchCandidate:
|
||||
"""A candidate for fuzzy matching.
|
||||
|
||||
Attributes:
|
||||
canonical_id: The canonical ID of this candidate
|
||||
name: The display name for this candidate
|
||||
aliases: List of alternative names to match against
|
||||
"""
|
||||
|
||||
canonical_id: str
|
||||
name: str
|
||||
aliases: list[str]
|
||||
|
||||
|
||||
def normalize_for_matching(s: str) -> str:
|
||||
"""Normalize a string for fuzzy matching.
|
||||
|
||||
- Convert to lowercase
|
||||
- Remove common prefixes/suffixes
|
||||
- Collapse whitespace
|
||||
|
||||
Args:
|
||||
s: String to normalize
|
||||
|
||||
Returns:
|
||||
Normalized string
|
||||
"""
|
||||
result = s.lower().strip()
|
||||
|
||||
# Remove common prefixes
|
||||
prefixes = ["the ", "team ", "stadium "]
|
||||
for prefix in prefixes:
|
||||
if result.startswith(prefix):
|
||||
result = result[len(prefix) :]
|
||||
|
||||
# Remove common suffixes
|
||||
suffixes = [" stadium", " arena", " center", " field", " park"]
|
||||
for suffix in suffixes:
|
||||
if result.endswith(suffix):
|
||||
result = result[: -len(suffix)]
|
||||
|
||||
return result.strip()
|
||||
|
||||
|
||||
def fuzzy_match_team(
|
||||
query: str,
|
||||
candidates: list[MatchCandidate],
|
||||
threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
top_n: int = 3,
|
||||
) -> list[FuzzyMatch]:
|
||||
"""Find fuzzy matches for a team name.
|
||||
|
||||
Uses multiple matching strategies:
|
||||
1. Token set ratio (handles word order differences)
|
||||
2. Partial ratio (handles substring matches)
|
||||
3. Standard ratio (overall similarity)
|
||||
|
||||
Args:
|
||||
query: Team name to match
|
||||
candidates: List of candidate teams to match against
|
||||
threshold: Minimum score to consider a match (0-100)
|
||||
top_n: Maximum number of matches to return
|
||||
|
||||
Returns:
|
||||
List of FuzzyMatch objects sorted by confidence (descending)
|
||||
"""
|
||||
query_norm = normalize_for_matching(query)
|
||||
|
||||
# Build list of all matchable strings with their canonical IDs
|
||||
match_strings: list[tuple[str, str, str]] = [] # (string, canonical_id, name)
|
||||
|
||||
for candidate in candidates:
|
||||
# Add primary name
|
||||
match_strings.append(
|
||||
(normalize_for_matching(candidate.name), candidate.canonical_id, candidate.name)
|
||||
)
|
||||
# Add aliases
|
||||
for alias in candidate.aliases:
|
||||
match_strings.append(
|
||||
(normalize_for_matching(alias), candidate.canonical_id, candidate.name)
|
||||
)
|
||||
|
||||
# Score all candidates
|
||||
scored: dict[str, tuple[int, str]] = {} # canonical_id -> (best_score, name)
|
||||
|
||||
for match_str, canonical_id, name in match_strings:
|
||||
# Use multiple scoring methods
|
||||
token_score = fuzz.token_set_ratio(query_norm, match_str)
|
||||
partial_score = fuzz.partial_ratio(query_norm, match_str)
|
||||
ratio_score = fuzz.ratio(query_norm, match_str)
|
||||
|
||||
# Weighted average favoring token_set_ratio for team names
|
||||
score = int(0.5 * token_score + 0.3 * partial_score + 0.2 * ratio_score)
|
||||
|
||||
# Keep best score for each canonical ID
|
||||
if canonical_id not in scored or score > scored[canonical_id][0]:
|
||||
scored[canonical_id] = (score, name)
|
||||
|
||||
# Filter by threshold and sort
|
||||
matches = [
|
||||
FuzzyMatch(canonical_id=cid, canonical_name=name, confidence=score)
|
||||
for cid, (score, name) in scored.items()
|
||||
if score >= threshold
|
||||
]
|
||||
|
||||
# Sort by confidence descending
|
||||
matches.sort(key=lambda m: m.confidence, reverse=True)
|
||||
|
||||
return matches[:top_n]
|
||||
|
||||
|
||||
def fuzzy_match_stadium(
|
||||
query: str,
|
||||
candidates: list[MatchCandidate],
|
||||
threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
top_n: int = 3,
|
||||
) -> list[FuzzyMatch]:
|
||||
"""Find fuzzy matches for a stadium name.
|
||||
|
||||
Uses matching strategies optimized for stadium names:
|
||||
1. Token sort ratio (handles "X Stadium" vs "Stadium X")
|
||||
2. Partial ratio (handles naming rights changes)
|
||||
3. Standard ratio
|
||||
|
||||
Args:
|
||||
query: Stadium name to match
|
||||
candidates: List of candidate stadiums to match against
|
||||
threshold: Minimum score to consider a match (0-100)
|
||||
top_n: Maximum number of matches to return
|
||||
|
||||
Returns:
|
||||
List of FuzzyMatch objects sorted by confidence (descending)
|
||||
"""
|
||||
query_norm = normalize_for_matching(query)
|
||||
|
||||
# Build list of all matchable strings
|
||||
match_strings: list[tuple[str, str, str]] = []
|
||||
|
||||
for candidate in candidates:
|
||||
match_strings.append(
|
||||
(normalize_for_matching(candidate.name), candidate.canonical_id, candidate.name)
|
||||
)
|
||||
for alias in candidate.aliases:
|
||||
match_strings.append(
|
||||
(normalize_for_matching(alias), candidate.canonical_id, candidate.name)
|
||||
)
|
||||
|
||||
# Score all candidates
|
||||
scored: dict[str, tuple[int, str]] = {}
|
||||
|
||||
for match_str, canonical_id, name in match_strings:
|
||||
# Use scoring methods suited for stadium names
|
||||
token_sort_score = fuzz.token_sort_ratio(query_norm, match_str)
|
||||
partial_score = fuzz.partial_ratio(query_norm, match_str)
|
||||
ratio_score = fuzz.ratio(query_norm, match_str)
|
||||
|
||||
# Weighted average
|
||||
score = int(0.4 * token_sort_score + 0.4 * partial_score + 0.2 * ratio_score)
|
||||
|
||||
if canonical_id not in scored or score > scored[canonical_id][0]:
|
||||
scored[canonical_id] = (score, name)
|
||||
|
||||
# Filter and sort
|
||||
matches = [
|
||||
FuzzyMatch(canonical_id=cid, canonical_name=name, confidence=score)
|
||||
for cid, (score, name) in scored.items()
|
||||
if score >= threshold
|
||||
]
|
||||
|
||||
matches.sort(key=lambda m: m.confidence, reverse=True)
|
||||
|
||||
return matches[:top_n]
|
||||
|
||||
|
||||
def exact_match(
|
||||
query: str,
|
||||
candidates: list[MatchCandidate],
|
||||
case_sensitive: bool = False,
|
||||
) -> Optional[str]:
|
||||
"""Find an exact match for a string.
|
||||
|
||||
Args:
|
||||
query: String to match
|
||||
candidates: List of candidates to match against
|
||||
case_sensitive: Whether to use case-sensitive matching
|
||||
|
||||
Returns:
|
||||
Canonical ID if exact match found, None otherwise
|
||||
"""
|
||||
if case_sensitive:
|
||||
query_norm = query.strip()
|
||||
else:
|
||||
query_norm = query.lower().strip()
|
||||
|
||||
for candidate in candidates:
|
||||
# Check primary name
|
||||
name = candidate.name if case_sensitive else candidate.name.lower()
|
||||
if query_norm == name.strip():
|
||||
return candidate.canonical_id
|
||||
|
||||
# Check aliases
|
||||
for alias in candidate.aliases:
|
||||
alias_norm = alias if case_sensitive else alias.lower()
|
||||
if query_norm == alias_norm.strip():
|
||||
return candidate.canonical_id
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def best_match(
|
||||
query: str,
|
||||
candidates: list[MatchCandidate],
|
||||
threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
) -> Optional[FuzzyMatch]:
|
||||
"""Find the best match for a query string.
|
||||
|
||||
First tries exact match, then falls back to fuzzy matching.
|
||||
|
||||
Args:
|
||||
query: String to match
|
||||
candidates: List of candidates
|
||||
threshold: Minimum fuzzy match score
|
||||
|
||||
Returns:
|
||||
Best FuzzyMatch or None if no match above threshold
|
||||
"""
|
||||
# Try exact match first
|
||||
exact = exact_match(query, candidates)
|
||||
if exact:
|
||||
# Find the name for this ID
|
||||
for c in candidates:
|
||||
if c.canonical_id == exact:
|
||||
return FuzzyMatch(
|
||||
canonical_id=exact,
|
||||
canonical_name=c.name,
|
||||
confidence=100,
|
||||
)
|
||||
|
||||
# Fall back to fuzzy matching
|
||||
# Use team matching by default (works for both)
|
||||
matches = fuzzy_match_team(query, candidates, threshold=threshold, top_n=1)
|
||||
|
||||
return matches[0] if matches else None
|
||||
|
||||
|
||||
def calculate_similarity(s1: str, s2: str) -> int:
|
||||
"""Calculate similarity between two strings.
|
||||
|
||||
Args:
|
||||
s1: First string
|
||||
s2: Second string
|
||||
|
||||
Returns:
|
||||
Similarity score 0-100
|
||||
"""
|
||||
s1_norm = normalize_for_matching(s1)
|
||||
s2_norm = normalize_for_matching(s2)
|
||||
|
||||
return fuzz.token_set_ratio(s1_norm, s2_norm)
|
||||
@@ -0,0 +1,474 @@
|
||||
"""Stadium name resolver with exact, alias, and fuzzy matching."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from datetime import date
|
||||
from typing import Optional
|
||||
from uuid import uuid4
|
||||
|
||||
from ..config import FUZZY_MATCH_THRESHOLD, ALLOWED_COUNTRIES
|
||||
from ..models.aliases import FuzzyMatch, ManualReviewItem, ReviewReason
|
||||
from .alias_loader import get_stadium_alias_loader, StadiumAliasLoader
|
||||
from .fuzzy import MatchCandidate, fuzzy_match_stadium
|
||||
|
||||
|
||||
@dataclass
|
||||
class StadiumResolveResult:
|
||||
"""Result of stadium resolution.
|
||||
|
||||
Attributes:
|
||||
canonical_id: Resolved canonical stadium ID (None if unresolved)
|
||||
confidence: Confidence in the match (100 for exact, lower for fuzzy)
|
||||
match_type: How the match was made ('exact', 'alias', 'fuzzy', 'unresolved')
|
||||
filtered_reason: Reason if stadium was filtered out (e.g., 'geographic')
|
||||
review_item: ManualReviewItem if resolution failed or low confidence
|
||||
"""
|
||||
|
||||
canonical_id: Optional[str]
|
||||
confidence: int
|
||||
match_type: str
|
||||
filtered_reason: Optional[str] = None
|
||||
review_item: Optional[ManualReviewItem] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class StadiumInfo:
|
||||
"""Stadium information for matching."""
|
||||
|
||||
canonical_id: str
|
||||
name: str
|
||||
city: str
|
||||
state: str
|
||||
country: str
|
||||
sport: str
|
||||
latitude: float
|
||||
longitude: float
|
||||
|
||||
|
||||
# Hardcoded stadium mappings
|
||||
# Format: {sport: {canonical_id: StadiumInfo}}
|
||||
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
|
||||
"nba": {
|
||||
"stadium_nba_state_farm_arena": StadiumInfo("stadium_nba_state_farm_arena", "State Farm Arena", "Atlanta", "GA", "USA", "nba", 33.7573, -84.3963),
|
||||
"stadium_nba_td_garden": StadiumInfo("stadium_nba_td_garden", "TD Garden", "Boston", "MA", "USA", "nba", 42.3662, -71.0621),
|
||||
"stadium_nba_barclays_center": StadiumInfo("stadium_nba_barclays_center", "Barclays Center", "Brooklyn", "NY", "USA", "nba", 40.6826, -73.9754),
|
||||
"stadium_nba_spectrum_center": StadiumInfo("stadium_nba_spectrum_center", "Spectrum Center", "Charlotte", "NC", "USA", "nba", 35.2251, -80.8392),
|
||||
"stadium_nba_united_center": StadiumInfo("stadium_nba_united_center", "United Center", "Chicago", "IL", "USA", "nba", 41.8807, -87.6742),
|
||||
"stadium_nba_rocket_mortgage_fieldhouse": StadiumInfo("stadium_nba_rocket_mortgage_fieldhouse", "Rocket Mortgage FieldHouse", "Cleveland", "OH", "USA", "nba", 41.4965, -81.6882),
|
||||
"stadium_nba_american_airlines_center": StadiumInfo("stadium_nba_american_airlines_center", "American Airlines Center", "Dallas", "TX", "USA", "nba", 32.7905, -96.8103),
|
||||
"stadium_nba_ball_arena": StadiumInfo("stadium_nba_ball_arena", "Ball Arena", "Denver", "CO", "USA", "nba", 39.7487, -105.0077),
|
||||
"stadium_nba_little_caesars_arena": StadiumInfo("stadium_nba_little_caesars_arena", "Little Caesars Arena", "Detroit", "MI", "USA", "nba", 42.3411, -83.0553),
|
||||
"stadium_nba_chase_center": StadiumInfo("stadium_nba_chase_center", "Chase Center", "San Francisco", "CA", "USA", "nba", 37.7680, -122.3877),
|
||||
"stadium_nba_toyota_center": StadiumInfo("stadium_nba_toyota_center", "Toyota Center", "Houston", "TX", "USA", "nba", 29.7508, -95.3621),
|
||||
"stadium_nba_gainbridge_fieldhouse": StadiumInfo("stadium_nba_gainbridge_fieldhouse", "Gainbridge Fieldhouse", "Indianapolis", "IN", "USA", "nba", 39.7640, -86.1555),
|
||||
"stadium_nba_intuit_dome": StadiumInfo("stadium_nba_intuit_dome", "Intuit Dome", "Inglewood", "CA", "USA", "nba", 33.9425, -118.3417),
|
||||
"stadium_nba_cryptocom_arena": StadiumInfo("stadium_nba_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "nba", 34.0430, -118.2673),
|
||||
"stadium_nba_fedexforum": StadiumInfo("stadium_nba_fedexforum", "FedExForum", "Memphis", "TN", "USA", "nba", 35.1383, -90.0505),
|
||||
"stadium_nba_kaseya_center": StadiumInfo("stadium_nba_kaseya_center", "Kaseya Center", "Miami", "FL", "USA", "nba", 25.7814, -80.1870),
|
||||
"stadium_nba_fiserv_forum": StadiumInfo("stadium_nba_fiserv_forum", "Fiserv Forum", "Milwaukee", "WI", "USA", "nba", 43.0451, -87.9172),
|
||||
"stadium_nba_target_center": StadiumInfo("stadium_nba_target_center", "Target Center", "Minneapolis", "MN", "USA", "nba", 44.9795, -93.2761),
|
||||
"stadium_nba_smoothie_king_center": StadiumInfo("stadium_nba_smoothie_king_center", "Smoothie King Center", "New Orleans", "LA", "USA", "nba", 29.9490, -90.0821),
|
||||
"stadium_nba_madison_square_garden": StadiumInfo("stadium_nba_madison_square_garden", "Madison Square Garden", "New York", "NY", "USA", "nba", 40.7505, -73.9934),
|
||||
"stadium_nba_paycom_center": StadiumInfo("stadium_nba_paycom_center", "Paycom Center", "Oklahoma City", "OK", "USA", "nba", 35.4634, -97.5151),
|
||||
"stadium_nba_kia_center": StadiumInfo("stadium_nba_kia_center", "Kia Center", "Orlando", "FL", "USA", "nba", 28.5392, -81.3839),
|
||||
"stadium_nba_wells_fargo_center": StadiumInfo("stadium_nba_wells_fargo_center", "Wells Fargo Center", "Philadelphia", "PA", "USA", "nba", 39.9012, -75.1720),
|
||||
"stadium_nba_footprint_center": StadiumInfo("stadium_nba_footprint_center", "Footprint Center", "Phoenix", "AZ", "USA", "nba", 33.4457, -112.0712),
|
||||
"stadium_nba_moda_center": StadiumInfo("stadium_nba_moda_center", "Moda Center", "Portland", "OR", "USA", "nba", 45.5316, -122.6668),
|
||||
"stadium_nba_golden_1_center": StadiumInfo("stadium_nba_golden_1_center", "Golden 1 Center", "Sacramento", "CA", "USA", "nba", 38.5802, -121.4997),
|
||||
"stadium_nba_frost_bank_center": StadiumInfo("stadium_nba_frost_bank_center", "Frost Bank Center", "San Antonio", "TX", "USA", "nba", 29.4270, -98.4375),
|
||||
"stadium_nba_scotiabank_arena": StadiumInfo("stadium_nba_scotiabank_arena", "Scotiabank Arena", "Toronto", "ON", "Canada", "nba", 43.6435, -79.3791),
|
||||
"stadium_nba_delta_center": StadiumInfo("stadium_nba_delta_center", "Delta Center", "Salt Lake City", "UT", "USA", "nba", 40.7683, -111.9011),
|
||||
"stadium_nba_capital_one_arena": StadiumInfo("stadium_nba_capital_one_arena", "Capital One Arena", "Washington", "DC", "USA", "nba", 38.8981, -77.0209),
|
||||
},
|
||||
"mlb": {
|
||||
"stadium_mlb_chase_field": StadiumInfo("stadium_mlb_chase_field", "Chase Field", "Phoenix", "AZ", "USA", "mlb", 33.4455, -112.0667),
|
||||
"stadium_mlb_truist_park": StadiumInfo("stadium_mlb_truist_park", "Truist Park", "Atlanta", "GA", "USA", "mlb", 33.8908, -84.4678),
|
||||
"stadium_mlb_oriole_park_at_camden_yards": StadiumInfo("stadium_mlb_oriole_park_at_camden_yards", "Oriole Park at Camden Yards", "Baltimore", "MD", "USA", "mlb", 39.2839, -76.6217),
|
||||
"stadium_mlb_fenway_park": StadiumInfo("stadium_mlb_fenway_park", "Fenway Park", "Boston", "MA", "USA", "mlb", 42.3467, -71.0972),
|
||||
"stadium_mlb_wrigley_field": StadiumInfo("stadium_mlb_wrigley_field", "Wrigley Field", "Chicago", "IL", "USA", "mlb", 41.9484, -87.6553),
|
||||
"stadium_mlb_guaranteed_rate_field": StadiumInfo("stadium_mlb_guaranteed_rate_field", "Guaranteed Rate Field", "Chicago", "IL", "USA", "mlb", 41.8299, -87.6338),
|
||||
"stadium_mlb_great_american_ball_park": StadiumInfo("stadium_mlb_great_american_ball_park", "Great American Ball Park", "Cincinnati", "OH", "USA", "mlb", 39.0974, -84.5082),
|
||||
"stadium_mlb_progressive_field": StadiumInfo("stadium_mlb_progressive_field", "Progressive Field", "Cleveland", "OH", "USA", "mlb", 41.4962, -81.6852),
|
||||
"stadium_mlb_coors_field": StadiumInfo("stadium_mlb_coors_field", "Coors Field", "Denver", "CO", "USA", "mlb", 39.7559, -104.9942),
|
||||
"stadium_mlb_comerica_park": StadiumInfo("stadium_mlb_comerica_park", "Comerica Park", "Detroit", "MI", "USA", "mlb", 42.3390, -83.0485),
|
||||
"stadium_mlb_minute_maid_park": StadiumInfo("stadium_mlb_minute_maid_park", "Minute Maid Park", "Houston", "TX", "USA", "mlb", 29.7573, -95.3555),
|
||||
"stadium_mlb_kauffman_stadium": StadiumInfo("stadium_mlb_kauffman_stadium", "Kauffman Stadium", "Kansas City", "MO", "USA", "mlb", 39.0517, -94.4803),
|
||||
"stadium_mlb_angel_stadium": StadiumInfo("stadium_mlb_angel_stadium", "Angel Stadium", "Anaheim", "CA", "USA", "mlb", 33.8003, -117.8827),
|
||||
"stadium_mlb_dodger_stadium": StadiumInfo("stadium_mlb_dodger_stadium", "Dodger Stadium", "Los Angeles", "CA", "USA", "mlb", 34.0739, -118.2400),
|
||||
"stadium_mlb_loandepot_park": StadiumInfo("stadium_mlb_loandepot_park", "loanDepot park", "Miami", "FL", "USA", "mlb", 25.7781, -80.2195),
|
||||
"stadium_mlb_american_family_field": StadiumInfo("stadium_mlb_american_family_field", "American Family Field", "Milwaukee", "WI", "USA", "mlb", 43.0280, -87.9712),
|
||||
"stadium_mlb_target_field": StadiumInfo("stadium_mlb_target_field", "Target Field", "Minneapolis", "MN", "USA", "mlb", 44.9818, -93.2775),
|
||||
"stadium_mlb_citi_field": StadiumInfo("stadium_mlb_citi_field", "Citi Field", "New York", "NY", "USA", "mlb", 40.7571, -73.8458),
|
||||
"stadium_mlb_yankee_stadium": StadiumInfo("stadium_mlb_yankee_stadium", "Yankee Stadium", "Bronx", "NY", "USA", "mlb", 40.8296, -73.9262),
|
||||
"stadium_mlb_sutter_health_park": StadiumInfo("stadium_mlb_sutter_health_park", "Sutter Health Park", "Sacramento", "CA", "USA", "mlb", 38.5803, -121.5005),
|
||||
"stadium_mlb_citizens_bank_park": StadiumInfo("stadium_mlb_citizens_bank_park", "Citizens Bank Park", "Philadelphia", "PA", "USA", "mlb", 39.9061, -75.1665),
|
||||
"stadium_mlb_pnc_park": StadiumInfo("stadium_mlb_pnc_park", "PNC Park", "Pittsburgh", "PA", "USA", "mlb", 40.4469, -80.0057),
|
||||
"stadium_mlb_petco_park": StadiumInfo("stadium_mlb_petco_park", "Petco Park", "San Diego", "CA", "USA", "mlb", 32.7076, -117.1570),
|
||||
"stadium_mlb_oracle_park": StadiumInfo("stadium_mlb_oracle_park", "Oracle Park", "San Francisco", "CA", "USA", "mlb", 37.7786, -122.3893),
|
||||
"stadium_mlb_tmobile_park": StadiumInfo("stadium_mlb_tmobile_park", "T-Mobile Park", "Seattle", "WA", "USA", "mlb", 47.5914, -122.3325),
|
||||
"stadium_mlb_busch_stadium": StadiumInfo("stadium_mlb_busch_stadium", "Busch Stadium", "St. Louis", "MO", "USA", "mlb", 38.6226, -90.1928),
|
||||
"stadium_mlb_tropicana_field": StadiumInfo("stadium_mlb_tropicana_field", "Tropicana Field", "St. Petersburg", "FL", "USA", "mlb", 27.7682, -82.6534),
|
||||
"stadium_mlb_globe_life_field": StadiumInfo("stadium_mlb_globe_life_field", "Globe Life Field", "Arlington", "TX", "USA", "mlb", 32.7473, -97.0845),
|
||||
"stadium_mlb_rogers_centre": StadiumInfo("stadium_mlb_rogers_centre", "Rogers Centre", "Toronto", "ON", "Canada", "mlb", 43.6414, -79.3894),
|
||||
"stadium_mlb_nationals_park": StadiumInfo("stadium_mlb_nationals_park", "Nationals Park", "Washington", "DC", "USA", "mlb", 38.8730, -77.0074),
|
||||
},
|
||||
"nfl": {
|
||||
"stadium_nfl_state_farm_stadium": StadiumInfo("stadium_nfl_state_farm_stadium", "State Farm Stadium", "Glendale", "AZ", "USA", "nfl", 33.5276, -112.2626),
|
||||
"stadium_nfl_mercedes_benz_stadium": StadiumInfo("stadium_nfl_mercedes_benz_stadium", "Mercedes-Benz Stadium", "Atlanta", "GA", "USA", "nfl", 33.7553, -84.4006),
|
||||
"stadium_nfl_mandt_bank_stadium": StadiumInfo("stadium_nfl_mandt_bank_stadium", "M&T Bank Stadium", "Baltimore", "MD", "USA", "nfl", 39.2780, -76.6227),
|
||||
"stadium_nfl_highmark_stadium": StadiumInfo("stadium_nfl_highmark_stadium", "Highmark Stadium", "Orchard Park", "NY", "USA", "nfl", 42.7738, -78.7870),
|
||||
"stadium_nfl_bank_of_america_stadium": StadiumInfo("stadium_nfl_bank_of_america_stadium", "Bank of America Stadium", "Charlotte", "NC", "USA", "nfl", 35.2258, -80.8528),
|
||||
"stadium_nfl_soldier_field": StadiumInfo("stadium_nfl_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "nfl", 41.8623, -87.6167),
|
||||
"stadium_nfl_paycor_stadium": StadiumInfo("stadium_nfl_paycor_stadium", "Paycor Stadium", "Cincinnati", "OH", "USA", "nfl", 39.0955, -84.5161),
|
||||
"stadium_nfl_huntington_bank_field": StadiumInfo("stadium_nfl_huntington_bank_field", "Huntington Bank Field", "Cleveland", "OH", "USA", "nfl", 41.5061, -81.6995),
|
||||
"stadium_nfl_att_stadium": StadiumInfo("stadium_nfl_att_stadium", "AT&T Stadium", "Arlington", "TX", "USA", "nfl", 32.7473, -97.0945),
|
||||
"stadium_nfl_empower_field": StadiumInfo("stadium_nfl_empower_field", "Empower Field at Mile High", "Denver", "CO", "USA", "nfl", 39.7439, -105.0201),
|
||||
"stadium_nfl_ford_field": StadiumInfo("stadium_nfl_ford_field", "Ford Field", "Detroit", "MI", "USA", "nfl", 42.3400, -83.0456),
|
||||
"stadium_nfl_lambeau_field": StadiumInfo("stadium_nfl_lambeau_field", "Lambeau Field", "Green Bay", "WI", "USA", "nfl", 44.5013, -88.0622),
|
||||
"stadium_nfl_nrg_stadium": StadiumInfo("stadium_nfl_nrg_stadium", "NRG Stadium", "Houston", "TX", "USA", "nfl", 29.6847, -95.4107),
|
||||
"stadium_nfl_lucas_oil_stadium": StadiumInfo("stadium_nfl_lucas_oil_stadium", "Lucas Oil Stadium", "Indianapolis", "IN", "USA", "nfl", 39.7601, -86.1639),
|
||||
"stadium_nfl_everbank_stadium": StadiumInfo("stadium_nfl_everbank_stadium", "EverBank Stadium", "Jacksonville", "FL", "USA", "nfl", 30.3239, -81.6373),
|
||||
"stadium_nfl_arrowhead_stadium": StadiumInfo("stadium_nfl_arrowhead_stadium", "Arrowhead Stadium", "Kansas City", "MO", "USA", "nfl", 39.0489, -94.4839),
|
||||
"stadium_nfl_allegiant_stadium": StadiumInfo("stadium_nfl_allegiant_stadium", "Allegiant Stadium", "Las Vegas", "NV", "USA", "nfl", 36.0909, -115.1833),
|
||||
"stadium_nfl_sofi_stadium": StadiumInfo("stadium_nfl_sofi_stadium", "SoFi Stadium", "Inglewood", "CA", "USA", "nfl", 33.9534, -118.3386),
|
||||
"stadium_nfl_hard_rock_stadium": StadiumInfo("stadium_nfl_hard_rock_stadium", "Hard Rock Stadium", "Miami Gardens", "FL", "USA", "nfl", 25.9580, -80.2389),
|
||||
"stadium_nfl_us_bank_stadium": StadiumInfo("stadium_nfl_us_bank_stadium", "U.S. Bank Stadium", "Minneapolis", "MN", "USA", "nfl", 44.9737, -93.2575),
|
||||
"stadium_nfl_gillette_stadium": StadiumInfo("stadium_nfl_gillette_stadium", "Gillette Stadium", "Foxborough", "MA", "USA", "nfl", 42.0909, -71.2643),
|
||||
"stadium_nfl_caesars_superdome": StadiumInfo("stadium_nfl_caesars_superdome", "Caesars Superdome", "New Orleans", "LA", "USA", "nfl", 29.9511, -90.0812),
|
||||
"stadium_nfl_metlife_stadium": StadiumInfo("stadium_nfl_metlife_stadium", "MetLife Stadium", "East Rutherford", "NJ", "USA", "nfl", 40.8128, -74.0742),
|
||||
"stadium_nfl_lincoln_financial_field": StadiumInfo("stadium_nfl_lincoln_financial_field", "Lincoln Financial Field", "Philadelphia", "PA", "USA", "nfl", 39.9008, -75.1675),
|
||||
"stadium_nfl_acrisure_stadium": StadiumInfo("stadium_nfl_acrisure_stadium", "Acrisure Stadium", "Pittsburgh", "PA", "USA", "nfl", 40.4468, -80.0158),
|
||||
"stadium_nfl_levis_stadium": StadiumInfo("stadium_nfl_levis_stadium", "Levi's Stadium", "Santa Clara", "CA", "USA", "nfl", 37.4033, -121.9695),
|
||||
"stadium_nfl_lumen_field": StadiumInfo("stadium_nfl_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "nfl", 47.5952, -122.3316),
|
||||
"stadium_nfl_raymond_james_stadium": StadiumInfo("stadium_nfl_raymond_james_stadium", "Raymond James Stadium", "Tampa", "FL", "USA", "nfl", 27.9759, -82.5033),
|
||||
"stadium_nfl_nissan_stadium": StadiumInfo("stadium_nfl_nissan_stadium", "Nissan Stadium", "Nashville", "TN", "USA", "nfl", 36.1665, -86.7713),
|
||||
"stadium_nfl_northwest_stadium": StadiumInfo("stadium_nfl_northwest_stadium", "Northwest Stadium", "Landover", "MD", "USA", "nfl", 38.9076, -76.8645),
|
||||
},
|
||||
"nhl": {
|
||||
"stadium_nhl_honda_center": StadiumInfo("stadium_nhl_honda_center", "Honda Center", "Anaheim", "CA", "USA", "nhl", 33.8078, -117.8765),
|
||||
"stadium_nhl_delta_center": StadiumInfo("stadium_nhl_delta_center", "Delta Center", "Salt Lake City", "UT", "USA", "nhl", 40.7683, -111.9011),
|
||||
"stadium_nhl_td_garden": StadiumInfo("stadium_nhl_td_garden", "TD Garden", "Boston", "MA", "USA", "nhl", 42.3662, -71.0621),
|
||||
"stadium_nhl_keybank_center": StadiumInfo("stadium_nhl_keybank_center", "KeyBank Center", "Buffalo", "NY", "USA", "nhl", 42.8750, -78.8764),
|
||||
"stadium_nhl_scotiabank_saddledome": StadiumInfo("stadium_nhl_scotiabank_saddledome", "Scotiabank Saddledome", "Calgary", "AB", "Canada", "nhl", 51.0374, -114.0519),
|
||||
"stadium_nhl_pnc_arena": StadiumInfo("stadium_nhl_pnc_arena", "PNC Arena", "Raleigh", "NC", "USA", "nhl", 35.8033, -78.7220),
|
||||
"stadium_nhl_united_center": StadiumInfo("stadium_nhl_united_center", "United Center", "Chicago", "IL", "USA", "nhl", 41.8807, -87.6742),
|
||||
"stadium_nhl_ball_arena": StadiumInfo("stadium_nhl_ball_arena", "Ball Arena", "Denver", "CO", "USA", "nhl", 39.7487, -105.0077),
|
||||
"stadium_nhl_nationwide_arena": StadiumInfo("stadium_nhl_nationwide_arena", "Nationwide Arena", "Columbus", "OH", "USA", "nhl", 39.9692, -83.0061),
|
||||
"stadium_nhl_american_airlines_center": StadiumInfo("stadium_nhl_american_airlines_center", "American Airlines Center", "Dallas", "TX", "USA", "nhl", 32.7905, -96.8103),
|
||||
"stadium_nhl_little_caesars_arena": StadiumInfo("stadium_nhl_little_caesars_arena", "Little Caesars Arena", "Detroit", "MI", "USA", "nhl", 42.3411, -83.0553),
|
||||
"stadium_nhl_rogers_place": StadiumInfo("stadium_nhl_rogers_place", "Rogers Place", "Edmonton", "AB", "Canada", "nhl", 53.5469, -113.4979),
|
||||
"stadium_nhl_amerant_bank_arena": StadiumInfo("stadium_nhl_amerant_bank_arena", "Amerant Bank Arena", "Sunrise", "FL", "USA", "nhl", 26.1584, -80.3256),
|
||||
"stadium_nhl_cryptocom_arena": StadiumInfo("stadium_nhl_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "nhl", 34.0430, -118.2673),
|
||||
"stadium_nhl_xcel_energy_center": StadiumInfo("stadium_nhl_xcel_energy_center", "Xcel Energy Center", "St. Paul", "MN", "USA", "nhl", 44.9448, -93.1010),
|
||||
"stadium_nhl_bell_centre": StadiumInfo("stadium_nhl_bell_centre", "Bell Centre", "Montreal", "QC", "Canada", "nhl", 45.4961, -73.5693),
|
||||
"stadium_nhl_bridgestone_arena": StadiumInfo("stadium_nhl_bridgestone_arena", "Bridgestone Arena", "Nashville", "TN", "USA", "nhl", 36.1592, -86.7785),
|
||||
"stadium_nhl_prudential_center": StadiumInfo("stadium_nhl_prudential_center", "Prudential Center", "Newark", "NJ", "USA", "nhl", 40.7334, -74.1712),
|
||||
"stadium_nhl_ubs_arena": StadiumInfo("stadium_nhl_ubs_arena", "UBS Arena", "Elmont", "NY", "USA", "nhl", 40.7170, -73.7255),
|
||||
"stadium_nhl_madison_square_garden": StadiumInfo("stadium_nhl_madison_square_garden", "Madison Square Garden", "New York", "NY", "USA", "nhl", 40.7505, -73.9934),
|
||||
"stadium_nhl_canadian_tire_centre": StadiumInfo("stadium_nhl_canadian_tire_centre", "Canadian Tire Centre", "Ottawa", "ON", "Canada", "nhl", 45.2969, -75.9272),
|
||||
"stadium_nhl_wells_fargo_center": StadiumInfo("stadium_nhl_wells_fargo_center", "Wells Fargo Center", "Philadelphia", "PA", "USA", "nhl", 39.9012, -75.1720),
|
||||
"stadium_nhl_ppg_paints_arena": StadiumInfo("stadium_nhl_ppg_paints_arena", "PPG Paints Arena", "Pittsburgh", "PA", "USA", "nhl", 40.4395, -79.9890),
|
||||
"stadium_nhl_sap_center": StadiumInfo("stadium_nhl_sap_center", "SAP Center", "San Jose", "CA", "USA", "nhl", 37.3327, -121.9011),
|
||||
"stadium_nhl_climate_pledge_arena": StadiumInfo("stadium_nhl_climate_pledge_arena", "Climate Pledge Arena", "Seattle", "WA", "USA", "nhl", 47.6221, -122.3540),
|
||||
"stadium_nhl_enterprise_center": StadiumInfo("stadium_nhl_enterprise_center", "Enterprise Center", "St. Louis", "MO", "USA", "nhl", 38.6268, -90.2025),
|
||||
"stadium_nhl_amalie_arena": StadiumInfo("stadium_nhl_amalie_arena", "Amalie Arena", "Tampa", "FL", "USA", "nhl", 27.9428, -82.4519),
|
||||
"stadium_nhl_scotiabank_arena": StadiumInfo("stadium_nhl_scotiabank_arena", "Scotiabank Arena", "Toronto", "ON", "Canada", "nhl", 43.6435, -79.3791),
|
||||
"stadium_nhl_rogers_arena": StadiumInfo("stadium_nhl_rogers_arena", "Rogers Arena", "Vancouver", "BC", "Canada", "nhl", 49.2778, -123.1088),
|
||||
"stadium_nhl_tmobile_arena": StadiumInfo("stadium_nhl_tmobile_arena", "T-Mobile Arena", "Las Vegas", "NV", "USA", "nhl", 36.1028, -115.1783),
|
||||
"stadium_nhl_capital_one_arena": StadiumInfo("stadium_nhl_capital_one_arena", "Capital One Arena", "Washington", "DC", "USA", "nhl", 38.8981, -77.0209),
|
||||
"stadium_nhl_canada_life_centre": StadiumInfo("stadium_nhl_canada_life_centre", "Canada Life Centre", "Winnipeg", "MB", "Canada", "nhl", 49.8928, -97.1433),
|
||||
},
|
||||
"mls": {
|
||||
"stadium_mls_mercedes_benz_stadium": StadiumInfo("stadium_mls_mercedes_benz_stadium", "Mercedes-Benz Stadium", "Atlanta", "GA", "USA", "mls", 33.7553, -84.4006),
|
||||
"stadium_mls_q2_stadium": StadiumInfo("stadium_mls_q2_stadium", "Q2 Stadium", "Austin", "TX", "USA", "mls", 30.3875, -97.7186),
|
||||
"stadium_mls_bank_of_america_stadium": StadiumInfo("stadium_mls_bank_of_america_stadium", "Bank of America Stadium", "Charlotte", "NC", "USA", "mls", 35.2258, -80.8528),
|
||||
"stadium_mls_soldier_field": StadiumInfo("stadium_mls_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "mls", 41.8623, -87.6167),
|
||||
"stadium_mls_tql_stadium": StadiumInfo("stadium_mls_tql_stadium", "TQL Stadium", "Cincinnati", "OH", "USA", "mls", 39.1112, -84.5225),
|
||||
"stadium_mls_dicks_sporting_goods_park": StadiumInfo("stadium_mls_dicks_sporting_goods_park", "Dick's Sporting Goods Park", "Commerce City", "CO", "USA", "mls", 39.8056, -104.8922),
|
||||
"stadium_mls_lower_com_field": StadiumInfo("stadium_mls_lower_com_field", "Lower.com Field", "Columbus", "OH", "USA", "mls", 39.9689, -83.0173),
|
||||
"stadium_mls_toyota_stadium": StadiumInfo("stadium_mls_toyota_stadium", "Toyota Stadium", "Frisco", "TX", "USA", "mls", 33.1545, -96.8353),
|
||||
"stadium_mls_audi_field": StadiumInfo("stadium_mls_audi_field", "Audi Field", "Washington", "DC", "USA", "mls", 38.8687, -77.0128),
|
||||
"stadium_mls_shell_energy_stadium": StadiumInfo("stadium_mls_shell_energy_stadium", "Shell Energy Stadium", "Houston", "TX", "USA", "mls", 29.7522, -95.3527),
|
||||
"stadium_mls_dignity_health_sports_park": StadiumInfo("stadium_mls_dignity_health_sports_park", "Dignity Health Sports Park", "Carson", "CA", "USA", "mls", 33.8644, -118.2611),
|
||||
"stadium_mls_bmo_stadium": StadiumInfo("stadium_mls_bmo_stadium", "BMO Stadium", "Los Angeles", "CA", "USA", "mls", 34.0128, -118.2841),
|
||||
"stadium_mls_chase_stadium": StadiumInfo("stadium_mls_chase_stadium", "Chase Stadium", "Fort Lauderdale", "FL", "USA", "mls", 26.1930, -80.1611),
|
||||
"stadium_mls_allianz_field": StadiumInfo("stadium_mls_allianz_field", "Allianz Field", "St. Paul", "MN", "USA", "mls", 44.9528, -93.1650),
|
||||
"stadium_mls_stade_saputo": StadiumInfo("stadium_mls_stade_saputo", "Stade Saputo", "Montreal", "QC", "Canada", "mls", 45.5622, -73.5528),
|
||||
"stadium_mls_geodis_park": StadiumInfo("stadium_mls_geodis_park", "GEODIS Park", "Nashville", "TN", "USA", "mls", 36.1304, -86.7651),
|
||||
"stadium_mls_gillette_stadium": StadiumInfo("stadium_mls_gillette_stadium", "Gillette Stadium", "Foxborough", "MA", "USA", "mls", 42.0909, -71.2643),
|
||||
"stadium_mls_yankee_stadium": StadiumInfo("stadium_mls_yankee_stadium", "Yankee Stadium", "Bronx", "NY", "USA", "mls", 40.8296, -73.9262),
|
||||
"stadium_mls_red_bull_arena": StadiumInfo("stadium_mls_red_bull_arena", "Red Bull Arena", "Harrison", "NJ", "USA", "mls", 40.7369, -74.1503),
|
||||
"stadium_mls_inter_co_stadium": StadiumInfo("stadium_mls_inter_co_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "mls", 28.5411, -81.3895),
|
||||
"stadium_mls_subaru_park": StadiumInfo("stadium_mls_subaru_park", "Subaru Park", "Chester", "PA", "USA", "mls", 39.8328, -75.3789),
|
||||
"stadium_mls_providence_park": StadiumInfo("stadium_mls_providence_park", "Providence Park", "Portland", "OR", "USA", "mls", 45.5216, -122.6917),
|
||||
"stadium_mls_america_first_field": StadiumInfo("stadium_mls_america_first_field", "America First Field", "Sandy", "UT", "USA", "mls", 40.5830, -111.8933),
|
||||
"stadium_mls_paypal_park": StadiumInfo("stadium_mls_paypal_park", "PayPal Park", "San Jose", "CA", "USA", "mls", 37.3511, -121.9250),
|
||||
"stadium_mls_snapdragon_stadium": StadiumInfo("stadium_mls_snapdragon_stadium", "Snapdragon Stadium", "San Diego", "CA", "USA", "mls", 32.7837, -117.1225),
|
||||
"stadium_mls_lumen_field": StadiumInfo("stadium_mls_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "mls", 47.5952, -122.3316),
|
||||
"stadium_mls_childrens_mercy_park": StadiumInfo("stadium_mls_childrens_mercy_park", "Children's Mercy Park", "Kansas City", "KS", "USA", "mls", 39.1217, -94.8231),
|
||||
"stadium_mls_citypark": StadiumInfo("stadium_mls_citypark", "CITYPARK", "St. Louis", "MO", "USA", "mls", 38.6316, -90.2106),
|
||||
"stadium_mls_bmo_field": StadiumInfo("stadium_mls_bmo_field", "BMO Field", "Toronto", "ON", "Canada", "mls", 43.6332, -79.4186),
|
||||
"stadium_mls_bc_place": StadiumInfo("stadium_mls_bc_place", "BC Place", "Vancouver", "BC", "Canada", "mls", 49.2768, -123.1118),
|
||||
},
|
||||
"wnba": {
|
||||
"stadium_wnba_gateway_center_arena": StadiumInfo("stadium_wnba_gateway_center_arena", "Gateway Center Arena", "College Park", "GA", "USA", "wnba", 33.6510, -84.4474),
|
||||
"stadium_wnba_wintrust_arena": StadiumInfo("stadium_wnba_wintrust_arena", "Wintrust Arena", "Chicago", "IL", "USA", "wnba", 41.8658, -87.6169),
|
||||
"stadium_wnba_mohegan_sun_arena": StadiumInfo("stadium_wnba_mohegan_sun_arena", "Mohegan Sun Arena", "Uncasville", "CT", "USA", "wnba", 41.4931, -72.0912),
|
||||
"stadium_wnba_college_park_center": StadiumInfo("stadium_wnba_college_park_center", "College Park Center", "Arlington", "TX", "USA", "wnba", 32.7304, -97.1077),
|
||||
"stadium_wnba_chase_center": StadiumInfo("stadium_wnba_chase_center", "Chase Center", "San Francisco", "CA", "USA", "wnba", 37.7680, -122.3877),
|
||||
"stadium_wnba_gainbridge_fieldhouse": StadiumInfo("stadium_wnba_gainbridge_fieldhouse", "Gainbridge Fieldhouse", "Indianapolis", "IN", "USA", "wnba", 39.7640, -86.1555),
|
||||
"stadium_wnba_michelob_ultra_arena": StadiumInfo("stadium_wnba_michelob_ultra_arena", "Michelob Ultra Arena", "Las Vegas", "NV", "USA", "wnba", 36.0902, -115.1756),
|
||||
"stadium_wnba_cryptocom_arena": StadiumInfo("stadium_wnba_cryptocom_arena", "Crypto.com Arena", "Los Angeles", "CA", "USA", "wnba", 34.0430, -118.2673),
|
||||
"stadium_wnba_target_center": StadiumInfo("stadium_wnba_target_center", "Target Center", "Minneapolis", "MN", "USA", "wnba", 44.9795, -93.2761),
|
||||
"stadium_wnba_barclays_center": StadiumInfo("stadium_wnba_barclays_center", "Barclays Center", "Brooklyn", "NY", "USA", "wnba", 40.6826, -73.9754),
|
||||
"stadium_wnba_footprint_center": StadiumInfo("stadium_wnba_footprint_center", "Footprint Center", "Phoenix", "AZ", "USA", "wnba", 33.4457, -112.0712),
|
||||
"stadium_wnba_climate_pledge_arena": StadiumInfo("stadium_wnba_climate_pledge_arena", "Climate Pledge Arena", "Seattle", "WA", "USA", "wnba", 47.6221, -122.3540),
|
||||
"stadium_wnba_entertainment_sports_arena": StadiumInfo("stadium_wnba_entertainment_sports_arena", "Entertainment & Sports Arena", "Washington", "DC", "USA", "wnba", 38.8690, -76.9745),
|
||||
},
|
||||
"nwsl": {
|
||||
"stadium_nwsl_bmo_stadium": StadiumInfo("stadium_nwsl_bmo_stadium", "BMO Stadium", "Los Angeles", "CA", "USA", "nwsl", 34.0128, -118.2841),
|
||||
"stadium_nwsl_seatgeek_stadium": StadiumInfo("stadium_nwsl_seatgeek_stadium", "SeatGeek Stadium", "Bridgeview", "IL", "USA", "nwsl", 41.7500, -87.8028),
|
||||
"stadium_nwsl_shell_energy_stadium": StadiumInfo("stadium_nwsl_shell_energy_stadium", "Shell Energy Stadium", "Houston", "TX", "USA", "nwsl", 29.7522, -95.3527),
|
||||
"stadium_nwsl_cpkc_stadium": StadiumInfo("stadium_nwsl_cpkc_stadium", "CPKC Stadium", "Kansas City", "MO", "USA", "nwsl", 39.1050, -94.5580),
|
||||
"stadium_nwsl_red_bull_arena": StadiumInfo("stadium_nwsl_red_bull_arena", "Red Bull Arena", "Harrison", "NJ", "USA", "nwsl", 40.7369, -74.1503),
|
||||
"stadium_nwsl_wakemed_soccer_park": StadiumInfo("stadium_nwsl_wakemed_soccer_park", "WakeMed Soccer Park", "Cary", "NC", "USA", "nwsl", 35.7879, -78.7806),
|
||||
"stadium_nwsl_inter_co_stadium": StadiumInfo("stadium_nwsl_inter_co_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "nwsl", 28.5411, -81.3895),
|
||||
"stadium_nwsl_providence_park": StadiumInfo("stadium_nwsl_providence_park", "Providence Park", "Portland", "OR", "USA", "nwsl", 45.5216, -122.6917),
|
||||
"stadium_nwsl_lynn_family_stadium": StadiumInfo("stadium_nwsl_lynn_family_stadium", "Lynn Family Stadium", "Louisville", "KY", "USA", "nwsl", 38.2219, -85.7381),
|
||||
"stadium_nwsl_snapdragon_stadium": StadiumInfo("stadium_nwsl_snapdragon_stadium", "Snapdragon Stadium", "San Diego", "CA", "USA", "nwsl", 32.7837, -117.1225),
|
||||
"stadium_nwsl_lumen_field": StadiumInfo("stadium_nwsl_lumen_field", "Lumen Field", "Seattle", "WA", "USA", "nwsl", 47.5952, -122.3316),
|
||||
"stadium_nwsl_america_first_field": StadiumInfo("stadium_nwsl_america_first_field", "America First Field", "Sandy", "UT", "USA", "nwsl", 40.5830, -111.8933),
|
||||
"stadium_nwsl_audi_field": StadiumInfo("stadium_nwsl_audi_field", "Audi Field", "Washington", "DC", "USA", "nwsl", 38.8687, -77.0128),
|
||||
"stadium_nwsl_paypal_park": StadiumInfo("stadium_nwsl_paypal_park", "PayPal Park", "San Jose", "CA", "USA", "nwsl", 37.3511, -121.9250),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
class StadiumResolver:
|
||||
"""Resolves stadium names to canonical IDs.
|
||||
|
||||
Resolution order:
|
||||
1. Exact match against stadium names
|
||||
2. Alias lookup (with date awareness)
|
||||
3. Fuzzy match against all known names
|
||||
4. Geographic filter check
|
||||
5. Unresolved (returns ManualReviewItem)
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
sport: str,
|
||||
alias_loader: Optional[StadiumAliasLoader] = None,
|
||||
fuzzy_threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
):
|
||||
"""Initialize the resolver.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
alias_loader: Stadium alias loader (default: global loader)
|
||||
fuzzy_threshold: Minimum fuzzy match score
|
||||
"""
|
||||
self.sport = sport.lower()
|
||||
self.alias_loader = alias_loader or get_stadium_alias_loader()
|
||||
self.fuzzy_threshold = fuzzy_threshold
|
||||
self._stadiums = STADIUM_MAPPINGS.get(self.sport, {})
|
||||
|
||||
# Build match candidates
|
||||
self._candidates = self._build_candidates()
|
||||
|
||||
def _build_candidates(self) -> list[MatchCandidate]:
|
||||
"""Build match candidates from stadium mappings."""
|
||||
candidates = []
|
||||
|
||||
for stadium_id, info in self._stadiums.items():
|
||||
# Get aliases for this stadium
|
||||
aliases = [a.alias_name for a in self.alias_loader.get_aliases_for_stadium(stadium_id)]
|
||||
|
||||
# Add city as alias
|
||||
aliases.append(info.city)
|
||||
|
||||
candidates.append(MatchCandidate(
|
||||
canonical_id=stadium_id,
|
||||
name=info.name,
|
||||
aliases=aliases,
|
||||
))
|
||||
|
||||
return candidates
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
name: str,
|
||||
check_date: Optional[date] = None,
|
||||
country: Optional[str] = None,
|
||||
source_url: Optional[str] = None,
|
||||
) -> StadiumResolveResult:
|
||||
"""Resolve a stadium name to a canonical ID.
|
||||
|
||||
Args:
|
||||
name: Stadium name to resolve
|
||||
check_date: Date for alias validity (None = today)
|
||||
country: Country for geographic filtering (None = no filter)
|
||||
source_url: Source URL for manual review items
|
||||
|
||||
Returns:
|
||||
StadiumResolveResult with resolution details
|
||||
"""
|
||||
name_lower = name.lower().strip()
|
||||
|
||||
# 1. Exact match against stadium names
|
||||
for stadium_id, info in self._stadiums.items():
|
||||
if name_lower == info.name.lower():
|
||||
return StadiumResolveResult(
|
||||
canonical_id=stadium_id,
|
||||
confidence=100,
|
||||
match_type="exact",
|
||||
)
|
||||
|
||||
# 2. Alias lookup
|
||||
alias_result = self.alias_loader.resolve(name, check_date)
|
||||
if alias_result:
|
||||
# Verify it's for the right sport (alias file has all sports)
|
||||
if alias_result.startswith(f"stadium_{self.sport}_"):
|
||||
return StadiumResolveResult(
|
||||
canonical_id=alias_result,
|
||||
confidence=95,
|
||||
match_type="alias",
|
||||
)
|
||||
|
||||
# 3. Fuzzy match
|
||||
matches = fuzzy_match_stadium(
|
||||
name,
|
||||
self._candidates,
|
||||
threshold=self.fuzzy_threshold,
|
||||
)
|
||||
|
||||
if matches:
|
||||
best = matches[0]
|
||||
review_item = None
|
||||
|
||||
# Create review item for low confidence matches
|
||||
if best.confidence < 90:
|
||||
review_item = ManualReviewItem(
|
||||
id=f"stadium_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.LOW_CONFIDENCE_MATCH,
|
||||
sport=self.sport,
|
||||
raw_value=name,
|
||||
context={"match_type": "fuzzy"},
|
||||
source_url=source_url,
|
||||
suggested_matches=matches,
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return StadiumResolveResult(
|
||||
canonical_id=best.canonical_id,
|
||||
confidence=best.confidence,
|
||||
match_type="fuzzy",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
# 4. Geographic filter check
|
||||
if country and country not in ALLOWED_COUNTRIES:
|
||||
review_item = ManualReviewItem(
|
||||
id=f"stadium_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.GEOGRAPHIC_FILTER,
|
||||
sport=self.sport,
|
||||
raw_value=name,
|
||||
context={"country": country, "reason": "Stadium outside USA/Canada/Mexico"},
|
||||
source_url=source_url,
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return StadiumResolveResult(
|
||||
canonical_id=None,
|
||||
confidence=0,
|
||||
match_type="filtered",
|
||||
filtered_reason="geographic",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
# 5. Unresolved
|
||||
review_item = ManualReviewItem(
|
||||
id=f"stadium_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.UNRESOLVED_STADIUM,
|
||||
sport=self.sport,
|
||||
raw_value=name,
|
||||
context={},
|
||||
source_url=source_url,
|
||||
suggested_matches=fuzzy_match_stadium(
|
||||
name,
|
||||
self._candidates,
|
||||
threshold=50, # Lower threshold for suggestions
|
||||
top_n=5,
|
||||
),
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return StadiumResolveResult(
|
||||
canonical_id=None,
|
||||
confidence=0,
|
||||
match_type="unresolved",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
def get_stadium_info(self, stadium_id: str) -> Optional[StadiumInfo]:
|
||||
"""Get stadium info by ID.
|
||||
|
||||
Args:
|
||||
stadium_id: Canonical stadium ID
|
||||
|
||||
Returns:
|
||||
StadiumInfo or None
|
||||
"""
|
||||
return self._stadiums.get(stadium_id)
|
||||
|
||||
def get_all_stadiums(self) -> list[StadiumInfo]:
|
||||
"""Get all stadiums for this sport.
|
||||
|
||||
Returns:
|
||||
List of StadiumInfo objects
|
||||
"""
|
||||
return list(self._stadiums.values())
|
||||
|
||||
def is_in_allowed_region(self, stadium_id: str) -> bool:
|
||||
"""Check if a stadium is in an allowed region.
|
||||
|
||||
Args:
|
||||
stadium_id: Canonical stadium ID
|
||||
|
||||
Returns:
|
||||
True if stadium is in USA, Canada, or Mexico
|
||||
"""
|
||||
info = self._stadiums.get(stadium_id)
|
||||
if not info:
|
||||
return False
|
||||
|
||||
return info.country in ALLOWED_COUNTRIES
|
||||
|
||||
|
||||
# Cached resolvers
|
||||
_resolvers: dict[str, StadiumResolver] = {}
|
||||
|
||||
|
||||
def get_stadium_resolver(sport: str) -> StadiumResolver:
|
||||
"""Get or create a stadium resolver for a sport."""
|
||||
sport_lower = sport.lower()
|
||||
if sport_lower not in _resolvers:
|
||||
_resolvers[sport_lower] = StadiumResolver(sport_lower)
|
||||
return _resolvers[sport_lower]
|
||||
|
||||
|
||||
def resolve_stadium(
|
||||
sport: str,
|
||||
name: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> StadiumResolveResult:
|
||||
"""Convenience function to resolve a stadium name.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
name: Stadium name to resolve
|
||||
check_date: Date for alias validity
|
||||
|
||||
Returns:
|
||||
StadiumResolveResult
|
||||
"""
|
||||
return get_stadium_resolver(sport).resolve(name, check_date)
|
||||
@@ -0,0 +1,482 @@
|
||||
"""Team name resolver with exact, alias, and fuzzy matching."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from datetime import date
|
||||
from typing import Optional
|
||||
from uuid import uuid4
|
||||
|
||||
from ..config import FUZZY_MATCH_THRESHOLD
|
||||
from ..models.aliases import (
|
||||
AliasType,
|
||||
FuzzyMatch,
|
||||
ManualReviewItem,
|
||||
ReviewReason,
|
||||
)
|
||||
from .alias_loader import get_team_alias_loader, TeamAliasLoader
|
||||
from .fuzzy import MatchCandidate, fuzzy_match_team, exact_match
|
||||
|
||||
|
||||
@dataclass
|
||||
class TeamResolveResult:
|
||||
"""Result of team resolution.
|
||||
|
||||
Attributes:
|
||||
canonical_id: Resolved canonical team ID (None if unresolved)
|
||||
confidence: Confidence in the match (100 for exact, lower for fuzzy)
|
||||
match_type: How the match was made ('exact', 'alias', 'fuzzy', 'unresolved')
|
||||
review_item: ManualReviewItem if resolution failed or low confidence
|
||||
"""
|
||||
|
||||
canonical_id: Optional[str]
|
||||
confidence: int
|
||||
match_type: str
|
||||
review_item: Optional[ManualReviewItem] = None
|
||||
|
||||
|
||||
# Hardcoded team mappings for each sport
|
||||
# Format: {sport: {abbreviation: (canonical_id, full_name, city)}}
|
||||
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
|
||||
"nba": {
|
||||
"ATL": ("team_nba_atl", "Atlanta Hawks", "Atlanta"),
|
||||
"BOS": ("team_nba_bos", "Boston Celtics", "Boston"),
|
||||
"BKN": ("team_nba_brk", "Brooklyn Nets", "Brooklyn"),
|
||||
"BRK": ("team_nba_brk", "Brooklyn Nets", "Brooklyn"),
|
||||
"CHA": ("team_nba_cho", "Charlotte Hornets", "Charlotte"),
|
||||
"CHO": ("team_nba_cho", "Charlotte Hornets", "Charlotte"),
|
||||
"CHI": ("team_nba_chi", "Chicago Bulls", "Chicago"),
|
||||
"CLE": ("team_nba_cle", "Cleveland Cavaliers", "Cleveland"),
|
||||
"DAL": ("team_nba_dal", "Dallas Mavericks", "Dallas"),
|
||||
"DEN": ("team_nba_den", "Denver Nuggets", "Denver"),
|
||||
"DET": ("team_nba_det", "Detroit Pistons", "Detroit"),
|
||||
"GSW": ("team_nba_gsw", "Golden State Warriors", "Golden State"),
|
||||
"GS": ("team_nba_gsw", "Golden State Warriors", "Golden State"),
|
||||
"HOU": ("team_nba_hou", "Houston Rockets", "Houston"),
|
||||
"IND": ("team_nba_ind", "Indiana Pacers", "Indiana"),
|
||||
"LAC": ("team_nba_lac", "Los Angeles Clippers", "Los Angeles"),
|
||||
"LAL": ("team_nba_lal", "Los Angeles Lakers", "Los Angeles"),
|
||||
"MEM": ("team_nba_mem", "Memphis Grizzlies", "Memphis"),
|
||||
"MIA": ("team_nba_mia", "Miami Heat", "Miami"),
|
||||
"MIL": ("team_nba_mil", "Milwaukee Bucks", "Milwaukee"),
|
||||
"MIN": ("team_nba_min", "Minnesota Timberwolves", "Minnesota"),
|
||||
"NOP": ("team_nba_nop", "New Orleans Pelicans", "New Orleans"),
|
||||
"NO": ("team_nba_nop", "New Orleans Pelicans", "New Orleans"),
|
||||
"NYK": ("team_nba_nyk", "New York Knicks", "New York"),
|
||||
"NY": ("team_nba_nyk", "New York Knicks", "New York"),
|
||||
"OKC": ("team_nba_okc", "Oklahoma City Thunder", "Oklahoma City"),
|
||||
"ORL": ("team_nba_orl", "Orlando Magic", "Orlando"),
|
||||
"PHI": ("team_nba_phi", "Philadelphia 76ers", "Philadelphia"),
|
||||
"PHX": ("team_nba_phx", "Phoenix Suns", "Phoenix"),
|
||||
"PHO": ("team_nba_phx", "Phoenix Suns", "Phoenix"),
|
||||
"POR": ("team_nba_por", "Portland Trail Blazers", "Portland"),
|
||||
"SAC": ("team_nba_sac", "Sacramento Kings", "Sacramento"),
|
||||
"SAS": ("team_nba_sas", "San Antonio Spurs", "San Antonio"),
|
||||
"SA": ("team_nba_sas", "San Antonio Spurs", "San Antonio"),
|
||||
"TOR": ("team_nba_tor", "Toronto Raptors", "Toronto"),
|
||||
"UTA": ("team_nba_uta", "Utah Jazz", "Utah"),
|
||||
"WAS": ("team_nba_was", "Washington Wizards", "Washington"),
|
||||
"WSH": ("team_nba_was", "Washington Wizards", "Washington"),
|
||||
},
|
||||
"mlb": {
|
||||
"ARI": ("team_mlb_ari", "Arizona Diamondbacks", "Arizona"),
|
||||
"ATL": ("team_mlb_atl", "Atlanta Braves", "Atlanta"),
|
||||
"BAL": ("team_mlb_bal", "Baltimore Orioles", "Baltimore"),
|
||||
"BOS": ("team_mlb_bos", "Boston Red Sox", "Boston"),
|
||||
"CHC": ("team_mlb_chc", "Chicago Cubs", "Chicago"),
|
||||
"CHW": ("team_mlb_chw", "Chicago White Sox", "Chicago"),
|
||||
"CWS": ("team_mlb_chw", "Chicago White Sox", "Chicago"),
|
||||
"CIN": ("team_mlb_cin", "Cincinnati Reds", "Cincinnati"),
|
||||
"CLE": ("team_mlb_cle", "Cleveland Guardians", "Cleveland"),
|
||||
"COL": ("team_mlb_col", "Colorado Rockies", "Colorado"),
|
||||
"DET": ("team_mlb_det", "Detroit Tigers", "Detroit"),
|
||||
"HOU": ("team_mlb_hou", "Houston Astros", "Houston"),
|
||||
"KC": ("team_mlb_kc", "Kansas City Royals", "Kansas City"),
|
||||
"KCR": ("team_mlb_kc", "Kansas City Royals", "Kansas City"),
|
||||
"LAA": ("team_mlb_laa", "Los Angeles Angels", "Los Angeles"),
|
||||
"ANA": ("team_mlb_laa", "Los Angeles Angels", "Anaheim"),
|
||||
"LAD": ("team_mlb_lad", "Los Angeles Dodgers", "Los Angeles"),
|
||||
"MIA": ("team_mlb_mia", "Miami Marlins", "Miami"),
|
||||
"FLA": ("team_mlb_mia", "Miami Marlins", "Florida"),
|
||||
"MIL": ("team_mlb_mil", "Milwaukee Brewers", "Milwaukee"),
|
||||
"MIN": ("team_mlb_min", "Minnesota Twins", "Minnesota"),
|
||||
"NYM": ("team_mlb_nym", "New York Mets", "New York"),
|
||||
"NYY": ("team_mlb_nyy", "New York Yankees", "New York"),
|
||||
"OAK": ("team_mlb_oak", "Oakland Athletics", "Oakland"),
|
||||
"PHI": ("team_mlb_phi", "Philadelphia Phillies", "Philadelphia"),
|
||||
"PIT": ("team_mlb_pit", "Pittsburgh Pirates", "Pittsburgh"),
|
||||
"SD": ("team_mlb_sd", "San Diego Padres", "San Diego"),
|
||||
"SDP": ("team_mlb_sd", "San Diego Padres", "San Diego"),
|
||||
"SF": ("team_mlb_sf", "San Francisco Giants", "San Francisco"),
|
||||
"SFG": ("team_mlb_sf", "San Francisco Giants", "San Francisco"),
|
||||
"SEA": ("team_mlb_sea", "Seattle Mariners", "Seattle"),
|
||||
"STL": ("team_mlb_stl", "St. Louis Cardinals", "St. Louis"),
|
||||
"TB": ("team_mlb_tbr", "Tampa Bay Rays", "Tampa Bay"),
|
||||
"TBR": ("team_mlb_tbr", "Tampa Bay Rays", "Tampa Bay"),
|
||||
"TEX": ("team_mlb_tex", "Texas Rangers", "Texas"),
|
||||
"TOR": ("team_mlb_tor", "Toronto Blue Jays", "Toronto"),
|
||||
"WSN": ("team_mlb_wsn", "Washington Nationals", "Washington"),
|
||||
"WAS": ("team_mlb_wsn", "Washington Nationals", "Washington"),
|
||||
},
|
||||
"nfl": {
|
||||
"ARI": ("team_nfl_ari", "Arizona Cardinals", "Arizona"),
|
||||
"ATL": ("team_nfl_atl", "Atlanta Falcons", "Atlanta"),
|
||||
"BAL": ("team_nfl_bal", "Baltimore Ravens", "Baltimore"),
|
||||
"BUF": ("team_nfl_buf", "Buffalo Bills", "Buffalo"),
|
||||
"CAR": ("team_nfl_car", "Carolina Panthers", "Carolina"),
|
||||
"CHI": ("team_nfl_chi", "Chicago Bears", "Chicago"),
|
||||
"CIN": ("team_nfl_cin", "Cincinnati Bengals", "Cincinnati"),
|
||||
"CLE": ("team_nfl_cle", "Cleveland Browns", "Cleveland"),
|
||||
"DAL": ("team_nfl_dal", "Dallas Cowboys", "Dallas"),
|
||||
"DEN": ("team_nfl_den", "Denver Broncos", "Denver"),
|
||||
"DET": ("team_nfl_det", "Detroit Lions", "Detroit"),
|
||||
"GB": ("team_nfl_gb", "Green Bay Packers", "Green Bay"),
|
||||
"GNB": ("team_nfl_gb", "Green Bay Packers", "Green Bay"),
|
||||
"HOU": ("team_nfl_hou", "Houston Texans", "Houston"),
|
||||
"IND": ("team_nfl_ind", "Indianapolis Colts", "Indianapolis"),
|
||||
"JAX": ("team_nfl_jax", "Jacksonville Jaguars", "Jacksonville"),
|
||||
"JAC": ("team_nfl_jax", "Jacksonville Jaguars", "Jacksonville"),
|
||||
"KC": ("team_nfl_kc", "Kansas City Chiefs", "Kansas City"),
|
||||
"KAN": ("team_nfl_kc", "Kansas City Chiefs", "Kansas City"),
|
||||
"LV": ("team_nfl_lv", "Las Vegas Raiders", "Las Vegas"),
|
||||
"LAC": ("team_nfl_lac", "Los Angeles Chargers", "Los Angeles"),
|
||||
"LAR": ("team_nfl_lar", "Los Angeles Rams", "Los Angeles"),
|
||||
"MIA": ("team_nfl_mia", "Miami Dolphins", "Miami"),
|
||||
"MIN": ("team_nfl_min", "Minnesota Vikings", "Minnesota"),
|
||||
"NE": ("team_nfl_ne", "New England Patriots", "New England"),
|
||||
"NWE": ("team_nfl_ne", "New England Patriots", "New England"),
|
||||
"NO": ("team_nfl_no", "New Orleans Saints", "New Orleans"),
|
||||
"NOR": ("team_nfl_no", "New Orleans Saints", "New Orleans"),
|
||||
"NYG": ("team_nfl_nyg", "New York Giants", "New York"),
|
||||
"NYJ": ("team_nfl_nyj", "New York Jets", "New York"),
|
||||
"PHI": ("team_nfl_phi", "Philadelphia Eagles", "Philadelphia"),
|
||||
"PIT": ("team_nfl_pit", "Pittsburgh Steelers", "Pittsburgh"),
|
||||
"SF": ("team_nfl_sf", "San Francisco 49ers", "San Francisco"),
|
||||
"SFO": ("team_nfl_sf", "San Francisco 49ers", "San Francisco"),
|
||||
"SEA": ("team_nfl_sea", "Seattle Seahawks", "Seattle"),
|
||||
"TB": ("team_nfl_tb", "Tampa Bay Buccaneers", "Tampa Bay"),
|
||||
"TAM": ("team_nfl_tb", "Tampa Bay Buccaneers", "Tampa Bay"),
|
||||
"TEN": ("team_nfl_ten", "Tennessee Titans", "Tennessee"),
|
||||
"WAS": ("team_nfl_was", "Washington Commanders", "Washington"),
|
||||
"WSH": ("team_nfl_was", "Washington Commanders", "Washington"),
|
||||
},
|
||||
"nhl": {
|
||||
"ANA": ("team_nhl_ana", "Anaheim Ducks", "Anaheim"),
|
||||
"ARI": ("team_nhl_ari", "Utah Hockey Club", "Utah"), # Moved 2024
|
||||
"UTA": ("team_nhl_ari", "Utah Hockey Club", "Utah"),
|
||||
"BOS": ("team_nhl_bos", "Boston Bruins", "Boston"),
|
||||
"BUF": ("team_nhl_buf", "Buffalo Sabres", "Buffalo"),
|
||||
"CGY": ("team_nhl_cgy", "Calgary Flames", "Calgary"),
|
||||
"CAR": ("team_nhl_car", "Carolina Hurricanes", "Carolina"),
|
||||
"CHI": ("team_nhl_chi", "Chicago Blackhawks", "Chicago"),
|
||||
"COL": ("team_nhl_col", "Colorado Avalanche", "Colorado"),
|
||||
"CBJ": ("team_nhl_cbj", "Columbus Blue Jackets", "Columbus"),
|
||||
"DAL": ("team_nhl_dal", "Dallas Stars", "Dallas"),
|
||||
"DET": ("team_nhl_det", "Detroit Red Wings", "Detroit"),
|
||||
"EDM": ("team_nhl_edm", "Edmonton Oilers", "Edmonton"),
|
||||
"FLA": ("team_nhl_fla", "Florida Panthers", "Florida"),
|
||||
"LA": ("team_nhl_la", "Los Angeles Kings", "Los Angeles"),
|
||||
"LAK": ("team_nhl_la", "Los Angeles Kings", "Los Angeles"),
|
||||
"MIN": ("team_nhl_min", "Minnesota Wild", "Minnesota"),
|
||||
"MTL": ("team_nhl_mtl", "Montreal Canadiens", "Montreal"),
|
||||
"MON": ("team_nhl_mtl", "Montreal Canadiens", "Montreal"),
|
||||
"NSH": ("team_nhl_nsh", "Nashville Predators", "Nashville"),
|
||||
"NAS": ("team_nhl_nsh", "Nashville Predators", "Nashville"),
|
||||
"NJ": ("team_nhl_njd", "New Jersey Devils", "New Jersey"),
|
||||
"NJD": ("team_nhl_njd", "New Jersey Devils", "New Jersey"),
|
||||
"NYI": ("team_nhl_nyi", "New York Islanders", "New York"),
|
||||
"NYR": ("team_nhl_nyr", "New York Rangers", "New York"),
|
||||
"OTT": ("team_nhl_ott", "Ottawa Senators", "Ottawa"),
|
||||
"PHI": ("team_nhl_phi", "Philadelphia Flyers", "Philadelphia"),
|
||||
"PIT": ("team_nhl_pit", "Pittsburgh Penguins", "Pittsburgh"),
|
||||
"SJ": ("team_nhl_sj", "San Jose Sharks", "San Jose"),
|
||||
"SJS": ("team_nhl_sj", "San Jose Sharks", "San Jose"),
|
||||
"SEA": ("team_nhl_sea", "Seattle Kraken", "Seattle"),
|
||||
"STL": ("team_nhl_stl", "St. Louis Blues", "St. Louis"),
|
||||
"TB": ("team_nhl_tb", "Tampa Bay Lightning", "Tampa Bay"),
|
||||
"TBL": ("team_nhl_tb", "Tampa Bay Lightning", "Tampa Bay"),
|
||||
"TOR": ("team_nhl_tor", "Toronto Maple Leafs", "Toronto"),
|
||||
"VAN": ("team_nhl_van", "Vancouver Canucks", "Vancouver"),
|
||||
"VGK": ("team_nhl_vgk", "Vegas Golden Knights", "Vegas"),
|
||||
"VEG": ("team_nhl_vgk", "Vegas Golden Knights", "Vegas"),
|
||||
"WAS": ("team_nhl_was", "Washington Capitals", "Washington"),
|
||||
"WSH": ("team_nhl_was", "Washington Capitals", "Washington"),
|
||||
"WPG": ("team_nhl_wpg", "Winnipeg Jets", "Winnipeg"),
|
||||
},
|
||||
"mls": {
|
||||
"ATL": ("team_mls_atl", "Atlanta United", "Atlanta"),
|
||||
"AUS": ("team_mls_aus", "Austin FC", "Austin"),
|
||||
"CLT": ("team_mls_clt", "Charlotte FC", "Charlotte"),
|
||||
"CHI": ("team_mls_chi", "Chicago Fire", "Chicago"),
|
||||
"CIN": ("team_mls_cin", "FC Cincinnati", "Cincinnati"),
|
||||
"COL": ("team_mls_col", "Colorado Rapids", "Colorado"),
|
||||
"CLB": ("team_mls_clb", "Columbus Crew", "Columbus"),
|
||||
"DAL": ("team_mls_dal", "FC Dallas", "Dallas"),
|
||||
"DC": ("team_mls_dc", "D.C. United", "Washington"),
|
||||
"HOU": ("team_mls_hou", "Houston Dynamo", "Houston"),
|
||||
"LAG": ("team_mls_lag", "LA Galaxy", "Los Angeles"),
|
||||
"LAFC": ("team_mls_lafc", "Los Angeles FC", "Los Angeles"),
|
||||
"MIA": ("team_mls_mia", "Inter Miami", "Miami"),
|
||||
"MIN": ("team_mls_min", "Minnesota United", "Minnesota"),
|
||||
"MTL": ("team_mls_mtl", "CF Montreal", "Montreal"),
|
||||
"NSH": ("team_mls_nsh", "Nashville SC", "Nashville"),
|
||||
"NE": ("team_mls_ne", "New England Revolution", "New England"),
|
||||
"NYC": ("team_mls_nyc", "New York City FC", "New York"),
|
||||
"RB": ("team_mls_ny", "New York Red Bulls", "New York"),
|
||||
"RBNY": ("team_mls_ny", "New York Red Bulls", "New York"),
|
||||
"ORL": ("team_mls_orl", "Orlando City", "Orlando"),
|
||||
"PHI": ("team_mls_phi", "Philadelphia Union", "Philadelphia"),
|
||||
"POR": ("team_mls_por", "Portland Timbers", "Portland"),
|
||||
"SLC": ("team_mls_slc", "Real Salt Lake", "Salt Lake"),
|
||||
"RSL": ("team_mls_slc", "Real Salt Lake", "Salt Lake"),
|
||||
"SJ": ("team_mls_sj", "San Jose Earthquakes", "San Jose"),
|
||||
"SD": ("team_mls_sd", "San Diego FC", "San Diego"),
|
||||
"SEA": ("team_mls_sea", "Seattle Sounders", "Seattle"),
|
||||
"SKC": ("team_mls_skc", "Sporting Kansas City", "Kansas City"),
|
||||
"STL": ("team_mls_stl", "St. Louis City SC", "St. Louis"),
|
||||
"TOR": ("team_mls_tor", "Toronto FC", "Toronto"),
|
||||
"VAN": ("team_mls_van", "Vancouver Whitecaps", "Vancouver"),
|
||||
},
|
||||
"wnba": {
|
||||
"ATL": ("team_wnba_atl", "Atlanta Dream", "Atlanta"),
|
||||
"CHI": ("team_wnba_chi", "Chicago Sky", "Chicago"),
|
||||
"CON": ("team_wnba_con", "Connecticut Sun", "Connecticut"),
|
||||
"DAL": ("team_wnba_dal", "Dallas Wings", "Dallas"),
|
||||
"GSV": ("team_wnba_gsv", "Golden State Valkyries", "Golden State"),
|
||||
"IND": ("team_wnba_ind", "Indiana Fever", "Indiana"),
|
||||
"LV": ("team_wnba_lv", "Las Vegas Aces", "Las Vegas"),
|
||||
"LA": ("team_wnba_la", "Los Angeles Sparks", "Los Angeles"),
|
||||
"MIN": ("team_wnba_min", "Minnesota Lynx", "Minnesota"),
|
||||
"NY": ("team_wnba_ny", "New York Liberty", "New York"),
|
||||
"PHX": ("team_wnba_phx", "Phoenix Mercury", "Phoenix"),
|
||||
"SEA": ("team_wnba_sea", "Seattle Storm", "Seattle"),
|
||||
"WAS": ("team_wnba_was", "Washington Mystics", "Washington"),
|
||||
},
|
||||
"nwsl": {
|
||||
"ANF": ("team_nwsl_anf", "Angel City FC", "Los Angeles"),
|
||||
"CHI": ("team_nwsl_chi", "Chicago Red Stars", "Chicago"),
|
||||
"HOU": ("team_nwsl_hou", "Houston Dash", "Houston"),
|
||||
"KC": ("team_nwsl_kc", "Kansas City Current", "Kansas City"),
|
||||
"NJ": ("team_nwsl_nj", "NJ/NY Gotham FC", "New Jersey"),
|
||||
"NC": ("team_nwsl_nc", "North Carolina Courage", "North Carolina"),
|
||||
"ORL": ("team_nwsl_orl", "Orlando Pride", "Orlando"),
|
||||
"POR": ("team_nwsl_por", "Portland Thorns", "Portland"),
|
||||
"RGN": ("team_nwsl_rgn", "Racing Louisville", "Louisville"),
|
||||
"SD": ("team_nwsl_sd", "San Diego Wave", "San Diego"),
|
||||
"SEA": ("team_nwsl_sea", "Seattle Reign", "Seattle"),
|
||||
"SLC": ("team_nwsl_slc", "Utah Royals", "Utah"),
|
||||
"WAS": ("team_nwsl_was", "Washington Spirit", "Washington"),
|
||||
"BFC": ("team_nwsl_bfc", "Bay FC", "San Francisco"),
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
class TeamResolver:
|
||||
"""Resolves team names to canonical IDs.
|
||||
|
||||
Resolution order:
|
||||
1. Exact match against abbreviation mappings
|
||||
2. Exact match against full team names
|
||||
3. Alias lookup (with date awareness)
|
||||
4. Fuzzy match against all known names
|
||||
5. Unresolved (returns ManualReviewItem)
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
sport: str,
|
||||
alias_loader: Optional[TeamAliasLoader] = None,
|
||||
fuzzy_threshold: int = FUZZY_MATCH_THRESHOLD,
|
||||
):
|
||||
"""Initialize the resolver.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
alias_loader: Team alias loader (default: global loader)
|
||||
fuzzy_threshold: Minimum fuzzy match score
|
||||
"""
|
||||
self.sport = sport.lower()
|
||||
self.alias_loader = alias_loader or get_team_alias_loader()
|
||||
self.fuzzy_threshold = fuzzy_threshold
|
||||
self._mappings = TEAM_MAPPINGS.get(self.sport, {})
|
||||
|
||||
# Build match candidates for fuzzy matching
|
||||
self._candidates = self._build_candidates()
|
||||
|
||||
def _build_candidates(self) -> list[MatchCandidate]:
|
||||
"""Build match candidates from team mappings."""
|
||||
# Group by canonical ID to avoid duplicates
|
||||
by_id: dict[str, tuple[str, list[str]]] = {}
|
||||
|
||||
for abbrev, (canonical_id, full_name, city) in self._mappings.items():
|
||||
if canonical_id not in by_id:
|
||||
by_id[canonical_id] = (full_name, [])
|
||||
|
||||
# Add abbreviation as alias
|
||||
by_id[canonical_id][1].append(abbrev)
|
||||
by_id[canonical_id][1].append(city)
|
||||
|
||||
return [
|
||||
MatchCandidate(
|
||||
canonical_id=cid,
|
||||
name=name,
|
||||
aliases=list(set(aliases)), # Dedupe
|
||||
)
|
||||
for cid, (name, aliases) in by_id.items()
|
||||
]
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
value: str,
|
||||
check_date: Optional[date] = None,
|
||||
source_url: Optional[str] = None,
|
||||
) -> TeamResolveResult:
|
||||
"""Resolve a team name to a canonical ID.
|
||||
|
||||
Args:
|
||||
value: Team name, abbreviation, or city to resolve
|
||||
check_date: Date for alias validity (None = today)
|
||||
source_url: Source URL for manual review items
|
||||
|
||||
Returns:
|
||||
TeamResolveResult with resolution details
|
||||
"""
|
||||
value_upper = value.upper().strip()
|
||||
value_lower = value.lower().strip()
|
||||
|
||||
# 1. Exact match against abbreviation
|
||||
if value_upper in self._mappings:
|
||||
canonical_id, full_name, _ = self._mappings[value_upper]
|
||||
return TeamResolveResult(
|
||||
canonical_id=canonical_id,
|
||||
confidence=100,
|
||||
match_type="exact",
|
||||
)
|
||||
|
||||
# 2. Exact match against full names
|
||||
for abbrev, (canonical_id, full_name, city) in self._mappings.items():
|
||||
if value_lower == full_name.lower() or value_lower == city.lower():
|
||||
return TeamResolveResult(
|
||||
canonical_id=canonical_id,
|
||||
confidence=100,
|
||||
match_type="exact",
|
||||
)
|
||||
|
||||
# 3. Alias lookup
|
||||
alias_result = self.alias_loader.resolve(value, check_date)
|
||||
if alias_result:
|
||||
return TeamResolveResult(
|
||||
canonical_id=alias_result,
|
||||
confidence=95,
|
||||
match_type="alias",
|
||||
)
|
||||
|
||||
# 4. Fuzzy match
|
||||
matches = fuzzy_match_team(
|
||||
value,
|
||||
self._candidates,
|
||||
threshold=self.fuzzy_threshold,
|
||||
)
|
||||
|
||||
if matches:
|
||||
best = matches[0]
|
||||
review_item = None
|
||||
|
||||
# Create review item for low confidence matches
|
||||
if best.confidence < 90:
|
||||
review_item = ManualReviewItem(
|
||||
id=f"team_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.LOW_CONFIDENCE_MATCH,
|
||||
sport=self.sport,
|
||||
raw_value=value,
|
||||
context={"match_type": "fuzzy"},
|
||||
source_url=source_url,
|
||||
suggested_matches=matches,
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return TeamResolveResult(
|
||||
canonical_id=best.canonical_id,
|
||||
confidence=best.confidence,
|
||||
match_type="fuzzy",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
# 5. Unresolved
|
||||
review_item = ManualReviewItem(
|
||||
id=f"team_{uuid4().hex[:8]}",
|
||||
reason=ReviewReason.UNRESOLVED_TEAM,
|
||||
sport=self.sport,
|
||||
raw_value=value,
|
||||
context={},
|
||||
source_url=source_url,
|
||||
suggested_matches=fuzzy_match_team(
|
||||
value,
|
||||
self._candidates,
|
||||
threshold=50, # Lower threshold for suggestions
|
||||
top_n=5,
|
||||
),
|
||||
game_date=check_date,
|
||||
)
|
||||
|
||||
return TeamResolveResult(
|
||||
canonical_id=None,
|
||||
confidence=0,
|
||||
match_type="unresolved",
|
||||
review_item=review_item,
|
||||
)
|
||||
|
||||
def get_team_info(self, abbreviation: str) -> Optional[tuple[str, str, str]]:
|
||||
"""Get team info by abbreviation.
|
||||
|
||||
Args:
|
||||
abbreviation: Team abbreviation
|
||||
|
||||
Returns:
|
||||
Tuple of (canonical_id, full_name, city) or None
|
||||
"""
|
||||
return self._mappings.get(abbreviation.upper())
|
||||
|
||||
def get_all_teams(self) -> list[tuple[str, str, str]]:
|
||||
"""Get all teams for this sport.
|
||||
|
||||
Returns:
|
||||
List of (canonical_id, full_name, city) tuples
|
||||
"""
|
||||
seen = set()
|
||||
result = []
|
||||
|
||||
for abbrev, (canonical_id, full_name, city) in self._mappings.items():
|
||||
if canonical_id not in seen:
|
||||
seen.add(canonical_id)
|
||||
result.append((canonical_id, full_name, city))
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# Cached resolvers
|
||||
_resolvers: dict[str, TeamResolver] = {}
|
||||
|
||||
|
||||
def get_team_resolver(sport: str) -> TeamResolver:
|
||||
"""Get or create a team resolver for a sport."""
|
||||
sport_lower = sport.lower()
|
||||
if sport_lower not in _resolvers:
|
||||
_resolvers[sport_lower] = TeamResolver(sport_lower)
|
||||
return _resolvers[sport_lower]
|
||||
|
||||
|
||||
def resolve_team(
|
||||
sport: str,
|
||||
value: str,
|
||||
check_date: Optional[date] = None,
|
||||
) -> TeamResolveResult:
|
||||
"""Convenience function to resolve a team name.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
value: Team name to resolve
|
||||
check_date: Date for alias validity
|
||||
|
||||
Returns:
|
||||
TeamResolveResult
|
||||
"""
|
||||
return get_team_resolver(sport).resolve(value, check_date)
|
||||
@@ -0,0 +1,344 @@
|
||||
"""Timezone conversion utilities for normalizing game times to UTC."""
|
||||
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, date, time
|
||||
from typing import Optional
|
||||
from zoneinfo import ZoneInfo
|
||||
|
||||
from dateutil import parser as dateutil_parser
|
||||
from dateutil.tz import gettz, tzutc
|
||||
|
||||
from ..models.aliases import ReviewReason, ManualReviewItem
|
||||
|
||||
|
||||
# Common timezone abbreviations to IANA timezones
|
||||
TIMEZONE_ABBREV_MAP: dict[str, str] = {
|
||||
# US timezones
|
||||
"ET": "America/New_York",
|
||||
"EST": "America/New_York",
|
||||
"EDT": "America/New_York",
|
||||
"CT": "America/Chicago",
|
||||
"CST": "America/Chicago",
|
||||
"CDT": "America/Chicago",
|
||||
"MT": "America/Denver",
|
||||
"MST": "America/Denver",
|
||||
"MDT": "America/Denver",
|
||||
"PT": "America/Los_Angeles",
|
||||
"PST": "America/Los_Angeles",
|
||||
"PDT": "America/Los_Angeles",
|
||||
"AT": "America/Anchorage",
|
||||
"AKST": "America/Anchorage",
|
||||
"AKDT": "America/Anchorage",
|
||||
"HT": "Pacific/Honolulu",
|
||||
"HST": "Pacific/Honolulu",
|
||||
# Canada
|
||||
"AST": "America/Halifax",
|
||||
"ADT": "America/Halifax",
|
||||
"NST": "America/St_Johns",
|
||||
"NDT": "America/St_Johns",
|
||||
# Mexico
|
||||
"CDST": "America/Mexico_City",
|
||||
# UTC
|
||||
"UTC": "UTC",
|
||||
"GMT": "UTC",
|
||||
"Z": "UTC",
|
||||
}
|
||||
|
||||
# State/region to timezone mapping for inferring timezone from location
|
||||
STATE_TIMEZONE_MAP: dict[str, str] = {
|
||||
# Eastern
|
||||
"CT": "America/New_York",
|
||||
"DE": "America/New_York",
|
||||
"FL": "America/New_York", # Most of Florida
|
||||
"GA": "America/New_York",
|
||||
"MA": "America/New_York",
|
||||
"MD": "America/New_York",
|
||||
"ME": "America/New_York",
|
||||
"MI": "America/Detroit",
|
||||
"NC": "America/New_York",
|
||||
"NH": "America/New_York",
|
||||
"NJ": "America/New_York",
|
||||
"NY": "America/New_York",
|
||||
"OH": "America/New_York",
|
||||
"PA": "America/New_York",
|
||||
"RI": "America/New_York",
|
||||
"SC": "America/New_York",
|
||||
"VA": "America/New_York",
|
||||
"VT": "America/New_York",
|
||||
"WV": "America/New_York",
|
||||
"DC": "America/New_York",
|
||||
# Central
|
||||
"AL": "America/Chicago",
|
||||
"AR": "America/Chicago",
|
||||
"IA": "America/Chicago",
|
||||
"IL": "America/Chicago",
|
||||
"IN": "America/Indiana/Indianapolis",
|
||||
"KS": "America/Chicago",
|
||||
"KY": "America/Kentucky/Louisville",
|
||||
"LA": "America/Chicago",
|
||||
"MN": "America/Chicago",
|
||||
"MO": "America/Chicago",
|
||||
"MS": "America/Chicago",
|
||||
"ND": "America/Chicago",
|
||||
"NE": "America/Chicago",
|
||||
"OK": "America/Chicago",
|
||||
"SD": "America/Chicago",
|
||||
"TN": "America/Chicago",
|
||||
"TX": "America/Chicago",
|
||||
"WI": "America/Chicago",
|
||||
# Mountain
|
||||
"AZ": "America/Phoenix", # No DST
|
||||
"CO": "America/Denver",
|
||||
"ID": "America/Boise",
|
||||
"MT": "America/Denver",
|
||||
"NM": "America/Denver",
|
||||
"UT": "America/Denver",
|
||||
"WY": "America/Denver",
|
||||
# Pacific
|
||||
"CA": "America/Los_Angeles",
|
||||
"NV": "America/Los_Angeles",
|
||||
"OR": "America/Los_Angeles",
|
||||
"WA": "America/Los_Angeles",
|
||||
# Alaska/Hawaii
|
||||
"AK": "America/Anchorage",
|
||||
"HI": "Pacific/Honolulu",
|
||||
# Canada provinces
|
||||
"ON": "America/Toronto",
|
||||
"QC": "America/Montreal",
|
||||
"BC": "America/Vancouver",
|
||||
"AB": "America/Edmonton",
|
||||
"MB": "America/Winnipeg",
|
||||
"SK": "America/Regina",
|
||||
"NS": "America/Halifax",
|
||||
"NB": "America/Moncton",
|
||||
"NL": "America/St_Johns",
|
||||
"PE": "America/Halifax",
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class TimezoneResult:
|
||||
"""Result of timezone conversion.
|
||||
|
||||
Attributes:
|
||||
datetime_utc: The datetime converted to UTC
|
||||
source_timezone: The timezone that was detected/used
|
||||
confidence: Confidence in the timezone detection ('high', 'medium', 'low')
|
||||
warning: Warning message if timezone was uncertain
|
||||
"""
|
||||
|
||||
datetime_utc: datetime
|
||||
source_timezone: str
|
||||
confidence: str
|
||||
warning: Optional[str] = None
|
||||
|
||||
|
||||
def detect_timezone_from_string(time_str: str) -> Optional[str]:
|
||||
"""Detect timezone from a time string containing a timezone abbreviation.
|
||||
|
||||
Args:
|
||||
time_str: Time string that may contain timezone info (e.g., '7:00 PM ET')
|
||||
|
||||
Returns:
|
||||
IANA timezone string if detected, None otherwise
|
||||
"""
|
||||
# Look for timezone abbreviation at end of string
|
||||
for abbrev, tz in TIMEZONE_ABBREV_MAP.items():
|
||||
pattern = rf"\b{abbrev}\b"
|
||||
if re.search(pattern, time_str, re.IGNORECASE):
|
||||
return tz
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def detect_timezone_from_location(
|
||||
state: Optional[str] = None,
|
||||
city: Optional[str] = None,
|
||||
) -> Optional[str]:
|
||||
"""Detect timezone from location information.
|
||||
|
||||
Args:
|
||||
state: State/province code (e.g., 'NY', 'ON')
|
||||
city: City name (optional, for special cases)
|
||||
|
||||
Returns:
|
||||
IANA timezone string if detected, None otherwise
|
||||
"""
|
||||
if state and state.upper() in STATE_TIMEZONE_MAP:
|
||||
return STATE_TIMEZONE_MAP[state.upper()]
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def parse_datetime(
|
||||
date_str: str,
|
||||
time_str: Optional[str] = None,
|
||||
timezone_hint: Optional[str] = None,
|
||||
location_state: Optional[str] = None,
|
||||
) -> TimezoneResult:
|
||||
"""Parse a date/time string and convert to UTC.
|
||||
|
||||
Attempts to detect timezone from:
|
||||
1. Explicit timezone in the string
|
||||
2. Provided timezone hint
|
||||
3. Location-based inference
|
||||
4. Default to Eastern Time with warning
|
||||
|
||||
Args:
|
||||
date_str: Date string (e.g., '2025-10-21', 'October 21, 2025')
|
||||
time_str: Optional time string (e.g., '7:00 PM ET', '19:00')
|
||||
timezone_hint: Optional IANA timezone to use if not detected
|
||||
location_state: Optional state code for timezone inference
|
||||
|
||||
Returns:
|
||||
TimezoneResult with UTC datetime and metadata
|
||||
"""
|
||||
# Parse the date
|
||||
try:
|
||||
if time_str:
|
||||
# Combine date and time
|
||||
full_str = f"{date_str} {time_str}"
|
||||
else:
|
||||
full_str = date_str
|
||||
|
||||
parsed = dateutil_parser.parse(full_str, fuzzy=True)
|
||||
except (ValueError, OverflowError) as e:
|
||||
# If parsing fails, return a placeholder with low confidence
|
||||
return TimezoneResult(
|
||||
datetime_utc=datetime.now(tz=ZoneInfo("UTC")),
|
||||
source_timezone="unknown",
|
||||
confidence="low",
|
||||
warning=f"Failed to parse datetime: {e}",
|
||||
)
|
||||
|
||||
# Determine timezone
|
||||
detected_tz = None
|
||||
confidence = "high"
|
||||
warning = None
|
||||
|
||||
# Check if datetime already has timezone
|
||||
if parsed.tzinfo is not None:
|
||||
detected_tz = str(parsed.tzinfo)
|
||||
else:
|
||||
# Try to detect from time string
|
||||
if time_str:
|
||||
detected_tz = detect_timezone_from_string(time_str)
|
||||
|
||||
# Try timezone hint
|
||||
if not detected_tz and timezone_hint:
|
||||
detected_tz = timezone_hint
|
||||
confidence = "medium"
|
||||
|
||||
# Try location inference
|
||||
if not detected_tz and location_state:
|
||||
detected_tz = detect_timezone_from_location(state=location_state)
|
||||
confidence = "medium"
|
||||
|
||||
# Default to Eastern Time
|
||||
if not detected_tz:
|
||||
detected_tz = "America/New_York"
|
||||
confidence = "low"
|
||||
warning = "Timezone not detected, defaulting to Eastern Time"
|
||||
|
||||
# Apply timezone and convert to UTC
|
||||
try:
|
||||
tz = ZoneInfo(detected_tz)
|
||||
except KeyError:
|
||||
# Invalid timezone, try to resolve abbreviation
|
||||
if detected_tz in TIMEZONE_ABBREV_MAP:
|
||||
tz = ZoneInfo(TIMEZONE_ABBREV_MAP[detected_tz])
|
||||
detected_tz = TIMEZONE_ABBREV_MAP[detected_tz]
|
||||
else:
|
||||
tz = ZoneInfo("America/New_York")
|
||||
confidence = "low"
|
||||
warning = f"Unknown timezone '{detected_tz}', defaulting to Eastern Time"
|
||||
detected_tz = "America/New_York"
|
||||
|
||||
# Apply timezone if not already set
|
||||
if parsed.tzinfo is None:
|
||||
parsed = parsed.replace(tzinfo=tz)
|
||||
|
||||
# Convert to UTC
|
||||
utc_dt = parsed.astimezone(ZoneInfo("UTC"))
|
||||
|
||||
return TimezoneResult(
|
||||
datetime_utc=utc_dt,
|
||||
source_timezone=detected_tz,
|
||||
confidence=confidence,
|
||||
warning=warning,
|
||||
)
|
||||
|
||||
|
||||
def convert_to_utc(
|
||||
dt: datetime,
|
||||
source_timezone: str,
|
||||
) -> datetime:
|
||||
"""Convert a datetime from a known timezone to UTC.
|
||||
|
||||
Args:
|
||||
dt: Datetime to convert (timezone-naive or timezone-aware)
|
||||
source_timezone: IANA timezone of the datetime
|
||||
|
||||
Returns:
|
||||
Datetime in UTC
|
||||
"""
|
||||
tz = ZoneInfo(source_timezone)
|
||||
|
||||
if dt.tzinfo is None:
|
||||
# Localize naive datetime
|
||||
dt = dt.replace(tzinfo=tz)
|
||||
|
||||
return dt.astimezone(ZoneInfo("UTC"))
|
||||
|
||||
|
||||
def create_timezone_warning(
|
||||
raw_value: str,
|
||||
sport: str,
|
||||
game_date: Optional[date] = None,
|
||||
source_url: Optional[str] = None,
|
||||
) -> ManualReviewItem:
|
||||
"""Create a manual review item for an undetermined timezone.
|
||||
|
||||
Args:
|
||||
raw_value: The original time string that couldn't be resolved
|
||||
sport: Sport code
|
||||
game_date: Date of the game
|
||||
source_url: URL of the source page
|
||||
|
||||
Returns:
|
||||
ManualReviewItem for timezone review
|
||||
"""
|
||||
return ManualReviewItem(
|
||||
id=f"tz_{sport}_{raw_value[:20].replace(' ', '_')}",
|
||||
reason=ReviewReason.TIMEZONE_UNKNOWN,
|
||||
sport=sport,
|
||||
raw_value=raw_value,
|
||||
context={"issue": "Could not determine timezone for game time"},
|
||||
source_url=source_url,
|
||||
game_date=game_date,
|
||||
)
|
||||
|
||||
|
||||
def get_stadium_timezone(
|
||||
stadium_state: str,
|
||||
stadium_timezone: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Get the timezone for a stadium based on its location.
|
||||
|
||||
Args:
|
||||
stadium_state: State/province code
|
||||
stadium_timezone: Explicit timezone override from stadium data
|
||||
|
||||
Returns:
|
||||
IANA timezone string
|
||||
"""
|
||||
if stadium_timezone:
|
||||
return stadium_timezone
|
||||
|
||||
tz = detect_timezone_from_location(state=stadium_state)
|
||||
if tz:
|
||||
return tz
|
||||
|
||||
# Default to Eastern
|
||||
return "America/New_York"
|
||||
@@ -0,0 +1,46 @@
|
||||
"""Scrapers for fetching sports data from various sources."""
|
||||
|
||||
from .base import (
|
||||
BaseScraper,
|
||||
RawGameData,
|
||||
ScrapeResult,
|
||||
ScraperError,
|
||||
PartialDataError,
|
||||
)
|
||||
from .nba import NBAScraper, create_nba_scraper
|
||||
from .mlb import MLBScraper, create_mlb_scraper
|
||||
from .nfl import NFLScraper, create_nfl_scraper
|
||||
from .nhl import NHLScraper, create_nhl_scraper
|
||||
from .mls import MLSScraper, create_mls_scraper
|
||||
from .wnba import WNBAScraper, create_wnba_scraper
|
||||
from .nwsl import NWSLScraper, create_nwsl_scraper
|
||||
|
||||
__all__ = [
|
||||
# Base
|
||||
"BaseScraper",
|
||||
"RawGameData",
|
||||
"ScrapeResult",
|
||||
"ScraperError",
|
||||
"PartialDataError",
|
||||
# NBA
|
||||
"NBAScraper",
|
||||
"create_nba_scraper",
|
||||
# MLB
|
||||
"MLBScraper",
|
||||
"create_mlb_scraper",
|
||||
# NFL
|
||||
"NFLScraper",
|
||||
"create_nfl_scraper",
|
||||
# NHL
|
||||
"NHLScraper",
|
||||
"create_nhl_scraper",
|
||||
# MLS
|
||||
"MLSScraper",
|
||||
"create_mls_scraper",
|
||||
# WNBA
|
||||
"WNBAScraper",
|
||||
"create_wnba_scraper",
|
||||
# NWSL
|
||||
"NWSLScraper",
|
||||
"create_nwsl_scraper",
|
||||
]
|
||||
@@ -0,0 +1,322 @@
|
||||
"""Base scraper class for all sport scrapers."""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date, datetime
|
||||
from typing import Optional
|
||||
|
||||
from ..config import EXPECTED_GAME_COUNTS
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..utils.http import RateLimitedSession, get_session
|
||||
from ..utils.logging import get_logger, log_error, log_warning
|
||||
from ..utils.progress import ScrapeProgress
|
||||
|
||||
|
||||
@dataclass
|
||||
class RawGameData:
|
||||
"""Raw game data before normalization.
|
||||
|
||||
This intermediate format holds data as scraped from sources,
|
||||
before team/stadium resolution and canonical ID generation.
|
||||
"""
|
||||
|
||||
game_date: datetime
|
||||
home_team_raw: str
|
||||
away_team_raw: str
|
||||
stadium_raw: Optional[str] = None
|
||||
home_score: Optional[int] = None
|
||||
away_score: Optional[int] = None
|
||||
status: str = "scheduled"
|
||||
source_url: Optional[str] = None
|
||||
game_number: Optional[int] = None # For doubleheaders
|
||||
|
||||
|
||||
@dataclass
|
||||
class ScrapeResult:
|
||||
"""Result of a scraping operation.
|
||||
|
||||
Attributes:
|
||||
games: List of normalized Game objects
|
||||
teams: List of Team objects
|
||||
stadiums: List of Stadium objects
|
||||
review_items: Items requiring manual review
|
||||
source: Name of the source used
|
||||
success: Whether scraping succeeded
|
||||
error_message: Error message if failed
|
||||
"""
|
||||
|
||||
games: list[Game] = field(default_factory=list)
|
||||
teams: list[Team] = field(default_factory=list)
|
||||
stadiums: list[Stadium] = field(default_factory=list)
|
||||
review_items: list[ManualReviewItem] = field(default_factory=list)
|
||||
source: str = ""
|
||||
success: bool = True
|
||||
error_message: Optional[str] = None
|
||||
|
||||
@property
|
||||
def game_count(self) -> int:
|
||||
return len(self.games)
|
||||
|
||||
@property
|
||||
def team_count(self) -> int:
|
||||
return len(self.teams)
|
||||
|
||||
@property
|
||||
def stadium_count(self) -> int:
|
||||
return len(self.stadiums)
|
||||
|
||||
@property
|
||||
def review_count(self) -> int:
|
||||
return len(self.review_items)
|
||||
|
||||
|
||||
class BaseScraper(ABC):
|
||||
"""Abstract base class for sport scrapers.
|
||||
|
||||
Subclasses must implement:
|
||||
- scrape_games(): Fetch and normalize game schedule
|
||||
- scrape_teams(): Fetch team information
|
||||
- scrape_stadiums(): Fetch stadium information
|
||||
- _get_sources(): Return list of source names in priority order
|
||||
|
||||
Features:
|
||||
- Multi-source fallback (try sources in order)
|
||||
- Built-in rate limiting
|
||||
- Error handling with partial data discard
|
||||
- Progress tracking
|
||||
- Source URL tracking for manual review
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
sport: str,
|
||||
season: int,
|
||||
session: Optional[RateLimitedSession] = None,
|
||||
):
|
||||
"""Initialize the scraper.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba', 'mlb')
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
session: Optional HTTP session (default: global session)
|
||||
"""
|
||||
self.sport = sport.lower()
|
||||
self.season = season
|
||||
self.session = session or get_session()
|
||||
self._logger = get_logger()
|
||||
self._progress: Optional[ScrapeProgress] = None
|
||||
|
||||
@property
|
||||
def expected_game_count(self) -> int:
|
||||
"""Get expected number of games for this sport."""
|
||||
return EXPECTED_GAME_COUNTS.get(self.sport, 0)
|
||||
|
||||
@abstractmethod
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return list of source names in priority order.
|
||||
|
||||
Returns:
|
||||
List of source identifiers (e.g., ['basketball_reference', 'espn', 'cbs'])
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def _scrape_games_from_source(
|
||||
self,
|
||||
source: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source.
|
||||
|
||||
Args:
|
||||
source: Source identifier
|
||||
|
||||
Returns:
|
||||
List of raw game data
|
||||
|
||||
Raises:
|
||||
Exception: If scraping fails
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw game data to Game objects.
|
||||
|
||||
Args:
|
||||
raw_games: Raw scraped data
|
||||
|
||||
Returns:
|
||||
Tuple of (normalized games, review items)
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Fetch team information.
|
||||
|
||||
Returns:
|
||||
List of Team objects
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Fetch stadium information.
|
||||
|
||||
Returns:
|
||||
List of Stadium objects
|
||||
"""
|
||||
pass
|
||||
|
||||
def scrape_games(self) -> ScrapeResult:
|
||||
"""Scrape games with multi-source fallback.
|
||||
|
||||
Tries each source in priority order. On failure, discards
|
||||
partial data and tries the next source.
|
||||
|
||||
Returns:
|
||||
ScrapeResult with games, review items, and status
|
||||
"""
|
||||
sources = self._get_sources()
|
||||
last_error: Optional[str] = None
|
||||
|
||||
for source in sources:
|
||||
self._logger.info(f"Trying source: {source}")
|
||||
|
||||
try:
|
||||
# Scrape raw data
|
||||
raw_games = self._scrape_games_from_source(source)
|
||||
|
||||
if not raw_games:
|
||||
log_warning(f"No games found from {source}")
|
||||
continue
|
||||
|
||||
self._logger.info(f"Found {len(raw_games)} raw games from {source}")
|
||||
|
||||
# Normalize data
|
||||
games, review_items = self._normalize_games(raw_games)
|
||||
|
||||
self._logger.info(
|
||||
f"Normalized {len(games)} games, {len(review_items)} need review"
|
||||
)
|
||||
|
||||
return ScrapeResult(
|
||||
games=games,
|
||||
review_items=review_items,
|
||||
source=source,
|
||||
success=True,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
last_error = str(e)
|
||||
log_error(f"Failed to scrape from {source}: {e}", exc_info=True)
|
||||
# Discard partial data and try next source
|
||||
continue
|
||||
|
||||
# All sources failed
|
||||
return ScrapeResult(
|
||||
success=False,
|
||||
error_message=f"All sources failed. Last error: {last_error}",
|
||||
)
|
||||
|
||||
def scrape_all(self) -> ScrapeResult:
|
||||
"""Scrape games, teams, and stadiums.
|
||||
|
||||
Returns:
|
||||
Complete ScrapeResult with all data
|
||||
"""
|
||||
self._progress = ScrapeProgress(self.sport, self.season)
|
||||
self._progress.start()
|
||||
|
||||
try:
|
||||
# Scrape games
|
||||
result = self.scrape_games()
|
||||
|
||||
if not result.success:
|
||||
self._progress.log_error(result.error_message or "Unknown error")
|
||||
self._progress.finish()
|
||||
return result
|
||||
|
||||
# Scrape teams
|
||||
teams = self.scrape_teams()
|
||||
result.teams = teams
|
||||
|
||||
# Scrape stadiums
|
||||
stadiums = self.scrape_stadiums()
|
||||
result.stadiums = stadiums
|
||||
|
||||
# Update progress
|
||||
self._progress.games_count = result.game_count
|
||||
self._progress.teams_count = result.team_count
|
||||
self._progress.stadiums_count = result.stadium_count
|
||||
self._progress.errors_count = result.review_count
|
||||
|
||||
self._progress.finish()
|
||||
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
log_error(f"Scraping failed: {e}", exc_info=True)
|
||||
self._progress.finish()
|
||||
|
||||
return ScrapeResult(
|
||||
success=False,
|
||||
error_message=str(e),
|
||||
)
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for this sport's season.
|
||||
|
||||
Returns:
|
||||
List of (year, month) tuples
|
||||
"""
|
||||
# Default implementation for sports with fall-spring seasons
|
||||
# (NBA, NHL, etc.)
|
||||
months = []
|
||||
|
||||
# Fall months of season start year
|
||||
for month in range(10, 13): # Oct-Dec
|
||||
months.append((self.season, month))
|
||||
|
||||
# Winter-spring months of following year
|
||||
for month in range(1, 7): # Jan-Jun
|
||||
months.append((self.season + 1, month))
|
||||
|
||||
return months
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build a source URL with parameters.
|
||||
|
||||
Subclasses should override this to build URLs for their sources.
|
||||
|
||||
Args:
|
||||
source: Source identifier
|
||||
**kwargs: URL parameters
|
||||
|
||||
Returns:
|
||||
Complete URL string
|
||||
"""
|
||||
raise NotImplementedError(f"URL builder not implemented for {source}")
|
||||
|
||||
|
||||
class ScraperError(Exception):
|
||||
"""Exception raised when scraping fails."""
|
||||
|
||||
def __init__(self, source: str, message: str):
|
||||
self.source = source
|
||||
self.message = message
|
||||
super().__init__(f"[{source}] {message}")
|
||||
|
||||
|
||||
class PartialDataError(ScraperError):
|
||||
"""Exception raised when only partial data was retrieved."""
|
||||
|
||||
def __init__(self, source: str, message: str, partial_count: int):
|
||||
self.partial_count = partial_count
|
||||
super().__init__(source, f"{message} (got {partial_count} items)")
|
||||
@@ -0,0 +1,707 @@
|
||||
"""MLB scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date
|
||||
from typing import Optional
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..normalizers.timezone import parse_datetime
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
class MLBScraper(BaseScraper):
|
||||
"""MLB schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. Baseball-Reference - Most reliable, complete historical data
|
||||
2. MLB Stats API - Official MLB data
|
||||
3. ESPN API - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize MLB scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026 for 2026 season)
|
||||
"""
|
||||
super().__init__("mlb", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("mlb")
|
||||
self._stadium_resolver = get_stadium_resolver("mlb")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["baseball_reference", "mlb_api", "espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "baseball_reference":
|
||||
month = kwargs.get("month", "april")
|
||||
# Baseball-Reference uses season year in URL
|
||||
return f"https://www.baseball-reference.com/leagues/majors/{self.season}-schedule.shtml"
|
||||
|
||||
elif source == "mlb_api":
|
||||
start_date = kwargs.get("start_date", "")
|
||||
end_date = kwargs.get("end_date", "")
|
||||
return f"https://statsapi.mlb.com/api/v1/schedule?sportId=1&startDate={start_date}&endDate={end_date}"
|
||||
|
||||
elif source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard?dates={date_str}"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for MLB season.
|
||||
|
||||
MLB season runs March/April through October/November.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# Spring training / early season
|
||||
for month in range(3, 12): # March-November
|
||||
months.append((self.season, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "baseball_reference":
|
||||
return self._scrape_baseball_reference()
|
||||
elif source == "mlb_api":
|
||||
return self._scrape_mlb_api()
|
||||
elif source == "espn":
|
||||
return self._scrape_espn()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_baseball_reference(self) -> list[RawGameData]:
|
||||
"""Scrape games from Baseball-Reference.
|
||||
|
||||
BR has a single schedule page per season.
|
||||
Format: https://www.baseball-reference.com/leagues/majors/YYYY-schedule.shtml
|
||||
"""
|
||||
url = self._get_source_url("baseball_reference")
|
||||
|
||||
try:
|
||||
html = self.session.get_html(url)
|
||||
games = self._parse_baseball_reference(html, url)
|
||||
return games
|
||||
|
||||
except Exception as e:
|
||||
self._logger.error(f"Failed to scrape Baseball-Reference: {e}")
|
||||
raise
|
||||
|
||||
def _parse_baseball_reference(
|
||||
self,
|
||||
html: str,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse Baseball-Reference schedule HTML.
|
||||
|
||||
Structure: Games are organized by date in div elements.
|
||||
Each game row has: date, away team, away score, home team, home score, venue.
|
||||
"""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
games: list[RawGameData] = []
|
||||
|
||||
# Find all game divs - they use class "game" or similar
|
||||
# Baseball-Reference uses <p class="game"> for each game
|
||||
game_paragraphs = soup.find_all("p", class_="game")
|
||||
|
||||
current_date = None
|
||||
|
||||
for elem in soup.find_all(["h3", "p"]):
|
||||
# H3 contains date headers
|
||||
if elem.name == "h3":
|
||||
date_text = elem.get_text(strip=True)
|
||||
try:
|
||||
# Format: "Thursday, April 1, 2026"
|
||||
current_date = datetime.strptime(date_text, "%A, %B %d, %Y")
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
elif elem.name == "p" and "game" in elem.get("class", []):
|
||||
if current_date is None:
|
||||
continue
|
||||
|
||||
try:
|
||||
game = self._parse_br_game(elem, current_date, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse game: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_br_game(
|
||||
self,
|
||||
elem,
|
||||
game_date: datetime,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single Baseball-Reference game element."""
|
||||
text = elem.get_text(" ", strip=True)
|
||||
|
||||
# Parse game text - formats vary:
|
||||
# "Team A (5) @ Team B (3)" or "Team A @ Team B"
|
||||
# Also handles doubleheader notation
|
||||
|
||||
# Find all links - usually team names
|
||||
links = elem.find_all("a")
|
||||
if len(links) < 2:
|
||||
return None
|
||||
|
||||
# First link is away team, second is home team
|
||||
away_team = links[0].get_text(strip=True)
|
||||
home_team = links[1].get_text(strip=True)
|
||||
|
||||
# Try to extract scores from text
|
||||
away_score = None
|
||||
home_score = None
|
||||
|
||||
# Look for score pattern "(N)"
|
||||
import re
|
||||
score_pattern = r"\((\d+)\)"
|
||||
scores = re.findall(score_pattern, text)
|
||||
|
||||
if len(scores) >= 2:
|
||||
try:
|
||||
away_score = int(scores[0])
|
||||
home_score = int(scores[1])
|
||||
except (ValueError, IndexError):
|
||||
pass
|
||||
|
||||
# Determine status
|
||||
status = "final" if home_score is not None else "scheduled"
|
||||
|
||||
# Check for postponed/cancelled
|
||||
text_lower = text.lower()
|
||||
if "postponed" in text_lower:
|
||||
status = "postponed"
|
||||
elif "cancelled" in text_lower or "canceled" in text_lower:
|
||||
status = "cancelled"
|
||||
|
||||
# Extract venue if present (usually after @ symbol)
|
||||
stadium = None
|
||||
if len(links) > 2:
|
||||
# Third link might be stadium
|
||||
stadium = links[2].get_text(strip=True)
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_mlb_api(self) -> list[RawGameData]:
|
||||
"""Scrape games from MLB Stats API.
|
||||
|
||||
MLB API allows date range queries.
|
||||
"""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
# Query by month to avoid hitting API limits
|
||||
for year, month in self._get_season_months():
|
||||
start_date = date(year, month, 1)
|
||||
|
||||
# Get last day of month
|
||||
if month == 12:
|
||||
end_date = date(year + 1, 1, 1)
|
||||
else:
|
||||
end_date = date(year, month + 1, 1)
|
||||
|
||||
# Adjust end date to last day of month
|
||||
from datetime import timedelta
|
||||
end_date = end_date - timedelta(days=1)
|
||||
|
||||
url = self._get_source_url(
|
||||
"mlb_api",
|
||||
start_date=start_date.strftime("%Y-%m-%d"),
|
||||
end_date=end_date.strftime("%Y-%m-%d"),
|
||||
)
|
||||
|
||||
try:
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_mlb_api_response(data, url)
|
||||
all_games.extend(games)
|
||||
self._logger.debug(f"Found {len(games)} games in {year}-{month:02d}")
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"MLB API error for {year}-{month}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_mlb_api_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse MLB Stats API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
dates = data.get("dates", [])
|
||||
|
||||
for date_entry in dates:
|
||||
for game in date_entry.get("games", []):
|
||||
try:
|
||||
raw_game = self._parse_mlb_api_game(game, source_url)
|
||||
if raw_game:
|
||||
games.append(raw_game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse MLB API game: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_mlb_api_game(
|
||||
self,
|
||||
game: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single MLB API game."""
|
||||
# Get game date/time
|
||||
game_date_str = game.get("gameDate", "")
|
||||
if not game_date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(game_date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
teams = game.get("teams", {})
|
||||
away_data = teams.get("away", {})
|
||||
home_data = teams.get("home", {})
|
||||
|
||||
away_team_info = away_data.get("team", {})
|
||||
home_team_info = home_data.get("team", {})
|
||||
|
||||
away_team = away_team_info.get("name", "")
|
||||
home_team = home_team_info.get("name", "")
|
||||
|
||||
if not away_team or not home_team:
|
||||
return None
|
||||
|
||||
# Get scores
|
||||
away_score = away_data.get("score")
|
||||
home_score = home_data.get("score")
|
||||
|
||||
# Get venue
|
||||
venue = game.get("venue", {})
|
||||
stadium = venue.get("name")
|
||||
|
||||
# Get status
|
||||
status_data = game.get("status", {})
|
||||
abstract_game_state = status_data.get("abstractGameState", "").lower()
|
||||
detailed_state = status_data.get("detailedState", "").lower()
|
||||
|
||||
if abstract_game_state == "final":
|
||||
status = "final"
|
||||
elif "postponed" in detailed_state:
|
||||
status = "postponed"
|
||||
elif "cancelled" in detailed_state or "canceled" in detailed_state:
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
# Check for doubleheader
|
||||
game_number = game.get("gameNumber")
|
||||
if game.get("doubleHeader") == "Y":
|
||||
game_number = game.get("gameNumber", 1)
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
game_number=game_number if game.get("doubleHeader") == "Y" else None,
|
||||
)
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API."""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
# Get number of days in month
|
||||
if month == 12:
|
||||
next_month = date(year + 1, 1, 1)
|
||||
else:
|
||||
next_month = date(year, month + 1, 1)
|
||||
|
||||
days_in_month = (next_month - date(year, month, 1)).days
|
||||
|
||||
for day in range(1, days_in_month + 1):
|
||||
try:
|
||||
game_date = date(year, month, day)
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
url = self._get_source_url("espn", date=date_str)
|
||||
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Track games by date/matchup for doubleheader detection
|
||||
games_by_matchup: dict[str, list[RawGameData]] = {}
|
||||
|
||||
for raw in raw_games:
|
||||
date_key = raw.game_date.strftime("%Y%m%d")
|
||||
matchup_key = f"{date_key}_{raw.away_team_raw}_{raw.home_team_raw}"
|
||||
|
||||
if matchup_key not in games_by_matchup:
|
||||
games_by_matchup[matchup_key] = []
|
||||
games_by_matchup[matchup_key].append(raw)
|
||||
|
||||
# Process games with doubleheader detection
|
||||
for matchup_key, matchup_games in games_by_matchup.items():
|
||||
is_doubleheader = len(matchup_games) > 1
|
||||
|
||||
# Sort by time if doubleheader
|
||||
if is_doubleheader:
|
||||
matchup_games.sort(key=lambda g: g.game_date)
|
||||
|
||||
for i, raw in enumerate(matchup_games):
|
||||
# Use provided game_number or calculate from order
|
||||
game_number = raw.game_number or ((i + 1) if is_doubleheader else None)
|
||||
|
||||
game, item_reviews = self._normalize_single_game(raw, game_number)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
game_number: Optional[int],
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=game_number,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=game_number,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
# team_mlb_nyy -> nyy
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all MLB teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# MLB league/division structure
|
||||
divisions = {
|
||||
"AL East": ("American", ["BAL", "BOS", "NYY", "TB", "TOR"]),
|
||||
"AL Central": ("American", ["CHW", "CLE", "DET", "KC", "MIN"]),
|
||||
"AL West": ("American", ["HOU", "LAA", "OAK", "SEA", "TEX"]),
|
||||
"NL East": ("National", ["ATL", "MIA", "NYM", "PHI", "WSN"]),
|
||||
"NL Central": ("National", ["CHC", "CIN", "MIL", "PIT", "STL"]),
|
||||
"NL West": ("National", ["ARI", "COL", "LAD", "SD", "SF"]),
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_divisions: dict[str, tuple[str, str]] = {}
|
||||
for div, (league, abbrevs) in divisions.items():
|
||||
for abbrev in abbrevs:
|
||||
team_divisions[abbrev] = (league, div)
|
||||
|
||||
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("mlb", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name from full name
|
||||
parts = full_name.split()
|
||||
if len(parts) >= 2:
|
||||
team_name = parts[-1]
|
||||
# Handle multi-word team names
|
||||
if team_name in ["Sox", "Jays"]:
|
||||
team_name = " ".join(parts[-2:])
|
||||
else:
|
||||
team_name = full_name
|
||||
|
||||
# Get league and division
|
||||
league, div = team_divisions.get(abbrev, (None, None))
|
||||
|
||||
# Get stadium ID
|
||||
stadium_id = None
|
||||
mlb_stadiums = STADIUM_MAPPINGS.get("mlb", {})
|
||||
for sid, sinfo in mlb_stadiums.items():
|
||||
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="mlb",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=league, # MLB uses "league" but we map to conference field
|
||||
division=div,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all MLB stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
mlb_stadiums = STADIUM_MAPPINGS.get("mlb", {})
|
||||
for stadium_id, info in mlb_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="mlb",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="grass", # Most MLB stadiums
|
||||
roof_type="open", # Most MLB stadiums
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_mlb_scraper(season: int) -> MLBScraper:
|
||||
"""Factory function to create an MLB scraper."""
|
||||
return MLBScraper(season=season)
|
||||
@@ -0,0 +1,410 @@
|
||||
"""MLS scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date
|
||||
from typing import Optional
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
class MLSScraper(BaseScraper):
|
||||
"""MLS schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. ESPN API - Most reliable for MLS
|
||||
2. FBref - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize MLS scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026 for 2026 season)
|
||||
"""
|
||||
super().__init__("mls", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("mls")
|
||||
self._stadium_resolver = get_stadium_resolver("mls")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["espn", "fbref"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard?dates={date_str}"
|
||||
|
||||
elif source == "fbref":
|
||||
return f"https://fbref.com/en/comps/22/{self.season}/schedule/{self.season}-Major-League-Soccer-Scores-and-Fixtures"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for MLS season.
|
||||
|
||||
MLS season runs February/March through October/November.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# MLS runs within a calendar year
|
||||
for month in range(2, 12): # Feb-Nov
|
||||
months.append((self.season, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "espn":
|
||||
return self._scrape_espn()
|
||||
elif source == "fbref":
|
||||
return self._scrape_fbref()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API."""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
# Get number of days in month
|
||||
if month == 12:
|
||||
next_month = date(year + 1, 1, 1)
|
||||
else:
|
||||
next_month = date(year, month + 1, 1)
|
||||
|
||||
days_in_month = (next_month - date(year, month, 1)).days
|
||||
|
||||
for day in range(1, days_in_month + 1):
|
||||
try:
|
||||
game_date = date(year, month, day)
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
url = self._get_source_url("espn", date=date_str)
|
||||
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_fbref(self) -> list[RawGameData]:
|
||||
"""Scrape games from FBref."""
|
||||
# FBref scraping would go here
|
||||
raise NotImplementedError("FBref scraper not implemented")
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all MLS teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# MLS conference structure
|
||||
conferences = {
|
||||
"Eastern": ["ATL", "CLT", "CHI", "CIN", "CLB", "DC", "MIA", "MTL", "NE", "NYC", "RB", "ORL", "PHI", "TOR"],
|
||||
"Western": ["AUS", "COL", "DAL", "HOU", "LAG", "LAFC", "MIN", "NSH", "POR", "SLC", "SD", "SJ", "SEA", "SKC", "STL", "VAN"],
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_conferences: dict[str, str] = {}
|
||||
for conf, abbrevs in conferences.items():
|
||||
for abbrev in abbrevs:
|
||||
team_conferences[abbrev] = conf
|
||||
|
||||
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("mls", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
team_name = full_name
|
||||
|
||||
# Get conference
|
||||
conf = team_conferences.get(abbrev)
|
||||
|
||||
# Get stadium ID
|
||||
stadium_id = None
|
||||
mls_stadiums = STADIUM_MAPPINGS.get("mls", {})
|
||||
for sid, sinfo in mls_stadiums.items():
|
||||
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="mls",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=conf,
|
||||
division=None, # MLS doesn't have divisions
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all MLS stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
mls_stadiums = STADIUM_MAPPINGS.get("mls", {})
|
||||
for stadium_id, info in mls_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="mls",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="grass",
|
||||
roof_type="open",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_mls_scraper(season: int) -> MLSScraper:
|
||||
"""Factory function to create an MLS scraper."""
|
||||
return MLSScraper(season=season)
|
||||
@@ -0,0 +1,637 @@
|
||||
"""NBA scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date, timezone
|
||||
from typing import Optional
|
||||
from bs4 import BeautifulSoup
|
||||
import re
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..normalizers.timezone import parse_datetime
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
# Month name to number mapping
|
||||
MONTH_MAP = {
|
||||
"january": 1, "february": 2, "march": 3, "april": 4,
|
||||
"may": 5, "june": 6, "july": 7, "august": 8,
|
||||
"september": 9, "october": 10, "november": 11, "december": 12,
|
||||
}
|
||||
|
||||
# Basketball Reference month URLs
|
||||
BR_MONTHS = [
|
||||
"october", "november", "december",
|
||||
"january", "february", "march", "april", "may", "june",
|
||||
]
|
||||
|
||||
|
||||
class NBAScraper(BaseScraper):
|
||||
"""NBA schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. Basketball-Reference - Most reliable, complete historical data
|
||||
2. ESPN API - Good for current/future seasons
|
||||
3. CBS Sports - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize NBA scraper.
|
||||
|
||||
Args:
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
"""
|
||||
super().__init__("nba", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("nba")
|
||||
self._stadium_resolver = get_stadium_resolver("nba")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["basketball_reference", "espn", "cbs"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "basketball_reference":
|
||||
month = kwargs.get("month", "october")
|
||||
year = kwargs.get("year", self.season + 1)
|
||||
return f"https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html"
|
||||
|
||||
elif source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates={date_str}"
|
||||
|
||||
elif source == "cbs":
|
||||
return "https://www.cbssports.com/nba/schedule/"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "basketball_reference":
|
||||
return self._scrape_basketball_reference()
|
||||
elif source == "espn":
|
||||
return self._scrape_espn()
|
||||
elif source == "cbs":
|
||||
return self._scrape_cbs()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_basketball_reference(self) -> list[RawGameData]:
|
||||
"""Scrape games from Basketball-Reference.
|
||||
|
||||
BR organizes games by month with separate pages.
|
||||
Format: https://www.basketball-reference.com/leagues/NBA_YYYY_games-month.html
|
||||
where YYYY is the ending year of the season.
|
||||
"""
|
||||
all_games: list[RawGameData] = []
|
||||
end_year = self.season + 1
|
||||
|
||||
for month in BR_MONTHS:
|
||||
url = self._get_source_url("basketball_reference", month=month, year=end_year)
|
||||
|
||||
try:
|
||||
html = self.session.get_html(url)
|
||||
games = self._parse_basketball_reference(html, url)
|
||||
all_games.extend(games)
|
||||
self._logger.debug(f"Found {len(games)} games in {month}")
|
||||
|
||||
except Exception as e:
|
||||
# Some months may not exist (e.g., no games in August)
|
||||
self._logger.debug(f"No data for {month}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_basketball_reference(
|
||||
self,
|
||||
html: str,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse Basketball-Reference schedule HTML.
|
||||
|
||||
Table structure:
|
||||
- th[data-stat="date_game"]: Date (e.g., "Tue, Oct 22, 2024")
|
||||
- td[data-stat="visitor_team_name"]: Away team
|
||||
- td[data-stat="home_team_name"]: Home team
|
||||
- td[data-stat="visitor_pts"]: Away score
|
||||
- td[data-stat="home_pts"]: Home score
|
||||
- td[data-stat="arena_name"]: Arena/stadium name
|
||||
"""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
games: list[RawGameData] = []
|
||||
|
||||
# Find the schedule table
|
||||
table = soup.find("table", id="schedule")
|
||||
if not table:
|
||||
return games
|
||||
|
||||
tbody = table.find("tbody")
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all("tr"):
|
||||
# Skip header rows
|
||||
if row.get("class") and "thead" in row.get("class", []):
|
||||
continue
|
||||
|
||||
try:
|
||||
game = self._parse_br_row(row, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse row: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_br_row(
|
||||
self,
|
||||
row,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single Basketball-Reference table row."""
|
||||
# Get date
|
||||
date_cell = row.find("th", {"data-stat": "date_game"})
|
||||
if not date_cell:
|
||||
return None
|
||||
|
||||
date_text = date_cell.get_text(strip=True)
|
||||
if not date_text:
|
||||
return None
|
||||
|
||||
# Parse date (format: "Tue, Oct 22, 2024")
|
||||
try:
|
||||
game_date = datetime.strptime(date_text, "%a, %b %d, %Y")
|
||||
except ValueError:
|
||||
# Try alternative format
|
||||
try:
|
||||
game_date = datetime.strptime(date_text, "%B %d, %Y")
|
||||
except ValueError:
|
||||
self._logger.debug(f"Could not parse date: {date_text}")
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
away_cell = row.find("td", {"data-stat": "visitor_team_name"})
|
||||
home_cell = row.find("td", {"data-stat": "home_team_name"})
|
||||
|
||||
if not away_cell or not home_cell:
|
||||
return None
|
||||
|
||||
away_team = away_cell.get_text(strip=True)
|
||||
home_team = home_cell.get_text(strip=True)
|
||||
|
||||
if not away_team or not home_team:
|
||||
return None
|
||||
|
||||
# Get scores (may be empty for future games)
|
||||
away_score_cell = row.find("td", {"data-stat": "visitor_pts"})
|
||||
home_score_cell = row.find("td", {"data-stat": "home_pts"})
|
||||
|
||||
away_score = None
|
||||
home_score = None
|
||||
|
||||
if away_score_cell and away_score_cell.get_text(strip=True):
|
||||
try:
|
||||
away_score = int(away_score_cell.get_text(strip=True))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
if home_score_cell and home_score_cell.get_text(strip=True):
|
||||
try:
|
||||
home_score = int(home_score_cell.get_text(strip=True))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
# Get arena
|
||||
arena_cell = row.find("td", {"data-stat": "arena_name"})
|
||||
arena = arena_cell.get_text(strip=True) if arena_cell else None
|
||||
|
||||
# Determine status
|
||||
status = "final" if home_score is not None else "scheduled"
|
||||
|
||||
# Check for postponed/cancelled
|
||||
notes_cell = row.find("td", {"data-stat": "game_remarks"})
|
||||
if notes_cell:
|
||||
notes = notes_cell.get_text(strip=True).lower()
|
||||
if "postponed" in notes:
|
||||
status = "postponed"
|
||||
elif "cancelled" in notes or "canceled" in notes:
|
||||
status = "cancelled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=arena,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API.
|
||||
|
||||
ESPN API returns games for a specific date range.
|
||||
We iterate through each day of the season.
|
||||
"""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
# Get number of days in month
|
||||
if month == 12:
|
||||
next_month = date(year + 1, 1, 1)
|
||||
else:
|
||||
next_month = date(year, month + 1, 1)
|
||||
|
||||
days_in_month = (next_month - date(year, month, 1)).days
|
||||
|
||||
for day in range(1, days_in_month + 1):
|
||||
try:
|
||||
game_date = date(year, month, day)
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
url = self._get_source_url("espn", date=date_str)
|
||||
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
# ESPN uses ISO format
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions (usually just one)
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
arena = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=arena,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_cbs(self) -> list[RawGameData]:
|
||||
"""Scrape games from CBS Sports.
|
||||
|
||||
CBS Sports is a backup source with less structured data.
|
||||
"""
|
||||
# CBS Sports scraping would go here
|
||||
# For now, return empty to fall back to other sources
|
||||
raise NotImplementedError("CBS scraper not implemented")
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Track games by date for doubleheader detection
|
||||
games_by_date: dict[str, list[RawGameData]] = {}
|
||||
|
||||
for raw in raw_games:
|
||||
date_key = raw.game_date.strftime("%Y%m%d")
|
||||
matchup_key = f"{date_key}_{raw.away_team_raw}_{raw.home_team_raw}"
|
||||
|
||||
if matchup_key not in games_by_date:
|
||||
games_by_date[matchup_key] = []
|
||||
games_by_date[matchup_key].append(raw)
|
||||
|
||||
# Process games with doubleheader detection
|
||||
for matchup_key, matchup_games in games_by_date.items():
|
||||
is_doubleheader = len(matchup_games) > 1
|
||||
|
||||
for i, raw in enumerate(matchup_games):
|
||||
game_number = (i + 1) if is_doubleheader else None
|
||||
|
||||
game, item_reviews = self._normalize_single_game(raw, game_number)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
game_number: Optional[int],
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium (optional - use home team's stadium if not found)
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# If no stadium found, use home team's default stadium
|
||||
if not stadium_id:
|
||||
# Look up home team's stadium from mappings
|
||||
home_abbrev = home_result.canonical_id.split("_")[-1].upper()
|
||||
team_info = self._team_resolver.get_team_info(home_abbrev)
|
||||
|
||||
if team_info:
|
||||
# Try to find stadium by team's home arena
|
||||
for sid, sinfo in STADIUM_MAPPINGS.get("nba", {}).items():
|
||||
# Match by city
|
||||
if sinfo.city.lower() in team_info[2].lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=game_number,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=game_number,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
# team_nba_okc -> okc
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all NBA teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# NBA conference/division structure
|
||||
divisions = {
|
||||
"Atlantic": ("Eastern", ["BOS", "BKN", "NYK", "PHI", "TOR"]),
|
||||
"Central": ("Eastern", ["CHI", "CLE", "DET", "IND", "MIL"]),
|
||||
"Southeast": ("Eastern", ["ATL", "CHA", "MIA", "ORL", "WAS"]),
|
||||
"Northwest": ("Western", ["DEN", "MIN", "OKC", "POR", "UTA"]),
|
||||
"Pacific": ("Western", ["GSW", "LAC", "LAL", "PHX", "SAC"]),
|
||||
"Southwest": ("Western", ["DAL", "HOU", "MEM", "NOP", "SAS"]),
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_divisions: dict[str, tuple[str, str]] = {}
|
||||
for div, (conf, abbrevs) in divisions.items():
|
||||
for abbrev in abbrevs:
|
||||
team_divisions[abbrev] = (conf, div)
|
||||
|
||||
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("nba", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse full name into city and name parts
|
||||
parts = full_name.split()
|
||||
if len(parts) >= 2:
|
||||
# Handle special cases like "Oklahoma City Thunder"
|
||||
if city == "Oklahoma City":
|
||||
team_name = "Thunder"
|
||||
elif city == "Golden State":
|
||||
team_name = "Warriors"
|
||||
elif city == "San Antonio":
|
||||
team_name = "Spurs"
|
||||
elif city == "New York":
|
||||
team_name = parts[-1] # Knicks
|
||||
elif city == "New Orleans":
|
||||
team_name = "Pelicans"
|
||||
elif city == "Los Angeles":
|
||||
team_name = parts[-1] # Lakers or Clippers
|
||||
else:
|
||||
team_name = parts[-1]
|
||||
else:
|
||||
team_name = full_name
|
||||
|
||||
# Get conference and division
|
||||
conf, div = team_divisions.get(abbrev, (None, None))
|
||||
|
||||
# Get stadium ID
|
||||
stadium_id = None
|
||||
for sid, sinfo in STADIUM_MAPPINGS.get("nba", {}).items():
|
||||
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="nba",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=conf,
|
||||
division=div,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all NBA stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
for stadium_id, info in STADIUM_MAPPINGS.get("nba", {}).items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="nba",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="hardwood",
|
||||
roof_type="dome",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_nba_scraper(season: int) -> NBAScraper:
|
||||
"""Factory function to create an NBA scraper."""
|
||||
return NBAScraper(season=season)
|
||||
@@ -0,0 +1,586 @@
|
||||
"""NFL scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date
|
||||
from typing import Optional
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
# International game locations to filter out
|
||||
INTERNATIONAL_LOCATIONS = {"London", "Mexico City", "Frankfurt", "Munich", "São Paulo"}
|
||||
|
||||
|
||||
class NFLScraper(BaseScraper):
|
||||
"""NFL schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. ESPN API - Most reliable for NFL
|
||||
2. Pro-Football-Reference - Complete historical data
|
||||
3. CBS Sports - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize NFL scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2025 for 2025 season)
|
||||
"""
|
||||
super().__init__("nfl", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("nfl")
|
||||
self._stadium_resolver = get_stadium_resolver("nfl")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["espn", "pro_football_reference", "cbs"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "espn":
|
||||
week = kwargs.get("week", 1)
|
||||
season_type = kwargs.get("season_type", 2) # 1=preseason, 2=regular, 3=postseason
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard?seasontype={season_type}&week={week}"
|
||||
|
||||
elif source == "pro_football_reference":
|
||||
return f"https://www.pro-football-reference.com/years/{self.season}/games.htm"
|
||||
|
||||
elif source == "cbs":
|
||||
return "https://www.cbssports.com/nfl/schedule/"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for NFL season.
|
||||
|
||||
NFL season runs September through February.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# Regular season months
|
||||
for month in range(9, 13): # Sept-Dec
|
||||
months.append((self.season, month))
|
||||
|
||||
# Playoff months
|
||||
for month in range(1, 3): # Jan-Feb
|
||||
months.append((self.season + 1, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "espn":
|
||||
return self._scrape_espn()
|
||||
elif source == "pro_football_reference":
|
||||
return self._scrape_pro_football_reference()
|
||||
elif source == "cbs":
|
||||
return self._scrape_cbs()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API.
|
||||
|
||||
ESPN NFL API uses week numbers.
|
||||
"""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
# Scrape preseason (4 weeks)
|
||||
for week in range(1, 5):
|
||||
try:
|
||||
url = self._get_source_url("espn", week=week, season_type=1)
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN preseason week {week} error: {e}")
|
||||
continue
|
||||
|
||||
# Scrape regular season (18 weeks)
|
||||
for week in range(1, 19):
|
||||
try:
|
||||
url = self._get_source_url("espn", week=week, season_type=2)
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
self._logger.debug(f"Found {len(games)} games in week {week}")
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN regular season week {week} error: {e}")
|
||||
continue
|
||||
|
||||
# Scrape postseason (4 rounds)
|
||||
for week in range(1, 5):
|
||||
try:
|
||||
url = self._get_source_url("espn", week=week, season_type=3)
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN postseason week {week} error: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
# Filter international games
|
||||
if game.stadium_raw and any(loc in game.stadium_raw for loc in INTERNATIONAL_LOCATIONS):
|
||||
self._logger.debug(f"Skipping international game: {game.stadium_raw}")
|
||||
continue
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Check for neutral site (international games)
|
||||
if competition.get("neutralSite"):
|
||||
venue = competition.get("venue", {})
|
||||
venue_city = venue.get("address", {}).get("city", "")
|
||||
if venue_city in INTERNATIONAL_LOCATIONS:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_pro_football_reference(self) -> list[RawGameData]:
|
||||
"""Scrape games from Pro-Football-Reference.
|
||||
|
||||
PFR has a single schedule page per season.
|
||||
"""
|
||||
url = self._get_source_url("pro_football_reference")
|
||||
|
||||
try:
|
||||
html = self.session.get_html(url)
|
||||
games = self._parse_pfr(html, url)
|
||||
return games
|
||||
except Exception as e:
|
||||
self._logger.error(f"Failed to scrape Pro-Football-Reference: {e}")
|
||||
raise
|
||||
|
||||
def _parse_pfr(
|
||||
self,
|
||||
html: str,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse Pro-Football-Reference schedule HTML."""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
games: list[RawGameData] = []
|
||||
|
||||
# Find the schedule table
|
||||
table = soup.find("table", id="games")
|
||||
if not table:
|
||||
return games
|
||||
|
||||
tbody = table.find("tbody")
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all("tr"):
|
||||
# Skip header rows
|
||||
if row.get("class") and "thead" in row.get("class", []):
|
||||
continue
|
||||
|
||||
try:
|
||||
game = self._parse_pfr_row(row, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse PFR row: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_pfr_row(
|
||||
self,
|
||||
row,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single Pro-Football-Reference table row."""
|
||||
# Get date
|
||||
date_cell = row.find("td", {"data-stat": "game_date"})
|
||||
if not date_cell:
|
||||
return None
|
||||
|
||||
date_text = date_cell.get_text(strip=True)
|
||||
if not date_text:
|
||||
return None
|
||||
|
||||
# Parse date
|
||||
try:
|
||||
# PFR uses YYYY-MM-DD format
|
||||
game_date = datetime.strptime(date_text, "%Y-%m-%d")
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
winner_cell = row.find("td", {"data-stat": "winner"})
|
||||
loser_cell = row.find("td", {"data-stat": "loser"})
|
||||
|
||||
if not winner_cell or not loser_cell:
|
||||
return None
|
||||
|
||||
winner = winner_cell.get_text(strip=True)
|
||||
loser = loser_cell.get_text(strip=True)
|
||||
|
||||
if not winner or not loser:
|
||||
return None
|
||||
|
||||
# Determine home/away based on @ symbol
|
||||
game_location = row.find("td", {"data-stat": "game_location"})
|
||||
at_home = game_location and "@" in game_location.get_text()
|
||||
|
||||
if at_home:
|
||||
home_team = loser
|
||||
away_team = winner
|
||||
else:
|
||||
home_team = winner
|
||||
away_team = loser
|
||||
|
||||
# Get scores
|
||||
pts_win_cell = row.find("td", {"data-stat": "pts_win"})
|
||||
pts_lose_cell = row.find("td", {"data-stat": "pts_lose"})
|
||||
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
if pts_win_cell and pts_lose_cell:
|
||||
try:
|
||||
winner_pts = int(pts_win_cell.get_text(strip=True))
|
||||
loser_pts = int(pts_lose_cell.get_text(strip=True))
|
||||
|
||||
if at_home:
|
||||
home_score = loser_pts
|
||||
away_score = winner_pts
|
||||
else:
|
||||
home_score = winner_pts
|
||||
away_score = loser_pts
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
# Determine status
|
||||
status = "final" if home_score is not None else "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=None, # PFR doesn't always have stadium
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_cbs(self) -> list[RawGameData]:
|
||||
"""Scrape games from CBS Sports."""
|
||||
raise NotImplementedError("CBS scraper not implemented")
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None, # NFL doesn't have doubleheaders
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all NFL teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# NFL conference/division structure
|
||||
divisions = {
|
||||
"AFC East": ("AFC", ["BUF", "MIA", "NE", "NYJ"]),
|
||||
"AFC North": ("AFC", ["BAL", "CIN", "CLE", "PIT"]),
|
||||
"AFC South": ("AFC", ["HOU", "IND", "JAX", "TEN"]),
|
||||
"AFC West": ("AFC", ["DEN", "KC", "LV", "LAC"]),
|
||||
"NFC East": ("NFC", ["DAL", "NYG", "PHI", "WAS"]),
|
||||
"NFC North": ("NFC", ["CHI", "DET", "GB", "MIN"]),
|
||||
"NFC South": ("NFC", ["ATL", "CAR", "NO", "TB"]),
|
||||
"NFC West": ("NFC", ["ARI", "LAR", "SF", "SEA"]),
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_divisions: dict[str, tuple[str, str]] = {}
|
||||
for div, (conf, abbrevs) in divisions.items():
|
||||
for abbrev in abbrevs:
|
||||
team_divisions[abbrev] = (conf, div)
|
||||
|
||||
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("nfl", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
parts = full_name.split()
|
||||
team_name = parts[-1] if parts else full_name
|
||||
|
||||
# Get conference and division
|
||||
conf, div = team_divisions.get(abbrev, (None, None))
|
||||
|
||||
# Get stadium ID
|
||||
stadium_id = None
|
||||
nfl_stadiums = STADIUM_MAPPINGS.get("nfl", {})
|
||||
for sid, sinfo in nfl_stadiums.items():
|
||||
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="nfl",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=conf,
|
||||
division=div,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all NFL stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
nfl_stadiums = STADIUM_MAPPINGS.get("nfl", {})
|
||||
for stadium_id, info in nfl_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="nfl",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="turf", # Many NFL stadiums
|
||||
roof_type="open", # Most outdoor
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_nfl_scraper(season: int) -> NFLScraper:
|
||||
"""Factory function to create an NFL scraper."""
|
||||
return NFLScraper(season=season)
|
||||
@@ -0,0 +1,655 @@
|
||||
"""NHL scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date
|
||||
from typing import Optional
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
# International game locations to filter out
|
||||
INTERNATIONAL_LOCATIONS = {"Prague", "Stockholm", "Helsinki", "Tampere", "Gothenburg"}
|
||||
|
||||
# Hockey Reference month URLs
|
||||
HR_MONTHS = [
|
||||
"october", "november", "december",
|
||||
"january", "february", "march", "april", "may", "june",
|
||||
]
|
||||
|
||||
|
||||
class NHLScraper(BaseScraper):
|
||||
"""NHL schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. Hockey-Reference - Most reliable for NHL
|
||||
2. NHL API - Official NHL data
|
||||
3. ESPN API - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize NHL scraper.
|
||||
|
||||
Args:
|
||||
season: Season start year (e.g., 2025 for 2025-26)
|
||||
"""
|
||||
super().__init__("nhl", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("nhl")
|
||||
self._stadium_resolver = get_stadium_resolver("nhl")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["hockey_reference", "nhl_api", "espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "hockey_reference":
|
||||
month = kwargs.get("month", "october")
|
||||
year = kwargs.get("year", self.season + 1)
|
||||
return f"https://www.hockey-reference.com/leagues/NHL_{year}_games.html"
|
||||
|
||||
elif source == "nhl_api":
|
||||
start_date = kwargs.get("start_date", "")
|
||||
end_date = kwargs.get("end_date", "")
|
||||
return f"https://api-web.nhle.com/v1/schedule/{start_date}"
|
||||
|
||||
elif source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard?dates={date_str}"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "hockey_reference":
|
||||
return self._scrape_hockey_reference()
|
||||
elif source == "nhl_api":
|
||||
return self._scrape_nhl_api()
|
||||
elif source == "espn":
|
||||
return self._scrape_espn()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_hockey_reference(self) -> list[RawGameData]:
|
||||
"""Scrape games from Hockey-Reference.
|
||||
|
||||
HR has a single schedule page per season.
|
||||
"""
|
||||
end_year = self.season + 1
|
||||
url = self._get_source_url("hockey_reference", year=end_year)
|
||||
|
||||
try:
|
||||
html = self.session.get_html(url)
|
||||
games = self._parse_hockey_reference(html, url)
|
||||
return games
|
||||
except Exception as e:
|
||||
self._logger.error(f"Failed to scrape Hockey-Reference: {e}")
|
||||
raise
|
||||
|
||||
def _parse_hockey_reference(
|
||||
self,
|
||||
html: str,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse Hockey-Reference schedule HTML."""
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
games: list[RawGameData] = []
|
||||
|
||||
# Find the schedule table
|
||||
table = soup.find("table", id="games")
|
||||
if not table:
|
||||
return games
|
||||
|
||||
tbody = table.find("tbody")
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all("tr"):
|
||||
# Skip header rows
|
||||
if row.get("class") and "thead" in row.get("class", []):
|
||||
continue
|
||||
|
||||
try:
|
||||
game = self._parse_hr_row(row, source_url)
|
||||
if game:
|
||||
# Filter international games
|
||||
if game.stadium_raw and any(loc in game.stadium_raw for loc in INTERNATIONAL_LOCATIONS):
|
||||
continue
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse HR row: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_hr_row(
|
||||
self,
|
||||
row,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single Hockey-Reference table row."""
|
||||
# Get date
|
||||
date_cell = row.find("th", {"data-stat": "date_game"})
|
||||
if not date_cell:
|
||||
return None
|
||||
|
||||
date_text = date_cell.get_text(strip=True)
|
||||
if not date_text:
|
||||
return None
|
||||
|
||||
# Parse date (format: "2025-10-15")
|
||||
try:
|
||||
game_date = datetime.strptime(date_text, "%Y-%m-%d")
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
visitor_cell = row.find("td", {"data-stat": "visitor_team_name"})
|
||||
home_cell = row.find("td", {"data-stat": "home_team_name"})
|
||||
|
||||
if not visitor_cell or not home_cell:
|
||||
return None
|
||||
|
||||
away_team = visitor_cell.get_text(strip=True)
|
||||
home_team = home_cell.get_text(strip=True)
|
||||
|
||||
if not away_team or not home_team:
|
||||
return None
|
||||
|
||||
# Get scores
|
||||
visitor_goals_cell = row.find("td", {"data-stat": "visitor_goals"})
|
||||
home_goals_cell = row.find("td", {"data-stat": "home_goals"})
|
||||
|
||||
away_score = None
|
||||
home_score = None
|
||||
|
||||
if visitor_goals_cell and visitor_goals_cell.get_text(strip=True):
|
||||
try:
|
||||
away_score = int(visitor_goals_cell.get_text(strip=True))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
if home_goals_cell and home_goals_cell.get_text(strip=True):
|
||||
try:
|
||||
home_score = int(home_goals_cell.get_text(strip=True))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
# Determine status
|
||||
status = "final" if home_score is not None else "scheduled"
|
||||
|
||||
# Check for OT/SO
|
||||
overtimes_cell = row.find("td", {"data-stat": "overtimes"})
|
||||
if overtimes_cell:
|
||||
ot_text = overtimes_cell.get_text(strip=True)
|
||||
if ot_text:
|
||||
status = "final" # OT games are still final
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=None, # HR doesn't have stadium
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_nhl_api(self) -> list[RawGameData]:
|
||||
"""Scrape games from NHL API."""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
start_date = date(year, month, 1)
|
||||
|
||||
url = self._get_source_url("nhl_api", start_date=start_date.strftime("%Y-%m-%d"))
|
||||
|
||||
try:
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_nhl_api_response(data, url)
|
||||
all_games.extend(games)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"NHL API error for {year}-{month}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_nhl_api_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse NHL API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
game_weeks = data.get("gameWeek", [])
|
||||
|
||||
for week in game_weeks:
|
||||
for game_day in week.get("games", []):
|
||||
try:
|
||||
game = self._parse_nhl_api_game(game_day, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse NHL API game: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_nhl_api_game(
|
||||
self,
|
||||
game: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single NHL API game."""
|
||||
# Get date
|
||||
start_time = game.get("startTimeUTC", "")
|
||||
if not start_time:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(start_time.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
away_team_data = game.get("awayTeam", {})
|
||||
home_team_data = game.get("homeTeam", {})
|
||||
|
||||
away_team = away_team_data.get("placeName", {}).get("default", "")
|
||||
home_team = home_team_data.get("placeName", {}).get("default", "")
|
||||
|
||||
if not away_team or not home_team:
|
||||
# Try full name
|
||||
away_team = away_team_data.get("name", {}).get("default", "")
|
||||
home_team = home_team_data.get("name", {}).get("default", "")
|
||||
|
||||
if not away_team or not home_team:
|
||||
return None
|
||||
|
||||
# Get scores
|
||||
away_score = away_team_data.get("score")
|
||||
home_score = home_team_data.get("score")
|
||||
|
||||
# Get venue
|
||||
venue = game.get("venue", {})
|
||||
stadium = venue.get("default")
|
||||
|
||||
# Get status
|
||||
game_state = game.get("gameState", "").lower()
|
||||
|
||||
if game_state in ["final", "off"]:
|
||||
status = "final"
|
||||
elif game_state == "postponed":
|
||||
status = "postponed"
|
||||
elif game_state in ["cancelled", "canceled"]:
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API."""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
# Get number of days in month
|
||||
if month == 12:
|
||||
next_month = date(year + 1, 1, 1)
|
||||
else:
|
||||
next_month = date(year, month + 1, 1)
|
||||
|
||||
days_in_month = (next_month - date(year, month, 1)).days
|
||||
|
||||
for day in range(1, days_in_month + 1):
|
||||
try:
|
||||
game_date = date(year, month, day)
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
url = self._get_source_url("espn", date=date_str)
|
||||
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Check for neutral site (international games like Global Series)
|
||||
if competition.get("neutralSite"):
|
||||
venue = competition.get("venue", {})
|
||||
venue_city = venue.get("address", {}).get("city", "")
|
||||
if venue_city in INTERNATIONAL_LOCATIONS:
|
||||
return None
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None, # NHL doesn't have doubleheaders
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all NHL teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
# NHL conference/division structure
|
||||
divisions = {
|
||||
"Atlantic": ("Eastern", ["BOS", "BUF", "DET", "FLA", "MTL", "OTT", "TB", "TOR"]),
|
||||
"Metropolitan": ("Eastern", ["CAR", "CBJ", "NJ", "NYI", "NYR", "PHI", "PIT", "WAS"]),
|
||||
"Central": ("Western", ["ARI", "CHI", "COL", "DAL", "MIN", "NSH", "STL", "WPG"]),
|
||||
"Pacific": ("Western", ["ANA", "CGY", "EDM", "LA", "SJ", "SEA", "VAN", "VGK"]),
|
||||
}
|
||||
|
||||
# Build reverse lookup
|
||||
team_divisions: dict[str, tuple[str, str]] = {}
|
||||
for div, (conf, abbrevs) in divisions.items():
|
||||
for abbrev in abbrevs:
|
||||
team_divisions[abbrev] = (conf, div)
|
||||
|
||||
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("nhl", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
parts = full_name.split()
|
||||
team_name = parts[-1] if parts else full_name
|
||||
# Handle multi-word names
|
||||
if team_name in ["Wings", "Jackets", "Knights", "Leafs"]:
|
||||
team_name = " ".join(parts[-2:])
|
||||
|
||||
# Get conference and division
|
||||
conf, div = team_divisions.get(abbrev, (None, None))
|
||||
|
||||
# Get stadium ID
|
||||
stadium_id = None
|
||||
nhl_stadiums = STADIUM_MAPPINGS.get("nhl", {})
|
||||
for sid, sinfo in nhl_stadiums.items():
|
||||
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="nhl",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=conf,
|
||||
division=div,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all NHL stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
nhl_stadiums = STADIUM_MAPPINGS.get("nhl", {})
|
||||
for stadium_id, info in nhl_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="nhl",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="ice",
|
||||
roof_type="dome",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_nhl_scraper(season: int) -> NHLScraper:
|
||||
"""Factory function to create an NHL scraper."""
|
||||
return NHLScraper(season=season)
|
||||
@@ -0,0 +1,385 @@
|
||||
"""NWSL scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date
|
||||
from typing import Optional
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
class NWSLScraper(BaseScraper):
|
||||
"""NWSL schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. ESPN API - Most reliable for NWSL
|
||||
2. NWSL official (via ESPN) - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize NWSL scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026 for 2026 season)
|
||||
"""
|
||||
super().__init__("nwsl", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("nwsl")
|
||||
self._stadium_resolver = get_stadium_resolver("nwsl")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard?dates={date_str}"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for NWSL season.
|
||||
|
||||
NWSL season runs March through November.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# NWSL regular season + playoffs
|
||||
for month in range(3, 12): # March-Nov
|
||||
months.append((self.season, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "espn":
|
||||
return self._scrape_espn()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API."""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
# Get number of days in month
|
||||
if month == 12:
|
||||
next_month = date(year + 1, 1, 1)
|
||||
else:
|
||||
next_month = date(year, month + 1, 1)
|
||||
|
||||
days_in_month = (next_month - date(year, month, 1)).days
|
||||
|
||||
for day in range(1, days_in_month + 1):
|
||||
try:
|
||||
game_date = date(year, month, day)
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
url = self._get_source_url("espn", date=date_str)
|
||||
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all NWSL teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("nwsl", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
team_name = full_name
|
||||
|
||||
# Get stadium ID
|
||||
stadium_id = None
|
||||
nwsl_stadiums = STADIUM_MAPPINGS.get("nwsl", {})
|
||||
for sid, sinfo in nwsl_stadiums.items():
|
||||
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="nwsl",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=None, # NWSL uses single table
|
||||
division=None,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all NWSL stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
nwsl_stadiums = STADIUM_MAPPINGS.get("nwsl", {})
|
||||
for stadium_id, info in nwsl_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="nwsl",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="grass",
|
||||
roof_type="open",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_nwsl_scraper(season: int) -> NWSLScraper:
|
||||
"""Factory function to create an NWSL scraper."""
|
||||
return NWSLScraper(season=season)
|
||||
@@ -0,0 +1,386 @@
|
||||
"""WNBA scraper implementation with multi-source fallback."""
|
||||
|
||||
from datetime import datetime, date
|
||||
from typing import Optional
|
||||
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from ..models.aliases import ManualReviewItem
|
||||
from ..normalizers.canonical_id import generate_game_id
|
||||
from ..normalizers.team_resolver import (
|
||||
TeamResolver,
|
||||
TEAM_MAPPINGS,
|
||||
get_team_resolver,
|
||||
)
|
||||
from ..normalizers.stadium_resolver import (
|
||||
StadiumResolver,
|
||||
STADIUM_MAPPINGS,
|
||||
get_stadium_resolver,
|
||||
)
|
||||
from ..utils.logging import get_logger, log_game, log_warning
|
||||
|
||||
|
||||
class WNBAScraper(BaseScraper):
|
||||
"""WNBA schedule scraper with multi-source fallback.
|
||||
|
||||
Sources (in priority order):
|
||||
1. ESPN API - Most reliable for WNBA
|
||||
2. WNBA official (via ESPN) - Backup option
|
||||
"""
|
||||
|
||||
def __init__(self, season: int, **kwargs):
|
||||
"""Initialize WNBA scraper.
|
||||
|
||||
Args:
|
||||
season: Season year (e.g., 2026 for 2026 season)
|
||||
"""
|
||||
super().__init__("wnba", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("wnba")
|
||||
self._stadium_resolver = get_stadium_resolver("wnba")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
"""Return source list in priority order."""
|
||||
return ["espn"]
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
"""Build URL for a source."""
|
||||
if source == "espn":
|
||||
date_str = kwargs.get("date", "")
|
||||
return f"https://site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard?dates={date_str}"
|
||||
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _get_season_months(self) -> list[tuple[int, int]]:
|
||||
"""Get the months to scrape for WNBA season.
|
||||
|
||||
WNBA season runs May through September/October.
|
||||
"""
|
||||
months = []
|
||||
|
||||
# WNBA regular season + playoffs
|
||||
for month in range(5, 11): # May-Oct
|
||||
months.append((self.season, month))
|
||||
|
||||
return months
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
"""Scrape games from a specific source."""
|
||||
if source == "espn":
|
||||
return self._scrape_espn()
|
||||
else:
|
||||
raise ValueError(f"Unknown source: {source}")
|
||||
|
||||
def _scrape_espn(self) -> list[RawGameData]:
|
||||
"""Scrape games from ESPN API."""
|
||||
all_games: list[RawGameData] = []
|
||||
|
||||
for year, month in self._get_season_months():
|
||||
# Get number of days in month
|
||||
if month == 12:
|
||||
next_month = date(year + 1, 1, 1)
|
||||
else:
|
||||
next_month = date(year, month + 1, 1)
|
||||
|
||||
days_in_month = (next_month - date(year, month, 1)).days
|
||||
|
||||
for day in range(1, days_in_month + 1):
|
||||
try:
|
||||
game_date = date(year, month, day)
|
||||
date_str = game_date.strftime("%Y%m%d")
|
||||
url = self._get_source_url("espn", date=date_str)
|
||||
|
||||
data = self.session.get_json(url)
|
||||
games = self._parse_espn_response(data, url)
|
||||
all_games.extend(games)
|
||||
|
||||
except Exception as e:
|
||||
self._logger.debug(f"ESPN error for {year}-{month}-{day}: {e}")
|
||||
continue
|
||||
|
||||
return all_games
|
||||
|
||||
def _parse_espn_response(
|
||||
self,
|
||||
data: dict,
|
||||
source_url: str,
|
||||
) -> list[RawGameData]:
|
||||
"""Parse ESPN API response."""
|
||||
games: list[RawGameData] = []
|
||||
|
||||
events = data.get("events", [])
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
game = self._parse_espn_event(event, source_url)
|
||||
if game:
|
||||
games.append(game)
|
||||
except Exception as e:
|
||||
self._logger.debug(f"Failed to parse ESPN event: {e}")
|
||||
continue
|
||||
|
||||
return games
|
||||
|
||||
def _parse_espn_event(
|
||||
self,
|
||||
event: dict,
|
||||
source_url: str,
|
||||
) -> Optional[RawGameData]:
|
||||
"""Parse a single ESPN event."""
|
||||
# Get date
|
||||
date_str = event.get("date", "")
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
game_date = datetime.fromisoformat(date_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
# Get competitions
|
||||
competitions = event.get("competitions", [])
|
||||
if not competitions:
|
||||
return None
|
||||
|
||||
competition = competitions[0]
|
||||
|
||||
# Get teams
|
||||
competitors = competition.get("competitors", [])
|
||||
if len(competitors) != 2:
|
||||
return None
|
||||
|
||||
home_team = None
|
||||
away_team = None
|
||||
home_score = None
|
||||
away_score = None
|
||||
|
||||
for competitor in competitors:
|
||||
team_info = competitor.get("team", {})
|
||||
team_name = team_info.get("displayName", "")
|
||||
is_home = competitor.get("homeAway") == "home"
|
||||
score = competitor.get("score")
|
||||
|
||||
if score:
|
||||
try:
|
||||
score = int(score)
|
||||
except (ValueError, TypeError):
|
||||
score = None
|
||||
|
||||
if is_home:
|
||||
home_team = team_name
|
||||
home_score = score
|
||||
else:
|
||||
away_team = team_name
|
||||
away_score = score
|
||||
|
||||
if not home_team or not away_team:
|
||||
return None
|
||||
|
||||
# Get venue
|
||||
venue = competition.get("venue", {})
|
||||
stadium = venue.get("fullName")
|
||||
|
||||
# Get status
|
||||
status_info = competition.get("status", {})
|
||||
status_type = status_info.get("type", {})
|
||||
status_name = status_type.get("name", "").lower()
|
||||
|
||||
if status_name == "status_final":
|
||||
status = "final"
|
||||
elif status_name == "status_postponed":
|
||||
status = "postponed"
|
||||
elif status_name == "status_canceled":
|
||||
status = "cancelled"
|
||||
else:
|
||||
status = "scheduled"
|
||||
|
||||
return RawGameData(
|
||||
game_date=game_date,
|
||||
home_team_raw=home_team,
|
||||
away_team_raw=away_team,
|
||||
stadium_raw=stadium,
|
||||
home_score=home_score,
|
||||
away_score=away_score,
|
||||
status=status,
|
||||
source_url=source_url,
|
||||
)
|
||||
|
||||
def _normalize_games(
|
||||
self,
|
||||
raw_games: list[RawGameData],
|
||||
) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
"""Normalize raw games to Game objects with canonical IDs."""
|
||||
games: list[Game] = []
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
for raw in raw_games:
|
||||
game, item_reviews = self._normalize_single_game(raw)
|
||||
|
||||
if game:
|
||||
games.append(game)
|
||||
log_game(
|
||||
self.sport,
|
||||
game.id,
|
||||
game.home_team_id,
|
||||
game.away_team_id,
|
||||
game.game_date.strftime("%Y-%m-%d"),
|
||||
game.status,
|
||||
)
|
||||
|
||||
review_items.extend(item_reviews)
|
||||
|
||||
return games, review_items
|
||||
|
||||
def _normalize_single_game(
|
||||
self,
|
||||
raw: RawGameData,
|
||||
) -> tuple[Optional[Game], list[ManualReviewItem]]:
|
||||
"""Normalize a single raw game."""
|
||||
review_items: list[ManualReviewItem] = []
|
||||
|
||||
# Resolve home team
|
||||
home_result = self._team_resolver.resolve(
|
||||
raw.home_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if home_result.review_item:
|
||||
review_items.append(home_result.review_item)
|
||||
|
||||
if not home_result.canonical_id:
|
||||
log_warning(f"Could not resolve home team: {raw.home_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve away team
|
||||
away_result = self._team_resolver.resolve(
|
||||
raw.away_team_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if away_result.review_item:
|
||||
review_items.append(away_result.review_item)
|
||||
|
||||
if not away_result.canonical_id:
|
||||
log_warning(f"Could not resolve away team: {raw.away_team_raw}")
|
||||
return None, review_items
|
||||
|
||||
# Resolve stadium
|
||||
stadium_id = None
|
||||
|
||||
if raw.stadium_raw:
|
||||
stadium_result = self._stadium_resolver.resolve(
|
||||
raw.stadium_raw,
|
||||
check_date=raw.game_date.date(),
|
||||
source_url=raw.source_url,
|
||||
)
|
||||
|
||||
if stadium_result.review_item:
|
||||
review_items.append(stadium_result.review_item)
|
||||
|
||||
stadium_id = stadium_result.canonical_id
|
||||
|
||||
# Get abbreviations for game ID
|
||||
home_abbrev = self._get_abbreviation(home_result.canonical_id)
|
||||
away_abbrev = self._get_abbreviation(away_result.canonical_id)
|
||||
|
||||
# Generate canonical game ID
|
||||
game_id = generate_game_id(
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
away_abbrev=away_abbrev,
|
||||
home_abbrev=home_abbrev,
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
)
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport=self.sport,
|
||||
season=self.season,
|
||||
home_team_id=home_result.canonical_id,
|
||||
away_team_id=away_result.canonical_id,
|
||||
stadium_id=stadium_id or "",
|
||||
game_date=raw.game_date,
|
||||
game_number=None,
|
||||
home_score=raw.home_score,
|
||||
away_score=raw.away_score,
|
||||
status=raw.status,
|
||||
source_url=raw.source_url,
|
||||
raw_home_team=raw.home_team_raw,
|
||||
raw_away_team=raw.away_team_raw,
|
||||
raw_stadium=raw.stadium_raw,
|
||||
)
|
||||
|
||||
return game, review_items
|
||||
|
||||
def _get_abbreviation(self, team_id: str) -> str:
|
||||
"""Extract abbreviation from team ID."""
|
||||
parts = team_id.split("_")
|
||||
return parts[-1] if parts else ""
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
"""Get all WNBA teams from hardcoded mappings."""
|
||||
teams: list[Team] = []
|
||||
seen: set[str] = set()
|
||||
|
||||
for abbrev, (team_id, full_name, city) in TEAM_MAPPINGS.get("wnba", {}).items():
|
||||
if team_id in seen:
|
||||
continue
|
||||
seen.add(team_id)
|
||||
|
||||
# Parse team name
|
||||
parts = full_name.split()
|
||||
team_name = parts[-1] if parts else full_name
|
||||
|
||||
# Get stadium ID
|
||||
stadium_id = None
|
||||
wnba_stadiums = STADIUM_MAPPINGS.get("wnba", {})
|
||||
for sid, sinfo in wnba_stadiums.items():
|
||||
if city.lower() in sinfo.city.lower() or sinfo.city.lower() in city.lower():
|
||||
stadium_id = sid
|
||||
break
|
||||
|
||||
team = Team(
|
||||
id=team_id,
|
||||
sport="wnba",
|
||||
city=city,
|
||||
name=team_name,
|
||||
full_name=full_name,
|
||||
abbreviation=abbrev,
|
||||
conference=None, # WNBA uses single table now
|
||||
division=None,
|
||||
stadium_id=stadium_id,
|
||||
)
|
||||
teams.append(team)
|
||||
|
||||
return teams
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
"""Get all WNBA stadiums from hardcoded mappings."""
|
||||
stadiums: list[Stadium] = []
|
||||
|
||||
wnba_stadiums = STADIUM_MAPPINGS.get("wnba", {})
|
||||
for stadium_id, info in wnba_stadiums.items():
|
||||
stadium = Stadium(
|
||||
id=stadium_id,
|
||||
sport="wnba",
|
||||
name=info.name,
|
||||
city=info.city,
|
||||
state=info.state,
|
||||
country=info.country,
|
||||
latitude=info.latitude,
|
||||
longitude=info.longitude,
|
||||
surface="hardwood",
|
||||
roof_type="dome",
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
def create_wnba_scraper(season: int) -> WNBAScraper:
|
||||
"""Factory function to create a WNBA scraper."""
|
||||
return WNBAScraper(season=season)
|
||||
@@ -0,0 +1 @@
|
||||
"""Unit tests for sportstime_parser."""
|
||||
@@ -0,0 +1,48 @@
|
||||
"""Test fixtures for sportstime-parser tests."""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
FIXTURES_DIR = Path(__file__).parent
|
||||
|
||||
# NBA fixtures
|
||||
NBA_FIXTURES_DIR = FIXTURES_DIR / "nba"
|
||||
NBA_BR_OCTOBER_HTML = NBA_FIXTURES_DIR / "basketball_reference_october.html"
|
||||
NBA_BR_EDGE_CASES_HTML = NBA_FIXTURES_DIR / "basketball_reference_edge_cases.html"
|
||||
NBA_ESPN_SCOREBOARD_JSON = NBA_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# MLB fixtures
|
||||
MLB_FIXTURES_DIR = FIXTURES_DIR / "mlb"
|
||||
MLB_ESPN_SCOREBOARD_JSON = MLB_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# NFL fixtures
|
||||
NFL_FIXTURES_DIR = FIXTURES_DIR / "nfl"
|
||||
NFL_ESPN_SCOREBOARD_JSON = NFL_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# NHL fixtures
|
||||
NHL_FIXTURES_DIR = FIXTURES_DIR / "nhl"
|
||||
NHL_ESPN_SCOREBOARD_JSON = NHL_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# MLS fixtures
|
||||
MLS_FIXTURES_DIR = FIXTURES_DIR / "mls"
|
||||
MLS_ESPN_SCOREBOARD_JSON = MLS_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# WNBA fixtures
|
||||
WNBA_FIXTURES_DIR = FIXTURES_DIR / "wnba"
|
||||
WNBA_ESPN_SCOREBOARD_JSON = WNBA_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
# NWSL fixtures
|
||||
NWSL_FIXTURES_DIR = FIXTURES_DIR / "nwsl"
|
||||
NWSL_ESPN_SCOREBOARD_JSON = NWSL_FIXTURES_DIR / "espn_scoreboard.json"
|
||||
|
||||
|
||||
def load_fixture(path: Path) -> str:
|
||||
"""Load a fixture file as text."""
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
return f.read()
|
||||
|
||||
|
||||
def load_json_fixture(path: Path) -> dict:
|
||||
"""Load a JSON fixture file."""
|
||||
import json
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "10",
|
||||
"uid": "s:1~l:10",
|
||||
"name": "Major League Baseball",
|
||||
"abbreviation": "MLB"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2026-04-15T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401584801",
|
||||
"uid": "s:1~l:10~e:401584801",
|
||||
"date": "2026-04-15T23:05:00Z",
|
||||
"name": "New York Yankees at Boston Red Sox",
|
||||
"shortName": "NYY @ BOS",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584801",
|
||||
"uid": "s:1~l:10~e:401584801~c:401584801",
|
||||
"date": "2026-04-15T23:05:00Z",
|
||||
"attendance": 37435,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3",
|
||||
"fullName": "Fenway Park",
|
||||
"address": {
|
||||
"city": "Boston",
|
||||
"state": "MA"
|
||||
},
|
||||
"capacity": 37755,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "2",
|
||||
"uid": "s:1~l:10~t:2",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "2",
|
||||
"uid": "s:1~l:10~t:2",
|
||||
"location": "Boston",
|
||||
"name": "Red Sox",
|
||||
"abbreviation": "BOS",
|
||||
"displayName": "Boston Red Sox"
|
||||
},
|
||||
"score": "5",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "10",
|
||||
"uid": "s:1~l:10~t:10",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "10",
|
||||
"uid": "s:1~l:10~t:10",
|
||||
"location": "New York",
|
||||
"name": "Yankees",
|
||||
"abbreviation": "NYY",
|
||||
"displayName": "New York Yankees"
|
||||
},
|
||||
"score": "3",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 9,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401584802",
|
||||
"uid": "s:1~l:10~e:401584802",
|
||||
"date": "2026-04-15T20:10:00Z",
|
||||
"name": "Chicago Cubs at St. Louis Cardinals",
|
||||
"shortName": "CHC @ STL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584802",
|
||||
"uid": "s:1~l:10~e:401584802~c:401584802",
|
||||
"date": "2026-04-15T20:10:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "87",
|
||||
"fullName": "Busch Stadium",
|
||||
"address": {
|
||||
"city": "St. Louis",
|
||||
"state": "MO"
|
||||
},
|
||||
"capacity": 45538,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "24",
|
||||
"uid": "s:1~l:10~t:24",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "24",
|
||||
"uid": "s:1~l:10~t:24",
|
||||
"location": "St. Louis",
|
||||
"name": "Cardinals",
|
||||
"abbreviation": "STL",
|
||||
"displayName": "St. Louis Cardinals"
|
||||
},
|
||||
"score": "7",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "16",
|
||||
"uid": "s:1~l:10~t:16",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "16",
|
||||
"uid": "s:1~l:10~t:16",
|
||||
"location": "Chicago",
|
||||
"name": "Cubs",
|
||||
"abbreviation": "CHC",
|
||||
"displayName": "Chicago Cubs"
|
||||
},
|
||||
"score": "4",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 9,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401584803",
|
||||
"uid": "s:1~l:10~e:401584803",
|
||||
"date": "2026-04-16T00:10:00Z",
|
||||
"name": "Los Angeles Dodgers at San Francisco Giants",
|
||||
"shortName": "LAD @ SF",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584803",
|
||||
"uid": "s:1~l:10~e:401584803~c:401584803",
|
||||
"date": "2026-04-16T00:10:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "116",
|
||||
"fullName": "Oracle Park",
|
||||
"address": {
|
||||
"city": "San Francisco",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 41915,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "26",
|
||||
"uid": "s:1~l:10~t:26",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "26",
|
||||
"uid": "s:1~l:10~t:26",
|
||||
"location": "San Francisco",
|
||||
"name": "Giants",
|
||||
"abbreviation": "SF",
|
||||
"displayName": "San Francisco Giants"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "19",
|
||||
"uid": "s:1~l:10~t:19",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "19",
|
||||
"uid": "s:1~l:10~t:19",
|
||||
"location": "Los Angeles",
|
||||
"name": "Dodgers",
|
||||
"abbreviation": "LAD",
|
||||
"displayName": "Los Angeles Dodgers"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "19",
|
||||
"uid": "s:600~l:19",
|
||||
"name": "Major League Soccer",
|
||||
"abbreviation": "MLS"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2026-03-15T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401672001",
|
||||
"uid": "s:600~l:19~e:401672001",
|
||||
"date": "2026-03-15T22:00:00Z",
|
||||
"name": "LA Galaxy at LAFC",
|
||||
"shortName": "LA @ LAFC",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672001",
|
||||
"uid": "s:600~l:19~e:401672001~c:401672001",
|
||||
"date": "2026-03-15T22:00:00Z",
|
||||
"attendance": 22000,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8909",
|
||||
"fullName": "BMO Stadium",
|
||||
"address": {
|
||||
"city": "Los Angeles",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 22000,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "21295",
|
||||
"uid": "s:600~l:19~t:21295",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "21295",
|
||||
"uid": "s:600~l:19~t:21295",
|
||||
"location": "Los Angeles",
|
||||
"name": "FC",
|
||||
"abbreviation": "LAFC",
|
||||
"displayName": "Los Angeles FC"
|
||||
},
|
||||
"score": "3",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "3610",
|
||||
"uid": "s:600~l:19~t:3610",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "3610",
|
||||
"uid": "s:600~l:19~t:3610",
|
||||
"location": "Los Angeles",
|
||||
"name": "Galaxy",
|
||||
"abbreviation": "LA",
|
||||
"displayName": "LA Galaxy"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 90,
|
||||
"displayClock": "90'",
|
||||
"period": 2,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672002",
|
||||
"uid": "s:600~l:19~e:401672002",
|
||||
"date": "2026-03-15T23:00:00Z",
|
||||
"name": "Seattle Sounders at Portland Timbers",
|
||||
"shortName": "SEA @ POR",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672002",
|
||||
"uid": "s:600~l:19~e:401672002~c:401672002",
|
||||
"date": "2026-03-15T23:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8070",
|
||||
"fullName": "Providence Park",
|
||||
"address": {
|
||||
"city": "Portland",
|
||||
"state": "OR"
|
||||
},
|
||||
"capacity": 25218,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "5282",
|
||||
"uid": "s:600~l:19~t:5282",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "5282",
|
||||
"uid": "s:600~l:19~t:5282",
|
||||
"location": "Portland",
|
||||
"name": "Timbers",
|
||||
"abbreviation": "POR",
|
||||
"displayName": "Portland Timbers"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "4687",
|
||||
"uid": "s:600~l:19~t:4687",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "4687",
|
||||
"uid": "s:600~l:19~t:4687",
|
||||
"location": "Seattle",
|
||||
"name": "Sounders FC",
|
||||
"abbreviation": "SEA",
|
||||
"displayName": "Seattle Sounders FC"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 90,
|
||||
"displayClock": "90'",
|
||||
"period": 2,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672003",
|
||||
"uid": "s:600~l:19~e:401672003",
|
||||
"date": "2026-03-16T00:00:00Z",
|
||||
"name": "New York Red Bulls at Atlanta United",
|
||||
"shortName": "NY @ ATL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672003",
|
||||
"uid": "s:600~l:19~e:401672003~c:401672003",
|
||||
"date": "2026-03-16T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8904",
|
||||
"fullName": "Mercedes-Benz Stadium",
|
||||
"address": {
|
||||
"city": "Atlanta",
|
||||
"state": "GA"
|
||||
},
|
||||
"capacity": 42500,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "18626",
|
||||
"uid": "s:600~l:19~t:18626",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "18626",
|
||||
"uid": "s:600~l:19~t:18626",
|
||||
"location": "Atlanta",
|
||||
"name": "United FC",
|
||||
"abbreviation": "ATL",
|
||||
"displayName": "Atlanta United FC"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "399",
|
||||
"uid": "s:600~l:19~t:399",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "399",
|
||||
"uid": "s:600~l:19~t:399",
|
||||
"location": "New York",
|
||||
"name": "Red Bulls",
|
||||
"abbreviation": "NY",
|
||||
"displayName": "New York Red Bulls"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0'",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
+79
@@ -0,0 +1,79 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>2025-26 NBA Schedule - Edge Cases | Basketball-Reference.com</title>
|
||||
</head>
|
||||
<body>
|
||||
<table id="schedule" class="stats_table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th data-stat="date_game">Date</th>
|
||||
<th data-stat="game_start_time">Start (ET)</th>
|
||||
<th data-stat="visitor_team_name">Visitor/Neutral</th>
|
||||
<th data-stat="visitor_pts">PTS</th>
|
||||
<th data-stat="home_team_name">Home/Neutral</th>
|
||||
<th data-stat="home_pts">PTS</th>
|
||||
<th data-stat="arena_name">Arena</th>
|
||||
<th data-stat="game_remarks">Notes</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<!-- Postponed game -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Sat, Jan 11, 2026</th>
|
||||
<td data-stat="game_start_time">7:30p</td>
|
||||
<td data-stat="visitor_team_name">Los Angeles Lakers</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Phoenix Suns</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Footprint Center</td>
|
||||
<td data-stat="game_remarks">Postponed - Weather</td>
|
||||
</tr>
|
||||
<!-- Neutral site game (Mexico City) -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Sat, Nov 8, 2025</th>
|
||||
<td data-stat="game_start_time">7:00p</td>
|
||||
<td data-stat="visitor_team_name">Miami Heat</td>
|
||||
<td data-stat="visitor_pts">105</td>
|
||||
<td data-stat="home_team_name">Washington Wizards</td>
|
||||
<td data-stat="home_pts">99</td>
|
||||
<td data-stat="arena_name">Arena CDMX</td>
|
||||
<td data-stat="game_remarks">NBA Mexico City Games</td>
|
||||
</tr>
|
||||
<!-- Cancelled game -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Wed, Dec 3, 2025</th>
|
||||
<td data-stat="game_start_time">8:00p</td>
|
||||
<td data-stat="visitor_team_name">Portland Trail Blazers</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Sacramento Kings</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Golden 1 Center</td>
|
||||
<td data-stat="game_remarks">Cancelled</td>
|
||||
</tr>
|
||||
<!-- Regular completed game with high scores -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Sun, Mar 15, 2026</th>
|
||||
<td data-stat="game_start_time">3:30p</td>
|
||||
<td data-stat="visitor_team_name">Indiana Pacers</td>
|
||||
<td data-stat="visitor_pts">147</td>
|
||||
<td data-stat="home_team_name">Atlanta Hawks</td>
|
||||
<td data-stat="home_pts">150</td>
|
||||
<td data-stat="arena_name">State Farm Arena</td>
|
||||
<td data-stat="game_remarks">OT</td>
|
||||
</tr>
|
||||
<!-- Game at arena with special characters -->
|
||||
<tr>
|
||||
<th data-stat="date_game">Mon, Feb 2, 2026</th>
|
||||
<td data-stat="game_start_time">10:30p</td>
|
||||
<td data-stat="visitor_team_name">Golden State Warriors</td>
|
||||
<td data-stat="visitor_pts">118</td>
|
||||
<td data-stat="home_team_name">Los Angeles Clippers</td>
|
||||
<td data-stat="home_pts">115</td>
|
||||
<td data-stat="arena_name">Intuit Dome</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</body>
|
||||
</html>
|
||||
+94
@@ -0,0 +1,94 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>2025-26 NBA Schedule - October | Basketball-Reference.com</title>
|
||||
</head>
|
||||
<body>
|
||||
<table id="schedule" class="stats_table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th data-stat="date_game">Date</th>
|
||||
<th data-stat="game_start_time">Start (ET)</th>
|
||||
<th data-stat="visitor_team_name">Visitor/Neutral</th>
|
||||
<th data-stat="visitor_pts">PTS</th>
|
||||
<th data-stat="home_team_name">Home/Neutral</th>
|
||||
<th data-stat="home_pts">PTS</th>
|
||||
<th data-stat="arena_name">Arena</th>
|
||||
<th data-stat="game_remarks">Notes</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<th data-stat="date_game">Tue, Oct 22, 2025</th>
|
||||
<td data-stat="game_start_time">7:30p</td>
|
||||
<td data-stat="visitor_team_name">Boston Celtics</td>
|
||||
<td data-stat="visitor_pts">112</td>
|
||||
<td data-stat="home_team_name">Cleveland Cavaliers</td>
|
||||
<td data-stat="home_pts">108</td>
|
||||
<td data-stat="arena_name">Rocket Mortgage FieldHouse</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Tue, Oct 22, 2025</th>
|
||||
<td data-stat="game_start_time">10:00p</td>
|
||||
<td data-stat="visitor_team_name">Denver Nuggets</td>
|
||||
<td data-stat="visitor_pts">119</td>
|
||||
<td data-stat="home_team_name">Los Angeles Lakers</td>
|
||||
<td data-stat="home_pts">127</td>
|
||||
<td data-stat="arena_name">Crypto.com Arena</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Wed, Oct 23, 2025</th>
|
||||
<td data-stat="game_start_time">7:00p</td>
|
||||
<td data-stat="visitor_team_name">Houston Rockets</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Oklahoma City Thunder</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Paycom Center</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Wed, Oct 23, 2025</th>
|
||||
<td data-stat="game_start_time">7:30p</td>
|
||||
<td data-stat="visitor_team_name">New York Knicks</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Brooklyn Nets</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Barclays Center</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Thu, Oct 24, 2025</th>
|
||||
<td data-stat="game_start_time">7:00p</td>
|
||||
<td data-stat="visitor_team_name">Chicago Bulls</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Miami Heat</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">Kaseya Center</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Fri, Oct 25, 2025</th>
|
||||
<td data-stat="game_start_time">7:30p</td>
|
||||
<td data-stat="visitor_team_name">Toronto Raptors</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Boston Celtics</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">TD Garden</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th data-stat="date_game">Sat, Oct 26, 2025</th>
|
||||
<td data-stat="game_start_time">8:00p</td>
|
||||
<td data-stat="visitor_team_name">Minnesota Timberwolves</td>
|
||||
<td data-stat="visitor_pts"></td>
|
||||
<td data-stat="home_team_name">Dallas Mavericks</td>
|
||||
<td data-stat="home_pts"></td>
|
||||
<td data-stat="arena_name">American Airlines Center</td>
|
||||
<td data-stat="game_remarks"></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</body>
|
||||
</html>
|
||||
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "46",
|
||||
"uid": "s:40~l:46",
|
||||
"name": "National Basketball Association",
|
||||
"abbreviation": "NBA"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2025-10-22T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401584721",
|
||||
"uid": "s:40~l:46~e:401584721",
|
||||
"date": "2025-10-22T23:30:00Z",
|
||||
"name": "Boston Celtics at Cleveland Cavaliers",
|
||||
"shortName": "BOS @ CLE",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584721",
|
||||
"uid": "s:40~l:46~e:401584721~c:401584721",
|
||||
"date": "2025-10-22T23:30:00Z",
|
||||
"attendance": 20562,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "5064",
|
||||
"fullName": "Rocket Mortgage FieldHouse",
|
||||
"address": {
|
||||
"city": "Cleveland",
|
||||
"state": "OH"
|
||||
},
|
||||
"capacity": 19432,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "5",
|
||||
"uid": "s:40~l:46~t:5",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "5",
|
||||
"uid": "s:40~l:46~t:5",
|
||||
"location": "Cleveland",
|
||||
"name": "Cavaliers",
|
||||
"abbreviation": "CLE",
|
||||
"displayName": "Cleveland Cavaliers"
|
||||
},
|
||||
"score": "108",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "2",
|
||||
"uid": "s:40~l:46~t:2",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "2",
|
||||
"uid": "s:40~l:46~t:2",
|
||||
"location": "Boston",
|
||||
"name": "Celtics",
|
||||
"abbreviation": "BOS",
|
||||
"displayName": "Boston Celtics"
|
||||
},
|
||||
"score": "112",
|
||||
"winner": true
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401584722",
|
||||
"uid": "s:40~l:46~e:401584722",
|
||||
"date": "2025-10-23T02:00:00Z",
|
||||
"name": "Denver Nuggets at Los Angeles Lakers",
|
||||
"shortName": "DEN @ LAL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584722",
|
||||
"uid": "s:40~l:46~e:401584722~c:401584722",
|
||||
"date": "2025-10-23T02:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "5091",
|
||||
"fullName": "Crypto.com Arena",
|
||||
"address": {
|
||||
"city": "Los Angeles",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 19068,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "13",
|
||||
"uid": "s:40~l:46~t:13",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "13",
|
||||
"uid": "s:40~l:46~t:13",
|
||||
"location": "Los Angeles",
|
||||
"name": "Lakers",
|
||||
"abbreviation": "LAL",
|
||||
"displayName": "Los Angeles Lakers"
|
||||
},
|
||||
"score": "127",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "7",
|
||||
"uid": "s:40~l:46~t:7",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "7",
|
||||
"uid": "s:40~l:46~t:7",
|
||||
"location": "Denver",
|
||||
"name": "Nuggets",
|
||||
"abbreviation": "DEN",
|
||||
"displayName": "Denver Nuggets"
|
||||
},
|
||||
"score": "119",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401584723",
|
||||
"uid": "s:40~l:46~e:401584723",
|
||||
"date": "2025-10-24T00:00:00Z",
|
||||
"name": "Houston Rockets at Oklahoma City Thunder",
|
||||
"shortName": "HOU @ OKC",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401584723",
|
||||
"uid": "s:40~l:46~e:401584723~c:401584723",
|
||||
"date": "2025-10-24T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "4922",
|
||||
"fullName": "Paycom Center",
|
||||
"address": {
|
||||
"city": "Oklahoma City",
|
||||
"state": "OK"
|
||||
},
|
||||
"capacity": 18203,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "25",
|
||||
"uid": "s:40~l:46~t:25",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "25",
|
||||
"uid": "s:40~l:46~t:25",
|
||||
"location": "Oklahoma City",
|
||||
"name": "Thunder",
|
||||
"abbreviation": "OKC",
|
||||
"displayName": "Oklahoma City Thunder"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "10",
|
||||
"uid": "s:40~l:46~t:10",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "10",
|
||||
"uid": "s:40~l:46~t:10",
|
||||
"location": "Houston",
|
||||
"name": "Rockets",
|
||||
"abbreviation": "HOU",
|
||||
"displayName": "Houston Rockets"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "28",
|
||||
"uid": "s:20~l:28",
|
||||
"name": "National Football League",
|
||||
"abbreviation": "NFL"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2025
|
||||
},
|
||||
"week": {
|
||||
"number": 1
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401671801",
|
||||
"uid": "s:20~l:28~e:401671801",
|
||||
"date": "2025-09-07T20:00:00Z",
|
||||
"name": "Kansas City Chiefs at Baltimore Ravens",
|
||||
"shortName": "KC @ BAL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671801",
|
||||
"uid": "s:20~l:28~e:401671801~c:401671801",
|
||||
"date": "2025-09-07T20:00:00Z",
|
||||
"attendance": 71547,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3814",
|
||||
"fullName": "M&T Bank Stadium",
|
||||
"address": {
|
||||
"city": "Baltimore",
|
||||
"state": "MD"
|
||||
},
|
||||
"capacity": 71008,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "33",
|
||||
"uid": "s:20~l:28~t:33",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "33",
|
||||
"uid": "s:20~l:28~t:33",
|
||||
"location": "Baltimore",
|
||||
"name": "Ravens",
|
||||
"abbreviation": "BAL",
|
||||
"displayName": "Baltimore Ravens"
|
||||
},
|
||||
"score": "20",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "12",
|
||||
"uid": "s:20~l:28~t:12",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "12",
|
||||
"uid": "s:20~l:28~t:12",
|
||||
"location": "Kansas City",
|
||||
"name": "Chiefs",
|
||||
"abbreviation": "KC",
|
||||
"displayName": "Kansas City Chiefs"
|
||||
},
|
||||
"score": "27",
|
||||
"winner": true
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401671802",
|
||||
"uid": "s:20~l:28~e:401671802",
|
||||
"date": "2025-09-08T17:00:00Z",
|
||||
"name": "Philadelphia Eagles at Green Bay Packers",
|
||||
"shortName": "PHI @ GB",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671802",
|
||||
"uid": "s:20~l:28~e:401671802~c:401671802",
|
||||
"date": "2025-09-08T17:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3798",
|
||||
"fullName": "Lambeau Field",
|
||||
"address": {
|
||||
"city": "Green Bay",
|
||||
"state": "WI"
|
||||
},
|
||||
"capacity": 81441,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "9",
|
||||
"uid": "s:20~l:28~t:9",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "9",
|
||||
"uid": "s:20~l:28~t:9",
|
||||
"location": "Green Bay",
|
||||
"name": "Packers",
|
||||
"abbreviation": "GB",
|
||||
"displayName": "Green Bay Packers"
|
||||
},
|
||||
"score": "34",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "21",
|
||||
"uid": "s:20~l:28~t:21",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "21",
|
||||
"uid": "s:20~l:28~t:21",
|
||||
"location": "Philadelphia",
|
||||
"name": "Eagles",
|
||||
"abbreviation": "PHI",
|
||||
"displayName": "Philadelphia Eagles"
|
||||
},
|
||||
"score": "29",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401671803",
|
||||
"uid": "s:20~l:28~e:401671803",
|
||||
"date": "2025-09-08T20:25:00Z",
|
||||
"name": "Dallas Cowboys at Cleveland Browns",
|
||||
"shortName": "DAL @ CLE",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671803",
|
||||
"uid": "s:20~l:28~e:401671803~c:401671803",
|
||||
"date": "2025-09-08T20:25:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3653",
|
||||
"fullName": "Cleveland Browns Stadium",
|
||||
"address": {
|
||||
"city": "Cleveland",
|
||||
"state": "OH"
|
||||
},
|
||||
"capacity": 67431,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "5",
|
||||
"uid": "s:20~l:28~t:5",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "5",
|
||||
"uid": "s:20~l:28~t:5",
|
||||
"location": "Cleveland",
|
||||
"name": "Browns",
|
||||
"abbreviation": "CLE",
|
||||
"displayName": "Cleveland Browns"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "6",
|
||||
"uid": "s:20~l:28~t:6",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "6",
|
||||
"uid": "s:20~l:28~t:6",
|
||||
"location": "Dallas",
|
||||
"name": "Cowboys",
|
||||
"abbreviation": "DAL",
|
||||
"displayName": "Dallas Cowboys"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "90",
|
||||
"uid": "s:70~l:90",
|
||||
"name": "National Hockey League",
|
||||
"abbreviation": "NHL"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2025-10-08T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401671901",
|
||||
"uid": "s:70~l:90~e:401671901",
|
||||
"date": "2025-10-08T23:00:00Z",
|
||||
"name": "Pittsburgh Penguins at Boston Bruins",
|
||||
"shortName": "PIT @ BOS",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671901",
|
||||
"uid": "s:70~l:90~e:401671901~c:401671901",
|
||||
"date": "2025-10-08T23:00:00Z",
|
||||
"attendance": 17850,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "1823",
|
||||
"fullName": "TD Garden",
|
||||
"address": {
|
||||
"city": "Boston",
|
||||
"state": "MA"
|
||||
},
|
||||
"capacity": 17850,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "1",
|
||||
"uid": "s:70~l:90~t:1",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "1",
|
||||
"uid": "s:70~l:90~t:1",
|
||||
"location": "Boston",
|
||||
"name": "Bruins",
|
||||
"abbreviation": "BOS",
|
||||
"displayName": "Boston Bruins"
|
||||
},
|
||||
"score": "4",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "5",
|
||||
"uid": "s:70~l:90~t:5",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "5",
|
||||
"uid": "s:70~l:90~t:5",
|
||||
"location": "Pittsburgh",
|
||||
"name": "Penguins",
|
||||
"abbreviation": "PIT",
|
||||
"displayName": "Pittsburgh Penguins"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 3,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401671902",
|
||||
"uid": "s:70~l:90~e:401671902",
|
||||
"date": "2025-10-09T00:00:00Z",
|
||||
"name": "Toronto Maple Leafs at Montreal Canadiens",
|
||||
"shortName": "TOR @ MTL",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671902",
|
||||
"uid": "s:70~l:90~e:401671902~c:401671902",
|
||||
"date": "2025-10-09T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "1918",
|
||||
"fullName": "Bell Centre",
|
||||
"address": {
|
||||
"city": "Montreal",
|
||||
"state": "QC"
|
||||
},
|
||||
"capacity": 21302,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "8",
|
||||
"uid": "s:70~l:90~t:8",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "8",
|
||||
"uid": "s:70~l:90~t:8",
|
||||
"location": "Montreal",
|
||||
"name": "Canadiens",
|
||||
"abbreviation": "MTL",
|
||||
"displayName": "Montreal Canadiens"
|
||||
},
|
||||
"score": "3",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "10",
|
||||
"uid": "s:70~l:90~t:10",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "10",
|
||||
"uid": "s:70~l:90~t:10",
|
||||
"location": "Toronto",
|
||||
"name": "Maple Leafs",
|
||||
"abbreviation": "TOR",
|
||||
"displayName": "Toronto Maple Leafs"
|
||||
},
|
||||
"score": "5",
|
||||
"winner": true
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 3,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401671903",
|
||||
"uid": "s:70~l:90~e:401671903",
|
||||
"date": "2025-10-09T02:00:00Z",
|
||||
"name": "Vegas Golden Knights at Los Angeles Kings",
|
||||
"shortName": "VGK @ LAK",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401671903",
|
||||
"uid": "s:70~l:90~e:401671903~c:401671903",
|
||||
"date": "2025-10-09T02:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "1816",
|
||||
"fullName": "Crypto.com Arena",
|
||||
"address": {
|
||||
"city": "Los Angeles",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 18230,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "26",
|
||||
"uid": "s:70~l:90~t:26",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "26",
|
||||
"uid": "s:70~l:90~t:26",
|
||||
"location": "Los Angeles",
|
||||
"name": "Kings",
|
||||
"abbreviation": "LAK",
|
||||
"displayName": "Los Angeles Kings"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "54",
|
||||
"uid": "s:70~l:90~t:54",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "54",
|
||||
"uid": "s:70~l:90~t:54",
|
||||
"location": "Vegas",
|
||||
"name": "Golden Knights",
|
||||
"abbreviation": "VGK",
|
||||
"displayName": "Vegas Golden Knights"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "761",
|
||||
"uid": "s:600~l:761",
|
||||
"name": "National Women's Soccer League",
|
||||
"abbreviation": "NWSL"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2026-04-10T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401672201",
|
||||
"uid": "s:600~l:761~e:401672201",
|
||||
"date": "2026-04-10T23:00:00Z",
|
||||
"name": "Angel City FC at Portland Thorns",
|
||||
"shortName": "LA @ POR",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672201",
|
||||
"uid": "s:600~l:761~e:401672201~c:401672201",
|
||||
"date": "2026-04-10T23:00:00Z",
|
||||
"attendance": 22000,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8070",
|
||||
"fullName": "Providence Park",
|
||||
"address": {
|
||||
"city": "Portland",
|
||||
"state": "OR"
|
||||
},
|
||||
"capacity": 25218,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "15625",
|
||||
"uid": "s:600~l:761~t:15625",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "15625",
|
||||
"uid": "s:600~l:761~t:15625",
|
||||
"location": "Portland",
|
||||
"name": "Thorns FC",
|
||||
"abbreviation": "POR",
|
||||
"displayName": "Portland Thorns FC"
|
||||
},
|
||||
"score": "2",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "19934",
|
||||
"uid": "s:600~l:761~t:19934",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "19934",
|
||||
"uid": "s:600~l:761~t:19934",
|
||||
"location": "Los Angeles",
|
||||
"name": "Angel City",
|
||||
"abbreviation": "LA",
|
||||
"displayName": "Angel City FC"
|
||||
},
|
||||
"score": "1",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 90,
|
||||
"displayClock": "90'",
|
||||
"period": 2,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672202",
|
||||
"uid": "s:600~l:761~e:401672202",
|
||||
"date": "2026-04-11T00:00:00Z",
|
||||
"name": "Orlando Pride at North Carolina Courage",
|
||||
"shortName": "ORL @ NC",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672202",
|
||||
"uid": "s:600~l:761~e:401672202~c:401672202",
|
||||
"date": "2026-04-11T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8073",
|
||||
"fullName": "WakeMed Soccer Park",
|
||||
"address": {
|
||||
"city": "Cary",
|
||||
"state": "NC"
|
||||
},
|
||||
"capacity": 10000,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "15618",
|
||||
"uid": "s:600~l:761~t:15618",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "15618",
|
||||
"uid": "s:600~l:761~t:15618",
|
||||
"location": "North Carolina",
|
||||
"name": "Courage",
|
||||
"abbreviation": "NC",
|
||||
"displayName": "North Carolina Courage"
|
||||
},
|
||||
"score": "3",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "15626",
|
||||
"uid": "s:600~l:761~t:15626",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "15626",
|
||||
"uid": "s:600~l:761~t:15626",
|
||||
"location": "Orlando",
|
||||
"name": "Pride",
|
||||
"abbreviation": "ORL",
|
||||
"displayName": "Orlando Pride"
|
||||
},
|
||||
"score": "1",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 90,
|
||||
"displayClock": "90'",
|
||||
"period": 2,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672203",
|
||||
"uid": "s:600~l:761~e:401672203",
|
||||
"date": "2026-04-11T02:00:00Z",
|
||||
"name": "San Diego Wave at Bay FC",
|
||||
"shortName": "SD @ BAY",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672203",
|
||||
"uid": "s:600~l:761~e:401672203~c:401672203",
|
||||
"date": "2026-04-11T02:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3945",
|
||||
"fullName": "PayPal Park",
|
||||
"address": {
|
||||
"city": "San Jose",
|
||||
"state": "CA"
|
||||
},
|
||||
"capacity": 18000,
|
||||
"indoor": false
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "25645",
|
||||
"uid": "s:600~l:761~t:25645",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "25645",
|
||||
"uid": "s:600~l:761~t:25645",
|
||||
"location": "Bay Area",
|
||||
"name": "FC",
|
||||
"abbreviation": "BAY",
|
||||
"displayName": "Bay FC"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "22638",
|
||||
"uid": "s:600~l:761~t:22638",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "22638",
|
||||
"uid": "s:600~l:761~t:22638",
|
||||
"location": "San Diego",
|
||||
"name": "Wave FC",
|
||||
"abbreviation": "SD",
|
||||
"displayName": "San Diego Wave FC"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0'",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,245 @@
|
||||
{
|
||||
"leagues": [
|
||||
{
|
||||
"id": "59",
|
||||
"uid": "s:40~l:59",
|
||||
"name": "Women's National Basketball Association",
|
||||
"abbreviation": "WNBA"
|
||||
}
|
||||
],
|
||||
"season": {
|
||||
"type": 2,
|
||||
"year": 2026
|
||||
},
|
||||
"day": {
|
||||
"date": "2026-05-20T00:00:00Z"
|
||||
},
|
||||
"events": [
|
||||
{
|
||||
"id": "401672101",
|
||||
"uid": "s:40~l:59~e:401672101",
|
||||
"date": "2026-05-20T23:00:00Z",
|
||||
"name": "Las Vegas Aces at New York Liberty",
|
||||
"shortName": "LV @ NY",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672101",
|
||||
"uid": "s:40~l:59~e:401672101~c:401672101",
|
||||
"date": "2026-05-20T23:00:00Z",
|
||||
"attendance": 17732,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "4346",
|
||||
"fullName": "Barclays Center",
|
||||
"address": {
|
||||
"city": "Brooklyn",
|
||||
"state": "NY"
|
||||
},
|
||||
"capacity": 17732,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "9",
|
||||
"uid": "s:40~l:59~t:9",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "9",
|
||||
"uid": "s:40~l:59~t:9",
|
||||
"location": "New York",
|
||||
"name": "Liberty",
|
||||
"abbreviation": "NY",
|
||||
"displayName": "New York Liberty"
|
||||
},
|
||||
"score": "92",
|
||||
"winner": true
|
||||
},
|
||||
{
|
||||
"id": "20",
|
||||
"uid": "s:40~l:59~t:20",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "20",
|
||||
"uid": "s:40~l:59~t:20",
|
||||
"location": "Las Vegas",
|
||||
"name": "Aces",
|
||||
"abbreviation": "LV",
|
||||
"displayName": "Las Vegas Aces"
|
||||
},
|
||||
"score": "88",
|
||||
"winner": false
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672102",
|
||||
"uid": "s:40~l:59~e:401672102",
|
||||
"date": "2026-05-21T00:00:00Z",
|
||||
"name": "Connecticut Sun at Chicago Sky",
|
||||
"shortName": "CONN @ CHI",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672102",
|
||||
"uid": "s:40~l:59~e:401672102~c:401672102",
|
||||
"date": "2026-05-21T00:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "8086",
|
||||
"fullName": "Wintrust Arena",
|
||||
"address": {
|
||||
"city": "Chicago",
|
||||
"state": "IL"
|
||||
},
|
||||
"capacity": 10387,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "6",
|
||||
"uid": "s:40~l:59~t:6",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "6",
|
||||
"uid": "s:40~l:59~t:6",
|
||||
"location": "Chicago",
|
||||
"name": "Sky",
|
||||
"abbreviation": "CHI",
|
||||
"displayName": "Chicago Sky"
|
||||
},
|
||||
"score": "78",
|
||||
"winner": false
|
||||
},
|
||||
{
|
||||
"id": "5",
|
||||
"uid": "s:40~l:59~t:5",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "5",
|
||||
"uid": "s:40~l:59~t:5",
|
||||
"location": "Connecticut",
|
||||
"name": "Sun",
|
||||
"abbreviation": "CONN",
|
||||
"displayName": "Connecticut Sun"
|
||||
},
|
||||
"score": "85",
|
||||
"winner": true
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 4,
|
||||
"type": {
|
||||
"id": "3",
|
||||
"name": "STATUS_FINAL",
|
||||
"state": "post",
|
||||
"completed": true
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": "401672103",
|
||||
"uid": "s:40~l:59~e:401672103",
|
||||
"date": "2026-05-21T02:00:00Z",
|
||||
"name": "Phoenix Mercury at Seattle Storm",
|
||||
"shortName": "PHX @ SEA",
|
||||
"competitions": [
|
||||
{
|
||||
"id": "401672103",
|
||||
"uid": "s:40~l:59~e:401672103~c:401672103",
|
||||
"date": "2026-05-21T02:00:00Z",
|
||||
"type": {
|
||||
"id": "1",
|
||||
"abbreviation": "STD"
|
||||
},
|
||||
"venue": {
|
||||
"id": "3097",
|
||||
"fullName": "Climate Pledge Arena",
|
||||
"address": {
|
||||
"city": "Seattle",
|
||||
"state": "WA"
|
||||
},
|
||||
"capacity": 18100,
|
||||
"indoor": true
|
||||
},
|
||||
"competitors": [
|
||||
{
|
||||
"id": "11",
|
||||
"uid": "s:40~l:59~t:11",
|
||||
"type": "team",
|
||||
"order": 0,
|
||||
"homeAway": "home",
|
||||
"team": {
|
||||
"id": "11",
|
||||
"uid": "s:40~l:59~t:11",
|
||||
"location": "Seattle",
|
||||
"name": "Storm",
|
||||
"abbreviation": "SEA",
|
||||
"displayName": "Seattle Storm"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
},
|
||||
{
|
||||
"id": "8",
|
||||
"uid": "s:40~l:59~t:8",
|
||||
"type": "team",
|
||||
"order": 1,
|
||||
"homeAway": "away",
|
||||
"team": {
|
||||
"id": "8",
|
||||
"uid": "s:40~l:59~t:8",
|
||||
"location": "Phoenix",
|
||||
"name": "Mercury",
|
||||
"abbreviation": "PHX",
|
||||
"displayName": "Phoenix Mercury"
|
||||
},
|
||||
"score": null,
|
||||
"winner": null
|
||||
}
|
||||
],
|
||||
"status": {
|
||||
"clock": 0,
|
||||
"displayClock": "0:00",
|
||||
"period": 0,
|
||||
"type": {
|
||||
"id": "1",
|
||||
"name": "STATUS_SCHEDULED",
|
||||
"state": "pre",
|
||||
"completed": false
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,269 @@
|
||||
"""Tests for alias loaders."""
|
||||
|
||||
import pytest
|
||||
import json
|
||||
import tempfile
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
|
||||
from sportstime_parser.normalizers.alias_loader import (
|
||||
TeamAliasLoader,
|
||||
StadiumAliasLoader,
|
||||
)
|
||||
from sportstime_parser.models.aliases import AliasType
|
||||
|
||||
|
||||
class TestTeamAliasLoader:
|
||||
"""Tests for TeamAliasLoader class."""
|
||||
|
||||
@pytest.fixture
|
||||
def sample_aliases_file(self):
|
||||
"""Create a temporary aliases file for testing."""
|
||||
data = [
|
||||
{
|
||||
"id": "1",
|
||||
"team_canonical_id": "nba_okc",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Seattle SuperSonics",
|
||||
"valid_from": "1967-01-01",
|
||||
"valid_until": "2008-07-02",
|
||||
},
|
||||
{
|
||||
"id": "2",
|
||||
"team_canonical_id": "nba_okc",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Oklahoma City Thunder",
|
||||
"valid_from": "2008-07-03",
|
||||
"valid_until": None,
|
||||
},
|
||||
{
|
||||
"id": "3",
|
||||
"team_canonical_id": "nba_okc",
|
||||
"alias_type": "abbreviation",
|
||||
"alias_value": "OKC",
|
||||
"valid_from": "2008-07-03",
|
||||
"valid_until": None,
|
||||
},
|
||||
]
|
||||
with tempfile.NamedTemporaryFile(
|
||||
mode="w", suffix=".json", delete=False
|
||||
) as f:
|
||||
json.dump(data, f)
|
||||
return Path(f.name)
|
||||
|
||||
def test_load_aliases(self, sample_aliases_file):
|
||||
"""Test loading aliases from file."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
loader.load()
|
||||
assert len(loader._aliases) == 3
|
||||
|
||||
def test_resolve_current_alias(self, sample_aliases_file):
|
||||
"""Test resolving a current alias."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
# Current date should resolve to Thunder
|
||||
result = loader.resolve("Oklahoma City Thunder")
|
||||
assert result == "nba_okc"
|
||||
|
||||
# Abbreviation should also work
|
||||
result = loader.resolve("OKC")
|
||||
assert result == "nba_okc"
|
||||
|
||||
def test_resolve_historical_alias(self, sample_aliases_file):
|
||||
"""Test resolving a historical alias with date."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
# Historical date should resolve SuperSonics
|
||||
result = loader.resolve("Seattle SuperSonics", check_date=date(2007, 1, 1))
|
||||
assert result == "nba_okc"
|
||||
|
||||
# After relocation, SuperSonics shouldn't resolve
|
||||
result = loader.resolve("Seattle SuperSonics", check_date=date(2010, 1, 1))
|
||||
assert result is None
|
||||
|
||||
def test_resolve_case_insensitive(self, sample_aliases_file):
|
||||
"""Test case insensitive resolution."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
result = loader.resolve("oklahoma city thunder")
|
||||
assert result == "nba_okc"
|
||||
|
||||
result = loader.resolve("okc")
|
||||
assert result == "nba_okc"
|
||||
|
||||
def test_resolve_with_type_filter(self, sample_aliases_file):
|
||||
"""Test filtering by alias type."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
# Should find when searching all types
|
||||
result = loader.resolve("OKC")
|
||||
assert result == "nba_okc"
|
||||
|
||||
# Should not find when filtering to name only
|
||||
result = loader.resolve("OKC", alias_types=[AliasType.NAME])
|
||||
assert result is None
|
||||
|
||||
def test_get_aliases_for_team(self, sample_aliases_file):
|
||||
"""Test getting all aliases for a team."""
|
||||
loader = TeamAliasLoader(sample_aliases_file)
|
||||
|
||||
aliases = loader.get_aliases_for_team("nba_okc")
|
||||
assert len(aliases) == 3
|
||||
|
||||
# Filter by current date
|
||||
aliases = loader.get_aliases_for_team(
|
||||
"nba_okc", check_date=date(2020, 1, 1)
|
||||
)
|
||||
assert len(aliases) == 2 # Thunder name + OKC abbreviation
|
||||
|
||||
def test_missing_file(self):
|
||||
"""Test handling of missing file."""
|
||||
loader = TeamAliasLoader(Path("/nonexistent/file.json"))
|
||||
loader.load() # Should not raise
|
||||
assert len(loader._aliases) == 0
|
||||
|
||||
|
||||
class TestStadiumAliasLoader:
|
||||
"""Tests for StadiumAliasLoader class."""
|
||||
|
||||
@pytest.fixture
|
||||
def sample_stadium_aliases(self):
|
||||
"""Create a temporary stadium aliases file."""
|
||||
data = [
|
||||
{
|
||||
"alias_name": "Crypto.com Arena",
|
||||
"stadium_canonical_id": "crypto_arena_los_angeles_ca",
|
||||
"valid_from": "2021-12-25",
|
||||
"valid_until": None,
|
||||
},
|
||||
{
|
||||
"alias_name": "Staples Center",
|
||||
"stadium_canonical_id": "crypto_arena_los_angeles_ca",
|
||||
"valid_from": "1999-10-17",
|
||||
"valid_until": "2021-12-24",
|
||||
},
|
||||
]
|
||||
with tempfile.NamedTemporaryFile(
|
||||
mode="w", suffix=".json", delete=False
|
||||
) as f:
|
||||
json.dump(data, f)
|
||||
return Path(f.name)
|
||||
|
||||
def test_load_stadium_aliases(self, sample_stadium_aliases):
|
||||
"""Test loading stadium aliases."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
loader.load()
|
||||
assert len(loader._aliases) == 2
|
||||
|
||||
def test_resolve_current_name(self, sample_stadium_aliases):
|
||||
"""Test resolving current stadium name."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
|
||||
result = loader.resolve("Crypto.com Arena")
|
||||
assert result == "crypto_arena_los_angeles_ca"
|
||||
|
||||
def test_resolve_historical_name(self, sample_stadium_aliases):
|
||||
"""Test resolving historical stadium name."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
|
||||
# Staples Center in 2020
|
||||
result = loader.resolve("Staples Center", check_date=date(2020, 1, 1))
|
||||
assert result == "crypto_arena_los_angeles_ca"
|
||||
|
||||
# Staples Center after rename shouldn't resolve
|
||||
result = loader.resolve("Staples Center", check_date=date(2023, 1, 1))
|
||||
assert result is None
|
||||
|
||||
def test_date_boundary(self, sample_stadium_aliases):
|
||||
"""Test exact date boundaries."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
|
||||
# Last day of Staples Center
|
||||
result = loader.resolve("Staples Center", check_date=date(2021, 12, 24))
|
||||
assert result == "crypto_arena_los_angeles_ca"
|
||||
|
||||
# First day of Crypto.com Arena
|
||||
result = loader.resolve("Crypto.com Arena", check_date=date(2021, 12, 25))
|
||||
assert result == "crypto_arena_los_angeles_ca"
|
||||
|
||||
def test_get_all_names(self, sample_stadium_aliases):
|
||||
"""Test getting all stadium names."""
|
||||
loader = StadiumAliasLoader(sample_stadium_aliases)
|
||||
|
||||
names = loader.get_all_names()
|
||||
assert len(names) == 2
|
||||
assert "Crypto.com Arena" in names
|
||||
assert "Staples Center" in names
|
||||
|
||||
|
||||
class TestDateRangeHandling:
|
||||
"""Tests for date range edge cases in aliases."""
|
||||
|
||||
@pytest.fixture
|
||||
def date_range_aliases(self):
|
||||
"""Create aliases with various date range scenarios."""
|
||||
data = [
|
||||
{
|
||||
"id": "1",
|
||||
"team_canonical_id": "test_team",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Always Valid",
|
||||
"valid_from": None,
|
||||
"valid_until": None,
|
||||
},
|
||||
{
|
||||
"id": "2",
|
||||
"team_canonical_id": "test_team",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Future Only",
|
||||
"valid_from": "2030-01-01",
|
||||
"valid_until": None,
|
||||
},
|
||||
{
|
||||
"id": "3",
|
||||
"team_canonical_id": "test_team",
|
||||
"alias_type": "name",
|
||||
"alias_value": "Past Only",
|
||||
"valid_from": None,
|
||||
"valid_until": "2000-01-01",
|
||||
},
|
||||
]
|
||||
with tempfile.NamedTemporaryFile(
|
||||
mode="w", suffix=".json", delete=False
|
||||
) as f:
|
||||
json.dump(data, f)
|
||||
return Path(f.name)
|
||||
|
||||
def test_always_valid_alias(self, date_range_aliases):
|
||||
"""Test alias with no date restrictions."""
|
||||
loader = TeamAliasLoader(date_range_aliases)
|
||||
|
||||
result = loader.resolve("Always Valid", check_date=date(2025, 1, 1))
|
||||
assert result == "test_team"
|
||||
|
||||
result = loader.resolve("Always Valid", check_date=date(1990, 1, 1))
|
||||
assert result == "test_team"
|
||||
|
||||
def test_future_only_alias(self, date_range_aliases):
|
||||
"""Test alias that starts in the future."""
|
||||
loader = TeamAliasLoader(date_range_aliases)
|
||||
|
||||
# Before valid_from
|
||||
result = loader.resolve("Future Only", check_date=date(2025, 1, 1))
|
||||
assert result is None
|
||||
|
||||
# After valid_from
|
||||
result = loader.resolve("Future Only", check_date=date(2035, 1, 1))
|
||||
assert result == "test_team"
|
||||
|
||||
def test_past_only_alias(self, date_range_aliases):
|
||||
"""Test alias that expired in the past."""
|
||||
loader = TeamAliasLoader(date_range_aliases)
|
||||
|
||||
# Before valid_until
|
||||
result = loader.resolve("Past Only", check_date=date(1990, 1, 1))
|
||||
assert result == "test_team"
|
||||
|
||||
# After valid_until
|
||||
result = loader.resolve("Past Only", check_date=date(2025, 1, 1))
|
||||
assert result is None
|
||||
@@ -0,0 +1,183 @@
|
||||
"""Tests for canonical ID generation."""
|
||||
|
||||
import pytest
|
||||
from datetime import datetime, date
|
||||
|
||||
from sportstime_parser.normalizers.canonical_id import (
|
||||
generate_game_id,
|
||||
generate_team_id,
|
||||
generate_team_id_from_abbrev,
|
||||
generate_stadium_id,
|
||||
parse_game_id,
|
||||
normalize_string,
|
||||
)
|
||||
|
||||
|
||||
class TestNormalizeString:
|
||||
"""Tests for normalize_string function."""
|
||||
|
||||
def test_basic_normalization(self):
|
||||
"""Test basic string normalization."""
|
||||
assert normalize_string("New York") == "new_york"
|
||||
assert normalize_string("Los Angeles") == "los_angeles"
|
||||
|
||||
def test_removes_special_characters(self):
|
||||
"""Test that special characters are removed."""
|
||||
assert normalize_string("AT&T Stadium") == "att_stadium"
|
||||
assert normalize_string("St. Louis") == "st_louis"
|
||||
assert normalize_string("O'Brien Field") == "obrien_field"
|
||||
|
||||
def test_collapses_whitespace(self):
|
||||
"""Test that multiple spaces are collapsed."""
|
||||
assert normalize_string("New York") == "new_york"
|
||||
assert normalize_string(" Los Angeles ") == "los_angeles"
|
||||
|
||||
def test_empty_string(self):
|
||||
"""Test empty string handling."""
|
||||
assert normalize_string("") == ""
|
||||
assert normalize_string(" ") == ""
|
||||
|
||||
def test_unicode_normalization(self):
|
||||
"""Test unicode characters are handled."""
|
||||
assert normalize_string("Café") == "cafe"
|
||||
assert normalize_string("José") == "jose"
|
||||
|
||||
|
||||
class TestGenerateGameId:
|
||||
"""Tests for generate_game_id function."""
|
||||
|
||||
def test_basic_game_id(self):
|
||||
"""Test basic game ID generation."""
|
||||
game_id = generate_game_id(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
away_abbrev="bos",
|
||||
home_abbrev="lal",
|
||||
game_date=date(2025, 12, 25),
|
||||
)
|
||||
assert game_id == "nba_2025_bos_lal_1225"
|
||||
|
||||
def test_game_id_with_datetime(self):
|
||||
"""Test game ID generation with datetime object."""
|
||||
game_id = generate_game_id(
|
||||
sport="mlb",
|
||||
season=2026,
|
||||
away_abbrev="nyy",
|
||||
home_abbrev="bos",
|
||||
game_date=datetime(2026, 4, 1, 19, 0),
|
||||
)
|
||||
assert game_id == "mlb_2026_nyy_bos_0401"
|
||||
|
||||
def test_game_id_with_game_number(self):
|
||||
"""Test game ID for doubleheader."""
|
||||
game_id_1 = generate_game_id(
|
||||
sport="mlb",
|
||||
season=2026,
|
||||
away_abbrev="nyy",
|
||||
home_abbrev="bos",
|
||||
game_date=date(2026, 7, 4),
|
||||
game_number=1,
|
||||
)
|
||||
game_id_2 = generate_game_id(
|
||||
sport="mlb",
|
||||
season=2026,
|
||||
away_abbrev="nyy",
|
||||
home_abbrev="bos",
|
||||
game_date=date(2026, 7, 4),
|
||||
game_number=2,
|
||||
)
|
||||
assert game_id_1 == "mlb_2026_nyy_bos_0704_1"
|
||||
assert game_id_2 == "mlb_2026_nyy_bos_0704_2"
|
||||
|
||||
def test_sport_lowercased(self):
|
||||
"""Test that sport is lowercased."""
|
||||
game_id = generate_game_id(
|
||||
sport="NBA",
|
||||
season=2025,
|
||||
away_abbrev="BOS",
|
||||
home_abbrev="LAL",
|
||||
game_date=date(2025, 12, 25),
|
||||
)
|
||||
assert game_id == "nba_2025_bos_lal_1225"
|
||||
|
||||
|
||||
class TestParseGameId:
|
||||
"""Tests for parse_game_id function."""
|
||||
|
||||
def test_parse_basic_game_id(self):
|
||||
"""Test parsing a basic game ID."""
|
||||
parsed = parse_game_id("nba_2025_bos_lal_1225")
|
||||
assert parsed["sport"] == "nba"
|
||||
assert parsed["season"] == 2025
|
||||
assert parsed["away_abbrev"] == "bos"
|
||||
assert parsed["home_abbrev"] == "lal"
|
||||
assert parsed["month"] == 12
|
||||
assert parsed["day"] == 25
|
||||
assert parsed["game_number"] is None
|
||||
|
||||
def test_parse_game_id_with_game_number(self):
|
||||
"""Test parsing game ID with game number."""
|
||||
parsed = parse_game_id("mlb_2026_nyy_bos_0704_2")
|
||||
assert parsed["sport"] == "mlb"
|
||||
assert parsed["season"] == 2026
|
||||
assert parsed["away_abbrev"] == "nyy"
|
||||
assert parsed["home_abbrev"] == "bos"
|
||||
assert parsed["month"] == 7
|
||||
assert parsed["day"] == 4
|
||||
assert parsed["game_number"] == 2
|
||||
|
||||
def test_parse_invalid_game_id(self):
|
||||
"""Test parsing invalid game ID raises error."""
|
||||
with pytest.raises(ValueError):
|
||||
parse_game_id("invalid")
|
||||
with pytest.raises(ValueError):
|
||||
parse_game_id("nba_2025_bos")
|
||||
with pytest.raises(ValueError):
|
||||
parse_game_id("")
|
||||
|
||||
|
||||
class TestGenerateTeamId:
|
||||
"""Tests for generate_team_id function."""
|
||||
|
||||
def test_basic_team_id(self):
|
||||
"""Test basic team ID generation from city and name."""
|
||||
team_id = generate_team_id(sport="nba", city="Los Angeles", name="Lakers")
|
||||
assert team_id == "team_nba_los_angeles_lakers"
|
||||
|
||||
def test_team_id_normalizes_input(self):
|
||||
"""Test that inputs are normalized."""
|
||||
team_id = generate_team_id(sport="NBA", city="New York", name="Yankees")
|
||||
assert team_id == "team_nba_new_york_yankees"
|
||||
|
||||
|
||||
class TestGenerateTeamIdFromAbbrev:
|
||||
"""Tests for generate_team_id_from_abbrev function."""
|
||||
|
||||
def test_basic_team_id_from_abbrev(self):
|
||||
"""Test team ID from abbreviation."""
|
||||
team_id = generate_team_id_from_abbrev(sport="nba", abbreviation="LAL")
|
||||
assert team_id == "team_nba_lal"
|
||||
|
||||
def test_lowercases_abbreviation(self):
|
||||
"""Test abbreviation is lowercased."""
|
||||
team_id = generate_team_id_from_abbrev(sport="MLB", abbreviation="NYY")
|
||||
assert team_id == "team_mlb_nyy"
|
||||
|
||||
|
||||
class TestGenerateStadiumId:
|
||||
"""Tests for generate_stadium_id function."""
|
||||
|
||||
def test_basic_stadium_id(self):
|
||||
"""Test basic stadium ID generation."""
|
||||
stadium_id = generate_stadium_id(sport="mlb", name="Fenway Park")
|
||||
assert stadium_id == "stadium_mlb_fenway_park"
|
||||
|
||||
def test_stadium_id_special_characters(self):
|
||||
"""Test stadium ID with special characters."""
|
||||
stadium_id = generate_stadium_id(sport="nfl", name="AT&T Stadium")
|
||||
assert stadium_id == "stadium_nfl_att_stadium"
|
||||
|
||||
def test_stadium_id_with_sponsor(self):
|
||||
"""Test stadium ID with sponsor name."""
|
||||
stadium_id = generate_stadium_id(sport="nba", name="Crypto.com Arena")
|
||||
assert stadium_id == "stadium_nba_cryptocom_arena"
|
||||
@@ -0,0 +1,194 @@
|
||||
"""Tests for fuzzy string matching utilities."""
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.normalizers.fuzzy import (
|
||||
normalize_for_matching,
|
||||
fuzzy_match_team,
|
||||
fuzzy_match_stadium,
|
||||
exact_match,
|
||||
best_match,
|
||||
calculate_similarity,
|
||||
MatchCandidate,
|
||||
)
|
||||
|
||||
|
||||
class TestNormalizeForMatching:
|
||||
"""Tests for normalize_for_matching function."""
|
||||
|
||||
def test_basic_normalization(self):
|
||||
"""Test basic string normalization."""
|
||||
assert normalize_for_matching("Los Angeles Lakers") == "los angeles lakers"
|
||||
assert normalize_for_matching(" Boston Celtics ") == "boston celtics"
|
||||
|
||||
def test_removes_common_prefixes(self):
|
||||
"""Test removal of common prefixes."""
|
||||
assert normalize_for_matching("The Boston Celtics") == "boston celtics"
|
||||
assert normalize_for_matching("Team Lakers") == "lakers"
|
||||
|
||||
def test_removes_stadium_suffixes(self):
|
||||
"""Test removal of stadium-related suffixes."""
|
||||
assert normalize_for_matching("Fenway Park") == "fenway"
|
||||
assert normalize_for_matching("Madison Square Garden Arena") == "madison square garden"
|
||||
assert normalize_for_matching("Wrigley Field") == "wrigley"
|
||||
assert normalize_for_matching("TD Garden Center") == "td garden"
|
||||
|
||||
|
||||
class TestExactMatch:
|
||||
"""Tests for exact_match function."""
|
||||
|
||||
def test_exact_match_primary_name(self):
|
||||
"""Test exact match on primary name."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LAL"]),
|
||||
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics", "BOS"]),
|
||||
]
|
||||
assert exact_match("Los Angeles Lakers", candidates) == "nba_lal"
|
||||
assert exact_match("Boston Celtics", candidates) == "nba_bos"
|
||||
|
||||
def test_exact_match_alias(self):
|
||||
"""Test exact match on alias."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LAL"]),
|
||||
]
|
||||
assert exact_match("Lakers", candidates) == "nba_lal"
|
||||
assert exact_match("LAL", candidates) == "nba_lal"
|
||||
|
||||
def test_case_insensitive(self):
|
||||
"""Test case insensitive matching."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
|
||||
]
|
||||
assert exact_match("los angeles lakers", candidates) == "nba_lal"
|
||||
assert exact_match("LAKERS", candidates) == "nba_lal"
|
||||
|
||||
def test_no_match(self):
|
||||
"""Test no match returns None."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
|
||||
]
|
||||
assert exact_match("New York Knicks", candidates) is None
|
||||
|
||||
|
||||
class TestFuzzyMatchTeam:
|
||||
"""Tests for fuzzy_match_team function."""
|
||||
|
||||
def test_close_match(self):
|
||||
"""Test fuzzy matching finds close matches."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers", "LA Lakers"]),
|
||||
MatchCandidate("nba_lac", "Los Angeles Clippers", ["Clippers", "LA Clippers"]),
|
||||
]
|
||||
matches = fuzzy_match_team("LA Lakers", candidates, threshold=70)
|
||||
assert len(matches) > 0
|
||||
assert matches[0].canonical_id == "nba_lal"
|
||||
|
||||
def test_partial_name_match(self):
|
||||
"""Test matching on partial team name."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics", "BOS"]),
|
||||
]
|
||||
matches = fuzzy_match_team("Celtics", candidates, threshold=80)
|
||||
assert len(matches) > 0
|
||||
assert matches[0].canonical_id == "nba_bos"
|
||||
|
||||
def test_threshold_filtering(self):
|
||||
"""Test that threshold filters low-confidence matches."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_bos", "Boston Celtics", []),
|
||||
]
|
||||
# Very different string should not match at high threshold
|
||||
matches = fuzzy_match_team("xyz123", candidates, threshold=90)
|
||||
assert len(matches) == 0
|
||||
|
||||
def test_returns_top_n(self):
|
||||
"""Test that top_n parameter limits results."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", []),
|
||||
MatchCandidate("nba_lac", "Los Angeles Clippers", []),
|
||||
MatchCandidate("mlb_lad", "Los Angeles Dodgers", []),
|
||||
]
|
||||
matches = fuzzy_match_team("Los Angeles", candidates, threshold=50, top_n=2)
|
||||
assert len(matches) <= 2
|
||||
|
||||
|
||||
class TestFuzzyMatchStadium:
|
||||
"""Tests for fuzzy_match_stadium function."""
|
||||
|
||||
def test_stadium_match(self):
|
||||
"""Test fuzzy matching stadium names."""
|
||||
candidates = [
|
||||
MatchCandidate("fenway", "Fenway Park", ["Fenway"]),
|
||||
MatchCandidate("td_garden", "TD Garden", ["Boston Garden"]),
|
||||
]
|
||||
matches = fuzzy_match_stadium("Fenway Park Boston", candidates, threshold=70)
|
||||
assert len(matches) > 0
|
||||
assert matches[0].canonical_id == "fenway"
|
||||
|
||||
def test_naming_rights_change(self):
|
||||
"""Test matching old stadium names."""
|
||||
candidates = [
|
||||
MatchCandidate(
|
||||
"chase_center",
|
||||
"Chase Center",
|
||||
["Oracle Arena", "Oakland Coliseum Arena"],
|
||||
),
|
||||
]
|
||||
# Should match on alias
|
||||
matches = fuzzy_match_stadium("Oracle Arena", candidates, threshold=70)
|
||||
assert len(matches) > 0
|
||||
|
||||
|
||||
class TestBestMatch:
|
||||
"""Tests for best_match function."""
|
||||
|
||||
def test_prefers_exact_match(self):
|
||||
"""Test that exact match is preferred over fuzzy."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
|
||||
MatchCandidate("nba_bos", "Boston Celtics", ["Celtics"]),
|
||||
]
|
||||
result = best_match("Lakers", candidates)
|
||||
assert result is not None
|
||||
assert result.canonical_id == "nba_lal"
|
||||
assert result.confidence == 100 # Exact match
|
||||
|
||||
def test_falls_back_to_fuzzy(self):
|
||||
"""Test fallback to fuzzy when no exact match."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", ["Lakers"]),
|
||||
]
|
||||
result = best_match("LA Laker", candidates, threshold=70)
|
||||
assert result is not None
|
||||
assert result.confidence < 100 # Fuzzy match
|
||||
|
||||
def test_no_match_below_threshold(self):
|
||||
"""Test returns None when no match above threshold."""
|
||||
candidates = [
|
||||
MatchCandidate("nba_lal", "Los Angeles Lakers", []),
|
||||
]
|
||||
result = best_match("xyz123", candidates, threshold=90)
|
||||
assert result is None
|
||||
|
||||
|
||||
class TestCalculateSimilarity:
|
||||
"""Tests for calculate_similarity function."""
|
||||
|
||||
def test_identical_strings(self):
|
||||
"""Test identical strings have 100% similarity."""
|
||||
assert calculate_similarity("Boston Celtics", "Boston Celtics") == 100
|
||||
|
||||
def test_similar_strings(self):
|
||||
"""Test similar strings have high similarity."""
|
||||
score = calculate_similarity("Boston Celtics", "Celtics Boston")
|
||||
assert score >= 90
|
||||
|
||||
def test_different_strings(self):
|
||||
"""Test different strings have low similarity."""
|
||||
score = calculate_similarity("Boston Celtics", "Los Angeles Lakers")
|
||||
assert score < 50
|
||||
|
||||
def test_empty_string(self):
|
||||
"""Test empty string handling."""
|
||||
score = calculate_similarity("", "Boston Celtics")
|
||||
assert score == 0
|
||||
@@ -0,0 +1 @@
|
||||
"""Tests for scrapers module."""
|
||||
@@ -0,0 +1,257 @@
|
||||
"""Tests for MLB scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.mlb import MLBScraper, create_mlb_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
MLB_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestMLBScraperInit:
|
||||
"""Test MLBScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
assert scraper.sport == "mlb"
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_mlb_scraper(season=2026)
|
||||
assert isinstance(scraper, MLBScraper)
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for MLB."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
assert scraper.expected_game_count == 2430
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["baseball_reference", "mlb_api", "espn"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Yankees @ Red Sox
|
||||
nyy_bos = next(g for g in completed if g.away_team_raw == "New York Yankees")
|
||||
assert nyy_bos.home_team_raw == "Boston Red Sox"
|
||||
assert nyy_bos.away_score == 3
|
||||
assert nyy_bos.home_score == 5
|
||||
assert nyy_bos.stadium_raw == "Fenway Park"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
lad_sf = scheduled[0]
|
||||
assert lad_sf.away_team_raw == "Los Angeles Dodgers"
|
||||
assert lad_sf.home_team_raw == "San Francisco Giants"
|
||||
assert lad_sf.stadium_raw == "Oracle Park"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
data = load_json_fixture(MLB_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 15),
|
||||
home_team_raw="Boston Red Sox",
|
||||
away_team_raw="New York Yankees",
|
||||
stadium_raw="Fenway Park",
|
||||
home_score=5,
|
||||
away_score=3,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "mlb_2026_nyy_bos_0415"
|
||||
assert game.sport == "mlb"
|
||||
assert game.season == 2026
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_mlb_bos"
|
||||
assert game.away_team_id == "team_mlb_nyy"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 5
|
||||
assert game.away_score == 3
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 15),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Boston Red Sox",
|
||||
stadium_raw="Fenway Park",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_mlb_teams(self):
|
||||
"""Test all 30 MLB teams are returned."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# 30 MLB teams
|
||||
assert len(teams) == 30
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 30
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_mlb_")
|
||||
assert team.sport == "mlb"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_leagues_and_divisions(self):
|
||||
"""Test teams have league (conference) and division info."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by league
|
||||
al = [t for t in teams if t.conference == "American"]
|
||||
nl = [t for t in teams if t.conference == "National"]
|
||||
|
||||
assert len(al) == 15
|
||||
assert len(nl) == 15
|
||||
|
||||
def test_scrapes_all_mlb_stadiums(self):
|
||||
"""Test all MLB stadiums are returned."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 30
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 30
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_mlb_")
|
||||
assert stadium.sport == "mlb"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country in ["USA", "Canada"]
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
|
||||
with patch.object(scraper, '_scrape_baseball_reference') as mock_br, \
|
||||
patch.object(scraper, '_scrape_mlb_api') as mock_mlb, \
|
||||
patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
|
||||
# Make BR and MLB API fail
|
||||
mock_br.side_effect = Exception("Connection failed")
|
||||
mock_mlb.side_effect = Exception("API error")
|
||||
|
||||
# Make ESPN return data
|
||||
mock_espn.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 15),
|
||||
home_team_raw="Boston Red Sox",
|
||||
away_team_raw="New York Yankees",
|
||||
stadium_raw="Fenway Park",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert result.success
|
||||
assert result.source == "espn"
|
||||
assert mock_br.called
|
||||
assert mock_mlb.called
|
||||
assert mock_espn.called
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for MLB season."""
|
||||
scraper = MLBScraper(season=2026)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# MLB season is March-November
|
||||
assert len(months) == 9 # Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
|
||||
|
||||
# Check first month is March of season year
|
||||
assert months[0] == (2026, 3)
|
||||
|
||||
# Check last month is November
|
||||
assert months[-1] == (2026, 11)
|
||||
@@ -0,0 +1,251 @@
|
||||
"""Tests for MLS scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.mls import MLSScraper, create_mls_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
MLS_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestMLSScraperInit:
|
||||
"""Test MLSScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
assert scraper.sport == "mls"
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_mls_scraper(season=2026)
|
||||
assert isinstance(scraper, MLSScraper)
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for MLS."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
assert scraper.expected_game_count == 493
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["espn", "fbref"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Galaxy @ LAFC
|
||||
la_lafc = next(g for g in completed if g.away_team_raw == "LA Galaxy")
|
||||
assert la_lafc.home_team_raw == "Los Angeles FC"
|
||||
assert la_lafc.away_score == 2
|
||||
assert la_lafc.home_score == 3
|
||||
assert la_lafc.stadium_raw == "BMO Stadium"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
ny_atl = scheduled[0]
|
||||
assert ny_atl.away_team_raw == "New York Red Bulls"
|
||||
assert ny_atl.home_team_raw == "Atlanta United FC"
|
||||
assert ny_atl.stadium_raw == "Mercedes-Benz Stadium"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
data = load_json_fixture(MLS_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 3, 15),
|
||||
home_team_raw="Los Angeles FC",
|
||||
away_team_raw="LA Galaxy",
|
||||
stadium_raw="BMO Stadium",
|
||||
home_score=3,
|
||||
away_score=2,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "mls_2026_lag_lafc_0315"
|
||||
assert game.sport == "mls"
|
||||
assert game.season == 2026
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_mls_lafc"
|
||||
assert game.away_team_id == "team_mls_lag"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 3
|
||||
assert game.away_score == 2
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 3, 15),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="LA Galaxy",
|
||||
stadium_raw="BMO Stadium",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_mls_teams(self):
|
||||
"""Test all MLS teams are returned."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# MLS has 29+ teams
|
||||
assert len(teams) >= 29
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == len(teams)
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_mls_")
|
||||
assert team.sport == "mls"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_conferences(self):
|
||||
"""Test teams have conference info."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by conference
|
||||
eastern = [t for t in teams if t.conference == "Eastern"]
|
||||
western = [t for t in teams if t.conference == "Western"]
|
||||
|
||||
# MLS has two conferences
|
||||
assert len(eastern) >= 14
|
||||
assert len(western) >= 14
|
||||
|
||||
def test_scrapes_all_mls_stadiums(self):
|
||||
"""Test all MLS stadiums are returned."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) >= 29
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_mls_")
|
||||
assert stadium.sport == "mls"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country in ["USA", "Canada"]
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
|
||||
with patch.object(scraper, '_scrape_espn') as mock_espn, \
|
||||
patch.object(scraper, '_scrape_fbref') as mock_fbref:
|
||||
|
||||
# Make ESPN fail
|
||||
mock_espn.side_effect = Exception("Connection failed")
|
||||
|
||||
# Make FBref return data
|
||||
mock_fbref.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 3, 15),
|
||||
home_team_raw="Los Angeles FC",
|
||||
away_team_raw="LA Galaxy",
|
||||
stadium_raw="BMO Stadium",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert result.success
|
||||
assert result.source == "fbref"
|
||||
assert mock_espn.called
|
||||
assert mock_fbref.called
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for MLS season."""
|
||||
scraper = MLSScraper(season=2026)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# MLS season is February-November
|
||||
assert len(months) == 10 # Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
|
||||
|
||||
# Check first month is February of season year
|
||||
assert months[0] == (2026, 2)
|
||||
|
||||
# Check last month is November
|
||||
assert months[-1] == (2026, 11)
|
||||
@@ -0,0 +1,428 @@
|
||||
"""Tests for NBA scraper."""
|
||||
|
||||
import json
|
||||
from datetime import datetime
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.nba import NBAScraper, create_nba_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_fixture,
|
||||
load_json_fixture,
|
||||
NBA_BR_OCTOBER_HTML,
|
||||
NBA_BR_EDGE_CASES_HTML,
|
||||
NBA_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestNBAScraperInit:
|
||||
"""Test NBAScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
assert scraper.sport == "nba"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_nba_scraper(season=2025)
|
||||
assert isinstance(scraper, NBAScraper)
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for NBA."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
assert scraper.expected_game_count == 1230
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["basketball_reference", "espn", "cbs"]
|
||||
|
||||
|
||||
class TestBasketballReferenceParsing:
|
||||
"""Test Basketball-Reference HTML parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games with scores."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_OCTOBER_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
# Should find all games in fixture
|
||||
assert len(games) == 7
|
||||
|
||||
# Check first completed game
|
||||
completed_games = [g for g in games if g.status == "final"]
|
||||
assert len(completed_games) == 2
|
||||
|
||||
# Boston @ Cleveland
|
||||
bos_cle = next(g for g in games if g.away_team_raw == "Boston Celtics")
|
||||
assert bos_cle.home_team_raw == "Cleveland Cavaliers"
|
||||
assert bos_cle.away_score == 112
|
||||
assert bos_cle.home_score == 108
|
||||
assert bos_cle.stadium_raw == "Rocket Mortgage FieldHouse"
|
||||
assert bos_cle.status == "final"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games without scores."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_OCTOBER_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
scheduled_games = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled_games) == 5
|
||||
|
||||
# Houston @ OKC
|
||||
hou_okc = next(g for g in scheduled_games if g.away_team_raw == "Houston Rockets")
|
||||
assert hou_okc.home_team_raw == "Oklahoma City Thunder"
|
||||
assert hou_okc.away_score is None
|
||||
assert hou_okc.home_score is None
|
||||
assert hou_okc.stadium_raw == "Paycom Center"
|
||||
|
||||
def test_parses_game_dates_correctly(self):
|
||||
"""Test game dates are parsed correctly."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_OCTOBER_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
# Check first game date
|
||||
first_game = games[0]
|
||||
assert first_game.game_date.year == 2025
|
||||
assert first_game.game_date.month == 10
|
||||
assert first_game.game_date.day == 22
|
||||
|
||||
def test_tracks_source_url(self):
|
||||
"""Test source URL is tracked for all games."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_OCTOBER_HTML)
|
||||
source_url = "http://basketball-reference.com/test"
|
||||
games = scraper._parse_basketball_reference(html, source_url)
|
||||
|
||||
for game in games:
|
||||
assert game.source_url == source_url
|
||||
|
||||
|
||||
class TestBasketballReferenceEdgeCases:
|
||||
"""Test edge case handling in Basketball-Reference parsing."""
|
||||
|
||||
def test_parses_postponed_games(self):
|
||||
"""Test postponed games are identified correctly."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
postponed = [g for g in games if g.status == "postponed"]
|
||||
assert len(postponed) == 1
|
||||
assert postponed[0].away_team_raw == "Los Angeles Lakers"
|
||||
assert postponed[0].home_team_raw == "Phoenix Suns"
|
||||
|
||||
def test_parses_cancelled_games(self):
|
||||
"""Test cancelled games are identified correctly."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
cancelled = [g for g in games if g.status == "cancelled"]
|
||||
assert len(cancelled) == 1
|
||||
assert cancelled[0].away_team_raw == "Portland Trail Blazers"
|
||||
|
||||
def test_parses_neutral_site_games(self):
|
||||
"""Test neutral site games are parsed."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
# Mexico City game
|
||||
mexico = next(g for g in games if g.stadium_raw == "Arena CDMX")
|
||||
assert mexico.away_team_raw == "Miami Heat"
|
||||
assert mexico.home_team_raw == "Washington Wizards"
|
||||
assert mexico.status == "final"
|
||||
|
||||
def test_parses_overtime_games(self):
|
||||
"""Test overtime games with high scores."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
html = load_fixture(NBA_BR_EDGE_CASES_HTML)
|
||||
games = scraper._parse_basketball_reference(html, "http://example.com")
|
||||
|
||||
# High scoring OT game
|
||||
ot_game = next(g for g in games if g.away_score == 147)
|
||||
assert ot_game.home_score == 150
|
||||
assert ot_game.status == "final"
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Boston @ Cleveland
|
||||
bos_cle = next(g for g in completed if g.away_team_raw == "Boston Celtics")
|
||||
assert bos_cle.home_team_raw == "Cleveland Cavaliers"
|
||||
assert bos_cle.away_score == 112
|
||||
assert bos_cle.home_score == 108
|
||||
assert bos_cle.stadium_raw == "Rocket Mortgage FieldHouse"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
hou_okc = scheduled[0]
|
||||
assert hou_okc.away_team_raw == "Houston Rockets"
|
||||
assert hou_okc.home_team_raw == "Oklahoma City Thunder"
|
||||
assert hou_okc.stadium_raw == "Paycom Center"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
data = load_json_fixture(NBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
# Check all games have venue info
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 22),
|
||||
home_team_raw="Cleveland Cavaliers",
|
||||
away_team_raw="Boston Celtics",
|
||||
stadium_raw="Rocket Mortgage FieldHouse",
|
||||
home_score=108,
|
||||
away_score=112,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "nba_2025_bos_cle_1022"
|
||||
assert game.sport == "nba"
|
||||
assert game.season == 2025
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_nba_cle"
|
||||
assert game.away_team_id == "team_nba_bos"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 108
|
||||
assert game.away_score == 112
|
||||
|
||||
def test_detects_doubleheaders(self):
|
||||
"""Test doubleheaders get correct game numbers."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 4, 1, 13, 0),
|
||||
home_team_raw="Boston Celtics",
|
||||
away_team_raw="New York Knicks",
|
||||
stadium_raw="TD Garden",
|
||||
status="final",
|
||||
home_score=105,
|
||||
away_score=98,
|
||||
),
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 4, 1, 19, 0),
|
||||
home_team_raw="Boston Celtics",
|
||||
away_team_raw="New York Knicks",
|
||||
stadium_raw="TD Garden",
|
||||
status="final",
|
||||
home_score=110,
|
||||
away_score=102,
|
||||
),
|
||||
]
|
||||
|
||||
games, _ = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 2
|
||||
game_numbers = sorted([g.game_number for g in games])
|
||||
assert game_numbers == [1, 2]
|
||||
|
||||
# Check IDs include game number
|
||||
game_ids = sorted([g.id for g in games])
|
||||
assert game_ids == ["nba_2025_nyk_bos_0401_1", "nba_2025_nyk_bos_0401_2"]
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 22),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Boston Celtics",
|
||||
stadium_raw="TD Garden",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_nba_teams(self):
|
||||
"""Test all 30 NBA teams are returned."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# 30 NBA teams
|
||||
assert len(teams) == 30
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 30
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_nba_")
|
||||
assert team.sport == "nba"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_conferences_and_divisions(self):
|
||||
"""Test teams have conference and division info."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by conference
|
||||
eastern = [t for t in teams if t.conference == "Eastern"]
|
||||
western = [t for t in teams if t.conference == "Western"]
|
||||
|
||||
assert len(eastern) == 15
|
||||
assert len(western) == 15
|
||||
|
||||
def test_scrapes_all_nba_stadiums(self):
|
||||
"""Test all NBA stadiums are returned."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 30
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 30
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_nba_")
|
||||
assert stadium.sport == "nba"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country in ["USA", "Canada"]
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
with patch.object(scraper, '_scrape_basketball_reference') as mock_br, \
|
||||
patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
|
||||
# Make BR fail
|
||||
mock_br.side_effect = Exception("Connection failed")
|
||||
|
||||
# Make ESPN return data
|
||||
mock_espn.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 22),
|
||||
home_team_raw="Cleveland Cavaliers",
|
||||
away_team_raw="Boston Celtics",
|
||||
stadium_raw="Rocket Mortgage FieldHouse",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
# Should have succeeded with ESPN
|
||||
assert result.success
|
||||
assert result.source == "espn"
|
||||
assert mock_br.called
|
||||
assert mock_espn.called
|
||||
|
||||
def test_returns_failure_when_all_sources_fail(self):
|
||||
"""Test scraper returns failure when all sources fail."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
|
||||
with patch.object(scraper, '_scrape_basketball_reference') as mock_br, \
|
||||
patch.object(scraper, '_scrape_espn') as mock_espn, \
|
||||
patch.object(scraper, '_scrape_cbs') as mock_cbs:
|
||||
|
||||
mock_br.side_effect = Exception("BR failed")
|
||||
mock_espn.side_effect = Exception("ESPN failed")
|
||||
mock_cbs.side_effect = Exception("CBS failed")
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert not result.success
|
||||
assert "All sources failed" in result.error_message
|
||||
assert "CBS failed" in result.error_message
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for NBA season."""
|
||||
scraper = NBAScraper(season=2025)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# NBA season is Oct-Jun
|
||||
assert len(months) == 9 # Oct, Nov, Dec, Jan, Feb, Mar, Apr, May, Jun
|
||||
|
||||
# Check first month is Oct of season year
|
||||
assert months[0] == (2025, 10)
|
||||
|
||||
# Check last month is Jun of following year
|
||||
assert months[-1] == (2026, 6)
|
||||
|
||||
# Check transition to new year
|
||||
assert months[2] == (2025, 12) # December
|
||||
assert months[3] == (2026, 1) # January
|
||||
@@ -0,0 +1,310 @@
|
||||
"""Tests for NFL scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.nfl import NFLScraper, create_nfl_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
NFL_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestNFLScraperInit:
|
||||
"""Test NFLScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
assert scraper.sport == "nfl"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_nfl_scraper(season=2025)
|
||||
assert isinstance(scraper, NFLScraper)
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for NFL."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
assert scraper.expected_game_count == 272
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["espn", "pro_football_reference", "cbs"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Chiefs @ Ravens
|
||||
kc_bal = next(g for g in completed if g.away_team_raw == "Kansas City Chiefs")
|
||||
assert kc_bal.home_team_raw == "Baltimore Ravens"
|
||||
assert kc_bal.away_score == 27
|
||||
assert kc_bal.home_score == 20
|
||||
assert kc_bal.stadium_raw == "M&T Bank Stadium"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
dal_cle = scheduled[0]
|
||||
assert dal_cle.away_team_raw == "Dallas Cowboys"
|
||||
assert dal_cle.home_team_raw == "Cleveland Browns"
|
||||
assert dal_cle.stadium_raw == "Cleveland Browns Stadium"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
data = load_json_fixture(NFL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 9, 7),
|
||||
home_team_raw="Baltimore Ravens",
|
||||
away_team_raw="Kansas City Chiefs",
|
||||
stadium_raw="M&T Bank Stadium",
|
||||
home_score=20,
|
||||
away_score=27,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "nfl_2025_kc_bal_0907"
|
||||
assert game.sport == "nfl"
|
||||
assert game.season == 2025
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_nfl_bal"
|
||||
assert game.away_team_id == "team_nfl_kc"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 20
|
||||
assert game.away_score == 27
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 9, 7),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Kansas City Chiefs",
|
||||
stadium_raw="Arrowhead Stadium",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_nfl_teams(self):
|
||||
"""Test all 32 NFL teams are returned."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# 32 NFL teams
|
||||
assert len(teams) == 32
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 32
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_nfl_")
|
||||
assert team.sport == "nfl"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_conferences_and_divisions(self):
|
||||
"""Test teams have conference and division info."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by conference
|
||||
afc = [t for t in teams if t.conference == "AFC"]
|
||||
nfc = [t for t in teams if t.conference == "NFC"]
|
||||
|
||||
assert len(afc) == 16
|
||||
assert len(nfc) == 16
|
||||
|
||||
def test_scrapes_all_nfl_stadiums(self):
|
||||
"""Test all NFL stadiums are returned."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams (some share)
|
||||
assert len(stadiums) >= 30
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_nfl_")
|
||||
assert stadium.sport == "nfl"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country == "USA"
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
with patch.object(scraper, '_scrape_espn') as mock_espn, \
|
||||
patch.object(scraper, '_scrape_pro_football_reference') as mock_pfr:
|
||||
|
||||
# Make ESPN fail
|
||||
mock_espn.side_effect = Exception("Connection failed")
|
||||
|
||||
# Make PFR return data
|
||||
mock_pfr.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 9, 7),
|
||||
home_team_raw="Baltimore Ravens",
|
||||
away_team_raw="Kansas City Chiefs",
|
||||
stadium_raw="M&T Bank Stadium",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert result.success
|
||||
assert result.source == "pro_football_reference"
|
||||
assert mock_espn.called
|
||||
assert mock_pfr.called
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for NFL season."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# NFL season is September-February
|
||||
assert len(months) == 6 # Sep, Oct, Nov, Dec, Jan, Feb
|
||||
|
||||
# Check first month is September of season year
|
||||
assert months[0] == (2025, 9)
|
||||
|
||||
# Check last month is February of following year
|
||||
assert months[-1] == (2026, 2)
|
||||
|
||||
# Check transition to new year
|
||||
assert months[3] == (2025, 12) # December
|
||||
assert months[4] == (2026, 1) # January
|
||||
|
||||
|
||||
class TestInternationalFiltering:
|
||||
"""Test international game filtering.
|
||||
|
||||
Note: Filtering happens in _parse_espn_response, not _normalize_games.
|
||||
"""
|
||||
|
||||
def test_filters_london_games_during_parsing(self):
|
||||
"""Test London games are filtered out during ESPN parsing."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
# Create ESPN-like data with London game
|
||||
espn_data = {
|
||||
"events": [
|
||||
{
|
||||
"date": "2025-10-15T09:30:00Z",
|
||||
"competitions": [
|
||||
{
|
||||
"neutralSite": True,
|
||||
"venue": {
|
||||
"fullName": "London Stadium",
|
||||
"address": {"city": "London", "country": "UK"},
|
||||
},
|
||||
"competitors": [
|
||||
{"homeAway": "home", "team": {"displayName": "Jacksonville Jaguars"}},
|
||||
{"homeAway": "away", "team": {"displayName": "Buffalo Bills"}},
|
||||
],
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
games = scraper._parse_espn_response(espn_data, "http://espn.com/api")
|
||||
|
||||
# London game should be filtered
|
||||
assert len(games) == 0
|
||||
|
||||
def test_keeps_us_games(self):
|
||||
"""Test US games are kept."""
|
||||
scraper = NFLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 9, 7),
|
||||
home_team_raw="Baltimore Ravens",
|
||||
away_team_raw="Kansas City Chiefs",
|
||||
stadium_raw="M&T Bank Stadium",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, _ = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
@@ -0,0 +1,317 @@
|
||||
"""Tests for NHL scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.nhl import NHLScraper, create_nhl_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
NHL_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestNHLScraperInit:
|
||||
"""Test NHLScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
assert scraper.sport == "nhl"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_nhl_scraper(season=2025)
|
||||
assert isinstance(scraper, NHLScraper)
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for NHL."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
assert scraper.expected_game_count == 1312
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["hockey_reference", "nhl_api", "espn"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Penguins @ Bruins
|
||||
pit_bos = next(g for g in completed if g.away_team_raw == "Pittsburgh Penguins")
|
||||
assert pit_bos.home_team_raw == "Boston Bruins"
|
||||
assert pit_bos.away_score == 2
|
||||
assert pit_bos.home_score == 4
|
||||
assert pit_bos.stadium_raw == "TD Garden"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
vgk_lak = scheduled[0]
|
||||
assert vgk_lak.away_team_raw == "Vegas Golden Knights"
|
||||
assert vgk_lak.home_team_raw == "Los Angeles Kings"
|
||||
assert vgk_lak.stadium_raw == "Crypto.com Arena"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
data = load_json_fixture(NHL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 8),
|
||||
home_team_raw="Boston Bruins",
|
||||
away_team_raw="Pittsburgh Penguins",
|
||||
stadium_raw="TD Garden",
|
||||
home_score=4,
|
||||
away_score=2,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "nhl_2025_pit_bos_1008"
|
||||
assert game.sport == "nhl"
|
||||
assert game.season == 2025
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_nhl_bos"
|
||||
assert game.away_team_id == "team_nhl_pit"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 4
|
||||
assert game.away_score == 2
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 8),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Boston Bruins",
|
||||
stadium_raw="TD Garden",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_nhl_teams(self):
|
||||
"""Test all 32 NHL teams are returned."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# 32 NHL teams
|
||||
assert len(teams) == 32
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 32
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_nhl_")
|
||||
assert team.sport == "nhl"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_teams_have_conferences_and_divisions(self):
|
||||
"""Test teams have conference and division info."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# Count teams by conference
|
||||
eastern = [t for t in teams if t.conference == "Eastern"]
|
||||
western = [t for t in teams if t.conference == "Western"]
|
||||
|
||||
assert len(eastern) == 16
|
||||
assert len(western) == 16
|
||||
|
||||
def test_scrapes_all_nhl_stadiums(self):
|
||||
"""Test all NHL stadiums are returned."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 32
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 32
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_nhl_")
|
||||
assert stadium.sport == "nhl"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country in ["USA", "Canada"]
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test multi-source fallback behavior."""
|
||||
|
||||
def test_falls_back_to_next_source_on_failure(self):
|
||||
"""Test scraper tries next source when first fails."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
with patch.object(scraper, '_scrape_hockey_reference') as mock_hr, \
|
||||
patch.object(scraper, '_scrape_nhl_api') as mock_nhl, \
|
||||
patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
|
||||
# Make HR and NHL API fail
|
||||
mock_hr.side_effect = Exception("Connection failed")
|
||||
mock_nhl.side_effect = Exception("API error")
|
||||
|
||||
# Make ESPN return data
|
||||
mock_espn.return_value = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 8),
|
||||
home_team_raw="Boston Bruins",
|
||||
away_team_raw="Pittsburgh Penguins",
|
||||
stadium_raw="TD Garden",
|
||||
status="scheduled",
|
||||
)
|
||||
]
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert result.success
|
||||
assert result.source == "espn"
|
||||
assert mock_hr.called
|
||||
assert mock_nhl.called
|
||||
assert mock_espn.called
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for NHL season."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# NHL season is October-June
|
||||
assert len(months) == 9 # Oct, Nov, Dec, Jan, Feb, Mar, Apr, May, Jun
|
||||
|
||||
# Check first month is October of season year
|
||||
assert months[0] == (2025, 10)
|
||||
|
||||
# Check last month is June of following year
|
||||
assert months[-1] == (2026, 6)
|
||||
|
||||
# Check transition to new year
|
||||
assert months[2] == (2025, 12) # December
|
||||
assert months[3] == (2026, 1) # January
|
||||
|
||||
|
||||
class TestInternationalFiltering:
|
||||
"""Test international game filtering.
|
||||
|
||||
Note: Filtering happens in _parse_espn_response, not _normalize_games.
|
||||
"""
|
||||
|
||||
def test_filters_european_games_during_parsing(self):
|
||||
"""Test European games are filtered out during ESPN parsing."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
# Create ESPN-like data with Prague game (Global Series)
|
||||
espn_data = {
|
||||
"events": [
|
||||
{
|
||||
"date": "2025-10-10T18:00:00Z",
|
||||
"competitions": [
|
||||
{
|
||||
"neutralSite": True,
|
||||
"venue": {
|
||||
"fullName": "O2 Arena, Prague",
|
||||
"address": {"city": "Prague", "country": "Czech Republic"},
|
||||
},
|
||||
"competitors": [
|
||||
{"homeAway": "home", "team": {"displayName": "Florida Panthers"}},
|
||||
{"homeAway": "away", "team": {"displayName": "Dallas Stars"}},
|
||||
],
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
games = scraper._parse_espn_response(espn_data, "http://espn.com/api")
|
||||
|
||||
# Prague game should be filtered
|
||||
assert len(games) == 0
|
||||
|
||||
def test_keeps_north_american_games(self):
|
||||
"""Test North American games are kept."""
|
||||
scraper = NHLScraper(season=2025)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2025, 10, 8),
|
||||
home_team_raw="Boston Bruins",
|
||||
away_team_raw="Pittsburgh Penguins",
|
||||
stadium_raw="TD Garden",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, _ = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
@@ -0,0 +1,226 @@
|
||||
"""Tests for NWSL scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.nwsl import NWSLScraper, create_nwsl_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
NWSL_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestNWSLScraperInit:
|
||||
"""Test NWSLScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
assert scraper.sport == "nwsl"
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_nwsl_scraper(season=2026)
|
||||
assert isinstance(scraper, NWSLScraper)
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for NWSL."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
assert scraper.expected_game_count == 182
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["espn"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Angel City @ Thorns
|
||||
la_por = next(g for g in completed if g.away_team_raw == "Angel City FC")
|
||||
assert la_por.home_team_raw == "Portland Thorns FC"
|
||||
assert la_por.away_score == 1
|
||||
assert la_por.home_score == 2
|
||||
assert la_por.stadium_raw == "Providence Park"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
sd_bay = scheduled[0]
|
||||
assert sd_bay.away_team_raw == "San Diego Wave FC"
|
||||
assert sd_bay.home_team_raw == "Bay FC"
|
||||
assert sd_bay.stadium_raw == "PayPal Park"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
data = load_json_fixture(NWSL_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 10),
|
||||
home_team_raw="Portland Thorns FC",
|
||||
away_team_raw="Angel City FC",
|
||||
stadium_raw="Providence Park",
|
||||
home_score=2,
|
||||
away_score=1,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "nwsl_2026_anf_por_0410"
|
||||
assert game.sport == "nwsl"
|
||||
assert game.season == 2026
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_nwsl_por"
|
||||
assert game.away_team_id == "team_nwsl_anf"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 2
|
||||
assert game.away_score == 1
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 4, 10),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Portland Thorns FC",
|
||||
stadium_raw="Providence Park",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_nwsl_teams(self):
|
||||
"""Test all NWSL teams are returned."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# NWSL has 14 teams
|
||||
assert len(teams) == 14
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 14
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_nwsl_")
|
||||
assert team.sport == "nwsl"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_scrapes_all_nwsl_stadiums(self):
|
||||
"""Test all NWSL stadiums are returned."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 14
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 14
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_nwsl_")
|
||||
assert stadium.sport == "nwsl"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country == "USA"
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test fallback behavior (NWSL only has ESPN)."""
|
||||
|
||||
def test_returns_failure_when_espn_fails(self):
|
||||
"""Test scraper returns failure when ESPN fails."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
|
||||
with patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
mock_espn.side_effect = Exception("ESPN failed")
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert not result.success
|
||||
assert "All sources failed" in result.error_message
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for NWSL season."""
|
||||
scraper = NWSLScraper(season=2026)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# NWSL season is March-November
|
||||
assert len(months) == 9 # Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov
|
||||
|
||||
# Check first month is March of season year
|
||||
assert months[0] == (2026, 3)
|
||||
|
||||
# Check last month is November
|
||||
assert months[-1] == (2026, 11)
|
||||
@@ -0,0 +1,226 @@
|
||||
"""Tests for WNBA scraper."""
|
||||
|
||||
from datetime import datetime
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
from sportstime_parser.scrapers.wnba import WNBAScraper, create_wnba_scraper
|
||||
from sportstime_parser.scrapers.base import RawGameData
|
||||
from sportstime_parser.tests.fixtures import (
|
||||
load_json_fixture,
|
||||
WNBA_ESPN_SCOREBOARD_JSON,
|
||||
)
|
||||
|
||||
|
||||
class TestWNBAScraperInit:
|
||||
"""Test WNBAScraper initialization."""
|
||||
|
||||
def test_creates_scraper_with_season(self):
|
||||
"""Test scraper initializes with correct season."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
assert scraper.sport == "wnba"
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_factory_function_creates_scraper(self):
|
||||
"""Test factory function creates correct scraper."""
|
||||
scraper = create_wnba_scraper(season=2026)
|
||||
assert isinstance(scraper, WNBAScraper)
|
||||
assert scraper.season == 2026
|
||||
|
||||
def test_expected_game_count(self):
|
||||
"""Test expected game count is correct for WNBA."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
assert scraper.expected_game_count == 220
|
||||
|
||||
def test_sources_in_priority_order(self):
|
||||
"""Test sources are returned in correct priority order."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
sources = scraper._get_sources()
|
||||
assert sources == ["espn"]
|
||||
|
||||
|
||||
class TestESPNParsing:
|
||||
"""Test ESPN API response parsing."""
|
||||
|
||||
def test_parses_completed_games(self):
|
||||
"""Test parsing completed games from ESPN."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
completed = [g for g in games if g.status == "final"]
|
||||
assert len(completed) == 2
|
||||
|
||||
# Aces @ Liberty
|
||||
lv_ny = next(g for g in completed if g.away_team_raw == "Las Vegas Aces")
|
||||
assert lv_ny.home_team_raw == "New York Liberty"
|
||||
assert lv_ny.away_score == 88
|
||||
assert lv_ny.home_score == 92
|
||||
assert lv_ny.stadium_raw == "Barclays Center"
|
||||
|
||||
def test_parses_scheduled_games(self):
|
||||
"""Test parsing scheduled games from ESPN."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
scheduled = [g for g in games if g.status == "scheduled"]
|
||||
assert len(scheduled) == 1
|
||||
|
||||
phx_sea = scheduled[0]
|
||||
assert phx_sea.away_team_raw == "Phoenix Mercury"
|
||||
assert phx_sea.home_team_raw == "Seattle Storm"
|
||||
assert phx_sea.stadium_raw == "Climate Pledge Arena"
|
||||
|
||||
def test_parses_venue_info(self):
|
||||
"""Test venue information is extracted."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
data = load_json_fixture(WNBA_ESPN_SCOREBOARD_JSON)
|
||||
games = scraper._parse_espn_response(data, "http://espn.com/api")
|
||||
|
||||
for game in games:
|
||||
assert game.stadium_raw is not None
|
||||
|
||||
|
||||
class TestGameNormalization:
|
||||
"""Test game normalization and canonical ID generation."""
|
||||
|
||||
def test_normalizes_games_with_canonical_ids(self):
|
||||
"""Test games are normalized with correct canonical IDs."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 5, 20),
|
||||
home_team_raw="New York Liberty",
|
||||
away_team_raw="Las Vegas Aces",
|
||||
stadium_raw="Barclays Center",
|
||||
home_score=92,
|
||||
away_score=88,
|
||||
status="final",
|
||||
source_url="http://example.com",
|
||||
)
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
assert len(games) == 1
|
||||
game = games[0]
|
||||
|
||||
# Check canonical ID format
|
||||
assert game.id == "wnba_2026_lv_ny_0520"
|
||||
assert game.sport == "wnba"
|
||||
assert game.season == 2026
|
||||
|
||||
# Check team IDs
|
||||
assert game.home_team_id == "team_wnba_ny"
|
||||
assert game.away_team_id == "team_wnba_lv"
|
||||
|
||||
# Check scores preserved
|
||||
assert game.home_score == 92
|
||||
assert game.away_score == 88
|
||||
|
||||
def test_creates_review_items_for_unresolved_teams(self):
|
||||
"""Test review items are created for unresolved teams."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
|
||||
raw_games = [
|
||||
RawGameData(
|
||||
game_date=datetime(2026, 5, 20),
|
||||
home_team_raw="Unknown Team XYZ",
|
||||
away_team_raw="Las Vegas Aces",
|
||||
stadium_raw="Barclays Center",
|
||||
status="scheduled",
|
||||
),
|
||||
]
|
||||
|
||||
games, review_items = scraper._normalize_games(raw_games)
|
||||
|
||||
# Game should not be created due to unresolved team
|
||||
assert len(games) == 0
|
||||
|
||||
# But there should be a review item
|
||||
assert len(review_items) >= 1
|
||||
|
||||
|
||||
class TestTeamAndStadiumScraping:
|
||||
"""Test team and stadium data scraping."""
|
||||
|
||||
def test_scrapes_all_wnba_teams(self):
|
||||
"""Test all WNBA teams are returned."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
teams = scraper.scrape_teams()
|
||||
|
||||
# WNBA has 13 teams (including Golden State Valkyries)
|
||||
assert len(teams) == 13
|
||||
|
||||
# Check team IDs are unique
|
||||
team_ids = [t.id for t in teams]
|
||||
assert len(set(team_ids)) == 13
|
||||
|
||||
# Check all teams have required fields
|
||||
for team in teams:
|
||||
assert team.id.startswith("team_wnba_")
|
||||
assert team.sport == "wnba"
|
||||
assert team.city
|
||||
assert team.name
|
||||
assert team.full_name
|
||||
assert team.abbreviation
|
||||
|
||||
def test_scrapes_all_wnba_stadiums(self):
|
||||
"""Test all WNBA stadiums are returned."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
stadiums = scraper.scrape_stadiums()
|
||||
|
||||
# Should have stadiums for all teams
|
||||
assert len(stadiums) == 13
|
||||
|
||||
# Check stadium IDs are unique
|
||||
stadium_ids = [s.id for s in stadiums]
|
||||
assert len(set(stadium_ids)) == 13
|
||||
|
||||
# Check all stadiums have required fields
|
||||
for stadium in stadiums:
|
||||
assert stadium.id.startswith("stadium_wnba_")
|
||||
assert stadium.sport == "wnba"
|
||||
assert stadium.name
|
||||
assert stadium.city
|
||||
assert stadium.state
|
||||
assert stadium.country == "USA"
|
||||
assert stadium.latitude != 0
|
||||
assert stadium.longitude != 0
|
||||
|
||||
|
||||
class TestScrapeFallback:
|
||||
"""Test fallback behavior (WNBA only has ESPN)."""
|
||||
|
||||
def test_returns_failure_when_espn_fails(self):
|
||||
"""Test scraper returns failure when ESPN fails."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
|
||||
with patch.object(scraper, '_scrape_espn') as mock_espn:
|
||||
mock_espn.side_effect = Exception("ESPN failed")
|
||||
|
||||
result = scraper.scrape_games()
|
||||
|
||||
assert not result.success
|
||||
assert "All sources failed" in result.error_message
|
||||
|
||||
|
||||
class TestSeasonMonths:
|
||||
"""Test season month calculation."""
|
||||
|
||||
def test_gets_correct_season_months(self):
|
||||
"""Test correct months are returned for WNBA season."""
|
||||
scraper = WNBAScraper(season=2026)
|
||||
months = scraper._get_season_months()
|
||||
|
||||
# WNBA season is May-October
|
||||
assert len(months) == 6 # May, Jun, Jul, Aug, Sep, Oct
|
||||
|
||||
# Check first month is May of season year
|
||||
assert months[0] == (2026, 5)
|
||||
|
||||
# Check last month is October
|
||||
assert months[-1] == (2026, 10)
|
||||
@@ -0,0 +1,187 @@
|
||||
"""Tests for timezone conversion utilities."""
|
||||
|
||||
import pytest
|
||||
from datetime import datetime, date
|
||||
from zoneinfo import ZoneInfo
|
||||
|
||||
from sportstime_parser.normalizers.timezone import (
|
||||
detect_timezone_from_string,
|
||||
detect_timezone_from_location,
|
||||
parse_datetime,
|
||||
convert_to_utc,
|
||||
get_stadium_timezone,
|
||||
TimezoneResult,
|
||||
)
|
||||
|
||||
|
||||
class TestDetectTimezoneFromString:
|
||||
"""Tests for detect_timezone_from_string function."""
|
||||
|
||||
def test_eastern_time(self):
|
||||
"""Test Eastern Time detection."""
|
||||
assert detect_timezone_from_string("7:00 PM ET") == "America/New_York"
|
||||
assert detect_timezone_from_string("7:00 PM EST") == "America/New_York"
|
||||
assert detect_timezone_from_string("7:00 PM EDT") == "America/New_York"
|
||||
|
||||
def test_central_time(self):
|
||||
"""Test Central Time detection."""
|
||||
assert detect_timezone_from_string("8:00 PM CT") == "America/Chicago"
|
||||
assert detect_timezone_from_string("8:00 PM CST") == "America/Chicago"
|
||||
assert detect_timezone_from_string("8:00 PM CDT") == "America/Chicago"
|
||||
|
||||
def test_mountain_time(self):
|
||||
"""Test Mountain Time detection."""
|
||||
assert detect_timezone_from_string("7:00 PM MT") == "America/Denver"
|
||||
assert detect_timezone_from_string("7:00 PM MST") == "America/Denver"
|
||||
|
||||
def test_pacific_time(self):
|
||||
"""Test Pacific Time detection."""
|
||||
assert detect_timezone_from_string("7:00 PM PT") == "America/Los_Angeles"
|
||||
assert detect_timezone_from_string("7:00 PM PST") == "America/Los_Angeles"
|
||||
assert detect_timezone_from_string("7:00 PM PDT") == "America/Los_Angeles"
|
||||
|
||||
def test_no_timezone(self):
|
||||
"""Test string with no timezone."""
|
||||
assert detect_timezone_from_string("7:00 PM") is None
|
||||
assert detect_timezone_from_string("19:00") is None
|
||||
|
||||
def test_case_insensitive(self):
|
||||
"""Test case insensitive matching."""
|
||||
assert detect_timezone_from_string("7:00 PM et") == "America/New_York"
|
||||
assert detect_timezone_from_string("7:00 PM Et") == "America/New_York"
|
||||
|
||||
|
||||
class TestDetectTimezoneFromLocation:
|
||||
"""Tests for detect_timezone_from_location function."""
|
||||
|
||||
def test_eastern_states(self):
|
||||
"""Test Eastern timezone states."""
|
||||
assert detect_timezone_from_location(state="NY") == "America/New_York"
|
||||
assert detect_timezone_from_location(state="MA") == "America/New_York"
|
||||
assert detect_timezone_from_location(state="FL") == "America/New_York"
|
||||
|
||||
def test_central_states(self):
|
||||
"""Test Central timezone states."""
|
||||
assert detect_timezone_from_location(state="TX") == "America/Chicago"
|
||||
assert detect_timezone_from_location(state="IL") == "America/Chicago"
|
||||
|
||||
def test_mountain_states(self):
|
||||
"""Test Mountain timezone states."""
|
||||
assert detect_timezone_from_location(state="CO") == "America/Denver"
|
||||
assert detect_timezone_from_location(state="AZ") == "America/Phoenix"
|
||||
|
||||
def test_pacific_states(self):
|
||||
"""Test Pacific timezone states."""
|
||||
assert detect_timezone_from_location(state="CA") == "America/Los_Angeles"
|
||||
assert detect_timezone_from_location(state="WA") == "America/Los_Angeles"
|
||||
|
||||
def test_canadian_provinces(self):
|
||||
"""Test Canadian provinces."""
|
||||
assert detect_timezone_from_location(state="ON") == "America/Toronto"
|
||||
assert detect_timezone_from_location(state="BC") == "America/Vancouver"
|
||||
assert detect_timezone_from_location(state="AB") == "America/Edmonton"
|
||||
|
||||
def test_case_insensitive(self):
|
||||
"""Test case insensitive matching."""
|
||||
assert detect_timezone_from_location(state="ny") == "America/New_York"
|
||||
assert detect_timezone_from_location(state="Ny") == "America/New_York"
|
||||
|
||||
def test_unknown_state(self):
|
||||
"""Test unknown state returns None."""
|
||||
assert detect_timezone_from_location(state="XX") is None
|
||||
assert detect_timezone_from_location(state=None) is None
|
||||
|
||||
|
||||
class TestParseDatetime:
|
||||
"""Tests for parse_datetime function."""
|
||||
|
||||
def test_basic_date_time(self):
|
||||
"""Test basic date and time parsing."""
|
||||
result = parse_datetime("2025-12-25", "7:00 PM ET")
|
||||
assert result.datetime_utc.year == 2025
|
||||
assert result.datetime_utc.month == 12
|
||||
assert result.datetime_utc.day == 26 # UTC is +5 hours ahead
|
||||
assert result.source_timezone == "America/New_York"
|
||||
assert result.confidence == "high"
|
||||
|
||||
def test_date_only(self):
|
||||
"""Test date only parsing."""
|
||||
result = parse_datetime("2025-10-21")
|
||||
assert result.datetime_utc.year == 2025
|
||||
assert result.datetime_utc.month == 10
|
||||
assert result.datetime_utc.day == 21
|
||||
|
||||
def test_timezone_hint(self):
|
||||
"""Test timezone hint is used when no timezone in string."""
|
||||
result = parse_datetime(
|
||||
"2025-10-21",
|
||||
"7:00 PM",
|
||||
timezone_hint="America/Chicago",
|
||||
)
|
||||
assert result.source_timezone == "America/Chicago"
|
||||
assert result.confidence == "medium"
|
||||
|
||||
def test_location_inference(self):
|
||||
"""Test timezone inference from location."""
|
||||
result = parse_datetime(
|
||||
"2025-10-21",
|
||||
"7:00 PM",
|
||||
location_state="CA",
|
||||
)
|
||||
assert result.source_timezone == "America/Los_Angeles"
|
||||
assert result.confidence == "medium"
|
||||
|
||||
def test_default_to_eastern(self):
|
||||
"""Test defaults to Eastern when no timezone info."""
|
||||
result = parse_datetime("2025-10-21", "7:00 PM")
|
||||
assert result.source_timezone == "America/New_York"
|
||||
assert result.confidence == "low"
|
||||
assert result.warning is not None
|
||||
|
||||
def test_invalid_date(self):
|
||||
"""Test handling of invalid date."""
|
||||
result = parse_datetime("not a date")
|
||||
assert result.confidence == "low"
|
||||
assert result.warning is not None
|
||||
|
||||
|
||||
class TestConvertToUtc:
|
||||
"""Tests for convert_to_utc function."""
|
||||
|
||||
def test_convert_naive_datetime(self):
|
||||
"""Test converting naive datetime to UTC."""
|
||||
dt = datetime(2025, 12, 25, 19, 0) # 7:00 PM
|
||||
utc = convert_to_utc(dt, "America/New_York")
|
||||
|
||||
# In December, Eastern Time is UTC-5
|
||||
assert utc.hour == 0 # Next day 00:00 UTC
|
||||
assert utc.day == 26
|
||||
|
||||
def test_convert_aware_datetime(self):
|
||||
"""Test converting timezone-aware datetime."""
|
||||
tz = ZoneInfo("America/Los_Angeles")
|
||||
dt = datetime(2025, 7, 4, 19, 0, tzinfo=tz) # 7:00 PM PT
|
||||
utc = convert_to_utc(dt, "America/Los_Angeles")
|
||||
|
||||
# In July, Pacific Time is UTC-7
|
||||
assert utc.hour == 2 # 02:00 UTC next day
|
||||
assert utc.day == 5
|
||||
|
||||
|
||||
class TestGetStadiumTimezone:
|
||||
"""Tests for get_stadium_timezone function."""
|
||||
|
||||
def test_explicit_timezone(self):
|
||||
"""Test explicit timezone override."""
|
||||
tz = get_stadium_timezone("AZ", stadium_timezone="America/Phoenix")
|
||||
assert tz == "America/Phoenix"
|
||||
|
||||
def test_state_inference(self):
|
||||
"""Test timezone from state."""
|
||||
tz = get_stadium_timezone("NY")
|
||||
assert tz == "America/New_York"
|
||||
|
||||
def test_default_eastern(self):
|
||||
"""Test default to Eastern for unknown state."""
|
||||
tz = get_stadium_timezone("XX")
|
||||
assert tz == "America/New_York"
|
||||
@@ -0,0 +1 @@
|
||||
"""Tests for the uploaders module."""
|
||||
@@ -0,0 +1,461 @@
|
||||
"""Tests for the CloudKit client."""
|
||||
|
||||
import json
|
||||
import pytest
|
||||
from datetime import datetime
|
||||
from unittest.mock import Mock, patch, MagicMock
|
||||
|
||||
from sportstime_parser.uploaders.cloudkit import (
|
||||
CloudKitClient,
|
||||
CloudKitRecord,
|
||||
CloudKitError,
|
||||
CloudKitAuthError,
|
||||
CloudKitRateLimitError,
|
||||
CloudKitServerError,
|
||||
RecordType,
|
||||
OperationResult,
|
||||
BatchResult,
|
||||
)
|
||||
|
||||
|
||||
class TestCloudKitRecord:
|
||||
"""Tests for CloudKitRecord dataclass."""
|
||||
|
||||
def test_create_record(self):
|
||||
"""Test creating a CloudKitRecord."""
|
||||
record = CloudKitRecord(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type=RecordType.GAME,
|
||||
fields={
|
||||
"sport": "nba",
|
||||
"season": 2025,
|
||||
},
|
||||
)
|
||||
|
||||
assert record.record_name == "nba_2025_hou_okc_1021"
|
||||
assert record.record_type == RecordType.GAME
|
||||
assert record.fields["sport"] == "nba"
|
||||
assert record.record_change_tag is None
|
||||
|
||||
def test_to_cloudkit_dict(self):
|
||||
"""Test converting to CloudKit API format."""
|
||||
record = CloudKitRecord(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type=RecordType.GAME,
|
||||
fields={
|
||||
"sport": "nba",
|
||||
"season": 2025,
|
||||
},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["recordName"] == "nba_2025_hou_okc_1021"
|
||||
assert data["recordType"] == "Game"
|
||||
assert "fields" in data
|
||||
assert "recordChangeTag" not in data
|
||||
|
||||
def test_to_cloudkit_dict_with_change_tag(self):
|
||||
"""Test converting with change tag for updates."""
|
||||
record = CloudKitRecord(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type=RecordType.GAME,
|
||||
fields={"sport": "nba"},
|
||||
record_change_tag="abc123",
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["recordChangeTag"] == "abc123"
|
||||
|
||||
def test_format_string_field(self):
|
||||
"""Test formatting string fields."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.GAME,
|
||||
fields={"name": "Test Name"},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["fields"]["name"]["value"] == "Test Name"
|
||||
assert data["fields"]["name"]["type"] == "STRING"
|
||||
|
||||
def test_format_int_field(self):
|
||||
"""Test formatting integer fields."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.GAME,
|
||||
fields={"count": 42},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["fields"]["count"]["value"] == 42
|
||||
assert data["fields"]["count"]["type"] == "INT64"
|
||||
|
||||
def test_format_float_field(self):
|
||||
"""Test formatting float fields."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.STADIUM,
|
||||
fields={"latitude": 35.4634},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["fields"]["latitude"]["value"] == 35.4634
|
||||
assert data["fields"]["latitude"]["type"] == "DOUBLE"
|
||||
|
||||
def test_format_datetime_field(self):
|
||||
"""Test formatting datetime fields."""
|
||||
dt = datetime(2025, 10, 21, 19, 0, 0)
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.GAME,
|
||||
fields={"game_date": dt},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
expected_ms = int(dt.timestamp() * 1000)
|
||||
assert data["fields"]["game_date"]["value"] == expected_ms
|
||||
assert data["fields"]["game_date"]["type"] == "TIMESTAMP"
|
||||
|
||||
def test_format_location_field(self):
|
||||
"""Test formatting location fields."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.STADIUM,
|
||||
fields={
|
||||
"location": {"latitude": 35.4634, "longitude": -97.5151},
|
||||
},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert data["fields"]["location"]["type"] == "LOCATION"
|
||||
assert data["fields"]["location"]["value"]["latitude"] == 35.4634
|
||||
assert data["fields"]["location"]["value"]["longitude"] == -97.5151
|
||||
|
||||
def test_skip_none_fields(self):
|
||||
"""Test that None fields are skipped."""
|
||||
record = CloudKitRecord(
|
||||
record_name="test",
|
||||
record_type=RecordType.GAME,
|
||||
fields={
|
||||
"sport": "nba",
|
||||
"score": None, # Should be skipped
|
||||
},
|
||||
)
|
||||
|
||||
data = record.to_cloudkit_dict()
|
||||
|
||||
assert "sport" in data["fields"]
|
||||
assert "score" not in data["fields"]
|
||||
|
||||
|
||||
class TestOperationResult:
|
||||
"""Tests for OperationResult dataclass."""
|
||||
|
||||
def test_successful_result(self):
|
||||
"""Test creating a successful operation result."""
|
||||
result = OperationResult(
|
||||
record_name="test_record",
|
||||
success=True,
|
||||
record_change_tag="new_tag",
|
||||
)
|
||||
|
||||
assert result.record_name == "test_record"
|
||||
assert result.success is True
|
||||
assert result.record_change_tag == "new_tag"
|
||||
assert result.error_code is None
|
||||
|
||||
def test_failed_result(self):
|
||||
"""Test creating a failed operation result."""
|
||||
result = OperationResult(
|
||||
record_name="test_record",
|
||||
success=False,
|
||||
error_code="SERVER_ERROR",
|
||||
error_message="Internal server error",
|
||||
)
|
||||
|
||||
assert result.success is False
|
||||
assert result.error_code == "SERVER_ERROR"
|
||||
assert result.error_message == "Internal server error"
|
||||
|
||||
|
||||
class TestBatchResult:
|
||||
"""Tests for BatchResult dataclass."""
|
||||
|
||||
def test_empty_batch_result(self):
|
||||
"""Test empty batch result."""
|
||||
result = BatchResult()
|
||||
|
||||
assert result.all_succeeded is True
|
||||
assert result.success_count == 0
|
||||
assert result.failure_count == 0
|
||||
|
||||
def test_batch_with_successes(self):
|
||||
"""Test batch with successful operations."""
|
||||
result = BatchResult()
|
||||
result.successful.append(OperationResult("rec1", True))
|
||||
result.successful.append(OperationResult("rec2", True))
|
||||
|
||||
assert result.all_succeeded is True
|
||||
assert result.success_count == 2
|
||||
assert result.failure_count == 0
|
||||
|
||||
def test_batch_with_failures(self):
|
||||
"""Test batch with failed operations."""
|
||||
result = BatchResult()
|
||||
result.successful.append(OperationResult("rec1", True))
|
||||
result.failed.append(OperationResult("rec2", False, error_message="Error"))
|
||||
|
||||
assert result.all_succeeded is False
|
||||
assert result.success_count == 1
|
||||
assert result.failure_count == 1
|
||||
|
||||
|
||||
class TestCloudKitClient:
|
||||
"""Tests for CloudKitClient."""
|
||||
|
||||
def test_not_configured_without_credentials(self):
|
||||
"""Test that client reports not configured without credentials."""
|
||||
with patch.dict("os.environ", {}, clear=True):
|
||||
client = CloudKitClient()
|
||||
assert client.is_configured is False
|
||||
|
||||
def test_configured_with_credentials(self):
|
||||
"""Test that client reports configured with credentials."""
|
||||
# Create a minimal mock for the private key
|
||||
mock_key = MagicMock()
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key_id",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
mock_load.return_value = mock_key
|
||||
client = CloudKitClient()
|
||||
assert client.is_configured is True
|
||||
|
||||
def test_get_api_path(self):
|
||||
"""Test API path construction."""
|
||||
client = CloudKitClient(
|
||||
container_id="iCloud.com.test.app",
|
||||
environment="development",
|
||||
)
|
||||
|
||||
path = client._get_api_path("records/query")
|
||||
|
||||
assert path == "/database/1/iCloud.com.test.app/development/public/records/query"
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_fetch_records_query(self, mock_session_class):
|
||||
"""Test fetching records with query."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.json.return_value = {
|
||||
"records": [
|
||||
{"recordName": "rec1", "recordType": "Game"},
|
||||
{"recordName": "rec2", "recordType": "Game"},
|
||||
]
|
||||
}
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
# Setup client with mocked auth
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
records = client.fetch_records(RecordType.GAME)
|
||||
|
||||
assert len(records) == 2
|
||||
assert records[0]["recordName"] == "rec1"
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_save_records_success(self, mock_session_class):
|
||||
"""Test saving records successfully."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.json.return_value = {
|
||||
"records": [
|
||||
{"recordName": "rec1", "recordChangeTag": "tag1"},
|
||||
{"recordName": "rec2", "recordChangeTag": "tag2"},
|
||||
]
|
||||
}
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
records = [
|
||||
CloudKitRecord("rec1", RecordType.GAME, {"sport": "nba"}),
|
||||
CloudKitRecord("rec2", RecordType.GAME, {"sport": "nba"}),
|
||||
]
|
||||
|
||||
result = client.save_records(records)
|
||||
|
||||
assert result.success_count == 2
|
||||
assert result.failure_count == 0
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_save_records_partial_failure(self, mock_session_class):
|
||||
"""Test saving records with some failures."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.json.return_value = {
|
||||
"records": [
|
||||
{"recordName": "rec1", "recordChangeTag": "tag1"},
|
||||
{"recordName": "rec2", "serverErrorCode": "QUOTA_EXCEEDED", "reason": "Quota exceeded"},
|
||||
]
|
||||
}
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
records = [
|
||||
CloudKitRecord("rec1", RecordType.GAME, {"sport": "nba"}),
|
||||
CloudKitRecord("rec2", RecordType.GAME, {"sport": "nba"}),
|
||||
]
|
||||
|
||||
result = client.save_records(records)
|
||||
|
||||
assert result.success_count == 1
|
||||
assert result.failure_count == 1
|
||||
assert result.failed[0].error_code == "QUOTA_EXCEEDED"
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_auth_error(self, mock_session_class):
|
||||
"""Test handling authentication error."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 421
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
with pytest.raises(CloudKitAuthError):
|
||||
client.fetch_records(RecordType.GAME)
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_rate_limit_error(self, mock_session_class):
|
||||
"""Test handling rate limit error."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 429
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
with pytest.raises(CloudKitRateLimitError):
|
||||
client.fetch_records(RecordType.GAME)
|
||||
|
||||
@patch("sportstime_parser.uploaders.cloudkit.requests.Session")
|
||||
def test_server_error(self, mock_session_class):
|
||||
"""Test handling server error."""
|
||||
mock_session = MagicMock()
|
||||
mock_session_class.return_value = mock_session
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 503
|
||||
mock_session.request.return_value = mock_response
|
||||
|
||||
mock_key = MagicMock()
|
||||
mock_key.sign.return_value = b"signature"
|
||||
|
||||
with patch.dict("os.environ", {
|
||||
"CLOUDKIT_KEY_ID": "test_key",
|
||||
"CLOUDKIT_PRIVATE_KEY": "-----BEGIN EC PRIVATE KEY-----\ntest\n-----END EC PRIVATE KEY-----",
|
||||
}):
|
||||
with patch("sportstime_parser.uploaders.cloudkit.serialization.load_pem_private_key") as mock_load:
|
||||
with patch("sportstime_parser.uploaders.cloudkit.jwt.encode") as mock_jwt:
|
||||
mock_load.return_value = mock_key
|
||||
mock_jwt.return_value = "test_token"
|
||||
|
||||
client = CloudKitClient()
|
||||
|
||||
with pytest.raises(CloudKitServerError):
|
||||
client.fetch_records(RecordType.GAME)
|
||||
|
||||
|
||||
class TestRecordType:
|
||||
"""Tests for RecordType enum."""
|
||||
|
||||
def test_record_type_values(self):
|
||||
"""Test that record type values match CloudKit schema."""
|
||||
assert RecordType.GAME.value == "Game"
|
||||
assert RecordType.TEAM.value == "Team"
|
||||
assert RecordType.STADIUM.value == "Stadium"
|
||||
assert RecordType.TEAM_ALIAS.value == "TeamAlias"
|
||||
assert RecordType.STADIUM_ALIAS.value == "StadiumAlias"
|
||||
@@ -0,0 +1,350 @@
|
||||
"""Tests for the record differ."""
|
||||
|
||||
import pytest
|
||||
from datetime import datetime
|
||||
|
||||
from sportstime_parser.models.game import Game
|
||||
from sportstime_parser.models.team import Team
|
||||
from sportstime_parser.models.stadium import Stadium
|
||||
from sportstime_parser.uploaders.diff import (
|
||||
DiffAction,
|
||||
RecordDiff,
|
||||
DiffResult,
|
||||
RecordDiffer,
|
||||
game_to_cloudkit_record,
|
||||
team_to_cloudkit_record,
|
||||
stadium_to_cloudkit_record,
|
||||
)
|
||||
from sportstime_parser.uploaders.cloudkit import RecordType
|
||||
|
||||
|
||||
class TestRecordDiff:
|
||||
"""Tests for RecordDiff dataclass."""
|
||||
|
||||
def test_create_record_diff(self):
|
||||
"""Test creating a RecordDiff."""
|
||||
diff = RecordDiff(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.CREATE,
|
||||
)
|
||||
|
||||
assert diff.record_name == "nba_2025_hou_okc_1021"
|
||||
assert diff.record_type == RecordType.GAME
|
||||
assert diff.action == DiffAction.CREATE
|
||||
|
||||
|
||||
class TestDiffResult:
|
||||
"""Tests for DiffResult dataclass."""
|
||||
|
||||
def test_empty_result(self):
|
||||
"""Test empty DiffResult."""
|
||||
result = DiffResult()
|
||||
|
||||
assert result.create_count == 0
|
||||
assert result.update_count == 0
|
||||
assert result.delete_count == 0
|
||||
assert result.unchanged_count == 0
|
||||
assert result.total_changes == 0
|
||||
|
||||
def test_counts(self):
|
||||
"""Test counting different change types."""
|
||||
result = DiffResult()
|
||||
|
||||
result.creates.append(RecordDiff(
|
||||
record_name="game_1",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.CREATE,
|
||||
))
|
||||
result.creates.append(RecordDiff(
|
||||
record_name="game_2",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.CREATE,
|
||||
))
|
||||
result.updates.append(RecordDiff(
|
||||
record_name="game_3",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.UPDATE,
|
||||
))
|
||||
result.deletes.append(RecordDiff(
|
||||
record_name="game_4",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.DELETE,
|
||||
))
|
||||
result.unchanged.append(RecordDiff(
|
||||
record_name="game_5",
|
||||
record_type=RecordType.GAME,
|
||||
action=DiffAction.UNCHANGED,
|
||||
))
|
||||
|
||||
assert result.create_count == 2
|
||||
assert result.update_count == 1
|
||||
assert result.delete_count == 1
|
||||
assert result.unchanged_count == 1
|
||||
assert result.total_changes == 4 # excludes unchanged
|
||||
|
||||
|
||||
class TestRecordDiffer:
|
||||
"""Tests for RecordDiffer."""
|
||||
|
||||
@pytest.fixture
|
||||
def differ(self):
|
||||
"""Create a RecordDiffer instance."""
|
||||
return RecordDiffer()
|
||||
|
||||
@pytest.fixture
|
||||
def sample_game(self):
|
||||
"""Create a sample Game."""
|
||||
return Game(
|
||||
id="nba_2025_hou_okc_1021",
|
||||
sport="nba",
|
||||
season=2025,
|
||||
home_team_id="team_nba_okc",
|
||||
away_team_id="team_nba_hou",
|
||||
stadium_id="stadium_nba_paycom_center",
|
||||
game_date=datetime(2025, 10, 21, 19, 0, 0),
|
||||
status="scheduled",
|
||||
)
|
||||
|
||||
@pytest.fixture
|
||||
def sample_team(self):
|
||||
"""Create a sample Team."""
|
||||
return Team(
|
||||
id="team_nba_okc",
|
||||
sport="nba",
|
||||
city="Oklahoma City",
|
||||
name="Thunder",
|
||||
full_name="Oklahoma City Thunder",
|
||||
abbreviation="OKC",
|
||||
conference="Western",
|
||||
division="Northwest",
|
||||
)
|
||||
|
||||
@pytest.fixture
|
||||
def sample_stadium(self):
|
||||
"""Create a sample Stadium."""
|
||||
return Stadium(
|
||||
id="stadium_nba_paycom_center",
|
||||
sport="nba",
|
||||
name="Paycom Center",
|
||||
city="Oklahoma City",
|
||||
state="OK",
|
||||
country="USA",
|
||||
latitude=35.4634,
|
||||
longitude=-97.5151,
|
||||
capacity=18203,
|
||||
)
|
||||
|
||||
def test_diff_games_create(self, differ, sample_game):
|
||||
"""Test detecting new games to create."""
|
||||
local_games = [sample_game]
|
||||
remote_records = []
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
|
||||
assert result.create_count == 1
|
||||
assert result.update_count == 0
|
||||
assert result.delete_count == 0
|
||||
assert result.creates[0].record_name == sample_game.id
|
||||
|
||||
def test_diff_games_delete(self, differ, sample_game):
|
||||
"""Test detecting games to delete."""
|
||||
local_games = []
|
||||
remote_records = [
|
||||
{
|
||||
"recordName": sample_game.id,
|
||||
"recordType": "Game",
|
||||
"fields": {
|
||||
"sport": {"value": "nba", "type": "STRING"},
|
||||
"season": {"value": 2025, "type": "INT64"},
|
||||
},
|
||||
"recordChangeTag": "abc123",
|
||||
}
|
||||
]
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
|
||||
assert result.create_count == 0
|
||||
assert result.delete_count == 1
|
||||
assert result.deletes[0].record_name == sample_game.id
|
||||
|
||||
def test_diff_games_unchanged(self, differ, sample_game):
|
||||
"""Test detecting unchanged games."""
|
||||
local_games = [sample_game]
|
||||
remote_records = [
|
||||
{
|
||||
"recordName": sample_game.id,
|
||||
"recordType": "Game",
|
||||
"fields": {
|
||||
"sport": {"value": "nba", "type": "STRING"},
|
||||
"season": {"value": 2025, "type": "INT64"},
|
||||
"home_team_id": {"value": "team_nba_okc", "type": "STRING"},
|
||||
"away_team_id": {"value": "team_nba_hou", "type": "STRING"},
|
||||
"stadium_id": {"value": "stadium_nba_paycom_center", "type": "STRING"},
|
||||
"game_date": {"value": int(sample_game.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
|
||||
"game_number": {"value": None, "type": "INT64"},
|
||||
"home_score": {"value": None, "type": "INT64"},
|
||||
"away_score": {"value": None, "type": "INT64"},
|
||||
"status": {"value": "scheduled", "type": "STRING"},
|
||||
},
|
||||
"recordChangeTag": "abc123",
|
||||
}
|
||||
]
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
|
||||
assert result.create_count == 0
|
||||
assert result.update_count == 0
|
||||
assert result.unchanged_count == 1
|
||||
|
||||
def test_diff_games_update(self, differ, sample_game):
|
||||
"""Test detecting games that need update."""
|
||||
local_games = [sample_game]
|
||||
# Remote has different status
|
||||
remote_records = [
|
||||
{
|
||||
"recordName": sample_game.id,
|
||||
"recordType": "Game",
|
||||
"fields": {
|
||||
"sport": {"value": "nba", "type": "STRING"},
|
||||
"season": {"value": 2025, "type": "INT64"},
|
||||
"home_team_id": {"value": "team_nba_okc", "type": "STRING"},
|
||||
"away_team_id": {"value": "team_nba_hou", "type": "STRING"},
|
||||
"stadium_id": {"value": "stadium_nba_paycom_center", "type": "STRING"},
|
||||
"game_date": {"value": int(sample_game.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
|
||||
"game_number": {"value": None, "type": "INT64"},
|
||||
"home_score": {"value": None, "type": "INT64"},
|
||||
"away_score": {"value": None, "type": "INT64"},
|
||||
"status": {"value": "postponed", "type": "STRING"}, # Different!
|
||||
},
|
||||
"recordChangeTag": "abc123",
|
||||
}
|
||||
]
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
|
||||
assert result.update_count == 1
|
||||
assert "status" in result.updates[0].changed_fields
|
||||
assert result.updates[0].record_change_tag == "abc123"
|
||||
|
||||
def test_diff_teams_create(self, differ, sample_team):
|
||||
"""Test detecting new teams to create."""
|
||||
local_teams = [sample_team]
|
||||
remote_records = []
|
||||
|
||||
result = differ.diff_teams(local_teams, remote_records)
|
||||
|
||||
assert result.create_count == 1
|
||||
assert result.creates[0].record_name == sample_team.id
|
||||
|
||||
def test_diff_stadiums_create(self, differ, sample_stadium):
|
||||
"""Test detecting new stadiums to create."""
|
||||
local_stadiums = [sample_stadium]
|
||||
remote_records = []
|
||||
|
||||
result = differ.diff_stadiums(local_stadiums, remote_records)
|
||||
|
||||
assert result.create_count == 1
|
||||
assert result.creates[0].record_name == sample_stadium.id
|
||||
|
||||
def test_get_records_to_upload(self, differ, sample_game):
|
||||
"""Test getting CloudKitRecords for upload."""
|
||||
game2 = Game(
|
||||
id="nba_2025_lal_lac_1022",
|
||||
sport="nba",
|
||||
season=2025,
|
||||
home_team_id="team_nba_lac",
|
||||
away_team_id="team_nba_lal",
|
||||
stadium_id="stadium_nba_crypto_com",
|
||||
game_date=datetime(2025, 10, 22, 19, 0, 0),
|
||||
status="scheduled",
|
||||
)
|
||||
|
||||
local_games = [sample_game, game2]
|
||||
# Only game2 exists remotely with different status
|
||||
remote_records = [
|
||||
{
|
||||
"recordName": game2.id,
|
||||
"recordType": "Game",
|
||||
"fields": {
|
||||
"sport": {"value": "nba", "type": "STRING"},
|
||||
"season": {"value": 2025, "type": "INT64"},
|
||||
"home_team_id": {"value": "team_nba_lac", "type": "STRING"},
|
||||
"away_team_id": {"value": "team_nba_lal", "type": "STRING"},
|
||||
"stadium_id": {"value": "stadium_nba_crypto_com", "type": "STRING"},
|
||||
"game_date": {"value": int(game2.game_date.timestamp() * 1000), "type": "TIMESTAMP"},
|
||||
"status": {"value": "postponed", "type": "STRING"}, # Different!
|
||||
},
|
||||
"recordChangeTag": "xyz789",
|
||||
}
|
||||
]
|
||||
|
||||
result = differ.diff_games(local_games, remote_records)
|
||||
records = result.get_records_to_upload()
|
||||
|
||||
assert len(records) == 2 # 1 create + 1 update
|
||||
record_names = [r.record_name for r in records]
|
||||
assert sample_game.id in record_names
|
||||
assert game2.id in record_names
|
||||
|
||||
|
||||
class TestConvenienceFunctions:
|
||||
"""Tests for module-level convenience functions."""
|
||||
|
||||
def test_game_to_cloudkit_record(self):
|
||||
"""Test converting Game to CloudKitRecord."""
|
||||
game = Game(
|
||||
id="nba_2025_hou_okc_1021",
|
||||
sport="nba",
|
||||
season=2025,
|
||||
home_team_id="team_nba_okc",
|
||||
away_team_id="team_nba_hou",
|
||||
stadium_id="stadium_nba_paycom_center",
|
||||
game_date=datetime(2025, 10, 21, 19, 0, 0),
|
||||
status="scheduled",
|
||||
)
|
||||
|
||||
record = game_to_cloudkit_record(game)
|
||||
|
||||
assert record.record_name == game.id
|
||||
assert record.record_type == RecordType.GAME
|
||||
assert record.fields["sport"] == "nba"
|
||||
assert record.fields["season"] == 2025
|
||||
|
||||
def test_team_to_cloudkit_record(self):
|
||||
"""Test converting Team to CloudKitRecord."""
|
||||
team = Team(
|
||||
id="team_nba_okc",
|
||||
sport="nba",
|
||||
city="Oklahoma City",
|
||||
name="Thunder",
|
||||
full_name="Oklahoma City Thunder",
|
||||
abbreviation="OKC",
|
||||
)
|
||||
|
||||
record = team_to_cloudkit_record(team)
|
||||
|
||||
assert record.record_name == team.id
|
||||
assert record.record_type == RecordType.TEAM
|
||||
assert record.fields["city"] == "Oklahoma City"
|
||||
assert record.fields["name"] == "Thunder"
|
||||
|
||||
def test_stadium_to_cloudkit_record(self):
|
||||
"""Test converting Stadium to CloudKitRecord."""
|
||||
stadium = Stadium(
|
||||
id="stadium_nba_paycom_center",
|
||||
sport="nba",
|
||||
name="Paycom Center",
|
||||
city="Oklahoma City",
|
||||
state="OK",
|
||||
country="USA",
|
||||
latitude=35.4634,
|
||||
longitude=-97.5151,
|
||||
)
|
||||
|
||||
record = stadium_to_cloudkit_record(stadium)
|
||||
|
||||
assert record.record_name == stadium.id
|
||||
assert record.record_type == RecordType.STADIUM
|
||||
assert record.fields["name"] == "Paycom Center"
|
||||
assert record.fields["latitude"] == 35.4634
|
||||
@@ -0,0 +1,472 @@
|
||||
"""Tests for the upload state manager."""
|
||||
|
||||
import json
|
||||
import pytest
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from tempfile import TemporaryDirectory
|
||||
|
||||
from sportstime_parser.uploaders.state import (
|
||||
RecordState,
|
||||
UploadSession,
|
||||
StateManager,
|
||||
)
|
||||
|
||||
|
||||
class TestRecordState:
|
||||
"""Tests for RecordState dataclass."""
|
||||
|
||||
def test_create_record_state(self):
|
||||
"""Test creating a RecordState with default values."""
|
||||
state = RecordState(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type="Game",
|
||||
)
|
||||
|
||||
assert state.record_name == "nba_2025_hou_okc_1021"
|
||||
assert state.record_type == "Game"
|
||||
assert state.status == "pending"
|
||||
assert state.uploaded_at is None
|
||||
assert state.record_change_tag is None
|
||||
assert state.error_message is None
|
||||
assert state.retry_count == 0
|
||||
|
||||
def test_record_state_to_dict(self):
|
||||
"""Test serializing RecordState to dictionary."""
|
||||
now = datetime.utcnow()
|
||||
state = RecordState(
|
||||
record_name="nba_2025_hou_okc_1021",
|
||||
record_type="Game",
|
||||
uploaded_at=now,
|
||||
record_change_tag="abc123",
|
||||
status="uploaded",
|
||||
)
|
||||
|
||||
data = state.to_dict()
|
||||
|
||||
assert data["record_name"] == "nba_2025_hou_okc_1021"
|
||||
assert data["record_type"] == "Game"
|
||||
assert data["status"] == "uploaded"
|
||||
assert data["uploaded_at"] == now.isoformat()
|
||||
assert data["record_change_tag"] == "abc123"
|
||||
|
||||
def test_record_state_from_dict(self):
|
||||
"""Test deserializing RecordState from dictionary."""
|
||||
data = {
|
||||
"record_name": "nba_2025_hou_okc_1021",
|
||||
"record_type": "Game",
|
||||
"uploaded_at": "2026-01-10T12:00:00",
|
||||
"record_change_tag": "abc123",
|
||||
"status": "uploaded",
|
||||
"error_message": None,
|
||||
"retry_count": 0,
|
||||
}
|
||||
|
||||
state = RecordState.from_dict(data)
|
||||
|
||||
assert state.record_name == "nba_2025_hou_okc_1021"
|
||||
assert state.record_type == "Game"
|
||||
assert state.status == "uploaded"
|
||||
assert state.uploaded_at == datetime.fromisoformat("2026-01-10T12:00:00")
|
||||
assert state.record_change_tag == "abc123"
|
||||
|
||||
|
||||
class TestUploadSession:
|
||||
"""Tests for UploadSession dataclass."""
|
||||
|
||||
def test_create_upload_session(self):
|
||||
"""Test creating an UploadSession."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
|
||||
assert session.sport == "nba"
|
||||
assert session.season == 2025
|
||||
assert session.environment == "development"
|
||||
assert session.total_count == 0
|
||||
assert len(session.records) == 0
|
||||
|
||||
def test_add_record(self):
|
||||
"""Test adding records to a session."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("team_1", "Team")
|
||||
|
||||
assert session.total_count == 3
|
||||
assert len(session.records) == 3
|
||||
assert "game_1" in session.records
|
||||
assert session.records["game_1"].record_type == "Game"
|
||||
|
||||
def test_mark_uploaded(self):
|
||||
"""Test marking a record as uploaded."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
|
||||
session.mark_uploaded("game_1", "change_tag_123")
|
||||
|
||||
assert session.records["game_1"].status == "uploaded"
|
||||
assert session.records["game_1"].record_change_tag == "change_tag_123"
|
||||
assert session.records["game_1"].uploaded_at is not None
|
||||
|
||||
def test_mark_failed(self):
|
||||
"""Test marking a record as failed."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
|
||||
session.mark_failed("game_1", "Server error")
|
||||
|
||||
assert session.records["game_1"].status == "failed"
|
||||
assert session.records["game_1"].error_message == "Server error"
|
||||
assert session.records["game_1"].retry_count == 1
|
||||
|
||||
def test_mark_failed_increments_retry_count(self):
|
||||
"""Test that marking failed increments retry count."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
|
||||
session.mark_failed("game_1", "Error 1")
|
||||
session.mark_failed("game_1", "Error 2")
|
||||
session.mark_failed("game_1", "Error 3")
|
||||
|
||||
assert session.records["game_1"].retry_count == 3
|
||||
|
||||
def test_counts(self):
|
||||
"""Test session counts."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
|
||||
session.mark_uploaded("game_1")
|
||||
session.mark_failed("game_2", "Error")
|
||||
|
||||
assert session.uploaded_count == 1
|
||||
assert session.failed_count == 1
|
||||
assert session.pending_count == 1
|
||||
|
||||
def test_is_complete(self):
|
||||
"""Test is_complete property."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
|
||||
assert not session.is_complete
|
||||
|
||||
session.mark_uploaded("game_1")
|
||||
assert not session.is_complete
|
||||
|
||||
session.mark_uploaded("game_2")
|
||||
assert session.is_complete
|
||||
|
||||
def test_progress_percent(self):
|
||||
"""Test progress percentage calculation."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
session.add_record("game_4", "Game")
|
||||
|
||||
session.mark_uploaded("game_1")
|
||||
|
||||
assert session.progress_percent == 25.0
|
||||
|
||||
def test_get_pending_records(self):
|
||||
"""Test getting pending record names."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
|
||||
session.mark_uploaded("game_1")
|
||||
session.mark_failed("game_2", "Error")
|
||||
|
||||
pending = session.get_pending_records()
|
||||
|
||||
assert pending == ["game_3"]
|
||||
|
||||
def test_get_failed_records(self):
|
||||
"""Test getting failed record names."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
|
||||
session.mark_failed("game_1", "Error 1")
|
||||
session.mark_failed("game_3", "Error 3")
|
||||
|
||||
failed = session.get_failed_records()
|
||||
|
||||
assert set(failed) == {"game_1", "game_3"}
|
||||
|
||||
def test_get_retryable_records(self):
|
||||
"""Test getting records eligible for retry."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.add_record("game_3", "Game")
|
||||
|
||||
# Fail game_1 once
|
||||
session.mark_failed("game_1", "Error")
|
||||
|
||||
# Fail game_2 three times (max retries)
|
||||
session.mark_failed("game_2", "Error")
|
||||
session.mark_failed("game_2", "Error")
|
||||
session.mark_failed("game_2", "Error")
|
||||
|
||||
retryable = session.get_retryable_records(max_retries=3)
|
||||
|
||||
assert retryable == ["game_1"]
|
||||
|
||||
def test_to_dict_and_from_dict(self):
|
||||
"""Test round-trip serialization."""
|
||||
session = UploadSession(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
)
|
||||
session.add_record("game_1", "Game")
|
||||
session.add_record("game_2", "Game")
|
||||
session.mark_uploaded("game_1", "tag_123")
|
||||
|
||||
data = session.to_dict()
|
||||
restored = UploadSession.from_dict(data)
|
||||
|
||||
assert restored.sport == session.sport
|
||||
assert restored.season == session.season
|
||||
assert restored.environment == session.environment
|
||||
assert restored.total_count == session.total_count
|
||||
assert restored.uploaded_count == session.uploaded_count
|
||||
assert restored.records["game_1"].status == "uploaded"
|
||||
|
||||
|
||||
class TestStateManager:
|
||||
"""Tests for StateManager."""
|
||||
|
||||
def test_create_session(self):
|
||||
"""Test creating a new session."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
session = manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[
|
||||
("game_1", "Game"),
|
||||
("game_2", "Game"),
|
||||
("team_1", "Team"),
|
||||
],
|
||||
)
|
||||
|
||||
assert session.sport == "nba"
|
||||
assert session.season == 2025
|
||||
assert session.total_count == 3
|
||||
|
||||
# Check file was created
|
||||
state_file = Path(tmpdir) / "upload_state_nba_2025_development.json"
|
||||
assert state_file.exists()
|
||||
|
||||
def test_load_session(self):
|
||||
"""Test loading an existing session."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create and save a session
|
||||
original = manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game")],
|
||||
)
|
||||
original.mark_uploaded("game_1", "tag_123")
|
||||
manager.save_session(original)
|
||||
|
||||
# Load it back
|
||||
loaded = manager.load_session("nba", 2025, "development")
|
||||
|
||||
assert loaded is not None
|
||||
assert loaded.sport == "nba"
|
||||
assert loaded.records["game_1"].status == "uploaded"
|
||||
|
||||
def test_load_nonexistent_session(self):
|
||||
"""Test loading a session that doesn't exist."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
session = manager.load_session("nba", 2025, "development")
|
||||
|
||||
assert session is None
|
||||
|
||||
def test_delete_session(self):
|
||||
"""Test deleting a session."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create a session
|
||||
manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game")],
|
||||
)
|
||||
|
||||
# Delete it
|
||||
result = manager.delete_session("nba", 2025, "development")
|
||||
|
||||
assert result is True
|
||||
|
||||
# Verify it's gone
|
||||
loaded = manager.load_session("nba", 2025, "development")
|
||||
assert loaded is None
|
||||
|
||||
def test_delete_nonexistent_session(self):
|
||||
"""Test deleting a session that doesn't exist."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
result = manager.delete_session("nba", 2025, "development")
|
||||
|
||||
assert result is False
|
||||
|
||||
def test_list_sessions(self):
|
||||
"""Test listing all sessions."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create multiple sessions
|
||||
manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game")],
|
||||
)
|
||||
manager.create_session(
|
||||
sport="mlb",
|
||||
season=2026,
|
||||
environment="production",
|
||||
record_names=[("game_2", "Game"), ("game_3", "Game")],
|
||||
)
|
||||
|
||||
sessions = manager.list_sessions()
|
||||
|
||||
assert len(sessions) == 2
|
||||
sports = {s["sport"] for s in sessions}
|
||||
assert sports == {"nba", "mlb"}
|
||||
|
||||
def test_get_session_or_create_new(self):
|
||||
"""Test getting a session when none exists."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
session = manager.get_session_or_create(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game")],
|
||||
resume=False,
|
||||
)
|
||||
|
||||
assert session.sport == "nba"
|
||||
assert session.total_count == 1
|
||||
|
||||
def test_get_session_or_create_resume(self):
|
||||
"""Test resuming an existing session."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create initial session
|
||||
original = manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game"), ("game_2", "Game")],
|
||||
)
|
||||
original.mark_uploaded("game_1", "tag_123")
|
||||
manager.save_session(original)
|
||||
|
||||
# Resume with additional records
|
||||
session = manager.get_session_or_create(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game"), ("game_2", "Game"), ("game_3", "Game")],
|
||||
resume=True,
|
||||
)
|
||||
|
||||
# Should have original progress plus new record
|
||||
assert session.records["game_1"].status == "uploaded"
|
||||
assert "game_3" in session.records
|
||||
assert session.total_count == 3
|
||||
|
||||
def test_get_session_or_create_overwrite(self):
|
||||
"""Test overwriting an existing session when not resuming."""
|
||||
with TemporaryDirectory() as tmpdir:
|
||||
manager = StateManager(state_dir=Path(tmpdir))
|
||||
|
||||
# Create initial session
|
||||
original = manager.create_session(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_1", "Game"), ("game_2", "Game")],
|
||||
)
|
||||
original.mark_uploaded("game_1", "tag_123")
|
||||
manager.save_session(original)
|
||||
|
||||
# Create new session (not resuming)
|
||||
session = manager.get_session_or_create(
|
||||
sport="nba",
|
||||
season=2025,
|
||||
environment="development",
|
||||
record_names=[("game_3", "Game")],
|
||||
resume=False,
|
||||
)
|
||||
|
||||
# Should be a fresh session
|
||||
assert session.total_count == 1
|
||||
assert "game_1" not in session.records
|
||||
assert "game_3" in session.records
|
||||
@@ -0,0 +1,52 @@
|
||||
"""CloudKit uploaders for sportstime-parser."""
|
||||
|
||||
from .cloudkit import (
|
||||
CloudKitClient,
|
||||
CloudKitRecord,
|
||||
CloudKitError,
|
||||
CloudKitAuthError,
|
||||
CloudKitRateLimitError,
|
||||
CloudKitServerError,
|
||||
RecordType,
|
||||
OperationResult,
|
||||
BatchResult,
|
||||
)
|
||||
from .state import (
|
||||
RecordState,
|
||||
UploadSession,
|
||||
StateManager,
|
||||
)
|
||||
from .diff import (
|
||||
DiffAction,
|
||||
RecordDiff,
|
||||
DiffResult,
|
||||
RecordDiffer,
|
||||
game_to_cloudkit_record,
|
||||
team_to_cloudkit_record,
|
||||
stadium_to_cloudkit_record,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# CloudKit client
|
||||
"CloudKitClient",
|
||||
"CloudKitRecord",
|
||||
"CloudKitError",
|
||||
"CloudKitAuthError",
|
||||
"CloudKitRateLimitError",
|
||||
"CloudKitServerError",
|
||||
"RecordType",
|
||||
"OperationResult",
|
||||
"BatchResult",
|
||||
# State manager
|
||||
"RecordState",
|
||||
"UploadSession",
|
||||
"StateManager",
|
||||
# Differ
|
||||
"DiffAction",
|
||||
"RecordDiff",
|
||||
"DiffResult",
|
||||
"RecordDiffer",
|
||||
"game_to_cloudkit_record",
|
||||
"team_to_cloudkit_record",
|
||||
"stadium_to_cloudkit_record",
|
||||
]
|
||||
@@ -0,0 +1,565 @@
|
||||
"""CloudKit Web Services client for sportstime-parser.
|
||||
|
||||
This module provides a client for uploading data to CloudKit using the
|
||||
CloudKit Web Services API. It handles JWT authentication, request signing,
|
||||
and batch operations.
|
||||
|
||||
Reference: https://developer.apple.com/documentation/cloudkitwebservices
|
||||
"""
|
||||
|
||||
import base64
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
from enum import Enum
|
||||
|
||||
import jwt
|
||||
import requests
|
||||
from cryptography.hazmat.primitives import hashes, serialization
|
||||
from cryptography.hazmat.primitives.asymmetric import ec
|
||||
from cryptography.hazmat.backends import default_backend
|
||||
|
||||
from ..config import (
|
||||
CLOUDKIT_CONTAINER_ID,
|
||||
CLOUDKIT_ENVIRONMENT,
|
||||
CLOUDKIT_BATCH_SIZE,
|
||||
)
|
||||
from ..utils.logging import get_logger
|
||||
|
||||
|
||||
class RecordType(str, Enum):
|
||||
"""CloudKit record types for SportsTime."""
|
||||
GAME = "Game"
|
||||
TEAM = "Team"
|
||||
STADIUM = "Stadium"
|
||||
TEAM_ALIAS = "TeamAlias"
|
||||
STADIUM_ALIAS = "StadiumAlias"
|
||||
|
||||
|
||||
@dataclass
|
||||
class CloudKitRecord:
|
||||
"""Represents a CloudKit record for upload.
|
||||
|
||||
Attributes:
|
||||
record_name: Unique record identifier (canonical ID)
|
||||
record_type: CloudKit record type
|
||||
fields: Dictionary of field name -> field value
|
||||
record_change_tag: Version tag for conflict detection (None for new records)
|
||||
"""
|
||||
record_name: str
|
||||
record_type: RecordType
|
||||
fields: dict[str, Any]
|
||||
record_change_tag: Optional[str] = None
|
||||
|
||||
def to_cloudkit_dict(self) -> dict:
|
||||
"""Convert to CloudKit API format."""
|
||||
record = {
|
||||
"recordName": self.record_name,
|
||||
"recordType": self.record_type.value,
|
||||
"fields": self._format_fields(),
|
||||
}
|
||||
if self.record_change_tag:
|
||||
record["recordChangeTag"] = self.record_change_tag
|
||||
return record
|
||||
|
||||
def _format_fields(self) -> dict:
|
||||
"""Format fields for CloudKit API."""
|
||||
formatted = {}
|
||||
for key, value in self.fields.items():
|
||||
if value is None:
|
||||
continue
|
||||
formatted[key] = self._format_field_value(value)
|
||||
return formatted
|
||||
|
||||
def _format_field_value(self, value: Any) -> dict:
|
||||
"""Format a single field value for CloudKit API."""
|
||||
if isinstance(value, str):
|
||||
return {"value": value, "type": "STRING"}
|
||||
elif isinstance(value, int):
|
||||
return {"value": value, "type": "INT64"}
|
||||
elif isinstance(value, float):
|
||||
return {"value": value, "type": "DOUBLE"}
|
||||
elif isinstance(value, bool):
|
||||
return {"value": 1 if value else 0, "type": "INT64"}
|
||||
elif isinstance(value, datetime):
|
||||
# CloudKit expects milliseconds since epoch
|
||||
timestamp_ms = int(value.timestamp() * 1000)
|
||||
return {"value": timestamp_ms, "type": "TIMESTAMP"}
|
||||
elif isinstance(value, list):
|
||||
return {"value": value, "type": "STRING_LIST"}
|
||||
elif isinstance(value, dict) and "latitude" in value and "longitude" in value:
|
||||
return {
|
||||
"value": {
|
||||
"latitude": value["latitude"],
|
||||
"longitude": value["longitude"],
|
||||
},
|
||||
"type": "LOCATION",
|
||||
}
|
||||
else:
|
||||
# Default to string
|
||||
return {"value": str(value), "type": "STRING"}
|
||||
|
||||
|
||||
@dataclass
|
||||
class OperationResult:
|
||||
"""Result of a CloudKit operation."""
|
||||
record_name: str
|
||||
success: bool
|
||||
record_change_tag: Optional[str] = None
|
||||
error_code: Optional[str] = None
|
||||
error_message: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class BatchResult:
|
||||
"""Result of a batch CloudKit operation."""
|
||||
successful: list[OperationResult] = field(default_factory=list)
|
||||
failed: list[OperationResult] = field(default_factory=list)
|
||||
|
||||
@property
|
||||
def all_succeeded(self) -> bool:
|
||||
return len(self.failed) == 0
|
||||
|
||||
@property
|
||||
def success_count(self) -> int:
|
||||
return len(self.successful)
|
||||
|
||||
@property
|
||||
def failure_count(self) -> int:
|
||||
return len(self.failed)
|
||||
|
||||
|
||||
class CloudKitClient:
|
||||
"""Client for CloudKit Web Services API.
|
||||
|
||||
Handles authentication via server-to-server JWT tokens and provides
|
||||
methods for CRUD operations on CloudKit records.
|
||||
|
||||
Authentication requires:
|
||||
- Key ID: CloudKit key identifier from Apple Developer Portal
|
||||
- Private Key: EC private key in PEM format
|
||||
|
||||
Environment variables:
|
||||
- CLOUDKIT_KEY_ID: The key identifier
|
||||
- CLOUDKIT_PRIVATE_KEY_PATH: Path to the private key file
|
||||
- CLOUDKIT_PRIVATE_KEY: The private key contents (alternative to path)
|
||||
"""
|
||||
|
||||
BASE_URL = "https://api.apple-cloudkit.com"
|
||||
TOKEN_EXPIRY_SECONDS = 3600 # 1 hour
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
container_id: str = CLOUDKIT_CONTAINER_ID,
|
||||
environment: str = CLOUDKIT_ENVIRONMENT,
|
||||
key_id: Optional[str] = None,
|
||||
private_key: Optional[str] = None,
|
||||
private_key_path: Optional[str] = None,
|
||||
):
|
||||
"""Initialize the CloudKit client.
|
||||
|
||||
Args:
|
||||
container_id: CloudKit container identifier
|
||||
environment: 'development' or 'production'
|
||||
key_id: CloudKit server-to-server key ID
|
||||
private_key: PEM-encoded EC private key contents
|
||||
private_key_path: Path to PEM-encoded EC private key file
|
||||
"""
|
||||
self.container_id = container_id
|
||||
self.environment = environment
|
||||
self.logger = get_logger()
|
||||
|
||||
# Load authentication credentials
|
||||
self.key_id = key_id or os.environ.get("CLOUDKIT_KEY_ID")
|
||||
|
||||
if private_key:
|
||||
self._private_key_pem = private_key
|
||||
elif private_key_path:
|
||||
self._private_key_pem = Path(private_key_path).read_text()
|
||||
elif os.environ.get("CLOUDKIT_PRIVATE_KEY"):
|
||||
self._private_key_pem = os.environ["CLOUDKIT_PRIVATE_KEY"]
|
||||
elif os.environ.get("CLOUDKIT_PRIVATE_KEY_PATH"):
|
||||
self._private_key_pem = Path(os.environ["CLOUDKIT_PRIVATE_KEY_PATH"]).read_text()
|
||||
else:
|
||||
self._private_key_pem = None
|
||||
|
||||
# Parse the private key if available
|
||||
self._private_key = None
|
||||
if self._private_key_pem:
|
||||
self._private_key = serialization.load_pem_private_key(
|
||||
self._private_key_pem.encode(),
|
||||
password=None,
|
||||
backend=default_backend(),
|
||||
)
|
||||
|
||||
# Token cache
|
||||
self._token: Optional[str] = None
|
||||
self._token_expiry: float = 0
|
||||
|
||||
# Session for connection pooling
|
||||
self._session = requests.Session()
|
||||
|
||||
@property
|
||||
def is_configured(self) -> bool:
|
||||
"""Check if the client has valid authentication credentials."""
|
||||
return bool(self.key_id and self._private_key)
|
||||
|
||||
def _get_api_path(self, operation: str) -> str:
|
||||
"""Build the full API path for an operation."""
|
||||
return f"/database/1/{self.container_id}/{self.environment}/public/{operation}"
|
||||
|
||||
def _get_token(self) -> str:
|
||||
"""Get a valid JWT token, generating a new one if needed."""
|
||||
if not self.is_configured:
|
||||
raise ValueError(
|
||||
"CloudKit client not configured. Set CLOUDKIT_KEY_ID and "
|
||||
"CLOUDKIT_PRIVATE_KEY_PATH environment variables."
|
||||
)
|
||||
|
||||
now = time.time()
|
||||
|
||||
# Return cached token if still valid (with 5 min buffer)
|
||||
if self._token and (self._token_expiry - now) > 300:
|
||||
return self._token
|
||||
|
||||
# Generate new token
|
||||
expiry = now + self.TOKEN_EXPIRY_SECONDS
|
||||
|
||||
payload = {
|
||||
"iss": self.key_id,
|
||||
"iat": int(now),
|
||||
"exp": int(expiry),
|
||||
"sub": self.container_id,
|
||||
}
|
||||
|
||||
self._token = jwt.encode(
|
||||
payload,
|
||||
self._private_key,
|
||||
algorithm="ES256",
|
||||
)
|
||||
self._token_expiry = expiry
|
||||
|
||||
return self._token
|
||||
|
||||
def _sign_request(self, method: str, path: str, body: Optional[bytes] = None) -> dict:
|
||||
"""Generate request headers with authentication.
|
||||
|
||||
Args:
|
||||
method: HTTP method
|
||||
path: API path
|
||||
body: Request body bytes
|
||||
|
||||
Returns:
|
||||
Dictionary of headers to include in the request
|
||||
"""
|
||||
token = self._get_token()
|
||||
|
||||
# CloudKit uses date in ISO format
|
||||
date_str = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||
|
||||
# Calculate body hash
|
||||
if body:
|
||||
body_hash = base64.b64encode(
|
||||
hashlib.sha256(body).digest()
|
||||
).decode()
|
||||
else:
|
||||
body_hash = base64.b64encode(
|
||||
hashlib.sha256(b"").digest()
|
||||
).decode()
|
||||
|
||||
# Build the message to sign
|
||||
# Format: date:body_hash:path
|
||||
message = f"{date_str}:{body_hash}:{path}"
|
||||
|
||||
# Sign the message
|
||||
signature = self._private_key.sign(
|
||||
message.encode(),
|
||||
ec.ECDSA(hashes.SHA256()),
|
||||
)
|
||||
signature_b64 = base64.b64encode(signature).decode()
|
||||
|
||||
return {
|
||||
"Authorization": f"Bearer {token}",
|
||||
"X-Apple-CloudKit-Request-KeyID": self.key_id,
|
||||
"X-Apple-CloudKit-Request-ISO8601Date": date_str,
|
||||
"X-Apple-CloudKit-Request-SignatureV1": signature_b64,
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
def _request(
|
||||
self,
|
||||
method: str,
|
||||
operation: str,
|
||||
body: Optional[dict] = None,
|
||||
) -> dict:
|
||||
"""Make a request to the CloudKit API.
|
||||
|
||||
Args:
|
||||
method: HTTP method
|
||||
operation: API operation path
|
||||
body: Request body as dictionary
|
||||
|
||||
Returns:
|
||||
Response data as dictionary
|
||||
|
||||
Raises:
|
||||
CloudKitError: If the request fails
|
||||
"""
|
||||
path = self._get_api_path(operation)
|
||||
url = f"{self.BASE_URL}{path}"
|
||||
|
||||
body_bytes = json.dumps(body).encode() if body else None
|
||||
headers = self._sign_request(method, path, body_bytes)
|
||||
|
||||
response = self._session.request(
|
||||
method=method,
|
||||
url=url,
|
||||
headers=headers,
|
||||
data=body_bytes,
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
elif response.status_code == 421:
|
||||
# Authentication required - token may be expired
|
||||
self._token = None
|
||||
raise CloudKitAuthError("Authentication failed - check credentials")
|
||||
elif response.status_code == 429:
|
||||
raise CloudKitRateLimitError("Rate limit exceeded")
|
||||
elif response.status_code >= 500:
|
||||
raise CloudKitServerError(f"Server error: {response.status_code}")
|
||||
else:
|
||||
try:
|
||||
error_data = response.json()
|
||||
error_msg = error_data.get("serverErrorCode", str(response.status_code))
|
||||
except (json.JSONDecodeError, KeyError):
|
||||
error_msg = response.text
|
||||
raise CloudKitError(f"Request failed: {error_msg}")
|
||||
|
||||
def fetch_records(
|
||||
self,
|
||||
record_type: RecordType,
|
||||
record_names: Optional[list[str]] = None,
|
||||
limit: int = 200,
|
||||
) -> list[dict]:
|
||||
"""Fetch records from CloudKit.
|
||||
|
||||
Args:
|
||||
record_type: Type of records to fetch
|
||||
record_names: Specific record names to fetch (optional)
|
||||
limit: Maximum records to return (default 200)
|
||||
|
||||
Returns:
|
||||
List of record dictionaries
|
||||
"""
|
||||
if record_names:
|
||||
# Fetch specific records by name
|
||||
body = {
|
||||
"records": [{"recordName": name} for name in record_names],
|
||||
}
|
||||
response = self._request("POST", "records/lookup", body)
|
||||
else:
|
||||
# Query all records of type
|
||||
body = {
|
||||
"query": {
|
||||
"recordType": record_type.value,
|
||||
},
|
||||
"resultsLimit": limit,
|
||||
}
|
||||
response = self._request("POST", "records/query", body)
|
||||
|
||||
records = response.get("records", [])
|
||||
return [r for r in records if "recordName" in r]
|
||||
|
||||
def fetch_all_records(self, record_type: RecordType) -> list[dict]:
|
||||
"""Fetch all records of a type using pagination.
|
||||
|
||||
Args:
|
||||
record_type: Type of records to fetch
|
||||
|
||||
Returns:
|
||||
List of all record dictionaries
|
||||
"""
|
||||
all_records = []
|
||||
continuation_marker = None
|
||||
|
||||
while True:
|
||||
body = {
|
||||
"query": {
|
||||
"recordType": record_type.value,
|
||||
},
|
||||
"resultsLimit": 200,
|
||||
}
|
||||
|
||||
if continuation_marker:
|
||||
body["continuationMarker"] = continuation_marker
|
||||
|
||||
response = self._request("POST", "records/query", body)
|
||||
|
||||
records = response.get("records", [])
|
||||
all_records.extend([r for r in records if "recordName" in r])
|
||||
|
||||
continuation_marker = response.get("continuationMarker")
|
||||
if not continuation_marker:
|
||||
break
|
||||
|
||||
return all_records
|
||||
|
||||
def save_records(self, records: list[CloudKitRecord]) -> BatchResult:
|
||||
"""Save records to CloudKit (create or update).
|
||||
|
||||
Args:
|
||||
records: List of records to save
|
||||
|
||||
Returns:
|
||||
BatchResult with success/failure details
|
||||
"""
|
||||
result = BatchResult()
|
||||
|
||||
# Process in batches
|
||||
for i in range(0, len(records), CLOUDKIT_BATCH_SIZE):
|
||||
batch = records[i:i + CLOUDKIT_BATCH_SIZE]
|
||||
batch_result = self._save_batch(batch)
|
||||
result.successful.extend(batch_result.successful)
|
||||
result.failed.extend(batch_result.failed)
|
||||
|
||||
return result
|
||||
|
||||
def _save_batch(self, records: list[CloudKitRecord]) -> BatchResult:
|
||||
"""Save a single batch of records.
|
||||
|
||||
Args:
|
||||
records: List of records (max CLOUDKIT_BATCH_SIZE)
|
||||
|
||||
Returns:
|
||||
BatchResult with success/failure details
|
||||
"""
|
||||
result = BatchResult()
|
||||
|
||||
operations = []
|
||||
for record in records:
|
||||
op = {
|
||||
"operationType": "forceReplace",
|
||||
"record": record.to_cloudkit_dict(),
|
||||
}
|
||||
operations.append(op)
|
||||
|
||||
body = {"operations": operations}
|
||||
|
||||
try:
|
||||
response = self._request("POST", "records/modify", body)
|
||||
except CloudKitError as e:
|
||||
# Entire batch failed
|
||||
for record in records:
|
||||
result.failed.append(OperationResult(
|
||||
record_name=record.record_name,
|
||||
success=False,
|
||||
error_message=str(e),
|
||||
))
|
||||
return result
|
||||
|
||||
# Process individual results
|
||||
for record_data in response.get("records", []):
|
||||
record_name = record_data.get("recordName", "unknown")
|
||||
|
||||
if "serverErrorCode" in record_data:
|
||||
result.failed.append(OperationResult(
|
||||
record_name=record_name,
|
||||
success=False,
|
||||
error_code=record_data.get("serverErrorCode"),
|
||||
error_message=record_data.get("reason"),
|
||||
))
|
||||
else:
|
||||
result.successful.append(OperationResult(
|
||||
record_name=record_name,
|
||||
success=True,
|
||||
record_change_tag=record_data.get("recordChangeTag"),
|
||||
))
|
||||
|
||||
return result
|
||||
|
||||
def delete_records(
|
||||
self,
|
||||
record_type: RecordType,
|
||||
record_names: list[str],
|
||||
) -> BatchResult:
|
||||
"""Delete records from CloudKit.
|
||||
|
||||
Args:
|
||||
record_type: Type of records to delete
|
||||
record_names: List of record names to delete
|
||||
|
||||
Returns:
|
||||
BatchResult with success/failure details
|
||||
"""
|
||||
result = BatchResult()
|
||||
|
||||
# Process in batches
|
||||
for i in range(0, len(record_names), CLOUDKIT_BATCH_SIZE):
|
||||
batch = record_names[i:i + CLOUDKIT_BATCH_SIZE]
|
||||
|
||||
operations = []
|
||||
for name in batch:
|
||||
operations.append({
|
||||
"operationType": "delete",
|
||||
"record": {
|
||||
"recordName": name,
|
||||
"recordType": record_type.value,
|
||||
},
|
||||
})
|
||||
|
||||
body = {"operations": operations}
|
||||
|
||||
try:
|
||||
response = self._request("POST", "records/modify", body)
|
||||
except CloudKitError as e:
|
||||
for name in batch:
|
||||
result.failed.append(OperationResult(
|
||||
record_name=name,
|
||||
success=False,
|
||||
error_message=str(e),
|
||||
))
|
||||
continue
|
||||
|
||||
for record_data in response.get("records", []):
|
||||
record_name = record_data.get("recordName", "unknown")
|
||||
|
||||
if "serverErrorCode" in record_data:
|
||||
result.failed.append(OperationResult(
|
||||
record_name=record_name,
|
||||
success=False,
|
||||
error_code=record_data.get("serverErrorCode"),
|
||||
error_message=record_data.get("reason"),
|
||||
))
|
||||
else:
|
||||
result.successful.append(OperationResult(
|
||||
record_name=record_name,
|
||||
success=True,
|
||||
))
|
||||
|
||||
return result
|
||||
|
||||
|
||||
class CloudKitError(Exception):
|
||||
"""Base exception for CloudKit errors."""
|
||||
pass
|
||||
|
||||
|
||||
class CloudKitAuthError(CloudKitError):
|
||||
"""Authentication error."""
|
||||
pass
|
||||
|
||||
|
||||
class CloudKitRateLimitError(CloudKitError):
|
||||
"""Rate limit exceeded."""
|
||||
pass
|
||||
|
||||
|
||||
class CloudKitServerError(CloudKitError):
|
||||
"""Server-side error."""
|
||||
pass
|
||||
@@ -0,0 +1,425 @@
|
||||
"""Record differ for CloudKit uploads.
|
||||
|
||||
This module compares local records with CloudKit records to determine
|
||||
what needs to be created, updated, or deleted.
|
||||
"""
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from enum import Enum
|
||||
from typing import Any, Optional
|
||||
|
||||
from ..models.game import Game
|
||||
from ..models.team import Team
|
||||
from ..models.stadium import Stadium
|
||||
from .cloudkit import CloudKitRecord, RecordType
|
||||
|
||||
|
||||
class DiffAction(str, Enum):
|
||||
"""Action to take for a record."""
|
||||
CREATE = "create"
|
||||
UPDATE = "update"
|
||||
DELETE = "delete"
|
||||
UNCHANGED = "unchanged"
|
||||
|
||||
|
||||
@dataclass
|
||||
class RecordDiff:
|
||||
"""Represents the difference between local and remote records.
|
||||
|
||||
Attributes:
|
||||
record_name: Canonical record ID
|
||||
record_type: CloudKit record type
|
||||
action: Action to take (create, update, delete, unchanged)
|
||||
local_record: Local CloudKitRecord (None if delete)
|
||||
remote_record: Remote record dict (None if create)
|
||||
changed_fields: List of field names that changed (for update)
|
||||
record_change_tag: Remote record's change tag (for update)
|
||||
"""
|
||||
record_name: str
|
||||
record_type: RecordType
|
||||
action: DiffAction
|
||||
local_record: Optional[CloudKitRecord] = None
|
||||
remote_record: Optional[dict] = None
|
||||
changed_fields: list[str] = field(default_factory=list)
|
||||
record_change_tag: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class DiffResult:
|
||||
"""Result of diffing local and remote records.
|
||||
|
||||
Attributes:
|
||||
creates: Records to create
|
||||
updates: Records to update
|
||||
deletes: Records to delete (record names)
|
||||
unchanged: Records with no changes
|
||||
"""
|
||||
creates: list[RecordDiff] = field(default_factory=list)
|
||||
updates: list[RecordDiff] = field(default_factory=list)
|
||||
deletes: list[RecordDiff] = field(default_factory=list)
|
||||
unchanged: list[RecordDiff] = field(default_factory=list)
|
||||
|
||||
@property
|
||||
def create_count(self) -> int:
|
||||
return len(self.creates)
|
||||
|
||||
@property
|
||||
def update_count(self) -> int:
|
||||
return len(self.updates)
|
||||
|
||||
@property
|
||||
def delete_count(self) -> int:
|
||||
return len(self.deletes)
|
||||
|
||||
@property
|
||||
def unchanged_count(self) -> int:
|
||||
return len(self.unchanged)
|
||||
|
||||
@property
|
||||
def total_changes(self) -> int:
|
||||
return self.create_count + self.update_count + self.delete_count
|
||||
|
||||
def get_records_to_upload(self) -> list[CloudKitRecord]:
|
||||
"""Get all records that need to be uploaded (creates + updates)."""
|
||||
records = []
|
||||
|
||||
for diff in self.creates:
|
||||
if diff.local_record:
|
||||
records.append(diff.local_record)
|
||||
|
||||
for diff in self.updates:
|
||||
if diff.local_record:
|
||||
# Add change tag for update
|
||||
diff.local_record.record_change_tag = diff.record_change_tag
|
||||
records.append(diff.local_record)
|
||||
|
||||
return records
|
||||
|
||||
|
||||
class RecordDiffer:
|
||||
"""Compares local records with CloudKit records."""
|
||||
|
||||
# Fields to compare for each record type
|
||||
GAME_FIELDS = [
|
||||
"sport", "season", "home_team_id", "away_team_id", "stadium_id",
|
||||
"game_date", "game_number", "home_score", "away_score", "status",
|
||||
]
|
||||
|
||||
TEAM_FIELDS = [
|
||||
"sport", "city", "name", "full_name", "abbreviation",
|
||||
"conference", "division", "primary_color", "secondary_color",
|
||||
"logo_url", "stadium_id",
|
||||
]
|
||||
|
||||
STADIUM_FIELDS = [
|
||||
"sport", "name", "city", "state", "country",
|
||||
"latitude", "longitude", "capacity", "surface",
|
||||
"roof_type", "opened_year", "image_url", "timezone",
|
||||
]
|
||||
|
||||
def diff_games(
|
||||
self,
|
||||
local_games: list[Game],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local games against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_games: List of local Game objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._game_to_record(g) for g in local_games]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.GAME,
|
||||
self.GAME_FIELDS,
|
||||
)
|
||||
|
||||
def diff_teams(
|
||||
self,
|
||||
local_teams: list[Team],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local teams against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_teams: List of local Team objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._team_to_record(t) for t in local_teams]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.TEAM,
|
||||
self.TEAM_FIELDS,
|
||||
)
|
||||
|
||||
def diff_stadiums(
|
||||
self,
|
||||
local_stadiums: list[Stadium],
|
||||
remote_records: list[dict],
|
||||
) -> DiffResult:
|
||||
"""Diff local stadiums against remote CloudKit records.
|
||||
|
||||
Args:
|
||||
local_stadiums: List of local Stadium objects
|
||||
remote_records: List of remote record dictionaries
|
||||
|
||||
Returns:
|
||||
DiffResult with creates, updates, deletes
|
||||
"""
|
||||
local_records = [self._stadium_to_record(s) for s in local_stadiums]
|
||||
return self._diff_records(
|
||||
local_records,
|
||||
remote_records,
|
||||
RecordType.STADIUM,
|
||||
self.STADIUM_FIELDS,
|
||||
)
|
||||
|
||||
def _diff_records(
|
||||
self,
|
||||
local_records: list[CloudKitRecord],
|
||||
remote_records: list[dict],
|
||||
record_type: RecordType,
|
||||
compare_fields: list[str],
|
||||
) -> DiffResult:
|
||||
"""Compare local and remote records.
|
||||
|
||||
Args:
|
||||
local_records: List of local CloudKitRecord objects
|
||||
remote_records: List of remote record dictionaries
|
||||
record_type: Type of records being compared
|
||||
compare_fields: List of field names to compare
|
||||
|
||||
Returns:
|
||||
DiffResult with categorized differences
|
||||
"""
|
||||
result = DiffResult()
|
||||
|
||||
# Index remote records by name
|
||||
remote_by_name: dict[str, dict] = {}
|
||||
for record in remote_records:
|
||||
name = record.get("recordName")
|
||||
if name:
|
||||
remote_by_name[name] = record
|
||||
|
||||
# Index local records by name
|
||||
local_by_name: dict[str, CloudKitRecord] = {}
|
||||
for record in local_records:
|
||||
local_by_name[record.record_name] = record
|
||||
|
||||
# Find creates and updates
|
||||
for local_record in local_records:
|
||||
remote = remote_by_name.get(local_record.record_name)
|
||||
|
||||
if remote is None:
|
||||
# New record
|
||||
result.creates.append(RecordDiff(
|
||||
record_name=local_record.record_name,
|
||||
record_type=record_type,
|
||||
action=DiffAction.CREATE,
|
||||
local_record=local_record,
|
||||
))
|
||||
else:
|
||||
# Check for changes
|
||||
changed_fields = self._compare_fields(
|
||||
local_record.fields,
|
||||
remote.get("fields", {}),
|
||||
compare_fields,
|
||||
)
|
||||
|
||||
if changed_fields:
|
||||
result.updates.append(RecordDiff(
|
||||
record_name=local_record.record_name,
|
||||
record_type=record_type,
|
||||
action=DiffAction.UPDATE,
|
||||
local_record=local_record,
|
||||
remote_record=remote,
|
||||
changed_fields=changed_fields,
|
||||
record_change_tag=remote.get("recordChangeTag"),
|
||||
))
|
||||
else:
|
||||
result.unchanged.append(RecordDiff(
|
||||
record_name=local_record.record_name,
|
||||
record_type=record_type,
|
||||
action=DiffAction.UNCHANGED,
|
||||
local_record=local_record,
|
||||
remote_record=remote,
|
||||
record_change_tag=remote.get("recordChangeTag"),
|
||||
))
|
||||
|
||||
# Find deletes (remote records not in local)
|
||||
local_names = set(local_by_name.keys())
|
||||
for remote_name, remote in remote_by_name.items():
|
||||
if remote_name not in local_names:
|
||||
result.deletes.append(RecordDiff(
|
||||
record_name=remote_name,
|
||||
record_type=record_type,
|
||||
action=DiffAction.DELETE,
|
||||
remote_record=remote,
|
||||
record_change_tag=remote.get("recordChangeTag"),
|
||||
))
|
||||
|
||||
return result
|
||||
|
||||
def _compare_fields(
|
||||
self,
|
||||
local_fields: dict[str, Any],
|
||||
remote_fields: dict[str, dict],
|
||||
compare_fields: list[str],
|
||||
) -> list[str]:
|
||||
"""Compare field values between local and remote.
|
||||
|
||||
Args:
|
||||
local_fields: Local field values
|
||||
remote_fields: Remote field values (CloudKit format)
|
||||
compare_fields: Fields to compare
|
||||
|
||||
Returns:
|
||||
List of field names that differ
|
||||
"""
|
||||
changed = []
|
||||
|
||||
for field_name in compare_fields:
|
||||
local_value = local_fields.get(field_name)
|
||||
remote_field = remote_fields.get(field_name, {})
|
||||
remote_value = remote_field.get("value") if remote_field else None
|
||||
|
||||
# Normalize values for comparison
|
||||
local_normalized = self._normalize_value(local_value)
|
||||
remote_normalized = self._normalize_remote_value(remote_value, remote_field)
|
||||
|
||||
if local_normalized != remote_normalized:
|
||||
changed.append(field_name)
|
||||
|
||||
return changed
|
||||
|
||||
def _normalize_value(self, value: Any) -> Any:
|
||||
"""Normalize a local value for comparison."""
|
||||
if value is None:
|
||||
return None
|
||||
if isinstance(value, datetime):
|
||||
# Convert to milliseconds since epoch
|
||||
return int(value.timestamp() * 1000)
|
||||
if isinstance(value, float):
|
||||
# Round to 6 decimal places for coordinate comparison
|
||||
return round(value, 6)
|
||||
return value
|
||||
|
||||
def _normalize_remote_value(self, value: Any, field_data: dict) -> Any:
|
||||
"""Normalize a remote CloudKit value for comparison."""
|
||||
if value is None:
|
||||
return None
|
||||
|
||||
field_type = field_data.get("type", "")
|
||||
|
||||
if field_type == "TIMESTAMP":
|
||||
# Already in milliseconds
|
||||
return value
|
||||
if field_type == "DOUBLE":
|
||||
return round(value, 6)
|
||||
if field_type == "LOCATION":
|
||||
# Return as tuple for comparison
|
||||
if isinstance(value, dict):
|
||||
return (
|
||||
round(value.get("latitude", 0), 6),
|
||||
round(value.get("longitude", 0), 6),
|
||||
)
|
||||
|
||||
return value
|
||||
|
||||
def _game_to_record(self, game: Game) -> CloudKitRecord:
|
||||
"""Convert a Game to a CloudKitRecord."""
|
||||
return CloudKitRecord(
|
||||
record_name=game.id,
|
||||
record_type=RecordType.GAME,
|
||||
fields={
|
||||
"sport": game.sport,
|
||||
"season": game.season,
|
||||
"home_team_id": game.home_team_id,
|
||||
"away_team_id": game.away_team_id,
|
||||
"stadium_id": game.stadium_id,
|
||||
"game_date": game.game_date,
|
||||
"game_number": game.game_number,
|
||||
"home_score": game.home_score,
|
||||
"away_score": game.away_score,
|
||||
"status": game.status,
|
||||
},
|
||||
)
|
||||
|
||||
def _team_to_record(self, team: Team) -> CloudKitRecord:
|
||||
"""Convert a Team to a CloudKitRecord."""
|
||||
return CloudKitRecord(
|
||||
record_name=team.id,
|
||||
record_type=RecordType.TEAM,
|
||||
fields={
|
||||
"sport": team.sport,
|
||||
"city": team.city,
|
||||
"name": team.name,
|
||||
"full_name": team.full_name,
|
||||
"abbreviation": team.abbreviation,
|
||||
"conference": team.conference,
|
||||
"division": team.division,
|
||||
"primary_color": team.primary_color,
|
||||
"secondary_color": team.secondary_color,
|
||||
"logo_url": team.logo_url,
|
||||
"stadium_id": team.stadium_id,
|
||||
},
|
||||
)
|
||||
|
||||
def _stadium_to_record(self, stadium: Stadium) -> CloudKitRecord:
|
||||
"""Convert a Stadium to a CloudKitRecord."""
|
||||
return CloudKitRecord(
|
||||
record_name=stadium.id,
|
||||
record_type=RecordType.STADIUM,
|
||||
fields={
|
||||
"sport": stadium.sport,
|
||||
"name": stadium.name,
|
||||
"city": stadium.city,
|
||||
"state": stadium.state,
|
||||
"country": stadium.country,
|
||||
"latitude": stadium.latitude,
|
||||
"longitude": stadium.longitude,
|
||||
"capacity": stadium.capacity,
|
||||
"surface": stadium.surface,
|
||||
"roof_type": stadium.roof_type,
|
||||
"opened_year": stadium.opened_year,
|
||||
"image_url": stadium.image_url,
|
||||
"timezone": stadium.timezone,
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
def game_to_cloudkit_record(game: Game) -> CloudKitRecord:
|
||||
"""Convert a Game to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._game_to_record(game)
|
||||
|
||||
|
||||
def team_to_cloudkit_record(team: Team) -> CloudKitRecord:
|
||||
"""Convert a Team to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._team_to_record(team)
|
||||
|
||||
|
||||
def stadium_to_cloudkit_record(stadium: Stadium) -> CloudKitRecord:
|
||||
"""Convert a Stadium to a CloudKitRecord.
|
||||
|
||||
Convenience function for external use.
|
||||
"""
|
||||
differ = RecordDiffer()
|
||||
return differ._stadium_to_record(stadium)
|
||||
@@ -0,0 +1,384 @@
|
||||
"""Upload state manager for resumable uploads.
|
||||
|
||||
This module tracks upload progress to enable resuming interrupted uploads.
|
||||
State is persisted to JSON files in the .parser_state directory.
|
||||
"""
|
||||
|
||||
import json
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from ..config import STATE_DIR
|
||||
|
||||
|
||||
@dataclass
|
||||
class RecordState:
|
||||
"""State of an individual record upload.
|
||||
|
||||
Attributes:
|
||||
record_name: Canonical record ID
|
||||
record_type: CloudKit record type
|
||||
uploaded_at: Timestamp when successfully uploaded
|
||||
record_change_tag: CloudKit version tag
|
||||
status: 'pending', 'uploaded', 'failed'
|
||||
error_message: Error message if failed
|
||||
retry_count: Number of retry attempts
|
||||
"""
|
||||
record_name: str
|
||||
record_type: str
|
||||
uploaded_at: Optional[datetime] = None
|
||||
record_change_tag: Optional[str] = None
|
||||
status: str = "pending"
|
||||
error_message: Optional[str] = None
|
||||
retry_count: int = 0
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"record_name": self.record_name,
|
||||
"record_type": self.record_type,
|
||||
"uploaded_at": self.uploaded_at.isoformat() if self.uploaded_at else None,
|
||||
"record_change_tag": self.record_change_tag,
|
||||
"status": self.status,
|
||||
"error_message": self.error_message,
|
||||
"retry_count": self.retry_count,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "RecordState":
|
||||
"""Create RecordState from dictionary."""
|
||||
uploaded_at = data.get("uploaded_at")
|
||||
if uploaded_at:
|
||||
uploaded_at = datetime.fromisoformat(uploaded_at)
|
||||
|
||||
return cls(
|
||||
record_name=data["record_name"],
|
||||
record_type=data["record_type"],
|
||||
uploaded_at=uploaded_at,
|
||||
record_change_tag=data.get("record_change_tag"),
|
||||
status=data.get("status", "pending"),
|
||||
error_message=data.get("error_message"),
|
||||
retry_count=data.get("retry_count", 0),
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class UploadSession:
|
||||
"""Tracks the state of an upload session.
|
||||
|
||||
Attributes:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
started_at: When the upload session started
|
||||
last_updated: When the state was last updated
|
||||
records: Dictionary of record_name -> RecordState
|
||||
total_count: Total number of records to upload
|
||||
"""
|
||||
sport: str
|
||||
season: int
|
||||
environment: str
|
||||
started_at: datetime = field(default_factory=datetime.utcnow)
|
||||
last_updated: datetime = field(default_factory=datetime.utcnow)
|
||||
records: dict[str, RecordState] = field(default_factory=dict)
|
||||
total_count: int = 0
|
||||
|
||||
@property
|
||||
def uploaded_count(self) -> int:
|
||||
"""Count of successfully uploaded records."""
|
||||
return sum(1 for r in self.records.values() if r.status == "uploaded")
|
||||
|
||||
@property
|
||||
def pending_count(self) -> int:
|
||||
"""Count of pending records."""
|
||||
return sum(1 for r in self.records.values() if r.status == "pending")
|
||||
|
||||
@property
|
||||
def failed_count(self) -> int:
|
||||
"""Count of failed records."""
|
||||
return sum(1 for r in self.records.values() if r.status == "failed")
|
||||
|
||||
@property
|
||||
def is_complete(self) -> bool:
|
||||
"""Check if all records have been processed."""
|
||||
return self.pending_count == 0
|
||||
|
||||
@property
|
||||
def progress_percent(self) -> float:
|
||||
"""Calculate upload progress as percentage."""
|
||||
if self.total_count == 0:
|
||||
return 100.0
|
||||
return (self.uploaded_count / self.total_count) * 100
|
||||
|
||||
def get_pending_records(self) -> list[str]:
|
||||
"""Get list of record names that still need to be uploaded."""
|
||||
return [
|
||||
name for name, state in self.records.items()
|
||||
if state.status == "pending"
|
||||
]
|
||||
|
||||
def get_failed_records(self) -> list[str]:
|
||||
"""Get list of record names that failed to upload."""
|
||||
return [
|
||||
name for name, state in self.records.items()
|
||||
if state.status == "failed"
|
||||
]
|
||||
|
||||
def get_retryable_records(self, max_retries: int = 3) -> list[str]:
|
||||
"""Get failed records that can be retried."""
|
||||
return [
|
||||
name for name, state in self.records.items()
|
||||
if state.status == "failed" and state.retry_count < max_retries
|
||||
]
|
||||
|
||||
def mark_uploaded(
|
||||
self,
|
||||
record_name: str,
|
||||
record_change_tag: Optional[str] = None,
|
||||
) -> None:
|
||||
"""Mark a record as successfully uploaded."""
|
||||
if record_name in self.records:
|
||||
state = self.records[record_name]
|
||||
state.status = "uploaded"
|
||||
state.uploaded_at = datetime.utcnow()
|
||||
state.record_change_tag = record_change_tag
|
||||
state.error_message = None
|
||||
self.last_updated = datetime.utcnow()
|
||||
|
||||
def mark_failed(self, record_name: str, error_message: str) -> None:
|
||||
"""Mark a record as failed."""
|
||||
if record_name in self.records:
|
||||
state = self.records[record_name]
|
||||
state.status = "failed"
|
||||
state.error_message = error_message
|
||||
state.retry_count += 1
|
||||
self.last_updated = datetime.utcnow()
|
||||
|
||||
def mark_pending(self, record_name: str) -> None:
|
||||
"""Mark a record as pending (for retry)."""
|
||||
if record_name in self.records:
|
||||
state = self.records[record_name]
|
||||
state.status = "pending"
|
||||
state.error_message = None
|
||||
self.last_updated = datetime.utcnow()
|
||||
|
||||
def add_record(self, record_name: str, record_type: str) -> None:
|
||||
"""Add a new record to track."""
|
||||
if record_name not in self.records:
|
||||
self.records[record_name] = RecordState(
|
||||
record_name=record_name,
|
||||
record_type=record_type,
|
||||
)
|
||||
self.total_count = len(self.records)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Convert to dictionary for JSON serialization."""
|
||||
return {
|
||||
"sport": self.sport,
|
||||
"season": self.season,
|
||||
"environment": self.environment,
|
||||
"started_at": self.started_at.isoformat(),
|
||||
"last_updated": self.last_updated.isoformat(),
|
||||
"total_count": self.total_count,
|
||||
"records": {
|
||||
name: state.to_dict()
|
||||
for name, state in self.records.items()
|
||||
},
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> "UploadSession":
|
||||
"""Create UploadSession from dictionary."""
|
||||
session = cls(
|
||||
sport=data["sport"],
|
||||
season=data["season"],
|
||||
environment=data["environment"],
|
||||
started_at=datetime.fromisoformat(data["started_at"]),
|
||||
last_updated=datetime.fromisoformat(data["last_updated"]),
|
||||
total_count=data.get("total_count", 0),
|
||||
)
|
||||
|
||||
for name, record_data in data.get("records", {}).items():
|
||||
session.records[name] = RecordState.from_dict(record_data)
|
||||
|
||||
return session
|
||||
|
||||
|
||||
class StateManager:
|
||||
"""Manages upload state persistence.
|
||||
|
||||
State files are stored in .parser_state/ with naming convention:
|
||||
upload_state_{sport}_{season}_{environment}.json
|
||||
"""
|
||||
|
||||
def __init__(self, state_dir: Optional[Path] = None):
|
||||
"""Initialize the state manager.
|
||||
|
||||
Args:
|
||||
state_dir: Directory for state files (default: .parser_state/)
|
||||
"""
|
||||
self.state_dir = state_dir or STATE_DIR
|
||||
self.state_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def _get_state_file(self, sport: str, season: int, environment: str) -> Path:
|
||||
"""Get the path to a state file."""
|
||||
return self.state_dir / f"upload_state_{sport}_{season}_{environment}.json"
|
||||
|
||||
def load_session(
|
||||
self,
|
||||
sport: str,
|
||||
season: int,
|
||||
environment: str,
|
||||
) -> Optional[UploadSession]:
|
||||
"""Load an existing upload session.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
|
||||
Returns:
|
||||
UploadSession if exists, None otherwise
|
||||
"""
|
||||
state_file = self._get_state_file(sport, season, environment)
|
||||
|
||||
if not state_file.exists():
|
||||
return None
|
||||
|
||||
try:
|
||||
with open(state_file, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
return UploadSession.from_dict(data)
|
||||
except (json.JSONDecodeError, KeyError) as e:
|
||||
# Corrupted state file
|
||||
return None
|
||||
|
||||
def save_session(self, session: UploadSession) -> None:
|
||||
"""Save an upload session to disk.
|
||||
|
||||
Args:
|
||||
session: The session to save
|
||||
"""
|
||||
state_file = self._get_state_file(
|
||||
session.sport,
|
||||
session.season,
|
||||
session.environment,
|
||||
)
|
||||
|
||||
session.last_updated = datetime.utcnow()
|
||||
|
||||
with open(state_file, "w", encoding="utf-8") as f:
|
||||
json.dump(session.to_dict(), f, indent=2)
|
||||
|
||||
def create_session(
|
||||
self,
|
||||
sport: str,
|
||||
season: int,
|
||||
environment: str,
|
||||
record_names: list[tuple[str, str]], # (record_name, record_type)
|
||||
) -> UploadSession:
|
||||
"""Create a new upload session.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
record_names: List of (record_name, record_type) tuples
|
||||
|
||||
Returns:
|
||||
New UploadSession
|
||||
"""
|
||||
session = UploadSession(
|
||||
sport=sport,
|
||||
season=season,
|
||||
environment=environment,
|
||||
)
|
||||
|
||||
for record_name, record_type in record_names:
|
||||
session.add_record(record_name, record_type)
|
||||
|
||||
self.save_session(session)
|
||||
return session
|
||||
|
||||
def delete_session(self, sport: str, season: int, environment: str) -> bool:
|
||||
"""Delete an upload session state file.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
|
||||
Returns:
|
||||
True if deleted, False if not found
|
||||
"""
|
||||
state_file = self._get_state_file(sport, season, environment)
|
||||
|
||||
if state_file.exists():
|
||||
state_file.unlink()
|
||||
return True
|
||||
return False
|
||||
|
||||
def list_sessions(self) -> list[dict]:
|
||||
"""List all upload sessions.
|
||||
|
||||
Returns:
|
||||
List of session summaries
|
||||
"""
|
||||
sessions = []
|
||||
|
||||
for state_file in self.state_dir.glob("upload_state_*.json"):
|
||||
try:
|
||||
with open(state_file, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
session = UploadSession.from_dict(data)
|
||||
sessions.append({
|
||||
"sport": session.sport,
|
||||
"season": session.season,
|
||||
"environment": session.environment,
|
||||
"started_at": session.started_at.isoformat(),
|
||||
"last_updated": session.last_updated.isoformat(),
|
||||
"progress": f"{session.uploaded_count}/{session.total_count}",
|
||||
"progress_percent": f"{session.progress_percent:.1f}%",
|
||||
"status": "complete" if session.is_complete else "in_progress",
|
||||
"failed_count": session.failed_count,
|
||||
})
|
||||
except (json.JSONDecodeError, KeyError):
|
||||
continue
|
||||
|
||||
return sessions
|
||||
|
||||
def get_session_or_create(
|
||||
self,
|
||||
sport: str,
|
||||
season: int,
|
||||
environment: str,
|
||||
record_names: list[tuple[str, str]],
|
||||
resume: bool = False,
|
||||
) -> UploadSession:
|
||||
"""Get existing session or create new one.
|
||||
|
||||
Args:
|
||||
sport: Sport code
|
||||
season: Season start year
|
||||
environment: CloudKit environment
|
||||
record_names: List of (record_name, record_type) tuples
|
||||
resume: Whether to resume existing session
|
||||
|
||||
Returns:
|
||||
UploadSession (existing or new)
|
||||
"""
|
||||
if resume:
|
||||
existing = self.load_session(sport, season, environment)
|
||||
if existing:
|
||||
# Add any new records not in existing session
|
||||
existing_names = set(existing.records.keys())
|
||||
for record_name, record_type in record_names:
|
||||
if record_name not in existing_names:
|
||||
existing.add_record(record_name, record_type)
|
||||
return existing
|
||||
|
||||
# Create new session (overwrites existing)
|
||||
return self.create_session(sport, season, environment, record_names)
|
||||
@@ -0,0 +1,58 @@
|
||||
"""Utility modules for sportstime-parser."""
|
||||
|
||||
from .logging import (
|
||||
get_console,
|
||||
get_logger,
|
||||
is_verbose,
|
||||
log_error,
|
||||
log_failure,
|
||||
log_game,
|
||||
log_stadium,
|
||||
log_success,
|
||||
log_team,
|
||||
log_warning,
|
||||
set_verbose,
|
||||
)
|
||||
from .http import (
|
||||
RateLimitedSession,
|
||||
get_session,
|
||||
fetch_url,
|
||||
fetch_json,
|
||||
fetch_html,
|
||||
)
|
||||
from .progress import (
|
||||
create_progress,
|
||||
create_spinner_progress,
|
||||
progress_bar,
|
||||
track_progress,
|
||||
ProgressTracker,
|
||||
ScrapeProgress,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Logging
|
||||
"get_console",
|
||||
"get_logger",
|
||||
"is_verbose",
|
||||
"log_error",
|
||||
"log_failure",
|
||||
"log_game",
|
||||
"log_stadium",
|
||||
"log_success",
|
||||
"log_team",
|
||||
"log_warning",
|
||||
"set_verbose",
|
||||
# HTTP
|
||||
"RateLimitedSession",
|
||||
"get_session",
|
||||
"fetch_url",
|
||||
"fetch_json",
|
||||
"fetch_html",
|
||||
# Progress
|
||||
"create_progress",
|
||||
"create_spinner_progress",
|
||||
"progress_bar",
|
||||
"track_progress",
|
||||
"ProgressTracker",
|
||||
"ScrapeProgress",
|
||||
]
|
||||
@@ -0,0 +1,276 @@
|
||||
"""HTTP utilities with rate limiting and exponential backoff."""
|
||||
|
||||
import random
|
||||
import time
|
||||
from typing import Optional
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util.retry import Retry
|
||||
|
||||
from ..config import (
|
||||
DEFAULT_REQUEST_DELAY,
|
||||
MAX_RETRIES,
|
||||
BACKOFF_FACTOR,
|
||||
INITIAL_BACKOFF,
|
||||
)
|
||||
from .logging import get_logger, log_warning
|
||||
|
||||
|
||||
# User agents for rotation to avoid blocks
|
||||
USER_AGENTS = [
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0",
|
||||
]
|
||||
|
||||
|
||||
class RateLimitedSession:
|
||||
"""HTTP session with rate limiting and exponential backoff.
|
||||
|
||||
Features:
|
||||
- Configurable delay between requests
|
||||
- Automatic 429 detection with exponential backoff
|
||||
- User-agent rotation
|
||||
- Connection pooling
|
||||
- Automatic retries for transient errors
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
delay: float = DEFAULT_REQUEST_DELAY,
|
||||
max_retries: int = MAX_RETRIES,
|
||||
backoff_factor: float = BACKOFF_FACTOR,
|
||||
initial_backoff: float = INITIAL_BACKOFF,
|
||||
):
|
||||
"""Initialize the rate-limited session.
|
||||
|
||||
Args:
|
||||
delay: Minimum delay between requests in seconds
|
||||
max_retries: Maximum number of retry attempts
|
||||
backoff_factor: Multiplier for exponential backoff
|
||||
initial_backoff: Initial backoff duration in seconds
|
||||
"""
|
||||
self.delay = delay
|
||||
self.max_retries = max_retries
|
||||
self.backoff_factor = backoff_factor
|
||||
self.initial_backoff = initial_backoff
|
||||
self.last_request_time: float = 0.0
|
||||
self._domain_delays: dict[str, float] = {}
|
||||
|
||||
# Create session with retry adapter
|
||||
self.session = requests.Session()
|
||||
|
||||
# Configure automatic retries for connection errors
|
||||
retry_strategy = Retry(
|
||||
total=max_retries,
|
||||
backoff_factor=0.5,
|
||||
status_forcelist=[500, 502, 503, 504],
|
||||
allowed_methods=["GET", "HEAD"],
|
||||
)
|
||||
adapter = HTTPAdapter(max_retries=retry_strategy, pool_maxsize=10)
|
||||
self.session.mount("http://", adapter)
|
||||
self.session.mount("https://", adapter)
|
||||
|
||||
self._logger = get_logger()
|
||||
|
||||
def _get_user_agent(self) -> str:
|
||||
"""Get a random user agent."""
|
||||
return random.choice(USER_AGENTS)
|
||||
|
||||
def _get_domain(self, url: str) -> str:
|
||||
"""Extract domain from URL."""
|
||||
parsed = urlparse(url)
|
||||
return parsed.netloc
|
||||
|
||||
def _wait_for_rate_limit(self, url: str) -> None:
|
||||
"""Wait to respect rate limiting."""
|
||||
domain = self._get_domain(url)
|
||||
|
||||
# Get domain-specific delay (if 429 was received)
|
||||
domain_delay = self._domain_delays.get(domain, 0.0)
|
||||
effective_delay = max(self.delay, domain_delay)
|
||||
|
||||
elapsed = time.time() - self.last_request_time
|
||||
if elapsed < effective_delay:
|
||||
sleep_time = effective_delay - elapsed
|
||||
self._logger.debug(f"Rate limiting: sleeping {sleep_time:.2f}s")
|
||||
time.sleep(sleep_time)
|
||||
|
||||
def _handle_429(self, url: str, attempt: int) -> float:
|
||||
"""Handle 429 Too Many Requests with exponential backoff.
|
||||
|
||||
Returns the backoff duration in seconds.
|
||||
"""
|
||||
domain = self._get_domain(url)
|
||||
backoff = self.initial_backoff * (self.backoff_factor ** attempt)
|
||||
|
||||
# Add jitter to prevent thundering herd
|
||||
backoff += random.uniform(0, 1)
|
||||
|
||||
# Update domain-specific delay
|
||||
self._domain_delays[domain] = min(backoff * 2, 60.0) # Cap at 60s
|
||||
|
||||
log_warning(f"Rate limited (429) for {domain}, backing off {backoff:.1f}s")
|
||||
|
||||
return backoff
|
||||
|
||||
def get(
|
||||
self,
|
||||
url: str,
|
||||
headers: Optional[dict] = None,
|
||||
params: Optional[dict] = None,
|
||||
timeout: float = 30.0,
|
||||
) -> requests.Response:
|
||||
"""Make a rate-limited GET request with automatic retries.
|
||||
|
||||
Args:
|
||||
url: URL to fetch
|
||||
headers: Additional headers to include
|
||||
params: Query parameters
|
||||
timeout: Request timeout in seconds
|
||||
|
||||
Returns:
|
||||
Response object
|
||||
|
||||
Raises:
|
||||
requests.RequestException: If all retries fail
|
||||
"""
|
||||
# Prepare headers with user agent
|
||||
request_headers = {"User-Agent": self._get_user_agent()}
|
||||
if headers:
|
||||
request_headers.update(headers)
|
||||
|
||||
last_exception: Optional[Exception] = None
|
||||
|
||||
for attempt in range(self.max_retries + 1):
|
||||
try:
|
||||
# Wait for rate limit
|
||||
self._wait_for_rate_limit(url)
|
||||
|
||||
# Make request
|
||||
self.last_request_time = time.time()
|
||||
response = self.session.get(
|
||||
url,
|
||||
headers=request_headers,
|
||||
params=params,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
# Handle 429
|
||||
if response.status_code == 429:
|
||||
if attempt < self.max_retries:
|
||||
backoff = self._handle_429(url, attempt)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
else:
|
||||
response.raise_for_status()
|
||||
|
||||
# Return successful response
|
||||
return response
|
||||
|
||||
except requests.RequestException as e:
|
||||
last_exception = e
|
||||
if attempt < self.max_retries:
|
||||
backoff = self.initial_backoff * (self.backoff_factor ** attempt)
|
||||
self._logger.warning(
|
||||
f"Request failed (attempt {attempt + 1}): {e}, retrying in {backoff:.1f}s"
|
||||
)
|
||||
time.sleep(backoff)
|
||||
else:
|
||||
raise
|
||||
|
||||
# Should not reach here, but just in case
|
||||
if last_exception:
|
||||
raise last_exception
|
||||
|
||||
raise requests.RequestException("Max retries exceeded")
|
||||
|
||||
def get_json(
|
||||
self,
|
||||
url: str,
|
||||
headers: Optional[dict] = None,
|
||||
params: Optional[dict] = None,
|
||||
timeout: float = 30.0,
|
||||
) -> dict:
|
||||
"""Make a rate-limited GET request and parse JSON response.
|
||||
|
||||
Args:
|
||||
url: URL to fetch
|
||||
headers: Additional headers to include
|
||||
params: Query parameters
|
||||
timeout: Request timeout in seconds
|
||||
|
||||
Returns:
|
||||
Parsed JSON as dictionary
|
||||
|
||||
Raises:
|
||||
requests.RequestException: If request fails
|
||||
ValueError: If response is not valid JSON
|
||||
"""
|
||||
response = self.get(url, headers=headers, params=params, timeout=timeout)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def get_html(
|
||||
self,
|
||||
url: str,
|
||||
headers: Optional[dict] = None,
|
||||
params: Optional[dict] = None,
|
||||
timeout: float = 30.0,
|
||||
) -> str:
|
||||
"""Make a rate-limited GET request and return HTML text.
|
||||
|
||||
Args:
|
||||
url: URL to fetch
|
||||
headers: Additional headers to include
|
||||
params: Query parameters
|
||||
timeout: Request timeout in seconds
|
||||
|
||||
Returns:
|
||||
HTML text content
|
||||
|
||||
Raises:
|
||||
requests.RequestException: If request fails
|
||||
"""
|
||||
response = self.get(url, headers=headers, params=params, timeout=timeout)
|
||||
response.raise_for_status()
|
||||
return response.text
|
||||
|
||||
def reset_domain_delays(self) -> None:
|
||||
"""Reset domain-specific delays (e.g., after a long pause)."""
|
||||
self._domain_delays.clear()
|
||||
|
||||
def close(self) -> None:
|
||||
"""Close the session and release resources."""
|
||||
self.session.close()
|
||||
|
||||
|
||||
# Global session instance (lazy initialized)
|
||||
_global_session: Optional[RateLimitedSession] = None
|
||||
|
||||
|
||||
def get_session() -> RateLimitedSession:
|
||||
"""Get the global rate-limited session instance."""
|
||||
global _global_session
|
||||
if _global_session is None:
|
||||
_global_session = RateLimitedSession()
|
||||
return _global_session
|
||||
|
||||
|
||||
def fetch_url(url: str, **kwargs) -> requests.Response:
|
||||
"""Convenience function to fetch a URL with rate limiting."""
|
||||
return get_session().get(url, **kwargs)
|
||||
|
||||
|
||||
def fetch_json(url: str, **kwargs) -> dict:
|
||||
"""Convenience function to fetch JSON with rate limiting."""
|
||||
return get_session().get_json(url, **kwargs)
|
||||
|
||||
|
||||
def fetch_html(url: str, **kwargs) -> str:
|
||||
"""Convenience function to fetch HTML with rate limiting."""
|
||||
return get_session().get_html(url, **kwargs)
|
||||
@@ -0,0 +1,149 @@
|
||||
"""Logging infrastructure for sportstime-parser."""
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from rich.console import Console
|
||||
from rich.logging import RichHandler
|
||||
|
||||
from ..config import SCRIPTS_DIR
|
||||
|
||||
# Module-level state
|
||||
_logger: Optional[logging.Logger] = None
|
||||
_verbose: bool = False
|
||||
_console: Optional[Console] = None
|
||||
|
||||
|
||||
def get_console() -> Console:
|
||||
"""Get the shared Rich console instance."""
|
||||
global _console
|
||||
if _console is None:
|
||||
_console = Console()
|
||||
return _console
|
||||
|
||||
|
||||
def set_verbose(verbose: bool) -> None:
|
||||
"""Set verbose mode globally."""
|
||||
global _verbose
|
||||
_verbose = verbose
|
||||
|
||||
logger = get_logger()
|
||||
if verbose:
|
||||
logger.setLevel(logging.DEBUG)
|
||||
else:
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
|
||||
def is_verbose() -> bool:
|
||||
"""Check if verbose mode is enabled."""
|
||||
return _verbose
|
||||
|
||||
|
||||
def get_logger() -> logging.Logger:
|
||||
"""Get or create the application logger."""
|
||||
global _logger
|
||||
|
||||
if _logger is not None:
|
||||
return _logger
|
||||
|
||||
_logger = logging.getLogger("sportstime_parser")
|
||||
_logger.setLevel(logging.INFO)
|
||||
|
||||
# Prevent propagation to root logger
|
||||
_logger.propagate = False
|
||||
|
||||
# Clear any existing handlers
|
||||
_logger.handlers.clear()
|
||||
|
||||
# Console handler with Rich formatting
|
||||
console_handler = RichHandler(
|
||||
console=get_console(),
|
||||
show_time=True,
|
||||
show_path=False,
|
||||
rich_tracebacks=True,
|
||||
tracebacks_show_locals=True,
|
||||
markup=True,
|
||||
)
|
||||
console_handler.setLevel(logging.DEBUG)
|
||||
console_format = logging.Formatter("%(message)s")
|
||||
console_handler.setFormatter(console_format)
|
||||
_logger.addHandler(console_handler)
|
||||
|
||||
# File handler for persistent logs
|
||||
log_dir = SCRIPTS_DIR / "logs"
|
||||
log_dir.mkdir(exist_ok=True)
|
||||
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
log_file = log_dir / f"parser_{timestamp}.log"
|
||||
|
||||
file_handler = logging.FileHandler(log_file, encoding="utf-8")
|
||||
file_handler.setLevel(logging.DEBUG)
|
||||
file_format = logging.Formatter(
|
||||
"%(asctime)s | %(levelname)-8s | %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
)
|
||||
file_handler.setFormatter(file_format)
|
||||
_logger.addHandler(file_handler)
|
||||
|
||||
return _logger
|
||||
|
||||
|
||||
def log_game(
|
||||
sport: str,
|
||||
game_id: str,
|
||||
home: str,
|
||||
away: str,
|
||||
date: str,
|
||||
status: str = "parsed",
|
||||
) -> None:
|
||||
"""Log a game being processed (only in verbose mode)."""
|
||||
if not is_verbose():
|
||||
return
|
||||
|
||||
logger = get_logger()
|
||||
logger.debug(f"[{sport.upper()}] {game_id}: {away} @ {home} ({date}) - {status}")
|
||||
|
||||
|
||||
def log_team(sport: str, team_id: str, name: str, status: str = "resolved") -> None:
|
||||
"""Log a team being processed (only in verbose mode)."""
|
||||
if not is_verbose():
|
||||
return
|
||||
|
||||
logger = get_logger()
|
||||
logger.debug(f"[{sport.upper()}] Team: {name} -> {team_id} ({status})")
|
||||
|
||||
|
||||
def log_stadium(sport: str, stadium_id: str, name: str, status: str = "resolved") -> None:
|
||||
"""Log a stadium being processed (only in verbose mode)."""
|
||||
if not is_verbose():
|
||||
return
|
||||
|
||||
logger = get_logger()
|
||||
logger.debug(f"[{sport.upper()}] Stadium: {name} -> {stadium_id} ({status})")
|
||||
|
||||
|
||||
def log_error(message: str, exc_info: bool = False) -> None:
|
||||
"""Log an error message."""
|
||||
logger = get_logger()
|
||||
logger.error(message, exc_info=exc_info)
|
||||
|
||||
|
||||
def log_warning(message: str) -> None:
|
||||
"""Log a warning message."""
|
||||
logger = get_logger()
|
||||
logger.warning(message)
|
||||
|
||||
|
||||
def log_success(message: str) -> None:
|
||||
"""Log a success message with green formatting."""
|
||||
logger = get_logger()
|
||||
logger.info(f"[green]✓[/green] {message}")
|
||||
|
||||
|
||||
def log_failure(message: str) -> None:
|
||||
"""Log a failure message with red formatting."""
|
||||
logger = get_logger()
|
||||
logger.info(f"[red]✗[/red] {message}")
|
||||
@@ -0,0 +1,360 @@
|
||||
"""Progress utilities using Rich for visual feedback."""
|
||||
|
||||
from contextlib import contextmanager
|
||||
from typing import Generator, Iterable, Optional, TypeVar
|
||||
|
||||
from rich.progress import (
|
||||
Progress,
|
||||
SpinnerColumn,
|
||||
TextColumn,
|
||||
BarColumn,
|
||||
TaskProgressColumn,
|
||||
TimeElapsedColumn,
|
||||
TimeRemainingColumn,
|
||||
MofNCompleteColumn,
|
||||
)
|
||||
from rich.console import Console
|
||||
|
||||
from .logging import get_console
|
||||
|
||||
T = TypeVar("T")
|
||||
|
||||
|
||||
def create_progress() -> Progress:
|
||||
"""Create a Rich progress bar with standard columns."""
|
||||
return Progress(
|
||||
SpinnerColumn(),
|
||||
TextColumn("[bold blue]{task.description}"),
|
||||
BarColumn(bar_width=40),
|
||||
TaskProgressColumn(),
|
||||
MofNCompleteColumn(),
|
||||
TimeElapsedColumn(),
|
||||
TimeRemainingColumn(),
|
||||
console=get_console(),
|
||||
transient=False,
|
||||
)
|
||||
|
||||
|
||||
def create_spinner_progress() -> Progress:
|
||||
"""Create a Rich progress bar with spinner only (for indeterminate tasks)."""
|
||||
return Progress(
|
||||
SpinnerColumn(),
|
||||
TextColumn("[bold blue]{task.description}"),
|
||||
TimeElapsedColumn(),
|
||||
console=get_console(),
|
||||
transient=True,
|
||||
)
|
||||
|
||||
|
||||
@contextmanager
|
||||
def progress_bar(
|
||||
description: str,
|
||||
total: Optional[int] = None,
|
||||
) -> Generator[tuple[Progress, int], None, None]:
|
||||
"""Context manager for a progress bar.
|
||||
|
||||
Args:
|
||||
description: Task description to display
|
||||
total: Total number of items (None for indeterminate)
|
||||
|
||||
Yields:
|
||||
Tuple of (Progress instance, task_id)
|
||||
|
||||
Example:
|
||||
with progress_bar("Scraping games", total=100) as (progress, task):
|
||||
for item in items:
|
||||
process(item)
|
||||
progress.advance(task)
|
||||
"""
|
||||
if total is None:
|
||||
progress = create_spinner_progress()
|
||||
else:
|
||||
progress = create_progress()
|
||||
|
||||
with progress:
|
||||
task_id = progress.add_task(description, total=total)
|
||||
yield progress, task_id
|
||||
|
||||
|
||||
def track_progress(
|
||||
iterable: Iterable[T],
|
||||
description: str,
|
||||
total: Optional[int] = None,
|
||||
) -> Generator[T, None, None]:
|
||||
"""Wrap an iterable with a progress bar.
|
||||
|
||||
Args:
|
||||
iterable: Items to iterate over
|
||||
description: Task description to display
|
||||
total: Total number of items (auto-detected if iterable has len)
|
||||
|
||||
Yields:
|
||||
Items from the iterable
|
||||
|
||||
Example:
|
||||
for game in track_progress(games, "Processing games"):
|
||||
process(game)
|
||||
"""
|
||||
# Try to get length if not provided
|
||||
if total is None:
|
||||
try:
|
||||
total = len(iterable) # type: ignore
|
||||
except TypeError:
|
||||
pass
|
||||
|
||||
if total is None:
|
||||
# Indeterminate progress
|
||||
progress = create_spinner_progress()
|
||||
with progress:
|
||||
task_id = progress.add_task(description, total=None)
|
||||
for item in iterable:
|
||||
yield item
|
||||
progress.update(task_id, advance=1)
|
||||
else:
|
||||
# Determinate progress
|
||||
progress = create_progress()
|
||||
with progress:
|
||||
task_id = progress.add_task(description, total=total)
|
||||
for item in iterable:
|
||||
yield item
|
||||
progress.advance(task_id)
|
||||
|
||||
|
||||
class ProgressTracker:
|
||||
"""Track progress across multiple phases with nested tasks.
|
||||
|
||||
Example:
|
||||
tracker = ProgressTracker()
|
||||
tracker.start("Scraping NBA")
|
||||
|
||||
with tracker.task("Fetching schedule", total=12) as advance:
|
||||
for month in months:
|
||||
fetch(month)
|
||||
advance()
|
||||
|
||||
with tracker.task("Parsing games", total=1230) as advance:
|
||||
for game in games:
|
||||
parse(game)
|
||||
advance()
|
||||
|
||||
tracker.finish("Completed NBA scrape")
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize the progress tracker."""
|
||||
self._console = get_console()
|
||||
self._current_progress: Optional[Progress] = None
|
||||
self._current_task: Optional[int] = None
|
||||
|
||||
def start(self, message: str) -> None:
|
||||
"""Start a new tracking session with a message."""
|
||||
self._console.print(f"\n[bold cyan]>>> {message}[/bold cyan]")
|
||||
|
||||
def finish(self, message: str) -> None:
|
||||
"""Finish the tracking session with a message."""
|
||||
self._console.print(f"[bold green]<<< {message}[/bold green]\n")
|
||||
|
||||
@contextmanager
|
||||
def task(
|
||||
self,
|
||||
description: str,
|
||||
total: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Context manager for a tracked task.
|
||||
|
||||
Args:
|
||||
description: Task description
|
||||
total: Total items (None for indeterminate)
|
||||
|
||||
Yields:
|
||||
Callable to advance the progress
|
||||
|
||||
Example:
|
||||
with tracker.task("Processing", total=100) as advance:
|
||||
for item in items:
|
||||
process(item)
|
||||
advance()
|
||||
"""
|
||||
with progress_bar(description, total) as (progress, task_id):
|
||||
self._current_progress = progress
|
||||
self._current_task = task_id
|
||||
|
||||
def advance(amount: int = 1) -> None:
|
||||
progress.advance(task_id, advance=amount)
|
||||
|
||||
yield advance
|
||||
|
||||
self._current_progress = None
|
||||
self._current_task = None
|
||||
|
||||
def log(self, message: str) -> None:
|
||||
"""Log a message (will be displayed above progress bar if active)."""
|
||||
if self._current_progress:
|
||||
self._current_progress.console.print(f" {message}")
|
||||
else:
|
||||
self._console.print(f" {message}")
|
||||
|
||||
|
||||
class ScrapeProgress:
|
||||
"""Specialized progress tracker for scraping operations.
|
||||
|
||||
Tracks counts of games, teams, stadiums scraped and provides
|
||||
formatted status updates.
|
||||
"""
|
||||
|
||||
def __init__(self, sport: str, season: int):
|
||||
"""Initialize scrape progress for a sport.
|
||||
|
||||
Args:
|
||||
sport: Sport code (e.g., 'nba')
|
||||
season: Season start year
|
||||
"""
|
||||
self.sport = sport
|
||||
self.season = season
|
||||
self.games_count = 0
|
||||
self.teams_count = 0
|
||||
self.stadiums_count = 0
|
||||
self.errors_count = 0
|
||||
self._tracker = ProgressTracker()
|
||||
|
||||
def start(self) -> None:
|
||||
"""Start the scraping session."""
|
||||
self._tracker.start(
|
||||
f"Scraping {self.sport.upper()} {self.season}-{self.season + 1}"
|
||||
)
|
||||
|
||||
def finish(self) -> None:
|
||||
"""Finish the scraping session with summary."""
|
||||
summary = (
|
||||
f"Scraped {self.games_count} games, "
|
||||
f"{self.teams_count} teams, "
|
||||
f"{self.stadiums_count} stadiums"
|
||||
)
|
||||
if self.errors_count > 0:
|
||||
summary += f" ({self.errors_count} errors)"
|
||||
self._tracker.finish(summary)
|
||||
|
||||
@contextmanager
|
||||
def scraping_schedule(
|
||||
self,
|
||||
total_months: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Track schedule scraping progress."""
|
||||
with self._tracker.task(
|
||||
f"Fetching {self.sport.upper()} schedule",
|
||||
total=total_months,
|
||||
) as advance:
|
||||
yield advance
|
||||
|
||||
@contextmanager
|
||||
def parsing_games(
|
||||
self,
|
||||
total_games: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Track game parsing progress."""
|
||||
with self._tracker.task(
|
||||
"Parsing games",
|
||||
total=total_games,
|
||||
) as advance:
|
||||
|
||||
def advance_and_count(amount: int = 1) -> None:
|
||||
self.games_count += amount
|
||||
advance(amount)
|
||||
|
||||
yield advance_and_count
|
||||
|
||||
@contextmanager
|
||||
def resolving_teams(
|
||||
self,
|
||||
total_teams: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Track team resolution progress."""
|
||||
with self._tracker.task(
|
||||
"Resolving teams",
|
||||
total=total_teams,
|
||||
) as advance:
|
||||
|
||||
def advance_and_count(amount: int = 1) -> None:
|
||||
self.teams_count += amount
|
||||
advance(amount)
|
||||
|
||||
yield advance_and_count
|
||||
|
||||
@contextmanager
|
||||
def resolving_stadiums(
|
||||
self,
|
||||
total_stadiums: Optional[int] = None,
|
||||
) -> Generator[callable, None, None]:
|
||||
"""Track stadium resolution progress."""
|
||||
with self._tracker.task(
|
||||
"Resolving stadiums",
|
||||
total=total_stadiums,
|
||||
) as advance:
|
||||
|
||||
def advance_and_count(amount: int = 1) -> None:
|
||||
self.stadiums_count += amount
|
||||
advance(amount)
|
||||
|
||||
yield advance_and_count
|
||||
|
||||
def log_error(self, message: str) -> None:
|
||||
"""Log an error during scraping."""
|
||||
self.errors_count += 1
|
||||
self._tracker.log(f"[red]Error: {message}[/red]")
|
||||
|
||||
def log_warning(self, message: str) -> None:
|
||||
"""Log a warning during scraping."""
|
||||
self._tracker.log(f"[yellow]Warning: {message}[/yellow]")
|
||||
|
||||
def log_info(self, message: str) -> None:
|
||||
"""Log an info message during scraping."""
|
||||
self._tracker.log(message)
|
||||
|
||||
|
||||
class SimpleProgressBar:
|
||||
"""Simple progress bar wrapper for batch operations.
|
||||
|
||||
Example:
|
||||
with create_progress_bar(total=100, description="Uploading") as progress:
|
||||
for item in items:
|
||||
upload(item)
|
||||
progress.advance()
|
||||
"""
|
||||
|
||||
def __init__(self, progress: Progress, task_id: int):
|
||||
self._progress = progress
|
||||
self._task_id = task_id
|
||||
|
||||
def advance(self, amount: int = 1) -> None:
|
||||
"""Advance the progress bar."""
|
||||
self._progress.advance(self._task_id, advance=amount)
|
||||
|
||||
def update(self, completed: int) -> None:
|
||||
"""Set the progress to a specific value."""
|
||||
self._progress.update(self._task_id, completed=completed)
|
||||
|
||||
|
||||
@contextmanager
|
||||
def create_progress_bar(
|
||||
total: int,
|
||||
description: str = "Progress",
|
||||
) -> Generator[SimpleProgressBar, None, None]:
|
||||
"""Create a simple progress bar for batch operations.
|
||||
|
||||
Args:
|
||||
total: Total number of items
|
||||
description: Task description
|
||||
|
||||
Yields:
|
||||
SimpleProgressBar with advance() and update() methods
|
||||
|
||||
Example:
|
||||
with create_progress_bar(total=100, description="Uploading") as progress:
|
||||
for item in items:
|
||||
upload(item)
|
||||
progress.advance()
|
||||
"""
|
||||
progress = create_progress()
|
||||
with progress:
|
||||
task_id = progress.add_task(description, total=total)
|
||||
yield SimpleProgressBar(progress, task_id)
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user