feat(scripts): complete data pipeline remediation

Scripts changes:
- Add WNBA abbreviation aliases to team_resolver.py
- Fix NHL stadium coordinates in stadium_resolver.py
- Add validate_aliases.py script for orphan detection
- Update scrapers with improved error handling
- Add DATA_AUDIT.md and REMEDIATION_PLAN.md documentation
- Update alias JSON files with new mappings

iOS bundle updates:
- Update games_canonical.json with latest scraped data
- Update teams_canonical.json and stadiums_canonical.json
- Sync alias files with Scripts versions

All 5 remediation phases complete.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-01-20 18:58:47 -06:00
parent 51419fccf2
commit 8ea3e6112a
21 changed files with 56592 additions and 35714 deletions

View File

@@ -19,7 +19,9 @@
"Bash(python -m py_compile:*)",
"WebFetch(domain:en.wikipedia.org)",
"Bash(tree:*)",
"Bash(python:*)"
"Bash(python:*)",
"Bash(grep:*)",
"Bash(xargs cat:*)"
]
}
}

49
Scripts/.gitignore vendored Normal file
View File

@@ -0,0 +1,49 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Virtual environments
venv/
ENV/
env/
.venv/
# IDE
.idea/
.vscode/
*.swp
*.swo
# Output and logs
output/
logs/
*.log
# Secrets
*.pem
.env
.env.*
# Parser state
.parser_state/
# Claude Code
.claude/

805
Scripts/docs/DATA_AUDIT.md Normal file
View File

@@ -0,0 +1,805 @@
# SportsTime Data Audit Report
**Generated:** 2026-01-20
**Scope:** NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
**Data Pipeline:** Scripts → CloudKit → iOS App
---
## Executive Summary
The data audit identified **15 issues** across the SportsTime data pipeline, with significant gaps in source reliability, stadium resolution, and iOS data freshness.
| Severity | Count | Description |
|----------|-------|-------------|
| **Critical** | 1 | iOS bundled data severely outdated |
| **High** | 4 | Single-source sports, NHL stadium data, NBA naming rights |
| **Medium** | 6 | Alias gaps, outdated config, silent game exclusion |
| **Low** | 4 | Minor configuration and coverage issues |
### Key Findings
**Data Pipeline Health:**
-**Canonical ID system**: 100% format compliance across 7,186 IDs
-**Team mappings**: All 183 teams correctly mapped with current abbreviations
-**Referential integrity**: Zero orphan references (0 games pointing to non-existent teams/stadiums)
- ⚠️ **Stadium resolution**: 1,466 games (21.6%) have unresolved stadiums
**Critical Risks:**
1. **ESPN single-point-of-failure** for WNBA, NWSL, MLS - if ESPN changes, 3 sports lose all data
2. **NHL has 100% missing stadiums** - Hockey Reference provides no venue data
3. **iOS bundled data 27% behind** - 1,820 games missing from first-launch experience
**Root Causes:**
- Stadium naming rights changed faster than alias updates (2024-2025)
- Fallback source limit (`max_sources_to_try = 2`) prevents third source from being tried
- Hockey Reference source limitation (no venue info) combined with fallback limit
- iOS bundled JSON not updated with latest pipeline output
---
## Phase Status Tracking
| Phase | Status | Issues Found |
|-------|--------|--------------|
| 1. Hardcoded Mapping Audit | ✅ COMPLETE | 1 Low |
| 2. Alias File Completeness | ✅ COMPLETE | 1 Medium, 1 Low |
| 3. Scraper Source Reliability | ✅ COMPLETE | 2 High, 1 Medium |
| 4. Game Count & Coverage | ✅ COMPLETE | 2 High, 2 Medium, 1 Low |
| 5. Canonical ID Consistency | ✅ COMPLETE | 0 issues |
| 6. Referential Integrity | ✅ COMPLETE | 1 Medium (NHL source) |
| 7. iOS Data Reception | ✅ COMPLETE | 1 Critical, 1 Medium, 1 Low |
---
## Phase 1 Results: Hardcoded Mapping Audit
**Files Audited:**
- `sportstime_parser/normalizers/team_resolver.py` (TEAM_MAPPINGS)
- `sportstime_parser/normalizers/stadium_resolver.py` (STADIUM_MAPPINGS)
### Team Counts
| Sport | Hardcoded | Expected | Abbreviations | Status |
|-------|-----------|----------|---------------|--------|
| NBA | 30 | 30 | 38 | ✅ |
| MLB | 30 | 30 | 38 | ✅ |
| NFL | 32 | 32 | 40 | ✅ |
| NHL | 32 | 32 | 41 | ✅ |
| MLS | 30 | 30* | 32 | ✅ |
| WNBA | 13 | 13 | 13 | ✅ |
| NWSL | 16 | 16 | 24 | ✅ |
*MLS: 29 original teams + San Diego FC (2025 expansion) = 30
### Stadium Counts
| Sport | Hardcoded | Notes | Status |
|-------|-----------|-------|--------|
| NBA | 30 | 1 per team | ✅ |
| MLB | 57 | 30 regular + 18 spring training + 9 special venues | ✅ |
| NFL | 30 | Includes shared venues (SoFi Stadium: LAR+LAC, MetLife: NYG+NYJ) | ✅ |
| NHL | 32 | 1 per team | ✅ |
| MLS | 30 | 1 per team | ✅ |
| WNBA | 13 | 1 per team | ✅ |
| NWSL | 19 | 14 current + 5 expansion team venues (Boston/Denver) | ✅ |
### Recent Updates Verification
| Update | Type | Status | Notes |
|--------|------|--------|-------|
| Utah Hockey Club (NHL) | Relocation | ✅ Present | ARI + UTA abbreviations both map to `team_nhl_ari` |
| Golden State Valkyries (WNBA) | Expansion 2025 | ✅ Present | `team_wnba_gsv` with Chase Center venue |
| Boston Legacy FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_bos` with Gillette Stadium |
| Denver Summit FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_den` with Dick's Sporting Goods Park |
| Oakland A's → Sacramento | Temporary relocation | ✅ Present | `stadium_mlb_sutter_health_park` |
| San Diego FC (MLS) | Expansion 2025 | ✅ Present | `team_mls_sd` with Snapdragon Stadium |
| FedExField → Northwest Stadium | Naming rights | ✅ Present | `stadium_nfl_northwest_stadium` |
### NFL Stadium Sharing
| Stadium | Teams | Status |
|---------|-------|--------|
| SoFi Stadium | LAR, LAC | ✅ Correct |
| MetLife Stadium | NYG, NYJ | ✅ Correct |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 1 | WNBA single abbreviations | Low | All 13 WNBA teams have only 1 abbreviation each. May need additional abbreviations for source compatibility. |
### Phase 1 Summary
**Result: PASS** - All team and stadium mappings are complete and up-to-date with 2025-2026 changes.
- ✅ All 7 sports have correct team counts
- ✅ All stadium counts are appropriate (including spring training, special venues)
- ✅ Recent franchise moves/expansions are reflected
- ✅ Stadium sharing is correctly handled
- ✅ Naming rights updates are current
---
## Phase 2 Results: Alias File Completeness
**Files Audited:**
- `Scripts/team_aliases.json`
- `Scripts/stadium_aliases.json`
### Team Aliases Summary
| Sport | Entries | Coverage | Status |
|-------|---------|----------|--------|
| MLB | 23 | Historical relocations/renames | ✅ |
| NBA | 29 | Historical relocations/renames | ✅ |
| NHL | 24 | Historical relocations/renames | ✅ |
| NFL | 0 | **No aliases** | ⚠️ |
| MLS | 0 | No aliases (newer league) | ✅ |
| WNBA | 0 | No aliases (newer league) | ✅ |
| NWSL | 0 | No aliases (newer league) | ✅ |
| **Total** | **76** | | |
- All 76 entries have valid date ranges
- No orphan references (all canonical IDs exist in mappings)
### Stadium Aliases Summary
| Sport | Entries | Coverage | Status |
|-------|---------|----------|--------|
| MLB | 109 | Regular + spring training + special venues | ✅ |
| NFL | 65 | Naming rights history | ✅ |
| NBA | 44 | Naming rights history | ✅ |
| NHL | 39 | Naming rights history | ✅ |
| MLS | 35 | Current + naming variants | ✅ |
| WNBA | 15 | Current + naming variants | ✅ |
| NWSL | 14 | Current + naming variants | ✅ |
| **Total** | **321** | | |
- 65 entries have date ranges (historical naming rights)
- 256 entries are permanent aliases (no date restrictions)
### Orphan Reference Check
| Type | Count | Status |
|------|-------|--------|
| Team aliases with invalid references | 0 | ✅ |
| Stadium aliases with invalid references | **5** | ❌ |
**Orphan Stadium References Found:**
| Alias Name | References (Invalid) | Correct ID |
|------------|---------------------|------------|
| Broncos Stadium at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Sports Authority Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Invesco Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Mile High Stadium | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Arrowhead Stadium | `stadium_nfl_geha_field_at_arrowhead_stadium` | `stadium_nfl_arrowhead_stadium` |
### Historical Changes Coverage
| Historical Name | Current Team | In Aliases? |
|-----------------|--------------|-------------|
| Montreal Expos | Washington Nationals | ✅ |
| Seattle SuperSonics | Oklahoma City Thunder | ✅ |
| Arizona Coyotes | Utah Hockey Club | ✅ |
| Cleveland Indians | Cleveland Guardians | ✅ |
| Hartford Whalers | Carolina Hurricanes | ✅ |
| Quebec Nordiques | Colorado Avalanche | ✅ |
| Vancouver Grizzlies | Memphis Grizzlies | ✅ |
| Washington Redskins | Washington Commanders | ❌ Missing |
| Washington Football Team | Washington Commanders | ❌ Missing |
| Brooklyn Dodgers | Los Angeles Dodgers | ❌ Missing |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 2 | Orphan stadium alias references | Medium | 5 stadium aliases point to non-existent canonical IDs (`stadium_nfl_empower_field_at_mile_high`, `stadium_nfl_geha_field_at_arrowhead_stadium`). Causes resolution failures for historical Denver/KC stadium names. |
| 3 | No NFL team aliases | Low | Missing Washington Redskins/Football Team historical names. Limits historical game matching for NFL. |
### Phase 2 Summary
**Result: PASS with issues** - Alias files cover most historical changes but have referential integrity bugs.
- ✅ Team aliases cover MLB/NBA/NHL historical changes
- ✅ Stadium aliases cover naming rights changes across all sports
- ✅ No date range validation errors
- ❌ 5 orphan stadium references need fixing
- ⚠️ No NFL team aliases (Washington Redskins/Football Team missing)
---
## Phase 3 Results: Scraper Source Reliability
**Files Audited:**
- `sportstime_parser/scrapers/base.py` (fallback logic)
- `sportstime_parser/scrapers/nba.py`, `mlb.py`, `nfl.py`, `nhl.py`, `mls.py`, `wnba.py`, `nwsl.py`
### Source Dependency Matrix
| Sport | Primary | Status | Fallback 1 | Status | Fallback 2 | Status | Risk |
|-------|---------|--------|------------|--------|------------|--------|------|
| NBA | basketball_reference | ✅ | espn | ✅ | cbs | ❌ NOT IMPL | Medium |
| MLB | mlb_api | ✅ | espn | ✅ | baseball_reference | ✅ | Low |
| NFL | espn | ✅ | pro_football_reference | ✅ | cbs | ❌ NOT IMPL | Medium |
| NHL | hockey_reference | ✅ | nhl_api | ✅ | espn | ✅ | Low |
| MLS | espn | ✅ | fbref | ❌ NOT IMPL | - | - | **HIGH** |
| WNBA | espn | ✅ | - | - | - | - | **HIGH** |
| NWSL | espn | ✅ | - | - | - | - | **HIGH** |
### Unimplemented Sources
| Sport | Source | Line | Status |
|-------|--------|------|--------|
| NBA | cbs | `nba.py:421` | `raise NotImplementedError("CBS scraper not implemented")` |
| NFL | cbs | `nfl.py:386` | `raise NotImplementedError("CBS scraper not implemented")` |
| MLS | fbref | `mls.py:214` | `raise NotImplementedError("FBref scraper not implemented")` |
### Fallback Logic Analysis
**File:** `base.py:189`
```python
max_sources_to_try = 2 # Don't try all sources if first few return nothing
```
**Impact:**
- Even if 3 sources are declared, only 2 are tried
- If sources 1 and 2 fail, source 3 is never attempted
- This limits resilience for NBA, MLB, NFL, NHL which have 3 sources
### International Game Filtering
| Sport | Hardcoded Locations | Notes |
|-------|---------------------|-------|
| NFL | London, Mexico City, Frankfurt, Munich, São Paulo | ✅ Complete for 2025 |
| NHL | Prague, Stockholm, Helsinki, Tampere, Gothenburg | ✅ Complete for 2025 |
| NBA | None | ⚠️ No international filtering (Abu Dhabi games?) |
| MLB | None | ⚠️ No international filtering (Mexico City games?) |
| MLS | None | N/A (domestic only) |
| WNBA | None | N/A (domestic only) |
| NWSL | None | N/A (domestic only) |
### Single Point of Failure Risk
| Sport | Primary Source | If ESPN Fails... | Risk Level |
|-------|----------------|------------------|------------|
| WNBA | ESPN only | **Complete data loss** | Critical |
| NWSL | ESPN only | **Complete data loss** | Critical |
| MLS | ESPN only (fbref not impl) | **Complete data loss** | Critical |
| NBA | Basketball-Ref → ESPN | ESPN fallback available | Low |
| NFL | ESPN → Pro-Football-Ref | Fallback available | Low |
| NHL | Hockey-Ref → NHL API → ESPN | Two fallbacks | Very Low |
| MLB | MLB API → ESPN → B-Ref | Two fallbacks | Very Low |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 4 | WNBA/NWSL/MLS single source | High | ESPN is the only working source for 3 sports. If ESPN changes or fails, data collection completely stops. |
| 5 | max_sources_to_try = 2 | High | Third fallback source never tried even if available. Reduces resilience for NBA/MLB/NFL/NHL. |
| 6 | CBS/FBref not implemented | Medium | Declared fallback sources raise NotImplementedError. Appears functional in config but fails at runtime. |
### Phase 3 Summary
**Result: FAIL** - Critical single-point-of-failure for 3 sports.
- ❌ WNBA, NWSL, MLS have only ESPN (no resilience)
- ❌ Fallback limit of 2 prevents third source from being tried
- ⚠️ CBS and FBref declared but not implemented
- ✅ MLB and NHL have full fallback chains
- ✅ International game filtering present for NFL/NHL
---
## Phase 4 Results: Game Count & Coverage
**Files Audited:**
- `Scripts/output/games_*.json` (all 2025 season files)
- `Scripts/output/validation_*.md` (all validation reports)
- `sportstime_parser/config.py` (EXPECTED_GAME_COUNTS)
### Coverage Summary
| Sport | Scraped | Expected | Coverage | Status |
|-------|---------|----------|----------|--------|
| NBA | 1,231 | 1,230 | 100.1% | ✅ |
| MLB | 2,866 | 2,430 | 117.9% | ⚠️ Includes spring training |
| NFL | 330 | 272 | 121.3% | ⚠️ Includes preseason/playoffs |
| NHL | 1,312 | 1,312 | 100.0% | ✅ |
| MLS | 542 | 493 | 109.9% | ✅ Includes playoffs |
| WNBA | 322 | 220 | **146.4%** | ⚠️ Expected count outdated |
| NWSL | 189 | 182 | 103.8% | ✅ |
### Date Range Analysis
| Sport | Start Date | End Date | Notes |
|-------|------------|----------|-------|
| NBA | 2025-10-21 | 2026-04-12 | Regular season only |
| MLB | 2025-03-01 | 2025-11-02 | Includes spring training (417 games in March) |
| NFL | 2025-08-01 | 2026-01-25 | Includes preseason (49 in Aug) + playoffs (28 in Jan) |
| NHL | 2025-10-07 | 2026-04-16 | Regular season only |
| MLS | 2025-02-22 | 2025-11-30 | Regular season + playoffs |
| WNBA | 2025-05-02 | 2025-10-11 | Regular season + playoffs |
| NWSL | 2025-03-15 | 2025-11-23 | Regular season + playoffs |
### Game Status Distribution
All games across all sports have status `unknown` - game status is not being properly parsed from sources.
### Duplicate Game Detection
| Sport | Duplicates Found | Details |
|-------|-----------------|---------|
| NBA | 0 | ✅ |
| MLB | 1 | `game_mlb_2025_20250508_det_col_1` appears twice (doubleheader handling issue) |
| NFL | 0 | ✅ |
| NHL | 0 | ✅ |
| MLS | 0 | ✅ |
| WNBA | 0 | ✅ |
| NWSL | 0 | ✅ |
### Validation Report Analysis
| Sport | Total Games | Unresolved Teams | Unresolved Stadiums | Manual Review Items |
|-------|-------------|------------------|---------------------|---------------------|
| NBA | 1,231 | 0 | **131** | 131 |
| MLB | 2,866 | 12 | 4 | 20 |
| NFL | 330 | 1 | 5 | 11 |
| NHL | 1,312 | 0 | 0 | **1,312** (all missing stadiums) |
| MLS | 542 | 1 | **64** | 129 |
| WNBA | 322 | 5 | **65** | 135 |
| NWSL | 189 | 0 | **16** | 32 |
### Top Unresolved Stadium Names (Recent Naming Rights)
| Stadium Name | Occurrences | Actual Venue | Issue |
|--------------|-------------|--------------|-------|
| Sports Illustrated Stadium | 11 | MLS expansion venue | New venue, missing alias |
| Mortgage Matchup Center | 8 | Rocket Mortgage FieldHouse (CLE) | 2025 naming rights change |
| ScottsMiracle-Gro Field | 4 | MLS Columbus Crew | Missing alias |
| Energizer Park | 3 | MLS CITY SC (STL?) | Missing alias |
| Xfinity Mobile Arena | 3 | Intuit Dome (LAC) | 2025 naming rights change |
| Rocket Arena | 3 | Toyota Center (HOU) | Potential name change |
| CareFirst Arena | 2 | Washington Mystics venue | New WNBA venue name |
### Unresolved Teams (Exhibition/International)
| Team Name | Sport | Type | Games |
|-----------|-------|------|-------|
| BRAZIL | WNBA | International exhibition | 2 |
| Toyota Antelopes | WNBA | Japanese team | 2 |
| TEAM CLARK | WNBA | All-Star Game | 1 |
| (Various MLB) | MLB | International teams | 12 |
| (MLS international) | MLS | CCL/exhibition | 1 |
| (NFL preseason) | NFL | Pre-season exhibition | 1 |
### NHL Stadium Data Issue
**Critical:** Hockey Reference does not provide stadium data. All 1,312 NHL games have `raw_stadium: None`, causing 100% of games to have missing stadium IDs. The NHL fallback sources (NHL API, ESPN) should provide this data, but the `max_sources_to_try = 2` limit combined with Hockey Reference success means fallbacks are never attempted.
### Expected Count Updates Needed
| Sport | Current Expected | Recommended | Reason |
|-------|------------------|-------------|--------|
| WNBA | 220 | **286** | 13 teams × 44 games / 2 (expanded with Golden State Valkyries) |
| NFL | 272 | 272 (filter preseason) | Or document that 330 includes preseason |
| MLB | 2,430 | 2,430 (filter spring training) | Or document that 2,866 includes spring training |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 7 | NHL has no stadium data | High | Hockey Reference provides no venue info. All 1,312 games missing stadium_id. Fallback sources not tried. |
| 8 | 131 NBA stadium resolution failures | High | Recent naming rights changes ("Mortgage Matchup Center", "Xfinity Mobile Arena") not in aliases. |
| 9 | Outdated WNBA expected count | Medium | Config says 220 but WNBA expanded to 13 teams in 2025; actual is 322 (286 regular + playoffs). |
| 10 | MLS/WNBA stadium alias gaps | Medium | 64 MLS + 65 WNBA unresolved stadiums from new/renamed venues. |
| 11 | Game status not parsed | Low | All games have status `unknown` instead of final/scheduled/postponed. |
### Phase 4 Summary
**Result: FAIL** - Significant stadium resolution failures across multiple sports.
- ❌ 131 NBA games missing stadium (naming rights changes)
- ❌ 1,312 NHL games missing stadium (source doesn't provide data)
- ❌ 64 MLS + 65 WNBA stadiums unresolved (new/renamed venues)
- ⚠️ WNBA expected count severely outdated (220 vs 322 actual)
- ⚠️ MLB/NFL include preseason/spring training games
- ✅ No significant duplicate games (1 MLB doubleheader edge case)
- ✅ All teams resolved except exhibition/international games
---
## Phase 5 Results: Canonical ID Consistency
**Files Audited:**
- `sportstime_parser/normalizers/canonical_id.py` (Python ID generation)
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (iOS models)
- `SportsTime/Core/Services/BootstrapService.swift` (iOS JSON parsing)
- All `Scripts/output/*.json` files (generated IDs)
### Format Validation
| Type | Total IDs | Valid | Invalid | Pass Rate |
|------|-----------|-------|---------|-----------|
| Team | 183 | 183 | 0 | 100.0% ✅ |
| Stadium | 211 | 211 | 0 | 100.0% ✅ |
| Game | 6,792 | 6,792 | 0 | 100.0% ✅ |
### ID Format Patterns (all validated)
```
Teams: team_{sport}_{abbrev} → team_nba_lal
Stadiums: stadium_{sport}_{normalized_name} → stadium_nba_cryptocom_arena
Games: game_{sport}_{season}_{YYYYMMDD}_{away}_{home}[_{#}]
→ game_nba_2025_20251021_hou_okc
```
### Normalization Quality
| Check | Result |
|-------|--------|
| Double underscores (`__`) | 0 found ✅ |
| Leading/trailing underscores | 0 found ✅ |
| Uppercase letters | 0 found ✅ |
| Special characters | 0 found ✅ |
### Abbreviation Lengths (Teams)
| Length | Count |
|--------|-------|
| 2 chars | 21 |
| 3 chars | 161 |
| 4 chars | 1 |
### Stadium ID Lengths
- Minimum: 8 characters
- Maximum: 29 characters
- Average: 16.2 characters
### iOS Cross-Compatibility
| Aspect | Status | Notes |
|--------|--------|-------|
| Field naming convention | ✅ Compatible | Python uses snake_case; iOS `BootstrapService` uses matching Codable structs |
| Deterministic UUID generation | ✅ Compatible | iOS uses SHA256 hash of canonical_id - matches any valid string |
| Schema version | ✅ Compatible | Both use version 1 |
| Required fields | ✅ Present | All iOS-required fields present in JSON output |
### Field Mapping (Python → iOS)
| Python Field | iOS Field | Notes |
|--------------|-----------|-------|
| `canonical_id` | `canonicalId` | Mapped via `JSONCanonicalStadium.canonical_id``CanonicalStadium.canonicalId` |
| `home_team_canonical_id` | `homeTeamCanonicalId` | Explicit mapping in BootstrapService |
| `away_team_canonical_id` | `awayTeamCanonicalId` | Explicit mapping in BootstrapService |
| `stadium_canonical_id` | `stadiumCanonicalId` | Explicit mapping in BootstrapService |
| `game_datetime_utc` | `dateTime` | ISO 8601 parsing with fallback to legacy format |
### Issues Found
**No issues found.** All canonical IDs are:
- Correctly formatted according to defined patterns
- Properly normalized (lowercase, no special characters)
- Deterministic (same input produces same output)
- Compatible with iOS parsing
### Phase 5 Summary
**Result: PASS** - All canonical IDs are consistent and iOS-compatible.
- ✅ 100% format validation pass rate across 7,186 IDs
- ✅ No normalization issues found
- ✅ iOS BootstrapService explicitly handles snake_case → camelCase mapping
- ✅ Deterministic UUID generation using SHA256 hash
---
## Phase 6 Results: Referential Integrity
**Files Audited:**
- `Scripts/output/games_*_2025.json`
- `Scripts/output/teams_*.json`
- `Scripts/output/stadiums_*.json`
### Game → Team References
| Sport | Total Games | Valid Home | Valid Away | Orphan Home | Orphan Away | Status |
|-------|-------------|------------|------------|-------------|-------------|--------|
| NBA | 1,231 | 1,231 | 1,231 | 0 | 0 | ✅ |
| MLB | 2,866 | 2,866 | 2,866 | 0 | 0 | ✅ |
| NFL | 330 | 330 | 330 | 0 | 0 | ✅ |
| NHL | 1,312 | 1,312 | 1,312 | 0 | 0 | ✅ |
| MLS | 542 | 542 | 542 | 0 | 0 | ✅ |
| WNBA | 322 | 322 | 322 | 0 | 0 | ✅ |
| NWSL | 189 | 189 | 189 | 0 | 0 | ✅ |
**Result:** 100% valid team references across all 6,792 games.
### Game → Stadium References
| Sport | Total Games | Valid | Missing | Percentage Missing |
|-------|-------------|-------|---------|-------------------|
| NBA | 1,231 | 1,231 | 0 | 0.0% ✅ |
| MLB | 2,866 | 2,862 | 4 | 0.1% ✅ |
| NFL | 330 | 325 | 5 | 1.5% ✅ |
| NHL | 1,312 | 0 | **1,312** | **100%** ❌ |
| MLS | 542 | 478 | 64 | 11.8% ⚠️ |
| WNBA | 322 | 257 | 65 | 20.2% ⚠️ |
| NWSL | 189 | 173 | 16 | 8.5% ⚠️ |
**Note:** "Missing" means `stadium_canonical_id` is empty (resolution failed at scrape time). This is NOT orphan references to non-existent stadiums.
### Team → Stadium References
| Sport | Teams | Valid Stadium | Invalid | Status |
|-------|-------|---------------|---------|--------|
| NBA | 30 | 30 | 0 | ✅ |
| MLB | 30 | 30 | 0 | ✅ |
| NFL | 32 | 32 | 0 | ✅ |
| NHL | 32 | 32 | 0 | ✅ |
| MLS | 30 | 30 | 0 | ✅ |
| WNBA | 13 | 13 | 0 | ✅ |
| NWSL | 16 | 16 | 0 | ✅ |
**Result:** 100% valid team → stadium references.
### Cross-Sport Stadium Check
✅ No stadiums are duplicated across sports. Each `stadium_{sport}_*` ID is unique to its sport.
### Missing Stadium Root Causes
| Sport | Missing | Root Cause |
|-------|---------|------------|
| NHL | 1,312 | **Hockey Reference provides no venue data** - source limitation |
| MLS | 64 | New/renamed stadiums not in aliases (see Phase 4) |
| WNBA | 65 | New venue names not in aliases (see Phase 4) |
| NWSL | 16 | Expansion team venues + alternate venues |
| NFL | 5 | International games not in stadium mappings |
| MLB | 4 | Exhibition/international games |
### Orphan Reference Summary
| Reference Type | Total Checked | Orphans Found |
|----------------|---------------|---------------|
| Game → Home Team | 6,792 | 0 ✅ |
| Game → Away Team | 6,792 | 0 ✅ |
| Game → Stadium | 6,792 | 0 ✅ |
| Team → Stadium | 183 | 0 ✅ |
**Note:** Zero orphan references. All "missing" stadiums are resolution failures (empty string), not references to non-existent canonical IDs.
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 12 | NHL games have no stadium data | Medium | Hockey Reference source doesn't provide venue information. All 1,312 NHL games have empty stadium_canonical_id. Fallback sources could provide this data but are limited by `max_sources_to_try = 2`. |
### Phase 6 Summary
**Result: PASS with known limitations** - No orphan references exist; missing stadiums are resolution failures.
- ✅ 100% valid team references (home and away)
- ✅ 100% valid team → stadium references
- ✅ No orphan references to non-existent canonical IDs
- ⚠️ 1,466 games (21.6%) have empty stadium_canonical_id (resolution failures, not orphans)
- ⚠️ NHL accounts for 90% of missing stadium data (source limitation)
---
## Phase 7 Results: iOS Data Reception
**Files Audited:**
- `SportsTime/Core/Services/BootstrapService.swift` (JSON parsing)
- `SportsTime/Core/Services/CanonicalSyncService.swift` (CloudKit sync)
- `SportsTime/Core/Services/DataProvider.swift` (data access)
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (SwiftData models)
- `SportsTime/Resources/*_canonical.json` (bundled data files)
### Bundled Data Comparison
| Data Type | iOS Bundled | Scripts Output | Difference | Status |
|-----------|-------------|----------------|------------|--------|
| Teams | 148 | 183 | **-35** (19%) | ❌ STALE |
| Stadiums | 122 | 211 | **-89** (42%) | ❌ STALE |
| Games | 4,972 | 6,792 | **-1,820** (27%) | ❌ STALE |
**iOS bundled data is significantly outdated compared to Scripts output.**
### Field Mapping Verification
| Python Field | iOS JSON Struct | iOS Model | Type Match | Status |
|--------------|-----------------|-----------|------------|--------|
| `canonical_id` | `canonical_id` | `canonicalId` | String ✅ | ✅ |
| `name` | `name` | `name` | String ✅ | ✅ |
| `game_datetime_utc` | `game_datetime_utc` | `dateTime` | ISO 8601 → Date ✅ | ✅ |
| `date` + `time` (legacy) | `date`, `time` | `dateTime` | Fallback parsing ✅ | ✅ |
| `home_team_canonical_id` | `home_team_canonical_id` | `homeTeamCanonicalId` | String ✅ | ✅ |
| `away_team_canonical_id` | `away_team_canonical_id` | `awayTeamCanonicalId` | String ✅ | ✅ |
| `stadium_canonical_id` | `stadium_canonical_id` | `stadiumCanonicalId` | String ✅ | ✅ |
| `sport` | `sport` | `sport` | String ✅ | ✅ |
| `season` | `season` | `season` | String ✅ | ✅ |
| `is_playoff` | `is_playoff` | `isPlayoff` | Bool ✅ | ✅ |
| `broadcast_info` | `broadcast_info` | `broadcastInfo` | String? ✅ | ✅ |
**Result:** All field mappings are correct and compatible.
### Date Parsing Compatibility
iOS `BootstrapService` supports both formats:
```swift
// New canonical format (preferred)
let game_datetime_utc: String? // ISO 8601
// Legacy format (fallback)
let date: String? // "YYYY-MM-DD"
let time: String? // "HH:mm" or "TBD"
```
**Current iOS bundled games use legacy format.** After updating bundled data, new `game_datetime_utc` format will be used.
### Missing Reference Handling
**`DataProvider.filterRichGames()` behavior:**
```swift
return games.compactMap { game in
guard let homeTeam = teamsById[game.homeTeamId],
let awayTeam = teamsById[game.awayTeamId],
let stadium = stadiumsById[game.stadiumId] else {
return nil // Silently drops game
}
return RichGame(...)
}
```
**Impact:**
- Games with missing stadium IDs are **silently excluded** from RichGame queries
- No error logging or fallback behavior
- User sees fewer games than expected without explanation
### Deduplication Logic
**Bootstrap:** No explicit deduplication. If bundled JSON contains duplicate canonical IDs, both would be inserted into SwiftData (leading to potential query issues).
**CloudKit Sync:** Uses upsert pattern with canonical ID as unique key - duplicates would overwrite.
### Schema Version Compatibility
| Component | Schema Version | Status |
|-----------|----------------|--------|
| Scripts output | 1 | ✅ |
| iOS CanonicalModels | 1 | ✅ |
| iOS BootstrapService | Expects 1 | ✅ |
**Compatible.** Schema version mismatch protection exists in `CanonicalSyncService`:
```swift
case .schemaVersionTooNew(let version):
return "Data requires app version supporting schema \(version). Please update the app."
```
### Bootstrap Order Validation
iOS bootstraps in correct dependency order:
1. Stadiums (no dependencies)
2. Stadium aliases (depends on stadiums)
3. League structure (no dependencies)
4. Teams (depends on stadiums)
5. Team aliases (depends on teams)
6. Games (depends on teams + stadiums)
**Correct - prevents orphan references during bootstrap.**
### CloudKit Sync Validation
`CanonicalSyncService` syncs in same dependency order and tracks:
- Per-entity sync timestamps
- Skipped records (incompatible schema version)
- Skipped records (older than local)
- Sync duration and cancellation
**Well-designed sync infrastructure.**
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 13 | iOS bundled data severely outdated | **Critical** | Missing 35 teams (19%), 89 stadiums (42%), 1,820 games (27%). First-launch experience shows incomplete data until CloudKit sync completes. |
| 14 | Silent game exclusion in RichGame queries | Medium | `filterRichGames()` silently drops games with missing team/stadium references. Users see fewer games without explanation. |
| 15 | No bootstrap deduplication | Low | Duplicate game IDs in bundled JSON would create duplicate SwiftData records. Low risk since JSON is generated correctly. |
### Phase 7 Summary
**Result: FAIL** - iOS bundled data is critically outdated.
- ❌ iOS bundled data missing 35 teams, 89 stadiums, 1,820 games
- ⚠️ Games with unresolved references silently dropped from RichGame queries
- ✅ Field mapping between Python and iOS is correct
- ✅ Date parsing supports both legacy and new formats
- ✅ Schema versions are compatible
- ✅ Bootstrap/sync order handles dependencies correctly
---
## Prioritized Issue List
| # | Issue | Severity | Phase | Root Cause | Remediation |
|---|-------|----------|-------|------------|-------------|
| 13 | iOS bundled data severely outdated | **Critical** | 7 | Bundled JSON not updated after pipeline runs | Copy Scripts/output/*_canonical.json to iOS Resources/ and rebuild |
| 4 | WNBA/NWSL/MLS ESPN-only source | **High** | 3 | No implemented fallback sources | Implement alternative scrapers (FBref for MLS, WNBA League Pass) |
| 5 | max_sources_to_try = 2 limits fallback | **High** | 3 | Hardcoded limit in base.py:189 | Increase to 3 or remove limit for sports with 3+ sources |
| 7 | NHL has no stadium data from primary source | **High** | 4 | Hockey Reference doesn't provide venue info | Force NHL to use NHL API or ESPN as primary (they provide venues) |
| 8 | 131 NBA stadium resolution failures | **High** | 4 | 2024-2025 naming rights not in aliases | Add aliases: "Mortgage Matchup Center" → Rocket Mortgage FieldHouse, "Xfinity Mobile Arena" → Intuit Dome |
| 2 | Orphan stadium alias references | **Medium** | 2 | Wrong canonical IDs in stadium_aliases.json | Fix 5 Denver/KC stadium aliases pointing to non-existent IDs |
| 6 | CBS/FBref scrapers declared but not implemented | **Medium** | 3 | NotImplementedError at runtime | Either implement or remove from source lists to avoid confusion |
| 9 | Outdated WNBA expected count | **Medium** | 4 | WNBA expanded to 13 teams in 2025 | Update config.py EXPECTED_GAME_COUNTS["wnba"] from 220 to 286 |
| 10 | MLS/WNBA stadium alias gaps | **Medium** | 4 | New/renamed venues missing from aliases | Add 129 missing stadium aliases (64 MLS + 65 WNBA) |
| 12 | NHL games have no stadium data | **Medium** | 6 | Same as Issue #7 | See Issue #7 remediation |
| 14 | Silent game exclusion in RichGame queries | **Medium** | 7 | compactMap silently drops games | Log dropped games or return partial RichGame with placeholder stadium |
| 1 | WNBA single abbreviations | **Low** | 1 | Only 1 abbreviation per team | Add alternative abbreviations for source compatibility |
| 3 | No NFL team aliases | **Low** | 2 | Missing Washington Redskins/Football Team | Add historical Washington team name aliases |
| 11 | Game status not parsed | **Low** | 4 | Status field always "unknown" | Parse game status from source data (final, scheduled, postponed) |
| 15 | No bootstrap deduplication | **Low** | 7 | No explicit duplicate check during bootstrap | Add deduplication check in bootstrapGames() |
---
## Recommended Next Steps
### Immediate (Before Next Release)
1. **Update iOS bundled data** (Issue #13)
```bash
cp Scripts/output/stadiums_*.json SportsTime/Resources/stadiums_canonical.json
cp Scripts/output/teams_*.json SportsTime/Resources/teams_canonical.json
cp Scripts/output/games_*.json SportsTime/Resources/games_canonical.json
```
2. **Fix NHL stadium data** (Issues #7, #12)
- Change NHL primary source from Hockey Reference to NHL API
- Or: Increase `max_sources_to_try` to 3 so fallbacks are attempted
3. **Add critical stadium aliases** (Issues #8, #10)
- "Mortgage Matchup Center" → `stadium_nba_rocket_mortgage_fieldhouse`
- "Xfinity Mobile Arena" → `stadium_nba_intuit_dome`
- Run validation report to identify all unresolved venue names
### Short-term (This Quarter)
4. **Implement MLS fallback source** (Issue #4)
- FBref has MLS data with venue information
- Reduces ESPN single-point-of-failure risk
5. **Fix orphan alias references** (Issue #2)
- Correct 5 NFL stadium aliases pointing to wrong canonical IDs
- Add validation check to prevent future orphan references
6. **Update expected game counts** (Issue #9)
- WNBA: 220 → 286 (13 teams × 44 games / 2)
### Long-term (Next Quarter)
7. **Implement WNBA/NWSL fallback sources** (Issue #4)
- Consider WNBA League Pass API or other sources
- NWSL has limited data availability - may need to accept ESPN-only
8. **Add RichGame partial loading** (Issue #14)
- Log games dropped due to missing references
- Consider returning games with placeholder stadiums for NHL
9. **Parse game status** (Issue #11)
- Extract final/scheduled/postponed from source data
- Enables filtering by game state
---
## Verification Checklist
After implementing fixes, verify:
- [ ] Run `python -m sportstime_parser scrape --sport all --season 2025`
- [ ] Check validation reports show <5% unresolved stadiums per sport
- [ ] Copy output JSON to iOS Resources/
- [ ] Build iOS app and verify data loads at startup
- [ ] Query RichGames and verify game count matches expectations
- [ ] Run CloudKit sync and verify no errors

File diff suppressed because it is too large Load Diff

View File

@@ -161,6 +161,105 @@
"parent_id": "nba_western",
"display_order": 17
},
{
"id": "nfl_league",
"sport": "NFL",
"type": "league",
"name": "National Football League",
"abbreviation": "NFL",
"parent_id": null,
"display_order": 18
},
{
"id": "nfl_afc",
"sport": "NFL",
"type": "conference",
"name": "American Football Conference",
"abbreviation": "AFC",
"parent_id": "nfl_league",
"display_order": 19
},
{
"id": "nfl_nfc",
"sport": "NFL",
"type": "conference",
"name": "National Football Conference",
"abbreviation": "NFC",
"parent_id": "nfl_league",
"display_order": 20
},
{
"id": "nfl_afc_east",
"sport": "NFL",
"type": "division",
"name": "AFC East",
"abbreviation": null,
"parent_id": "nfl_afc",
"display_order": 21
},
{
"id": "nfl_afc_north",
"sport": "NFL",
"type": "division",
"name": "AFC North",
"abbreviation": null,
"parent_id": "nfl_afc",
"display_order": 22
},
{
"id": "nfl_afc_south",
"sport": "NFL",
"type": "division",
"name": "AFC South",
"abbreviation": null,
"parent_id": "nfl_afc",
"display_order": 23
},
{
"id": "nfl_afc_west",
"sport": "NFL",
"type": "division",
"name": "AFC West",
"abbreviation": null,
"parent_id": "nfl_afc",
"display_order": 24
},
{
"id": "nfl_nfc_east",
"sport": "NFL",
"type": "division",
"name": "NFC East",
"abbreviation": null,
"parent_id": "nfl_nfc",
"display_order": 25
},
{
"id": "nfl_nfc_north",
"sport": "NFL",
"type": "division",
"name": "NFC North",
"abbreviation": null,
"parent_id": "nfl_nfc",
"display_order": 26
},
{
"id": "nfl_nfc_south",
"sport": "NFL",
"type": "division",
"name": "NFC South",
"abbreviation": null,
"parent_id": "nfl_nfc",
"display_order": 27
},
{
"id": "nfl_nfc_west",
"sport": "NFL",
"type": "division",
"name": "NFC West",
"abbreviation": null,
"parent_id": "nfl_nfc",
"display_order": 28
},
{
"id": "nhl_league",
"sport": "NHL",
@@ -168,7 +267,7 @@
"name": "National Hockey League",
"abbreviation": "NHL",
"parent_id": null,
"display_order": 18
"display_order": 29
},
{
"id": "nhl_eastern",
@@ -177,7 +276,7 @@
"name": "Eastern Conference",
"abbreviation": "East",
"parent_id": "nhl_league",
"display_order": 19
"display_order": 30
},
{
"id": "nhl_western",
@@ -186,7 +285,7 @@
"name": "Western Conference",
"abbreviation": "West",
"parent_id": "nhl_league",
"display_order": 20
"display_order": 31
},
{
"id": "nhl_atlantic",
@@ -195,7 +294,7 @@
"name": "Atlantic",
"abbreviation": null,
"parent_id": "nhl_eastern",
"display_order": 21
"display_order": 32
},
{
"id": "nhl_metropolitan",
@@ -204,7 +303,7 @@
"name": "Metropolitan",
"abbreviation": null,
"parent_id": "nhl_eastern",
"display_order": 22
"display_order": 33
},
{
"id": "nhl_central",
@@ -213,7 +312,7 @@
"name": "Central",
"abbreviation": null,
"parent_id": "nhl_western",
"display_order": 23
"display_order": 34
},
{
"id": "nhl_pacific",
@@ -222,6 +321,51 @@
"name": "Pacific",
"abbreviation": null,
"parent_id": "nhl_western",
"display_order": 24
"display_order": 35
},
{
"id": "wnba_league",
"sport": "WNBA",
"type": "league",
"name": "Women's National Basketball Association",
"abbreviation": "WNBA",
"parent_id": null,
"display_order": 36
},
{
"id": "mls_league",
"sport": "MLS",
"type": "league",
"name": "Major League Soccer",
"abbreviation": "MLS",
"parent_id": null,
"display_order": 37
},
{
"id": "mls_eastern",
"sport": "MLS",
"type": "conference",
"name": "Eastern Conference",
"abbreviation": "East",
"parent_id": "mls_league",
"display_order": 38
},
{
"id": "mls_western",
"sport": "MLS",
"type": "conference",
"name": "Western Conference",
"abbreviation": "West",
"parent_id": "mls_league",
"display_order": 39
},
{
"id": "nwsl_league",
"sport": "NWSL",
"type": "league",
"name": "National Women's Soccer League",
"abbreviation": "NWSL",
"parent_id": null,
"display_order": 40
}
]
]

View File

@@ -41,14 +41,15 @@ BACKOFF_FACTOR: float = 2.0 # exponential backoff multiplier
INITIAL_BACKOFF: float = 1.0 # initial backoff in seconds
# Expected game counts per sport (approximate, for validation)
# Updated 2026-01-20 based on 2025-26 season data
EXPECTED_GAME_COUNTS: dict[str, int] = {
"nba": 1230, # 30 teams × 82 games / 2
"mlb": 2430, # 30 teams × 162 games / 2
"nfl": 272, # 32 teams × 17 games / 2
"mlb": 2430, # 30 teams × 162 games / 2 (regular season only)
"nfl": 272, # 32 teams × 17 games / 2 (regular season only)
"nhl": 1312, # 32 teams × 82 games / 2
"mls": 493, # 30 teams × varies
"wnba": 220, # 13 teams × 40 games / 2 (approx)
"nwsl": 182, # 14 teams × 26 games / 2
"mls": 540, # 30 teams × varies (updated for 2025 expansion)
"wnba": 286, # 13 teams × 44 games / 2 (updated for 2025 expansion)
"nwsl": 188, # 14→16 teams × varies (updated for 2025 expansion)
}
# Minimum match score for fuzzy matching (0-100)

View File

@@ -79,6 +79,8 @@ STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
"stadium_nba_scotiabank_arena": StadiumInfo("stadium_nba_scotiabank_arena", "Scotiabank Arena", "Toronto", "ON", "Canada", "nba", 43.6435, -79.3791, "America/Toronto"),
"stadium_nba_delta_center": StadiumInfo("stadium_nba_delta_center", "Delta Center", "Salt Lake City", "UT", "USA", "nba", 40.7683, -111.9011, "America/Denver"),
"stadium_nba_capital_one_arena": StadiumInfo("stadium_nba_capital_one_arena", "Capital One Arena", "Washington", "DC", "USA", "nba", 38.8981, -77.0209),
# International venues
"stadium_nba_mexico_city_arena": StadiumInfo("stadium_nba_mexico_city_arena", "Mexico City Arena", "Mexico City", "CDMX", "Mexico", "nba", 19.4042, -99.0970, "America/Mexico_City"),
},
"mlb": {
"stadium_mlb_chase_field": StadiumInfo("stadium_mlb_chase_field", "Chase Field", "Phoenix", "AZ", "USA", "mlb", 33.4455, -112.0667),
@@ -215,7 +217,7 @@ STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
"stadium_mls_soldier_field": StadiumInfo("stadium_mls_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "mls", 41.8623, -87.6167),
"stadium_mls_tql_stadium": StadiumInfo("stadium_mls_tql_stadium", "TQL Stadium", "Cincinnati", "OH", "USA", "mls", 39.1112, -84.5225),
"stadium_mls_dicks_sporting_goods_park": StadiumInfo("stadium_mls_dicks_sporting_goods_park", "Dick's Sporting Goods Park", "Commerce City", "CO", "USA", "mls", 39.8056, -104.8922),
"stadium_mls_lower_com_field": StadiumInfo("stadium_mls_lower_com_field", "Lower.com Field", "Columbus", "OH", "USA", "mls", 39.9689, -83.0173),
"stadium_mls_lowercom_field": StadiumInfo("stadium_mls_lowercom_field", "Lower.com Field", "Columbus", "OH", "USA", "mls", 39.9689, -83.0173),
"stadium_mls_toyota_stadium": StadiumInfo("stadium_mls_toyota_stadium", "Toyota Stadium", "Frisco", "TX", "USA", "mls", 33.1545, -96.8353),
"stadium_mls_audi_field": StadiumInfo("stadium_mls_audi_field", "Audi Field", "Washington", "DC", "USA", "mls", 38.8687, -77.0128),
"stadium_mls_shell_energy_stadium": StadiumInfo("stadium_mls_shell_energy_stadium", "Shell Energy Stadium", "Houston", "TX", "USA", "mls", 29.7522, -95.3527),
@@ -228,7 +230,7 @@ STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
"stadium_mls_gillette_stadium": StadiumInfo("stadium_mls_gillette_stadium", "Gillette Stadium", "Foxborough", "MA", "USA", "mls", 42.0909, -71.2643),
"stadium_mls_yankee_stadium": StadiumInfo("stadium_mls_yankee_stadium", "Yankee Stadium", "Bronx", "NY", "USA", "mls", 40.8296, -73.9262),
"stadium_mls_red_bull_arena": StadiumInfo("stadium_mls_red_bull_arena", "Red Bull Arena", "Harrison", "NJ", "USA", "mls", 40.7369, -74.1503),
"stadium_mls_inter_co_stadium": StadiumInfo("stadium_mls_inter_co_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "mls", 28.5411, -81.3895),
"stadium_mls_interco_stadium": StadiumInfo("stadium_mls_interco_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "mls", 28.5411, -81.3895),
"stadium_mls_subaru_park": StadiumInfo("stadium_mls_subaru_park", "Subaru Park", "Chester", "PA", "USA", "mls", 39.8328, -75.3789),
"stadium_mls_providence_park": StadiumInfo("stadium_mls_providence_park", "Providence Park", "Portland", "OR", "USA", "mls", 45.5216, -122.6917),
"stadium_mls_america_first_field": StadiumInfo("stadium_mls_america_first_field", "America First Field", "Sandy", "UT", "USA", "mls", 40.5830, -111.8933),
@@ -254,6 +256,10 @@ STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
"stadium_wnba_footprint_center": StadiumInfo("stadium_wnba_footprint_center", "Footprint Center", "Phoenix", "AZ", "USA", "wnba", 33.4457, -112.0712),
"stadium_wnba_climate_pledge_arena": StadiumInfo("stadium_wnba_climate_pledge_arena", "Climate Pledge Arena", "Seattle", "WA", "USA", "wnba", 47.6221, -122.3540),
"stadium_wnba_entertainment_sports_arena": StadiumInfo("stadium_wnba_entertainment_sports_arena", "Entertainment & Sports Arena", "Washington", "DC", "USA", "wnba", 38.8690, -76.9745),
"stadium_wnba_state_farm_arena": StadiumInfo("stadium_wnba_state_farm_arena", "State Farm Arena", "Atlanta", "GA", "USA", "wnba", 33.7573, -84.3963),
"stadium_wnba_rocket_mortgage_fieldhouse": StadiumInfo("stadium_wnba_rocket_mortgage_fieldhouse", "Rocket Mortgage FieldHouse", "Cleveland", "OH", "USA", "wnba", 41.4965, -81.6882),
"stadium_wnba_cfg_bank_arena": StadiumInfo("stadium_wnba_cfg_bank_arena", "CFG Bank Arena", "Baltimore", "MD", "USA", "wnba", 39.2825, -76.6220),
"stadium_wnba_purcell_pavilion": StadiumInfo("stadium_wnba_purcell_pavilion", "Purcell Pavilion", "Notre Dame", "IN", "USA", "wnba", 41.6987, -86.2340),
},
"nwsl": {
"stadium_nwsl_bmo_stadium": StadiumInfo("stadium_nwsl_bmo_stadium", "BMO Stadium", "Los Angeles", "CA", "USA", "nwsl", 34.0128, -118.2841),
@@ -262,7 +268,7 @@ STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
"stadium_nwsl_cpkc_stadium": StadiumInfo("stadium_nwsl_cpkc_stadium", "CPKC Stadium", "Kansas City", "MO", "USA", "nwsl", 39.1050, -94.5580),
"stadium_nwsl_red_bull_arena": StadiumInfo("stadium_nwsl_red_bull_arena", "Red Bull Arena", "Harrison", "NJ", "USA", "nwsl", 40.7369, -74.1503),
"stadium_nwsl_wakemed_soccer_park": StadiumInfo("stadium_nwsl_wakemed_soccer_park", "WakeMed Soccer Park", "Cary", "NC", "USA", "nwsl", 35.7879, -78.7806),
"stadium_nwsl_inter_co_stadium": StadiumInfo("stadium_nwsl_inter_co_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "nwsl", 28.5411, -81.3895),
"stadium_nwsl_interco_stadium": StadiumInfo("stadium_nwsl_interco_stadium", "Inter&Co Stadium", "Orlando", "FL", "USA", "nwsl", 28.5411, -81.3895),
"stadium_nwsl_providence_park": StadiumInfo("stadium_nwsl_providence_park", "Providence Park", "Portland", "OR", "USA", "nwsl", 45.5216, -122.6917),
"stadium_nwsl_lynn_family_stadium": StadiumInfo("stadium_nwsl_lynn_family_stadium", "Lynn Family Stadium", "Louisville", "KY", "USA", "nwsl", 38.2219, -85.7381),
"stadium_nwsl_snapdragon_stadium": StadiumInfo("stadium_nwsl_snapdragon_stadium", "Snapdragon Stadium", "San Diego", "CA", "USA", "nwsl", 32.7837, -117.1225),
@@ -277,6 +283,9 @@ STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
"stadium_nwsl_empower_field": StadiumInfo("stadium_nwsl_empower_field", "Empower Field at Mile High", "Denver", "CO", "USA", "nwsl", 39.7439, -105.0201, "America/Denver"),
"stadium_nwsl_dicks_sporting_goods_park": StadiumInfo("stadium_nwsl_dicks_sporting_goods_park", "Dick's Sporting Goods Park", "Commerce City", "CO", "USA", "nwsl", 39.8056, -104.8922, "America/Denver"),
"stadium_nwsl_centennial_stadium": StadiumInfo("stadium_nwsl_centennial_stadium", "Centennial Stadium", "Centennial", "CO", "USA", "nwsl", 39.6000, -104.8800, "America/Denver"),
# Shared NFL/MLB venues
"stadium_nwsl_soldier_field": StadiumInfo("stadium_nwsl_soldier_field", "Soldier Field", "Chicago", "IL", "USA", "nwsl", 41.8623, -87.6167),
"stadium_nwsl_oracle_park": StadiumInfo("stadium_nwsl_oracle_park", "Oracle Park", "San Francisco", "CA", "USA", "nwsl", 37.7786, -122.3893, "America/Los_Angeles"),
},
}

View File

@@ -208,7 +208,7 @@ TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str, str]]] = {
"CHI": ("team_mls_chi", "Chicago Fire", "Chicago", "stadium_mls_soldier_field"),
"CIN": ("team_mls_cin", "FC Cincinnati", "Cincinnati", "stadium_mls_tql_stadium"),
"COL": ("team_mls_col", "Colorado Rapids", "Colorado", "stadium_mls_dicks_sporting_goods_park"),
"CLB": ("team_mls_clb", "Columbus Crew", "Columbus", "stadium_mls_lower_com_field"),
"CLB": ("team_mls_clb", "Columbus Crew", "Columbus", "stadium_mls_lowercom_field"),
"DAL": ("team_mls_dal", "FC Dallas", "Dallas", "stadium_mls_toyota_stadium"),
"DC": ("team_mls_dc", "D.C. United", "Washington", "stadium_mls_audi_field"),
"HOU": ("team_mls_hou", "Houston Dynamo", "Houston", "stadium_mls_shell_energy_stadium"),
@@ -222,7 +222,7 @@ TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str, str]]] = {
"NYC": ("team_mls_nyc", "New York City FC", "New York", "stadium_mls_yankee_stadium"),
"RB": ("team_mls_ny", "New York Red Bulls", "New York", "stadium_mls_red_bull_arena"),
"RBNY": ("team_mls_ny", "New York Red Bulls", "New York", "stadium_mls_red_bull_arena"),
"ORL": ("team_mls_orl", "Orlando City", "Orlando", "stadium_mls_inter_co_stadium"),
"ORL": ("team_mls_orl", "Orlando City", "Orlando", "stadium_mls_interco_stadium"),
"PHI": ("team_mls_phi", "Philadelphia Union", "Philadelphia", "stadium_mls_subaru_park"),
"POR": ("team_mls_por", "Portland Timbers", "Portland", "stadium_mls_providence_park"),
"SLC": ("team_mls_slc", "Real Salt Lake", "Salt Lake", "stadium_mls_america_first_field"),
@@ -237,34 +237,64 @@ TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str, str]]] = {
},
"wnba": {
"ATL": ("team_wnba_atl", "Atlanta Dream", "Atlanta", "stadium_wnba_gateway_center_arena"),
"DREAM": ("team_wnba_atl", "Atlanta Dream", "Atlanta", "stadium_wnba_gateway_center_arena"), # alias
"CHI": ("team_wnba_chi", "Chicago Sky", "Chicago", "stadium_wnba_wintrust_arena"),
"SKY": ("team_wnba_chi", "Chicago Sky", "Chicago", "stadium_wnba_wintrust_arena"), # alias
"CON": ("team_wnba_con", "Connecticut Sun", "Connecticut", "stadium_wnba_mohegan_sun_arena"),
"CONN": ("team_wnba_con", "Connecticut Sun", "Connecticut", "stadium_wnba_mohegan_sun_arena"), # alias
"SUN": ("team_wnba_con", "Connecticut Sun", "Connecticut", "stadium_wnba_mohegan_sun_arena"), # alias
"DAL": ("team_wnba_dal", "Dallas Wings", "Dallas", "stadium_wnba_college_park_center"),
"WINGS": ("team_wnba_dal", "Dallas Wings", "Dallas", "stadium_wnba_college_park_center"), # alias
"GSV": ("team_wnba_gsv", "Golden State Valkyries", "Golden State", "stadium_wnba_chase_center"),
"GS": ("team_wnba_gsv", "Golden State Valkyries", "Golden State", "stadium_wnba_chase_center"), # alias
"VAL": ("team_wnba_gsv", "Golden State Valkyries", "Golden State", "stadium_wnba_chase_center"), # alias
"IND": ("team_wnba_ind", "Indiana Fever", "Indiana", "stadium_wnba_gainbridge_fieldhouse"),
"FEVER": ("team_wnba_ind", "Indiana Fever", "Indiana", "stadium_wnba_gainbridge_fieldhouse"), # alias
"LV": ("team_wnba_lv", "Las Vegas Aces", "Las Vegas", "stadium_wnba_michelob_ultra_arena"),
"LVA": ("team_wnba_lv", "Las Vegas Aces", "Las Vegas", "stadium_wnba_michelob_ultra_arena"), # alias
"ACES": ("team_wnba_lv", "Las Vegas Aces", "Las Vegas", "stadium_wnba_michelob_ultra_arena"), # alias
"LA": ("team_wnba_la", "Los Angeles Sparks", "Los Angeles", "stadium_wnba_cryptocom_arena"),
"LAS": ("team_wnba_la", "Los Angeles Sparks", "Los Angeles", "stadium_wnba_cryptocom_arena"), # alias
"SPARKS": ("team_wnba_la", "Los Angeles Sparks", "Los Angeles", "stadium_wnba_cryptocom_arena"), # alias
"MIN": ("team_wnba_min", "Minnesota Lynx", "Minnesota", "stadium_wnba_target_center"),
"LYNX": ("team_wnba_min", "Minnesota Lynx", "Minnesota", "stadium_wnba_target_center"), # alias
"NY": ("team_wnba_ny", "New York Liberty", "New York", "stadium_wnba_barclays_center"),
"NYL": ("team_wnba_ny", "New York Liberty", "New York", "stadium_wnba_barclays_center"), # alias
"LIB": ("team_wnba_ny", "New York Liberty", "New York", "stadium_wnba_barclays_center"), # alias
"PHX": ("team_wnba_phx", "Phoenix Mercury", "Phoenix", "stadium_wnba_footprint_center"),
"PHO": ("team_wnba_phx", "Phoenix Mercury", "Phoenix", "stadium_wnba_footprint_center"), # alias
"MERCURY": ("team_wnba_phx", "Phoenix Mercury", "Phoenix", "stadium_wnba_footprint_center"), # alias
"SEA": ("team_wnba_sea", "Seattle Storm", "Seattle", "stadium_wnba_climate_pledge_arena"),
"STORM": ("team_wnba_sea", "Seattle Storm", "Seattle", "stadium_wnba_climate_pledge_arena"), # alias
"WAS": ("team_wnba_was", "Washington Mystics", "Washington", "stadium_wnba_entertainment_sports_arena"),
"WSH": ("team_wnba_was", "Washington Mystics", "Washington", "stadium_wnba_entertainment_sports_arena"), # alias
"MYSTICS": ("team_wnba_was", "Washington Mystics", "Washington", "stadium_wnba_entertainment_sports_arena"), # alias
},
"nwsl": {
"ANF": ("team_nwsl_anf", "Angel City FC", "Los Angeles", "stadium_nwsl_bmo_stadium"),
# Canonical IDs aligned with teams_canonical.json
"ANG": ("team_nwsl_ang", "Angel City FC", "Los Angeles", "stadium_nwsl_bmo_stadium"),
"ANF": ("team_nwsl_ang", "Angel City FC", "Los Angeles", "stadium_nwsl_bmo_stadium"), # alias
"CHI": ("team_nwsl_chi", "Chicago Red Stars", "Chicago", "stadium_nwsl_seatgeek_stadium"),
"HOU": ("team_nwsl_hou", "Houston Dash", "Houston", "stadium_nwsl_shell_energy_stadium"),
"KC": ("team_nwsl_kc", "Kansas City Current", "Kansas City", "stadium_nwsl_cpkc_stadium"),
"NJ": ("team_nwsl_nj", "NJ/NY Gotham FC", "New Jersey", "stadium_nwsl_red_bull_arena"),
"NC": ("team_nwsl_nc", "North Carolina Courage", "North Carolina", "stadium_nwsl_wakemed_soccer_park"),
"ORL": ("team_nwsl_orl", "Orlando Pride", "Orlando", "stadium_nwsl_inter_co_stadium"),
"KCC": ("team_nwsl_kcc", "Kansas City Current", "Kansas City", "stadium_nwsl_cpkc_stadium"),
"KC": ("team_nwsl_kcc", "Kansas City Current", "Kansas City", "stadium_nwsl_cpkc_stadium"), # alias
"NJY": ("team_nwsl_njy", "NJ/NY Gotham FC", "New Jersey", "stadium_nwsl_red_bull_arena"),
"NJ": ("team_nwsl_njy", "NJ/NY Gotham FC", "New Jersey", "stadium_nwsl_red_bull_arena"), # alias
"NCC": ("team_nwsl_ncc", "North Carolina Courage", "North Carolina", "stadium_nwsl_wakemed_soccer_park"),
"NC": ("team_nwsl_ncc", "North Carolina Courage", "North Carolina", "stadium_nwsl_wakemed_soccer_park"), # alias
"ORL": ("team_nwsl_orl", "Orlando Pride", "Orlando", "stadium_nwsl_interco_stadium"),
"POR": ("team_nwsl_por", "Portland Thorns", "Portland", "stadium_nwsl_providence_park"),
"RGN": ("team_nwsl_rgn", "Racing Louisville", "Louisville", "stadium_nwsl_lynn_family_stadium"),
"SD": ("team_nwsl_sd", "San Diego Wave", "San Diego", "stadium_nwsl_snapdragon_stadium"),
"SDW": ("team_nwsl_sdw", "San Diego Wave", "San Diego", "stadium_nwsl_snapdragon_stadium"),
"SD": ("team_nwsl_sdw", "San Diego Wave", "San Diego", "stadium_nwsl_snapdragon_stadium"), # alias
"SEA": ("team_nwsl_sea", "Seattle Reign", "Seattle", "stadium_nwsl_lumen_field"),
"SLC": ("team_nwsl_slc", "Utah Royals", "Utah", "stadium_nwsl_america_first_field"),
"WAS": ("team_nwsl_was", "Washington Spirit", "Washington", "stadium_nwsl_audi_field"),
"BFC": ("team_nwsl_bfc", "Bay FC", "San Francisco", "stadium_nwsl_paypal_park"),
"UTA": ("team_nwsl_uta", "Utah Royals", "Utah", "stadium_nwsl_america_first_field"),
"SLC": ("team_nwsl_uta", "Utah Royals", "Utah", "stadium_nwsl_america_first_field"), # alias
"WSH": ("team_nwsl_wsh", "Washington Spirit", "Washington", "stadium_nwsl_audi_field"),
"WAS": ("team_nwsl_wsh", "Washington Spirit", "Washington", "stadium_nwsl_audi_field"), # alias
"BAY": ("team_nwsl_bay", "Bay FC", "San Francisco", "stadium_nwsl_paypal_park"),
"BFC": ("team_nwsl_bay", "Bay FC", "San Francisco", "stadium_nwsl_paypal_park"), # alias
# Expansion teams (2026) - need to be added to teams_canonical.json
"BOS": ("team_nwsl_bos", "Boston Legacy FC", "Boston", "stadium_nwsl_gillette_stadium"),
"DEN": ("team_nwsl_den", "Denver Summit FC", "Denver", "stadium_nwsl_dicks_sporting_goods_park"),
},

View File

@@ -186,7 +186,9 @@ class BaseScraper(ABC):
sources = self._get_sources()
last_error: Optional[str] = None
sources_tried = 0
max_sources_to_try = 2 # Don't try all sources if first few return nothing
# Allow 3 sources to be tried. This enables NHL to fall back to NHL API
# for venue data since Hockey Reference doesn't provide it.
max_sources_to_try = 3
for source in sources:
self._logger.info(f"Trying source: {source}")

View File

@@ -42,7 +42,8 @@ class MLSScraper(BaseScraper):
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["espn", "fbref"]
# FBref scraper not yet implemented - TODO for future
return ["espn"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""

View File

@@ -60,7 +60,8 @@ class NBAScraper(BaseScraper):
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["basketball_reference", "espn", "cbs"]
# CBS scraper not yet implemented - TODO for future
return ["basketball_reference", "espn"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""

View File

@@ -48,7 +48,8 @@ class NFLScraper(BaseScraper):
def _get_sources(self) -> list[str]:
"""Return source list in priority order."""
return ["espn", "pro_football_reference", "cbs"]
# CBS scraper not yet implemented - TODO for future
return ["espn", "pro_football_reference"]
def _get_source_url(self, source: str, **kwargs) -> str:
"""Build URL for a source."""

View File

@@ -531,6 +531,16 @@ class NHLScraper(BaseScraper):
stadium_id = stadium_result.canonical_id
# Fallback: Use home team's default stadium if no venue provided
# This is common for Hockey-Reference which doesn't have venue data
if not stadium_id:
home_team_data = TEAM_MAPPINGS.get("nhl", {})
home_abbrev = self._get_abbreviation(home_result.canonical_id)
for abbrev, (team_id, _, _, default_stadium) in home_team_data.items():
if team_id == home_result.canonical_id:
stadium_id = default_stadium
break
# Get abbreviations for game ID
home_abbrev = self._get_abbreviation(home_result.canonical_id)
away_abbrev = self._get_abbreviation(away_result.canonical_id)

View File

@@ -583,25 +583,25 @@
},
{
"alias_name": "mercedes-benz stadium",
"stadium_canonical_id": "stadium_nfl_mercedesbenz_stadium",
"stadium_canonical_id": "stadium_nfl_mercedes_benz_stadium",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "mercedesbenz stadium",
"stadium_canonical_id": "stadium_nfl_mercedesbenz_stadium",
"stadium_canonical_id": "stadium_nfl_mercedes_benz_stadium",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "m&t bank stadium",
"stadium_canonical_id": "stadium_nfl_mt_bank_stadium",
"stadium_canonical_id": "stadium_nfl_mandt_bank_stadium",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "mt bank stadium",
"stadium_canonical_id": "stadium_nfl_mt_bank_stadium",
"stadium_canonical_id": "stadium_nfl_mandt_bank_stadium",
"valid_from": null,
"valid_until": null
},
@@ -631,7 +631,7 @@
},
{
"alias_name": "cleveland browns stadium",
"stadium_canonical_id": "stadium_nfl_cleveland_browns_stadium",
"stadium_canonical_id": "stadium_nfl_huntington_bank_field",
"valid_from": null,
"valid_until": null
},
@@ -649,7 +649,7 @@
},
{
"alias_name": "empower field at mile high",
"stadium_canonical_id": "stadium_nfl_empower_field_at_mile_high",
"stadium_canonical_id": "stadium_nfl_empower_field",
"valid_from": null,
"valid_until": null
},
@@ -685,7 +685,7 @@
},
{
"alias_name": "geha field at arrowhead stadium",
"stadium_canonical_id": "stadium_nfl_geha_field_at_arrowhead_stadium",
"stadium_canonical_id": "stadium_nfl_arrowhead_stadium",
"valid_from": null,
"valid_until": null
},
@@ -787,13 +787,13 @@
},
{
"alias_name": "mercedes-benz stadium",
"stadium_canonical_id": "stadium_mls_mercedesbenz_stadium",
"stadium_canonical_id": "stadium_mls_mercedes_benz_stadium",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "mercedesbenz stadium",
"stadium_canonical_id": "stadium_mls_mercedesbenz_stadium",
"stadium_canonical_id": "stadium_mls_mercedes_benz_stadium",
"valid_from": null,
"valid_until": null
},
@@ -1405,25 +1405,25 @@
},
{
"alias_name": "broncos stadium at mile high",
"stadium_canonical_id": "stadium_nfl_empower_field_at_mile_high",
"stadium_canonical_id": "stadium_nfl_empower_field",
"valid_from": "2018-09-01",
"valid_until": "2019-08-31"
},
{
"alias_name": "sports authority field at mile high",
"stadium_canonical_id": "stadium_nfl_empower_field_at_mile_high",
"stadium_canonical_id": "stadium_nfl_empower_field",
"valid_from": "2011-08-01",
"valid_until": "2018-08-31"
},
{
"alias_name": "invesco field at mile high",
"stadium_canonical_id": "stadium_nfl_empower_field_at_mile_high",
"stadium_canonical_id": "stadium_nfl_empower_field",
"valid_from": "2001-09-01",
"valid_until": "2011-07-31"
},
{
"alias_name": "mile high stadium",
"stadium_canonical_id": "stadium_nfl_empower_field_at_mile_high",
"stadium_canonical_id": "stadium_nfl_empower_field",
"valid_from": "1960-01-01",
"valid_until": "2001-08-31"
},
@@ -1531,7 +1531,7 @@
},
{
"alias_name": "arrowhead stadium",
"stadium_canonical_id": "stadium_nfl_geha_field_at_arrowhead_stadium",
"stadium_canonical_id": "stadium_nfl_arrowhead_stadium",
"valid_from": "1972-08-01",
"valid_until": null
},
@@ -1924,5 +1924,113 @@
"stadium_canonical_id": "stadium_mlb_journey_bank_ballpark",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "mortgage matchup center",
"stadium_canonical_id": "stadium_nba_rocket_mortgage_fieldhouse",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "xfinity mobile arena",
"stadium_canonical_id": "stadium_nba_intuit_dome",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "rocket arena",
"stadium_canonical_id": "stadium_nba_toyota_center",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "mexico city arena",
"stadium_canonical_id": "stadium_nba_mexico_city_arena",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "arena cdmx",
"stadium_canonical_id": "stadium_nba_mexico_city_arena",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "scottsmiracle-gro field",
"stadium_canonical_id": "stadium_mls_lowercom_field",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "scotts miracle-gro field",
"stadium_canonical_id": "stadium_mls_lowercom_field",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "energizer park",
"stadium_canonical_id": "stadium_mls_citypark",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "sports illustrated stadium",
"stadium_canonical_id": "stadium_mls_red_bull_arena",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "sports illustrated stadium",
"stadium_canonical_id": "stadium_nwsl_red_bull_arena",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "soldier field",
"stadium_canonical_id": "stadium_nwsl_soldier_field",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "oracle park",
"stadium_canonical_id": "stadium_nwsl_oracle_park",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "carefirst arena",
"stadium_canonical_id": "stadium_wnba_entertainment_sports_arena",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "care first arena",
"stadium_canonical_id": "stadium_wnba_entertainment_sports_arena",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "mortgage matchup center",
"stadium_canonical_id": "stadium_wnba_rocket_mortgage_fieldhouse",
"valid_from": "2025-01-01",
"valid_until": null
},
{
"alias_name": "state farm arena",
"stadium_canonical_id": "stadium_wnba_state_farm_arena",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "cfg bank arena",
"stadium_canonical_id": "stadium_wnba_cfg_bank_arena",
"valid_from": null,
"valid_until": null
},
{
"alias_name": "purcell pavilion",
"stadium_canonical_id": "stadium_wnba_purcell_pavilion",
"valid_from": null,
"valid_until": null
}
]

View File

@@ -606,5 +606,29 @@
"alias_value": "Miami",
"valid_from": "1993-01-01",
"valid_until": "1998-12-31"
},
{
"id": "alias_nfl_77",
"team_canonical_id": "team_nfl_was",
"alias_type": "name",
"alias_value": "Washington Redskins",
"valid_from": "1937-01-01",
"valid_until": "2020-07-13"
},
{
"id": "alias_nfl_78",
"team_canonical_id": "team_nfl_was",
"alias_type": "name",
"alias_value": "Washington Football Team",
"valid_from": "2020-07-13",
"valid_until": "2022-02-02"
},
{
"id": "alias_nfl_79",
"team_canonical_id": "team_nfl_was",
"alias_type": "abbreviation",
"alias_value": "WFT",
"valid_from": "2020-07-13",
"valid_until": "2022-02-02"
}
]

View File

@@ -0,0 +1,99 @@
#!/usr/bin/env python3
"""Validate alias files for orphan references and format issues.
This script checks stadium_aliases.json and team_aliases.json for:
1. Orphan references (aliases pointing to non-existent canonical IDs)
2. JSON syntax errors
3. Required field presence
Usage:
python validate_aliases.py
Returns exit code 0 on success, 1 on failure.
"""
import json
import sys
from pathlib import Path
# Add parent to path for imports
sys.path.insert(0, str(Path(__file__).parent))
from sportstime_parser.normalizers.stadium_resolver import STADIUM_MAPPINGS
from sportstime_parser.normalizers.team_resolver import TEAM_MAPPINGS
def main() -> int:
"""Run validation checks on alias files."""
errors: list[str] = []
# Build valid stadium ID set
valid_stadium_ids: set[str] = set()
for sport_stadiums in STADIUM_MAPPINGS.values():
for stadium_id in sport_stadiums.keys():
valid_stadium_ids.add(stadium_id)
# Build valid team ID set
valid_team_ids: set[str] = set()
for sport_teams in TEAM_MAPPINGS.values():
for abbrev, team_data in sport_teams.items():
valid_team_ids.add(team_data[0]) # team_id is first element
print(f"Valid stadium IDs: {len(valid_stadium_ids)}")
print(f"Valid team IDs: {len(valid_team_ids)}")
print()
# Check stadium aliases
try:
stadium_aliases = json.load(open("stadium_aliases.json"))
print(f"✓ stadium_aliases.json: Valid JSON ({len(stadium_aliases)} aliases)")
for alias in stadium_aliases:
# Check required fields
if "alias_name" not in alias:
errors.append(f"Stadium alias missing 'alias_name': {alias}")
if "stadium_canonical_id" not in alias:
errors.append(f"Stadium alias missing 'stadium_canonical_id': {alias}")
elif alias["stadium_canonical_id"] not in valid_stadium_ids:
errors.append(
f"Orphan stadium alias: '{alias.get('alias_name', '?')}' -> "
f"'{alias['stadium_canonical_id']}'"
)
except FileNotFoundError:
errors.append("stadium_aliases.json not found")
except json.JSONDecodeError as e:
errors.append(f"stadium_aliases.json: Invalid JSON - {e}")
# Check team aliases
try:
team_aliases = json.load(open("team_aliases.json"))
print(f"✓ team_aliases.json: Valid JSON ({len(team_aliases)} aliases)")
for alias in team_aliases:
# Check required fields
if "team_canonical_id" not in alias:
errors.append(f"Team alias missing 'team_canonical_id': {alias}")
elif alias["team_canonical_id"] not in valid_team_ids:
errors.append(
f"Orphan team alias: '{alias.get('alias_value', '?')}' -> "
f"'{alias['team_canonical_id']}'"
)
except FileNotFoundError:
errors.append("team_aliases.json not found")
except json.JSONDecodeError as e:
errors.append(f"team_aliases.json: Invalid JSON - {e}")
# Report results
print()
if errors:
print(f"❌ Validation failed with {len(errors)} error(s):")
for error in errors:
print(f" - {error}")
return 1
print("✅ All aliases valid")
return 0
if __name__ == "__main__":
sys.exit(main())