feat(scripts): complete data pipeline remediation
Scripts changes: - Add WNBA abbreviation aliases to team_resolver.py - Fix NHL stadium coordinates in stadium_resolver.py - Add validate_aliases.py script for orphan detection - Update scrapers with improved error handling - Add DATA_AUDIT.md and REMEDIATION_PLAN.md documentation - Update alias JSON files with new mappings iOS bundle updates: - Update games_canonical.json with latest scraped data - Update teams_canonical.json and stadiums_canonical.json - Sync alias files with Scripts versions All 5 remediation phases complete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
805
Scripts/docs/DATA_AUDIT.md
Normal file
805
Scripts/docs/DATA_AUDIT.md
Normal file
@@ -0,0 +1,805 @@
|
||||
# SportsTime Data Audit Report
|
||||
|
||||
**Generated:** 2026-01-20
|
||||
**Scope:** NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
|
||||
**Data Pipeline:** Scripts → CloudKit → iOS App
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The data audit identified **15 issues** across the SportsTime data pipeline, with significant gaps in source reliability, stadium resolution, and iOS data freshness.
|
||||
|
||||
| Severity | Count | Description |
|
||||
|----------|-------|-------------|
|
||||
| **Critical** | 1 | iOS bundled data severely outdated |
|
||||
| **High** | 4 | Single-source sports, NHL stadium data, NBA naming rights |
|
||||
| **Medium** | 6 | Alias gaps, outdated config, silent game exclusion |
|
||||
| **Low** | 4 | Minor configuration and coverage issues |
|
||||
|
||||
### Key Findings
|
||||
|
||||
**Data Pipeline Health:**
|
||||
- ✅ **Canonical ID system**: 100% format compliance across 7,186 IDs
|
||||
- ✅ **Team mappings**: All 183 teams correctly mapped with current abbreviations
|
||||
- ✅ **Referential integrity**: Zero orphan references (0 games pointing to non-existent teams/stadiums)
|
||||
- ⚠️ **Stadium resolution**: 1,466 games (21.6%) have unresolved stadiums
|
||||
|
||||
**Critical Risks:**
|
||||
1. **ESPN single-point-of-failure** for WNBA, NWSL, MLS - if ESPN changes, 3 sports lose all data
|
||||
2. **NHL has 100% missing stadiums** - Hockey Reference provides no venue data
|
||||
3. **iOS bundled data 27% behind** - 1,820 games missing from first-launch experience
|
||||
|
||||
**Root Causes:**
|
||||
- Stadium naming rights changed faster than alias updates (2024-2025)
|
||||
- Fallback source limit (`max_sources_to_try = 2`) prevents third source from being tried
|
||||
- Hockey Reference source limitation (no venue info) combined with fallback limit
|
||||
- iOS bundled JSON not updated with latest pipeline output
|
||||
|
||||
---
|
||||
|
||||
## Phase Status Tracking
|
||||
|
||||
| Phase | Status | Issues Found |
|
||||
|-------|--------|--------------|
|
||||
| 1. Hardcoded Mapping Audit | ✅ COMPLETE | 1 Low |
|
||||
| 2. Alias File Completeness | ✅ COMPLETE | 1 Medium, 1 Low |
|
||||
| 3. Scraper Source Reliability | ✅ COMPLETE | 2 High, 1 Medium |
|
||||
| 4. Game Count & Coverage | ✅ COMPLETE | 2 High, 2 Medium, 1 Low |
|
||||
| 5. Canonical ID Consistency | ✅ COMPLETE | 0 issues |
|
||||
| 6. Referential Integrity | ✅ COMPLETE | 1 Medium (NHL source) |
|
||||
| 7. iOS Data Reception | ✅ COMPLETE | 1 Critical, 1 Medium, 1 Low |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 Results: Hardcoded Mapping Audit
|
||||
|
||||
**Files Audited:**
|
||||
- `sportstime_parser/normalizers/team_resolver.py` (TEAM_MAPPINGS)
|
||||
- `sportstime_parser/normalizers/stadium_resolver.py` (STADIUM_MAPPINGS)
|
||||
|
||||
### Team Counts
|
||||
|
||||
| Sport | Hardcoded | Expected | Abbreviations | Status |
|
||||
|-------|-----------|----------|---------------|--------|
|
||||
| NBA | 30 | 30 | 38 | ✅ |
|
||||
| MLB | 30 | 30 | 38 | ✅ |
|
||||
| NFL | 32 | 32 | 40 | ✅ |
|
||||
| NHL | 32 | 32 | 41 | ✅ |
|
||||
| MLS | 30 | 30* | 32 | ✅ |
|
||||
| WNBA | 13 | 13 | 13 | ✅ |
|
||||
| NWSL | 16 | 16 | 24 | ✅ |
|
||||
|
||||
*MLS: 29 original teams + San Diego FC (2025 expansion) = 30
|
||||
|
||||
### Stadium Counts
|
||||
|
||||
| Sport | Hardcoded | Notes | Status |
|
||||
|-------|-----------|-------|--------|
|
||||
| NBA | 30 | 1 per team | ✅ |
|
||||
| MLB | 57 | 30 regular + 18 spring training + 9 special venues | ✅ |
|
||||
| NFL | 30 | Includes shared venues (SoFi Stadium: LAR+LAC, MetLife: NYG+NYJ) | ✅ |
|
||||
| NHL | 32 | 1 per team | ✅ |
|
||||
| MLS | 30 | 1 per team | ✅ |
|
||||
| WNBA | 13 | 1 per team | ✅ |
|
||||
| NWSL | 19 | 14 current + 5 expansion team venues (Boston/Denver) | ✅ |
|
||||
|
||||
### Recent Updates Verification
|
||||
|
||||
| Update | Type | Status | Notes |
|
||||
|--------|------|--------|-------|
|
||||
| Utah Hockey Club (NHL) | Relocation | ✅ Present | ARI + UTA abbreviations both map to `team_nhl_ari` |
|
||||
| Golden State Valkyries (WNBA) | Expansion 2025 | ✅ Present | `team_wnba_gsv` with Chase Center venue |
|
||||
| Boston Legacy FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_bos` with Gillette Stadium |
|
||||
| Denver Summit FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_den` with Dick's Sporting Goods Park |
|
||||
| Oakland A's → Sacramento | Temporary relocation | ✅ Present | `stadium_mlb_sutter_health_park` |
|
||||
| San Diego FC (MLS) | Expansion 2025 | ✅ Present | `team_mls_sd` with Snapdragon Stadium |
|
||||
| FedExField → Northwest Stadium | Naming rights | ✅ Present | `stadium_nfl_northwest_stadium` |
|
||||
|
||||
### NFL Stadium Sharing
|
||||
|
||||
| Stadium | Teams | Status |
|
||||
|---------|-------|--------|
|
||||
| SoFi Stadium | LAR, LAC | ✅ Correct |
|
||||
| MetLife Stadium | NYG, NYJ | ✅ Correct |
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 1 | WNBA single abbreviations | Low | All 13 WNBA teams have only 1 abbreviation each. May need additional abbreviations for source compatibility. |
|
||||
|
||||
### Phase 1 Summary
|
||||
|
||||
**Result: PASS** - All team and stadium mappings are complete and up-to-date with 2025-2026 changes.
|
||||
|
||||
- ✅ All 7 sports have correct team counts
|
||||
- ✅ All stadium counts are appropriate (including spring training, special venues)
|
||||
- ✅ Recent franchise moves/expansions are reflected
|
||||
- ✅ Stadium sharing is correctly handled
|
||||
- ✅ Naming rights updates are current
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 Results: Alias File Completeness
|
||||
|
||||
**Files Audited:**
|
||||
- `Scripts/team_aliases.json`
|
||||
- `Scripts/stadium_aliases.json`
|
||||
|
||||
### Team Aliases Summary
|
||||
|
||||
| Sport | Entries | Coverage | Status |
|
||||
|-------|---------|----------|--------|
|
||||
| MLB | 23 | Historical relocations/renames | ✅ |
|
||||
| NBA | 29 | Historical relocations/renames | ✅ |
|
||||
| NHL | 24 | Historical relocations/renames | ✅ |
|
||||
| NFL | 0 | **No aliases** | ⚠️ |
|
||||
| MLS | 0 | No aliases (newer league) | ✅ |
|
||||
| WNBA | 0 | No aliases (newer league) | ✅ |
|
||||
| NWSL | 0 | No aliases (newer league) | ✅ |
|
||||
| **Total** | **76** | | |
|
||||
|
||||
- All 76 entries have valid date ranges
|
||||
- No orphan references (all canonical IDs exist in mappings)
|
||||
|
||||
### Stadium Aliases Summary
|
||||
|
||||
| Sport | Entries | Coverage | Status |
|
||||
|-------|---------|----------|--------|
|
||||
| MLB | 109 | Regular + spring training + special venues | ✅ |
|
||||
| NFL | 65 | Naming rights history | ✅ |
|
||||
| NBA | 44 | Naming rights history | ✅ |
|
||||
| NHL | 39 | Naming rights history | ✅ |
|
||||
| MLS | 35 | Current + naming variants | ✅ |
|
||||
| WNBA | 15 | Current + naming variants | ✅ |
|
||||
| NWSL | 14 | Current + naming variants | ✅ |
|
||||
| **Total** | **321** | | |
|
||||
|
||||
- 65 entries have date ranges (historical naming rights)
|
||||
- 256 entries are permanent aliases (no date restrictions)
|
||||
|
||||
### Orphan Reference Check
|
||||
|
||||
| Type | Count | Status |
|
||||
|------|-------|--------|
|
||||
| Team aliases with invalid references | 0 | ✅ |
|
||||
| Stadium aliases with invalid references | **5** | ❌ |
|
||||
|
||||
**Orphan Stadium References Found:**
|
||||
| Alias Name | References (Invalid) | Correct ID |
|
||||
|------------|---------------------|------------|
|
||||
| Broncos Stadium at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
|
||||
| Sports Authority Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
|
||||
| Invesco Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
|
||||
| Mile High Stadium | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
|
||||
| Arrowhead Stadium | `stadium_nfl_geha_field_at_arrowhead_stadium` | `stadium_nfl_arrowhead_stadium` |
|
||||
|
||||
### Historical Changes Coverage
|
||||
|
||||
| Historical Name | Current Team | In Aliases? |
|
||||
|-----------------|--------------|-------------|
|
||||
| Montreal Expos | Washington Nationals | ✅ |
|
||||
| Seattle SuperSonics | Oklahoma City Thunder | ✅ |
|
||||
| Arizona Coyotes | Utah Hockey Club | ✅ |
|
||||
| Cleveland Indians | Cleveland Guardians | ✅ |
|
||||
| Hartford Whalers | Carolina Hurricanes | ✅ |
|
||||
| Quebec Nordiques | Colorado Avalanche | ✅ |
|
||||
| Vancouver Grizzlies | Memphis Grizzlies | ✅ |
|
||||
| Washington Redskins | Washington Commanders | ❌ Missing |
|
||||
| Washington Football Team | Washington Commanders | ❌ Missing |
|
||||
| Brooklyn Dodgers | Los Angeles Dodgers | ❌ Missing |
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 2 | Orphan stadium alias references | Medium | 5 stadium aliases point to non-existent canonical IDs (`stadium_nfl_empower_field_at_mile_high`, `stadium_nfl_geha_field_at_arrowhead_stadium`). Causes resolution failures for historical Denver/KC stadium names. |
|
||||
| 3 | No NFL team aliases | Low | Missing Washington Redskins/Football Team historical names. Limits historical game matching for NFL. |
|
||||
|
||||
### Phase 2 Summary
|
||||
|
||||
**Result: PASS with issues** - Alias files cover most historical changes but have referential integrity bugs.
|
||||
|
||||
- ✅ Team aliases cover MLB/NBA/NHL historical changes
|
||||
- ✅ Stadium aliases cover naming rights changes across all sports
|
||||
- ✅ No date range validation errors
|
||||
- ❌ 5 orphan stadium references need fixing
|
||||
- ⚠️ No NFL team aliases (Washington Redskins/Football Team missing)
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 Results: Scraper Source Reliability
|
||||
|
||||
**Files Audited:**
|
||||
- `sportstime_parser/scrapers/base.py` (fallback logic)
|
||||
- `sportstime_parser/scrapers/nba.py`, `mlb.py`, `nfl.py`, `nhl.py`, `mls.py`, `wnba.py`, `nwsl.py`
|
||||
|
||||
### Source Dependency Matrix
|
||||
|
||||
| Sport | Primary | Status | Fallback 1 | Status | Fallback 2 | Status | Risk |
|
||||
|-------|---------|--------|------------|--------|------------|--------|------|
|
||||
| NBA | basketball_reference | ✅ | espn | ✅ | cbs | ❌ NOT IMPL | Medium |
|
||||
| MLB | mlb_api | ✅ | espn | ✅ | baseball_reference | ✅ | Low |
|
||||
| NFL | espn | ✅ | pro_football_reference | ✅ | cbs | ❌ NOT IMPL | Medium |
|
||||
| NHL | hockey_reference | ✅ | nhl_api | ✅ | espn | ✅ | Low |
|
||||
| MLS | espn | ✅ | fbref | ❌ NOT IMPL | - | - | **HIGH** |
|
||||
| WNBA | espn | ✅ | - | - | - | - | **HIGH** |
|
||||
| NWSL | espn | ✅ | - | - | - | - | **HIGH** |
|
||||
|
||||
### Unimplemented Sources
|
||||
|
||||
| Sport | Source | Line | Status |
|
||||
|-------|--------|------|--------|
|
||||
| NBA | cbs | `nba.py:421` | `raise NotImplementedError("CBS scraper not implemented")` |
|
||||
| NFL | cbs | `nfl.py:386` | `raise NotImplementedError("CBS scraper not implemented")` |
|
||||
| MLS | fbref | `mls.py:214` | `raise NotImplementedError("FBref scraper not implemented")` |
|
||||
|
||||
### Fallback Logic Analysis
|
||||
|
||||
**File:** `base.py:189`
|
||||
```python
|
||||
max_sources_to_try = 2 # Don't try all sources if first few return nothing
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
- Even if 3 sources are declared, only 2 are tried
|
||||
- If sources 1 and 2 fail, source 3 is never attempted
|
||||
- This limits resilience for NBA, MLB, NFL, NHL which have 3 sources
|
||||
|
||||
### International Game Filtering
|
||||
|
||||
| Sport | Hardcoded Locations | Notes |
|
||||
|-------|---------------------|-------|
|
||||
| NFL | London, Mexico City, Frankfurt, Munich, São Paulo | ✅ Complete for 2025 |
|
||||
| NHL | Prague, Stockholm, Helsinki, Tampere, Gothenburg | ✅ Complete for 2025 |
|
||||
| NBA | None | ⚠️ No international filtering (Abu Dhabi games?) |
|
||||
| MLB | None | ⚠️ No international filtering (Mexico City games?) |
|
||||
| MLS | None | N/A (domestic only) |
|
||||
| WNBA | None | N/A (domestic only) |
|
||||
| NWSL | None | N/A (domestic only) |
|
||||
|
||||
### Single Point of Failure Risk
|
||||
|
||||
| Sport | Primary Source | If ESPN Fails... | Risk Level |
|
||||
|-------|----------------|------------------|------------|
|
||||
| WNBA | ESPN only | **Complete data loss** | Critical |
|
||||
| NWSL | ESPN only | **Complete data loss** | Critical |
|
||||
| MLS | ESPN only (fbref not impl) | **Complete data loss** | Critical |
|
||||
| NBA | Basketball-Ref → ESPN | ESPN fallback available | Low |
|
||||
| NFL | ESPN → Pro-Football-Ref | Fallback available | Low |
|
||||
| NHL | Hockey-Ref → NHL API → ESPN | Two fallbacks | Very Low |
|
||||
| MLB | MLB API → ESPN → B-Ref | Two fallbacks | Very Low |
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 4 | WNBA/NWSL/MLS single source | High | ESPN is the only working source for 3 sports. If ESPN changes or fails, data collection completely stops. |
|
||||
| 5 | max_sources_to_try = 2 | High | Third fallback source never tried even if available. Reduces resilience for NBA/MLB/NFL/NHL. |
|
||||
| 6 | CBS/FBref not implemented | Medium | Declared fallback sources raise NotImplementedError. Appears functional in config but fails at runtime. |
|
||||
|
||||
### Phase 3 Summary
|
||||
|
||||
**Result: FAIL** - Critical single-point-of-failure for 3 sports.
|
||||
|
||||
- ❌ WNBA, NWSL, MLS have only ESPN (no resilience)
|
||||
- ❌ Fallback limit of 2 prevents third source from being tried
|
||||
- ⚠️ CBS and FBref declared but not implemented
|
||||
- ✅ MLB and NHL have full fallback chains
|
||||
- ✅ International game filtering present for NFL/NHL
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 Results: Game Count & Coverage
|
||||
|
||||
**Files Audited:**
|
||||
- `Scripts/output/games_*.json` (all 2025 season files)
|
||||
- `Scripts/output/validation_*.md` (all validation reports)
|
||||
- `sportstime_parser/config.py` (EXPECTED_GAME_COUNTS)
|
||||
|
||||
### Coverage Summary
|
||||
|
||||
| Sport | Scraped | Expected | Coverage | Status |
|
||||
|-------|---------|----------|----------|--------|
|
||||
| NBA | 1,231 | 1,230 | 100.1% | ✅ |
|
||||
| MLB | 2,866 | 2,430 | 117.9% | ⚠️ Includes spring training |
|
||||
| NFL | 330 | 272 | 121.3% | ⚠️ Includes preseason/playoffs |
|
||||
| NHL | 1,312 | 1,312 | 100.0% | ✅ |
|
||||
| MLS | 542 | 493 | 109.9% | ✅ Includes playoffs |
|
||||
| WNBA | 322 | 220 | **146.4%** | ⚠️ Expected count outdated |
|
||||
| NWSL | 189 | 182 | 103.8% | ✅ |
|
||||
|
||||
### Date Range Analysis
|
||||
|
||||
| Sport | Start Date | End Date | Notes |
|
||||
|-------|------------|----------|-------|
|
||||
| NBA | 2025-10-21 | 2026-04-12 | Regular season only |
|
||||
| MLB | 2025-03-01 | 2025-11-02 | Includes spring training (417 games in March) |
|
||||
| NFL | 2025-08-01 | 2026-01-25 | Includes preseason (49 in Aug) + playoffs (28 in Jan) |
|
||||
| NHL | 2025-10-07 | 2026-04-16 | Regular season only |
|
||||
| MLS | 2025-02-22 | 2025-11-30 | Regular season + playoffs |
|
||||
| WNBA | 2025-05-02 | 2025-10-11 | Regular season + playoffs |
|
||||
| NWSL | 2025-03-15 | 2025-11-23 | Regular season + playoffs |
|
||||
|
||||
### Game Status Distribution
|
||||
|
||||
All games across all sports have status `unknown` - game status is not being properly parsed from sources.
|
||||
|
||||
### Duplicate Game Detection
|
||||
|
||||
| Sport | Duplicates Found | Details |
|
||||
|-------|-----------------|---------|
|
||||
| NBA | 0 | ✅ |
|
||||
| MLB | 1 | `game_mlb_2025_20250508_det_col_1` appears twice (doubleheader handling issue) |
|
||||
| NFL | 0 | ✅ |
|
||||
| NHL | 0 | ✅ |
|
||||
| MLS | 0 | ✅ |
|
||||
| WNBA | 0 | ✅ |
|
||||
| NWSL | 0 | ✅ |
|
||||
|
||||
### Validation Report Analysis
|
||||
|
||||
| Sport | Total Games | Unresolved Teams | Unresolved Stadiums | Manual Review Items |
|
||||
|-------|-------------|------------------|---------------------|---------------------|
|
||||
| NBA | 1,231 | 0 | **131** | 131 |
|
||||
| MLB | 2,866 | 12 | 4 | 20 |
|
||||
| NFL | 330 | 1 | 5 | 11 |
|
||||
| NHL | 1,312 | 0 | 0 | **1,312** (all missing stadiums) |
|
||||
| MLS | 542 | 1 | **64** | 129 |
|
||||
| WNBA | 322 | 5 | **65** | 135 |
|
||||
| NWSL | 189 | 0 | **16** | 32 |
|
||||
|
||||
### Top Unresolved Stadium Names (Recent Naming Rights)
|
||||
|
||||
| Stadium Name | Occurrences | Actual Venue | Issue |
|
||||
|--------------|-------------|--------------|-------|
|
||||
| Sports Illustrated Stadium | 11 | MLS expansion venue | New venue, missing alias |
|
||||
| Mortgage Matchup Center | 8 | Rocket Mortgage FieldHouse (CLE) | 2025 naming rights change |
|
||||
| ScottsMiracle-Gro Field | 4 | MLS Columbus Crew | Missing alias |
|
||||
| Energizer Park | 3 | MLS CITY SC (STL?) | Missing alias |
|
||||
| Xfinity Mobile Arena | 3 | Intuit Dome (LAC) | 2025 naming rights change |
|
||||
| Rocket Arena | 3 | Toyota Center (HOU) | Potential name change |
|
||||
| CareFirst Arena | 2 | Washington Mystics venue | New WNBA venue name |
|
||||
|
||||
### Unresolved Teams (Exhibition/International)
|
||||
|
||||
| Team Name | Sport | Type | Games |
|
||||
|-----------|-------|------|-------|
|
||||
| BRAZIL | WNBA | International exhibition | 2 |
|
||||
| Toyota Antelopes | WNBA | Japanese team | 2 |
|
||||
| TEAM CLARK | WNBA | All-Star Game | 1 |
|
||||
| (Various MLB) | MLB | International teams | 12 |
|
||||
| (MLS international) | MLS | CCL/exhibition | 1 |
|
||||
| (NFL preseason) | NFL | Pre-season exhibition | 1 |
|
||||
|
||||
### NHL Stadium Data Issue
|
||||
|
||||
**Critical:** Hockey Reference does not provide stadium data. All 1,312 NHL games have `raw_stadium: None`, causing 100% of games to have missing stadium IDs. The NHL fallback sources (NHL API, ESPN) should provide this data, but the `max_sources_to_try = 2` limit combined with Hockey Reference success means fallbacks are never attempted.
|
||||
|
||||
### Expected Count Updates Needed
|
||||
|
||||
| Sport | Current Expected | Recommended | Reason |
|
||||
|-------|------------------|-------------|--------|
|
||||
| WNBA | 220 | **286** | 13 teams × 44 games / 2 (expanded with Golden State Valkyries) |
|
||||
| NFL | 272 | 272 (filter preseason) | Or document that 330 includes preseason |
|
||||
| MLB | 2,430 | 2,430 (filter spring training) | Or document that 2,866 includes spring training |
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 7 | NHL has no stadium data | High | Hockey Reference provides no venue info. All 1,312 games missing stadium_id. Fallback sources not tried. |
|
||||
| 8 | 131 NBA stadium resolution failures | High | Recent naming rights changes ("Mortgage Matchup Center", "Xfinity Mobile Arena") not in aliases. |
|
||||
| 9 | Outdated WNBA expected count | Medium | Config says 220 but WNBA expanded to 13 teams in 2025; actual is 322 (286 regular + playoffs). |
|
||||
| 10 | MLS/WNBA stadium alias gaps | Medium | 64 MLS + 65 WNBA unresolved stadiums from new/renamed venues. |
|
||||
| 11 | Game status not parsed | Low | All games have status `unknown` instead of final/scheduled/postponed. |
|
||||
|
||||
### Phase 4 Summary
|
||||
|
||||
**Result: FAIL** - Significant stadium resolution failures across multiple sports.
|
||||
|
||||
- ❌ 131 NBA games missing stadium (naming rights changes)
|
||||
- ❌ 1,312 NHL games missing stadium (source doesn't provide data)
|
||||
- ❌ 64 MLS + 65 WNBA stadiums unresolved (new/renamed venues)
|
||||
- ⚠️ WNBA expected count severely outdated (220 vs 322 actual)
|
||||
- ⚠️ MLB/NFL include preseason/spring training games
|
||||
- ✅ No significant duplicate games (1 MLB doubleheader edge case)
|
||||
- ✅ All teams resolved except exhibition/international games
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 Results: Canonical ID Consistency
|
||||
|
||||
**Files Audited:**
|
||||
- `sportstime_parser/normalizers/canonical_id.py` (Python ID generation)
|
||||
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (iOS models)
|
||||
- `SportsTime/Core/Services/BootstrapService.swift` (iOS JSON parsing)
|
||||
- All `Scripts/output/*.json` files (generated IDs)
|
||||
|
||||
### Format Validation
|
||||
|
||||
| Type | Total IDs | Valid | Invalid | Pass Rate |
|
||||
|------|-----------|-------|---------|-----------|
|
||||
| Team | 183 | 183 | 0 | 100.0% ✅ |
|
||||
| Stadium | 211 | 211 | 0 | 100.0% ✅ |
|
||||
| Game | 6,792 | 6,792 | 0 | 100.0% ✅ |
|
||||
|
||||
### ID Format Patterns (all validated)
|
||||
|
||||
```
|
||||
Teams: team_{sport}_{abbrev} → team_nba_lal
|
||||
Stadiums: stadium_{sport}_{normalized_name} → stadium_nba_cryptocom_arena
|
||||
Games: game_{sport}_{season}_{YYYYMMDD}_{away}_{home}[_{#}]
|
||||
→ game_nba_2025_20251021_hou_okc
|
||||
```
|
||||
|
||||
### Normalization Quality
|
||||
|
||||
| Check | Result |
|
||||
|-------|--------|
|
||||
| Double underscores (`__`) | 0 found ✅ |
|
||||
| Leading/trailing underscores | 0 found ✅ |
|
||||
| Uppercase letters | 0 found ✅ |
|
||||
| Special characters | 0 found ✅ |
|
||||
|
||||
### Abbreviation Lengths (Teams)
|
||||
|
||||
| Length | Count |
|
||||
|--------|-------|
|
||||
| 2 chars | 21 |
|
||||
| 3 chars | 161 |
|
||||
| 4 chars | 1 |
|
||||
|
||||
### Stadium ID Lengths
|
||||
|
||||
- Minimum: 8 characters
|
||||
- Maximum: 29 characters
|
||||
- Average: 16.2 characters
|
||||
|
||||
### iOS Cross-Compatibility
|
||||
|
||||
| Aspect | Status | Notes |
|
||||
|--------|--------|-------|
|
||||
| Field naming convention | ✅ Compatible | Python uses snake_case; iOS `BootstrapService` uses matching Codable structs |
|
||||
| Deterministic UUID generation | ✅ Compatible | iOS uses SHA256 hash of canonical_id - matches any valid string |
|
||||
| Schema version | ✅ Compatible | Both use version 1 |
|
||||
| Required fields | ✅ Present | All iOS-required fields present in JSON output |
|
||||
|
||||
### Field Mapping (Python → iOS)
|
||||
|
||||
| Python Field | iOS Field | Notes |
|
||||
|--------------|-----------|-------|
|
||||
| `canonical_id` | `canonicalId` | Mapped via `JSONCanonicalStadium.canonical_id` → `CanonicalStadium.canonicalId` |
|
||||
| `home_team_canonical_id` | `homeTeamCanonicalId` | Explicit mapping in BootstrapService |
|
||||
| `away_team_canonical_id` | `awayTeamCanonicalId` | Explicit mapping in BootstrapService |
|
||||
| `stadium_canonical_id` | `stadiumCanonicalId` | Explicit mapping in BootstrapService |
|
||||
| `game_datetime_utc` | `dateTime` | ISO 8601 parsing with fallback to legacy format |
|
||||
|
||||
### Issues Found
|
||||
|
||||
**No issues found.** All canonical IDs are:
|
||||
- Correctly formatted according to defined patterns
|
||||
- Properly normalized (lowercase, no special characters)
|
||||
- Deterministic (same input produces same output)
|
||||
- Compatible with iOS parsing
|
||||
|
||||
### Phase 5 Summary
|
||||
|
||||
**Result: PASS** - All canonical IDs are consistent and iOS-compatible.
|
||||
|
||||
- ✅ 100% format validation pass rate across 7,186 IDs
|
||||
- ✅ No normalization issues found
|
||||
- ✅ iOS BootstrapService explicitly handles snake_case → camelCase mapping
|
||||
- ✅ Deterministic UUID generation using SHA256 hash
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 Results: Referential Integrity
|
||||
|
||||
**Files Audited:**
|
||||
- `Scripts/output/games_*_2025.json`
|
||||
- `Scripts/output/teams_*.json`
|
||||
- `Scripts/output/stadiums_*.json`
|
||||
|
||||
### Game → Team References
|
||||
|
||||
| Sport | Total Games | Valid Home | Valid Away | Orphan Home | Orphan Away | Status |
|
||||
|-------|-------------|------------|------------|-------------|-------------|--------|
|
||||
| NBA | 1,231 | 1,231 | 1,231 | 0 | 0 | ✅ |
|
||||
| MLB | 2,866 | 2,866 | 2,866 | 0 | 0 | ✅ |
|
||||
| NFL | 330 | 330 | 330 | 0 | 0 | ✅ |
|
||||
| NHL | 1,312 | 1,312 | 1,312 | 0 | 0 | ✅ |
|
||||
| MLS | 542 | 542 | 542 | 0 | 0 | ✅ |
|
||||
| WNBA | 322 | 322 | 322 | 0 | 0 | ✅ |
|
||||
| NWSL | 189 | 189 | 189 | 0 | 0 | ✅ |
|
||||
|
||||
**Result:** 100% valid team references across all 6,792 games.
|
||||
|
||||
### Game → Stadium References
|
||||
|
||||
| Sport | Total Games | Valid | Missing | Percentage Missing |
|
||||
|-------|-------------|-------|---------|-------------------|
|
||||
| NBA | 1,231 | 1,231 | 0 | 0.0% ✅ |
|
||||
| MLB | 2,866 | 2,862 | 4 | 0.1% ✅ |
|
||||
| NFL | 330 | 325 | 5 | 1.5% ✅ |
|
||||
| NHL | 1,312 | 0 | **1,312** | **100%** ❌ |
|
||||
| MLS | 542 | 478 | 64 | 11.8% ⚠️ |
|
||||
| WNBA | 322 | 257 | 65 | 20.2% ⚠️ |
|
||||
| NWSL | 189 | 173 | 16 | 8.5% ⚠️ |
|
||||
|
||||
**Note:** "Missing" means `stadium_canonical_id` is empty (resolution failed at scrape time). This is NOT orphan references to non-existent stadiums.
|
||||
|
||||
### Team → Stadium References
|
||||
|
||||
| Sport | Teams | Valid Stadium | Invalid | Status |
|
||||
|-------|-------|---------------|---------|--------|
|
||||
| NBA | 30 | 30 | 0 | ✅ |
|
||||
| MLB | 30 | 30 | 0 | ✅ |
|
||||
| NFL | 32 | 32 | 0 | ✅ |
|
||||
| NHL | 32 | 32 | 0 | ✅ |
|
||||
| MLS | 30 | 30 | 0 | ✅ |
|
||||
| WNBA | 13 | 13 | 0 | ✅ |
|
||||
| NWSL | 16 | 16 | 0 | ✅ |
|
||||
|
||||
**Result:** 100% valid team → stadium references.
|
||||
|
||||
### Cross-Sport Stadium Check
|
||||
|
||||
✅ No stadiums are duplicated across sports. Each `stadium_{sport}_*` ID is unique to its sport.
|
||||
|
||||
### Missing Stadium Root Causes
|
||||
|
||||
| Sport | Missing | Root Cause |
|
||||
|-------|---------|------------|
|
||||
| NHL | 1,312 | **Hockey Reference provides no venue data** - source limitation |
|
||||
| MLS | 64 | New/renamed stadiums not in aliases (see Phase 4) |
|
||||
| WNBA | 65 | New venue names not in aliases (see Phase 4) |
|
||||
| NWSL | 16 | Expansion team venues + alternate venues |
|
||||
| NFL | 5 | International games not in stadium mappings |
|
||||
| MLB | 4 | Exhibition/international games |
|
||||
|
||||
### Orphan Reference Summary
|
||||
|
||||
| Reference Type | Total Checked | Orphans Found |
|
||||
|----------------|---------------|---------------|
|
||||
| Game → Home Team | 6,792 | 0 ✅ |
|
||||
| Game → Away Team | 6,792 | 0 ✅ |
|
||||
| Game → Stadium | 6,792 | 0 ✅ |
|
||||
| Team → Stadium | 183 | 0 ✅ |
|
||||
|
||||
**Note:** Zero orphan references. All "missing" stadiums are resolution failures (empty string), not references to non-existent canonical IDs.
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 12 | NHL games have no stadium data | Medium | Hockey Reference source doesn't provide venue information. All 1,312 NHL games have empty stadium_canonical_id. Fallback sources could provide this data but are limited by `max_sources_to_try = 2`. |
|
||||
|
||||
### Phase 6 Summary
|
||||
|
||||
**Result: PASS with known limitations** - No orphan references exist; missing stadiums are resolution failures.
|
||||
|
||||
- ✅ 100% valid team references (home and away)
|
||||
- ✅ 100% valid team → stadium references
|
||||
- ✅ No orphan references to non-existent canonical IDs
|
||||
- ⚠️ 1,466 games (21.6%) have empty stadium_canonical_id (resolution failures, not orphans)
|
||||
- ⚠️ NHL accounts for 90% of missing stadium data (source limitation)
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 Results: iOS Data Reception
|
||||
|
||||
**Files Audited:**
|
||||
- `SportsTime/Core/Services/BootstrapService.swift` (JSON parsing)
|
||||
- `SportsTime/Core/Services/CanonicalSyncService.swift` (CloudKit sync)
|
||||
- `SportsTime/Core/Services/DataProvider.swift` (data access)
|
||||
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (SwiftData models)
|
||||
- `SportsTime/Resources/*_canonical.json` (bundled data files)
|
||||
|
||||
### Bundled Data Comparison
|
||||
|
||||
| Data Type | iOS Bundled | Scripts Output | Difference | Status |
|
||||
|-----------|-------------|----------------|------------|--------|
|
||||
| Teams | 148 | 183 | **-35** (19%) | ❌ STALE |
|
||||
| Stadiums | 122 | 211 | **-89** (42%) | ❌ STALE |
|
||||
| Games | 4,972 | 6,792 | **-1,820** (27%) | ❌ STALE |
|
||||
|
||||
**iOS bundled data is significantly outdated compared to Scripts output.**
|
||||
|
||||
### Field Mapping Verification
|
||||
|
||||
| Python Field | iOS JSON Struct | iOS Model | Type Match | Status |
|
||||
|--------------|-----------------|-----------|------------|--------|
|
||||
| `canonical_id` | `canonical_id` | `canonicalId` | String ✅ | ✅ |
|
||||
| `name` | `name` | `name` | String ✅ | ✅ |
|
||||
| `game_datetime_utc` | `game_datetime_utc` | `dateTime` | ISO 8601 → Date ✅ | ✅ |
|
||||
| `date` + `time` (legacy) | `date`, `time` | `dateTime` | Fallback parsing ✅ | ✅ |
|
||||
| `home_team_canonical_id` | `home_team_canonical_id` | `homeTeamCanonicalId` | String ✅ | ✅ |
|
||||
| `away_team_canonical_id` | `away_team_canonical_id` | `awayTeamCanonicalId` | String ✅ | ✅ |
|
||||
| `stadium_canonical_id` | `stadium_canonical_id` | `stadiumCanonicalId` | String ✅ | ✅ |
|
||||
| `sport` | `sport` | `sport` | String ✅ | ✅ |
|
||||
| `season` | `season` | `season` | String ✅ | ✅ |
|
||||
| `is_playoff` | `is_playoff` | `isPlayoff` | Bool ✅ | ✅ |
|
||||
| `broadcast_info` | `broadcast_info` | `broadcastInfo` | String? ✅ | ✅ |
|
||||
|
||||
**Result:** All field mappings are correct and compatible.
|
||||
|
||||
### Date Parsing Compatibility
|
||||
|
||||
iOS `BootstrapService` supports both formats:
|
||||
|
||||
```swift
|
||||
// New canonical format (preferred)
|
||||
let game_datetime_utc: String? // ISO 8601
|
||||
|
||||
// Legacy format (fallback)
|
||||
let date: String? // "YYYY-MM-DD"
|
||||
let time: String? // "HH:mm" or "TBD"
|
||||
```
|
||||
|
||||
**Current iOS bundled games use legacy format.** After updating bundled data, new `game_datetime_utc` format will be used.
|
||||
|
||||
### Missing Reference Handling
|
||||
|
||||
**`DataProvider.filterRichGames()` behavior:**
|
||||
```swift
|
||||
return games.compactMap { game in
|
||||
guard let homeTeam = teamsById[game.homeTeamId],
|
||||
let awayTeam = teamsById[game.awayTeamId],
|
||||
let stadium = stadiumsById[game.stadiumId] else {
|
||||
return nil // ⚠️ Silently drops game
|
||||
}
|
||||
return RichGame(...)
|
||||
}
|
||||
```
|
||||
|
||||
**Impact:**
|
||||
- Games with missing stadium IDs are **silently excluded** from RichGame queries
|
||||
- No error logging or fallback behavior
|
||||
- User sees fewer games than expected without explanation
|
||||
|
||||
### Deduplication Logic
|
||||
|
||||
**Bootstrap:** No explicit deduplication. If bundled JSON contains duplicate canonical IDs, both would be inserted into SwiftData (leading to potential query issues).
|
||||
|
||||
**CloudKit Sync:** Uses upsert pattern with canonical ID as unique key - duplicates would overwrite.
|
||||
|
||||
### Schema Version Compatibility
|
||||
|
||||
| Component | Schema Version | Status |
|
||||
|-----------|----------------|--------|
|
||||
| Scripts output | 1 | ✅ |
|
||||
| iOS CanonicalModels | 1 | ✅ |
|
||||
| iOS BootstrapService | Expects 1 | ✅ |
|
||||
|
||||
**Compatible.** Schema version mismatch protection exists in `CanonicalSyncService`:
|
||||
```swift
|
||||
case .schemaVersionTooNew(let version):
|
||||
return "Data requires app version supporting schema \(version). Please update the app."
|
||||
```
|
||||
|
||||
### Bootstrap Order Validation
|
||||
|
||||
iOS bootstraps in correct dependency order:
|
||||
1. Stadiums (no dependencies)
|
||||
2. Stadium aliases (depends on stadiums)
|
||||
3. League structure (no dependencies)
|
||||
4. Teams (depends on stadiums)
|
||||
5. Team aliases (depends on teams)
|
||||
6. Games (depends on teams + stadiums)
|
||||
|
||||
**Correct - prevents orphan references during bootstrap.**
|
||||
|
||||
### CloudKit Sync Validation
|
||||
|
||||
`CanonicalSyncService` syncs in same dependency order and tracks:
|
||||
- Per-entity sync timestamps
|
||||
- Skipped records (incompatible schema version)
|
||||
- Skipped records (older than local)
|
||||
- Sync duration and cancellation
|
||||
|
||||
**Well-designed sync infrastructure.**
|
||||
|
||||
### Issues Found
|
||||
|
||||
| # | Issue | Severity | Description |
|
||||
|---|-------|----------|-------------|
|
||||
| 13 | iOS bundled data severely outdated | **Critical** | Missing 35 teams (19%), 89 stadiums (42%), 1,820 games (27%). First-launch experience shows incomplete data until CloudKit sync completes. |
|
||||
| 14 | Silent game exclusion in RichGame queries | Medium | `filterRichGames()` silently drops games with missing team/stadium references. Users see fewer games without explanation. |
|
||||
| 15 | No bootstrap deduplication | Low | Duplicate game IDs in bundled JSON would create duplicate SwiftData records. Low risk since JSON is generated correctly. |
|
||||
|
||||
### Phase 7 Summary
|
||||
|
||||
**Result: FAIL** - iOS bundled data is critically outdated.
|
||||
|
||||
- ❌ iOS bundled data missing 35 teams, 89 stadiums, 1,820 games
|
||||
- ⚠️ Games with unresolved references silently dropped from RichGame queries
|
||||
- ✅ Field mapping between Python and iOS is correct
|
||||
- ✅ Date parsing supports both legacy and new formats
|
||||
- ✅ Schema versions are compatible
|
||||
- ✅ Bootstrap/sync order handles dependencies correctly
|
||||
|
||||
---
|
||||
|
||||
## Prioritized Issue List
|
||||
|
||||
| # | Issue | Severity | Phase | Root Cause | Remediation |
|
||||
|---|-------|----------|-------|------------|-------------|
|
||||
| 13 | iOS bundled data severely outdated | **Critical** | 7 | Bundled JSON not updated after pipeline runs | Copy Scripts/output/*_canonical.json to iOS Resources/ and rebuild |
|
||||
| 4 | WNBA/NWSL/MLS ESPN-only source | **High** | 3 | No implemented fallback sources | Implement alternative scrapers (FBref for MLS, WNBA League Pass) |
|
||||
| 5 | max_sources_to_try = 2 limits fallback | **High** | 3 | Hardcoded limit in base.py:189 | Increase to 3 or remove limit for sports with 3+ sources |
|
||||
| 7 | NHL has no stadium data from primary source | **High** | 4 | Hockey Reference doesn't provide venue info | Force NHL to use NHL API or ESPN as primary (they provide venues) |
|
||||
| 8 | 131 NBA stadium resolution failures | **High** | 4 | 2024-2025 naming rights not in aliases | Add aliases: "Mortgage Matchup Center" → Rocket Mortgage FieldHouse, "Xfinity Mobile Arena" → Intuit Dome |
|
||||
| 2 | Orphan stadium alias references | **Medium** | 2 | Wrong canonical IDs in stadium_aliases.json | Fix 5 Denver/KC stadium aliases pointing to non-existent IDs |
|
||||
| 6 | CBS/FBref scrapers declared but not implemented | **Medium** | 3 | NotImplementedError at runtime | Either implement or remove from source lists to avoid confusion |
|
||||
| 9 | Outdated WNBA expected count | **Medium** | 4 | WNBA expanded to 13 teams in 2025 | Update config.py EXPECTED_GAME_COUNTS["wnba"] from 220 to 286 |
|
||||
| 10 | MLS/WNBA stadium alias gaps | **Medium** | 4 | New/renamed venues missing from aliases | Add 129 missing stadium aliases (64 MLS + 65 WNBA) |
|
||||
| 12 | NHL games have no stadium data | **Medium** | 6 | Same as Issue #7 | See Issue #7 remediation |
|
||||
| 14 | Silent game exclusion in RichGame queries | **Medium** | 7 | compactMap silently drops games | Log dropped games or return partial RichGame with placeholder stadium |
|
||||
| 1 | WNBA single abbreviations | **Low** | 1 | Only 1 abbreviation per team | Add alternative abbreviations for source compatibility |
|
||||
| 3 | No NFL team aliases | **Low** | 2 | Missing Washington Redskins/Football Team | Add historical Washington team name aliases |
|
||||
| 11 | Game status not parsed | **Low** | 4 | Status field always "unknown" | Parse game status from source data (final, scheduled, postponed) |
|
||||
| 15 | No bootstrap deduplication | **Low** | 7 | No explicit duplicate check during bootstrap | Add deduplication check in bootstrapGames() |
|
||||
|
||||
---
|
||||
|
||||
## Recommended Next Steps
|
||||
|
||||
### Immediate (Before Next Release)
|
||||
|
||||
1. **Update iOS bundled data** (Issue #13)
|
||||
```bash
|
||||
cp Scripts/output/stadiums_*.json SportsTime/Resources/stadiums_canonical.json
|
||||
cp Scripts/output/teams_*.json SportsTime/Resources/teams_canonical.json
|
||||
cp Scripts/output/games_*.json SportsTime/Resources/games_canonical.json
|
||||
```
|
||||
|
||||
2. **Fix NHL stadium data** (Issues #7, #12)
|
||||
- Change NHL primary source from Hockey Reference to NHL API
|
||||
- Or: Increase `max_sources_to_try` to 3 so fallbacks are attempted
|
||||
|
||||
3. **Add critical stadium aliases** (Issues #8, #10)
|
||||
- "Mortgage Matchup Center" → `stadium_nba_rocket_mortgage_fieldhouse`
|
||||
- "Xfinity Mobile Arena" → `stadium_nba_intuit_dome`
|
||||
- Run validation report to identify all unresolved venue names
|
||||
|
||||
### Short-term (This Quarter)
|
||||
|
||||
4. **Implement MLS fallback source** (Issue #4)
|
||||
- FBref has MLS data with venue information
|
||||
- Reduces ESPN single-point-of-failure risk
|
||||
|
||||
5. **Fix orphan alias references** (Issue #2)
|
||||
- Correct 5 NFL stadium aliases pointing to wrong canonical IDs
|
||||
- Add validation check to prevent future orphan references
|
||||
|
||||
6. **Update expected game counts** (Issue #9)
|
||||
- WNBA: 220 → 286 (13 teams × 44 games / 2)
|
||||
|
||||
### Long-term (Next Quarter)
|
||||
|
||||
7. **Implement WNBA/NWSL fallback sources** (Issue #4)
|
||||
- Consider WNBA League Pass API or other sources
|
||||
- NWSL has limited data availability - may need to accept ESPN-only
|
||||
|
||||
8. **Add RichGame partial loading** (Issue #14)
|
||||
- Log games dropped due to missing references
|
||||
- Consider returning games with placeholder stadiums for NHL
|
||||
|
||||
9. **Parse game status** (Issue #11)
|
||||
- Extract final/scheduled/postponed from source data
|
||||
- Enables filtering by game state
|
||||
|
||||
---
|
||||
|
||||
## Verification Checklist
|
||||
|
||||
After implementing fixes, verify:
|
||||
|
||||
- [ ] Run `python -m sportstime_parser scrape --sport all --season 2025`
|
||||
- [ ] Check validation reports show <5% unresolved stadiums per sport
|
||||
- [ ] Copy output JSON to iOS Resources/
|
||||
- [ ] Build iOS app and verify data loads at startup
|
||||
- [ ] Query RichGames and verify game count matches expectations
|
||||
- [ ] Run CloudKit sync and verify no errors
|
||||
1046
Scripts/docs/REMEDIATION_PLAN.md
Normal file
1046
Scripts/docs/REMEDIATION_PLAN.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user