feat(scripts): add sportstime-parser data pipeline

Complete Python package for scraping, normalizing, and uploading
sports schedule data to CloudKit. Includes:

- Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Canonical ID system for teams, stadiums, and games
- Fuzzy matching with manual alias support
- CloudKit uploader with batch operations and deduplication
- Comprehensive test suite with fixtures
- WNBA abbreviation aliases for improved team resolution
- Alias validation script to detect orphan references

All 5 phases of data remediation plan completed:
- Phase 1: Alias fixes (team/stadium alias additions)
- Phase 2: NHL stadium coordinate fixes
- Phase 3: Re-scrape validation
- Phase 4: iOS bundle update
- Phase 5: Code quality improvements (WNBA aliases)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-01-20 18:56:25 -06:00
parent ac78042a7e
commit 52d445bca4
76 changed files with 25065 additions and 0 deletions

805
docs/DATA_AUDIT.md Normal file
View File

@@ -0,0 +1,805 @@
# SportsTime Data Audit Report
**Generated:** 2026-01-20
**Scope:** NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
**Data Pipeline:** Scripts → CloudKit → iOS App
---
## Executive Summary
The data audit identified **15 issues** across the SportsTime data pipeline, with significant gaps in source reliability, stadium resolution, and iOS data freshness.
| Severity | Count | Description |
|----------|-------|-------------|
| **Critical** | 1 | iOS bundled data severely outdated |
| **High** | 4 | Single-source sports, NHL stadium data, NBA naming rights |
| **Medium** | 6 | Alias gaps, outdated config, silent game exclusion |
| **Low** | 4 | Minor configuration and coverage issues |
### Key Findings
**Data Pipeline Health:**
-**Canonical ID system**: 100% format compliance across 7,186 IDs
-**Team mappings**: All 183 teams correctly mapped with current abbreviations
-**Referential integrity**: Zero orphan references (0 games pointing to non-existent teams/stadiums)
- ⚠️ **Stadium resolution**: 1,466 games (21.6%) have unresolved stadiums
**Critical Risks:**
1. **ESPN single-point-of-failure** for WNBA, NWSL, MLS - if ESPN changes, 3 sports lose all data
2. **NHL has 100% missing stadiums** - Hockey Reference provides no venue data
3. **iOS bundled data 27% behind** - 1,820 games missing from first-launch experience
**Root Causes:**
- Stadium naming rights changed faster than alias updates (2024-2025)
- Fallback source limit (`max_sources_to_try = 2`) prevents third source from being tried
- Hockey Reference source limitation (no venue info) combined with fallback limit
- iOS bundled JSON not updated with latest pipeline output
---
## Phase Status Tracking
| Phase | Status | Issues Found |
|-------|--------|--------------|
| 1. Hardcoded Mapping Audit | ✅ COMPLETE | 1 Low |
| 2. Alias File Completeness | ✅ COMPLETE | 1 Medium, 1 Low |
| 3. Scraper Source Reliability | ✅ COMPLETE | 2 High, 1 Medium |
| 4. Game Count & Coverage | ✅ COMPLETE | 2 High, 2 Medium, 1 Low |
| 5. Canonical ID Consistency | ✅ COMPLETE | 0 issues |
| 6. Referential Integrity | ✅ COMPLETE | 1 Medium (NHL source) |
| 7. iOS Data Reception | ✅ COMPLETE | 1 Critical, 1 Medium, 1 Low |
---
## Phase 1 Results: Hardcoded Mapping Audit
**Files Audited:**
- `sportstime_parser/normalizers/team_resolver.py` (TEAM_MAPPINGS)
- `sportstime_parser/normalizers/stadium_resolver.py` (STADIUM_MAPPINGS)
### Team Counts
| Sport | Hardcoded | Expected | Abbreviations | Status |
|-------|-----------|----------|---------------|--------|
| NBA | 30 | 30 | 38 | ✅ |
| MLB | 30 | 30 | 38 | ✅ |
| NFL | 32 | 32 | 40 | ✅ |
| NHL | 32 | 32 | 41 | ✅ |
| MLS | 30 | 30* | 32 | ✅ |
| WNBA | 13 | 13 | 13 | ✅ |
| NWSL | 16 | 16 | 24 | ✅ |
*MLS: 29 original teams + San Diego FC (2025 expansion) = 30
### Stadium Counts
| Sport | Hardcoded | Notes | Status |
|-------|-----------|-------|--------|
| NBA | 30 | 1 per team | ✅ |
| MLB | 57 | 30 regular + 18 spring training + 9 special venues | ✅ |
| NFL | 30 | Includes shared venues (SoFi Stadium: LAR+LAC, MetLife: NYG+NYJ) | ✅ |
| NHL | 32 | 1 per team | ✅ |
| MLS | 30 | 1 per team | ✅ |
| WNBA | 13 | 1 per team | ✅ |
| NWSL | 19 | 14 current + 5 expansion team venues (Boston/Denver) | ✅ |
### Recent Updates Verification
| Update | Type | Status | Notes |
|--------|------|--------|-------|
| Utah Hockey Club (NHL) | Relocation | ✅ Present | ARI + UTA abbreviations both map to `team_nhl_ari` |
| Golden State Valkyries (WNBA) | Expansion 2025 | ✅ Present | `team_wnba_gsv` with Chase Center venue |
| Boston Legacy FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_bos` with Gillette Stadium |
| Denver Summit FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_den` with Dick's Sporting Goods Park |
| Oakland A's → Sacramento | Temporary relocation | ✅ Present | `stadium_mlb_sutter_health_park` |
| San Diego FC (MLS) | Expansion 2025 | ✅ Present | `team_mls_sd` with Snapdragon Stadium |
| FedExField → Northwest Stadium | Naming rights | ✅ Present | `stadium_nfl_northwest_stadium` |
### NFL Stadium Sharing
| Stadium | Teams | Status |
|---------|-------|--------|
| SoFi Stadium | LAR, LAC | ✅ Correct |
| MetLife Stadium | NYG, NYJ | ✅ Correct |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 1 | WNBA single abbreviations | Low | All 13 WNBA teams have only 1 abbreviation each. May need additional abbreviations for source compatibility. |
### Phase 1 Summary
**Result: PASS** - All team and stadium mappings are complete and up-to-date with 2025-2026 changes.
- ✅ All 7 sports have correct team counts
- ✅ All stadium counts are appropriate (including spring training, special venues)
- ✅ Recent franchise moves/expansions are reflected
- ✅ Stadium sharing is correctly handled
- ✅ Naming rights updates are current
---
## Phase 2 Results: Alias File Completeness
**Files Audited:**
- `Scripts/team_aliases.json`
- `Scripts/stadium_aliases.json`
### Team Aliases Summary
| Sport | Entries | Coverage | Status |
|-------|---------|----------|--------|
| MLB | 23 | Historical relocations/renames | ✅ |
| NBA | 29 | Historical relocations/renames | ✅ |
| NHL | 24 | Historical relocations/renames | ✅ |
| NFL | 0 | **No aliases** | ⚠️ |
| MLS | 0 | No aliases (newer league) | ✅ |
| WNBA | 0 | No aliases (newer league) | ✅ |
| NWSL | 0 | No aliases (newer league) | ✅ |
| **Total** | **76** | | |
- All 76 entries have valid date ranges
- No orphan references (all canonical IDs exist in mappings)
### Stadium Aliases Summary
| Sport | Entries | Coverage | Status |
|-------|---------|----------|--------|
| MLB | 109 | Regular + spring training + special venues | ✅ |
| NFL | 65 | Naming rights history | ✅ |
| NBA | 44 | Naming rights history | ✅ |
| NHL | 39 | Naming rights history | ✅ |
| MLS | 35 | Current + naming variants | ✅ |
| WNBA | 15 | Current + naming variants | ✅ |
| NWSL | 14 | Current + naming variants | ✅ |
| **Total** | **321** | | |
- 65 entries have date ranges (historical naming rights)
- 256 entries are permanent aliases (no date restrictions)
### Orphan Reference Check
| Type | Count | Status |
|------|-------|--------|
| Team aliases with invalid references | 0 | ✅ |
| Stadium aliases with invalid references | **5** | ❌ |
**Orphan Stadium References Found:**
| Alias Name | References (Invalid) | Correct ID |
|------------|---------------------|------------|
| Broncos Stadium at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Sports Authority Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Invesco Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Mile High Stadium | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Arrowhead Stadium | `stadium_nfl_geha_field_at_arrowhead_stadium` | `stadium_nfl_arrowhead_stadium` |
### Historical Changes Coverage
| Historical Name | Current Team | In Aliases? |
|-----------------|--------------|-------------|
| Montreal Expos | Washington Nationals | ✅ |
| Seattle SuperSonics | Oklahoma City Thunder | ✅ |
| Arizona Coyotes | Utah Hockey Club | ✅ |
| Cleveland Indians | Cleveland Guardians | ✅ |
| Hartford Whalers | Carolina Hurricanes | ✅ |
| Quebec Nordiques | Colorado Avalanche | ✅ |
| Vancouver Grizzlies | Memphis Grizzlies | ✅ |
| Washington Redskins | Washington Commanders | ❌ Missing |
| Washington Football Team | Washington Commanders | ❌ Missing |
| Brooklyn Dodgers | Los Angeles Dodgers | ❌ Missing |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 2 | Orphan stadium alias references | Medium | 5 stadium aliases point to non-existent canonical IDs (`stadium_nfl_empower_field_at_mile_high`, `stadium_nfl_geha_field_at_arrowhead_stadium`). Causes resolution failures for historical Denver/KC stadium names. |
| 3 | No NFL team aliases | Low | Missing Washington Redskins/Football Team historical names. Limits historical game matching for NFL. |
### Phase 2 Summary
**Result: PASS with issues** - Alias files cover most historical changes but have referential integrity bugs.
- ✅ Team aliases cover MLB/NBA/NHL historical changes
- ✅ Stadium aliases cover naming rights changes across all sports
- ✅ No date range validation errors
- ❌ 5 orphan stadium references need fixing
- ⚠️ No NFL team aliases (Washington Redskins/Football Team missing)
---
## Phase 3 Results: Scraper Source Reliability
**Files Audited:**
- `sportstime_parser/scrapers/base.py` (fallback logic)
- `sportstime_parser/scrapers/nba.py`, `mlb.py`, `nfl.py`, `nhl.py`, `mls.py`, `wnba.py`, `nwsl.py`
### Source Dependency Matrix
| Sport | Primary | Status | Fallback 1 | Status | Fallback 2 | Status | Risk |
|-------|---------|--------|------------|--------|------------|--------|------|
| NBA | basketball_reference | ✅ | espn | ✅ | cbs | ❌ NOT IMPL | Medium |
| MLB | mlb_api | ✅ | espn | ✅ | baseball_reference | ✅ | Low |
| NFL | espn | ✅ | pro_football_reference | ✅ | cbs | ❌ NOT IMPL | Medium |
| NHL | hockey_reference | ✅ | nhl_api | ✅ | espn | ✅ | Low |
| MLS | espn | ✅ | fbref | ❌ NOT IMPL | - | - | **HIGH** |
| WNBA | espn | ✅ | - | - | - | - | **HIGH** |
| NWSL | espn | ✅ | - | - | - | - | **HIGH** |
### Unimplemented Sources
| Sport | Source | Line | Status |
|-------|--------|------|--------|
| NBA | cbs | `nba.py:421` | `raise NotImplementedError("CBS scraper not implemented")` |
| NFL | cbs | `nfl.py:386` | `raise NotImplementedError("CBS scraper not implemented")` |
| MLS | fbref | `mls.py:214` | `raise NotImplementedError("FBref scraper not implemented")` |
### Fallback Logic Analysis
**File:** `base.py:189`
```python
max_sources_to_try = 2 # Don't try all sources if first few return nothing
```
**Impact:**
- Even if 3 sources are declared, only 2 are tried
- If sources 1 and 2 fail, source 3 is never attempted
- This limits resilience for NBA, MLB, NFL, NHL which have 3 sources
### International Game Filtering
| Sport | Hardcoded Locations | Notes |
|-------|---------------------|-------|
| NFL | London, Mexico City, Frankfurt, Munich, São Paulo | ✅ Complete for 2025 |
| NHL | Prague, Stockholm, Helsinki, Tampere, Gothenburg | ✅ Complete for 2025 |
| NBA | None | ⚠️ No international filtering (Abu Dhabi games?) |
| MLB | None | ⚠️ No international filtering (Mexico City games?) |
| MLS | None | N/A (domestic only) |
| WNBA | None | N/A (domestic only) |
| NWSL | None | N/A (domestic only) |
### Single Point of Failure Risk
| Sport | Primary Source | If ESPN Fails... | Risk Level |
|-------|----------------|------------------|------------|
| WNBA | ESPN only | **Complete data loss** | Critical |
| NWSL | ESPN only | **Complete data loss** | Critical |
| MLS | ESPN only (fbref not impl) | **Complete data loss** | Critical |
| NBA | Basketball-Ref → ESPN | ESPN fallback available | Low |
| NFL | ESPN → Pro-Football-Ref | Fallback available | Low |
| NHL | Hockey-Ref → NHL API → ESPN | Two fallbacks | Very Low |
| MLB | MLB API → ESPN → B-Ref | Two fallbacks | Very Low |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 4 | WNBA/NWSL/MLS single source | High | ESPN is the only working source for 3 sports. If ESPN changes or fails, data collection completely stops. |
| 5 | max_sources_to_try = 2 | High | Third fallback source never tried even if available. Reduces resilience for NBA/MLB/NFL/NHL. |
| 6 | CBS/FBref not implemented | Medium | Declared fallback sources raise NotImplementedError. Appears functional in config but fails at runtime. |
### Phase 3 Summary
**Result: FAIL** - Critical single-point-of-failure for 3 sports.
- ❌ WNBA, NWSL, MLS have only ESPN (no resilience)
- ❌ Fallback limit of 2 prevents third source from being tried
- ⚠️ CBS and FBref declared but not implemented
- ✅ MLB and NHL have full fallback chains
- ✅ International game filtering present for NFL/NHL
---
## Phase 4 Results: Game Count & Coverage
**Files Audited:**
- `Scripts/output/games_*.json` (all 2025 season files)
- `Scripts/output/validation_*.md` (all validation reports)
- `sportstime_parser/config.py` (EXPECTED_GAME_COUNTS)
### Coverage Summary
| Sport | Scraped | Expected | Coverage | Status |
|-------|---------|----------|----------|--------|
| NBA | 1,231 | 1,230 | 100.1% | ✅ |
| MLB | 2,866 | 2,430 | 117.9% | ⚠️ Includes spring training |
| NFL | 330 | 272 | 121.3% | ⚠️ Includes preseason/playoffs |
| NHL | 1,312 | 1,312 | 100.0% | ✅ |
| MLS | 542 | 493 | 109.9% | ✅ Includes playoffs |
| WNBA | 322 | 220 | **146.4%** | ⚠️ Expected count outdated |
| NWSL | 189 | 182 | 103.8% | ✅ |
### Date Range Analysis
| Sport | Start Date | End Date | Notes |
|-------|------------|----------|-------|
| NBA | 2025-10-21 | 2026-04-12 | Regular season only |
| MLB | 2025-03-01 | 2025-11-02 | Includes spring training (417 games in March) |
| NFL | 2025-08-01 | 2026-01-25 | Includes preseason (49 in Aug) + playoffs (28 in Jan) |
| NHL | 2025-10-07 | 2026-04-16 | Regular season only |
| MLS | 2025-02-22 | 2025-11-30 | Regular season + playoffs |
| WNBA | 2025-05-02 | 2025-10-11 | Regular season + playoffs |
| NWSL | 2025-03-15 | 2025-11-23 | Regular season + playoffs |
### Game Status Distribution
All games across all sports have status `unknown` - game status is not being properly parsed from sources.
### Duplicate Game Detection
| Sport | Duplicates Found | Details |
|-------|-----------------|---------|
| NBA | 0 | ✅ |
| MLB | 1 | `game_mlb_2025_20250508_det_col_1` appears twice (doubleheader handling issue) |
| NFL | 0 | ✅ |
| NHL | 0 | ✅ |
| MLS | 0 | ✅ |
| WNBA | 0 | ✅ |
| NWSL | 0 | ✅ |
### Validation Report Analysis
| Sport | Total Games | Unresolved Teams | Unresolved Stadiums | Manual Review Items |
|-------|-------------|------------------|---------------------|---------------------|
| NBA | 1,231 | 0 | **131** | 131 |
| MLB | 2,866 | 12 | 4 | 20 |
| NFL | 330 | 1 | 5 | 11 |
| NHL | 1,312 | 0 | 0 | **1,312** (all missing stadiums) |
| MLS | 542 | 1 | **64** | 129 |
| WNBA | 322 | 5 | **65** | 135 |
| NWSL | 189 | 0 | **16** | 32 |
### Top Unresolved Stadium Names (Recent Naming Rights)
| Stadium Name | Occurrences | Actual Venue | Issue |
|--------------|-------------|--------------|-------|
| Sports Illustrated Stadium | 11 | MLS expansion venue | New venue, missing alias |
| Mortgage Matchup Center | 8 | Rocket Mortgage FieldHouse (CLE) | 2025 naming rights change |
| ScottsMiracle-Gro Field | 4 | MLS Columbus Crew | Missing alias |
| Energizer Park | 3 | MLS CITY SC (STL?) | Missing alias |
| Xfinity Mobile Arena | 3 | Intuit Dome (LAC) | 2025 naming rights change |
| Rocket Arena | 3 | Toyota Center (HOU) | Potential name change |
| CareFirst Arena | 2 | Washington Mystics venue | New WNBA venue name |
### Unresolved Teams (Exhibition/International)
| Team Name | Sport | Type | Games |
|-----------|-------|------|-------|
| BRAZIL | WNBA | International exhibition | 2 |
| Toyota Antelopes | WNBA | Japanese team | 2 |
| TEAM CLARK | WNBA | All-Star Game | 1 |
| (Various MLB) | MLB | International teams | 12 |
| (MLS international) | MLS | CCL/exhibition | 1 |
| (NFL preseason) | NFL | Pre-season exhibition | 1 |
### NHL Stadium Data Issue
**Critical:** Hockey Reference does not provide stadium data. All 1,312 NHL games have `raw_stadium: None`, causing 100% of games to have missing stadium IDs. The NHL fallback sources (NHL API, ESPN) should provide this data, but the `max_sources_to_try = 2` limit combined with Hockey Reference success means fallbacks are never attempted.
### Expected Count Updates Needed
| Sport | Current Expected | Recommended | Reason |
|-------|------------------|-------------|--------|
| WNBA | 220 | **286** | 13 teams × 44 games / 2 (expanded with Golden State Valkyries) |
| NFL | 272 | 272 (filter preseason) | Or document that 330 includes preseason |
| MLB | 2,430 | 2,430 (filter spring training) | Or document that 2,866 includes spring training |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 7 | NHL has no stadium data | High | Hockey Reference provides no venue info. All 1,312 games missing stadium_id. Fallback sources not tried. |
| 8 | 131 NBA stadium resolution failures | High | Recent naming rights changes ("Mortgage Matchup Center", "Xfinity Mobile Arena") not in aliases. |
| 9 | Outdated WNBA expected count | Medium | Config says 220 but WNBA expanded to 13 teams in 2025; actual is 322 (286 regular + playoffs). |
| 10 | MLS/WNBA stadium alias gaps | Medium | 64 MLS + 65 WNBA unresolved stadiums from new/renamed venues. |
| 11 | Game status not parsed | Low | All games have status `unknown` instead of final/scheduled/postponed. |
### Phase 4 Summary
**Result: FAIL** - Significant stadium resolution failures across multiple sports.
- ❌ 131 NBA games missing stadium (naming rights changes)
- ❌ 1,312 NHL games missing stadium (source doesn't provide data)
- ❌ 64 MLS + 65 WNBA stadiums unresolved (new/renamed venues)
- ⚠️ WNBA expected count severely outdated (220 vs 322 actual)
- ⚠️ MLB/NFL include preseason/spring training games
- ✅ No significant duplicate games (1 MLB doubleheader edge case)
- ✅ All teams resolved except exhibition/international games
---
## Phase 5 Results: Canonical ID Consistency
**Files Audited:**
- `sportstime_parser/normalizers/canonical_id.py` (Python ID generation)
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (iOS models)
- `SportsTime/Core/Services/BootstrapService.swift` (iOS JSON parsing)
- All `Scripts/output/*.json` files (generated IDs)
### Format Validation
| Type | Total IDs | Valid | Invalid | Pass Rate |
|------|-----------|-------|---------|-----------|
| Team | 183 | 183 | 0 | 100.0% ✅ |
| Stadium | 211 | 211 | 0 | 100.0% ✅ |
| Game | 6,792 | 6,792 | 0 | 100.0% ✅ |
### ID Format Patterns (all validated)
```
Teams: team_{sport}_{abbrev} → team_nba_lal
Stadiums: stadium_{sport}_{normalized_name} → stadium_nba_cryptocom_arena
Games: game_{sport}_{season}_{YYYYMMDD}_{away}_{home}[_{#}]
→ game_nba_2025_20251021_hou_okc
```
### Normalization Quality
| Check | Result |
|-------|--------|
| Double underscores (`__`) | 0 found ✅ |
| Leading/trailing underscores | 0 found ✅ |
| Uppercase letters | 0 found ✅ |
| Special characters | 0 found ✅ |
### Abbreviation Lengths (Teams)
| Length | Count |
|--------|-------|
| 2 chars | 21 |
| 3 chars | 161 |
| 4 chars | 1 |
### Stadium ID Lengths
- Minimum: 8 characters
- Maximum: 29 characters
- Average: 16.2 characters
### iOS Cross-Compatibility
| Aspect | Status | Notes |
|--------|--------|-------|
| Field naming convention | ✅ Compatible | Python uses snake_case; iOS `BootstrapService` uses matching Codable structs |
| Deterministic UUID generation | ✅ Compatible | iOS uses SHA256 hash of canonical_id - matches any valid string |
| Schema version | ✅ Compatible | Both use version 1 |
| Required fields | ✅ Present | All iOS-required fields present in JSON output |
### Field Mapping (Python → iOS)
| Python Field | iOS Field | Notes |
|--------------|-----------|-------|
| `canonical_id` | `canonicalId` | Mapped via `JSONCanonicalStadium.canonical_id``CanonicalStadium.canonicalId` |
| `home_team_canonical_id` | `homeTeamCanonicalId` | Explicit mapping in BootstrapService |
| `away_team_canonical_id` | `awayTeamCanonicalId` | Explicit mapping in BootstrapService |
| `stadium_canonical_id` | `stadiumCanonicalId` | Explicit mapping in BootstrapService |
| `game_datetime_utc` | `dateTime` | ISO 8601 parsing with fallback to legacy format |
### Issues Found
**No issues found.** All canonical IDs are:
- Correctly formatted according to defined patterns
- Properly normalized (lowercase, no special characters)
- Deterministic (same input produces same output)
- Compatible with iOS parsing
### Phase 5 Summary
**Result: PASS** - All canonical IDs are consistent and iOS-compatible.
- ✅ 100% format validation pass rate across 7,186 IDs
- ✅ No normalization issues found
- ✅ iOS BootstrapService explicitly handles snake_case → camelCase mapping
- ✅ Deterministic UUID generation using SHA256 hash
---
## Phase 6 Results: Referential Integrity
**Files Audited:**
- `Scripts/output/games_*_2025.json`
- `Scripts/output/teams_*.json`
- `Scripts/output/stadiums_*.json`
### Game → Team References
| Sport | Total Games | Valid Home | Valid Away | Orphan Home | Orphan Away | Status |
|-------|-------------|------------|------------|-------------|-------------|--------|
| NBA | 1,231 | 1,231 | 1,231 | 0 | 0 | ✅ |
| MLB | 2,866 | 2,866 | 2,866 | 0 | 0 | ✅ |
| NFL | 330 | 330 | 330 | 0 | 0 | ✅ |
| NHL | 1,312 | 1,312 | 1,312 | 0 | 0 | ✅ |
| MLS | 542 | 542 | 542 | 0 | 0 | ✅ |
| WNBA | 322 | 322 | 322 | 0 | 0 | ✅ |
| NWSL | 189 | 189 | 189 | 0 | 0 | ✅ |
**Result:** 100% valid team references across all 6,792 games.
### Game → Stadium References
| Sport | Total Games | Valid | Missing | Percentage Missing |
|-------|-------------|-------|---------|-------------------|
| NBA | 1,231 | 1,231 | 0 | 0.0% ✅ |
| MLB | 2,866 | 2,862 | 4 | 0.1% ✅ |
| NFL | 330 | 325 | 5 | 1.5% ✅ |
| NHL | 1,312 | 0 | **1,312** | **100%** ❌ |
| MLS | 542 | 478 | 64 | 11.8% ⚠️ |
| WNBA | 322 | 257 | 65 | 20.2% ⚠️ |
| NWSL | 189 | 173 | 16 | 8.5% ⚠️ |
**Note:** "Missing" means `stadium_canonical_id` is empty (resolution failed at scrape time). This is NOT orphan references to non-existent stadiums.
### Team → Stadium References
| Sport | Teams | Valid Stadium | Invalid | Status |
|-------|-------|---------------|---------|--------|
| NBA | 30 | 30 | 0 | ✅ |
| MLB | 30 | 30 | 0 | ✅ |
| NFL | 32 | 32 | 0 | ✅ |
| NHL | 32 | 32 | 0 | ✅ |
| MLS | 30 | 30 | 0 | ✅ |
| WNBA | 13 | 13 | 0 | ✅ |
| NWSL | 16 | 16 | 0 | ✅ |
**Result:** 100% valid team → stadium references.
### Cross-Sport Stadium Check
✅ No stadiums are duplicated across sports. Each `stadium_{sport}_*` ID is unique to its sport.
### Missing Stadium Root Causes
| Sport | Missing | Root Cause |
|-------|---------|------------|
| NHL | 1,312 | **Hockey Reference provides no venue data** - source limitation |
| MLS | 64 | New/renamed stadiums not in aliases (see Phase 4) |
| WNBA | 65 | New venue names not in aliases (see Phase 4) |
| NWSL | 16 | Expansion team venues + alternate venues |
| NFL | 5 | International games not in stadium mappings |
| MLB | 4 | Exhibition/international games |
### Orphan Reference Summary
| Reference Type | Total Checked | Orphans Found |
|----------------|---------------|---------------|
| Game → Home Team | 6,792 | 0 ✅ |
| Game → Away Team | 6,792 | 0 ✅ |
| Game → Stadium | 6,792 | 0 ✅ |
| Team → Stadium | 183 | 0 ✅ |
**Note:** Zero orphan references. All "missing" stadiums are resolution failures (empty string), not references to non-existent canonical IDs.
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 12 | NHL games have no stadium data | Medium | Hockey Reference source doesn't provide venue information. All 1,312 NHL games have empty stadium_canonical_id. Fallback sources could provide this data but are limited by `max_sources_to_try = 2`. |
### Phase 6 Summary
**Result: PASS with known limitations** - No orphan references exist; missing stadiums are resolution failures.
- ✅ 100% valid team references (home and away)
- ✅ 100% valid team → stadium references
- ✅ No orphan references to non-existent canonical IDs
- ⚠️ 1,466 games (21.6%) have empty stadium_canonical_id (resolution failures, not orphans)
- ⚠️ NHL accounts for 90% of missing stadium data (source limitation)
---
## Phase 7 Results: iOS Data Reception
**Files Audited:**
- `SportsTime/Core/Services/BootstrapService.swift` (JSON parsing)
- `SportsTime/Core/Services/CanonicalSyncService.swift` (CloudKit sync)
- `SportsTime/Core/Services/DataProvider.swift` (data access)
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (SwiftData models)
- `SportsTime/Resources/*_canonical.json` (bundled data files)
### Bundled Data Comparison
| Data Type | iOS Bundled | Scripts Output | Difference | Status |
|-----------|-------------|----------------|------------|--------|
| Teams | 148 | 183 | **-35** (19%) | ❌ STALE |
| Stadiums | 122 | 211 | **-89** (42%) | ❌ STALE |
| Games | 4,972 | 6,792 | **-1,820** (27%) | ❌ STALE |
**iOS bundled data is significantly outdated compared to Scripts output.**
### Field Mapping Verification
| Python Field | iOS JSON Struct | iOS Model | Type Match | Status |
|--------------|-----------------|-----------|------------|--------|
| `canonical_id` | `canonical_id` | `canonicalId` | String ✅ | ✅ |
| `name` | `name` | `name` | String ✅ | ✅ |
| `game_datetime_utc` | `game_datetime_utc` | `dateTime` | ISO 8601 → Date ✅ | ✅ |
| `date` + `time` (legacy) | `date`, `time` | `dateTime` | Fallback parsing ✅ | ✅ |
| `home_team_canonical_id` | `home_team_canonical_id` | `homeTeamCanonicalId` | String ✅ | ✅ |
| `away_team_canonical_id` | `away_team_canonical_id` | `awayTeamCanonicalId` | String ✅ | ✅ |
| `stadium_canonical_id` | `stadium_canonical_id` | `stadiumCanonicalId` | String ✅ | ✅ |
| `sport` | `sport` | `sport` | String ✅ | ✅ |
| `season` | `season` | `season` | String ✅ | ✅ |
| `is_playoff` | `is_playoff` | `isPlayoff` | Bool ✅ | ✅ |
| `broadcast_info` | `broadcast_info` | `broadcastInfo` | String? ✅ | ✅ |
**Result:** All field mappings are correct and compatible.
### Date Parsing Compatibility
iOS `BootstrapService` supports both formats:
```swift
// New canonical format (preferred)
let game_datetime_utc: String? // ISO 8601
// Legacy format (fallback)
let date: String? // "YYYY-MM-DD"
let time: String? // "HH:mm" or "TBD"
```
**Current iOS bundled games use legacy format.** After updating bundled data, new `game_datetime_utc` format will be used.
### Missing Reference Handling
**`DataProvider.filterRichGames()` behavior:**
```swift
return games.compactMap { game in
guard let homeTeam = teamsById[game.homeTeamId],
let awayTeam = teamsById[game.awayTeamId],
let stadium = stadiumsById[game.stadiumId] else {
return nil // Silently drops game
}
return RichGame(...)
}
```
**Impact:**
- Games with missing stadium IDs are **silently excluded** from RichGame queries
- No error logging or fallback behavior
- User sees fewer games than expected without explanation
### Deduplication Logic
**Bootstrap:** No explicit deduplication. If bundled JSON contains duplicate canonical IDs, both would be inserted into SwiftData (leading to potential query issues).
**CloudKit Sync:** Uses upsert pattern with canonical ID as unique key - duplicates would overwrite.
### Schema Version Compatibility
| Component | Schema Version | Status |
|-----------|----------------|--------|
| Scripts output | 1 | ✅ |
| iOS CanonicalModels | 1 | ✅ |
| iOS BootstrapService | Expects 1 | ✅ |
**Compatible.** Schema version mismatch protection exists in `CanonicalSyncService`:
```swift
case .schemaVersionTooNew(let version):
return "Data requires app version supporting schema \(version). Please update the app."
```
### Bootstrap Order Validation
iOS bootstraps in correct dependency order:
1. Stadiums (no dependencies)
2. Stadium aliases (depends on stadiums)
3. League structure (no dependencies)
4. Teams (depends on stadiums)
5. Team aliases (depends on teams)
6. Games (depends on teams + stadiums)
**Correct - prevents orphan references during bootstrap.**
### CloudKit Sync Validation
`CanonicalSyncService` syncs in same dependency order and tracks:
- Per-entity sync timestamps
- Skipped records (incompatible schema version)
- Skipped records (older than local)
- Sync duration and cancellation
**Well-designed sync infrastructure.**
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 13 | iOS bundled data severely outdated | **Critical** | Missing 35 teams (19%), 89 stadiums (42%), 1,820 games (27%). First-launch experience shows incomplete data until CloudKit sync completes. |
| 14 | Silent game exclusion in RichGame queries | Medium | `filterRichGames()` silently drops games with missing team/stadium references. Users see fewer games without explanation. |
| 15 | No bootstrap deduplication | Low | Duplicate game IDs in bundled JSON would create duplicate SwiftData records. Low risk since JSON is generated correctly. |
### Phase 7 Summary
**Result: FAIL** - iOS bundled data is critically outdated.
- ❌ iOS bundled data missing 35 teams, 89 stadiums, 1,820 games
- ⚠️ Games with unresolved references silently dropped from RichGame queries
- ✅ Field mapping between Python and iOS is correct
- ✅ Date parsing supports both legacy and new formats
- ✅ Schema versions are compatible
- ✅ Bootstrap/sync order handles dependencies correctly
---
## Prioritized Issue List
| # | Issue | Severity | Phase | Root Cause | Remediation |
|---|-------|----------|-------|------------|-------------|
| 13 | iOS bundled data severely outdated | **Critical** | 7 | Bundled JSON not updated after pipeline runs | Copy Scripts/output/*_canonical.json to iOS Resources/ and rebuild |
| 4 | WNBA/NWSL/MLS ESPN-only source | **High** | 3 | No implemented fallback sources | Implement alternative scrapers (FBref for MLS, WNBA League Pass) |
| 5 | max_sources_to_try = 2 limits fallback | **High** | 3 | Hardcoded limit in base.py:189 | Increase to 3 or remove limit for sports with 3+ sources |
| 7 | NHL has no stadium data from primary source | **High** | 4 | Hockey Reference doesn't provide venue info | Force NHL to use NHL API or ESPN as primary (they provide venues) |
| 8 | 131 NBA stadium resolution failures | **High** | 4 | 2024-2025 naming rights not in aliases | Add aliases: "Mortgage Matchup Center" → Rocket Mortgage FieldHouse, "Xfinity Mobile Arena" → Intuit Dome |
| 2 | Orphan stadium alias references | **Medium** | 2 | Wrong canonical IDs in stadium_aliases.json | Fix 5 Denver/KC stadium aliases pointing to non-existent IDs |
| 6 | CBS/FBref scrapers declared but not implemented | **Medium** | 3 | NotImplementedError at runtime | Either implement or remove from source lists to avoid confusion |
| 9 | Outdated WNBA expected count | **Medium** | 4 | WNBA expanded to 13 teams in 2025 | Update config.py EXPECTED_GAME_COUNTS["wnba"] from 220 to 286 |
| 10 | MLS/WNBA stadium alias gaps | **Medium** | 4 | New/renamed venues missing from aliases | Add 129 missing stadium aliases (64 MLS + 65 WNBA) |
| 12 | NHL games have no stadium data | **Medium** | 6 | Same as Issue #7 | See Issue #7 remediation |
| 14 | Silent game exclusion in RichGame queries | **Medium** | 7 | compactMap silently drops games | Log dropped games or return partial RichGame with placeholder stadium |
| 1 | WNBA single abbreviations | **Low** | 1 | Only 1 abbreviation per team | Add alternative abbreviations for source compatibility |
| 3 | No NFL team aliases | **Low** | 2 | Missing Washington Redskins/Football Team | Add historical Washington team name aliases |
| 11 | Game status not parsed | **Low** | 4 | Status field always "unknown" | Parse game status from source data (final, scheduled, postponed) |
| 15 | No bootstrap deduplication | **Low** | 7 | No explicit duplicate check during bootstrap | Add deduplication check in bootstrapGames() |
---
## Recommended Next Steps
### Immediate (Before Next Release)
1. **Update iOS bundled data** (Issue #13)
```bash
cp Scripts/output/stadiums_*.json SportsTime/Resources/stadiums_canonical.json
cp Scripts/output/teams_*.json SportsTime/Resources/teams_canonical.json
cp Scripts/output/games_*.json SportsTime/Resources/games_canonical.json
```
2. **Fix NHL stadium data** (Issues #7, #12)
- Change NHL primary source from Hockey Reference to NHL API
- Or: Increase `max_sources_to_try` to 3 so fallbacks are attempted
3. **Add critical stadium aliases** (Issues #8, #10)
- "Mortgage Matchup Center" → `stadium_nba_rocket_mortgage_fieldhouse`
- "Xfinity Mobile Arena" → `stadium_nba_intuit_dome`
- Run validation report to identify all unresolved venue names
### Short-term (This Quarter)
4. **Implement MLS fallback source** (Issue #4)
- FBref has MLS data with venue information
- Reduces ESPN single-point-of-failure risk
5. **Fix orphan alias references** (Issue #2)
- Correct 5 NFL stadium aliases pointing to wrong canonical IDs
- Add validation check to prevent future orphan references
6. **Update expected game counts** (Issue #9)
- WNBA: 220 → 286 (13 teams × 44 games / 2)
### Long-term (Next Quarter)
7. **Implement WNBA/NWSL fallback sources** (Issue #4)
- Consider WNBA League Pass API or other sources
- NWSL has limited data availability - may need to accept ESPN-only
8. **Add RichGame partial loading** (Issue #14)
- Log games dropped due to missing references
- Consider returning games with placeholder stadiums for NHL
9. **Parse game status** (Issue #11)
- Extract final/scheduled/postponed from source data
- Enables filtering by game state
---
## Verification Checklist
After implementing fixes, verify:
- [ ] Run `python -m sportstime_parser scrape --sport all --season 2025`
- [ ] Check validation reports show <5% unresolved stadiums per sport
- [ ] Copy output JSON to iOS Resources/
- [ ] Build iOS app and verify data loads at startup
- [ ] Query RichGames and verify game count matches expectations
- [ ] Run CloudKit sync and verify no errors