Files
SportstimeAPI/docs/DATA_AUDIT.md
Trey t 52d445bca4 feat(scripts): add sportstime-parser data pipeline
Complete Python package for scraping, normalizing, and uploading
sports schedule data to CloudKit. Includes:

- Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Canonical ID system for teams, stadiums, and games
- Fuzzy matching with manual alias support
- CloudKit uploader with batch operations and deduplication
- Comprehensive test suite with fixtures
- WNBA abbreviation aliases for improved team resolution
- Alias validation script to detect orphan references

All 5 phases of data remediation plan completed:
- Phase 1: Alias fixes (team/stadium alias additions)
- Phase 2: NHL stadium coordinate fixes
- Phase 3: Re-scrape validation
- Phase 4: iOS bundle update
- Phase 5: Code quality improvements (WNBA aliases)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 18:56:25 -06:00

806 lines
34 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SportsTime Data Audit Report
**Generated:** 2026-01-20
**Scope:** NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
**Data Pipeline:** Scripts → CloudKit → iOS App
---
## Executive Summary
The data audit identified **15 issues** across the SportsTime data pipeline, with significant gaps in source reliability, stadium resolution, and iOS data freshness.
| Severity | Count | Description |
|----------|-------|-------------|
| **Critical** | 1 | iOS bundled data severely outdated |
| **High** | 4 | Single-source sports, NHL stadium data, NBA naming rights |
| **Medium** | 6 | Alias gaps, outdated config, silent game exclusion |
| **Low** | 4 | Minor configuration and coverage issues |
### Key Findings
**Data Pipeline Health:**
-**Canonical ID system**: 100% format compliance across 7,186 IDs
-**Team mappings**: All 183 teams correctly mapped with current abbreviations
-**Referential integrity**: Zero orphan references (0 games pointing to non-existent teams/stadiums)
- ⚠️ **Stadium resolution**: 1,466 games (21.6%) have unresolved stadiums
**Critical Risks:**
1. **ESPN single-point-of-failure** for WNBA, NWSL, MLS - if ESPN changes, 3 sports lose all data
2. **NHL has 100% missing stadiums** - Hockey Reference provides no venue data
3. **iOS bundled data 27% behind** - 1,820 games missing from first-launch experience
**Root Causes:**
- Stadium naming rights changed faster than alias updates (2024-2025)
- Fallback source limit (`max_sources_to_try = 2`) prevents third source from being tried
- Hockey Reference source limitation (no venue info) combined with fallback limit
- iOS bundled JSON not updated with latest pipeline output
---
## Phase Status Tracking
| Phase | Status | Issues Found |
|-------|--------|--------------|
| 1. Hardcoded Mapping Audit | ✅ COMPLETE | 1 Low |
| 2. Alias File Completeness | ✅ COMPLETE | 1 Medium, 1 Low |
| 3. Scraper Source Reliability | ✅ COMPLETE | 2 High, 1 Medium |
| 4. Game Count & Coverage | ✅ COMPLETE | 2 High, 2 Medium, 1 Low |
| 5. Canonical ID Consistency | ✅ COMPLETE | 0 issues |
| 6. Referential Integrity | ✅ COMPLETE | 1 Medium (NHL source) |
| 7. iOS Data Reception | ✅ COMPLETE | 1 Critical, 1 Medium, 1 Low |
---
## Phase 1 Results: Hardcoded Mapping Audit
**Files Audited:**
- `sportstime_parser/normalizers/team_resolver.py` (TEAM_MAPPINGS)
- `sportstime_parser/normalizers/stadium_resolver.py` (STADIUM_MAPPINGS)
### Team Counts
| Sport | Hardcoded | Expected | Abbreviations | Status |
|-------|-----------|----------|---------------|--------|
| NBA | 30 | 30 | 38 | ✅ |
| MLB | 30 | 30 | 38 | ✅ |
| NFL | 32 | 32 | 40 | ✅ |
| NHL | 32 | 32 | 41 | ✅ |
| MLS | 30 | 30* | 32 | ✅ |
| WNBA | 13 | 13 | 13 | ✅ |
| NWSL | 16 | 16 | 24 | ✅ |
*MLS: 29 original teams + San Diego FC (2025 expansion) = 30
### Stadium Counts
| Sport | Hardcoded | Notes | Status |
|-------|-----------|-------|--------|
| NBA | 30 | 1 per team | ✅ |
| MLB | 57 | 30 regular + 18 spring training + 9 special venues | ✅ |
| NFL | 30 | Includes shared venues (SoFi Stadium: LAR+LAC, MetLife: NYG+NYJ) | ✅ |
| NHL | 32 | 1 per team | ✅ |
| MLS | 30 | 1 per team | ✅ |
| WNBA | 13 | 1 per team | ✅ |
| NWSL | 19 | 14 current + 5 expansion team venues (Boston/Denver) | ✅ |
### Recent Updates Verification
| Update | Type | Status | Notes |
|--------|------|--------|-------|
| Utah Hockey Club (NHL) | Relocation | ✅ Present | ARI + UTA abbreviations both map to `team_nhl_ari` |
| Golden State Valkyries (WNBA) | Expansion 2025 | ✅ Present | `team_wnba_gsv` with Chase Center venue |
| Boston Legacy FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_bos` with Gillette Stadium |
| Denver Summit FC (NWSL) | Expansion 2026 | ✅ Present | `team_nwsl_den` with Dick's Sporting Goods Park |
| Oakland A's → Sacramento | Temporary relocation | ✅ Present | `stadium_mlb_sutter_health_park` |
| San Diego FC (MLS) | Expansion 2025 | ✅ Present | `team_mls_sd` with Snapdragon Stadium |
| FedExField → Northwest Stadium | Naming rights | ✅ Present | `stadium_nfl_northwest_stadium` |
### NFL Stadium Sharing
| Stadium | Teams | Status |
|---------|-------|--------|
| SoFi Stadium | LAR, LAC | ✅ Correct |
| MetLife Stadium | NYG, NYJ | ✅ Correct |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 1 | WNBA single abbreviations | Low | All 13 WNBA teams have only 1 abbreviation each. May need additional abbreviations for source compatibility. |
### Phase 1 Summary
**Result: PASS** - All team and stadium mappings are complete and up-to-date with 2025-2026 changes.
- ✅ All 7 sports have correct team counts
- ✅ All stadium counts are appropriate (including spring training, special venues)
- ✅ Recent franchise moves/expansions are reflected
- ✅ Stadium sharing is correctly handled
- ✅ Naming rights updates are current
---
## Phase 2 Results: Alias File Completeness
**Files Audited:**
- `Scripts/team_aliases.json`
- `Scripts/stadium_aliases.json`
### Team Aliases Summary
| Sport | Entries | Coverage | Status |
|-------|---------|----------|--------|
| MLB | 23 | Historical relocations/renames | ✅ |
| NBA | 29 | Historical relocations/renames | ✅ |
| NHL | 24 | Historical relocations/renames | ✅ |
| NFL | 0 | **No aliases** | ⚠️ |
| MLS | 0 | No aliases (newer league) | ✅ |
| WNBA | 0 | No aliases (newer league) | ✅ |
| NWSL | 0 | No aliases (newer league) | ✅ |
| **Total** | **76** | | |
- All 76 entries have valid date ranges
- No orphan references (all canonical IDs exist in mappings)
### Stadium Aliases Summary
| Sport | Entries | Coverage | Status |
|-------|---------|----------|--------|
| MLB | 109 | Regular + spring training + special venues | ✅ |
| NFL | 65 | Naming rights history | ✅ |
| NBA | 44 | Naming rights history | ✅ |
| NHL | 39 | Naming rights history | ✅ |
| MLS | 35 | Current + naming variants | ✅ |
| WNBA | 15 | Current + naming variants | ✅ |
| NWSL | 14 | Current + naming variants | ✅ |
| **Total** | **321** | | |
- 65 entries have date ranges (historical naming rights)
- 256 entries are permanent aliases (no date restrictions)
### Orphan Reference Check
| Type | Count | Status |
|------|-------|--------|
| Team aliases with invalid references | 0 | ✅ |
| Stadium aliases with invalid references | **5** | ❌ |
**Orphan Stadium References Found:**
| Alias Name | References (Invalid) | Correct ID |
|------------|---------------------|------------|
| Broncos Stadium at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Sports Authority Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Invesco Field at Mile High | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Mile High Stadium | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` |
| Arrowhead Stadium | `stadium_nfl_geha_field_at_arrowhead_stadium` | `stadium_nfl_arrowhead_stadium` |
### Historical Changes Coverage
| Historical Name | Current Team | In Aliases? |
|-----------------|--------------|-------------|
| Montreal Expos | Washington Nationals | ✅ |
| Seattle SuperSonics | Oklahoma City Thunder | ✅ |
| Arizona Coyotes | Utah Hockey Club | ✅ |
| Cleveland Indians | Cleveland Guardians | ✅ |
| Hartford Whalers | Carolina Hurricanes | ✅ |
| Quebec Nordiques | Colorado Avalanche | ✅ |
| Vancouver Grizzlies | Memphis Grizzlies | ✅ |
| Washington Redskins | Washington Commanders | ❌ Missing |
| Washington Football Team | Washington Commanders | ❌ Missing |
| Brooklyn Dodgers | Los Angeles Dodgers | ❌ Missing |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 2 | Orphan stadium alias references | Medium | 5 stadium aliases point to non-existent canonical IDs (`stadium_nfl_empower_field_at_mile_high`, `stadium_nfl_geha_field_at_arrowhead_stadium`). Causes resolution failures for historical Denver/KC stadium names. |
| 3 | No NFL team aliases | Low | Missing Washington Redskins/Football Team historical names. Limits historical game matching for NFL. |
### Phase 2 Summary
**Result: PASS with issues** - Alias files cover most historical changes but have referential integrity bugs.
- ✅ Team aliases cover MLB/NBA/NHL historical changes
- ✅ Stadium aliases cover naming rights changes across all sports
- ✅ No date range validation errors
- ❌ 5 orphan stadium references need fixing
- ⚠️ No NFL team aliases (Washington Redskins/Football Team missing)
---
## Phase 3 Results: Scraper Source Reliability
**Files Audited:**
- `sportstime_parser/scrapers/base.py` (fallback logic)
- `sportstime_parser/scrapers/nba.py`, `mlb.py`, `nfl.py`, `nhl.py`, `mls.py`, `wnba.py`, `nwsl.py`
### Source Dependency Matrix
| Sport | Primary | Status | Fallback 1 | Status | Fallback 2 | Status | Risk |
|-------|---------|--------|------------|--------|------------|--------|------|
| NBA | basketball_reference | ✅ | espn | ✅ | cbs | ❌ NOT IMPL | Medium |
| MLB | mlb_api | ✅ | espn | ✅ | baseball_reference | ✅ | Low |
| NFL | espn | ✅ | pro_football_reference | ✅ | cbs | ❌ NOT IMPL | Medium |
| NHL | hockey_reference | ✅ | nhl_api | ✅ | espn | ✅ | Low |
| MLS | espn | ✅ | fbref | ❌ NOT IMPL | - | - | **HIGH** |
| WNBA | espn | ✅ | - | - | - | - | **HIGH** |
| NWSL | espn | ✅ | - | - | - | - | **HIGH** |
### Unimplemented Sources
| Sport | Source | Line | Status |
|-------|--------|------|--------|
| NBA | cbs | `nba.py:421` | `raise NotImplementedError("CBS scraper not implemented")` |
| NFL | cbs | `nfl.py:386` | `raise NotImplementedError("CBS scraper not implemented")` |
| MLS | fbref | `mls.py:214` | `raise NotImplementedError("FBref scraper not implemented")` |
### Fallback Logic Analysis
**File:** `base.py:189`
```python
max_sources_to_try = 2 # Don't try all sources if first few return nothing
```
**Impact:**
- Even if 3 sources are declared, only 2 are tried
- If sources 1 and 2 fail, source 3 is never attempted
- This limits resilience for NBA, MLB, NFL, NHL which have 3 sources
### International Game Filtering
| Sport | Hardcoded Locations | Notes |
|-------|---------------------|-------|
| NFL | London, Mexico City, Frankfurt, Munich, São Paulo | ✅ Complete for 2025 |
| NHL | Prague, Stockholm, Helsinki, Tampere, Gothenburg | ✅ Complete for 2025 |
| NBA | None | ⚠️ No international filtering (Abu Dhabi games?) |
| MLB | None | ⚠️ No international filtering (Mexico City games?) |
| MLS | None | N/A (domestic only) |
| WNBA | None | N/A (domestic only) |
| NWSL | None | N/A (domestic only) |
### Single Point of Failure Risk
| Sport | Primary Source | If ESPN Fails... | Risk Level |
|-------|----------------|------------------|------------|
| WNBA | ESPN only | **Complete data loss** | Critical |
| NWSL | ESPN only | **Complete data loss** | Critical |
| MLS | ESPN only (fbref not impl) | **Complete data loss** | Critical |
| NBA | Basketball-Ref → ESPN | ESPN fallback available | Low |
| NFL | ESPN → Pro-Football-Ref | Fallback available | Low |
| NHL | Hockey-Ref → NHL API → ESPN | Two fallbacks | Very Low |
| MLB | MLB API → ESPN → B-Ref | Two fallbacks | Very Low |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 4 | WNBA/NWSL/MLS single source | High | ESPN is the only working source for 3 sports. If ESPN changes or fails, data collection completely stops. |
| 5 | max_sources_to_try = 2 | High | Third fallback source never tried even if available. Reduces resilience for NBA/MLB/NFL/NHL. |
| 6 | CBS/FBref not implemented | Medium | Declared fallback sources raise NotImplementedError. Appears functional in config but fails at runtime. |
### Phase 3 Summary
**Result: FAIL** - Critical single-point-of-failure for 3 sports.
- ❌ WNBA, NWSL, MLS have only ESPN (no resilience)
- ❌ Fallback limit of 2 prevents third source from being tried
- ⚠️ CBS and FBref declared but not implemented
- ✅ MLB and NHL have full fallback chains
- ✅ International game filtering present for NFL/NHL
---
## Phase 4 Results: Game Count & Coverage
**Files Audited:**
- `Scripts/output/games_*.json` (all 2025 season files)
- `Scripts/output/validation_*.md` (all validation reports)
- `sportstime_parser/config.py` (EXPECTED_GAME_COUNTS)
### Coverage Summary
| Sport | Scraped | Expected | Coverage | Status |
|-------|---------|----------|----------|--------|
| NBA | 1,231 | 1,230 | 100.1% | ✅ |
| MLB | 2,866 | 2,430 | 117.9% | ⚠️ Includes spring training |
| NFL | 330 | 272 | 121.3% | ⚠️ Includes preseason/playoffs |
| NHL | 1,312 | 1,312 | 100.0% | ✅ |
| MLS | 542 | 493 | 109.9% | ✅ Includes playoffs |
| WNBA | 322 | 220 | **146.4%** | ⚠️ Expected count outdated |
| NWSL | 189 | 182 | 103.8% | ✅ |
### Date Range Analysis
| Sport | Start Date | End Date | Notes |
|-------|------------|----------|-------|
| NBA | 2025-10-21 | 2026-04-12 | Regular season only |
| MLB | 2025-03-01 | 2025-11-02 | Includes spring training (417 games in March) |
| NFL | 2025-08-01 | 2026-01-25 | Includes preseason (49 in Aug) + playoffs (28 in Jan) |
| NHL | 2025-10-07 | 2026-04-16 | Regular season only |
| MLS | 2025-02-22 | 2025-11-30 | Regular season + playoffs |
| WNBA | 2025-05-02 | 2025-10-11 | Regular season + playoffs |
| NWSL | 2025-03-15 | 2025-11-23 | Regular season + playoffs |
### Game Status Distribution
All games across all sports have status `unknown` - game status is not being properly parsed from sources.
### Duplicate Game Detection
| Sport | Duplicates Found | Details |
|-------|-----------------|---------|
| NBA | 0 | ✅ |
| MLB | 1 | `game_mlb_2025_20250508_det_col_1` appears twice (doubleheader handling issue) |
| NFL | 0 | ✅ |
| NHL | 0 | ✅ |
| MLS | 0 | ✅ |
| WNBA | 0 | ✅ |
| NWSL | 0 | ✅ |
### Validation Report Analysis
| Sport | Total Games | Unresolved Teams | Unresolved Stadiums | Manual Review Items |
|-------|-------------|------------------|---------------------|---------------------|
| NBA | 1,231 | 0 | **131** | 131 |
| MLB | 2,866 | 12 | 4 | 20 |
| NFL | 330 | 1 | 5 | 11 |
| NHL | 1,312 | 0 | 0 | **1,312** (all missing stadiums) |
| MLS | 542 | 1 | **64** | 129 |
| WNBA | 322 | 5 | **65** | 135 |
| NWSL | 189 | 0 | **16** | 32 |
### Top Unresolved Stadium Names (Recent Naming Rights)
| Stadium Name | Occurrences | Actual Venue | Issue |
|--------------|-------------|--------------|-------|
| Sports Illustrated Stadium | 11 | MLS expansion venue | New venue, missing alias |
| Mortgage Matchup Center | 8 | Rocket Mortgage FieldHouse (CLE) | 2025 naming rights change |
| ScottsMiracle-Gro Field | 4 | MLS Columbus Crew | Missing alias |
| Energizer Park | 3 | MLS CITY SC (STL?) | Missing alias |
| Xfinity Mobile Arena | 3 | Intuit Dome (LAC) | 2025 naming rights change |
| Rocket Arena | 3 | Toyota Center (HOU) | Potential name change |
| CareFirst Arena | 2 | Washington Mystics venue | New WNBA venue name |
### Unresolved Teams (Exhibition/International)
| Team Name | Sport | Type | Games |
|-----------|-------|------|-------|
| BRAZIL | WNBA | International exhibition | 2 |
| Toyota Antelopes | WNBA | Japanese team | 2 |
| TEAM CLARK | WNBA | All-Star Game | 1 |
| (Various MLB) | MLB | International teams | 12 |
| (MLS international) | MLS | CCL/exhibition | 1 |
| (NFL preseason) | NFL | Pre-season exhibition | 1 |
### NHL Stadium Data Issue
**Critical:** Hockey Reference does not provide stadium data. All 1,312 NHL games have `raw_stadium: None`, causing 100% of games to have missing stadium IDs. The NHL fallback sources (NHL API, ESPN) should provide this data, but the `max_sources_to_try = 2` limit combined with Hockey Reference success means fallbacks are never attempted.
### Expected Count Updates Needed
| Sport | Current Expected | Recommended | Reason |
|-------|------------------|-------------|--------|
| WNBA | 220 | **286** | 13 teams × 44 games / 2 (expanded with Golden State Valkyries) |
| NFL | 272 | 272 (filter preseason) | Or document that 330 includes preseason |
| MLB | 2,430 | 2,430 (filter spring training) | Or document that 2,866 includes spring training |
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 7 | NHL has no stadium data | High | Hockey Reference provides no venue info. All 1,312 games missing stadium_id. Fallback sources not tried. |
| 8 | 131 NBA stadium resolution failures | High | Recent naming rights changes ("Mortgage Matchup Center", "Xfinity Mobile Arena") not in aliases. |
| 9 | Outdated WNBA expected count | Medium | Config says 220 but WNBA expanded to 13 teams in 2025; actual is 322 (286 regular + playoffs). |
| 10 | MLS/WNBA stadium alias gaps | Medium | 64 MLS + 65 WNBA unresolved stadiums from new/renamed venues. |
| 11 | Game status not parsed | Low | All games have status `unknown` instead of final/scheduled/postponed. |
### Phase 4 Summary
**Result: FAIL** - Significant stadium resolution failures across multiple sports.
- ❌ 131 NBA games missing stadium (naming rights changes)
- ❌ 1,312 NHL games missing stadium (source doesn't provide data)
- ❌ 64 MLS + 65 WNBA stadiums unresolved (new/renamed venues)
- ⚠️ WNBA expected count severely outdated (220 vs 322 actual)
- ⚠️ MLB/NFL include preseason/spring training games
- ✅ No significant duplicate games (1 MLB doubleheader edge case)
- ✅ All teams resolved except exhibition/international games
---
## Phase 5 Results: Canonical ID Consistency
**Files Audited:**
- `sportstime_parser/normalizers/canonical_id.py` (Python ID generation)
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (iOS models)
- `SportsTime/Core/Services/BootstrapService.swift` (iOS JSON parsing)
- All `Scripts/output/*.json` files (generated IDs)
### Format Validation
| Type | Total IDs | Valid | Invalid | Pass Rate |
|------|-----------|-------|---------|-----------|
| Team | 183 | 183 | 0 | 100.0% ✅ |
| Stadium | 211 | 211 | 0 | 100.0% ✅ |
| Game | 6,792 | 6,792 | 0 | 100.0% ✅ |
### ID Format Patterns (all validated)
```
Teams: team_{sport}_{abbrev} → team_nba_lal
Stadiums: stadium_{sport}_{normalized_name} → stadium_nba_cryptocom_arena
Games: game_{sport}_{season}_{YYYYMMDD}_{away}_{home}[_{#}]
→ game_nba_2025_20251021_hou_okc
```
### Normalization Quality
| Check | Result |
|-------|--------|
| Double underscores (`__`) | 0 found ✅ |
| Leading/trailing underscores | 0 found ✅ |
| Uppercase letters | 0 found ✅ |
| Special characters | 0 found ✅ |
### Abbreviation Lengths (Teams)
| Length | Count |
|--------|-------|
| 2 chars | 21 |
| 3 chars | 161 |
| 4 chars | 1 |
### Stadium ID Lengths
- Minimum: 8 characters
- Maximum: 29 characters
- Average: 16.2 characters
### iOS Cross-Compatibility
| Aspect | Status | Notes |
|--------|--------|-------|
| Field naming convention | ✅ Compatible | Python uses snake_case; iOS `BootstrapService` uses matching Codable structs |
| Deterministic UUID generation | ✅ Compatible | iOS uses SHA256 hash of canonical_id - matches any valid string |
| Schema version | ✅ Compatible | Both use version 1 |
| Required fields | ✅ Present | All iOS-required fields present in JSON output |
### Field Mapping (Python → iOS)
| Python Field | iOS Field | Notes |
|--------------|-----------|-------|
| `canonical_id` | `canonicalId` | Mapped via `JSONCanonicalStadium.canonical_id``CanonicalStadium.canonicalId` |
| `home_team_canonical_id` | `homeTeamCanonicalId` | Explicit mapping in BootstrapService |
| `away_team_canonical_id` | `awayTeamCanonicalId` | Explicit mapping in BootstrapService |
| `stadium_canonical_id` | `stadiumCanonicalId` | Explicit mapping in BootstrapService |
| `game_datetime_utc` | `dateTime` | ISO 8601 parsing with fallback to legacy format |
### Issues Found
**No issues found.** All canonical IDs are:
- Correctly formatted according to defined patterns
- Properly normalized (lowercase, no special characters)
- Deterministic (same input produces same output)
- Compatible with iOS parsing
### Phase 5 Summary
**Result: PASS** - All canonical IDs are consistent and iOS-compatible.
- ✅ 100% format validation pass rate across 7,186 IDs
- ✅ No normalization issues found
- ✅ iOS BootstrapService explicitly handles snake_case → camelCase mapping
- ✅ Deterministic UUID generation using SHA256 hash
---
## Phase 6 Results: Referential Integrity
**Files Audited:**
- `Scripts/output/games_*_2025.json`
- `Scripts/output/teams_*.json`
- `Scripts/output/stadiums_*.json`
### Game → Team References
| Sport | Total Games | Valid Home | Valid Away | Orphan Home | Orphan Away | Status |
|-------|-------------|------------|------------|-------------|-------------|--------|
| NBA | 1,231 | 1,231 | 1,231 | 0 | 0 | ✅ |
| MLB | 2,866 | 2,866 | 2,866 | 0 | 0 | ✅ |
| NFL | 330 | 330 | 330 | 0 | 0 | ✅ |
| NHL | 1,312 | 1,312 | 1,312 | 0 | 0 | ✅ |
| MLS | 542 | 542 | 542 | 0 | 0 | ✅ |
| WNBA | 322 | 322 | 322 | 0 | 0 | ✅ |
| NWSL | 189 | 189 | 189 | 0 | 0 | ✅ |
**Result:** 100% valid team references across all 6,792 games.
### Game → Stadium References
| Sport | Total Games | Valid | Missing | Percentage Missing |
|-------|-------------|-------|---------|-------------------|
| NBA | 1,231 | 1,231 | 0 | 0.0% ✅ |
| MLB | 2,866 | 2,862 | 4 | 0.1% ✅ |
| NFL | 330 | 325 | 5 | 1.5% ✅ |
| NHL | 1,312 | 0 | **1,312** | **100%** ❌ |
| MLS | 542 | 478 | 64 | 11.8% ⚠️ |
| WNBA | 322 | 257 | 65 | 20.2% ⚠️ |
| NWSL | 189 | 173 | 16 | 8.5% ⚠️ |
**Note:** "Missing" means `stadium_canonical_id` is empty (resolution failed at scrape time). This is NOT orphan references to non-existent stadiums.
### Team → Stadium References
| Sport | Teams | Valid Stadium | Invalid | Status |
|-------|-------|---------------|---------|--------|
| NBA | 30 | 30 | 0 | ✅ |
| MLB | 30 | 30 | 0 | ✅ |
| NFL | 32 | 32 | 0 | ✅ |
| NHL | 32 | 32 | 0 | ✅ |
| MLS | 30 | 30 | 0 | ✅ |
| WNBA | 13 | 13 | 0 | ✅ |
| NWSL | 16 | 16 | 0 | ✅ |
**Result:** 100% valid team → stadium references.
### Cross-Sport Stadium Check
✅ No stadiums are duplicated across sports. Each `stadium_{sport}_*` ID is unique to its sport.
### Missing Stadium Root Causes
| Sport | Missing | Root Cause |
|-------|---------|------------|
| NHL | 1,312 | **Hockey Reference provides no venue data** - source limitation |
| MLS | 64 | New/renamed stadiums not in aliases (see Phase 4) |
| WNBA | 65 | New venue names not in aliases (see Phase 4) |
| NWSL | 16 | Expansion team venues + alternate venues |
| NFL | 5 | International games not in stadium mappings |
| MLB | 4 | Exhibition/international games |
### Orphan Reference Summary
| Reference Type | Total Checked | Orphans Found |
|----------------|---------------|---------------|
| Game → Home Team | 6,792 | 0 ✅ |
| Game → Away Team | 6,792 | 0 ✅ |
| Game → Stadium | 6,792 | 0 ✅ |
| Team → Stadium | 183 | 0 ✅ |
**Note:** Zero orphan references. All "missing" stadiums are resolution failures (empty string), not references to non-existent canonical IDs.
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 12 | NHL games have no stadium data | Medium | Hockey Reference source doesn't provide venue information. All 1,312 NHL games have empty stadium_canonical_id. Fallback sources could provide this data but are limited by `max_sources_to_try = 2`. |
### Phase 6 Summary
**Result: PASS with known limitations** - No orphan references exist; missing stadiums are resolution failures.
- ✅ 100% valid team references (home and away)
- ✅ 100% valid team → stadium references
- ✅ No orphan references to non-existent canonical IDs
- ⚠️ 1,466 games (21.6%) have empty stadium_canonical_id (resolution failures, not orphans)
- ⚠️ NHL accounts for 90% of missing stadium data (source limitation)
---
## Phase 7 Results: iOS Data Reception
**Files Audited:**
- `SportsTime/Core/Services/BootstrapService.swift` (JSON parsing)
- `SportsTime/Core/Services/CanonicalSyncService.swift` (CloudKit sync)
- `SportsTime/Core/Services/DataProvider.swift` (data access)
- `SportsTime/Core/Models/Local/CanonicalModels.swift` (SwiftData models)
- `SportsTime/Resources/*_canonical.json` (bundled data files)
### Bundled Data Comparison
| Data Type | iOS Bundled | Scripts Output | Difference | Status |
|-----------|-------------|----------------|------------|--------|
| Teams | 148 | 183 | **-35** (19%) | ❌ STALE |
| Stadiums | 122 | 211 | **-89** (42%) | ❌ STALE |
| Games | 4,972 | 6,792 | **-1,820** (27%) | ❌ STALE |
**iOS bundled data is significantly outdated compared to Scripts output.**
### Field Mapping Verification
| Python Field | iOS JSON Struct | iOS Model | Type Match | Status |
|--------------|-----------------|-----------|------------|--------|
| `canonical_id` | `canonical_id` | `canonicalId` | String ✅ | ✅ |
| `name` | `name` | `name` | String ✅ | ✅ |
| `game_datetime_utc` | `game_datetime_utc` | `dateTime` | ISO 8601 → Date ✅ | ✅ |
| `date` + `time` (legacy) | `date`, `time` | `dateTime` | Fallback parsing ✅ | ✅ |
| `home_team_canonical_id` | `home_team_canonical_id` | `homeTeamCanonicalId` | String ✅ | ✅ |
| `away_team_canonical_id` | `away_team_canonical_id` | `awayTeamCanonicalId` | String ✅ | ✅ |
| `stadium_canonical_id` | `stadium_canonical_id` | `stadiumCanonicalId` | String ✅ | ✅ |
| `sport` | `sport` | `sport` | String ✅ | ✅ |
| `season` | `season` | `season` | String ✅ | ✅ |
| `is_playoff` | `is_playoff` | `isPlayoff` | Bool ✅ | ✅ |
| `broadcast_info` | `broadcast_info` | `broadcastInfo` | String? ✅ | ✅ |
**Result:** All field mappings are correct and compatible.
### Date Parsing Compatibility
iOS `BootstrapService` supports both formats:
```swift
// New canonical format (preferred)
let game_datetime_utc: String? // ISO 8601
// Legacy format (fallback)
let date: String? // "YYYY-MM-DD"
let time: String? // "HH:mm" or "TBD"
```
**Current iOS bundled games use legacy format.** After updating bundled data, new `game_datetime_utc` format will be used.
### Missing Reference Handling
**`DataProvider.filterRichGames()` behavior:**
```swift
return games.compactMap { game in
guard let homeTeam = teamsById[game.homeTeamId],
let awayTeam = teamsById[game.awayTeamId],
let stadium = stadiumsById[game.stadiumId] else {
return nil // Silently drops game
}
return RichGame(...)
}
```
**Impact:**
- Games with missing stadium IDs are **silently excluded** from RichGame queries
- No error logging or fallback behavior
- User sees fewer games than expected without explanation
### Deduplication Logic
**Bootstrap:** No explicit deduplication. If bundled JSON contains duplicate canonical IDs, both would be inserted into SwiftData (leading to potential query issues).
**CloudKit Sync:** Uses upsert pattern with canonical ID as unique key - duplicates would overwrite.
### Schema Version Compatibility
| Component | Schema Version | Status |
|-----------|----------------|--------|
| Scripts output | 1 | ✅ |
| iOS CanonicalModels | 1 | ✅ |
| iOS BootstrapService | Expects 1 | ✅ |
**Compatible.** Schema version mismatch protection exists in `CanonicalSyncService`:
```swift
case .schemaVersionTooNew(let version):
return "Data requires app version supporting schema \(version). Please update the app."
```
### Bootstrap Order Validation
iOS bootstraps in correct dependency order:
1. Stadiums (no dependencies)
2. Stadium aliases (depends on stadiums)
3. League structure (no dependencies)
4. Teams (depends on stadiums)
5. Team aliases (depends on teams)
6. Games (depends on teams + stadiums)
**Correct - prevents orphan references during bootstrap.**
### CloudKit Sync Validation
`CanonicalSyncService` syncs in same dependency order and tracks:
- Per-entity sync timestamps
- Skipped records (incompatible schema version)
- Skipped records (older than local)
- Sync duration and cancellation
**Well-designed sync infrastructure.**
### Issues Found
| # | Issue | Severity | Description |
|---|-------|----------|-------------|
| 13 | iOS bundled data severely outdated | **Critical** | Missing 35 teams (19%), 89 stadiums (42%), 1,820 games (27%). First-launch experience shows incomplete data until CloudKit sync completes. |
| 14 | Silent game exclusion in RichGame queries | Medium | `filterRichGames()` silently drops games with missing team/stadium references. Users see fewer games without explanation. |
| 15 | No bootstrap deduplication | Low | Duplicate game IDs in bundled JSON would create duplicate SwiftData records. Low risk since JSON is generated correctly. |
### Phase 7 Summary
**Result: FAIL** - iOS bundled data is critically outdated.
- ❌ iOS bundled data missing 35 teams, 89 stadiums, 1,820 games
- ⚠️ Games with unresolved references silently dropped from RichGame queries
- ✅ Field mapping between Python and iOS is correct
- ✅ Date parsing supports both legacy and new formats
- ✅ Schema versions are compatible
- ✅ Bootstrap/sync order handles dependencies correctly
---
## Prioritized Issue List
| # | Issue | Severity | Phase | Root Cause | Remediation |
|---|-------|----------|-------|------------|-------------|
| 13 | iOS bundled data severely outdated | **Critical** | 7 | Bundled JSON not updated after pipeline runs | Copy Scripts/output/*_canonical.json to iOS Resources/ and rebuild |
| 4 | WNBA/NWSL/MLS ESPN-only source | **High** | 3 | No implemented fallback sources | Implement alternative scrapers (FBref for MLS, WNBA League Pass) |
| 5 | max_sources_to_try = 2 limits fallback | **High** | 3 | Hardcoded limit in base.py:189 | Increase to 3 or remove limit for sports with 3+ sources |
| 7 | NHL has no stadium data from primary source | **High** | 4 | Hockey Reference doesn't provide venue info | Force NHL to use NHL API or ESPN as primary (they provide venues) |
| 8 | 131 NBA stadium resolution failures | **High** | 4 | 2024-2025 naming rights not in aliases | Add aliases: "Mortgage Matchup Center" → Rocket Mortgage FieldHouse, "Xfinity Mobile Arena" → Intuit Dome |
| 2 | Orphan stadium alias references | **Medium** | 2 | Wrong canonical IDs in stadium_aliases.json | Fix 5 Denver/KC stadium aliases pointing to non-existent IDs |
| 6 | CBS/FBref scrapers declared but not implemented | **Medium** | 3 | NotImplementedError at runtime | Either implement or remove from source lists to avoid confusion |
| 9 | Outdated WNBA expected count | **Medium** | 4 | WNBA expanded to 13 teams in 2025 | Update config.py EXPECTED_GAME_COUNTS["wnba"] from 220 to 286 |
| 10 | MLS/WNBA stadium alias gaps | **Medium** | 4 | New/renamed venues missing from aliases | Add 129 missing stadium aliases (64 MLS + 65 WNBA) |
| 12 | NHL games have no stadium data | **Medium** | 6 | Same as Issue #7 | See Issue #7 remediation |
| 14 | Silent game exclusion in RichGame queries | **Medium** | 7 | compactMap silently drops games | Log dropped games or return partial RichGame with placeholder stadium |
| 1 | WNBA single abbreviations | **Low** | 1 | Only 1 abbreviation per team | Add alternative abbreviations for source compatibility |
| 3 | No NFL team aliases | **Low** | 2 | Missing Washington Redskins/Football Team | Add historical Washington team name aliases |
| 11 | Game status not parsed | **Low** | 4 | Status field always "unknown" | Parse game status from source data (final, scheduled, postponed) |
| 15 | No bootstrap deduplication | **Low** | 7 | No explicit duplicate check during bootstrap | Add deduplication check in bootstrapGames() |
---
## Recommended Next Steps
### Immediate (Before Next Release)
1. **Update iOS bundled data** (Issue #13)
```bash
cp Scripts/output/stadiums_*.json SportsTime/Resources/stadiums_canonical.json
cp Scripts/output/teams_*.json SportsTime/Resources/teams_canonical.json
cp Scripts/output/games_*.json SportsTime/Resources/games_canonical.json
```
2. **Fix NHL stadium data** (Issues #7, #12)
- Change NHL primary source from Hockey Reference to NHL API
- Or: Increase `max_sources_to_try` to 3 so fallbacks are attempted
3. **Add critical stadium aliases** (Issues #8, #10)
- "Mortgage Matchup Center" → `stadium_nba_rocket_mortgage_fieldhouse`
- "Xfinity Mobile Arena" → `stadium_nba_intuit_dome`
- Run validation report to identify all unresolved venue names
### Short-term (This Quarter)
4. **Implement MLS fallback source** (Issue #4)
- FBref has MLS data with venue information
- Reduces ESPN single-point-of-failure risk
5. **Fix orphan alias references** (Issue #2)
- Correct 5 NFL stadium aliases pointing to wrong canonical IDs
- Add validation check to prevent future orphan references
6. **Update expected game counts** (Issue #9)
- WNBA: 220 → 286 (13 teams × 44 games / 2)
### Long-term (Next Quarter)
7. **Implement WNBA/NWSL fallback sources** (Issue #4)
- Consider WNBA League Pass API or other sources
- NWSL has limited data availability - may need to accept ESPN-only
8. **Add RichGame partial loading** (Issue #14)
- Log games dropped due to missing references
- Consider returning games with placeholder stadiums for NHL
9. **Parse game status** (Issue #11)
- Extract final/scheduled/postponed from source data
- Enables filtering by game state
---
## Verification Checklist
After implementing fixes, verify:
- [ ] Run `python -m sportstime_parser scrape --sport all --season 2025`
- [ ] Check validation reports show <5% unresolved stadiums per sport
- [ ] Copy output JSON to iOS Resources/
- [ ] Build iOS app and verify data loads at startup
- [ ] Query RichGames and verify game count matches expectations
- [ ] Run CloudKit sync and verify no errors