Files
SportstimeAPI/docs/DATA_AUDIT.md
Trey t 52d445bca4 feat(scripts): add sportstime-parser data pipeline
Complete Python package for scraping, normalizing, and uploading
sports schedule data to CloudKit. Includes:

- Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- Canonical ID system for teams, stadiums, and games
- Fuzzy matching with manual alias support
- CloudKit uploader with batch operations and deduplication
- Comprehensive test suite with fixtures
- WNBA abbreviation aliases for improved team resolution
- Alias validation script to detect orphan references

All 5 phases of data remediation plan completed:
- Phase 1: Alias fixes (team/stadium alias additions)
- Phase 2: NHL stadium coordinate fixes
- Phase 3: Re-scrape validation
- Phase 4: iOS bundle update
- Phase 5: Code quality improvements (WNBA aliases)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 18:56:25 -06:00

34 KiB
Raw Permalink Blame History

SportsTime Data Audit Report

Generated: 2026-01-20 Scope: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL Data Pipeline: Scripts → CloudKit → iOS App


Executive Summary

The data audit identified 15 issues across the SportsTime data pipeline, with significant gaps in source reliability, stadium resolution, and iOS data freshness.

Severity Count Description
Critical 1 iOS bundled data severely outdated
High 4 Single-source sports, NHL stadium data, NBA naming rights
Medium 6 Alias gaps, outdated config, silent game exclusion
Low 4 Minor configuration and coverage issues

Key Findings

Data Pipeline Health:

  • Canonical ID system: 100% format compliance across 7,186 IDs
  • Team mappings: All 183 teams correctly mapped with current abbreviations
  • Referential integrity: Zero orphan references (0 games pointing to non-existent teams/stadiums)
  • ⚠️ Stadium resolution: 1,466 games (21.6%) have unresolved stadiums

Critical Risks:

  1. ESPN single-point-of-failure for WNBA, NWSL, MLS - if ESPN changes, 3 sports lose all data
  2. NHL has 100% missing stadiums - Hockey Reference provides no venue data
  3. iOS bundled data 27% behind - 1,820 games missing from first-launch experience

Root Causes:

  • Stadium naming rights changed faster than alias updates (2024-2025)
  • Fallback source limit (max_sources_to_try = 2) prevents third source from being tried
  • Hockey Reference source limitation (no venue info) combined with fallback limit
  • iOS bundled JSON not updated with latest pipeline output

Phase Status Tracking

Phase Status Issues Found
1. Hardcoded Mapping Audit COMPLETE 1 Low
2. Alias File Completeness COMPLETE 1 Medium, 1 Low
3. Scraper Source Reliability COMPLETE 2 High, 1 Medium
4. Game Count & Coverage COMPLETE 2 High, 2 Medium, 1 Low
5. Canonical ID Consistency COMPLETE 0 issues
6. Referential Integrity COMPLETE 1 Medium (NHL source)
7. iOS Data Reception COMPLETE 1 Critical, 1 Medium, 1 Low

Phase 1 Results: Hardcoded Mapping Audit

Files Audited:

  • sportstime_parser/normalizers/team_resolver.py (TEAM_MAPPINGS)
  • sportstime_parser/normalizers/stadium_resolver.py (STADIUM_MAPPINGS)

Team Counts

Sport Hardcoded Expected Abbreviations Status
NBA 30 30 38
MLB 30 30 38
NFL 32 32 40
NHL 32 32 41
MLS 30 30* 32
WNBA 13 13 13
NWSL 16 16 24

*MLS: 29 original teams + San Diego FC (2025 expansion) = 30

Stadium Counts

Sport Hardcoded Notes Status
NBA 30 1 per team
MLB 57 30 regular + 18 spring training + 9 special venues
NFL 30 Includes shared venues (SoFi Stadium: LAR+LAC, MetLife: NYG+NYJ)
NHL 32 1 per team
MLS 30 1 per team
WNBA 13 1 per team
NWSL 19 14 current + 5 expansion team venues (Boston/Denver)

Recent Updates Verification

Update Type Status Notes
Utah Hockey Club (NHL) Relocation Present ARI + UTA abbreviations both map to team_nhl_ari
Golden State Valkyries (WNBA) Expansion 2025 Present team_wnba_gsv with Chase Center venue
Boston Legacy FC (NWSL) Expansion 2026 Present team_nwsl_bos with Gillette Stadium
Denver Summit FC (NWSL) Expansion 2026 Present team_nwsl_den with Dick's Sporting Goods Park
Oakland A's → Sacramento Temporary relocation Present stadium_mlb_sutter_health_park
San Diego FC (MLS) Expansion 2025 Present team_mls_sd with Snapdragon Stadium
FedExField → Northwest Stadium Naming rights Present stadium_nfl_northwest_stadium

NFL Stadium Sharing

Stadium Teams Status
SoFi Stadium LAR, LAC Correct
MetLife Stadium NYG, NYJ Correct

Issues Found

# Issue Severity Description
1 WNBA single abbreviations Low All 13 WNBA teams have only 1 abbreviation each. May need additional abbreviations for source compatibility.

Phase 1 Summary

Result: PASS - All team and stadium mappings are complete and up-to-date with 2025-2026 changes.

  • All 7 sports have correct team counts
  • All stadium counts are appropriate (including spring training, special venues)
  • Recent franchise moves/expansions are reflected
  • Stadium sharing is correctly handled
  • Naming rights updates are current

Phase 2 Results: Alias File Completeness

Files Audited:

  • Scripts/team_aliases.json
  • Scripts/stadium_aliases.json

Team Aliases Summary

Sport Entries Coverage Status
MLB 23 Historical relocations/renames
NBA 29 Historical relocations/renames
NHL 24 Historical relocations/renames
NFL 0 No aliases ⚠️
MLS 0 No aliases (newer league)
WNBA 0 No aliases (newer league)
NWSL 0 No aliases (newer league)
Total 76
  • All 76 entries have valid date ranges
  • No orphan references (all canonical IDs exist in mappings)

Stadium Aliases Summary

Sport Entries Coverage Status
MLB 109 Regular + spring training + special venues
NFL 65 Naming rights history
NBA 44 Naming rights history
NHL 39 Naming rights history
MLS 35 Current + naming variants
WNBA 15 Current + naming variants
NWSL 14 Current + naming variants
Total 321
  • 65 entries have date ranges (historical naming rights)
  • 256 entries are permanent aliases (no date restrictions)

Orphan Reference Check

Type Count Status
Team aliases with invalid references 0
Stadium aliases with invalid references 5

Orphan Stadium References Found:

Alias Name References (Invalid) Correct ID
Broncos Stadium at Mile High stadium_nfl_empower_field_at_mile_high stadium_nfl_empower_field
Sports Authority Field at Mile High stadium_nfl_empower_field_at_mile_high stadium_nfl_empower_field
Invesco Field at Mile High stadium_nfl_empower_field_at_mile_high stadium_nfl_empower_field
Mile High Stadium stadium_nfl_empower_field_at_mile_high stadium_nfl_empower_field
Arrowhead Stadium stadium_nfl_geha_field_at_arrowhead_stadium stadium_nfl_arrowhead_stadium

Historical Changes Coverage

Historical Name Current Team In Aliases?
Montreal Expos Washington Nationals
Seattle SuperSonics Oklahoma City Thunder
Arizona Coyotes Utah Hockey Club
Cleveland Indians Cleveland Guardians
Hartford Whalers Carolina Hurricanes
Quebec Nordiques Colorado Avalanche
Vancouver Grizzlies Memphis Grizzlies
Washington Redskins Washington Commanders Missing
Washington Football Team Washington Commanders Missing
Brooklyn Dodgers Los Angeles Dodgers Missing

Issues Found

# Issue Severity Description
2 Orphan stadium alias references Medium 5 stadium aliases point to non-existent canonical IDs (stadium_nfl_empower_field_at_mile_high, stadium_nfl_geha_field_at_arrowhead_stadium). Causes resolution failures for historical Denver/KC stadium names.
3 No NFL team aliases Low Missing Washington Redskins/Football Team historical names. Limits historical game matching for NFL.

Phase 2 Summary

Result: PASS with issues - Alias files cover most historical changes but have referential integrity bugs.

  • Team aliases cover MLB/NBA/NHL historical changes
  • Stadium aliases cover naming rights changes across all sports
  • No date range validation errors
  • 5 orphan stadium references need fixing
  • ⚠️ No NFL team aliases (Washington Redskins/Football Team missing)

Phase 3 Results: Scraper Source Reliability

Files Audited:

  • sportstime_parser/scrapers/base.py (fallback logic)
  • sportstime_parser/scrapers/nba.py, mlb.py, nfl.py, nhl.py, mls.py, wnba.py, nwsl.py

Source Dependency Matrix

Sport Primary Status Fallback 1 Status Fallback 2 Status Risk
NBA basketball_reference espn cbs NOT IMPL Medium
MLB mlb_api espn baseball_reference Low
NFL espn pro_football_reference cbs NOT IMPL Medium
NHL hockey_reference nhl_api espn Low
MLS espn fbref NOT IMPL - - HIGH
WNBA espn - - - - HIGH
NWSL espn - - - - HIGH

Unimplemented Sources

Sport Source Line Status
NBA cbs nba.py:421 raise NotImplementedError("CBS scraper not implemented")
NFL cbs nfl.py:386 raise NotImplementedError("CBS scraper not implemented")
MLS fbref mls.py:214 raise NotImplementedError("FBref scraper not implemented")

Fallback Logic Analysis

File: base.py:189

max_sources_to_try = 2  # Don't try all sources if first few return nothing

Impact:

  • Even if 3 sources are declared, only 2 are tried
  • If sources 1 and 2 fail, source 3 is never attempted
  • This limits resilience for NBA, MLB, NFL, NHL which have 3 sources

International Game Filtering

Sport Hardcoded Locations Notes
NFL London, Mexico City, Frankfurt, Munich, São Paulo Complete for 2025
NHL Prague, Stockholm, Helsinki, Tampere, Gothenburg Complete for 2025
NBA None ⚠️ No international filtering (Abu Dhabi games?)
MLB None ⚠️ No international filtering (Mexico City games?)
MLS None N/A (domestic only)
WNBA None N/A (domestic only)
NWSL None N/A (domestic only)

Single Point of Failure Risk

Sport Primary Source If ESPN Fails... Risk Level
WNBA ESPN only Complete data loss Critical
NWSL ESPN only Complete data loss Critical
MLS ESPN only (fbref not impl) Complete data loss Critical
NBA Basketball-Ref → ESPN ESPN fallback available Low
NFL ESPN → Pro-Football-Ref Fallback available Low
NHL Hockey-Ref → NHL API → ESPN Two fallbacks Very Low
MLB MLB API → ESPN → B-Ref Two fallbacks Very Low

Issues Found

# Issue Severity Description
4 WNBA/NWSL/MLS single source High ESPN is the only working source for 3 sports. If ESPN changes or fails, data collection completely stops.
5 max_sources_to_try = 2 High Third fallback source never tried even if available. Reduces resilience for NBA/MLB/NFL/NHL.
6 CBS/FBref not implemented Medium Declared fallback sources raise NotImplementedError. Appears functional in config but fails at runtime.

Phase 3 Summary

Result: FAIL - Critical single-point-of-failure for 3 sports.

  • WNBA, NWSL, MLS have only ESPN (no resilience)
  • Fallback limit of 2 prevents third source from being tried
  • ⚠️ CBS and FBref declared but not implemented
  • MLB and NHL have full fallback chains
  • International game filtering present for NFL/NHL

Phase 4 Results: Game Count & Coverage

Files Audited:

  • Scripts/output/games_*.json (all 2025 season files)
  • Scripts/output/validation_*.md (all validation reports)
  • sportstime_parser/config.py (EXPECTED_GAME_COUNTS)

Coverage Summary

Sport Scraped Expected Coverage Status
NBA 1,231 1,230 100.1%
MLB 2,866 2,430 117.9% ⚠️ Includes spring training
NFL 330 272 121.3% ⚠️ Includes preseason/playoffs
NHL 1,312 1,312 100.0%
MLS 542 493 109.9% Includes playoffs
WNBA 322 220 146.4% ⚠️ Expected count outdated
NWSL 189 182 103.8%

Date Range Analysis

Sport Start Date End Date Notes
NBA 2025-10-21 2026-04-12 Regular season only
MLB 2025-03-01 2025-11-02 Includes spring training (417 games in March)
NFL 2025-08-01 2026-01-25 Includes preseason (49 in Aug) + playoffs (28 in Jan)
NHL 2025-10-07 2026-04-16 Regular season only
MLS 2025-02-22 2025-11-30 Regular season + playoffs
WNBA 2025-05-02 2025-10-11 Regular season + playoffs
NWSL 2025-03-15 2025-11-23 Regular season + playoffs

Game Status Distribution

All games across all sports have status unknown - game status is not being properly parsed from sources.

Duplicate Game Detection

Sport Duplicates Found Details
NBA 0
MLB 1 game_mlb_2025_20250508_det_col_1 appears twice (doubleheader handling issue)
NFL 0
NHL 0
MLS 0
WNBA 0
NWSL 0

Validation Report Analysis

Sport Total Games Unresolved Teams Unresolved Stadiums Manual Review Items
NBA 1,231 0 131 131
MLB 2,866 12 4 20
NFL 330 1 5 11
NHL 1,312 0 0 1,312 (all missing stadiums)
MLS 542 1 64 129
WNBA 322 5 65 135
NWSL 189 0 16 32

Top Unresolved Stadium Names (Recent Naming Rights)

Stadium Name Occurrences Actual Venue Issue
Sports Illustrated Stadium 11 MLS expansion venue New venue, missing alias
Mortgage Matchup Center 8 Rocket Mortgage FieldHouse (CLE) 2025 naming rights change
ScottsMiracle-Gro Field 4 MLS Columbus Crew Missing alias
Energizer Park 3 MLS CITY SC (STL?) Missing alias
Xfinity Mobile Arena 3 Intuit Dome (LAC) 2025 naming rights change
Rocket Arena 3 Toyota Center (HOU) Potential name change
CareFirst Arena 2 Washington Mystics venue New WNBA venue name

Unresolved Teams (Exhibition/International)

Team Name Sport Type Games
BRAZIL WNBA International exhibition 2
Toyota Antelopes WNBA Japanese team 2
TEAM CLARK WNBA All-Star Game 1
(Various MLB) MLB International teams 12
(MLS international) MLS CCL/exhibition 1
(NFL preseason) NFL Pre-season exhibition 1

NHL Stadium Data Issue

Critical: Hockey Reference does not provide stadium data. All 1,312 NHL games have raw_stadium: None, causing 100% of games to have missing stadium IDs. The NHL fallback sources (NHL API, ESPN) should provide this data, but the max_sources_to_try = 2 limit combined with Hockey Reference success means fallbacks are never attempted.

Expected Count Updates Needed

Sport Current Expected Recommended Reason
WNBA 220 286 13 teams × 44 games / 2 (expanded with Golden State Valkyries)
NFL 272 272 (filter preseason) Or document that 330 includes preseason
MLB 2,430 2,430 (filter spring training) Or document that 2,866 includes spring training

Issues Found

# Issue Severity Description
7 NHL has no stadium data High Hockey Reference provides no venue info. All 1,312 games missing stadium_id. Fallback sources not tried.
8 131 NBA stadium resolution failures High Recent naming rights changes ("Mortgage Matchup Center", "Xfinity Mobile Arena") not in aliases.
9 Outdated WNBA expected count Medium Config says 220 but WNBA expanded to 13 teams in 2025; actual is 322 (286 regular + playoffs).
10 MLS/WNBA stadium alias gaps Medium 64 MLS + 65 WNBA unresolved stadiums from new/renamed venues.
11 Game status not parsed Low All games have status unknown instead of final/scheduled/postponed.

Phase 4 Summary

Result: FAIL - Significant stadium resolution failures across multiple sports.

  • 131 NBA games missing stadium (naming rights changes)
  • 1,312 NHL games missing stadium (source doesn't provide data)
  • 64 MLS + 65 WNBA stadiums unresolved (new/renamed venues)
  • ⚠️ WNBA expected count severely outdated (220 vs 322 actual)
  • ⚠️ MLB/NFL include preseason/spring training games
  • No significant duplicate games (1 MLB doubleheader edge case)
  • All teams resolved except exhibition/international games

Phase 5 Results: Canonical ID Consistency

Files Audited:

  • sportstime_parser/normalizers/canonical_id.py (Python ID generation)
  • SportsTime/Core/Models/Local/CanonicalModels.swift (iOS models)
  • SportsTime/Core/Services/BootstrapService.swift (iOS JSON parsing)
  • All Scripts/output/*.json files (generated IDs)

Format Validation

Type Total IDs Valid Invalid Pass Rate
Team 183 183 0 100.0%
Stadium 211 211 0 100.0%
Game 6,792 6,792 0 100.0%

ID Format Patterns (all validated)

Teams:    team_{sport}_{abbrev}                  → team_nba_lal
Stadiums: stadium_{sport}_{normalized_name}       → stadium_nba_cryptocom_arena
Games:    game_{sport}_{season}_{YYYYMMDD}_{away}_{home}[_{#}]
                                                  → game_nba_2025_20251021_hou_okc

Normalization Quality

Check Result
Double underscores (__) 0 found
Leading/trailing underscores 0 found
Uppercase letters 0 found
Special characters 0 found

Abbreviation Lengths (Teams)

Length Count
2 chars 21
3 chars 161
4 chars 1

Stadium ID Lengths

  • Minimum: 8 characters
  • Maximum: 29 characters
  • Average: 16.2 characters

iOS Cross-Compatibility

Aspect Status Notes
Field naming convention Compatible Python uses snake_case; iOS BootstrapService uses matching Codable structs
Deterministic UUID generation Compatible iOS uses SHA256 hash of canonical_id - matches any valid string
Schema version Compatible Both use version 1
Required fields Present All iOS-required fields present in JSON output

Field Mapping (Python → iOS)

Python Field iOS Field Notes
canonical_id canonicalId Mapped via JSONCanonicalStadium.canonical_idCanonicalStadium.canonicalId
home_team_canonical_id homeTeamCanonicalId Explicit mapping in BootstrapService
away_team_canonical_id awayTeamCanonicalId Explicit mapping in BootstrapService
stadium_canonical_id stadiumCanonicalId Explicit mapping in BootstrapService
game_datetime_utc dateTime ISO 8601 parsing with fallback to legacy format

Issues Found

No issues found. All canonical IDs are:

  • Correctly formatted according to defined patterns
  • Properly normalized (lowercase, no special characters)
  • Deterministic (same input produces same output)
  • Compatible with iOS parsing

Phase 5 Summary

Result: PASS - All canonical IDs are consistent and iOS-compatible.

  • 100% format validation pass rate across 7,186 IDs
  • No normalization issues found
  • iOS BootstrapService explicitly handles snake_case → camelCase mapping
  • Deterministic UUID generation using SHA256 hash

Phase 6 Results: Referential Integrity

Files Audited:

  • Scripts/output/games_*_2025.json
  • Scripts/output/teams_*.json
  • Scripts/output/stadiums_*.json

Game → Team References

Sport Total Games Valid Home Valid Away Orphan Home Orphan Away Status
NBA 1,231 1,231 1,231 0 0
MLB 2,866 2,866 2,866 0 0
NFL 330 330 330 0 0
NHL 1,312 1,312 1,312 0 0
MLS 542 542 542 0 0
WNBA 322 322 322 0 0
NWSL 189 189 189 0 0

Result: 100% valid team references across all 6,792 games.

Game → Stadium References

Sport Total Games Valid Missing Percentage Missing
NBA 1,231 1,231 0 0.0%
MLB 2,866 2,862 4 0.1%
NFL 330 325 5 1.5%
NHL 1,312 0 1,312 100%
MLS 542 478 64 11.8% ⚠️
WNBA 322 257 65 20.2% ⚠️
NWSL 189 173 16 8.5% ⚠️

Note: "Missing" means stadium_canonical_id is empty (resolution failed at scrape time). This is NOT orphan references to non-existent stadiums.

Team → Stadium References

Sport Teams Valid Stadium Invalid Status
NBA 30 30 0
MLB 30 30 0
NFL 32 32 0
NHL 32 32 0
MLS 30 30 0
WNBA 13 13 0
NWSL 16 16 0

Result: 100% valid team → stadium references.

Cross-Sport Stadium Check

No stadiums are duplicated across sports. Each stadium_{sport}_* ID is unique to its sport.

Missing Stadium Root Causes

Sport Missing Root Cause
NHL 1,312 Hockey Reference provides no venue data - source limitation
MLS 64 New/renamed stadiums not in aliases (see Phase 4)
WNBA 65 New venue names not in aliases (see Phase 4)
NWSL 16 Expansion team venues + alternate venues
NFL 5 International games not in stadium mappings
MLB 4 Exhibition/international games

Orphan Reference Summary

Reference Type Total Checked Orphans Found
Game → Home Team 6,792 0
Game → Away Team 6,792 0
Game → Stadium 6,792 0
Team → Stadium 183 0

Note: Zero orphan references. All "missing" stadiums are resolution failures (empty string), not references to non-existent canonical IDs.

Issues Found

# Issue Severity Description
12 NHL games have no stadium data Medium Hockey Reference source doesn't provide venue information. All 1,312 NHL games have empty stadium_canonical_id. Fallback sources could provide this data but are limited by max_sources_to_try = 2.

Phase 6 Summary

Result: PASS with known limitations - No orphan references exist; missing stadiums are resolution failures.

  • 100% valid team references (home and away)
  • 100% valid team → stadium references
  • No orphan references to non-existent canonical IDs
  • ⚠️ 1,466 games (21.6%) have empty stadium_canonical_id (resolution failures, not orphans)
  • ⚠️ NHL accounts for 90% of missing stadium data (source limitation)

Phase 7 Results: iOS Data Reception

Files Audited:

  • SportsTime/Core/Services/BootstrapService.swift (JSON parsing)
  • SportsTime/Core/Services/CanonicalSyncService.swift (CloudKit sync)
  • SportsTime/Core/Services/DataProvider.swift (data access)
  • SportsTime/Core/Models/Local/CanonicalModels.swift (SwiftData models)
  • SportsTime/Resources/*_canonical.json (bundled data files)

Bundled Data Comparison

Data Type iOS Bundled Scripts Output Difference Status
Teams 148 183 -35 (19%) STALE
Stadiums 122 211 -89 (42%) STALE
Games 4,972 6,792 -1,820 (27%) STALE

iOS bundled data is significantly outdated compared to Scripts output.

Field Mapping Verification

Python Field iOS JSON Struct iOS Model Type Match Status
canonical_id canonical_id canonicalId String
name name name String
game_datetime_utc game_datetime_utc dateTime ISO 8601 → Date
date + time (legacy) date, time dateTime Fallback parsing
home_team_canonical_id home_team_canonical_id homeTeamCanonicalId String
away_team_canonical_id away_team_canonical_id awayTeamCanonicalId String
stadium_canonical_id stadium_canonical_id stadiumCanonicalId String
sport sport sport String
season season season String
is_playoff is_playoff isPlayoff Bool
broadcast_info broadcast_info broadcastInfo String?

Result: All field mappings are correct and compatible.

Date Parsing Compatibility

iOS BootstrapService supports both formats:

// New canonical format (preferred)
let game_datetime_utc: String?  // ISO 8601

// Legacy format (fallback)
let date: String?   // "YYYY-MM-DD"
let time: String?   // "HH:mm" or "TBD"

Current iOS bundled games use legacy format. After updating bundled data, new game_datetime_utc format will be used.

Missing Reference Handling

DataProvider.filterRichGames() behavior:

return games.compactMap { game in
    guard let homeTeam = teamsById[game.homeTeamId],
          let awayTeam = teamsById[game.awayTeamId],
          let stadium = stadiumsById[game.stadiumId] else {
        return nil  // ⚠️ Silently drops game
    }
    return RichGame(...)
}

Impact:

  • Games with missing stadium IDs are silently excluded from RichGame queries
  • No error logging or fallback behavior
  • User sees fewer games than expected without explanation

Deduplication Logic

Bootstrap: No explicit deduplication. If bundled JSON contains duplicate canonical IDs, both would be inserted into SwiftData (leading to potential query issues).

CloudKit Sync: Uses upsert pattern with canonical ID as unique key - duplicates would overwrite.

Schema Version Compatibility

Component Schema Version Status
Scripts output 1
iOS CanonicalModels 1
iOS BootstrapService Expects 1

Compatible. Schema version mismatch protection exists in CanonicalSyncService:

case .schemaVersionTooNew(let version):
    return "Data requires app version supporting schema \(version). Please update the app."

Bootstrap Order Validation

iOS bootstraps in correct dependency order:

  1. Stadiums (no dependencies)
  2. Stadium aliases (depends on stadiums)
  3. League structure (no dependencies)
  4. Teams (depends on stadiums)
  5. Team aliases (depends on teams)
  6. Games (depends on teams + stadiums)

Correct - prevents orphan references during bootstrap.

CloudKit Sync Validation

CanonicalSyncService syncs in same dependency order and tracks:

  • Per-entity sync timestamps
  • Skipped records (incompatible schema version)
  • Skipped records (older than local)
  • Sync duration and cancellation

Well-designed sync infrastructure.

Issues Found

# Issue Severity Description
13 iOS bundled data severely outdated Critical Missing 35 teams (19%), 89 stadiums (42%), 1,820 games (27%). First-launch experience shows incomplete data until CloudKit sync completes.
14 Silent game exclusion in RichGame queries Medium filterRichGames() silently drops games with missing team/stadium references. Users see fewer games without explanation.
15 No bootstrap deduplication Low Duplicate game IDs in bundled JSON would create duplicate SwiftData records. Low risk since JSON is generated correctly.

Phase 7 Summary

Result: FAIL - iOS bundled data is critically outdated.

  • iOS bundled data missing 35 teams, 89 stadiums, 1,820 games
  • ⚠️ Games with unresolved references silently dropped from RichGame queries
  • Field mapping between Python and iOS is correct
  • Date parsing supports both legacy and new formats
  • Schema versions are compatible
  • Bootstrap/sync order handles dependencies correctly

Prioritized Issue List

# Issue Severity Phase Root Cause Remediation
13 iOS bundled data severely outdated Critical 7 Bundled JSON not updated after pipeline runs Copy Scripts/output/*_canonical.json to iOS Resources/ and rebuild
4 WNBA/NWSL/MLS ESPN-only source High 3 No implemented fallback sources Implement alternative scrapers (FBref for MLS, WNBA League Pass)
5 max_sources_to_try = 2 limits fallback High 3 Hardcoded limit in base.py:189 Increase to 3 or remove limit for sports with 3+ sources
7 NHL has no stadium data from primary source High 4 Hockey Reference doesn't provide venue info Force NHL to use NHL API or ESPN as primary (they provide venues)
8 131 NBA stadium resolution failures High 4 2024-2025 naming rights not in aliases Add aliases: "Mortgage Matchup Center" → Rocket Mortgage FieldHouse, "Xfinity Mobile Arena" → Intuit Dome
2 Orphan stadium alias references Medium 2 Wrong canonical IDs in stadium_aliases.json Fix 5 Denver/KC stadium aliases pointing to non-existent IDs
6 CBS/FBref scrapers declared but not implemented Medium 3 NotImplementedError at runtime Either implement or remove from source lists to avoid confusion
9 Outdated WNBA expected count Medium 4 WNBA expanded to 13 teams in 2025 Update config.py EXPECTED_GAME_COUNTS["wnba"] from 220 to 286
10 MLS/WNBA stadium alias gaps Medium 4 New/renamed venues missing from aliases Add 129 missing stadium aliases (64 MLS + 65 WNBA)
12 NHL games have no stadium data Medium 6 Same as Issue #7 See Issue #7 remediation
14 Silent game exclusion in RichGame queries Medium 7 compactMap silently drops games Log dropped games or return partial RichGame with placeholder stadium
1 WNBA single abbreviations Low 1 Only 1 abbreviation per team Add alternative abbreviations for source compatibility
3 No NFL team aliases Low 2 Missing Washington Redskins/Football Team Add historical Washington team name aliases
11 Game status not parsed Low 4 Status field always "unknown" Parse game status from source data (final, scheduled, postponed)
15 No bootstrap deduplication Low 7 No explicit duplicate check during bootstrap Add deduplication check in bootstrapGames()

Immediate (Before Next Release)

  1. Update iOS bundled data (Issue #13)

    cp Scripts/output/stadiums_*.json SportsTime/Resources/stadiums_canonical.json
    cp Scripts/output/teams_*.json SportsTime/Resources/teams_canonical.json
    cp Scripts/output/games_*.json SportsTime/Resources/games_canonical.json
    
  2. Fix NHL stadium data (Issues #7, #12)

    • Change NHL primary source from Hockey Reference to NHL API
    • Or: Increase max_sources_to_try to 3 so fallbacks are attempted
  3. Add critical stadium aliases (Issues #8, #10)

    • "Mortgage Matchup Center" → stadium_nba_rocket_mortgage_fieldhouse
    • "Xfinity Mobile Arena" → stadium_nba_intuit_dome
    • Run validation report to identify all unresolved venue names

Short-term (This Quarter)

  1. Implement MLS fallback source (Issue #4)

    • FBref has MLS data with venue information
    • Reduces ESPN single-point-of-failure risk
  2. Fix orphan alias references (Issue #2)

    • Correct 5 NFL stadium aliases pointing to wrong canonical IDs
    • Add validation check to prevent future orphan references
  3. Update expected game counts (Issue #9)

    • WNBA: 220 → 286 (13 teams × 44 games / 2)

Long-term (Next Quarter)

  1. Implement WNBA/NWSL fallback sources (Issue #4)

    • Consider WNBA League Pass API or other sources
    • NWSL has limited data availability - may need to accept ESPN-only
  2. Add RichGame partial loading (Issue #14)

    • Log games dropped due to missing references
    • Consider returning games with placeholder stadiums for NHL
  3. Parse game status (Issue #11)

    • Extract final/scheduled/postponed from source data
    • Enables filtering by game state

Verification Checklist

After implementing fixes, verify:

  • Run python -m sportstime_parser scrape --sport all --season 2025
  • Check validation reports show <5% unresolved stadiums per sport
  • Copy output JSON to iOS Resources/
  • Build iOS app and verify data loads at startup
  • Query RichGames and verify game count matches expectations
  • Run CloudKit sync and verify no errors