# SportsTime Data Pipeline Remediation Plan **Created:** 2026-01-20 **Based on:** DATA_AUDIT.md findings (15 issues identified) **Priority:** Fix critical data integrity issues blocking production release --- ## Executive Summary The data audit identified **15 issues** across the pipeline: - **1 Critical:** iOS bundled data 27% behind Scripts output - **4 High:** ESPN single-source risk, NHL missing 100% stadiums, NBA naming rights failures - **6 Medium:** Alias gaps, orphan references, silent game drops - **4 Low:** Configuration and metadata gaps This plan organizes fixes into **5 phases** with clear dependencies, tasks, and validation gates. --- ## Phase Dependency Graph ``` Phase 1: Alias & Reference Fixes ↓ Phase 2: NHL Stadium Data Fix ↓ Phase 3: Re-scrape & Validate ↓ Phase 4: iOS Bundle Update ↓ Phase 5: Code Quality & Future-Proofing ``` **Rationale:** Aliases must be fixed before re-scraping. NHL source fix enables stadium resolution. Fresh scrape validates all fixes. iOS bundle updated last with clean data. --- ## Phase 1: Alias & Reference Fixes **Goal:** Fix all alias files so stadium/team resolution succeeds for 2024-2025 naming rights changes. **Issues Addressed:** #2, #3, #8, #10 **Duration:** 2-3 hours ### Task 1.1: Fix Orphan Stadium Alias References **File:** `Scripts/stadium_aliases.json` **Issue #2:** 5 stadium aliases point to non-existent canonical IDs. | Current (Invalid) | Correct ID | |-------------------|------------| | `stadium_nfl_empower_field_at_mile_high` | `stadium_nfl_empower_field` | | `stadium_nfl_geha_field_at_arrowhead_stadium` | `stadium_nfl_arrowhead_stadium` | **Tasks:** 1. Open `Scripts/stadium_aliases.json` 2. Search for `stadium_nfl_empower_field_at_mile_high` 3. Replace all occurrences with `stadium_nfl_empower_field` 4. Search for `stadium_nfl_geha_field_at_arrowhead_stadium` 5. Replace all occurrences with `stadium_nfl_arrowhead_stadium` 6. Verify JSON is valid: `python -c "import json; json.load(open('stadium_aliases.json'))"` **Affected Aliases:** ```json // FIX THESE: { "alias_name": "Broncos Stadium at Mile High", "stadium_canonical_id": "stadium_nfl_empower_field" } { "alias_name": "Sports Authority Field at Mile High", "stadium_canonical_id": "stadium_nfl_empower_field" } { "alias_name": "Invesco Field at Mile High", "stadium_canonical_id": "stadium_nfl_empower_field" } { "alias_name": "Mile High Stadium", "stadium_canonical_id": "stadium_nfl_empower_field" } { "alias_name": "Arrowhead Stadium", "stadium_canonical_id": "stadium_nfl_arrowhead_stadium" } ``` ### Task 1.2: Add NBA 2024-2025 Stadium Aliases **File:** `Scripts/stadium_aliases.json` **Issue #8:** 131 NBA games failing resolution due to 2024-2025 naming rights changes. **Top Unresolved Names (from validation report):** | Source Name | Maps To | Canonical ID | |-------------|---------|--------------| | Mortgage Matchup Center | Rocket Mortgage FieldHouse | `stadium_nba_rocket_mortgage_fieldhouse` | | Xfinity Mobile Arena | Intuit Dome | `stadium_nba_intuit_dome` | | Rocket Arena | Toyota Center (?) | `stadium_nba_toyota_center` | **Tasks:** 1. Run validation report to get full list of unresolved NBA stadiums: ```bash grep -A2 "Unresolved Stadium" output/validation_nba_2025.md | head -50 ``` 2. For each unresolved name, identify the correct canonical ID 3. Add alias entries to `stadium_aliases.json`: ```json { "alias_name": "Mortgage Matchup Center", "stadium_canonical_id": "stadium_nba_rocket_mortgage_fieldhouse", "valid_from": "2025-01-01", "valid_until": null }, { "alias_name": "Xfinity Mobile Arena", "stadium_canonical_id": "stadium_nba_intuit_dome", "valid_from": "2025-01-01", "valid_until": null } ``` ### Task 1.3: Add MLS Stadium Aliases **File:** `Scripts/stadium_aliases.json` **Issue #10:** 64 MLS games with unresolved stadiums. **Tasks:** 1. Extract unresolved MLS stadiums: ```bash grep -A2 "Unresolved Stadium" output/validation_mls_2025.md | sort | uniq -c | sort -rn ``` 2. Research each stadium name to find correct canonical ID 3. Add aliases for: - Sports Illustrated Stadium (San Diego FC expansion venue) - ScottsMiracle-Gro Field (Columbus Crew alternate name) - Energizer Park (St. Louis alternate name) - Any other unresolved venues ### Task 1.4: Add WNBA Stadium Aliases **File:** `Scripts/stadium_aliases.json` **Issue #10:** 65 WNBA games with unresolved stadiums. **Tasks:** 1. Extract unresolved WNBA stadiums: ```bash grep -A2 "Unresolved Stadium" output/validation_wnba_2025.md | sort | uniq -c | sort -rn ``` 2. Add aliases for new venue names: - CareFirst Arena (Washington Mystics) - Any alternate arena names from ESPN ### Task 1.5: Add NWSL Stadium Aliases **File:** `Scripts/stadium_aliases.json` **Issue #10:** 16 NWSL games with unresolved stadiums. **Tasks:** 1. Extract unresolved NWSL stadiums: ```bash grep -A2 "Unresolved Stadium" output/validation_nwsl_2025.md | sort | uniq -c | sort -rn ``` 2. Add aliases for expansion team venues and alternate names ### Task 1.6: Add NFL Team Aliases (Historical) **File:** `Scripts/team_aliases.json` **Issue #3:** Missing Washington Redskins/Football Team historical names. **Tasks:** 1. Add team aliases: ```json { "team_canonical_id": "team_nfl_was", "alias_type": "name", "alias_value": "Washington Redskins", "valid_from": "1937-01-01", "valid_until": "2020-07-13" }, { "team_canonical_id": "team_nfl_was", "alias_type": "name", "alias_value": "Washington Football Team", "valid_from": "2020-07-13", "valid_until": "2022-02-02" } ``` ### Phase 1 Validation **Gate:** All alias files must pass validation before proceeding. ```bash # 1. Validate JSON syntax python -c "import json; json.load(open('stadium_aliases.json')); print('stadium_aliases.json OK')" python -c "import json; json.load(open('team_aliases.json')); print('team_aliases.json OK')" # 2. Check for orphan references (run this script) python << 'EOF' import json from sportstime_parser.normalizers.stadium_resolver import STADIUM_MAPPINGS from sportstime_parser.normalizers.team_resolver import TEAM_MAPPINGS # Build set of valid canonical IDs valid_stadium_ids = set() for sport_stadiums in STADIUM_MAPPINGS.values(): for stadium_id, _ in sport_stadiums.values(): valid_stadium_ids.add(stadium_id) valid_team_ids = set() for sport_teams in TEAM_MAPPINGS.values(): for abbrev, (team_id, name, city, stadium_id) in sport_teams.items(): valid_team_ids.add(team_id) # Check stadium aliases stadium_aliases = json.load(open('stadium_aliases.json')) orphan_stadiums = [] for alias in stadium_aliases: if alias['stadium_canonical_id'] not in valid_stadium_ids: orphan_stadiums.append(alias) # Check team aliases team_aliases = json.load(open('team_aliases.json')) orphan_teams = [] for alias in team_aliases: if alias['team_canonical_id'] not in valid_team_ids: orphan_teams.append(alias) print(f"Orphan stadium aliases: {len(orphan_stadiums)}") for o in orphan_stadiums[:5]: print(f" - {o['alias_name']} -> {o['stadium_canonical_id']}") print(f"Orphan team aliases: {len(orphan_teams)}") for o in orphan_teams[:5]: print(f" - {o['alias_value']} -> {o['team_canonical_id']}") if orphan_stadiums or orphan_teams: exit(1) print("✅ No orphan references found") EOF # Expected output: # Orphan stadium aliases: 0 # Orphan team aliases: 0 # ✅ No orphan references found ``` **Success Criteria:** - [x] `stadium_aliases.json` valid JSON - [x] `team_aliases.json` valid JSON - [x] 0 orphan stadium references - [x] 0 orphan team references ### Phase 1 Completion Log (2026-01-20) **Task 1.1 - NFL Orphan Fixes:** - Fixed 4 references: `stadium_nfl_empower_field_at_mile_high` → `stadium_nfl_empower_field` - Fixed 1 reference: `stadium_nfl_geha_field_at_arrowhead_stadium` → `stadium_nfl_arrowhead_stadium` **Task 1.2 - NBA Stadium Aliases Added:** - `mortgage matchup center` → `stadium_nba_rocket_mortgage_fieldhouse` - `xfinity mobile arena` → `stadium_nba_intuit_dome` - `rocket arena` → `stadium_nba_toyota_center` - `mexico city arena` → `stadium_nba_mexico_city_arena` (new canonical ID) **Task 1.3 - MLS Stadium Aliases Added:** - `scottsmiracle-gro field` → `stadium_mls_lowercom_field` - `energizer park` → `stadium_mls_citypark` - `sports illustrated stadium` → `stadium_mls_red_bull_arena` **Task 1.4 - WNBA Stadium Aliases Added:** - `carefirst arena` → `stadium_wnba_entertainment_sports_arena` - `mortgage matchup center` → `stadium_wnba_rocket_mortgage_fieldhouse` (new) - `state farm arena` → `stadium_wnba_state_farm_arena` (new) - `cfg bank arena` → `stadium_wnba_cfg_bank_arena` (new) - `purcell pavilion` → `stadium_wnba_purcell_pavilion` (new) **Task 1.5 - NWSL Stadium Aliases Added:** - `sports illustrated stadium` → `stadium_nwsl_red_bull_arena` - `soldier field` → `stadium_nwsl_soldier_field` (new) - `oracle park` → `stadium_nwsl_oracle_park` (new) **Task 1.6 - NFL Team Aliases Added:** - `Washington Redskins` (1937-2020) → `team_nfl_was` - `Washington Football Team` (2020-2022) → `team_nfl_was` - `WFT` abbreviation (2020-2022) → `team_nfl_was` **New Canonical Stadium IDs Added to stadium_resolver.py:** - `stadium_nba_mexico_city_arena` (Mexico City) - `stadium_wnba_state_farm_arena` (Atlanta) - `stadium_wnba_rocket_mortgage_fieldhouse` (Cleveland) - `stadium_wnba_cfg_bank_arena` (Baltimore) - `stadium_wnba_purcell_pavilion` (Notre Dame) - `stadium_nwsl_soldier_field` (Chicago) - `stadium_nwsl_oracle_park` (San Francisco) --- ## Phase 2: NHL Stadium Data Fix **Goal:** Ensure NHL games have stadium data by either changing primary source or enabling fallbacks. **Issues Addressed:** #5, #7, #12 **Duration:** 1-2 hours ### Task 2.1: Analyze NHL Source Options **Issue #7:** Hockey Reference provides no venue data. NHL API and ESPN do. **Options:** | Option | Pros | Cons | |--------|------|------| | A: Change NHL primary to NHL API | NHL API provides venues | Different data format, may need parser updates | | B: Change NHL primary to ESPN | ESPN provides venues | Less historical depth | | C: Increase `max_sources_to_try` to 3 | Keeps Hockey-Ref depth, fallback fills venues | Still scrapes Hockey-Ref first (wasteful for venue data) | | D: Hybrid - scrape games from H-Ref, venues from NHL API | Best of both worlds | More complex, two API calls | **Recommended:** Option C (quickest fix) or Option D (best long-term) ### Task 2.2: Implement Option C - Increase Fallback Limit **File:** `sportstime_parser/scrapers/base.py` **Current Code (line ~189):** ```python max_sources_to_try = 2 # Don't try all sources if first few return nothing ``` **Change to:** ```python max_sources_to_try = 3 # Allow third fallback for venues ``` **Tasks:** 1. Open `sportstime_parser/scrapers/base.py` 2. Find `max_sources_to_try = 2` 3. Change to `max_sources_to_try = 3` 4. Add comment explaining rationale: ```python # Allow 3 sources to be tried. This enables NHL to fall back to NHL API # for venue data since Hockey Reference doesn't provide it. max_sources_to_try = 3 ``` ### Task 2.3: Alternative - Implement Option D (Hybrid NHL Scraper) **File:** `sportstime_parser/scrapers/nhl.py` If Option C doesn't work well, implement venue enrichment: ```python async def _enrich_games_with_venues(self, games: list[Game]) -> list[Game]: """Fetch venue data from NHL API for games missing stadium_id.""" games_needing_venues = [g for g in games if not g.stadium_canonical_id] if not games_needing_venues: return games # Fetch venue data from NHL API venue_map = await self._fetch_venues_from_nhl_api(games_needing_venues) # Enrich games enriched = [] for game in games: if not game.stadium_canonical_id and game.canonical_id in venue_map: game = game._replace(stadium_canonical_id=venue_map[game.canonical_id]) enriched.append(game) return enriched ``` ### Phase 2 Validation **Gate:** NHL scraper must return games with stadium data. ```bash # 1. Run NHL scraper for a single month python -m sportstime_parser scrape --sport nhl --season 2025 --month 10 # 2. Check stadium resolution python << 'EOF' import json games = json.load(open('output/games_nhl_2025.json')) total = len(games) with_stadium = sum(1 for g in games if g.get('stadium_canonical_id')) pct = (with_stadium / total) * 100 if total > 0 else 0 print(f"NHL games with stadium: {with_stadium}/{total} ({pct:.1f}%)") if pct < 95: print("❌ FAIL: Less than 95% stadium coverage") exit(1) print("✅ PASS: Stadium coverage above 95%") EOF # Expected output: # NHL games with stadium: 1250/1312 (95.3%) # ✅ PASS: Stadium coverage above 95% ``` **Success Criteria:** - [ ] NHL games have >95% stadium coverage - [x] `max_sources_to_try` set to 3 (or hybrid implemented) - [ ] No regression in other sports ### Phase 2 Completion Log (2026-01-20) **Task 2.2 - Option C Implemented:** - Updated `sportstime_parser/scrapers/base.py` line 189 - Changed `max_sources_to_try = 2` → `max_sources_to_try = 3` - Added comment explaining rationale for NHL venue fallback **NHL Source Configuration Verified:** - Sources in order: `hockey_reference`, `nhl_api`, `espn` - Both `nhl_api` and `espn` provide venue data - With `max_sources_to_try = 3`, all three sources can now be attempted **Note:** If Phase 3 validation shows NHL still has high missing stadium rate, will need to implement Option D (hybrid venue enrichment). --- ## Phase 3: Re-scrape & Validate **Goal:** Fresh scrape of all sports with fixed aliases and NHL source, validate <5% unresolved. **Issues Addressed:** Validates fixes for #2, #7, #8, #10 **Duration:** 30 minutes (mostly waiting for scrape) ### Task 3.1: Run Full Scrape ```bash cd Scripts # Run scrape for all sports, 2025 season python -m sportstime_parser scrape --sport all --season 2025 # This will generate: # - output/games_*.json # - output/teams_*.json # - output/stadiums_*.json # - output/validation_*.md ``` ### Task 3.2: Validate Resolution Rates ```bash python << 'EOF' import json import os from collections import defaultdict sports = ['nba', 'mlb', 'nfl', 'nhl', 'mls', 'wnba', 'nwsl'] results = {} for sport in sports: games_file = f'output/games_{sport}_2025.json' if not os.path.exists(games_file): print(f"⚠️ Missing {games_file}") continue games = json.load(open(games_file)) total = len(games) missing_stadium = sum(1 for g in games if not g.get('stadium_canonical_id')) missing_home = sum(1 for g in games if not g.get('home_team_canonical_id')) missing_away = sum(1 for g in games if not g.get('away_team_canonical_id')) stadium_pct = (missing_stadium / total) * 100 if total > 0 else 0 results[sport] = { 'total': total, 'missing_stadium': missing_stadium, 'stadium_pct': stadium_pct, 'missing_home': missing_home, 'missing_away': missing_away } print("\n=== Stadium Resolution Report ===\n") print(f"{'Sport':<8} {'Total':>6} {'Missing':>8} {'%':>6} {'Status':<8}") print("-" * 45) all_pass = True for sport in sports: if sport not in results: continue r = results[sport] status = "✅ PASS" if r['stadium_pct'] < 5 else "❌ FAIL" if r['stadium_pct'] >= 5: all_pass = False print(f"{sport.upper():<8} {r['total']:>6} {r['missing_stadium']:>8} {r['stadium_pct']:>5.1f}% {status}") print("-" * 45) if all_pass: print("\n✅ All sports under 5% missing stadiums") else: print("\n❌ Some sports have >5% missing stadiums - investigate before proceeding") exit(1) EOF ``` ### Task 3.3: Review Validation Reports ```bash # Check each validation report for remaining issues for sport in nba mlb nfl nhl mls wnba nwsl; do echo "=== $sport ===" head -30 output/validation_${sport}_2025.md echo "" done ``` ### Phase 3 Validation **Gate:** All sports must have <5% missing stadiums (except for genuine exhibition games). **Success Criteria:** - [x] NBA: <5% missing stadiums (was 10.6% with 131 failures) - [x] MLB: <1% missing stadiums (was 0.1%) - [x] NFL: <2% missing stadiums (was 1.5%) - [x] NHL: <5% missing stadiums (was 100% - critical fix) - [x] MLS: <5% missing stadiums (was 11.8%) - [x] WNBA: <5% missing stadiums (was 20.2%) - [x] NWSL: <5% missing stadiums (was 8.5%) ### Phase 3 Completion Log (2026-01-20) **Validation Results After Fixes:** | Sport | Total | Missing | % | Before | |-------|-------|---------|---|--------| | NBA | 1231 | 0 | 0.0% | 10.6% (131 failures) | | MLB | 2866 | 4 | 0.1% | 0.1% | | NFL | 330 | 5 | 1.5% | 1.5% | | NHL | 1312 | 0 | 0.0% | 100% (1312 failures) | | MLS | 542 | 13 | 2.4% | 11.8% (64 failures) | | WNBA | 322 | 13 | 4.0% | 20.2% (65 failures) | | NWSL | 189 | 1 | 0.5% | 8.5% (16 failures) | **NHL Stadium Fix Details:** - Option C (max_sources_to_try=3) was insufficient since Hockey Reference returns games successfully - Implemented home team stadium fallback in `_normalize_single_game()` in `sportstime_parser/scrapers/nhl.py` - When `stadium_raw` is None, uses the home team's default stadium from TEAM_MAPPINGS **All validation gates PASSED ✅** --- ## Phase 4: iOS Bundle Update **Goal:** Replace outdated iOS bundled JSON with fresh pipeline output. **Issues Addressed:** #13 **Duration:** 30 minutes ### Task 4.1: Prepare Canonical JSON Files The pipeline outputs separate files per sport. iOS expects combined files. ```bash cd Scripts # Create combined canonical files for iOS python << 'EOF' import json import os sports = ['nba', 'mlb', 'nfl', 'nhl', 'mls', 'wnba', 'nwsl'] # Combine stadiums all_stadiums = [] for sport in sports: file = f'output/stadiums_{sport}.json' if os.path.exists(file): all_stadiums.extend(json.load(open(file))) print(f"Combined {len(all_stadiums)} stadiums") with open('output/stadiums_canonical.json', 'w') as f: json.dump(all_stadiums, f, indent=2) # Combine teams all_teams = [] for sport in sports: file = f'output/teams_{sport}.json' if os.path.exists(file): all_teams.extend(json.load(open(file))) print(f"Combined {len(all_teams)} teams") with open('output/teams_canonical.json', 'w') as f: json.dump(all_teams, f, indent=2) # Combine games (2025 season) all_games = [] for sport in sports: file = f'output/games_{sport}_2025.json' if os.path.exists(file): all_games.extend(json.load(open(file))) print(f"Combined {len(all_games)} games") with open('output/games_canonical.json', 'w') as f: json.dump(all_games, f, indent=2) print("✅ Created combined canonical files") EOF ``` ### Task 4.2: Copy to iOS Resources ```bash # Copy combined files to iOS app resources cp output/stadiums_canonical.json ../SportsTime/Resources/stadiums_canonical.json cp output/teams_canonical.json ../SportsTime/Resources/teams_canonical.json cp output/games_canonical.json ../SportsTime/Resources/games_canonical.json # Copy alias files cp stadium_aliases.json ../SportsTime/Resources/stadium_aliases.json cp team_aliases.json ../SportsTime/Resources/team_aliases.json echo "✅ Copied files to iOS Resources" ``` ### Task 4.3: Verify iOS JSON Compatibility ```bash # Verify iOS can parse the files python << 'EOF' import json # Check required fields exist stadiums = json.load(open('../SportsTime/Resources/stadiums_canonical.json')) teams = json.load(open('../SportsTime/Resources/teams_canonical.json')) games = json.load(open('../SportsTime/Resources/games_canonical.json')) print(f"Stadiums: {len(stadiums)}") print(f"Teams: {len(teams)}") print(f"Games: {len(games)}") # Check stadium fields required_stadium = ['canonical_id', 'name', 'city', 'state', 'latitude', 'longitude', 'sport'] for s in stadiums[:3]: for field in required_stadium: if field not in s: print(f"❌ Missing stadium field: {field}") exit(1) # Check team fields required_team = ['canonical_id', 'name', 'abbreviation', 'sport', 'city', 'stadium_canonical_id'] for t in teams[:3]: for field in required_team: if field not in t: print(f"❌ Missing team field: {field}") exit(1) # Check game fields required_game = ['canonical_id', 'sport', 'season', 'home_team_canonical_id', 'away_team_canonical_id'] for g in games[:3]: for field in required_game: if field not in g: print(f"❌ Missing game field: {field}") exit(1) print("✅ All required fields present") EOF ``` ### Phase 4 Validation **Gate:** iOS app must build and load data correctly. ```bash # Build iOS app cd ../SportsTime xcodebuild -project SportsTime.xcodeproj \ -scheme SportsTime \ -destination 'platform=iOS Simulator,name=iPhone 17,OS=26.2' \ build # Run data loading tests (if they exist) xcodebuild -project SportsTime.xcodeproj \ -scheme SportsTime \ -destination 'platform=iOS Simulator,name=iPhone 17,OS=26.2' \ -only-testing:SportsTimeTests/BootstrapServiceTests \ test ``` **Success Criteria:** - [ ] iOS build succeeds - [ ] Bootstrap tests pass - [ ] Manual verification: App launches and shows game data ### Phase 4 Completion Log (2026-01-20) **Combined Canonical Files Created:** - `stadiums_canonical.json`: 218 stadiums (was 122) - `teams_canonical.json`: 183 teams (was 148) - `games_canonical.json`: 6,792 games (was 4,972) **Files Copied to iOS Resources:** - `stadiums_canonical.json` (75K) - `teams_canonical.json` (57K) - `games_canonical.json` (2.3M) - `stadium_aliases.json` (53K) - `team_aliases.json` (16K) **JSON Compatibility Verified:** - All required stadium fields present: canonical_id, name, city, state, latitude, longitude, sport - All required team fields present: canonical_id, name, abbreviation, sport, city, stadium_canonical_id - All required game fields present: canonical_id, sport, season, home_team_canonical_id, away_team_canonical_id **Note:** iOS build verification pending manual test by developer. --- ## Phase 5: Code Quality & Future-Proofing **Goal:** Fix code-level issues and add validation to prevent regressions. **Issues Addressed:** #1, #6, #9, #11, #14, #15 **Duration:** 4-6 hours ### Task 5.1: Update Expected Game Counts **File:** `sportstime_parser/config.py` **Issue #9:** WNBA expected count outdated (220 vs actual 322). ```python # Update EXPECTED_GAME_COUNTS EXPECTED_GAME_COUNTS: dict[str, int] = { "nba": 1230, # 30 teams × 82 games / 2 "mlb": 2430, # 30 teams × 162 games / 2 (regular season only) "nfl": 272, # 32 teams × 17 games / 2 (regular season only) "nhl": 1312, # 32 teams × 82 games / 2 "mls": 493, # 29 teams × varies (regular season) "wnba": 286, # 13 teams × 44 games / 2 (updated for 2025 expansion) "nwsl": 182, # 14 teams × 26 games / 2 } ``` ### Task 5.2: Clean Up Unimplemented Scrapers **Files:** `nba.py`, `nfl.py`, `mls.py` **Issue #6:** CBS/FBref declared but raise NotImplementedError. **Options:** - A: Remove unimplemented sources from SOURCES list - B: Keep but document as "not implemented" - C: Actually implement them **Recommended:** Option A - remove to avoid confusion. **Tasks:** 1. In `nba.py`, remove `cbs` from SOURCES list or comment it out 2. In `nfl.py`, remove `cbs` from SOURCES list 3. In `mls.py`, remove `fbref` from SOURCES list 4. Add TODO comments for future implementation ### Task 5.3: Add WNBA Abbreviation Aliases **File:** `sportstime_parser/normalizers/team_resolver.py` **Issue #1:** WNBA teams only have 1 abbreviation each. ```python # Add alternative abbreviations for WNBA teams # Example: Some sources use different codes "wnba": { "LVA": ("team_wnba_lva", "Las Vegas Aces", "Las Vegas", "stadium_wnba_michelob_ultra_arena"), "ACES": ("team_wnba_lva", "Las Vegas Aces", "Las Vegas", "stadium_wnba_michelob_ultra_arena"), # ... add alternatives for each team } ``` ### Task 5.4: Add RichGame Logging for Dropped Games **File:** `SportsTime/Core/Services/DataProvider.swift` **Issue #14:** Games silently dropped when team/stadium lookup fails. **Current:** ```swift return games.compactMap { game in guard let homeTeam = teamsById[game.homeTeamId], let awayTeam = teamsById[game.awayTeamId], let stadium = stadiumsById[game.stadiumId] else { return nil } return RichGame(...) } ``` **Fixed:** ```swift return games.compactMap { game in guard let homeTeam = teamsById[game.homeTeamId] else { Logger.data.warning("Dropping game \(game.id): missing home team \(game.homeTeamId)") return nil } guard let awayTeam = teamsById[game.awayTeamId] else { Logger.data.warning("Dropping game \(game.id): missing away team \(game.awayTeamId)") return nil } guard let stadium = stadiumsById[game.stadiumId] else { Logger.data.warning("Dropping game \(game.id): missing stadium \(game.stadiumId)") return nil } return RichGame(game: game, homeTeam: homeTeam, awayTeam: awayTeam, stadium: stadium) } ``` ### Task 5.5: Add Bootstrap Deduplication **File:** `SportsTime/Core/Services/BootstrapService.swift` **Issue #15:** No duplicate check during bootstrap. ```swift @MainActor private func bootstrapGames(context: ModelContext) async throws { // ... existing code ... // Deduplicate by canonical ID before inserting var seenIds = Set() var uniqueGames: [JSONCanonicalGame] = [] for game in games { if !seenIds.contains(game.canonical_id) { seenIds.insert(game.canonical_id) uniqueGames.append(game) } else { Logger.bootstrap.warning("Skipping duplicate game: \(game.canonical_id)") } } // Insert unique games for game in uniqueGames { // ... existing insert code ... } } ``` ### Task 5.6: Add Alias Validation Script **File:** `Scripts/validate_aliases.py` (new file) Create automated validation to run in CI: ```python #!/usr/bin/env python3 """Validate alias files for orphan references and format issues.""" import json import sys from sportstime_parser.normalizers.stadium_resolver import STADIUM_MAPPINGS from sportstime_parser.normalizers.team_resolver import TEAM_MAPPINGS def main(): errors = [] # Build valid ID sets valid_stadium_ids = set() for sport_stadiums in STADIUM_MAPPINGS.values(): for stadium_id, _ in sport_stadiums.values(): valid_stadium_ids.add(stadium_id) valid_team_ids = set() for sport_teams in TEAM_MAPPINGS.values(): for abbrev, (team_id, *_) in sport_teams.items(): valid_team_ids.add(team_id) # Check stadium aliases stadium_aliases = json.load(open('stadium_aliases.json')) for alias in stadium_aliases: if alias['stadium_canonical_id'] not in valid_stadium_ids: errors.append(f"Orphan stadium alias: {alias['alias_name']} -> {alias['stadium_canonical_id']}") # Check team aliases team_aliases = json.load(open('team_aliases.json')) for alias in team_aliases: if alias['team_canonical_id'] not in valid_team_ids: errors.append(f"Orphan team alias: {alias['alias_value']} -> {alias['team_canonical_id']}") if errors: print("❌ Validation failed:") for e in errors: print(f" - {e}") sys.exit(1) print("✅ All aliases valid") sys.exit(0) if __name__ == '__main__': main() ``` ### Phase 5 Validation ```bash # Run alias validation python validate_aliases.py # Run Python tests pytest tests/ # Run iOS tests cd ../SportsTime xcodebuild test -scheme SportsTime -destination 'platform=iOS Simulator,name=iPhone 17' ``` **Success Criteria:** - [x] Alias validation script passes - [ ] Python tests pass - [ ] iOS tests pass - [ ] No warnings in Xcode build ### Phase 5 Completion Log (2026-01-20) **Task 5.1 - Expected Game Counts Updated:** - Updated `sportstime_parser/config.py` with 2025-26 season counts - WNBA: 220 → 286 (13 teams × 44 games / 2) - NWSL: 168 → 188 (14→16 teams expansion) - MLS: 493 → 540 (30 teams expansion) **Task 5.2 - Removed Unimplemented Scrapers:** - `nfl.py`: Removed "cbs" from sources list - `nba.py`: Removed "cbs" from sources list - `mls.py`: Removed "fbref" from sources list **Task 5.3 - WNBA Abbreviation Aliases Added:** Added 22 alternative abbreviations to `team_resolver.py`: - ATL: Added "DREAM" - CHI: Added "SKY" - CON: Added "CONN", "SUN" - DAL: Added "WINGS" - GSV: Added "GS", "VAL" - IND: Added "FEVER" - LV: Added "LVA", "ACES" - LA: Added "LAS", "SPARKS" - MIN: Added "LYNX" - NY: Added "NYL", "LIB" - PHX: Added "PHO", "MERCURY" - SEA: Added "STORM" - WAS: Added "WSH", "MYSTICS" **Task 5.4 - RichGame Logging (iOS Swift):** - Deferred to iOS developer - out of scope for Python pipeline work **Task 5.5 - Bootstrap Deduplication (iOS Swift):** - Deferred to iOS developer - out of scope for Python pipeline work **Task 5.6 - Alias Validation Script Created:** - Created `Scripts/validate_aliases.py` - Validates JSON syntax for both alias files - Checks for orphan references against canonical IDs - Suitable for CI/CD integration - Verified: All 339 stadium aliases and 79 team aliases valid --- ## Post-Remediation Verification ### Full Pipeline Test ```bash cd Scripts # 1. Validate aliases python validate_aliases.py # 2. Fresh scrape python -m sportstime_parser scrape --sport all --season 2025 # 3. Check resolution rates python << 'EOF' import json sports = ['nba', 'mlb', 'nfl', 'nhl', 'mls', 'wnba', 'nwsl'] for sport in sports: games = json.load(open(f'output/games_{sport}_2025.json')) total = len(games) missing = sum(1 for g in games if not g.get('stadium_canonical_id')) pct = (missing / total) * 100 if total else 0 status = "✅" if pct < 5 else "❌" print(f"{status} {sport.upper()}: {missing}/{total} missing ({pct:.1f}%)") EOF # 4. Update iOS bundle python combine_canonical.py # (from Task 4.1) cp output/*_canonical.json ../SportsTime/Resources/ # 5. Build iOS cd ../SportsTime xcodebuild build -scheme SportsTime -destination 'platform=iOS Simulator,name=iPhone 17' # 6. Run tests xcodebuild test -scheme SportsTime -destination 'platform=iOS Simulator,name=iPhone 17' ``` ### Success Metrics | Metric | Before | Target | Actual | |--------|--------|--------|--------| | NBA missing stadiums | 131 (10.6%) | <5% | | | NHL missing stadiums | 1312 (100%) | <5% | | | MLS missing stadiums | 64 (11.8%) | <5% | | | WNBA missing stadiums | 65 (20.2%) | <5% | | | NWSL missing stadiums | 16 (8.5%) | <5% | | | iOS bundled teams | 148 | 183 | | | iOS bundled stadiums | 122 | 211 | | | iOS bundled games | 4,972 | ~6,792 | | | Orphan alias references | 5 | 0 | | --- ## Rollback Plan If issues are discovered after deployment: 1. **iOS Bundle Rollback:** ```bash git checkout HEAD~1 -- SportsTime/Resources/*_canonical.json ``` 2. **Alias Rollback:** ```bash git checkout HEAD~1 -- Scripts/stadium_aliases.json Scripts/team_aliases.json ``` 3. **Code Rollback:** ```bash git revert ``` --- ## Appendix: Issue Cross-Reference | Issue # | Phase | Task | Status | |---------|-------|------|--------| | 1 | 5 | 5.3 | ✅ Complete - 22 WNBA abbreviations added | | 2 | 1 | 1.1 | ✅ Complete - Orphan references fixed | | 3 | 1 | 1.6 | ✅ Complete - Washington historical aliases added | | 4 | Future | - | Out of scope (requires new scraper implementation) | | 5 | 2 | 2.2 | ✅ Complete - max_sources_to_try=3 | | 6 | 5 | 5.2 | ✅ Complete - Unimplemented scrapers removed | | 7 | 2 | 2.2/2.3 | ✅ Complete - Home team stadium fallback added | | 8 | 1 | 1.2 | ✅ Complete - NBA stadium aliases added | | 9 | 5 | 5.1 | ✅ Complete - Expected counts updated | | 10 | 1 | 1.3/1.4/1.5 | ✅ Complete - MLS/WNBA/NWSL aliases added | | 11 | Future | - | Low priority | | 12 | 2 | 2.2/2.3 | ✅ Complete - NHL venue resolution fixed | | 13 | 4 | 4.1/4.2 | ✅ Complete - iOS bundle updated | | 14 | 5 | 5.4 | ⏸️ Deferred - iOS Swift code (out of Python scope) | | 15 | 5 | 5.5 | ⏸️ Deferred - iOS Swift code (out of Python scope) |