--- phase: 04-canonical-linking plan: 01 type: execute --- Generate canonical games with correct team and stadium links for all 7 sports. Purpose: Complete the data pipeline by resolving raw game data to canonical team/stadium IDs, enabling the iOS app to correctly display game→team→stadium relationships. Output: `games_canonical.json` with all games linked to canonical teams and stadiums, validated and ready for CloudKit upload. ~/.claude/get-shit-done/workflows/execute-phase.md ~/.claude/get-shit-done/templates/summary.md @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md # Prior phase summaries (dependency graph): @.planning/phases/03-alias-systems/03-01-SUMMARY.md @.planning/phases/03-alias-systems/03-02-SUMMARY.md # Key files: @Scripts/canonicalize_games.py @Scripts/validate_canonical.py **Tech stack available:** Python canonicalization pipeline, team/stadium aliases **Established patterns:** 3-stage canonicalization (stadiums → teams → games), sport-scoped resolution **Constraining decisions:** - Phase 03-01: Team abbreviation aliases handle relocations and data source variations - Phase 03-02: All 7 sports (NBA, MLB, NHL, NFL, MLS, WNBA, NWSL) canonicalized with 180 total teams **Current state:** - `games.json`: 2.2MB raw game data - `games_canonical.json`: Empty `[]` - needs to be generated - `teams_canonical.json`: 180 teams across 7 sports - `stadiums_canonical.json`: Complete stadium data - `stadium_aliases.json`: Historical name aliases Task 1: Run game canonicalization pipeline Scripts/data/games_canonical.json, Scripts/data/game_resolution_warnings.json Run the game canonicalization to generate canonical games: ```bash cd Scripts && python canonicalize_games.py --games data/games.json --teams data/teams_canonical.json --aliases data/stadium_aliases.json --output data/ --verbose ``` This will: 1. Load raw games from games.json 2. Resolve team abbreviations to canonical IDs using TEAM_ABBREV_ALIASES 3. Resolve venues to stadium canonical IDs (preferring home team stadium) 4. Generate canonical game IDs with doubleheader handling 5. Output games_canonical.json and any warnings Expected output: ~10,000+ canonical games across all sports with home_team_canonical_id, away_team_canonical_id, and stadium_canonical_id populated. - games_canonical.json exists and is non-empty - File size > 1MB (indicates substantial data) - Sample game has all three canonical ID fields populated games_canonical.json generated with canonical team/stadium links for all games Task 2: Validate canonical links Scripts/data/canonicalization_validation.json Run validation to ensure all game→team→stadium references resolve: ```bash cd Scripts && python validate_canonical.py --data-dir data/ --verbose ``` Check validation output for: 1. **ERROR-level issues**: Must be zero (blocks CloudKit upload) 2. **Unknown teams**: Any team_canonical_id not found in teams_canonical.json 3. **Unknown stadiums**: Any stadium_canonical_id starting with "stadium_unknown" 4. **Game count warnings**: Teams with unusual game counts per EXPECTED_GAMES config If validation passes with no errors, the linking is complete. - validate_canonical.py exits with code 0 - No ERROR-level issues reported - All teams and stadiums resolve to known entities All canonical games validated - no broken team or stadium links Task 3: Fix resolution issues (if any) Scripts/canonicalize_games.py, Scripts/canonicalize_stadiums.py Review game_resolution_warnings.json and fix any issues: **If "Unknown home/away team" warnings:** - Add missing team abbreviation alias to TEAM_ABBREV_ALIASES in canonicalize_games.py - Format: `('SPORT', 'ABBREV'): 'team_sport_canonical',` **If "Unknown stadium" warnings:** - Check if venue name needs alias in HISTORICAL_STADIUM_ALIASES in canonicalize_stadiums.py - Or verify home team has correct stadium_canonical_id in sport module **After fixes:** 1. Re-run canonicalization: `python canonicalize_games.py --verbose` 2. Re-run validation: `python validate_canonical.py --verbose` If no warnings exist, mark this task as complete with "No resolution issues found." - game_resolution_warnings.json is empty or contains only acceptable warnings - Re-run canonicalization produces no new warnings All resolution issues fixed, or no issues found Before declaring phase complete: - [ ] `games_canonical.json` exists with >1MB of data - [ ] All games have valid `home_team_canonical_id` (no "team_unknown_*") - [ ] All games have valid `away_team_canonical_id` (no "team_unknown_*") - [ ] All games have valid `stadium_canonical_id` (no "stadium_unknown_*") - [ ] `validate_canonical.py` passes with 0 errors - [ ] Game counts per team within expected ranges - All tasks completed - All verification checks pass - games_canonical.json ready for CloudKit upload - No broken team or stadium links in any game After completion, create `.planning/phases/04-canonical-linking/04-01-SUMMARY.md` with: ```markdown --- phase: 04-canonical-linking plan: 01 subsystem: data-pipeline tags: [games, canonicalization, linking, validation] # Dependency graph requires: - phase: 03-alias-systems provides: Team/stadium aliases for all 7 sports provides: - Canonical games with resolved team/stadium links - Validated game→team→stadium relationships affects: [05-cloudkit-crud, ios-app-data] # Tech tracking tech-stack: added: [] patterns: [game-canonicalization, link-validation] key-files: created: - Scripts/data/games_canonical.json modified: [] key-decisions: - [Any decisions made during execution] patterns-established: - [Any new patterns] issues-created: [] # Metrics duration: X min completed: YYYY-MM-DD --- # Phase 4 Plan 01: Canonical Linking Summary **[One-liner: X games canonicalized with Y% resolution rate]** ## Accomplishments - [Key outcomes] ## Files Created/Modified - [List with descriptions] ## Decisions Made - [Or "None"] ## Issues Encountered - [Or "None"] ## Next Phase Readiness - Ready for Phase 5: CloudKit CRUD ```