diff --git a/.planning/phases/04-canonical-linking/04-01-PLAN.md b/.planning/phases/04-canonical-linking/04-01-PLAN.md new file mode 100644 index 0000000..d03c0ab --- /dev/null +++ b/.planning/phases/04-canonical-linking/04-01-PLAN.md @@ -0,0 +1,209 @@ +--- +phase: 04-canonical-linking +plan: 01 +type: execute +--- + + +Generate canonical games with correct team and stadium links for all 7 sports. + +Purpose: Complete the data pipeline by resolving raw game data to canonical team/stadium IDs, enabling the iOS app to correctly display game→team→stadium relationships. +Output: `games_canonical.json` with all games linked to canonical teams and stadiums, validated and ready for CloudKit upload. + + + +~/.claude/get-shit-done/workflows/execute-phase.md +~/.claude/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/STATE.md + +# Prior phase summaries (dependency graph): +@.planning/phases/03-alias-systems/03-01-SUMMARY.md +@.planning/phases/03-alias-systems/03-02-SUMMARY.md + +# Key files: +@Scripts/canonicalize_games.py +@Scripts/validate_canonical.py + +**Tech stack available:** Python canonicalization pipeline, team/stadium aliases +**Established patterns:** 3-stage canonicalization (stadiums → teams → games), sport-scoped resolution +**Constraining decisions:** +- Phase 03-01: Team abbreviation aliases handle relocations and data source variations +- Phase 03-02: All 7 sports (NBA, MLB, NHL, NFL, MLS, WNBA, NWSL) canonicalized with 180 total teams + +**Current state:** +- `games.json`: 2.2MB raw game data +- `games_canonical.json`: Empty `[]` - needs to be generated +- `teams_canonical.json`: 180 teams across 7 sports +- `stadiums_canonical.json`: Complete stadium data +- `stadium_aliases.json`: Historical name aliases + + + + + + Task 1: Run game canonicalization pipeline + Scripts/data/games_canonical.json, Scripts/data/game_resolution_warnings.json + +Run the game canonicalization to generate canonical games: + +```bash +cd Scripts && python canonicalize_games.py --games data/games.json --teams data/teams_canonical.json --aliases data/stadium_aliases.json --output data/ --verbose +``` + +This will: +1. Load raw games from games.json +2. Resolve team abbreviations to canonical IDs using TEAM_ABBREV_ALIASES +3. Resolve venues to stadium canonical IDs (preferring home team stadium) +4. Generate canonical game IDs with doubleheader handling +5. Output games_canonical.json and any warnings + +Expected output: ~10,000+ canonical games across all sports with home_team_canonical_id, away_team_canonical_id, and stadium_canonical_id populated. + + +- games_canonical.json exists and is non-empty +- File size > 1MB (indicates substantial data) +- Sample game has all three canonical ID fields populated + + games_canonical.json generated with canonical team/stadium links for all games + + + + Task 2: Validate canonical links + Scripts/data/canonicalization_validation.json + +Run validation to ensure all game→team→stadium references resolve: + +```bash +cd Scripts && python validate_canonical.py --data-dir data/ --verbose +``` + +Check validation output for: +1. **ERROR-level issues**: Must be zero (blocks CloudKit upload) +2. **Unknown teams**: Any team_canonical_id not found in teams_canonical.json +3. **Unknown stadiums**: Any stadium_canonical_id starting with "stadium_unknown" +4. **Game count warnings**: Teams with unusual game counts per EXPECTED_GAMES config + +If validation passes with no errors, the linking is complete. + + +- validate_canonical.py exits with code 0 +- No ERROR-level issues reported +- All teams and stadiums resolve to known entities + + All canonical games validated - no broken team or stadium links + + + + Task 3: Fix resolution issues (if any) + Scripts/canonicalize_games.py, Scripts/canonicalize_stadiums.py + +Review game_resolution_warnings.json and fix any issues: + +**If "Unknown home/away team" warnings:** +- Add missing team abbreviation alias to TEAM_ABBREV_ALIASES in canonicalize_games.py +- Format: `('SPORT', 'ABBREV'): 'team_sport_canonical',` + +**If "Unknown stadium" warnings:** +- Check if venue name needs alias in HISTORICAL_STADIUM_ALIASES in canonicalize_stadiums.py +- Or verify home team has correct stadium_canonical_id in sport module + +**After fixes:** +1. Re-run canonicalization: `python canonicalize_games.py --verbose` +2. Re-run validation: `python validate_canonical.py --verbose` + +If no warnings exist, mark this task as complete with "No resolution issues found." + + +- game_resolution_warnings.json is empty or contains only acceptable warnings +- Re-run canonicalization produces no new warnings + + All resolution issues fixed, or no issues found + + + + + +Before declaring phase complete: +- [ ] `games_canonical.json` exists with >1MB of data +- [ ] All games have valid `home_team_canonical_id` (no "team_unknown_*") +- [ ] All games have valid `away_team_canonical_id` (no "team_unknown_*") +- [ ] All games have valid `stadium_canonical_id` (no "stadium_unknown_*") +- [ ] `validate_canonical.py` passes with 0 errors +- [ ] Game counts per team within expected ranges + + + + +- All tasks completed +- All verification checks pass +- games_canonical.json ready for CloudKit upload +- No broken team or stadium links in any game + + + +After completion, create `.planning/phases/04-canonical-linking/04-01-SUMMARY.md` with: + +```markdown +--- +phase: 04-canonical-linking +plan: 01 +subsystem: data-pipeline +tags: [games, canonicalization, linking, validation] + +# Dependency graph +requires: + - phase: 03-alias-systems + provides: Team/stadium aliases for all 7 sports +provides: + - Canonical games with resolved team/stadium links + - Validated game→team→stadium relationships +affects: [05-cloudkit-crud, ios-app-data] + +# Tech tracking +tech-stack: + added: [] + patterns: [game-canonicalization, link-validation] + +key-files: + created: + - Scripts/data/games_canonical.json + modified: [] + +key-decisions: + - [Any decisions made during execution] + +patterns-established: + - [Any new patterns] + +issues-created: [] + +# Metrics +duration: X min +completed: YYYY-MM-DD +--- + +# Phase 4 Plan 01: Canonical Linking Summary + +**[One-liner: X games canonicalized with Y% resolution rate]** + +## Accomplishments +- [Key outcomes] + +## Files Created/Modified +- [List with descriptions] + +## Decisions Made +- [Or "None"] + +## Issues Encountered +- [Or "None"] + +## Next Phase Readiness +- Ready for Phase 5: CloudKit CRUD +``` +