Files
Sportstime/.planning/phases/04-canonical-linking/04-01-PLAN.md
Trey t dbfaca206d docs(04-01): create canonical linking plan
Phase 4: Canonical Linking
- 1 plan created
- 3 tasks defined (game canonicalization, validation, fix issues)
- Ready for execution
2026-01-10 09:52:58 -06:00

6.5 KiB

phase, plan, type
phase plan type
04-canonical-linking 01 execute
Generate canonical games with correct team and stadium links for all 7 sports.

Purpose: Complete the data pipeline by resolving raw game data to canonical team/stadium IDs, enabling the iOS app to correctly display game→team→stadium relationships. Output: games_canonical.json with all games linked to canonical teams and stadiums, validated and ready for CloudKit upload.

<execution_context> ~/.claude/get-shit-done/workflows/execute-phase.md ~/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md

Prior phase summaries (dependency graph):

@.planning/phases/03-alias-systems/03-01-SUMMARY.md @.planning/phases/03-alias-systems/03-02-SUMMARY.md

Key files:

@Scripts/canonicalize_games.py @Scripts/validate_canonical.py

Tech stack available: Python canonicalization pipeline, team/stadium aliases Established patterns: 3-stage canonicalization (stadiums → teams → games), sport-scoped resolution Constraining decisions:

  • Phase 03-01: Team abbreviation aliases handle relocations and data source variations
  • Phase 03-02: All 7 sports (NBA, MLB, NHL, NFL, MLS, WNBA, NWSL) canonicalized with 180 total teams

Current state:

  • games.json: 2.2MB raw game data
  • games_canonical.json: Empty [] - needs to be generated
  • teams_canonical.json: 180 teams across 7 sports
  • stadiums_canonical.json: Complete stadium data
  • stadium_aliases.json: Historical name aliases
Task 1: Run game canonicalization pipeline Scripts/data/games_canonical.json, Scripts/data/game_resolution_warnings.json Run the game canonicalization to generate canonical games:
cd Scripts && python canonicalize_games.py --games data/games.json --teams data/teams_canonical.json --aliases data/stadium_aliases.json --output data/ --verbose

This will:

  1. Load raw games from games.json
  2. Resolve team abbreviations to canonical IDs using TEAM_ABBREV_ALIASES
  3. Resolve venues to stadium canonical IDs (preferring home team stadium)
  4. Generate canonical game IDs with doubleheader handling
  5. Output games_canonical.json and any warnings

Expected output: ~10,000+ canonical games across all sports with home_team_canonical_id, away_team_canonical_id, and stadium_canonical_id populated.

  • games_canonical.json exists and is non-empty
  • File size > 1MB (indicates substantial data)
  • Sample game has all three canonical ID fields populated games_canonical.json generated with canonical team/stadium links for all games
Task 2: Validate canonical links Scripts/data/canonicalization_validation.json Run validation to ensure all game→team→stadium references resolve:
cd Scripts && python validate_canonical.py --data-dir data/ --verbose

Check validation output for:

  1. ERROR-level issues: Must be zero (blocks CloudKit upload)
  2. Unknown teams: Any team_canonical_id not found in teams_canonical.json
  3. Unknown stadiums: Any stadium_canonical_id starting with "stadium_unknown"
  4. Game count warnings: Teams with unusual game counts per EXPECTED_GAMES config

If validation passes with no errors, the linking is complete.

  • validate_canonical.py exits with code 0
  • No ERROR-level issues reported
  • All teams and stadiums resolve to known entities All canonical games validated - no broken team or stadium links
Task 3: Fix resolution issues (if any) Scripts/canonicalize_games.py, Scripts/canonicalize_stadiums.py Review game_resolution_warnings.json and fix any issues:

If "Unknown home/away team" warnings:

  • Add missing team abbreviation alias to TEAM_ABBREV_ALIASES in canonicalize_games.py
  • Format: ('SPORT', 'ABBREV'): 'team_sport_canonical',

If "Unknown stadium" warnings:

  • Check if venue name needs alias in HISTORICAL_STADIUM_ALIASES in canonicalize_stadiums.py
  • Or verify home team has correct stadium_canonical_id in sport module

After fixes:

  1. Re-run canonicalization: python canonicalize_games.py --verbose
  2. Re-run validation: python validate_canonical.py --verbose

If no warnings exist, mark this task as complete with "No resolution issues found."

  • game_resolution_warnings.json is empty or contains only acceptable warnings
  • Re-run canonicalization produces no new warnings All resolution issues fixed, or no issues found
Before declaring phase complete: - [ ] `games_canonical.json` exists with >1MB of data - [ ] All games have valid `home_team_canonical_id` (no "team_unknown_*") - [ ] All games have valid `away_team_canonical_id` (no "team_unknown_*") - [ ] All games have valid `stadium_canonical_id` (no "stadium_unknown_*") - [ ] `validate_canonical.py` passes with 0 errors - [ ] Game counts per team within expected ranges

<success_criteria>

  • All tasks completed
  • All verification checks pass
  • games_canonical.json ready for CloudKit upload
  • No broken team or stadium links in any game </success_criteria>
After completion, create `.planning/phases/04-canonical-linking/04-01-SUMMARY.md` with:
---
phase: 04-canonical-linking
plan: 01
subsystem: data-pipeline
tags: [games, canonicalization, linking, validation]

# Dependency graph
requires:
  - phase: 03-alias-systems
    provides: Team/stadium aliases for all 7 sports
provides:
  - Canonical games with resolved team/stadium links
  - Validated game→team→stadium relationships
affects: [05-cloudkit-crud, ios-app-data]

# Tech tracking
tech-stack:
  added: []
  patterns: [game-canonicalization, link-validation]

key-files:
  created:
    - Scripts/data/games_canonical.json
  modified: []

key-decisions:
  - [Any decisions made during execution]

patterns-established:
  - [Any new patterns]

issues-created: []

# Metrics
duration: X min
completed: YYYY-MM-DD
---

# Phase 4 Plan 01: Canonical Linking Summary

**[One-liner: X games canonicalized with Y% resolution rate]**

## Accomplishments
- [Key outcomes]

## Files Created/Modified
- [List with descriptions]

## Decisions Made
- [Or "None"]

## Issues Encountered
- [Or "None"]

## Next Phase Readiness
- Ready for Phase 5: CloudKit CRUD