Files
Sportstime/.planning/ROADMAP.md
Trey t 80bfb5919b docs(03-02): complete secondary sports canonicalization plan
Tasks completed: 3/3
- Add MLS to canonicalization pipeline (30 teams + 10 aliases + 8 stadium aliases)
- Add WNBA to canonicalization pipeline (13 teams + 6 aliases + 4 stadium aliases)
- Add NWSL to canonicalization pipeline (13 teams + 7 aliases + 3 stadium aliases)

Phase 3 complete - all 7 sports now have alias support (180 teams total)

SUMMARY: .planning/phases/03-alias-systems/03-02-SUMMARY.md
2026-01-10 09:45:09 -06:00

4.4 KiB

Roadmap: SportsTime Data Pipeline

Overview

Transform the monolithic data scraping scripts into a maintainable, sport-organized pipeline that ensures every game correctly links to its teams and stadium. Starting with script restructuring, we'll complete the stadium database, add alias systems for name variations, establish correct canonical linking, implement full CloudKit CRUD operations, and finish with comprehensive validation reports.

Domain Expertise

None

Phases

Phase Numbering:

  • Integer phases (1, 2, 3): Planned milestone work

  • Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)

  • Phase 1: Script Architecture - Split monolithic scripts into sport-specific modules (3/3 plans)

  • Phase 2: Stadium Foundation - Complete stadium database with coordinates and names (2/2 plans)

  • Phase 2.1: Additional Sports Stadiums - Add stadium data for MLS, WNBA, NWSL, CBB (INSERTED)

  • Phase 3: Alias Systems - Stadium and team alias systems for name variations

  • Phase 4: Canonical Linking - Correct game→team→stadium relationships

  • Phase 5: CloudKit CRUD - Full create, read, update, delete operations

  • Phase 6: Validation Reports - Reports showing counts, gaps, orphan records

Phase Details

Phase 1: Script Architecture

Goal: Reorganize monolithic scraping scripts into sport-specific modules (MLB, NBA, NHL, NFL) for easier debugging and maintenance Depends on: Nothing (first phase) Research: Unlikely (internal refactoring, Python module patterns) Plans: 3 plans

Plans:

  • 01-01: Create core.py shared module + mlb.py sport module
  • 01-02: Create nba.py + nhl.py sport modules
  • 01-03: Create nfl.py + refactor scrape_schedules.py orchestrator

Phase 2: Stadium Foundation

Goal: Complete stadium database with correct coordinates, names, and venue data for all 4 sports Depends on: Phase 1 Research: No (hardcoded data exists in sport modules, internal pipeline work) Plans: 2 plans

Plans:

  • 02-01: Audit & complete hardcoded stadium data in sport modules
  • 02-02: Regenerate canonical data and verify pipeline

Phase 2.1: Additional Sports Stadiums (INSERTED)

Goal: Add hardcoded stadium data for secondary sports: MLS, WNBA, NWSL (CBB deferred - 350+ D1 teams requires separate scoped phase) Depends on: Phase 2 Research: No (stadium data compilation following established patterns) Plans: 3 plans

Plans:

  • 02.1-01: Create MLS module with 30 hardcoded stadiums
  • 02.1-02: Create WNBA module with 13 hardcoded arenas
  • 02.1-03: Create NWSL module with 13 hardcoded stadiums

Phase 3: Alias Systems

Goal: Implement alias systems for both stadiums and teams to handle name variations across data sources Depends on: Phase 2.1 Research: No (internal mapping logic) Plans: 2 plans

Plans:

  • 03-01: Add NFL to canonicalization pipeline with aliases
  • 03-02: Add MLS, WNBA, NWSL to canonicalization pipeline with aliases

Phase 4: Canonical Linking

Goal: Ensure every game correctly links to its home/away teams and stadium via canonical IDs Depends on: Phase 3 Research: Unlikely (existing model relationships) Plans: TBD

Plans:

  • 04-01: TBD

Phase 5: CloudKit CRUD

Goal: Implement full create, read, update, delete operations for CloudKit management Depends on: Phase 4 Research: Likely (CloudKit server-to-server API) Research topics: CloudKit server-to-server authentication, record modification operations, batch operations, conflict resolution Plans: TBD

Plans:

  • 05-01: TBD

Phase 6: Validation Reports

Goal: Generate validation reports showing record counts, data gaps, orphan records, and relationship integrity Depends on: Phase 5 Research: Unlikely (internal reporting logic) Plans: TBD

Plans:

  • 06-01: TBD

Progress

Execution Order: Phases execute in numeric order: 1 → 2 → 2.1 → 3 → 4 → 5 → 6

Phase Plans Complete Status Completed
1. Script Architecture 3/3 Complete 2026-01-10
2. Stadium Foundation 2/2 Complete 2026-01-10
2.1. Additional Sports Stadiums 3/3 Complete 2026-01-10
3. Alias Systems 2/2 Complete 2026-01-10
4. Canonical Linking 0/TBD Not started -
5. CloudKit CRUD 0/TBD Not started -
6. Validation Reports 0/TBD Not started -