Files
Sportstime/.planning/milestones/v1.0-ROADMAP.md
Trey t ca9fa535f1 chore: complete v1.0 Data Pipeline milestone
- Added MILESTONES.md entry with key accomplishments
- Evolved PROJECT.md with validated requirements
- Reorganized ROADMAP.md with milestone grouping
- Created milestone archive: milestones/v1.0-ROADMAP.md
- Updated STATE.md for next milestone planning
- Tagged v1.0

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 11:15:19 -06:00

5.6 KiB

Milestone v1.0: Data Pipeline

Status: SHIPPED 2026-01-10 Phases: 1-7 (+ 2.1 inserted) Total Plans: 15

Overview

Transform the monolithic data scraping scripts into a maintainable, sport-organized pipeline that ensures every game correctly links to its teams and stadium. Starting with script restructuring, we completed the stadium database, added alias systems for name variations, established correct canonical linking, implemented full CloudKit CRUD operations, and finished with comprehensive validation reports.

Phases

Phase 1: Script Architecture

Goal: Reorganize monolithic scraping scripts into sport-specific modules (MLB, NBA, NHL, NFL) for easier debugging and maintenance Depends on: Nothing (first phase) Plans: 3 plans

Plans:

  • 01-01: Create core.py shared module + mlb.py sport module
  • 01-02: Create nba.py + nhl.py sport modules
  • 01-03: Create nfl.py + refactor scrape_schedules.py orchestrator

Details:

  • Created core.py with shared Game/Stadium dataclasses and scraper utilities
  • Each sport module exports: {SPORT}TEAMS, scrape{sport}_games, {SPORT}_GAME_SOURCES
  • Reduced orchestrator from 3359 to 733 lines (78% reduction)

Phase 2: Stadium Foundation

Goal: Complete stadium database with correct coordinates, names, and venue data for all 4 sports Depends on: Phase 1 Plans: 2 plans

Plans:

  • 02-01: Audit & complete hardcoded stadium data in sport modules
  • 02-02: Regenerate canonical data and verify pipeline

Details:

  • Audited and completed hardcoded stadium data for MLB, NBA, NHL, NFL
  • Used original opening years (not renovation years) for year_opened field
  • Regenerated canonical JSON files with complete stadium coverage

Phase 2.1: Additional Sports Stadiums (INSERTED)

Goal: Add hardcoded stadium data for secondary sports: MLS, WNBA, NWSL Depends on: Phase 2 Plans: 3 plans

Plans:

  • 02.1-01: Create MLS module with 30 hardcoded stadiums
  • 02.1-02: Create WNBA module with 13 hardcoded arenas
  • 02.1-03: Create NWSL module with 13 hardcoded stadiums

Details:

  • Created mls.py with 30 MLS stadiums including shared NFL venues
  • Created wnba.py with 13 arenas cross-referenced from NBA/NHL
  • Created nwsl.py with 13 stadiums cross-referenced from MLS
  • CBB deferred (350+ D1 teams requires separate scoped phase)

Phase 3: Alias Systems

Goal: Implement alias systems for both stadiums and teams to handle name variations across data sources Depends on: Phase 2.1 Plans: 2 plans

Plans:

  • 03-01: Add NFL to canonicalization pipeline with aliases
  • 03-02: Add MLS, WNBA, NWSL to canonicalization pipeline with aliases

Details:

  • Added TEAM_ABBREV_ALIASES for cross-source team name variations
  • Stadium aliases handle historical names and source-specific variations
  • All 7 sports now have alias support

Phase 4: Canonical Linking

Goal: Ensure every game correctly links to its home/away teams and stadium via canonical IDs Depends on: Phase 3 Plans: 1 plan

Plans:

  • 04-01: Generate canonical games with resolved team/stadium links

Details:

  • All games correctly link to teams and stadiums via canonical IDs
  • Team abbreviation aliases discovered during canonicalization added iteratively

Phase 5: CloudKit CRUD

Goal: Implement full create, read, update, delete operations for CloudKit management Depends on: Phase 4 Plans: 2 plans

Plans:

  • 05-01: Smart sync with change detection (diff reporting, differential upload)
  • 05-02: Verification and record management (sync verification, individual CRUD)

Details:

  • New records use forceReplace; updates use recordChangeTag for conflict detection
  • Orphan deletion requires explicit --delete-orphans flag (safe by default)
  • Triple lookup fallback: direct recordName -> deterministic UUID -> canonicalId query

Phase 6: Validation Reports

Goal: Generate validation reports showing record counts, data gaps, orphan records, and relationship integrity Depends on: Phase 5 Plans: 1 plan

Plans:

  • 06-01: Comprehensive validation with orphan listing and completeness metrics

Details:

  • Health score formula: avg_completeness - orphan_penalty (max -30) - unknown_penalty (max -10)
  • --list-orphans requires CloudKit connection; --validate works offline
  • Completeness metrics per sport with expected game counts

Phase 7: Testing & Documentation

Goal: Complete pipeline documentation and finalize project status Depends on: Phase 6 Plans: 1 plan

Plans:

  • 07-01: Create Scripts/README.md and update PROJECT.md with completion status

Details:

  • Created comprehensive Scripts/README.md with usage examples
  • Updated PROJECT.md with completion status and validated requirements

Milestone Summary

Decimal Phases:

  • Phase 2.1: Additional Sports Stadiums (inserted after Phase 2 for MLS/WNBA/NWSL coverage)

Key Decisions:

  • Split by sport, not function (user preference for organization)
  • Validation reports over automated tests (faster feedback, easier debugging)
  • Full CRUD over upload-only (enable data corrections without full rebuild)
  • Each sport module independent with own team abbrev functions
  • Non-core sports remain inline with TODO markers for future extraction

Issues Resolved:

  • Game linking failures (games now correctly link to teams/stadiums)
  • Missing stadium data (148 stadiums complete with coordinates)
  • Name variation mismatches (alias systems handle cross-source differences)

Issues Deferred:

  • CBB support (350+ D1 teams requires separate scoped phase)

Technical Debt Incurred:

  • None significant

For current project status, see .planning/ROADMAP.md