- Added MILESTONES.md entry with key accomplishments - Evolved PROJECT.md with validated requirements - Reorganized ROADMAP.md with milestone grouping - Created milestone archive: milestones/v1.0-ROADMAP.md - Updated STATE.md for next milestone planning - Tagged v1.0 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.6 KiB
Milestone v1.0: Data Pipeline
Status: SHIPPED 2026-01-10 Phases: 1-7 (+ 2.1 inserted) Total Plans: 15
Overview
Transform the monolithic data scraping scripts into a maintainable, sport-organized pipeline that ensures every game correctly links to its teams and stadium. Starting with script restructuring, we completed the stadium database, added alias systems for name variations, established correct canonical linking, implemented full CloudKit CRUD operations, and finished with comprehensive validation reports.
Phases
Phase 1: Script Architecture
Goal: Reorganize monolithic scraping scripts into sport-specific modules (MLB, NBA, NHL, NFL) for easier debugging and maintenance Depends on: Nothing (first phase) Plans: 3 plans
Plans:
- 01-01: Create core.py shared module + mlb.py sport module
- 01-02: Create nba.py + nhl.py sport modules
- 01-03: Create nfl.py + refactor scrape_schedules.py orchestrator
Details:
- Created
core.pywith shared Game/Stadium dataclasses and scraper utilities - Each sport module exports: {SPORT}TEAMS, scrape{sport}_games, {SPORT}_GAME_SOURCES
- Reduced orchestrator from 3359 to 733 lines (78% reduction)
Phase 2: Stadium Foundation
Goal: Complete stadium database with correct coordinates, names, and venue data for all 4 sports Depends on: Phase 1 Plans: 2 plans
Plans:
- 02-01: Audit & complete hardcoded stadium data in sport modules
- 02-02: Regenerate canonical data and verify pipeline
Details:
- Audited and completed hardcoded stadium data for MLB, NBA, NHL, NFL
- Used original opening years (not renovation years) for year_opened field
- Regenerated canonical JSON files with complete stadium coverage
Phase 2.1: Additional Sports Stadiums (INSERTED)
Goal: Add hardcoded stadium data for secondary sports: MLS, WNBA, NWSL Depends on: Phase 2 Plans: 3 plans
Plans:
- 02.1-01: Create MLS module with 30 hardcoded stadiums
- 02.1-02: Create WNBA module with 13 hardcoded arenas
- 02.1-03: Create NWSL module with 13 hardcoded stadiums
Details:
- Created mls.py with 30 MLS stadiums including shared NFL venues
- Created wnba.py with 13 arenas cross-referenced from NBA/NHL
- Created nwsl.py with 13 stadiums cross-referenced from MLS
- CBB deferred (350+ D1 teams requires separate scoped phase)
Phase 3: Alias Systems
Goal: Implement alias systems for both stadiums and teams to handle name variations across data sources Depends on: Phase 2.1 Plans: 2 plans
Plans:
- 03-01: Add NFL to canonicalization pipeline with aliases
- 03-02: Add MLS, WNBA, NWSL to canonicalization pipeline with aliases
Details:
- Added TEAM_ABBREV_ALIASES for cross-source team name variations
- Stadium aliases handle historical names and source-specific variations
- All 7 sports now have alias support
Phase 4: Canonical Linking
Goal: Ensure every game correctly links to its home/away teams and stadium via canonical IDs Depends on: Phase 3 Plans: 1 plan
Plans:
- 04-01: Generate canonical games with resolved team/stadium links
Details:
- All games correctly link to teams and stadiums via canonical IDs
- Team abbreviation aliases discovered during canonicalization added iteratively
Phase 5: CloudKit CRUD
Goal: Implement full create, read, update, delete operations for CloudKit management Depends on: Phase 4 Plans: 2 plans
Plans:
- 05-01: Smart sync with change detection (diff reporting, differential upload)
- 05-02: Verification and record management (sync verification, individual CRUD)
Details:
- New records use forceReplace; updates use recordChangeTag for conflict detection
- Orphan deletion requires explicit --delete-orphans flag (safe by default)
- Triple lookup fallback: direct recordName -> deterministic UUID -> canonicalId query
Phase 6: Validation Reports
Goal: Generate validation reports showing record counts, data gaps, orphan records, and relationship integrity Depends on: Phase 5 Plans: 1 plan
Plans:
- 06-01: Comprehensive validation with orphan listing and completeness metrics
Details:
- Health score formula: avg_completeness - orphan_penalty (max -30) - unknown_penalty (max -10)
- --list-orphans requires CloudKit connection; --validate works offline
- Completeness metrics per sport with expected game counts
Phase 7: Testing & Documentation
Goal: Complete pipeline documentation and finalize project status Depends on: Phase 6 Plans: 1 plan
Plans:
- 07-01: Create Scripts/README.md and update PROJECT.md with completion status
Details:
- Created comprehensive Scripts/README.md with usage examples
- Updated PROJECT.md with completion status and validated requirements
Milestone Summary
Decimal Phases:
- Phase 2.1: Additional Sports Stadiums (inserted after Phase 2 for MLS/WNBA/NWSL coverage)
Key Decisions:
- Split by sport, not function (user preference for organization)
- Validation reports over automated tests (faster feedback, easier debugging)
- Full CRUD over upload-only (enable data corrections without full rebuild)
- Each sport module independent with own team abbrev functions
- Non-core sports remain inline with TODO markers for future extraction
Issues Resolved:
- Game linking failures (games now correctly link to teams/stadiums)
- Missing stadium data (148 stadiums complete with coordinates)
- Name variation mismatches (alias systems handle cross-source differences)
Issues Deferred:
- CBB support (350+ D1 teams requires separate scoped phase)
Technical Debt Incurred:
- None significant
For current project status, see .planning/ROADMAP.md