chore: complete v1.0 Data Pipeline milestone

- Added MILESTONES.md entry with key accomplishments
- Evolved PROJECT.md with validated requirements
- Reorganized ROADMAP.md with milestone grouping
- Created milestone archive: milestones/v1.0-ROADMAP.md
- Updated STATE.md for next milestone planning
- Tagged v1.0

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-01-10 11:15:19 -06:00
parent 1b796a604c
commit ca9fa535f1
5 changed files with 229 additions and 179 deletions

29
.planning/MILESTONES.md Normal file
View File

@@ -0,0 +1,29 @@
# Project Milestones: SportsTime
## v1.0 Data Pipeline (Shipped: 2026-01-10)
**Delivered:** Complete Python data pipeline with sport-organized modules, full stadium database, alias systems, and CloudKit CRUD operations for the SportsTime iOS app.
**Phases completed:** 1-7 (+ 2.1 inserted) — 15 plans total
**Key accomplishments:**
- Reorganized monolithic scripts into 7 sport modules (mlb.py, nba.py, nhl.py, nfl.py, mls.py, wnba.py, nwsl.py)
- Built complete stadium database with 148 venues across all 7 sports with verified coordinates
- Implemented team and stadium alias systems for cross-source name variations
- Added full CloudKit CRUD with diff reporting, smart sync, and orphan detection
- Created comprehensive validation with health scores and completeness metrics
- Reduced scrape_schedules.py orchestrator from 3359 to 733 lines (78% reduction)
**Stats:**
- 15 plans executed across 8 phases
- ~12,000 lines of Python
- 92 minutes total execution time (6.1 min average per plan)
- 2 days from start to ship (2026-01-09 → 2026-01-10)
**Git range:** `feat(01-01)``docs(07-01)`
**What's next:** iOS app improvements — IAP, UI/UX, testing
---

View File

@@ -16,16 +16,17 @@ Every game must correctly link to its teams and stadium — a game at the wrong
- ✓ Canonical data models (stadiums, teams, games) — existing
- ✓ CloudKit import capability — existing
- ✓ Bundled JSON generation for offline-first — existing
- ✓ Split scripts by sport (MLB, NBA, NHL, NFL as separate modules) — v1.0
- ✓ Complete stadium database with correct coordinates and names (148 stadiums) — v1.0
- ✓ Stadium alias system for name variations across sources — v1.0
- ✓ Correct game→team→stadium canonical linking for all sports — v1.0
- ✓ Full CRUD CloudKit management (create, read, update, delete) — v1.0
- ✓ Validation reports showing counts, gaps, and orphan records — v1.0
- ✓ Team alias system for name variations across sources — v1.0
### Active
- ✓ Split scripts by sport (MLB, NBA, NHL, NFL as separate modules) — 7 sport modules
- ✓ Complete stadium database with correct coordinates and names — 148 stadiums
- ✓ Stadium alias system for name variations across sources — alias JSON files
- ✓ Correct game→team→stadium canonical linking for all sports — canonicalize_games.py
- ✓ Full CRUD CloudKit management (create, read, update, delete) — cloudkit_import.py
- ✓ Validation reports showing counts, gaps, and orphan records — --validate flag
- ✓ Team alias system for name variations across sources — TEAM_ABBREV_ALIASES
(None — v1.0 complete, planning next milestone)
### Out of Scope
@@ -68,4 +69,4 @@ Every game must correctly link to its teams and stadium — a game at the wrong
| Full CRUD over upload-only | Enable data corrections without full rebuild | ✓ Completed — create/update/delete with diff reporting and orphan detection |
---
*Last updated: 2026-01-10 — Project complete (all 7 phases finished)*
*Last updated: 2026-01-10 after v1.0 Data Pipeline milestone*

View File

@@ -1,121 +1,36 @@
# Roadmap: SportsTime Data Pipeline
# Roadmap: SportsTime
## Overview
## Milestones
Transform the monolithic data scraping scripts into a maintainable, sport-organized pipeline that ensures every game correctly links to its teams and stadium. Starting with script restructuring, we'll complete the stadium database, add alias systems for name variations, establish correct canonical linking, implement full CloudKit CRUD operations, and finish with comprehensive validation reports.
- [v1.0 Data Pipeline](milestones/v1.0-ROADMAP.md) (Phases 1-7) — SHIPPED 2026-01-10
## Domain Expertise
## Completed Milestones
None
<details>
<summary>v1.0 Data Pipeline (Phases 1-7) — SHIPPED 2026-01-10</summary>
## Phases
- [x] Phase 1: Script Architecture (3/3 plans) — completed 2026-01-10
- [x] Phase 2: Stadium Foundation (2/2 plans) — completed 2026-01-10
- [x] Phase 2.1: Additional Sports Stadiums (3/3 plans) — completed 2026-01-10
- [x] Phase 3: Alias Systems (2/2 plans) — completed 2026-01-10
- [x] Phase 4: Canonical Linking (1/1 plans) — completed 2026-01-10
- [x] Phase 5: CloudKit CRUD (2/2 plans) — completed 2026-01-10
- [x] Phase 6: Validation Reports (1/1 plans) — completed 2026-01-10
- [x] Phase 7: Testing & Documentation (1/1 plans) — completed 2026-01-10
**Phase Numbering:**
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
**Full details:** [milestones/v1.0-ROADMAP.md](milestones/v1.0-ROADMAP.md)
- [x] **Phase 1: Script Architecture** - Split monolithic scripts into sport-specific modules (3/3 plans)
- [x] **Phase 2: Stadium Foundation** - Complete stadium database with coordinates and names (2/2 plans)
- [x] **Phase 2.1: Additional Sports Stadiums** - Add stadium data for MLS, WNBA, NWSL, CBB (INSERTED) (3/3 plans)
- [x] **Phase 3: Alias Systems** - Stadium and team alias systems for name variations (2/2 plans)
- [x] **Phase 4: Canonical Linking** - Correct game→team→stadium relationships (1/1 plans)
- [x] **Phase 5: CloudKit CRUD** - Full create, read, update, delete operations (2/2 plans)
- [x] **Phase 6: Validation Reports** - Reports showing counts, gaps, orphan records (1/1 plans)
- [x] **Phase 7: Testing & Documentation** - Test coverage and documentation updates (1/1 plans)
## Phase Details
### Phase 1: Script Architecture
**Goal**: Reorganize monolithic scraping scripts into sport-specific modules (MLB, NBA, NHL, NFL) for easier debugging and maintenance
**Depends on**: Nothing (first phase)
**Research**: Unlikely (internal refactoring, Python module patterns)
**Plans**: 3 plans
Plans:
- [x] 01-01: Create core.py shared module + mlb.py sport module
- [x] 01-02: Create nba.py + nhl.py sport modules
- [x] 01-03: Create nfl.py + refactor scrape_schedules.py orchestrator
### Phase 2: Stadium Foundation
**Goal**: Complete stadium database with correct coordinates, names, and venue data for all 4 sports
**Depends on**: Phase 1
**Research**: No (hardcoded data exists in sport modules, internal pipeline work)
**Plans**: 2 plans
Plans:
- [x] 02-01: Audit & complete hardcoded stadium data in sport modules
- [x] 02-02: Regenerate canonical data and verify pipeline
### Phase 2.1: Additional Sports Stadiums (INSERTED)
**Goal**: Add hardcoded stadium data for secondary sports: MLS, WNBA, NWSL (CBB deferred - 350+ D1 teams requires separate scoped phase)
**Depends on**: Phase 2
**Research**: No (stadium data compilation following established patterns)
**Plans**: 3 plans
Plans:
- [x] 02.1-01: Create MLS module with 30 hardcoded stadiums
- [x] 02.1-02: Create WNBA module with 13 hardcoded arenas
- [x] 02.1-03: Create NWSL module with 13 hardcoded stadiums
### Phase 3: Alias Systems
**Goal**: Implement alias systems for both stadiums and teams to handle name variations across data sources
**Depends on**: Phase 2.1
**Research**: No (internal mapping logic)
**Plans**: 2 plans
Plans:
- [x] 03-01: Add NFL to canonicalization pipeline with aliases
- [x] 03-02: Add MLS, WNBA, NWSL to canonicalization pipeline with aliases
### Phase 4: Canonical Linking
**Goal**: Ensure every game correctly links to its home/away teams and stadium via canonical IDs
**Depends on**: Phase 3
**Research**: Unlikely (existing model relationships)
**Plans**: 1 plan
Plans:
- [x] 04-01: Generate canonical games with resolved team/stadium links
### Phase 5: CloudKit CRUD
**Goal**: Implement full create, read, update, delete operations for CloudKit management
**Depends on**: Phase 4
**Research**: No (existing patterns in cloudkit_import.py sufficient)
**Plans**: 2 plans
Plans:
- [x] 05-01: Smart sync with change detection (diff reporting, differential upload)
- [x] 05-02: Verification and record management (sync verification, individual CRUD)
### Phase 6: Validation Reports
**Goal**: Generate validation reports showing record counts, data gaps, orphan records, and relationship integrity
**Depends on**: Phase 5
**Research**: Unlikely (internal reporting logic)
**Plans**: 1 plan
Plans:
- [x] 06-01: Comprehensive validation with orphan listing and completeness metrics
### Phase 7: Testing & Documentation
**Goal**: Complete pipeline documentation and finalize project status
**Depends on**: Phase 6
**Research**: No (internal documentation)
**Plans**: 1 plan
Plans:
- [x] 07-01: Create Scripts/README.md and update PROJECT.md with completion status
</details>
## Progress
**Execution Order:**
Phases execute in numeric order: 1 → 2 → 2.1 → 3 → 4 → 5 → 6 → 7
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Script Architecture | 3/3 | Complete | 2026-01-10 |
| 2. Stadium Foundation | 2/2 | Complete | 2026-01-10 |
| 2.1. Additional Sports Stadiums | 3/3 | Complete | 2026-01-10 |
| 3. Alias Systems | 2/2 | Complete | 2026-01-10 |
| 4. Canonical Linking | 1/1 | Complete | 2026-01-10 |
| 5. CloudKit CRUD | 2/2 | Complete | 2026-01-10 |
| 6. Validation Reports | 1/1 | Complete | 2026-01-10 |
| 7. Testing & Documentation | 1/1 | Complete | 2026-01-10 |
| Phase | Milestone | Plans Complete | Status | Completed |
|-------|-----------|----------------|--------|-----------|
| 1. Script Architecture | v1.0 | 3/3 | Complete | 2026-01-10 |
| 2. Stadium Foundation | v1.0 | 2/2 | Complete | 2026-01-10 |
| 2.1. Additional Sports Stadiums | v1.0 | 3/3 | Complete | 2026-01-10 |
| 3. Alias Systems | v1.0 | 2/2 | Complete | 2026-01-10 |
| 4. Canonical Linking | v1.0 | 1/1 | Complete | 2026-01-10 |
| 5. CloudKit CRUD | v1.0 | 2/2 | Complete | 2026-01-10 |
| 6. Validation Reports | v1.0 | 1/1 | Complete | 2026-01-10 |
| 7. Testing & Documentation | v1.0 | 1/1 | Complete | 2026-01-10 |

View File

@@ -2,88 +2,43 @@
## Project Reference
See: .planning/PROJECT.md (updated 2026-01-09)
See: .planning/PROJECT.md (updated 2026-01-10)
**Core value:** Every game must correctly link to its teams and stadium — a game at the wrong venue or with broken team links ruins trip planning.
**Current focus:** Phase 7 — Testing & Documentation
**Current focus:** Planning next milestone
## Current Position
Phase: 7 of 7 (Testing & Documentation)
Plan: 1 of 1 in current phase
Status: Complete
Last activity: 2026-01-10 — Completed 07-01-PLAN.md (final plan)
Phase: N/A (between milestones)
Plan: N/A
Status: Ready to plan next milestone
Last activity: 2026-01-10 — v1.0 Data Pipeline shipped
Progress: ██████████ 100% (15 of 15 plans complete — MILESTONE COMPLETE)
Progress: v1.0 complete — 15 plans across 8 phases
## Performance Metrics
## Shipped Milestones
**Velocity:**
- Total plans completed: 15
- Average duration: 6.1 min
- Total execution time: 92 min
**By Phase:**
| Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------|
| 1. Script Architecture | 3/3 | 23 min | 7.7 min |
| 2. Stadium Foundation | 2/2 | 14 min | 7 min |
| 2.1. Additional Sports Stadiums | 3/3 | 17 min | 5.7 min |
| 3. Alias Systems | 2/2 | 6 min | 3 min |
| 4. Canonical Linking | 1/1 | 4 min | 4 min |
| 5. CloudKit CRUD | 2/2 | 14 min | 7 min |
| 6. Validation Reports | 1/1 | 12 min | 12 min |
| 7. Testing & Documentation | 1/1 | 2 min | 2 min |
**Recent Trend:**
- Last 5 plans: 05-01 (6 min), 05-02 (8 min), 06-01 (12 min), 07-01 (2 min)
- Trend: Complete — all phases finished
| Milestone | Phases | Plans | Shipped |
|-----------|--------|-------|---------|
| v1.0 Data Pipeline | 1-7 (+2.1) | 15 | 2026-01-10 |
## Accumulated Context
### Decisions
Decisions are logged in PROJECT.md Key Decisions table.
Recent decisions affecting current work:
- **01-01**: Each sport module has its own `get_{sport}_team_abbrev()` function for independence
- **01-01**: Import fallback pattern (try/except) for running from Scripts/ or project root
- **01-02**: NBA/NHL use season string format (2024-25) for cross-calendar-year seasons
- **01-02**: Each module has hardcoded stadium list with coordinates as reliable fallback
- **01-03**: NFL uses cross-calendar-year season format (2025-26) like NBA/NHL
- **01-03**: Non-core sports (WNBA, MLS, NWSL, CBB) remain inline with TODO markers
- **02-01**: Used original opening years (not renovation years) for year_opened field
- **02-01**: Stadium dataclass already supported year_opened - no changes needed to core.py
- **02-02**: MLS stadiums excluded from bundled JSON (incomplete data), deferred to Phase 2.1
- **02.1-01**: Used soccer configuration capacities for shared NFL stadiums (e.g., Mercedes-Benz 42,500 for soccer vs 71,000 for NFL)
- **02.1-01**: Prioritized hardcoded source (priority=1) over gavinr GeoJSON (priority=2) for complete data
- **02.1-02**: Cross-referenced shared arena coordinates from nba.py and nhl.py for WNBA venues
- **02.1-03**: Cross-referenced 10 of 13 NWSL stadiums from mls.py for shared venue coordinates
- **02.1-03**: CBB deferred to future phase (350+ D1 teams requires separate scoped approach)
- **04-01**: Team abbreviation aliases discovered during canonicalization runs are added iteratively to TEAM_ABBREV_ALIASES
- **05-01**: New records use forceReplace; updated records use update with recordChangeTag for conflict detection
- **05-01**: Orphan deletion requires explicit --delete-orphans flag for safety (safe by default)
- **05-02**: Triple lookup fallback: direct recordName -> deterministic UUID -> canonicalId query
- **06-01**: Health score formula: avg_completeness - orphan_penalty (max -30) - unknown_penalty (max -10)
- **06-01**: --list-orphans requires CloudKit connection; --validate works with or without
### Roadmap Evolution
- Phase 2.1 inserted after Phase 2: Add stadium data for MLS, WNBA, NWSL, CBB (INSERTED)
- Phase 7 added: Testing & Documentation
### Deferred Issues
None yet.
- CBB support (350+ D1 teams requires separate scoped phase)
### Blockers/Concerns
None yet.
None.
## Session Continuity
Last session: 2026-01-10
Stopped at: Milestone complete — all 7 phases finished
Stopped at: v1.0 milestone shipped
Resume file: N/A
Next action: Run /gsd:complete-milestone to archive and summarize
Next action: /gsd:discuss-milestone or /gsd:new-milestone to plan next work

View File

@@ -0,0 +1,150 @@
# Milestone v1.0: Data Pipeline
**Status:** SHIPPED 2026-01-10
**Phases:** 1-7 (+ 2.1 inserted)
**Total Plans:** 15
## Overview
Transform the monolithic data scraping scripts into a maintainable, sport-organized pipeline that ensures every game correctly links to its teams and stadium. Starting with script restructuring, we completed the stadium database, added alias systems for name variations, established correct canonical linking, implemented full CloudKit CRUD operations, and finished with comprehensive validation reports.
## Phases
### Phase 1: Script Architecture
**Goal**: Reorganize monolithic scraping scripts into sport-specific modules (MLB, NBA, NHL, NFL) for easier debugging and maintenance
**Depends on**: Nothing (first phase)
**Plans**: 3 plans
Plans:
- [x] 01-01: Create core.py shared module + mlb.py sport module
- [x] 01-02: Create nba.py + nhl.py sport modules
- [x] 01-03: Create nfl.py + refactor scrape_schedules.py orchestrator
**Details:**
- Created `core.py` with shared Game/Stadium dataclasses and scraper utilities
- Each sport module exports: {SPORT}_TEAMS, scrape_{sport}_games, {SPORT}_GAME_SOURCES
- Reduced orchestrator from 3359 to 733 lines (78% reduction)
### Phase 2: Stadium Foundation
**Goal**: Complete stadium database with correct coordinates, names, and venue data for all 4 sports
**Depends on**: Phase 1
**Plans**: 2 plans
Plans:
- [x] 02-01: Audit & complete hardcoded stadium data in sport modules
- [x] 02-02: Regenerate canonical data and verify pipeline
**Details:**
- Audited and completed hardcoded stadium data for MLB, NBA, NHL, NFL
- Used original opening years (not renovation years) for year_opened field
- Regenerated canonical JSON files with complete stadium coverage
### Phase 2.1: Additional Sports Stadiums (INSERTED)
**Goal**: Add hardcoded stadium data for secondary sports: MLS, WNBA, NWSL
**Depends on**: Phase 2
**Plans**: 3 plans
Plans:
- [x] 02.1-01: Create MLS module with 30 hardcoded stadiums
- [x] 02.1-02: Create WNBA module with 13 hardcoded arenas
- [x] 02.1-03: Create NWSL module with 13 hardcoded stadiums
**Details:**
- Created mls.py with 30 MLS stadiums including shared NFL venues
- Created wnba.py with 13 arenas cross-referenced from NBA/NHL
- Created nwsl.py with 13 stadiums cross-referenced from MLS
- CBB deferred (350+ D1 teams requires separate scoped phase)
### Phase 3: Alias Systems
**Goal**: Implement alias systems for both stadiums and teams to handle name variations across data sources
**Depends on**: Phase 2.1
**Plans**: 2 plans
Plans:
- [x] 03-01: Add NFL to canonicalization pipeline with aliases
- [x] 03-02: Add MLS, WNBA, NWSL to canonicalization pipeline with aliases
**Details:**
- Added TEAM_ABBREV_ALIASES for cross-source team name variations
- Stadium aliases handle historical names and source-specific variations
- All 7 sports now have alias support
### Phase 4: Canonical Linking
**Goal**: Ensure every game correctly links to its home/away teams and stadium via canonical IDs
**Depends on**: Phase 3
**Plans**: 1 plan
Plans:
- [x] 04-01: Generate canonical games with resolved team/stadium links
**Details:**
- All games correctly link to teams and stadiums via canonical IDs
- Team abbreviation aliases discovered during canonicalization added iteratively
### Phase 5: CloudKit CRUD
**Goal**: Implement full create, read, update, delete operations for CloudKit management
**Depends on**: Phase 4
**Plans**: 2 plans
Plans:
- [x] 05-01: Smart sync with change detection (diff reporting, differential upload)
- [x] 05-02: Verification and record management (sync verification, individual CRUD)
**Details:**
- New records use forceReplace; updates use recordChangeTag for conflict detection
- Orphan deletion requires explicit --delete-orphans flag (safe by default)
- Triple lookup fallback: direct recordName -> deterministic UUID -> canonicalId query
### Phase 6: Validation Reports
**Goal**: Generate validation reports showing record counts, data gaps, orphan records, and relationship integrity
**Depends on**: Phase 5
**Plans**: 1 plan
Plans:
- [x] 06-01: Comprehensive validation with orphan listing and completeness metrics
**Details:**
- Health score formula: avg_completeness - orphan_penalty (max -30) - unknown_penalty (max -10)
- --list-orphans requires CloudKit connection; --validate works offline
- Completeness metrics per sport with expected game counts
### Phase 7: Testing & Documentation
**Goal**: Complete pipeline documentation and finalize project status
**Depends on**: Phase 6
**Plans**: 1 plan
Plans:
- [x] 07-01: Create Scripts/README.md and update PROJECT.md with completion status
**Details:**
- Created comprehensive Scripts/README.md with usage examples
- Updated PROJECT.md with completion status and validated requirements
---
## Milestone Summary
**Decimal Phases:**
- Phase 2.1: Additional Sports Stadiums (inserted after Phase 2 for MLS/WNBA/NWSL coverage)
**Key Decisions:**
- Split by sport, not function (user preference for organization)
- Validation reports over automated tests (faster feedback, easier debugging)
- Full CRUD over upload-only (enable data corrections without full rebuild)
- Each sport module independent with own team abbrev functions
- Non-core sports remain inline with TODO markers for future extraction
**Issues Resolved:**
- Game linking failures (games now correctly link to teams/stadiums)
- Missing stadium data (148 stadiums complete with coordinates)
- Name variation mismatches (alias systems handle cross-source differences)
**Issues Deferred:**
- CBB support (350+ D1 teams requires separate scoped phase)
**Technical Debt Incurred:**
- None significant
---
_For current project status, see .planning/ROADMAP.md_