- Added MILESTONES.md entry with key accomplishments - Evolved PROJECT.md with validated requirements - Reorganized ROADMAP.md with milestone grouping - Created milestone archive: milestones/v1.0-ROADMAP.md - Updated STATE.md for next milestone planning - Tagged v1.0 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
73 lines
3.4 KiB
Markdown
73 lines
3.4 KiB
Markdown
# SportsTime Data Pipeline
|
|
|
|
## What This Is
|
|
|
|
A Python data pipeline that scrapes, canonicalizes, and syncs sports schedule data to CloudKit for the SportsTime iOS app. The pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, and NFL.
|
|
|
|
## Core Value
|
|
|
|
Every game must correctly link to its teams and stadium — a game at the wrong venue or with broken team links ruins trip planning.
|
|
|
|
## Requirements
|
|
|
|
### Validated
|
|
|
|
- ✓ Basic schedule scraping for MLB, NBA, NHL, NFL — existing
|
|
- ✓ Canonical data models (stadiums, teams, games) — existing
|
|
- ✓ CloudKit import capability — existing
|
|
- ✓ Bundled JSON generation for offline-first — existing
|
|
- ✓ Split scripts by sport (MLB, NBA, NHL, NFL as separate modules) — v1.0
|
|
- ✓ Complete stadium database with correct coordinates and names (148 stadiums) — v1.0
|
|
- ✓ Stadium alias system for name variations across sources — v1.0
|
|
- ✓ Correct game→team→stadium canonical linking for all sports — v1.0
|
|
- ✓ Full CRUD CloudKit management (create, read, update, delete) — v1.0
|
|
- ✓ Validation reports showing counts, gaps, and orphan records — v1.0
|
|
- ✓ Team alias system for name variations across sources — v1.0
|
|
|
|
### Active
|
|
|
|
(None — v1.0 complete, planning next milestone)
|
|
|
|
### Out of Scope
|
|
|
|
- Real-time scores — this is schedule data, not live game tracking
|
|
- Adding new sports (MLS, WNBA, etc.) — stabilize current 4 first
|
|
- iOS app changes — this is purely backend/script work
|
|
|
|
## Context
|
|
|
|
**Current State:**
|
|
- Data quality: Resolved — all games correctly link to teams and stadiums via canonical IDs
|
|
- Stadium database: Complete — 148 stadiums across 7 sports with verified coordinates
|
|
- Script organization: Resolved — sport-specific modules (mlb.py, nba.py, nhl.py, nfl.py, mls.py, wnba.py, nwsl.py)
|
|
- CloudKit: Full CRUD — create, update, delete with diff reporting, verification, and orphan detection
|
|
|
|
**Existing Infrastructure:**
|
|
- Python 3 with requests, beautifulsoup4, pandas, lxml
|
|
- CloudKit server-to-server auth via cryptography package
|
|
- Bundled JSON in `SportsTime/Resources/` for offline bootstrap
|
|
- Data sources: Basketball-Reference, Baseball-Reference, Hockey-Reference, official APIs
|
|
|
|
**iOS App Dependency:**
|
|
- `AppDataProvider.shared` is single source of truth
|
|
- SwiftData models: `CanonicalStadium`, `CanonicalTeam`, `CanonicalGame`
|
|
- Domain models expect correct relationships via canonical IDs
|
|
|
|
## Constraints
|
|
|
|
- **Tech Stack**: Must remain Python (existing tooling, team familiarity)
|
|
- **Data Sources**: Free/public APIs and sites only (no paid subscriptions)
|
|
- **CloudKit**: Must use existing container (`iCloud.com.sportstime.app`)
|
|
- **Compatibility**: Output must match existing Swift model expectations
|
|
|
|
## Key Decisions
|
|
|
|
| Decision | Rationale | Outcome |
|
|
|----------|-----------|---------|
|
|
| Split by sport, not function | User preference for organization | ✓ Completed — 7 sport modules (mlb.py, nba.py, nhl.py, nfl.py, mls.py, wnba.py, nwsl.py) |
|
|
| Validation reports over automated tests | Faster feedback, easier debugging | ✓ Completed — --validate flag with health scores and completeness metrics |
|
|
| Full CRUD over upload-only | Enable data corrections without full rebuild | ✓ Completed — create/update/delete with diff reporting and orphan detection |
|
|
|
|
---
|
|
*Last updated: 2026-01-10 after v1.0 Data Pipeline milestone*
|