Fix data quality issues across MLB, NBA, NHL, NFL with correct game→team→stadium canonical linking. Creates PROJECT.md with requirements and constraints. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.9 KiB
2.9 KiB
SportsTime Data Pipeline
What This Is
A Python data pipeline that scrapes, canonicalizes, and syncs sports schedule data to CloudKit for the SportsTime iOS app. The pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, and NFL.
Core Value
Every game must correctly link to its teams and stadium — a game at the wrong venue or with broken team links ruins trip planning.
Requirements
Validated
- ✓ Basic schedule scraping for MLB, NBA, NHL, NFL — existing
- ✓ Canonical data models (stadiums, teams, games) — existing
- ✓ CloudKit import capability — existing
- ✓ Bundled JSON generation for offline-first — existing
Active
- Split scripts by sport (MLB, NBA, NHL, NFL as separate modules)
- Complete stadium database with correct coordinates and names
- Stadium alias system for name variations across sources
- Correct game→team→stadium canonical linking for all sports
- Full CRUD CloudKit management (create, read, update, delete)
- Validation reports showing counts, gaps, and orphan records
- Team alias system for name variations across sources
Out of Scope
- Real-time scores — this is schedule data, not live game tracking
- Adding new sports (MLS, WNBA, etc.) — stabilize current 4 first
- iOS app changes — this is purely backend/script work
Context
Current State:
- Data quality issues exist across all sports (wrong stadiums, missing games, broken team links)
- Stadium problems include: missing venues, wrong coordinates, name mismatches between sources
- Single large script files that are hard to debug and maintain
- Existing CloudKit import works but lacks verification and CRUD operations
Existing Infrastructure:
- Python 3 with requests, beautifulsoup4, pandas, lxml
- CloudKit server-to-server auth via cryptography package
- Bundled JSON in
SportsTime/Resources/for offline bootstrap - Data sources: Basketball-Reference, Baseball-Reference, Hockey-Reference, official APIs
iOS App Dependency:
AppDataProvider.sharedis single source of truth- SwiftData models:
CanonicalStadium,CanonicalTeam,CanonicalGame - Domain models expect correct relationships via canonical IDs
Constraints
- Tech Stack: Must remain Python (existing tooling, team familiarity)
- Data Sources: Free/public APIs and sites only (no paid subscriptions)
- CloudKit: Must use existing container (
iCloud.com.sportstime.app) - Compatibility: Output must match existing Swift model expectations
Key Decisions
| Decision | Rationale | Outcome |
|---|---|---|
| Split by sport, not function | User preference for organization | — Pending |
| Validation reports over automated tests | Faster feedback, easier debugging | — Pending |
| Full CRUD over upload-only | Enable data corrections without full rebuild | — Pending |
Last updated: 2026-01-09 after initialization