Files
Sportstime/.planning/PROJECT.md
Trey t f1adaf342e docs(07-01): update PROJECT.md with completion status
- Mark all Active requirements as complete (7 items)
- Update Key Decisions outcomes (split by sport, validation reports, full CRUD)
- Update Current State to reflect resolved data quality and complete pipeline
- Update last updated date to 2026-01-10

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 10:43:32 -06:00

3.4 KiB

SportsTime Data Pipeline

What This Is

A Python data pipeline that scrapes, canonicalizes, and syncs sports schedule data to CloudKit for the SportsTime iOS app. The pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, and NFL.

Core Value

Every game must correctly link to its teams and stadium — a game at the wrong venue or with broken team links ruins trip planning.

Requirements

Validated

  • ✓ Basic schedule scraping for MLB, NBA, NHL, NFL — existing
  • ✓ Canonical data models (stadiums, teams, games) — existing
  • ✓ CloudKit import capability — existing
  • ✓ Bundled JSON generation for offline-first — existing

Active

  • ✓ Split scripts by sport (MLB, NBA, NHL, NFL as separate modules) — 7 sport modules
  • ✓ Complete stadium database with correct coordinates and names — 148 stadiums
  • ✓ Stadium alias system for name variations across sources — alias JSON files
  • ✓ Correct game→team→stadium canonical linking for all sports — canonicalize_games.py
  • ✓ Full CRUD CloudKit management (create, read, update, delete) — cloudkit_import.py
  • ✓ Validation reports showing counts, gaps, and orphan records — --validate flag
  • ✓ Team alias system for name variations across sources — TEAM_ABBREV_ALIASES

Out of Scope

  • Real-time scores — this is schedule data, not live game tracking
  • Adding new sports (MLS, WNBA, etc.) — stabilize current 4 first
  • iOS app changes — this is purely backend/script work

Context

Current State:

  • Data quality: Resolved — all games correctly link to teams and stadiums via canonical IDs
  • Stadium database: Complete — 148 stadiums across 7 sports with verified coordinates
  • Script organization: Resolved — sport-specific modules (mlb.py, nba.py, nhl.py, nfl.py, mls.py, wnba.py, nwsl.py)
  • CloudKit: Full CRUD — create, update, delete with diff reporting, verification, and orphan detection

Existing Infrastructure:

  • Python 3 with requests, beautifulsoup4, pandas, lxml
  • CloudKit server-to-server auth via cryptography package
  • Bundled JSON in SportsTime/Resources/ for offline bootstrap
  • Data sources: Basketball-Reference, Baseball-Reference, Hockey-Reference, official APIs

iOS App Dependency:

  • AppDataProvider.shared is single source of truth
  • SwiftData models: CanonicalStadium, CanonicalTeam, CanonicalGame
  • Domain models expect correct relationships via canonical IDs

Constraints

  • Tech Stack: Must remain Python (existing tooling, team familiarity)
  • Data Sources: Free/public APIs and sites only (no paid subscriptions)
  • CloudKit: Must use existing container (iCloud.com.sportstime.app)
  • Compatibility: Output must match existing Swift model expectations

Key Decisions

Decision Rationale Outcome
Split by sport, not function User preference for organization ✓ Completed — 7 sport modules (mlb.py, nba.py, nhl.py, nfl.py, mls.py, wnba.py, nwsl.py)
Validation reports over automated tests Faster feedback, easier debugging ✓ Completed — --validate flag with health scores and completeness metrics
Full CRUD over upload-only Enable data corrections without full rebuild ✓ Completed — create/update/delete with diff reporting and orphan detection

Last updated: 2026-01-10 — Project complete (all 7 phases finished)