Files
Sportstime/Scripts
Trey t 63fb06c41a fix: update pipeline imports to use sport modules
After Phase 1 refactoring moved scraper functions to sport-specific
modules (nba.py, mlb.py, etc.), these pipeline scripts still imported
from scrape_schedules.py.

- run_pipeline.py: import from core.py and sport modules
- validate_data.py: import from core.py and sport modules
- run_canonicalization_pipeline.py: import from core.py and sport modules

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 10:52:13 -06:00
..

SportsTime Data Pipeline

Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app.

Overview

This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL.

Quick Start

# Install dependencies
pip install -r requirements.txt

# Scrape all sports for current season
python scrape_schedules.py --sport all --season 2026

# Run full pipeline (scrape + canonicalize)
python run_pipeline.py --sport all

# Validate data integrity
python cloudkit_import.py --validate

# Sync to CloudKit
python cloudkit_import.py --upload

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        SPORT MODULES                                │
│  mlb.py  nba.py  nhl.py  nfl.py  mls.py  wnba.py  nwsl.py         │
└────────────────────────────┬────────────────────────────────────────┘
                             │ scrape
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        RAW DATA                                     │
│  data/games.csv    data/stadiums.csv    data/games.json            │
└────────────────────────────┬────────────────────────────────────────┘
                             │ canonicalize
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     CANONICAL JSON                                  │
│  data/stadiums_canonical.json    data/teams_canonical.json         │
│  data/games/*.json (per-sport/season)                              │
└────────────────────────────┬────────────────────────────────────────┘
                             │ sync
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│               CloudKit (iCloud.com.sportstime.app)                 │
│               Bundled JSON (SportsTime/Resources/)                  │
└─────────────────────────────────────────────────────────────────────┘

Module Reference

Script Purpose
core.py Shared utilities: data classes, rate limiting, fallback system
scrape_schedules.py Main orchestrator for scraping schedules from multiple sources
run_pipeline.py Full pipeline runner (scrape + canonicalize in one command)
canonicalize_stadiums.py Stadium name resolution with alias support
canonicalize_teams.py Team name resolution with alias support
canonicalize_games.py Game linking (game → team → stadium relationships)
cloudkit_import.py CloudKit sync with full CRUD, validation, and diff reporting
validate_canonical.py Data validation with completeness metrics
generate_canonical_data.py Generate bundled JSON for iOS app bootstrap

Sport Modules

Each sport has its own module with hardcoded stadium data and sport-specific scraping logic:

Module Sport Stadiums Notes
mlb.py MLB 30 ballparks Baseball-Reference scraper
nba.py NBA 30 arenas Basketball-Reference scraper
nhl.py NHL 32 arenas Hockey-Reference scraper
nfl.py NFL 30 stadiums Cross-calendar season (2025-26)
mls.py MLS 30 stadiums Soccer-specific capacities
wnba.py WNBA 13 arenas Shares venues with NBA
nwsl.py NWSL 13 stadiums Shares some MLS venues

Data Files

Output Directory: data/

File Contents
games.csv Raw scraped game data (all sports)
games.json Raw scraped games as JSON
stadiums.json Raw stadium data
stadiums_canonical.json Canonical stadiums with resolved aliases
teams_canonical.json Canonical teams with resolved aliases
stadium_aliases.json Stadium name → canonical ID mapping
games/{sport}_{season}.json Per-sport canonical games

Alias Files

  • data/canonical/stadiums.json - Master stadium database
  • data/canonical/teams.json - Master team database

Pipeline Commands

Scraping

# Single sport
python scrape_schedules.py --sport nba --season 2025-26

# All sports
python scrape_schedules.py --sport all --season 2026

# With specific output directory
python scrape_schedules.py --sport mlb --season 2025 --output ./data

Canonicalization

# Run canonicalization pipeline
python run_canonicalization_pipeline.py --sport all

CloudKit Operations

# Validate data without uploading
python cloudkit_import.py --validate

# Show what would be uploaded (dry run)
python cloudkit_import.py --upload --dry-run

# Upload to CloudKit
python cloudkit_import.py --upload

# List orphan records (requires CloudKit connection)
python cloudkit_import.py --validate --list-orphans

# Delete orphan records
python cloudkit_import.py --delete-orphans