# SportsTime Data Pipeline

Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app.

## Overview

This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL.

## Quick Start

```bash
# Install dependencies
pip install -r requirements.txt

# Scrape all sports for current season
python scrape_schedules.py --sport all --season 2026

# Run full pipeline (scrape + canonicalize)
python run_pipeline.py --sport all

# Validate data integrity
python cloudkit_import.py --validate

# Sync to CloudKit
python cloudkit_import.py --upload
```

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                        SPORT MODULES                                │
│  mlb.py  nba.py  nhl.py  nfl.py  mls.py  wnba.py  nwsl.py         │
└────────────────────────────┬────────────────────────────────────────┘
                             │ scrape
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        RAW DATA                                     │
│  data/games.csv    data/stadiums.csv    data/games.json            │
└────────────────────────────┬────────────────────────────────────────┘
                             │ canonicalize
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     CANONICAL JSON                                  │
│  data/stadiums_canonical.json    data/teams_canonical.json         │
│  data/games/*.json (per-sport/season)                              │
└────────────────────────────┬────────────────────────────────────────┘
                             │ sync
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│               CloudKit (iCloud.com.sportstime.app)                 │
│               Bundled JSON (SportsTime/Resources/)                  │
└─────────────────────────────────────────────────────────────────────┘
```

## Module Reference

| Script | Purpose |
|--------|---------|
| `core.py` | Shared utilities: data classes, rate limiting, fallback system |
| `scrape_schedules.py` | Main orchestrator for scraping schedules from multiple sources |
| `run_pipeline.py` | Full pipeline runner (scrape + canonicalize in one command) |
| `canonicalize_stadiums.py` | Stadium name resolution with alias support |
| `canonicalize_teams.py` | Team name resolution with alias support |
| `canonicalize_games.py` | Game linking (game → team → stadium relationships) |
| `cloudkit_import.py` | CloudKit sync with full CRUD, validation, and diff reporting |
| `validate_canonical.py` | Data validation with completeness metrics |
| `generate_canonical_data.py` | Generate bundled JSON for iOS app bootstrap |

## Sport Modules

Each sport has its own module with hardcoded stadium data and sport-specific scraping logic:

| Module | Sport | Stadiums | Notes |
|--------|-------|----------|-------|
| `mlb.py` | MLB | 30 ballparks | Baseball-Reference scraper |
| `nba.py` | NBA | 30 arenas | Basketball-Reference scraper |
| `nhl.py` | NHL | 32 arenas | Hockey-Reference scraper |
| `nfl.py` | NFL | 30 stadiums | Cross-calendar season (2025-26) |
| `mls.py` | MLS | 30 stadiums | Soccer-specific capacities |
| `wnba.py` | WNBA | 13 arenas | Shares venues with NBA |
| `nwsl.py` | NWSL | 13 stadiums | Shares some MLS venues |

## Data Files

### Output Directory: `data/`

| File | Contents |
|------|----------|
| `games.csv` | Raw scraped game data (all sports) |
| `games.json` | Raw scraped games as JSON |
| `stadiums.json` | Raw stadium data |
| `stadiums_canonical.json` | Canonical stadiums with resolved aliases |
| `teams_canonical.json` | Canonical teams with resolved aliases |
| `stadium_aliases.json` | Stadium name → canonical ID mapping |
| `games/{sport}_{season}.json` | Per-sport canonical games |

### Alias Files

- `data/canonical/stadiums.json` - Master stadium database
- `data/canonical/teams.json` - Master team database

## Pipeline Commands

### Scraping

```bash
# Single sport
python scrape_schedules.py --sport nba --season 2025-26

# All sports
python scrape_schedules.py --sport all --season 2026

# With specific output directory
python scrape_schedules.py --sport mlb --season 2025 --output ./data
```

### Canonicalization

```bash
# Run canonicalization pipeline
python run_canonicalization_pipeline.py --sport all
```

### CloudKit Operations

```bash
# Validate data without uploading
python cloudkit_import.py --validate

# Show what would be uploaded (dry run)
python cloudkit_import.py --upload --dry-run

# Upload to CloudKit
python cloudkit_import.py --upload

# List orphan records (requires CloudKit connection)
python cloudkit_import.py --validate --list-orphans

# Delete orphan records
python cloudkit_import.py --delete-orphans
```

## Related Documentation

- [DATA_SOURCES.md](DATA_SOURCES.md) - Data source URLs, rate limits, validation strategy
- [CLOUDKIT_SETUP.md](CLOUDKIT_SETUP.md) - CloudKit container setup, record types, security roles