From d9f446bccb2bd66126eaf21a7aff31fce3706dcc Mon Sep 17 00:00:00 2001
From: Trey t <treytartt@fastmail.com>
Date: Sat, 10 Jan 2026 10:42:47 -0600
Subject: [PATCH] docs(07-01): create Scripts/README.md with pipeline
 documentation

- Overview and quick start commands
- ASCII architecture diagram showing data flow
- Module reference table for all Python scripts
- Sport modules table with stadium counts
- Data files and alias file documentation
- Pipeline commands for scraping, canonicalization, CloudKit

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 Scripts/README.md | 147 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 147 insertions(+)
 create mode 100644 Scripts/README.md

diff --git a/Scripts/README.md b/Scripts/README.md
new file mode 100644
index 0000000..dc40108
--- /dev/null
+++ b/Scripts/README.md
@@ -0,0 +1,147 @@
+# SportsTime Data Pipeline
+
+Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app.
+
+## Overview
+
+This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL.
+
+## Quick Start
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Scrape all sports for current season
+python scrape_schedules.py --sport all --season 2026
+
+# Run full pipeline (scrape + canonicalize)
+python run_pipeline.py --sport all
+
+# Validate data integrity
+python cloudkit_import.py --validate
+
+# Sync to CloudKit
+python cloudkit_import.py --upload
+```
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        SPORT MODULES                                │
+│  mlb.py  nba.py  nhl.py  nfl.py  mls.py  wnba.py  nwsl.py         │
+└────────────────────────────┬────────────────────────────────────────┘
+                             │ scrape
+                             ▼
+┌─────────────────────────────────────────────────────────────────────┐
+│                        RAW DATA                                     │
+│  data/games.csv    data/stadiums.csv    data/games.json            │
+└────────────────────────────┬────────────────────────────────────────┘
+                             │ canonicalize
+                             ▼
+┌─────────────────────────────────────────────────────────────────────┐
+│                     CANONICAL JSON                                  │
+│  data/stadiums_canonical.json    data/teams_canonical.json         │
+│  data/games/*.json (per-sport/season)                              │
+└────────────────────────────┬────────────────────────────────────────┘
+                             │ sync
+                             ▼
+┌─────────────────────────────────────────────────────────────────────┐
+│               CloudKit (iCloud.com.sportstime.app)                 │
+│               Bundled JSON (SportsTime/Resources/)                  │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+## Module Reference
+
+| Script | Purpose |
+|--------|---------|
+| `core.py` | Shared utilities: data classes, rate limiting, fallback system |
+| `scrape_schedules.py` | Main orchestrator for scraping schedules from multiple sources |
+| `run_pipeline.py` | Full pipeline runner (scrape + canonicalize in one command) |
+| `canonicalize_stadiums.py` | Stadium name resolution with alias support |
+| `canonicalize_teams.py` | Team name resolution with alias support |
+| `canonicalize_games.py` | Game linking (game → team → stadium relationships) |
+| `cloudkit_import.py` | CloudKit sync with full CRUD, validation, and diff reporting |
+| `validate_canonical.py` | Data validation with completeness metrics |
+| `generate_canonical_data.py` | Generate bundled JSON for iOS app bootstrap |
+
+## Sport Modules
+
+Each sport has its own module with hardcoded stadium data and sport-specific scraping logic:
+
+| Module | Sport | Stadiums | Notes |
+|--------|-------|----------|-------|
+| `mlb.py` | MLB | 30 ballparks | Baseball-Reference scraper |
+| `nba.py` | NBA | 30 arenas | Basketball-Reference scraper |
+| `nhl.py` | NHL | 32 arenas | Hockey-Reference scraper |
+| `nfl.py` | NFL | 30 stadiums | Cross-calendar season (2025-26) |
+| `mls.py` | MLS | 30 stadiums | Soccer-specific capacities |
+| `wnba.py` | WNBA | 13 arenas | Shares venues with NBA |
+| `nwsl.py` | NWSL | 13 stadiums | Shares some MLS venues |
+
+## Data Files
+
+### Output Directory: `data/`
+
+| File | Contents |
+|------|----------|
+| `games.csv` | Raw scraped game data (all sports) |
+| `games.json` | Raw scraped games as JSON |
+| `stadiums.json` | Raw stadium data |
+| `stadiums_canonical.json` | Canonical stadiums with resolved aliases |
+| `teams_canonical.json` | Canonical teams with resolved aliases |
+| `stadium_aliases.json` | Stadium name → canonical ID mapping |
+| `games/{sport}_{season}.json` | Per-sport canonical games |
+
+### Alias Files
+
+- `data/canonical/stadiums.json` - Master stadium database
+- `data/canonical/teams.json` - Master team database
+
+## Pipeline Commands
+
+### Scraping
+
+```bash
+# Single sport
+python scrape_schedules.py --sport nba --season 2025-26
+
+# All sports
+python scrape_schedules.py --sport all --season 2026
+
+# With specific output directory
+python scrape_schedules.py --sport mlb --season 2025 --output ./data
+```
+
+### Canonicalization
+
+```bash
+# Run canonicalization pipeline
+python run_canonicalization_pipeline.py --sport all
+```
+
+### CloudKit Operations
+
+```bash
+# Validate data without uploading
+python cloudkit_import.py --validate
+
+# Show what would be uploaded (dry run)
+python cloudkit_import.py --upload --dry-run
+
+# Upload to CloudKit
+python cloudkit_import.py --upload
+
+# List orphan records (requires CloudKit connection)
+python cloudkit_import.py --validate --list-orphans
+
+# Delete orphan records
+python cloudkit_import.py --delete-orphans
+```
+
+## Related Documentation
+
+- [DATA_SOURCES.md](DATA_SOURCES.md) - Data source URLs, rate limits, validation strategy
+- [CLOUDKIT_SETUP.md](CLOUDKIT_SETUP.md) - CloudKit container setup, record types, security roles