docs(07-01): create Scripts/README.md with pipeline documentation
- Overview and quick start commands - ASCII architecture diagram showing data flow - Module reference table for all Python scripts - Sport modules table with stadium counts - Data files and alias file documentation - Pipeline commands for scraping, canonicalization, CloudKit Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
147
Scripts/README.md
Normal file
147
Scripts/README.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# SportsTime Data Pipeline
|
||||
|
||||
Python scripts that scrape, canonicalize, and sync sports schedule data to CloudKit for the SportsTime iOS app.
|
||||
|
||||
## Overview
|
||||
|
||||
This pipeline ensures every game correctly links to its home/away teams and stadium with complete, accurate data across MLB, NBA, NHL, NFL, MLS, WNBA, and NWSL.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Scrape all sports for current season
|
||||
python scrape_schedules.py --sport all --season 2026
|
||||
|
||||
# Run full pipeline (scrape + canonicalize)
|
||||
python run_pipeline.py --sport all
|
||||
|
||||
# Validate data integrity
|
||||
python cloudkit_import.py --validate
|
||||
|
||||
# Sync to CloudKit
|
||||
python cloudkit_import.py --upload
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ SPORT MODULES │
|
||||
│ mlb.py nba.py nhl.py nfl.py mls.py wnba.py nwsl.py │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│ scrape
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ RAW DATA │
|
||||
│ data/games.csv data/stadiums.csv data/games.json │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│ canonicalize
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ CANONICAL JSON │
|
||||
│ data/stadiums_canonical.json data/teams_canonical.json │
|
||||
│ data/games/*.json (per-sport/season) │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│ sync
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ CloudKit (iCloud.com.sportstime.app) │
|
||||
│ Bundled JSON (SportsTime/Resources/) │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Module Reference
|
||||
|
||||
| Script | Purpose |
|
||||
|--------|---------|
|
||||
| `core.py` | Shared utilities: data classes, rate limiting, fallback system |
|
||||
| `scrape_schedules.py` | Main orchestrator for scraping schedules from multiple sources |
|
||||
| `run_pipeline.py` | Full pipeline runner (scrape + canonicalize in one command) |
|
||||
| `canonicalize_stadiums.py` | Stadium name resolution with alias support |
|
||||
| `canonicalize_teams.py` | Team name resolution with alias support |
|
||||
| `canonicalize_games.py` | Game linking (game → team → stadium relationships) |
|
||||
| `cloudkit_import.py` | CloudKit sync with full CRUD, validation, and diff reporting |
|
||||
| `validate_canonical.py` | Data validation with completeness metrics |
|
||||
| `generate_canonical_data.py` | Generate bundled JSON for iOS app bootstrap |
|
||||
|
||||
## Sport Modules
|
||||
|
||||
Each sport has its own module with hardcoded stadium data and sport-specific scraping logic:
|
||||
|
||||
| Module | Sport | Stadiums | Notes |
|
||||
|--------|-------|----------|-------|
|
||||
| `mlb.py` | MLB | 30 ballparks | Baseball-Reference scraper |
|
||||
| `nba.py` | NBA | 30 arenas | Basketball-Reference scraper |
|
||||
| `nhl.py` | NHL | 32 arenas | Hockey-Reference scraper |
|
||||
| `nfl.py` | NFL | 30 stadiums | Cross-calendar season (2025-26) |
|
||||
| `mls.py` | MLS | 30 stadiums | Soccer-specific capacities |
|
||||
| `wnba.py` | WNBA | 13 arenas | Shares venues with NBA |
|
||||
| `nwsl.py` | NWSL | 13 stadiums | Shares some MLS venues |
|
||||
|
||||
## Data Files
|
||||
|
||||
### Output Directory: `data/`
|
||||
|
||||
| File | Contents |
|
||||
|------|----------|
|
||||
| `games.csv` | Raw scraped game data (all sports) |
|
||||
| `games.json` | Raw scraped games as JSON |
|
||||
| `stadiums.json` | Raw stadium data |
|
||||
| `stadiums_canonical.json` | Canonical stadiums with resolved aliases |
|
||||
| `teams_canonical.json` | Canonical teams with resolved aliases |
|
||||
| `stadium_aliases.json` | Stadium name → canonical ID mapping |
|
||||
| `games/{sport}_{season}.json` | Per-sport canonical games |
|
||||
|
||||
### Alias Files
|
||||
|
||||
- `data/canonical/stadiums.json` - Master stadium database
|
||||
- `data/canonical/teams.json` - Master team database
|
||||
|
||||
## Pipeline Commands
|
||||
|
||||
### Scraping
|
||||
|
||||
```bash
|
||||
# Single sport
|
||||
python scrape_schedules.py --sport nba --season 2025-26
|
||||
|
||||
# All sports
|
||||
python scrape_schedules.py --sport all --season 2026
|
||||
|
||||
# With specific output directory
|
||||
python scrape_schedules.py --sport mlb --season 2025 --output ./data
|
||||
```
|
||||
|
||||
### Canonicalization
|
||||
|
||||
```bash
|
||||
# Run canonicalization pipeline
|
||||
python run_canonicalization_pipeline.py --sport all
|
||||
```
|
||||
|
||||
### CloudKit Operations
|
||||
|
||||
```bash
|
||||
# Validate data without uploading
|
||||
python cloudkit_import.py --validate
|
||||
|
||||
# Show what would be uploaded (dry run)
|
||||
python cloudkit_import.py --upload --dry-run
|
||||
|
||||
# Upload to CloudKit
|
||||
python cloudkit_import.py --upload
|
||||
|
||||
# List orphan records (requires CloudKit connection)
|
||||
python cloudkit_import.py --validate --list-orphans
|
||||
|
||||
# Delete orphan records
|
||||
python cloudkit_import.py --delete-orphans
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [DATA_SOURCES.md](DATA_SOURCES.md) - Data source URLs, rate limits, validation strategy
|
||||
- [CLOUDKIT_SETUP.md](CLOUDKIT_SETUP.md) - CloudKit container setup, record types, security roles
|
||||
Reference in New Issue
Block a user