# Data Scraping System

This document describes the SportsTime schedule scraping system, including all data sources, the fallback architecture, and operational procedures.

## Overview

The scraping system (`Scripts/scrape_schedules.py`) fetches game schedules for 7 sports leagues from multiple data sources. It uses a **multi-source fallback architecture** to ensure reliability—if one source fails or returns insufficient data, the system automatically tries backup sources.

## Supported Sports

| Sport | League | Season Format | Typical Games |
|-------|--------|---------------|---------------|
| NBA | National Basketball Association | 2024-25 | ~1,230 |
| MLB | Major League Baseball | 2025 | ~2,430 |
| NHL | National Hockey League | 2024-25 | ~1,312 |
| NFL | National Football League | 2025-26 | ~272 |
| WNBA | Women's National Basketball Association | 2025 | ~200 |
| MLS | Major League Soccer | 2025 | ~500 |
| NWSL | National Women's Soccer League | 2025 | ~180 |

## Data Sources by Sport

Each sport has 3 data sources configured in priority order. The scraper tries sources sequentially until one returns sufficient data.

### NBA (National Basketball Association)

| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{year}_games-{month}.html` | 500 |
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | 500 |
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | 100 |

**Notes:**
- Basketball-Reference is most reliable for historical data
- ESPN API provides real-time updates but may have rate limits
- CBS Sports as emergency fallback

### MLB (Major League Baseball)

| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | 1,000 |
| 2 | Baseball-Reference | `baseball-reference.com/leagues/majors/{year}-schedule.shtml` | 500 |
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | 500 |

**Notes:**
- MLB Stats API is official and most complete
- Baseball-Reference good for historical seasons
- Rate limit: 1 request/second for all sources

### NHL (National Hockey League)

| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{year}_games.html` | 500 |
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | 500 |
| 3 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | 100 |

**Notes:**
- Hockey-Reference uses season format like "2025" for 2024-25 season
- NHL API is official but documentation is limited

### NFL (National Football League)

| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | 200 |
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{year}/games.htm` | 200 |
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | 100 |

**Notes:**
- ESPN provides week-by-week schedule data
- PFR has complete historical archives
- Season runs September-February (crosses calendar years)

### WNBA (Women's National Basketball Association)

| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | 100 |
| 2 | Basketball-Reference | `basketball-reference.com/wnba/years/{year}_games.html` | 100 |
| 3 | CBS Sports | `cbssports.com/wnba/schedule/` | 50 |

**Notes:**
- WNBA season runs May-September
- Fewer games than NBA (12 teams, 40-game season)

### MLS (Major League Soccer)

| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | 200 |
| 2 | FBref | `fbref.com/en/comps/22/{year}/schedule/` | 100 |
| 3 | MLSSoccer.com | `mlssoccer.com/schedule/scores` | 100 |

**Notes:**
- ESPN's league ID for MLS is `usa.1`
- FBref may block automated requests (403 errors)
- Season runs February-November

### NWSL (National Women's Soccer League)

| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | 100 |
| 2 | FBref | `fbref.com/en/comps/182/{year}/schedule/` | 50 |
| 3 | NWSL.com | `nwslsoccer.com/schedule` | 50 |

**Notes:**
- ESPN's league ID for NWSL is `usa.nwsl`
- 14 teams, ~180 regular season games

## Fallback Architecture

### ScraperSource Configuration

```python
@dataclass
class ScraperSource:
    name: str                              # Display name (e.g., "ESPN")
    scraper_func: Callable[[int], list]    # Function taking season year
    priority: int = 1                      # Lower = higher priority
    min_games: int = 10                    # Minimum to consider success
```

### Fallback Logic

```python
def scrape_with_fallback(sport, season, sources):
    sources = sorted(sources, key=lambda s: s.priority)

    for source in sources:
        try:
            games = source.scraper_func(season)
            if len(games) >= source.min_games:
                return games  # Success!
        except Exception:
            continue  # Try next source

    return []  # All sources failed
```

### Example Output

```
SCRAPING NBA 2026
============================================================
  [1/3] Trying Basketball-Reference...
  ✓ Basketball-Reference returned 1230 games

SCRAPING MLB 2026
============================================================
  [1/3] Trying MLB Stats API...
  ✗ MLB Stats API failed: Connection timeout
  [2/3] Trying Baseball-Reference...
  ✓ Baseball-Reference returned 2430 games
```

## Usage

### Command Line Interface

```bash
# Scrape all sports for 2026 season
python scrape_schedules.py --sport all --season 2026

# Scrape specific sport
python scrape_schedules.py --sport nba --season 2026
python scrape_schedules.py --sport mlb --season 2026

# Scrape only stadiums (legacy method)
python scrape_schedules.py --stadiums-only

# Scrape comprehensive stadium data for ALL 11 sports
python scrape_schedules.py --stadiums-update

# Custom output directory
python scrape_schedules.py --sport all --season 2026 --output ./custom_data
```

### Available Options

| Option | Values | Default | Description |
|--------|--------|---------|-------------|
| `--sport` | `nba`, `mlb`, `nhl`, `nfl`, `wnba`, `mls`, `nwsl`, `all` | `all` | Sport(s) to scrape |
| `--season` | Year (int) | `2026` | Season ending year |
| `--stadiums-only` | Flag | False | Only scrape stadium data (legacy method) |
| `--stadiums-update` | Flag | False | Scrape ALL stadium data for all 7 sports |
| `--output` | Path | `./data` | Output directory |

## Output Format

### Directory Structure

```
data/
├── games.json          # All games from all sports
├── stadiums.json       # All stadium/venue data
└── teams.json          # Team metadata (generated)
```

### Game JSON Schema

```json
{
  "id": "NBA-2025-26-LAL-BOS-20251225",
  "sport": "NBA",
  "homeTeam": "Los Angeles Lakers",
  "awayTeam": "Boston Celtics",
  "homeTeamId": "LAL",
  "awayTeamId": "BOS",
  "date": "2025-12-25T20:00:00Z",
  "venue": "Crypto.com Arena",
  "city": "Los Angeles",
  "state": "CA"
}
```

### Stadium JSON Schema

```json
{
  "id": "crypto-com-arena",
  "name": "Crypto.com Arena",
  "city": "Los Angeles",
  "state": "CA",
  "latitude": 34.0430,
  "longitude": -118.2673,
  "sports": ["NBA", "NHL"],
  "teams": ["Los Angeles Lakers", "Los Angeles Kings", "Los Angeles Clippers"]
}
```

## Stable Game IDs

Games are assigned stable IDs using the pattern:
```
{SPORT}-{SEASON}-{AWAY}-{HOME}-{DATE}
```

Example: `NBA-2025-26-LAL-BOS-20251225`

This ensures:
- Same game gets same ID across scraper runs
- IDs survive if scraper source changes
- CloudKit records can be updated (not duplicated)

## Rate Limiting

All scrapers implement rate limiting to avoid being blocked:

| Source Type | Rate Limit | Implementation |
|-------------|------------|----------------|
| Sports-Reference family | 1 req/sec | `time.sleep(1)` between requests |
| ESPN API | 0.5 req/sec | `time.sleep(0.5)` between date ranges |
| Official APIs (MLB, NHL) | 1 req/sec | `time.sleep(1)` between requests |
| CBS Sports | 1 req/sec | `time.sleep(1)` between pages |

## Error Handling

### Common Errors

| Error | Cause | Resolution |
|-------|-------|------------|
| `403 Forbidden` | Rate limited or blocked | Wait 5 min, reduce request rate |
| `Connection timeout` | Network issue | Retry, check connectivity |
| `0 games returned` | Off-season or parsing error | Check if season has started |
| `KeyError` in parsing | Website structure changed | Update scraper selectors |

### Fallback Behavior

1. If primary source fails → Try source #2
2. If source #2 fails → Try source #3
3. If all sources fail → Log warning, return empty list
4. Script continues to next sport (doesn't abort)

## Adding New Sources

### 1. Create Scraper Function

```python
def scrape_newsport_newsource(season: int) -> list[Game]:
    """Scrape NewSport schedule from NewSource."""
    games = []
    url = f"https://newsource.com/schedule/{season}"

    response = requests.get(url, headers=HEADERS)
    # Parse response...

    return games
```

### 2. Register in main()

```python
if args.sport in ['newsport', 'all']:
    sources = [
        ScraperSource('Primary', scrape_newsport_primary, priority=1, min_games=100),
        ScraperSource('NewSource', scrape_newsport_newsource, priority=2, min_games=50),
        ScraperSource('Backup', scrape_newsport_backup, priority=3, min_games=25),
    ]
    games = scrape_with_fallback('NEWSPORT', args.season, sources)
```

### 3. Add to CLI choices

```python
parser.add_argument('--sport', choices=[..., 'newsport', 'all'])
```

## Maintenance

### Monthly Tasks
- Run full scrape to update schedules
- Check for 403 errors indicating blocked sources
- Verify game counts match expected totals

### Seasonal Tasks
- Update season year in scripts
- Check for website structure changes
- Verify new teams/venues are included

### When Sources Break
1. Check if website changed structure (inspect HTML)
2. Update CSS selectors or JSON paths
3. If permanently broken, add new backup source
4. Update min_games thresholds if needed

## Dependencies

```
requests>=2.28.0
beautifulsoup4>=4.11.0
lxml>=4.9.0
```

Install with:
```bash
cd Scripts && pip install -r requirements.txt
```

## CloudKit Integration

After scraping, data is uploaded to CloudKit via:
```bash
python cloudkit_import.py
```

This syncs:
- Games → `CanonicalGame` records
- Stadiums → `CanonicalStadium` records
- Teams → `CanonicalTeam` records

The iOS app then syncs from CloudKit to local SwiftData storage.