Remove CBB (~5,000+ games per season) to reduce complexity. Changes: - Remove .cbb enum case from Sport - Remove CBB theme color (cbbMint) - Update documentation to reflect 7 supported leagues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
355 lines
11 KiB
Markdown
355 lines
11 KiB
Markdown
# Data Scraping System
|
|
|
|
This document describes the SportsTime schedule scraping system, including all data sources, the fallback architecture, and operational procedures.
|
|
|
|
## Overview
|
|
|
|
The scraping system (`Scripts/scrape_schedules.py`) fetches game schedules for 7 sports leagues from multiple data sources. It uses a **multi-source fallback architecture** to ensure reliability—if one source fails or returns insufficient data, the system automatically tries backup sources.
|
|
|
|
## Supported Sports
|
|
|
|
| Sport | League | Season Format | Typical Games |
|
|
|-------|--------|---------------|---------------|
|
|
| NBA | National Basketball Association | 2024-25 | ~1,230 |
|
|
| MLB | Major League Baseball | 2025 | ~2,430 |
|
|
| NHL | National Hockey League | 2024-25 | ~1,312 |
|
|
| NFL | National Football League | 2025-26 | ~272 |
|
|
| WNBA | Women's National Basketball Association | 2025 | ~200 |
|
|
| MLS | Major League Soccer | 2025 | ~500 |
|
|
| NWSL | National Women's Soccer League | 2025 | ~180 |
|
|
|
|
## Data Sources by Sport
|
|
|
|
Each sport has 3 data sources configured in priority order. The scraper tries sources sequentially until one returns sufficient data.
|
|
|
|
### NBA (National Basketball Association)
|
|
|
|
| Priority | Source | URL Pattern | Min Games |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{year}_games-{month}.html` | 500 |
|
|
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | 500 |
|
|
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | 100 |
|
|
|
|
**Notes:**
|
|
- Basketball-Reference is most reliable for historical data
|
|
- ESPN API provides real-time updates but may have rate limits
|
|
- CBS Sports as emergency fallback
|
|
|
|
### MLB (Major League Baseball)
|
|
|
|
| Priority | Source | URL Pattern | Min Games |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | 1,000 |
|
|
| 2 | Baseball-Reference | `baseball-reference.com/leagues/majors/{year}-schedule.shtml` | 500 |
|
|
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | 500 |
|
|
|
|
**Notes:**
|
|
- MLB Stats API is official and most complete
|
|
- Baseball-Reference good for historical seasons
|
|
- Rate limit: 1 request/second for all sources
|
|
|
|
### NHL (National Hockey League)
|
|
|
|
| Priority | Source | URL Pattern | Min Games |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{year}_games.html` | 500 |
|
|
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | 500 |
|
|
| 3 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | 100 |
|
|
|
|
**Notes:**
|
|
- Hockey-Reference uses season format like "2025" for 2024-25 season
|
|
- NHL API is official but documentation is limited
|
|
|
|
### NFL (National Football League)
|
|
|
|
| Priority | Source | URL Pattern | Min Games |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | 200 |
|
|
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{year}/games.htm` | 200 |
|
|
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | 100 |
|
|
|
|
**Notes:**
|
|
- ESPN provides week-by-week schedule data
|
|
- PFR has complete historical archives
|
|
- Season runs September-February (crosses calendar years)
|
|
|
|
### WNBA (Women's National Basketball Association)
|
|
|
|
| Priority | Source | URL Pattern | Min Games |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | 100 |
|
|
| 2 | Basketball-Reference | `basketball-reference.com/wnba/years/{year}_games.html` | 100 |
|
|
| 3 | CBS Sports | `cbssports.com/wnba/schedule/` | 50 |
|
|
|
|
**Notes:**
|
|
- WNBA season runs May-September
|
|
- Fewer games than NBA (12 teams, 40-game season)
|
|
|
|
### MLS (Major League Soccer)
|
|
|
|
| Priority | Source | URL Pattern | Min Games |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | 200 |
|
|
| 2 | FBref | `fbref.com/en/comps/22/{year}/schedule/` | 100 |
|
|
| 3 | MLSSoccer.com | `mlssoccer.com/schedule/scores` | 100 |
|
|
|
|
**Notes:**
|
|
- ESPN's league ID for MLS is `usa.1`
|
|
- FBref may block automated requests (403 errors)
|
|
- Season runs February-November
|
|
|
|
### NWSL (National Women's Soccer League)
|
|
|
|
| Priority | Source | URL Pattern | Min Games |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | 100 |
|
|
| 2 | FBref | `fbref.com/en/comps/182/{year}/schedule/` | 50 |
|
|
| 3 | NWSL.com | `nwslsoccer.com/schedule` | 50 |
|
|
|
|
**Notes:**
|
|
- ESPN's league ID for NWSL is `usa.nwsl`
|
|
- 14 teams, ~180 regular season games
|
|
|
|
## Fallback Architecture
|
|
|
|
### ScraperSource Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class ScraperSource:
|
|
name: str # Display name (e.g., "ESPN")
|
|
scraper_func: Callable[[int], list] # Function taking season year
|
|
priority: int = 1 # Lower = higher priority
|
|
min_games: int = 10 # Minimum to consider success
|
|
```
|
|
|
|
### Fallback Logic
|
|
|
|
```python
|
|
def scrape_with_fallback(sport, season, sources):
|
|
sources = sorted(sources, key=lambda s: s.priority)
|
|
|
|
for source in sources:
|
|
try:
|
|
games = source.scraper_func(season)
|
|
if len(games) >= source.min_games:
|
|
return games # Success!
|
|
except Exception:
|
|
continue # Try next source
|
|
|
|
return [] # All sources failed
|
|
```
|
|
|
|
### Example Output
|
|
|
|
```
|
|
SCRAPING NBA 2026
|
|
============================================================
|
|
[1/3] Trying Basketball-Reference...
|
|
✓ Basketball-Reference returned 1230 games
|
|
|
|
SCRAPING MLB 2026
|
|
============================================================
|
|
[1/3] Trying MLB Stats API...
|
|
✗ MLB Stats API failed: Connection timeout
|
|
[2/3] Trying Baseball-Reference...
|
|
✓ Baseball-Reference returned 2430 games
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Command Line Interface
|
|
|
|
```bash
|
|
# Scrape all sports for 2026 season
|
|
python scrape_schedules.py --sport all --season 2026
|
|
|
|
# Scrape specific sport
|
|
python scrape_schedules.py --sport nba --season 2026
|
|
python scrape_schedules.py --sport mlb --season 2026
|
|
|
|
# Scrape only stadiums (legacy method)
|
|
python scrape_schedules.py --stadiums-only
|
|
|
|
# Scrape comprehensive stadium data for ALL 11 sports
|
|
python scrape_schedules.py --stadiums-update
|
|
|
|
# Custom output directory
|
|
python scrape_schedules.py --sport all --season 2026 --output ./custom_data
|
|
```
|
|
|
|
### Available Options
|
|
|
|
| Option | Values | Default | Description |
|
|
|--------|--------|---------|-------------|
|
|
| `--sport` | `nba`, `mlb`, `nhl`, `nfl`, `wnba`, `mls`, `nwsl`, `all` | `all` | Sport(s) to scrape |
|
|
| `--season` | Year (int) | `2026` | Season ending year |
|
|
| `--stadiums-only` | Flag | False | Only scrape stadium data (legacy method) |
|
|
| `--stadiums-update` | Flag | False | Scrape ALL stadium data for all 7 sports |
|
|
| `--output` | Path | `./data` | Output directory |
|
|
|
|
## Output Format
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
data/
|
|
├── games.json # All games from all sports
|
|
├── stadiums.json # All stadium/venue data
|
|
└── teams.json # Team metadata (generated)
|
|
```
|
|
|
|
### Game JSON Schema
|
|
|
|
```json
|
|
{
|
|
"id": "NBA-2025-26-LAL-BOS-20251225",
|
|
"sport": "NBA",
|
|
"homeTeam": "Los Angeles Lakers",
|
|
"awayTeam": "Boston Celtics",
|
|
"homeTeamId": "LAL",
|
|
"awayTeamId": "BOS",
|
|
"date": "2025-12-25T20:00:00Z",
|
|
"venue": "Crypto.com Arena",
|
|
"city": "Los Angeles",
|
|
"state": "CA"
|
|
}
|
|
```
|
|
|
|
### Stadium JSON Schema
|
|
|
|
```json
|
|
{
|
|
"id": "crypto-com-arena",
|
|
"name": "Crypto.com Arena",
|
|
"city": "Los Angeles",
|
|
"state": "CA",
|
|
"latitude": 34.0430,
|
|
"longitude": -118.2673,
|
|
"sports": ["NBA", "NHL"],
|
|
"teams": ["Los Angeles Lakers", "Los Angeles Kings", "Los Angeles Clippers"]
|
|
}
|
|
```
|
|
|
|
## Stable Game IDs
|
|
|
|
Games are assigned stable IDs using the pattern:
|
|
```
|
|
{SPORT}-{SEASON}-{AWAY}-{HOME}-{DATE}
|
|
```
|
|
|
|
Example: `NBA-2025-26-LAL-BOS-20251225`
|
|
|
|
This ensures:
|
|
- Same game gets same ID across scraper runs
|
|
- IDs survive if scraper source changes
|
|
- CloudKit records can be updated (not duplicated)
|
|
|
|
## Rate Limiting
|
|
|
|
All scrapers implement rate limiting to avoid being blocked:
|
|
|
|
| Source Type | Rate Limit | Implementation |
|
|
|-------------|------------|----------------|
|
|
| Sports-Reference family | 1 req/sec | `time.sleep(1)` between requests |
|
|
| ESPN API | 0.5 req/sec | `time.sleep(0.5)` between date ranges |
|
|
| Official APIs (MLB, NHL) | 1 req/sec | `time.sleep(1)` between requests |
|
|
| CBS Sports | 1 req/sec | `time.sleep(1)` between pages |
|
|
|
|
## Error Handling
|
|
|
|
### Common Errors
|
|
|
|
| Error | Cause | Resolution |
|
|
|-------|-------|------------|
|
|
| `403 Forbidden` | Rate limited or blocked | Wait 5 min, reduce request rate |
|
|
| `Connection timeout` | Network issue | Retry, check connectivity |
|
|
| `0 games returned` | Off-season or parsing error | Check if season has started |
|
|
| `KeyError` in parsing | Website structure changed | Update scraper selectors |
|
|
|
|
### Fallback Behavior
|
|
|
|
1. If primary source fails → Try source #2
|
|
2. If source #2 fails → Try source #3
|
|
3. If all sources fail → Log warning, return empty list
|
|
4. Script continues to next sport (doesn't abort)
|
|
|
|
## Adding New Sources
|
|
|
|
### 1. Create Scraper Function
|
|
|
|
```python
|
|
def scrape_newsport_newsource(season: int) -> list[Game]:
|
|
"""Scrape NewSport schedule from NewSource."""
|
|
games = []
|
|
url = f"https://newsource.com/schedule/{season}"
|
|
|
|
response = requests.get(url, headers=HEADERS)
|
|
# Parse response...
|
|
|
|
return games
|
|
```
|
|
|
|
### 2. Register in main()
|
|
|
|
```python
|
|
if args.sport in ['newsport', 'all']:
|
|
sources = [
|
|
ScraperSource('Primary', scrape_newsport_primary, priority=1, min_games=100),
|
|
ScraperSource('NewSource', scrape_newsport_newsource, priority=2, min_games=50),
|
|
ScraperSource('Backup', scrape_newsport_backup, priority=3, min_games=25),
|
|
]
|
|
games = scrape_with_fallback('NEWSPORT', args.season, sources)
|
|
```
|
|
|
|
### 3. Add to CLI choices
|
|
|
|
```python
|
|
parser.add_argument('--sport', choices=[..., 'newsport', 'all'])
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Monthly Tasks
|
|
- Run full scrape to update schedules
|
|
- Check for 403 errors indicating blocked sources
|
|
- Verify game counts match expected totals
|
|
|
|
### Seasonal Tasks
|
|
- Update season year in scripts
|
|
- Check for website structure changes
|
|
- Verify new teams/venues are included
|
|
|
|
### When Sources Break
|
|
1. Check if website changed structure (inspect HTML)
|
|
2. Update CSS selectors or JSON paths
|
|
3. If permanently broken, add new backup source
|
|
4. Update min_games thresholds if needed
|
|
|
|
## Dependencies
|
|
|
|
```
|
|
requests>=2.28.0
|
|
beautifulsoup4>=4.11.0
|
|
lxml>=4.9.0
|
|
```
|
|
|
|
Install with:
|
|
```bash
|
|
cd Scripts && pip install -r requirements.txt
|
|
```
|
|
|
|
## CloudKit Integration
|
|
|
|
After scraping, data is uploaded to CloudKit via:
|
|
```bash
|
|
python cloudkit_import.py
|
|
```
|
|
|
|
This syncs:
|
|
- Games → `CanonicalGame` records
|
|
- Stadiums → `CanonicalStadium` records
|
|
- Teams → `CanonicalTeam` records
|
|
|
|
The iOS app then syncs from CloudKit to local SwiftData storage.
|