Remove CFB/NASCAR/PGA and streamline to 8 supported sports
- Remove College Football, NASCAR, and PGA from scraper and app - Clean all data files (stadiums, games, pipeline reports) - Update Sport.swift enum and all UI components - Add sportstime.py CLI tool for pipeline management - Add DATA_SCRAPING.md documentation - Add WNBA/MLS/NWSL implementation documentation - Scraper now supports: NBA, MLB, NHL, NFL, WNBA, MLS, NWSL, CBB Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
368
docs/DATA_SCRAPING.md
Normal file
368
docs/DATA_SCRAPING.md
Normal file
@@ -0,0 +1,368 @@
|
||||
# Data Scraping System
|
||||
|
||||
This document describes the SportsTime schedule scraping system, including all data sources, the fallback architecture, and operational procedures.
|
||||
|
||||
## Overview
|
||||
|
||||
The scraping system (`Scripts/scrape_schedules.py`) fetches game schedules for 8 sports leagues from multiple data sources. It uses a **multi-source fallback architecture** to ensure reliability—if one source fails or returns insufficient data, the system automatically tries backup sources.
|
||||
|
||||
## Supported Sports
|
||||
|
||||
| Sport | League | Season Format | Typical Games |
|
||||
|-------|--------|---------------|---------------|
|
||||
| NBA | National Basketball Association | 2024-25 | ~1,230 |
|
||||
| MLB | Major League Baseball | 2025 | ~2,430 |
|
||||
| NHL | National Hockey League | 2024-25 | ~1,312 |
|
||||
| NFL | National Football League | 2025-26 | ~272 |
|
||||
| WNBA | Women's National Basketball Association | 2025 | ~200 |
|
||||
| MLS | Major League Soccer | 2025 | ~500 |
|
||||
| NWSL | National Women's Soccer League | 2025 | ~180 |
|
||||
| CBB | NCAA Division I Basketball | 2025-26 | ~5,000+ |
|
||||
|
||||
## Data Sources by Sport
|
||||
|
||||
Each sport has 3 data sources configured in priority order. The scraper tries sources sequentially until one returns sufficient data.
|
||||
|
||||
### NBA (National Basketball Association)
|
||||
|
||||
| Priority | Source | URL Pattern | Min Games |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{year}_games-{month}.html` | 500 |
|
||||
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | 500 |
|
||||
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | 100 |
|
||||
|
||||
**Notes:**
|
||||
- Basketball-Reference is most reliable for historical data
|
||||
- ESPN API provides real-time updates but may have rate limits
|
||||
- CBS Sports as emergency fallback
|
||||
|
||||
### MLB (Major League Baseball)
|
||||
|
||||
| Priority | Source | URL Pattern | Min Games |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | 1,000 |
|
||||
| 2 | Baseball-Reference | `baseball-reference.com/leagues/majors/{year}-schedule.shtml` | 500 |
|
||||
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | 500 |
|
||||
|
||||
**Notes:**
|
||||
- MLB Stats API is official and most complete
|
||||
- Baseball-Reference good for historical seasons
|
||||
- Rate limit: 1 request/second for all sources
|
||||
|
||||
### NHL (National Hockey League)
|
||||
|
||||
| Priority | Source | URL Pattern | Min Games |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{year}_games.html` | 500 |
|
||||
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | 500 |
|
||||
| 3 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | 100 |
|
||||
|
||||
**Notes:**
|
||||
- Hockey-Reference uses season format like "2025" for 2024-25 season
|
||||
- NHL API is official but documentation is limited
|
||||
|
||||
### NFL (National Football League)
|
||||
|
||||
| Priority | Source | URL Pattern | Min Games |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | 200 |
|
||||
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{year}/games.htm` | 200 |
|
||||
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | 100 |
|
||||
|
||||
**Notes:**
|
||||
- ESPN provides week-by-week schedule data
|
||||
- PFR has complete historical archives
|
||||
- Season runs September-February (crosses calendar years)
|
||||
|
||||
### WNBA (Women's National Basketball Association)
|
||||
|
||||
| Priority | Source | URL Pattern | Min Games |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | 100 |
|
||||
| 2 | Basketball-Reference | `basketball-reference.com/wnba/years/{year}_games.html` | 100 |
|
||||
| 3 | CBS Sports | `cbssports.com/wnba/schedule/` | 50 |
|
||||
|
||||
**Notes:**
|
||||
- WNBA season runs May-September
|
||||
- Fewer games than NBA (12 teams, 40-game season)
|
||||
|
||||
### MLS (Major League Soccer)
|
||||
|
||||
| Priority | Source | URL Pattern | Min Games |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | 200 |
|
||||
| 2 | FBref | `fbref.com/en/comps/22/{year}/schedule/` | 100 |
|
||||
| 3 | MLSSoccer.com | `mlssoccer.com/schedule/scores` | 100 |
|
||||
|
||||
**Notes:**
|
||||
- ESPN's league ID for MLS is `usa.1`
|
||||
- FBref may block automated requests (403 errors)
|
||||
- Season runs February-November
|
||||
|
||||
### NWSL (National Women's Soccer League)
|
||||
|
||||
| Priority | Source | URL Pattern | Min Games |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | 100 |
|
||||
| 2 | FBref | `fbref.com/en/comps/182/{year}/schedule/` | 50 |
|
||||
| 3 | NWSL.com | `nwslsoccer.com/schedule` | 50 |
|
||||
|
||||
**Notes:**
|
||||
- ESPN's league ID for NWSL is `usa.nwsl`
|
||||
- 14 teams, ~180 regular season games
|
||||
|
||||
### CBB (College Basketball - Division I)
|
||||
|
||||
| Priority | Source | URL Pattern | Min Games |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard` | 1,000 |
|
||||
| 2 | Sports-Reference | `sports-reference.com/cbb/seasons/{year}-schedule.html` | 500 |
|
||||
| 3 | CBS Sports | `cbssports.com/college-basketball/schedule/` | 300 |
|
||||
|
||||
**Notes:**
|
||||
- ~360 Division I teams = 5,000+ games per season
|
||||
- ESPN provides group filtering (D1 = group 50)
|
||||
- Season runs November-April (March Madness)
|
||||
|
||||
## Fallback Architecture
|
||||
|
||||
### ScraperSource Configuration
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ScraperSource:
|
||||
name: str # Display name (e.g., "ESPN")
|
||||
scraper_func: Callable[[int], list] # Function taking season year
|
||||
priority: int = 1 # Lower = higher priority
|
||||
min_games: int = 10 # Minimum to consider success
|
||||
```
|
||||
|
||||
### Fallback Logic
|
||||
|
||||
```python
|
||||
def scrape_with_fallback(sport, season, sources):
|
||||
sources = sorted(sources, key=lambda s: s.priority)
|
||||
|
||||
for source in sources:
|
||||
try:
|
||||
games = source.scraper_func(season)
|
||||
if len(games) >= source.min_games:
|
||||
return games # Success!
|
||||
except Exception:
|
||||
continue # Try next source
|
||||
|
||||
return [] # All sources failed
|
||||
```
|
||||
|
||||
### Example Output
|
||||
|
||||
```
|
||||
SCRAPING NBA 2026
|
||||
============================================================
|
||||
[1/3] Trying Basketball-Reference...
|
||||
✓ Basketball-Reference returned 1230 games
|
||||
|
||||
SCRAPING MLB 2026
|
||||
============================================================
|
||||
[1/3] Trying MLB Stats API...
|
||||
✗ MLB Stats API failed: Connection timeout
|
||||
[2/3] Trying Baseball-Reference...
|
||||
✓ Baseball-Reference returned 2430 games
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Command Line Interface
|
||||
|
||||
```bash
|
||||
# Scrape all sports for 2026 season
|
||||
python scrape_schedules.py --sport all --season 2026
|
||||
|
||||
# Scrape specific sport
|
||||
python scrape_schedules.py --sport nba --season 2026
|
||||
python scrape_schedules.py --sport mlb --season 2026
|
||||
|
||||
# Scrape only stadiums (legacy method)
|
||||
python scrape_schedules.py --stadiums-only
|
||||
|
||||
# Scrape comprehensive stadium data for ALL 11 sports
|
||||
python scrape_schedules.py --stadiums-update
|
||||
|
||||
# Custom output directory
|
||||
python scrape_schedules.py --sport all --season 2026 --output ./custom_data
|
||||
```
|
||||
|
||||
### Available Options
|
||||
|
||||
| Option | Values | Default | Description |
|
||||
|--------|--------|---------|-------------|
|
||||
| `--sport` | `nba`, `mlb`, `nhl`, `nfl`, `wnba`, `mls`, `nwsl`, `cbb`, `all` | `all` | Sport(s) to scrape |
|
||||
| `--season` | Year (int) | `2026` | Season ending year |
|
||||
| `--stadiums-only` | Flag | False | Only scrape stadium data (legacy method) |
|
||||
| `--stadiums-update` | Flag | False | Scrape ALL stadium data for all 8 sports |
|
||||
| `--output` | Path | `./data` | Output directory |
|
||||
|
||||
## Output Format
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
data/
|
||||
├── games.json # All games from all sports
|
||||
├── stadiums.json # All stadium/venue data
|
||||
└── teams.json # Team metadata (generated)
|
||||
```
|
||||
|
||||
### Game JSON Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "NBA-2025-26-LAL-BOS-20251225",
|
||||
"sport": "NBA",
|
||||
"homeTeam": "Los Angeles Lakers",
|
||||
"awayTeam": "Boston Celtics",
|
||||
"homeTeamId": "LAL",
|
||||
"awayTeamId": "BOS",
|
||||
"date": "2025-12-25T20:00:00Z",
|
||||
"venue": "Crypto.com Arena",
|
||||
"city": "Los Angeles",
|
||||
"state": "CA"
|
||||
}
|
||||
```
|
||||
|
||||
### Stadium JSON Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "crypto-com-arena",
|
||||
"name": "Crypto.com Arena",
|
||||
"city": "Los Angeles",
|
||||
"state": "CA",
|
||||
"latitude": 34.0430,
|
||||
"longitude": -118.2673,
|
||||
"sports": ["NBA", "NHL"],
|
||||
"teams": ["Los Angeles Lakers", "Los Angeles Kings", "Los Angeles Clippers"]
|
||||
}
|
||||
```
|
||||
|
||||
## Stable Game IDs
|
||||
|
||||
Games are assigned stable IDs using the pattern:
|
||||
```
|
||||
{SPORT}-{SEASON}-{AWAY}-{HOME}-{DATE}
|
||||
```
|
||||
|
||||
Example: `NBA-2025-26-LAL-BOS-20251225`
|
||||
|
||||
This ensures:
|
||||
- Same game gets same ID across scraper runs
|
||||
- IDs survive if scraper source changes
|
||||
- CloudKit records can be updated (not duplicated)
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
All scrapers implement rate limiting to avoid being blocked:
|
||||
|
||||
| Source Type | Rate Limit | Implementation |
|
||||
|-------------|------------|----------------|
|
||||
| Sports-Reference family | 1 req/sec | `time.sleep(1)` between requests |
|
||||
| ESPN API | 0.5 req/sec | `time.sleep(0.5)` between date ranges |
|
||||
| Official APIs (MLB, NHL) | 1 req/sec | `time.sleep(1)` between requests |
|
||||
| CBS Sports | 1 req/sec | `time.sleep(1)` between pages |
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Common Errors
|
||||
|
||||
| Error | Cause | Resolution |
|
||||
|-------|-------|------------|
|
||||
| `403 Forbidden` | Rate limited or blocked | Wait 5 min, reduce request rate |
|
||||
| `Connection timeout` | Network issue | Retry, check connectivity |
|
||||
| `0 games returned` | Off-season or parsing error | Check if season has started |
|
||||
| `KeyError` in parsing | Website structure changed | Update scraper selectors |
|
||||
|
||||
### Fallback Behavior
|
||||
|
||||
1. If primary source fails → Try source #2
|
||||
2. If source #2 fails → Try source #3
|
||||
3. If all sources fail → Log warning, return empty list
|
||||
4. Script continues to next sport (doesn't abort)
|
||||
|
||||
## Adding New Sources
|
||||
|
||||
### 1. Create Scraper Function
|
||||
|
||||
```python
|
||||
def scrape_newsport_newsource(season: int) -> list[Game]:
|
||||
"""Scrape NewSport schedule from NewSource."""
|
||||
games = []
|
||||
url = f"https://newsource.com/schedule/{season}"
|
||||
|
||||
response = requests.get(url, headers=HEADERS)
|
||||
# Parse response...
|
||||
|
||||
return games
|
||||
```
|
||||
|
||||
### 2. Register in main()
|
||||
|
||||
```python
|
||||
if args.sport in ['newsport', 'all']:
|
||||
sources = [
|
||||
ScraperSource('Primary', scrape_newsport_primary, priority=1, min_games=100),
|
||||
ScraperSource('NewSource', scrape_newsport_newsource, priority=2, min_games=50),
|
||||
ScraperSource('Backup', scrape_newsport_backup, priority=3, min_games=25),
|
||||
]
|
||||
games = scrape_with_fallback('NEWSPORT', args.season, sources)
|
||||
```
|
||||
|
||||
### 3. Add to CLI choices
|
||||
|
||||
```python
|
||||
parser.add_argument('--sport', choices=[..., 'newsport', 'all'])
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Monthly Tasks
|
||||
- Run full scrape to update schedules
|
||||
- Check for 403 errors indicating blocked sources
|
||||
- Verify game counts match expected totals
|
||||
|
||||
### Seasonal Tasks
|
||||
- Update season year in scripts
|
||||
- Check for website structure changes
|
||||
- Verify new teams/venues are included
|
||||
|
||||
### When Sources Break
|
||||
1. Check if website changed structure (inspect HTML)
|
||||
2. Update CSS selectors or JSON paths
|
||||
3. If permanently broken, add new backup source
|
||||
4. Update min_games thresholds if needed
|
||||
|
||||
## Dependencies
|
||||
|
||||
```
|
||||
requests>=2.28.0
|
||||
beautifulsoup4>=4.11.0
|
||||
lxml>=4.9.0
|
||||
```
|
||||
|
||||
Install with:
|
||||
```bash
|
||||
cd Scripts && pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## CloudKit Integration
|
||||
|
||||
After scraping, data is uploaded to CloudKit via:
|
||||
```bash
|
||||
python cloudkit_import.py
|
||||
```
|
||||
|
||||
This syncs:
|
||||
- Games → `CanonicalGame` records
|
||||
- Stadiums → `CanonicalStadium` records
|
||||
- Teams → `CanonicalTeam` records
|
||||
|
||||
The iOS app then syncs from CloudKit to local SwiftData storage.
|
||||
Reference in New Issue
Block a user