Files
Sportstime/Scripts/sportstime_parser/SOURCES.md
Trey t eeaf900e5a feat(scripts): rewrite parser as modular Python CLI
Replace monolithic scraping scripts with sportstime_parser package:

- Multi-source scrapers with automatic fallback for 7 sports
- Canonical ID generation for games, teams, and stadiums
- Fuzzy matching with configurable thresholds for name resolution
- CloudKit Web Services uploader with JWT auth, diff-based updates
- Resumable uploads with checkpoint state persistence
- Validation reports with manual review items and suggested matches
- Comprehensive test suite (249 tests)

CLI: sportstime-parser scrape|validate|upload|status|retry|clear

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 21:06:12 -06:00

255 lines
7.2 KiB
Markdown

# Data Sources
This document lists all data sources used by the SportsTime parser, including URLs, rate limits, and data freshness expectations.
## Source Priority
Each sport has multiple sources configured in priority order. The scraper tries each source in order and uses the first one that succeeds. If a source fails (network error, parsing error, etc.), it falls back to the next source.
---
## NBA (National Basketball Association)
**Teams**: 30
**Expected Games**: ~1,230 per season
**Season**: October - June (spans two calendar years)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | HTML |
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | JSON |
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | HTML |
### Rate Limits
- **Basketball-Reference**: ~1 request/second recommended
- **ESPN API**: No published limit, use 1 request/second to be safe
- **CBS Sports**: ~1 request/second recommended
### Notes
- Basketball-Reference is the most reliable source with complete historical data
- ESPN API is good for current/future seasons
- Games organized by month on Basketball-Reference
---
## MLB (Major League Baseball)
**Teams**: 30
**Expected Games**: ~2,430 per season
**Season**: March/April - October/November (single calendar year)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | Baseball-Reference | `baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | HTML |
| 2 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | JSON |
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | JSON |
### Rate Limits
- **Baseball-Reference**: ~1 request/second recommended
- **MLB Stats API**: No published limit, use 0.5 request/second
- **ESPN API**: ~1 request/second
### Notes
- MLB has doubleheaders; games are suffixed with `_1`, `_2`
- Single schedule page per season on Baseball-Reference
- MLB Stats API allows date range queries for efficiency
---
## NFL (National Football League)
**Teams**: 32
**Expected Games**: ~272 per season (regular season only)
**Season**: September - February (spans two calendar years)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | JSON |
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{YEAR}/games.htm` | HTML |
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | HTML |
### Rate Limits
- **ESPN API**: ~1 request/second
- **Pro-Football-Reference**: ~1 request/second
- **CBS Sports**: ~1 request/second
### Notes
- ESPN API uses week numbers instead of dates
- International games (London, Mexico City, Frankfurt, etc.) are filtered out
- Includes preseason, regular season, and playoffs
---
## NHL (National Hockey League)
**Teams**: 32 (including Utah Hockey Club)
**Expected Games**: ~1,312 per season
**Season**: October - June (spans two calendar years)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{YEAR}_games.html` | HTML |
| 2 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | JSON |
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | JSON |
### Rate Limits
- **Hockey-Reference**: ~1 request/second
- **NHL API**: No published limit, use 0.5 request/second
- **ESPN API**: ~1 request/second
### Notes
- International games (Prague, Stockholm, Helsinki, etc.) are filtered out
- Single schedule page per season on Hockey-Reference
---
## MLS (Major League Soccer)
**Teams**: 30 (including San Diego FC)
**Expected Games**: ~493 per season
**Season**: February/March - October/November (single calendar year)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | JSON |
| 2 | FBref | `fbref.com/en/comps/22/{YEAR}/schedule/` | HTML |
### Rate Limits
- **ESPN API**: ~1 request/second
- **FBref**: ~1 request/second
### Notes
- MLS runs within a single calendar year
- Some teams share stadiums with NFL teams
---
## WNBA (Women's National Basketball Association)
**Teams**: 13 (including Golden State Valkyries)
**Expected Games**: ~220 per season
**Season**: May - October (single calendar year)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | JSON |
### Rate Limits
- **ESPN API**: ~1 request/second
### Notes
- Many WNBA teams share arenas with NBA teams
- Teams and stadiums are hardcoded (smaller league)
---
## NWSL (National Women's Soccer League)
**Teams**: 14
**Expected Games**: ~182 per season
**Season**: March - November (single calendar year)
### Sources
| Priority | Source | URL Pattern | Data Type |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | JSON |
### Rate Limits
- **ESPN API**: ~1 request/second
### Notes
- Many NWSL teams share stadiums with MLS teams
- Teams and stadiums are hardcoded (smaller league)
---
## Stadium Data Sources
Stadium coordinates and metadata come from multiple sources:
| Sport | Sources |
|-------|---------|
| MLB | MLBScoreBot GitHub, cageyjames GeoJSON, hardcoded |
| NFL | NFLScoreBot GitHub, brianhatchl GeoJSON, hardcoded |
| NBA | Hardcoded |
| NHL | Hardcoded |
| MLS | gavinr GeoJSON, hardcoded |
| WNBA | Hardcoded (shared with NBA) |
| NWSL | Hardcoded (shared with MLS) |
---
## General Guidelines
### Rate Limiting
All scrapers implement:
1. **Default delay**: 1 second between requests
2. **Auto-detection**: Detects HTTP 429 (Too Many Requests) responses
3. **Exponential backoff**: Starts at 1 second, doubles up to 3 retries
4. **Connection pooling**: Reuses HTTP connections for efficiency
### Error Handling
- **Partial data**: If a source fails mid-scrape, partial data is discarded
- **Source fallback**: Automatically tries the next source on failure
- **Logging**: All errors are logged for debugging
### Data Freshness
| Data Type | Freshness |
|-----------|-----------|
| Games (future) | Check weekly during season |
| Games (past) | Final scores available within hours |
| Teams | Update at start of each season |
| Stadiums | Update when venues change |
### Geographic Filter
Games at venues outside USA, Canada, and Mexico are automatically filtered out:
- **NFL**: London, Frankfurt, Munich, Mexico City, São Paulo
- **NHL**: Prague, Stockholm, Helsinki, Tampere, Gothenburg
---
## Legal Considerations
This tool is designed for personal/educational use. When using these sources:
1. Respect robots.txt files
2. Don't make excessive requests
3. Cache responses when possible
4. Check each source's Terms of Service
5. Consider that schedule data may be copyrighted
The ESPN API is undocumented but publicly accessible. Sports-Reference sites allow scraping but request reasonable rate limiting.