Replace monolithic scraping scripts with sportstime_parser package: - Multi-source scrapers with automatic fallback for 7 sports - Canonical ID generation for games, teams, and stadiums - Fuzzy matching with configurable thresholds for name resolution - CloudKit Web Services uploader with JWT auth, diff-based updates - Resumable uploads with checkpoint state persistence - Validation reports with manual review items and suggested matches - Comprehensive test suite (249 tests) CLI: sportstime-parser scrape|validate|upload|status|retry|clear Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
255 lines
7.2 KiB
Markdown
255 lines
7.2 KiB
Markdown
# Data Sources
|
|
|
|
This document lists all data sources used by the SportsTime parser, including URLs, rate limits, and data freshness expectations.
|
|
|
|
## Source Priority
|
|
|
|
Each sport has multiple sources configured in priority order. The scraper tries each source in order and uses the first one that succeeds. If a source fails (network error, parsing error, etc.), it falls back to the next source.
|
|
|
|
---
|
|
|
|
## NBA (National Basketball Association)
|
|
|
|
**Teams**: 30
|
|
**Expected Games**: ~1,230 per season
|
|
**Season**: October - June (spans two calendar years)
|
|
|
|
### Sources
|
|
|
|
| Priority | Source | URL Pattern | Data Type |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | HTML |
|
|
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | JSON |
|
|
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | HTML |
|
|
|
|
### Rate Limits
|
|
|
|
- **Basketball-Reference**: ~1 request/second recommended
|
|
- **ESPN API**: No published limit, use 1 request/second to be safe
|
|
- **CBS Sports**: ~1 request/second recommended
|
|
|
|
### Notes
|
|
|
|
- Basketball-Reference is the most reliable source with complete historical data
|
|
- ESPN API is good for current/future seasons
|
|
- Games organized by month on Basketball-Reference
|
|
|
|
---
|
|
|
|
## MLB (Major League Baseball)
|
|
|
|
**Teams**: 30
|
|
**Expected Games**: ~2,430 per season
|
|
**Season**: March/April - October/November (single calendar year)
|
|
|
|
### Sources
|
|
|
|
| Priority | Source | URL Pattern | Data Type |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | Baseball-Reference | `baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | HTML |
|
|
| 2 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | JSON |
|
|
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | JSON |
|
|
|
|
### Rate Limits
|
|
|
|
- **Baseball-Reference**: ~1 request/second recommended
|
|
- **MLB Stats API**: No published limit, use 0.5 request/second
|
|
- **ESPN API**: ~1 request/second
|
|
|
|
### Notes
|
|
|
|
- MLB has doubleheaders; games are suffixed with `_1`, `_2`
|
|
- Single schedule page per season on Baseball-Reference
|
|
- MLB Stats API allows date range queries for efficiency
|
|
|
|
---
|
|
|
|
## NFL (National Football League)
|
|
|
|
**Teams**: 32
|
|
**Expected Games**: ~272 per season (regular season only)
|
|
**Season**: September - February (spans two calendar years)
|
|
|
|
### Sources
|
|
|
|
| Priority | Source | URL Pattern | Data Type |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | JSON |
|
|
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{YEAR}/games.htm` | HTML |
|
|
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | HTML |
|
|
|
|
### Rate Limits
|
|
|
|
- **ESPN API**: ~1 request/second
|
|
- **Pro-Football-Reference**: ~1 request/second
|
|
- **CBS Sports**: ~1 request/second
|
|
|
|
### Notes
|
|
|
|
- ESPN API uses week numbers instead of dates
|
|
- International games (London, Mexico City, Frankfurt, etc.) are filtered out
|
|
- Includes preseason, regular season, and playoffs
|
|
|
|
---
|
|
|
|
## NHL (National Hockey League)
|
|
|
|
**Teams**: 32 (including Utah Hockey Club)
|
|
**Expected Games**: ~1,312 per season
|
|
**Season**: October - June (spans two calendar years)
|
|
|
|
### Sources
|
|
|
|
| Priority | Source | URL Pattern | Data Type |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{YEAR}_games.html` | HTML |
|
|
| 2 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | JSON |
|
|
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | JSON |
|
|
|
|
### Rate Limits
|
|
|
|
- **Hockey-Reference**: ~1 request/second
|
|
- **NHL API**: No published limit, use 0.5 request/second
|
|
- **ESPN API**: ~1 request/second
|
|
|
|
### Notes
|
|
|
|
- International games (Prague, Stockholm, Helsinki, etc.) are filtered out
|
|
- Single schedule page per season on Hockey-Reference
|
|
|
|
---
|
|
|
|
## MLS (Major League Soccer)
|
|
|
|
**Teams**: 30 (including San Diego FC)
|
|
**Expected Games**: ~493 per season
|
|
**Season**: February/March - October/November (single calendar year)
|
|
|
|
### Sources
|
|
|
|
| Priority | Source | URL Pattern | Data Type |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | JSON |
|
|
| 2 | FBref | `fbref.com/en/comps/22/{YEAR}/schedule/` | HTML |
|
|
|
|
### Rate Limits
|
|
|
|
- **ESPN API**: ~1 request/second
|
|
- **FBref**: ~1 request/second
|
|
|
|
### Notes
|
|
|
|
- MLS runs within a single calendar year
|
|
- Some teams share stadiums with NFL teams
|
|
|
|
---
|
|
|
|
## WNBA (Women's National Basketball Association)
|
|
|
|
**Teams**: 13 (including Golden State Valkyries)
|
|
**Expected Games**: ~220 per season
|
|
**Season**: May - October (single calendar year)
|
|
|
|
### Sources
|
|
|
|
| Priority | Source | URL Pattern | Data Type |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | JSON |
|
|
|
|
### Rate Limits
|
|
|
|
- **ESPN API**: ~1 request/second
|
|
|
|
### Notes
|
|
|
|
- Many WNBA teams share arenas with NBA teams
|
|
- Teams and stadiums are hardcoded (smaller league)
|
|
|
|
---
|
|
|
|
## NWSL (National Women's Soccer League)
|
|
|
|
**Teams**: 14
|
|
**Expected Games**: ~182 per season
|
|
**Season**: March - November (single calendar year)
|
|
|
|
### Sources
|
|
|
|
| Priority | Source | URL Pattern | Data Type |
|
|
|----------|--------|-------------|-----------|
|
|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | JSON |
|
|
|
|
### Rate Limits
|
|
|
|
- **ESPN API**: ~1 request/second
|
|
|
|
### Notes
|
|
|
|
- Many NWSL teams share stadiums with MLS teams
|
|
- Teams and stadiums are hardcoded (smaller league)
|
|
|
|
---
|
|
|
|
## Stadium Data Sources
|
|
|
|
Stadium coordinates and metadata come from multiple sources:
|
|
|
|
| Sport | Sources |
|
|
|-------|---------|
|
|
| MLB | MLBScoreBot GitHub, cageyjames GeoJSON, hardcoded |
|
|
| NFL | NFLScoreBot GitHub, brianhatchl GeoJSON, hardcoded |
|
|
| NBA | Hardcoded |
|
|
| NHL | Hardcoded |
|
|
| MLS | gavinr GeoJSON, hardcoded |
|
|
| WNBA | Hardcoded (shared with NBA) |
|
|
| NWSL | Hardcoded (shared with MLS) |
|
|
|
|
---
|
|
|
|
## General Guidelines
|
|
|
|
### Rate Limiting
|
|
|
|
All scrapers implement:
|
|
|
|
1. **Default delay**: 1 second between requests
|
|
2. **Auto-detection**: Detects HTTP 429 (Too Many Requests) responses
|
|
3. **Exponential backoff**: Starts at 1 second, doubles up to 3 retries
|
|
4. **Connection pooling**: Reuses HTTP connections for efficiency
|
|
|
|
### Error Handling
|
|
|
|
- **Partial data**: If a source fails mid-scrape, partial data is discarded
|
|
- **Source fallback**: Automatically tries the next source on failure
|
|
- **Logging**: All errors are logged for debugging
|
|
|
|
### Data Freshness
|
|
|
|
| Data Type | Freshness |
|
|
|-----------|-----------|
|
|
| Games (future) | Check weekly during season |
|
|
| Games (past) | Final scores available within hours |
|
|
| Teams | Update at start of each season |
|
|
| Stadiums | Update when venues change |
|
|
|
|
### Geographic Filter
|
|
|
|
Games at venues outside USA, Canada, and Mexico are automatically filtered out:
|
|
|
|
- **NFL**: London, Frankfurt, Munich, Mexico City, São Paulo
|
|
- **NHL**: Prague, Stockholm, Helsinki, Tampere, Gothenburg
|
|
|
|
---
|
|
|
|
## Legal Considerations
|
|
|
|
This tool is designed for personal/educational use. When using these sources:
|
|
|
|
1. Respect robots.txt files
|
|
2. Don't make excessive requests
|
|
3. Cache responses when possible
|
|
4. Check each source's Terms of Service
|
|
5. Consider that schedule data may be copyrighted
|
|
|
|
The ESPN API is undocumented but publicly accessible. Sports-Reference sites allow scraping but request reasonable rate limiting.
|