Complete Python package for scraping, normalizing, and uploading sports schedule data to CloudKit. Includes: - Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL - Canonical ID system for teams, stadiums, and games - Fuzzy matching with manual alias support - CloudKit uploader with batch operations and deduplication - Comprehensive test suite with fixtures - WNBA abbreviation aliases for improved team resolution - Alias validation script to detect orphan references All 5 phases of data remediation plan completed: - Phase 1: Alias fixes (team/stadium alias additions) - Phase 2: NHL stadium coordinate fixes - Phase 3: Re-scrape validation - Phase 4: iOS bundle update - Phase 5: Code quality improvements (WNBA aliases) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7.2 KiB
Data Sources
This document lists all data sources used by the SportsTime parser, including URLs, rate limits, and data freshness expectations.
Source Priority
Each sport has multiple sources configured in priority order. The scraper tries each source in order and uses the first one that succeeds. If a source fails (network error, parsing error, etc.), it falls back to the next source.
NBA (National Basketball Association)
Teams: 30 Expected Games: ~1,230 per season Season: October - June (spans two calendar years)
Sources
| Priority | Source | URL Pattern | Data Type |
|---|---|---|---|
| 1 | Basketball-Reference | basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html |
HTML |
| 2 | ESPN API | site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard |
JSON |
| 3 | CBS Sports | cbssports.com/nba/schedule/ |
HTML |
Rate Limits
- Basketball-Reference: ~1 request/second recommended
- ESPN API: No published limit, use 1 request/second to be safe
- CBS Sports: ~1 request/second recommended
Notes
- Basketball-Reference is the most reliable source with complete historical data
- ESPN API is good for current/future seasons
- Games organized by month on Basketball-Reference
MLB (Major League Baseball)
Teams: 30 Expected Games: ~2,430 per season Season: March/April - October/November (single calendar year)
Sources
| Priority | Source | URL Pattern | Data Type |
|---|---|---|---|
| 1 | Baseball-Reference | baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml |
HTML |
| 2 | MLB Stats API | statsapi.mlb.com/api/v1/schedule |
JSON |
| 3 | ESPN API | site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard |
JSON |
Rate Limits
- Baseball-Reference: ~1 request/second recommended
- MLB Stats API: No published limit, use 0.5 request/second
- ESPN API: ~1 request/second
Notes
- MLB has doubleheaders; games are suffixed with
_1,_2 - Single schedule page per season on Baseball-Reference
- MLB Stats API allows date range queries for efficiency
NFL (National Football League)
Teams: 32 Expected Games: ~272 per season (regular season only) Season: September - February (spans two calendar years)
Sources
| Priority | Source | URL Pattern | Data Type |
|---|---|---|---|
| 1 | ESPN API | site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard |
JSON |
| 2 | Pro-Football-Reference | pro-football-reference.com/years/{YEAR}/games.htm |
HTML |
| 3 | CBS Sports | cbssports.com/nfl/schedule/ |
HTML |
Rate Limits
- ESPN API: ~1 request/second
- Pro-Football-Reference: ~1 request/second
- CBS Sports: ~1 request/second
Notes
- ESPN API uses week numbers instead of dates
- International games (London, Mexico City, Frankfurt, etc.) are filtered out
- Includes preseason, regular season, and playoffs
NHL (National Hockey League)
Teams: 32 (including Utah Hockey Club) Expected Games: ~1,312 per season Season: October - June (spans two calendar years)
Sources
| Priority | Source | URL Pattern | Data Type |
|---|---|---|---|
| 1 | Hockey-Reference | hockey-reference.com/leagues/NHL_{YEAR}_games.html |
HTML |
| 2 | NHL API | api-web.nhle.com/v1/schedule/{date} |
JSON |
| 3 | ESPN API | site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard |
JSON |
Rate Limits
- Hockey-Reference: ~1 request/second
- NHL API: No published limit, use 0.5 request/second
- ESPN API: ~1 request/second
Notes
- International games (Prague, Stockholm, Helsinki, etc.) are filtered out
- Single schedule page per season on Hockey-Reference
MLS (Major League Soccer)
Teams: 30 (including San Diego FC) Expected Games: ~493 per season Season: February/March - October/November (single calendar year)
Sources
| Priority | Source | URL Pattern | Data Type |
|---|---|---|---|
| 1 | ESPN API | site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard |
JSON |
| 2 | FBref | fbref.com/en/comps/22/{YEAR}/schedule/ |
HTML |
Rate Limits
- ESPN API: ~1 request/second
- FBref: ~1 request/second
Notes
- MLS runs within a single calendar year
- Some teams share stadiums with NFL teams
WNBA (Women's National Basketball Association)
Teams: 13 (including Golden State Valkyries) Expected Games: ~220 per season Season: May - October (single calendar year)
Sources
| Priority | Source | URL Pattern | Data Type |
|---|---|---|---|
| 1 | ESPN API | site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard |
JSON |
Rate Limits
- ESPN API: ~1 request/second
Notes
- Many WNBA teams share arenas with NBA teams
- Teams and stadiums are hardcoded (smaller league)
NWSL (National Women's Soccer League)
Teams: 14 Expected Games: ~182 per season Season: March - November (single calendar year)
Sources
| Priority | Source | URL Pattern | Data Type |
|---|---|---|---|
| 1 | ESPN API | site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard |
JSON |
Rate Limits
- ESPN API: ~1 request/second
Notes
- Many NWSL teams share stadiums with MLS teams
- Teams and stadiums are hardcoded (smaller league)
Stadium Data Sources
Stadium coordinates and metadata come from multiple sources:
| Sport | Sources |
|---|---|
| MLB | MLBScoreBot GitHub, cageyjames GeoJSON, hardcoded |
| NFL | NFLScoreBot GitHub, brianhatchl GeoJSON, hardcoded |
| NBA | Hardcoded |
| NHL | Hardcoded |
| MLS | gavinr GeoJSON, hardcoded |
| WNBA | Hardcoded (shared with NBA) |
| NWSL | Hardcoded (shared with MLS) |
General Guidelines
Rate Limiting
All scrapers implement:
- Default delay: 1 second between requests
- Auto-detection: Detects HTTP 429 (Too Many Requests) responses
- Exponential backoff: Starts at 1 second, doubles up to 3 retries
- Connection pooling: Reuses HTTP connections for efficiency
Error Handling
- Partial data: If a source fails mid-scrape, partial data is discarded
- Source fallback: Automatically tries the next source on failure
- Logging: All errors are logged for debugging
Data Freshness
| Data Type | Freshness |
|---|---|
| Games (future) | Check weekly during season |
| Games (past) | Final scores available within hours |
| Teams | Update at start of each season |
| Stadiums | Update when venues change |
Geographic Filter
Games at venues outside USA, Canada, and Mexico are automatically filtered out:
- NFL: London, Frankfurt, Munich, Mexico City, São Paulo
- NHL: Prague, Stockholm, Helsinki, Tampere, Gothenburg
Legal Considerations
This tool is designed for personal/educational use. When using these sources:
- Respect robots.txt files
- Don't make excessive requests
- Cache responses when possible
- Check each source's Terms of Service
- Consider that schedule data may be copyrighted
The ESPN API is undocumented but publicly accessible. Sports-Reference sites allow scraping but request reasonable rate limiting.