Files
Sportstime/Scripts/sportstime_parser/SOURCES.md
Trey t eeaf900e5a feat(scripts): rewrite parser as modular Python CLI
Replace monolithic scraping scripts with sportstime_parser package:

- Multi-source scrapers with automatic fallback for 7 sports
- Canonical ID generation for games, teams, and stadiums
- Fuzzy matching with configurable thresholds for name resolution
- CloudKit Web Services uploader with JWT auth, diff-based updates
- Resumable uploads with checkpoint state persistence
- Validation reports with manual review items and suggested matches
- Comprehensive test suite (249 tests)

CLI: sportstime-parser scrape|validate|upload|status|retry|clear

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 21:06:12 -06:00

7.2 KiB

Data Sources

This document lists all data sources used by the SportsTime parser, including URLs, rate limits, and data freshness expectations.

Source Priority

Each sport has multiple sources configured in priority order. The scraper tries each source in order and uses the first one that succeeds. If a source fails (network error, parsing error, etc.), it falls back to the next source.


NBA (National Basketball Association)

Teams: 30 Expected Games: ~1,230 per season Season: October - June (spans two calendar years)

Sources

Priority Source URL Pattern Data Type
1 Basketball-Reference basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html HTML
2 ESPN API site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard JSON
3 CBS Sports cbssports.com/nba/schedule/ HTML

Rate Limits

  • Basketball-Reference: ~1 request/second recommended
  • ESPN API: No published limit, use 1 request/second to be safe
  • CBS Sports: ~1 request/second recommended

Notes

  • Basketball-Reference is the most reliable source with complete historical data
  • ESPN API is good for current/future seasons
  • Games organized by month on Basketball-Reference

MLB (Major League Baseball)

Teams: 30 Expected Games: ~2,430 per season Season: March/April - October/November (single calendar year)

Sources

Priority Source URL Pattern Data Type
1 Baseball-Reference baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml HTML
2 MLB Stats API statsapi.mlb.com/api/v1/schedule JSON
3 ESPN API site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard JSON

Rate Limits

  • Baseball-Reference: ~1 request/second recommended
  • MLB Stats API: No published limit, use 0.5 request/second
  • ESPN API: ~1 request/second

Notes

  • MLB has doubleheaders; games are suffixed with _1, _2
  • Single schedule page per season on Baseball-Reference
  • MLB Stats API allows date range queries for efficiency

NFL (National Football League)

Teams: 32 Expected Games: ~272 per season (regular season only) Season: September - February (spans two calendar years)

Sources

Priority Source URL Pattern Data Type
1 ESPN API site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard JSON
2 Pro-Football-Reference pro-football-reference.com/years/{YEAR}/games.htm HTML
3 CBS Sports cbssports.com/nfl/schedule/ HTML

Rate Limits

  • ESPN API: ~1 request/second
  • Pro-Football-Reference: ~1 request/second
  • CBS Sports: ~1 request/second

Notes

  • ESPN API uses week numbers instead of dates
  • International games (London, Mexico City, Frankfurt, etc.) are filtered out
  • Includes preseason, regular season, and playoffs

NHL (National Hockey League)

Teams: 32 (including Utah Hockey Club) Expected Games: ~1,312 per season Season: October - June (spans two calendar years)

Sources

Priority Source URL Pattern Data Type
1 Hockey-Reference hockey-reference.com/leagues/NHL_{YEAR}_games.html HTML
2 NHL API api-web.nhle.com/v1/schedule/{date} JSON
3 ESPN API site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard JSON

Rate Limits

  • Hockey-Reference: ~1 request/second
  • NHL API: No published limit, use 0.5 request/second
  • ESPN API: ~1 request/second

Notes

  • International games (Prague, Stockholm, Helsinki, etc.) are filtered out
  • Single schedule page per season on Hockey-Reference

MLS (Major League Soccer)

Teams: 30 (including San Diego FC) Expected Games: ~493 per season Season: February/March - October/November (single calendar year)

Sources

Priority Source URL Pattern Data Type
1 ESPN API site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard JSON
2 FBref fbref.com/en/comps/22/{YEAR}/schedule/ HTML

Rate Limits

  • ESPN API: ~1 request/second
  • FBref: ~1 request/second

Notes

  • MLS runs within a single calendar year
  • Some teams share stadiums with NFL teams

WNBA (Women's National Basketball Association)

Teams: 13 (including Golden State Valkyries) Expected Games: ~220 per season Season: May - October (single calendar year)

Sources

Priority Source URL Pattern Data Type
1 ESPN API site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard JSON

Rate Limits

  • ESPN API: ~1 request/second

Notes

  • Many WNBA teams share arenas with NBA teams
  • Teams and stadiums are hardcoded (smaller league)

NWSL (National Women's Soccer League)

Teams: 14 Expected Games: ~182 per season Season: March - November (single calendar year)

Sources

Priority Source URL Pattern Data Type
1 ESPN API site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard JSON

Rate Limits

  • ESPN API: ~1 request/second

Notes

  • Many NWSL teams share stadiums with MLS teams
  • Teams and stadiums are hardcoded (smaller league)

Stadium Data Sources

Stadium coordinates and metadata come from multiple sources:

Sport Sources
MLB MLBScoreBot GitHub, cageyjames GeoJSON, hardcoded
NFL NFLScoreBot GitHub, brianhatchl GeoJSON, hardcoded
NBA Hardcoded
NHL Hardcoded
MLS gavinr GeoJSON, hardcoded
WNBA Hardcoded (shared with NBA)
NWSL Hardcoded (shared with MLS)

General Guidelines

Rate Limiting

All scrapers implement:

  1. Default delay: 1 second between requests
  2. Auto-detection: Detects HTTP 429 (Too Many Requests) responses
  3. Exponential backoff: Starts at 1 second, doubles up to 3 retries
  4. Connection pooling: Reuses HTTP connections for efficiency

Error Handling

  • Partial data: If a source fails mid-scrape, partial data is discarded
  • Source fallback: Automatically tries the next source on failure
  • Logging: All errors are logged for debugging

Data Freshness

Data Type Freshness
Games (future) Check weekly during season
Games (past) Final scores available within hours
Teams Update at start of each season
Stadiums Update when venues change

Geographic Filter

Games at venues outside USA, Canada, and Mexico are automatically filtered out:

  • NFL: London, Frankfurt, Munich, Mexico City, São Paulo
  • NHL: Prague, Stockholm, Helsinki, Tampere, Gothenburg

This tool is designed for personal/educational use. When using these sources:

  1. Respect robots.txt files
  2. Don't make excessive requests
  3. Cache responses when possible
  4. Check each source's Terms of Service
  5. Consider that schedule data may be copyrighted

The ESPN API is undocumented but publicly accessible. Sports-Reference sites allow scraping but request reasonable rate limiting.