Remove CBB (~5,000+ games per season) to reduce complexity. Changes: - Remove .cbb enum case from Sport - Remove CBB theme color (cbbMint) - Update documentation to reflect 7 supported leagues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
11 KiB
Data Scraping System
This document describes the SportsTime schedule scraping system, including all data sources, the fallback architecture, and operational procedures.
Overview
The scraping system (Scripts/scrape_schedules.py) fetches game schedules for 7 sports leagues from multiple data sources. It uses a multi-source fallback architecture to ensure reliability—if one source fails or returns insufficient data, the system automatically tries backup sources.
Supported Sports
| Sport | League | Season Format | Typical Games |
|---|---|---|---|
| NBA | National Basketball Association | 2024-25 | ~1,230 |
| MLB | Major League Baseball | 2025 | ~2,430 |
| NHL | National Hockey League | 2024-25 | ~1,312 |
| NFL | National Football League | 2025-26 | ~272 |
| WNBA | Women's National Basketball Association | 2025 | ~200 |
| MLS | Major League Soccer | 2025 | ~500 |
| NWSL | National Women's Soccer League | 2025 | ~180 |
Data Sources by Sport
Each sport has 3 data sources configured in priority order. The scraper tries sources sequentially until one returns sufficient data.
NBA (National Basketball Association)
| Priority | Source | URL Pattern | Min Games |
|---|---|---|---|
| 1 | Basketball-Reference | basketball-reference.com/leagues/NBA_{year}_games-{month}.html |
500 |
| 2 | ESPN API | site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard |
500 |
| 3 | CBS Sports | cbssports.com/nba/schedule/ |
100 |
Notes:
- Basketball-Reference is most reliable for historical data
- ESPN API provides real-time updates but may have rate limits
- CBS Sports as emergency fallback
MLB (Major League Baseball)
| Priority | Source | URL Pattern | Min Games |
|---|---|---|---|
| 1 | MLB Stats API | statsapi.mlb.com/api/v1/schedule |
1,000 |
| 2 | Baseball-Reference | baseball-reference.com/leagues/majors/{year}-schedule.shtml |
500 |
| 3 | ESPN API | site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard |
500 |
Notes:
- MLB Stats API is official and most complete
- Baseball-Reference good for historical seasons
- Rate limit: 1 request/second for all sources
NHL (National Hockey League)
| Priority | Source | URL Pattern | Min Games |
|---|---|---|---|
| 1 | Hockey-Reference | hockey-reference.com/leagues/NHL_{year}_games.html |
500 |
| 2 | ESPN API | site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard |
500 |
| 3 | NHL API | api-web.nhle.com/v1/schedule/{date} |
100 |
Notes:
- Hockey-Reference uses season format like "2025" for 2024-25 season
- NHL API is official but documentation is limited
NFL (National Football League)
| Priority | Source | URL Pattern | Min Games |
|---|---|---|---|
| 1 | ESPN API | site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard |
200 |
| 2 | Pro-Football-Reference | pro-football-reference.com/years/{year}/games.htm |
200 |
| 3 | CBS Sports | cbssports.com/nfl/schedule/ |
100 |
Notes:
- ESPN provides week-by-week schedule data
- PFR has complete historical archives
- Season runs September-February (crosses calendar years)
WNBA (Women's National Basketball Association)
| Priority | Source | URL Pattern | Min Games |
|---|---|---|---|
| 1 | ESPN API | site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard |
100 |
| 2 | Basketball-Reference | basketball-reference.com/wnba/years/{year}_games.html |
100 |
| 3 | CBS Sports | cbssports.com/wnba/schedule/ |
50 |
Notes:
- WNBA season runs May-September
- Fewer games than NBA (12 teams, 40-game season)
MLS (Major League Soccer)
| Priority | Source | URL Pattern | Min Games |
|---|---|---|---|
| 1 | ESPN API | site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard |
200 |
| 2 | FBref | fbref.com/en/comps/22/{year}/schedule/ |
100 |
| 3 | MLSSoccer.com | mlssoccer.com/schedule/scores |
100 |
Notes:
- ESPN's league ID for MLS is
usa.1 - FBref may block automated requests (403 errors)
- Season runs February-November
NWSL (National Women's Soccer League)
| Priority | Source | URL Pattern | Min Games |
|---|---|---|---|
| 1 | ESPN API | site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard |
100 |
| 2 | FBref | fbref.com/en/comps/182/{year}/schedule/ |
50 |
| 3 | NWSL.com | nwslsoccer.com/schedule |
50 |
Notes:
- ESPN's league ID for NWSL is
usa.nwsl - 14 teams, ~180 regular season games
Fallback Architecture
ScraperSource Configuration
@dataclass
class ScraperSource:
name: str # Display name (e.g., "ESPN")
scraper_func: Callable[[int], list] # Function taking season year
priority: int = 1 # Lower = higher priority
min_games: int = 10 # Minimum to consider success
Fallback Logic
def scrape_with_fallback(sport, season, sources):
sources = sorted(sources, key=lambda s: s.priority)
for source in sources:
try:
games = source.scraper_func(season)
if len(games) >= source.min_games:
return games # Success!
except Exception:
continue # Try next source
return [] # All sources failed
Example Output
SCRAPING NBA 2026
============================================================
[1/3] Trying Basketball-Reference...
✓ Basketball-Reference returned 1230 games
SCRAPING MLB 2026
============================================================
[1/3] Trying MLB Stats API...
✗ MLB Stats API failed: Connection timeout
[2/3] Trying Baseball-Reference...
✓ Baseball-Reference returned 2430 games
Usage
Command Line Interface
# Scrape all sports for 2026 season
python scrape_schedules.py --sport all --season 2026
# Scrape specific sport
python scrape_schedules.py --sport nba --season 2026
python scrape_schedules.py --sport mlb --season 2026
# Scrape only stadiums (legacy method)
python scrape_schedules.py --stadiums-only
# Scrape comprehensive stadium data for ALL 11 sports
python scrape_schedules.py --stadiums-update
# Custom output directory
python scrape_schedules.py --sport all --season 2026 --output ./custom_data
Available Options
| Option | Values | Default | Description |
|---|---|---|---|
--sport |
nba, mlb, nhl, nfl, wnba, mls, nwsl, all |
all |
Sport(s) to scrape |
--season |
Year (int) | 2026 |
Season ending year |
--stadiums-only |
Flag | False | Only scrape stadium data (legacy method) |
--stadiums-update |
Flag | False | Scrape ALL stadium data for all 7 sports |
--output |
Path | ./data |
Output directory |
Output Format
Directory Structure
data/
├── games.json # All games from all sports
├── stadiums.json # All stadium/venue data
└── teams.json # Team metadata (generated)
Game JSON Schema
{
"id": "NBA-2025-26-LAL-BOS-20251225",
"sport": "NBA",
"homeTeam": "Los Angeles Lakers",
"awayTeam": "Boston Celtics",
"homeTeamId": "LAL",
"awayTeamId": "BOS",
"date": "2025-12-25T20:00:00Z",
"venue": "Crypto.com Arena",
"city": "Los Angeles",
"state": "CA"
}
Stadium JSON Schema
{
"id": "crypto-com-arena",
"name": "Crypto.com Arena",
"city": "Los Angeles",
"state": "CA",
"latitude": 34.0430,
"longitude": -118.2673,
"sports": ["NBA", "NHL"],
"teams": ["Los Angeles Lakers", "Los Angeles Kings", "Los Angeles Clippers"]
}
Stable Game IDs
Games are assigned stable IDs using the pattern:
{SPORT}-{SEASON}-{AWAY}-{HOME}-{DATE}
Example: NBA-2025-26-LAL-BOS-20251225
This ensures:
- Same game gets same ID across scraper runs
- IDs survive if scraper source changes
- CloudKit records can be updated (not duplicated)
Rate Limiting
All scrapers implement rate limiting to avoid being blocked:
| Source Type | Rate Limit | Implementation |
|---|---|---|
| Sports-Reference family | 1 req/sec | time.sleep(1) between requests |
| ESPN API | 0.5 req/sec | time.sleep(0.5) between date ranges |
| Official APIs (MLB, NHL) | 1 req/sec | time.sleep(1) between requests |
| CBS Sports | 1 req/sec | time.sleep(1) between pages |
Error Handling
Common Errors
| Error | Cause | Resolution |
|---|---|---|
403 Forbidden |
Rate limited or blocked | Wait 5 min, reduce request rate |
Connection timeout |
Network issue | Retry, check connectivity |
0 games returned |
Off-season or parsing error | Check if season has started |
KeyError in parsing |
Website structure changed | Update scraper selectors |
Fallback Behavior
- If primary source fails → Try source #2
- If source #2 fails → Try source #3
- If all sources fail → Log warning, return empty list
- Script continues to next sport (doesn't abort)
Adding New Sources
1. Create Scraper Function
def scrape_newsport_newsource(season: int) -> list[Game]:
"""Scrape NewSport schedule from NewSource."""
games = []
url = f"https://newsource.com/schedule/{season}"
response = requests.get(url, headers=HEADERS)
# Parse response...
return games
2. Register in main()
if args.sport in ['newsport', 'all']:
sources = [
ScraperSource('Primary', scrape_newsport_primary, priority=1, min_games=100),
ScraperSource('NewSource', scrape_newsport_newsource, priority=2, min_games=50),
ScraperSource('Backup', scrape_newsport_backup, priority=3, min_games=25),
]
games = scrape_with_fallback('NEWSPORT', args.season, sources)
3. Add to CLI choices
parser.add_argument('--sport', choices=[..., 'newsport', 'all'])
Maintenance
Monthly Tasks
- Run full scrape to update schedules
- Check for 403 errors indicating blocked sources
- Verify game counts match expected totals
Seasonal Tasks
- Update season year in scripts
- Check for website structure changes
- Verify new teams/venues are included
When Sources Break
- Check if website changed structure (inspect HTML)
- Update CSS selectors or JSON paths
- If permanently broken, add new backup source
- Update min_games thresholds if needed
Dependencies
requests>=2.28.0
beautifulsoup4>=4.11.0
lxml>=4.9.0
Install with:
cd Scripts && pip install -r requirements.txt
CloudKit Integration
After scraping, data is uploaded to CloudKit via:
python cloudkit_import.py
This syncs:
- Games →
CanonicalGamerecords - Stadiums →
CanonicalStadiumrecords - Teams →
CanonicalTeamrecords
The iOS app then syncs from CloudKit to local SwiftData storage.