Files
Sportstime/docs/DATA_SCRAPING.md
Trey t c9e5bd9909 chore: remove college basketball (CBB) from iOS app
Remove CBB (~5,000+ games per season) to reduce complexity.

Changes:
- Remove .cbb enum case from Sport
- Remove CBB theme color (cbbMint)
- Update documentation to reflect 7 supported leagues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 01:44:35 -06:00

11 KiB

Data Scraping System

This document describes the SportsTime schedule scraping system, including all data sources, the fallback architecture, and operational procedures.

Overview

The scraping system (Scripts/scrape_schedules.py) fetches game schedules for 7 sports leagues from multiple data sources. It uses a multi-source fallback architecture to ensure reliability—if one source fails or returns insufficient data, the system automatically tries backup sources.

Supported Sports

Sport League Season Format Typical Games
NBA National Basketball Association 2024-25 ~1,230
MLB Major League Baseball 2025 ~2,430
NHL National Hockey League 2024-25 ~1,312
NFL National Football League 2025-26 ~272
WNBA Women's National Basketball Association 2025 ~200
MLS Major League Soccer 2025 ~500
NWSL National Women's Soccer League 2025 ~180

Data Sources by Sport

Each sport has 3 data sources configured in priority order. The scraper tries sources sequentially until one returns sufficient data.

NBA (National Basketball Association)

Priority Source URL Pattern Min Games
1 Basketball-Reference basketball-reference.com/leagues/NBA_{year}_games-{month}.html 500
2 ESPN API site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard 500
3 CBS Sports cbssports.com/nba/schedule/ 100

Notes:

  • Basketball-Reference is most reliable for historical data
  • ESPN API provides real-time updates but may have rate limits
  • CBS Sports as emergency fallback

MLB (Major League Baseball)

Priority Source URL Pattern Min Games
1 MLB Stats API statsapi.mlb.com/api/v1/schedule 1,000
2 Baseball-Reference baseball-reference.com/leagues/majors/{year}-schedule.shtml 500
3 ESPN API site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard 500

Notes:

  • MLB Stats API is official and most complete
  • Baseball-Reference good for historical seasons
  • Rate limit: 1 request/second for all sources

NHL (National Hockey League)

Priority Source URL Pattern Min Games
1 Hockey-Reference hockey-reference.com/leagues/NHL_{year}_games.html 500
2 ESPN API site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard 500
3 NHL API api-web.nhle.com/v1/schedule/{date} 100

Notes:

  • Hockey-Reference uses season format like "2025" for 2024-25 season
  • NHL API is official but documentation is limited

NFL (National Football League)

Priority Source URL Pattern Min Games
1 ESPN API site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard 200
2 Pro-Football-Reference pro-football-reference.com/years/{year}/games.htm 200
3 CBS Sports cbssports.com/nfl/schedule/ 100

Notes:

  • ESPN provides week-by-week schedule data
  • PFR has complete historical archives
  • Season runs September-February (crosses calendar years)

WNBA (Women's National Basketball Association)

Priority Source URL Pattern Min Games
1 ESPN API site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard 100
2 Basketball-Reference basketball-reference.com/wnba/years/{year}_games.html 100
3 CBS Sports cbssports.com/wnba/schedule/ 50

Notes:

  • WNBA season runs May-September
  • Fewer games than NBA (12 teams, 40-game season)

MLS (Major League Soccer)

Priority Source URL Pattern Min Games
1 ESPN API site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard 200
2 FBref fbref.com/en/comps/22/{year}/schedule/ 100
3 MLSSoccer.com mlssoccer.com/schedule/scores 100

Notes:

  • ESPN's league ID for MLS is usa.1
  • FBref may block automated requests (403 errors)
  • Season runs February-November

NWSL (National Women's Soccer League)

Priority Source URL Pattern Min Games
1 ESPN API site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard 100
2 FBref fbref.com/en/comps/182/{year}/schedule/ 50
3 NWSL.com nwslsoccer.com/schedule 50

Notes:

  • ESPN's league ID for NWSL is usa.nwsl
  • 14 teams, ~180 regular season games

Fallback Architecture

ScraperSource Configuration

@dataclass
class ScraperSource:
    name: str                              # Display name (e.g., "ESPN")
    scraper_func: Callable[[int], list]    # Function taking season year
    priority: int = 1                      # Lower = higher priority
    min_games: int = 10                    # Minimum to consider success

Fallback Logic

def scrape_with_fallback(sport, season, sources):
    sources = sorted(sources, key=lambda s: s.priority)

    for source in sources:
        try:
            games = source.scraper_func(season)
            if len(games) >= source.min_games:
                return games  # Success!
        except Exception:
            continue  # Try next source

    return []  # All sources failed

Example Output

SCRAPING NBA 2026
============================================================
  [1/3] Trying Basketball-Reference...
  ✓ Basketball-Reference returned 1230 games

SCRAPING MLB 2026
============================================================
  [1/3] Trying MLB Stats API...
  ✗ MLB Stats API failed: Connection timeout
  [2/3] Trying Baseball-Reference...
  ✓ Baseball-Reference returned 2430 games

Usage

Command Line Interface

# Scrape all sports for 2026 season
python scrape_schedules.py --sport all --season 2026

# Scrape specific sport
python scrape_schedules.py --sport nba --season 2026
python scrape_schedules.py --sport mlb --season 2026

# Scrape only stadiums (legacy method)
python scrape_schedules.py --stadiums-only

# Scrape comprehensive stadium data for ALL 11 sports
python scrape_schedules.py --stadiums-update

# Custom output directory
python scrape_schedules.py --sport all --season 2026 --output ./custom_data

Available Options

Option Values Default Description
--sport nba, mlb, nhl, nfl, wnba, mls, nwsl, all all Sport(s) to scrape
--season Year (int) 2026 Season ending year
--stadiums-only Flag False Only scrape stadium data (legacy method)
--stadiums-update Flag False Scrape ALL stadium data for all 7 sports
--output Path ./data Output directory

Output Format

Directory Structure

data/
├── games.json          # All games from all sports
├── stadiums.json       # All stadium/venue data
└── teams.json          # Team metadata (generated)

Game JSON Schema

{
  "id": "NBA-2025-26-LAL-BOS-20251225",
  "sport": "NBA",
  "homeTeam": "Los Angeles Lakers",
  "awayTeam": "Boston Celtics",
  "homeTeamId": "LAL",
  "awayTeamId": "BOS",
  "date": "2025-12-25T20:00:00Z",
  "venue": "Crypto.com Arena",
  "city": "Los Angeles",
  "state": "CA"
}

Stadium JSON Schema

{
  "id": "crypto-com-arena",
  "name": "Crypto.com Arena",
  "city": "Los Angeles",
  "state": "CA",
  "latitude": 34.0430,
  "longitude": -118.2673,
  "sports": ["NBA", "NHL"],
  "teams": ["Los Angeles Lakers", "Los Angeles Kings", "Los Angeles Clippers"]
}

Stable Game IDs

Games are assigned stable IDs using the pattern:

{SPORT}-{SEASON}-{AWAY}-{HOME}-{DATE}

Example: NBA-2025-26-LAL-BOS-20251225

This ensures:

  • Same game gets same ID across scraper runs
  • IDs survive if scraper source changes
  • CloudKit records can be updated (not duplicated)

Rate Limiting

All scrapers implement rate limiting to avoid being blocked:

Source Type Rate Limit Implementation
Sports-Reference family 1 req/sec time.sleep(1) between requests
ESPN API 0.5 req/sec time.sleep(0.5) between date ranges
Official APIs (MLB, NHL) 1 req/sec time.sleep(1) between requests
CBS Sports 1 req/sec time.sleep(1) between pages

Error Handling

Common Errors

Error Cause Resolution
403 Forbidden Rate limited or blocked Wait 5 min, reduce request rate
Connection timeout Network issue Retry, check connectivity
0 games returned Off-season or parsing error Check if season has started
KeyError in parsing Website structure changed Update scraper selectors

Fallback Behavior

  1. If primary source fails → Try source #2
  2. If source #2 fails → Try source #3
  3. If all sources fail → Log warning, return empty list
  4. Script continues to next sport (doesn't abort)

Adding New Sources

1. Create Scraper Function

def scrape_newsport_newsource(season: int) -> list[Game]:
    """Scrape NewSport schedule from NewSource."""
    games = []
    url = f"https://newsource.com/schedule/{season}"

    response = requests.get(url, headers=HEADERS)
    # Parse response...

    return games

2. Register in main()

if args.sport in ['newsport', 'all']:
    sources = [
        ScraperSource('Primary', scrape_newsport_primary, priority=1, min_games=100),
        ScraperSource('NewSource', scrape_newsport_newsource, priority=2, min_games=50),
        ScraperSource('Backup', scrape_newsport_backup, priority=3, min_games=25),
    ]
    games = scrape_with_fallback('NEWSPORT', args.season, sources)

3. Add to CLI choices

parser.add_argument('--sport', choices=[..., 'newsport', 'all'])

Maintenance

Monthly Tasks

  • Run full scrape to update schedules
  • Check for 403 errors indicating blocked sources
  • Verify game counts match expected totals

Seasonal Tasks

  • Update season year in scripts
  • Check for website structure changes
  • Verify new teams/venues are included

When Sources Break

  1. Check if website changed structure (inspect HTML)
  2. Update CSS selectors or JSON paths
  3. If permanently broken, add new backup source
  4. Update min_games thresholds if needed

Dependencies

requests>=2.28.0
beautifulsoup4>=4.11.0
lxml>=4.9.0

Install with:

cd Scripts && pip install -r requirements.txt

CloudKit Integration

After scraping, data is uploaded to CloudKit via:

python cloudkit_import.py

This syncs:

  • Games → CanonicalGame records
  • Stadiums → CanonicalStadium records
  • Teams → CanonicalTeam records

The iOS app then syncs from CloudKit to local SwiftData storage.