# Data Scraping System This document describes the SportsTime schedule scraping system, including all data sources, the fallback architecture, and operational procedures. ## Overview The scraping system (`Scripts/scrape_schedules.py`) fetches game schedules for 7 sports leagues from multiple data sources. It uses a **multi-source fallback architecture** to ensure reliability—if one source fails or returns insufficient data, the system automatically tries backup sources. ## Supported Sports | Sport | League | Season Format | Typical Games | |-------|--------|---------------|---------------| | NBA | National Basketball Association | 2024-25 | ~1,230 | | MLB | Major League Baseball | 2025 | ~2,430 | | NHL | National Hockey League | 2024-25 | ~1,312 | | NFL | National Football League | 2025-26 | ~272 | | WNBA | Women's National Basketball Association | 2025 | ~200 | | MLS | Major League Soccer | 2025 | ~500 | | NWSL | National Women's Soccer League | 2025 | ~180 | ## Data Sources by Sport Each sport has 3 data sources configured in priority order. The scraper tries sources sequentially until one returns sufficient data. ### NBA (National Basketball Association) | Priority | Source | URL Pattern | Min Games | |----------|--------|-------------|-----------| | 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{year}_games-{month}.html` | 500 | | 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | 500 | | 3 | CBS Sports | `cbssports.com/nba/schedule/` | 100 | **Notes:** - Basketball-Reference is most reliable for historical data - ESPN API provides real-time updates but may have rate limits - CBS Sports as emergency fallback ### MLB (Major League Baseball) | Priority | Source | URL Pattern | Min Games | |----------|--------|-------------|-----------| | 1 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | 1,000 | | 2 | Baseball-Reference | `baseball-reference.com/leagues/majors/{year}-schedule.shtml` | 500 | | 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | 500 | **Notes:** - MLB Stats API is official and most complete - Baseball-Reference good for historical seasons - Rate limit: 1 request/second for all sources ### NHL (National Hockey League) | Priority | Source | URL Pattern | Min Games | |----------|--------|-------------|-----------| | 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{year}_games.html` | 500 | | 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | 500 | | 3 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | 100 | **Notes:** - Hockey-Reference uses season format like "2025" for 2024-25 season - NHL API is official but documentation is limited ### NFL (National Football League) | Priority | Source | URL Pattern | Min Games | |----------|--------|-------------|-----------| | 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | 200 | | 2 | Pro-Football-Reference | `pro-football-reference.com/years/{year}/games.htm` | 200 | | 3 | CBS Sports | `cbssports.com/nfl/schedule/` | 100 | **Notes:** - ESPN provides week-by-week schedule data - PFR has complete historical archives - Season runs September-February (crosses calendar years) ### WNBA (Women's National Basketball Association) | Priority | Source | URL Pattern | Min Games | |----------|--------|-------------|-----------| | 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | 100 | | 2 | Basketball-Reference | `basketball-reference.com/wnba/years/{year}_games.html` | 100 | | 3 | CBS Sports | `cbssports.com/wnba/schedule/` | 50 | **Notes:** - WNBA season runs May-September - Fewer games than NBA (12 teams, 40-game season) ### MLS (Major League Soccer) | Priority | Source | URL Pattern | Min Games | |----------|--------|-------------|-----------| | 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | 200 | | 2 | FBref | `fbref.com/en/comps/22/{year}/schedule/` | 100 | | 3 | MLSSoccer.com | `mlssoccer.com/schedule/scores` | 100 | **Notes:** - ESPN's league ID for MLS is `usa.1` - FBref may block automated requests (403 errors) - Season runs February-November ### NWSL (National Women's Soccer League) | Priority | Source | URL Pattern | Min Games | |----------|--------|-------------|-----------| | 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | 100 | | 2 | FBref | `fbref.com/en/comps/182/{year}/schedule/` | 50 | | 3 | NWSL.com | `nwslsoccer.com/schedule` | 50 | **Notes:** - ESPN's league ID for NWSL is `usa.nwsl` - 14 teams, ~180 regular season games ## Fallback Architecture ### ScraperSource Configuration ```python @dataclass class ScraperSource: name: str # Display name (e.g., "ESPN") scraper_func: Callable[[int], list] # Function taking season year priority: int = 1 # Lower = higher priority min_games: int = 10 # Minimum to consider success ``` ### Fallback Logic ```python def scrape_with_fallback(sport, season, sources): sources = sorted(sources, key=lambda s: s.priority) for source in sources: try: games = source.scraper_func(season) if len(games) >= source.min_games: return games # Success! except Exception: continue # Try next source return [] # All sources failed ``` ### Example Output ``` SCRAPING NBA 2026 ============================================================ [1/3] Trying Basketball-Reference... ✓ Basketball-Reference returned 1230 games SCRAPING MLB 2026 ============================================================ [1/3] Trying MLB Stats API... ✗ MLB Stats API failed: Connection timeout [2/3] Trying Baseball-Reference... ✓ Baseball-Reference returned 2430 games ``` ## Usage ### Command Line Interface ```bash # Scrape all sports for 2026 season python scrape_schedules.py --sport all --season 2026 # Scrape specific sport python scrape_schedules.py --sport nba --season 2026 python scrape_schedules.py --sport mlb --season 2026 # Scrape only stadiums (legacy method) python scrape_schedules.py --stadiums-only # Scrape comprehensive stadium data for ALL 11 sports python scrape_schedules.py --stadiums-update # Custom output directory python scrape_schedules.py --sport all --season 2026 --output ./custom_data ``` ### Available Options | Option | Values | Default | Description | |--------|--------|---------|-------------| | `--sport` | `nba`, `mlb`, `nhl`, `nfl`, `wnba`, `mls`, `nwsl`, `all` | `all` | Sport(s) to scrape | | `--season` | Year (int) | `2026` | Season ending year | | `--stadiums-only` | Flag | False | Only scrape stadium data (legacy method) | | `--stadiums-update` | Flag | False | Scrape ALL stadium data for all 7 sports | | `--output` | Path | `./data` | Output directory | ## Output Format ### Directory Structure ``` data/ ├── games.json # All games from all sports ├── stadiums.json # All stadium/venue data └── teams.json # Team metadata (generated) ``` ### Game JSON Schema ```json { "id": "NBA-2025-26-LAL-BOS-20251225", "sport": "NBA", "homeTeam": "Los Angeles Lakers", "awayTeam": "Boston Celtics", "homeTeamId": "LAL", "awayTeamId": "BOS", "date": "2025-12-25T20:00:00Z", "venue": "Crypto.com Arena", "city": "Los Angeles", "state": "CA" } ``` ### Stadium JSON Schema ```json { "id": "crypto-com-arena", "name": "Crypto.com Arena", "city": "Los Angeles", "state": "CA", "latitude": 34.0430, "longitude": -118.2673, "sports": ["NBA", "NHL"], "teams": ["Los Angeles Lakers", "Los Angeles Kings", "Los Angeles Clippers"] } ``` ## Stable Game IDs Games are assigned stable IDs using the pattern: ``` {SPORT}-{SEASON}-{AWAY}-{HOME}-{DATE} ``` Example: `NBA-2025-26-LAL-BOS-20251225` This ensures: - Same game gets same ID across scraper runs - IDs survive if scraper source changes - CloudKit records can be updated (not duplicated) ## Rate Limiting All scrapers implement rate limiting to avoid being blocked: | Source Type | Rate Limit | Implementation | |-------------|------------|----------------| | Sports-Reference family | 1 req/sec | `time.sleep(1)` between requests | | ESPN API | 0.5 req/sec | `time.sleep(0.5)` between date ranges | | Official APIs (MLB, NHL) | 1 req/sec | `time.sleep(1)` between requests | | CBS Sports | 1 req/sec | `time.sleep(1)` between pages | ## Error Handling ### Common Errors | Error | Cause | Resolution | |-------|-------|------------| | `403 Forbidden` | Rate limited or blocked | Wait 5 min, reduce request rate | | `Connection timeout` | Network issue | Retry, check connectivity | | `0 games returned` | Off-season or parsing error | Check if season has started | | `KeyError` in parsing | Website structure changed | Update scraper selectors | ### Fallback Behavior 1. If primary source fails → Try source #2 2. If source #2 fails → Try source #3 3. If all sources fail → Log warning, return empty list 4. Script continues to next sport (doesn't abort) ## Adding New Sources ### 1. Create Scraper Function ```python def scrape_newsport_newsource(season: int) -> list[Game]: """Scrape NewSport schedule from NewSource.""" games = [] url = f"https://newsource.com/schedule/{season}" response = requests.get(url, headers=HEADERS) # Parse response... return games ``` ### 2. Register in main() ```python if args.sport in ['newsport', 'all']: sources = [ ScraperSource('Primary', scrape_newsport_primary, priority=1, min_games=100), ScraperSource('NewSource', scrape_newsport_newsource, priority=2, min_games=50), ScraperSource('Backup', scrape_newsport_backup, priority=3, min_games=25), ] games = scrape_with_fallback('NEWSPORT', args.season, sources) ``` ### 3. Add to CLI choices ```python parser.add_argument('--sport', choices=[..., 'newsport', 'all']) ``` ## Maintenance ### Monthly Tasks - Run full scrape to update schedules - Check for 403 errors indicating blocked sources - Verify game counts match expected totals ### Seasonal Tasks - Update season year in scripts - Check for website structure changes - Verify new teams/venues are included ### When Sources Break 1. Check if website changed structure (inspect HTML) 2. Update CSS selectors or JSON paths 3. If permanently broken, add new backup source 4. Update min_games thresholds if needed ## Dependencies ``` requests>=2.28.0 beautifulsoup4>=4.11.0 lxml>=4.9.0 ``` Install with: ```bash cd Scripts && pip install -r requirements.txt ``` ## CloudKit Integration After scraping, data is uploaded to CloudKit via: ```bash python cloudkit_import.py ``` This syncs: - Games → `CanonicalGame` records - Stadiums → `CanonicalStadium` records - Teams → `CanonicalTeam` records The iOS app then syncs from CloudKit to local SwiftData storage.