Remove CFB/NASCAR/PGA and streamline to 8 supported sports

- Remove College Football, NASCAR, and PGA from scraper and app
- Clean all data files (stadiums, games, pipeline reports)
- Update Sport.swift enum and all UI components
- Add sportstime.py CLI tool for pipeline management
- Add DATA_SCRAPING.md documentation
- Add WNBA/MLS/NWSL implementation documentation
- Scraper now supports: NBA, MLB, NHL, NFL, WNBA, MLS, NWSL, CBB

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-01-09 23:22:13 -06:00
parent f5e509a9ae
commit 8790d2ad73
35 changed files with 117819 additions and 65871 deletions

368
docs/DATA_SCRAPING.md Normal file
View File

@@ -0,0 +1,368 @@
# Data Scraping System
This document describes the SportsTime schedule scraping system, including all data sources, the fallback architecture, and operational procedures.
## Overview
The scraping system (`Scripts/scrape_schedules.py`) fetches game schedules for 8 sports leagues from multiple data sources. It uses a **multi-source fallback architecture** to ensure reliability—if one source fails or returns insufficient data, the system automatically tries backup sources.
## Supported Sports
| Sport | League | Season Format | Typical Games |
|-------|--------|---------------|---------------|
| NBA | National Basketball Association | 2024-25 | ~1,230 |
| MLB | Major League Baseball | 2025 | ~2,430 |
| NHL | National Hockey League | 2024-25 | ~1,312 |
| NFL | National Football League | 2025-26 | ~272 |
| WNBA | Women's National Basketball Association | 2025 | ~200 |
| MLS | Major League Soccer | 2025 | ~500 |
| NWSL | National Women's Soccer League | 2025 | ~180 |
| CBB | NCAA Division I Basketball | 2025-26 | ~5,000+ |
## Data Sources by Sport
Each sport has 3 data sources configured in priority order. The scraper tries sources sequentially until one returns sufficient data.
### NBA (National Basketball Association)
| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | Basketball-Reference | `basketball-reference.com/leagues/NBA_{year}_games-{month}.html` | 500 |
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` | 500 |
| 3 | CBS Sports | `cbssports.com/nba/schedule/` | 100 |
**Notes:**
- Basketball-Reference is most reliable for historical data
- ESPN API provides real-time updates but may have rate limits
- CBS Sports as emergency fallback
### MLB (Major League Baseball)
| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | MLB Stats API | `statsapi.mlb.com/api/v1/schedule` | 1,000 |
| 2 | Baseball-Reference | `baseball-reference.com/leagues/majors/{year}-schedule.shtml` | 500 |
| 3 | ESPN API | `site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` | 500 |
**Notes:**
- MLB Stats API is official and most complete
- Baseball-Reference good for historical seasons
- Rate limit: 1 request/second for all sources
### NHL (National Hockey League)
| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | Hockey-Reference | `hockey-reference.com/leagues/NHL_{year}_games.html` | 500 |
| 2 | ESPN API | `site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` | 500 |
| 3 | NHL API | `api-web.nhle.com/v1/schedule/{date}` | 100 |
**Notes:**
- Hockey-Reference uses season format like "2025" for 2024-25 season
- NHL API is official but documentation is limited
### NFL (National Football League)
| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` | 200 |
| 2 | Pro-Football-Reference | `pro-football-reference.com/years/{year}/games.htm` | 200 |
| 3 | CBS Sports | `cbssports.com/nfl/schedule/` | 100 |
**Notes:**
- ESPN provides week-by-week schedule data
- PFR has complete historical archives
- Season runs September-February (crosses calendar years)
### WNBA (Women's National Basketball Association)
| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/wnba/scoreboard` | 100 |
| 2 | Basketball-Reference | `basketball-reference.com/wnba/years/{year}_games.html` | 100 |
| 3 | CBS Sports | `cbssports.com/wnba/schedule/` | 50 |
**Notes:**
- WNBA season runs May-September
- Fewer games than NBA (12 teams, 40-game season)
### MLS (Major League Soccer)
| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.1/scoreboard` | 200 |
| 2 | FBref | `fbref.com/en/comps/22/{year}/schedule/` | 100 |
| 3 | MLSSoccer.com | `mlssoccer.com/schedule/scores` | 100 |
**Notes:**
- ESPN's league ID for MLS is `usa.1`
- FBref may block automated requests (403 errors)
- Season runs February-November
### NWSL (National Women's Soccer League)
| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/soccer/usa.nwsl/scoreboard` | 100 |
| 2 | FBref | `fbref.com/en/comps/182/{year}/schedule/` | 50 |
| 3 | NWSL.com | `nwslsoccer.com/schedule` | 50 |
**Notes:**
- ESPN's league ID for NWSL is `usa.nwsl`
- 14 teams, ~180 regular season games
### CBB (College Basketball - Division I)
| Priority | Source | URL Pattern | Min Games |
|----------|--------|-------------|-----------|
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard` | 1,000 |
| 2 | Sports-Reference | `sports-reference.com/cbb/seasons/{year}-schedule.html` | 500 |
| 3 | CBS Sports | `cbssports.com/college-basketball/schedule/` | 300 |
**Notes:**
- ~360 Division I teams = 5,000+ games per season
- ESPN provides group filtering (D1 = group 50)
- Season runs November-April (March Madness)
## Fallback Architecture
### ScraperSource Configuration
```python
@dataclass
class ScraperSource:
name: str # Display name (e.g., "ESPN")
scraper_func: Callable[[int], list] # Function taking season year
priority: int = 1 # Lower = higher priority
min_games: int = 10 # Minimum to consider success
```
### Fallback Logic
```python
def scrape_with_fallback(sport, season, sources):
sources = sorted(sources, key=lambda s: s.priority)
for source in sources:
try:
games = source.scraper_func(season)
if len(games) >= source.min_games:
return games # Success!
except Exception:
continue # Try next source
return [] # All sources failed
```
### Example Output
```
SCRAPING NBA 2026
============================================================
[1/3] Trying Basketball-Reference...
✓ Basketball-Reference returned 1230 games
SCRAPING MLB 2026
============================================================
[1/3] Trying MLB Stats API...
✗ MLB Stats API failed: Connection timeout
[2/3] Trying Baseball-Reference...
✓ Baseball-Reference returned 2430 games
```
## Usage
### Command Line Interface
```bash
# Scrape all sports for 2026 season
python scrape_schedules.py --sport all --season 2026
# Scrape specific sport
python scrape_schedules.py --sport nba --season 2026
python scrape_schedules.py --sport mlb --season 2026
# Scrape only stadiums (legacy method)
python scrape_schedules.py --stadiums-only
# Scrape comprehensive stadium data for ALL 11 sports
python scrape_schedules.py --stadiums-update
# Custom output directory
python scrape_schedules.py --sport all --season 2026 --output ./custom_data
```
### Available Options
| Option | Values | Default | Description |
|--------|--------|---------|-------------|
| `--sport` | `nba`, `mlb`, `nhl`, `nfl`, `wnba`, `mls`, `nwsl`, `cbb`, `all` | `all` | Sport(s) to scrape |
| `--season` | Year (int) | `2026` | Season ending year |
| `--stadiums-only` | Flag | False | Only scrape stadium data (legacy method) |
| `--stadiums-update` | Flag | False | Scrape ALL stadium data for all 8 sports |
| `--output` | Path | `./data` | Output directory |
## Output Format
### Directory Structure
```
data/
├── games.json # All games from all sports
├── stadiums.json # All stadium/venue data
└── teams.json # Team metadata (generated)
```
### Game JSON Schema
```json
{
"id": "NBA-2025-26-LAL-BOS-20251225",
"sport": "NBA",
"homeTeam": "Los Angeles Lakers",
"awayTeam": "Boston Celtics",
"homeTeamId": "LAL",
"awayTeamId": "BOS",
"date": "2025-12-25T20:00:00Z",
"venue": "Crypto.com Arena",
"city": "Los Angeles",
"state": "CA"
}
```
### Stadium JSON Schema
```json
{
"id": "crypto-com-arena",
"name": "Crypto.com Arena",
"city": "Los Angeles",
"state": "CA",
"latitude": 34.0430,
"longitude": -118.2673,
"sports": ["NBA", "NHL"],
"teams": ["Los Angeles Lakers", "Los Angeles Kings", "Los Angeles Clippers"]
}
```
## Stable Game IDs
Games are assigned stable IDs using the pattern:
```
{SPORT}-{SEASON}-{AWAY}-{HOME}-{DATE}
```
Example: `NBA-2025-26-LAL-BOS-20251225`
This ensures:
- Same game gets same ID across scraper runs
- IDs survive if scraper source changes
- CloudKit records can be updated (not duplicated)
## Rate Limiting
All scrapers implement rate limiting to avoid being blocked:
| Source Type | Rate Limit | Implementation |
|-------------|------------|----------------|
| Sports-Reference family | 1 req/sec | `time.sleep(1)` between requests |
| ESPN API | 0.5 req/sec | `time.sleep(0.5)` between date ranges |
| Official APIs (MLB, NHL) | 1 req/sec | `time.sleep(1)` between requests |
| CBS Sports | 1 req/sec | `time.sleep(1)` between pages |
## Error Handling
### Common Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `403 Forbidden` | Rate limited or blocked | Wait 5 min, reduce request rate |
| `Connection timeout` | Network issue | Retry, check connectivity |
| `0 games returned` | Off-season or parsing error | Check if season has started |
| `KeyError` in parsing | Website structure changed | Update scraper selectors |
### Fallback Behavior
1. If primary source fails → Try source #2
2. If source #2 fails → Try source #3
3. If all sources fail → Log warning, return empty list
4. Script continues to next sport (doesn't abort)
## Adding New Sources
### 1. Create Scraper Function
```python
def scrape_newsport_newsource(season: int) -> list[Game]:
"""Scrape NewSport schedule from NewSource."""
games = []
url = f"https://newsource.com/schedule/{season}"
response = requests.get(url, headers=HEADERS)
# Parse response...
return games
```
### 2. Register in main()
```python
if args.sport in ['newsport', 'all']:
sources = [
ScraperSource('Primary', scrape_newsport_primary, priority=1, min_games=100),
ScraperSource('NewSource', scrape_newsport_newsource, priority=2, min_games=50),
ScraperSource('Backup', scrape_newsport_backup, priority=3, min_games=25),
]
games = scrape_with_fallback('NEWSPORT', args.season, sources)
```
### 3. Add to CLI choices
```python
parser.add_argument('--sport', choices=[..., 'newsport', 'all'])
```
## Maintenance
### Monthly Tasks
- Run full scrape to update schedules
- Check for 403 errors indicating blocked sources
- Verify game counts match expected totals
### Seasonal Tasks
- Update season year in scripts
- Check for website structure changes
- Verify new teams/venues are included
### When Sources Break
1. Check if website changed structure (inspect HTML)
2. Update CSS selectors or JSON paths
3. If permanently broken, add new backup source
4. Update min_games thresholds if needed
## Dependencies
```
requests>=2.28.0
beautifulsoup4>=4.11.0
lxml>=4.9.0
```
Install with:
```bash
cd Scripts && pip install -r requirements.txt
```
## CloudKit Integration
After scraping, data is uploaded to CloudKit via:
```bash
python cloudkit_import.py
```
This syncs:
- Games → `CanonicalGame` records
- Stadiums → `CanonicalStadium` records
- Teams → `CanonicalTeam` records
The iOS app then syncs from CloudKit to local SwiftData storage.