docs(scripts): add comprehensive README for data scraping pipeline
Documents the complete sportstime_parser package including architecture, multi-source scraping, name normalization with aliases, CloudKit uploads, and workflows for manual review and adding new sports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
833
Scripts/README.md
Normal file
833
Scripts/README.md
Normal file
@@ -0,0 +1,833 @@
|
|||||||
|
# SportsTime Parser
|
||||||
|
|
||||||
|
A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app.
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
- [Overview](#overview)
|
||||||
|
- [Installation](#installation)
|
||||||
|
- [Quick Start](#quick-start)
|
||||||
|
- [Architecture](#architecture)
|
||||||
|
- [Directory Structure](#directory-structure)
|
||||||
|
- [Configuration](#configuration)
|
||||||
|
- [Data Models](#data-models)
|
||||||
|
- [Normalizers](#normalizers)
|
||||||
|
- [Scrapers](#scrapers)
|
||||||
|
- [Uploaders](#uploaders)
|
||||||
|
- [Utilities](#utilities)
|
||||||
|
- [Manual Review Workflow](#manual-review-workflow)
|
||||||
|
- [Adding a New Sport](#adding-a-new-sport)
|
||||||
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The `sportstime_parser` package provides a complete pipeline for:
|
||||||
|
|
||||||
|
1. **Scraping** game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.)
|
||||||
|
2. **Normalizing** raw data to canonical identifiers (teams, stadiums, games)
|
||||||
|
3. **Resolving** team/stadium names using exact matching, historical aliases, and fuzzy matching
|
||||||
|
4. **Uploading** data to CloudKit with diff-based sync and resumable uploads
|
||||||
|
|
||||||
|
### Supported Sports
|
||||||
|
|
||||||
|
| Sport | Code | Sources | Season Format |
|
||||||
|
|-------|------|---------|---------------|
|
||||||
|
| NBA | `nba` | Basketball-Reference, ESPN, CBS | Oct-Jun (split year) |
|
||||||
|
| MLB | `mlb` | Baseball-Reference, MLB API, ESPN | Mar-Nov (single year) |
|
||||||
|
| NFL | `nfl` | ESPN, Pro-Football-Reference, CBS | Sep-Feb (split year) |
|
||||||
|
| NHL | `nhl` | Hockey-Reference, NHL API, ESPN | Oct-Jun (split year) |
|
||||||
|
| MLS | `mls` | ESPN, FBref | Feb-Nov (single year) |
|
||||||
|
| WNBA | `wnba` | ESPN | May-Oct (single year) |
|
||||||
|
| NWSL | `nwsl` | ESPN | Mar-Nov (single year) |
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd Scripts
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
|
||||||
|
- `requests` - HTTP requests with session management
|
||||||
|
- `beautifulsoup4` + `lxml` - HTML parsing
|
||||||
|
- `rapidfuzz` - Fuzzy string matching
|
||||||
|
- `pyjwt` + `cryptography` - CloudKit JWT authentication
|
||||||
|
- `rich` - Terminal UI (progress bars, logging)
|
||||||
|
- `pytz` / `timezonefinder` - Timezone detection
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Scrape a Single Sport
|
||||||
|
|
||||||
|
```python
|
||||||
|
from sportstime_parser.scrapers import create_nba_scraper
|
||||||
|
|
||||||
|
scraper = create_nba_scraper(season=2025)
|
||||||
|
result = scraper.scrape_all()
|
||||||
|
|
||||||
|
print(f"Games: {result.game_count}")
|
||||||
|
print(f"Teams: {result.team_count}")
|
||||||
|
print(f"Stadiums: {result.stadium_count}")
|
||||||
|
print(f"Needs review: {result.review_count}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Upload to CloudKit
|
||||||
|
|
||||||
|
```python
|
||||||
|
from sportstime_parser.uploaders import CloudKitClient, RecordDiffer
|
||||||
|
|
||||||
|
client = CloudKitClient(environment="development")
|
||||||
|
differ = RecordDiffer()
|
||||||
|
|
||||||
|
# Compare local vs remote
|
||||||
|
diff = differ.diff_games(local_games, remote_records)
|
||||||
|
|
||||||
|
# Upload changes
|
||||||
|
records = diff.get_records_to_upload()
|
||||||
|
result = await client.save_records(records)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ DATA SOURCES │
|
||||||
|
│ Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc. │
|
||||||
|
└────────────────────────────────┬────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ SCRAPERS │
|
||||||
|
│ NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │
|
||||||
|
│ │
|
||||||
|
│ Features: │
|
||||||
|
│ • Multi-source fallback (try sources in priority order) │
|
||||||
|
│ • Automatic rate limiting with exponential backoff │
|
||||||
|
│ • Doubleheader detection │
|
||||||
|
│ • International game filtering (NFL London, NHL Global Series) │
|
||||||
|
└────────────────────────────────┬────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ NORMALIZERS │
|
||||||
|
│ TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader │
|
||||||
|
│ │
|
||||||
|
│ Resolution Strategy (in order): │
|
||||||
|
│ 1. Exact match against canonical mappings │
|
||||||
|
│ 2. Date-aware alias lookup (handles renames/relocations) │
|
||||||
|
│ 3. Fuzzy matching with confidence threshold (85%) │
|
||||||
|
│ 4. Flag for manual review if unresolved or low confidence │
|
||||||
|
└────────────────────────────────┬────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ DATA MODELS │
|
||||||
|
│ Game │ Team │ Stadium │ ManualReviewItem │
|
||||||
|
│ │
|
||||||
|
│ All models use canonical IDs: │
|
||||||
|
│ • team_nba_lal (Los Angeles Lakers) │
|
||||||
|
│ • stadium_nba_los_angeles_lakers (Crypto.com Arena) │
|
||||||
|
│ • game_nba_2025_20251022_bos_lal (specific game) │
|
||||||
|
└────────────────────────────────┬────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ UPLOADERS │
|
||||||
|
│ CloudKitClient │ RecordDiffer │ StateManager │
|
||||||
|
│ │
|
||||||
|
│ Features: │
|
||||||
|
│ • JWT authentication with Apple's CloudKit Web Services │
|
||||||
|
│ • Batch operations (up to 200 records per request) │
|
||||||
|
│ • Diff-based sync (only upload changes) │
|
||||||
|
│ • Resumable uploads with persistent state │
|
||||||
|
└────────────────────────────────┬────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ CLOUDKIT │
|
||||||
|
│ Public Database: Games, Teams, Stadiums, Aliases │
|
||||||
|
└─────────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
Scripts/
|
||||||
|
├── README.md # This file
|
||||||
|
├── requirements.txt # Python dependencies
|
||||||
|
├── pyproject.toml # Package configuration
|
||||||
|
├── league_structure.json # League hierarchy (conferences, divisions)
|
||||||
|
├── team_aliases.json # Historical team name mappings
|
||||||
|
├── stadium_aliases.json # Historical stadium name mappings
|
||||||
|
├── logs/ # Runtime logs (auto-created)
|
||||||
|
├── output/ # Scrape output files (auto-created)
|
||||||
|
└── sportstime_parser/ # Main package
|
||||||
|
├── __init__.py
|
||||||
|
├── config.py # Configuration constants
|
||||||
|
├── SOURCES.md # Data source documentation
|
||||||
|
├── models/ # Data classes
|
||||||
|
│ ├── game.py # Game model
|
||||||
|
│ ├── team.py # Team model
|
||||||
|
│ ├── stadium.py # Stadium model
|
||||||
|
│ └── aliases.py # Alias and ManualReviewItem models
|
||||||
|
├── normalizers/ # Name resolution
|
||||||
|
│ ├── canonical_id.py # ID generation
|
||||||
|
│ ├── alias_loader.py # Alias loading and resolution
|
||||||
|
│ ├── fuzzy.py # Fuzzy string matching
|
||||||
|
│ ├── timezone.py # Timezone detection
|
||||||
|
│ ├── team_resolver.py # Team name resolution
|
||||||
|
│ └── stadium_resolver.py # Stadium name resolution
|
||||||
|
├── scrapers/ # Sport-specific scrapers
|
||||||
|
│ ├── base.py # Abstract base scraper
|
||||||
|
│ ├── nba.py # NBA scraper
|
||||||
|
│ ├── mlb.py # MLB scraper
|
||||||
|
│ ├── nfl.py # NFL scraper
|
||||||
|
│ ├── nhl.py # NHL scraper
|
||||||
|
│ ├── mls.py # MLS scraper
|
||||||
|
│ ├── wnba.py # WNBA scraper
|
||||||
|
│ └── nwsl.py # NWSL scraper
|
||||||
|
├── uploaders/ # CloudKit integration
|
||||||
|
│ ├── cloudkit.py # CloudKit Web Services client
|
||||||
|
│ ├── diff.py # Record diffing
|
||||||
|
│ └── state.py # Resumable upload state
|
||||||
|
└── utils/ # Shared utilities
|
||||||
|
├── logging.py # Rich-based logging
|
||||||
|
├── http.py # Rate-limited HTTP client
|
||||||
|
└── progress.py # Progress tracking
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### config.py
|
||||||
|
|
||||||
|
Key configuration constants:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Directories
|
||||||
|
SCRIPTS_DIR = Path(__file__).parent.parent # Scripts/
|
||||||
|
OUTPUT_DIR = SCRIPTS_DIR / "output" # JSON output
|
||||||
|
STATE_DIR = SCRIPTS_DIR / ".parser_state" # Upload state
|
||||||
|
|
||||||
|
# CloudKit
|
||||||
|
CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app"
|
||||||
|
CLOUDKIT_ENVIRONMENT = "development" # or "production"
|
||||||
|
|
||||||
|
# Rate Limiting
|
||||||
|
DEFAULT_REQUEST_DELAY = 3.0 # seconds between requests
|
||||||
|
MAX_RETRIES = 3 # retry attempts
|
||||||
|
BACKOFF_FACTOR = 2.0 # exponential backoff multiplier
|
||||||
|
INITIAL_BACKOFF = 5.0 # initial backoff duration
|
||||||
|
|
||||||
|
# Fuzzy Matching
|
||||||
|
FUZZY_THRESHOLD = 85 # minimum match confidence (0-100)
|
||||||
|
|
||||||
|
# Expected game counts (for validation)
|
||||||
|
EXPECTED_GAME_COUNTS = {
|
||||||
|
"nba": 1230, # 30 teams × 82 games ÷ 2
|
||||||
|
"mlb": 2430, # 30 teams × 162 games ÷ 2
|
||||||
|
"nfl": 272, # Regular season only
|
||||||
|
"nhl": 1312, # 32 teams × 82 games ÷ 2
|
||||||
|
"mls": 544, # 29 teams × ~34 games ÷ 2
|
||||||
|
"wnba": 228, # 12 teams × 40 games ÷ 2
|
||||||
|
"nwsl": 182, # 14 teams × 26 games ÷ 2
|
||||||
|
}
|
||||||
|
|
||||||
|
# Geography (for filtering international games)
|
||||||
|
ALLOWED_COUNTRIES = {"USA", "Canada"}
|
||||||
|
```
|
||||||
|
|
||||||
|
### league_structure.json
|
||||||
|
|
||||||
|
Defines the hierarchical structure of each league:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"nba": {
|
||||||
|
"name": "National Basketball Association",
|
||||||
|
"conferences": {
|
||||||
|
"Eastern": {
|
||||||
|
"divisions": {
|
||||||
|
"Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"],
|
||||||
|
"Central": ["CHI", "CLE", "DET", "IND", "MIL"],
|
||||||
|
"Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"Western": { ... }
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"mlb": { ... },
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### team_aliases.json / stadium_aliases.json
|
||||||
|
|
||||||
|
Historical name mappings with validity dates:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"team_mlb_athletics": [
|
||||||
|
{
|
||||||
|
"alias": "Oakland Athletics",
|
||||||
|
"alias_type": "full_name",
|
||||||
|
"valid_from": "1968-01-01",
|
||||||
|
"valid_until": "2024-12-31"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"alias": "Las Vegas Athletics",
|
||||||
|
"alias_type": "full_name",
|
||||||
|
"valid_from": "2028-01-01",
|
||||||
|
"valid_until": null
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data Models
|
||||||
|
|
||||||
|
### Game
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class Game:
|
||||||
|
id: str # Canonical ID: game_{sport}_{season}_{date}_{away}_{home}
|
||||||
|
sport: str # Sport code (nba, mlb, etc.)
|
||||||
|
season: int # Season start year
|
||||||
|
home_team_id: str # Canonical team ID
|
||||||
|
away_team_id: str # Canonical team ID
|
||||||
|
stadium_id: str # Canonical stadium ID
|
||||||
|
game_date: datetime # UTC datetime
|
||||||
|
game_number: Optional[int] # 1 or 2 for doubleheaders
|
||||||
|
home_score: Optional[int] # None if not played
|
||||||
|
away_score: Optional[int]
|
||||||
|
status: str # scheduled, final, postponed, cancelled
|
||||||
|
source_url: Optional[str] # For manual review
|
||||||
|
raw_home_team: Optional[str] # Original scraped value
|
||||||
|
raw_away_team: Optional[str]
|
||||||
|
raw_stadium: Optional[str]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Team
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class Team:
|
||||||
|
id: str # Canonical ID: team_{sport}_{abbrev}
|
||||||
|
sport: str
|
||||||
|
city: str # e.g., "Los Angeles"
|
||||||
|
name: str # e.g., "Lakers"
|
||||||
|
full_name: str # e.g., "Los Angeles Lakers"
|
||||||
|
abbreviation: str # e.g., "LAL"
|
||||||
|
conference: Optional[str] # e.g., "Western"
|
||||||
|
division: Optional[str] # e.g., "Pacific"
|
||||||
|
stadium_id: Optional[str] # Home stadium
|
||||||
|
primary_color: Optional[str]
|
||||||
|
secondary_color: Optional[str]
|
||||||
|
logo_url: Optional[str]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Stadium
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class Stadium:
|
||||||
|
id: str # Canonical ID: stadium_{sport}_{city_team}
|
||||||
|
sport: str
|
||||||
|
name: str # Current name (e.g., "Crypto.com Arena")
|
||||||
|
city: str
|
||||||
|
state: Optional[str]
|
||||||
|
country: str
|
||||||
|
latitude: Optional[float]
|
||||||
|
longitude: Optional[float]
|
||||||
|
capacity: Optional[int]
|
||||||
|
surface: Optional[str] # grass, turf, ice, hardwood
|
||||||
|
roof_type: Optional[str] # dome, retractable, open
|
||||||
|
opened_year: Optional[int]
|
||||||
|
image_url: Optional[str]
|
||||||
|
timezone: Optional[str]
|
||||||
|
```
|
||||||
|
|
||||||
|
### ManualReviewItem
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class ManualReviewItem:
|
||||||
|
item_type: str # "team" or "stadium"
|
||||||
|
raw_value: str # Original scraped value
|
||||||
|
suggested_id: Optional[str] # Best fuzzy match (if any)
|
||||||
|
confidence: float # 0.0 - 1.0
|
||||||
|
reason: str # Why review is needed
|
||||||
|
source_url: Optional[str] # Where it came from
|
||||||
|
sport: str
|
||||||
|
check_date: Optional[date] # For date-aware alias lookup
|
||||||
|
```
|
||||||
|
|
||||||
|
## Normalizers
|
||||||
|
|
||||||
|
### Canonical ID Generation
|
||||||
|
|
||||||
|
IDs are deterministic and immutable:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Team ID
|
||||||
|
generate_team_id("nba", "LAL")
|
||||||
|
# → "team_nba_lal"
|
||||||
|
|
||||||
|
# Stadium ID
|
||||||
|
generate_stadium_id("nba", "Los Angeles", "Lakers")
|
||||||
|
# → "stadium_nba_los_angeles_lakers"
|
||||||
|
|
||||||
|
# Game ID
|
||||||
|
generate_game_id(
|
||||||
|
sport="nba",
|
||||||
|
season=2025,
|
||||||
|
away_abbrev="BOS",
|
||||||
|
home_abbrev="LAL",
|
||||||
|
game_date=datetime(2025, 10, 22),
|
||||||
|
game_number=None
|
||||||
|
)
|
||||||
|
# → "game_nba_2025_20251022_bos_lal"
|
||||||
|
|
||||||
|
# Doubleheader Game ID
|
||||||
|
generate_game_id(..., game_number=2)
|
||||||
|
# → "game_nba_2025_20251022_bos_lal_2"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Team Resolution
|
||||||
|
|
||||||
|
The `TeamResolver` uses a three-stage strategy:
|
||||||
|
|
||||||
|
```python
|
||||||
|
resolver = get_team_resolver("nba")
|
||||||
|
result = resolver.resolve(
|
||||||
|
"Los Angeles Lakers",
|
||||||
|
check_date=date(2025, 10, 22),
|
||||||
|
source_url="https://..."
|
||||||
|
)
|
||||||
|
|
||||||
|
# Result:
|
||||||
|
# - canonical_id: "team_nba_lal"
|
||||||
|
# - confidence: 1.0 (exact match)
|
||||||
|
# - review_item: None
|
||||||
|
```
|
||||||
|
|
||||||
|
**Resolution stages:**
|
||||||
|
|
||||||
|
1. **Exact Match**: Check against canonical team mappings
|
||||||
|
- Full name: "Los Angeles Lakers"
|
||||||
|
- City + Name: "Los Angeles" + "Lakers"
|
||||||
|
- Abbreviation: "LAL"
|
||||||
|
|
||||||
|
2. **Alias Lookup**: Check historical aliases with date awareness
|
||||||
|
- "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31)
|
||||||
|
- Handles relocations: "Oakland" → "Las Vegas" transition
|
||||||
|
|
||||||
|
3. **Fuzzy Match**: Use rapidfuzz with 85% threshold
|
||||||
|
- "LA Lakers" → "Los Angeles Lakers" (92% match)
|
||||||
|
- Low-confidence matches flagged for review
|
||||||
|
|
||||||
|
### Stadium Resolution
|
||||||
|
|
||||||
|
Similar three-stage strategy with additional location awareness:
|
||||||
|
|
||||||
|
```python
|
||||||
|
resolver = get_stadium_resolver("nba")
|
||||||
|
result = resolver.resolve(
|
||||||
|
"Crypto.com Arena",
|
||||||
|
check_date=date(2025, 10, 22)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key features:**
|
||||||
|
- Handles naming rights changes (Staples Center → Crypto.com Arena)
|
||||||
|
- Date-aware: "Staples Center" resolves correctly for historical games
|
||||||
|
- Location-based fallback using latitude/longitude
|
||||||
|
|
||||||
|
## Scrapers
|
||||||
|
|
||||||
|
### Base Scraper
|
||||||
|
|
||||||
|
All scrapers extend `BaseScraper` with these features:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class BaseScraper(ABC):
|
||||||
|
def __init__(self, sport: str, season: int): ...
|
||||||
|
|
||||||
|
# Required implementations
|
||||||
|
def _get_sources(self) -> list[str]: ...
|
||||||
|
def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ...
|
||||||
|
def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ...
|
||||||
|
def scrape_teams(self) -> list[Team]: ...
|
||||||
|
def scrape_stadiums(self) -> list[Stadium]: ...
|
||||||
|
|
||||||
|
# Built-in features
|
||||||
|
def scrape_games(self) -> ScrapeResult:
|
||||||
|
"""Multi-source fallback - tries each source in order."""
|
||||||
|
...
|
||||||
|
|
||||||
|
def scrape_all(self) -> ScrapeResult:
|
||||||
|
"""Scrapes games, teams, and stadiums with progress tracking."""
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
### NBA Scraper
|
||||||
|
|
||||||
|
```python
|
||||||
|
class NBAScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Sources (in priority order):
|
||||||
|
1. Basketball-Reference - HTML tables, monthly pages
|
||||||
|
2. ESPN API - JSON, per-date queries
|
||||||
|
3. CBS Sports - Backup (not implemented)
|
||||||
|
|
||||||
|
Season: October to June (split year, e.g., 2025-26)
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
**Basketball-Reference parsing:**
|
||||||
|
- URL: `https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html`
|
||||||
|
- Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name
|
||||||
|
|
||||||
|
### MLB Scraper
|
||||||
|
|
||||||
|
```python
|
||||||
|
class MLBScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Sources:
|
||||||
|
1. Baseball-Reference - Single page per season
|
||||||
|
2. MLB Stats API - Official API with date range queries
|
||||||
|
3. ESPN API - Backup
|
||||||
|
|
||||||
|
Season: March to November (single year)
|
||||||
|
Handles: Doubleheaders with game_number
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
### NFL Scraper
|
||||||
|
|
||||||
|
```python
|
||||||
|
class NFLScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Sources:
|
||||||
|
1. ESPN API - Week-based queries
|
||||||
|
2. Pro-Football-Reference - Single page per season
|
||||||
|
|
||||||
|
Season: September to February (split year)
|
||||||
|
Filters: International games (London, Mexico City, Frankfurt)
|
||||||
|
Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds)
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
### NHL Scraper
|
||||||
|
|
||||||
|
```python
|
||||||
|
class NHLScraper(BaseScraper):
|
||||||
|
"""
|
||||||
|
Sources:
|
||||||
|
1. Hockey-Reference - Single page per season
|
||||||
|
2. NHL API - New API (api-web.nhle.com)
|
||||||
|
3. ESPN API - Backup
|
||||||
|
|
||||||
|
Season: October to June (split year)
|
||||||
|
Filters: International games (Prague, Stockholm, Helsinki)
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
### MLS / WNBA / NWSL Scrapers
|
||||||
|
|
||||||
|
All use ESPN API as primary source with similar structure:
|
||||||
|
- Single calendar year seasons
|
||||||
|
- Conference-based organization (MLS) or single table (WNBA, NWSL)
|
||||||
|
|
||||||
|
## Uploaders
|
||||||
|
|
||||||
|
### CloudKit Client
|
||||||
|
|
||||||
|
```python
|
||||||
|
class CloudKitClient:
|
||||||
|
"""CloudKit Web Services API client with JWT authentication."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
container_id: str = CLOUDKIT_CONTAINER,
|
||||||
|
environment: str = "development", # or "production"
|
||||||
|
key_id: str = None, # From CloudKit Dashboard
|
||||||
|
private_key: str = None, # EC P-256 private key
|
||||||
|
): ...
|
||||||
|
|
||||||
|
async def fetch_records(
|
||||||
|
self,
|
||||||
|
record_type: RecordType,
|
||||||
|
filter_by: Optional[dict] = None,
|
||||||
|
sort_by: Optional[str] = None,
|
||||||
|
) -> list[dict]: ...
|
||||||
|
|
||||||
|
async def save_records(
|
||||||
|
self,
|
||||||
|
records: list[CloudKitRecord],
|
||||||
|
batch_size: int = 200,
|
||||||
|
) -> BatchResult: ...
|
||||||
|
|
||||||
|
async def delete_records(
|
||||||
|
self,
|
||||||
|
record_names: list[str],
|
||||||
|
record_type: RecordType,
|
||||||
|
) -> BatchResult: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**Authentication:**
|
||||||
|
- Uses EC P-256 key pair
|
||||||
|
- JWT tokens signed with private key
|
||||||
|
- Tokens valid for 30 minutes
|
||||||
|
|
||||||
|
### Record Differ
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RecordDiffer:
|
||||||
|
"""Compares local records with CloudKit records."""
|
||||||
|
|
||||||
|
def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ...
|
||||||
|
def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ...
|
||||||
|
def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**DiffResult:**
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class DiffResult:
|
||||||
|
creates: list[RecordDiff] # New records to create
|
||||||
|
updates: list[RecordDiff] # Changed records to update
|
||||||
|
deletes: list[RecordDiff] # Remote records to delete
|
||||||
|
unchanged: list[RecordDiff] # Records with no changes
|
||||||
|
|
||||||
|
def get_records_to_upload(self) -> list[CloudKitRecord]:
|
||||||
|
"""Returns creates + updates ready for upload."""
|
||||||
|
```
|
||||||
|
|
||||||
|
### State Manager
|
||||||
|
|
||||||
|
```python
|
||||||
|
class StateManager:
|
||||||
|
"""Manages resumable upload state."""
|
||||||
|
|
||||||
|
def load_session(self, sport, season, environment) -> Optional[UploadSession]: ...
|
||||||
|
def save_session(self, session: UploadSession) -> None: ...
|
||||||
|
def get_session_or_create(
|
||||||
|
self,
|
||||||
|
sport, season, environment,
|
||||||
|
record_names: list[tuple[str, str]],
|
||||||
|
resume: bool = False,
|
||||||
|
) -> UploadSession: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**State persistence:**
|
||||||
|
- Stored in `.parser_state/upload_state_{sport}_{season}_{env}.json`
|
||||||
|
- Tracks: pending, uploaded, failed records
|
||||||
|
- Supports retry with backoff
|
||||||
|
|
||||||
|
## Utilities
|
||||||
|
|
||||||
|
### HTTP Client
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RateLimitedSession:
|
||||||
|
"""HTTP session with rate limiting and exponential backoff."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
delay: float = 3.0, # Seconds between requests
|
||||||
|
max_retries: int = 3,
|
||||||
|
backoff_factor: float = 2.0,
|
||||||
|
): ...
|
||||||
|
|
||||||
|
def get(self, url, **kwargs) -> Response: ...
|
||||||
|
def get_json(self, url, **kwargs) -> dict: ...
|
||||||
|
def get_html(self, url, **kwargs) -> str: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- User-agent rotation (5 different Chrome/Firefox/Safari agents)
|
||||||
|
- Per-domain rate limiting
|
||||||
|
- Automatic 429 handling with exponential backoff + jitter
|
||||||
|
- Connection pooling
|
||||||
|
|
||||||
|
### Logging
|
||||||
|
|
||||||
|
```python
|
||||||
|
from sportstime_parser.utils import get_logger, log_success, log_error
|
||||||
|
|
||||||
|
logger = get_logger() # Rich-formatted logger
|
||||||
|
logger.info("Starting scrape")
|
||||||
|
|
||||||
|
log_success("Scraped 1230 games") # Green checkmark
|
||||||
|
log_error("Failed to parse") # Red X
|
||||||
|
```
|
||||||
|
|
||||||
|
**Log output:**
|
||||||
|
- Console: Rich-formatted with colors
|
||||||
|
- File: `logs/parser_{timestamp}.log`
|
||||||
|
|
||||||
|
### Progress Tracking
|
||||||
|
|
||||||
|
```python
|
||||||
|
from sportstime_parser.utils import ScrapeProgress, track_progress
|
||||||
|
|
||||||
|
# Specialized scrape tracking
|
||||||
|
progress = ScrapeProgress("nba", 2025)
|
||||||
|
progress.start()
|
||||||
|
|
||||||
|
with progress.scraping_schedule(total_months=9) as advance:
|
||||||
|
for month in months:
|
||||||
|
fetch(month)
|
||||||
|
advance()
|
||||||
|
|
||||||
|
progress.finish() # Prints summary
|
||||||
|
|
||||||
|
# Generic progress bar
|
||||||
|
for game in track_progress(games, "Processing games"):
|
||||||
|
process(game)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Manual Review Workflow
|
||||||
|
|
||||||
|
When the system can't confidently resolve a team or stadium:
|
||||||
|
|
||||||
|
1. **Low confidence fuzzy match** (< 85%):
|
||||||
|
```
|
||||||
|
ManualReviewItem(
|
||||||
|
item_type="team",
|
||||||
|
raw_value="LA Lakers",
|
||||||
|
suggested_id="team_nba_lal",
|
||||||
|
confidence=0.82,
|
||||||
|
reason="Fuzzy match below threshold"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **No match found**:
|
||||||
|
```
|
||||||
|
ManualReviewItem(
|
||||||
|
raw_value="Unknown Team FC",
|
||||||
|
suggested_id=None,
|
||||||
|
confidence=0.0,
|
||||||
|
reason="No match found in canonical mappings"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Ambiguous match** (multiple candidates):
|
||||||
|
```
|
||||||
|
ManualReviewItem(
|
||||||
|
raw_value="LA",
|
||||||
|
suggested_id="team_nba_lac",
|
||||||
|
confidence=0.5,
|
||||||
|
reason="Ambiguous: could be Lakers or Clippers"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Resolution:**
|
||||||
|
- Review items are exported to JSON
|
||||||
|
- Manually verify and add to `team_aliases.json` or `stadium_aliases.json`
|
||||||
|
- Re-run scrape - aliases will be used for resolution
|
||||||
|
|
||||||
|
## Adding a New Sport
|
||||||
|
|
||||||
|
1. **Create scraper** in `scrapers/{sport}.py`:
|
||||||
|
```python
|
||||||
|
class NewSportScraper(BaseScraper):
|
||||||
|
def __init__(self, season: int, **kwargs):
|
||||||
|
super().__init__("newsport", season, **kwargs)
|
||||||
|
self._team_resolver = get_team_resolver("newsport")
|
||||||
|
self._stadium_resolver = get_stadium_resolver("newsport")
|
||||||
|
|
||||||
|
def _get_sources(self) -> list[str]:
|
||||||
|
return ["primary_source", "backup_source"]
|
||||||
|
|
||||||
|
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||||
|
# Implement source-specific scraping
|
||||||
|
...
|
||||||
|
|
||||||
|
def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||||
|
# Use resolvers to normalize
|
||||||
|
...
|
||||||
|
|
||||||
|
def scrape_teams(self) -> list[Team]:
|
||||||
|
# Return canonical team list
|
||||||
|
...
|
||||||
|
|
||||||
|
def scrape_stadiums(self) -> list[Stadium]:
|
||||||
|
# Return canonical stadium list
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Add team mappings** in `normalizers/team_resolver.py`:
|
||||||
|
```python
|
||||||
|
TEAM_MAPPINGS["newsport"] = {
|
||||||
|
"ABC": ("team_newsport_abc", "Full Team Name", "City"),
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Add stadium mappings** in `normalizers/stadium_resolver.py`:
|
||||||
|
```python
|
||||||
|
STADIUM_MAPPINGS["newsport"] = {
|
||||||
|
"stadium_newsport_venue": StadiumInfo(
|
||||||
|
name="Venue Name",
|
||||||
|
city="City",
|
||||||
|
state="State",
|
||||||
|
country="USA",
|
||||||
|
latitude=40.0,
|
||||||
|
longitude=-74.0,
|
||||||
|
),
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Add to league_structure.json** (if hierarchical)
|
||||||
|
|
||||||
|
5. **Update config.py**:
|
||||||
|
```python
|
||||||
|
EXPECTED_GAME_COUNTS["newsport"] = 500
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Export from `__init__.py`**
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Rate Limiting (429 errors)
|
||||||
|
|
||||||
|
The system handles these automatically with exponential backoff. If persistent:
|
||||||
|
- Increase `DEFAULT_REQUEST_DELAY` in config.py
|
||||||
|
- Check if source has changed their rate limits
|
||||||
|
|
||||||
|
### Missing Teams/Stadiums
|
||||||
|
|
||||||
|
1. Check scraper logs for raw values
|
||||||
|
2. Add to `team_aliases.json` or `stadium_aliases.json`
|
||||||
|
3. Or add to canonical mappings if it's a new team/stadium
|
||||||
|
|
||||||
|
### CloudKit Authentication Errors
|
||||||
|
|
||||||
|
1. Verify key_id matches CloudKit Dashboard
|
||||||
|
2. Check private key format (EC P-256, PEM)
|
||||||
|
3. Ensure container identifier is correct
|
||||||
|
|
||||||
|
### Incomplete Scrapes
|
||||||
|
|
||||||
|
The system discards partial data on errors. Check:
|
||||||
|
- `logs/` for error details
|
||||||
|
- Network connectivity
|
||||||
|
- Source website availability
|
||||||
|
|
||||||
|
### International Games Appearing
|
||||||
|
|
||||||
|
NFL and NHL scrapers filter these automatically. If new locations emerge:
|
||||||
|
- Add to `INTERNATIONAL_LOCATIONS` in the scraper
|
||||||
|
- Or add filtering logic for neutral site games
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
1. Follow existing patterns for new scrapers
|
||||||
|
2. Always use canonical IDs
|
||||||
|
3. Add aliases for historical names
|
||||||
|
4. Include source URLs for traceability
|
||||||
|
5. Test with multiple seasons
|
||||||
Reference in New Issue
Block a user