docs(scripts): add comprehensive README for data scraping pipeline

Documents the complete sportstime_parser package including architecture, multi-source scraping, name normalization with aliases, CloudKit uploads, and workflows for manual review and adding new sports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 13:22:33 -06:00
parent 1355c94236
commit 11c0ae70d2
1 changed files with 833 additions and 0 deletions
--- a/Scripts/README.md
+++ b/Scripts/README.md
@@ -0,0 +1,833 @@
+# SportsTime Parser
+
+A Python package for scraping, normalizing, and uploading sports schedule data to CloudKit for the SportsTime iOS app.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Installation](#installation)
+- [Quick Start](#quick-start)
+- [Architecture](#architecture)
+- [Directory Structure](#directory-structure)
+- [Configuration](#configuration)
+- [Data Models](#data-models)
+- [Normalizers](#normalizers)
+- [Scrapers](#scrapers)
+- [Uploaders](#uploaders)
+- [Utilities](#utilities)
+- [Manual Review Workflow](#manual-review-workflow)
+- [Adding a New Sport](#adding-a-new-sport)
+- [Troubleshooting](#troubleshooting)
+
+## Overview
+
+The `sportstime_parser` package provides a complete pipeline for:
+
+1. **Scraping** game schedules from multiple sources (Basketball-Reference, ESPN, MLB API, etc.)
+2. **Normalizing** raw data to canonical identifiers (teams, stadiums, games)
+3. **Resolving** team/stadium names using exact matching, historical aliases, and fuzzy matching
+4. **Uploading** data to CloudKit with diff-based sync and resumable uploads
+
+### Supported Sports
+
+| Sport | Code | Sources | Season Format |
+|-------|------|---------|---------------|
+| NBA | `nba` | Basketball-Reference, ESPN, CBS | Oct-Jun (split year) |
+| MLB | `mlb` | Baseball-Reference, MLB API, ESPN | Mar-Nov (single year) |
+| NFL | `nfl` | ESPN, Pro-Football-Reference, CBS | Sep-Feb (split year) |
+| NHL | `nhl` | Hockey-Reference, NHL API, ESPN | Oct-Jun (split year) |
+| MLS | `mls` | ESPN, FBref | Feb-Nov (single year) |
+| WNBA | `wnba` | ESPN | May-Oct (single year) |
+| NWSL | `nwsl` | ESPN | Mar-Nov (single year) |
+
+## Installation
+
+```bash
+cd Scripts
+pip install -r requirements.txt
+```
+
+### Dependencies
+
+- `requests` - HTTP requests with session management
+- `beautifulsoup4` + `lxml` - HTML parsing
+- `rapidfuzz` - Fuzzy string matching
+- `pyjwt` + `cryptography` - CloudKit JWT authentication
+- `rich` - Terminal UI (progress bars, logging)
+- `pytz` / `timezonefinder` - Timezone detection
+
+## Quick Start
+
+### Scrape a Single Sport
+
+```python
+from sportstime_parser.scrapers import create_nba_scraper
+
+scraper = create_nba_scraper(season=2025)
+result = scraper.scrape_all()
+
+print(f"Games: {result.game_count}")
+print(f"Teams: {result.team_count}")
+print(f"Stadiums: {result.stadium_count}")
+print(f"Needs review: {result.review_count}")
+```
+
+### Upload to CloudKit
+
+```python
+from sportstime_parser.uploaders import CloudKitClient, RecordDiffer
+
+client = CloudKitClient(environment="development")
+differ = RecordDiffer()
+
+# Compare local vs remote
+diff = differ.diff_games(local_games, remote_records)
+
+# Upload changes
+records = diff.get_records_to_upload()
+result = await client.save_records(records)
+```
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                           DATA SOURCES                                   │
+│  Basketball-Reference │ ESPN API │ MLB API │ Hockey-Reference │ etc.   │
+└────────────────────────────────┬────────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                            SCRAPERS                                      │
+│  NBAScraper │ MLBScraper │ NFLScraper │ NHLScraper │ MLSScraper │ etc. │
+│                                                                          │
+│  Features:                                                               │
+│  • Multi-source fallback (try sources in priority order)                │
+│  • Automatic rate limiting with exponential backoff                     │
+│  • Doubleheader detection                                               │
+│  • International game filtering (NFL London, NHL Global Series)         │
+└────────────────────────────────┬────────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                          NORMALIZERS                                     │
+│  TeamResolver │ StadiumResolver │ CanonicalIdGenerator │ AliasLoader   │
+│                                                                          │
+│  Resolution Strategy (in order):                                        │
+│  1. Exact match against canonical mappings                              │
+│  2. Date-aware alias lookup (handles renames/relocations)               │
+│  3. Fuzzy matching with confidence threshold (85%)                      │
+│  4. Flag for manual review if unresolved or low confidence              │
+└────────────────────────────────┬────────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                          DATA MODELS                                     │
+│  Game │ Team │ Stadium │ ManualReviewItem                               │
+│                                                                          │
+│  All models use canonical IDs:                                          │
+│  • team_nba_lal (Los Angeles Lakers)                                    │
+│  • stadium_nba_los_angeles_lakers (Crypto.com Arena)                    │
+│  • game_nba_2025_20251022_bos_lal (specific game)                       │
+└────────────────────────────────┬────────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                           UPLOADERS                                      │
+│  CloudKitClient │ RecordDiffer │ StateManager                           │
+│                                                                          │
+│  Features:                                                               │
+│  • JWT authentication with Apple's CloudKit Web Services                │
+│  • Batch operations (up to 200 records per request)                     │
+│  • Diff-based sync (only upload changes)                                │
+│  • Resumable uploads with persistent state                              │
+└────────────────────────────────┬────────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────────────┐
+│                            CLOUDKIT                                      │
+│  Public Database: Games, Teams, Stadiums, Aliases                       │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+## Directory Structure
+
+```
+Scripts/
+├── README.md                    # This file
+├── requirements.txt             # Python dependencies
+├── pyproject.toml              # Package configuration
+├── league_structure.json       # League hierarchy (conferences, divisions)
+├── team_aliases.json           # Historical team name mappings
+├── stadium_aliases.json        # Historical stadium name mappings
+├── logs/                       # Runtime logs (auto-created)
+├── output/                     # Scrape output files (auto-created)
+└── sportstime_parser/          # Main package
+    ├── __init__.py
+    ├── config.py               # Configuration constants
+    ├── SOURCES.md              # Data source documentation
+    ├── models/                 # Data classes
+    │   ├── game.py            # Game model
+    │   ├── team.py            # Team model
+    │   ├── stadium.py         # Stadium model
+    │   └── aliases.py         # Alias and ManualReviewItem models
+    ├── normalizers/            # Name resolution
+    │   ├── canonical_id.py    # ID generation
+    │   ├── alias_loader.py    # Alias loading and resolution
+    │   ├── fuzzy.py           # Fuzzy string matching
+    │   ├── timezone.py        # Timezone detection
+    │   ├── team_resolver.py   # Team name resolution
+    │   └── stadium_resolver.py # Stadium name resolution
+    ├── scrapers/               # Sport-specific scrapers
+    │   ├── base.py            # Abstract base scraper
+    │   ├── nba.py             # NBA scraper
+    │   ├── mlb.py             # MLB scraper
+    │   ├── nfl.py             # NFL scraper
+    │   ├── nhl.py             # NHL scraper
+    │   ├── mls.py             # MLS scraper
+    │   ├── wnba.py            # WNBA scraper
+    │   └── nwsl.py            # NWSL scraper
+    ├── uploaders/              # CloudKit integration
+    │   ├── cloudkit.py        # CloudKit Web Services client
+    │   ├── diff.py            # Record diffing
+    │   └── state.py           # Resumable upload state
+    └── utils/                  # Shared utilities
+        ├── logging.py         # Rich-based logging
+        ├── http.py            # Rate-limited HTTP client
+        └── progress.py        # Progress tracking
+```
+
+## Configuration
+
+### config.py
+
+Key configuration constants:
+
+```python
+# Directories
+SCRIPTS_DIR = Path(__file__).parent.parent      # Scripts/
+OUTPUT_DIR = SCRIPTS_DIR / "output"             # JSON output
+STATE_DIR = SCRIPTS_DIR / ".parser_state"       # Upload state
+
+# CloudKit
+CLOUDKIT_CONTAINER = "iCloud.com.sportstime.app"
+CLOUDKIT_ENVIRONMENT = "development"  # or "production"
+
+# Rate Limiting
+DEFAULT_REQUEST_DELAY = 3.0      # seconds between requests
+MAX_RETRIES = 3                   # retry attempts
+BACKOFF_FACTOR = 2.0             # exponential backoff multiplier
+INITIAL_BACKOFF = 5.0            # initial backoff duration
+
+# Fuzzy Matching
+FUZZY_THRESHOLD = 85             # minimum match confidence (0-100)
+
+# Expected game counts (for validation)
+EXPECTED_GAME_COUNTS = {
+    "nba": 1230,   # 30 teams × 82 games ÷ 2
+    "mlb": 2430,   # 30 teams × 162 games ÷ 2
+    "nfl": 272,    # Regular season only
+    "nhl": 1312,   # 32 teams × 82 games ÷ 2
+    "mls": 544,    # 29 teams × ~34 games ÷ 2
+    "wnba": 228,   # 12 teams × 40 games ÷ 2
+    "nwsl": 182,   # 14 teams × 26 games ÷ 2
+}
+
+# Geography (for filtering international games)
+ALLOWED_COUNTRIES = {"USA", "Canada"}
+```
+
+### league_structure.json
+
+Defines the hierarchical structure of each league:
+
+```json
+{
+  "nba": {
+    "name": "National Basketball Association",
+    "conferences": {
+      "Eastern": {
+        "divisions": {
+          "Atlantic": ["BOS", "BKN", "NYK", "PHI", "TOR"],
+          "Central": ["CHI", "CLE", "DET", "IND", "MIL"],
+          "Southeast": ["ATL", "CHA", "MIA", "ORL", "WAS"]
+        }
+      },
+      "Western": { ... }
+    }
+  },
+  "mlb": { ... },
+  ...
+}
+```
+
+### team_aliases.json / stadium_aliases.json
+
+Historical name mappings with validity dates:
+
+```json
+{
+  "team_mlb_athletics": [
+    {
+      "alias": "Oakland Athletics",
+      "alias_type": "full_name",
+      "valid_from": "1968-01-01",
+      "valid_until": "2024-12-31"
+    },
+    {
+      "alias": "Las Vegas Athletics",
+      "alias_type": "full_name",
+      "valid_from": "2028-01-01",
+      "valid_until": null
+    }
+  ]
+}
+```
+
+## Data Models
+
+### Game
+
+```python
+@dataclass
+class Game:
+    id: str                    # Canonical ID: game_{sport}_{season}_{date}_{away}_{home}
+    sport: str                 # Sport code (nba, mlb, etc.)
+    season: int                # Season start year
+    home_team_id: str          # Canonical team ID
+    away_team_id: str          # Canonical team ID
+    stadium_id: str            # Canonical stadium ID
+    game_date: datetime        # UTC datetime
+    game_number: Optional[int] # 1 or 2 for doubleheaders
+    home_score: Optional[int]  # None if not played
+    away_score: Optional[int]
+    status: str                # scheduled, final, postponed, cancelled
+    source_url: Optional[str]  # For manual review
+    raw_home_team: Optional[str]  # Original scraped value
+    raw_away_team: Optional[str]
+    raw_stadium: Optional[str]
+```
+
+### Team
+
+```python
+@dataclass
+class Team:
+    id: str                    # Canonical ID: team_{sport}_{abbrev}
+    sport: str
+    city: str                  # e.g., "Los Angeles"
+    name: str                  # e.g., "Lakers"
+    full_name: str             # e.g., "Los Angeles Lakers"
+    abbreviation: str          # e.g., "LAL"
+    conference: Optional[str]  # e.g., "Western"
+    division: Optional[str]    # e.g., "Pacific"
+    stadium_id: Optional[str]  # Home stadium
+    primary_color: Optional[str]
+    secondary_color: Optional[str]
+    logo_url: Optional[str]
+```
+
+### Stadium
+
+```python
+@dataclass
+class Stadium:
+    id: str                    # Canonical ID: stadium_{sport}_{city_team}
+    sport: str
+    name: str                  # Current name (e.g., "Crypto.com Arena")
+    city: str
+    state: Optional[str]
+    country: str
+    latitude: Optional[float]
+    longitude: Optional[float]
+    capacity: Optional[int]
+    surface: Optional[str]     # grass, turf, ice, hardwood
+    roof_type: Optional[str]   # dome, retractable, open
+    opened_year: Optional[int]
+    image_url: Optional[str]
+    timezone: Optional[str]
+```
+
+### ManualReviewItem
+
+```python
+@dataclass
+class ManualReviewItem:
+    item_type: str             # "team" or "stadium"
+    raw_value: str             # Original scraped value
+    suggested_id: Optional[str]  # Best fuzzy match (if any)
+    confidence: float          # 0.0 - 1.0
+    reason: str                # Why review is needed
+    source_url: Optional[str]  # Where it came from
+    sport: str
+    check_date: Optional[date]  # For date-aware alias lookup
+```
+
+## Normalizers
+
+### Canonical ID Generation
+
+IDs are deterministic and immutable:
+
+```python
+# Team ID
+generate_team_id("nba", "LAL")
+# → "team_nba_lal"
+
+# Stadium ID
+generate_stadium_id("nba", "Los Angeles", "Lakers")
+# → "stadium_nba_los_angeles_lakers"
+
+# Game ID
+generate_game_id(
+    sport="nba",
+    season=2025,
+    away_abbrev="BOS",
+    home_abbrev="LAL",
+    game_date=datetime(2025, 10, 22),
+    game_number=None
+)
+# → "game_nba_2025_20251022_bos_lal"
+
+# Doubleheader Game ID
+generate_game_id(..., game_number=2)
+# → "game_nba_2025_20251022_bos_lal_2"
+```
+
+### Team Resolution
+
+The `TeamResolver` uses a three-stage strategy:
+
+```python
+resolver = get_team_resolver("nba")
+result = resolver.resolve(
+    "Los Angeles Lakers",
+    check_date=date(2025, 10, 22),
+    source_url="https://..."
+)
+
+# Result:
+# - canonical_id: "team_nba_lal"
+# - confidence: 1.0 (exact match)
+# - review_item: None
+```
+
+**Resolution stages:**
+
+1. **Exact Match**: Check against canonical team mappings
+   - Full name: "Los Angeles Lakers"
+   - City + Name: "Los Angeles" + "Lakers"
+   - Abbreviation: "LAL"
+
+2. **Alias Lookup**: Check historical aliases with date awareness
+   - "Oakland Athletics" → "team_mlb_athletics" (valid until 2024-12-31)
+   - Handles relocations: "Oakland" → "Las Vegas" transition
+
+3. **Fuzzy Match**: Use rapidfuzz with 85% threshold
+   - "LA Lakers" → "Los Angeles Lakers" (92% match)
+   - Low-confidence matches flagged for review
+
+### Stadium Resolution
+
+Similar three-stage strategy with additional location awareness:
+
+```python
+resolver = get_stadium_resolver("nba")
+result = resolver.resolve(
+    "Crypto.com Arena",
+    check_date=date(2025, 10, 22)
+)
+```
+
+**Key features:**
+- Handles naming rights changes (Staples Center → Crypto.com Arena)
+- Date-aware: "Staples Center" resolves correctly for historical games
+- Location-based fallback using latitude/longitude
+
+## Scrapers
+
+### Base Scraper
+
+All scrapers extend `BaseScraper` with these features:
+
+```python
+class BaseScraper(ABC):
+    def __init__(self, sport: str, season: int): ...
+
+    # Required implementations
+    def _get_sources(self) -> list[str]: ...
+    def _scrape_games_from_source(self, source: str) -> list[RawGameData]: ...
+    def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]: ...
+    def scrape_teams(self) -> list[Team]: ...
+    def scrape_stadiums(self) -> list[Stadium]: ...
+
+    # Built-in features
+    def scrape_games(self) -> ScrapeResult:
+        """Multi-source fallback - tries each source in order."""
+        ...
+
+    def scrape_all(self) -> ScrapeResult:
+        """Scrapes games, teams, and stadiums with progress tracking."""
+        ...
+```
+
+### NBA Scraper
+
+```python
+class NBAScraper(BaseScraper):
+    """
+    Sources (in priority order):
+    1. Basketball-Reference - HTML tables, monthly pages
+    2. ESPN API - JSON, per-date queries
+    3. CBS Sports - Backup (not implemented)
+
+    Season: October to June (split year, e.g., 2025-26)
+    """
+```
+
+**Basketball-Reference parsing:**
+- URL: `https://www.basketball-reference.com/leagues/NBA_{year}_games-{month}.html`
+- Table columns: date_game, visitor_team_name, home_team_name, visitor_pts, home_pts, arena_name
+
+### MLB Scraper
+
+```python
+class MLBScraper(BaseScraper):
+    """
+    Sources:
+    1. Baseball-Reference - Single page per season
+    2. MLB Stats API - Official API with date range queries
+    3. ESPN API - Backup
+
+    Season: March to November (single year)
+    Handles: Doubleheaders with game_number
+    """
+```
+
+### NFL Scraper
+
+```python
+class NFLScraper(BaseScraper):
+    """
+    Sources:
+    1. ESPN API - Week-based queries
+    2. Pro-Football-Reference - Single page per season
+
+    Season: September to February (split year)
+    Filters: International games (London, Mexico City, Frankfurt)
+    Scrapes: Preseason (4 weeks), Regular (18 weeks), Postseason (4 rounds)
+    """
+```
+
+### NHL Scraper
+
+```python
+class NHLScraper(BaseScraper):
+    """
+    Sources:
+    1. Hockey-Reference - Single page per season
+    2. NHL API - New API (api-web.nhle.com)
+    3. ESPN API - Backup
+
+    Season: October to June (split year)
+    Filters: International games (Prague, Stockholm, Helsinki)
+    """
+```
+
+### MLS / WNBA / NWSL Scrapers
+
+All use ESPN API as primary source with similar structure:
+- Single calendar year seasons
+- Conference-based organization (MLS) or single table (WNBA, NWSL)
+
+## Uploaders
+
+### CloudKit Client
+
+```python
+class CloudKitClient:
+    """CloudKit Web Services API client with JWT authentication."""
+
+    def __init__(
+        self,
+        container_id: str = CLOUDKIT_CONTAINER,
+        environment: str = "development",  # or "production"
+        key_id: str = None,       # From CloudKit Dashboard
+        private_key: str = None,  # EC P-256 private key
+    ): ...
+
+    async def fetch_records(
+        self,
+        record_type: RecordType,
+        filter_by: Optional[dict] = None,
+        sort_by: Optional[str] = None,
+    ) -> list[dict]: ...
+
+    async def save_records(
+        self,
+        records: list[CloudKitRecord],
+        batch_size: int = 200,
+    ) -> BatchResult: ...
+
+    async def delete_records(
+        self,
+        record_names: list[str],
+        record_type: RecordType,
+    ) -> BatchResult: ...
+```
+
+**Authentication:**
+- Uses EC P-256 key pair
+- JWT tokens signed with private key
+- Tokens valid for 30 minutes
+
+### Record Differ
+
+```python
+class RecordDiffer:
+    """Compares local records with CloudKit records."""
+
+    def diff_games(self, local: list[Game], remote: list[dict]) -> DiffResult: ...
+    def diff_teams(self, local: list[Team], remote: list[dict]) -> DiffResult: ...
+    def diff_stadiums(self, local: list[Stadium], remote: list[dict]) -> DiffResult: ...
+```
+
+**DiffResult:**
+```python
+@dataclass
+class DiffResult:
+    creates: list[RecordDiff]   # New records to create
+    updates: list[RecordDiff]   # Changed records to update
+    deletes: list[RecordDiff]   # Remote records to delete
+    unchanged: list[RecordDiff] # Records with no changes
+
+    def get_records_to_upload(self) -> list[CloudKitRecord]:
+        """Returns creates + updates ready for upload."""
+```
+
+### State Manager
+
+```python
+class StateManager:
+    """Manages resumable upload state."""
+
+    def load_session(self, sport, season, environment) -> Optional[UploadSession]: ...
+    def save_session(self, session: UploadSession) -> None: ...
+    def get_session_or_create(
+        self,
+        sport, season, environment,
+        record_names: list[tuple[str, str]],
+        resume: bool = False,
+    ) -> UploadSession: ...
+```
+
+**State persistence:**
+- Stored in `.parser_state/upload_state_{sport}_{season}_{env}.json`
+- Tracks: pending, uploaded, failed records
+- Supports retry with backoff
+
+## Utilities
+
+### HTTP Client
+
+```python
+class RateLimitedSession:
+    """HTTP session with rate limiting and exponential backoff."""
+
+    def __init__(
+        self,
+        delay: float = 3.0,        # Seconds between requests
+        max_retries: int = 3,
+        backoff_factor: float = 2.0,
+    ): ...
+
+    def get(self, url, **kwargs) -> Response: ...
+    def get_json(self, url, **kwargs) -> dict: ...
+    def get_html(self, url, **kwargs) -> str: ...
+```
+
+**Features:**
+- User-agent rotation (5 different Chrome/Firefox/Safari agents)
+- Per-domain rate limiting
+- Automatic 429 handling with exponential backoff + jitter
+- Connection pooling
+
+### Logging
+
+```python
+from sportstime_parser.utils import get_logger, log_success, log_error
+
+logger = get_logger()  # Rich-formatted logger
+logger.info("Starting scrape")
+
+log_success("Scraped 1230 games")  # Green checkmark
+log_error("Failed to parse")       # Red X
+```
+
+**Log output:**
+- Console: Rich-formatted with colors
+- File: `logs/parser_{timestamp}.log`
+
+### Progress Tracking
+
+```python
+from sportstime_parser.utils import ScrapeProgress, track_progress
+
+# Specialized scrape tracking
+progress = ScrapeProgress("nba", 2025)
+progress.start()
+
+with progress.scraping_schedule(total_months=9) as advance:
+    for month in months:
+        fetch(month)
+        advance()
+
+progress.finish()  # Prints summary
+
+# Generic progress bar
+for game in track_progress(games, "Processing games"):
+    process(game)
+```
+
+## Manual Review Workflow
+
+When the system can't confidently resolve a team or stadium:
+
+1. **Low confidence fuzzy match** (< 85%):
+   ```
+   ManualReviewItem(
+       item_type="team",
+       raw_value="LA Lakers",
+       suggested_id="team_nba_lal",
+       confidence=0.82,
+       reason="Fuzzy match below threshold"
+   )
+   ```
+
+2. **No match found**:
+   ```
+   ManualReviewItem(
+       raw_value="Unknown Team FC",
+       suggested_id=None,
+       confidence=0.0,
+       reason="No match found in canonical mappings"
+   )
+   ```
+
+3. **Ambiguous match** (multiple candidates):
+   ```
+   ManualReviewItem(
+       raw_value="LA",
+       suggested_id="team_nba_lac",
+       confidence=0.5,
+       reason="Ambiguous: could be Lakers or Clippers"
+   )
+   ```
+
+**Resolution:**
+- Review items are exported to JSON
+- Manually verify and add to `team_aliases.json` or `stadium_aliases.json`
+- Re-run scrape - aliases will be used for resolution
+
+## Adding a New Sport
+
+1. **Create scraper** in `scrapers/{sport}.py`:
+   ```python
+   class NewSportScraper(BaseScraper):
+       def __init__(self, season: int, **kwargs):
+           super().__init__("newsport", season, **kwargs)
+           self._team_resolver = get_team_resolver("newsport")
+           self._stadium_resolver = get_stadium_resolver("newsport")
+
+       def _get_sources(self) -> list[str]:
+           return ["primary_source", "backup_source"]
+
+       def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
+           # Implement source-specific scraping
+           ...
+
+       def _normalize_games(self, raw_games) -> tuple[list[Game], list[ManualReviewItem]]:
+           # Use resolvers to normalize
+           ...
+
+       def scrape_teams(self) -> list[Team]:
+           # Return canonical team list
+           ...
+
+       def scrape_stadiums(self) -> list[Stadium]:
+           # Return canonical stadium list
+           ...
+   ```
+
+2. **Add team mappings** in `normalizers/team_resolver.py`:
+   ```python
+   TEAM_MAPPINGS["newsport"] = {
+       "ABC": ("team_newsport_abc", "Full Team Name", "City"),
+       ...
+   }
+   ```
+
+3. **Add stadium mappings** in `normalizers/stadium_resolver.py`:
+   ```python
+   STADIUM_MAPPINGS["newsport"] = {
+       "stadium_newsport_venue": StadiumInfo(
+           name="Venue Name",
+           city="City",
+           state="State",
+           country="USA",
+           latitude=40.0,
+           longitude=-74.0,
+       ),
+       ...
+   }
+   ```
+
+4. **Add to league_structure.json** (if hierarchical)
+
+5. **Update config.py**:
+   ```python
+   EXPECTED_GAME_COUNTS["newsport"] = 500
+   ```
+
+6. **Export from `__init__.py`**
+
+## Troubleshooting
+
+### Rate Limiting (429 errors)
+
+The system handles these automatically with exponential backoff. If persistent:
+- Increase `DEFAULT_REQUEST_DELAY` in config.py
+- Check if source has changed their rate limits
+
+### Missing Teams/Stadiums
+
+1. Check scraper logs for raw values
+2. Add to `team_aliases.json` or `stadium_aliases.json`
+3. Or add to canonical mappings if it's a new team/stadium
+
+### CloudKit Authentication Errors
+
+1. Verify key_id matches CloudKit Dashboard
+2. Check private key format (EC P-256, PEM)
+3. Ensure container identifier is correct
+
+### Incomplete Scrapes
+
+The system discards partial data on errors. Check:
+- `logs/` for error details
+- Network connectivity
+- Source website availability
+
+### International Games Appearing
+
+NFL and NHL scrapers filter these automatically. If new locations emerge:
+- Add to `INTERNATIONAL_LOCATIONS` in the scraper
+- Or add filtering logic for neutral site games
+
+## Contributing
+
+1. Follow existing patterns for new scrapers
+2. Always use canonical IDs
+3. Add aliases for historical names
+4. Include source URLs for traceability
+5. Test with multiple seasons