feat(scripts): add sportstime-parser data pipeline
Complete Python package for scraping, normalizing, and uploading sports schedule data to CloudKit. Includes: - Multi-source scrapers for NBA, MLB, NFL, NHL, MLS, WNBA, NWSL - Canonical ID system for teams, stadiums, and games - Fuzzy matching with manual alias support - CloudKit uploader with batch operations and deduplication - Comprehensive test suite with fixtures - WNBA abbreviation aliases for improved team resolution - Alias validation script to detect orphan references All 5 phases of data remediation plan completed: - Phase 1: Alias fixes (team/stadium alias additions) - Phase 2: NHL stadium coordinate fixes - Phase 3: Re-scrape validation - Phase 4: iOS bundle update - Phase 5: Code quality improvements (WNBA aliases) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
688
sportstime_parser/README.md
Normal file
688
sportstime_parser/README.md
Normal file
@@ -0,0 +1,688 @@
|
||||
# SportsTime Parser
|
||||
|
||||
A Python CLI tool for scraping sports schedules, normalizing data with canonical IDs, and uploading to CloudKit.
|
||||
|
||||
## Features
|
||||
|
||||
- Scrapes game schedules from multiple sources with automatic fallback
|
||||
- Supports 7 major sports leagues: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
|
||||
- Generates deterministic canonical IDs for games, teams, and stadiums
|
||||
- Produces validation reports with manual review lists
|
||||
- Uploads to CloudKit with resumable, diff-based updates
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.11+
|
||||
- CloudKit credentials (for upload functionality)
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# From the Scripts directory
|
||||
cd Scripts
|
||||
|
||||
# Install in development mode
|
||||
pip install -e ".[dev]"
|
||||
|
||||
# Or install dependencies only
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Scrape NBA 2025-26 season
|
||||
sportstime-parser scrape nba --season 2025
|
||||
|
||||
# Scrape all sports
|
||||
sportstime-parser scrape all --season 2025
|
||||
|
||||
# Validate existing scraped data
|
||||
sportstime-parser validate nba --season 2025
|
||||
|
||||
# Check status
|
||||
sportstime-parser status
|
||||
|
||||
# Upload to CloudKit (development)
|
||||
sportstime-parser upload nba --season 2025
|
||||
|
||||
# Upload to CloudKit (production)
|
||||
sportstime-parser upload nba --season 2025 --environment production
|
||||
```
|
||||
|
||||
## CLI Reference
|
||||
|
||||
### scrape
|
||||
|
||||
Scrape game schedules, teams, and stadiums from web sources.
|
||||
|
||||
```bash
|
||||
sportstime-parser scrape <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to scrape: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--dry-run Parse and validate only, don't write output files
|
||||
--verbose, -v Enable verbose output
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Scrape NBA 2025-26 season
|
||||
sportstime-parser scrape nba --season 2025
|
||||
|
||||
# Scrape all sports with verbose output
|
||||
sportstime-parser scrape all --season 2025 --verbose
|
||||
|
||||
# Dry run to test without writing files
|
||||
sportstime-parser scrape mlb --season 2026 --dry-run
|
||||
```
|
||||
|
||||
### validate
|
||||
|
||||
Run validation on existing scraped data and regenerate reports. Validation performs these checks:
|
||||
|
||||
1. **Game Coverage**: Compares scraped game count against expected totals per league (e.g., ~1,230 for NBA, ~2,430 for MLB)
|
||||
2. **Team Resolution**: Identifies team names that couldn't be matched to canonical IDs using fuzzy matching
|
||||
3. **Stadium Resolution**: Identifies venue names that couldn't be matched to canonical stadium IDs
|
||||
4. **Duplicate Detection**: Finds games with the same home/away teams on the same date (potential doubleheader issues or data errors)
|
||||
5. **Missing Data**: Flags games missing required fields (stadium_id, team IDs, valid dates)
|
||||
|
||||
The output is a Markdown report with:
|
||||
- Summary statistics (total games, valid games, coverage percentage)
|
||||
- Manual review items grouped by type (unresolved teams, unresolved stadiums, duplicates)
|
||||
- Fuzzy match suggestions with confidence scores to help resolve unmatched names
|
||||
|
||||
```bash
|
||||
sportstime-parser validate <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to validate: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Validate NBA data
|
||||
sportstime-parser validate nba --season 2025
|
||||
|
||||
# Validate all sports
|
||||
sportstime-parser validate all
|
||||
```
|
||||
|
||||
### upload
|
||||
|
||||
Upload scraped data to CloudKit with diff-based updates.
|
||||
|
||||
```bash
|
||||
sportstime-parser upload <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to upload: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment: development or production (default: development)
|
||||
--resume Resume interrupted upload from last checkpoint
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
|
||||
```bash
|
||||
# Upload NBA to development
|
||||
sportstime-parser upload nba --season 2025
|
||||
|
||||
# Upload to production
|
||||
sportstime-parser upload nba --season 2025 --environment production
|
||||
|
||||
# Resume interrupted upload
|
||||
sportstime-parser upload mlb --season 2026 --resume
|
||||
```
|
||||
|
||||
### status
|
||||
|
||||
Show current scrape and upload status.
|
||||
|
||||
```bash
|
||||
sportstime-parser status
|
||||
```
|
||||
|
||||
### retry
|
||||
|
||||
Retry failed uploads from previous attempts.
|
||||
|
||||
```bash
|
||||
sportstime-parser retry <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to retry: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment (default: development)
|
||||
--max-retries INT Maximum retry attempts per record (default: 3)
|
||||
```
|
||||
|
||||
### clear
|
||||
|
||||
Clear upload session state to start fresh.
|
||||
|
||||
```bash
|
||||
sportstime-parser clear <sport> [options]
|
||||
|
||||
Arguments:
|
||||
sport Sport to clear: nba, mlb, nfl, nhl, mls, wnba, nwsl, or "all"
|
||||
|
||||
Options:
|
||||
--season, -s INT Season start year (default: 2025)
|
||||
--environment, -e CloudKit environment (default: development)
|
||||
```
|
||||
|
||||
## CloudKit Configuration
|
||||
|
||||
To upload data to CloudKit, you need to configure authentication credentials.
|
||||
|
||||
### 1. Get Credentials from Apple Developer Portal
|
||||
|
||||
1. Go to [Apple Developer Portal](https://developer.apple.com)
|
||||
2. Navigate to **Certificates, Identifiers & Profiles** > **Keys**
|
||||
3. Create a new key with **CloudKit** capability
|
||||
4. Download the private key file (.p8)
|
||||
5. Note the Key ID
|
||||
|
||||
### 2. Set Environment Variables
|
||||
|
||||
```bash
|
||||
# Key ID from Apple Developer Portal
|
||||
export CLOUDKIT_KEY_ID="your_key_id_here"
|
||||
|
||||
# Path to private key file
|
||||
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/AuthKey_XXXXXX.p8"
|
||||
|
||||
# Or provide key content directly (useful for CI/CD)
|
||||
export CLOUDKIT_PRIVATE_KEY="-----BEGIN EC PRIVATE KEY-----
|
||||
...key content...
|
||||
-----END EC PRIVATE KEY-----"
|
||||
```
|
||||
|
||||
### 3. Verify Configuration
|
||||
|
||||
```bash
|
||||
sportstime-parser status
|
||||
```
|
||||
|
||||
The status output will show whether CloudKit is configured correctly.
|
||||
|
||||
## Output Files
|
||||
|
||||
Scraped data is saved to the `output/` directory:
|
||||
|
||||
```
|
||||
output/
|
||||
games_nba_2025.json # Game schedules
|
||||
teams_nba.json # Team data
|
||||
stadiums_nba.json # Stadium data
|
||||
validation_nba_2025.md # Validation report
|
||||
```
|
||||
|
||||
## Validation Reports
|
||||
|
||||
Validation reports are generated in Markdown format at `output/validation_{sport}_{season}.md`.
|
||||
|
||||
### Report Sections
|
||||
|
||||
**Summary Table**
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| Total Games | Number of games scraped |
|
||||
| Valid Games | Games with all required fields resolved |
|
||||
| Coverage | Percentage of expected games found (based on league schedule) |
|
||||
| Unresolved Teams | Team names that couldn't be matched |
|
||||
| Unresolved Stadiums | Venue names that couldn't be matched |
|
||||
| Duplicates | Potential duplicate game entries |
|
||||
|
||||
**Manual Review Items**
|
||||
|
||||
Items are grouped by type and include the raw value, source URL, and suggested fixes:
|
||||
|
||||
- **Unresolved Teams**: Team names not in the alias mapping. Add to `team_aliases.json` to resolve.
|
||||
- **Unresolved Stadiums**: Venue names not recognized. Common for renamed arenas (naming rights changes). Add to `stadium_aliases.json`.
|
||||
- **Duplicate Games**: Same matchup on same date. May indicate doubleheader parsing issues or duplicate entries from different sources.
|
||||
- **Missing Data**: Games missing stadium coordinates or other required fields.
|
||||
|
||||
**Fuzzy Match Suggestions**
|
||||
|
||||
For each unresolved name, the validator provides the top fuzzy matches with confidence scores (0-100). High-confidence matches (>80) are likely correct; lower scores need manual verification.
|
||||
|
||||
## Canonical IDs
|
||||
|
||||
Canonical IDs are stable, deterministic identifiers that enable cross-referencing between games, teams, and stadiums across different data sources.
|
||||
|
||||
### ID Formats
|
||||
|
||||
**Games**
|
||||
```
|
||||
{sport}_{season}_{away}_{home}_{MMDD}[_{game_number}]
|
||||
```
|
||||
Examples:
|
||||
- `nba_2025_hou_okc_1021` - NBA 2025-26, Houston @ OKC, Oct 21
|
||||
- `mlb_2026_nyy_bos_0401_1` - MLB 2026, Yankees @ Red Sox, Apr 1, Game 1 (doubleheader)
|
||||
|
||||
**Teams**
|
||||
```
|
||||
{sport}_{city}_{name}
|
||||
```
|
||||
Examples:
|
||||
- `nba_la_lakers`
|
||||
- `mlb_new_york_yankees`
|
||||
- `nfl_new_york_giants`
|
||||
|
||||
**Stadiums**
|
||||
```
|
||||
{sport}_{normalized_name}
|
||||
```
|
||||
Examples:
|
||||
- `mlb_yankee_stadium`
|
||||
- `nba_crypto_com_arena`
|
||||
- `nfl_sofi_stadium`
|
||||
|
||||
### Generated vs Matched IDs
|
||||
|
||||
| Entity | Generated | Matched |
|
||||
|--------|-----------|---------|
|
||||
| **Teams** | Pre-defined in `team_resolver.py` mappings | Resolved from raw scraped names via aliases + fuzzy matching |
|
||||
| **Stadiums** | Pre-defined in `stadium_resolver.py` mappings | Resolved from raw venue names via aliases + fuzzy matching |
|
||||
| **Games** | Generated at scrape time from resolved team IDs + date | N/A (always generated, never matched) |
|
||||
|
||||
**Resolution Flow:**
|
||||
```
|
||||
Raw Name (from scraper)
|
||||
↓
|
||||
Exact Match (alias lookup in team_aliases.json / stadium_aliases.json)
|
||||
↓ (if no match)
|
||||
Fuzzy Match (Levenshtein distance against known names)
|
||||
↓ (if confidence > threshold)
|
||||
Canonical ID assigned
|
||||
↓ (if no match)
|
||||
Manual Review Item created
|
||||
```
|
||||
|
||||
### Cross-References
|
||||
|
||||
Entities reference each other via canonical IDs:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Game │
|
||||
│ id: nba_2025_hou_okc_1021 │
|
||||
│ home_team_id: nba_oklahoma_city_thunder ──────────────┐ │
|
||||
│ away_team_id: nba_houston_rockets ────────────────┐ │ │
|
||||
│ stadium_id: nba_paycom_center ────────────────┐ │ │ │
|
||||
└─────────────────────────────────────────────────│───│───│───┘
|
||||
│ │ │
|
||||
┌─────────────────────────────────────────────────│───│───│───┐
|
||||
│ Stadium │ │ │ │
|
||||
│ id: nba_paycom_center ◄───────────────────────┘ │ │ │
|
||||
│ name: "Paycom Center" │ │ │
|
||||
│ city: "Oklahoma City" │ │ │
|
||||
│ latitude: 35.4634 │ │ │
|
||||
│ longitude: -97.5151 │ │ │
|
||||
└─────────────────────────────────────────────────────│───│───┘
|
||||
│ │
|
||||
┌─────────────────────────────────────────────────────│───│───┐
|
||||
│ Team │ │ │
|
||||
│ id: nba_houston_rockets ◄─────────────────────────┘ │ │
|
||||
│ name: "Rockets" │ │
|
||||
│ city: "Houston" │ │
|
||||
│ stadium_id: nba_toyota_center │ │
|
||||
└─────────────────────────────────────────────────────────│───┘
|
||||
│
|
||||
┌─────────────────────────────────────────────────────────│───┐
|
||||
│ Team │ │
|
||||
│ id: nba_oklahoma_city_thunder ◄───────────────────────┘ │
|
||||
│ name: "Thunder" │
|
||||
│ city: "Oklahoma City" │
|
||||
│ stadium_id: nba_paycom_center │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Alias Files
|
||||
|
||||
Aliases map variant names to canonical IDs:
|
||||
|
||||
**`team_aliases.json`**
|
||||
```json
|
||||
{
|
||||
"nba": {
|
||||
"LA Lakers": "nba_la_lakers",
|
||||
"Los Angeles Lakers": "nba_la_lakers",
|
||||
"LAL": "nba_la_lakers"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**`stadium_aliases.json`**
|
||||
```json
|
||||
{
|
||||
"nba": {
|
||||
"Crypto.com Arena": "nba_crypto_com_arena",
|
||||
"Staples Center": "nba_crypto_com_arena",
|
||||
"STAPLES Center": "nba_crypto_com_arena"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
When a scraper returns a raw name like "LA Lakers", the resolver:
|
||||
1. Checks `team_aliases.json` for an exact match → finds `nba_la_lakers`
|
||||
2. If no exact match, runs fuzzy matching against all known team names
|
||||
3. If fuzzy match confidence > 80%, uses that canonical ID
|
||||
4. Otherwise, creates a manual review item for human resolution
|
||||
|
||||
## Adding a New Sport
|
||||
|
||||
To add support for a new sport (e.g., `cfb` for college football), update these files:
|
||||
|
||||
### 1. Configuration (`config.py`)
|
||||
|
||||
Add the sport to `SUPPORTED_SPORTS` and `EXPECTED_GAME_COUNTS`:
|
||||
|
||||
```python
|
||||
SUPPORTED_SPORTS: list[str] = [
|
||||
"nba", "mlb", "nfl", "nhl", "mls", "wnba", "nwsl",
|
||||
"cfb", # ← Add new sport
|
||||
]
|
||||
|
||||
EXPECTED_GAME_COUNTS: dict[str, int] = {
|
||||
# ... existing sports ...
|
||||
"cfb": 900, # ← Add expected game count for validation
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Team Mappings (`normalizers/team_resolver.py`)
|
||||
|
||||
Add team definitions to `TEAM_MAPPINGS`. Each entry maps an abbreviation to `(canonical_id, full_name, city)`:
|
||||
|
||||
```python
|
||||
TEAM_MAPPINGS: dict[str, dict[str, tuple[str, str, str]]] = {
|
||||
# ... existing sports ...
|
||||
"cfb": {
|
||||
"ALA": ("team_cfb_ala", "Alabama Crimson Tide", "Tuscaloosa"),
|
||||
"OSU": ("team_cfb_osu", "Ohio State Buckeyes", "Columbus"),
|
||||
# ... all teams ...
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Stadium Mappings (`normalizers/stadium_resolver.py`)
|
||||
|
||||
Add stadium definitions to `STADIUM_MAPPINGS`. Each entry is a `StadiumInfo` with coordinates:
|
||||
|
||||
```python
|
||||
STADIUM_MAPPINGS: dict[str, dict[str, StadiumInfo]] = {
|
||||
# ... existing sports ...
|
||||
"cfb": {
|
||||
"stadium_cfb_bryant_denny": StadiumInfo(
|
||||
id="stadium_cfb_bryant_denny",
|
||||
name="Bryant-Denny Stadium",
|
||||
city="Tuscaloosa",
|
||||
state="AL",
|
||||
country="USA",
|
||||
sport="cfb",
|
||||
latitude=33.2083,
|
||||
longitude=-87.5503,
|
||||
),
|
||||
# ... all stadiums ...
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Scraper Implementation (`scrapers/cfb.py`)
|
||||
|
||||
Create a new scraper class extending `BaseScraper`:
|
||||
|
||||
```python
|
||||
from .base import BaseScraper, RawGameData, ScrapeResult
|
||||
|
||||
class CFBScraper(BaseScraper):
|
||||
def __init__(self, season: int, **kwargs):
|
||||
super().__init__("cfb", season, **kwargs)
|
||||
self._team_resolver = get_team_resolver("cfb")
|
||||
self._stadium_resolver = get_stadium_resolver("cfb")
|
||||
|
||||
def _get_sources(self) -> list[str]:
|
||||
return ["espn", "sports_reference"] # Priority order
|
||||
|
||||
def _get_source_url(self, source: str, **kwargs) -> str:
|
||||
# Return URL for each source
|
||||
...
|
||||
|
||||
def _scrape_games_from_source(self, source: str) -> list[RawGameData]:
|
||||
# Implement scraping logic
|
||||
...
|
||||
|
||||
def _normalize_games(self, raw_games: list[RawGameData]) -> tuple[list[Game], list[ManualReviewItem]]:
|
||||
# Convert raw data to Game objects using resolvers
|
||||
...
|
||||
|
||||
def scrape_teams(self) -> list[Team]:
|
||||
# Return Team objects from TEAM_MAPPINGS
|
||||
...
|
||||
|
||||
def scrape_stadiums(self) -> list[Stadium]:
|
||||
# Return Stadium objects from STADIUM_MAPPINGS
|
||||
...
|
||||
|
||||
def create_cfb_scraper(season: int) -> CFBScraper:
|
||||
return CFBScraper(season=season)
|
||||
```
|
||||
|
||||
### 5. Register Scraper (`scrapers/__init__.py`)
|
||||
|
||||
Export the new scraper:
|
||||
|
||||
```python
|
||||
from .cfb import CFBScraper, create_cfb_scraper
|
||||
|
||||
__all__ = [
|
||||
# ... existing exports ...
|
||||
"CFBScraper",
|
||||
"create_cfb_scraper",
|
||||
]
|
||||
```
|
||||
|
||||
### 6. CLI Registration (`cli.py`)
|
||||
|
||||
Add the sport to `get_scraper()`:
|
||||
|
||||
```python
|
||||
def get_scraper(sport: str, season: int):
|
||||
# ... existing sports ...
|
||||
elif sport == "cfb":
|
||||
from .scrapers.cfb import create_cfb_scraper
|
||||
return create_cfb_scraper(season)
|
||||
```
|
||||
|
||||
### 7. Alias Files (`team_aliases.json`, `stadium_aliases.json`)
|
||||
|
||||
Add initial aliases for common name variants:
|
||||
|
||||
```json
|
||||
// team_aliases.json
|
||||
{
|
||||
"cfb": {
|
||||
"Alabama": "team_cfb_ala",
|
||||
"Bama": "team_cfb_ala",
|
||||
"Roll Tide": "team_cfb_ala"
|
||||
}
|
||||
}
|
||||
|
||||
// stadium_aliases.json
|
||||
{
|
||||
"cfb": {
|
||||
"Bryant Denny Stadium": "stadium_cfb_bryant_denny",
|
||||
"Bryant-Denny": "stadium_cfb_bryant_denny"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8. Documentation (`SOURCES.md`)
|
||||
|
||||
Document data sources with URLs, rate limits, and notes:
|
||||
|
||||
```markdown
|
||||
## CFB (College Football)
|
||||
|
||||
**Teams**: 134 (FBS)
|
||||
**Expected Games**: ~900 per season
|
||||
**Season**: August - January
|
||||
|
||||
### Sources
|
||||
|
||||
| Priority | Source | URL Pattern | Data Type |
|
||||
|----------|--------|-------------|-----------|
|
||||
| 1 | ESPN API | `site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` | JSON |
|
||||
| 2 | Sports-Reference | `sports-reference.com/cfb/years/{YEAR}-schedule.html` | HTML |
|
||||
```
|
||||
|
||||
### 9. Tests (`tests/test_scrapers/test_cfb.py`)
|
||||
|
||||
Create tests for the new scraper:
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from sportstime_parser.scrapers.cfb import CFBScraper, create_cfb_scraper
|
||||
|
||||
class TestCFBScraper:
|
||||
def test_factory_creates_scraper(self):
|
||||
scraper = create_cfb_scraper(season=2025)
|
||||
assert scraper.sport == "cfb"
|
||||
assert scraper.season == 2025
|
||||
|
||||
def test_get_sources_returns_priority_list(self):
|
||||
scraper = CFBScraper(season=2025)
|
||||
sources = scraper._get_sources()
|
||||
assert "espn" in sources
|
||||
|
||||
# ... more tests ...
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
- [ ] Add to `SUPPORTED_SPORTS` in `config.py`
|
||||
- [ ] Add to `EXPECTED_GAME_COUNTS` in `config.py`
|
||||
- [ ] Add team mappings to `team_resolver.py`
|
||||
- [ ] Add stadium mappings to `stadium_resolver.py`
|
||||
- [ ] Create `scrapers/{sport}.py` with scraper class
|
||||
- [ ] Export in `scrapers/__init__.py`
|
||||
- [ ] Register in `cli.py` `get_scraper()`
|
||||
- [ ] Add aliases to `team_aliases.json`
|
||||
- [ ] Add aliases to `stadium_aliases.json`
|
||||
- [ ] Document sources in `SOURCES.md`
|
||||
- [ ] Create tests in `tests/test_scrapers/`
|
||||
- [ ] Run `pytest` to verify all tests pass
|
||||
- [ ] Run dry-run scrape: `sportstime-parser scrape {sport} --season 2025 --dry-run`
|
||||
|
||||
## Development
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
pytest
|
||||
|
||||
# Run with coverage
|
||||
pytest --cov=sportstime_parser --cov-report=html
|
||||
|
||||
# Run specific test file
|
||||
pytest tests/test_scrapers/test_nba.py
|
||||
|
||||
# Run with verbose output
|
||||
pytest -v
|
||||
```
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
sportstime_parser/
|
||||
__init__.py
|
||||
__main__.py # CLI entry point
|
||||
cli.py # Subcommand definitions
|
||||
config.py # Constants, defaults
|
||||
|
||||
models/
|
||||
game.py # Game dataclass
|
||||
team.py # Team dataclass
|
||||
stadium.py # Stadium dataclass
|
||||
aliases.py # Alias dataclasses
|
||||
|
||||
scrapers/
|
||||
base.py # BaseScraper abstract class
|
||||
nba.py # NBA scrapers
|
||||
mlb.py # MLB scrapers
|
||||
nfl.py # NFL scrapers
|
||||
nhl.py # NHL scrapers
|
||||
mls.py # MLS scrapers
|
||||
wnba.py # WNBA scrapers
|
||||
nwsl.py # NWSL scrapers
|
||||
|
||||
normalizers/
|
||||
canonical_id.py # ID generation
|
||||
team_resolver.py # Team name resolution
|
||||
stadium_resolver.py # Stadium name resolution
|
||||
timezone.py # Timezone conversion
|
||||
fuzzy.py # Fuzzy matching
|
||||
|
||||
validators/
|
||||
report.py # Validation report generator
|
||||
|
||||
uploaders/
|
||||
cloudkit.py # CloudKit Web Services client
|
||||
state.py # Resumable upload state
|
||||
diff.py # Record comparison
|
||||
|
||||
utils/
|
||||
http.py # Rate-limited HTTP client
|
||||
logging.py # Verbose logger
|
||||
progress.py # Progress bars
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "No games file found"
|
||||
|
||||
Run the scrape command first:
|
||||
```bash
|
||||
sportstime-parser scrape nba --season 2025
|
||||
```
|
||||
|
||||
### "CloudKit not configured"
|
||||
|
||||
Set the required environment variables:
|
||||
```bash
|
||||
export CLOUDKIT_KEY_ID="your_key_id"
|
||||
export CLOUDKIT_PRIVATE_KEY_PATH="/path/to/key.p8"
|
||||
```
|
||||
|
||||
### Rate limit errors
|
||||
|
||||
The scraper includes automatic rate limiting and exponential backoff. If you encounter persistent rate limit errors:
|
||||
|
||||
1. Wait a few minutes before retrying
|
||||
2. Try scraping one sport at a time instead of "all"
|
||||
3. Check that you're not running multiple instances
|
||||
|
||||
### Scrape fails with no data
|
||||
|
||||
1. Check your internet connection
|
||||
2. Run with `--verbose` to see detailed error messages
|
||||
3. The scraper will try multiple sources - if all fail, the source websites may be temporarily unavailable
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
Reference in New Issue
Block a user