Adds the full Django application layer on top of sportstime_parser: - core: Sport, Team, Stadium, Game models with aliases and league structure - scraper: orchestration engine, adapter, job management, Celery tasks - cloudkit: CloudKit sync client, sync state tracking, sync jobs - dashboard: staff dashboard for monitoring scrapers, sync, review queue - notifications: email reports for scrape/sync results - Docker setup for deployment (Dockerfile, docker-compose, entrypoint) Game exports now use game_datetime_utc (ISO 8601 UTC) instead of venue-local date+time strings, matching the canonical format used by the iOS app. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
325 lines
10 KiB
Markdown
325 lines
10 KiB
Markdown
# SportsTime Data Pipeline
|
|
|
|
A Django-based sports data pipeline that scrapes game schedules from official sources, normalizes data, and syncs to CloudKit for iOS app consumption.
|
|
|
|
## Features
|
|
|
|
- **Multi-sport support**: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
|
|
- **Automated scraping**: Scheduled data collection from ESPN and league APIs
|
|
- **Smart name resolution**: Team/stadium aliases with date validity support
|
|
- **CloudKit sync**: Push data to iCloud for iOS app consumption
|
|
- **Admin dashboard**: Monitor scrapers, review items, manage data
|
|
- **Import/Export**: Bulk data management via JSON, CSV, XLSX
|
|
- **Audit history**: Track all changes with django-simple-history
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Docker and Docker Compose
|
|
- (Optional) CloudKit credentials for sync
|
|
|
|
### Setup
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone <repo-url>
|
|
cd SportsTimeScripts
|
|
```
|
|
|
|
2. Copy environment template:
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
3. Start the containers:
|
|
```bash
|
|
docker-compose up -d
|
|
```
|
|
|
|
4. Run migrations:
|
|
```bash
|
|
docker-compose exec web python manage.py migrate
|
|
```
|
|
|
|
5. Create a superuser:
|
|
```bash
|
|
docker-compose exec web python manage.py createsuperuser
|
|
```
|
|
|
|
6. Access the admin at http://localhost:8000/admin/
|
|
7. Access the dashboard at http://localhost:8000/dashboard/
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────┐
|
|
│ Data Sources │ ──▶ │ Scrapers │ ──▶ │ PostgreSQL │ ──▶ │ CloudKit │
|
|
│ (ESPN, leagues) │ │ (sportstime_ │ │ (Django) │ │ (iOS) │
|
|
└─────────────────┘ │ parser) │ └─────────────┘ └──────────┘
|
|
└──────────────┘
|
|
```
|
|
|
|
### Components
|
|
|
|
| Component | Description |
|
|
|-----------|-------------|
|
|
| **Django** | Web framework, ORM, admin interface |
|
|
| **PostgreSQL** | Primary database |
|
|
| **Redis** | Celery message broker |
|
|
| **Celery** | Async task queue (scraping, syncing) |
|
|
| **Celery Beat** | Scheduled task runner |
|
|
| **sportstime_parser** | Standalone scraper library |
|
|
|
|
## Usage
|
|
|
|
### Dashboard
|
|
|
|
Visit http://localhost:8000/dashboard/ (staff login required) to:
|
|
|
|
- View scraper status and run scrapers
|
|
- Monitor CloudKit sync status
|
|
- Review items needing manual attention
|
|
- See statistics across all sports
|
|
|
|
### Running Scrapers
|
|
|
|
**Via Dashboard:**
|
|
1. Go to Dashboard → Scraper Status
|
|
2. Click "Run Now" for a specific sport or "Run All Enabled"
|
|
|
|
**Via Command Line:**
|
|
```bash
|
|
docker-compose exec web python manage.py shell
|
|
>>> from scraper.tasks import run_scraper_task
|
|
>>> from scraper.models import ScraperConfig
|
|
>>> config = ScraperConfig.objects.get(sport__code='nba', season=2025)
|
|
>>> run_scraper_task.delay(config.id)
|
|
```
|
|
|
|
### Managing Aliases
|
|
|
|
When scrapers encounter unknown team or stadium names:
|
|
|
|
1. A **Review Item** is created for manual resolution
|
|
2. Add an alias via Admin → Team Aliases or Stadium Aliases
|
|
3. Re-run the scraper to pick up the new mapping
|
|
|
|
Aliases support **validity dates** - useful for:
|
|
- Historical team names (e.g., "Washington Redskins" valid until 2020)
|
|
- Stadium naming rights changes (e.g., "Staples Center" valid until 2021)
|
|
|
|
### Import/Export
|
|
|
|
All admin models support bulk import/export:
|
|
|
|
1. Go to any admin list page (e.g., Teams)
|
|
2. Click **Export** → Select format (JSON recommended) → Submit
|
|
3. Modify the data as needed (e.g., ask Claude to update it)
|
|
4. Click **Import** → Upload file → Preview → Confirm
|
|
|
|
Imports will update existing records and create new ones.
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
SportsTimeScripts/
|
|
├── core/ # Core Django models
|
|
│ ├── models/ # Sport, Team, Stadium, Game, Aliases
|
|
│ ├── admin/ # Admin configuration with import/export
|
|
│ └── resources.py # Import/export resource definitions
|
|
├── scraper/ # Scraper orchestration
|
|
│ ├── engine/ # Adapter, DB alias loaders
|
|
│ │ ├── adapter.py # Bridges sportstime_parser to Django
|
|
│ │ └── db_alias_loader.py # Database alias resolution
|
|
│ ├── models.py # ScraperConfig, ScrapeJob, ManualReviewItem
|
|
│ └── tasks.py # Celery tasks
|
|
├── sportstime_parser/ # Standalone scraper library
|
|
│ ├── scrapers/ # Per-sport scrapers (NBA, MLB, etc.)
|
|
│ ├── normalizers/ # Team/stadium name resolution
|
|
│ ├── models/ # Data classes
|
|
│ └── uploaders/ # CloudKit client (legacy)
|
|
├── cloudkit/ # CloudKit sync
|
|
│ ├── client.py # CloudKit API client
|
|
│ ├── models.py # CloudKitConfiguration, SyncState, SyncJob
|
|
│ └── tasks.py # Sync tasks
|
|
├── dashboard/ # Staff dashboard
|
|
│ ├── views.py # Dashboard views
|
|
│ └── urls.py # Dashboard URLs
|
|
├── templates/ # Django templates
|
|
│ ├── base.html # Base template
|
|
│ └── dashboard/ # Dashboard templates
|
|
├── sportstime/ # Django project config
|
|
│ ├── settings.py # Django settings
|
|
│ ├── urls.py # URL routing
|
|
│ └── celery.py # Celery configuration
|
|
├── docker-compose.yml # Container orchestration
|
|
├── Dockerfile # Container image
|
|
├── requirements.txt # Python dependencies
|
|
├── CLAUDE.md # Claude Code context
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Data Models
|
|
|
|
### Model Hierarchy
|
|
|
|
```
|
|
Sport
|
|
├── Conference
|
|
│ └── Division
|
|
│ └── Team (has TeamAliases)
|
|
├── Stadium (has StadiumAliases)
|
|
└── Game (references Team, Stadium)
|
|
```
|
|
|
|
### Key Models
|
|
|
|
| Model | Description |
|
|
|-------|-------------|
|
|
| **Sport** | Sports with season configuration |
|
|
| **Team** | Teams with division, colors, logos |
|
|
| **Stadium** | Venues with location, capacity |
|
|
| **Game** | Games with scores, status, teams |
|
|
| **TeamAlias** | Historical team names with validity dates |
|
|
| **StadiumAlias** | Historical stadium names with validity dates |
|
|
| **ScraperConfig** | Scraper settings per sport/season |
|
|
| **ScrapeJob** | Scrape execution logs |
|
|
| **ManualReviewItem** | Items needing human review |
|
|
| **CloudKitSyncState** | Per-record sync status |
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `DEBUG` | Debug mode | `False` |
|
|
| `SECRET_KEY` | Django secret key | (required in prod) |
|
|
| `DATABASE_URL` | PostgreSQL connection | `postgresql://...` |
|
|
| `REDIS_URL` | Redis connection | `redis://localhost:6379/0` |
|
|
| `CLOUDKIT_CONTAINER` | CloudKit container ID | - |
|
|
| `CLOUDKIT_KEY_ID` | CloudKit key ID | - |
|
|
| `CLOUDKIT_PRIVATE_KEY_PATH` | Path to CloudKit private key | - |
|
|
|
|
### Scraper Settings
|
|
|
|
| Setting | Description | Default |
|
|
|---------|-------------|---------|
|
|
| `SCRAPER_REQUEST_DELAY` | Delay between requests (seconds) | `3.0` |
|
|
| `SCRAPER_MAX_RETRIES` | Max retry attempts | `3` |
|
|
| `SCRAPER_FUZZY_THRESHOLD` | Fuzzy match confidence threshold | `85` |
|
|
|
|
## Supported Sports
|
|
|
|
| Code | League | Season Type | Games/Season | Data Sources |
|
|
|------|--------|-------------|--------------|--------------|
|
|
| nba | NBA | Oct-Jun (split) | ~1,230 | ESPN, NBA.com |
|
|
| mlb | MLB | Mar-Nov (calendar) | ~2,430 | ESPN, MLB.com |
|
|
| nfl | NFL | Sep-Feb (split) | ~272 | ESPN, NFL.com |
|
|
| nhl | NHL | Oct-Jun (split) | ~1,312 | ESPN, NHL.com |
|
|
| mls | MLS | Feb-Nov (calendar) | ~544 | ESPN |
|
|
| wnba | WNBA | May-Oct (calendar) | ~228 | ESPN |
|
|
| nwsl | NWSL | Mar-Nov (calendar) | ~182 | ESPN |
|
|
|
|
## Development
|
|
|
|
### Useful Commands
|
|
|
|
```bash
|
|
# Start containers
|
|
docker-compose up -d
|
|
|
|
# Stop containers
|
|
docker-compose down
|
|
|
|
# Restart containers
|
|
docker-compose restart
|
|
|
|
# Rebuild after requirements change
|
|
docker-compose down && docker-compose up -d --build
|
|
|
|
# View logs
|
|
docker-compose logs -f web
|
|
docker-compose logs -f celery-worker
|
|
|
|
# Django shell
|
|
docker-compose exec web python manage.py shell
|
|
|
|
# Database shell
|
|
docker-compose exec db psql -U sportstime -d sportstime
|
|
|
|
# Run migrations
|
|
docker-compose exec web python manage.py migrate
|
|
|
|
# Create superuser
|
|
docker-compose exec web python manage.py createsuperuser
|
|
```
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
docker-compose exec web pytest
|
|
```
|
|
|
|
### Adding a New Sport
|
|
|
|
1. Create scraper in `sportstime_parser/scrapers/{sport}.py`
|
|
2. Add team mappings in `sportstime_parser/normalizers/team_resolver.py`
|
|
3. Add stadium mappings in `sportstime_parser/normalizers/stadium_resolver.py`
|
|
4. Register scraper in `scraper/engine/adapter.py`
|
|
5. Add Sport record via Django admin
|
|
6. Create ScraperConfig for the sport/season
|
|
|
|
## sportstime_parser Library
|
|
|
|
The `sportstime_parser` package is a standalone library that handles:
|
|
|
|
- **Scraping** from multiple sources (ESPN, league APIs)
|
|
- **Normalizing** team/stadium names to canonical IDs
|
|
- **Resolving** names using exact match, aliases, and fuzzy matching
|
|
|
|
### Resolution Strategy
|
|
|
|
1. **Exact match** against canonical mappings
|
|
2. **Alias lookup** with date-aware validity
|
|
3. **Fuzzy match** with 85% confidence threshold
|
|
4. **Manual review** if unresolved
|
|
|
|
### Canonical ID Format
|
|
|
|
```
|
|
team_nba_lal # Team: Los Angeles Lakers
|
|
stadium_nba_los_angeles_lakers # Stadium: Crypto.com Arena
|
|
game_nba_2025_20251022_bos_lal # Game: BOS @ LAL on Oct 22, 2025
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Scraper fails with rate limiting
|
|
|
|
The system handles 429 errors automatically. If persistent, increase `SCRAPER_REQUEST_DELAY`.
|
|
|
|
### Unknown team/stadium names
|
|
|
|
1. Check ManualReviewItem in admin
|
|
2. Add alias via Team Aliases or Stadium Aliases
|
|
3. Re-run scraper
|
|
|
|
### CloudKit sync errors
|
|
|
|
1. Verify credentials in CloudKitConfiguration
|
|
2. Check CloudKitSyncState for failed records
|
|
3. Use "Retry failed syncs" action in admin
|
|
|
|
### Docker volume issues
|
|
|
|
If template changes don't appear:
|
|
```bash
|
|
docker-compose down && docker-compose up -d --build
|
|
```
|
|
|
|
## License
|
|
|
|
Private - All rights reserved.
|