Files
SportstimeAPI/README.md
Trey t 63acf7accb feat: add Django web app, CloudKit sync, dashboard, and game_datetime_utc export
Adds the full Django application layer on top of sportstime_parser:
- core: Sport, Team, Stadium, Game models with aliases and league structure
- scraper: orchestration engine, adapter, job management, Celery tasks
- cloudkit: CloudKit sync client, sync state tracking, sync jobs
- dashboard: staff dashboard for monitoring scrapers, sync, review queue
- notifications: email reports for scrape/sync results
- Docker setup for deployment (Dockerfile, docker-compose, entrypoint)

Game exports now use game_datetime_utc (ISO 8601 UTC) instead of
venue-local date+time strings, matching the canonical format used
by the iOS app.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 14:04:27 -06:00

325 lines
10 KiB
Markdown

# SportsTime Data Pipeline
A Django-based sports data pipeline that scrapes game schedules from official sources, normalizes data, and syncs to CloudKit for iOS app consumption.
## Features
- **Multi-sport support**: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- **Automated scraping**: Scheduled data collection from ESPN and league APIs
- **Smart name resolution**: Team/stadium aliases with date validity support
- **CloudKit sync**: Push data to iCloud for iOS app consumption
- **Admin dashboard**: Monitor scrapers, review items, manage data
- **Import/Export**: Bulk data management via JSON, CSV, XLSX
- **Audit history**: Track all changes with django-simple-history
## Quick Start
### Prerequisites
- Docker and Docker Compose
- (Optional) CloudKit credentials for sync
### Setup
1. Clone the repository:
```bash
git clone <repo-url>
cd SportsTimeScripts
```
2. Copy environment template:
```bash
cp .env.example .env
```
3. Start the containers:
```bash
docker-compose up -d
```
4. Run migrations:
```bash
docker-compose exec web python manage.py migrate
```
5. Create a superuser:
```bash
docker-compose exec web python manage.py createsuperuser
```
6. Access the admin at http://localhost:8000/admin/
7. Access the dashboard at http://localhost:8000/dashboard/
## Architecture
```
┌─────────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────┐
│ Data Sources │ ──▶ │ Scrapers │ ──▶ │ PostgreSQL │ ──▶ │ CloudKit │
│ (ESPN, leagues) │ │ (sportstime_ │ │ (Django) │ │ (iOS) │
└─────────────────┘ │ parser) │ └─────────────┘ └──────────┘
└──────────────┘
```
### Components
| Component | Description |
|-----------|-------------|
| **Django** | Web framework, ORM, admin interface |
| **PostgreSQL** | Primary database |
| **Redis** | Celery message broker |
| **Celery** | Async task queue (scraping, syncing) |
| **Celery Beat** | Scheduled task runner |
| **sportstime_parser** | Standalone scraper library |
## Usage
### Dashboard
Visit http://localhost:8000/dashboard/ (staff login required) to:
- View scraper status and run scrapers
- Monitor CloudKit sync status
- Review items needing manual attention
- See statistics across all sports
### Running Scrapers
**Via Dashboard:**
1. Go to Dashboard → Scraper Status
2. Click "Run Now" for a specific sport or "Run All Enabled"
**Via Command Line:**
```bash
docker-compose exec web python manage.py shell
>>> from scraper.tasks import run_scraper_task
>>> from scraper.models import ScraperConfig
>>> config = ScraperConfig.objects.get(sport__code='nba', season=2025)
>>> run_scraper_task.delay(config.id)
```
### Managing Aliases
When scrapers encounter unknown team or stadium names:
1. A **Review Item** is created for manual resolution
2. Add an alias via Admin → Team Aliases or Stadium Aliases
3. Re-run the scraper to pick up the new mapping
Aliases support **validity dates** - useful for:
- Historical team names (e.g., "Washington Redskins" valid until 2020)
- Stadium naming rights changes (e.g., "Staples Center" valid until 2021)
### Import/Export
All admin models support bulk import/export:
1. Go to any admin list page (e.g., Teams)
2. Click **Export** → Select format (JSON recommended) → Submit
3. Modify the data as needed (e.g., ask Claude to update it)
4. Click **Import** → Upload file → Preview → Confirm
Imports will update existing records and create new ones.
## Project Structure
```
SportsTimeScripts/
├── core/ # Core Django models
│ ├── models/ # Sport, Team, Stadium, Game, Aliases
│ ├── admin/ # Admin configuration with import/export
│ └── resources.py # Import/export resource definitions
├── scraper/ # Scraper orchestration
│ ├── engine/ # Adapter, DB alias loaders
│ │ ├── adapter.py # Bridges sportstime_parser to Django
│ │ └── db_alias_loader.py # Database alias resolution
│ ├── models.py # ScraperConfig, ScrapeJob, ManualReviewItem
│ └── tasks.py # Celery tasks
├── sportstime_parser/ # Standalone scraper library
│ ├── scrapers/ # Per-sport scrapers (NBA, MLB, etc.)
│ ├── normalizers/ # Team/stadium name resolution
│ ├── models/ # Data classes
│ └── uploaders/ # CloudKit client (legacy)
├── cloudkit/ # CloudKit sync
│ ├── client.py # CloudKit API client
│ ├── models.py # CloudKitConfiguration, SyncState, SyncJob
│ └── tasks.py # Sync tasks
├── dashboard/ # Staff dashboard
│ ├── views.py # Dashboard views
│ └── urls.py # Dashboard URLs
├── templates/ # Django templates
│ ├── base.html # Base template
│ └── dashboard/ # Dashboard templates
├── sportstime/ # Django project config
│ ├── settings.py # Django settings
│ ├── urls.py # URL routing
│ └── celery.py # Celery configuration
├── docker-compose.yml # Container orchestration
├── Dockerfile # Container image
├── requirements.txt # Python dependencies
├── CLAUDE.md # Claude Code context
└── README.md # This file
```
## Data Models
### Model Hierarchy
```
Sport
├── Conference
│ └── Division
│ └── Team (has TeamAliases)
├── Stadium (has StadiumAliases)
└── Game (references Team, Stadium)
```
### Key Models
| Model | Description |
|-------|-------------|
| **Sport** | Sports with season configuration |
| **Team** | Teams with division, colors, logos |
| **Stadium** | Venues with location, capacity |
| **Game** | Games with scores, status, teams |
| **TeamAlias** | Historical team names with validity dates |
| **StadiumAlias** | Historical stadium names with validity dates |
| **ScraperConfig** | Scraper settings per sport/season |
| **ScrapeJob** | Scrape execution logs |
| **ManualReviewItem** | Items needing human review |
| **CloudKitSyncState** | Per-record sync status |
## Configuration
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `DEBUG` | Debug mode | `False` |
| `SECRET_KEY` | Django secret key | (required in prod) |
| `DATABASE_URL` | PostgreSQL connection | `postgresql://...` |
| `REDIS_URL` | Redis connection | `redis://localhost:6379/0` |
| `CLOUDKIT_CONTAINER` | CloudKit container ID | - |
| `CLOUDKIT_KEY_ID` | CloudKit key ID | - |
| `CLOUDKIT_PRIVATE_KEY_PATH` | Path to CloudKit private key | - |
### Scraper Settings
| Setting | Description | Default |
|---------|-------------|---------|
| `SCRAPER_REQUEST_DELAY` | Delay between requests (seconds) | `3.0` |
| `SCRAPER_MAX_RETRIES` | Max retry attempts | `3` |
| `SCRAPER_FUZZY_THRESHOLD` | Fuzzy match confidence threshold | `85` |
## Supported Sports
| Code | League | Season Type | Games/Season | Data Sources |
|------|--------|-------------|--------------|--------------|
| nba | NBA | Oct-Jun (split) | ~1,230 | ESPN, NBA.com |
| mlb | MLB | Mar-Nov (calendar) | ~2,430 | ESPN, MLB.com |
| nfl | NFL | Sep-Feb (split) | ~272 | ESPN, NFL.com |
| nhl | NHL | Oct-Jun (split) | ~1,312 | ESPN, NHL.com |
| mls | MLS | Feb-Nov (calendar) | ~544 | ESPN |
| wnba | WNBA | May-Oct (calendar) | ~228 | ESPN |
| nwsl | NWSL | Mar-Nov (calendar) | ~182 | ESPN |
## Development
### Useful Commands
```bash
# Start containers
docker-compose up -d
# Stop containers
docker-compose down
# Restart containers
docker-compose restart
# Rebuild after requirements change
docker-compose down && docker-compose up -d --build
# View logs
docker-compose logs -f web
docker-compose logs -f celery-worker
# Django shell
docker-compose exec web python manage.py shell
# Database shell
docker-compose exec db psql -U sportstime -d sportstime
# Run migrations
docker-compose exec web python manage.py migrate
# Create superuser
docker-compose exec web python manage.py createsuperuser
```
### Running Tests
```bash
docker-compose exec web pytest
```
### Adding a New Sport
1. Create scraper in `sportstime_parser/scrapers/{sport}.py`
2. Add team mappings in `sportstime_parser/normalizers/team_resolver.py`
3. Add stadium mappings in `sportstime_parser/normalizers/stadium_resolver.py`
4. Register scraper in `scraper/engine/adapter.py`
5. Add Sport record via Django admin
6. Create ScraperConfig for the sport/season
## sportstime_parser Library
The `sportstime_parser` package is a standalone library that handles:
- **Scraping** from multiple sources (ESPN, league APIs)
- **Normalizing** team/stadium names to canonical IDs
- **Resolving** names using exact match, aliases, and fuzzy matching
### Resolution Strategy
1. **Exact match** against canonical mappings
2. **Alias lookup** with date-aware validity
3. **Fuzzy match** with 85% confidence threshold
4. **Manual review** if unresolved
### Canonical ID Format
```
team_nba_lal # Team: Los Angeles Lakers
stadium_nba_los_angeles_lakers # Stadium: Crypto.com Arena
game_nba_2025_20251022_bos_lal # Game: BOS @ LAL on Oct 22, 2025
```
## Troubleshooting
### Scraper fails with rate limiting
The system handles 429 errors automatically. If persistent, increase `SCRAPER_REQUEST_DELAY`.
### Unknown team/stadium names
1. Check ManualReviewItem in admin
2. Add alias via Team Aliases or Stadium Aliases
3. Re-run scraper
### CloudKit sync errors
1. Verify credentials in CloudKitConfiguration
2. Check CloudKitSyncState for failed records
3. Use "Retry failed syncs" action in admin
### Docker volume issues
If template changes don't appear:
```bash
docker-compose down && docker-compose up -d --build
```
## License
Private - All rights reserved.