# SportsTime Data Pipeline

A Django-based sports data pipeline that scrapes game schedules from official sources, normalizes data, and syncs to CloudKit for iOS app consumption.

## Features

- **Multi-sport support**: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL
- **Automated scraping**: Scheduled data collection from ESPN and league APIs
- **Smart name resolution**: Team/stadium aliases with date validity support
- **CloudKit sync**: Push data to iCloud for iOS app consumption
- **Admin dashboard**: Monitor scrapers, review items, manage data
- **Import/Export**: Bulk data management via JSON, CSV, XLSX
- **Audit history**: Track all changes with django-simple-history

## Quick Start

### Prerequisites

- Docker and Docker Compose
- (Optional) CloudKit credentials for sync

### Setup

1. Clone the repository:
   ```bash
   git clone <repo-url>
   cd SportsTimeScripts
   ```

2. Copy environment template:
   ```bash
   cp .env.example .env
   ```

3. Start the containers:
   ```bash
   docker-compose up -d
   ```

4. Run migrations:
   ```bash
   docker-compose exec web python manage.py migrate
   ```

5. Create a superuser:
   ```bash
   docker-compose exec web python manage.py createsuperuser
   ```

6. Access the admin at http://localhost:8000/admin/
7. Access the dashboard at http://localhost:8000/dashboard/

## Architecture

```
┌─────────────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────┐
│  Data Sources   │ ──▶ │   Scrapers   │ ──▶ │  PostgreSQL │ ──▶ │ CloudKit │
│ (ESPN, leagues) │     │ (sportstime_ │     │  (Django)   │     │  (iOS)   │
└─────────────────┘     │   parser)    │     └─────────────┘     └──────────┘
                        └──────────────┘
```

### Components

| Component | Description |
|-----------|-------------|
| **Django** | Web framework, ORM, admin interface |
| **PostgreSQL** | Primary database |
| **Redis** | Celery message broker |
| **Celery** | Async task queue (scraping, syncing) |
| **Celery Beat** | Scheduled task runner |
| **sportstime_parser** | Standalone scraper library |

## Usage

### Dashboard

Visit http://localhost:8000/dashboard/ (staff login required) to:

- View scraper status and run scrapers
- Monitor CloudKit sync status
- Review items needing manual attention
- See statistics across all sports

### Running Scrapers

**Via Dashboard:**
1. Go to Dashboard → Scraper Status
2. Click "Run Now" for a specific sport or "Run All Enabled"

**Via Command Line:**
```bash
docker-compose exec web python manage.py shell
>>> from scraper.tasks import run_scraper_task
>>> from scraper.models import ScraperConfig
>>> config = ScraperConfig.objects.get(sport__code='nba', season=2025)
>>> run_scraper_task.delay(config.id)
```

### Managing Aliases

When scrapers encounter unknown team or stadium names:

1. A **Review Item** is created for manual resolution
2. Add an alias via Admin → Team Aliases or Stadium Aliases
3. Re-run the scraper to pick up the new mapping

Aliases support **validity dates** - useful for:
- Historical team names (e.g., "Washington Redskins" valid until 2020)
- Stadium naming rights changes (e.g., "Staples Center" valid until 2021)

### Import/Export

All admin models support bulk import/export:

1. Go to any admin list page (e.g., Teams)
2. Click **Export** → Select format (JSON recommended) → Submit
3. Modify the data as needed (e.g., ask Claude to update it)
4. Click **Import** → Upload file → Preview → Confirm

Imports will update existing records and create new ones.

## Project Structure

```
SportsTimeScripts/
├── core/                   # Core Django models
│   ├── models/            # Sport, Team, Stadium, Game, Aliases
│   ├── admin/             # Admin configuration with import/export
│   └── resources.py       # Import/export resource definitions
├── scraper/               # Scraper orchestration
│   ├── engine/            # Adapter, DB alias loaders
│   │   ├── adapter.py     # Bridges sportstime_parser to Django
│   │   └── db_alias_loader.py  # Database alias resolution
│   ├── models.py          # ScraperConfig, ScrapeJob, ManualReviewItem
│   └── tasks.py           # Celery tasks
├── sportstime_parser/     # Standalone scraper library
│   ├── scrapers/          # Per-sport scrapers (NBA, MLB, etc.)
│   ├── normalizers/       # Team/stadium name resolution
│   ├── models/            # Data classes
│   └── uploaders/         # CloudKit client (legacy)
├── cloudkit/              # CloudKit sync
│   ├── client.py          # CloudKit API client
│   ├── models.py          # CloudKitConfiguration, SyncState, SyncJob
│   └── tasks.py           # Sync tasks
├── dashboard/             # Staff dashboard
│   ├── views.py           # Dashboard views
│   └── urls.py            # Dashboard URLs
├── templates/             # Django templates
│   ├── base.html          # Base template
│   └── dashboard/         # Dashboard templates
├── sportstime/            # Django project config
│   ├── settings.py        # Django settings
│   ├── urls.py            # URL routing
│   └── celery.py          # Celery configuration
├── docker-compose.yml     # Container orchestration
├── Dockerfile             # Container image
├── requirements.txt       # Python dependencies
├── CLAUDE.md              # Claude Code context
└── README.md              # This file
```

## Data Models

### Model Hierarchy

```
Sport
├── Conference
│   └── Division
│       └── Team (has TeamAliases)
├── Stadium (has StadiumAliases)
└── Game (references Team, Stadium)
```

### Key Models

| Model | Description |
|-------|-------------|
| **Sport** | Sports with season configuration |
| **Team** | Teams with division, colors, logos |
| **Stadium** | Venues with location, capacity |
| **Game** | Games with scores, status, teams |
| **TeamAlias** | Historical team names with validity dates |
| **StadiumAlias** | Historical stadium names with validity dates |
| **ScraperConfig** | Scraper settings per sport/season |
| **ScrapeJob** | Scrape execution logs |
| **ManualReviewItem** | Items needing human review |
| **CloudKitSyncState** | Per-record sync status |

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `DEBUG` | Debug mode | `False` |
| `SECRET_KEY` | Django secret key | (required in prod) |
| `DATABASE_URL` | PostgreSQL connection | `postgresql://...` |
| `REDIS_URL` | Redis connection | `redis://localhost:6379/0` |
| `CLOUDKIT_CONTAINER` | CloudKit container ID | - |
| `CLOUDKIT_KEY_ID` | CloudKit key ID | - |
| `CLOUDKIT_PRIVATE_KEY_PATH` | Path to CloudKit private key | - |

### Scraper Settings

| Setting | Description | Default |
|---------|-------------|---------|
| `SCRAPER_REQUEST_DELAY` | Delay between requests (seconds) | `3.0` |
| `SCRAPER_MAX_RETRIES` | Max retry attempts | `3` |
| `SCRAPER_FUZZY_THRESHOLD` | Fuzzy match confidence threshold | `85` |

## Supported Sports

| Code | League | Season Type | Games/Season | Data Sources |
|------|--------|-------------|--------------|--------------|
| nba | NBA | Oct-Jun (split) | ~1,230 | ESPN, NBA.com |
| mlb | MLB | Mar-Nov (calendar) | ~2,430 | ESPN, MLB.com |
| nfl | NFL | Sep-Feb (split) | ~272 | ESPN, NFL.com |
| nhl | NHL | Oct-Jun (split) | ~1,312 | ESPN, NHL.com |
| mls | MLS | Feb-Nov (calendar) | ~544 | ESPN |
| wnba | WNBA | May-Oct (calendar) | ~228 | ESPN |
| nwsl | NWSL | Mar-Nov (calendar) | ~182 | ESPN |

## Development

### Useful Commands

```bash
# Start containers
docker-compose up -d

# Stop containers
docker-compose down

# Restart containers
docker-compose restart

# Rebuild after requirements change
docker-compose down && docker-compose up -d --build

# View logs
docker-compose logs -f web
docker-compose logs -f celery-worker

# Django shell
docker-compose exec web python manage.py shell

# Database shell
docker-compose exec db psql -U sportstime -d sportstime

# Run migrations
docker-compose exec web python manage.py migrate

# Create superuser
docker-compose exec web python manage.py createsuperuser
```

### Running Tests

```bash
docker-compose exec web pytest
```

### Adding a New Sport

1. Create scraper in `sportstime_parser/scrapers/{sport}.py`
2. Add team mappings in `sportstime_parser/normalizers/team_resolver.py`
3. Add stadium mappings in `sportstime_parser/normalizers/stadium_resolver.py`
4. Register scraper in `scraper/engine/adapter.py`
5. Add Sport record via Django admin
6. Create ScraperConfig for the sport/season

## sportstime_parser Library

The `sportstime_parser` package is a standalone library that handles:

- **Scraping** from multiple sources (ESPN, league APIs)
- **Normalizing** team/stadium names to canonical IDs
- **Resolving** names using exact match, aliases, and fuzzy matching

### Resolution Strategy

1. **Exact match** against canonical mappings
2. **Alias lookup** with date-aware validity
3. **Fuzzy match** with 85% confidence threshold
4. **Manual review** if unresolved

### Canonical ID Format

```
team_nba_lal                    # Team: Los Angeles Lakers
stadium_nba_los_angeles_lakers  # Stadium: Crypto.com Arena
game_nba_2025_20251022_bos_lal  # Game: BOS @ LAL on Oct 22, 2025
```

## Troubleshooting

### Scraper fails with rate limiting

The system handles 429 errors automatically. If persistent, increase `SCRAPER_REQUEST_DELAY`.

### Unknown team/stadium names

1. Check ManualReviewItem in admin
2. Add alias via Team Aliases or Stadium Aliases
3. Re-run scraper

### CloudKit sync errors

1. Verify credentials in CloudKitConfiguration
2. Check CloudKitSyncState for failed records
3. Use "Retry failed syncs" action in admin

### Docker volume issues

If template changes don't appear:
```bash
docker-compose down && docker-compose up -d --build
```

## License

Private - All rights reserved.