# SportsTime Data Pipeline A Django-based sports data pipeline that scrapes game schedules from official sources, normalizes data, and syncs to CloudKit for iOS app consumption. ## Features - **Multi-sport support**: NBA, MLB, NFL, NHL, MLS, WNBA, NWSL - **Automated scraping**: Scheduled data collection from ESPN and league APIs - **Smart name resolution**: Team/stadium aliases with date validity support - **CloudKit sync**: Push data to iCloud for iOS app consumption - **Admin dashboard**: Monitor scrapers, review items, manage data - **Import/Export**: Bulk data management via JSON, CSV, XLSX - **Audit history**: Track all changes with django-simple-history ## Quick Start ### Prerequisites - Docker and Docker Compose - (Optional) CloudKit credentials for sync ### Setup 1. Clone the repository: ```bash git clone cd SportsTimeScripts ``` 2. Copy environment template: ```bash cp .env.example .env ``` 3. Start the containers: ```bash docker-compose up -d ``` 4. Run migrations: ```bash docker-compose exec web python manage.py migrate ``` 5. Create a superuser: ```bash docker-compose exec web python manage.py createsuperuser ``` 6. Access the admin at http://localhost:8000/admin/ 7. Access the dashboard at http://localhost:8000/dashboard/ ## Architecture ``` ┌─────────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────┐ │ Data Sources │ ──▶ │ Scrapers │ ──▶ │ PostgreSQL │ ──▶ │ CloudKit │ │ (ESPN, leagues) │ │ (sportstime_ │ │ (Django) │ │ (iOS) │ └─────────────────┘ │ parser) │ └─────────────┘ └──────────┘ └──────────────┘ ``` ### Components | Component | Description | |-----------|-------------| | **Django** | Web framework, ORM, admin interface | | **PostgreSQL** | Primary database | | **Redis** | Celery message broker | | **Celery** | Async task queue (scraping, syncing) | | **Celery Beat** | Scheduled task runner | | **sportstime_parser** | Standalone scraper library | ## Usage ### Dashboard Visit http://localhost:8000/dashboard/ (staff login required) to: - View scraper status and run scrapers - Monitor CloudKit sync status - Review items needing manual attention - See statistics across all sports ### Running Scrapers **Via Dashboard:** 1. Go to Dashboard → Scraper Status 2. Click "Run Now" for a specific sport or "Run All Enabled" **Via Command Line:** ```bash docker-compose exec web python manage.py shell >>> from scraper.tasks import run_scraper_task >>> from scraper.models import ScraperConfig >>> config = ScraperConfig.objects.get(sport__code='nba', season=2025) >>> run_scraper_task.delay(config.id) ``` ### Managing Aliases When scrapers encounter unknown team or stadium names: 1. A **Review Item** is created for manual resolution 2. Add an alias via Admin → Team Aliases or Stadium Aliases 3. Re-run the scraper to pick up the new mapping Aliases support **validity dates** - useful for: - Historical team names (e.g., "Washington Redskins" valid until 2020) - Stadium naming rights changes (e.g., "Staples Center" valid until 2021) ### Import/Export All admin models support bulk import/export: 1. Go to any admin list page (e.g., Teams) 2. Click **Export** → Select format (JSON recommended) → Submit 3. Modify the data as needed (e.g., ask Claude to update it) 4. Click **Import** → Upload file → Preview → Confirm Imports will update existing records and create new ones. ## Project Structure ``` SportsTimeScripts/ ├── core/ # Core Django models │ ├── models/ # Sport, Team, Stadium, Game, Aliases │ ├── admin/ # Admin configuration with import/export │ └── resources.py # Import/export resource definitions ├── scraper/ # Scraper orchestration │ ├── engine/ # Adapter, DB alias loaders │ │ ├── adapter.py # Bridges sportstime_parser to Django │ │ └── db_alias_loader.py # Database alias resolution │ ├── models.py # ScraperConfig, ScrapeJob, ManualReviewItem │ └── tasks.py # Celery tasks ├── sportstime_parser/ # Standalone scraper library │ ├── scrapers/ # Per-sport scrapers (NBA, MLB, etc.) │ ├── normalizers/ # Team/stadium name resolution │ ├── models/ # Data classes │ └── uploaders/ # CloudKit client (legacy) ├── cloudkit/ # CloudKit sync │ ├── client.py # CloudKit API client │ ├── models.py # CloudKitConfiguration, SyncState, SyncJob │ └── tasks.py # Sync tasks ├── dashboard/ # Staff dashboard │ ├── views.py # Dashboard views │ └── urls.py # Dashboard URLs ├── templates/ # Django templates │ ├── base.html # Base template │ └── dashboard/ # Dashboard templates ├── sportstime/ # Django project config │ ├── settings.py # Django settings │ ├── urls.py # URL routing │ └── celery.py # Celery configuration ├── docker-compose.yml # Container orchestration ├── Dockerfile # Container image ├── requirements.txt # Python dependencies ├── CLAUDE.md # Claude Code context └── README.md # This file ``` ## Data Models ### Model Hierarchy ``` Sport ├── Conference │ └── Division │ └── Team (has TeamAliases) ├── Stadium (has StadiumAliases) └── Game (references Team, Stadium) ``` ### Key Models | Model | Description | |-------|-------------| | **Sport** | Sports with season configuration | | **Team** | Teams with division, colors, logos | | **Stadium** | Venues with location, capacity | | **Game** | Games with scores, status, teams | | **TeamAlias** | Historical team names with validity dates | | **StadiumAlias** | Historical stadium names with validity dates | | **ScraperConfig** | Scraper settings per sport/season | | **ScrapeJob** | Scrape execution logs | | **ManualReviewItem** | Items needing human review | | **CloudKitSyncState** | Per-record sync status | ## Configuration ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `DEBUG` | Debug mode | `False` | | `SECRET_KEY` | Django secret key | (required in prod) | | `DATABASE_URL` | PostgreSQL connection | `postgresql://...` | | `REDIS_URL` | Redis connection | `redis://localhost:6379/0` | | `CLOUDKIT_CONTAINER` | CloudKit container ID | - | | `CLOUDKIT_KEY_ID` | CloudKit key ID | - | | `CLOUDKIT_PRIVATE_KEY_PATH` | Path to CloudKit private key | - | ### Scraper Settings | Setting | Description | Default | |---------|-------------|---------| | `SCRAPER_REQUEST_DELAY` | Delay between requests (seconds) | `3.0` | | `SCRAPER_MAX_RETRIES` | Max retry attempts | `3` | | `SCRAPER_FUZZY_THRESHOLD` | Fuzzy match confidence threshold | `85` | ## Supported Sports | Code | League | Season Type | Games/Season | Data Sources | |------|--------|-------------|--------------|--------------| | nba | NBA | Oct-Jun (split) | ~1,230 | ESPN, NBA.com | | mlb | MLB | Mar-Nov (calendar) | ~2,430 | ESPN, MLB.com | | nfl | NFL | Sep-Feb (split) | ~272 | ESPN, NFL.com | | nhl | NHL | Oct-Jun (split) | ~1,312 | ESPN, NHL.com | | mls | MLS | Feb-Nov (calendar) | ~544 | ESPN | | wnba | WNBA | May-Oct (calendar) | ~228 | ESPN | | nwsl | NWSL | Mar-Nov (calendar) | ~182 | ESPN | ## Development ### Useful Commands ```bash # Start containers docker-compose up -d # Stop containers docker-compose down # Restart containers docker-compose restart # Rebuild after requirements change docker-compose down && docker-compose up -d --build # View logs docker-compose logs -f web docker-compose logs -f celery-worker # Django shell docker-compose exec web python manage.py shell # Database shell docker-compose exec db psql -U sportstime -d sportstime # Run migrations docker-compose exec web python manage.py migrate # Create superuser docker-compose exec web python manage.py createsuperuser ``` ### Running Tests ```bash docker-compose exec web pytest ``` ### Adding a New Sport 1. Create scraper in `sportstime_parser/scrapers/{sport}.py` 2. Add team mappings in `sportstime_parser/normalizers/team_resolver.py` 3. Add stadium mappings in `sportstime_parser/normalizers/stadium_resolver.py` 4. Register scraper in `scraper/engine/adapter.py` 5. Add Sport record via Django admin 6. Create ScraperConfig for the sport/season ## sportstime_parser Library The `sportstime_parser` package is a standalone library that handles: - **Scraping** from multiple sources (ESPN, league APIs) - **Normalizing** team/stadium names to canonical IDs - **Resolving** names using exact match, aliases, and fuzzy matching ### Resolution Strategy 1. **Exact match** against canonical mappings 2. **Alias lookup** with date-aware validity 3. **Fuzzy match** with 85% confidence threshold 4. **Manual review** if unresolved ### Canonical ID Format ``` team_nba_lal # Team: Los Angeles Lakers stadium_nba_los_angeles_lakers # Stadium: Crypto.com Arena game_nba_2025_20251022_bos_lal # Game: BOS @ LAL on Oct 22, 2025 ``` ## Troubleshooting ### Scraper fails with rate limiting The system handles 429 errors automatically. If persistent, increase `SCRAPER_REQUEST_DELAY`. ### Unknown team/stadium names 1. Check ManualReviewItem in admin 2. Add alias via Team Aliases or Stadium Aliases 3. Re-run scraper ### CloudKit sync errors 1. Verify credentials in CloudKitConfiguration 2. Check CloudKitSyncState for failed records 3. Use "Retry failed syncs" action in admin ### Docker volume issues If template changes don't appear: ```bash docker-compose down && docker-compose up -d --build ``` ## License Private - All rights reserved.