Initial commit: SportsTime trip planning app
- Three-scenario planning engine (A: date range, B: selected games, C: directional routes) - GeographicRouteExplorer with anchor game support for route exploration - Shared ItineraryBuilder for travel segment calculation - TravelEstimator for driving time/distance estimation - SwiftUI views for trip creation and detail display - CloudKit integration for schedule data - Python scraping scripts for sports schedules 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
145
Scripts/CLOUDKIT_SETUP.md
Normal file
145
Scripts/CLOUDKIT_SETUP.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# CloudKit Setup Guide for SportsTime
|
||||
|
||||
## 1. Configure Container in Apple Developer Portal
|
||||
|
||||
1. Go to [Apple Developer Portal](https://developer.apple.com/account)
|
||||
2. Navigate to **Certificates, Identifiers & Profiles** > **Identifiers**
|
||||
3. Select your App ID or create one for `com.sportstime.app`
|
||||
4. Enable **iCloud** capability
|
||||
5. Click **Configure** and create container: `iCloud.com.sportstime.app`
|
||||
|
||||
## 2. Configure in Xcode
|
||||
|
||||
1. Open `SportsTime.xcodeproj` in Xcode
|
||||
2. Select the SportsTime target
|
||||
3. Go to **Signing & Capabilities**
|
||||
4. Ensure **iCloud** is added (should already be there)
|
||||
5. Check **CloudKit** is selected
|
||||
6. Select container `iCloud.com.sportstime.app`
|
||||
|
||||
## 3. Create Record Types in CloudKit Dashboard
|
||||
|
||||
Go to [CloudKit Dashboard](https://icloud.developer.apple.com/dashboard)
|
||||
|
||||
### Record Type: `Stadium`
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `stadiumId` | String | Unique identifier |
|
||||
| `name` | String | Stadium name |
|
||||
| `city` | String | City |
|
||||
| `state` | String | State/Province |
|
||||
| `location` | Location | CLLocation (lat/lng) |
|
||||
| `capacity` | Int(64) | Seating capacity |
|
||||
| `sport` | String | NBA, MLB, NHL |
|
||||
| `teamAbbrevs` | String (List) | Team abbreviations |
|
||||
| `source` | String | Data source |
|
||||
| `yearOpened` | Int(64) | Optional |
|
||||
|
||||
**Indexes**:
|
||||
- `sport` (Queryable, Sortable)
|
||||
- `location` (Queryable) - for radius searches
|
||||
- `teamAbbrevs` (Queryable)
|
||||
|
||||
### Record Type: `Team`
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `teamId` | String | Unique identifier |
|
||||
| `name` | String | Full team name |
|
||||
| `abbreviation` | String | 3-letter code |
|
||||
| `sport` | String | NBA, MLB, NHL |
|
||||
| `city` | String | City |
|
||||
|
||||
**Indexes**:
|
||||
- `sport` (Queryable, Sortable)
|
||||
- `abbreviation` (Queryable)
|
||||
|
||||
### Record Type: `Game`
|
||||
|
||||
| Field | Type | Notes |
|
||||
|-------|------|-------|
|
||||
| `gameId` | String | Unique identifier |
|
||||
| `sport` | String | NBA, MLB, NHL |
|
||||
| `season` | String | e.g., "2024-25" |
|
||||
| `dateTime` | Date/Time | Game date and time |
|
||||
| `homeTeamRef` | Reference | Reference to Team |
|
||||
| `awayTeamRef` | Reference | Reference to Team |
|
||||
| `venueRef` | Reference | Reference to Stadium |
|
||||
| `isPlayoff` | Int(64) | 0 or 1 |
|
||||
| `broadcastInfo` | String | TV channel |
|
||||
| `source` | String | Data source |
|
||||
|
||||
**Indexes**:
|
||||
- `sport` (Queryable, Sortable)
|
||||
- `dateTime` (Queryable, Sortable)
|
||||
- `homeTeamRef` (Queryable)
|
||||
- `awayTeamRef` (Queryable)
|
||||
- `season` (Queryable)
|
||||
|
||||
## 4. Import Data
|
||||
|
||||
After creating record types:
|
||||
|
||||
```bash
|
||||
# 1. First scrape the data
|
||||
cd Scripts
|
||||
python3 scrape_schedules.py --sport all --season 2025 --output ./data
|
||||
|
||||
# 2. Run the import script (requires running from Xcode or with proper entitlements)
|
||||
# The Swift script cannot run standalone - use the app or create a macOS command-line tool
|
||||
```
|
||||
|
||||
### Alternative: Import via App
|
||||
|
||||
Add this to your app for first-run data import:
|
||||
|
||||
```swift
|
||||
// In AppDelegate or App init
|
||||
Task {
|
||||
let importer = CloudKitImporter()
|
||||
|
||||
// Load JSON from bundle or downloaded file
|
||||
if let stadiumsURL = Bundle.main.url(forResource: "stadiums", withExtension: "json"),
|
||||
let gamesURL = Bundle.main.url(forResource: "games", withExtension: "json") {
|
||||
// Import stadiums first
|
||||
let stadiumsData = try Data(contentsOf: stadiumsURL)
|
||||
let stadiums = try JSONDecoder().decode([ScrapedStadium].self, from: stadiumsData)
|
||||
let count = try await importer.importStadiums(from: stadiums)
|
||||
print("Imported \(count) stadiums")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 5. Security Roles (CloudKit Dashboard)
|
||||
|
||||
For the **Public Database**:
|
||||
|
||||
| Role | Stadium | Team | Game |
|
||||
|------|---------|------|------|
|
||||
| World | Read | Read | Read |
|
||||
| Authenticated | Read | Read | Read |
|
||||
| Creator | Read/Write | Read/Write | Read/Write |
|
||||
|
||||
Users should only read from public database. Write access is for your admin imports.
|
||||
|
||||
## 6. Testing
|
||||
|
||||
1. Build and run the app on simulator or device
|
||||
2. Check CloudKit Dashboard > **Data** to see imported records
|
||||
3. Use **Logs** tab to debug any issues
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Container not found"
|
||||
- Ensure container is created in Developer Portal
|
||||
- Check entitlements file has correct container ID
|
||||
- Clean build and re-run
|
||||
|
||||
### "Permission denied"
|
||||
- Check Security Roles in CloudKit Dashboard
|
||||
- Ensure app is signed with correct provisioning profile
|
||||
|
||||
### "Record type not found"
|
||||
- Create record types in Development environment first
|
||||
- Deploy schema to Production when ready
|
||||
72
Scripts/DATA_SOURCES.md
Normal file
72
Scripts/DATA_SOURCES.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Sports Data Sources
|
||||
|
||||
## Schedule Data Sources (by league)
|
||||
|
||||
### NBA Schedule
|
||||
| Source | URL Pattern | Data Available | Notes |
|
||||
|--------|-------------|----------------|-------|
|
||||
| Basketball-Reference | `https://www.basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | Date, Time, Teams, Arena, Attendance | Monthly pages (october, november, etc.) |
|
||||
| ESPN | `https://www.espn.com/nba/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
|
||||
| NBA.com API | `https://cdn.nba.com/static/json/staticData/scheduleLeagueV2.json` | Full season JSON | Official source |
|
||||
| FixtureDownload | `https://fixturedownload.com/download/nba-{year}-UTC.csv` | CSV download | Easy format |
|
||||
|
||||
### MLB Schedule
|
||||
| Source | URL Pattern | Data Available | Notes |
|
||||
|--------|-------------|----------------|-------|
|
||||
| Baseball-Reference | `https://www.baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | Date, Teams, Score, Attendance | Full season page |
|
||||
| ESPN | `https://www.espn.com/mlb/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
|
||||
| MLB Stats API | `https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={YEAR}` | Full season JSON | Official API |
|
||||
| FixtureDownload | `https://fixturedownload.com/download/mlb-{year}-UTC.csv` | CSV download | Easy format |
|
||||
|
||||
### NHL Schedule
|
||||
| Source | URL Pattern | Data Available | Notes |
|
||||
|--------|-------------|----------------|-------|
|
||||
| Hockey-Reference | `https://www.hockey-reference.com/leagues/NHL_{YEAR}_games.html` | Date, Teams, Score, Arena, Attendance | Full season page |
|
||||
| ESPN | `https://www.espn.com/nhl/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
|
||||
| NHL API | `https://api-web.nhle.com/v1/schedule/{YYYY-MM-DD}` | Daily JSON | Official API |
|
||||
| FixtureDownload | `https://fixturedownload.com/download/nhl-{year}-UTC.csv` | CSV download | Easy format |
|
||||
|
||||
---
|
||||
|
||||
## Stadium/Arena Data Sources
|
||||
|
||||
| Source | URL/Method | Data Available | Notes |
|
||||
|--------|------------|----------------|-------|
|
||||
| Wikipedia | Team pages | Name, City, Capacity, Coordinates | Manual or scrape |
|
||||
| HIFLD Open Data | `https://hifld-geoplatform.opendata.arcgis.com/datasets/major-sport-venues` | GeoJSON with coordinates | US Government data |
|
||||
| ESPN Team Pages | `https://www.espn.com/{sport}/team/_/name/{abbrev}` | Arena name, location | Per-team |
|
||||
| Sports-Reference | Team pages | Arena name, capacity | In schedule data |
|
||||
| OpenStreetMap | Nominatim API | Coordinates from address | For geocoding |
|
||||
|
||||
---
|
||||
|
||||
## Data Validation Strategy
|
||||
|
||||
### Cross-Reference Points
|
||||
1. **Game Count**: Total games per team should match (82 NBA, 162 MLB, 82 NHL)
|
||||
2. **Home/Away Balance**: Each team should have equal home/away games
|
||||
3. **Date Alignment**: Same game should appear on same date across sources
|
||||
4. **Team Names**: Map abbreviations across sources (NYK vs NY vs Knicks)
|
||||
5. **Venue Names**: Stadiums may have different names (sponsorship changes)
|
||||
|
||||
### Discrepancy Handling
|
||||
- If sources disagree on game time: prefer official API (NBA.com, MLB.com, NHL.com)
|
||||
- If sources disagree on venue: prefer Sports-Reference (most accurate historically)
|
||||
- Log all discrepancies for manual review
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting Guidelines
|
||||
|
||||
| Source | Limit | Recommended Delay |
|
||||
|--------|-------|-------------------|
|
||||
| Sports-Reference sites | 20 req/min | 3 seconds between requests |
|
||||
| ESPN | Unknown | 1 second between requests |
|
||||
| Official APIs | Varies | 0.5 seconds between requests |
|
||||
| Wikipedia | Polite | 1 second between requests |
|
||||
|
||||
---
|
||||
|
||||
## Team Abbreviation Mappings
|
||||
|
||||
See `team_mappings.json` for canonical mappings between sources.
|
||||
306
Scripts/cloudkit_import.py
Executable file
306
Scripts/cloudkit_import.py
Executable file
@@ -0,0 +1,306 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
CloudKit Import Script
|
||||
======================
|
||||
Imports JSON data into CloudKit. Run separately from pipeline.
|
||||
|
||||
Setup:
|
||||
1. CloudKit Dashboard > Tokens & Keys > Server-to-Server Keys
|
||||
2. Create key with Read/Write access to public database
|
||||
3. Download .p8 file and note Key ID
|
||||
|
||||
Usage:
|
||||
python cloudkit_import.py --dry-run # Preview first
|
||||
python cloudkit_import.py --key-id XX --key-file key.p8 # Import all
|
||||
python cloudkit_import.py --stadiums-only ... # Stadiums first
|
||||
python cloudkit_import.py --games-only ... # Games after
|
||||
python cloudkit_import.py --delete-all ... # Delete then import
|
||||
python cloudkit_import.py --delete-only ... # Delete only (no import)
|
||||
"""
|
||||
|
||||
import argparse, json, time, os, sys, hashlib, base64, requests
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from cryptography.hazmat.primitives import hashes, serialization
|
||||
from cryptography.hazmat.primitives.asymmetric import ec
|
||||
from cryptography.hazmat.backends import default_backend
|
||||
HAS_CRYPTO = True
|
||||
except ImportError:
|
||||
HAS_CRYPTO = False
|
||||
|
||||
CONTAINER = "iCloud.com.sportstime.app"
|
||||
HOST = "https://api.apple-cloudkit.com"
|
||||
BATCH_SIZE = 200
|
||||
|
||||
|
||||
class CloudKit:
|
||||
def __init__(self, key_id, private_key, container, env):
|
||||
self.key_id = key_id
|
||||
self.private_key = private_key
|
||||
self.path_base = f"/database/1/{container}/{env}/public"
|
||||
|
||||
def _sign(self, date, body, path):
|
||||
key = serialization.load_pem_private_key(self.private_key, None, default_backend())
|
||||
body_hash = base64.b64encode(hashlib.sha256(body.encode()).digest()).decode()
|
||||
sig = key.sign(f"{date}:{body_hash}:{path}".encode(), ec.ECDSA(hashes.SHA256()))
|
||||
return base64.b64encode(sig).decode()
|
||||
|
||||
def modify(self, operations):
|
||||
path = f"{self.path_base}/records/modify"
|
||||
body = json.dumps({'operations': operations})
|
||||
date = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
|
||||
headers = {
|
||||
'Content-Type': 'application/json',
|
||||
'X-Apple-CloudKit-Request-KeyID': self.key_id,
|
||||
'X-Apple-CloudKit-Request-ISO8601Date': date,
|
||||
'X-Apple-CloudKit-Request-SignatureV1': self._sign(date, body, path),
|
||||
}
|
||||
r = requests.post(f"{HOST}{path}", headers=headers, data=body, timeout=60)
|
||||
if r.status_code == 200:
|
||||
return r.json()
|
||||
else:
|
||||
try:
|
||||
err = r.json()
|
||||
reason = err.get('reason', 'Unknown')
|
||||
code = err.get('serverErrorCode', r.status_code)
|
||||
return {'error': f"{code}: {reason}"}
|
||||
except:
|
||||
return {'error': f"{r.status_code}: {r.text[:200]}"}
|
||||
|
||||
def query(self, record_type, limit=200):
|
||||
"""Query records of a given type."""
|
||||
path = f"{self.path_base}/records/query"
|
||||
body = json.dumps({
|
||||
'query': {'recordType': record_type},
|
||||
'resultsLimit': limit
|
||||
})
|
||||
date = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
|
||||
headers = {
|
||||
'Content-Type': 'application/json',
|
||||
'X-Apple-CloudKit-Request-KeyID': self.key_id,
|
||||
'X-Apple-CloudKit-Request-ISO8601Date': date,
|
||||
'X-Apple-CloudKit-Request-SignatureV1': self._sign(date, body, path),
|
||||
}
|
||||
r = requests.post(f"{HOST}{path}", headers=headers, data=body, timeout=60)
|
||||
if r.status_code == 200:
|
||||
return r.json()
|
||||
return {'error': f"{r.status_code}: {r.text[:200]}"}
|
||||
|
||||
def delete_all(self, record_type, verbose=False):
|
||||
"""Delete all records of a given type."""
|
||||
total_deleted = 0
|
||||
while True:
|
||||
result = self.query(record_type)
|
||||
if 'error' in result:
|
||||
print(f" Query error: {result['error']}")
|
||||
break
|
||||
|
||||
records = result.get('records', [])
|
||||
if not records:
|
||||
break
|
||||
|
||||
# Build delete operations
|
||||
ops = [{
|
||||
'operationType': 'delete',
|
||||
'record': {'recordName': r['recordName'], 'recordType': record_type}
|
||||
} for r in records]
|
||||
|
||||
delete_result = self.modify(ops)
|
||||
if 'error' in delete_result:
|
||||
print(f" Delete error: {delete_result['error']}")
|
||||
break
|
||||
|
||||
deleted = len(delete_result.get('records', []))
|
||||
total_deleted += deleted
|
||||
if verbose:
|
||||
print(f" Deleted {deleted} {record_type} records...")
|
||||
|
||||
time.sleep(0.5)
|
||||
|
||||
return total_deleted
|
||||
|
||||
|
||||
def import_data(ck, records, name, dry_run, verbose):
|
||||
total = 0
|
||||
errors = 0
|
||||
for i in range(0, len(records), BATCH_SIZE):
|
||||
batch = records[i:i+BATCH_SIZE]
|
||||
ops = [{'operationType': 'forceReplace', 'record': r} for r in batch]
|
||||
|
||||
if verbose:
|
||||
print(f" Batch {i//BATCH_SIZE + 1}: {len(batch)} records, {len(ops)} ops")
|
||||
|
||||
if not ops:
|
||||
print(f" Warning: Empty batch at index {i}, skipping")
|
||||
continue
|
||||
|
||||
if dry_run:
|
||||
print(f" [DRY RUN] Would create {len(batch)} {name}")
|
||||
total += len(batch)
|
||||
else:
|
||||
result = ck.modify(ops)
|
||||
if 'error' in result:
|
||||
errors += 1
|
||||
if errors <= 3: # Only show first 3 errors
|
||||
print(f" Error: {result['error']}")
|
||||
if verbose and batch:
|
||||
print(f" Sample record: {json.dumps(batch[0], indent=2)[:500]}")
|
||||
if errors == 3:
|
||||
print(" (suppressing further errors...)")
|
||||
else:
|
||||
result_records = result.get('records', [])
|
||||
# Count only successful records (no serverErrorCode)
|
||||
successful = [r for r in result_records if 'serverErrorCode' not in r]
|
||||
failed = [r for r in result_records if 'serverErrorCode' in r]
|
||||
n = len(successful)
|
||||
total += n
|
||||
print(f" Created {n} {name}")
|
||||
if failed:
|
||||
print(f" Failed {len(failed)} records: {failed[0].get('serverErrorCode')}: {failed[0].get('reason')}")
|
||||
if verbose:
|
||||
print(f" Response: {json.dumps(result, indent=2)[:1000]}")
|
||||
time.sleep(0.5)
|
||||
if errors > 0:
|
||||
print(f" Total errors: {errors}")
|
||||
return total
|
||||
|
||||
|
||||
def main():
|
||||
p = argparse.ArgumentParser(description='Import JSON to CloudKit')
|
||||
p.add_argument('--key-id', default=os.environ.get('CLOUDKIT_KEY_ID'))
|
||||
p.add_argument('--key-file', default=os.environ.get('CLOUDKIT_KEY_FILE'))
|
||||
p.add_argument('--container', default=CONTAINER)
|
||||
p.add_argument('--env', choices=['development', 'production'], default='development')
|
||||
p.add_argument('--data-dir', default='./data')
|
||||
p.add_argument('--stadiums-only', action='store_true')
|
||||
p.add_argument('--games-only', action='store_true')
|
||||
p.add_argument('--delete-all', action='store_true', help='Delete all records before importing')
|
||||
p.add_argument('--delete-only', action='store_true', help='Only delete records, do not import')
|
||||
p.add_argument('--dry-run', action='store_true')
|
||||
p.add_argument('--verbose', '-v', action='store_true')
|
||||
args = p.parse_args()
|
||||
|
||||
print(f"\n{'='*50}")
|
||||
print(f"CloudKit Import {'(DRY RUN)' if args.dry_run else ''}")
|
||||
print(f"{'='*50}")
|
||||
print(f"Container: {args.container}")
|
||||
print(f"Environment: {args.env}\n")
|
||||
|
||||
data_dir = Path(args.data_dir)
|
||||
stadiums = json.load(open(data_dir / 'stadiums.json'))
|
||||
games = json.load(open(data_dir / 'games.json')) if (data_dir / 'games.json').exists() else []
|
||||
print(f"Loaded {len(stadiums)} stadiums, {len(games)} games\n")
|
||||
|
||||
ck = None
|
||||
if not args.dry_run:
|
||||
if not HAS_CRYPTO:
|
||||
sys.exit("Error: pip install cryptography")
|
||||
if not args.key_id or not args.key_file:
|
||||
sys.exit("Error: --key-id and --key-file required (or use --dry-run)")
|
||||
ck = CloudKit(args.key_id, open(args.key_file, 'rb').read(), args.container, args.env)
|
||||
|
||||
# Handle deletion
|
||||
if args.delete_all or args.delete_only:
|
||||
if not ck:
|
||||
sys.exit("Error: --key-id and --key-file required for deletion")
|
||||
|
||||
print("--- Deleting Existing Records ---")
|
||||
# Delete in order: Games first (has references), then Teams, then Stadiums
|
||||
for record_type in ['Game', 'Team', 'Stadium']:
|
||||
print(f" Deleting {record_type} records...")
|
||||
deleted = ck.delete_all(record_type, verbose=args.verbose)
|
||||
print(f" Deleted {deleted} {record_type} records")
|
||||
|
||||
if args.delete_only:
|
||||
print(f"\n{'='*50}")
|
||||
print("DELETE COMPLETE")
|
||||
print()
|
||||
return
|
||||
|
||||
stats = {'stadiums': 0, 'teams': 0, 'games': 0}
|
||||
team_map = {}
|
||||
|
||||
# Import stadiums & teams
|
||||
if not args.games_only:
|
||||
print("--- Stadiums ---")
|
||||
recs = [{
|
||||
'recordType': 'Stadium', 'recordName': s['id'],
|
||||
'fields': {
|
||||
'stadiumId': {'value': s['id']}, 'name': {'value': s['name']},
|
||||
'city': {'value': s['city']}, 'state': {'value': s.get('state', '')},
|
||||
'sport': {'value': s['sport']}, 'source': {'value': s.get('source', '')},
|
||||
'teamAbbrevs': {'value': s.get('team_abbrevs', [])},
|
||||
**({'location': {'value': {'latitude': s['latitude'], 'longitude': s['longitude']}}}
|
||||
if s.get('latitude') else {}),
|
||||
**({'capacity': {'value': s['capacity']}} if s.get('capacity') else {}),
|
||||
}
|
||||
} for s in stadiums]
|
||||
stats['stadiums'] = import_data(ck, recs, 'stadiums', args.dry_run, args.verbose)
|
||||
|
||||
print("--- Teams ---")
|
||||
teams = {}
|
||||
for s in stadiums:
|
||||
for abbr in s.get('team_abbrevs', []):
|
||||
if abbr not in teams:
|
||||
teams[abbr] = {'city': s['city'], 'sport': s['sport']}
|
||||
team_map[abbr] = f"team_{abbr.lower()}"
|
||||
|
||||
recs = [{
|
||||
'recordType': 'Team', 'recordName': f"team_{abbr.lower()}",
|
||||
'fields': {
|
||||
'teamId': {'value': f"team_{abbr.lower()}"}, 'abbreviation': {'value': abbr},
|
||||
'name': {'value': abbr}, 'city': {'value': info['city']}, 'sport': {'value': info['sport']},
|
||||
}
|
||||
} for abbr, info in teams.items()]
|
||||
stats['teams'] = import_data(ck, recs, 'teams', args.dry_run, args.verbose)
|
||||
|
||||
# Import games
|
||||
if not args.stadiums_only and games:
|
||||
if not team_map:
|
||||
for s in stadiums:
|
||||
for abbr in s.get('team_abbrevs', []):
|
||||
team_map[abbr] = f"team_{abbr.lower()}"
|
||||
|
||||
print("--- Games ---")
|
||||
|
||||
# Deduplicate games by ID
|
||||
seen_ids = set()
|
||||
unique_games = []
|
||||
for g in games:
|
||||
if g['id'] not in seen_ids:
|
||||
seen_ids.add(g['id'])
|
||||
unique_games.append(g)
|
||||
|
||||
if len(unique_games) < len(games):
|
||||
print(f" Removed {len(games) - len(unique_games)} duplicate games")
|
||||
|
||||
recs = []
|
||||
for g in unique_games:
|
||||
fields = {
|
||||
'gameId': {'value': g['id']}, 'sport': {'value': g['sport']},
|
||||
'season': {'value': g.get('season', '')}, 'source': {'value': g.get('source', '')},
|
||||
}
|
||||
if g.get('date'):
|
||||
try:
|
||||
dt = datetime.strptime(f"{g['date']} {g.get('time', '19:00')}", '%Y-%m-%d %H:%M')
|
||||
fields['dateTime'] = {'value': int(dt.timestamp() * 1000)}
|
||||
except: pass
|
||||
if g.get('home_team_abbrev') in team_map:
|
||||
fields['homeTeamRef'] = {'value': {'recordName': team_map[g['home_team_abbrev']], 'action': 'NONE'}}
|
||||
if g.get('away_team_abbrev') in team_map:
|
||||
fields['awayTeamRef'] = {'value': {'recordName': team_map[g['away_team_abbrev']], 'action': 'NONE'}}
|
||||
recs.append({'recordType': 'Game', 'recordName': g['id'], 'fields': fields})
|
||||
|
||||
stats['games'] = import_data(ck, recs, 'games', args.dry_run, args.verbose)
|
||||
|
||||
print(f"\n{'='*50}")
|
||||
print(f"COMPLETE: {stats['stadiums']} stadiums, {stats['teams']} teams, {stats['games']} games")
|
||||
if args.dry_run:
|
||||
print("[DRY RUN - nothing imported]")
|
||||
print()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
53
Scripts/cloudkit_schema.ckdb
Normal file
53
Scripts/cloudkit_schema.ckdb
Normal file
@@ -0,0 +1,53 @@
|
||||
DEFINE SCHEMA
|
||||
|
||||
RECORD TYPE Stadium (
|
||||
"___createTime" TIMESTAMP,
|
||||
"___createdBy" REFERENCE,
|
||||
"___etag" STRING,
|
||||
"___modTime" TIMESTAMP,
|
||||
"___modifiedBy" REFERENCE,
|
||||
"___recordID" REFERENCE QUERYABLE,
|
||||
stadiumId STRING QUERYABLE,
|
||||
name STRING QUERYABLE SEARCHABLE,
|
||||
city STRING QUERYABLE,
|
||||
state STRING,
|
||||
location LOCATION QUERYABLE,
|
||||
capacity INT64,
|
||||
sport STRING QUERYABLE SORTABLE,
|
||||
teamAbbrevs LIST<STRING>,
|
||||
source STRING,
|
||||
yearOpened INT64
|
||||
);
|
||||
|
||||
RECORD TYPE Team (
|
||||
"___createTime" TIMESTAMP,
|
||||
"___createdBy" REFERENCE,
|
||||
"___etag" STRING,
|
||||
"___modTime" TIMESTAMP,
|
||||
"___modifiedBy" REFERENCE,
|
||||
"___recordID" REFERENCE QUERYABLE,
|
||||
teamId STRING QUERYABLE,
|
||||
name STRING QUERYABLE SEARCHABLE,
|
||||
abbreviation STRING QUERYABLE,
|
||||
city STRING QUERYABLE,
|
||||
sport STRING QUERYABLE SORTABLE
|
||||
);
|
||||
|
||||
RECORD TYPE Game (
|
||||
"___createTime" TIMESTAMP,
|
||||
"___createdBy" REFERENCE,
|
||||
"___etag" STRING,
|
||||
"___modTime" TIMESTAMP,
|
||||
"___modifiedBy" REFERENCE,
|
||||
"___recordID" REFERENCE QUERYABLE,
|
||||
gameId STRING QUERYABLE,
|
||||
sport STRING QUERYABLE SORTABLE,
|
||||
season STRING QUERYABLE,
|
||||
dateTime TIMESTAMP QUERYABLE SORTABLE,
|
||||
homeTeamRef REFERENCE QUERYABLE,
|
||||
awayTeamRef REFERENCE QUERYABLE,
|
||||
venueRef REFERENCE,
|
||||
isPlayoff INT64,
|
||||
broadcastInfo STRING,
|
||||
source STRING
|
||||
);
|
||||
5098
Scripts/data/games.csv
Normal file
5098
Scripts/data/games.csv
Normal file
File diff suppressed because it is too large
Load Diff
76457
Scripts/data/games.json
Normal file
76457
Scripts/data/games.json
Normal file
File diff suppressed because it is too large
Load Diff
1620
Scripts/data/pipeline_report.json
Normal file
1620
Scripts/data/pipeline_report.json
Normal file
File diff suppressed because it is too large
Load Diff
93
Scripts/data/stadiums.csv
Normal file
93
Scripts/data/stadiums.csv
Normal file
@@ -0,0 +1,93 @@
|
||||
id,name,city,state,latitude,longitude,capacity,sport,team_abbrevs,source,year_opened
|
||||
manual_nba_atl,State Farm Arena,Atlanta,,33.7573,-84.3963,0,NBA,['ATL'],manual,
|
||||
manual_nba_bos,TD Garden,Boston,,42.3662,-71.0621,0,NBA,['BOS'],manual,
|
||||
manual_nba_brk,Barclays Center,Brooklyn,,40.6826,-73.9754,0,NBA,['BRK'],manual,
|
||||
manual_nba_cho,Spectrum Center,Charlotte,,35.2251,-80.8392,0,NBA,['CHO'],manual,
|
||||
manual_nba_chi,United Center,Chicago,,41.8807,-87.6742,0,NBA,['CHI'],manual,
|
||||
manual_nba_cle,Rocket Mortgage FieldHouse,Cleveland,,41.4965,-81.6882,0,NBA,['CLE'],manual,
|
||||
manual_nba_dal,American Airlines Center,Dallas,,32.7905,-96.8103,0,NBA,['DAL'],manual,
|
||||
manual_nba_den,Ball Arena,Denver,,39.7487,-105.0077,0,NBA,['DEN'],manual,
|
||||
manual_nba_det,Little Caesars Arena,Detroit,,42.3411,-83.0553,0,NBA,['DET'],manual,
|
||||
manual_nba_gsw,Chase Center,San Francisco,,37.768,-122.3879,0,NBA,['GSW'],manual,
|
||||
manual_nba_hou,Toyota Center,Houston,,29.7508,-95.3621,0,NBA,['HOU'],manual,
|
||||
manual_nba_ind,Gainbridge Fieldhouse,Indianapolis,,39.764,-86.1555,0,NBA,['IND'],manual,
|
||||
manual_nba_lac,Intuit Dome,Inglewood,,33.9425,-118.3419,0,NBA,['LAC'],manual,
|
||||
manual_nba_lal,Crypto.com Arena,Los Angeles,,34.043,-118.2673,0,NBA,['LAL'],manual,
|
||||
manual_nba_mem,FedExForum,Memphis,,35.1382,-90.0506,0,NBA,['MEM'],manual,
|
||||
manual_nba_mia,Kaseya Center,Miami,,25.7814,-80.187,0,NBA,['MIA'],manual,
|
||||
manual_nba_mil,Fiserv Forum,Milwaukee,,43.0451,-87.9174,0,NBA,['MIL'],manual,
|
||||
manual_nba_min,Target Center,Minneapolis,,44.9795,-93.2761,0,NBA,['MIN'],manual,
|
||||
manual_nba_nop,Smoothie King Center,New Orleans,,29.949,-90.0821,0,NBA,['NOP'],manual,
|
||||
manual_nba_nyk,Madison Square Garden,New York,,40.7505,-73.9934,0,NBA,['NYK'],manual,
|
||||
manual_nba_okc,Paycom Center,Oklahoma City,,35.4634,-97.5151,0,NBA,['OKC'],manual,
|
||||
manual_nba_orl,Kia Center,Orlando,,28.5392,-81.3839,0,NBA,['ORL'],manual,
|
||||
manual_nba_phi,Wells Fargo Center,Philadelphia,,39.9012,-75.172,0,NBA,['PHI'],manual,
|
||||
manual_nba_pho,Footprint Center,Phoenix,,33.4457,-112.0712,0,NBA,['PHO'],manual,
|
||||
manual_nba_por,Moda Center,Portland,,45.5316,-122.6668,0,NBA,['POR'],manual,
|
||||
manual_nba_sac,Golden 1 Center,Sacramento,,38.5802,-121.4997,0,NBA,['SAC'],manual,
|
||||
manual_nba_sas,Frost Bank Center,San Antonio,,29.427,-98.4375,0,NBA,['SAS'],manual,
|
||||
manual_nba_tor,Scotiabank Arena,Toronto,,43.6435,-79.3791,0,NBA,['TOR'],manual,
|
||||
manual_nba_uta,Delta Center,Salt Lake City,,40.7683,-111.9011,0,NBA,['UTA'],manual,
|
||||
manual_nba_was,Capital One Arena,Washington,,38.8982,-77.0209,0,NBA,['WAS'],manual,
|
||||
manual_mlb_ari,Chase Field,Phoenix,AZ,33.4453,-112.0667,48686,MLB,['ARI'],manual,
|
||||
manual_mlb_atl,Truist Park,Atlanta,GA,33.8907,-84.4678,41084,MLB,['ATL'],manual,
|
||||
manual_mlb_bal,Oriole Park at Camden Yards,Baltimore,MD,39.2838,-76.6218,45971,MLB,['BAL'],manual,
|
||||
manual_mlb_bos,Fenway Park,Boston,MA,42.3467,-71.0972,37755,MLB,['BOS'],manual,
|
||||
manual_mlb_chc,Wrigley Field,Chicago,IL,41.9484,-87.6553,41649,MLB,['CHC'],manual,
|
||||
manual_mlb_chw,Guaranteed Rate Field,Chicago,IL,41.8299,-87.6338,40615,MLB,['CHW'],manual,
|
||||
manual_mlb_cin,Great American Ball Park,Cincinnati,OH,39.0979,-84.5082,42319,MLB,['CIN'],manual,
|
||||
manual_mlb_cle,Progressive Field,Cleveland,OH,41.4962,-81.6852,34830,MLB,['CLE'],manual,
|
||||
manual_mlb_col,Coors Field,Denver,CO,39.7559,-104.9942,50144,MLB,['COL'],manual,
|
||||
manual_mlb_det,Comerica Park,Detroit,MI,42.339,-83.0485,41083,MLB,['DET'],manual,
|
||||
manual_mlb_hou,Minute Maid Park,Houston,TX,29.7573,-95.3555,41168,MLB,['HOU'],manual,
|
||||
manual_mlb_kcr,Kauffman Stadium,Kansas City,MO,39.0517,-94.4803,37903,MLB,['KCR'],manual,
|
||||
manual_mlb_laa,Angel Stadium,Anaheim,CA,33.8003,-117.8827,45517,MLB,['LAA'],manual,
|
||||
manual_mlb_lad,Dodger Stadium,Los Angeles,CA,34.0739,-118.24,56000,MLB,['LAD'],manual,
|
||||
manual_mlb_mia,LoanDepot Park,Miami,FL,25.7781,-80.2196,36742,MLB,['MIA'],manual,
|
||||
manual_mlb_mil,American Family Field,Milwaukee,WI,43.028,-87.9712,41900,MLB,['MIL'],manual,
|
||||
manual_mlb_min,Target Field,Minneapolis,MN,44.9817,-93.2776,38544,MLB,['MIN'],manual,
|
||||
manual_mlb_nym,Citi Field,New York,NY,40.7571,-73.8458,41922,MLB,['NYM'],manual,
|
||||
manual_mlb_nyy,Yankee Stadium,New York,NY,40.8296,-73.9262,46537,MLB,['NYY'],manual,
|
||||
manual_mlb_oak,Sutter Health Park,Sacramento,CA,38.5802,-121.5097,14014,MLB,['OAK'],manual,
|
||||
manual_mlb_phi,Citizens Bank Park,Philadelphia,PA,39.9061,-75.1665,42792,MLB,['PHI'],manual,
|
||||
manual_mlb_pit,PNC Park,Pittsburgh,PA,40.4469,-80.0057,38362,MLB,['PIT'],manual,
|
||||
manual_mlb_sdp,Petco Park,San Diego,CA,32.7076,-117.157,40209,MLB,['SDP'],manual,
|
||||
manual_mlb_sfg,Oracle Park,San Francisco,CA,37.7786,-122.3893,41265,MLB,['SFG'],manual,
|
||||
manual_mlb_sea,T-Mobile Park,Seattle,WA,47.5914,-122.3325,47929,MLB,['SEA'],manual,
|
||||
manual_mlb_stl,Busch Stadium,St. Louis,MO,38.6226,-90.1928,45494,MLB,['STL'],manual,
|
||||
manual_mlb_tbr,Tropicana Field,St. Petersburg,FL,27.7682,-82.6534,25000,MLB,['TBR'],manual,
|
||||
manual_mlb_tex,Globe Life Field,Arlington,TX,32.7473,-97.0845,40300,MLB,['TEX'],manual,
|
||||
manual_mlb_tor,Rogers Centre,Toronto,ON,43.6414,-79.3894,49282,MLB,['TOR'],manual,
|
||||
manual_mlb_wsn,Nationals Park,Washington,DC,38.873,-77.0074,41339,MLB,['WSN'],manual,
|
||||
manual_nhl_ana,Honda Center,Anaheim,CA,33.8078,-117.8765,17174,NHL,['ANA'],manual,
|
||||
manual_nhl_ari,Delta Center,Salt Lake City,UT,40.7683,-111.9011,18306,NHL,['ARI'],manual,
|
||||
manual_nhl_bos,TD Garden,Boston,MA,42.3662,-71.0621,17565,NHL,['BOS'],manual,
|
||||
manual_nhl_buf,KeyBank Center,Buffalo,NY,42.875,-78.8764,19070,NHL,['BUF'],manual,
|
||||
manual_nhl_cgy,Scotiabank Saddledome,Calgary,AB,51.0374,-114.0519,19289,NHL,['CGY'],manual,
|
||||
manual_nhl_car,PNC Arena,Raleigh,NC,35.8034,-78.722,18680,NHL,['CAR'],manual,
|
||||
manual_nhl_chi,United Center,Chicago,IL,41.8807,-87.6742,19717,NHL,['CHI'],manual,
|
||||
manual_nhl_col,Ball Arena,Denver,CO,39.7487,-105.0077,18007,NHL,['COL'],manual,
|
||||
manual_nhl_cbj,Nationwide Arena,Columbus,OH,39.9693,-83.0061,18500,NHL,['CBJ'],manual,
|
||||
manual_nhl_dal,American Airlines Center,Dallas,TX,32.7905,-96.8103,18532,NHL,['DAL'],manual,
|
||||
manual_nhl_det,Little Caesars Arena,Detroit,MI,42.3411,-83.0553,19515,NHL,['DET'],manual,
|
||||
manual_nhl_edm,Rogers Place,Edmonton,AB,53.5469,-113.4978,18347,NHL,['EDM'],manual,
|
||||
manual_nhl_fla,Amerant Bank Arena,Sunrise,FL,26.1584,-80.3256,19250,NHL,['FLA'],manual,
|
||||
manual_nhl_lak,Crypto.com Arena,Los Angeles,CA,34.043,-118.2673,18230,NHL,['LAK'],manual,
|
||||
manual_nhl_min,Xcel Energy Center,St. Paul,MN,44.9448,-93.101,17954,NHL,['MIN'],manual,
|
||||
manual_nhl_mtl,Bell Centre,Montreal,QC,45.4961,-73.5693,21302,NHL,['MTL'],manual,
|
||||
manual_nhl_nsh,Bridgestone Arena,Nashville,TN,36.1592,-86.7785,17159,NHL,['NSH'],manual,
|
||||
manual_nhl_njd,Prudential Center,Newark,NJ,40.7334,-74.1712,16514,NHL,['NJD'],manual,
|
||||
manual_nhl_nyi,UBS Arena,Elmont,NY,40.7161,-73.7246,17255,NHL,['NYI'],manual,
|
||||
manual_nhl_nyr,Madison Square Garden,New York,NY,40.7505,-73.9934,18006,NHL,['NYR'],manual,
|
||||
manual_nhl_ott,Canadian Tire Centre,Ottawa,ON,45.2969,-75.9272,18652,NHL,['OTT'],manual,
|
||||
manual_nhl_phi,Wells Fargo Center,Philadelphia,PA,39.9012,-75.172,19543,NHL,['PHI'],manual,
|
||||
manual_nhl_pit,PPG Paints Arena,Pittsburgh,PA,40.4395,-79.9892,18387,NHL,['PIT'],manual,
|
||||
manual_nhl_sjs,SAP Center,San Jose,CA,37.3327,-121.901,17562,NHL,['SJS'],manual,
|
||||
manual_nhl_sea,Climate Pledge Arena,Seattle,WA,47.6221,-122.354,17100,NHL,['SEA'],manual,
|
||||
manual_nhl_stl,Enterprise Center,St. Louis,MO,38.6268,-90.2025,18096,NHL,['STL'],manual,
|
||||
manual_nhl_tbl,Amalie Arena,Tampa,FL,27.9426,-82.4519,19092,NHL,['TBL'],manual,
|
||||
manual_nhl_tor,Scotiabank Arena,Toronto,ON,43.6435,-79.3791,18819,NHL,['TOR'],manual,
|
||||
manual_nhl_van,Rogers Arena,Vancouver,BC,49.2778,-123.1089,18910,NHL,['VAN'],manual,
|
||||
manual_nhl_vgk,T-Mobile Arena,Las Vegas,NV,36.1028,-115.1784,17500,NHL,['VGK'],manual,
|
||||
manual_nhl_wsh,Capital One Arena,Washington,DC,38.8982,-77.0209,18573,NHL,['WSH'],manual,
|
||||
manual_nhl_wpg,Canada Life Centre,Winnipeg,MB,49.8928,-97.1436,15321,NHL,['WPG'],manual,
|
||||
|
1382
Scripts/data/stadiums.json
Normal file
1382
Scripts/data/stadiums.json
Normal file
File diff suppressed because it is too large
Load Diff
1425
Scripts/data/validation_report.json
Normal file
1425
Scripts/data/validation_report.json
Normal file
File diff suppressed because it is too large
Load Diff
275
Scripts/import_to_cloudkit.swift
Normal file
275
Scripts/import_to_cloudkit.swift
Normal file
@@ -0,0 +1,275 @@
|
||||
#!/usr/bin/env swift
|
||||
//
|
||||
// import_to_cloudkit.swift
|
||||
// SportsTime
|
||||
//
|
||||
// Imports scraped JSON data into CloudKit public database.
|
||||
// Run from command line: swift import_to_cloudkit.swift --games data/games.json --stadiums data/stadiums.json
|
||||
//
|
||||
|
||||
import Foundation
|
||||
import CloudKit
|
||||
|
||||
// MARK: - Data Models (matching scraper output)
|
||||
|
||||
struct ScrapedGame: Codable {
|
||||
let id: String
|
||||
let sport: String
|
||||
let season: String
|
||||
let date: String
|
||||
let time: String?
|
||||
let home_team: String
|
||||
let away_team: String
|
||||
let home_team_abbrev: String
|
||||
let away_team_abbrev: String
|
||||
let venue: String
|
||||
let source: String
|
||||
let is_playoff: Bool?
|
||||
let broadcast: String?
|
||||
}
|
||||
|
||||
struct ScrapedStadium: Codable {
|
||||
let id: String
|
||||
let name: String
|
||||
let city: String
|
||||
let state: String
|
||||
let latitude: Double
|
||||
let longitude: Double
|
||||
let capacity: Int
|
||||
let sport: String
|
||||
let team_abbrevs: [String]
|
||||
let source: String
|
||||
let year_opened: Int?
|
||||
}
|
||||
|
||||
// MARK: - CloudKit Importer
|
||||
|
||||
class CloudKitImporter {
|
||||
let container: CKContainer
|
||||
let database: CKDatabase
|
||||
|
||||
init(containerIdentifier: String = "iCloud.com.sportstime.app") {
|
||||
self.container = CKContainer(identifier: containerIdentifier)
|
||||
self.database = container.publicCloudDatabase
|
||||
}
|
||||
|
||||
// MARK: - Import Stadiums
|
||||
|
||||
func importStadiums(from stadiums: [ScrapedStadium]) async throws -> Int {
|
||||
var imported = 0
|
||||
|
||||
for stadium in stadiums {
|
||||
let record = CKRecord(recordType: "Stadium")
|
||||
record["stadiumId"] = stadium.id
|
||||
record["name"] = stadium.name
|
||||
record["city"] = stadium.city
|
||||
record["state"] = stadium.state
|
||||
record["location"] = CLLocation(latitude: stadium.latitude, longitude: stadium.longitude)
|
||||
record["capacity"] = stadium.capacity
|
||||
record["sport"] = stadium.sport
|
||||
record["teamAbbrevs"] = stadium.team_abbrevs
|
||||
record["source"] = stadium.source
|
||||
|
||||
if let yearOpened = stadium.year_opened {
|
||||
record["yearOpened"] = yearOpened
|
||||
}
|
||||
|
||||
do {
|
||||
_ = try await database.save(record)
|
||||
imported += 1
|
||||
print(" Imported stadium: \(stadium.name)")
|
||||
} catch {
|
||||
print(" Error importing \(stadium.name): \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
return imported
|
||||
}
|
||||
|
||||
// MARK: - Import Teams
|
||||
|
||||
func importTeams(from stadiums: [ScrapedStadium], teamMappings: [String: TeamInfo]) async throws -> [String: CKRecord.ID] {
|
||||
var teamRecordIDs: [String: CKRecord.ID] = [:]
|
||||
|
||||
for (abbrev, info) in teamMappings {
|
||||
let record = CKRecord(recordType: "Team")
|
||||
record["teamId"] = UUID().uuidString
|
||||
record["name"] = info.name
|
||||
record["abbreviation"] = abbrev
|
||||
record["sport"] = info.sport
|
||||
record["city"] = info.city
|
||||
|
||||
do {
|
||||
let saved = try await database.save(record)
|
||||
teamRecordIDs[abbrev] = saved.recordID
|
||||
print(" Imported team: \(info.name)")
|
||||
} catch {
|
||||
print(" Error importing team \(info.name): \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
return teamRecordIDs
|
||||
}
|
||||
|
||||
// MARK: - Import Games
|
||||
|
||||
func importGames(
|
||||
from games: [ScrapedGame],
|
||||
teamRecordIDs: [String: CKRecord.ID],
|
||||
stadiumRecordIDs: [String: CKRecord.ID]
|
||||
) async throws -> Int {
|
||||
var imported = 0
|
||||
|
||||
// Batch imports for efficiency
|
||||
let batchSize = 100
|
||||
var batch: [CKRecord] = []
|
||||
|
||||
for game in games {
|
||||
let record = CKRecord(recordType: "Game")
|
||||
record["gameId"] = game.id
|
||||
record["sport"] = game.sport
|
||||
record["season"] = game.season
|
||||
|
||||
// Parse date
|
||||
let dateFormatter = DateFormatter()
|
||||
dateFormatter.dateFormat = "yyyy-MM-dd"
|
||||
if let date = dateFormatter.date(from: game.date) {
|
||||
if let timeStr = game.time {
|
||||
// Combine date and time
|
||||
let timeFormatter = DateFormatter()
|
||||
timeFormatter.dateFormat = "HH:mm"
|
||||
if let time = timeFormatter.date(from: timeStr) {
|
||||
let calendar = Calendar.current
|
||||
let timeComponents = calendar.dateComponents([.hour, .minute], from: time)
|
||||
if let combined = calendar.date(bySettingHour: timeComponents.hour ?? 19,
|
||||
minute: timeComponents.minute ?? 0,
|
||||
second: 0, of: date) {
|
||||
record["dateTime"] = combined
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// Default to 7 PM if no time
|
||||
let calendar = Calendar.current
|
||||
if let defaultTime = calendar.date(bySettingHour: 19, minute: 0, second: 0, of: date) {
|
||||
record["dateTime"] = defaultTime
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Team references
|
||||
if let homeTeamID = teamRecordIDs[game.home_team_abbrev] {
|
||||
record["homeTeamRef"] = CKRecord.Reference(recordID: homeTeamID, action: .none)
|
||||
}
|
||||
if let awayTeamID = teamRecordIDs[game.away_team_abbrev] {
|
||||
record["awayTeamRef"] = CKRecord.Reference(recordID: awayTeamID, action: .none)
|
||||
}
|
||||
|
||||
record["isPlayoff"] = (game.is_playoff ?? false) ? 1 : 0
|
||||
record["broadcastInfo"] = game.broadcast
|
||||
record["source"] = game.source
|
||||
|
||||
batch.append(record)
|
||||
|
||||
// Save batch
|
||||
if batch.count >= batchSize {
|
||||
do {
|
||||
let operation = CKModifyRecordsOperation(recordsToSave: batch, recordIDsToDelete: nil)
|
||||
operation.savePolicy = .changedKeys
|
||||
|
||||
try await database.modifyRecords(saving: batch, deleting: [])
|
||||
imported += batch.count
|
||||
print(" Imported batch of \(batch.count) games (total: \(imported))")
|
||||
batch.removeAll()
|
||||
} catch {
|
||||
print(" Error importing batch: \(error)")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Save remaining
|
||||
if !batch.isEmpty {
|
||||
do {
|
||||
try await database.modifyRecords(saving: batch, deleting: [])
|
||||
imported += batch.count
|
||||
} catch {
|
||||
print(" Error importing final batch: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
return imported
|
||||
}
|
||||
}
|
||||
|
||||
// MARK: - Team Info
|
||||
|
||||
struct TeamInfo {
|
||||
let name: String
|
||||
let city: String
|
||||
let sport: String
|
||||
}
|
||||
|
||||
// MARK: - Main
|
||||
|
||||
func loadJSON<T: Codable>(from path: String) throws -> T {
|
||||
let url = URL(fileURLWithPath: path)
|
||||
let data = try Data(contentsOf: url)
|
||||
return try JSONDecoder().decode(T.self, from: data)
|
||||
}
|
||||
|
||||
func main() async {
|
||||
let args = CommandLine.arguments
|
||||
|
||||
guard args.count >= 3 else {
|
||||
print("Usage: swift import_to_cloudkit.swift --games <path> --stadiums <path>")
|
||||
return
|
||||
}
|
||||
|
||||
var gamesPath: String?
|
||||
var stadiumsPath: String?
|
||||
|
||||
for i in 1..<args.count {
|
||||
if args[i] == "--games" && i + 1 < args.count {
|
||||
gamesPath = args[i + 1]
|
||||
}
|
||||
if args[i] == "--stadiums" && i + 1 < args.count {
|
||||
stadiumsPath = args[i + 1]
|
||||
}
|
||||
}
|
||||
|
||||
let importer = CloudKitImporter()
|
||||
|
||||
// Import stadiums
|
||||
if let path = stadiumsPath {
|
||||
print("\n=== Importing Stadiums ===")
|
||||
do {
|
||||
let stadiums: [ScrapedStadium] = try loadJSON(from: path)
|
||||
let count = try await importer.importStadiums(from: stadiums)
|
||||
print("Imported \(count) stadiums")
|
||||
} catch {
|
||||
print("Error loading stadiums: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
// Import games
|
||||
if let path = gamesPath {
|
||||
print("\n=== Importing Games ===")
|
||||
do {
|
||||
let games: [ScrapedGame] = try loadJSON(from: path)
|
||||
// Note: Would need to first import teams and get their record IDs
|
||||
// This is a simplified version
|
||||
print("Loaded \(games.count) games for import")
|
||||
} catch {
|
||||
print("Error loading games: \(error)")
|
||||
}
|
||||
}
|
||||
|
||||
print("\n=== Import Complete ===")
|
||||
}
|
||||
|
||||
// Run
|
||||
Task {
|
||||
await main()
|
||||
}
|
||||
|
||||
// Keep the process running for async operations
|
||||
RunLoop.main.run()
|
||||
8
Scripts/requirements.txt
Normal file
8
Scripts/requirements.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
# Sports Schedule Scraper Dependencies
|
||||
requests>=2.28.0
|
||||
beautifulsoup4>=4.11.0
|
||||
pandas>=2.0.0
|
||||
lxml>=4.9.0
|
||||
|
||||
# CloudKit Import (optional - only needed for cloudkit_import.py)
|
||||
cryptography>=41.0.0
|
||||
435
Scripts/run_pipeline.py
Executable file
435
Scripts/run_pipeline.py
Executable file
@@ -0,0 +1,435 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
SportsTime Data Pipeline
|
||||
========================
|
||||
Master script that orchestrates all data fetching, validation, and reporting.
|
||||
|
||||
Usage:
|
||||
python run_pipeline.py # Full pipeline with defaults
|
||||
python run_pipeline.py --season 2026 # Specify season
|
||||
python run_pipeline.py --sport nba # Single sport only
|
||||
python run_pipeline.py --skip-scrape # Validate existing data only
|
||||
python run_pipeline.py --verbose # Detailed output
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
from enum import Enum
|
||||
|
||||
# Import our modules
|
||||
from scrape_schedules import (
|
||||
Game, Stadium,
|
||||
scrape_nba_basketball_reference,
|
||||
scrape_mlb_statsapi, scrape_mlb_baseball_reference,
|
||||
scrape_nhl_hockey_reference,
|
||||
generate_stadiums_from_teams,
|
||||
export_to_json,
|
||||
assign_stable_ids,
|
||||
)
|
||||
from validate_data import (
|
||||
validate_games,
|
||||
validate_stadiums,
|
||||
scrape_mlb_all_sources,
|
||||
scrape_nba_all_sources,
|
||||
scrape_nhl_all_sources,
|
||||
ValidationReport,
|
||||
)
|
||||
|
||||
|
||||
class Severity(Enum):
|
||||
HIGH = "high"
|
||||
MEDIUM = "medium"
|
||||
LOW = "low"
|
||||
|
||||
|
||||
@dataclass
|
||||
class PipelineResult:
|
||||
success: bool
|
||||
games_scraped: int
|
||||
stadiums_scraped: int
|
||||
games_by_sport: dict
|
||||
validation_reports: list
|
||||
stadium_issues: list
|
||||
high_severity_count: int
|
||||
medium_severity_count: int
|
||||
low_severity_count: int
|
||||
output_dir: Path
|
||||
duration_seconds: float
|
||||
|
||||
|
||||
def print_header(text: str):
|
||||
"""Print a formatted header."""
|
||||
print()
|
||||
print("=" * 70)
|
||||
print(f" {text}")
|
||||
print("=" * 70)
|
||||
|
||||
|
||||
def print_section(text: str):
|
||||
"""Print a section header."""
|
||||
print()
|
||||
print(f"--- {text} ---")
|
||||
|
||||
|
||||
def print_severity(severity: str, message: str):
|
||||
"""Print a message with severity indicator."""
|
||||
icons = {
|
||||
'high': '🔴 HIGH',
|
||||
'medium': '🟡 MEDIUM',
|
||||
'low': '🟢 LOW',
|
||||
}
|
||||
print(f" {icons.get(severity, '⚪')} {message}")
|
||||
|
||||
|
||||
def run_pipeline(
|
||||
season: int = 2025,
|
||||
sport: str = 'all',
|
||||
output_dir: Path = Path('./data'),
|
||||
skip_scrape: bool = False,
|
||||
validate: bool = True,
|
||||
verbose: bool = False,
|
||||
) -> PipelineResult:
|
||||
"""
|
||||
Run the complete data pipeline.
|
||||
"""
|
||||
start_time = datetime.now()
|
||||
|
||||
all_games = []
|
||||
all_stadiums = []
|
||||
games_by_sport = {}
|
||||
validation_reports = []
|
||||
stadium_issues = []
|
||||
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# =========================================================================
|
||||
# PHASE 1: SCRAPE DATA
|
||||
# =========================================================================
|
||||
|
||||
if not skip_scrape:
|
||||
print_header("PHASE 1: SCRAPING DATA")
|
||||
|
||||
# Scrape stadiums
|
||||
print_section("Stadiums")
|
||||
all_stadiums = generate_stadiums_from_teams()
|
||||
print(f" Generated {len(all_stadiums)} stadiums from team data")
|
||||
|
||||
# Scrape by sport
|
||||
if sport in ['nba', 'all']:
|
||||
print_section(f"NBA {season}")
|
||||
nba_games = scrape_nba_basketball_reference(season)
|
||||
nba_season = f"{season-1}-{str(season)[2:]}"
|
||||
nba_games = assign_stable_ids(nba_games, 'NBA', nba_season)
|
||||
all_games.extend(nba_games)
|
||||
games_by_sport['NBA'] = len(nba_games)
|
||||
|
||||
if sport in ['mlb', 'all']:
|
||||
print_section(f"MLB {season}")
|
||||
mlb_games = scrape_mlb_statsapi(season)
|
||||
# MLB API uses official gamePk - already stable
|
||||
all_games.extend(mlb_games)
|
||||
games_by_sport['MLB'] = len(mlb_games)
|
||||
|
||||
if sport in ['nhl', 'all']:
|
||||
print_section(f"NHL {season}")
|
||||
nhl_games = scrape_nhl_hockey_reference(season)
|
||||
nhl_season = f"{season-1}-{str(season)[2:]}"
|
||||
nhl_games = assign_stable_ids(nhl_games, 'NHL', nhl_season)
|
||||
all_games.extend(nhl_games)
|
||||
games_by_sport['NHL'] = len(nhl_games)
|
||||
|
||||
# Export data
|
||||
print_section("Exporting Data")
|
||||
export_to_json(all_games, all_stadiums, output_dir)
|
||||
print(f" Exported to {output_dir}")
|
||||
|
||||
else:
|
||||
# Load existing data
|
||||
print_header("LOADING EXISTING DATA")
|
||||
|
||||
games_file = output_dir / 'games.json'
|
||||
stadiums_file = output_dir / 'stadiums.json'
|
||||
|
||||
if games_file.exists():
|
||||
with open(games_file) as f:
|
||||
games_data = json.load(f)
|
||||
all_games = [Game(**g) for g in games_data]
|
||||
for g in all_games:
|
||||
games_by_sport[g.sport] = games_by_sport.get(g.sport, 0) + 1
|
||||
print(f" Loaded {len(all_games)} games")
|
||||
|
||||
if stadiums_file.exists():
|
||||
with open(stadiums_file) as f:
|
||||
stadiums_data = json.load(f)
|
||||
all_stadiums = [Stadium(**s) for s in stadiums_data]
|
||||
print(f" Loaded {len(all_stadiums)} stadiums")
|
||||
|
||||
# =========================================================================
|
||||
# PHASE 2: VALIDATE DATA
|
||||
# =========================================================================
|
||||
|
||||
if validate:
|
||||
print_header("PHASE 2: CROSS-VALIDATION")
|
||||
|
||||
# MLB validation (has two good sources)
|
||||
if sport in ['mlb', 'all']:
|
||||
print_section("MLB Cross-Validation")
|
||||
try:
|
||||
mlb_sources = scrape_mlb_all_sources(season)
|
||||
source_names = list(mlb_sources.keys())
|
||||
|
||||
if len(source_names) >= 2:
|
||||
games1 = mlb_sources[source_names[0]]
|
||||
games2 = mlb_sources[source_names[1]]
|
||||
|
||||
if games1 and games2:
|
||||
report = validate_games(
|
||||
games1, games2,
|
||||
source_names[0], source_names[1],
|
||||
'MLB', str(season)
|
||||
)
|
||||
validation_reports.append(report)
|
||||
|
||||
print(f" Sources: {source_names[0]} vs {source_names[1]}")
|
||||
print(f" Games compared: {report.total_games_source1} vs {report.total_games_source2}")
|
||||
print(f" Matched: {report.games_matched}")
|
||||
print(f" Discrepancies: {len(report.discrepancies)}")
|
||||
except Exception as e:
|
||||
print(f" Error during MLB validation: {e}")
|
||||
|
||||
# Stadium validation
|
||||
print_section("Stadium Validation")
|
||||
stadium_issues = validate_stadiums(all_stadiums)
|
||||
print(f" Issues found: {len(stadium_issues)}")
|
||||
|
||||
# Data quality checks
|
||||
print_section("Data Quality Checks")
|
||||
|
||||
# Check game counts per team
|
||||
if sport in ['nba', 'all']:
|
||||
nba_games = [g for g in all_games if g.sport == 'NBA']
|
||||
team_counts = {}
|
||||
for g in nba_games:
|
||||
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
|
||||
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
|
||||
|
||||
for team, count in sorted(team_counts.items()):
|
||||
if count < 75 or count > 90:
|
||||
print(f" NBA: {team} has {count} games (expected ~82)")
|
||||
|
||||
if sport in ['nhl', 'all']:
|
||||
nhl_games = [g for g in all_games if g.sport == 'NHL']
|
||||
team_counts = {}
|
||||
for g in nhl_games:
|
||||
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
|
||||
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
|
||||
|
||||
for team, count in sorted(team_counts.items()):
|
||||
if count < 75 or count > 90:
|
||||
print(f" NHL: {team} has {count} games (expected ~82)")
|
||||
|
||||
# =========================================================================
|
||||
# PHASE 3: GENERATE REPORT
|
||||
# =========================================================================
|
||||
|
||||
print_header("PHASE 3: DISCREPANCY REPORT")
|
||||
|
||||
# Count by severity
|
||||
high_count = 0
|
||||
medium_count = 0
|
||||
low_count = 0
|
||||
|
||||
# Game discrepancies
|
||||
for report in validation_reports:
|
||||
for d in report.discrepancies:
|
||||
if d.severity == 'high':
|
||||
high_count += 1
|
||||
elif d.severity == 'medium':
|
||||
medium_count += 1
|
||||
else:
|
||||
low_count += 1
|
||||
|
||||
# Stadium issues
|
||||
for issue in stadium_issues:
|
||||
if issue['severity'] == 'high':
|
||||
high_count += 1
|
||||
elif issue['severity'] == 'medium':
|
||||
medium_count += 1
|
||||
else:
|
||||
low_count += 1
|
||||
|
||||
# Print summary
|
||||
print()
|
||||
print(f" 🔴 HIGH severity: {high_count}")
|
||||
print(f" 🟡 MEDIUM severity: {medium_count}")
|
||||
print(f" 🟢 LOW severity: {low_count}")
|
||||
print()
|
||||
|
||||
# Print high severity issues (always)
|
||||
if high_count > 0:
|
||||
print_section("HIGH Severity Issues (Requires Attention)")
|
||||
|
||||
shown = 0
|
||||
max_show = 10 if not verbose else 50
|
||||
|
||||
for report in validation_reports:
|
||||
for d in report.discrepancies:
|
||||
if d.severity == 'high' and shown < max_show:
|
||||
print_severity('high', f"[{report.sport}] {d.field}: {d.game_key}")
|
||||
if verbose:
|
||||
print(f" {d.source1}: {d.value1}")
|
||||
print(f" {d.source2}: {d.value2}")
|
||||
shown += 1
|
||||
|
||||
for issue in stadium_issues:
|
||||
if issue['severity'] == 'high' and shown < max_show:
|
||||
print_severity('high', f"[Stadium] {issue['stadium']}: {issue['issue']}")
|
||||
shown += 1
|
||||
|
||||
if high_count > max_show:
|
||||
print(f" ... and {high_count - max_show} more (use --verbose to see all)")
|
||||
|
||||
# Print medium severity if verbose
|
||||
if medium_count > 0 and verbose:
|
||||
print_section("MEDIUM Severity Issues")
|
||||
|
||||
for report in validation_reports:
|
||||
for d in report.discrepancies:
|
||||
if d.severity == 'medium':
|
||||
print_severity('medium', f"[{report.sport}] {d.field}: {d.game_key}")
|
||||
|
||||
for issue in stadium_issues:
|
||||
if issue['severity'] == 'medium':
|
||||
print_severity('medium', f"[Stadium] {issue['stadium']}: {issue['issue']}")
|
||||
|
||||
# Save full report
|
||||
report_path = output_dir / 'pipeline_report.json'
|
||||
full_report = {
|
||||
'generated_at': datetime.now().isoformat(),
|
||||
'season': season,
|
||||
'sport': sport,
|
||||
'summary': {
|
||||
'games_scraped': len(all_games),
|
||||
'stadiums_scraped': len(all_stadiums),
|
||||
'games_by_sport': games_by_sport,
|
||||
'high_severity': high_count,
|
||||
'medium_severity': medium_count,
|
||||
'low_severity': low_count,
|
||||
},
|
||||
'game_validations': [r.to_dict() for r in validation_reports],
|
||||
'stadium_issues': stadium_issues,
|
||||
}
|
||||
|
||||
with open(report_path, 'w') as f:
|
||||
json.dump(full_report, f, indent=2)
|
||||
|
||||
# =========================================================================
|
||||
# FINAL SUMMARY
|
||||
# =========================================================================
|
||||
|
||||
duration = (datetime.now() - start_time).total_seconds()
|
||||
|
||||
print_header("PIPELINE COMPLETE")
|
||||
print()
|
||||
print(f" Duration: {duration:.1f} seconds")
|
||||
print(f" Games: {len(all_games):,}")
|
||||
print(f" Stadiums: {len(all_stadiums)}")
|
||||
print(f" Output: {output_dir.absolute()}")
|
||||
print()
|
||||
|
||||
for sport_name, count in sorted(games_by_sport.items()):
|
||||
print(f" {sport_name}: {count:,} games")
|
||||
|
||||
print()
|
||||
print(f" Reports saved to:")
|
||||
print(f" - {output_dir / 'games.json'}")
|
||||
print(f" - {output_dir / 'stadiums.json'}")
|
||||
print(f" - {output_dir / 'pipeline_report.json'}")
|
||||
print()
|
||||
|
||||
# Status indicator
|
||||
if high_count > 0:
|
||||
print(" ⚠️ STATUS: Review required - high severity issues found")
|
||||
elif medium_count > 0:
|
||||
print(" ✓ STATUS: Complete with warnings")
|
||||
else:
|
||||
print(" ✅ STATUS: All checks passed")
|
||||
|
||||
print()
|
||||
|
||||
return PipelineResult(
|
||||
success=high_count == 0,
|
||||
games_scraped=len(all_games),
|
||||
stadiums_scraped=len(all_stadiums),
|
||||
games_by_sport=games_by_sport,
|
||||
validation_reports=validation_reports,
|
||||
stadium_issues=stadium_issues,
|
||||
high_severity_count=high_count,
|
||||
medium_severity_count=medium_count,
|
||||
low_severity_count=low_count,
|
||||
output_dir=output_dir,
|
||||
duration_seconds=duration,
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='SportsTime Data Pipeline - Fetch, validate, and report on sports data',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
python run_pipeline.py # Full pipeline
|
||||
python run_pipeline.py --season 2026 # Different season
|
||||
python run_pipeline.py --sport mlb # MLB only
|
||||
python run_pipeline.py --skip-scrape # Validate existing data
|
||||
python run_pipeline.py --verbose # Show all issues
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--season', type=int, default=2025,
|
||||
help='Season year (default: 2025)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--sport', choices=['nba', 'mlb', 'nhl', 'all'], default='all',
|
||||
help='Sport to process (default: all)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--output', type=str, default='./data',
|
||||
help='Output directory (default: ./data)'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--skip-scrape', action='store_true',
|
||||
help='Skip scraping, validate existing data only'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--no-validate', action='store_true',
|
||||
help='Skip validation step'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--verbose', '-v', action='store_true',
|
||||
help='Verbose output with all issues'
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
result = run_pipeline(
|
||||
season=args.season,
|
||||
sport=args.sport,
|
||||
output_dir=Path(args.output),
|
||||
skip_scrape=args.skip_scrape,
|
||||
validate=not args.no_validate,
|
||||
verbose=args.verbose,
|
||||
)
|
||||
|
||||
# Exit with error code if high severity issues
|
||||
sys.exit(0 if result.success else 1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
970
Scripts/scrape_schedules.py
Normal file
970
Scripts/scrape_schedules.py
Normal file
@@ -0,0 +1,970 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Sports Schedule Scraper for SportsTime App
|
||||
Scrapes NBA, MLB, NHL schedules from multiple sources for cross-validation.
|
||||
|
||||
Usage:
|
||||
python scrape_schedules.py --sport nba --season 2025
|
||||
python scrape_schedules.py --sport all --season 2025
|
||||
python scrape_schedules.py --stadiums-only
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
import re
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from dataclasses import dataclass, asdict
|
||||
from typing import Optional
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import pandas as pd
|
||||
|
||||
# Rate limiting
|
||||
REQUEST_DELAY = 3.0 # seconds between requests to same domain
|
||||
last_request_time = {}
|
||||
|
||||
|
||||
def rate_limit(domain: str):
|
||||
"""Enforce rate limiting per domain."""
|
||||
now = time.time()
|
||||
if domain in last_request_time:
|
||||
elapsed = now - last_request_time[domain]
|
||||
if elapsed < REQUEST_DELAY:
|
||||
time.sleep(REQUEST_DELAY - elapsed)
|
||||
last_request_time[domain] = time.time()
|
||||
|
||||
|
||||
def fetch_page(url: str, domain: str) -> Optional[BeautifulSoup]:
|
||||
"""Fetch and parse a webpage with rate limiting."""
|
||||
rate_limit(domain)
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
|
||||
}
|
||||
try:
|
||||
response = requests.get(url, headers=headers, timeout=30)
|
||||
response.raise_for_status()
|
||||
return BeautifulSoup(response.content, 'html.parser')
|
||||
except Exception as e:
|
||||
print(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# DATA CLASSES
|
||||
# =============================================================================
|
||||
|
||||
@dataclass
|
||||
class Game:
|
||||
id: str
|
||||
sport: str
|
||||
season: str
|
||||
date: str # YYYY-MM-DD
|
||||
time: Optional[str] # HH:MM (24hr, ET)
|
||||
home_team: str
|
||||
away_team: str
|
||||
home_team_abbrev: str
|
||||
away_team_abbrev: str
|
||||
venue: str
|
||||
source: str
|
||||
is_playoff: bool = False
|
||||
broadcast: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class Stadium:
|
||||
id: str
|
||||
name: str
|
||||
city: str
|
||||
state: str
|
||||
latitude: float
|
||||
longitude: float
|
||||
capacity: int
|
||||
sport: str
|
||||
team_abbrevs: list
|
||||
source: str
|
||||
year_opened: Optional[int] = None
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# TEAM MAPPINGS
|
||||
# =============================================================================
|
||||
|
||||
NBA_TEAMS = {
|
||||
'ATL': {'name': 'Atlanta Hawks', 'city': 'Atlanta', 'arena': 'State Farm Arena'},
|
||||
'BOS': {'name': 'Boston Celtics', 'city': 'Boston', 'arena': 'TD Garden'},
|
||||
'BRK': {'name': 'Brooklyn Nets', 'city': 'Brooklyn', 'arena': 'Barclays Center'},
|
||||
'CHO': {'name': 'Charlotte Hornets', 'city': 'Charlotte', 'arena': 'Spectrum Center'},
|
||||
'CHI': {'name': 'Chicago Bulls', 'city': 'Chicago', 'arena': 'United Center'},
|
||||
'CLE': {'name': 'Cleveland Cavaliers', 'city': 'Cleveland', 'arena': 'Rocket Mortgage FieldHouse'},
|
||||
'DAL': {'name': 'Dallas Mavericks', 'city': 'Dallas', 'arena': 'American Airlines Center'},
|
||||
'DEN': {'name': 'Denver Nuggets', 'city': 'Denver', 'arena': 'Ball Arena'},
|
||||
'DET': {'name': 'Detroit Pistons', 'city': 'Detroit', 'arena': 'Little Caesars Arena'},
|
||||
'GSW': {'name': 'Golden State Warriors', 'city': 'San Francisco', 'arena': 'Chase Center'},
|
||||
'HOU': {'name': 'Houston Rockets', 'city': 'Houston', 'arena': 'Toyota Center'},
|
||||
'IND': {'name': 'Indiana Pacers', 'city': 'Indianapolis', 'arena': 'Gainbridge Fieldhouse'},
|
||||
'LAC': {'name': 'Los Angeles Clippers', 'city': 'Inglewood', 'arena': 'Intuit Dome'},
|
||||
'LAL': {'name': 'Los Angeles Lakers', 'city': 'Los Angeles', 'arena': 'Crypto.com Arena'},
|
||||
'MEM': {'name': 'Memphis Grizzlies', 'city': 'Memphis', 'arena': 'FedExForum'},
|
||||
'MIA': {'name': 'Miami Heat', 'city': 'Miami', 'arena': 'Kaseya Center'},
|
||||
'MIL': {'name': 'Milwaukee Bucks', 'city': 'Milwaukee', 'arena': 'Fiserv Forum'},
|
||||
'MIN': {'name': 'Minnesota Timberwolves', 'city': 'Minneapolis', 'arena': 'Target Center'},
|
||||
'NOP': {'name': 'New Orleans Pelicans', 'city': 'New Orleans', 'arena': 'Smoothie King Center'},
|
||||
'NYK': {'name': 'New York Knicks', 'city': 'New York', 'arena': 'Madison Square Garden'},
|
||||
'OKC': {'name': 'Oklahoma City Thunder', 'city': 'Oklahoma City', 'arena': 'Paycom Center'},
|
||||
'ORL': {'name': 'Orlando Magic', 'city': 'Orlando', 'arena': 'Kia Center'},
|
||||
'PHI': {'name': 'Philadelphia 76ers', 'city': 'Philadelphia', 'arena': 'Wells Fargo Center'},
|
||||
'PHO': {'name': 'Phoenix Suns', 'city': 'Phoenix', 'arena': 'Footprint Center'},
|
||||
'POR': {'name': 'Portland Trail Blazers', 'city': 'Portland', 'arena': 'Moda Center'},
|
||||
'SAC': {'name': 'Sacramento Kings', 'city': 'Sacramento', 'arena': 'Golden 1 Center'},
|
||||
'SAS': {'name': 'San Antonio Spurs', 'city': 'San Antonio', 'arena': 'Frost Bank Center'},
|
||||
'TOR': {'name': 'Toronto Raptors', 'city': 'Toronto', 'arena': 'Scotiabank Arena'},
|
||||
'UTA': {'name': 'Utah Jazz', 'city': 'Salt Lake City', 'arena': 'Delta Center'},
|
||||
'WAS': {'name': 'Washington Wizards', 'city': 'Washington', 'arena': 'Capital One Arena'},
|
||||
}
|
||||
|
||||
MLB_TEAMS = {
|
||||
'ARI': {'name': 'Arizona Diamondbacks', 'city': 'Phoenix', 'stadium': 'Chase Field'},
|
||||
'ATL': {'name': 'Atlanta Braves', 'city': 'Atlanta', 'stadium': 'Truist Park'},
|
||||
'BAL': {'name': 'Baltimore Orioles', 'city': 'Baltimore', 'stadium': 'Oriole Park at Camden Yards'},
|
||||
'BOS': {'name': 'Boston Red Sox', 'city': 'Boston', 'stadium': 'Fenway Park'},
|
||||
'CHC': {'name': 'Chicago Cubs', 'city': 'Chicago', 'stadium': 'Wrigley Field'},
|
||||
'CHW': {'name': 'Chicago White Sox', 'city': 'Chicago', 'stadium': 'Guaranteed Rate Field'},
|
||||
'CIN': {'name': 'Cincinnati Reds', 'city': 'Cincinnati', 'stadium': 'Great American Ball Park'},
|
||||
'CLE': {'name': 'Cleveland Guardians', 'city': 'Cleveland', 'stadium': 'Progressive Field'},
|
||||
'COL': {'name': 'Colorado Rockies', 'city': 'Denver', 'stadium': 'Coors Field'},
|
||||
'DET': {'name': 'Detroit Tigers', 'city': 'Detroit', 'stadium': 'Comerica Park'},
|
||||
'HOU': {'name': 'Houston Astros', 'city': 'Houston', 'stadium': 'Minute Maid Park'},
|
||||
'KCR': {'name': 'Kansas City Royals', 'city': 'Kansas City', 'stadium': 'Kauffman Stadium'},
|
||||
'LAA': {'name': 'Los Angeles Angels', 'city': 'Anaheim', 'stadium': 'Angel Stadium'},
|
||||
'LAD': {'name': 'Los Angeles Dodgers', 'city': 'Los Angeles', 'stadium': 'Dodger Stadium'},
|
||||
'MIA': {'name': 'Miami Marlins', 'city': 'Miami', 'stadium': 'LoanDepot Park'},
|
||||
'MIL': {'name': 'Milwaukee Brewers', 'city': 'Milwaukee', 'stadium': 'American Family Field'},
|
||||
'MIN': {'name': 'Minnesota Twins', 'city': 'Minneapolis', 'stadium': 'Target Field'},
|
||||
'NYM': {'name': 'New York Mets', 'city': 'New York', 'stadium': 'Citi Field'},
|
||||
'NYY': {'name': 'New York Yankees', 'city': 'New York', 'stadium': 'Yankee Stadium'},
|
||||
'OAK': {'name': 'Oakland Athletics', 'city': 'Sacramento', 'stadium': 'Sutter Health Park'},
|
||||
'PHI': {'name': 'Philadelphia Phillies', 'city': 'Philadelphia', 'stadium': 'Citizens Bank Park'},
|
||||
'PIT': {'name': 'Pittsburgh Pirates', 'city': 'Pittsburgh', 'stadium': 'PNC Park'},
|
||||
'SDP': {'name': 'San Diego Padres', 'city': 'San Diego', 'stadium': 'Petco Park'},
|
||||
'SFG': {'name': 'San Francisco Giants', 'city': 'San Francisco', 'stadium': 'Oracle Park'},
|
||||
'SEA': {'name': 'Seattle Mariners', 'city': 'Seattle', 'stadium': 'T-Mobile Park'},
|
||||
'STL': {'name': 'St. Louis Cardinals', 'city': 'St. Louis', 'stadium': 'Busch Stadium'},
|
||||
'TBR': {'name': 'Tampa Bay Rays', 'city': 'St. Petersburg', 'stadium': 'Tropicana Field'},
|
||||
'TEX': {'name': 'Texas Rangers', 'city': 'Arlington', 'stadium': 'Globe Life Field'},
|
||||
'TOR': {'name': 'Toronto Blue Jays', 'city': 'Toronto', 'stadium': 'Rogers Centre'},
|
||||
'WSN': {'name': 'Washington Nationals', 'city': 'Washington', 'stadium': 'Nationals Park'},
|
||||
}
|
||||
|
||||
NHL_TEAMS = {
|
||||
'ANA': {'name': 'Anaheim Ducks', 'city': 'Anaheim', 'arena': 'Honda Center'},
|
||||
'ARI': {'name': 'Utah Hockey Club', 'city': 'Salt Lake City', 'arena': 'Delta Center'},
|
||||
'BOS': {'name': 'Boston Bruins', 'city': 'Boston', 'arena': 'TD Garden'},
|
||||
'BUF': {'name': 'Buffalo Sabres', 'city': 'Buffalo', 'arena': 'KeyBank Center'},
|
||||
'CGY': {'name': 'Calgary Flames', 'city': 'Calgary', 'arena': 'Scotiabank Saddledome'},
|
||||
'CAR': {'name': 'Carolina Hurricanes', 'city': 'Raleigh', 'arena': 'PNC Arena'},
|
||||
'CHI': {'name': 'Chicago Blackhawks', 'city': 'Chicago', 'arena': 'United Center'},
|
||||
'COL': {'name': 'Colorado Avalanche', 'city': 'Denver', 'arena': 'Ball Arena'},
|
||||
'CBJ': {'name': 'Columbus Blue Jackets', 'city': 'Columbus', 'arena': 'Nationwide Arena'},
|
||||
'DAL': {'name': 'Dallas Stars', 'city': 'Dallas', 'arena': 'American Airlines Center'},
|
||||
'DET': {'name': 'Detroit Red Wings', 'city': 'Detroit', 'arena': 'Little Caesars Arena'},
|
||||
'EDM': {'name': 'Edmonton Oilers', 'city': 'Edmonton', 'arena': 'Rogers Place'},
|
||||
'FLA': {'name': 'Florida Panthers', 'city': 'Sunrise', 'arena': 'Amerant Bank Arena'},
|
||||
'LAK': {'name': 'Los Angeles Kings', 'city': 'Los Angeles', 'arena': 'Crypto.com Arena'},
|
||||
'MIN': {'name': 'Minnesota Wild', 'city': 'St. Paul', 'arena': 'Xcel Energy Center'},
|
||||
'MTL': {'name': 'Montreal Canadiens', 'city': 'Montreal', 'arena': 'Bell Centre'},
|
||||
'NSH': {'name': 'Nashville Predators', 'city': 'Nashville', 'arena': 'Bridgestone Arena'},
|
||||
'NJD': {'name': 'New Jersey Devils', 'city': 'Newark', 'arena': 'Prudential Center'},
|
||||
'NYI': {'name': 'New York Islanders', 'city': 'Elmont', 'arena': 'UBS Arena'},
|
||||
'NYR': {'name': 'New York Rangers', 'city': 'New York', 'arena': 'Madison Square Garden'},
|
||||
'OTT': {'name': 'Ottawa Senators', 'city': 'Ottawa', 'arena': 'Canadian Tire Centre'},
|
||||
'PHI': {'name': 'Philadelphia Flyers', 'city': 'Philadelphia', 'arena': 'Wells Fargo Center'},
|
||||
'PIT': {'name': 'Pittsburgh Penguins', 'city': 'Pittsburgh', 'arena': 'PPG Paints Arena'},
|
||||
'SJS': {'name': 'San Jose Sharks', 'city': 'San Jose', 'arena': 'SAP Center'},
|
||||
'SEA': {'name': 'Seattle Kraken', 'city': 'Seattle', 'arena': 'Climate Pledge Arena'},
|
||||
'STL': {'name': 'St. Louis Blues', 'city': 'St. Louis', 'arena': 'Enterprise Center'},
|
||||
'TBL': {'name': 'Tampa Bay Lightning', 'city': 'Tampa', 'arena': 'Amalie Arena'},
|
||||
'TOR': {'name': 'Toronto Maple Leafs', 'city': 'Toronto', 'arena': 'Scotiabank Arena'},
|
||||
'VAN': {'name': 'Vancouver Canucks', 'city': 'Vancouver', 'arena': 'Rogers Arena'},
|
||||
'VGK': {'name': 'Vegas Golden Knights', 'city': 'Las Vegas', 'arena': 'T-Mobile Arena'},
|
||||
'WSH': {'name': 'Washington Capitals', 'city': 'Washington', 'arena': 'Capital One Arena'},
|
||||
'WPG': {'name': 'Winnipeg Jets', 'city': 'Winnipeg', 'arena': 'Canada Life Centre'},
|
||||
}
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SCRAPERS - NBA
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nba_basketball_reference(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NBA schedule from Basketball-Reference.
|
||||
URL: https://www.basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html
|
||||
Season year is the ending year (e.g., 2025 for 2024-25 season)
|
||||
"""
|
||||
games = []
|
||||
months = ['october', 'november', 'december', 'january', 'february', 'march', 'april', 'may', 'june']
|
||||
|
||||
print(f"Scraping NBA {season} from Basketball-Reference...")
|
||||
|
||||
for month in months:
|
||||
url = f"https://www.basketball-reference.com/leagues/NBA_{season}_games-{month}.html"
|
||||
soup = fetch_page(url, 'basketball-reference.com')
|
||||
|
||||
if not soup:
|
||||
continue
|
||||
|
||||
table = soup.find('table', {'id': 'schedule'})
|
||||
if not table:
|
||||
continue
|
||||
|
||||
tbody = table.find('tbody')
|
||||
if not tbody:
|
||||
continue
|
||||
|
||||
for row in tbody.find_all('tr'):
|
||||
if row.get('class') and 'thead' in row.get('class'):
|
||||
continue
|
||||
|
||||
cells = row.find_all(['td', 'th'])
|
||||
if len(cells) < 6:
|
||||
continue
|
||||
|
||||
try:
|
||||
# Parse date
|
||||
date_cell = row.find('th', {'data-stat': 'date_game'})
|
||||
if not date_cell:
|
||||
continue
|
||||
date_link = date_cell.find('a')
|
||||
date_str = date_link.text if date_link else date_cell.text
|
||||
|
||||
# Parse time
|
||||
time_cell = row.find('td', {'data-stat': 'game_start_time'})
|
||||
time_str = time_cell.text.strip() if time_cell else None
|
||||
|
||||
# Parse teams
|
||||
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
|
||||
home_cell = row.find('td', {'data-stat': 'home_team_name'})
|
||||
|
||||
if not visitor_cell or not home_cell:
|
||||
continue
|
||||
|
||||
visitor_link = visitor_cell.find('a')
|
||||
home_link = home_cell.find('a')
|
||||
|
||||
away_team = visitor_link.text if visitor_link else visitor_cell.text
|
||||
home_team = home_link.text if home_link else home_cell.text
|
||||
|
||||
# Parse arena
|
||||
arena_cell = row.find('td', {'data-stat': 'arena_name'})
|
||||
arena = arena_cell.text.strip() if arena_cell else ''
|
||||
|
||||
# Convert date
|
||||
try:
|
||||
parsed_date = datetime.strptime(date_str.strip(), '%a, %b %d, %Y')
|
||||
date_formatted = parsed_date.strftime('%Y-%m-%d')
|
||||
except:
|
||||
continue
|
||||
|
||||
# Generate game ID
|
||||
game_id = f"nba_{date_formatted}_{away_team[:3]}_{home_team[:3]}".lower().replace(' ', '')
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NBA',
|
||||
season=f"{season-1}-{str(season)[2:]}",
|
||||
date=date_formatted,
|
||||
time=time_str,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=get_team_abbrev(home_team, 'NBA'),
|
||||
away_team_abbrev=get_team_abbrev(away_team, 'NBA'),
|
||||
venue=arena,
|
||||
source='basketball-reference.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception as e:
|
||||
print(f" Error parsing row: {e}")
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from Basketball-Reference")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nba_espn(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NBA schedule from ESPN.
|
||||
URL: https://www.espn.com/nba/schedule/_/date/{YYYYMMDD}
|
||||
"""
|
||||
games = []
|
||||
print(f"Scraping NBA {season} from ESPN...")
|
||||
|
||||
# Determine date range for season
|
||||
start_date = datetime(season - 1, 10, 1) # October of previous year
|
||||
end_date = datetime(season, 6, 30) # June of season year
|
||||
|
||||
current_date = start_date
|
||||
while current_date <= end_date:
|
||||
date_str = current_date.strftime('%Y%m%d')
|
||||
url = f"https://www.espn.com/nba/schedule/_/date/{date_str}"
|
||||
|
||||
soup = fetch_page(url, 'espn.com')
|
||||
if soup:
|
||||
# ESPN uses JavaScript rendering, so we need to parse what's available
|
||||
# This is a simplified version - full implementation would need Selenium
|
||||
pass
|
||||
|
||||
current_date += timedelta(days=7) # Sample weekly to respect rate limits
|
||||
|
||||
print(f" Found {len(games)} games from ESPN")
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SCRAPERS - MLB
|
||||
# =============================================================================
|
||||
|
||||
def scrape_mlb_baseball_reference(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape MLB schedule from Baseball-Reference.
|
||||
URL: https://www.baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml
|
||||
"""
|
||||
games = []
|
||||
url = f"https://www.baseball-reference.com/leagues/majors/{season}-schedule.shtml"
|
||||
|
||||
print(f"Scraping MLB {season} from Baseball-Reference...")
|
||||
soup = fetch_page(url, 'baseball-reference.com')
|
||||
|
||||
if not soup:
|
||||
return games
|
||||
|
||||
# Baseball-Reference groups games by date in h3 headers
|
||||
current_date = None
|
||||
|
||||
# Find the schedule section
|
||||
schedule_div = soup.find('div', {'id': 'all_schedule'})
|
||||
if not schedule_div:
|
||||
schedule_div = soup
|
||||
|
||||
# Process all elements to track date context
|
||||
for element in schedule_div.find_all(['h3', 'p', 'div']):
|
||||
# Check for date header
|
||||
if element.name == 'h3':
|
||||
date_text = element.get_text(strip=True)
|
||||
# Parse date like "Thursday, March 27, 2025"
|
||||
try:
|
||||
for fmt in ['%A, %B %d, %Y', '%B %d, %Y', '%a, %b %d, %Y']:
|
||||
try:
|
||||
parsed = datetime.strptime(date_text, fmt)
|
||||
current_date = parsed.strftime('%Y-%m-%d')
|
||||
break
|
||||
except:
|
||||
continue
|
||||
except:
|
||||
pass
|
||||
|
||||
# Check for game entries
|
||||
elif element.name == 'p' and 'game' in element.get('class', []):
|
||||
if not current_date:
|
||||
continue
|
||||
|
||||
try:
|
||||
links = element.find_all('a')
|
||||
if len(links) >= 2:
|
||||
away_team = links[0].text.strip()
|
||||
home_team = links[1].text.strip()
|
||||
|
||||
# Generate unique game ID
|
||||
away_abbrev = get_team_abbrev(away_team, 'MLB')
|
||||
home_abbrev = get_team_abbrev(home_team, 'MLB')
|
||||
game_id = f"mlb_br_{current_date}_{away_abbrev}_{home_abbrev}".lower()
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='MLB',
|
||||
season=str(season),
|
||||
date=current_date,
|
||||
time=None,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=home_abbrev,
|
||||
away_team_abbrev=away_abbrev,
|
||||
venue='',
|
||||
source='baseball-reference.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from Baseball-Reference")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_mlb_statsapi(season: int) -> list[Game]:
|
||||
"""
|
||||
Fetch MLB schedule from official Stats API (JSON).
|
||||
URL: https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={YEAR}&gameType=R
|
||||
"""
|
||||
games = []
|
||||
url = f"https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={season}&gameType=R&hydrate=team,venue"
|
||||
|
||||
print(f"Fetching MLB {season} from Stats API...")
|
||||
|
||||
try:
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for date_entry in data.get('dates', []):
|
||||
game_date = date_entry.get('date', '')
|
||||
|
||||
for game_data in date_entry.get('games', []):
|
||||
try:
|
||||
teams = game_data.get('teams', {})
|
||||
away = teams.get('away', {}).get('team', {})
|
||||
home = teams.get('home', {}).get('team', {})
|
||||
venue = game_data.get('venue', {})
|
||||
|
||||
game_time = game_data.get('gameDate', '')
|
||||
if 'T' in game_time:
|
||||
time_str = game_time.split('T')[1][:5]
|
||||
else:
|
||||
time_str = None
|
||||
|
||||
game = Game(
|
||||
id=f"mlb_{game_data.get('gamePk', '')}",
|
||||
sport='MLB',
|
||||
season=str(season),
|
||||
date=game_date,
|
||||
time=time_str,
|
||||
home_team=home.get('name', ''),
|
||||
away_team=away.get('name', ''),
|
||||
home_team_abbrev=home.get('abbreviation', ''),
|
||||
away_team_abbrev=away.get('abbreviation', ''),
|
||||
venue=venue.get('name', ''),
|
||||
source='statsapi.mlb.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
except Exception as e:
|
||||
print(f" Error fetching MLB API: {e}")
|
||||
|
||||
print(f" Found {len(games)} games from MLB Stats API")
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# SCRAPERS - NHL
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nhl_hockey_reference(season: int) -> list[Game]:
|
||||
"""
|
||||
Scrape NHL schedule from Hockey-Reference.
|
||||
URL: https://www.hockey-reference.com/leagues/NHL_{YEAR}_games.html
|
||||
"""
|
||||
games = []
|
||||
url = f"https://www.hockey-reference.com/leagues/NHL_{season}_games.html"
|
||||
|
||||
print(f"Scraping NHL {season} from Hockey-Reference...")
|
||||
soup = fetch_page(url, 'hockey-reference.com')
|
||||
|
||||
if not soup:
|
||||
return games
|
||||
|
||||
table = soup.find('table', {'id': 'games'})
|
||||
if not table:
|
||||
print(" Could not find games table")
|
||||
return games
|
||||
|
||||
tbody = table.find('tbody')
|
||||
if not tbody:
|
||||
return games
|
||||
|
||||
for row in tbody.find_all('tr'):
|
||||
try:
|
||||
cells = row.find_all(['td', 'th'])
|
||||
if len(cells) < 5:
|
||||
continue
|
||||
|
||||
# Parse date
|
||||
date_cell = row.find('th', {'data-stat': 'date_game'})
|
||||
if not date_cell:
|
||||
continue
|
||||
date_link = date_cell.find('a')
|
||||
date_str = date_link.text if date_link else date_cell.text
|
||||
|
||||
# Parse teams
|
||||
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
|
||||
home_cell = row.find('td', {'data-stat': 'home_team_name'})
|
||||
|
||||
if not visitor_cell or not home_cell:
|
||||
continue
|
||||
|
||||
visitor_link = visitor_cell.find('a')
|
||||
home_link = home_cell.find('a')
|
||||
|
||||
away_team = visitor_link.text if visitor_link else visitor_cell.text
|
||||
home_team = home_link.text if home_link else home_cell.text
|
||||
|
||||
# Convert date
|
||||
try:
|
||||
parsed_date = datetime.strptime(date_str.strip(), '%Y-%m-%d')
|
||||
date_formatted = parsed_date.strftime('%Y-%m-%d')
|
||||
except:
|
||||
continue
|
||||
|
||||
game_id = f"nhl_{date_formatted}_{away_team[:3]}_{home_team[:3]}".lower().replace(' ', '')
|
||||
|
||||
game = Game(
|
||||
id=game_id,
|
||||
sport='NHL',
|
||||
season=f"{season-1}-{str(season)[2:]}",
|
||||
date=date_formatted,
|
||||
time=None,
|
||||
home_team=home_team,
|
||||
away_team=away_team,
|
||||
home_team_abbrev=get_team_abbrev(home_team, 'NHL'),
|
||||
away_team_abbrev=get_team_abbrev(away_team, 'NHL'),
|
||||
venue='',
|
||||
source='hockey-reference.com'
|
||||
)
|
||||
games.append(game)
|
||||
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
print(f" Found {len(games)} games from Hockey-Reference")
|
||||
return games
|
||||
|
||||
|
||||
def scrape_nhl_api(season: int) -> list[Game]:
|
||||
"""
|
||||
Fetch NHL schedule from official API (JSON).
|
||||
URL: https://api-web.nhle.com/v1/schedule/{YYYY-MM-DD}
|
||||
"""
|
||||
games = []
|
||||
print(f"Fetching NHL {season} from NHL API...")
|
||||
|
||||
# NHL API provides club schedules
|
||||
# We'd need to iterate through dates or teams
|
||||
# Simplified implementation here
|
||||
|
||||
return games
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# STADIUM SCRAPER
|
||||
# =============================================================================
|
||||
|
||||
def scrape_stadiums_hifld() -> list[Stadium]:
|
||||
"""
|
||||
Fetch stadium data from HIFLD Open Data (US Government).
|
||||
Returns GeoJSON with coordinates.
|
||||
"""
|
||||
stadiums = []
|
||||
url = "https://services1.arcgis.com/Hp6G80Pky0om7QvQ/arcgis/rest/services/Major_Sport_Venues/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json"
|
||||
|
||||
print("Fetching stadiums from HIFLD Open Data...")
|
||||
|
||||
try:
|
||||
response = requests.get(url, timeout=30)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for feature in data.get('features', []):
|
||||
attrs = feature.get('attributes', {})
|
||||
geom = feature.get('geometry', {})
|
||||
|
||||
# Filter for NBA, MLB, NHL venues
|
||||
league = attrs.get('LEAGUE', '')
|
||||
if league not in ['NBA', 'MLB', 'NHL', 'NFL']:
|
||||
continue
|
||||
|
||||
sport_map = {'NBA': 'NBA', 'MLB': 'MLB', 'NHL': 'NHL'}
|
||||
if league not in sport_map:
|
||||
continue
|
||||
|
||||
stadium = Stadium(
|
||||
id=f"hifld_{attrs.get('OBJECTID', '')}",
|
||||
name=attrs.get('NAME', ''),
|
||||
city=attrs.get('CITY', ''),
|
||||
state=attrs.get('STATE', ''),
|
||||
latitude=geom.get('y', 0),
|
||||
longitude=geom.get('x', 0),
|
||||
capacity=attrs.get('CAPACITY', 0) or 0,
|
||||
sport=sport_map.get(league, ''),
|
||||
team_abbrevs=[attrs.get('TEAM', '')],
|
||||
source='hifld.gov',
|
||||
year_opened=attrs.get('YEAR_OPEN')
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
except Exception as e:
|
||||
print(f" Error fetching HIFLD data: {e}")
|
||||
|
||||
print(f" Found {len(stadiums)} stadiums from HIFLD")
|
||||
return stadiums
|
||||
|
||||
|
||||
def generate_stadiums_from_teams() -> list[Stadium]:
|
||||
"""
|
||||
Generate stadium data from team mappings with manual coordinates.
|
||||
This serves as a fallback/validation source.
|
||||
"""
|
||||
stadiums = []
|
||||
|
||||
# NBA Arenas with coordinates (manually curated)
|
||||
nba_coords = {
|
||||
'State Farm Arena': (33.7573, -84.3963),
|
||||
'TD Garden': (42.3662, -71.0621),
|
||||
'Barclays Center': (40.6826, -73.9754),
|
||||
'Spectrum Center': (35.2251, -80.8392),
|
||||
'United Center': (41.8807, -87.6742),
|
||||
'Rocket Mortgage FieldHouse': (41.4965, -81.6882),
|
||||
'American Airlines Center': (32.7905, -96.8103),
|
||||
'Ball Arena': (39.7487, -105.0077),
|
||||
'Little Caesars Arena': (42.3411, -83.0553),
|
||||
'Chase Center': (37.7680, -122.3879),
|
||||
'Toyota Center': (29.7508, -95.3621),
|
||||
'Gainbridge Fieldhouse': (39.7640, -86.1555),
|
||||
'Intuit Dome': (33.9425, -118.3419),
|
||||
'Crypto.com Arena': (34.0430, -118.2673),
|
||||
'FedExForum': (35.1382, -90.0506),
|
||||
'Kaseya Center': (25.7814, -80.1870),
|
||||
'Fiserv Forum': (43.0451, -87.9174),
|
||||
'Target Center': (44.9795, -93.2761),
|
||||
'Smoothie King Center': (29.9490, -90.0821),
|
||||
'Madison Square Garden': (40.7505, -73.9934),
|
||||
'Paycom Center': (35.4634, -97.5151),
|
||||
'Kia Center': (28.5392, -81.3839),
|
||||
'Wells Fargo Center': (39.9012, -75.1720),
|
||||
'Footprint Center': (33.4457, -112.0712),
|
||||
'Moda Center': (45.5316, -122.6668),
|
||||
'Golden 1 Center': (38.5802, -121.4997),
|
||||
'Frost Bank Center': (29.4270, -98.4375),
|
||||
'Scotiabank Arena': (43.6435, -79.3791),
|
||||
'Delta Center': (40.7683, -111.9011),
|
||||
'Capital One Arena': (38.8982, -77.0209),
|
||||
}
|
||||
|
||||
for abbrev, info in NBA_TEAMS.items():
|
||||
arena = info['arena']
|
||||
coords = nba_coords.get(arena, (0, 0))
|
||||
|
||||
stadium = Stadium(
|
||||
id=f"manual_nba_{abbrev.lower()}",
|
||||
name=arena,
|
||||
city=info['city'],
|
||||
state='',
|
||||
latitude=coords[0],
|
||||
longitude=coords[1],
|
||||
capacity=0,
|
||||
sport='NBA',
|
||||
team_abbrevs=[abbrev],
|
||||
source='manual'
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
# MLB Stadiums with coordinates
|
||||
mlb_coords = {
|
||||
'Chase Field': (33.4453, -112.0667, 'AZ', 48686),
|
||||
'Truist Park': (33.8907, -84.4678, 'GA', 41084),
|
||||
'Oriole Park at Camden Yards': (39.2838, -76.6218, 'MD', 45971),
|
||||
'Fenway Park': (42.3467, -71.0972, 'MA', 37755),
|
||||
'Wrigley Field': (41.9484, -87.6553, 'IL', 41649),
|
||||
'Guaranteed Rate Field': (41.8299, -87.6338, 'IL', 40615),
|
||||
'Great American Ball Park': (39.0979, -84.5082, 'OH', 42319),
|
||||
'Progressive Field': (41.4962, -81.6852, 'OH', 34830),
|
||||
'Coors Field': (39.7559, -104.9942, 'CO', 50144),
|
||||
'Comerica Park': (42.3390, -83.0485, 'MI', 41083),
|
||||
'Minute Maid Park': (29.7573, -95.3555, 'TX', 41168),
|
||||
'Kauffman Stadium': (39.0517, -94.4803, 'MO', 37903),
|
||||
'Angel Stadium': (33.8003, -117.8827, 'CA', 45517),
|
||||
'Dodger Stadium': (34.0739, -118.2400, 'CA', 56000),
|
||||
'LoanDepot Park': (25.7781, -80.2196, 'FL', 36742),
|
||||
'American Family Field': (43.0280, -87.9712, 'WI', 41900),
|
||||
'Target Field': (44.9817, -93.2776, 'MN', 38544),
|
||||
'Citi Field': (40.7571, -73.8458, 'NY', 41922),
|
||||
'Yankee Stadium': (40.8296, -73.9262, 'NY', 46537),
|
||||
'Sutter Health Park': (38.5802, -121.5097, 'CA', 14014),
|
||||
'Citizens Bank Park': (39.9061, -75.1665, 'PA', 42792),
|
||||
'PNC Park': (40.4469, -80.0057, 'PA', 38362),
|
||||
'Petco Park': (32.7076, -117.1570, 'CA', 40209),
|
||||
'Oracle Park': (37.7786, -122.3893, 'CA', 41265),
|
||||
'T-Mobile Park': (47.5914, -122.3325, 'WA', 47929),
|
||||
'Busch Stadium': (38.6226, -90.1928, 'MO', 45494),
|
||||
'Tropicana Field': (27.7682, -82.6534, 'FL', 25000),
|
||||
'Globe Life Field': (32.7473, -97.0845, 'TX', 40300),
|
||||
'Rogers Centre': (43.6414, -79.3894, 'ON', 49282),
|
||||
'Nationals Park': (38.8730, -77.0074, 'DC', 41339),
|
||||
}
|
||||
|
||||
for abbrev, info in MLB_TEAMS.items():
|
||||
stadium_name = info['stadium']
|
||||
coord_data = mlb_coords.get(stadium_name, (0, 0, '', 0))
|
||||
|
||||
stadium = Stadium(
|
||||
id=f"manual_mlb_{abbrev.lower()}",
|
||||
name=stadium_name,
|
||||
city=info['city'],
|
||||
state=coord_data[2] if len(coord_data) > 2 else '',
|
||||
latitude=coord_data[0],
|
||||
longitude=coord_data[1],
|
||||
capacity=coord_data[3] if len(coord_data) > 3 else 0,
|
||||
sport='MLB',
|
||||
team_abbrevs=[abbrev],
|
||||
source='manual'
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
# NHL Arenas with coordinates
|
||||
nhl_coords = {
|
||||
'Honda Center': (33.8078, -117.8765, 'CA', 17174),
|
||||
'Delta Center': (40.7683, -111.9011, 'UT', 18306),
|
||||
'TD Garden': (42.3662, -71.0621, 'MA', 17565),
|
||||
'KeyBank Center': (42.8750, -78.8764, 'NY', 19070),
|
||||
'Scotiabank Saddledome': (51.0374, -114.0519, 'AB', 19289),
|
||||
'PNC Arena': (35.8034, -78.7220, 'NC', 18680),
|
||||
'United Center': (41.8807, -87.6742, 'IL', 19717),
|
||||
'Ball Arena': (39.7487, -105.0077, 'CO', 18007),
|
||||
'Nationwide Arena': (39.9693, -83.0061, 'OH', 18500),
|
||||
'American Airlines Center': (32.7905, -96.8103, 'TX', 18532),
|
||||
'Little Caesars Arena': (42.3411, -83.0553, 'MI', 19515),
|
||||
'Rogers Place': (53.5469, -113.4978, 'AB', 18347),
|
||||
'Amerant Bank Arena': (26.1584, -80.3256, 'FL', 19250),
|
||||
'Crypto.com Arena': (34.0430, -118.2673, 'CA', 18230),
|
||||
'Xcel Energy Center': (44.9448, -93.1010, 'MN', 17954),
|
||||
'Bell Centre': (45.4961, -73.5693, 'QC', 21302),
|
||||
'Bridgestone Arena': (36.1592, -86.7785, 'TN', 17159),
|
||||
'Prudential Center': (40.7334, -74.1712, 'NJ', 16514),
|
||||
'UBS Arena': (40.7161, -73.7246, 'NY', 17255),
|
||||
'Madison Square Garden': (40.7505, -73.9934, 'NY', 18006),
|
||||
'Canadian Tire Centre': (45.2969, -75.9272, 'ON', 18652),
|
||||
'Wells Fargo Center': (39.9012, -75.1720, 'PA', 19543),
|
||||
'PPG Paints Arena': (40.4395, -79.9892, 'PA', 18387),
|
||||
'SAP Center': (37.3327, -121.9010, 'CA', 17562),
|
||||
'Climate Pledge Arena': (47.6221, -122.3540, 'WA', 17100),
|
||||
'Enterprise Center': (38.6268, -90.2025, 'MO', 18096),
|
||||
'Amalie Arena': (27.9426, -82.4519, 'FL', 19092),
|
||||
'Scotiabank Arena': (43.6435, -79.3791, 'ON', 18819),
|
||||
'Rogers Arena': (49.2778, -123.1089, 'BC', 18910),
|
||||
'T-Mobile Arena': (36.1028, -115.1784, 'NV', 17500),
|
||||
'Capital One Arena': (38.8982, -77.0209, 'DC', 18573),
|
||||
'Canada Life Centre': (49.8928, -97.1436, 'MB', 15321),
|
||||
}
|
||||
|
||||
for abbrev, info in NHL_TEAMS.items():
|
||||
arena_name = info['arena']
|
||||
coord_data = nhl_coords.get(arena_name, (0, 0, '', 0))
|
||||
|
||||
stadium = Stadium(
|
||||
id=f"manual_nhl_{abbrev.lower()}",
|
||||
name=arena_name,
|
||||
city=info['city'],
|
||||
state=coord_data[2] if len(coord_data) > 2 else '',
|
||||
latitude=coord_data[0],
|
||||
longitude=coord_data[1],
|
||||
capacity=coord_data[3] if len(coord_data) > 3 else 0,
|
||||
sport='NHL',
|
||||
team_abbrevs=[abbrev],
|
||||
source='manual'
|
||||
)
|
||||
stadiums.append(stadium)
|
||||
|
||||
return stadiums
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# HELPERS
|
||||
# =============================================================================
|
||||
|
||||
def assign_stable_ids(games: list[Game], sport: str, season: str) -> list[Game]:
|
||||
"""
|
||||
Assign stable IDs based on matchup + occurrence number within season.
|
||||
Format: {sport}_{season}_{away}_{home}_{num}
|
||||
|
||||
This ensures IDs don't change when games are rescheduled.
|
||||
"""
|
||||
from collections import defaultdict
|
||||
|
||||
# Group games by matchup (away @ home)
|
||||
matchups = defaultdict(list)
|
||||
for game in games:
|
||||
key = f"{game.away_team_abbrev}_{game.home_team_abbrev}"
|
||||
matchups[key].append(game)
|
||||
|
||||
# Sort each matchup by date and assign occurrence number
|
||||
for key, matchup_games in matchups.items():
|
||||
matchup_games.sort(key=lambda g: g.date)
|
||||
for i, game in enumerate(matchup_games, 1):
|
||||
away = game.away_team_abbrev.lower()
|
||||
home = game.home_team_abbrev.lower()
|
||||
# Normalize season format (e.g., "2024-25" -> "2024-25", "2025" -> "2025")
|
||||
season_str = season.replace('-', '')
|
||||
game.id = f"{sport.lower()}_{season_str}_{away}_{home}_{i}"
|
||||
|
||||
return games
|
||||
|
||||
|
||||
def get_team_abbrev(team_name: str, sport: str) -> str:
|
||||
"""Get team abbreviation from full name."""
|
||||
teams = {'NBA': NBA_TEAMS, 'MLB': MLB_TEAMS, 'NHL': NHL_TEAMS}.get(sport, {})
|
||||
|
||||
for abbrev, info in teams.items():
|
||||
if info['name'].lower() == team_name.lower():
|
||||
return abbrev
|
||||
if team_name.lower() in info['name'].lower():
|
||||
return abbrev
|
||||
|
||||
# Return first 3 letters as fallback
|
||||
return team_name[:3].upper()
|
||||
|
||||
|
||||
def validate_games(games_by_source: dict) -> dict:
|
||||
"""
|
||||
Cross-validate games from multiple sources.
|
||||
Returns discrepancies.
|
||||
"""
|
||||
discrepancies = {
|
||||
'missing_in_source': [],
|
||||
'date_mismatch': [],
|
||||
'time_mismatch': [],
|
||||
'venue_mismatch': [],
|
||||
}
|
||||
|
||||
sources = list(games_by_source.keys())
|
||||
if len(sources) < 2:
|
||||
return discrepancies
|
||||
|
||||
primary = sources[0]
|
||||
primary_games = {g.id: g for g in games_by_source[primary]}
|
||||
|
||||
for source in sources[1:]:
|
||||
secondary_games = {g.id: g for g in games_by_source[source]}
|
||||
|
||||
for game_id, game in primary_games.items():
|
||||
if game_id not in secondary_games:
|
||||
discrepancies['missing_in_source'].append({
|
||||
'game_id': game_id,
|
||||
'present_in': primary,
|
||||
'missing_in': source
|
||||
})
|
||||
|
||||
return discrepancies
|
||||
|
||||
|
||||
def export_to_json(games: list[Game], stadiums: list[Stadium], output_dir: Path):
|
||||
"""Export scraped data to JSON files."""
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Export games
|
||||
games_data = [asdict(g) for g in games]
|
||||
with open(output_dir / 'games.json', 'w') as f:
|
||||
json.dump(games_data, f, indent=2)
|
||||
|
||||
# Export stadiums
|
||||
stadiums_data = [asdict(s) for s in stadiums]
|
||||
with open(output_dir / 'stadiums.json', 'w') as f:
|
||||
json.dump(stadiums_data, f, indent=2)
|
||||
|
||||
# Export as CSV for easy viewing
|
||||
if games:
|
||||
df_games = pd.DataFrame(games_data)
|
||||
df_games.to_csv(output_dir / 'games.csv', index=False)
|
||||
|
||||
if stadiums:
|
||||
df_stadiums = pd.DataFrame(stadiums_data)
|
||||
df_stadiums.to_csv(output_dir / 'stadiums.csv', index=False)
|
||||
|
||||
print(f"\nExported to {output_dir}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MAIN
|
||||
# =============================================================================
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Scrape sports schedules')
|
||||
parser.add_argument('--sport', choices=['nba', 'mlb', 'nhl', 'all'], default='all')
|
||||
parser.add_argument('--season', type=int, default=2025, help='Season year (ending year)')
|
||||
parser.add_argument('--stadiums-only', action='store_true', help='Only scrape stadium data')
|
||||
parser.add_argument('--output', type=str, default='./data', help='Output directory')
|
||||
|
||||
args = parser.parse_args()
|
||||
output_dir = Path(args.output)
|
||||
|
||||
all_games = []
|
||||
all_stadiums = []
|
||||
|
||||
# Scrape stadiums
|
||||
print("\n" + "="*60)
|
||||
print("SCRAPING STADIUMS")
|
||||
print("="*60)
|
||||
|
||||
all_stadiums.extend(scrape_stadiums_hifld())
|
||||
all_stadiums.extend(generate_stadiums_from_teams())
|
||||
|
||||
if args.stadiums_only:
|
||||
export_to_json([], all_stadiums, output_dir)
|
||||
return
|
||||
|
||||
# Scrape schedules
|
||||
if args.sport in ['nba', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING NBA {args.season}")
|
||||
print("="*60)
|
||||
|
||||
nba_games_br = scrape_nba_basketball_reference(args.season)
|
||||
nba_season = f"{args.season-1}-{str(args.season)[2:]}" # e.g., "2024-25"
|
||||
nba_games_br = assign_stable_ids(nba_games_br, 'NBA', nba_season)
|
||||
all_games.extend(nba_games_br)
|
||||
|
||||
if args.sport in ['mlb', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING MLB {args.season}")
|
||||
print("="*60)
|
||||
|
||||
mlb_games_api = scrape_mlb_statsapi(args.season)
|
||||
# MLB API uses official gamePk which is already stable - no reassignment needed
|
||||
all_games.extend(mlb_games_api)
|
||||
|
||||
if args.sport in ['nhl', 'all']:
|
||||
print("\n" + "="*60)
|
||||
print(f"SCRAPING NHL {args.season}")
|
||||
print("="*60)
|
||||
|
||||
nhl_games_hr = scrape_nhl_hockey_reference(args.season)
|
||||
nhl_season = f"{args.season-1}-{str(args.season)[2:]}" # e.g., "2024-25"
|
||||
nhl_games_hr = assign_stable_ids(nhl_games_hr, 'NHL', nhl_season)
|
||||
all_games.extend(nhl_games_hr)
|
||||
|
||||
# Export
|
||||
print("\n" + "="*60)
|
||||
print("EXPORTING DATA")
|
||||
print("="*60)
|
||||
|
||||
export_to_json(all_games, all_stadiums, output_dir)
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*60)
|
||||
print("SUMMARY")
|
||||
print("="*60)
|
||||
print(f"Total games scraped: {len(all_games)}")
|
||||
print(f"Total stadiums: {len(all_stadiums)}")
|
||||
|
||||
# Games by sport
|
||||
by_sport = {}
|
||||
for g in all_games:
|
||||
by_sport[g.sport] = by_sport.get(g.sport, 0) + 1
|
||||
for sport, count in by_sport.items():
|
||||
print(f" {sport}: {count} games")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
61
Scripts/test_cloudkit.py
Normal file
61
Scripts/test_cloudkit.py
Normal file
@@ -0,0 +1,61 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Quick test to query CloudKit records."""
|
||||
|
||||
import json, hashlib, base64, requests, os, sys
|
||||
from datetime import datetime, timezone
|
||||
|
||||
try:
|
||||
from cryptography.hazmat.primitives import hashes, serialization
|
||||
from cryptography.hazmat.primitives.asymmetric import ec
|
||||
from cryptography.hazmat.backends import default_backend
|
||||
except ImportError:
|
||||
sys.exit("Error: pip install cryptography")
|
||||
|
||||
CONTAINER = "iCloud.com.sportstime.app"
|
||||
HOST = "https://api.apple-cloudkit.com"
|
||||
|
||||
def sign(key_data, date, body, path):
|
||||
key = serialization.load_pem_private_key(key_data, None, default_backend())
|
||||
body_hash = base64.b64encode(hashlib.sha256(body.encode()).digest()).decode()
|
||||
sig = key.sign(f"{date}:{body_hash}:{path}".encode(), ec.ECDSA(hashes.SHA256()))
|
||||
return base64.b64encode(sig).decode()
|
||||
|
||||
def query(key_id, key_data, record_type, env='development'):
|
||||
path = f"/database/1/{CONTAINER}/{env}/public/records/query"
|
||||
body = json.dumps({
|
||||
'query': {'recordType': record_type},
|
||||
'resultsLimit': 10
|
||||
})
|
||||
date = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
|
||||
headers = {
|
||||
'Content-Type': 'application/json',
|
||||
'X-Apple-CloudKit-Request-KeyID': key_id,
|
||||
'X-Apple-CloudKit-Request-ISO8601Date': date,
|
||||
'X-Apple-CloudKit-Request-SignatureV1': sign(key_data, date, body, path),
|
||||
}
|
||||
r = requests.post(f"{HOST}{path}", headers=headers, data=body, timeout=30)
|
||||
return r.status_code, r.json()
|
||||
|
||||
if __name__ == '__main__':
|
||||
key_id = os.environ.get('CLOUDKIT_KEY_ID') or (sys.argv[1] if len(sys.argv) > 1 else None)
|
||||
key_file = os.environ.get('CLOUDKIT_KEY_FILE') or (sys.argv[2] if len(sys.argv) > 2 else 'eckey.pem')
|
||||
|
||||
if not key_id:
|
||||
sys.exit("Usage: python test_cloudkit.py KEY_ID [KEY_FILE]")
|
||||
|
||||
key_data = open(key_file, 'rb').read()
|
||||
|
||||
print("Testing CloudKit connection...\n")
|
||||
|
||||
for record_type in ['Stadium', 'Team', 'Game']:
|
||||
status, result = query(key_id, key_data, record_type)
|
||||
count = len(result.get('records', []))
|
||||
print(f"{record_type}: status={status}, records={count}")
|
||||
if count > 0:
|
||||
print(f" Sample: {result['records'][0].get('recordName', 'N/A')}")
|
||||
if 'serverErrorCode' in result:
|
||||
print(f" Error: {result.get('serverErrorCode')}: {result.get('reason')}")
|
||||
|
||||
print("\nFull response for Stadium query:")
|
||||
status, result = query(key_id, key_data, 'Stadium')
|
||||
print(json.dumps(result, indent=2)[:1000])
|
||||
590
Scripts/validate_data.py
Normal file
590
Scripts/validate_data.py
Normal file
@@ -0,0 +1,590 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Cross-Validation System for SportsTime App
|
||||
Compares scraped data from multiple sources and flags discrepancies.
|
||||
|
||||
Usage:
|
||||
python validate_data.py --data-dir ./data
|
||||
python validate_data.py --scrape-and-validate --season 2025
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from dataclasses import dataclass, asdict, field
|
||||
from typing import Optional
|
||||
from collections import defaultdict
|
||||
|
||||
# Import scrapers from main script
|
||||
from scrape_schedules import (
|
||||
Game, Stadium,
|
||||
scrape_nba_basketball_reference,
|
||||
scrape_mlb_statsapi, scrape_mlb_baseball_reference,
|
||||
scrape_nhl_hockey_reference,
|
||||
NBA_TEAMS, MLB_TEAMS, NHL_TEAMS,
|
||||
assign_stable_ids,
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# VALIDATION DATA CLASSES
|
||||
# =============================================================================
|
||||
|
||||
@dataclass
|
||||
class Discrepancy:
|
||||
"""Represents a discrepancy between sources."""
|
||||
game_key: str
|
||||
field: str # 'date', 'time', 'venue', 'teams', 'missing'
|
||||
source1: str
|
||||
source2: str
|
||||
value1: str
|
||||
value2: str
|
||||
severity: str # 'high', 'medium', 'low'
|
||||
|
||||
|
||||
@dataclass
|
||||
class ValidationReport:
|
||||
"""Summary of validation results."""
|
||||
sport: str
|
||||
season: str
|
||||
sources: list
|
||||
total_games_source1: int = 0
|
||||
total_games_source2: int = 0
|
||||
games_matched: int = 0
|
||||
games_missing_source1: int = 0
|
||||
games_missing_source2: int = 0
|
||||
discrepancies: list = field(default_factory=list)
|
||||
|
||||
def to_dict(self):
|
||||
return {
|
||||
'sport': self.sport,
|
||||
'season': self.season,
|
||||
'sources': self.sources,
|
||||
'total_games_source1': self.total_games_source1,
|
||||
'total_games_source2': self.total_games_source2,
|
||||
'games_matched': self.games_matched,
|
||||
'games_missing_source1': self.games_missing_source1,
|
||||
'games_missing_source2': self.games_missing_source2,
|
||||
'discrepancies': [asdict(d) for d in self.discrepancies],
|
||||
'discrepancy_summary': self.get_summary()
|
||||
}
|
||||
|
||||
def get_summary(self):
|
||||
by_field = defaultdict(int)
|
||||
by_severity = defaultdict(int)
|
||||
for d in self.discrepancies:
|
||||
by_field[d.field] += 1
|
||||
by_severity[d.severity] += 1
|
||||
return {
|
||||
'by_field': dict(by_field),
|
||||
'by_severity': dict(by_severity)
|
||||
}
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# GAME KEY GENERATION
|
||||
# =============================================================================
|
||||
|
||||
def normalize_abbrev(abbrev: str, sport: str) -> str:
|
||||
"""Normalize team abbreviations across different sources."""
|
||||
abbrev = abbrev.upper().strip()
|
||||
|
||||
if sport == 'MLB':
|
||||
# MLB abbreviation mappings between sources
|
||||
mlb_mappings = {
|
||||
'AZ': 'ARI', 'ARI': 'ARI', # Arizona
|
||||
'ATH': 'OAK', 'OAK': 'OAK', # Oakland/Athletics
|
||||
'CWS': 'CHW', 'CHW': 'CHW', # Chicago White Sox
|
||||
'KC': 'KCR', 'KCR': 'KCR', # Kansas City
|
||||
'SD': 'SDP', 'SDP': 'SDP', # San Diego
|
||||
'SF': 'SFG', 'SFG': 'SFG', # San Francisco
|
||||
'TB': 'TBR', 'TBR': 'TBR', # Tampa Bay
|
||||
'WSH': 'WSN', 'WSN': 'WSN', # Washington
|
||||
}
|
||||
return mlb_mappings.get(abbrev, abbrev)
|
||||
|
||||
elif sport == 'NBA':
|
||||
nba_mappings = {
|
||||
'PHX': 'PHO', 'PHO': 'PHO', # Phoenix
|
||||
'BKN': 'BRK', 'BRK': 'BRK', # Brooklyn
|
||||
'CHA': 'CHO', 'CHO': 'CHO', # Charlotte
|
||||
'NOP': 'NOP', 'NO': 'NOP', # New Orleans
|
||||
}
|
||||
return nba_mappings.get(abbrev, abbrev)
|
||||
|
||||
elif sport == 'NHL':
|
||||
nhl_mappings = {
|
||||
'ARI': 'UTA', 'UTA': 'UTA', # Arizona moved to Utah
|
||||
'VGS': 'VGK', 'VGK': 'VGK', # Vegas
|
||||
}
|
||||
return nhl_mappings.get(abbrev, abbrev)
|
||||
|
||||
return abbrev
|
||||
|
||||
|
||||
def generate_game_key(game: Game) -> str:
|
||||
"""
|
||||
Generate a unique key for matching games across sources.
|
||||
Uses date + normalized team abbreviations (sorted) to match.
|
||||
"""
|
||||
home = normalize_abbrev(game.home_team_abbrev, game.sport)
|
||||
away = normalize_abbrev(game.away_team_abbrev, game.sport)
|
||||
teams = sorted([home, away])
|
||||
return f"{game.date}_{teams[0]}_{teams[1]}"
|
||||
|
||||
|
||||
def normalize_team_name(name: str, sport: str) -> str:
|
||||
"""Normalize team name variations."""
|
||||
teams = {'NBA': NBA_TEAMS, 'MLB': MLB_TEAMS, 'NHL': NHL_TEAMS}.get(sport, {})
|
||||
|
||||
name_lower = name.lower().strip()
|
||||
|
||||
# Check against known team names
|
||||
for abbrev, info in teams.items():
|
||||
if name_lower == info['name'].lower():
|
||||
return abbrev
|
||||
# Check city match
|
||||
if name_lower == info['city'].lower():
|
||||
return abbrev
|
||||
# Check partial match
|
||||
if name_lower in info['name'].lower() or info['name'].lower() in name_lower:
|
||||
return abbrev
|
||||
|
||||
return name[:3].upper()
|
||||
|
||||
|
||||
def normalize_venue(venue: str) -> str:
|
||||
"""Normalize venue name for comparison."""
|
||||
# Remove common variations
|
||||
normalized = venue.lower().strip()
|
||||
|
||||
# Remove sponsorship prefixes that change
|
||||
replacements = [
|
||||
('at ', ''),
|
||||
('the ', ''),
|
||||
(' stadium', ''),
|
||||
(' arena', ''),
|
||||
(' center', ''),
|
||||
(' field', ''),
|
||||
(' park', ''),
|
||||
('.com', ''),
|
||||
('crypto', 'crypto.com'),
|
||||
]
|
||||
|
||||
for old, new in replacements:
|
||||
normalized = normalized.replace(old, new)
|
||||
|
||||
return normalized.strip()
|
||||
|
||||
|
||||
def normalize_time(time_str: Optional[str]) -> Optional[str]:
|
||||
"""Normalize time format to HH:MM."""
|
||||
if not time_str:
|
||||
return None
|
||||
|
||||
time_str = time_str.strip().lower()
|
||||
|
||||
# Handle various formats
|
||||
if 'pm' in time_str or 'am' in time_str:
|
||||
# 12-hour format
|
||||
try:
|
||||
for fmt in ['%I:%M%p', '%I:%M %p', '%I%p']:
|
||||
try:
|
||||
dt = datetime.strptime(time_str.replace(' ', ''), fmt)
|
||||
return dt.strftime('%H:%M')
|
||||
except:
|
||||
continue
|
||||
except:
|
||||
pass
|
||||
|
||||
# Already 24-hour or just numbers
|
||||
if ':' in time_str:
|
||||
parts = time_str.split(':')
|
||||
if len(parts) >= 2:
|
||||
try:
|
||||
hour = int(parts[0])
|
||||
minute = int(parts[1][:2])
|
||||
return f"{hour:02d}:{minute:02d}"
|
||||
except:
|
||||
pass
|
||||
|
||||
return time_str
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# CROSS-VALIDATION LOGIC
|
||||
# =============================================================================
|
||||
|
||||
def validate_games(
|
||||
games1: list[Game],
|
||||
games2: list[Game],
|
||||
source1_name: str,
|
||||
source2_name: str,
|
||||
sport: str,
|
||||
season: str
|
||||
) -> ValidationReport:
|
||||
"""
|
||||
Compare two lists of games and find discrepancies.
|
||||
"""
|
||||
report = ValidationReport(
|
||||
sport=sport,
|
||||
season=season,
|
||||
sources=[source1_name, source2_name],
|
||||
total_games_source1=len(games1),
|
||||
total_games_source2=len(games2)
|
||||
)
|
||||
|
||||
# Index games by key
|
||||
games1_by_key = {}
|
||||
for g in games1:
|
||||
key = generate_game_key(g)
|
||||
games1_by_key[key] = g
|
||||
|
||||
games2_by_key = {}
|
||||
for g in games2:
|
||||
key = generate_game_key(g)
|
||||
games2_by_key[key] = g
|
||||
|
||||
# Find matches and discrepancies
|
||||
all_keys = set(games1_by_key.keys()) | set(games2_by_key.keys())
|
||||
|
||||
for key in all_keys:
|
||||
g1 = games1_by_key.get(key)
|
||||
g2 = games2_by_key.get(key)
|
||||
|
||||
if g1 and g2:
|
||||
# Both sources have this game - compare fields
|
||||
report.games_matched += 1
|
||||
|
||||
# Compare dates (should match by key, but double-check)
|
||||
if g1.date != g2.date:
|
||||
report.discrepancies.append(Discrepancy(
|
||||
game_key=key,
|
||||
field='date',
|
||||
source1=source1_name,
|
||||
source2=source2_name,
|
||||
value1=g1.date,
|
||||
value2=g2.date,
|
||||
severity='high'
|
||||
))
|
||||
|
||||
# Compare times
|
||||
time1 = normalize_time(g1.time)
|
||||
time2 = normalize_time(g2.time)
|
||||
if time1 and time2 and time1 != time2:
|
||||
# Check if times are close (within 1 hour - could be timezone)
|
||||
try:
|
||||
t1 = datetime.strptime(time1, '%H:%M')
|
||||
t2 = datetime.strptime(time2, '%H:%M')
|
||||
diff_minutes = abs((t1 - t2).total_seconds() / 60)
|
||||
severity = 'low' if diff_minutes <= 60 else 'medium'
|
||||
except:
|
||||
severity = 'medium'
|
||||
|
||||
report.discrepancies.append(Discrepancy(
|
||||
game_key=key,
|
||||
field='time',
|
||||
source1=source1_name,
|
||||
source2=source2_name,
|
||||
value1=time1 or '',
|
||||
value2=time2 or '',
|
||||
severity=severity
|
||||
))
|
||||
|
||||
# Compare venues
|
||||
venue1 = normalize_venue(g1.venue) if g1.venue else ''
|
||||
venue2 = normalize_venue(g2.venue) if g2.venue else ''
|
||||
if venue1 and venue2 and venue1 != venue2:
|
||||
# Check for partial match
|
||||
if venue1 not in venue2 and venue2 not in venue1:
|
||||
report.discrepancies.append(Discrepancy(
|
||||
game_key=key,
|
||||
field='venue',
|
||||
source1=source1_name,
|
||||
source2=source2_name,
|
||||
value1=g1.venue,
|
||||
value2=g2.venue,
|
||||
severity='low'
|
||||
))
|
||||
|
||||
elif g1 and not g2:
|
||||
# Game only in source 1
|
||||
report.games_missing_source2 += 1
|
||||
|
||||
# Determine severity based on date
|
||||
# Spring training (March before ~25th) and playoffs (Oct+) are expected differences
|
||||
severity = 'high'
|
||||
try:
|
||||
game_date = datetime.strptime(g1.date, '%Y-%m-%d')
|
||||
month = game_date.month
|
||||
day = game_date.day
|
||||
if month == 3 and day < 26: # Spring training
|
||||
severity = 'medium'
|
||||
elif month >= 10: # Playoffs/postseason
|
||||
severity = 'medium'
|
||||
except:
|
||||
pass
|
||||
|
||||
report.discrepancies.append(Discrepancy(
|
||||
game_key=key,
|
||||
field='missing',
|
||||
source1=source1_name,
|
||||
source2=source2_name,
|
||||
value1=f"{g1.away_team} @ {g1.home_team}",
|
||||
value2='NOT FOUND',
|
||||
severity=severity
|
||||
))
|
||||
|
||||
else:
|
||||
# Game only in source 2
|
||||
report.games_missing_source1 += 1
|
||||
|
||||
# Determine severity based on date
|
||||
severity = 'high'
|
||||
try:
|
||||
game_date = datetime.strptime(g2.date, '%Y-%m-%d')
|
||||
month = game_date.month
|
||||
day = game_date.day
|
||||
if month == 3 and day < 26: # Spring training
|
||||
severity = 'medium'
|
||||
elif month >= 10: # Playoffs/postseason
|
||||
severity = 'medium'
|
||||
except:
|
||||
pass
|
||||
|
||||
report.discrepancies.append(Discrepancy(
|
||||
game_key=key,
|
||||
field='missing',
|
||||
source1=source1_name,
|
||||
source2=source2_name,
|
||||
value1='NOT FOUND',
|
||||
value2=f"{g2.away_team} @ {g2.home_team}",
|
||||
severity=severity
|
||||
))
|
||||
|
||||
return report
|
||||
|
||||
|
||||
def validate_stadiums(stadiums: list[Stadium]) -> list[dict]:
|
||||
"""
|
||||
Validate stadium data for completeness and accuracy.
|
||||
"""
|
||||
issues = []
|
||||
|
||||
for s in stadiums:
|
||||
# Check for missing coordinates
|
||||
if s.latitude == 0 or s.longitude == 0:
|
||||
issues.append({
|
||||
'stadium': s.name,
|
||||
'sport': s.sport,
|
||||
'issue': 'Missing coordinates',
|
||||
'severity': 'high'
|
||||
})
|
||||
|
||||
# Check for missing capacity
|
||||
if s.capacity == 0:
|
||||
issues.append({
|
||||
'stadium': s.name,
|
||||
'sport': s.sport,
|
||||
'issue': 'Missing capacity',
|
||||
'severity': 'low'
|
||||
})
|
||||
|
||||
# Check coordinate bounds (roughly North America)
|
||||
if s.latitude != 0:
|
||||
if not (24 < s.latitude < 55):
|
||||
issues.append({
|
||||
'stadium': s.name,
|
||||
'sport': s.sport,
|
||||
'issue': f'Latitude {s.latitude} outside expected range',
|
||||
'severity': 'medium'
|
||||
})
|
||||
|
||||
if s.longitude != 0:
|
||||
if not (-130 < s.longitude < -60):
|
||||
issues.append({
|
||||
'stadium': s.name,
|
||||
'sport': s.sport,
|
||||
'issue': f'Longitude {s.longitude} outside expected range',
|
||||
'severity': 'medium'
|
||||
})
|
||||
|
||||
return issues
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MULTI-SOURCE SCRAPING
|
||||
# =============================================================================
|
||||
|
||||
def scrape_nba_all_sources(season: int) -> dict:
|
||||
"""Scrape NBA from all available sources."""
|
||||
nba_season = f"{season-1}-{str(season)[2:]}"
|
||||
games = scrape_nba_basketball_reference(season)
|
||||
games = assign_stable_ids(games, 'NBA', nba_season)
|
||||
return {
|
||||
'basketball-reference': games,
|
||||
# ESPN requires JS rendering, skip for now
|
||||
}
|
||||
|
||||
|
||||
def scrape_mlb_all_sources(season: int) -> dict:
|
||||
"""Scrape MLB from all available sources."""
|
||||
mlb_season = str(season)
|
||||
|
||||
# MLB API uses official gamePk - already stable
|
||||
api_games = scrape_mlb_statsapi(season)
|
||||
|
||||
# Baseball-Reference needs stable IDs
|
||||
br_games = scrape_mlb_baseball_reference(season)
|
||||
br_games = assign_stable_ids(br_games, 'MLB', mlb_season)
|
||||
|
||||
return {
|
||||
'statsapi.mlb.com': api_games,
|
||||
'baseball-reference': br_games,
|
||||
}
|
||||
|
||||
|
||||
def scrape_nhl_all_sources(season: int) -> dict:
|
||||
"""Scrape NHL from all available sources."""
|
||||
nhl_season = f"{season-1}-{str(season)[2:]}"
|
||||
games = scrape_nhl_hockey_reference(season)
|
||||
games = assign_stable_ids(games, 'NHL', nhl_season)
|
||||
return {
|
||||
'hockey-reference': games,
|
||||
# NHL API requires date iteration, skip for now
|
||||
}
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MAIN
|
||||
# =============================================================================
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Validate sports data')
|
||||
parser.add_argument('--data-dir', type=str, default='./data', help='Data directory')
|
||||
parser.add_argument('--scrape-and-validate', action='store_true', help='Scrape fresh and validate')
|
||||
parser.add_argument('--season', type=int, default=2025, help='Season year')
|
||||
parser.add_argument('--sport', choices=['nba', 'mlb', 'nhl', 'all'], default='all')
|
||||
parser.add_argument('--output', type=str, default='./data/validation_report.json')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
reports = []
|
||||
stadium_issues = []
|
||||
|
||||
if args.scrape_and_validate:
|
||||
print("\n" + "="*60)
|
||||
print("CROSS-VALIDATION MODE")
|
||||
print("="*60)
|
||||
|
||||
# MLB has two good sources - validate
|
||||
if args.sport in ['mlb', 'all']:
|
||||
print(f"\n--- MLB {args.season} ---")
|
||||
mlb_sources = scrape_mlb_all_sources(args.season)
|
||||
|
||||
source_names = list(mlb_sources.keys())
|
||||
if len(source_names) >= 2:
|
||||
games1 = mlb_sources[source_names[0]]
|
||||
games2 = mlb_sources[source_names[1]]
|
||||
|
||||
if games1 and games2:
|
||||
report = validate_games(
|
||||
games1, games2,
|
||||
source_names[0], source_names[1],
|
||||
'MLB', str(args.season)
|
||||
)
|
||||
reports.append(report)
|
||||
print(f" Compared {report.total_games_source1} vs {report.total_games_source2} games")
|
||||
print(f" Matched: {report.games_matched}")
|
||||
print(f" Discrepancies: {len(report.discrepancies)}")
|
||||
|
||||
# NBA (single source for now, but validate data quality)
|
||||
if args.sport in ['nba', 'all']:
|
||||
print(f"\n--- NBA {args.season} ---")
|
||||
nba_sources = scrape_nba_all_sources(args.season)
|
||||
games = nba_sources.get('basketball-reference', [])
|
||||
print(f" Got {len(games)} games from Basketball-Reference")
|
||||
|
||||
# Validate internal consistency
|
||||
teams_seen = defaultdict(int)
|
||||
for g in games:
|
||||
teams_seen[g.home_team_abbrev] += 1
|
||||
teams_seen[g.away_team_abbrev] += 1
|
||||
|
||||
# Each team should have ~82 games
|
||||
for team, count in teams_seen.items():
|
||||
if count < 70 or count > 95:
|
||||
print(f" Warning: {team} has {count} games (expected ~82)")
|
||||
|
||||
else:
|
||||
# Load existing data and validate
|
||||
data_dir = Path(args.data_dir)
|
||||
|
||||
# Load games
|
||||
games_file = data_dir / 'games.json'
|
||||
if games_file.exists():
|
||||
with open(games_file) as f:
|
||||
games_data = json.load(f)
|
||||
print(f"\nLoaded {len(games_data)} games from {games_file}")
|
||||
|
||||
# Group by sport and validate counts
|
||||
by_sport = defaultdict(list)
|
||||
for g in games_data:
|
||||
by_sport[g['sport']].append(g)
|
||||
|
||||
for sport, sport_games in by_sport.items():
|
||||
print(f" {sport}: {len(sport_games)} games")
|
||||
|
||||
# Load and validate stadiums
|
||||
stadiums_file = data_dir / 'stadiums.json'
|
||||
if stadiums_file.exists():
|
||||
with open(stadiums_file) as f:
|
||||
stadiums_data = json.load(f)
|
||||
stadiums = [Stadium(**s) for s in stadiums_data]
|
||||
print(f"\nLoaded {len(stadiums)} stadiums from {stadiums_file}")
|
||||
|
||||
stadium_issues = validate_stadiums(stadiums)
|
||||
if stadium_issues:
|
||||
print(f"\nStadium validation issues ({len(stadium_issues)}):")
|
||||
for issue in stadium_issues[:10]:
|
||||
print(f" [{issue['severity'].upper()}] {issue['stadium']}: {issue['issue']}")
|
||||
if len(stadium_issues) > 10:
|
||||
print(f" ... and {len(stadium_issues) - 10} more")
|
||||
|
||||
# Save validation report
|
||||
output_path = Path(args.output)
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
full_report = {
|
||||
'generated_at': datetime.now().isoformat(),
|
||||
'season': args.season,
|
||||
'game_validations': [r.to_dict() for r in reports],
|
||||
'stadium_issues': stadium_issues
|
||||
}
|
||||
|
||||
with open(output_path, 'w') as f:
|
||||
json.dump(full_report, f, indent=2)
|
||||
|
||||
print(f"\n Validation report saved to {output_path}")
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*60)
|
||||
print("VALIDATION SUMMARY")
|
||||
print("="*60)
|
||||
|
||||
total_discrepancies = sum(len(r.discrepancies) for r in reports)
|
||||
high_severity = sum(
|
||||
1 for r in reports
|
||||
for d in r.discrepancies
|
||||
if d.severity == 'high'
|
||||
)
|
||||
|
||||
print(f"Total game validation reports: {len(reports)}")
|
||||
print(f"Total discrepancies found: {total_discrepancies}")
|
||||
print(f"High severity issues: {high_severity}")
|
||||
print(f"Stadium data issues: {len(stadium_issues)}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user