Initial commit: SportsTime trip planning app

- Three-scenario planning engine (A: date range, B: selected games, C: directional routes)
- GeographicRouteExplorer with anchor game support for route exploration
- Shared ItineraryBuilder for travel segment calculation
- TravelEstimator for driving time/distance estimation
- SwiftUI views for trip creation and detail display
- CloudKit integration for schedule data
- Python scraping scripts for sports schedules

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-01-07 00:46:40 -06:00
commit 9088b46563
84 changed files with 180371 additions and 0 deletions

145
Scripts/CLOUDKIT_SETUP.md Normal file
View File

@@ -0,0 +1,145 @@
# CloudKit Setup Guide for SportsTime
## 1. Configure Container in Apple Developer Portal
1. Go to [Apple Developer Portal](https://developer.apple.com/account)
2. Navigate to **Certificates, Identifiers & Profiles** > **Identifiers**
3. Select your App ID or create one for `com.sportstime.app`
4. Enable **iCloud** capability
5. Click **Configure** and create container: `iCloud.com.sportstime.app`
## 2. Configure in Xcode
1. Open `SportsTime.xcodeproj` in Xcode
2. Select the SportsTime target
3. Go to **Signing & Capabilities**
4. Ensure **iCloud** is added (should already be there)
5. Check **CloudKit** is selected
6. Select container `iCloud.com.sportstime.app`
## 3. Create Record Types in CloudKit Dashboard
Go to [CloudKit Dashboard](https://icloud.developer.apple.com/dashboard)
### Record Type: `Stadium`
| Field | Type | Notes |
|-------|------|-------|
| `stadiumId` | String | Unique identifier |
| `name` | String | Stadium name |
| `city` | String | City |
| `state` | String | State/Province |
| `location` | Location | CLLocation (lat/lng) |
| `capacity` | Int(64) | Seating capacity |
| `sport` | String | NBA, MLB, NHL |
| `teamAbbrevs` | String (List) | Team abbreviations |
| `source` | String | Data source |
| `yearOpened` | Int(64) | Optional |
**Indexes**:
- `sport` (Queryable, Sortable)
- `location` (Queryable) - for radius searches
- `teamAbbrevs` (Queryable)
### Record Type: `Team`
| Field | Type | Notes |
|-------|------|-------|
| `teamId` | String | Unique identifier |
| `name` | String | Full team name |
| `abbreviation` | String | 3-letter code |
| `sport` | String | NBA, MLB, NHL |
| `city` | String | City |
**Indexes**:
- `sport` (Queryable, Sortable)
- `abbreviation` (Queryable)
### Record Type: `Game`
| Field | Type | Notes |
|-------|------|-------|
| `gameId` | String | Unique identifier |
| `sport` | String | NBA, MLB, NHL |
| `season` | String | e.g., "2024-25" |
| `dateTime` | Date/Time | Game date and time |
| `homeTeamRef` | Reference | Reference to Team |
| `awayTeamRef` | Reference | Reference to Team |
| `venueRef` | Reference | Reference to Stadium |
| `isPlayoff` | Int(64) | 0 or 1 |
| `broadcastInfo` | String | TV channel |
| `source` | String | Data source |
**Indexes**:
- `sport` (Queryable, Sortable)
- `dateTime` (Queryable, Sortable)
- `homeTeamRef` (Queryable)
- `awayTeamRef` (Queryable)
- `season` (Queryable)
## 4. Import Data
After creating record types:
```bash
# 1. First scrape the data
cd Scripts
python3 scrape_schedules.py --sport all --season 2025 --output ./data
# 2. Run the import script (requires running from Xcode or with proper entitlements)
# The Swift script cannot run standalone - use the app or create a macOS command-line tool
```
### Alternative: Import via App
Add this to your app for first-run data import:
```swift
// In AppDelegate or App init
Task {
let importer = CloudKitImporter()
// Load JSON from bundle or downloaded file
if let stadiumsURL = Bundle.main.url(forResource: "stadiums", withExtension: "json"),
let gamesURL = Bundle.main.url(forResource: "games", withExtension: "json") {
// Import stadiums first
let stadiumsData = try Data(contentsOf: stadiumsURL)
let stadiums = try JSONDecoder().decode([ScrapedStadium].self, from: stadiumsData)
let count = try await importer.importStadiums(from: stadiums)
print("Imported \(count) stadiums")
}
}
```
## 5. Security Roles (CloudKit Dashboard)
For the **Public Database**:
| Role | Stadium | Team | Game |
|------|---------|------|------|
| World | Read | Read | Read |
| Authenticated | Read | Read | Read |
| Creator | Read/Write | Read/Write | Read/Write |
Users should only read from public database. Write access is for your admin imports.
## 6. Testing
1. Build and run the app on simulator or device
2. Check CloudKit Dashboard > **Data** to see imported records
3. Use **Logs** tab to debug any issues
## Troubleshooting
### "Container not found"
- Ensure container is created in Developer Portal
- Check entitlements file has correct container ID
- Clean build and re-run
### "Permission denied"
- Check Security Roles in CloudKit Dashboard
- Ensure app is signed with correct provisioning profile
### "Record type not found"
- Create record types in Development environment first
- Deploy schema to Production when ready

72
Scripts/DATA_SOURCES.md Normal file
View File

@@ -0,0 +1,72 @@
# Sports Data Sources
## Schedule Data Sources (by league)
### NBA Schedule
| Source | URL Pattern | Data Available | Notes |
|--------|-------------|----------------|-------|
| Basketball-Reference | `https://www.basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html` | Date, Time, Teams, Arena, Attendance | Monthly pages (october, november, etc.) |
| ESPN | `https://www.espn.com/nba/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
| NBA.com API | `https://cdn.nba.com/static/json/staticData/scheduleLeagueV2.json` | Full season JSON | Official source |
| FixtureDownload | `https://fixturedownload.com/download/nba-{year}-UTC.csv` | CSV download | Easy format |
### MLB Schedule
| Source | URL Pattern | Data Available | Notes |
|--------|-------------|----------------|-------|
| Baseball-Reference | `https://www.baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml` | Date, Teams, Score, Attendance | Full season page |
| ESPN | `https://www.espn.com/mlb/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
| MLB Stats API | `https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={YEAR}` | Full season JSON | Official API |
| FixtureDownload | `https://fixturedownload.com/download/mlb-{year}-UTC.csv` | CSV download | Easy format |
### NHL Schedule
| Source | URL Pattern | Data Available | Notes |
|--------|-------------|----------------|-------|
| Hockey-Reference | `https://www.hockey-reference.com/leagues/NHL_{YEAR}_games.html` | Date, Teams, Score, Arena, Attendance | Full season page |
| ESPN | `https://www.espn.com/nhl/schedule/_/date/{YYYYMMDD}` | Date, Time, Teams, TV | Daily schedule |
| NHL API | `https://api-web.nhle.com/v1/schedule/{YYYY-MM-DD}` | Daily JSON | Official API |
| FixtureDownload | `https://fixturedownload.com/download/nhl-{year}-UTC.csv` | CSV download | Easy format |
---
## Stadium/Arena Data Sources
| Source | URL/Method | Data Available | Notes |
|--------|------------|----------------|-------|
| Wikipedia | Team pages | Name, City, Capacity, Coordinates | Manual or scrape |
| HIFLD Open Data | `https://hifld-geoplatform.opendata.arcgis.com/datasets/major-sport-venues` | GeoJSON with coordinates | US Government data |
| ESPN Team Pages | `https://www.espn.com/{sport}/team/_/name/{abbrev}` | Arena name, location | Per-team |
| Sports-Reference | Team pages | Arena name, capacity | In schedule data |
| OpenStreetMap | Nominatim API | Coordinates from address | For geocoding |
---
## Data Validation Strategy
### Cross-Reference Points
1. **Game Count**: Total games per team should match (82 NBA, 162 MLB, 82 NHL)
2. **Home/Away Balance**: Each team should have equal home/away games
3. **Date Alignment**: Same game should appear on same date across sources
4. **Team Names**: Map abbreviations across sources (NYK vs NY vs Knicks)
5. **Venue Names**: Stadiums may have different names (sponsorship changes)
### Discrepancy Handling
- If sources disagree on game time: prefer official API (NBA.com, MLB.com, NHL.com)
- If sources disagree on venue: prefer Sports-Reference (most accurate historically)
- Log all discrepancies for manual review
---
## Rate Limiting Guidelines
| Source | Limit | Recommended Delay |
|--------|-------|-------------------|
| Sports-Reference sites | 20 req/min | 3 seconds between requests |
| ESPN | Unknown | 1 second between requests |
| Official APIs | Varies | 0.5 seconds between requests |
| Wikipedia | Polite | 1 second between requests |
---
## Team Abbreviation Mappings
See `team_mappings.json` for canonical mappings between sources.

306
Scripts/cloudkit_import.py Executable file
View File

@@ -0,0 +1,306 @@
#!/usr/bin/env python3
"""
CloudKit Import Script
======================
Imports JSON data into CloudKit. Run separately from pipeline.
Setup:
1. CloudKit Dashboard > Tokens & Keys > Server-to-Server Keys
2. Create key with Read/Write access to public database
3. Download .p8 file and note Key ID
Usage:
python cloudkit_import.py --dry-run # Preview first
python cloudkit_import.py --key-id XX --key-file key.p8 # Import all
python cloudkit_import.py --stadiums-only ... # Stadiums first
python cloudkit_import.py --games-only ... # Games after
python cloudkit_import.py --delete-all ... # Delete then import
python cloudkit_import.py --delete-only ... # Delete only (no import)
"""
import argparse, json, time, os, sys, hashlib, base64, requests
from datetime import datetime, timezone
from pathlib import Path
try:
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import ec
from cryptography.hazmat.backends import default_backend
HAS_CRYPTO = True
except ImportError:
HAS_CRYPTO = False
CONTAINER = "iCloud.com.sportstime.app"
HOST = "https://api.apple-cloudkit.com"
BATCH_SIZE = 200
class CloudKit:
def __init__(self, key_id, private_key, container, env):
self.key_id = key_id
self.private_key = private_key
self.path_base = f"/database/1/{container}/{env}/public"
def _sign(self, date, body, path):
key = serialization.load_pem_private_key(self.private_key, None, default_backend())
body_hash = base64.b64encode(hashlib.sha256(body.encode()).digest()).decode()
sig = key.sign(f"{date}:{body_hash}:{path}".encode(), ec.ECDSA(hashes.SHA256()))
return base64.b64encode(sig).decode()
def modify(self, operations):
path = f"{self.path_base}/records/modify"
body = json.dumps({'operations': operations})
date = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
headers = {
'Content-Type': 'application/json',
'X-Apple-CloudKit-Request-KeyID': self.key_id,
'X-Apple-CloudKit-Request-ISO8601Date': date,
'X-Apple-CloudKit-Request-SignatureV1': self._sign(date, body, path),
}
r = requests.post(f"{HOST}{path}", headers=headers, data=body, timeout=60)
if r.status_code == 200:
return r.json()
else:
try:
err = r.json()
reason = err.get('reason', 'Unknown')
code = err.get('serverErrorCode', r.status_code)
return {'error': f"{code}: {reason}"}
except:
return {'error': f"{r.status_code}: {r.text[:200]}"}
def query(self, record_type, limit=200):
"""Query records of a given type."""
path = f"{self.path_base}/records/query"
body = json.dumps({
'query': {'recordType': record_type},
'resultsLimit': limit
})
date = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
headers = {
'Content-Type': 'application/json',
'X-Apple-CloudKit-Request-KeyID': self.key_id,
'X-Apple-CloudKit-Request-ISO8601Date': date,
'X-Apple-CloudKit-Request-SignatureV1': self._sign(date, body, path),
}
r = requests.post(f"{HOST}{path}", headers=headers, data=body, timeout=60)
if r.status_code == 200:
return r.json()
return {'error': f"{r.status_code}: {r.text[:200]}"}
def delete_all(self, record_type, verbose=False):
"""Delete all records of a given type."""
total_deleted = 0
while True:
result = self.query(record_type)
if 'error' in result:
print(f" Query error: {result['error']}")
break
records = result.get('records', [])
if not records:
break
# Build delete operations
ops = [{
'operationType': 'delete',
'record': {'recordName': r['recordName'], 'recordType': record_type}
} for r in records]
delete_result = self.modify(ops)
if 'error' in delete_result:
print(f" Delete error: {delete_result['error']}")
break
deleted = len(delete_result.get('records', []))
total_deleted += deleted
if verbose:
print(f" Deleted {deleted} {record_type} records...")
time.sleep(0.5)
return total_deleted
def import_data(ck, records, name, dry_run, verbose):
total = 0
errors = 0
for i in range(0, len(records), BATCH_SIZE):
batch = records[i:i+BATCH_SIZE]
ops = [{'operationType': 'forceReplace', 'record': r} for r in batch]
if verbose:
print(f" Batch {i//BATCH_SIZE + 1}: {len(batch)} records, {len(ops)} ops")
if not ops:
print(f" Warning: Empty batch at index {i}, skipping")
continue
if dry_run:
print(f" [DRY RUN] Would create {len(batch)} {name}")
total += len(batch)
else:
result = ck.modify(ops)
if 'error' in result:
errors += 1
if errors <= 3: # Only show first 3 errors
print(f" Error: {result['error']}")
if verbose and batch:
print(f" Sample record: {json.dumps(batch[0], indent=2)[:500]}")
if errors == 3:
print(" (suppressing further errors...)")
else:
result_records = result.get('records', [])
# Count only successful records (no serverErrorCode)
successful = [r for r in result_records if 'serverErrorCode' not in r]
failed = [r for r in result_records if 'serverErrorCode' in r]
n = len(successful)
total += n
print(f" Created {n} {name}")
if failed:
print(f" Failed {len(failed)} records: {failed[0].get('serverErrorCode')}: {failed[0].get('reason')}")
if verbose:
print(f" Response: {json.dumps(result, indent=2)[:1000]}")
time.sleep(0.5)
if errors > 0:
print(f" Total errors: {errors}")
return total
def main():
p = argparse.ArgumentParser(description='Import JSON to CloudKit')
p.add_argument('--key-id', default=os.environ.get('CLOUDKIT_KEY_ID'))
p.add_argument('--key-file', default=os.environ.get('CLOUDKIT_KEY_FILE'))
p.add_argument('--container', default=CONTAINER)
p.add_argument('--env', choices=['development', 'production'], default='development')
p.add_argument('--data-dir', default='./data')
p.add_argument('--stadiums-only', action='store_true')
p.add_argument('--games-only', action='store_true')
p.add_argument('--delete-all', action='store_true', help='Delete all records before importing')
p.add_argument('--delete-only', action='store_true', help='Only delete records, do not import')
p.add_argument('--dry-run', action='store_true')
p.add_argument('--verbose', '-v', action='store_true')
args = p.parse_args()
print(f"\n{'='*50}")
print(f"CloudKit Import {'(DRY RUN)' if args.dry_run else ''}")
print(f"{'='*50}")
print(f"Container: {args.container}")
print(f"Environment: {args.env}\n")
data_dir = Path(args.data_dir)
stadiums = json.load(open(data_dir / 'stadiums.json'))
games = json.load(open(data_dir / 'games.json')) if (data_dir / 'games.json').exists() else []
print(f"Loaded {len(stadiums)} stadiums, {len(games)} games\n")
ck = None
if not args.dry_run:
if not HAS_CRYPTO:
sys.exit("Error: pip install cryptography")
if not args.key_id or not args.key_file:
sys.exit("Error: --key-id and --key-file required (or use --dry-run)")
ck = CloudKit(args.key_id, open(args.key_file, 'rb').read(), args.container, args.env)
# Handle deletion
if args.delete_all or args.delete_only:
if not ck:
sys.exit("Error: --key-id and --key-file required for deletion")
print("--- Deleting Existing Records ---")
# Delete in order: Games first (has references), then Teams, then Stadiums
for record_type in ['Game', 'Team', 'Stadium']:
print(f" Deleting {record_type} records...")
deleted = ck.delete_all(record_type, verbose=args.verbose)
print(f" Deleted {deleted} {record_type} records")
if args.delete_only:
print(f"\n{'='*50}")
print("DELETE COMPLETE")
print()
return
stats = {'stadiums': 0, 'teams': 0, 'games': 0}
team_map = {}
# Import stadiums & teams
if not args.games_only:
print("--- Stadiums ---")
recs = [{
'recordType': 'Stadium', 'recordName': s['id'],
'fields': {
'stadiumId': {'value': s['id']}, 'name': {'value': s['name']},
'city': {'value': s['city']}, 'state': {'value': s.get('state', '')},
'sport': {'value': s['sport']}, 'source': {'value': s.get('source', '')},
'teamAbbrevs': {'value': s.get('team_abbrevs', [])},
**({'location': {'value': {'latitude': s['latitude'], 'longitude': s['longitude']}}}
if s.get('latitude') else {}),
**({'capacity': {'value': s['capacity']}} if s.get('capacity') else {}),
}
} for s in stadiums]
stats['stadiums'] = import_data(ck, recs, 'stadiums', args.dry_run, args.verbose)
print("--- Teams ---")
teams = {}
for s in stadiums:
for abbr in s.get('team_abbrevs', []):
if abbr not in teams:
teams[abbr] = {'city': s['city'], 'sport': s['sport']}
team_map[abbr] = f"team_{abbr.lower()}"
recs = [{
'recordType': 'Team', 'recordName': f"team_{abbr.lower()}",
'fields': {
'teamId': {'value': f"team_{abbr.lower()}"}, 'abbreviation': {'value': abbr},
'name': {'value': abbr}, 'city': {'value': info['city']}, 'sport': {'value': info['sport']},
}
} for abbr, info in teams.items()]
stats['teams'] = import_data(ck, recs, 'teams', args.dry_run, args.verbose)
# Import games
if not args.stadiums_only and games:
if not team_map:
for s in stadiums:
for abbr in s.get('team_abbrevs', []):
team_map[abbr] = f"team_{abbr.lower()}"
print("--- Games ---")
# Deduplicate games by ID
seen_ids = set()
unique_games = []
for g in games:
if g['id'] not in seen_ids:
seen_ids.add(g['id'])
unique_games.append(g)
if len(unique_games) < len(games):
print(f" Removed {len(games) - len(unique_games)} duplicate games")
recs = []
for g in unique_games:
fields = {
'gameId': {'value': g['id']}, 'sport': {'value': g['sport']},
'season': {'value': g.get('season', '')}, 'source': {'value': g.get('source', '')},
}
if g.get('date'):
try:
dt = datetime.strptime(f"{g['date']} {g.get('time', '19:00')}", '%Y-%m-%d %H:%M')
fields['dateTime'] = {'value': int(dt.timestamp() * 1000)}
except: pass
if g.get('home_team_abbrev') in team_map:
fields['homeTeamRef'] = {'value': {'recordName': team_map[g['home_team_abbrev']], 'action': 'NONE'}}
if g.get('away_team_abbrev') in team_map:
fields['awayTeamRef'] = {'value': {'recordName': team_map[g['away_team_abbrev']], 'action': 'NONE'}}
recs.append({'recordType': 'Game', 'recordName': g['id'], 'fields': fields})
stats['games'] = import_data(ck, recs, 'games', args.dry_run, args.verbose)
print(f"\n{'='*50}")
print(f"COMPLETE: {stats['stadiums']} stadiums, {stats['teams']} teams, {stats['games']} games")
if args.dry_run:
print("[DRY RUN - nothing imported]")
print()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,53 @@
DEFINE SCHEMA
RECORD TYPE Stadium (
"___createTime" TIMESTAMP,
"___createdBy" REFERENCE,
"___etag" STRING,
"___modTime" TIMESTAMP,
"___modifiedBy" REFERENCE,
"___recordID" REFERENCE QUERYABLE,
stadiumId STRING QUERYABLE,
name STRING QUERYABLE SEARCHABLE,
city STRING QUERYABLE,
state STRING,
location LOCATION QUERYABLE,
capacity INT64,
sport STRING QUERYABLE SORTABLE,
teamAbbrevs LIST<STRING>,
source STRING,
yearOpened INT64
);
RECORD TYPE Team (
"___createTime" TIMESTAMP,
"___createdBy" REFERENCE,
"___etag" STRING,
"___modTime" TIMESTAMP,
"___modifiedBy" REFERENCE,
"___recordID" REFERENCE QUERYABLE,
teamId STRING QUERYABLE,
name STRING QUERYABLE SEARCHABLE,
abbreviation STRING QUERYABLE,
city STRING QUERYABLE,
sport STRING QUERYABLE SORTABLE
);
RECORD TYPE Game (
"___createTime" TIMESTAMP,
"___createdBy" REFERENCE,
"___etag" STRING,
"___modTime" TIMESTAMP,
"___modifiedBy" REFERENCE,
"___recordID" REFERENCE QUERYABLE,
gameId STRING QUERYABLE,
sport STRING QUERYABLE SORTABLE,
season STRING QUERYABLE,
dateTime TIMESTAMP QUERYABLE SORTABLE,
homeTeamRef REFERENCE QUERYABLE,
awayTeamRef REFERENCE QUERYABLE,
venueRef REFERENCE,
isPlayoff INT64,
broadcastInfo STRING,
source STRING
);

5098
Scripts/data/games.csv Normal file

File diff suppressed because it is too large Load Diff

76457
Scripts/data/games.json Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

93
Scripts/data/stadiums.csv Normal file
View File

@@ -0,0 +1,93 @@
id,name,city,state,latitude,longitude,capacity,sport,team_abbrevs,source,year_opened
manual_nba_atl,State Farm Arena,Atlanta,,33.7573,-84.3963,0,NBA,['ATL'],manual,
manual_nba_bos,TD Garden,Boston,,42.3662,-71.0621,0,NBA,['BOS'],manual,
manual_nba_brk,Barclays Center,Brooklyn,,40.6826,-73.9754,0,NBA,['BRK'],manual,
manual_nba_cho,Spectrum Center,Charlotte,,35.2251,-80.8392,0,NBA,['CHO'],manual,
manual_nba_chi,United Center,Chicago,,41.8807,-87.6742,0,NBA,['CHI'],manual,
manual_nba_cle,Rocket Mortgage FieldHouse,Cleveland,,41.4965,-81.6882,0,NBA,['CLE'],manual,
manual_nba_dal,American Airlines Center,Dallas,,32.7905,-96.8103,0,NBA,['DAL'],manual,
manual_nba_den,Ball Arena,Denver,,39.7487,-105.0077,0,NBA,['DEN'],manual,
manual_nba_det,Little Caesars Arena,Detroit,,42.3411,-83.0553,0,NBA,['DET'],manual,
manual_nba_gsw,Chase Center,San Francisco,,37.768,-122.3879,0,NBA,['GSW'],manual,
manual_nba_hou,Toyota Center,Houston,,29.7508,-95.3621,0,NBA,['HOU'],manual,
manual_nba_ind,Gainbridge Fieldhouse,Indianapolis,,39.764,-86.1555,0,NBA,['IND'],manual,
manual_nba_lac,Intuit Dome,Inglewood,,33.9425,-118.3419,0,NBA,['LAC'],manual,
manual_nba_lal,Crypto.com Arena,Los Angeles,,34.043,-118.2673,0,NBA,['LAL'],manual,
manual_nba_mem,FedExForum,Memphis,,35.1382,-90.0506,0,NBA,['MEM'],manual,
manual_nba_mia,Kaseya Center,Miami,,25.7814,-80.187,0,NBA,['MIA'],manual,
manual_nba_mil,Fiserv Forum,Milwaukee,,43.0451,-87.9174,0,NBA,['MIL'],manual,
manual_nba_min,Target Center,Minneapolis,,44.9795,-93.2761,0,NBA,['MIN'],manual,
manual_nba_nop,Smoothie King Center,New Orleans,,29.949,-90.0821,0,NBA,['NOP'],manual,
manual_nba_nyk,Madison Square Garden,New York,,40.7505,-73.9934,0,NBA,['NYK'],manual,
manual_nba_okc,Paycom Center,Oklahoma City,,35.4634,-97.5151,0,NBA,['OKC'],manual,
manual_nba_orl,Kia Center,Orlando,,28.5392,-81.3839,0,NBA,['ORL'],manual,
manual_nba_phi,Wells Fargo Center,Philadelphia,,39.9012,-75.172,0,NBA,['PHI'],manual,
manual_nba_pho,Footprint Center,Phoenix,,33.4457,-112.0712,0,NBA,['PHO'],manual,
manual_nba_por,Moda Center,Portland,,45.5316,-122.6668,0,NBA,['POR'],manual,
manual_nba_sac,Golden 1 Center,Sacramento,,38.5802,-121.4997,0,NBA,['SAC'],manual,
manual_nba_sas,Frost Bank Center,San Antonio,,29.427,-98.4375,0,NBA,['SAS'],manual,
manual_nba_tor,Scotiabank Arena,Toronto,,43.6435,-79.3791,0,NBA,['TOR'],manual,
manual_nba_uta,Delta Center,Salt Lake City,,40.7683,-111.9011,0,NBA,['UTA'],manual,
manual_nba_was,Capital One Arena,Washington,,38.8982,-77.0209,0,NBA,['WAS'],manual,
manual_mlb_ari,Chase Field,Phoenix,AZ,33.4453,-112.0667,48686,MLB,['ARI'],manual,
manual_mlb_atl,Truist Park,Atlanta,GA,33.8907,-84.4678,41084,MLB,['ATL'],manual,
manual_mlb_bal,Oriole Park at Camden Yards,Baltimore,MD,39.2838,-76.6218,45971,MLB,['BAL'],manual,
manual_mlb_bos,Fenway Park,Boston,MA,42.3467,-71.0972,37755,MLB,['BOS'],manual,
manual_mlb_chc,Wrigley Field,Chicago,IL,41.9484,-87.6553,41649,MLB,['CHC'],manual,
manual_mlb_chw,Guaranteed Rate Field,Chicago,IL,41.8299,-87.6338,40615,MLB,['CHW'],manual,
manual_mlb_cin,Great American Ball Park,Cincinnati,OH,39.0979,-84.5082,42319,MLB,['CIN'],manual,
manual_mlb_cle,Progressive Field,Cleveland,OH,41.4962,-81.6852,34830,MLB,['CLE'],manual,
manual_mlb_col,Coors Field,Denver,CO,39.7559,-104.9942,50144,MLB,['COL'],manual,
manual_mlb_det,Comerica Park,Detroit,MI,42.339,-83.0485,41083,MLB,['DET'],manual,
manual_mlb_hou,Minute Maid Park,Houston,TX,29.7573,-95.3555,41168,MLB,['HOU'],manual,
manual_mlb_kcr,Kauffman Stadium,Kansas City,MO,39.0517,-94.4803,37903,MLB,['KCR'],manual,
manual_mlb_laa,Angel Stadium,Anaheim,CA,33.8003,-117.8827,45517,MLB,['LAA'],manual,
manual_mlb_lad,Dodger Stadium,Los Angeles,CA,34.0739,-118.24,56000,MLB,['LAD'],manual,
manual_mlb_mia,LoanDepot Park,Miami,FL,25.7781,-80.2196,36742,MLB,['MIA'],manual,
manual_mlb_mil,American Family Field,Milwaukee,WI,43.028,-87.9712,41900,MLB,['MIL'],manual,
manual_mlb_min,Target Field,Minneapolis,MN,44.9817,-93.2776,38544,MLB,['MIN'],manual,
manual_mlb_nym,Citi Field,New York,NY,40.7571,-73.8458,41922,MLB,['NYM'],manual,
manual_mlb_nyy,Yankee Stadium,New York,NY,40.8296,-73.9262,46537,MLB,['NYY'],manual,
manual_mlb_oak,Sutter Health Park,Sacramento,CA,38.5802,-121.5097,14014,MLB,['OAK'],manual,
manual_mlb_phi,Citizens Bank Park,Philadelphia,PA,39.9061,-75.1665,42792,MLB,['PHI'],manual,
manual_mlb_pit,PNC Park,Pittsburgh,PA,40.4469,-80.0057,38362,MLB,['PIT'],manual,
manual_mlb_sdp,Petco Park,San Diego,CA,32.7076,-117.157,40209,MLB,['SDP'],manual,
manual_mlb_sfg,Oracle Park,San Francisco,CA,37.7786,-122.3893,41265,MLB,['SFG'],manual,
manual_mlb_sea,T-Mobile Park,Seattle,WA,47.5914,-122.3325,47929,MLB,['SEA'],manual,
manual_mlb_stl,Busch Stadium,St. Louis,MO,38.6226,-90.1928,45494,MLB,['STL'],manual,
manual_mlb_tbr,Tropicana Field,St. Petersburg,FL,27.7682,-82.6534,25000,MLB,['TBR'],manual,
manual_mlb_tex,Globe Life Field,Arlington,TX,32.7473,-97.0845,40300,MLB,['TEX'],manual,
manual_mlb_tor,Rogers Centre,Toronto,ON,43.6414,-79.3894,49282,MLB,['TOR'],manual,
manual_mlb_wsn,Nationals Park,Washington,DC,38.873,-77.0074,41339,MLB,['WSN'],manual,
manual_nhl_ana,Honda Center,Anaheim,CA,33.8078,-117.8765,17174,NHL,['ANA'],manual,
manual_nhl_ari,Delta Center,Salt Lake City,UT,40.7683,-111.9011,18306,NHL,['ARI'],manual,
manual_nhl_bos,TD Garden,Boston,MA,42.3662,-71.0621,17565,NHL,['BOS'],manual,
manual_nhl_buf,KeyBank Center,Buffalo,NY,42.875,-78.8764,19070,NHL,['BUF'],manual,
manual_nhl_cgy,Scotiabank Saddledome,Calgary,AB,51.0374,-114.0519,19289,NHL,['CGY'],manual,
manual_nhl_car,PNC Arena,Raleigh,NC,35.8034,-78.722,18680,NHL,['CAR'],manual,
manual_nhl_chi,United Center,Chicago,IL,41.8807,-87.6742,19717,NHL,['CHI'],manual,
manual_nhl_col,Ball Arena,Denver,CO,39.7487,-105.0077,18007,NHL,['COL'],manual,
manual_nhl_cbj,Nationwide Arena,Columbus,OH,39.9693,-83.0061,18500,NHL,['CBJ'],manual,
manual_nhl_dal,American Airlines Center,Dallas,TX,32.7905,-96.8103,18532,NHL,['DAL'],manual,
manual_nhl_det,Little Caesars Arena,Detroit,MI,42.3411,-83.0553,19515,NHL,['DET'],manual,
manual_nhl_edm,Rogers Place,Edmonton,AB,53.5469,-113.4978,18347,NHL,['EDM'],manual,
manual_nhl_fla,Amerant Bank Arena,Sunrise,FL,26.1584,-80.3256,19250,NHL,['FLA'],manual,
manual_nhl_lak,Crypto.com Arena,Los Angeles,CA,34.043,-118.2673,18230,NHL,['LAK'],manual,
manual_nhl_min,Xcel Energy Center,St. Paul,MN,44.9448,-93.101,17954,NHL,['MIN'],manual,
manual_nhl_mtl,Bell Centre,Montreal,QC,45.4961,-73.5693,21302,NHL,['MTL'],manual,
manual_nhl_nsh,Bridgestone Arena,Nashville,TN,36.1592,-86.7785,17159,NHL,['NSH'],manual,
manual_nhl_njd,Prudential Center,Newark,NJ,40.7334,-74.1712,16514,NHL,['NJD'],manual,
manual_nhl_nyi,UBS Arena,Elmont,NY,40.7161,-73.7246,17255,NHL,['NYI'],manual,
manual_nhl_nyr,Madison Square Garden,New York,NY,40.7505,-73.9934,18006,NHL,['NYR'],manual,
manual_nhl_ott,Canadian Tire Centre,Ottawa,ON,45.2969,-75.9272,18652,NHL,['OTT'],manual,
manual_nhl_phi,Wells Fargo Center,Philadelphia,PA,39.9012,-75.172,19543,NHL,['PHI'],manual,
manual_nhl_pit,PPG Paints Arena,Pittsburgh,PA,40.4395,-79.9892,18387,NHL,['PIT'],manual,
manual_nhl_sjs,SAP Center,San Jose,CA,37.3327,-121.901,17562,NHL,['SJS'],manual,
manual_nhl_sea,Climate Pledge Arena,Seattle,WA,47.6221,-122.354,17100,NHL,['SEA'],manual,
manual_nhl_stl,Enterprise Center,St. Louis,MO,38.6268,-90.2025,18096,NHL,['STL'],manual,
manual_nhl_tbl,Amalie Arena,Tampa,FL,27.9426,-82.4519,19092,NHL,['TBL'],manual,
manual_nhl_tor,Scotiabank Arena,Toronto,ON,43.6435,-79.3791,18819,NHL,['TOR'],manual,
manual_nhl_van,Rogers Arena,Vancouver,BC,49.2778,-123.1089,18910,NHL,['VAN'],manual,
manual_nhl_vgk,T-Mobile Arena,Las Vegas,NV,36.1028,-115.1784,17500,NHL,['VGK'],manual,
manual_nhl_wsh,Capital One Arena,Washington,DC,38.8982,-77.0209,18573,NHL,['WSH'],manual,
manual_nhl_wpg,Canada Life Centre,Winnipeg,MB,49.8928,-97.1436,15321,NHL,['WPG'],manual,
1 id name city state latitude longitude capacity sport team_abbrevs source year_opened
2 manual_nba_atl State Farm Arena Atlanta 33.7573 -84.3963 0 NBA ['ATL'] manual
3 manual_nba_bos TD Garden Boston 42.3662 -71.0621 0 NBA ['BOS'] manual
4 manual_nba_brk Barclays Center Brooklyn 40.6826 -73.9754 0 NBA ['BRK'] manual
5 manual_nba_cho Spectrum Center Charlotte 35.2251 -80.8392 0 NBA ['CHO'] manual
6 manual_nba_chi United Center Chicago 41.8807 -87.6742 0 NBA ['CHI'] manual
7 manual_nba_cle Rocket Mortgage FieldHouse Cleveland 41.4965 -81.6882 0 NBA ['CLE'] manual
8 manual_nba_dal American Airlines Center Dallas 32.7905 -96.8103 0 NBA ['DAL'] manual
9 manual_nba_den Ball Arena Denver 39.7487 -105.0077 0 NBA ['DEN'] manual
10 manual_nba_det Little Caesars Arena Detroit 42.3411 -83.0553 0 NBA ['DET'] manual
11 manual_nba_gsw Chase Center San Francisco 37.768 -122.3879 0 NBA ['GSW'] manual
12 manual_nba_hou Toyota Center Houston 29.7508 -95.3621 0 NBA ['HOU'] manual
13 manual_nba_ind Gainbridge Fieldhouse Indianapolis 39.764 -86.1555 0 NBA ['IND'] manual
14 manual_nba_lac Intuit Dome Inglewood 33.9425 -118.3419 0 NBA ['LAC'] manual
15 manual_nba_lal Crypto.com Arena Los Angeles 34.043 -118.2673 0 NBA ['LAL'] manual
16 manual_nba_mem FedExForum Memphis 35.1382 -90.0506 0 NBA ['MEM'] manual
17 manual_nba_mia Kaseya Center Miami 25.7814 -80.187 0 NBA ['MIA'] manual
18 manual_nba_mil Fiserv Forum Milwaukee 43.0451 -87.9174 0 NBA ['MIL'] manual
19 manual_nba_min Target Center Minneapolis 44.9795 -93.2761 0 NBA ['MIN'] manual
20 manual_nba_nop Smoothie King Center New Orleans 29.949 -90.0821 0 NBA ['NOP'] manual
21 manual_nba_nyk Madison Square Garden New York 40.7505 -73.9934 0 NBA ['NYK'] manual
22 manual_nba_okc Paycom Center Oklahoma City 35.4634 -97.5151 0 NBA ['OKC'] manual
23 manual_nba_orl Kia Center Orlando 28.5392 -81.3839 0 NBA ['ORL'] manual
24 manual_nba_phi Wells Fargo Center Philadelphia 39.9012 -75.172 0 NBA ['PHI'] manual
25 manual_nba_pho Footprint Center Phoenix 33.4457 -112.0712 0 NBA ['PHO'] manual
26 manual_nba_por Moda Center Portland 45.5316 -122.6668 0 NBA ['POR'] manual
27 manual_nba_sac Golden 1 Center Sacramento 38.5802 -121.4997 0 NBA ['SAC'] manual
28 manual_nba_sas Frost Bank Center San Antonio 29.427 -98.4375 0 NBA ['SAS'] manual
29 manual_nba_tor Scotiabank Arena Toronto 43.6435 -79.3791 0 NBA ['TOR'] manual
30 manual_nba_uta Delta Center Salt Lake City 40.7683 -111.9011 0 NBA ['UTA'] manual
31 manual_nba_was Capital One Arena Washington 38.8982 -77.0209 0 NBA ['WAS'] manual
32 manual_mlb_ari Chase Field Phoenix AZ 33.4453 -112.0667 48686 MLB ['ARI'] manual
33 manual_mlb_atl Truist Park Atlanta GA 33.8907 -84.4678 41084 MLB ['ATL'] manual
34 manual_mlb_bal Oriole Park at Camden Yards Baltimore MD 39.2838 -76.6218 45971 MLB ['BAL'] manual
35 manual_mlb_bos Fenway Park Boston MA 42.3467 -71.0972 37755 MLB ['BOS'] manual
36 manual_mlb_chc Wrigley Field Chicago IL 41.9484 -87.6553 41649 MLB ['CHC'] manual
37 manual_mlb_chw Guaranteed Rate Field Chicago IL 41.8299 -87.6338 40615 MLB ['CHW'] manual
38 manual_mlb_cin Great American Ball Park Cincinnati OH 39.0979 -84.5082 42319 MLB ['CIN'] manual
39 manual_mlb_cle Progressive Field Cleveland OH 41.4962 -81.6852 34830 MLB ['CLE'] manual
40 manual_mlb_col Coors Field Denver CO 39.7559 -104.9942 50144 MLB ['COL'] manual
41 manual_mlb_det Comerica Park Detroit MI 42.339 -83.0485 41083 MLB ['DET'] manual
42 manual_mlb_hou Minute Maid Park Houston TX 29.7573 -95.3555 41168 MLB ['HOU'] manual
43 manual_mlb_kcr Kauffman Stadium Kansas City MO 39.0517 -94.4803 37903 MLB ['KCR'] manual
44 manual_mlb_laa Angel Stadium Anaheim CA 33.8003 -117.8827 45517 MLB ['LAA'] manual
45 manual_mlb_lad Dodger Stadium Los Angeles CA 34.0739 -118.24 56000 MLB ['LAD'] manual
46 manual_mlb_mia LoanDepot Park Miami FL 25.7781 -80.2196 36742 MLB ['MIA'] manual
47 manual_mlb_mil American Family Field Milwaukee WI 43.028 -87.9712 41900 MLB ['MIL'] manual
48 manual_mlb_min Target Field Minneapolis MN 44.9817 -93.2776 38544 MLB ['MIN'] manual
49 manual_mlb_nym Citi Field New York NY 40.7571 -73.8458 41922 MLB ['NYM'] manual
50 manual_mlb_nyy Yankee Stadium New York NY 40.8296 -73.9262 46537 MLB ['NYY'] manual
51 manual_mlb_oak Sutter Health Park Sacramento CA 38.5802 -121.5097 14014 MLB ['OAK'] manual
52 manual_mlb_phi Citizens Bank Park Philadelphia PA 39.9061 -75.1665 42792 MLB ['PHI'] manual
53 manual_mlb_pit PNC Park Pittsburgh PA 40.4469 -80.0057 38362 MLB ['PIT'] manual
54 manual_mlb_sdp Petco Park San Diego CA 32.7076 -117.157 40209 MLB ['SDP'] manual
55 manual_mlb_sfg Oracle Park San Francisco CA 37.7786 -122.3893 41265 MLB ['SFG'] manual
56 manual_mlb_sea T-Mobile Park Seattle WA 47.5914 -122.3325 47929 MLB ['SEA'] manual
57 manual_mlb_stl Busch Stadium St. Louis MO 38.6226 -90.1928 45494 MLB ['STL'] manual
58 manual_mlb_tbr Tropicana Field St. Petersburg FL 27.7682 -82.6534 25000 MLB ['TBR'] manual
59 manual_mlb_tex Globe Life Field Arlington TX 32.7473 -97.0845 40300 MLB ['TEX'] manual
60 manual_mlb_tor Rogers Centre Toronto ON 43.6414 -79.3894 49282 MLB ['TOR'] manual
61 manual_mlb_wsn Nationals Park Washington DC 38.873 -77.0074 41339 MLB ['WSN'] manual
62 manual_nhl_ana Honda Center Anaheim CA 33.8078 -117.8765 17174 NHL ['ANA'] manual
63 manual_nhl_ari Delta Center Salt Lake City UT 40.7683 -111.9011 18306 NHL ['ARI'] manual
64 manual_nhl_bos TD Garden Boston MA 42.3662 -71.0621 17565 NHL ['BOS'] manual
65 manual_nhl_buf KeyBank Center Buffalo NY 42.875 -78.8764 19070 NHL ['BUF'] manual
66 manual_nhl_cgy Scotiabank Saddledome Calgary AB 51.0374 -114.0519 19289 NHL ['CGY'] manual
67 manual_nhl_car PNC Arena Raleigh NC 35.8034 -78.722 18680 NHL ['CAR'] manual
68 manual_nhl_chi United Center Chicago IL 41.8807 -87.6742 19717 NHL ['CHI'] manual
69 manual_nhl_col Ball Arena Denver CO 39.7487 -105.0077 18007 NHL ['COL'] manual
70 manual_nhl_cbj Nationwide Arena Columbus OH 39.9693 -83.0061 18500 NHL ['CBJ'] manual
71 manual_nhl_dal American Airlines Center Dallas TX 32.7905 -96.8103 18532 NHL ['DAL'] manual
72 manual_nhl_det Little Caesars Arena Detroit MI 42.3411 -83.0553 19515 NHL ['DET'] manual
73 manual_nhl_edm Rogers Place Edmonton AB 53.5469 -113.4978 18347 NHL ['EDM'] manual
74 manual_nhl_fla Amerant Bank Arena Sunrise FL 26.1584 -80.3256 19250 NHL ['FLA'] manual
75 manual_nhl_lak Crypto.com Arena Los Angeles CA 34.043 -118.2673 18230 NHL ['LAK'] manual
76 manual_nhl_min Xcel Energy Center St. Paul MN 44.9448 -93.101 17954 NHL ['MIN'] manual
77 manual_nhl_mtl Bell Centre Montreal QC 45.4961 -73.5693 21302 NHL ['MTL'] manual
78 manual_nhl_nsh Bridgestone Arena Nashville TN 36.1592 -86.7785 17159 NHL ['NSH'] manual
79 manual_nhl_njd Prudential Center Newark NJ 40.7334 -74.1712 16514 NHL ['NJD'] manual
80 manual_nhl_nyi UBS Arena Elmont NY 40.7161 -73.7246 17255 NHL ['NYI'] manual
81 manual_nhl_nyr Madison Square Garden New York NY 40.7505 -73.9934 18006 NHL ['NYR'] manual
82 manual_nhl_ott Canadian Tire Centre Ottawa ON 45.2969 -75.9272 18652 NHL ['OTT'] manual
83 manual_nhl_phi Wells Fargo Center Philadelphia PA 39.9012 -75.172 19543 NHL ['PHI'] manual
84 manual_nhl_pit PPG Paints Arena Pittsburgh PA 40.4395 -79.9892 18387 NHL ['PIT'] manual
85 manual_nhl_sjs SAP Center San Jose CA 37.3327 -121.901 17562 NHL ['SJS'] manual
86 manual_nhl_sea Climate Pledge Arena Seattle WA 47.6221 -122.354 17100 NHL ['SEA'] manual
87 manual_nhl_stl Enterprise Center St. Louis MO 38.6268 -90.2025 18096 NHL ['STL'] manual
88 manual_nhl_tbl Amalie Arena Tampa FL 27.9426 -82.4519 19092 NHL ['TBL'] manual
89 manual_nhl_tor Scotiabank Arena Toronto ON 43.6435 -79.3791 18819 NHL ['TOR'] manual
90 manual_nhl_van Rogers Arena Vancouver BC 49.2778 -123.1089 18910 NHL ['VAN'] manual
91 manual_nhl_vgk T-Mobile Arena Las Vegas NV 36.1028 -115.1784 17500 NHL ['VGK'] manual
92 manual_nhl_wsh Capital One Arena Washington DC 38.8982 -77.0209 18573 NHL ['WSH'] manual
93 manual_nhl_wpg Canada Life Centre Winnipeg MB 49.8928 -97.1436 15321 NHL ['WPG'] manual

1382
Scripts/data/stadiums.json Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,275 @@
#!/usr/bin/env swift
//
// import_to_cloudkit.swift
// SportsTime
//
// Imports scraped JSON data into CloudKit public database.
// Run from command line: swift import_to_cloudkit.swift --games data/games.json --stadiums data/stadiums.json
//
import Foundation
import CloudKit
// MARK: - Data Models (matching scraper output)
struct ScrapedGame: Codable {
let id: String
let sport: String
let season: String
let date: String
let time: String?
let home_team: String
let away_team: String
let home_team_abbrev: String
let away_team_abbrev: String
let venue: String
let source: String
let is_playoff: Bool?
let broadcast: String?
}
struct ScrapedStadium: Codable {
let id: String
let name: String
let city: String
let state: String
let latitude: Double
let longitude: Double
let capacity: Int
let sport: String
let team_abbrevs: [String]
let source: String
let year_opened: Int?
}
// MARK: - CloudKit Importer
class CloudKitImporter {
let container: CKContainer
let database: CKDatabase
init(containerIdentifier: String = "iCloud.com.sportstime.app") {
self.container = CKContainer(identifier: containerIdentifier)
self.database = container.publicCloudDatabase
}
// MARK: - Import Stadiums
func importStadiums(from stadiums: [ScrapedStadium]) async throws -> Int {
var imported = 0
for stadium in stadiums {
let record = CKRecord(recordType: "Stadium")
record["stadiumId"] = stadium.id
record["name"] = stadium.name
record["city"] = stadium.city
record["state"] = stadium.state
record["location"] = CLLocation(latitude: stadium.latitude, longitude: stadium.longitude)
record["capacity"] = stadium.capacity
record["sport"] = stadium.sport
record["teamAbbrevs"] = stadium.team_abbrevs
record["source"] = stadium.source
if let yearOpened = stadium.year_opened {
record["yearOpened"] = yearOpened
}
do {
_ = try await database.save(record)
imported += 1
print(" Imported stadium: \(stadium.name)")
} catch {
print(" Error importing \(stadium.name): \(error)")
}
}
return imported
}
// MARK: - Import Teams
func importTeams(from stadiums: [ScrapedStadium], teamMappings: [String: TeamInfo]) async throws -> [String: CKRecord.ID] {
var teamRecordIDs: [String: CKRecord.ID] = [:]
for (abbrev, info) in teamMappings {
let record = CKRecord(recordType: "Team")
record["teamId"] = UUID().uuidString
record["name"] = info.name
record["abbreviation"] = abbrev
record["sport"] = info.sport
record["city"] = info.city
do {
let saved = try await database.save(record)
teamRecordIDs[abbrev] = saved.recordID
print(" Imported team: \(info.name)")
} catch {
print(" Error importing team \(info.name): \(error)")
}
}
return teamRecordIDs
}
// MARK: - Import Games
func importGames(
from games: [ScrapedGame],
teamRecordIDs: [String: CKRecord.ID],
stadiumRecordIDs: [String: CKRecord.ID]
) async throws -> Int {
var imported = 0
// Batch imports for efficiency
let batchSize = 100
var batch: [CKRecord] = []
for game in games {
let record = CKRecord(recordType: "Game")
record["gameId"] = game.id
record["sport"] = game.sport
record["season"] = game.season
// Parse date
let dateFormatter = DateFormatter()
dateFormatter.dateFormat = "yyyy-MM-dd"
if let date = dateFormatter.date(from: game.date) {
if let timeStr = game.time {
// Combine date and time
let timeFormatter = DateFormatter()
timeFormatter.dateFormat = "HH:mm"
if let time = timeFormatter.date(from: timeStr) {
let calendar = Calendar.current
let timeComponents = calendar.dateComponents([.hour, .minute], from: time)
if let combined = calendar.date(bySettingHour: timeComponents.hour ?? 19,
minute: timeComponents.minute ?? 0,
second: 0, of: date) {
record["dateTime"] = combined
}
}
} else {
// Default to 7 PM if no time
let calendar = Calendar.current
if let defaultTime = calendar.date(bySettingHour: 19, minute: 0, second: 0, of: date) {
record["dateTime"] = defaultTime
}
}
}
// Team references
if let homeTeamID = teamRecordIDs[game.home_team_abbrev] {
record["homeTeamRef"] = CKRecord.Reference(recordID: homeTeamID, action: .none)
}
if let awayTeamID = teamRecordIDs[game.away_team_abbrev] {
record["awayTeamRef"] = CKRecord.Reference(recordID: awayTeamID, action: .none)
}
record["isPlayoff"] = (game.is_playoff ?? false) ? 1 : 0
record["broadcastInfo"] = game.broadcast
record["source"] = game.source
batch.append(record)
// Save batch
if batch.count >= batchSize {
do {
let operation = CKModifyRecordsOperation(recordsToSave: batch, recordIDsToDelete: nil)
operation.savePolicy = .changedKeys
try await database.modifyRecords(saving: batch, deleting: [])
imported += batch.count
print(" Imported batch of \(batch.count) games (total: \(imported))")
batch.removeAll()
} catch {
print(" Error importing batch: \(error)")
}
}
}
// Save remaining
if !batch.isEmpty {
do {
try await database.modifyRecords(saving: batch, deleting: [])
imported += batch.count
} catch {
print(" Error importing final batch: \(error)")
}
}
return imported
}
}
// MARK: - Team Info
struct TeamInfo {
let name: String
let city: String
let sport: String
}
// MARK: - Main
func loadJSON<T: Codable>(from path: String) throws -> T {
let url = URL(fileURLWithPath: path)
let data = try Data(contentsOf: url)
return try JSONDecoder().decode(T.self, from: data)
}
func main() async {
let args = CommandLine.arguments
guard args.count >= 3 else {
print("Usage: swift import_to_cloudkit.swift --games <path> --stadiums <path>")
return
}
var gamesPath: String?
var stadiumsPath: String?
for i in 1..<args.count {
if args[i] == "--games" && i + 1 < args.count {
gamesPath = args[i + 1]
}
if args[i] == "--stadiums" && i + 1 < args.count {
stadiumsPath = args[i + 1]
}
}
let importer = CloudKitImporter()
// Import stadiums
if let path = stadiumsPath {
print("\n=== Importing Stadiums ===")
do {
let stadiums: [ScrapedStadium] = try loadJSON(from: path)
let count = try await importer.importStadiums(from: stadiums)
print("Imported \(count) stadiums")
} catch {
print("Error loading stadiums: \(error)")
}
}
// Import games
if let path = gamesPath {
print("\n=== Importing Games ===")
do {
let games: [ScrapedGame] = try loadJSON(from: path)
// Note: Would need to first import teams and get their record IDs
// This is a simplified version
print("Loaded \(games.count) games for import")
} catch {
print("Error loading games: \(error)")
}
}
print("\n=== Import Complete ===")
}
// Run
Task {
await main()
}
// Keep the process running for async operations
RunLoop.main.run()

8
Scripts/requirements.txt Normal file
View File

@@ -0,0 +1,8 @@
# Sports Schedule Scraper Dependencies
requests>=2.28.0
beautifulsoup4>=4.11.0
pandas>=2.0.0
lxml>=4.9.0
# CloudKit Import (optional - only needed for cloudkit_import.py)
cryptography>=41.0.0

435
Scripts/run_pipeline.py Executable file
View File

@@ -0,0 +1,435 @@
#!/usr/bin/env python3
"""
SportsTime Data Pipeline
========================
Master script that orchestrates all data fetching, validation, and reporting.
Usage:
python run_pipeline.py # Full pipeline with defaults
python run_pipeline.py --season 2026 # Specify season
python run_pipeline.py --sport nba # Single sport only
python run_pipeline.py --skip-scrape # Validate existing data only
python run_pipeline.py --verbose # Detailed output
"""
import argparse
import json
import sys
from datetime import datetime
from pathlib import Path
from dataclasses import dataclass
from typing import Optional
from enum import Enum
# Import our modules
from scrape_schedules import (
Game, Stadium,
scrape_nba_basketball_reference,
scrape_mlb_statsapi, scrape_mlb_baseball_reference,
scrape_nhl_hockey_reference,
generate_stadiums_from_teams,
export_to_json,
assign_stable_ids,
)
from validate_data import (
validate_games,
validate_stadiums,
scrape_mlb_all_sources,
scrape_nba_all_sources,
scrape_nhl_all_sources,
ValidationReport,
)
class Severity(Enum):
HIGH = "high"
MEDIUM = "medium"
LOW = "low"
@dataclass
class PipelineResult:
success: bool
games_scraped: int
stadiums_scraped: int
games_by_sport: dict
validation_reports: list
stadium_issues: list
high_severity_count: int
medium_severity_count: int
low_severity_count: int
output_dir: Path
duration_seconds: float
def print_header(text: str):
"""Print a formatted header."""
print()
print("=" * 70)
print(f" {text}")
print("=" * 70)
def print_section(text: str):
"""Print a section header."""
print()
print(f"--- {text} ---")
def print_severity(severity: str, message: str):
"""Print a message with severity indicator."""
icons = {
'high': '🔴 HIGH',
'medium': '🟡 MEDIUM',
'low': '🟢 LOW',
}
print(f" {icons.get(severity, '')} {message}")
def run_pipeline(
season: int = 2025,
sport: str = 'all',
output_dir: Path = Path('./data'),
skip_scrape: bool = False,
validate: bool = True,
verbose: bool = False,
) -> PipelineResult:
"""
Run the complete data pipeline.
"""
start_time = datetime.now()
all_games = []
all_stadiums = []
games_by_sport = {}
validation_reports = []
stadium_issues = []
output_dir.mkdir(parents=True, exist_ok=True)
# =========================================================================
# PHASE 1: SCRAPE DATA
# =========================================================================
if not skip_scrape:
print_header("PHASE 1: SCRAPING DATA")
# Scrape stadiums
print_section("Stadiums")
all_stadiums = generate_stadiums_from_teams()
print(f" Generated {len(all_stadiums)} stadiums from team data")
# Scrape by sport
if sport in ['nba', 'all']:
print_section(f"NBA {season}")
nba_games = scrape_nba_basketball_reference(season)
nba_season = f"{season-1}-{str(season)[2:]}"
nba_games = assign_stable_ids(nba_games, 'NBA', nba_season)
all_games.extend(nba_games)
games_by_sport['NBA'] = len(nba_games)
if sport in ['mlb', 'all']:
print_section(f"MLB {season}")
mlb_games = scrape_mlb_statsapi(season)
# MLB API uses official gamePk - already stable
all_games.extend(mlb_games)
games_by_sport['MLB'] = len(mlb_games)
if sport in ['nhl', 'all']:
print_section(f"NHL {season}")
nhl_games = scrape_nhl_hockey_reference(season)
nhl_season = f"{season-1}-{str(season)[2:]}"
nhl_games = assign_stable_ids(nhl_games, 'NHL', nhl_season)
all_games.extend(nhl_games)
games_by_sport['NHL'] = len(nhl_games)
# Export data
print_section("Exporting Data")
export_to_json(all_games, all_stadiums, output_dir)
print(f" Exported to {output_dir}")
else:
# Load existing data
print_header("LOADING EXISTING DATA")
games_file = output_dir / 'games.json'
stadiums_file = output_dir / 'stadiums.json'
if games_file.exists():
with open(games_file) as f:
games_data = json.load(f)
all_games = [Game(**g) for g in games_data]
for g in all_games:
games_by_sport[g.sport] = games_by_sport.get(g.sport, 0) + 1
print(f" Loaded {len(all_games)} games")
if stadiums_file.exists():
with open(stadiums_file) as f:
stadiums_data = json.load(f)
all_stadiums = [Stadium(**s) for s in stadiums_data]
print(f" Loaded {len(all_stadiums)} stadiums")
# =========================================================================
# PHASE 2: VALIDATE DATA
# =========================================================================
if validate:
print_header("PHASE 2: CROSS-VALIDATION")
# MLB validation (has two good sources)
if sport in ['mlb', 'all']:
print_section("MLB Cross-Validation")
try:
mlb_sources = scrape_mlb_all_sources(season)
source_names = list(mlb_sources.keys())
if len(source_names) >= 2:
games1 = mlb_sources[source_names[0]]
games2 = mlb_sources[source_names[1]]
if games1 and games2:
report = validate_games(
games1, games2,
source_names[0], source_names[1],
'MLB', str(season)
)
validation_reports.append(report)
print(f" Sources: {source_names[0]} vs {source_names[1]}")
print(f" Games compared: {report.total_games_source1} vs {report.total_games_source2}")
print(f" Matched: {report.games_matched}")
print(f" Discrepancies: {len(report.discrepancies)}")
except Exception as e:
print(f" Error during MLB validation: {e}")
# Stadium validation
print_section("Stadium Validation")
stadium_issues = validate_stadiums(all_stadiums)
print(f" Issues found: {len(stadium_issues)}")
# Data quality checks
print_section("Data Quality Checks")
# Check game counts per team
if sport in ['nba', 'all']:
nba_games = [g for g in all_games if g.sport == 'NBA']
team_counts = {}
for g in nba_games:
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
for team, count in sorted(team_counts.items()):
if count < 75 or count > 90:
print(f" NBA: {team} has {count} games (expected ~82)")
if sport in ['nhl', 'all']:
nhl_games = [g for g in all_games if g.sport == 'NHL']
team_counts = {}
for g in nhl_games:
team_counts[g.home_team_abbrev] = team_counts.get(g.home_team_abbrev, 0) + 1
team_counts[g.away_team_abbrev] = team_counts.get(g.away_team_abbrev, 0) + 1
for team, count in sorted(team_counts.items()):
if count < 75 or count > 90:
print(f" NHL: {team} has {count} games (expected ~82)")
# =========================================================================
# PHASE 3: GENERATE REPORT
# =========================================================================
print_header("PHASE 3: DISCREPANCY REPORT")
# Count by severity
high_count = 0
medium_count = 0
low_count = 0
# Game discrepancies
for report in validation_reports:
for d in report.discrepancies:
if d.severity == 'high':
high_count += 1
elif d.severity == 'medium':
medium_count += 1
else:
low_count += 1
# Stadium issues
for issue in stadium_issues:
if issue['severity'] == 'high':
high_count += 1
elif issue['severity'] == 'medium':
medium_count += 1
else:
low_count += 1
# Print summary
print()
print(f" 🔴 HIGH severity: {high_count}")
print(f" 🟡 MEDIUM severity: {medium_count}")
print(f" 🟢 LOW severity: {low_count}")
print()
# Print high severity issues (always)
if high_count > 0:
print_section("HIGH Severity Issues (Requires Attention)")
shown = 0
max_show = 10 if not verbose else 50
for report in validation_reports:
for d in report.discrepancies:
if d.severity == 'high' and shown < max_show:
print_severity('high', f"[{report.sport}] {d.field}: {d.game_key}")
if verbose:
print(f" {d.source1}: {d.value1}")
print(f" {d.source2}: {d.value2}")
shown += 1
for issue in stadium_issues:
if issue['severity'] == 'high' and shown < max_show:
print_severity('high', f"[Stadium] {issue['stadium']}: {issue['issue']}")
shown += 1
if high_count > max_show:
print(f" ... and {high_count - max_show} more (use --verbose to see all)")
# Print medium severity if verbose
if medium_count > 0 and verbose:
print_section("MEDIUM Severity Issues")
for report in validation_reports:
for d in report.discrepancies:
if d.severity == 'medium':
print_severity('medium', f"[{report.sport}] {d.field}: {d.game_key}")
for issue in stadium_issues:
if issue['severity'] == 'medium':
print_severity('medium', f"[Stadium] {issue['stadium']}: {issue['issue']}")
# Save full report
report_path = output_dir / 'pipeline_report.json'
full_report = {
'generated_at': datetime.now().isoformat(),
'season': season,
'sport': sport,
'summary': {
'games_scraped': len(all_games),
'stadiums_scraped': len(all_stadiums),
'games_by_sport': games_by_sport,
'high_severity': high_count,
'medium_severity': medium_count,
'low_severity': low_count,
},
'game_validations': [r.to_dict() for r in validation_reports],
'stadium_issues': stadium_issues,
}
with open(report_path, 'w') as f:
json.dump(full_report, f, indent=2)
# =========================================================================
# FINAL SUMMARY
# =========================================================================
duration = (datetime.now() - start_time).total_seconds()
print_header("PIPELINE COMPLETE")
print()
print(f" Duration: {duration:.1f} seconds")
print(f" Games: {len(all_games):,}")
print(f" Stadiums: {len(all_stadiums)}")
print(f" Output: {output_dir.absolute()}")
print()
for sport_name, count in sorted(games_by_sport.items()):
print(f" {sport_name}: {count:,} games")
print()
print(f" Reports saved to:")
print(f" - {output_dir / 'games.json'}")
print(f" - {output_dir / 'stadiums.json'}")
print(f" - {output_dir / 'pipeline_report.json'}")
print()
# Status indicator
if high_count > 0:
print(" ⚠️ STATUS: Review required - high severity issues found")
elif medium_count > 0:
print(" ✓ STATUS: Complete with warnings")
else:
print(" ✅ STATUS: All checks passed")
print()
return PipelineResult(
success=high_count == 0,
games_scraped=len(all_games),
stadiums_scraped=len(all_stadiums),
games_by_sport=games_by_sport,
validation_reports=validation_reports,
stadium_issues=stadium_issues,
high_severity_count=high_count,
medium_severity_count=medium_count,
low_severity_count=low_count,
output_dir=output_dir,
duration_seconds=duration,
)
def main():
parser = argparse.ArgumentParser(
description='SportsTime Data Pipeline - Fetch, validate, and report on sports data',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python run_pipeline.py # Full pipeline
python run_pipeline.py --season 2026 # Different season
python run_pipeline.py --sport mlb # MLB only
python run_pipeline.py --skip-scrape # Validate existing data
python run_pipeline.py --verbose # Show all issues
"""
)
parser.add_argument(
'--season', type=int, default=2025,
help='Season year (default: 2025)'
)
parser.add_argument(
'--sport', choices=['nba', 'mlb', 'nhl', 'all'], default='all',
help='Sport to process (default: all)'
)
parser.add_argument(
'--output', type=str, default='./data',
help='Output directory (default: ./data)'
)
parser.add_argument(
'--skip-scrape', action='store_true',
help='Skip scraping, validate existing data only'
)
parser.add_argument(
'--no-validate', action='store_true',
help='Skip validation step'
)
parser.add_argument(
'--verbose', '-v', action='store_true',
help='Verbose output with all issues'
)
args = parser.parse_args()
result = run_pipeline(
season=args.season,
sport=args.sport,
output_dir=Path(args.output),
skip_scrape=args.skip_scrape,
validate=not args.no_validate,
verbose=args.verbose,
)
# Exit with error code if high severity issues
sys.exit(0 if result.success else 1)
if __name__ == '__main__':
main()

970
Scripts/scrape_schedules.py Normal file
View File

@@ -0,0 +1,970 @@
#!/usr/bin/env python3
"""
Sports Schedule Scraper for SportsTime App
Scrapes NBA, MLB, NHL schedules from multiple sources for cross-validation.
Usage:
python scrape_schedules.py --sport nba --season 2025
python scrape_schedules.py --sport all --season 2025
python scrape_schedules.py --stadiums-only
"""
import argparse
import json
import time
import re
from datetime import datetime, timedelta
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Rate limiting
REQUEST_DELAY = 3.0 # seconds between requests to same domain
last_request_time = {}
def rate_limit(domain: str):
"""Enforce rate limiting per domain."""
now = time.time()
if domain in last_request_time:
elapsed = now - last_request_time[domain]
if elapsed < REQUEST_DELAY:
time.sleep(REQUEST_DELAY - elapsed)
last_request_time[domain] = time.time()
def fetch_page(url: str, domain: str) -> Optional[BeautifulSoup]:
"""Fetch and parse a webpage with rate limiting."""
rate_limit(domain)
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
return BeautifulSoup(response.content, 'html.parser')
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
# =============================================================================
# DATA CLASSES
# =============================================================================
@dataclass
class Game:
id: str
sport: str
season: str
date: str # YYYY-MM-DD
time: Optional[str] # HH:MM (24hr, ET)
home_team: str
away_team: str
home_team_abbrev: str
away_team_abbrev: str
venue: str
source: str
is_playoff: bool = False
broadcast: Optional[str] = None
@dataclass
class Stadium:
id: str
name: str
city: str
state: str
latitude: float
longitude: float
capacity: int
sport: str
team_abbrevs: list
source: str
year_opened: Optional[int] = None
# =============================================================================
# TEAM MAPPINGS
# =============================================================================
NBA_TEAMS = {
'ATL': {'name': 'Atlanta Hawks', 'city': 'Atlanta', 'arena': 'State Farm Arena'},
'BOS': {'name': 'Boston Celtics', 'city': 'Boston', 'arena': 'TD Garden'},
'BRK': {'name': 'Brooklyn Nets', 'city': 'Brooklyn', 'arena': 'Barclays Center'},
'CHO': {'name': 'Charlotte Hornets', 'city': 'Charlotte', 'arena': 'Spectrum Center'},
'CHI': {'name': 'Chicago Bulls', 'city': 'Chicago', 'arena': 'United Center'},
'CLE': {'name': 'Cleveland Cavaliers', 'city': 'Cleveland', 'arena': 'Rocket Mortgage FieldHouse'},
'DAL': {'name': 'Dallas Mavericks', 'city': 'Dallas', 'arena': 'American Airlines Center'},
'DEN': {'name': 'Denver Nuggets', 'city': 'Denver', 'arena': 'Ball Arena'},
'DET': {'name': 'Detroit Pistons', 'city': 'Detroit', 'arena': 'Little Caesars Arena'},
'GSW': {'name': 'Golden State Warriors', 'city': 'San Francisco', 'arena': 'Chase Center'},
'HOU': {'name': 'Houston Rockets', 'city': 'Houston', 'arena': 'Toyota Center'},
'IND': {'name': 'Indiana Pacers', 'city': 'Indianapolis', 'arena': 'Gainbridge Fieldhouse'},
'LAC': {'name': 'Los Angeles Clippers', 'city': 'Inglewood', 'arena': 'Intuit Dome'},
'LAL': {'name': 'Los Angeles Lakers', 'city': 'Los Angeles', 'arena': 'Crypto.com Arena'},
'MEM': {'name': 'Memphis Grizzlies', 'city': 'Memphis', 'arena': 'FedExForum'},
'MIA': {'name': 'Miami Heat', 'city': 'Miami', 'arena': 'Kaseya Center'},
'MIL': {'name': 'Milwaukee Bucks', 'city': 'Milwaukee', 'arena': 'Fiserv Forum'},
'MIN': {'name': 'Minnesota Timberwolves', 'city': 'Minneapolis', 'arena': 'Target Center'},
'NOP': {'name': 'New Orleans Pelicans', 'city': 'New Orleans', 'arena': 'Smoothie King Center'},
'NYK': {'name': 'New York Knicks', 'city': 'New York', 'arena': 'Madison Square Garden'},
'OKC': {'name': 'Oklahoma City Thunder', 'city': 'Oklahoma City', 'arena': 'Paycom Center'},
'ORL': {'name': 'Orlando Magic', 'city': 'Orlando', 'arena': 'Kia Center'},
'PHI': {'name': 'Philadelphia 76ers', 'city': 'Philadelphia', 'arena': 'Wells Fargo Center'},
'PHO': {'name': 'Phoenix Suns', 'city': 'Phoenix', 'arena': 'Footprint Center'},
'POR': {'name': 'Portland Trail Blazers', 'city': 'Portland', 'arena': 'Moda Center'},
'SAC': {'name': 'Sacramento Kings', 'city': 'Sacramento', 'arena': 'Golden 1 Center'},
'SAS': {'name': 'San Antonio Spurs', 'city': 'San Antonio', 'arena': 'Frost Bank Center'},
'TOR': {'name': 'Toronto Raptors', 'city': 'Toronto', 'arena': 'Scotiabank Arena'},
'UTA': {'name': 'Utah Jazz', 'city': 'Salt Lake City', 'arena': 'Delta Center'},
'WAS': {'name': 'Washington Wizards', 'city': 'Washington', 'arena': 'Capital One Arena'},
}
MLB_TEAMS = {
'ARI': {'name': 'Arizona Diamondbacks', 'city': 'Phoenix', 'stadium': 'Chase Field'},
'ATL': {'name': 'Atlanta Braves', 'city': 'Atlanta', 'stadium': 'Truist Park'},
'BAL': {'name': 'Baltimore Orioles', 'city': 'Baltimore', 'stadium': 'Oriole Park at Camden Yards'},
'BOS': {'name': 'Boston Red Sox', 'city': 'Boston', 'stadium': 'Fenway Park'},
'CHC': {'name': 'Chicago Cubs', 'city': 'Chicago', 'stadium': 'Wrigley Field'},
'CHW': {'name': 'Chicago White Sox', 'city': 'Chicago', 'stadium': 'Guaranteed Rate Field'},
'CIN': {'name': 'Cincinnati Reds', 'city': 'Cincinnati', 'stadium': 'Great American Ball Park'},
'CLE': {'name': 'Cleveland Guardians', 'city': 'Cleveland', 'stadium': 'Progressive Field'},
'COL': {'name': 'Colorado Rockies', 'city': 'Denver', 'stadium': 'Coors Field'},
'DET': {'name': 'Detroit Tigers', 'city': 'Detroit', 'stadium': 'Comerica Park'},
'HOU': {'name': 'Houston Astros', 'city': 'Houston', 'stadium': 'Minute Maid Park'},
'KCR': {'name': 'Kansas City Royals', 'city': 'Kansas City', 'stadium': 'Kauffman Stadium'},
'LAA': {'name': 'Los Angeles Angels', 'city': 'Anaheim', 'stadium': 'Angel Stadium'},
'LAD': {'name': 'Los Angeles Dodgers', 'city': 'Los Angeles', 'stadium': 'Dodger Stadium'},
'MIA': {'name': 'Miami Marlins', 'city': 'Miami', 'stadium': 'LoanDepot Park'},
'MIL': {'name': 'Milwaukee Brewers', 'city': 'Milwaukee', 'stadium': 'American Family Field'},
'MIN': {'name': 'Minnesota Twins', 'city': 'Minneapolis', 'stadium': 'Target Field'},
'NYM': {'name': 'New York Mets', 'city': 'New York', 'stadium': 'Citi Field'},
'NYY': {'name': 'New York Yankees', 'city': 'New York', 'stadium': 'Yankee Stadium'},
'OAK': {'name': 'Oakland Athletics', 'city': 'Sacramento', 'stadium': 'Sutter Health Park'},
'PHI': {'name': 'Philadelphia Phillies', 'city': 'Philadelphia', 'stadium': 'Citizens Bank Park'},
'PIT': {'name': 'Pittsburgh Pirates', 'city': 'Pittsburgh', 'stadium': 'PNC Park'},
'SDP': {'name': 'San Diego Padres', 'city': 'San Diego', 'stadium': 'Petco Park'},
'SFG': {'name': 'San Francisco Giants', 'city': 'San Francisco', 'stadium': 'Oracle Park'},
'SEA': {'name': 'Seattle Mariners', 'city': 'Seattle', 'stadium': 'T-Mobile Park'},
'STL': {'name': 'St. Louis Cardinals', 'city': 'St. Louis', 'stadium': 'Busch Stadium'},
'TBR': {'name': 'Tampa Bay Rays', 'city': 'St. Petersburg', 'stadium': 'Tropicana Field'},
'TEX': {'name': 'Texas Rangers', 'city': 'Arlington', 'stadium': 'Globe Life Field'},
'TOR': {'name': 'Toronto Blue Jays', 'city': 'Toronto', 'stadium': 'Rogers Centre'},
'WSN': {'name': 'Washington Nationals', 'city': 'Washington', 'stadium': 'Nationals Park'},
}
NHL_TEAMS = {
'ANA': {'name': 'Anaheim Ducks', 'city': 'Anaheim', 'arena': 'Honda Center'},
'ARI': {'name': 'Utah Hockey Club', 'city': 'Salt Lake City', 'arena': 'Delta Center'},
'BOS': {'name': 'Boston Bruins', 'city': 'Boston', 'arena': 'TD Garden'},
'BUF': {'name': 'Buffalo Sabres', 'city': 'Buffalo', 'arena': 'KeyBank Center'},
'CGY': {'name': 'Calgary Flames', 'city': 'Calgary', 'arena': 'Scotiabank Saddledome'},
'CAR': {'name': 'Carolina Hurricanes', 'city': 'Raleigh', 'arena': 'PNC Arena'},
'CHI': {'name': 'Chicago Blackhawks', 'city': 'Chicago', 'arena': 'United Center'},
'COL': {'name': 'Colorado Avalanche', 'city': 'Denver', 'arena': 'Ball Arena'},
'CBJ': {'name': 'Columbus Blue Jackets', 'city': 'Columbus', 'arena': 'Nationwide Arena'},
'DAL': {'name': 'Dallas Stars', 'city': 'Dallas', 'arena': 'American Airlines Center'},
'DET': {'name': 'Detroit Red Wings', 'city': 'Detroit', 'arena': 'Little Caesars Arena'},
'EDM': {'name': 'Edmonton Oilers', 'city': 'Edmonton', 'arena': 'Rogers Place'},
'FLA': {'name': 'Florida Panthers', 'city': 'Sunrise', 'arena': 'Amerant Bank Arena'},
'LAK': {'name': 'Los Angeles Kings', 'city': 'Los Angeles', 'arena': 'Crypto.com Arena'},
'MIN': {'name': 'Minnesota Wild', 'city': 'St. Paul', 'arena': 'Xcel Energy Center'},
'MTL': {'name': 'Montreal Canadiens', 'city': 'Montreal', 'arena': 'Bell Centre'},
'NSH': {'name': 'Nashville Predators', 'city': 'Nashville', 'arena': 'Bridgestone Arena'},
'NJD': {'name': 'New Jersey Devils', 'city': 'Newark', 'arena': 'Prudential Center'},
'NYI': {'name': 'New York Islanders', 'city': 'Elmont', 'arena': 'UBS Arena'},
'NYR': {'name': 'New York Rangers', 'city': 'New York', 'arena': 'Madison Square Garden'},
'OTT': {'name': 'Ottawa Senators', 'city': 'Ottawa', 'arena': 'Canadian Tire Centre'},
'PHI': {'name': 'Philadelphia Flyers', 'city': 'Philadelphia', 'arena': 'Wells Fargo Center'},
'PIT': {'name': 'Pittsburgh Penguins', 'city': 'Pittsburgh', 'arena': 'PPG Paints Arena'},
'SJS': {'name': 'San Jose Sharks', 'city': 'San Jose', 'arena': 'SAP Center'},
'SEA': {'name': 'Seattle Kraken', 'city': 'Seattle', 'arena': 'Climate Pledge Arena'},
'STL': {'name': 'St. Louis Blues', 'city': 'St. Louis', 'arena': 'Enterprise Center'},
'TBL': {'name': 'Tampa Bay Lightning', 'city': 'Tampa', 'arena': 'Amalie Arena'},
'TOR': {'name': 'Toronto Maple Leafs', 'city': 'Toronto', 'arena': 'Scotiabank Arena'},
'VAN': {'name': 'Vancouver Canucks', 'city': 'Vancouver', 'arena': 'Rogers Arena'},
'VGK': {'name': 'Vegas Golden Knights', 'city': 'Las Vegas', 'arena': 'T-Mobile Arena'},
'WSH': {'name': 'Washington Capitals', 'city': 'Washington', 'arena': 'Capital One Arena'},
'WPG': {'name': 'Winnipeg Jets', 'city': 'Winnipeg', 'arena': 'Canada Life Centre'},
}
# =============================================================================
# SCRAPERS - NBA
# =============================================================================
def scrape_nba_basketball_reference(season: int) -> list[Game]:
"""
Scrape NBA schedule from Basketball-Reference.
URL: https://www.basketball-reference.com/leagues/NBA_{YEAR}_games-{month}.html
Season year is the ending year (e.g., 2025 for 2024-25 season)
"""
games = []
months = ['october', 'november', 'december', 'january', 'february', 'march', 'april', 'may', 'june']
print(f"Scraping NBA {season} from Basketball-Reference...")
for month in months:
url = f"https://www.basketball-reference.com/leagues/NBA_{season}_games-{month}.html"
soup = fetch_page(url, 'basketball-reference.com')
if not soup:
continue
table = soup.find('table', {'id': 'schedule'})
if not table:
continue
tbody = table.find('tbody')
if not tbody:
continue
for row in tbody.find_all('tr'):
if row.get('class') and 'thead' in row.get('class'):
continue
cells = row.find_all(['td', 'th'])
if len(cells) < 6:
continue
try:
# Parse date
date_cell = row.find('th', {'data-stat': 'date_game'})
if not date_cell:
continue
date_link = date_cell.find('a')
date_str = date_link.text if date_link else date_cell.text
# Parse time
time_cell = row.find('td', {'data-stat': 'game_start_time'})
time_str = time_cell.text.strip() if time_cell else None
# Parse teams
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
home_cell = row.find('td', {'data-stat': 'home_team_name'})
if not visitor_cell or not home_cell:
continue
visitor_link = visitor_cell.find('a')
home_link = home_cell.find('a')
away_team = visitor_link.text if visitor_link else visitor_cell.text
home_team = home_link.text if home_link else home_cell.text
# Parse arena
arena_cell = row.find('td', {'data-stat': 'arena_name'})
arena = arena_cell.text.strip() if arena_cell else ''
# Convert date
try:
parsed_date = datetime.strptime(date_str.strip(), '%a, %b %d, %Y')
date_formatted = parsed_date.strftime('%Y-%m-%d')
except:
continue
# Generate game ID
game_id = f"nba_{date_formatted}_{away_team[:3]}_{home_team[:3]}".lower().replace(' ', '')
game = Game(
id=game_id,
sport='NBA',
season=f"{season-1}-{str(season)[2:]}",
date=date_formatted,
time=time_str,
home_team=home_team,
away_team=away_team,
home_team_abbrev=get_team_abbrev(home_team, 'NBA'),
away_team_abbrev=get_team_abbrev(away_team, 'NBA'),
venue=arena,
source='basketball-reference.com'
)
games.append(game)
except Exception as e:
print(f" Error parsing row: {e}")
continue
print(f" Found {len(games)} games from Basketball-Reference")
return games
def scrape_nba_espn(season: int) -> list[Game]:
"""
Scrape NBA schedule from ESPN.
URL: https://www.espn.com/nba/schedule/_/date/{YYYYMMDD}
"""
games = []
print(f"Scraping NBA {season} from ESPN...")
# Determine date range for season
start_date = datetime(season - 1, 10, 1) # October of previous year
end_date = datetime(season, 6, 30) # June of season year
current_date = start_date
while current_date <= end_date:
date_str = current_date.strftime('%Y%m%d')
url = f"https://www.espn.com/nba/schedule/_/date/{date_str}"
soup = fetch_page(url, 'espn.com')
if soup:
# ESPN uses JavaScript rendering, so we need to parse what's available
# This is a simplified version - full implementation would need Selenium
pass
current_date += timedelta(days=7) # Sample weekly to respect rate limits
print(f" Found {len(games)} games from ESPN")
return games
# =============================================================================
# SCRAPERS - MLB
# =============================================================================
def scrape_mlb_baseball_reference(season: int) -> list[Game]:
"""
Scrape MLB schedule from Baseball-Reference.
URL: https://www.baseball-reference.com/leagues/majors/{YEAR}-schedule.shtml
"""
games = []
url = f"https://www.baseball-reference.com/leagues/majors/{season}-schedule.shtml"
print(f"Scraping MLB {season} from Baseball-Reference...")
soup = fetch_page(url, 'baseball-reference.com')
if not soup:
return games
# Baseball-Reference groups games by date in h3 headers
current_date = None
# Find the schedule section
schedule_div = soup.find('div', {'id': 'all_schedule'})
if not schedule_div:
schedule_div = soup
# Process all elements to track date context
for element in schedule_div.find_all(['h3', 'p', 'div']):
# Check for date header
if element.name == 'h3':
date_text = element.get_text(strip=True)
# Parse date like "Thursday, March 27, 2025"
try:
for fmt in ['%A, %B %d, %Y', '%B %d, %Y', '%a, %b %d, %Y']:
try:
parsed = datetime.strptime(date_text, fmt)
current_date = parsed.strftime('%Y-%m-%d')
break
except:
continue
except:
pass
# Check for game entries
elif element.name == 'p' and 'game' in element.get('class', []):
if not current_date:
continue
try:
links = element.find_all('a')
if len(links) >= 2:
away_team = links[0].text.strip()
home_team = links[1].text.strip()
# Generate unique game ID
away_abbrev = get_team_abbrev(away_team, 'MLB')
home_abbrev = get_team_abbrev(home_team, 'MLB')
game_id = f"mlb_br_{current_date}_{away_abbrev}_{home_abbrev}".lower()
game = Game(
id=game_id,
sport='MLB',
season=str(season),
date=current_date,
time=None,
home_team=home_team,
away_team=away_team,
home_team_abbrev=home_abbrev,
away_team_abbrev=away_abbrev,
venue='',
source='baseball-reference.com'
)
games.append(game)
except Exception as e:
continue
print(f" Found {len(games)} games from Baseball-Reference")
return games
def scrape_mlb_statsapi(season: int) -> list[Game]:
"""
Fetch MLB schedule from official Stats API (JSON).
URL: https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={YEAR}&gameType=R
"""
games = []
url = f"https://statsapi.mlb.com/api/v1/schedule?sportId=1&season={season}&gameType=R&hydrate=team,venue"
print(f"Fetching MLB {season} from Stats API...")
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
for date_entry in data.get('dates', []):
game_date = date_entry.get('date', '')
for game_data in date_entry.get('games', []):
try:
teams = game_data.get('teams', {})
away = teams.get('away', {}).get('team', {})
home = teams.get('home', {}).get('team', {})
venue = game_data.get('venue', {})
game_time = game_data.get('gameDate', '')
if 'T' in game_time:
time_str = game_time.split('T')[1][:5]
else:
time_str = None
game = Game(
id=f"mlb_{game_data.get('gamePk', '')}",
sport='MLB',
season=str(season),
date=game_date,
time=time_str,
home_team=home.get('name', ''),
away_team=away.get('name', ''),
home_team_abbrev=home.get('abbreviation', ''),
away_team_abbrev=away.get('abbreviation', ''),
venue=venue.get('name', ''),
source='statsapi.mlb.com'
)
games.append(game)
except Exception as e:
continue
except Exception as e:
print(f" Error fetching MLB API: {e}")
print(f" Found {len(games)} games from MLB Stats API")
return games
# =============================================================================
# SCRAPERS - NHL
# =============================================================================
def scrape_nhl_hockey_reference(season: int) -> list[Game]:
"""
Scrape NHL schedule from Hockey-Reference.
URL: https://www.hockey-reference.com/leagues/NHL_{YEAR}_games.html
"""
games = []
url = f"https://www.hockey-reference.com/leagues/NHL_{season}_games.html"
print(f"Scraping NHL {season} from Hockey-Reference...")
soup = fetch_page(url, 'hockey-reference.com')
if not soup:
return games
table = soup.find('table', {'id': 'games'})
if not table:
print(" Could not find games table")
return games
tbody = table.find('tbody')
if not tbody:
return games
for row in tbody.find_all('tr'):
try:
cells = row.find_all(['td', 'th'])
if len(cells) < 5:
continue
# Parse date
date_cell = row.find('th', {'data-stat': 'date_game'})
if not date_cell:
continue
date_link = date_cell.find('a')
date_str = date_link.text if date_link else date_cell.text
# Parse teams
visitor_cell = row.find('td', {'data-stat': 'visitor_team_name'})
home_cell = row.find('td', {'data-stat': 'home_team_name'})
if not visitor_cell or not home_cell:
continue
visitor_link = visitor_cell.find('a')
home_link = home_cell.find('a')
away_team = visitor_link.text if visitor_link else visitor_cell.text
home_team = home_link.text if home_link else home_cell.text
# Convert date
try:
parsed_date = datetime.strptime(date_str.strip(), '%Y-%m-%d')
date_formatted = parsed_date.strftime('%Y-%m-%d')
except:
continue
game_id = f"nhl_{date_formatted}_{away_team[:3]}_{home_team[:3]}".lower().replace(' ', '')
game = Game(
id=game_id,
sport='NHL',
season=f"{season-1}-{str(season)[2:]}",
date=date_formatted,
time=None,
home_team=home_team,
away_team=away_team,
home_team_abbrev=get_team_abbrev(home_team, 'NHL'),
away_team_abbrev=get_team_abbrev(away_team, 'NHL'),
venue='',
source='hockey-reference.com'
)
games.append(game)
except Exception as e:
continue
print(f" Found {len(games)} games from Hockey-Reference")
return games
def scrape_nhl_api(season: int) -> list[Game]:
"""
Fetch NHL schedule from official API (JSON).
URL: https://api-web.nhle.com/v1/schedule/{YYYY-MM-DD}
"""
games = []
print(f"Fetching NHL {season} from NHL API...")
# NHL API provides club schedules
# We'd need to iterate through dates or teams
# Simplified implementation here
return games
# =============================================================================
# STADIUM SCRAPER
# =============================================================================
def scrape_stadiums_hifld() -> list[Stadium]:
"""
Fetch stadium data from HIFLD Open Data (US Government).
Returns GeoJSON with coordinates.
"""
stadiums = []
url = "https://services1.arcgis.com/Hp6G80Pky0om7QvQ/arcgis/rest/services/Major_Sport_Venues/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json"
print("Fetching stadiums from HIFLD Open Data...")
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
for feature in data.get('features', []):
attrs = feature.get('attributes', {})
geom = feature.get('geometry', {})
# Filter for NBA, MLB, NHL venues
league = attrs.get('LEAGUE', '')
if league not in ['NBA', 'MLB', 'NHL', 'NFL']:
continue
sport_map = {'NBA': 'NBA', 'MLB': 'MLB', 'NHL': 'NHL'}
if league not in sport_map:
continue
stadium = Stadium(
id=f"hifld_{attrs.get('OBJECTID', '')}",
name=attrs.get('NAME', ''),
city=attrs.get('CITY', ''),
state=attrs.get('STATE', ''),
latitude=geom.get('y', 0),
longitude=geom.get('x', 0),
capacity=attrs.get('CAPACITY', 0) or 0,
sport=sport_map.get(league, ''),
team_abbrevs=[attrs.get('TEAM', '')],
source='hifld.gov',
year_opened=attrs.get('YEAR_OPEN')
)
stadiums.append(stadium)
except Exception as e:
print(f" Error fetching HIFLD data: {e}")
print(f" Found {len(stadiums)} stadiums from HIFLD")
return stadiums
def generate_stadiums_from_teams() -> list[Stadium]:
"""
Generate stadium data from team mappings with manual coordinates.
This serves as a fallback/validation source.
"""
stadiums = []
# NBA Arenas with coordinates (manually curated)
nba_coords = {
'State Farm Arena': (33.7573, -84.3963),
'TD Garden': (42.3662, -71.0621),
'Barclays Center': (40.6826, -73.9754),
'Spectrum Center': (35.2251, -80.8392),
'United Center': (41.8807, -87.6742),
'Rocket Mortgage FieldHouse': (41.4965, -81.6882),
'American Airlines Center': (32.7905, -96.8103),
'Ball Arena': (39.7487, -105.0077),
'Little Caesars Arena': (42.3411, -83.0553),
'Chase Center': (37.7680, -122.3879),
'Toyota Center': (29.7508, -95.3621),
'Gainbridge Fieldhouse': (39.7640, -86.1555),
'Intuit Dome': (33.9425, -118.3419),
'Crypto.com Arena': (34.0430, -118.2673),
'FedExForum': (35.1382, -90.0506),
'Kaseya Center': (25.7814, -80.1870),
'Fiserv Forum': (43.0451, -87.9174),
'Target Center': (44.9795, -93.2761),
'Smoothie King Center': (29.9490, -90.0821),
'Madison Square Garden': (40.7505, -73.9934),
'Paycom Center': (35.4634, -97.5151),
'Kia Center': (28.5392, -81.3839),
'Wells Fargo Center': (39.9012, -75.1720),
'Footprint Center': (33.4457, -112.0712),
'Moda Center': (45.5316, -122.6668),
'Golden 1 Center': (38.5802, -121.4997),
'Frost Bank Center': (29.4270, -98.4375),
'Scotiabank Arena': (43.6435, -79.3791),
'Delta Center': (40.7683, -111.9011),
'Capital One Arena': (38.8982, -77.0209),
}
for abbrev, info in NBA_TEAMS.items():
arena = info['arena']
coords = nba_coords.get(arena, (0, 0))
stadium = Stadium(
id=f"manual_nba_{abbrev.lower()}",
name=arena,
city=info['city'],
state='',
latitude=coords[0],
longitude=coords[1],
capacity=0,
sport='NBA',
team_abbrevs=[abbrev],
source='manual'
)
stadiums.append(stadium)
# MLB Stadiums with coordinates
mlb_coords = {
'Chase Field': (33.4453, -112.0667, 'AZ', 48686),
'Truist Park': (33.8907, -84.4678, 'GA', 41084),
'Oriole Park at Camden Yards': (39.2838, -76.6218, 'MD', 45971),
'Fenway Park': (42.3467, -71.0972, 'MA', 37755),
'Wrigley Field': (41.9484, -87.6553, 'IL', 41649),
'Guaranteed Rate Field': (41.8299, -87.6338, 'IL', 40615),
'Great American Ball Park': (39.0979, -84.5082, 'OH', 42319),
'Progressive Field': (41.4962, -81.6852, 'OH', 34830),
'Coors Field': (39.7559, -104.9942, 'CO', 50144),
'Comerica Park': (42.3390, -83.0485, 'MI', 41083),
'Minute Maid Park': (29.7573, -95.3555, 'TX', 41168),
'Kauffman Stadium': (39.0517, -94.4803, 'MO', 37903),
'Angel Stadium': (33.8003, -117.8827, 'CA', 45517),
'Dodger Stadium': (34.0739, -118.2400, 'CA', 56000),
'LoanDepot Park': (25.7781, -80.2196, 'FL', 36742),
'American Family Field': (43.0280, -87.9712, 'WI', 41900),
'Target Field': (44.9817, -93.2776, 'MN', 38544),
'Citi Field': (40.7571, -73.8458, 'NY', 41922),
'Yankee Stadium': (40.8296, -73.9262, 'NY', 46537),
'Sutter Health Park': (38.5802, -121.5097, 'CA', 14014),
'Citizens Bank Park': (39.9061, -75.1665, 'PA', 42792),
'PNC Park': (40.4469, -80.0057, 'PA', 38362),
'Petco Park': (32.7076, -117.1570, 'CA', 40209),
'Oracle Park': (37.7786, -122.3893, 'CA', 41265),
'T-Mobile Park': (47.5914, -122.3325, 'WA', 47929),
'Busch Stadium': (38.6226, -90.1928, 'MO', 45494),
'Tropicana Field': (27.7682, -82.6534, 'FL', 25000),
'Globe Life Field': (32.7473, -97.0845, 'TX', 40300),
'Rogers Centre': (43.6414, -79.3894, 'ON', 49282),
'Nationals Park': (38.8730, -77.0074, 'DC', 41339),
}
for abbrev, info in MLB_TEAMS.items():
stadium_name = info['stadium']
coord_data = mlb_coords.get(stadium_name, (0, 0, '', 0))
stadium = Stadium(
id=f"manual_mlb_{abbrev.lower()}",
name=stadium_name,
city=info['city'],
state=coord_data[2] if len(coord_data) > 2 else '',
latitude=coord_data[0],
longitude=coord_data[1],
capacity=coord_data[3] if len(coord_data) > 3 else 0,
sport='MLB',
team_abbrevs=[abbrev],
source='manual'
)
stadiums.append(stadium)
# NHL Arenas with coordinates
nhl_coords = {
'Honda Center': (33.8078, -117.8765, 'CA', 17174),
'Delta Center': (40.7683, -111.9011, 'UT', 18306),
'TD Garden': (42.3662, -71.0621, 'MA', 17565),
'KeyBank Center': (42.8750, -78.8764, 'NY', 19070),
'Scotiabank Saddledome': (51.0374, -114.0519, 'AB', 19289),
'PNC Arena': (35.8034, -78.7220, 'NC', 18680),
'United Center': (41.8807, -87.6742, 'IL', 19717),
'Ball Arena': (39.7487, -105.0077, 'CO', 18007),
'Nationwide Arena': (39.9693, -83.0061, 'OH', 18500),
'American Airlines Center': (32.7905, -96.8103, 'TX', 18532),
'Little Caesars Arena': (42.3411, -83.0553, 'MI', 19515),
'Rogers Place': (53.5469, -113.4978, 'AB', 18347),
'Amerant Bank Arena': (26.1584, -80.3256, 'FL', 19250),
'Crypto.com Arena': (34.0430, -118.2673, 'CA', 18230),
'Xcel Energy Center': (44.9448, -93.1010, 'MN', 17954),
'Bell Centre': (45.4961, -73.5693, 'QC', 21302),
'Bridgestone Arena': (36.1592, -86.7785, 'TN', 17159),
'Prudential Center': (40.7334, -74.1712, 'NJ', 16514),
'UBS Arena': (40.7161, -73.7246, 'NY', 17255),
'Madison Square Garden': (40.7505, -73.9934, 'NY', 18006),
'Canadian Tire Centre': (45.2969, -75.9272, 'ON', 18652),
'Wells Fargo Center': (39.9012, -75.1720, 'PA', 19543),
'PPG Paints Arena': (40.4395, -79.9892, 'PA', 18387),
'SAP Center': (37.3327, -121.9010, 'CA', 17562),
'Climate Pledge Arena': (47.6221, -122.3540, 'WA', 17100),
'Enterprise Center': (38.6268, -90.2025, 'MO', 18096),
'Amalie Arena': (27.9426, -82.4519, 'FL', 19092),
'Scotiabank Arena': (43.6435, -79.3791, 'ON', 18819),
'Rogers Arena': (49.2778, -123.1089, 'BC', 18910),
'T-Mobile Arena': (36.1028, -115.1784, 'NV', 17500),
'Capital One Arena': (38.8982, -77.0209, 'DC', 18573),
'Canada Life Centre': (49.8928, -97.1436, 'MB', 15321),
}
for abbrev, info in NHL_TEAMS.items():
arena_name = info['arena']
coord_data = nhl_coords.get(arena_name, (0, 0, '', 0))
stadium = Stadium(
id=f"manual_nhl_{abbrev.lower()}",
name=arena_name,
city=info['city'],
state=coord_data[2] if len(coord_data) > 2 else '',
latitude=coord_data[0],
longitude=coord_data[1],
capacity=coord_data[3] if len(coord_data) > 3 else 0,
sport='NHL',
team_abbrevs=[abbrev],
source='manual'
)
stadiums.append(stadium)
return stadiums
# =============================================================================
# HELPERS
# =============================================================================
def assign_stable_ids(games: list[Game], sport: str, season: str) -> list[Game]:
"""
Assign stable IDs based on matchup + occurrence number within season.
Format: {sport}_{season}_{away}_{home}_{num}
This ensures IDs don't change when games are rescheduled.
"""
from collections import defaultdict
# Group games by matchup (away @ home)
matchups = defaultdict(list)
for game in games:
key = f"{game.away_team_abbrev}_{game.home_team_abbrev}"
matchups[key].append(game)
# Sort each matchup by date and assign occurrence number
for key, matchup_games in matchups.items():
matchup_games.sort(key=lambda g: g.date)
for i, game in enumerate(matchup_games, 1):
away = game.away_team_abbrev.lower()
home = game.home_team_abbrev.lower()
# Normalize season format (e.g., "2024-25" -> "2024-25", "2025" -> "2025")
season_str = season.replace('-', '')
game.id = f"{sport.lower()}_{season_str}_{away}_{home}_{i}"
return games
def get_team_abbrev(team_name: str, sport: str) -> str:
"""Get team abbreviation from full name."""
teams = {'NBA': NBA_TEAMS, 'MLB': MLB_TEAMS, 'NHL': NHL_TEAMS}.get(sport, {})
for abbrev, info in teams.items():
if info['name'].lower() == team_name.lower():
return abbrev
if team_name.lower() in info['name'].lower():
return abbrev
# Return first 3 letters as fallback
return team_name[:3].upper()
def validate_games(games_by_source: dict) -> dict:
"""
Cross-validate games from multiple sources.
Returns discrepancies.
"""
discrepancies = {
'missing_in_source': [],
'date_mismatch': [],
'time_mismatch': [],
'venue_mismatch': [],
}
sources = list(games_by_source.keys())
if len(sources) < 2:
return discrepancies
primary = sources[0]
primary_games = {g.id: g for g in games_by_source[primary]}
for source in sources[1:]:
secondary_games = {g.id: g for g in games_by_source[source]}
for game_id, game in primary_games.items():
if game_id not in secondary_games:
discrepancies['missing_in_source'].append({
'game_id': game_id,
'present_in': primary,
'missing_in': source
})
return discrepancies
def export_to_json(games: list[Game], stadiums: list[Stadium], output_dir: Path):
"""Export scraped data to JSON files."""
output_dir.mkdir(parents=True, exist_ok=True)
# Export games
games_data = [asdict(g) for g in games]
with open(output_dir / 'games.json', 'w') as f:
json.dump(games_data, f, indent=2)
# Export stadiums
stadiums_data = [asdict(s) for s in stadiums]
with open(output_dir / 'stadiums.json', 'w') as f:
json.dump(stadiums_data, f, indent=2)
# Export as CSV for easy viewing
if games:
df_games = pd.DataFrame(games_data)
df_games.to_csv(output_dir / 'games.csv', index=False)
if stadiums:
df_stadiums = pd.DataFrame(stadiums_data)
df_stadiums.to_csv(output_dir / 'stadiums.csv', index=False)
print(f"\nExported to {output_dir}")
# =============================================================================
# MAIN
# =============================================================================
def main():
parser = argparse.ArgumentParser(description='Scrape sports schedules')
parser.add_argument('--sport', choices=['nba', 'mlb', 'nhl', 'all'], default='all')
parser.add_argument('--season', type=int, default=2025, help='Season year (ending year)')
parser.add_argument('--stadiums-only', action='store_true', help='Only scrape stadium data')
parser.add_argument('--output', type=str, default='./data', help='Output directory')
args = parser.parse_args()
output_dir = Path(args.output)
all_games = []
all_stadiums = []
# Scrape stadiums
print("\n" + "="*60)
print("SCRAPING STADIUMS")
print("="*60)
all_stadiums.extend(scrape_stadiums_hifld())
all_stadiums.extend(generate_stadiums_from_teams())
if args.stadiums_only:
export_to_json([], all_stadiums, output_dir)
return
# Scrape schedules
if args.sport in ['nba', 'all']:
print("\n" + "="*60)
print(f"SCRAPING NBA {args.season}")
print("="*60)
nba_games_br = scrape_nba_basketball_reference(args.season)
nba_season = f"{args.season-1}-{str(args.season)[2:]}" # e.g., "2024-25"
nba_games_br = assign_stable_ids(nba_games_br, 'NBA', nba_season)
all_games.extend(nba_games_br)
if args.sport in ['mlb', 'all']:
print("\n" + "="*60)
print(f"SCRAPING MLB {args.season}")
print("="*60)
mlb_games_api = scrape_mlb_statsapi(args.season)
# MLB API uses official gamePk which is already stable - no reassignment needed
all_games.extend(mlb_games_api)
if args.sport in ['nhl', 'all']:
print("\n" + "="*60)
print(f"SCRAPING NHL {args.season}")
print("="*60)
nhl_games_hr = scrape_nhl_hockey_reference(args.season)
nhl_season = f"{args.season-1}-{str(args.season)[2:]}" # e.g., "2024-25"
nhl_games_hr = assign_stable_ids(nhl_games_hr, 'NHL', nhl_season)
all_games.extend(nhl_games_hr)
# Export
print("\n" + "="*60)
print("EXPORTING DATA")
print("="*60)
export_to_json(all_games, all_stadiums, output_dir)
# Summary
print("\n" + "="*60)
print("SUMMARY")
print("="*60)
print(f"Total games scraped: {len(all_games)}")
print(f"Total stadiums: {len(all_stadiums)}")
# Games by sport
by_sport = {}
for g in all_games:
by_sport[g.sport] = by_sport.get(g.sport, 0) + 1
for sport, count in by_sport.items():
print(f" {sport}: {count} games")
if __name__ == '__main__':
main()

61
Scripts/test_cloudkit.py Normal file
View File

@@ -0,0 +1,61 @@
#!/usr/bin/env python3
"""Quick test to query CloudKit records."""
import json, hashlib, base64, requests, os, sys
from datetime import datetime, timezone
try:
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import ec
from cryptography.hazmat.backends import default_backend
except ImportError:
sys.exit("Error: pip install cryptography")
CONTAINER = "iCloud.com.sportstime.app"
HOST = "https://api.apple-cloudkit.com"
def sign(key_data, date, body, path):
key = serialization.load_pem_private_key(key_data, None, default_backend())
body_hash = base64.b64encode(hashlib.sha256(body.encode()).digest()).decode()
sig = key.sign(f"{date}:{body_hash}:{path}".encode(), ec.ECDSA(hashes.SHA256()))
return base64.b64encode(sig).decode()
def query(key_id, key_data, record_type, env='development'):
path = f"/database/1/{CONTAINER}/{env}/public/records/query"
body = json.dumps({
'query': {'recordType': record_type},
'resultsLimit': 10
})
date = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
headers = {
'Content-Type': 'application/json',
'X-Apple-CloudKit-Request-KeyID': key_id,
'X-Apple-CloudKit-Request-ISO8601Date': date,
'X-Apple-CloudKit-Request-SignatureV1': sign(key_data, date, body, path),
}
r = requests.post(f"{HOST}{path}", headers=headers, data=body, timeout=30)
return r.status_code, r.json()
if __name__ == '__main__':
key_id = os.environ.get('CLOUDKIT_KEY_ID') or (sys.argv[1] if len(sys.argv) > 1 else None)
key_file = os.environ.get('CLOUDKIT_KEY_FILE') or (sys.argv[2] if len(sys.argv) > 2 else 'eckey.pem')
if not key_id:
sys.exit("Usage: python test_cloudkit.py KEY_ID [KEY_FILE]")
key_data = open(key_file, 'rb').read()
print("Testing CloudKit connection...\n")
for record_type in ['Stadium', 'Team', 'Game']:
status, result = query(key_id, key_data, record_type)
count = len(result.get('records', []))
print(f"{record_type}: status={status}, records={count}")
if count > 0:
print(f" Sample: {result['records'][0].get('recordName', 'N/A')}")
if 'serverErrorCode' in result:
print(f" Error: {result.get('serverErrorCode')}: {result.get('reason')}")
print("\nFull response for Stadium query:")
status, result = query(key_id, key_data, 'Stadium')
print(json.dumps(result, indent=2)[:1000])

590
Scripts/validate_data.py Normal file
View File

@@ -0,0 +1,590 @@
#!/usr/bin/env python3
"""
Cross-Validation System for SportsTime App
Compares scraped data from multiple sources and flags discrepancies.
Usage:
python validate_data.py --data-dir ./data
python validate_data.py --scrape-and-validate --season 2025
"""
import argparse
import json
from datetime import datetime
from pathlib import Path
from dataclasses import dataclass, asdict, field
from typing import Optional
from collections import defaultdict
# Import scrapers from main script
from scrape_schedules import (
Game, Stadium,
scrape_nba_basketball_reference,
scrape_mlb_statsapi, scrape_mlb_baseball_reference,
scrape_nhl_hockey_reference,
NBA_TEAMS, MLB_TEAMS, NHL_TEAMS,
assign_stable_ids,
)
# =============================================================================
# VALIDATION DATA CLASSES
# =============================================================================
@dataclass
class Discrepancy:
"""Represents a discrepancy between sources."""
game_key: str
field: str # 'date', 'time', 'venue', 'teams', 'missing'
source1: str
source2: str
value1: str
value2: str
severity: str # 'high', 'medium', 'low'
@dataclass
class ValidationReport:
"""Summary of validation results."""
sport: str
season: str
sources: list
total_games_source1: int = 0
total_games_source2: int = 0
games_matched: int = 0
games_missing_source1: int = 0
games_missing_source2: int = 0
discrepancies: list = field(default_factory=list)
def to_dict(self):
return {
'sport': self.sport,
'season': self.season,
'sources': self.sources,
'total_games_source1': self.total_games_source1,
'total_games_source2': self.total_games_source2,
'games_matched': self.games_matched,
'games_missing_source1': self.games_missing_source1,
'games_missing_source2': self.games_missing_source2,
'discrepancies': [asdict(d) for d in self.discrepancies],
'discrepancy_summary': self.get_summary()
}
def get_summary(self):
by_field = defaultdict(int)
by_severity = defaultdict(int)
for d in self.discrepancies:
by_field[d.field] += 1
by_severity[d.severity] += 1
return {
'by_field': dict(by_field),
'by_severity': dict(by_severity)
}
# =============================================================================
# GAME KEY GENERATION
# =============================================================================
def normalize_abbrev(abbrev: str, sport: str) -> str:
"""Normalize team abbreviations across different sources."""
abbrev = abbrev.upper().strip()
if sport == 'MLB':
# MLB abbreviation mappings between sources
mlb_mappings = {
'AZ': 'ARI', 'ARI': 'ARI', # Arizona
'ATH': 'OAK', 'OAK': 'OAK', # Oakland/Athletics
'CWS': 'CHW', 'CHW': 'CHW', # Chicago White Sox
'KC': 'KCR', 'KCR': 'KCR', # Kansas City
'SD': 'SDP', 'SDP': 'SDP', # San Diego
'SF': 'SFG', 'SFG': 'SFG', # San Francisco
'TB': 'TBR', 'TBR': 'TBR', # Tampa Bay
'WSH': 'WSN', 'WSN': 'WSN', # Washington
}
return mlb_mappings.get(abbrev, abbrev)
elif sport == 'NBA':
nba_mappings = {
'PHX': 'PHO', 'PHO': 'PHO', # Phoenix
'BKN': 'BRK', 'BRK': 'BRK', # Brooklyn
'CHA': 'CHO', 'CHO': 'CHO', # Charlotte
'NOP': 'NOP', 'NO': 'NOP', # New Orleans
}
return nba_mappings.get(abbrev, abbrev)
elif sport == 'NHL':
nhl_mappings = {
'ARI': 'UTA', 'UTA': 'UTA', # Arizona moved to Utah
'VGS': 'VGK', 'VGK': 'VGK', # Vegas
}
return nhl_mappings.get(abbrev, abbrev)
return abbrev
def generate_game_key(game: Game) -> str:
"""
Generate a unique key for matching games across sources.
Uses date + normalized team abbreviations (sorted) to match.
"""
home = normalize_abbrev(game.home_team_abbrev, game.sport)
away = normalize_abbrev(game.away_team_abbrev, game.sport)
teams = sorted([home, away])
return f"{game.date}_{teams[0]}_{teams[1]}"
def normalize_team_name(name: str, sport: str) -> str:
"""Normalize team name variations."""
teams = {'NBA': NBA_TEAMS, 'MLB': MLB_TEAMS, 'NHL': NHL_TEAMS}.get(sport, {})
name_lower = name.lower().strip()
# Check against known team names
for abbrev, info in teams.items():
if name_lower == info['name'].lower():
return abbrev
# Check city match
if name_lower == info['city'].lower():
return abbrev
# Check partial match
if name_lower in info['name'].lower() or info['name'].lower() in name_lower:
return abbrev
return name[:3].upper()
def normalize_venue(venue: str) -> str:
"""Normalize venue name for comparison."""
# Remove common variations
normalized = venue.lower().strip()
# Remove sponsorship prefixes that change
replacements = [
('at ', ''),
('the ', ''),
(' stadium', ''),
(' arena', ''),
(' center', ''),
(' field', ''),
(' park', ''),
('.com', ''),
('crypto', 'crypto.com'),
]
for old, new in replacements:
normalized = normalized.replace(old, new)
return normalized.strip()
def normalize_time(time_str: Optional[str]) -> Optional[str]:
"""Normalize time format to HH:MM."""
if not time_str:
return None
time_str = time_str.strip().lower()
# Handle various formats
if 'pm' in time_str or 'am' in time_str:
# 12-hour format
try:
for fmt in ['%I:%M%p', '%I:%M %p', '%I%p']:
try:
dt = datetime.strptime(time_str.replace(' ', ''), fmt)
return dt.strftime('%H:%M')
except:
continue
except:
pass
# Already 24-hour or just numbers
if ':' in time_str:
parts = time_str.split(':')
if len(parts) >= 2:
try:
hour = int(parts[0])
minute = int(parts[1][:2])
return f"{hour:02d}:{minute:02d}"
except:
pass
return time_str
# =============================================================================
# CROSS-VALIDATION LOGIC
# =============================================================================
def validate_games(
games1: list[Game],
games2: list[Game],
source1_name: str,
source2_name: str,
sport: str,
season: str
) -> ValidationReport:
"""
Compare two lists of games and find discrepancies.
"""
report = ValidationReport(
sport=sport,
season=season,
sources=[source1_name, source2_name],
total_games_source1=len(games1),
total_games_source2=len(games2)
)
# Index games by key
games1_by_key = {}
for g in games1:
key = generate_game_key(g)
games1_by_key[key] = g
games2_by_key = {}
for g in games2:
key = generate_game_key(g)
games2_by_key[key] = g
# Find matches and discrepancies
all_keys = set(games1_by_key.keys()) | set(games2_by_key.keys())
for key in all_keys:
g1 = games1_by_key.get(key)
g2 = games2_by_key.get(key)
if g1 and g2:
# Both sources have this game - compare fields
report.games_matched += 1
# Compare dates (should match by key, but double-check)
if g1.date != g2.date:
report.discrepancies.append(Discrepancy(
game_key=key,
field='date',
source1=source1_name,
source2=source2_name,
value1=g1.date,
value2=g2.date,
severity='high'
))
# Compare times
time1 = normalize_time(g1.time)
time2 = normalize_time(g2.time)
if time1 and time2 and time1 != time2:
# Check if times are close (within 1 hour - could be timezone)
try:
t1 = datetime.strptime(time1, '%H:%M')
t2 = datetime.strptime(time2, '%H:%M')
diff_minutes = abs((t1 - t2).total_seconds() / 60)
severity = 'low' if diff_minutes <= 60 else 'medium'
except:
severity = 'medium'
report.discrepancies.append(Discrepancy(
game_key=key,
field='time',
source1=source1_name,
source2=source2_name,
value1=time1 or '',
value2=time2 or '',
severity=severity
))
# Compare venues
venue1 = normalize_venue(g1.venue) if g1.venue else ''
venue2 = normalize_venue(g2.venue) if g2.venue else ''
if venue1 and venue2 and venue1 != venue2:
# Check for partial match
if venue1 not in venue2 and venue2 not in venue1:
report.discrepancies.append(Discrepancy(
game_key=key,
field='venue',
source1=source1_name,
source2=source2_name,
value1=g1.venue,
value2=g2.venue,
severity='low'
))
elif g1 and not g2:
# Game only in source 1
report.games_missing_source2 += 1
# Determine severity based on date
# Spring training (March before ~25th) and playoffs (Oct+) are expected differences
severity = 'high'
try:
game_date = datetime.strptime(g1.date, '%Y-%m-%d')
month = game_date.month
day = game_date.day
if month == 3 and day < 26: # Spring training
severity = 'medium'
elif month >= 10: # Playoffs/postseason
severity = 'medium'
except:
pass
report.discrepancies.append(Discrepancy(
game_key=key,
field='missing',
source1=source1_name,
source2=source2_name,
value1=f"{g1.away_team} @ {g1.home_team}",
value2='NOT FOUND',
severity=severity
))
else:
# Game only in source 2
report.games_missing_source1 += 1
# Determine severity based on date
severity = 'high'
try:
game_date = datetime.strptime(g2.date, '%Y-%m-%d')
month = game_date.month
day = game_date.day
if month == 3 and day < 26: # Spring training
severity = 'medium'
elif month >= 10: # Playoffs/postseason
severity = 'medium'
except:
pass
report.discrepancies.append(Discrepancy(
game_key=key,
field='missing',
source1=source1_name,
source2=source2_name,
value1='NOT FOUND',
value2=f"{g2.away_team} @ {g2.home_team}",
severity=severity
))
return report
def validate_stadiums(stadiums: list[Stadium]) -> list[dict]:
"""
Validate stadium data for completeness and accuracy.
"""
issues = []
for s in stadiums:
# Check for missing coordinates
if s.latitude == 0 or s.longitude == 0:
issues.append({
'stadium': s.name,
'sport': s.sport,
'issue': 'Missing coordinates',
'severity': 'high'
})
# Check for missing capacity
if s.capacity == 0:
issues.append({
'stadium': s.name,
'sport': s.sport,
'issue': 'Missing capacity',
'severity': 'low'
})
# Check coordinate bounds (roughly North America)
if s.latitude != 0:
if not (24 < s.latitude < 55):
issues.append({
'stadium': s.name,
'sport': s.sport,
'issue': f'Latitude {s.latitude} outside expected range',
'severity': 'medium'
})
if s.longitude != 0:
if not (-130 < s.longitude < -60):
issues.append({
'stadium': s.name,
'sport': s.sport,
'issue': f'Longitude {s.longitude} outside expected range',
'severity': 'medium'
})
return issues
# =============================================================================
# MULTI-SOURCE SCRAPING
# =============================================================================
def scrape_nba_all_sources(season: int) -> dict:
"""Scrape NBA from all available sources."""
nba_season = f"{season-1}-{str(season)[2:]}"
games = scrape_nba_basketball_reference(season)
games = assign_stable_ids(games, 'NBA', nba_season)
return {
'basketball-reference': games,
# ESPN requires JS rendering, skip for now
}
def scrape_mlb_all_sources(season: int) -> dict:
"""Scrape MLB from all available sources."""
mlb_season = str(season)
# MLB API uses official gamePk - already stable
api_games = scrape_mlb_statsapi(season)
# Baseball-Reference needs stable IDs
br_games = scrape_mlb_baseball_reference(season)
br_games = assign_stable_ids(br_games, 'MLB', mlb_season)
return {
'statsapi.mlb.com': api_games,
'baseball-reference': br_games,
}
def scrape_nhl_all_sources(season: int) -> dict:
"""Scrape NHL from all available sources."""
nhl_season = f"{season-1}-{str(season)[2:]}"
games = scrape_nhl_hockey_reference(season)
games = assign_stable_ids(games, 'NHL', nhl_season)
return {
'hockey-reference': games,
# NHL API requires date iteration, skip for now
}
# =============================================================================
# MAIN
# =============================================================================
def main():
parser = argparse.ArgumentParser(description='Validate sports data')
parser.add_argument('--data-dir', type=str, default='./data', help='Data directory')
parser.add_argument('--scrape-and-validate', action='store_true', help='Scrape fresh and validate')
parser.add_argument('--season', type=int, default=2025, help='Season year')
parser.add_argument('--sport', choices=['nba', 'mlb', 'nhl', 'all'], default='all')
parser.add_argument('--output', type=str, default='./data/validation_report.json')
args = parser.parse_args()
reports = []
stadium_issues = []
if args.scrape_and_validate:
print("\n" + "="*60)
print("CROSS-VALIDATION MODE")
print("="*60)
# MLB has two good sources - validate
if args.sport in ['mlb', 'all']:
print(f"\n--- MLB {args.season} ---")
mlb_sources = scrape_mlb_all_sources(args.season)
source_names = list(mlb_sources.keys())
if len(source_names) >= 2:
games1 = mlb_sources[source_names[0]]
games2 = mlb_sources[source_names[1]]
if games1 and games2:
report = validate_games(
games1, games2,
source_names[0], source_names[1],
'MLB', str(args.season)
)
reports.append(report)
print(f" Compared {report.total_games_source1} vs {report.total_games_source2} games")
print(f" Matched: {report.games_matched}")
print(f" Discrepancies: {len(report.discrepancies)}")
# NBA (single source for now, but validate data quality)
if args.sport in ['nba', 'all']:
print(f"\n--- NBA {args.season} ---")
nba_sources = scrape_nba_all_sources(args.season)
games = nba_sources.get('basketball-reference', [])
print(f" Got {len(games)} games from Basketball-Reference")
# Validate internal consistency
teams_seen = defaultdict(int)
for g in games:
teams_seen[g.home_team_abbrev] += 1
teams_seen[g.away_team_abbrev] += 1
# Each team should have ~82 games
for team, count in teams_seen.items():
if count < 70 or count > 95:
print(f" Warning: {team} has {count} games (expected ~82)")
else:
# Load existing data and validate
data_dir = Path(args.data_dir)
# Load games
games_file = data_dir / 'games.json'
if games_file.exists():
with open(games_file) as f:
games_data = json.load(f)
print(f"\nLoaded {len(games_data)} games from {games_file}")
# Group by sport and validate counts
by_sport = defaultdict(list)
for g in games_data:
by_sport[g['sport']].append(g)
for sport, sport_games in by_sport.items():
print(f" {sport}: {len(sport_games)} games")
# Load and validate stadiums
stadiums_file = data_dir / 'stadiums.json'
if stadiums_file.exists():
with open(stadiums_file) as f:
stadiums_data = json.load(f)
stadiums = [Stadium(**s) for s in stadiums_data]
print(f"\nLoaded {len(stadiums)} stadiums from {stadiums_file}")
stadium_issues = validate_stadiums(stadiums)
if stadium_issues:
print(f"\nStadium validation issues ({len(stadium_issues)}):")
for issue in stadium_issues[:10]:
print(f" [{issue['severity'].upper()}] {issue['stadium']}: {issue['issue']}")
if len(stadium_issues) > 10:
print(f" ... and {len(stadium_issues) - 10} more")
# Save validation report
output_path = Path(args.output)
output_path.parent.mkdir(parents=True, exist_ok=True)
full_report = {
'generated_at': datetime.now().isoformat(),
'season': args.season,
'game_validations': [r.to_dict() for r in reports],
'stadium_issues': stadium_issues
}
with open(output_path, 'w') as f:
json.dump(full_report, f, indent=2)
print(f"\n Validation report saved to {output_path}")
# Summary
print("\n" + "="*60)
print("VALIDATION SUMMARY")
print("="*60)
total_discrepancies = sum(len(r.discrepancies) for r in reports)
high_severity = sum(
1 for r in reports
for d in r.discrepancies
if d.severity == 'high'
)
print(f"Total game validation reports: {len(reports)}")
print(f"Total discrepancies found: {total_discrepancies}")
print(f"High severity issues: {high_severity}")
print(f"Stadium data issues: {len(stadium_issues)}")
if __name__ == '__main__':
main()