docs: add Phase 1 plans and codebase documentation

- 01-01-PLAN.md: core.py + mlb.py (executed) - 01-02-PLAN.md: nba.py + nhl.py - 01-03-PLAN.md: nfl.py + orchestrator refactor - Codebase documentation for planning context Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 00:00:45 -06:00
parent 504187059f
commit 60b450d869
10 changed files with 1436 additions and 0 deletions
@@ -0,0 +1,127 @@
+---
+phase: 01-script-architecture
+plan: 01
+type: execute
+---
+
+<objective>
+Create shared core module and extract MLB scrapers as the first sport module.
+
+Purpose: Establish the modular pattern that subsequent sports will follow.
+Output: `Scripts/core.py` with shared utilities, `Scripts/mlb.py` with MLB scrapers.
+</objective>
+
+<execution_context>
+@~/.claude/get-shit-done/workflows/execute-phase.md
+@~/.claude/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+
+**Source file:**
+@Scripts/scrape_schedules.py
+
+**Codebase context:**
+@.planning/codebase/CONVENTIONS.md
+
+**Tech stack:** Python 3, requests, beautifulsoup4, pandas, lxml
+**Established patterns:** dataclasses, type hints, docstrings
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Create core.py shared module</name>
+  <files>Scripts/core.py</files>
+  <action>
+Create `Scripts/core.py` containing:
+
+1. Imports: argparse, json, time, re, datetime, timedelta, pathlib, dataclasses, typing, requests, BeautifulSoup, pandas
+
+2. Rate limiting utilities:
+   - `REQUEST_DELAY` constant (3.0)
+   - `last_request_time` dict
+   - `rate_limit(domain: str)` function
+   - `fetch_page(url: str, domain: str) -> Optional[BeautifulSoup]` function
+
+3. Data classes:
+   - `@dataclass Game` with all fields (id, sport, season, date, time, home_team, away_team, etc.)
+   - `@dataclass Stadium` with all fields (id, name, city, state, latitude, longitude, etc.)
+
+4. Multi-source fallback system:
+   - `@dataclass ScraperSource`
+   - `scrape_with_fallback(sport, season, sources, verbose)` function
+   - `@dataclass StadiumScraperSource`
+   - `scrape_stadiums_with_fallback(sport, sources, verbose)` function
+
+5. ID generation:
+   - `assign_stable_ids(games, sport, season)` function
+
+6. Export utilities:
+   - `export_to_json(games, stadiums, output_dir)` function
+   - `cross_validate_sources(games_by_source)` function
+
+Keep exact function signatures and logic from scrape_schedules.py. Use `__all__` to explicitly export public API.
+  </action>
+  <verify>python3 -c "from Scripts.core import Game, Stadium, ScraperSource, rate_limit, fetch_page, scrape_with_fallback, assign_stable_ids, export_to_json; print('OK')"</verify>
+  <done>core.py exists, imports successfully, exports all shared utilities</done>
+</task>
+
+<task type="auto">
+  <name>Task 2: Create mlb.py sport module</name>
+  <files>Scripts/mlb.py</files>
+  <action>
+Create `Scripts/mlb.py` containing:
+
+1. Import from core:
+   ```python
+   from core import Game, Stadium, ScraperSource, StadiumScraperSource, fetch_page, scrape_with_fallback, scrape_stadiums_with_fallback
+   ```
+
+2. MLB game scrapers (copy exact logic):
+   - `scrape_mlb_baseball_reference(season: int) -> list[Game]`
+   - `scrape_mlb_statsapi(season: int) -> list[Game]`
+   - `scrape_mlb_espn(season: int) -> list[Game]`
+
+3. MLB stadium scrapers:
+   - `scrape_mlb_stadiums_scorebot() -> list[Stadium]`
+   - `scrape_mlb_stadiums_geojson() -> list[Stadium]`
+   - `scrape_mlb_stadiums_hardcoded() -> list[Stadium]`
+   - `scrape_mlb_stadiums() -> list[Stadium]` (combines above with fallback)
+
+4. Source configurations:
+   - `MLB_GAME_SOURCES` list of ScraperSource
+   - `MLB_STADIUM_SOURCES` list of StadiumScraperSource
+
+5. Convenience function:
+   - `scrape_mlb_games(season: int) -> list[Game]` - uses fallback system
+
+Use `__all__` to export public API. Keep all team abbreviation mappings, venue name normalizations, and parsing logic intact.
+  </action>
+  <verify>python3 -c "from Scripts.mlb import scrape_mlb_games, scrape_mlb_stadiums, MLB_GAME_SOURCES; print('OK')"</verify>
+  <done>mlb.py exists, imports from core.py, exports MLB scrapers and source configs</done>
+</task>
+
+</tasks>
+
+<verification>
+Before declaring plan complete:
+- [ ] `Scripts/core.py` exists and imports cleanly
+- [ ] `Scripts/mlb.py` exists and imports from core
+- [ ] No syntax errors: `python3 -m py_compile Scripts/core.py Scripts/mlb.py`
+- [ ] Type hints present on all public functions
+</verification>
+
+<success_criteria>
+- core.py contains all shared utilities extracted from scrape_schedules.py
+- mlb.py contains all MLB-specific scrapers
+- Both files import without errors
+- Original scrape_schedules.py unchanged (we're creating new files first)
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/01-script-architecture/01-01-SUMMARY.md`
+</output>
@@ -0,0 +1,119 @@
+---
+phase: 01-script-architecture
+plan: 02
+type: execute
+---
+
+<objective>
+Extract NBA and NHL scrapers to dedicated sport modules.
+
+Purpose: Continue the modular pattern established in Plan 01.
+Output: `Scripts/nba.py` and `Scripts/nhl.py` with respective scrapers.
+</objective>
+
+<execution_context>
+@~/.claude/get-shit-done/workflows/execute-phase.md
+@~/.claude/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+
+**Prior work:**
+@.planning/phases/01-script-architecture/01-01-SUMMARY.md
+
+**Source files:**
+@Scripts/core.py
+@Scripts/scrape_schedules.py
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Create nba.py sport module</name>
+  <files>Scripts/nba.py</files>
+  <action>
+Create `Scripts/nba.py` following the mlb.py pattern:
+
+1. Import from core:
+   ```python
+   from core import Game, Stadium, ScraperSource, StadiumScraperSource, fetch_page, scrape_with_fallback, scrape_stadiums_with_fallback
+   ```
+
+2. NBA game scrapers:
+   - `scrape_nba_basketball_reference(season: int) -> list[Game]`
+   - `scrape_nba_espn(season: int) -> list[Game]`
+   - `scrape_nba_cbssports(season: int) -> list[Game]`
+
+3. NBA stadium scrapers:
+   - `scrape_nba_stadiums() -> list[Stadium]` (from generate_stadiums_from_teams or hardcoded)
+
+4. Source configurations:
+   - `NBA_GAME_SOURCES` list of ScraperSource
+   - `NBA_STADIUM_SOURCES` list of StadiumScraperSource
+
+5. Convenience functions:
+   - `scrape_nba_games(season: int) -> list[Game]`
+   - `get_nba_season_string(season: int) -> str` - returns "2024-25" format
+
+Copy exact parsing logic including team abbreviations and venue mappings from scrape_schedules.py.
+  </action>
+  <verify>python3 -c "from Scripts.nba import scrape_nba_games, NBA_GAME_SOURCES; print('OK')"</verify>
+  <done>nba.py exists, imports from core.py, exports NBA scrapers</done>
+</task>
+
+<task type="auto">
+  <name>Task 2: Create nhl.py sport module</name>
+  <files>Scripts/nhl.py</files>
+  <action>
+Create `Scripts/nhl.py` following the same pattern:
+
+1. Import from core:
+   ```python
+   from core import Game, Stadium, ScraperSource, StadiumScraperSource, fetch_page, scrape_with_fallback, scrape_stadiums_with_fallback
+   ```
+
+2. NHL game scrapers:
+   - `scrape_nhl_hockey_reference(season: int) -> list[Game]`
+   - `scrape_nhl_api(season: int) -> list[Game]`
+   - `scrape_nhl_espn(season: int) -> list[Game]`
+
+3. NHL stadium scrapers:
+   - `scrape_nhl_stadiums() -> list[Stadium]`
+
+4. Source configurations:
+   - `NHL_GAME_SOURCES` list of ScraperSource
+   - `NHL_STADIUM_SOURCES` list of StadiumScraperSource
+
+5. Convenience functions:
+   - `scrape_nhl_games(season: int) -> list[Game]`
+   - `get_nhl_season_string(season: int) -> str` - returns "2024-25" format
+
+Copy exact parsing logic from scrape_schedules.py.
+  </action>
+  <verify>python3 -c "from Scripts.nhl import scrape_nhl_games, NHL_GAME_SOURCES; print('OK')"</verify>
+  <done>nhl.py exists, imports from core.py, exports NHL scrapers</done>
+</task>
+
+</tasks>
+
+<verification>
+Before declaring plan complete:
+- [ ] `Scripts/nba.py` exists and imports cleanly
+- [ ] `Scripts/nhl.py` exists and imports cleanly
+- [ ] No syntax errors: `python3 -m py_compile Scripts/nba.py Scripts/nhl.py`
+- [ ] Both import from core.py (not duplicating shared utilities)
+</verification>
+
+<success_criteria>
+- nba.py contains all NBA-specific scrapers
+- nhl.py contains all NHL-specific scrapers
+- Both follow the pattern established in mlb.py
+- All files import without errors
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/01-script-architecture/01-02-SUMMARY.md`
+</output>
@@ -0,0 +1,147 @@
+---
+phase: 01-script-architecture
+plan: 03
+type: execute
+---
+
+<objective>
+Extract NFL scrapers and refactor scrape_schedules.py to be a thin orchestrator.
+
+Purpose: Complete the modular architecture and update the main entry point.
+Output: `Scripts/nfl.py` and refactored `Scripts/scrape_schedules.py`.
+</objective>
+
+<execution_context>
+@~/.claude/get-shit-done/workflows/execute-phase.md
+@~/.claude/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+
+**Prior work:**
+@.planning/phases/01-script-architecture/01-01-SUMMARY.md
+@.planning/phases/01-script-architecture/01-02-SUMMARY.md
+
+**Source files:**
+@Scripts/core.py
+@Scripts/mlb.py
+@Scripts/nba.py
+@Scripts/nhl.py
+@Scripts/scrape_schedules.py
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Create nfl.py sport module</name>
+  <files>Scripts/nfl.py</files>
+  <action>
+Create `Scripts/nfl.py` following the established pattern:
+
+1. Import from core:
+   ```python
+   from core import Game, Stadium, ScraperSource, StadiumScraperSource, fetch_page, scrape_with_fallback, scrape_stadiums_with_fallback
+   ```
+
+2. NFL game scrapers:
+   - `scrape_nfl_espn(season: int) -> list[Game]`
+   - `scrape_nfl_pro_football_reference(season: int) -> list[Game]`
+   - `scrape_nfl_cbssports(season: int) -> list[Game]`
+
+3. NFL stadium scrapers:
+   - `scrape_nfl_stadiums_scorebot() -> list[Stadium]`
+   - `scrape_nfl_stadiums_geojson() -> list[Stadium]`
+   - `scrape_nfl_stadiums_hardcoded() -> list[Stadium]`
+   - `scrape_nfl_stadiums() -> list[Stadium]`
+
+4. Source configurations:
+   - `NFL_GAME_SOURCES` list of ScraperSource
+   - `NFL_STADIUM_SOURCES` list of StadiumScraperSource
+
+5. Convenience functions:
+   - `scrape_nfl_games(season: int) -> list[Game]`
+   - `get_nfl_season_string(season: int) -> str` - returns "2025-26" format
+
+Copy exact parsing logic from scrape_schedules.py.
+  </action>
+  <verify>python3 -c "from Scripts.nfl import scrape_nfl_games, NFL_GAME_SOURCES; print('OK')"</verify>
+  <done>nfl.py exists, imports from core.py, exports NFL scrapers</done>
+</task>
+
+<task type="auto">
+  <name>Task 2: Refactor scrape_schedules.py to orchestrator</name>
+  <files>Scripts/scrape_schedules.py</files>
+  <action>
+Rewrite `Scripts/scrape_schedules.py` as a thin orchestrator:
+
+1. Replace inline scrapers with imports:
+   ```python
+   from core import Game, Stadium, assign_stable_ids, export_to_json
+   from mlb import scrape_mlb_games, scrape_mlb_stadiums, MLB_GAME_SOURCES
+   from nba import scrape_nba_games, scrape_nba_stadiums, NBA_GAME_SOURCES, get_nba_season_string
+   from nhl import scrape_nhl_games, scrape_nhl_stadiums, NHL_GAME_SOURCES, get_nhl_season_string
+   from nfl import scrape_nfl_games, scrape_nfl_stadiums, NFL_GAME_SOURCES, get_nfl_season_string
+   ```
+
+2. Keep the main() function with argparse for CLI
+
+3. Update sport scraping blocks to use new imports:
+   - `if args.sport in ['nba', 'all']:` uses `scrape_nba_games(season)`
+   - `if args.sport in ['mlb', 'all']:` uses `scrape_mlb_games(season)`
+   - etc.
+
+4. Keep stadium scraping with the new module imports
+
+5. For non-core sports (WNBA, MLS, NWSL, CBB), keep them inline for now with a `# TODO: Extract to separate modules` comment
+
+6. Update file header docstring to explain the modular structure:
+   ```python
+   """
+   Sports Schedule Scraper Orchestrator
+
+   This script coordinates scraping across sport-specific modules:
+   - core.py: Shared utilities, data classes, fallback system
+   - mlb.py: MLB scrapers
+   - nba.py: NBA scrapers
+   - nhl.py: NHL scrapers
+   - nfl.py: NFL scrapers
+
+   Usage:
+       python scrape_schedules.py --sport nba --season 2026
+       python scrape_schedules.py --sport all --season 2026
+   """
+   ```
+
+Target: ~500 lines (down from 3359) for the orchestrator, with sport logic in modules.
+  </action>
+  <verify>cd Scripts && python3 scrape_schedules.py --help</verify>
+  <done>scrape_schedules.py is thin orchestrator, imports from sport modules, --help works</done>
+</task>
+
+</tasks>
+
+<verification>
+Before declaring phase complete:
+- [ ] All sport modules exist: core.py, mlb.py, nba.py, nhl.py, nfl.py
+- [ ] `python3 -m py_compile Scripts/*.py` passes for all files
+- [ ] `cd Scripts && python3 scrape_schedules.py --help` shows usage
+- [ ] scrape_schedules.py is significantly smaller (~500 lines vs 3359)
+- [ ] No circular imports between modules
+</verification>
+
+<success_criteria>
+- Phase 1: Script Architecture complete
+- All 4 core sports have dedicated modules
+- Shared utilities in core.py
+- scrape_schedules.py is thin orchestrator
+- CLI unchanged (backward compatible)
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/01-script-architecture/01-03-SUMMARY.md` with:
+- Phase 1 complete
+- Ready for Phase 2: Stadium Foundation
+</output>