# Phase 1: Knowledge Base Creation - Implementation Plan ## Overview **Goal:** Build structured plant knowledge from `data/houseplants_list.json`, enriching with taxonomy and characteristics. **Input:** `data/houseplants_list.json` (2,278 plants, 11 categories, 50 families) **Output:** Enriched plant knowledge base (JSON + SQLite) with ~500-2000 validated entries --- ## Current Data Assessment | Attribute | Current State | Required Enhancement | |-----------|---------------|---------------------| | Total Plants | 2,278 | Validate, deduplicate | | Scientific Names | Present | Validate binomial nomenclature | | Common Names | Array per plant | Normalize, cross-reference | | Family | 50 families | Validate against taxonomy | | Category | 11 categories | Map to target types | | Physical Characteristics | **Missing** | **Must add** | | Regional/Seasonal Info | **Missing** | **Must add** | --- ## Task Breakdown ### Task 1.1: Load and Validate Plant List **Objective:** Parse JSON and validate data integrity **Actions:** - [ ] Create Python script `scripts/validate_plant_list.py` - [ ] Load `data/houseplants_list.json` - [ ] Validate JSON schema: - Each plant has `scientific_name` (required, string) - Each plant has `common_names` (required, array of strings) - Each plant has `family` (required, string) - Each plant has `category` (required, string) - [ ] Identify malformed entries (missing fields, wrong types) - [ ] Generate validation report: `output/validation_report.json` **Validation Criteria:** - 0 malformed entries - All required fields present - No null/empty scientific names **Output File:** `scripts/validate_plant_list.py` --- ### Task 1.2: Normalize and Standardize Plant Names **Objective:** Ensure consistent naming conventions **Actions:** - [ ] Create `scripts/normalize_names.py` - [ ] Scientific name normalization: - Capitalize genus, lowercase species (e.g., "Philodendron hederaceum") - Handle cultivar notation: 'Cultivar Name' in single quotes - Validate binomial/trinomial format - [ ] Common name normalization: - Title case standardization - Remove leading/trailing whitespace - Standardize punctuation - [ ] Handle hybrid notation (×) consistently - [ ] Flag names that don't match expected patterns **Validation Criteria:** - 100% of scientific names follow binomial nomenclature pattern - No leading/trailing whitespace in any names - Consistent cultivar notation **Output File:** `data/normalized_plants.json` --- ### Task 1.3: Create Deduplicated Master List **Objective:** Remove duplicates while preserving unique cultivars **Actions:** - [ ] Create `scripts/deduplicate_plants.py` - [ ] Define deduplication rules: - Exact scientific name match = duplicate - Different cultivars of same species = keep both - Same plant, different common names = merge common names - [ ] Identify potential duplicates using fuzzy matching on: - Scientific names (Levenshtein distance < 3) - Common names that are identical - [ ] Generate duplicate candidates report for manual review - [ ] Merge duplicates: combine common names arrays - [ ] Assign unique plant IDs (`plant_001`, `plant_002`, etc.) **Validation Criteria:** - No exact scientific name duplicates - All plants have unique IDs - Merge log documenting all deduplication decisions **Output Files:** - `data/master_plant_list.json` - `output/deduplication_report.json` --- ### Task 1.4: Enrich with Physical Characteristics **Objective:** Add visual and physical attributes for each plant **Actions:** - [ ] Create `scripts/enrich_characteristics.py` - [ ] Define characteristic schema: ```json { "characteristics": { "leaf_shape": ["heart", "oval", "linear", "palmate", "lobed", "needle", "rosette"], "leaf_color": ["green", "variegated", "red", "purple", "silver", "yellow"], "leaf_texture": ["glossy", "matte", "fuzzy", "waxy", "smooth", "rough"], "growth_habit": ["upright", "trailing", "climbing", "rosette", "bushy", "tree-form"], "mature_height_cm": [0-500], "mature_width_cm": [0-300], "flowering": true/false, "flower_colors": ["white", "pink", "red", "yellow", "orange", "purple", "blue"], "bloom_season": ["spring", "summer", "fall", "winter", "year-round"] } } ``` - [ ] Source characteristics data: - **Primary:** Web scraping from botanical databases (RHS, Missouri Botanical Garden) - **Secondary:** Wikipedia API for plant descriptions - **Fallback:** Family/genus-level defaults - [ ] Implement web fetching with rate limiting - [ ] Parse and extract characteristics from HTML/JSON responses - [ ] Store enrichment sources for traceability **Validation Criteria:** - ≥80% of plants have leaf_shape populated - ≥80% of plants have growth_habit populated - ≥60% of plants have height/width estimates - 100% of plants have flowering boolean **Output Files:** - `data/enriched_plants.json` - `output/enrichment_coverage_report.json` --- ### Task 1.5: Categorize Plants by Type **Objective:** Map existing categories to target classification system **Actions:** - [ ] Create `scripts/categorize_plants.py` - [ ] Define target categories (per plan): ``` - Flowering Plant - Tree / Palm - Shrub / Bush - Succulent / Cactus - Fern - Vine / Trailing - Herb - Orchid - Bromeliad - Air Plant ``` - [ ] Create mapping from current 11 categories: ``` Current → Target ───────────────────────────── Air Plant → Air Plant Bromeliad → Bromeliad Cactus → Succulent / Cactus Fern → Fern Flowering Houseplant → Flowering Plant Herb → Herb Orchid → Orchid Palm → Tree / Palm Succulent → Succulent / Cactus Trailing/Climbing → Vine / Trailing Tropical Foliage → [Requires secondary classification] ``` - [ ] Handle "Tropical Foliage" (largest category): - Use growth_habit from Task 1.4 to sub-classify - Cross-reference family for tree-form species (Ficus → Tree) - [ ] Add `primary_category` and `secondary_categories` fields **Validation Criteria:** - 100% of plants have primary_category assigned - No plants remain as "Tropical Foliage" (all reclassified) - Category distribution documented **Output File:** `data/categorized_plants.json` --- ### Task 1.6: Map Common Names to Scientific Names **Objective:** Create bidirectional lookup for name resolution **Actions:** - [ ] Create `scripts/build_name_index.py` - [ ] Build scientific → common names map (already exists, validate) - [ ] Build common → scientific names map (reverse lookup) - [ ] Handle ambiguous common names (multiple plants share same common name): - Flag conflicts - Add disambiguation notes - [ ] Validate against external taxonomy: - World Flora Online (WFO) API - GBIF (Global Biodiversity Information Facility) - [ ] Add `verified` boolean for taxonomically confirmed names - [ ] Store alternative/deprecated scientific names as synonyms **Validation Criteria:** - Reverse lookup resolves ≥95% of common names unambiguously - ≥70% of scientific names verified against WFO/GBIF - Synonym list for deprecated names **Output Files:** - `data/name_index.json` - `output/name_ambiguity_report.json` --- ### Task 1.7: Add Regional/Seasonal Information **Objective:** Add native regions, hardiness zones, and seasonal behaviors **Actions:** - [ ] Create `scripts/add_regional_data.py` - [ ] Define regional schema: ```json { "regional_info": { "native_regions": ["South America", "Southeast Asia", "Africa", ...], "native_countries": ["Brazil", "Thailand", ...], "usda_hardiness_zones": ["9a", "9b", "10a", ...], "indoor_outdoor": "indoor_only" | "outdoor_temperate" | "outdoor_tropical", "seasonal_behavior": "evergreen" | "deciduous" | "dormant_winter" } } ``` - [ ] Source regional data: - USDA Plants Database API - Wikipedia (native range sections) - Existing botanical databases - [ ] Map families to typical native regions as fallback - [ ] Add care-relevant seasonality (dormancy periods, bloom times) **Validation Criteria:** - ≥70% of plants have native_regions populated - ≥60% of plants have hardiness zones - 100% of plants have indoor_outdoor classification **Output File:** `data/final_knowledge_base.json` --- ## Final Knowledge Base Schema ```json { "version": "1.0.0", "generated_date": "YYYY-MM-DD", "total_plants": 2000, "plants": [ { "id": "plant_001", "scientific_name": "Philodendron hederaceum", "common_names": ["Heartleaf Philodendron", "Sweetheart Plant"], "synonyms": [], "family": "Araceae", "genus": "Philodendron", "species": "hederaceum", "cultivar": null, "primary_category": "Vine / Trailing", "secondary_categories": ["Tropical Foliage"], "characteristics": { "leaf_shape": "heart", "leaf_color": ["green"], "leaf_texture": "glossy", "growth_habit": "trailing", "mature_height_cm": 120, "mature_width_cm": 60, "flowering": true, "flower_colors": ["white", "green"], "bloom_season": "rarely indoors" }, "regional_info": { "native_regions": ["Central America", "South America"], "native_countries": ["Mexico", "Brazil"], "usda_hardiness_zones": ["10b", "11", "12"], "indoor_outdoor": "indoor_only", "seasonal_behavior": "evergreen" }, "taxonomy_verified": true, "data_sources": ["RHS", "Missouri Botanical Garden"], "last_updated": "YYYY-MM-DD" } ] } ``` --- ## Output File Structure ``` PlantGuide/ ├── data/ │ ├── houseplants_list.json # Original input (unchanged) │ ├── normalized_plants.json # Task 1.2 output │ ├── master_plant_list.json # Task 1.3 output │ ├── enriched_plants.json # Task 1.4 output │ ├── categorized_plants.json # Task 1.5 output │ ├── name_index.json # Task 1.6 output │ └── final_knowledge_base.json # Task 1.7 output (FINAL) ├── scripts/ │ ├── validate_plant_list.py # Task 1.1 │ ├── normalize_names.py # Task 1.2 │ ├── deduplicate_plants.py # Task 1.3 │ ├── enrich_characteristics.py # Task 1.4 │ ├── categorize_plants.py # Task 1.5 │ ├── build_name_index.py # Task 1.6 │ └── add_regional_data.py # Task 1.7 ├── output/ │ ├── validation_report.json │ ├── deduplication_report.json │ ├── enrichment_coverage_report.json │ └── name_ambiguity_report.json └── knowledge_base/ ├── plants.db # SQLite database └── schema.sql # Database schema ``` --- ## SQLite Database Schema ```sql -- Task: Create SQLite database alongside JSON CREATE TABLE plants ( id TEXT PRIMARY KEY, scientific_name TEXT NOT NULL UNIQUE, family TEXT NOT NULL, genus TEXT, species TEXT, cultivar TEXT, primary_category TEXT NOT NULL, taxonomy_verified BOOLEAN DEFAULT FALSE, last_updated DATE ); CREATE TABLE common_names ( id INTEGER PRIMARY KEY AUTOINCREMENT, plant_id TEXT REFERENCES plants(id), common_name TEXT NOT NULL, is_primary BOOLEAN DEFAULT FALSE ); CREATE TABLE characteristics ( plant_id TEXT PRIMARY KEY REFERENCES plants(id), leaf_shape TEXT, leaf_color TEXT, -- JSON array leaf_texture TEXT, growth_habit TEXT, mature_height_cm INTEGER, mature_width_cm INTEGER, flowering BOOLEAN, flower_colors TEXT, -- JSON array bloom_season TEXT ); CREATE TABLE regional_info ( plant_id TEXT PRIMARY KEY REFERENCES plants(id), native_regions TEXT, -- JSON array native_countries TEXT, -- JSON array usda_hardiness_zones TEXT, -- JSON array indoor_outdoor TEXT, seasonal_behavior TEXT ); CREATE TABLE synonyms ( id INTEGER PRIMARY KEY AUTOINCREMENT, plant_id TEXT REFERENCES plants(id), synonym TEXT NOT NULL ); -- Indexes for common queries CREATE INDEX idx_plants_family ON plants(family); CREATE INDEX idx_plants_category ON plants(primary_category); CREATE INDEX idx_common_names_name ON common_names(common_name); CREATE INDEX idx_characteristics_habit ON characteristics(growth_habit); ``` --- ## End Phase Validation Checklist ### Data Quality Gates | Metric | Target | Validation Method | |--------|--------|-------------------| | Total validated plants | ≥1,500 | Count after deduplication | | Schema compliance | 100% | JSON schema validation | | Scientific name format | 100% valid | Regex: `^[A-Z][a-z]+ [a-z]+` | | Plants with characteristics | ≥80% | Field coverage check | | Plants with regional data | ≥70% | Field coverage check | | Category coverage | 100% | No "Unknown" categories | | Name disambiguation | ≥95% | Ambiguity report review | | Taxonomy verification | ≥70% | WFO/GBIF cross-reference | ### Functional Validation - [ ] **Query Test 1:** Lookup by scientific name returns full plant record - [ ] **Query Test 2:** Lookup by common name returns correct plant(s) - [ ] **Query Test 3:** Filter by category returns expected results - [ ] **Query Test 4:** Filter by characteristics (leaf_shape=heart) works - [ ] **Query Test 5:** Regional filter (hardiness_zone=10a) works ### Deliverable Checklist - [ ] `data/final_knowledge_base.json` exists and passes schema validation - [ ] `knowledge_base/plants.db` SQLite database is populated - [ ] All scripts in `scripts/` directory are functional - [ ] All reports in `output/` directory are generated - [ ] Data coverage meets minimum thresholds - [ ] No critical validation errors in reports ### Phase Exit Criteria **Phase 1 is COMPLETE when:** 1. ✅ Final knowledge base contains ≥1,500 validated plant entries 2. ✅ ≥80% of plants have physical characteristics populated 3. ✅ ≥70% of plants have regional information 4. ✅ 100% of plants have valid categories (no "Unknown") 5. ✅ SQLite database mirrors JSON knowledge base 6. ✅ All validation tests pass 7. ✅ Documentation updated with final counts and coverage metrics --- ## Execution Order ``` Task 1.1 (Validate) ↓ Task 1.2 (Normalize) ↓ Task 1.3 (Deduplicate) ↓ ├─→ Task 1.4 (Characteristics) ─┐ │ │ └─→ Task 1.6 (Name Index) ──────┤ │ │ └─→ Task 1.7 (Regional) ────────┤ ↓ Task 1.5 (Categorize) [Depends on 1.4 for Tropical Foliage] ↓ Final Assembly (JSON + SQLite) ↓ Validation Suite ``` **Note:** Tasks 1.4, 1.6, and 1.7 can run in parallel after Task 1.3 completes. Task 1.5 depends on Task 1.4 output for sub-categorizing Tropical Foliage plants. --- ## Risk Mitigation | Risk | Mitigation | |------|------------| | External API rate limits | Implement caching, request throttling | | Incomplete enrichment data | Use family-level defaults, document gaps | | Ambiguous common names | Flag for manual review, prioritize top plants | | Taxonomy database mismatches | Trust WFO as primary source | | Large dataset processing | Process in batches, checkpoint progress |