# Phase 1: Knowledge Base Creation - Implementation Plan

## Overview

**Goal:** Build structured plant knowledge from `data/houseplants_list.json`, enriching with taxonomy and characteristics.

**Input:** `data/houseplants_list.json` (2,278 plants, 11 categories, 50 families)

**Output:** Enriched plant knowledge base (JSON + SQLite) with ~500-2000 validated entries

---

## Current Data Assessment

| Attribute | Current State | Required Enhancement |
|-----------|---------------|---------------------|
| Total Plants | 2,278 | Validate, deduplicate |
| Scientific Names | Present | Validate binomial nomenclature |
| Common Names | Array per plant | Normalize, cross-reference |
| Family | 50 families | Validate against taxonomy |
| Category | 11 categories | Map to target types |
| Physical Characteristics | **Missing** | **Must add** |
| Regional/Seasonal Info | **Missing** | **Must add** |

---

## Task Breakdown

### Task 1.1: Load and Validate Plant List

**Objective:** Parse JSON and validate data integrity

**Actions:**
- [ ] Create Python script `scripts/validate_plant_list.py`
- [ ] Load `data/houseplants_list.json`
- [ ] Validate JSON schema:
  - Each plant has `scientific_name` (required, string)
  - Each plant has `common_names` (required, array of strings)
  - Each plant has `family` (required, string)
  - Each plant has `category` (required, string)
- [ ] Identify malformed entries (missing fields, wrong types)
- [ ] Generate validation report: `output/validation_report.json`

**Validation Criteria:**
- 0 malformed entries
- All required fields present
- No null/empty scientific names

**Output File:** `scripts/validate_plant_list.py`

---

### Task 1.2: Normalize and Standardize Plant Names

**Objective:** Ensure consistent naming conventions

**Actions:**
- [ ] Create `scripts/normalize_names.py`
- [ ] Scientific name normalization:
  - Capitalize genus, lowercase species (e.g., "Philodendron hederaceum")
  - Handle cultivar notation: 'Cultivar Name' in single quotes
  - Validate binomial/trinomial format
- [ ] Common name normalization:
  - Title case standardization
  - Remove leading/trailing whitespace
  - Standardize punctuation
- [ ] Handle hybrid notation (×) consistently
- [ ] Flag names that don't match expected patterns

**Validation Criteria:**
- 100% of scientific names follow binomial nomenclature pattern
- No leading/trailing whitespace in any names
- Consistent cultivar notation

**Output File:** `data/normalized_plants.json`

---

### Task 1.3: Create Deduplicated Master List

**Objective:** Remove duplicates while preserving unique cultivars

**Actions:**
- [ ] Create `scripts/deduplicate_plants.py`
- [ ] Define deduplication rules:
  - Exact scientific name match = duplicate
  - Different cultivars of same species = keep both
  - Same plant, different common names = merge common names
- [ ] Identify potential duplicates using fuzzy matching on:
  - Scientific names (Levenshtein distance < 3)
  - Common names that are identical
- [ ] Generate duplicate candidates report for manual review
- [ ] Merge duplicates: combine common names arrays
- [ ] Assign unique plant IDs (`plant_001`, `plant_002`, etc.)

**Validation Criteria:**
- No exact scientific name duplicates
- All plants have unique IDs
- Merge log documenting all deduplication decisions

**Output Files:**
- `data/master_plant_list.json`
- `output/deduplication_report.json`

---

### Task 1.4: Enrich with Physical Characteristics

**Objective:** Add visual and physical attributes for each plant

**Actions:**
- [ ] Create `scripts/enrich_characteristics.py`
- [ ] Define characteristic schema:
  ```json
  {
    "characteristics": {
      "leaf_shape": ["heart", "oval", "linear", "palmate", "lobed", "needle", "rosette"],
      "leaf_color": ["green", "variegated", "red", "purple", "silver", "yellow"],
      "leaf_texture": ["glossy", "matte", "fuzzy", "waxy", "smooth", "rough"],
      "growth_habit": ["upright", "trailing", "climbing", "rosette", "bushy", "tree-form"],
      "mature_height_cm": [0-500],
      "mature_width_cm": [0-300],
      "flowering": true/false,
      "flower_colors": ["white", "pink", "red", "yellow", "orange", "purple", "blue"],
      "bloom_season": ["spring", "summer", "fall", "winter", "year-round"]
    }
  }
  ```
- [ ] Source characteristics data:
  - **Primary:** Web scraping from botanical databases (RHS, Missouri Botanical Garden)
  - **Secondary:** Wikipedia API for plant descriptions
  - **Fallback:** Family/genus-level defaults
- [ ] Implement web fetching with rate limiting
- [ ] Parse and extract characteristics from HTML/JSON responses
- [ ] Store enrichment sources for traceability

**Validation Criteria:**
- ≥80% of plants have leaf_shape populated
- ≥80% of plants have growth_habit populated
- ≥60% of plants have height/width estimates
- 100% of plants have flowering boolean

**Output Files:**
- `data/enriched_plants.json`
- `output/enrichment_coverage_report.json`

---

### Task 1.5: Categorize Plants by Type

**Objective:** Map existing categories to target classification system

**Actions:**
- [ ] Create `scripts/categorize_plants.py`
- [ ] Define target categories (per plan):
  ```
  - Flowering Plant
  - Tree / Palm
  - Shrub / Bush
  - Succulent / Cactus
  - Fern
  - Vine / Trailing
  - Herb
  - Orchid
  - Bromeliad
  - Air Plant
  ```
- [ ] Create mapping from current 11 categories:
  ```
  Current → Target
  ─────────────────────────────
  Air Plant → Air Plant
  Bromeliad → Bromeliad
  Cactus → Succulent / Cactus
  Fern → Fern
  Flowering Houseplant → Flowering Plant
  Herb → Herb
  Orchid → Orchid
  Palm → Tree / Palm
  Succulent → Succulent / Cactus
  Trailing/Climbing → Vine / Trailing
  Tropical Foliage → [Requires secondary classification]
  ```
- [ ] Handle "Tropical Foliage" (largest category):
  - Use growth_habit from Task 1.4 to sub-classify
  - Cross-reference family for tree-form species (Ficus → Tree)
- [ ] Add `primary_category` and `secondary_categories` fields

**Validation Criteria:**
- 100% of plants have primary_category assigned
- No plants remain as "Tropical Foliage" (all reclassified)
- Category distribution documented

**Output File:** `data/categorized_plants.json`

---

### Task 1.6: Map Common Names to Scientific Names

**Objective:** Create bidirectional lookup for name resolution

**Actions:**
- [ ] Create `scripts/build_name_index.py`
- [ ] Build scientific → common names map (already exists, validate)
- [ ] Build common → scientific names map (reverse lookup)
- [ ] Handle ambiguous common names (multiple plants share same common name):
  - Flag conflicts
  - Add disambiguation notes
- [ ] Validate against external taxonomy:
  - World Flora Online (WFO) API
  - GBIF (Global Biodiversity Information Facility)
- [ ] Add `verified` boolean for taxonomically confirmed names
- [ ] Store alternative/deprecated scientific names as synonyms

**Validation Criteria:**
- Reverse lookup resolves ≥95% of common names unambiguously
- ≥70% of scientific names verified against WFO/GBIF
- Synonym list for deprecated names

**Output Files:**
- `data/name_index.json`
- `output/name_ambiguity_report.json`

---

### Task 1.7: Add Regional/Seasonal Information

**Objective:** Add native regions, hardiness zones, and seasonal behaviors

**Actions:**
- [ ] Create `scripts/add_regional_data.py`
- [ ] Define regional schema:
  ```json
  {
    "regional_info": {
      "native_regions": ["South America", "Southeast Asia", "Africa", ...],
      "native_countries": ["Brazil", "Thailand", ...],
      "usda_hardiness_zones": ["9a", "9b", "10a", ...],
      "indoor_outdoor": "indoor_only" | "outdoor_temperate" | "outdoor_tropical",
      "seasonal_behavior": "evergreen" | "deciduous" | "dormant_winter"
    }
  }
  ```
- [ ] Source regional data:
  - USDA Plants Database API
  - Wikipedia (native range sections)
  - Existing botanical databases
- [ ] Map families to typical native regions as fallback
- [ ] Add care-relevant seasonality (dormancy periods, bloom times)

**Validation Criteria:**
- ≥70% of plants have native_regions populated
- ≥60% of plants have hardiness zones
- 100% of plants have indoor_outdoor classification

**Output File:** `data/final_knowledge_base.json`

---

## Final Knowledge Base Schema

```json
{
  "version": "1.0.0",
  "generated_date": "YYYY-MM-DD",
  "total_plants": 2000,
  "plants": [
    {
      "id": "plant_001",
      "scientific_name": "Philodendron hederaceum",
      "common_names": ["Heartleaf Philodendron", "Sweetheart Plant"],
      "synonyms": [],
      "family": "Araceae",
      "genus": "Philodendron",
      "species": "hederaceum",
      "cultivar": null,
      "primary_category": "Vine / Trailing",
      "secondary_categories": ["Tropical Foliage"],
      "characteristics": {
        "leaf_shape": "heart",
        "leaf_color": ["green"],
        "leaf_texture": "glossy",
        "growth_habit": "trailing",
        "mature_height_cm": 120,
        "mature_width_cm": 60,
        "flowering": true,
        "flower_colors": ["white", "green"],
        "bloom_season": "rarely indoors"
      },
      "regional_info": {
        "native_regions": ["Central America", "South America"],
        "native_countries": ["Mexico", "Brazil"],
        "usda_hardiness_zones": ["10b", "11", "12"],
        "indoor_outdoor": "indoor_only",
        "seasonal_behavior": "evergreen"
      },
      "taxonomy_verified": true,
      "data_sources": ["RHS", "Missouri Botanical Garden"],
      "last_updated": "YYYY-MM-DD"
    }
  ]
}
```

---

## Output File Structure

```
PlantGuide/
├── data/
│   ├── houseplants_list.json          # Original input (unchanged)
│   ├── normalized_plants.json         # Task 1.2 output
│   ├── master_plant_list.json         # Task 1.3 output
│   ├── enriched_plants.json           # Task 1.4 output
│   ├── categorized_plants.json        # Task 1.5 output
│   ├── name_index.json                # Task 1.6 output
│   └── final_knowledge_base.json      # Task 1.7 output (FINAL)
├── scripts/
│   ├── validate_plant_list.py         # Task 1.1
│   ├── normalize_names.py             # Task 1.2
│   ├── deduplicate_plants.py          # Task 1.3
│   ├── enrich_characteristics.py      # Task 1.4
│   ├── categorize_plants.py           # Task 1.5
│   ├── build_name_index.py            # Task 1.6
│   └── add_regional_data.py           # Task 1.7
├── output/
│   ├── validation_report.json
│   ├── deduplication_report.json
│   ├── enrichment_coverage_report.json
│   └── name_ambiguity_report.json
└── knowledge_base/
    ├── plants.db                       # SQLite database
    └── schema.sql                      # Database schema
```

---

## SQLite Database Schema

```sql
-- Task: Create SQLite database alongside JSON

CREATE TABLE plants (
    id TEXT PRIMARY KEY,
    scientific_name TEXT NOT NULL UNIQUE,
    family TEXT NOT NULL,
    genus TEXT,
    species TEXT,
    cultivar TEXT,
    primary_category TEXT NOT NULL,
    taxonomy_verified BOOLEAN DEFAULT FALSE,
    last_updated DATE
);

CREATE TABLE common_names (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    plant_id TEXT REFERENCES plants(id),
    common_name TEXT NOT NULL,
    is_primary BOOLEAN DEFAULT FALSE
);

CREATE TABLE characteristics (
    plant_id TEXT PRIMARY KEY REFERENCES plants(id),
    leaf_shape TEXT,
    leaf_color TEXT,  -- JSON array
    leaf_texture TEXT,
    growth_habit TEXT,
    mature_height_cm INTEGER,
    mature_width_cm INTEGER,
    flowering BOOLEAN,
    flower_colors TEXT,  -- JSON array
    bloom_season TEXT
);

CREATE TABLE regional_info (
    plant_id TEXT PRIMARY KEY REFERENCES plants(id),
    native_regions TEXT,  -- JSON array
    native_countries TEXT,  -- JSON array
    usda_hardiness_zones TEXT,  -- JSON array
    indoor_outdoor TEXT,
    seasonal_behavior TEXT
);

CREATE TABLE synonyms (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    plant_id TEXT REFERENCES plants(id),
    synonym TEXT NOT NULL
);

-- Indexes for common queries
CREATE INDEX idx_plants_family ON plants(family);
CREATE INDEX idx_plants_category ON plants(primary_category);
CREATE INDEX idx_common_names_name ON common_names(common_name);
CREATE INDEX idx_characteristics_habit ON characteristics(growth_habit);
```

---

## End Phase Validation Checklist

### Data Quality Gates

| Metric | Target | Validation Method |
|--------|--------|-------------------|
| Total validated plants | ≥1,500 | Count after deduplication |
| Schema compliance | 100% | JSON schema validation |
| Scientific name format | 100% valid | Regex: `^[A-Z][a-z]+ [a-z]+` |
| Plants with characteristics | ≥80% | Field coverage check |
| Plants with regional data | ≥70% | Field coverage check |
| Category coverage | 100% | No "Unknown" categories |
| Name disambiguation | ≥95% | Ambiguity report review |
| Taxonomy verification | ≥70% | WFO/GBIF cross-reference |

### Functional Validation

- [ ] **Query Test 1:** Lookup by scientific name returns full plant record
- [ ] **Query Test 2:** Lookup by common name returns correct plant(s)
- [ ] **Query Test 3:** Filter by category returns expected results
- [ ] **Query Test 4:** Filter by characteristics (leaf_shape=heart) works
- [ ] **Query Test 5:** Regional filter (hardiness_zone=10a) works

### Deliverable Checklist

- [ ] `data/final_knowledge_base.json` exists and passes schema validation
- [ ] `knowledge_base/plants.db` SQLite database is populated
- [ ] All scripts in `scripts/` directory are functional
- [ ] All reports in `output/` directory are generated
- [ ] Data coverage meets minimum thresholds
- [ ] No critical validation errors in reports

### Phase Exit Criteria

**Phase 1 is COMPLETE when:**

1. ✅ Final knowledge base contains ≥1,500 validated plant entries
2. ✅ ≥80% of plants have physical characteristics populated
3. ✅ ≥70% of plants have regional information
4. ✅ 100% of plants have valid categories (no "Unknown")
5. ✅ SQLite database mirrors JSON knowledge base
6. ✅ All validation tests pass
7. ✅ Documentation updated with final counts and coverage metrics

---

## Execution Order

```
Task 1.1 (Validate)
    ↓
Task 1.2 (Normalize)
    ↓
Task 1.3 (Deduplicate)
    ↓
    ├─→ Task 1.4 (Characteristics) ─┐
    │                               │
    └─→ Task 1.6 (Name Index) ──────┤
    │                               │
    └─→ Task 1.7 (Regional) ────────┤
                                    ↓
                            Task 1.5 (Categorize)
                            [Depends on 1.4 for Tropical Foliage]
                                    ↓
                            Final Assembly
                            (JSON + SQLite)
                                    ↓
                            Validation Suite
```

**Note:** Tasks 1.4, 1.6, and 1.7 can run in parallel after Task 1.3 completes. Task 1.5 depends on Task 1.4 output for sub-categorizing Tropical Foliage plants.

---

## Risk Mitigation

| Risk | Mitigation |
|------|------------|
| External API rate limits | Implement caching, request throttling |
| Incomplete enrichment data | Use family-level defaults, document gaps |
| Ambiguous common names | Flag for manual review, prioritize top plants |
| Taxonomy database mismatches | Trust WFO as primary source |
| Large dataset processing | Process in batches, checkpoint progress |