Files
PlantGuide/Docs/phase-1-implementation-plan.md
Trey t 136dfbae33 Add PlantGuide iOS app with plant identification and care management
- Implement camera capture and plant identification workflow
- Add Core Data persistence for plants, care schedules, and cached API data
- Create collection view with grid/list layouts and filtering
- Build plant detail views with care information display
- Integrate Trefle botanical API for plant care data
- Add local image storage for captured plant photos
- Implement dependency injection container for testability
- Include accessibility support throughout the app

Bug fixes in this commit:
- Fix Trefle API decoding by removing duplicate CodingKeys
- Fix LocalCachedImage to load from correct PlantImages directory
- Set dateAdded when saving plants for proper collection sorting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:18:01 -06:00

483 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 1: Knowledge Base Creation - Implementation Plan
## Overview
**Goal:** Build structured plant knowledge from `data/houseplants_list.json`, enriching with taxonomy and characteristics.
**Input:** `data/houseplants_list.json` (2,278 plants, 11 categories, 50 families)
**Output:** Enriched plant knowledge base (JSON + SQLite) with ~500-2000 validated entries
---
## Current Data Assessment
| Attribute | Current State | Required Enhancement |
|-----------|---------------|---------------------|
| Total Plants | 2,278 | Validate, deduplicate |
| Scientific Names | Present | Validate binomial nomenclature |
| Common Names | Array per plant | Normalize, cross-reference |
| Family | 50 families | Validate against taxonomy |
| Category | 11 categories | Map to target types |
| Physical Characteristics | **Missing** | **Must add** |
| Regional/Seasonal Info | **Missing** | **Must add** |
---
## Task Breakdown
### Task 1.1: Load and Validate Plant List
**Objective:** Parse JSON and validate data integrity
**Actions:**
- [ ] Create Python script `scripts/validate_plant_list.py`
- [ ] Load `data/houseplants_list.json`
- [ ] Validate JSON schema:
- Each plant has `scientific_name` (required, string)
- Each plant has `common_names` (required, array of strings)
- Each plant has `family` (required, string)
- Each plant has `category` (required, string)
- [ ] Identify malformed entries (missing fields, wrong types)
- [ ] Generate validation report: `output/validation_report.json`
**Validation Criteria:**
- 0 malformed entries
- All required fields present
- No null/empty scientific names
**Output File:** `scripts/validate_plant_list.py`
---
### Task 1.2: Normalize and Standardize Plant Names
**Objective:** Ensure consistent naming conventions
**Actions:**
- [ ] Create `scripts/normalize_names.py`
- [ ] Scientific name normalization:
- Capitalize genus, lowercase species (e.g., "Philodendron hederaceum")
- Handle cultivar notation: 'Cultivar Name' in single quotes
- Validate binomial/trinomial format
- [ ] Common name normalization:
- Title case standardization
- Remove leading/trailing whitespace
- Standardize punctuation
- [ ] Handle hybrid notation (×) consistently
- [ ] Flag names that don't match expected patterns
**Validation Criteria:**
- 100% of scientific names follow binomial nomenclature pattern
- No leading/trailing whitespace in any names
- Consistent cultivar notation
**Output File:** `data/normalized_plants.json`
---
### Task 1.3: Create Deduplicated Master List
**Objective:** Remove duplicates while preserving unique cultivars
**Actions:**
- [ ] Create `scripts/deduplicate_plants.py`
- [ ] Define deduplication rules:
- Exact scientific name match = duplicate
- Different cultivars of same species = keep both
- Same plant, different common names = merge common names
- [ ] Identify potential duplicates using fuzzy matching on:
- Scientific names (Levenshtein distance < 3)
- Common names that are identical
- [ ] Generate duplicate candidates report for manual review
- [ ] Merge duplicates: combine common names arrays
- [ ] Assign unique plant IDs (`plant_001`, `plant_002`, etc.)
**Validation Criteria:**
- No exact scientific name duplicates
- All plants have unique IDs
- Merge log documenting all deduplication decisions
**Output Files:**
- `data/master_plant_list.json`
- `output/deduplication_report.json`
---
### Task 1.4: Enrich with Physical Characteristics
**Objective:** Add visual and physical attributes for each plant
**Actions:**
- [ ] Create `scripts/enrich_characteristics.py`
- [ ] Define characteristic schema:
```json
{
"characteristics": {
"leaf_shape": ["heart", "oval", "linear", "palmate", "lobed", "needle", "rosette"],
"leaf_color": ["green", "variegated", "red", "purple", "silver", "yellow"],
"leaf_texture": ["glossy", "matte", "fuzzy", "waxy", "smooth", "rough"],
"growth_habit": ["upright", "trailing", "climbing", "rosette", "bushy", "tree-form"],
"mature_height_cm": [0-500],
"mature_width_cm": [0-300],
"flowering": true/false,
"flower_colors": ["white", "pink", "red", "yellow", "orange", "purple", "blue"],
"bloom_season": ["spring", "summer", "fall", "winter", "year-round"]
}
}
```
- [ ] Source characteristics data:
- **Primary:** Web scraping from botanical databases (RHS, Missouri Botanical Garden)
- **Secondary:** Wikipedia API for plant descriptions
- **Fallback:** Family/genus-level defaults
- [ ] Implement web fetching with rate limiting
- [ ] Parse and extract characteristics from HTML/JSON responses
- [ ] Store enrichment sources for traceability
**Validation Criteria:**
- ≥80% of plants have leaf_shape populated
- ≥80% of plants have growth_habit populated
- ≥60% of plants have height/width estimates
- 100% of plants have flowering boolean
**Output Files:**
- `data/enriched_plants.json`
- `output/enrichment_coverage_report.json`
---
### Task 1.5: Categorize Plants by Type
**Objective:** Map existing categories to target classification system
**Actions:**
- [ ] Create `scripts/categorize_plants.py`
- [ ] Define target categories (per plan):
```
- Flowering Plant
- Tree / Palm
- Shrub / Bush
- Succulent / Cactus
- Fern
- Vine / Trailing
- Herb
- Orchid
- Bromeliad
- Air Plant
```
- [ ] Create mapping from current 11 categories:
```
Current → Target
─────────────────────────────
Air Plant → Air Plant
Bromeliad → Bromeliad
Cactus → Succulent / Cactus
Fern → Fern
Flowering Houseplant → Flowering Plant
Herb → Herb
Orchid → Orchid
Palm → Tree / Palm
Succulent → Succulent / Cactus
Trailing/Climbing → Vine / Trailing
Tropical Foliage → [Requires secondary classification]
```
- [ ] Handle "Tropical Foliage" (largest category):
- Use growth_habit from Task 1.4 to sub-classify
- Cross-reference family for tree-form species (Ficus → Tree)
- [ ] Add `primary_category` and `secondary_categories` fields
**Validation Criteria:**
- 100% of plants have primary_category assigned
- No plants remain as "Tropical Foliage" (all reclassified)
- Category distribution documented
**Output File:** `data/categorized_plants.json`
---
### Task 1.6: Map Common Names to Scientific Names
**Objective:** Create bidirectional lookup for name resolution
**Actions:**
- [ ] Create `scripts/build_name_index.py`
- [ ] Build scientific → common names map (already exists, validate)
- [ ] Build common → scientific names map (reverse lookup)
- [ ] Handle ambiguous common names (multiple plants share same common name):
- Flag conflicts
- Add disambiguation notes
- [ ] Validate against external taxonomy:
- World Flora Online (WFO) API
- GBIF (Global Biodiversity Information Facility)
- [ ] Add `verified` boolean for taxonomically confirmed names
- [ ] Store alternative/deprecated scientific names as synonyms
**Validation Criteria:**
- Reverse lookup resolves ≥95% of common names unambiguously
- ≥70% of scientific names verified against WFO/GBIF
- Synonym list for deprecated names
**Output Files:**
- `data/name_index.json`
- `output/name_ambiguity_report.json`
---
### Task 1.7: Add Regional/Seasonal Information
**Objective:** Add native regions, hardiness zones, and seasonal behaviors
**Actions:**
- [ ] Create `scripts/add_regional_data.py`
- [ ] Define regional schema:
```json
{
"regional_info": {
"native_regions": ["South America", "Southeast Asia", "Africa", ...],
"native_countries": ["Brazil", "Thailand", ...],
"usda_hardiness_zones": ["9a", "9b", "10a", ...],
"indoor_outdoor": "indoor_only" | "outdoor_temperate" | "outdoor_tropical",
"seasonal_behavior": "evergreen" | "deciduous" | "dormant_winter"
}
}
```
- [ ] Source regional data:
- USDA Plants Database API
- Wikipedia (native range sections)
- Existing botanical databases
- [ ] Map families to typical native regions as fallback
- [ ] Add care-relevant seasonality (dormancy periods, bloom times)
**Validation Criteria:**
- ≥70% of plants have native_regions populated
- ≥60% of plants have hardiness zones
- 100% of plants have indoor_outdoor classification
**Output File:** `data/final_knowledge_base.json`
---
## Final Knowledge Base Schema
```json
{
"version": "1.0.0",
"generated_date": "YYYY-MM-DD",
"total_plants": 2000,
"plants": [
{
"id": "plant_001",
"scientific_name": "Philodendron hederaceum",
"common_names": ["Heartleaf Philodendron", "Sweetheart Plant"],
"synonyms": [],
"family": "Araceae",
"genus": "Philodendron",
"species": "hederaceum",
"cultivar": null,
"primary_category": "Vine / Trailing",
"secondary_categories": ["Tropical Foliage"],
"characteristics": {
"leaf_shape": "heart",
"leaf_color": ["green"],
"leaf_texture": "glossy",
"growth_habit": "trailing",
"mature_height_cm": 120,
"mature_width_cm": 60,
"flowering": true,
"flower_colors": ["white", "green"],
"bloom_season": "rarely indoors"
},
"regional_info": {
"native_regions": ["Central America", "South America"],
"native_countries": ["Mexico", "Brazil"],
"usda_hardiness_zones": ["10b", "11", "12"],
"indoor_outdoor": "indoor_only",
"seasonal_behavior": "evergreen"
},
"taxonomy_verified": true,
"data_sources": ["RHS", "Missouri Botanical Garden"],
"last_updated": "YYYY-MM-DD"
}
]
}
```
---
## Output File Structure
```
PlantGuide/
├── data/
│ ├── houseplants_list.json # Original input (unchanged)
│ ├── normalized_plants.json # Task 1.2 output
│ ├── master_plant_list.json # Task 1.3 output
│ ├── enriched_plants.json # Task 1.4 output
│ ├── categorized_plants.json # Task 1.5 output
│ ├── name_index.json # Task 1.6 output
│ └── final_knowledge_base.json # Task 1.7 output (FINAL)
├── scripts/
│ ├── validate_plant_list.py # Task 1.1
│ ├── normalize_names.py # Task 1.2
│ ├── deduplicate_plants.py # Task 1.3
│ ├── enrich_characteristics.py # Task 1.4
│ ├── categorize_plants.py # Task 1.5
│ ├── build_name_index.py # Task 1.6
│ └── add_regional_data.py # Task 1.7
├── output/
│ ├── validation_report.json
│ ├── deduplication_report.json
│ ├── enrichment_coverage_report.json
│ └── name_ambiguity_report.json
└── knowledge_base/
├── plants.db # SQLite database
└── schema.sql # Database schema
```
---
## SQLite Database Schema
```sql
-- Task: Create SQLite database alongside JSON
CREATE TABLE plants (
id TEXT PRIMARY KEY,
scientific_name TEXT NOT NULL UNIQUE,
family TEXT NOT NULL,
genus TEXT,
species TEXT,
cultivar TEXT,
primary_category TEXT NOT NULL,
taxonomy_verified BOOLEAN DEFAULT FALSE,
last_updated DATE
);
CREATE TABLE common_names (
id INTEGER PRIMARY KEY AUTOINCREMENT,
plant_id TEXT REFERENCES plants(id),
common_name TEXT NOT NULL,
is_primary BOOLEAN DEFAULT FALSE
);
CREATE TABLE characteristics (
plant_id TEXT PRIMARY KEY REFERENCES plants(id),
leaf_shape TEXT,
leaf_color TEXT, -- JSON array
leaf_texture TEXT,
growth_habit TEXT,
mature_height_cm INTEGER,
mature_width_cm INTEGER,
flowering BOOLEAN,
flower_colors TEXT, -- JSON array
bloom_season TEXT
);
CREATE TABLE regional_info (
plant_id TEXT PRIMARY KEY REFERENCES plants(id),
native_regions TEXT, -- JSON array
native_countries TEXT, -- JSON array
usda_hardiness_zones TEXT, -- JSON array
indoor_outdoor TEXT,
seasonal_behavior TEXT
);
CREATE TABLE synonyms (
id INTEGER PRIMARY KEY AUTOINCREMENT,
plant_id TEXT REFERENCES plants(id),
synonym TEXT NOT NULL
);
-- Indexes for common queries
CREATE INDEX idx_plants_family ON plants(family);
CREATE INDEX idx_plants_category ON plants(primary_category);
CREATE INDEX idx_common_names_name ON common_names(common_name);
CREATE INDEX idx_characteristics_habit ON characteristics(growth_habit);
```
---
## End Phase Validation Checklist
### Data Quality Gates
| Metric | Target | Validation Method |
|--------|--------|-------------------|
| Total validated plants | ≥1,500 | Count after deduplication |
| Schema compliance | 100% | JSON schema validation |
| Scientific name format | 100% valid | Regex: `^[A-Z][a-z]+ [a-z]+` |
| Plants with characteristics | ≥80% | Field coverage check |
| Plants with regional data | ≥70% | Field coverage check |
| Category coverage | 100% | No "Unknown" categories |
| Name disambiguation | ≥95% | Ambiguity report review |
| Taxonomy verification | ≥70% | WFO/GBIF cross-reference |
### Functional Validation
- [ ] **Query Test 1:** Lookup by scientific name returns full plant record
- [ ] **Query Test 2:** Lookup by common name returns correct plant(s)
- [ ] **Query Test 3:** Filter by category returns expected results
- [ ] **Query Test 4:** Filter by characteristics (leaf_shape=heart) works
- [ ] **Query Test 5:** Regional filter (hardiness_zone=10a) works
### Deliverable Checklist
- [ ] `data/final_knowledge_base.json` exists and passes schema validation
- [ ] `knowledge_base/plants.db` SQLite database is populated
- [ ] All scripts in `scripts/` directory are functional
- [ ] All reports in `output/` directory are generated
- [ ] Data coverage meets minimum thresholds
- [ ] No critical validation errors in reports
### Phase Exit Criteria
**Phase 1 is COMPLETE when:**
1. ✅ Final knowledge base contains ≥1,500 validated plant entries
2. ✅ ≥80% of plants have physical characteristics populated
3. ✅ ≥70% of plants have regional information
4. ✅ 100% of plants have valid categories (no "Unknown")
5. ✅ SQLite database mirrors JSON knowledge base
6. ✅ All validation tests pass
7. ✅ Documentation updated with final counts and coverage metrics
---
## Execution Order
```
Task 1.1 (Validate)
Task 1.2 (Normalize)
Task 1.3 (Deduplicate)
├─→ Task 1.4 (Characteristics) ─┐
│ │
└─→ Task 1.6 (Name Index) ──────┤
│ │
└─→ Task 1.7 (Regional) ────────┤
Task 1.5 (Categorize)
[Depends on 1.4 for Tropical Foliage]
Final Assembly
(JSON + SQLite)
Validation Suite
```
**Note:** Tasks 1.4, 1.6, and 1.7 can run in parallel after Task 1.3 completes. Task 1.5 depends on Task 1.4 output for sub-categorizing Tropical Foliage plants.
---
## Risk Mitigation
| Risk | Mitigation |
|------|------------|
| External API rate limits | Implement caching, request throttling |
| Incomplete enrichment data | Use family-level defaults, document gaps |
| Ambiguous common names | Flag for manual review, prioritize top plants |
| Taxonomy database mismatches | Trust WFO as primary source |
| Large dataset processing | Process in batches, checkpoint progress |