- Implement camera capture and plant identification workflow - Add Core Data persistence for plants, care schedules, and cached API data - Create collection view with grid/list layouts and filtering - Build plant detail views with care information display - Integrate Trefle botanical API for plant care data - Add local image storage for captured plant photos - Implement dependency injection container for testability - Include accessibility support throughout the app Bug fixes in this commit: - Fix Trefle API decoding by removing duplicate CodingKeys - Fix LocalCachedImage to load from correct PlantImages directory - Set dateAdded when saving plants for proper collection sorting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
15 KiB
Phase 1: Knowledge Base Creation - Implementation Plan
Overview
Goal: Build structured plant knowledge from data/houseplants_list.json, enriching with taxonomy and characteristics.
Input: data/houseplants_list.json (2,278 plants, 11 categories, 50 families)
Output: Enriched plant knowledge base (JSON + SQLite) with ~500-2000 validated entries
Current Data Assessment
| Attribute | Current State | Required Enhancement |
|---|---|---|
| Total Plants | 2,278 | Validate, deduplicate |
| Scientific Names | Present | Validate binomial nomenclature |
| Common Names | Array per plant | Normalize, cross-reference |
| Family | 50 families | Validate against taxonomy |
| Category | 11 categories | Map to target types |
| Physical Characteristics | Missing | Must add |
| Regional/Seasonal Info | Missing | Must add |
Task Breakdown
Task 1.1: Load and Validate Plant List
Objective: Parse JSON and validate data integrity
Actions:
- Create Python script
scripts/validate_plant_list.py - Load
data/houseplants_list.json - Validate JSON schema:
- Each plant has
scientific_name(required, string) - Each plant has
common_names(required, array of strings) - Each plant has
family(required, string) - Each plant has
category(required, string)
- Each plant has
- Identify malformed entries (missing fields, wrong types)
- Generate validation report:
output/validation_report.json
Validation Criteria:
- 0 malformed entries
- All required fields present
- No null/empty scientific names
Output File: scripts/validate_plant_list.py
Task 1.2: Normalize and Standardize Plant Names
Objective: Ensure consistent naming conventions
Actions:
- Create
scripts/normalize_names.py - Scientific name normalization:
- Capitalize genus, lowercase species (e.g., "Philodendron hederaceum")
- Handle cultivar notation: 'Cultivar Name' in single quotes
- Validate binomial/trinomial format
- Common name normalization:
- Title case standardization
- Remove leading/trailing whitespace
- Standardize punctuation
- Handle hybrid notation (×) consistently
- Flag names that don't match expected patterns
Validation Criteria:
- 100% of scientific names follow binomial nomenclature pattern
- No leading/trailing whitespace in any names
- Consistent cultivar notation
Output File: data/normalized_plants.json
Task 1.3: Create Deduplicated Master List
Objective: Remove duplicates while preserving unique cultivars
Actions:
- Create
scripts/deduplicate_plants.py - Define deduplication rules:
- Exact scientific name match = duplicate
- Different cultivars of same species = keep both
- Same plant, different common names = merge common names
- Identify potential duplicates using fuzzy matching on:
- Scientific names (Levenshtein distance < 3)
- Common names that are identical
- Generate duplicate candidates report for manual review
- Merge duplicates: combine common names arrays
- Assign unique plant IDs (
plant_001,plant_002, etc.)
Validation Criteria:
- No exact scientific name duplicates
- All plants have unique IDs
- Merge log documenting all deduplication decisions
Output Files:
data/master_plant_list.jsonoutput/deduplication_report.json
Task 1.4: Enrich with Physical Characteristics
Objective: Add visual and physical attributes for each plant
Actions:
- Create
scripts/enrich_characteristics.py - Define characteristic schema:
{ "characteristics": { "leaf_shape": ["heart", "oval", "linear", "palmate", "lobed", "needle", "rosette"], "leaf_color": ["green", "variegated", "red", "purple", "silver", "yellow"], "leaf_texture": ["glossy", "matte", "fuzzy", "waxy", "smooth", "rough"], "growth_habit": ["upright", "trailing", "climbing", "rosette", "bushy", "tree-form"], "mature_height_cm": [0-500], "mature_width_cm": [0-300], "flowering": true/false, "flower_colors": ["white", "pink", "red", "yellow", "orange", "purple", "blue"], "bloom_season": ["spring", "summer", "fall", "winter", "year-round"] } } - Source characteristics data:
- Primary: Web scraping from botanical databases (RHS, Missouri Botanical Garden)
- Secondary: Wikipedia API for plant descriptions
- Fallback: Family/genus-level defaults
- Implement web fetching with rate limiting
- Parse and extract characteristics from HTML/JSON responses
- Store enrichment sources for traceability
Validation Criteria:
- ≥80% of plants have leaf_shape populated
- ≥80% of plants have growth_habit populated
- ≥60% of plants have height/width estimates
- 100% of plants have flowering boolean
Output Files:
data/enriched_plants.jsonoutput/enrichment_coverage_report.json
Task 1.5: Categorize Plants by Type
Objective: Map existing categories to target classification system
Actions:
- Create
scripts/categorize_plants.py - Define target categories (per plan):
- Flowering Plant - Tree / Palm - Shrub / Bush - Succulent / Cactus - Fern - Vine / Trailing - Herb - Orchid - Bromeliad - Air Plant - Create mapping from current 11 categories:
Current → Target ───────────────────────────── Air Plant → Air Plant Bromeliad → Bromeliad Cactus → Succulent / Cactus Fern → Fern Flowering Houseplant → Flowering Plant Herb → Herb Orchid → Orchid Palm → Tree / Palm Succulent → Succulent / Cactus Trailing/Climbing → Vine / Trailing Tropical Foliage → [Requires secondary classification] - Handle "Tropical Foliage" (largest category):
- Use growth_habit from Task 1.4 to sub-classify
- Cross-reference family for tree-form species (Ficus → Tree)
- Add
primary_categoryandsecondary_categoriesfields
Validation Criteria:
- 100% of plants have primary_category assigned
- No plants remain as "Tropical Foliage" (all reclassified)
- Category distribution documented
Output File: data/categorized_plants.json
Task 1.6: Map Common Names to Scientific Names
Objective: Create bidirectional lookup for name resolution
Actions:
- Create
scripts/build_name_index.py - Build scientific → common names map (already exists, validate)
- Build common → scientific names map (reverse lookup)
- Handle ambiguous common names (multiple plants share same common name):
- Flag conflicts
- Add disambiguation notes
- Validate against external taxonomy:
- World Flora Online (WFO) API
- GBIF (Global Biodiversity Information Facility)
- Add
verifiedboolean for taxonomically confirmed names - Store alternative/deprecated scientific names as synonyms
Validation Criteria:
- Reverse lookup resolves ≥95% of common names unambiguously
- ≥70% of scientific names verified against WFO/GBIF
- Synonym list for deprecated names
Output Files:
data/name_index.jsonoutput/name_ambiguity_report.json
Task 1.7: Add Regional/Seasonal Information
Objective: Add native regions, hardiness zones, and seasonal behaviors
Actions:
- Create
scripts/add_regional_data.py - Define regional schema:
{ "regional_info": { "native_regions": ["South America", "Southeast Asia", "Africa", ...], "native_countries": ["Brazil", "Thailand", ...], "usda_hardiness_zones": ["9a", "9b", "10a", ...], "indoor_outdoor": "indoor_only" | "outdoor_temperate" | "outdoor_tropical", "seasonal_behavior": "evergreen" | "deciduous" | "dormant_winter" } } - Source regional data:
- USDA Plants Database API
- Wikipedia (native range sections)
- Existing botanical databases
- Map families to typical native regions as fallback
- Add care-relevant seasonality (dormancy periods, bloom times)
Validation Criteria:
- ≥70% of plants have native_regions populated
- ≥60% of plants have hardiness zones
- 100% of plants have indoor_outdoor classification
Output File: data/final_knowledge_base.json
Final Knowledge Base Schema
{
"version": "1.0.0",
"generated_date": "YYYY-MM-DD",
"total_plants": 2000,
"plants": [
{
"id": "plant_001",
"scientific_name": "Philodendron hederaceum",
"common_names": ["Heartleaf Philodendron", "Sweetheart Plant"],
"synonyms": [],
"family": "Araceae",
"genus": "Philodendron",
"species": "hederaceum",
"cultivar": null,
"primary_category": "Vine / Trailing",
"secondary_categories": ["Tropical Foliage"],
"characteristics": {
"leaf_shape": "heart",
"leaf_color": ["green"],
"leaf_texture": "glossy",
"growth_habit": "trailing",
"mature_height_cm": 120,
"mature_width_cm": 60,
"flowering": true,
"flower_colors": ["white", "green"],
"bloom_season": "rarely indoors"
},
"regional_info": {
"native_regions": ["Central America", "South America"],
"native_countries": ["Mexico", "Brazil"],
"usda_hardiness_zones": ["10b", "11", "12"],
"indoor_outdoor": "indoor_only",
"seasonal_behavior": "evergreen"
},
"taxonomy_verified": true,
"data_sources": ["RHS", "Missouri Botanical Garden"],
"last_updated": "YYYY-MM-DD"
}
]
}
Output File Structure
PlantGuide/
├── data/
│ ├── houseplants_list.json # Original input (unchanged)
│ ├── normalized_plants.json # Task 1.2 output
│ ├── master_plant_list.json # Task 1.3 output
│ ├── enriched_plants.json # Task 1.4 output
│ ├── categorized_plants.json # Task 1.5 output
│ ├── name_index.json # Task 1.6 output
│ └── final_knowledge_base.json # Task 1.7 output (FINAL)
├── scripts/
│ ├── validate_plant_list.py # Task 1.1
│ ├── normalize_names.py # Task 1.2
│ ├── deduplicate_plants.py # Task 1.3
│ ├── enrich_characteristics.py # Task 1.4
│ ├── categorize_plants.py # Task 1.5
│ ├── build_name_index.py # Task 1.6
│ └── add_regional_data.py # Task 1.7
├── output/
│ ├── validation_report.json
│ ├── deduplication_report.json
│ ├── enrichment_coverage_report.json
│ └── name_ambiguity_report.json
└── knowledge_base/
├── plants.db # SQLite database
└── schema.sql # Database schema
SQLite Database Schema
-- Task: Create SQLite database alongside JSON
CREATE TABLE plants (
id TEXT PRIMARY KEY,
scientific_name TEXT NOT NULL UNIQUE,
family TEXT NOT NULL,
genus TEXT,
species TEXT,
cultivar TEXT,
primary_category TEXT NOT NULL,
taxonomy_verified BOOLEAN DEFAULT FALSE,
last_updated DATE
);
CREATE TABLE common_names (
id INTEGER PRIMARY KEY AUTOINCREMENT,
plant_id TEXT REFERENCES plants(id),
common_name TEXT NOT NULL,
is_primary BOOLEAN DEFAULT FALSE
);
CREATE TABLE characteristics (
plant_id TEXT PRIMARY KEY REFERENCES plants(id),
leaf_shape TEXT,
leaf_color TEXT, -- JSON array
leaf_texture TEXT,
growth_habit TEXT,
mature_height_cm INTEGER,
mature_width_cm INTEGER,
flowering BOOLEAN,
flower_colors TEXT, -- JSON array
bloom_season TEXT
);
CREATE TABLE regional_info (
plant_id TEXT PRIMARY KEY REFERENCES plants(id),
native_regions TEXT, -- JSON array
native_countries TEXT, -- JSON array
usda_hardiness_zones TEXT, -- JSON array
indoor_outdoor TEXT,
seasonal_behavior TEXT
);
CREATE TABLE synonyms (
id INTEGER PRIMARY KEY AUTOINCREMENT,
plant_id TEXT REFERENCES plants(id),
synonym TEXT NOT NULL
);
-- Indexes for common queries
CREATE INDEX idx_plants_family ON plants(family);
CREATE INDEX idx_plants_category ON plants(primary_category);
CREATE INDEX idx_common_names_name ON common_names(common_name);
CREATE INDEX idx_characteristics_habit ON characteristics(growth_habit);
End Phase Validation Checklist
Data Quality Gates
| Metric | Target | Validation Method |
|---|---|---|
| Total validated plants | ≥1,500 | Count after deduplication |
| Schema compliance | 100% | JSON schema validation |
| Scientific name format | 100% valid | Regex: ^[A-Z][a-z]+ [a-z]+ |
| Plants with characteristics | ≥80% | Field coverage check |
| Plants with regional data | ≥70% | Field coverage check |
| Category coverage | 100% | No "Unknown" categories |
| Name disambiguation | ≥95% | Ambiguity report review |
| Taxonomy verification | ≥70% | WFO/GBIF cross-reference |
Functional Validation
- Query Test 1: Lookup by scientific name returns full plant record
- Query Test 2: Lookup by common name returns correct plant(s)
- Query Test 3: Filter by category returns expected results
- Query Test 4: Filter by characteristics (leaf_shape=heart) works
- Query Test 5: Regional filter (hardiness_zone=10a) works
Deliverable Checklist
data/final_knowledge_base.jsonexists and passes schema validationknowledge_base/plants.dbSQLite database is populated- All scripts in
scripts/directory are functional - All reports in
output/directory are generated - Data coverage meets minimum thresholds
- No critical validation errors in reports
Phase Exit Criteria
Phase 1 is COMPLETE when:
- ✅ Final knowledge base contains ≥1,500 validated plant entries
- ✅ ≥80% of plants have physical characteristics populated
- ✅ ≥70% of plants have regional information
- ✅ 100% of plants have valid categories (no "Unknown")
- ✅ SQLite database mirrors JSON knowledge base
- ✅ All validation tests pass
- ✅ Documentation updated with final counts and coverage metrics
Execution Order
Task 1.1 (Validate)
↓
Task 1.2 (Normalize)
↓
Task 1.3 (Deduplicate)
↓
├─→ Task 1.4 (Characteristics) ─┐
│ │
└─→ Task 1.6 (Name Index) ──────┤
│ │
└─→ Task 1.7 (Regional) ────────┤
↓
Task 1.5 (Categorize)
[Depends on 1.4 for Tropical Foliage]
↓
Final Assembly
(JSON + SQLite)
↓
Validation Suite
Note: Tasks 1.4, 1.6, and 1.7 can run in parallel after Task 1.3 completes. Task 1.5 depends on Task 1.4 output for sub-categorizing Tropical Foliage plants.
Risk Mitigation
| Risk | Mitigation |
|---|---|
| External API rate limits | Implement caching, request throttling |
| Incomplete enrichment data | Use family-level defaults, document gaps |
| Ambiguous common names | Flag for manual review, prioritize top plants |
| Taxonomy database mismatches | Trust WFO as primary source |
| Large dataset processing | Process in batches, checkpoint progress |