PlantGuide/Docs/plant-identification-model-plan.md

# Plant Identification Core ML Model - Development Plan

## Overview

Build a plant knowledge base from a curated plant list, then source/create an image dataset to train the Core ML model for visual plant identification.

---

## Phase 1: Knowledge Base Creation from Plant List

**Goal:** Build structured plant knowledge from a curated plant list (CSV/JSON), enriching with taxonomy and characteristics.

| Task | Description |
|------|-------------|
| 1.1 | Load and validate plant list file (CSV/JSON) |
| 1.2 | Normalize and standardize plant names |
| 1.3 | Create a master plant list with deduplicated entries |
| 1.4 | Enrich with physical characteristics (leaf shape, flower color, height, etc.) |
| 1.5 | Categorize plants by type (flower, tree, shrub, vegetable, herb, succulent) |
| 1.6 | Map common names to scientific names (binomial nomenclature) |
| 1.7 | Add regional/seasonal information from external sources |

**Deliverable:** Structured plant knowledge base (JSON/SQLite) with ~500-2000 plant entries

---

## Phase 2: Image Dataset Acquisition

**Goal:** Gather labeled plant images matching our knowledge base.

| Task | Description |
|------|-------------|
| 2.1 | Research public plant image datasets (PlantCLEF, iNaturalist, PlantNet, Pl@ntNet) |
| 2.2 | Cross-reference available datasets with Phase 1 plant list |
| 2.3 | Download and organize images by species/category |
| 2.4 | Establish minimum image count per class (target: 100+ images per plant) |
| 2.5 | Identify gaps - plants in our knowledge base without sufficient images |
| 2.6 | Source supplementary images for gap plants (Flickr API, Wikimedia Commons) |
| 2.7 | Verify image quality and label accuracy (remove mislabeled/low-quality) |
| 2.8 | Split dataset: 70% training, 15% validation, 15% test |

**Deliverable:** Labeled image dataset with 50,000-200,000 images across target plant classes

---

## Phase 3: Dataset Preprocessing & Augmentation

**Goal:** Prepare images for training with consistent formatting and augmentation.

| Task | Description |
|------|-------------|
| 3.1 | Standardize image dimensions (e.g., 224x224 or 299x299) |
| 3.2 | Normalize color channels and handle various image formats |
| 3.3 | Implement data augmentation pipeline (rotation, flip, brightness, crop) |
| 3.4 | Create augmented variants to balance underrepresented classes |
| 3.5 | Generate image manifest files mapping paths to labels |
| 3.6 | Validate dataset integrity (no corrupted files, correct labels) |

**Deliverable:** Training-ready dataset with augmentation pipeline

---

## Phase 4: Model Architecture Selection

**Goal:** Choose and configure the optimal model architecture for on-device inference.

| Task | Description |
|------|-------------|
| 4.1 | Evaluate architectures: MobileNetV3, EfficientNet-Lite, ResNet50, Vision Transformer |
| 4.2 | Benchmark model size vs accuracy tradeoffs for mobile deployment |
| 4.3 | Select base architecture (recommend: MobileNetV3 or EfficientNet-Lite for iOS) |
| 4.4 | Configure transfer learning from ImageNet pretrained weights |
| 4.5 | Design classification head for our plant class count |
| 4.6 | Define target metrics: accuracy >85%, model size <50MB, inference <100ms |

**Deliverable:** Model architecture specification document

---

## Phase 5: Initial Training Run

**Goal:** Train baseline model and establish performance benchmarks.

| Task | Description |
|------|-------------|
| 5.1 | Set up training environment (PyTorch/TensorFlow with GPU) |
| 5.2 | Implement training loop with learning rate scheduling |
| 5.3 | Train baseline model for 50 epochs |
| 5.4 | Log training/validation loss and accuracy curves |
| 5.5 | Evaluate on test set - document per-class accuracy |
| 5.6 | Identify problematic classes (low accuracy, high confusion) |
| 5.7 | Generate confusion matrix to find commonly confused plant pairs |

**Deliverable:** Baseline model with documented accuracy metrics

---

## Phase 6: Model Refinement & Iteration

**Goal:** Improve model through iterative refinement cycles.

| Task | Description |
|------|-------------|
| 6.1 | Address class imbalance with weighted loss or oversampling |
| 6.2 | Fine-tune hyperparameters (learning rate, batch size, dropout) |
| 6.3 | Experiment with different augmentation strategies |
| 6.4 | Add more training data for underperforming classes |
| 6.5 | Consider hierarchical classification (family -> genus -> species) |
| 6.6 | Implement hard negative mining for confused pairs |
| 6.7 | Re-train and evaluate until target accuracy achieved |
| 6.8 | Perform k-fold cross-validation for robust metrics |

**Deliverable:** Refined model meeting accuracy targets (>85% top-1, >95% top-5)

---

## Phase 7: Core ML Conversion & Optimization

**Goal:** Convert trained model to Core ML format optimized for iOS.

| Task | Description |
|------|-------------|
| 7.1 | Export trained model to ONNX or saved model format |
| 7.2 | Convert to Core ML using coremltools |
| 7.3 | Apply quantization (Float16 or Int8) to reduce model size |
| 7.4 | Configure model metadata (class labels, input/output specs) |
| 7.5 | Test converted model accuracy matches original |
| 7.6 | Optimize for Neural Engine execution |
| 7.7 | Benchmark inference speed on target devices (iPhone 12+) |

**Deliverable:** Optimized `.mlmodel` or `.mlpackage` file

---

## Phase 8: iOS Integration Testing

**Goal:** Validate model performance in real iOS environment.

| Task | Description |
|------|-------------|
| 8.1 | Create test iOS app with camera capture |
| 8.2 | Integrate Core ML model with Vision framework |
| 8.3 | Test with real-world plant photos (not from training set) |
| 8.4 | Measure on-device inference latency |
| 8.5 | Test edge cases (partial plants, multiple plants, poor lighting) |
| 8.6 | Gather user feedback on identification accuracy |
| 8.7 | Document failure modes and edge cases |

**Deliverable:** Validated model with real-world accuracy report

---

## Phase 9: Knowledge Integration

**Goal:** Combine visual model with plant knowledge base for rich results.

| Task | Description |
|------|-------------|
| 9.1 | Link model class predictions to Phase 1 knowledge base |
| 9.2 | Design result payload (name, description, care tips, characteristics) |
| 9.3 | Add confidence thresholds and "unknown plant" handling |
| 9.4 | Implement top-N predictions with confidence scores |
| 9.5 | Create fallback for low-confidence identifications |

**Deliverable:** Complete plant identification system with rich metadata

---

## Phase 10: Final Validation & Documentation

**Goal:** Comprehensive testing and production readiness.

| Task | Description |
|------|-------------|
| 10.1 | Run full test suite across diverse plant images |
| 10.2 | Document supported plant list with accuracy per species |
| 10.3 | Create model card (training data, limitations, biases) |
| 10.4 | Write iOS integration guide |
| 10.5 | Package final `.mlmodel` with metadata and labels |
| 10.6 | Establish model versioning and update strategy |

**Deliverable:** Production-ready Core ML model with documentation

---

## Summary

| Phase | Focus | Key Deliverable |
|-------|-------|-----------------|
| 1 | Knowledge Base Creation | Plant knowledge base from plant list |
| 2 | Image Acquisition | Labeled dataset (50K-200K images) |
| 3 | Preprocessing | Training-ready augmented dataset |
| 4 | Architecture | Model design specification |
| 5 | Initial Training | Baseline model + benchmarks |
| 6 | Refinement | Optimized model (>85% accuracy) |
| 7 | Core ML Conversion | Quantized `.mlmodel` file |
| 8 | iOS Testing | Real-world validation report |
| 9 | Knowledge Integration | Rich identification results |
| 10 | Final Validation | Production-ready package |

---

## Key Insights

The plant list provides **structured plant data** (names, characteristics) but visual identification requires image training data. The plan combines the plant knowledge base with external image datasets to create a complete plant identification system.

## Target Specifications

| Metric | Target |
|--------|--------|
| Plant Classes | 200-500 species |
| Top-1 Accuracy | >85% |
| Top-5 Accuracy | >95% |
| Model Size | <50MB |
| Inference Time | <100ms on iPhone 12+ |

## Recommended Datasets

- **PlantCLEF** - Annual plant identification challenge dataset
- **iNaturalist** - Community-sourced plant observations
- **PlantNet** - Botanical research dataset
- **Oxford Flowers** - 102 flower categories
- **Wikimedia Commons** - Supplementary images

## Recommended Architecture

**MobileNetV3-Large** or **EfficientNet-Lite** for optimal balance of:
- On-device performance
- Model size constraints
- Classification accuracy
- Neural Engine compatibility