Files
PlantGuide/Docs/phase2-implementation-plan.md
Trey t 136dfbae33 Add PlantGuide iOS app with plant identification and care management
- Implement camera capture and plant identification workflow
- Add Core Data persistence for plants, care schedules, and cached API data
- Create collection view with grid/list layouts and filtering
- Build plant detail views with care information display
- Integrate Trefle botanical API for plant care data
- Add local image storage for captured plant photos
- Implement dependency injection container for testability
- Include accessibility support throughout the app

Bug fixes in this commit:
- Fix Trefle API decoding by removing duplicate CodingKeys
- Fix LocalCachedImage to load from correct PlantImages directory
- Set dateAdded when saving plants for proper collection sorting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:18:01 -06:00

384 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2: Image Dataset Acquisition - Implementation Plan
## Overview
**Goal:** Gather labeled plant images matching our 2,064-plant knowledge base from Phase 1.
**Target Deliverable:** Labeled image dataset with 50,000-200,000 images across target plant classes, split into training (70%), validation (15%), and test (15%) sets.
---
## Prerequisites
- [x] Phase 1 complete: `data/final_knowledge_base.json` (2,064 plants)
- [x] SQLite database: `knowledge_base/plants.db`
- [ ] Python environment with required packages
- [ ] API keys for image sources (iNaturalist, Flickr, etc.)
- [ ] Storage space: ~50-100GB for raw images
---
## Task Breakdown
### Task 2.1: Research Public Plant Image Datasets
**Objective:** Evaluate available datasets for compatibility with our plant list.
**Actions:**
1. Research and document each dataset:
- **PlantCLEF** - Download links, species coverage, image format, license
- **iNaturalist** - API access, species coverage, observation quality filters
- **PlantNet (Pl@ntNet)** - API documentation, rate limits, attribution requirements
- **Oxford Flowers 102** - Direct download, category mapping
- **Wikimedia Commons** - API access for botanical images
2. Create `scripts/phase2/research_datasets.py` to:
- Query each API for available species counts
- Document download procedures and authentication
- Estimate total available images per source
**Output:** `output/dataset_research_report.json`
**Validation:**
- [ ] Report contains at least 4 dataset sources
- [ ] Each source has documented: URL, license, estimated image count, access method
---
### Task 2.2: Cross-Reference Datasets with Plant List
**Objective:** Identify which plants from our knowledge base have images in public datasets.
**Actions:**
1. Create `scripts/phase2/cross_reference_plants.py` to:
- Load plant list from `data/final_knowledge_base.json`
- Query each dataset API for matching scientific names
- Handle synonyms using `data/synonyms.json`
- Track exact matches, synonym matches, and genus-level matches
2. Generate coverage matrix: plants × datasets
**Output:**
- `output/dataset_coverage_matrix.json` - Per-plant availability
- `output/cross_reference_report.json` - Summary statistics
**Validation:**
- [ ] Coverage matrix includes all 2,064 plants
- [ ] Report shows percentage coverage per dataset
- [ ] Identified total unique plants with at least one dataset match
---
### Task 2.3: Download and Organize Images
**Objective:** Download images from selected sources and organize by species.
**Actions:**
1. Create directory structure:
```
datasets/
├── raw/
│ ├── inaturalist/
│ ├── plantclef/
│ ├── wikimedia/
│ └── flickr/
└── organized/
└── {scientific_name}/
├── img_001.jpg
└── metadata.json
```
2. Create `scripts/phase2/download_inaturalist.py`:
- Use iNaturalist API with research-grade filter
- Download max 500 images per species
- Include metadata (observer, date, location, license)
- Handle rate limiting with exponential backoff
3. Create `scripts/phase2/download_plantclef.py`:
- Download from PlantCLEF challenge archives
- Extract and organize by species
4. Create `scripts/phase2/download_wikimedia.py`:
- Query Wikimedia Commons API for botanical images
- Filter by license (CC-BY, CC-BY-SA, public domain)
5. Create `scripts/phase2/organize_images.py`:
- Consolidate images from all sources
- Rename with consistent naming: `{plant_id}_{source}_{index}.jpg`
- Generate per-species `metadata.json`
**Output:**
- `datasets/organized/` - Organized image directory
- `output/download_progress.json` - Download status per species
**Validation:**
- [ ] Images organized in consistent directory structure
- [ ] Each image has source attribution in metadata
- [ ] Progress tracking shows download status for all plants
---
### Task 2.4: Establish Minimum Image Count per Class
**Objective:** Define and track image count thresholds.
**Actions:**
1. Create `scripts/phase2/count_images.py` to:
- Count images per species in `datasets/organized/`
- Classify plants into coverage tiers:
- **Excellent:** 200+ images
- **Good:** 100-199 images (target minimum)
- **Marginal:** 50-99 images
- **Insufficient:** 10-49 images
- **Critical:** <10 images
2. Generate coverage report with distribution histogram
**Output:**
- `output/image_count_report.json`
- `output/coverage_histogram.png`
**Validation:**
- [ ] Target: At least 60% of plants have 100+ images
- [ ] Report identifies all plants below minimum threshold
- [ ] Total image count within target range (50K-200K)
---
### Task 2.5: Identify Gap Plants
**Objective:** Find plants needing supplementary images.
**Actions:**
1. Create `scripts/phase2/identify_gaps.py` to:
- List plants with <100 images
- Prioritize gaps by:
- Plant popularity/commonality
- Category importance (user-facing plants first)
- Ease of sourcing (common names available)
2. Generate prioritized gap list with recommended sources
**Output:**
- `output/gap_plants.json` - Prioritized list with current counts
- `output/gap_analysis_report.md` - Human-readable analysis
**Validation:**
- [ ] Gap list includes all plants under 100-image threshold
- [ ] Each gap plant has recommended supplementary sources
- [ ] Priority scores assigned based on criteria
---
### Task 2.6: Source Supplementary Images
**Objective:** Fill gaps using additional image sources.
**Actions:**
1. Create `scripts/phase2/download_flickr.py`:
- Use Flickr API with botanical/plant tags
- Filter by license (CC-BY, CC-BY-SA)
- Search by scientific name AND common names
2. Create `scripts/phase2/download_google_images.py`:
- Use Google Custom Search API (paid tier)
- Apply strict botanical filters
- Download only high-resolution images
3. Create `scripts/phase2/manual_curation_list.py`:
- Generate list of gap plants requiring manual sourcing
- Create curation checklist for human review
4. Update `organize_images.py` to incorporate supplementary sources
**Output:**
- Updated `datasets/organized/` with supplementary images
- `output/supplementary_download_report.json`
- `output/manual_curation_checklist.md` (if needed)
**Validation:**
- [ ] Gap plants have improved coverage
- [ ] All supplementary images have proper licensing
- [ ] Re-run Task 2.4 shows improved coverage metrics
---
### Task 2.7: Verify Image Quality and Labels
**Objective:** Remove mislabeled and low-quality images.
**Actions:**
1. Create `scripts/phase2/quality_filter.py` to:
- Detect corrupt/truncated images
- Filter by minimum resolution (224x224 minimum)
- Detect duplicates using perceptual hashing (pHash)
- Flag images with text overlays/watermarks
2. Create `scripts/phase2/label_verification.py` to:
- Use pretrained plant classifier for sanity check
- Flag images where model confidence is very low
- Generate review queue for human verification
3. Create `scripts/phase2/human_review_tool.py`:
- Simple CLI tool for reviewing flagged images
- Accept/reject/relabel options
- Track reviewer decisions
**Output:**
- `datasets/verified/` - Cleaned image directory
- `output/quality_report.json` - Filtering statistics
- `output/removed_images.json` - Log of removed images with reasons
**Validation:**
- [ ] All images pass minimum resolution check
- [ ] No duplicate images (within 95% perceptual similarity)
- [ ] Flagged images reviewed and resolved
- [ ] Removal rate documented (<20% expected)
---
### Task 2.8: Split Dataset
**Objective:** Create reproducible train/validation/test splits.
**Actions:**
1. Create `scripts/phase2/split_dataset.py` to:
- Stratified split maintaining class distribution
- 70% training, 15% validation, 15% test
- Ensure no data leakage (same plant photo in multiple splits)
- Handle class imbalance (minimum samples per class in each split)
2. Create manifest files:
```
datasets/
├── train/
│ ├── images/
│ └── manifest.csv (path, label, scientific_name, plant_id)
├── val/
│ ├── images/
│ └── manifest.csv
└── test/
├── images/
└── manifest.csv
```
3. Generate split statistics report
**Output:**
- `datasets/train/`, `datasets/val/`, `datasets/test/` directories
- `output/split_statistics.json`
- `output/class_distribution.png` (per-split histogram)
**Validation:**
- [ ] Split ratios within 1% of target (70/15/15)
- [ ] Each class has minimum 5 samples in val and test sets
- [ ] No image appears in multiple splits
- [ ] Manifest files are complete and valid
---
## End-Phase Validation Checklist
Run `scripts/phase2/validate_phase2.py` to verify:
| # | Validation Criterion | Target | Pass/Fail |
|---|---------------------|--------|-----------|
| 1 | Total image count | 50,000 - 200,000 | [ ] |
| 2 | Plant coverage | ≥80% of 2,064 plants have images | [ ] |
| 3 | Minimum images per included plant | ≥50 images (relaxed from 100 for rare plants) | [ ] |
| 4 | Image quality | 100% pass resolution check | [ ] |
| 5 | No duplicates | 0 exact duplicates, <1% near-duplicates | [ ] |
| 6 | License compliance | 100% images have documented license | [ ] |
| 7 | Train/val/test split exists | All three directories with manifests | [ ] |
| 8 | Split ratio accuracy | Within 1% of 70/15/15 | [ ] |
| 9 | Stratification verified | Chi-square test p > 0.05 | [ ] |
| 10 | Metadata completeness | 100% images have source + license | [ ] |
**Phase 2 Complete When:** All 10 validation criteria pass.
---
## Scripts Summary
| Script | Task | Input | Output |
|--------|------|-------|--------|
| `research_datasets.py` | 2.1 | None | `dataset_research_report.json` |
| `cross_reference_plants.py` | 2.2 | Knowledge base | `cross_reference_report.json` |
| `download_inaturalist.py` | 2.3 | Plant list | Images + metadata |
| `download_plantclef.py` | 2.3 | Plant list | Images + metadata |
| `download_wikimedia.py` | 2.3 | Plant list | Images + metadata |
| `organize_images.py` | 2.3 | Raw images | `datasets/organized/` |
| `count_images.py` | 2.4 | Organized images | `image_count_report.json` |
| `identify_gaps.py` | 2.5 | Image counts | `gap_plants.json` |
| `download_flickr.py` | 2.6 | Gap plants | Supplementary images |
| `quality_filter.py` | 2.7 | All images | `datasets/verified/` |
| `label_verification.py` | 2.7 | Verified images | Review queue |
| `split_dataset.py` | 2.8 | Verified images | Train/val/test splits |
| `validate_phase2.py` | Final | All outputs | Validation report |
---
## Dependencies
```
# requirements-phase2.txt
requests>=2.28.0
Pillow>=9.0.0
imagehash>=4.3.0
pandas>=1.5.0
tqdm>=4.64.0
python-dotenv>=1.0.0
matplotlib>=3.6.0
scipy>=1.9.0
```
---
## Environment Variables
```
# .env.phase2
INATURALIST_APP_ID=your_app_id
INATURALIST_APP_SECRET=your_secret
FLICKR_API_KEY=your_key
FLICKR_API_SECRET=your_secret
GOOGLE_CSE_API_KEY=your_key
GOOGLE_CSE_CX=your_cx
```
---
## Estimated Timeline
| Task | Effort | Notes |
|------|--------|-------|
| 2.1 Research | 1 day | Documentation and API testing |
| 2.2 Cross-reference | 1 day | API queries, matching logic |
| 2.3 Download | 3-5 days | Rate-limited by APIs |
| 2.4 Count | 0.5 day | Quick analysis |
| 2.5 Gap analysis | 0.5 day | Based on counts |
| 2.6 Supplementary | 2-3 days | Depends on gap size |
| 2.7 Quality verification | 2 days | Includes manual review |
| 2.8 Split | 0.5 day | Automated |
| Validation | 0.5 day | Final checks |
---
## Risk Mitigation
| Risk | Mitigation |
|------|------------|
| API rate limits | Implement backoff, cache responses, spread over time |
| Low coverage for rare plants | Accept lower threshold (50 images) with augmentation in Phase 3 |
| License issues | Track all sources, prefer CC-licensed content |
| Storage limits | Implement progressive download, compress as needed |
| Label noise | Use pretrained model for sanity check, human review queue |
---
## Next Steps After Phase 2
1. Review `output/image_count_report.json` for Phase 3 augmentation priorities
2. Ensure `datasets/train/manifest.csv` format is compatible with training framework
3. Document any plants excluded due to insufficient images