Add PlantGuide iOS app with plant identification and care management
- Implement camera capture and plant identification workflow - Add Core Data persistence for plants, care schedules, and cached API data - Create collection view with grid/list layouts and filtering - Build plant detail views with care information display - Integrate Trefle botanical API for plant care data - Add local image storage for captured plant photos - Implement dependency injection container for testability - Include accessibility support throughout the app Bug fixes in this commit: - Fix Trefle API decoding by removing duplicate CodingKeys - Fix LocalCachedImage to load from correct PlantImages directory - Set dateAdded when saving plants for proper collection sorting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
383
Docs/phase2-implementation-plan.md
Normal file
383
Docs/phase2-implementation-plan.md
Normal file
@@ -0,0 +1,383 @@
|
||||
# Phase 2: Image Dataset Acquisition - Implementation Plan
|
||||
|
||||
## Overview
|
||||
|
||||
**Goal:** Gather labeled plant images matching our 2,064-plant knowledge base from Phase 1.
|
||||
|
||||
**Target Deliverable:** Labeled image dataset with 50,000-200,000 images across target plant classes, split into training (70%), validation (15%), and test (15%) sets.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- [x] Phase 1 complete: `data/final_knowledge_base.json` (2,064 plants)
|
||||
- [x] SQLite database: `knowledge_base/plants.db`
|
||||
- [ ] Python environment with required packages
|
||||
- [ ] API keys for image sources (iNaturalist, Flickr, etc.)
|
||||
- [ ] Storage space: ~50-100GB for raw images
|
||||
|
||||
---
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
### Task 2.1: Research Public Plant Image Datasets
|
||||
|
||||
**Objective:** Evaluate available datasets for compatibility with our plant list.
|
||||
|
||||
**Actions:**
|
||||
1. Research and document each dataset:
|
||||
- **PlantCLEF** - Download links, species coverage, image format, license
|
||||
- **iNaturalist** - API access, species coverage, observation quality filters
|
||||
- **PlantNet (Pl@ntNet)** - API documentation, rate limits, attribution requirements
|
||||
- **Oxford Flowers 102** - Direct download, category mapping
|
||||
- **Wikimedia Commons** - API access for botanical images
|
||||
|
||||
2. Create `scripts/phase2/research_datasets.py` to:
|
||||
- Query each API for available species counts
|
||||
- Document download procedures and authentication
|
||||
- Estimate total available images per source
|
||||
|
||||
**Output:** `output/dataset_research_report.json`
|
||||
|
||||
**Validation:**
|
||||
- [ ] Report contains at least 4 dataset sources
|
||||
- [ ] Each source has documented: URL, license, estimated image count, access method
|
||||
|
||||
---
|
||||
|
||||
### Task 2.2: Cross-Reference Datasets with Plant List
|
||||
|
||||
**Objective:** Identify which plants from our knowledge base have images in public datasets.
|
||||
|
||||
**Actions:**
|
||||
1. Create `scripts/phase2/cross_reference_plants.py` to:
|
||||
- Load plant list from `data/final_knowledge_base.json`
|
||||
- Query each dataset API for matching scientific names
|
||||
- Handle synonyms using `data/synonyms.json`
|
||||
- Track exact matches, synonym matches, and genus-level matches
|
||||
|
||||
2. Generate coverage matrix: plants × datasets
|
||||
|
||||
**Output:**
|
||||
- `output/dataset_coverage_matrix.json` - Per-plant availability
|
||||
- `output/cross_reference_report.json` - Summary statistics
|
||||
|
||||
**Validation:**
|
||||
- [ ] Coverage matrix includes all 2,064 plants
|
||||
- [ ] Report shows percentage coverage per dataset
|
||||
- [ ] Identified total unique plants with at least one dataset match
|
||||
|
||||
---
|
||||
|
||||
### Task 2.3: Download and Organize Images
|
||||
|
||||
**Objective:** Download images from selected sources and organize by species.
|
||||
|
||||
**Actions:**
|
||||
1. Create directory structure:
|
||||
```
|
||||
datasets/
|
||||
├── raw/
|
||||
│ ├── inaturalist/
|
||||
│ ├── plantclef/
|
||||
│ ├── wikimedia/
|
||||
│ └── flickr/
|
||||
└── organized/
|
||||
└── {scientific_name}/
|
||||
├── img_001.jpg
|
||||
└── metadata.json
|
||||
```
|
||||
|
||||
2. Create `scripts/phase2/download_inaturalist.py`:
|
||||
- Use iNaturalist API with research-grade filter
|
||||
- Download max 500 images per species
|
||||
- Include metadata (observer, date, location, license)
|
||||
- Handle rate limiting with exponential backoff
|
||||
|
||||
3. Create `scripts/phase2/download_plantclef.py`:
|
||||
- Download from PlantCLEF challenge archives
|
||||
- Extract and organize by species
|
||||
|
||||
4. Create `scripts/phase2/download_wikimedia.py`:
|
||||
- Query Wikimedia Commons API for botanical images
|
||||
- Filter by license (CC-BY, CC-BY-SA, public domain)
|
||||
|
||||
5. Create `scripts/phase2/organize_images.py`:
|
||||
- Consolidate images from all sources
|
||||
- Rename with consistent naming: `{plant_id}_{source}_{index}.jpg`
|
||||
- Generate per-species `metadata.json`
|
||||
|
||||
**Output:**
|
||||
- `datasets/organized/` - Organized image directory
|
||||
- `output/download_progress.json` - Download status per species
|
||||
|
||||
**Validation:**
|
||||
- [ ] Images organized in consistent directory structure
|
||||
- [ ] Each image has source attribution in metadata
|
||||
- [ ] Progress tracking shows download status for all plants
|
||||
|
||||
---
|
||||
|
||||
### Task 2.4: Establish Minimum Image Count per Class
|
||||
|
||||
**Objective:** Define and track image count thresholds.
|
||||
|
||||
**Actions:**
|
||||
1. Create `scripts/phase2/count_images.py` to:
|
||||
- Count images per species in `datasets/organized/`
|
||||
- Classify plants into coverage tiers:
|
||||
- **Excellent:** 200+ images
|
||||
- **Good:** 100-199 images (target minimum)
|
||||
- **Marginal:** 50-99 images
|
||||
- **Insufficient:** 10-49 images
|
||||
- **Critical:** <10 images
|
||||
|
||||
2. Generate coverage report with distribution histogram
|
||||
|
||||
**Output:**
|
||||
- `output/image_count_report.json`
|
||||
- `output/coverage_histogram.png`
|
||||
|
||||
**Validation:**
|
||||
- [ ] Target: At least 60% of plants have 100+ images
|
||||
- [ ] Report identifies all plants below minimum threshold
|
||||
- [ ] Total image count within target range (50K-200K)
|
||||
|
||||
---
|
||||
|
||||
### Task 2.5: Identify Gap Plants
|
||||
|
||||
**Objective:** Find plants needing supplementary images.
|
||||
|
||||
**Actions:**
|
||||
1. Create `scripts/phase2/identify_gaps.py` to:
|
||||
- List plants with <100 images
|
||||
- Prioritize gaps by:
|
||||
- Plant popularity/commonality
|
||||
- Category importance (user-facing plants first)
|
||||
- Ease of sourcing (common names available)
|
||||
|
||||
2. Generate prioritized gap list with recommended sources
|
||||
|
||||
**Output:**
|
||||
- `output/gap_plants.json` - Prioritized list with current counts
|
||||
- `output/gap_analysis_report.md` - Human-readable analysis
|
||||
|
||||
**Validation:**
|
||||
- [ ] Gap list includes all plants under 100-image threshold
|
||||
- [ ] Each gap plant has recommended supplementary sources
|
||||
- [ ] Priority scores assigned based on criteria
|
||||
|
||||
---
|
||||
|
||||
### Task 2.6: Source Supplementary Images
|
||||
|
||||
**Objective:** Fill gaps using additional image sources.
|
||||
|
||||
**Actions:**
|
||||
1. Create `scripts/phase2/download_flickr.py`:
|
||||
- Use Flickr API with botanical/plant tags
|
||||
- Filter by license (CC-BY, CC-BY-SA)
|
||||
- Search by scientific name AND common names
|
||||
|
||||
2. Create `scripts/phase2/download_google_images.py`:
|
||||
- Use Google Custom Search API (paid tier)
|
||||
- Apply strict botanical filters
|
||||
- Download only high-resolution images
|
||||
|
||||
3. Create `scripts/phase2/manual_curation_list.py`:
|
||||
- Generate list of gap plants requiring manual sourcing
|
||||
- Create curation checklist for human review
|
||||
|
||||
4. Update `organize_images.py` to incorporate supplementary sources
|
||||
|
||||
**Output:**
|
||||
- Updated `datasets/organized/` with supplementary images
|
||||
- `output/supplementary_download_report.json`
|
||||
- `output/manual_curation_checklist.md` (if needed)
|
||||
|
||||
**Validation:**
|
||||
- [ ] Gap plants have improved coverage
|
||||
- [ ] All supplementary images have proper licensing
|
||||
- [ ] Re-run Task 2.4 shows improved coverage metrics
|
||||
|
||||
---
|
||||
|
||||
### Task 2.7: Verify Image Quality and Labels
|
||||
|
||||
**Objective:** Remove mislabeled and low-quality images.
|
||||
|
||||
**Actions:**
|
||||
1. Create `scripts/phase2/quality_filter.py` to:
|
||||
- Detect corrupt/truncated images
|
||||
- Filter by minimum resolution (224x224 minimum)
|
||||
- Detect duplicates using perceptual hashing (pHash)
|
||||
- Flag images with text overlays/watermarks
|
||||
|
||||
2. Create `scripts/phase2/label_verification.py` to:
|
||||
- Use pretrained plant classifier for sanity check
|
||||
- Flag images where model confidence is very low
|
||||
- Generate review queue for human verification
|
||||
|
||||
3. Create `scripts/phase2/human_review_tool.py`:
|
||||
- Simple CLI tool for reviewing flagged images
|
||||
- Accept/reject/relabel options
|
||||
- Track reviewer decisions
|
||||
|
||||
**Output:**
|
||||
- `datasets/verified/` - Cleaned image directory
|
||||
- `output/quality_report.json` - Filtering statistics
|
||||
- `output/removed_images.json` - Log of removed images with reasons
|
||||
|
||||
**Validation:**
|
||||
- [ ] All images pass minimum resolution check
|
||||
- [ ] No duplicate images (within 95% perceptual similarity)
|
||||
- [ ] Flagged images reviewed and resolved
|
||||
- [ ] Removal rate documented (<20% expected)
|
||||
|
||||
---
|
||||
|
||||
### Task 2.8: Split Dataset
|
||||
|
||||
**Objective:** Create reproducible train/validation/test splits.
|
||||
|
||||
**Actions:**
|
||||
1. Create `scripts/phase2/split_dataset.py` to:
|
||||
- Stratified split maintaining class distribution
|
||||
- 70% training, 15% validation, 15% test
|
||||
- Ensure no data leakage (same plant photo in multiple splits)
|
||||
- Handle class imbalance (minimum samples per class in each split)
|
||||
|
||||
2. Create manifest files:
|
||||
```
|
||||
datasets/
|
||||
├── train/
|
||||
│ ├── images/
|
||||
│ └── manifest.csv (path, label, scientific_name, plant_id)
|
||||
├── val/
|
||||
│ ├── images/
|
||||
│ └── manifest.csv
|
||||
└── test/
|
||||
├── images/
|
||||
└── manifest.csv
|
||||
```
|
||||
|
||||
3. Generate split statistics report
|
||||
|
||||
**Output:**
|
||||
- `datasets/train/`, `datasets/val/`, `datasets/test/` directories
|
||||
- `output/split_statistics.json`
|
||||
- `output/class_distribution.png` (per-split histogram)
|
||||
|
||||
**Validation:**
|
||||
- [ ] Split ratios within 1% of target (70/15/15)
|
||||
- [ ] Each class has minimum 5 samples in val and test sets
|
||||
- [ ] No image appears in multiple splits
|
||||
- [ ] Manifest files are complete and valid
|
||||
|
||||
---
|
||||
|
||||
## End-Phase Validation Checklist
|
||||
|
||||
Run `scripts/phase2/validate_phase2.py` to verify:
|
||||
|
||||
| # | Validation Criterion | Target | Pass/Fail |
|
||||
|---|---------------------|--------|-----------|
|
||||
| 1 | Total image count | 50,000 - 200,000 | [ ] |
|
||||
| 2 | Plant coverage | ≥80% of 2,064 plants have images | [ ] |
|
||||
| 3 | Minimum images per included plant | ≥50 images (relaxed from 100 for rare plants) | [ ] |
|
||||
| 4 | Image quality | 100% pass resolution check | [ ] |
|
||||
| 5 | No duplicates | 0 exact duplicates, <1% near-duplicates | [ ] |
|
||||
| 6 | License compliance | 100% images have documented license | [ ] |
|
||||
| 7 | Train/val/test split exists | All three directories with manifests | [ ] |
|
||||
| 8 | Split ratio accuracy | Within 1% of 70/15/15 | [ ] |
|
||||
| 9 | Stratification verified | Chi-square test p > 0.05 | [ ] |
|
||||
| 10 | Metadata completeness | 100% images have source + license | [ ] |
|
||||
|
||||
**Phase 2 Complete When:** All 10 validation criteria pass.
|
||||
|
||||
---
|
||||
|
||||
## Scripts Summary
|
||||
|
||||
| Script | Task | Input | Output |
|
||||
|--------|------|-------|--------|
|
||||
| `research_datasets.py` | 2.1 | None | `dataset_research_report.json` |
|
||||
| `cross_reference_plants.py` | 2.2 | Knowledge base | `cross_reference_report.json` |
|
||||
| `download_inaturalist.py` | 2.3 | Plant list | Images + metadata |
|
||||
| `download_plantclef.py` | 2.3 | Plant list | Images + metadata |
|
||||
| `download_wikimedia.py` | 2.3 | Plant list | Images + metadata |
|
||||
| `organize_images.py` | 2.3 | Raw images | `datasets/organized/` |
|
||||
| `count_images.py` | 2.4 | Organized images | `image_count_report.json` |
|
||||
| `identify_gaps.py` | 2.5 | Image counts | `gap_plants.json` |
|
||||
| `download_flickr.py` | 2.6 | Gap plants | Supplementary images |
|
||||
| `quality_filter.py` | 2.7 | All images | `datasets/verified/` |
|
||||
| `label_verification.py` | 2.7 | Verified images | Review queue |
|
||||
| `split_dataset.py` | 2.8 | Verified images | Train/val/test splits |
|
||||
| `validate_phase2.py` | Final | All outputs | Validation report |
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
```
|
||||
# requirements-phase2.txt
|
||||
requests>=2.28.0
|
||||
Pillow>=9.0.0
|
||||
imagehash>=4.3.0
|
||||
pandas>=1.5.0
|
||||
tqdm>=4.64.0
|
||||
python-dotenv>=1.0.0
|
||||
matplotlib>=3.6.0
|
||||
scipy>=1.9.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
```
|
||||
# .env.phase2
|
||||
INATURALIST_APP_ID=your_app_id
|
||||
INATURALIST_APP_SECRET=your_secret
|
||||
FLICKR_API_KEY=your_key
|
||||
FLICKR_API_SECRET=your_secret
|
||||
GOOGLE_CSE_API_KEY=your_key
|
||||
GOOGLE_CSE_CX=your_cx
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Estimated Timeline
|
||||
|
||||
| Task | Effort | Notes |
|
||||
|------|--------|-------|
|
||||
| 2.1 Research | 1 day | Documentation and API testing |
|
||||
| 2.2 Cross-reference | 1 day | API queries, matching logic |
|
||||
| 2.3 Download | 3-5 days | Rate-limited by APIs |
|
||||
| 2.4 Count | 0.5 day | Quick analysis |
|
||||
| 2.5 Gap analysis | 0.5 day | Based on counts |
|
||||
| 2.6 Supplementary | 2-3 days | Depends on gap size |
|
||||
| 2.7 Quality verification | 2 days | Includes manual review |
|
||||
| 2.8 Split | 0.5 day | Automated |
|
||||
| Validation | 0.5 day | Final checks |
|
||||
|
||||
---
|
||||
|
||||
## Risk Mitigation
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| API rate limits | Implement backoff, cache responses, spread over time |
|
||||
| Low coverage for rare plants | Accept lower threshold (50 images) with augmentation in Phase 3 |
|
||||
| License issues | Track all sources, prefer CC-licensed content |
|
||||
| Storage limits | Implement progressive download, compress as needed |
|
||||
| Label noise | Use pretrained model for sanity check, human review queue |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Phase 2
|
||||
|
||||
1. Review `output/image_count_report.json` for Phase 3 augmentation priorities
|
||||
2. Ensure `datasets/train/manifest.csv` format is compatible with training framework
|
||||
3. Document any plants excluded due to insufficient images
|
||||
Reference in New Issue
Block a user