- Implement camera capture and plant identification workflow - Add Core Data persistence for plants, care schedules, and cached API data - Create collection view with grid/list layouts and filtering - Build plant detail views with care information display - Integrate Trefle botanical API for plant care data - Add local image storage for captured plant photos - Implement dependency injection container for testability - Include accessibility support throughout the app Bug fixes in this commit: - Fix Trefle API decoding by removing duplicate CodingKeys - Fix LocalCachedImage to load from correct PlantImages directory - Set dateAdded when saving plants for proper collection sorting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
Phase 2: Image Dataset Acquisition - Implementation Plan
Overview
Goal: Gather labeled plant images matching our 2,064-plant knowledge base from Phase 1.
Target Deliverable: Labeled image dataset with 50,000-200,000 images across target plant classes, split into training (70%), validation (15%), and test (15%) sets.
Prerequisites
- Phase 1 complete:
data/final_knowledge_base.json(2,064 plants) - SQLite database:
knowledge_base/plants.db - Python environment with required packages
- API keys for image sources (iNaturalist, Flickr, etc.)
- Storage space: ~50-100GB for raw images
Task Breakdown
Task 2.1: Research Public Plant Image Datasets
Objective: Evaluate available datasets for compatibility with our plant list.
Actions:
-
Research and document each dataset:
- PlantCLEF - Download links, species coverage, image format, license
- iNaturalist - API access, species coverage, observation quality filters
- PlantNet (Pl@ntNet) - API documentation, rate limits, attribution requirements
- Oxford Flowers 102 - Direct download, category mapping
- Wikimedia Commons - API access for botanical images
-
Create
scripts/phase2/research_datasets.pyto:- Query each API for available species counts
- Document download procedures and authentication
- Estimate total available images per source
Output: output/dataset_research_report.json
Validation:
- Report contains at least 4 dataset sources
- Each source has documented: URL, license, estimated image count, access method
Task 2.2: Cross-Reference Datasets with Plant List
Objective: Identify which plants from our knowledge base have images in public datasets.
Actions:
-
Create
scripts/phase2/cross_reference_plants.pyto:- Load plant list from
data/final_knowledge_base.json - Query each dataset API for matching scientific names
- Handle synonyms using
data/synonyms.json - Track exact matches, synonym matches, and genus-level matches
- Load plant list from
-
Generate coverage matrix: plants × datasets
Output:
output/dataset_coverage_matrix.json- Per-plant availabilityoutput/cross_reference_report.json- Summary statistics
Validation:
- Coverage matrix includes all 2,064 plants
- Report shows percentage coverage per dataset
- Identified total unique plants with at least one dataset match
Task 2.3: Download and Organize Images
Objective: Download images from selected sources and organize by species.
Actions:
-
Create directory structure:
datasets/ ├── raw/ │ ├── inaturalist/ │ ├── plantclef/ │ ├── wikimedia/ │ └── flickr/ └── organized/ └── {scientific_name}/ ├── img_001.jpg └── metadata.json -
Create
scripts/phase2/download_inaturalist.py:- Use iNaturalist API with research-grade filter
- Download max 500 images per species
- Include metadata (observer, date, location, license)
- Handle rate limiting with exponential backoff
-
Create
scripts/phase2/download_plantclef.py:- Download from PlantCLEF challenge archives
- Extract and organize by species
-
Create
scripts/phase2/download_wikimedia.py:- Query Wikimedia Commons API for botanical images
- Filter by license (CC-BY, CC-BY-SA, public domain)
-
Create
scripts/phase2/organize_images.py:- Consolidate images from all sources
- Rename with consistent naming:
{plant_id}_{source}_{index}.jpg - Generate per-species
metadata.json
Output:
datasets/organized/- Organized image directoryoutput/download_progress.json- Download status per species
Validation:
- Images organized in consistent directory structure
- Each image has source attribution in metadata
- Progress tracking shows download status for all plants
Task 2.4: Establish Minimum Image Count per Class
Objective: Define and track image count thresholds.
Actions:
-
Create
scripts/phase2/count_images.pyto:- Count images per species in
datasets/organized/ - Classify plants into coverage tiers:
- Excellent: 200+ images
- Good: 100-199 images (target minimum)
- Marginal: 50-99 images
- Insufficient: 10-49 images
- Critical: <10 images
- Count images per species in
-
Generate coverage report with distribution histogram
Output:
output/image_count_report.jsonoutput/coverage_histogram.png
Validation:
- Target: At least 60% of plants have 100+ images
- Report identifies all plants below minimum threshold
- Total image count within target range (50K-200K)
Task 2.5: Identify Gap Plants
Objective: Find plants needing supplementary images.
Actions:
-
Create
scripts/phase2/identify_gaps.pyto:- List plants with <100 images
- Prioritize gaps by:
- Plant popularity/commonality
- Category importance (user-facing plants first)
- Ease of sourcing (common names available)
-
Generate prioritized gap list with recommended sources
Output:
output/gap_plants.json- Prioritized list with current countsoutput/gap_analysis_report.md- Human-readable analysis
Validation:
- Gap list includes all plants under 100-image threshold
- Each gap plant has recommended supplementary sources
- Priority scores assigned based on criteria
Task 2.6: Source Supplementary Images
Objective: Fill gaps using additional image sources.
Actions:
-
Create
scripts/phase2/download_flickr.py:- Use Flickr API with botanical/plant tags
- Filter by license (CC-BY, CC-BY-SA)
- Search by scientific name AND common names
-
Create
scripts/phase2/download_google_images.py:- Use Google Custom Search API (paid tier)
- Apply strict botanical filters
- Download only high-resolution images
-
Create
scripts/phase2/manual_curation_list.py:- Generate list of gap plants requiring manual sourcing
- Create curation checklist for human review
-
Update
organize_images.pyto incorporate supplementary sources
Output:
- Updated
datasets/organized/with supplementary images output/supplementary_download_report.jsonoutput/manual_curation_checklist.md(if needed)
Validation:
- Gap plants have improved coverage
- All supplementary images have proper licensing
- Re-run Task 2.4 shows improved coverage metrics
Task 2.7: Verify Image Quality and Labels
Objective: Remove mislabeled and low-quality images.
Actions:
-
Create
scripts/phase2/quality_filter.pyto:- Detect corrupt/truncated images
- Filter by minimum resolution (224x224 minimum)
- Detect duplicates using perceptual hashing (pHash)
- Flag images with text overlays/watermarks
-
Create
scripts/phase2/label_verification.pyto:- Use pretrained plant classifier for sanity check
- Flag images where model confidence is very low
- Generate review queue for human verification
-
Create
scripts/phase2/human_review_tool.py:- Simple CLI tool for reviewing flagged images
- Accept/reject/relabel options
- Track reviewer decisions
Output:
datasets/verified/- Cleaned image directoryoutput/quality_report.json- Filtering statisticsoutput/removed_images.json- Log of removed images with reasons
Validation:
- All images pass minimum resolution check
- No duplicate images (within 95% perceptual similarity)
- Flagged images reviewed and resolved
- Removal rate documented (<20% expected)
Task 2.8: Split Dataset
Objective: Create reproducible train/validation/test splits.
Actions:
-
Create
scripts/phase2/split_dataset.pyto:- Stratified split maintaining class distribution
- 70% training, 15% validation, 15% test
- Ensure no data leakage (same plant photo in multiple splits)
- Handle class imbalance (minimum samples per class in each split)
-
Create manifest files:
datasets/ ├── train/ │ ├── images/ │ └── manifest.csv (path, label, scientific_name, plant_id) ├── val/ │ ├── images/ │ └── manifest.csv └── test/ ├── images/ └── manifest.csv -
Generate split statistics report
Output:
datasets/train/,datasets/val/,datasets/test/directoriesoutput/split_statistics.jsonoutput/class_distribution.png(per-split histogram)
Validation:
- Split ratios within 1% of target (70/15/15)
- Each class has minimum 5 samples in val and test sets
- No image appears in multiple splits
- Manifest files are complete and valid
End-Phase Validation Checklist
Run scripts/phase2/validate_phase2.py to verify:
| # | Validation Criterion | Target | Pass/Fail |
|---|---|---|---|
| 1 | Total image count | 50,000 - 200,000 | [ ] |
| 2 | Plant coverage | ≥80% of 2,064 plants have images | [ ] |
| 3 | Minimum images per included plant | ≥50 images (relaxed from 100 for rare plants) | [ ] |
| 4 | Image quality | 100% pass resolution check | [ ] |
| 5 | No duplicates | 0 exact duplicates, <1% near-duplicates | [ ] |
| 6 | License compliance | 100% images have documented license | [ ] |
| 7 | Train/val/test split exists | All three directories with manifests | [ ] |
| 8 | Split ratio accuracy | Within 1% of 70/15/15 | [ ] |
| 9 | Stratification verified | Chi-square test p > 0.05 | [ ] |
| 10 | Metadata completeness | 100% images have source + license | [ ] |
Phase 2 Complete When: All 10 validation criteria pass.
Scripts Summary
| Script | Task | Input | Output |
|---|---|---|---|
research_datasets.py |
2.1 | None | dataset_research_report.json |
cross_reference_plants.py |
2.2 | Knowledge base | cross_reference_report.json |
download_inaturalist.py |
2.3 | Plant list | Images + metadata |
download_plantclef.py |
2.3 | Plant list | Images + metadata |
download_wikimedia.py |
2.3 | Plant list | Images + metadata |
organize_images.py |
2.3 | Raw images | datasets/organized/ |
count_images.py |
2.4 | Organized images | image_count_report.json |
identify_gaps.py |
2.5 | Image counts | gap_plants.json |
download_flickr.py |
2.6 | Gap plants | Supplementary images |
quality_filter.py |
2.7 | All images | datasets/verified/ |
label_verification.py |
2.7 | Verified images | Review queue |
split_dataset.py |
2.8 | Verified images | Train/val/test splits |
validate_phase2.py |
Final | All outputs | Validation report |
Dependencies
# requirements-phase2.txt
requests>=2.28.0
Pillow>=9.0.0
imagehash>=4.3.0
pandas>=1.5.0
tqdm>=4.64.0
python-dotenv>=1.0.0
matplotlib>=3.6.0
scipy>=1.9.0
Environment Variables
# .env.phase2
INATURALIST_APP_ID=your_app_id
INATURALIST_APP_SECRET=your_secret
FLICKR_API_KEY=your_key
FLICKR_API_SECRET=your_secret
GOOGLE_CSE_API_KEY=your_key
GOOGLE_CSE_CX=your_cx
Estimated Timeline
| Task | Effort | Notes |
|---|---|---|
| 2.1 Research | 1 day | Documentation and API testing |
| 2.2 Cross-reference | 1 day | API queries, matching logic |
| 2.3 Download | 3-5 days | Rate-limited by APIs |
| 2.4 Count | 0.5 day | Quick analysis |
| 2.5 Gap analysis | 0.5 day | Based on counts |
| 2.6 Supplementary | 2-3 days | Depends on gap size |
| 2.7 Quality verification | 2 days | Includes manual review |
| 2.8 Split | 0.5 day | Automated |
| Validation | 0.5 day | Final checks |
Risk Mitigation
| Risk | Mitigation |
|---|---|
| API rate limits | Implement backoff, cache responses, spread over time |
| Low coverage for rare plants | Accept lower threshold (50 images) with augmentation in Phase 3 |
| License issues | Track all sources, prefer CC-licensed content |
| Storage limits | Implement progressive download, compress as needed |
| Label noise | Use pretrained model for sanity check, human review queue |
Next Steps After Phase 2
- Review
output/image_count_report.jsonfor Phase 3 augmentation priorities - Ensure
datasets/train/manifest.csvformat is compatible with training framework - Document any plants excluded due to insufficient images