Files

Trey t 136dfbae33 Add PlantGuide iOS app with plant identification and care management

- Implement camera capture and plant identification workflow
- Add Core Data persistence for plants, care schedules, and cached API data
- Create collection view with grid/list layouts and filtering
- Build plant detail views with care information display
- Integrate Trefle botanical API for plant care data
- Add local image storage for captured plant photos
- Implement dependency injection container for testability
- Include accessibility support throughout the app

Bug fixes in this commit:
- Fix Trefle API decoding by removing duplicate CodingKeys
- Fix LocalCachedImage to load from correct PlantImages directory
- Set dateAdded when saving plants for proper collection sorting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-23 12:18:01 -06:00

12 KiB

Raw Blame History

Phase 2: Image Dataset Acquisition - Implementation Plan

Overview

Goal: Gather labeled plant images matching our 2,064-plant knowledge base from Phase 1.

Target Deliverable: Labeled image dataset with 50,000-200,000 images across target plant classes, split into training (70%), validation (15%), and test (15%) sets.

Prerequisites

Phase 1 complete: data/final_knowledge_base.json (2,064 plants)
SQLite database: knowledge_base/plants.db
Python environment with required packages
API keys for image sources (iNaturalist, Flickr, etc.)
Storage space: ~50-100GB for raw images

Task Breakdown

Task 2.1: Research Public Plant Image Datasets

Objective: Evaluate available datasets for compatibility with our plant list.

Actions:

Research and document each dataset:
- PlantCLEF - Download links, species coverage, image format, license
- iNaturalist - API access, species coverage, observation quality filters
- PlantNet (Pl@ntNet) - API documentation, rate limits, attribution requirements
- Oxford Flowers 102 - Direct download, category mapping
- Wikimedia Commons - API access for botanical images
Create scripts/phase2/research_datasets.py to:
- Query each API for available species counts
- Document download procedures and authentication
- Estimate total available images per source

Output: output/dataset_research_report.json

Validation:

Report contains at least 4 dataset sources
Each source has documented: URL, license, estimated image count, access method

Task 2.2: Cross-Reference Datasets with Plant List

Objective: Identify which plants from our knowledge base have images in public datasets.

Actions:

Create scripts/phase2/cross_reference_plants.py to:
- Load plant list from data/final_knowledge_base.json
- Query each dataset API for matching scientific names
- Handle synonyms using data/synonyms.json
- Track exact matches, synonym matches, and genus-level matches
Generate coverage matrix: plants × datasets

Output:

output/dataset_coverage_matrix.json - Per-plant availability
output/cross_reference_report.json - Summary statistics

Validation:

Coverage matrix includes all 2,064 plants
Report shows percentage coverage per dataset
Identified total unique plants with at least one dataset match

Task 2.3: Download and Organize Images

Objective: Download images from selected sources and organize by species.

Actions:

Create directory structure:

datasets/
├── raw/
│   ├── inaturalist/
│   ├── plantclef/
│   ├── wikimedia/
│   └── flickr/
└── organized/
    └── {scientific_name}/
        ├── img_001.jpg
        └── metadata.json

Create scripts/phase2/download_inaturalist.py:
- Use iNaturalist API with research-grade filter
- Download max 500 images per species
- Include metadata (observer, date, location, license)
- Handle rate limiting with exponential backoff
Create scripts/phase2/download_plantclef.py:
- Download from PlantCLEF challenge archives
- Extract and organize by species
Create scripts/phase2/download_wikimedia.py:
- Query Wikimedia Commons API for botanical images
- Filter by license (CC-BY, CC-BY-SA, public domain)
Create scripts/phase2/organize_images.py:
- Consolidate images from all sources
- Rename with consistent naming: {plant_id}_{source}_{index}.jpg
- Generate per-species metadata.json

Output:

datasets/organized/ - Organized image directory
output/download_progress.json - Download status per species

Validation:

Images organized in consistent directory structure
Each image has source attribution in metadata
Progress tracking shows download status for all plants

Task 2.4: Establish Minimum Image Count per Class

Objective: Define and track image count thresholds.

Actions:

Create scripts/phase2/count_images.py to:
- Count images per species in datasets/organized/
- Classify plants into coverage tiers:
  - Excellent: 200+ images
  - Good: 100-199 images (target minimum)
  - Marginal: 50-99 images
  - Insufficient: 10-49 images
  - Critical: <10 images
Generate coverage report with distribution histogram

Output:

output/image_count_report.json
output/coverage_histogram.png

Validation:

Target: At least 60% of plants have 100+ images
Report identifies all plants below minimum threshold
Total image count within target range (50K-200K)

Task 2.5: Identify Gap Plants

Objective: Find plants needing supplementary images.

Actions:

Create scripts/phase2/identify_gaps.py to:
- List plants with <100 images
- Prioritize gaps by:
  - Plant popularity/commonality
  - Category importance (user-facing plants first)
  - Ease of sourcing (common names available)
Generate prioritized gap list with recommended sources

Output:

output/gap_plants.json - Prioritized list with current counts
output/gap_analysis_report.md - Human-readable analysis

Validation:

Gap list includes all plants under 100-image threshold
Each gap plant has recommended supplementary sources
Priority scores assigned based on criteria

Task 2.6: Source Supplementary Images

Objective: Fill gaps using additional image sources.

Actions:

Create scripts/phase2/download_flickr.py:
- Use Flickr API with botanical/plant tags
- Filter by license (CC-BY, CC-BY-SA)
- Search by scientific name AND common names
Create scripts/phase2/download_google_images.py:
- Use Google Custom Search API (paid tier)
- Apply strict botanical filters
- Download only high-resolution images
Create scripts/phase2/manual_curation_list.py:
- Generate list of gap plants requiring manual sourcing
- Create curation checklist for human review
Update organize_images.py to incorporate supplementary sources

Output:

Updated datasets/organized/ with supplementary images
output/supplementary_download_report.json
output/manual_curation_checklist.md (if needed)

Validation:

Gap plants have improved coverage
All supplementary images have proper licensing
Re-run Task 2.4 shows improved coverage metrics

Task 2.7: Verify Image Quality and Labels

Objective: Remove mislabeled and low-quality images.

Actions:

Create scripts/phase2/quality_filter.py to:
- Detect corrupt/truncated images
- Filter by minimum resolution (224x224 minimum)
- Detect duplicates using perceptual hashing (pHash)
- Flag images with text overlays/watermarks
Create scripts/phase2/label_verification.py to:
- Use pretrained plant classifier for sanity check
- Flag images where model confidence is very low
- Generate review queue for human verification
Create scripts/phase2/human_review_tool.py:
- Simple CLI tool for reviewing flagged images
- Accept/reject/relabel options
- Track reviewer decisions

Output:

datasets/verified/ - Cleaned image directory
output/quality_report.json - Filtering statistics
output/removed_images.json - Log of removed images with reasons

Validation:

All images pass minimum resolution check
No duplicate images (within 95% perceptual similarity)
Flagged images reviewed and resolved
Removal rate documented (<20% expected)

Task 2.8: Split Dataset

Objective: Create reproducible train/validation/test splits.

Actions:

Create scripts/phase2/split_dataset.py to:
- Stratified split maintaining class distribution
- 70% training, 15% validation, 15% test
- Ensure no data leakage (same plant photo in multiple splits)
- Handle class imbalance (minimum samples per class in each split)

Create manifest files:

datasets/
├── train/
│   ├── images/
│   └── manifest.csv  (path, label, scientific_name, plant_id)
├── val/
│   ├── images/
│   └── manifest.csv
└── test/
    ├── images/
    └── manifest.csv

Generate split statistics report

Output:

datasets/train/, datasets/val/, datasets/test/ directories
output/split_statistics.json
output/class_distribution.png (per-split histogram)

Validation:

Split ratios within 1% of target (70/15/15)
Each class has minimum 5 samples in val and test sets
No image appears in multiple splits
Manifest files are complete and valid

End-Phase Validation Checklist

Run scripts/phase2/validate_phase2.py to verify:

#	Validation Criterion	Target	Pass/Fail
1	Total image count	50,000 - 200,000	[ ]
2	Plant coverage	≥80% of 2,064 plants have images	[ ]
3	Minimum images per included plant	≥50 images (relaxed from 100 for rare plants)	[ ]
4	Image quality	100% pass resolution check	[ ]
5	No duplicates	0 exact duplicates, <1% near-duplicates	[ ]
6	License compliance	100% images have documented license	[ ]
7	Train/val/test split exists	All three directories with manifests	[ ]
8	Split ratio accuracy	Within 1% of 70/15/15	[ ]
9	Stratification verified	Chi-square test p > 0.05	[ ]
10	Metadata completeness	100% images have source + license	[ ]

Phase 2 Complete When: All 10 validation criteria pass.

Scripts Summary

Script	Task	Input	Output
`research_datasets.py`	2.1	None	`dataset_research_report.json`
`cross_reference_plants.py`	2.2	Knowledge base	`cross_reference_report.json`
`download_inaturalist.py`	2.3	Plant list	Images + metadata
`download_plantclef.py`	2.3	Plant list	Images + metadata
`download_wikimedia.py`	2.3	Plant list	Images + metadata
`organize_images.py`	2.3	Raw images	`datasets/organized/`
`count_images.py`	2.4	Organized images	`image_count_report.json`
`identify_gaps.py`	2.5	Image counts	`gap_plants.json`
`download_flickr.py`	2.6	Gap plants	Supplementary images
`quality_filter.py`	2.7	All images	`datasets/verified/`
`label_verification.py`	2.7	Verified images	Review queue
`split_dataset.py`	2.8	Verified images	Train/val/test splits
`validate_phase2.py`	Final	All outputs	Validation report

Dependencies

# requirements-phase2.txt
requests>=2.28.0
Pillow>=9.0.0
imagehash>=4.3.0
pandas>=1.5.0
tqdm>=4.64.0
python-dotenv>=1.0.0
matplotlib>=3.6.0
scipy>=1.9.0

Environment Variables

# .env.phase2
INATURALIST_APP_ID=your_app_id
INATURALIST_APP_SECRET=your_secret
FLICKR_API_KEY=your_key
FLICKR_API_SECRET=your_secret
GOOGLE_CSE_API_KEY=your_key
GOOGLE_CSE_CX=your_cx

Estimated Timeline

Task	Effort	Notes
2.1 Research	1 day	Documentation and API testing
2.2 Cross-reference	1 day	API queries, matching logic
2.3 Download	3-5 days	Rate-limited by APIs
2.4 Count	0.5 day	Quick analysis
2.5 Gap analysis	0.5 day	Based on counts
2.6 Supplementary	2-3 days	Depends on gap size
2.7 Quality verification	2 days	Includes manual review
2.8 Split	0.5 day	Automated
Validation	0.5 day	Final checks

Risk Mitigation

Risk	Mitigation
API rate limits	Implement backoff, cache responses, spread over time
Low coverage for rare plants	Accept lower threshold (50 images) with augmentation in Phase 3
License issues	Track all sources, prefer CC-licensed content
Storage limits	Implement progressive download, compress as needed
Label noise	Use pretrained model for sanity check, human review queue

Next Steps After Phase 2

Review output/image_count_report.json for Phase 3 augmentation priorities
Ensure datasets/train/manifest.csv format is compatible with training framework
Document any plants excluded due to insufficient images

12 KiB Raw Blame History Unescape Escape

Phase 2: Image Dataset Acquisition - Implementation Plan

Overview

Prerequisites

Task Breakdown

Task 2.1: Research Public Plant Image Datasets

Task 2.2: Cross-Reference Datasets with Plant List

Task 2.3: Download and Organize Images

Task 2.4: Establish Minimum Image Count per Class

Task 2.5: Identify Gap Plants

Task 2.6: Source Supplementary Images

Task 2.7: Verify Image Quality and Labels

Task 2.8: Split Dataset

End-Phase Validation Checklist

Scripts Summary

Dependencies

Environment Variables

Estimated Timeline

Risk Mitigation

Next Steps After Phase 2

12 KiB

Raw Blame History