# Phase 2: Image Dataset Acquisition - Implementation Plan ## Overview **Goal:** Gather labeled plant images matching our 2,064-plant knowledge base from Phase 1. **Target Deliverable:** Labeled image dataset with 50,000-200,000 images across target plant classes, split into training (70%), validation (15%), and test (15%) sets. --- ## Prerequisites - [x] Phase 1 complete: `data/final_knowledge_base.json` (2,064 plants) - [x] SQLite database: `knowledge_base/plants.db` - [ ] Python environment with required packages - [ ] API keys for image sources (iNaturalist, Flickr, etc.) - [ ] Storage space: ~50-100GB for raw images --- ## Task Breakdown ### Task 2.1: Research Public Plant Image Datasets **Objective:** Evaluate available datasets for compatibility with our plant list. **Actions:** 1. Research and document each dataset: - **PlantCLEF** - Download links, species coverage, image format, license - **iNaturalist** - API access, species coverage, observation quality filters - **PlantNet (Pl@ntNet)** - API documentation, rate limits, attribution requirements - **Oxford Flowers 102** - Direct download, category mapping - **Wikimedia Commons** - API access for botanical images 2. Create `scripts/phase2/research_datasets.py` to: - Query each API for available species counts - Document download procedures and authentication - Estimate total available images per source **Output:** `output/dataset_research_report.json` **Validation:** - [ ] Report contains at least 4 dataset sources - [ ] Each source has documented: URL, license, estimated image count, access method --- ### Task 2.2: Cross-Reference Datasets with Plant List **Objective:** Identify which plants from our knowledge base have images in public datasets. **Actions:** 1. Create `scripts/phase2/cross_reference_plants.py` to: - Load plant list from `data/final_knowledge_base.json` - Query each dataset API for matching scientific names - Handle synonyms using `data/synonyms.json` - Track exact matches, synonym matches, and genus-level matches 2. Generate coverage matrix: plants × datasets **Output:** - `output/dataset_coverage_matrix.json` - Per-plant availability - `output/cross_reference_report.json` - Summary statistics **Validation:** - [ ] Coverage matrix includes all 2,064 plants - [ ] Report shows percentage coverage per dataset - [ ] Identified total unique plants with at least one dataset match --- ### Task 2.3: Download and Organize Images **Objective:** Download images from selected sources and organize by species. **Actions:** 1. Create directory structure: ``` datasets/ ├── raw/ │ ├── inaturalist/ │ ├── plantclef/ │ ├── wikimedia/ │ └── flickr/ └── organized/ └── {scientific_name}/ ├── img_001.jpg └── metadata.json ``` 2. Create `scripts/phase2/download_inaturalist.py`: - Use iNaturalist API with research-grade filter - Download max 500 images per species - Include metadata (observer, date, location, license) - Handle rate limiting with exponential backoff 3. Create `scripts/phase2/download_plantclef.py`: - Download from PlantCLEF challenge archives - Extract and organize by species 4. Create `scripts/phase2/download_wikimedia.py`: - Query Wikimedia Commons API for botanical images - Filter by license (CC-BY, CC-BY-SA, public domain) 5. Create `scripts/phase2/organize_images.py`: - Consolidate images from all sources - Rename with consistent naming: `{plant_id}_{source}_{index}.jpg` - Generate per-species `metadata.json` **Output:** - `datasets/organized/` - Organized image directory - `output/download_progress.json` - Download status per species **Validation:** - [ ] Images organized in consistent directory structure - [ ] Each image has source attribution in metadata - [ ] Progress tracking shows download status for all plants --- ### Task 2.4: Establish Minimum Image Count per Class **Objective:** Define and track image count thresholds. **Actions:** 1. Create `scripts/phase2/count_images.py` to: - Count images per species in `datasets/organized/` - Classify plants into coverage tiers: - **Excellent:** 200+ images - **Good:** 100-199 images (target minimum) - **Marginal:** 50-99 images - **Insufficient:** 10-49 images - **Critical:** <10 images 2. Generate coverage report with distribution histogram **Output:** - `output/image_count_report.json` - `output/coverage_histogram.png` **Validation:** - [ ] Target: At least 60% of plants have 100+ images - [ ] Report identifies all plants below minimum threshold - [ ] Total image count within target range (50K-200K) --- ### Task 2.5: Identify Gap Plants **Objective:** Find plants needing supplementary images. **Actions:** 1. Create `scripts/phase2/identify_gaps.py` to: - List plants with <100 images - Prioritize gaps by: - Plant popularity/commonality - Category importance (user-facing plants first) - Ease of sourcing (common names available) 2. Generate prioritized gap list with recommended sources **Output:** - `output/gap_plants.json` - Prioritized list with current counts - `output/gap_analysis_report.md` - Human-readable analysis **Validation:** - [ ] Gap list includes all plants under 100-image threshold - [ ] Each gap plant has recommended supplementary sources - [ ] Priority scores assigned based on criteria --- ### Task 2.6: Source Supplementary Images **Objective:** Fill gaps using additional image sources. **Actions:** 1. Create `scripts/phase2/download_flickr.py`: - Use Flickr API with botanical/plant tags - Filter by license (CC-BY, CC-BY-SA) - Search by scientific name AND common names 2. Create `scripts/phase2/download_google_images.py`: - Use Google Custom Search API (paid tier) - Apply strict botanical filters - Download only high-resolution images 3. Create `scripts/phase2/manual_curation_list.py`: - Generate list of gap plants requiring manual sourcing - Create curation checklist for human review 4. Update `organize_images.py` to incorporate supplementary sources **Output:** - Updated `datasets/organized/` with supplementary images - `output/supplementary_download_report.json` - `output/manual_curation_checklist.md` (if needed) **Validation:** - [ ] Gap plants have improved coverage - [ ] All supplementary images have proper licensing - [ ] Re-run Task 2.4 shows improved coverage metrics --- ### Task 2.7: Verify Image Quality and Labels **Objective:** Remove mislabeled and low-quality images. **Actions:** 1. Create `scripts/phase2/quality_filter.py` to: - Detect corrupt/truncated images - Filter by minimum resolution (224x224 minimum) - Detect duplicates using perceptual hashing (pHash) - Flag images with text overlays/watermarks 2. Create `scripts/phase2/label_verification.py` to: - Use pretrained plant classifier for sanity check - Flag images where model confidence is very low - Generate review queue for human verification 3. Create `scripts/phase2/human_review_tool.py`: - Simple CLI tool for reviewing flagged images - Accept/reject/relabel options - Track reviewer decisions **Output:** - `datasets/verified/` - Cleaned image directory - `output/quality_report.json` - Filtering statistics - `output/removed_images.json` - Log of removed images with reasons **Validation:** - [ ] All images pass minimum resolution check - [ ] No duplicate images (within 95% perceptual similarity) - [ ] Flagged images reviewed and resolved - [ ] Removal rate documented (<20% expected) --- ### Task 2.8: Split Dataset **Objective:** Create reproducible train/validation/test splits. **Actions:** 1. Create `scripts/phase2/split_dataset.py` to: - Stratified split maintaining class distribution - 70% training, 15% validation, 15% test - Ensure no data leakage (same plant photo in multiple splits) - Handle class imbalance (minimum samples per class in each split) 2. Create manifest files: ``` datasets/ ├── train/ │ ├── images/ │ └── manifest.csv (path, label, scientific_name, plant_id) ├── val/ │ ├── images/ │ └── manifest.csv └── test/ ├── images/ └── manifest.csv ``` 3. Generate split statistics report **Output:** - `datasets/train/`, `datasets/val/`, `datasets/test/` directories - `output/split_statistics.json` - `output/class_distribution.png` (per-split histogram) **Validation:** - [ ] Split ratios within 1% of target (70/15/15) - [ ] Each class has minimum 5 samples in val and test sets - [ ] No image appears in multiple splits - [ ] Manifest files are complete and valid --- ## End-Phase Validation Checklist Run `scripts/phase2/validate_phase2.py` to verify: | # | Validation Criterion | Target | Pass/Fail | |---|---------------------|--------|-----------| | 1 | Total image count | 50,000 - 200,000 | [ ] | | 2 | Plant coverage | ≥80% of 2,064 plants have images | [ ] | | 3 | Minimum images per included plant | ≥50 images (relaxed from 100 for rare plants) | [ ] | | 4 | Image quality | 100% pass resolution check | [ ] | | 5 | No duplicates | 0 exact duplicates, <1% near-duplicates | [ ] | | 6 | License compliance | 100% images have documented license | [ ] | | 7 | Train/val/test split exists | All three directories with manifests | [ ] | | 8 | Split ratio accuracy | Within 1% of 70/15/15 | [ ] | | 9 | Stratification verified | Chi-square test p > 0.05 | [ ] | | 10 | Metadata completeness | 100% images have source + license | [ ] | **Phase 2 Complete When:** All 10 validation criteria pass. --- ## Scripts Summary | Script | Task | Input | Output | |--------|------|-------|--------| | `research_datasets.py` | 2.1 | None | `dataset_research_report.json` | | `cross_reference_plants.py` | 2.2 | Knowledge base | `cross_reference_report.json` | | `download_inaturalist.py` | 2.3 | Plant list | Images + metadata | | `download_plantclef.py` | 2.3 | Plant list | Images + metadata | | `download_wikimedia.py` | 2.3 | Plant list | Images + metadata | | `organize_images.py` | 2.3 | Raw images | `datasets/organized/` | | `count_images.py` | 2.4 | Organized images | `image_count_report.json` | | `identify_gaps.py` | 2.5 | Image counts | `gap_plants.json` | | `download_flickr.py` | 2.6 | Gap plants | Supplementary images | | `quality_filter.py` | 2.7 | All images | `datasets/verified/` | | `label_verification.py` | 2.7 | Verified images | Review queue | | `split_dataset.py` | 2.8 | Verified images | Train/val/test splits | | `validate_phase2.py` | Final | All outputs | Validation report | --- ## Dependencies ``` # requirements-phase2.txt requests>=2.28.0 Pillow>=9.0.0 imagehash>=4.3.0 pandas>=1.5.0 tqdm>=4.64.0 python-dotenv>=1.0.0 matplotlib>=3.6.0 scipy>=1.9.0 ``` --- ## Environment Variables ``` # .env.phase2 INATURALIST_APP_ID=your_app_id INATURALIST_APP_SECRET=your_secret FLICKR_API_KEY=your_key FLICKR_API_SECRET=your_secret GOOGLE_CSE_API_KEY=your_key GOOGLE_CSE_CX=your_cx ``` --- ## Estimated Timeline | Task | Effort | Notes | |------|--------|-------| | 2.1 Research | 1 day | Documentation and API testing | | 2.2 Cross-reference | 1 day | API queries, matching logic | | 2.3 Download | 3-5 days | Rate-limited by APIs | | 2.4 Count | 0.5 day | Quick analysis | | 2.5 Gap analysis | 0.5 day | Based on counts | | 2.6 Supplementary | 2-3 days | Depends on gap size | | 2.7 Quality verification | 2 days | Includes manual review | | 2.8 Split | 0.5 day | Automated | | Validation | 0.5 day | Final checks | --- ## Risk Mitigation | Risk | Mitigation | |------|------------| | API rate limits | Implement backoff, cache responses, spread over time | | Low coverage for rare plants | Accept lower threshold (50 images) with augmentation in Phase 3 | | License issues | Track all sources, prefer CC-licensed content | | Storage limits | Implement progressive download, compress as needed | | Label noise | Use pretrained model for sanity check, human review queue | --- ## Next Steps After Phase 2 1. Review `output/image_count_report.json` for Phase 3 augmentation priorities 2. Ensure `datasets/train/manifest.csv` format is compatible with training framework 3. Document any plants excluded due to insufficient images