Files
PlantGuide/Docs/phase2-implementation-plan.md
Trey t 136dfbae33 Add PlantGuide iOS app with plant identification and care management
- Implement camera capture and plant identification workflow
- Add Core Data persistence for plants, care schedules, and cached API data
- Create collection view with grid/list layouts and filtering
- Build plant detail views with care information display
- Integrate Trefle botanical API for plant care data
- Add local image storage for captured plant photos
- Implement dependency injection container for testability
- Include accessibility support throughout the app

Bug fixes in this commit:
- Fix Trefle API decoding by removing duplicate CodingKeys
- Fix LocalCachedImage to load from correct PlantImages directory
- Set dateAdded when saving plants for proper collection sorting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:18:01 -06:00

12 KiB
Raw Blame History

Phase 2: Image Dataset Acquisition - Implementation Plan

Overview

Goal: Gather labeled plant images matching our 2,064-plant knowledge base from Phase 1.

Target Deliverable: Labeled image dataset with 50,000-200,000 images across target plant classes, split into training (70%), validation (15%), and test (15%) sets.


Prerequisites

  • Phase 1 complete: data/final_knowledge_base.json (2,064 plants)
  • SQLite database: knowledge_base/plants.db
  • Python environment with required packages
  • API keys for image sources (iNaturalist, Flickr, etc.)
  • Storage space: ~50-100GB for raw images

Task Breakdown

Task 2.1: Research Public Plant Image Datasets

Objective: Evaluate available datasets for compatibility with our plant list.

Actions:

  1. Research and document each dataset:

    • PlantCLEF - Download links, species coverage, image format, license
    • iNaturalist - API access, species coverage, observation quality filters
    • PlantNet (Pl@ntNet) - API documentation, rate limits, attribution requirements
    • Oxford Flowers 102 - Direct download, category mapping
    • Wikimedia Commons - API access for botanical images
  2. Create scripts/phase2/research_datasets.py to:

    • Query each API for available species counts
    • Document download procedures and authentication
    • Estimate total available images per source

Output: output/dataset_research_report.json

Validation:

  • Report contains at least 4 dataset sources
  • Each source has documented: URL, license, estimated image count, access method

Task 2.2: Cross-Reference Datasets with Plant List

Objective: Identify which plants from our knowledge base have images in public datasets.

Actions:

  1. Create scripts/phase2/cross_reference_plants.py to:

    • Load plant list from data/final_knowledge_base.json
    • Query each dataset API for matching scientific names
    • Handle synonyms using data/synonyms.json
    • Track exact matches, synonym matches, and genus-level matches
  2. Generate coverage matrix: plants × datasets

Output:

  • output/dataset_coverage_matrix.json - Per-plant availability
  • output/cross_reference_report.json - Summary statistics

Validation:

  • Coverage matrix includes all 2,064 plants
  • Report shows percentage coverage per dataset
  • Identified total unique plants with at least one dataset match

Task 2.3: Download and Organize Images

Objective: Download images from selected sources and organize by species.

Actions:

  1. Create directory structure:

    datasets/
    ├── raw/
    │   ├── inaturalist/
    │   ├── plantclef/
    │   ├── wikimedia/
    │   └── flickr/
    └── organized/
        └── {scientific_name}/
            ├── img_001.jpg
            └── metadata.json
    
  2. Create scripts/phase2/download_inaturalist.py:

    • Use iNaturalist API with research-grade filter
    • Download max 500 images per species
    • Include metadata (observer, date, location, license)
    • Handle rate limiting with exponential backoff
  3. Create scripts/phase2/download_plantclef.py:

    • Download from PlantCLEF challenge archives
    • Extract and organize by species
  4. Create scripts/phase2/download_wikimedia.py:

    • Query Wikimedia Commons API for botanical images
    • Filter by license (CC-BY, CC-BY-SA, public domain)
  5. Create scripts/phase2/organize_images.py:

    • Consolidate images from all sources
    • Rename with consistent naming: {plant_id}_{source}_{index}.jpg
    • Generate per-species metadata.json

Output:

  • datasets/organized/ - Organized image directory
  • output/download_progress.json - Download status per species

Validation:

  • Images organized in consistent directory structure
  • Each image has source attribution in metadata
  • Progress tracking shows download status for all plants

Task 2.4: Establish Minimum Image Count per Class

Objective: Define and track image count thresholds.

Actions:

  1. Create scripts/phase2/count_images.py to:

    • Count images per species in datasets/organized/
    • Classify plants into coverage tiers:
      • Excellent: 200+ images
      • Good: 100-199 images (target minimum)
      • Marginal: 50-99 images
      • Insufficient: 10-49 images
      • Critical: <10 images
  2. Generate coverage report with distribution histogram

Output:

  • output/image_count_report.json
  • output/coverage_histogram.png

Validation:

  • Target: At least 60% of plants have 100+ images
  • Report identifies all plants below minimum threshold
  • Total image count within target range (50K-200K)

Task 2.5: Identify Gap Plants

Objective: Find plants needing supplementary images.

Actions:

  1. Create scripts/phase2/identify_gaps.py to:

    • List plants with <100 images
    • Prioritize gaps by:
      • Plant popularity/commonality
      • Category importance (user-facing plants first)
      • Ease of sourcing (common names available)
  2. Generate prioritized gap list with recommended sources

Output:

  • output/gap_plants.json - Prioritized list with current counts
  • output/gap_analysis_report.md - Human-readable analysis

Validation:

  • Gap list includes all plants under 100-image threshold
  • Each gap plant has recommended supplementary sources
  • Priority scores assigned based on criteria

Task 2.6: Source Supplementary Images

Objective: Fill gaps using additional image sources.

Actions:

  1. Create scripts/phase2/download_flickr.py:

    • Use Flickr API with botanical/plant tags
    • Filter by license (CC-BY, CC-BY-SA)
    • Search by scientific name AND common names
  2. Create scripts/phase2/download_google_images.py:

    • Use Google Custom Search API (paid tier)
    • Apply strict botanical filters
    • Download only high-resolution images
  3. Create scripts/phase2/manual_curation_list.py:

    • Generate list of gap plants requiring manual sourcing
    • Create curation checklist for human review
  4. Update organize_images.py to incorporate supplementary sources

Output:

  • Updated datasets/organized/ with supplementary images
  • output/supplementary_download_report.json
  • output/manual_curation_checklist.md (if needed)

Validation:

  • Gap plants have improved coverage
  • All supplementary images have proper licensing
  • Re-run Task 2.4 shows improved coverage metrics

Task 2.7: Verify Image Quality and Labels

Objective: Remove mislabeled and low-quality images.

Actions:

  1. Create scripts/phase2/quality_filter.py to:

    • Detect corrupt/truncated images
    • Filter by minimum resolution (224x224 minimum)
    • Detect duplicates using perceptual hashing (pHash)
    • Flag images with text overlays/watermarks
  2. Create scripts/phase2/label_verification.py to:

    • Use pretrained plant classifier for sanity check
    • Flag images where model confidence is very low
    • Generate review queue for human verification
  3. Create scripts/phase2/human_review_tool.py:

    • Simple CLI tool for reviewing flagged images
    • Accept/reject/relabel options
    • Track reviewer decisions

Output:

  • datasets/verified/ - Cleaned image directory
  • output/quality_report.json - Filtering statistics
  • output/removed_images.json - Log of removed images with reasons

Validation:

  • All images pass minimum resolution check
  • No duplicate images (within 95% perceptual similarity)
  • Flagged images reviewed and resolved
  • Removal rate documented (<20% expected)

Task 2.8: Split Dataset

Objective: Create reproducible train/validation/test splits.

Actions:

  1. Create scripts/phase2/split_dataset.py to:

    • Stratified split maintaining class distribution
    • 70% training, 15% validation, 15% test
    • Ensure no data leakage (same plant photo in multiple splits)
    • Handle class imbalance (minimum samples per class in each split)
  2. Create manifest files:

    datasets/
    ├── train/
    │   ├── images/
    │   └── manifest.csv  (path, label, scientific_name, plant_id)
    ├── val/
    │   ├── images/
    │   └── manifest.csv
    └── test/
        ├── images/
        └── manifest.csv
    
  3. Generate split statistics report

Output:

  • datasets/train/, datasets/val/, datasets/test/ directories
  • output/split_statistics.json
  • output/class_distribution.png (per-split histogram)

Validation:

  • Split ratios within 1% of target (70/15/15)
  • Each class has minimum 5 samples in val and test sets
  • No image appears in multiple splits
  • Manifest files are complete and valid

End-Phase Validation Checklist

Run scripts/phase2/validate_phase2.py to verify:

# Validation Criterion Target Pass/Fail
1 Total image count 50,000 - 200,000 [ ]
2 Plant coverage ≥80% of 2,064 plants have images [ ]
3 Minimum images per included plant ≥50 images (relaxed from 100 for rare plants) [ ]
4 Image quality 100% pass resolution check [ ]
5 No duplicates 0 exact duplicates, <1% near-duplicates [ ]
6 License compliance 100% images have documented license [ ]
7 Train/val/test split exists All three directories with manifests [ ]
8 Split ratio accuracy Within 1% of 70/15/15 [ ]
9 Stratification verified Chi-square test p > 0.05 [ ]
10 Metadata completeness 100% images have source + license [ ]

Phase 2 Complete When: All 10 validation criteria pass.


Scripts Summary

Script Task Input Output
research_datasets.py 2.1 None dataset_research_report.json
cross_reference_plants.py 2.2 Knowledge base cross_reference_report.json
download_inaturalist.py 2.3 Plant list Images + metadata
download_plantclef.py 2.3 Plant list Images + metadata
download_wikimedia.py 2.3 Plant list Images + metadata
organize_images.py 2.3 Raw images datasets/organized/
count_images.py 2.4 Organized images image_count_report.json
identify_gaps.py 2.5 Image counts gap_plants.json
download_flickr.py 2.6 Gap plants Supplementary images
quality_filter.py 2.7 All images datasets/verified/
label_verification.py 2.7 Verified images Review queue
split_dataset.py 2.8 Verified images Train/val/test splits
validate_phase2.py Final All outputs Validation report

Dependencies

# requirements-phase2.txt
requests>=2.28.0
Pillow>=9.0.0
imagehash>=4.3.0
pandas>=1.5.0
tqdm>=4.64.0
python-dotenv>=1.0.0
matplotlib>=3.6.0
scipy>=1.9.0

Environment Variables

# .env.phase2
INATURALIST_APP_ID=your_app_id
INATURALIST_APP_SECRET=your_secret
FLICKR_API_KEY=your_key
FLICKR_API_SECRET=your_secret
GOOGLE_CSE_API_KEY=your_key
GOOGLE_CSE_CX=your_cx

Estimated Timeline

Task Effort Notes
2.1 Research 1 day Documentation and API testing
2.2 Cross-reference 1 day API queries, matching logic
2.3 Download 3-5 days Rate-limited by APIs
2.4 Count 0.5 day Quick analysis
2.5 Gap analysis 0.5 day Based on counts
2.6 Supplementary 2-3 days Depends on gap size
2.7 Quality verification 2 days Includes manual review
2.8 Split 0.5 day Automated
Validation 0.5 day Final checks

Risk Mitigation

Risk Mitigation
API rate limits Implement backoff, cache responses, spread over time
Low coverage for rare plants Accept lower threshold (50 images) with augmentation in Phase 3
License issues Track all sources, prefer CC-licensed content
Storage limits Implement progressive download, compress as needed
Label noise Use pretrained model for sanity check, human review queue

Next Steps After Phase 2

  1. Review output/image_count_report.json for Phase 3 augmentation priorities
  2. Ensure datasets/train/manifest.csv format is compatible with training framework
  3. Document any plants excluded due to insufficient images