Files
PlantGuideScraper/accum_images.md
2026-04-12 09:54:27 -05:00

10 KiB

Houseplant Image Dataset Accumulation Plan

Overview

Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use.


Requirements Summary

Parameter Value
Target species 5,000-10,000 (realistic houseplant ceiling)
Images per species 200-500 (recommended)
Total images ~1-5 million
Budget Free preferred, paid as reference
Compute M1 Max Mac (training) + Unraid server (data pipeline)
Curation Automated pipeline
Timeline Weeks-months
Licensing Must allow training + commercial model distribution

Hardware Assessment

Machine Role Capability
M1 Max Mac Training Create ML can train 5-10K class models; 32+ GB unified memory is ideal
Unraid Server Data pipeline Scraping, downloading, preprocessing, storage

M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required.


Data Sources Analysis

Source License Commercial-Safe Volume Houseplant Coverage Access Method
iNaturalist via GBIF CC-BY, CC0 (filter) Yes (filtered) 100M+ observations Good (has "captive/cultivated" flag) Bulk export + API
Flickr CC-BY, CC0 (filter) Yes (filtered) Millions Moderate API
Wikimedia Commons CC-BY, CC-BY-SA, Public Domain Mostly Thousands Moderate API

Tier 2: Supplemental Sources

Source License Commercial-Safe Notes
USDA PLANTS Public Domain Yes US-focused, limited images
Encyclopedia of Life Mixed Check each Aggregator, good metadata
Pl@ntNet-300K Dataset CC-BY-SA Share-alike Good for research/prototyping only

Tier 3: Paid Options (Reference)

Source Estimated Cost Notes
iNaturalist AWS Open Data Free Bulk image export, requires S3 costs for transfer
Custom scraping infrastructure $50-200/mo Proxies, storage, bandwidth
Commercial botanical databases $1000s+ Getty, Alamy — not recommended

Licensing Decision Matrix

Want commercial model distribution?
├─ YES → Use ONLY: CC0, CC-BY, Public Domain
│        Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved
│
└─ NO (research only) → Can use CC-BY-NC, CC-BY-SA
                        Pl@ntNet-300K dataset becomes viable

Recommendation: Filter for commercial-safe licenses from day 1. Avoids re-scraping later.


Houseplant Species Taxonomy

Problem: No canonical "houseplant" species list exists. Must construct one.

Approach:

  1. Start with Wikipedia "List of houseplants" (~500 species)
  2. Expand via genus crawl (all Philodendron, all Hoya, etc.)
  3. Cross-reference with RHS, ASPCA, nursery catalogs
  4. Target: 1,000-3,000 species is realistic for quality dataset

Key Genera (prioritize these — cover 80% of common houseplants):

Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena,
Sansevieria, Calathea, Maranta, Alocasia, Anthurium,
Peperomia, Hoya, Begonia, Tradescantia, Pilea,
Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula

Data Quality Requirements

Parameter Minimum Target Rationale
Images per species 100 300-500 Below 100 = unreliable classification
Resolution 256x256 512x512+ Downsample to 224x224 for training
Variety Single angle Multi-angle, growth stages, lighting Generalization
Label accuracy 80% 95%+ iNaturalist "Research Grade" = community verified

Training Approach Options

Pros Cons
Native Apple Silicon optimization Limited hyperparameter control
Outputs CoreML directly Max ~10K classes practical limit
No Python/ML expertise needed Less flexible augmentation
Fast iteration

Best for: This use case exactly.

Option B: PyTorch + MPS Transfer Learning

Pros Cons
Full control over architecture Steeper learning curve
State-of-art augmentation (albumentations) Manual CoreML conversion
Can use EfficientNet, ConvNeXt, etc. Slower iteration

Best for: If Create ML hits limits or you need custom architecture.

Option C: Cloud GPU (Google Colab / AWS Spot)

Pros Cons
Faster training for large models Cost
No local resource constraints Network transfer overhead

Best for: If dataset exceeds M1 Max memory or you want transformer-based vision models.

Recommendation: Start with Create ML. Pivot to Option B only if needed.


Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     UNRAID SERVER                                │
├─────────────────────────────────────────────────────────────────┤
│  1. Species List Generator                                       │
│     └─ Scrape Wikipedia, RHS, expand by genus                   │
│                                                                  │
│  2. Image Downloader                                             │
│     ├─ iNaturalist/GBIF bulk export (primary)                   │
│     ├─ Flickr API (supplemental)                                │
│     └─ License filter (CC-BY, CC0 only)                         │
│                                                                  │
│  3. Preprocessing Pipeline                                       │
│     ├─ Resize to 512x512                                        │
│     ├─ Remove duplicates (perceptual hash)                      │
│     ├─ Remove low-quality (blur detection, size filter)         │
│     └─ Organize: /species_name/image_001.jpg                    │
│                                                                  │
│  4. Dataset Statistics                                           │
│     └─ Report per-species counts, flag under-represented        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼ (rsync/SMB)
┌─────────────────────────────────────────────────────────────────┐
│                      M1 MAX MAC                                  │
├─────────────────────────────────────────────────────────────────┤
│  5. Create ML Training                                           │
│     ├─ Import dataset folder                                    │
│     ├─ Train image classifier                                   │
│     └─ Export .mlmodel                                          │
│                                                                  │
│  6. Validation                                                   │
│     ├─ Test on held-out images                                  │
│     └─ Test on real-world photos (your phone)                   │
│                                                                  │
│  7. Integration                                                  │
│     └─ Replace PlantNet-300K in PlantGuide                      │
└─────────────────────────────────────────────────────────────────┘

Timeline

Phase Duration Output
1. Species list curation 1 week 1,000-3,000 target species with scientific + common names
2. Pipeline development 1-2 weeks Automated scraper on Unraid
3. Data collection 2-4 weeks Running 24/7, rate-limited by APIs
4. Preprocessing + QA 1 week Clean dataset, statistics report
5. Initial training 2-3 days First model with subset (500 species)
6. Full training 1 week Full model, iteration
7. Validation + tuning 1 week Production-ready model

Total: 6-10 weeks


Risk Analysis

Risk Likelihood Mitigation
Insufficient images for rare species High Accept lower coverage OR merge to genus-level for rare species
API rate limits slow collection High Parallelize sources, use bulk exports, patience
Noisy labels degrade accuracy Medium Use only "Research Grade" iNaturalist, implement confidence thresholds
Create ML memory limits Low M1 Max should handle; fallback to PyTorch
License ambiguity Low Strict filter on download, keep metadata

Next Steps

  1. Build species master list — Python script to scrape/merge sources
  2. Set up GBIF bulk download — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images
  3. Build Flickr supplemental scraper — Target under-represented species
  4. Docker container on Unraid — Orchestrate pipeline
  5. Create ML project setup — Folder structure, initial test with 50 species

Open Questions

  • Prioritize speed (start with 500 species, fast iteration) or completeness (build full 3K species list first)?
  • Any specific houseplant species that must be included?
  • Docker running on Unraid already?