# Houseplant Image Dataset Accumulation Plan ## Overview Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use. --- ## Requirements Summary | Parameter | Value | |-----------|-------| | Target species | 5,000-10,000 (realistic houseplant ceiling) | | Images per species | 200-500 (recommended) | | Total images | ~1-5 million | | Budget | Free preferred, paid as reference | | Compute | M1 Max Mac (training) + Unraid server (data pipeline) | | Curation | Automated pipeline | | Timeline | Weeks-months | | Licensing | Must allow training + commercial model distribution | --- ## Hardware Assessment | Machine | Role | Capability | |---------|------|------------| | M1 Max Mac | **Training** | Create ML can train 5-10K class models; 32+ GB unified memory is ideal | | Unraid Server | **Data pipeline** | Scraping, downloading, preprocessing, storage | M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required. --- ## Data Sources Analysis ### Tier 1: Primary Sources (Recommended) | Source | License | Commercial-Safe | Volume | Houseplant Coverage | Access Method | |--------|---------|-----------------|--------|---------------------|---------------| | **iNaturalist via GBIF** | CC-BY, CC0 (filter) | Yes (filtered) | 100M+ observations | Good (has "captive/cultivated" flag) | Bulk export + API | | **Flickr** | CC-BY, CC0 (filter) | Yes (filtered) | Millions | Moderate | API | | **Wikimedia Commons** | CC-BY, CC-BY-SA, Public Domain | Mostly | Thousands | Moderate | API | ### Tier 2: Supplemental Sources | Source | License | Commercial-Safe | Notes | |--------|---------|-----------------|-------| | **USDA PLANTS** | Public Domain | Yes | US-focused, limited images | | **Encyclopedia of Life** | Mixed | Check each | Aggregator, good metadata | | **Pl@ntNet-300K Dataset** | CC-BY-SA | Share-alike | Good for research/prototyping only | ### Tier 3: Paid Options (Reference) | Source | Estimated Cost | Notes | |--------|----------------|-------| | iNaturalist AWS Open Data | Free | Bulk image export, requires S3 costs for transfer | | Custom scraping infrastructure | $50-200/mo | Proxies, storage, bandwidth | | Commercial botanical databases | $1000s+ | Getty, Alamy — not recommended | --- ## Licensing Decision Matrix ``` Want commercial model distribution? ├─ YES → Use ONLY: CC0, CC-BY, Public Domain │ Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved │ └─ NO (research only) → Can use CC-BY-NC, CC-BY-SA Pl@ntNet-300K dataset becomes viable ``` **Recommendation**: Filter for commercial-safe licenses from day 1. Avoids re-scraping later. --- ## Houseplant Species Taxonomy **Problem**: No canonical "houseplant" species list exists. Must construct one. **Approach**: 1. Start with Wikipedia "List of houseplants" (~500 species) 2. Expand via genus crawl (all Philodendron, all Hoya, etc.) 3. Cross-reference with RHS, ASPCA, nursery catalogs 4. Target: **1,000-3,000 species** is realistic for quality dataset **Key Genera** (prioritize these — cover 80% of common houseplants): ``` Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena, Sansevieria, Calathea, Maranta, Alocasia, Anthurium, Peperomia, Hoya, Begonia, Tradescantia, Pilea, Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula ``` --- ## Data Quality Requirements | Parameter | Minimum | Target | Rationale | |-----------|---------|--------|-----------| | Images per species | 100 | 300-500 | Below 100 = unreliable classification | | Resolution | 256x256 | 512x512+ | Downsample to 224x224 for training | | Variety | Single angle | Multi-angle, growth stages, lighting | Generalization | | Label accuracy | 80% | 95%+ | iNaturalist "Research Grade" = community verified | --- ## Training Approach Options ### Option A: Create ML (Recommended) | Pros | Cons | |------|------| | Native Apple Silicon optimization | Limited hyperparameter control | | Outputs CoreML directly | Max ~10K classes practical limit | | No Python/ML expertise needed | Less flexible augmentation | | Fast iteration | | **Best for**: This use case exactly. ### Option B: PyTorch + MPS Transfer Learning | Pros | Cons | |------|------| | Full control over architecture | Steeper learning curve | | State-of-art augmentation (albumentations) | Manual CoreML conversion | | Can use EfficientNet, ConvNeXt, etc. | Slower iteration | **Best for**: If Create ML hits limits or you need custom architecture. ### Option C: Cloud GPU (Google Colab / AWS Spot) | Pros | Cons | |------|------| | Faster training for large models | Cost | | No local resource constraints | Network transfer overhead | **Best for**: If dataset exceeds M1 Max memory or you want transformer-based vision models. **Recommendation**: Start with Create ML. Pivot to Option B only if needed. --- ## Pipeline Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ UNRAID SERVER │ ├─────────────────────────────────────────────────────────────────┤ │ 1. Species List Generator │ │ └─ Scrape Wikipedia, RHS, expand by genus │ │ │ │ 2. Image Downloader │ │ ├─ iNaturalist/GBIF bulk export (primary) │ │ ├─ Flickr API (supplemental) │ │ └─ License filter (CC-BY, CC0 only) │ │ │ │ 3. Preprocessing Pipeline │ │ ├─ Resize to 512x512 │ │ ├─ Remove duplicates (perceptual hash) │ │ ├─ Remove low-quality (blur detection, size filter) │ │ └─ Organize: /species_name/image_001.jpg │ │ │ │ 4. Dataset Statistics │ │ └─ Report per-species counts, flag under-represented │ └─────────────────────────────────────────────────────────────────┘ │ ▼ (rsync/SMB) ┌─────────────────────────────────────────────────────────────────┐ │ M1 MAX MAC │ ├─────────────────────────────────────────────────────────────────┤ │ 5. Create ML Training │ │ ├─ Import dataset folder │ │ ├─ Train image classifier │ │ └─ Export .mlmodel │ │ │ │ 6. Validation │ │ ├─ Test on held-out images │ │ └─ Test on real-world photos (your phone) │ │ │ │ 7. Integration │ │ └─ Replace PlantNet-300K in PlantGuide │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Timeline | Phase | Duration | Output | |-------|----------|--------| | 1. Species list curation | 1 week | 1,000-3,000 target species with scientific + common names | | 2. Pipeline development | 1-2 weeks | Automated scraper on Unraid | | 3. Data collection | 2-4 weeks | Running 24/7, rate-limited by APIs | | 4. Preprocessing + QA | 1 week | Clean dataset, statistics report | | 5. Initial training | 2-3 days | First model with subset (500 species) | | 6. Full training | 1 week | Full model, iteration | | 7. Validation + tuning | 1 week | Production-ready model | **Total: 6-10 weeks** --- ## Risk Analysis | Risk | Likelihood | Mitigation | |------|------------|------------| | Insufficient images for rare species | High | Accept lower coverage OR merge to genus-level for rare species | | API rate limits slow collection | High | Parallelize sources, use bulk exports, patience | | Noisy labels degrade accuracy | Medium | Use only "Research Grade" iNaturalist, implement confidence thresholds | | Create ML memory limits | Low | M1 Max should handle; fallback to PyTorch | | License ambiguity | Low | Strict filter on download, keep metadata | --- ## Next Steps 1. **Build species master list** — Python script to scrape/merge sources 2. **Set up GBIF bulk download** — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images 3. **Build Flickr supplemental scraper** — Target under-represented species 4. **Docker container on Unraid** — Orchestrate pipeline 5. **Create ML project setup** — Folder structure, initial test with 50 species --- ## Open Questions - Prioritize **speed** (start with 500 species, fast iteration) or **completeness** (build full 3K species list first)? - Any specific houseplant species that must be included? - Docker running on Unraid already?