10 KiB
Houseplant Image Dataset Accumulation Plan
Overview
Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use.
Requirements Summary
| Parameter | Value |
|---|---|
| Target species | 5,000-10,000 (realistic houseplant ceiling) |
| Images per species | 200-500 (recommended) |
| Total images | ~1-5 million |
| Budget | Free preferred, paid as reference |
| Compute | M1 Max Mac (training) + Unraid server (data pipeline) |
| Curation | Automated pipeline |
| Timeline | Weeks-months |
| Licensing | Must allow training + commercial model distribution |
Hardware Assessment
| Machine | Role | Capability |
|---|---|---|
| M1 Max Mac | Training | Create ML can train 5-10K class models; 32+ GB unified memory is ideal |
| Unraid Server | Data pipeline | Scraping, downloading, preprocessing, storage |
M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required.
Data Sources Analysis
Tier 1: Primary Sources (Recommended)
| Source | License | Commercial-Safe | Volume | Houseplant Coverage | Access Method |
|---|---|---|---|---|---|
| iNaturalist via GBIF | CC-BY, CC0 (filter) | Yes (filtered) | 100M+ observations | Good (has "captive/cultivated" flag) | Bulk export + API |
| Flickr | CC-BY, CC0 (filter) | Yes (filtered) | Millions | Moderate | API |
| Wikimedia Commons | CC-BY, CC-BY-SA, Public Domain | Mostly | Thousands | Moderate | API |
Tier 2: Supplemental Sources
| Source | License | Commercial-Safe | Notes |
|---|---|---|---|
| USDA PLANTS | Public Domain | Yes | US-focused, limited images |
| Encyclopedia of Life | Mixed | Check each | Aggregator, good metadata |
| Pl@ntNet-300K Dataset | CC-BY-SA | Share-alike | Good for research/prototyping only |
Tier 3: Paid Options (Reference)
| Source | Estimated Cost | Notes |
|---|---|---|
| iNaturalist AWS Open Data | Free | Bulk image export, requires S3 costs for transfer |
| Custom scraping infrastructure | $50-200/mo | Proxies, storage, bandwidth |
| Commercial botanical databases | $1000s+ | Getty, Alamy — not recommended |
Licensing Decision Matrix
Want commercial model distribution?
├─ YES → Use ONLY: CC0, CC-BY, Public Domain
│ Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved
│
└─ NO (research only) → Can use CC-BY-NC, CC-BY-SA
Pl@ntNet-300K dataset becomes viable
Recommendation: Filter for commercial-safe licenses from day 1. Avoids re-scraping later.
Houseplant Species Taxonomy
Problem: No canonical "houseplant" species list exists. Must construct one.
Approach:
- Start with Wikipedia "List of houseplants" (~500 species)
- Expand via genus crawl (all Philodendron, all Hoya, etc.)
- Cross-reference with RHS, ASPCA, nursery catalogs
- Target: 1,000-3,000 species is realistic for quality dataset
Key Genera (prioritize these — cover 80% of common houseplants):
Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena,
Sansevieria, Calathea, Maranta, Alocasia, Anthurium,
Peperomia, Hoya, Begonia, Tradescantia, Pilea,
Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula
Data Quality Requirements
| Parameter | Minimum | Target | Rationale |
|---|---|---|---|
| Images per species | 100 | 300-500 | Below 100 = unreliable classification |
| Resolution | 256x256 | 512x512+ | Downsample to 224x224 for training |
| Variety | Single angle | Multi-angle, growth stages, lighting | Generalization |
| Label accuracy | 80% | 95%+ | iNaturalist "Research Grade" = community verified |
Training Approach Options
Option A: Create ML (Recommended)
| Pros | Cons |
|---|---|
| Native Apple Silicon optimization | Limited hyperparameter control |
| Outputs CoreML directly | Max ~10K classes practical limit |
| No Python/ML expertise needed | Less flexible augmentation |
| Fast iteration |
Best for: This use case exactly.
Option B: PyTorch + MPS Transfer Learning
| Pros | Cons |
|---|---|
| Full control over architecture | Steeper learning curve |
| State-of-art augmentation (albumentations) | Manual CoreML conversion |
| Can use EfficientNet, ConvNeXt, etc. | Slower iteration |
Best for: If Create ML hits limits or you need custom architecture.
Option C: Cloud GPU (Google Colab / AWS Spot)
| Pros | Cons |
|---|---|
| Faster training for large models | Cost |
| No local resource constraints | Network transfer overhead |
Best for: If dataset exceeds M1 Max memory or you want transformer-based vision models.
Recommendation: Start with Create ML. Pivot to Option B only if needed.
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ UNRAID SERVER │
├─────────────────────────────────────────────────────────────────┤
│ 1. Species List Generator │
│ └─ Scrape Wikipedia, RHS, expand by genus │
│ │
│ 2. Image Downloader │
│ ├─ iNaturalist/GBIF bulk export (primary) │
│ ├─ Flickr API (supplemental) │
│ └─ License filter (CC-BY, CC0 only) │
│ │
│ 3. Preprocessing Pipeline │
│ ├─ Resize to 512x512 │
│ ├─ Remove duplicates (perceptual hash) │
│ ├─ Remove low-quality (blur detection, size filter) │
│ └─ Organize: /species_name/image_001.jpg │
│ │
│ 4. Dataset Statistics │
│ └─ Report per-species counts, flag under-represented │
└─────────────────────────────────────────────────────────────────┘
│
▼ (rsync/SMB)
┌─────────────────────────────────────────────────────────────────┐
│ M1 MAX MAC │
├─────────────────────────────────────────────────────────────────┤
│ 5. Create ML Training │
│ ├─ Import dataset folder │
│ ├─ Train image classifier │
│ └─ Export .mlmodel │
│ │
│ 6. Validation │
│ ├─ Test on held-out images │
│ └─ Test on real-world photos (your phone) │
│ │
│ 7. Integration │
│ └─ Replace PlantNet-300K in PlantGuide │
└─────────────────────────────────────────────────────────────────┘
Timeline
| Phase | Duration | Output |
|---|---|---|
| 1. Species list curation | 1 week | 1,000-3,000 target species with scientific + common names |
| 2. Pipeline development | 1-2 weeks | Automated scraper on Unraid |
| 3. Data collection | 2-4 weeks | Running 24/7, rate-limited by APIs |
| 4. Preprocessing + QA | 1 week | Clean dataset, statistics report |
| 5. Initial training | 2-3 days | First model with subset (500 species) |
| 6. Full training | 1 week | Full model, iteration |
| 7. Validation + tuning | 1 week | Production-ready model |
Total: 6-10 weeks
Risk Analysis
| Risk | Likelihood | Mitigation |
|---|---|---|
| Insufficient images for rare species | High | Accept lower coverage OR merge to genus-level for rare species |
| API rate limits slow collection | High | Parallelize sources, use bulk exports, patience |
| Noisy labels degrade accuracy | Medium | Use only "Research Grade" iNaturalist, implement confidence thresholds |
| Create ML memory limits | Low | M1 Max should handle; fallback to PyTorch |
| License ambiguity | Low | Strict filter on download, keep metadata |
Next Steps
- Build species master list — Python script to scrape/merge sources
- Set up GBIF bulk download — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images
- Build Flickr supplemental scraper — Target under-represented species
- Docker container on Unraid — Orchestrate pipeline
- Create ML project setup — Folder structure, initial test with 50 species
Open Questions
- Prioritize speed (start with 500 species, fast iteration) or completeness (build full 3K species list first)?
- Any specific houseplant species that must be included?
- Docker running on Unraid already?