Files

Trey T 6926f502c5 Initial commit — PlantGuideScraper project

2026-04-12 09:54:27 -05:00

10 KiB

Raw Permalink Blame History

Houseplant Image Dataset Accumulation Plan

Overview

Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use.

Requirements Summary

Parameter	Value
Target species	5,000-10,000 (realistic houseplant ceiling)
Images per species	200-500 (recommended)
Total images	~1-5 million
Budget	Free preferred, paid as reference
Compute	M1 Max Mac (training) + Unraid server (data pipeline)
Curation	Automated pipeline
Timeline	Weeks-months
Licensing	Must allow training + commercial model distribution

Hardware Assessment

Machine	Role	Capability
M1 Max Mac	Training	Create ML can train 5-10K class models; 32+ GB unified memory is ideal
Unraid Server	Data pipeline	Scraping, downloading, preprocessing, storage

M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required.

Data Sources Analysis

Tier 1: Primary Sources (Recommended)

Source	License	Commercial-Safe	Volume	Houseplant Coverage	Access Method
iNaturalist via GBIF	CC-BY, CC0 (filter)	Yes (filtered)	100M+ observations	Good (has "captive/cultivated" flag)	Bulk export + API
Flickr	CC-BY, CC0 (filter)	Yes (filtered)	Millions	Moderate	API
Wikimedia Commons	CC-BY, CC-BY-SA, Public Domain	Mostly	Thousands	Moderate	API

Tier 2: Supplemental Sources

Source	License	Commercial-Safe	Notes
USDA PLANTS	Public Domain	Yes	US-focused, limited images
Encyclopedia of Life	Mixed	Check each	Aggregator, good metadata
Pl@ntNet-300K Dataset	CC-BY-SA	Share-alike	Good for research/prototyping only

Tier 3: Paid Options (Reference)

Source	Estimated Cost	Notes
iNaturalist AWS Open Data	Free	Bulk image export, requires S3 costs for transfer
Custom scraping infrastructure	$50-200/mo	Proxies, storage, bandwidth
Commercial botanical databases	$1000s+	Getty, Alamy — not recommended

Licensing Decision Matrix

Want commercial model distribution?
├─ YES → Use ONLY: CC0, CC-BY, Public Domain
│        Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved
│
└─ NO (research only) → Can use CC-BY-NC, CC-BY-SA
                        Pl@ntNet-300K dataset becomes viable

Recommendation: Filter for commercial-safe licenses from day 1. Avoids re-scraping later.

Houseplant Species Taxonomy

Problem: No canonical "houseplant" species list exists. Must construct one.

Approach:

Start with Wikipedia "List of houseplants" (~500 species)
Expand via genus crawl (all Philodendron, all Hoya, etc.)
Cross-reference with RHS, ASPCA, nursery catalogs
Target: 1,000-3,000 species is realistic for quality dataset

Key Genera (prioritize these — cover 80% of common houseplants):

Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena,
Sansevieria, Calathea, Maranta, Alocasia, Anthurium,
Peperomia, Hoya, Begonia, Tradescantia, Pilea,
Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula

Data Quality Requirements

Parameter	Minimum	Target	Rationale
Images per species	100	300-500	Below 100 = unreliable classification
Resolution	256x256	512x512+	Downsample to 224x224 for training
Variety	Single angle	Multi-angle, growth stages, lighting	Generalization
Label accuracy	80%	95%+	iNaturalist "Research Grade" = community verified

Training Approach Options

Option A: Create ML (Recommended)

Pros	Cons
Native Apple Silicon optimization	Limited hyperparameter control
Outputs CoreML directly	Max ~10K classes practical limit
No Python/ML expertise needed	Less flexible augmentation
Fast iteration

Best for: This use case exactly.

Option B: PyTorch + MPS Transfer Learning

Pros	Cons
Full control over architecture	Steeper learning curve
State-of-art augmentation (albumentations)	Manual CoreML conversion
Can use EfficientNet, ConvNeXt, etc.	Slower iteration

Best for: If Create ML hits limits or you need custom architecture.

Option C: Cloud GPU (Google Colab / AWS Spot)

Pros	Cons
Faster training for large models	Cost
No local resource constraints	Network transfer overhead

Best for: If dataset exceeds M1 Max memory or you want transformer-based vision models.

Recommendation: Start with Create ML. Pivot to Option B only if needed.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     UNRAID SERVER                                │
├─────────────────────────────────────────────────────────────────┤
│  1. Species List Generator                                       │
│     └─ Scrape Wikipedia, RHS, expand by genus                   │
│                                                                  │
│  2. Image Downloader                                             │
│     ├─ iNaturalist/GBIF bulk export (primary)                   │
│     ├─ Flickr API (supplemental)                                │
│     └─ License filter (CC-BY, CC0 only)                         │
│                                                                  │
│  3. Preprocessing Pipeline                                       │
│     ├─ Resize to 512x512                                        │
│     ├─ Remove duplicates (perceptual hash)                      │
│     ├─ Remove low-quality (blur detection, size filter)         │
│     └─ Organize: /species_name/image_001.jpg                    │
│                                                                  │
│  4. Dataset Statistics                                           │
│     └─ Report per-species counts, flag under-represented        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼ (rsync/SMB)
┌─────────────────────────────────────────────────────────────────┐
│                      M1 MAX MAC                                  │
├─────────────────────────────────────────────────────────────────┤
│  5. Create ML Training                                           │
│     ├─ Import dataset folder                                    │
│     ├─ Train image classifier                                   │
│     └─ Export .mlmodel                                          │
│                                                                  │
│  6. Validation                                                   │
│     ├─ Test on held-out images                                  │
│     └─ Test on real-world photos (your phone)                   │
│                                                                  │
│  7. Integration                                                  │
│     └─ Replace PlantNet-300K in PlantGuide                      │
└─────────────────────────────────────────────────────────────────┘

Timeline

Phase	Duration	Output
1. Species list curation	1 week	1,000-3,000 target species with scientific + common names
2. Pipeline development	1-2 weeks	Automated scraper on Unraid
3. Data collection	2-4 weeks	Running 24/7, rate-limited by APIs
4. Preprocessing + QA	1 week	Clean dataset, statistics report
5. Initial training	2-3 days	First model with subset (500 species)
6. Full training	1 week	Full model, iteration
7. Validation + tuning	1 week	Production-ready model

Total: 6-10 weeks

Risk Analysis

Risk	Likelihood	Mitigation
Insufficient images for rare species	High	Accept lower coverage OR merge to genus-level for rare species
API rate limits slow collection	High	Parallelize sources, use bulk exports, patience
Noisy labels degrade accuracy	Medium	Use only "Research Grade" iNaturalist, implement confidence thresholds
Create ML memory limits	Low	M1 Max should handle; fallback to PyTorch
License ambiguity	Low	Strict filter on download, keep metadata

Next Steps

Build species master list — Python script to scrape/merge sources
Set up GBIF bulk download — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images
Build Flickr supplemental scraper — Target under-represented species
Docker container on Unraid — Orchestrate pipeline
Create ML project setup — Folder structure, initial test with 50 species

Open Questions

Prioritize speed (start with 500 species, fast iteration) or completeness (build full 3K species list first)?
Any specific houseplant species that must be included?
Docker running on Unraid already?

10 KiB Raw Permalink Blame History