# Houseplant Image Dataset Accumulation Plan

## Overview

Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use.

---

## Requirements Summary

| Parameter | Value |
|-----------|-------|
| Target species | 5,000-10,000 (realistic houseplant ceiling) |
| Images per species | 200-500 (recommended) |
| Total images | ~1-5 million |
| Budget | Free preferred, paid as reference |
| Compute | M1 Max Mac (training) + Unraid server (data pipeline) |
| Curation | Automated pipeline |
| Timeline | Weeks-months |
| Licensing | Must allow training + commercial model distribution |

---

## Hardware Assessment

| Machine | Role | Capability |
|---------|------|------------|
| M1 Max Mac | **Training** | Create ML can train 5-10K class models; 32+ GB unified memory is ideal |
| Unraid Server | **Data pipeline** | Scraping, downloading, preprocessing, storage |

M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required.

---

## Data Sources Analysis

### Tier 1: Primary Sources (Recommended)

| Source | License | Commercial-Safe | Volume | Houseplant Coverage | Access Method |
|--------|---------|-----------------|--------|---------------------|---------------|
| **iNaturalist via GBIF** | CC-BY, CC0 (filter) | Yes (filtered) | 100M+ observations | Good (has "captive/cultivated" flag) | Bulk export + API |
| **Flickr** | CC-BY, CC0 (filter) | Yes (filtered) | Millions | Moderate | API |
| **Wikimedia Commons** | CC-BY, CC-BY-SA, Public Domain | Mostly | Thousands | Moderate | API |

### Tier 2: Supplemental Sources

| Source | License | Commercial-Safe | Notes |
|--------|---------|-----------------|-------|
| **USDA PLANTS** | Public Domain | Yes | US-focused, limited images |
| **Encyclopedia of Life** | Mixed | Check each | Aggregator, good metadata |
| **Pl@ntNet-300K Dataset** | CC-BY-SA | Share-alike | Good for research/prototyping only |

### Tier 3: Paid Options (Reference)

| Source | Estimated Cost | Notes |
|--------|----------------|-------|
| iNaturalist AWS Open Data | Free | Bulk image export, requires S3 costs for transfer |
| Custom scraping infrastructure | $50-200/mo | Proxies, storage, bandwidth |
| Commercial botanical databases | $1000s+ | Getty, Alamy — not recommended |

---

## Licensing Decision Matrix

```
Want commercial model distribution?
├─ YES → Use ONLY: CC0, CC-BY, Public Domain
│        Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved
│
└─ NO (research only) → Can use CC-BY-NC, CC-BY-SA
                        Pl@ntNet-300K dataset becomes viable
```

**Recommendation**: Filter for commercial-safe licenses from day 1. Avoids re-scraping later.

---

## Houseplant Species Taxonomy

**Problem**: No canonical "houseplant" species list exists. Must construct one.

**Approach**:
1. Start with Wikipedia "List of houseplants" (~500 species)
2. Expand via genus crawl (all Philodendron, all Hoya, etc.)
3. Cross-reference with RHS, ASPCA, nursery catalogs
4. Target: **1,000-3,000 species** is realistic for quality dataset

**Key Genera** (prioritize these — cover 80% of common houseplants):
```
Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena,
Sansevieria, Calathea, Maranta, Alocasia, Anthurium,
Peperomia, Hoya, Begonia, Tradescantia, Pilea,
Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula
```

---

## Data Quality Requirements

| Parameter | Minimum | Target | Rationale |
|-----------|---------|--------|-----------|
| Images per species | 100 | 300-500 | Below 100 = unreliable classification |
| Resolution | 256x256 | 512x512+ | Downsample to 224x224 for training |
| Variety | Single angle | Multi-angle, growth stages, lighting | Generalization |
| Label accuracy | 80% | 95%+ | iNaturalist "Research Grade" = community verified |

---

## Training Approach Options

### Option A: Create ML (Recommended)

| Pros | Cons |
|------|------|
| Native Apple Silicon optimization | Limited hyperparameter control |
| Outputs CoreML directly | Max ~10K classes practical limit |
| No Python/ML expertise needed | Less flexible augmentation |
| Fast iteration | |

**Best for**: This use case exactly.

### Option B: PyTorch + MPS Transfer Learning

| Pros | Cons |
|------|------|
| Full control over architecture | Steeper learning curve |
| State-of-art augmentation (albumentations) | Manual CoreML conversion |
| Can use EfficientNet, ConvNeXt, etc. | Slower iteration |

**Best for**: If Create ML hits limits or you need custom architecture.

### Option C: Cloud GPU (Google Colab / AWS Spot)

| Pros | Cons |
|------|------|
| Faster training for large models | Cost |
| No local resource constraints | Network transfer overhead |

**Best for**: If dataset exceeds M1 Max memory or you want transformer-based vision models.

**Recommendation**: Start with Create ML. Pivot to Option B only if needed.

---

## Pipeline Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                     UNRAID SERVER                                │
├─────────────────────────────────────────────────────────────────┤
│  1. Species List Generator                                       │
│     └─ Scrape Wikipedia, RHS, expand by genus                   │
│                                                                  │
│  2. Image Downloader                                             │
│     ├─ iNaturalist/GBIF bulk export (primary)                   │
│     ├─ Flickr API (supplemental)                                │
│     └─ License filter (CC-BY, CC0 only)                         │
│                                                                  │
│  3. Preprocessing Pipeline                                       │
│     ├─ Resize to 512x512                                        │
│     ├─ Remove duplicates (perceptual hash)                      │
│     ├─ Remove low-quality (blur detection, size filter)         │
│     └─ Organize: /species_name/image_001.jpg                    │
│                                                                  │
│  4. Dataset Statistics                                           │
│     └─ Report per-species counts, flag under-represented        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼ (rsync/SMB)
┌─────────────────────────────────────────────────────────────────┐
│                      M1 MAX MAC                                  │
├─────────────────────────────────────────────────────────────────┤
│  5. Create ML Training                                           │
│     ├─ Import dataset folder                                    │
│     ├─ Train image classifier                                   │
│     └─ Export .mlmodel                                          │
│                                                                  │
│  6. Validation                                                   │
│     ├─ Test on held-out images                                  │
│     └─ Test on real-world photos (your phone)                   │
│                                                                  │
│  7. Integration                                                  │
│     └─ Replace PlantNet-300K in PlantGuide                      │
└─────────────────────────────────────────────────────────────────┘
```

---

## Timeline

| Phase | Duration | Output |
|-------|----------|--------|
| 1. Species list curation | 1 week | 1,000-3,000 target species with scientific + common names |
| 2. Pipeline development | 1-2 weeks | Automated scraper on Unraid |
| 3. Data collection | 2-4 weeks | Running 24/7, rate-limited by APIs |
| 4. Preprocessing + QA | 1 week | Clean dataset, statistics report |
| 5. Initial training | 2-3 days | First model with subset (500 species) |
| 6. Full training | 1 week | Full model, iteration |
| 7. Validation + tuning | 1 week | Production-ready model |

**Total: 6-10 weeks**

---

## Risk Analysis

| Risk | Likelihood | Mitigation |
|------|------------|------------|
| Insufficient images for rare species | High | Accept lower coverage OR merge to genus-level for rare species |
| API rate limits slow collection | High | Parallelize sources, use bulk exports, patience |
| Noisy labels degrade accuracy | Medium | Use only "Research Grade" iNaturalist, implement confidence thresholds |
| Create ML memory limits | Low | M1 Max should handle; fallback to PyTorch |
| License ambiguity | Low | Strict filter on download, keep metadata |

---

## Next Steps

1. **Build species master list** — Python script to scrape/merge sources
2. **Set up GBIF bulk download** — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images
3. **Build Flickr supplemental scraper** — Target under-represented species
4. **Docker container on Unraid** — Orchestrate pipeline
5. **Create ML project setup** — Folder structure, initial test with 50 species

---

## Open Questions

- Prioritize **speed** (start with 500 species, fast iteration) or **completeness** (build full 3K species list first)?
- Any specific houseplant species that must be included?
- Docker running on Unraid already?