PlantGuideScraper/docs/master_plan.md

# Houseplant Image Scraper - Master Plan

## Overview

Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.

---

## Requirements Summary

| Requirement | Value |
|-------------|-------|
| Platform | Web app in Docker on Unraid |
| Sources | iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL |
| API keys | Configurable per service |
| Species list | Manual import (CSV/paste) |
| Grouping | Species, genus, source, license (faceted) |
| Search/filter | Yes |
| Quality filter | Automatic (hash dedup, blur, size) |
| Progress | Real-time dashboard |
| Storage | `/species_name/image.jpg` + SQLite DB |
| Export | Filtered zip for CoreML, downloadable anytime |
| Auth | None (single user) |
| Deployment | Docker Compose |

---

## Create ML Export Requirements

Per [Apple's documentation](https://developer.apple.com/documentation/createml/creating-an-image-classifier-model):

- **Folder structure**: `/SpeciesName/image001.jpg` (folder name = class label)
- **Train/Test split**: 80/20 recommended, separate folders
- **Balance**: Roughly equal images per class (avoid bias)
- **No metadata needed**: Create ML uses folder names as labels

### Export Format

```
dataset_export/
├── Training/
│   ├── Monstera_deliciosa/
│   │   ├── img001.jpg
│   │   └── ...
│   ├── Philodendron_hederaceum/
│   └── ...
└── Testing/
    ├── Monstera_deliciosa/
    └── ...
```

---

## Data Sources

| Source | API/Method | License Filter | Rate Limits | Notes |
|--------|------------|----------------|-------------|-------|
| **iNaturalist/GBIF** | Bulk DwC-A export + API | CC0, CC-BY | 1 req/sec, 10k/day, 5GB/hr media | Best source: Research Grade = verified |
| **Flickr** | flickr.photos.search | license=4,9 (CC-BY, CC0) | 3600 req/hr | Good supplemental |
| **Wikimedia Commons** | MediaWiki API + pyWikiCommons | CC-BY, CC-BY-SA, PD | Generous | Category-based search |
| **Trefle.io** | REST API | Open source | Free tier | Species metadata + some images |
| **USDA PLANTS** | REST API | Public Domain | Generous | US-focused, limited images |
| **Plant.id** | REST API | Commercial | Paid | For validation, not scraping |
| **Encyclopedia of Life** | API | Mixed | Check each | Aggregator |

### Source References

- iNaturalist: https://www.inaturalist.org/pages/developers
- iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741
- Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html
- Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API
- pyWikiCommons: https://pypi.org/project/pyWikiCommons/
- Trefle.io: https://trefle.io/
- USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r

### Flickr License IDs

| ID | License |
|----|---------|
| 0 | All Rights Reserved |
| 1 | CC BY-NC-SA 2.0 |
| 2 | CC BY-NC 2.0 |
| 3 | CC BY-NC-ND 2.0 |
| 4 | CC BY 2.0 (Commercial OK) |
| 5 | CC BY-SA 2.0 |
| 6 | CC BY-ND 2.0 |
| 7 | No known copyright restrictions |
| 8 | United States Government Work |
| 9 | Public Domain (CC0) |

**For commercial use**: Filter to license IDs 4, 7, 8, 9 only.

---

## Image Quality Pipeline

| Stage | Library | Purpose |
|-------|---------|---------|
| **Deduplication** | imagededup | Perceptual hash (CNN + hash methods) |
| **Blur detection** | scipy + Sobel variance | Reject blurry images |
| **Size filter** | Pillow | Min 256x256 |
| **Resize** | Pillow | Normalize to 512x512 |

### Library References

- imagededup: https://github.com/idealo/imagededup
- imagehash: https://github.com/JohannesBuchner/imagehash

---

## Technology Stack

| Component | Choice | Rationale |
|-----------|--------|-----------|
| **Backend** | FastAPI (Python) | Async, fast, ML ecosystem, great docs |
| **Frontend** | React + Tailwind | Fast dev, good component libraries |
| **Database** | SQLite (+ FTS5) | Simple, no separate container, sufficient for single-user |
| **Task Queue** | Celery + Redis | Long-running scrape jobs, good monitoring |
| **Containers** | Docker Compose | Multi-service orchestration |

Reference: https://github.com/fastapi/full-stack-fastapi-template

---

## Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                         DOCKER COMPOSE ON UNRAID                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────┐    ┌─────────────────────────────────────────────────┐ │
│  │   NGINX     │    │              FASTAPI BACKEND                     │ │
│  │   :80       │───▶│  /api/species     - CRUD species list           │ │
│  │             │    │  /api/sources     - API key management          │ │
│  └──────┬──────┘    │  /api/jobs        - Scrape job control          │ │
│         │           │  /api/images      - Search, filter, browse      │ │
│         ▼           │  /api/export      - Generate zip for CoreML     │ │
│  ┌─────────────┐    │  /api/stats       - Dashboard metrics           │ │
│  │   REACT     │    └─────────────────────────────────────────────────┘ │
│  │   SPA       │                         │                              │
│  │   :3000     │                         ▼                              │
│  └─────────────┘    ┌─────────────────────────────────────────────────┐ │
│                     │              CELERY WORKERS                      │ │
│  ┌─────────────┐    │  - iNaturalist scraper                          │ │
│  │   REDIS     │◀───│  - Flickr scraper                               │ │
│  │   :6379     │    │  - Wikimedia scraper                            │ │
│  └─────────────┘    │  - Quality filter pipeline                      │ │
│                     │  - Export generator                              │ │
│                     └─────────────────────────────────────────────────┘ │
│                                          │                              │
│                                          ▼                              │
│  ┌─────────────────────────────────────────────────────────────────────┐│
│  │                         STORAGE (Bind Mounts)                        ││
│  │  /data/db/plants.sqlite     - Species, images metadata, jobs        ││
│  │  /data/images/{species}/    - Downloaded images                     ││
│  │  /data/exports/             - Generated zip files                   ││
│  └─────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────┘
```

---

## Database Schema

```sql
-- Species master list (imported from CSV)
CREATE TABLE species (
    id INTEGER PRIMARY KEY,
    scientific_name TEXT UNIQUE NOT NULL,
    common_name TEXT,
    genus TEXT,
    family TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Full-text search index
CREATE VIRTUAL TABLE species_fts USING fts5(
    scientific_name,
    common_name,
    genus,
    content='species',
    content_rowid='id'
);

-- API credentials
CREATE TABLE api_keys (
    id INTEGER PRIMARY KEY,
    source TEXT UNIQUE NOT NULL,  -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
    api_key TEXT NOT NULL,
    api_secret TEXT,
    rate_limit_per_sec REAL DEFAULT 1.0,
    enabled BOOLEAN DEFAULT TRUE
);

-- Downloaded images
CREATE TABLE images (
    id INTEGER PRIMARY KEY,
    species_id INTEGER REFERENCES species(id),
    source TEXT NOT NULL,
    source_id TEXT,  -- Original ID from source
    url TEXT NOT NULL,
    local_path TEXT,
    license TEXT NOT NULL,
    attribution TEXT,
    width INTEGER,
    height INTEGER,
    phash TEXT,  -- Perceptual hash for dedup
    quality_score REAL,  -- Blur/quality metric
    status TEXT DEFAULT 'pending',  -- pending, downloaded, rejected, deleted
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(source, source_id)
);

-- Index for common queries
CREATE INDEX idx_images_species ON images(species_id);
CREATE INDEX idx_images_status ON images(status);
CREATE INDEX idx_images_source ON images(source);
CREATE INDEX idx_images_phash ON images(phash);

-- Scrape jobs
CREATE TABLE jobs (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    source TEXT NOT NULL,
    species_filter TEXT,  -- JSON array of species IDs or NULL for all
    status TEXT DEFAULT 'pending',  -- pending, running, paused, completed, failed
    progress_current INTEGER DEFAULT 0,
    progress_total INTEGER DEFAULT 0,
    images_downloaded INTEGER DEFAULT 0,
    images_rejected INTEGER DEFAULT 0,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT
);

-- Export jobs
CREATE TABLE exports (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    filter_criteria TEXT,  -- JSON: min_images, licenses, min_quality, species_ids
    train_split REAL DEFAULT 0.8,
    status TEXT DEFAULT 'pending',  -- pending, generating, completed, failed
    file_path TEXT,
    file_size INTEGER,
    species_count INTEGER,
    image_count INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP
);
```

---

## API Endpoints

### Species

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/species` | List species (paginated, searchable) |
| POST | `/api/species` | Create single species |
| POST | `/api/species/import` | Bulk import from CSV |
| GET | `/api/species/{id}` | Get species details |
| PUT | `/api/species/{id}` | Update species |
| DELETE | `/api/species/{id}` | Delete species |

### API Keys

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/sources` | List configured sources |
| PUT | `/api/sources/{source}` | Update source config (key, rate limit) |

### Jobs

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/jobs` | List jobs |
| POST | `/api/jobs` | Create scrape job |
| GET | `/api/jobs/{id}` | Get job status |
| POST | `/api/jobs/{id}/pause` | Pause job |
| POST | `/api/jobs/{id}/resume` | Resume job |
| POST | `/api/jobs/{id}/cancel` | Cancel job |

### Images

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/images` | List images (paginated, filterable) |
| GET | `/api/images/{id}` | Get image details |
| DELETE | `/api/images/{id}` | Delete image |
| POST | `/api/images/bulk-delete` | Bulk delete |

### Export

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/exports` | List exports |
| POST | `/api/exports` | Create export job |
| GET | `/api/exports/{id}` | Get export status |
| GET | `/api/exports/{id}/download` | Download zip file |

### Stats

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/stats` | Dashboard statistics |
| GET | `/api/stats/sources` | Per-source breakdown |
| GET | `/api/stats/species` | Per-species image counts |

---

## UI Screens

### 1. Dashboard

- Total species, images by source, images by license
- Active jobs with progress bars
- Quick stats: images/sec, disk usage
- Recent activity feed

### 2. Species Management

- Table: scientific name, common name, genus, image count
- Import CSV button (drag-and-drop)
- Search/filter by name, genus
- Bulk select → "Start Scrape" button
- Inline editing

### 3. API Keys

- Card per source with:
  - API key input (masked)
  - API secret input (if applicable)
  - Rate limit slider
  - Enable/disable toggle
  - Test connection button

### 4. Image Browser

- Grid view with thumbnails (lazy-loaded)
- Filters sidebar:
  - Species (autocomplete)
  - Source (checkboxes)
  - License (checkboxes)
  - Quality score (range slider)
  - Status (tabs: all, pending, downloaded, rejected)
- Sort by: date, quality, species
- Bulk select → actions (delete, re-queue)
- Click to view full-size + metadata

### 5. Jobs

- Table: name, source, status, progress, dates
- Real-time progress updates (WebSocket)
- Actions: pause, resume, cancel, view logs

### 6. Export

- Filter builder:
  - Min images per species
  - License whitelist
  - Min quality score
  - Species selection (all or specific)
- Train/test split slider (default 80/20)
- Preview: estimated species count, image count, file size
- "Generate Zip" button
- Download history with re-download links

---

## Tradeoffs

| Decision | Alternative | Why This Choice |
|----------|-------------|-----------------|
| SQLite | PostgreSQL | Single-user, simpler Docker setup, sufficient for millions of rows |
| Celery+Redis | RQ, Dramatiq | Battle-tested, good monitoring (Flower) |
| React | Vue, Svelte | Largest ecosystem, more component libraries |
| Separate workers | Threads in FastAPI | Better isolation, can scale workers independently |
| Nginx reverse proxy | Traefik | Simpler config for single-app deployment |

---

## Risks & Mitigations

| Risk | Likelihood | Mitigation |
|------|------------|------------|
| iNaturalist rate limits (5GB/hr) | High | Throttle downloads, prioritize species with low counts |
| Disk fills up | Medium | Dashboard shows disk usage, configurable storage limits |
| Scrape jobs crash mid-run | Medium | Job state in DB, resume from last checkpoint |
| Perceptual hash collisions | Low | Store hash, allow manual review of flagged duplicates |
| API keys exposed | Low | Environment variables, not stored in code |
| SQLite write contention | Low | WAL mode, single writer pattern via Celery |

---

## Implementation Phases

### Phase 1: Foundation
- [ ] Docker Compose setup (FastAPI, React, Redis, Nginx)
- [ ] Database schema + migrations (Alembic)
- [ ] Basic FastAPI skeleton with health checks
- [ ] React app scaffolding with Tailwind

### Phase 2: Core Data Management
- [ ] Species CRUD API
- [ ] CSV import endpoint
- [ ] Species list UI with search/filter
- [ ] API keys management UI

### Phase 3: iNaturalist Scraper
- [ ] Celery worker setup
- [ ] iNaturalist/GBIF scraper task
- [ ] Job management API
- [ ] Real-time progress (WebSocket or polling)

### Phase 4: Quality Pipeline
- [ ] Image download worker
- [ ] Perceptual hash deduplication
- [ ] Blur detection + quality scoring
- [ ] Resize to 512x512

### Phase 5: Image Browser
- [ ] Image listing API with filters
- [ ] Thumbnail generation
- [ ] Grid view UI
- [ ] Bulk operations

### Phase 6: Additional Scrapers
- [ ] Flickr scraper
- [ ] Wikimedia Commons scraper
- [ ] Trefle scraper (metadata + images)
- [ ] USDA PLANTS scraper

### Phase 7: Export
- [ ] Export job API
- [ ] Train/test split logic
- [ ] Zip generation worker
- [ ] Download endpoint
- [ ] Export UI with filters

### Phase 8: Dashboard & Polish
- [ ] Stats API
- [ ] Dashboard UI with charts
- [ ] Job monitoring UI
- [ ] Error handling + logging
- [ ] Documentation

---

## File Structure

```
PlantGuideScraper/
├── docker-compose.yml
├── .env.example
├── docs/
│   └── master_plan.md
├── backend/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── alembic/
│   │   └── versions/
│   ├── app/
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── config.py
│   │   ├── database.py
│   │   ├── models/
│   │   │   ├── species.py
│   │   │   ├── image.py
│   │   │   ├── job.py
│   │   │   └── export.py
│   │   ├── schemas/
│   │   │   └── ...
│   │   ├── api/
│   │   │   ├── species.py
│   │   │   ├── images.py
│   │   │   ├── jobs.py
│   │   │   ├── exports.py
│   │   │   └── stats.py
│   │   ├── scrapers/
│   │   │   ├── base.py
│   │   │   ├── inaturalist.py
│   │   │   ├── flickr.py
│   │   │   ├── wikimedia.py
│   │   │   └── trefle.py
│   │   ├── workers/
│   │   │   ├── celery_app.py
│   │   │   ├── scrape_tasks.py
│   │   │   ├── quality_tasks.py
│   │   │   └── export_tasks.py
│   │   └── utils/
│   │       ├── image_quality.py
│   │       └── dedup.py
│   └── tests/
├── frontend/
│   ├── Dockerfile
│   ├── package.json
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   ├── pages/
│   │   │   ├── Dashboard.tsx
│   │   │   ├── Species.tsx
│   │   │   ├── Images.tsx
│   │   │   ├── Jobs.tsx
│   │   │   ├── Export.tsx
│   │   │   └── Settings.tsx
│   │   ├── hooks/
│   │   └── api/
│   └── public/
├── nginx/
│   └── nginx.conf
└── data/                  # Bind mount (not in repo)
    ├── db/
    ├── images/
    └── exports/
```

---

## Environment Variables

```bash
# Backend
DATABASE_URL=sqlite:///data/db/plants.sqlite
REDIS_URL=redis://redis:6379/0
IMAGES_PATH=/data/images
EXPORTS_PATH=/data/exports

# API Keys (user-provided)
FLICKR_API_KEY=
FLICKR_API_SECRET=
INATURALIST_APP_ID=
INATURALIST_APP_SECRET=
TREFLE_API_KEY=

# Optional
LOG_LEVEL=INFO
CELERY_CONCURRENCY=4
```

---

## Commands

```bash
# Development
docker-compose up --build

# Production
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Run migrations
docker-compose exec backend alembic upgrade head

# View Celery logs
docker-compose logs -f celery

# Access Redis CLI
docker-compose exec redis redis-cli
```