Houseplant Image Scraper - Master Plan
Overview
Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.
Requirements Summary
| Requirement |
Value |
| Platform |
Web app in Docker on Unraid |
| Sources |
iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL |
| API keys |
Configurable per service |
| Species list |
Manual import (CSV/paste) |
| Grouping |
Species, genus, source, license (faceted) |
| Search/filter |
Yes |
| Quality filter |
Automatic (hash dedup, blur, size) |
| Progress |
Real-time dashboard |
| Storage |
/species_name/image.jpg + SQLite DB |
| Export |
Filtered zip for CoreML, downloadable anytime |
| Auth |
None (single user) |
| Deployment |
Docker Compose |
Create ML Export Requirements
Per Apple's documentation:
- Folder structure:
/SpeciesName/image001.jpg (folder name = class label)
- Train/Test split: 80/20 recommended, separate folders
- Balance: Roughly equal images per class (avoid bias)
- No metadata needed: Create ML uses folder names as labels
Export Format
Data Sources
| Source |
API/Method |
License Filter |
Rate Limits |
Notes |
| iNaturalist/GBIF |
Bulk DwC-A export + API |
CC0, CC-BY |
1 req/sec, 10k/day, 5GB/hr media |
Best source: Research Grade = verified |
| Flickr |
flickr.photos.search |
license=4,9 (CC-BY, CC0) |
3600 req/hr |
Good supplemental |
| Wikimedia Commons |
MediaWiki API + pyWikiCommons |
CC-BY, CC-BY-SA, PD |
Generous |
Category-based search |
| Trefle.io |
REST API |
Open source |
Free tier |
Species metadata + some images |
| USDA PLANTS |
REST API |
Public Domain |
Generous |
US-focused, limited images |
| Plant.id |
REST API |
Commercial |
Paid |
For validation, not scraping |
| Encyclopedia of Life |
API |
Mixed |
Check each |
Aggregator |
Source References
Flickr License IDs
| ID |
License |
| 0 |
All Rights Reserved |
| 1 |
CC BY-NC-SA 2.0 |
| 2 |
CC BY-NC 2.0 |
| 3 |
CC BY-NC-ND 2.0 |
| 4 |
CC BY 2.0 (Commercial OK) |
| 5 |
CC BY-SA 2.0 |
| 6 |
CC BY-ND 2.0 |
| 7 |
No known copyright restrictions |
| 8 |
United States Government Work |
| 9 |
Public Domain (CC0) |
For commercial use: Filter to license IDs 4, 7, 8, 9 only.
Image Quality Pipeline
| Stage |
Library |
Purpose |
| Deduplication |
imagededup |
Perceptual hash (CNN + hash methods) |
| Blur detection |
scipy + Sobel variance |
Reject blurry images |
| Size filter |
Pillow |
Min 256x256 |
| Resize |
Pillow |
Normalize to 512x512 |
Library References
Technology Stack
| Component |
Choice |
Rationale |
| Backend |
FastAPI (Python) |
Async, fast, ML ecosystem, great docs |
| Frontend |
React + Tailwind |
Fast dev, good component libraries |
| Database |
SQLite (+ FTS5) |
Simple, no separate container, sufficient for single-user |
| Task Queue |
Celery + Redis |
Long-running scrape jobs, good monitoring |
| Containers |
Docker Compose |
Multi-service orchestration |
Reference: https://github.com/fastapi/full-stack-fastapi-template
Architecture
Database Schema
-- Species master list (imported from CSV)
CREATE TABLE species (
id INTEGER PRIMARY KEY,
scientific_name TEXT UNIQUE NOT NULL,
common_name TEXT,
genus TEXT,
family TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Full-text search index
CREATE VIRTUAL TABLE species_fts USING fts5(
scientific_name,
common_name,
genus,
content='species',
content_rowid='id'
);
-- API credentials
CREATE TABLE api_keys (
id INTEGER PRIMARY KEY,
source TEXT UNIQUE NOT NULL, -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
api_key TEXT NOT NULL,
api_secret TEXT,
rate_limit_per_sec REAL DEFAULT 1.0,
enabled BOOLEAN DEFAULT TRUE
);
-- Downloaded images
CREATE TABLE images (
id INTEGER PRIMARY KEY,
species_id INTEGER REFERENCES species(id),
source TEXT NOT NULL,
source_id TEXT, -- Original ID from source
url TEXT NOT NULL,
local_path TEXT,
license TEXT NOT NULL,
attribution TEXT,
width INTEGER,
height INTEGER,
phash TEXT, -- Perceptual hash for dedup
quality_score REAL, -- Blur/quality metric
status TEXT DEFAULT 'pending', -- pending, downloaded, rejected, deleted
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(source, source_id)
);
-- Index for common queries
CREATE INDEX idx_images_species ON images(species_id);
CREATE INDEX idx_images_status ON images(status);
CREATE INDEX idx_images_source ON images(source);
CREATE INDEX idx_images_phash ON images(phash);
-- Scrape jobs
CREATE TABLE jobs (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
source TEXT NOT NULL,
species_filter TEXT, -- JSON array of species IDs or NULL for all
status TEXT DEFAULT 'pending', -- pending, running, paused, completed, failed
progress_current INTEGER DEFAULT 0,
progress_total INTEGER DEFAULT 0,
images_downloaded INTEGER DEFAULT 0,
images_rejected INTEGER DEFAULT 0,
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT
);
-- Export jobs
CREATE TABLE exports (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
filter_criteria TEXT, -- JSON: min_images, licenses, min_quality, species_ids
train_split REAL DEFAULT 0.8,
status TEXT DEFAULT 'pending', -- pending, generating, completed, failed
file_path TEXT,
file_size INTEGER,
species_count INTEGER,
image_count INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP
);
API Endpoints
Species
| Method |
Endpoint |
Description |
| GET |
/api/species |
List species (paginated, searchable) |
| POST |
/api/species |
Create single species |
| POST |
/api/species/import |
Bulk import from CSV |
| GET |
/api/species/{id} |
Get species details |
| PUT |
/api/species/{id} |
Update species |
| DELETE |
/api/species/{id} |
Delete species |
API Keys
| Method |
Endpoint |
Description |
| GET |
/api/sources |
List configured sources |
| PUT |
/api/sources/{source} |
Update source config (key, rate limit) |
Jobs
| Method |
Endpoint |
Description |
| GET |
/api/jobs |
List jobs |
| POST |
/api/jobs |
Create scrape job |
| GET |
/api/jobs/{id} |
Get job status |
| POST |
/api/jobs/{id}/pause |
Pause job |
| POST |
/api/jobs/{id}/resume |
Resume job |
| POST |
/api/jobs/{id}/cancel |
Cancel job |
Images
| Method |
Endpoint |
Description |
| GET |
/api/images |
List images (paginated, filterable) |
| GET |
/api/images/{id} |
Get image details |
| DELETE |
/api/images/{id} |
Delete image |
| POST |
/api/images/bulk-delete |
Bulk delete |
Export
| Method |
Endpoint |
Description |
| GET |
/api/exports |
List exports |
| POST |
/api/exports |
Create export job |
| GET |
/api/exports/{id} |
Get export status |
| GET |
/api/exports/{id}/download |
Download zip file |
Stats
| Method |
Endpoint |
Description |
| GET |
/api/stats |
Dashboard statistics |
| GET |
/api/stats/sources |
Per-source breakdown |
| GET |
/api/stats/species |
Per-species image counts |
UI Screens
1. Dashboard
- Total species, images by source, images by license
- Active jobs with progress bars
- Quick stats: images/sec, disk usage
- Recent activity feed
2. Species Management
- Table: scientific name, common name, genus, image count
- Import CSV button (drag-and-drop)
- Search/filter by name, genus
- Bulk select → "Start Scrape" button
- Inline editing
3. API Keys
- Card per source with:
- API key input (masked)
- API secret input (if applicable)
- Rate limit slider
- Enable/disable toggle
- Test connection button
4. Image Browser
- Grid view with thumbnails (lazy-loaded)
- Filters sidebar:
- Species (autocomplete)
- Source (checkboxes)
- License (checkboxes)
- Quality score (range slider)
- Status (tabs: all, pending, downloaded, rejected)
- Sort by: date, quality, species
- Bulk select → actions (delete, re-queue)
- Click to view full-size + metadata
5. Jobs
- Table: name, source, status, progress, dates
- Real-time progress updates (WebSocket)
- Actions: pause, resume, cancel, view logs
6. Export
- Filter builder:
- Min images per species
- License whitelist
- Min quality score
- Species selection (all or specific)
- Train/test split slider (default 80/20)
- Preview: estimated species count, image count, file size
- "Generate Zip" button
- Download history with re-download links
Tradeoffs
| Decision |
Alternative |
Why This Choice |
| SQLite |
PostgreSQL |
Single-user, simpler Docker setup, sufficient for millions of rows |
| Celery+Redis |
RQ, Dramatiq |
Battle-tested, good monitoring (Flower) |
| React |
Vue, Svelte |
Largest ecosystem, more component libraries |
| Separate workers |
Threads in FastAPI |
Better isolation, can scale workers independently |
| Nginx reverse proxy |
Traefik |
Simpler config for single-app deployment |
Risks & Mitigations
| Risk |
Likelihood |
Mitigation |
| iNaturalist rate limits (5GB/hr) |
High |
Throttle downloads, prioritize species with low counts |
| Disk fills up |
Medium |
Dashboard shows disk usage, configurable storage limits |
| Scrape jobs crash mid-run |
Medium |
Job state in DB, resume from last checkpoint |
| Perceptual hash collisions |
Low |
Store hash, allow manual review of flagged duplicates |
| API keys exposed |
Low |
Environment variables, not stored in code |
| SQLite write contention |
Low |
WAL mode, single writer pattern via Celery |
Implementation Phases
Phase 1: Foundation
Phase 2: Core Data Management
Phase 3: iNaturalist Scraper
Phase 4: Quality Pipeline
Phase 5: Image Browser
Phase 6: Additional Scrapers
Phase 7: Export
Phase 8: Dashboard & Polish
File Structure
Environment Variables
Commands