# Houseplant Image Scraper - Master Plan ## Overview Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training. --- ## Requirements Summary | Requirement | Value | |-------------|-------| | Platform | Web app in Docker on Unraid | | Sources | iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL | | API keys | Configurable per service | | Species list | Manual import (CSV/paste) | | Grouping | Species, genus, source, license (faceted) | | Search/filter | Yes | | Quality filter | Automatic (hash dedup, blur, size) | | Progress | Real-time dashboard | | Storage | `/species_name/image.jpg` + SQLite DB | | Export | Filtered zip for CoreML, downloadable anytime | | Auth | None (single user) | | Deployment | Docker Compose | --- ## Create ML Export Requirements Per [Apple's documentation](https://developer.apple.com/documentation/createml/creating-an-image-classifier-model): - **Folder structure**: `/SpeciesName/image001.jpg` (folder name = class label) - **Train/Test split**: 80/20 recommended, separate folders - **Balance**: Roughly equal images per class (avoid bias) - **No metadata needed**: Create ML uses folder names as labels ### Export Format ``` dataset_export/ ├── Training/ │ ├── Monstera_deliciosa/ │ │ ├── img001.jpg │ │ └── ... │ ├── Philodendron_hederaceum/ │ └── ... └── Testing/ ├── Monstera_deliciosa/ └── ... ``` --- ## Data Sources | Source | API/Method | License Filter | Rate Limits | Notes | |--------|------------|----------------|-------------|-------| | **iNaturalist/GBIF** | Bulk DwC-A export + API | CC0, CC-BY | 1 req/sec, 10k/day, 5GB/hr media | Best source: Research Grade = verified | | **Flickr** | flickr.photos.search | license=4,9 (CC-BY, CC0) | 3600 req/hr | Good supplemental | | **Wikimedia Commons** | MediaWiki API + pyWikiCommons | CC-BY, CC-BY-SA, PD | Generous | Category-based search | | **Trefle.io** | REST API | Open source | Free tier | Species metadata + some images | | **USDA PLANTS** | REST API | Public Domain | Generous | US-focused, limited images | | **Plant.id** | REST API | Commercial | Paid | For validation, not scraping | | **Encyclopedia of Life** | API | Mixed | Check each | Aggregator | ### Source References - iNaturalist: https://www.inaturalist.org/pages/developers - iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741 - Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html - Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API - pyWikiCommons: https://pypi.org/project/pyWikiCommons/ - Trefle.io: https://trefle.io/ - USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r ### Flickr License IDs | ID | License | |----|---------| | 0 | All Rights Reserved | | 1 | CC BY-NC-SA 2.0 | | 2 | CC BY-NC 2.0 | | 3 | CC BY-NC-ND 2.0 | | 4 | CC BY 2.0 (Commercial OK) | | 5 | CC BY-SA 2.0 | | 6 | CC BY-ND 2.0 | | 7 | No known copyright restrictions | | 8 | United States Government Work | | 9 | Public Domain (CC0) | **For commercial use**: Filter to license IDs 4, 7, 8, 9 only. --- ## Image Quality Pipeline | Stage | Library | Purpose | |-------|---------|---------| | **Deduplication** | imagededup | Perceptual hash (CNN + hash methods) | | **Blur detection** | scipy + Sobel variance | Reject blurry images | | **Size filter** | Pillow | Min 256x256 | | **Resize** | Pillow | Normalize to 512x512 | ### Library References - imagededup: https://github.com/idealo/imagededup - imagehash: https://github.com/JohannesBuchner/imagehash --- ## Technology Stack | Component | Choice | Rationale | |-----------|--------|-----------| | **Backend** | FastAPI (Python) | Async, fast, ML ecosystem, great docs | | **Frontend** | React + Tailwind | Fast dev, good component libraries | | **Database** | SQLite (+ FTS5) | Simple, no separate container, sufficient for single-user | | **Task Queue** | Celery + Redis | Long-running scrape jobs, good monitoring | | **Containers** | Docker Compose | Multi-service orchestration | Reference: https://github.com/fastapi/full-stack-fastapi-template --- ## Architecture ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ DOCKER COMPOSE ON UNRAID │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────────────────────────────────────────┐ │ │ │ NGINX │ │ FASTAPI BACKEND │ │ │ │ :80 │───▶│ /api/species - CRUD species list │ │ │ │ │ │ /api/sources - API key management │ │ │ └──────┬──────┘ │ /api/jobs - Scrape job control │ │ │ │ │ /api/images - Search, filter, browse │ │ │ ▼ │ /api/export - Generate zip for CoreML │ │ │ ┌─────────────┐ │ /api/stats - Dashboard metrics │ │ │ │ REACT │ └─────────────────────────────────────────────────┘ │ │ │ SPA │ │ │ │ │ :3000 │ ▼ │ │ └─────────────┘ ┌─────────────────────────────────────────────────┐ │ │ │ CELERY WORKERS │ │ │ ┌─────────────┐ │ - iNaturalist scraper │ │ │ │ REDIS │◀───│ - Flickr scraper │ │ │ │ :6379 │ │ - Wikimedia scraper │ │ │ └─────────────┘ │ - Quality filter pipeline │ │ │ │ - Export generator │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐│ │ │ STORAGE (Bind Mounts) ││ │ │ /data/db/plants.sqlite - Species, images metadata, jobs ││ │ │ /data/images/{species}/ - Downloaded images ││ │ │ /data/exports/ - Generated zip files ││ │ └─────────────────────────────────────────────────────────────────────┘│ └─────────────────────────────────────────────────────────────────────────┘ ``` --- ## Database Schema ```sql -- Species master list (imported from CSV) CREATE TABLE species ( id INTEGER PRIMARY KEY, scientific_name TEXT UNIQUE NOT NULL, common_name TEXT, genus TEXT, family TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Full-text search index CREATE VIRTUAL TABLE species_fts USING fts5( scientific_name, common_name, genus, content='species', content_rowid='id' ); -- API credentials CREATE TABLE api_keys ( id INTEGER PRIMARY KEY, source TEXT UNIQUE NOT NULL, -- 'flickr', 'inaturalist', 'wikimedia', 'trefle' api_key TEXT NOT NULL, api_secret TEXT, rate_limit_per_sec REAL DEFAULT 1.0, enabled BOOLEAN DEFAULT TRUE ); -- Downloaded images CREATE TABLE images ( id INTEGER PRIMARY KEY, species_id INTEGER REFERENCES species(id), source TEXT NOT NULL, source_id TEXT, -- Original ID from source url TEXT NOT NULL, local_path TEXT, license TEXT NOT NULL, attribution TEXT, width INTEGER, height INTEGER, phash TEXT, -- Perceptual hash for dedup quality_score REAL, -- Blur/quality metric status TEXT DEFAULT 'pending', -- pending, downloaded, rejected, deleted created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, UNIQUE(source, source_id) ); -- Index for common queries CREATE INDEX idx_images_species ON images(species_id); CREATE INDEX idx_images_status ON images(status); CREATE INDEX idx_images_source ON images(source); CREATE INDEX idx_images_phash ON images(phash); -- Scrape jobs CREATE TABLE jobs ( id INTEGER PRIMARY KEY, name TEXT NOT NULL, source TEXT NOT NULL, species_filter TEXT, -- JSON array of species IDs or NULL for all status TEXT DEFAULT 'pending', -- pending, running, paused, completed, failed progress_current INTEGER DEFAULT 0, progress_total INTEGER DEFAULT 0, images_downloaded INTEGER DEFAULT 0, images_rejected INTEGER DEFAULT 0, started_at TIMESTAMP, completed_at TIMESTAMP, error_message TEXT ); -- Export jobs CREATE TABLE exports ( id INTEGER PRIMARY KEY, name TEXT NOT NULL, filter_criteria TEXT, -- JSON: min_images, licenses, min_quality, species_ids train_split REAL DEFAULT 0.8, status TEXT DEFAULT 'pending', -- pending, generating, completed, failed file_path TEXT, file_size INTEGER, species_count INTEGER, image_count INTEGER, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, completed_at TIMESTAMP ); ``` --- ## API Endpoints ### Species | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/species` | List species (paginated, searchable) | | POST | `/api/species` | Create single species | | POST | `/api/species/import` | Bulk import from CSV | | GET | `/api/species/{id}` | Get species details | | PUT | `/api/species/{id}` | Update species | | DELETE | `/api/species/{id}` | Delete species | ### API Keys | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/sources` | List configured sources | | PUT | `/api/sources/{source}` | Update source config (key, rate limit) | ### Jobs | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/jobs` | List jobs | | POST | `/api/jobs` | Create scrape job | | GET | `/api/jobs/{id}` | Get job status | | POST | `/api/jobs/{id}/pause` | Pause job | | POST | `/api/jobs/{id}/resume` | Resume job | | POST | `/api/jobs/{id}/cancel` | Cancel job | ### Images | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/images` | List images (paginated, filterable) | | GET | `/api/images/{id}` | Get image details | | DELETE | `/api/images/{id}` | Delete image | | POST | `/api/images/bulk-delete` | Bulk delete | ### Export | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/exports` | List exports | | POST | `/api/exports` | Create export job | | GET | `/api/exports/{id}` | Get export status | | GET | `/api/exports/{id}/download` | Download zip file | ### Stats | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/stats` | Dashboard statistics | | GET | `/api/stats/sources` | Per-source breakdown | | GET | `/api/stats/species` | Per-species image counts | --- ## UI Screens ### 1. Dashboard - Total species, images by source, images by license - Active jobs with progress bars - Quick stats: images/sec, disk usage - Recent activity feed ### 2. Species Management - Table: scientific name, common name, genus, image count - Import CSV button (drag-and-drop) - Search/filter by name, genus - Bulk select → "Start Scrape" button - Inline editing ### 3. API Keys - Card per source with: - API key input (masked) - API secret input (if applicable) - Rate limit slider - Enable/disable toggle - Test connection button ### 4. Image Browser - Grid view with thumbnails (lazy-loaded) - Filters sidebar: - Species (autocomplete) - Source (checkboxes) - License (checkboxes) - Quality score (range slider) - Status (tabs: all, pending, downloaded, rejected) - Sort by: date, quality, species - Bulk select → actions (delete, re-queue) - Click to view full-size + metadata ### 5. Jobs - Table: name, source, status, progress, dates - Real-time progress updates (WebSocket) - Actions: pause, resume, cancel, view logs ### 6. Export - Filter builder: - Min images per species - License whitelist - Min quality score - Species selection (all or specific) - Train/test split slider (default 80/20) - Preview: estimated species count, image count, file size - "Generate Zip" button - Download history with re-download links --- ## Tradeoffs | Decision | Alternative | Why This Choice | |----------|-------------|-----------------| | SQLite | PostgreSQL | Single-user, simpler Docker setup, sufficient for millions of rows | | Celery+Redis | RQ, Dramatiq | Battle-tested, good monitoring (Flower) | | React | Vue, Svelte | Largest ecosystem, more component libraries | | Separate workers | Threads in FastAPI | Better isolation, can scale workers independently | | Nginx reverse proxy | Traefik | Simpler config for single-app deployment | --- ## Risks & Mitigations | Risk | Likelihood | Mitigation | |------|------------|------------| | iNaturalist rate limits (5GB/hr) | High | Throttle downloads, prioritize species with low counts | | Disk fills up | Medium | Dashboard shows disk usage, configurable storage limits | | Scrape jobs crash mid-run | Medium | Job state in DB, resume from last checkpoint | | Perceptual hash collisions | Low | Store hash, allow manual review of flagged duplicates | | API keys exposed | Low | Environment variables, not stored in code | | SQLite write contention | Low | WAL mode, single writer pattern via Celery | --- ## Implementation Phases ### Phase 1: Foundation - [ ] Docker Compose setup (FastAPI, React, Redis, Nginx) - [ ] Database schema + migrations (Alembic) - [ ] Basic FastAPI skeleton with health checks - [ ] React app scaffolding with Tailwind ### Phase 2: Core Data Management - [ ] Species CRUD API - [ ] CSV import endpoint - [ ] Species list UI with search/filter - [ ] API keys management UI ### Phase 3: iNaturalist Scraper - [ ] Celery worker setup - [ ] iNaturalist/GBIF scraper task - [ ] Job management API - [ ] Real-time progress (WebSocket or polling) ### Phase 4: Quality Pipeline - [ ] Image download worker - [ ] Perceptual hash deduplication - [ ] Blur detection + quality scoring - [ ] Resize to 512x512 ### Phase 5: Image Browser - [ ] Image listing API with filters - [ ] Thumbnail generation - [ ] Grid view UI - [ ] Bulk operations ### Phase 6: Additional Scrapers - [ ] Flickr scraper - [ ] Wikimedia Commons scraper - [ ] Trefle scraper (metadata + images) - [ ] USDA PLANTS scraper ### Phase 7: Export - [ ] Export job API - [ ] Train/test split logic - [ ] Zip generation worker - [ ] Download endpoint - [ ] Export UI with filters ### Phase 8: Dashboard & Polish - [ ] Stats API - [ ] Dashboard UI with charts - [ ] Job monitoring UI - [ ] Error handling + logging - [ ] Documentation --- ## File Structure ``` PlantGuideScraper/ ├── docker-compose.yml ├── .env.example ├── docs/ │ └── master_plan.md ├── backend/ │ ├── Dockerfile │ ├── requirements.txt │ ├── alembic/ │ │ └── versions/ │ ├── app/ │ │ ├── __init__.py │ │ ├── main.py │ │ ├── config.py │ │ ├── database.py │ │ ├── models/ │ │ │ ├── species.py │ │ │ ├── image.py │ │ │ ├── job.py │ │ │ └── export.py │ │ ├── schemas/ │ │ │ └── ... │ │ ├── api/ │ │ │ ├── species.py │ │ │ ├── images.py │ │ │ ├── jobs.py │ │ │ ├── exports.py │ │ │ └── stats.py │ │ ├── scrapers/ │ │ │ ├── base.py │ │ │ ├── inaturalist.py │ │ │ ├── flickr.py │ │ │ ├── wikimedia.py │ │ │ └── trefle.py │ │ ├── workers/ │ │ │ ├── celery_app.py │ │ │ ├── scrape_tasks.py │ │ │ ├── quality_tasks.py │ │ │ └── export_tasks.py │ │ └── utils/ │ │ ├── image_quality.py │ │ └── dedup.py │ └── tests/ ├── frontend/ │ ├── Dockerfile │ ├── package.json │ ├── src/ │ │ ├── App.tsx │ │ ├── components/ │ │ ├── pages/ │ │ │ ├── Dashboard.tsx │ │ │ ├── Species.tsx │ │ │ ├── Images.tsx │ │ │ ├── Jobs.tsx │ │ │ ├── Export.tsx │ │ │ └── Settings.tsx │ │ ├── hooks/ │ │ └── api/ │ └── public/ ├── nginx/ │ └── nginx.conf └── data/ # Bind mount (not in repo) ├── db/ ├── images/ └── exports/ ``` --- ## Environment Variables ```bash # Backend DATABASE_URL=sqlite:///data/db/plants.sqlite REDIS_URL=redis://redis:6379/0 IMAGES_PATH=/data/images EXPORTS_PATH=/data/exports # API Keys (user-provided) FLICKR_API_KEY= FLICKR_API_SECRET= INATURALIST_APP_ID= INATURALIST_APP_SECRET= TREFLE_API_KEY= # Optional LOG_LEVEL=INFO CELERY_CONCURRENCY=4 ``` --- ## Commands ```bash # Development docker-compose up --build # Production docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d # Run migrations docker-compose exec backend alembic upgrade head # View Celery logs docker-compose logs -f celery # Access Redis CLI docker-compose exec redis redis-cli ```