Files
PlantGuideScraper/docs/master_plan.md
2026-04-12 09:54:27 -05:00

565 lines
19 KiB
Markdown

# Houseplant Image Scraper - Master Plan
## Overview
Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.
---
## Requirements Summary
| Requirement | Value |
|-------------|-------|
| Platform | Web app in Docker on Unraid |
| Sources | iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL |
| API keys | Configurable per service |
| Species list | Manual import (CSV/paste) |
| Grouping | Species, genus, source, license (faceted) |
| Search/filter | Yes |
| Quality filter | Automatic (hash dedup, blur, size) |
| Progress | Real-time dashboard |
| Storage | `/species_name/image.jpg` + SQLite DB |
| Export | Filtered zip for CoreML, downloadable anytime |
| Auth | None (single user) |
| Deployment | Docker Compose |
---
## Create ML Export Requirements
Per [Apple's documentation](https://developer.apple.com/documentation/createml/creating-an-image-classifier-model):
- **Folder structure**: `/SpeciesName/image001.jpg` (folder name = class label)
- **Train/Test split**: 80/20 recommended, separate folders
- **Balance**: Roughly equal images per class (avoid bias)
- **No metadata needed**: Create ML uses folder names as labels
### Export Format
```
dataset_export/
├── Training/
│ ├── Monstera_deliciosa/
│ │ ├── img001.jpg
│ │ └── ...
│ ├── Philodendron_hederaceum/
│ └── ...
└── Testing/
├── Monstera_deliciosa/
└── ...
```
---
## Data Sources
| Source | API/Method | License Filter | Rate Limits | Notes |
|--------|------------|----------------|-------------|-------|
| **iNaturalist/GBIF** | Bulk DwC-A export + API | CC0, CC-BY | 1 req/sec, 10k/day, 5GB/hr media | Best source: Research Grade = verified |
| **Flickr** | flickr.photos.search | license=4,9 (CC-BY, CC0) | 3600 req/hr | Good supplemental |
| **Wikimedia Commons** | MediaWiki API + pyWikiCommons | CC-BY, CC-BY-SA, PD | Generous | Category-based search |
| **Trefle.io** | REST API | Open source | Free tier | Species metadata + some images |
| **USDA PLANTS** | REST API | Public Domain | Generous | US-focused, limited images |
| **Plant.id** | REST API | Commercial | Paid | For validation, not scraping |
| **Encyclopedia of Life** | API | Mixed | Check each | Aggregator |
### Source References
- iNaturalist: https://www.inaturalist.org/pages/developers
- iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741
- Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html
- Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API
- pyWikiCommons: https://pypi.org/project/pyWikiCommons/
- Trefle.io: https://trefle.io/
- USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r
### Flickr License IDs
| ID | License |
|----|---------|
| 0 | All Rights Reserved |
| 1 | CC BY-NC-SA 2.0 |
| 2 | CC BY-NC 2.0 |
| 3 | CC BY-NC-ND 2.0 |
| 4 | CC BY 2.0 (Commercial OK) |
| 5 | CC BY-SA 2.0 |
| 6 | CC BY-ND 2.0 |
| 7 | No known copyright restrictions |
| 8 | United States Government Work |
| 9 | Public Domain (CC0) |
**For commercial use**: Filter to license IDs 4, 7, 8, 9 only.
---
## Image Quality Pipeline
| Stage | Library | Purpose |
|-------|---------|---------|
| **Deduplication** | imagededup | Perceptual hash (CNN + hash methods) |
| **Blur detection** | scipy + Sobel variance | Reject blurry images |
| **Size filter** | Pillow | Min 256x256 |
| **Resize** | Pillow | Normalize to 512x512 |
### Library References
- imagededup: https://github.com/idealo/imagededup
- imagehash: https://github.com/JohannesBuchner/imagehash
---
## Technology Stack
| Component | Choice | Rationale |
|-----------|--------|-----------|
| **Backend** | FastAPI (Python) | Async, fast, ML ecosystem, great docs |
| **Frontend** | React + Tailwind | Fast dev, good component libraries |
| **Database** | SQLite (+ FTS5) | Simple, no separate container, sufficient for single-user |
| **Task Queue** | Celery + Redis | Long-running scrape jobs, good monitoring |
| **Containers** | Docker Compose | Multi-service orchestration |
Reference: https://github.com/fastapi/full-stack-fastapi-template
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────────────┐
│ DOCKER COMPOSE ON UNRAID │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────────────────────────────────────────┐ │
│ │ NGINX │ │ FASTAPI BACKEND │ │
│ │ :80 │───▶│ /api/species - CRUD species list │ │
│ │ │ │ /api/sources - API key management │ │
│ └──────┬──────┘ │ /api/jobs - Scrape job control │ │
│ │ │ /api/images - Search, filter, browse │ │
│ ▼ │ /api/export - Generate zip for CoreML │ │
│ ┌─────────────┐ │ /api/stats - Dashboard metrics │ │
│ │ REACT │ └─────────────────────────────────────────────────┘ │
│ │ SPA │ │ │
│ │ :3000 │ ▼ │
│ └─────────────┘ ┌─────────────────────────────────────────────────┐ │
│ │ CELERY WORKERS │ │
│ ┌─────────────┐ │ - iNaturalist scraper │ │
│ │ REDIS │◀───│ - Flickr scraper │ │
│ │ :6379 │ │ - Wikimedia scraper │ │
│ └─────────────┘ │ - Quality filter pipeline │ │
│ │ - Export generator │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐│
│ │ STORAGE (Bind Mounts) ││
│ │ /data/db/plants.sqlite - Species, images metadata, jobs ││
│ │ /data/images/{species}/ - Downloaded images ││
│ │ /data/exports/ - Generated zip files ││
│ └─────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────┘
```
---
## Database Schema
```sql
-- Species master list (imported from CSV)
CREATE TABLE species (
id INTEGER PRIMARY KEY,
scientific_name TEXT UNIQUE NOT NULL,
common_name TEXT,
genus TEXT,
family TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Full-text search index
CREATE VIRTUAL TABLE species_fts USING fts5(
scientific_name,
common_name,
genus,
content='species',
content_rowid='id'
);
-- API credentials
CREATE TABLE api_keys (
id INTEGER PRIMARY KEY,
source TEXT UNIQUE NOT NULL, -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
api_key TEXT NOT NULL,
api_secret TEXT,
rate_limit_per_sec REAL DEFAULT 1.0,
enabled BOOLEAN DEFAULT TRUE
);
-- Downloaded images
CREATE TABLE images (
id INTEGER PRIMARY KEY,
species_id INTEGER REFERENCES species(id),
source TEXT NOT NULL,
source_id TEXT, -- Original ID from source
url TEXT NOT NULL,
local_path TEXT,
license TEXT NOT NULL,
attribution TEXT,
width INTEGER,
height INTEGER,
phash TEXT, -- Perceptual hash for dedup
quality_score REAL, -- Blur/quality metric
status TEXT DEFAULT 'pending', -- pending, downloaded, rejected, deleted
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(source, source_id)
);
-- Index for common queries
CREATE INDEX idx_images_species ON images(species_id);
CREATE INDEX idx_images_status ON images(status);
CREATE INDEX idx_images_source ON images(source);
CREATE INDEX idx_images_phash ON images(phash);
-- Scrape jobs
CREATE TABLE jobs (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
source TEXT NOT NULL,
species_filter TEXT, -- JSON array of species IDs or NULL for all
status TEXT DEFAULT 'pending', -- pending, running, paused, completed, failed
progress_current INTEGER DEFAULT 0,
progress_total INTEGER DEFAULT 0,
images_downloaded INTEGER DEFAULT 0,
images_rejected INTEGER DEFAULT 0,
started_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT
);
-- Export jobs
CREATE TABLE exports (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
filter_criteria TEXT, -- JSON: min_images, licenses, min_quality, species_ids
train_split REAL DEFAULT 0.8,
status TEXT DEFAULT 'pending', -- pending, generating, completed, failed
file_path TEXT,
file_size INTEGER,
species_count INTEGER,
image_count INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
completed_at TIMESTAMP
);
```
---
## API Endpoints
### Species
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/species` | List species (paginated, searchable) |
| POST | `/api/species` | Create single species |
| POST | `/api/species/import` | Bulk import from CSV |
| GET | `/api/species/{id}` | Get species details |
| PUT | `/api/species/{id}` | Update species |
| DELETE | `/api/species/{id}` | Delete species |
### API Keys
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/sources` | List configured sources |
| PUT | `/api/sources/{source}` | Update source config (key, rate limit) |
### Jobs
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/jobs` | List jobs |
| POST | `/api/jobs` | Create scrape job |
| GET | `/api/jobs/{id}` | Get job status |
| POST | `/api/jobs/{id}/pause` | Pause job |
| POST | `/api/jobs/{id}/resume` | Resume job |
| POST | `/api/jobs/{id}/cancel` | Cancel job |
### Images
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/images` | List images (paginated, filterable) |
| GET | `/api/images/{id}` | Get image details |
| DELETE | `/api/images/{id}` | Delete image |
| POST | `/api/images/bulk-delete` | Bulk delete |
### Export
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/exports` | List exports |
| POST | `/api/exports` | Create export job |
| GET | `/api/exports/{id}` | Get export status |
| GET | `/api/exports/{id}/download` | Download zip file |
### Stats
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/stats` | Dashboard statistics |
| GET | `/api/stats/sources` | Per-source breakdown |
| GET | `/api/stats/species` | Per-species image counts |
---
## UI Screens
### 1. Dashboard
- Total species, images by source, images by license
- Active jobs with progress bars
- Quick stats: images/sec, disk usage
- Recent activity feed
### 2. Species Management
- Table: scientific name, common name, genus, image count
- Import CSV button (drag-and-drop)
- Search/filter by name, genus
- Bulk select → "Start Scrape" button
- Inline editing
### 3. API Keys
- Card per source with:
- API key input (masked)
- API secret input (if applicable)
- Rate limit slider
- Enable/disable toggle
- Test connection button
### 4. Image Browser
- Grid view with thumbnails (lazy-loaded)
- Filters sidebar:
- Species (autocomplete)
- Source (checkboxes)
- License (checkboxes)
- Quality score (range slider)
- Status (tabs: all, pending, downloaded, rejected)
- Sort by: date, quality, species
- Bulk select → actions (delete, re-queue)
- Click to view full-size + metadata
### 5. Jobs
- Table: name, source, status, progress, dates
- Real-time progress updates (WebSocket)
- Actions: pause, resume, cancel, view logs
### 6. Export
- Filter builder:
- Min images per species
- License whitelist
- Min quality score
- Species selection (all or specific)
- Train/test split slider (default 80/20)
- Preview: estimated species count, image count, file size
- "Generate Zip" button
- Download history with re-download links
---
## Tradeoffs
| Decision | Alternative | Why This Choice |
|----------|-------------|-----------------|
| SQLite | PostgreSQL | Single-user, simpler Docker setup, sufficient for millions of rows |
| Celery+Redis | RQ, Dramatiq | Battle-tested, good monitoring (Flower) |
| React | Vue, Svelte | Largest ecosystem, more component libraries |
| Separate workers | Threads in FastAPI | Better isolation, can scale workers independently |
| Nginx reverse proxy | Traefik | Simpler config for single-app deployment |
---
## Risks & Mitigations
| Risk | Likelihood | Mitigation |
|------|------------|------------|
| iNaturalist rate limits (5GB/hr) | High | Throttle downloads, prioritize species with low counts |
| Disk fills up | Medium | Dashboard shows disk usage, configurable storage limits |
| Scrape jobs crash mid-run | Medium | Job state in DB, resume from last checkpoint |
| Perceptual hash collisions | Low | Store hash, allow manual review of flagged duplicates |
| API keys exposed | Low | Environment variables, not stored in code |
| SQLite write contention | Low | WAL mode, single writer pattern via Celery |
---
## Implementation Phases
### Phase 1: Foundation
- [ ] Docker Compose setup (FastAPI, React, Redis, Nginx)
- [ ] Database schema + migrations (Alembic)
- [ ] Basic FastAPI skeleton with health checks
- [ ] React app scaffolding with Tailwind
### Phase 2: Core Data Management
- [ ] Species CRUD API
- [ ] CSV import endpoint
- [ ] Species list UI with search/filter
- [ ] API keys management UI
### Phase 3: iNaturalist Scraper
- [ ] Celery worker setup
- [ ] iNaturalist/GBIF scraper task
- [ ] Job management API
- [ ] Real-time progress (WebSocket or polling)
### Phase 4: Quality Pipeline
- [ ] Image download worker
- [ ] Perceptual hash deduplication
- [ ] Blur detection + quality scoring
- [ ] Resize to 512x512
### Phase 5: Image Browser
- [ ] Image listing API with filters
- [ ] Thumbnail generation
- [ ] Grid view UI
- [ ] Bulk operations
### Phase 6: Additional Scrapers
- [ ] Flickr scraper
- [ ] Wikimedia Commons scraper
- [ ] Trefle scraper (metadata + images)
- [ ] USDA PLANTS scraper
### Phase 7: Export
- [ ] Export job API
- [ ] Train/test split logic
- [ ] Zip generation worker
- [ ] Download endpoint
- [ ] Export UI with filters
### Phase 8: Dashboard & Polish
- [ ] Stats API
- [ ] Dashboard UI with charts
- [ ] Job monitoring UI
- [ ] Error handling + logging
- [ ] Documentation
---
## File Structure
```
PlantGuideScraper/
├── docker-compose.yml
├── .env.example
├── docs/
│ └── master_plan.md
├── backend/
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── alembic/
│ │ └── versions/
│ ├── app/
│ │ ├── __init__.py
│ │ ├── main.py
│ │ ├── config.py
│ │ ├── database.py
│ │ ├── models/
│ │ │ ├── species.py
│ │ │ ├── image.py
│ │ │ ├── job.py
│ │ │ └── export.py
│ │ ├── schemas/
│ │ │ └── ...
│ │ ├── api/
│ │ │ ├── species.py
│ │ │ ├── images.py
│ │ │ ├── jobs.py
│ │ │ ├── exports.py
│ │ │ └── stats.py
│ │ ├── scrapers/
│ │ │ ├── base.py
│ │ │ ├── inaturalist.py
│ │ │ ├── flickr.py
│ │ │ ├── wikimedia.py
│ │ │ └── trefle.py
│ │ ├── workers/
│ │ │ ├── celery_app.py
│ │ │ ├── scrape_tasks.py
│ │ │ ├── quality_tasks.py
│ │ │ └── export_tasks.py
│ │ └── utils/
│ │ ├── image_quality.py
│ │ └── dedup.py
│ └── tests/
├── frontend/
│ ├── Dockerfile
│ ├── package.json
│ ├── src/
│ │ ├── App.tsx
│ │ ├── components/
│ │ ├── pages/
│ │ │ ├── Dashboard.tsx
│ │ │ ├── Species.tsx
│ │ │ ├── Images.tsx
│ │ │ ├── Jobs.tsx
│ │ │ ├── Export.tsx
│ │ │ └── Settings.tsx
│ │ ├── hooks/
│ │ └── api/
│ └── public/
├── nginx/
│ └── nginx.conf
└── data/ # Bind mount (not in repo)
├── db/
├── images/
└── exports/
```
---
## Environment Variables
```bash
# Backend
DATABASE_URL=sqlite:///data/db/plants.sqlite
REDIS_URL=redis://redis:6379/0
IMAGES_PATH=/data/images
EXPORTS_PATH=/data/exports
# API Keys (user-provided)
FLICKR_API_KEY=
FLICKR_API_SECRET=
INATURALIST_APP_ID=
INATURALIST_APP_SECRET=
TREFLE_API_KEY=
# Optional
LOG_LEVEL=INFO
CELERY_CONCURRENCY=4
```
---
## Commands
```bash
# Development
docker-compose up --build
# Production
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
# Run migrations
docker-compose exec backend alembic upgrade head
# View Celery logs
docker-compose logs -f celery
# Access Redis CLI
docker-compose exec redis redis-cli
```