565 lines
19 KiB
Markdown
565 lines
19 KiB
Markdown
# Houseplant Image Scraper - Master Plan
|
|
|
|
## Overview
|
|
|
|
Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.
|
|
|
|
---
|
|
|
|
## Requirements Summary
|
|
|
|
| Requirement | Value |
|
|
|-------------|-------|
|
|
| Platform | Web app in Docker on Unraid |
|
|
| Sources | iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL |
|
|
| API keys | Configurable per service |
|
|
| Species list | Manual import (CSV/paste) |
|
|
| Grouping | Species, genus, source, license (faceted) |
|
|
| Search/filter | Yes |
|
|
| Quality filter | Automatic (hash dedup, blur, size) |
|
|
| Progress | Real-time dashboard |
|
|
| Storage | `/species_name/image.jpg` + SQLite DB |
|
|
| Export | Filtered zip for CoreML, downloadable anytime |
|
|
| Auth | None (single user) |
|
|
| Deployment | Docker Compose |
|
|
|
|
---
|
|
|
|
## Create ML Export Requirements
|
|
|
|
Per [Apple's documentation](https://developer.apple.com/documentation/createml/creating-an-image-classifier-model):
|
|
|
|
- **Folder structure**: `/SpeciesName/image001.jpg` (folder name = class label)
|
|
- **Train/Test split**: 80/20 recommended, separate folders
|
|
- **Balance**: Roughly equal images per class (avoid bias)
|
|
- **No metadata needed**: Create ML uses folder names as labels
|
|
|
|
### Export Format
|
|
|
|
```
|
|
dataset_export/
|
|
├── Training/
|
|
│ ├── Monstera_deliciosa/
|
|
│ │ ├── img001.jpg
|
|
│ │ └── ...
|
|
│ ├── Philodendron_hederaceum/
|
|
│ └── ...
|
|
└── Testing/
|
|
├── Monstera_deliciosa/
|
|
└── ...
|
|
```
|
|
|
|
---
|
|
|
|
## Data Sources
|
|
|
|
| Source | API/Method | License Filter | Rate Limits | Notes |
|
|
|--------|------------|----------------|-------------|-------|
|
|
| **iNaturalist/GBIF** | Bulk DwC-A export + API | CC0, CC-BY | 1 req/sec, 10k/day, 5GB/hr media | Best source: Research Grade = verified |
|
|
| **Flickr** | flickr.photos.search | license=4,9 (CC-BY, CC0) | 3600 req/hr | Good supplemental |
|
|
| **Wikimedia Commons** | MediaWiki API + pyWikiCommons | CC-BY, CC-BY-SA, PD | Generous | Category-based search |
|
|
| **Trefle.io** | REST API | Open source | Free tier | Species metadata + some images |
|
|
| **USDA PLANTS** | REST API | Public Domain | Generous | US-focused, limited images |
|
|
| **Plant.id** | REST API | Commercial | Paid | For validation, not scraping |
|
|
| **Encyclopedia of Life** | API | Mixed | Check each | Aggregator |
|
|
|
|
### Source References
|
|
|
|
- iNaturalist: https://www.inaturalist.org/pages/developers
|
|
- iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741
|
|
- Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html
|
|
- Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API
|
|
- pyWikiCommons: https://pypi.org/project/pyWikiCommons/
|
|
- Trefle.io: https://trefle.io/
|
|
- USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r
|
|
|
|
### Flickr License IDs
|
|
|
|
| ID | License |
|
|
|----|---------|
|
|
| 0 | All Rights Reserved |
|
|
| 1 | CC BY-NC-SA 2.0 |
|
|
| 2 | CC BY-NC 2.0 |
|
|
| 3 | CC BY-NC-ND 2.0 |
|
|
| 4 | CC BY 2.0 (Commercial OK) |
|
|
| 5 | CC BY-SA 2.0 |
|
|
| 6 | CC BY-ND 2.0 |
|
|
| 7 | No known copyright restrictions |
|
|
| 8 | United States Government Work |
|
|
| 9 | Public Domain (CC0) |
|
|
|
|
**For commercial use**: Filter to license IDs 4, 7, 8, 9 only.
|
|
|
|
---
|
|
|
|
## Image Quality Pipeline
|
|
|
|
| Stage | Library | Purpose |
|
|
|-------|---------|---------|
|
|
| **Deduplication** | imagededup | Perceptual hash (CNN + hash methods) |
|
|
| **Blur detection** | scipy + Sobel variance | Reject blurry images |
|
|
| **Size filter** | Pillow | Min 256x256 |
|
|
| **Resize** | Pillow | Normalize to 512x512 |
|
|
|
|
### Library References
|
|
|
|
- imagededup: https://github.com/idealo/imagededup
|
|
- imagehash: https://github.com/JohannesBuchner/imagehash
|
|
|
|
---
|
|
|
|
## Technology Stack
|
|
|
|
| Component | Choice | Rationale |
|
|
|-----------|--------|-----------|
|
|
| **Backend** | FastAPI (Python) | Async, fast, ML ecosystem, great docs |
|
|
| **Frontend** | React + Tailwind | Fast dev, good component libraries |
|
|
| **Database** | SQLite (+ FTS5) | Simple, no separate container, sufficient for single-user |
|
|
| **Task Queue** | Celery + Redis | Long-running scrape jobs, good monitoring |
|
|
| **Containers** | Docker Compose | Multi-service orchestration |
|
|
|
|
Reference: https://github.com/fastapi/full-stack-fastapi-template
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ DOCKER COMPOSE ON UNRAID │
|
|
├─────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ NGINX │ │ FASTAPI BACKEND │ │
|
|
│ │ :80 │───▶│ /api/species - CRUD species list │ │
|
|
│ │ │ │ /api/sources - API key management │ │
|
|
│ └──────┬──────┘ │ /api/jobs - Scrape job control │ │
|
|
│ │ │ /api/images - Search, filter, browse │ │
|
|
│ ▼ │ /api/export - Generate zip for CoreML │ │
|
|
│ ┌─────────────┐ │ /api/stats - Dashboard metrics │ │
|
|
│ │ REACT │ └─────────────────────────────────────────────────┘ │
|
|
│ │ SPA │ │ │
|
|
│ │ :3000 │ ▼ │
|
|
│ └─────────────┘ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ CELERY WORKERS │ │
|
|
│ ┌─────────────┐ │ - iNaturalist scraper │ │
|
|
│ │ REDIS │◀───│ - Flickr scraper │ │
|
|
│ │ :6379 │ │ - Wikimedia scraper │ │
|
|
│ └─────────────┘ │ - Quality filter pipeline │ │
|
|
│ │ - Export generator │ │
|
|
│ └─────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐│
|
|
│ │ STORAGE (Bind Mounts) ││
|
|
│ │ /data/db/plants.sqlite - Species, images metadata, jobs ││
|
|
│ │ /data/images/{species}/ - Downloaded images ││
|
|
│ │ /data/exports/ - Generated zip files ││
|
|
│ └─────────────────────────────────────────────────────────────────────┘│
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Database Schema
|
|
|
|
```sql
|
|
-- Species master list (imported from CSV)
|
|
CREATE TABLE species (
|
|
id INTEGER PRIMARY KEY,
|
|
scientific_name TEXT UNIQUE NOT NULL,
|
|
common_name TEXT,
|
|
genus TEXT,
|
|
family TEXT,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|
);
|
|
|
|
-- Full-text search index
|
|
CREATE VIRTUAL TABLE species_fts USING fts5(
|
|
scientific_name,
|
|
common_name,
|
|
genus,
|
|
content='species',
|
|
content_rowid='id'
|
|
);
|
|
|
|
-- API credentials
|
|
CREATE TABLE api_keys (
|
|
id INTEGER PRIMARY KEY,
|
|
source TEXT UNIQUE NOT NULL, -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
|
|
api_key TEXT NOT NULL,
|
|
api_secret TEXT,
|
|
rate_limit_per_sec REAL DEFAULT 1.0,
|
|
enabled BOOLEAN DEFAULT TRUE
|
|
);
|
|
|
|
-- Downloaded images
|
|
CREATE TABLE images (
|
|
id INTEGER PRIMARY KEY,
|
|
species_id INTEGER REFERENCES species(id),
|
|
source TEXT NOT NULL,
|
|
source_id TEXT, -- Original ID from source
|
|
url TEXT NOT NULL,
|
|
local_path TEXT,
|
|
license TEXT NOT NULL,
|
|
attribution TEXT,
|
|
width INTEGER,
|
|
height INTEGER,
|
|
phash TEXT, -- Perceptual hash for dedup
|
|
quality_score REAL, -- Blur/quality metric
|
|
status TEXT DEFAULT 'pending', -- pending, downloaded, rejected, deleted
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
UNIQUE(source, source_id)
|
|
);
|
|
|
|
-- Index for common queries
|
|
CREATE INDEX idx_images_species ON images(species_id);
|
|
CREATE INDEX idx_images_status ON images(status);
|
|
CREATE INDEX idx_images_source ON images(source);
|
|
CREATE INDEX idx_images_phash ON images(phash);
|
|
|
|
-- Scrape jobs
|
|
CREATE TABLE jobs (
|
|
id INTEGER PRIMARY KEY,
|
|
name TEXT NOT NULL,
|
|
source TEXT NOT NULL,
|
|
species_filter TEXT, -- JSON array of species IDs or NULL for all
|
|
status TEXT DEFAULT 'pending', -- pending, running, paused, completed, failed
|
|
progress_current INTEGER DEFAULT 0,
|
|
progress_total INTEGER DEFAULT 0,
|
|
images_downloaded INTEGER DEFAULT 0,
|
|
images_rejected INTEGER DEFAULT 0,
|
|
started_at TIMESTAMP,
|
|
completed_at TIMESTAMP,
|
|
error_message TEXT
|
|
);
|
|
|
|
-- Export jobs
|
|
CREATE TABLE exports (
|
|
id INTEGER PRIMARY KEY,
|
|
name TEXT NOT NULL,
|
|
filter_criteria TEXT, -- JSON: min_images, licenses, min_quality, species_ids
|
|
train_split REAL DEFAULT 0.8,
|
|
status TEXT DEFAULT 'pending', -- pending, generating, completed, failed
|
|
file_path TEXT,
|
|
file_size INTEGER,
|
|
species_count INTEGER,
|
|
image_count INTEGER,
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
completed_at TIMESTAMP
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## API Endpoints
|
|
|
|
### Species
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| GET | `/api/species` | List species (paginated, searchable) |
|
|
| POST | `/api/species` | Create single species |
|
|
| POST | `/api/species/import` | Bulk import from CSV |
|
|
| GET | `/api/species/{id}` | Get species details |
|
|
| PUT | `/api/species/{id}` | Update species |
|
|
| DELETE | `/api/species/{id}` | Delete species |
|
|
|
|
### API Keys
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| GET | `/api/sources` | List configured sources |
|
|
| PUT | `/api/sources/{source}` | Update source config (key, rate limit) |
|
|
|
|
### Jobs
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| GET | `/api/jobs` | List jobs |
|
|
| POST | `/api/jobs` | Create scrape job |
|
|
| GET | `/api/jobs/{id}` | Get job status |
|
|
| POST | `/api/jobs/{id}/pause` | Pause job |
|
|
| POST | `/api/jobs/{id}/resume` | Resume job |
|
|
| POST | `/api/jobs/{id}/cancel` | Cancel job |
|
|
|
|
### Images
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| GET | `/api/images` | List images (paginated, filterable) |
|
|
| GET | `/api/images/{id}` | Get image details |
|
|
| DELETE | `/api/images/{id}` | Delete image |
|
|
| POST | `/api/images/bulk-delete` | Bulk delete |
|
|
|
|
### Export
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| GET | `/api/exports` | List exports |
|
|
| POST | `/api/exports` | Create export job |
|
|
| GET | `/api/exports/{id}` | Get export status |
|
|
| GET | `/api/exports/{id}/download` | Download zip file |
|
|
|
|
### Stats
|
|
|
|
| Method | Endpoint | Description |
|
|
|--------|----------|-------------|
|
|
| GET | `/api/stats` | Dashboard statistics |
|
|
| GET | `/api/stats/sources` | Per-source breakdown |
|
|
| GET | `/api/stats/species` | Per-species image counts |
|
|
|
|
---
|
|
|
|
## UI Screens
|
|
|
|
### 1. Dashboard
|
|
|
|
- Total species, images by source, images by license
|
|
- Active jobs with progress bars
|
|
- Quick stats: images/sec, disk usage
|
|
- Recent activity feed
|
|
|
|
### 2. Species Management
|
|
|
|
- Table: scientific name, common name, genus, image count
|
|
- Import CSV button (drag-and-drop)
|
|
- Search/filter by name, genus
|
|
- Bulk select → "Start Scrape" button
|
|
- Inline editing
|
|
|
|
### 3. API Keys
|
|
|
|
- Card per source with:
|
|
- API key input (masked)
|
|
- API secret input (if applicable)
|
|
- Rate limit slider
|
|
- Enable/disable toggle
|
|
- Test connection button
|
|
|
|
### 4. Image Browser
|
|
|
|
- Grid view with thumbnails (lazy-loaded)
|
|
- Filters sidebar:
|
|
- Species (autocomplete)
|
|
- Source (checkboxes)
|
|
- License (checkboxes)
|
|
- Quality score (range slider)
|
|
- Status (tabs: all, pending, downloaded, rejected)
|
|
- Sort by: date, quality, species
|
|
- Bulk select → actions (delete, re-queue)
|
|
- Click to view full-size + metadata
|
|
|
|
### 5. Jobs
|
|
|
|
- Table: name, source, status, progress, dates
|
|
- Real-time progress updates (WebSocket)
|
|
- Actions: pause, resume, cancel, view logs
|
|
|
|
### 6. Export
|
|
|
|
- Filter builder:
|
|
- Min images per species
|
|
- License whitelist
|
|
- Min quality score
|
|
- Species selection (all or specific)
|
|
- Train/test split slider (default 80/20)
|
|
- Preview: estimated species count, image count, file size
|
|
- "Generate Zip" button
|
|
- Download history with re-download links
|
|
|
|
---
|
|
|
|
## Tradeoffs
|
|
|
|
| Decision | Alternative | Why This Choice |
|
|
|----------|-------------|-----------------|
|
|
| SQLite | PostgreSQL | Single-user, simpler Docker setup, sufficient for millions of rows |
|
|
| Celery+Redis | RQ, Dramatiq | Battle-tested, good monitoring (Flower) |
|
|
| React | Vue, Svelte | Largest ecosystem, more component libraries |
|
|
| Separate workers | Threads in FastAPI | Better isolation, can scale workers independently |
|
|
| Nginx reverse proxy | Traefik | Simpler config for single-app deployment |
|
|
|
|
---
|
|
|
|
## Risks & Mitigations
|
|
|
|
| Risk | Likelihood | Mitigation |
|
|
|------|------------|------------|
|
|
| iNaturalist rate limits (5GB/hr) | High | Throttle downloads, prioritize species with low counts |
|
|
| Disk fills up | Medium | Dashboard shows disk usage, configurable storage limits |
|
|
| Scrape jobs crash mid-run | Medium | Job state in DB, resume from last checkpoint |
|
|
| Perceptual hash collisions | Low | Store hash, allow manual review of flagged duplicates |
|
|
| API keys exposed | Low | Environment variables, not stored in code |
|
|
| SQLite write contention | Low | WAL mode, single writer pattern via Celery |
|
|
|
|
---
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Foundation
|
|
- [ ] Docker Compose setup (FastAPI, React, Redis, Nginx)
|
|
- [ ] Database schema + migrations (Alembic)
|
|
- [ ] Basic FastAPI skeleton with health checks
|
|
- [ ] React app scaffolding with Tailwind
|
|
|
|
### Phase 2: Core Data Management
|
|
- [ ] Species CRUD API
|
|
- [ ] CSV import endpoint
|
|
- [ ] Species list UI with search/filter
|
|
- [ ] API keys management UI
|
|
|
|
### Phase 3: iNaturalist Scraper
|
|
- [ ] Celery worker setup
|
|
- [ ] iNaturalist/GBIF scraper task
|
|
- [ ] Job management API
|
|
- [ ] Real-time progress (WebSocket or polling)
|
|
|
|
### Phase 4: Quality Pipeline
|
|
- [ ] Image download worker
|
|
- [ ] Perceptual hash deduplication
|
|
- [ ] Blur detection + quality scoring
|
|
- [ ] Resize to 512x512
|
|
|
|
### Phase 5: Image Browser
|
|
- [ ] Image listing API with filters
|
|
- [ ] Thumbnail generation
|
|
- [ ] Grid view UI
|
|
- [ ] Bulk operations
|
|
|
|
### Phase 6: Additional Scrapers
|
|
- [ ] Flickr scraper
|
|
- [ ] Wikimedia Commons scraper
|
|
- [ ] Trefle scraper (metadata + images)
|
|
- [ ] USDA PLANTS scraper
|
|
|
|
### Phase 7: Export
|
|
- [ ] Export job API
|
|
- [ ] Train/test split logic
|
|
- [ ] Zip generation worker
|
|
- [ ] Download endpoint
|
|
- [ ] Export UI with filters
|
|
|
|
### Phase 8: Dashboard & Polish
|
|
- [ ] Stats API
|
|
- [ ] Dashboard UI with charts
|
|
- [ ] Job monitoring UI
|
|
- [ ] Error handling + logging
|
|
- [ ] Documentation
|
|
|
|
---
|
|
|
|
## File Structure
|
|
|
|
```
|
|
PlantGuideScraper/
|
|
├── docker-compose.yml
|
|
├── .env.example
|
|
├── docs/
|
|
│ └── master_plan.md
|
|
├── backend/
|
|
│ ├── Dockerfile
|
|
│ ├── requirements.txt
|
|
│ ├── alembic/
|
|
│ │ └── versions/
|
|
│ ├── app/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── main.py
|
|
│ │ ├── config.py
|
|
│ │ ├── database.py
|
|
│ │ ├── models/
|
|
│ │ │ ├── species.py
|
|
│ │ │ ├── image.py
|
|
│ │ │ ├── job.py
|
|
│ │ │ └── export.py
|
|
│ │ ├── schemas/
|
|
│ │ │ └── ...
|
|
│ │ ├── api/
|
|
│ │ │ ├── species.py
|
|
│ │ │ ├── images.py
|
|
│ │ │ ├── jobs.py
|
|
│ │ │ ├── exports.py
|
|
│ │ │ └── stats.py
|
|
│ │ ├── scrapers/
|
|
│ │ │ ├── base.py
|
|
│ │ │ ├── inaturalist.py
|
|
│ │ │ ├── flickr.py
|
|
│ │ │ ├── wikimedia.py
|
|
│ │ │ └── trefle.py
|
|
│ │ ├── workers/
|
|
│ │ │ ├── celery_app.py
|
|
│ │ │ ├── scrape_tasks.py
|
|
│ │ │ ├── quality_tasks.py
|
|
│ │ │ └── export_tasks.py
|
|
│ │ └── utils/
|
|
│ │ ├── image_quality.py
|
|
│ │ └── dedup.py
|
|
│ └── tests/
|
|
├── frontend/
|
|
│ ├── Dockerfile
|
|
│ ├── package.json
|
|
│ ├── src/
|
|
│ │ ├── App.tsx
|
|
│ │ ├── components/
|
|
│ │ ├── pages/
|
|
│ │ │ ├── Dashboard.tsx
|
|
│ │ │ ├── Species.tsx
|
|
│ │ │ ├── Images.tsx
|
|
│ │ │ ├── Jobs.tsx
|
|
│ │ │ ├── Export.tsx
|
|
│ │ │ └── Settings.tsx
|
|
│ │ ├── hooks/
|
|
│ │ └── api/
|
|
│ └── public/
|
|
├── nginx/
|
|
│ └── nginx.conf
|
|
└── data/ # Bind mount (not in repo)
|
|
├── db/
|
|
├── images/
|
|
└── exports/
|
|
```
|
|
|
|
---
|
|
|
|
## Environment Variables
|
|
|
|
```bash
|
|
# Backend
|
|
DATABASE_URL=sqlite:///data/db/plants.sqlite
|
|
REDIS_URL=redis://redis:6379/0
|
|
IMAGES_PATH=/data/images
|
|
EXPORTS_PATH=/data/exports
|
|
|
|
# API Keys (user-provided)
|
|
FLICKR_API_KEY=
|
|
FLICKR_API_SECRET=
|
|
INATURALIST_APP_ID=
|
|
INATURALIST_APP_SECRET=
|
|
TREFLE_API_KEY=
|
|
|
|
# Optional
|
|
LOG_LEVEL=INFO
|
|
CELERY_CONCURRENCY=4
|
|
```
|
|
|
|
---
|
|
|
|
## Commands
|
|
|
|
```bash
|
|
# Development
|
|
docker-compose up --build
|
|
|
|
# Production
|
|
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
|
|
|
|
# Run migrations
|
|
docker-compose exec backend alembic upgrade head
|
|
|
|
# View Celery logs
|
|
docker-compose logs -f celery
|
|
|
|
# Access Redis CLI
|
|
docker-compose exec redis redis-cli
|
|
```
|