Initial commit — PlantGuideScraper project

2026-04-12 09:54:27 -05:00
commit 6926f502c5
87 changed files with 29120 additions and 0 deletions
@@ -0,0 +1,20 @@
+# Database
+DATABASE_URL=sqlite:////data/db/plants.sqlite
+
+# Redis
+REDIS_URL=redis://redis:6379/0
+
+# Storage paths
+IMAGES_PATH=/data/images
+EXPORTS_PATH=/data/exports
+
+# API Keys (user-provided)
+FLICKR_API_KEY=
+FLICKR_API_SECRET=
+INATURALIST_APP_ID=
+INATURALIST_APP_SECRET=
+TREFLE_API_KEY=
+
+# Optional settings
+LOG_LEVEL=INFO
+CELERY_CONCURRENCY=4
@@ -0,0 +1,39 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+venv/
+.venv/
+ENV/
+env/
+.eggs/
+*.egg-info/
+*.egg
+
+# Node
+node_modules/
+npm-debug.log
+yarn-error.log
+
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Project specific
+data/
+*.sqlite
+*.db
+.env
+*.zip
+
+# Docker
+docker-compose.override.yml
@@ -0,0 +1,209 @@
+# PlantGuideScraper
+
+Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.
+
+## Features
+
+- **Species Management**: Import species lists via CSV or JSON, search and filter by genus or image status
+- **Multi-Source Scraping**: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
+- **Image Quality Pipeline**: Automatic deduplication, blur detection, resizing
+- **License Filtering**: Only collect commercially-safe CC0/CC-BY licensed images
+- **Export for CoreML**: Train/test split, Create ML-compatible folder structure
+- **Real-time Dashboard**: Progress tracking, statistics, job monitoring
+
+## Quick Start
+
+```bash
+# Clone and start
+cd PlantGuideScraper
+docker-compose up --build
+
+# Access the UI
+open http://localhost
+```
+
+## Unraid Deployment
+
+### Setup
+
+1. Copy the project to your Unraid server:
+   ```bash
+   scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper
+   ```
+
+2. SSH into Unraid and create data directories:
+   ```bash
+   ssh root@YOUR_UNRAID_IP
+   mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}
+   ```
+
+3. Install **Docker Compose Manager** from Community Applications
+
+4. In Unraid: **Docker → Compose → Add New Stack**
+   - Path: `/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml`
+   - Click **Compose Up**
+
+5. Access at `http://YOUR_UNRAID_IP:8580`
+
+### Configurable Paths
+
+Edit `docker-compose.unraid.yml` to customize where data is stored. Look for these lines in both `backend` and `celery` services:
+
+```yaml
+# === CONFIGURABLE DATA PATHS ===
+- /mnt/user/appdata/PlantGuideScraper/database:/data/db    # DATABASE_PATH
+- /mnt/user/appdata/PlantGuideScraper/images:/data/images  # IMAGES_PATH
+- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
+```
+
+| Path | Description | Default |
+|------|-------------|---------|
+| DATABASE_PATH | SQLite database file | `/mnt/user/appdata/PlantGuideScraper/database` |
+| IMAGES_PATH | Downloaded images (can be 100GB+) | `/mnt/user/appdata/PlantGuideScraper/images` |
+| EXPORTS_PATH | Generated export zip files | `/mnt/user/appdata/PlantGuideScraper/exports` |
+
+**Example: Store images on a separate share:**
+```yaml
+- /mnt/user/data/PlantImages:/data/images  # IMAGES_PATH
+```
+
+**Important:** Keep paths identical in both `backend` and `celery` services.
+
+## Configuration
+
+1. Configure API keys in Settings:
+   - **Flickr**: Get key at https://www.flickr.com/services/api/
+   - **Trefle**: Get key at https://trefle.io/
+   - iNaturalist and Wikimedia don't require keys
+
+2. Import species list (see Import Documentation below)
+
+3. Select species and start scraping
+
+## Import Documentation
+
+### CSV Import
+
+Import species from a CSV file with the following columns:
+
+| Column | Required | Description |
+|--------|----------|-------------|
+| `scientific_name` | Yes | Binomial name (e.g., "Monstera deliciosa") |
+| `common_name` | No | Common name (e.g., "Swiss Cheese Plant") |
+| `genus` | No | Auto-extracted from scientific_name if not provided |
+| `family` | No | Plant family (e.g., "Araceae") |
+
+**Example CSV:**
+```csv
+scientific_name,common_name,genus,family
+Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
+Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
+Epipremnum aureum,Golden Pothos,Epipremnum,Araceae
+```
+
+### JSON Import
+
+Import species from a JSON file with the following structure:
+
+```json
+{
+  "plants": [
+    {
+      "scientific_name": "Monstera deliciosa",
+      "common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
+      "family": "Araceae"
+    },
+    {
+      "scientific_name": "Philodendron hederaceum",
+      "common_names": ["Heartleaf Philodendron"],
+      "family": "Araceae"
+    }
+  ]
+}
+```
+
+| Field | Required | Description |
+|-------|----------|-------------|
+| `scientific_name` | Yes | Binomial name |
+| `common_names` | No | Array of common names (first one is used) |
+| `family` | No | Plant family |
+
+**Notes:**
+- Genus is automatically extracted from the first word of `scientific_name`
+- Duplicate species (by scientific_name) are skipped
+- The included `houseplants_list.json` contains 2,278 houseplant species
+
+### API Endpoints
+
+```bash
+# Import CSV
+curl -X POST http://localhost/api/species/import \
+  -F "file=@species.csv"
+
+# Import JSON
+curl -X POST http://localhost/api/species/import-json \
+  -F "file=@plants.json"
+```
+
+**Response:**
+```json
+{
+  "imported": 150,
+  "skipped": 5,
+  "errors": []
+}
+```
+
+## Architecture
+
+```
+┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
+│   React     │────▶│  FastAPI        │────▶│   Celery    │
+│   Frontend  │     │  Backend        │     │   Workers   │
+└─────────────┘     └─────────────────┘     └─────────────┘
+                           │                       │
+                           ▼                       ▼
+                   ┌─────────────┐         ┌─────────────┐
+                   │   SQLite    │         │   Redis     │
+                   │   Database  │         │   Queue     │
+                   └─────────────┘         └─────────────┘
+```
+
+## Export Format
+
+Exports are Create ML-compatible:
+
+```
+export.zip/
+├── Training/
+│   ├── Monstera_deliciosa/
+│   │   ├── img_00001.jpg
+│   │   └── ...
+│   └── ...
+└── Testing/
+    ├── Monstera_deliciosa/
+    └── ...
+```
+
+## Data Storage
+
+All data is stored in the `./data` directory:
+
+```
+data/
+├── db/
+│   └── plants.sqlite    # SQLite database
+├── images/              # Downloaded images
+│   └── {species_id}/
+│       └── {image_id}.jpg
+└── exports/             # Generated export archives
+    └── {export_id}.zip
+```
+
+## API Documentation
+
+Full API docs available at http://localhost/api/docs
+
+## License
+
+MIT
@@ -0,0 +1,231 @@
+# Houseplant Image Dataset Accumulation Plan
+
+## Overview
+
+Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use.
+
+---
+
+## Requirements Summary
+
+| Parameter | Value |
+|-----------|-------|
+| Target species | 5,000-10,000 (realistic houseplant ceiling) |
+| Images per species | 200-500 (recommended) |
+| Total images | ~1-5 million |
+| Budget | Free preferred, paid as reference |
+| Compute | M1 Max Mac (training) + Unraid server (data pipeline) |
+| Curation | Automated pipeline |
+| Timeline | Weeks-months |
+| Licensing | Must allow training + commercial model distribution |
+
+---
+
+## Hardware Assessment
+
+| Machine | Role | Capability |
+|---------|------|------------|
+| M1 Max Mac | **Training** | Create ML can train 5-10K class models; 32+ GB unified memory is ideal |
+| Unraid Server | **Data pipeline** | Scraping, downloading, preprocessing, storage |
+
+M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required.
+
+---
+
+## Data Sources Analysis
+
+### Tier 1: Primary Sources (Recommended)
+
+| Source | License | Commercial-Safe | Volume | Houseplant Coverage | Access Method |
+|--------|---------|-----------------|--------|---------------------|---------------|
+| **iNaturalist via GBIF** | CC-BY, CC0 (filter) | Yes (filtered) | 100M+ observations | Good (has "captive/cultivated" flag) | Bulk export + API |
+| **Flickr** | CC-BY, CC0 (filter) | Yes (filtered) | Millions | Moderate | API |
+| **Wikimedia Commons** | CC-BY, CC-BY-SA, Public Domain | Mostly | Thousands | Moderate | API |
+
+### Tier 2: Supplemental Sources
+
+| Source | License | Commercial-Safe | Notes |
+|--------|---------|-----------------|-------|
+| **USDA PLANTS** | Public Domain | Yes | US-focused, limited images |
+| **Encyclopedia of Life** | Mixed | Check each | Aggregator, good metadata |
+| **Pl@ntNet-300K Dataset** | CC-BY-SA | Share-alike | Good for research/prototyping only |
+
+### Tier 3: Paid Options (Reference)
+
+| Source | Estimated Cost | Notes |
+|--------|----------------|-------|
+| iNaturalist AWS Open Data | Free | Bulk image export, requires S3 costs for transfer |
+| Custom scraping infrastructure | $50-200/mo | Proxies, storage, bandwidth |
+| Commercial botanical databases | $1000s+ | Getty, Alamy — not recommended |
+
+---
+
+## Licensing Decision Matrix
+
+```
+Want commercial model distribution?
+├─ YES → Use ONLY: CC0, CC-BY, Public Domain
+│        Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved
+│
+└─ NO (research only) → Can use CC-BY-NC, CC-BY-SA
+                        Pl@ntNet-300K dataset becomes viable
+```
+
+**Recommendation**: Filter for commercial-safe licenses from day 1. Avoids re-scraping later.
+
+---
+
+## Houseplant Species Taxonomy
+
+**Problem**: No canonical "houseplant" species list exists. Must construct one.
+
+**Approach**:
+1. Start with Wikipedia "List of houseplants" (~500 species)
+2. Expand via genus crawl (all Philodendron, all Hoya, etc.)
+3. Cross-reference with RHS, ASPCA, nursery catalogs
+4. Target: **1,000-3,000 species** is realistic for quality dataset
+
+**Key Genera** (prioritize these — cover 80% of common houseplants):
+```
+Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena,
+Sansevieria, Calathea, Maranta, Alocasia, Anthurium,
+Peperomia, Hoya, Begonia, Tradescantia, Pilea,
+Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula
+```
+
+---
+
+## Data Quality Requirements
+
+| Parameter | Minimum | Target | Rationale |
+|-----------|---------|--------|-----------|
+| Images per species | 100 | 300-500 | Below 100 = unreliable classification |
+| Resolution | 256x256 | 512x512+ | Downsample to 224x224 for training |
+| Variety | Single angle | Multi-angle, growth stages, lighting | Generalization |
+| Label accuracy | 80% | 95%+ | iNaturalist "Research Grade" = community verified |
+
+---
+
+## Training Approach Options
+
+### Option A: Create ML (Recommended)
+
+| Pros | Cons |
+|------|------|
+| Native Apple Silicon optimization | Limited hyperparameter control |
+| Outputs CoreML directly | Max ~10K classes practical limit |
+| No Python/ML expertise needed | Less flexible augmentation |
+| Fast iteration | |
+
+**Best for**: This use case exactly.
+
+### Option B: PyTorch + MPS Transfer Learning
+
+| Pros | Cons |
+|------|------|
+| Full control over architecture | Steeper learning curve |
+| State-of-art augmentation (albumentations) | Manual CoreML conversion |
+| Can use EfficientNet, ConvNeXt, etc. | Slower iteration |
+
+**Best for**: If Create ML hits limits or you need custom architecture.
+
+### Option C: Cloud GPU (Google Colab / AWS Spot)
+
+| Pros | Cons |
+|------|------|
+| Faster training for large models | Cost |
+| No local resource constraints | Network transfer overhead |
+
+**Best for**: If dataset exceeds M1 Max memory or you want transformer-based vision models.
+
+**Recommendation**: Start with Create ML. Pivot to Option B only if needed.
+
+---
+
+## Pipeline Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     UNRAID SERVER                                │
+├─────────────────────────────────────────────────────────────────┤
+│  1. Species List Generator                                       │
+│     └─ Scrape Wikipedia, RHS, expand by genus                   │
+│                                                                  │
+│  2. Image Downloader                                             │
+│     ├─ iNaturalist/GBIF bulk export (primary)                   │
+│     ├─ Flickr API (supplemental)                                │
+│     └─ License filter (CC-BY, CC0 only)                         │
+│                                                                  │
+│  3. Preprocessing Pipeline                                       │
+│     ├─ Resize to 512x512                                        │
+│     ├─ Remove duplicates (perceptual hash)                      │
+│     ├─ Remove low-quality (blur detection, size filter)         │
+│     └─ Organize: /species_name/image_001.jpg                    │
+│                                                                  │
+│  4. Dataset Statistics                                           │
+│     └─ Report per-species counts, flag under-represented        │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              ▼ (rsync/SMB)
+┌─────────────────────────────────────────────────────────────────┐
+│                      M1 MAX MAC                                  │
+├─────────────────────────────────────────────────────────────────┤
+│  5. Create ML Training                                           │
+│     ├─ Import dataset folder                                    │
+│     ├─ Train image classifier                                   │
+│     └─ Export .mlmodel                                          │
+│                                                                  │
+│  6. Validation                                                   │
+│     ├─ Test on held-out images                                  │
+│     └─ Test on real-world photos (your phone)                   │
+│                                                                  │
+│  7. Integration                                                  │
+│     └─ Replace PlantNet-300K in PlantGuide                      │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Timeline
+
+| Phase | Duration | Output |
+|-------|----------|--------|
+| 1. Species list curation | 1 week | 1,000-3,000 target species with scientific + common names |
+| 2. Pipeline development | 1-2 weeks | Automated scraper on Unraid |
+| 3. Data collection | 2-4 weeks | Running 24/7, rate-limited by APIs |
+| 4. Preprocessing + QA | 1 week | Clean dataset, statistics report |
+| 5. Initial training | 2-3 days | First model with subset (500 species) |
+| 6. Full training | 1 week | Full model, iteration |
+| 7. Validation + tuning | 1 week | Production-ready model |
+
+**Total: 6-10 weeks**
+
+---
+
+## Risk Analysis
+
+| Risk | Likelihood | Mitigation |
+|------|------------|------------|
+| Insufficient images for rare species | High | Accept lower coverage OR merge to genus-level for rare species |
+| API rate limits slow collection | High | Parallelize sources, use bulk exports, patience |
+| Noisy labels degrade accuracy | Medium | Use only "Research Grade" iNaturalist, implement confidence thresholds |
+| Create ML memory limits | Low | M1 Max should handle; fallback to PyTorch |
+| License ambiguity | Low | Strict filter on download, keep metadata |
+
+---
+
+## Next Steps
+
+1. **Build species master list** — Python script to scrape/merge sources
+2. **Set up GBIF bulk download** — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images
+3. **Build Flickr supplemental scraper** — Target under-represented species
+4. **Docker container on Unraid** — Orchestrate pipeline
+5. **Create ML project setup** — Folder structure, initial test with 50 species
+
+---
+
+## Open Questions
+
+- Prioritize **speed** (start with 500 species, fast iteration) or **completeness** (build full 3K species list first)?
+- Any specific houseplant species that must be included?
+- Docker running on Unraid already?
@@ -0,0 +1,24 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    gcc \
+    g++ \
+    libffi-dev \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy application code
+COPY . .
+
+# Create data directories
+RUN mkdir -p /data/db /data/images /data/exports
+
+EXPOSE 8000
+
+CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+"""Add missing database indexes."""
+from sqlalchemy import text
+from app.database import engine
+
+with engine.connect() as conn:
+    # Single column indexes
+    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_license ON images(license)'))
+    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_status ON images(status)'))
+    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_source ON images(source)'))
+    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_species_id ON images(species_id)'))
+    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_phash ON images(phash)'))
+
+    # Composite indexes for common query patterns
+    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_species_status ON images(species_id, status)'))
+    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_status_created ON images(status, created_at)'))
+
+    conn.commit()
+    print('All indexes created successfully')
@@ -0,0 +1,42 @@
+[alembic]
+script_location = alembic
+prepend_sys_path = .
+version_path_separator = os
+
+sqlalchemy.url = sqlite:////data/db/plants.sqlite
+
+[post_write_hooks]
+
+[loggers]
+keys = root,sqlalchemy,alembic
+
+[handlers]
+keys = console
+
+[formatters]
+keys = generic
+
+[logger_root]
+level = WARN
+handlers = console
+qualname =
+
+[logger_sqlalchemy]
+level = WARN
+handlers =
+qualname = sqlalchemy.engine
+
+[logger_alembic]
+level = INFO
+handlers =
+qualname = alembic
+
+[handler_console]
+class = StreamHandler
+args = (sys.stderr,)
+level = NOTSET
+formatter = generic
+
+[formatter_generic]
+format = %(levelname)-5.5s [%(name)s] %(message)s
+datefmt = %H:%M:%S
@@ -0,0 +1,54 @@
+from logging.config import fileConfig
+
+from sqlalchemy import engine_from_config
+from sqlalchemy import pool
+
+from alembic import context
+
+# Import models for autogenerate
+from app.database import Base
+from app.models import Species, Image, Job, ApiKey, Export
+
+config = context.config
+
+if config.config_file_name is not None:
+    fileConfig(config.config_file_name)
+
+target_metadata = Base.metadata
+
+
+def run_migrations_offline() -> None:
+    """Run migrations in 'offline' mode."""
+    url = config.get_main_option("sqlalchemy.url")
+    context.configure(
+        url=url,
+        target_metadata=target_metadata,
+        literal_binds=True,
+        dialect_opts={"paramstyle": "named"},
+    )
+
+    with context.begin_transaction():
+        context.run_migrations()
+
+
+def run_migrations_online() -> None:
+    """Run migrations in 'online' mode."""
+    connectable = engine_from_config(
+        config.get_section(config.config_ini_section, {}),
+        prefix="sqlalchemy.",
+        poolclass=pool.NullPool,
+    )
+
+    with connectable.connect() as connection:
+        context.configure(
+            connection=connection, target_metadata=target_metadata
+        )
+
+        with context.begin_transaction():
+            context.run_migrations()
+
+
+if context.is_offline_mode():
+    run_migrations_offline()
+else:
+    run_migrations_online()
@@ -0,0 +1,26 @@
+"""${message}
+
+Revision ID: ${up_revision}
+Revises: ${down_revision | comma,n}
+Create Date: ${create_date}
+
+"""
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+${imports if imports else ""}
+
+# revision identifiers, used by Alembic.
+revision: str = ${repr(up_revision)}
+down_revision: Union[str, None] = ${repr(down_revision)}
+branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
+depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
+
+
+def upgrade() -> None:
+    ${upgrades if upgrades else "pass"}
+
+
+def downgrade() -> None:
+    ${downgrades if downgrades else "pass"}
@@ -0,0 +1,112 @@
+"""Initial migration
+
+Revision ID: 001
+Revises:
+Create Date: 2024-01-01
+
+"""
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+
+revision: str = '001'
+down_revision: Union[str, None] = None
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    # Species table
+    op.create_table(
+        'species',
+        sa.Column('id', sa.Integer(), primary_key=True),
+        sa.Column('scientific_name', sa.String(), nullable=False, unique=True),
+        sa.Column('common_name', sa.String(), nullable=True),
+        sa.Column('genus', sa.String(), nullable=True),
+        sa.Column('family', sa.String(), nullable=True),
+        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
+    )
+    op.create_index('ix_species_scientific_name', 'species', ['scientific_name'])
+    op.create_index('ix_species_genus', 'species', ['genus'])
+
+    # API Keys table
+    op.create_table(
+        'api_keys',
+        sa.Column('id', sa.Integer(), primary_key=True),
+        sa.Column('source', sa.String(), nullable=False, unique=True),
+        sa.Column('api_key', sa.String(), nullable=False),
+        sa.Column('api_secret', sa.String(), nullable=True),
+        sa.Column('rate_limit_per_sec', sa.Float(), default=1.0),
+        sa.Column('enabled', sa.Boolean(), default=True),
+    )
+
+    # Images table
+    op.create_table(
+        'images',
+        sa.Column('id', sa.Integer(), primary_key=True),
+        sa.Column('species_id', sa.Integer(), sa.ForeignKey('species.id'), nullable=False),
+        sa.Column('source', sa.String(), nullable=False),
+        sa.Column('source_id', sa.String(), nullable=True),
+        sa.Column('url', sa.String(), nullable=False),
+        sa.Column('local_path', sa.String(), nullable=True),
+        sa.Column('license', sa.String(), nullable=False),
+        sa.Column('attribution', sa.String(), nullable=True),
+        sa.Column('width', sa.Integer(), nullable=True),
+        sa.Column('height', sa.Integer(), nullable=True),
+        sa.Column('phash', sa.String(), nullable=True),
+        sa.Column('quality_score', sa.Float(), nullable=True),
+        sa.Column('status', sa.String(), default='pending'),
+        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
+    )
+    op.create_index('ix_images_species_id', 'images', ['species_id'])
+    op.create_index('ix_images_source', 'images', ['source'])
+    op.create_index('ix_images_status', 'images', ['status'])
+    op.create_index('ix_images_phash', 'images', ['phash'])
+    op.create_unique_constraint('uq_source_source_id', 'images', ['source', 'source_id'])
+
+    # Jobs table
+    op.create_table(
+        'jobs',
+        sa.Column('id', sa.Integer(), primary_key=True),
+        sa.Column('name', sa.String(), nullable=False),
+        sa.Column('source', sa.String(), nullable=False),
+        sa.Column('species_filter', sa.Text(), nullable=True),
+        sa.Column('status', sa.String(), default='pending'),
+        sa.Column('progress_current', sa.Integer(), default=0),
+        sa.Column('progress_total', sa.Integer(), default=0),
+        sa.Column('images_downloaded', sa.Integer(), default=0),
+        sa.Column('images_rejected', sa.Integer(), default=0),
+        sa.Column('celery_task_id', sa.String(), nullable=True),
+        sa.Column('started_at', sa.DateTime(), nullable=True),
+        sa.Column('completed_at', sa.DateTime(), nullable=True),
+        sa.Column('error_message', sa.Text(), nullable=True),
+        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
+    )
+    op.create_index('ix_jobs_status', 'jobs', ['status'])
+
+    # Exports table
+    op.create_table(
+        'exports',
+        sa.Column('id', sa.Integer(), primary_key=True),
+        sa.Column('name', sa.String(), nullable=False),
+        sa.Column('filter_criteria', sa.Text(), nullable=True),
+        sa.Column('train_split', sa.Float(), default=0.8),
+        sa.Column('status', sa.String(), default='pending'),
+        sa.Column('file_path', sa.String(), nullable=True),
+        sa.Column('file_size', sa.Integer(), nullable=True),
+        sa.Column('species_count', sa.Integer(), nullable=True),
+        sa.Column('image_count', sa.Integer(), nullable=True),
+        sa.Column('celery_task_id', sa.String(), nullable=True),
+        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
+        sa.Column('completed_at', sa.DateTime(), nullable=True),
+        sa.Column('error_message', sa.Text(), nullable=True),
+    )
+
+
+def downgrade() -> None:
+    op.drop_table('exports')
+    op.drop_table('jobs')
+    op.drop_table('images')
+    op.drop_table('api_keys')
+    op.drop_table('species')
@@ -0,0 +1,53 @@
+"""Add cached_stats table and license index
+
+Revision ID: 002
+Revises: 001
+Create Date: 2025-01-25
+
+"""
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+
+revision: str = '002'
+down_revision: Union[str, None] = '001'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    # Cached stats table for pre-calculated dashboard statistics
+    op.create_table(
+        'cached_stats',
+        sa.Column('id', sa.Integer(), primary_key=True),
+        sa.Column('key', sa.String(50), nullable=False, unique=True),
+        sa.Column('value', sa.Text(), nullable=False),
+        sa.Column('updated_at', sa.DateTime(), server_default=sa.func.now()),
+    )
+    op.create_index('ix_cached_stats_key', 'cached_stats', ['key'])
+
+    # Add license index to images table (if not exists)
+    # Using batch mode for SQLite compatibility
+    try:
+        op.create_index('ix_images_license', 'images', ['license'])
+    except Exception:
+        pass  # Index may already exist
+
+    # Add only_without_images column to jobs if it doesn't exist
+    try:
+        op.add_column('jobs', sa.Column('only_without_images', sa.Boolean(), default=False))
+    except Exception:
+        pass  # Column may already exist
+
+
+def downgrade() -> None:
+    try:
+        op.drop_index('ix_images_license', 'images')
+    except Exception:
+        pass
+    try:
+        op.drop_column('jobs', 'only_without_images')
+    except Exception:
+        pass
+    op.drop_table('cached_stats')
@@ -0,0 +1,31 @@
+"""Add max_images column to jobs table
+
+Revision ID: 003
+Revises: 002
+Create Date: 2025-01-25
+
+"""
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+
+revision: str = '003'
+down_revision: Union[str, None] = '002'
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    # Add max_images column to jobs table
+    try:
+        op.add_column('jobs', sa.Column('max_images', sa.Integer(), nullable=True))
+    except Exception:
+        pass  # Column may already exist
+
+
+def downgrade() -> None:
+    try:
+        op.drop_column('jobs', 'max_images')
+    except Exception:
+        pass
@@ -0,0 +1 @@
+# PlantGuideScraper Backend
@@ -0,0 +1 @@
+# API routes
@@ -0,0 +1,175 @@
+import json
+import os
+from typing import Optional
+
+from fastapi import APIRouter, Depends, HTTPException, Query
+from fastapi.responses import FileResponse
+from sqlalchemy.orm import Session
+from sqlalchemy import func
+
+from app.database import get_db
+from app.models import Export, Image, Species
+from app.schemas.export import (
+    ExportCreate,
+    ExportResponse,
+    ExportListResponse,
+    ExportPreview,
+)
+from app.workers.export_tasks import generate_export
+
+router = APIRouter()
+
+
+@router.get("", response_model=ExportListResponse)
+def list_exports(
+    limit: int = Query(50, ge=1, le=200),
+    db: Session = Depends(get_db),
+):
+    """List all exports."""
+    total = db.query(Export).count()
+    exports = db.query(Export).order_by(Export.created_at.desc()).limit(limit).all()
+
+    return ExportListResponse(
+        items=[ExportResponse.model_validate(e) for e in exports],
+        total=total,
+    )
+
+
+@router.post("/preview", response_model=ExportPreview)
+def preview_export(export: ExportCreate, db: Session = Depends(get_db)):
+    """Preview export without creating it."""
+    criteria = export.filter_criteria
+    min_images = criteria.min_images_per_species
+
+    # Build query
+    query = db.query(Image).filter(Image.status == "downloaded")
+
+    if criteria.licenses:
+        query = query.filter(Image.license.in_(criteria.licenses))
+
+    if criteria.min_quality:
+        query = query.filter(Image.quality_score >= criteria.min_quality)
+
+    if criteria.species_ids:
+        query = query.filter(Image.species_id.in_(criteria.species_ids))
+
+    # Count images per species
+    species_counts = db.query(
+        Image.species_id,
+        func.count(Image.id).label("count")
+    ).filter(Image.status == "downloaded")
+
+    if criteria.licenses:
+        species_counts = species_counts.filter(Image.license.in_(criteria.licenses))
+    if criteria.min_quality:
+        species_counts = species_counts.filter(Image.quality_score >= criteria.min_quality)
+    if criteria.species_ids:
+        species_counts = species_counts.filter(Image.species_id.in_(criteria.species_ids))
+
+    species_counts = species_counts.group_by(Image.species_id).all()
+
+    valid_species = [s for s in species_counts if s.count >= min_images]
+    total_images = sum(s.count for s in valid_species)
+
+    # Estimate file size (rough: 50KB per image)
+    estimated_size_mb = (total_images * 50) / 1024
+
+    return ExportPreview(
+        species_count=len(valid_species),
+        image_count=total_images,
+        estimated_size_mb=estimated_size_mb,
+    )
+
+
+@router.post("", response_model=ExportResponse)
+def create_export(export: ExportCreate, db: Session = Depends(get_db)):
+    """Create and start a new export job."""
+    db_export = Export(
+        name=export.name,
+        filter_criteria=export.filter_criteria.model_dump_json(),
+        train_split=export.train_split,
+        status="pending",
+    )
+    db.add(db_export)
+    db.commit()
+    db.refresh(db_export)
+
+    # Start Celery task
+    task = generate_export.delay(db_export.id)
+    db_export.celery_task_id = task.id
+    db.commit()
+
+    return ExportResponse.model_validate(db_export)
+
+
+@router.get("/{export_id}", response_model=ExportResponse)
+def get_export(export_id: int, db: Session = Depends(get_db)):
+    """Get export status."""
+    export = db.query(Export).filter(Export.id == export_id).first()
+    if not export:
+        raise HTTPException(status_code=404, detail="Export not found")
+
+    return ExportResponse.model_validate(export)
+
+
+@router.get("/{export_id}/progress")
+def get_export_progress(export_id: int, db: Session = Depends(get_db)):
+    """Get real-time export progress."""
+    from app.workers.celery_app import celery_app
+
+    export = db.query(Export).filter(Export.id == export_id).first()
+    if not export:
+        raise HTTPException(status_code=404, detail="Export not found")
+
+    if not export.celery_task_id:
+        return {"status": export.status}
+
+    result = celery_app.AsyncResult(export.celery_task_id)
+
+    if result.state == "PROGRESS":
+        meta = result.info
+        return {
+            "status": "generating",
+            "current": meta.get("current", 0),
+            "total": meta.get("total", 0),
+            "current_species": meta.get("species", ""),
+        }
+
+    return {"status": export.status}
+
+
+@router.get("/{export_id}/download")
+def download_export(export_id: int, db: Session = Depends(get_db)):
+    """Download export zip file."""
+    export = db.query(Export).filter(Export.id == export_id).first()
+    if not export:
+        raise HTTPException(status_code=404, detail="Export not found")
+
+    if export.status != "completed":
+        raise HTTPException(status_code=400, detail="Export not ready")
+
+    if not export.file_path or not os.path.exists(export.file_path):
+        raise HTTPException(status_code=404, detail="Export file not found")
+
+    return FileResponse(
+        export.file_path,
+        media_type="application/zip",
+        filename=f"{export.name}.zip",
+    )
+
+
+@router.delete("/{export_id}")
+def delete_export(export_id: int, db: Session = Depends(get_db)):
+    """Delete an export and its file."""
+    export = db.query(Export).filter(Export.id == export_id).first()
+    if not export:
+        raise HTTPException(status_code=404, detail="Export not found")
+
+    # Delete file if exists
+    if export.file_path and os.path.exists(export.file_path):
+        os.remove(export.file_path)
+
+    db.delete(export)
+    db.commit()
+
+    return {"status": "deleted"}
@@ -0,0 +1,441 @@
+import os
+import shutil
+import uuid
+from pathlib import Path
+from typing import Optional, List
+
+from fastapi import APIRouter, Depends, HTTPException, Query
+from fastapi.responses import FileResponse
+from sqlalchemy.orm import Session
+from sqlalchemy import func
+from PIL import Image as PILImage
+
+from app.database import get_db
+from app.models import Image, Species
+from app.schemas.image import ImageResponse, ImageListResponse
+from app.config import get_settings
+
+router = APIRouter()
+settings = get_settings()
+
+
+@router.get("", response_model=ImageListResponse)
+def list_images(
+    page: int = Query(1, ge=1),
+    page_size: int = Query(50, ge=1, le=200),
+    species_id: Optional[int] = None,
+    source: Optional[str] = None,
+    license: Optional[str] = None,
+    status: Optional[str] = None,
+    min_quality: Optional[float] = None,
+    search: Optional[str] = None,
+    db: Session = Depends(get_db),
+):
+    """List images with pagination and filters."""
+    # Use joinedload to fetch species in single query
+    from sqlalchemy.orm import joinedload
+    query = db.query(Image).options(joinedload(Image.species))
+
+    if species_id:
+        query = query.filter(Image.species_id == species_id)
+
+    if source:
+        query = query.filter(Image.source == source)
+
+    if license:
+        query = query.filter(Image.license == license)
+
+    if status:
+        query = query.filter(Image.status == status)
+
+    if min_quality:
+        query = query.filter(Image.quality_score >= min_quality)
+
+    if search:
+        search_term = f"%{search}%"
+        query = query.join(Species).filter(
+            (Species.scientific_name.ilike(search_term)) |
+            (Species.common_name.ilike(search_term))
+        )
+
+    # Use faster count for simple queries
+    if not search:
+        # Build count query without join for better performance
+        count_query = db.query(func.count(Image.id))
+        if species_id:
+            count_query = count_query.filter(Image.species_id == species_id)
+        if source:
+            count_query = count_query.filter(Image.source == source)
+        if license:
+            count_query = count_query.filter(Image.license == license)
+        if status:
+            count_query = count_query.filter(Image.status == status)
+        if min_quality:
+            count_query = count_query.filter(Image.quality_score >= min_quality)
+        total = count_query.scalar()
+    else:
+        total = query.count()
+
+    pages = (total + page_size - 1) // page_size
+
+    images = query.order_by(Image.created_at.desc()).offset(
+        (page - 1) * page_size
+    ).limit(page_size).all()
+
+    items = [
+        ImageResponse(
+            id=img.id,
+            species_id=img.species_id,
+            species_name=img.species.scientific_name if img.species else None,
+            source=img.source,
+            source_id=img.source_id,
+            url=img.url,
+            local_path=img.local_path,
+            license=img.license,
+            attribution=img.attribution,
+            width=img.width,
+            height=img.height,
+            quality_score=img.quality_score,
+            status=img.status,
+            created_at=img.created_at,
+        )
+        for img in images
+    ]
+
+    return ImageListResponse(
+        items=items,
+        total=total,
+        page=page,
+        page_size=page_size,
+        pages=pages,
+    )
+
+
+@router.get("/sources")
+def list_sources(db: Session = Depends(get_db)):
+    """List all unique image sources."""
+    sources = db.query(Image.source).distinct().all()
+    return [s[0] for s in sources]
+
+
+@router.get("/licenses")
+def list_licenses(db: Session = Depends(get_db)):
+    """List all unique licenses."""
+    licenses = db.query(Image.license).distinct().all()
+    return [l[0] for l in licenses]
+
+
+@router.post("/process-pending")
+def process_pending_images(
+    source: Optional[str] = None,
+    db: Session = Depends(get_db),
+):
+    """Queue all pending images for download and processing."""
+    from app.workers.quality_tasks import batch_process_pending_images
+
+    query = db.query(func.count(Image.id)).filter(Image.status == "pending")
+    if source:
+        query = query.filter(Image.source == source)
+    pending_count = query.scalar()
+
+    task = batch_process_pending_images.delay(source=source)
+
+    return {
+        "pending_count": pending_count,
+        "task_id": task.id,
+    }
+
+
+@router.get("/process-pending/status/{task_id}")
+def process_pending_status(task_id: str):
+    """Check status of a batch processing task."""
+    from app.workers.celery_app import celery_app
+
+    result = celery_app.AsyncResult(task_id)
+    state = result.state  # PENDING, STARTED, PROGRESS, SUCCESS, FAILURE
+
+    response = {"task_id": task_id, "state": state}
+
+    if state == "PROGRESS" and isinstance(result.info, dict):
+        response["queued"] = result.info.get("queued", 0)
+        response["total"] = result.info.get("total", 0)
+    elif state == "SUCCESS" and isinstance(result.result, dict):
+        response["queued"] = result.result.get("queued", 0)
+        response["total"] = result.result.get("total", 0)
+
+    return response
+
+
+@router.get("/{image_id}", response_model=ImageResponse)
+def get_image(image_id: int, db: Session = Depends(get_db)):
+    """Get an image by ID."""
+    image = db.query(Image).filter(Image.id == image_id).first()
+    if not image:
+        raise HTTPException(status_code=404, detail="Image not found")
+
+    return ImageResponse(
+        id=image.id,
+        species_id=image.species_id,
+        species_name=image.species.scientific_name if image.species else None,
+        source=image.source,
+        source_id=image.source_id,
+        url=image.url,
+        local_path=image.local_path,
+        license=image.license,
+        attribution=image.attribution,
+        width=image.width,
+        height=image.height,
+        quality_score=image.quality_score,
+        status=image.status,
+        created_at=image.created_at,
+    )
+
+
+@router.get("/{image_id}/file")
+def get_image_file(image_id: int, db: Session = Depends(get_db)):
+    """Get the actual image file."""
+    image = db.query(Image).filter(Image.id == image_id).first()
+    if not image:
+        raise HTTPException(status_code=404, detail="Image not found")
+
+    if not image.local_path:
+        raise HTTPException(status_code=404, detail="Image file not available")
+
+    return FileResponse(image.local_path, media_type="image/jpeg")
+
+
+@router.delete("/{image_id}")
+def delete_image(image_id: int, db: Session = Depends(get_db)):
+    """Delete an image."""
+    image = db.query(Image).filter(Image.id == image_id).first()
+    if not image:
+        raise HTTPException(status_code=404, detail="Image not found")
+
+    # Delete file if exists
+    if image.local_path:
+        import os
+        if os.path.exists(image.local_path):
+            os.remove(image.local_path)
+
+    db.delete(image)
+    db.commit()
+
+    return {"status": "deleted"}
+
+
+@router.post("/bulk-delete")
+def bulk_delete_images(
+    image_ids: List[int],
+    db: Session = Depends(get_db),
+):
+    """Delete multiple images."""
+    import os
+
+    images = db.query(Image).filter(Image.id.in_(image_ids)).all()
+
+    deleted = 0
+    for image in images:
+        if image.local_path and os.path.exists(image.local_path):
+            os.remove(image.local_path)
+        db.delete(image)
+        deleted += 1
+
+    db.commit()
+
+    return {"deleted": deleted}
+
+
+@router.get("/import/scan")
+def scan_imports(db: Session = Depends(get_db)):
+    """Scan the imports folder and return what can be imported.
+
+    Expected structure: imports/{source}/{species_name}/*.jpg
+    """
+    imports_path = Path(settings.imports_path)
+
+    if not imports_path.exists():
+        return {
+            "available": False,
+            "message": f"Imports folder not found: {imports_path}",
+            "sources": [],
+            "total_images": 0,
+            "matched_species": 0,
+            "unmatched_species": [],
+        }
+
+    results = {
+        "available": True,
+        "sources": [],
+        "total_images": 0,
+        "matched_species": 0,
+        "unmatched_species": [],
+    }
+
+    # Get all species for matching
+    species_map = {}
+    for species in db.query(Species).all():
+        # Map by scientific name with underscores and spaces
+        species_map[species.scientific_name.lower()] = species
+        species_map[species.scientific_name.replace(" ", "_").lower()] = species
+
+    seen_unmatched = set()
+
+    # Scan source folders
+    for source_dir in imports_path.iterdir():
+        if not source_dir.is_dir():
+            continue
+
+        source_name = source_dir.name
+        source_info = {
+            "name": source_name,
+            "species_count": 0,
+            "image_count": 0,
+        }
+
+        # Scan species folders within source
+        for species_dir in source_dir.iterdir():
+            if not species_dir.is_dir():
+                continue
+
+            species_name = species_dir.name.replace("_", " ")
+            species_key = species_name.lower()
+
+            # Count images
+            image_files = list(species_dir.glob("*.jpg")) + \
+                         list(species_dir.glob("*.jpeg")) + \
+                         list(species_dir.glob("*.png"))
+
+            if not image_files:
+                continue
+
+            source_info["image_count"] += len(image_files)
+            results["total_images"] += len(image_files)
+
+            if species_key in species_map or species_dir.name.lower() in species_map:
+                source_info["species_count"] += 1
+                results["matched_species"] += 1
+            else:
+                if species_name not in seen_unmatched:
+                    seen_unmatched.add(species_name)
+                    results["unmatched_species"].append(species_name)
+
+        if source_info["image_count"] > 0:
+            results["sources"].append(source_info)
+
+    return results
+
+
+@router.post("/import/run")
+def run_import(
+    move_files: bool = Query(False, description="Move files instead of copy"),
+    db: Session = Depends(get_db),
+):
+    """Import images from the imports folder.
+
+    Expected structure: imports/{source}/{species_name}/*.jpg
+    Images are copied/moved to: images/{species_name}/{source}_{filename}
+    """
+    imports_path = Path(settings.imports_path)
+    images_path = Path(settings.images_path)
+
+    if not imports_path.exists():
+        raise HTTPException(status_code=400, detail="Imports folder not found")
+
+    # Get all species for matching
+    species_map = {}
+    for species in db.query(Species).all():
+        species_map[species.scientific_name.lower()] = species
+        species_map[species.scientific_name.replace(" ", "_").lower()] = species
+
+    imported = 0
+    skipped = 0
+    errors = []
+
+    # Scan source folders
+    for source_dir in imports_path.iterdir():
+        if not source_dir.is_dir():
+            continue
+
+        source_name = source_dir.name
+
+        # Scan species folders within source
+        for species_dir in source_dir.iterdir():
+            if not species_dir.is_dir():
+                continue
+
+            species_name = species_dir.name.replace("_", " ")
+            species_key = species_name.lower()
+
+            # Find matching species
+            species = species_map.get(species_key) or species_map.get(species_dir.name.lower())
+            if not species:
+                continue
+
+            # Create target directory
+            target_dir = images_path / species.scientific_name.replace(" ", "_")
+            target_dir.mkdir(parents=True, exist_ok=True)
+
+            # Process images
+            image_files = list(species_dir.glob("*.jpg")) + \
+                         list(species_dir.glob("*.jpeg")) + \
+                         list(species_dir.glob("*.png"))
+
+            for img_file in image_files:
+                try:
+                    # Generate unique filename
+                    ext = img_file.suffix.lower()
+                    if ext == ".jpeg":
+                        ext = ".jpg"
+                    new_filename = f"{source_name}_{img_file.stem}_{uuid.uuid4().hex[:8]}{ext}"
+                    target_path = target_dir / new_filename
+
+                    # Check if already imported (by original filename pattern)
+                    existing = db.query(Image).filter(
+                        Image.species_id == species.id,
+                        Image.source == source_name,
+                        Image.source_id == img_file.stem,
+                    ).first()
+
+                    if existing:
+                        skipped += 1
+                        continue
+
+                    # Get image dimensions
+                    try:
+                        with PILImage.open(img_file) as pil_img:
+                            width, height = pil_img.size
+                    except Exception:
+                        width, height = None, None
+
+                    # Copy or move file
+                    if move_files:
+                        shutil.move(str(img_file), str(target_path))
+                    else:
+                        shutil.copy2(str(img_file), str(target_path))
+
+                    # Create database record
+                    image = Image(
+                        species_id=species.id,
+                        source=source_name,
+                        source_id=img_file.stem,
+                        url=f"file://{img_file}",
+                        local_path=str(target_path),
+                        license="unknown",
+                        width=width,
+                        height=height,
+                        status="downloaded",
+                    )
+                    db.add(image)
+                    imported += 1
+
+                except Exception as e:
+                    errors.append(f"{img_file}: {str(e)}")
+
+            # Commit after each species to avoid large transactions
+            db.commit()
+
+    return {
+        "imported": imported,
+        "skipped": skipped,
+        "errors": errors[:20],
+    }
@@ -0,0 +1,173 @@
+import json
+from typing import Optional
+
+from fastapi import APIRouter, Depends, HTTPException, Query
+from sqlalchemy.orm import Session
+
+from app.database import get_db
+from app.models import Job
+from app.schemas.job import JobCreate, JobResponse, JobListResponse
+from app.workers.scrape_tasks import run_scrape_job
+
+router = APIRouter()
+
+
+@router.get("", response_model=JobListResponse)
+def list_jobs(
+    status: Optional[str] = None,
+    source: Optional[str] = None,
+    limit: int = Query(50, ge=1, le=200),
+    db: Session = Depends(get_db),
+):
+    """List all jobs."""
+    query = db.query(Job)
+
+    if status:
+        query = query.filter(Job.status == status)
+
+    if source:
+        query = query.filter(Job.source == source)
+
+    total = query.count()
+    jobs = query.order_by(Job.created_at.desc()).limit(limit).all()
+
+    return JobListResponse(
+        items=[JobResponse.model_validate(j) for j in jobs],
+        total=total,
+    )
+
+
+@router.post("", response_model=JobResponse)
+def create_job(job: JobCreate, db: Session = Depends(get_db)):
+    """Create and start a new scrape job."""
+    species_filter = None
+    if job.species_ids:
+        species_filter = json.dumps(job.species_ids)
+
+    db_job = Job(
+        name=job.name,
+        source=job.source,
+        species_filter=species_filter,
+        only_without_images=job.only_without_images,
+        max_images=job.max_images,
+        status="pending",
+    )
+    db.add(db_job)
+    db.commit()
+    db.refresh(db_job)
+
+    # Start the Celery task
+    task = run_scrape_job.delay(db_job.id)
+    db_job.celery_task_id = task.id
+    db.commit()
+
+    return JobResponse.model_validate(db_job)
+
+
+@router.get("/{job_id}", response_model=JobResponse)
+def get_job(job_id: int, db: Session = Depends(get_db)):
+    """Get job status."""
+    job = db.query(Job).filter(Job.id == job_id).first()
+    if not job:
+        raise HTTPException(status_code=404, detail="Job not found")
+
+    return JobResponse.model_validate(job)
+
+
+@router.get("/{job_id}/progress")
+def get_job_progress(job_id: int, db: Session = Depends(get_db)):
+    """Get real-time job progress from Celery."""
+    from app.workers.celery_app import celery_app
+
+    job = db.query(Job).filter(Job.id == job_id).first()
+    if not job:
+        raise HTTPException(status_code=404, detail="Job not found")
+
+    if not job.celery_task_id:
+        return {
+            "status": job.status,
+            "progress_current": job.progress_current,
+            "progress_total": job.progress_total,
+        }
+
+    # Get Celery task state
+    result = celery_app.AsyncResult(job.celery_task_id)
+
+    if result.state == "PROGRESS":
+        meta = result.info
+        return {
+            "status": "running",
+            "progress_current": meta.get("current", 0),
+            "progress_total": meta.get("total", 0),
+            "current_species": meta.get("species", ""),
+        }
+
+    return {
+        "status": job.status,
+        "progress_current": job.progress_current,
+        "progress_total": job.progress_total,
+    }
+
+
+@router.post("/{job_id}/pause")
+def pause_job(job_id: int, db: Session = Depends(get_db)):
+    """Pause a running job."""
+    from app.workers.celery_app import celery_app
+
+    job = db.query(Job).filter(Job.id == job_id).first()
+    if not job:
+        raise HTTPException(status_code=404, detail="Job not found")
+
+    if job.status != "running":
+        raise HTTPException(status_code=400, detail="Job is not running")
+
+    # Revoke Celery task
+    if job.celery_task_id:
+        celery_app.control.revoke(job.celery_task_id, terminate=True)
+
+    job.status = "paused"
+    db.commit()
+
+    return {"status": "paused"}
+
+
+@router.post("/{job_id}/resume")
+def resume_job(job_id: int, db: Session = Depends(get_db)):
+    """Resume a paused job."""
+    job = db.query(Job).filter(Job.id == job_id).first()
+    if not job:
+        raise HTTPException(status_code=404, detail="Job not found")
+
+    if job.status != "paused":
+        raise HTTPException(status_code=400, detail="Job is not paused")
+
+    # Start new Celery task
+    task = run_scrape_job.delay(job.id)
+    job.celery_task_id = task.id
+    job.status = "pending"
+    db.commit()
+
+    return {"status": "resumed"}
+
+
+@router.post("/{job_id}/cancel")
+def cancel_job(job_id: int, db: Session = Depends(get_db)):
+    """Cancel a job."""
+    from app.workers.celery_app import celery_app
+
+    job = db.query(Job).filter(Job.id == job_id).first()
+    if not job:
+        raise HTTPException(status_code=404, detail="Job not found")
+
+    if job.status in ["completed", "failed"]:
+        raise HTTPException(status_code=400, detail="Job already finished")
+
+    # Revoke Celery task
+    if job.celery_task_id:
+        celery_app.control.revoke(job.celery_task_id, terminate=True)
+
+    job.status = "failed"
+    job.error_message = "Cancelled by user"
+    db.commit()
+
+    return {"status": "cancelled"}
@@ -0,0 +1,198 @@
+from fastapi import APIRouter, Depends, HTTPException
+from sqlalchemy.orm import Session
+
+from app.database import get_db
+from app.models import ApiKey
+from app.schemas.api_key import ApiKeyCreate, ApiKeyUpdate, ApiKeyResponse
+
+router = APIRouter()
+
+# Available sources
+# auth_type: "none" (no auth), "api_key" (single key), "api_key_secret" (key + secret), "oauth" (client_id + client_secret + access_token)
+# default_rate: safe default requests per second for each API
+AVAILABLE_SOURCES = [
+    {"name": "gbif", "label": "GBIF", "requires_secret": False, "auth_type": "none", "default_rate": 1.0},  # Free, no auth required
+    {"name": "inaturalist", "label": "iNaturalist", "requires_secret": True, "auth_type": "api_key_secret", "default_rate": 1.0},  # 60/min limit
+    {"name": "flickr", "label": "Flickr", "requires_secret": True, "auth_type": "api_key_secret", "default_rate": 0.5},  # 3600/hr shared limit
+    {"name": "wikimedia", "label": "Wikimedia Commons", "requires_secret": True, "auth_type": "oauth", "default_rate": 1.0},  # generous limits
+    {"name": "trefle", "label": "Trefle.io", "requires_secret": False, "auth_type": "api_key", "default_rate": 1.0},  # 120/min limit
+    {"name": "duckduckgo", "label": "DuckDuckGo", "requires_secret": False, "auth_type": "none", "default_rate": 0.5},  # Web search, no API key
+    {"name": "bing", "label": "Bing Image Search", "requires_secret": False, "auth_type": "api_key", "default_rate": 3.0},  # Azure Cognitive Services
+]
+
+
+def mask_api_key(key: str) -> str:
+    """Mask API key, showing only last 4 characters."""
+    if not key or len(key) <= 4:
+        return "****"
+    return "*" * (len(key) - 4) + key[-4:]
+
+
+@router.get("")
+def list_sources(db: Session = Depends(get_db)):
+    """List all available sources with their configuration status."""
+    api_keys = {k.source: k for k in db.query(ApiKey).all()}
+
+    result = []
+    for source in AVAILABLE_SOURCES:
+        api_key = api_keys.get(source["name"])
+        default_rate = source.get("default_rate", 1.0)
+        result.append({
+            "name": source["name"],
+            "label": source["label"],
+            "requires_secret": source["requires_secret"],
+            "auth_type": source.get("auth_type", "api_key"),
+            "configured": api_key is not None,
+            "enabled": api_key.enabled if api_key else False,
+            "api_key_masked": mask_api_key(api_key.api_key) if api_key else None,
+            "has_secret": bool(api_key.api_secret) if api_key else False,
+            "has_access_token": bool(getattr(api_key, 'access_token', None)) if api_key else False,
+            "rate_limit_per_sec": api_key.rate_limit_per_sec if api_key else default_rate,
+            "default_rate": default_rate,
+        })
+
+    return result
+
+
+@router.get("/{source}")
+def get_source(source: str, db: Session = Depends(get_db)):
+    """Get source configuration."""
+    source_info = next((s for s in AVAILABLE_SOURCES if s["name"] == source), None)
+    if not source_info:
+        raise HTTPException(status_code=404, detail="Unknown source")
+
+    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
+    default_rate = source_info.get("default_rate", 1.0)
+
+    return {
+        "name": source_info["name"],
+        "label": source_info["label"],
+        "requires_secret": source_info["requires_secret"],
+        "auth_type": source_info.get("auth_type", "api_key"),
+        "configured": api_key is not None,
+        "enabled": api_key.enabled if api_key else False,
+        "api_key_masked": mask_api_key(api_key.api_key) if api_key else None,
+        "has_secret": bool(api_key.api_secret) if api_key else False,
+        "has_access_token": bool(getattr(api_key, 'access_token', None)) if api_key else False,
+        "rate_limit_per_sec": api_key.rate_limit_per_sec if api_key else default_rate,
+        "default_rate": default_rate,
+    }
+
+
+@router.put("/{source}")
+def update_source(
+    source: str,
+    config: ApiKeyCreate,
+    db: Session = Depends(get_db),
+):
+    """Create or update source configuration."""
+    source_info = next((s for s in AVAILABLE_SOURCES if s["name"] == source), None)
+    if not source_info:
+        raise HTTPException(status_code=404, detail="Unknown source")
+
+    # For sources that require auth, validate api_key is provided
+    auth_type = source_info.get("auth_type", "api_key")
+    if auth_type != "none" and not config.api_key:
+        raise HTTPException(status_code=400, detail="API key is required for this source")
+
+    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
+
+    # Use placeholder for no-auth sources
+    api_key_value = config.api_key or "no-auth"
+
+    if api_key:
+        # Update existing
+        api_key.api_key = api_key_value
+        if config.api_secret:
+            api_key.api_secret = config.api_secret
+        if config.access_token:
+            api_key.access_token = config.access_token
+        api_key.rate_limit_per_sec = config.rate_limit_per_sec
+        api_key.enabled = config.enabled
+    else:
+        # Create new
+        api_key = ApiKey(
+            source=source,
+            api_key=api_key_value,
+            api_secret=config.api_secret,
+            access_token=config.access_token,
+            rate_limit_per_sec=config.rate_limit_per_sec,
+            enabled=config.enabled,
+        )
+        db.add(api_key)
+
+    db.commit()
+    db.refresh(api_key)
+
+    return {
+        "name": source,
+        "configured": True,
+        "enabled": api_key.enabled,
+        "api_key_masked": mask_api_key(api_key.api_key) if auth_type != "none" else None,
+        "has_secret": bool(api_key.api_secret),
+        "has_access_token": bool(api_key.access_token),
+        "rate_limit_per_sec": api_key.rate_limit_per_sec,
+    }
+
+
+@router.patch("/{source}")
+def patch_source(
+    source: str,
+    config: ApiKeyUpdate,
+    db: Session = Depends(get_db),
+):
+    """Partially update source configuration."""
+    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
+    if not api_key:
+        raise HTTPException(status_code=404, detail="Source not configured")
+
+    update_data = config.model_dump(exclude_unset=True)
+    for field, value in update_data.items():
+        setattr(api_key, field, value)
+
+    db.commit()
+    db.refresh(api_key)
+
+    return {
+        "name": source,
+        "configured": True,
+        "enabled": api_key.enabled,
+        "api_key_masked": mask_api_key(api_key.api_key),
+        "has_secret": bool(api_key.api_secret),
+        "has_access_token": bool(api_key.access_token),
+        "rate_limit_per_sec": api_key.rate_limit_per_sec,
+    }
+
+
+@router.delete("/{source}")
+def delete_source(source: str, db: Session = Depends(get_db)):
+    """Delete source configuration."""
+    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
+    if not api_key:
+        raise HTTPException(status_code=404, detail="Source not configured")
+
+    db.delete(api_key)
+    db.commit()
+
+    return {"status": "deleted"}
+
+
+@router.post("/{source}/test")
+def test_source(source: str, db: Session = Depends(get_db)):
+    """Test source API connection."""
+    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
+    if not api_key:
+        raise HTTPException(status_code=404, detail="Source not configured")
+
+    # Import and test the scraper
+    from app.scrapers import get_scraper
+
+    scraper = get_scraper(source)
+    if not scraper:
+        raise HTTPException(status_code=400, detail="No scraper for this source")
+
+    try:
+        result = scraper.test_connection(api_key)
+        return {"status": "success", "message": result}
+    except Exception as e:
+        return {"status": "error", "message": str(e)}
@@ -0,0 +1,366 @@
+import csv
+import io
+import json
+from typing import Optional
+
+from fastapi import APIRouter, Depends, HTTPException, Query, UploadFile, File
+from sqlalchemy.orm import Session
+from sqlalchemy import func, text
+
+from app.database import get_db
+from app.models import Species, Image
+from app.schemas.species import (
+    SpeciesCreate,
+    SpeciesUpdate,
+    SpeciesResponse,
+    SpeciesListResponse,
+    SpeciesImportResponse,
+)
+
+router = APIRouter()
+
+
+def get_species_with_count(db: Session, species: Species) -> SpeciesResponse:
+    """Get species response with image count."""
+    image_count = db.query(func.count(Image.id)).filter(
+        Image.species_id == species.id,
+        Image.status == "downloaded"
+    ).scalar()
+
+    return SpeciesResponse(
+        id=species.id,
+        scientific_name=species.scientific_name,
+        common_name=species.common_name,
+        genus=species.genus,
+        family=species.family,
+        created_at=species.created_at,
+        image_count=image_count or 0,
+    )
+
+
+@router.get("", response_model=SpeciesListResponse)
+def list_species(
+    page: int = Query(1, ge=1),
+    page_size: int = Query(50, ge=1, le=500),
+    search: Optional[str] = None,
+    genus: Optional[str] = None,
+    has_images: Optional[bool] = None,
+    max_images: Optional[int] = Query(None, description="Filter species with less than N images"),
+    min_images: Optional[int] = Query(None, description="Filter species with at least N images"),
+    db: Session = Depends(get_db),
+):
+    """List species with pagination and filters.
+
+    Filters:
+    - search: Search by scientific or common name
+    - genus: Filter by genus
+    - has_images: True for species with images, False for species without
+    - max_images: Filter species with fewer than N downloaded images
+    - min_images: Filter species with at least N downloaded images
+    """
+    # If filtering by image count, we need to use a subquery approach
+    if max_images is not None or min_images is not None:
+        # Build a subquery with image counts per species
+        image_counts = (
+            db.query(
+                Species.id.label("species_id"),
+                func.count(Image.id).label("img_count")
+            )
+            .outerjoin(Image, (Image.species_id == Species.id) & (Image.status == "downloaded"))
+            .group_by(Species.id)
+            .subquery()
+        )
+
+        # Join species with their counts
+        query = db.query(Species).join(
+            image_counts, Species.id == image_counts.c.species_id
+        )
+
+        if max_images is not None:
+            query = query.filter(image_counts.c.img_count < max_images)
+
+        if min_images is not None:
+            query = query.filter(image_counts.c.img_count >= min_images)
+    else:
+        query = db.query(Species)
+
+    if search:
+        search_term = f"%{search}%"
+        query = query.filter(
+            (Species.scientific_name.ilike(search_term)) |
+            (Species.common_name.ilike(search_term))
+        )
+
+    if genus:
+        query = query.filter(Species.genus == genus)
+
+    # Filter by whether species has downloaded images (only if not using min/max filters)
+    if has_images is not None and max_images is None and min_images is None:
+        # Get IDs of species that have at least one downloaded image
+        species_with_images = (
+            db.query(Image.species_id)
+            .filter(Image.status == "downloaded")
+            .distinct()
+            .subquery()
+        )
+        if has_images:
+            query = query.filter(Species.id.in_(db.query(species_with_images.c.species_id)))
+        else:
+            query = query.filter(~Species.id.in_(db.query(species_with_images.c.species_id)))
+
+    total = query.count()
+    pages = (total + page_size - 1) // page_size
+
+    species_list = query.order_by(Species.scientific_name).offset(
+        (page - 1) * page_size
+    ).limit(page_size).all()
+
+    # Fetch image counts in bulk for all species on this page
+    species_ids = [s.id for s in species_list]
+    if species_ids:
+        count_query = db.query(
+            Image.species_id,
+            func.count(Image.id)
+        ).filter(
+            Image.species_id.in_(species_ids),
+            Image.status == "downloaded"
+        ).group_by(Image.species_id).all()
+        count_map = {species_id: count for species_id, count in count_query}
+    else:
+        count_map = {}
+
+    items = [
+        SpeciesResponse(
+            id=s.id,
+            scientific_name=s.scientific_name,
+            common_name=s.common_name,
+            genus=s.genus,
+            family=s.family,
+            created_at=s.created_at,
+            image_count=count_map.get(s.id, 0),
+        )
+        for s in species_list
+    ]
+
+    return SpeciesListResponse(
+        items=items,
+        total=total,
+        page=page,
+        page_size=page_size,
+        pages=pages,
+    )
+
+
+@router.post("", response_model=SpeciesResponse)
+def create_species(species: SpeciesCreate, db: Session = Depends(get_db)):
+    """Create a new species."""
+    existing = db.query(Species).filter(
+        Species.scientific_name == species.scientific_name
+    ).first()
+
+    if existing:
+        raise HTTPException(status_code=400, detail="Species already exists")
+
+    # Auto-extract genus from scientific name if not provided
+    genus = species.genus
+    if not genus and " " in species.scientific_name:
+        genus = species.scientific_name.split()[0]
+
+    db_species = Species(
+        scientific_name=species.scientific_name,
+        common_name=species.common_name,
+        genus=genus,
+        family=species.family,
+    )
+    db.add(db_species)
+    db.commit()
+    db.refresh(db_species)
+
+    return get_species_with_count(db, db_species)
+
+
+@router.post("/import", response_model=SpeciesImportResponse)
+async def import_species(
+    file: UploadFile = File(...),
+    db: Session = Depends(get_db),
+):
+    """Import species from CSV file.
+
+    Expected columns: scientific_name, common_name (optional), genus (optional), family (optional)
+    """
+    if not file.filename.endswith(".csv"):
+        raise HTTPException(status_code=400, detail="File must be a CSV")
+
+    content = await file.read()
+    text = content.decode("utf-8")
+
+    reader = csv.DictReader(io.StringIO(text))
+
+    imported = 0
+    skipped = 0
+    errors = []
+
+    for row_num, row in enumerate(reader, start=2):
+        scientific_name = row.get("scientific_name", "").strip()
+        if not scientific_name:
+            errors.append(f"Row {row_num}: Missing scientific_name")
+            continue
+
+        # Check if already exists
+        existing = db.query(Species).filter(
+            Species.scientific_name == scientific_name
+        ).first()
+
+        if existing:
+            skipped += 1
+            continue
+
+        # Auto-extract genus if not provided
+        genus = row.get("genus", "").strip()
+        if not genus and " " in scientific_name:
+            genus = scientific_name.split()[0]
+
+        try:
+            species = Species(
+                scientific_name=scientific_name,
+                common_name=row.get("common_name", "").strip() or None,
+                genus=genus or None,
+                family=row.get("family", "").strip() or None,
+            )
+            db.add(species)
+            imported += 1
+        except Exception as e:
+            errors.append(f"Row {row_num}: {str(e)}")
+
+    db.commit()
+
+    return SpeciesImportResponse(
+        imported=imported,
+        skipped=skipped,
+        errors=errors[:10],  # Limit error messages
+    )
+
+
+@router.post("/import-json", response_model=SpeciesImportResponse)
+async def import_species_json(
+    file: UploadFile = File(...),
+    db: Session = Depends(get_db),
+):
+    """Import species from JSON file.
+
+    Expected format: {"plants": [{"scientific_name": "...", "common_names": [...], "family": "..."}]}
+    """
+    if not file.filename.endswith(".json"):
+        raise HTTPException(status_code=400, detail="File must be a JSON")
+
+    content = await file.read()
+    try:
+        data = json.loads(content.decode("utf-8"))
+    except json.JSONDecodeError as e:
+        raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")
+
+    plants = data.get("plants", [])
+    if not plants:
+        raise HTTPException(status_code=400, detail="No plants found in JSON")
+
+    imported = 0
+    skipped = 0
+    errors = []
+
+    for idx, plant in enumerate(plants):
+        scientific_name = plant.get("scientific_name", "").strip()
+        if not scientific_name:
+            errors.append(f"Plant {idx}: Missing scientific_name")
+            continue
+
+        # Check if already exists
+        existing = db.query(Species).filter(
+            Species.scientific_name == scientific_name
+        ).first()
+
+        if existing:
+            skipped += 1
+            continue
+
+        # Auto-extract genus from scientific name
+        genus = None
+        if " " in scientific_name:
+            genus = scientific_name.split()[0]
+
+        # Get first common name if array provided
+        common_names = plant.get("common_names", [])
+        common_name = common_names[0] if common_names else None
+
+        try:
+            species = Species(
+                scientific_name=scientific_name,
+                common_name=common_name,
+                genus=genus,
+                family=plant.get("family"),
+            )
+            db.add(species)
+            imported += 1
+        except Exception as e:
+            errors.append(f"Plant {idx}: {str(e)}")
+
+    db.commit()
+
+    return SpeciesImportResponse(
+        imported=imported,
+        skipped=skipped,
+        errors=errors[:10],
+    )
+
+
+@router.get("/{species_id}", response_model=SpeciesResponse)
+def get_species(species_id: int, db: Session = Depends(get_db)):
+    """Get a species by ID."""
+    species = db.query(Species).filter(Species.id == species_id).first()
+    if not species:
+        raise HTTPException(status_code=404, detail="Species not found")
+
+    return get_species_with_count(db, species)
+
+
+@router.put("/{species_id}", response_model=SpeciesResponse)
+def update_species(
+    species_id: int,
+    species_update: SpeciesUpdate,
+    db: Session = Depends(get_db),
+):
+    """Update a species."""
+    species = db.query(Species).filter(Species.id == species_id).first()
+    if not species:
+        raise HTTPException(status_code=404, detail="Species not found")
+
+    update_data = species_update.model_dump(exclude_unset=True)
+    for field, value in update_data.items():
+        setattr(species, field, value)
+
+    db.commit()
+    db.refresh(species)
+
+    return get_species_with_count(db, species)
+
+
+@router.delete("/{species_id}")
+def delete_species(species_id: int, db: Session = Depends(get_db)):
+    """Delete a species and all its images."""
+    species = db.query(Species).filter(Species.id == species_id).first()
+    if not species:
+        raise HTTPException(status_code=404, detail="Species not found")
+
+    db.delete(species)
+    db.commit()
+
+    return {"status": "deleted"}
+
+
+@router.get("/genera/list")
+def list_genera(db: Session = Depends(get_db)):
+    """List all unique genera."""
+    genera = db.query(Species.genus).filter(
+        Species.genus.isnot(None)
+    ).distinct().order_by(Species.genus).all()
+
+    return [g[0] for g in genera]
@@ -0,0 +1,190 @@
+import json
+
+from fastapi import APIRouter, Depends, HTTPException
+from sqlalchemy.orm import Session
+from sqlalchemy import func, case
+
+from app.database import get_db
+from app.models import Species, Image, Job
+from app.models.cached_stats import CachedStats
+from app.schemas.stats import StatsResponse, SourceStats, LicenseStats, SpeciesStats, JobStats
+
+router = APIRouter()
+
+
+@router.get("", response_model=StatsResponse)
+def get_stats(db: Session = Depends(get_db)):
+    """Get dashboard statistics from cache (updated every 60s by Celery)."""
+    # Try to get cached stats
+    cached = db.query(CachedStats).filter(CachedStats.key == "dashboard_stats").first()
+
+    if cached:
+        data = json.loads(cached.value)
+        return StatsResponse(
+            total_species=data["total_species"],
+            total_images=data["total_images"],
+            images_downloaded=data["images_downloaded"],
+            images_pending=data["images_pending"],
+            images_rejected=data["images_rejected"],
+            disk_usage_mb=data["disk_usage_mb"],
+            sources=[SourceStats(**s) for s in data["sources"]],
+            licenses=[LicenseStats(**l) for l in data["licenses"]],
+            jobs=JobStats(**data["jobs"]),
+            top_species=[SpeciesStats(**s) for s in data["top_species"]],
+            under_represented=[SpeciesStats(**s) for s in data["under_represented"]],
+        )
+
+    # No cache yet - return empty stats (Celery will populate soon)
+    # This only happens on first startup before Celery runs
+    return StatsResponse(
+        total_species=0,
+        total_images=0,
+        images_downloaded=0,
+        images_pending=0,
+        images_rejected=0,
+        disk_usage_mb=0.0,
+        sources=[],
+        licenses=[],
+        jobs=JobStats(running=0, pending=0, completed=0, failed=0),
+        top_species=[],
+        under_represented=[],
+    )
+
+
+@router.post("/refresh")
+def refresh_stats_now(db: Session = Depends(get_db)):
+    """Manually trigger a stats refresh."""
+    from app.workers.stats_tasks import refresh_stats
+    refresh_stats.delay()
+    return {"status": "refresh_queued"}
+
+
+@router.get("/sources")
+def get_source_stats(db: Session = Depends(get_db)):
+    """Get per-source breakdown."""
+    stats = db.query(
+        Image.source,
+        func.count(Image.id).label("total"),
+        func.sum(case((Image.status == "downloaded", 1), else_=0)).label("downloaded"),
+        func.sum(case((Image.status == "pending", 1), else_=0)).label("pending"),
+        func.sum(case((Image.status == "rejected", 1), else_=0)).label("rejected"),
+    ).group_by(Image.source).all()
+
+    return [
+        {
+            "source": s.source,
+            "total": s.total,
+            "downloaded": s.downloaded or 0,
+            "pending": s.pending or 0,
+            "rejected": s.rejected or 0,
+        }
+        for s in stats
+    ]
+
+
+@router.get("/species")
+def get_species_stats(
+    min_count: int = 0,
+    max_count: int = None,
+    db: Session = Depends(get_db),
+):
+    """Get per-species image counts."""
+    query = db.query(
+        Species.id,
+        Species.scientific_name,
+        Species.common_name,
+        Species.genus,
+        func.count(Image.id).label("image_count")
+    ).outerjoin(Image, (Image.species_id == Species.id) & (Image.status == "downloaded")
+    ).group_by(Species.id)
+
+    if min_count > 0:
+        query = query.having(func.count(Image.id) >= min_count)
+
+    if max_count is not None:
+        query = query.having(func.count(Image.id) <= max_count)
+
+    stats = query.order_by(func.count(Image.id).desc()).all()
+
+    return [
+        {
+            "id": s.id,
+            "scientific_name": s.scientific_name,
+            "common_name": s.common_name,
+            "genus": s.genus,
+            "image_count": s.image_count,
+        }
+        for s in stats
+    ]
+
+
+@router.get("/distribution")
+def get_image_distribution(db: Session = Depends(get_db)):
+    """Get distribution of images per species for ML training assessment.
+
+    Returns counts of species at various image thresholds to help
+    determine dataset quality for training image classifiers.
+    """
+    from sqlalchemy import text
+
+    # Get image counts per species using optimized raw SQL
+    distribution_sql = text("""
+        WITH species_counts AS (
+            SELECT
+                s.id,
+                COUNT(i.id) as cnt
+            FROM species s
+            LEFT JOIN images i ON i.species_id = s.id AND i.status = 'downloaded'
+            GROUP BY s.id
+        )
+        SELECT
+            COUNT(*) as total_species,
+            SUM(CASE WHEN cnt = 0 THEN 1 ELSE 0 END) as with_0,
+            SUM(CASE WHEN cnt >= 1 AND cnt < 10 THEN 1 ELSE 0 END) as with_1_9,
+            SUM(CASE WHEN cnt >= 10 AND cnt < 25 THEN 1 ELSE 0 END) as with_10_24,
+            SUM(CASE WHEN cnt >= 25 AND cnt < 50 THEN 1 ELSE 0 END) as with_25_49,
+            SUM(CASE WHEN cnt >= 50 AND cnt < 100 THEN 1 ELSE 0 END) as with_50_99,
+            SUM(CASE WHEN cnt >= 100 AND cnt < 200 THEN 1 ELSE 0 END) as with_100_199,
+            SUM(CASE WHEN cnt >= 200 THEN 1 ELSE 0 END) as with_200_plus,
+            SUM(CASE WHEN cnt >= 10 THEN 1 ELSE 0 END) as trainable_10,
+            SUM(CASE WHEN cnt >= 25 THEN 1 ELSE 0 END) as trainable_25,
+            SUM(CASE WHEN cnt >= 50 THEN 1 ELSE 0 END) as trainable_50,
+            SUM(CASE WHEN cnt >= 100 THEN 1 ELSE 0 END) as trainable_100,
+            AVG(cnt) as avg_images,
+            MAX(cnt) as max_images,
+            MIN(cnt) as min_images,
+            SUM(cnt) as total_images
+        FROM species_counts
+    """)
+
+    result = db.execute(distribution_sql).fetchone()
+
+    return {
+        "total_species": result[0] or 0,
+        "distribution": {
+            "0_images": result[1] or 0,
+            "1_to_9": result[2] or 0,
+            "10_to_24": result[3] or 0,
+            "25_to_49": result[4] or 0,
+            "50_to_99": result[5] or 0,
+            "100_to_199": result[6] or 0,
+            "200_plus": result[7] or 0,
+        },
+        "trainable_species": {
+            "min_10_images": result[8] or 0,
+            "min_25_images": result[9] or 0,
+            "min_50_images": result[10] or 0,
+            "min_100_images": result[11] or 0,
+        },
+        "summary": {
+            "avg_images_per_species": round(result[12] or 0, 1),
+            "max_images": result[13] or 0,
+            "min_images": result[14] or 0,
+            "total_downloaded_images": result[15] or 0,
+        },
+        "recommendations": {
+            "for_basic_model": f"{result[8] or 0} species with 10+ images",
+            "for_good_model": f"{result[10] or 0} species with 50+ images",
+            "for_excellent_model": f"{result[11] or 0} species with 100+ images",
+        }
+    }
@@ -0,0 +1,38 @@
+from pydantic_settings import BaseSettings
+from functools import lru_cache
+
+
+class Settings(BaseSettings):
+    # Database
+    database_url: str = "sqlite:////data/db/plants.sqlite"
+
+    # Redis
+    redis_url: str = "redis://redis:6379/0"
+
+    # Storage paths
+    images_path: str = "/data/images"
+    exports_path: str = "/data/exports"
+    imports_path: str = "/data/imports"
+    logs_path: str = "/data/logs"
+
+    # API Keys
+    flickr_api_key: str = ""
+    flickr_api_secret: str = ""
+    inaturalist_app_id: str = ""
+    inaturalist_app_secret: str = ""
+    trefle_api_key: str = ""
+
+    # Logging
+    log_level: str = "INFO"
+
+    # Celery
+    celery_concurrency: int = 4
+
+    class Config:
+        env_file = ".env"
+        extra = "ignore"
+
+
+@lru_cache()
+def get_settings() -> Settings:
+    return Settings()
@@ -0,0 +1,44 @@
+from sqlalchemy import create_engine, event
+from sqlalchemy.orm import sessionmaker, declarative_base
+from sqlalchemy.pool import StaticPool
+
+from app.config import get_settings
+
+settings = get_settings()
+
+# SQLite-specific configuration
+connect_args = {"check_same_thread": False}
+
+engine = create_engine(
+    settings.database_url,
+    connect_args=connect_args,
+    poolclass=StaticPool,
+    echo=False,
+)
+
+# Enable WAL mode for better concurrent access
+@event.listens_for(engine, "connect")
+def set_sqlite_pragma(dbapi_connection, connection_record):
+    cursor = dbapi_connection.cursor()
+    cursor.execute("PRAGMA journal_mode=WAL")
+    cursor.execute("PRAGMA synchronous=NORMAL")
+    cursor.execute("PRAGMA foreign_keys=ON")
+    cursor.close()
+
+SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
+
+Base = declarative_base()
+
+
+def get_db():
+    db = SessionLocal()
+    try:
+        yield db
+    finally:
+        db.close()
+
+
+def init_db():
+    """Create all tables."""
+    from app.models import species, image, job, api_key, export, cached_stats  # noqa
+    Base.metadata.create_all(bind=engine)
@@ -0,0 +1,95 @@
+from fastapi import FastAPI
+from fastapi.middleware.cors import CORSMiddleware
+
+from app.config import get_settings
+from app.database import init_db
+from app.api import species, images, jobs, exports, stats, sources
+
+settings = get_settings()
+
+app = FastAPI(
+    title="PlantGuideScraper API",
+    description="Web scraper interface for houseplant image collection",
+    version="1.0.0",
+)
+
+# CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+# Include routers
+app.include_router(species.router, prefix="/api/species", tags=["Species"])
+app.include_router(images.router, prefix="/api/images", tags=["Images"])
+app.include_router(jobs.router, prefix="/api/jobs", tags=["Jobs"])
+app.include_router(exports.router, prefix="/api/exports", tags=["Exports"])
+app.include_router(stats.router, prefix="/api/stats", tags=["Stats"])
+app.include_router(sources.router, prefix="/api/sources", tags=["Sources"])
+
+
+@app.on_event("startup")
+async def startup_event():
+    """Initialize database on startup."""
+    init_db()
+
+
+@app.get("/health")
+async def health_check():
+    """Health check endpoint."""
+    return {"status": "healthy", "service": "plant-scraper"}
+
+
+@app.get("/api/debug")
+async def debug_check():
+    """Debug endpoint - checks database connection."""
+    import time
+    from app.database import SessionLocal
+    from app.models import Species, Image
+
+    results = {"status": "checking", "checks": {}}
+
+    # Check 1: Can we create a session?
+    try:
+        start = time.time()
+        db = SessionLocal()
+        results["checks"]["session_create"] = {"ok": True, "ms": int((time.time() - start) * 1000)}
+    except Exception as e:
+        results["checks"]["session_create"] = {"ok": False, "error": str(e)}
+        results["status"] = "error"
+        return results
+
+    # Check 2: Simple query - count species
+    try:
+        start = time.time()
+        count = db.query(Species).count()
+        results["checks"]["species_count"] = {"ok": True, "count": count, "ms": int((time.time() - start) * 1000)}
+    except Exception as e:
+        results["checks"]["species_count"] = {"ok": False, "error": str(e)}
+        results["status"] = "error"
+        db.close()
+        return results
+
+    # Check 3: Count images
+    try:
+        start = time.time()
+        count = db.query(Image).count()
+        results["checks"]["image_count"] = {"ok": True, "count": count, "ms": int((time.time() - start) * 1000)}
+    except Exception as e:
+        results["checks"]["image_count"] = {"ok": False, "error": str(e)}
+        results["status"] = "error"
+        db.close()
+        return results
+
+    db.close()
+    results["status"] = "healthy"
+    return results
+
+
+@app.get("/")
+async def root():
+    """Root endpoint."""
+    return {"message": "PlantGuideScraper API", "docs": "/docs"}
@@ -0,0 +1,8 @@
+from app.models.species import Species
+from app.models.image import Image
+from app.models.job import Job
+from app.models.api_key import ApiKey
+from app.models.export import Export
+from app.models.cached_stats import CachedStats
+
+__all__ = ["Species", "Image", "Job", "ApiKey", "Export", "CachedStats"]
@@ -0,0 +1,18 @@
+from sqlalchemy import Column, Integer, String, Float, Boolean
+
+from app.database import Base
+
+
+class ApiKey(Base):
+    __tablename__ = "api_keys"
+
+    id = Column(Integer, primary_key=True, index=True)
+    source = Column(String, unique=True, nullable=False)  # 'flickr', 'inaturalist', 'wikimedia', 'trefle'
+    api_key = Column(String, nullable=False)  # Also used as Client ID for OAuth sources
+    api_secret = Column(String, nullable=True)  # Also used as Client Secret for OAuth sources
+    access_token = Column(String, nullable=True)  # For OAuth sources like Wikimedia
+    rate_limit_per_sec = Column(Float, default=1.0)
+    enabled = Column(Boolean, default=True)
+
+    def __repr__(self):
+        return f"<ApiKey(id={self.id}, source='{self.source}', enabled={self.enabled})>"
@@ -0,0 +1,14 @@
+from datetime import datetime
+from sqlalchemy import Column, Integer, String, Text, DateTime
+
+from app.database import Base
+
+
+class CachedStats(Base):
+    """Stores pre-calculated statistics updated by Celery beat."""
+    __tablename__ = "cached_stats"
+
+    id = Column(Integer, primary_key=True, index=True)
+    key = Column(String(50), unique=True, nullable=False, index=True)
+    value = Column(Text, nullable=False)  # JSON-encoded stats
+    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
@@ -0,0 +1,24 @@
+from sqlalchemy import Column, Integer, String, Float, DateTime, Text, func
+
+from app.database import Base
+
+
+class Export(Base):
+    __tablename__ = "exports"
+
+    id = Column(Integer, primary_key=True, index=True)
+    name = Column(String, nullable=False)
+    filter_criteria = Column(Text, nullable=True)  # JSON: min_images, licenses, min_quality, species_ids
+    train_split = Column(Float, default=0.8)
+    status = Column(String, default="pending")  # pending, generating, completed, failed
+    file_path = Column(String, nullable=True)
+    file_size = Column(Integer, nullable=True)
+    species_count = Column(Integer, nullable=True)
+    image_count = Column(Integer, nullable=True)
+    celery_task_id = Column(String, nullable=True)
+    created_at = Column(DateTime, server_default=func.now())
+    completed_at = Column(DateTime, nullable=True)
+    error_message = Column(Text, nullable=True)
+
+    def __repr__(self):
+        return f"<Export(id={self.id}, name='{self.name}', status='{self.status}')>"
@@ -0,0 +1,36 @@
+from sqlalchemy import Column, Integer, String, Float, DateTime, ForeignKey, func, UniqueConstraint, Index
+from sqlalchemy.orm import relationship
+
+from app.database import Base
+
+
+class Image(Base):
+    __tablename__ = "images"
+
+    id = Column(Integer, primary_key=True, index=True)
+    species_id = Column(Integer, ForeignKey("species.id"), nullable=False, index=True)
+    source = Column(String, nullable=False, index=True)
+    source_id = Column(String, nullable=True)
+    url = Column(String, nullable=False)
+    local_path = Column(String, nullable=True)
+    license = Column(String, nullable=False, index=True)
+    attribution = Column(String, nullable=True)
+    width = Column(Integer, nullable=True)
+    height = Column(Integer, nullable=True)
+    phash = Column(String, nullable=True, index=True)
+    quality_score = Column(Float, nullable=True)
+    status = Column(String, default="pending", index=True)  # pending, downloaded, rejected, deleted
+    created_at = Column(DateTime, server_default=func.now())
+
+    # Composite indexes for common query patterns
+    __table_args__ = (
+        UniqueConstraint("source", "source_id", name="uq_source_source_id"),
+        Index("ix_images_species_status", "species_id", "status"),  # For counting images per species by status
+        Index("ix_images_status_created", "status", "created_at"),  # For listing images by status
+    )
+
+    # Relationships
+    species = relationship("Species", back_populates="images")
+
+    def __repr__(self):
+        return f"<Image(id={self.id}, source='{self.source}', status='{self.status}')>"
@@ -0,0 +1,27 @@
+from sqlalchemy import Column, Integer, String, DateTime, Text, Boolean, func
+
+from app.database import Base
+
+
+class Job(Base):
+    __tablename__ = "jobs"
+
+    id = Column(Integer, primary_key=True, index=True)
+    name = Column(String, nullable=False)
+    source = Column(String, nullable=False)
+    species_filter = Column(Text, nullable=True)  # JSON array of species IDs or NULL for all
+    only_without_images = Column(Boolean, default=False)  # If True, only scrape species with 0 images
+    max_images = Column(Integer, nullable=True)  # If set, only scrape species with fewer than N images
+    status = Column(String, default="pending", index=True)  # pending, running, paused, completed, failed
+    progress_current = Column(Integer, default=0)
+    progress_total = Column(Integer, default=0)
+    images_downloaded = Column(Integer, default=0)
+    images_rejected = Column(Integer, default=0)
+    celery_task_id = Column(String, nullable=True)
+    started_at = Column(DateTime, nullable=True)
+    completed_at = Column(DateTime, nullable=True)
+    error_message = Column(Text, nullable=True)
+    created_at = Column(DateTime, server_default=func.now())
+
+    def __repr__(self):
+        return f"<Job(id={self.id}, name='{self.name}', status='{self.status}')>"
@@ -0,0 +1,21 @@
+from sqlalchemy import Column, Integer, String, DateTime, func
+from sqlalchemy.orm import relationship
+
+from app.database import Base
+
+
+class Species(Base):
+    __tablename__ = "species"
+
+    id = Column(Integer, primary_key=True, index=True)
+    scientific_name = Column(String, unique=True, nullable=False, index=True)
+    common_name = Column(String, nullable=True)
+    genus = Column(String, nullable=True, index=True)
+    family = Column(String, nullable=True)
+    created_at = Column(DateTime, server_default=func.now())
+
+    # Relationships
+    images = relationship("Image", back_populates="species", cascade="all, delete-orphan")
+
+    def __repr__(self):
+        return f"<Species(id={self.id}, scientific_name='{self.scientific_name}')>"
@@ -0,0 +1,15 @@
+from app.schemas.species import SpeciesCreate, SpeciesUpdate, SpeciesResponse, SpeciesListResponse
+from app.schemas.image import ImageResponse, ImageListResponse, ImageFilter
+from app.schemas.job import JobCreate, JobResponse, JobListResponse
+from app.schemas.api_key import ApiKeyCreate, ApiKeyUpdate, ApiKeyResponse
+from app.schemas.export import ExportCreate, ExportResponse, ExportListResponse
+from app.schemas.stats import StatsResponse, SourceStats, SpeciesStats
+
+__all__ = [
+    "SpeciesCreate", "SpeciesUpdate", "SpeciesResponse", "SpeciesListResponse",
+    "ImageResponse", "ImageListResponse", "ImageFilter",
+    "JobCreate", "JobResponse", "JobListResponse",
+    "ApiKeyCreate", "ApiKeyUpdate", "ApiKeyResponse",
+    "ExportCreate", "ExportResponse", "ExportListResponse",
+    "StatsResponse", "SourceStats", "SpeciesStats",
+]
@@ -0,0 +1,36 @@
+from pydantic import BaseModel
+from typing import Optional
+
+
+class ApiKeyBase(BaseModel):
+    source: str
+    api_key: Optional[str] = None  # Optional for no-auth sources, used as Client ID for OAuth
+    api_secret: Optional[str] = None  # Also used as Client Secret for OAuth sources
+    access_token: Optional[str] = None  # For OAuth sources like Wikimedia
+    rate_limit_per_sec: float = 1.0
+    enabled: bool = True
+
+
+class ApiKeyCreate(ApiKeyBase):
+    pass
+
+
+class ApiKeyUpdate(BaseModel):
+    api_key: Optional[str] = None
+    api_secret: Optional[str] = None
+    access_token: Optional[str] = None
+    rate_limit_per_sec: Optional[float] = None
+    enabled: Optional[bool] = None
+
+
+class ApiKeyResponse(BaseModel):
+    id: int
+    source: str
+    api_key_masked: str  # Show only last 4 chars
+    has_secret: bool
+    has_access_token: bool
+    rate_limit_per_sec: float
+    enabled: bool
+
+    class Config:
+        from_attributes = True
@@ -0,0 +1,45 @@
+from pydantic import BaseModel
+from datetime import datetime
+from typing import Optional, List
+
+
+class ExportFilter(BaseModel):
+    min_images_per_species: int = 100
+    licenses: Optional[List[str]] = None  # None means all
+    min_quality: Optional[float] = None
+    species_ids: Optional[List[int]] = None  # None means all
+
+
+class ExportCreate(BaseModel):
+    name: str
+    filter_criteria: ExportFilter
+    train_split: float = 0.8
+
+
+class ExportResponse(BaseModel):
+    id: int
+    name: str
+    filter_criteria: Optional[str] = None
+    train_split: float
+    status: str
+    file_path: Optional[str] = None
+    file_size: Optional[int] = None
+    species_count: Optional[int] = None
+    image_count: Optional[int] = None
+    created_at: datetime
+    completed_at: Optional[datetime] = None
+    error_message: Optional[str] = None
+
+    class Config:
+        from_attributes = True
+
+
+class ExportListResponse(BaseModel):
+    items: List[ExportResponse]
+    total: int
+
+
+class ExportPreview(BaseModel):
+    species_count: int
+    image_count: int
+    estimated_size_mb: float
@@ -0,0 +1,47 @@
+from pydantic import BaseModel
+from datetime import datetime
+from typing import Optional, List
+
+
+class ImageBase(BaseModel):
+    species_id: int
+    source: str
+    url: str
+    license: str
+
+
+class ImageResponse(BaseModel):
+    id: int
+    species_id: int
+    species_name: Optional[str] = None
+    source: str
+    source_id: Optional[str] = None
+    url: str
+    local_path: Optional[str] = None
+    license: str
+    attribution: Optional[str] = None
+    width: Optional[int] = None
+    height: Optional[int] = None
+    quality_score: Optional[float] = None
+    status: str
+    created_at: datetime
+
+    class Config:
+        from_attributes = True
+
+
+class ImageListResponse(BaseModel):
+    items: List[ImageResponse]
+    total: int
+    page: int
+    page_size: int
+    pages: int
+
+
+class ImageFilter(BaseModel):
+    species_id: Optional[int] = None
+    source: Optional[str] = None
+    license: Optional[str] = None
+    status: Optional[str] = None
+    min_quality: Optional[float] = None
+    search: Optional[str] = None
@@ -0,0 +1,35 @@
+from pydantic import BaseModel
+from datetime import datetime
+from typing import Optional, List
+
+
+class JobCreate(BaseModel):
+    name: str
+    source: str
+    species_ids: Optional[List[int]] = None  # None means all species
+    only_without_images: bool = False  # If True, only scrape species with 0 images
+    max_images: Optional[int] = None  # If set, only scrape species with fewer than N images
+
+
+class JobResponse(BaseModel):
+    id: int
+    name: str
+    source: str
+    species_filter: Optional[str] = None
+    status: str
+    progress_current: int
+    progress_total: int
+    images_downloaded: int
+    images_rejected: int
+    started_at: Optional[datetime] = None
+    completed_at: Optional[datetime] = None
+    error_message: Optional[str] = None
+    created_at: datetime
+
+    class Config:
+        from_attributes = True
+
+
+class JobListResponse(BaseModel):
+    items: List[JobResponse]
+    total: int
@@ -0,0 +1,44 @@
+from pydantic import BaseModel
+from datetime import datetime
+from typing import Optional, List
+
+
+class SpeciesBase(BaseModel):
+    scientific_name: str
+    common_name: Optional[str] = None
+    genus: Optional[str] = None
+    family: Optional[str] = None
+
+
+class SpeciesCreate(SpeciesBase):
+    pass
+
+
+class SpeciesUpdate(BaseModel):
+    scientific_name: Optional[str] = None
+    common_name: Optional[str] = None
+    genus: Optional[str] = None
+    family: Optional[str] = None
+
+
+class SpeciesResponse(SpeciesBase):
+    id: int
+    created_at: datetime
+    image_count: int = 0
+
+    class Config:
+        from_attributes = True
+
+
+class SpeciesListResponse(BaseModel):
+    items: List[SpeciesResponse]
+    total: int
+    page: int
+    page_size: int
+    pages: int
+
+
+class SpeciesImportResponse(BaseModel):
+    imported: int
+    skipped: int
+    errors: List[str]
@@ -0,0 +1,43 @@
+from pydantic import BaseModel
+from typing import List, Dict
+
+
+class SourceStats(BaseModel):
+    source: str
+    image_count: int
+    downloaded: int
+    pending: int
+    rejected: int
+
+
+class LicenseStats(BaseModel):
+    license: str
+    count: int
+
+
+class SpeciesStats(BaseModel):
+    id: int
+    scientific_name: str
+    common_name: str | None
+    image_count: int
+
+
+class JobStats(BaseModel):
+    running: int
+    pending: int
+    completed: int
+    failed: int
+
+
+class StatsResponse(BaseModel):
+    total_species: int
+    total_images: int
+    images_downloaded: int
+    images_pending: int
+    images_rejected: int
+    disk_usage_mb: float
+    sources: List[SourceStats]
+    licenses: List[LicenseStats]
+    jobs: JobStats
+    top_species: List[SpeciesStats]
+    under_represented: List[SpeciesStats]  # Species with < 100 images
@@ -0,0 +1,41 @@
+from typing import Optional
+
+from app.scrapers.base import BaseScraper
+from app.scrapers.inaturalist import INaturalistScraper
+from app.scrapers.flickr import FlickrScraper
+from app.scrapers.wikimedia import WikimediaScraper
+from app.scrapers.trefle import TrefleScraper
+from app.scrapers.gbif import GBIFScraper
+from app.scrapers.duckduckgo import DuckDuckGoScraper
+from app.scrapers.bing import BingScraper
+
+
+def get_scraper(source: str) -> Optional[BaseScraper]:
+    """Get scraper instance for a source."""
+    scrapers = {
+        "inaturalist": INaturalistScraper,
+        "flickr": FlickrScraper,
+        "wikimedia": WikimediaScraper,
+        "trefle": TrefleScraper,
+        "gbif": GBIFScraper,
+        "duckduckgo": DuckDuckGoScraper,
+        "bing": BingScraper,
+    }
+
+    scraper_class = scrapers.get(source)
+    if scraper_class:
+        return scraper_class()
+    return None
+
+
+__all__ = [
+    "get_scraper",
+    "BaseScraper",
+    "INaturalistScraper",
+    "FlickrScraper",
+    "WikimediaScraper",
+    "TrefleScraper",
+    "GBIFScraper",
+    "DuckDuckGoScraper",
+    "BingScraper",
+]
@@ -0,0 +1,57 @@
+from abc import ABC, abstractmethod
+from typing import Dict, Any, Optional
+import logging
+
+from sqlalchemy.orm import Session
+
+from app.models import Species, ApiKey
+
+
+class BaseScraper(ABC):
+    """Base class for all image scrapers."""
+
+    name: str = "base"
+    requires_api_key: bool = True
+
+    @abstractmethod
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None
+    ) -> Dict[str, int]:
+        """
+        Scrape images for a species.
+
+        Args:
+            species: The species to scrape images for
+            db: Database session
+            logger: Optional logger for debugging
+
+        Returns:
+            Dict with 'downloaded' and 'rejected' counts
+        """
+        pass
+
+    @abstractmethod
+    def test_connection(self, api_key: ApiKey) -> str:
+        """
+        Test API connection.
+
+        Args:
+            api_key: The API key configuration
+
+        Returns:
+            Success message
+
+        Raises:
+            Exception if connection fails
+        """
+        pass
+
+    def get_api_key(self, db: Session) -> ApiKey:
+        """Get API key for this scraper."""
+        return db.query(ApiKey).filter(
+            ApiKey.source == self.name,
+            ApiKey.enabled == True
+        ).first()
@@ -0,0 +1,228 @@
+import time
+import logging
+from typing import Dict, Optional
+
+import httpx
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class BHLScraper(BaseScraper):
+    """Scraper for Biodiversity Heritage Library (BHL) images.
+
+    BHL provides access to digitized biodiversity literature and illustrations.
+    Most content is public domain (pre-1927) or CC-licensed.
+
+    Note: BHL images are primarily historical botanical illustrations,
+    which may differ from photographs but are valuable for training.
+    """
+
+    name = "bhl"
+    requires_api_key = True  # BHL requires free API key
+
+    BASE_URL = "https://www.biodiversitylibrary.org/api3"
+
+    HEADERS = {
+        "User-Agent": "PlantGuideScraper/1.0 (Plant image collection for ML training)",
+        "Accept": "application/json",
+    }
+
+    # BHL content is mostly public domain
+    ALLOWED_LICENSES = {"CC0", "CC-BY", "CC-BY-SA", "PD"}
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None
+    ) -> Dict[str, int]:
+        """Scrape images from BHL for a species."""
+        api_key = self.get_api_key(db)
+        if not api_key:
+            return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
+
+        rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
+
+        downloaded = 0
+        rejected = 0
+
+        def log(level: str, msg: str):
+            if logger:
+                getattr(logger, level)(msg)
+
+        try:
+            # Disable SSL verification - some Docker environments lack proper CA certificates
+            with httpx.Client(timeout=30, headers=self.HEADERS, verify=False) as client:
+                # Search for name in BHL
+                search_response = client.get(
+                    f"{self.BASE_URL}",
+                    params={
+                        "op": "NameSearch",
+                        "name": species.scientific_name,
+                        "format": "json",
+                        "apikey": api_key.api_key,
+                    },
+                )
+                search_response.raise_for_status()
+                search_data = search_response.json()
+
+                results = search_data.get("Result", [])
+                if not results:
+                    log("info", f"  Species not found in BHL: {species.scientific_name}")
+                    return {"downloaded": 0, "rejected": 0}
+
+                time.sleep(1.0 / rate_limit)
+
+                # Get pages with illustrations for each name result
+                for name_result in results[:5]:  # Limit to top 5 matches
+                    name_bank_id = name_result.get("NameBankID")
+                    if not name_bank_id:
+                        continue
+
+                    # Get publications with this name
+                    pub_response = client.get(
+                        f"{self.BASE_URL}",
+                        params={
+                            "op": "NameGetDetail",
+                            "namebankid": name_bank_id,
+                            "format": "json",
+                            "apikey": api_key.api_key,
+                        },
+                    )
+                    pub_response.raise_for_status()
+                    pub_data = pub_response.json()
+
+                    time.sleep(1.0 / rate_limit)
+
+                    # Extract titles and get page images
+                    for title in pub_data.get("Result", []):
+                        title_id = title.get("TitleID")
+                        if not title_id:
+                            continue
+
+                        # Get pages for this title
+                        pages_response = client.get(
+                            f"{self.BASE_URL}",
+                            params={
+                                "op": "GetPageMetadata",
+                                "titleid": title_id,
+                                "format": "json",
+                                "apikey": api_key.api_key,
+                                "ocr": "false",
+                                "names": "false",
+                            },
+                        )
+
+                        if pages_response.status_code != 200:
+                            continue
+
+                        pages_data = pages_response.json()
+                        pages = pages_data.get("Result", [])
+
+                        time.sleep(1.0 / rate_limit)
+
+                        # Look for pages that are likely illustrations
+                        for page in pages[:100]:  # Limit pages per title
+                            page_types = page.get("PageTypes", [])
+
+                            # Only get illustration/plate pages
+                            is_illustration = any(
+                                pt.get("PageTypeName", "").lower() in ["illustration", "plate", "figure", "map"]
+                                for pt in page_types
+                            ) if page_types else False
+
+                            if not is_illustration and page_types:
+                                continue
+
+                            page_id = page.get("PageID")
+                            if not page_id:
+                                continue
+
+                            # Construct image URL
+                            # BHL provides multiple image sizes
+                            image_url = f"https://www.biodiversitylibrary.org/pageimage/{page_id}"
+
+                            # Check if already exists
+                            source_id = str(page_id)
+                            existing = db.query(Image).filter(
+                                Image.source == self.name,
+                                Image.source_id == source_id,
+                            ).first()
+
+                            if existing:
+                                continue
+
+                            # Determine license - BHL content is usually public domain
+                            item_url = page.get("ItemUrl", "")
+                            year = None
+                            try:
+                                # Try to extract year from ItemUrl or other fields
+                                if "Year" in page:
+                                    year = int(page.get("Year", 0))
+                            except (ValueError, TypeError):
+                                pass
+
+                            # Content before 1927 is public domain in US
+                            if year and year < 1927:
+                                license_code = "PD"
+                            else:
+                                license_code = "CC0"  # BHL default for older works
+
+                            # Build attribution
+                            title_name = title.get("ShortTitle", title.get("FullTitle", "Unknown"))
+                            attribution = f"From '{title_name}' via Biodiversity Heritage Library ({license_code})"
+
+                            # Create image record
+                            image = Image(
+                                species_id=species.id,
+                                source=self.name,
+                                source_id=source_id,
+                                url=image_url,
+                                license=license_code,
+                                attribution=attribution,
+                                status="pending",
+                            )
+                            db.add(image)
+                            db.commit()
+
+                            # Queue for download
+                            download_and_process_image.delay(image.id)
+                            downloaded += 1
+
+                            # Limit total per species
+                            if downloaded >= 50:
+                                break
+
+                        if downloaded >= 50:
+                            break
+
+                    if downloaded >= 50:
+                        break
+
+        except httpx.HTTPStatusError as e:
+            log("error", f"  HTTP error for {species.scientific_name}: {e.response.status_code}")
+        except Exception as e:
+            log("error", f"  Error scraping BHL for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        """Test BHL API connection."""
+        with httpx.Client(timeout=10, headers=self.HEADERS, verify=False) as client:
+            response = client.get(
+                f"{self.BASE_URL}",
+                params={
+                    "op": "NameSearch",
+                    "name": "Rosa",
+                    "format": "json",
+                    "apikey": api_key.api_key,
+                },
+            )
+            response.raise_for_status()
+            data = response.json()
+
+        results = data.get("Result", [])
+        return f"BHL API connection successful ({len(results)} results for 'Rosa')"
@@ -0,0 +1,135 @@
+import hashlib
+import time
+import logging
+from typing import Dict, Optional
+
+import httpx
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class BingScraper(BaseScraper):
+    """Scraper for Bing Image Search v7 API (Azure Cognitive Services)."""
+
+    name = "bing"
+    requires_api_key = True
+
+    BASE_URL = "https://api.bing.microsoft.com/v7.0/images/search"
+
+    NEGATIVE_TERMS = "-herbarium -specimen -illustration -drawing -diagram -dried -pressed"
+
+    LICENSE_MAP = {
+        "Public": "CC0",
+        "Share": "CC-BY-SA",
+        "ShareCommercially": "CC-BY",
+        "Modify": "CC-BY-SA",
+        "ModifyCommercially": "CC-BY",
+    }
+
+    def _build_queries(self, species: Species) -> list[str]:
+        queries = [f'"{species.scientific_name}" plant photo {self.NEGATIVE_TERMS}']
+        if species.common_name:
+            queries.append(f'"{species.common_name}" houseplant photo {self.NEGATIVE_TERMS}')
+        return queries
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None,
+    ) -> Dict[str, int]:
+        api_key = self.get_api_key(db)
+        if not api_key:
+            return {"downloaded": 0, "rejected": 0}
+
+        rate_limit = api_key.rate_limit_per_sec or 3.0
+        downloaded = 0
+        rejected = 0
+        seen_urls = set()
+
+        headers = {
+            "Ocp-Apim-Subscription-Key": api_key.api_key,
+        }
+
+        try:
+            queries = self._build_queries(species)
+
+            with httpx.Client(timeout=30, headers=headers) as client:
+                for query in queries:
+                    params = {
+                        "q": query,
+                        "imageType": "Photo",
+                        "license": "ShareCommercially",
+                        "count": 50,
+                    }
+
+                    response = client.get(self.BASE_URL, params=params)
+                    response.raise_for_status()
+                    data = response.json()
+
+                    for result in data.get("value", []):
+                        url = result.get("contentUrl")
+                        if not url or url in seen_urls:
+                            continue
+                        seen_urls.add(url)
+
+                        # Use Bing's imageId, fall back to md5 hash
+                        source_id = result.get("imageId") or hashlib.md5(url.encode()).hexdigest()[:16]
+
+                        existing = db.query(Image).filter(
+                            Image.source == self.name,
+                            Image.source_id == source_id,
+                        ).first()
+
+                        if existing:
+                            continue
+
+                        # Map license
+                        bing_license = result.get("license", "")
+                        license_code = self.LICENSE_MAP.get(bing_license, "UNKNOWN")
+
+                        host = result.get("hostPageDisplayUrl", "")
+                        attribution = f"via Bing ({host})" if host else "via Bing Image Search"
+
+                        image = Image(
+                            species_id=species.id,
+                            source=self.name,
+                            source_id=source_id,
+                            url=url,
+                            width=result.get("width"),
+                            height=result.get("height"),
+                            license=license_code,
+                            attribution=attribution,
+                            status="pending",
+                        )
+                        db.add(image)
+                        db.commit()
+
+                        download_and_process_image.delay(image.id)
+                        downloaded += 1
+
+                    time.sleep(1.0 / rate_limit)
+
+        except Exception as e:
+            if logger:
+                logger.error(f"Error scraping Bing for {species.scientific_name}: {e}")
+            else:
+                print(f"Error scraping Bing for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        headers = {"Ocp-Apim-Subscription-Key": api_key.api_key}
+        with httpx.Client(timeout=10, headers=headers) as client:
+            response = client.get(
+                self.BASE_URL,
+                params={"q": "Monstera deliciosa plant", "count": 1},
+            )
+            response.raise_for_status()
+            data = response.json()
+
+        count = data.get("totalEstimatedMatches", 0)
+        return f"Bing Image Search working ({count:,} estimated matches)"
@@ -0,0 +1,101 @@
+import hashlib
+import time
+import logging
+from typing import Dict, Optional
+
+from duckduckgo_search import DDGS
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class DuckDuckGoScraper(BaseScraper):
+    """Scraper for DuckDuckGo image search. No API key required."""
+
+    name = "duckduckgo"
+    requires_api_key = False
+
+    NEGATIVE_TERMS = "-herbarium -specimen -illustration -drawing -diagram -dried -pressed"
+
+    def _build_queries(self, species: Species) -> list[str]:
+        queries = [f'"{species.scientific_name}" plant photo {self.NEGATIVE_TERMS}']
+        if species.common_name:
+            queries.append(f'"{species.common_name}" houseplant photo {self.NEGATIVE_TERMS}')
+        return queries
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None,
+    ) -> Dict[str, int]:
+        api_key = self.get_api_key(db)
+        rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
+
+        downloaded = 0
+        rejected = 0
+        seen_urls = set()
+
+        try:
+            queries = self._build_queries(species)
+
+            with DDGS() as ddgs:
+                for query in queries:
+                    results = ddgs.images(
+                        keywords=query,
+                        type_image="photo",
+                        max_results=50,
+                    )
+
+                    for result in results:
+                        url = result.get("image")
+                        if not url or url in seen_urls:
+                            continue
+                        seen_urls.add(url)
+
+                        source_id = hashlib.md5(url.encode()).hexdigest()[:16]
+
+                        # Check if already exists
+                        existing = db.query(Image).filter(
+                            Image.source == self.name,
+                            Image.source_id == source_id,
+                        ).first()
+
+                        if existing:
+                            continue
+
+                        title = result.get("title", "")
+                        attribution = f"{title} via DuckDuckGo" if title else "via DuckDuckGo"
+
+                        image = Image(
+                            species_id=species.id,
+                            source=self.name,
+                            source_id=source_id,
+                            url=url,
+                            license="UNKNOWN",
+                            attribution=attribution,
+                            status="pending",
+                        )
+                        db.add(image)
+                        db.commit()
+
+                        download_and_process_image.delay(image.id)
+                        downloaded += 1
+
+                    time.sleep(1.0 / rate_limit)
+
+        except Exception as e:
+            if logger:
+                logger.error(f"Error scraping DuckDuckGo for {species.scientific_name}: {e}")
+            else:
+                print(f"Error scraping DuckDuckGo for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        with DDGS() as ddgs:
+            results = ddgs.images(keywords="Monstera deliciosa plant", max_results=1)
+            count = len(list(results))
+        return f"DuckDuckGo search working ({count} test result)"
@@ -0,0 +1,226 @@
+import time
+import logging
+from typing import Dict, Optional
+
+import httpx
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class EOLScraper(BaseScraper):
+    """Scraper for Encyclopedia of Life (EOL) images.
+
+    EOL aggregates biodiversity data from many sources and provides
+    a free API with no authentication required.
+    """
+
+    name = "eol"
+    requires_api_key = False
+
+    BASE_URL = "https://eol.org/api"
+
+    HEADERS = {
+        "User-Agent": "PlantGuideScraper/1.0 (Plant image collection for ML training)",
+        "Accept": "application/json",
+    }
+
+    # Map EOL license URLs to short codes
+    LICENSE_MAP = {
+        "http://creativecommons.org/publicdomain/zero/1.0/": "CC0",
+        "http://creativecommons.org/publicdomain/mark/1.0/": "CC0",
+        "http://creativecommons.org/licenses/by/2.0/": "CC-BY",
+        "http://creativecommons.org/licenses/by/3.0/": "CC-BY",
+        "http://creativecommons.org/licenses/by/4.0/": "CC-BY",
+        "http://creativecommons.org/licenses/by-sa/2.0/": "CC-BY-SA",
+        "http://creativecommons.org/licenses/by-sa/3.0/": "CC-BY-SA",
+        "http://creativecommons.org/licenses/by-sa/4.0/": "CC-BY-SA",
+        "https://creativecommons.org/publicdomain/zero/1.0/": "CC0",
+        "https://creativecommons.org/publicdomain/mark/1.0/": "CC0",
+        "https://creativecommons.org/licenses/by/2.0/": "CC-BY",
+        "https://creativecommons.org/licenses/by/3.0/": "CC-BY",
+        "https://creativecommons.org/licenses/by/4.0/": "CC-BY",
+        "https://creativecommons.org/licenses/by-sa/2.0/": "CC-BY-SA",
+        "https://creativecommons.org/licenses/by-sa/3.0/": "CC-BY-SA",
+        "https://creativecommons.org/licenses/by-sa/4.0/": "CC-BY-SA",
+        "pd": "CC0",  # Public domain
+        "public domain": "CC0",
+    }
+
+    # Commercial-safe licenses
+    ALLOWED_LICENSES = {"CC0", "CC-BY", "CC-BY-SA"}
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None
+    ) -> Dict[str, int]:
+        """Scrape images from EOL for a species."""
+        api_key = self.get_api_key(db)
+        rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
+
+        downloaded = 0
+        rejected = 0
+
+        def log(level: str, msg: str):
+            if logger:
+                getattr(logger, level)(msg)
+
+        try:
+            # Disable SSL verification - EOL is a trusted source and some Docker
+            # environments lack proper CA certificates
+            with httpx.Client(timeout=30, headers=self.HEADERS, verify=False) as client:
+                # Step 1: Search for the species
+                search_response = client.get(
+                    f"{self.BASE_URL}/search/1.0.json",
+                    params={
+                        "q": species.scientific_name,
+                        "page": 1,
+                        "exact": "true",
+                    },
+                )
+                search_response.raise_for_status()
+                search_data = search_response.json()
+
+                results = search_data.get("results", [])
+                if not results:
+                    log("info", f"  Species not found in EOL: {species.scientific_name}")
+                    return {"downloaded": 0, "rejected": 0}
+
+                # Get the EOL page ID
+                eol_page_id = results[0].get("id")
+                if not eol_page_id:
+                    return {"downloaded": 0, "rejected": 0}
+
+                time.sleep(1.0 / rate_limit)
+
+                # Step 2: Get page details with images
+                page_response = client.get(
+                    f"{self.BASE_URL}/pages/1.0/{eol_page_id}.json",
+                    params={
+                        "images_per_page": 75,
+                        "images_page": 1,
+                        "videos_per_page": 0,
+                        "sounds_per_page": 0,
+                        "maps_per_page": 0,
+                        "texts_per_page": 0,
+                        "details": "true",
+                        "licenses": "cc-by|cc-by-sa|pd|cc-by-nc",
+                    },
+                )
+                page_response.raise_for_status()
+                page_data = page_response.json()
+
+                data_objects = page_data.get("dataObjects", [])
+                log("debug", f"  Found {len(data_objects)} media objects")
+
+                for obj in data_objects:
+                    # Only process images
+                    media_type = obj.get("dataType", "")
+                    if "image" not in media_type.lower() and "stillimage" not in media_type.lower():
+                        continue
+
+                    # Get image URL
+                    image_url = obj.get("eolMediaURL") or obj.get("mediaURL")
+                    if not image_url:
+                        rejected += 1
+                        continue
+
+                    # Check license
+                    license_url = obj.get("license", "").lower()
+                    license_code = None
+
+                    # Try to match license URL
+                    for pattern, code in self.LICENSE_MAP.items():
+                        if pattern in license_url:
+                            license_code = code
+                            break
+
+                    if not license_code:
+                        # Check for NC licenses which we reject
+                        if "-nc" in license_url:
+                            rejected += 1
+                            continue
+                        # Unknown license, skip
+                        log("debug", f"  Rejected: unknown license {license_url}")
+                        rejected += 1
+                        continue
+
+                    if license_code not in self.ALLOWED_LICENSES:
+                        rejected += 1
+                        continue
+
+                    # Create unique source ID
+                    source_id = str(obj.get("dataObjectVersionID") or obj.get("identifier") or hash(image_url))
+
+                    # Check if already exists
+                    existing = db.query(Image).filter(
+                        Image.source == self.name,
+                        Image.source_id == source_id,
+                    ).first()
+
+                    if existing:
+                        continue
+
+                    # Build attribution
+                    agents = obj.get("agents", [])
+                    photographer = None
+                    rights_holder = None
+
+                    for agent in agents:
+                        role = agent.get("role", "").lower()
+                        name = agent.get("full_name", "")
+                        if role == "photographer":
+                            photographer = name
+                        elif role == "owner" or role == "rights holder":
+                            rights_holder = name
+
+                    attribution_parts = []
+                    if photographer:
+                        attribution_parts.append(f"Photo by {photographer}")
+                    if rights_holder and rights_holder != photographer:
+                        attribution_parts.append(f"Rights: {rights_holder}")
+                    attribution_parts.append(f"via EOL ({license_code})")
+                    attribution = " | ".join(attribution_parts)
+
+                    # Create image record
+                    image = Image(
+                        species_id=species.id,
+                        source=self.name,
+                        source_id=source_id,
+                        url=image_url,
+                        license=license_code,
+                        attribution=attribution,
+                        status="pending",
+                    )
+                    db.add(image)
+                    db.commit()
+
+                    # Queue for download
+                    download_and_process_image.delay(image.id)
+                    downloaded += 1
+
+                time.sleep(1.0 / rate_limit)
+
+        except httpx.HTTPStatusError as e:
+            log("error", f"  HTTP error for {species.scientific_name}: {e.response.status_code}")
+        except Exception as e:
+            log("error", f"  Error scraping EOL for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        """Test EOL API connection."""
+        with httpx.Client(timeout=10, headers=self.HEADERS, verify=False) as client:
+            response = client.get(
+                f"{self.BASE_URL}/search/1.0.json",
+                params={"q": "Rosa", "page": 1},
+            )
+            response.raise_for_status()
+            data = response.json()
+
+        total = data.get("totalResults", 0)
+        return f"EOL API connection successful ({total} results for 'Rosa')"
@@ -0,0 +1,146 @@
+import time
+import logging
+from typing import Dict, Optional
+
+import httpx
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class FlickrScraper(BaseScraper):
+    """Scraper for Flickr images via their API."""
+
+    name = "flickr"
+    requires_api_key = True
+
+    BASE_URL = "https://api.flickr.com/services/rest/"
+
+    HEADERS = {
+        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
+    }
+
+    # Commercial-safe license IDs
+    # 4 = CC BY 2.0, 7 = No known copyright, 8 = US Gov, 9 = CC0
+    ALLOWED_LICENSES = "4,7,8,9"
+
+    LICENSE_MAP = {
+        "4": "CC-BY",
+        "7": "NO-KNOWN-COPYRIGHT",
+        "8": "US-GOV",
+        "9": "CC0",
+    }
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None
+    ) -> Dict[str, int]:
+        """Scrape images from Flickr for a species."""
+        api_key = self.get_api_key(db)
+        if not api_key:
+            return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
+
+        rate_limit = api_key.rate_limit_per_sec
+
+        downloaded = 0
+        rejected = 0
+
+        try:
+            params = {
+                "method": "flickr.photos.search",
+                "api_key": api_key.api_key,
+                "text": species.scientific_name,
+                "license": self.ALLOWED_LICENSES,
+                "content_type": 1,  # Photos only
+                "media": "photos",
+                "extras": "license,url_l,url_o,owner_name",
+                "per_page": 100,
+                "format": "json",
+                "nojsoncallback": 1,
+            }
+
+            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
+                response = client.get(self.BASE_URL, params=params)
+                response.raise_for_status()
+                data = response.json()
+
+            if data.get("stat") != "ok":
+                return {"downloaded": 0, "rejected": 0, "error": data.get("message")}
+
+            photos = data.get("photos", {}).get("photo", [])
+
+            for photo in photos:
+                # Get best URL (original or large)
+                url = photo.get("url_o") or photo.get("url_l")
+                if not url:
+                    rejected += 1
+                    continue
+
+                # Get license
+                license_id = str(photo.get("license", ""))
+                license_code = self.LICENSE_MAP.get(license_id, "UNKNOWN")
+                if license_code == "UNKNOWN":
+                    rejected += 1
+                    continue
+
+                # Check if already exists
+                source_id = str(photo.get("id"))
+                existing = db.query(Image).filter(
+                    Image.source == self.name,
+                    Image.source_id == source_id,
+                ).first()
+
+                if existing:
+                    continue
+
+                # Build attribution
+                owner = photo.get("ownername", "Unknown")
+                attribution = f"Photo by {owner} on Flickr ({license_code})"
+
+                # Create image record
+                image = Image(
+                    species_id=species.id,
+                    source=self.name,
+                    source_id=source_id,
+                    url=url,
+                    license=license_code,
+                    attribution=attribution,
+                    status="pending",
+                )
+                db.add(image)
+                db.commit()
+
+                # Queue for download
+                download_and_process_image.delay(image.id)
+                downloaded += 1
+
+            # Rate limiting
+            time.sleep(1.0 / rate_limit)
+
+        except Exception as e:
+            print(f"Error scraping Flickr for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        """Test Flickr API connection."""
+        params = {
+            "method": "flickr.test.echo",
+            "api_key": api_key.api_key,
+            "format": "json",
+            "nojsoncallback": 1,
+        }
+
+        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
+            response = client.get(self.BASE_URL, params=params)
+            response.raise_for_status()
+            data = response.json()
+
+        if data.get("stat") != "ok":
+            raise Exception(data.get("message", "API test failed"))
+
+        return "Flickr API connection successful"
@@ -0,0 +1,159 @@
+import time
+import logging
+from typing import Dict, Optional
+
+import httpx
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class GBIFScraper(BaseScraper):
+    """Scraper for GBIF (Global Biodiversity Information Facility) images."""
+
+    name = "gbif"
+    requires_api_key = False  # GBIF is free to use
+
+    BASE_URL = "https://api.gbif.org/v1"
+
+    HEADERS = {
+        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
+    }
+
+    # Map GBIF license URLs to short codes
+    LICENSE_MAP = {
+        "http://creativecommons.org/publicdomain/zero/1.0/legalcode": "CC0",
+        "http://creativecommons.org/licenses/by/4.0/legalcode": "CC-BY",
+        "http://creativecommons.org/licenses/by-nc/4.0/legalcode": "CC-BY-NC",
+        "http://creativecommons.org/publicdomain/zero/1.0/": "CC0",
+        "http://creativecommons.org/licenses/by/4.0/": "CC-BY",
+        "http://creativecommons.org/licenses/by-nc/4.0/": "CC-BY-NC",
+        "https://creativecommons.org/publicdomain/zero/1.0/legalcode": "CC0",
+        "https://creativecommons.org/licenses/by/4.0/legalcode": "CC-BY",
+        "https://creativecommons.org/licenses/by-nc/4.0/legalcode": "CC-BY-NC",
+        "https://creativecommons.org/publicdomain/zero/1.0/": "CC0",
+        "https://creativecommons.org/licenses/by/4.0/": "CC-BY",
+        "https://creativecommons.org/licenses/by-nc/4.0/": "CC-BY-NC",
+    }
+
+    # Only allow commercial-safe licenses
+    ALLOWED_LICENSES = {"CC0", "CC-BY"}
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None
+    ) -> Dict[str, int]:
+        """Scrape images from GBIF for a species."""
+        # GBIF doesn't require API key, but we still respect rate limits
+        api_key = self.get_api_key(db)
+        rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
+
+        downloaded = 0
+        rejected = 0
+
+        try:
+            params = {
+                "scientificName": species.scientific_name,
+                "mediaType": "StillImage",
+                "limit": 100,
+            }
+
+            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
+                response = client.get(
+                    f"{self.BASE_URL}/occurrence/search",
+                    params=params,
+                )
+                response.raise_for_status()
+                data = response.json()
+
+                results = data.get("results", [])
+
+                for occurrence in results:
+                    media_list = occurrence.get("media", [])
+
+                    for media in media_list:
+                        # Only process still images
+                        if media.get("type") != "StillImage":
+                            continue
+
+                        url = media.get("identifier")
+                        if not url:
+                            rejected += 1
+                            continue
+
+                        # Check license
+                        license_url = media.get("license", "")
+                        license_code = self.LICENSE_MAP.get(license_url)
+
+                        if not license_code or license_code not in self.ALLOWED_LICENSES:
+                            rejected += 1
+                            continue
+
+                        # Create unique source ID from occurrence key and media URL
+                        occurrence_key = occurrence.get("key", "")
+                        # Use hash of URL for uniqueness within occurrence
+                        url_hash = str(hash(url))[-8:]
+                        source_id = f"{occurrence_key}_{url_hash}"
+
+                        # Check if already exists
+                        existing = db.query(Image).filter(
+                            Image.source == self.name,
+                            Image.source_id == source_id,
+                        ).first()
+
+                        if existing:
+                            continue
+
+                        # Build attribution
+                        creator = media.get("creator", "")
+                        rights_holder = media.get("rightsHolder", "")
+                        attribution_parts = []
+                        if creator:
+                            attribution_parts.append(f"Photo by {creator}")
+                        if rights_holder and rights_holder != creator:
+                            attribution_parts.append(f"Rights: {rights_holder}")
+                        attribution_parts.append(f"via GBIF ({license_code})")
+                        attribution = " | ".join(attribution_parts) if attribution_parts else f"GBIF ({license_code})"
+
+                        # Create image record
+                        image = Image(
+                            species_id=species.id,
+                            source=self.name,
+                            source_id=source_id,
+                            url=url,
+                            license=license_code,
+                            attribution=attribution,
+                            status="pending",
+                        )
+                        db.add(image)
+                        db.commit()
+
+                        # Queue for download
+                        download_and_process_image.delay(image.id)
+                        downloaded += 1
+
+                # Rate limiting
+                time.sleep(1.0 / rate_limit)
+
+        except Exception as e:
+            print(f"Error scraping GBIF for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        """Test GBIF API connection."""
+        # GBIF doesn't require authentication, just test the endpoint
+        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
+            response = client.get(
+                f"{self.BASE_URL}/occurrence/search",
+                params={"limit": 1},
+            )
+            response.raise_for_status()
+            data = response.json()
+
+        count = data.get("count", 0)
+        return f"GBIF API connection successful ({count:,} total occurrences available)"
@@ -0,0 +1,144 @@
+import time
+import logging
+from typing import Dict, Optional
+
+import httpx
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class INaturalistScraper(BaseScraper):
+    """Scraper for iNaturalist observations via their API."""
+
+    name = "inaturalist"
+    requires_api_key = False  # Public API, but rate limited
+
+    BASE_URL = "https://api.inaturalist.org/v1"
+
+    HEADERS = {
+        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
+    }
+
+    # Commercial-safe licenses (CC0, CC-BY)
+    ALLOWED_LICENSES = ["cc0", "cc-by"]
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None
+    ) -> Dict[str, int]:
+        """Scrape images from iNaturalist for a species."""
+        api_key = self.get_api_key(db)
+        rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
+
+        downloaded = 0
+        rejected = 0
+
+        def log(level: str, msg: str):
+            if logger:
+                getattr(logger, level)(msg)
+
+        try:
+            # Search for observations of this species
+            params = {
+                "taxon_name": species.scientific_name,
+                "quality_grade": "research",  # Only research-grade
+                "photos": True,
+                "per_page": 200,
+                "order_by": "votes",
+                "license": ",".join(self.ALLOWED_LICENSES),
+            }
+
+            log("debug", f"  API request params: {params}")
+
+            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
+                response = client.get(
+                    f"{self.BASE_URL}/observations",
+                    params=params,
+                )
+                log("debug", f"  API response status: {response.status_code}")
+                response.raise_for_status()
+                data = response.json()
+
+            observations = data.get("results", [])
+            total_results = data.get("total_results", 0)
+            log("debug", f"  Found {len(observations)} observations (total: {total_results})")
+
+            if not observations:
+                log("info", f"  No observations found for {species.scientific_name}")
+                return {"downloaded": 0, "rejected": 0}
+
+            for obs in observations:
+                photos = obs.get("photos", [])
+                for photo in photos:
+                    # Check license
+                    license_code = photo.get("license_code", "").lower() if photo.get("license_code") else ""
+                    if license_code not in self.ALLOWED_LICENSES:
+                        log("debug", f"  Rejected photo {photo.get('id')}: license={license_code}")
+                        rejected += 1
+                        continue
+
+                    # Get image URL (medium size for initial download)
+                    url = photo.get("url", "")
+                    if not url:
+                        log("debug", f"  Skipped photo {photo.get('id')}: no URL")
+                        continue
+
+                    # Convert to larger size
+                    url = url.replace("square", "large")
+
+                    # Check if already exists
+                    source_id = str(photo.get("id"))
+                    existing = db.query(Image).filter(
+                        Image.source == self.name,
+                        Image.source_id == source_id,
+                    ).first()
+
+                    if existing:
+                        log("debug", f"  Skipped photo {source_id}: already exists")
+                        continue
+
+                    # Create image record
+                    image = Image(
+                        species_id=species.id,
+                        source=self.name,
+                        source_id=source_id,
+                        url=url,
+                        license=license_code.upper(),
+                        attribution=photo.get("attribution", ""),
+                        status="pending",
+                    )
+                    db.add(image)
+                    db.commit()
+
+                    # Queue for download
+                    download_and_process_image.delay(image.id)
+                    downloaded += 1
+                    log("debug", f"  Queued photo {source_id} for download")
+
+                # Rate limiting
+                time.sleep(1.0 / rate_limit)
+
+        except httpx.HTTPStatusError as e:
+            log("error", f"  HTTP error for {species.scientific_name}: {e.response.status_code} - {e.response.text}")
+        except httpx.RequestError as e:
+            log("error", f"  Request error for {species.scientific_name}: {e}")
+        except Exception as e:
+            log("error", f"  Error scraping iNaturalist for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        """Test iNaturalist API connection."""
+        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
+            response = client.get(
+                f"{self.BASE_URL}/observations",
+                params={"per_page": 1},
+            )
+            response.raise_for_status()
+
+        return "iNaturalist API connection successful"
@@ -0,0 +1,154 @@
+import time
+import logging
+from typing import Dict, Optional
+
+import httpx
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class TrefleScraper(BaseScraper):
+    """Scraper for Trefle.io plant database."""
+
+    name = "trefle"
+    requires_api_key = True
+
+    BASE_URL = "https://trefle.io/api/v1"
+
+    HEADERS = {
+        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
+    }
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None
+    ) -> Dict[str, int]:
+        """Scrape images from Trefle for a species."""
+        api_key = self.get_api_key(db)
+        if not api_key:
+            return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
+
+        rate_limit = api_key.rate_limit_per_sec
+
+        downloaded = 0
+        rejected = 0
+
+        try:
+            # Search for the species
+            params = {
+                "token": api_key.api_key,
+                "q": species.scientific_name,
+            }
+
+            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
+                response = client.get(
+                    f"{self.BASE_URL}/plants/search",
+                    params=params,
+                )
+                response.raise_for_status()
+                data = response.json()
+
+                plants = data.get("data", [])
+
+                for plant in plants:
+                    # Get plant details for more images
+                    plant_id = plant.get("id")
+                    if not plant_id:
+                        continue
+
+                    detail_response = client.get(
+                        f"{self.BASE_URL}/plants/{plant_id}",
+                        params={"token": api_key.api_key},
+                    )
+
+                    if detail_response.status_code != 200:
+                        continue
+
+                    plant_detail = detail_response.json().get("data", {})
+
+                    # Get main image
+                    main_image = plant_detail.get("image_url")
+                    if main_image:
+                        source_id = f"main_{plant_id}"
+                        existing = db.query(Image).filter(
+                            Image.source == self.name,
+                            Image.source_id == source_id,
+                        ).first()
+
+                        if not existing:
+                            image = Image(
+                                species_id=species.id,
+                                source=self.name,
+                                source_id=source_id,
+                                url=main_image,
+                                license="TREFLE",  # Trefle's own license
+                                attribution="Trefle.io Plant Database",
+                                status="pending",
+                            )
+                            db.add(image)
+                            db.commit()
+                            download_and_process_image.delay(image.id)
+                            downloaded += 1
+
+                    # Get additional images from species detail
+                    images = plant_detail.get("images", {})
+                    for image_type, image_list in images.items():
+                        if not isinstance(image_list, list):
+                            continue
+
+                        for img in image_list:
+                            url = img.get("image_url")
+                            if not url:
+                                continue
+
+                            img_id = img.get("id", url.split("/")[-1])
+                            source_id = f"{image_type}_{img_id}"
+
+                            existing = db.query(Image).filter(
+                                Image.source == self.name,
+                                Image.source_id == source_id,
+                            ).first()
+
+                            if existing:
+                                continue
+
+                            copyright_info = img.get("copyright", "")
+                            image = Image(
+                                species_id=species.id,
+                                source=self.name,
+                                source_id=source_id,
+                                url=url,
+                                license="TREFLE",
+                                attribution=copyright_info or "Trefle.io",
+                                status="pending",
+                            )
+                            db.add(image)
+                            db.commit()
+                            download_and_process_image.delay(image.id)
+                            downloaded += 1
+
+                    # Rate limiting
+                    time.sleep(1.0 / rate_limit)
+
+        except Exception as e:
+            print(f"Error scraping Trefle for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        """Test Trefle API connection."""
+        params = {"token": api_key.api_key}
+
+        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
+            response = client.get(
+                f"{self.BASE_URL}/plants",
+                params=params,
+            )
+            response.raise_for_status()
+
+        return "Trefle API connection successful"
@@ -0,0 +1,146 @@
+import time
+import logging
+from typing import Dict, Optional
+
+import httpx
+from sqlalchemy.orm import Session
+
+from app.scrapers.base import BaseScraper
+from app.models import Species, Image, ApiKey
+from app.workers.quality_tasks import download_and_process_image
+
+
+class WikimediaScraper(BaseScraper):
+    """Scraper for Wikimedia Commons images."""
+
+    name = "wikimedia"
+    requires_api_key = False
+
+    BASE_URL = "https://commons.wikimedia.org/w/api.php"
+
+    HEADERS = {
+        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
+    }
+
+    def scrape_species(
+        self,
+        species: Species,
+        db: Session,
+        logger: Optional[logging.Logger] = None
+    ) -> Dict[str, int]:
+        """Scrape images from Wikimedia Commons for a species."""
+        api_key = self.get_api_key(db)
+        rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
+
+        downloaded = 0
+        rejected = 0
+
+        try:
+            # Search for images in the species category
+            search_term = species.scientific_name
+
+            params = {
+                "action": "query",
+                "format": "json",
+                "generator": "search",
+                "gsrsearch": f"filetype:bitmap {search_term}",
+                "gsrnamespace": 6,  # File namespace
+                "gsrlimit": 50,
+                "prop": "imageinfo",
+                "iiprop": "url|extmetadata|size",
+            }
+
+            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
+                response = client.get(self.BASE_URL, params=params)
+                response.raise_for_status()
+                data = response.json()
+
+            pages = data.get("query", {}).get("pages", {})
+
+            for page_id, page in pages.items():
+                if int(page_id) < 0:
+                    continue
+
+                imageinfo = page.get("imageinfo", [{}])[0]
+                url = imageinfo.get("url", "")
+                if not url:
+                    continue
+
+                # Check size
+                width = imageinfo.get("width", 0)
+                height = imageinfo.get("height", 0)
+                if width < 256 or height < 256:
+                    rejected += 1
+                    continue
+
+                # Get license from metadata
+                metadata = imageinfo.get("extmetadata", {})
+                license_info = metadata.get("LicenseShortName", {}).get("value", "")
+
+                # Filter for commercial-safe licenses
+                license_upper = license_info.upper()
+                if "CC BY" in license_upper or "CC0" in license_upper or "PUBLIC DOMAIN" in license_upper:
+                    license_code = license_info
+                else:
+                    rejected += 1
+                    continue
+
+                # Check if already exists
+                source_id = str(page_id)
+                existing = db.query(Image).filter(
+                    Image.source == self.name,
+                    Image.source_id == source_id,
+                ).first()
+
+                if existing:
+                    continue
+
+                # Get attribution
+                artist = metadata.get("Artist", {}).get("value", "Unknown")
+                # Clean HTML from artist
+                if "<" in artist:
+                    import re
+                    artist = re.sub(r"<[^>]+>", "", artist).strip()
+
+                attribution = f"{artist} via Wikimedia Commons ({license_code})"
+
+                # Create image record
+                image = Image(
+                    species_id=species.id,
+                    source=self.name,
+                    source_id=source_id,
+                    url=url,
+                    license=license_code,
+                    attribution=attribution,
+                    width=width,
+                    height=height,
+                    status="pending",
+                )
+                db.add(image)
+                db.commit()
+
+                # Queue for download
+                download_and_process_image.delay(image.id)
+                downloaded += 1
+
+            # Rate limiting
+            time.sleep(1.0 / rate_limit)
+
+        except Exception as e:
+            print(f"Error scraping Wikimedia for {species.scientific_name}: {e}")
+
+        return {"downloaded": downloaded, "rejected": rejected}
+
+    def test_connection(self, api_key: ApiKey) -> str:
+        """Test Wikimedia API connection."""
+        params = {
+            "action": "query",
+            "format": "json",
+            "meta": "siteinfo",
+        }
+
+        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
+            response = client.get(self.BASE_URL, params=params)
+            response.raise_for_status()
+
+        return "Wikimedia Commons API connection successful"
@@ -0,0 +1 @@
+# Utility functions
@@ -0,0 +1,80 @@
+"""Image deduplication utilities using perceptual hashing."""
+
+from typing import Optional
+
+import imagehash
+from PIL import Image as PILImage
+
+
+def calculate_phash(image_path: str) -> Optional[str]:
+    """
+    Calculate perceptual hash for an image.
+
+    Args:
+        image_path: Path to image file
+
+    Returns:
+        Hex string of perceptual hash, or None if failed
+    """
+    try:
+        with PILImage.open(image_path) as img:
+            return str(imagehash.phash(img))
+    except Exception:
+        return None
+
+
+def calculate_dhash(image_path: str) -> Optional[str]:
+    """
+    Calculate difference hash for an image.
+    Faster but less accurate than phash.
+
+    Args:
+        image_path: Path to image file
+
+    Returns:
+        Hex string of difference hash, or None if failed
+    """
+    try:
+        with PILImage.open(image_path) as img:
+            return str(imagehash.dhash(img))
+    except Exception:
+        return None
+
+
+def hashes_are_similar(hash1: str, hash2: str, threshold: int = 10) -> bool:
+    """
+    Check if two hashes are similar (potential duplicates).
+
+    Args:
+        hash1: First hash string
+        hash2: Second hash string
+        threshold: Maximum Hamming distance (default 10)
+
+    Returns:
+        True if hashes are similar
+    """
+    try:
+        h1 = imagehash.hex_to_hash(hash1)
+        h2 = imagehash.hex_to_hash(hash2)
+        return (h1 - h2) <= threshold
+    except Exception:
+        return False
+
+
+def hamming_distance(hash1: str, hash2: str) -> int:
+    """
+    Calculate Hamming distance between two hashes.
+
+    Args:
+        hash1: First hash string
+        hash2: Second hash string
+
+    Returns:
+        Hamming distance (0 = identical, higher = more different)
+    """
+    try:
+        h1 = imagehash.hex_to_hash(hash1)
+        h2 = imagehash.hex_to_hash(hash2)
+        return int(h1 - h2)
+    except Exception:
+        return 64  # Maximum distance
@@ -0,0 +1,109 @@
+"""Image quality assessment utilities."""
+
+import numpy as np
+from PIL import Image as PILImage
+from scipy import ndimage
+
+
+def calculate_blur_score(image_path: str) -> float:
+    """
+    Calculate blur score using Laplacian variance.
+    Higher score = sharper image.
+
+    Args:
+        image_path: Path to image file
+
+    Returns:
+        Variance of Laplacian (higher = sharper)
+    """
+    try:
+        img = PILImage.open(image_path).convert("L")
+        img_array = np.array(img)
+        laplacian = ndimage.laplace(img_array)
+        return float(np.var(laplacian))
+    except Exception:
+        return 0.0
+
+
+def is_too_blurry(image_path: str, threshold: float = 100.0) -> bool:
+    """
+    Check if image is too blurry for training.
+
+    Args:
+        image_path: Path to image file
+        threshold: Minimum acceptable blur score (default 100)
+
+    Returns:
+        True if image is too blurry
+    """
+    score = calculate_blur_score(image_path)
+    return score < threshold
+
+
+def get_image_dimensions(image_path: str) -> tuple[int, int]:
+    """
+    Get image dimensions.
+
+    Args:
+        image_path: Path to image file
+
+    Returns:
+        Tuple of (width, height)
+    """
+    try:
+        with PILImage.open(image_path) as img:
+            return img.size
+    except Exception:
+        return (0, 0)
+
+
+def is_too_small(image_path: str, min_size: int = 256) -> bool:
+    """
+    Check if image is too small for training.
+
+    Args:
+        image_path: Path to image file
+        min_size: Minimum dimension size (default 256)
+
+    Returns:
+        True if image is too small
+    """
+    width, height = get_image_dimensions(image_path)
+    return width < min_size or height < min_size
+
+
+def resize_image(
+    image_path: str,
+    output_path: str = None,
+    max_size: int = 512,
+    quality: int = 95,
+) -> bool:
+    """
+    Resize image to max dimension while preserving aspect ratio.
+
+    Args:
+        image_path: Path to input image
+        output_path: Path for output (defaults to overwriting input)
+        max_size: Maximum dimension size (default 512)
+        quality: JPEG quality (default 95)
+
+    Returns:
+        True if successful
+    """
+    try:
+        output_path = output_path or image_path
+
+        with PILImage.open(image_path) as img:
+            # Only resize if larger than max_size
+            if max(img.size) > max_size:
+                img.thumbnail((max_size, max_size), PILImage.Resampling.LANCZOS)
+
+            # Convert to RGB if necessary (for JPEG)
+            if img.mode in ("RGBA", "P"):
+                img = img.convert("RGB")
+
+            img.save(output_path, "JPEG", quality=quality)
+
+        return True
+    except Exception:
+        return False
@@ -0,0 +1,92 @@
+import logging
+import os
+from datetime import datetime
+from pathlib import Path
+
+from app.config import get_settings
+
+settings = get_settings()
+
+
+def setup_logging():
+    """Configure file and console logging."""
+    logs_path = Path(settings.logs_path)
+    logs_path.mkdir(parents=True, exist_ok=True)
+
+    # Create a dated log file
+    log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
+
+    # Configure root logger
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+        handlers=[
+            logging.FileHandler(log_file),
+            logging.StreamHandler()
+        ]
+    )
+
+    return logging.getLogger("plant_scraper")
+
+
+def get_logger(name: str = "plant_scraper"):
+    """Get a logger instance."""
+    logs_path = Path(settings.logs_path)
+    logs_path.mkdir(parents=True, exist_ok=True)
+
+    logger = logging.getLogger(name)
+
+    if not logger.handlers:
+        logger.setLevel(logging.INFO)
+
+        # File handler with daily rotation
+        log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
+        file_handler = logging.FileHandler(log_file)
+        file_handler.setLevel(logging.INFO)
+        file_handler.setFormatter(logging.Formatter(
+            '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+        ))
+
+        # Console handler
+        console_handler = logging.StreamHandler()
+        console_handler.setLevel(logging.INFO)
+        console_handler.setFormatter(logging.Formatter(
+            '%(asctime)s - %(levelname)s - %(message)s'
+        ))
+
+        logger.addHandler(file_handler)
+        logger.addHandler(console_handler)
+
+    return logger
+
+
+def get_job_logger(job_id: int):
+    """Get a logger specific to a job, writing to a job-specific file."""
+    logs_path = Path(settings.logs_path)
+    logs_path.mkdir(parents=True, exist_ok=True)
+
+    logger = logging.getLogger(f"job_{job_id}")
+
+    if not logger.handlers:
+        logger.setLevel(logging.DEBUG)
+
+        # Job-specific log file
+        job_log_file = logs_path / f"job_{job_id}.log"
+        file_handler = logging.FileHandler(job_log_file)
+        file_handler.setLevel(logging.DEBUG)
+        file_handler.setFormatter(logging.Formatter(
+            '%(asctime)s - %(levelname)s - %(message)s'
+        ))
+
+        # Also log to daily file
+        daily_log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
+        daily_handler = logging.FileHandler(daily_log_file)
+        daily_handler.setLevel(logging.INFO)
+        daily_handler.setFormatter(logging.Formatter(
+            '%(asctime)s - job_%(name)s - %(levelname)s - %(message)s'
+        ))
+
+        logger.addHandler(file_handler)
+        logger.addHandler(daily_handler)
+
+    return logger
@@ -0,0 +1 @@
+# Celery workers
@@ -0,0 +1,36 @@
+from celery import Celery
+
+from app.config import get_settings
+
+settings = get_settings()
+
+celery_app = Celery(
+    "plant_scraper",
+    broker=settings.redis_url,
+    backend=settings.redis_url,
+    include=[
+        "app.workers.scrape_tasks",
+        "app.workers.quality_tasks",
+        "app.workers.export_tasks",
+        "app.workers.stats_tasks",
+    ],
+)
+
+celery_app.conf.update(
+    task_serializer="json",
+    accept_content=["json"],
+    result_serializer="json",
+    timezone="UTC",
+    enable_utc=True,
+    task_track_started=True,
+    task_time_limit=3600 * 24,  # 24 hour max per task
+    worker_prefetch_multiplier=1,
+    task_acks_late=True,
+    beat_schedule={
+        "refresh-stats-every-5min": {
+            "task": "app.workers.stats_tasks.refresh_stats",
+            "schedule": 300.0,  # Every 5 minutes
+        },
+    },
+    beat_schedule_filename="/tmp/celerybeat-schedule",
+)
@@ -0,0 +1,170 @@
+import json
+import os
+import random
+import shutil
+import zipfile
+from datetime import datetime
+from pathlib import Path
+
+from app.workers.celery_app import celery_app
+from app.database import SessionLocal
+from app.models import Export, Image, Species
+from app.config import get_settings
+
+settings = get_settings()
+
+
+@celery_app.task(bind=True)
+def generate_export(self, export_id: int):
+    """Generate a zip export for CoreML training."""
+    db = SessionLocal()
+    try:
+        export = db.query(Export).filter(Export.id == export_id).first()
+        if not export:
+            return {"error": "Export not found"}
+
+        # Update status
+        export.status = "generating"
+        export.celery_task_id = self.request.id
+        db.commit()
+
+        # Parse filter criteria
+        criteria = json.loads(export.filter_criteria) if export.filter_criteria else {}
+        min_images = criteria.get("min_images_per_species", 100)
+        licenses = criteria.get("licenses")
+        min_quality = criteria.get("min_quality")
+        species_ids = criteria.get("species_ids")
+
+        # Build query for images
+        query = db.query(Image).filter(Image.status == "downloaded")
+
+        if licenses:
+            query = query.filter(Image.license.in_(licenses))
+
+        if min_quality:
+            query = query.filter(Image.quality_score >= min_quality)
+
+        if species_ids:
+            query = query.filter(Image.species_id.in_(species_ids))
+
+        # Group by species and filter by min count
+        from sqlalchemy import func
+        species_counts = db.query(
+            Image.species_id,
+            func.count(Image.id).label("count")
+        ).filter(Image.status == "downloaded").group_by(Image.species_id).all()
+
+        valid_species_ids = [s.species_id for s in species_counts if s.count >= min_images]
+
+        if species_ids:
+            valid_species_ids = [s for s in valid_species_ids if s in species_ids]
+
+        if not valid_species_ids:
+            export.status = "failed"
+            export.error_message = "No species meet the criteria"
+            export.completed_at = datetime.utcnow()
+            db.commit()
+            return {"error": "No species meet the criteria"}
+
+        # Create export directory
+        export_dir = Path(settings.exports_path) / f"export_{export_id}"
+        train_dir = export_dir / "Training"
+        test_dir = export_dir / "Testing"
+        train_dir.mkdir(parents=True, exist_ok=True)
+        test_dir.mkdir(parents=True, exist_ok=True)
+
+        total_images = 0
+        species_count = 0
+
+        # Process each valid species
+        for i, species_id in enumerate(valid_species_ids):
+            species = db.query(Species).filter(Species.id == species_id).first()
+            if not species:
+                continue
+
+            # Get images for this species
+            images_query = query.filter(Image.species_id == species_id)
+            if licenses:
+                images_query = images_query.filter(Image.license.in_(licenses))
+            if min_quality:
+                images_query = images_query.filter(Image.quality_score >= min_quality)
+
+            images = images_query.all()
+            if len(images) < min_images:
+                continue
+
+            species_count += 1
+
+            # Create species folders
+            species_name = species.scientific_name.replace(" ", "_")
+            (train_dir / species_name).mkdir(exist_ok=True)
+            (test_dir / species_name).mkdir(exist_ok=True)
+
+            # Shuffle and split
+            random.shuffle(images)
+            split_idx = int(len(images) * export.train_split)
+            train_images = images[:split_idx]
+            test_images = images[split_idx:]
+
+            # Copy images
+            for j, img in enumerate(train_images):
+                if img.local_path and os.path.exists(img.local_path):
+                    ext = Path(img.local_path).suffix or ".jpg"
+                    dest = train_dir / species_name / f"img_{j:05d}{ext}"
+                    shutil.copy2(img.local_path, dest)
+                    total_images += 1
+
+            for j, img in enumerate(test_images):
+                if img.local_path and os.path.exists(img.local_path):
+                    ext = Path(img.local_path).suffix or ".jpg"
+                    dest = test_dir / species_name / f"img_{j:05d}{ext}"
+                    shutil.copy2(img.local_path, dest)
+                    total_images += 1
+
+            # Update progress
+            self.update_state(
+                state="PROGRESS",
+                meta={
+                    "current": i + 1,
+                    "total": len(valid_species_ids),
+                    "species": species.scientific_name,
+                }
+            )
+
+        # Create zip file
+        zip_path = Path(settings.exports_path) / f"export_{export_id}.zip"
+        with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
+            for root, dirs, files in os.walk(export_dir):
+                for file in files:
+                    file_path = Path(root) / file
+                    arcname = file_path.relative_to(export_dir)
+                    zipf.write(file_path, arcname)
+
+        # Clean up directory
+        shutil.rmtree(export_dir)
+
+        # Update export record
+        export.status = "completed"
+        export.file_path = str(zip_path)
+        export.file_size = zip_path.stat().st_size
+        export.species_count = species_count
+        export.image_count = total_images
+        export.completed_at = datetime.utcnow()
+        db.commit()
+
+        return {
+            "status": "completed",
+            "species_count": species_count,
+            "image_count": total_images,
+            "file_size": export.file_size,
+        }
+
+    except Exception as e:
+        if export:
+            export.status = "failed"
+            export.error_message = str(e)
+            export.completed_at = datetime.utcnow()
+            db.commit()
+        raise
+    finally:
+        db.close()
@@ -0,0 +1,224 @@
+import os
+from pathlib import Path
+
+import httpx
+from PIL import Image as PILImage
+import imagehash
+import numpy as np
+from scipy import ndimage
+
+from app.workers.celery_app import celery_app
+from app.database import SessionLocal
+from app.models import Image
+from app.config import get_settings
+
+settings = get_settings()
+
+
+def calculate_blur_score(image_path: str) -> float:
+    """Calculate blur score using Laplacian variance. Higher = sharper."""
+    try:
+        img = PILImage.open(image_path).convert("L")
+        img_array = np.array(img)
+        laplacian = ndimage.laplace(img_array)
+        return float(np.var(laplacian))
+    except Exception:
+        return 0.0
+
+
+def calculate_phash(image_path: str) -> str:
+    """Calculate perceptual hash for deduplication."""
+    try:
+        img = PILImage.open(image_path)
+        return str(imagehash.phash(img))
+    except Exception:
+        return ""
+
+
+def check_color_distribution(image_path: str) -> tuple[bool, str]:
+    """Check if image has healthy color distribution for a plant photo.
+
+    Returns (passed, reason) tuple.
+    Rejects:
+    - Low color variance (mean channel std < 25): herbarium specimens (brown on white)
+    - No green + low variance (green ratio < 5% AND mean std < 40): monochrome illustrations
+    """
+    try:
+        img = PILImage.open(image_path).convert("RGB")
+        arr = np.array(img, dtype=np.float64)
+
+        # Per-channel standard deviation
+        channel_stds = arr.std(axis=(0, 1))  # [R_std, G_std, B_std]
+        mean_std = float(channel_stds.mean())
+
+        if mean_std < 25:
+            return False, f"Low color variance ({mean_std:.1f})"
+
+        # Check green ratio
+        channel_means = arr.mean(axis=(0, 1))
+        total = channel_means.sum()
+        green_ratio = channel_means[1] / total if total > 0 else 0
+
+        if green_ratio < 0.05 and mean_std < 40:
+            return False, f"No green ({green_ratio:.2%}) + low variance ({mean_std:.1f})"
+
+        return True, ""
+    except Exception:
+        return True, ""  # Don't reject on error
+
+
+def resize_image(image_path: str, target_size: int = 512) -> bool:
+    """Resize image to target size while maintaining aspect ratio."""
+    try:
+        img = PILImage.open(image_path)
+        img.thumbnail((target_size, target_size), PILImage.Resampling.LANCZOS)
+        img.save(image_path, quality=95)
+        return True
+    except Exception:
+        return False
+
+
+@celery_app.task
+def download_and_process_image(image_id: int):
+    """Download image, check quality, dedupe, and resize."""
+    db = SessionLocal()
+    try:
+        image = db.query(Image).filter(Image.id == image_id).first()
+        if not image:
+            return {"error": "Image not found"}
+
+        # Create directory for species
+        species = image.species
+        species_dir = Path(settings.images_path) / species.scientific_name.replace(" ", "_")
+        species_dir.mkdir(parents=True, exist_ok=True)
+
+        # Download image
+        filename = f"{image.source}_{image.source_id or image.id}.jpg"
+        local_path = species_dir / filename
+
+        try:
+            headers = {
+                "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
+            }
+            with httpx.Client(timeout=30, headers=headers, follow_redirects=True) as client:
+                response = client.get(image.url)
+                response.raise_for_status()
+
+                with open(local_path, "wb") as f:
+                    f.write(response.content)
+        except Exception as e:
+            image.status = "rejected"
+            db.commit()
+            return {"error": f"Download failed: {e}"}
+
+        # Check minimum size
+        try:
+            with PILImage.open(local_path) as img:
+                width, height = img.size
+                if width < 256 or height < 256:
+                    os.remove(local_path)
+                    image.status = "rejected"
+                    db.commit()
+                    return {"error": "Image too small"}
+                image.width = width
+                image.height = height
+        except Exception as e:
+            if local_path.exists():
+                os.remove(local_path)
+            image.status = "rejected"
+            db.commit()
+            return {"error": f"Invalid image: {e}"}
+
+        # Calculate perceptual hash for deduplication
+        phash = calculate_phash(str(local_path))
+        if phash:
+            # Check for duplicates
+            existing = db.query(Image).filter(
+                Image.phash == phash,
+                Image.id != image.id,
+                Image.status == "downloaded"
+            ).first()
+
+            if existing:
+                os.remove(local_path)
+                image.status = "rejected"
+                image.phash = phash
+                db.commit()
+                return {"error": "Duplicate image"}
+
+            image.phash = phash
+
+        # Calculate blur score
+        quality_score = calculate_blur_score(str(local_path))
+        image.quality_score = quality_score
+
+        # Reject very blurry images (threshold can be tuned)
+        if quality_score < 100:  # Low variance = blurry
+            os.remove(local_path)
+            image.status = "rejected"
+            db.commit()
+            return {"error": "Image too blurry"}
+
+        # Check color distribution (reject herbarium specimens, illustrations)
+        color_ok, color_reason = check_color_distribution(str(local_path))
+        if not color_ok:
+            os.remove(local_path)
+            image.status = "rejected"
+            db.commit()
+            return {"error": f"Non-photo content: {color_reason}"}
+
+        # Resize to 512x512 max
+        resize_image(str(local_path))
+
+        # Update image record
+        image.local_path = str(local_path)
+        image.status = "downloaded"
+        db.commit()
+
+        return {
+            "status": "success",
+            "path": str(local_path),
+            "quality_score": quality_score,
+        }
+
+    except Exception as e:
+        if image:
+            image.status = "rejected"
+            db.commit()
+        return {"error": str(e)}
+    finally:
+        db.close()
+
+
+@celery_app.task(bind=True)
+def batch_process_pending_images(self, source: str = None, chunk_size: int = 500):
+    """Process ALL pending images in chunks, with progress tracking."""
+    db = SessionLocal()
+    try:
+        query = db.query(Image).filter(Image.status == "pending")
+        if source:
+            query = query.filter(Image.source == source)
+
+        total = query.count()
+        queued = 0
+        offset = 0
+
+        while offset < total:
+            chunk = query.order_by(Image.id).offset(offset).limit(chunk_size).all()
+            if not chunk:
+                break
+
+            for image in chunk:
+                download_and_process_image.delay(image.id)
+                queued += 1
+
+            offset += len(chunk)
+
+            self.update_state(
+                state="PROGRESS",
+                meta={"queued": queued, "total": total},
+            )
+
+        return {"queued": queued, "total": total}
+    finally:
+        db.close()
@@ -0,0 +1,164 @@
+import json
+from datetime import datetime
+
+from app.workers.celery_app import celery_app
+from app.database import SessionLocal
+from app.models import Job, Species, Image
+from app.utils.logging import get_job_logger
+
+
+@celery_app.task(bind=True)
+def run_scrape_job(self, job_id: int):
+    """Main scrape task that dispatches to source-specific scrapers."""
+    logger = get_job_logger(job_id)
+    logger.info(f"Starting scrape job {job_id}")
+
+    db = SessionLocal()
+    job = None
+    try:
+        job = db.query(Job).filter(Job.id == job_id).first()
+        if not job:
+            logger.error(f"Job {job_id} not found")
+            return {"error": "Job not found"}
+
+        logger.info(f"Job: {job.name}, Source: {job.source}")
+
+        # Update job status
+        job.status = "running"
+        job.started_at = datetime.utcnow()
+        job.celery_task_id = self.request.id
+        db.commit()
+
+        # Get species to scrape
+        if job.species_filter:
+            species_ids = json.loads(job.species_filter)
+            query = db.query(Species).filter(Species.id.in_(species_ids))
+            logger.info(f"Filtered to species IDs: {species_ids}")
+        else:
+            query = db.query(Species)
+            logger.info("Scraping all species")
+
+        # Filter by image count if requested
+        if job.only_without_images or job.max_images:
+            from sqlalchemy import func
+            # Subquery to count downloaded images per species
+            image_count_subquery = (
+                db.query(Image.species_id, func.count(Image.id).label("count"))
+                .filter(Image.status == "downloaded")
+                .group_by(Image.species_id)
+                .subquery()
+            )
+            # Left join with the count subquery
+            query = query.outerjoin(
+                image_count_subquery,
+                Species.id == image_count_subquery.c.species_id
+            )
+
+            if job.only_without_images:
+                # Filter where count is NULL or 0
+                query = query.filter(
+                    (image_count_subquery.c.count == None) | (image_count_subquery.c.count == 0)
+                )
+                logger.info("Filtering to species without images")
+            elif job.max_images:
+                # Filter where count is NULL or less than max_images
+                query = query.filter(
+                    (image_count_subquery.c.count == None) | (image_count_subquery.c.count < job.max_images)
+                )
+                logger.info(f"Filtering to species with fewer than {job.max_images} images")
+
+        species_list = query.all()
+        logger.info(f"Total species to scrape: {len(species_list)}")
+
+        job.progress_total = len(species_list)
+        db.commit()
+
+        # Import scraper based on source
+        from app.scrapers import get_scraper
+        scraper = get_scraper(job.source)
+
+        if not scraper:
+            error_msg = f"Unknown source: {job.source}"
+            logger.error(error_msg)
+            job.status = "failed"
+            job.error_message = error_msg
+            job.completed_at = datetime.utcnow()
+            db.commit()
+            return {"error": error_msg}
+
+        logger.info(f"Using scraper: {scraper.name}")
+
+        # Scrape each species
+        for i, species in enumerate(species_list):
+            try:
+                # Update progress
+                job.progress_current = i + 1
+                db.commit()
+
+                logger.info(f"[{i+1}/{len(species_list)}] Scraping: {species.scientific_name}")
+
+                # Update task state for real-time monitoring
+                self.update_state(
+                    state="PROGRESS",
+                    meta={
+                        "current": i + 1,
+                        "total": len(species_list),
+                        "species": species.scientific_name,
+                    }
+                )
+
+                # Run scraper for this species
+                results = scraper.scrape_species(species, db, logger)
+                downloaded = results.get("downloaded", 0)
+                rejected = results.get("rejected", 0)
+                job.images_downloaded += downloaded
+                job.images_rejected += rejected
+                db.commit()
+
+                logger.info(f"  -> Downloaded: {downloaded}, Rejected: {rejected}")
+
+            except Exception as e:
+                # Log error but continue with other species
+                logger.error(f"Error scraping {species.scientific_name}: {e}", exc_info=True)
+                continue
+
+        # Mark job complete
+        job.status = "completed"
+        job.completed_at = datetime.utcnow()
+        db.commit()
+
+        logger.info(f"Job {job_id} completed. Total downloaded: {job.images_downloaded}, rejected: {job.images_rejected}")
+
+        return {
+            "status": "completed",
+            "downloaded": job.images_downloaded,
+            "rejected": job.images_rejected,
+        }
+
+    except Exception as e:
+        logger.error(f"Job {job_id} failed with error: {e}", exc_info=True)
+        if job:
+            job.status = "failed"
+            job.error_message = str(e)
+            job.completed_at = datetime.utcnow()
+            db.commit()
+        raise
+    finally:
+        db.close()
+
+
+@celery_app.task
+def pause_scrape_job(job_id: int):
+    """Pause a running scrape job."""
+    db = SessionLocal()
+    try:
+        job = db.query(Job).filter(Job.id == job_id).first()
+        if job and job.status == "running":
+            job.status = "paused"
+            db.commit()
+            # Revoke the Celery task
+            if job.celery_task_id:
+                celery_app.control.revoke(job.celery_task_id, terminate=True)
+        return {"status": "paused"}
+    finally:
+        db.close()
@@ -0,0 +1,193 @@
+import json
+import os
+from datetime import datetime
+from pathlib import Path
+
+from sqlalchemy import func, case, text
+
+from app.workers.celery_app import celery_app
+from app.database import SessionLocal
+from app.models import Species, Image, Job
+from app.models.cached_stats import CachedStats
+from app.config import get_settings
+
+
+def get_directory_size_fast(path: str) -> int:
+    """Get directory size in bytes using fast os.scandir."""
+    total = 0
+    try:
+        with os.scandir(path) as it:
+            for entry in it:
+                try:
+                    if entry.is_file(follow_symlinks=False):
+                        total += entry.stat(follow_symlinks=False).st_size
+                    elif entry.is_dir(follow_symlinks=False):
+                        total += get_directory_size_fast(entry.path)
+                except (OSError, PermissionError):
+                    pass
+    except (OSError, PermissionError):
+        pass
+    return total
+
+
+@celery_app.task
+def refresh_stats():
+    """Calculate and cache dashboard statistics."""
+    print("=== STATS TASK: Starting refresh ===", flush=True)
+
+    db = SessionLocal()
+    try:
+        # Use raw SQL for maximum performance on SQLite
+        # All counts in a single query
+        counts_sql = text("""
+            SELECT
+                (SELECT COUNT(*) FROM species) as total_species,
+                (SELECT COUNT(*) FROM images) as total_images,
+                (SELECT COUNT(*) FROM images WHERE status = 'downloaded') as images_downloaded,
+                (SELECT COUNT(*) FROM images WHERE status = 'pending') as images_pending,
+                (SELECT COUNT(*) FROM images WHERE status = 'rejected') as images_rejected
+        """)
+        counts = db.execute(counts_sql).fetchone()
+        total_species = counts[0] or 0
+        total_images = counts[1] or 0
+        images_downloaded = counts[2] or 0
+        images_pending = counts[3] or 0
+        images_rejected = counts[4] or 0
+
+        # Per-source stats - single query with GROUP BY
+        source_sql = text("""
+            SELECT
+                source,
+                COUNT(*) as total,
+                SUM(CASE WHEN status = 'downloaded' THEN 1 ELSE 0 END) as downloaded,
+                SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
+                SUM(CASE WHEN status = 'rejected' THEN 1 ELSE 0 END) as rejected
+            FROM images
+            GROUP BY source
+        """)
+        source_stats_raw = db.execute(source_sql).fetchall()
+        sources = [
+            {
+                "source": s[0],
+                "image_count": s[1],
+                "downloaded": s[2] or 0,
+                "pending": s[3] or 0,
+                "rejected": s[4] or 0,
+            }
+            for s in source_stats_raw
+        ]
+
+        # Per-license stats - single indexed query
+        license_sql = text("""
+            SELECT license, COUNT(*) as count
+            FROM images
+            WHERE status = 'downloaded'
+            GROUP BY license
+        """)
+        license_stats_raw = db.execute(license_sql).fetchall()
+        licenses = [
+            {"license": l[0], "count": l[1]}
+            for l in license_stats_raw
+        ]
+
+        # Job stats - single query
+        job_sql = text("""
+            SELECT
+                SUM(CASE WHEN status = 'running' THEN 1 ELSE 0 END) as running,
+                SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
+                SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as completed,
+                SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed
+            FROM jobs
+        """)
+        job_counts = db.execute(job_sql).fetchone()
+        jobs = {
+            "running": job_counts[0] or 0,
+            "pending": job_counts[1] or 0,
+            "completed": job_counts[2] or 0,
+            "failed": job_counts[3] or 0,
+        }
+
+        # Top species by image count - optimized with index
+        top_sql = text("""
+            SELECT s.id, s.scientific_name, s.common_name, COUNT(i.id) as image_count
+            FROM species s
+            INNER JOIN images i ON i.species_id = s.id AND i.status = 'downloaded'
+            GROUP BY s.id
+            ORDER BY image_count DESC
+            LIMIT 10
+        """)
+        top_species_raw = db.execute(top_sql).fetchall()
+        top_species = [
+            {
+                "id": s[0],
+                "scientific_name": s[1],
+                "common_name": s[2],
+                "image_count": s[3],
+            }
+            for s in top_species_raw
+        ]
+
+        # Under-represented species - use pre-computed counts
+        under_sql = text("""
+            SELECT s.id, s.scientific_name, s.common_name, COALESCE(img_counts.cnt, 0) as image_count
+            FROM species s
+            LEFT JOIN (
+                SELECT species_id, COUNT(*) as cnt
+                FROM images
+                WHERE status = 'downloaded'
+                GROUP BY species_id
+            ) img_counts ON img_counts.species_id = s.id
+            WHERE COALESCE(img_counts.cnt, 0) < 100
+            ORDER BY image_count ASC
+            LIMIT 10
+        """)
+        under_rep_raw = db.execute(under_sql).fetchall()
+        under_represented = [
+            {
+                "id": s[0],
+                "scientific_name": s[1],
+                "common_name": s[2],
+                "image_count": s[3],
+            }
+            for s in under_rep_raw
+        ]
+
+        # Calculate disk usage (fast recursive scan)
+        settings = get_settings()
+        disk_usage_bytes = get_directory_size_fast(settings.images_path)
+        disk_usage_mb = round(disk_usage_bytes / (1024 * 1024), 2)
+
+        # Build the stats object
+        stats = {
+            "total_species": total_species,
+            "total_images": total_images,
+            "images_downloaded": images_downloaded,
+            "images_pending": images_pending,
+            "images_rejected": images_rejected,
+            "disk_usage_mb": disk_usage_mb,
+            "sources": sources,
+            "licenses": licenses,
+            "jobs": jobs,
+            "top_species": top_species,
+            "under_represented": under_represented,
+        }
+
+        # Store in database
+        cached = db.query(CachedStats).filter(CachedStats.key == "dashboard_stats").first()
+        if cached:
+            cached.value = json.dumps(stats)
+            cached.updated_at = datetime.utcnow()
+        else:
+            cached = CachedStats(key="dashboard_stats", value=json.dumps(stats))
+            db.add(cached)
+
+        db.commit()
+        print(f"=== STATS TASK: Refreshed (species={total_species}, images={total_images}) ===", flush=True)
+
+        return {"status": "success", "total_species": total_species, "total_images": total_images}
+
+    except Exception as e:
+        print(f"=== STATS TASK ERROR: {e} ===", flush=True)
+        raise
+    finally:
+        db.close()
@@ -0,0 +1,34 @@
+# Web framework
+fastapi==0.109.0
+uvicorn[standard]==0.27.0
+python-multipart==0.0.6
+
+# Database
+sqlalchemy==2.0.25
+alembic==1.13.1
+aiosqlite==0.19.0
+
+# Task queue
+celery==5.3.6
+redis==5.0.1
+
+# Image processing
+Pillow==10.2.0
+imagehash==4.3.1
+imagededup==0.3.3.post2
+
+# HTTP clients
+httpx==0.26.0
+aiohttp==3.9.3
+
+# Search
+duckduckgo-search
+
+# Utilities
+python-dotenv==1.0.0
+pydantic==2.5.3
+pydantic-settings==2.1.0
+
+# Testing
+pytest==7.4.4
+pytest-asyncio==0.23.3
@@ -0,0 +1 @@
+# Tests
@@ -0,0 +1,114 @@
+# Docker Compose for Unraid
+#
+# Access at http://YOUR_UNRAID_IP:8580
+#
+# ============================================
+# CONFIGURE THESE PATHS FOR YOUR UNRAID SETUP
+# ============================================
+# Edit the left side of the colon (:) for each volume mount
+#
+# DATABASE_PATH: Where to store the SQLite database
+# IMAGES_PATH:   Where to store downloaded images (can be large, 100GB+)
+# EXPORTS_PATH:  Where to store generated export zip files
+# IMPORTS_PATH:  Where to place images for bulk import (source/species/images)
+# LOGS_PATH:     Where to store scraper log files for debugging
+
+services:
+  backend:
+    build:
+      context: /mnt/user/appdata/PlantGuideScraper/backend
+      dockerfile: Dockerfile
+    container_name: plant-scraper-backend
+    restart: unless-stopped
+    volumes:
+      - /mnt/user/appdata/PlantGuideScraper/backend:/app:ro
+      # === CONFIGURABLE DATA PATHS ===
+      - /mnt/user/downloads/PlantGuideDocker/database:/data/db          # DATABASE_PATH
+      - /mnt/user/downloads/PlantGuideDocker/images:/data/images        # IMAGES_PATH
+      - /mnt/user/downloads/PlantGuideDocker/exports:/data/exports      # EXPORTS_PATH
+      - /mnt/user/downloads/PlantGuideDocker/imports:/data/imports      # IMPORTS_PATH
+      - /mnt/user/downloads/PlantGuideDocker/logs:/data/logs            # LOGS_PATH
+    environment:
+      - DATABASE_URL=sqlite:////data/db/plants.sqlite
+      - REDIS_URL=redis://plant-scraper-redis:6379/0
+      - IMAGES_PATH=/data/images
+      - EXPORTS_PATH=/data/exports
+      - IMPORTS_PATH=/data/imports
+      - LOGS_PATH=/data/logs
+    depends_on:
+      - redis
+    command: uvicorn app.main:app --host 0.0.0.0 --port 8000
+    networks:
+      - plant-scraper
+
+  celery:
+    build:
+      context: /mnt/user/appdata/PlantGuideScraper/backend
+      dockerfile: Dockerfile
+    container_name: plant-scraper-celery
+    restart: unless-stopped
+    volumes:
+      - /mnt/user/appdata/PlantGuideScraper/backend:/app:ro
+      # === CONFIGURABLE DATA PATHS (must match backend) ===
+      - /mnt/user/downloads/PlantGuideDocker/database:/data/db          # DATABASE_PATH
+      - /mnt/user/downloads/PlantGuideDocker/images:/data/images        # IMAGES_PATH
+      - /mnt/user/downloads/PlantGuideDocker/exports:/data/exports      # EXPORTS_PATH
+      - /mnt/user/downloads/PlantGuideDocker/imports:/data/imports      # IMPORTS_PATH
+      - /mnt/user/downloads/PlantGuideDocker/logs:/data/logs            # LOGS_PATH
+    environment:
+      - DATABASE_URL=sqlite:////data/db/plants.sqlite
+      - REDIS_URL=redis://plant-scraper-redis:6379/0
+      - IMAGES_PATH=/data/images
+      - EXPORTS_PATH=/data/exports
+      - IMPORTS_PATH=/data/imports
+      - LOGS_PATH=/data/logs
+    depends_on:
+      - redis
+    command: celery -A app.workers.celery_app worker --beat --loglevel=info --concurrency=4
+    networks:
+      - plant-scraper
+
+  redis:
+    image: redis:7-alpine
+    container_name: plant-scraper-redis
+    restart: unless-stopped
+    volumes:
+      - /mnt/user/appdata/PlantGuideScraper/redis:/data
+    networks:
+      - plant-scraper
+
+  frontend:
+    build:
+      context: /mnt/user/appdata/PlantGuideScraper/frontend
+      dockerfile: Dockerfile
+    container_name: plant-scraper-frontend
+    restart: unless-stopped
+    volumes:
+      - /mnt/user/appdata/PlantGuideScraper/frontend:/app
+      - plant-scraper-node-modules:/app/node_modules
+    environment:
+      - VITE_API_URL=
+    command: npm run dev -- --host
+    networks:
+      - plant-scraper
+
+  nginx:
+    image: nginx:alpine
+    container_name: plant-scraper-nginx
+    restart: unless-stopped
+    ports:
+      - "8580:80"
+    volumes:
+      - /mnt/user/appdata/PlantGuideScraper/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
+    depends_on:
+      - backend
+      - frontend
+    networks:
+      - plant-scraper
+
+networks:
+  plant-scraper:
+    name: plant-scraper
+
+volumes:
+  plant-scraper-node-modules:
@@ -0,0 +1,73 @@
+services:
+  backend:
+    build:
+      context: ./backend
+      dockerfile: Dockerfile
+    container_name: plant-scraper-backend
+    # Port exposed only internally, nginx proxies to it
+    volumes:
+      - ./backend:/app
+      - ./data:/data
+    environment:
+      - DATABASE_URL=sqlite:////data/db/plants.sqlite
+      - REDIS_URL=redis://redis:6379/0
+      - IMAGES_PATH=/data/images
+      - EXPORTS_PATH=/data/exports
+      - IMPORTS_PATH=/data/imports
+      - LOGS_PATH=/data/logs
+    depends_on:
+      - redis
+    command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
+
+  celery:
+    build:
+      context: ./backend
+      dockerfile: Dockerfile
+    container_name: plant-scraper-celery
+    volumes:
+      - ./backend:/app
+      - ./data:/data
+    environment:
+      - DATABASE_URL=sqlite:////data/db/plants.sqlite
+      - REDIS_URL=redis://redis:6379/0
+      - IMAGES_PATH=/data/images
+      - EXPORTS_PATH=/data/exports
+      - IMPORTS_PATH=/data/imports
+      - LOGS_PATH=/data/logs
+    depends_on:
+      - redis
+    command: celery -A app.workers.celery_app worker --beat --loglevel=info --concurrency=4
+
+  redis:
+    image: redis:7-alpine
+    container_name: plant-scraper-redis
+    # Port exposed only internally, not to host (avoid conflicts)
+    volumes:
+      - redis_data:/data
+
+  frontend:
+    build:
+      context: ./frontend
+      dockerfile: Dockerfile
+    container_name: plant-scraper-frontend
+    # Port exposed only internally, nginx proxies to it
+    volumes:
+      - ./frontend:/app
+      - /app/node_modules
+    environment:
+      - VITE_API_URL=
+    command: npm run dev -- --host
+
+  nginx:
+    image: nginx:alpine
+    container_name: plant-scraper-nginx
+    ports:
+      - "80:80"
+    volumes:
+      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
+    depends_on:
+      - backend
+      - frontend
+
+volumes:
+  redis_data:
@@ -0,0 +1,564 @@
+# Houseplant Image Scraper - Master Plan
+
+## Overview
+
+Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.
+
+---
+
+## Requirements Summary
+
+| Requirement | Value |
+|-------------|-------|
+| Platform | Web app in Docker on Unraid |
+| Sources | iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL |
+| API keys | Configurable per service |
+| Species list | Manual import (CSV/paste) |
+| Grouping | Species, genus, source, license (faceted) |
+| Search/filter | Yes |
+| Quality filter | Automatic (hash dedup, blur, size) |
+| Progress | Real-time dashboard |
+| Storage | `/species_name/image.jpg` + SQLite DB |
+| Export | Filtered zip for CoreML, downloadable anytime |
+| Auth | None (single user) |
+| Deployment | Docker Compose |
+
+---
+
+## Create ML Export Requirements
+
+Per [Apple's documentation](https://developer.apple.com/documentation/createml/creating-an-image-classifier-model):
+
+- **Folder structure**: `/SpeciesName/image001.jpg` (folder name = class label)
+- **Train/Test split**: 80/20 recommended, separate folders
+- **Balance**: Roughly equal images per class (avoid bias)
+- **No metadata needed**: Create ML uses folder names as labels
+
+### Export Format
+
+```
+dataset_export/
+├── Training/
+│   ├── Monstera_deliciosa/
+│   │   ├── img001.jpg
+│   │   └── ...
+│   ├── Philodendron_hederaceum/
+│   └── ...
+└── Testing/
+    ├── Monstera_deliciosa/
+    └── ...
+```
+
+---
+
+## Data Sources
+
+| Source | API/Method | License Filter | Rate Limits | Notes |
+|--------|------------|----------------|-------------|-------|
+| **iNaturalist/GBIF** | Bulk DwC-A export + API | CC0, CC-BY | 1 req/sec, 10k/day, 5GB/hr media | Best source: Research Grade = verified |
+| **Flickr** | flickr.photos.search | license=4,9 (CC-BY, CC0) | 3600 req/hr | Good supplemental |
+| **Wikimedia Commons** | MediaWiki API + pyWikiCommons | CC-BY, CC-BY-SA, PD | Generous | Category-based search |
+| **Trefle.io** | REST API | Open source | Free tier | Species metadata + some images |
+| **USDA PLANTS** | REST API | Public Domain | Generous | US-focused, limited images |
+| **Plant.id** | REST API | Commercial | Paid | For validation, not scraping |
+| **Encyclopedia of Life** | API | Mixed | Check each | Aggregator |
+
+### Source References
+
+- iNaturalist: https://www.inaturalist.org/pages/developers
+- iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741
+- Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html
+- Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API
+- pyWikiCommons: https://pypi.org/project/pyWikiCommons/
+- Trefle.io: https://trefle.io/
+- USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r
+
+### Flickr License IDs
+
+| ID | License |
+|----|---------|
+| 0 | All Rights Reserved |
+| 1 | CC BY-NC-SA 2.0 |
+| 2 | CC BY-NC 2.0 |
+| 3 | CC BY-NC-ND 2.0 |
+| 4 | CC BY 2.0 (Commercial OK) |
+| 5 | CC BY-SA 2.0 |
+| 6 | CC BY-ND 2.0 |
+| 7 | No known copyright restrictions |
+| 8 | United States Government Work |
+| 9 | Public Domain (CC0) |
+
+**For commercial use**: Filter to license IDs 4, 7, 8, 9 only.
+
+---
+
+## Image Quality Pipeline
+
+| Stage | Library | Purpose |
+|-------|---------|---------|
+| **Deduplication** | imagededup | Perceptual hash (CNN + hash methods) |
+| **Blur detection** | scipy + Sobel variance | Reject blurry images |
+| **Size filter** | Pillow | Min 256x256 |
+| **Resize** | Pillow | Normalize to 512x512 |
+
+### Library References
+
+- imagededup: https://github.com/idealo/imagededup
+- imagehash: https://github.com/JohannesBuchner/imagehash
+
+---
+
+## Technology Stack
+
+| Component | Choice | Rationale |
+|-----------|--------|-----------|
+| **Backend** | FastAPI (Python) | Async, fast, ML ecosystem, great docs |
+| **Frontend** | React + Tailwind | Fast dev, good component libraries |
+| **Database** | SQLite (+ FTS5) | Simple, no separate container, sufficient for single-user |
+| **Task Queue** | Celery + Redis | Long-running scrape jobs, good monitoring |
+| **Containers** | Docker Compose | Multi-service orchestration |
+
+Reference: https://github.com/fastapi/full-stack-fastapi-template
+
+---
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         DOCKER COMPOSE ON UNRAID                         │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                          │
+│  ┌─────────────┐    ┌─────────────────────────────────────────────────┐ │
+│  │   NGINX     │    │              FASTAPI BACKEND                     │ │
+│  │   :80       │───▶│  /api/species     - CRUD species list           │ │
+│  │             │    │  /api/sources     - API key management          │ │
+│  └──────┬──────┘    │  /api/jobs        - Scrape job control          │ │
+│         │           │  /api/images      - Search, filter, browse      │ │
+│         ▼           │  /api/export      - Generate zip for CoreML     │ │
+│  ┌─────────────┐    │  /api/stats       - Dashboard metrics           │ │
+│  │   REACT     │    └─────────────────────────────────────────────────┘ │
+│  │   SPA       │                         │                              │
+│  │   :3000     │                         ▼                              │
+│  └─────────────┘    ┌─────────────────────────────────────────────────┐ │
+│                     │              CELERY WORKERS                      │ │
+│  ┌─────────────┐    │  - iNaturalist scraper                          │ │
+│  │   REDIS     │◀───│  - Flickr scraper                               │ │
+│  │   :6379     │    │  - Wikimedia scraper                            │ │
+│  └─────────────┘    │  - Quality filter pipeline                      │ │
+│                     │  - Export generator                              │ │
+│                     └─────────────────────────────────────────────────┘ │
+│                                          │                              │
+│                                          ▼                              │
+│  ┌─────────────────────────────────────────────────────────────────────┐│
+│  │                         STORAGE (Bind Mounts)                        ││
+│  │  /data/db/plants.sqlite     - Species, images metadata, jobs        ││
+│  │  /data/images/{species}/    - Downloaded images                     ││
+│  │  /data/exports/             - Generated zip files                   ││
+│  └─────────────────────────────────────────────────────────────────────┘│
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Database Schema
+
+```sql
+-- Species master list (imported from CSV)
+CREATE TABLE species (
+    id INTEGER PRIMARY KEY,
+    scientific_name TEXT UNIQUE NOT NULL,
+    common_name TEXT,
+    genus TEXT,
+    family TEXT,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+
+-- Full-text search index
+CREATE VIRTUAL TABLE species_fts USING fts5(
+    scientific_name,
+    common_name,
+    genus,
+    content='species',
+    content_rowid='id'
+);
+
+-- API credentials
+CREATE TABLE api_keys (
+    id INTEGER PRIMARY KEY,
+    source TEXT UNIQUE NOT NULL,  -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
+    api_key TEXT NOT NULL,
+    api_secret TEXT,
+    rate_limit_per_sec REAL DEFAULT 1.0,
+    enabled BOOLEAN DEFAULT TRUE
+);
+
+-- Downloaded images
+CREATE TABLE images (
+    id INTEGER PRIMARY KEY,
+    species_id INTEGER REFERENCES species(id),
+    source TEXT NOT NULL,
+    source_id TEXT,  -- Original ID from source
+    url TEXT NOT NULL,
+    local_path TEXT,
+    license TEXT NOT NULL,
+    attribution TEXT,
+    width INTEGER,
+    height INTEGER,
+    phash TEXT,  -- Perceptual hash for dedup
+    quality_score REAL,  -- Blur/quality metric
+    status TEXT DEFAULT 'pending',  -- pending, downloaded, rejected, deleted
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    UNIQUE(source, source_id)
+);
+
+-- Index for common queries
+CREATE INDEX idx_images_species ON images(species_id);
+CREATE INDEX idx_images_status ON images(status);
+CREATE INDEX idx_images_source ON images(source);
+CREATE INDEX idx_images_phash ON images(phash);
+
+-- Scrape jobs
+CREATE TABLE jobs (
+    id INTEGER PRIMARY KEY,
+    name TEXT NOT NULL,
+    source TEXT NOT NULL,
+    species_filter TEXT,  -- JSON array of species IDs or NULL for all
+    status TEXT DEFAULT 'pending',  -- pending, running, paused, completed, failed
+    progress_current INTEGER DEFAULT 0,
+    progress_total INTEGER DEFAULT 0,
+    images_downloaded INTEGER DEFAULT 0,
+    images_rejected INTEGER DEFAULT 0,
+    started_at TIMESTAMP,
+    completed_at TIMESTAMP,
+    error_message TEXT
+);
+
+-- Export jobs
+CREATE TABLE exports (
+    id INTEGER PRIMARY KEY,
+    name TEXT NOT NULL,
+    filter_criteria TEXT,  -- JSON: min_images, licenses, min_quality, species_ids
+    train_split REAL DEFAULT 0.8,
+    status TEXT DEFAULT 'pending',  -- pending, generating, completed, failed
+    file_path TEXT,
+    file_size INTEGER,
+    species_count INTEGER,
+    image_count INTEGER,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    completed_at TIMESTAMP
+);
+```
+
+---
+
+## API Endpoints
+
+### Species
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/species` | List species (paginated, searchable) |
+| POST | `/api/species` | Create single species |
+| POST | `/api/species/import` | Bulk import from CSV |
+| GET | `/api/species/{id}` | Get species details |
+| PUT | `/api/species/{id}` | Update species |
+| DELETE | `/api/species/{id}` | Delete species |
+
+### API Keys
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/sources` | List configured sources |
+| PUT | `/api/sources/{source}` | Update source config (key, rate limit) |
+
+### Jobs
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/jobs` | List jobs |
+| POST | `/api/jobs` | Create scrape job |
+| GET | `/api/jobs/{id}` | Get job status |
+| POST | `/api/jobs/{id}/pause` | Pause job |
+| POST | `/api/jobs/{id}/resume` | Resume job |
+| POST | `/api/jobs/{id}/cancel` | Cancel job |
+
+### Images
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/images` | List images (paginated, filterable) |
+| GET | `/api/images/{id}` | Get image details |
+| DELETE | `/api/images/{id}` | Delete image |
+| POST | `/api/images/bulk-delete` | Bulk delete |
+
+### Export
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/exports` | List exports |
+| POST | `/api/exports` | Create export job |
+| GET | `/api/exports/{id}` | Get export status |
+| GET | `/api/exports/{id}/download` | Download zip file |
+
+### Stats
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/stats` | Dashboard statistics |
+| GET | `/api/stats/sources` | Per-source breakdown |
+| GET | `/api/stats/species` | Per-species image counts |
+
+---
+
+## UI Screens
+
+### 1. Dashboard
+
+- Total species, images by source, images by license
+- Active jobs with progress bars
+- Quick stats: images/sec, disk usage
+- Recent activity feed
+
+### 2. Species Management
+
+- Table: scientific name, common name, genus, image count
+- Import CSV button (drag-and-drop)
+- Search/filter by name, genus
+- Bulk select → "Start Scrape" button
+- Inline editing
+
+### 3. API Keys
+
+- Card per source with:
+  - API key input (masked)
+  - API secret input (if applicable)
+  - Rate limit slider
+  - Enable/disable toggle
+  - Test connection button
+
+### 4. Image Browser
+
+- Grid view with thumbnails (lazy-loaded)
+- Filters sidebar:
+  - Species (autocomplete)
+  - Source (checkboxes)
+  - License (checkboxes)
+  - Quality score (range slider)
+  - Status (tabs: all, pending, downloaded, rejected)
+- Sort by: date, quality, species
+- Bulk select → actions (delete, re-queue)
+- Click to view full-size + metadata
+
+### 5. Jobs
+
+- Table: name, source, status, progress, dates
+- Real-time progress updates (WebSocket)
+- Actions: pause, resume, cancel, view logs
+
+### 6. Export
+
+- Filter builder:
+  - Min images per species
+  - License whitelist
+  - Min quality score
+  - Species selection (all or specific)
+- Train/test split slider (default 80/20)
+- Preview: estimated species count, image count, file size
+- "Generate Zip" button
+- Download history with re-download links
+
+---
+
+## Tradeoffs
+
+| Decision | Alternative | Why This Choice |
+|----------|-------------|-----------------|
+| SQLite | PostgreSQL | Single-user, simpler Docker setup, sufficient for millions of rows |
+| Celery+Redis | RQ, Dramatiq | Battle-tested, good monitoring (Flower) |
+| React | Vue, Svelte | Largest ecosystem, more component libraries |
+| Separate workers | Threads in FastAPI | Better isolation, can scale workers independently |
+| Nginx reverse proxy | Traefik | Simpler config for single-app deployment |
+
+---
+
+## Risks & Mitigations
+
+| Risk | Likelihood | Mitigation |
+|------|------------|------------|
+| iNaturalist rate limits (5GB/hr) | High | Throttle downloads, prioritize species with low counts |
+| Disk fills up | Medium | Dashboard shows disk usage, configurable storage limits |
+| Scrape jobs crash mid-run | Medium | Job state in DB, resume from last checkpoint |
+| Perceptual hash collisions | Low | Store hash, allow manual review of flagged duplicates |
+| API keys exposed | Low | Environment variables, not stored in code |
+| SQLite write contention | Low | WAL mode, single writer pattern via Celery |
+
+---
+
+## Implementation Phases
+
+### Phase 1: Foundation
+- [ ] Docker Compose setup (FastAPI, React, Redis, Nginx)
+- [ ] Database schema + migrations (Alembic)
+- [ ] Basic FastAPI skeleton with health checks
+- [ ] React app scaffolding with Tailwind
+
+### Phase 2: Core Data Management
+- [ ] Species CRUD API
+- [ ] CSV import endpoint
+- [ ] Species list UI with search/filter
+- [ ] API keys management UI
+
+### Phase 3: iNaturalist Scraper
+- [ ] Celery worker setup
+- [ ] iNaturalist/GBIF scraper task
+- [ ] Job management API
+- [ ] Real-time progress (WebSocket or polling)
+
+### Phase 4: Quality Pipeline
+- [ ] Image download worker
+- [ ] Perceptual hash deduplication
+- [ ] Blur detection + quality scoring
+- [ ] Resize to 512x512
+
+### Phase 5: Image Browser
+- [ ] Image listing API with filters
+- [ ] Thumbnail generation
+- [ ] Grid view UI
+- [ ] Bulk operations
+
+### Phase 6: Additional Scrapers
+- [ ] Flickr scraper
+- [ ] Wikimedia Commons scraper
+- [ ] Trefle scraper (metadata + images)
+- [ ] USDA PLANTS scraper
+
+### Phase 7: Export
+- [ ] Export job API
+- [ ] Train/test split logic
+- [ ] Zip generation worker
+- [ ] Download endpoint
+- [ ] Export UI with filters
+
+### Phase 8: Dashboard & Polish
+- [ ] Stats API
+- [ ] Dashboard UI with charts
+- [ ] Job monitoring UI
+- [ ] Error handling + logging
+- [ ] Documentation
+
+---
+
+## File Structure
+
+```
+PlantGuideScraper/
+├── docker-compose.yml
+├── .env.example
+├── docs/
+│   └── master_plan.md
+├── backend/
+│   ├── Dockerfile
+│   ├── requirements.txt
+│   ├── alembic/
+│   │   └── versions/
+│   ├── app/
+│   │   ├── __init__.py
+│   │   ├── main.py
+│   │   ├── config.py
+│   │   ├── database.py
+│   │   ├── models/
+│   │   │   ├── species.py
+│   │   │   ├── image.py
+│   │   │   ├── job.py
+│   │   │   └── export.py
+│   │   ├── schemas/
+│   │   │   └── ...
+│   │   ├── api/
+│   │   │   ├── species.py
+│   │   │   ├── images.py
+│   │   │   ├── jobs.py
+│   │   │   ├── exports.py
+│   │   │   └── stats.py
+│   │   ├── scrapers/
+│   │   │   ├── base.py
+│   │   │   ├── inaturalist.py
+│   │   │   ├── flickr.py
+│   │   │   ├── wikimedia.py
+│   │   │   └── trefle.py
+│   │   ├── workers/
+│   │   │   ├── celery_app.py
+│   │   │   ├── scrape_tasks.py
+│   │   │   ├── quality_tasks.py
+│   │   │   └── export_tasks.py
+│   │   └── utils/
+│   │       ├── image_quality.py
+│   │       └── dedup.py
+│   └── tests/
+├── frontend/
+│   ├── Dockerfile
+│   ├── package.json
+│   ├── src/
+│   │   ├── App.tsx
+│   │   ├── components/
+│   │   ├── pages/
+│   │   │   ├── Dashboard.tsx
+│   │   │   ├── Species.tsx
+│   │   │   ├── Images.tsx
+│   │   │   ├── Jobs.tsx
+│   │   │   ├── Export.tsx
+│   │   │   └── Settings.tsx
+│   │   ├── hooks/
+│   │   └── api/
+│   └── public/
+├── nginx/
+│   └── nginx.conf
+└── data/                  # Bind mount (not in repo)
+    ├── db/
+    ├── images/
+    └── exports/
+```
+
+---
+
+## Environment Variables
+
+```bash
+# Backend
+DATABASE_URL=sqlite:///data/db/plants.sqlite
+REDIS_URL=redis://redis:6379/0
+IMAGES_PATH=/data/images
+EXPORTS_PATH=/data/exports
+
+# API Keys (user-provided)
+FLICKR_API_KEY=
+FLICKR_API_SECRET=
+INATURALIST_APP_ID=
+INATURALIST_APP_SECRET=
+TREFLE_API_KEY=
+
+# Optional
+LOG_LEVEL=INFO
+CELERY_CONCURRENCY=4
+```
+
+---
+
+## Commands
+
+```bash
+# Development
+docker-compose up --build
+
+# Production
+docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
+
+# Run migrations
+docker-compose exec backend alembic upgrade head
+
+# View Celery logs
+docker-compose logs -f celery
+
+# Access Redis CLI
+docker-compose exec redis redis-cli
+```
@@ -0,0 +1,14 @@
+FROM node:20-alpine
+
+WORKDIR /app
+
+# Install dependencies
+COPY package*.json ./
+RUN npm install
+
+# Copy source
+COPY . .
+
+EXPOSE 3000
+
+CMD ["npm", "run", "dev", "--", "--host"]
@@ -0,0 +1,14 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>PlantGuideScraper</title>
+    <script type="module" crossorigin src="/assets/index-BXIq8BNP.js"></script>
+    <link rel="stylesheet" crossorigin href="/assets/index-uHzGA3u6.css">
+  </head>
+  <body>
+    <div id="root"></div>
+  </body>
+</html>
@@ -0,0 +1,13 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>PlantGuideScraper</title>
+  </head>
+  <body>
+    <div id="root"></div>
+    <script type="module" src="/src/main.tsx"></script>
+  </body>
+</html>
@@ -0,0 +1,31 @@
+{
+  "name": "plant-scraper-frontend",
+  "private": true,
+  "version": "1.0.0",
+  "type": "module",
+  "scripts": {
+    "dev": "vite",
+    "build": "tsc && vite build",
+    "preview": "vite preview"
+  },
+  "dependencies": {
+    "react": "^18.2.0",
+    "react-dom": "^18.2.0",
+    "react-router-dom": "^6.21.0",
+    "@tanstack/react-query": "^5.17.0",
+    "axios": "^1.6.0",
+    "lucide-react": "^0.303.0",
+    "recharts": "^2.10.0",
+    "clsx": "^2.1.0"
+  },
+  "devDependencies": {
+    "@types/react": "^18.2.0",
+    "@types/react-dom": "^18.2.0",
+    "@vitejs/plugin-react": "^4.2.0",
+    "autoprefixer": "^10.4.16",
+    "postcss": "^8.4.32",
+    "tailwindcss": "^3.4.0",
+    "typescript": "^5.3.0",
+    "vite": "^5.0.0"
+  }
+}
@@ -0,0 +1,6 @@
+export default {
+  plugins: {
+    tailwindcss: {},
+    autoprefixer: {},
+  },
+}
@@ -0,0 +1,81 @@
+import { BrowserRouter, Routes, Route, NavLink } from 'react-router-dom'
+import {
+  LayoutDashboard,
+  Leaf,
+  Image,
+  Play,
+  Download,
+  Settings,
+} from 'lucide-react'
+import { clsx } from 'clsx'
+
+import Dashboard from './pages/Dashboard'
+import Species from './pages/Species'
+import Images from './pages/Images'
+import Jobs from './pages/Jobs'
+import Export from './pages/Export'
+import SettingsPage from './pages/Settings'
+
+const navItems = [
+  { to: '/', icon: LayoutDashboard, label: 'Dashboard' },
+  { to: '/species', icon: Leaf, label: 'Species' },
+  { to: '/images', icon: Image, label: 'Images' },
+  { to: '/jobs', icon: Play, label: 'Jobs' },
+  { to: '/export', icon: Download, label: 'Export' },
+  { to: '/settings', icon: Settings, label: 'Settings' },
+]
+
+function Sidebar() {
+  return (
+    <aside className="w-64 bg-white border-r border-gray-200 min-h-screen">
+      <div className="p-4 border-b border-gray-200">
+        <h1 className="text-xl font-bold text-green-600 flex items-center gap-2">
+          <Leaf className="w-6 h-6" />
+          PlantScraper
+        </h1>
+      </div>
+      <nav className="p-4">
+        <ul className="space-y-2">
+          {navItems.map((item) => (
+            <li key={item.to}>
+              <NavLink
+                to={item.to}
+                className={({ isActive }) =>
+                  clsx(
+                    'flex items-center gap-3 px-3 py-2 rounded-lg transition-colors',
+                    isActive
+                      ? 'bg-green-50 text-green-700'
+                      : 'text-gray-600 hover:bg-gray-100'
+                  )
+                }
+              >
+                <item.icon className="w-5 h-5" />
+                {item.label}
+              </NavLink>
+            </li>
+          ))}
+        </ul>
+      </nav>
+    </aside>
+  )
+}
+
+export default function App() {
+  return (
+    <BrowserRouter>
+      <div className="flex min-h-screen">
+        <Sidebar />
+        <main className="flex-1 p-8">
+          <Routes>
+            <Route path="/" element={<Dashboard />} />
+            <Route path="/species" element={<Species />} />
+            <Route path="/images" element={<Images />} />
+            <Route path="/jobs" element={<Jobs />} />
+            <Route path="/export" element={<Export />} />
+            <Route path="/settings" element={<SettingsPage />} />
+          </Routes>
+        </main>
+      </div>
+    </BrowserRouter>
+  )
+}
@@ -0,0 +1,275 @@
+import axios from 'axios'
+
+const API_URL = import.meta.env.VITE_API_URL || ''
+
+export const api = axios.create({
+  baseURL: `${API_URL}/api`,
+  headers: {
+    'Content-Type': 'application/json',
+  },
+})
+
+// Types
+export interface Species {
+  id: number
+  scientific_name: string
+  common_name: string | null
+  genus: string | null
+  family: string | null
+  created_at: string
+  image_count: number
+}
+
+export interface SpeciesListResponse {
+  items: Species[]
+  total: number
+  page: number
+  page_size: number
+  pages: number
+}
+
+export interface Image {
+  id: number
+  species_id: number
+  species_name: string | null
+  source: string
+  source_id: string | null
+  url: string
+  local_path: string | null
+  license: string
+  attribution: string | null
+  width: number | null
+  height: number | null
+  quality_score: number | null
+  status: string
+  created_at: string
+}
+
+export interface ImageListResponse {
+  items: Image[]
+  total: number
+  page: number
+  page_size: number
+  pages: number
+}
+
+export interface Job {
+  id: number
+  name: string
+  source: string
+  species_filter: string | null
+  status: string
+  progress_current: number
+  progress_total: number
+  images_downloaded: number
+  images_rejected: number
+  started_at: string | null
+  completed_at: string | null
+  error_message: string | null
+  created_at: string
+}
+
+export interface JobListResponse {
+  items: Job[]
+  total: number
+}
+
+export interface JobProgress {
+  status: string
+  progress_current: number
+  progress_total: number
+  current_species?: string
+}
+
+export interface Export {
+  id: number
+  name: string
+  filter_criteria: string | null
+  train_split: number
+  status: string
+  file_path: string | null
+  file_size: number | null
+  species_count: number | null
+  image_count: number | null
+  created_at: string
+  completed_at: string | null
+  error_message: string | null
+}
+
+export interface SourceConfig {
+  name: string
+  label: string
+  requires_secret: boolean
+  auth_type: 'none' | 'api_key' | 'api_key_secret' | 'oauth'
+  configured: boolean
+  enabled: boolean
+  api_key_masked: string | null
+  has_secret: boolean
+  has_access_token: boolean
+  rate_limit_per_sec: number
+  default_rate: number
+}
+
+export interface Stats {
+  total_species: number
+  total_images: number
+  images_downloaded: number
+  images_pending: number
+  images_rejected: number
+  disk_usage_mb: number
+  sources: Array<{
+    source: string
+    image_count: number
+    downloaded: number
+    pending: number
+    rejected: number
+  }>
+  licenses: Array<{
+    license: string
+    count: number
+  }>
+  jobs: {
+    running: number
+    pending: number
+    completed: number
+    failed: number
+  }
+  top_species: Array<{
+    id: number
+    scientific_name: string
+    common_name: string | null
+    image_count: number
+  }>
+  under_represented: Array<{
+    id: number
+    scientific_name: string
+    common_name: string | null
+    image_count: number
+  }>
+}
+
+// API functions
+export const speciesApi = {
+  list: (params?: { page?: number; page_size?: number; search?: string; genus?: string; has_images?: boolean; max_images?: number; min_images?: number }) =>
+    api.get<SpeciesListResponse>('/species', { params }),
+  get: (id: number) => api.get<Species>(`/species/${id}`),
+  create: (data: { scientific_name: string; common_name?: string; genus?: string; family?: string }) =>
+    api.post<Species>('/species', data),
+  update: (id: number, data: Partial<Species>) => api.put<Species>(`/species/${id}`, data),
+  delete: (id: number) => api.delete(`/species/${id}`),
+  import: (file: File) => {
+    const formData = new FormData()
+    formData.append('file', file)
+    return api.post('/species/import', formData, {
+      headers: { 'Content-Type': 'multipart/form-data' },
+    })
+  },
+  importJson: (file: File) => {
+    const formData = new FormData()
+    formData.append('file', file)
+    return api.post('/species/import-json', formData, {
+      headers: { 'Content-Type': 'multipart/form-data' },
+    })
+  },
+  genera: () => api.get<string[]>('/species/genera/list'),
+}
+
+export interface ImportScanResult {
+  available: boolean
+  message?: string
+  sources: Array<{
+    name: string
+    species_count: number
+    image_count: number
+  }>
+  total_images: number
+  matched_species: number
+  unmatched_species: string[]
+}
+
+export interface ImportResult {
+  imported: number
+  skipped: number
+  errors: string[]
+}
+
+export const imagesApi = {
+  list: (params?: {
+    page?: number
+    page_size?: number
+    species_id?: number
+    source?: string
+    license?: string
+    status?: string
+    min_quality?: number
+    search?: string
+  }) => api.get<ImageListResponse>('/images', { params }),
+  get: (id: number) => api.get<Image>(`/images/${id}`),
+  delete: (id: number) => api.delete(`/images/${id}`),
+  bulkDelete: (ids: number[]) => api.post('/images/bulk-delete', ids),
+  sources: () => api.get<string[]>('/images/sources'),
+  licenses: () => api.get<string[]>('/images/licenses'),
+  processPending: (source?: string) =>
+    api.post<{ pending_count: number; task_id: string }>('/images/process-pending', null, {
+      params: source ? { source } : undefined,
+    }),
+  processPendingStatus: (taskId: string) =>
+    api.get<{ task_id: string; state: string; queued?: number; total?: number }>(
+      `/images/process-pending/status/${taskId}`
+    ),
+  scanImports: () => api.get<ImportScanResult>('/images/import/scan'),
+  runImport: (moveFiles: boolean = false) =>
+    api.post<ImportResult>('/images/import/run', null, { params: { move_files: moveFiles } }),
+}
+
+export const jobsApi = {
+  list: (params?: { status?: string; source?: string; limit?: number }) =>
+    api.get<JobListResponse>('/jobs', { params }),
+  get: (id: number) => api.get<Job>(`/jobs/${id}`),
+  create: (data: { name: string; source: string; species_ids?: number[]; only_without_images?: boolean; max_images?: number }) =>
+    api.post<Job>('/jobs', data),
+  progress: (id: number) => api.get<JobProgress>(`/jobs/${id}/progress`),
+  pause: (id: number) => api.post(`/jobs/${id}/pause`),
+  resume: (id: number) => api.post(`/jobs/${id}/resume`),
+  cancel: (id: number) => api.post(`/jobs/${id}/cancel`),
+}
+
+export const exportsApi = {
+  list: (params?: { limit?: number }) => api.get('/exports', { params }),
+  get: (id: number) => api.get<Export>(`/exports/${id}`),
+  create: (data: {
+    name: string
+    filter_criteria: {
+      min_images_per_species: number
+      licenses?: string[]
+      min_quality?: number
+      species_ids?: number[]
+    }
+    train_split: number
+  }) => api.post<Export>('/exports', data),
+  preview: (data: any) => api.post('/exports/preview', data),
+  progress: (id: number) => api.get(`/exports/${id}/progress`),
+  download: (id: number) => `${API_URL}/api/exports/${id}/download`,
+  delete: (id: number) => api.delete(`/exports/${id}`),
+}
+
+export const sourcesApi = {
+  list: () => api.get<SourceConfig[]>('/sources'),
+  get: (source: string) => api.get<SourceConfig>(`/sources/${source}`),
+  update: (source: string, data: {
+    api_key?: string
+    api_secret?: string
+    access_token?: string
+    rate_limit_per_sec?: number
+    enabled?: boolean
+  }) => api.put(`/sources/${source}`, { source, ...data }),
+  test: (source: string) => api.post(`/sources/${source}/test`),
+  delete: (source: string) => api.delete(`/sources/${source}`),
+}
+
+export const statsApi = {
+  get: () => api.get<Stats>('/stats'),
+  sources: () => api.get('/stats/sources'),
+  species: (params?: { min_count?: number; max_count?: number }) =>
+    api.get('/stats/species', { params }),
+}
@@ -0,0 +1,7 @@
+@tailwind base;
+@tailwind components;
+@tailwind utilities;
+
+body {
+  @apply bg-gray-50 text-gray-900;
+}
@@ -0,0 +1,22 @@
+import React from 'react'
+import ReactDOM from 'react-dom/client'
+import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
+import App from './App'
+import './index.css'
+
+const queryClient = new QueryClient({
+  defaultOptions: {
+    queries: {
+      refetchOnWindowFocus: false,
+      retry: 1,
+    },
+  },
+})
+
+ReactDOM.createRoot(document.getElementById('root')!).render(
+  <React.StrictMode>
+    <QueryClientProvider client={queryClient}>
+      <App />
+    </QueryClientProvider>
+  </React.StrictMode>,
+)
@@ -0,0 +1,413 @@
+import { useState } from 'react'
+import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
+import {
+  Leaf,
+  Image,
+  HardDrive,
+  Clock,
+  CheckCircle,
+  XCircle,
+  AlertCircle,
+} from 'lucide-react'
+import {
+  BarChart,
+  Bar,
+  XAxis,
+  YAxis,
+  Tooltip,
+  ResponsiveContainer,
+  PieChart,
+  Pie,
+  Cell,
+} from 'recharts'
+import { statsApi, imagesApi } from '../api/client'
+
+const COLORS = ['#22c55e', '#3b82f6', '#f59e0b', '#ef4444', '#8b5cf6', '#ec4899']
+
+function StatCard({
+  title,
+  value,
+  icon: Icon,
+  color,
+}: {
+  title: string
+  value: string | number
+  icon: React.ElementType
+  color: string
+}) {
+  return (
+    <div className="bg-white rounded-lg shadow p-6">
+      <div className="flex items-center justify-between">
+        <div>
+          <p className="text-sm text-gray-500">{title}</p>
+          <p className="text-2xl font-bold mt-1">{value}</p>
+        </div>
+        <div className={`p-3 rounded-full ${color}`}>
+          <Icon className="w-6 h-6 text-white" />
+        </div>
+      </div>
+    </div>
+  )
+}
+
+export default function Dashboard() {
+  const queryClient = useQueryClient()
+
+  const [processingTaskId, setProcessingTaskId] = useState<string | null>(null)
+
+  const processPendingMutation = useMutation({
+    mutationFn: () => imagesApi.processPending(),
+    onSuccess: (res) => {
+      setProcessingTaskId(res.data.task_id)
+    },
+  })
+
+  // Poll task status while processing
+  const { data: taskStatus } = useQuery({
+    queryKey: ['process-pending-status', processingTaskId],
+    queryFn: async () => {
+      const res = await imagesApi.processPendingStatus(processingTaskId!)
+      if (res.data.state === 'SUCCESS' || res.data.state === 'FAILURE') {
+        // Task finished - clear tracking and refresh stats
+        setTimeout(() => {
+          setProcessingTaskId(null)
+          queryClient.invalidateQueries({ queryKey: ['stats'] })
+        }, 0)
+      }
+      return res.data
+    },
+    enabled: !!processingTaskId,
+    refetchInterval: (query) => {
+      const state = query.state.data?.state
+      if (state === 'SUCCESS' || state === 'FAILURE') return false
+      return 2000
+    },
+  })
+
+  const isProcessing = !!processingTaskId && taskStatus?.state !== 'SUCCESS' && taskStatus?.state !== 'FAILURE'
+
+  const { data: stats, isLoading, error, failureCount, isFetching } = useQuery({
+    queryKey: ['stats'],
+    queryFn: async () => {
+      const startTime = Date.now()
+      console.log('[Dashboard] Fetching stats...')
+
+      // Create abort controller for timeout
+      const controller = new AbortController()
+      const timeoutId = setTimeout(() => controller.abort(), 10000) // 10 second timeout
+
+      try {
+        const res = await statsApi.get()
+        clearTimeout(timeoutId)
+        console.log(`[Dashboard] Stats loaded in ${Date.now() - startTime}ms`)
+        return res.data
+      } catch (err: any) {
+        clearTimeout(timeoutId)
+        if (err.name === 'AbortError' || err.code === 'ECONNABORTED') {
+          console.error('[Dashboard] Request timed out after 10 seconds')
+          throw new Error('Request timed out after 10 seconds - backend may be unresponsive')
+        }
+        console.error('[Dashboard] Stats fetch failed:', err)
+        console.error('[Dashboard] Error details:', {
+          message: err.message,
+          status: err.response?.status,
+          statusText: err.response?.statusText,
+          data: err.response?.data,
+        })
+        throw err
+      }
+    },
+    refetchInterval: 30000,  // 30 seconds - matches backend cache
+    retry: 1,
+    staleTime: 25000,
+  })
+
+  // Debug panel to test backend
+  const { data: debugData, refetch: refetchDebug, isFetching: isDebugFetching } = useQuery({
+    queryKey: ['debug'],
+    queryFn: async () => {
+      const res = await fetch('/api/debug')
+      return res.json()
+    },
+    enabled: false, // Only fetch when manually triggered
+  })
+
+  if (isLoading) {
+    return (
+      <div className="flex items-center justify-center h-64">
+        <div className="text-center">
+          <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600 mx-auto"></div>
+          <p className="mt-2 text-gray-500">Loading stats...</p>
+        </div>
+      </div>
+    )
+  }
+
+  if (error) {
+    const err = error as any
+    return (
+      <div className="space-y-4 m-4">
+        <div className="bg-red-50 border border-red-200 rounded-lg p-6">
+          <h2 className="text-lg font-bold text-red-700 mb-2">Failed to load dashboard</h2>
+          <div className="space-y-2 text-sm">
+            <p><strong>Error:</strong> {err.message}</p>
+            {err.response && (
+              <>
+                <p><strong>Status:</strong> {err.response.status} {err.response.statusText}</p>
+                {err.response.data && (
+                  <p><strong>Response:</strong> {JSON.stringify(err.response.data)}</p>
+                )}
+              </>
+            )}
+            <p><strong>Retry count:</strong> {failureCount}</p>
+          </div>
+        </div>
+
+        <div className="bg-blue-50 border border-blue-200 rounded-lg p-6">
+          <h3 className="font-bold text-blue-700 mb-2">Debug Backend Connection</h3>
+          <button
+            onClick={() => refetchDebug()}
+            disabled={isDebugFetching}
+            className="px-4 py-2 bg-blue-600 text-white rounded hover:bg-blue-700 disabled:opacity-50"
+          >
+            {isDebugFetching ? 'Testing...' : 'Test Backend'}
+          </button>
+          {debugData && (
+            <pre className="mt-4 p-4 bg-white rounded text-xs overflow-auto">
+              {JSON.stringify(debugData, null, 2)}
+            </pre>
+          )}
+        </div>
+      </div>
+    )
+  }
+
+  if (!stats) {
+    return <div>Failed to load stats</div>
+  }
+
+  const sourceData = stats.sources.map((s) => ({
+    name: s.source,
+    downloaded: s.downloaded,
+    pending: s.pending,
+    rejected: s.rejected,
+  }))
+
+  const licenseData = stats.licenses.map((l, i) => ({
+    name: l.license,
+    value: l.count,
+    color: COLORS[i % COLORS.length],
+  }))
+
+  return (
+    <div className="space-y-6">
+      <h1 className="text-2xl font-bold">Dashboard</h1>
+
+      {/* Stats Grid */}
+      <div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-4 gap-4">
+        <StatCard
+          title="Total Species"
+          value={stats.total_species.toLocaleString()}
+          icon={Leaf}
+          color="bg-green-500"
+        />
+        <StatCard
+          title="Downloaded Images"
+          value={stats.images_downloaded.toLocaleString()}
+          icon={Image}
+          color="bg-blue-500"
+        />
+        <StatCard
+          title="Pending Images"
+          value={stats.images_pending.toLocaleString()}
+          icon={Clock}
+          color="bg-yellow-500"
+        />
+        <StatCard
+          title="Disk Usage"
+          value={`${stats.disk_usage_mb.toFixed(1)} MB`}
+          icon={HardDrive}
+          color="bg-purple-500"
+        />
+      </div>
+
+      {/* Process Pending Banner */}
+      {(stats.images_pending > 0 || isProcessing) && (
+        <div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4 flex items-center justify-between">
+          <div>
+            <p className="font-semibold text-yellow-800">
+              {isProcessing
+                ? `Processing pending images...`
+                : `${stats.images_pending.toLocaleString()} pending images`}
+            </p>
+            <p className="text-sm text-yellow-700">
+              {isProcessing && taskStatus?.queued != null && taskStatus?.total != null
+                ? `Queued ${taskStatus.queued.toLocaleString()} of ${taskStatus.total.toLocaleString()} for download`
+                : isProcessing
+                ? 'Queueing images for download...'
+                : 'These images have been scraped but not yet downloaded and processed.'}
+            </p>
+          </div>
+          <button
+            onClick={() => processPendingMutation.mutate()}
+            disabled={isProcessing || processPendingMutation.isPending}
+            className="px-4 py-2 bg-yellow-600 text-white rounded-lg hover:bg-yellow-700 disabled:opacity-50 whitespace-nowrap"
+          >
+            {isProcessing ? 'Processing...' : processPendingMutation.isPending ? 'Starting...' : 'Process All Pending'}
+          </button>
+        </div>
+      )}
+
+      {/* Jobs Status */}
+      <div className="bg-white rounded-lg shadow p-6">
+        <h2 className="text-lg font-semibold mb-4">Jobs Status</h2>
+        <div className="flex gap-6">
+          <div className="flex items-center gap-2">
+            <div className="w-3 h-3 rounded-full bg-blue-500 animate-pulse"></div>
+            <span>Running: {stats.jobs.running}</span>
+          </div>
+          <div className="flex items-center gap-2">
+            <Clock className="w-4 h-4 text-yellow-500" />
+            <span>Pending: {stats.jobs.pending}</span>
+          </div>
+          <div className="flex items-center gap-2">
+            <CheckCircle className="w-4 h-4 text-green-500" />
+            <span>Completed: {stats.jobs.completed}</span>
+          </div>
+          <div className="flex items-center gap-2">
+            <XCircle className="w-4 h-4 text-red-500" />
+            <span>Failed: {stats.jobs.failed}</span>
+          </div>
+        </div>
+      </div>
+
+      {/* Charts */}
+      <div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
+        {/* Source Chart */}
+        <div className="bg-white rounded-lg shadow p-6">
+          <h2 className="text-lg font-semibold mb-4">Images by Source</h2>
+          {sourceData.length > 0 ? (
+            <ResponsiveContainer width="100%" height={300}>
+              <BarChart data={sourceData}>
+                <XAxis dataKey="name" />
+                <YAxis />
+                <Tooltip />
+                <Bar dataKey="downloaded" fill="#22c55e" name="Downloaded" />
+                <Bar dataKey="pending" fill="#f59e0b" name="Pending" />
+                <Bar dataKey="rejected" fill="#ef4444" name="Rejected" />
+              </BarChart>
+            </ResponsiveContainer>
+          ) : (
+            <div className="h-[300px] flex items-center justify-center text-gray-400">
+              No data yet
+            </div>
+          )}
+        </div>
+
+        {/* License Chart */}
+        <div className="bg-white rounded-lg shadow p-6">
+          <h2 className="text-lg font-semibold mb-4">Images by License</h2>
+          {licenseData.length > 0 ? (
+            <ResponsiveContainer width="100%" height={300}>
+              <PieChart>
+                <Pie
+                  data={licenseData}
+                  dataKey="value"
+                  nameKey="name"
+                  cx="50%"
+                  cy="50%"
+                  outerRadius={100}
+                  label={({ name, percent }) =>
+                    `${name} (${(percent * 100).toFixed(0)}%)`
+                  }
+                >
+                  {licenseData.map((entry, index) => (
+                    <Cell key={index} fill={entry.color} />
+                  ))}
+                </Pie>
+                <Tooltip />
+              </PieChart>
+            </ResponsiveContainer>
+          ) : (
+            <div className="h-[300px] flex items-center justify-center text-gray-400">
+              No data yet
+            </div>
+          )}
+        </div>
+      </div>
+
+      {/* Species Tables */}
+      <div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
+        {/* Top Species */}
+        <div className="bg-white rounded-lg shadow p-6">
+          <h2 className="text-lg font-semibold mb-4">Top Species</h2>
+          <table className="w-full">
+            <thead>
+              <tr className="text-left text-sm text-gray-500">
+                <th className="pb-2">Species</th>
+                <th className="pb-2 text-right">Images</th>
+              </tr>
+            </thead>
+            <tbody>
+              {stats.top_species.map((s) => (
+                <tr key={s.id} className="border-t">
+                  <td className="py-2">
+                    <div className="font-medium">{s.scientific_name}</div>
+                    {s.common_name && (
+                      <div className="text-sm text-gray-500">{s.common_name}</div>
+                    )}
+                  </td>
+                  <td className="py-2 text-right">{s.image_count}</td>
+                </tr>
+              ))}
+              {stats.top_species.length === 0 && (
+                <tr>
+                  <td colSpan={2} className="py-4 text-center text-gray-400">
+                    No species yet
+                  </td>
+                </tr>
+              )}
+            </tbody>
+          </table>
+        </div>
+
+        {/* Under-represented Species */}
+        <div className="bg-white rounded-lg shadow p-6">
+          <h2 className="text-lg font-semibold mb-4 flex items-center gap-2">
+            <AlertCircle className="w-5 h-5 text-yellow-500" />
+            Under-represented Species
+          </h2>
+          <p className="text-sm text-gray-500 mb-4">Species with fewer than 100 images</p>
+          <table className="w-full">
+            <thead>
+              <tr className="text-left text-sm text-gray-500">
+                <th className="pb-2">Species</th>
+                <th className="pb-2 text-right">Images</th>
+              </tr>
+            </thead>
+            <tbody>
+              {stats.under_represented.map((s) => (
+                <tr key={s.id} className="border-t">
+                  <td className="py-2">
+                    <div className="font-medium">{s.scientific_name}</div>
+                    {s.common_name && (
+                      <div className="text-sm text-gray-500">{s.common_name}</div>
+                    )}
+                  </td>
+                  <td className="py-2 text-right text-yellow-600">{s.image_count}</td>
+                </tr>
+              ))}
+              {stats.under_represented.length === 0 && (
+                <tr>
+                  <td colSpan={2} className="py-4 text-center text-gray-400">
+                    All species have 100+ images
+                  </td>
+                </tr>
+              )}
+            </tbody>
+          </table>
+        </div>
+      </div>
+    </div>
+  )
+}
@@ -0,0 +1,346 @@
+import { useState } from 'react'
+import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
+import {
+  Download,
+  Trash2,
+  CheckCircle,
+  Clock,
+  AlertCircle,
+  Package,
+} from 'lucide-react'
+import { exportsApi, imagesApi, Export as ExportType } from '../api/client'
+
+export default function Export() {
+  const queryClient = useQueryClient()
+  const [showCreateModal, setShowCreateModal] = useState(false)
+
+  const { data: exports, isLoading } = useQuery({
+    queryKey: ['exports'],
+    queryFn: () => exportsApi.list({ limit: 50 }).then((res) => res.data),
+    refetchInterval: 5000,
+  })
+
+  const deleteMutation = useMutation({
+    mutationFn: (id: number) => exportsApi.delete(id),
+    onSuccess: () => queryClient.invalidateQueries({ queryKey: ['exports'] }),
+  })
+
+  const getStatusIcon = (status: string) => {
+    switch (status) {
+      case 'generating':
+        return <Clock className="w-4 h-4 text-blue-500 animate-pulse" />
+      case 'completed':
+        return <CheckCircle className="w-4 h-4 text-green-500" />
+      case 'failed':
+        return <AlertCircle className="w-4 h-4 text-red-500" />
+      default:
+        return <Clock className="w-4 h-4 text-gray-400" />
+    }
+  }
+
+  const formatBytes = (bytes: number | null) => {
+    if (!bytes) return 'N/A'
+    if (bytes < 1024) return `${bytes} B`
+    if (bytes < 1024 * 1024) return `${(bytes / 1024).toFixed(1)} KB`
+    if (bytes < 1024 * 1024 * 1024) return `${(bytes / 1024 / 1024).toFixed(1)} MB`
+    return `${(bytes / 1024 / 1024 / 1024).toFixed(1)} GB`
+  }
+
+  return (
+    <div className="space-y-6">
+      <div className="flex items-center justify-between">
+        <h1 className="text-2xl font-bold">Export Dataset</h1>
+        <button
+          onClick={() => setShowCreateModal(true)}
+          className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
+        >
+          <Package className="w-4 h-4" />
+          Create Export
+        </button>
+      </div>
+
+      {/* Info Card */}
+      <div className="bg-blue-50 border border-blue-200 rounded-lg p-4">
+        <h3 className="font-medium text-blue-800">Export Format</h3>
+        <p className="text-sm text-blue-700 mt-1">
+          Exports are created in Create ML-compatible format with Training and Testing
+          folders. Each species has its own subfolder with images.
+        </p>
+      </div>
+
+      {/* Exports List */}
+      {isLoading ? (
+        <div className="flex items-center justify-center h-64">
+          <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
+        </div>
+      ) : exports?.items.length === 0 ? (
+        <div className="bg-white rounded-lg shadow p-8 text-center text-gray-400">
+          <Package className="w-12 h-12 mx-auto mb-4" />
+          <p>No exports yet</p>
+          <p className="text-sm mt-2">
+            Create an export to download your dataset for CoreML training
+          </p>
+        </div>
+      ) : (
+        <div className="space-y-4">
+          {exports?.items.map((exp: ExportType) => (
+            <div
+              key={exp.id}
+              className="bg-white rounded-lg shadow p-6"
+            >
+              <div className="flex items-start justify-between">
+                <div className="flex-1">
+                  <div className="flex items-center gap-3">
+                    {getStatusIcon(exp.status)}
+                    <h3 className="font-semibold">{exp.name}</h3>
+                  </div>
+                  <div className="mt-2 grid grid-cols-4 gap-4 text-sm">
+                    <div>
+                      <span className="text-gray-500">Species:</span>{' '}
+                      {exp.species_count ?? 'N/A'}
+                    </div>
+                    <div>
+                      <span className="text-gray-500">Images:</span>{' '}
+                      {exp.image_count ?? 'N/A'}
+                    </div>
+                    <div>
+                      <span className="text-gray-500">Size:</span>{' '}
+                      {formatBytes(exp.file_size)}
+                    </div>
+                    <div>
+                      <span className="text-gray-500">Split:</span>{' '}
+                      {Math.round(exp.train_split * 100)}% / {Math.round((1 - exp.train_split) * 100)}%
+                    </div>
+                  </div>
+                  {exp.error_message && (
+                    <div className="mt-2 text-sm text-red-600">
+                      Error: {exp.error_message}
+                    </div>
+                  )}
+                  <div className="mt-2 text-xs text-gray-400">
+                    Created: {new Date(exp.created_at).toLocaleString()}
+                    {exp.completed_at && (
+                      <span className="ml-4">
+                        Completed: {new Date(exp.completed_at).toLocaleString()}
+                      </span>
+                    )}
+                  </div>
+                </div>
+                <div className="flex gap-2 ml-4">
+                  {exp.status === 'completed' && (
+                    <a
+                      href={exportsApi.download(exp.id)}
+                      className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
+                    >
+                      <Download className="w-4 h-4" />
+                      Download
+                    </a>
+                  )}
+                  <button
+                    onClick={() => deleteMutation.mutate(exp.id)}
+                    className="p-2 text-red-600 hover:bg-red-50 rounded"
+                    title="Delete"
+                  >
+                    <Trash2 className="w-5 h-5" />
+                  </button>
+                </div>
+              </div>
+            </div>
+          ))}
+        </div>
+      )}
+
+      {/* Create Modal */}
+      {showCreateModal && (
+        <CreateExportModal onClose={() => setShowCreateModal(false)} />
+      )}
+    </div>
+  )
+}
+
+function CreateExportModal({ onClose }: { onClose: () => void }) {
+  const queryClient = useQueryClient()
+  const [form, setForm] = useState({
+    name: `Export ${new Date().toLocaleDateString()}`,
+    min_images: 100,
+    train_split: 0.8,
+    licenses: [] as string[],
+    min_quality: undefined as number | undefined,
+  })
+
+  const { data: licenses } = useQuery({
+    queryKey: ['image-licenses'],
+    queryFn: () => imagesApi.licenses().then((res) => res.data),
+  })
+
+  const previewMutation = useMutation({
+    mutationFn: () =>
+      exportsApi.preview({
+        name: form.name,
+        filter_criteria: {
+          min_images_per_species: form.min_images,
+          licenses: form.licenses.length > 0 ? form.licenses : undefined,
+          min_quality: form.min_quality,
+        },
+        train_split: form.train_split,
+      }),
+  })
+
+  const createMutation = useMutation({
+    mutationFn: () =>
+      exportsApi.create({
+        name: form.name,
+        filter_criteria: {
+          min_images_per_species: form.min_images,
+          licenses: form.licenses.length > 0 ? form.licenses : undefined,
+          min_quality: form.min_quality,
+        },
+        train_split: form.train_split,
+      }),
+    onSuccess: () => {
+      queryClient.invalidateQueries({ queryKey: ['exports'] })
+      onClose()
+    },
+  })
+
+  const toggleLicense = (license: string) => {
+    setForm((f) => ({
+      ...f,
+      licenses: f.licenses.includes(license)
+        ? f.licenses.filter((l) => l !== license)
+        : [...f.licenses, license],
+    }))
+  }
+
+  return (
+    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
+      <div className="bg-white rounded-lg p-6 w-full max-w-lg">
+        <h2 className="text-xl font-bold mb-4">Create Export</h2>
+
+        <div className="space-y-4">
+          <div>
+            <label className="block text-sm font-medium mb-1">Export Name</label>
+            <input
+              type="text"
+              value={form.name}
+              onChange={(e) => setForm({ ...form, name: e.target.value })}
+              className="w-full px-3 py-2 border rounded-lg"
+            />
+          </div>
+
+          <div>
+            <label className="block text-sm font-medium mb-1">
+              Minimum Images per Species
+            </label>
+            <input
+              type="number"
+              value={form.min_images}
+              onChange={(e) =>
+                setForm({ ...form, min_images: parseInt(e.target.value) || 0 })
+              }
+              className="w-full px-3 py-2 border rounded-lg"
+              min={1}
+            />
+            <p className="text-xs text-gray-500 mt-1">
+              Species with fewer images will be excluded
+            </p>
+          </div>
+
+          <div>
+            <label className="block text-sm font-medium mb-1">
+              Train/Test Split
+            </label>
+            <div className="flex items-center gap-4">
+              <input
+                type="range"
+                value={form.train_split}
+                onChange={(e) =>
+                  setForm({ ...form, train_split: parseFloat(e.target.value) })
+                }
+                min={0.5}
+                max={0.95}
+                step={0.05}
+                className="flex-1"
+              />
+              <span className="text-sm w-20 text-right">
+                {Math.round(form.train_split * 100)}% /{' '}
+                {Math.round((1 - form.train_split) * 100)}%
+              </span>
+            </div>
+          </div>
+
+          <div>
+            <label className="block text-sm font-medium mb-2">
+              Filter by License (optional)
+            </label>
+            <div className="flex flex-wrap gap-2">
+              {licenses?.map((license) => (
+                <button
+                  key={license}
+                  onClick={() => toggleLicense(license)}
+                  className={`px-3 py-1 rounded-full text-sm ${
+                    form.licenses.includes(license)
+                      ? 'bg-green-100 text-green-700 border-green-300'
+                      : 'bg-gray-100 text-gray-600'
+                  } border`}
+                >
+                  {license}
+                </button>
+              ))}
+            </div>
+            {form.licenses.length === 0 && (
+              <p className="text-xs text-gray-500 mt-1">
+                All licenses will be included
+              </p>
+            )}
+          </div>
+
+          {/* Preview */}
+          {previewMutation.data && (
+            <div className="bg-gray-50 rounded-lg p-4">
+              <h4 className="font-medium mb-2">Preview</h4>
+              <div className="grid grid-cols-3 gap-4 text-sm">
+                <div>
+                  <span className="text-gray-500">Species:</span>{' '}
+                  {previewMutation.data.data.species_count}
+                </div>
+                <div>
+                  <span className="text-gray-500">Images:</span>{' '}
+                  {previewMutation.data.data.image_count}
+                </div>
+                <div>
+                  <span className="text-gray-500">Est. Size:</span>{' '}
+                  {previewMutation.data.data.estimated_size_mb.toFixed(0)} MB
+                </div>
+              </div>
+            </div>
+          )}
+        </div>
+
+        <div className="flex justify-between mt-6">
+          <button
+            onClick={() => previewMutation.mutate()}
+            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
+          >
+            Preview
+          </button>
+          <div className="flex gap-2">
+            <button
+              onClick={onClose}
+              className="px-4 py-2 border rounded-lg hover:bg-gray-50"
+            >
+              Cancel
+            </button>
+            <button
+              onClick={() => createMutation.mutate()}
+              disabled={!form.name}
+              className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50"
+            >
+              Create Export
+            </button>
+          </div>
+        </div>
+      </div>
+    </div>
+  )
+}
@@ -0,0 +1,331 @@
+import { useState } from 'react'
+import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
+import {
+  Search,
+  Filter,
+  Trash2,
+  ChevronLeft,
+  ChevronRight,
+  X,
+  ExternalLink,
+} from 'lucide-react'
+import { imagesApi } from '../api/client'
+
+export default function Images() {
+  const queryClient = useQueryClient()
+  const [page, setPage] = useState(1)
+  const [search, setSearch] = useState('')
+  const [filters, setFilters] = useState({
+    source: '',
+    license: '',
+    status: 'downloaded',
+    min_quality: undefined as number | undefined,
+  })
+  const [selectedIds, setSelectedIds] = useState<number[]>([])
+  const [selectedImage, setSelectedImage] = useState<number | null>(null)
+
+  const { data, isLoading } = useQuery({
+    queryKey: ['images', page, search, filters],
+    queryFn: () =>
+      imagesApi
+        .list({
+          page,
+          page_size: 48,
+          search: search || undefined,
+          source: filters.source || undefined,
+          license: filters.license || undefined,
+          status: filters.status || undefined,
+          min_quality: filters.min_quality,
+        })
+        .then((res) => res.data),
+  })
+
+  const { data: sources } = useQuery({
+    queryKey: ['image-sources'],
+    queryFn: () => imagesApi.sources().then((res) => res.data),
+  })
+
+  const { data: licenses } = useQuery({
+    queryKey: ['image-licenses'],
+    queryFn: () => imagesApi.licenses().then((res) => res.data),
+  })
+
+  const { data: imageDetail } = useQuery({
+    queryKey: ['image', selectedImage],
+    queryFn: () => imagesApi.get(selectedImage!).then((res) => res.data),
+    enabled: !!selectedImage,
+  })
+
+  const deleteMutation = useMutation({
+    mutationFn: (id: number) => imagesApi.delete(id),
+    onSuccess: () => {
+      queryClient.invalidateQueries({ queryKey: ['images'] })
+      setSelectedImage(null)
+    },
+  })
+
+  const bulkDeleteMutation = useMutation({
+    mutationFn: (ids: number[]) => imagesApi.bulkDelete(ids),
+    onSuccess: () => {
+      queryClient.invalidateQueries({ queryKey: ['images'] })
+      setSelectedIds([])
+    },
+  })
+
+  const handleSelect = (id: number) => {
+    setSelectedIds((prev) =>
+      prev.includes(id) ? prev.filter((i) => i !== id) : [...prev, id]
+    )
+  }
+
+  return (
+    <div className="space-y-6">
+      <div className="flex items-center justify-between">
+        <h1 className="text-2xl font-bold">Images</h1>
+        {selectedIds.length > 0 && (
+          <button
+            onClick={() => bulkDeleteMutation.mutate(selectedIds)}
+            className="flex items-center gap-2 px-4 py-2 bg-red-600 text-white rounded-lg hover:bg-red-700"
+          >
+            <Trash2 className="w-4 h-4" />
+            Delete {selectedIds.length} images
+          </button>
+        )}
+      </div>
+
+      {/* Filters */}
+      <div className="flex flex-wrap gap-4">
+        <div className="relative">
+          <Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
+          <input
+            type="text"
+            placeholder="Search species..."
+            value={search}
+            onChange={(e) => {
+              setSearch(e.target.value)
+              setPage(1)
+            }}
+            className="pl-10 pr-4 py-2 border rounded-lg w-64"
+          />
+        </div>
+
+        <select
+          value={filters.source}
+          onChange={(e) => setFilters({ ...filters, source: e.target.value })}
+          className="px-3 py-2 border rounded-lg"
+        >
+          <option value="">All Sources</option>
+          {sources?.map((s) => (
+            <option key={s} value={s}>
+              {s}
+            </option>
+          ))}
+        </select>
+
+        <select
+          value={filters.license}
+          onChange={(e) => setFilters({ ...filters, license: e.target.value })}
+          className="px-3 py-2 border rounded-lg"
+        >
+          <option value="">All Licenses</option>
+          {licenses?.map((l) => (
+            <option key={l} value={l}>
+              {l}
+            </option>
+          ))}
+        </select>
+
+        <select
+          value={filters.status}
+          onChange={(e) => setFilters({ ...filters, status: e.target.value })}
+          className="px-3 py-2 border rounded-lg"
+        >
+          <option value="">All Status</option>
+          <option value="downloaded">Downloaded</option>
+          <option value="pending">Pending</option>
+          <option value="rejected">Rejected</option>
+        </select>
+      </div>
+
+      {/* Image Grid */}
+      {isLoading ? (
+        <div className="flex items-center justify-center h-64">
+          <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
+        </div>
+      ) : data?.items.length === 0 ? (
+        <div className="flex flex-col items-center justify-center h-64 text-gray-400">
+          <Filter className="w-12 h-12 mb-4" />
+          <p>No images found</p>
+        </div>
+      ) : (
+        <div className="grid grid-cols-2 sm:grid-cols-4 md:grid-cols-6 lg:grid-cols-8 gap-2">
+          {data?.items.map((image) => (
+            <div
+              key={image.id}
+              className={`relative aspect-square bg-gray-100 rounded-lg overflow-hidden cursor-pointer group ${
+                selectedIds.includes(image.id) ? 'ring-2 ring-green-500' : ''
+              }`}
+              onClick={() => setSelectedImage(image.id)}
+            >
+              {image.local_path ? (
+                <img
+                  src={`/api/images/${image.id}/file`}
+                  alt={image.species_name || ''}
+                  className="w-full h-full object-cover"
+                  loading="lazy"
+                />
+              ) : (
+                <div className="flex items-center justify-center h-full text-gray-400 text-xs">
+                  Pending
+                </div>
+              )}
+              <div className="absolute inset-0 bg-black/0 group-hover:bg-black/20 transition-colors" />
+              <div className="absolute top-1 left-1">
+                <input
+                  type="checkbox"
+                  checked={selectedIds.includes(image.id)}
+                  onChange={(e) => {
+                    e.stopPropagation()
+                    handleSelect(image.id)
+                  }}
+                  className="rounded opacity-0 group-hover:opacity-100 checked:opacity-100"
+                />
+              </div>
+              <div className="absolute bottom-0 left-0 right-0 bg-gradient-to-t from-black/60 to-transparent p-1 opacity-0 group-hover:opacity-100 transition-opacity">
+                <p className="text-white text-xs truncate">
+                  {image.species_name}
+                </p>
+              </div>
+            </div>
+          ))}
+        </div>
+      )}
+
+      {/* Pagination */}
+      {data && data.pages > 1 && (
+        <div className="flex items-center justify-between">
+          <span className="text-sm text-gray-600">
+            {data.total} images
+          </span>
+          <div className="flex gap-2">
+            <button
+              onClick={() => setPage((p) => Math.max(1, p - 1))}
+              disabled={page === 1}
+              className="p-2 rounded border disabled:opacity-50"
+            >
+              <ChevronLeft className="w-4 h-4" />
+            </button>
+            <span className="px-4 py-2">
+              Page {page} of {data.pages}
+            </span>
+            <button
+              onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
+              disabled={page === data.pages}
+              className="p-2 rounded border disabled:opacity-50"
+            >
+              <ChevronRight className="w-4 h-4" />
+            </button>
+          </div>
+        </div>
+      )}
+
+      {/* Image Detail Modal */}
+      {selectedImage && imageDetail && (
+        <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50 p-8">
+          <div className="bg-white rounded-lg w-full max-w-4xl max-h-full overflow-auto">
+            <div className="flex justify-between items-center p-4 border-b">
+              <h2 className="text-lg font-semibold">Image Details</h2>
+              <button
+                onClick={() => setSelectedImage(null)}
+                className="p-1 hover:bg-gray-100 rounded"
+              >
+                <X className="w-5 h-5" />
+              </button>
+            </div>
+            <div className="grid grid-cols-2 gap-6 p-6">
+              <div className="aspect-square bg-gray-100 rounded-lg overflow-hidden">
+                {imageDetail.local_path ? (
+                  <img
+                    src={`/api/images/${imageDetail.id}/file`}
+                    alt={imageDetail.species_name || ''}
+                    className="w-full h-full object-contain"
+                  />
+                ) : (
+                  <div className="flex items-center justify-center h-full text-gray-400">
+                    Not downloaded
+                  </div>
+                )}
+              </div>
+              <div className="space-y-4">
+                <div>
+                  <label className="text-sm text-gray-500">Species</label>
+                  <p className="font-medium">{imageDetail.species_name}</p>
+                </div>
+                <div>
+                  <label className="text-sm text-gray-500">Source</label>
+                  <p>{imageDetail.source}</p>
+                </div>
+                <div>
+                  <label className="text-sm text-gray-500">License</label>
+                  <p>{imageDetail.license}</p>
+                </div>
+                {imageDetail.attribution && (
+                  <div>
+                    <label className="text-sm text-gray-500">Attribution</label>
+                    <p className="text-sm">{imageDetail.attribution}</p>
+                  </div>
+                )}
+                <div className="grid grid-cols-2 gap-4">
+                  <div>
+                    <label className="text-sm text-gray-500">Dimensions</label>
+                    <p>
+                      {imageDetail.width || '?'} x {imageDetail.height || '?'}
+                    </p>
+                  </div>
+                  <div>
+                    <label className="text-sm text-gray-500">Quality Score</label>
+                    <p>{imageDetail.quality_score?.toFixed(1) || 'N/A'}</p>
+                  </div>
+                </div>
+                <div>
+                  <label className="text-sm text-gray-500">Status</label>
+                  <p>
+                    <span
+                      className={`inline-block px-2 py-1 rounded text-sm ${
+                        imageDetail.status === 'downloaded'
+                          ? 'bg-green-100 text-green-700'
+                          : imageDetail.status === 'pending'
+                          ? 'bg-yellow-100 text-yellow-700'
+                          : 'bg-red-100 text-red-700'
+                      }`}
+                    >
+                      {imageDetail.status}
+                    </span>
+                  </p>
+                </div>
+                <div className="flex gap-2 pt-4">
+                  <a
+                    href={imageDetail.url}
+                    target="_blank"
+                    rel="noopener noreferrer"
+                    className="flex items-center gap-2 px-4 py-2 border rounded-lg hover:bg-gray-50"
+                  >
+                    <ExternalLink className="w-4 h-4" />
+                    View Original
+                  </a>
+                  <button
+                    onClick={() => deleteMutation.mutate(imageDetail.id)}
+                    className="flex items-center gap-2 px-4 py-2 bg-red-600 text-white rounded-lg hover:bg-red-700"
+                  >
+                    <Trash2 className="w-4 h-4" />
+                    Delete
+                  </button>
+                </div>
+              </div>
+            </div>
+          </div>
+        </div>
+      )}
+    </div>
+  )
+}
@@ -0,0 +1,354 @@
+import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
+import {
+  Play,
+  Pause,
+  XCircle,
+  CheckCircle,
+  Clock,
+  AlertCircle,
+  RefreshCw,
+  Leaf,
+  Download,
+  XOctagon,
+} from 'lucide-react'
+import { jobsApi, Job } from '../api/client'
+
+export default function Jobs() {
+  const queryClient = useQueryClient()
+
+  const { data, isLoading, refetch } = useQuery({
+    queryKey: ['jobs'],
+    queryFn: () => jobsApi.list({ limit: 100 }).then((res) => res.data),
+    refetchInterval: 1000, // Faster refresh for live updates
+  })
+
+  const pauseMutation = useMutation({
+    mutationFn: (id: number) => jobsApi.pause(id),
+    onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
+  })
+
+  const resumeMutation = useMutation({
+    mutationFn: (id: number) => jobsApi.resume(id),
+    onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
+  })
+
+  const cancelMutation = useMutation({
+    mutationFn: (id: number) => jobsApi.cancel(id),
+    onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
+  })
+
+  const getStatusIcon = (status: string) => {
+    switch (status) {
+      case 'running':
+        return <RefreshCw className="w-4 h-4 text-blue-500 animate-spin" />
+      case 'pending':
+        return <Clock className="w-4 h-4 text-yellow-500" />
+      case 'paused':
+        return <Pause className="w-4 h-4 text-gray-500" />
+      case 'completed':
+        return <CheckCircle className="w-4 h-4 text-green-500" />
+      case 'failed':
+        return <AlertCircle className="w-4 h-4 text-red-500" />
+      default:
+        return null
+    }
+  }
+
+  const getStatusClass = (status: string) => {
+    switch (status) {
+      case 'running':
+        return 'bg-blue-100 text-blue-700'
+      case 'pending':
+        return 'bg-yellow-100 text-yellow-700'
+      case 'paused':
+        return 'bg-gray-100 text-gray-700'
+      case 'completed':
+        return 'bg-green-100 text-green-700'
+      case 'failed':
+        return 'bg-red-100 text-red-700'
+      default:
+        return 'bg-gray-100 text-gray-700'
+    }
+  }
+
+  // Separate running jobs from others
+  const runningJobs = data?.items.filter((j) => j.status === 'running') || []
+  const otherJobs = data?.items.filter((j) => j.status !== 'running') || []
+
+  return (
+    <div className="space-y-6">
+      <div className="flex items-center justify-between">
+        <h1 className="text-2xl font-bold">Jobs</h1>
+        <button
+          onClick={() => refetch()}
+          className="flex items-center gap-2 px-4 py-2 border rounded-lg hover:bg-gray-50"
+        >
+          <RefreshCw className="w-4 h-4" />
+          Refresh
+        </button>
+      </div>
+
+      {isLoading ? (
+        <div className="flex items-center justify-center h-64">
+          <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
+        </div>
+      ) : data?.items.length === 0 ? (
+        <div className="bg-white rounded-lg shadow p-8 text-center text-gray-400">
+          <Clock className="w-12 h-12 mx-auto mb-4" />
+          <p>No jobs yet</p>
+          <p className="text-sm mt-2">
+            Select species and start a scrape job to get started
+          </p>
+        </div>
+      ) : (
+        <div className="space-y-6">
+          {/* Running Jobs - More prominent display */}
+          {runningJobs.length > 0 && (
+            <div className="space-y-4">
+              <h2 className="text-lg font-semibold flex items-center gap-2">
+                <RefreshCw className="w-5 h-5 animate-spin text-blue-500" />
+                Active Jobs ({runningJobs.length})
+              </h2>
+              {runningJobs.map((job) => (
+                <RunningJobCard
+                  key={job.id}
+                  job={job}
+                  onPause={() => pauseMutation.mutate(job.id)}
+                  onCancel={() => cancelMutation.mutate(job.id)}
+                />
+              ))}
+            </div>
+          )}
+
+          {/* Other Jobs */}
+          {otherJobs.length > 0 && (
+            <div className="space-y-4">
+              {runningJobs.length > 0 && (
+                <h2 className="text-lg font-semibold text-gray-600">Other Jobs</h2>
+              )}
+              {otherJobs.map((job) => (
+                <div
+                  key={job.id}
+                  className="bg-white rounded-lg shadow p-6"
+                >
+                  <div className="flex items-start justify-between">
+                    <div className="flex-1">
+                      <div className="flex items-center gap-3">
+                        {getStatusIcon(job.status)}
+                        <h3 className="font-semibold">{job.name}</h3>
+                        <span
+                          className={`px-2 py-0.5 rounded text-xs ${getStatusClass(
+                            job.status
+                          )}`}
+                        >
+                          {job.status}
+                        </span>
+                      </div>
+                      <div className="mt-2 text-sm text-gray-600">
+                        <span className="mr-4">Source: {job.source}</span>
+                        <span className="mr-4">
+                          Downloaded: {job.images_downloaded}
+                        </span>
+                        <span>Rejected: {job.images_rejected}</span>
+                      </div>
+
+                      {/* Progress bar for paused jobs */}
+                      {job.status === 'paused' && job.progress_total > 0 && (
+                        <div className="mt-4">
+                          <div className="flex justify-between text-sm text-gray-600 mb-1">
+                            <span>
+                              {job.progress_current} / {job.progress_total} species
+                            </span>
+                            <span>
+                              {Math.round(
+                                (job.progress_current / job.progress_total) * 100
+                              )}
+                              %
+                            </span>
+                          </div>
+                          <div className="h-2 bg-gray-200 rounded-full overflow-hidden">
+                            <div
+                              className="h-full rounded-full bg-gray-400"
+                              style={{
+                                width: `${
+                                  (job.progress_current / job.progress_total) * 100
+                                }%`,
+                              }}
+                            />
+                          </div>
+                        </div>
+                      )}
+
+                      {job.error_message && (
+                        <div className="mt-2 text-sm text-red-600">
+                          Error: {job.error_message}
+                        </div>
+                      )}
+
+                      <div className="mt-2 text-xs text-gray-400">
+                        {job.started_at && (
+                          <span className="mr-4">
+                            Started: {new Date(job.started_at).toLocaleString()}
+                          </span>
+                        )}
+                        {job.completed_at && (
+                          <span>
+                            Completed: {new Date(job.completed_at).toLocaleString()}
+                          </span>
+                        )}
+                      </div>
+                    </div>
+
+                    {/* Actions */}
+                    <div className="flex gap-2 ml-4">
+                      {job.status === 'paused' && (
+                        <button
+                          onClick={() => resumeMutation.mutate(job.id)}
+                          className="p-2 text-blue-600 hover:bg-blue-50 rounded"
+                          title="Resume"
+                        >
+                          <Play className="w-5 h-5" />
+                        </button>
+                      )}
+                      {(job.status === 'paused' || job.status === 'pending') && (
+                        <button
+                          onClick={() => cancelMutation.mutate(job.id)}
+                          className="p-2 text-red-600 hover:bg-red-50 rounded"
+                          title="Cancel"
+                        >
+                          <XCircle className="w-5 h-5" />
+                        </button>
+                      )}
+                    </div>
+                  </div>
+                </div>
+              ))}
+            </div>
+          )}
+        </div>
+      )}
+    </div>
+  )
+}
+
+function RunningJobCard({
+  job,
+  onPause,
+  onCancel,
+}: {
+  job: Job
+  onPause: () => void
+  onCancel: () => void
+}) {
+  // Fetch real-time progress for this job
+  const { data: progress } = useQuery({
+    queryKey: ['job-progress', job.id],
+    queryFn: () => jobsApi.progress(job.id).then((res) => res.data),
+    refetchInterval: 500, // Very fast updates for live feel
+    enabled: job.status === 'running',
+  })
+
+  const currentSpecies = progress?.current_species || ''
+  const progressCurrent = progress?.progress_current ?? job.progress_current
+  const progressTotal = progress?.progress_total ?? job.progress_total
+  const percentage = progressTotal > 0 ? Math.round((progressCurrent / progressTotal) * 100) : 0
+
+  return (
+    <div className="bg-gradient-to-r from-blue-50 to-white rounded-lg shadow-lg border-2 border-blue-200 p-6">
+      <div className="flex items-start justify-between">
+        <div className="flex-1">
+          <div className="flex items-center gap-3">
+            <RefreshCw className="w-5 h-5 text-blue-500 animate-spin" />
+            <h3 className="font-semibold text-lg">{job.name}</h3>
+            <span className="px-2 py-0.5 rounded text-xs bg-blue-100 text-blue-700 animate-pulse">
+              running
+            </span>
+          </div>
+
+          {/* Live Stats */}
+          <div className="mt-4 grid grid-cols-3 gap-4">
+            <div className="bg-white rounded-lg p-3 border">
+              <div className="flex items-center gap-2 text-gray-500 text-sm">
+                <Leaf className="w-4 h-4" />
+                Species Progress
+              </div>
+              <div className="text-2xl font-bold text-blue-600 mt-1">
+                {progressCurrent} / {progressTotal}
+              </div>
+            </div>
+            <div className="bg-white rounded-lg p-3 border">
+              <div className="flex items-center gap-2 text-gray-500 text-sm">
+                <Download className="w-4 h-4" />
+                Downloaded
+              </div>
+              <div className="text-2xl font-bold text-green-600 mt-1">
+                {job.images_downloaded}
+              </div>
+            </div>
+            <div className="bg-white rounded-lg p-3 border">
+              <div className="flex items-center gap-2 text-gray-500 text-sm">
+                <XOctagon className="w-4 h-4" />
+                Rejected
+              </div>
+              <div className="text-2xl font-bold text-red-600 mt-1">
+                {job.images_rejected}
+              </div>
+            </div>
+          </div>
+
+          {/* Current Species */}
+          {currentSpecies && (
+            <div className="mt-4 bg-white rounded-lg p-3 border">
+              <div className="text-sm text-gray-500 mb-1">Currently scraping:</div>
+              <div className="flex items-center gap-2">
+                <span className="relative flex h-3 w-3">
+                  <span className="animate-ping absolute inline-flex h-full w-full rounded-full bg-blue-400 opacity-75"></span>
+                  <span className="relative inline-flex rounded-full h-3 w-3 bg-blue-500"></span>
+                </span>
+                <span className="font-medium text-blue-800 italic">{currentSpecies}</span>
+              </div>
+            </div>
+          )}
+
+          {/* Progress bar */}
+          {progressTotal > 0 && (
+            <div className="mt-4">
+              <div className="flex justify-between text-sm text-gray-600 mb-1">
+                <span>Progress</span>
+                <span className="font-medium">{percentage}%</span>
+              </div>
+              <div className="h-3 bg-gray-200 rounded-full overflow-hidden">
+                <div
+                  className="h-full rounded-full bg-gradient-to-r from-blue-500 to-blue-600 transition-all duration-500"
+                  style={{ width: `${percentage}%` }}
+                />
+              </div>
+            </div>
+          )}
+
+          <div className="mt-3 text-xs text-gray-400">
+            Source: {job.source} • Started: {job.started_at ? new Date(job.started_at).toLocaleString() : 'N/A'}
+          </div>
+        </div>
+
+        {/* Actions */}
+        <div className="flex gap-2 ml-4">
+          <button
+            onClick={onPause}
+            className="p-2 text-gray-600 hover:bg-gray-100 rounded"
+            title="Pause"
+          >
+            <Pause className="w-5 h-5" />
+          </button>
+          <button
+            onClick={onCancel}
+            className="p-2 text-red-600 hover:bg-red-50 rounded"
+            title="Cancel"
+          >
+            <XCircle className="w-5 h-5" />
+          </button>
+        </div>
+      </div>
+    </div>
+  )
+}
@@ -0,0 +1,543 @@
+import { useState } from 'react'
+import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
+import {
+  Key,
+  CheckCircle,
+  XCircle,
+  Eye,
+  EyeOff,
+  RefreshCw,
+  FolderInput,
+  AlertTriangle,
+} from 'lucide-react'
+import { sourcesApi, imagesApi, SourceConfig, ImportScanResult } from '../api/client'
+
+export default function Settings() {
+  const [editingSource, setEditingSource] = useState<string | null>(null)
+
+  const { data: sources, isLoading, error } = useQuery({
+    queryKey: ['sources'],
+    queryFn: () => sourcesApi.list().then((res) => res.data),
+  })
+
+  return (
+    <div className="space-y-6">
+      <h1 className="text-2xl font-bold">Settings</h1>
+
+      {/* API Keys Section */}
+      <div className="bg-white rounded-lg shadow">
+        <div className="px-6 py-4 border-b">
+          <h2 className="text-lg font-semibold flex items-center gap-2">
+            <Key className="w-5 h-5" />
+            API Keys
+          </h2>
+          <p className="text-sm text-gray-500 mt-1">
+            Configure API keys for each data source
+          </p>
+        </div>
+
+        {isLoading ? (
+          <div className="p-6 text-center">
+            <RefreshCw className="w-6 h-6 animate-spin mx-auto text-gray-400" />
+          </div>
+        ) : error ? (
+          <div className="p-6 text-center text-red-600">
+            Error loading sources: {(error as Error).message}
+          </div>
+        ) : !sources || sources.length === 0 ? (
+          <div className="p-6 text-center text-gray-500">
+            No sources available
+          </div>
+        ) : (
+          <div className="divide-y">
+            {sources.map((source) => (
+              <SourceRow
+                key={source.name}
+                source={source}
+                isEditing={editingSource === source.name}
+                onEdit={() => setEditingSource(source.name)}
+                onClose={() => setEditingSource(null)}
+              />
+            ))}
+          </div>
+        )}
+      </div>
+
+      {/* Import Scanner Section */}
+      <ImportScanner />
+
+      {/* Rate Limits Info */}
+      <div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
+        <h3 className="font-medium text-yellow-800">Rate Limits (recommended settings)</h3>
+        <ul className="text-sm text-yellow-700 mt-2 space-y-1 list-disc list-inside">
+          <li>GBIF: 1 req/sec safe (free, no authentication required)</li>
+          <li>iNaturalist: 1 req/sec max (60/min limit), 10k/day, 5GB/hr media</li>
+          <li>Flickr: 0.5 req/sec recommended (3600/hr limit shared across all users)</li>
+          <li>Wikimedia: 1 req/sec safe (requires OAuth credentials)</li>
+          <li>Trefle: 1 req/sec safe (120/min limit)</li>
+        </ul>
+      </div>
+    </div>
+  )
+}
+
+function SourceRow({
+  source,
+  isEditing,
+  onEdit,
+  onClose,
+}: {
+  source: SourceConfig
+  isEditing: boolean
+  onEdit: () => void
+  onClose: () => void
+}) {
+  const queryClient = useQueryClient()
+  const [showKey, setShowKey] = useState(false)
+  const [form, setForm] = useState({
+    api_key: '',
+    api_secret: '',
+    access_token: '',
+    rate_limit_per_sec: source.configured ? source.rate_limit_per_sec : (source.default_rate || 1.0),
+    enabled: source.enabled,
+  })
+
+  // Get field labels based on auth type
+  const isNoAuth = source.auth_type === 'none'
+  const isOAuth = source.auth_type === 'oauth'
+  const keyLabel = isOAuth ? 'Client ID' : 'API Key'
+  const secretLabel = isOAuth ? 'Client Secret' : 'API Secret'
+  const [testResult, setTestResult] = useState<{
+    status: 'success' | 'error'
+    message: string
+  } | null>(null)
+
+  const updateMutation = useMutation({
+    mutationFn: () =>
+      sourcesApi.update(source.name, {
+        api_key: isNoAuth ? undefined : form.api_key || undefined,
+        api_secret: form.api_secret || undefined,
+        access_token: form.access_token || undefined,
+        rate_limit_per_sec: form.rate_limit_per_sec,
+        enabled: form.enabled,
+      }),
+    onSuccess: () => {
+      queryClient.invalidateQueries({ queryKey: ['sources'] })
+      onClose()
+    },
+  })
+
+  const testMutation = useMutation({
+    mutationFn: () => sourcesApi.test(source.name),
+    onSuccess: (res) => {
+      setTestResult({ status: res.data.status, message: res.data.message })
+    },
+    onError: (err: any) => {
+      setTestResult({
+        status: 'error',
+        message: err.response?.data?.message || 'Connection failed',
+      })
+    },
+  })
+
+  if (isEditing) {
+    return (
+      <div className="p-6 bg-gray-50">
+        <div className="flex items-center justify-between mb-4">
+          <h3 className="font-medium">{source.label}</h3>
+          <button
+            onClick={onClose}
+            className="text-gray-500 hover:text-gray-700"
+          >
+            Cancel
+          </button>
+        </div>
+
+        <div className="space-y-4">
+          {isNoAuth ? (
+            <div className="bg-green-50 border border-green-200 rounded-lg p-3 text-green-700 text-sm">
+              This source doesn't require authentication. Just enable it to start scraping.
+            </div>
+          ) : (
+            <>
+              <div>
+                <label className="block text-sm font-medium mb-1">{keyLabel}</label>
+                <div className="relative">
+                  <input
+                    type={showKey ? 'text' : 'password'}
+                    value={form.api_key}
+                    onChange={(e) => setForm({ ...form, api_key: e.target.value })}
+                    placeholder={source.api_key_masked || `Enter ${keyLabel}`}
+                    className="w-full px-3 py-2 border rounded-lg pr-10"
+                  />
+                  <button
+                    type="button"
+                    onClick={() => setShowKey(!showKey)}
+                    className="absolute right-2 top-1/2 -translate-y-1/2 text-gray-400"
+                  >
+                    {showKey ? (
+                      <EyeOff className="w-4 h-4" />
+                    ) : (
+                      <Eye className="w-4 h-4" />
+                    )}
+                  </button>
+                </div>
+              </div>
+
+              {source.requires_secret && (
+                <div>
+                  <label className="block text-sm font-medium mb-1">
+                    {secretLabel}
+                  </label>
+                  <input
+                    type="password"
+                    value={form.api_secret}
+                    onChange={(e) =>
+                      setForm({ ...form, api_secret: e.target.value })
+                    }
+                    placeholder={source.has_secret ? '••••••••' : `Enter ${secretLabel}`}
+                    className="w-full px-3 py-2 border rounded-lg"
+                  />
+                </div>
+              )}
+
+              {isOAuth && (
+                <div>
+                  <label className="block text-sm font-medium mb-1">
+                    Access Token
+                  </label>
+                  <input
+                    type="password"
+                    value={form.access_token}
+                    onChange={(e) =>
+                      setForm({ ...form, access_token: e.target.value })
+                    }
+                    placeholder={source.has_access_token ? '••••••••' : 'Enter Access Token'}
+                    className="w-full px-3 py-2 border rounded-lg"
+                  />
+                </div>
+              )}
+            </>
+          )}
+
+          <div>
+            <label className="block text-sm font-medium mb-1">
+              Rate Limit (requests/sec)
+            </label>
+            <input
+              type="number"
+              value={form.rate_limit_per_sec}
+              onChange={(e) =>
+                setForm({
+                  ...form,
+                  rate_limit_per_sec: parseFloat(e.target.value) || 1,
+                })
+              }
+              className="w-full px-3 py-2 border rounded-lg"
+              min={0.1}
+              max={10}
+              step={0.1}
+            />
+          </div>
+
+          <div className="flex items-center gap-2">
+            <input
+              type="checkbox"
+              id="enabled"
+              checked={form.enabled}
+              onChange={(e) => setForm({ ...form, enabled: e.target.checked })}
+              className="rounded"
+            />
+            <label htmlFor="enabled" className="text-sm">
+              Enable this source
+            </label>
+          </div>
+
+          {testResult && (
+            <div
+              className={`p-3 rounded-lg ${
+                testResult.status === 'success'
+                  ? 'bg-green-50 text-green-700'
+                  : 'bg-red-50 text-red-700'
+              }`}
+            >
+              {testResult.message}
+            </div>
+          )}
+
+          <div className="flex justify-between">
+            {source.configured && (
+              <button
+                onClick={() => testMutation.mutate()}
+                disabled={testMutation.isPending}
+                className="px-4 py-2 border rounded-lg hover:bg-white"
+              >
+                {testMutation.isPending ? 'Testing...' : 'Test Connection'}
+              </button>
+            )}
+            <button
+              onClick={() => updateMutation.mutate()}
+              disabled={!isNoAuth && !form.api_key && !source.configured}
+              className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50 ml-auto"
+            >
+              Save
+            </button>
+          </div>
+        </div>
+      </div>
+    )
+  }
+
+  const isNoAuthRow = source.auth_type === 'none'
+
+  return (
+    <div className="px-6 py-4 flex items-center justify-between">
+      <div className="flex items-center gap-4">
+        <div
+          className={`w-2 h-2 rounded-full ${
+            (isNoAuthRow || source.configured) && source.enabled
+              ? 'bg-green-500'
+              : source.configured
+              ? 'bg-yellow-500'
+              : 'bg-gray-300'
+          }`}
+        />
+        <div>
+          <h3 className="font-medium">{source.label}</h3>
+          <p className="text-sm text-gray-500">
+            {isNoAuthRow
+              ? 'No authentication required'
+              : source.configured
+              ? `Key: ${source.api_key_masked}`
+              : 'Not configured'}
+          </p>
+        </div>
+      </div>
+      <div className="flex items-center gap-4">
+        {(isNoAuthRow || source.configured) && (
+          <span
+            className={`flex items-center gap-1 text-sm ${
+              source.enabled ? 'text-green-600' : 'text-gray-400'
+            }`}
+          >
+            {source.enabled ? (
+              <>
+                <CheckCircle className="w-4 h-4" />
+                Enabled
+              </>
+            ) : (
+              <>
+                <XCircle className="w-4 h-4" />
+                Disabled
+              </>
+            )}
+          </span>
+        )}
+        <button
+          onClick={onEdit}
+          className="px-3 py-1 text-sm border rounded hover:bg-gray-50"
+        >
+          {isNoAuthRow || source.configured ? 'Edit' : 'Configure'}
+        </button>
+      </div>
+    </div>
+  )
+}
+
+function ImportScanner() {
+  const [scanResult, setScanResult] = useState<ImportScanResult | null>(null)
+  const [moveFiles, setMoveFiles] = useState(false)
+  const [importResult, setImportResult] = useState<{
+    imported: number
+    skipped: number
+    errors: string[]
+  } | null>(null)
+
+  const scanMutation = useMutation({
+    mutationFn: () => imagesApi.scanImports().then((res) => res.data),
+    onSuccess: (data) => {
+      setScanResult(data)
+      setImportResult(null)
+    },
+  })
+
+  const importMutation = useMutation({
+    mutationFn: () => imagesApi.runImport(moveFiles).then((res) => res.data),
+    onSuccess: (data) => {
+      setImportResult(data)
+      setScanResult(null)
+    },
+  })
+
+  return (
+    <div className="bg-white rounded-lg shadow">
+      <div className="px-6 py-4 border-b">
+        <h2 className="text-lg font-semibold flex items-center gap-2">
+          <FolderInput className="w-5 h-5" />
+          Import Images
+        </h2>
+        <p className="text-sm text-gray-500 mt-1">
+          Bulk import images from the imports folder
+        </p>
+      </div>
+
+      <div className="p-6 space-y-4">
+        <div className="bg-gray-50 rounded-lg p-4">
+          <h3 className="font-medium text-sm mb-2">Expected folder structure:</h3>
+          <code className="text-sm text-gray-600 block">
+            imports/{'{source}'}/{'{species_name}'}/*.jpg
+          </code>
+          <p className="text-sm text-gray-500 mt-2">
+            Example: imports/inaturalist/Monstera_deliciosa/image1.jpg
+          </p>
+        </div>
+
+        <div className="flex items-center gap-4">
+          <button
+            onClick={() => scanMutation.mutate()}
+            disabled={scanMutation.isPending}
+            className="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700 disabled:opacity-50 flex items-center gap-2"
+          >
+            {scanMutation.isPending ? (
+              <>
+                <RefreshCw className="w-4 h-4 animate-spin" />
+                Scanning...
+              </>
+            ) : (
+              'Scan Imports Folder'
+            )}
+          </button>
+        </div>
+
+        {scanMutation.isError && (
+          <div className="bg-red-50 border border-red-200 rounded-lg p-4 text-red-700">
+            Error scanning: {(scanMutation.error as Error).message}
+          </div>
+        )}
+
+        {scanResult && (
+          <div className="space-y-4">
+            {!scanResult.available ? (
+              <div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
+                <p className="text-yellow-700">{scanResult.message}</p>
+              </div>
+            ) : scanResult.total_images === 0 ? (
+              <div className="bg-gray-50 border border-gray-200 rounded-lg p-4">
+                <p className="text-gray-600">No images found in the imports folder.</p>
+              </div>
+            ) : (
+              <>
+                <div className="bg-green-50 border border-green-200 rounded-lg p-4">
+                  <h3 className="font-medium text-green-800 mb-2">Scan Results</h3>
+                  <div className="grid grid-cols-2 gap-4 text-sm">
+                    <div>
+                      <span className="text-gray-600">Total Images:</span>
+                      <span className="ml-2 font-medium">{scanResult.total_images}</span>
+                    </div>
+                    <div>
+                      <span className="text-gray-600">Matched Species:</span>
+                      <span className="ml-2 font-medium">{scanResult.matched_species}</span>
+                    </div>
+                  </div>
+
+                  {scanResult.sources.length > 0 && (
+                    <div className="mt-4">
+                      <h4 className="text-sm font-medium text-green-800 mb-2">Sources Found:</h4>
+                      <div className="space-y-1">
+                        {scanResult.sources.map((source) => (
+                          <div key={source.name} className="text-sm flex justify-between">
+                            <span>{source.name}</span>
+                            <span className="text-gray-600">
+                              {source.species_count} species, {source.image_count} images
+                            </span>
+                          </div>
+                        ))}
+                      </div>
+                    </div>
+                  )}
+                </div>
+
+                {scanResult.unmatched_species.length > 0 && (
+                  <div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
+                    <h3 className="font-medium text-yellow-800 flex items-center gap-2 mb-2">
+                      <AlertTriangle className="w-4 h-4" />
+                      Unmatched Species ({scanResult.unmatched_species.length})
+                    </h3>
+                    <p className="text-sm text-yellow-700 mb-2">
+                      These species folders don't match any species in the database and will be skipped:
+                    </p>
+                    <div className="text-sm text-yellow-600 max-h-32 overflow-y-auto">
+                      {scanResult.unmatched_species.slice(0, 20).map((name) => (
+                        <div key={name}>{name}</div>
+                      ))}
+                      {scanResult.unmatched_species.length > 20 && (
+                        <div className="text-yellow-500 mt-1">
+                          ...and {scanResult.unmatched_species.length - 20} more
+                        </div>
+                      )}
+                    </div>
+                  </div>
+                )}
+
+                <div className="border-t pt-4">
+                  <div className="flex items-center gap-4 mb-4">
+                    <label className="flex items-center gap-2 text-sm">
+                      <input
+                        type="checkbox"
+                        checked={moveFiles}
+                        onChange={(e) => setMoveFiles(e.target.checked)}
+                        className="rounded"
+                      />
+                      Move files instead of copy (removes originals)
+                    </label>
+                  </div>
+
+                  <button
+                    onClick={() => importMutation.mutate()}
+                    disabled={importMutation.isPending || scanResult.matched_species === 0}
+                    className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50 flex items-center gap-2"
+                  >
+                    {importMutation.isPending ? (
+                      <>
+                        <RefreshCw className="w-4 h-4 animate-spin" />
+                        Importing...
+                      </>
+                    ) : (
+                      `Import ${scanResult.total_images} Images`
+                    )}
+                  </button>
+                </div>
+              </>
+            )}
+          </div>
+        )}
+
+        {importResult && (
+          <div className="bg-green-50 border border-green-200 rounded-lg p-4">
+            <h3 className="font-medium text-green-800 mb-2">Import Complete</h3>
+            <div className="text-sm space-y-1">
+              <div>
+                <span className="text-gray-600">Imported:</span>
+                <span className="ml-2 font-medium text-green-700">{importResult.imported}</span>
+              </div>
+              <div>
+                <span className="text-gray-600">Skipped (already exists):</span>
+                <span className="ml-2 font-medium">{importResult.skipped}</span>
+              </div>
+              {importResult.errors.length > 0 && (
+                <div className="mt-2">
+                  <span className="text-red-600">Errors ({importResult.errors.length}):</span>
+                  <div className="text-red-500 mt-1 max-h-24 overflow-y-auto">
+                    {importResult.errors.map((err, i) => (
+                      <div key={i} className="text-xs">{err}</div>
+                    ))}
+                  </div>
+                </div>
+              )}
+            </div>
+          </div>
+        )}
+      </div>
+    </div>
+  )
+}
@@ -0,0 +1,997 @@
+import { useState, useRef } from 'react'
+import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
+import {
+  Plus,
+  Upload,
+  Search,
+  Trash2,
+  Play,
+  ChevronLeft,
+  ChevronRight,
+  Filter,
+  X,
+  Image as ImageIcon,
+  ExternalLink,
+} from 'lucide-react'
+import { speciesApi, jobsApi, imagesApi, Species as SpeciesType } from '../api/client'
+
+export default function Species() {
+  const queryClient = useQueryClient()
+  const csvInputRef = useRef<HTMLInputElement>(null)
+  const jsonInputRef = useRef<HTMLInputElement>(null)
+
+  const [page, setPage] = useState(1)
+  const [search, setSearch] = useState('')
+  const [genus, setGenus] = useState<string>('')
+  const [hasImages, setHasImages] = useState<string>('')
+  const [maxImages, setMaxImages] = useState<string>('')
+  const [selectedIds, setSelectedIds] = useState<number[]>([])
+  const [showAddModal, setShowAddModal] = useState(false)
+  const [showScrapeModal, setShowScrapeModal] = useState(false)
+  const [showScrapeAllModal, setShowScrapeAllModal] = useState(false)
+  const [showScrapeFilteredModal, setShowScrapeFilteredModal] = useState(false)
+  const [viewSpecies, setViewSpecies] = useState<SpeciesType | null>(null)
+
+  const { data: genera } = useQuery({
+    queryKey: ['genera'],
+    queryFn: () => speciesApi.genera().then((res) => res.data),
+  })
+
+  const { data, isLoading } = useQuery({
+    queryKey: ['species', page, search, genus, hasImages, maxImages],
+    queryFn: () =>
+      speciesApi.list({
+        page,
+        page_size: 50,
+        search: search || undefined,
+        genus: genus || undefined,
+        has_images: hasImages === '' ? undefined : hasImages === 'true',
+        max_images: maxImages ? parseInt(maxImages) : undefined,
+      }).then((res) => res.data),
+  })
+
+  const importCsvMutation = useMutation({
+    mutationFn: (file: File) => speciesApi.import(file),
+    onSuccess: (res) => {
+      queryClient.invalidateQueries({ queryKey: ['species'] })
+      queryClient.invalidateQueries({ queryKey: ['genera'] })
+      alert(`Imported ${res.data.imported} species, skipped ${res.data.skipped}`)
+    },
+  })
+
+  const importJsonMutation = useMutation({
+    mutationFn: (file: File) => speciesApi.importJson(file),
+    onSuccess: (res) => {
+      queryClient.invalidateQueries({ queryKey: ['species'] })
+      queryClient.invalidateQueries({ queryKey: ['genera'] })
+      alert(`Imported ${res.data.imported} species, skipped ${res.data.skipped}`)
+    },
+  })
+
+  const deleteMutation = useMutation({
+    mutationFn: (id: number) => speciesApi.delete(id),
+    onSuccess: () => {
+      queryClient.invalidateQueries({ queryKey: ['species'] })
+    },
+  })
+
+  const createJobMutation = useMutation({
+    mutationFn: (data: { name: string; source: string; species_ids?: number[] }) =>
+      jobsApi.create(data),
+    onSuccess: () => {
+      setShowScrapeModal(false)
+      setSelectedIds([])
+      alert('Scrape job created!')
+    },
+  })
+
+  const handleCsvImport = (e: React.ChangeEvent<HTMLInputElement>) => {
+    const file = e.target.files?.[0]
+    if (file) {
+      importCsvMutation.mutate(file)
+      e.target.value = ''
+    }
+  }
+
+  const handleJsonImport = (e: React.ChangeEvent<HTMLInputElement>) => {
+    const file = e.target.files?.[0]
+    if (file) {
+      importJsonMutation.mutate(file)
+      e.target.value = ''
+    }
+  }
+
+  const handleSelectAll = () => {
+    if (!data) return
+    if (selectedIds.length === data.items.length) {
+      setSelectedIds([])
+    } else {
+      setSelectedIds(data.items.map((s) => s.id))
+    }
+  }
+
+  const handleSelect = (id: number) => {
+    setSelectedIds((prev) =>
+      prev.includes(id) ? prev.filter((i) => i !== id) : [...prev, id]
+    )
+  }
+
+  return (
+    <div className="space-y-6">
+      <div className="flex items-center justify-between">
+        <h1 className="text-2xl font-bold">Species</h1>
+        <div className="flex gap-2">
+          <button
+            onClick={() => csvInputRef.current?.click()}
+            disabled={importCsvMutation.isPending}
+            className="flex items-center gap-2 px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200 disabled:opacity-50"
+          >
+            <Upload className="w-4 h-4" />
+            {importCsvMutation.isPending ? 'Importing...' : 'Import CSV'}
+          </button>
+          <input
+            ref={csvInputRef}
+            type="file"
+            accept=".csv"
+            onChange={handleCsvImport}
+            className="hidden"
+          />
+          <button
+            onClick={() => jsonInputRef.current?.click()}
+            disabled={importJsonMutation.isPending}
+            className="flex items-center gap-2 px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200 disabled:opacity-50"
+          >
+            <Upload className="w-4 h-4" />
+            {importJsonMutation.isPending ? 'Importing...' : 'Import JSON'}
+          </button>
+          <input
+            ref={jsonInputRef}
+            type="file"
+            accept=".json"
+            onChange={handleJsonImport}
+            className="hidden"
+          />
+          <button
+            onClick={() => setShowAddModal(true)}
+            className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
+          >
+            <Plus className="w-4 h-4" />
+            Add Species
+          </button>
+        </div>
+      </div>
+
+      {/* Search and Filters */}
+      <div className="flex items-center gap-4 flex-wrap">
+        <div className="relative">
+          <Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
+          <input
+            type="text"
+            placeholder="Search species..."
+            value={search}
+            onChange={(e) => {
+              setSearch(e.target.value)
+              setPage(1)
+            }}
+            className="pl-10 pr-4 py-2 border rounded-lg w-64"
+          />
+        </div>
+
+        <div className="flex items-center gap-2">
+          <Filter className="w-4 h-4 text-gray-400" />
+          <select
+            value={genus}
+            onChange={(e) => {
+              setGenus(e.target.value)
+              setPage(1)
+            }}
+            className="px-3 py-2 border rounded-lg bg-white"
+          >
+            <option value="">All Genera</option>
+            {genera?.map((g) => (
+              <option key={g} value={g}>
+                {g}
+              </option>
+            ))}
+          </select>
+
+          <select
+            value={hasImages}
+            onChange={(e) => {
+              setHasImages(e.target.value)
+              setMaxImages('')
+              setPage(1)
+            }}
+            className="px-3 py-2 border rounded-lg bg-white"
+          >
+            <option value="">All Species</option>
+            <option value="true">Has Images</option>
+            <option value="false">No Images</option>
+          </select>
+
+          <select
+            value={maxImages}
+            onChange={(e) => {
+              setMaxImages(e.target.value)
+              setHasImages('')
+              setPage(1)
+            }}
+            className="px-3 py-2 border rounded-lg bg-white"
+          >
+            <option value="">Any Image Count</option>
+            <option value="25">Less than 25 images</option>
+            <option value="50">Less than 50 images</option>
+            <option value="100">Less than 100 images</option>
+            <option value="250">Less than 250 images</option>
+            <option value="500">Less than 500 images</option>
+          </select>
+
+          {(genus || hasImages || maxImages) && (
+            <button
+              onClick={() => {
+                setGenus('')
+                setHasImages('')
+                setMaxImages('')
+                setPage(1)
+              }}
+              className="flex items-center gap-1 px-2 py-1 text-sm text-gray-500 hover:text-gray-700"
+            >
+              <X className="w-3 h-3" />
+              Clear
+            </button>
+          )}
+        </div>
+
+        <div className="ml-auto flex items-center gap-4">
+          {maxImages && data && data.total > 0 && (
+            <button
+              onClick={() => setShowScrapeFilteredModal(true)}
+              className="flex items-center gap-2 px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700"
+            >
+              <Play className="w-4 h-4" />
+              Scrape All {data.total} Filtered
+            </button>
+          )}
+          <button
+            onClick={() => setShowScrapeAllModal(true)}
+            className="flex items-center gap-2 px-4 py-2 bg-orange-600 text-white rounded-lg hover:bg-orange-700"
+          >
+            <Play className="w-4 h-4" />
+            Scrape All Without Images
+          </button>
+          {selectedIds.length > 0 && (
+            <div className="flex items-center gap-4">
+              <span className="text-sm text-gray-600">
+                {selectedIds.length} selected
+              </span>
+              <button
+                onClick={() => setShowScrapeModal(true)}
+                className="flex items-center gap-2 px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
+              >
+                <Play className="w-4 h-4" />
+                Start Scrape
+              </button>
+            </div>
+          )}
+        </div>
+      </div>
+
+      {/* Table */}
+      <div className="bg-white rounded-lg shadow overflow-hidden">
+        <table className="w-full">
+          <thead className="bg-gray-50">
+            <tr>
+              <th className="px-4 py-3 text-left">
+                <input
+                  type="checkbox"
+                  checked={(data?.items?.length ?? 0) > 0 && selectedIds.length === (data?.items?.length ?? 0)}
+                  onChange={handleSelectAll}
+                  className="rounded"
+                />
+              </th>
+              <th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
+                Scientific Name
+              </th>
+              <th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
+                Common Name
+              </th>
+              <th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
+                Genus
+              </th>
+              <th className="px-4 py-3 text-right text-sm font-medium text-gray-600">
+                Images
+              </th>
+              <th className="px-4 py-3 text-right text-sm font-medium text-gray-600">
+                Actions
+              </th>
+            </tr>
+          </thead>
+          <tbody>
+            {isLoading ? (
+              <tr>
+                <td colSpan={6} className="px-4 py-8 text-center text-gray-400">
+                  Loading...
+                </td>
+              </tr>
+            ) : data?.items.length === 0 ? (
+              <tr>
+                <td colSpan={6} className="px-4 py-8 text-center text-gray-400">
+                  No species found. Import a CSV to get started.
+                </td>
+              </tr>
+            ) : (
+              data?.items.map((species) => (
+                <tr
+                  key={species.id}
+                  className="border-t hover:bg-gray-50 cursor-pointer"
+                  onClick={() => setViewSpecies(species)}
+                >
+                  <td className="px-4 py-3" onClick={(e) => e.stopPropagation()}>
+                    <input
+                      type="checkbox"
+                      checked={selectedIds.includes(species.id)}
+                      onChange={() => handleSelect(species.id)}
+                      className="rounded"
+                    />
+                  </td>
+                  <td className="px-4 py-3 font-medium">{species.scientific_name}</td>
+                  <td className="px-4 py-3 text-gray-600">
+                    {species.common_name || '-'}
+                  </td>
+                  <td className="px-4 py-3 text-gray-600">{species.genus || '-'}</td>
+                  <td className="px-4 py-3 text-right">
+                    <span
+                      className={`inline-block px-2 py-1 rounded text-sm ${
+                        species.image_count >= 100
+                          ? 'bg-green-100 text-green-700'
+                          : species.image_count > 0
+                          ? 'bg-yellow-100 text-yellow-700'
+                          : 'bg-gray-100 text-gray-600'
+                      }`}
+                    >
+                      {species.image_count}
+                    </span>
+                  </td>
+                  <td className="px-4 py-3 text-right" onClick={(e) => e.stopPropagation()}>
+                    <button
+                      onClick={() => deleteMutation.mutate(species.id)}
+                      className="p-1 text-red-500 hover:bg-red-50 rounded"
+                    >
+                      <Trash2 className="w-4 h-4" />
+                    </button>
+                  </td>
+                </tr>
+              ))
+            )}
+          </tbody>
+        </table>
+      </div>
+
+      {/* Pagination */}
+      {data && data.pages > 1 && (
+        <div className="flex items-center justify-between">
+          <span className="text-sm text-gray-600">
+            Showing {(page - 1) * 50 + 1} to {Math.min(page * 50, data.total)} of{' '}
+            {data.total}
+          </span>
+          <div className="flex gap-2">
+            <button
+              onClick={() => setPage((p) => Math.max(1, p - 1))}
+              disabled={page === 1}
+              className="p-2 rounded border disabled:opacity-50"
+            >
+              <ChevronLeft className="w-4 h-4" />
+            </button>
+            <span className="px-4 py-2">
+              Page {page} of {data.pages}
+            </span>
+            <button
+              onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
+              disabled={page === data.pages}
+              className="p-2 rounded border disabled:opacity-50"
+            >
+              <ChevronRight className="w-4 h-4" />
+            </button>
+          </div>
+        </div>
+      )}
+
+      {/* Add Species Modal */}
+      {showAddModal && (
+        <AddSpeciesModal onClose={() => setShowAddModal(false)} />
+      )}
+
+      {/* Scrape Modal */}
+      {showScrapeModal && (
+        <ScrapeModal
+          selectedIds={selectedIds}
+          onClose={() => setShowScrapeModal(false)}
+          onSubmit={(source) => {
+            createJobMutation.mutate({
+              name: `Scrape ${selectedIds.length} species from ${source}`,
+              source,
+              species_ids: selectedIds,
+            })
+          }}
+        />
+      )}
+
+      {/* Species Detail Modal */}
+      {viewSpecies && (
+        <SpeciesDetailModal
+          species={viewSpecies}
+          onClose={() => setViewSpecies(null)}
+        />
+      )}
+
+      {/* Scrape All Without Images Modal */}
+      {showScrapeAllModal && (
+        <ScrapeAllModal
+          onClose={() => setShowScrapeAllModal(false)}
+        />
+      )}
+
+      {/* Scrape All Filtered Modal */}
+      {showScrapeFilteredModal && (
+        <ScrapeFilteredModal
+          maxImages={parseInt(maxImages)}
+          speciesCount={data?.total ?? 0}
+          onClose={() => setShowScrapeFilteredModal(false)}
+        />
+      )}
+    </div>
+  )
+}
+
+function AddSpeciesModal({ onClose }: { onClose: () => void }) {
+  const queryClient = useQueryClient()
+  const [form, setForm] = useState({
+    scientific_name: '',
+    common_name: '',
+    genus: '',
+    family: '',
+  })
+
+  const mutation = useMutation({
+    mutationFn: () => speciesApi.create(form),
+    onSuccess: () => {
+      queryClient.invalidateQueries({ queryKey: ['species'] })
+      onClose()
+    },
+  })
+
+  return (
+    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
+      <div className="bg-white rounded-lg p-6 w-full max-w-md">
+        <h2 className="text-xl font-bold mb-4">Add Species</h2>
+        <div className="space-y-4">
+          <div>
+            <label className="block text-sm font-medium mb-1">
+              Scientific Name *
+            </label>
+            <input
+              type="text"
+              value={form.scientific_name}
+              onChange={(e) =>
+                setForm({ ...form, scientific_name: e.target.value })
+              }
+              className="w-full px-3 py-2 border rounded-lg"
+              placeholder="e.g. Monstera deliciosa"
+            />
+          </div>
+          <div>
+            <label className="block text-sm font-medium mb-1">Common Name</label>
+            <input
+              type="text"
+              value={form.common_name}
+              onChange={(e) => setForm({ ...form, common_name: e.target.value })}
+              className="w-full px-3 py-2 border rounded-lg"
+              placeholder="e.g. Swiss Cheese Plant"
+            />
+          </div>
+          <div className="grid grid-cols-2 gap-4">
+            <div>
+              <label className="block text-sm font-medium mb-1">Genus</label>
+              <input
+                type="text"
+                value={form.genus}
+                onChange={(e) => setForm({ ...form, genus: e.target.value })}
+                className="w-full px-3 py-2 border rounded-lg"
+                placeholder="e.g. Monstera"
+              />
+            </div>
+            <div>
+              <label className="block text-sm font-medium mb-1">Family</label>
+              <input
+                type="text"
+                value={form.family}
+                onChange={(e) => setForm({ ...form, family: e.target.value })}
+                className="w-full px-3 py-2 border rounded-lg"
+                placeholder="e.g. Araceae"
+              />
+            </div>
+          </div>
+        </div>
+        <div className="flex justify-end gap-2 mt-6">
+          <button
+            onClick={onClose}
+            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
+          >
+            Cancel
+          </button>
+          <button
+            onClick={() => mutation.mutate()}
+            disabled={!form.scientific_name}
+            className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50"
+          >
+            Add Species
+          </button>
+        </div>
+      </div>
+    </div>
+  )
+}
+
+function ScrapeModal({
+  selectedIds,
+  onClose,
+  onSubmit,
+}: {
+  selectedIds: number[]
+  onClose: () => void
+  onSubmit: (source: string) => void
+}) {
+  const [source, setSource] = useState('inaturalist')
+
+  const sources = [
+    { value: 'gbif', label: 'GBIF' },
+    { value: 'inaturalist', label: 'iNaturalist' },
+    { value: 'flickr', label: 'Flickr' },
+    { value: 'wikimedia', label: 'Wikimedia Commons' },
+    { value: 'trefle', label: 'Trefle.io' },
+    { value: 'duckduckgo', label: 'DuckDuckGo' },
+    { value: 'bing', label: 'Bing Image Search' },
+  ]
+
+  return (
+    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
+      <div className="bg-white rounded-lg p-6 w-full max-w-md">
+        <h2 className="text-xl font-bold mb-4">Start Scrape Job</h2>
+        <p className="text-gray-600 mb-4">
+          Scrape images for {selectedIds.length} selected species
+        </p>
+        <div>
+          <label className="block text-sm font-medium mb-2">Select Source</label>
+          <div className="space-y-2">
+            {sources.map((s) => (
+              <label
+                key={s.value}
+                className={`flex items-center p-3 border rounded-lg cursor-pointer ${
+                  source === s.value ? 'border-green-500 bg-green-50' : ''
+                }`}
+              >
+                <input
+                  type="radio"
+                  value={s.value}
+                  checked={source === s.value}
+                  onChange={(e) => setSource(e.target.value)}
+                  className="mr-3"
+                />
+                {s.label}
+              </label>
+            ))}
+          </div>
+        </div>
+        <div className="flex justify-end gap-2 mt-6">
+          <button
+            onClick={onClose}
+            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
+          >
+            Cancel
+          </button>
+          <button
+            onClick={() => onSubmit(source)}
+            className="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
+          >
+            Start Scrape
+          </button>
+        </div>
+      </div>
+    </div>
+  )
+}
+
+function SpeciesDetailModal({
+  species,
+  onClose,
+}: {
+  species: SpeciesType
+  onClose: () => void
+}) {
+  const [page, setPage] = useState(1)
+  const pageSize = 20
+
+  const { data, isLoading } = useQuery({
+    queryKey: ['species-images', species.id, page],
+    queryFn: () =>
+      imagesApi.list({
+        species_id: species.id,
+        status: 'downloaded',
+        page,
+        page_size: pageSize,
+      }).then((res) => res.data),
+  })
+
+  return (
+    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50 p-4">
+      <div className="bg-white rounded-lg w-full max-w-5xl max-h-[90vh] flex flex-col">
+        {/* Header */}
+        <div className="px-6 py-4 border-b flex items-start justify-between">
+          <div>
+            <h2 className="text-xl font-bold">{species.scientific_name}</h2>
+            {species.common_name && (
+              <p className="text-gray-600">{species.common_name}</p>
+            )}
+            <div className="flex gap-4 mt-2 text-sm text-gray-500">
+              {species.genus && <span>Genus: {species.genus}</span>}
+              {species.family && <span>Family: {species.family}</span>}
+              <span>{species.image_count} images</span>
+            </div>
+          </div>
+          <button
+            onClick={onClose}
+            className="p-2 hover:bg-gray-100 rounded-lg"
+          >
+            <X className="w-5 h-5" />
+          </button>
+        </div>
+
+        {/* Images Grid */}
+        <div className="flex-1 overflow-y-auto p-6">
+          {isLoading ? (
+            <div className="flex items-center justify-center h-64">
+              <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
+            </div>
+          ) : !data || data.items.length === 0 ? (
+            <div className="flex flex-col items-center justify-center h-64 text-gray-400">
+              <ImageIcon className="w-12 h-12 mb-4" />
+              <p>No images yet</p>
+              <p className="text-sm mt-2">
+                Start a scrape job to download images for this species
+              </p>
+            </div>
+          ) : (
+            <div className="grid grid-cols-2 sm:grid-cols-3 md:grid-cols-4 lg:grid-cols-5 gap-4">
+              {data.items.map((image) => (
+                <div
+                  key={image.id}
+                  className="group relative aspect-square bg-gray-100 rounded-lg overflow-hidden"
+                >
+                  {image.local_path ? (
+                    <img
+                      src={`/api/images/${image.id}/file`}
+                      alt={species.scientific_name}
+                      className="w-full h-full object-cover"
+                      loading="lazy"
+                    />
+                  ) : (
+                    <div className="w-full h-full flex items-center justify-center text-gray-400">
+                      <ImageIcon className="w-8 h-8" />
+                    </div>
+                  )}
+                  {/* Overlay with info */}
+                  <div className="absolute inset-0 bg-black/60 opacity-0 group-hover:opacity-100 transition-opacity flex flex-col justify-end p-2">
+                    <div className="text-white text-xs">
+                      <div className="flex items-center justify-between">
+                        <span className="bg-white/20 px-1.5 py-0.5 rounded">
+                          {image.source}
+                        </span>
+                        <span className="bg-white/20 px-1.5 py-0.5 rounded">
+                          {image.license}
+                        </span>
+                      </div>
+                      {image.width && image.height && (
+                        <div className="mt-1 text-white/70">
+                          {image.width} × {image.height}
+                        </div>
+                      )}
+                    </div>
+                    {image.url && (
+                      <a
+                        href={image.url}
+                        target="_blank"
+                        rel="noopener noreferrer"
+                        className="absolute top-2 right-2 p-1 bg-white/20 rounded hover:bg-white/40"
+                        onClick={(e) => e.stopPropagation()}
+                      >
+                        <ExternalLink className="w-4 h-4 text-white" />
+                      </a>
+                    )}
+                  </div>
+                </div>
+              ))}
+            </div>
+          )}
+        </div>
+
+        {/* Pagination */}
+        {data && data.pages > 1 && (
+          <div className="px-6 py-4 border-t flex items-center justify-between">
+            <span className="text-sm text-gray-600">
+              Showing {(page - 1) * pageSize + 1} to{' '}
+              {Math.min(page * pageSize, data.total)} of {data.total}
+            </span>
+            <div className="flex gap-2">
+              <button
+                onClick={() => setPage((p) => Math.max(1, p - 1))}
+                disabled={page === 1}
+                className="p-2 rounded border disabled:opacity-50"
+              >
+                <ChevronLeft className="w-4 h-4" />
+              </button>
+              <span className="px-4 py-2">
+                Page {page} of {data.pages}
+              </span>
+              <button
+                onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
+                disabled={page === data.pages}
+                className="p-2 rounded border disabled:opacity-50"
+              >
+                <ChevronRight className="w-4 h-4" />
+              </button>
+            </div>
+          </div>
+        )}
+      </div>
+    </div>
+  )
+}
+
+function ScrapeAllModal({ onClose }: { onClose: () => void }) {
+  const [selectedSources, setSelectedSources] = useState<string[]>([])
+  const [isSubmitting, setIsSubmitting] = useState(false)
+
+  // Fetch count of species without images
+  const { data: speciesData, isLoading } = useQuery({
+    queryKey: ['species-no-images'],
+    queryFn: () =>
+      speciesApi.list({
+        page: 1,
+        page_size: 1,
+        has_images: false,
+      }).then((res) => res.data),
+  })
+
+  const sources = [
+    { value: 'gbif', label: 'GBIF', description: 'Free biodiversity database, no API key needed' },
+    { value: 'inaturalist', label: 'iNaturalist', description: 'Research-grade observations with CC licenses' },
+    { value: 'wikimedia', label: 'Wikimedia Commons', description: 'Free media repository, requires OAuth' },
+    { value: 'flickr', label: 'Flickr', description: 'Requires API key, CC-licensed photos' },
+    { value: 'trefle', label: 'Trefle.io', description: 'Plant database, requires API key' },
+    { value: 'duckduckgo', label: 'DuckDuckGo', description: 'Web image search, no API key needed' },
+    { value: 'bing', label: 'Bing Image Search', description: 'Azure Cognitive Services, requires API key' },
+  ]
+
+  const toggleSource = (source: string) => {
+    setSelectedSources((prev) =>
+      prev.includes(source)
+        ? prev.filter((s) => s !== source)
+        : [...prev, source]
+    )
+  }
+
+  const handleSubmit = async () => {
+    if (selectedSources.length === 0) return
+
+    setIsSubmitting(true)
+    try {
+      // Create a job for each selected source
+      for (const source of selectedSources) {
+        await jobsApi.create({
+          name: `Scrape all species without images from ${source}`,
+          source,
+          only_without_images: true,
+        })
+      }
+      alert(`Created ${selectedSources.length} scrape job(s)!`)
+      onClose()
+    } catch (error) {
+      alert('Failed to create jobs')
+    } finally {
+      setIsSubmitting(false)
+    }
+  }
+
+  const speciesCount = speciesData?.total ?? 0
+
+  return (
+    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
+      <div className="bg-white rounded-lg p-6 w-full max-w-lg">
+        <h2 className="text-xl font-bold mb-2">Scrape All Species Without Images</h2>
+        {isLoading ? (
+          <p className="text-gray-600 mb-4">Loading...</p>
+        ) : (
+          <p className="text-gray-600 mb-4">
+            {speciesCount === 0 ? (
+              'All species already have images!'
+            ) : (
+              <>
+                <span className="font-semibold text-orange-600">{speciesCount}</span> species
+                don't have any images yet. Select sources to scrape from:
+              </>
+            )}
+          </p>
+        )}
+
+        {speciesCount > 0 && (
+          <>
+            <div className="space-y-2 mb-6">
+              {sources.map((s) => (
+                <label
+                  key={s.value}
+                  className={`flex items-start p-3 border rounded-lg cursor-pointer transition-colors ${
+                    selectedSources.includes(s.value)
+                      ? 'border-orange-500 bg-orange-50'
+                      : 'hover:bg-gray-50'
+                  }`}
+                >
+                  <input
+                    type="checkbox"
+                    checked={selectedSources.includes(s.value)}
+                    onChange={() => toggleSource(s.value)}
+                    className="mt-1 mr-3 rounded"
+                  />
+                  <div>
+                    <div className="font-medium">{s.label}</div>
+                    <div className="text-sm text-gray-500">{s.description}</div>
+                  </div>
+                </label>
+              ))}
+            </div>
+
+            {selectedSources.length > 1 && (
+              <div className="bg-blue-50 border border-blue-200 rounded-lg p-3 mb-4 text-sm text-blue-700">
+                <strong>{selectedSources.length} jobs</strong> will be created and run in parallel,
+                one for each selected source.
+              </div>
+            )}
+          </>
+        )}
+
+        <div className="flex justify-end gap-2">
+          <button
+            onClick={onClose}
+            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
+          >
+            Cancel
+          </button>
+          {speciesCount > 0 && (
+            <button
+              onClick={handleSubmit}
+              disabled={selectedSources.length === 0 || isSubmitting}
+              className="px-4 py-2 bg-orange-600 text-white rounded-lg hover:bg-orange-700 disabled:opacity-50"
+            >
+              {isSubmitting
+                ? 'Creating Jobs...'
+                : `Start ${selectedSources.length || ''} Scrape Job${selectedSources.length !== 1 ? 's' : ''}`}
+            </button>
+          )}
+        </div>
+      </div>
+    </div>
+  )
+}
+
+function ScrapeFilteredModal({
+  maxImages,
+  speciesCount,
+  onClose,
+}: {
+  maxImages: number
+  speciesCount: number
+  onClose: () => void
+}) {
+  const [selectedSources, setSelectedSources] = useState<string[]>([])
+  const [isSubmitting, setIsSubmitting] = useState(false)
+
+  const sources = [
+    { value: 'gbif', label: 'GBIF', description: 'Free biodiversity database, no API key needed' },
+    { value: 'inaturalist', label: 'iNaturalist', description: 'Research-grade observations with CC licenses' },
+    { value: 'wikimedia', label: 'Wikimedia Commons', description: 'Free media repository, requires OAuth' },
+    { value: 'flickr', label: 'Flickr', description: 'Requires API key, CC-licensed photos' },
+    { value: 'trefle', label: 'Trefle.io', description: 'Plant database, requires API key' },
+    { value: 'duckduckgo', label: 'DuckDuckGo', description: 'Web image search, no API key needed' },
+    { value: 'bing', label: 'Bing Image Search', description: 'Azure Cognitive Services, requires API key' },
+  ]
+
+  const toggleSource = (source: string) => {
+    setSelectedSources((prev) =>
+      prev.includes(source)
+        ? prev.filter((s) => s !== source)
+        : [...prev, source]
+    )
+  }
+
+  const handleSubmit = async () => {
+    if (selectedSources.length === 0) return
+
+    setIsSubmitting(true)
+    try {
+      for (const source of selectedSources) {
+        await jobsApi.create({
+          name: `Scrape species with <${maxImages} images from ${source}`,
+          source,
+          max_images: maxImages,
+        })
+      }
+      alert(`Created ${selectedSources.length} scrape job(s)!`)
+      onClose()
+    } catch (error) {
+      alert('Failed to create jobs')
+    } finally {
+      setIsSubmitting(false)
+    }
+  }
+
+  return (
+    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
+      <div className="bg-white rounded-lg p-6 w-full max-w-lg">
+        <h2 className="text-xl font-bold mb-2">Scrape All Filtered Species</h2>
+        <p className="text-gray-600 mb-4">
+          <span className="font-semibold text-purple-600">{speciesCount}</span> species
+          have fewer than <span className="font-semibold">{maxImages}</span> images.
+          Select sources to scrape from:
+        </p>
+
+        <div className="space-y-2 mb-6">
+          {sources.map((s) => (
+            <label
+              key={s.value}
+              className={`flex items-start p-3 border rounded-lg cursor-pointer transition-colors ${
+                selectedSources.includes(s.value)
+                  ? 'border-purple-500 bg-purple-50'
+                  : 'hover:bg-gray-50'
+              }`}
+            >
+              <input
+                type="checkbox"
+                checked={selectedSources.includes(s.value)}
+                onChange={() => toggleSource(s.value)}
+                className="mt-1 mr-3 rounded"
+              />
+              <div>
+                <div className="font-medium">{s.label}</div>
+                <div className="text-sm text-gray-500">{s.description}</div>
+              </div>
+            </label>
+          ))}
+        </div>
+
+        {selectedSources.length > 1 && (
+          <div className="bg-blue-50 border border-blue-200 rounded-lg p-3 mb-4 text-sm text-blue-700">
+            <strong>{selectedSources.length} jobs</strong> will be created and run in parallel,
+            one for each selected source.
+          </div>
+        )}
+
+        <div className="flex justify-end gap-2">
+          <button
+            onClick={onClose}
+            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
+          >
+            Cancel
+          </button>
+          <button
+            onClick={handleSubmit}
+            disabled={selectedSources.length === 0 || isSubmitting}
+            className="px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700 disabled:opacity-50"
+          >
+            {isSubmitting
+              ? 'Creating Jobs...'
+              : `Start ${selectedSources.length || ''} Scrape Job${selectedSources.length !== 1 ? 's' : ''}`}
+          </button>
+        </div>
+      </div>
+    </div>
+  )
+}
@@ -0,0 +1,9 @@
+/// <reference types="vite/client" />
+
+interface ImportMetaEnv {
+  readonly VITE_API_URL: string
+}
+
+interface ImportMeta {
+  readonly env: ImportMetaEnv
+}
@@ -0,0 +1,11 @@
+/** @type {import('tailwindcss').Config} */
+export default {
+  content: [
+    "./index.html",
+    "./src/**/*.{js,ts,jsx,tsx}",
+  ],
+  theme: {
+    extend: {},
+  },
+  plugins: [],
+}
@@ -0,0 +1,21 @@
+{
+  "compilerOptions": {
+    "target": "ES2020",
+    "useDefineForClassFields": true,
+    "lib": ["ES2020", "DOM", "DOM.Iterable"],
+    "module": "ESNext",
+    "skipLibCheck": true,
+    "moduleResolution": "bundler",
+    "allowImportingTsExtensions": true,
+    "resolveJsonModule": true,
+    "isolatedModules": true,
+    "noEmit": true,
+    "jsx": "react-jsx",
+    "strict": true,
+    "noUnusedLocals": true,
+    "noUnusedParameters": true,
+    "noFallthroughCasesInSwitch": true
+  },
+  "include": ["src"],
+  "references": [{ "path": "./tsconfig.node.json" }]
+}
@@ -0,0 +1,10 @@
+{
+  "compilerOptions": {
+    "composite": true,
+    "skipLibCheck": true,
+    "module": "ESNext",
+    "moduleResolution": "bundler",
+    "allowSyntheticDefaultImports": true
+  },
+  "include": ["vite.config.ts"]
+}
@@ -0,0 +1,18 @@
+import { defineConfig } from 'vite'
+import react from '@vitejs/plugin-react'
+
+export default defineConfig({
+  plugins: [react()],
+  server: {
+    port: 3000,
+    host: true,
+    proxy: {
+      '/api': {
+        target: 'http://backend:8000',
+        changeOrigin: true,
+      },
+    },
+    // Disable HMR - not useful in Docker deployments
+    hmr: false,
+  },
+})
@@ -0,0 +1,58 @@
+events {
+    worker_connections 1024;
+}
+
+http {
+    include /etc/nginx/mime.types;
+    default_type application/octet-stream;
+
+    upstream backend {
+        server backend:8000;
+    }
+
+    upstream frontend {
+        server frontend:3000;
+    }
+
+    server {
+        listen 80;
+        server_name localhost;
+
+        # API routes
+        location /api {
+            proxy_pass http://backend;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+
+            # Increase timeouts for slow API calls
+            proxy_connect_timeout 60s;
+            proxy_send_timeout 60s;
+            proxy_read_timeout 60s;
+        }
+
+        # Health check
+        location /health {
+            proxy_pass http://backend;
+        }
+
+        # WebSocket support for hot reload
+        location /ws {
+            proxy_pass http://frontend;
+            proxy_http_version 1.1;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection "upgrade";
+        }
+
+        # Frontend
+        location / {
+            proxy_pass http://frontend;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_http_version 1.1;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection "upgrade";
+        }
+    }
+}