Initial commit — PlantGuideScraper project

2026-04-12 09:54:27 -05:00
commit 6926f502c5
87 changed files with 29120 additions and 0 deletions
@@ -0,0 +1,20 @@
 # Database
 DATABASE_URL=sqlite:////data/db/plants.sqlite
 # Redis
 REDIS_URL=redis://redis:6379/0
 # Storage paths
 IMAGES_PATH=/data/images
 EXPORTS_PATH=/data/exports
 # API Keys (user-provided)
 FLICKR_API_KEY=
 FLICKR_API_SECRET=
 INATURALIST_APP_ID=
 INATURALIST_APP_SECRET=
 TREFLE_API_KEY=
 # Optional settings
 LOG_LEVEL=INFO
 CELERY_CONCURRENCY=4
@@ -0,0 +1,39 @@
 # Python
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 .Python
 venv/
 .venv/
 ENV/
 env/
 .eggs/
 *.egg-info/
 *.egg
 # Node
 node_modules/
 npm-debug.log
 yarn-error.log
 # IDE
 .idea/
 .vscode/
 *.swp
 *.swo
 *~
 # OS
 .DS_Store
 Thumbs.db
 # Project specific
 data/
 *.sqlite
 *.db
 .env
 *.zip
 # Docker
 docker-compose.override.yml
@@ -0,0 +1,209 @@
 # PlantGuideScraper
 Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.
 ## Features
 - **Species Management**: Import species lists via CSV or JSON, search and filter by genus or image status
 - **Multi-Source Scraping**: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
 - **Image Quality Pipeline**: Automatic deduplication, blur detection, resizing
 - **License Filtering**: Only collect commercially-safe CC0/CC-BY licensed images
 - **Export for CoreML**: Train/test split, Create ML-compatible folder structure
 - **Real-time Dashboard**: Progress tracking, statistics, job monitoring
 ## Quick Start
 ```bash
 # Clone and start
 cd PlantGuideScraper
 docker-compose up --build
 # Access the UI
 open http://localhost
 ```
 ## Unraid Deployment
 ### Setup
 1. Copy the project to your Unraid server:
   ```bash
   scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper
   ```
 2. SSH into Unraid and create data directories:
   ```bash
   ssh root@YOUR_UNRAID_IP
   mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}
   ```
 3. Install **Docker Compose Manager** from Community Applications
 4. In Unraid: **Docker → Compose → Add New Stack**
   - Path: `/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml`
   - Click **Compose Up**
 5. Access at `http://YOUR_UNRAID_IP:8580`
 ### Configurable Paths
 Edit `docker-compose.unraid.yml` to customize where data is stored. Look for these lines in both `backend` and `celery` services:
 ```yaml
 # === CONFIGURABLE DATA PATHS ===
 - /mnt/user/appdata/PlantGuideScraper/database:/data/db    # DATABASE_PATH
 - /mnt/user/appdata/PlantGuideScraper/images:/data/images  # IMAGES_PATH
 - /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
 ```
 | Path | Description | Default |
 |------|-------------|---------|
 | DATABASE_PATH | SQLite database file | `/mnt/user/appdata/PlantGuideScraper/database` |
 | IMAGES_PATH | Downloaded images (can be 100GB+) | `/mnt/user/appdata/PlantGuideScraper/images` |
 | EXPORTS_PATH | Generated export zip files | `/mnt/user/appdata/PlantGuideScraper/exports` |
 **Example: Store images on a separate share:**
 ```yaml
 - /mnt/user/data/PlantImages:/data/images  # IMAGES_PATH
 ```
 **Important:** Keep paths identical in both `backend` and `celery` services.
 ## Configuration
 1. Configure API keys in Settings:
   - **Flickr**: Get key at https://www.flickr.com/services/api/
   - **Trefle**: Get key at https://trefle.io/
   - iNaturalist and Wikimedia don't require keys
 2. Import species list (see Import Documentation below)
 3. Select species and start scraping
 ## Import Documentation
 ### CSV Import
 Import species from a CSV file with the following columns:
 | Column | Required | Description |
 |--------|----------|-------------|
 | `scientific_name` | Yes | Binomial name (e.g., "Monstera deliciosa") |
 | `common_name` | No | Common name (e.g., "Swiss Cheese Plant") |
 | `genus` | No | Auto-extracted from scientific_name if not provided |
 | `family` | No | Plant family (e.g., "Araceae") |
 **Example CSV:**
 ```csv
 scientific_name,common_name,genus,family
 Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
 Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
 Epipremnum aureum,Golden Pothos,Epipremnum,Araceae
 ```
 ### JSON Import
 Import species from a JSON file with the following structure:
 ```json
 {
  "plants": [
    {
      "scientific_name": "Monstera deliciosa",
      "common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
      "family": "Araceae"
    },
    {
      "scientific_name": "Philodendron hederaceum",
      "common_names": ["Heartleaf Philodendron"],
      "family": "Araceae"
    }
  ]
 }
 ```
 | Field | Required | Description |
 |-------|----------|-------------|
 | `scientific_name` | Yes | Binomial name |
 | `common_names` | No | Array of common names (first one is used) |
 | `family` | No | Plant family |
 **Notes:**
 - Genus is automatically extracted from the first word of `scientific_name`
 - Duplicate species (by scientific_name) are skipped
 - The included `houseplants_list.json` contains 2,278 houseplant species
 ### API Endpoints
 ```bash
 # Import CSV
 curl -X POST http://localhost/api/species/import \
  -F "file=@species.csv"
 # Import JSON
 curl -X POST http://localhost/api/species/import-json \
  -F "file=@plants.json"
 ```
 **Response:**
 ```json
 {
  "imported": 150,
  "skipped": 5,
  "errors": []
 }
 ```
 ## Architecture
 ```
 ┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
 │   React     │────▶│  FastAPI        │────▶│   Celery    │
 │   Frontend  │     │  Backend        │     │   Workers   │
 └─────────────┘     └─────────────────┘     └─────────────┘
                           │                       │
                           ▼                       ▼
                   ┌─────────────┐         ┌─────────────┐
                   │   SQLite    │         │   Redis     │
                   │   Database  │         │   Queue     │
                   └─────────────┘         └─────────────┘
 ```
 ## Export Format
 Exports are Create ML-compatible:
 ```
 export.zip/
 ├── Training/
 │   ├── Monstera_deliciosa/
 │   │   ├── img_00001.jpg
 │   │   └── ...
 │   └── ...
 └── Testing/
    ├── Monstera_deliciosa/
    └── ...
 ```
 ## Data Storage
 All data is stored in the `./data` directory:
 ```
 data/
 ├── db/
 │   └── plants.sqlite    # SQLite database
 ├── images/              # Downloaded images
 │   └── {species_id}/
 │       └── {image_id}.jpg
 └── exports/             # Generated export archives
    └── {export_id}.zip
 ```
 ## API Documentation
 Full API docs available at http://localhost/api/docs
 ## License
 MIT
@@ -0,0 +1,231 @@
 # Houseplant Image Dataset Accumulation Plan
 ## Overview
 Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use.
 ---
 ## Requirements Summary
 | Parameter | Value |
 |-----------|-------|
 | Target species | 5,000-10,000 (realistic houseplant ceiling) |
 | Images per species | 200-500 (recommended) |
 | Total images | ~1-5 million |
 | Budget | Free preferred, paid as reference |
 | Compute | M1 Max Mac (training) + Unraid server (data pipeline) |
 | Curation | Automated pipeline |
 | Timeline | Weeks-months |
 | Licensing | Must allow training + commercial model distribution |
 ---
 ## Hardware Assessment
 | Machine | Role | Capability |
 |---------|------|------------|
 | M1 Max Mac | **Training** | Create ML can train 5-10K class models; 32+ GB unified memory is ideal |
 | Unraid Server | **Data pipeline** | Scraping, downloading, preprocessing, storage |
 M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required.
 ---
 ## Data Sources Analysis
 ### Tier 1: Primary Sources (Recommended)
 | Source | License | Commercial-Safe | Volume | Houseplant Coverage | Access Method |
 |--------|---------|-----------------|--------|---------------------|---------------|
 | **iNaturalist via GBIF** | CC-BY, CC0 (filter) | Yes (filtered) | 100M+ observations | Good (has "captive/cultivated" flag) | Bulk export + API |
 | **Flickr** | CC-BY, CC0 (filter) | Yes (filtered) | Millions | Moderate | API |
 | **Wikimedia Commons** | CC-BY, CC-BY-SA, Public Domain | Mostly | Thousands | Moderate | API |
 ### Tier 2: Supplemental Sources
 | Source | License | Commercial-Safe | Notes |
 |--------|---------|-----------------|-------|
 | **USDA PLANTS** | Public Domain | Yes | US-focused, limited images |
 | **Encyclopedia of Life** | Mixed | Check each | Aggregator, good metadata |
 | **Pl@ntNet-300K Dataset** | CC-BY-SA | Share-alike | Good for research/prototyping only |
 ### Tier 3: Paid Options (Reference)
 | Source | Estimated Cost | Notes |
 |--------|----------------|-------|
 | iNaturalist AWS Open Data | Free | Bulk image export, requires S3 costs for transfer |
 | Custom scraping infrastructure | $50-200/mo | Proxies, storage, bandwidth |
 | Commercial botanical databases | $1000s+ | Getty, Alamy — not recommended |
 ---
 ## Licensing Decision Matrix
 ```
 Want commercial model distribution?
 ├─ YES → Use ONLY: CC0, CC-BY, Public Domain
 │        Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved
 │
 └─ NO (research only) → Can use CC-BY-NC, CC-BY-SA
                        Pl@ntNet-300K dataset becomes viable
 ```
 **Recommendation**: Filter for commercial-safe licenses from day 1. Avoids re-scraping later.
 ---
 ## Houseplant Species Taxonomy
 **Problem**: No canonical "houseplant" species list exists. Must construct one.
 **Approach**:
 1. Start with Wikipedia "List of houseplants" (~500 species)
 2. Expand via genus crawl (all Philodendron, all Hoya, etc.)
 3. Cross-reference with RHS, ASPCA, nursery catalogs
 4. Target: **1,000-3,000 species** is realistic for quality dataset
 **Key Genera** (prioritize these — cover 80% of common houseplants):
 ```
 Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena,
 Sansevieria, Calathea, Maranta, Alocasia, Anthurium,
 Peperomia, Hoya, Begonia, Tradescantia, Pilea,
 Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula
 ```
 ---
 ## Data Quality Requirements
 | Parameter | Minimum | Target | Rationale |
 |-----------|---------|--------|-----------|
 | Images per species | 100 | 300-500 | Below 100 = unreliable classification |
 | Resolution | 256x256 | 512x512+ | Downsample to 224x224 for training |
 | Variety | Single angle | Multi-angle, growth stages, lighting | Generalization |
 | Label accuracy | 80% | 95%+ | iNaturalist "Research Grade" = community verified |
 ---
 ## Training Approach Options
 ### Option A: Create ML (Recommended)
 | Pros | Cons |
 |------|------|
 | Native Apple Silicon optimization | Limited hyperparameter control |
 | Outputs CoreML directly | Max ~10K classes practical limit |
 | No Python/ML expertise needed | Less flexible augmentation |
 | Fast iteration | |
 **Best for**: This use case exactly.
 ### Option B: PyTorch + MPS Transfer Learning
 | Pros | Cons |
 |------|------|
 | Full control over architecture | Steeper learning curve |
 | State-of-art augmentation (albumentations) | Manual CoreML conversion |
 | Can use EfficientNet, ConvNeXt, etc. | Slower iteration |
 **Best for**: If Create ML hits limits or you need custom architecture.
 ### Option C: Cloud GPU (Google Colab / AWS Spot)
 | Pros | Cons |
 |------|------|
 | Faster training for large models | Cost |
 | No local resource constraints | Network transfer overhead |
 **Best for**: If dataset exceeds M1 Max memory or you want transformer-based vision models.
 **Recommendation**: Start with Create ML. Pivot to Option B only if needed.
 ---
 ## Pipeline Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                     UNRAID SERVER                                │
 ├─────────────────────────────────────────────────────────────────┤
 │  1. Species List Generator                                       │
 │     └─ Scrape Wikipedia, RHS, expand by genus                   │
 │                                                                  │
 │  2. Image Downloader                                             │
 │     ├─ iNaturalist/GBIF bulk export (primary)                   │
 │     ├─ Flickr API (supplemental)                                │
 │     └─ License filter (CC-BY, CC0 only)                         │
 │                                                                  │
 │  3. Preprocessing Pipeline                                       │
 │     ├─ Resize to 512x512                                        │
 │     ├─ Remove duplicates (perceptual hash)                      │
 │     ├─ Remove low-quality (blur detection, size filter)         │
 │     └─ Organize: /species_name/image_001.jpg                    │
 │                                                                  │
 │  4. Dataset Statistics                                           │
 │     └─ Report per-species counts, flag under-represented        │
 └─────────────────────────────────────────────────────────────────┘
                              │
                              ▼ (rsync/SMB)
 ┌─────────────────────────────────────────────────────────────────┐
 │                      M1 MAX MAC                                  │
 ├─────────────────────────────────────────────────────────────────┤
 │  5. Create ML Training                                           │
 │     ├─ Import dataset folder                                    │
 │     ├─ Train image classifier                                   │
 │     └─ Export .mlmodel                                          │
 │                                                                  │
 │  6. Validation                                                   │
 │     ├─ Test on held-out images                                  │
 │     └─ Test on real-world photos (your phone)                   │
 │                                                                  │
 │  7. Integration                                                  │
 │     └─ Replace PlantNet-300K in PlantGuide                      │
 └─────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## Timeline
 | Phase | Duration | Output |
 |-------|----------|--------|
 | 1. Species list curation | 1 week | 1,000-3,000 target species with scientific + common names |
 | 2. Pipeline development | 1-2 weeks | Automated scraper on Unraid |
 | 3. Data collection | 2-4 weeks | Running 24/7, rate-limited by APIs |
 | 4. Preprocessing + QA | 1 week | Clean dataset, statistics report |
 | 5. Initial training | 2-3 days | First model with subset (500 species) |
 | 6. Full training | 1 week | Full model, iteration |
 | 7. Validation + tuning | 1 week | Production-ready model |
 **Total: 6-10 weeks**
 ---
 ## Risk Analysis
 | Risk | Likelihood | Mitigation |
 |------|------------|------------|
 | Insufficient images for rare species | High | Accept lower coverage OR merge to genus-level for rare species |
 | API rate limits slow collection | High | Parallelize sources, use bulk exports, patience |
 | Noisy labels degrade accuracy | Medium | Use only "Research Grade" iNaturalist, implement confidence thresholds |
 | Create ML memory limits | Low | M1 Max should handle; fallback to PyTorch |
 | License ambiguity | Low | Strict filter on download, keep metadata |
 ---
 ## Next Steps
 1. **Build species master list** — Python script to scrape/merge sources
 2. **Set up GBIF bulk download** — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images
 3. **Build Flickr supplemental scraper** — Target under-represented species
 4. **Docker container on Unraid** — Orchestrate pipeline
 5. **Create ML project setup** — Folder structure, initial test with 50 species
 ---
 ## Open Questions
 - Prioritize **speed** (start with 500 species, fast iteration) or **completeness** (build full 3K species list first)?
 - Any specific houseplant species that must be included?
 - Docker running on Unraid already?
@@ -0,0 +1,24 @@
 FROM python:3.11-slim
 WORKDIR /app
 # Install system dependencies
 RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    libffi-dev \
    && rm -rf /var/lib/apt/lists/*
 # Install Python dependencies
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Copy application code
 COPY . .
 # Create data directories
 RUN mkdir -p /data/db /data/images /data/exports
 EXPOSE 8000
 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
@@ -0,0 +1,19 @@
 #!/usr/bin/env python
 """Add missing database indexes."""
 from sqlalchemy import text
 from app.database import engine
 with engine.connect() as conn:
    # Single column indexes
    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_license ON images(license)'))
    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_status ON images(status)'))
    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_source ON images(source)'))
    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_species_id ON images(species_id)'))
    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_phash ON images(phash)'))
    # Composite indexes for common query patterns
    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_species_status ON images(species_id, status)'))
    conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_status_created ON images(status, created_at)'))
    conn.commit()
    print('All indexes created successfully')
@@ -0,0 +1,42 @@
 [alembic]
 script_location = alembic
 prepend_sys_path = .
 version_path_separator = os
 sqlalchemy.url = sqlite:////data/db/plants.sqlite
 [post_write_hooks]
 [loggers]
 keys = root,sqlalchemy,alembic
 [handlers]
 keys = console
 [formatters]
 keys = generic
 [logger_root]
 level = WARN
 handlers = console
 qualname =
 [logger_sqlalchemy]
 level = WARN
 handlers =
 qualname = sqlalchemy.engine
 [logger_alembic]
 level = INFO
 handlers =
 qualname = alembic
 [handler_console]
 class = StreamHandler
 args = (sys.stderr,)
 level = NOTSET
 formatter = generic
 [formatter_generic]
 format = %(levelname)-5.5s [%(name)s] %(message)s
 datefmt = %H:%M:%S
@@ -0,0 +1,54 @@
 from logging.config import fileConfig
 from sqlalchemy import engine_from_config
 from sqlalchemy import pool
 from alembic import context
 # Import models for autogenerate
 from app.database import Base
 from app.models import Species, Image, Job, ApiKey, Export
 config = context.config
 if config.config_file_name is not None:
    fileConfig(config.config_file_name)
 target_metadata = Base.metadata
 def run_migrations_offline() -> None:
    """Run migrations in 'offline' mode."""
    url = config.get_main_option("sqlalchemy.url")
    context.configure(
        url=url,
        target_metadata=target_metadata,
        literal_binds=True,
        dialect_opts={"paramstyle": "named"},
    )
    with context.begin_transaction():
        context.run_migrations()
 def run_migrations_online() -> None:
    """Run migrations in 'online' mode."""
    connectable = engine_from_config(
        config.get_section(config.config_ini_section, {}),
        prefix="sqlalchemy.",
        poolclass=pool.NullPool,
    )
    with connectable.connect() as connection:
        context.configure(
            connection=connection, target_metadata=target_metadata
        )
        with context.begin_transaction():
            context.run_migrations()
 if context.is_offline_mode():
    run_migrations_offline()
 else:
    run_migrations_online()
@@ -0,0 +1,26 @@
 """${message}
 Revision ID: ${up_revision}
 Revises: ${down_revision | comma,n}
 Create Date: ${create_date}
 """
 from typing import Sequence, Union
 from alembic import op
 import sqlalchemy as sa
 ${imports if imports else ""}
 # revision identifiers, used by Alembic.
 revision: str = ${repr(up_revision)}
 down_revision: Union[str, None] = ${repr(down_revision)}
 branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
 depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
 def upgrade() -> None:
    ${upgrades if upgrades else "pass"}
 def downgrade() -> None:
    ${downgrades if downgrades else "pass"}
@@ -0,0 +1,112 @@
 """Initial migration
 Revision ID: 001
 Revises:
 Create Date: 2024-01-01
 """
 from typing import Sequence, Union
 from alembic import op
 import sqlalchemy as sa
 revision: str = '001'
 down_revision: Union[str, None] = None
 branch_labels: Union[str, Sequence[str], None] = None
 depends_on: Union[str, Sequence[str], None] = None
 def upgrade() -> None:
    # Species table
    op.create_table(
        'species',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('scientific_name', sa.String(), nullable=False, unique=True),
        sa.Column('common_name', sa.String(), nullable=True),
        sa.Column('genus', sa.String(), nullable=True),
        sa.Column('family', sa.String(), nullable=True),
        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
    )
    op.create_index('ix_species_scientific_name', 'species', ['scientific_name'])
    op.create_index('ix_species_genus', 'species', ['genus'])
    # API Keys table
    op.create_table(
        'api_keys',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('source', sa.String(), nullable=False, unique=True),
        sa.Column('api_key', sa.String(), nullable=False),
        sa.Column('api_secret', sa.String(), nullable=True),
        sa.Column('rate_limit_per_sec', sa.Float(), default=1.0),
        sa.Column('enabled', sa.Boolean(), default=True),
    )
    # Images table
    op.create_table(
        'images',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('species_id', sa.Integer(), sa.ForeignKey('species.id'), nullable=False),
        sa.Column('source', sa.String(), nullable=False),
        sa.Column('source_id', sa.String(), nullable=True),
        sa.Column('url', sa.String(), nullable=False),
        sa.Column('local_path', sa.String(), nullable=True),
        sa.Column('license', sa.String(), nullable=False),
        sa.Column('attribution', sa.String(), nullable=True),
        sa.Column('width', sa.Integer(), nullable=True),
        sa.Column('height', sa.Integer(), nullable=True),
        sa.Column('phash', sa.String(), nullable=True),
        sa.Column('quality_score', sa.Float(), nullable=True),
        sa.Column('status', sa.String(), default='pending'),
        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
    )
    op.create_index('ix_images_species_id', 'images', ['species_id'])
    op.create_index('ix_images_source', 'images', ['source'])
    op.create_index('ix_images_status', 'images', ['status'])
    op.create_index('ix_images_phash', 'images', ['phash'])
    op.create_unique_constraint('uq_source_source_id', 'images', ['source', 'source_id'])
    # Jobs table
    op.create_table(
        'jobs',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('name', sa.String(), nullable=False),
        sa.Column('source', sa.String(), nullable=False),
        sa.Column('species_filter', sa.Text(), nullable=True),
        sa.Column('status', sa.String(), default='pending'),
        sa.Column('progress_current', sa.Integer(), default=0),
        sa.Column('progress_total', sa.Integer(), default=0),
        sa.Column('images_downloaded', sa.Integer(), default=0),
        sa.Column('images_rejected', sa.Integer(), default=0),
        sa.Column('celery_task_id', sa.String(), nullable=True),
        sa.Column('started_at', sa.DateTime(), nullable=True),
        sa.Column('completed_at', sa.DateTime(), nullable=True),
        sa.Column('error_message', sa.Text(), nullable=True),
        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
    )
    op.create_index('ix_jobs_status', 'jobs', ['status'])
    # Exports table
    op.create_table(
        'exports',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('name', sa.String(), nullable=False),
        sa.Column('filter_criteria', sa.Text(), nullable=True),
        sa.Column('train_split', sa.Float(), default=0.8),
        sa.Column('status', sa.String(), default='pending'),
        sa.Column('file_path', sa.String(), nullable=True),
        sa.Column('file_size', sa.Integer(), nullable=True),
        sa.Column('species_count', sa.Integer(), nullable=True),
        sa.Column('image_count', sa.Integer(), nullable=True),
        sa.Column('celery_task_id', sa.String(), nullable=True),
        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
        sa.Column('completed_at', sa.DateTime(), nullable=True),
        sa.Column('error_message', sa.Text(), nullable=True),
    )
 def downgrade() -> None:
    op.drop_table('exports')
    op.drop_table('jobs')
    op.drop_table('images')
    op.drop_table('api_keys')
    op.drop_table('species')
@@ -0,0 +1,53 @@
 """Add cached_stats table and license index
 Revision ID: 002
 Revises: 001
 Create Date: 2025-01-25
 """
 from typing import Sequence, Union
 from alembic import op
 import sqlalchemy as sa
 revision: str = '002'
 down_revision: Union[str, None] = '001'
 branch_labels: Union[str, Sequence[str], None] = None
 depends_on: Union[str, Sequence[str], None] = None
 def upgrade() -> None:
    # Cached stats table for pre-calculated dashboard statistics
    op.create_table(
        'cached_stats',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('key', sa.String(50), nullable=False, unique=True),
        sa.Column('value', sa.Text(), nullable=False),
        sa.Column('updated_at', sa.DateTime(), server_default=sa.func.now()),
    )
    op.create_index('ix_cached_stats_key', 'cached_stats', ['key'])
    # Add license index to images table (if not exists)
    # Using batch mode for SQLite compatibility
    try:
        op.create_index('ix_images_license', 'images', ['license'])
    except Exception:
        pass  # Index may already exist
    # Add only_without_images column to jobs if it doesn't exist
    try:
        op.add_column('jobs', sa.Column('only_without_images', sa.Boolean(), default=False))
    except Exception:
        pass  # Column may already exist
 def downgrade() -> None:
    try:
        op.drop_index('ix_images_license', 'images')
    except Exception:
        pass
    try:
        op.drop_column('jobs', 'only_without_images')
    except Exception:
        pass
    op.drop_table('cached_stats')
@@ -0,0 +1,31 @@
 """Add max_images column to jobs table
 Revision ID: 003
 Revises: 002
 Create Date: 2025-01-25
 """
 from typing import Sequence, Union
 from alembic import op
 import sqlalchemy as sa
 revision: str = '003'
 down_revision: Union[str, None] = '002'
 branch_labels: Union[str, Sequence[str], None] = None
 depends_on: Union[str, Sequence[str], None] = None
 def upgrade() -> None:
    # Add max_images column to jobs table
    try:
        op.add_column('jobs', sa.Column('max_images', sa.Integer(), nullable=True))
    except Exception:
        pass  # Column may already exist
 def downgrade() -> None:
    try:
        op.drop_column('jobs', 'max_images')
    except Exception:
        pass
@@ -0,0 +1 @@
 # PlantGuideScraper Backend
@@ -0,0 +1 @@
 # API routes
@@ -0,0 +1,175 @@
 import json
 import os
 from typing import Optional
 from fastapi import APIRouter, Depends, HTTPException, Query
 from fastapi.responses import FileResponse
 from sqlalchemy.orm import Session
 from sqlalchemy import func
 from app.database import get_db
 from app.models import Export, Image, Species
 from app.schemas.export import (
    ExportCreate,
    ExportResponse,
    ExportListResponse,
    ExportPreview,
 )
 from app.workers.export_tasks import generate_export
 router = APIRouter()
@router.get("", response_model=ExportListResponse)
 def list_exports(
    limit: int = Query(50, ge=1, le=200),
    db: Session = Depends(get_db),
 ):
    """List all exports."""
    total = db.query(Export).count()
    exports = db.query(Export).order_by(Export.created_at.desc()).limit(limit).all()
    return ExportListResponse(
        items=[ExportResponse.model_validate(e) for e in exports],
        total=total,
    )
@router.post("/preview", response_model=ExportPreview)
 def preview_export(export: ExportCreate, db: Session = Depends(get_db)):
    """Preview export without creating it."""
    criteria = export.filter_criteria
    min_images = criteria.min_images_per_species
    # Build query
    query = db.query(Image).filter(Image.status == "downloaded")
    if criteria.licenses:
        query = query.filter(Image.license.in_(criteria.licenses))
    if criteria.min_quality:
        query = query.filter(Image.quality_score >= criteria.min_quality)
    if criteria.species_ids:
        query = query.filter(Image.species_id.in_(criteria.species_ids))
    # Count images per species
    species_counts = db.query(
        Image.species_id,
        func.count(Image.id).label("count")
    ).filter(Image.status == "downloaded")
    if criteria.licenses:
        species_counts = species_counts.filter(Image.license.in_(criteria.licenses))
    if criteria.min_quality:
        species_counts = species_counts.filter(Image.quality_score >= criteria.min_quality)
    if criteria.species_ids:
        species_counts = species_counts.filter(Image.species_id.in_(criteria.species_ids))
    species_counts = species_counts.group_by(Image.species_id).all()
    valid_species = [s for s in species_counts if s.count >= min_images]
    total_images = sum(s.count for s in valid_species)
    # Estimate file size (rough: 50KB per image)
    estimated_size_mb = (total_images * 50) / 1024
    return ExportPreview(
        species_count=len(valid_species),
        image_count=total_images,
        estimated_size_mb=estimated_size_mb,
    )
@router.post("", response_model=ExportResponse)
 def create_export(export: ExportCreate, db: Session = Depends(get_db)):
    """Create and start a new export job."""
    db_export = Export(
        name=export.name,
        filter_criteria=export.filter_criteria.model_dump_json(),
        train_split=export.train_split,
        status="pending",
    )
    db.add(db_export)
    db.commit()
    db.refresh(db_export)
    # Start Celery task
    task = generate_export.delay(db_export.id)
    db_export.celery_task_id = task.id
    db.commit()
    return ExportResponse.model_validate(db_export)
@router.get("/{export_id}", response_model=ExportResponse)
 def get_export(export_id: int, db: Session = Depends(get_db)):
    """Get export status."""
    export = db.query(Export).filter(Export.id == export_id).first()
    if not export:
        raise HTTPException(status_code=404, detail="Export not found")
    return ExportResponse.model_validate(export)
@router.get("/{export_id}/progress")
 def get_export_progress(export_id: int, db: Session = Depends(get_db)):
    """Get real-time export progress."""
    from app.workers.celery_app import celery_app
    export = db.query(Export).filter(Export.id == export_id).first()
    if not export:
        raise HTTPException(status_code=404, detail="Export not found")
    if not export.celery_task_id:
        return {"status": export.status}
    result = celery_app.AsyncResult(export.celery_task_id)
    if result.state == "PROGRESS":
        meta = result.info
        return {
            "status": "generating",
            "current": meta.get("current", 0),
            "total": meta.get("total", 0),
            "current_species": meta.get("species", ""),
        }
    return {"status": export.status}
@router.get("/{export_id}/download")
 def download_export(export_id: int, db: Session = Depends(get_db)):
    """Download export zip file."""
    export = db.query(Export).filter(Export.id == export_id).first()
    if not export:
        raise HTTPException(status_code=404, detail="Export not found")
    if export.status != "completed":
        raise HTTPException(status_code=400, detail="Export not ready")
    if not export.file_path or not os.path.exists(export.file_path):
        raise HTTPException(status_code=404, detail="Export file not found")
    return FileResponse(
        export.file_path,
        media_type="application/zip",
        filename=f"{export.name}.zip",
    )
@router.delete("/{export_id}")
 def delete_export(export_id: int, db: Session = Depends(get_db)):
    """Delete an export and its file."""
    export = db.query(Export).filter(Export.id == export_id).first()
    if not export:
        raise HTTPException(status_code=404, detail="Export not found")
    # Delete file if exists
    if export.file_path and os.path.exists(export.file_path):
        os.remove(export.file_path)
    db.delete(export)
    db.commit()
    return {"status": "deleted"}
@@ -0,0 +1,441 @@
 import os
 import shutil
 import uuid
 from pathlib import Path
 from typing import Optional, List
 from fastapi import APIRouter, Depends, HTTPException, Query
 from fastapi.responses import FileResponse
 from sqlalchemy.orm import Session
 from sqlalchemy import func
 from PIL import Image as PILImage
 from app.database import get_db
 from app.models import Image, Species
 from app.schemas.image import ImageResponse, ImageListResponse
 from app.config import get_settings
 router = APIRouter()
 settings = get_settings()
@router.get("", response_model=ImageListResponse)
 def list_images(
    page: int = Query(1, ge=1),
    page_size: int = Query(50, ge=1, le=200),
    species_id: Optional[int] = None,
    source: Optional[str] = None,
    license: Optional[str] = None,
    status: Optional[str] = None,
    min_quality: Optional[float] = None,
    search: Optional[str] = None,
    db: Session = Depends(get_db),
 ):
    """List images with pagination and filters."""
    # Use joinedload to fetch species in single query
    from sqlalchemy.orm import joinedload
    query = db.query(Image).options(joinedload(Image.species))
    if species_id:
        query = query.filter(Image.species_id == species_id)
    if source:
        query = query.filter(Image.source == source)
    if license:
        query = query.filter(Image.license == license)
    if status:
        query = query.filter(Image.status == status)
    if min_quality:
        query = query.filter(Image.quality_score >= min_quality)
    if search:
        search_term = f"%{search}%"
        query = query.join(Species).filter(
            (Species.scientific_name.ilike(search_term)) |
            (Species.common_name.ilike(search_term))
        )
    # Use faster count for simple queries
    if not search:
        # Build count query without join for better performance
        count_query = db.query(func.count(Image.id))
        if species_id:
            count_query = count_query.filter(Image.species_id == species_id)
        if source:
            count_query = count_query.filter(Image.source == source)
        if license:
            count_query = count_query.filter(Image.license == license)
        if status:
            count_query = count_query.filter(Image.status == status)
        if min_quality:
            count_query = count_query.filter(Image.quality_score >= min_quality)
        total = count_query.scalar()
    else:
        total = query.count()
    pages = (total + page_size - 1) // page_size
    images = query.order_by(Image.created_at.desc()).offset(
        (page - 1) * page_size
    ).limit(page_size).all()
    items = [
        ImageResponse(
            id=img.id,
            species_id=img.species_id,
            species_name=img.species.scientific_name if img.species else None,
            source=img.source,
            source_id=img.source_id,
            url=img.url,
            local_path=img.local_path,
            license=img.license,
            attribution=img.attribution,
            width=img.width,
            height=img.height,
            quality_score=img.quality_score,
            status=img.status,
            created_at=img.created_at,
        )
        for img in images
    ]
    return ImageListResponse(
        items=items,
        total=total,
        page=page,
        page_size=page_size,
        pages=pages,
    )
@router.get("/sources")
 def list_sources(db: Session = Depends(get_db)):
    """List all unique image sources."""
    sources = db.query(Image.source).distinct().all()
    return [s[0] for s in sources]
@router.get("/licenses")
 def list_licenses(db: Session = Depends(get_db)):
    """List all unique licenses."""
    licenses = db.query(Image.license).distinct().all()
    return [l[0] for l in licenses]
@router.post("/process-pending")
 def process_pending_images(
    source: Optional[str] = None,
    db: Session = Depends(get_db),
 ):
    """Queue all pending images for download and processing."""
    from app.workers.quality_tasks import batch_process_pending_images
    query = db.query(func.count(Image.id)).filter(Image.status == "pending")
    if source:
        query = query.filter(Image.source == source)
    pending_count = query.scalar()
    task = batch_process_pending_images.delay(source=source)
    return {
        "pending_count": pending_count,
        "task_id": task.id,
    }
@router.get("/process-pending/status/{task_id}")
 def process_pending_status(task_id: str):
    """Check status of a batch processing task."""
    from app.workers.celery_app import celery_app
    result = celery_app.AsyncResult(task_id)
    state = result.state  # PENDING, STARTED, PROGRESS, SUCCESS, FAILURE
    response = {"task_id": task_id, "state": state}
    if state == "PROGRESS" and isinstance(result.info, dict):
        response["queued"] = result.info.get("queued", 0)
        response["total"] = result.info.get("total", 0)
    elif state == "SUCCESS" and isinstance(result.result, dict):
        response["queued"] = result.result.get("queued", 0)
        response["total"] = result.result.get("total", 0)
    return response
@router.get("/{image_id}", response_model=ImageResponse)
 def get_image(image_id: int, db: Session = Depends(get_db)):
    """Get an image by ID."""
    image = db.query(Image).filter(Image.id == image_id).first()
    if not image:
        raise HTTPException(status_code=404, detail="Image not found")
    return ImageResponse(
        id=image.id,
        species_id=image.species_id,
        species_name=image.species.scientific_name if image.species else None,
        source=image.source,
        source_id=image.source_id,
        url=image.url,
        local_path=image.local_path,
        license=image.license,
        attribution=image.attribution,
        width=image.width,
        height=image.height,
        quality_score=image.quality_score,
        status=image.status,
        created_at=image.created_at,
    )
@router.get("/{image_id}/file")
 def get_image_file(image_id: int, db: Session = Depends(get_db)):
    """Get the actual image file."""
    image = db.query(Image).filter(Image.id == image_id).first()
    if not image:
        raise HTTPException(status_code=404, detail="Image not found")
    if not image.local_path:
        raise HTTPException(status_code=404, detail="Image file not available")
    return FileResponse(image.local_path, media_type="image/jpeg")
@router.delete("/{image_id}")
 def delete_image(image_id: int, db: Session = Depends(get_db)):
    """Delete an image."""
    image = db.query(Image).filter(Image.id == image_id).first()
    if not image:
        raise HTTPException(status_code=404, detail="Image not found")
    # Delete file if exists
    if image.local_path:
        import os
        if os.path.exists(image.local_path):
            os.remove(image.local_path)
    db.delete(image)
    db.commit()
    return {"status": "deleted"}
@router.post("/bulk-delete")
 def bulk_delete_images(
    image_ids: List[int],
    db: Session = Depends(get_db),
 ):
    """Delete multiple images."""
    import os
    images = db.query(Image).filter(Image.id.in_(image_ids)).all()
    deleted = 0
    for image in images:
        if image.local_path and os.path.exists(image.local_path):
            os.remove(image.local_path)
        db.delete(image)
        deleted += 1
    db.commit()
    return {"deleted": deleted}
@router.get("/import/scan")
 def scan_imports(db: Session = Depends(get_db)):
    """Scan the imports folder and return what can be imported.
    Expected structure: imports/{source}/{species_name}/*.jpg
    """
    imports_path = Path(settings.imports_path)
    if not imports_path.exists():
        return {
            "available": False,
            "message": f"Imports folder not found: {imports_path}",
            "sources": [],
            "total_images": 0,
            "matched_species": 0,
            "unmatched_species": [],
        }
    results = {
        "available": True,
        "sources": [],
        "total_images": 0,
        "matched_species": 0,
        "unmatched_species": [],
    }
    # Get all species for matching
    species_map = {}
    for species in db.query(Species).all():
        # Map by scientific name with underscores and spaces
        species_map[species.scientific_name.lower()] = species
        species_map[species.scientific_name.replace(" ", "_").lower()] = species
    seen_unmatched = set()
    # Scan source folders
    for source_dir in imports_path.iterdir():
        if not source_dir.is_dir():
            continue
        source_name = source_dir.name
        source_info = {
            "name": source_name,
            "species_count": 0,
            "image_count": 0,
        }
        # Scan species folders within source
        for species_dir in source_dir.iterdir():
            if not species_dir.is_dir():
                continue
            species_name = species_dir.name.replace("_", " ")
            species_key = species_name.lower()
            # Count images
            image_files = list(species_dir.glob("*.jpg")) + \
                         list(species_dir.glob("*.jpeg")) + \
                         list(species_dir.glob("*.png"))
            if not image_files:
                continue
            source_info["image_count"] += len(image_files)
            results["total_images"] += len(image_files)
            if species_key in species_map or species_dir.name.lower() in species_map:
                source_info["species_count"] += 1
                results["matched_species"] += 1
            else:
                if species_name not in seen_unmatched:
                    seen_unmatched.add(species_name)
                    results["unmatched_species"].append(species_name)
        if source_info["image_count"] > 0:
            results["sources"].append(source_info)
    return results
@router.post("/import/run")
 def run_import(
    move_files: bool = Query(False, description="Move files instead of copy"),
    db: Session = Depends(get_db),
 ):
    """Import images from the imports folder.
    Expected structure: imports/{source}/{species_name}/*.jpg
    Images are copied/moved to: images/{species_name}/{source}_{filename}
    """
    imports_path = Path(settings.imports_path)
    images_path = Path(settings.images_path)
    if not imports_path.exists():
        raise HTTPException(status_code=400, detail="Imports folder not found")
    # Get all species for matching
    species_map = {}
    for species in db.query(Species).all():
        species_map[species.scientific_name.lower()] = species
        species_map[species.scientific_name.replace(" ", "_").lower()] = species
    imported = 0
    skipped = 0
    errors = []
    # Scan source folders
    for source_dir in imports_path.iterdir():
        if not source_dir.is_dir():
            continue
        source_name = source_dir.name
        # Scan species folders within source
        for species_dir in source_dir.iterdir():
            if not species_dir.is_dir():
                continue
            species_name = species_dir.name.replace("_", " ")
            species_key = species_name.lower()
            # Find matching species
            species = species_map.get(species_key) or species_map.get(species_dir.name.lower())
            if not species:
                continue
            # Create target directory
            target_dir = images_path / species.scientific_name.replace(" ", "_")
            target_dir.mkdir(parents=True, exist_ok=True)
            # Process images
            image_files = list(species_dir.glob("*.jpg")) + \
                         list(species_dir.glob("*.jpeg")) + \
                         list(species_dir.glob("*.png"))
            for img_file in image_files:
                try:
                    # Generate unique filename
                    ext = img_file.suffix.lower()
                    if ext == ".jpeg":
                        ext = ".jpg"
                    new_filename = f"{source_name}_{img_file.stem}_{uuid.uuid4().hex[:8]}{ext}"
                    target_path = target_dir / new_filename
                    # Check if already imported (by original filename pattern)
                    existing = db.query(Image).filter(
                        Image.species_id == species.id,
                        Image.source == source_name,
                        Image.source_id == img_file.stem,
                    ).first()
                    if existing:
                        skipped += 1
                        continue
                    # Get image dimensions
                    try:
                        with PILImage.open(img_file) as pil_img:
                            width, height = pil_img.size
                    except Exception:
                        width, height = None, None
                    # Copy or move file
                    if move_files:
                        shutil.move(str(img_file), str(target_path))
                    else:
                        shutil.copy2(str(img_file), str(target_path))
                    # Create database record
                    image = Image(
                        species_id=species.id,
                        source=source_name,
                        source_id=img_file.stem,
                        url=f"file://{img_file}",
                        local_path=str(target_path),
                        license="unknown",
                        width=width,
                        height=height,
                        status="downloaded",
                    )
                    db.add(image)
                    imported += 1
                except Exception as e:
                    errors.append(f"{img_file}: {str(e)}")
            # Commit after each species to avoid large transactions
            db.commit()
    return {
        "imported": imported,
        "skipped": skipped,
        "errors": errors[:20],
    }
@@ -0,0 +1,173 @@
 import json
 from typing import Optional
 from fastapi import APIRouter, Depends, HTTPException, Query
 from sqlalchemy.orm import Session
 from app.database import get_db
 from app.models import Job
 from app.schemas.job import JobCreate, JobResponse, JobListResponse
 from app.workers.scrape_tasks import run_scrape_job
 router = APIRouter()
@router.get("", response_model=JobListResponse)
 def list_jobs(
    status: Optional[str] = None,
    source: Optional[str] = None,
    limit: int = Query(50, ge=1, le=200),
    db: Session = Depends(get_db),
 ):
    """List all jobs."""
    query = db.query(Job)
    if status:
        query = query.filter(Job.status == status)
    if source:
        query = query.filter(Job.source == source)
    total = query.count()
    jobs = query.order_by(Job.created_at.desc()).limit(limit).all()
    return JobListResponse(
        items=[JobResponse.model_validate(j) for j in jobs],
        total=total,
    )
@router.post("", response_model=JobResponse)
 def create_job(job: JobCreate, db: Session = Depends(get_db)):
    """Create and start a new scrape job."""
    species_filter = None
    if job.species_ids:
        species_filter = json.dumps(job.species_ids)
    db_job = Job(
        name=job.name,
        source=job.source,
        species_filter=species_filter,
        only_without_images=job.only_without_images,
        max_images=job.max_images,
        status="pending",
    )
    db.add(db_job)
    db.commit()
    db.refresh(db_job)
    # Start the Celery task
    task = run_scrape_job.delay(db_job.id)
    db_job.celery_task_id = task.id
    db.commit()
    return JobResponse.model_validate(db_job)
@router.get("/{job_id}", response_model=JobResponse)
 def get_job(job_id: int, db: Session = Depends(get_db)):
    """Get job status."""
    job = db.query(Job).filter(Job.id == job_id).first()
    if not job:
        raise HTTPException(status_code=404, detail="Job not found")
    return JobResponse.model_validate(job)
@router.get("/{job_id}/progress")
 def get_job_progress(job_id: int, db: Session = Depends(get_db)):
    """Get real-time job progress from Celery."""
    from app.workers.celery_app import celery_app
    job = db.query(Job).filter(Job.id == job_id).first()
    if not job:
        raise HTTPException(status_code=404, detail="Job not found")
    if not job.celery_task_id:
        return {
            "status": job.status,
            "progress_current": job.progress_current,
            "progress_total": job.progress_total,
        }
    # Get Celery task state
    result = celery_app.AsyncResult(job.celery_task_id)
    if result.state == "PROGRESS":
        meta = result.info
        return {
            "status": "running",
            "progress_current": meta.get("current", 0),
            "progress_total": meta.get("total", 0),
            "current_species": meta.get("species", ""),
        }
    return {
        "status": job.status,
        "progress_current": job.progress_current,
        "progress_total": job.progress_total,
    }
@router.post("/{job_id}/pause")
 def pause_job(job_id: int, db: Session = Depends(get_db)):
    """Pause a running job."""
    from app.workers.celery_app import celery_app
    job = db.query(Job).filter(Job.id == job_id).first()
    if not job:
        raise HTTPException(status_code=404, detail="Job not found")
    if job.status != "running":
        raise HTTPException(status_code=400, detail="Job is not running")
    # Revoke Celery task
    if job.celery_task_id:
        celery_app.control.revoke(job.celery_task_id, terminate=True)
    job.status = "paused"
    db.commit()
    return {"status": "paused"}
@router.post("/{job_id}/resume")
 def resume_job(job_id: int, db: Session = Depends(get_db)):
    """Resume a paused job."""
    job = db.query(Job).filter(Job.id == job_id).first()
    if not job:
        raise HTTPException(status_code=404, detail="Job not found")
    if job.status != "paused":
        raise HTTPException(status_code=400, detail="Job is not paused")
    # Start new Celery task
    task = run_scrape_job.delay(job.id)
    job.celery_task_id = task.id
    job.status = "pending"
    db.commit()
    return {"status": "resumed"}
@router.post("/{job_id}/cancel")
 def cancel_job(job_id: int, db: Session = Depends(get_db)):
    """Cancel a job."""
    from app.workers.celery_app import celery_app
    job = db.query(Job).filter(Job.id == job_id).first()
    if not job:
        raise HTTPException(status_code=404, detail="Job not found")
    if job.status in ["completed", "failed"]:
        raise HTTPException(status_code=400, detail="Job already finished")
    # Revoke Celery task
    if job.celery_task_id:
        celery_app.control.revoke(job.celery_task_id, terminate=True)
    job.status = "failed"
    job.error_message = "Cancelled by user"
    db.commit()
    return {"status": "cancelled"}
@@ -0,0 +1,198 @@
 from fastapi import APIRouter, Depends, HTTPException
 from sqlalchemy.orm import Session
 from app.database import get_db
 from app.models import ApiKey
 from app.schemas.api_key import ApiKeyCreate, ApiKeyUpdate, ApiKeyResponse
 router = APIRouter()
 # Available sources
 # auth_type: "none" (no auth), "api_key" (single key), "api_key_secret" (key + secret), "oauth" (client_id + client_secret + access_token)
 # default_rate: safe default requests per second for each API
 AVAILABLE_SOURCES = [
    {"name": "gbif", "label": "GBIF", "requires_secret": False, "auth_type": "none", "default_rate": 1.0},  # Free, no auth required
    {"name": "inaturalist", "label": "iNaturalist", "requires_secret": True, "auth_type": "api_key_secret", "default_rate": 1.0},  # 60/min limit
    {"name": "flickr", "label": "Flickr", "requires_secret": True, "auth_type": "api_key_secret", "default_rate": 0.5},  # 3600/hr shared limit
    {"name": "wikimedia", "label": "Wikimedia Commons", "requires_secret": True, "auth_type": "oauth", "default_rate": 1.0},  # generous limits
    {"name": "trefle", "label": "Trefle.io", "requires_secret": False, "auth_type": "api_key", "default_rate": 1.0},  # 120/min limit
    {"name": "duckduckgo", "label": "DuckDuckGo", "requires_secret": False, "auth_type": "none", "default_rate": 0.5},  # Web search, no API key
    {"name": "bing", "label": "Bing Image Search", "requires_secret": False, "auth_type": "api_key", "default_rate": 3.0},  # Azure Cognitive Services
 ]
 def mask_api_key(key: str) -> str:
    """Mask API key, showing only last 4 characters."""
    if not key or len(key) <= 4:
        return "****"
    return "*" * (len(key) - 4) + key[-4:]
@router.get("")
 def list_sources(db: Session = Depends(get_db)):
    """List all available sources with their configuration status."""
    api_keys = {k.source: k for k in db.query(ApiKey).all()}
    result = []
    for source in AVAILABLE_SOURCES:
        api_key = api_keys.get(source["name"])
        default_rate = source.get("default_rate", 1.0)
        result.append({
            "name": source["name"],
            "label": source["label"],
            "requires_secret": source["requires_secret"],
            "auth_type": source.get("auth_type", "api_key"),
            "configured": api_key is not None,
            "enabled": api_key.enabled if api_key else False,
            "api_key_masked": mask_api_key(api_key.api_key) if api_key else None,
            "has_secret": bool(api_key.api_secret) if api_key else False,
            "has_access_token": bool(getattr(api_key, 'access_token', None)) if api_key else False,
            "rate_limit_per_sec": api_key.rate_limit_per_sec if api_key else default_rate,
            "default_rate": default_rate,
        })
    return result
@router.get("/{source}")
 def get_source(source: str, db: Session = Depends(get_db)):
    """Get source configuration."""
    source_info = next((s for s in AVAILABLE_SOURCES if s["name"] == source), None)
    if not source_info:
        raise HTTPException(status_code=404, detail="Unknown source")
    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
    default_rate = source_info.get("default_rate", 1.0)
    return {
        "name": source_info["name"],
        "label": source_info["label"],
        "requires_secret": source_info["requires_secret"],
        "auth_type": source_info.get("auth_type", "api_key"),
        "configured": api_key is not None,
        "enabled": api_key.enabled if api_key else False,
        "api_key_masked": mask_api_key(api_key.api_key) if api_key else None,
        "has_secret": bool(api_key.api_secret) if api_key else False,
        "has_access_token": bool(getattr(api_key, 'access_token', None)) if api_key else False,
        "rate_limit_per_sec": api_key.rate_limit_per_sec if api_key else default_rate,
        "default_rate": default_rate,
    }
@router.put("/{source}")
 def update_source(
    source: str,
    config: ApiKeyCreate,
    db: Session = Depends(get_db),
 ):
    """Create or update source configuration."""
    source_info = next((s for s in AVAILABLE_SOURCES if s["name"] == source), None)
    if not source_info:
        raise HTTPException(status_code=404, detail="Unknown source")
    # For sources that require auth, validate api_key is provided
    auth_type = source_info.get("auth_type", "api_key")
    if auth_type != "none" and not config.api_key:
        raise HTTPException(status_code=400, detail="API key is required for this source")
    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
    # Use placeholder for no-auth sources
    api_key_value = config.api_key or "no-auth"
    if api_key:
        # Update existing
        api_key.api_key = api_key_value
        if config.api_secret:
            api_key.api_secret = config.api_secret
        if config.access_token:
            api_key.access_token = config.access_token
        api_key.rate_limit_per_sec = config.rate_limit_per_sec
        api_key.enabled = config.enabled
    else:
        # Create new
        api_key = ApiKey(
            source=source,
            api_key=api_key_value,
            api_secret=config.api_secret,
            access_token=config.access_token,
            rate_limit_per_sec=config.rate_limit_per_sec,
            enabled=config.enabled,
        )
        db.add(api_key)
    db.commit()
    db.refresh(api_key)
    return {
        "name": source,
        "configured": True,
        "enabled": api_key.enabled,
        "api_key_masked": mask_api_key(api_key.api_key) if auth_type != "none" else None,
        "has_secret": bool(api_key.api_secret),
        "has_access_token": bool(api_key.access_token),
        "rate_limit_per_sec": api_key.rate_limit_per_sec,
    }
@router.patch("/{source}")
 def patch_source(
    source: str,
    config: ApiKeyUpdate,
    db: Session = Depends(get_db),
 ):
    """Partially update source configuration."""
    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
    if not api_key:
        raise HTTPException(status_code=404, detail="Source not configured")
    update_data = config.model_dump(exclude_unset=True)
    for field, value in update_data.items():
        setattr(api_key, field, value)
    db.commit()
    db.refresh(api_key)
    return {
        "name": source,
        "configured": True,
        "enabled": api_key.enabled,
        "api_key_masked": mask_api_key(api_key.api_key),
        "has_secret": bool(api_key.api_secret),
        "has_access_token": bool(api_key.access_token),
        "rate_limit_per_sec": api_key.rate_limit_per_sec,
    }
@router.delete("/{source}")
 def delete_source(source: str, db: Session = Depends(get_db)):
    """Delete source configuration."""
    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
    if not api_key:
        raise HTTPException(status_code=404, detail="Source not configured")
    db.delete(api_key)
    db.commit()
    return {"status": "deleted"}
@router.post("/{source}/test")
 def test_source(source: str, db: Session = Depends(get_db)):
    """Test source API connection."""
    api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
    if not api_key:
        raise HTTPException(status_code=404, detail="Source not configured")
    # Import and test the scraper
    from app.scrapers import get_scraper
    scraper = get_scraper(source)
    if not scraper:
        raise HTTPException(status_code=400, detail="No scraper for this source")
    try:
        result = scraper.test_connection(api_key)
        return {"status": "success", "message": result}
    except Exception as e:
        return {"status": "error", "message": str(e)}
@@ -0,0 +1,366 @@
 import csv
 import io
 import json
 from typing import Optional
 from fastapi import APIRouter, Depends, HTTPException, Query, UploadFile, File
 from sqlalchemy.orm import Session
 from sqlalchemy import func, text
 from app.database import get_db
 from app.models import Species, Image
 from app.schemas.species import (
    SpeciesCreate,
    SpeciesUpdate,
    SpeciesResponse,
    SpeciesListResponse,
    SpeciesImportResponse,
 )
 router = APIRouter()
 def get_species_with_count(db: Session, species: Species) -> SpeciesResponse:
    """Get species response with image count."""
    image_count = db.query(func.count(Image.id)).filter(
        Image.species_id == species.id,
        Image.status == "downloaded"
    ).scalar()
    return SpeciesResponse(
        id=species.id,
        scientific_name=species.scientific_name,
        common_name=species.common_name,
        genus=species.genus,
        family=species.family,
        created_at=species.created_at,
        image_count=image_count or 0,
    )
@router.get("", response_model=SpeciesListResponse)
 def list_species(
    page: int = Query(1, ge=1),
    page_size: int = Query(50, ge=1, le=500),
    search: Optional[str] = None,
    genus: Optional[str] = None,
    has_images: Optional[bool] = None,
    max_images: Optional[int] = Query(None, description="Filter species with less than N images"),
    min_images: Optional[int] = Query(None, description="Filter species with at least N images"),
    db: Session = Depends(get_db),
 ):
    """List species with pagination and filters.
    Filters:
    - search: Search by scientific or common name
    - genus: Filter by genus
    - has_images: True for species with images, False for species without
    - max_images: Filter species with fewer than N downloaded images
    - min_images: Filter species with at least N downloaded images
    """
    # If filtering by image count, we need to use a subquery approach
    if max_images is not None or min_images is not None:
        # Build a subquery with image counts per species
        image_counts = (
            db.query(
                Species.id.label("species_id"),
                func.count(Image.id).label("img_count")
            )
            .outerjoin(Image, (Image.species_id == Species.id) & (Image.status == "downloaded"))
            .group_by(Species.id)
            .subquery()
        )
        # Join species with their counts
        query = db.query(Species).join(
            image_counts, Species.id == image_counts.c.species_id
        )
        if max_images is not None:
            query = query.filter(image_counts.c.img_count < max_images)
        if min_images is not None:
            query = query.filter(image_counts.c.img_count >= min_images)
    else:
        query = db.query(Species)
    if search:
        search_term = f"%{search}%"
        query = query.filter(
            (Species.scientific_name.ilike(search_term)) |
            (Species.common_name.ilike(search_term))
        )
    if genus:
        query = query.filter(Species.genus == genus)
    # Filter by whether species has downloaded images (only if not using min/max filters)
    if has_images is not None and max_images is None and min_images is None:
        # Get IDs of species that have at least one downloaded image
        species_with_images = (
            db.query(Image.species_id)
            .filter(Image.status == "downloaded")
            .distinct()
            .subquery()
        )
        if has_images:
            query = query.filter(Species.id.in_(db.query(species_with_images.c.species_id)))
        else:
            query = query.filter(~Species.id.in_(db.query(species_with_images.c.species_id)))
    total = query.count()
    pages = (total + page_size - 1) // page_size
    species_list = query.order_by(Species.scientific_name).offset(
        (page - 1) * page_size
    ).limit(page_size).all()
    # Fetch image counts in bulk for all species on this page
    species_ids = [s.id for s in species_list]
    if species_ids:
        count_query = db.query(
            Image.species_id,
            func.count(Image.id)
        ).filter(
            Image.species_id.in_(species_ids),
            Image.status == "downloaded"
        ).group_by(Image.species_id).all()
        count_map = {species_id: count for species_id, count in count_query}
    else:
        count_map = {}
    items = [
        SpeciesResponse(
            id=s.id,
            scientific_name=s.scientific_name,
            common_name=s.common_name,
            genus=s.genus,
            family=s.family,
            created_at=s.created_at,
            image_count=count_map.get(s.id, 0),
        )
        for s in species_list
    ]
    return SpeciesListResponse(
        items=items,
        total=total,
        page=page,
        page_size=page_size,
        pages=pages,
    )
@router.post("", response_model=SpeciesResponse)
 def create_species(species: SpeciesCreate, db: Session = Depends(get_db)):
    """Create a new species."""
    existing = db.query(Species).filter(
        Species.scientific_name == species.scientific_name
    ).first()
    if existing:
        raise HTTPException(status_code=400, detail="Species already exists")
    # Auto-extract genus from scientific name if not provided
    genus = species.genus
    if not genus and " " in species.scientific_name:
        genus = species.scientific_name.split()[0]
    db_species = Species(
        scientific_name=species.scientific_name,
        common_name=species.common_name,
        genus=genus,
        family=species.family,
    )
    db.add(db_species)
    db.commit()
    db.refresh(db_species)
    return get_species_with_count(db, db_species)
@router.post("/import", response_model=SpeciesImportResponse)
 async def import_species(
    file: UploadFile = File(...),
    db: Session = Depends(get_db),
 ):
    """Import species from CSV file.
    Expected columns: scientific_name, common_name (optional), genus (optional), family (optional)
    """
    if not file.filename.endswith(".csv"):
        raise HTTPException(status_code=400, detail="File must be a CSV")
    content = await file.read()
    text = content.decode("utf-8")
    reader = csv.DictReader(io.StringIO(text))
    imported = 0
    skipped = 0
    errors = []
    for row_num, row in enumerate(reader, start=2):
        scientific_name = row.get("scientific_name", "").strip()
        if not scientific_name:
            errors.append(f"Row {row_num}: Missing scientific_name")
            continue
        # Check if already exists
        existing = db.query(Species).filter(
            Species.scientific_name == scientific_name
        ).first()
        if existing:
            skipped += 1
            continue
        # Auto-extract genus if not provided
        genus = row.get("genus", "").strip()
        if not genus and " " in scientific_name:
            genus = scientific_name.split()[0]
        try:
            species = Species(
                scientific_name=scientific_name,
                common_name=row.get("common_name", "").strip() or None,
                genus=genus or None,
                family=row.get("family", "").strip() or None,
            )
            db.add(species)
            imported += 1
        except Exception as e:
            errors.append(f"Row {row_num}: {str(e)}")
    db.commit()
    return SpeciesImportResponse(
        imported=imported,
        skipped=skipped,
        errors=errors[:10],  # Limit error messages
    )
@router.post("/import-json", response_model=SpeciesImportResponse)
 async def import_species_json(
    file: UploadFile = File(...),
    db: Session = Depends(get_db),
 ):
    """Import species from JSON file.
    Expected format: {"plants": [{"scientific_name": "...", "common_names": [...], "family": "..."}]}
    """
    if not file.filename.endswith(".json"):
        raise HTTPException(status_code=400, detail="File must be a JSON")
    content = await file.read()
    try:
        data = json.loads(content.decode("utf-8"))
    except json.JSONDecodeError as e:
        raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")
    plants = data.get("plants", [])
    if not plants:
        raise HTTPException(status_code=400, detail="No plants found in JSON")
    imported = 0
    skipped = 0
    errors = []
    for idx, plant in enumerate(plants):
        scientific_name = plant.get("scientific_name", "").strip()
        if not scientific_name:
            errors.append(f"Plant {idx}: Missing scientific_name")
            continue
        # Check if already exists
        existing = db.query(Species).filter(
            Species.scientific_name == scientific_name
        ).first()
        if existing:
            skipped += 1
            continue
        # Auto-extract genus from scientific name
        genus = None
        if " " in scientific_name:
            genus = scientific_name.split()[0]
        # Get first common name if array provided
        common_names = plant.get("common_names", [])
        common_name = common_names[0] if common_names else None
        try:
            species = Species(
                scientific_name=scientific_name,
                common_name=common_name,
                genus=genus,
                family=plant.get("family"),
            )
            db.add(species)
            imported += 1
        except Exception as e:
            errors.append(f"Plant {idx}: {str(e)}")
    db.commit()
    return SpeciesImportResponse(
        imported=imported,
        skipped=skipped,
        errors=errors[:10],
    )
@router.get("/{species_id}", response_model=SpeciesResponse)
 def get_species(species_id: int, db: Session = Depends(get_db)):
    """Get a species by ID."""
    species = db.query(Species).filter(Species.id == species_id).first()
    if not species:
        raise HTTPException(status_code=404, detail="Species not found")
    return get_species_with_count(db, species)
@router.put("/{species_id}", response_model=SpeciesResponse)
 def update_species(
    species_id: int,
    species_update: SpeciesUpdate,
    db: Session = Depends(get_db),
 ):
    """Update a species."""
    species = db.query(Species).filter(Species.id == species_id).first()
    if not species:
        raise HTTPException(status_code=404, detail="Species not found")
    update_data = species_update.model_dump(exclude_unset=True)
    for field, value in update_data.items():
        setattr(species, field, value)
    db.commit()
    db.refresh(species)
    return get_species_with_count(db, species)
@router.delete("/{species_id}")
 def delete_species(species_id: int, db: Session = Depends(get_db)):
    """Delete a species and all its images."""
    species = db.query(Species).filter(Species.id == species_id).first()
    if not species:
        raise HTTPException(status_code=404, detail="Species not found")
    db.delete(species)
    db.commit()
    return {"status": "deleted"}
@router.get("/genera/list")
 def list_genera(db: Session = Depends(get_db)):
    """List all unique genera."""
    genera = db.query(Species.genus).filter(
        Species.genus.isnot(None)
    ).distinct().order_by(Species.genus).all()
    return [g[0] for g in genera]
@@ -0,0 +1,190 @@
 import json
 from fastapi import APIRouter, Depends, HTTPException
 from sqlalchemy.orm import Session
 from sqlalchemy import func, case
 from app.database import get_db
 from app.models import Species, Image, Job
 from app.models.cached_stats import CachedStats
 from app.schemas.stats import StatsResponse, SourceStats, LicenseStats, SpeciesStats, JobStats
 router = APIRouter()
@router.get("", response_model=StatsResponse)
 def get_stats(db: Session = Depends(get_db)):
    """Get dashboard statistics from cache (updated every 60s by Celery)."""
    # Try to get cached stats
    cached = db.query(CachedStats).filter(CachedStats.key == "dashboard_stats").first()
    if cached:
        data = json.loads(cached.value)
        return StatsResponse(
            total_species=data["total_species"],
            total_images=data["total_images"],
            images_downloaded=data["images_downloaded"],
            images_pending=data["images_pending"],
            images_rejected=data["images_rejected"],
            disk_usage_mb=data["disk_usage_mb"],
            sources=[SourceStats(**s) for s in data["sources"]],
            licenses=[LicenseStats(**l) for l in data["licenses"]],
            jobs=JobStats(**data["jobs"]),
            top_species=[SpeciesStats(**s) for s in data["top_species"]],
            under_represented=[SpeciesStats(**s) for s in data["under_represented"]],
        )
    # No cache yet - return empty stats (Celery will populate soon)
    # This only happens on first startup before Celery runs
    return StatsResponse(
        total_species=0,
        total_images=0,
        images_downloaded=0,
        images_pending=0,
        images_rejected=0,
        disk_usage_mb=0.0,
        sources=[],
        licenses=[],
        jobs=JobStats(running=0, pending=0, completed=0, failed=0),
        top_species=[],
        under_represented=[],
    )
@router.post("/refresh")
 def refresh_stats_now(db: Session = Depends(get_db)):
    """Manually trigger a stats refresh."""
    from app.workers.stats_tasks import refresh_stats
    refresh_stats.delay()
    return {"status": "refresh_queued"}
@router.get("/sources")
 def get_source_stats(db: Session = Depends(get_db)):
    """Get per-source breakdown."""
    stats = db.query(
        Image.source,
        func.count(Image.id).label("total"),
        func.sum(case((Image.status == "downloaded", 1), else_=0)).label("downloaded"),
        func.sum(case((Image.status == "pending", 1), else_=0)).label("pending"),
        func.sum(case((Image.status == "rejected", 1), else_=0)).label("rejected"),
    ).group_by(Image.source).all()
    return [
        {
            "source": s.source,
            "total": s.total,
            "downloaded": s.downloaded or 0,
            "pending": s.pending or 0,
            "rejected": s.rejected or 0,
        }
        for s in stats
    ]
@router.get("/species")
 def get_species_stats(
    min_count: int = 0,
    max_count: int = None,
    db: Session = Depends(get_db),
 ):
    """Get per-species image counts."""
    query = db.query(
        Species.id,
        Species.scientific_name,
        Species.common_name,
        Species.genus,
        func.count(Image.id).label("image_count")
    ).outerjoin(Image, (Image.species_id == Species.id) & (Image.status == "downloaded")
    ).group_by(Species.id)
    if min_count > 0:
        query = query.having(func.count(Image.id) >= min_count)
    if max_count is not None:
        query = query.having(func.count(Image.id) <= max_count)
    stats = query.order_by(func.count(Image.id).desc()).all()
    return [
        {
            "id": s.id,
            "scientific_name": s.scientific_name,
            "common_name": s.common_name,
            "genus": s.genus,
            "image_count": s.image_count,
        }
        for s in stats
    ]
@router.get("/distribution")
 def get_image_distribution(db: Session = Depends(get_db)):
    """Get distribution of images per species for ML training assessment.
    Returns counts of species at various image thresholds to help
    determine dataset quality for training image classifiers.
    """
    from sqlalchemy import text
    # Get image counts per species using optimized raw SQL
    distribution_sql = text("""
        WITH species_counts AS (
            SELECT
                s.id,
                COUNT(i.id) as cnt
            FROM species s
            LEFT JOIN images i ON i.species_id = s.id AND i.status = 'downloaded'
            GROUP BY s.id
        )
        SELECT
            COUNT(*) as total_species,
            SUM(CASE WHEN cnt = 0 THEN 1 ELSE 0 END) as with_0,
            SUM(CASE WHEN cnt >= 1 AND cnt < 10 THEN 1 ELSE 0 END) as with_1_9,
            SUM(CASE WHEN cnt >= 10 AND cnt < 25 THEN 1 ELSE 0 END) as with_10_24,
            SUM(CASE WHEN cnt >= 25 AND cnt < 50 THEN 1 ELSE 0 END) as with_25_49,
            SUM(CASE WHEN cnt >= 50 AND cnt < 100 THEN 1 ELSE 0 END) as with_50_99,
            SUM(CASE WHEN cnt >= 100 AND cnt < 200 THEN 1 ELSE 0 END) as with_100_199,
            SUM(CASE WHEN cnt >= 200 THEN 1 ELSE 0 END) as with_200_plus,
            SUM(CASE WHEN cnt >= 10 THEN 1 ELSE 0 END) as trainable_10,
            SUM(CASE WHEN cnt >= 25 THEN 1 ELSE 0 END) as trainable_25,
            SUM(CASE WHEN cnt >= 50 THEN 1 ELSE 0 END) as trainable_50,
            SUM(CASE WHEN cnt >= 100 THEN 1 ELSE 0 END) as trainable_100,
            AVG(cnt) as avg_images,
            MAX(cnt) as max_images,
            MIN(cnt) as min_images,
            SUM(cnt) as total_images
        FROM species_counts
    """)
    result = db.execute(distribution_sql).fetchone()
    return {
        "total_species": result[0] or 0,
        "distribution": {
            "0_images": result[1] or 0,
            "1_to_9": result[2] or 0,
            "10_to_24": result[3] or 0,
            "25_to_49": result[4] or 0,
            "50_to_99": result[5] or 0,
            "100_to_199": result[6] or 0,
            "200_plus": result[7] or 0,
        },
        "trainable_species": {
            "min_10_images": result[8] or 0,
            "min_25_images": result[9] or 0,
            "min_50_images": result[10] or 0,
            "min_100_images": result[11] or 0,
        },
        "summary": {
            "avg_images_per_species": round(result[12] or 0, 1),
            "max_images": result[13] or 0,
            "min_images": result[14] or 0,
            "total_downloaded_images": result[15] or 0,
        },
        "recommendations": {
            "for_basic_model": f"{result[8] or 0} species with 10+ images",
            "for_good_model": f"{result[10] or 0} species with 50+ images",
            "for_excellent_model": f"{result[11] or 0} species with 100+ images",
        }
    }
@@ -0,0 +1,38 @@
 from pydantic_settings import BaseSettings
 from functools import lru_cache
 class Settings(BaseSettings):
    # Database
    database_url: str = "sqlite:////data/db/plants.sqlite"
    # Redis
    redis_url: str = "redis://redis:6379/0"
    # Storage paths
    images_path: str = "/data/images"
    exports_path: str = "/data/exports"
    imports_path: str = "/data/imports"
    logs_path: str = "/data/logs"
    # API Keys
    flickr_api_key: str = ""
    flickr_api_secret: str = ""
    inaturalist_app_id: str = ""
    inaturalist_app_secret: str = ""
    trefle_api_key: str = ""
    # Logging
    log_level: str = "INFO"
    # Celery
    celery_concurrency: int = 4
    class Config:
        env_file = ".env"
        extra = "ignore"
@lru_cache()
 def get_settings() -> Settings:
    return Settings()
@@ -0,0 +1,44 @@
 from sqlalchemy import create_engine, event
 from sqlalchemy.orm import sessionmaker, declarative_base
 from sqlalchemy.pool import StaticPool
 from app.config import get_settings
 settings = get_settings()
 # SQLite-specific configuration
 connect_args = {"check_same_thread": False}
 engine = create_engine(
    settings.database_url,
    connect_args=connect_args,
    poolclass=StaticPool,
    echo=False,
 )
 # Enable WAL mode for better concurrent access
@event.listens_for(engine, "connect")
 def set_sqlite_pragma(dbapi_connection, connection_record):
    cursor = dbapi_connection.cursor()
    cursor.execute("PRAGMA journal_mode=WAL")
    cursor.execute("PRAGMA synchronous=NORMAL")
    cursor.execute("PRAGMA foreign_keys=ON")
    cursor.close()
 SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
 Base = declarative_base()
 def get_db():
    db = SessionLocal()
    try:
        yield db
    finally:
        db.close()
 def init_db():
    """Create all tables."""
    from app.models import species, image, job, api_key, export, cached_stats  # noqa
    Base.metadata.create_all(bind=engine)
@@ -0,0 +1,95 @@
 from fastapi import FastAPI
 from fastapi.middleware.cors import CORSMiddleware
 from app.config import get_settings
 from app.database import init_db
 from app.api import species, images, jobs, exports, stats, sources
 settings = get_settings()
 app = FastAPI(
    title="PlantGuideScraper API",
    description="Web scraper interface for houseplant image collection",
    version="1.0.0",
 )
 # CORS middleware
 app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
 )
 # Include routers
 app.include_router(species.router, prefix="/api/species", tags=["Species"])
 app.include_router(images.router, prefix="/api/images", tags=["Images"])
 app.include_router(jobs.router, prefix="/api/jobs", tags=["Jobs"])
 app.include_router(exports.router, prefix="/api/exports", tags=["Exports"])
 app.include_router(stats.router, prefix="/api/stats", tags=["Stats"])
 app.include_router(sources.router, prefix="/api/sources", tags=["Sources"])
@app.on_event("startup")
 async def startup_event():
    """Initialize database on startup."""
    init_db()
@app.get("/health")
 async def health_check():
    """Health check endpoint."""
    return {"status": "healthy", "service": "plant-scraper"}
@app.get("/api/debug")
 async def debug_check():
    """Debug endpoint - checks database connection."""
    import time
    from app.database import SessionLocal
    from app.models import Species, Image
    results = {"status": "checking", "checks": {}}
    # Check 1: Can we create a session?
    try:
        start = time.time()
        db = SessionLocal()
        results["checks"]["session_create"] = {"ok": True, "ms": int((time.time() - start) * 1000)}
    except Exception as e:
        results["checks"]["session_create"] = {"ok": False, "error": str(e)}
        results["status"] = "error"
        return results
    # Check 2: Simple query - count species
    try:
        start = time.time()
        count = db.query(Species).count()
        results["checks"]["species_count"] = {"ok": True, "count": count, "ms": int((time.time() - start) * 1000)}
    except Exception as e:
        results["checks"]["species_count"] = {"ok": False, "error": str(e)}
        results["status"] = "error"
        db.close()
        return results
    # Check 3: Count images
    try:
        start = time.time()
        count = db.query(Image).count()
        results["checks"]["image_count"] = {"ok": True, "count": count, "ms": int((time.time() - start) * 1000)}
    except Exception as e:
        results["checks"]["image_count"] = {"ok": False, "error": str(e)}
        results["status"] = "error"
        db.close()
        return results
    db.close()
    results["status"] = "healthy"
    return results
@app.get("/")
 async def root():
    """Root endpoint."""
    return {"message": "PlantGuideScraper API", "docs": "/docs"}
@@ -0,0 +1,8 @@
 from app.models.species import Species
 from app.models.image import Image
 from app.models.job import Job
 from app.models.api_key import ApiKey
 from app.models.export import Export
 from app.models.cached_stats import CachedStats
 __all__ = ["Species", "Image", "Job", "ApiKey", "Export", "CachedStats"]
@@ -0,0 +1,18 @@
 from sqlalchemy import Column, Integer, String, Float, Boolean
 from app.database import Base
 class ApiKey(Base):
    __tablename__ = "api_keys"
    id = Column(Integer, primary_key=True, index=True)
    source = Column(String, unique=True, nullable=False)  # 'flickr', 'inaturalist', 'wikimedia', 'trefle'
    api_key = Column(String, nullable=False)  # Also used as Client ID for OAuth sources
    api_secret = Column(String, nullable=True)  # Also used as Client Secret for OAuth sources
    access_token = Column(String, nullable=True)  # For OAuth sources like Wikimedia
    rate_limit_per_sec = Column(Float, default=1.0)
    enabled = Column(Boolean, default=True)
    def __repr__(self):
        return f"<ApiKey(id={self.id}, source='{self.source}', enabled={self.enabled})>"
@@ -0,0 +1,14 @@
 from datetime import datetime
 from sqlalchemy import Column, Integer, String, Text, DateTime
 from app.database import Base
 class CachedStats(Base):
    """Stores pre-calculated statistics updated by Celery beat."""
    __tablename__ = "cached_stats"
    id = Column(Integer, primary_key=True, index=True)
    key = Column(String(50), unique=True, nullable=False, index=True)
    value = Column(Text, nullable=False)  # JSON-encoded stats
    updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
@@ -0,0 +1,24 @@
 from sqlalchemy import Column, Integer, String, Float, DateTime, Text, func
 from app.database import Base
 class Export(Base):
    __tablename__ = "exports"
    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, nullable=False)
    filter_criteria = Column(Text, nullable=True)  # JSON: min_images, licenses, min_quality, species_ids
    train_split = Column(Float, default=0.8)
    status = Column(String, default="pending")  # pending, generating, completed, failed
    file_path = Column(String, nullable=True)
    file_size = Column(Integer, nullable=True)
    species_count = Column(Integer, nullable=True)
    image_count = Column(Integer, nullable=True)
    celery_task_id = Column(String, nullable=True)
    created_at = Column(DateTime, server_default=func.now())
    completed_at = Column(DateTime, nullable=True)
    error_message = Column(Text, nullable=True)
    def __repr__(self):
        return f"<Export(id={self.id}, name='{self.name}', status='{self.status}')>"
@@ -0,0 +1,36 @@
 from sqlalchemy import Column, Integer, String, Float, DateTime, ForeignKey, func, UniqueConstraint, Index
 from sqlalchemy.orm import relationship
 from app.database import Base
 class Image(Base):
    __tablename__ = "images"
    id = Column(Integer, primary_key=True, index=True)
    species_id = Column(Integer, ForeignKey("species.id"), nullable=False, index=True)
    source = Column(String, nullable=False, index=True)
    source_id = Column(String, nullable=True)
    url = Column(String, nullable=False)
    local_path = Column(String, nullable=True)
    license = Column(String, nullable=False, index=True)
    attribution = Column(String, nullable=True)
    width = Column(Integer, nullable=True)
    height = Column(Integer, nullable=True)
    phash = Column(String, nullable=True, index=True)
    quality_score = Column(Float, nullable=True)
    status = Column(String, default="pending", index=True)  # pending, downloaded, rejected, deleted
    created_at = Column(DateTime, server_default=func.now())
    # Composite indexes for common query patterns
    __table_args__ = (
        UniqueConstraint("source", "source_id", name="uq_source_source_id"),
        Index("ix_images_species_status", "species_id", "status"),  # For counting images per species by status
        Index("ix_images_status_created", "status", "created_at"),  # For listing images by status
    )
    # Relationships
    species = relationship("Species", back_populates="images")
    def __repr__(self):
        return f"<Image(id={self.id}, source='{self.source}', status='{self.status}')>"
@@ -0,0 +1,27 @@
 from sqlalchemy import Column, Integer, String, DateTime, Text, Boolean, func
 from app.database import Base
 class Job(Base):
    __tablename__ = "jobs"
    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, nullable=False)
    source = Column(String, nullable=False)
    species_filter = Column(Text, nullable=True)  # JSON array of species IDs or NULL for all
    only_without_images = Column(Boolean, default=False)  # If True, only scrape species with 0 images
    max_images = Column(Integer, nullable=True)  # If set, only scrape species with fewer than N images
    status = Column(String, default="pending", index=True)  # pending, running, paused, completed, failed
    progress_current = Column(Integer, default=0)
    progress_total = Column(Integer, default=0)
    images_downloaded = Column(Integer, default=0)
    images_rejected = Column(Integer, default=0)
    celery_task_id = Column(String, nullable=True)
    started_at = Column(DateTime, nullable=True)
    completed_at = Column(DateTime, nullable=True)
    error_message = Column(Text, nullable=True)
    created_at = Column(DateTime, server_default=func.now())
    def __repr__(self):
        return f"<Job(id={self.id}, name='{self.name}', status='{self.status}')>"
@@ -0,0 +1,21 @@
 from sqlalchemy import Column, Integer, String, DateTime, func
 from sqlalchemy.orm import relationship
 from app.database import Base
 class Species(Base):
    __tablename__ = "species"
    id = Column(Integer, primary_key=True, index=True)
    scientific_name = Column(String, unique=True, nullable=False, index=True)
    common_name = Column(String, nullable=True)
    genus = Column(String, nullable=True, index=True)
    family = Column(String, nullable=True)
    created_at = Column(DateTime, server_default=func.now())
    # Relationships
    images = relationship("Image", back_populates="species", cascade="all, delete-orphan")
    def __repr__(self):
        return f"<Species(id={self.id}, scientific_name='{self.scientific_name}')>"
@@ -0,0 +1,15 @@
 from app.schemas.species import SpeciesCreate, SpeciesUpdate, SpeciesResponse, SpeciesListResponse
 from app.schemas.image import ImageResponse, ImageListResponse, ImageFilter
 from app.schemas.job import JobCreate, JobResponse, JobListResponse
 from app.schemas.api_key import ApiKeyCreate, ApiKeyUpdate, ApiKeyResponse
 from app.schemas.export import ExportCreate, ExportResponse, ExportListResponse
 from app.schemas.stats import StatsResponse, SourceStats, SpeciesStats
 __all__ = [
    "SpeciesCreate", "SpeciesUpdate", "SpeciesResponse", "SpeciesListResponse",
    "ImageResponse", "ImageListResponse", "ImageFilter",
    "JobCreate", "JobResponse", "JobListResponse",
    "ApiKeyCreate", "ApiKeyUpdate", "ApiKeyResponse",
    "ExportCreate", "ExportResponse", "ExportListResponse",
    "StatsResponse", "SourceStats", "SpeciesStats",
 ]
@@ -0,0 +1,36 @@
 from pydantic import BaseModel
 from typing import Optional
 class ApiKeyBase(BaseModel):
    source: str
    api_key: Optional[str] = None  # Optional for no-auth sources, used as Client ID for OAuth
    api_secret: Optional[str] = None  # Also used as Client Secret for OAuth sources
    access_token: Optional[str] = None  # For OAuth sources like Wikimedia
    rate_limit_per_sec: float = 1.0
    enabled: bool = True
 class ApiKeyCreate(ApiKeyBase):
    pass
 class ApiKeyUpdate(BaseModel):
    api_key: Optional[str] = None
    api_secret: Optional[str] = None
    access_token: Optional[str] = None
    rate_limit_per_sec: Optional[float] = None
    enabled: Optional[bool] = None
 class ApiKeyResponse(BaseModel):
    id: int
    source: str
    api_key_masked: str  # Show only last 4 chars
    has_secret: bool
    has_access_token: bool
    rate_limit_per_sec: float
    enabled: bool
    class Config:
        from_attributes = True
@@ -0,0 +1,45 @@
 from pydantic import BaseModel
 from datetime import datetime
 from typing import Optional, List
 class ExportFilter(BaseModel):
    min_images_per_species: int = 100
    licenses: Optional[List[str]] = None  # None means all
    min_quality: Optional[float] = None
    species_ids: Optional[List[int]] = None  # None means all
 class ExportCreate(BaseModel):
    name: str
    filter_criteria: ExportFilter
    train_split: float = 0.8
 class ExportResponse(BaseModel):
    id: int
    name: str
    filter_criteria: Optional[str] = None
    train_split: float
    status: str
    file_path: Optional[str] = None
    file_size: Optional[int] = None
    species_count: Optional[int] = None
    image_count: Optional[int] = None
    created_at: datetime
    completed_at: Optional[datetime] = None
    error_message: Optional[str] = None
    class Config:
        from_attributes = True
 class ExportListResponse(BaseModel):
    items: List[ExportResponse]
    total: int
 class ExportPreview(BaseModel):
    species_count: int
    image_count: int
    estimated_size_mb: float
@@ -0,0 +1,47 @@
 from pydantic import BaseModel
 from datetime import datetime
 from typing import Optional, List
 class ImageBase(BaseModel):
    species_id: int
    source: str
    url: str
    license: str
 class ImageResponse(BaseModel):
    id: int
    species_id: int
    species_name: Optional[str] = None
    source: str
    source_id: Optional[str] = None
    url: str
    local_path: Optional[str] = None
    license: str
    attribution: Optional[str] = None
    width: Optional[int] = None
    height: Optional[int] = None
    quality_score: Optional[float] = None
    status: str
    created_at: datetime
    class Config:
        from_attributes = True
 class ImageListResponse(BaseModel):
    items: List[ImageResponse]
    total: int
    page: int
    page_size: int
    pages: int
 class ImageFilter(BaseModel):
    species_id: Optional[int] = None
    source: Optional[str] = None
    license: Optional[str] = None
    status: Optional[str] = None
    min_quality: Optional[float] = None
    search: Optional[str] = None
@@ -0,0 +1,35 @@
 from pydantic import BaseModel
 from datetime import datetime
 from typing import Optional, List
 class JobCreate(BaseModel):
    name: str
    source: str
    species_ids: Optional[List[int]] = None  # None means all species
    only_without_images: bool = False  # If True, only scrape species with 0 images
    max_images: Optional[int] = None  # If set, only scrape species with fewer than N images
 class JobResponse(BaseModel):
    id: int
    name: str
    source: str
    species_filter: Optional[str] = None
    status: str
    progress_current: int
    progress_total: int
    images_downloaded: int
    images_rejected: int
    started_at: Optional[datetime] = None
    completed_at: Optional[datetime] = None
    error_message: Optional[str] = None
    created_at: datetime
    class Config:
        from_attributes = True
 class JobListResponse(BaseModel):
    items: List[JobResponse]
    total: int
@@ -0,0 +1,44 @@
 from pydantic import BaseModel
 from datetime import datetime
 from typing import Optional, List
 class SpeciesBase(BaseModel):
    scientific_name: str
    common_name: Optional[str] = None
    genus: Optional[str] = None
    family: Optional[str] = None
 class SpeciesCreate(SpeciesBase):
    pass
 class SpeciesUpdate(BaseModel):
    scientific_name: Optional[str] = None
    common_name: Optional[str] = None
    genus: Optional[str] = None
    family: Optional[str] = None
 class SpeciesResponse(SpeciesBase):
    id: int
    created_at: datetime
    image_count: int = 0
    class Config:
        from_attributes = True
 class SpeciesListResponse(BaseModel):
    items: List[SpeciesResponse]
    total: int
    page: int
    page_size: int
    pages: int
 class SpeciesImportResponse(BaseModel):
    imported: int
    skipped: int
    errors: List[str]
@@ -0,0 +1,43 @@
 from pydantic import BaseModel
 from typing import List, Dict
 class SourceStats(BaseModel):
    source: str
    image_count: int
    downloaded: int
    pending: int
    rejected: int
 class LicenseStats(BaseModel):
    license: str
    count: int
 class SpeciesStats(BaseModel):
    id: int
    scientific_name: str
    common_name: str | None
    image_count: int
 class JobStats(BaseModel):
    running: int
    pending: int
    completed: int
    failed: int
 class StatsResponse(BaseModel):
    total_species: int
    total_images: int
    images_downloaded: int
    images_pending: int
    images_rejected: int
    disk_usage_mb: float
    sources: List[SourceStats]
    licenses: List[LicenseStats]
    jobs: JobStats
    top_species: List[SpeciesStats]
    under_represented: List[SpeciesStats]  # Species with < 100 images
@@ -0,0 +1,41 @@
 from typing import Optional
 from app.scrapers.base import BaseScraper
 from app.scrapers.inaturalist import INaturalistScraper
 from app.scrapers.flickr import FlickrScraper
 from app.scrapers.wikimedia import WikimediaScraper
 from app.scrapers.trefle import TrefleScraper
 from app.scrapers.gbif import GBIFScraper
 from app.scrapers.duckduckgo import DuckDuckGoScraper
 from app.scrapers.bing import BingScraper
 def get_scraper(source: str) -> Optional[BaseScraper]:
    """Get scraper instance for a source."""
    scrapers = {
        "inaturalist": INaturalistScraper,
        "flickr": FlickrScraper,
        "wikimedia": WikimediaScraper,
        "trefle": TrefleScraper,
        "gbif": GBIFScraper,
        "duckduckgo": DuckDuckGoScraper,
        "bing": BingScraper,
    }
    scraper_class = scrapers.get(source)
    if scraper_class:
        return scraper_class()
    return None
 __all__ = [
    "get_scraper",
    "BaseScraper",
    "INaturalistScraper",
    "FlickrScraper",
    "WikimediaScraper",
    "TrefleScraper",
    "GBIFScraper",
    "DuckDuckGoScraper",
    "BingScraper",
 ]
@@ -0,0 +1,57 @@
 from abc import ABC, abstractmethod
 from typing import Dict, Any, Optional
 import logging
 from sqlalchemy.orm import Session
 from app.models import Species, ApiKey
 class BaseScraper(ABC):
    """Base class for all image scrapers."""
    name: str = "base"
    requires_api_key: bool = True
    @abstractmethod
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None
    ) -> Dict[str, int]:
        """
        Scrape images for a species.
        Args:
            species: The species to scrape images for
            db: Database session
            logger: Optional logger for debugging
        Returns:
            Dict with 'downloaded' and 'rejected' counts
        """
        pass
    @abstractmethod
    def test_connection(self, api_key: ApiKey) -> str:
        """
        Test API connection.
        Args:
            api_key: The API key configuration
        Returns:
            Success message
        Raises:
            Exception if connection fails
        """
        pass
    def get_api_key(self, db: Session) -> ApiKey:
        """Get API key for this scraper."""
        return db.query(ApiKey).filter(
            ApiKey.source == self.name,
            ApiKey.enabled == True
        ).first()
@@ -0,0 +1,228 @@
 import time
 import logging
 from typing import Dict, Optional
 import httpx
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class BHLScraper(BaseScraper):
    """Scraper for Biodiversity Heritage Library (BHL) images.
    BHL provides access to digitized biodiversity literature and illustrations.
    Most content is public domain (pre-1927) or CC-licensed.
    Note: BHL images are primarily historical botanical illustrations,
    which may differ from photographs but are valuable for training.
    """
    name = "bhl"
    requires_api_key = True  # BHL requires free API key
    BASE_URL = "https://www.biodiversitylibrary.org/api3"
    HEADERS = {
        "User-Agent": "PlantGuideScraper/1.0 (Plant image collection for ML training)",
        "Accept": "application/json",
    }
    # BHL content is mostly public domain
    ALLOWED_LICENSES = {"CC0", "CC-BY", "CC-BY-SA", "PD"}
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None
    ) -> Dict[str, int]:
        """Scrape images from BHL for a species."""
        api_key = self.get_api_key(db)
        if not api_key:
            return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
        rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
        downloaded = 0
        rejected = 0
        def log(level: str, msg: str):
            if logger:
                getattr(logger, level)(msg)
        try:
            # Disable SSL verification - some Docker environments lack proper CA certificates
            with httpx.Client(timeout=30, headers=self.HEADERS, verify=False) as client:
                # Search for name in BHL
                search_response = client.get(
                    f"{self.BASE_URL}",
                    params={
                        "op": "NameSearch",
                        "name": species.scientific_name,
                        "format": "json",
                        "apikey": api_key.api_key,
                    },
                )
                search_response.raise_for_status()
                search_data = search_response.json()
                results = search_data.get("Result", [])
                if not results:
                    log("info", f"  Species not found in BHL: {species.scientific_name}")
                    return {"downloaded": 0, "rejected": 0}
                time.sleep(1.0 / rate_limit)
                # Get pages with illustrations for each name result
                for name_result in results[:5]:  # Limit to top 5 matches
                    name_bank_id = name_result.get("NameBankID")
                    if not name_bank_id:
                        continue
                    # Get publications with this name
                    pub_response = client.get(
                        f"{self.BASE_URL}",
                        params={
                            "op": "NameGetDetail",
                            "namebankid": name_bank_id,
                            "format": "json",
                            "apikey": api_key.api_key,
                        },
                    )
                    pub_response.raise_for_status()
                    pub_data = pub_response.json()
                    time.sleep(1.0 / rate_limit)
                    # Extract titles and get page images
                    for title in pub_data.get("Result", []):
                        title_id = title.get("TitleID")
                        if not title_id:
                            continue
                        # Get pages for this title
                        pages_response = client.get(
                            f"{self.BASE_URL}",
                            params={
                                "op": "GetPageMetadata",
                                "titleid": title_id,
                                "format": "json",
                                "apikey": api_key.api_key,
                                "ocr": "false",
                                "names": "false",
                            },
                        )
                        if pages_response.status_code != 200:
                            continue
                        pages_data = pages_response.json()
                        pages = pages_data.get("Result", [])
                        time.sleep(1.0 / rate_limit)
                        # Look for pages that are likely illustrations
                        for page in pages[:100]:  # Limit pages per title
                            page_types = page.get("PageTypes", [])
                            # Only get illustration/plate pages
                            is_illustration = any(
                                pt.get("PageTypeName", "").lower() in ["illustration", "plate", "figure", "map"]
                                for pt in page_types
                            ) if page_types else False
                            if not is_illustration and page_types:
                                continue
                            page_id = page.get("PageID")
                            if not page_id:
                                continue
                            # Construct image URL
                            # BHL provides multiple image sizes
                            image_url = f"https://www.biodiversitylibrary.org/pageimage/{page_id}"
                            # Check if already exists
                            source_id = str(page_id)
                            existing = db.query(Image).filter(
                                Image.source == self.name,
                                Image.source_id == source_id,
                            ).first()
                            if existing:
                                continue
                            # Determine license - BHL content is usually public domain
                            item_url = page.get("ItemUrl", "")
                            year = None
                            try:
                                # Try to extract year from ItemUrl or other fields
                                if "Year" in page:
                                    year = int(page.get("Year", 0))
                            except (ValueError, TypeError):
                                pass
                            # Content before 1927 is public domain in US
                            if year and year < 1927:
                                license_code = "PD"
                            else:
                                license_code = "CC0"  # BHL default for older works
                            # Build attribution
                            title_name = title.get("ShortTitle", title.get("FullTitle", "Unknown"))
                            attribution = f"From '{title_name}' via Biodiversity Heritage Library ({license_code})"
                            # Create image record
                            image = Image(
                                species_id=species.id,
                                source=self.name,
                                source_id=source_id,
                                url=image_url,
                                license=license_code,
                                attribution=attribution,
                                status="pending",
                            )
                            db.add(image)
                            db.commit()
                            # Queue for download
                            download_and_process_image.delay(image.id)
                            downloaded += 1
                            # Limit total per species
                            if downloaded >= 50:
                                break
                        if downloaded >= 50:
                            break
                    if downloaded >= 50:
                        break
        except httpx.HTTPStatusError as e:
            log("error", f"  HTTP error for {species.scientific_name}: {e.response.status_code}")
        except Exception as e:
            log("error", f"  Error scraping BHL for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        """Test BHL API connection."""
        with httpx.Client(timeout=10, headers=self.HEADERS, verify=False) as client:
            response = client.get(
                f"{self.BASE_URL}",
                params={
                    "op": "NameSearch",
                    "name": "Rosa",
                    "format": "json",
                    "apikey": api_key.api_key,
                },
            )
            response.raise_for_status()
            data = response.json()
        results = data.get("Result", [])
        return f"BHL API connection successful ({len(results)} results for 'Rosa')"
@@ -0,0 +1,135 @@
 import hashlib
 import time
 import logging
 from typing import Dict, Optional
 import httpx
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class BingScraper(BaseScraper):
    """Scraper for Bing Image Search v7 API (Azure Cognitive Services)."""
    name = "bing"
    requires_api_key = True
    BASE_URL = "https://api.bing.microsoft.com/v7.0/images/search"
    NEGATIVE_TERMS = "-herbarium -specimen -illustration -drawing -diagram -dried -pressed"
    LICENSE_MAP = {
        "Public": "CC0",
        "Share": "CC-BY-SA",
        "ShareCommercially": "CC-BY",
        "Modify": "CC-BY-SA",
        "ModifyCommercially": "CC-BY",
    }
    def _build_queries(self, species: Species) -> list[str]:
        queries = [f'"{species.scientific_name}" plant photo {self.NEGATIVE_TERMS}']
        if species.common_name:
            queries.append(f'"{species.common_name}" houseplant photo {self.NEGATIVE_TERMS}')
        return queries
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None,
    ) -> Dict[str, int]:
        api_key = self.get_api_key(db)
        if not api_key:
            return {"downloaded": 0, "rejected": 0}
        rate_limit = api_key.rate_limit_per_sec or 3.0
        downloaded = 0
        rejected = 0
        seen_urls = set()
        headers = {
            "Ocp-Apim-Subscription-Key": api_key.api_key,
        }
        try:
            queries = self._build_queries(species)
            with httpx.Client(timeout=30, headers=headers) as client:
                for query in queries:
                    params = {
                        "q": query,
                        "imageType": "Photo",
                        "license": "ShareCommercially",
                        "count": 50,
                    }
                    response = client.get(self.BASE_URL, params=params)
                    response.raise_for_status()
                    data = response.json()
                    for result in data.get("value", []):
                        url = result.get("contentUrl")
                        if not url or url in seen_urls:
                            continue
                        seen_urls.add(url)
                        # Use Bing's imageId, fall back to md5 hash
                        source_id = result.get("imageId") or hashlib.md5(url.encode()).hexdigest()[:16]
                        existing = db.query(Image).filter(
                            Image.source == self.name,
                            Image.source_id == source_id,
                        ).first()
                        if existing:
                            continue
                        # Map license
                        bing_license = result.get("license", "")
                        license_code = self.LICENSE_MAP.get(bing_license, "UNKNOWN")
                        host = result.get("hostPageDisplayUrl", "")
                        attribution = f"via Bing ({host})" if host else "via Bing Image Search"
                        image = Image(
                            species_id=species.id,
                            source=self.name,
                            source_id=source_id,
                            url=url,
                            width=result.get("width"),
                            height=result.get("height"),
                            license=license_code,
                            attribution=attribution,
                            status="pending",
                        )
                        db.add(image)
                        db.commit()
                        download_and_process_image.delay(image.id)
                        downloaded += 1
                    time.sleep(1.0 / rate_limit)
        except Exception as e:
            if logger:
                logger.error(f"Error scraping Bing for {species.scientific_name}: {e}")
            else:
                print(f"Error scraping Bing for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        headers = {"Ocp-Apim-Subscription-Key": api_key.api_key}
        with httpx.Client(timeout=10, headers=headers) as client:
            response = client.get(
                self.BASE_URL,
                params={"q": "Monstera deliciosa plant", "count": 1},
            )
            response.raise_for_status()
            data = response.json()
        count = data.get("totalEstimatedMatches", 0)
        return f"Bing Image Search working ({count:,} estimated matches)"
@@ -0,0 +1,101 @@
 import hashlib
 import time
 import logging
 from typing import Dict, Optional
 from duckduckgo_search import DDGS
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class DuckDuckGoScraper(BaseScraper):
    """Scraper for DuckDuckGo image search. No API key required."""
    name = "duckduckgo"
    requires_api_key = False
    NEGATIVE_TERMS = "-herbarium -specimen -illustration -drawing -diagram -dried -pressed"
    def _build_queries(self, species: Species) -> list[str]:
        queries = [f'"{species.scientific_name}" plant photo {self.NEGATIVE_TERMS}']
        if species.common_name:
            queries.append(f'"{species.common_name}" houseplant photo {self.NEGATIVE_TERMS}')
        return queries
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None,
    ) -> Dict[str, int]:
        api_key = self.get_api_key(db)
        rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
        downloaded = 0
        rejected = 0
        seen_urls = set()
        try:
            queries = self._build_queries(species)
            with DDGS() as ddgs:
                for query in queries:
                    results = ddgs.images(
                        keywords=query,
                        type_image="photo",
                        max_results=50,
                    )
                    for result in results:
                        url = result.get("image")
                        if not url or url in seen_urls:
                            continue
                        seen_urls.add(url)
                        source_id = hashlib.md5(url.encode()).hexdigest()[:16]
                        # Check if already exists
                        existing = db.query(Image).filter(
                            Image.source == self.name,
                            Image.source_id == source_id,
                        ).first()
                        if existing:
                            continue
                        title = result.get("title", "")
                        attribution = f"{title} via DuckDuckGo" if title else "via DuckDuckGo"
                        image = Image(
                            species_id=species.id,
                            source=self.name,
                            source_id=source_id,
                            url=url,
                            license="UNKNOWN",
                            attribution=attribution,
                            status="pending",
                        )
                        db.add(image)
                        db.commit()
                        download_and_process_image.delay(image.id)
                        downloaded += 1
                    time.sleep(1.0 / rate_limit)
        except Exception as e:
            if logger:
                logger.error(f"Error scraping DuckDuckGo for {species.scientific_name}: {e}")
            else:
                print(f"Error scraping DuckDuckGo for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        with DDGS() as ddgs:
            results = ddgs.images(keywords="Monstera deliciosa plant", max_results=1)
            count = len(list(results))
        return f"DuckDuckGo search working ({count} test result)"
@@ -0,0 +1,226 @@
 import time
 import logging
 from typing import Dict, Optional
 import httpx
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class EOLScraper(BaseScraper):
    """Scraper for Encyclopedia of Life (EOL) images.
    EOL aggregates biodiversity data from many sources and provides
    a free API with no authentication required.
    """
    name = "eol"
    requires_api_key = False
    BASE_URL = "https://eol.org/api"
    HEADERS = {
        "User-Agent": "PlantGuideScraper/1.0 (Plant image collection for ML training)",
        "Accept": "application/json",
    }
    # Map EOL license URLs to short codes
    LICENSE_MAP = {
        "http://creativecommons.org/publicdomain/zero/1.0/": "CC0",
        "http://creativecommons.org/publicdomain/mark/1.0/": "CC0",
        "http://creativecommons.org/licenses/by/2.0/": "CC-BY",
        "http://creativecommons.org/licenses/by/3.0/": "CC-BY",
        "http://creativecommons.org/licenses/by/4.0/": "CC-BY",
        "http://creativecommons.org/licenses/by-sa/2.0/": "CC-BY-SA",
        "http://creativecommons.org/licenses/by-sa/3.0/": "CC-BY-SA",
        "http://creativecommons.org/licenses/by-sa/4.0/": "CC-BY-SA",
        "https://creativecommons.org/publicdomain/zero/1.0/": "CC0",
        "https://creativecommons.org/publicdomain/mark/1.0/": "CC0",
        "https://creativecommons.org/licenses/by/2.0/": "CC-BY",
        "https://creativecommons.org/licenses/by/3.0/": "CC-BY",
        "https://creativecommons.org/licenses/by/4.0/": "CC-BY",
        "https://creativecommons.org/licenses/by-sa/2.0/": "CC-BY-SA",
        "https://creativecommons.org/licenses/by-sa/3.0/": "CC-BY-SA",
        "https://creativecommons.org/licenses/by-sa/4.0/": "CC-BY-SA",
        "pd": "CC0",  # Public domain
        "public domain": "CC0",
    }
    # Commercial-safe licenses
    ALLOWED_LICENSES = {"CC0", "CC-BY", "CC-BY-SA"}
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None
    ) -> Dict[str, int]:
        """Scrape images from EOL for a species."""
        api_key = self.get_api_key(db)
        rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
        downloaded = 0
        rejected = 0
        def log(level: str, msg: str):
            if logger:
                getattr(logger, level)(msg)
        try:
            # Disable SSL verification - EOL is a trusted source and some Docker
            # environments lack proper CA certificates
            with httpx.Client(timeout=30, headers=self.HEADERS, verify=False) as client:
                # Step 1: Search for the species
                search_response = client.get(
                    f"{self.BASE_URL}/search/1.0.json",
                    params={
                        "q": species.scientific_name,
                        "page": 1,
                        "exact": "true",
                    },
                )
                search_response.raise_for_status()
                search_data = search_response.json()
                results = search_data.get("results", [])
                if not results:
                    log("info", f"  Species not found in EOL: {species.scientific_name}")
                    return {"downloaded": 0, "rejected": 0}
                # Get the EOL page ID
                eol_page_id = results[0].get("id")
                if not eol_page_id:
                    return {"downloaded": 0, "rejected": 0}
                time.sleep(1.0 / rate_limit)
                # Step 2: Get page details with images
                page_response = client.get(
                    f"{self.BASE_URL}/pages/1.0/{eol_page_id}.json",
                    params={
                        "images_per_page": 75,
                        "images_page": 1,
                        "videos_per_page": 0,
                        "sounds_per_page": 0,
                        "maps_per_page": 0,
                        "texts_per_page": 0,
                        "details": "true",
                        "licenses": "cc-by|cc-by-sa|pd|cc-by-nc",
                    },
                )
                page_response.raise_for_status()
                page_data = page_response.json()
                data_objects = page_data.get("dataObjects", [])
                log("debug", f"  Found {len(data_objects)} media objects")
                for obj in data_objects:
                    # Only process images
                    media_type = obj.get("dataType", "")
                    if "image" not in media_type.lower() and "stillimage" not in media_type.lower():
                        continue
                    # Get image URL
                    image_url = obj.get("eolMediaURL") or obj.get("mediaURL")
                    if not image_url:
                        rejected += 1
                        continue
                    # Check license
                    license_url = obj.get("license", "").lower()
                    license_code = None
                    # Try to match license URL
                    for pattern, code in self.LICENSE_MAP.items():
                        if pattern in license_url:
                            license_code = code
                            break
                    if not license_code:
                        # Check for NC licenses which we reject
                        if "-nc" in license_url:
                            rejected += 1
                            continue
                        # Unknown license, skip
                        log("debug", f"  Rejected: unknown license {license_url}")
                        rejected += 1
                        continue
                    if license_code not in self.ALLOWED_LICENSES:
                        rejected += 1
                        continue
                    # Create unique source ID
                    source_id = str(obj.get("dataObjectVersionID") or obj.get("identifier") or hash(image_url))
                    # Check if already exists
                    existing = db.query(Image).filter(
                        Image.source == self.name,
                        Image.source_id == source_id,
                    ).first()
                    if existing:
                        continue
                    # Build attribution
                    agents = obj.get("agents", [])
                    photographer = None
                    rights_holder = None
                    for agent in agents:
                        role = agent.get("role", "").lower()
                        name = agent.get("full_name", "")
                        if role == "photographer":
                            photographer = name
                        elif role == "owner" or role == "rights holder":
                            rights_holder = name
                    attribution_parts = []
                    if photographer:
                        attribution_parts.append(f"Photo by {photographer}")
                    if rights_holder and rights_holder != photographer:
                        attribution_parts.append(f"Rights: {rights_holder}")
                    attribution_parts.append(f"via EOL ({license_code})")
                    attribution = " | ".join(attribution_parts)
                    # Create image record
                    image = Image(
                        species_id=species.id,
                        source=self.name,
                        source_id=source_id,
                        url=image_url,
                        license=license_code,
                        attribution=attribution,
                        status="pending",
                    )
                    db.add(image)
                    db.commit()
                    # Queue for download
                    download_and_process_image.delay(image.id)
                    downloaded += 1
                time.sleep(1.0 / rate_limit)
        except httpx.HTTPStatusError as e:
            log("error", f"  HTTP error for {species.scientific_name}: {e.response.status_code}")
        except Exception as e:
            log("error", f"  Error scraping EOL for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        """Test EOL API connection."""
        with httpx.Client(timeout=10, headers=self.HEADERS, verify=False) as client:
            response = client.get(
                f"{self.BASE_URL}/search/1.0.json",
                params={"q": "Rosa", "page": 1},
            )
            response.raise_for_status()
            data = response.json()
        total = data.get("totalResults", 0)
        return f"EOL API connection successful ({total} results for 'Rosa')"
@@ -0,0 +1,146 @@
 import time
 import logging
 from typing import Dict, Optional
 import httpx
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class FlickrScraper(BaseScraper):
    """Scraper for Flickr images via their API."""
    name = "flickr"
    requires_api_key = True
    BASE_URL = "https://api.flickr.com/services/rest/"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
    }
    # Commercial-safe license IDs
    # 4 = CC BY 2.0, 7 = No known copyright, 8 = US Gov, 9 = CC0
    ALLOWED_LICENSES = "4,7,8,9"
    LICENSE_MAP = {
        "4": "CC-BY",
        "7": "NO-KNOWN-COPYRIGHT",
        "8": "US-GOV",
        "9": "CC0",
    }
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None
    ) -> Dict[str, int]:
        """Scrape images from Flickr for a species."""
        api_key = self.get_api_key(db)
        if not api_key:
            return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
        rate_limit = api_key.rate_limit_per_sec
        downloaded = 0
        rejected = 0
        try:
            params = {
                "method": "flickr.photos.search",
                "api_key": api_key.api_key,
                "text": species.scientific_name,
                "license": self.ALLOWED_LICENSES,
                "content_type": 1,  # Photos only
                "media": "photos",
                "extras": "license,url_l,url_o,owner_name",
                "per_page": 100,
                "format": "json",
                "nojsoncallback": 1,
            }
            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
                response = client.get(self.BASE_URL, params=params)
                response.raise_for_status()
                data = response.json()
            if data.get("stat") != "ok":
                return {"downloaded": 0, "rejected": 0, "error": data.get("message")}
            photos = data.get("photos", {}).get("photo", [])
            for photo in photos:
                # Get best URL (original or large)
                url = photo.get("url_o") or photo.get("url_l")
                if not url:
                    rejected += 1
                    continue
                # Get license
                license_id = str(photo.get("license", ""))
                license_code = self.LICENSE_MAP.get(license_id, "UNKNOWN")
                if license_code == "UNKNOWN":
                    rejected += 1
                    continue
                # Check if already exists
                source_id = str(photo.get("id"))
                existing = db.query(Image).filter(
                    Image.source == self.name,
                    Image.source_id == source_id,
                ).first()
                if existing:
                    continue
                # Build attribution
                owner = photo.get("ownername", "Unknown")
                attribution = f"Photo by {owner} on Flickr ({license_code})"
                # Create image record
                image = Image(
                    species_id=species.id,
                    source=self.name,
                    source_id=source_id,
                    url=url,
                    license=license_code,
                    attribution=attribution,
                    status="pending",
                )
                db.add(image)
                db.commit()
                # Queue for download
                download_and_process_image.delay(image.id)
                downloaded += 1
            # Rate limiting
            time.sleep(1.0 / rate_limit)
        except Exception as e:
            print(f"Error scraping Flickr for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        """Test Flickr API connection."""
        params = {
            "method": "flickr.test.echo",
            "api_key": api_key.api_key,
            "format": "json",
            "nojsoncallback": 1,
        }
        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
            response = client.get(self.BASE_URL, params=params)
            response.raise_for_status()
            data = response.json()
        if data.get("stat") != "ok":
            raise Exception(data.get("message", "API test failed"))
        return "Flickr API connection successful"
@@ -0,0 +1,159 @@
 import time
 import logging
 from typing import Dict, Optional
 import httpx
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class GBIFScraper(BaseScraper):
    """Scraper for GBIF (Global Biodiversity Information Facility) images."""
    name = "gbif"
    requires_api_key = False  # GBIF is free to use
    BASE_URL = "https://api.gbif.org/v1"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
    }
    # Map GBIF license URLs to short codes
    LICENSE_MAP = {
        "http://creativecommons.org/publicdomain/zero/1.0/legalcode": "CC0",
        "http://creativecommons.org/licenses/by/4.0/legalcode": "CC-BY",
        "http://creativecommons.org/licenses/by-nc/4.0/legalcode": "CC-BY-NC",
        "http://creativecommons.org/publicdomain/zero/1.0/": "CC0",
        "http://creativecommons.org/licenses/by/4.0/": "CC-BY",
        "http://creativecommons.org/licenses/by-nc/4.0/": "CC-BY-NC",
        "https://creativecommons.org/publicdomain/zero/1.0/legalcode": "CC0",
        "https://creativecommons.org/licenses/by/4.0/legalcode": "CC-BY",
        "https://creativecommons.org/licenses/by-nc/4.0/legalcode": "CC-BY-NC",
        "https://creativecommons.org/publicdomain/zero/1.0/": "CC0",
        "https://creativecommons.org/licenses/by/4.0/": "CC-BY",
        "https://creativecommons.org/licenses/by-nc/4.0/": "CC-BY-NC",
    }
    # Only allow commercial-safe licenses
    ALLOWED_LICENSES = {"CC0", "CC-BY"}
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None
    ) -> Dict[str, int]:
        """Scrape images from GBIF for a species."""
        # GBIF doesn't require API key, but we still respect rate limits
        api_key = self.get_api_key(db)
        rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
        downloaded = 0
        rejected = 0
        try:
            params = {
                "scientificName": species.scientific_name,
                "mediaType": "StillImage",
                "limit": 100,
            }
            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
                response = client.get(
                    f"{self.BASE_URL}/occurrence/search",
                    params=params,
                )
                response.raise_for_status()
                data = response.json()
                results = data.get("results", [])
                for occurrence in results:
                    media_list = occurrence.get("media", [])
                    for media in media_list:
                        # Only process still images
                        if media.get("type") != "StillImage":
                            continue
                        url = media.get("identifier")
                        if not url:
                            rejected += 1
                            continue
                        # Check license
                        license_url = media.get("license", "")
                        license_code = self.LICENSE_MAP.get(license_url)
                        if not license_code or license_code not in self.ALLOWED_LICENSES:
                            rejected += 1
                            continue
                        # Create unique source ID from occurrence key and media URL
                        occurrence_key = occurrence.get("key", "")
                        # Use hash of URL for uniqueness within occurrence
                        url_hash = str(hash(url))[-8:]
                        source_id = f"{occurrence_key}_{url_hash}"
                        # Check if already exists
                        existing = db.query(Image).filter(
                            Image.source == self.name,
                            Image.source_id == source_id,
                        ).first()
                        if existing:
                            continue
                        # Build attribution
                        creator = media.get("creator", "")
                        rights_holder = media.get("rightsHolder", "")
                        attribution_parts = []
                        if creator:
                            attribution_parts.append(f"Photo by {creator}")
                        if rights_holder and rights_holder != creator:
                            attribution_parts.append(f"Rights: {rights_holder}")
                        attribution_parts.append(f"via GBIF ({license_code})")
                        attribution = " | ".join(attribution_parts) if attribution_parts else f"GBIF ({license_code})"
                        # Create image record
                        image = Image(
                            species_id=species.id,
                            source=self.name,
                            source_id=source_id,
                            url=url,
                            license=license_code,
                            attribution=attribution,
                            status="pending",
                        )
                        db.add(image)
                        db.commit()
                        # Queue for download
                        download_and_process_image.delay(image.id)
                        downloaded += 1
                # Rate limiting
                time.sleep(1.0 / rate_limit)
        except Exception as e:
            print(f"Error scraping GBIF for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        """Test GBIF API connection."""
        # GBIF doesn't require authentication, just test the endpoint
        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
            response = client.get(
                f"{self.BASE_URL}/occurrence/search",
                params={"limit": 1},
            )
            response.raise_for_status()
            data = response.json()
        count = data.get("count", 0)
        return f"GBIF API connection successful ({count:,} total occurrences available)"
@@ -0,0 +1,144 @@
 import time
 import logging
 from typing import Dict, Optional
 import httpx
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class INaturalistScraper(BaseScraper):
    """Scraper for iNaturalist observations via their API."""
    name = "inaturalist"
    requires_api_key = False  # Public API, but rate limited
    BASE_URL = "https://api.inaturalist.org/v1"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
    }
    # Commercial-safe licenses (CC0, CC-BY)
    ALLOWED_LICENSES = ["cc0", "cc-by"]
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None
    ) -> Dict[str, int]:
        """Scrape images from iNaturalist for a species."""
        api_key = self.get_api_key(db)
        rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
        downloaded = 0
        rejected = 0
        def log(level: str, msg: str):
            if logger:
                getattr(logger, level)(msg)
        try:
            # Search for observations of this species
            params = {
                "taxon_name": species.scientific_name,
                "quality_grade": "research",  # Only research-grade
                "photos": True,
                "per_page": 200,
                "order_by": "votes",
                "license": ",".join(self.ALLOWED_LICENSES),
            }
            log("debug", f"  API request params: {params}")
            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
                response = client.get(
                    f"{self.BASE_URL}/observations",
                    params=params,
                )
                log("debug", f"  API response status: {response.status_code}")
                response.raise_for_status()
                data = response.json()
            observations = data.get("results", [])
            total_results = data.get("total_results", 0)
            log("debug", f"  Found {len(observations)} observations (total: {total_results})")
            if not observations:
                log("info", f"  No observations found for {species.scientific_name}")
                return {"downloaded": 0, "rejected": 0}
            for obs in observations:
                photos = obs.get("photos", [])
                for photo in photos:
                    # Check license
                    license_code = photo.get("license_code", "").lower() if photo.get("license_code") else ""
                    if license_code not in self.ALLOWED_LICENSES:
                        log("debug", f"  Rejected photo {photo.get('id')}: license={license_code}")
                        rejected += 1
                        continue
                    # Get image URL (medium size for initial download)
                    url = photo.get("url", "")
                    if not url:
                        log("debug", f"  Skipped photo {photo.get('id')}: no URL")
                        continue
                    # Convert to larger size
                    url = url.replace("square", "large")
                    # Check if already exists
                    source_id = str(photo.get("id"))
                    existing = db.query(Image).filter(
                        Image.source == self.name,
                        Image.source_id == source_id,
                    ).first()
                    if existing:
                        log("debug", f"  Skipped photo {source_id}: already exists")
                        continue
                    # Create image record
                    image = Image(
                        species_id=species.id,
                        source=self.name,
                        source_id=source_id,
                        url=url,
                        license=license_code.upper(),
                        attribution=photo.get("attribution", ""),
                        status="pending",
                    )
                    db.add(image)
                    db.commit()
                    # Queue for download
                    download_and_process_image.delay(image.id)
                    downloaded += 1
                    log("debug", f"  Queued photo {source_id} for download")
                # Rate limiting
                time.sleep(1.0 / rate_limit)
        except httpx.HTTPStatusError as e:
            log("error", f"  HTTP error for {species.scientific_name}: {e.response.status_code} - {e.response.text}")
        except httpx.RequestError as e:
            log("error", f"  Request error for {species.scientific_name}: {e}")
        except Exception as e:
            log("error", f"  Error scraping iNaturalist for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        """Test iNaturalist API connection."""
        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
            response = client.get(
                f"{self.BASE_URL}/observations",
                params={"per_page": 1},
            )
            response.raise_for_status()
        return "iNaturalist API connection successful"
@@ -0,0 +1,154 @@
 import time
 import logging
 from typing import Dict, Optional
 import httpx
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class TrefleScraper(BaseScraper):
    """Scraper for Trefle.io plant database."""
    name = "trefle"
    requires_api_key = True
    BASE_URL = "https://trefle.io/api/v1"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
    }
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None
    ) -> Dict[str, int]:
        """Scrape images from Trefle for a species."""
        api_key = self.get_api_key(db)
        if not api_key:
            return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
        rate_limit = api_key.rate_limit_per_sec
        downloaded = 0
        rejected = 0
        try:
            # Search for the species
            params = {
                "token": api_key.api_key,
                "q": species.scientific_name,
            }
            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
                response = client.get(
                    f"{self.BASE_URL}/plants/search",
                    params=params,
                )
                response.raise_for_status()
                data = response.json()
                plants = data.get("data", [])
                for plant in plants:
                    # Get plant details for more images
                    plant_id = plant.get("id")
                    if not plant_id:
                        continue
                    detail_response = client.get(
                        f"{self.BASE_URL}/plants/{plant_id}",
                        params={"token": api_key.api_key},
                    )
                    if detail_response.status_code != 200:
                        continue
                    plant_detail = detail_response.json().get("data", {})
                    # Get main image
                    main_image = plant_detail.get("image_url")
                    if main_image:
                        source_id = f"main_{plant_id}"
                        existing = db.query(Image).filter(
                            Image.source == self.name,
                            Image.source_id == source_id,
                        ).first()
                        if not existing:
                            image = Image(
                                species_id=species.id,
                                source=self.name,
                                source_id=source_id,
                                url=main_image,
                                license="TREFLE",  # Trefle's own license
                                attribution="Trefle.io Plant Database",
                                status="pending",
                            )
                            db.add(image)
                            db.commit()
                            download_and_process_image.delay(image.id)
                            downloaded += 1
                    # Get additional images from species detail
                    images = plant_detail.get("images", {})
                    for image_type, image_list in images.items():
                        if not isinstance(image_list, list):
                            continue
                        for img in image_list:
                            url = img.get("image_url")
                            if not url:
                                continue
                            img_id = img.get("id", url.split("/")[-1])
                            source_id = f"{image_type}_{img_id}"
                            existing = db.query(Image).filter(
                                Image.source == self.name,
                                Image.source_id == source_id,
                            ).first()
                            if existing:
                                continue
                            copyright_info = img.get("copyright", "")
                            image = Image(
                                species_id=species.id,
                                source=self.name,
                                source_id=source_id,
                                url=url,
                                license="TREFLE",
                                attribution=copyright_info or "Trefle.io",
                                status="pending",
                            )
                            db.add(image)
                            db.commit()
                            download_and_process_image.delay(image.id)
                            downloaded += 1
                    # Rate limiting
                    time.sleep(1.0 / rate_limit)
        except Exception as e:
            print(f"Error scraping Trefle for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        """Test Trefle API connection."""
        params = {"token": api_key.api_key}
        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
            response = client.get(
                f"{self.BASE_URL}/plants",
                params=params,
            )
            response.raise_for_status()
        return "Trefle API connection successful"
@@ -0,0 +1,146 @@
 import time
 import logging
 from typing import Dict, Optional
 import httpx
 from sqlalchemy.orm import Session
 from app.scrapers.base import BaseScraper
 from app.models import Species, Image, ApiKey
 from app.workers.quality_tasks import download_and_process_image
 class WikimediaScraper(BaseScraper):
    """Scraper for Wikimedia Commons images."""
    name = "wikimedia"
    requires_api_key = False
    BASE_URL = "https://commons.wikimedia.org/w/api.php"
    HEADERS = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
    }
    def scrape_species(
        self,
        species: Species,
        db: Session,
        logger: Optional[logging.Logger] = None
    ) -> Dict[str, int]:
        """Scrape images from Wikimedia Commons for a species."""
        api_key = self.get_api_key(db)
        rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
        downloaded = 0
        rejected = 0
        try:
            # Search for images in the species category
            search_term = species.scientific_name
            params = {
                "action": "query",
                "format": "json",
                "generator": "search",
                "gsrsearch": f"filetype:bitmap {search_term}",
                "gsrnamespace": 6,  # File namespace
                "gsrlimit": 50,
                "prop": "imageinfo",
                "iiprop": "url|extmetadata|size",
            }
            with httpx.Client(timeout=30, headers=self.HEADERS) as client:
                response = client.get(self.BASE_URL, params=params)
                response.raise_for_status()
                data = response.json()
            pages = data.get("query", {}).get("pages", {})
            for page_id, page in pages.items():
                if int(page_id) < 0:
                    continue
                imageinfo = page.get("imageinfo", [{}])[0]
                url = imageinfo.get("url", "")
                if not url:
                    continue
                # Check size
                width = imageinfo.get("width", 0)
                height = imageinfo.get("height", 0)
                if width < 256 or height < 256:
                    rejected += 1
                    continue
                # Get license from metadata
                metadata = imageinfo.get("extmetadata", {})
                license_info = metadata.get("LicenseShortName", {}).get("value", "")
                # Filter for commercial-safe licenses
                license_upper = license_info.upper()
                if "CC BY" in license_upper or "CC0" in license_upper or "PUBLIC DOMAIN" in license_upper:
                    license_code = license_info
                else:
                    rejected += 1
                    continue
                # Check if already exists
                source_id = str(page_id)
                existing = db.query(Image).filter(
                    Image.source == self.name,
                    Image.source_id == source_id,
                ).first()
                if existing:
                    continue
                # Get attribution
                artist = metadata.get("Artist", {}).get("value", "Unknown")
                # Clean HTML from artist
                if "<" in artist:
                    import re
                    artist = re.sub(r"<[^>]+>", "", artist).strip()
                attribution = f"{artist} via Wikimedia Commons ({license_code})"
                # Create image record
                image = Image(
                    species_id=species.id,
                    source=self.name,
                    source_id=source_id,
                    url=url,
                    license=license_code,
                    attribution=attribution,
                    width=width,
                    height=height,
                    status="pending",
                )
                db.add(image)
                db.commit()
                # Queue for download
                download_and_process_image.delay(image.id)
                downloaded += 1
            # Rate limiting
            time.sleep(1.0 / rate_limit)
        except Exception as e:
            print(f"Error scraping Wikimedia for {species.scientific_name}: {e}")
        return {"downloaded": downloaded, "rejected": rejected}
    def test_connection(self, api_key: ApiKey) -> str:
        """Test Wikimedia API connection."""
        params = {
            "action": "query",
            "format": "json",
            "meta": "siteinfo",
        }
        with httpx.Client(timeout=10, headers=self.HEADERS) as client:
            response = client.get(self.BASE_URL, params=params)
            response.raise_for_status()
        return "Wikimedia Commons API connection successful"
@@ -0,0 +1 @@
 # Utility functions
@@ -0,0 +1,80 @@
 """Image deduplication utilities using perceptual hashing."""
 from typing import Optional
 import imagehash
 from PIL import Image as PILImage
 def calculate_phash(image_path: str) -> Optional[str]:
    """
    Calculate perceptual hash for an image.
    Args:
        image_path: Path to image file
    Returns:
        Hex string of perceptual hash, or None if failed
    """
    try:
        with PILImage.open(image_path) as img:
            return str(imagehash.phash(img))
    except Exception:
        return None
 def calculate_dhash(image_path: str) -> Optional[str]:
    """
    Calculate difference hash for an image.
    Faster but less accurate than phash.
    Args:
        image_path: Path to image file
    Returns:
        Hex string of difference hash, or None if failed
    """
    try:
        with PILImage.open(image_path) as img:
            return str(imagehash.dhash(img))
    except Exception:
        return None
 def hashes_are_similar(hash1: str, hash2: str, threshold: int = 10) -> bool:
    """
    Check if two hashes are similar (potential duplicates).
    Args:
        hash1: First hash string
        hash2: Second hash string
        threshold: Maximum Hamming distance (default 10)
    Returns:
        True if hashes are similar
    """
    try:
        h1 = imagehash.hex_to_hash(hash1)
        h2 = imagehash.hex_to_hash(hash2)
        return (h1 - h2) <= threshold
    except Exception:
        return False
 def hamming_distance(hash1: str, hash2: str) -> int:
    """
    Calculate Hamming distance between two hashes.
    Args:
        hash1: First hash string
        hash2: Second hash string
    Returns:
        Hamming distance (0 = identical, higher = more different)
    """
    try:
        h1 = imagehash.hex_to_hash(hash1)
        h2 = imagehash.hex_to_hash(hash2)
        return int(h1 - h2)
    except Exception:
        return 64  # Maximum distance
@@ -0,0 +1,109 @@
 """Image quality assessment utilities."""
 import numpy as np
 from PIL import Image as PILImage
 from scipy import ndimage
 def calculate_blur_score(image_path: str) -> float:
    """
    Calculate blur score using Laplacian variance.
    Higher score = sharper image.
    Args:
        image_path: Path to image file
    Returns:
        Variance of Laplacian (higher = sharper)
    """
    try:
        img = PILImage.open(image_path).convert("L")
        img_array = np.array(img)
        laplacian = ndimage.laplace(img_array)
        return float(np.var(laplacian))
    except Exception:
        return 0.0
 def is_too_blurry(image_path: str, threshold: float = 100.0) -> bool:
    """
    Check if image is too blurry for training.
    Args:
        image_path: Path to image file
        threshold: Minimum acceptable blur score (default 100)
    Returns:
        True if image is too blurry
    """
    score = calculate_blur_score(image_path)
    return score < threshold
 def get_image_dimensions(image_path: str) -> tuple[int, int]:
    """
    Get image dimensions.
    Args:
        image_path: Path to image file
    Returns:
        Tuple of (width, height)
    """
    try:
        with PILImage.open(image_path) as img:
            return img.size
    except Exception:
        return (0, 0)
 def is_too_small(image_path: str, min_size: int = 256) -> bool:
    """
    Check if image is too small for training.
    Args:
        image_path: Path to image file
        min_size: Minimum dimension size (default 256)
    Returns:
        True if image is too small
    """
    width, height = get_image_dimensions(image_path)
    return width < min_size or height < min_size
 def resize_image(
    image_path: str,
    output_path: str = None,
    max_size: int = 512,
    quality: int = 95,
 ) -> bool:
    """
    Resize image to max dimension while preserving aspect ratio.
    Args:
        image_path: Path to input image
        output_path: Path for output (defaults to overwriting input)
        max_size: Maximum dimension size (default 512)
        quality: JPEG quality (default 95)
    Returns:
        True if successful
    """
    try:
        output_path = output_path or image_path
        with PILImage.open(image_path) as img:
            # Only resize if larger than max_size
            if max(img.size) > max_size:
                img.thumbnail((max_size, max_size), PILImage.Resampling.LANCZOS)
            # Convert to RGB if necessary (for JPEG)
            if img.mode in ("RGBA", "P"):
                img = img.convert("RGB")
            img.save(output_path, "JPEG", quality=quality)
        return True
    except Exception:
        return False
@@ -0,0 +1,92 @@
 import logging
 import os
 from datetime import datetime
 from pathlib import Path
 from app.config import get_settings
 settings = get_settings()
 def setup_logging():
    """Configure file and console logging."""
    logs_path = Path(settings.logs_path)
    logs_path.mkdir(parents=True, exist_ok=True)
    # Create a dated log file
    log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
    # Configure root logger
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(log_file),
            logging.StreamHandler()
        ]
    )
    return logging.getLogger("plant_scraper")
 def get_logger(name: str = "plant_scraper"):
    """Get a logger instance."""
    logs_path = Path(settings.logs_path)
    logs_path.mkdir(parents=True, exist_ok=True)
    logger = logging.getLogger(name)
    if not logger.handlers:
        logger.setLevel(logging.INFO)
        # File handler with daily rotation
        log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
        file_handler = logging.FileHandler(log_file)
        file_handler.setLevel(logging.INFO)
        file_handler.setFormatter(logging.Formatter(
            '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
        ))
        # Console handler
        console_handler = logging.StreamHandler()
        console_handler.setLevel(logging.INFO)
        console_handler.setFormatter(logging.Formatter(
            '%(asctime)s - %(levelname)s - %(message)s'
        ))
        logger.addHandler(file_handler)
        logger.addHandler(console_handler)
    return logger
 def get_job_logger(job_id: int):
    """Get a logger specific to a job, writing to a job-specific file."""
    logs_path = Path(settings.logs_path)
    logs_path.mkdir(parents=True, exist_ok=True)
    logger = logging.getLogger(f"job_{job_id}")
    if not logger.handlers:
        logger.setLevel(logging.DEBUG)
        # Job-specific log file
        job_log_file = logs_path / f"job_{job_id}.log"
        file_handler = logging.FileHandler(job_log_file)
        file_handler.setLevel(logging.DEBUG)
        file_handler.setFormatter(logging.Formatter(
            '%(asctime)s - %(levelname)s - %(message)s'
        ))
        # Also log to daily file
        daily_log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
        daily_handler = logging.FileHandler(daily_log_file)
        daily_handler.setLevel(logging.INFO)
        daily_handler.setFormatter(logging.Formatter(
            '%(asctime)s - job_%(name)s - %(levelname)s - %(message)s'
        ))
        logger.addHandler(file_handler)
        logger.addHandler(daily_handler)
    return logger
@@ -0,0 +1 @@
 # Celery workers
@@ -0,0 +1,36 @@
 from celery import Celery
 from app.config import get_settings
 settings = get_settings()
 celery_app = Celery(
    "plant_scraper",
    broker=settings.redis_url,
    backend=settings.redis_url,
    include=[
        "app.workers.scrape_tasks",
        "app.workers.quality_tasks",
        "app.workers.export_tasks",
        "app.workers.stats_tasks",
    ],
 )
 celery_app.conf.update(
    task_serializer="json",
    accept_content=["json"],
    result_serializer="json",
    timezone="UTC",
    enable_utc=True,
    task_track_started=True,
    task_time_limit=3600 * 24,  # 24 hour max per task
    worker_prefetch_multiplier=1,
    task_acks_late=True,
    beat_schedule={
        "refresh-stats-every-5min": {
            "task": "app.workers.stats_tasks.refresh_stats",
            "schedule": 300.0,  # Every 5 minutes
        },
    },
    beat_schedule_filename="/tmp/celerybeat-schedule",
 )
@@ -0,0 +1,170 @@
 import json
 import os
 import random
 import shutil
 import zipfile
 from datetime import datetime
 from pathlib import Path
 from app.workers.celery_app import celery_app
 from app.database import SessionLocal
 from app.models import Export, Image, Species
 from app.config import get_settings
 settings = get_settings()
@celery_app.task(bind=True)
 def generate_export(self, export_id: int):
    """Generate a zip export for CoreML training."""
    db = SessionLocal()
    try:
        export = db.query(Export).filter(Export.id == export_id).first()
        if not export:
            return {"error": "Export not found"}
        # Update status
        export.status = "generating"
        export.celery_task_id = self.request.id
        db.commit()
        # Parse filter criteria
        criteria = json.loads(export.filter_criteria) if export.filter_criteria else {}
        min_images = criteria.get("min_images_per_species", 100)
        licenses = criteria.get("licenses")
        min_quality = criteria.get("min_quality")
        species_ids = criteria.get("species_ids")
        # Build query for images
        query = db.query(Image).filter(Image.status == "downloaded")
        if licenses:
            query = query.filter(Image.license.in_(licenses))
        if min_quality:
            query = query.filter(Image.quality_score >= min_quality)
        if species_ids:
            query = query.filter(Image.species_id.in_(species_ids))
        # Group by species and filter by min count
        from sqlalchemy import func
        species_counts = db.query(
            Image.species_id,
            func.count(Image.id).label("count")
        ).filter(Image.status == "downloaded").group_by(Image.species_id).all()
        valid_species_ids = [s.species_id for s in species_counts if s.count >= min_images]
        if species_ids:
            valid_species_ids = [s for s in valid_species_ids if s in species_ids]
        if not valid_species_ids:
            export.status = "failed"
            export.error_message = "No species meet the criteria"
            export.completed_at = datetime.utcnow()
            db.commit()
            return {"error": "No species meet the criteria"}
        # Create export directory
        export_dir = Path(settings.exports_path) / f"export_{export_id}"
        train_dir = export_dir / "Training"
        test_dir = export_dir / "Testing"
        train_dir.mkdir(parents=True, exist_ok=True)
        test_dir.mkdir(parents=True, exist_ok=True)
        total_images = 0
        species_count = 0
        # Process each valid species
        for i, species_id in enumerate(valid_species_ids):
            species = db.query(Species).filter(Species.id == species_id).first()
            if not species:
                continue
            # Get images for this species
            images_query = query.filter(Image.species_id == species_id)
            if licenses:
                images_query = images_query.filter(Image.license.in_(licenses))
            if min_quality:
                images_query = images_query.filter(Image.quality_score >= min_quality)
            images = images_query.all()
            if len(images) < min_images:
                continue
            species_count += 1
            # Create species folders
            species_name = species.scientific_name.replace(" ", "_")
            (train_dir / species_name).mkdir(exist_ok=True)
            (test_dir / species_name).mkdir(exist_ok=True)
            # Shuffle and split
            random.shuffle(images)
            split_idx = int(len(images) * export.train_split)
            train_images = images[:split_idx]
            test_images = images[split_idx:]
            # Copy images
            for j, img in enumerate(train_images):
                if img.local_path and os.path.exists(img.local_path):
                    ext = Path(img.local_path).suffix or ".jpg"
                    dest = train_dir / species_name / f"img_{j:05d}{ext}"
                    shutil.copy2(img.local_path, dest)
                    total_images += 1
            for j, img in enumerate(test_images):
                if img.local_path and os.path.exists(img.local_path):
                    ext = Path(img.local_path).suffix or ".jpg"
                    dest = test_dir / species_name / f"img_{j:05d}{ext}"
                    shutil.copy2(img.local_path, dest)
                    total_images += 1
            # Update progress
            self.update_state(
                state="PROGRESS",
                meta={
                    "current": i + 1,
                    "total": len(valid_species_ids),
                    "species": species.scientific_name,
                }
            )
        # Create zip file
        zip_path = Path(settings.exports_path) / f"export_{export_id}.zip"
        with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
            for root, dirs, files in os.walk(export_dir):
                for file in files:
                    file_path = Path(root) / file
                    arcname = file_path.relative_to(export_dir)
                    zipf.write(file_path, arcname)
        # Clean up directory
        shutil.rmtree(export_dir)
        # Update export record
        export.status = "completed"
        export.file_path = str(zip_path)
        export.file_size = zip_path.stat().st_size
        export.species_count = species_count
        export.image_count = total_images
        export.completed_at = datetime.utcnow()
        db.commit()
        return {
            "status": "completed",
            "species_count": species_count,
            "image_count": total_images,
            "file_size": export.file_size,
        }
    except Exception as e:
        if export:
            export.status = "failed"
            export.error_message = str(e)
            export.completed_at = datetime.utcnow()
            db.commit()
        raise
    finally:
        db.close()
@@ -0,0 +1,224 @@
 import os
 from pathlib import Path
 import httpx
 from PIL import Image as PILImage
 import imagehash
 import numpy as np
 from scipy import ndimage
 from app.workers.celery_app import celery_app
 from app.database import SessionLocal
 from app.models import Image
 from app.config import get_settings
 settings = get_settings()
 def calculate_blur_score(image_path: str) -> float:
    """Calculate blur score using Laplacian variance. Higher = sharper."""
    try:
        img = PILImage.open(image_path).convert("L")
        img_array = np.array(img)
        laplacian = ndimage.laplace(img_array)
        return float(np.var(laplacian))
    except Exception:
        return 0.0
 def calculate_phash(image_path: str) -> str:
    """Calculate perceptual hash for deduplication."""
    try:
        img = PILImage.open(image_path)
        return str(imagehash.phash(img))
    except Exception:
        return ""
 def check_color_distribution(image_path: str) -> tuple[bool, str]:
    """Check if image has healthy color distribution for a plant photo.
    Returns (passed, reason) tuple.
    Rejects:
    - Low color variance (mean channel std < 25): herbarium specimens (brown on white)
    - No green + low variance (green ratio < 5% AND mean std < 40): monochrome illustrations
    """
    try:
        img = PILImage.open(image_path).convert("RGB")
        arr = np.array(img, dtype=np.float64)
        # Per-channel standard deviation
        channel_stds = arr.std(axis=(0, 1))  # [R_std, G_std, B_std]
        mean_std = float(channel_stds.mean())
        if mean_std < 25:
            return False, f"Low color variance ({mean_std:.1f})"
        # Check green ratio
        channel_means = arr.mean(axis=(0, 1))
        total = channel_means.sum()
        green_ratio = channel_means[1] / total if total > 0 else 0
        if green_ratio < 0.05 and mean_std < 40:
            return False, f"No green ({green_ratio:.2%}) + low variance ({mean_std:.1f})"
        return True, ""
    except Exception:
        return True, ""  # Don't reject on error
 def resize_image(image_path: str, target_size: int = 512) -> bool:
    """Resize image to target size while maintaining aspect ratio."""
    try:
        img = PILImage.open(image_path)
        img.thumbnail((target_size, target_size), PILImage.Resampling.LANCZOS)
        img.save(image_path, quality=95)
        return True
    except Exception:
        return False
@celery_app.task
 def download_and_process_image(image_id: int):
    """Download image, check quality, dedupe, and resize."""
    db = SessionLocal()
    try:
        image = db.query(Image).filter(Image.id == image_id).first()
        if not image:
            return {"error": "Image not found"}
        # Create directory for species
        species = image.species
        species_dir = Path(settings.images_path) / species.scientific_name.replace(" ", "_")
        species_dir.mkdir(parents=True, exist_ok=True)
        # Download image
        filename = f"{image.source}_{image.source_id or image.id}.jpg"
        local_path = species_dir / filename
        try:
            headers = {
                "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
            }
            with httpx.Client(timeout=30, headers=headers, follow_redirects=True) as client:
                response = client.get(image.url)
                response.raise_for_status()
                with open(local_path, "wb") as f:
                    f.write(response.content)
        except Exception as e:
            image.status = "rejected"
            db.commit()
            return {"error": f"Download failed: {e}"}
        # Check minimum size
        try:
            with PILImage.open(local_path) as img:
                width, height = img.size
                if width < 256 or height < 256:
                    os.remove(local_path)
                    image.status = "rejected"
                    db.commit()
                    return {"error": "Image too small"}
                image.width = width
                image.height = height
        except Exception as e:
            if local_path.exists():
                os.remove(local_path)
            image.status = "rejected"
            db.commit()
            return {"error": f"Invalid image: {e}"}
        # Calculate perceptual hash for deduplication
        phash = calculate_phash(str(local_path))
        if phash:
            # Check for duplicates
            existing = db.query(Image).filter(
                Image.phash == phash,
                Image.id != image.id,
                Image.status == "downloaded"
            ).first()
            if existing:
                os.remove(local_path)
                image.status = "rejected"
                image.phash = phash
                db.commit()
                return {"error": "Duplicate image"}
            image.phash = phash
        # Calculate blur score
        quality_score = calculate_blur_score(str(local_path))
        image.quality_score = quality_score
        # Reject very blurry images (threshold can be tuned)
        if quality_score < 100:  # Low variance = blurry
            os.remove(local_path)
            image.status = "rejected"
            db.commit()
            return {"error": "Image too blurry"}
        # Check color distribution (reject herbarium specimens, illustrations)
        color_ok, color_reason = check_color_distribution(str(local_path))
        if not color_ok:
            os.remove(local_path)
            image.status = "rejected"
            db.commit()
            return {"error": f"Non-photo content: {color_reason}"}
        # Resize to 512x512 max
        resize_image(str(local_path))
        # Update image record
        image.local_path = str(local_path)
        image.status = "downloaded"
        db.commit()
        return {
            "status": "success",
            "path": str(local_path),
            "quality_score": quality_score,
        }
    except Exception as e:
        if image:
            image.status = "rejected"
            db.commit()
        return {"error": str(e)}
    finally:
        db.close()
@celery_app.task(bind=True)
 def batch_process_pending_images(self, source: str = None, chunk_size: int = 500):
    """Process ALL pending images in chunks, with progress tracking."""
    db = SessionLocal()
    try:
        query = db.query(Image).filter(Image.status == "pending")
        if source:
            query = query.filter(Image.source == source)
        total = query.count()
        queued = 0
        offset = 0
        while offset < total:
            chunk = query.order_by(Image.id).offset(offset).limit(chunk_size).all()
            if not chunk:
                break
            for image in chunk:
                download_and_process_image.delay(image.id)
                queued += 1
            offset += len(chunk)
            self.update_state(
                state="PROGRESS",
                meta={"queued": queued, "total": total},
            )
        return {"queued": queued, "total": total}
    finally:
        db.close()
@@ -0,0 +1,164 @@
 import json
 from datetime import datetime
 from app.workers.celery_app import celery_app
 from app.database import SessionLocal
 from app.models import Job, Species, Image
 from app.utils.logging import get_job_logger
@celery_app.task(bind=True)
 def run_scrape_job(self, job_id: int):
    """Main scrape task that dispatches to source-specific scrapers."""
    logger = get_job_logger(job_id)
    logger.info(f"Starting scrape job {job_id}")
    db = SessionLocal()
    job = None
    try:
        job = db.query(Job).filter(Job.id == job_id).first()
        if not job:
            logger.error(f"Job {job_id} not found")
            return {"error": "Job not found"}
        logger.info(f"Job: {job.name}, Source: {job.source}")
        # Update job status
        job.status = "running"
        job.started_at = datetime.utcnow()
        job.celery_task_id = self.request.id
        db.commit()
        # Get species to scrape
        if job.species_filter:
            species_ids = json.loads(job.species_filter)
            query = db.query(Species).filter(Species.id.in_(species_ids))
            logger.info(f"Filtered to species IDs: {species_ids}")
        else:
            query = db.query(Species)
            logger.info("Scraping all species")
        # Filter by image count if requested
        if job.only_without_images or job.max_images:
            from sqlalchemy import func
            # Subquery to count downloaded images per species
            image_count_subquery = (
                db.query(Image.species_id, func.count(Image.id).label("count"))
                .filter(Image.status == "downloaded")
                .group_by(Image.species_id)
                .subquery()
            )
            # Left join with the count subquery
            query = query.outerjoin(
                image_count_subquery,
                Species.id == image_count_subquery.c.species_id
            )
            if job.only_without_images:
                # Filter where count is NULL or 0
                query = query.filter(
                    (image_count_subquery.c.count == None) | (image_count_subquery.c.count == 0)
                )
                logger.info("Filtering to species without images")
            elif job.max_images:
                # Filter where count is NULL or less than max_images
                query = query.filter(
                    (image_count_subquery.c.count == None) | (image_count_subquery.c.count < job.max_images)
                )
                logger.info(f"Filtering to species with fewer than {job.max_images} images")
        species_list = query.all()
        logger.info(f"Total species to scrape: {len(species_list)}")
        job.progress_total = len(species_list)
        db.commit()
        # Import scraper based on source
        from app.scrapers import get_scraper
        scraper = get_scraper(job.source)
        if not scraper:
            error_msg = f"Unknown source: {job.source}"
            logger.error(error_msg)
            job.status = "failed"
            job.error_message = error_msg
            job.completed_at = datetime.utcnow()
            db.commit()
            return {"error": error_msg}
        logger.info(f"Using scraper: {scraper.name}")
        # Scrape each species
        for i, species in enumerate(species_list):
            try:
                # Update progress
                job.progress_current = i + 1
                db.commit()
                logger.info(f"[{i+1}/{len(species_list)}] Scraping: {species.scientific_name}")
                # Update task state for real-time monitoring
                self.update_state(
                    state="PROGRESS",
                    meta={
                        "current": i + 1,
                        "total": len(species_list),
                        "species": species.scientific_name,
                    }
                )
                # Run scraper for this species
                results = scraper.scrape_species(species, db, logger)
                downloaded = results.get("downloaded", 0)
                rejected = results.get("rejected", 0)
                job.images_downloaded += downloaded
                job.images_rejected += rejected
                db.commit()
                logger.info(f"  -> Downloaded: {downloaded}, Rejected: {rejected}")
            except Exception as e:
                # Log error but continue with other species
                logger.error(f"Error scraping {species.scientific_name}: {e}", exc_info=True)
                continue
        # Mark job complete
        job.status = "completed"
        job.completed_at = datetime.utcnow()
        db.commit()
        logger.info(f"Job {job_id} completed. Total downloaded: {job.images_downloaded}, rejected: {job.images_rejected}")
        return {
            "status": "completed",
            "downloaded": job.images_downloaded,
            "rejected": job.images_rejected,
        }
    except Exception as e:
        logger.error(f"Job {job_id} failed with error: {e}", exc_info=True)
        if job:
            job.status = "failed"
            job.error_message = str(e)
            job.completed_at = datetime.utcnow()
            db.commit()
        raise
    finally:
        db.close()
@celery_app.task
 def pause_scrape_job(job_id: int):
    """Pause a running scrape job."""
    db = SessionLocal()
    try:
        job = db.query(Job).filter(Job.id == job_id).first()
        if job and job.status == "running":
            job.status = "paused"
            db.commit()
            # Revoke the Celery task
            if job.celery_task_id:
                celery_app.control.revoke(job.celery_task_id, terminate=True)
        return {"status": "paused"}
    finally:
        db.close()
@@ -0,0 +1,193 @@
 import json
 import os
 from datetime import datetime
 from pathlib import Path
 from sqlalchemy import func, case, text
 from app.workers.celery_app import celery_app
 from app.database import SessionLocal
 from app.models import Species, Image, Job
 from app.models.cached_stats import CachedStats
 from app.config import get_settings
 def get_directory_size_fast(path: str) -> int:
    """Get directory size in bytes using fast os.scandir."""
    total = 0
    try:
        with os.scandir(path) as it:
            for entry in it:
                try:
                    if entry.is_file(follow_symlinks=False):
                        total += entry.stat(follow_symlinks=False).st_size
                    elif entry.is_dir(follow_symlinks=False):
                        total += get_directory_size_fast(entry.path)
                except (OSError, PermissionError):
                    pass
    except (OSError, PermissionError):
        pass
    return total
@celery_app.task
 def refresh_stats():
    """Calculate and cache dashboard statistics."""
    print("=== STATS TASK: Starting refresh ===", flush=True)
    db = SessionLocal()
    try:
        # Use raw SQL for maximum performance on SQLite
        # All counts in a single query
        counts_sql = text("""
            SELECT
                (SELECT COUNT(*) FROM species) as total_species,
                (SELECT COUNT(*) FROM images) as total_images,
                (SELECT COUNT(*) FROM images WHERE status = 'downloaded') as images_downloaded,
                (SELECT COUNT(*) FROM images WHERE status = 'pending') as images_pending,
                (SELECT COUNT(*) FROM images WHERE status = 'rejected') as images_rejected
        """)
        counts = db.execute(counts_sql).fetchone()
        total_species = counts[0] or 0
        total_images = counts[1] or 0
        images_downloaded = counts[2] or 0
        images_pending = counts[3] or 0
        images_rejected = counts[4] or 0
        # Per-source stats - single query with GROUP BY
        source_sql = text("""
            SELECT
                source,
                COUNT(*) as total,
                SUM(CASE WHEN status = 'downloaded' THEN 1 ELSE 0 END) as downloaded,
                SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
                SUM(CASE WHEN status = 'rejected' THEN 1 ELSE 0 END) as rejected
            FROM images
            GROUP BY source
        """)
        source_stats_raw = db.execute(source_sql).fetchall()
        sources = [
            {
                "source": s[0],
                "image_count": s[1],
                "downloaded": s[2] or 0,
                "pending": s[3] or 0,
                "rejected": s[4] or 0,
            }
            for s in source_stats_raw
        ]
        # Per-license stats - single indexed query
        license_sql = text("""
            SELECT license, COUNT(*) as count
            FROM images
            WHERE status = 'downloaded'
            GROUP BY license
        """)
        license_stats_raw = db.execute(license_sql).fetchall()
        licenses = [
            {"license": l[0], "count": l[1]}
            for l in license_stats_raw
        ]
        # Job stats - single query
        job_sql = text("""
            SELECT
                SUM(CASE WHEN status = 'running' THEN 1 ELSE 0 END) as running,
                SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
                SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as completed,
                SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed
            FROM jobs
        """)
        job_counts = db.execute(job_sql).fetchone()
        jobs = {
            "running": job_counts[0] or 0,
            "pending": job_counts[1] or 0,
            "completed": job_counts[2] or 0,
            "failed": job_counts[3] or 0,
        }
        # Top species by image count - optimized with index
        top_sql = text("""
            SELECT s.id, s.scientific_name, s.common_name, COUNT(i.id) as image_count
            FROM species s
            INNER JOIN images i ON i.species_id = s.id AND i.status = 'downloaded'
            GROUP BY s.id
            ORDER BY image_count DESC
            LIMIT 10
        """)
        top_species_raw = db.execute(top_sql).fetchall()
        top_species = [
            {
                "id": s[0],
                "scientific_name": s[1],
                "common_name": s[2],
                "image_count": s[3],
            }
            for s in top_species_raw
        ]
        # Under-represented species - use pre-computed counts
        under_sql = text("""
            SELECT s.id, s.scientific_name, s.common_name, COALESCE(img_counts.cnt, 0) as image_count
            FROM species s
            LEFT JOIN (
                SELECT species_id, COUNT(*) as cnt
                FROM images
                WHERE status = 'downloaded'
                GROUP BY species_id
            ) img_counts ON img_counts.species_id = s.id
            WHERE COALESCE(img_counts.cnt, 0) < 100
            ORDER BY image_count ASC
            LIMIT 10
        """)
        under_rep_raw = db.execute(under_sql).fetchall()
        under_represented = [
            {
                "id": s[0],
                "scientific_name": s[1],
                "common_name": s[2],
                "image_count": s[3],
            }
            for s in under_rep_raw
        ]
        # Calculate disk usage (fast recursive scan)
        settings = get_settings()
        disk_usage_bytes = get_directory_size_fast(settings.images_path)
        disk_usage_mb = round(disk_usage_bytes / (1024 * 1024), 2)
        # Build the stats object
        stats = {
            "total_species": total_species,
            "total_images": total_images,
            "images_downloaded": images_downloaded,
            "images_pending": images_pending,
            "images_rejected": images_rejected,
            "disk_usage_mb": disk_usage_mb,
            "sources": sources,
            "licenses": licenses,
            "jobs": jobs,
            "top_species": top_species,
            "under_represented": under_represented,
        }
        # Store in database
        cached = db.query(CachedStats).filter(CachedStats.key == "dashboard_stats").first()
        if cached:
            cached.value = json.dumps(stats)
            cached.updated_at = datetime.utcnow()
        else:
            cached = CachedStats(key="dashboard_stats", value=json.dumps(stats))
            db.add(cached)
        db.commit()
        print(f"=== STATS TASK: Refreshed (species={total_species}, images={total_images}) ===", flush=True)
        return {"status": "success", "total_species": total_species, "total_images": total_images}
    except Exception as e:
        print(f"=== STATS TASK ERROR: {e} ===", flush=True)
        raise
    finally:
        db.close()
@@ -0,0 +1,34 @@
 # Web framework
 fastapi==0.109.0
 uvicorn[standard]==0.27.0
 python-multipart==0.0.6
 # Database
 sqlalchemy==2.0.25
 alembic==1.13.1
 aiosqlite==0.19.0
 # Task queue
 celery==5.3.6
 redis==5.0.1
 # Image processing
 Pillow==10.2.0
 imagehash==4.3.1
 imagededup==0.3.3.post2
 # HTTP clients
 httpx==0.26.0
 aiohttp==3.9.3
 # Search
 duckduckgo-search
 # Utilities
 python-dotenv==1.0.0
 pydantic==2.5.3
 pydantic-settings==2.1.0
 # Testing
 pytest==7.4.4
 pytest-asyncio==0.23.3
@@ -0,0 +1 @@
 # Tests
@@ -0,0 +1,114 @@
 # Docker Compose for Unraid
 #
 # Access at http://YOUR_UNRAID_IP:8580
 #
 # ============================================
 # CONFIGURE THESE PATHS FOR YOUR UNRAID SETUP
 # ============================================
 # Edit the left side of the colon (:) for each volume mount
 #
 # DATABASE_PATH: Where to store the SQLite database
 # IMAGES_PATH:   Where to store downloaded images (can be large, 100GB+)
 # EXPORTS_PATH:  Where to store generated export zip files
 # IMPORTS_PATH:  Where to place images for bulk import (source/species/images)
 # LOGS_PATH:     Where to store scraper log files for debugging
 services:
  backend:
    build:
      context: /mnt/user/appdata/PlantGuideScraper/backend
      dockerfile: Dockerfile
    container_name: plant-scraper-backend
    restart: unless-stopped
    volumes:
      - /mnt/user/appdata/PlantGuideScraper/backend:/app:ro
      # === CONFIGURABLE DATA PATHS ===
      - /mnt/user/downloads/PlantGuideDocker/database:/data/db          # DATABASE_PATH
      - /mnt/user/downloads/PlantGuideDocker/images:/data/images        # IMAGES_PATH
      - /mnt/user/downloads/PlantGuideDocker/exports:/data/exports      # EXPORTS_PATH
      - /mnt/user/downloads/PlantGuideDocker/imports:/data/imports      # IMPORTS_PATH
      - /mnt/user/downloads/PlantGuideDocker/logs:/data/logs            # LOGS_PATH
    environment:
      - DATABASE_URL=sqlite:////data/db/plants.sqlite
      - REDIS_URL=redis://plant-scraper-redis:6379/0
      - IMAGES_PATH=/data/images
      - EXPORTS_PATH=/data/exports
      - IMPORTS_PATH=/data/imports
      - LOGS_PATH=/data/logs
    depends_on:
      - redis
    command: uvicorn app.main:app --host 0.0.0.0 --port 8000
    networks:
      - plant-scraper
  celery:
    build:
      context: /mnt/user/appdata/PlantGuideScraper/backend
      dockerfile: Dockerfile
    container_name: plant-scraper-celery
    restart: unless-stopped
    volumes:
      - /mnt/user/appdata/PlantGuideScraper/backend:/app:ro
      # === CONFIGURABLE DATA PATHS (must match backend) ===
      - /mnt/user/downloads/PlantGuideDocker/database:/data/db          # DATABASE_PATH
      - /mnt/user/downloads/PlantGuideDocker/images:/data/images        # IMAGES_PATH
      - /mnt/user/downloads/PlantGuideDocker/exports:/data/exports      # EXPORTS_PATH
      - /mnt/user/downloads/PlantGuideDocker/imports:/data/imports      # IMPORTS_PATH
      - /mnt/user/downloads/PlantGuideDocker/logs:/data/logs            # LOGS_PATH
    environment:
      - DATABASE_URL=sqlite:////data/db/plants.sqlite
      - REDIS_URL=redis://plant-scraper-redis:6379/0
      - IMAGES_PATH=/data/images
      - EXPORTS_PATH=/data/exports
      - IMPORTS_PATH=/data/imports
      - LOGS_PATH=/data/logs
    depends_on:
      - redis
    command: celery -A app.workers.celery_app worker --beat --loglevel=info --concurrency=4
    networks:
      - plant-scraper
  redis:
    image: redis:7-alpine
    container_name: plant-scraper-redis
    restart: unless-stopped
    volumes:
      - /mnt/user/appdata/PlantGuideScraper/redis:/data
    networks:
      - plant-scraper
  frontend:
    build:
      context: /mnt/user/appdata/PlantGuideScraper/frontend
      dockerfile: Dockerfile
    container_name: plant-scraper-frontend
    restart: unless-stopped
    volumes:
      - /mnt/user/appdata/PlantGuideScraper/frontend:/app
      - plant-scraper-node-modules:/app/node_modules
    environment:
      - VITE_API_URL=
    command: npm run dev -- --host
    networks:
      - plant-scraper
  nginx:
    image: nginx:alpine
    container_name: plant-scraper-nginx
    restart: unless-stopped
    ports:
      - "8580:80"
    volumes:
      - /mnt/user/appdata/PlantGuideScraper/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - backend
      - frontend
    networks:
      - plant-scraper
 networks:
  plant-scraper:
    name: plant-scraper
 volumes:
  plant-scraper-node-modules:
@@ -0,0 +1,73 @@
 services:
  backend:
    build:
      context: ./backend
      dockerfile: Dockerfile
    container_name: plant-scraper-backend
    # Port exposed only internally, nginx proxies to it
    volumes:
      - ./backend:/app
      - ./data:/data
    environment:
      - DATABASE_URL=sqlite:////data/db/plants.sqlite
      - REDIS_URL=redis://redis:6379/0
      - IMAGES_PATH=/data/images
      - EXPORTS_PATH=/data/exports
      - IMPORTS_PATH=/data/imports
      - LOGS_PATH=/data/logs
    depends_on:
      - redis
    command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
  celery:
    build:
      context: ./backend
      dockerfile: Dockerfile
    container_name: plant-scraper-celery
    volumes:
      - ./backend:/app
      - ./data:/data
    environment:
      - DATABASE_URL=sqlite:////data/db/plants.sqlite
      - REDIS_URL=redis://redis:6379/0
      - IMAGES_PATH=/data/images
      - EXPORTS_PATH=/data/exports
      - IMPORTS_PATH=/data/imports
      - LOGS_PATH=/data/logs
    depends_on:
      - redis
    command: celery -A app.workers.celery_app worker --beat --loglevel=info --concurrency=4
  redis:
    image: redis:7-alpine
    container_name: plant-scraper-redis
    # Port exposed only internally, not to host (avoid conflicts)
    volumes:
      - redis_data:/data
  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile
    container_name: plant-scraper-frontend
    # Port exposed only internally, nginx proxies to it
    volumes:
      - ./frontend:/app
      - /app/node_modules
    environment:
      - VITE_API_URL=
    command: npm run dev -- --host
  nginx:
    image: nginx:alpine
    container_name: plant-scraper-nginx
    ports:
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - backend
      - frontend
 volumes:
  redis_data:
@@ -0,0 +1,564 @@
 # Houseplant Image Scraper - Master Plan
 ## Overview
 Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.
 ---
 ## Requirements Summary
 | Requirement | Value |
 |-------------|-------|
 | Platform | Web app in Docker on Unraid |
 | Sources | iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL |
 | API keys | Configurable per service |
 | Species list | Manual import (CSV/paste) |
 | Grouping | Species, genus, source, license (faceted) |
 | Search/filter | Yes |
 | Quality filter | Automatic (hash dedup, blur, size) |
 | Progress | Real-time dashboard |
 | Storage | `/species_name/image.jpg` + SQLite DB |
 | Export | Filtered zip for CoreML, downloadable anytime |
 | Auth | None (single user) |
 | Deployment | Docker Compose |
 ---
 ## Create ML Export Requirements
 Per [Apple's documentation](https://developer.apple.com/documentation/createml/creating-an-image-classifier-model):
 - **Folder structure**: `/SpeciesName/image001.jpg` (folder name = class label)
 - **Train/Test split**: 80/20 recommended, separate folders
 - **Balance**: Roughly equal images per class (avoid bias)
 - **No metadata needed**: Create ML uses folder names as labels
 ### Export Format
 ```
 dataset_export/
 ├── Training/
 │   ├── Monstera_deliciosa/
 │   │   ├── img001.jpg
 │   │   └── ...
 │   ├── Philodendron_hederaceum/
 │   └── ...
 └── Testing/
    ├── Monstera_deliciosa/
    └── ...
 ```
 ---
 ## Data Sources
 | Source | API/Method | License Filter | Rate Limits | Notes |
 |--------|------------|----------------|-------------|-------|
 | **iNaturalist/GBIF** | Bulk DwC-A export + API | CC0, CC-BY | 1 req/sec, 10k/day, 5GB/hr media | Best source: Research Grade = verified |
 | **Flickr** | flickr.photos.search | license=4,9 (CC-BY, CC0) | 3600 req/hr | Good supplemental |
 | **Wikimedia Commons** | MediaWiki API + pyWikiCommons | CC-BY, CC-BY-SA, PD | Generous | Category-based search |
 | **Trefle.io** | REST API | Open source | Free tier | Species metadata + some images |
 | **USDA PLANTS** | REST API | Public Domain | Generous | US-focused, limited images |
 | **Plant.id** | REST API | Commercial | Paid | For validation, not scraping |
 | **Encyclopedia of Life** | API | Mixed | Check each | Aggregator |
 ### Source References
 - iNaturalist: https://www.inaturalist.org/pages/developers
 - iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741
 - Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html
 - Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API
 - pyWikiCommons: https://pypi.org/project/pyWikiCommons/
 - Trefle.io: https://trefle.io/
 - USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r
 ### Flickr License IDs
 | ID | License |
 |----|---------|
 | 0 | All Rights Reserved |
 | 1 | CC BY-NC-SA 2.0 |
 | 2 | CC BY-NC 2.0 |
 | 3 | CC BY-NC-ND 2.0 |
 | 4 | CC BY 2.0 (Commercial OK) |
 | 5 | CC BY-SA 2.0 |
 | 6 | CC BY-ND 2.0 |
 | 7 | No known copyright restrictions |
 | 8 | United States Government Work |
 | 9 | Public Domain (CC0) |
 **For commercial use**: Filter to license IDs 4, 7, 8, 9 only.
 ---
 ## Image Quality Pipeline
 | Stage | Library | Purpose |
 |-------|---------|---------|
 | **Deduplication** | imagededup | Perceptual hash (CNN + hash methods) |
 | **Blur detection** | scipy + Sobel variance | Reject blurry images |
 | **Size filter** | Pillow | Min 256x256 |
 | **Resize** | Pillow | Normalize to 512x512 |
 ### Library References
 - imagededup: https://github.com/idealo/imagededup
 - imagehash: https://github.com/JohannesBuchner/imagehash
 ---
 ## Technology Stack
 | Component | Choice | Rationale |
 |-----------|--------|-----------|
 | **Backend** | FastAPI (Python) | Async, fast, ML ecosystem, great docs |
 | **Frontend** | React + Tailwind | Fast dev, good component libraries |
 | **Database** | SQLite (+ FTS5) | Simple, no separate container, sufficient for single-user |
 | **Task Queue** | Celery + Redis | Long-running scrape jobs, good monitoring |
 | **Containers** | Docker Compose | Multi-service orchestration |
 Reference: https://github.com/fastapi/full-stack-fastapi-template
 ---
 ## Architecture
 ```
 ┌─────────────────────────────────────────────────────────────────────────┐
 │                         DOCKER COMPOSE ON UNRAID                         │
 ├─────────────────────────────────────────────────────────────────────────┤
 │                                                                          │
 │  ┌─────────────┐    ┌─────────────────────────────────────────────────┐ │
 │  │   NGINX     │    │              FASTAPI BACKEND                     │ │
 │  │   :80       │───▶│  /api/species     - CRUD species list           │ │
 │  │             │    │  /api/sources     - API key management          │ │
 │  └──────┬──────┘    │  /api/jobs        - Scrape job control          │ │
 │         │           │  /api/images      - Search, filter, browse      │ │
 │         ▼           │  /api/export      - Generate zip for CoreML     │ │
 │  ┌─────────────┐    │  /api/stats       - Dashboard metrics           │ │
 │  │   REACT     │    └─────────────────────────────────────────────────┘ │
 │  │   SPA       │                         │                              │
 │  │   :3000     │                         ▼                              │
 │  └─────────────┘    ┌─────────────────────────────────────────────────┐ │
 │                     │              CELERY WORKERS                      │ │
 │  ┌─────────────┐    │  - iNaturalist scraper                          │ │
 │  │   REDIS     │◀───│  - Flickr scraper                               │ │
 │  │   :6379     │    │  - Wikimedia scraper                            │ │
 │  └─────────────┘    │  - Quality filter pipeline                      │ │
 │                     │  - Export generator                              │ │
 │                     └─────────────────────────────────────────────────┘ │
 │                                          │                              │
 │                                          ▼                              │
 │  ┌─────────────────────────────────────────────────────────────────────┐│
 │  │                         STORAGE (Bind Mounts)                        ││
 │  │  /data/db/plants.sqlite     - Species, images metadata, jobs        ││
 │  │  /data/images/{species}/    - Downloaded images                     ││
 │  │  /data/exports/             - Generated zip files                   ││
 │  └─────────────────────────────────────────────────────────────────────┘│
 └─────────────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## Database Schema
 ```sql
 -- Species master list (imported from CSV)
 CREATE TABLE species (
    id INTEGER PRIMARY KEY,
    scientific_name TEXT UNIQUE NOT NULL,
    common_name TEXT,
    genus TEXT,
    family TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
 );
 -- Full-text search index
 CREATE VIRTUAL TABLE species_fts USING fts5(
    scientific_name,
    common_name,
    genus,
    content='species',
    content_rowid='id'
 );
 -- API credentials
 CREATE TABLE api_keys (
    id INTEGER PRIMARY KEY,
    source TEXT UNIQUE NOT NULL,  -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
    api_key TEXT NOT NULL,
    api_secret TEXT,
    rate_limit_per_sec REAL DEFAULT 1.0,
    enabled BOOLEAN DEFAULT TRUE
 );
 -- Downloaded images
 CREATE TABLE images (
    id INTEGER PRIMARY KEY,
    species_id INTEGER REFERENCES species(id),
    source TEXT NOT NULL,
    source_id TEXT,  -- Original ID from source
    url TEXT NOT NULL,
    local_path TEXT,
    license TEXT NOT NULL,
    attribution TEXT,
    width INTEGER,
    height INTEGER,
    phash TEXT,  -- Perceptual hash for dedup
    quality_score REAL,  -- Blur/quality metric
    status TEXT DEFAULT 'pending',  -- pending, downloaded, rejected, deleted
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(source, source_id)
 );
 -- Index for common queries
 CREATE INDEX idx_images_species ON images(species_id);
 CREATE INDEX idx_images_status ON images(status);
 CREATE INDEX idx_images_source ON images(source);
 CREATE INDEX idx_images_phash ON images(phash);
 -- Scrape jobs
 CREATE TABLE jobs (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    source TEXT NOT NULL,
    species_filter TEXT,  -- JSON array of species IDs or NULL for all
    status TEXT DEFAULT 'pending',  -- pending, running, paused, completed, failed
    progress_current INTEGER DEFAULT 0,
    progress_total INTEGER DEFAULT 0,
    images_downloaded INTEGER DEFAULT 0,
    images_rejected INTEGER DEFAULT 0,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT
 );
 -- Export jobs
 CREATE TABLE exports (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    filter_criteria TEXT,  -- JSON: min_images, licenses, min_quality, species_ids
    train_split REAL DEFAULT 0.8,
    status TEXT DEFAULT 'pending',  -- pending, generating, completed, failed
    file_path TEXT,
    file_size INTEGER,
    species_count INTEGER,
    image_count INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP
 );
 ```
 ---
 ## API Endpoints
 ### Species
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | GET | `/api/species` | List species (paginated, searchable) |
 | POST | `/api/species` | Create single species |
 | POST | `/api/species/import` | Bulk import from CSV |
 | GET | `/api/species/{id}` | Get species details |
 | PUT | `/api/species/{id}` | Update species |
 | DELETE | `/api/species/{id}` | Delete species |
 ### API Keys
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | GET | `/api/sources` | List configured sources |
 | PUT | `/api/sources/{source}` | Update source config (key, rate limit) |
 ### Jobs
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | GET | `/api/jobs` | List jobs |
 | POST | `/api/jobs` | Create scrape job |
 | GET | `/api/jobs/{id}` | Get job status |
 | POST | `/api/jobs/{id}/pause` | Pause job |
 | POST | `/api/jobs/{id}/resume` | Resume job |
 | POST | `/api/jobs/{id}/cancel` | Cancel job |
 ### Images
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | GET | `/api/images` | List images (paginated, filterable) |
 | GET | `/api/images/{id}` | Get image details |
 | DELETE | `/api/images/{id}` | Delete image |
 | POST | `/api/images/bulk-delete` | Bulk delete |
 ### Export
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | GET | `/api/exports` | List exports |
 | POST | `/api/exports` | Create export job |
 | GET | `/api/exports/{id}` | Get export status |
 | GET | `/api/exports/{id}/download` | Download zip file |
 ### Stats
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | GET | `/api/stats` | Dashboard statistics |
 | GET | `/api/stats/sources` | Per-source breakdown |
 | GET | `/api/stats/species` | Per-species image counts |
 ---
 ## UI Screens
 ### 1. Dashboard
 - Total species, images by source, images by license
 - Active jobs with progress bars
 - Quick stats: images/sec, disk usage
 - Recent activity feed
 ### 2. Species Management
 - Table: scientific name, common name, genus, image count
 - Import CSV button (drag-and-drop)
 - Search/filter by name, genus
 - Bulk select → "Start Scrape" button
 - Inline editing
 ### 3. API Keys
 - Card per source with:
  - API key input (masked)
  - API secret input (if applicable)
  - Rate limit slider
  - Enable/disable toggle
  - Test connection button
 ### 4. Image Browser
 - Grid view with thumbnails (lazy-loaded)
 - Filters sidebar:
  - Species (autocomplete)
  - Source (checkboxes)
  - License (checkboxes)
  - Quality score (range slider)
  - Status (tabs: all, pending, downloaded, rejected)
 - Sort by: date, quality, species
 - Bulk select → actions (delete, re-queue)
 - Click to view full-size + metadata
 ### 5. Jobs
 - Table: name, source, status, progress, dates
 - Real-time progress updates (WebSocket)
 - Actions: pause, resume, cancel, view logs
 ### 6. Export
 - Filter builder:
  - Min images per species
  - License whitelist
  - Min quality score
  - Species selection (all or specific)
 - Train/test split slider (default 80/20)
 - Preview: estimated species count, image count, file size
 - "Generate Zip" button
 - Download history with re-download links
 ---
 ## Tradeoffs
 | Decision | Alternative | Why This Choice |
 |----------|-------------|-----------------|
 | SQLite | PostgreSQL | Single-user, simpler Docker setup, sufficient for millions of rows |
 | Celery+Redis | RQ, Dramatiq | Battle-tested, good monitoring (Flower) |
 | React | Vue, Svelte | Largest ecosystem, more component libraries |
 | Separate workers | Threads in FastAPI | Better isolation, can scale workers independently |
 | Nginx reverse proxy | Traefik | Simpler config for single-app deployment |
 ---
 ## Risks & Mitigations
 | Risk | Likelihood | Mitigation |
 |------|------------|------------|
 | iNaturalist rate limits (5GB/hr) | High | Throttle downloads, prioritize species with low counts |
 | Disk fills up | Medium | Dashboard shows disk usage, configurable storage limits |
 | Scrape jobs crash mid-run | Medium | Job state in DB, resume from last checkpoint |
 | Perceptual hash collisions | Low | Store hash, allow manual review of flagged duplicates |
 | API keys exposed | Low | Environment variables, not stored in code |
 | SQLite write contention | Low | WAL mode, single writer pattern via Celery |
 ---
 ## Implementation Phases
 ### Phase 1: Foundation
 - [ ] Docker Compose setup (FastAPI, React, Redis, Nginx)
 - [ ] Database schema + migrations (Alembic)
 - [ ] Basic FastAPI skeleton with health checks
 - [ ] React app scaffolding with Tailwind
 ### Phase 2: Core Data Management
 - [ ] Species CRUD API
 - [ ] CSV import endpoint
 - [ ] Species list UI with search/filter
 - [ ] API keys management UI
 ### Phase 3: iNaturalist Scraper
 - [ ] Celery worker setup
 - [ ] iNaturalist/GBIF scraper task
 - [ ] Job management API
 - [ ] Real-time progress (WebSocket or polling)
 ### Phase 4: Quality Pipeline
 - [ ] Image download worker
 - [ ] Perceptual hash deduplication
 - [ ] Blur detection + quality scoring
 - [ ] Resize to 512x512
 ### Phase 5: Image Browser
 - [ ] Image listing API with filters
 - [ ] Thumbnail generation
 - [ ] Grid view UI
 - [ ] Bulk operations
 ### Phase 6: Additional Scrapers
 - [ ] Flickr scraper
 - [ ] Wikimedia Commons scraper
 - [ ] Trefle scraper (metadata + images)
 - [ ] USDA PLANTS scraper
 ### Phase 7: Export
 - [ ] Export job API
 - [ ] Train/test split logic
 - [ ] Zip generation worker
 - [ ] Download endpoint
 - [ ] Export UI with filters
 ### Phase 8: Dashboard & Polish
 - [ ] Stats API
 - [ ] Dashboard UI with charts
 - [ ] Job monitoring UI
 - [ ] Error handling + logging
 - [ ] Documentation
 ---
 ## File Structure
 ```
 PlantGuideScraper/
 ├── docker-compose.yml
 ├── .env.example
 ├── docs/
 │   └── master_plan.md
 ├── backend/
 │   ├── Dockerfile
 │   ├── requirements.txt
 │   ├── alembic/
 │   │   └── versions/
 │   ├── app/
 │   │   ├── __init__.py
 │   │   ├── main.py
 │   │   ├── config.py
 │   │   ├── database.py
 │   │   ├── models/
 │   │   │   ├── species.py
 │   │   │   ├── image.py
 │   │   │   ├── job.py
 │   │   │   └── export.py
 │   │   ├── schemas/
 │   │   │   └── ...
 │   │   ├── api/
 │   │   │   ├── species.py
 │   │   │   ├── images.py
 │   │   │   ├── jobs.py
 │   │   │   ├── exports.py
 │   │   │   └── stats.py
 │   │   ├── scrapers/
 │   │   │   ├── base.py
 │   │   │   ├── inaturalist.py
 │   │   │   ├── flickr.py
 │   │   │   ├── wikimedia.py
 │   │   │   └── trefle.py
 │   │   ├── workers/
 │   │   │   ├── celery_app.py
 │   │   │   ├── scrape_tasks.py
 │   │   │   ├── quality_tasks.py
 │   │   │   └── export_tasks.py
 │   │   └── utils/
 │   │       ├── image_quality.py
 │   │       └── dedup.py
 │   └── tests/
 ├── frontend/
 │   ├── Dockerfile
 │   ├── package.json
 │   ├── src/
 │   │   ├── App.tsx
 │   │   ├── components/
 │   │   ├── pages/
 │   │   │   ├── Dashboard.tsx
 │   │   │   ├── Species.tsx
 │   │   │   ├── Images.tsx
 │   │   │   ├── Jobs.tsx
 │   │   │   ├── Export.tsx
 │   │   │   └── Settings.tsx
 │   │   ├── hooks/
 │   │   └── api/
 │   └── public/
 ├── nginx/
 │   └── nginx.conf
 └── data/                  # Bind mount (not in repo)
    ├── db/
    ├── images/
    └── exports/
 ```
 ---
 ## Environment Variables
 ```bash
 # Backend
 DATABASE_URL=sqlite:///data/db/plants.sqlite
 REDIS_URL=redis://redis:6379/0
 IMAGES_PATH=/data/images
 EXPORTS_PATH=/data/exports
 # API Keys (user-provided)
 FLICKR_API_KEY=
 FLICKR_API_SECRET=
 INATURALIST_APP_ID=
 INATURALIST_APP_SECRET=
 TREFLE_API_KEY=
 # Optional
 LOG_LEVEL=INFO
 CELERY_CONCURRENCY=4
 ```
 ---
 ## Commands
 ```bash
 # Development
 docker-compose up --build
 # Production
 docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
 # Run migrations
 docker-compose exec backend alembic upgrade head
 # View Celery logs
 docker-compose logs -f celery
 # Access Redis CLI
 docker-compose exec redis redis-cli
 ```
@@ -0,0 +1,14 @@
 FROM node:20-alpine
 WORKDIR /app
 # Install dependencies
 COPY package*.json ./
 RUN npm install
 # Copy source
 COPY . .
 EXPOSE 3000
 CMD ["npm", "run", "dev", "--", "--host"]
@@ -0,0 +1,14 @@
 <!DOCTYPE html>
 <html lang="en">
  <head>
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>PlantGuideScraper</title>
    <script type="module" crossorigin src="/assets/index-BXIq8BNP.js"></script>
    <link rel="stylesheet" crossorigin href="/assets/index-uHzGA3u6.css">
  </head>
  <body>
    <div id="root"></div>
  </body>
 </html>
@@ -0,0 +1,13 @@
 <!DOCTYPE html>
 <html lang="en">
  <head>
    <meta charset="UTF-8" />
    <link rel="icon" type="image/svg+xml" href="/vite.svg" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>PlantGuideScraper</title>
  </head>
  <body>
    <div id="root"></div>
    <script type="module" src="/src/main.tsx"></script>
  </body>
 </html>
@@ -0,0 +1,31 @@
 {
  "name": "plant-scraper-frontend",
  "private": true,
  "version": "1.0.0",
  "type": "module",
  "scripts": {
    "dev": "vite",
    "build": "tsc && vite build",
    "preview": "vite preview"
  },
  "dependencies": {
    "react": "^18.2.0",
    "react-dom": "^18.2.0",
    "react-router-dom": "^6.21.0",
    "@tanstack/react-query": "^5.17.0",
    "axios": "^1.6.0",
    "lucide-react": "^0.303.0",
    "recharts": "^2.10.0",
    "clsx": "^2.1.0"
  },
  "devDependencies": {
    "@types/react": "^18.2.0",
    "@types/react-dom": "^18.2.0",
    "@vitejs/plugin-react": "^4.2.0",
    "autoprefixer": "^10.4.16",
    "postcss": "^8.4.32",
    "tailwindcss": "^3.4.0",
    "typescript": "^5.3.0",
    "vite": "^5.0.0"
  }
 }
@@ -0,0 +1,6 @@
 export default {
  plugins: {
    tailwindcss: {},
    autoprefixer: {},
  },
 }
@@ -0,0 +1,81 @@
 import { BrowserRouter, Routes, Route, NavLink } from 'react-router-dom'
 import {
  LayoutDashboard,
  Leaf,
  Image,
  Play,
  Download,
  Settings,
 } from 'lucide-react'
 import { clsx } from 'clsx'
 import Dashboard from './pages/Dashboard'
 import Species from './pages/Species'
 import Images from './pages/Images'
 import Jobs from './pages/Jobs'
 import Export from './pages/Export'
 import SettingsPage from './pages/Settings'
 const navItems = [
  { to: '/', icon: LayoutDashboard, label: 'Dashboard' },
  { to: '/species', icon: Leaf, label: 'Species' },
  { to: '/images', icon: Image, label: 'Images' },
  { to: '/jobs', icon: Play, label: 'Jobs' },
  { to: '/export', icon: Download, label: 'Export' },
  { to: '/settings', icon: Settings, label: 'Settings' },
 ]
 function Sidebar() {
  return (
    <aside className="w-64 bg-white border-r border-gray-200 min-h-screen">
      <div className="p-4 border-b border-gray-200">
        <h1 className="text-xl font-bold text-green-600 flex items-center gap-2">
          <Leaf className="w-6 h-6" />
          PlantScraper
        </h1>
      </div>
      <nav className="p-4">
        <ul className="space-y-2">
          {navItems.map((item) => (
            <li key={item.to}>
              <NavLink
                to={item.to}
                className={({ isActive }) =>
                  clsx(
                    'flex items-center gap-3 px-3 py-2 rounded-lg transition-colors',
                    isActive
                      ? 'bg-green-50 text-green-700'
                      : 'text-gray-600 hover:bg-gray-100'
                  )
                }
              >
                <item.icon className="w-5 h-5" />
                {item.label}
              </NavLink>
            </li>
          ))}
        </ul>
      </nav>
    </aside>
  )
 }
 export default function App() {
  return (
    <BrowserRouter>
      <div className="flex min-h-screen">
        <Sidebar />
        <main className="flex-1 p-8">
          <Routes>
            <Route path="/" element={<Dashboard />} />
            <Route path="/species" element={<Species />} />
            <Route path="/images" element={<Images />} />
            <Route path="/jobs" element={<Jobs />} />
            <Route path="/export" element={<Export />} />
            <Route path="/settings" element={<SettingsPage />} />
          </Routes>
        </main>
      </div>
    </BrowserRouter>
  )
 }
@@ -0,0 +1,275 @@
 import axios from 'axios'
 const API_URL = import.meta.env.VITE_API_URL || ''
 export const api = axios.create({
  baseURL: `${API_URL}/api`,
  headers: {
    'Content-Type': 'application/json',
  },
 })
 // Types
 export interface Species {
  id: number
  scientific_name: string
  common_name: string | null
  genus: string | null
  family: string | null
  created_at: string
  image_count: number
 }
 export interface SpeciesListResponse {
  items: Species[]
  total: number
  page: number
  page_size: number
  pages: number
 }
 export interface Image {
  id: number
  species_id: number
  species_name: string | null
  source: string
  source_id: string | null
  url: string
  local_path: string | null
  license: string
  attribution: string | null
  width: number | null
  height: number | null
  quality_score: number | null
  status: string
  created_at: string
 }
 export interface ImageListResponse {
  items: Image[]
  total: number
  page: number
  page_size: number
  pages: number
 }
 export interface Job {
  id: number
  name: string
  source: string
  species_filter: string | null
  status: string
  progress_current: number
  progress_total: number
  images_downloaded: number
  images_rejected: number
  started_at: string | null
  completed_at: string | null
  error_message: string | null
  created_at: string
 }
 export interface JobListResponse {
  items: Job[]
  total: number
 }
 export interface JobProgress {
  status: string
  progress_current: number
  progress_total: number
  current_species?: string
 }
 export interface Export {
  id: number
  name: string
  filter_criteria: string | null
  train_split: number
  status: string
  file_path: string | null
  file_size: number | null
  species_count: number | null
  image_count: number | null
  created_at: string
  completed_at: string | null
  error_message: string | null
 }
 export interface SourceConfig {
  name: string
  label: string
  requires_secret: boolean
  auth_type: 'none' | 'api_key' | 'api_key_secret' | 'oauth'
  configured: boolean
  enabled: boolean
  api_key_masked: string | null
  has_secret: boolean
  has_access_token: boolean
  rate_limit_per_sec: number
  default_rate: number
 }
 export interface Stats {
  total_species: number
  total_images: number
  images_downloaded: number
  images_pending: number
  images_rejected: number
  disk_usage_mb: number
  sources: Array<{
    source: string
    image_count: number
    downloaded: number
    pending: number
    rejected: number
  }>
  licenses: Array<{
    license: string
    count: number
  }>
  jobs: {
    running: number
    pending: number
    completed: number
    failed: number
  }
  top_species: Array<{
    id: number
    scientific_name: string
    common_name: string | null
    image_count: number
  }>
  under_represented: Array<{
    id: number
    scientific_name: string
    common_name: string | null
    image_count: number
  }>
 }
 // API functions
 export const speciesApi = {
  list: (params?: { page?: number; page_size?: number; search?: string; genus?: string; has_images?: boolean; max_images?: number; min_images?: number }) =>
    api.get<SpeciesListResponse>('/species', { params }),
  get: (id: number) => api.get<Species>(`/species/${id}`),
  create: (data: { scientific_name: string; common_name?: string; genus?: string; family?: string }) =>
    api.post<Species>('/species', data),
  update: (id: number, data: Partial<Species>) => api.put<Species>(`/species/${id}`, data),
  delete: (id: number) => api.delete(`/species/${id}`),
  import: (file: File) => {
    const formData = new FormData()
    formData.append('file', file)
    return api.post('/species/import', formData, {
      headers: { 'Content-Type': 'multipart/form-data' },
    })
  },
  importJson: (file: File) => {
    const formData = new FormData()
    formData.append('file', file)
    return api.post('/species/import-json', formData, {
      headers: { 'Content-Type': 'multipart/form-data' },
    })
  },
  genera: () => api.get<string[]>('/species/genera/list'),
 }
 export interface ImportScanResult {
  available: boolean
  message?: string
  sources: Array<{
    name: string
    species_count: number
    image_count: number
  }>
  total_images: number
  matched_species: number
  unmatched_species: string[]
 }
 export interface ImportResult {
  imported: number
  skipped: number
  errors: string[]
 }
 export const imagesApi = {
  list: (params?: {
    page?: number
    page_size?: number
    species_id?: number
    source?: string
    license?: string
    status?: string
    min_quality?: number
    search?: string
  }) => api.get<ImageListResponse>('/images', { params }),
  get: (id: number) => api.get<Image>(`/images/${id}`),
  delete: (id: number) => api.delete(`/images/${id}`),
  bulkDelete: (ids: number[]) => api.post('/images/bulk-delete', ids),
  sources: () => api.get<string[]>('/images/sources'),
  licenses: () => api.get<string[]>('/images/licenses'),
  processPending: (source?: string) =>
    api.post<{ pending_count: number; task_id: string }>('/images/process-pending', null, {
      params: source ? { source } : undefined,
    }),
  processPendingStatus: (taskId: string) =>
    api.get<{ task_id: string; state: string; queued?: number; total?: number }>(
      `/images/process-pending/status/${taskId}`
    ),
  scanImports: () => api.get<ImportScanResult>('/images/import/scan'),
  runImport: (moveFiles: boolean = false) =>
    api.post<ImportResult>('/images/import/run', null, { params: { move_files: moveFiles } }),
 }
 export const jobsApi = {
  list: (params?: { status?: string; source?: string; limit?: number }) =>
    api.get<JobListResponse>('/jobs', { params }),
  get: (id: number) => api.get<Job>(`/jobs/${id}`),
  create: (data: { name: string; source: string; species_ids?: number[]; only_without_images?: boolean; max_images?: number }) =>
    api.post<Job>('/jobs', data),
  progress: (id: number) => api.get<JobProgress>(`/jobs/${id}/progress`),
  pause: (id: number) => api.post(`/jobs/${id}/pause`),
  resume: (id: number) => api.post(`/jobs/${id}/resume`),
  cancel: (id: number) => api.post(`/jobs/${id}/cancel`),
 }
 export const exportsApi = {
  list: (params?: { limit?: number }) => api.get('/exports', { params }),
  get: (id: number) => api.get<Export>(`/exports/${id}`),
  create: (data: {
    name: string
    filter_criteria: {
      min_images_per_species: number
      licenses?: string[]
      min_quality?: number
      species_ids?: number[]
    }
    train_split: number
  }) => api.post<Export>('/exports', data),
  preview: (data: any) => api.post('/exports/preview', data),
  progress: (id: number) => api.get(`/exports/${id}/progress`),
  download: (id: number) => `${API_URL}/api/exports/${id}/download`,
  delete: (id: number) => api.delete(`/exports/${id}`),
 }
 export const sourcesApi = {
  list: () => api.get<SourceConfig[]>('/sources'),
  get: (source: string) => api.get<SourceConfig>(`/sources/${source}`),
  update: (source: string, data: {
    api_key?: string
    api_secret?: string
    access_token?: string
    rate_limit_per_sec?: number
    enabled?: boolean
  }) => api.put(`/sources/${source}`, { source, ...data }),
  test: (source: string) => api.post(`/sources/${source}/test`),
  delete: (source: string) => api.delete(`/sources/${source}`),
 }
 export const statsApi = {
  get: () => api.get<Stats>('/stats'),
  sources: () => api.get('/stats/sources'),
  species: (params?: { min_count?: number; max_count?: number }) =>
    api.get('/stats/species', { params }),
 }
@@ -0,0 +1,7 @@
@tailwind base;
@tailwind components;
@tailwind utilities;
 body {
  @apply bg-gray-50 text-gray-900;
 }
@@ -0,0 +1,22 @@
 import React from 'react'
 import ReactDOM from 'react-dom/client'
 import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
 import App from './App'
 import './index.css'
 const queryClient = new QueryClient({
  defaultOptions: {
    queries: {
      refetchOnWindowFocus: false,
      retry: 1,
    },
  },
 })
 ReactDOM.createRoot(document.getElementById('root')!).render(
  <React.StrictMode>
    <QueryClientProvider client={queryClient}>
      <App />
    </QueryClientProvider>
  </React.StrictMode>,
 )
@@ -0,0 +1,413 @@
 import { useState } from 'react'
 import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
 import {
  Leaf,
  Image,
  HardDrive,
  Clock,
  CheckCircle,
  XCircle,
  AlertCircle,
 } from 'lucide-react'
 import {
  BarChart,
  Bar,
  XAxis,
  YAxis,
  Tooltip,
  ResponsiveContainer,
  PieChart,
  Pie,
  Cell,
 } from 'recharts'
 import { statsApi, imagesApi } from '../api/client'
 const COLORS = ['#22c55e', '#3b82f6', '#f59e0b', '#ef4444', '#8b5cf6', '#ec4899']
 function StatCard({
  title,
  value,
  icon: Icon,
  color,
 }: {
  title: string
  value: string | number
  icon: React.ElementType
  color: string
 }) {
  return (
    <div className="bg-white rounded-lg shadow p-6">
      <div className="flex items-center justify-between">
        <div>
          <p className="text-sm text-gray-500">{title}</p>
          <p className="text-2xl font-bold mt-1">{value}</p>
        </div>
        <div className={`p-3 rounded-full ${color}`}>
          <Icon className="w-6 h-6 text-white" />
        </div>
      </div>
    </div>
  )
 }
 export default function Dashboard() {
  const queryClient = useQueryClient()
  const [processingTaskId, setProcessingTaskId] = useState<string | null>(null)
  const processPendingMutation = useMutation({
    mutationFn: () => imagesApi.processPending(),
    onSuccess: (res) => {
      setProcessingTaskId(res.data.task_id)
    },
  })
  // Poll task status while processing
  const { data: taskStatus } = useQuery({
    queryKey: ['process-pending-status', processingTaskId],
    queryFn: async () => {
      const res = await imagesApi.processPendingStatus(processingTaskId!)
      if (res.data.state === 'SUCCESS' || res.data.state === 'FAILURE') {
        // Task finished - clear tracking and refresh stats
        setTimeout(() => {
          setProcessingTaskId(null)
          queryClient.invalidateQueries({ queryKey: ['stats'] })
        }, 0)
      }
      return res.data
    },
    enabled: !!processingTaskId,
    refetchInterval: (query) => {
      const state = query.state.data?.state
      if (state === 'SUCCESS' || state === 'FAILURE') return false
      return 2000
    },
  })
  const isProcessing = !!processingTaskId && taskStatus?.state !== 'SUCCESS' && taskStatus?.state !== 'FAILURE'
  const { data: stats, isLoading, error, failureCount, isFetching } = useQuery({
    queryKey: ['stats'],
    queryFn: async () => {
      const startTime = Date.now()
      console.log('[Dashboard] Fetching stats...')
      // Create abort controller for timeout
      const controller = new AbortController()
      const timeoutId = setTimeout(() => controller.abort(), 10000) // 10 second timeout
      try {
        const res = await statsApi.get()
        clearTimeout(timeoutId)
        console.log(`[Dashboard] Stats loaded in ${Date.now() - startTime}ms`)
        return res.data
      } catch (err: any) {
        clearTimeout(timeoutId)
        if (err.name === 'AbortError' || err.code === 'ECONNABORTED') {
          console.error('[Dashboard] Request timed out after 10 seconds')
          throw new Error('Request timed out after 10 seconds - backend may be unresponsive')
        }
        console.error('[Dashboard] Stats fetch failed:', err)
        console.error('[Dashboard] Error details:', {
          message: err.message,
          status: err.response?.status,
          statusText: err.response?.statusText,
          data: err.response?.data,
        })
        throw err
      }
    },
    refetchInterval: 30000,  // 30 seconds - matches backend cache
    retry: 1,
    staleTime: 25000,
  })
  // Debug panel to test backend
  const { data: debugData, refetch: refetchDebug, isFetching: isDebugFetching } = useQuery({
    queryKey: ['debug'],
    queryFn: async () => {
      const res = await fetch('/api/debug')
      return res.json()
    },
    enabled: false, // Only fetch when manually triggered
  })
  if (isLoading) {
    return (
      <div className="flex items-center justify-center h-64">
        <div className="text-center">
          <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600 mx-auto"></div>
          <p className="mt-2 text-gray-500">Loading stats...</p>
        </div>
      </div>
    )
  }
  if (error) {
    const err = error as any
    return (
      <div className="space-y-4 m-4">
        <div className="bg-red-50 border border-red-200 rounded-lg p-6">
          <h2 className="text-lg font-bold text-red-700 mb-2">Failed to load dashboard</h2>
          <div className="space-y-2 text-sm">
            <p><strong>Error:</strong> {err.message}</p>
            {err.response && (
              <>
                <p><strong>Status:</strong> {err.response.status} {err.response.statusText}</p>
                {err.response.data && (
                  <p><strong>Response:</strong> {JSON.stringify(err.response.data)}</p>
                )}
              </>
            )}
            <p><strong>Retry count:</strong> {failureCount}</p>
          </div>
        </div>
        <div className="bg-blue-50 border border-blue-200 rounded-lg p-6">
          <h3 className="font-bold text-blue-700 mb-2">Debug Backend Connection</h3>
          <button
            onClick={() => refetchDebug()}
            disabled={isDebugFetching}
            className="px-4 py-2 bg-blue-600 text-white rounded hover:bg-blue-700 disabled:opacity-50"
          >
            {isDebugFetching ? 'Testing...' : 'Test Backend'}
          </button>
          {debugData && (
            <pre className="mt-4 p-4 bg-white rounded text-xs overflow-auto">
              {JSON.stringify(debugData, null, 2)}
            </pre>
          )}
        </div>
      </div>
    )
  }
  if (!stats) {
    return <div>Failed to load stats</div>
  }
  const sourceData = stats.sources.map((s) => ({
    name: s.source,
    downloaded: s.downloaded,
    pending: s.pending,
    rejected: s.rejected,
  }))
  const licenseData = stats.licenses.map((l, i) => ({
    name: l.license,
    value: l.count,
    color: COLORS[i % COLORS.length],
  }))
  return (
    <div className="space-y-6">
      <h1 className="text-2xl font-bold">Dashboard</h1>
      {/* Stats Grid */}
      <div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-4 gap-4">
        <StatCard
          title="Total Species"
          value={stats.total_species.toLocaleString()}
          icon={Leaf}
          color="bg-green-500"
        />
        <StatCard
          title="Downloaded Images"
          value={stats.images_downloaded.toLocaleString()}
          icon={Image}
          color="bg-blue-500"
        />
        <StatCard
          title="Pending Images"
          value={stats.images_pending.toLocaleString()}
          icon={Clock}
          color="bg-yellow-500"
        />
        <StatCard
          title="Disk Usage"
          value={`${stats.disk_usage_mb.toFixed(1)} MB`}
          icon={HardDrive}
          color="bg-purple-500"
        />
      </div>
      {/* Process Pending Banner */}
      {(stats.images_pending > 0 || isProcessing) && (
        <div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4 flex items-center justify-between">
          <div>
            <p className="font-semibold text-yellow-800">
              {isProcessing
                ? `Processing pending images...`
                : `${stats.images_pending.toLocaleString()} pending images`}
            </p>
            <p className="text-sm text-yellow-700">
              {isProcessing && taskStatus?.queued != null && taskStatus?.total != null
                ? `Queued ${taskStatus.queued.toLocaleString()} of ${taskStatus.total.toLocaleString()} for download`
                : isProcessing
                ? 'Queueing images for download...'
                : 'These images have been scraped but not yet downloaded and processed.'}
            </p>
          </div>
          <button
            onClick={() => processPendingMutation.mutate()}
            disabled={isProcessing || processPendingMutation.isPending}
            className="px-4 py-2 bg-yellow-600 text-white rounded-lg hover:bg-yellow-700 disabled:opacity-50 whitespace-nowrap"
          >
            {isProcessing ? 'Processing...' : processPendingMutation.isPending ? 'Starting...' : 'Process All Pending'}
          </button>
        </div>
      )}
      {/* Jobs Status */}
      <div className="bg-white rounded-lg shadow p-6">
        <h2 className="text-lg font-semibold mb-4">Jobs Status</h2>
        <div className="flex gap-6">
          <div className="flex items-center gap-2">
            <div className="w-3 h-3 rounded-full bg-blue-500 animate-pulse"></div>
            <span>Running: {stats.jobs.running}</span>
          </div>
          <div className="flex items-center gap-2">
            <Clock className="w-4 h-4 text-yellow-500" />
            <span>Pending: {stats.jobs.pending}</span>
          </div>
          <div className="flex items-center gap-2">
            <CheckCircle className="w-4 h-4 text-green-500" />
            <span>Completed: {stats.jobs.completed}</span>
          </div>
          <div className="flex items-center gap-2">
            <XCircle className="w-4 h-4 text-red-500" />
            <span>Failed: {stats.jobs.failed}</span>
          </div>
        </div>
      </div>
      {/* Charts */}
      <div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
        {/* Source Chart */}
        <div className="bg-white rounded-lg shadow p-6">
          <h2 className="text-lg font-semibold mb-4">Images by Source</h2>
          {sourceData.length > 0 ? (
            <ResponsiveContainer width="100%" height={300}>
              <BarChart data={sourceData}>
                <XAxis dataKey="name" />
                <YAxis />
                <Tooltip />
                <Bar dataKey="downloaded" fill="#22c55e" name="Downloaded" />
                <Bar dataKey="pending" fill="#f59e0b" name="Pending" />
                <Bar dataKey="rejected" fill="#ef4444" name="Rejected" />
              </BarChart>
            </ResponsiveContainer>
          ) : (
            <div className="h-[300px] flex items-center justify-center text-gray-400">
              No data yet
            </div>
          )}
        </div>
        {/* License Chart */}
        <div className="bg-white rounded-lg shadow p-6">
          <h2 className="text-lg font-semibold mb-4">Images by License</h2>
          {licenseData.length > 0 ? (
            <ResponsiveContainer width="100%" height={300}>
              <PieChart>
                <Pie
                  data={licenseData}
                  dataKey="value"
                  nameKey="name"
                  cx="50%"
                  cy="50%"
                  outerRadius={100}
                  label={({ name, percent }) =>
                    `${name} (${(percent * 100).toFixed(0)}%)`
                  }
                >
                  {licenseData.map((entry, index) => (
                    <Cell key={index} fill={entry.color} />
                  ))}
                </Pie>
                <Tooltip />
              </PieChart>
            </ResponsiveContainer>
          ) : (
            <div className="h-[300px] flex items-center justify-center text-gray-400">
              No data yet
            </div>
          )}
        </div>
      </div>
      {/* Species Tables */}
      <div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
        {/* Top Species */}
        <div className="bg-white rounded-lg shadow p-6">
          <h2 className="text-lg font-semibold mb-4">Top Species</h2>
          <table className="w-full">
            <thead>
              <tr className="text-left text-sm text-gray-500">
                <th className="pb-2">Species</th>
                <th className="pb-2 text-right">Images</th>
              </tr>
            </thead>
            <tbody>
              {stats.top_species.map((s) => (
                <tr key={s.id} className="border-t">
                  <td className="py-2">
                    <div className="font-medium">{s.scientific_name}</div>
                    {s.common_name && (
                      <div className="text-sm text-gray-500">{s.common_name}</div>
                    )}
                  </td>
                  <td className="py-2 text-right">{s.image_count}</td>
                </tr>
              ))}
              {stats.top_species.length === 0 && (
                <tr>
                  <td colSpan={2} className="py-4 text-center text-gray-400">
                    No species yet
                  </td>
                </tr>
              )}
            </tbody>
          </table>
        </div>
        {/* Under-represented Species */}
        <div className="bg-white rounded-lg shadow p-6">
          <h2 className="text-lg font-semibold mb-4 flex items-center gap-2">
            <AlertCircle className="w-5 h-5 text-yellow-500" />
            Under-represented Species
          </h2>
          <p className="text-sm text-gray-500 mb-4">Species with fewer than 100 images</p>
          <table className="w-full">
            <thead>
              <tr className="text-left text-sm text-gray-500">
                <th className="pb-2">Species</th>
                <th className="pb-2 text-right">Images</th>
              </tr>
            </thead>
            <tbody>
              {stats.under_represented.map((s) => (
                <tr key={s.id} className="border-t">
                  <td className="py-2">
                    <div className="font-medium">{s.scientific_name}</div>
                    {s.common_name && (
                      <div className="text-sm text-gray-500">{s.common_name}</div>
                    )}
                  </td>
                  <td className="py-2 text-right text-yellow-600">{s.image_count}</td>
                </tr>
              ))}
              {stats.under_represented.length === 0 && (
                <tr>
                  <td colSpan={2} className="py-4 text-center text-gray-400">
                    All species have 100+ images
                  </td>
                </tr>
              )}
            </tbody>
          </table>
        </div>
      </div>
    </div>
  )
 }
@@ -0,0 +1,346 @@
 import { useState } from 'react'
 import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
 import {
  Download,
  Trash2,
  CheckCircle,
  Clock,
  AlertCircle,
  Package,
 } from 'lucide-react'
 import { exportsApi, imagesApi, Export as ExportType } from '../api/client'
 export default function Export() {
  const queryClient = useQueryClient()
  const [showCreateModal, setShowCreateModal] = useState(false)
  const { data: exports, isLoading } = useQuery({
    queryKey: ['exports'],
    queryFn: () => exportsApi.list({ limit: 50 }).then((res) => res.data),
    refetchInterval: 5000,
  })
  const deleteMutation = useMutation({
    mutationFn: (id: number) => exportsApi.delete(id),
    onSuccess: () => queryClient.invalidateQueries({ queryKey: ['exports'] }),
  })
  const getStatusIcon = (status: string) => {
    switch (status) {
      case 'generating':
        return <Clock className="w-4 h-4 text-blue-500 animate-pulse" />
      case 'completed':
        return <CheckCircle className="w-4 h-4 text-green-500" />
      case 'failed':
        return <AlertCircle className="w-4 h-4 text-red-500" />
      default:
        return <Clock className="w-4 h-4 text-gray-400" />
    }
  }
  const formatBytes = (bytes: number | null) => {
    if (!bytes) return 'N/A'
    if (bytes < 1024) return `${bytes} B`
    if (bytes < 1024 * 1024) return `${(bytes / 1024).toFixed(1)} KB`
    if (bytes < 1024 * 1024 * 1024) return `${(bytes / 1024 / 1024).toFixed(1)} MB`
    return `${(bytes / 1024 / 1024 / 1024).toFixed(1)} GB`
  }
  return (
    <div className="space-y-6">
      <div className="flex items-center justify-between">
        <h1 className="text-2xl font-bold">Export Dataset</h1>
        <button
          onClick={() => setShowCreateModal(true)}
          className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
        >
          <Package className="w-4 h-4" />
          Create Export
        </button>
      </div>
      {/* Info Card */}
      <div className="bg-blue-50 border border-blue-200 rounded-lg p-4">
        <h3 className="font-medium text-blue-800">Export Format</h3>
        <p className="text-sm text-blue-700 mt-1">
          Exports are created in Create ML-compatible format with Training and Testing
          folders. Each species has its own subfolder with images.
        </p>
      </div>
      {/* Exports List */}
      {isLoading ? (
        <div className="flex items-center justify-center h-64">
          <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
        </div>
      ) : exports?.items.length === 0 ? (
        <div className="bg-white rounded-lg shadow p-8 text-center text-gray-400">
          <Package className="w-12 h-12 mx-auto mb-4" />
          <p>No exports yet</p>
          <p className="text-sm mt-2">
            Create an export to download your dataset for CoreML training
          </p>
        </div>
      ) : (
        <div className="space-y-4">
          {exports?.items.map((exp: ExportType) => (
            <div
              key={exp.id}
              className="bg-white rounded-lg shadow p-6"
            >
              <div className="flex items-start justify-between">
                <div className="flex-1">
                  <div className="flex items-center gap-3">
                    {getStatusIcon(exp.status)}
                    <h3 className="font-semibold">{exp.name}</h3>
                  </div>
                  <div className="mt-2 grid grid-cols-4 gap-4 text-sm">
                    <div>
                      <span className="text-gray-500">Species:</span>{' '}
                      {exp.species_count ?? 'N/A'}
                    </div>
                    <div>
                      <span className="text-gray-500">Images:</span>{' '}
                      {exp.image_count ?? 'N/A'}
                    </div>
                    <div>
                      <span className="text-gray-500">Size:</span>{' '}
                      {formatBytes(exp.file_size)}
                    </div>
                    <div>
                      <span className="text-gray-500">Split:</span>{' '}
                      {Math.round(exp.train_split * 100)}% / {Math.round((1 - exp.train_split) * 100)}%
                    </div>
                  </div>
                  {exp.error_message && (
                    <div className="mt-2 text-sm text-red-600">
                      Error: {exp.error_message}
                    </div>
                  )}
                  <div className="mt-2 text-xs text-gray-400">
                    Created: {new Date(exp.created_at).toLocaleString()}
                    {exp.completed_at && (
                      <span className="ml-4">
                        Completed: {new Date(exp.completed_at).toLocaleString()}
                      </span>
                    )}
                  </div>
                </div>
                <div className="flex gap-2 ml-4">
                  {exp.status === 'completed' && (
                    <a
                      href={exportsApi.download(exp.id)}
                      className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
                    >
                      <Download className="w-4 h-4" />
                      Download
                    </a>
                  )}
                  <button
                    onClick={() => deleteMutation.mutate(exp.id)}
                    className="p-2 text-red-600 hover:bg-red-50 rounded"
                    title="Delete"
                  >
                    <Trash2 className="w-5 h-5" />
                  </button>
                </div>
              </div>
            </div>
          ))}
        </div>
      )}
      {/* Create Modal */}
      {showCreateModal && (
        <CreateExportModal onClose={() => setShowCreateModal(false)} />
      )}
    </div>
  )
 }
 function CreateExportModal({ onClose }: { onClose: () => void }) {
  const queryClient = useQueryClient()
  const [form, setForm] = useState({
    name: `Export ${new Date().toLocaleDateString()}`,
    min_images: 100,
    train_split: 0.8,
    licenses: [] as string[],
    min_quality: undefined as number | undefined,
  })
  const { data: licenses } = useQuery({
    queryKey: ['image-licenses'],
    queryFn: () => imagesApi.licenses().then((res) => res.data),
  })
  const previewMutation = useMutation({
    mutationFn: () =>
      exportsApi.preview({
        name: form.name,
        filter_criteria: {
          min_images_per_species: form.min_images,
          licenses: form.licenses.length > 0 ? form.licenses : undefined,
          min_quality: form.min_quality,
        },
        train_split: form.train_split,
      }),
  })
  const createMutation = useMutation({
    mutationFn: () =>
      exportsApi.create({
        name: form.name,
        filter_criteria: {
          min_images_per_species: form.min_images,
          licenses: form.licenses.length > 0 ? form.licenses : undefined,
          min_quality: form.min_quality,
        },
        train_split: form.train_split,
      }),
    onSuccess: () => {
      queryClient.invalidateQueries({ queryKey: ['exports'] })
      onClose()
    },
  })
  const toggleLicense = (license: string) => {
    setForm((f) => ({
      ...f,
      licenses: f.licenses.includes(license)
        ? f.licenses.filter((l) => l !== license)
        : [...f.licenses, license],
    }))
  }
  return (
    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
      <div className="bg-white rounded-lg p-6 w-full max-w-lg">
        <h2 className="text-xl font-bold mb-4">Create Export</h2>
        <div className="space-y-4">
          <div>
            <label className="block text-sm font-medium mb-1">Export Name</label>
            <input
              type="text"
              value={form.name}
              onChange={(e) => setForm({ ...form, name: e.target.value })}
              className="w-full px-3 py-2 border rounded-lg"
            />
          </div>
          <div>
            <label className="block text-sm font-medium mb-1">
              Minimum Images per Species
            </label>
            <input
              type="number"
              value={form.min_images}
              onChange={(e) =>
                setForm({ ...form, min_images: parseInt(e.target.value) || 0 })
              }
              className="w-full px-3 py-2 border rounded-lg"
              min={1}
            />
            <p className="text-xs text-gray-500 mt-1">
              Species with fewer images will be excluded
            </p>
          </div>
          <div>
            <label className="block text-sm font-medium mb-1">
              Train/Test Split
            </label>
            <div className="flex items-center gap-4">
              <input
                type="range"
                value={form.train_split}
                onChange={(e) =>
                  setForm({ ...form, train_split: parseFloat(e.target.value) })
                }
                min={0.5}
                max={0.95}
                step={0.05}
                className="flex-1"
              />
              <span className="text-sm w-20 text-right">
                {Math.round(form.train_split * 100)}% /{' '}
                {Math.round((1 - form.train_split) * 100)}%
              </span>
            </div>
          </div>
          <div>
            <label className="block text-sm font-medium mb-2">
              Filter by License (optional)
            </label>
            <div className="flex flex-wrap gap-2">
              {licenses?.map((license) => (
                <button
                  key={license}
                  onClick={() => toggleLicense(license)}
                  className={`px-3 py-1 rounded-full text-sm ${
                    form.licenses.includes(license)
                      ? 'bg-green-100 text-green-700 border-green-300'
                      : 'bg-gray-100 text-gray-600'
                  } border`}
                >
                  {license}
                </button>
              ))}
            </div>
            {form.licenses.length === 0 && (
              <p className="text-xs text-gray-500 mt-1">
                All licenses will be included
              </p>
            )}
          </div>
          {/* Preview */}
          {previewMutation.data && (
            <div className="bg-gray-50 rounded-lg p-4">
              <h4 className="font-medium mb-2">Preview</h4>
              <div className="grid grid-cols-3 gap-4 text-sm">
                <div>
                  <span className="text-gray-500">Species:</span>{' '}
                  {previewMutation.data.data.species_count}
                </div>
                <div>
                  <span className="text-gray-500">Images:</span>{' '}
                  {previewMutation.data.data.image_count}
                </div>
                <div>
                  <span className="text-gray-500">Est. Size:</span>{' '}
                  {previewMutation.data.data.estimated_size_mb.toFixed(0)} MB
                </div>
              </div>
            </div>
          )}
        </div>
        <div className="flex justify-between mt-6">
          <button
            onClick={() => previewMutation.mutate()}
            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
          >
            Preview
          </button>
          <div className="flex gap-2">
            <button
              onClick={onClose}
              className="px-4 py-2 border rounded-lg hover:bg-gray-50"
            >
              Cancel
            </button>
            <button
              onClick={() => createMutation.mutate()}
              disabled={!form.name}
              className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50"
            >
              Create Export
            </button>
          </div>
        </div>
      </div>
    </div>
  )
 }
@@ -0,0 +1,331 @@
 import { useState } from 'react'
 import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
 import {
  Search,
  Filter,
  Trash2,
  ChevronLeft,
  ChevronRight,
  X,
  ExternalLink,
 } from 'lucide-react'
 import { imagesApi } from '../api/client'
 export default function Images() {
  const queryClient = useQueryClient()
  const [page, setPage] = useState(1)
  const [search, setSearch] = useState('')
  const [filters, setFilters] = useState({
    source: '',
    license: '',
    status: 'downloaded',
    min_quality: undefined as number | undefined,
  })
  const [selectedIds, setSelectedIds] = useState<number[]>([])
  const [selectedImage, setSelectedImage] = useState<number | null>(null)
  const { data, isLoading } = useQuery({
    queryKey: ['images', page, search, filters],
    queryFn: () =>
      imagesApi
        .list({
          page,
          page_size: 48,
          search: search || undefined,
          source: filters.source || undefined,
          license: filters.license || undefined,
          status: filters.status || undefined,
          min_quality: filters.min_quality,
        })
        .then((res) => res.data),
  })
  const { data: sources } = useQuery({
    queryKey: ['image-sources'],
    queryFn: () => imagesApi.sources().then((res) => res.data),
  })
  const { data: licenses } = useQuery({
    queryKey: ['image-licenses'],
    queryFn: () => imagesApi.licenses().then((res) => res.data),
  })
  const { data: imageDetail } = useQuery({
    queryKey: ['image', selectedImage],
    queryFn: () => imagesApi.get(selectedImage!).then((res) => res.data),
    enabled: !!selectedImage,
  })
  const deleteMutation = useMutation({
    mutationFn: (id: number) => imagesApi.delete(id),
    onSuccess: () => {
      queryClient.invalidateQueries({ queryKey: ['images'] })
      setSelectedImage(null)
    },
  })
  const bulkDeleteMutation = useMutation({
    mutationFn: (ids: number[]) => imagesApi.bulkDelete(ids),
    onSuccess: () => {
      queryClient.invalidateQueries({ queryKey: ['images'] })
      setSelectedIds([])
    },
  })
  const handleSelect = (id: number) => {
    setSelectedIds((prev) =>
      prev.includes(id) ? prev.filter((i) => i !== id) : [...prev, id]
    )
  }
  return (
    <div className="space-y-6">
      <div className="flex items-center justify-between">
        <h1 className="text-2xl font-bold">Images</h1>
        {selectedIds.length > 0 && (
          <button
            onClick={() => bulkDeleteMutation.mutate(selectedIds)}
            className="flex items-center gap-2 px-4 py-2 bg-red-600 text-white rounded-lg hover:bg-red-700"
          >
            <Trash2 className="w-4 h-4" />
            Delete {selectedIds.length} images
          </button>
        )}
      </div>
      {/* Filters */}
      <div className="flex flex-wrap gap-4">
        <div className="relative">
          <Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
          <input
            type="text"
            placeholder="Search species..."
            value={search}
            onChange={(e) => {
              setSearch(e.target.value)
              setPage(1)
            }}
            className="pl-10 pr-4 py-2 border rounded-lg w-64"
          />
        </div>
        <select
          value={filters.source}
          onChange={(e) => setFilters({ ...filters, source: e.target.value })}
          className="px-3 py-2 border rounded-lg"
        >
          <option value="">All Sources</option>
          {sources?.map((s) => (
            <option key={s} value={s}>
              {s}
            </option>
          ))}
        </select>
        <select
          value={filters.license}
          onChange={(e) => setFilters({ ...filters, license: e.target.value })}
          className="px-3 py-2 border rounded-lg"
        >
          <option value="">All Licenses</option>
          {licenses?.map((l) => (
            <option key={l} value={l}>
              {l}
            </option>
          ))}
        </select>
        <select
          value={filters.status}
          onChange={(e) => setFilters({ ...filters, status: e.target.value })}
          className="px-3 py-2 border rounded-lg"
        >
          <option value="">All Status</option>
          <option value="downloaded">Downloaded</option>
          <option value="pending">Pending</option>
          <option value="rejected">Rejected</option>
        </select>
      </div>
      {/* Image Grid */}
      {isLoading ? (
        <div className="flex items-center justify-center h-64">
          <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
        </div>
      ) : data?.items.length === 0 ? (
        <div className="flex flex-col items-center justify-center h-64 text-gray-400">
          <Filter className="w-12 h-12 mb-4" />
          <p>No images found</p>
        </div>
      ) : (
        <div className="grid grid-cols-2 sm:grid-cols-4 md:grid-cols-6 lg:grid-cols-8 gap-2">
          {data?.items.map((image) => (
            <div
              key={image.id}
              className={`relative aspect-square bg-gray-100 rounded-lg overflow-hidden cursor-pointer group ${
                selectedIds.includes(image.id) ? 'ring-2 ring-green-500' : ''
              }`}
              onClick={() => setSelectedImage(image.id)}
            >
              {image.local_path ? (
                <img
                  src={`/api/images/${image.id}/file`}
                  alt={image.species_name || ''}
                  className="w-full h-full object-cover"
                  loading="lazy"
                />
              ) : (
                <div className="flex items-center justify-center h-full text-gray-400 text-xs">
                  Pending
                </div>
              )}
              <div className="absolute inset-0 bg-black/0 group-hover:bg-black/20 transition-colors" />
              <div className="absolute top-1 left-1">
                <input
                  type="checkbox"
                  checked={selectedIds.includes(image.id)}
                  onChange={(e) => {
                    e.stopPropagation()
                    handleSelect(image.id)
                  }}
                  className="rounded opacity-0 group-hover:opacity-100 checked:opacity-100"
                />
              </div>
              <div className="absolute bottom-0 left-0 right-0 bg-gradient-to-t from-black/60 to-transparent p-1 opacity-0 group-hover:opacity-100 transition-opacity">
                <p className="text-white text-xs truncate">
                  {image.species_name}
                </p>
              </div>
            </div>
          ))}
        </div>
      )}
      {/* Pagination */}
      {data && data.pages > 1 && (
        <div className="flex items-center justify-between">
          <span className="text-sm text-gray-600">
            {data.total} images
          </span>
          <div className="flex gap-2">
            <button
              onClick={() => setPage((p) => Math.max(1, p - 1))}
              disabled={page === 1}
              className="p-2 rounded border disabled:opacity-50"
            >
              <ChevronLeft className="w-4 h-4" />
            </button>
            <span className="px-4 py-2">
              Page {page} of {data.pages}
            </span>
            <button
              onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
              disabled={page === data.pages}
              className="p-2 rounded border disabled:opacity-50"
            >
              <ChevronRight className="w-4 h-4" />
            </button>
          </div>
        </div>
      )}
      {/* Image Detail Modal */}
      {selectedImage && imageDetail && (
        <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50 p-8">
          <div className="bg-white rounded-lg w-full max-w-4xl max-h-full overflow-auto">
            <div className="flex justify-between items-center p-4 border-b">
              <h2 className="text-lg font-semibold">Image Details</h2>
              <button
                onClick={() => setSelectedImage(null)}
                className="p-1 hover:bg-gray-100 rounded"
              >
                <X className="w-5 h-5" />
              </button>
            </div>
            <div className="grid grid-cols-2 gap-6 p-6">
              <div className="aspect-square bg-gray-100 rounded-lg overflow-hidden">
                {imageDetail.local_path ? (
                  <img
                    src={`/api/images/${imageDetail.id}/file`}
                    alt={imageDetail.species_name || ''}
                    className="w-full h-full object-contain"
                  />
                ) : (
                  <div className="flex items-center justify-center h-full text-gray-400">
                    Not downloaded
                  </div>
                )}
              </div>
              <div className="space-y-4">
                <div>
                  <label className="text-sm text-gray-500">Species</label>
                  <p className="font-medium">{imageDetail.species_name}</p>
                </div>
                <div>
                  <label className="text-sm text-gray-500">Source</label>
                  <p>{imageDetail.source}</p>
                </div>
                <div>
                  <label className="text-sm text-gray-500">License</label>
                  <p>{imageDetail.license}</p>
                </div>
                {imageDetail.attribution && (
                  <div>
                    <label className="text-sm text-gray-500">Attribution</label>
                    <p className="text-sm">{imageDetail.attribution}</p>
                  </div>
                )}
                <div className="grid grid-cols-2 gap-4">
                  <div>
                    <label className="text-sm text-gray-500">Dimensions</label>
                    <p>
                      {imageDetail.width || '?'} x {imageDetail.height || '?'}
                    </p>
                  </div>
                  <div>
                    <label className="text-sm text-gray-500">Quality Score</label>
                    <p>{imageDetail.quality_score?.toFixed(1) || 'N/A'}</p>
                  </div>
                </div>
                <div>
                  <label className="text-sm text-gray-500">Status</label>
                  <p>
                    <span
                      className={`inline-block px-2 py-1 rounded text-sm ${
                        imageDetail.status === 'downloaded'
                          ? 'bg-green-100 text-green-700'
                          : imageDetail.status === 'pending'
                          ? 'bg-yellow-100 text-yellow-700'
                          : 'bg-red-100 text-red-700'
                      }`}
                    >
                      {imageDetail.status}
                    </span>
                  </p>
                </div>
                <div className="flex gap-2 pt-4">
                  <a
                    href={imageDetail.url}
                    target="_blank"
                    rel="noopener noreferrer"
                    className="flex items-center gap-2 px-4 py-2 border rounded-lg hover:bg-gray-50"
                  >
                    <ExternalLink className="w-4 h-4" />
                    View Original
                  </a>
                  <button
                    onClick={() => deleteMutation.mutate(imageDetail.id)}
                    className="flex items-center gap-2 px-4 py-2 bg-red-600 text-white rounded-lg hover:bg-red-700"
                  >
                    <Trash2 className="w-4 h-4" />
                    Delete
                  </button>
                </div>
              </div>
            </div>
          </div>
        </div>
      )}
    </div>
  )
 }
@@ -0,0 +1,354 @@
 import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
 import {
  Play,
  Pause,
  XCircle,
  CheckCircle,
  Clock,
  AlertCircle,
  RefreshCw,
  Leaf,
  Download,
  XOctagon,
 } from 'lucide-react'
 import { jobsApi, Job } from '../api/client'
 export default function Jobs() {
  const queryClient = useQueryClient()
  const { data, isLoading, refetch } = useQuery({
    queryKey: ['jobs'],
    queryFn: () => jobsApi.list({ limit: 100 }).then((res) => res.data),
    refetchInterval: 1000, // Faster refresh for live updates
  })
  const pauseMutation = useMutation({
    mutationFn: (id: number) => jobsApi.pause(id),
    onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
  })
  const resumeMutation = useMutation({
    mutationFn: (id: number) => jobsApi.resume(id),
    onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
  })
  const cancelMutation = useMutation({
    mutationFn: (id: number) => jobsApi.cancel(id),
    onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
  })
  const getStatusIcon = (status: string) => {
    switch (status) {
      case 'running':
        return <RefreshCw className="w-4 h-4 text-blue-500 animate-spin" />
      case 'pending':
        return <Clock className="w-4 h-4 text-yellow-500" />
      case 'paused':
        return <Pause className="w-4 h-4 text-gray-500" />
      case 'completed':
        return <CheckCircle className="w-4 h-4 text-green-500" />
      case 'failed':
        return <AlertCircle className="w-4 h-4 text-red-500" />
      default:
        return null
    }
  }
  const getStatusClass = (status: string) => {
    switch (status) {
      case 'running':
        return 'bg-blue-100 text-blue-700'
      case 'pending':
        return 'bg-yellow-100 text-yellow-700'
      case 'paused':
        return 'bg-gray-100 text-gray-700'
      case 'completed':
        return 'bg-green-100 text-green-700'
      case 'failed':
        return 'bg-red-100 text-red-700'
      default:
        return 'bg-gray-100 text-gray-700'
    }
  }
  // Separate running jobs from others
  const runningJobs = data?.items.filter((j) => j.status === 'running') || []
  const otherJobs = data?.items.filter((j) => j.status !== 'running') || []
  return (
    <div className="space-y-6">
      <div className="flex items-center justify-between">
        <h1 className="text-2xl font-bold">Jobs</h1>
        <button
          onClick={() => refetch()}
          className="flex items-center gap-2 px-4 py-2 border rounded-lg hover:bg-gray-50"
        >
          <RefreshCw className="w-4 h-4" />
          Refresh
        </button>
      </div>
      {isLoading ? (
        <div className="flex items-center justify-center h-64">
          <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
        </div>
      ) : data?.items.length === 0 ? (
        <div className="bg-white rounded-lg shadow p-8 text-center text-gray-400">
          <Clock className="w-12 h-12 mx-auto mb-4" />
          <p>No jobs yet</p>
          <p className="text-sm mt-2">
            Select species and start a scrape job to get started
          </p>
        </div>
      ) : (
        <div className="space-y-6">
          {/* Running Jobs - More prominent display */}
          {runningJobs.length > 0 && (
            <div className="space-y-4">
              <h2 className="text-lg font-semibold flex items-center gap-2">
                <RefreshCw className="w-5 h-5 animate-spin text-blue-500" />
                Active Jobs ({runningJobs.length})
              </h2>
              {runningJobs.map((job) => (
                <RunningJobCard
                  key={job.id}
                  job={job}
                  onPause={() => pauseMutation.mutate(job.id)}
                  onCancel={() => cancelMutation.mutate(job.id)}
                />
              ))}
            </div>
          )}
          {/* Other Jobs */}
          {otherJobs.length > 0 && (
            <div className="space-y-4">
              {runningJobs.length > 0 && (
                <h2 className="text-lg font-semibold text-gray-600">Other Jobs</h2>
              )}
              {otherJobs.map((job) => (
                <div
                  key={job.id}
                  className="bg-white rounded-lg shadow p-6"
                >
                  <div className="flex items-start justify-between">
                    <div className="flex-1">
                      <div className="flex items-center gap-3">
                        {getStatusIcon(job.status)}
                        <h3 className="font-semibold">{job.name}</h3>
                        <span
                          className={`px-2 py-0.5 rounded text-xs ${getStatusClass(
                            job.status
                          )}`}
                        >
                          {job.status}
                        </span>
                      </div>
                      <div className="mt-2 text-sm text-gray-600">
                        <span className="mr-4">Source: {job.source}</span>
                        <span className="mr-4">
                          Downloaded: {job.images_downloaded}
                        </span>
                        <span>Rejected: {job.images_rejected}</span>
                      </div>
                      {/* Progress bar for paused jobs */}
                      {job.status === 'paused' && job.progress_total > 0 && (
                        <div className="mt-4">
                          <div className="flex justify-between text-sm text-gray-600 mb-1">
                            <span>
                              {job.progress_current} / {job.progress_total} species
                            </span>
                            <span>
                              {Math.round(
                                (job.progress_current / job.progress_total) * 100
                              )}
                              %
                            </span>
                          </div>
                          <div className="h-2 bg-gray-200 rounded-full overflow-hidden">
                            <div
                              className="h-full rounded-full bg-gray-400"
                              style={{
                                width: `${
                                  (job.progress_current / job.progress_total) * 100
                                }%`,
                              }}
                            />
                          </div>
                        </div>
                      )}
                      {job.error_message && (
                        <div className="mt-2 text-sm text-red-600">
                          Error: {job.error_message}
                        </div>
                      )}
                      <div className="mt-2 text-xs text-gray-400">
                        {job.started_at && (
                          <span className="mr-4">
                            Started: {new Date(job.started_at).toLocaleString()}
                          </span>
                        )}
                        {job.completed_at && (
                          <span>
                            Completed: {new Date(job.completed_at).toLocaleString()}
                          </span>
                        )}
                      </div>
                    </div>
                    {/* Actions */}
                    <div className="flex gap-2 ml-4">
                      {job.status === 'paused' && (
                        <button
                          onClick={() => resumeMutation.mutate(job.id)}
                          className="p-2 text-blue-600 hover:bg-blue-50 rounded"
                          title="Resume"
                        >
                          <Play className="w-5 h-5" />
                        </button>
                      )}
                      {(job.status === 'paused' || job.status === 'pending') && (
                        <button
                          onClick={() => cancelMutation.mutate(job.id)}
                          className="p-2 text-red-600 hover:bg-red-50 rounded"
                          title="Cancel"
                        >
                          <XCircle className="w-5 h-5" />
                        </button>
                      )}
                    </div>
                  </div>
                </div>
              ))}
            </div>
          )}
        </div>
      )}
    </div>
  )
 }
 function RunningJobCard({
  job,
  onPause,
  onCancel,
 }: {
  job: Job
  onPause: () => void
  onCancel: () => void
 }) {
  // Fetch real-time progress for this job
  const { data: progress } = useQuery({
    queryKey: ['job-progress', job.id],
    queryFn: () => jobsApi.progress(job.id).then((res) => res.data),
    refetchInterval: 500, // Very fast updates for live feel
    enabled: job.status === 'running',
  })
  const currentSpecies = progress?.current_species || ''
  const progressCurrent = progress?.progress_current ?? job.progress_current
  const progressTotal = progress?.progress_total ?? job.progress_total
  const percentage = progressTotal > 0 ? Math.round((progressCurrent / progressTotal) * 100) : 0
  return (
    <div className="bg-gradient-to-r from-blue-50 to-white rounded-lg shadow-lg border-2 border-blue-200 p-6">
      <div className="flex items-start justify-between">
        <div className="flex-1">
          <div className="flex items-center gap-3">
            <RefreshCw className="w-5 h-5 text-blue-500 animate-spin" />
            <h3 className="font-semibold text-lg">{job.name}</h3>
            <span className="px-2 py-0.5 rounded text-xs bg-blue-100 text-blue-700 animate-pulse">
              running
            </span>
          </div>
          {/* Live Stats */}
          <div className="mt-4 grid grid-cols-3 gap-4">
            <div className="bg-white rounded-lg p-3 border">
              <div className="flex items-center gap-2 text-gray-500 text-sm">
                <Leaf className="w-4 h-4" />
                Species Progress
              </div>
              <div className="text-2xl font-bold text-blue-600 mt-1">
                {progressCurrent} / {progressTotal}
              </div>
            </div>
            <div className="bg-white rounded-lg p-3 border">
              <div className="flex items-center gap-2 text-gray-500 text-sm">
                <Download className="w-4 h-4" />
                Downloaded
              </div>
              <div className="text-2xl font-bold text-green-600 mt-1">
                {job.images_downloaded}
              </div>
            </div>
            <div className="bg-white rounded-lg p-3 border">
              <div className="flex items-center gap-2 text-gray-500 text-sm">
                <XOctagon className="w-4 h-4" />
                Rejected
              </div>
              <div className="text-2xl font-bold text-red-600 mt-1">
                {job.images_rejected}
              </div>
            </div>
          </div>
          {/* Current Species */}
          {currentSpecies && (
            <div className="mt-4 bg-white rounded-lg p-3 border">
              <div className="text-sm text-gray-500 mb-1">Currently scraping:</div>
              <div className="flex items-center gap-2">
                <span className="relative flex h-3 w-3">
                  <span className="animate-ping absolute inline-flex h-full w-full rounded-full bg-blue-400 opacity-75"></span>
                  <span className="relative inline-flex rounded-full h-3 w-3 bg-blue-500"></span>
                </span>
                <span className="font-medium text-blue-800 italic">{currentSpecies}</span>
              </div>
            </div>
          )}
          {/* Progress bar */}
          {progressTotal > 0 && (
            <div className="mt-4">
              <div className="flex justify-between text-sm text-gray-600 mb-1">
                <span>Progress</span>
                <span className="font-medium">{percentage}%</span>
              </div>
              <div className="h-3 bg-gray-200 rounded-full overflow-hidden">
                <div
                  className="h-full rounded-full bg-gradient-to-r from-blue-500 to-blue-600 transition-all duration-500"
                  style={{ width: `${percentage}%` }}
                />
              </div>
            </div>
          )}
          <div className="mt-3 text-xs text-gray-400">
            Source: {job.source} • Started: {job.started_at ? new Date(job.started_at).toLocaleString() : 'N/A'}
          </div>
        </div>
        {/* Actions */}
        <div className="flex gap-2 ml-4">
          <button
            onClick={onPause}
            className="p-2 text-gray-600 hover:bg-gray-100 rounded"
            title="Pause"
          >
            <Pause className="w-5 h-5" />
          </button>
          <button
            onClick={onCancel}
            className="p-2 text-red-600 hover:bg-red-50 rounded"
            title="Cancel"
          >
            <XCircle className="w-5 h-5" />
          </button>
        </div>
      </div>
    </div>
  )
 }
@@ -0,0 +1,543 @@
 import { useState } from 'react'
 import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
 import {
  Key,
  CheckCircle,
  XCircle,
  Eye,
  EyeOff,
  RefreshCw,
  FolderInput,
  AlertTriangle,
 } from 'lucide-react'
 import { sourcesApi, imagesApi, SourceConfig, ImportScanResult } from '../api/client'
 export default function Settings() {
  const [editingSource, setEditingSource] = useState<string | null>(null)
  const { data: sources, isLoading, error } = useQuery({
    queryKey: ['sources'],
    queryFn: () => sourcesApi.list().then((res) => res.data),
  })
  return (
    <div className="space-y-6">
      <h1 className="text-2xl font-bold">Settings</h1>
      {/* API Keys Section */}
      <div className="bg-white rounded-lg shadow">
        <div className="px-6 py-4 border-b">
          <h2 className="text-lg font-semibold flex items-center gap-2">
            <Key className="w-5 h-5" />
            API Keys
          </h2>
          <p className="text-sm text-gray-500 mt-1">
            Configure API keys for each data source
          </p>
        </div>
        {isLoading ? (
          <div className="p-6 text-center">
            <RefreshCw className="w-6 h-6 animate-spin mx-auto text-gray-400" />
          </div>
        ) : error ? (
          <div className="p-6 text-center text-red-600">
            Error loading sources: {(error as Error).message}
          </div>
        ) : !sources || sources.length === 0 ? (
          <div className="p-6 text-center text-gray-500">
            No sources available
          </div>
        ) : (
          <div className="divide-y">
            {sources.map((source) => (
              <SourceRow
                key={source.name}
                source={source}
                isEditing={editingSource === source.name}
                onEdit={() => setEditingSource(source.name)}
                onClose={() => setEditingSource(null)}
              />
            ))}
          </div>
        )}
      </div>
      {/* Import Scanner Section */}
      <ImportScanner />
      {/* Rate Limits Info */}
      <div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
        <h3 className="font-medium text-yellow-800">Rate Limits (recommended settings)</h3>
        <ul className="text-sm text-yellow-700 mt-2 space-y-1 list-disc list-inside">
          <li>GBIF: 1 req/sec safe (free, no authentication required)</li>
          <li>iNaturalist: 1 req/sec max (60/min limit), 10k/day, 5GB/hr media</li>
          <li>Flickr: 0.5 req/sec recommended (3600/hr limit shared across all users)</li>
          <li>Wikimedia: 1 req/sec safe (requires OAuth credentials)</li>
          <li>Trefle: 1 req/sec safe (120/min limit)</li>
        </ul>
      </div>
    </div>
  )
 }
 function SourceRow({
  source,
  isEditing,
  onEdit,
  onClose,
 }: {
  source: SourceConfig
  isEditing: boolean
  onEdit: () => void
  onClose: () => void
 }) {
  const queryClient = useQueryClient()
  const [showKey, setShowKey] = useState(false)
  const [form, setForm] = useState({
    api_key: '',
    api_secret: '',
    access_token: '',
    rate_limit_per_sec: source.configured ? source.rate_limit_per_sec : (source.default_rate || 1.0),
    enabled: source.enabled,
  })
  // Get field labels based on auth type
  const isNoAuth = source.auth_type === 'none'
  const isOAuth = source.auth_type === 'oauth'
  const keyLabel = isOAuth ? 'Client ID' : 'API Key'
  const secretLabel = isOAuth ? 'Client Secret' : 'API Secret'
  const [testResult, setTestResult] = useState<{
    status: 'success' | 'error'
    message: string
  } | null>(null)
  const updateMutation = useMutation({
    mutationFn: () =>
      sourcesApi.update(source.name, {
        api_key: isNoAuth ? undefined : form.api_key || undefined,
        api_secret: form.api_secret || undefined,
        access_token: form.access_token || undefined,
        rate_limit_per_sec: form.rate_limit_per_sec,
        enabled: form.enabled,
      }),
    onSuccess: () => {
      queryClient.invalidateQueries({ queryKey: ['sources'] })
      onClose()
    },
  })
  const testMutation = useMutation({
    mutationFn: () => sourcesApi.test(source.name),
    onSuccess: (res) => {
      setTestResult({ status: res.data.status, message: res.data.message })
    },
    onError: (err: any) => {
      setTestResult({
        status: 'error',
        message: err.response?.data?.message || 'Connection failed',
      })
    },
  })
  if (isEditing) {
    return (
      <div className="p-6 bg-gray-50">
        <div className="flex items-center justify-between mb-4">
          <h3 className="font-medium">{source.label}</h3>
          <button
            onClick={onClose}
            className="text-gray-500 hover:text-gray-700"
          >
            Cancel
          </button>
        </div>
        <div className="space-y-4">
          {isNoAuth ? (
            <div className="bg-green-50 border border-green-200 rounded-lg p-3 text-green-700 text-sm">
              This source doesn't require authentication. Just enable it to start scraping.
            </div>
          ) : (
            <>
              <div>
                <label className="block text-sm font-medium mb-1">{keyLabel}</label>
                <div className="relative">
                  <input
                    type={showKey ? 'text' : 'password'}
                    value={form.api_key}
                    onChange={(e) => setForm({ ...form, api_key: e.target.value })}
                    placeholder={source.api_key_masked || `Enter ${keyLabel}`}
                    className="w-full px-3 py-2 border rounded-lg pr-10"
                  />
                  <button
                    type="button"
                    onClick={() => setShowKey(!showKey)}
                    className="absolute right-2 top-1/2 -translate-y-1/2 text-gray-400"
                  >
                    {showKey ? (
                      <EyeOff className="w-4 h-4" />
                    ) : (
                      <Eye className="w-4 h-4" />
                    )}
                  </button>
                </div>
              </div>
              {source.requires_secret && (
                <div>
                  <label className="block text-sm font-medium mb-1">
                    {secretLabel}
                  </label>
                  <input
                    type="password"
                    value={form.api_secret}
                    onChange={(e) =>
                      setForm({ ...form, api_secret: e.target.value })
                    }
                    placeholder={source.has_secret ? '••••••••' : `Enter ${secretLabel}`}
                    className="w-full px-3 py-2 border rounded-lg"
                  />
                </div>
              )}
              {isOAuth && (
                <div>
                  <label className="block text-sm font-medium mb-1">
                    Access Token
                  </label>
                  <input
                    type="password"
                    value={form.access_token}
                    onChange={(e) =>
                      setForm({ ...form, access_token: e.target.value })
                    }
                    placeholder={source.has_access_token ? '••••••••' : 'Enter Access Token'}
                    className="w-full px-3 py-2 border rounded-lg"
                  />
                </div>
              )}
            </>
          )}
          <div>
            <label className="block text-sm font-medium mb-1">
              Rate Limit (requests/sec)
            </label>
            <input
              type="number"
              value={form.rate_limit_per_sec}
              onChange={(e) =>
                setForm({
                  ...form,
                  rate_limit_per_sec: parseFloat(e.target.value) || 1,
                })
              }
              className="w-full px-3 py-2 border rounded-lg"
              min={0.1}
              max={10}
              step={0.1}
            />
          </div>
          <div className="flex items-center gap-2">
            <input
              type="checkbox"
              id="enabled"
              checked={form.enabled}
              onChange={(e) => setForm({ ...form, enabled: e.target.checked })}
              className="rounded"
            />
            <label htmlFor="enabled" className="text-sm">
              Enable this source
            </label>
          </div>
          {testResult && (
            <div
              className={`p-3 rounded-lg ${
                testResult.status === 'success'
                  ? 'bg-green-50 text-green-700'
                  : 'bg-red-50 text-red-700'
              }`}
            >
              {testResult.message}
            </div>
          )}
          <div className="flex justify-between">
            {source.configured && (
              <button
                onClick={() => testMutation.mutate()}
                disabled={testMutation.isPending}
                className="px-4 py-2 border rounded-lg hover:bg-white"
              >
                {testMutation.isPending ? 'Testing...' : 'Test Connection'}
              </button>
            )}
            <button
              onClick={() => updateMutation.mutate()}
              disabled={!isNoAuth && !form.api_key && !source.configured}
              className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50 ml-auto"
            >
              Save
            </button>
          </div>
        </div>
      </div>
    )
  }
  const isNoAuthRow = source.auth_type === 'none'
  return (
    <div className="px-6 py-4 flex items-center justify-between">
      <div className="flex items-center gap-4">
        <div
          className={`w-2 h-2 rounded-full ${
            (isNoAuthRow || source.configured) && source.enabled
              ? 'bg-green-500'
              : source.configured
              ? 'bg-yellow-500'
              : 'bg-gray-300'
          }`}
        />
        <div>
          <h3 className="font-medium">{source.label}</h3>
          <p className="text-sm text-gray-500">
            {isNoAuthRow
              ? 'No authentication required'
              : source.configured
              ? `Key: ${source.api_key_masked}`
              : 'Not configured'}
          </p>
        </div>
      </div>
      <div className="flex items-center gap-4">
        {(isNoAuthRow || source.configured) && (
          <span
            className={`flex items-center gap-1 text-sm ${
              source.enabled ? 'text-green-600' : 'text-gray-400'
            }`}
          >
            {source.enabled ? (
              <>
                <CheckCircle className="w-4 h-4" />
                Enabled
              </>
            ) : (
              <>
                <XCircle className="w-4 h-4" />
                Disabled
              </>
            )}
          </span>
        )}
        <button
          onClick={onEdit}
          className="px-3 py-1 text-sm border rounded hover:bg-gray-50"
        >
          {isNoAuthRow || source.configured ? 'Edit' : 'Configure'}
        </button>
      </div>
    </div>
  )
 }
 function ImportScanner() {
  const [scanResult, setScanResult] = useState<ImportScanResult | null>(null)
  const [moveFiles, setMoveFiles] = useState(false)
  const [importResult, setImportResult] = useState<{
    imported: number
    skipped: number
    errors: string[]
  } | null>(null)
  const scanMutation = useMutation({
    mutationFn: () => imagesApi.scanImports().then((res) => res.data),
    onSuccess: (data) => {
      setScanResult(data)
      setImportResult(null)
    },
  })
  const importMutation = useMutation({
    mutationFn: () => imagesApi.runImport(moveFiles).then((res) => res.data),
    onSuccess: (data) => {
      setImportResult(data)
      setScanResult(null)
    },
  })
  return (
    <div className="bg-white rounded-lg shadow">
      <div className="px-6 py-4 border-b">
        <h2 className="text-lg font-semibold flex items-center gap-2">
          <FolderInput className="w-5 h-5" />
          Import Images
        </h2>
        <p className="text-sm text-gray-500 mt-1">
          Bulk import images from the imports folder
        </p>
      </div>
      <div className="p-6 space-y-4">
        <div className="bg-gray-50 rounded-lg p-4">
          <h3 className="font-medium text-sm mb-2">Expected folder structure:</h3>
          <code className="text-sm text-gray-600 block">
            imports/{'{source}'}/{'{species_name}'}/*.jpg
          </code>
          <p className="text-sm text-gray-500 mt-2">
            Example: imports/inaturalist/Monstera_deliciosa/image1.jpg
          </p>
        </div>
        <div className="flex items-center gap-4">
          <button
            onClick={() => scanMutation.mutate()}
            disabled={scanMutation.isPending}
            className="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700 disabled:opacity-50 flex items-center gap-2"
          >
            {scanMutation.isPending ? (
              <>
                <RefreshCw className="w-4 h-4 animate-spin" />
                Scanning...
              </>
            ) : (
              'Scan Imports Folder'
            )}
          </button>
        </div>
        {scanMutation.isError && (
          <div className="bg-red-50 border border-red-200 rounded-lg p-4 text-red-700">
            Error scanning: {(scanMutation.error as Error).message}
          </div>
        )}
        {scanResult && (
          <div className="space-y-4">
            {!scanResult.available ? (
              <div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
                <p className="text-yellow-700">{scanResult.message}</p>
              </div>
            ) : scanResult.total_images === 0 ? (
              <div className="bg-gray-50 border border-gray-200 rounded-lg p-4">
                <p className="text-gray-600">No images found in the imports folder.</p>
              </div>
            ) : (
              <>
                <div className="bg-green-50 border border-green-200 rounded-lg p-4">
                  <h3 className="font-medium text-green-800 mb-2">Scan Results</h3>
                  <div className="grid grid-cols-2 gap-4 text-sm">
                    <div>
                      <span className="text-gray-600">Total Images:</span>
                      <span className="ml-2 font-medium">{scanResult.total_images}</span>
                    </div>
                    <div>
                      <span className="text-gray-600">Matched Species:</span>
                      <span className="ml-2 font-medium">{scanResult.matched_species}</span>
                    </div>
                  </div>
                  {scanResult.sources.length > 0 && (
                    <div className="mt-4">
                      <h4 className="text-sm font-medium text-green-800 mb-2">Sources Found:</h4>
                      <div className="space-y-1">
                        {scanResult.sources.map((source) => (
                          <div key={source.name} className="text-sm flex justify-between">
                            <span>{source.name}</span>
                            <span className="text-gray-600">
                              {source.species_count} species, {source.image_count} images
                            </span>
                          </div>
                        ))}
                      </div>
                    </div>
                  )}
                </div>
                {scanResult.unmatched_species.length > 0 && (
                  <div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
                    <h3 className="font-medium text-yellow-800 flex items-center gap-2 mb-2">
                      <AlertTriangle className="w-4 h-4" />
                      Unmatched Species ({scanResult.unmatched_species.length})
                    </h3>
                    <p className="text-sm text-yellow-700 mb-2">
                      These species folders don't match any species in the database and will be skipped:
                    </p>
                    <div className="text-sm text-yellow-600 max-h-32 overflow-y-auto">
                      {scanResult.unmatched_species.slice(0, 20).map((name) => (
                        <div key={name}>{name}</div>
                      ))}
                      {scanResult.unmatched_species.length > 20 && (
                        <div className="text-yellow-500 mt-1">
                          ...and {scanResult.unmatched_species.length - 20} more
                        </div>
                      )}
                    </div>
                  </div>
                )}
                <div className="border-t pt-4">
                  <div className="flex items-center gap-4 mb-4">
                    <label className="flex items-center gap-2 text-sm">
                      <input
                        type="checkbox"
                        checked={moveFiles}
                        onChange={(e) => setMoveFiles(e.target.checked)}
                        className="rounded"
                      />
                      Move files instead of copy (removes originals)
                    </label>
                  </div>
                  <button
                    onClick={() => importMutation.mutate()}
                    disabled={importMutation.isPending || scanResult.matched_species === 0}
                    className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50 flex items-center gap-2"
                  >
                    {importMutation.isPending ? (
                      <>
                        <RefreshCw className="w-4 h-4 animate-spin" />
                        Importing...
                      </>
                    ) : (
                      `Import ${scanResult.total_images} Images`
                    )}
                  </button>
                </div>
              </>
            )}
          </div>
        )}
        {importResult && (
          <div className="bg-green-50 border border-green-200 rounded-lg p-4">
            <h3 className="font-medium text-green-800 mb-2">Import Complete</h3>
            <div className="text-sm space-y-1">
              <div>
                <span className="text-gray-600">Imported:</span>
                <span className="ml-2 font-medium text-green-700">{importResult.imported}</span>
              </div>
              <div>
                <span className="text-gray-600">Skipped (already exists):</span>
                <span className="ml-2 font-medium">{importResult.skipped}</span>
              </div>
              {importResult.errors.length > 0 && (
                <div className="mt-2">
                  <span className="text-red-600">Errors ({importResult.errors.length}):</span>
                  <div className="text-red-500 mt-1 max-h-24 overflow-y-auto">
                    {importResult.errors.map((err, i) => (
                      <div key={i} className="text-xs">{err}</div>
                    ))}
                  </div>
                </div>
              )}
            </div>
          </div>
        )}
      </div>
    </div>
  )
 }
@@ -0,0 +1,997 @@
 import { useState, useRef } from 'react'
 import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
 import {
  Plus,
  Upload,
  Search,
  Trash2,
  Play,
  ChevronLeft,
  ChevronRight,
  Filter,
  X,
  Image as ImageIcon,
  ExternalLink,
 } from 'lucide-react'
 import { speciesApi, jobsApi, imagesApi, Species as SpeciesType } from '../api/client'
 export default function Species() {
  const queryClient = useQueryClient()
  const csvInputRef = useRef<HTMLInputElement>(null)
  const jsonInputRef = useRef<HTMLInputElement>(null)
  const [page, setPage] = useState(1)
  const [search, setSearch] = useState('')
  const [genus, setGenus] = useState<string>('')
  const [hasImages, setHasImages] = useState<string>('')
  const [maxImages, setMaxImages] = useState<string>('')
  const [selectedIds, setSelectedIds] = useState<number[]>([])
  const [showAddModal, setShowAddModal] = useState(false)
  const [showScrapeModal, setShowScrapeModal] = useState(false)
  const [showScrapeAllModal, setShowScrapeAllModal] = useState(false)
  const [showScrapeFilteredModal, setShowScrapeFilteredModal] = useState(false)
  const [viewSpecies, setViewSpecies] = useState<SpeciesType | null>(null)
  const { data: genera } = useQuery({
    queryKey: ['genera'],
    queryFn: () => speciesApi.genera().then((res) => res.data),
  })
  const { data, isLoading } = useQuery({
    queryKey: ['species', page, search, genus, hasImages, maxImages],
    queryFn: () =>
      speciesApi.list({
        page,
        page_size: 50,
        search: search || undefined,
        genus: genus || undefined,
        has_images: hasImages === '' ? undefined : hasImages === 'true',
        max_images: maxImages ? parseInt(maxImages) : undefined,
      }).then((res) => res.data),
  })
  const importCsvMutation = useMutation({
    mutationFn: (file: File) => speciesApi.import(file),
    onSuccess: (res) => {
      queryClient.invalidateQueries({ queryKey: ['species'] })
      queryClient.invalidateQueries({ queryKey: ['genera'] })
      alert(`Imported ${res.data.imported} species, skipped ${res.data.skipped}`)
    },
  })
  const importJsonMutation = useMutation({
    mutationFn: (file: File) => speciesApi.importJson(file),
    onSuccess: (res) => {
      queryClient.invalidateQueries({ queryKey: ['species'] })
      queryClient.invalidateQueries({ queryKey: ['genera'] })
      alert(`Imported ${res.data.imported} species, skipped ${res.data.skipped}`)
    },
  })
  const deleteMutation = useMutation({
    mutationFn: (id: number) => speciesApi.delete(id),
    onSuccess: () => {
      queryClient.invalidateQueries({ queryKey: ['species'] })
    },
  })
  const createJobMutation = useMutation({
    mutationFn: (data: { name: string; source: string; species_ids?: number[] }) =>
      jobsApi.create(data),
    onSuccess: () => {
      setShowScrapeModal(false)
      setSelectedIds([])
      alert('Scrape job created!')
    },
  })
  const handleCsvImport = (e: React.ChangeEvent<HTMLInputElement>) => {
    const file = e.target.files?.[0]
    if (file) {
      importCsvMutation.mutate(file)
      e.target.value = ''
    }
  }
  const handleJsonImport = (e: React.ChangeEvent<HTMLInputElement>) => {
    const file = e.target.files?.[0]
    if (file) {
      importJsonMutation.mutate(file)
      e.target.value = ''
    }
  }
  const handleSelectAll = () => {
    if (!data) return
    if (selectedIds.length === data.items.length) {
      setSelectedIds([])
    } else {
      setSelectedIds(data.items.map((s) => s.id))
    }
  }
  const handleSelect = (id: number) => {
    setSelectedIds((prev) =>
      prev.includes(id) ? prev.filter((i) => i !== id) : [...prev, id]
    )
  }
  return (
    <div className="space-y-6">
      <div className="flex items-center justify-between">
        <h1 className="text-2xl font-bold">Species</h1>
        <div className="flex gap-2">
          <button
            onClick={() => csvInputRef.current?.click()}
            disabled={importCsvMutation.isPending}
            className="flex items-center gap-2 px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200 disabled:opacity-50"
          >
            <Upload className="w-4 h-4" />
            {importCsvMutation.isPending ? 'Importing...' : 'Import CSV'}
          </button>
          <input
            ref={csvInputRef}
            type="file"
            accept=".csv"
            onChange={handleCsvImport}
            className="hidden"
          />
          <button
            onClick={() => jsonInputRef.current?.click()}
            disabled={importJsonMutation.isPending}
            className="flex items-center gap-2 px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200 disabled:opacity-50"
          >
            <Upload className="w-4 h-4" />
            {importJsonMutation.isPending ? 'Importing...' : 'Import JSON'}
          </button>
          <input
            ref={jsonInputRef}
            type="file"
            accept=".json"
            onChange={handleJsonImport}
            className="hidden"
          />
          <button
            onClick={() => setShowAddModal(true)}
            className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
          >
            <Plus className="w-4 h-4" />
            Add Species
          </button>
        </div>
      </div>
      {/* Search and Filters */}
      <div className="flex items-center gap-4 flex-wrap">
        <div className="relative">
          <Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
          <input
            type="text"
            placeholder="Search species..."
            value={search}
            onChange={(e) => {
              setSearch(e.target.value)
              setPage(1)
            }}
            className="pl-10 pr-4 py-2 border rounded-lg w-64"
          />
        </div>
        <div className="flex items-center gap-2">
          <Filter className="w-4 h-4 text-gray-400" />
          <select
            value={genus}
            onChange={(e) => {
              setGenus(e.target.value)
              setPage(1)
            }}
            className="px-3 py-2 border rounded-lg bg-white"
          >
            <option value="">All Genera</option>
            {genera?.map((g) => (
              <option key={g} value={g}>
                {g}
              </option>
            ))}
          </select>
          <select
            value={hasImages}
            onChange={(e) => {
              setHasImages(e.target.value)
              setMaxImages('')
              setPage(1)
            }}
            className="px-3 py-2 border rounded-lg bg-white"
          >
            <option value="">All Species</option>
            <option value="true">Has Images</option>
            <option value="false">No Images</option>
          </select>
          <select
            value={maxImages}
            onChange={(e) => {
              setMaxImages(e.target.value)
              setHasImages('')
              setPage(1)
            }}
            className="px-3 py-2 border rounded-lg bg-white"
          >
            <option value="">Any Image Count</option>
            <option value="25">Less than 25 images</option>
            <option value="50">Less than 50 images</option>
            <option value="100">Less than 100 images</option>
            <option value="250">Less than 250 images</option>
            <option value="500">Less than 500 images</option>
          </select>
          {(genus || hasImages || maxImages) && (
            <button
              onClick={() => {
                setGenus('')
                setHasImages('')
                setMaxImages('')
                setPage(1)
              }}
              className="flex items-center gap-1 px-2 py-1 text-sm text-gray-500 hover:text-gray-700"
            >
              <X className="w-3 h-3" />
              Clear
            </button>
          )}
        </div>
        <div className="ml-auto flex items-center gap-4">
          {maxImages && data && data.total > 0 && (
            <button
              onClick={() => setShowScrapeFilteredModal(true)}
              className="flex items-center gap-2 px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700"
            >
              <Play className="w-4 h-4" />
              Scrape All {data.total} Filtered
            </button>
          )}
          <button
            onClick={() => setShowScrapeAllModal(true)}
            className="flex items-center gap-2 px-4 py-2 bg-orange-600 text-white rounded-lg hover:bg-orange-700"
          >
            <Play className="w-4 h-4" />
            Scrape All Without Images
          </button>
          {selectedIds.length > 0 && (
            <div className="flex items-center gap-4">
              <span className="text-sm text-gray-600">
                {selectedIds.length} selected
              </span>
              <button
                onClick={() => setShowScrapeModal(true)}
                className="flex items-center gap-2 px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
              >
                <Play className="w-4 h-4" />
                Start Scrape
              </button>
            </div>
          )}
        </div>
      </div>
      {/* Table */}
      <div className="bg-white rounded-lg shadow overflow-hidden">
        <table className="w-full">
          <thead className="bg-gray-50">
            <tr>
              <th className="px-4 py-3 text-left">
                <input
                  type="checkbox"
                  checked={(data?.items?.length ?? 0) > 0 && selectedIds.length === (data?.items?.length ?? 0)}
                  onChange={handleSelectAll}
                  className="rounded"
                />
              </th>
              <th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
                Scientific Name
              </th>
              <th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
                Common Name
              </th>
              <th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
                Genus
              </th>
              <th className="px-4 py-3 text-right text-sm font-medium text-gray-600">
                Images
              </th>
              <th className="px-4 py-3 text-right text-sm font-medium text-gray-600">
                Actions
              </th>
            </tr>
          </thead>
          <tbody>
            {isLoading ? (
              <tr>
                <td colSpan={6} className="px-4 py-8 text-center text-gray-400">
                  Loading...
                </td>
              </tr>
            ) : data?.items.length === 0 ? (
              <tr>
                <td colSpan={6} className="px-4 py-8 text-center text-gray-400">
                  No species found. Import a CSV to get started.
                </td>
              </tr>
            ) : (
              data?.items.map((species) => (
                <tr
                  key={species.id}
                  className="border-t hover:bg-gray-50 cursor-pointer"
                  onClick={() => setViewSpecies(species)}
                >
                  <td className="px-4 py-3" onClick={(e) => e.stopPropagation()}>
                    <input
                      type="checkbox"
                      checked={selectedIds.includes(species.id)}
                      onChange={() => handleSelect(species.id)}
                      className="rounded"
                    />
                  </td>
                  <td className="px-4 py-3 font-medium">{species.scientific_name}</td>
                  <td className="px-4 py-3 text-gray-600">
                    {species.common_name || '-'}
                  </td>
                  <td className="px-4 py-3 text-gray-600">{species.genus || '-'}</td>
                  <td className="px-4 py-3 text-right">
                    <span
                      className={`inline-block px-2 py-1 rounded text-sm ${
                        species.image_count >= 100
                          ? 'bg-green-100 text-green-700'
                          : species.image_count > 0
                          ? 'bg-yellow-100 text-yellow-700'
                          : 'bg-gray-100 text-gray-600'
                      }`}
                    >
                      {species.image_count}
                    </span>
                  </td>
                  <td className="px-4 py-3 text-right" onClick={(e) => e.stopPropagation()}>
                    <button
                      onClick={() => deleteMutation.mutate(species.id)}
                      className="p-1 text-red-500 hover:bg-red-50 rounded"
                    >
                      <Trash2 className="w-4 h-4" />
                    </button>
                  </td>
                </tr>
              ))
            )}
          </tbody>
        </table>
      </div>
      {/* Pagination */}
      {data && data.pages > 1 && (
        <div className="flex items-center justify-between">
          <span className="text-sm text-gray-600">
            Showing {(page - 1) * 50 + 1} to {Math.min(page * 50, data.total)} of{' '}
            {data.total}
          </span>
          <div className="flex gap-2">
            <button
              onClick={() => setPage((p) => Math.max(1, p - 1))}
              disabled={page === 1}
              className="p-2 rounded border disabled:opacity-50"
            >
              <ChevronLeft className="w-4 h-4" />
            </button>
            <span className="px-4 py-2">
              Page {page} of {data.pages}
            </span>
            <button
              onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
              disabled={page === data.pages}
              className="p-2 rounded border disabled:opacity-50"
            >
              <ChevronRight className="w-4 h-4" />
            </button>
          </div>
        </div>
      )}
      {/* Add Species Modal */}
      {showAddModal && (
        <AddSpeciesModal onClose={() => setShowAddModal(false)} />
      )}
      {/* Scrape Modal */}
      {showScrapeModal && (
        <ScrapeModal
          selectedIds={selectedIds}
          onClose={() => setShowScrapeModal(false)}
          onSubmit={(source) => {
            createJobMutation.mutate({
              name: `Scrape ${selectedIds.length} species from ${source}`,
              source,
              species_ids: selectedIds,
            })
          }}
        />
      )}
      {/* Species Detail Modal */}
      {viewSpecies && (
        <SpeciesDetailModal
          species={viewSpecies}
          onClose={() => setViewSpecies(null)}
        />
      )}
      {/* Scrape All Without Images Modal */}
      {showScrapeAllModal && (
        <ScrapeAllModal
          onClose={() => setShowScrapeAllModal(false)}
        />
      )}
      {/* Scrape All Filtered Modal */}
      {showScrapeFilteredModal && (
        <ScrapeFilteredModal
          maxImages={parseInt(maxImages)}
          speciesCount={data?.total ?? 0}
          onClose={() => setShowScrapeFilteredModal(false)}
        />
      )}
    </div>
  )
 }
 function AddSpeciesModal({ onClose }: { onClose: () => void }) {
  const queryClient = useQueryClient()
  const [form, setForm] = useState({
    scientific_name: '',
    common_name: '',
    genus: '',
    family: '',
  })
  const mutation = useMutation({
    mutationFn: () => speciesApi.create(form),
    onSuccess: () => {
      queryClient.invalidateQueries({ queryKey: ['species'] })
      onClose()
    },
  })
  return (
    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
      <div className="bg-white rounded-lg p-6 w-full max-w-md">
        <h2 className="text-xl font-bold mb-4">Add Species</h2>
        <div className="space-y-4">
          <div>
            <label className="block text-sm font-medium mb-1">
              Scientific Name *
            </label>
            <input
              type="text"
              value={form.scientific_name}
              onChange={(e) =>
                setForm({ ...form, scientific_name: e.target.value })
              }
              className="w-full px-3 py-2 border rounded-lg"
              placeholder="e.g. Monstera deliciosa"
            />
          </div>
          <div>
            <label className="block text-sm font-medium mb-1">Common Name</label>
            <input
              type="text"
              value={form.common_name}
              onChange={(e) => setForm({ ...form, common_name: e.target.value })}
              className="w-full px-3 py-2 border rounded-lg"
              placeholder="e.g. Swiss Cheese Plant"
            />
          </div>
          <div className="grid grid-cols-2 gap-4">
            <div>
              <label className="block text-sm font-medium mb-1">Genus</label>
              <input
                type="text"
                value={form.genus}
                onChange={(e) => setForm({ ...form, genus: e.target.value })}
                className="w-full px-3 py-2 border rounded-lg"
                placeholder="e.g. Monstera"
              />
            </div>
            <div>
              <label className="block text-sm font-medium mb-1">Family</label>
              <input
                type="text"
                value={form.family}
                onChange={(e) => setForm({ ...form, family: e.target.value })}
                className="w-full px-3 py-2 border rounded-lg"
                placeholder="e.g. Araceae"
              />
            </div>
          </div>
        </div>
        <div className="flex justify-end gap-2 mt-6">
          <button
            onClick={onClose}
            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
          >
            Cancel
          </button>
          <button
            onClick={() => mutation.mutate()}
            disabled={!form.scientific_name}
            className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50"
          >
            Add Species
          </button>
        </div>
      </div>
    </div>
  )
 }
 function ScrapeModal({
  selectedIds,
  onClose,
  onSubmit,
 }: {
  selectedIds: number[]
  onClose: () => void
  onSubmit: (source: string) => void
 }) {
  const [source, setSource] = useState('inaturalist')
  const sources = [
    { value: 'gbif', label: 'GBIF' },
    { value: 'inaturalist', label: 'iNaturalist' },
    { value: 'flickr', label: 'Flickr' },
    { value: 'wikimedia', label: 'Wikimedia Commons' },
    { value: 'trefle', label: 'Trefle.io' },
    { value: 'duckduckgo', label: 'DuckDuckGo' },
    { value: 'bing', label: 'Bing Image Search' },
  ]
  return (
    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
      <div className="bg-white rounded-lg p-6 w-full max-w-md">
        <h2 className="text-xl font-bold mb-4">Start Scrape Job</h2>
        <p className="text-gray-600 mb-4">
          Scrape images for {selectedIds.length} selected species
        </p>
        <div>
          <label className="block text-sm font-medium mb-2">Select Source</label>
          <div className="space-y-2">
            {sources.map((s) => (
              <label
                key={s.value}
                className={`flex items-center p-3 border rounded-lg cursor-pointer ${
                  source === s.value ? 'border-green-500 bg-green-50' : ''
                }`}
              >
                <input
                  type="radio"
                  value={s.value}
                  checked={source === s.value}
                  onChange={(e) => setSource(e.target.value)}
                  className="mr-3"
                />
                {s.label}
              </label>
            ))}
          </div>
        </div>
        <div className="flex justify-end gap-2 mt-6">
          <button
            onClick={onClose}
            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
          >
            Cancel
          </button>
          <button
            onClick={() => onSubmit(source)}
            className="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
          >
            Start Scrape
          </button>
        </div>
      </div>
    </div>
  )
 }
 function SpeciesDetailModal({
  species,
  onClose,
 }: {
  species: SpeciesType
  onClose: () => void
 }) {
  const [page, setPage] = useState(1)
  const pageSize = 20
  const { data, isLoading } = useQuery({
    queryKey: ['species-images', species.id, page],
    queryFn: () =>
      imagesApi.list({
        species_id: species.id,
        status: 'downloaded',
        page,
        page_size: pageSize,
      }).then((res) => res.data),
  })
  return (
    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50 p-4">
      <div className="bg-white rounded-lg w-full max-w-5xl max-h-[90vh] flex flex-col">
        {/* Header */}
        <div className="px-6 py-4 border-b flex items-start justify-between">
          <div>
            <h2 className="text-xl font-bold">{species.scientific_name}</h2>
            {species.common_name && (
              <p className="text-gray-600">{species.common_name}</p>
            )}
            <div className="flex gap-4 mt-2 text-sm text-gray-500">
              {species.genus && <span>Genus: {species.genus}</span>}
              {species.family && <span>Family: {species.family}</span>}
              <span>{species.image_count} images</span>
            </div>
          </div>
          <button
            onClick={onClose}
            className="p-2 hover:bg-gray-100 rounded-lg"
          >
            <X className="w-5 h-5" />
          </button>
        </div>
        {/* Images Grid */}
        <div className="flex-1 overflow-y-auto p-6">
          {isLoading ? (
            <div className="flex items-center justify-center h-64">
              <div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
            </div>
          ) : !data || data.items.length === 0 ? (
            <div className="flex flex-col items-center justify-center h-64 text-gray-400">
              <ImageIcon className="w-12 h-12 mb-4" />
              <p>No images yet</p>
              <p className="text-sm mt-2">
                Start a scrape job to download images for this species
              </p>
            </div>
          ) : (
            <div className="grid grid-cols-2 sm:grid-cols-3 md:grid-cols-4 lg:grid-cols-5 gap-4">
              {data.items.map((image) => (
                <div
                  key={image.id}
                  className="group relative aspect-square bg-gray-100 rounded-lg overflow-hidden"
                >
                  {image.local_path ? (
                    <img
                      src={`/api/images/${image.id}/file`}
                      alt={species.scientific_name}
                      className="w-full h-full object-cover"
                      loading="lazy"
                    />
                  ) : (
                    <div className="w-full h-full flex items-center justify-center text-gray-400">
                      <ImageIcon className="w-8 h-8" />
                    </div>
                  )}
                  {/* Overlay with info */}
                  <div className="absolute inset-0 bg-black/60 opacity-0 group-hover:opacity-100 transition-opacity flex flex-col justify-end p-2">
                    <div className="text-white text-xs">
                      <div className="flex items-center justify-between">
                        <span className="bg-white/20 px-1.5 py-0.5 rounded">
                          {image.source}
                        </span>
                        <span className="bg-white/20 px-1.5 py-0.5 rounded">
                          {image.license}
                        </span>
                      </div>
                      {image.width && image.height && (
                        <div className="mt-1 text-white/70">
                          {image.width} × {image.height}
                        </div>
                      )}
                    </div>
                    {image.url && (
                      <a
                        href={image.url}
                        target="_blank"
                        rel="noopener noreferrer"
                        className="absolute top-2 right-2 p-1 bg-white/20 rounded hover:bg-white/40"
                        onClick={(e) => e.stopPropagation()}
                      >
                        <ExternalLink className="w-4 h-4 text-white" />
                      </a>
                    )}
                  </div>
                </div>
              ))}
            </div>
          )}
        </div>
        {/* Pagination */}
        {data && data.pages > 1 && (
          <div className="px-6 py-4 border-t flex items-center justify-between">
            <span className="text-sm text-gray-600">
              Showing {(page - 1) * pageSize + 1} to{' '}
              {Math.min(page * pageSize, data.total)} of {data.total}
            </span>
            <div className="flex gap-2">
              <button
                onClick={() => setPage((p) => Math.max(1, p - 1))}
                disabled={page === 1}
                className="p-2 rounded border disabled:opacity-50"
              >
                <ChevronLeft className="w-4 h-4" />
              </button>
              <span className="px-4 py-2">
                Page {page} of {data.pages}
              </span>
              <button
                onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
                disabled={page === data.pages}
                className="p-2 rounded border disabled:opacity-50"
              >
                <ChevronRight className="w-4 h-4" />
              </button>
            </div>
          </div>
        )}
      </div>
    </div>
  )
 }
 function ScrapeAllModal({ onClose }: { onClose: () => void }) {
  const [selectedSources, setSelectedSources] = useState<string[]>([])
  const [isSubmitting, setIsSubmitting] = useState(false)
  // Fetch count of species without images
  const { data: speciesData, isLoading } = useQuery({
    queryKey: ['species-no-images'],
    queryFn: () =>
      speciesApi.list({
        page: 1,
        page_size: 1,
        has_images: false,
      }).then((res) => res.data),
  })
  const sources = [
    { value: 'gbif', label: 'GBIF', description: 'Free biodiversity database, no API key needed' },
    { value: 'inaturalist', label: 'iNaturalist', description: 'Research-grade observations with CC licenses' },
    { value: 'wikimedia', label: 'Wikimedia Commons', description: 'Free media repository, requires OAuth' },
    { value: 'flickr', label: 'Flickr', description: 'Requires API key, CC-licensed photos' },
    { value: 'trefle', label: 'Trefle.io', description: 'Plant database, requires API key' },
    { value: 'duckduckgo', label: 'DuckDuckGo', description: 'Web image search, no API key needed' },
    { value: 'bing', label: 'Bing Image Search', description: 'Azure Cognitive Services, requires API key' },
  ]
  const toggleSource = (source: string) => {
    setSelectedSources((prev) =>
      prev.includes(source)
        ? prev.filter((s) => s !== source)
        : [...prev, source]
    )
  }
  const handleSubmit = async () => {
    if (selectedSources.length === 0) return
    setIsSubmitting(true)
    try {
      // Create a job for each selected source
      for (const source of selectedSources) {
        await jobsApi.create({
          name: `Scrape all species without images from ${source}`,
          source,
          only_without_images: true,
        })
      }
      alert(`Created ${selectedSources.length} scrape job(s)!`)
      onClose()
    } catch (error) {
      alert('Failed to create jobs')
    } finally {
      setIsSubmitting(false)
    }
  }
  const speciesCount = speciesData?.total ?? 0
  return (
    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
      <div className="bg-white rounded-lg p-6 w-full max-w-lg">
        <h2 className="text-xl font-bold mb-2">Scrape All Species Without Images</h2>
        {isLoading ? (
          <p className="text-gray-600 mb-4">Loading...</p>
        ) : (
          <p className="text-gray-600 mb-4">
            {speciesCount === 0 ? (
              'All species already have images!'
            ) : (
              <>
                <span className="font-semibold text-orange-600">{speciesCount}</span> species
                don't have any images yet. Select sources to scrape from:
              </>
            )}
          </p>
        )}
        {speciesCount > 0 && (
          <>
            <div className="space-y-2 mb-6">
              {sources.map((s) => (
                <label
                  key={s.value}
                  className={`flex items-start p-3 border rounded-lg cursor-pointer transition-colors ${
                    selectedSources.includes(s.value)
                      ? 'border-orange-500 bg-orange-50'
                      : 'hover:bg-gray-50'
                  }`}
                >
                  <input
                    type="checkbox"
                    checked={selectedSources.includes(s.value)}
                    onChange={() => toggleSource(s.value)}
                    className="mt-1 mr-3 rounded"
                  />
                  <div>
                    <div className="font-medium">{s.label}</div>
                    <div className="text-sm text-gray-500">{s.description}</div>
                  </div>
                </label>
              ))}
            </div>
            {selectedSources.length > 1 && (
              <div className="bg-blue-50 border border-blue-200 rounded-lg p-3 mb-4 text-sm text-blue-700">
                <strong>{selectedSources.length} jobs</strong> will be created and run in parallel,
                one for each selected source.
              </div>
            )}
          </>
        )}
        <div className="flex justify-end gap-2">
          <button
            onClick={onClose}
            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
          >
            Cancel
          </button>
          {speciesCount > 0 && (
            <button
              onClick={handleSubmit}
              disabled={selectedSources.length === 0 || isSubmitting}
              className="px-4 py-2 bg-orange-600 text-white rounded-lg hover:bg-orange-700 disabled:opacity-50"
            >
              {isSubmitting
                ? 'Creating Jobs...'
                : `Start ${selectedSources.length || ''} Scrape Job${selectedSources.length !== 1 ? 's' : ''}`}
            </button>
          )}
        </div>
      </div>
    </div>
  )
 }
 function ScrapeFilteredModal({
  maxImages,
  speciesCount,
  onClose,
 }: {
  maxImages: number
  speciesCount: number
  onClose: () => void
 }) {
  const [selectedSources, setSelectedSources] = useState<string[]>([])
  const [isSubmitting, setIsSubmitting] = useState(false)
  const sources = [
    { value: 'gbif', label: 'GBIF', description: 'Free biodiversity database, no API key needed' },
    { value: 'inaturalist', label: 'iNaturalist', description: 'Research-grade observations with CC licenses' },
    { value: 'wikimedia', label: 'Wikimedia Commons', description: 'Free media repository, requires OAuth' },
    { value: 'flickr', label: 'Flickr', description: 'Requires API key, CC-licensed photos' },
    { value: 'trefle', label: 'Trefle.io', description: 'Plant database, requires API key' },
    { value: 'duckduckgo', label: 'DuckDuckGo', description: 'Web image search, no API key needed' },
    { value: 'bing', label: 'Bing Image Search', description: 'Azure Cognitive Services, requires API key' },
  ]
  const toggleSource = (source: string) => {
    setSelectedSources((prev) =>
      prev.includes(source)
        ? prev.filter((s) => s !== source)
        : [...prev, source]
    )
  }
  const handleSubmit = async () => {
    if (selectedSources.length === 0) return
    setIsSubmitting(true)
    try {
      for (const source of selectedSources) {
        await jobsApi.create({
          name: `Scrape species with <${maxImages} images from ${source}`,
          source,
          max_images: maxImages,
        })
      }
      alert(`Created ${selectedSources.length} scrape job(s)!`)
      onClose()
    } catch (error) {
      alert('Failed to create jobs')
    } finally {
      setIsSubmitting(false)
    }
  }
  return (
    <div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
      <div className="bg-white rounded-lg p-6 w-full max-w-lg">
        <h2 className="text-xl font-bold mb-2">Scrape All Filtered Species</h2>
        <p className="text-gray-600 mb-4">
          <span className="font-semibold text-purple-600">{speciesCount}</span> species
          have fewer than <span className="font-semibold">{maxImages}</span> images.
          Select sources to scrape from:
        </p>
        <div className="space-y-2 mb-6">
          {sources.map((s) => (
            <label
              key={s.value}
              className={`flex items-start p-3 border rounded-lg cursor-pointer transition-colors ${
                selectedSources.includes(s.value)
                  ? 'border-purple-500 bg-purple-50'
                  : 'hover:bg-gray-50'
              }`}
            >
              <input
                type="checkbox"
                checked={selectedSources.includes(s.value)}
                onChange={() => toggleSource(s.value)}
                className="mt-1 mr-3 rounded"
              />
              <div>
                <div className="font-medium">{s.label}</div>
                <div className="text-sm text-gray-500">{s.description}</div>
              </div>
            </label>
          ))}
        </div>
        {selectedSources.length > 1 && (
          <div className="bg-blue-50 border border-blue-200 rounded-lg p-3 mb-4 text-sm text-blue-700">
            <strong>{selectedSources.length} jobs</strong> will be created and run in parallel,
            one for each selected source.
          </div>
        )}
        <div className="flex justify-end gap-2">
          <button
            onClick={onClose}
            className="px-4 py-2 border rounded-lg hover:bg-gray-50"
          >
            Cancel
          </button>
          <button
            onClick={handleSubmit}
            disabled={selectedSources.length === 0 || isSubmitting}
            className="px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700 disabled:opacity-50"
          >
            {isSubmitting
              ? 'Creating Jobs...'
              : `Start ${selectedSources.length || ''} Scrape Job${selectedSources.length !== 1 ? 's' : ''}`}
          </button>
        </div>
      </div>
    </div>
  )
 }
@@ -0,0 +1,9 @@
 /// <reference types="vite/client" />
 interface ImportMetaEnv {
  readonly VITE_API_URL: string
 }
 interface ImportMeta {
  readonly env: ImportMetaEnv
 }
@@ -0,0 +1,11 @@
 /** @type {import('tailwindcss').Config} */
 export default {
  content: [
    "./index.html",
    "./src/**/*.{js,ts,jsx,tsx}",
  ],
  theme: {
    extend: {},
  },
  plugins: [],
 }
@@ -0,0 +1,21 @@
 {
  "compilerOptions": {
    "target": "ES2020",
    "useDefineForClassFields": true,
    "lib": ["ES2020", "DOM", "DOM.Iterable"],
    "module": "ESNext",
    "skipLibCheck": true,
    "moduleResolution": "bundler",
    "allowImportingTsExtensions": true,
    "resolveJsonModule": true,
    "isolatedModules": true,
    "noEmit": true,
    "jsx": "react-jsx",
    "strict": true,
    "noUnusedLocals": true,
    "noUnusedParameters": true,
    "noFallthroughCasesInSwitch": true
  },
  "include": ["src"],
  "references": [{ "path": "./tsconfig.node.json" }]
 }
@@ -0,0 +1,10 @@
 {
  "compilerOptions": {
    "composite": true,
    "skipLibCheck": true,
    "module": "ESNext",
    "moduleResolution": "bundler",
    "allowSyntheticDefaultImports": true
  },
  "include": ["vite.config.ts"]
 }
@@ -0,0 +1,18 @@
 import { defineConfig } from 'vite'
 import react from '@vitejs/plugin-react'
 export default defineConfig({
  plugins: [react()],
  server: {
    port: 3000,
    host: true,
    proxy: {
      '/api': {
        target: 'http://backend:8000',
        changeOrigin: true,
      },
    },
    // Disable HMR - not useful in Docker deployments
    hmr: false,
  },
 })
@@ -0,0 +1,58 @@
 events {
    worker_connections 1024;
 }
 http {
    include /etc/nginx/mime.types;
    default_type application/octet-stream;
    upstream backend {
        server backend:8000;
    }
    upstream frontend {
        server frontend:3000;
    }
    server {
        listen 80;
        server_name localhost;
        # API routes
        location /api {
            proxy_pass http://backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            # Increase timeouts for slow API calls
            proxy_connect_timeout 60s;
            proxy_send_timeout 60s;
            proxy_read_timeout 60s;
        }
        # Health check
        location /health {
            proxy_pass http://backend;
        }
        # WebSocket support for hot reload
        location /ws {
            proxy_pass http://frontend;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
        }
        # Frontend
        location / {
            proxy_pass http://frontend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
        }
    }
 }