Initial commit — PlantGuideScraper project
This commit is contained in:
20
.env.example
Normal file
20
.env.example
Normal file
@@ -0,0 +1,20 @@
|
||||
# Database
|
||||
DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||
|
||||
# Redis
|
||||
REDIS_URL=redis://redis:6379/0
|
||||
|
||||
# Storage paths
|
||||
IMAGES_PATH=/data/images
|
||||
EXPORTS_PATH=/data/exports
|
||||
|
||||
# API Keys (user-provided)
|
||||
FLICKR_API_KEY=
|
||||
FLICKR_API_SECRET=
|
||||
INATURALIST_APP_ID=
|
||||
INATURALIST_APP_SECRET=
|
||||
TREFLE_API_KEY=
|
||||
|
||||
# Optional settings
|
||||
LOG_LEVEL=INFO
|
||||
CELERY_CONCURRENCY=4
|
||||
39
.gitignore
vendored
Normal file
39
.gitignore
vendored
Normal file
@@ -0,0 +1,39 @@
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
venv/
|
||||
.venv/
|
||||
ENV/
|
||||
env/
|
||||
.eggs/
|
||||
*.egg-info/
|
||||
*.egg
|
||||
|
||||
# Node
|
||||
node_modules/
|
||||
npm-debug.log
|
||||
yarn-error.log
|
||||
|
||||
# IDE
|
||||
.idea/
|
||||
.vscode/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Project specific
|
||||
data/
|
||||
*.sqlite
|
||||
*.db
|
||||
.env
|
||||
*.zip
|
||||
|
||||
# Docker
|
||||
docker-compose.override.yml
|
||||
209
README.md
Normal file
209
README.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# PlantGuideScraper
|
||||
|
||||
Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.
|
||||
|
||||
## Features
|
||||
|
||||
- **Species Management**: Import species lists via CSV or JSON, search and filter by genus or image status
|
||||
- **Multi-Source Scraping**: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
|
||||
- **Image Quality Pipeline**: Automatic deduplication, blur detection, resizing
|
||||
- **License Filtering**: Only collect commercially-safe CC0/CC-BY licensed images
|
||||
- **Export for CoreML**: Train/test split, Create ML-compatible folder structure
|
||||
- **Real-time Dashboard**: Progress tracking, statistics, job monitoring
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Clone and start
|
||||
cd PlantGuideScraper
|
||||
docker-compose up --build
|
||||
|
||||
# Access the UI
|
||||
open http://localhost
|
||||
```
|
||||
|
||||
## Unraid Deployment
|
||||
|
||||
### Setup
|
||||
|
||||
1. Copy the project to your Unraid server:
|
||||
```bash
|
||||
scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper
|
||||
```
|
||||
|
||||
2. SSH into Unraid and create data directories:
|
||||
```bash
|
||||
ssh root@YOUR_UNRAID_IP
|
||||
mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}
|
||||
```
|
||||
|
||||
3. Install **Docker Compose Manager** from Community Applications
|
||||
|
||||
4. In Unraid: **Docker → Compose → Add New Stack**
|
||||
- Path: `/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml`
|
||||
- Click **Compose Up**
|
||||
|
||||
5. Access at `http://YOUR_UNRAID_IP:8580`
|
||||
|
||||
### Configurable Paths
|
||||
|
||||
Edit `docker-compose.unraid.yml` to customize where data is stored. Look for these lines in both `backend` and `celery` services:
|
||||
|
||||
```yaml
|
||||
# === CONFIGURABLE DATA PATHS ===
|
||||
- /mnt/user/appdata/PlantGuideScraper/database:/data/db # DATABASE_PATH
|
||||
- /mnt/user/appdata/PlantGuideScraper/images:/data/images # IMAGES_PATH
|
||||
- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
|
||||
```
|
||||
|
||||
| Path | Description | Default |
|
||||
|------|-------------|---------|
|
||||
| DATABASE_PATH | SQLite database file | `/mnt/user/appdata/PlantGuideScraper/database` |
|
||||
| IMAGES_PATH | Downloaded images (can be 100GB+) | `/mnt/user/appdata/PlantGuideScraper/images` |
|
||||
| EXPORTS_PATH | Generated export zip files | `/mnt/user/appdata/PlantGuideScraper/exports` |
|
||||
|
||||
**Example: Store images on a separate share:**
|
||||
```yaml
|
||||
- /mnt/user/data/PlantImages:/data/images # IMAGES_PATH
|
||||
```
|
||||
|
||||
**Important:** Keep paths identical in both `backend` and `celery` services.
|
||||
|
||||
## Configuration
|
||||
|
||||
1. Configure API keys in Settings:
|
||||
- **Flickr**: Get key at https://www.flickr.com/services/api/
|
||||
- **Trefle**: Get key at https://trefle.io/
|
||||
- iNaturalist and Wikimedia don't require keys
|
||||
|
||||
2. Import species list (see Import Documentation below)
|
||||
|
||||
3. Select species and start scraping
|
||||
|
||||
## Import Documentation
|
||||
|
||||
### CSV Import
|
||||
|
||||
Import species from a CSV file with the following columns:
|
||||
|
||||
| Column | Required | Description |
|
||||
|--------|----------|-------------|
|
||||
| `scientific_name` | Yes | Binomial name (e.g., "Monstera deliciosa") |
|
||||
| `common_name` | No | Common name (e.g., "Swiss Cheese Plant") |
|
||||
| `genus` | No | Auto-extracted from scientific_name if not provided |
|
||||
| `family` | No | Plant family (e.g., "Araceae") |
|
||||
|
||||
**Example CSV:**
|
||||
```csv
|
||||
scientific_name,common_name,genus,family
|
||||
Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
|
||||
Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
|
||||
Epipremnum aureum,Golden Pothos,Epipremnum,Araceae
|
||||
```
|
||||
|
||||
### JSON Import
|
||||
|
||||
Import species from a JSON file with the following structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"plants": [
|
||||
{
|
||||
"scientific_name": "Monstera deliciosa",
|
||||
"common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
|
||||
"family": "Araceae"
|
||||
},
|
||||
{
|
||||
"scientific_name": "Philodendron hederaceum",
|
||||
"common_names": ["Heartleaf Philodendron"],
|
||||
"family": "Araceae"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Required | Description |
|
||||
|-------|----------|-------------|
|
||||
| `scientific_name` | Yes | Binomial name |
|
||||
| `common_names` | No | Array of common names (first one is used) |
|
||||
| `family` | No | Plant family |
|
||||
|
||||
**Notes:**
|
||||
- Genus is automatically extracted from the first word of `scientific_name`
|
||||
- Duplicate species (by scientific_name) are skipped
|
||||
- The included `houseplants_list.json` contains 2,278 houseplant species
|
||||
|
||||
### API Endpoints
|
||||
|
||||
```bash
|
||||
# Import CSV
|
||||
curl -X POST http://localhost/api/species/import \
|
||||
-F "file=@species.csv"
|
||||
|
||||
# Import JSON
|
||||
curl -X POST http://localhost/api/species/import-json \
|
||||
-F "file=@plants.json"
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"imported": 150,
|
||||
"skipped": 5,
|
||||
"errors": []
|
||||
}
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
|
||||
│ React │────▶│ FastAPI │────▶│ Celery │
|
||||
│ Frontend │ │ Backend │ │ Workers │
|
||||
└─────────────┘ └─────────────────┘ └─────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│ SQLite │ │ Redis │
|
||||
│ Database │ │ Queue │
|
||||
└─────────────┘ └─────────────┘
|
||||
```
|
||||
|
||||
## Export Format
|
||||
|
||||
Exports are Create ML-compatible:
|
||||
|
||||
```
|
||||
export.zip/
|
||||
├── Training/
|
||||
│ ├── Monstera_deliciosa/
|
||||
│ │ ├── img_00001.jpg
|
||||
│ │ └── ...
|
||||
│ └── ...
|
||||
└── Testing/
|
||||
├── Monstera_deliciosa/
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Data Storage
|
||||
|
||||
All data is stored in the `./data` directory:
|
||||
|
||||
```
|
||||
data/
|
||||
├── db/
|
||||
│ └── plants.sqlite # SQLite database
|
||||
├── images/ # Downloaded images
|
||||
│ └── {species_id}/
|
||||
│ └── {image_id}.jpg
|
||||
└── exports/ # Generated export archives
|
||||
└── {export_id}.zip
|
||||
```
|
||||
|
||||
## API Documentation
|
||||
|
||||
Full API docs available at http://localhost/api/docs
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
231
accum_images.md
Normal file
231
accum_images.md
Normal file
@@ -0,0 +1,231 @@
|
||||
# Houseplant Image Dataset Accumulation Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use.
|
||||
|
||||
---
|
||||
|
||||
## Requirements Summary
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| Target species | 5,000-10,000 (realistic houseplant ceiling) |
|
||||
| Images per species | 200-500 (recommended) |
|
||||
| Total images | ~1-5 million |
|
||||
| Budget | Free preferred, paid as reference |
|
||||
| Compute | M1 Max Mac (training) + Unraid server (data pipeline) |
|
||||
| Curation | Automated pipeline |
|
||||
| Timeline | Weeks-months |
|
||||
| Licensing | Must allow training + commercial model distribution |
|
||||
|
||||
---
|
||||
|
||||
## Hardware Assessment
|
||||
|
||||
| Machine | Role | Capability |
|
||||
|---------|------|------------|
|
||||
| M1 Max Mac | **Training** | Create ML can train 5-10K class models; 32+ GB unified memory is ideal |
|
||||
| Unraid Server | **Data pipeline** | Scraping, downloading, preprocessing, storage |
|
||||
|
||||
M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required.
|
||||
|
||||
---
|
||||
|
||||
## Data Sources Analysis
|
||||
|
||||
### Tier 1: Primary Sources (Recommended)
|
||||
|
||||
| Source | License | Commercial-Safe | Volume | Houseplant Coverage | Access Method |
|
||||
|--------|---------|-----------------|--------|---------------------|---------------|
|
||||
| **iNaturalist via GBIF** | CC-BY, CC0 (filter) | Yes (filtered) | 100M+ observations | Good (has "captive/cultivated" flag) | Bulk export + API |
|
||||
| **Flickr** | CC-BY, CC0 (filter) | Yes (filtered) | Millions | Moderate | API |
|
||||
| **Wikimedia Commons** | CC-BY, CC-BY-SA, Public Domain | Mostly | Thousands | Moderate | API |
|
||||
|
||||
### Tier 2: Supplemental Sources
|
||||
|
||||
| Source | License | Commercial-Safe | Notes |
|
||||
|--------|---------|-----------------|-------|
|
||||
| **USDA PLANTS** | Public Domain | Yes | US-focused, limited images |
|
||||
| **Encyclopedia of Life** | Mixed | Check each | Aggregator, good metadata |
|
||||
| **Pl@ntNet-300K Dataset** | CC-BY-SA | Share-alike | Good for research/prototyping only |
|
||||
|
||||
### Tier 3: Paid Options (Reference)
|
||||
|
||||
| Source | Estimated Cost | Notes |
|
||||
|--------|----------------|-------|
|
||||
| iNaturalist AWS Open Data | Free | Bulk image export, requires S3 costs for transfer |
|
||||
| Custom scraping infrastructure | $50-200/mo | Proxies, storage, bandwidth |
|
||||
| Commercial botanical databases | $1000s+ | Getty, Alamy — not recommended |
|
||||
|
||||
---
|
||||
|
||||
## Licensing Decision Matrix
|
||||
|
||||
```
|
||||
Want commercial model distribution?
|
||||
├─ YES → Use ONLY: CC0, CC-BY, Public Domain
|
||||
│ Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved
|
||||
│
|
||||
└─ NO (research only) → Can use CC-BY-NC, CC-BY-SA
|
||||
Pl@ntNet-300K dataset becomes viable
|
||||
```
|
||||
|
||||
**Recommendation**: Filter for commercial-safe licenses from day 1. Avoids re-scraping later.
|
||||
|
||||
---
|
||||
|
||||
## Houseplant Species Taxonomy
|
||||
|
||||
**Problem**: No canonical "houseplant" species list exists. Must construct one.
|
||||
|
||||
**Approach**:
|
||||
1. Start with Wikipedia "List of houseplants" (~500 species)
|
||||
2. Expand via genus crawl (all Philodendron, all Hoya, etc.)
|
||||
3. Cross-reference with RHS, ASPCA, nursery catalogs
|
||||
4. Target: **1,000-3,000 species** is realistic for quality dataset
|
||||
|
||||
**Key Genera** (prioritize these — cover 80% of common houseplants):
|
||||
```
|
||||
Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena,
|
||||
Sansevieria, Calathea, Maranta, Alocasia, Anthurium,
|
||||
Peperomia, Hoya, Begonia, Tradescantia, Pilea,
|
||||
Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Quality Requirements
|
||||
|
||||
| Parameter | Minimum | Target | Rationale |
|
||||
|-----------|---------|--------|-----------|
|
||||
| Images per species | 100 | 300-500 | Below 100 = unreliable classification |
|
||||
| Resolution | 256x256 | 512x512+ | Downsample to 224x224 for training |
|
||||
| Variety | Single angle | Multi-angle, growth stages, lighting | Generalization |
|
||||
| Label accuracy | 80% | 95%+ | iNaturalist "Research Grade" = community verified |
|
||||
|
||||
---
|
||||
|
||||
## Training Approach Options
|
||||
|
||||
### Option A: Create ML (Recommended)
|
||||
|
||||
| Pros | Cons |
|
||||
|------|------|
|
||||
| Native Apple Silicon optimization | Limited hyperparameter control |
|
||||
| Outputs CoreML directly | Max ~10K classes practical limit |
|
||||
| No Python/ML expertise needed | Less flexible augmentation |
|
||||
| Fast iteration | |
|
||||
|
||||
**Best for**: This use case exactly.
|
||||
|
||||
### Option B: PyTorch + MPS Transfer Learning
|
||||
|
||||
| Pros | Cons |
|
||||
|------|------|
|
||||
| Full control over architecture | Steeper learning curve |
|
||||
| State-of-art augmentation (albumentations) | Manual CoreML conversion |
|
||||
| Can use EfficientNet, ConvNeXt, etc. | Slower iteration |
|
||||
|
||||
**Best for**: If Create ML hits limits or you need custom architecture.
|
||||
|
||||
### Option C: Cloud GPU (Google Colab / AWS Spot)
|
||||
|
||||
| Pros | Cons |
|
||||
|------|------|
|
||||
| Faster training for large models | Cost |
|
||||
| No local resource constraints | Network transfer overhead |
|
||||
|
||||
**Best for**: If dataset exceeds M1 Max memory or you want transformer-based vision models.
|
||||
|
||||
**Recommendation**: Start with Create ML. Pivot to Option B only if needed.
|
||||
|
||||
---
|
||||
|
||||
## Pipeline Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ UNRAID SERVER │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 1. Species List Generator │
|
||||
│ └─ Scrape Wikipedia, RHS, expand by genus │
|
||||
│ │
|
||||
│ 2. Image Downloader │
|
||||
│ ├─ iNaturalist/GBIF bulk export (primary) │
|
||||
│ ├─ Flickr API (supplemental) │
|
||||
│ └─ License filter (CC-BY, CC0 only) │
|
||||
│ │
|
||||
│ 3. Preprocessing Pipeline │
|
||||
│ ├─ Resize to 512x512 │
|
||||
│ ├─ Remove duplicates (perceptual hash) │
|
||||
│ ├─ Remove low-quality (blur detection, size filter) │
|
||||
│ └─ Organize: /species_name/image_001.jpg │
|
||||
│ │
|
||||
│ 4. Dataset Statistics │
|
||||
│ └─ Report per-species counts, flag under-represented │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼ (rsync/SMB)
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ M1 MAX MAC │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ 5. Create ML Training │
|
||||
│ ├─ Import dataset folder │
|
||||
│ ├─ Train image classifier │
|
||||
│ └─ Export .mlmodel │
|
||||
│ │
|
||||
│ 6. Validation │
|
||||
│ ├─ Test on held-out images │
|
||||
│ └─ Test on real-world photos (your phone) │
|
||||
│ │
|
||||
│ 7. Integration │
|
||||
│ └─ Replace PlantNet-300K in PlantGuide │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
| Phase | Duration | Output |
|
||||
|-------|----------|--------|
|
||||
| 1. Species list curation | 1 week | 1,000-3,000 target species with scientific + common names |
|
||||
| 2. Pipeline development | 1-2 weeks | Automated scraper on Unraid |
|
||||
| 3. Data collection | 2-4 weeks | Running 24/7, rate-limited by APIs |
|
||||
| 4. Preprocessing + QA | 1 week | Clean dataset, statistics report |
|
||||
| 5. Initial training | 2-3 days | First model with subset (500 species) |
|
||||
| 6. Full training | 1 week | Full model, iteration |
|
||||
| 7. Validation + tuning | 1 week | Production-ready model |
|
||||
|
||||
**Total: 6-10 weeks**
|
||||
|
||||
---
|
||||
|
||||
## Risk Analysis
|
||||
|
||||
| Risk | Likelihood | Mitigation |
|
||||
|------|------------|------------|
|
||||
| Insufficient images for rare species | High | Accept lower coverage OR merge to genus-level for rare species |
|
||||
| API rate limits slow collection | High | Parallelize sources, use bulk exports, patience |
|
||||
| Noisy labels degrade accuracy | Medium | Use only "Research Grade" iNaturalist, implement confidence thresholds |
|
||||
| Create ML memory limits | Low | M1 Max should handle; fallback to PyTorch |
|
||||
| License ambiguity | Low | Strict filter on download, keep metadata |
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Build species master list** — Python script to scrape/merge sources
|
||||
2. **Set up GBIF bulk download** — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images
|
||||
3. **Build Flickr supplemental scraper** — Target under-represented species
|
||||
4. **Docker container on Unraid** — Orchestrate pipeline
|
||||
5. **Create ML project setup** — Folder structure, initial test with 50 species
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Prioritize **speed** (start with 500 species, fast iteration) or **completeness** (build full 3K species list first)?
|
||||
- Any specific houseplant species that must be included?
|
||||
- Docker running on Unraid already?
|
||||
24
backend/Dockerfile
Normal file
24
backend/Dockerfile
Normal file
@@ -0,0 +1,24 @@
|
||||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install system dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
gcc \
|
||||
g++ \
|
||||
libffi-dev \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Python dependencies
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Copy application code
|
||||
COPY . .
|
||||
|
||||
# Create data directories
|
||||
RUN mkdir -p /data/db /data/images /data/exports
|
||||
|
||||
EXPOSE 8000
|
||||
|
||||
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
19
backend/add_indexes.py
Normal file
19
backend/add_indexes.py
Normal file
@@ -0,0 +1,19 @@
|
||||
#!/usr/bin/env python
|
||||
"""Add missing database indexes."""
|
||||
from sqlalchemy import text
|
||||
from app.database import engine
|
||||
|
||||
with engine.connect() as conn:
|
||||
# Single column indexes
|
||||
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_license ON images(license)'))
|
||||
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_status ON images(status)'))
|
||||
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_source ON images(source)'))
|
||||
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_species_id ON images(species_id)'))
|
||||
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_phash ON images(phash)'))
|
||||
|
||||
# Composite indexes for common query patterns
|
||||
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_species_status ON images(species_id, status)'))
|
||||
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_status_created ON images(status, created_at)'))
|
||||
|
||||
conn.commit()
|
||||
print('All indexes created successfully')
|
||||
42
backend/alembic.ini
Normal file
42
backend/alembic.ini
Normal file
@@ -0,0 +1,42 @@
|
||||
[alembic]
|
||||
script_location = alembic
|
||||
prepend_sys_path = .
|
||||
version_path_separator = os
|
||||
|
||||
sqlalchemy.url = sqlite:////data/db/plants.sqlite
|
||||
|
||||
[post_write_hooks]
|
||||
|
||||
[loggers]
|
||||
keys = root,sqlalchemy,alembic
|
||||
|
||||
[handlers]
|
||||
keys = console
|
||||
|
||||
[formatters]
|
||||
keys = generic
|
||||
|
||||
[logger_root]
|
||||
level = WARN
|
||||
handlers = console
|
||||
qualname =
|
||||
|
||||
[logger_sqlalchemy]
|
||||
level = WARN
|
||||
handlers =
|
||||
qualname = sqlalchemy.engine
|
||||
|
||||
[logger_alembic]
|
||||
level = INFO
|
||||
handlers =
|
||||
qualname = alembic
|
||||
|
||||
[handler_console]
|
||||
class = StreamHandler
|
||||
args = (sys.stderr,)
|
||||
level = NOTSET
|
||||
formatter = generic
|
||||
|
||||
[formatter_generic]
|
||||
format = %(levelname)-5.5s [%(name)s] %(message)s
|
||||
datefmt = %H:%M:%S
|
||||
54
backend/alembic/env.py
Normal file
54
backend/alembic/env.py
Normal file
@@ -0,0 +1,54 @@
|
||||
from logging.config import fileConfig
|
||||
|
||||
from sqlalchemy import engine_from_config
|
||||
from sqlalchemy import pool
|
||||
|
||||
from alembic import context
|
||||
|
||||
# Import models for autogenerate
|
||||
from app.database import Base
|
||||
from app.models import Species, Image, Job, ApiKey, Export
|
||||
|
||||
config = context.config
|
||||
|
||||
if config.config_file_name is not None:
|
||||
fileConfig(config.config_file_name)
|
||||
|
||||
target_metadata = Base.metadata
|
||||
|
||||
|
||||
def run_migrations_offline() -> None:
|
||||
"""Run migrations in 'offline' mode."""
|
||||
url = config.get_main_option("sqlalchemy.url")
|
||||
context.configure(
|
||||
url=url,
|
||||
target_metadata=target_metadata,
|
||||
literal_binds=True,
|
||||
dialect_opts={"paramstyle": "named"},
|
||||
)
|
||||
|
||||
with context.begin_transaction():
|
||||
context.run_migrations()
|
||||
|
||||
|
||||
def run_migrations_online() -> None:
|
||||
"""Run migrations in 'online' mode."""
|
||||
connectable = engine_from_config(
|
||||
config.get_section(config.config_ini_section, {}),
|
||||
prefix="sqlalchemy.",
|
||||
poolclass=pool.NullPool,
|
||||
)
|
||||
|
||||
with connectable.connect() as connection:
|
||||
context.configure(
|
||||
connection=connection, target_metadata=target_metadata
|
||||
)
|
||||
|
||||
with context.begin_transaction():
|
||||
context.run_migrations()
|
||||
|
||||
|
||||
if context.is_offline_mode():
|
||||
run_migrations_offline()
|
||||
else:
|
||||
run_migrations_online()
|
||||
26
backend/alembic/script.py.mako
Normal file
26
backend/alembic/script.py.mako
Normal file
@@ -0,0 +1,26 @@
|
||||
"""${message}
|
||||
|
||||
Revision ID: ${up_revision}
|
||||
Revises: ${down_revision | comma,n}
|
||||
Create Date: ${create_date}
|
||||
|
||||
"""
|
||||
from typing import Sequence, Union
|
||||
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
${imports if imports else ""}
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision: str = ${repr(up_revision)}
|
||||
down_revision: Union[str, None] = ${repr(down_revision)}
|
||||
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
|
||||
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
${upgrades if upgrades else "pass"}
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
${downgrades if downgrades else "pass"}
|
||||
112
backend/alembic/versions/001_initial.py
Normal file
112
backend/alembic/versions/001_initial.py
Normal file
@@ -0,0 +1,112 @@
|
||||
"""Initial migration
|
||||
|
||||
Revision ID: 001
|
||||
Revises:
|
||||
Create Date: 2024-01-01
|
||||
|
||||
"""
|
||||
from typing import Sequence, Union
|
||||
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
|
||||
revision: str = '001'
|
||||
down_revision: Union[str, None] = None
|
||||
branch_labels: Union[str, Sequence[str], None] = None
|
||||
depends_on: Union[str, Sequence[str], None] = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
# Species table
|
||||
op.create_table(
|
||||
'species',
|
||||
sa.Column('id', sa.Integer(), primary_key=True),
|
||||
sa.Column('scientific_name', sa.String(), nullable=False, unique=True),
|
||||
sa.Column('common_name', sa.String(), nullable=True),
|
||||
sa.Column('genus', sa.String(), nullable=True),
|
||||
sa.Column('family', sa.String(), nullable=True),
|
||||
sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
|
||||
)
|
||||
op.create_index('ix_species_scientific_name', 'species', ['scientific_name'])
|
||||
op.create_index('ix_species_genus', 'species', ['genus'])
|
||||
|
||||
# API Keys table
|
||||
op.create_table(
|
||||
'api_keys',
|
||||
sa.Column('id', sa.Integer(), primary_key=True),
|
||||
sa.Column('source', sa.String(), nullable=False, unique=True),
|
||||
sa.Column('api_key', sa.String(), nullable=False),
|
||||
sa.Column('api_secret', sa.String(), nullable=True),
|
||||
sa.Column('rate_limit_per_sec', sa.Float(), default=1.0),
|
||||
sa.Column('enabled', sa.Boolean(), default=True),
|
||||
)
|
||||
|
||||
# Images table
|
||||
op.create_table(
|
||||
'images',
|
||||
sa.Column('id', sa.Integer(), primary_key=True),
|
||||
sa.Column('species_id', sa.Integer(), sa.ForeignKey('species.id'), nullable=False),
|
||||
sa.Column('source', sa.String(), nullable=False),
|
||||
sa.Column('source_id', sa.String(), nullable=True),
|
||||
sa.Column('url', sa.String(), nullable=False),
|
||||
sa.Column('local_path', sa.String(), nullable=True),
|
||||
sa.Column('license', sa.String(), nullable=False),
|
||||
sa.Column('attribution', sa.String(), nullable=True),
|
||||
sa.Column('width', sa.Integer(), nullable=True),
|
||||
sa.Column('height', sa.Integer(), nullable=True),
|
||||
sa.Column('phash', sa.String(), nullable=True),
|
||||
sa.Column('quality_score', sa.Float(), nullable=True),
|
||||
sa.Column('status', sa.String(), default='pending'),
|
||||
sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
|
||||
)
|
||||
op.create_index('ix_images_species_id', 'images', ['species_id'])
|
||||
op.create_index('ix_images_source', 'images', ['source'])
|
||||
op.create_index('ix_images_status', 'images', ['status'])
|
||||
op.create_index('ix_images_phash', 'images', ['phash'])
|
||||
op.create_unique_constraint('uq_source_source_id', 'images', ['source', 'source_id'])
|
||||
|
||||
# Jobs table
|
||||
op.create_table(
|
||||
'jobs',
|
||||
sa.Column('id', sa.Integer(), primary_key=True),
|
||||
sa.Column('name', sa.String(), nullable=False),
|
||||
sa.Column('source', sa.String(), nullable=False),
|
||||
sa.Column('species_filter', sa.Text(), nullable=True),
|
||||
sa.Column('status', sa.String(), default='pending'),
|
||||
sa.Column('progress_current', sa.Integer(), default=0),
|
||||
sa.Column('progress_total', sa.Integer(), default=0),
|
||||
sa.Column('images_downloaded', sa.Integer(), default=0),
|
||||
sa.Column('images_rejected', sa.Integer(), default=0),
|
||||
sa.Column('celery_task_id', sa.String(), nullable=True),
|
||||
sa.Column('started_at', sa.DateTime(), nullable=True),
|
||||
sa.Column('completed_at', sa.DateTime(), nullable=True),
|
||||
sa.Column('error_message', sa.Text(), nullable=True),
|
||||
sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
|
||||
)
|
||||
op.create_index('ix_jobs_status', 'jobs', ['status'])
|
||||
|
||||
# Exports table
|
||||
op.create_table(
|
||||
'exports',
|
||||
sa.Column('id', sa.Integer(), primary_key=True),
|
||||
sa.Column('name', sa.String(), nullable=False),
|
||||
sa.Column('filter_criteria', sa.Text(), nullable=True),
|
||||
sa.Column('train_split', sa.Float(), default=0.8),
|
||||
sa.Column('status', sa.String(), default='pending'),
|
||||
sa.Column('file_path', sa.String(), nullable=True),
|
||||
sa.Column('file_size', sa.Integer(), nullable=True),
|
||||
sa.Column('species_count', sa.Integer(), nullable=True),
|
||||
sa.Column('image_count', sa.Integer(), nullable=True),
|
||||
sa.Column('celery_task_id', sa.String(), nullable=True),
|
||||
sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
|
||||
sa.Column('completed_at', sa.DateTime(), nullable=True),
|
||||
sa.Column('error_message', sa.Text(), nullable=True),
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_table('exports')
|
||||
op.drop_table('jobs')
|
||||
op.drop_table('images')
|
||||
op.drop_table('api_keys')
|
||||
op.drop_table('species')
|
||||
53
backend/alembic/versions/002_add_cached_stats_and_indexes.py
Normal file
53
backend/alembic/versions/002_add_cached_stats_and_indexes.py
Normal file
@@ -0,0 +1,53 @@
|
||||
"""Add cached_stats table and license index
|
||||
|
||||
Revision ID: 002
|
||||
Revises: 001
|
||||
Create Date: 2025-01-25
|
||||
|
||||
"""
|
||||
from typing import Sequence, Union
|
||||
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
|
||||
revision: str = '002'
|
||||
down_revision: Union[str, None] = '001'
|
||||
branch_labels: Union[str, Sequence[str], None] = None
|
||||
depends_on: Union[str, Sequence[str], None] = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
# Cached stats table for pre-calculated dashboard statistics
|
||||
op.create_table(
|
||||
'cached_stats',
|
||||
sa.Column('id', sa.Integer(), primary_key=True),
|
||||
sa.Column('key', sa.String(50), nullable=False, unique=True),
|
||||
sa.Column('value', sa.Text(), nullable=False),
|
||||
sa.Column('updated_at', sa.DateTime(), server_default=sa.func.now()),
|
||||
)
|
||||
op.create_index('ix_cached_stats_key', 'cached_stats', ['key'])
|
||||
|
||||
# Add license index to images table (if not exists)
|
||||
# Using batch mode for SQLite compatibility
|
||||
try:
|
||||
op.create_index('ix_images_license', 'images', ['license'])
|
||||
except Exception:
|
||||
pass # Index may already exist
|
||||
|
||||
# Add only_without_images column to jobs if it doesn't exist
|
||||
try:
|
||||
op.add_column('jobs', sa.Column('only_without_images', sa.Boolean(), default=False))
|
||||
except Exception:
|
||||
pass # Column may already exist
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
try:
|
||||
op.drop_index('ix_images_license', 'images')
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
op.drop_column('jobs', 'only_without_images')
|
||||
except Exception:
|
||||
pass
|
||||
op.drop_table('cached_stats')
|
||||
31
backend/alembic/versions/003_add_job_max_images.py
Normal file
31
backend/alembic/versions/003_add_job_max_images.py
Normal file
@@ -0,0 +1,31 @@
|
||||
"""Add max_images column to jobs table
|
||||
|
||||
Revision ID: 003
|
||||
Revises: 002
|
||||
Create Date: 2025-01-25
|
||||
|
||||
"""
|
||||
from typing import Sequence, Union
|
||||
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
|
||||
revision: str = '003'
|
||||
down_revision: Union[str, None] = '002'
|
||||
branch_labels: Union[str, Sequence[str], None] = None
|
||||
depends_on: Union[str, Sequence[str], None] = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
# Add max_images column to jobs table
|
||||
try:
|
||||
op.add_column('jobs', sa.Column('max_images', sa.Integer(), nullable=True))
|
||||
except Exception:
|
||||
pass # Column may already exist
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
try:
|
||||
op.drop_column('jobs', 'max_images')
|
||||
except Exception:
|
||||
pass
|
||||
1
backend/app/__init__.py
Normal file
1
backend/app/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# PlantGuideScraper Backend
|
||||
1
backend/app/api/__init__.py
Normal file
1
backend/app/api/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# API routes
|
||||
175
backend/app/api/exports.py
Normal file
175
backend/app/api/exports.py
Normal file
@@ -0,0 +1,175 @@
|
||||
import json
|
||||
import os
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
from fastapi.responses import FileResponse
|
||||
from sqlalchemy.orm import Session
|
||||
from sqlalchemy import func
|
||||
|
||||
from app.database import get_db
|
||||
from app.models import Export, Image, Species
|
||||
from app.schemas.export import (
|
||||
ExportCreate,
|
||||
ExportResponse,
|
||||
ExportListResponse,
|
||||
ExportPreview,
|
||||
)
|
||||
from app.workers.export_tasks import generate_export
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
@router.get("", response_model=ExportListResponse)
|
||||
def list_exports(
|
||||
limit: int = Query(50, ge=1, le=200),
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""List all exports."""
|
||||
total = db.query(Export).count()
|
||||
exports = db.query(Export).order_by(Export.created_at.desc()).limit(limit).all()
|
||||
|
||||
return ExportListResponse(
|
||||
items=[ExportResponse.model_validate(e) for e in exports],
|
||||
total=total,
|
||||
)
|
||||
|
||||
|
||||
@router.post("/preview", response_model=ExportPreview)
|
||||
def preview_export(export: ExportCreate, db: Session = Depends(get_db)):
|
||||
"""Preview export without creating it."""
|
||||
criteria = export.filter_criteria
|
||||
min_images = criteria.min_images_per_species
|
||||
|
||||
# Build query
|
||||
query = db.query(Image).filter(Image.status == "downloaded")
|
||||
|
||||
if criteria.licenses:
|
||||
query = query.filter(Image.license.in_(criteria.licenses))
|
||||
|
||||
if criteria.min_quality:
|
||||
query = query.filter(Image.quality_score >= criteria.min_quality)
|
||||
|
||||
if criteria.species_ids:
|
||||
query = query.filter(Image.species_id.in_(criteria.species_ids))
|
||||
|
||||
# Count images per species
|
||||
species_counts = db.query(
|
||||
Image.species_id,
|
||||
func.count(Image.id).label("count")
|
||||
).filter(Image.status == "downloaded")
|
||||
|
||||
if criteria.licenses:
|
||||
species_counts = species_counts.filter(Image.license.in_(criteria.licenses))
|
||||
if criteria.min_quality:
|
||||
species_counts = species_counts.filter(Image.quality_score >= criteria.min_quality)
|
||||
if criteria.species_ids:
|
||||
species_counts = species_counts.filter(Image.species_id.in_(criteria.species_ids))
|
||||
|
||||
species_counts = species_counts.group_by(Image.species_id).all()
|
||||
|
||||
valid_species = [s for s in species_counts if s.count >= min_images]
|
||||
total_images = sum(s.count for s in valid_species)
|
||||
|
||||
# Estimate file size (rough: 50KB per image)
|
||||
estimated_size_mb = (total_images * 50) / 1024
|
||||
|
||||
return ExportPreview(
|
||||
species_count=len(valid_species),
|
||||
image_count=total_images,
|
||||
estimated_size_mb=estimated_size_mb,
|
||||
)
|
||||
|
||||
|
||||
@router.post("", response_model=ExportResponse)
|
||||
def create_export(export: ExportCreate, db: Session = Depends(get_db)):
|
||||
"""Create and start a new export job."""
|
||||
db_export = Export(
|
||||
name=export.name,
|
||||
filter_criteria=export.filter_criteria.model_dump_json(),
|
||||
train_split=export.train_split,
|
||||
status="pending",
|
||||
)
|
||||
db.add(db_export)
|
||||
db.commit()
|
||||
db.refresh(db_export)
|
||||
|
||||
# Start Celery task
|
||||
task = generate_export.delay(db_export.id)
|
||||
db_export.celery_task_id = task.id
|
||||
db.commit()
|
||||
|
||||
return ExportResponse.model_validate(db_export)
|
||||
|
||||
|
||||
@router.get("/{export_id}", response_model=ExportResponse)
|
||||
def get_export(export_id: int, db: Session = Depends(get_db)):
|
||||
"""Get export status."""
|
||||
export = db.query(Export).filter(Export.id == export_id).first()
|
||||
if not export:
|
||||
raise HTTPException(status_code=404, detail="Export not found")
|
||||
|
||||
return ExportResponse.model_validate(export)
|
||||
|
||||
|
||||
@router.get("/{export_id}/progress")
|
||||
def get_export_progress(export_id: int, db: Session = Depends(get_db)):
|
||||
"""Get real-time export progress."""
|
||||
from app.workers.celery_app import celery_app
|
||||
|
||||
export = db.query(Export).filter(Export.id == export_id).first()
|
||||
if not export:
|
||||
raise HTTPException(status_code=404, detail="Export not found")
|
||||
|
||||
if not export.celery_task_id:
|
||||
return {"status": export.status}
|
||||
|
||||
result = celery_app.AsyncResult(export.celery_task_id)
|
||||
|
||||
if result.state == "PROGRESS":
|
||||
meta = result.info
|
||||
return {
|
||||
"status": "generating",
|
||||
"current": meta.get("current", 0),
|
||||
"total": meta.get("total", 0),
|
||||
"current_species": meta.get("species", ""),
|
||||
}
|
||||
|
||||
return {"status": export.status}
|
||||
|
||||
|
||||
@router.get("/{export_id}/download")
|
||||
def download_export(export_id: int, db: Session = Depends(get_db)):
|
||||
"""Download export zip file."""
|
||||
export = db.query(Export).filter(Export.id == export_id).first()
|
||||
if not export:
|
||||
raise HTTPException(status_code=404, detail="Export not found")
|
||||
|
||||
if export.status != "completed":
|
||||
raise HTTPException(status_code=400, detail="Export not ready")
|
||||
|
||||
if not export.file_path or not os.path.exists(export.file_path):
|
||||
raise HTTPException(status_code=404, detail="Export file not found")
|
||||
|
||||
return FileResponse(
|
||||
export.file_path,
|
||||
media_type="application/zip",
|
||||
filename=f"{export.name}.zip",
|
||||
)
|
||||
|
||||
|
||||
@router.delete("/{export_id}")
|
||||
def delete_export(export_id: int, db: Session = Depends(get_db)):
|
||||
"""Delete an export and its file."""
|
||||
export = db.query(Export).filter(Export.id == export_id).first()
|
||||
if not export:
|
||||
raise HTTPException(status_code=404, detail="Export not found")
|
||||
|
||||
# Delete file if exists
|
||||
if export.file_path and os.path.exists(export.file_path):
|
||||
os.remove(export.file_path)
|
||||
|
||||
db.delete(export)
|
||||
db.commit()
|
||||
|
||||
return {"status": "deleted"}
|
||||
441
backend/app/api/images.py
Normal file
441
backend/app/api/images.py
Normal file
@@ -0,0 +1,441 @@
|
||||
import os
|
||||
import shutil
|
||||
import uuid
|
||||
from pathlib import Path
|
||||
from typing import Optional, List
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
from fastapi.responses import FileResponse
|
||||
from sqlalchemy.orm import Session
|
||||
from sqlalchemy import func
|
||||
from PIL import Image as PILImage
|
||||
|
||||
from app.database import get_db
|
||||
from app.models import Image, Species
|
||||
from app.schemas.image import ImageResponse, ImageListResponse
|
||||
from app.config import get_settings
|
||||
|
||||
router = APIRouter()
|
||||
settings = get_settings()
|
||||
|
||||
|
||||
@router.get("", response_model=ImageListResponse)
|
||||
def list_images(
|
||||
page: int = Query(1, ge=1),
|
||||
page_size: int = Query(50, ge=1, le=200),
|
||||
species_id: Optional[int] = None,
|
||||
source: Optional[str] = None,
|
||||
license: Optional[str] = None,
|
||||
status: Optional[str] = None,
|
||||
min_quality: Optional[float] = None,
|
||||
search: Optional[str] = None,
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""List images with pagination and filters."""
|
||||
# Use joinedload to fetch species in single query
|
||||
from sqlalchemy.orm import joinedload
|
||||
query = db.query(Image).options(joinedload(Image.species))
|
||||
|
||||
if species_id:
|
||||
query = query.filter(Image.species_id == species_id)
|
||||
|
||||
if source:
|
||||
query = query.filter(Image.source == source)
|
||||
|
||||
if license:
|
||||
query = query.filter(Image.license == license)
|
||||
|
||||
if status:
|
||||
query = query.filter(Image.status == status)
|
||||
|
||||
if min_quality:
|
||||
query = query.filter(Image.quality_score >= min_quality)
|
||||
|
||||
if search:
|
||||
search_term = f"%{search}%"
|
||||
query = query.join(Species).filter(
|
||||
(Species.scientific_name.ilike(search_term)) |
|
||||
(Species.common_name.ilike(search_term))
|
||||
)
|
||||
|
||||
# Use faster count for simple queries
|
||||
if not search:
|
||||
# Build count query without join for better performance
|
||||
count_query = db.query(func.count(Image.id))
|
||||
if species_id:
|
||||
count_query = count_query.filter(Image.species_id == species_id)
|
||||
if source:
|
||||
count_query = count_query.filter(Image.source == source)
|
||||
if license:
|
||||
count_query = count_query.filter(Image.license == license)
|
||||
if status:
|
||||
count_query = count_query.filter(Image.status == status)
|
||||
if min_quality:
|
||||
count_query = count_query.filter(Image.quality_score >= min_quality)
|
||||
total = count_query.scalar()
|
||||
else:
|
||||
total = query.count()
|
||||
|
||||
pages = (total + page_size - 1) // page_size
|
||||
|
||||
images = query.order_by(Image.created_at.desc()).offset(
|
||||
(page - 1) * page_size
|
||||
).limit(page_size).all()
|
||||
|
||||
items = [
|
||||
ImageResponse(
|
||||
id=img.id,
|
||||
species_id=img.species_id,
|
||||
species_name=img.species.scientific_name if img.species else None,
|
||||
source=img.source,
|
||||
source_id=img.source_id,
|
||||
url=img.url,
|
||||
local_path=img.local_path,
|
||||
license=img.license,
|
||||
attribution=img.attribution,
|
||||
width=img.width,
|
||||
height=img.height,
|
||||
quality_score=img.quality_score,
|
||||
status=img.status,
|
||||
created_at=img.created_at,
|
||||
)
|
||||
for img in images
|
||||
]
|
||||
|
||||
return ImageListResponse(
|
||||
items=items,
|
||||
total=total,
|
||||
page=page,
|
||||
page_size=page_size,
|
||||
pages=pages,
|
||||
)
|
||||
|
||||
|
||||
@router.get("/sources")
|
||||
def list_sources(db: Session = Depends(get_db)):
|
||||
"""List all unique image sources."""
|
||||
sources = db.query(Image.source).distinct().all()
|
||||
return [s[0] for s in sources]
|
||||
|
||||
|
||||
@router.get("/licenses")
|
||||
def list_licenses(db: Session = Depends(get_db)):
|
||||
"""List all unique licenses."""
|
||||
licenses = db.query(Image.license).distinct().all()
|
||||
return [l[0] for l in licenses]
|
||||
|
||||
|
||||
@router.post("/process-pending")
|
||||
def process_pending_images(
|
||||
source: Optional[str] = None,
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Queue all pending images for download and processing."""
|
||||
from app.workers.quality_tasks import batch_process_pending_images
|
||||
|
||||
query = db.query(func.count(Image.id)).filter(Image.status == "pending")
|
||||
if source:
|
||||
query = query.filter(Image.source == source)
|
||||
pending_count = query.scalar()
|
||||
|
||||
task = batch_process_pending_images.delay(source=source)
|
||||
|
||||
return {
|
||||
"pending_count": pending_count,
|
||||
"task_id": task.id,
|
||||
}
|
||||
|
||||
|
||||
@router.get("/process-pending/status/{task_id}")
|
||||
def process_pending_status(task_id: str):
|
||||
"""Check status of a batch processing task."""
|
||||
from app.workers.celery_app import celery_app
|
||||
|
||||
result = celery_app.AsyncResult(task_id)
|
||||
state = result.state # PENDING, STARTED, PROGRESS, SUCCESS, FAILURE
|
||||
|
||||
response = {"task_id": task_id, "state": state}
|
||||
|
||||
if state == "PROGRESS" and isinstance(result.info, dict):
|
||||
response["queued"] = result.info.get("queued", 0)
|
||||
response["total"] = result.info.get("total", 0)
|
||||
elif state == "SUCCESS" and isinstance(result.result, dict):
|
||||
response["queued"] = result.result.get("queued", 0)
|
||||
response["total"] = result.result.get("total", 0)
|
||||
|
||||
return response
|
||||
|
||||
|
||||
@router.get("/{image_id}", response_model=ImageResponse)
|
||||
def get_image(image_id: int, db: Session = Depends(get_db)):
|
||||
"""Get an image by ID."""
|
||||
image = db.query(Image).filter(Image.id == image_id).first()
|
||||
if not image:
|
||||
raise HTTPException(status_code=404, detail="Image not found")
|
||||
|
||||
return ImageResponse(
|
||||
id=image.id,
|
||||
species_id=image.species_id,
|
||||
species_name=image.species.scientific_name if image.species else None,
|
||||
source=image.source,
|
||||
source_id=image.source_id,
|
||||
url=image.url,
|
||||
local_path=image.local_path,
|
||||
license=image.license,
|
||||
attribution=image.attribution,
|
||||
width=image.width,
|
||||
height=image.height,
|
||||
quality_score=image.quality_score,
|
||||
status=image.status,
|
||||
created_at=image.created_at,
|
||||
)
|
||||
|
||||
|
||||
@router.get("/{image_id}/file")
|
||||
def get_image_file(image_id: int, db: Session = Depends(get_db)):
|
||||
"""Get the actual image file."""
|
||||
image = db.query(Image).filter(Image.id == image_id).first()
|
||||
if not image:
|
||||
raise HTTPException(status_code=404, detail="Image not found")
|
||||
|
||||
if not image.local_path:
|
||||
raise HTTPException(status_code=404, detail="Image file not available")
|
||||
|
||||
return FileResponse(image.local_path, media_type="image/jpeg")
|
||||
|
||||
|
||||
@router.delete("/{image_id}")
|
||||
def delete_image(image_id: int, db: Session = Depends(get_db)):
|
||||
"""Delete an image."""
|
||||
image = db.query(Image).filter(Image.id == image_id).first()
|
||||
if not image:
|
||||
raise HTTPException(status_code=404, detail="Image not found")
|
||||
|
||||
# Delete file if exists
|
||||
if image.local_path:
|
||||
import os
|
||||
if os.path.exists(image.local_path):
|
||||
os.remove(image.local_path)
|
||||
|
||||
db.delete(image)
|
||||
db.commit()
|
||||
|
||||
return {"status": "deleted"}
|
||||
|
||||
|
||||
@router.post("/bulk-delete")
|
||||
def bulk_delete_images(
|
||||
image_ids: List[int],
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Delete multiple images."""
|
||||
import os
|
||||
|
||||
images = db.query(Image).filter(Image.id.in_(image_ids)).all()
|
||||
|
||||
deleted = 0
|
||||
for image in images:
|
||||
if image.local_path and os.path.exists(image.local_path):
|
||||
os.remove(image.local_path)
|
||||
db.delete(image)
|
||||
deleted += 1
|
||||
|
||||
db.commit()
|
||||
|
||||
return {"deleted": deleted}
|
||||
|
||||
|
||||
@router.get("/import/scan")
|
||||
def scan_imports(db: Session = Depends(get_db)):
|
||||
"""Scan the imports folder and return what can be imported.
|
||||
|
||||
Expected structure: imports/{source}/{species_name}/*.jpg
|
||||
"""
|
||||
imports_path = Path(settings.imports_path)
|
||||
|
||||
if not imports_path.exists():
|
||||
return {
|
||||
"available": False,
|
||||
"message": f"Imports folder not found: {imports_path}",
|
||||
"sources": [],
|
||||
"total_images": 0,
|
||||
"matched_species": 0,
|
||||
"unmatched_species": [],
|
||||
}
|
||||
|
||||
results = {
|
||||
"available": True,
|
||||
"sources": [],
|
||||
"total_images": 0,
|
||||
"matched_species": 0,
|
||||
"unmatched_species": [],
|
||||
}
|
||||
|
||||
# Get all species for matching
|
||||
species_map = {}
|
||||
for species in db.query(Species).all():
|
||||
# Map by scientific name with underscores and spaces
|
||||
species_map[species.scientific_name.lower()] = species
|
||||
species_map[species.scientific_name.replace(" ", "_").lower()] = species
|
||||
|
||||
seen_unmatched = set()
|
||||
|
||||
# Scan source folders
|
||||
for source_dir in imports_path.iterdir():
|
||||
if not source_dir.is_dir():
|
||||
continue
|
||||
|
||||
source_name = source_dir.name
|
||||
source_info = {
|
||||
"name": source_name,
|
||||
"species_count": 0,
|
||||
"image_count": 0,
|
||||
}
|
||||
|
||||
# Scan species folders within source
|
||||
for species_dir in source_dir.iterdir():
|
||||
if not species_dir.is_dir():
|
||||
continue
|
||||
|
||||
species_name = species_dir.name.replace("_", " ")
|
||||
species_key = species_name.lower()
|
||||
|
||||
# Count images
|
||||
image_files = list(species_dir.glob("*.jpg")) + \
|
||||
list(species_dir.glob("*.jpeg")) + \
|
||||
list(species_dir.glob("*.png"))
|
||||
|
||||
if not image_files:
|
||||
continue
|
||||
|
||||
source_info["image_count"] += len(image_files)
|
||||
results["total_images"] += len(image_files)
|
||||
|
||||
if species_key in species_map or species_dir.name.lower() in species_map:
|
||||
source_info["species_count"] += 1
|
||||
results["matched_species"] += 1
|
||||
else:
|
||||
if species_name not in seen_unmatched:
|
||||
seen_unmatched.add(species_name)
|
||||
results["unmatched_species"].append(species_name)
|
||||
|
||||
if source_info["image_count"] > 0:
|
||||
results["sources"].append(source_info)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
@router.post("/import/run")
|
||||
def run_import(
|
||||
move_files: bool = Query(False, description="Move files instead of copy"),
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Import images from the imports folder.
|
||||
|
||||
Expected structure: imports/{source}/{species_name}/*.jpg
|
||||
Images are copied/moved to: images/{species_name}/{source}_{filename}
|
||||
"""
|
||||
imports_path = Path(settings.imports_path)
|
||||
images_path = Path(settings.images_path)
|
||||
|
||||
if not imports_path.exists():
|
||||
raise HTTPException(status_code=400, detail="Imports folder not found")
|
||||
|
||||
# Get all species for matching
|
||||
species_map = {}
|
||||
for species in db.query(Species).all():
|
||||
species_map[species.scientific_name.lower()] = species
|
||||
species_map[species.scientific_name.replace(" ", "_").lower()] = species
|
||||
|
||||
imported = 0
|
||||
skipped = 0
|
||||
errors = []
|
||||
|
||||
# Scan source folders
|
||||
for source_dir in imports_path.iterdir():
|
||||
if not source_dir.is_dir():
|
||||
continue
|
||||
|
||||
source_name = source_dir.name
|
||||
|
||||
# Scan species folders within source
|
||||
for species_dir in source_dir.iterdir():
|
||||
if not species_dir.is_dir():
|
||||
continue
|
||||
|
||||
species_name = species_dir.name.replace("_", " ")
|
||||
species_key = species_name.lower()
|
||||
|
||||
# Find matching species
|
||||
species = species_map.get(species_key) or species_map.get(species_dir.name.lower())
|
||||
if not species:
|
||||
continue
|
||||
|
||||
# Create target directory
|
||||
target_dir = images_path / species.scientific_name.replace(" ", "_")
|
||||
target_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Process images
|
||||
image_files = list(species_dir.glob("*.jpg")) + \
|
||||
list(species_dir.glob("*.jpeg")) + \
|
||||
list(species_dir.glob("*.png"))
|
||||
|
||||
for img_file in image_files:
|
||||
try:
|
||||
# Generate unique filename
|
||||
ext = img_file.suffix.lower()
|
||||
if ext == ".jpeg":
|
||||
ext = ".jpg"
|
||||
new_filename = f"{source_name}_{img_file.stem}_{uuid.uuid4().hex[:8]}{ext}"
|
||||
target_path = target_dir / new_filename
|
||||
|
||||
# Check if already imported (by original filename pattern)
|
||||
existing = db.query(Image).filter(
|
||||
Image.species_id == species.id,
|
||||
Image.source == source_name,
|
||||
Image.source_id == img_file.stem,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
skipped += 1
|
||||
continue
|
||||
|
||||
# Get image dimensions
|
||||
try:
|
||||
with PILImage.open(img_file) as pil_img:
|
||||
width, height = pil_img.size
|
||||
except Exception:
|
||||
width, height = None, None
|
||||
|
||||
# Copy or move file
|
||||
if move_files:
|
||||
shutil.move(str(img_file), str(target_path))
|
||||
else:
|
||||
shutil.copy2(str(img_file), str(target_path))
|
||||
|
||||
# Create database record
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=source_name,
|
||||
source_id=img_file.stem,
|
||||
url=f"file://{img_file}",
|
||||
local_path=str(target_path),
|
||||
license="unknown",
|
||||
width=width,
|
||||
height=height,
|
||||
status="downloaded",
|
||||
)
|
||||
db.add(image)
|
||||
imported += 1
|
||||
|
||||
except Exception as e:
|
||||
errors.append(f"{img_file}: {str(e)}")
|
||||
|
||||
# Commit after each species to avoid large transactions
|
||||
db.commit()
|
||||
|
||||
return {
|
||||
"imported": imported,
|
||||
"skipped": skipped,
|
||||
"errors": errors[:20],
|
||||
}
|
||||
173
backend/app/api/jobs.py
Normal file
173
backend/app/api/jobs.py
Normal file
@@ -0,0 +1,173 @@
|
||||
import json
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.database import get_db
|
||||
from app.models import Job
|
||||
from app.schemas.job import JobCreate, JobResponse, JobListResponse
|
||||
from app.workers.scrape_tasks import run_scrape_job
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
@router.get("", response_model=JobListResponse)
|
||||
def list_jobs(
|
||||
status: Optional[str] = None,
|
||||
source: Optional[str] = None,
|
||||
limit: int = Query(50, ge=1, le=200),
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""List all jobs."""
|
||||
query = db.query(Job)
|
||||
|
||||
if status:
|
||||
query = query.filter(Job.status == status)
|
||||
|
||||
if source:
|
||||
query = query.filter(Job.source == source)
|
||||
|
||||
total = query.count()
|
||||
jobs = query.order_by(Job.created_at.desc()).limit(limit).all()
|
||||
|
||||
return JobListResponse(
|
||||
items=[JobResponse.model_validate(j) for j in jobs],
|
||||
total=total,
|
||||
)
|
||||
|
||||
|
||||
@router.post("", response_model=JobResponse)
|
||||
def create_job(job: JobCreate, db: Session = Depends(get_db)):
|
||||
"""Create and start a new scrape job."""
|
||||
species_filter = None
|
||||
if job.species_ids:
|
||||
species_filter = json.dumps(job.species_ids)
|
||||
|
||||
db_job = Job(
|
||||
name=job.name,
|
||||
source=job.source,
|
||||
species_filter=species_filter,
|
||||
only_without_images=job.only_without_images,
|
||||
max_images=job.max_images,
|
||||
status="pending",
|
||||
)
|
||||
db.add(db_job)
|
||||
db.commit()
|
||||
db.refresh(db_job)
|
||||
|
||||
# Start the Celery task
|
||||
task = run_scrape_job.delay(db_job.id)
|
||||
db_job.celery_task_id = task.id
|
||||
db.commit()
|
||||
|
||||
return JobResponse.model_validate(db_job)
|
||||
|
||||
|
||||
@router.get("/{job_id}", response_model=JobResponse)
|
||||
def get_job(job_id: int, db: Session = Depends(get_db)):
|
||||
"""Get job status."""
|
||||
job = db.query(Job).filter(Job.id == job_id).first()
|
||||
if not job:
|
||||
raise HTTPException(status_code=404, detail="Job not found")
|
||||
|
||||
return JobResponse.model_validate(job)
|
||||
|
||||
|
||||
@router.get("/{job_id}/progress")
|
||||
def get_job_progress(job_id: int, db: Session = Depends(get_db)):
|
||||
"""Get real-time job progress from Celery."""
|
||||
from app.workers.celery_app import celery_app
|
||||
|
||||
job = db.query(Job).filter(Job.id == job_id).first()
|
||||
if not job:
|
||||
raise HTTPException(status_code=404, detail="Job not found")
|
||||
|
||||
if not job.celery_task_id:
|
||||
return {
|
||||
"status": job.status,
|
||||
"progress_current": job.progress_current,
|
||||
"progress_total": job.progress_total,
|
||||
}
|
||||
|
||||
# Get Celery task state
|
||||
result = celery_app.AsyncResult(job.celery_task_id)
|
||||
|
||||
if result.state == "PROGRESS":
|
||||
meta = result.info
|
||||
return {
|
||||
"status": "running",
|
||||
"progress_current": meta.get("current", 0),
|
||||
"progress_total": meta.get("total", 0),
|
||||
"current_species": meta.get("species", ""),
|
||||
}
|
||||
|
||||
return {
|
||||
"status": job.status,
|
||||
"progress_current": job.progress_current,
|
||||
"progress_total": job.progress_total,
|
||||
}
|
||||
|
||||
|
||||
@router.post("/{job_id}/pause")
|
||||
def pause_job(job_id: int, db: Session = Depends(get_db)):
|
||||
"""Pause a running job."""
|
||||
from app.workers.celery_app import celery_app
|
||||
|
||||
job = db.query(Job).filter(Job.id == job_id).first()
|
||||
if not job:
|
||||
raise HTTPException(status_code=404, detail="Job not found")
|
||||
|
||||
if job.status != "running":
|
||||
raise HTTPException(status_code=400, detail="Job is not running")
|
||||
|
||||
# Revoke Celery task
|
||||
if job.celery_task_id:
|
||||
celery_app.control.revoke(job.celery_task_id, terminate=True)
|
||||
|
||||
job.status = "paused"
|
||||
db.commit()
|
||||
|
||||
return {"status": "paused"}
|
||||
|
||||
|
||||
@router.post("/{job_id}/resume")
|
||||
def resume_job(job_id: int, db: Session = Depends(get_db)):
|
||||
"""Resume a paused job."""
|
||||
job = db.query(Job).filter(Job.id == job_id).first()
|
||||
if not job:
|
||||
raise HTTPException(status_code=404, detail="Job not found")
|
||||
|
||||
if job.status != "paused":
|
||||
raise HTTPException(status_code=400, detail="Job is not paused")
|
||||
|
||||
# Start new Celery task
|
||||
task = run_scrape_job.delay(job.id)
|
||||
job.celery_task_id = task.id
|
||||
job.status = "pending"
|
||||
db.commit()
|
||||
|
||||
return {"status": "resumed"}
|
||||
|
||||
|
||||
@router.post("/{job_id}/cancel")
|
||||
def cancel_job(job_id: int, db: Session = Depends(get_db)):
|
||||
"""Cancel a job."""
|
||||
from app.workers.celery_app import celery_app
|
||||
|
||||
job = db.query(Job).filter(Job.id == job_id).first()
|
||||
if not job:
|
||||
raise HTTPException(status_code=404, detail="Job not found")
|
||||
|
||||
if job.status in ["completed", "failed"]:
|
||||
raise HTTPException(status_code=400, detail="Job already finished")
|
||||
|
||||
# Revoke Celery task
|
||||
if job.celery_task_id:
|
||||
celery_app.control.revoke(job.celery_task_id, terminate=True)
|
||||
|
||||
job.status = "failed"
|
||||
job.error_message = "Cancelled by user"
|
||||
db.commit()
|
||||
|
||||
return {"status": "cancelled"}
|
||||
198
backend/app/api/sources.py
Normal file
198
backend/app/api/sources.py
Normal file
@@ -0,0 +1,198 @@
|
||||
from fastapi import APIRouter, Depends, HTTPException
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.database import get_db
|
||||
from app.models import ApiKey
|
||||
from app.schemas.api_key import ApiKeyCreate, ApiKeyUpdate, ApiKeyResponse
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
# Available sources
|
||||
# auth_type: "none" (no auth), "api_key" (single key), "api_key_secret" (key + secret), "oauth" (client_id + client_secret + access_token)
|
||||
# default_rate: safe default requests per second for each API
|
||||
AVAILABLE_SOURCES = [
|
||||
{"name": "gbif", "label": "GBIF", "requires_secret": False, "auth_type": "none", "default_rate": 1.0}, # Free, no auth required
|
||||
{"name": "inaturalist", "label": "iNaturalist", "requires_secret": True, "auth_type": "api_key_secret", "default_rate": 1.0}, # 60/min limit
|
||||
{"name": "flickr", "label": "Flickr", "requires_secret": True, "auth_type": "api_key_secret", "default_rate": 0.5}, # 3600/hr shared limit
|
||||
{"name": "wikimedia", "label": "Wikimedia Commons", "requires_secret": True, "auth_type": "oauth", "default_rate": 1.0}, # generous limits
|
||||
{"name": "trefle", "label": "Trefle.io", "requires_secret": False, "auth_type": "api_key", "default_rate": 1.0}, # 120/min limit
|
||||
{"name": "duckduckgo", "label": "DuckDuckGo", "requires_secret": False, "auth_type": "none", "default_rate": 0.5}, # Web search, no API key
|
||||
{"name": "bing", "label": "Bing Image Search", "requires_secret": False, "auth_type": "api_key", "default_rate": 3.0}, # Azure Cognitive Services
|
||||
]
|
||||
|
||||
|
||||
def mask_api_key(key: str) -> str:
|
||||
"""Mask API key, showing only last 4 characters."""
|
||||
if not key or len(key) <= 4:
|
||||
return "****"
|
||||
return "*" * (len(key) - 4) + key[-4:]
|
||||
|
||||
|
||||
@router.get("")
|
||||
def list_sources(db: Session = Depends(get_db)):
|
||||
"""List all available sources with their configuration status."""
|
||||
api_keys = {k.source: k for k in db.query(ApiKey).all()}
|
||||
|
||||
result = []
|
||||
for source in AVAILABLE_SOURCES:
|
||||
api_key = api_keys.get(source["name"])
|
||||
default_rate = source.get("default_rate", 1.0)
|
||||
result.append({
|
||||
"name": source["name"],
|
||||
"label": source["label"],
|
||||
"requires_secret": source["requires_secret"],
|
||||
"auth_type": source.get("auth_type", "api_key"),
|
||||
"configured": api_key is not None,
|
||||
"enabled": api_key.enabled if api_key else False,
|
||||
"api_key_masked": mask_api_key(api_key.api_key) if api_key else None,
|
||||
"has_secret": bool(api_key.api_secret) if api_key else False,
|
||||
"has_access_token": bool(getattr(api_key, 'access_token', None)) if api_key else False,
|
||||
"rate_limit_per_sec": api_key.rate_limit_per_sec if api_key else default_rate,
|
||||
"default_rate": default_rate,
|
||||
})
|
||||
|
||||
return result
|
||||
|
||||
|
||||
@router.get("/{source}")
|
||||
def get_source(source: str, db: Session = Depends(get_db)):
|
||||
"""Get source configuration."""
|
||||
source_info = next((s for s in AVAILABLE_SOURCES if s["name"] == source), None)
|
||||
if not source_info:
|
||||
raise HTTPException(status_code=404, detail="Unknown source")
|
||||
|
||||
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||
default_rate = source_info.get("default_rate", 1.0)
|
||||
|
||||
return {
|
||||
"name": source_info["name"],
|
||||
"label": source_info["label"],
|
||||
"requires_secret": source_info["requires_secret"],
|
||||
"auth_type": source_info.get("auth_type", "api_key"),
|
||||
"configured": api_key is not None,
|
||||
"enabled": api_key.enabled if api_key else False,
|
||||
"api_key_masked": mask_api_key(api_key.api_key) if api_key else None,
|
||||
"has_secret": bool(api_key.api_secret) if api_key else False,
|
||||
"has_access_token": bool(getattr(api_key, 'access_token', None)) if api_key else False,
|
||||
"rate_limit_per_sec": api_key.rate_limit_per_sec if api_key else default_rate,
|
||||
"default_rate": default_rate,
|
||||
}
|
||||
|
||||
|
||||
@router.put("/{source}")
|
||||
def update_source(
|
||||
source: str,
|
||||
config: ApiKeyCreate,
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Create or update source configuration."""
|
||||
source_info = next((s for s in AVAILABLE_SOURCES if s["name"] == source), None)
|
||||
if not source_info:
|
||||
raise HTTPException(status_code=404, detail="Unknown source")
|
||||
|
||||
# For sources that require auth, validate api_key is provided
|
||||
auth_type = source_info.get("auth_type", "api_key")
|
||||
if auth_type != "none" and not config.api_key:
|
||||
raise HTTPException(status_code=400, detail="API key is required for this source")
|
||||
|
||||
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||
|
||||
# Use placeholder for no-auth sources
|
||||
api_key_value = config.api_key or "no-auth"
|
||||
|
||||
if api_key:
|
||||
# Update existing
|
||||
api_key.api_key = api_key_value
|
||||
if config.api_secret:
|
||||
api_key.api_secret = config.api_secret
|
||||
if config.access_token:
|
||||
api_key.access_token = config.access_token
|
||||
api_key.rate_limit_per_sec = config.rate_limit_per_sec
|
||||
api_key.enabled = config.enabled
|
||||
else:
|
||||
# Create new
|
||||
api_key = ApiKey(
|
||||
source=source,
|
||||
api_key=api_key_value,
|
||||
api_secret=config.api_secret,
|
||||
access_token=config.access_token,
|
||||
rate_limit_per_sec=config.rate_limit_per_sec,
|
||||
enabled=config.enabled,
|
||||
)
|
||||
db.add(api_key)
|
||||
|
||||
db.commit()
|
||||
db.refresh(api_key)
|
||||
|
||||
return {
|
||||
"name": source,
|
||||
"configured": True,
|
||||
"enabled": api_key.enabled,
|
||||
"api_key_masked": mask_api_key(api_key.api_key) if auth_type != "none" else None,
|
||||
"has_secret": bool(api_key.api_secret),
|
||||
"has_access_token": bool(api_key.access_token),
|
||||
"rate_limit_per_sec": api_key.rate_limit_per_sec,
|
||||
}
|
||||
|
||||
|
||||
@router.patch("/{source}")
|
||||
def patch_source(
|
||||
source: str,
|
||||
config: ApiKeyUpdate,
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Partially update source configuration."""
|
||||
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||
if not api_key:
|
||||
raise HTTPException(status_code=404, detail="Source not configured")
|
||||
|
||||
update_data = config.model_dump(exclude_unset=True)
|
||||
for field, value in update_data.items():
|
||||
setattr(api_key, field, value)
|
||||
|
||||
db.commit()
|
||||
db.refresh(api_key)
|
||||
|
||||
return {
|
||||
"name": source,
|
||||
"configured": True,
|
||||
"enabled": api_key.enabled,
|
||||
"api_key_masked": mask_api_key(api_key.api_key),
|
||||
"has_secret": bool(api_key.api_secret),
|
||||
"has_access_token": bool(api_key.access_token),
|
||||
"rate_limit_per_sec": api_key.rate_limit_per_sec,
|
||||
}
|
||||
|
||||
|
||||
@router.delete("/{source}")
|
||||
def delete_source(source: str, db: Session = Depends(get_db)):
|
||||
"""Delete source configuration."""
|
||||
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||
if not api_key:
|
||||
raise HTTPException(status_code=404, detail="Source not configured")
|
||||
|
||||
db.delete(api_key)
|
||||
db.commit()
|
||||
|
||||
return {"status": "deleted"}
|
||||
|
||||
|
||||
@router.post("/{source}/test")
|
||||
def test_source(source: str, db: Session = Depends(get_db)):
|
||||
"""Test source API connection."""
|
||||
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||
if not api_key:
|
||||
raise HTTPException(status_code=404, detail="Source not configured")
|
||||
|
||||
# Import and test the scraper
|
||||
from app.scrapers import get_scraper
|
||||
|
||||
scraper = get_scraper(source)
|
||||
if not scraper:
|
||||
raise HTTPException(status_code=400, detail="No scraper for this source")
|
||||
|
||||
try:
|
||||
result = scraper.test_connection(api_key)
|
||||
return {"status": "success", "message": result}
|
||||
except Exception as e:
|
||||
return {"status": "error", "message": str(e)}
|
||||
366
backend/app/api/species.py
Normal file
366
backend/app/api/species.py
Normal file
@@ -0,0 +1,366 @@
|
||||
import csv
|
||||
import io
|
||||
import json
|
||||
from typing import Optional
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query, UploadFile, File
|
||||
from sqlalchemy.orm import Session
|
||||
from sqlalchemy import func, text
|
||||
|
||||
from app.database import get_db
|
||||
from app.models import Species, Image
|
||||
from app.schemas.species import (
|
||||
SpeciesCreate,
|
||||
SpeciesUpdate,
|
||||
SpeciesResponse,
|
||||
SpeciesListResponse,
|
||||
SpeciesImportResponse,
|
||||
)
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
def get_species_with_count(db: Session, species: Species) -> SpeciesResponse:
|
||||
"""Get species response with image count."""
|
||||
image_count = db.query(func.count(Image.id)).filter(
|
||||
Image.species_id == species.id,
|
||||
Image.status == "downloaded"
|
||||
).scalar()
|
||||
|
||||
return SpeciesResponse(
|
||||
id=species.id,
|
||||
scientific_name=species.scientific_name,
|
||||
common_name=species.common_name,
|
||||
genus=species.genus,
|
||||
family=species.family,
|
||||
created_at=species.created_at,
|
||||
image_count=image_count or 0,
|
||||
)
|
||||
|
||||
|
||||
@router.get("", response_model=SpeciesListResponse)
|
||||
def list_species(
|
||||
page: int = Query(1, ge=1),
|
||||
page_size: int = Query(50, ge=1, le=500),
|
||||
search: Optional[str] = None,
|
||||
genus: Optional[str] = None,
|
||||
has_images: Optional[bool] = None,
|
||||
max_images: Optional[int] = Query(None, description="Filter species with less than N images"),
|
||||
min_images: Optional[int] = Query(None, description="Filter species with at least N images"),
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""List species with pagination and filters.
|
||||
|
||||
Filters:
|
||||
- search: Search by scientific or common name
|
||||
- genus: Filter by genus
|
||||
- has_images: True for species with images, False for species without
|
||||
- max_images: Filter species with fewer than N downloaded images
|
||||
- min_images: Filter species with at least N downloaded images
|
||||
"""
|
||||
# If filtering by image count, we need to use a subquery approach
|
||||
if max_images is not None or min_images is not None:
|
||||
# Build a subquery with image counts per species
|
||||
image_counts = (
|
||||
db.query(
|
||||
Species.id.label("species_id"),
|
||||
func.count(Image.id).label("img_count")
|
||||
)
|
||||
.outerjoin(Image, (Image.species_id == Species.id) & (Image.status == "downloaded"))
|
||||
.group_by(Species.id)
|
||||
.subquery()
|
||||
)
|
||||
|
||||
# Join species with their counts
|
||||
query = db.query(Species).join(
|
||||
image_counts, Species.id == image_counts.c.species_id
|
||||
)
|
||||
|
||||
if max_images is not None:
|
||||
query = query.filter(image_counts.c.img_count < max_images)
|
||||
|
||||
if min_images is not None:
|
||||
query = query.filter(image_counts.c.img_count >= min_images)
|
||||
else:
|
||||
query = db.query(Species)
|
||||
|
||||
if search:
|
||||
search_term = f"%{search}%"
|
||||
query = query.filter(
|
||||
(Species.scientific_name.ilike(search_term)) |
|
||||
(Species.common_name.ilike(search_term))
|
||||
)
|
||||
|
||||
if genus:
|
||||
query = query.filter(Species.genus == genus)
|
||||
|
||||
# Filter by whether species has downloaded images (only if not using min/max filters)
|
||||
if has_images is not None and max_images is None and min_images is None:
|
||||
# Get IDs of species that have at least one downloaded image
|
||||
species_with_images = (
|
||||
db.query(Image.species_id)
|
||||
.filter(Image.status == "downloaded")
|
||||
.distinct()
|
||||
.subquery()
|
||||
)
|
||||
if has_images:
|
||||
query = query.filter(Species.id.in_(db.query(species_with_images.c.species_id)))
|
||||
else:
|
||||
query = query.filter(~Species.id.in_(db.query(species_with_images.c.species_id)))
|
||||
|
||||
total = query.count()
|
||||
pages = (total + page_size - 1) // page_size
|
||||
|
||||
species_list = query.order_by(Species.scientific_name).offset(
|
||||
(page - 1) * page_size
|
||||
).limit(page_size).all()
|
||||
|
||||
# Fetch image counts in bulk for all species on this page
|
||||
species_ids = [s.id for s in species_list]
|
||||
if species_ids:
|
||||
count_query = db.query(
|
||||
Image.species_id,
|
||||
func.count(Image.id)
|
||||
).filter(
|
||||
Image.species_id.in_(species_ids),
|
||||
Image.status == "downloaded"
|
||||
).group_by(Image.species_id).all()
|
||||
count_map = {species_id: count for species_id, count in count_query}
|
||||
else:
|
||||
count_map = {}
|
||||
|
||||
items = [
|
||||
SpeciesResponse(
|
||||
id=s.id,
|
||||
scientific_name=s.scientific_name,
|
||||
common_name=s.common_name,
|
||||
genus=s.genus,
|
||||
family=s.family,
|
||||
created_at=s.created_at,
|
||||
image_count=count_map.get(s.id, 0),
|
||||
)
|
||||
for s in species_list
|
||||
]
|
||||
|
||||
return SpeciesListResponse(
|
||||
items=items,
|
||||
total=total,
|
||||
page=page,
|
||||
page_size=page_size,
|
||||
pages=pages,
|
||||
)
|
||||
|
||||
|
||||
@router.post("", response_model=SpeciesResponse)
|
||||
def create_species(species: SpeciesCreate, db: Session = Depends(get_db)):
|
||||
"""Create a new species."""
|
||||
existing = db.query(Species).filter(
|
||||
Species.scientific_name == species.scientific_name
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
raise HTTPException(status_code=400, detail="Species already exists")
|
||||
|
||||
# Auto-extract genus from scientific name if not provided
|
||||
genus = species.genus
|
||||
if not genus and " " in species.scientific_name:
|
||||
genus = species.scientific_name.split()[0]
|
||||
|
||||
db_species = Species(
|
||||
scientific_name=species.scientific_name,
|
||||
common_name=species.common_name,
|
||||
genus=genus,
|
||||
family=species.family,
|
||||
)
|
||||
db.add(db_species)
|
||||
db.commit()
|
||||
db.refresh(db_species)
|
||||
|
||||
return get_species_with_count(db, db_species)
|
||||
|
||||
|
||||
@router.post("/import", response_model=SpeciesImportResponse)
|
||||
async def import_species(
|
||||
file: UploadFile = File(...),
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Import species from CSV file.
|
||||
|
||||
Expected columns: scientific_name, common_name (optional), genus (optional), family (optional)
|
||||
"""
|
||||
if not file.filename.endswith(".csv"):
|
||||
raise HTTPException(status_code=400, detail="File must be a CSV")
|
||||
|
||||
content = await file.read()
|
||||
text = content.decode("utf-8")
|
||||
|
||||
reader = csv.DictReader(io.StringIO(text))
|
||||
|
||||
imported = 0
|
||||
skipped = 0
|
||||
errors = []
|
||||
|
||||
for row_num, row in enumerate(reader, start=2):
|
||||
scientific_name = row.get("scientific_name", "").strip()
|
||||
if not scientific_name:
|
||||
errors.append(f"Row {row_num}: Missing scientific_name")
|
||||
continue
|
||||
|
||||
# Check if already exists
|
||||
existing = db.query(Species).filter(
|
||||
Species.scientific_name == scientific_name
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
skipped += 1
|
||||
continue
|
||||
|
||||
# Auto-extract genus if not provided
|
||||
genus = row.get("genus", "").strip()
|
||||
if not genus and " " in scientific_name:
|
||||
genus = scientific_name.split()[0]
|
||||
|
||||
try:
|
||||
species = Species(
|
||||
scientific_name=scientific_name,
|
||||
common_name=row.get("common_name", "").strip() or None,
|
||||
genus=genus or None,
|
||||
family=row.get("family", "").strip() or None,
|
||||
)
|
||||
db.add(species)
|
||||
imported += 1
|
||||
except Exception as e:
|
||||
errors.append(f"Row {row_num}: {str(e)}")
|
||||
|
||||
db.commit()
|
||||
|
||||
return SpeciesImportResponse(
|
||||
imported=imported,
|
||||
skipped=skipped,
|
||||
errors=errors[:10], # Limit error messages
|
||||
)
|
||||
|
||||
|
||||
@router.post("/import-json", response_model=SpeciesImportResponse)
|
||||
async def import_species_json(
|
||||
file: UploadFile = File(...),
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Import species from JSON file.
|
||||
|
||||
Expected format: {"plants": [{"scientific_name": "...", "common_names": [...], "family": "..."}]}
|
||||
"""
|
||||
if not file.filename.endswith(".json"):
|
||||
raise HTTPException(status_code=400, detail="File must be a JSON")
|
||||
|
||||
content = await file.read()
|
||||
try:
|
||||
data = json.loads(content.decode("utf-8"))
|
||||
except json.JSONDecodeError as e:
|
||||
raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")
|
||||
|
||||
plants = data.get("plants", [])
|
||||
if not plants:
|
||||
raise HTTPException(status_code=400, detail="No plants found in JSON")
|
||||
|
||||
imported = 0
|
||||
skipped = 0
|
||||
errors = []
|
||||
|
||||
for idx, plant in enumerate(plants):
|
||||
scientific_name = plant.get("scientific_name", "").strip()
|
||||
if not scientific_name:
|
||||
errors.append(f"Plant {idx}: Missing scientific_name")
|
||||
continue
|
||||
|
||||
# Check if already exists
|
||||
existing = db.query(Species).filter(
|
||||
Species.scientific_name == scientific_name
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
skipped += 1
|
||||
continue
|
||||
|
||||
# Auto-extract genus from scientific name
|
||||
genus = None
|
||||
if " " in scientific_name:
|
||||
genus = scientific_name.split()[0]
|
||||
|
||||
# Get first common name if array provided
|
||||
common_names = plant.get("common_names", [])
|
||||
common_name = common_names[0] if common_names else None
|
||||
|
||||
try:
|
||||
species = Species(
|
||||
scientific_name=scientific_name,
|
||||
common_name=common_name,
|
||||
genus=genus,
|
||||
family=plant.get("family"),
|
||||
)
|
||||
db.add(species)
|
||||
imported += 1
|
||||
except Exception as e:
|
||||
errors.append(f"Plant {idx}: {str(e)}")
|
||||
|
||||
db.commit()
|
||||
|
||||
return SpeciesImportResponse(
|
||||
imported=imported,
|
||||
skipped=skipped,
|
||||
errors=errors[:10],
|
||||
)
|
||||
|
||||
|
||||
@router.get("/{species_id}", response_model=SpeciesResponse)
|
||||
def get_species(species_id: int, db: Session = Depends(get_db)):
|
||||
"""Get a species by ID."""
|
||||
species = db.query(Species).filter(Species.id == species_id).first()
|
||||
if not species:
|
||||
raise HTTPException(status_code=404, detail="Species not found")
|
||||
|
||||
return get_species_with_count(db, species)
|
||||
|
||||
|
||||
@router.put("/{species_id}", response_model=SpeciesResponse)
|
||||
def update_species(
|
||||
species_id: int,
|
||||
species_update: SpeciesUpdate,
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Update a species."""
|
||||
species = db.query(Species).filter(Species.id == species_id).first()
|
||||
if not species:
|
||||
raise HTTPException(status_code=404, detail="Species not found")
|
||||
|
||||
update_data = species_update.model_dump(exclude_unset=True)
|
||||
for field, value in update_data.items():
|
||||
setattr(species, field, value)
|
||||
|
||||
db.commit()
|
||||
db.refresh(species)
|
||||
|
||||
return get_species_with_count(db, species)
|
||||
|
||||
|
||||
@router.delete("/{species_id}")
|
||||
def delete_species(species_id: int, db: Session = Depends(get_db)):
|
||||
"""Delete a species and all its images."""
|
||||
species = db.query(Species).filter(Species.id == species_id).first()
|
||||
if not species:
|
||||
raise HTTPException(status_code=404, detail="Species not found")
|
||||
|
||||
db.delete(species)
|
||||
db.commit()
|
||||
|
||||
return {"status": "deleted"}
|
||||
|
||||
|
||||
@router.get("/genera/list")
|
||||
def list_genera(db: Session = Depends(get_db)):
|
||||
"""List all unique genera."""
|
||||
genera = db.query(Species.genus).filter(
|
||||
Species.genus.isnot(None)
|
||||
).distinct().order_by(Species.genus).all()
|
||||
|
||||
return [g[0] for g in genera]
|
||||
190
backend/app/api/stats.py
Normal file
190
backend/app/api/stats.py
Normal file
@@ -0,0 +1,190 @@
|
||||
import json
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException
|
||||
from sqlalchemy.orm import Session
|
||||
from sqlalchemy import func, case
|
||||
|
||||
from app.database import get_db
|
||||
from app.models import Species, Image, Job
|
||||
from app.models.cached_stats import CachedStats
|
||||
from app.schemas.stats import StatsResponse, SourceStats, LicenseStats, SpeciesStats, JobStats
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
@router.get("", response_model=StatsResponse)
|
||||
def get_stats(db: Session = Depends(get_db)):
|
||||
"""Get dashboard statistics from cache (updated every 60s by Celery)."""
|
||||
# Try to get cached stats
|
||||
cached = db.query(CachedStats).filter(CachedStats.key == "dashboard_stats").first()
|
||||
|
||||
if cached:
|
||||
data = json.loads(cached.value)
|
||||
return StatsResponse(
|
||||
total_species=data["total_species"],
|
||||
total_images=data["total_images"],
|
||||
images_downloaded=data["images_downloaded"],
|
||||
images_pending=data["images_pending"],
|
||||
images_rejected=data["images_rejected"],
|
||||
disk_usage_mb=data["disk_usage_mb"],
|
||||
sources=[SourceStats(**s) for s in data["sources"]],
|
||||
licenses=[LicenseStats(**l) for l in data["licenses"]],
|
||||
jobs=JobStats(**data["jobs"]),
|
||||
top_species=[SpeciesStats(**s) for s in data["top_species"]],
|
||||
under_represented=[SpeciesStats(**s) for s in data["under_represented"]],
|
||||
)
|
||||
|
||||
# No cache yet - return empty stats (Celery will populate soon)
|
||||
# This only happens on first startup before Celery runs
|
||||
return StatsResponse(
|
||||
total_species=0,
|
||||
total_images=0,
|
||||
images_downloaded=0,
|
||||
images_pending=0,
|
||||
images_rejected=0,
|
||||
disk_usage_mb=0.0,
|
||||
sources=[],
|
||||
licenses=[],
|
||||
jobs=JobStats(running=0, pending=0, completed=0, failed=0),
|
||||
top_species=[],
|
||||
under_represented=[],
|
||||
)
|
||||
|
||||
|
||||
@router.post("/refresh")
|
||||
def refresh_stats_now(db: Session = Depends(get_db)):
|
||||
"""Manually trigger a stats refresh."""
|
||||
from app.workers.stats_tasks import refresh_stats
|
||||
refresh_stats.delay()
|
||||
return {"status": "refresh_queued"}
|
||||
|
||||
|
||||
@router.get("/sources")
|
||||
def get_source_stats(db: Session = Depends(get_db)):
|
||||
"""Get per-source breakdown."""
|
||||
stats = db.query(
|
||||
Image.source,
|
||||
func.count(Image.id).label("total"),
|
||||
func.sum(case((Image.status == "downloaded", 1), else_=0)).label("downloaded"),
|
||||
func.sum(case((Image.status == "pending", 1), else_=0)).label("pending"),
|
||||
func.sum(case((Image.status == "rejected", 1), else_=0)).label("rejected"),
|
||||
).group_by(Image.source).all()
|
||||
|
||||
return [
|
||||
{
|
||||
"source": s.source,
|
||||
"total": s.total,
|
||||
"downloaded": s.downloaded or 0,
|
||||
"pending": s.pending or 0,
|
||||
"rejected": s.rejected or 0,
|
||||
}
|
||||
for s in stats
|
||||
]
|
||||
|
||||
|
||||
@router.get("/species")
|
||||
def get_species_stats(
|
||||
min_count: int = 0,
|
||||
max_count: int = None,
|
||||
db: Session = Depends(get_db),
|
||||
):
|
||||
"""Get per-species image counts."""
|
||||
query = db.query(
|
||||
Species.id,
|
||||
Species.scientific_name,
|
||||
Species.common_name,
|
||||
Species.genus,
|
||||
func.count(Image.id).label("image_count")
|
||||
).outerjoin(Image, (Image.species_id == Species.id) & (Image.status == "downloaded")
|
||||
).group_by(Species.id)
|
||||
|
||||
if min_count > 0:
|
||||
query = query.having(func.count(Image.id) >= min_count)
|
||||
|
||||
if max_count is not None:
|
||||
query = query.having(func.count(Image.id) <= max_count)
|
||||
|
||||
stats = query.order_by(func.count(Image.id).desc()).all()
|
||||
|
||||
return [
|
||||
{
|
||||
"id": s.id,
|
||||
"scientific_name": s.scientific_name,
|
||||
"common_name": s.common_name,
|
||||
"genus": s.genus,
|
||||
"image_count": s.image_count,
|
||||
}
|
||||
for s in stats
|
||||
]
|
||||
|
||||
|
||||
@router.get("/distribution")
|
||||
def get_image_distribution(db: Session = Depends(get_db)):
|
||||
"""Get distribution of images per species for ML training assessment.
|
||||
|
||||
Returns counts of species at various image thresholds to help
|
||||
determine dataset quality for training image classifiers.
|
||||
"""
|
||||
from sqlalchemy import text
|
||||
|
||||
# Get image counts per species using optimized raw SQL
|
||||
distribution_sql = text("""
|
||||
WITH species_counts AS (
|
||||
SELECT
|
||||
s.id,
|
||||
COUNT(i.id) as cnt
|
||||
FROM species s
|
||||
LEFT JOIN images i ON i.species_id = s.id AND i.status = 'downloaded'
|
||||
GROUP BY s.id
|
||||
)
|
||||
SELECT
|
||||
COUNT(*) as total_species,
|
||||
SUM(CASE WHEN cnt = 0 THEN 1 ELSE 0 END) as with_0,
|
||||
SUM(CASE WHEN cnt >= 1 AND cnt < 10 THEN 1 ELSE 0 END) as with_1_9,
|
||||
SUM(CASE WHEN cnt >= 10 AND cnt < 25 THEN 1 ELSE 0 END) as with_10_24,
|
||||
SUM(CASE WHEN cnt >= 25 AND cnt < 50 THEN 1 ELSE 0 END) as with_25_49,
|
||||
SUM(CASE WHEN cnt >= 50 AND cnt < 100 THEN 1 ELSE 0 END) as with_50_99,
|
||||
SUM(CASE WHEN cnt >= 100 AND cnt < 200 THEN 1 ELSE 0 END) as with_100_199,
|
||||
SUM(CASE WHEN cnt >= 200 THEN 1 ELSE 0 END) as with_200_plus,
|
||||
SUM(CASE WHEN cnt >= 10 THEN 1 ELSE 0 END) as trainable_10,
|
||||
SUM(CASE WHEN cnt >= 25 THEN 1 ELSE 0 END) as trainable_25,
|
||||
SUM(CASE WHEN cnt >= 50 THEN 1 ELSE 0 END) as trainable_50,
|
||||
SUM(CASE WHEN cnt >= 100 THEN 1 ELSE 0 END) as trainable_100,
|
||||
AVG(cnt) as avg_images,
|
||||
MAX(cnt) as max_images,
|
||||
MIN(cnt) as min_images,
|
||||
SUM(cnt) as total_images
|
||||
FROM species_counts
|
||||
""")
|
||||
|
||||
result = db.execute(distribution_sql).fetchone()
|
||||
|
||||
return {
|
||||
"total_species": result[0] or 0,
|
||||
"distribution": {
|
||||
"0_images": result[1] or 0,
|
||||
"1_to_9": result[2] or 0,
|
||||
"10_to_24": result[3] or 0,
|
||||
"25_to_49": result[4] or 0,
|
||||
"50_to_99": result[5] or 0,
|
||||
"100_to_199": result[6] or 0,
|
||||
"200_plus": result[7] or 0,
|
||||
},
|
||||
"trainable_species": {
|
||||
"min_10_images": result[8] or 0,
|
||||
"min_25_images": result[9] or 0,
|
||||
"min_50_images": result[10] or 0,
|
||||
"min_100_images": result[11] or 0,
|
||||
},
|
||||
"summary": {
|
||||
"avg_images_per_species": round(result[12] or 0, 1),
|
||||
"max_images": result[13] or 0,
|
||||
"min_images": result[14] or 0,
|
||||
"total_downloaded_images": result[15] or 0,
|
||||
},
|
||||
"recommendations": {
|
||||
"for_basic_model": f"{result[8] or 0} species with 10+ images",
|
||||
"for_good_model": f"{result[10] or 0} species with 50+ images",
|
||||
"for_excellent_model": f"{result[11] or 0} species with 100+ images",
|
||||
}
|
||||
}
|
||||
38
backend/app/config.py
Normal file
38
backend/app/config.py
Normal file
@@ -0,0 +1,38 @@
|
||||
from pydantic_settings import BaseSettings
|
||||
from functools import lru_cache
|
||||
|
||||
|
||||
class Settings(BaseSettings):
|
||||
# Database
|
||||
database_url: str = "sqlite:////data/db/plants.sqlite"
|
||||
|
||||
# Redis
|
||||
redis_url: str = "redis://redis:6379/0"
|
||||
|
||||
# Storage paths
|
||||
images_path: str = "/data/images"
|
||||
exports_path: str = "/data/exports"
|
||||
imports_path: str = "/data/imports"
|
||||
logs_path: str = "/data/logs"
|
||||
|
||||
# API Keys
|
||||
flickr_api_key: str = ""
|
||||
flickr_api_secret: str = ""
|
||||
inaturalist_app_id: str = ""
|
||||
inaturalist_app_secret: str = ""
|
||||
trefle_api_key: str = ""
|
||||
|
||||
# Logging
|
||||
log_level: str = "INFO"
|
||||
|
||||
# Celery
|
||||
celery_concurrency: int = 4
|
||||
|
||||
class Config:
|
||||
env_file = ".env"
|
||||
extra = "ignore"
|
||||
|
||||
|
||||
@lru_cache()
|
||||
def get_settings() -> Settings:
|
||||
return Settings()
|
||||
44
backend/app/database.py
Normal file
44
backend/app/database.py
Normal file
@@ -0,0 +1,44 @@
|
||||
from sqlalchemy import create_engine, event
|
||||
from sqlalchemy.orm import sessionmaker, declarative_base
|
||||
from sqlalchemy.pool import StaticPool
|
||||
|
||||
from app.config import get_settings
|
||||
|
||||
settings = get_settings()
|
||||
|
||||
# SQLite-specific configuration
|
||||
connect_args = {"check_same_thread": False}
|
||||
|
||||
engine = create_engine(
|
||||
settings.database_url,
|
||||
connect_args=connect_args,
|
||||
poolclass=StaticPool,
|
||||
echo=False,
|
||||
)
|
||||
|
||||
# Enable WAL mode for better concurrent access
|
||||
@event.listens_for(engine, "connect")
|
||||
def set_sqlite_pragma(dbapi_connection, connection_record):
|
||||
cursor = dbapi_connection.cursor()
|
||||
cursor.execute("PRAGMA journal_mode=WAL")
|
||||
cursor.execute("PRAGMA synchronous=NORMAL")
|
||||
cursor.execute("PRAGMA foreign_keys=ON")
|
||||
cursor.close()
|
||||
|
||||
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||
|
||||
Base = declarative_base()
|
||||
|
||||
|
||||
def get_db():
|
||||
db = SessionLocal()
|
||||
try:
|
||||
yield db
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
def init_db():
|
||||
"""Create all tables."""
|
||||
from app.models import species, image, job, api_key, export, cached_stats # noqa
|
||||
Base.metadata.create_all(bind=engine)
|
||||
95
backend/app/main.py
Normal file
95
backend/app/main.py
Normal file
@@ -0,0 +1,95 @@
|
||||
from fastapi import FastAPI
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
|
||||
from app.config import get_settings
|
||||
from app.database import init_db
|
||||
from app.api import species, images, jobs, exports, stats, sources
|
||||
|
||||
settings = get_settings()
|
||||
|
||||
app = FastAPI(
|
||||
title="PlantGuideScraper API",
|
||||
description="Web scraper interface for houseplant image collection",
|
||||
version="1.0.0",
|
||||
)
|
||||
|
||||
# CORS middleware
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"],
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
# Include routers
|
||||
app.include_router(species.router, prefix="/api/species", tags=["Species"])
|
||||
app.include_router(images.router, prefix="/api/images", tags=["Images"])
|
||||
app.include_router(jobs.router, prefix="/api/jobs", tags=["Jobs"])
|
||||
app.include_router(exports.router, prefix="/api/exports", tags=["Exports"])
|
||||
app.include_router(stats.router, prefix="/api/stats", tags=["Stats"])
|
||||
app.include_router(sources.router, prefix="/api/sources", tags=["Sources"])
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
async def startup_event():
|
||||
"""Initialize database on startup."""
|
||||
init_db()
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health_check():
|
||||
"""Health check endpoint."""
|
||||
return {"status": "healthy", "service": "plant-scraper"}
|
||||
|
||||
|
||||
@app.get("/api/debug")
|
||||
async def debug_check():
|
||||
"""Debug endpoint - checks database connection."""
|
||||
import time
|
||||
from app.database import SessionLocal
|
||||
from app.models import Species, Image
|
||||
|
||||
results = {"status": "checking", "checks": {}}
|
||||
|
||||
# Check 1: Can we create a session?
|
||||
try:
|
||||
start = time.time()
|
||||
db = SessionLocal()
|
||||
results["checks"]["session_create"] = {"ok": True, "ms": int((time.time() - start) * 1000)}
|
||||
except Exception as e:
|
||||
results["checks"]["session_create"] = {"ok": False, "error": str(e)}
|
||||
results["status"] = "error"
|
||||
return results
|
||||
|
||||
# Check 2: Simple query - count species
|
||||
try:
|
||||
start = time.time()
|
||||
count = db.query(Species).count()
|
||||
results["checks"]["species_count"] = {"ok": True, "count": count, "ms": int((time.time() - start) * 1000)}
|
||||
except Exception as e:
|
||||
results["checks"]["species_count"] = {"ok": False, "error": str(e)}
|
||||
results["status"] = "error"
|
||||
db.close()
|
||||
return results
|
||||
|
||||
# Check 3: Count images
|
||||
try:
|
||||
start = time.time()
|
||||
count = db.query(Image).count()
|
||||
results["checks"]["image_count"] = {"ok": True, "count": count, "ms": int((time.time() - start) * 1000)}
|
||||
except Exception as e:
|
||||
results["checks"]["image_count"] = {"ok": False, "error": str(e)}
|
||||
results["status"] = "error"
|
||||
db.close()
|
||||
return results
|
||||
|
||||
db.close()
|
||||
results["status"] = "healthy"
|
||||
return results
|
||||
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
"""Root endpoint."""
|
||||
return {"message": "PlantGuideScraper API", "docs": "/docs"}
|
||||
8
backend/app/models/__init__.py
Normal file
8
backend/app/models/__init__.py
Normal file
@@ -0,0 +1,8 @@
|
||||
from app.models.species import Species
|
||||
from app.models.image import Image
|
||||
from app.models.job import Job
|
||||
from app.models.api_key import ApiKey
|
||||
from app.models.export import Export
|
||||
from app.models.cached_stats import CachedStats
|
||||
|
||||
__all__ = ["Species", "Image", "Job", "ApiKey", "Export", "CachedStats"]
|
||||
18
backend/app/models/api_key.py
Normal file
18
backend/app/models/api_key.py
Normal file
@@ -0,0 +1,18 @@
|
||||
from sqlalchemy import Column, Integer, String, Float, Boolean
|
||||
|
||||
from app.database import Base
|
||||
|
||||
|
||||
class ApiKey(Base):
|
||||
__tablename__ = "api_keys"
|
||||
|
||||
id = Column(Integer, primary_key=True, index=True)
|
||||
source = Column(String, unique=True, nullable=False) # 'flickr', 'inaturalist', 'wikimedia', 'trefle'
|
||||
api_key = Column(String, nullable=False) # Also used as Client ID for OAuth sources
|
||||
api_secret = Column(String, nullable=True) # Also used as Client Secret for OAuth sources
|
||||
access_token = Column(String, nullable=True) # For OAuth sources like Wikimedia
|
||||
rate_limit_per_sec = Column(Float, default=1.0)
|
||||
enabled = Column(Boolean, default=True)
|
||||
|
||||
def __repr__(self):
|
||||
return f"<ApiKey(id={self.id}, source='{self.source}', enabled={self.enabled})>"
|
||||
14
backend/app/models/cached_stats.py
Normal file
14
backend/app/models/cached_stats.py
Normal file
@@ -0,0 +1,14 @@
|
||||
from datetime import datetime
|
||||
from sqlalchemy import Column, Integer, String, Text, DateTime
|
||||
|
||||
from app.database import Base
|
||||
|
||||
|
||||
class CachedStats(Base):
|
||||
"""Stores pre-calculated statistics updated by Celery beat."""
|
||||
__tablename__ = "cached_stats"
|
||||
|
||||
id = Column(Integer, primary_key=True, index=True)
|
||||
key = Column(String(50), unique=True, nullable=False, index=True)
|
||||
value = Column(Text, nullable=False) # JSON-encoded stats
|
||||
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
||||
24
backend/app/models/export.py
Normal file
24
backend/app/models/export.py
Normal file
@@ -0,0 +1,24 @@
|
||||
from sqlalchemy import Column, Integer, String, Float, DateTime, Text, func
|
||||
|
||||
from app.database import Base
|
||||
|
||||
|
||||
class Export(Base):
|
||||
__tablename__ = "exports"
|
||||
|
||||
id = Column(Integer, primary_key=True, index=True)
|
||||
name = Column(String, nullable=False)
|
||||
filter_criteria = Column(Text, nullable=True) # JSON: min_images, licenses, min_quality, species_ids
|
||||
train_split = Column(Float, default=0.8)
|
||||
status = Column(String, default="pending") # pending, generating, completed, failed
|
||||
file_path = Column(String, nullable=True)
|
||||
file_size = Column(Integer, nullable=True)
|
||||
species_count = Column(Integer, nullable=True)
|
||||
image_count = Column(Integer, nullable=True)
|
||||
celery_task_id = Column(String, nullable=True)
|
||||
created_at = Column(DateTime, server_default=func.now())
|
||||
completed_at = Column(DateTime, nullable=True)
|
||||
error_message = Column(Text, nullable=True)
|
||||
|
||||
def __repr__(self):
|
||||
return f"<Export(id={self.id}, name='{self.name}', status='{self.status}')>"
|
||||
36
backend/app/models/image.py
Normal file
36
backend/app/models/image.py
Normal file
@@ -0,0 +1,36 @@
|
||||
from sqlalchemy import Column, Integer, String, Float, DateTime, ForeignKey, func, UniqueConstraint, Index
|
||||
from sqlalchemy.orm import relationship
|
||||
|
||||
from app.database import Base
|
||||
|
||||
|
||||
class Image(Base):
|
||||
__tablename__ = "images"
|
||||
|
||||
id = Column(Integer, primary_key=True, index=True)
|
||||
species_id = Column(Integer, ForeignKey("species.id"), nullable=False, index=True)
|
||||
source = Column(String, nullable=False, index=True)
|
||||
source_id = Column(String, nullable=True)
|
||||
url = Column(String, nullable=False)
|
||||
local_path = Column(String, nullable=True)
|
||||
license = Column(String, nullable=False, index=True)
|
||||
attribution = Column(String, nullable=True)
|
||||
width = Column(Integer, nullable=True)
|
||||
height = Column(Integer, nullable=True)
|
||||
phash = Column(String, nullable=True, index=True)
|
||||
quality_score = Column(Float, nullable=True)
|
||||
status = Column(String, default="pending", index=True) # pending, downloaded, rejected, deleted
|
||||
created_at = Column(DateTime, server_default=func.now())
|
||||
|
||||
# Composite indexes for common query patterns
|
||||
__table_args__ = (
|
||||
UniqueConstraint("source", "source_id", name="uq_source_source_id"),
|
||||
Index("ix_images_species_status", "species_id", "status"), # For counting images per species by status
|
||||
Index("ix_images_status_created", "status", "created_at"), # For listing images by status
|
||||
)
|
||||
|
||||
# Relationships
|
||||
species = relationship("Species", back_populates="images")
|
||||
|
||||
def __repr__(self):
|
||||
return f"<Image(id={self.id}, source='{self.source}', status='{self.status}')>"
|
||||
27
backend/app/models/job.py
Normal file
27
backend/app/models/job.py
Normal file
@@ -0,0 +1,27 @@
|
||||
from sqlalchemy import Column, Integer, String, DateTime, Text, Boolean, func
|
||||
|
||||
from app.database import Base
|
||||
|
||||
|
||||
class Job(Base):
|
||||
__tablename__ = "jobs"
|
||||
|
||||
id = Column(Integer, primary_key=True, index=True)
|
||||
name = Column(String, nullable=False)
|
||||
source = Column(String, nullable=False)
|
||||
species_filter = Column(Text, nullable=True) # JSON array of species IDs or NULL for all
|
||||
only_without_images = Column(Boolean, default=False) # If True, only scrape species with 0 images
|
||||
max_images = Column(Integer, nullable=True) # If set, only scrape species with fewer than N images
|
||||
status = Column(String, default="pending", index=True) # pending, running, paused, completed, failed
|
||||
progress_current = Column(Integer, default=0)
|
||||
progress_total = Column(Integer, default=0)
|
||||
images_downloaded = Column(Integer, default=0)
|
||||
images_rejected = Column(Integer, default=0)
|
||||
celery_task_id = Column(String, nullable=True)
|
||||
started_at = Column(DateTime, nullable=True)
|
||||
completed_at = Column(DateTime, nullable=True)
|
||||
error_message = Column(Text, nullable=True)
|
||||
created_at = Column(DateTime, server_default=func.now())
|
||||
|
||||
def __repr__(self):
|
||||
return f"<Job(id={self.id}, name='{self.name}', status='{self.status}')>"
|
||||
21
backend/app/models/species.py
Normal file
21
backend/app/models/species.py
Normal file
@@ -0,0 +1,21 @@
|
||||
from sqlalchemy import Column, Integer, String, DateTime, func
|
||||
from sqlalchemy.orm import relationship
|
||||
|
||||
from app.database import Base
|
||||
|
||||
|
||||
class Species(Base):
|
||||
__tablename__ = "species"
|
||||
|
||||
id = Column(Integer, primary_key=True, index=True)
|
||||
scientific_name = Column(String, unique=True, nullable=False, index=True)
|
||||
common_name = Column(String, nullable=True)
|
||||
genus = Column(String, nullable=True, index=True)
|
||||
family = Column(String, nullable=True)
|
||||
created_at = Column(DateTime, server_default=func.now())
|
||||
|
||||
# Relationships
|
||||
images = relationship("Image", back_populates="species", cascade="all, delete-orphan")
|
||||
|
||||
def __repr__(self):
|
||||
return f"<Species(id={self.id}, scientific_name='{self.scientific_name}')>"
|
||||
15
backend/app/schemas/__init__.py
Normal file
15
backend/app/schemas/__init__.py
Normal file
@@ -0,0 +1,15 @@
|
||||
from app.schemas.species import SpeciesCreate, SpeciesUpdate, SpeciesResponse, SpeciesListResponse
|
||||
from app.schemas.image import ImageResponse, ImageListResponse, ImageFilter
|
||||
from app.schemas.job import JobCreate, JobResponse, JobListResponse
|
||||
from app.schemas.api_key import ApiKeyCreate, ApiKeyUpdate, ApiKeyResponse
|
||||
from app.schemas.export import ExportCreate, ExportResponse, ExportListResponse
|
||||
from app.schemas.stats import StatsResponse, SourceStats, SpeciesStats
|
||||
|
||||
__all__ = [
|
||||
"SpeciesCreate", "SpeciesUpdate", "SpeciesResponse", "SpeciesListResponse",
|
||||
"ImageResponse", "ImageListResponse", "ImageFilter",
|
||||
"JobCreate", "JobResponse", "JobListResponse",
|
||||
"ApiKeyCreate", "ApiKeyUpdate", "ApiKeyResponse",
|
||||
"ExportCreate", "ExportResponse", "ExportListResponse",
|
||||
"StatsResponse", "SourceStats", "SpeciesStats",
|
||||
]
|
||||
36
backend/app/schemas/api_key.py
Normal file
36
backend/app/schemas/api_key.py
Normal file
@@ -0,0 +1,36 @@
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional
|
||||
|
||||
|
||||
class ApiKeyBase(BaseModel):
|
||||
source: str
|
||||
api_key: Optional[str] = None # Optional for no-auth sources, used as Client ID for OAuth
|
||||
api_secret: Optional[str] = None # Also used as Client Secret for OAuth sources
|
||||
access_token: Optional[str] = None # For OAuth sources like Wikimedia
|
||||
rate_limit_per_sec: float = 1.0
|
||||
enabled: bool = True
|
||||
|
||||
|
||||
class ApiKeyCreate(ApiKeyBase):
|
||||
pass
|
||||
|
||||
|
||||
class ApiKeyUpdate(BaseModel):
|
||||
api_key: Optional[str] = None
|
||||
api_secret: Optional[str] = None
|
||||
access_token: Optional[str] = None
|
||||
rate_limit_per_sec: Optional[float] = None
|
||||
enabled: Optional[bool] = None
|
||||
|
||||
|
||||
class ApiKeyResponse(BaseModel):
|
||||
id: int
|
||||
source: str
|
||||
api_key_masked: str # Show only last 4 chars
|
||||
has_secret: bool
|
||||
has_access_token: bool
|
||||
rate_limit_per_sec: float
|
||||
enabled: bool
|
||||
|
||||
class Config:
|
||||
from_attributes = True
|
||||
45
backend/app/schemas/export.py
Normal file
45
backend/app/schemas/export.py
Normal file
@@ -0,0 +1,45 @@
|
||||
from pydantic import BaseModel
|
||||
from datetime import datetime
|
||||
from typing import Optional, List
|
||||
|
||||
|
||||
class ExportFilter(BaseModel):
|
||||
min_images_per_species: int = 100
|
||||
licenses: Optional[List[str]] = None # None means all
|
||||
min_quality: Optional[float] = None
|
||||
species_ids: Optional[List[int]] = None # None means all
|
||||
|
||||
|
||||
class ExportCreate(BaseModel):
|
||||
name: str
|
||||
filter_criteria: ExportFilter
|
||||
train_split: float = 0.8
|
||||
|
||||
|
||||
class ExportResponse(BaseModel):
|
||||
id: int
|
||||
name: str
|
||||
filter_criteria: Optional[str] = None
|
||||
train_split: float
|
||||
status: str
|
||||
file_path: Optional[str] = None
|
||||
file_size: Optional[int] = None
|
||||
species_count: Optional[int] = None
|
||||
image_count: Optional[int] = None
|
||||
created_at: datetime
|
||||
completed_at: Optional[datetime] = None
|
||||
error_message: Optional[str] = None
|
||||
|
||||
class Config:
|
||||
from_attributes = True
|
||||
|
||||
|
||||
class ExportListResponse(BaseModel):
|
||||
items: List[ExportResponse]
|
||||
total: int
|
||||
|
||||
|
||||
class ExportPreview(BaseModel):
|
||||
species_count: int
|
||||
image_count: int
|
||||
estimated_size_mb: float
|
||||
47
backend/app/schemas/image.py
Normal file
47
backend/app/schemas/image.py
Normal file
@@ -0,0 +1,47 @@
|
||||
from pydantic import BaseModel
|
||||
from datetime import datetime
|
||||
from typing import Optional, List
|
||||
|
||||
|
||||
class ImageBase(BaseModel):
|
||||
species_id: int
|
||||
source: str
|
||||
url: str
|
||||
license: str
|
||||
|
||||
|
||||
class ImageResponse(BaseModel):
|
||||
id: int
|
||||
species_id: int
|
||||
species_name: Optional[str] = None
|
||||
source: str
|
||||
source_id: Optional[str] = None
|
||||
url: str
|
||||
local_path: Optional[str] = None
|
||||
license: str
|
||||
attribution: Optional[str] = None
|
||||
width: Optional[int] = None
|
||||
height: Optional[int] = None
|
||||
quality_score: Optional[float] = None
|
||||
status: str
|
||||
created_at: datetime
|
||||
|
||||
class Config:
|
||||
from_attributes = True
|
||||
|
||||
|
||||
class ImageListResponse(BaseModel):
|
||||
items: List[ImageResponse]
|
||||
total: int
|
||||
page: int
|
||||
page_size: int
|
||||
pages: int
|
||||
|
||||
|
||||
class ImageFilter(BaseModel):
|
||||
species_id: Optional[int] = None
|
||||
source: Optional[str] = None
|
||||
license: Optional[str] = None
|
||||
status: Optional[str] = None
|
||||
min_quality: Optional[float] = None
|
||||
search: Optional[str] = None
|
||||
35
backend/app/schemas/job.py
Normal file
35
backend/app/schemas/job.py
Normal file
@@ -0,0 +1,35 @@
|
||||
from pydantic import BaseModel
|
||||
from datetime import datetime
|
||||
from typing import Optional, List
|
||||
|
||||
|
||||
class JobCreate(BaseModel):
|
||||
name: str
|
||||
source: str
|
||||
species_ids: Optional[List[int]] = None # None means all species
|
||||
only_without_images: bool = False # If True, only scrape species with 0 images
|
||||
max_images: Optional[int] = None # If set, only scrape species with fewer than N images
|
||||
|
||||
|
||||
class JobResponse(BaseModel):
|
||||
id: int
|
||||
name: str
|
||||
source: str
|
||||
species_filter: Optional[str] = None
|
||||
status: str
|
||||
progress_current: int
|
||||
progress_total: int
|
||||
images_downloaded: int
|
||||
images_rejected: int
|
||||
started_at: Optional[datetime] = None
|
||||
completed_at: Optional[datetime] = None
|
||||
error_message: Optional[str] = None
|
||||
created_at: datetime
|
||||
|
||||
class Config:
|
||||
from_attributes = True
|
||||
|
||||
|
||||
class JobListResponse(BaseModel):
|
||||
items: List[JobResponse]
|
||||
total: int
|
||||
44
backend/app/schemas/species.py
Normal file
44
backend/app/schemas/species.py
Normal file
@@ -0,0 +1,44 @@
|
||||
from pydantic import BaseModel
|
||||
from datetime import datetime
|
||||
from typing import Optional, List
|
||||
|
||||
|
||||
class SpeciesBase(BaseModel):
|
||||
scientific_name: str
|
||||
common_name: Optional[str] = None
|
||||
genus: Optional[str] = None
|
||||
family: Optional[str] = None
|
||||
|
||||
|
||||
class SpeciesCreate(SpeciesBase):
|
||||
pass
|
||||
|
||||
|
||||
class SpeciesUpdate(BaseModel):
|
||||
scientific_name: Optional[str] = None
|
||||
common_name: Optional[str] = None
|
||||
genus: Optional[str] = None
|
||||
family: Optional[str] = None
|
||||
|
||||
|
||||
class SpeciesResponse(SpeciesBase):
|
||||
id: int
|
||||
created_at: datetime
|
||||
image_count: int = 0
|
||||
|
||||
class Config:
|
||||
from_attributes = True
|
||||
|
||||
|
||||
class SpeciesListResponse(BaseModel):
|
||||
items: List[SpeciesResponse]
|
||||
total: int
|
||||
page: int
|
||||
page_size: int
|
||||
pages: int
|
||||
|
||||
|
||||
class SpeciesImportResponse(BaseModel):
|
||||
imported: int
|
||||
skipped: int
|
||||
errors: List[str]
|
||||
43
backend/app/schemas/stats.py
Normal file
43
backend/app/schemas/stats.py
Normal file
@@ -0,0 +1,43 @@
|
||||
from pydantic import BaseModel
|
||||
from typing import List, Dict
|
||||
|
||||
|
||||
class SourceStats(BaseModel):
|
||||
source: str
|
||||
image_count: int
|
||||
downloaded: int
|
||||
pending: int
|
||||
rejected: int
|
||||
|
||||
|
||||
class LicenseStats(BaseModel):
|
||||
license: str
|
||||
count: int
|
||||
|
||||
|
||||
class SpeciesStats(BaseModel):
|
||||
id: int
|
||||
scientific_name: str
|
||||
common_name: str | None
|
||||
image_count: int
|
||||
|
||||
|
||||
class JobStats(BaseModel):
|
||||
running: int
|
||||
pending: int
|
||||
completed: int
|
||||
failed: int
|
||||
|
||||
|
||||
class StatsResponse(BaseModel):
|
||||
total_species: int
|
||||
total_images: int
|
||||
images_downloaded: int
|
||||
images_pending: int
|
||||
images_rejected: int
|
||||
disk_usage_mb: float
|
||||
sources: List[SourceStats]
|
||||
licenses: List[LicenseStats]
|
||||
jobs: JobStats
|
||||
top_species: List[SpeciesStats]
|
||||
under_represented: List[SpeciesStats] # Species with < 100 images
|
||||
41
backend/app/scrapers/__init__.py
Normal file
41
backend/app/scrapers/__init__.py
Normal file
@@ -0,0 +1,41 @@
|
||||
from typing import Optional
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.scrapers.inaturalist import INaturalistScraper
|
||||
from app.scrapers.flickr import FlickrScraper
|
||||
from app.scrapers.wikimedia import WikimediaScraper
|
||||
from app.scrapers.trefle import TrefleScraper
|
||||
from app.scrapers.gbif import GBIFScraper
|
||||
from app.scrapers.duckduckgo import DuckDuckGoScraper
|
||||
from app.scrapers.bing import BingScraper
|
||||
|
||||
|
||||
def get_scraper(source: str) -> Optional[BaseScraper]:
|
||||
"""Get scraper instance for a source."""
|
||||
scrapers = {
|
||||
"inaturalist": INaturalistScraper,
|
||||
"flickr": FlickrScraper,
|
||||
"wikimedia": WikimediaScraper,
|
||||
"trefle": TrefleScraper,
|
||||
"gbif": GBIFScraper,
|
||||
"duckduckgo": DuckDuckGoScraper,
|
||||
"bing": BingScraper,
|
||||
}
|
||||
|
||||
scraper_class = scrapers.get(source)
|
||||
if scraper_class:
|
||||
return scraper_class()
|
||||
return None
|
||||
|
||||
|
||||
__all__ = [
|
||||
"get_scraper",
|
||||
"BaseScraper",
|
||||
"INaturalistScraper",
|
||||
"FlickrScraper",
|
||||
"WikimediaScraper",
|
||||
"TrefleScraper",
|
||||
"GBIFScraper",
|
||||
"DuckDuckGoScraper",
|
||||
"BingScraper",
|
||||
]
|
||||
57
backend/app/scrapers/base.py
Normal file
57
backend/app/scrapers/base.py
Normal file
@@ -0,0 +1,57 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any, Optional
|
||||
import logging
|
||||
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.models import Species, ApiKey
|
||||
|
||||
|
||||
class BaseScraper(ABC):
|
||||
"""Base class for all image scrapers."""
|
||||
|
||||
name: str = "base"
|
||||
requires_api_key: bool = True
|
||||
|
||||
@abstractmethod
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None
|
||||
) -> Dict[str, int]:
|
||||
"""
|
||||
Scrape images for a species.
|
||||
|
||||
Args:
|
||||
species: The species to scrape images for
|
||||
db: Database session
|
||||
logger: Optional logger for debugging
|
||||
|
||||
Returns:
|
||||
Dict with 'downloaded' and 'rejected' counts
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
"""
|
||||
Test API connection.
|
||||
|
||||
Args:
|
||||
api_key: The API key configuration
|
||||
|
||||
Returns:
|
||||
Success message
|
||||
|
||||
Raises:
|
||||
Exception if connection fails
|
||||
"""
|
||||
pass
|
||||
|
||||
def get_api_key(self, db: Session) -> ApiKey:
|
||||
"""Get API key for this scraper."""
|
||||
return db.query(ApiKey).filter(
|
||||
ApiKey.source == self.name,
|
||||
ApiKey.enabled == True
|
||||
).first()
|
||||
228
backend/app/scrapers/bhl.py
Normal file
228
backend/app/scrapers/bhl.py
Normal file
@@ -0,0 +1,228 @@
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
import httpx
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class BHLScraper(BaseScraper):
|
||||
"""Scraper for Biodiversity Heritage Library (BHL) images.
|
||||
|
||||
BHL provides access to digitized biodiversity literature and illustrations.
|
||||
Most content is public domain (pre-1927) or CC-licensed.
|
||||
|
||||
Note: BHL images are primarily historical botanical illustrations,
|
||||
which may differ from photographs but are valuable for training.
|
||||
"""
|
||||
|
||||
name = "bhl"
|
||||
requires_api_key = True # BHL requires free API key
|
||||
|
||||
BASE_URL = "https://www.biodiversitylibrary.org/api3"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "PlantGuideScraper/1.0 (Plant image collection for ML training)",
|
||||
"Accept": "application/json",
|
||||
}
|
||||
|
||||
# BHL content is mostly public domain
|
||||
ALLOWED_LICENSES = {"CC0", "CC-BY", "CC-BY-SA", "PD"}
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None
|
||||
) -> Dict[str, int]:
|
||||
"""Scrape images from BHL for a species."""
|
||||
api_key = self.get_api_key(db)
|
||||
if not api_key:
|
||||
return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
|
||||
|
||||
rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
|
||||
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
|
||||
def log(level: str, msg: str):
|
||||
if logger:
|
||||
getattr(logger, level)(msg)
|
||||
|
||||
try:
|
||||
# Disable SSL verification - some Docker environments lack proper CA certificates
|
||||
with httpx.Client(timeout=30, headers=self.HEADERS, verify=False) as client:
|
||||
# Search for name in BHL
|
||||
search_response = client.get(
|
||||
f"{self.BASE_URL}",
|
||||
params={
|
||||
"op": "NameSearch",
|
||||
"name": species.scientific_name,
|
||||
"format": "json",
|
||||
"apikey": api_key.api_key,
|
||||
},
|
||||
)
|
||||
search_response.raise_for_status()
|
||||
search_data = search_response.json()
|
||||
|
||||
results = search_data.get("Result", [])
|
||||
if not results:
|
||||
log("info", f" Species not found in BHL: {species.scientific_name}")
|
||||
return {"downloaded": 0, "rejected": 0}
|
||||
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
# Get pages with illustrations for each name result
|
||||
for name_result in results[:5]: # Limit to top 5 matches
|
||||
name_bank_id = name_result.get("NameBankID")
|
||||
if not name_bank_id:
|
||||
continue
|
||||
|
||||
# Get publications with this name
|
||||
pub_response = client.get(
|
||||
f"{self.BASE_URL}",
|
||||
params={
|
||||
"op": "NameGetDetail",
|
||||
"namebankid": name_bank_id,
|
||||
"format": "json",
|
||||
"apikey": api_key.api_key,
|
||||
},
|
||||
)
|
||||
pub_response.raise_for_status()
|
||||
pub_data = pub_response.json()
|
||||
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
# Extract titles and get page images
|
||||
for title in pub_data.get("Result", []):
|
||||
title_id = title.get("TitleID")
|
||||
if not title_id:
|
||||
continue
|
||||
|
||||
# Get pages for this title
|
||||
pages_response = client.get(
|
||||
f"{self.BASE_URL}",
|
||||
params={
|
||||
"op": "GetPageMetadata",
|
||||
"titleid": title_id,
|
||||
"format": "json",
|
||||
"apikey": api_key.api_key,
|
||||
"ocr": "false",
|
||||
"names": "false",
|
||||
},
|
||||
)
|
||||
|
||||
if pages_response.status_code != 200:
|
||||
continue
|
||||
|
||||
pages_data = pages_response.json()
|
||||
pages = pages_data.get("Result", [])
|
||||
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
# Look for pages that are likely illustrations
|
||||
for page in pages[:100]: # Limit pages per title
|
||||
page_types = page.get("PageTypes", [])
|
||||
|
||||
# Only get illustration/plate pages
|
||||
is_illustration = any(
|
||||
pt.get("PageTypeName", "").lower() in ["illustration", "plate", "figure", "map"]
|
||||
for pt in page_types
|
||||
) if page_types else False
|
||||
|
||||
if not is_illustration and page_types:
|
||||
continue
|
||||
|
||||
page_id = page.get("PageID")
|
||||
if not page_id:
|
||||
continue
|
||||
|
||||
# Construct image URL
|
||||
# BHL provides multiple image sizes
|
||||
image_url = f"https://www.biodiversitylibrary.org/pageimage/{page_id}"
|
||||
|
||||
# Check if already exists
|
||||
source_id = str(page_id)
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue
|
||||
|
||||
# Determine license - BHL content is usually public domain
|
||||
item_url = page.get("ItemUrl", "")
|
||||
year = None
|
||||
try:
|
||||
# Try to extract year from ItemUrl or other fields
|
||||
if "Year" in page:
|
||||
year = int(page.get("Year", 0))
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
|
||||
# Content before 1927 is public domain in US
|
||||
if year and year < 1927:
|
||||
license_code = "PD"
|
||||
else:
|
||||
license_code = "CC0" # BHL default for older works
|
||||
|
||||
# Build attribution
|
||||
title_name = title.get("ShortTitle", title.get("FullTitle", "Unknown"))
|
||||
attribution = f"From '{title_name}' via Biodiversity Heritage Library ({license_code})"
|
||||
|
||||
# Create image record
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=image_url,
|
||||
license=license_code,
|
||||
attribution=attribution,
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
|
||||
# Queue for download
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
# Limit total per species
|
||||
if downloaded >= 50:
|
||||
break
|
||||
|
||||
if downloaded >= 50:
|
||||
break
|
||||
|
||||
if downloaded >= 50:
|
||||
break
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
log("error", f" HTTP error for {species.scientific_name}: {e.response.status_code}")
|
||||
except Exception as e:
|
||||
log("error", f" Error scraping BHL for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
"""Test BHL API connection."""
|
||||
with httpx.Client(timeout=10, headers=self.HEADERS, verify=False) as client:
|
||||
response = client.get(
|
||||
f"{self.BASE_URL}",
|
||||
params={
|
||||
"op": "NameSearch",
|
||||
"name": "Rosa",
|
||||
"format": "json",
|
||||
"apikey": api_key.api_key,
|
||||
},
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
results = data.get("Result", [])
|
||||
return f"BHL API connection successful ({len(results)} results for 'Rosa')"
|
||||
135
backend/app/scrapers/bing.py
Normal file
135
backend/app/scrapers/bing.py
Normal file
@@ -0,0 +1,135 @@
|
||||
import hashlib
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
import httpx
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class BingScraper(BaseScraper):
|
||||
"""Scraper for Bing Image Search v7 API (Azure Cognitive Services)."""
|
||||
|
||||
name = "bing"
|
||||
requires_api_key = True
|
||||
|
||||
BASE_URL = "https://api.bing.microsoft.com/v7.0/images/search"
|
||||
|
||||
NEGATIVE_TERMS = "-herbarium -specimen -illustration -drawing -diagram -dried -pressed"
|
||||
|
||||
LICENSE_MAP = {
|
||||
"Public": "CC0",
|
||||
"Share": "CC-BY-SA",
|
||||
"ShareCommercially": "CC-BY",
|
||||
"Modify": "CC-BY-SA",
|
||||
"ModifyCommercially": "CC-BY",
|
||||
}
|
||||
|
||||
def _build_queries(self, species: Species) -> list[str]:
|
||||
queries = [f'"{species.scientific_name}" plant photo {self.NEGATIVE_TERMS}']
|
||||
if species.common_name:
|
||||
queries.append(f'"{species.common_name}" houseplant photo {self.NEGATIVE_TERMS}')
|
||||
return queries
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None,
|
||||
) -> Dict[str, int]:
|
||||
api_key = self.get_api_key(db)
|
||||
if not api_key:
|
||||
return {"downloaded": 0, "rejected": 0}
|
||||
|
||||
rate_limit = api_key.rate_limit_per_sec or 3.0
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
seen_urls = set()
|
||||
|
||||
headers = {
|
||||
"Ocp-Apim-Subscription-Key": api_key.api_key,
|
||||
}
|
||||
|
||||
try:
|
||||
queries = self._build_queries(species)
|
||||
|
||||
with httpx.Client(timeout=30, headers=headers) as client:
|
||||
for query in queries:
|
||||
params = {
|
||||
"q": query,
|
||||
"imageType": "Photo",
|
||||
"license": "ShareCommercially",
|
||||
"count": 50,
|
||||
}
|
||||
|
||||
response = client.get(self.BASE_URL, params=params)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
for result in data.get("value", []):
|
||||
url = result.get("contentUrl")
|
||||
if not url or url in seen_urls:
|
||||
continue
|
||||
seen_urls.add(url)
|
||||
|
||||
# Use Bing's imageId, fall back to md5 hash
|
||||
source_id = result.get("imageId") or hashlib.md5(url.encode()).hexdigest()[:16]
|
||||
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue
|
||||
|
||||
# Map license
|
||||
bing_license = result.get("license", "")
|
||||
license_code = self.LICENSE_MAP.get(bing_license, "UNKNOWN")
|
||||
|
||||
host = result.get("hostPageDisplayUrl", "")
|
||||
attribution = f"via Bing ({host})" if host else "via Bing Image Search"
|
||||
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=url,
|
||||
width=result.get("width"),
|
||||
height=result.get("height"),
|
||||
license=license_code,
|
||||
attribution=attribution,
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
except Exception as e:
|
||||
if logger:
|
||||
logger.error(f"Error scraping Bing for {species.scientific_name}: {e}")
|
||||
else:
|
||||
print(f"Error scraping Bing for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
headers = {"Ocp-Apim-Subscription-Key": api_key.api_key}
|
||||
with httpx.Client(timeout=10, headers=headers) as client:
|
||||
response = client.get(
|
||||
self.BASE_URL,
|
||||
params={"q": "Monstera deliciosa plant", "count": 1},
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
count = data.get("totalEstimatedMatches", 0)
|
||||
return f"Bing Image Search working ({count:,} estimated matches)"
|
||||
101
backend/app/scrapers/duckduckgo.py
Normal file
101
backend/app/scrapers/duckduckgo.py
Normal file
@@ -0,0 +1,101 @@
|
||||
import hashlib
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
from duckduckgo_search import DDGS
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class DuckDuckGoScraper(BaseScraper):
|
||||
"""Scraper for DuckDuckGo image search. No API key required."""
|
||||
|
||||
name = "duckduckgo"
|
||||
requires_api_key = False
|
||||
|
||||
NEGATIVE_TERMS = "-herbarium -specimen -illustration -drawing -diagram -dried -pressed"
|
||||
|
||||
def _build_queries(self, species: Species) -> list[str]:
|
||||
queries = [f'"{species.scientific_name}" plant photo {self.NEGATIVE_TERMS}']
|
||||
if species.common_name:
|
||||
queries.append(f'"{species.common_name}" houseplant photo {self.NEGATIVE_TERMS}')
|
||||
return queries
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None,
|
||||
) -> Dict[str, int]:
|
||||
api_key = self.get_api_key(db)
|
||||
rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
|
||||
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
seen_urls = set()
|
||||
|
||||
try:
|
||||
queries = self._build_queries(species)
|
||||
|
||||
with DDGS() as ddgs:
|
||||
for query in queries:
|
||||
results = ddgs.images(
|
||||
keywords=query,
|
||||
type_image="photo",
|
||||
max_results=50,
|
||||
)
|
||||
|
||||
for result in results:
|
||||
url = result.get("image")
|
||||
if not url or url in seen_urls:
|
||||
continue
|
||||
seen_urls.add(url)
|
||||
|
||||
source_id = hashlib.md5(url.encode()).hexdigest()[:16]
|
||||
|
||||
# Check if already exists
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue
|
||||
|
||||
title = result.get("title", "")
|
||||
attribution = f"{title} via DuckDuckGo" if title else "via DuckDuckGo"
|
||||
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=url,
|
||||
license="UNKNOWN",
|
||||
attribution=attribution,
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
except Exception as e:
|
||||
if logger:
|
||||
logger.error(f"Error scraping DuckDuckGo for {species.scientific_name}: {e}")
|
||||
else:
|
||||
print(f"Error scraping DuckDuckGo for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
with DDGS() as ddgs:
|
||||
results = ddgs.images(keywords="Monstera deliciosa plant", max_results=1)
|
||||
count = len(list(results))
|
||||
return f"DuckDuckGo search working ({count} test result)"
|
||||
226
backend/app/scrapers/eol.py
Normal file
226
backend/app/scrapers/eol.py
Normal file
@@ -0,0 +1,226 @@
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
import httpx
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class EOLScraper(BaseScraper):
|
||||
"""Scraper for Encyclopedia of Life (EOL) images.
|
||||
|
||||
EOL aggregates biodiversity data from many sources and provides
|
||||
a free API with no authentication required.
|
||||
"""
|
||||
|
||||
name = "eol"
|
||||
requires_api_key = False
|
||||
|
||||
BASE_URL = "https://eol.org/api"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "PlantGuideScraper/1.0 (Plant image collection for ML training)",
|
||||
"Accept": "application/json",
|
||||
}
|
||||
|
||||
# Map EOL license URLs to short codes
|
||||
LICENSE_MAP = {
|
||||
"http://creativecommons.org/publicdomain/zero/1.0/": "CC0",
|
||||
"http://creativecommons.org/publicdomain/mark/1.0/": "CC0",
|
||||
"http://creativecommons.org/licenses/by/2.0/": "CC-BY",
|
||||
"http://creativecommons.org/licenses/by/3.0/": "CC-BY",
|
||||
"http://creativecommons.org/licenses/by/4.0/": "CC-BY",
|
||||
"http://creativecommons.org/licenses/by-sa/2.0/": "CC-BY-SA",
|
||||
"http://creativecommons.org/licenses/by-sa/3.0/": "CC-BY-SA",
|
||||
"http://creativecommons.org/licenses/by-sa/4.0/": "CC-BY-SA",
|
||||
"https://creativecommons.org/publicdomain/zero/1.0/": "CC0",
|
||||
"https://creativecommons.org/publicdomain/mark/1.0/": "CC0",
|
||||
"https://creativecommons.org/licenses/by/2.0/": "CC-BY",
|
||||
"https://creativecommons.org/licenses/by/3.0/": "CC-BY",
|
||||
"https://creativecommons.org/licenses/by/4.0/": "CC-BY",
|
||||
"https://creativecommons.org/licenses/by-sa/2.0/": "CC-BY-SA",
|
||||
"https://creativecommons.org/licenses/by-sa/3.0/": "CC-BY-SA",
|
||||
"https://creativecommons.org/licenses/by-sa/4.0/": "CC-BY-SA",
|
||||
"pd": "CC0", # Public domain
|
||||
"public domain": "CC0",
|
||||
}
|
||||
|
||||
# Commercial-safe licenses
|
||||
ALLOWED_LICENSES = {"CC0", "CC-BY", "CC-BY-SA"}
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None
|
||||
) -> Dict[str, int]:
|
||||
"""Scrape images from EOL for a species."""
|
||||
api_key = self.get_api_key(db)
|
||||
rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
|
||||
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
|
||||
def log(level: str, msg: str):
|
||||
if logger:
|
||||
getattr(logger, level)(msg)
|
||||
|
||||
try:
|
||||
# Disable SSL verification - EOL is a trusted source and some Docker
|
||||
# environments lack proper CA certificates
|
||||
with httpx.Client(timeout=30, headers=self.HEADERS, verify=False) as client:
|
||||
# Step 1: Search for the species
|
||||
search_response = client.get(
|
||||
f"{self.BASE_URL}/search/1.0.json",
|
||||
params={
|
||||
"q": species.scientific_name,
|
||||
"page": 1,
|
||||
"exact": "true",
|
||||
},
|
||||
)
|
||||
search_response.raise_for_status()
|
||||
search_data = search_response.json()
|
||||
|
||||
results = search_data.get("results", [])
|
||||
if not results:
|
||||
log("info", f" Species not found in EOL: {species.scientific_name}")
|
||||
return {"downloaded": 0, "rejected": 0}
|
||||
|
||||
# Get the EOL page ID
|
||||
eol_page_id = results[0].get("id")
|
||||
if not eol_page_id:
|
||||
return {"downloaded": 0, "rejected": 0}
|
||||
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
# Step 2: Get page details with images
|
||||
page_response = client.get(
|
||||
f"{self.BASE_URL}/pages/1.0/{eol_page_id}.json",
|
||||
params={
|
||||
"images_per_page": 75,
|
||||
"images_page": 1,
|
||||
"videos_per_page": 0,
|
||||
"sounds_per_page": 0,
|
||||
"maps_per_page": 0,
|
||||
"texts_per_page": 0,
|
||||
"details": "true",
|
||||
"licenses": "cc-by|cc-by-sa|pd|cc-by-nc",
|
||||
},
|
||||
)
|
||||
page_response.raise_for_status()
|
||||
page_data = page_response.json()
|
||||
|
||||
data_objects = page_data.get("dataObjects", [])
|
||||
log("debug", f" Found {len(data_objects)} media objects")
|
||||
|
||||
for obj in data_objects:
|
||||
# Only process images
|
||||
media_type = obj.get("dataType", "")
|
||||
if "image" not in media_type.lower() and "stillimage" not in media_type.lower():
|
||||
continue
|
||||
|
||||
# Get image URL
|
||||
image_url = obj.get("eolMediaURL") or obj.get("mediaURL")
|
||||
if not image_url:
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Check license
|
||||
license_url = obj.get("license", "").lower()
|
||||
license_code = None
|
||||
|
||||
# Try to match license URL
|
||||
for pattern, code in self.LICENSE_MAP.items():
|
||||
if pattern in license_url:
|
||||
license_code = code
|
||||
break
|
||||
|
||||
if not license_code:
|
||||
# Check for NC licenses which we reject
|
||||
if "-nc" in license_url:
|
||||
rejected += 1
|
||||
continue
|
||||
# Unknown license, skip
|
||||
log("debug", f" Rejected: unknown license {license_url}")
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
if license_code not in self.ALLOWED_LICENSES:
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Create unique source ID
|
||||
source_id = str(obj.get("dataObjectVersionID") or obj.get("identifier") or hash(image_url))
|
||||
|
||||
# Check if already exists
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue
|
||||
|
||||
# Build attribution
|
||||
agents = obj.get("agents", [])
|
||||
photographer = None
|
||||
rights_holder = None
|
||||
|
||||
for agent in agents:
|
||||
role = agent.get("role", "").lower()
|
||||
name = agent.get("full_name", "")
|
||||
if role == "photographer":
|
||||
photographer = name
|
||||
elif role == "owner" or role == "rights holder":
|
||||
rights_holder = name
|
||||
|
||||
attribution_parts = []
|
||||
if photographer:
|
||||
attribution_parts.append(f"Photo by {photographer}")
|
||||
if rights_holder and rights_holder != photographer:
|
||||
attribution_parts.append(f"Rights: {rights_holder}")
|
||||
attribution_parts.append(f"via EOL ({license_code})")
|
||||
attribution = " | ".join(attribution_parts)
|
||||
|
||||
# Create image record
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=image_url,
|
||||
license=license_code,
|
||||
attribution=attribution,
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
|
||||
# Queue for download
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
log("error", f" HTTP error for {species.scientific_name}: {e.response.status_code}")
|
||||
except Exception as e:
|
||||
log("error", f" Error scraping EOL for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
"""Test EOL API connection."""
|
||||
with httpx.Client(timeout=10, headers=self.HEADERS, verify=False) as client:
|
||||
response = client.get(
|
||||
f"{self.BASE_URL}/search/1.0.json",
|
||||
params={"q": "Rosa", "page": 1},
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
total = data.get("totalResults", 0)
|
||||
return f"EOL API connection successful ({total} results for 'Rosa')"
|
||||
146
backend/app/scrapers/flickr.py
Normal file
146
backend/app/scrapers/flickr.py
Normal file
@@ -0,0 +1,146 @@
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
import httpx
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class FlickrScraper(BaseScraper):
|
||||
"""Scraper for Flickr images via their API."""
|
||||
|
||||
name = "flickr"
|
||||
requires_api_key = True
|
||||
|
||||
BASE_URL = "https://api.flickr.com/services/rest/"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||
}
|
||||
|
||||
# Commercial-safe license IDs
|
||||
# 4 = CC BY 2.0, 7 = No known copyright, 8 = US Gov, 9 = CC0
|
||||
ALLOWED_LICENSES = "4,7,8,9"
|
||||
|
||||
LICENSE_MAP = {
|
||||
"4": "CC-BY",
|
||||
"7": "NO-KNOWN-COPYRIGHT",
|
||||
"8": "US-GOV",
|
||||
"9": "CC0",
|
||||
}
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None
|
||||
) -> Dict[str, int]:
|
||||
"""Scrape images from Flickr for a species."""
|
||||
api_key = self.get_api_key(db)
|
||||
if not api_key:
|
||||
return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
|
||||
|
||||
rate_limit = api_key.rate_limit_per_sec
|
||||
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
|
||||
try:
|
||||
params = {
|
||||
"method": "flickr.photos.search",
|
||||
"api_key": api_key.api_key,
|
||||
"text": species.scientific_name,
|
||||
"license": self.ALLOWED_LICENSES,
|
||||
"content_type": 1, # Photos only
|
||||
"media": "photos",
|
||||
"extras": "license,url_l,url_o,owner_name",
|
||||
"per_page": 100,
|
||||
"format": "json",
|
||||
"nojsoncallback": 1,
|
||||
}
|
||||
|
||||
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||
response = client.get(self.BASE_URL, params=params)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
if data.get("stat") != "ok":
|
||||
return {"downloaded": 0, "rejected": 0, "error": data.get("message")}
|
||||
|
||||
photos = data.get("photos", {}).get("photo", [])
|
||||
|
||||
for photo in photos:
|
||||
# Get best URL (original or large)
|
||||
url = photo.get("url_o") or photo.get("url_l")
|
||||
if not url:
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Get license
|
||||
license_id = str(photo.get("license", ""))
|
||||
license_code = self.LICENSE_MAP.get(license_id, "UNKNOWN")
|
||||
if license_code == "UNKNOWN":
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Check if already exists
|
||||
source_id = str(photo.get("id"))
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue
|
||||
|
||||
# Build attribution
|
||||
owner = photo.get("ownername", "Unknown")
|
||||
attribution = f"Photo by {owner} on Flickr ({license_code})"
|
||||
|
||||
# Create image record
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=url,
|
||||
license=license_code,
|
||||
attribution=attribution,
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
|
||||
# Queue for download
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
# Rate limiting
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error scraping Flickr for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
"""Test Flickr API connection."""
|
||||
params = {
|
||||
"method": "flickr.test.echo",
|
||||
"api_key": api_key.api_key,
|
||||
"format": "json",
|
||||
"nojsoncallback": 1,
|
||||
}
|
||||
|
||||
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||
response = client.get(self.BASE_URL, params=params)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
if data.get("stat") != "ok":
|
||||
raise Exception(data.get("message", "API test failed"))
|
||||
|
||||
return "Flickr API connection successful"
|
||||
159
backend/app/scrapers/gbif.py
Normal file
159
backend/app/scrapers/gbif.py
Normal file
@@ -0,0 +1,159 @@
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
import httpx
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class GBIFScraper(BaseScraper):
|
||||
"""Scraper for GBIF (Global Biodiversity Information Facility) images."""
|
||||
|
||||
name = "gbif"
|
||||
requires_api_key = False # GBIF is free to use
|
||||
|
||||
BASE_URL = "https://api.gbif.org/v1"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||
}
|
||||
|
||||
# Map GBIF license URLs to short codes
|
||||
LICENSE_MAP = {
|
||||
"http://creativecommons.org/publicdomain/zero/1.0/legalcode": "CC0",
|
||||
"http://creativecommons.org/licenses/by/4.0/legalcode": "CC-BY",
|
||||
"http://creativecommons.org/licenses/by-nc/4.0/legalcode": "CC-BY-NC",
|
||||
"http://creativecommons.org/publicdomain/zero/1.0/": "CC0",
|
||||
"http://creativecommons.org/licenses/by/4.0/": "CC-BY",
|
||||
"http://creativecommons.org/licenses/by-nc/4.0/": "CC-BY-NC",
|
||||
"https://creativecommons.org/publicdomain/zero/1.0/legalcode": "CC0",
|
||||
"https://creativecommons.org/licenses/by/4.0/legalcode": "CC-BY",
|
||||
"https://creativecommons.org/licenses/by-nc/4.0/legalcode": "CC-BY-NC",
|
||||
"https://creativecommons.org/publicdomain/zero/1.0/": "CC0",
|
||||
"https://creativecommons.org/licenses/by/4.0/": "CC-BY",
|
||||
"https://creativecommons.org/licenses/by-nc/4.0/": "CC-BY-NC",
|
||||
}
|
||||
|
||||
# Only allow commercial-safe licenses
|
||||
ALLOWED_LICENSES = {"CC0", "CC-BY"}
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None
|
||||
) -> Dict[str, int]:
|
||||
"""Scrape images from GBIF for a species."""
|
||||
# GBIF doesn't require API key, but we still respect rate limits
|
||||
api_key = self.get_api_key(db)
|
||||
rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
|
||||
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
|
||||
try:
|
||||
params = {
|
||||
"scientificName": species.scientific_name,
|
||||
"mediaType": "StillImage",
|
||||
"limit": 100,
|
||||
}
|
||||
|
||||
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||
response = client.get(
|
||||
f"{self.BASE_URL}/occurrence/search",
|
||||
params=params,
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
results = data.get("results", [])
|
||||
|
||||
for occurrence in results:
|
||||
media_list = occurrence.get("media", [])
|
||||
|
||||
for media in media_list:
|
||||
# Only process still images
|
||||
if media.get("type") != "StillImage":
|
||||
continue
|
||||
|
||||
url = media.get("identifier")
|
||||
if not url:
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Check license
|
||||
license_url = media.get("license", "")
|
||||
license_code = self.LICENSE_MAP.get(license_url)
|
||||
|
||||
if not license_code or license_code not in self.ALLOWED_LICENSES:
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Create unique source ID from occurrence key and media URL
|
||||
occurrence_key = occurrence.get("key", "")
|
||||
# Use hash of URL for uniqueness within occurrence
|
||||
url_hash = str(hash(url))[-8:]
|
||||
source_id = f"{occurrence_key}_{url_hash}"
|
||||
|
||||
# Check if already exists
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue
|
||||
|
||||
# Build attribution
|
||||
creator = media.get("creator", "")
|
||||
rights_holder = media.get("rightsHolder", "")
|
||||
attribution_parts = []
|
||||
if creator:
|
||||
attribution_parts.append(f"Photo by {creator}")
|
||||
if rights_holder and rights_holder != creator:
|
||||
attribution_parts.append(f"Rights: {rights_holder}")
|
||||
attribution_parts.append(f"via GBIF ({license_code})")
|
||||
attribution = " | ".join(attribution_parts) if attribution_parts else f"GBIF ({license_code})"
|
||||
|
||||
# Create image record
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=url,
|
||||
license=license_code,
|
||||
attribution=attribution,
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
|
||||
# Queue for download
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
# Rate limiting
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error scraping GBIF for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
"""Test GBIF API connection."""
|
||||
# GBIF doesn't require authentication, just test the endpoint
|
||||
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||
response = client.get(
|
||||
f"{self.BASE_URL}/occurrence/search",
|
||||
params={"limit": 1},
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
count = data.get("count", 0)
|
||||
return f"GBIF API connection successful ({count:,} total occurrences available)"
|
||||
144
backend/app/scrapers/inaturalist.py
Normal file
144
backend/app/scrapers/inaturalist.py
Normal file
@@ -0,0 +1,144 @@
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
import httpx
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class INaturalistScraper(BaseScraper):
|
||||
"""Scraper for iNaturalist observations via their API."""
|
||||
|
||||
name = "inaturalist"
|
||||
requires_api_key = False # Public API, but rate limited
|
||||
|
||||
BASE_URL = "https://api.inaturalist.org/v1"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||
}
|
||||
|
||||
# Commercial-safe licenses (CC0, CC-BY)
|
||||
ALLOWED_LICENSES = ["cc0", "cc-by"]
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None
|
||||
) -> Dict[str, int]:
|
||||
"""Scrape images from iNaturalist for a species."""
|
||||
api_key = self.get_api_key(db)
|
||||
rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
|
||||
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
|
||||
def log(level: str, msg: str):
|
||||
if logger:
|
||||
getattr(logger, level)(msg)
|
||||
|
||||
try:
|
||||
# Search for observations of this species
|
||||
params = {
|
||||
"taxon_name": species.scientific_name,
|
||||
"quality_grade": "research", # Only research-grade
|
||||
"photos": True,
|
||||
"per_page": 200,
|
||||
"order_by": "votes",
|
||||
"license": ",".join(self.ALLOWED_LICENSES),
|
||||
}
|
||||
|
||||
log("debug", f" API request params: {params}")
|
||||
|
||||
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||
response = client.get(
|
||||
f"{self.BASE_URL}/observations",
|
||||
params=params,
|
||||
)
|
||||
log("debug", f" API response status: {response.status_code}")
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
observations = data.get("results", [])
|
||||
total_results = data.get("total_results", 0)
|
||||
log("debug", f" Found {len(observations)} observations (total: {total_results})")
|
||||
|
||||
if not observations:
|
||||
log("info", f" No observations found for {species.scientific_name}")
|
||||
return {"downloaded": 0, "rejected": 0}
|
||||
|
||||
for obs in observations:
|
||||
photos = obs.get("photos", [])
|
||||
for photo in photos:
|
||||
# Check license
|
||||
license_code = photo.get("license_code", "").lower() if photo.get("license_code") else ""
|
||||
if license_code not in self.ALLOWED_LICENSES:
|
||||
log("debug", f" Rejected photo {photo.get('id')}: license={license_code}")
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Get image URL (medium size for initial download)
|
||||
url = photo.get("url", "")
|
||||
if not url:
|
||||
log("debug", f" Skipped photo {photo.get('id')}: no URL")
|
||||
continue
|
||||
|
||||
# Convert to larger size
|
||||
url = url.replace("square", "large")
|
||||
|
||||
# Check if already exists
|
||||
source_id = str(photo.get("id"))
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
log("debug", f" Skipped photo {source_id}: already exists")
|
||||
continue
|
||||
|
||||
# Create image record
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=url,
|
||||
license=license_code.upper(),
|
||||
attribution=photo.get("attribution", ""),
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
|
||||
# Queue for download
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
log("debug", f" Queued photo {source_id} for download")
|
||||
|
||||
# Rate limiting
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
except httpx.HTTPStatusError as e:
|
||||
log("error", f" HTTP error for {species.scientific_name}: {e.response.status_code} - {e.response.text}")
|
||||
except httpx.RequestError as e:
|
||||
log("error", f" Request error for {species.scientific_name}: {e}")
|
||||
except Exception as e:
|
||||
log("error", f" Error scraping iNaturalist for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
"""Test iNaturalist API connection."""
|
||||
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||
response = client.get(
|
||||
f"{self.BASE_URL}/observations",
|
||||
params={"per_page": 1},
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
return "iNaturalist API connection successful"
|
||||
154
backend/app/scrapers/trefle.py
Normal file
154
backend/app/scrapers/trefle.py
Normal file
@@ -0,0 +1,154 @@
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
import httpx
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class TrefleScraper(BaseScraper):
|
||||
"""Scraper for Trefle.io plant database."""
|
||||
|
||||
name = "trefle"
|
||||
requires_api_key = True
|
||||
|
||||
BASE_URL = "https://trefle.io/api/v1"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||
}
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None
|
||||
) -> Dict[str, int]:
|
||||
"""Scrape images from Trefle for a species."""
|
||||
api_key = self.get_api_key(db)
|
||||
if not api_key:
|
||||
return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
|
||||
|
||||
rate_limit = api_key.rate_limit_per_sec
|
||||
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
|
||||
try:
|
||||
# Search for the species
|
||||
params = {
|
||||
"token": api_key.api_key,
|
||||
"q": species.scientific_name,
|
||||
}
|
||||
|
||||
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||
response = client.get(
|
||||
f"{self.BASE_URL}/plants/search",
|
||||
params=params,
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
plants = data.get("data", [])
|
||||
|
||||
for plant in plants:
|
||||
# Get plant details for more images
|
||||
plant_id = plant.get("id")
|
||||
if not plant_id:
|
||||
continue
|
||||
|
||||
detail_response = client.get(
|
||||
f"{self.BASE_URL}/plants/{plant_id}",
|
||||
params={"token": api_key.api_key},
|
||||
)
|
||||
|
||||
if detail_response.status_code != 200:
|
||||
continue
|
||||
|
||||
plant_detail = detail_response.json().get("data", {})
|
||||
|
||||
# Get main image
|
||||
main_image = plant_detail.get("image_url")
|
||||
if main_image:
|
||||
source_id = f"main_{plant_id}"
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if not existing:
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=main_image,
|
||||
license="TREFLE", # Trefle's own license
|
||||
attribution="Trefle.io Plant Database",
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
# Get additional images from species detail
|
||||
images = plant_detail.get("images", {})
|
||||
for image_type, image_list in images.items():
|
||||
if not isinstance(image_list, list):
|
||||
continue
|
||||
|
||||
for img in image_list:
|
||||
url = img.get("image_url")
|
||||
if not url:
|
||||
continue
|
||||
|
||||
img_id = img.get("id", url.split("/")[-1])
|
||||
source_id = f"{image_type}_{img_id}"
|
||||
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue
|
||||
|
||||
copyright_info = img.get("copyright", "")
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=url,
|
||||
license="TREFLE",
|
||||
attribution=copyright_info or "Trefle.io",
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
# Rate limiting
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error scraping Trefle for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
"""Test Trefle API connection."""
|
||||
params = {"token": api_key.api_key}
|
||||
|
||||
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||
response = client.get(
|
||||
f"{self.BASE_URL}/plants",
|
||||
params=params,
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
return "Trefle API connection successful"
|
||||
146
backend/app/scrapers/wikimedia.py
Normal file
146
backend/app/scrapers/wikimedia.py
Normal file
@@ -0,0 +1,146 @@
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Optional
|
||||
|
||||
import httpx
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from app.scrapers.base import BaseScraper
|
||||
from app.models import Species, Image, ApiKey
|
||||
from app.workers.quality_tasks import download_and_process_image
|
||||
|
||||
|
||||
class WikimediaScraper(BaseScraper):
|
||||
"""Scraper for Wikimedia Commons images."""
|
||||
|
||||
name = "wikimedia"
|
||||
requires_api_key = False
|
||||
|
||||
BASE_URL = "https://commons.wikimedia.org/w/api.php"
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||
}
|
||||
|
||||
def scrape_species(
|
||||
self,
|
||||
species: Species,
|
||||
db: Session,
|
||||
logger: Optional[logging.Logger] = None
|
||||
) -> Dict[str, int]:
|
||||
"""Scrape images from Wikimedia Commons for a species."""
|
||||
api_key = self.get_api_key(db)
|
||||
rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
|
||||
|
||||
downloaded = 0
|
||||
rejected = 0
|
||||
|
||||
try:
|
||||
# Search for images in the species category
|
||||
search_term = species.scientific_name
|
||||
|
||||
params = {
|
||||
"action": "query",
|
||||
"format": "json",
|
||||
"generator": "search",
|
||||
"gsrsearch": f"filetype:bitmap {search_term}",
|
||||
"gsrnamespace": 6, # File namespace
|
||||
"gsrlimit": 50,
|
||||
"prop": "imageinfo",
|
||||
"iiprop": "url|extmetadata|size",
|
||||
}
|
||||
|
||||
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||
response = client.get(self.BASE_URL, params=params)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
|
||||
pages = data.get("query", {}).get("pages", {})
|
||||
|
||||
for page_id, page in pages.items():
|
||||
if int(page_id) < 0:
|
||||
continue
|
||||
|
||||
imageinfo = page.get("imageinfo", [{}])[0]
|
||||
url = imageinfo.get("url", "")
|
||||
if not url:
|
||||
continue
|
||||
|
||||
# Check size
|
||||
width = imageinfo.get("width", 0)
|
||||
height = imageinfo.get("height", 0)
|
||||
if width < 256 or height < 256:
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Get license from metadata
|
||||
metadata = imageinfo.get("extmetadata", {})
|
||||
license_info = metadata.get("LicenseShortName", {}).get("value", "")
|
||||
|
||||
# Filter for commercial-safe licenses
|
||||
license_upper = license_info.upper()
|
||||
if "CC BY" in license_upper or "CC0" in license_upper or "PUBLIC DOMAIN" in license_upper:
|
||||
license_code = license_info
|
||||
else:
|
||||
rejected += 1
|
||||
continue
|
||||
|
||||
# Check if already exists
|
||||
source_id = str(page_id)
|
||||
existing = db.query(Image).filter(
|
||||
Image.source == self.name,
|
||||
Image.source_id == source_id,
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
continue
|
||||
|
||||
# Get attribution
|
||||
artist = metadata.get("Artist", {}).get("value", "Unknown")
|
||||
# Clean HTML from artist
|
||||
if "<" in artist:
|
||||
import re
|
||||
artist = re.sub(r"<[^>]+>", "", artist).strip()
|
||||
|
||||
attribution = f"{artist} via Wikimedia Commons ({license_code})"
|
||||
|
||||
# Create image record
|
||||
image = Image(
|
||||
species_id=species.id,
|
||||
source=self.name,
|
||||
source_id=source_id,
|
||||
url=url,
|
||||
license=license_code,
|
||||
attribution=attribution,
|
||||
width=width,
|
||||
height=height,
|
||||
status="pending",
|
||||
)
|
||||
db.add(image)
|
||||
db.commit()
|
||||
|
||||
# Queue for download
|
||||
download_and_process_image.delay(image.id)
|
||||
downloaded += 1
|
||||
|
||||
# Rate limiting
|
||||
time.sleep(1.0 / rate_limit)
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error scraping Wikimedia for {species.scientific_name}: {e}")
|
||||
|
||||
return {"downloaded": downloaded, "rejected": rejected}
|
||||
|
||||
def test_connection(self, api_key: ApiKey) -> str:
|
||||
"""Test Wikimedia API connection."""
|
||||
params = {
|
||||
"action": "query",
|
||||
"format": "json",
|
||||
"meta": "siteinfo",
|
||||
}
|
||||
|
||||
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||
response = client.get(self.BASE_URL, params=params)
|
||||
response.raise_for_status()
|
||||
|
||||
return "Wikimedia Commons API connection successful"
|
||||
1
backend/app/utils/__init__.py
Normal file
1
backend/app/utils/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Utility functions
|
||||
80
backend/app/utils/dedup.py
Normal file
80
backend/app/utils/dedup.py
Normal file
@@ -0,0 +1,80 @@
|
||||
"""Image deduplication utilities using perceptual hashing."""
|
||||
|
||||
from typing import Optional
|
||||
|
||||
import imagehash
|
||||
from PIL import Image as PILImage
|
||||
|
||||
|
||||
def calculate_phash(image_path: str) -> Optional[str]:
|
||||
"""
|
||||
Calculate perceptual hash for an image.
|
||||
|
||||
Args:
|
||||
image_path: Path to image file
|
||||
|
||||
Returns:
|
||||
Hex string of perceptual hash, or None if failed
|
||||
"""
|
||||
try:
|
||||
with PILImage.open(image_path) as img:
|
||||
return str(imagehash.phash(img))
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def calculate_dhash(image_path: str) -> Optional[str]:
|
||||
"""
|
||||
Calculate difference hash for an image.
|
||||
Faster but less accurate than phash.
|
||||
|
||||
Args:
|
||||
image_path: Path to image file
|
||||
|
||||
Returns:
|
||||
Hex string of difference hash, or None if failed
|
||||
"""
|
||||
try:
|
||||
with PILImage.open(image_path) as img:
|
||||
return str(imagehash.dhash(img))
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def hashes_are_similar(hash1: str, hash2: str, threshold: int = 10) -> bool:
|
||||
"""
|
||||
Check if two hashes are similar (potential duplicates).
|
||||
|
||||
Args:
|
||||
hash1: First hash string
|
||||
hash2: Second hash string
|
||||
threshold: Maximum Hamming distance (default 10)
|
||||
|
||||
Returns:
|
||||
True if hashes are similar
|
||||
"""
|
||||
try:
|
||||
h1 = imagehash.hex_to_hash(hash1)
|
||||
h2 = imagehash.hex_to_hash(hash2)
|
||||
return (h1 - h2) <= threshold
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def hamming_distance(hash1: str, hash2: str) -> int:
|
||||
"""
|
||||
Calculate Hamming distance between two hashes.
|
||||
|
||||
Args:
|
||||
hash1: First hash string
|
||||
hash2: Second hash string
|
||||
|
||||
Returns:
|
||||
Hamming distance (0 = identical, higher = more different)
|
||||
"""
|
||||
try:
|
||||
h1 = imagehash.hex_to_hash(hash1)
|
||||
h2 = imagehash.hex_to_hash(hash2)
|
||||
return int(h1 - h2)
|
||||
except Exception:
|
||||
return 64 # Maximum distance
|
||||
109
backend/app/utils/image_quality.py
Normal file
109
backend/app/utils/image_quality.py
Normal file
@@ -0,0 +1,109 @@
|
||||
"""Image quality assessment utilities."""
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image as PILImage
|
||||
from scipy import ndimage
|
||||
|
||||
|
||||
def calculate_blur_score(image_path: str) -> float:
|
||||
"""
|
||||
Calculate blur score using Laplacian variance.
|
||||
Higher score = sharper image.
|
||||
|
||||
Args:
|
||||
image_path: Path to image file
|
||||
|
||||
Returns:
|
||||
Variance of Laplacian (higher = sharper)
|
||||
"""
|
||||
try:
|
||||
img = PILImage.open(image_path).convert("L")
|
||||
img_array = np.array(img)
|
||||
laplacian = ndimage.laplace(img_array)
|
||||
return float(np.var(laplacian))
|
||||
except Exception:
|
||||
return 0.0
|
||||
|
||||
|
||||
def is_too_blurry(image_path: str, threshold: float = 100.0) -> bool:
|
||||
"""
|
||||
Check if image is too blurry for training.
|
||||
|
||||
Args:
|
||||
image_path: Path to image file
|
||||
threshold: Minimum acceptable blur score (default 100)
|
||||
|
||||
Returns:
|
||||
True if image is too blurry
|
||||
"""
|
||||
score = calculate_blur_score(image_path)
|
||||
return score < threshold
|
||||
|
||||
|
||||
def get_image_dimensions(image_path: str) -> tuple[int, int]:
|
||||
"""
|
||||
Get image dimensions.
|
||||
|
||||
Args:
|
||||
image_path: Path to image file
|
||||
|
||||
Returns:
|
||||
Tuple of (width, height)
|
||||
"""
|
||||
try:
|
||||
with PILImage.open(image_path) as img:
|
||||
return img.size
|
||||
except Exception:
|
||||
return (0, 0)
|
||||
|
||||
|
||||
def is_too_small(image_path: str, min_size: int = 256) -> bool:
|
||||
"""
|
||||
Check if image is too small for training.
|
||||
|
||||
Args:
|
||||
image_path: Path to image file
|
||||
min_size: Minimum dimension size (default 256)
|
||||
|
||||
Returns:
|
||||
True if image is too small
|
||||
"""
|
||||
width, height = get_image_dimensions(image_path)
|
||||
return width < min_size or height < min_size
|
||||
|
||||
|
||||
def resize_image(
|
||||
image_path: str,
|
||||
output_path: str = None,
|
||||
max_size: int = 512,
|
||||
quality: int = 95,
|
||||
) -> bool:
|
||||
"""
|
||||
Resize image to max dimension while preserving aspect ratio.
|
||||
|
||||
Args:
|
||||
image_path: Path to input image
|
||||
output_path: Path for output (defaults to overwriting input)
|
||||
max_size: Maximum dimension size (default 512)
|
||||
quality: JPEG quality (default 95)
|
||||
|
||||
Returns:
|
||||
True if successful
|
||||
"""
|
||||
try:
|
||||
output_path = output_path or image_path
|
||||
|
||||
with PILImage.open(image_path) as img:
|
||||
# Only resize if larger than max_size
|
||||
if max(img.size) > max_size:
|
||||
img.thumbnail((max_size, max_size), PILImage.Resampling.LANCZOS)
|
||||
|
||||
# Convert to RGB if necessary (for JPEG)
|
||||
if img.mode in ("RGBA", "P"):
|
||||
img = img.convert("RGB")
|
||||
|
||||
img.save(output_path, "JPEG", quality=quality)
|
||||
|
||||
return True
|
||||
except Exception:
|
||||
return False
|
||||
92
backend/app/utils/logging.py
Normal file
92
backend/app/utils/logging.py
Normal file
@@ -0,0 +1,92 @@
|
||||
import logging
|
||||
import os
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
from app.config import get_settings
|
||||
|
||||
settings = get_settings()
|
||||
|
||||
|
||||
def setup_logging():
|
||||
"""Configure file and console logging."""
|
||||
logs_path = Path(settings.logs_path)
|
||||
logs_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Create a dated log file
|
||||
log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
|
||||
|
||||
# Configure root logger
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler(log_file),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
|
||||
return logging.getLogger("plant_scraper")
|
||||
|
||||
|
||||
def get_logger(name: str = "plant_scraper"):
|
||||
"""Get a logger instance."""
|
||||
logs_path = Path(settings.logs_path)
|
||||
logs_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
logger = logging.getLogger(name)
|
||||
|
||||
if not logger.handlers:
|
||||
logger.setLevel(logging.INFO)
|
||||
|
||||
# File handler with daily rotation
|
||||
log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
|
||||
file_handler = logging.FileHandler(log_file)
|
||||
file_handler.setLevel(logging.INFO)
|
||||
file_handler.setFormatter(logging.Formatter(
|
||||
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
||||
))
|
||||
|
||||
# Console handler
|
||||
console_handler = logging.StreamHandler()
|
||||
console_handler.setLevel(logging.INFO)
|
||||
console_handler.setFormatter(logging.Formatter(
|
||||
'%(asctime)s - %(levelname)s - %(message)s'
|
||||
))
|
||||
|
||||
logger.addHandler(file_handler)
|
||||
logger.addHandler(console_handler)
|
||||
|
||||
return logger
|
||||
|
||||
|
||||
def get_job_logger(job_id: int):
|
||||
"""Get a logger specific to a job, writing to a job-specific file."""
|
||||
logs_path = Path(settings.logs_path)
|
||||
logs_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
logger = logging.getLogger(f"job_{job_id}")
|
||||
|
||||
if not logger.handlers:
|
||||
logger.setLevel(logging.DEBUG)
|
||||
|
||||
# Job-specific log file
|
||||
job_log_file = logs_path / f"job_{job_id}.log"
|
||||
file_handler = logging.FileHandler(job_log_file)
|
||||
file_handler.setLevel(logging.DEBUG)
|
||||
file_handler.setFormatter(logging.Formatter(
|
||||
'%(asctime)s - %(levelname)s - %(message)s'
|
||||
))
|
||||
|
||||
# Also log to daily file
|
||||
daily_log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
|
||||
daily_handler = logging.FileHandler(daily_log_file)
|
||||
daily_handler.setLevel(logging.INFO)
|
||||
daily_handler.setFormatter(logging.Formatter(
|
||||
'%(asctime)s - job_%(name)s - %(levelname)s - %(message)s'
|
||||
))
|
||||
|
||||
logger.addHandler(file_handler)
|
||||
logger.addHandler(daily_handler)
|
||||
|
||||
return logger
|
||||
1
backend/app/workers/__init__.py
Normal file
1
backend/app/workers/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Celery workers
|
||||
36
backend/app/workers/celery_app.py
Normal file
36
backend/app/workers/celery_app.py
Normal file
@@ -0,0 +1,36 @@
|
||||
from celery import Celery
|
||||
|
||||
from app.config import get_settings
|
||||
|
||||
settings = get_settings()
|
||||
|
||||
celery_app = Celery(
|
||||
"plant_scraper",
|
||||
broker=settings.redis_url,
|
||||
backend=settings.redis_url,
|
||||
include=[
|
||||
"app.workers.scrape_tasks",
|
||||
"app.workers.quality_tasks",
|
||||
"app.workers.export_tasks",
|
||||
"app.workers.stats_tasks",
|
||||
],
|
||||
)
|
||||
|
||||
celery_app.conf.update(
|
||||
task_serializer="json",
|
||||
accept_content=["json"],
|
||||
result_serializer="json",
|
||||
timezone="UTC",
|
||||
enable_utc=True,
|
||||
task_track_started=True,
|
||||
task_time_limit=3600 * 24, # 24 hour max per task
|
||||
worker_prefetch_multiplier=1,
|
||||
task_acks_late=True,
|
||||
beat_schedule={
|
||||
"refresh-stats-every-5min": {
|
||||
"task": "app.workers.stats_tasks.refresh_stats",
|
||||
"schedule": 300.0, # Every 5 minutes
|
||||
},
|
||||
},
|
||||
beat_schedule_filename="/tmp/celerybeat-schedule",
|
||||
)
|
||||
170
backend/app/workers/export_tasks.py
Normal file
170
backend/app/workers/export_tasks.py
Normal file
@@ -0,0 +1,170 @@
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import shutil
|
||||
import zipfile
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
from app.workers.celery_app import celery_app
|
||||
from app.database import SessionLocal
|
||||
from app.models import Export, Image, Species
|
||||
from app.config import get_settings
|
||||
|
||||
settings = get_settings()
|
||||
|
||||
|
||||
@celery_app.task(bind=True)
|
||||
def generate_export(self, export_id: int):
|
||||
"""Generate a zip export for CoreML training."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
export = db.query(Export).filter(Export.id == export_id).first()
|
||||
if not export:
|
||||
return {"error": "Export not found"}
|
||||
|
||||
# Update status
|
||||
export.status = "generating"
|
||||
export.celery_task_id = self.request.id
|
||||
db.commit()
|
||||
|
||||
# Parse filter criteria
|
||||
criteria = json.loads(export.filter_criteria) if export.filter_criteria else {}
|
||||
min_images = criteria.get("min_images_per_species", 100)
|
||||
licenses = criteria.get("licenses")
|
||||
min_quality = criteria.get("min_quality")
|
||||
species_ids = criteria.get("species_ids")
|
||||
|
||||
# Build query for images
|
||||
query = db.query(Image).filter(Image.status == "downloaded")
|
||||
|
||||
if licenses:
|
||||
query = query.filter(Image.license.in_(licenses))
|
||||
|
||||
if min_quality:
|
||||
query = query.filter(Image.quality_score >= min_quality)
|
||||
|
||||
if species_ids:
|
||||
query = query.filter(Image.species_id.in_(species_ids))
|
||||
|
||||
# Group by species and filter by min count
|
||||
from sqlalchemy import func
|
||||
species_counts = db.query(
|
||||
Image.species_id,
|
||||
func.count(Image.id).label("count")
|
||||
).filter(Image.status == "downloaded").group_by(Image.species_id).all()
|
||||
|
||||
valid_species_ids = [s.species_id for s in species_counts if s.count >= min_images]
|
||||
|
||||
if species_ids:
|
||||
valid_species_ids = [s for s in valid_species_ids if s in species_ids]
|
||||
|
||||
if not valid_species_ids:
|
||||
export.status = "failed"
|
||||
export.error_message = "No species meet the criteria"
|
||||
export.completed_at = datetime.utcnow()
|
||||
db.commit()
|
||||
return {"error": "No species meet the criteria"}
|
||||
|
||||
# Create export directory
|
||||
export_dir = Path(settings.exports_path) / f"export_{export_id}"
|
||||
train_dir = export_dir / "Training"
|
||||
test_dir = export_dir / "Testing"
|
||||
train_dir.mkdir(parents=True, exist_ok=True)
|
||||
test_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
total_images = 0
|
||||
species_count = 0
|
||||
|
||||
# Process each valid species
|
||||
for i, species_id in enumerate(valid_species_ids):
|
||||
species = db.query(Species).filter(Species.id == species_id).first()
|
||||
if not species:
|
||||
continue
|
||||
|
||||
# Get images for this species
|
||||
images_query = query.filter(Image.species_id == species_id)
|
||||
if licenses:
|
||||
images_query = images_query.filter(Image.license.in_(licenses))
|
||||
if min_quality:
|
||||
images_query = images_query.filter(Image.quality_score >= min_quality)
|
||||
|
||||
images = images_query.all()
|
||||
if len(images) < min_images:
|
||||
continue
|
||||
|
||||
species_count += 1
|
||||
|
||||
# Create species folders
|
||||
species_name = species.scientific_name.replace(" ", "_")
|
||||
(train_dir / species_name).mkdir(exist_ok=True)
|
||||
(test_dir / species_name).mkdir(exist_ok=True)
|
||||
|
||||
# Shuffle and split
|
||||
random.shuffle(images)
|
||||
split_idx = int(len(images) * export.train_split)
|
||||
train_images = images[:split_idx]
|
||||
test_images = images[split_idx:]
|
||||
|
||||
# Copy images
|
||||
for j, img in enumerate(train_images):
|
||||
if img.local_path and os.path.exists(img.local_path):
|
||||
ext = Path(img.local_path).suffix or ".jpg"
|
||||
dest = train_dir / species_name / f"img_{j:05d}{ext}"
|
||||
shutil.copy2(img.local_path, dest)
|
||||
total_images += 1
|
||||
|
||||
for j, img in enumerate(test_images):
|
||||
if img.local_path and os.path.exists(img.local_path):
|
||||
ext = Path(img.local_path).suffix or ".jpg"
|
||||
dest = test_dir / species_name / f"img_{j:05d}{ext}"
|
||||
shutil.copy2(img.local_path, dest)
|
||||
total_images += 1
|
||||
|
||||
# Update progress
|
||||
self.update_state(
|
||||
state="PROGRESS",
|
||||
meta={
|
||||
"current": i + 1,
|
||||
"total": len(valid_species_ids),
|
||||
"species": species.scientific_name,
|
||||
}
|
||||
)
|
||||
|
||||
# Create zip file
|
||||
zip_path = Path(settings.exports_path) / f"export_{export_id}.zip"
|
||||
with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
|
||||
for root, dirs, files in os.walk(export_dir):
|
||||
for file in files:
|
||||
file_path = Path(root) / file
|
||||
arcname = file_path.relative_to(export_dir)
|
||||
zipf.write(file_path, arcname)
|
||||
|
||||
# Clean up directory
|
||||
shutil.rmtree(export_dir)
|
||||
|
||||
# Update export record
|
||||
export.status = "completed"
|
||||
export.file_path = str(zip_path)
|
||||
export.file_size = zip_path.stat().st_size
|
||||
export.species_count = species_count
|
||||
export.image_count = total_images
|
||||
export.completed_at = datetime.utcnow()
|
||||
db.commit()
|
||||
|
||||
return {
|
||||
"status": "completed",
|
||||
"species_count": species_count,
|
||||
"image_count": total_images,
|
||||
"file_size": export.file_size,
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
if export:
|
||||
export.status = "failed"
|
||||
export.error_message = str(e)
|
||||
export.completed_at = datetime.utcnow()
|
||||
db.commit()
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
224
backend/app/workers/quality_tasks.py
Normal file
224
backend/app/workers/quality_tasks.py
Normal file
@@ -0,0 +1,224 @@
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from PIL import Image as PILImage
|
||||
import imagehash
|
||||
import numpy as np
|
||||
from scipy import ndimage
|
||||
|
||||
from app.workers.celery_app import celery_app
|
||||
from app.database import SessionLocal
|
||||
from app.models import Image
|
||||
from app.config import get_settings
|
||||
|
||||
settings = get_settings()
|
||||
|
||||
|
||||
def calculate_blur_score(image_path: str) -> float:
|
||||
"""Calculate blur score using Laplacian variance. Higher = sharper."""
|
||||
try:
|
||||
img = PILImage.open(image_path).convert("L")
|
||||
img_array = np.array(img)
|
||||
laplacian = ndimage.laplace(img_array)
|
||||
return float(np.var(laplacian))
|
||||
except Exception:
|
||||
return 0.0
|
||||
|
||||
|
||||
def calculate_phash(image_path: str) -> str:
|
||||
"""Calculate perceptual hash for deduplication."""
|
||||
try:
|
||||
img = PILImage.open(image_path)
|
||||
return str(imagehash.phash(img))
|
||||
except Exception:
|
||||
return ""
|
||||
|
||||
|
||||
def check_color_distribution(image_path: str) -> tuple[bool, str]:
|
||||
"""Check if image has healthy color distribution for a plant photo.
|
||||
|
||||
Returns (passed, reason) tuple.
|
||||
Rejects:
|
||||
- Low color variance (mean channel std < 25): herbarium specimens (brown on white)
|
||||
- No green + low variance (green ratio < 5% AND mean std < 40): monochrome illustrations
|
||||
"""
|
||||
try:
|
||||
img = PILImage.open(image_path).convert("RGB")
|
||||
arr = np.array(img, dtype=np.float64)
|
||||
|
||||
# Per-channel standard deviation
|
||||
channel_stds = arr.std(axis=(0, 1)) # [R_std, G_std, B_std]
|
||||
mean_std = float(channel_stds.mean())
|
||||
|
||||
if mean_std < 25:
|
||||
return False, f"Low color variance ({mean_std:.1f})"
|
||||
|
||||
# Check green ratio
|
||||
channel_means = arr.mean(axis=(0, 1))
|
||||
total = channel_means.sum()
|
||||
green_ratio = channel_means[1] / total if total > 0 else 0
|
||||
|
||||
if green_ratio < 0.05 and mean_std < 40:
|
||||
return False, f"No green ({green_ratio:.2%}) + low variance ({mean_std:.1f})"
|
||||
|
||||
return True, ""
|
||||
except Exception:
|
||||
return True, "" # Don't reject on error
|
||||
|
||||
|
||||
def resize_image(image_path: str, target_size: int = 512) -> bool:
|
||||
"""Resize image to target size while maintaining aspect ratio."""
|
||||
try:
|
||||
img = PILImage.open(image_path)
|
||||
img.thumbnail((target_size, target_size), PILImage.Resampling.LANCZOS)
|
||||
img.save(image_path, quality=95)
|
||||
return True
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
@celery_app.task
|
||||
def download_and_process_image(image_id: int):
|
||||
"""Download image, check quality, dedupe, and resize."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
image = db.query(Image).filter(Image.id == image_id).first()
|
||||
if not image:
|
||||
return {"error": "Image not found"}
|
||||
|
||||
# Create directory for species
|
||||
species = image.species
|
||||
species_dir = Path(settings.images_path) / species.scientific_name.replace(" ", "_")
|
||||
species_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Download image
|
||||
filename = f"{image.source}_{image.source_id or image.id}.jpg"
|
||||
local_path = species_dir / filename
|
||||
|
||||
try:
|
||||
headers = {
|
||||
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||
}
|
||||
with httpx.Client(timeout=30, headers=headers, follow_redirects=True) as client:
|
||||
response = client.get(image.url)
|
||||
response.raise_for_status()
|
||||
|
||||
with open(local_path, "wb") as f:
|
||||
f.write(response.content)
|
||||
except Exception as e:
|
||||
image.status = "rejected"
|
||||
db.commit()
|
||||
return {"error": f"Download failed: {e}"}
|
||||
|
||||
# Check minimum size
|
||||
try:
|
||||
with PILImage.open(local_path) as img:
|
||||
width, height = img.size
|
||||
if width < 256 or height < 256:
|
||||
os.remove(local_path)
|
||||
image.status = "rejected"
|
||||
db.commit()
|
||||
return {"error": "Image too small"}
|
||||
image.width = width
|
||||
image.height = height
|
||||
except Exception as e:
|
||||
if local_path.exists():
|
||||
os.remove(local_path)
|
||||
image.status = "rejected"
|
||||
db.commit()
|
||||
return {"error": f"Invalid image: {e}"}
|
||||
|
||||
# Calculate perceptual hash for deduplication
|
||||
phash = calculate_phash(str(local_path))
|
||||
if phash:
|
||||
# Check for duplicates
|
||||
existing = db.query(Image).filter(
|
||||
Image.phash == phash,
|
||||
Image.id != image.id,
|
||||
Image.status == "downloaded"
|
||||
).first()
|
||||
|
||||
if existing:
|
||||
os.remove(local_path)
|
||||
image.status = "rejected"
|
||||
image.phash = phash
|
||||
db.commit()
|
||||
return {"error": "Duplicate image"}
|
||||
|
||||
image.phash = phash
|
||||
|
||||
# Calculate blur score
|
||||
quality_score = calculate_blur_score(str(local_path))
|
||||
image.quality_score = quality_score
|
||||
|
||||
# Reject very blurry images (threshold can be tuned)
|
||||
if quality_score < 100: # Low variance = blurry
|
||||
os.remove(local_path)
|
||||
image.status = "rejected"
|
||||
db.commit()
|
||||
return {"error": "Image too blurry"}
|
||||
|
||||
# Check color distribution (reject herbarium specimens, illustrations)
|
||||
color_ok, color_reason = check_color_distribution(str(local_path))
|
||||
if not color_ok:
|
||||
os.remove(local_path)
|
||||
image.status = "rejected"
|
||||
db.commit()
|
||||
return {"error": f"Non-photo content: {color_reason}"}
|
||||
|
||||
# Resize to 512x512 max
|
||||
resize_image(str(local_path))
|
||||
|
||||
# Update image record
|
||||
image.local_path = str(local_path)
|
||||
image.status = "downloaded"
|
||||
db.commit()
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"path": str(local_path),
|
||||
"quality_score": quality_score,
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
if image:
|
||||
image.status = "rejected"
|
||||
db.commit()
|
||||
return {"error": str(e)}
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@celery_app.task(bind=True)
|
||||
def batch_process_pending_images(self, source: str = None, chunk_size: int = 500):
|
||||
"""Process ALL pending images in chunks, with progress tracking."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
query = db.query(Image).filter(Image.status == "pending")
|
||||
if source:
|
||||
query = query.filter(Image.source == source)
|
||||
|
||||
total = query.count()
|
||||
queued = 0
|
||||
offset = 0
|
||||
|
||||
while offset < total:
|
||||
chunk = query.order_by(Image.id).offset(offset).limit(chunk_size).all()
|
||||
if not chunk:
|
||||
break
|
||||
|
||||
for image in chunk:
|
||||
download_and_process_image.delay(image.id)
|
||||
queued += 1
|
||||
|
||||
offset += len(chunk)
|
||||
|
||||
self.update_state(
|
||||
state="PROGRESS",
|
||||
meta={"queued": queued, "total": total},
|
||||
)
|
||||
|
||||
return {"queued": queued, "total": total}
|
||||
finally:
|
||||
db.close()
|
||||
164
backend/app/workers/scrape_tasks.py
Normal file
164
backend/app/workers/scrape_tasks.py
Normal file
@@ -0,0 +1,164 @@
|
||||
import json
|
||||
from datetime import datetime
|
||||
|
||||
from app.workers.celery_app import celery_app
|
||||
from app.database import SessionLocal
|
||||
from app.models import Job, Species, Image
|
||||
from app.utils.logging import get_job_logger
|
||||
|
||||
|
||||
@celery_app.task(bind=True)
|
||||
def run_scrape_job(self, job_id: int):
|
||||
"""Main scrape task that dispatches to source-specific scrapers."""
|
||||
logger = get_job_logger(job_id)
|
||||
logger.info(f"Starting scrape job {job_id}")
|
||||
|
||||
db = SessionLocal()
|
||||
job = None
|
||||
try:
|
||||
job = db.query(Job).filter(Job.id == job_id).first()
|
||||
if not job:
|
||||
logger.error(f"Job {job_id} not found")
|
||||
return {"error": "Job not found"}
|
||||
|
||||
logger.info(f"Job: {job.name}, Source: {job.source}")
|
||||
|
||||
# Update job status
|
||||
job.status = "running"
|
||||
job.started_at = datetime.utcnow()
|
||||
job.celery_task_id = self.request.id
|
||||
db.commit()
|
||||
|
||||
# Get species to scrape
|
||||
if job.species_filter:
|
||||
species_ids = json.loads(job.species_filter)
|
||||
query = db.query(Species).filter(Species.id.in_(species_ids))
|
||||
logger.info(f"Filtered to species IDs: {species_ids}")
|
||||
else:
|
||||
query = db.query(Species)
|
||||
logger.info("Scraping all species")
|
||||
|
||||
# Filter by image count if requested
|
||||
if job.only_without_images or job.max_images:
|
||||
from sqlalchemy import func
|
||||
# Subquery to count downloaded images per species
|
||||
image_count_subquery = (
|
||||
db.query(Image.species_id, func.count(Image.id).label("count"))
|
||||
.filter(Image.status == "downloaded")
|
||||
.group_by(Image.species_id)
|
||||
.subquery()
|
||||
)
|
||||
# Left join with the count subquery
|
||||
query = query.outerjoin(
|
||||
image_count_subquery,
|
||||
Species.id == image_count_subquery.c.species_id
|
||||
)
|
||||
|
||||
if job.only_without_images:
|
||||
# Filter where count is NULL or 0
|
||||
query = query.filter(
|
||||
(image_count_subquery.c.count == None) | (image_count_subquery.c.count == 0)
|
||||
)
|
||||
logger.info("Filtering to species without images")
|
||||
elif job.max_images:
|
||||
# Filter where count is NULL or less than max_images
|
||||
query = query.filter(
|
||||
(image_count_subquery.c.count == None) | (image_count_subquery.c.count < job.max_images)
|
||||
)
|
||||
logger.info(f"Filtering to species with fewer than {job.max_images} images")
|
||||
|
||||
species_list = query.all()
|
||||
logger.info(f"Total species to scrape: {len(species_list)}")
|
||||
|
||||
job.progress_total = len(species_list)
|
||||
db.commit()
|
||||
|
||||
# Import scraper based on source
|
||||
from app.scrapers import get_scraper
|
||||
scraper = get_scraper(job.source)
|
||||
|
||||
if not scraper:
|
||||
error_msg = f"Unknown source: {job.source}"
|
||||
logger.error(error_msg)
|
||||
job.status = "failed"
|
||||
job.error_message = error_msg
|
||||
job.completed_at = datetime.utcnow()
|
||||
db.commit()
|
||||
return {"error": error_msg}
|
||||
|
||||
logger.info(f"Using scraper: {scraper.name}")
|
||||
|
||||
# Scrape each species
|
||||
for i, species in enumerate(species_list):
|
||||
try:
|
||||
# Update progress
|
||||
job.progress_current = i + 1
|
||||
db.commit()
|
||||
|
||||
logger.info(f"[{i+1}/{len(species_list)}] Scraping: {species.scientific_name}")
|
||||
|
||||
# Update task state for real-time monitoring
|
||||
self.update_state(
|
||||
state="PROGRESS",
|
||||
meta={
|
||||
"current": i + 1,
|
||||
"total": len(species_list),
|
||||
"species": species.scientific_name,
|
||||
}
|
||||
)
|
||||
|
||||
# Run scraper for this species
|
||||
results = scraper.scrape_species(species, db, logger)
|
||||
downloaded = results.get("downloaded", 0)
|
||||
rejected = results.get("rejected", 0)
|
||||
job.images_downloaded += downloaded
|
||||
job.images_rejected += rejected
|
||||
db.commit()
|
||||
|
||||
logger.info(f" -> Downloaded: {downloaded}, Rejected: {rejected}")
|
||||
|
||||
except Exception as e:
|
||||
# Log error but continue with other species
|
||||
logger.error(f"Error scraping {species.scientific_name}: {e}", exc_info=True)
|
||||
continue
|
||||
|
||||
# Mark job complete
|
||||
job.status = "completed"
|
||||
job.completed_at = datetime.utcnow()
|
||||
db.commit()
|
||||
|
||||
logger.info(f"Job {job_id} completed. Total downloaded: {job.images_downloaded}, rejected: {job.images_rejected}")
|
||||
|
||||
return {
|
||||
"status": "completed",
|
||||
"downloaded": job.images_downloaded,
|
||||
"rejected": job.images_rejected,
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Job {job_id} failed with error: {e}", exc_info=True)
|
||||
if job:
|
||||
job.status = "failed"
|
||||
job.error_message = str(e)
|
||||
job.completed_at = datetime.utcnow()
|
||||
db.commit()
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
|
||||
|
||||
@celery_app.task
|
||||
def pause_scrape_job(job_id: int):
|
||||
"""Pause a running scrape job."""
|
||||
db = SessionLocal()
|
||||
try:
|
||||
job = db.query(Job).filter(Job.id == job_id).first()
|
||||
if job and job.status == "running":
|
||||
job.status = "paused"
|
||||
db.commit()
|
||||
# Revoke the Celery task
|
||||
if job.celery_task_id:
|
||||
celery_app.control.revoke(job.celery_task_id, terminate=True)
|
||||
return {"status": "paused"}
|
||||
finally:
|
||||
db.close()
|
||||
193
backend/app/workers/stats_tasks.py
Normal file
193
backend/app/workers/stats_tasks.py
Normal file
@@ -0,0 +1,193 @@
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
from sqlalchemy import func, case, text
|
||||
|
||||
from app.workers.celery_app import celery_app
|
||||
from app.database import SessionLocal
|
||||
from app.models import Species, Image, Job
|
||||
from app.models.cached_stats import CachedStats
|
||||
from app.config import get_settings
|
||||
|
||||
|
||||
def get_directory_size_fast(path: str) -> int:
|
||||
"""Get directory size in bytes using fast os.scandir."""
|
||||
total = 0
|
||||
try:
|
||||
with os.scandir(path) as it:
|
||||
for entry in it:
|
||||
try:
|
||||
if entry.is_file(follow_symlinks=False):
|
||||
total += entry.stat(follow_symlinks=False).st_size
|
||||
elif entry.is_dir(follow_symlinks=False):
|
||||
total += get_directory_size_fast(entry.path)
|
||||
except (OSError, PermissionError):
|
||||
pass
|
||||
except (OSError, PermissionError):
|
||||
pass
|
||||
return total
|
||||
|
||||
|
||||
@celery_app.task
|
||||
def refresh_stats():
|
||||
"""Calculate and cache dashboard statistics."""
|
||||
print("=== STATS TASK: Starting refresh ===", flush=True)
|
||||
|
||||
db = SessionLocal()
|
||||
try:
|
||||
# Use raw SQL for maximum performance on SQLite
|
||||
# All counts in a single query
|
||||
counts_sql = text("""
|
||||
SELECT
|
||||
(SELECT COUNT(*) FROM species) as total_species,
|
||||
(SELECT COUNT(*) FROM images) as total_images,
|
||||
(SELECT COUNT(*) FROM images WHERE status = 'downloaded') as images_downloaded,
|
||||
(SELECT COUNT(*) FROM images WHERE status = 'pending') as images_pending,
|
||||
(SELECT COUNT(*) FROM images WHERE status = 'rejected') as images_rejected
|
||||
""")
|
||||
counts = db.execute(counts_sql).fetchone()
|
||||
total_species = counts[0] or 0
|
||||
total_images = counts[1] or 0
|
||||
images_downloaded = counts[2] or 0
|
||||
images_pending = counts[3] or 0
|
||||
images_rejected = counts[4] or 0
|
||||
|
||||
# Per-source stats - single query with GROUP BY
|
||||
source_sql = text("""
|
||||
SELECT
|
||||
source,
|
||||
COUNT(*) as total,
|
||||
SUM(CASE WHEN status = 'downloaded' THEN 1 ELSE 0 END) as downloaded,
|
||||
SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
|
||||
SUM(CASE WHEN status = 'rejected' THEN 1 ELSE 0 END) as rejected
|
||||
FROM images
|
||||
GROUP BY source
|
||||
""")
|
||||
source_stats_raw = db.execute(source_sql).fetchall()
|
||||
sources = [
|
||||
{
|
||||
"source": s[0],
|
||||
"image_count": s[1],
|
||||
"downloaded": s[2] or 0,
|
||||
"pending": s[3] or 0,
|
||||
"rejected": s[4] or 0,
|
||||
}
|
||||
for s in source_stats_raw
|
||||
]
|
||||
|
||||
# Per-license stats - single indexed query
|
||||
license_sql = text("""
|
||||
SELECT license, COUNT(*) as count
|
||||
FROM images
|
||||
WHERE status = 'downloaded'
|
||||
GROUP BY license
|
||||
""")
|
||||
license_stats_raw = db.execute(license_sql).fetchall()
|
||||
licenses = [
|
||||
{"license": l[0], "count": l[1]}
|
||||
for l in license_stats_raw
|
||||
]
|
||||
|
||||
# Job stats - single query
|
||||
job_sql = text("""
|
||||
SELECT
|
||||
SUM(CASE WHEN status = 'running' THEN 1 ELSE 0 END) as running,
|
||||
SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
|
||||
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as completed,
|
||||
SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed
|
||||
FROM jobs
|
||||
""")
|
||||
job_counts = db.execute(job_sql).fetchone()
|
||||
jobs = {
|
||||
"running": job_counts[0] or 0,
|
||||
"pending": job_counts[1] or 0,
|
||||
"completed": job_counts[2] or 0,
|
||||
"failed": job_counts[3] or 0,
|
||||
}
|
||||
|
||||
# Top species by image count - optimized with index
|
||||
top_sql = text("""
|
||||
SELECT s.id, s.scientific_name, s.common_name, COUNT(i.id) as image_count
|
||||
FROM species s
|
||||
INNER JOIN images i ON i.species_id = s.id AND i.status = 'downloaded'
|
||||
GROUP BY s.id
|
||||
ORDER BY image_count DESC
|
||||
LIMIT 10
|
||||
""")
|
||||
top_species_raw = db.execute(top_sql).fetchall()
|
||||
top_species = [
|
||||
{
|
||||
"id": s[0],
|
||||
"scientific_name": s[1],
|
||||
"common_name": s[2],
|
||||
"image_count": s[3],
|
||||
}
|
||||
for s in top_species_raw
|
||||
]
|
||||
|
||||
# Under-represented species - use pre-computed counts
|
||||
under_sql = text("""
|
||||
SELECT s.id, s.scientific_name, s.common_name, COALESCE(img_counts.cnt, 0) as image_count
|
||||
FROM species s
|
||||
LEFT JOIN (
|
||||
SELECT species_id, COUNT(*) as cnt
|
||||
FROM images
|
||||
WHERE status = 'downloaded'
|
||||
GROUP BY species_id
|
||||
) img_counts ON img_counts.species_id = s.id
|
||||
WHERE COALESCE(img_counts.cnt, 0) < 100
|
||||
ORDER BY image_count ASC
|
||||
LIMIT 10
|
||||
""")
|
||||
under_rep_raw = db.execute(under_sql).fetchall()
|
||||
under_represented = [
|
||||
{
|
||||
"id": s[0],
|
||||
"scientific_name": s[1],
|
||||
"common_name": s[2],
|
||||
"image_count": s[3],
|
||||
}
|
||||
for s in under_rep_raw
|
||||
]
|
||||
|
||||
# Calculate disk usage (fast recursive scan)
|
||||
settings = get_settings()
|
||||
disk_usage_bytes = get_directory_size_fast(settings.images_path)
|
||||
disk_usage_mb = round(disk_usage_bytes / (1024 * 1024), 2)
|
||||
|
||||
# Build the stats object
|
||||
stats = {
|
||||
"total_species": total_species,
|
||||
"total_images": total_images,
|
||||
"images_downloaded": images_downloaded,
|
||||
"images_pending": images_pending,
|
||||
"images_rejected": images_rejected,
|
||||
"disk_usage_mb": disk_usage_mb,
|
||||
"sources": sources,
|
||||
"licenses": licenses,
|
||||
"jobs": jobs,
|
||||
"top_species": top_species,
|
||||
"under_represented": under_represented,
|
||||
}
|
||||
|
||||
# Store in database
|
||||
cached = db.query(CachedStats).filter(CachedStats.key == "dashboard_stats").first()
|
||||
if cached:
|
||||
cached.value = json.dumps(stats)
|
||||
cached.updated_at = datetime.utcnow()
|
||||
else:
|
||||
cached = CachedStats(key="dashboard_stats", value=json.dumps(stats))
|
||||
db.add(cached)
|
||||
|
||||
db.commit()
|
||||
print(f"=== STATS TASK: Refreshed (species={total_species}, images={total_images}) ===", flush=True)
|
||||
|
||||
return {"status": "success", "total_species": total_species, "total_images": total_images}
|
||||
|
||||
except Exception as e:
|
||||
print(f"=== STATS TASK ERROR: {e} ===", flush=True)
|
||||
raise
|
||||
finally:
|
||||
db.close()
|
||||
34
backend/requirements.txt
Normal file
34
backend/requirements.txt
Normal file
@@ -0,0 +1,34 @@
|
||||
# Web framework
|
||||
fastapi==0.109.0
|
||||
uvicorn[standard]==0.27.0
|
||||
python-multipart==0.0.6
|
||||
|
||||
# Database
|
||||
sqlalchemy==2.0.25
|
||||
alembic==1.13.1
|
||||
aiosqlite==0.19.0
|
||||
|
||||
# Task queue
|
||||
celery==5.3.6
|
||||
redis==5.0.1
|
||||
|
||||
# Image processing
|
||||
Pillow==10.2.0
|
||||
imagehash==4.3.1
|
||||
imagededup==0.3.3.post2
|
||||
|
||||
# HTTP clients
|
||||
httpx==0.26.0
|
||||
aiohttp==3.9.3
|
||||
|
||||
# Search
|
||||
duckduckgo-search
|
||||
|
||||
# Utilities
|
||||
python-dotenv==1.0.0
|
||||
pydantic==2.5.3
|
||||
pydantic-settings==2.1.0
|
||||
|
||||
# Testing
|
||||
pytest==7.4.4
|
||||
pytest-asyncio==0.23.3
|
||||
1
backend/tests/__init__.py
Normal file
1
backend/tests/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# Tests
|
||||
114
docker-compose.unraid.yml
Normal file
114
docker-compose.unraid.yml
Normal file
@@ -0,0 +1,114 @@
|
||||
# Docker Compose for Unraid
|
||||
#
|
||||
# Access at http://YOUR_UNRAID_IP:8580
|
||||
#
|
||||
# ============================================
|
||||
# CONFIGURE THESE PATHS FOR YOUR UNRAID SETUP
|
||||
# ============================================
|
||||
# Edit the left side of the colon (:) for each volume mount
|
||||
#
|
||||
# DATABASE_PATH: Where to store the SQLite database
|
||||
# IMAGES_PATH: Where to store downloaded images (can be large, 100GB+)
|
||||
# EXPORTS_PATH: Where to store generated export zip files
|
||||
# IMPORTS_PATH: Where to place images for bulk import (source/species/images)
|
||||
# LOGS_PATH: Where to store scraper log files for debugging
|
||||
|
||||
services:
|
||||
backend:
|
||||
build:
|
||||
context: /mnt/user/appdata/PlantGuideScraper/backend
|
||||
dockerfile: Dockerfile
|
||||
container_name: plant-scraper-backend
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- /mnt/user/appdata/PlantGuideScraper/backend:/app:ro
|
||||
# === CONFIGURABLE DATA PATHS ===
|
||||
- /mnt/user/downloads/PlantGuideDocker/database:/data/db # DATABASE_PATH
|
||||
- /mnt/user/downloads/PlantGuideDocker/images:/data/images # IMAGES_PATH
|
||||
- /mnt/user/downloads/PlantGuideDocker/exports:/data/exports # EXPORTS_PATH
|
||||
- /mnt/user/downloads/PlantGuideDocker/imports:/data/imports # IMPORTS_PATH
|
||||
- /mnt/user/downloads/PlantGuideDocker/logs:/data/logs # LOGS_PATH
|
||||
environment:
|
||||
- DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||
- REDIS_URL=redis://plant-scraper-redis:6379/0
|
||||
- IMAGES_PATH=/data/images
|
||||
- EXPORTS_PATH=/data/exports
|
||||
- IMPORTS_PATH=/data/imports
|
||||
- LOGS_PATH=/data/logs
|
||||
depends_on:
|
||||
- redis
|
||||
command: uvicorn app.main:app --host 0.0.0.0 --port 8000
|
||||
networks:
|
||||
- plant-scraper
|
||||
|
||||
celery:
|
||||
build:
|
||||
context: /mnt/user/appdata/PlantGuideScraper/backend
|
||||
dockerfile: Dockerfile
|
||||
container_name: plant-scraper-celery
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- /mnt/user/appdata/PlantGuideScraper/backend:/app:ro
|
||||
# === CONFIGURABLE DATA PATHS (must match backend) ===
|
||||
- /mnt/user/downloads/PlantGuideDocker/database:/data/db # DATABASE_PATH
|
||||
- /mnt/user/downloads/PlantGuideDocker/images:/data/images # IMAGES_PATH
|
||||
- /mnt/user/downloads/PlantGuideDocker/exports:/data/exports # EXPORTS_PATH
|
||||
- /mnt/user/downloads/PlantGuideDocker/imports:/data/imports # IMPORTS_PATH
|
||||
- /mnt/user/downloads/PlantGuideDocker/logs:/data/logs # LOGS_PATH
|
||||
environment:
|
||||
- DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||
- REDIS_URL=redis://plant-scraper-redis:6379/0
|
||||
- IMAGES_PATH=/data/images
|
||||
- EXPORTS_PATH=/data/exports
|
||||
- IMPORTS_PATH=/data/imports
|
||||
- LOGS_PATH=/data/logs
|
||||
depends_on:
|
||||
- redis
|
||||
command: celery -A app.workers.celery_app worker --beat --loglevel=info --concurrency=4
|
||||
networks:
|
||||
- plant-scraper
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
container_name: plant-scraper-redis
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- /mnt/user/appdata/PlantGuideScraper/redis:/data
|
||||
networks:
|
||||
- plant-scraper
|
||||
|
||||
frontend:
|
||||
build:
|
||||
context: /mnt/user/appdata/PlantGuideScraper/frontend
|
||||
dockerfile: Dockerfile
|
||||
container_name: plant-scraper-frontend
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- /mnt/user/appdata/PlantGuideScraper/frontend:/app
|
||||
- plant-scraper-node-modules:/app/node_modules
|
||||
environment:
|
||||
- VITE_API_URL=
|
||||
command: npm run dev -- --host
|
||||
networks:
|
||||
- plant-scraper
|
||||
|
||||
nginx:
|
||||
image: nginx:alpine
|
||||
container_name: plant-scraper-nginx
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "8580:80"
|
||||
volumes:
|
||||
- /mnt/user/appdata/PlantGuideScraper/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
|
||||
depends_on:
|
||||
- backend
|
||||
- frontend
|
||||
networks:
|
||||
- plant-scraper
|
||||
|
||||
networks:
|
||||
plant-scraper:
|
||||
name: plant-scraper
|
||||
|
||||
volumes:
|
||||
plant-scraper-node-modules:
|
||||
73
docker-compose.yml
Normal file
73
docker-compose.yml
Normal file
@@ -0,0 +1,73 @@
|
||||
services:
|
||||
backend:
|
||||
build:
|
||||
context: ./backend
|
||||
dockerfile: Dockerfile
|
||||
container_name: plant-scraper-backend
|
||||
# Port exposed only internally, nginx proxies to it
|
||||
volumes:
|
||||
- ./backend:/app
|
||||
- ./data:/data
|
||||
environment:
|
||||
- DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||
- REDIS_URL=redis://redis:6379/0
|
||||
- IMAGES_PATH=/data/images
|
||||
- EXPORTS_PATH=/data/exports
|
||||
- IMPORTS_PATH=/data/imports
|
||||
- LOGS_PATH=/data/logs
|
||||
depends_on:
|
||||
- redis
|
||||
command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
|
||||
|
||||
celery:
|
||||
build:
|
||||
context: ./backend
|
||||
dockerfile: Dockerfile
|
||||
container_name: plant-scraper-celery
|
||||
volumes:
|
||||
- ./backend:/app
|
||||
- ./data:/data
|
||||
environment:
|
||||
- DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||
- REDIS_URL=redis://redis:6379/0
|
||||
- IMAGES_PATH=/data/images
|
||||
- EXPORTS_PATH=/data/exports
|
||||
- IMPORTS_PATH=/data/imports
|
||||
- LOGS_PATH=/data/logs
|
||||
depends_on:
|
||||
- redis
|
||||
command: celery -A app.workers.celery_app worker --beat --loglevel=info --concurrency=4
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
container_name: plant-scraper-redis
|
||||
# Port exposed only internally, not to host (avoid conflicts)
|
||||
volumes:
|
||||
- redis_data:/data
|
||||
|
||||
frontend:
|
||||
build:
|
||||
context: ./frontend
|
||||
dockerfile: Dockerfile
|
||||
container_name: plant-scraper-frontend
|
||||
# Port exposed only internally, nginx proxies to it
|
||||
volumes:
|
||||
- ./frontend:/app
|
||||
- /app/node_modules
|
||||
environment:
|
||||
- VITE_API_URL=
|
||||
command: npm run dev -- --host
|
||||
|
||||
nginx:
|
||||
image: nginx:alpine
|
||||
container_name: plant-scraper-nginx
|
||||
ports:
|
||||
- "80:80"
|
||||
volumes:
|
||||
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
|
||||
depends_on:
|
||||
- backend
|
||||
- frontend
|
||||
|
||||
volumes:
|
||||
redis_data:
|
||||
564
docs/master_plan.md
Normal file
564
docs/master_plan.md
Normal file
@@ -0,0 +1,564 @@
|
||||
# Houseplant Image Scraper - Master Plan
|
||||
|
||||
## Overview
|
||||
|
||||
Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.
|
||||
|
||||
---
|
||||
|
||||
## Requirements Summary
|
||||
|
||||
| Requirement | Value |
|
||||
|-------------|-------|
|
||||
| Platform | Web app in Docker on Unraid |
|
||||
| Sources | iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL |
|
||||
| API keys | Configurable per service |
|
||||
| Species list | Manual import (CSV/paste) |
|
||||
| Grouping | Species, genus, source, license (faceted) |
|
||||
| Search/filter | Yes |
|
||||
| Quality filter | Automatic (hash dedup, blur, size) |
|
||||
| Progress | Real-time dashboard |
|
||||
| Storage | `/species_name/image.jpg` + SQLite DB |
|
||||
| Export | Filtered zip for CoreML, downloadable anytime |
|
||||
| Auth | None (single user) |
|
||||
| Deployment | Docker Compose |
|
||||
|
||||
---
|
||||
|
||||
## Create ML Export Requirements
|
||||
|
||||
Per [Apple's documentation](https://developer.apple.com/documentation/createml/creating-an-image-classifier-model):
|
||||
|
||||
- **Folder structure**: `/SpeciesName/image001.jpg` (folder name = class label)
|
||||
- **Train/Test split**: 80/20 recommended, separate folders
|
||||
- **Balance**: Roughly equal images per class (avoid bias)
|
||||
- **No metadata needed**: Create ML uses folder names as labels
|
||||
|
||||
### Export Format
|
||||
|
||||
```
|
||||
dataset_export/
|
||||
├── Training/
|
||||
│ ├── Monstera_deliciosa/
|
||||
│ │ ├── img001.jpg
|
||||
│ │ └── ...
|
||||
│ ├── Philodendron_hederaceum/
|
||||
│ └── ...
|
||||
└── Testing/
|
||||
├── Monstera_deliciosa/
|
||||
└── ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Sources
|
||||
|
||||
| Source | API/Method | License Filter | Rate Limits | Notes |
|
||||
|--------|------------|----------------|-------------|-------|
|
||||
| **iNaturalist/GBIF** | Bulk DwC-A export + API | CC0, CC-BY | 1 req/sec, 10k/day, 5GB/hr media | Best source: Research Grade = verified |
|
||||
| **Flickr** | flickr.photos.search | license=4,9 (CC-BY, CC0) | 3600 req/hr | Good supplemental |
|
||||
| **Wikimedia Commons** | MediaWiki API + pyWikiCommons | CC-BY, CC-BY-SA, PD | Generous | Category-based search |
|
||||
| **Trefle.io** | REST API | Open source | Free tier | Species metadata + some images |
|
||||
| **USDA PLANTS** | REST API | Public Domain | Generous | US-focused, limited images |
|
||||
| **Plant.id** | REST API | Commercial | Paid | For validation, not scraping |
|
||||
| **Encyclopedia of Life** | API | Mixed | Check each | Aggregator |
|
||||
|
||||
### Source References
|
||||
|
||||
- iNaturalist: https://www.inaturalist.org/pages/developers
|
||||
- iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741
|
||||
- Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html
|
||||
- Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API
|
||||
- pyWikiCommons: https://pypi.org/project/pyWikiCommons/
|
||||
- Trefle.io: https://trefle.io/
|
||||
- USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r
|
||||
|
||||
### Flickr License IDs
|
||||
|
||||
| ID | License |
|
||||
|----|---------|
|
||||
| 0 | All Rights Reserved |
|
||||
| 1 | CC BY-NC-SA 2.0 |
|
||||
| 2 | CC BY-NC 2.0 |
|
||||
| 3 | CC BY-NC-ND 2.0 |
|
||||
| 4 | CC BY 2.0 (Commercial OK) |
|
||||
| 5 | CC BY-SA 2.0 |
|
||||
| 6 | CC BY-ND 2.0 |
|
||||
| 7 | No known copyright restrictions |
|
||||
| 8 | United States Government Work |
|
||||
| 9 | Public Domain (CC0) |
|
||||
|
||||
**For commercial use**: Filter to license IDs 4, 7, 8, 9 only.
|
||||
|
||||
---
|
||||
|
||||
## Image Quality Pipeline
|
||||
|
||||
| Stage | Library | Purpose |
|
||||
|-------|---------|---------|
|
||||
| **Deduplication** | imagededup | Perceptual hash (CNN + hash methods) |
|
||||
| **Blur detection** | scipy + Sobel variance | Reject blurry images |
|
||||
| **Size filter** | Pillow | Min 256x256 |
|
||||
| **Resize** | Pillow | Normalize to 512x512 |
|
||||
|
||||
### Library References
|
||||
|
||||
- imagededup: https://github.com/idealo/imagededup
|
||||
- imagehash: https://github.com/JohannesBuchner/imagehash
|
||||
|
||||
---
|
||||
|
||||
## Technology Stack
|
||||
|
||||
| Component | Choice | Rationale |
|
||||
|-----------|--------|-----------|
|
||||
| **Backend** | FastAPI (Python) | Async, fast, ML ecosystem, great docs |
|
||||
| **Frontend** | React + Tailwind | Fast dev, good component libraries |
|
||||
| **Database** | SQLite (+ FTS5) | Simple, no separate container, sufficient for single-user |
|
||||
| **Task Queue** | Celery + Redis | Long-running scrape jobs, good monitoring |
|
||||
| **Containers** | Docker Compose | Multi-service orchestration |
|
||||
|
||||
Reference: https://github.com/fastapi/full-stack-fastapi-template
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ DOCKER COMPOSE ON UNRAID │
|
||||
├─────────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌─────────────┐ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ NGINX │ │ FASTAPI BACKEND │ │
|
||||
│ │ :80 │───▶│ /api/species - CRUD species list │ │
|
||||
│ │ │ │ /api/sources - API key management │ │
|
||||
│ └──────┬──────┘ │ /api/jobs - Scrape job control │ │
|
||||
│ │ │ /api/images - Search, filter, browse │ │
|
||||
│ ▼ │ /api/export - Generate zip for CoreML │ │
|
||||
│ ┌─────────────┐ │ /api/stats - Dashboard metrics │ │
|
||||
│ │ REACT │ └─────────────────────────────────────────────────┘ │
|
||||
│ │ SPA │ │ │
|
||||
│ │ :3000 │ ▼ │
|
||||
│ └─────────────┘ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ CELERY WORKERS │ │
|
||||
│ ┌─────────────┐ │ - iNaturalist scraper │ │
|
||||
│ │ REDIS │◀───│ - Flickr scraper │ │
|
||||
│ │ :6379 │ │ - Wikimedia scraper │ │
|
||||
│ └─────────────┘ │ - Quality filter pipeline │ │
|
||||
│ │ - Export generator │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ STORAGE (Bind Mounts) ││
|
||||
│ │ /data/db/plants.sqlite - Species, images metadata, jobs ││
|
||||
│ │ /data/images/{species}/ - Downloaded images ││
|
||||
│ │ /data/exports/ - Generated zip files ││
|
||||
│ └─────────────────────────────────────────────────────────────────────┘│
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Schema
|
||||
|
||||
```sql
|
||||
-- Species master list (imported from CSV)
|
||||
CREATE TABLE species (
|
||||
id INTEGER PRIMARY KEY,
|
||||
scientific_name TEXT UNIQUE NOT NULL,
|
||||
common_name TEXT,
|
||||
genus TEXT,
|
||||
family TEXT,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- Full-text search index
|
||||
CREATE VIRTUAL TABLE species_fts USING fts5(
|
||||
scientific_name,
|
||||
common_name,
|
||||
genus,
|
||||
content='species',
|
||||
content_rowid='id'
|
||||
);
|
||||
|
||||
-- API credentials
|
||||
CREATE TABLE api_keys (
|
||||
id INTEGER PRIMARY KEY,
|
||||
source TEXT UNIQUE NOT NULL, -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
|
||||
api_key TEXT NOT NULL,
|
||||
api_secret TEXT,
|
||||
rate_limit_per_sec REAL DEFAULT 1.0,
|
||||
enabled BOOLEAN DEFAULT TRUE
|
||||
);
|
||||
|
||||
-- Downloaded images
|
||||
CREATE TABLE images (
|
||||
id INTEGER PRIMARY KEY,
|
||||
species_id INTEGER REFERENCES species(id),
|
||||
source TEXT NOT NULL,
|
||||
source_id TEXT, -- Original ID from source
|
||||
url TEXT NOT NULL,
|
||||
local_path TEXT,
|
||||
license TEXT NOT NULL,
|
||||
attribution TEXT,
|
||||
width INTEGER,
|
||||
height INTEGER,
|
||||
phash TEXT, -- Perceptual hash for dedup
|
||||
quality_score REAL, -- Blur/quality metric
|
||||
status TEXT DEFAULT 'pending', -- pending, downloaded, rejected, deleted
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
UNIQUE(source, source_id)
|
||||
);
|
||||
|
||||
-- Index for common queries
|
||||
CREATE INDEX idx_images_species ON images(species_id);
|
||||
CREATE INDEX idx_images_status ON images(status);
|
||||
CREATE INDEX idx_images_source ON images(source);
|
||||
CREATE INDEX idx_images_phash ON images(phash);
|
||||
|
||||
-- Scrape jobs
|
||||
CREATE TABLE jobs (
|
||||
id INTEGER PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
source TEXT NOT NULL,
|
||||
species_filter TEXT, -- JSON array of species IDs or NULL for all
|
||||
status TEXT DEFAULT 'pending', -- pending, running, paused, completed, failed
|
||||
progress_current INTEGER DEFAULT 0,
|
||||
progress_total INTEGER DEFAULT 0,
|
||||
images_downloaded INTEGER DEFAULT 0,
|
||||
images_rejected INTEGER DEFAULT 0,
|
||||
started_at TIMESTAMP,
|
||||
completed_at TIMESTAMP,
|
||||
error_message TEXT
|
||||
);
|
||||
|
||||
-- Export jobs
|
||||
CREATE TABLE exports (
|
||||
id INTEGER PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
filter_criteria TEXT, -- JSON: min_images, licenses, min_quality, species_ids
|
||||
train_split REAL DEFAULT 0.8,
|
||||
status TEXT DEFAULT 'pending', -- pending, generating, completed, failed
|
||||
file_path TEXT,
|
||||
file_size INTEGER,
|
||||
species_count INTEGER,
|
||||
image_count INTEGER,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
completed_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Species
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | `/api/species` | List species (paginated, searchable) |
|
||||
| POST | `/api/species` | Create single species |
|
||||
| POST | `/api/species/import` | Bulk import from CSV |
|
||||
| GET | `/api/species/{id}` | Get species details |
|
||||
| PUT | `/api/species/{id}` | Update species |
|
||||
| DELETE | `/api/species/{id}` | Delete species |
|
||||
|
||||
### API Keys
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | `/api/sources` | List configured sources |
|
||||
| PUT | `/api/sources/{source}` | Update source config (key, rate limit) |
|
||||
|
||||
### Jobs
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | `/api/jobs` | List jobs |
|
||||
| POST | `/api/jobs` | Create scrape job |
|
||||
| GET | `/api/jobs/{id}` | Get job status |
|
||||
| POST | `/api/jobs/{id}/pause` | Pause job |
|
||||
| POST | `/api/jobs/{id}/resume` | Resume job |
|
||||
| POST | `/api/jobs/{id}/cancel` | Cancel job |
|
||||
|
||||
### Images
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | `/api/images` | List images (paginated, filterable) |
|
||||
| GET | `/api/images/{id}` | Get image details |
|
||||
| DELETE | `/api/images/{id}` | Delete image |
|
||||
| POST | `/api/images/bulk-delete` | Bulk delete |
|
||||
|
||||
### Export
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | `/api/exports` | List exports |
|
||||
| POST | `/api/exports` | Create export job |
|
||||
| GET | `/api/exports/{id}` | Get export status |
|
||||
| GET | `/api/exports/{id}/download` | Download zip file |
|
||||
|
||||
### Stats
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | `/api/stats` | Dashboard statistics |
|
||||
| GET | `/api/stats/sources` | Per-source breakdown |
|
||||
| GET | `/api/stats/species` | Per-species image counts |
|
||||
|
||||
---
|
||||
|
||||
## UI Screens
|
||||
|
||||
### 1. Dashboard
|
||||
|
||||
- Total species, images by source, images by license
|
||||
- Active jobs with progress bars
|
||||
- Quick stats: images/sec, disk usage
|
||||
- Recent activity feed
|
||||
|
||||
### 2. Species Management
|
||||
|
||||
- Table: scientific name, common name, genus, image count
|
||||
- Import CSV button (drag-and-drop)
|
||||
- Search/filter by name, genus
|
||||
- Bulk select → "Start Scrape" button
|
||||
- Inline editing
|
||||
|
||||
### 3. API Keys
|
||||
|
||||
- Card per source with:
|
||||
- API key input (masked)
|
||||
- API secret input (if applicable)
|
||||
- Rate limit slider
|
||||
- Enable/disable toggle
|
||||
- Test connection button
|
||||
|
||||
### 4. Image Browser
|
||||
|
||||
- Grid view with thumbnails (lazy-loaded)
|
||||
- Filters sidebar:
|
||||
- Species (autocomplete)
|
||||
- Source (checkboxes)
|
||||
- License (checkboxes)
|
||||
- Quality score (range slider)
|
||||
- Status (tabs: all, pending, downloaded, rejected)
|
||||
- Sort by: date, quality, species
|
||||
- Bulk select → actions (delete, re-queue)
|
||||
- Click to view full-size + metadata
|
||||
|
||||
### 5. Jobs
|
||||
|
||||
- Table: name, source, status, progress, dates
|
||||
- Real-time progress updates (WebSocket)
|
||||
- Actions: pause, resume, cancel, view logs
|
||||
|
||||
### 6. Export
|
||||
|
||||
- Filter builder:
|
||||
- Min images per species
|
||||
- License whitelist
|
||||
- Min quality score
|
||||
- Species selection (all or specific)
|
||||
- Train/test split slider (default 80/20)
|
||||
- Preview: estimated species count, image count, file size
|
||||
- "Generate Zip" button
|
||||
- Download history with re-download links
|
||||
|
||||
---
|
||||
|
||||
## Tradeoffs
|
||||
|
||||
| Decision | Alternative | Why This Choice |
|
||||
|----------|-------------|-----------------|
|
||||
| SQLite | PostgreSQL | Single-user, simpler Docker setup, sufficient for millions of rows |
|
||||
| Celery+Redis | RQ, Dramatiq | Battle-tested, good monitoring (Flower) |
|
||||
| React | Vue, Svelte | Largest ecosystem, more component libraries |
|
||||
| Separate workers | Threads in FastAPI | Better isolation, can scale workers independently |
|
||||
| Nginx reverse proxy | Traefik | Simpler config for single-app deployment |
|
||||
|
||||
---
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
| Risk | Likelihood | Mitigation |
|
||||
|------|------------|------------|
|
||||
| iNaturalist rate limits (5GB/hr) | High | Throttle downloads, prioritize species with low counts |
|
||||
| Disk fills up | Medium | Dashboard shows disk usage, configurable storage limits |
|
||||
| Scrape jobs crash mid-run | Medium | Job state in DB, resume from last checkpoint |
|
||||
| Perceptual hash collisions | Low | Store hash, allow manual review of flagged duplicates |
|
||||
| API keys exposed | Low | Environment variables, not stored in code |
|
||||
| SQLite write contention | Low | WAL mode, single writer pattern via Celery |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Foundation
|
||||
- [ ] Docker Compose setup (FastAPI, React, Redis, Nginx)
|
||||
- [ ] Database schema + migrations (Alembic)
|
||||
- [ ] Basic FastAPI skeleton with health checks
|
||||
- [ ] React app scaffolding with Tailwind
|
||||
|
||||
### Phase 2: Core Data Management
|
||||
- [ ] Species CRUD API
|
||||
- [ ] CSV import endpoint
|
||||
- [ ] Species list UI with search/filter
|
||||
- [ ] API keys management UI
|
||||
|
||||
### Phase 3: iNaturalist Scraper
|
||||
- [ ] Celery worker setup
|
||||
- [ ] iNaturalist/GBIF scraper task
|
||||
- [ ] Job management API
|
||||
- [ ] Real-time progress (WebSocket or polling)
|
||||
|
||||
### Phase 4: Quality Pipeline
|
||||
- [ ] Image download worker
|
||||
- [ ] Perceptual hash deduplication
|
||||
- [ ] Blur detection + quality scoring
|
||||
- [ ] Resize to 512x512
|
||||
|
||||
### Phase 5: Image Browser
|
||||
- [ ] Image listing API with filters
|
||||
- [ ] Thumbnail generation
|
||||
- [ ] Grid view UI
|
||||
- [ ] Bulk operations
|
||||
|
||||
### Phase 6: Additional Scrapers
|
||||
- [ ] Flickr scraper
|
||||
- [ ] Wikimedia Commons scraper
|
||||
- [ ] Trefle scraper (metadata + images)
|
||||
- [ ] USDA PLANTS scraper
|
||||
|
||||
### Phase 7: Export
|
||||
- [ ] Export job API
|
||||
- [ ] Train/test split logic
|
||||
- [ ] Zip generation worker
|
||||
- [ ] Download endpoint
|
||||
- [ ] Export UI with filters
|
||||
|
||||
### Phase 8: Dashboard & Polish
|
||||
- [ ] Stats API
|
||||
- [ ] Dashboard UI with charts
|
||||
- [ ] Job monitoring UI
|
||||
- [ ] Error handling + logging
|
||||
- [ ] Documentation
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
PlantGuideScraper/
|
||||
├── docker-compose.yml
|
||||
├── .env.example
|
||||
├── docs/
|
||||
│ └── master_plan.md
|
||||
├── backend/
|
||||
│ ├── Dockerfile
|
||||
│ ├── requirements.txt
|
||||
│ ├── alembic/
|
||||
│ │ └── versions/
|
||||
│ ├── app/
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── main.py
|
||||
│ │ ├── config.py
|
||||
│ │ ├── database.py
|
||||
│ │ ├── models/
|
||||
│ │ │ ├── species.py
|
||||
│ │ │ ├── image.py
|
||||
│ │ │ ├── job.py
|
||||
│ │ │ └── export.py
|
||||
│ │ ├── schemas/
|
||||
│ │ │ └── ...
|
||||
│ │ ├── api/
|
||||
│ │ │ ├── species.py
|
||||
│ │ │ ├── images.py
|
||||
│ │ │ ├── jobs.py
|
||||
│ │ │ ├── exports.py
|
||||
│ │ │ └── stats.py
|
||||
│ │ ├── scrapers/
|
||||
│ │ │ ├── base.py
|
||||
│ │ │ ├── inaturalist.py
|
||||
│ │ │ ├── flickr.py
|
||||
│ │ │ ├── wikimedia.py
|
||||
│ │ │ └── trefle.py
|
||||
│ │ ├── workers/
|
||||
│ │ │ ├── celery_app.py
|
||||
│ │ │ ├── scrape_tasks.py
|
||||
│ │ │ ├── quality_tasks.py
|
||||
│ │ │ └── export_tasks.py
|
||||
│ │ └── utils/
|
||||
│ │ ├── image_quality.py
|
||||
│ │ └── dedup.py
|
||||
│ └── tests/
|
||||
├── frontend/
|
||||
│ ├── Dockerfile
|
||||
│ ├── package.json
|
||||
│ ├── src/
|
||||
│ │ ├── App.tsx
|
||||
│ │ ├── components/
|
||||
│ │ ├── pages/
|
||||
│ │ │ ├── Dashboard.tsx
|
||||
│ │ │ ├── Species.tsx
|
||||
│ │ │ ├── Images.tsx
|
||||
│ │ │ ├── Jobs.tsx
|
||||
│ │ │ ├── Export.tsx
|
||||
│ │ │ └── Settings.tsx
|
||||
│ │ ├── hooks/
|
||||
│ │ └── api/
|
||||
│ └── public/
|
||||
├── nginx/
|
||||
│ └── nginx.conf
|
||||
└── data/ # Bind mount (not in repo)
|
||||
├── db/
|
||||
├── images/
|
||||
└── exports/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
```bash
|
||||
# Backend
|
||||
DATABASE_URL=sqlite:///data/db/plants.sqlite
|
||||
REDIS_URL=redis://redis:6379/0
|
||||
IMAGES_PATH=/data/images
|
||||
EXPORTS_PATH=/data/exports
|
||||
|
||||
# API Keys (user-provided)
|
||||
FLICKR_API_KEY=
|
||||
FLICKR_API_SECRET=
|
||||
INATURALIST_APP_ID=
|
||||
INATURALIST_APP_SECRET=
|
||||
TREFLE_API_KEY=
|
||||
|
||||
# Optional
|
||||
LOG_LEVEL=INFO
|
||||
CELERY_CONCURRENCY=4
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# Development
|
||||
docker-compose up --build
|
||||
|
||||
# Production
|
||||
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
|
||||
|
||||
# Run migrations
|
||||
docker-compose exec backend alembic upgrade head
|
||||
|
||||
# View Celery logs
|
||||
docker-compose logs -f celery
|
||||
|
||||
# Access Redis CLI
|
||||
docker-compose exec redis redis-cli
|
||||
```
|
||||
14
frontend/Dockerfile
Normal file
14
frontend/Dockerfile
Normal file
@@ -0,0 +1,14 @@
|
||||
FROM node:20-alpine
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install dependencies
|
||||
COPY package*.json ./
|
||||
RUN npm install
|
||||
|
||||
# Copy source
|
||||
COPY . .
|
||||
|
||||
EXPOSE 3000
|
||||
|
||||
CMD ["npm", "run", "dev", "--", "--host"]
|
||||
283
frontend/dist/assets/index-BXIq8BNP.js
vendored
Normal file
283
frontend/dist/assets/index-BXIq8BNP.js
vendored
Normal file
File diff suppressed because one or more lines are too long
1
frontend/dist/assets/index-uHzGA3u6.css
vendored
Normal file
1
frontend/dist/assets/index-uHzGA3u6.css
vendored
Normal file
File diff suppressed because one or more lines are too long
14
frontend/dist/index.html
vendored
Normal file
14
frontend/dist/index.html
vendored
Normal file
@@ -0,0 +1,14 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8" />
|
||||
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>PlantGuideScraper</title>
|
||||
<script type="module" crossorigin src="/assets/index-BXIq8BNP.js"></script>
|
||||
<link rel="stylesheet" crossorigin href="/assets/index-uHzGA3u6.css">
|
||||
</head>
|
||||
<body>
|
||||
<div id="root"></div>
|
||||
</body>
|
||||
</html>
|
||||
13
frontend/index.html
Normal file
13
frontend/index.html
Normal file
@@ -0,0 +1,13 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="UTF-8" />
|
||||
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>PlantGuideScraper</title>
|
||||
</head>
|
||||
<body>
|
||||
<div id="root"></div>
|
||||
<script type="module" src="/src/main.tsx"></script>
|
||||
</body>
|
||||
</html>
|
||||
31
frontend/package.json
Normal file
31
frontend/package.json
Normal file
@@ -0,0 +1,31 @@
|
||||
{
|
||||
"name": "plant-scraper-frontend",
|
||||
"private": true,
|
||||
"version": "1.0.0",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "vite",
|
||||
"build": "tsc && vite build",
|
||||
"preview": "vite preview"
|
||||
},
|
||||
"dependencies": {
|
||||
"react": "^18.2.0",
|
||||
"react-dom": "^18.2.0",
|
||||
"react-router-dom": "^6.21.0",
|
||||
"@tanstack/react-query": "^5.17.0",
|
||||
"axios": "^1.6.0",
|
||||
"lucide-react": "^0.303.0",
|
||||
"recharts": "^2.10.0",
|
||||
"clsx": "^2.1.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/react": "^18.2.0",
|
||||
"@types/react-dom": "^18.2.0",
|
||||
"@vitejs/plugin-react": "^4.2.0",
|
||||
"autoprefixer": "^10.4.16",
|
||||
"postcss": "^8.4.32",
|
||||
"tailwindcss": "^3.4.0",
|
||||
"typescript": "^5.3.0",
|
||||
"vite": "^5.0.0"
|
||||
}
|
||||
}
|
||||
6
frontend/postcss.config.js
Normal file
6
frontend/postcss.config.js
Normal file
@@ -0,0 +1,6 @@
|
||||
export default {
|
||||
plugins: {
|
||||
tailwindcss: {},
|
||||
autoprefixer: {},
|
||||
},
|
||||
}
|
||||
81
frontend/src/App.tsx
Normal file
81
frontend/src/App.tsx
Normal file
@@ -0,0 +1,81 @@
|
||||
import { BrowserRouter, Routes, Route, NavLink } from 'react-router-dom'
|
||||
import {
|
||||
LayoutDashboard,
|
||||
Leaf,
|
||||
Image,
|
||||
Play,
|
||||
Download,
|
||||
Settings,
|
||||
} from 'lucide-react'
|
||||
import { clsx } from 'clsx'
|
||||
|
||||
import Dashboard from './pages/Dashboard'
|
||||
import Species from './pages/Species'
|
||||
import Images from './pages/Images'
|
||||
import Jobs from './pages/Jobs'
|
||||
import Export from './pages/Export'
|
||||
import SettingsPage from './pages/Settings'
|
||||
|
||||
const navItems = [
|
||||
{ to: '/', icon: LayoutDashboard, label: 'Dashboard' },
|
||||
{ to: '/species', icon: Leaf, label: 'Species' },
|
||||
{ to: '/images', icon: Image, label: 'Images' },
|
||||
{ to: '/jobs', icon: Play, label: 'Jobs' },
|
||||
{ to: '/export', icon: Download, label: 'Export' },
|
||||
{ to: '/settings', icon: Settings, label: 'Settings' },
|
||||
]
|
||||
|
||||
function Sidebar() {
|
||||
return (
|
||||
<aside className="w-64 bg-white border-r border-gray-200 min-h-screen">
|
||||
<div className="p-4 border-b border-gray-200">
|
||||
<h1 className="text-xl font-bold text-green-600 flex items-center gap-2">
|
||||
<Leaf className="w-6 h-6" />
|
||||
PlantScraper
|
||||
</h1>
|
||||
</div>
|
||||
<nav className="p-4">
|
||||
<ul className="space-y-2">
|
||||
{navItems.map((item) => (
|
||||
<li key={item.to}>
|
||||
<NavLink
|
||||
to={item.to}
|
||||
className={({ isActive }) =>
|
||||
clsx(
|
||||
'flex items-center gap-3 px-3 py-2 rounded-lg transition-colors',
|
||||
isActive
|
||||
? 'bg-green-50 text-green-700'
|
||||
: 'text-gray-600 hover:bg-gray-100'
|
||||
)
|
||||
}
|
||||
>
|
||||
<item.icon className="w-5 h-5" />
|
||||
{item.label}
|
||||
</NavLink>
|
||||
</li>
|
||||
))}
|
||||
</ul>
|
||||
</nav>
|
||||
</aside>
|
||||
)
|
||||
}
|
||||
|
||||
export default function App() {
|
||||
return (
|
||||
<BrowserRouter>
|
||||
<div className="flex min-h-screen">
|
||||
<Sidebar />
|
||||
<main className="flex-1 p-8">
|
||||
<Routes>
|
||||
<Route path="/" element={<Dashboard />} />
|
||||
<Route path="/species" element={<Species />} />
|
||||
<Route path="/images" element={<Images />} />
|
||||
<Route path="/jobs" element={<Jobs />} />
|
||||
<Route path="/export" element={<Export />} />
|
||||
<Route path="/settings" element={<SettingsPage />} />
|
||||
</Routes>
|
||||
</main>
|
||||
</div>
|
||||
</BrowserRouter>
|
||||
)
|
||||
}
|
||||
275
frontend/src/api/client.ts
Normal file
275
frontend/src/api/client.ts
Normal file
@@ -0,0 +1,275 @@
|
||||
import axios from 'axios'
|
||||
|
||||
const API_URL = import.meta.env.VITE_API_URL || ''
|
||||
|
||||
export const api = axios.create({
|
||||
baseURL: `${API_URL}/api`,
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
},
|
||||
})
|
||||
|
||||
// Types
|
||||
export interface Species {
|
||||
id: number
|
||||
scientific_name: string
|
||||
common_name: string | null
|
||||
genus: string | null
|
||||
family: string | null
|
||||
created_at: string
|
||||
image_count: number
|
||||
}
|
||||
|
||||
export interface SpeciesListResponse {
|
||||
items: Species[]
|
||||
total: number
|
||||
page: number
|
||||
page_size: number
|
||||
pages: number
|
||||
}
|
||||
|
||||
export interface Image {
|
||||
id: number
|
||||
species_id: number
|
||||
species_name: string | null
|
||||
source: string
|
||||
source_id: string | null
|
||||
url: string
|
||||
local_path: string | null
|
||||
license: string
|
||||
attribution: string | null
|
||||
width: number | null
|
||||
height: number | null
|
||||
quality_score: number | null
|
||||
status: string
|
||||
created_at: string
|
||||
}
|
||||
|
||||
export interface ImageListResponse {
|
||||
items: Image[]
|
||||
total: number
|
||||
page: number
|
||||
page_size: number
|
||||
pages: number
|
||||
}
|
||||
|
||||
export interface Job {
|
||||
id: number
|
||||
name: string
|
||||
source: string
|
||||
species_filter: string | null
|
||||
status: string
|
||||
progress_current: number
|
||||
progress_total: number
|
||||
images_downloaded: number
|
||||
images_rejected: number
|
||||
started_at: string | null
|
||||
completed_at: string | null
|
||||
error_message: string | null
|
||||
created_at: string
|
||||
}
|
||||
|
||||
export interface JobListResponse {
|
||||
items: Job[]
|
||||
total: number
|
||||
}
|
||||
|
||||
export interface JobProgress {
|
||||
status: string
|
||||
progress_current: number
|
||||
progress_total: number
|
||||
current_species?: string
|
||||
}
|
||||
|
||||
export interface Export {
|
||||
id: number
|
||||
name: string
|
||||
filter_criteria: string | null
|
||||
train_split: number
|
||||
status: string
|
||||
file_path: string | null
|
||||
file_size: number | null
|
||||
species_count: number | null
|
||||
image_count: number | null
|
||||
created_at: string
|
||||
completed_at: string | null
|
||||
error_message: string | null
|
||||
}
|
||||
|
||||
export interface SourceConfig {
|
||||
name: string
|
||||
label: string
|
||||
requires_secret: boolean
|
||||
auth_type: 'none' | 'api_key' | 'api_key_secret' | 'oauth'
|
||||
configured: boolean
|
||||
enabled: boolean
|
||||
api_key_masked: string | null
|
||||
has_secret: boolean
|
||||
has_access_token: boolean
|
||||
rate_limit_per_sec: number
|
||||
default_rate: number
|
||||
}
|
||||
|
||||
export interface Stats {
|
||||
total_species: number
|
||||
total_images: number
|
||||
images_downloaded: number
|
||||
images_pending: number
|
||||
images_rejected: number
|
||||
disk_usage_mb: number
|
||||
sources: Array<{
|
||||
source: string
|
||||
image_count: number
|
||||
downloaded: number
|
||||
pending: number
|
||||
rejected: number
|
||||
}>
|
||||
licenses: Array<{
|
||||
license: string
|
||||
count: number
|
||||
}>
|
||||
jobs: {
|
||||
running: number
|
||||
pending: number
|
||||
completed: number
|
||||
failed: number
|
||||
}
|
||||
top_species: Array<{
|
||||
id: number
|
||||
scientific_name: string
|
||||
common_name: string | null
|
||||
image_count: number
|
||||
}>
|
||||
under_represented: Array<{
|
||||
id: number
|
||||
scientific_name: string
|
||||
common_name: string | null
|
||||
image_count: number
|
||||
}>
|
||||
}
|
||||
|
||||
// API functions
|
||||
export const speciesApi = {
|
||||
list: (params?: { page?: number; page_size?: number; search?: string; genus?: string; has_images?: boolean; max_images?: number; min_images?: number }) =>
|
||||
api.get<SpeciesListResponse>('/species', { params }),
|
||||
get: (id: number) => api.get<Species>(`/species/${id}`),
|
||||
create: (data: { scientific_name: string; common_name?: string; genus?: string; family?: string }) =>
|
||||
api.post<Species>('/species', data),
|
||||
update: (id: number, data: Partial<Species>) => api.put<Species>(`/species/${id}`, data),
|
||||
delete: (id: number) => api.delete(`/species/${id}`),
|
||||
import: (file: File) => {
|
||||
const formData = new FormData()
|
||||
formData.append('file', file)
|
||||
return api.post('/species/import', formData, {
|
||||
headers: { 'Content-Type': 'multipart/form-data' },
|
||||
})
|
||||
},
|
||||
importJson: (file: File) => {
|
||||
const formData = new FormData()
|
||||
formData.append('file', file)
|
||||
return api.post('/species/import-json', formData, {
|
||||
headers: { 'Content-Type': 'multipart/form-data' },
|
||||
})
|
||||
},
|
||||
genera: () => api.get<string[]>('/species/genera/list'),
|
||||
}
|
||||
|
||||
export interface ImportScanResult {
|
||||
available: boolean
|
||||
message?: string
|
||||
sources: Array<{
|
||||
name: string
|
||||
species_count: number
|
||||
image_count: number
|
||||
}>
|
||||
total_images: number
|
||||
matched_species: number
|
||||
unmatched_species: string[]
|
||||
}
|
||||
|
||||
export interface ImportResult {
|
||||
imported: number
|
||||
skipped: number
|
||||
errors: string[]
|
||||
}
|
||||
|
||||
export const imagesApi = {
|
||||
list: (params?: {
|
||||
page?: number
|
||||
page_size?: number
|
||||
species_id?: number
|
||||
source?: string
|
||||
license?: string
|
||||
status?: string
|
||||
min_quality?: number
|
||||
search?: string
|
||||
}) => api.get<ImageListResponse>('/images', { params }),
|
||||
get: (id: number) => api.get<Image>(`/images/${id}`),
|
||||
delete: (id: number) => api.delete(`/images/${id}`),
|
||||
bulkDelete: (ids: number[]) => api.post('/images/bulk-delete', ids),
|
||||
sources: () => api.get<string[]>('/images/sources'),
|
||||
licenses: () => api.get<string[]>('/images/licenses'),
|
||||
processPending: (source?: string) =>
|
||||
api.post<{ pending_count: number; task_id: string }>('/images/process-pending', null, {
|
||||
params: source ? { source } : undefined,
|
||||
}),
|
||||
processPendingStatus: (taskId: string) =>
|
||||
api.get<{ task_id: string; state: string; queued?: number; total?: number }>(
|
||||
`/images/process-pending/status/${taskId}`
|
||||
),
|
||||
scanImports: () => api.get<ImportScanResult>('/images/import/scan'),
|
||||
runImport: (moveFiles: boolean = false) =>
|
||||
api.post<ImportResult>('/images/import/run', null, { params: { move_files: moveFiles } }),
|
||||
}
|
||||
|
||||
export const jobsApi = {
|
||||
list: (params?: { status?: string; source?: string; limit?: number }) =>
|
||||
api.get<JobListResponse>('/jobs', { params }),
|
||||
get: (id: number) => api.get<Job>(`/jobs/${id}`),
|
||||
create: (data: { name: string; source: string; species_ids?: number[]; only_without_images?: boolean; max_images?: number }) =>
|
||||
api.post<Job>('/jobs', data),
|
||||
progress: (id: number) => api.get<JobProgress>(`/jobs/${id}/progress`),
|
||||
pause: (id: number) => api.post(`/jobs/${id}/pause`),
|
||||
resume: (id: number) => api.post(`/jobs/${id}/resume`),
|
||||
cancel: (id: number) => api.post(`/jobs/${id}/cancel`),
|
||||
}
|
||||
|
||||
export const exportsApi = {
|
||||
list: (params?: { limit?: number }) => api.get('/exports', { params }),
|
||||
get: (id: number) => api.get<Export>(`/exports/${id}`),
|
||||
create: (data: {
|
||||
name: string
|
||||
filter_criteria: {
|
||||
min_images_per_species: number
|
||||
licenses?: string[]
|
||||
min_quality?: number
|
||||
species_ids?: number[]
|
||||
}
|
||||
train_split: number
|
||||
}) => api.post<Export>('/exports', data),
|
||||
preview: (data: any) => api.post('/exports/preview', data),
|
||||
progress: (id: number) => api.get(`/exports/${id}/progress`),
|
||||
download: (id: number) => `${API_URL}/api/exports/${id}/download`,
|
||||
delete: (id: number) => api.delete(`/exports/${id}`),
|
||||
}
|
||||
|
||||
export const sourcesApi = {
|
||||
list: () => api.get<SourceConfig[]>('/sources'),
|
||||
get: (source: string) => api.get<SourceConfig>(`/sources/${source}`),
|
||||
update: (source: string, data: {
|
||||
api_key?: string
|
||||
api_secret?: string
|
||||
access_token?: string
|
||||
rate_limit_per_sec?: number
|
||||
enabled?: boolean
|
||||
}) => api.put(`/sources/${source}`, { source, ...data }),
|
||||
test: (source: string) => api.post(`/sources/${source}/test`),
|
||||
delete: (source: string) => api.delete(`/sources/${source}`),
|
||||
}
|
||||
|
||||
export const statsApi = {
|
||||
get: () => api.get<Stats>('/stats'),
|
||||
sources: () => api.get('/stats/sources'),
|
||||
species: (params?: { min_count?: number; max_count?: number }) =>
|
||||
api.get('/stats/species', { params }),
|
||||
}
|
||||
7
frontend/src/index.css
Normal file
7
frontend/src/index.css
Normal file
@@ -0,0 +1,7 @@
|
||||
@tailwind base;
|
||||
@tailwind components;
|
||||
@tailwind utilities;
|
||||
|
||||
body {
|
||||
@apply bg-gray-50 text-gray-900;
|
||||
}
|
||||
22
frontend/src/main.tsx
Normal file
22
frontend/src/main.tsx
Normal file
@@ -0,0 +1,22 @@
|
||||
import React from 'react'
|
||||
import ReactDOM from 'react-dom/client'
|
||||
import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
|
||||
import App from './App'
|
||||
import './index.css'
|
||||
|
||||
const queryClient = new QueryClient({
|
||||
defaultOptions: {
|
||||
queries: {
|
||||
refetchOnWindowFocus: false,
|
||||
retry: 1,
|
||||
},
|
||||
},
|
||||
})
|
||||
|
||||
ReactDOM.createRoot(document.getElementById('root')!).render(
|
||||
<React.StrictMode>
|
||||
<QueryClientProvider client={queryClient}>
|
||||
<App />
|
||||
</QueryClientProvider>
|
||||
</React.StrictMode>,
|
||||
)
|
||||
413
frontend/src/pages/Dashboard.tsx
Normal file
413
frontend/src/pages/Dashboard.tsx
Normal file
@@ -0,0 +1,413 @@
|
||||
import { useState } from 'react'
|
||||
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||
import {
|
||||
Leaf,
|
||||
Image,
|
||||
HardDrive,
|
||||
Clock,
|
||||
CheckCircle,
|
||||
XCircle,
|
||||
AlertCircle,
|
||||
} from 'lucide-react'
|
||||
import {
|
||||
BarChart,
|
||||
Bar,
|
||||
XAxis,
|
||||
YAxis,
|
||||
Tooltip,
|
||||
ResponsiveContainer,
|
||||
PieChart,
|
||||
Pie,
|
||||
Cell,
|
||||
} from 'recharts'
|
||||
import { statsApi, imagesApi } from '../api/client'
|
||||
|
||||
const COLORS = ['#22c55e', '#3b82f6', '#f59e0b', '#ef4444', '#8b5cf6', '#ec4899']
|
||||
|
||||
function StatCard({
|
||||
title,
|
||||
value,
|
||||
icon: Icon,
|
||||
color,
|
||||
}: {
|
||||
title: string
|
||||
value: string | number
|
||||
icon: React.ElementType
|
||||
color: string
|
||||
}) {
|
||||
return (
|
||||
<div className="bg-white rounded-lg shadow p-6">
|
||||
<div className="flex items-center justify-between">
|
||||
<div>
|
||||
<p className="text-sm text-gray-500">{title}</p>
|
||||
<p className="text-2xl font-bold mt-1">{value}</p>
|
||||
</div>
|
||||
<div className={`p-3 rounded-full ${color}`}>
|
||||
<Icon className="w-6 h-6 text-white" />
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
export default function Dashboard() {
|
||||
const queryClient = useQueryClient()
|
||||
|
||||
const [processingTaskId, setProcessingTaskId] = useState<string | null>(null)
|
||||
|
||||
const processPendingMutation = useMutation({
|
||||
mutationFn: () => imagesApi.processPending(),
|
||||
onSuccess: (res) => {
|
||||
setProcessingTaskId(res.data.task_id)
|
||||
},
|
||||
})
|
||||
|
||||
// Poll task status while processing
|
||||
const { data: taskStatus } = useQuery({
|
||||
queryKey: ['process-pending-status', processingTaskId],
|
||||
queryFn: async () => {
|
||||
const res = await imagesApi.processPendingStatus(processingTaskId!)
|
||||
if (res.data.state === 'SUCCESS' || res.data.state === 'FAILURE') {
|
||||
// Task finished - clear tracking and refresh stats
|
||||
setTimeout(() => {
|
||||
setProcessingTaskId(null)
|
||||
queryClient.invalidateQueries({ queryKey: ['stats'] })
|
||||
}, 0)
|
||||
}
|
||||
return res.data
|
||||
},
|
||||
enabled: !!processingTaskId,
|
||||
refetchInterval: (query) => {
|
||||
const state = query.state.data?.state
|
||||
if (state === 'SUCCESS' || state === 'FAILURE') return false
|
||||
return 2000
|
||||
},
|
||||
})
|
||||
|
||||
const isProcessing = !!processingTaskId && taskStatus?.state !== 'SUCCESS' && taskStatus?.state !== 'FAILURE'
|
||||
|
||||
const { data: stats, isLoading, error, failureCount, isFetching } = useQuery({
|
||||
queryKey: ['stats'],
|
||||
queryFn: async () => {
|
||||
const startTime = Date.now()
|
||||
console.log('[Dashboard] Fetching stats...')
|
||||
|
||||
// Create abort controller for timeout
|
||||
const controller = new AbortController()
|
||||
const timeoutId = setTimeout(() => controller.abort(), 10000) // 10 second timeout
|
||||
|
||||
try {
|
||||
const res = await statsApi.get()
|
||||
clearTimeout(timeoutId)
|
||||
console.log(`[Dashboard] Stats loaded in ${Date.now() - startTime}ms`)
|
||||
return res.data
|
||||
} catch (err: any) {
|
||||
clearTimeout(timeoutId)
|
||||
if (err.name === 'AbortError' || err.code === 'ECONNABORTED') {
|
||||
console.error('[Dashboard] Request timed out after 10 seconds')
|
||||
throw new Error('Request timed out after 10 seconds - backend may be unresponsive')
|
||||
}
|
||||
console.error('[Dashboard] Stats fetch failed:', err)
|
||||
console.error('[Dashboard] Error details:', {
|
||||
message: err.message,
|
||||
status: err.response?.status,
|
||||
statusText: err.response?.statusText,
|
||||
data: err.response?.data,
|
||||
})
|
||||
throw err
|
||||
}
|
||||
},
|
||||
refetchInterval: 30000, // 30 seconds - matches backend cache
|
||||
retry: 1,
|
||||
staleTime: 25000,
|
||||
})
|
||||
|
||||
// Debug panel to test backend
|
||||
const { data: debugData, refetch: refetchDebug, isFetching: isDebugFetching } = useQuery({
|
||||
queryKey: ['debug'],
|
||||
queryFn: async () => {
|
||||
const res = await fetch('/api/debug')
|
||||
return res.json()
|
||||
},
|
||||
enabled: false, // Only fetch when manually triggered
|
||||
})
|
||||
|
||||
if (isLoading) {
|
||||
return (
|
||||
<div className="flex items-center justify-center h-64">
|
||||
<div className="text-center">
|
||||
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600 mx-auto"></div>
|
||||
<p className="mt-2 text-gray-500">Loading stats...</p>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
if (error) {
|
||||
const err = error as any
|
||||
return (
|
||||
<div className="space-y-4 m-4">
|
||||
<div className="bg-red-50 border border-red-200 rounded-lg p-6">
|
||||
<h2 className="text-lg font-bold text-red-700 mb-2">Failed to load dashboard</h2>
|
||||
<div className="space-y-2 text-sm">
|
||||
<p><strong>Error:</strong> {err.message}</p>
|
||||
{err.response && (
|
||||
<>
|
||||
<p><strong>Status:</strong> {err.response.status} {err.response.statusText}</p>
|
||||
{err.response.data && (
|
||||
<p><strong>Response:</strong> {JSON.stringify(err.response.data)}</p>
|
||||
)}
|
||||
</>
|
||||
)}
|
||||
<p><strong>Retry count:</strong> {failureCount}</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div className="bg-blue-50 border border-blue-200 rounded-lg p-6">
|
||||
<h3 className="font-bold text-blue-700 mb-2">Debug Backend Connection</h3>
|
||||
<button
|
||||
onClick={() => refetchDebug()}
|
||||
disabled={isDebugFetching}
|
||||
className="px-4 py-2 bg-blue-600 text-white rounded hover:bg-blue-700 disabled:opacity-50"
|
||||
>
|
||||
{isDebugFetching ? 'Testing...' : 'Test Backend'}
|
||||
</button>
|
||||
{debugData && (
|
||||
<pre className="mt-4 p-4 bg-white rounded text-xs overflow-auto">
|
||||
{JSON.stringify(debugData, null, 2)}
|
||||
</pre>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
if (!stats) {
|
||||
return <div>Failed to load stats</div>
|
||||
}
|
||||
|
||||
const sourceData = stats.sources.map((s) => ({
|
||||
name: s.source,
|
||||
downloaded: s.downloaded,
|
||||
pending: s.pending,
|
||||
rejected: s.rejected,
|
||||
}))
|
||||
|
||||
const licenseData = stats.licenses.map((l, i) => ({
|
||||
name: l.license,
|
||||
value: l.count,
|
||||
color: COLORS[i % COLORS.length],
|
||||
}))
|
||||
|
||||
return (
|
||||
<div className="space-y-6">
|
||||
<h1 className="text-2xl font-bold">Dashboard</h1>
|
||||
|
||||
{/* Stats Grid */}
|
||||
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-4 gap-4">
|
||||
<StatCard
|
||||
title="Total Species"
|
||||
value={stats.total_species.toLocaleString()}
|
||||
icon={Leaf}
|
||||
color="bg-green-500"
|
||||
/>
|
||||
<StatCard
|
||||
title="Downloaded Images"
|
||||
value={stats.images_downloaded.toLocaleString()}
|
||||
icon={Image}
|
||||
color="bg-blue-500"
|
||||
/>
|
||||
<StatCard
|
||||
title="Pending Images"
|
||||
value={stats.images_pending.toLocaleString()}
|
||||
icon={Clock}
|
||||
color="bg-yellow-500"
|
||||
/>
|
||||
<StatCard
|
||||
title="Disk Usage"
|
||||
value={`${stats.disk_usage_mb.toFixed(1)} MB`}
|
||||
icon={HardDrive}
|
||||
color="bg-purple-500"
|
||||
/>
|
||||
</div>
|
||||
|
||||
{/* Process Pending Banner */}
|
||||
{(stats.images_pending > 0 || isProcessing) && (
|
||||
<div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4 flex items-center justify-between">
|
||||
<div>
|
||||
<p className="font-semibold text-yellow-800">
|
||||
{isProcessing
|
||||
? `Processing pending images...`
|
||||
: `${stats.images_pending.toLocaleString()} pending images`}
|
||||
</p>
|
||||
<p className="text-sm text-yellow-700">
|
||||
{isProcessing && taskStatus?.queued != null && taskStatus?.total != null
|
||||
? `Queued ${taskStatus.queued.toLocaleString()} of ${taskStatus.total.toLocaleString()} for download`
|
||||
: isProcessing
|
||||
? 'Queueing images for download...'
|
||||
: 'These images have been scraped but not yet downloaded and processed.'}
|
||||
</p>
|
||||
</div>
|
||||
<button
|
||||
onClick={() => processPendingMutation.mutate()}
|
||||
disabled={isProcessing || processPendingMutation.isPending}
|
||||
className="px-4 py-2 bg-yellow-600 text-white rounded-lg hover:bg-yellow-700 disabled:opacity-50 whitespace-nowrap"
|
||||
>
|
||||
{isProcessing ? 'Processing...' : processPendingMutation.isPending ? 'Starting...' : 'Process All Pending'}
|
||||
</button>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Jobs Status */}
|
||||
<div className="bg-white rounded-lg shadow p-6">
|
||||
<h2 className="text-lg font-semibold mb-4">Jobs Status</h2>
|
||||
<div className="flex gap-6">
|
||||
<div className="flex items-center gap-2">
|
||||
<div className="w-3 h-3 rounded-full bg-blue-500 animate-pulse"></div>
|
||||
<span>Running: {stats.jobs.running}</span>
|
||||
</div>
|
||||
<div className="flex items-center gap-2">
|
||||
<Clock className="w-4 h-4 text-yellow-500" />
|
||||
<span>Pending: {stats.jobs.pending}</span>
|
||||
</div>
|
||||
<div className="flex items-center gap-2">
|
||||
<CheckCircle className="w-4 h-4 text-green-500" />
|
||||
<span>Completed: {stats.jobs.completed}</span>
|
||||
</div>
|
||||
<div className="flex items-center gap-2">
|
||||
<XCircle className="w-4 h-4 text-red-500" />
|
||||
<span>Failed: {stats.jobs.failed}</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Charts */}
|
||||
<div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
|
||||
{/* Source Chart */}
|
||||
<div className="bg-white rounded-lg shadow p-6">
|
||||
<h2 className="text-lg font-semibold mb-4">Images by Source</h2>
|
||||
{sourceData.length > 0 ? (
|
||||
<ResponsiveContainer width="100%" height={300}>
|
||||
<BarChart data={sourceData}>
|
||||
<XAxis dataKey="name" />
|
||||
<YAxis />
|
||||
<Tooltip />
|
||||
<Bar dataKey="downloaded" fill="#22c55e" name="Downloaded" />
|
||||
<Bar dataKey="pending" fill="#f59e0b" name="Pending" />
|
||||
<Bar dataKey="rejected" fill="#ef4444" name="Rejected" />
|
||||
</BarChart>
|
||||
</ResponsiveContainer>
|
||||
) : (
|
||||
<div className="h-[300px] flex items-center justify-center text-gray-400">
|
||||
No data yet
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
|
||||
{/* License Chart */}
|
||||
<div className="bg-white rounded-lg shadow p-6">
|
||||
<h2 className="text-lg font-semibold mb-4">Images by License</h2>
|
||||
{licenseData.length > 0 ? (
|
||||
<ResponsiveContainer width="100%" height={300}>
|
||||
<PieChart>
|
||||
<Pie
|
||||
data={licenseData}
|
||||
dataKey="value"
|
||||
nameKey="name"
|
||||
cx="50%"
|
||||
cy="50%"
|
||||
outerRadius={100}
|
||||
label={({ name, percent }) =>
|
||||
`${name} (${(percent * 100).toFixed(0)}%)`
|
||||
}
|
||||
>
|
||||
{licenseData.map((entry, index) => (
|
||||
<Cell key={index} fill={entry.color} />
|
||||
))}
|
||||
</Pie>
|
||||
<Tooltip />
|
||||
</PieChart>
|
||||
</ResponsiveContainer>
|
||||
) : (
|
||||
<div className="h-[300px] flex items-center justify-center text-gray-400">
|
||||
No data yet
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Species Tables */}
|
||||
<div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
|
||||
{/* Top Species */}
|
||||
<div className="bg-white rounded-lg shadow p-6">
|
||||
<h2 className="text-lg font-semibold mb-4">Top Species</h2>
|
||||
<table className="w-full">
|
||||
<thead>
|
||||
<tr className="text-left text-sm text-gray-500">
|
||||
<th className="pb-2">Species</th>
|
||||
<th className="pb-2 text-right">Images</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{stats.top_species.map((s) => (
|
||||
<tr key={s.id} className="border-t">
|
||||
<td className="py-2">
|
||||
<div className="font-medium">{s.scientific_name}</div>
|
||||
{s.common_name && (
|
||||
<div className="text-sm text-gray-500">{s.common_name}</div>
|
||||
)}
|
||||
</td>
|
||||
<td className="py-2 text-right">{s.image_count}</td>
|
||||
</tr>
|
||||
))}
|
||||
{stats.top_species.length === 0 && (
|
||||
<tr>
|
||||
<td colSpan={2} className="py-4 text-center text-gray-400">
|
||||
No species yet
|
||||
</td>
|
||||
</tr>
|
||||
)}
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
{/* Under-represented Species */}
|
||||
<div className="bg-white rounded-lg shadow p-6">
|
||||
<h2 className="text-lg font-semibold mb-4 flex items-center gap-2">
|
||||
<AlertCircle className="w-5 h-5 text-yellow-500" />
|
||||
Under-represented Species
|
||||
</h2>
|
||||
<p className="text-sm text-gray-500 mb-4">Species with fewer than 100 images</p>
|
||||
<table className="w-full">
|
||||
<thead>
|
||||
<tr className="text-left text-sm text-gray-500">
|
||||
<th className="pb-2">Species</th>
|
||||
<th className="pb-2 text-right">Images</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{stats.under_represented.map((s) => (
|
||||
<tr key={s.id} className="border-t">
|
||||
<td className="py-2">
|
||||
<div className="font-medium">{s.scientific_name}</div>
|
||||
{s.common_name && (
|
||||
<div className="text-sm text-gray-500">{s.common_name}</div>
|
||||
)}
|
||||
</td>
|
||||
<td className="py-2 text-right text-yellow-600">{s.image_count}</td>
|
||||
</tr>
|
||||
))}
|
||||
{stats.under_represented.length === 0 && (
|
||||
<tr>
|
||||
<td colSpan={2} className="py-4 text-center text-gray-400">
|
||||
All species have 100+ images
|
||||
</td>
|
||||
</tr>
|
||||
)}
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
346
frontend/src/pages/Export.tsx
Normal file
346
frontend/src/pages/Export.tsx
Normal file
@@ -0,0 +1,346 @@
|
||||
import { useState } from 'react'
|
||||
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||
import {
|
||||
Download,
|
||||
Trash2,
|
||||
CheckCircle,
|
||||
Clock,
|
||||
AlertCircle,
|
||||
Package,
|
||||
} from 'lucide-react'
|
||||
import { exportsApi, imagesApi, Export as ExportType } from '../api/client'
|
||||
|
||||
export default function Export() {
|
||||
const queryClient = useQueryClient()
|
||||
const [showCreateModal, setShowCreateModal] = useState(false)
|
||||
|
||||
const { data: exports, isLoading } = useQuery({
|
||||
queryKey: ['exports'],
|
||||
queryFn: () => exportsApi.list({ limit: 50 }).then((res) => res.data),
|
||||
refetchInterval: 5000,
|
||||
})
|
||||
|
||||
const deleteMutation = useMutation({
|
||||
mutationFn: (id: number) => exportsApi.delete(id),
|
||||
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['exports'] }),
|
||||
})
|
||||
|
||||
const getStatusIcon = (status: string) => {
|
||||
switch (status) {
|
||||
case 'generating':
|
||||
return <Clock className="w-4 h-4 text-blue-500 animate-pulse" />
|
||||
case 'completed':
|
||||
return <CheckCircle className="w-4 h-4 text-green-500" />
|
||||
case 'failed':
|
||||
return <AlertCircle className="w-4 h-4 text-red-500" />
|
||||
default:
|
||||
return <Clock className="w-4 h-4 text-gray-400" />
|
||||
}
|
||||
}
|
||||
|
||||
const formatBytes = (bytes: number | null) => {
|
||||
if (!bytes) return 'N/A'
|
||||
if (bytes < 1024) return `${bytes} B`
|
||||
if (bytes < 1024 * 1024) return `${(bytes / 1024).toFixed(1)} KB`
|
||||
if (bytes < 1024 * 1024 * 1024) return `${(bytes / 1024 / 1024).toFixed(1)} MB`
|
||||
return `${(bytes / 1024 / 1024 / 1024).toFixed(1)} GB`
|
||||
}
|
||||
|
||||
return (
|
||||
<div className="space-y-6">
|
||||
<div className="flex items-center justify-between">
|
||||
<h1 className="text-2xl font-bold">Export Dataset</h1>
|
||||
<button
|
||||
onClick={() => setShowCreateModal(true)}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
|
||||
>
|
||||
<Package className="w-4 h-4" />
|
||||
Create Export
|
||||
</button>
|
||||
</div>
|
||||
|
||||
{/* Info Card */}
|
||||
<div className="bg-blue-50 border border-blue-200 rounded-lg p-4">
|
||||
<h3 className="font-medium text-blue-800">Export Format</h3>
|
||||
<p className="text-sm text-blue-700 mt-1">
|
||||
Exports are created in Create ML-compatible format with Training and Testing
|
||||
folders. Each species has its own subfolder with images.
|
||||
</p>
|
||||
</div>
|
||||
|
||||
{/* Exports List */}
|
||||
{isLoading ? (
|
||||
<div className="flex items-center justify-center h-64">
|
||||
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
|
||||
</div>
|
||||
) : exports?.items.length === 0 ? (
|
||||
<div className="bg-white rounded-lg shadow p-8 text-center text-gray-400">
|
||||
<Package className="w-12 h-12 mx-auto mb-4" />
|
||||
<p>No exports yet</p>
|
||||
<p className="text-sm mt-2">
|
||||
Create an export to download your dataset for CoreML training
|
||||
</p>
|
||||
</div>
|
||||
) : (
|
||||
<div className="space-y-4">
|
||||
{exports?.items.map((exp: ExportType) => (
|
||||
<div
|
||||
key={exp.id}
|
||||
className="bg-white rounded-lg shadow p-6"
|
||||
>
|
||||
<div className="flex items-start justify-between">
|
||||
<div className="flex-1">
|
||||
<div className="flex items-center gap-3">
|
||||
{getStatusIcon(exp.status)}
|
||||
<h3 className="font-semibold">{exp.name}</h3>
|
||||
</div>
|
||||
<div className="mt-2 grid grid-cols-4 gap-4 text-sm">
|
||||
<div>
|
||||
<span className="text-gray-500">Species:</span>{' '}
|
||||
{exp.species_count ?? 'N/A'}
|
||||
</div>
|
||||
<div>
|
||||
<span className="text-gray-500">Images:</span>{' '}
|
||||
{exp.image_count ?? 'N/A'}
|
||||
</div>
|
||||
<div>
|
||||
<span className="text-gray-500">Size:</span>{' '}
|
||||
{formatBytes(exp.file_size)}
|
||||
</div>
|
||||
<div>
|
||||
<span className="text-gray-500">Split:</span>{' '}
|
||||
{Math.round(exp.train_split * 100)}% / {Math.round((1 - exp.train_split) * 100)}%
|
||||
</div>
|
||||
</div>
|
||||
{exp.error_message && (
|
||||
<div className="mt-2 text-sm text-red-600">
|
||||
Error: {exp.error_message}
|
||||
</div>
|
||||
)}
|
||||
<div className="mt-2 text-xs text-gray-400">
|
||||
Created: {new Date(exp.created_at).toLocaleString()}
|
||||
{exp.completed_at && (
|
||||
<span className="ml-4">
|
||||
Completed: {new Date(exp.completed_at).toLocaleString()}
|
||||
</span>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex gap-2 ml-4">
|
||||
{exp.status === 'completed' && (
|
||||
<a
|
||||
href={exportsApi.download(exp.id)}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
|
||||
>
|
||||
<Download className="w-4 h-4" />
|
||||
Download
|
||||
</a>
|
||||
)}
|
||||
<button
|
||||
onClick={() => deleteMutation.mutate(exp.id)}
|
||||
className="p-2 text-red-600 hover:bg-red-50 rounded"
|
||||
title="Delete"
|
||||
>
|
||||
<Trash2 className="w-5 h-5" />
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Create Modal */}
|
||||
{showCreateModal && (
|
||||
<CreateExportModal onClose={() => setShowCreateModal(false)} />
|
||||
)}
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function CreateExportModal({ onClose }: { onClose: () => void }) {
|
||||
const queryClient = useQueryClient()
|
||||
const [form, setForm] = useState({
|
||||
name: `Export ${new Date().toLocaleDateString()}`,
|
||||
min_images: 100,
|
||||
train_split: 0.8,
|
||||
licenses: [] as string[],
|
||||
min_quality: undefined as number | undefined,
|
||||
})
|
||||
|
||||
const { data: licenses } = useQuery({
|
||||
queryKey: ['image-licenses'],
|
||||
queryFn: () => imagesApi.licenses().then((res) => res.data),
|
||||
})
|
||||
|
||||
const previewMutation = useMutation({
|
||||
mutationFn: () =>
|
||||
exportsApi.preview({
|
||||
name: form.name,
|
||||
filter_criteria: {
|
||||
min_images_per_species: form.min_images,
|
||||
licenses: form.licenses.length > 0 ? form.licenses : undefined,
|
||||
min_quality: form.min_quality,
|
||||
},
|
||||
train_split: form.train_split,
|
||||
}),
|
||||
})
|
||||
|
||||
const createMutation = useMutation({
|
||||
mutationFn: () =>
|
||||
exportsApi.create({
|
||||
name: form.name,
|
||||
filter_criteria: {
|
||||
min_images_per_species: form.min_images,
|
||||
licenses: form.licenses.length > 0 ? form.licenses : undefined,
|
||||
min_quality: form.min_quality,
|
||||
},
|
||||
train_split: form.train_split,
|
||||
}),
|
||||
onSuccess: () => {
|
||||
queryClient.invalidateQueries({ queryKey: ['exports'] })
|
||||
onClose()
|
||||
},
|
||||
})
|
||||
|
||||
const toggleLicense = (license: string) => {
|
||||
setForm((f) => ({
|
||||
...f,
|
||||
licenses: f.licenses.includes(license)
|
||||
? f.licenses.filter((l) => l !== license)
|
||||
: [...f.licenses, license],
|
||||
}))
|
||||
}
|
||||
|
||||
return (
|
||||
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||
<div className="bg-white rounded-lg p-6 w-full max-w-lg">
|
||||
<h2 className="text-xl font-bold mb-4">Create Export</h2>
|
||||
|
||||
<div className="space-y-4">
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">Export Name</label>
|
||||
<input
|
||||
type="text"
|
||||
value={form.name}
|
||||
onChange={(e) => setForm({ ...form, name: e.target.value })}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
/>
|
||||
</div>
|
||||
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">
|
||||
Minimum Images per Species
|
||||
</label>
|
||||
<input
|
||||
type="number"
|
||||
value={form.min_images}
|
||||
onChange={(e) =>
|
||||
setForm({ ...form, min_images: parseInt(e.target.value) || 0 })
|
||||
}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
min={1}
|
||||
/>
|
||||
<p className="text-xs text-gray-500 mt-1">
|
||||
Species with fewer images will be excluded
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">
|
||||
Train/Test Split
|
||||
</label>
|
||||
<div className="flex items-center gap-4">
|
||||
<input
|
||||
type="range"
|
||||
value={form.train_split}
|
||||
onChange={(e) =>
|
||||
setForm({ ...form, train_split: parseFloat(e.target.value) })
|
||||
}
|
||||
min={0.5}
|
||||
max={0.95}
|
||||
step={0.05}
|
||||
className="flex-1"
|
||||
/>
|
||||
<span className="text-sm w-20 text-right">
|
||||
{Math.round(form.train_split * 100)}% /{' '}
|
||||
{Math.round((1 - form.train_split) * 100)}%
|
||||
</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-2">
|
||||
Filter by License (optional)
|
||||
</label>
|
||||
<div className="flex flex-wrap gap-2">
|
||||
{licenses?.map((license) => (
|
||||
<button
|
||||
key={license}
|
||||
onClick={() => toggleLicense(license)}
|
||||
className={`px-3 py-1 rounded-full text-sm ${
|
||||
form.licenses.includes(license)
|
||||
? 'bg-green-100 text-green-700 border-green-300'
|
||||
: 'bg-gray-100 text-gray-600'
|
||||
} border`}
|
||||
>
|
||||
{license}
|
||||
</button>
|
||||
))}
|
||||
</div>
|
||||
{form.licenses.length === 0 && (
|
||||
<p className="text-xs text-gray-500 mt-1">
|
||||
All licenses will be included
|
||||
</p>
|
||||
)}
|
||||
</div>
|
||||
|
||||
{/* Preview */}
|
||||
{previewMutation.data && (
|
||||
<div className="bg-gray-50 rounded-lg p-4">
|
||||
<h4 className="font-medium mb-2">Preview</h4>
|
||||
<div className="grid grid-cols-3 gap-4 text-sm">
|
||||
<div>
|
||||
<span className="text-gray-500">Species:</span>{' '}
|
||||
{previewMutation.data.data.species_count}
|
||||
</div>
|
||||
<div>
|
||||
<span className="text-gray-500">Images:</span>{' '}
|
||||
{previewMutation.data.data.image_count}
|
||||
</div>
|
||||
<div>
|
||||
<span className="text-gray-500">Est. Size:</span>{' '}
|
||||
{previewMutation.data.data.estimated_size_mb.toFixed(0)} MB
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
|
||||
<div className="flex justify-between mt-6">
|
||||
<button
|
||||
onClick={() => previewMutation.mutate()}
|
||||
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||
>
|
||||
Preview
|
||||
</button>
|
||||
<div className="flex gap-2">
|
||||
<button
|
||||
onClick={onClose}
|
||||
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||
>
|
||||
Cancel
|
||||
</button>
|
||||
<button
|
||||
onClick={() => createMutation.mutate()}
|
||||
disabled={!form.name}
|
||||
className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50"
|
||||
>
|
||||
Create Export
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
331
frontend/src/pages/Images.tsx
Normal file
331
frontend/src/pages/Images.tsx
Normal file
@@ -0,0 +1,331 @@
|
||||
import { useState } from 'react'
|
||||
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||
import {
|
||||
Search,
|
||||
Filter,
|
||||
Trash2,
|
||||
ChevronLeft,
|
||||
ChevronRight,
|
||||
X,
|
||||
ExternalLink,
|
||||
} from 'lucide-react'
|
||||
import { imagesApi } from '../api/client'
|
||||
|
||||
export default function Images() {
|
||||
const queryClient = useQueryClient()
|
||||
const [page, setPage] = useState(1)
|
||||
const [search, setSearch] = useState('')
|
||||
const [filters, setFilters] = useState({
|
||||
source: '',
|
||||
license: '',
|
||||
status: 'downloaded',
|
||||
min_quality: undefined as number | undefined,
|
||||
})
|
||||
const [selectedIds, setSelectedIds] = useState<number[]>([])
|
||||
const [selectedImage, setSelectedImage] = useState<number | null>(null)
|
||||
|
||||
const { data, isLoading } = useQuery({
|
||||
queryKey: ['images', page, search, filters],
|
||||
queryFn: () =>
|
||||
imagesApi
|
||||
.list({
|
||||
page,
|
||||
page_size: 48,
|
||||
search: search || undefined,
|
||||
source: filters.source || undefined,
|
||||
license: filters.license || undefined,
|
||||
status: filters.status || undefined,
|
||||
min_quality: filters.min_quality,
|
||||
})
|
||||
.then((res) => res.data),
|
||||
})
|
||||
|
||||
const { data: sources } = useQuery({
|
||||
queryKey: ['image-sources'],
|
||||
queryFn: () => imagesApi.sources().then((res) => res.data),
|
||||
})
|
||||
|
||||
const { data: licenses } = useQuery({
|
||||
queryKey: ['image-licenses'],
|
||||
queryFn: () => imagesApi.licenses().then((res) => res.data),
|
||||
})
|
||||
|
||||
const { data: imageDetail } = useQuery({
|
||||
queryKey: ['image', selectedImage],
|
||||
queryFn: () => imagesApi.get(selectedImage!).then((res) => res.data),
|
||||
enabled: !!selectedImage,
|
||||
})
|
||||
|
||||
const deleteMutation = useMutation({
|
||||
mutationFn: (id: number) => imagesApi.delete(id),
|
||||
onSuccess: () => {
|
||||
queryClient.invalidateQueries({ queryKey: ['images'] })
|
||||
setSelectedImage(null)
|
||||
},
|
||||
})
|
||||
|
||||
const bulkDeleteMutation = useMutation({
|
||||
mutationFn: (ids: number[]) => imagesApi.bulkDelete(ids),
|
||||
onSuccess: () => {
|
||||
queryClient.invalidateQueries({ queryKey: ['images'] })
|
||||
setSelectedIds([])
|
||||
},
|
||||
})
|
||||
|
||||
const handleSelect = (id: number) => {
|
||||
setSelectedIds((prev) =>
|
||||
prev.includes(id) ? prev.filter((i) => i !== id) : [...prev, id]
|
||||
)
|
||||
}
|
||||
|
||||
return (
|
||||
<div className="space-y-6">
|
||||
<div className="flex items-center justify-between">
|
||||
<h1 className="text-2xl font-bold">Images</h1>
|
||||
{selectedIds.length > 0 && (
|
||||
<button
|
||||
onClick={() => bulkDeleteMutation.mutate(selectedIds)}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-red-600 text-white rounded-lg hover:bg-red-700"
|
||||
>
|
||||
<Trash2 className="w-4 h-4" />
|
||||
Delete {selectedIds.length} images
|
||||
</button>
|
||||
)}
|
||||
</div>
|
||||
|
||||
{/* Filters */}
|
||||
<div className="flex flex-wrap gap-4">
|
||||
<div className="relative">
|
||||
<Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
|
||||
<input
|
||||
type="text"
|
||||
placeholder="Search species..."
|
||||
value={search}
|
||||
onChange={(e) => {
|
||||
setSearch(e.target.value)
|
||||
setPage(1)
|
||||
}}
|
||||
className="pl-10 pr-4 py-2 border rounded-lg w-64"
|
||||
/>
|
||||
</div>
|
||||
|
||||
<select
|
||||
value={filters.source}
|
||||
onChange={(e) => setFilters({ ...filters, source: e.target.value })}
|
||||
className="px-3 py-2 border rounded-lg"
|
||||
>
|
||||
<option value="">All Sources</option>
|
||||
{sources?.map((s) => (
|
||||
<option key={s} value={s}>
|
||||
{s}
|
||||
</option>
|
||||
))}
|
||||
</select>
|
||||
|
||||
<select
|
||||
value={filters.license}
|
||||
onChange={(e) => setFilters({ ...filters, license: e.target.value })}
|
||||
className="px-3 py-2 border rounded-lg"
|
||||
>
|
||||
<option value="">All Licenses</option>
|
||||
{licenses?.map((l) => (
|
||||
<option key={l} value={l}>
|
||||
{l}
|
||||
</option>
|
||||
))}
|
||||
</select>
|
||||
|
||||
<select
|
||||
value={filters.status}
|
||||
onChange={(e) => setFilters({ ...filters, status: e.target.value })}
|
||||
className="px-3 py-2 border rounded-lg"
|
||||
>
|
||||
<option value="">All Status</option>
|
||||
<option value="downloaded">Downloaded</option>
|
||||
<option value="pending">Pending</option>
|
||||
<option value="rejected">Rejected</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
{/* Image Grid */}
|
||||
{isLoading ? (
|
||||
<div className="flex items-center justify-center h-64">
|
||||
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
|
||||
</div>
|
||||
) : data?.items.length === 0 ? (
|
||||
<div className="flex flex-col items-center justify-center h-64 text-gray-400">
|
||||
<Filter className="w-12 h-12 mb-4" />
|
||||
<p>No images found</p>
|
||||
</div>
|
||||
) : (
|
||||
<div className="grid grid-cols-2 sm:grid-cols-4 md:grid-cols-6 lg:grid-cols-8 gap-2">
|
||||
{data?.items.map((image) => (
|
||||
<div
|
||||
key={image.id}
|
||||
className={`relative aspect-square bg-gray-100 rounded-lg overflow-hidden cursor-pointer group ${
|
||||
selectedIds.includes(image.id) ? 'ring-2 ring-green-500' : ''
|
||||
}`}
|
||||
onClick={() => setSelectedImage(image.id)}
|
||||
>
|
||||
{image.local_path ? (
|
||||
<img
|
||||
src={`/api/images/${image.id}/file`}
|
||||
alt={image.species_name || ''}
|
||||
className="w-full h-full object-cover"
|
||||
loading="lazy"
|
||||
/>
|
||||
) : (
|
||||
<div className="flex items-center justify-center h-full text-gray-400 text-xs">
|
||||
Pending
|
||||
</div>
|
||||
)}
|
||||
<div className="absolute inset-0 bg-black/0 group-hover:bg-black/20 transition-colors" />
|
||||
<div className="absolute top-1 left-1">
|
||||
<input
|
||||
type="checkbox"
|
||||
checked={selectedIds.includes(image.id)}
|
||||
onChange={(e) => {
|
||||
e.stopPropagation()
|
||||
handleSelect(image.id)
|
||||
}}
|
||||
className="rounded opacity-0 group-hover:opacity-100 checked:opacity-100"
|
||||
/>
|
||||
</div>
|
||||
<div className="absolute bottom-0 left-0 right-0 bg-gradient-to-t from-black/60 to-transparent p-1 opacity-0 group-hover:opacity-100 transition-opacity">
|
||||
<p className="text-white text-xs truncate">
|
||||
{image.species_name}
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Pagination */}
|
||||
{data && data.pages > 1 && (
|
||||
<div className="flex items-center justify-between">
|
||||
<span className="text-sm text-gray-600">
|
||||
{data.total} images
|
||||
</span>
|
||||
<div className="flex gap-2">
|
||||
<button
|
||||
onClick={() => setPage((p) => Math.max(1, p - 1))}
|
||||
disabled={page === 1}
|
||||
className="p-2 rounded border disabled:opacity-50"
|
||||
>
|
||||
<ChevronLeft className="w-4 h-4" />
|
||||
</button>
|
||||
<span className="px-4 py-2">
|
||||
Page {page} of {data.pages}
|
||||
</span>
|
||||
<button
|
||||
onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
|
||||
disabled={page === data.pages}
|
||||
className="p-2 rounded border disabled:opacity-50"
|
||||
>
|
||||
<ChevronRight className="w-4 h-4" />
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Image Detail Modal */}
|
||||
{selectedImage && imageDetail && (
|
||||
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50 p-8">
|
||||
<div className="bg-white rounded-lg w-full max-w-4xl max-h-full overflow-auto">
|
||||
<div className="flex justify-between items-center p-4 border-b">
|
||||
<h2 className="text-lg font-semibold">Image Details</h2>
|
||||
<button
|
||||
onClick={() => setSelectedImage(null)}
|
||||
className="p-1 hover:bg-gray-100 rounded"
|
||||
>
|
||||
<X className="w-5 h-5" />
|
||||
</button>
|
||||
</div>
|
||||
<div className="grid grid-cols-2 gap-6 p-6">
|
||||
<div className="aspect-square bg-gray-100 rounded-lg overflow-hidden">
|
||||
{imageDetail.local_path ? (
|
||||
<img
|
||||
src={`/api/images/${imageDetail.id}/file`}
|
||||
alt={imageDetail.species_name || ''}
|
||||
className="w-full h-full object-contain"
|
||||
/>
|
||||
) : (
|
||||
<div className="flex items-center justify-center h-full text-gray-400">
|
||||
Not downloaded
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
<div className="space-y-4">
|
||||
<div>
|
||||
<label className="text-sm text-gray-500">Species</label>
|
||||
<p className="font-medium">{imageDetail.species_name}</p>
|
||||
</div>
|
||||
<div>
|
||||
<label className="text-sm text-gray-500">Source</label>
|
||||
<p>{imageDetail.source}</p>
|
||||
</div>
|
||||
<div>
|
||||
<label className="text-sm text-gray-500">License</label>
|
||||
<p>{imageDetail.license}</p>
|
||||
</div>
|
||||
{imageDetail.attribution && (
|
||||
<div>
|
||||
<label className="text-sm text-gray-500">Attribution</label>
|
||||
<p className="text-sm">{imageDetail.attribution}</p>
|
||||
</div>
|
||||
)}
|
||||
<div className="grid grid-cols-2 gap-4">
|
||||
<div>
|
||||
<label className="text-sm text-gray-500">Dimensions</label>
|
||||
<p>
|
||||
{imageDetail.width || '?'} x {imageDetail.height || '?'}
|
||||
</p>
|
||||
</div>
|
||||
<div>
|
||||
<label className="text-sm text-gray-500">Quality Score</label>
|
||||
<p>{imageDetail.quality_score?.toFixed(1) || 'N/A'}</p>
|
||||
</div>
|
||||
</div>
|
||||
<div>
|
||||
<label className="text-sm text-gray-500">Status</label>
|
||||
<p>
|
||||
<span
|
||||
className={`inline-block px-2 py-1 rounded text-sm ${
|
||||
imageDetail.status === 'downloaded'
|
||||
? 'bg-green-100 text-green-700'
|
||||
: imageDetail.status === 'pending'
|
||||
? 'bg-yellow-100 text-yellow-700'
|
||||
: 'bg-red-100 text-red-700'
|
||||
}`}
|
||||
>
|
||||
{imageDetail.status}
|
||||
</span>
|
||||
</p>
|
||||
</div>
|
||||
<div className="flex gap-2 pt-4">
|
||||
<a
|
||||
href={imageDetail.url}
|
||||
target="_blank"
|
||||
rel="noopener noreferrer"
|
||||
className="flex items-center gap-2 px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||
>
|
||||
<ExternalLink className="w-4 h-4" />
|
||||
View Original
|
||||
</a>
|
||||
<button
|
||||
onClick={() => deleteMutation.mutate(imageDetail.id)}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-red-600 text-white rounded-lg hover:bg-red-700"
|
||||
>
|
||||
<Trash2 className="w-4 h-4" />
|
||||
Delete
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
)
|
||||
}
|
||||
354
frontend/src/pages/Jobs.tsx
Normal file
354
frontend/src/pages/Jobs.tsx
Normal file
@@ -0,0 +1,354 @@
|
||||
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||
import {
|
||||
Play,
|
||||
Pause,
|
||||
XCircle,
|
||||
CheckCircle,
|
||||
Clock,
|
||||
AlertCircle,
|
||||
RefreshCw,
|
||||
Leaf,
|
||||
Download,
|
||||
XOctagon,
|
||||
} from 'lucide-react'
|
||||
import { jobsApi, Job } from '../api/client'
|
||||
|
||||
export default function Jobs() {
|
||||
const queryClient = useQueryClient()
|
||||
|
||||
const { data, isLoading, refetch } = useQuery({
|
||||
queryKey: ['jobs'],
|
||||
queryFn: () => jobsApi.list({ limit: 100 }).then((res) => res.data),
|
||||
refetchInterval: 1000, // Faster refresh for live updates
|
||||
})
|
||||
|
||||
const pauseMutation = useMutation({
|
||||
mutationFn: (id: number) => jobsApi.pause(id),
|
||||
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
|
||||
})
|
||||
|
||||
const resumeMutation = useMutation({
|
||||
mutationFn: (id: number) => jobsApi.resume(id),
|
||||
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
|
||||
})
|
||||
|
||||
const cancelMutation = useMutation({
|
||||
mutationFn: (id: number) => jobsApi.cancel(id),
|
||||
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
|
||||
})
|
||||
|
||||
const getStatusIcon = (status: string) => {
|
||||
switch (status) {
|
||||
case 'running':
|
||||
return <RefreshCw className="w-4 h-4 text-blue-500 animate-spin" />
|
||||
case 'pending':
|
||||
return <Clock className="w-4 h-4 text-yellow-500" />
|
||||
case 'paused':
|
||||
return <Pause className="w-4 h-4 text-gray-500" />
|
||||
case 'completed':
|
||||
return <CheckCircle className="w-4 h-4 text-green-500" />
|
||||
case 'failed':
|
||||
return <AlertCircle className="w-4 h-4 text-red-500" />
|
||||
default:
|
||||
return null
|
||||
}
|
||||
}
|
||||
|
||||
const getStatusClass = (status: string) => {
|
||||
switch (status) {
|
||||
case 'running':
|
||||
return 'bg-blue-100 text-blue-700'
|
||||
case 'pending':
|
||||
return 'bg-yellow-100 text-yellow-700'
|
||||
case 'paused':
|
||||
return 'bg-gray-100 text-gray-700'
|
||||
case 'completed':
|
||||
return 'bg-green-100 text-green-700'
|
||||
case 'failed':
|
||||
return 'bg-red-100 text-red-700'
|
||||
default:
|
||||
return 'bg-gray-100 text-gray-700'
|
||||
}
|
||||
}
|
||||
|
||||
// Separate running jobs from others
|
||||
const runningJobs = data?.items.filter((j) => j.status === 'running') || []
|
||||
const otherJobs = data?.items.filter((j) => j.status !== 'running') || []
|
||||
|
||||
return (
|
||||
<div className="space-y-6">
|
||||
<div className="flex items-center justify-between">
|
||||
<h1 className="text-2xl font-bold">Jobs</h1>
|
||||
<button
|
||||
onClick={() => refetch()}
|
||||
className="flex items-center gap-2 px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||
>
|
||||
<RefreshCw className="w-4 h-4" />
|
||||
Refresh
|
||||
</button>
|
||||
</div>
|
||||
|
||||
{isLoading ? (
|
||||
<div className="flex items-center justify-center h-64">
|
||||
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
|
||||
</div>
|
||||
) : data?.items.length === 0 ? (
|
||||
<div className="bg-white rounded-lg shadow p-8 text-center text-gray-400">
|
||||
<Clock className="w-12 h-12 mx-auto mb-4" />
|
||||
<p>No jobs yet</p>
|
||||
<p className="text-sm mt-2">
|
||||
Select species and start a scrape job to get started
|
||||
</p>
|
||||
</div>
|
||||
) : (
|
||||
<div className="space-y-6">
|
||||
{/* Running Jobs - More prominent display */}
|
||||
{runningJobs.length > 0 && (
|
||||
<div className="space-y-4">
|
||||
<h2 className="text-lg font-semibold flex items-center gap-2">
|
||||
<RefreshCw className="w-5 h-5 animate-spin text-blue-500" />
|
||||
Active Jobs ({runningJobs.length})
|
||||
</h2>
|
||||
{runningJobs.map((job) => (
|
||||
<RunningJobCard
|
||||
key={job.id}
|
||||
job={job}
|
||||
onPause={() => pauseMutation.mutate(job.id)}
|
||||
onCancel={() => cancelMutation.mutate(job.id)}
|
||||
/>
|
||||
))}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Other Jobs */}
|
||||
{otherJobs.length > 0 && (
|
||||
<div className="space-y-4">
|
||||
{runningJobs.length > 0 && (
|
||||
<h2 className="text-lg font-semibold text-gray-600">Other Jobs</h2>
|
||||
)}
|
||||
{otherJobs.map((job) => (
|
||||
<div
|
||||
key={job.id}
|
||||
className="bg-white rounded-lg shadow p-6"
|
||||
>
|
||||
<div className="flex items-start justify-between">
|
||||
<div className="flex-1">
|
||||
<div className="flex items-center gap-3">
|
||||
{getStatusIcon(job.status)}
|
||||
<h3 className="font-semibold">{job.name}</h3>
|
||||
<span
|
||||
className={`px-2 py-0.5 rounded text-xs ${getStatusClass(
|
||||
job.status
|
||||
)}`}
|
||||
>
|
||||
{job.status}
|
||||
</span>
|
||||
</div>
|
||||
<div className="mt-2 text-sm text-gray-600">
|
||||
<span className="mr-4">Source: {job.source}</span>
|
||||
<span className="mr-4">
|
||||
Downloaded: {job.images_downloaded}
|
||||
</span>
|
||||
<span>Rejected: {job.images_rejected}</span>
|
||||
</div>
|
||||
|
||||
{/* Progress bar for paused jobs */}
|
||||
{job.status === 'paused' && job.progress_total > 0 && (
|
||||
<div className="mt-4">
|
||||
<div className="flex justify-between text-sm text-gray-600 mb-1">
|
||||
<span>
|
||||
{job.progress_current} / {job.progress_total} species
|
||||
</span>
|
||||
<span>
|
||||
{Math.round(
|
||||
(job.progress_current / job.progress_total) * 100
|
||||
)}
|
||||
%
|
||||
</span>
|
||||
</div>
|
||||
<div className="h-2 bg-gray-200 rounded-full overflow-hidden">
|
||||
<div
|
||||
className="h-full rounded-full bg-gray-400"
|
||||
style={{
|
||||
width: `${
|
||||
(job.progress_current / job.progress_total) * 100
|
||||
}%`,
|
||||
}}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{job.error_message && (
|
||||
<div className="mt-2 text-sm text-red-600">
|
||||
Error: {job.error_message}
|
||||
</div>
|
||||
)}
|
||||
|
||||
<div className="mt-2 text-xs text-gray-400">
|
||||
{job.started_at && (
|
||||
<span className="mr-4">
|
||||
Started: {new Date(job.started_at).toLocaleString()}
|
||||
</span>
|
||||
)}
|
||||
{job.completed_at && (
|
||||
<span>
|
||||
Completed: {new Date(job.completed_at).toLocaleString()}
|
||||
</span>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Actions */}
|
||||
<div className="flex gap-2 ml-4">
|
||||
{job.status === 'paused' && (
|
||||
<button
|
||||
onClick={() => resumeMutation.mutate(job.id)}
|
||||
className="p-2 text-blue-600 hover:bg-blue-50 rounded"
|
||||
title="Resume"
|
||||
>
|
||||
<Play className="w-5 h-5" />
|
||||
</button>
|
||||
)}
|
||||
{(job.status === 'paused' || job.status === 'pending') && (
|
||||
<button
|
||||
onClick={() => cancelMutation.mutate(job.id)}
|
||||
className="p-2 text-red-600 hover:bg-red-50 rounded"
|
||||
title="Cancel"
|
||||
>
|
||||
<XCircle className="w-5 h-5" />
|
||||
</button>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function RunningJobCard({
|
||||
job,
|
||||
onPause,
|
||||
onCancel,
|
||||
}: {
|
||||
job: Job
|
||||
onPause: () => void
|
||||
onCancel: () => void
|
||||
}) {
|
||||
// Fetch real-time progress for this job
|
||||
const { data: progress } = useQuery({
|
||||
queryKey: ['job-progress', job.id],
|
||||
queryFn: () => jobsApi.progress(job.id).then((res) => res.data),
|
||||
refetchInterval: 500, // Very fast updates for live feel
|
||||
enabled: job.status === 'running',
|
||||
})
|
||||
|
||||
const currentSpecies = progress?.current_species || ''
|
||||
const progressCurrent = progress?.progress_current ?? job.progress_current
|
||||
const progressTotal = progress?.progress_total ?? job.progress_total
|
||||
const percentage = progressTotal > 0 ? Math.round((progressCurrent / progressTotal) * 100) : 0
|
||||
|
||||
return (
|
||||
<div className="bg-gradient-to-r from-blue-50 to-white rounded-lg shadow-lg border-2 border-blue-200 p-6">
|
||||
<div className="flex items-start justify-between">
|
||||
<div className="flex-1">
|
||||
<div className="flex items-center gap-3">
|
||||
<RefreshCw className="w-5 h-5 text-blue-500 animate-spin" />
|
||||
<h3 className="font-semibold text-lg">{job.name}</h3>
|
||||
<span className="px-2 py-0.5 rounded text-xs bg-blue-100 text-blue-700 animate-pulse">
|
||||
running
|
||||
</span>
|
||||
</div>
|
||||
|
||||
{/* Live Stats */}
|
||||
<div className="mt-4 grid grid-cols-3 gap-4">
|
||||
<div className="bg-white rounded-lg p-3 border">
|
||||
<div className="flex items-center gap-2 text-gray-500 text-sm">
|
||||
<Leaf className="w-4 h-4" />
|
||||
Species Progress
|
||||
</div>
|
||||
<div className="text-2xl font-bold text-blue-600 mt-1">
|
||||
{progressCurrent} / {progressTotal}
|
||||
</div>
|
||||
</div>
|
||||
<div className="bg-white rounded-lg p-3 border">
|
||||
<div className="flex items-center gap-2 text-gray-500 text-sm">
|
||||
<Download className="w-4 h-4" />
|
||||
Downloaded
|
||||
</div>
|
||||
<div className="text-2xl font-bold text-green-600 mt-1">
|
||||
{job.images_downloaded}
|
||||
</div>
|
||||
</div>
|
||||
<div className="bg-white rounded-lg p-3 border">
|
||||
<div className="flex items-center gap-2 text-gray-500 text-sm">
|
||||
<XOctagon className="w-4 h-4" />
|
||||
Rejected
|
||||
</div>
|
||||
<div className="text-2xl font-bold text-red-600 mt-1">
|
||||
{job.images_rejected}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Current Species */}
|
||||
{currentSpecies && (
|
||||
<div className="mt-4 bg-white rounded-lg p-3 border">
|
||||
<div className="text-sm text-gray-500 mb-1">Currently scraping:</div>
|
||||
<div className="flex items-center gap-2">
|
||||
<span className="relative flex h-3 w-3">
|
||||
<span className="animate-ping absolute inline-flex h-full w-full rounded-full bg-blue-400 opacity-75"></span>
|
||||
<span className="relative inline-flex rounded-full h-3 w-3 bg-blue-500"></span>
|
||||
</span>
|
||||
<span className="font-medium text-blue-800 italic">{currentSpecies}</span>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Progress bar */}
|
||||
{progressTotal > 0 && (
|
||||
<div className="mt-4">
|
||||
<div className="flex justify-between text-sm text-gray-600 mb-1">
|
||||
<span>Progress</span>
|
||||
<span className="font-medium">{percentage}%</span>
|
||||
</div>
|
||||
<div className="h-3 bg-gray-200 rounded-full overflow-hidden">
|
||||
<div
|
||||
className="h-full rounded-full bg-gradient-to-r from-blue-500 to-blue-600 transition-all duration-500"
|
||||
style={{ width: `${percentage}%` }}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
<div className="mt-3 text-xs text-gray-400">
|
||||
Source: {job.source} • Started: {job.started_at ? new Date(job.started_at).toLocaleString() : 'N/A'}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Actions */}
|
||||
<div className="flex gap-2 ml-4">
|
||||
<button
|
||||
onClick={onPause}
|
||||
className="p-2 text-gray-600 hover:bg-gray-100 rounded"
|
||||
title="Pause"
|
||||
>
|
||||
<Pause className="w-5 h-5" />
|
||||
</button>
|
||||
<button
|
||||
onClick={onCancel}
|
||||
className="p-2 text-red-600 hover:bg-red-50 rounded"
|
||||
title="Cancel"
|
||||
>
|
||||
<XCircle className="w-5 h-5" />
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
543
frontend/src/pages/Settings.tsx
Normal file
543
frontend/src/pages/Settings.tsx
Normal file
@@ -0,0 +1,543 @@
|
||||
import { useState } from 'react'
|
||||
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||
import {
|
||||
Key,
|
||||
CheckCircle,
|
||||
XCircle,
|
||||
Eye,
|
||||
EyeOff,
|
||||
RefreshCw,
|
||||
FolderInput,
|
||||
AlertTriangle,
|
||||
} from 'lucide-react'
|
||||
import { sourcesApi, imagesApi, SourceConfig, ImportScanResult } from '../api/client'
|
||||
|
||||
export default function Settings() {
|
||||
const [editingSource, setEditingSource] = useState<string | null>(null)
|
||||
|
||||
const { data: sources, isLoading, error } = useQuery({
|
||||
queryKey: ['sources'],
|
||||
queryFn: () => sourcesApi.list().then((res) => res.data),
|
||||
})
|
||||
|
||||
return (
|
||||
<div className="space-y-6">
|
||||
<h1 className="text-2xl font-bold">Settings</h1>
|
||||
|
||||
{/* API Keys Section */}
|
||||
<div className="bg-white rounded-lg shadow">
|
||||
<div className="px-6 py-4 border-b">
|
||||
<h2 className="text-lg font-semibold flex items-center gap-2">
|
||||
<Key className="w-5 h-5" />
|
||||
API Keys
|
||||
</h2>
|
||||
<p className="text-sm text-gray-500 mt-1">
|
||||
Configure API keys for each data source
|
||||
</p>
|
||||
</div>
|
||||
|
||||
{isLoading ? (
|
||||
<div className="p-6 text-center">
|
||||
<RefreshCw className="w-6 h-6 animate-spin mx-auto text-gray-400" />
|
||||
</div>
|
||||
) : error ? (
|
||||
<div className="p-6 text-center text-red-600">
|
||||
Error loading sources: {(error as Error).message}
|
||||
</div>
|
||||
) : !sources || sources.length === 0 ? (
|
||||
<div className="p-6 text-center text-gray-500">
|
||||
No sources available
|
||||
</div>
|
||||
) : (
|
||||
<div className="divide-y">
|
||||
{sources.map((source) => (
|
||||
<SourceRow
|
||||
key={source.name}
|
||||
source={source}
|
||||
isEditing={editingSource === source.name}
|
||||
onEdit={() => setEditingSource(source.name)}
|
||||
onClose={() => setEditingSource(null)}
|
||||
/>
|
||||
))}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
|
||||
{/* Import Scanner Section */}
|
||||
<ImportScanner />
|
||||
|
||||
{/* Rate Limits Info */}
|
||||
<div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
|
||||
<h3 className="font-medium text-yellow-800">Rate Limits (recommended settings)</h3>
|
||||
<ul className="text-sm text-yellow-700 mt-2 space-y-1 list-disc list-inside">
|
||||
<li>GBIF: 1 req/sec safe (free, no authentication required)</li>
|
||||
<li>iNaturalist: 1 req/sec max (60/min limit), 10k/day, 5GB/hr media</li>
|
||||
<li>Flickr: 0.5 req/sec recommended (3600/hr limit shared across all users)</li>
|
||||
<li>Wikimedia: 1 req/sec safe (requires OAuth credentials)</li>
|
||||
<li>Trefle: 1 req/sec safe (120/min limit)</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function SourceRow({
|
||||
source,
|
||||
isEditing,
|
||||
onEdit,
|
||||
onClose,
|
||||
}: {
|
||||
source: SourceConfig
|
||||
isEditing: boolean
|
||||
onEdit: () => void
|
||||
onClose: () => void
|
||||
}) {
|
||||
const queryClient = useQueryClient()
|
||||
const [showKey, setShowKey] = useState(false)
|
||||
const [form, setForm] = useState({
|
||||
api_key: '',
|
||||
api_secret: '',
|
||||
access_token: '',
|
||||
rate_limit_per_sec: source.configured ? source.rate_limit_per_sec : (source.default_rate || 1.0),
|
||||
enabled: source.enabled,
|
||||
})
|
||||
|
||||
// Get field labels based on auth type
|
||||
const isNoAuth = source.auth_type === 'none'
|
||||
const isOAuth = source.auth_type === 'oauth'
|
||||
const keyLabel = isOAuth ? 'Client ID' : 'API Key'
|
||||
const secretLabel = isOAuth ? 'Client Secret' : 'API Secret'
|
||||
const [testResult, setTestResult] = useState<{
|
||||
status: 'success' | 'error'
|
||||
message: string
|
||||
} | null>(null)
|
||||
|
||||
const updateMutation = useMutation({
|
||||
mutationFn: () =>
|
||||
sourcesApi.update(source.name, {
|
||||
api_key: isNoAuth ? undefined : form.api_key || undefined,
|
||||
api_secret: form.api_secret || undefined,
|
||||
access_token: form.access_token || undefined,
|
||||
rate_limit_per_sec: form.rate_limit_per_sec,
|
||||
enabled: form.enabled,
|
||||
}),
|
||||
onSuccess: () => {
|
||||
queryClient.invalidateQueries({ queryKey: ['sources'] })
|
||||
onClose()
|
||||
},
|
||||
})
|
||||
|
||||
const testMutation = useMutation({
|
||||
mutationFn: () => sourcesApi.test(source.name),
|
||||
onSuccess: (res) => {
|
||||
setTestResult({ status: res.data.status, message: res.data.message })
|
||||
},
|
||||
onError: (err: any) => {
|
||||
setTestResult({
|
||||
status: 'error',
|
||||
message: err.response?.data?.message || 'Connection failed',
|
||||
})
|
||||
},
|
||||
})
|
||||
|
||||
if (isEditing) {
|
||||
return (
|
||||
<div className="p-6 bg-gray-50">
|
||||
<div className="flex items-center justify-between mb-4">
|
||||
<h3 className="font-medium">{source.label}</h3>
|
||||
<button
|
||||
onClick={onClose}
|
||||
className="text-gray-500 hover:text-gray-700"
|
||||
>
|
||||
Cancel
|
||||
</button>
|
||||
</div>
|
||||
|
||||
<div className="space-y-4">
|
||||
{isNoAuth ? (
|
||||
<div className="bg-green-50 border border-green-200 rounded-lg p-3 text-green-700 text-sm">
|
||||
This source doesn't require authentication. Just enable it to start scraping.
|
||||
</div>
|
||||
) : (
|
||||
<>
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">{keyLabel}</label>
|
||||
<div className="relative">
|
||||
<input
|
||||
type={showKey ? 'text' : 'password'}
|
||||
value={form.api_key}
|
||||
onChange={(e) => setForm({ ...form, api_key: e.target.value })}
|
||||
placeholder={source.api_key_masked || `Enter ${keyLabel}`}
|
||||
className="w-full px-3 py-2 border rounded-lg pr-10"
|
||||
/>
|
||||
<button
|
||||
type="button"
|
||||
onClick={() => setShowKey(!showKey)}
|
||||
className="absolute right-2 top-1/2 -translate-y-1/2 text-gray-400"
|
||||
>
|
||||
{showKey ? (
|
||||
<EyeOff className="w-4 h-4" />
|
||||
) : (
|
||||
<Eye className="w-4 h-4" />
|
||||
)}
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{source.requires_secret && (
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">
|
||||
{secretLabel}
|
||||
</label>
|
||||
<input
|
||||
type="password"
|
||||
value={form.api_secret}
|
||||
onChange={(e) =>
|
||||
setForm({ ...form, api_secret: e.target.value })
|
||||
}
|
||||
placeholder={source.has_secret ? '••••••••' : `Enter ${secretLabel}`}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
/>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{isOAuth && (
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">
|
||||
Access Token
|
||||
</label>
|
||||
<input
|
||||
type="password"
|
||||
value={form.access_token}
|
||||
onChange={(e) =>
|
||||
setForm({ ...form, access_token: e.target.value })
|
||||
}
|
||||
placeholder={source.has_access_token ? '••••••••' : 'Enter Access Token'}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
/>
|
||||
</div>
|
||||
)}
|
||||
</>
|
||||
)}
|
||||
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">
|
||||
Rate Limit (requests/sec)
|
||||
</label>
|
||||
<input
|
||||
type="number"
|
||||
value={form.rate_limit_per_sec}
|
||||
onChange={(e) =>
|
||||
setForm({
|
||||
...form,
|
||||
rate_limit_per_sec: parseFloat(e.target.value) || 1,
|
||||
})
|
||||
}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
min={0.1}
|
||||
max={10}
|
||||
step={0.1}
|
||||
/>
|
||||
</div>
|
||||
|
||||
<div className="flex items-center gap-2">
|
||||
<input
|
||||
type="checkbox"
|
||||
id="enabled"
|
||||
checked={form.enabled}
|
||||
onChange={(e) => setForm({ ...form, enabled: e.target.checked })}
|
||||
className="rounded"
|
||||
/>
|
||||
<label htmlFor="enabled" className="text-sm">
|
||||
Enable this source
|
||||
</label>
|
||||
</div>
|
||||
|
||||
{testResult && (
|
||||
<div
|
||||
className={`p-3 rounded-lg ${
|
||||
testResult.status === 'success'
|
||||
? 'bg-green-50 text-green-700'
|
||||
: 'bg-red-50 text-red-700'
|
||||
}`}
|
||||
>
|
||||
{testResult.message}
|
||||
</div>
|
||||
)}
|
||||
|
||||
<div className="flex justify-between">
|
||||
{source.configured && (
|
||||
<button
|
||||
onClick={() => testMutation.mutate()}
|
||||
disabled={testMutation.isPending}
|
||||
className="px-4 py-2 border rounded-lg hover:bg-white"
|
||||
>
|
||||
{testMutation.isPending ? 'Testing...' : 'Test Connection'}
|
||||
</button>
|
||||
)}
|
||||
<button
|
||||
onClick={() => updateMutation.mutate()}
|
||||
disabled={!isNoAuth && !form.api_key && !source.configured}
|
||||
className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50 ml-auto"
|
||||
>
|
||||
Save
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
const isNoAuthRow = source.auth_type === 'none'
|
||||
|
||||
return (
|
||||
<div className="px-6 py-4 flex items-center justify-between">
|
||||
<div className="flex items-center gap-4">
|
||||
<div
|
||||
className={`w-2 h-2 rounded-full ${
|
||||
(isNoAuthRow || source.configured) && source.enabled
|
||||
? 'bg-green-500'
|
||||
: source.configured
|
||||
? 'bg-yellow-500'
|
||||
: 'bg-gray-300'
|
||||
}`}
|
||||
/>
|
||||
<div>
|
||||
<h3 className="font-medium">{source.label}</h3>
|
||||
<p className="text-sm text-gray-500">
|
||||
{isNoAuthRow
|
||||
? 'No authentication required'
|
||||
: source.configured
|
||||
? `Key: ${source.api_key_masked}`
|
||||
: 'Not configured'}
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex items-center gap-4">
|
||||
{(isNoAuthRow || source.configured) && (
|
||||
<span
|
||||
className={`flex items-center gap-1 text-sm ${
|
||||
source.enabled ? 'text-green-600' : 'text-gray-400'
|
||||
}`}
|
||||
>
|
||||
{source.enabled ? (
|
||||
<>
|
||||
<CheckCircle className="w-4 h-4" />
|
||||
Enabled
|
||||
</>
|
||||
) : (
|
||||
<>
|
||||
<XCircle className="w-4 h-4" />
|
||||
Disabled
|
||||
</>
|
||||
)}
|
||||
</span>
|
||||
)}
|
||||
<button
|
||||
onClick={onEdit}
|
||||
className="px-3 py-1 text-sm border rounded hover:bg-gray-50"
|
||||
>
|
||||
{isNoAuthRow || source.configured ? 'Edit' : 'Configure'}
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function ImportScanner() {
|
||||
const [scanResult, setScanResult] = useState<ImportScanResult | null>(null)
|
||||
const [moveFiles, setMoveFiles] = useState(false)
|
||||
const [importResult, setImportResult] = useState<{
|
||||
imported: number
|
||||
skipped: number
|
||||
errors: string[]
|
||||
} | null>(null)
|
||||
|
||||
const scanMutation = useMutation({
|
||||
mutationFn: () => imagesApi.scanImports().then((res) => res.data),
|
||||
onSuccess: (data) => {
|
||||
setScanResult(data)
|
||||
setImportResult(null)
|
||||
},
|
||||
})
|
||||
|
||||
const importMutation = useMutation({
|
||||
mutationFn: () => imagesApi.runImport(moveFiles).then((res) => res.data),
|
||||
onSuccess: (data) => {
|
||||
setImportResult(data)
|
||||
setScanResult(null)
|
||||
},
|
||||
})
|
||||
|
||||
return (
|
||||
<div className="bg-white rounded-lg shadow">
|
||||
<div className="px-6 py-4 border-b">
|
||||
<h2 className="text-lg font-semibold flex items-center gap-2">
|
||||
<FolderInput className="w-5 h-5" />
|
||||
Import Images
|
||||
</h2>
|
||||
<p className="text-sm text-gray-500 mt-1">
|
||||
Bulk import images from the imports folder
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<div className="p-6 space-y-4">
|
||||
<div className="bg-gray-50 rounded-lg p-4">
|
||||
<h3 className="font-medium text-sm mb-2">Expected folder structure:</h3>
|
||||
<code className="text-sm text-gray-600 block">
|
||||
imports/{'{source}'}/{'{species_name}'}/*.jpg
|
||||
</code>
|
||||
<p className="text-sm text-gray-500 mt-2">
|
||||
Example: imports/inaturalist/Monstera_deliciosa/image1.jpg
|
||||
</p>
|
||||
</div>
|
||||
|
||||
<div className="flex items-center gap-4">
|
||||
<button
|
||||
onClick={() => scanMutation.mutate()}
|
||||
disabled={scanMutation.isPending}
|
||||
className="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700 disabled:opacity-50 flex items-center gap-2"
|
||||
>
|
||||
{scanMutation.isPending ? (
|
||||
<>
|
||||
<RefreshCw className="w-4 h-4 animate-spin" />
|
||||
Scanning...
|
||||
</>
|
||||
) : (
|
||||
'Scan Imports Folder'
|
||||
)}
|
||||
</button>
|
||||
</div>
|
||||
|
||||
{scanMutation.isError && (
|
||||
<div className="bg-red-50 border border-red-200 rounded-lg p-4 text-red-700">
|
||||
Error scanning: {(scanMutation.error as Error).message}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{scanResult && (
|
||||
<div className="space-y-4">
|
||||
{!scanResult.available ? (
|
||||
<div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
|
||||
<p className="text-yellow-700">{scanResult.message}</p>
|
||||
</div>
|
||||
) : scanResult.total_images === 0 ? (
|
||||
<div className="bg-gray-50 border border-gray-200 rounded-lg p-4">
|
||||
<p className="text-gray-600">No images found in the imports folder.</p>
|
||||
</div>
|
||||
) : (
|
||||
<>
|
||||
<div className="bg-green-50 border border-green-200 rounded-lg p-4">
|
||||
<h3 className="font-medium text-green-800 mb-2">Scan Results</h3>
|
||||
<div className="grid grid-cols-2 gap-4 text-sm">
|
||||
<div>
|
||||
<span className="text-gray-600">Total Images:</span>
|
||||
<span className="ml-2 font-medium">{scanResult.total_images}</span>
|
||||
</div>
|
||||
<div>
|
||||
<span className="text-gray-600">Matched Species:</span>
|
||||
<span className="ml-2 font-medium">{scanResult.matched_species}</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{scanResult.sources.length > 0 && (
|
||||
<div className="mt-4">
|
||||
<h4 className="text-sm font-medium text-green-800 mb-2">Sources Found:</h4>
|
||||
<div className="space-y-1">
|
||||
{scanResult.sources.map((source) => (
|
||||
<div key={source.name} className="text-sm flex justify-between">
|
||||
<span>{source.name}</span>
|
||||
<span className="text-gray-600">
|
||||
{source.species_count} species, {source.image_count} images
|
||||
</span>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
|
||||
{scanResult.unmatched_species.length > 0 && (
|
||||
<div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
|
||||
<h3 className="font-medium text-yellow-800 flex items-center gap-2 mb-2">
|
||||
<AlertTriangle className="w-4 h-4" />
|
||||
Unmatched Species ({scanResult.unmatched_species.length})
|
||||
</h3>
|
||||
<p className="text-sm text-yellow-700 mb-2">
|
||||
These species folders don't match any species in the database and will be skipped:
|
||||
</p>
|
||||
<div className="text-sm text-yellow-600 max-h-32 overflow-y-auto">
|
||||
{scanResult.unmatched_species.slice(0, 20).map((name) => (
|
||||
<div key={name}>{name}</div>
|
||||
))}
|
||||
{scanResult.unmatched_species.length > 20 && (
|
||||
<div className="text-yellow-500 mt-1">
|
||||
...and {scanResult.unmatched_species.length - 20} more
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
<div className="border-t pt-4">
|
||||
<div className="flex items-center gap-4 mb-4">
|
||||
<label className="flex items-center gap-2 text-sm">
|
||||
<input
|
||||
type="checkbox"
|
||||
checked={moveFiles}
|
||||
onChange={(e) => setMoveFiles(e.target.checked)}
|
||||
className="rounded"
|
||||
/>
|
||||
Move files instead of copy (removes originals)
|
||||
</label>
|
||||
</div>
|
||||
|
||||
<button
|
||||
onClick={() => importMutation.mutate()}
|
||||
disabled={importMutation.isPending || scanResult.matched_species === 0}
|
||||
className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50 flex items-center gap-2"
|
||||
>
|
||||
{importMutation.isPending ? (
|
||||
<>
|
||||
<RefreshCw className="w-4 h-4 animate-spin" />
|
||||
Importing...
|
||||
</>
|
||||
) : (
|
||||
`Import ${scanResult.total_images} Images`
|
||||
)}
|
||||
</button>
|
||||
</div>
|
||||
</>
|
||||
)}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{importResult && (
|
||||
<div className="bg-green-50 border border-green-200 rounded-lg p-4">
|
||||
<h3 className="font-medium text-green-800 mb-2">Import Complete</h3>
|
||||
<div className="text-sm space-y-1">
|
||||
<div>
|
||||
<span className="text-gray-600">Imported:</span>
|
||||
<span className="ml-2 font-medium text-green-700">{importResult.imported}</span>
|
||||
</div>
|
||||
<div>
|
||||
<span className="text-gray-600">Skipped (already exists):</span>
|
||||
<span className="ml-2 font-medium">{importResult.skipped}</span>
|
||||
</div>
|
||||
{importResult.errors.length > 0 && (
|
||||
<div className="mt-2">
|
||||
<span className="text-red-600">Errors ({importResult.errors.length}):</span>
|
||||
<div className="text-red-500 mt-1 max-h-24 overflow-y-auto">
|
||||
{importResult.errors.map((err, i) => (
|
||||
<div key={i} className="text-xs">{err}</div>
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
997
frontend/src/pages/Species.tsx
Normal file
997
frontend/src/pages/Species.tsx
Normal file
@@ -0,0 +1,997 @@
|
||||
import { useState, useRef } from 'react'
|
||||
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||
import {
|
||||
Plus,
|
||||
Upload,
|
||||
Search,
|
||||
Trash2,
|
||||
Play,
|
||||
ChevronLeft,
|
||||
ChevronRight,
|
||||
Filter,
|
||||
X,
|
||||
Image as ImageIcon,
|
||||
ExternalLink,
|
||||
} from 'lucide-react'
|
||||
import { speciesApi, jobsApi, imagesApi, Species as SpeciesType } from '../api/client'
|
||||
|
||||
export default function Species() {
|
||||
const queryClient = useQueryClient()
|
||||
const csvInputRef = useRef<HTMLInputElement>(null)
|
||||
const jsonInputRef = useRef<HTMLInputElement>(null)
|
||||
|
||||
const [page, setPage] = useState(1)
|
||||
const [search, setSearch] = useState('')
|
||||
const [genus, setGenus] = useState<string>('')
|
||||
const [hasImages, setHasImages] = useState<string>('')
|
||||
const [maxImages, setMaxImages] = useState<string>('')
|
||||
const [selectedIds, setSelectedIds] = useState<number[]>([])
|
||||
const [showAddModal, setShowAddModal] = useState(false)
|
||||
const [showScrapeModal, setShowScrapeModal] = useState(false)
|
||||
const [showScrapeAllModal, setShowScrapeAllModal] = useState(false)
|
||||
const [showScrapeFilteredModal, setShowScrapeFilteredModal] = useState(false)
|
||||
const [viewSpecies, setViewSpecies] = useState<SpeciesType | null>(null)
|
||||
|
||||
const { data: genera } = useQuery({
|
||||
queryKey: ['genera'],
|
||||
queryFn: () => speciesApi.genera().then((res) => res.data),
|
||||
})
|
||||
|
||||
const { data, isLoading } = useQuery({
|
||||
queryKey: ['species', page, search, genus, hasImages, maxImages],
|
||||
queryFn: () =>
|
||||
speciesApi.list({
|
||||
page,
|
||||
page_size: 50,
|
||||
search: search || undefined,
|
||||
genus: genus || undefined,
|
||||
has_images: hasImages === '' ? undefined : hasImages === 'true',
|
||||
max_images: maxImages ? parseInt(maxImages) : undefined,
|
||||
}).then((res) => res.data),
|
||||
})
|
||||
|
||||
const importCsvMutation = useMutation({
|
||||
mutationFn: (file: File) => speciesApi.import(file),
|
||||
onSuccess: (res) => {
|
||||
queryClient.invalidateQueries({ queryKey: ['species'] })
|
||||
queryClient.invalidateQueries({ queryKey: ['genera'] })
|
||||
alert(`Imported ${res.data.imported} species, skipped ${res.data.skipped}`)
|
||||
},
|
||||
})
|
||||
|
||||
const importJsonMutation = useMutation({
|
||||
mutationFn: (file: File) => speciesApi.importJson(file),
|
||||
onSuccess: (res) => {
|
||||
queryClient.invalidateQueries({ queryKey: ['species'] })
|
||||
queryClient.invalidateQueries({ queryKey: ['genera'] })
|
||||
alert(`Imported ${res.data.imported} species, skipped ${res.data.skipped}`)
|
||||
},
|
||||
})
|
||||
|
||||
const deleteMutation = useMutation({
|
||||
mutationFn: (id: number) => speciesApi.delete(id),
|
||||
onSuccess: () => {
|
||||
queryClient.invalidateQueries({ queryKey: ['species'] })
|
||||
},
|
||||
})
|
||||
|
||||
const createJobMutation = useMutation({
|
||||
mutationFn: (data: { name: string; source: string; species_ids?: number[] }) =>
|
||||
jobsApi.create(data),
|
||||
onSuccess: () => {
|
||||
setShowScrapeModal(false)
|
||||
setSelectedIds([])
|
||||
alert('Scrape job created!')
|
||||
},
|
||||
})
|
||||
|
||||
const handleCsvImport = (e: React.ChangeEvent<HTMLInputElement>) => {
|
||||
const file = e.target.files?.[0]
|
||||
if (file) {
|
||||
importCsvMutation.mutate(file)
|
||||
e.target.value = ''
|
||||
}
|
||||
}
|
||||
|
||||
const handleJsonImport = (e: React.ChangeEvent<HTMLInputElement>) => {
|
||||
const file = e.target.files?.[0]
|
||||
if (file) {
|
||||
importJsonMutation.mutate(file)
|
||||
e.target.value = ''
|
||||
}
|
||||
}
|
||||
|
||||
const handleSelectAll = () => {
|
||||
if (!data) return
|
||||
if (selectedIds.length === data.items.length) {
|
||||
setSelectedIds([])
|
||||
} else {
|
||||
setSelectedIds(data.items.map((s) => s.id))
|
||||
}
|
||||
}
|
||||
|
||||
const handleSelect = (id: number) => {
|
||||
setSelectedIds((prev) =>
|
||||
prev.includes(id) ? prev.filter((i) => i !== id) : [...prev, id]
|
||||
)
|
||||
}
|
||||
|
||||
return (
|
||||
<div className="space-y-6">
|
||||
<div className="flex items-center justify-between">
|
||||
<h1 className="text-2xl font-bold">Species</h1>
|
||||
<div className="flex gap-2">
|
||||
<button
|
||||
onClick={() => csvInputRef.current?.click()}
|
||||
disabled={importCsvMutation.isPending}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200 disabled:opacity-50"
|
||||
>
|
||||
<Upload className="w-4 h-4" />
|
||||
{importCsvMutation.isPending ? 'Importing...' : 'Import CSV'}
|
||||
</button>
|
||||
<input
|
||||
ref={csvInputRef}
|
||||
type="file"
|
||||
accept=".csv"
|
||||
onChange={handleCsvImport}
|
||||
className="hidden"
|
||||
/>
|
||||
<button
|
||||
onClick={() => jsonInputRef.current?.click()}
|
||||
disabled={importJsonMutation.isPending}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200 disabled:opacity-50"
|
||||
>
|
||||
<Upload className="w-4 h-4" />
|
||||
{importJsonMutation.isPending ? 'Importing...' : 'Import JSON'}
|
||||
</button>
|
||||
<input
|
||||
ref={jsonInputRef}
|
||||
type="file"
|
||||
accept=".json"
|
||||
onChange={handleJsonImport}
|
||||
className="hidden"
|
||||
/>
|
||||
<button
|
||||
onClick={() => setShowAddModal(true)}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
|
||||
>
|
||||
<Plus className="w-4 h-4" />
|
||||
Add Species
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Search and Filters */}
|
||||
<div className="flex items-center gap-4 flex-wrap">
|
||||
<div className="relative">
|
||||
<Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
|
||||
<input
|
||||
type="text"
|
||||
placeholder="Search species..."
|
||||
value={search}
|
||||
onChange={(e) => {
|
||||
setSearch(e.target.value)
|
||||
setPage(1)
|
||||
}}
|
||||
className="pl-10 pr-4 py-2 border rounded-lg w-64"
|
||||
/>
|
||||
</div>
|
||||
|
||||
<div className="flex items-center gap-2">
|
||||
<Filter className="w-4 h-4 text-gray-400" />
|
||||
<select
|
||||
value={genus}
|
||||
onChange={(e) => {
|
||||
setGenus(e.target.value)
|
||||
setPage(1)
|
||||
}}
|
||||
className="px-3 py-2 border rounded-lg bg-white"
|
||||
>
|
||||
<option value="">All Genera</option>
|
||||
{genera?.map((g) => (
|
||||
<option key={g} value={g}>
|
||||
{g}
|
||||
</option>
|
||||
))}
|
||||
</select>
|
||||
|
||||
<select
|
||||
value={hasImages}
|
||||
onChange={(e) => {
|
||||
setHasImages(e.target.value)
|
||||
setMaxImages('')
|
||||
setPage(1)
|
||||
}}
|
||||
className="px-3 py-2 border rounded-lg bg-white"
|
||||
>
|
||||
<option value="">All Species</option>
|
||||
<option value="true">Has Images</option>
|
||||
<option value="false">No Images</option>
|
||||
</select>
|
||||
|
||||
<select
|
||||
value={maxImages}
|
||||
onChange={(e) => {
|
||||
setMaxImages(e.target.value)
|
||||
setHasImages('')
|
||||
setPage(1)
|
||||
}}
|
||||
className="px-3 py-2 border rounded-lg bg-white"
|
||||
>
|
||||
<option value="">Any Image Count</option>
|
||||
<option value="25">Less than 25 images</option>
|
||||
<option value="50">Less than 50 images</option>
|
||||
<option value="100">Less than 100 images</option>
|
||||
<option value="250">Less than 250 images</option>
|
||||
<option value="500">Less than 500 images</option>
|
||||
</select>
|
||||
|
||||
{(genus || hasImages || maxImages) && (
|
||||
<button
|
||||
onClick={() => {
|
||||
setGenus('')
|
||||
setHasImages('')
|
||||
setMaxImages('')
|
||||
setPage(1)
|
||||
}}
|
||||
className="flex items-center gap-1 px-2 py-1 text-sm text-gray-500 hover:text-gray-700"
|
||||
>
|
||||
<X className="w-3 h-3" />
|
||||
Clear
|
||||
</button>
|
||||
)}
|
||||
</div>
|
||||
|
||||
<div className="ml-auto flex items-center gap-4">
|
||||
{maxImages && data && data.total > 0 && (
|
||||
<button
|
||||
onClick={() => setShowScrapeFilteredModal(true)}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700"
|
||||
>
|
||||
<Play className="w-4 h-4" />
|
||||
Scrape All {data.total} Filtered
|
||||
</button>
|
||||
)}
|
||||
<button
|
||||
onClick={() => setShowScrapeAllModal(true)}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-orange-600 text-white rounded-lg hover:bg-orange-700"
|
||||
>
|
||||
<Play className="w-4 h-4" />
|
||||
Scrape All Without Images
|
||||
</button>
|
||||
{selectedIds.length > 0 && (
|
||||
<div className="flex items-center gap-4">
|
||||
<span className="text-sm text-gray-600">
|
||||
{selectedIds.length} selected
|
||||
</span>
|
||||
<button
|
||||
onClick={() => setShowScrapeModal(true)}
|
||||
className="flex items-center gap-2 px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
|
||||
>
|
||||
<Play className="w-4 h-4" />
|
||||
Start Scrape
|
||||
</button>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Table */}
|
||||
<div className="bg-white rounded-lg shadow overflow-hidden">
|
||||
<table className="w-full">
|
||||
<thead className="bg-gray-50">
|
||||
<tr>
|
||||
<th className="px-4 py-3 text-left">
|
||||
<input
|
||||
type="checkbox"
|
||||
checked={(data?.items?.length ?? 0) > 0 && selectedIds.length === (data?.items?.length ?? 0)}
|
||||
onChange={handleSelectAll}
|
||||
className="rounded"
|
||||
/>
|
||||
</th>
|
||||
<th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
|
||||
Scientific Name
|
||||
</th>
|
||||
<th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
|
||||
Common Name
|
||||
</th>
|
||||
<th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
|
||||
Genus
|
||||
</th>
|
||||
<th className="px-4 py-3 text-right text-sm font-medium text-gray-600">
|
||||
Images
|
||||
</th>
|
||||
<th className="px-4 py-3 text-right text-sm font-medium text-gray-600">
|
||||
Actions
|
||||
</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{isLoading ? (
|
||||
<tr>
|
||||
<td colSpan={6} className="px-4 py-8 text-center text-gray-400">
|
||||
Loading...
|
||||
</td>
|
||||
</tr>
|
||||
) : data?.items.length === 0 ? (
|
||||
<tr>
|
||||
<td colSpan={6} className="px-4 py-8 text-center text-gray-400">
|
||||
No species found. Import a CSV to get started.
|
||||
</td>
|
||||
</tr>
|
||||
) : (
|
||||
data?.items.map((species) => (
|
||||
<tr
|
||||
key={species.id}
|
||||
className="border-t hover:bg-gray-50 cursor-pointer"
|
||||
onClick={() => setViewSpecies(species)}
|
||||
>
|
||||
<td className="px-4 py-3" onClick={(e) => e.stopPropagation()}>
|
||||
<input
|
||||
type="checkbox"
|
||||
checked={selectedIds.includes(species.id)}
|
||||
onChange={() => handleSelect(species.id)}
|
||||
className="rounded"
|
||||
/>
|
||||
</td>
|
||||
<td className="px-4 py-3 font-medium">{species.scientific_name}</td>
|
||||
<td className="px-4 py-3 text-gray-600">
|
||||
{species.common_name || '-'}
|
||||
</td>
|
||||
<td className="px-4 py-3 text-gray-600">{species.genus || '-'}</td>
|
||||
<td className="px-4 py-3 text-right">
|
||||
<span
|
||||
className={`inline-block px-2 py-1 rounded text-sm ${
|
||||
species.image_count >= 100
|
||||
? 'bg-green-100 text-green-700'
|
||||
: species.image_count > 0
|
||||
? 'bg-yellow-100 text-yellow-700'
|
||||
: 'bg-gray-100 text-gray-600'
|
||||
}`}
|
||||
>
|
||||
{species.image_count}
|
||||
</span>
|
||||
</td>
|
||||
<td className="px-4 py-3 text-right" onClick={(e) => e.stopPropagation()}>
|
||||
<button
|
||||
onClick={() => deleteMutation.mutate(species.id)}
|
||||
className="p-1 text-red-500 hover:bg-red-50 rounded"
|
||||
>
|
||||
<Trash2 className="w-4 h-4" />
|
||||
</button>
|
||||
</td>
|
||||
</tr>
|
||||
))
|
||||
)}
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
{/* Pagination */}
|
||||
{data && data.pages > 1 && (
|
||||
<div className="flex items-center justify-between">
|
||||
<span className="text-sm text-gray-600">
|
||||
Showing {(page - 1) * 50 + 1} to {Math.min(page * 50, data.total)} of{' '}
|
||||
{data.total}
|
||||
</span>
|
||||
<div className="flex gap-2">
|
||||
<button
|
||||
onClick={() => setPage((p) => Math.max(1, p - 1))}
|
||||
disabled={page === 1}
|
||||
className="p-2 rounded border disabled:opacity-50"
|
||||
>
|
||||
<ChevronLeft className="w-4 h-4" />
|
||||
</button>
|
||||
<span className="px-4 py-2">
|
||||
Page {page} of {data.pages}
|
||||
</span>
|
||||
<button
|
||||
onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
|
||||
disabled={page === data.pages}
|
||||
className="p-2 rounded border disabled:opacity-50"
|
||||
>
|
||||
<ChevronRight className="w-4 h-4" />
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Add Species Modal */}
|
||||
{showAddModal && (
|
||||
<AddSpeciesModal onClose={() => setShowAddModal(false)} />
|
||||
)}
|
||||
|
||||
{/* Scrape Modal */}
|
||||
{showScrapeModal && (
|
||||
<ScrapeModal
|
||||
selectedIds={selectedIds}
|
||||
onClose={() => setShowScrapeModal(false)}
|
||||
onSubmit={(source) => {
|
||||
createJobMutation.mutate({
|
||||
name: `Scrape ${selectedIds.length} species from ${source}`,
|
||||
source,
|
||||
species_ids: selectedIds,
|
||||
})
|
||||
}}
|
||||
/>
|
||||
)}
|
||||
|
||||
{/* Species Detail Modal */}
|
||||
{viewSpecies && (
|
||||
<SpeciesDetailModal
|
||||
species={viewSpecies}
|
||||
onClose={() => setViewSpecies(null)}
|
||||
/>
|
||||
)}
|
||||
|
||||
{/* Scrape All Without Images Modal */}
|
||||
{showScrapeAllModal && (
|
||||
<ScrapeAllModal
|
||||
onClose={() => setShowScrapeAllModal(false)}
|
||||
/>
|
||||
)}
|
||||
|
||||
{/* Scrape All Filtered Modal */}
|
||||
{showScrapeFilteredModal && (
|
||||
<ScrapeFilteredModal
|
||||
maxImages={parseInt(maxImages)}
|
||||
speciesCount={data?.total ?? 0}
|
||||
onClose={() => setShowScrapeFilteredModal(false)}
|
||||
/>
|
||||
)}
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function AddSpeciesModal({ onClose }: { onClose: () => void }) {
|
||||
const queryClient = useQueryClient()
|
||||
const [form, setForm] = useState({
|
||||
scientific_name: '',
|
||||
common_name: '',
|
||||
genus: '',
|
||||
family: '',
|
||||
})
|
||||
|
||||
const mutation = useMutation({
|
||||
mutationFn: () => speciesApi.create(form),
|
||||
onSuccess: () => {
|
||||
queryClient.invalidateQueries({ queryKey: ['species'] })
|
||||
onClose()
|
||||
},
|
||||
})
|
||||
|
||||
return (
|
||||
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||
<div className="bg-white rounded-lg p-6 w-full max-w-md">
|
||||
<h2 className="text-xl font-bold mb-4">Add Species</h2>
|
||||
<div className="space-y-4">
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">
|
||||
Scientific Name *
|
||||
</label>
|
||||
<input
|
||||
type="text"
|
||||
value={form.scientific_name}
|
||||
onChange={(e) =>
|
||||
setForm({ ...form, scientific_name: e.target.value })
|
||||
}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
placeholder="e.g. Monstera deliciosa"
|
||||
/>
|
||||
</div>
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">Common Name</label>
|
||||
<input
|
||||
type="text"
|
||||
value={form.common_name}
|
||||
onChange={(e) => setForm({ ...form, common_name: e.target.value })}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
placeholder="e.g. Swiss Cheese Plant"
|
||||
/>
|
||||
</div>
|
||||
<div className="grid grid-cols-2 gap-4">
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">Genus</label>
|
||||
<input
|
||||
type="text"
|
||||
value={form.genus}
|
||||
onChange={(e) => setForm({ ...form, genus: e.target.value })}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
placeholder="e.g. Monstera"
|
||||
/>
|
||||
</div>
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-1">Family</label>
|
||||
<input
|
||||
type="text"
|
||||
value={form.family}
|
||||
onChange={(e) => setForm({ ...form, family: e.target.value })}
|
||||
className="w-full px-3 py-2 border rounded-lg"
|
||||
placeholder="e.g. Araceae"
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex justify-end gap-2 mt-6">
|
||||
<button
|
||||
onClick={onClose}
|
||||
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||
>
|
||||
Cancel
|
||||
</button>
|
||||
<button
|
||||
onClick={() => mutation.mutate()}
|
||||
disabled={!form.scientific_name}
|
||||
className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50"
|
||||
>
|
||||
Add Species
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function ScrapeModal({
|
||||
selectedIds,
|
||||
onClose,
|
||||
onSubmit,
|
||||
}: {
|
||||
selectedIds: number[]
|
||||
onClose: () => void
|
||||
onSubmit: (source: string) => void
|
||||
}) {
|
||||
const [source, setSource] = useState('inaturalist')
|
||||
|
||||
const sources = [
|
||||
{ value: 'gbif', label: 'GBIF' },
|
||||
{ value: 'inaturalist', label: 'iNaturalist' },
|
||||
{ value: 'flickr', label: 'Flickr' },
|
||||
{ value: 'wikimedia', label: 'Wikimedia Commons' },
|
||||
{ value: 'trefle', label: 'Trefle.io' },
|
||||
{ value: 'duckduckgo', label: 'DuckDuckGo' },
|
||||
{ value: 'bing', label: 'Bing Image Search' },
|
||||
]
|
||||
|
||||
return (
|
||||
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||
<div className="bg-white rounded-lg p-6 w-full max-w-md">
|
||||
<h2 className="text-xl font-bold mb-4">Start Scrape Job</h2>
|
||||
<p className="text-gray-600 mb-4">
|
||||
Scrape images for {selectedIds.length} selected species
|
||||
</p>
|
||||
<div>
|
||||
<label className="block text-sm font-medium mb-2">Select Source</label>
|
||||
<div className="space-y-2">
|
||||
{sources.map((s) => (
|
||||
<label
|
||||
key={s.value}
|
||||
className={`flex items-center p-3 border rounded-lg cursor-pointer ${
|
||||
source === s.value ? 'border-green-500 bg-green-50' : ''
|
||||
}`}
|
||||
>
|
||||
<input
|
||||
type="radio"
|
||||
value={s.value}
|
||||
checked={source === s.value}
|
||||
onChange={(e) => setSource(e.target.value)}
|
||||
className="mr-3"
|
||||
/>
|
||||
{s.label}
|
||||
</label>
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex justify-end gap-2 mt-6">
|
||||
<button
|
||||
onClick={onClose}
|
||||
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||
>
|
||||
Cancel
|
||||
</button>
|
||||
<button
|
||||
onClick={() => onSubmit(source)}
|
||||
className="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
|
||||
>
|
||||
Start Scrape
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function SpeciesDetailModal({
|
||||
species,
|
||||
onClose,
|
||||
}: {
|
||||
species: SpeciesType
|
||||
onClose: () => void
|
||||
}) {
|
||||
const [page, setPage] = useState(1)
|
||||
const pageSize = 20
|
||||
|
||||
const { data, isLoading } = useQuery({
|
||||
queryKey: ['species-images', species.id, page],
|
||||
queryFn: () =>
|
||||
imagesApi.list({
|
||||
species_id: species.id,
|
||||
status: 'downloaded',
|
||||
page,
|
||||
page_size: pageSize,
|
||||
}).then((res) => res.data),
|
||||
})
|
||||
|
||||
return (
|
||||
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50 p-4">
|
||||
<div className="bg-white rounded-lg w-full max-w-5xl max-h-[90vh] flex flex-col">
|
||||
{/* Header */}
|
||||
<div className="px-6 py-4 border-b flex items-start justify-between">
|
||||
<div>
|
||||
<h2 className="text-xl font-bold">{species.scientific_name}</h2>
|
||||
{species.common_name && (
|
||||
<p className="text-gray-600">{species.common_name}</p>
|
||||
)}
|
||||
<div className="flex gap-4 mt-2 text-sm text-gray-500">
|
||||
{species.genus && <span>Genus: {species.genus}</span>}
|
||||
{species.family && <span>Family: {species.family}</span>}
|
||||
<span>{species.image_count} images</span>
|
||||
</div>
|
||||
</div>
|
||||
<button
|
||||
onClick={onClose}
|
||||
className="p-2 hover:bg-gray-100 rounded-lg"
|
||||
>
|
||||
<X className="w-5 h-5" />
|
||||
</button>
|
||||
</div>
|
||||
|
||||
{/* Images Grid */}
|
||||
<div className="flex-1 overflow-y-auto p-6">
|
||||
{isLoading ? (
|
||||
<div className="flex items-center justify-center h-64">
|
||||
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
|
||||
</div>
|
||||
) : !data || data.items.length === 0 ? (
|
||||
<div className="flex flex-col items-center justify-center h-64 text-gray-400">
|
||||
<ImageIcon className="w-12 h-12 mb-4" />
|
||||
<p>No images yet</p>
|
||||
<p className="text-sm mt-2">
|
||||
Start a scrape job to download images for this species
|
||||
</p>
|
||||
</div>
|
||||
) : (
|
||||
<div className="grid grid-cols-2 sm:grid-cols-3 md:grid-cols-4 lg:grid-cols-5 gap-4">
|
||||
{data.items.map((image) => (
|
||||
<div
|
||||
key={image.id}
|
||||
className="group relative aspect-square bg-gray-100 rounded-lg overflow-hidden"
|
||||
>
|
||||
{image.local_path ? (
|
||||
<img
|
||||
src={`/api/images/${image.id}/file`}
|
||||
alt={species.scientific_name}
|
||||
className="w-full h-full object-cover"
|
||||
loading="lazy"
|
||||
/>
|
||||
) : (
|
||||
<div className="w-full h-full flex items-center justify-center text-gray-400">
|
||||
<ImageIcon className="w-8 h-8" />
|
||||
</div>
|
||||
)}
|
||||
{/* Overlay with info */}
|
||||
<div className="absolute inset-0 bg-black/60 opacity-0 group-hover:opacity-100 transition-opacity flex flex-col justify-end p-2">
|
||||
<div className="text-white text-xs">
|
||||
<div className="flex items-center justify-between">
|
||||
<span className="bg-white/20 px-1.5 py-0.5 rounded">
|
||||
{image.source}
|
||||
</span>
|
||||
<span className="bg-white/20 px-1.5 py-0.5 rounded">
|
||||
{image.license}
|
||||
</span>
|
||||
</div>
|
||||
{image.width && image.height && (
|
||||
<div className="mt-1 text-white/70">
|
||||
{image.width} × {image.height}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
{image.url && (
|
||||
<a
|
||||
href={image.url}
|
||||
target="_blank"
|
||||
rel="noopener noreferrer"
|
||||
className="absolute top-2 right-2 p-1 bg-white/20 rounded hover:bg-white/40"
|
||||
onClick={(e) => e.stopPropagation()}
|
||||
>
|
||||
<ExternalLink className="w-4 h-4 text-white" />
|
||||
</a>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
|
||||
{/* Pagination */}
|
||||
{data && data.pages > 1 && (
|
||||
<div className="px-6 py-4 border-t flex items-center justify-between">
|
||||
<span className="text-sm text-gray-600">
|
||||
Showing {(page - 1) * pageSize + 1} to{' '}
|
||||
{Math.min(page * pageSize, data.total)} of {data.total}
|
||||
</span>
|
||||
<div className="flex gap-2">
|
||||
<button
|
||||
onClick={() => setPage((p) => Math.max(1, p - 1))}
|
||||
disabled={page === 1}
|
||||
className="p-2 rounded border disabled:opacity-50"
|
||||
>
|
||||
<ChevronLeft className="w-4 h-4" />
|
||||
</button>
|
||||
<span className="px-4 py-2">
|
||||
Page {page} of {data.pages}
|
||||
</span>
|
||||
<button
|
||||
onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
|
||||
disabled={page === data.pages}
|
||||
className="p-2 rounded border disabled:opacity-50"
|
||||
>
|
||||
<ChevronRight className="w-4 h-4" />
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function ScrapeAllModal({ onClose }: { onClose: () => void }) {
|
||||
const [selectedSources, setSelectedSources] = useState<string[]>([])
|
||||
const [isSubmitting, setIsSubmitting] = useState(false)
|
||||
|
||||
// Fetch count of species without images
|
||||
const { data: speciesData, isLoading } = useQuery({
|
||||
queryKey: ['species-no-images'],
|
||||
queryFn: () =>
|
||||
speciesApi.list({
|
||||
page: 1,
|
||||
page_size: 1,
|
||||
has_images: false,
|
||||
}).then((res) => res.data),
|
||||
})
|
||||
|
||||
const sources = [
|
||||
{ value: 'gbif', label: 'GBIF', description: 'Free biodiversity database, no API key needed' },
|
||||
{ value: 'inaturalist', label: 'iNaturalist', description: 'Research-grade observations with CC licenses' },
|
||||
{ value: 'wikimedia', label: 'Wikimedia Commons', description: 'Free media repository, requires OAuth' },
|
||||
{ value: 'flickr', label: 'Flickr', description: 'Requires API key, CC-licensed photos' },
|
||||
{ value: 'trefle', label: 'Trefle.io', description: 'Plant database, requires API key' },
|
||||
{ value: 'duckduckgo', label: 'DuckDuckGo', description: 'Web image search, no API key needed' },
|
||||
{ value: 'bing', label: 'Bing Image Search', description: 'Azure Cognitive Services, requires API key' },
|
||||
]
|
||||
|
||||
const toggleSource = (source: string) => {
|
||||
setSelectedSources((prev) =>
|
||||
prev.includes(source)
|
||||
? prev.filter((s) => s !== source)
|
||||
: [...prev, source]
|
||||
)
|
||||
}
|
||||
|
||||
const handleSubmit = async () => {
|
||||
if (selectedSources.length === 0) return
|
||||
|
||||
setIsSubmitting(true)
|
||||
try {
|
||||
// Create a job for each selected source
|
||||
for (const source of selectedSources) {
|
||||
await jobsApi.create({
|
||||
name: `Scrape all species without images from ${source}`,
|
||||
source,
|
||||
only_without_images: true,
|
||||
})
|
||||
}
|
||||
alert(`Created ${selectedSources.length} scrape job(s)!`)
|
||||
onClose()
|
||||
} catch (error) {
|
||||
alert('Failed to create jobs')
|
||||
} finally {
|
||||
setIsSubmitting(false)
|
||||
}
|
||||
}
|
||||
|
||||
const speciesCount = speciesData?.total ?? 0
|
||||
|
||||
return (
|
||||
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||
<div className="bg-white rounded-lg p-6 w-full max-w-lg">
|
||||
<h2 className="text-xl font-bold mb-2">Scrape All Species Without Images</h2>
|
||||
{isLoading ? (
|
||||
<p className="text-gray-600 mb-4">Loading...</p>
|
||||
) : (
|
||||
<p className="text-gray-600 mb-4">
|
||||
{speciesCount === 0 ? (
|
||||
'All species already have images!'
|
||||
) : (
|
||||
<>
|
||||
<span className="font-semibold text-orange-600">{speciesCount}</span> species
|
||||
don't have any images yet. Select sources to scrape from:
|
||||
</>
|
||||
)}
|
||||
</p>
|
||||
)}
|
||||
|
||||
{speciesCount > 0 && (
|
||||
<>
|
||||
<div className="space-y-2 mb-6">
|
||||
{sources.map((s) => (
|
||||
<label
|
||||
key={s.value}
|
||||
className={`flex items-start p-3 border rounded-lg cursor-pointer transition-colors ${
|
||||
selectedSources.includes(s.value)
|
||||
? 'border-orange-500 bg-orange-50'
|
||||
: 'hover:bg-gray-50'
|
||||
}`}
|
||||
>
|
||||
<input
|
||||
type="checkbox"
|
||||
checked={selectedSources.includes(s.value)}
|
||||
onChange={() => toggleSource(s.value)}
|
||||
className="mt-1 mr-3 rounded"
|
||||
/>
|
||||
<div>
|
||||
<div className="font-medium">{s.label}</div>
|
||||
<div className="text-sm text-gray-500">{s.description}</div>
|
||||
</div>
|
||||
</label>
|
||||
))}
|
||||
</div>
|
||||
|
||||
{selectedSources.length > 1 && (
|
||||
<div className="bg-blue-50 border border-blue-200 rounded-lg p-3 mb-4 text-sm text-blue-700">
|
||||
<strong>{selectedSources.length} jobs</strong> will be created and run in parallel,
|
||||
one for each selected source.
|
||||
</div>
|
||||
)}
|
||||
</>
|
||||
)}
|
||||
|
||||
<div className="flex justify-end gap-2">
|
||||
<button
|
||||
onClick={onClose}
|
||||
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||
>
|
||||
Cancel
|
||||
</button>
|
||||
{speciesCount > 0 && (
|
||||
<button
|
||||
onClick={handleSubmit}
|
||||
disabled={selectedSources.length === 0 || isSubmitting}
|
||||
className="px-4 py-2 bg-orange-600 text-white rounded-lg hover:bg-orange-700 disabled:opacity-50"
|
||||
>
|
||||
{isSubmitting
|
||||
? 'Creating Jobs...'
|
||||
: `Start ${selectedSources.length || ''} Scrape Job${selectedSources.length !== 1 ? 's' : ''}`}
|
||||
</button>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function ScrapeFilteredModal({
|
||||
maxImages,
|
||||
speciesCount,
|
||||
onClose,
|
||||
}: {
|
||||
maxImages: number
|
||||
speciesCount: number
|
||||
onClose: () => void
|
||||
}) {
|
||||
const [selectedSources, setSelectedSources] = useState<string[]>([])
|
||||
const [isSubmitting, setIsSubmitting] = useState(false)
|
||||
|
||||
const sources = [
|
||||
{ value: 'gbif', label: 'GBIF', description: 'Free biodiversity database, no API key needed' },
|
||||
{ value: 'inaturalist', label: 'iNaturalist', description: 'Research-grade observations with CC licenses' },
|
||||
{ value: 'wikimedia', label: 'Wikimedia Commons', description: 'Free media repository, requires OAuth' },
|
||||
{ value: 'flickr', label: 'Flickr', description: 'Requires API key, CC-licensed photos' },
|
||||
{ value: 'trefle', label: 'Trefle.io', description: 'Plant database, requires API key' },
|
||||
{ value: 'duckduckgo', label: 'DuckDuckGo', description: 'Web image search, no API key needed' },
|
||||
{ value: 'bing', label: 'Bing Image Search', description: 'Azure Cognitive Services, requires API key' },
|
||||
]
|
||||
|
||||
const toggleSource = (source: string) => {
|
||||
setSelectedSources((prev) =>
|
||||
prev.includes(source)
|
||||
? prev.filter((s) => s !== source)
|
||||
: [...prev, source]
|
||||
)
|
||||
}
|
||||
|
||||
const handleSubmit = async () => {
|
||||
if (selectedSources.length === 0) return
|
||||
|
||||
setIsSubmitting(true)
|
||||
try {
|
||||
for (const source of selectedSources) {
|
||||
await jobsApi.create({
|
||||
name: `Scrape species with <${maxImages} images from ${source}`,
|
||||
source,
|
||||
max_images: maxImages,
|
||||
})
|
||||
}
|
||||
alert(`Created ${selectedSources.length} scrape job(s)!`)
|
||||
onClose()
|
||||
} catch (error) {
|
||||
alert('Failed to create jobs')
|
||||
} finally {
|
||||
setIsSubmitting(false)
|
||||
}
|
||||
}
|
||||
|
||||
return (
|
||||
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||
<div className="bg-white rounded-lg p-6 w-full max-w-lg">
|
||||
<h2 className="text-xl font-bold mb-2">Scrape All Filtered Species</h2>
|
||||
<p className="text-gray-600 mb-4">
|
||||
<span className="font-semibold text-purple-600">{speciesCount}</span> species
|
||||
have fewer than <span className="font-semibold">{maxImages}</span> images.
|
||||
Select sources to scrape from:
|
||||
</p>
|
||||
|
||||
<div className="space-y-2 mb-6">
|
||||
{sources.map((s) => (
|
||||
<label
|
||||
key={s.value}
|
||||
className={`flex items-start p-3 border rounded-lg cursor-pointer transition-colors ${
|
||||
selectedSources.includes(s.value)
|
||||
? 'border-purple-500 bg-purple-50'
|
||||
: 'hover:bg-gray-50'
|
||||
}`}
|
||||
>
|
||||
<input
|
||||
type="checkbox"
|
||||
checked={selectedSources.includes(s.value)}
|
||||
onChange={() => toggleSource(s.value)}
|
||||
className="mt-1 mr-3 rounded"
|
||||
/>
|
||||
<div>
|
||||
<div className="font-medium">{s.label}</div>
|
||||
<div className="text-sm text-gray-500">{s.description}</div>
|
||||
</div>
|
||||
</label>
|
||||
))}
|
||||
</div>
|
||||
|
||||
{selectedSources.length > 1 && (
|
||||
<div className="bg-blue-50 border border-blue-200 rounded-lg p-3 mb-4 text-sm text-blue-700">
|
||||
<strong>{selectedSources.length} jobs</strong> will be created and run in parallel,
|
||||
one for each selected source.
|
||||
</div>
|
||||
)}
|
||||
|
||||
<div className="flex justify-end gap-2">
|
||||
<button
|
||||
onClick={onClose}
|
||||
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||
>
|
||||
Cancel
|
||||
</button>
|
||||
<button
|
||||
onClick={handleSubmit}
|
||||
disabled={selectedSources.length === 0 || isSubmitting}
|
||||
className="px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700 disabled:opacity-50"
|
||||
>
|
||||
{isSubmitting
|
||||
? 'Creating Jobs...'
|
||||
: `Start ${selectedSources.length || ''} Scrape Job${selectedSources.length !== 1 ? 's' : ''}`}
|
||||
</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
9
frontend/src/vite-env.d.ts
vendored
Normal file
9
frontend/src/vite-env.d.ts
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
/// <reference types="vite/client" />
|
||||
|
||||
interface ImportMetaEnv {
|
||||
readonly VITE_API_URL: string
|
||||
}
|
||||
|
||||
interface ImportMeta {
|
||||
readonly env: ImportMetaEnv
|
||||
}
|
||||
11
frontend/tailwind.config.js
Normal file
11
frontend/tailwind.config.js
Normal file
@@ -0,0 +1,11 @@
|
||||
/** @type {import('tailwindcss').Config} */
|
||||
export default {
|
||||
content: [
|
||||
"./index.html",
|
||||
"./src/**/*.{js,ts,jsx,tsx}",
|
||||
],
|
||||
theme: {
|
||||
extend: {},
|
||||
},
|
||||
plugins: [],
|
||||
}
|
||||
21
frontend/tsconfig.json
Normal file
21
frontend/tsconfig.json
Normal file
@@ -0,0 +1,21 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"target": "ES2020",
|
||||
"useDefineForClassFields": true,
|
||||
"lib": ["ES2020", "DOM", "DOM.Iterable"],
|
||||
"module": "ESNext",
|
||||
"skipLibCheck": true,
|
||||
"moduleResolution": "bundler",
|
||||
"allowImportingTsExtensions": true,
|
||||
"resolveJsonModule": true,
|
||||
"isolatedModules": true,
|
||||
"noEmit": true,
|
||||
"jsx": "react-jsx",
|
||||
"strict": true,
|
||||
"noUnusedLocals": true,
|
||||
"noUnusedParameters": true,
|
||||
"noFallthroughCasesInSwitch": true
|
||||
},
|
||||
"include": ["src"],
|
||||
"references": [{ "path": "./tsconfig.node.json" }]
|
||||
}
|
||||
10
frontend/tsconfig.node.json
Normal file
10
frontend/tsconfig.node.json
Normal file
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"composite": true,
|
||||
"skipLibCheck": true,
|
||||
"module": "ESNext",
|
||||
"moduleResolution": "bundler",
|
||||
"allowSyntheticDefaultImports": true
|
||||
},
|
||||
"include": ["vite.config.ts"]
|
||||
}
|
||||
18
frontend/vite.config.ts
Normal file
18
frontend/vite.config.ts
Normal file
@@ -0,0 +1,18 @@
|
||||
import { defineConfig } from 'vite'
|
||||
import react from '@vitejs/plugin-react'
|
||||
|
||||
export default defineConfig({
|
||||
plugins: [react()],
|
||||
server: {
|
||||
port: 3000,
|
||||
host: true,
|
||||
proxy: {
|
||||
'/api': {
|
||||
target: 'http://backend:8000',
|
||||
changeOrigin: true,
|
||||
},
|
||||
},
|
||||
// Disable HMR - not useful in Docker deployments
|
||||
hmr: false,
|
||||
},
|
||||
})
|
||||
18874
houseplants_list.json
Executable file
18874
houseplants_list.json
Executable file
File diff suppressed because it is too large
Load Diff
58
nginx/nginx.conf
Normal file
58
nginx/nginx.conf
Normal file
@@ -0,0 +1,58 @@
|
||||
events {
|
||||
worker_connections 1024;
|
||||
}
|
||||
|
||||
http {
|
||||
include /etc/nginx/mime.types;
|
||||
default_type application/octet-stream;
|
||||
|
||||
upstream backend {
|
||||
server backend:8000;
|
||||
}
|
||||
|
||||
upstream frontend {
|
||||
server frontend:3000;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
server_name localhost;
|
||||
|
||||
# API routes
|
||||
location /api {
|
||||
proxy_pass http://backend;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
# Increase timeouts for slow API calls
|
||||
proxy_connect_timeout 60s;
|
||||
proxy_send_timeout 60s;
|
||||
proxy_read_timeout 60s;
|
||||
}
|
||||
|
||||
# Health check
|
||||
location /health {
|
||||
proxy_pass http://backend;
|
||||
}
|
||||
|
||||
# WebSocket support for hot reload
|
||||
location /ws {
|
||||
proxy_pass http://frontend;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
}
|
||||
|
||||
# Frontend
|
||||
location / {
|
||||
proxy_pass http://frontend;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
}
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user