Initial commit — PlantGuideScraper project
This commit is contained in:
20
.env.example
Normal file
20
.env.example
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
# Database
|
||||||
|
DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||||
|
|
||||||
|
# Redis
|
||||||
|
REDIS_URL=redis://redis:6379/0
|
||||||
|
|
||||||
|
# Storage paths
|
||||||
|
IMAGES_PATH=/data/images
|
||||||
|
EXPORTS_PATH=/data/exports
|
||||||
|
|
||||||
|
# API Keys (user-provided)
|
||||||
|
FLICKR_API_KEY=
|
||||||
|
FLICKR_API_SECRET=
|
||||||
|
INATURALIST_APP_ID=
|
||||||
|
INATURALIST_APP_SECRET=
|
||||||
|
TREFLE_API_KEY=
|
||||||
|
|
||||||
|
# Optional settings
|
||||||
|
LOG_LEVEL=INFO
|
||||||
|
CELERY_CONCURRENCY=4
|
||||||
39
.gitignore
vendored
Normal file
39
.gitignore
vendored
Normal file
@@ -0,0 +1,39 @@
|
|||||||
|
# Python
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
*.so
|
||||||
|
.Python
|
||||||
|
venv/
|
||||||
|
.venv/
|
||||||
|
ENV/
|
||||||
|
env/
|
||||||
|
.eggs/
|
||||||
|
*.egg-info/
|
||||||
|
*.egg
|
||||||
|
|
||||||
|
# Node
|
||||||
|
node_modules/
|
||||||
|
npm-debug.log
|
||||||
|
yarn-error.log
|
||||||
|
|
||||||
|
# IDE
|
||||||
|
.idea/
|
||||||
|
.vscode/
|
||||||
|
*.swp
|
||||||
|
*.swo
|
||||||
|
*~
|
||||||
|
|
||||||
|
# OS
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
|
|
||||||
|
# Project specific
|
||||||
|
data/
|
||||||
|
*.sqlite
|
||||||
|
*.db
|
||||||
|
.env
|
||||||
|
*.zip
|
||||||
|
|
||||||
|
# Docker
|
||||||
|
docker-compose.override.yml
|
||||||
209
README.md
Normal file
209
README.md
Normal file
@@ -0,0 +1,209 @@
|
|||||||
|
# PlantGuideScraper
|
||||||
|
|
||||||
|
Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Species Management**: Import species lists via CSV or JSON, search and filter by genus or image status
|
||||||
|
- **Multi-Source Scraping**: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
|
||||||
|
- **Image Quality Pipeline**: Automatic deduplication, blur detection, resizing
|
||||||
|
- **License Filtering**: Only collect commercially-safe CC0/CC-BY licensed images
|
||||||
|
- **Export for CoreML**: Train/test split, Create ML-compatible folder structure
|
||||||
|
- **Real-time Dashboard**: Progress tracking, statistics, job monitoring
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone and start
|
||||||
|
cd PlantGuideScraper
|
||||||
|
docker-compose up --build
|
||||||
|
|
||||||
|
# Access the UI
|
||||||
|
open http://localhost
|
||||||
|
```
|
||||||
|
|
||||||
|
## Unraid Deployment
|
||||||
|
|
||||||
|
### Setup
|
||||||
|
|
||||||
|
1. Copy the project to your Unraid server:
|
||||||
|
```bash
|
||||||
|
scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper
|
||||||
|
```
|
||||||
|
|
||||||
|
2. SSH into Unraid and create data directories:
|
||||||
|
```bash
|
||||||
|
ssh root@YOUR_UNRAID_IP
|
||||||
|
mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Install **Docker Compose Manager** from Community Applications
|
||||||
|
|
||||||
|
4. In Unraid: **Docker → Compose → Add New Stack**
|
||||||
|
- Path: `/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml`
|
||||||
|
- Click **Compose Up**
|
||||||
|
|
||||||
|
5. Access at `http://YOUR_UNRAID_IP:8580`
|
||||||
|
|
||||||
|
### Configurable Paths
|
||||||
|
|
||||||
|
Edit `docker-compose.unraid.yml` to customize where data is stored. Look for these lines in both `backend` and `celery` services:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# === CONFIGURABLE DATA PATHS ===
|
||||||
|
- /mnt/user/appdata/PlantGuideScraper/database:/data/db # DATABASE_PATH
|
||||||
|
- /mnt/user/appdata/PlantGuideScraper/images:/data/images # IMAGES_PATH
|
||||||
|
- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
|
||||||
|
```
|
||||||
|
|
||||||
|
| Path | Description | Default |
|
||||||
|
|------|-------------|---------|
|
||||||
|
| DATABASE_PATH | SQLite database file | `/mnt/user/appdata/PlantGuideScraper/database` |
|
||||||
|
| IMAGES_PATH | Downloaded images (can be 100GB+) | `/mnt/user/appdata/PlantGuideScraper/images` |
|
||||||
|
| EXPORTS_PATH | Generated export zip files | `/mnt/user/appdata/PlantGuideScraper/exports` |
|
||||||
|
|
||||||
|
**Example: Store images on a separate share:**
|
||||||
|
```yaml
|
||||||
|
- /mnt/user/data/PlantImages:/data/images # IMAGES_PATH
|
||||||
|
```
|
||||||
|
|
||||||
|
**Important:** Keep paths identical in both `backend` and `celery` services.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
1. Configure API keys in Settings:
|
||||||
|
- **Flickr**: Get key at https://www.flickr.com/services/api/
|
||||||
|
- **Trefle**: Get key at https://trefle.io/
|
||||||
|
- iNaturalist and Wikimedia don't require keys
|
||||||
|
|
||||||
|
2. Import species list (see Import Documentation below)
|
||||||
|
|
||||||
|
3. Select species and start scraping
|
||||||
|
|
||||||
|
## Import Documentation
|
||||||
|
|
||||||
|
### CSV Import
|
||||||
|
|
||||||
|
Import species from a CSV file with the following columns:
|
||||||
|
|
||||||
|
| Column | Required | Description |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| `scientific_name` | Yes | Binomial name (e.g., "Monstera deliciosa") |
|
||||||
|
| `common_name` | No | Common name (e.g., "Swiss Cheese Plant") |
|
||||||
|
| `genus` | No | Auto-extracted from scientific_name if not provided |
|
||||||
|
| `family` | No | Plant family (e.g., "Araceae") |
|
||||||
|
|
||||||
|
**Example CSV:**
|
||||||
|
```csv
|
||||||
|
scientific_name,common_name,genus,family
|
||||||
|
Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
|
||||||
|
Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
|
||||||
|
Epipremnum aureum,Golden Pothos,Epipremnum,Araceae
|
||||||
|
```
|
||||||
|
|
||||||
|
### JSON Import
|
||||||
|
|
||||||
|
Import species from a JSON file with the following structure:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"plants": [
|
||||||
|
{
|
||||||
|
"scientific_name": "Monstera deliciosa",
|
||||||
|
"common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
|
||||||
|
"family": "Araceae"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"scientific_name": "Philodendron hederaceum",
|
||||||
|
"common_names": ["Heartleaf Philodendron"],
|
||||||
|
"family": "Araceae"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
| Field | Required | Description |
|
||||||
|
|-------|----------|-------------|
|
||||||
|
| `scientific_name` | Yes | Binomial name |
|
||||||
|
| `common_names` | No | Array of common names (first one is used) |
|
||||||
|
| `family` | No | Plant family |
|
||||||
|
|
||||||
|
**Notes:**
|
||||||
|
- Genus is automatically extracted from the first word of `scientific_name`
|
||||||
|
- Duplicate species (by scientific_name) are skipped
|
||||||
|
- The included `houseplants_list.json` contains 2,278 houseplant species
|
||||||
|
|
||||||
|
### API Endpoints
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Import CSV
|
||||||
|
curl -X POST http://localhost/api/species/import \
|
||||||
|
-F "file=@species.csv"
|
||||||
|
|
||||||
|
# Import JSON
|
||||||
|
curl -X POST http://localhost/api/species/import-json \
|
||||||
|
-F "file=@plants.json"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"imported": 150,
|
||||||
|
"skipped": 5,
|
||||||
|
"errors": []
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
|
||||||
|
│ React │────▶│ FastAPI │────▶│ Celery │
|
||||||
|
│ Frontend │ │ Backend │ │ Workers │
|
||||||
|
└─────────────┘ └─────────────────┘ └─────────────┘
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌─────────────┐ ┌─────────────┐
|
||||||
|
│ SQLite │ │ Redis │
|
||||||
|
│ Database │ │ Queue │
|
||||||
|
└─────────────┘ └─────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Export Format
|
||||||
|
|
||||||
|
Exports are Create ML-compatible:
|
||||||
|
|
||||||
|
```
|
||||||
|
export.zip/
|
||||||
|
├── Training/
|
||||||
|
│ ├── Monstera_deliciosa/
|
||||||
|
│ │ ├── img_00001.jpg
|
||||||
|
│ │ └── ...
|
||||||
|
│ └── ...
|
||||||
|
└── Testing/
|
||||||
|
├── Monstera_deliciosa/
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data Storage
|
||||||
|
|
||||||
|
All data is stored in the `./data` directory:
|
||||||
|
|
||||||
|
```
|
||||||
|
data/
|
||||||
|
├── db/
|
||||||
|
│ └── plants.sqlite # SQLite database
|
||||||
|
├── images/ # Downloaded images
|
||||||
|
│ └── {species_id}/
|
||||||
|
│ └── {image_id}.jpg
|
||||||
|
└── exports/ # Generated export archives
|
||||||
|
└── {export_id}.zip
|
||||||
|
```
|
||||||
|
|
||||||
|
## API Documentation
|
||||||
|
|
||||||
|
Full API docs available at http://localhost/api/docs
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT
|
||||||
231
accum_images.md
Normal file
231
accum_images.md
Normal file
@@ -0,0 +1,231 @@
|
|||||||
|
# Houseplant Image Dataset Accumulation Plan
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Build a custom CoreML model for houseplant identification by accumulating a large dataset of houseplant images with proper licensing for commercial use.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Requirements Summary
|
||||||
|
|
||||||
|
| Parameter | Value |
|
||||||
|
|-----------|-------|
|
||||||
|
| Target species | 5,000-10,000 (realistic houseplant ceiling) |
|
||||||
|
| Images per species | 200-500 (recommended) |
|
||||||
|
| Total images | ~1-5 million |
|
||||||
|
| Budget | Free preferred, paid as reference |
|
||||||
|
| Compute | M1 Max Mac (training) + Unraid server (data pipeline) |
|
||||||
|
| Curation | Automated pipeline |
|
||||||
|
| Timeline | Weeks-months |
|
||||||
|
| Licensing | Must allow training + commercial model distribution |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hardware Assessment
|
||||||
|
|
||||||
|
| Machine | Role | Capability |
|
||||||
|
|---------|------|------------|
|
||||||
|
| M1 Max Mac | **Training** | Create ML can train 5-10K class models; 32+ GB unified memory is ideal |
|
||||||
|
| Unraid Server | **Data pipeline** | Scraping, downloading, preprocessing, storage |
|
||||||
|
|
||||||
|
M1 Max is legitimately viable for this task via Create ML or PyTorch+MPS. No cloud GPU required.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Sources Analysis
|
||||||
|
|
||||||
|
### Tier 1: Primary Sources (Recommended)
|
||||||
|
|
||||||
|
| Source | License | Commercial-Safe | Volume | Houseplant Coverage | Access Method |
|
||||||
|
|--------|---------|-----------------|--------|---------------------|---------------|
|
||||||
|
| **iNaturalist via GBIF** | CC-BY, CC0 (filter) | Yes (filtered) | 100M+ observations | Good (has "captive/cultivated" flag) | Bulk export + API |
|
||||||
|
| **Flickr** | CC-BY, CC0 (filter) | Yes (filtered) | Millions | Moderate | API |
|
||||||
|
| **Wikimedia Commons** | CC-BY, CC-BY-SA, Public Domain | Mostly | Thousands | Moderate | API |
|
||||||
|
|
||||||
|
### Tier 2: Supplemental Sources
|
||||||
|
|
||||||
|
| Source | License | Commercial-Safe | Notes |
|
||||||
|
|--------|---------|-----------------|-------|
|
||||||
|
| **USDA PLANTS** | Public Domain | Yes | US-focused, limited images |
|
||||||
|
| **Encyclopedia of Life** | Mixed | Check each | Aggregator, good metadata |
|
||||||
|
| **Pl@ntNet-300K Dataset** | CC-BY-SA | Share-alike | Good for research/prototyping only |
|
||||||
|
|
||||||
|
### Tier 3: Paid Options (Reference)
|
||||||
|
|
||||||
|
| Source | Estimated Cost | Notes |
|
||||||
|
|--------|----------------|-------|
|
||||||
|
| iNaturalist AWS Open Data | Free | Bulk image export, requires S3 costs for transfer |
|
||||||
|
| Custom scraping infrastructure | $50-200/mo | Proxies, storage, bandwidth |
|
||||||
|
| Commercial botanical databases | $1000s+ | Getty, Alamy — not recommended |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Licensing Decision Matrix
|
||||||
|
|
||||||
|
```
|
||||||
|
Want commercial model distribution?
|
||||||
|
├─ YES → Use ONLY: CC0, CC-BY, Public Domain
|
||||||
|
│ Filter OUT: CC-BY-NC, CC-BY-SA, All Rights Reserved
|
||||||
|
│
|
||||||
|
└─ NO (research only) → Can use CC-BY-NC, CC-BY-SA
|
||||||
|
Pl@ntNet-300K dataset becomes viable
|
||||||
|
```
|
||||||
|
|
||||||
|
**Recommendation**: Filter for commercial-safe licenses from day 1. Avoids re-scraping later.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Houseplant Species Taxonomy
|
||||||
|
|
||||||
|
**Problem**: No canonical "houseplant" species list exists. Must construct one.
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
1. Start with Wikipedia "List of houseplants" (~500 species)
|
||||||
|
2. Expand via genus crawl (all Philodendron, all Hoya, etc.)
|
||||||
|
3. Cross-reference with RHS, ASPCA, nursery catalogs
|
||||||
|
4. Target: **1,000-3,000 species** is realistic for quality dataset
|
||||||
|
|
||||||
|
**Key Genera** (prioritize these — cover 80% of common houseplants):
|
||||||
|
```
|
||||||
|
Philodendron, Monstera, Pothos/Epipremnum, Ficus, Dracaena,
|
||||||
|
Sansevieria, Calathea, Maranta, Alocasia, Anthurium,
|
||||||
|
Peperomia, Hoya, Begonia, Tradescantia, Pilea,
|
||||||
|
Aglaonema, Dieffenbachia, Spathiphyllum, Zamioculcas, Crassula
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Quality Requirements
|
||||||
|
|
||||||
|
| Parameter | Minimum | Target | Rationale |
|
||||||
|
|-----------|---------|--------|-----------|
|
||||||
|
| Images per species | 100 | 300-500 | Below 100 = unreliable classification |
|
||||||
|
| Resolution | 256x256 | 512x512+ | Downsample to 224x224 for training |
|
||||||
|
| Variety | Single angle | Multi-angle, growth stages, lighting | Generalization |
|
||||||
|
| Label accuracy | 80% | 95%+ | iNaturalist "Research Grade" = community verified |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Training Approach Options
|
||||||
|
|
||||||
|
### Option A: Create ML (Recommended)
|
||||||
|
|
||||||
|
| Pros | Cons |
|
||||||
|
|------|------|
|
||||||
|
| Native Apple Silicon optimization | Limited hyperparameter control |
|
||||||
|
| Outputs CoreML directly | Max ~10K classes practical limit |
|
||||||
|
| No Python/ML expertise needed | Less flexible augmentation |
|
||||||
|
| Fast iteration | |
|
||||||
|
|
||||||
|
**Best for**: This use case exactly.
|
||||||
|
|
||||||
|
### Option B: PyTorch + MPS Transfer Learning
|
||||||
|
|
||||||
|
| Pros | Cons |
|
||||||
|
|------|------|
|
||||||
|
| Full control over architecture | Steeper learning curve |
|
||||||
|
| State-of-art augmentation (albumentations) | Manual CoreML conversion |
|
||||||
|
| Can use EfficientNet, ConvNeXt, etc. | Slower iteration |
|
||||||
|
|
||||||
|
**Best for**: If Create ML hits limits or you need custom architecture.
|
||||||
|
|
||||||
|
### Option C: Cloud GPU (Google Colab / AWS Spot)
|
||||||
|
|
||||||
|
| Pros | Cons |
|
||||||
|
|------|------|
|
||||||
|
| Faster training for large models | Cost |
|
||||||
|
| No local resource constraints | Network transfer overhead |
|
||||||
|
|
||||||
|
**Best for**: If dataset exceeds M1 Max memory or you want transformer-based vision models.
|
||||||
|
|
||||||
|
**Recommendation**: Start with Create ML. Pivot to Option B only if needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pipeline Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ UNRAID SERVER │
|
||||||
|
├─────────────────────────────────────────────────────────────────┤
|
||||||
|
│ 1. Species List Generator │
|
||||||
|
│ └─ Scrape Wikipedia, RHS, expand by genus │
|
||||||
|
│ │
|
||||||
|
│ 2. Image Downloader │
|
||||||
|
│ ├─ iNaturalist/GBIF bulk export (primary) │
|
||||||
|
│ ├─ Flickr API (supplemental) │
|
||||||
|
│ └─ License filter (CC-BY, CC0 only) │
|
||||||
|
│ │
|
||||||
|
│ 3. Preprocessing Pipeline │
|
||||||
|
│ ├─ Resize to 512x512 │
|
||||||
|
│ ├─ Remove duplicates (perceptual hash) │
|
||||||
|
│ ├─ Remove low-quality (blur detection, size filter) │
|
||||||
|
│ └─ Organize: /species_name/image_001.jpg │
|
||||||
|
│ │
|
||||||
|
│ 4. Dataset Statistics │
|
||||||
|
│ └─ Report per-species counts, flag under-represented │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
│
|
||||||
|
▼ (rsync/SMB)
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ M1 MAX MAC │
|
||||||
|
├─────────────────────────────────────────────────────────────────┤
|
||||||
|
│ 5. Create ML Training │
|
||||||
|
│ ├─ Import dataset folder │
|
||||||
|
│ ├─ Train image classifier │
|
||||||
|
│ └─ Export .mlmodel │
|
||||||
|
│ │
|
||||||
|
│ 6. Validation │
|
||||||
|
│ ├─ Test on held-out images │
|
||||||
|
│ └─ Test on real-world photos (your phone) │
|
||||||
|
│ │
|
||||||
|
│ 7. Integration │
|
||||||
|
│ └─ Replace PlantNet-300K in PlantGuide │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
| Phase | Duration | Output |
|
||||||
|
|-------|----------|--------|
|
||||||
|
| 1. Species list curation | 1 week | 1,000-3,000 target species with scientific + common names |
|
||||||
|
| 2. Pipeline development | 1-2 weeks | Automated scraper on Unraid |
|
||||||
|
| 3. Data collection | 2-4 weeks | Running 24/7, rate-limited by APIs |
|
||||||
|
| 4. Preprocessing + QA | 1 week | Clean dataset, statistics report |
|
||||||
|
| 5. Initial training | 2-3 days | First model with subset (500 species) |
|
||||||
|
| 6. Full training | 1 week | Full model, iteration |
|
||||||
|
| 7. Validation + tuning | 1 week | Production-ready model |
|
||||||
|
|
||||||
|
**Total: 6-10 weeks**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risk Analysis
|
||||||
|
|
||||||
|
| Risk | Likelihood | Mitigation |
|
||||||
|
|------|------------|------------|
|
||||||
|
| Insufficient images for rare species | High | Accept lower coverage OR merge to genus-level for rare species |
|
||||||
|
| API rate limits slow collection | High | Parallelize sources, use bulk exports, patience |
|
||||||
|
| Noisy labels degrade accuracy | Medium | Use only "Research Grade" iNaturalist, implement confidence thresholds |
|
||||||
|
| Create ML memory limits | Low | M1 Max should handle; fallback to PyTorch |
|
||||||
|
| License ambiguity | Low | Strict filter on download, keep metadata |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Build species master list** — Python script to scrape/merge sources
|
||||||
|
2. **Set up GBIF bulk download** — Filter: Plantae, captive/cultivated, CC-BY/CC0, has images
|
||||||
|
3. **Build Flickr supplemental scraper** — Target under-represented species
|
||||||
|
4. **Docker container on Unraid** — Orchestrate pipeline
|
||||||
|
5. **Create ML project setup** — Folder structure, initial test with 50 species
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- Prioritize **speed** (start with 500 species, fast iteration) or **completeness** (build full 3K species list first)?
|
||||||
|
- Any specific houseplant species that must be included?
|
||||||
|
- Docker running on Unraid already?
|
||||||
24
backend/Dockerfile
Normal file
24
backend/Dockerfile
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install system dependencies
|
||||||
|
RUN apt-get update && apt-get install -y \
|
||||||
|
gcc \
|
||||||
|
g++ \
|
||||||
|
libffi-dev \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Install Python dependencies
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
# Copy application code
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
# Create data directories
|
||||||
|
RUN mkdir -p /data/db /data/images /data/exports
|
||||||
|
|
||||||
|
EXPOSE 8000
|
||||||
|
|
||||||
|
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
19
backend/add_indexes.py
Normal file
19
backend/add_indexes.py
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""Add missing database indexes."""
|
||||||
|
from sqlalchemy import text
|
||||||
|
from app.database import engine
|
||||||
|
|
||||||
|
with engine.connect() as conn:
|
||||||
|
# Single column indexes
|
||||||
|
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_license ON images(license)'))
|
||||||
|
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_status ON images(status)'))
|
||||||
|
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_source ON images(source)'))
|
||||||
|
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_species_id ON images(species_id)'))
|
||||||
|
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_phash ON images(phash)'))
|
||||||
|
|
||||||
|
# Composite indexes for common query patterns
|
||||||
|
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_species_status ON images(species_id, status)'))
|
||||||
|
conn.execute(text('CREATE INDEX IF NOT EXISTS ix_images_status_created ON images(status, created_at)'))
|
||||||
|
|
||||||
|
conn.commit()
|
||||||
|
print('All indexes created successfully')
|
||||||
42
backend/alembic.ini
Normal file
42
backend/alembic.ini
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
[alembic]
|
||||||
|
script_location = alembic
|
||||||
|
prepend_sys_path = .
|
||||||
|
version_path_separator = os
|
||||||
|
|
||||||
|
sqlalchemy.url = sqlite:////data/db/plants.sqlite
|
||||||
|
|
||||||
|
[post_write_hooks]
|
||||||
|
|
||||||
|
[loggers]
|
||||||
|
keys = root,sqlalchemy,alembic
|
||||||
|
|
||||||
|
[handlers]
|
||||||
|
keys = console
|
||||||
|
|
||||||
|
[formatters]
|
||||||
|
keys = generic
|
||||||
|
|
||||||
|
[logger_root]
|
||||||
|
level = WARN
|
||||||
|
handlers = console
|
||||||
|
qualname =
|
||||||
|
|
||||||
|
[logger_sqlalchemy]
|
||||||
|
level = WARN
|
||||||
|
handlers =
|
||||||
|
qualname = sqlalchemy.engine
|
||||||
|
|
||||||
|
[logger_alembic]
|
||||||
|
level = INFO
|
||||||
|
handlers =
|
||||||
|
qualname = alembic
|
||||||
|
|
||||||
|
[handler_console]
|
||||||
|
class = StreamHandler
|
||||||
|
args = (sys.stderr,)
|
||||||
|
level = NOTSET
|
||||||
|
formatter = generic
|
||||||
|
|
||||||
|
[formatter_generic]
|
||||||
|
format = %(levelname)-5.5s [%(name)s] %(message)s
|
||||||
|
datefmt = %H:%M:%S
|
||||||
54
backend/alembic/env.py
Normal file
54
backend/alembic/env.py
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
from logging.config import fileConfig
|
||||||
|
|
||||||
|
from sqlalchemy import engine_from_config
|
||||||
|
from sqlalchemy import pool
|
||||||
|
|
||||||
|
from alembic import context
|
||||||
|
|
||||||
|
# Import models for autogenerate
|
||||||
|
from app.database import Base
|
||||||
|
from app.models import Species, Image, Job, ApiKey, Export
|
||||||
|
|
||||||
|
config = context.config
|
||||||
|
|
||||||
|
if config.config_file_name is not None:
|
||||||
|
fileConfig(config.config_file_name)
|
||||||
|
|
||||||
|
target_metadata = Base.metadata
|
||||||
|
|
||||||
|
|
||||||
|
def run_migrations_offline() -> None:
|
||||||
|
"""Run migrations in 'offline' mode."""
|
||||||
|
url = config.get_main_option("sqlalchemy.url")
|
||||||
|
context.configure(
|
||||||
|
url=url,
|
||||||
|
target_metadata=target_metadata,
|
||||||
|
literal_binds=True,
|
||||||
|
dialect_opts={"paramstyle": "named"},
|
||||||
|
)
|
||||||
|
|
||||||
|
with context.begin_transaction():
|
||||||
|
context.run_migrations()
|
||||||
|
|
||||||
|
|
||||||
|
def run_migrations_online() -> None:
|
||||||
|
"""Run migrations in 'online' mode."""
|
||||||
|
connectable = engine_from_config(
|
||||||
|
config.get_section(config.config_ini_section, {}),
|
||||||
|
prefix="sqlalchemy.",
|
||||||
|
poolclass=pool.NullPool,
|
||||||
|
)
|
||||||
|
|
||||||
|
with connectable.connect() as connection:
|
||||||
|
context.configure(
|
||||||
|
connection=connection, target_metadata=target_metadata
|
||||||
|
)
|
||||||
|
|
||||||
|
with context.begin_transaction():
|
||||||
|
context.run_migrations()
|
||||||
|
|
||||||
|
|
||||||
|
if context.is_offline_mode():
|
||||||
|
run_migrations_offline()
|
||||||
|
else:
|
||||||
|
run_migrations_online()
|
||||||
26
backend/alembic/script.py.mako
Normal file
26
backend/alembic/script.py.mako
Normal file
@@ -0,0 +1,26 @@
|
|||||||
|
"""${message}
|
||||||
|
|
||||||
|
Revision ID: ${up_revision}
|
||||||
|
Revises: ${down_revision | comma,n}
|
||||||
|
Create Date: ${create_date}
|
||||||
|
|
||||||
|
"""
|
||||||
|
from typing import Sequence, Union
|
||||||
|
|
||||||
|
from alembic import op
|
||||||
|
import sqlalchemy as sa
|
||||||
|
${imports if imports else ""}
|
||||||
|
|
||||||
|
# revision identifiers, used by Alembic.
|
||||||
|
revision: str = ${repr(up_revision)}
|
||||||
|
down_revision: Union[str, None] = ${repr(down_revision)}
|
||||||
|
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
|
||||||
|
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade() -> None:
|
||||||
|
${upgrades if upgrades else "pass"}
|
||||||
|
|
||||||
|
|
||||||
|
def downgrade() -> None:
|
||||||
|
${downgrades if downgrades else "pass"}
|
||||||
112
backend/alembic/versions/001_initial.py
Normal file
112
backend/alembic/versions/001_initial.py
Normal file
@@ -0,0 +1,112 @@
|
|||||||
|
"""Initial migration
|
||||||
|
|
||||||
|
Revision ID: 001
|
||||||
|
Revises:
|
||||||
|
Create Date: 2024-01-01
|
||||||
|
|
||||||
|
"""
|
||||||
|
from typing import Sequence, Union
|
||||||
|
|
||||||
|
from alembic import op
|
||||||
|
import sqlalchemy as sa
|
||||||
|
|
||||||
|
revision: str = '001'
|
||||||
|
down_revision: Union[str, None] = None
|
||||||
|
branch_labels: Union[str, Sequence[str], None] = None
|
||||||
|
depends_on: Union[str, Sequence[str], None] = None
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade() -> None:
|
||||||
|
# Species table
|
||||||
|
op.create_table(
|
||||||
|
'species',
|
||||||
|
sa.Column('id', sa.Integer(), primary_key=True),
|
||||||
|
sa.Column('scientific_name', sa.String(), nullable=False, unique=True),
|
||||||
|
sa.Column('common_name', sa.String(), nullable=True),
|
||||||
|
sa.Column('genus', sa.String(), nullable=True),
|
||||||
|
sa.Column('family', sa.String(), nullable=True),
|
||||||
|
sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
|
||||||
|
)
|
||||||
|
op.create_index('ix_species_scientific_name', 'species', ['scientific_name'])
|
||||||
|
op.create_index('ix_species_genus', 'species', ['genus'])
|
||||||
|
|
||||||
|
# API Keys table
|
||||||
|
op.create_table(
|
||||||
|
'api_keys',
|
||||||
|
sa.Column('id', sa.Integer(), primary_key=True),
|
||||||
|
sa.Column('source', sa.String(), nullable=False, unique=True),
|
||||||
|
sa.Column('api_key', sa.String(), nullable=False),
|
||||||
|
sa.Column('api_secret', sa.String(), nullable=True),
|
||||||
|
sa.Column('rate_limit_per_sec', sa.Float(), default=1.0),
|
||||||
|
sa.Column('enabled', sa.Boolean(), default=True),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Images table
|
||||||
|
op.create_table(
|
||||||
|
'images',
|
||||||
|
sa.Column('id', sa.Integer(), primary_key=True),
|
||||||
|
sa.Column('species_id', sa.Integer(), sa.ForeignKey('species.id'), nullable=False),
|
||||||
|
sa.Column('source', sa.String(), nullable=False),
|
||||||
|
sa.Column('source_id', sa.String(), nullable=True),
|
||||||
|
sa.Column('url', sa.String(), nullable=False),
|
||||||
|
sa.Column('local_path', sa.String(), nullable=True),
|
||||||
|
sa.Column('license', sa.String(), nullable=False),
|
||||||
|
sa.Column('attribution', sa.String(), nullable=True),
|
||||||
|
sa.Column('width', sa.Integer(), nullable=True),
|
||||||
|
sa.Column('height', sa.Integer(), nullable=True),
|
||||||
|
sa.Column('phash', sa.String(), nullable=True),
|
||||||
|
sa.Column('quality_score', sa.Float(), nullable=True),
|
||||||
|
sa.Column('status', sa.String(), default='pending'),
|
||||||
|
sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
|
||||||
|
)
|
||||||
|
op.create_index('ix_images_species_id', 'images', ['species_id'])
|
||||||
|
op.create_index('ix_images_source', 'images', ['source'])
|
||||||
|
op.create_index('ix_images_status', 'images', ['status'])
|
||||||
|
op.create_index('ix_images_phash', 'images', ['phash'])
|
||||||
|
op.create_unique_constraint('uq_source_source_id', 'images', ['source', 'source_id'])
|
||||||
|
|
||||||
|
# Jobs table
|
||||||
|
op.create_table(
|
||||||
|
'jobs',
|
||||||
|
sa.Column('id', sa.Integer(), primary_key=True),
|
||||||
|
sa.Column('name', sa.String(), nullable=False),
|
||||||
|
sa.Column('source', sa.String(), nullable=False),
|
||||||
|
sa.Column('species_filter', sa.Text(), nullable=True),
|
||||||
|
sa.Column('status', sa.String(), default='pending'),
|
||||||
|
sa.Column('progress_current', sa.Integer(), default=0),
|
||||||
|
sa.Column('progress_total', sa.Integer(), default=0),
|
||||||
|
sa.Column('images_downloaded', sa.Integer(), default=0),
|
||||||
|
sa.Column('images_rejected', sa.Integer(), default=0),
|
||||||
|
sa.Column('celery_task_id', sa.String(), nullable=True),
|
||||||
|
sa.Column('started_at', sa.DateTime(), nullable=True),
|
||||||
|
sa.Column('completed_at', sa.DateTime(), nullable=True),
|
||||||
|
sa.Column('error_message', sa.Text(), nullable=True),
|
||||||
|
sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
|
||||||
|
)
|
||||||
|
op.create_index('ix_jobs_status', 'jobs', ['status'])
|
||||||
|
|
||||||
|
# Exports table
|
||||||
|
op.create_table(
|
||||||
|
'exports',
|
||||||
|
sa.Column('id', sa.Integer(), primary_key=True),
|
||||||
|
sa.Column('name', sa.String(), nullable=False),
|
||||||
|
sa.Column('filter_criteria', sa.Text(), nullable=True),
|
||||||
|
sa.Column('train_split', sa.Float(), default=0.8),
|
||||||
|
sa.Column('status', sa.String(), default='pending'),
|
||||||
|
sa.Column('file_path', sa.String(), nullable=True),
|
||||||
|
sa.Column('file_size', sa.Integer(), nullable=True),
|
||||||
|
sa.Column('species_count', sa.Integer(), nullable=True),
|
||||||
|
sa.Column('image_count', sa.Integer(), nullable=True),
|
||||||
|
sa.Column('celery_task_id', sa.String(), nullable=True),
|
||||||
|
sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
|
||||||
|
sa.Column('completed_at', sa.DateTime(), nullable=True),
|
||||||
|
sa.Column('error_message', sa.Text(), nullable=True),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def downgrade() -> None:
|
||||||
|
op.drop_table('exports')
|
||||||
|
op.drop_table('jobs')
|
||||||
|
op.drop_table('images')
|
||||||
|
op.drop_table('api_keys')
|
||||||
|
op.drop_table('species')
|
||||||
53
backend/alembic/versions/002_add_cached_stats_and_indexes.py
Normal file
53
backend/alembic/versions/002_add_cached_stats_and_indexes.py
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
"""Add cached_stats table and license index
|
||||||
|
|
||||||
|
Revision ID: 002
|
||||||
|
Revises: 001
|
||||||
|
Create Date: 2025-01-25
|
||||||
|
|
||||||
|
"""
|
||||||
|
from typing import Sequence, Union
|
||||||
|
|
||||||
|
from alembic import op
|
||||||
|
import sqlalchemy as sa
|
||||||
|
|
||||||
|
revision: str = '002'
|
||||||
|
down_revision: Union[str, None] = '001'
|
||||||
|
branch_labels: Union[str, Sequence[str], None] = None
|
||||||
|
depends_on: Union[str, Sequence[str], None] = None
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade() -> None:
|
||||||
|
# Cached stats table for pre-calculated dashboard statistics
|
||||||
|
op.create_table(
|
||||||
|
'cached_stats',
|
||||||
|
sa.Column('id', sa.Integer(), primary_key=True),
|
||||||
|
sa.Column('key', sa.String(50), nullable=False, unique=True),
|
||||||
|
sa.Column('value', sa.Text(), nullable=False),
|
||||||
|
sa.Column('updated_at', sa.DateTime(), server_default=sa.func.now()),
|
||||||
|
)
|
||||||
|
op.create_index('ix_cached_stats_key', 'cached_stats', ['key'])
|
||||||
|
|
||||||
|
# Add license index to images table (if not exists)
|
||||||
|
# Using batch mode for SQLite compatibility
|
||||||
|
try:
|
||||||
|
op.create_index('ix_images_license', 'images', ['license'])
|
||||||
|
except Exception:
|
||||||
|
pass # Index may already exist
|
||||||
|
|
||||||
|
# Add only_without_images column to jobs if it doesn't exist
|
||||||
|
try:
|
||||||
|
op.add_column('jobs', sa.Column('only_without_images', sa.Boolean(), default=False))
|
||||||
|
except Exception:
|
||||||
|
pass # Column may already exist
|
||||||
|
|
||||||
|
|
||||||
|
def downgrade() -> None:
|
||||||
|
try:
|
||||||
|
op.drop_index('ix_images_license', 'images')
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
try:
|
||||||
|
op.drop_column('jobs', 'only_without_images')
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
op.drop_table('cached_stats')
|
||||||
31
backend/alembic/versions/003_add_job_max_images.py
Normal file
31
backend/alembic/versions/003_add_job_max_images.py
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
"""Add max_images column to jobs table
|
||||||
|
|
||||||
|
Revision ID: 003
|
||||||
|
Revises: 002
|
||||||
|
Create Date: 2025-01-25
|
||||||
|
|
||||||
|
"""
|
||||||
|
from typing import Sequence, Union
|
||||||
|
|
||||||
|
from alembic import op
|
||||||
|
import sqlalchemy as sa
|
||||||
|
|
||||||
|
revision: str = '003'
|
||||||
|
down_revision: Union[str, None] = '002'
|
||||||
|
branch_labels: Union[str, Sequence[str], None] = None
|
||||||
|
depends_on: Union[str, Sequence[str], None] = None
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade() -> None:
|
||||||
|
# Add max_images column to jobs table
|
||||||
|
try:
|
||||||
|
op.add_column('jobs', sa.Column('max_images', sa.Integer(), nullable=True))
|
||||||
|
except Exception:
|
||||||
|
pass # Column may already exist
|
||||||
|
|
||||||
|
|
||||||
|
def downgrade() -> None:
|
||||||
|
try:
|
||||||
|
op.drop_column('jobs', 'max_images')
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
1
backend/app/__init__.py
Normal file
1
backend/app/__init__.py
Normal file
@@ -0,0 +1 @@
|
|||||||
|
# PlantGuideScraper Backend
|
||||||
1
backend/app/api/__init__.py
Normal file
1
backend/app/api/__init__.py
Normal file
@@ -0,0 +1 @@
|
|||||||
|
# API routes
|
||||||
175
backend/app/api/exports.py
Normal file
175
backend/app/api/exports.py
Normal file
@@ -0,0 +1,175 @@
|
|||||||
|
import json
|
||||||
|
import os
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||||
|
from fastapi.responses import FileResponse
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
from sqlalchemy import func
|
||||||
|
|
||||||
|
from app.database import get_db
|
||||||
|
from app.models import Export, Image, Species
|
||||||
|
from app.schemas.export import (
|
||||||
|
ExportCreate,
|
||||||
|
ExportResponse,
|
||||||
|
ExportListResponse,
|
||||||
|
ExportPreview,
|
||||||
|
)
|
||||||
|
from app.workers.export_tasks import generate_export
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("", response_model=ExportListResponse)
|
||||||
|
def list_exports(
|
||||||
|
limit: int = Query(50, ge=1, le=200),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""List all exports."""
|
||||||
|
total = db.query(Export).count()
|
||||||
|
exports = db.query(Export).order_by(Export.created_at.desc()).limit(limit).all()
|
||||||
|
|
||||||
|
return ExportListResponse(
|
||||||
|
items=[ExportResponse.model_validate(e) for e in exports],
|
||||||
|
total=total,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/preview", response_model=ExportPreview)
|
||||||
|
def preview_export(export: ExportCreate, db: Session = Depends(get_db)):
|
||||||
|
"""Preview export without creating it."""
|
||||||
|
criteria = export.filter_criteria
|
||||||
|
min_images = criteria.min_images_per_species
|
||||||
|
|
||||||
|
# Build query
|
||||||
|
query = db.query(Image).filter(Image.status == "downloaded")
|
||||||
|
|
||||||
|
if criteria.licenses:
|
||||||
|
query = query.filter(Image.license.in_(criteria.licenses))
|
||||||
|
|
||||||
|
if criteria.min_quality:
|
||||||
|
query = query.filter(Image.quality_score >= criteria.min_quality)
|
||||||
|
|
||||||
|
if criteria.species_ids:
|
||||||
|
query = query.filter(Image.species_id.in_(criteria.species_ids))
|
||||||
|
|
||||||
|
# Count images per species
|
||||||
|
species_counts = db.query(
|
||||||
|
Image.species_id,
|
||||||
|
func.count(Image.id).label("count")
|
||||||
|
).filter(Image.status == "downloaded")
|
||||||
|
|
||||||
|
if criteria.licenses:
|
||||||
|
species_counts = species_counts.filter(Image.license.in_(criteria.licenses))
|
||||||
|
if criteria.min_quality:
|
||||||
|
species_counts = species_counts.filter(Image.quality_score >= criteria.min_quality)
|
||||||
|
if criteria.species_ids:
|
||||||
|
species_counts = species_counts.filter(Image.species_id.in_(criteria.species_ids))
|
||||||
|
|
||||||
|
species_counts = species_counts.group_by(Image.species_id).all()
|
||||||
|
|
||||||
|
valid_species = [s for s in species_counts if s.count >= min_images]
|
||||||
|
total_images = sum(s.count for s in valid_species)
|
||||||
|
|
||||||
|
# Estimate file size (rough: 50KB per image)
|
||||||
|
estimated_size_mb = (total_images * 50) / 1024
|
||||||
|
|
||||||
|
return ExportPreview(
|
||||||
|
species_count=len(valid_species),
|
||||||
|
image_count=total_images,
|
||||||
|
estimated_size_mb=estimated_size_mb,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("", response_model=ExportResponse)
|
||||||
|
def create_export(export: ExportCreate, db: Session = Depends(get_db)):
|
||||||
|
"""Create and start a new export job."""
|
||||||
|
db_export = Export(
|
||||||
|
name=export.name,
|
||||||
|
filter_criteria=export.filter_criteria.model_dump_json(),
|
||||||
|
train_split=export.train_split,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(db_export)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(db_export)
|
||||||
|
|
||||||
|
# Start Celery task
|
||||||
|
task = generate_export.delay(db_export.id)
|
||||||
|
db_export.celery_task_id = task.id
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return ExportResponse.model_validate(db_export)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{export_id}", response_model=ExportResponse)
|
||||||
|
def get_export(export_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Get export status."""
|
||||||
|
export = db.query(Export).filter(Export.id == export_id).first()
|
||||||
|
if not export:
|
||||||
|
raise HTTPException(status_code=404, detail="Export not found")
|
||||||
|
|
||||||
|
return ExportResponse.model_validate(export)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{export_id}/progress")
|
||||||
|
def get_export_progress(export_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Get real-time export progress."""
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
|
||||||
|
export = db.query(Export).filter(Export.id == export_id).first()
|
||||||
|
if not export:
|
||||||
|
raise HTTPException(status_code=404, detail="Export not found")
|
||||||
|
|
||||||
|
if not export.celery_task_id:
|
||||||
|
return {"status": export.status}
|
||||||
|
|
||||||
|
result = celery_app.AsyncResult(export.celery_task_id)
|
||||||
|
|
||||||
|
if result.state == "PROGRESS":
|
||||||
|
meta = result.info
|
||||||
|
return {
|
||||||
|
"status": "generating",
|
||||||
|
"current": meta.get("current", 0),
|
||||||
|
"total": meta.get("total", 0),
|
||||||
|
"current_species": meta.get("species", ""),
|
||||||
|
}
|
||||||
|
|
||||||
|
return {"status": export.status}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{export_id}/download")
|
||||||
|
def download_export(export_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Download export zip file."""
|
||||||
|
export = db.query(Export).filter(Export.id == export_id).first()
|
||||||
|
if not export:
|
||||||
|
raise HTTPException(status_code=404, detail="Export not found")
|
||||||
|
|
||||||
|
if export.status != "completed":
|
||||||
|
raise HTTPException(status_code=400, detail="Export not ready")
|
||||||
|
|
||||||
|
if not export.file_path or not os.path.exists(export.file_path):
|
||||||
|
raise HTTPException(status_code=404, detail="Export file not found")
|
||||||
|
|
||||||
|
return FileResponse(
|
||||||
|
export.file_path,
|
||||||
|
media_type="application/zip",
|
||||||
|
filename=f"{export.name}.zip",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{export_id}")
|
||||||
|
def delete_export(export_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Delete an export and its file."""
|
||||||
|
export = db.query(Export).filter(Export.id == export_id).first()
|
||||||
|
if not export:
|
||||||
|
raise HTTPException(status_code=404, detail="Export not found")
|
||||||
|
|
||||||
|
# Delete file if exists
|
||||||
|
if export.file_path and os.path.exists(export.file_path):
|
||||||
|
os.remove(export.file_path)
|
||||||
|
|
||||||
|
db.delete(export)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"status": "deleted"}
|
||||||
441
backend/app/api/images.py
Normal file
441
backend/app/api/images.py
Normal file
@@ -0,0 +1,441 @@
|
|||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import uuid
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional, List
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||||
|
from fastapi.responses import FileResponse
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
from sqlalchemy import func
|
||||||
|
from PIL import Image as PILImage
|
||||||
|
|
||||||
|
from app.database import get_db
|
||||||
|
from app.models import Image, Species
|
||||||
|
from app.schemas.image import ImageResponse, ImageListResponse
|
||||||
|
from app.config import get_settings
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
settings = get_settings()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("", response_model=ImageListResponse)
|
||||||
|
def list_images(
|
||||||
|
page: int = Query(1, ge=1),
|
||||||
|
page_size: int = Query(50, ge=1, le=200),
|
||||||
|
species_id: Optional[int] = None,
|
||||||
|
source: Optional[str] = None,
|
||||||
|
license: Optional[str] = None,
|
||||||
|
status: Optional[str] = None,
|
||||||
|
min_quality: Optional[float] = None,
|
||||||
|
search: Optional[str] = None,
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""List images with pagination and filters."""
|
||||||
|
# Use joinedload to fetch species in single query
|
||||||
|
from sqlalchemy.orm import joinedload
|
||||||
|
query = db.query(Image).options(joinedload(Image.species))
|
||||||
|
|
||||||
|
if species_id:
|
||||||
|
query = query.filter(Image.species_id == species_id)
|
||||||
|
|
||||||
|
if source:
|
||||||
|
query = query.filter(Image.source == source)
|
||||||
|
|
||||||
|
if license:
|
||||||
|
query = query.filter(Image.license == license)
|
||||||
|
|
||||||
|
if status:
|
||||||
|
query = query.filter(Image.status == status)
|
||||||
|
|
||||||
|
if min_quality:
|
||||||
|
query = query.filter(Image.quality_score >= min_quality)
|
||||||
|
|
||||||
|
if search:
|
||||||
|
search_term = f"%{search}%"
|
||||||
|
query = query.join(Species).filter(
|
||||||
|
(Species.scientific_name.ilike(search_term)) |
|
||||||
|
(Species.common_name.ilike(search_term))
|
||||||
|
)
|
||||||
|
|
||||||
|
# Use faster count for simple queries
|
||||||
|
if not search:
|
||||||
|
# Build count query without join for better performance
|
||||||
|
count_query = db.query(func.count(Image.id))
|
||||||
|
if species_id:
|
||||||
|
count_query = count_query.filter(Image.species_id == species_id)
|
||||||
|
if source:
|
||||||
|
count_query = count_query.filter(Image.source == source)
|
||||||
|
if license:
|
||||||
|
count_query = count_query.filter(Image.license == license)
|
||||||
|
if status:
|
||||||
|
count_query = count_query.filter(Image.status == status)
|
||||||
|
if min_quality:
|
||||||
|
count_query = count_query.filter(Image.quality_score >= min_quality)
|
||||||
|
total = count_query.scalar()
|
||||||
|
else:
|
||||||
|
total = query.count()
|
||||||
|
|
||||||
|
pages = (total + page_size - 1) // page_size
|
||||||
|
|
||||||
|
images = query.order_by(Image.created_at.desc()).offset(
|
||||||
|
(page - 1) * page_size
|
||||||
|
).limit(page_size).all()
|
||||||
|
|
||||||
|
items = [
|
||||||
|
ImageResponse(
|
||||||
|
id=img.id,
|
||||||
|
species_id=img.species_id,
|
||||||
|
species_name=img.species.scientific_name if img.species else None,
|
||||||
|
source=img.source,
|
||||||
|
source_id=img.source_id,
|
||||||
|
url=img.url,
|
||||||
|
local_path=img.local_path,
|
||||||
|
license=img.license,
|
||||||
|
attribution=img.attribution,
|
||||||
|
width=img.width,
|
||||||
|
height=img.height,
|
||||||
|
quality_score=img.quality_score,
|
||||||
|
status=img.status,
|
||||||
|
created_at=img.created_at,
|
||||||
|
)
|
||||||
|
for img in images
|
||||||
|
]
|
||||||
|
|
||||||
|
return ImageListResponse(
|
||||||
|
items=items,
|
||||||
|
total=total,
|
||||||
|
page=page,
|
||||||
|
page_size=page_size,
|
||||||
|
pages=pages,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/sources")
|
||||||
|
def list_sources(db: Session = Depends(get_db)):
|
||||||
|
"""List all unique image sources."""
|
||||||
|
sources = db.query(Image.source).distinct().all()
|
||||||
|
return [s[0] for s in sources]
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/licenses")
|
||||||
|
def list_licenses(db: Session = Depends(get_db)):
|
||||||
|
"""List all unique licenses."""
|
||||||
|
licenses = db.query(Image.license).distinct().all()
|
||||||
|
return [l[0] for l in licenses]
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/process-pending")
|
||||||
|
def process_pending_images(
|
||||||
|
source: Optional[str] = None,
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Queue all pending images for download and processing."""
|
||||||
|
from app.workers.quality_tasks import batch_process_pending_images
|
||||||
|
|
||||||
|
query = db.query(func.count(Image.id)).filter(Image.status == "pending")
|
||||||
|
if source:
|
||||||
|
query = query.filter(Image.source == source)
|
||||||
|
pending_count = query.scalar()
|
||||||
|
|
||||||
|
task = batch_process_pending_images.delay(source=source)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"pending_count": pending_count,
|
||||||
|
"task_id": task.id,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/process-pending/status/{task_id}")
|
||||||
|
def process_pending_status(task_id: str):
|
||||||
|
"""Check status of a batch processing task."""
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
|
||||||
|
result = celery_app.AsyncResult(task_id)
|
||||||
|
state = result.state # PENDING, STARTED, PROGRESS, SUCCESS, FAILURE
|
||||||
|
|
||||||
|
response = {"task_id": task_id, "state": state}
|
||||||
|
|
||||||
|
if state == "PROGRESS" and isinstance(result.info, dict):
|
||||||
|
response["queued"] = result.info.get("queued", 0)
|
||||||
|
response["total"] = result.info.get("total", 0)
|
||||||
|
elif state == "SUCCESS" and isinstance(result.result, dict):
|
||||||
|
response["queued"] = result.result.get("queued", 0)
|
||||||
|
response["total"] = result.result.get("total", 0)
|
||||||
|
|
||||||
|
return response
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{image_id}", response_model=ImageResponse)
|
||||||
|
def get_image(image_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Get an image by ID."""
|
||||||
|
image = db.query(Image).filter(Image.id == image_id).first()
|
||||||
|
if not image:
|
||||||
|
raise HTTPException(status_code=404, detail="Image not found")
|
||||||
|
|
||||||
|
return ImageResponse(
|
||||||
|
id=image.id,
|
||||||
|
species_id=image.species_id,
|
||||||
|
species_name=image.species.scientific_name if image.species else None,
|
||||||
|
source=image.source,
|
||||||
|
source_id=image.source_id,
|
||||||
|
url=image.url,
|
||||||
|
local_path=image.local_path,
|
||||||
|
license=image.license,
|
||||||
|
attribution=image.attribution,
|
||||||
|
width=image.width,
|
||||||
|
height=image.height,
|
||||||
|
quality_score=image.quality_score,
|
||||||
|
status=image.status,
|
||||||
|
created_at=image.created_at,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{image_id}/file")
|
||||||
|
def get_image_file(image_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Get the actual image file."""
|
||||||
|
image = db.query(Image).filter(Image.id == image_id).first()
|
||||||
|
if not image:
|
||||||
|
raise HTTPException(status_code=404, detail="Image not found")
|
||||||
|
|
||||||
|
if not image.local_path:
|
||||||
|
raise HTTPException(status_code=404, detail="Image file not available")
|
||||||
|
|
||||||
|
return FileResponse(image.local_path, media_type="image/jpeg")
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{image_id}")
|
||||||
|
def delete_image(image_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Delete an image."""
|
||||||
|
image = db.query(Image).filter(Image.id == image_id).first()
|
||||||
|
if not image:
|
||||||
|
raise HTTPException(status_code=404, detail="Image not found")
|
||||||
|
|
||||||
|
# Delete file if exists
|
||||||
|
if image.local_path:
|
||||||
|
import os
|
||||||
|
if os.path.exists(image.local_path):
|
||||||
|
os.remove(image.local_path)
|
||||||
|
|
||||||
|
db.delete(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"status": "deleted"}
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/bulk-delete")
|
||||||
|
def bulk_delete_images(
|
||||||
|
image_ids: List[int],
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Delete multiple images."""
|
||||||
|
import os
|
||||||
|
|
||||||
|
images = db.query(Image).filter(Image.id.in_(image_ids)).all()
|
||||||
|
|
||||||
|
deleted = 0
|
||||||
|
for image in images:
|
||||||
|
if image.local_path and os.path.exists(image.local_path):
|
||||||
|
os.remove(image.local_path)
|
||||||
|
db.delete(image)
|
||||||
|
deleted += 1
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"deleted": deleted}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/import/scan")
|
||||||
|
def scan_imports(db: Session = Depends(get_db)):
|
||||||
|
"""Scan the imports folder and return what can be imported.
|
||||||
|
|
||||||
|
Expected structure: imports/{source}/{species_name}/*.jpg
|
||||||
|
"""
|
||||||
|
imports_path = Path(settings.imports_path)
|
||||||
|
|
||||||
|
if not imports_path.exists():
|
||||||
|
return {
|
||||||
|
"available": False,
|
||||||
|
"message": f"Imports folder not found: {imports_path}",
|
||||||
|
"sources": [],
|
||||||
|
"total_images": 0,
|
||||||
|
"matched_species": 0,
|
||||||
|
"unmatched_species": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
results = {
|
||||||
|
"available": True,
|
||||||
|
"sources": [],
|
||||||
|
"total_images": 0,
|
||||||
|
"matched_species": 0,
|
||||||
|
"unmatched_species": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get all species for matching
|
||||||
|
species_map = {}
|
||||||
|
for species in db.query(Species).all():
|
||||||
|
# Map by scientific name with underscores and spaces
|
||||||
|
species_map[species.scientific_name.lower()] = species
|
||||||
|
species_map[species.scientific_name.replace(" ", "_").lower()] = species
|
||||||
|
|
||||||
|
seen_unmatched = set()
|
||||||
|
|
||||||
|
# Scan source folders
|
||||||
|
for source_dir in imports_path.iterdir():
|
||||||
|
if not source_dir.is_dir():
|
||||||
|
continue
|
||||||
|
|
||||||
|
source_name = source_dir.name
|
||||||
|
source_info = {
|
||||||
|
"name": source_name,
|
||||||
|
"species_count": 0,
|
||||||
|
"image_count": 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Scan species folders within source
|
||||||
|
for species_dir in source_dir.iterdir():
|
||||||
|
if not species_dir.is_dir():
|
||||||
|
continue
|
||||||
|
|
||||||
|
species_name = species_dir.name.replace("_", " ")
|
||||||
|
species_key = species_name.lower()
|
||||||
|
|
||||||
|
# Count images
|
||||||
|
image_files = list(species_dir.glob("*.jpg")) + \
|
||||||
|
list(species_dir.glob("*.jpeg")) + \
|
||||||
|
list(species_dir.glob("*.png"))
|
||||||
|
|
||||||
|
if not image_files:
|
||||||
|
continue
|
||||||
|
|
||||||
|
source_info["image_count"] += len(image_files)
|
||||||
|
results["total_images"] += len(image_files)
|
||||||
|
|
||||||
|
if species_key in species_map or species_dir.name.lower() in species_map:
|
||||||
|
source_info["species_count"] += 1
|
||||||
|
results["matched_species"] += 1
|
||||||
|
else:
|
||||||
|
if species_name not in seen_unmatched:
|
||||||
|
seen_unmatched.add(species_name)
|
||||||
|
results["unmatched_species"].append(species_name)
|
||||||
|
|
||||||
|
if source_info["image_count"] > 0:
|
||||||
|
results["sources"].append(source_info)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/import/run")
|
||||||
|
def run_import(
|
||||||
|
move_files: bool = Query(False, description="Move files instead of copy"),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Import images from the imports folder.
|
||||||
|
|
||||||
|
Expected structure: imports/{source}/{species_name}/*.jpg
|
||||||
|
Images are copied/moved to: images/{species_name}/{source}_{filename}
|
||||||
|
"""
|
||||||
|
imports_path = Path(settings.imports_path)
|
||||||
|
images_path = Path(settings.images_path)
|
||||||
|
|
||||||
|
if not imports_path.exists():
|
||||||
|
raise HTTPException(status_code=400, detail="Imports folder not found")
|
||||||
|
|
||||||
|
# Get all species for matching
|
||||||
|
species_map = {}
|
||||||
|
for species in db.query(Species).all():
|
||||||
|
species_map[species.scientific_name.lower()] = species
|
||||||
|
species_map[species.scientific_name.replace(" ", "_").lower()] = species
|
||||||
|
|
||||||
|
imported = 0
|
||||||
|
skipped = 0
|
||||||
|
errors = []
|
||||||
|
|
||||||
|
# Scan source folders
|
||||||
|
for source_dir in imports_path.iterdir():
|
||||||
|
if not source_dir.is_dir():
|
||||||
|
continue
|
||||||
|
|
||||||
|
source_name = source_dir.name
|
||||||
|
|
||||||
|
# Scan species folders within source
|
||||||
|
for species_dir in source_dir.iterdir():
|
||||||
|
if not species_dir.is_dir():
|
||||||
|
continue
|
||||||
|
|
||||||
|
species_name = species_dir.name.replace("_", " ")
|
||||||
|
species_key = species_name.lower()
|
||||||
|
|
||||||
|
# Find matching species
|
||||||
|
species = species_map.get(species_key) or species_map.get(species_dir.name.lower())
|
||||||
|
if not species:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Create target directory
|
||||||
|
target_dir = images_path / species.scientific_name.replace(" ", "_")
|
||||||
|
target_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Process images
|
||||||
|
image_files = list(species_dir.glob("*.jpg")) + \
|
||||||
|
list(species_dir.glob("*.jpeg")) + \
|
||||||
|
list(species_dir.glob("*.png"))
|
||||||
|
|
||||||
|
for img_file in image_files:
|
||||||
|
try:
|
||||||
|
# Generate unique filename
|
||||||
|
ext = img_file.suffix.lower()
|
||||||
|
if ext == ".jpeg":
|
||||||
|
ext = ".jpg"
|
||||||
|
new_filename = f"{source_name}_{img_file.stem}_{uuid.uuid4().hex[:8]}{ext}"
|
||||||
|
target_path = target_dir / new_filename
|
||||||
|
|
||||||
|
# Check if already imported (by original filename pattern)
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.species_id == species.id,
|
||||||
|
Image.source == source_name,
|
||||||
|
Image.source_id == img_file.stem,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
skipped += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get image dimensions
|
||||||
|
try:
|
||||||
|
with PILImage.open(img_file) as pil_img:
|
||||||
|
width, height = pil_img.size
|
||||||
|
except Exception:
|
||||||
|
width, height = None, None
|
||||||
|
|
||||||
|
# Copy or move file
|
||||||
|
if move_files:
|
||||||
|
shutil.move(str(img_file), str(target_path))
|
||||||
|
else:
|
||||||
|
shutil.copy2(str(img_file), str(target_path))
|
||||||
|
|
||||||
|
# Create database record
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=source_name,
|
||||||
|
source_id=img_file.stem,
|
||||||
|
url=f"file://{img_file}",
|
||||||
|
local_path=str(target_path),
|
||||||
|
license="unknown",
|
||||||
|
width=width,
|
||||||
|
height=height,
|
||||||
|
status="downloaded",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
imported += 1
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
errors.append(f"{img_file}: {str(e)}")
|
||||||
|
|
||||||
|
# Commit after each species to avoid large transactions
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"imported": imported,
|
||||||
|
"skipped": skipped,
|
||||||
|
"errors": errors[:20],
|
||||||
|
}
|
||||||
173
backend/app/api/jobs.py
Normal file
173
backend/app/api/jobs.py
Normal file
@@ -0,0 +1,173 @@
|
|||||||
|
import json
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.database import get_db
|
||||||
|
from app.models import Job
|
||||||
|
from app.schemas.job import JobCreate, JobResponse, JobListResponse
|
||||||
|
from app.workers.scrape_tasks import run_scrape_job
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("", response_model=JobListResponse)
|
||||||
|
def list_jobs(
|
||||||
|
status: Optional[str] = None,
|
||||||
|
source: Optional[str] = None,
|
||||||
|
limit: int = Query(50, ge=1, le=200),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""List all jobs."""
|
||||||
|
query = db.query(Job)
|
||||||
|
|
||||||
|
if status:
|
||||||
|
query = query.filter(Job.status == status)
|
||||||
|
|
||||||
|
if source:
|
||||||
|
query = query.filter(Job.source == source)
|
||||||
|
|
||||||
|
total = query.count()
|
||||||
|
jobs = query.order_by(Job.created_at.desc()).limit(limit).all()
|
||||||
|
|
||||||
|
return JobListResponse(
|
||||||
|
items=[JobResponse.model_validate(j) for j in jobs],
|
||||||
|
total=total,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("", response_model=JobResponse)
|
||||||
|
def create_job(job: JobCreate, db: Session = Depends(get_db)):
|
||||||
|
"""Create and start a new scrape job."""
|
||||||
|
species_filter = None
|
||||||
|
if job.species_ids:
|
||||||
|
species_filter = json.dumps(job.species_ids)
|
||||||
|
|
||||||
|
db_job = Job(
|
||||||
|
name=job.name,
|
||||||
|
source=job.source,
|
||||||
|
species_filter=species_filter,
|
||||||
|
only_without_images=job.only_without_images,
|
||||||
|
max_images=job.max_images,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(db_job)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(db_job)
|
||||||
|
|
||||||
|
# Start the Celery task
|
||||||
|
task = run_scrape_job.delay(db_job.id)
|
||||||
|
db_job.celery_task_id = task.id
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return JobResponse.model_validate(db_job)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{job_id}", response_model=JobResponse)
|
||||||
|
def get_job(job_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Get job status."""
|
||||||
|
job = db.query(Job).filter(Job.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="Job not found")
|
||||||
|
|
||||||
|
return JobResponse.model_validate(job)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{job_id}/progress")
|
||||||
|
def get_job_progress(job_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Get real-time job progress from Celery."""
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
|
||||||
|
job = db.query(Job).filter(Job.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="Job not found")
|
||||||
|
|
||||||
|
if not job.celery_task_id:
|
||||||
|
return {
|
||||||
|
"status": job.status,
|
||||||
|
"progress_current": job.progress_current,
|
||||||
|
"progress_total": job.progress_total,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get Celery task state
|
||||||
|
result = celery_app.AsyncResult(job.celery_task_id)
|
||||||
|
|
||||||
|
if result.state == "PROGRESS":
|
||||||
|
meta = result.info
|
||||||
|
return {
|
||||||
|
"status": "running",
|
||||||
|
"progress_current": meta.get("current", 0),
|
||||||
|
"progress_total": meta.get("total", 0),
|
||||||
|
"current_species": meta.get("species", ""),
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
"status": job.status,
|
||||||
|
"progress_current": job.progress_current,
|
||||||
|
"progress_total": job.progress_total,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/{job_id}/pause")
|
||||||
|
def pause_job(job_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Pause a running job."""
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
|
||||||
|
job = db.query(Job).filter(Job.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="Job not found")
|
||||||
|
|
||||||
|
if job.status != "running":
|
||||||
|
raise HTTPException(status_code=400, detail="Job is not running")
|
||||||
|
|
||||||
|
# Revoke Celery task
|
||||||
|
if job.celery_task_id:
|
||||||
|
celery_app.control.revoke(job.celery_task_id, terminate=True)
|
||||||
|
|
||||||
|
job.status = "paused"
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"status": "paused"}
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/{job_id}/resume")
|
||||||
|
def resume_job(job_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Resume a paused job."""
|
||||||
|
job = db.query(Job).filter(Job.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="Job not found")
|
||||||
|
|
||||||
|
if job.status != "paused":
|
||||||
|
raise HTTPException(status_code=400, detail="Job is not paused")
|
||||||
|
|
||||||
|
# Start new Celery task
|
||||||
|
task = run_scrape_job.delay(job.id)
|
||||||
|
job.celery_task_id = task.id
|
||||||
|
job.status = "pending"
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"status": "resumed"}
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/{job_id}/cancel")
|
||||||
|
def cancel_job(job_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Cancel a job."""
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
|
||||||
|
job = db.query(Job).filter(Job.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(status_code=404, detail="Job not found")
|
||||||
|
|
||||||
|
if job.status in ["completed", "failed"]:
|
||||||
|
raise HTTPException(status_code=400, detail="Job already finished")
|
||||||
|
|
||||||
|
# Revoke Celery task
|
||||||
|
if job.celery_task_id:
|
||||||
|
celery_app.control.revoke(job.celery_task_id, terminate=True)
|
||||||
|
|
||||||
|
job.status = "failed"
|
||||||
|
job.error_message = "Cancelled by user"
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"status": "cancelled"}
|
||||||
198
backend/app/api/sources.py
Normal file
198
backend/app/api/sources.py
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
from fastapi import APIRouter, Depends, HTTPException
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.database import get_db
|
||||||
|
from app.models import ApiKey
|
||||||
|
from app.schemas.api_key import ApiKeyCreate, ApiKeyUpdate, ApiKeyResponse
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
# Available sources
|
||||||
|
# auth_type: "none" (no auth), "api_key" (single key), "api_key_secret" (key + secret), "oauth" (client_id + client_secret + access_token)
|
||||||
|
# default_rate: safe default requests per second for each API
|
||||||
|
AVAILABLE_SOURCES = [
|
||||||
|
{"name": "gbif", "label": "GBIF", "requires_secret": False, "auth_type": "none", "default_rate": 1.0}, # Free, no auth required
|
||||||
|
{"name": "inaturalist", "label": "iNaturalist", "requires_secret": True, "auth_type": "api_key_secret", "default_rate": 1.0}, # 60/min limit
|
||||||
|
{"name": "flickr", "label": "Flickr", "requires_secret": True, "auth_type": "api_key_secret", "default_rate": 0.5}, # 3600/hr shared limit
|
||||||
|
{"name": "wikimedia", "label": "Wikimedia Commons", "requires_secret": True, "auth_type": "oauth", "default_rate": 1.0}, # generous limits
|
||||||
|
{"name": "trefle", "label": "Trefle.io", "requires_secret": False, "auth_type": "api_key", "default_rate": 1.0}, # 120/min limit
|
||||||
|
{"name": "duckduckgo", "label": "DuckDuckGo", "requires_secret": False, "auth_type": "none", "default_rate": 0.5}, # Web search, no API key
|
||||||
|
{"name": "bing", "label": "Bing Image Search", "requires_secret": False, "auth_type": "api_key", "default_rate": 3.0}, # Azure Cognitive Services
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def mask_api_key(key: str) -> str:
|
||||||
|
"""Mask API key, showing only last 4 characters."""
|
||||||
|
if not key or len(key) <= 4:
|
||||||
|
return "****"
|
||||||
|
return "*" * (len(key) - 4) + key[-4:]
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("")
|
||||||
|
def list_sources(db: Session = Depends(get_db)):
|
||||||
|
"""List all available sources with their configuration status."""
|
||||||
|
api_keys = {k.source: k for k in db.query(ApiKey).all()}
|
||||||
|
|
||||||
|
result = []
|
||||||
|
for source in AVAILABLE_SOURCES:
|
||||||
|
api_key = api_keys.get(source["name"])
|
||||||
|
default_rate = source.get("default_rate", 1.0)
|
||||||
|
result.append({
|
||||||
|
"name": source["name"],
|
||||||
|
"label": source["label"],
|
||||||
|
"requires_secret": source["requires_secret"],
|
||||||
|
"auth_type": source.get("auth_type", "api_key"),
|
||||||
|
"configured": api_key is not None,
|
||||||
|
"enabled": api_key.enabled if api_key else False,
|
||||||
|
"api_key_masked": mask_api_key(api_key.api_key) if api_key else None,
|
||||||
|
"has_secret": bool(api_key.api_secret) if api_key else False,
|
||||||
|
"has_access_token": bool(getattr(api_key, 'access_token', None)) if api_key else False,
|
||||||
|
"rate_limit_per_sec": api_key.rate_limit_per_sec if api_key else default_rate,
|
||||||
|
"default_rate": default_rate,
|
||||||
|
})
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{source}")
|
||||||
|
def get_source(source: str, db: Session = Depends(get_db)):
|
||||||
|
"""Get source configuration."""
|
||||||
|
source_info = next((s for s in AVAILABLE_SOURCES if s["name"] == source), None)
|
||||||
|
if not source_info:
|
||||||
|
raise HTTPException(status_code=404, detail="Unknown source")
|
||||||
|
|
||||||
|
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||||
|
default_rate = source_info.get("default_rate", 1.0)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"name": source_info["name"],
|
||||||
|
"label": source_info["label"],
|
||||||
|
"requires_secret": source_info["requires_secret"],
|
||||||
|
"auth_type": source_info.get("auth_type", "api_key"),
|
||||||
|
"configured": api_key is not None,
|
||||||
|
"enabled": api_key.enabled if api_key else False,
|
||||||
|
"api_key_masked": mask_api_key(api_key.api_key) if api_key else None,
|
||||||
|
"has_secret": bool(api_key.api_secret) if api_key else False,
|
||||||
|
"has_access_token": bool(getattr(api_key, 'access_token', None)) if api_key else False,
|
||||||
|
"rate_limit_per_sec": api_key.rate_limit_per_sec if api_key else default_rate,
|
||||||
|
"default_rate": default_rate,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.put("/{source}")
|
||||||
|
def update_source(
|
||||||
|
source: str,
|
||||||
|
config: ApiKeyCreate,
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Create or update source configuration."""
|
||||||
|
source_info = next((s for s in AVAILABLE_SOURCES if s["name"] == source), None)
|
||||||
|
if not source_info:
|
||||||
|
raise HTTPException(status_code=404, detail="Unknown source")
|
||||||
|
|
||||||
|
# For sources that require auth, validate api_key is provided
|
||||||
|
auth_type = source_info.get("auth_type", "api_key")
|
||||||
|
if auth_type != "none" and not config.api_key:
|
||||||
|
raise HTTPException(status_code=400, detail="API key is required for this source")
|
||||||
|
|
||||||
|
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||||
|
|
||||||
|
# Use placeholder for no-auth sources
|
||||||
|
api_key_value = config.api_key or "no-auth"
|
||||||
|
|
||||||
|
if api_key:
|
||||||
|
# Update existing
|
||||||
|
api_key.api_key = api_key_value
|
||||||
|
if config.api_secret:
|
||||||
|
api_key.api_secret = config.api_secret
|
||||||
|
if config.access_token:
|
||||||
|
api_key.access_token = config.access_token
|
||||||
|
api_key.rate_limit_per_sec = config.rate_limit_per_sec
|
||||||
|
api_key.enabled = config.enabled
|
||||||
|
else:
|
||||||
|
# Create new
|
||||||
|
api_key = ApiKey(
|
||||||
|
source=source,
|
||||||
|
api_key=api_key_value,
|
||||||
|
api_secret=config.api_secret,
|
||||||
|
access_token=config.access_token,
|
||||||
|
rate_limit_per_sec=config.rate_limit_per_sec,
|
||||||
|
enabled=config.enabled,
|
||||||
|
)
|
||||||
|
db.add(api_key)
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
db.refresh(api_key)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"name": source,
|
||||||
|
"configured": True,
|
||||||
|
"enabled": api_key.enabled,
|
||||||
|
"api_key_masked": mask_api_key(api_key.api_key) if auth_type != "none" else None,
|
||||||
|
"has_secret": bool(api_key.api_secret),
|
||||||
|
"has_access_token": bool(api_key.access_token),
|
||||||
|
"rate_limit_per_sec": api_key.rate_limit_per_sec,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.patch("/{source}")
|
||||||
|
def patch_source(
|
||||||
|
source: str,
|
||||||
|
config: ApiKeyUpdate,
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Partially update source configuration."""
|
||||||
|
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||||
|
if not api_key:
|
||||||
|
raise HTTPException(status_code=404, detail="Source not configured")
|
||||||
|
|
||||||
|
update_data = config.model_dump(exclude_unset=True)
|
||||||
|
for field, value in update_data.items():
|
||||||
|
setattr(api_key, field, value)
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
db.refresh(api_key)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"name": source,
|
||||||
|
"configured": True,
|
||||||
|
"enabled": api_key.enabled,
|
||||||
|
"api_key_masked": mask_api_key(api_key.api_key),
|
||||||
|
"has_secret": bool(api_key.api_secret),
|
||||||
|
"has_access_token": bool(api_key.access_token),
|
||||||
|
"rate_limit_per_sec": api_key.rate_limit_per_sec,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{source}")
|
||||||
|
def delete_source(source: str, db: Session = Depends(get_db)):
|
||||||
|
"""Delete source configuration."""
|
||||||
|
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||||
|
if not api_key:
|
||||||
|
raise HTTPException(status_code=404, detail="Source not configured")
|
||||||
|
|
||||||
|
db.delete(api_key)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"status": "deleted"}
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/{source}/test")
|
||||||
|
def test_source(source: str, db: Session = Depends(get_db)):
|
||||||
|
"""Test source API connection."""
|
||||||
|
api_key = db.query(ApiKey).filter(ApiKey.source == source).first()
|
||||||
|
if not api_key:
|
||||||
|
raise HTTPException(status_code=404, detail="Source not configured")
|
||||||
|
|
||||||
|
# Import and test the scraper
|
||||||
|
from app.scrapers import get_scraper
|
||||||
|
|
||||||
|
scraper = get_scraper(source)
|
||||||
|
if not scraper:
|
||||||
|
raise HTTPException(status_code=400, detail="No scraper for this source")
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = scraper.test_connection(api_key)
|
||||||
|
return {"status": "success", "message": result}
|
||||||
|
except Exception as e:
|
||||||
|
return {"status": "error", "message": str(e)}
|
||||||
366
backend/app/api/species.py
Normal file
366
backend/app/api/species.py
Normal file
@@ -0,0 +1,366 @@
|
|||||||
|
import csv
|
||||||
|
import io
|
||||||
|
import json
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException, Query, UploadFile, File
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
from sqlalchemy import func, text
|
||||||
|
|
||||||
|
from app.database import get_db
|
||||||
|
from app.models import Species, Image
|
||||||
|
from app.schemas.species import (
|
||||||
|
SpeciesCreate,
|
||||||
|
SpeciesUpdate,
|
||||||
|
SpeciesResponse,
|
||||||
|
SpeciesListResponse,
|
||||||
|
SpeciesImportResponse,
|
||||||
|
)
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
def get_species_with_count(db: Session, species: Species) -> SpeciesResponse:
|
||||||
|
"""Get species response with image count."""
|
||||||
|
image_count = db.query(func.count(Image.id)).filter(
|
||||||
|
Image.species_id == species.id,
|
||||||
|
Image.status == "downloaded"
|
||||||
|
).scalar()
|
||||||
|
|
||||||
|
return SpeciesResponse(
|
||||||
|
id=species.id,
|
||||||
|
scientific_name=species.scientific_name,
|
||||||
|
common_name=species.common_name,
|
||||||
|
genus=species.genus,
|
||||||
|
family=species.family,
|
||||||
|
created_at=species.created_at,
|
||||||
|
image_count=image_count or 0,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("", response_model=SpeciesListResponse)
|
||||||
|
def list_species(
|
||||||
|
page: int = Query(1, ge=1),
|
||||||
|
page_size: int = Query(50, ge=1, le=500),
|
||||||
|
search: Optional[str] = None,
|
||||||
|
genus: Optional[str] = None,
|
||||||
|
has_images: Optional[bool] = None,
|
||||||
|
max_images: Optional[int] = Query(None, description="Filter species with less than N images"),
|
||||||
|
min_images: Optional[int] = Query(None, description="Filter species with at least N images"),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""List species with pagination and filters.
|
||||||
|
|
||||||
|
Filters:
|
||||||
|
- search: Search by scientific or common name
|
||||||
|
- genus: Filter by genus
|
||||||
|
- has_images: True for species with images, False for species without
|
||||||
|
- max_images: Filter species with fewer than N downloaded images
|
||||||
|
- min_images: Filter species with at least N downloaded images
|
||||||
|
"""
|
||||||
|
# If filtering by image count, we need to use a subquery approach
|
||||||
|
if max_images is not None or min_images is not None:
|
||||||
|
# Build a subquery with image counts per species
|
||||||
|
image_counts = (
|
||||||
|
db.query(
|
||||||
|
Species.id.label("species_id"),
|
||||||
|
func.count(Image.id).label("img_count")
|
||||||
|
)
|
||||||
|
.outerjoin(Image, (Image.species_id == Species.id) & (Image.status == "downloaded"))
|
||||||
|
.group_by(Species.id)
|
||||||
|
.subquery()
|
||||||
|
)
|
||||||
|
|
||||||
|
# Join species with their counts
|
||||||
|
query = db.query(Species).join(
|
||||||
|
image_counts, Species.id == image_counts.c.species_id
|
||||||
|
)
|
||||||
|
|
||||||
|
if max_images is not None:
|
||||||
|
query = query.filter(image_counts.c.img_count < max_images)
|
||||||
|
|
||||||
|
if min_images is not None:
|
||||||
|
query = query.filter(image_counts.c.img_count >= min_images)
|
||||||
|
else:
|
||||||
|
query = db.query(Species)
|
||||||
|
|
||||||
|
if search:
|
||||||
|
search_term = f"%{search}%"
|
||||||
|
query = query.filter(
|
||||||
|
(Species.scientific_name.ilike(search_term)) |
|
||||||
|
(Species.common_name.ilike(search_term))
|
||||||
|
)
|
||||||
|
|
||||||
|
if genus:
|
||||||
|
query = query.filter(Species.genus == genus)
|
||||||
|
|
||||||
|
# Filter by whether species has downloaded images (only if not using min/max filters)
|
||||||
|
if has_images is not None and max_images is None and min_images is None:
|
||||||
|
# Get IDs of species that have at least one downloaded image
|
||||||
|
species_with_images = (
|
||||||
|
db.query(Image.species_id)
|
||||||
|
.filter(Image.status == "downloaded")
|
||||||
|
.distinct()
|
||||||
|
.subquery()
|
||||||
|
)
|
||||||
|
if has_images:
|
||||||
|
query = query.filter(Species.id.in_(db.query(species_with_images.c.species_id)))
|
||||||
|
else:
|
||||||
|
query = query.filter(~Species.id.in_(db.query(species_with_images.c.species_id)))
|
||||||
|
|
||||||
|
total = query.count()
|
||||||
|
pages = (total + page_size - 1) // page_size
|
||||||
|
|
||||||
|
species_list = query.order_by(Species.scientific_name).offset(
|
||||||
|
(page - 1) * page_size
|
||||||
|
).limit(page_size).all()
|
||||||
|
|
||||||
|
# Fetch image counts in bulk for all species on this page
|
||||||
|
species_ids = [s.id for s in species_list]
|
||||||
|
if species_ids:
|
||||||
|
count_query = db.query(
|
||||||
|
Image.species_id,
|
||||||
|
func.count(Image.id)
|
||||||
|
).filter(
|
||||||
|
Image.species_id.in_(species_ids),
|
||||||
|
Image.status == "downloaded"
|
||||||
|
).group_by(Image.species_id).all()
|
||||||
|
count_map = {species_id: count for species_id, count in count_query}
|
||||||
|
else:
|
||||||
|
count_map = {}
|
||||||
|
|
||||||
|
items = [
|
||||||
|
SpeciesResponse(
|
||||||
|
id=s.id,
|
||||||
|
scientific_name=s.scientific_name,
|
||||||
|
common_name=s.common_name,
|
||||||
|
genus=s.genus,
|
||||||
|
family=s.family,
|
||||||
|
created_at=s.created_at,
|
||||||
|
image_count=count_map.get(s.id, 0),
|
||||||
|
)
|
||||||
|
for s in species_list
|
||||||
|
]
|
||||||
|
|
||||||
|
return SpeciesListResponse(
|
||||||
|
items=items,
|
||||||
|
total=total,
|
||||||
|
page=page,
|
||||||
|
page_size=page_size,
|
||||||
|
pages=pages,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("", response_model=SpeciesResponse)
|
||||||
|
def create_species(species: SpeciesCreate, db: Session = Depends(get_db)):
|
||||||
|
"""Create a new species."""
|
||||||
|
existing = db.query(Species).filter(
|
||||||
|
Species.scientific_name == species.scientific_name
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
raise HTTPException(status_code=400, detail="Species already exists")
|
||||||
|
|
||||||
|
# Auto-extract genus from scientific name if not provided
|
||||||
|
genus = species.genus
|
||||||
|
if not genus and " " in species.scientific_name:
|
||||||
|
genus = species.scientific_name.split()[0]
|
||||||
|
|
||||||
|
db_species = Species(
|
||||||
|
scientific_name=species.scientific_name,
|
||||||
|
common_name=species.common_name,
|
||||||
|
genus=genus,
|
||||||
|
family=species.family,
|
||||||
|
)
|
||||||
|
db.add(db_species)
|
||||||
|
db.commit()
|
||||||
|
db.refresh(db_species)
|
||||||
|
|
||||||
|
return get_species_with_count(db, db_species)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/import", response_model=SpeciesImportResponse)
|
||||||
|
async def import_species(
|
||||||
|
file: UploadFile = File(...),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Import species from CSV file.
|
||||||
|
|
||||||
|
Expected columns: scientific_name, common_name (optional), genus (optional), family (optional)
|
||||||
|
"""
|
||||||
|
if not file.filename.endswith(".csv"):
|
||||||
|
raise HTTPException(status_code=400, detail="File must be a CSV")
|
||||||
|
|
||||||
|
content = await file.read()
|
||||||
|
text = content.decode("utf-8")
|
||||||
|
|
||||||
|
reader = csv.DictReader(io.StringIO(text))
|
||||||
|
|
||||||
|
imported = 0
|
||||||
|
skipped = 0
|
||||||
|
errors = []
|
||||||
|
|
||||||
|
for row_num, row in enumerate(reader, start=2):
|
||||||
|
scientific_name = row.get("scientific_name", "").strip()
|
||||||
|
if not scientific_name:
|
||||||
|
errors.append(f"Row {row_num}: Missing scientific_name")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
existing = db.query(Species).filter(
|
||||||
|
Species.scientific_name == scientific_name
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
skipped += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Auto-extract genus if not provided
|
||||||
|
genus = row.get("genus", "").strip()
|
||||||
|
if not genus and " " in scientific_name:
|
||||||
|
genus = scientific_name.split()[0]
|
||||||
|
|
||||||
|
try:
|
||||||
|
species = Species(
|
||||||
|
scientific_name=scientific_name,
|
||||||
|
common_name=row.get("common_name", "").strip() or None,
|
||||||
|
genus=genus or None,
|
||||||
|
family=row.get("family", "").strip() or None,
|
||||||
|
)
|
||||||
|
db.add(species)
|
||||||
|
imported += 1
|
||||||
|
except Exception as e:
|
||||||
|
errors.append(f"Row {row_num}: {str(e)}")
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return SpeciesImportResponse(
|
||||||
|
imported=imported,
|
||||||
|
skipped=skipped,
|
||||||
|
errors=errors[:10], # Limit error messages
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/import-json", response_model=SpeciesImportResponse)
|
||||||
|
async def import_species_json(
|
||||||
|
file: UploadFile = File(...),
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Import species from JSON file.
|
||||||
|
|
||||||
|
Expected format: {"plants": [{"scientific_name": "...", "common_names": [...], "family": "..."}]}
|
||||||
|
"""
|
||||||
|
if not file.filename.endswith(".json"):
|
||||||
|
raise HTTPException(status_code=400, detail="File must be a JSON")
|
||||||
|
|
||||||
|
content = await file.read()
|
||||||
|
try:
|
||||||
|
data = json.loads(content.decode("utf-8"))
|
||||||
|
except json.JSONDecodeError as e:
|
||||||
|
raise HTTPException(status_code=400, detail=f"Invalid JSON: {e}")
|
||||||
|
|
||||||
|
plants = data.get("plants", [])
|
||||||
|
if not plants:
|
||||||
|
raise HTTPException(status_code=400, detail="No plants found in JSON")
|
||||||
|
|
||||||
|
imported = 0
|
||||||
|
skipped = 0
|
||||||
|
errors = []
|
||||||
|
|
||||||
|
for idx, plant in enumerate(plants):
|
||||||
|
scientific_name = plant.get("scientific_name", "").strip()
|
||||||
|
if not scientific_name:
|
||||||
|
errors.append(f"Plant {idx}: Missing scientific_name")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
existing = db.query(Species).filter(
|
||||||
|
Species.scientific_name == scientific_name
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
skipped += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Auto-extract genus from scientific name
|
||||||
|
genus = None
|
||||||
|
if " " in scientific_name:
|
||||||
|
genus = scientific_name.split()[0]
|
||||||
|
|
||||||
|
# Get first common name if array provided
|
||||||
|
common_names = plant.get("common_names", [])
|
||||||
|
common_name = common_names[0] if common_names else None
|
||||||
|
|
||||||
|
try:
|
||||||
|
species = Species(
|
||||||
|
scientific_name=scientific_name,
|
||||||
|
common_name=common_name,
|
||||||
|
genus=genus,
|
||||||
|
family=plant.get("family"),
|
||||||
|
)
|
||||||
|
db.add(species)
|
||||||
|
imported += 1
|
||||||
|
except Exception as e:
|
||||||
|
errors.append(f"Plant {idx}: {str(e)}")
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return SpeciesImportResponse(
|
||||||
|
imported=imported,
|
||||||
|
skipped=skipped,
|
||||||
|
errors=errors[:10],
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/{species_id}", response_model=SpeciesResponse)
|
||||||
|
def get_species(species_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Get a species by ID."""
|
||||||
|
species = db.query(Species).filter(Species.id == species_id).first()
|
||||||
|
if not species:
|
||||||
|
raise HTTPException(status_code=404, detail="Species not found")
|
||||||
|
|
||||||
|
return get_species_with_count(db, species)
|
||||||
|
|
||||||
|
|
||||||
|
@router.put("/{species_id}", response_model=SpeciesResponse)
|
||||||
|
def update_species(
|
||||||
|
species_id: int,
|
||||||
|
species_update: SpeciesUpdate,
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Update a species."""
|
||||||
|
species = db.query(Species).filter(Species.id == species_id).first()
|
||||||
|
if not species:
|
||||||
|
raise HTTPException(status_code=404, detail="Species not found")
|
||||||
|
|
||||||
|
update_data = species_update.model_dump(exclude_unset=True)
|
||||||
|
for field, value in update_data.items():
|
||||||
|
setattr(species, field, value)
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
db.refresh(species)
|
||||||
|
|
||||||
|
return get_species_with_count(db, species)
|
||||||
|
|
||||||
|
|
||||||
|
@router.delete("/{species_id}")
|
||||||
|
def delete_species(species_id: int, db: Session = Depends(get_db)):
|
||||||
|
"""Delete a species and all its images."""
|
||||||
|
species = db.query(Species).filter(Species.id == species_id).first()
|
||||||
|
if not species:
|
||||||
|
raise HTTPException(status_code=404, detail="Species not found")
|
||||||
|
|
||||||
|
db.delete(species)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {"status": "deleted"}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/genera/list")
|
||||||
|
def list_genera(db: Session = Depends(get_db)):
|
||||||
|
"""List all unique genera."""
|
||||||
|
genera = db.query(Species.genus).filter(
|
||||||
|
Species.genus.isnot(None)
|
||||||
|
).distinct().order_by(Species.genus).all()
|
||||||
|
|
||||||
|
return [g[0] for g in genera]
|
||||||
190
backend/app/api/stats.py
Normal file
190
backend/app/api/stats.py
Normal file
@@ -0,0 +1,190 @@
|
|||||||
|
import json
|
||||||
|
|
||||||
|
from fastapi import APIRouter, Depends, HTTPException
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
from sqlalchemy import func, case
|
||||||
|
|
||||||
|
from app.database import get_db
|
||||||
|
from app.models import Species, Image, Job
|
||||||
|
from app.models.cached_stats import CachedStats
|
||||||
|
from app.schemas.stats import StatsResponse, SourceStats, LicenseStats, SpeciesStats, JobStats
|
||||||
|
|
||||||
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("", response_model=StatsResponse)
|
||||||
|
def get_stats(db: Session = Depends(get_db)):
|
||||||
|
"""Get dashboard statistics from cache (updated every 60s by Celery)."""
|
||||||
|
# Try to get cached stats
|
||||||
|
cached = db.query(CachedStats).filter(CachedStats.key == "dashboard_stats").first()
|
||||||
|
|
||||||
|
if cached:
|
||||||
|
data = json.loads(cached.value)
|
||||||
|
return StatsResponse(
|
||||||
|
total_species=data["total_species"],
|
||||||
|
total_images=data["total_images"],
|
||||||
|
images_downloaded=data["images_downloaded"],
|
||||||
|
images_pending=data["images_pending"],
|
||||||
|
images_rejected=data["images_rejected"],
|
||||||
|
disk_usage_mb=data["disk_usage_mb"],
|
||||||
|
sources=[SourceStats(**s) for s in data["sources"]],
|
||||||
|
licenses=[LicenseStats(**l) for l in data["licenses"]],
|
||||||
|
jobs=JobStats(**data["jobs"]),
|
||||||
|
top_species=[SpeciesStats(**s) for s in data["top_species"]],
|
||||||
|
under_represented=[SpeciesStats(**s) for s in data["under_represented"]],
|
||||||
|
)
|
||||||
|
|
||||||
|
# No cache yet - return empty stats (Celery will populate soon)
|
||||||
|
# This only happens on first startup before Celery runs
|
||||||
|
return StatsResponse(
|
||||||
|
total_species=0,
|
||||||
|
total_images=0,
|
||||||
|
images_downloaded=0,
|
||||||
|
images_pending=0,
|
||||||
|
images_rejected=0,
|
||||||
|
disk_usage_mb=0.0,
|
||||||
|
sources=[],
|
||||||
|
licenses=[],
|
||||||
|
jobs=JobStats(running=0, pending=0, completed=0, failed=0),
|
||||||
|
top_species=[],
|
||||||
|
under_represented=[],
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/refresh")
|
||||||
|
def refresh_stats_now(db: Session = Depends(get_db)):
|
||||||
|
"""Manually trigger a stats refresh."""
|
||||||
|
from app.workers.stats_tasks import refresh_stats
|
||||||
|
refresh_stats.delay()
|
||||||
|
return {"status": "refresh_queued"}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/sources")
|
||||||
|
def get_source_stats(db: Session = Depends(get_db)):
|
||||||
|
"""Get per-source breakdown."""
|
||||||
|
stats = db.query(
|
||||||
|
Image.source,
|
||||||
|
func.count(Image.id).label("total"),
|
||||||
|
func.sum(case((Image.status == "downloaded", 1), else_=0)).label("downloaded"),
|
||||||
|
func.sum(case((Image.status == "pending", 1), else_=0)).label("pending"),
|
||||||
|
func.sum(case((Image.status == "rejected", 1), else_=0)).label("rejected"),
|
||||||
|
).group_by(Image.source).all()
|
||||||
|
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"source": s.source,
|
||||||
|
"total": s.total,
|
||||||
|
"downloaded": s.downloaded or 0,
|
||||||
|
"pending": s.pending or 0,
|
||||||
|
"rejected": s.rejected or 0,
|
||||||
|
}
|
||||||
|
for s in stats
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/species")
|
||||||
|
def get_species_stats(
|
||||||
|
min_count: int = 0,
|
||||||
|
max_count: int = None,
|
||||||
|
db: Session = Depends(get_db),
|
||||||
|
):
|
||||||
|
"""Get per-species image counts."""
|
||||||
|
query = db.query(
|
||||||
|
Species.id,
|
||||||
|
Species.scientific_name,
|
||||||
|
Species.common_name,
|
||||||
|
Species.genus,
|
||||||
|
func.count(Image.id).label("image_count")
|
||||||
|
).outerjoin(Image, (Image.species_id == Species.id) & (Image.status == "downloaded")
|
||||||
|
).group_by(Species.id)
|
||||||
|
|
||||||
|
if min_count > 0:
|
||||||
|
query = query.having(func.count(Image.id) >= min_count)
|
||||||
|
|
||||||
|
if max_count is not None:
|
||||||
|
query = query.having(func.count(Image.id) <= max_count)
|
||||||
|
|
||||||
|
stats = query.order_by(func.count(Image.id).desc()).all()
|
||||||
|
|
||||||
|
return [
|
||||||
|
{
|
||||||
|
"id": s.id,
|
||||||
|
"scientific_name": s.scientific_name,
|
||||||
|
"common_name": s.common_name,
|
||||||
|
"genus": s.genus,
|
||||||
|
"image_count": s.image_count,
|
||||||
|
}
|
||||||
|
for s in stats
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/distribution")
|
||||||
|
def get_image_distribution(db: Session = Depends(get_db)):
|
||||||
|
"""Get distribution of images per species for ML training assessment.
|
||||||
|
|
||||||
|
Returns counts of species at various image thresholds to help
|
||||||
|
determine dataset quality for training image classifiers.
|
||||||
|
"""
|
||||||
|
from sqlalchemy import text
|
||||||
|
|
||||||
|
# Get image counts per species using optimized raw SQL
|
||||||
|
distribution_sql = text("""
|
||||||
|
WITH species_counts AS (
|
||||||
|
SELECT
|
||||||
|
s.id,
|
||||||
|
COUNT(i.id) as cnt
|
||||||
|
FROM species s
|
||||||
|
LEFT JOIN images i ON i.species_id = s.id AND i.status = 'downloaded'
|
||||||
|
GROUP BY s.id
|
||||||
|
)
|
||||||
|
SELECT
|
||||||
|
COUNT(*) as total_species,
|
||||||
|
SUM(CASE WHEN cnt = 0 THEN 1 ELSE 0 END) as with_0,
|
||||||
|
SUM(CASE WHEN cnt >= 1 AND cnt < 10 THEN 1 ELSE 0 END) as with_1_9,
|
||||||
|
SUM(CASE WHEN cnt >= 10 AND cnt < 25 THEN 1 ELSE 0 END) as with_10_24,
|
||||||
|
SUM(CASE WHEN cnt >= 25 AND cnt < 50 THEN 1 ELSE 0 END) as with_25_49,
|
||||||
|
SUM(CASE WHEN cnt >= 50 AND cnt < 100 THEN 1 ELSE 0 END) as with_50_99,
|
||||||
|
SUM(CASE WHEN cnt >= 100 AND cnt < 200 THEN 1 ELSE 0 END) as with_100_199,
|
||||||
|
SUM(CASE WHEN cnt >= 200 THEN 1 ELSE 0 END) as with_200_plus,
|
||||||
|
SUM(CASE WHEN cnt >= 10 THEN 1 ELSE 0 END) as trainable_10,
|
||||||
|
SUM(CASE WHEN cnt >= 25 THEN 1 ELSE 0 END) as trainable_25,
|
||||||
|
SUM(CASE WHEN cnt >= 50 THEN 1 ELSE 0 END) as trainable_50,
|
||||||
|
SUM(CASE WHEN cnt >= 100 THEN 1 ELSE 0 END) as trainable_100,
|
||||||
|
AVG(cnt) as avg_images,
|
||||||
|
MAX(cnt) as max_images,
|
||||||
|
MIN(cnt) as min_images,
|
||||||
|
SUM(cnt) as total_images
|
||||||
|
FROM species_counts
|
||||||
|
""")
|
||||||
|
|
||||||
|
result = db.execute(distribution_sql).fetchone()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"total_species": result[0] or 0,
|
||||||
|
"distribution": {
|
||||||
|
"0_images": result[1] or 0,
|
||||||
|
"1_to_9": result[2] or 0,
|
||||||
|
"10_to_24": result[3] or 0,
|
||||||
|
"25_to_49": result[4] or 0,
|
||||||
|
"50_to_99": result[5] or 0,
|
||||||
|
"100_to_199": result[6] or 0,
|
||||||
|
"200_plus": result[7] or 0,
|
||||||
|
},
|
||||||
|
"trainable_species": {
|
||||||
|
"min_10_images": result[8] or 0,
|
||||||
|
"min_25_images": result[9] or 0,
|
||||||
|
"min_50_images": result[10] or 0,
|
||||||
|
"min_100_images": result[11] or 0,
|
||||||
|
},
|
||||||
|
"summary": {
|
||||||
|
"avg_images_per_species": round(result[12] or 0, 1),
|
||||||
|
"max_images": result[13] or 0,
|
||||||
|
"min_images": result[14] or 0,
|
||||||
|
"total_downloaded_images": result[15] or 0,
|
||||||
|
},
|
||||||
|
"recommendations": {
|
||||||
|
"for_basic_model": f"{result[8] or 0} species with 10+ images",
|
||||||
|
"for_good_model": f"{result[10] or 0} species with 50+ images",
|
||||||
|
"for_excellent_model": f"{result[11] or 0} species with 100+ images",
|
||||||
|
}
|
||||||
|
}
|
||||||
38
backend/app/config.py
Normal file
38
backend/app/config.py
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
from pydantic_settings import BaseSettings
|
||||||
|
from functools import lru_cache
|
||||||
|
|
||||||
|
|
||||||
|
class Settings(BaseSettings):
|
||||||
|
# Database
|
||||||
|
database_url: str = "sqlite:////data/db/plants.sqlite"
|
||||||
|
|
||||||
|
# Redis
|
||||||
|
redis_url: str = "redis://redis:6379/0"
|
||||||
|
|
||||||
|
# Storage paths
|
||||||
|
images_path: str = "/data/images"
|
||||||
|
exports_path: str = "/data/exports"
|
||||||
|
imports_path: str = "/data/imports"
|
||||||
|
logs_path: str = "/data/logs"
|
||||||
|
|
||||||
|
# API Keys
|
||||||
|
flickr_api_key: str = ""
|
||||||
|
flickr_api_secret: str = ""
|
||||||
|
inaturalist_app_id: str = ""
|
||||||
|
inaturalist_app_secret: str = ""
|
||||||
|
trefle_api_key: str = ""
|
||||||
|
|
||||||
|
# Logging
|
||||||
|
log_level: str = "INFO"
|
||||||
|
|
||||||
|
# Celery
|
||||||
|
celery_concurrency: int = 4
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
env_file = ".env"
|
||||||
|
extra = "ignore"
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache()
|
||||||
|
def get_settings() -> Settings:
|
||||||
|
return Settings()
|
||||||
44
backend/app/database.py
Normal file
44
backend/app/database.py
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
from sqlalchemy import create_engine, event
|
||||||
|
from sqlalchemy.orm import sessionmaker, declarative_base
|
||||||
|
from sqlalchemy.pool import StaticPool
|
||||||
|
|
||||||
|
from app.config import get_settings
|
||||||
|
|
||||||
|
settings = get_settings()
|
||||||
|
|
||||||
|
# SQLite-specific configuration
|
||||||
|
connect_args = {"check_same_thread": False}
|
||||||
|
|
||||||
|
engine = create_engine(
|
||||||
|
settings.database_url,
|
||||||
|
connect_args=connect_args,
|
||||||
|
poolclass=StaticPool,
|
||||||
|
echo=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Enable WAL mode for better concurrent access
|
||||||
|
@event.listens_for(engine, "connect")
|
||||||
|
def set_sqlite_pragma(dbapi_connection, connection_record):
|
||||||
|
cursor = dbapi_connection.cursor()
|
||||||
|
cursor.execute("PRAGMA journal_mode=WAL")
|
||||||
|
cursor.execute("PRAGMA synchronous=NORMAL")
|
||||||
|
cursor.execute("PRAGMA foreign_keys=ON")
|
||||||
|
cursor.close()
|
||||||
|
|
||||||
|
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
|
||||||
|
|
||||||
|
Base = declarative_base()
|
||||||
|
|
||||||
|
|
||||||
|
def get_db():
|
||||||
|
db = SessionLocal()
|
||||||
|
try:
|
||||||
|
yield db
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
def init_db():
|
||||||
|
"""Create all tables."""
|
||||||
|
from app.models import species, image, job, api_key, export, cached_stats # noqa
|
||||||
|
Base.metadata.create_all(bind=engine)
|
||||||
95
backend/app/main.py
Normal file
95
backend/app/main.py
Normal file
@@ -0,0 +1,95 @@
|
|||||||
|
from fastapi import FastAPI
|
||||||
|
from fastapi.middleware.cors import CORSMiddleware
|
||||||
|
|
||||||
|
from app.config import get_settings
|
||||||
|
from app.database import init_db
|
||||||
|
from app.api import species, images, jobs, exports, stats, sources
|
||||||
|
|
||||||
|
settings = get_settings()
|
||||||
|
|
||||||
|
app = FastAPI(
|
||||||
|
title="PlantGuideScraper API",
|
||||||
|
description="Web scraper interface for houseplant image collection",
|
||||||
|
version="1.0.0",
|
||||||
|
)
|
||||||
|
|
||||||
|
# CORS middleware
|
||||||
|
app.add_middleware(
|
||||||
|
CORSMiddleware,
|
||||||
|
allow_origins=["*"],
|
||||||
|
allow_credentials=True,
|
||||||
|
allow_methods=["*"],
|
||||||
|
allow_headers=["*"],
|
||||||
|
)
|
||||||
|
|
||||||
|
# Include routers
|
||||||
|
app.include_router(species.router, prefix="/api/species", tags=["Species"])
|
||||||
|
app.include_router(images.router, prefix="/api/images", tags=["Images"])
|
||||||
|
app.include_router(jobs.router, prefix="/api/jobs", tags=["Jobs"])
|
||||||
|
app.include_router(exports.router, prefix="/api/exports", tags=["Exports"])
|
||||||
|
app.include_router(stats.router, prefix="/api/stats", tags=["Stats"])
|
||||||
|
app.include_router(sources.router, prefix="/api/sources", tags=["Sources"])
|
||||||
|
|
||||||
|
|
||||||
|
@app.on_event("startup")
|
||||||
|
async def startup_event():
|
||||||
|
"""Initialize database on startup."""
|
||||||
|
init_db()
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
async def health_check():
|
||||||
|
"""Health check endpoint."""
|
||||||
|
return {"status": "healthy", "service": "plant-scraper"}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/api/debug")
|
||||||
|
async def debug_check():
|
||||||
|
"""Debug endpoint - checks database connection."""
|
||||||
|
import time
|
||||||
|
from app.database import SessionLocal
|
||||||
|
from app.models import Species, Image
|
||||||
|
|
||||||
|
results = {"status": "checking", "checks": {}}
|
||||||
|
|
||||||
|
# Check 1: Can we create a session?
|
||||||
|
try:
|
||||||
|
start = time.time()
|
||||||
|
db = SessionLocal()
|
||||||
|
results["checks"]["session_create"] = {"ok": True, "ms": int((time.time() - start) * 1000)}
|
||||||
|
except Exception as e:
|
||||||
|
results["checks"]["session_create"] = {"ok": False, "error": str(e)}
|
||||||
|
results["status"] = "error"
|
||||||
|
return results
|
||||||
|
|
||||||
|
# Check 2: Simple query - count species
|
||||||
|
try:
|
||||||
|
start = time.time()
|
||||||
|
count = db.query(Species).count()
|
||||||
|
results["checks"]["species_count"] = {"ok": True, "count": count, "ms": int((time.time() - start) * 1000)}
|
||||||
|
except Exception as e:
|
||||||
|
results["checks"]["species_count"] = {"ok": False, "error": str(e)}
|
||||||
|
results["status"] = "error"
|
||||||
|
db.close()
|
||||||
|
return results
|
||||||
|
|
||||||
|
# Check 3: Count images
|
||||||
|
try:
|
||||||
|
start = time.time()
|
||||||
|
count = db.query(Image).count()
|
||||||
|
results["checks"]["image_count"] = {"ok": True, "count": count, "ms": int((time.time() - start) * 1000)}
|
||||||
|
except Exception as e:
|
||||||
|
results["checks"]["image_count"] = {"ok": False, "error": str(e)}
|
||||||
|
results["status"] = "error"
|
||||||
|
db.close()
|
||||||
|
return results
|
||||||
|
|
||||||
|
db.close()
|
||||||
|
results["status"] = "healthy"
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/")
|
||||||
|
async def root():
|
||||||
|
"""Root endpoint."""
|
||||||
|
return {"message": "PlantGuideScraper API", "docs": "/docs"}
|
||||||
8
backend/app/models/__init__.py
Normal file
8
backend/app/models/__init__.py
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
from app.models.species import Species
|
||||||
|
from app.models.image import Image
|
||||||
|
from app.models.job import Job
|
||||||
|
from app.models.api_key import ApiKey
|
||||||
|
from app.models.export import Export
|
||||||
|
from app.models.cached_stats import CachedStats
|
||||||
|
|
||||||
|
__all__ = ["Species", "Image", "Job", "ApiKey", "Export", "CachedStats"]
|
||||||
18
backend/app/models/api_key.py
Normal file
18
backend/app/models/api_key.py
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
from sqlalchemy import Column, Integer, String, Float, Boolean
|
||||||
|
|
||||||
|
from app.database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class ApiKey(Base):
|
||||||
|
__tablename__ = "api_keys"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
source = Column(String, unique=True, nullable=False) # 'flickr', 'inaturalist', 'wikimedia', 'trefle'
|
||||||
|
api_key = Column(String, nullable=False) # Also used as Client ID for OAuth sources
|
||||||
|
api_secret = Column(String, nullable=True) # Also used as Client Secret for OAuth sources
|
||||||
|
access_token = Column(String, nullable=True) # For OAuth sources like Wikimedia
|
||||||
|
rate_limit_per_sec = Column(Float, default=1.0)
|
||||||
|
enabled = Column(Boolean, default=True)
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<ApiKey(id={self.id}, source='{self.source}', enabled={self.enabled})>"
|
||||||
14
backend/app/models/cached_stats.py
Normal file
14
backend/app/models/cached_stats.py
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
from datetime import datetime
|
||||||
|
from sqlalchemy import Column, Integer, String, Text, DateTime
|
||||||
|
|
||||||
|
from app.database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class CachedStats(Base):
|
||||||
|
"""Stores pre-calculated statistics updated by Celery beat."""
|
||||||
|
__tablename__ = "cached_stats"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
key = Column(String(50), unique=True, nullable=False, index=True)
|
||||||
|
value = Column(Text, nullable=False) # JSON-encoded stats
|
||||||
|
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)
|
||||||
24
backend/app/models/export.py
Normal file
24
backend/app/models/export.py
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
from sqlalchemy import Column, Integer, String, Float, DateTime, Text, func
|
||||||
|
|
||||||
|
from app.database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class Export(Base):
|
||||||
|
__tablename__ = "exports"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
name = Column(String, nullable=False)
|
||||||
|
filter_criteria = Column(Text, nullable=True) # JSON: min_images, licenses, min_quality, species_ids
|
||||||
|
train_split = Column(Float, default=0.8)
|
||||||
|
status = Column(String, default="pending") # pending, generating, completed, failed
|
||||||
|
file_path = Column(String, nullable=True)
|
||||||
|
file_size = Column(Integer, nullable=True)
|
||||||
|
species_count = Column(Integer, nullable=True)
|
||||||
|
image_count = Column(Integer, nullable=True)
|
||||||
|
celery_task_id = Column(String, nullable=True)
|
||||||
|
created_at = Column(DateTime, server_default=func.now())
|
||||||
|
completed_at = Column(DateTime, nullable=True)
|
||||||
|
error_message = Column(Text, nullable=True)
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<Export(id={self.id}, name='{self.name}', status='{self.status}')>"
|
||||||
36
backend/app/models/image.py
Normal file
36
backend/app/models/image.py
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
from sqlalchemy import Column, Integer, String, Float, DateTime, ForeignKey, func, UniqueConstraint, Index
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from app.database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class Image(Base):
|
||||||
|
__tablename__ = "images"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
species_id = Column(Integer, ForeignKey("species.id"), nullable=False, index=True)
|
||||||
|
source = Column(String, nullable=False, index=True)
|
||||||
|
source_id = Column(String, nullable=True)
|
||||||
|
url = Column(String, nullable=False)
|
||||||
|
local_path = Column(String, nullable=True)
|
||||||
|
license = Column(String, nullable=False, index=True)
|
||||||
|
attribution = Column(String, nullable=True)
|
||||||
|
width = Column(Integer, nullable=True)
|
||||||
|
height = Column(Integer, nullable=True)
|
||||||
|
phash = Column(String, nullable=True, index=True)
|
||||||
|
quality_score = Column(Float, nullable=True)
|
||||||
|
status = Column(String, default="pending", index=True) # pending, downloaded, rejected, deleted
|
||||||
|
created_at = Column(DateTime, server_default=func.now())
|
||||||
|
|
||||||
|
# Composite indexes for common query patterns
|
||||||
|
__table_args__ = (
|
||||||
|
UniqueConstraint("source", "source_id", name="uq_source_source_id"),
|
||||||
|
Index("ix_images_species_status", "species_id", "status"), # For counting images per species by status
|
||||||
|
Index("ix_images_status_created", "status", "created_at"), # For listing images by status
|
||||||
|
)
|
||||||
|
|
||||||
|
# Relationships
|
||||||
|
species = relationship("Species", back_populates="images")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<Image(id={self.id}, source='{self.source}', status='{self.status}')>"
|
||||||
27
backend/app/models/job.py
Normal file
27
backend/app/models/job.py
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
from sqlalchemy import Column, Integer, String, DateTime, Text, Boolean, func
|
||||||
|
|
||||||
|
from app.database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class Job(Base):
|
||||||
|
__tablename__ = "jobs"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
name = Column(String, nullable=False)
|
||||||
|
source = Column(String, nullable=False)
|
||||||
|
species_filter = Column(Text, nullable=True) # JSON array of species IDs or NULL for all
|
||||||
|
only_without_images = Column(Boolean, default=False) # If True, only scrape species with 0 images
|
||||||
|
max_images = Column(Integer, nullable=True) # If set, only scrape species with fewer than N images
|
||||||
|
status = Column(String, default="pending", index=True) # pending, running, paused, completed, failed
|
||||||
|
progress_current = Column(Integer, default=0)
|
||||||
|
progress_total = Column(Integer, default=0)
|
||||||
|
images_downloaded = Column(Integer, default=0)
|
||||||
|
images_rejected = Column(Integer, default=0)
|
||||||
|
celery_task_id = Column(String, nullable=True)
|
||||||
|
started_at = Column(DateTime, nullable=True)
|
||||||
|
completed_at = Column(DateTime, nullable=True)
|
||||||
|
error_message = Column(Text, nullable=True)
|
||||||
|
created_at = Column(DateTime, server_default=func.now())
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<Job(id={self.id}, name='{self.name}', status='{self.status}')>"
|
||||||
21
backend/app/models/species.py
Normal file
21
backend/app/models/species.py
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
from sqlalchemy import Column, Integer, String, DateTime, func
|
||||||
|
from sqlalchemy.orm import relationship
|
||||||
|
|
||||||
|
from app.database import Base
|
||||||
|
|
||||||
|
|
||||||
|
class Species(Base):
|
||||||
|
__tablename__ = "species"
|
||||||
|
|
||||||
|
id = Column(Integer, primary_key=True, index=True)
|
||||||
|
scientific_name = Column(String, unique=True, nullable=False, index=True)
|
||||||
|
common_name = Column(String, nullable=True)
|
||||||
|
genus = Column(String, nullable=True, index=True)
|
||||||
|
family = Column(String, nullable=True)
|
||||||
|
created_at = Column(DateTime, server_default=func.now())
|
||||||
|
|
||||||
|
# Relationships
|
||||||
|
images = relationship("Image", back_populates="species", cascade="all, delete-orphan")
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return f"<Species(id={self.id}, scientific_name='{self.scientific_name}')>"
|
||||||
15
backend/app/schemas/__init__.py
Normal file
15
backend/app/schemas/__init__.py
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
from app.schemas.species import SpeciesCreate, SpeciesUpdate, SpeciesResponse, SpeciesListResponse
|
||||||
|
from app.schemas.image import ImageResponse, ImageListResponse, ImageFilter
|
||||||
|
from app.schemas.job import JobCreate, JobResponse, JobListResponse
|
||||||
|
from app.schemas.api_key import ApiKeyCreate, ApiKeyUpdate, ApiKeyResponse
|
||||||
|
from app.schemas.export import ExportCreate, ExportResponse, ExportListResponse
|
||||||
|
from app.schemas.stats import StatsResponse, SourceStats, SpeciesStats
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"SpeciesCreate", "SpeciesUpdate", "SpeciesResponse", "SpeciesListResponse",
|
||||||
|
"ImageResponse", "ImageListResponse", "ImageFilter",
|
||||||
|
"JobCreate", "JobResponse", "JobListResponse",
|
||||||
|
"ApiKeyCreate", "ApiKeyUpdate", "ApiKeyResponse",
|
||||||
|
"ExportCreate", "ExportResponse", "ExportListResponse",
|
||||||
|
"StatsResponse", "SourceStats", "SpeciesStats",
|
||||||
|
]
|
||||||
36
backend/app/schemas/api_key.py
Normal file
36
backend/app/schemas/api_key.py
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
from pydantic import BaseModel
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
class ApiKeyBase(BaseModel):
|
||||||
|
source: str
|
||||||
|
api_key: Optional[str] = None # Optional for no-auth sources, used as Client ID for OAuth
|
||||||
|
api_secret: Optional[str] = None # Also used as Client Secret for OAuth sources
|
||||||
|
access_token: Optional[str] = None # For OAuth sources like Wikimedia
|
||||||
|
rate_limit_per_sec: float = 1.0
|
||||||
|
enabled: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
class ApiKeyCreate(ApiKeyBase):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class ApiKeyUpdate(BaseModel):
|
||||||
|
api_key: Optional[str] = None
|
||||||
|
api_secret: Optional[str] = None
|
||||||
|
access_token: Optional[str] = None
|
||||||
|
rate_limit_per_sec: Optional[float] = None
|
||||||
|
enabled: Optional[bool] = None
|
||||||
|
|
||||||
|
|
||||||
|
class ApiKeyResponse(BaseModel):
|
||||||
|
id: int
|
||||||
|
source: str
|
||||||
|
api_key_masked: str # Show only last 4 chars
|
||||||
|
has_secret: bool
|
||||||
|
has_access_token: bool
|
||||||
|
rate_limit_per_sec: float
|
||||||
|
enabled: bool
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
45
backend/app/schemas/export.py
Normal file
45
backend/app/schemas/export.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
from pydantic import BaseModel
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List
|
||||||
|
|
||||||
|
|
||||||
|
class ExportFilter(BaseModel):
|
||||||
|
min_images_per_species: int = 100
|
||||||
|
licenses: Optional[List[str]] = None # None means all
|
||||||
|
min_quality: Optional[float] = None
|
||||||
|
species_ids: Optional[List[int]] = None # None means all
|
||||||
|
|
||||||
|
|
||||||
|
class ExportCreate(BaseModel):
|
||||||
|
name: str
|
||||||
|
filter_criteria: ExportFilter
|
||||||
|
train_split: float = 0.8
|
||||||
|
|
||||||
|
|
||||||
|
class ExportResponse(BaseModel):
|
||||||
|
id: int
|
||||||
|
name: str
|
||||||
|
filter_criteria: Optional[str] = None
|
||||||
|
train_split: float
|
||||||
|
status: str
|
||||||
|
file_path: Optional[str] = None
|
||||||
|
file_size: Optional[int] = None
|
||||||
|
species_count: Optional[int] = None
|
||||||
|
image_count: Optional[int] = None
|
||||||
|
created_at: datetime
|
||||||
|
completed_at: Optional[datetime] = None
|
||||||
|
error_message: Optional[str] = None
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class ExportListResponse(BaseModel):
|
||||||
|
items: List[ExportResponse]
|
||||||
|
total: int
|
||||||
|
|
||||||
|
|
||||||
|
class ExportPreview(BaseModel):
|
||||||
|
species_count: int
|
||||||
|
image_count: int
|
||||||
|
estimated_size_mb: float
|
||||||
47
backend/app/schemas/image.py
Normal file
47
backend/app/schemas/image.py
Normal file
@@ -0,0 +1,47 @@
|
|||||||
|
from pydantic import BaseModel
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List
|
||||||
|
|
||||||
|
|
||||||
|
class ImageBase(BaseModel):
|
||||||
|
species_id: int
|
||||||
|
source: str
|
||||||
|
url: str
|
||||||
|
license: str
|
||||||
|
|
||||||
|
|
||||||
|
class ImageResponse(BaseModel):
|
||||||
|
id: int
|
||||||
|
species_id: int
|
||||||
|
species_name: Optional[str] = None
|
||||||
|
source: str
|
||||||
|
source_id: Optional[str] = None
|
||||||
|
url: str
|
||||||
|
local_path: Optional[str] = None
|
||||||
|
license: str
|
||||||
|
attribution: Optional[str] = None
|
||||||
|
width: Optional[int] = None
|
||||||
|
height: Optional[int] = None
|
||||||
|
quality_score: Optional[float] = None
|
||||||
|
status: str
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class ImageListResponse(BaseModel):
|
||||||
|
items: List[ImageResponse]
|
||||||
|
total: int
|
||||||
|
page: int
|
||||||
|
page_size: int
|
||||||
|
pages: int
|
||||||
|
|
||||||
|
|
||||||
|
class ImageFilter(BaseModel):
|
||||||
|
species_id: Optional[int] = None
|
||||||
|
source: Optional[str] = None
|
||||||
|
license: Optional[str] = None
|
||||||
|
status: Optional[str] = None
|
||||||
|
min_quality: Optional[float] = None
|
||||||
|
search: Optional[str] = None
|
||||||
35
backend/app/schemas/job.py
Normal file
35
backend/app/schemas/job.py
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
from pydantic import BaseModel
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List
|
||||||
|
|
||||||
|
|
||||||
|
class JobCreate(BaseModel):
|
||||||
|
name: str
|
||||||
|
source: str
|
||||||
|
species_ids: Optional[List[int]] = None # None means all species
|
||||||
|
only_without_images: bool = False # If True, only scrape species with 0 images
|
||||||
|
max_images: Optional[int] = None # If set, only scrape species with fewer than N images
|
||||||
|
|
||||||
|
|
||||||
|
class JobResponse(BaseModel):
|
||||||
|
id: int
|
||||||
|
name: str
|
||||||
|
source: str
|
||||||
|
species_filter: Optional[str] = None
|
||||||
|
status: str
|
||||||
|
progress_current: int
|
||||||
|
progress_total: int
|
||||||
|
images_downloaded: int
|
||||||
|
images_rejected: int
|
||||||
|
started_at: Optional[datetime] = None
|
||||||
|
completed_at: Optional[datetime] = None
|
||||||
|
error_message: Optional[str] = None
|
||||||
|
created_at: datetime
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class JobListResponse(BaseModel):
|
||||||
|
items: List[JobResponse]
|
||||||
|
total: int
|
||||||
44
backend/app/schemas/species.py
Normal file
44
backend/app/schemas/species.py
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
from pydantic import BaseModel
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Optional, List
|
||||||
|
|
||||||
|
|
||||||
|
class SpeciesBase(BaseModel):
|
||||||
|
scientific_name: str
|
||||||
|
common_name: Optional[str] = None
|
||||||
|
genus: Optional[str] = None
|
||||||
|
family: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class SpeciesCreate(SpeciesBase):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class SpeciesUpdate(BaseModel):
|
||||||
|
scientific_name: Optional[str] = None
|
||||||
|
common_name: Optional[str] = None
|
||||||
|
genus: Optional[str] = None
|
||||||
|
family: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class SpeciesResponse(SpeciesBase):
|
||||||
|
id: int
|
||||||
|
created_at: datetime
|
||||||
|
image_count: int = 0
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
from_attributes = True
|
||||||
|
|
||||||
|
|
||||||
|
class SpeciesListResponse(BaseModel):
|
||||||
|
items: List[SpeciesResponse]
|
||||||
|
total: int
|
||||||
|
page: int
|
||||||
|
page_size: int
|
||||||
|
pages: int
|
||||||
|
|
||||||
|
|
||||||
|
class SpeciesImportResponse(BaseModel):
|
||||||
|
imported: int
|
||||||
|
skipped: int
|
||||||
|
errors: List[str]
|
||||||
43
backend/app/schemas/stats.py
Normal file
43
backend/app/schemas/stats.py
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
from pydantic import BaseModel
|
||||||
|
from typing import List, Dict
|
||||||
|
|
||||||
|
|
||||||
|
class SourceStats(BaseModel):
|
||||||
|
source: str
|
||||||
|
image_count: int
|
||||||
|
downloaded: int
|
||||||
|
pending: int
|
||||||
|
rejected: int
|
||||||
|
|
||||||
|
|
||||||
|
class LicenseStats(BaseModel):
|
||||||
|
license: str
|
||||||
|
count: int
|
||||||
|
|
||||||
|
|
||||||
|
class SpeciesStats(BaseModel):
|
||||||
|
id: int
|
||||||
|
scientific_name: str
|
||||||
|
common_name: str | None
|
||||||
|
image_count: int
|
||||||
|
|
||||||
|
|
||||||
|
class JobStats(BaseModel):
|
||||||
|
running: int
|
||||||
|
pending: int
|
||||||
|
completed: int
|
||||||
|
failed: int
|
||||||
|
|
||||||
|
|
||||||
|
class StatsResponse(BaseModel):
|
||||||
|
total_species: int
|
||||||
|
total_images: int
|
||||||
|
images_downloaded: int
|
||||||
|
images_pending: int
|
||||||
|
images_rejected: int
|
||||||
|
disk_usage_mb: float
|
||||||
|
sources: List[SourceStats]
|
||||||
|
licenses: List[LicenseStats]
|
||||||
|
jobs: JobStats
|
||||||
|
top_species: List[SpeciesStats]
|
||||||
|
under_represented: List[SpeciesStats] # Species with < 100 images
|
||||||
41
backend/app/scrapers/__init__.py
Normal file
41
backend/app/scrapers/__init__.py
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.scrapers.inaturalist import INaturalistScraper
|
||||||
|
from app.scrapers.flickr import FlickrScraper
|
||||||
|
from app.scrapers.wikimedia import WikimediaScraper
|
||||||
|
from app.scrapers.trefle import TrefleScraper
|
||||||
|
from app.scrapers.gbif import GBIFScraper
|
||||||
|
from app.scrapers.duckduckgo import DuckDuckGoScraper
|
||||||
|
from app.scrapers.bing import BingScraper
|
||||||
|
|
||||||
|
|
||||||
|
def get_scraper(source: str) -> Optional[BaseScraper]:
|
||||||
|
"""Get scraper instance for a source."""
|
||||||
|
scrapers = {
|
||||||
|
"inaturalist": INaturalistScraper,
|
||||||
|
"flickr": FlickrScraper,
|
||||||
|
"wikimedia": WikimediaScraper,
|
||||||
|
"trefle": TrefleScraper,
|
||||||
|
"gbif": GBIFScraper,
|
||||||
|
"duckduckgo": DuckDuckGoScraper,
|
||||||
|
"bing": BingScraper,
|
||||||
|
}
|
||||||
|
|
||||||
|
scraper_class = scrapers.get(source)
|
||||||
|
if scraper_class:
|
||||||
|
return scraper_class()
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"get_scraper",
|
||||||
|
"BaseScraper",
|
||||||
|
"INaturalistScraper",
|
||||||
|
"FlickrScraper",
|
||||||
|
"WikimediaScraper",
|
||||||
|
"TrefleScraper",
|
||||||
|
"GBIFScraper",
|
||||||
|
"DuckDuckGoScraper",
|
||||||
|
"BingScraper",
|
||||||
|
]
|
||||||
57
backend/app/scrapers/base.py
Normal file
57
backend/app/scrapers/base.py
Normal file
@@ -0,0 +1,57 @@
|
|||||||
|
from abc import ABC, abstractmethod
|
||||||
|
from typing import Dict, Any, Optional
|
||||||
|
import logging
|
||||||
|
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.models import Species, ApiKey
|
||||||
|
|
||||||
|
|
||||||
|
class BaseScraper(ABC):
|
||||||
|
"""Base class for all image scrapers."""
|
||||||
|
|
||||||
|
name: str = "base"
|
||||||
|
requires_api_key: bool = True
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
"""
|
||||||
|
Scrape images for a species.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
species: The species to scrape images for
|
||||||
|
db: Database session
|
||||||
|
logger: Optional logger for debugging
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with 'downloaded' and 'rejected' counts
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
"""
|
||||||
|
Test API connection.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
api_key: The API key configuration
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Success message
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
Exception if connection fails
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
def get_api_key(self, db: Session) -> ApiKey:
|
||||||
|
"""Get API key for this scraper."""
|
||||||
|
return db.query(ApiKey).filter(
|
||||||
|
ApiKey.source == self.name,
|
||||||
|
ApiKey.enabled == True
|
||||||
|
).first()
|
||||||
228
backend/app/scrapers/bhl.py
Normal file
228
backend/app/scrapers/bhl.py
Normal file
@@ -0,0 +1,228 @@
|
|||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class BHLScraper(BaseScraper):
|
||||||
|
"""Scraper for Biodiversity Heritage Library (BHL) images.
|
||||||
|
|
||||||
|
BHL provides access to digitized biodiversity literature and illustrations.
|
||||||
|
Most content is public domain (pre-1927) or CC-licensed.
|
||||||
|
|
||||||
|
Note: BHL images are primarily historical botanical illustrations,
|
||||||
|
which may differ from photographs but are valuable for training.
|
||||||
|
"""
|
||||||
|
|
||||||
|
name = "bhl"
|
||||||
|
requires_api_key = True # BHL requires free API key
|
||||||
|
|
||||||
|
BASE_URL = "https://www.biodiversitylibrary.org/api3"
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "PlantGuideScraper/1.0 (Plant image collection for ML training)",
|
||||||
|
"Accept": "application/json",
|
||||||
|
}
|
||||||
|
|
||||||
|
# BHL content is mostly public domain
|
||||||
|
ALLOWED_LICENSES = {"CC0", "CC-BY", "CC-BY-SA", "PD"}
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
"""Scrape images from BHL for a species."""
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
if not api_key:
|
||||||
|
return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
|
||||||
|
|
||||||
|
rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
|
||||||
|
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
|
||||||
|
def log(level: str, msg: str):
|
||||||
|
if logger:
|
||||||
|
getattr(logger, level)(msg)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Disable SSL verification - some Docker environments lack proper CA certificates
|
||||||
|
with httpx.Client(timeout=30, headers=self.HEADERS, verify=False) as client:
|
||||||
|
# Search for name in BHL
|
||||||
|
search_response = client.get(
|
||||||
|
f"{self.BASE_URL}",
|
||||||
|
params={
|
||||||
|
"op": "NameSearch",
|
||||||
|
"name": species.scientific_name,
|
||||||
|
"format": "json",
|
||||||
|
"apikey": api_key.api_key,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
search_response.raise_for_status()
|
||||||
|
search_data = search_response.json()
|
||||||
|
|
||||||
|
results = search_data.get("Result", [])
|
||||||
|
if not results:
|
||||||
|
log("info", f" Species not found in BHL: {species.scientific_name}")
|
||||||
|
return {"downloaded": 0, "rejected": 0}
|
||||||
|
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
# Get pages with illustrations for each name result
|
||||||
|
for name_result in results[:5]: # Limit to top 5 matches
|
||||||
|
name_bank_id = name_result.get("NameBankID")
|
||||||
|
if not name_bank_id:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get publications with this name
|
||||||
|
pub_response = client.get(
|
||||||
|
f"{self.BASE_URL}",
|
||||||
|
params={
|
||||||
|
"op": "NameGetDetail",
|
||||||
|
"namebankid": name_bank_id,
|
||||||
|
"format": "json",
|
||||||
|
"apikey": api_key.api_key,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
pub_response.raise_for_status()
|
||||||
|
pub_data = pub_response.json()
|
||||||
|
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
# Extract titles and get page images
|
||||||
|
for title in pub_data.get("Result", []):
|
||||||
|
title_id = title.get("TitleID")
|
||||||
|
if not title_id:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get pages for this title
|
||||||
|
pages_response = client.get(
|
||||||
|
f"{self.BASE_URL}",
|
||||||
|
params={
|
||||||
|
"op": "GetPageMetadata",
|
||||||
|
"titleid": title_id,
|
||||||
|
"format": "json",
|
||||||
|
"apikey": api_key.api_key,
|
||||||
|
"ocr": "false",
|
||||||
|
"names": "false",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
if pages_response.status_code != 200:
|
||||||
|
continue
|
||||||
|
|
||||||
|
pages_data = pages_response.json()
|
||||||
|
pages = pages_data.get("Result", [])
|
||||||
|
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
# Look for pages that are likely illustrations
|
||||||
|
for page in pages[:100]: # Limit pages per title
|
||||||
|
page_types = page.get("PageTypes", [])
|
||||||
|
|
||||||
|
# Only get illustration/plate pages
|
||||||
|
is_illustration = any(
|
||||||
|
pt.get("PageTypeName", "").lower() in ["illustration", "plate", "figure", "map"]
|
||||||
|
for pt in page_types
|
||||||
|
) if page_types else False
|
||||||
|
|
||||||
|
if not is_illustration and page_types:
|
||||||
|
continue
|
||||||
|
|
||||||
|
page_id = page.get("PageID")
|
||||||
|
if not page_id:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Construct image URL
|
||||||
|
# BHL provides multiple image sizes
|
||||||
|
image_url = f"https://www.biodiversitylibrary.org/pageimage/{page_id}"
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
source_id = str(page_id)
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Determine license - BHL content is usually public domain
|
||||||
|
item_url = page.get("ItemUrl", "")
|
||||||
|
year = None
|
||||||
|
try:
|
||||||
|
# Try to extract year from ItemUrl or other fields
|
||||||
|
if "Year" in page:
|
||||||
|
year = int(page.get("Year", 0))
|
||||||
|
except (ValueError, TypeError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Content before 1927 is public domain in US
|
||||||
|
if year and year < 1927:
|
||||||
|
license_code = "PD"
|
||||||
|
else:
|
||||||
|
license_code = "CC0" # BHL default for older works
|
||||||
|
|
||||||
|
# Build attribution
|
||||||
|
title_name = title.get("ShortTitle", title.get("FullTitle", "Unknown"))
|
||||||
|
attribution = f"From '{title_name}' via Biodiversity Heritage Library ({license_code})"
|
||||||
|
|
||||||
|
# Create image record
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=image_url,
|
||||||
|
license=license_code,
|
||||||
|
attribution=attribution,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Queue for download
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
# Limit total per species
|
||||||
|
if downloaded >= 50:
|
||||||
|
break
|
||||||
|
|
||||||
|
if downloaded >= 50:
|
||||||
|
break
|
||||||
|
|
||||||
|
if downloaded >= 50:
|
||||||
|
break
|
||||||
|
|
||||||
|
except httpx.HTTPStatusError as e:
|
||||||
|
log("error", f" HTTP error for {species.scientific_name}: {e.response.status_code}")
|
||||||
|
except Exception as e:
|
||||||
|
log("error", f" Error scraping BHL for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
"""Test BHL API connection."""
|
||||||
|
with httpx.Client(timeout=10, headers=self.HEADERS, verify=False) as client:
|
||||||
|
response = client.get(
|
||||||
|
f"{self.BASE_URL}",
|
||||||
|
params={
|
||||||
|
"op": "NameSearch",
|
||||||
|
"name": "Rosa",
|
||||||
|
"format": "json",
|
||||||
|
"apikey": api_key.api_key,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
results = data.get("Result", [])
|
||||||
|
return f"BHL API connection successful ({len(results)} results for 'Rosa')"
|
||||||
135
backend/app/scrapers/bing.py
Normal file
135
backend/app/scrapers/bing.py
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
import hashlib
|
||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class BingScraper(BaseScraper):
|
||||||
|
"""Scraper for Bing Image Search v7 API (Azure Cognitive Services)."""
|
||||||
|
|
||||||
|
name = "bing"
|
||||||
|
requires_api_key = True
|
||||||
|
|
||||||
|
BASE_URL = "https://api.bing.microsoft.com/v7.0/images/search"
|
||||||
|
|
||||||
|
NEGATIVE_TERMS = "-herbarium -specimen -illustration -drawing -diagram -dried -pressed"
|
||||||
|
|
||||||
|
LICENSE_MAP = {
|
||||||
|
"Public": "CC0",
|
||||||
|
"Share": "CC-BY-SA",
|
||||||
|
"ShareCommercially": "CC-BY",
|
||||||
|
"Modify": "CC-BY-SA",
|
||||||
|
"ModifyCommercially": "CC-BY",
|
||||||
|
}
|
||||||
|
|
||||||
|
def _build_queries(self, species: Species) -> list[str]:
|
||||||
|
queries = [f'"{species.scientific_name}" plant photo {self.NEGATIVE_TERMS}']
|
||||||
|
if species.common_name:
|
||||||
|
queries.append(f'"{species.common_name}" houseplant photo {self.NEGATIVE_TERMS}')
|
||||||
|
return queries
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None,
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
if not api_key:
|
||||||
|
return {"downloaded": 0, "rejected": 0}
|
||||||
|
|
||||||
|
rate_limit = api_key.rate_limit_per_sec or 3.0
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
seen_urls = set()
|
||||||
|
|
||||||
|
headers = {
|
||||||
|
"Ocp-Apim-Subscription-Key": api_key.api_key,
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
queries = self._build_queries(species)
|
||||||
|
|
||||||
|
with httpx.Client(timeout=30, headers=headers) as client:
|
||||||
|
for query in queries:
|
||||||
|
params = {
|
||||||
|
"q": query,
|
||||||
|
"imageType": "Photo",
|
||||||
|
"license": "ShareCommercially",
|
||||||
|
"count": 50,
|
||||||
|
}
|
||||||
|
|
||||||
|
response = client.get(self.BASE_URL, params=params)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
for result in data.get("value", []):
|
||||||
|
url = result.get("contentUrl")
|
||||||
|
if not url or url in seen_urls:
|
||||||
|
continue
|
||||||
|
seen_urls.add(url)
|
||||||
|
|
||||||
|
# Use Bing's imageId, fall back to md5 hash
|
||||||
|
source_id = result.get("imageId") or hashlib.md5(url.encode()).hexdigest()[:16]
|
||||||
|
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Map license
|
||||||
|
bing_license = result.get("license", "")
|
||||||
|
license_code = self.LICENSE_MAP.get(bing_license, "UNKNOWN")
|
||||||
|
|
||||||
|
host = result.get("hostPageDisplayUrl", "")
|
||||||
|
attribution = f"via Bing ({host})" if host else "via Bing Image Search"
|
||||||
|
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=url,
|
||||||
|
width=result.get("width"),
|
||||||
|
height=result.get("height"),
|
||||||
|
license=license_code,
|
||||||
|
attribution=attribution,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if logger:
|
||||||
|
logger.error(f"Error scraping Bing for {species.scientific_name}: {e}")
|
||||||
|
else:
|
||||||
|
print(f"Error scraping Bing for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
headers = {"Ocp-Apim-Subscription-Key": api_key.api_key}
|
||||||
|
with httpx.Client(timeout=10, headers=headers) as client:
|
||||||
|
response = client.get(
|
||||||
|
self.BASE_URL,
|
||||||
|
params={"q": "Monstera deliciosa plant", "count": 1},
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
count = data.get("totalEstimatedMatches", 0)
|
||||||
|
return f"Bing Image Search working ({count:,} estimated matches)"
|
||||||
101
backend/app/scrapers/duckduckgo.py
Normal file
101
backend/app/scrapers/duckduckgo.py
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
import hashlib
|
||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
from duckduckgo_search import DDGS
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class DuckDuckGoScraper(BaseScraper):
|
||||||
|
"""Scraper for DuckDuckGo image search. No API key required."""
|
||||||
|
|
||||||
|
name = "duckduckgo"
|
||||||
|
requires_api_key = False
|
||||||
|
|
||||||
|
NEGATIVE_TERMS = "-herbarium -specimen -illustration -drawing -diagram -dried -pressed"
|
||||||
|
|
||||||
|
def _build_queries(self, species: Species) -> list[str]:
|
||||||
|
queries = [f'"{species.scientific_name}" plant photo {self.NEGATIVE_TERMS}']
|
||||||
|
if species.common_name:
|
||||||
|
queries.append(f'"{species.common_name}" houseplant photo {self.NEGATIVE_TERMS}')
|
||||||
|
return queries
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None,
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
|
||||||
|
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
seen_urls = set()
|
||||||
|
|
||||||
|
try:
|
||||||
|
queries = self._build_queries(species)
|
||||||
|
|
||||||
|
with DDGS() as ddgs:
|
||||||
|
for query in queries:
|
||||||
|
results = ddgs.images(
|
||||||
|
keywords=query,
|
||||||
|
type_image="photo",
|
||||||
|
max_results=50,
|
||||||
|
)
|
||||||
|
|
||||||
|
for result in results:
|
||||||
|
url = result.get("image")
|
||||||
|
if not url or url in seen_urls:
|
||||||
|
continue
|
||||||
|
seen_urls.add(url)
|
||||||
|
|
||||||
|
source_id = hashlib.md5(url.encode()).hexdigest()[:16]
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
title = result.get("title", "")
|
||||||
|
attribution = f"{title} via DuckDuckGo" if title else "via DuckDuckGo"
|
||||||
|
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=url,
|
||||||
|
license="UNKNOWN",
|
||||||
|
attribution=attribution,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if logger:
|
||||||
|
logger.error(f"Error scraping DuckDuckGo for {species.scientific_name}: {e}")
|
||||||
|
else:
|
||||||
|
print(f"Error scraping DuckDuckGo for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
with DDGS() as ddgs:
|
||||||
|
results = ddgs.images(keywords="Monstera deliciosa plant", max_results=1)
|
||||||
|
count = len(list(results))
|
||||||
|
return f"DuckDuckGo search working ({count} test result)"
|
||||||
226
backend/app/scrapers/eol.py
Normal file
226
backend/app/scrapers/eol.py
Normal file
@@ -0,0 +1,226 @@
|
|||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class EOLScraper(BaseScraper):
|
||||||
|
"""Scraper for Encyclopedia of Life (EOL) images.
|
||||||
|
|
||||||
|
EOL aggregates biodiversity data from many sources and provides
|
||||||
|
a free API with no authentication required.
|
||||||
|
"""
|
||||||
|
|
||||||
|
name = "eol"
|
||||||
|
requires_api_key = False
|
||||||
|
|
||||||
|
BASE_URL = "https://eol.org/api"
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "PlantGuideScraper/1.0 (Plant image collection for ML training)",
|
||||||
|
"Accept": "application/json",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Map EOL license URLs to short codes
|
||||||
|
LICENSE_MAP = {
|
||||||
|
"http://creativecommons.org/publicdomain/zero/1.0/": "CC0",
|
||||||
|
"http://creativecommons.org/publicdomain/mark/1.0/": "CC0",
|
||||||
|
"http://creativecommons.org/licenses/by/2.0/": "CC-BY",
|
||||||
|
"http://creativecommons.org/licenses/by/3.0/": "CC-BY",
|
||||||
|
"http://creativecommons.org/licenses/by/4.0/": "CC-BY",
|
||||||
|
"http://creativecommons.org/licenses/by-sa/2.0/": "CC-BY-SA",
|
||||||
|
"http://creativecommons.org/licenses/by-sa/3.0/": "CC-BY-SA",
|
||||||
|
"http://creativecommons.org/licenses/by-sa/4.0/": "CC-BY-SA",
|
||||||
|
"https://creativecommons.org/publicdomain/zero/1.0/": "CC0",
|
||||||
|
"https://creativecommons.org/publicdomain/mark/1.0/": "CC0",
|
||||||
|
"https://creativecommons.org/licenses/by/2.0/": "CC-BY",
|
||||||
|
"https://creativecommons.org/licenses/by/3.0/": "CC-BY",
|
||||||
|
"https://creativecommons.org/licenses/by/4.0/": "CC-BY",
|
||||||
|
"https://creativecommons.org/licenses/by-sa/2.0/": "CC-BY-SA",
|
||||||
|
"https://creativecommons.org/licenses/by-sa/3.0/": "CC-BY-SA",
|
||||||
|
"https://creativecommons.org/licenses/by-sa/4.0/": "CC-BY-SA",
|
||||||
|
"pd": "CC0", # Public domain
|
||||||
|
"public domain": "CC0",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Commercial-safe licenses
|
||||||
|
ALLOWED_LICENSES = {"CC0", "CC-BY", "CC-BY-SA"}
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
"""Scrape images from EOL for a species."""
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
rate_limit = api_key.rate_limit_per_sec if api_key else 0.5
|
||||||
|
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
|
||||||
|
def log(level: str, msg: str):
|
||||||
|
if logger:
|
||||||
|
getattr(logger, level)(msg)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Disable SSL verification - EOL is a trusted source and some Docker
|
||||||
|
# environments lack proper CA certificates
|
||||||
|
with httpx.Client(timeout=30, headers=self.HEADERS, verify=False) as client:
|
||||||
|
# Step 1: Search for the species
|
||||||
|
search_response = client.get(
|
||||||
|
f"{self.BASE_URL}/search/1.0.json",
|
||||||
|
params={
|
||||||
|
"q": species.scientific_name,
|
||||||
|
"page": 1,
|
||||||
|
"exact": "true",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
search_response.raise_for_status()
|
||||||
|
search_data = search_response.json()
|
||||||
|
|
||||||
|
results = search_data.get("results", [])
|
||||||
|
if not results:
|
||||||
|
log("info", f" Species not found in EOL: {species.scientific_name}")
|
||||||
|
return {"downloaded": 0, "rejected": 0}
|
||||||
|
|
||||||
|
# Get the EOL page ID
|
||||||
|
eol_page_id = results[0].get("id")
|
||||||
|
if not eol_page_id:
|
||||||
|
return {"downloaded": 0, "rejected": 0}
|
||||||
|
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
# Step 2: Get page details with images
|
||||||
|
page_response = client.get(
|
||||||
|
f"{self.BASE_URL}/pages/1.0/{eol_page_id}.json",
|
||||||
|
params={
|
||||||
|
"images_per_page": 75,
|
||||||
|
"images_page": 1,
|
||||||
|
"videos_per_page": 0,
|
||||||
|
"sounds_per_page": 0,
|
||||||
|
"maps_per_page": 0,
|
||||||
|
"texts_per_page": 0,
|
||||||
|
"details": "true",
|
||||||
|
"licenses": "cc-by|cc-by-sa|pd|cc-by-nc",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
page_response.raise_for_status()
|
||||||
|
page_data = page_response.json()
|
||||||
|
|
||||||
|
data_objects = page_data.get("dataObjects", [])
|
||||||
|
log("debug", f" Found {len(data_objects)} media objects")
|
||||||
|
|
||||||
|
for obj in data_objects:
|
||||||
|
# Only process images
|
||||||
|
media_type = obj.get("dataType", "")
|
||||||
|
if "image" not in media_type.lower() and "stillimage" not in media_type.lower():
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get image URL
|
||||||
|
image_url = obj.get("eolMediaURL") or obj.get("mediaURL")
|
||||||
|
if not image_url:
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check license
|
||||||
|
license_url = obj.get("license", "").lower()
|
||||||
|
license_code = None
|
||||||
|
|
||||||
|
# Try to match license URL
|
||||||
|
for pattern, code in self.LICENSE_MAP.items():
|
||||||
|
if pattern in license_url:
|
||||||
|
license_code = code
|
||||||
|
break
|
||||||
|
|
||||||
|
if not license_code:
|
||||||
|
# Check for NC licenses which we reject
|
||||||
|
if "-nc" in license_url:
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
# Unknown license, skip
|
||||||
|
log("debug", f" Rejected: unknown license {license_url}")
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
if license_code not in self.ALLOWED_LICENSES:
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Create unique source ID
|
||||||
|
source_id = str(obj.get("dataObjectVersionID") or obj.get("identifier") or hash(image_url))
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Build attribution
|
||||||
|
agents = obj.get("agents", [])
|
||||||
|
photographer = None
|
||||||
|
rights_holder = None
|
||||||
|
|
||||||
|
for agent in agents:
|
||||||
|
role = agent.get("role", "").lower()
|
||||||
|
name = agent.get("full_name", "")
|
||||||
|
if role == "photographer":
|
||||||
|
photographer = name
|
||||||
|
elif role == "owner" or role == "rights holder":
|
||||||
|
rights_holder = name
|
||||||
|
|
||||||
|
attribution_parts = []
|
||||||
|
if photographer:
|
||||||
|
attribution_parts.append(f"Photo by {photographer}")
|
||||||
|
if rights_holder and rights_holder != photographer:
|
||||||
|
attribution_parts.append(f"Rights: {rights_holder}")
|
||||||
|
attribution_parts.append(f"via EOL ({license_code})")
|
||||||
|
attribution = " | ".join(attribution_parts)
|
||||||
|
|
||||||
|
# Create image record
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=image_url,
|
||||||
|
license=license_code,
|
||||||
|
attribution=attribution,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Queue for download
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
except httpx.HTTPStatusError as e:
|
||||||
|
log("error", f" HTTP error for {species.scientific_name}: {e.response.status_code}")
|
||||||
|
except Exception as e:
|
||||||
|
log("error", f" Error scraping EOL for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
"""Test EOL API connection."""
|
||||||
|
with httpx.Client(timeout=10, headers=self.HEADERS, verify=False) as client:
|
||||||
|
response = client.get(
|
||||||
|
f"{self.BASE_URL}/search/1.0.json",
|
||||||
|
params={"q": "Rosa", "page": 1},
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
total = data.get("totalResults", 0)
|
||||||
|
return f"EOL API connection successful ({total} results for 'Rosa')"
|
||||||
146
backend/app/scrapers/flickr.py
Normal file
146
backend/app/scrapers/flickr.py
Normal file
@@ -0,0 +1,146 @@
|
|||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class FlickrScraper(BaseScraper):
|
||||||
|
"""Scraper for Flickr images via their API."""
|
||||||
|
|
||||||
|
name = "flickr"
|
||||||
|
requires_api_key = True
|
||||||
|
|
||||||
|
BASE_URL = "https://api.flickr.com/services/rest/"
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Commercial-safe license IDs
|
||||||
|
# 4 = CC BY 2.0, 7 = No known copyright, 8 = US Gov, 9 = CC0
|
||||||
|
ALLOWED_LICENSES = "4,7,8,9"
|
||||||
|
|
||||||
|
LICENSE_MAP = {
|
||||||
|
"4": "CC-BY",
|
||||||
|
"7": "NO-KNOWN-COPYRIGHT",
|
||||||
|
"8": "US-GOV",
|
||||||
|
"9": "CC0",
|
||||||
|
}
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
"""Scrape images from Flickr for a species."""
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
if not api_key:
|
||||||
|
return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
|
||||||
|
|
||||||
|
rate_limit = api_key.rate_limit_per_sec
|
||||||
|
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
params = {
|
||||||
|
"method": "flickr.photos.search",
|
||||||
|
"api_key": api_key.api_key,
|
||||||
|
"text": species.scientific_name,
|
||||||
|
"license": self.ALLOWED_LICENSES,
|
||||||
|
"content_type": 1, # Photos only
|
||||||
|
"media": "photos",
|
||||||
|
"extras": "license,url_l,url_o,owner_name",
|
||||||
|
"per_page": 100,
|
||||||
|
"format": "json",
|
||||||
|
"nojsoncallback": 1,
|
||||||
|
}
|
||||||
|
|
||||||
|
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(self.BASE_URL, params=params)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
if data.get("stat") != "ok":
|
||||||
|
return {"downloaded": 0, "rejected": 0, "error": data.get("message")}
|
||||||
|
|
||||||
|
photos = data.get("photos", {}).get("photo", [])
|
||||||
|
|
||||||
|
for photo in photos:
|
||||||
|
# Get best URL (original or large)
|
||||||
|
url = photo.get("url_o") or photo.get("url_l")
|
||||||
|
if not url:
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get license
|
||||||
|
license_id = str(photo.get("license", ""))
|
||||||
|
license_code = self.LICENSE_MAP.get(license_id, "UNKNOWN")
|
||||||
|
if license_code == "UNKNOWN":
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
source_id = str(photo.get("id"))
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Build attribution
|
||||||
|
owner = photo.get("ownername", "Unknown")
|
||||||
|
attribution = f"Photo by {owner} on Flickr ({license_code})"
|
||||||
|
|
||||||
|
# Create image record
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=url,
|
||||||
|
license=license_code,
|
||||||
|
attribution=attribution,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Queue for download
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
# Rate limiting
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error scraping Flickr for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
"""Test Flickr API connection."""
|
||||||
|
params = {
|
||||||
|
"method": "flickr.test.echo",
|
||||||
|
"api_key": api_key.api_key,
|
||||||
|
"format": "json",
|
||||||
|
"nojsoncallback": 1,
|
||||||
|
}
|
||||||
|
|
||||||
|
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(self.BASE_URL, params=params)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
if data.get("stat") != "ok":
|
||||||
|
raise Exception(data.get("message", "API test failed"))
|
||||||
|
|
||||||
|
return "Flickr API connection successful"
|
||||||
159
backend/app/scrapers/gbif.py
Normal file
159
backend/app/scrapers/gbif.py
Normal file
@@ -0,0 +1,159 @@
|
|||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class GBIFScraper(BaseScraper):
|
||||||
|
"""Scraper for GBIF (Global Biodiversity Information Facility) images."""
|
||||||
|
|
||||||
|
name = "gbif"
|
||||||
|
requires_api_key = False # GBIF is free to use
|
||||||
|
|
||||||
|
BASE_URL = "https://api.gbif.org/v1"
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Map GBIF license URLs to short codes
|
||||||
|
LICENSE_MAP = {
|
||||||
|
"http://creativecommons.org/publicdomain/zero/1.0/legalcode": "CC0",
|
||||||
|
"http://creativecommons.org/licenses/by/4.0/legalcode": "CC-BY",
|
||||||
|
"http://creativecommons.org/licenses/by-nc/4.0/legalcode": "CC-BY-NC",
|
||||||
|
"http://creativecommons.org/publicdomain/zero/1.0/": "CC0",
|
||||||
|
"http://creativecommons.org/licenses/by/4.0/": "CC-BY",
|
||||||
|
"http://creativecommons.org/licenses/by-nc/4.0/": "CC-BY-NC",
|
||||||
|
"https://creativecommons.org/publicdomain/zero/1.0/legalcode": "CC0",
|
||||||
|
"https://creativecommons.org/licenses/by/4.0/legalcode": "CC-BY",
|
||||||
|
"https://creativecommons.org/licenses/by-nc/4.0/legalcode": "CC-BY-NC",
|
||||||
|
"https://creativecommons.org/publicdomain/zero/1.0/": "CC0",
|
||||||
|
"https://creativecommons.org/licenses/by/4.0/": "CC-BY",
|
||||||
|
"https://creativecommons.org/licenses/by-nc/4.0/": "CC-BY-NC",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Only allow commercial-safe licenses
|
||||||
|
ALLOWED_LICENSES = {"CC0", "CC-BY"}
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
"""Scrape images from GBIF for a species."""
|
||||||
|
# GBIF doesn't require API key, but we still respect rate limits
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
|
||||||
|
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
params = {
|
||||||
|
"scientificName": species.scientific_name,
|
||||||
|
"mediaType": "StillImage",
|
||||||
|
"limit": 100,
|
||||||
|
}
|
||||||
|
|
||||||
|
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(
|
||||||
|
f"{self.BASE_URL}/occurrence/search",
|
||||||
|
params=params,
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
results = data.get("results", [])
|
||||||
|
|
||||||
|
for occurrence in results:
|
||||||
|
media_list = occurrence.get("media", [])
|
||||||
|
|
||||||
|
for media in media_list:
|
||||||
|
# Only process still images
|
||||||
|
if media.get("type") != "StillImage":
|
||||||
|
continue
|
||||||
|
|
||||||
|
url = media.get("identifier")
|
||||||
|
if not url:
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check license
|
||||||
|
license_url = media.get("license", "")
|
||||||
|
license_code = self.LICENSE_MAP.get(license_url)
|
||||||
|
|
||||||
|
if not license_code or license_code not in self.ALLOWED_LICENSES:
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Create unique source ID from occurrence key and media URL
|
||||||
|
occurrence_key = occurrence.get("key", "")
|
||||||
|
# Use hash of URL for uniqueness within occurrence
|
||||||
|
url_hash = str(hash(url))[-8:]
|
||||||
|
source_id = f"{occurrence_key}_{url_hash}"
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Build attribution
|
||||||
|
creator = media.get("creator", "")
|
||||||
|
rights_holder = media.get("rightsHolder", "")
|
||||||
|
attribution_parts = []
|
||||||
|
if creator:
|
||||||
|
attribution_parts.append(f"Photo by {creator}")
|
||||||
|
if rights_holder and rights_holder != creator:
|
||||||
|
attribution_parts.append(f"Rights: {rights_holder}")
|
||||||
|
attribution_parts.append(f"via GBIF ({license_code})")
|
||||||
|
attribution = " | ".join(attribution_parts) if attribution_parts else f"GBIF ({license_code})"
|
||||||
|
|
||||||
|
# Create image record
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=url,
|
||||||
|
license=license_code,
|
||||||
|
attribution=attribution,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Queue for download
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
# Rate limiting
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error scraping GBIF for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
"""Test GBIF API connection."""
|
||||||
|
# GBIF doesn't require authentication, just test the endpoint
|
||||||
|
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(
|
||||||
|
f"{self.BASE_URL}/occurrence/search",
|
||||||
|
params={"limit": 1},
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
count = data.get("count", 0)
|
||||||
|
return f"GBIF API connection successful ({count:,} total occurrences available)"
|
||||||
144
backend/app/scrapers/inaturalist.py
Normal file
144
backend/app/scrapers/inaturalist.py
Normal file
@@ -0,0 +1,144 @@
|
|||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class INaturalistScraper(BaseScraper):
|
||||||
|
"""Scraper for iNaturalist observations via their API."""
|
||||||
|
|
||||||
|
name = "inaturalist"
|
||||||
|
requires_api_key = False # Public API, but rate limited
|
||||||
|
|
||||||
|
BASE_URL = "https://api.inaturalist.org/v1"
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Commercial-safe licenses (CC0, CC-BY)
|
||||||
|
ALLOWED_LICENSES = ["cc0", "cc-by"]
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
"""Scrape images from iNaturalist for a species."""
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
|
||||||
|
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
|
||||||
|
def log(level: str, msg: str):
|
||||||
|
if logger:
|
||||||
|
getattr(logger, level)(msg)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Search for observations of this species
|
||||||
|
params = {
|
||||||
|
"taxon_name": species.scientific_name,
|
||||||
|
"quality_grade": "research", # Only research-grade
|
||||||
|
"photos": True,
|
||||||
|
"per_page": 200,
|
||||||
|
"order_by": "votes",
|
||||||
|
"license": ",".join(self.ALLOWED_LICENSES),
|
||||||
|
}
|
||||||
|
|
||||||
|
log("debug", f" API request params: {params}")
|
||||||
|
|
||||||
|
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(
|
||||||
|
f"{self.BASE_URL}/observations",
|
||||||
|
params=params,
|
||||||
|
)
|
||||||
|
log("debug", f" API response status: {response.status_code}")
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
observations = data.get("results", [])
|
||||||
|
total_results = data.get("total_results", 0)
|
||||||
|
log("debug", f" Found {len(observations)} observations (total: {total_results})")
|
||||||
|
|
||||||
|
if not observations:
|
||||||
|
log("info", f" No observations found for {species.scientific_name}")
|
||||||
|
return {"downloaded": 0, "rejected": 0}
|
||||||
|
|
||||||
|
for obs in observations:
|
||||||
|
photos = obs.get("photos", [])
|
||||||
|
for photo in photos:
|
||||||
|
# Check license
|
||||||
|
license_code = photo.get("license_code", "").lower() if photo.get("license_code") else ""
|
||||||
|
if license_code not in self.ALLOWED_LICENSES:
|
||||||
|
log("debug", f" Rejected photo {photo.get('id')}: license={license_code}")
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get image URL (medium size for initial download)
|
||||||
|
url = photo.get("url", "")
|
||||||
|
if not url:
|
||||||
|
log("debug", f" Skipped photo {photo.get('id')}: no URL")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Convert to larger size
|
||||||
|
url = url.replace("square", "large")
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
source_id = str(photo.get("id"))
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
log("debug", f" Skipped photo {source_id}: already exists")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Create image record
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=url,
|
||||||
|
license=license_code.upper(),
|
||||||
|
attribution=photo.get("attribution", ""),
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Queue for download
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
log("debug", f" Queued photo {source_id} for download")
|
||||||
|
|
||||||
|
# Rate limiting
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
except httpx.HTTPStatusError as e:
|
||||||
|
log("error", f" HTTP error for {species.scientific_name}: {e.response.status_code} - {e.response.text}")
|
||||||
|
except httpx.RequestError as e:
|
||||||
|
log("error", f" Request error for {species.scientific_name}: {e}")
|
||||||
|
except Exception as e:
|
||||||
|
log("error", f" Error scraping iNaturalist for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
"""Test iNaturalist API connection."""
|
||||||
|
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(
|
||||||
|
f"{self.BASE_URL}/observations",
|
||||||
|
params={"per_page": 1},
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
return "iNaturalist API connection successful"
|
||||||
154
backend/app/scrapers/trefle.py
Normal file
154
backend/app/scrapers/trefle.py
Normal file
@@ -0,0 +1,154 @@
|
|||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class TrefleScraper(BaseScraper):
|
||||||
|
"""Scraper for Trefle.io plant database."""
|
||||||
|
|
||||||
|
name = "trefle"
|
||||||
|
requires_api_key = True
|
||||||
|
|
||||||
|
BASE_URL = "https://trefle.io/api/v1"
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||||
|
}
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
"""Scrape images from Trefle for a species."""
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
if not api_key:
|
||||||
|
return {"downloaded": 0, "rejected": 0, "error": "No API key configured"}
|
||||||
|
|
||||||
|
rate_limit = api_key.rate_limit_per_sec
|
||||||
|
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Search for the species
|
||||||
|
params = {
|
||||||
|
"token": api_key.api_key,
|
||||||
|
"q": species.scientific_name,
|
||||||
|
}
|
||||||
|
|
||||||
|
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(
|
||||||
|
f"{self.BASE_URL}/plants/search",
|
||||||
|
params=params,
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
plants = data.get("data", [])
|
||||||
|
|
||||||
|
for plant in plants:
|
||||||
|
# Get plant details for more images
|
||||||
|
plant_id = plant.get("id")
|
||||||
|
if not plant_id:
|
||||||
|
continue
|
||||||
|
|
||||||
|
detail_response = client.get(
|
||||||
|
f"{self.BASE_URL}/plants/{plant_id}",
|
||||||
|
params={"token": api_key.api_key},
|
||||||
|
)
|
||||||
|
|
||||||
|
if detail_response.status_code != 200:
|
||||||
|
continue
|
||||||
|
|
||||||
|
plant_detail = detail_response.json().get("data", {})
|
||||||
|
|
||||||
|
# Get main image
|
||||||
|
main_image = plant_detail.get("image_url")
|
||||||
|
if main_image:
|
||||||
|
source_id = f"main_{plant_id}"
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if not existing:
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=main_image,
|
||||||
|
license="TREFLE", # Trefle's own license
|
||||||
|
attribution="Trefle.io Plant Database",
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
# Get additional images from species detail
|
||||||
|
images = plant_detail.get("images", {})
|
||||||
|
for image_type, image_list in images.items():
|
||||||
|
if not isinstance(image_list, list):
|
||||||
|
continue
|
||||||
|
|
||||||
|
for img in image_list:
|
||||||
|
url = img.get("image_url")
|
||||||
|
if not url:
|
||||||
|
continue
|
||||||
|
|
||||||
|
img_id = img.get("id", url.split("/")[-1])
|
||||||
|
source_id = f"{image_type}_{img_id}"
|
||||||
|
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
copyright_info = img.get("copyright", "")
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=url,
|
||||||
|
license="TREFLE",
|
||||||
|
attribution=copyright_info or "Trefle.io",
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
# Rate limiting
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error scraping Trefle for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
"""Test Trefle API connection."""
|
||||||
|
params = {"token": api_key.api_key}
|
||||||
|
|
||||||
|
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(
|
||||||
|
f"{self.BASE_URL}/plants",
|
||||||
|
params=params,
|
||||||
|
)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
return "Trefle API connection successful"
|
||||||
146
backend/app/scrapers/wikimedia.py
Normal file
146
backend/app/scrapers/wikimedia.py
Normal file
@@ -0,0 +1,146 @@
|
|||||||
|
import time
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
|
|
||||||
|
from app.scrapers.base import BaseScraper
|
||||||
|
from app.models import Species, Image, ApiKey
|
||||||
|
from app.workers.quality_tasks import download_and_process_image
|
||||||
|
|
||||||
|
|
||||||
|
class WikimediaScraper(BaseScraper):
|
||||||
|
"""Scraper for Wikimedia Commons images."""
|
||||||
|
|
||||||
|
name = "wikimedia"
|
||||||
|
requires_api_key = False
|
||||||
|
|
||||||
|
BASE_URL = "https://commons.wikimedia.org/w/api.php"
|
||||||
|
|
||||||
|
HEADERS = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||||
|
}
|
||||||
|
|
||||||
|
def scrape_species(
|
||||||
|
self,
|
||||||
|
species: Species,
|
||||||
|
db: Session,
|
||||||
|
logger: Optional[logging.Logger] = None
|
||||||
|
) -> Dict[str, int]:
|
||||||
|
"""Scrape images from Wikimedia Commons for a species."""
|
||||||
|
api_key = self.get_api_key(db)
|
||||||
|
rate_limit = api_key.rate_limit_per_sec if api_key else 1.0
|
||||||
|
|
||||||
|
downloaded = 0
|
||||||
|
rejected = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Search for images in the species category
|
||||||
|
search_term = species.scientific_name
|
||||||
|
|
||||||
|
params = {
|
||||||
|
"action": "query",
|
||||||
|
"format": "json",
|
||||||
|
"generator": "search",
|
||||||
|
"gsrsearch": f"filetype:bitmap {search_term}",
|
||||||
|
"gsrnamespace": 6, # File namespace
|
||||||
|
"gsrlimit": 50,
|
||||||
|
"prop": "imageinfo",
|
||||||
|
"iiprop": "url|extmetadata|size",
|
||||||
|
}
|
||||||
|
|
||||||
|
with httpx.Client(timeout=30, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(self.BASE_URL, params=params)
|
||||||
|
response.raise_for_status()
|
||||||
|
data = response.json()
|
||||||
|
|
||||||
|
pages = data.get("query", {}).get("pages", {})
|
||||||
|
|
||||||
|
for page_id, page in pages.items():
|
||||||
|
if int(page_id) < 0:
|
||||||
|
continue
|
||||||
|
|
||||||
|
imageinfo = page.get("imageinfo", [{}])[0]
|
||||||
|
url = imageinfo.get("url", "")
|
||||||
|
if not url:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check size
|
||||||
|
width = imageinfo.get("width", 0)
|
||||||
|
height = imageinfo.get("height", 0)
|
||||||
|
if width < 256 or height < 256:
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get license from metadata
|
||||||
|
metadata = imageinfo.get("extmetadata", {})
|
||||||
|
license_info = metadata.get("LicenseShortName", {}).get("value", "")
|
||||||
|
|
||||||
|
# Filter for commercial-safe licenses
|
||||||
|
license_upper = license_info.upper()
|
||||||
|
if "CC BY" in license_upper or "CC0" in license_upper or "PUBLIC DOMAIN" in license_upper:
|
||||||
|
license_code = license_info
|
||||||
|
else:
|
||||||
|
rejected += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check if already exists
|
||||||
|
source_id = str(page_id)
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.source == self.name,
|
||||||
|
Image.source_id == source_id,
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get attribution
|
||||||
|
artist = metadata.get("Artist", {}).get("value", "Unknown")
|
||||||
|
# Clean HTML from artist
|
||||||
|
if "<" in artist:
|
||||||
|
import re
|
||||||
|
artist = re.sub(r"<[^>]+>", "", artist).strip()
|
||||||
|
|
||||||
|
attribution = f"{artist} via Wikimedia Commons ({license_code})"
|
||||||
|
|
||||||
|
# Create image record
|
||||||
|
image = Image(
|
||||||
|
species_id=species.id,
|
||||||
|
source=self.name,
|
||||||
|
source_id=source_id,
|
||||||
|
url=url,
|
||||||
|
license=license_code,
|
||||||
|
attribution=attribution,
|
||||||
|
width=width,
|
||||||
|
height=height,
|
||||||
|
status="pending",
|
||||||
|
)
|
||||||
|
db.add(image)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Queue for download
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
downloaded += 1
|
||||||
|
|
||||||
|
# Rate limiting
|
||||||
|
time.sleep(1.0 / rate_limit)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error scraping Wikimedia for {species.scientific_name}: {e}")
|
||||||
|
|
||||||
|
return {"downloaded": downloaded, "rejected": rejected}
|
||||||
|
|
||||||
|
def test_connection(self, api_key: ApiKey) -> str:
|
||||||
|
"""Test Wikimedia API connection."""
|
||||||
|
params = {
|
||||||
|
"action": "query",
|
||||||
|
"format": "json",
|
||||||
|
"meta": "siteinfo",
|
||||||
|
}
|
||||||
|
|
||||||
|
with httpx.Client(timeout=10, headers=self.HEADERS) as client:
|
||||||
|
response = client.get(self.BASE_URL, params=params)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
return "Wikimedia Commons API connection successful"
|
||||||
1
backend/app/utils/__init__.py
Normal file
1
backend/app/utils/__init__.py
Normal file
@@ -0,0 +1 @@
|
|||||||
|
# Utility functions
|
||||||
80
backend/app/utils/dedup.py
Normal file
80
backend/app/utils/dedup.py
Normal file
@@ -0,0 +1,80 @@
|
|||||||
|
"""Image deduplication utilities using perceptual hashing."""
|
||||||
|
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import imagehash
|
||||||
|
from PIL import Image as PILImage
|
||||||
|
|
||||||
|
|
||||||
|
def calculate_phash(image_path: str) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Calculate perceptual hash for an image.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Hex string of perceptual hash, or None if failed
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
with PILImage.open(image_path) as img:
|
||||||
|
return str(imagehash.phash(img))
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def calculate_dhash(image_path: str) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Calculate difference hash for an image.
|
||||||
|
Faster but less accurate than phash.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Hex string of difference hash, or None if failed
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
with PILImage.open(image_path) as img:
|
||||||
|
return str(imagehash.dhash(img))
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def hashes_are_similar(hash1: str, hash2: str, threshold: int = 10) -> bool:
|
||||||
|
"""
|
||||||
|
Check if two hashes are similar (potential duplicates).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
hash1: First hash string
|
||||||
|
hash2: Second hash string
|
||||||
|
threshold: Maximum Hamming distance (default 10)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if hashes are similar
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
h1 = imagehash.hex_to_hash(hash1)
|
||||||
|
h2 = imagehash.hex_to_hash(hash2)
|
||||||
|
return (h1 - h2) <= threshold
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def hamming_distance(hash1: str, hash2: str) -> int:
|
||||||
|
"""
|
||||||
|
Calculate Hamming distance between two hashes.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
hash1: First hash string
|
||||||
|
hash2: Second hash string
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Hamming distance (0 = identical, higher = more different)
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
h1 = imagehash.hex_to_hash(hash1)
|
||||||
|
h2 = imagehash.hex_to_hash(hash2)
|
||||||
|
return int(h1 - h2)
|
||||||
|
except Exception:
|
||||||
|
return 64 # Maximum distance
|
||||||
109
backend/app/utils/image_quality.py
Normal file
109
backend/app/utils/image_quality.py
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
"""Image quality assessment utilities."""
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from PIL import Image as PILImage
|
||||||
|
from scipy import ndimage
|
||||||
|
|
||||||
|
|
||||||
|
def calculate_blur_score(image_path: str) -> float:
|
||||||
|
"""
|
||||||
|
Calculate blur score using Laplacian variance.
|
||||||
|
Higher score = sharper image.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Variance of Laplacian (higher = sharper)
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
img = PILImage.open(image_path).convert("L")
|
||||||
|
img_array = np.array(img)
|
||||||
|
laplacian = ndimage.laplace(img_array)
|
||||||
|
return float(np.var(laplacian))
|
||||||
|
except Exception:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def is_too_blurry(image_path: str, threshold: float = 100.0) -> bool:
|
||||||
|
"""
|
||||||
|
Check if image is too blurry for training.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
threshold: Minimum acceptable blur score (default 100)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if image is too blurry
|
||||||
|
"""
|
||||||
|
score = calculate_blur_score(image_path)
|
||||||
|
return score < threshold
|
||||||
|
|
||||||
|
|
||||||
|
def get_image_dimensions(image_path: str) -> tuple[int, int]:
|
||||||
|
"""
|
||||||
|
Get image dimensions.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (width, height)
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
with PILImage.open(image_path) as img:
|
||||||
|
return img.size
|
||||||
|
except Exception:
|
||||||
|
return (0, 0)
|
||||||
|
|
||||||
|
|
||||||
|
def is_too_small(image_path: str, min_size: int = 256) -> bool:
|
||||||
|
"""
|
||||||
|
Check if image is too small for training.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to image file
|
||||||
|
min_size: Minimum dimension size (default 256)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if image is too small
|
||||||
|
"""
|
||||||
|
width, height = get_image_dimensions(image_path)
|
||||||
|
return width < min_size or height < min_size
|
||||||
|
|
||||||
|
|
||||||
|
def resize_image(
|
||||||
|
image_path: str,
|
||||||
|
output_path: str = None,
|
||||||
|
max_size: int = 512,
|
||||||
|
quality: int = 95,
|
||||||
|
) -> bool:
|
||||||
|
"""
|
||||||
|
Resize image to max dimension while preserving aspect ratio.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
image_path: Path to input image
|
||||||
|
output_path: Path for output (defaults to overwriting input)
|
||||||
|
max_size: Maximum dimension size (default 512)
|
||||||
|
quality: JPEG quality (default 95)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if successful
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
output_path = output_path or image_path
|
||||||
|
|
||||||
|
with PILImage.open(image_path) as img:
|
||||||
|
# Only resize if larger than max_size
|
||||||
|
if max(img.size) > max_size:
|
||||||
|
img.thumbnail((max_size, max_size), PILImage.Resampling.LANCZOS)
|
||||||
|
|
||||||
|
# Convert to RGB if necessary (for JPEG)
|
||||||
|
if img.mode in ("RGBA", "P"):
|
||||||
|
img = img.convert("RGB")
|
||||||
|
|
||||||
|
img.save(output_path, "JPEG", quality=quality)
|
||||||
|
|
||||||
|
return True
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
92
backend/app/utils/logging.py
Normal file
92
backend/app/utils/logging.py
Normal file
@@ -0,0 +1,92 @@
|
|||||||
|
import logging
|
||||||
|
import os
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from app.config import get_settings
|
||||||
|
|
||||||
|
settings = get_settings()
|
||||||
|
|
||||||
|
|
||||||
|
def setup_logging():
|
||||||
|
"""Configure file and console logging."""
|
||||||
|
logs_path = Path(settings.logs_path)
|
||||||
|
logs_path.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Create a dated log file
|
||||||
|
log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
|
||||||
|
|
||||||
|
# Configure root logger
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.INFO,
|
||||||
|
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||||
|
handlers=[
|
||||||
|
logging.FileHandler(log_file),
|
||||||
|
logging.StreamHandler()
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
return logging.getLogger("plant_scraper")
|
||||||
|
|
||||||
|
|
||||||
|
def get_logger(name: str = "plant_scraper"):
|
||||||
|
"""Get a logger instance."""
|
||||||
|
logs_path = Path(settings.logs_path)
|
||||||
|
logs_path.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
logger = logging.getLogger(name)
|
||||||
|
|
||||||
|
if not logger.handlers:
|
||||||
|
logger.setLevel(logging.INFO)
|
||||||
|
|
||||||
|
# File handler with daily rotation
|
||||||
|
log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
|
||||||
|
file_handler = logging.FileHandler(log_file)
|
||||||
|
file_handler.setLevel(logging.INFO)
|
||||||
|
file_handler.setFormatter(logging.Formatter(
|
||||||
|
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
||||||
|
))
|
||||||
|
|
||||||
|
# Console handler
|
||||||
|
console_handler = logging.StreamHandler()
|
||||||
|
console_handler.setLevel(logging.INFO)
|
||||||
|
console_handler.setFormatter(logging.Formatter(
|
||||||
|
'%(asctime)s - %(levelname)s - %(message)s'
|
||||||
|
))
|
||||||
|
|
||||||
|
logger.addHandler(file_handler)
|
||||||
|
logger.addHandler(console_handler)
|
||||||
|
|
||||||
|
return logger
|
||||||
|
|
||||||
|
|
||||||
|
def get_job_logger(job_id: int):
|
||||||
|
"""Get a logger specific to a job, writing to a job-specific file."""
|
||||||
|
logs_path = Path(settings.logs_path)
|
||||||
|
logs_path.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
logger = logging.getLogger(f"job_{job_id}")
|
||||||
|
|
||||||
|
if not logger.handlers:
|
||||||
|
logger.setLevel(logging.DEBUG)
|
||||||
|
|
||||||
|
# Job-specific log file
|
||||||
|
job_log_file = logs_path / f"job_{job_id}.log"
|
||||||
|
file_handler = logging.FileHandler(job_log_file)
|
||||||
|
file_handler.setLevel(logging.DEBUG)
|
||||||
|
file_handler.setFormatter(logging.Formatter(
|
||||||
|
'%(asctime)s - %(levelname)s - %(message)s'
|
||||||
|
))
|
||||||
|
|
||||||
|
# Also log to daily file
|
||||||
|
daily_log_file = logs_path / f"scraper_{datetime.now().strftime('%Y-%m-%d')}.log"
|
||||||
|
daily_handler = logging.FileHandler(daily_log_file)
|
||||||
|
daily_handler.setLevel(logging.INFO)
|
||||||
|
daily_handler.setFormatter(logging.Formatter(
|
||||||
|
'%(asctime)s - job_%(name)s - %(levelname)s - %(message)s'
|
||||||
|
))
|
||||||
|
|
||||||
|
logger.addHandler(file_handler)
|
||||||
|
logger.addHandler(daily_handler)
|
||||||
|
|
||||||
|
return logger
|
||||||
1
backend/app/workers/__init__.py
Normal file
1
backend/app/workers/__init__.py
Normal file
@@ -0,0 +1 @@
|
|||||||
|
# Celery workers
|
||||||
36
backend/app/workers/celery_app.py
Normal file
36
backend/app/workers/celery_app.py
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
from celery import Celery
|
||||||
|
|
||||||
|
from app.config import get_settings
|
||||||
|
|
||||||
|
settings = get_settings()
|
||||||
|
|
||||||
|
celery_app = Celery(
|
||||||
|
"plant_scraper",
|
||||||
|
broker=settings.redis_url,
|
||||||
|
backend=settings.redis_url,
|
||||||
|
include=[
|
||||||
|
"app.workers.scrape_tasks",
|
||||||
|
"app.workers.quality_tasks",
|
||||||
|
"app.workers.export_tasks",
|
||||||
|
"app.workers.stats_tasks",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
celery_app.conf.update(
|
||||||
|
task_serializer="json",
|
||||||
|
accept_content=["json"],
|
||||||
|
result_serializer="json",
|
||||||
|
timezone="UTC",
|
||||||
|
enable_utc=True,
|
||||||
|
task_track_started=True,
|
||||||
|
task_time_limit=3600 * 24, # 24 hour max per task
|
||||||
|
worker_prefetch_multiplier=1,
|
||||||
|
task_acks_late=True,
|
||||||
|
beat_schedule={
|
||||||
|
"refresh-stats-every-5min": {
|
||||||
|
"task": "app.workers.stats_tasks.refresh_stats",
|
||||||
|
"schedule": 300.0, # Every 5 minutes
|
||||||
|
},
|
||||||
|
},
|
||||||
|
beat_schedule_filename="/tmp/celerybeat-schedule",
|
||||||
|
)
|
||||||
170
backend/app/workers/export_tasks.py
Normal file
170
backend/app/workers/export_tasks.py
Normal file
@@ -0,0 +1,170 @@
|
|||||||
|
import json
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
import shutil
|
||||||
|
import zipfile
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
from app.database import SessionLocal
|
||||||
|
from app.models import Export, Image, Species
|
||||||
|
from app.config import get_settings
|
||||||
|
|
||||||
|
settings = get_settings()
|
||||||
|
|
||||||
|
|
||||||
|
@celery_app.task(bind=True)
|
||||||
|
def generate_export(self, export_id: int):
|
||||||
|
"""Generate a zip export for CoreML training."""
|
||||||
|
db = SessionLocal()
|
||||||
|
try:
|
||||||
|
export = db.query(Export).filter(Export.id == export_id).first()
|
||||||
|
if not export:
|
||||||
|
return {"error": "Export not found"}
|
||||||
|
|
||||||
|
# Update status
|
||||||
|
export.status = "generating"
|
||||||
|
export.celery_task_id = self.request.id
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Parse filter criteria
|
||||||
|
criteria = json.loads(export.filter_criteria) if export.filter_criteria else {}
|
||||||
|
min_images = criteria.get("min_images_per_species", 100)
|
||||||
|
licenses = criteria.get("licenses")
|
||||||
|
min_quality = criteria.get("min_quality")
|
||||||
|
species_ids = criteria.get("species_ids")
|
||||||
|
|
||||||
|
# Build query for images
|
||||||
|
query = db.query(Image).filter(Image.status == "downloaded")
|
||||||
|
|
||||||
|
if licenses:
|
||||||
|
query = query.filter(Image.license.in_(licenses))
|
||||||
|
|
||||||
|
if min_quality:
|
||||||
|
query = query.filter(Image.quality_score >= min_quality)
|
||||||
|
|
||||||
|
if species_ids:
|
||||||
|
query = query.filter(Image.species_id.in_(species_ids))
|
||||||
|
|
||||||
|
# Group by species and filter by min count
|
||||||
|
from sqlalchemy import func
|
||||||
|
species_counts = db.query(
|
||||||
|
Image.species_id,
|
||||||
|
func.count(Image.id).label("count")
|
||||||
|
).filter(Image.status == "downloaded").group_by(Image.species_id).all()
|
||||||
|
|
||||||
|
valid_species_ids = [s.species_id for s in species_counts if s.count >= min_images]
|
||||||
|
|
||||||
|
if species_ids:
|
||||||
|
valid_species_ids = [s for s in valid_species_ids if s in species_ids]
|
||||||
|
|
||||||
|
if not valid_species_ids:
|
||||||
|
export.status = "failed"
|
||||||
|
export.error_message = "No species meet the criteria"
|
||||||
|
export.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
return {"error": "No species meet the criteria"}
|
||||||
|
|
||||||
|
# Create export directory
|
||||||
|
export_dir = Path(settings.exports_path) / f"export_{export_id}"
|
||||||
|
train_dir = export_dir / "Training"
|
||||||
|
test_dir = export_dir / "Testing"
|
||||||
|
train_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
test_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
total_images = 0
|
||||||
|
species_count = 0
|
||||||
|
|
||||||
|
# Process each valid species
|
||||||
|
for i, species_id in enumerate(valid_species_ids):
|
||||||
|
species = db.query(Species).filter(Species.id == species_id).first()
|
||||||
|
if not species:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get images for this species
|
||||||
|
images_query = query.filter(Image.species_id == species_id)
|
||||||
|
if licenses:
|
||||||
|
images_query = images_query.filter(Image.license.in_(licenses))
|
||||||
|
if min_quality:
|
||||||
|
images_query = images_query.filter(Image.quality_score >= min_quality)
|
||||||
|
|
||||||
|
images = images_query.all()
|
||||||
|
if len(images) < min_images:
|
||||||
|
continue
|
||||||
|
|
||||||
|
species_count += 1
|
||||||
|
|
||||||
|
# Create species folders
|
||||||
|
species_name = species.scientific_name.replace(" ", "_")
|
||||||
|
(train_dir / species_name).mkdir(exist_ok=True)
|
||||||
|
(test_dir / species_name).mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
# Shuffle and split
|
||||||
|
random.shuffle(images)
|
||||||
|
split_idx = int(len(images) * export.train_split)
|
||||||
|
train_images = images[:split_idx]
|
||||||
|
test_images = images[split_idx:]
|
||||||
|
|
||||||
|
# Copy images
|
||||||
|
for j, img in enumerate(train_images):
|
||||||
|
if img.local_path and os.path.exists(img.local_path):
|
||||||
|
ext = Path(img.local_path).suffix or ".jpg"
|
||||||
|
dest = train_dir / species_name / f"img_{j:05d}{ext}"
|
||||||
|
shutil.copy2(img.local_path, dest)
|
||||||
|
total_images += 1
|
||||||
|
|
||||||
|
for j, img in enumerate(test_images):
|
||||||
|
if img.local_path and os.path.exists(img.local_path):
|
||||||
|
ext = Path(img.local_path).suffix or ".jpg"
|
||||||
|
dest = test_dir / species_name / f"img_{j:05d}{ext}"
|
||||||
|
shutil.copy2(img.local_path, dest)
|
||||||
|
total_images += 1
|
||||||
|
|
||||||
|
# Update progress
|
||||||
|
self.update_state(
|
||||||
|
state="PROGRESS",
|
||||||
|
meta={
|
||||||
|
"current": i + 1,
|
||||||
|
"total": len(valid_species_ids),
|
||||||
|
"species": species.scientific_name,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create zip file
|
||||||
|
zip_path = Path(settings.exports_path) / f"export_{export_id}.zip"
|
||||||
|
with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
|
||||||
|
for root, dirs, files in os.walk(export_dir):
|
||||||
|
for file in files:
|
||||||
|
file_path = Path(root) / file
|
||||||
|
arcname = file_path.relative_to(export_dir)
|
||||||
|
zipf.write(file_path, arcname)
|
||||||
|
|
||||||
|
# Clean up directory
|
||||||
|
shutil.rmtree(export_dir)
|
||||||
|
|
||||||
|
# Update export record
|
||||||
|
export.status = "completed"
|
||||||
|
export.file_path = str(zip_path)
|
||||||
|
export.file_size = zip_path.stat().st_size
|
||||||
|
export.species_count = species_count
|
||||||
|
export.image_count = total_images
|
||||||
|
export.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"status": "completed",
|
||||||
|
"species_count": species_count,
|
||||||
|
"image_count": total_images,
|
||||||
|
"file_size": export.file_size,
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if export:
|
||||||
|
export.status = "failed"
|
||||||
|
export.error_message = str(e)
|
||||||
|
export.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
raise
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
224
backend/app/workers/quality_tasks.py
Normal file
224
backend/app/workers/quality_tasks.py
Normal file
@@ -0,0 +1,224 @@
|
|||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from PIL import Image as PILImage
|
||||||
|
import imagehash
|
||||||
|
import numpy as np
|
||||||
|
from scipy import ndimage
|
||||||
|
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
from app.database import SessionLocal
|
||||||
|
from app.models import Image
|
||||||
|
from app.config import get_settings
|
||||||
|
|
||||||
|
settings = get_settings()
|
||||||
|
|
||||||
|
|
||||||
|
def calculate_blur_score(image_path: str) -> float:
|
||||||
|
"""Calculate blur score using Laplacian variance. Higher = sharper."""
|
||||||
|
try:
|
||||||
|
img = PILImage.open(image_path).convert("L")
|
||||||
|
img_array = np.array(img)
|
||||||
|
laplacian = ndimage.laplace(img_array)
|
||||||
|
return float(np.var(laplacian))
|
||||||
|
except Exception:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def calculate_phash(image_path: str) -> str:
|
||||||
|
"""Calculate perceptual hash for deduplication."""
|
||||||
|
try:
|
||||||
|
img = PILImage.open(image_path)
|
||||||
|
return str(imagehash.phash(img))
|
||||||
|
except Exception:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
def check_color_distribution(image_path: str) -> tuple[bool, str]:
|
||||||
|
"""Check if image has healthy color distribution for a plant photo.
|
||||||
|
|
||||||
|
Returns (passed, reason) tuple.
|
||||||
|
Rejects:
|
||||||
|
- Low color variance (mean channel std < 25): herbarium specimens (brown on white)
|
||||||
|
- No green + low variance (green ratio < 5% AND mean std < 40): monochrome illustrations
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
img = PILImage.open(image_path).convert("RGB")
|
||||||
|
arr = np.array(img, dtype=np.float64)
|
||||||
|
|
||||||
|
# Per-channel standard deviation
|
||||||
|
channel_stds = arr.std(axis=(0, 1)) # [R_std, G_std, B_std]
|
||||||
|
mean_std = float(channel_stds.mean())
|
||||||
|
|
||||||
|
if mean_std < 25:
|
||||||
|
return False, f"Low color variance ({mean_std:.1f})"
|
||||||
|
|
||||||
|
# Check green ratio
|
||||||
|
channel_means = arr.mean(axis=(0, 1))
|
||||||
|
total = channel_means.sum()
|
||||||
|
green_ratio = channel_means[1] / total if total > 0 else 0
|
||||||
|
|
||||||
|
if green_ratio < 0.05 and mean_std < 40:
|
||||||
|
return False, f"No green ({green_ratio:.2%}) + low variance ({mean_std:.1f})"
|
||||||
|
|
||||||
|
return True, ""
|
||||||
|
except Exception:
|
||||||
|
return True, "" # Don't reject on error
|
||||||
|
|
||||||
|
|
||||||
|
def resize_image(image_path: str, target_size: int = 512) -> bool:
|
||||||
|
"""Resize image to target size while maintaining aspect ratio."""
|
||||||
|
try:
|
||||||
|
img = PILImage.open(image_path)
|
||||||
|
img.thumbnail((target_size, target_size), PILImage.Resampling.LANCZOS)
|
||||||
|
img.save(image_path, quality=95)
|
||||||
|
return True
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
@celery_app.task
|
||||||
|
def download_and_process_image(image_id: int):
|
||||||
|
"""Download image, check quality, dedupe, and resize."""
|
||||||
|
db = SessionLocal()
|
||||||
|
try:
|
||||||
|
image = db.query(Image).filter(Image.id == image_id).first()
|
||||||
|
if not image:
|
||||||
|
return {"error": "Image not found"}
|
||||||
|
|
||||||
|
# Create directory for species
|
||||||
|
species = image.species
|
||||||
|
species_dir = Path(settings.images_path) / species.scientific_name.replace(" ", "_")
|
||||||
|
species_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Download image
|
||||||
|
filename = f"{image.source}_{image.source_id or image.id}.jpg"
|
||||||
|
local_path = species_dir / filename
|
||||||
|
|
||||||
|
try:
|
||||||
|
headers = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
|
||||||
|
}
|
||||||
|
with httpx.Client(timeout=30, headers=headers, follow_redirects=True) as client:
|
||||||
|
response = client.get(image.url)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
with open(local_path, "wb") as f:
|
||||||
|
f.write(response.content)
|
||||||
|
except Exception as e:
|
||||||
|
image.status = "rejected"
|
||||||
|
db.commit()
|
||||||
|
return {"error": f"Download failed: {e}"}
|
||||||
|
|
||||||
|
# Check minimum size
|
||||||
|
try:
|
||||||
|
with PILImage.open(local_path) as img:
|
||||||
|
width, height = img.size
|
||||||
|
if width < 256 or height < 256:
|
||||||
|
os.remove(local_path)
|
||||||
|
image.status = "rejected"
|
||||||
|
db.commit()
|
||||||
|
return {"error": "Image too small"}
|
||||||
|
image.width = width
|
||||||
|
image.height = height
|
||||||
|
except Exception as e:
|
||||||
|
if local_path.exists():
|
||||||
|
os.remove(local_path)
|
||||||
|
image.status = "rejected"
|
||||||
|
db.commit()
|
||||||
|
return {"error": f"Invalid image: {e}"}
|
||||||
|
|
||||||
|
# Calculate perceptual hash for deduplication
|
||||||
|
phash = calculate_phash(str(local_path))
|
||||||
|
if phash:
|
||||||
|
# Check for duplicates
|
||||||
|
existing = db.query(Image).filter(
|
||||||
|
Image.phash == phash,
|
||||||
|
Image.id != image.id,
|
||||||
|
Image.status == "downloaded"
|
||||||
|
).first()
|
||||||
|
|
||||||
|
if existing:
|
||||||
|
os.remove(local_path)
|
||||||
|
image.status = "rejected"
|
||||||
|
image.phash = phash
|
||||||
|
db.commit()
|
||||||
|
return {"error": "Duplicate image"}
|
||||||
|
|
||||||
|
image.phash = phash
|
||||||
|
|
||||||
|
# Calculate blur score
|
||||||
|
quality_score = calculate_blur_score(str(local_path))
|
||||||
|
image.quality_score = quality_score
|
||||||
|
|
||||||
|
# Reject very blurry images (threshold can be tuned)
|
||||||
|
if quality_score < 100: # Low variance = blurry
|
||||||
|
os.remove(local_path)
|
||||||
|
image.status = "rejected"
|
||||||
|
db.commit()
|
||||||
|
return {"error": "Image too blurry"}
|
||||||
|
|
||||||
|
# Check color distribution (reject herbarium specimens, illustrations)
|
||||||
|
color_ok, color_reason = check_color_distribution(str(local_path))
|
||||||
|
if not color_ok:
|
||||||
|
os.remove(local_path)
|
||||||
|
image.status = "rejected"
|
||||||
|
db.commit()
|
||||||
|
return {"error": f"Non-photo content: {color_reason}"}
|
||||||
|
|
||||||
|
# Resize to 512x512 max
|
||||||
|
resize_image(str(local_path))
|
||||||
|
|
||||||
|
# Update image record
|
||||||
|
image.local_path = str(local_path)
|
||||||
|
image.status = "downloaded"
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"status": "success",
|
||||||
|
"path": str(local_path),
|
||||||
|
"quality_score": quality_score,
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
if image:
|
||||||
|
image.status = "rejected"
|
||||||
|
db.commit()
|
||||||
|
return {"error": str(e)}
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
@celery_app.task(bind=True)
|
||||||
|
def batch_process_pending_images(self, source: str = None, chunk_size: int = 500):
|
||||||
|
"""Process ALL pending images in chunks, with progress tracking."""
|
||||||
|
db = SessionLocal()
|
||||||
|
try:
|
||||||
|
query = db.query(Image).filter(Image.status == "pending")
|
||||||
|
if source:
|
||||||
|
query = query.filter(Image.source == source)
|
||||||
|
|
||||||
|
total = query.count()
|
||||||
|
queued = 0
|
||||||
|
offset = 0
|
||||||
|
|
||||||
|
while offset < total:
|
||||||
|
chunk = query.order_by(Image.id).offset(offset).limit(chunk_size).all()
|
||||||
|
if not chunk:
|
||||||
|
break
|
||||||
|
|
||||||
|
for image in chunk:
|
||||||
|
download_and_process_image.delay(image.id)
|
||||||
|
queued += 1
|
||||||
|
|
||||||
|
offset += len(chunk)
|
||||||
|
|
||||||
|
self.update_state(
|
||||||
|
state="PROGRESS",
|
||||||
|
meta={"queued": queued, "total": total},
|
||||||
|
)
|
||||||
|
|
||||||
|
return {"queued": queued, "total": total}
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
164
backend/app/workers/scrape_tasks.py
Normal file
164
backend/app/workers/scrape_tasks.py
Normal file
@@ -0,0 +1,164 @@
|
|||||||
|
import json
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
from app.database import SessionLocal
|
||||||
|
from app.models import Job, Species, Image
|
||||||
|
from app.utils.logging import get_job_logger
|
||||||
|
|
||||||
|
|
||||||
|
@celery_app.task(bind=True)
|
||||||
|
def run_scrape_job(self, job_id: int):
|
||||||
|
"""Main scrape task that dispatches to source-specific scrapers."""
|
||||||
|
logger = get_job_logger(job_id)
|
||||||
|
logger.info(f"Starting scrape job {job_id}")
|
||||||
|
|
||||||
|
db = SessionLocal()
|
||||||
|
job = None
|
||||||
|
try:
|
||||||
|
job = db.query(Job).filter(Job.id == job_id).first()
|
||||||
|
if not job:
|
||||||
|
logger.error(f"Job {job_id} not found")
|
||||||
|
return {"error": "Job not found"}
|
||||||
|
|
||||||
|
logger.info(f"Job: {job.name}, Source: {job.source}")
|
||||||
|
|
||||||
|
# Update job status
|
||||||
|
job.status = "running"
|
||||||
|
job.started_at = datetime.utcnow()
|
||||||
|
job.celery_task_id = self.request.id
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Get species to scrape
|
||||||
|
if job.species_filter:
|
||||||
|
species_ids = json.loads(job.species_filter)
|
||||||
|
query = db.query(Species).filter(Species.id.in_(species_ids))
|
||||||
|
logger.info(f"Filtered to species IDs: {species_ids}")
|
||||||
|
else:
|
||||||
|
query = db.query(Species)
|
||||||
|
logger.info("Scraping all species")
|
||||||
|
|
||||||
|
# Filter by image count if requested
|
||||||
|
if job.only_without_images or job.max_images:
|
||||||
|
from sqlalchemy import func
|
||||||
|
# Subquery to count downloaded images per species
|
||||||
|
image_count_subquery = (
|
||||||
|
db.query(Image.species_id, func.count(Image.id).label("count"))
|
||||||
|
.filter(Image.status == "downloaded")
|
||||||
|
.group_by(Image.species_id)
|
||||||
|
.subquery()
|
||||||
|
)
|
||||||
|
# Left join with the count subquery
|
||||||
|
query = query.outerjoin(
|
||||||
|
image_count_subquery,
|
||||||
|
Species.id == image_count_subquery.c.species_id
|
||||||
|
)
|
||||||
|
|
||||||
|
if job.only_without_images:
|
||||||
|
# Filter where count is NULL or 0
|
||||||
|
query = query.filter(
|
||||||
|
(image_count_subquery.c.count == None) | (image_count_subquery.c.count == 0)
|
||||||
|
)
|
||||||
|
logger.info("Filtering to species without images")
|
||||||
|
elif job.max_images:
|
||||||
|
# Filter where count is NULL or less than max_images
|
||||||
|
query = query.filter(
|
||||||
|
(image_count_subquery.c.count == None) | (image_count_subquery.c.count < job.max_images)
|
||||||
|
)
|
||||||
|
logger.info(f"Filtering to species with fewer than {job.max_images} images")
|
||||||
|
|
||||||
|
species_list = query.all()
|
||||||
|
logger.info(f"Total species to scrape: {len(species_list)}")
|
||||||
|
|
||||||
|
job.progress_total = len(species_list)
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
# Import scraper based on source
|
||||||
|
from app.scrapers import get_scraper
|
||||||
|
scraper = get_scraper(job.source)
|
||||||
|
|
||||||
|
if not scraper:
|
||||||
|
error_msg = f"Unknown source: {job.source}"
|
||||||
|
logger.error(error_msg)
|
||||||
|
job.status = "failed"
|
||||||
|
job.error_message = error_msg
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
return {"error": error_msg}
|
||||||
|
|
||||||
|
logger.info(f"Using scraper: {scraper.name}")
|
||||||
|
|
||||||
|
# Scrape each species
|
||||||
|
for i, species in enumerate(species_list):
|
||||||
|
try:
|
||||||
|
# Update progress
|
||||||
|
job.progress_current = i + 1
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
logger.info(f"[{i+1}/{len(species_list)}] Scraping: {species.scientific_name}")
|
||||||
|
|
||||||
|
# Update task state for real-time monitoring
|
||||||
|
self.update_state(
|
||||||
|
state="PROGRESS",
|
||||||
|
meta={
|
||||||
|
"current": i + 1,
|
||||||
|
"total": len(species_list),
|
||||||
|
"species": species.scientific_name,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run scraper for this species
|
||||||
|
results = scraper.scrape_species(species, db, logger)
|
||||||
|
downloaded = results.get("downloaded", 0)
|
||||||
|
rejected = results.get("rejected", 0)
|
||||||
|
job.images_downloaded += downloaded
|
||||||
|
job.images_rejected += rejected
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
logger.info(f" -> Downloaded: {downloaded}, Rejected: {rejected}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Log error but continue with other species
|
||||||
|
logger.error(f"Error scraping {species.scientific_name}: {e}", exc_info=True)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Mark job complete
|
||||||
|
job.status = "completed"
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
|
||||||
|
logger.info(f"Job {job_id} completed. Total downloaded: {job.images_downloaded}, rejected: {job.images_rejected}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"status": "completed",
|
||||||
|
"downloaded": job.images_downloaded,
|
||||||
|
"rejected": job.images_rejected,
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Job {job_id} failed with error: {e}", exc_info=True)
|
||||||
|
if job:
|
||||||
|
job.status = "failed"
|
||||||
|
job.error_message = str(e)
|
||||||
|
job.completed_at = datetime.utcnow()
|
||||||
|
db.commit()
|
||||||
|
raise
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
|
|
||||||
|
|
||||||
|
@celery_app.task
|
||||||
|
def pause_scrape_job(job_id: int):
|
||||||
|
"""Pause a running scrape job."""
|
||||||
|
db = SessionLocal()
|
||||||
|
try:
|
||||||
|
job = db.query(Job).filter(Job.id == job_id).first()
|
||||||
|
if job and job.status == "running":
|
||||||
|
job.status = "paused"
|
||||||
|
db.commit()
|
||||||
|
# Revoke the Celery task
|
||||||
|
if job.celery_task_id:
|
||||||
|
celery_app.control.revoke(job.celery_task_id, terminate=True)
|
||||||
|
return {"status": "paused"}
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
193
backend/app/workers/stats_tasks.py
Normal file
193
backend/app/workers/stats_tasks.py
Normal file
@@ -0,0 +1,193 @@
|
|||||||
|
import json
|
||||||
|
import os
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from sqlalchemy import func, case, text
|
||||||
|
|
||||||
|
from app.workers.celery_app import celery_app
|
||||||
|
from app.database import SessionLocal
|
||||||
|
from app.models import Species, Image, Job
|
||||||
|
from app.models.cached_stats import CachedStats
|
||||||
|
from app.config import get_settings
|
||||||
|
|
||||||
|
|
||||||
|
def get_directory_size_fast(path: str) -> int:
|
||||||
|
"""Get directory size in bytes using fast os.scandir."""
|
||||||
|
total = 0
|
||||||
|
try:
|
||||||
|
with os.scandir(path) as it:
|
||||||
|
for entry in it:
|
||||||
|
try:
|
||||||
|
if entry.is_file(follow_symlinks=False):
|
||||||
|
total += entry.stat(follow_symlinks=False).st_size
|
||||||
|
elif entry.is_dir(follow_symlinks=False):
|
||||||
|
total += get_directory_size_fast(entry.path)
|
||||||
|
except (OSError, PermissionError):
|
||||||
|
pass
|
||||||
|
except (OSError, PermissionError):
|
||||||
|
pass
|
||||||
|
return total
|
||||||
|
|
||||||
|
|
||||||
|
@celery_app.task
|
||||||
|
def refresh_stats():
|
||||||
|
"""Calculate and cache dashboard statistics."""
|
||||||
|
print("=== STATS TASK: Starting refresh ===", flush=True)
|
||||||
|
|
||||||
|
db = SessionLocal()
|
||||||
|
try:
|
||||||
|
# Use raw SQL for maximum performance on SQLite
|
||||||
|
# All counts in a single query
|
||||||
|
counts_sql = text("""
|
||||||
|
SELECT
|
||||||
|
(SELECT COUNT(*) FROM species) as total_species,
|
||||||
|
(SELECT COUNT(*) FROM images) as total_images,
|
||||||
|
(SELECT COUNT(*) FROM images WHERE status = 'downloaded') as images_downloaded,
|
||||||
|
(SELECT COUNT(*) FROM images WHERE status = 'pending') as images_pending,
|
||||||
|
(SELECT COUNT(*) FROM images WHERE status = 'rejected') as images_rejected
|
||||||
|
""")
|
||||||
|
counts = db.execute(counts_sql).fetchone()
|
||||||
|
total_species = counts[0] or 0
|
||||||
|
total_images = counts[1] or 0
|
||||||
|
images_downloaded = counts[2] or 0
|
||||||
|
images_pending = counts[3] or 0
|
||||||
|
images_rejected = counts[4] or 0
|
||||||
|
|
||||||
|
# Per-source stats - single query with GROUP BY
|
||||||
|
source_sql = text("""
|
||||||
|
SELECT
|
||||||
|
source,
|
||||||
|
COUNT(*) as total,
|
||||||
|
SUM(CASE WHEN status = 'downloaded' THEN 1 ELSE 0 END) as downloaded,
|
||||||
|
SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
|
||||||
|
SUM(CASE WHEN status = 'rejected' THEN 1 ELSE 0 END) as rejected
|
||||||
|
FROM images
|
||||||
|
GROUP BY source
|
||||||
|
""")
|
||||||
|
source_stats_raw = db.execute(source_sql).fetchall()
|
||||||
|
sources = [
|
||||||
|
{
|
||||||
|
"source": s[0],
|
||||||
|
"image_count": s[1],
|
||||||
|
"downloaded": s[2] or 0,
|
||||||
|
"pending": s[3] or 0,
|
||||||
|
"rejected": s[4] or 0,
|
||||||
|
}
|
||||||
|
for s in source_stats_raw
|
||||||
|
]
|
||||||
|
|
||||||
|
# Per-license stats - single indexed query
|
||||||
|
license_sql = text("""
|
||||||
|
SELECT license, COUNT(*) as count
|
||||||
|
FROM images
|
||||||
|
WHERE status = 'downloaded'
|
||||||
|
GROUP BY license
|
||||||
|
""")
|
||||||
|
license_stats_raw = db.execute(license_sql).fetchall()
|
||||||
|
licenses = [
|
||||||
|
{"license": l[0], "count": l[1]}
|
||||||
|
for l in license_stats_raw
|
||||||
|
]
|
||||||
|
|
||||||
|
# Job stats - single query
|
||||||
|
job_sql = text("""
|
||||||
|
SELECT
|
||||||
|
SUM(CASE WHEN status = 'running' THEN 1 ELSE 0 END) as running,
|
||||||
|
SUM(CASE WHEN status = 'pending' THEN 1 ELSE 0 END) as pending,
|
||||||
|
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) as completed,
|
||||||
|
SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed
|
||||||
|
FROM jobs
|
||||||
|
""")
|
||||||
|
job_counts = db.execute(job_sql).fetchone()
|
||||||
|
jobs = {
|
||||||
|
"running": job_counts[0] or 0,
|
||||||
|
"pending": job_counts[1] or 0,
|
||||||
|
"completed": job_counts[2] or 0,
|
||||||
|
"failed": job_counts[3] or 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Top species by image count - optimized with index
|
||||||
|
top_sql = text("""
|
||||||
|
SELECT s.id, s.scientific_name, s.common_name, COUNT(i.id) as image_count
|
||||||
|
FROM species s
|
||||||
|
INNER JOIN images i ON i.species_id = s.id AND i.status = 'downloaded'
|
||||||
|
GROUP BY s.id
|
||||||
|
ORDER BY image_count DESC
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
top_species_raw = db.execute(top_sql).fetchall()
|
||||||
|
top_species = [
|
||||||
|
{
|
||||||
|
"id": s[0],
|
||||||
|
"scientific_name": s[1],
|
||||||
|
"common_name": s[2],
|
||||||
|
"image_count": s[3],
|
||||||
|
}
|
||||||
|
for s in top_species_raw
|
||||||
|
]
|
||||||
|
|
||||||
|
# Under-represented species - use pre-computed counts
|
||||||
|
under_sql = text("""
|
||||||
|
SELECT s.id, s.scientific_name, s.common_name, COALESCE(img_counts.cnt, 0) as image_count
|
||||||
|
FROM species s
|
||||||
|
LEFT JOIN (
|
||||||
|
SELECT species_id, COUNT(*) as cnt
|
||||||
|
FROM images
|
||||||
|
WHERE status = 'downloaded'
|
||||||
|
GROUP BY species_id
|
||||||
|
) img_counts ON img_counts.species_id = s.id
|
||||||
|
WHERE COALESCE(img_counts.cnt, 0) < 100
|
||||||
|
ORDER BY image_count ASC
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
under_rep_raw = db.execute(under_sql).fetchall()
|
||||||
|
under_represented = [
|
||||||
|
{
|
||||||
|
"id": s[0],
|
||||||
|
"scientific_name": s[1],
|
||||||
|
"common_name": s[2],
|
||||||
|
"image_count": s[3],
|
||||||
|
}
|
||||||
|
for s in under_rep_raw
|
||||||
|
]
|
||||||
|
|
||||||
|
# Calculate disk usage (fast recursive scan)
|
||||||
|
settings = get_settings()
|
||||||
|
disk_usage_bytes = get_directory_size_fast(settings.images_path)
|
||||||
|
disk_usage_mb = round(disk_usage_bytes / (1024 * 1024), 2)
|
||||||
|
|
||||||
|
# Build the stats object
|
||||||
|
stats = {
|
||||||
|
"total_species": total_species,
|
||||||
|
"total_images": total_images,
|
||||||
|
"images_downloaded": images_downloaded,
|
||||||
|
"images_pending": images_pending,
|
||||||
|
"images_rejected": images_rejected,
|
||||||
|
"disk_usage_mb": disk_usage_mb,
|
||||||
|
"sources": sources,
|
||||||
|
"licenses": licenses,
|
||||||
|
"jobs": jobs,
|
||||||
|
"top_species": top_species,
|
||||||
|
"under_represented": under_represented,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Store in database
|
||||||
|
cached = db.query(CachedStats).filter(CachedStats.key == "dashboard_stats").first()
|
||||||
|
if cached:
|
||||||
|
cached.value = json.dumps(stats)
|
||||||
|
cached.updated_at = datetime.utcnow()
|
||||||
|
else:
|
||||||
|
cached = CachedStats(key="dashboard_stats", value=json.dumps(stats))
|
||||||
|
db.add(cached)
|
||||||
|
|
||||||
|
db.commit()
|
||||||
|
print(f"=== STATS TASK: Refreshed (species={total_species}, images={total_images}) ===", flush=True)
|
||||||
|
|
||||||
|
return {"status": "success", "total_species": total_species, "total_images": total_images}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"=== STATS TASK ERROR: {e} ===", flush=True)
|
||||||
|
raise
|
||||||
|
finally:
|
||||||
|
db.close()
|
||||||
34
backend/requirements.txt
Normal file
34
backend/requirements.txt
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
# Web framework
|
||||||
|
fastapi==0.109.0
|
||||||
|
uvicorn[standard]==0.27.0
|
||||||
|
python-multipart==0.0.6
|
||||||
|
|
||||||
|
# Database
|
||||||
|
sqlalchemy==2.0.25
|
||||||
|
alembic==1.13.1
|
||||||
|
aiosqlite==0.19.0
|
||||||
|
|
||||||
|
# Task queue
|
||||||
|
celery==5.3.6
|
||||||
|
redis==5.0.1
|
||||||
|
|
||||||
|
# Image processing
|
||||||
|
Pillow==10.2.0
|
||||||
|
imagehash==4.3.1
|
||||||
|
imagededup==0.3.3.post2
|
||||||
|
|
||||||
|
# HTTP clients
|
||||||
|
httpx==0.26.0
|
||||||
|
aiohttp==3.9.3
|
||||||
|
|
||||||
|
# Search
|
||||||
|
duckduckgo-search
|
||||||
|
|
||||||
|
# Utilities
|
||||||
|
python-dotenv==1.0.0
|
||||||
|
pydantic==2.5.3
|
||||||
|
pydantic-settings==2.1.0
|
||||||
|
|
||||||
|
# Testing
|
||||||
|
pytest==7.4.4
|
||||||
|
pytest-asyncio==0.23.3
|
||||||
1
backend/tests/__init__.py
Normal file
1
backend/tests/__init__.py
Normal file
@@ -0,0 +1 @@
|
|||||||
|
# Tests
|
||||||
114
docker-compose.unraid.yml
Normal file
114
docker-compose.unraid.yml
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
# Docker Compose for Unraid
|
||||||
|
#
|
||||||
|
# Access at http://YOUR_UNRAID_IP:8580
|
||||||
|
#
|
||||||
|
# ============================================
|
||||||
|
# CONFIGURE THESE PATHS FOR YOUR UNRAID SETUP
|
||||||
|
# ============================================
|
||||||
|
# Edit the left side of the colon (:) for each volume mount
|
||||||
|
#
|
||||||
|
# DATABASE_PATH: Where to store the SQLite database
|
||||||
|
# IMAGES_PATH: Where to store downloaded images (can be large, 100GB+)
|
||||||
|
# EXPORTS_PATH: Where to store generated export zip files
|
||||||
|
# IMPORTS_PATH: Where to place images for bulk import (source/species/images)
|
||||||
|
# LOGS_PATH: Where to store scraper log files for debugging
|
||||||
|
|
||||||
|
services:
|
||||||
|
backend:
|
||||||
|
build:
|
||||||
|
context: /mnt/user/appdata/PlantGuideScraper/backend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
container_name: plant-scraper-backend
|
||||||
|
restart: unless-stopped
|
||||||
|
volumes:
|
||||||
|
- /mnt/user/appdata/PlantGuideScraper/backend:/app:ro
|
||||||
|
# === CONFIGURABLE DATA PATHS ===
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/database:/data/db # DATABASE_PATH
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/images:/data/images # IMAGES_PATH
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/exports:/data/exports # EXPORTS_PATH
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/imports:/data/imports # IMPORTS_PATH
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/logs:/data/logs # LOGS_PATH
|
||||||
|
environment:
|
||||||
|
- DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||||
|
- REDIS_URL=redis://plant-scraper-redis:6379/0
|
||||||
|
- IMAGES_PATH=/data/images
|
||||||
|
- EXPORTS_PATH=/data/exports
|
||||||
|
- IMPORTS_PATH=/data/imports
|
||||||
|
- LOGS_PATH=/data/logs
|
||||||
|
depends_on:
|
||||||
|
- redis
|
||||||
|
command: uvicorn app.main:app --host 0.0.0.0 --port 8000
|
||||||
|
networks:
|
||||||
|
- plant-scraper
|
||||||
|
|
||||||
|
celery:
|
||||||
|
build:
|
||||||
|
context: /mnt/user/appdata/PlantGuideScraper/backend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
container_name: plant-scraper-celery
|
||||||
|
restart: unless-stopped
|
||||||
|
volumes:
|
||||||
|
- /mnt/user/appdata/PlantGuideScraper/backend:/app:ro
|
||||||
|
# === CONFIGURABLE DATA PATHS (must match backend) ===
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/database:/data/db # DATABASE_PATH
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/images:/data/images # IMAGES_PATH
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/exports:/data/exports # EXPORTS_PATH
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/imports:/data/imports # IMPORTS_PATH
|
||||||
|
- /mnt/user/downloads/PlantGuideDocker/logs:/data/logs # LOGS_PATH
|
||||||
|
environment:
|
||||||
|
- DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||||
|
- REDIS_URL=redis://plant-scraper-redis:6379/0
|
||||||
|
- IMAGES_PATH=/data/images
|
||||||
|
- EXPORTS_PATH=/data/exports
|
||||||
|
- IMPORTS_PATH=/data/imports
|
||||||
|
- LOGS_PATH=/data/logs
|
||||||
|
depends_on:
|
||||||
|
- redis
|
||||||
|
command: celery -A app.workers.celery_app worker --beat --loglevel=info --concurrency=4
|
||||||
|
networks:
|
||||||
|
- plant-scraper
|
||||||
|
|
||||||
|
redis:
|
||||||
|
image: redis:7-alpine
|
||||||
|
container_name: plant-scraper-redis
|
||||||
|
restart: unless-stopped
|
||||||
|
volumes:
|
||||||
|
- /mnt/user/appdata/PlantGuideScraper/redis:/data
|
||||||
|
networks:
|
||||||
|
- plant-scraper
|
||||||
|
|
||||||
|
frontend:
|
||||||
|
build:
|
||||||
|
context: /mnt/user/appdata/PlantGuideScraper/frontend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
container_name: plant-scraper-frontend
|
||||||
|
restart: unless-stopped
|
||||||
|
volumes:
|
||||||
|
- /mnt/user/appdata/PlantGuideScraper/frontend:/app
|
||||||
|
- plant-scraper-node-modules:/app/node_modules
|
||||||
|
environment:
|
||||||
|
- VITE_API_URL=
|
||||||
|
command: npm run dev -- --host
|
||||||
|
networks:
|
||||||
|
- plant-scraper
|
||||||
|
|
||||||
|
nginx:
|
||||||
|
image: nginx:alpine
|
||||||
|
container_name: plant-scraper-nginx
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "8580:80"
|
||||||
|
volumes:
|
||||||
|
- /mnt/user/appdata/PlantGuideScraper/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
|
||||||
|
depends_on:
|
||||||
|
- backend
|
||||||
|
- frontend
|
||||||
|
networks:
|
||||||
|
- plant-scraper
|
||||||
|
|
||||||
|
networks:
|
||||||
|
plant-scraper:
|
||||||
|
name: plant-scraper
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
plant-scraper-node-modules:
|
||||||
73
docker-compose.yml
Normal file
73
docker-compose.yml
Normal file
@@ -0,0 +1,73 @@
|
|||||||
|
services:
|
||||||
|
backend:
|
||||||
|
build:
|
||||||
|
context: ./backend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
container_name: plant-scraper-backend
|
||||||
|
# Port exposed only internally, nginx proxies to it
|
||||||
|
volumes:
|
||||||
|
- ./backend:/app
|
||||||
|
- ./data:/data
|
||||||
|
environment:
|
||||||
|
- DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||||
|
- REDIS_URL=redis://redis:6379/0
|
||||||
|
- IMAGES_PATH=/data/images
|
||||||
|
- EXPORTS_PATH=/data/exports
|
||||||
|
- IMPORTS_PATH=/data/imports
|
||||||
|
- LOGS_PATH=/data/logs
|
||||||
|
depends_on:
|
||||||
|
- redis
|
||||||
|
command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
|
||||||
|
|
||||||
|
celery:
|
||||||
|
build:
|
||||||
|
context: ./backend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
container_name: plant-scraper-celery
|
||||||
|
volumes:
|
||||||
|
- ./backend:/app
|
||||||
|
- ./data:/data
|
||||||
|
environment:
|
||||||
|
- DATABASE_URL=sqlite:////data/db/plants.sqlite
|
||||||
|
- REDIS_URL=redis://redis:6379/0
|
||||||
|
- IMAGES_PATH=/data/images
|
||||||
|
- EXPORTS_PATH=/data/exports
|
||||||
|
- IMPORTS_PATH=/data/imports
|
||||||
|
- LOGS_PATH=/data/logs
|
||||||
|
depends_on:
|
||||||
|
- redis
|
||||||
|
command: celery -A app.workers.celery_app worker --beat --loglevel=info --concurrency=4
|
||||||
|
|
||||||
|
redis:
|
||||||
|
image: redis:7-alpine
|
||||||
|
container_name: plant-scraper-redis
|
||||||
|
# Port exposed only internally, not to host (avoid conflicts)
|
||||||
|
volumes:
|
||||||
|
- redis_data:/data
|
||||||
|
|
||||||
|
frontend:
|
||||||
|
build:
|
||||||
|
context: ./frontend
|
||||||
|
dockerfile: Dockerfile
|
||||||
|
container_name: plant-scraper-frontend
|
||||||
|
# Port exposed only internally, nginx proxies to it
|
||||||
|
volumes:
|
||||||
|
- ./frontend:/app
|
||||||
|
- /app/node_modules
|
||||||
|
environment:
|
||||||
|
- VITE_API_URL=
|
||||||
|
command: npm run dev -- --host
|
||||||
|
|
||||||
|
nginx:
|
||||||
|
image: nginx:alpine
|
||||||
|
container_name: plant-scraper-nginx
|
||||||
|
ports:
|
||||||
|
- "80:80"
|
||||||
|
volumes:
|
||||||
|
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
|
||||||
|
depends_on:
|
||||||
|
- backend
|
||||||
|
- frontend
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
redis_data:
|
||||||
564
docs/master_plan.md
Normal file
564
docs/master_plan.md
Normal file
@@ -0,0 +1,564 @@
|
|||||||
|
# Houseplant Image Scraper - Master Plan
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Requirements Summary
|
||||||
|
|
||||||
|
| Requirement | Value |
|
||||||
|
|-------------|-------|
|
||||||
|
| Platform | Web app in Docker on Unraid |
|
||||||
|
| Sources | iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL |
|
||||||
|
| API keys | Configurable per service |
|
||||||
|
| Species list | Manual import (CSV/paste) |
|
||||||
|
| Grouping | Species, genus, source, license (faceted) |
|
||||||
|
| Search/filter | Yes |
|
||||||
|
| Quality filter | Automatic (hash dedup, blur, size) |
|
||||||
|
| Progress | Real-time dashboard |
|
||||||
|
| Storage | `/species_name/image.jpg` + SQLite DB |
|
||||||
|
| Export | Filtered zip for CoreML, downloadable anytime |
|
||||||
|
| Auth | None (single user) |
|
||||||
|
| Deployment | Docker Compose |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Create ML Export Requirements
|
||||||
|
|
||||||
|
Per [Apple's documentation](https://developer.apple.com/documentation/createml/creating-an-image-classifier-model):
|
||||||
|
|
||||||
|
- **Folder structure**: `/SpeciesName/image001.jpg` (folder name = class label)
|
||||||
|
- **Train/Test split**: 80/20 recommended, separate folders
|
||||||
|
- **Balance**: Roughly equal images per class (avoid bias)
|
||||||
|
- **No metadata needed**: Create ML uses folder names as labels
|
||||||
|
|
||||||
|
### Export Format
|
||||||
|
|
||||||
|
```
|
||||||
|
dataset_export/
|
||||||
|
├── Training/
|
||||||
|
│ ├── Monstera_deliciosa/
|
||||||
|
│ │ ├── img001.jpg
|
||||||
|
│ │ └── ...
|
||||||
|
│ ├── Philodendron_hederaceum/
|
||||||
|
│ └── ...
|
||||||
|
└── Testing/
|
||||||
|
├── Monstera_deliciosa/
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Sources
|
||||||
|
|
||||||
|
| Source | API/Method | License Filter | Rate Limits | Notes |
|
||||||
|
|--------|------------|----------------|-------------|-------|
|
||||||
|
| **iNaturalist/GBIF** | Bulk DwC-A export + API | CC0, CC-BY | 1 req/sec, 10k/day, 5GB/hr media | Best source: Research Grade = verified |
|
||||||
|
| **Flickr** | flickr.photos.search | license=4,9 (CC-BY, CC0) | 3600 req/hr | Good supplemental |
|
||||||
|
| **Wikimedia Commons** | MediaWiki API + pyWikiCommons | CC-BY, CC-BY-SA, PD | Generous | Category-based search |
|
||||||
|
| **Trefle.io** | REST API | Open source | Free tier | Species metadata + some images |
|
||||||
|
| **USDA PLANTS** | REST API | Public Domain | Generous | US-focused, limited images |
|
||||||
|
| **Plant.id** | REST API | Commercial | Paid | For validation, not scraping |
|
||||||
|
| **Encyclopedia of Life** | API | Mixed | Check each | Aggregator |
|
||||||
|
|
||||||
|
### Source References
|
||||||
|
|
||||||
|
- iNaturalist: https://www.inaturalist.org/pages/developers
|
||||||
|
- iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741
|
||||||
|
- Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html
|
||||||
|
- Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API
|
||||||
|
- pyWikiCommons: https://pypi.org/project/pyWikiCommons/
|
||||||
|
- Trefle.io: https://trefle.io/
|
||||||
|
- USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r
|
||||||
|
|
||||||
|
### Flickr License IDs
|
||||||
|
|
||||||
|
| ID | License |
|
||||||
|
|----|---------|
|
||||||
|
| 0 | All Rights Reserved |
|
||||||
|
| 1 | CC BY-NC-SA 2.0 |
|
||||||
|
| 2 | CC BY-NC 2.0 |
|
||||||
|
| 3 | CC BY-NC-ND 2.0 |
|
||||||
|
| 4 | CC BY 2.0 (Commercial OK) |
|
||||||
|
| 5 | CC BY-SA 2.0 |
|
||||||
|
| 6 | CC BY-ND 2.0 |
|
||||||
|
| 7 | No known copyright restrictions |
|
||||||
|
| 8 | United States Government Work |
|
||||||
|
| 9 | Public Domain (CC0) |
|
||||||
|
|
||||||
|
**For commercial use**: Filter to license IDs 4, 7, 8, 9 only.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Image Quality Pipeline
|
||||||
|
|
||||||
|
| Stage | Library | Purpose |
|
||||||
|
|-------|---------|---------|
|
||||||
|
| **Deduplication** | imagededup | Perceptual hash (CNN + hash methods) |
|
||||||
|
| **Blur detection** | scipy + Sobel variance | Reject blurry images |
|
||||||
|
| **Size filter** | Pillow | Min 256x256 |
|
||||||
|
| **Resize** | Pillow | Normalize to 512x512 |
|
||||||
|
|
||||||
|
### Library References
|
||||||
|
|
||||||
|
- imagededup: https://github.com/idealo/imagededup
|
||||||
|
- imagehash: https://github.com/JohannesBuchner/imagehash
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Technology Stack
|
||||||
|
|
||||||
|
| Component | Choice | Rationale |
|
||||||
|
|-----------|--------|-----------|
|
||||||
|
| **Backend** | FastAPI (Python) | Async, fast, ML ecosystem, great docs |
|
||||||
|
| **Frontend** | React + Tailwind | Fast dev, good component libraries |
|
||||||
|
| **Database** | SQLite (+ FTS5) | Simple, no separate container, sufficient for single-user |
|
||||||
|
| **Task Queue** | Celery + Redis | Long-running scrape jobs, good monitoring |
|
||||||
|
| **Containers** | Docker Compose | Multi-service orchestration |
|
||||||
|
|
||||||
|
Reference: https://github.com/fastapi/full-stack-fastapi-template
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ DOCKER COMPOSE ON UNRAID │
|
||||||
|
├─────────────────────────────────────────────────────────────────────────┤
|
||||||
|
│ │
|
||||||
|
│ ┌─────────────┐ ┌─────────────────────────────────────────────────┐ │
|
||||||
|
│ │ NGINX │ │ FASTAPI BACKEND │ │
|
||||||
|
│ │ :80 │───▶│ /api/species - CRUD species list │ │
|
||||||
|
│ │ │ │ /api/sources - API key management │ │
|
||||||
|
│ └──────┬──────┘ │ /api/jobs - Scrape job control │ │
|
||||||
|
│ │ │ /api/images - Search, filter, browse │ │
|
||||||
|
│ ▼ │ /api/export - Generate zip for CoreML │ │
|
||||||
|
│ ┌─────────────┐ │ /api/stats - Dashboard metrics │ │
|
||||||
|
│ │ REACT │ └─────────────────────────────────────────────────┘ │
|
||||||
|
│ │ SPA │ │ │
|
||||||
|
│ │ :3000 │ ▼ │
|
||||||
|
│ └─────────────┘ ┌─────────────────────────────────────────────────┐ │
|
||||||
|
│ │ CELERY WORKERS │ │
|
||||||
|
│ ┌─────────────┐ │ - iNaturalist scraper │ │
|
||||||
|
│ │ REDIS │◀───│ - Flickr scraper │ │
|
||||||
|
│ │ :6379 │ │ - Wikimedia scraper │ │
|
||||||
|
│ └─────────────┘ │ - Quality filter pipeline │ │
|
||||||
|
│ │ - Export generator │ │
|
||||||
|
│ └─────────────────────────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ▼ │
|
||||||
|
│ ┌─────────────────────────────────────────────────────────────────────┐│
|
||||||
|
│ │ STORAGE (Bind Mounts) ││
|
||||||
|
│ │ /data/db/plants.sqlite - Species, images metadata, jobs ││
|
||||||
|
│ │ /data/images/{species}/ - Downloaded images ││
|
||||||
|
│ │ /data/exports/ - Generated zip files ││
|
||||||
|
│ └─────────────────────────────────────────────────────────────────────┘│
|
||||||
|
└─────────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Database Schema
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Species master list (imported from CSV)
|
||||||
|
CREATE TABLE species (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
scientific_name TEXT UNIQUE NOT NULL,
|
||||||
|
common_name TEXT,
|
||||||
|
genus TEXT,
|
||||||
|
family TEXT,
|
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||||
|
);
|
||||||
|
|
||||||
|
-- Full-text search index
|
||||||
|
CREATE VIRTUAL TABLE species_fts USING fts5(
|
||||||
|
scientific_name,
|
||||||
|
common_name,
|
||||||
|
genus,
|
||||||
|
content='species',
|
||||||
|
content_rowid='id'
|
||||||
|
);
|
||||||
|
|
||||||
|
-- API credentials
|
||||||
|
CREATE TABLE api_keys (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source TEXT UNIQUE NOT NULL, -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
|
||||||
|
api_key TEXT NOT NULL,
|
||||||
|
api_secret TEXT,
|
||||||
|
rate_limit_per_sec REAL DEFAULT 1.0,
|
||||||
|
enabled BOOLEAN DEFAULT TRUE
|
||||||
|
);
|
||||||
|
|
||||||
|
-- Downloaded images
|
||||||
|
CREATE TABLE images (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
species_id INTEGER REFERENCES species(id),
|
||||||
|
source TEXT NOT NULL,
|
||||||
|
source_id TEXT, -- Original ID from source
|
||||||
|
url TEXT NOT NULL,
|
||||||
|
local_path TEXT,
|
||||||
|
license TEXT NOT NULL,
|
||||||
|
attribution TEXT,
|
||||||
|
width INTEGER,
|
||||||
|
height INTEGER,
|
||||||
|
phash TEXT, -- Perceptual hash for dedup
|
||||||
|
quality_score REAL, -- Blur/quality metric
|
||||||
|
status TEXT DEFAULT 'pending', -- pending, downloaded, rejected, deleted
|
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||||
|
UNIQUE(source, source_id)
|
||||||
|
);
|
||||||
|
|
||||||
|
-- Index for common queries
|
||||||
|
CREATE INDEX idx_images_species ON images(species_id);
|
||||||
|
CREATE INDEX idx_images_status ON images(status);
|
||||||
|
CREATE INDEX idx_images_source ON images(source);
|
||||||
|
CREATE INDEX idx_images_phash ON images(phash);
|
||||||
|
|
||||||
|
-- Scrape jobs
|
||||||
|
CREATE TABLE jobs (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
name TEXT NOT NULL,
|
||||||
|
source TEXT NOT NULL,
|
||||||
|
species_filter TEXT, -- JSON array of species IDs or NULL for all
|
||||||
|
status TEXT DEFAULT 'pending', -- pending, running, paused, completed, failed
|
||||||
|
progress_current INTEGER DEFAULT 0,
|
||||||
|
progress_total INTEGER DEFAULT 0,
|
||||||
|
images_downloaded INTEGER DEFAULT 0,
|
||||||
|
images_rejected INTEGER DEFAULT 0,
|
||||||
|
started_at TIMESTAMP,
|
||||||
|
completed_at TIMESTAMP,
|
||||||
|
error_message TEXT
|
||||||
|
);
|
||||||
|
|
||||||
|
-- Export jobs
|
||||||
|
CREATE TABLE exports (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
name TEXT NOT NULL,
|
||||||
|
filter_criteria TEXT, -- JSON: min_images, licenses, min_quality, species_ids
|
||||||
|
train_split REAL DEFAULT 0.8,
|
||||||
|
status TEXT DEFAULT 'pending', -- pending, generating, completed, failed
|
||||||
|
file_path TEXT,
|
||||||
|
file_size INTEGER,
|
||||||
|
species_count INTEGER,
|
||||||
|
image_count INTEGER,
|
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||||
|
completed_at TIMESTAMP
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Endpoints
|
||||||
|
|
||||||
|
### Species
|
||||||
|
|
||||||
|
| Method | Endpoint | Description |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| GET | `/api/species` | List species (paginated, searchable) |
|
||||||
|
| POST | `/api/species` | Create single species |
|
||||||
|
| POST | `/api/species/import` | Bulk import from CSV |
|
||||||
|
| GET | `/api/species/{id}` | Get species details |
|
||||||
|
| PUT | `/api/species/{id}` | Update species |
|
||||||
|
| DELETE | `/api/species/{id}` | Delete species |
|
||||||
|
|
||||||
|
### API Keys
|
||||||
|
|
||||||
|
| Method | Endpoint | Description |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| GET | `/api/sources` | List configured sources |
|
||||||
|
| PUT | `/api/sources/{source}` | Update source config (key, rate limit) |
|
||||||
|
|
||||||
|
### Jobs
|
||||||
|
|
||||||
|
| Method | Endpoint | Description |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| GET | `/api/jobs` | List jobs |
|
||||||
|
| POST | `/api/jobs` | Create scrape job |
|
||||||
|
| GET | `/api/jobs/{id}` | Get job status |
|
||||||
|
| POST | `/api/jobs/{id}/pause` | Pause job |
|
||||||
|
| POST | `/api/jobs/{id}/resume` | Resume job |
|
||||||
|
| POST | `/api/jobs/{id}/cancel` | Cancel job |
|
||||||
|
|
||||||
|
### Images
|
||||||
|
|
||||||
|
| Method | Endpoint | Description |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| GET | `/api/images` | List images (paginated, filterable) |
|
||||||
|
| GET | `/api/images/{id}` | Get image details |
|
||||||
|
| DELETE | `/api/images/{id}` | Delete image |
|
||||||
|
| POST | `/api/images/bulk-delete` | Bulk delete |
|
||||||
|
|
||||||
|
### Export
|
||||||
|
|
||||||
|
| Method | Endpoint | Description |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| GET | `/api/exports` | List exports |
|
||||||
|
| POST | `/api/exports` | Create export job |
|
||||||
|
| GET | `/api/exports/{id}` | Get export status |
|
||||||
|
| GET | `/api/exports/{id}/download` | Download zip file |
|
||||||
|
|
||||||
|
### Stats
|
||||||
|
|
||||||
|
| Method | Endpoint | Description |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| GET | `/api/stats` | Dashboard statistics |
|
||||||
|
| GET | `/api/stats/sources` | Per-source breakdown |
|
||||||
|
| GET | `/api/stats/species` | Per-species image counts |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## UI Screens
|
||||||
|
|
||||||
|
### 1. Dashboard
|
||||||
|
|
||||||
|
- Total species, images by source, images by license
|
||||||
|
- Active jobs with progress bars
|
||||||
|
- Quick stats: images/sec, disk usage
|
||||||
|
- Recent activity feed
|
||||||
|
|
||||||
|
### 2. Species Management
|
||||||
|
|
||||||
|
- Table: scientific name, common name, genus, image count
|
||||||
|
- Import CSV button (drag-and-drop)
|
||||||
|
- Search/filter by name, genus
|
||||||
|
- Bulk select → "Start Scrape" button
|
||||||
|
- Inline editing
|
||||||
|
|
||||||
|
### 3. API Keys
|
||||||
|
|
||||||
|
- Card per source with:
|
||||||
|
- API key input (masked)
|
||||||
|
- API secret input (if applicable)
|
||||||
|
- Rate limit slider
|
||||||
|
- Enable/disable toggle
|
||||||
|
- Test connection button
|
||||||
|
|
||||||
|
### 4. Image Browser
|
||||||
|
|
||||||
|
- Grid view with thumbnails (lazy-loaded)
|
||||||
|
- Filters sidebar:
|
||||||
|
- Species (autocomplete)
|
||||||
|
- Source (checkboxes)
|
||||||
|
- License (checkboxes)
|
||||||
|
- Quality score (range slider)
|
||||||
|
- Status (tabs: all, pending, downloaded, rejected)
|
||||||
|
- Sort by: date, quality, species
|
||||||
|
- Bulk select → actions (delete, re-queue)
|
||||||
|
- Click to view full-size + metadata
|
||||||
|
|
||||||
|
### 5. Jobs
|
||||||
|
|
||||||
|
- Table: name, source, status, progress, dates
|
||||||
|
- Real-time progress updates (WebSocket)
|
||||||
|
- Actions: pause, resume, cancel, view logs
|
||||||
|
|
||||||
|
### 6. Export
|
||||||
|
|
||||||
|
- Filter builder:
|
||||||
|
- Min images per species
|
||||||
|
- License whitelist
|
||||||
|
- Min quality score
|
||||||
|
- Species selection (all or specific)
|
||||||
|
- Train/test split slider (default 80/20)
|
||||||
|
- Preview: estimated species count, image count, file size
|
||||||
|
- "Generate Zip" button
|
||||||
|
- Download history with re-download links
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tradeoffs
|
||||||
|
|
||||||
|
| Decision | Alternative | Why This Choice |
|
||||||
|
|----------|-------------|-----------------|
|
||||||
|
| SQLite | PostgreSQL | Single-user, simpler Docker setup, sufficient for millions of rows |
|
||||||
|
| Celery+Redis | RQ, Dramatiq | Battle-tested, good monitoring (Flower) |
|
||||||
|
| React | Vue, Svelte | Largest ecosystem, more component libraries |
|
||||||
|
| Separate workers | Threads in FastAPI | Better isolation, can scale workers independently |
|
||||||
|
| Nginx reverse proxy | Traefik | Simpler config for single-app deployment |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Risks & Mitigations
|
||||||
|
|
||||||
|
| Risk | Likelihood | Mitigation |
|
||||||
|
|------|------------|------------|
|
||||||
|
| iNaturalist rate limits (5GB/hr) | High | Throttle downloads, prioritize species with low counts |
|
||||||
|
| Disk fills up | Medium | Dashboard shows disk usage, configurable storage limits |
|
||||||
|
| Scrape jobs crash mid-run | Medium | Job state in DB, resume from last checkpoint |
|
||||||
|
| Perceptual hash collisions | Low | Store hash, allow manual review of flagged duplicates |
|
||||||
|
| API keys exposed | Low | Environment variables, not stored in code |
|
||||||
|
| SQLite write contention | Low | WAL mode, single writer pattern via Celery |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Phases
|
||||||
|
|
||||||
|
### Phase 1: Foundation
|
||||||
|
- [ ] Docker Compose setup (FastAPI, React, Redis, Nginx)
|
||||||
|
- [ ] Database schema + migrations (Alembic)
|
||||||
|
- [ ] Basic FastAPI skeleton with health checks
|
||||||
|
- [ ] React app scaffolding with Tailwind
|
||||||
|
|
||||||
|
### Phase 2: Core Data Management
|
||||||
|
- [ ] Species CRUD API
|
||||||
|
- [ ] CSV import endpoint
|
||||||
|
- [ ] Species list UI with search/filter
|
||||||
|
- [ ] API keys management UI
|
||||||
|
|
||||||
|
### Phase 3: iNaturalist Scraper
|
||||||
|
- [ ] Celery worker setup
|
||||||
|
- [ ] iNaturalist/GBIF scraper task
|
||||||
|
- [ ] Job management API
|
||||||
|
- [ ] Real-time progress (WebSocket or polling)
|
||||||
|
|
||||||
|
### Phase 4: Quality Pipeline
|
||||||
|
- [ ] Image download worker
|
||||||
|
- [ ] Perceptual hash deduplication
|
||||||
|
- [ ] Blur detection + quality scoring
|
||||||
|
- [ ] Resize to 512x512
|
||||||
|
|
||||||
|
### Phase 5: Image Browser
|
||||||
|
- [ ] Image listing API with filters
|
||||||
|
- [ ] Thumbnail generation
|
||||||
|
- [ ] Grid view UI
|
||||||
|
- [ ] Bulk operations
|
||||||
|
|
||||||
|
### Phase 6: Additional Scrapers
|
||||||
|
- [ ] Flickr scraper
|
||||||
|
- [ ] Wikimedia Commons scraper
|
||||||
|
- [ ] Trefle scraper (metadata + images)
|
||||||
|
- [ ] USDA PLANTS scraper
|
||||||
|
|
||||||
|
### Phase 7: Export
|
||||||
|
- [ ] Export job API
|
||||||
|
- [ ] Train/test split logic
|
||||||
|
- [ ] Zip generation worker
|
||||||
|
- [ ] Download endpoint
|
||||||
|
- [ ] Export UI with filters
|
||||||
|
|
||||||
|
### Phase 8: Dashboard & Polish
|
||||||
|
- [ ] Stats API
|
||||||
|
- [ ] Dashboard UI with charts
|
||||||
|
- [ ] Job monitoring UI
|
||||||
|
- [ ] Error handling + logging
|
||||||
|
- [ ] Documentation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
PlantGuideScraper/
|
||||||
|
├── docker-compose.yml
|
||||||
|
├── .env.example
|
||||||
|
├── docs/
|
||||||
|
│ └── master_plan.md
|
||||||
|
├── backend/
|
||||||
|
│ ├── Dockerfile
|
||||||
|
│ ├── requirements.txt
|
||||||
|
│ ├── alembic/
|
||||||
|
│ │ └── versions/
|
||||||
|
│ ├── app/
|
||||||
|
│ │ ├── __init__.py
|
||||||
|
│ │ ├── main.py
|
||||||
|
│ │ ├── config.py
|
||||||
|
│ │ ├── database.py
|
||||||
|
│ │ ├── models/
|
||||||
|
│ │ │ ├── species.py
|
||||||
|
│ │ │ ├── image.py
|
||||||
|
│ │ │ ├── job.py
|
||||||
|
│ │ │ └── export.py
|
||||||
|
│ │ ├── schemas/
|
||||||
|
│ │ │ └── ...
|
||||||
|
│ │ ├── api/
|
||||||
|
│ │ │ ├── species.py
|
||||||
|
│ │ │ ├── images.py
|
||||||
|
│ │ │ ├── jobs.py
|
||||||
|
│ │ │ ├── exports.py
|
||||||
|
│ │ │ └── stats.py
|
||||||
|
│ │ ├── scrapers/
|
||||||
|
│ │ │ ├── base.py
|
||||||
|
│ │ │ ├── inaturalist.py
|
||||||
|
│ │ │ ├── flickr.py
|
||||||
|
│ │ │ ├── wikimedia.py
|
||||||
|
│ │ │ └── trefle.py
|
||||||
|
│ │ ├── workers/
|
||||||
|
│ │ │ ├── celery_app.py
|
||||||
|
│ │ │ ├── scrape_tasks.py
|
||||||
|
│ │ │ ├── quality_tasks.py
|
||||||
|
│ │ │ └── export_tasks.py
|
||||||
|
│ │ └── utils/
|
||||||
|
│ │ ├── image_quality.py
|
||||||
|
│ │ └── dedup.py
|
||||||
|
│ └── tests/
|
||||||
|
├── frontend/
|
||||||
|
│ ├── Dockerfile
|
||||||
|
│ ├── package.json
|
||||||
|
│ ├── src/
|
||||||
|
│ │ ├── App.tsx
|
||||||
|
│ │ ├── components/
|
||||||
|
│ │ ├── pages/
|
||||||
|
│ │ │ ├── Dashboard.tsx
|
||||||
|
│ │ │ ├── Species.tsx
|
||||||
|
│ │ │ ├── Images.tsx
|
||||||
|
│ │ │ ├── Jobs.tsx
|
||||||
|
│ │ │ ├── Export.tsx
|
||||||
|
│ │ │ └── Settings.tsx
|
||||||
|
│ │ ├── hooks/
|
||||||
|
│ │ └── api/
|
||||||
|
│ └── public/
|
||||||
|
├── nginx/
|
||||||
|
│ └── nginx.conf
|
||||||
|
└── data/ # Bind mount (not in repo)
|
||||||
|
├── db/
|
||||||
|
├── images/
|
||||||
|
└── exports/
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backend
|
||||||
|
DATABASE_URL=sqlite:///data/db/plants.sqlite
|
||||||
|
REDIS_URL=redis://redis:6379/0
|
||||||
|
IMAGES_PATH=/data/images
|
||||||
|
EXPORTS_PATH=/data/exports
|
||||||
|
|
||||||
|
# API Keys (user-provided)
|
||||||
|
FLICKR_API_KEY=
|
||||||
|
FLICKR_API_SECRET=
|
||||||
|
INATURALIST_APP_ID=
|
||||||
|
INATURALIST_APP_SECRET=
|
||||||
|
TREFLE_API_KEY=
|
||||||
|
|
||||||
|
# Optional
|
||||||
|
LOG_LEVEL=INFO
|
||||||
|
CELERY_CONCURRENCY=4
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Development
|
||||||
|
docker-compose up --build
|
||||||
|
|
||||||
|
# Production
|
||||||
|
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d
|
||||||
|
|
||||||
|
# Run migrations
|
||||||
|
docker-compose exec backend alembic upgrade head
|
||||||
|
|
||||||
|
# View Celery logs
|
||||||
|
docker-compose logs -f celery
|
||||||
|
|
||||||
|
# Access Redis CLI
|
||||||
|
docker-compose exec redis redis-cli
|
||||||
|
```
|
||||||
14
frontend/Dockerfile
Normal file
14
frontend/Dockerfile
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
FROM node:20-alpine
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
COPY package*.json ./
|
||||||
|
RUN npm install
|
||||||
|
|
||||||
|
# Copy source
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
EXPOSE 3000
|
||||||
|
|
||||||
|
CMD ["npm", "run", "dev", "--", "--host"]
|
||||||
283
frontend/dist/assets/index-BXIq8BNP.js
vendored
Normal file
283
frontend/dist/assets/index-BXIq8BNP.js
vendored
Normal file
File diff suppressed because one or more lines are too long
1
frontend/dist/assets/index-uHzGA3u6.css
vendored
Normal file
1
frontend/dist/assets/index-uHzGA3u6.css
vendored
Normal file
File diff suppressed because one or more lines are too long
14
frontend/dist/index.html
vendored
Normal file
14
frontend/dist/index.html
vendored
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
<!DOCTYPE html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="UTF-8" />
|
||||||
|
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||||
|
<title>PlantGuideScraper</title>
|
||||||
|
<script type="module" crossorigin src="/assets/index-BXIq8BNP.js"></script>
|
||||||
|
<link rel="stylesheet" crossorigin href="/assets/index-uHzGA3u6.css">
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<div id="root"></div>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
13
frontend/index.html
Normal file
13
frontend/index.html
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
<!DOCTYPE html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="UTF-8" />
|
||||||
|
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||||
|
<title>PlantGuideScraper</title>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<div id="root"></div>
|
||||||
|
<script type="module" src="/src/main.tsx"></script>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
31
frontend/package.json
Normal file
31
frontend/package.json
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
{
|
||||||
|
"name": "plant-scraper-frontend",
|
||||||
|
"private": true,
|
||||||
|
"version": "1.0.0",
|
||||||
|
"type": "module",
|
||||||
|
"scripts": {
|
||||||
|
"dev": "vite",
|
||||||
|
"build": "tsc && vite build",
|
||||||
|
"preview": "vite preview"
|
||||||
|
},
|
||||||
|
"dependencies": {
|
||||||
|
"react": "^18.2.0",
|
||||||
|
"react-dom": "^18.2.0",
|
||||||
|
"react-router-dom": "^6.21.0",
|
||||||
|
"@tanstack/react-query": "^5.17.0",
|
||||||
|
"axios": "^1.6.0",
|
||||||
|
"lucide-react": "^0.303.0",
|
||||||
|
"recharts": "^2.10.0",
|
||||||
|
"clsx": "^2.1.0"
|
||||||
|
},
|
||||||
|
"devDependencies": {
|
||||||
|
"@types/react": "^18.2.0",
|
||||||
|
"@types/react-dom": "^18.2.0",
|
||||||
|
"@vitejs/plugin-react": "^4.2.0",
|
||||||
|
"autoprefixer": "^10.4.16",
|
||||||
|
"postcss": "^8.4.32",
|
||||||
|
"tailwindcss": "^3.4.0",
|
||||||
|
"typescript": "^5.3.0",
|
||||||
|
"vite": "^5.0.0"
|
||||||
|
}
|
||||||
|
}
|
||||||
6
frontend/postcss.config.js
Normal file
6
frontend/postcss.config.js
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
export default {
|
||||||
|
plugins: {
|
||||||
|
tailwindcss: {},
|
||||||
|
autoprefixer: {},
|
||||||
|
},
|
||||||
|
}
|
||||||
81
frontend/src/App.tsx
Normal file
81
frontend/src/App.tsx
Normal file
@@ -0,0 +1,81 @@
|
|||||||
|
import { BrowserRouter, Routes, Route, NavLink } from 'react-router-dom'
|
||||||
|
import {
|
||||||
|
LayoutDashboard,
|
||||||
|
Leaf,
|
||||||
|
Image,
|
||||||
|
Play,
|
||||||
|
Download,
|
||||||
|
Settings,
|
||||||
|
} from 'lucide-react'
|
||||||
|
import { clsx } from 'clsx'
|
||||||
|
|
||||||
|
import Dashboard from './pages/Dashboard'
|
||||||
|
import Species from './pages/Species'
|
||||||
|
import Images from './pages/Images'
|
||||||
|
import Jobs from './pages/Jobs'
|
||||||
|
import Export from './pages/Export'
|
||||||
|
import SettingsPage from './pages/Settings'
|
||||||
|
|
||||||
|
const navItems = [
|
||||||
|
{ to: '/', icon: LayoutDashboard, label: 'Dashboard' },
|
||||||
|
{ to: '/species', icon: Leaf, label: 'Species' },
|
||||||
|
{ to: '/images', icon: Image, label: 'Images' },
|
||||||
|
{ to: '/jobs', icon: Play, label: 'Jobs' },
|
||||||
|
{ to: '/export', icon: Download, label: 'Export' },
|
||||||
|
{ to: '/settings', icon: Settings, label: 'Settings' },
|
||||||
|
]
|
||||||
|
|
||||||
|
function Sidebar() {
|
||||||
|
return (
|
||||||
|
<aside className="w-64 bg-white border-r border-gray-200 min-h-screen">
|
||||||
|
<div className="p-4 border-b border-gray-200">
|
||||||
|
<h1 className="text-xl font-bold text-green-600 flex items-center gap-2">
|
||||||
|
<Leaf className="w-6 h-6" />
|
||||||
|
PlantScraper
|
||||||
|
</h1>
|
||||||
|
</div>
|
||||||
|
<nav className="p-4">
|
||||||
|
<ul className="space-y-2">
|
||||||
|
{navItems.map((item) => (
|
||||||
|
<li key={item.to}>
|
||||||
|
<NavLink
|
||||||
|
to={item.to}
|
||||||
|
className={({ isActive }) =>
|
||||||
|
clsx(
|
||||||
|
'flex items-center gap-3 px-3 py-2 rounded-lg transition-colors',
|
||||||
|
isActive
|
||||||
|
? 'bg-green-50 text-green-700'
|
||||||
|
: 'text-gray-600 hover:bg-gray-100'
|
||||||
|
)
|
||||||
|
}
|
||||||
|
>
|
||||||
|
<item.icon className="w-5 h-5" />
|
||||||
|
{item.label}
|
||||||
|
</NavLink>
|
||||||
|
</li>
|
||||||
|
))}
|
||||||
|
</ul>
|
||||||
|
</nav>
|
||||||
|
</aside>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function App() {
|
||||||
|
return (
|
||||||
|
<BrowserRouter>
|
||||||
|
<div className="flex min-h-screen">
|
||||||
|
<Sidebar />
|
||||||
|
<main className="flex-1 p-8">
|
||||||
|
<Routes>
|
||||||
|
<Route path="/" element={<Dashboard />} />
|
||||||
|
<Route path="/species" element={<Species />} />
|
||||||
|
<Route path="/images" element={<Images />} />
|
||||||
|
<Route path="/jobs" element={<Jobs />} />
|
||||||
|
<Route path="/export" element={<Export />} />
|
||||||
|
<Route path="/settings" element={<SettingsPage />} />
|
||||||
|
</Routes>
|
||||||
|
</main>
|
||||||
|
</div>
|
||||||
|
</BrowserRouter>
|
||||||
|
)
|
||||||
|
}
|
||||||
275
frontend/src/api/client.ts
Normal file
275
frontend/src/api/client.ts
Normal file
@@ -0,0 +1,275 @@
|
|||||||
|
import axios from 'axios'
|
||||||
|
|
||||||
|
const API_URL = import.meta.env.VITE_API_URL || ''
|
||||||
|
|
||||||
|
export const api = axios.create({
|
||||||
|
baseURL: `${API_URL}/api`,
|
||||||
|
headers: {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
// Types
|
||||||
|
export interface Species {
|
||||||
|
id: number
|
||||||
|
scientific_name: string
|
||||||
|
common_name: string | null
|
||||||
|
genus: string | null
|
||||||
|
family: string | null
|
||||||
|
created_at: string
|
||||||
|
image_count: number
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface SpeciesListResponse {
|
||||||
|
items: Species[]
|
||||||
|
total: number
|
||||||
|
page: number
|
||||||
|
page_size: number
|
||||||
|
pages: number
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface Image {
|
||||||
|
id: number
|
||||||
|
species_id: number
|
||||||
|
species_name: string | null
|
||||||
|
source: string
|
||||||
|
source_id: string | null
|
||||||
|
url: string
|
||||||
|
local_path: string | null
|
||||||
|
license: string
|
||||||
|
attribution: string | null
|
||||||
|
width: number | null
|
||||||
|
height: number | null
|
||||||
|
quality_score: number | null
|
||||||
|
status: string
|
||||||
|
created_at: string
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface ImageListResponse {
|
||||||
|
items: Image[]
|
||||||
|
total: number
|
||||||
|
page: number
|
||||||
|
page_size: number
|
||||||
|
pages: number
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface Job {
|
||||||
|
id: number
|
||||||
|
name: string
|
||||||
|
source: string
|
||||||
|
species_filter: string | null
|
||||||
|
status: string
|
||||||
|
progress_current: number
|
||||||
|
progress_total: number
|
||||||
|
images_downloaded: number
|
||||||
|
images_rejected: number
|
||||||
|
started_at: string | null
|
||||||
|
completed_at: string | null
|
||||||
|
error_message: string | null
|
||||||
|
created_at: string
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface JobListResponse {
|
||||||
|
items: Job[]
|
||||||
|
total: number
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface JobProgress {
|
||||||
|
status: string
|
||||||
|
progress_current: number
|
||||||
|
progress_total: number
|
||||||
|
current_species?: string
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface Export {
|
||||||
|
id: number
|
||||||
|
name: string
|
||||||
|
filter_criteria: string | null
|
||||||
|
train_split: number
|
||||||
|
status: string
|
||||||
|
file_path: string | null
|
||||||
|
file_size: number | null
|
||||||
|
species_count: number | null
|
||||||
|
image_count: number | null
|
||||||
|
created_at: string
|
||||||
|
completed_at: string | null
|
||||||
|
error_message: string | null
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface SourceConfig {
|
||||||
|
name: string
|
||||||
|
label: string
|
||||||
|
requires_secret: boolean
|
||||||
|
auth_type: 'none' | 'api_key' | 'api_key_secret' | 'oauth'
|
||||||
|
configured: boolean
|
||||||
|
enabled: boolean
|
||||||
|
api_key_masked: string | null
|
||||||
|
has_secret: boolean
|
||||||
|
has_access_token: boolean
|
||||||
|
rate_limit_per_sec: number
|
||||||
|
default_rate: number
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface Stats {
|
||||||
|
total_species: number
|
||||||
|
total_images: number
|
||||||
|
images_downloaded: number
|
||||||
|
images_pending: number
|
||||||
|
images_rejected: number
|
||||||
|
disk_usage_mb: number
|
||||||
|
sources: Array<{
|
||||||
|
source: string
|
||||||
|
image_count: number
|
||||||
|
downloaded: number
|
||||||
|
pending: number
|
||||||
|
rejected: number
|
||||||
|
}>
|
||||||
|
licenses: Array<{
|
||||||
|
license: string
|
||||||
|
count: number
|
||||||
|
}>
|
||||||
|
jobs: {
|
||||||
|
running: number
|
||||||
|
pending: number
|
||||||
|
completed: number
|
||||||
|
failed: number
|
||||||
|
}
|
||||||
|
top_species: Array<{
|
||||||
|
id: number
|
||||||
|
scientific_name: string
|
||||||
|
common_name: string | null
|
||||||
|
image_count: number
|
||||||
|
}>
|
||||||
|
under_represented: Array<{
|
||||||
|
id: number
|
||||||
|
scientific_name: string
|
||||||
|
common_name: string | null
|
||||||
|
image_count: number
|
||||||
|
}>
|
||||||
|
}
|
||||||
|
|
||||||
|
// API functions
|
||||||
|
export const speciesApi = {
|
||||||
|
list: (params?: { page?: number; page_size?: number; search?: string; genus?: string; has_images?: boolean; max_images?: number; min_images?: number }) =>
|
||||||
|
api.get<SpeciesListResponse>('/species', { params }),
|
||||||
|
get: (id: number) => api.get<Species>(`/species/${id}`),
|
||||||
|
create: (data: { scientific_name: string; common_name?: string; genus?: string; family?: string }) =>
|
||||||
|
api.post<Species>('/species', data),
|
||||||
|
update: (id: number, data: Partial<Species>) => api.put<Species>(`/species/${id}`, data),
|
||||||
|
delete: (id: number) => api.delete(`/species/${id}`),
|
||||||
|
import: (file: File) => {
|
||||||
|
const formData = new FormData()
|
||||||
|
formData.append('file', file)
|
||||||
|
return api.post('/species/import', formData, {
|
||||||
|
headers: { 'Content-Type': 'multipart/form-data' },
|
||||||
|
})
|
||||||
|
},
|
||||||
|
importJson: (file: File) => {
|
||||||
|
const formData = new FormData()
|
||||||
|
formData.append('file', file)
|
||||||
|
return api.post('/species/import-json', formData, {
|
||||||
|
headers: { 'Content-Type': 'multipart/form-data' },
|
||||||
|
})
|
||||||
|
},
|
||||||
|
genera: () => api.get<string[]>('/species/genera/list'),
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface ImportScanResult {
|
||||||
|
available: boolean
|
||||||
|
message?: string
|
||||||
|
sources: Array<{
|
||||||
|
name: string
|
||||||
|
species_count: number
|
||||||
|
image_count: number
|
||||||
|
}>
|
||||||
|
total_images: number
|
||||||
|
matched_species: number
|
||||||
|
unmatched_species: string[]
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface ImportResult {
|
||||||
|
imported: number
|
||||||
|
skipped: number
|
||||||
|
errors: string[]
|
||||||
|
}
|
||||||
|
|
||||||
|
export const imagesApi = {
|
||||||
|
list: (params?: {
|
||||||
|
page?: number
|
||||||
|
page_size?: number
|
||||||
|
species_id?: number
|
||||||
|
source?: string
|
||||||
|
license?: string
|
||||||
|
status?: string
|
||||||
|
min_quality?: number
|
||||||
|
search?: string
|
||||||
|
}) => api.get<ImageListResponse>('/images', { params }),
|
||||||
|
get: (id: number) => api.get<Image>(`/images/${id}`),
|
||||||
|
delete: (id: number) => api.delete(`/images/${id}`),
|
||||||
|
bulkDelete: (ids: number[]) => api.post('/images/bulk-delete', ids),
|
||||||
|
sources: () => api.get<string[]>('/images/sources'),
|
||||||
|
licenses: () => api.get<string[]>('/images/licenses'),
|
||||||
|
processPending: (source?: string) =>
|
||||||
|
api.post<{ pending_count: number; task_id: string }>('/images/process-pending', null, {
|
||||||
|
params: source ? { source } : undefined,
|
||||||
|
}),
|
||||||
|
processPendingStatus: (taskId: string) =>
|
||||||
|
api.get<{ task_id: string; state: string; queued?: number; total?: number }>(
|
||||||
|
`/images/process-pending/status/${taskId}`
|
||||||
|
),
|
||||||
|
scanImports: () => api.get<ImportScanResult>('/images/import/scan'),
|
||||||
|
runImport: (moveFiles: boolean = false) =>
|
||||||
|
api.post<ImportResult>('/images/import/run', null, { params: { move_files: moveFiles } }),
|
||||||
|
}
|
||||||
|
|
||||||
|
export const jobsApi = {
|
||||||
|
list: (params?: { status?: string; source?: string; limit?: number }) =>
|
||||||
|
api.get<JobListResponse>('/jobs', { params }),
|
||||||
|
get: (id: number) => api.get<Job>(`/jobs/${id}`),
|
||||||
|
create: (data: { name: string; source: string; species_ids?: number[]; only_without_images?: boolean; max_images?: number }) =>
|
||||||
|
api.post<Job>('/jobs', data),
|
||||||
|
progress: (id: number) => api.get<JobProgress>(`/jobs/${id}/progress`),
|
||||||
|
pause: (id: number) => api.post(`/jobs/${id}/pause`),
|
||||||
|
resume: (id: number) => api.post(`/jobs/${id}/resume`),
|
||||||
|
cancel: (id: number) => api.post(`/jobs/${id}/cancel`),
|
||||||
|
}
|
||||||
|
|
||||||
|
export const exportsApi = {
|
||||||
|
list: (params?: { limit?: number }) => api.get('/exports', { params }),
|
||||||
|
get: (id: number) => api.get<Export>(`/exports/${id}`),
|
||||||
|
create: (data: {
|
||||||
|
name: string
|
||||||
|
filter_criteria: {
|
||||||
|
min_images_per_species: number
|
||||||
|
licenses?: string[]
|
||||||
|
min_quality?: number
|
||||||
|
species_ids?: number[]
|
||||||
|
}
|
||||||
|
train_split: number
|
||||||
|
}) => api.post<Export>('/exports', data),
|
||||||
|
preview: (data: any) => api.post('/exports/preview', data),
|
||||||
|
progress: (id: number) => api.get(`/exports/${id}/progress`),
|
||||||
|
download: (id: number) => `${API_URL}/api/exports/${id}/download`,
|
||||||
|
delete: (id: number) => api.delete(`/exports/${id}`),
|
||||||
|
}
|
||||||
|
|
||||||
|
export const sourcesApi = {
|
||||||
|
list: () => api.get<SourceConfig[]>('/sources'),
|
||||||
|
get: (source: string) => api.get<SourceConfig>(`/sources/${source}`),
|
||||||
|
update: (source: string, data: {
|
||||||
|
api_key?: string
|
||||||
|
api_secret?: string
|
||||||
|
access_token?: string
|
||||||
|
rate_limit_per_sec?: number
|
||||||
|
enabled?: boolean
|
||||||
|
}) => api.put(`/sources/${source}`, { source, ...data }),
|
||||||
|
test: (source: string) => api.post(`/sources/${source}/test`),
|
||||||
|
delete: (source: string) => api.delete(`/sources/${source}`),
|
||||||
|
}
|
||||||
|
|
||||||
|
export const statsApi = {
|
||||||
|
get: () => api.get<Stats>('/stats'),
|
||||||
|
sources: () => api.get('/stats/sources'),
|
||||||
|
species: (params?: { min_count?: number; max_count?: number }) =>
|
||||||
|
api.get('/stats/species', { params }),
|
||||||
|
}
|
||||||
7
frontend/src/index.css
Normal file
7
frontend/src/index.css
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
@tailwind base;
|
||||||
|
@tailwind components;
|
||||||
|
@tailwind utilities;
|
||||||
|
|
||||||
|
body {
|
||||||
|
@apply bg-gray-50 text-gray-900;
|
||||||
|
}
|
||||||
22
frontend/src/main.tsx
Normal file
22
frontend/src/main.tsx
Normal file
@@ -0,0 +1,22 @@
|
|||||||
|
import React from 'react'
|
||||||
|
import ReactDOM from 'react-dom/client'
|
||||||
|
import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
|
||||||
|
import App from './App'
|
||||||
|
import './index.css'
|
||||||
|
|
||||||
|
const queryClient = new QueryClient({
|
||||||
|
defaultOptions: {
|
||||||
|
queries: {
|
||||||
|
refetchOnWindowFocus: false,
|
||||||
|
retry: 1,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
ReactDOM.createRoot(document.getElementById('root')!).render(
|
||||||
|
<React.StrictMode>
|
||||||
|
<QueryClientProvider client={queryClient}>
|
||||||
|
<App />
|
||||||
|
</QueryClientProvider>
|
||||||
|
</React.StrictMode>,
|
||||||
|
)
|
||||||
413
frontend/src/pages/Dashboard.tsx
Normal file
413
frontend/src/pages/Dashboard.tsx
Normal file
@@ -0,0 +1,413 @@
|
|||||||
|
import { useState } from 'react'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Leaf,
|
||||||
|
Image,
|
||||||
|
HardDrive,
|
||||||
|
Clock,
|
||||||
|
CheckCircle,
|
||||||
|
XCircle,
|
||||||
|
AlertCircle,
|
||||||
|
} from 'lucide-react'
|
||||||
|
import {
|
||||||
|
BarChart,
|
||||||
|
Bar,
|
||||||
|
XAxis,
|
||||||
|
YAxis,
|
||||||
|
Tooltip,
|
||||||
|
ResponsiveContainer,
|
||||||
|
PieChart,
|
||||||
|
Pie,
|
||||||
|
Cell,
|
||||||
|
} from 'recharts'
|
||||||
|
import { statsApi, imagesApi } from '../api/client'
|
||||||
|
|
||||||
|
const COLORS = ['#22c55e', '#3b82f6', '#f59e0b', '#ef4444', '#8b5cf6', '#ec4899']
|
||||||
|
|
||||||
|
function StatCard({
|
||||||
|
title,
|
||||||
|
value,
|
||||||
|
icon: Icon,
|
||||||
|
color,
|
||||||
|
}: {
|
||||||
|
title: string
|
||||||
|
value: string | number
|
||||||
|
icon: React.ElementType
|
||||||
|
color: string
|
||||||
|
}) {
|
||||||
|
return (
|
||||||
|
<div className="bg-white rounded-lg shadow p-6">
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<div>
|
||||||
|
<p className="text-sm text-gray-500">{title}</p>
|
||||||
|
<p className="text-2xl font-bold mt-1">{value}</p>
|
||||||
|
</div>
|
||||||
|
<div className={`p-3 rounded-full ${color}`}>
|
||||||
|
<Icon className="w-6 h-6 text-white" />
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function Dashboard() {
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
|
||||||
|
const [processingTaskId, setProcessingTaskId] = useState<string | null>(null)
|
||||||
|
|
||||||
|
const processPendingMutation = useMutation({
|
||||||
|
mutationFn: () => imagesApi.processPending(),
|
||||||
|
onSuccess: (res) => {
|
||||||
|
setProcessingTaskId(res.data.task_id)
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
// Poll task status while processing
|
||||||
|
const { data: taskStatus } = useQuery({
|
||||||
|
queryKey: ['process-pending-status', processingTaskId],
|
||||||
|
queryFn: async () => {
|
||||||
|
const res = await imagesApi.processPendingStatus(processingTaskId!)
|
||||||
|
if (res.data.state === 'SUCCESS' || res.data.state === 'FAILURE') {
|
||||||
|
// Task finished - clear tracking and refresh stats
|
||||||
|
setTimeout(() => {
|
||||||
|
setProcessingTaskId(null)
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['stats'] })
|
||||||
|
}, 0)
|
||||||
|
}
|
||||||
|
return res.data
|
||||||
|
},
|
||||||
|
enabled: !!processingTaskId,
|
||||||
|
refetchInterval: (query) => {
|
||||||
|
const state = query.state.data?.state
|
||||||
|
if (state === 'SUCCESS' || state === 'FAILURE') return false
|
||||||
|
return 2000
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const isProcessing = !!processingTaskId && taskStatus?.state !== 'SUCCESS' && taskStatus?.state !== 'FAILURE'
|
||||||
|
|
||||||
|
const { data: stats, isLoading, error, failureCount, isFetching } = useQuery({
|
||||||
|
queryKey: ['stats'],
|
||||||
|
queryFn: async () => {
|
||||||
|
const startTime = Date.now()
|
||||||
|
console.log('[Dashboard] Fetching stats...')
|
||||||
|
|
||||||
|
// Create abort controller for timeout
|
||||||
|
const controller = new AbortController()
|
||||||
|
const timeoutId = setTimeout(() => controller.abort(), 10000) // 10 second timeout
|
||||||
|
|
||||||
|
try {
|
||||||
|
const res = await statsApi.get()
|
||||||
|
clearTimeout(timeoutId)
|
||||||
|
console.log(`[Dashboard] Stats loaded in ${Date.now() - startTime}ms`)
|
||||||
|
return res.data
|
||||||
|
} catch (err: any) {
|
||||||
|
clearTimeout(timeoutId)
|
||||||
|
if (err.name === 'AbortError' || err.code === 'ECONNABORTED') {
|
||||||
|
console.error('[Dashboard] Request timed out after 10 seconds')
|
||||||
|
throw new Error('Request timed out after 10 seconds - backend may be unresponsive')
|
||||||
|
}
|
||||||
|
console.error('[Dashboard] Stats fetch failed:', err)
|
||||||
|
console.error('[Dashboard] Error details:', {
|
||||||
|
message: err.message,
|
||||||
|
status: err.response?.status,
|
||||||
|
statusText: err.response?.statusText,
|
||||||
|
data: err.response?.data,
|
||||||
|
})
|
||||||
|
throw err
|
||||||
|
}
|
||||||
|
},
|
||||||
|
refetchInterval: 30000, // 30 seconds - matches backend cache
|
||||||
|
retry: 1,
|
||||||
|
staleTime: 25000,
|
||||||
|
})
|
||||||
|
|
||||||
|
// Debug panel to test backend
|
||||||
|
const { data: debugData, refetch: refetchDebug, isFetching: isDebugFetching } = useQuery({
|
||||||
|
queryKey: ['debug'],
|
||||||
|
queryFn: async () => {
|
||||||
|
const res = await fetch('/api/debug')
|
||||||
|
return res.json()
|
||||||
|
},
|
||||||
|
enabled: false, // Only fetch when manually triggered
|
||||||
|
})
|
||||||
|
|
||||||
|
if (isLoading) {
|
||||||
|
return (
|
||||||
|
<div className="flex items-center justify-center h-64">
|
||||||
|
<div className="text-center">
|
||||||
|
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600 mx-auto"></div>
|
||||||
|
<p className="mt-2 text-gray-500">Loading stats...</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
if (error) {
|
||||||
|
const err = error as any
|
||||||
|
return (
|
||||||
|
<div className="space-y-4 m-4">
|
||||||
|
<div className="bg-red-50 border border-red-200 rounded-lg p-6">
|
||||||
|
<h2 className="text-lg font-bold text-red-700 mb-2">Failed to load dashboard</h2>
|
||||||
|
<div className="space-y-2 text-sm">
|
||||||
|
<p><strong>Error:</strong> {err.message}</p>
|
||||||
|
{err.response && (
|
||||||
|
<>
|
||||||
|
<p><strong>Status:</strong> {err.response.status} {err.response.statusText}</p>
|
||||||
|
{err.response.data && (
|
||||||
|
<p><strong>Response:</strong> {JSON.stringify(err.response.data)}</p>
|
||||||
|
)}
|
||||||
|
</>
|
||||||
|
)}
|
||||||
|
<p><strong>Retry count:</strong> {failureCount}</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="bg-blue-50 border border-blue-200 rounded-lg p-6">
|
||||||
|
<h3 className="font-bold text-blue-700 mb-2">Debug Backend Connection</h3>
|
||||||
|
<button
|
||||||
|
onClick={() => refetchDebug()}
|
||||||
|
disabled={isDebugFetching}
|
||||||
|
className="px-4 py-2 bg-blue-600 text-white rounded hover:bg-blue-700 disabled:opacity-50"
|
||||||
|
>
|
||||||
|
{isDebugFetching ? 'Testing...' : 'Test Backend'}
|
||||||
|
</button>
|
||||||
|
{debugData && (
|
||||||
|
<pre className="mt-4 p-4 bg-white rounded text-xs overflow-auto">
|
||||||
|
{JSON.stringify(debugData, null, 2)}
|
||||||
|
</pre>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!stats) {
|
||||||
|
return <div>Failed to load stats</div>
|
||||||
|
}
|
||||||
|
|
||||||
|
const sourceData = stats.sources.map((s) => ({
|
||||||
|
name: s.source,
|
||||||
|
downloaded: s.downloaded,
|
||||||
|
pending: s.pending,
|
||||||
|
rejected: s.rejected,
|
||||||
|
}))
|
||||||
|
|
||||||
|
const licenseData = stats.licenses.map((l, i) => ({
|
||||||
|
name: l.license,
|
||||||
|
value: l.count,
|
||||||
|
color: COLORS[i % COLORS.length],
|
||||||
|
}))
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="space-y-6">
|
||||||
|
<h1 className="text-2xl font-bold">Dashboard</h1>
|
||||||
|
|
||||||
|
{/* Stats Grid */}
|
||||||
|
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-4 gap-4">
|
||||||
|
<StatCard
|
||||||
|
title="Total Species"
|
||||||
|
value={stats.total_species.toLocaleString()}
|
||||||
|
icon={Leaf}
|
||||||
|
color="bg-green-500"
|
||||||
|
/>
|
||||||
|
<StatCard
|
||||||
|
title="Downloaded Images"
|
||||||
|
value={stats.images_downloaded.toLocaleString()}
|
||||||
|
icon={Image}
|
||||||
|
color="bg-blue-500"
|
||||||
|
/>
|
||||||
|
<StatCard
|
||||||
|
title="Pending Images"
|
||||||
|
value={stats.images_pending.toLocaleString()}
|
||||||
|
icon={Clock}
|
||||||
|
color="bg-yellow-500"
|
||||||
|
/>
|
||||||
|
<StatCard
|
||||||
|
title="Disk Usage"
|
||||||
|
value={`${stats.disk_usage_mb.toFixed(1)} MB`}
|
||||||
|
icon={HardDrive}
|
||||||
|
color="bg-purple-500"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Process Pending Banner */}
|
||||||
|
{(stats.images_pending > 0 || isProcessing) && (
|
||||||
|
<div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4 flex items-center justify-between">
|
||||||
|
<div>
|
||||||
|
<p className="font-semibold text-yellow-800">
|
||||||
|
{isProcessing
|
||||||
|
? `Processing pending images...`
|
||||||
|
: `${stats.images_pending.toLocaleString()} pending images`}
|
||||||
|
</p>
|
||||||
|
<p className="text-sm text-yellow-700">
|
||||||
|
{isProcessing && taskStatus?.queued != null && taskStatus?.total != null
|
||||||
|
? `Queued ${taskStatus.queued.toLocaleString()} of ${taskStatus.total.toLocaleString()} for download`
|
||||||
|
: isProcessing
|
||||||
|
? 'Queueing images for download...'
|
||||||
|
: 'These images have been scraped but not yet downloaded and processed.'}
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
<button
|
||||||
|
onClick={() => processPendingMutation.mutate()}
|
||||||
|
disabled={isProcessing || processPendingMutation.isPending}
|
||||||
|
className="px-4 py-2 bg-yellow-600 text-white rounded-lg hover:bg-yellow-700 disabled:opacity-50 whitespace-nowrap"
|
||||||
|
>
|
||||||
|
{isProcessing ? 'Processing...' : processPendingMutation.isPending ? 'Starting...' : 'Process All Pending'}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Jobs Status */}
|
||||||
|
<div className="bg-white rounded-lg shadow p-6">
|
||||||
|
<h2 className="text-lg font-semibold mb-4">Jobs Status</h2>
|
||||||
|
<div className="flex gap-6">
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<div className="w-3 h-3 rounded-full bg-blue-500 animate-pulse"></div>
|
||||||
|
<span>Running: {stats.jobs.running}</span>
|
||||||
|
</div>
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<Clock className="w-4 h-4 text-yellow-500" />
|
||||||
|
<span>Pending: {stats.jobs.pending}</span>
|
||||||
|
</div>
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<CheckCircle className="w-4 h-4 text-green-500" />
|
||||||
|
<span>Completed: {stats.jobs.completed}</span>
|
||||||
|
</div>
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<XCircle className="w-4 h-4 text-red-500" />
|
||||||
|
<span>Failed: {stats.jobs.failed}</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Charts */}
|
||||||
|
<div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
|
||||||
|
{/* Source Chart */}
|
||||||
|
<div className="bg-white rounded-lg shadow p-6">
|
||||||
|
<h2 className="text-lg font-semibold mb-4">Images by Source</h2>
|
||||||
|
{sourceData.length > 0 ? (
|
||||||
|
<ResponsiveContainer width="100%" height={300}>
|
||||||
|
<BarChart data={sourceData}>
|
||||||
|
<XAxis dataKey="name" />
|
||||||
|
<YAxis />
|
||||||
|
<Tooltip />
|
||||||
|
<Bar dataKey="downloaded" fill="#22c55e" name="Downloaded" />
|
||||||
|
<Bar dataKey="pending" fill="#f59e0b" name="Pending" />
|
||||||
|
<Bar dataKey="rejected" fill="#ef4444" name="Rejected" />
|
||||||
|
</BarChart>
|
||||||
|
</ResponsiveContainer>
|
||||||
|
) : (
|
||||||
|
<div className="h-[300px] flex items-center justify-center text-gray-400">
|
||||||
|
No data yet
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* License Chart */}
|
||||||
|
<div className="bg-white rounded-lg shadow p-6">
|
||||||
|
<h2 className="text-lg font-semibold mb-4">Images by License</h2>
|
||||||
|
{licenseData.length > 0 ? (
|
||||||
|
<ResponsiveContainer width="100%" height={300}>
|
||||||
|
<PieChart>
|
||||||
|
<Pie
|
||||||
|
data={licenseData}
|
||||||
|
dataKey="value"
|
||||||
|
nameKey="name"
|
||||||
|
cx="50%"
|
||||||
|
cy="50%"
|
||||||
|
outerRadius={100}
|
||||||
|
label={({ name, percent }) =>
|
||||||
|
`${name} (${(percent * 100).toFixed(0)}%)`
|
||||||
|
}
|
||||||
|
>
|
||||||
|
{licenseData.map((entry, index) => (
|
||||||
|
<Cell key={index} fill={entry.color} />
|
||||||
|
))}
|
||||||
|
</Pie>
|
||||||
|
<Tooltip />
|
||||||
|
</PieChart>
|
||||||
|
</ResponsiveContainer>
|
||||||
|
) : (
|
||||||
|
<div className="h-[300px] flex items-center justify-center text-gray-400">
|
||||||
|
No data yet
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Species Tables */}
|
||||||
|
<div className="grid grid-cols-1 lg:grid-cols-2 gap-6">
|
||||||
|
{/* Top Species */}
|
||||||
|
<div className="bg-white rounded-lg shadow p-6">
|
||||||
|
<h2 className="text-lg font-semibold mb-4">Top Species</h2>
|
||||||
|
<table className="w-full">
|
||||||
|
<thead>
|
||||||
|
<tr className="text-left text-sm text-gray-500">
|
||||||
|
<th className="pb-2">Species</th>
|
||||||
|
<th className="pb-2 text-right">Images</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
{stats.top_species.map((s) => (
|
||||||
|
<tr key={s.id} className="border-t">
|
||||||
|
<td className="py-2">
|
||||||
|
<div className="font-medium">{s.scientific_name}</div>
|
||||||
|
{s.common_name && (
|
||||||
|
<div className="text-sm text-gray-500">{s.common_name}</div>
|
||||||
|
)}
|
||||||
|
</td>
|
||||||
|
<td className="py-2 text-right">{s.image_count}</td>
|
||||||
|
</tr>
|
||||||
|
))}
|
||||||
|
{stats.top_species.length === 0 && (
|
||||||
|
<tr>
|
||||||
|
<td colSpan={2} className="py-4 text-center text-gray-400">
|
||||||
|
No species yet
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
)}
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Under-represented Species */}
|
||||||
|
<div className="bg-white rounded-lg shadow p-6">
|
||||||
|
<h2 className="text-lg font-semibold mb-4 flex items-center gap-2">
|
||||||
|
<AlertCircle className="w-5 h-5 text-yellow-500" />
|
||||||
|
Under-represented Species
|
||||||
|
</h2>
|
||||||
|
<p className="text-sm text-gray-500 mb-4">Species with fewer than 100 images</p>
|
||||||
|
<table className="w-full">
|
||||||
|
<thead>
|
||||||
|
<tr className="text-left text-sm text-gray-500">
|
||||||
|
<th className="pb-2">Species</th>
|
||||||
|
<th className="pb-2 text-right">Images</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
{stats.under_represented.map((s) => (
|
||||||
|
<tr key={s.id} className="border-t">
|
||||||
|
<td className="py-2">
|
||||||
|
<div className="font-medium">{s.scientific_name}</div>
|
||||||
|
{s.common_name && (
|
||||||
|
<div className="text-sm text-gray-500">{s.common_name}</div>
|
||||||
|
)}
|
||||||
|
</td>
|
||||||
|
<td className="py-2 text-right text-yellow-600">{s.image_count}</td>
|
||||||
|
</tr>
|
||||||
|
))}
|
||||||
|
{stats.under_represented.length === 0 && (
|
||||||
|
<tr>
|
||||||
|
<td colSpan={2} className="py-4 text-center text-gray-400">
|
||||||
|
All species have 100+ images
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
)}
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
346
frontend/src/pages/Export.tsx
Normal file
346
frontend/src/pages/Export.tsx
Normal file
@@ -0,0 +1,346 @@
|
|||||||
|
import { useState } from 'react'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Download,
|
||||||
|
Trash2,
|
||||||
|
CheckCircle,
|
||||||
|
Clock,
|
||||||
|
AlertCircle,
|
||||||
|
Package,
|
||||||
|
} from 'lucide-react'
|
||||||
|
import { exportsApi, imagesApi, Export as ExportType } from '../api/client'
|
||||||
|
|
||||||
|
export default function Export() {
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const [showCreateModal, setShowCreateModal] = useState(false)
|
||||||
|
|
||||||
|
const { data: exports, isLoading } = useQuery({
|
||||||
|
queryKey: ['exports'],
|
||||||
|
queryFn: () => exportsApi.list({ limit: 50 }).then((res) => res.data),
|
||||||
|
refetchInterval: 5000,
|
||||||
|
})
|
||||||
|
|
||||||
|
const deleteMutation = useMutation({
|
||||||
|
mutationFn: (id: number) => exportsApi.delete(id),
|
||||||
|
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['exports'] }),
|
||||||
|
})
|
||||||
|
|
||||||
|
const getStatusIcon = (status: string) => {
|
||||||
|
switch (status) {
|
||||||
|
case 'generating':
|
||||||
|
return <Clock className="w-4 h-4 text-blue-500 animate-pulse" />
|
||||||
|
case 'completed':
|
||||||
|
return <CheckCircle className="w-4 h-4 text-green-500" />
|
||||||
|
case 'failed':
|
||||||
|
return <AlertCircle className="w-4 h-4 text-red-500" />
|
||||||
|
default:
|
||||||
|
return <Clock className="w-4 h-4 text-gray-400" />
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const formatBytes = (bytes: number | null) => {
|
||||||
|
if (!bytes) return 'N/A'
|
||||||
|
if (bytes < 1024) return `${bytes} B`
|
||||||
|
if (bytes < 1024 * 1024) return `${(bytes / 1024).toFixed(1)} KB`
|
||||||
|
if (bytes < 1024 * 1024 * 1024) return `${(bytes / 1024 / 1024).toFixed(1)} MB`
|
||||||
|
return `${(bytes / 1024 / 1024 / 1024).toFixed(1)} GB`
|
||||||
|
}
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="space-y-6">
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<h1 className="text-2xl font-bold">Export Dataset</h1>
|
||||||
|
<button
|
||||||
|
onClick={() => setShowCreateModal(true)}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
|
||||||
|
>
|
||||||
|
<Package className="w-4 h-4" />
|
||||||
|
Create Export
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Info Card */}
|
||||||
|
<div className="bg-blue-50 border border-blue-200 rounded-lg p-4">
|
||||||
|
<h3 className="font-medium text-blue-800">Export Format</h3>
|
||||||
|
<p className="text-sm text-blue-700 mt-1">
|
||||||
|
Exports are created in Create ML-compatible format with Training and Testing
|
||||||
|
folders. Each species has its own subfolder with images.
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Exports List */}
|
||||||
|
{isLoading ? (
|
||||||
|
<div className="flex items-center justify-center h-64">
|
||||||
|
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
|
||||||
|
</div>
|
||||||
|
) : exports?.items.length === 0 ? (
|
||||||
|
<div className="bg-white rounded-lg shadow p-8 text-center text-gray-400">
|
||||||
|
<Package className="w-12 h-12 mx-auto mb-4" />
|
||||||
|
<p>No exports yet</p>
|
||||||
|
<p className="text-sm mt-2">
|
||||||
|
Create an export to download your dataset for CoreML training
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<div className="space-y-4">
|
||||||
|
{exports?.items.map((exp: ExportType) => (
|
||||||
|
<div
|
||||||
|
key={exp.id}
|
||||||
|
className="bg-white rounded-lg shadow p-6"
|
||||||
|
>
|
||||||
|
<div className="flex items-start justify-between">
|
||||||
|
<div className="flex-1">
|
||||||
|
<div className="flex items-center gap-3">
|
||||||
|
{getStatusIcon(exp.status)}
|
||||||
|
<h3 className="font-semibold">{exp.name}</h3>
|
||||||
|
</div>
|
||||||
|
<div className="mt-2 grid grid-cols-4 gap-4 text-sm">
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-500">Species:</span>{' '}
|
||||||
|
{exp.species_count ?? 'N/A'}
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-500">Images:</span>{' '}
|
||||||
|
{exp.image_count ?? 'N/A'}
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-500">Size:</span>{' '}
|
||||||
|
{formatBytes(exp.file_size)}
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-500">Split:</span>{' '}
|
||||||
|
{Math.round(exp.train_split * 100)}% / {Math.round((1 - exp.train_split) * 100)}%
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
{exp.error_message && (
|
||||||
|
<div className="mt-2 text-sm text-red-600">
|
||||||
|
Error: {exp.error_message}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
<div className="mt-2 text-xs text-gray-400">
|
||||||
|
Created: {new Date(exp.created_at).toLocaleString()}
|
||||||
|
{exp.completed_at && (
|
||||||
|
<span className="ml-4">
|
||||||
|
Completed: {new Date(exp.completed_at).toLocaleString()}
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div className="flex gap-2 ml-4">
|
||||||
|
{exp.status === 'completed' && (
|
||||||
|
<a
|
||||||
|
href={exportsApi.download(exp.id)}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
|
||||||
|
>
|
||||||
|
<Download className="w-4 h-4" />
|
||||||
|
Download
|
||||||
|
</a>
|
||||||
|
)}
|
||||||
|
<button
|
||||||
|
onClick={() => deleteMutation.mutate(exp.id)}
|
||||||
|
className="p-2 text-red-600 hover:bg-red-50 rounded"
|
||||||
|
title="Delete"
|
||||||
|
>
|
||||||
|
<Trash2 className="w-5 h-5" />
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Create Modal */}
|
||||||
|
{showCreateModal && (
|
||||||
|
<CreateExportModal onClose={() => setShowCreateModal(false)} />
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function CreateExportModal({ onClose }: { onClose: () => void }) {
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const [form, setForm] = useState({
|
||||||
|
name: `Export ${new Date().toLocaleDateString()}`,
|
||||||
|
min_images: 100,
|
||||||
|
train_split: 0.8,
|
||||||
|
licenses: [] as string[],
|
||||||
|
min_quality: undefined as number | undefined,
|
||||||
|
})
|
||||||
|
|
||||||
|
const { data: licenses } = useQuery({
|
||||||
|
queryKey: ['image-licenses'],
|
||||||
|
queryFn: () => imagesApi.licenses().then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
const previewMutation = useMutation({
|
||||||
|
mutationFn: () =>
|
||||||
|
exportsApi.preview({
|
||||||
|
name: form.name,
|
||||||
|
filter_criteria: {
|
||||||
|
min_images_per_species: form.min_images,
|
||||||
|
licenses: form.licenses.length > 0 ? form.licenses : undefined,
|
||||||
|
min_quality: form.min_quality,
|
||||||
|
},
|
||||||
|
train_split: form.train_split,
|
||||||
|
}),
|
||||||
|
})
|
||||||
|
|
||||||
|
const createMutation = useMutation({
|
||||||
|
mutationFn: () =>
|
||||||
|
exportsApi.create({
|
||||||
|
name: form.name,
|
||||||
|
filter_criteria: {
|
||||||
|
min_images_per_species: form.min_images,
|
||||||
|
licenses: form.licenses.length > 0 ? form.licenses : undefined,
|
||||||
|
min_quality: form.min_quality,
|
||||||
|
},
|
||||||
|
train_split: form.train_split,
|
||||||
|
}),
|
||||||
|
onSuccess: () => {
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['exports'] })
|
||||||
|
onClose()
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const toggleLicense = (license: string) => {
|
||||||
|
setForm((f) => ({
|
||||||
|
...f,
|
||||||
|
licenses: f.licenses.includes(license)
|
||||||
|
? f.licenses.filter((l) => l !== license)
|
||||||
|
: [...f.licenses, license],
|
||||||
|
}))
|
||||||
|
}
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||||
|
<div className="bg-white rounded-lg p-6 w-full max-w-lg">
|
||||||
|
<h2 className="text-xl font-bold mb-4">Create Export</h2>
|
||||||
|
|
||||||
|
<div className="space-y-4">
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">Export Name</label>
|
||||||
|
<input
|
||||||
|
type="text"
|
||||||
|
value={form.name}
|
||||||
|
onChange={(e) => setForm({ ...form, name: e.target.value })}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">
|
||||||
|
Minimum Images per Species
|
||||||
|
</label>
|
||||||
|
<input
|
||||||
|
type="number"
|
||||||
|
value={form.min_images}
|
||||||
|
onChange={(e) =>
|
||||||
|
setForm({ ...form, min_images: parseInt(e.target.value) || 0 })
|
||||||
|
}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
min={1}
|
||||||
|
/>
|
||||||
|
<p className="text-xs text-gray-500 mt-1">
|
||||||
|
Species with fewer images will be excluded
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">
|
||||||
|
Train/Test Split
|
||||||
|
</label>
|
||||||
|
<div className="flex items-center gap-4">
|
||||||
|
<input
|
||||||
|
type="range"
|
||||||
|
value={form.train_split}
|
||||||
|
onChange={(e) =>
|
||||||
|
setForm({ ...form, train_split: parseFloat(e.target.value) })
|
||||||
|
}
|
||||||
|
min={0.5}
|
||||||
|
max={0.95}
|
||||||
|
step={0.05}
|
||||||
|
className="flex-1"
|
||||||
|
/>
|
||||||
|
<span className="text-sm w-20 text-right">
|
||||||
|
{Math.round(form.train_split * 100)}% /{' '}
|
||||||
|
{Math.round((1 - form.train_split) * 100)}%
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-2">
|
||||||
|
Filter by License (optional)
|
||||||
|
</label>
|
||||||
|
<div className="flex flex-wrap gap-2">
|
||||||
|
{licenses?.map((license) => (
|
||||||
|
<button
|
||||||
|
key={license}
|
||||||
|
onClick={() => toggleLicense(license)}
|
||||||
|
className={`px-3 py-1 rounded-full text-sm ${
|
||||||
|
form.licenses.includes(license)
|
||||||
|
? 'bg-green-100 text-green-700 border-green-300'
|
||||||
|
: 'bg-gray-100 text-gray-600'
|
||||||
|
} border`}
|
||||||
|
>
|
||||||
|
{license}
|
||||||
|
</button>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
{form.licenses.length === 0 && (
|
||||||
|
<p className="text-xs text-gray-500 mt-1">
|
||||||
|
All licenses will be included
|
||||||
|
</p>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Preview */}
|
||||||
|
{previewMutation.data && (
|
||||||
|
<div className="bg-gray-50 rounded-lg p-4">
|
||||||
|
<h4 className="font-medium mb-2">Preview</h4>
|
||||||
|
<div className="grid grid-cols-3 gap-4 text-sm">
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-500">Species:</span>{' '}
|
||||||
|
{previewMutation.data.data.species_count}
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-500">Images:</span>{' '}
|
||||||
|
{previewMutation.data.data.image_count}
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-500">Est. Size:</span>{' '}
|
||||||
|
{previewMutation.data.data.estimated_size_mb.toFixed(0)} MB
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="flex justify-between mt-6">
|
||||||
|
<button
|
||||||
|
onClick={() => previewMutation.mutate()}
|
||||||
|
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
Preview
|
||||||
|
</button>
|
||||||
|
<div className="flex gap-2">
|
||||||
|
<button
|
||||||
|
onClick={onClose}
|
||||||
|
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
Cancel
|
||||||
|
</button>
|
||||||
|
<button
|
||||||
|
onClick={() => createMutation.mutate()}
|
||||||
|
disabled={!form.name}
|
||||||
|
className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50"
|
||||||
|
>
|
||||||
|
Create Export
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
331
frontend/src/pages/Images.tsx
Normal file
331
frontend/src/pages/Images.tsx
Normal file
@@ -0,0 +1,331 @@
|
|||||||
|
import { useState } from 'react'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Search,
|
||||||
|
Filter,
|
||||||
|
Trash2,
|
||||||
|
ChevronLeft,
|
||||||
|
ChevronRight,
|
||||||
|
X,
|
||||||
|
ExternalLink,
|
||||||
|
} from 'lucide-react'
|
||||||
|
import { imagesApi } from '../api/client'
|
||||||
|
|
||||||
|
export default function Images() {
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const [page, setPage] = useState(1)
|
||||||
|
const [search, setSearch] = useState('')
|
||||||
|
const [filters, setFilters] = useState({
|
||||||
|
source: '',
|
||||||
|
license: '',
|
||||||
|
status: 'downloaded',
|
||||||
|
min_quality: undefined as number | undefined,
|
||||||
|
})
|
||||||
|
const [selectedIds, setSelectedIds] = useState<number[]>([])
|
||||||
|
const [selectedImage, setSelectedImage] = useState<number | null>(null)
|
||||||
|
|
||||||
|
const { data, isLoading } = useQuery({
|
||||||
|
queryKey: ['images', page, search, filters],
|
||||||
|
queryFn: () =>
|
||||||
|
imagesApi
|
||||||
|
.list({
|
||||||
|
page,
|
||||||
|
page_size: 48,
|
||||||
|
search: search || undefined,
|
||||||
|
source: filters.source || undefined,
|
||||||
|
license: filters.license || undefined,
|
||||||
|
status: filters.status || undefined,
|
||||||
|
min_quality: filters.min_quality,
|
||||||
|
})
|
||||||
|
.then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
const { data: sources } = useQuery({
|
||||||
|
queryKey: ['image-sources'],
|
||||||
|
queryFn: () => imagesApi.sources().then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
const { data: licenses } = useQuery({
|
||||||
|
queryKey: ['image-licenses'],
|
||||||
|
queryFn: () => imagesApi.licenses().then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
const { data: imageDetail } = useQuery({
|
||||||
|
queryKey: ['image', selectedImage],
|
||||||
|
queryFn: () => imagesApi.get(selectedImage!).then((res) => res.data),
|
||||||
|
enabled: !!selectedImage,
|
||||||
|
})
|
||||||
|
|
||||||
|
const deleteMutation = useMutation({
|
||||||
|
mutationFn: (id: number) => imagesApi.delete(id),
|
||||||
|
onSuccess: () => {
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['images'] })
|
||||||
|
setSelectedImage(null)
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const bulkDeleteMutation = useMutation({
|
||||||
|
mutationFn: (ids: number[]) => imagesApi.bulkDelete(ids),
|
||||||
|
onSuccess: () => {
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['images'] })
|
||||||
|
setSelectedIds([])
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const handleSelect = (id: number) => {
|
||||||
|
setSelectedIds((prev) =>
|
||||||
|
prev.includes(id) ? prev.filter((i) => i !== id) : [...prev, id]
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="space-y-6">
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<h1 className="text-2xl font-bold">Images</h1>
|
||||||
|
{selectedIds.length > 0 && (
|
||||||
|
<button
|
||||||
|
onClick={() => bulkDeleteMutation.mutate(selectedIds)}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-red-600 text-white rounded-lg hover:bg-red-700"
|
||||||
|
>
|
||||||
|
<Trash2 className="w-4 h-4" />
|
||||||
|
Delete {selectedIds.length} images
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Filters */}
|
||||||
|
<div className="flex flex-wrap gap-4">
|
||||||
|
<div className="relative">
|
||||||
|
<Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
|
||||||
|
<input
|
||||||
|
type="text"
|
||||||
|
placeholder="Search species..."
|
||||||
|
value={search}
|
||||||
|
onChange={(e) => {
|
||||||
|
setSearch(e.target.value)
|
||||||
|
setPage(1)
|
||||||
|
}}
|
||||||
|
className="pl-10 pr-4 py-2 border rounded-lg w-64"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<select
|
||||||
|
value={filters.source}
|
||||||
|
onChange={(e) => setFilters({ ...filters, source: e.target.value })}
|
||||||
|
className="px-3 py-2 border rounded-lg"
|
||||||
|
>
|
||||||
|
<option value="">All Sources</option>
|
||||||
|
{sources?.map((s) => (
|
||||||
|
<option key={s} value={s}>
|
||||||
|
{s}
|
||||||
|
</option>
|
||||||
|
))}
|
||||||
|
</select>
|
||||||
|
|
||||||
|
<select
|
||||||
|
value={filters.license}
|
||||||
|
onChange={(e) => setFilters({ ...filters, license: e.target.value })}
|
||||||
|
className="px-3 py-2 border rounded-lg"
|
||||||
|
>
|
||||||
|
<option value="">All Licenses</option>
|
||||||
|
{licenses?.map((l) => (
|
||||||
|
<option key={l} value={l}>
|
||||||
|
{l}
|
||||||
|
</option>
|
||||||
|
))}
|
||||||
|
</select>
|
||||||
|
|
||||||
|
<select
|
||||||
|
value={filters.status}
|
||||||
|
onChange={(e) => setFilters({ ...filters, status: e.target.value })}
|
||||||
|
className="px-3 py-2 border rounded-lg"
|
||||||
|
>
|
||||||
|
<option value="">All Status</option>
|
||||||
|
<option value="downloaded">Downloaded</option>
|
||||||
|
<option value="pending">Pending</option>
|
||||||
|
<option value="rejected">Rejected</option>
|
||||||
|
</select>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Image Grid */}
|
||||||
|
{isLoading ? (
|
||||||
|
<div className="flex items-center justify-center h-64">
|
||||||
|
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
|
||||||
|
</div>
|
||||||
|
) : data?.items.length === 0 ? (
|
||||||
|
<div className="flex flex-col items-center justify-center h-64 text-gray-400">
|
||||||
|
<Filter className="w-12 h-12 mb-4" />
|
||||||
|
<p>No images found</p>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<div className="grid grid-cols-2 sm:grid-cols-4 md:grid-cols-6 lg:grid-cols-8 gap-2">
|
||||||
|
{data?.items.map((image) => (
|
||||||
|
<div
|
||||||
|
key={image.id}
|
||||||
|
className={`relative aspect-square bg-gray-100 rounded-lg overflow-hidden cursor-pointer group ${
|
||||||
|
selectedIds.includes(image.id) ? 'ring-2 ring-green-500' : ''
|
||||||
|
}`}
|
||||||
|
onClick={() => setSelectedImage(image.id)}
|
||||||
|
>
|
||||||
|
{image.local_path ? (
|
||||||
|
<img
|
||||||
|
src={`/api/images/${image.id}/file`}
|
||||||
|
alt={image.species_name || ''}
|
||||||
|
className="w-full h-full object-cover"
|
||||||
|
loading="lazy"
|
||||||
|
/>
|
||||||
|
) : (
|
||||||
|
<div className="flex items-center justify-center h-full text-gray-400 text-xs">
|
||||||
|
Pending
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
<div className="absolute inset-0 bg-black/0 group-hover:bg-black/20 transition-colors" />
|
||||||
|
<div className="absolute top-1 left-1">
|
||||||
|
<input
|
||||||
|
type="checkbox"
|
||||||
|
checked={selectedIds.includes(image.id)}
|
||||||
|
onChange={(e) => {
|
||||||
|
e.stopPropagation()
|
||||||
|
handleSelect(image.id)
|
||||||
|
}}
|
||||||
|
className="rounded opacity-0 group-hover:opacity-100 checked:opacity-100"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
<div className="absolute bottom-0 left-0 right-0 bg-gradient-to-t from-black/60 to-transparent p-1 opacity-0 group-hover:opacity-100 transition-opacity">
|
||||||
|
<p className="text-white text-xs truncate">
|
||||||
|
{image.species_name}
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Pagination */}
|
||||||
|
{data && data.pages > 1 && (
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<span className="text-sm text-gray-600">
|
||||||
|
{data.total} images
|
||||||
|
</span>
|
||||||
|
<div className="flex gap-2">
|
||||||
|
<button
|
||||||
|
onClick={() => setPage((p) => Math.max(1, p - 1))}
|
||||||
|
disabled={page === 1}
|
||||||
|
className="p-2 rounded border disabled:opacity-50"
|
||||||
|
>
|
||||||
|
<ChevronLeft className="w-4 h-4" />
|
||||||
|
</button>
|
||||||
|
<span className="px-4 py-2">
|
||||||
|
Page {page} of {data.pages}
|
||||||
|
</span>
|
||||||
|
<button
|
||||||
|
onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
|
||||||
|
disabled={page === data.pages}
|
||||||
|
className="p-2 rounded border disabled:opacity-50"
|
||||||
|
>
|
||||||
|
<ChevronRight className="w-4 h-4" />
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Image Detail Modal */}
|
||||||
|
{selectedImage && imageDetail && (
|
||||||
|
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50 p-8">
|
||||||
|
<div className="bg-white rounded-lg w-full max-w-4xl max-h-full overflow-auto">
|
||||||
|
<div className="flex justify-between items-center p-4 border-b">
|
||||||
|
<h2 className="text-lg font-semibold">Image Details</h2>
|
||||||
|
<button
|
||||||
|
onClick={() => setSelectedImage(null)}
|
||||||
|
className="p-1 hover:bg-gray-100 rounded"
|
||||||
|
>
|
||||||
|
<X className="w-5 h-5" />
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
<div className="grid grid-cols-2 gap-6 p-6">
|
||||||
|
<div className="aspect-square bg-gray-100 rounded-lg overflow-hidden">
|
||||||
|
{imageDetail.local_path ? (
|
||||||
|
<img
|
||||||
|
src={`/api/images/${imageDetail.id}/file`}
|
||||||
|
alt={imageDetail.species_name || ''}
|
||||||
|
className="w-full h-full object-contain"
|
||||||
|
/>
|
||||||
|
) : (
|
||||||
|
<div className="flex items-center justify-center h-full text-gray-400">
|
||||||
|
Not downloaded
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
<div className="space-y-4">
|
||||||
|
<div>
|
||||||
|
<label className="text-sm text-gray-500">Species</label>
|
||||||
|
<p className="font-medium">{imageDetail.species_name}</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<label className="text-sm text-gray-500">Source</label>
|
||||||
|
<p>{imageDetail.source}</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<label className="text-sm text-gray-500">License</label>
|
||||||
|
<p>{imageDetail.license}</p>
|
||||||
|
</div>
|
||||||
|
{imageDetail.attribution && (
|
||||||
|
<div>
|
||||||
|
<label className="text-sm text-gray-500">Attribution</label>
|
||||||
|
<p className="text-sm">{imageDetail.attribution}</p>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
<div className="grid grid-cols-2 gap-4">
|
||||||
|
<div>
|
||||||
|
<label className="text-sm text-gray-500">Dimensions</label>
|
||||||
|
<p>
|
||||||
|
{imageDetail.width || '?'} x {imageDetail.height || '?'}
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<label className="text-sm text-gray-500">Quality Score</label>
|
||||||
|
<p>{imageDetail.quality_score?.toFixed(1) || 'N/A'}</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<label className="text-sm text-gray-500">Status</label>
|
||||||
|
<p>
|
||||||
|
<span
|
||||||
|
className={`inline-block px-2 py-1 rounded text-sm ${
|
||||||
|
imageDetail.status === 'downloaded'
|
||||||
|
? 'bg-green-100 text-green-700'
|
||||||
|
: imageDetail.status === 'pending'
|
||||||
|
? 'bg-yellow-100 text-yellow-700'
|
||||||
|
: 'bg-red-100 text-red-700'
|
||||||
|
}`}
|
||||||
|
>
|
||||||
|
{imageDetail.status}
|
||||||
|
</span>
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
<div className="flex gap-2 pt-4">
|
||||||
|
<a
|
||||||
|
href={imageDetail.url}
|
||||||
|
target="_blank"
|
||||||
|
rel="noopener noreferrer"
|
||||||
|
className="flex items-center gap-2 px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
<ExternalLink className="w-4 h-4" />
|
||||||
|
View Original
|
||||||
|
</a>
|
||||||
|
<button
|
||||||
|
onClick={() => deleteMutation.mutate(imageDetail.id)}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-red-600 text-white rounded-lg hover:bg-red-700"
|
||||||
|
>
|
||||||
|
<Trash2 className="w-4 h-4" />
|
||||||
|
Delete
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
354
frontend/src/pages/Jobs.tsx
Normal file
354
frontend/src/pages/Jobs.tsx
Normal file
@@ -0,0 +1,354 @@
|
|||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Play,
|
||||||
|
Pause,
|
||||||
|
XCircle,
|
||||||
|
CheckCircle,
|
||||||
|
Clock,
|
||||||
|
AlertCircle,
|
||||||
|
RefreshCw,
|
||||||
|
Leaf,
|
||||||
|
Download,
|
||||||
|
XOctagon,
|
||||||
|
} from 'lucide-react'
|
||||||
|
import { jobsApi, Job } from '../api/client'
|
||||||
|
|
||||||
|
export default function Jobs() {
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
|
||||||
|
const { data, isLoading, refetch } = useQuery({
|
||||||
|
queryKey: ['jobs'],
|
||||||
|
queryFn: () => jobsApi.list({ limit: 100 }).then((res) => res.data),
|
||||||
|
refetchInterval: 1000, // Faster refresh for live updates
|
||||||
|
})
|
||||||
|
|
||||||
|
const pauseMutation = useMutation({
|
||||||
|
mutationFn: (id: number) => jobsApi.pause(id),
|
||||||
|
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
|
||||||
|
})
|
||||||
|
|
||||||
|
const resumeMutation = useMutation({
|
||||||
|
mutationFn: (id: number) => jobsApi.resume(id),
|
||||||
|
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
|
||||||
|
})
|
||||||
|
|
||||||
|
const cancelMutation = useMutation({
|
||||||
|
mutationFn: (id: number) => jobsApi.cancel(id),
|
||||||
|
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['jobs'] }),
|
||||||
|
})
|
||||||
|
|
||||||
|
const getStatusIcon = (status: string) => {
|
||||||
|
switch (status) {
|
||||||
|
case 'running':
|
||||||
|
return <RefreshCw className="w-4 h-4 text-blue-500 animate-spin" />
|
||||||
|
case 'pending':
|
||||||
|
return <Clock className="w-4 h-4 text-yellow-500" />
|
||||||
|
case 'paused':
|
||||||
|
return <Pause className="w-4 h-4 text-gray-500" />
|
||||||
|
case 'completed':
|
||||||
|
return <CheckCircle className="w-4 h-4 text-green-500" />
|
||||||
|
case 'failed':
|
||||||
|
return <AlertCircle className="w-4 h-4 text-red-500" />
|
||||||
|
default:
|
||||||
|
return null
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const getStatusClass = (status: string) => {
|
||||||
|
switch (status) {
|
||||||
|
case 'running':
|
||||||
|
return 'bg-blue-100 text-blue-700'
|
||||||
|
case 'pending':
|
||||||
|
return 'bg-yellow-100 text-yellow-700'
|
||||||
|
case 'paused':
|
||||||
|
return 'bg-gray-100 text-gray-700'
|
||||||
|
case 'completed':
|
||||||
|
return 'bg-green-100 text-green-700'
|
||||||
|
case 'failed':
|
||||||
|
return 'bg-red-100 text-red-700'
|
||||||
|
default:
|
||||||
|
return 'bg-gray-100 text-gray-700'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Separate running jobs from others
|
||||||
|
const runningJobs = data?.items.filter((j) => j.status === 'running') || []
|
||||||
|
const otherJobs = data?.items.filter((j) => j.status !== 'running') || []
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="space-y-6">
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<h1 className="text-2xl font-bold">Jobs</h1>
|
||||||
|
<button
|
||||||
|
onClick={() => refetch()}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
<RefreshCw className="w-4 h-4" />
|
||||||
|
Refresh
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{isLoading ? (
|
||||||
|
<div className="flex items-center justify-center h-64">
|
||||||
|
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
|
||||||
|
</div>
|
||||||
|
) : data?.items.length === 0 ? (
|
||||||
|
<div className="bg-white rounded-lg shadow p-8 text-center text-gray-400">
|
||||||
|
<Clock className="w-12 h-12 mx-auto mb-4" />
|
||||||
|
<p>No jobs yet</p>
|
||||||
|
<p className="text-sm mt-2">
|
||||||
|
Select species and start a scrape job to get started
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<div className="space-y-6">
|
||||||
|
{/* Running Jobs - More prominent display */}
|
||||||
|
{runningJobs.length > 0 && (
|
||||||
|
<div className="space-y-4">
|
||||||
|
<h2 className="text-lg font-semibold flex items-center gap-2">
|
||||||
|
<RefreshCw className="w-5 h-5 animate-spin text-blue-500" />
|
||||||
|
Active Jobs ({runningJobs.length})
|
||||||
|
</h2>
|
||||||
|
{runningJobs.map((job) => (
|
||||||
|
<RunningJobCard
|
||||||
|
key={job.id}
|
||||||
|
job={job}
|
||||||
|
onPause={() => pauseMutation.mutate(job.id)}
|
||||||
|
onCancel={() => cancelMutation.mutate(job.id)}
|
||||||
|
/>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Other Jobs */}
|
||||||
|
{otherJobs.length > 0 && (
|
||||||
|
<div className="space-y-4">
|
||||||
|
{runningJobs.length > 0 && (
|
||||||
|
<h2 className="text-lg font-semibold text-gray-600">Other Jobs</h2>
|
||||||
|
)}
|
||||||
|
{otherJobs.map((job) => (
|
||||||
|
<div
|
||||||
|
key={job.id}
|
||||||
|
className="bg-white rounded-lg shadow p-6"
|
||||||
|
>
|
||||||
|
<div className="flex items-start justify-between">
|
||||||
|
<div className="flex-1">
|
||||||
|
<div className="flex items-center gap-3">
|
||||||
|
{getStatusIcon(job.status)}
|
||||||
|
<h3 className="font-semibold">{job.name}</h3>
|
||||||
|
<span
|
||||||
|
className={`px-2 py-0.5 rounded text-xs ${getStatusClass(
|
||||||
|
job.status
|
||||||
|
)}`}
|
||||||
|
>
|
||||||
|
{job.status}
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
<div className="mt-2 text-sm text-gray-600">
|
||||||
|
<span className="mr-4">Source: {job.source}</span>
|
||||||
|
<span className="mr-4">
|
||||||
|
Downloaded: {job.images_downloaded}
|
||||||
|
</span>
|
||||||
|
<span>Rejected: {job.images_rejected}</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Progress bar for paused jobs */}
|
||||||
|
{job.status === 'paused' && job.progress_total > 0 && (
|
||||||
|
<div className="mt-4">
|
||||||
|
<div className="flex justify-between text-sm text-gray-600 mb-1">
|
||||||
|
<span>
|
||||||
|
{job.progress_current} / {job.progress_total} species
|
||||||
|
</span>
|
||||||
|
<span>
|
||||||
|
{Math.round(
|
||||||
|
(job.progress_current / job.progress_total) * 100
|
||||||
|
)}
|
||||||
|
%
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
<div className="h-2 bg-gray-200 rounded-full overflow-hidden">
|
||||||
|
<div
|
||||||
|
className="h-full rounded-full bg-gray-400"
|
||||||
|
style={{
|
||||||
|
width: `${
|
||||||
|
(job.progress_current / job.progress_total) * 100
|
||||||
|
}%`,
|
||||||
|
}}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{job.error_message && (
|
||||||
|
<div className="mt-2 text-sm text-red-600">
|
||||||
|
Error: {job.error_message}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
<div className="mt-2 text-xs text-gray-400">
|
||||||
|
{job.started_at && (
|
||||||
|
<span className="mr-4">
|
||||||
|
Started: {new Date(job.started_at).toLocaleString()}
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
{job.completed_at && (
|
||||||
|
<span>
|
||||||
|
Completed: {new Date(job.completed_at).toLocaleString()}
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Actions */}
|
||||||
|
<div className="flex gap-2 ml-4">
|
||||||
|
{job.status === 'paused' && (
|
||||||
|
<button
|
||||||
|
onClick={() => resumeMutation.mutate(job.id)}
|
||||||
|
className="p-2 text-blue-600 hover:bg-blue-50 rounded"
|
||||||
|
title="Resume"
|
||||||
|
>
|
||||||
|
<Play className="w-5 h-5" />
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
|
{(job.status === 'paused' || job.status === 'pending') && (
|
||||||
|
<button
|
||||||
|
onClick={() => cancelMutation.mutate(job.id)}
|
||||||
|
className="p-2 text-red-600 hover:bg-red-50 rounded"
|
||||||
|
title="Cancel"
|
||||||
|
>
|
||||||
|
<XCircle className="w-5 h-5" />
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function RunningJobCard({
|
||||||
|
job,
|
||||||
|
onPause,
|
||||||
|
onCancel,
|
||||||
|
}: {
|
||||||
|
job: Job
|
||||||
|
onPause: () => void
|
||||||
|
onCancel: () => void
|
||||||
|
}) {
|
||||||
|
// Fetch real-time progress for this job
|
||||||
|
const { data: progress } = useQuery({
|
||||||
|
queryKey: ['job-progress', job.id],
|
||||||
|
queryFn: () => jobsApi.progress(job.id).then((res) => res.data),
|
||||||
|
refetchInterval: 500, // Very fast updates for live feel
|
||||||
|
enabled: job.status === 'running',
|
||||||
|
})
|
||||||
|
|
||||||
|
const currentSpecies = progress?.current_species || ''
|
||||||
|
const progressCurrent = progress?.progress_current ?? job.progress_current
|
||||||
|
const progressTotal = progress?.progress_total ?? job.progress_total
|
||||||
|
const percentage = progressTotal > 0 ? Math.round((progressCurrent / progressTotal) * 100) : 0
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="bg-gradient-to-r from-blue-50 to-white rounded-lg shadow-lg border-2 border-blue-200 p-6">
|
||||||
|
<div className="flex items-start justify-between">
|
||||||
|
<div className="flex-1">
|
||||||
|
<div className="flex items-center gap-3">
|
||||||
|
<RefreshCw className="w-5 h-5 text-blue-500 animate-spin" />
|
||||||
|
<h3 className="font-semibold text-lg">{job.name}</h3>
|
||||||
|
<span className="px-2 py-0.5 rounded text-xs bg-blue-100 text-blue-700 animate-pulse">
|
||||||
|
running
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Live Stats */}
|
||||||
|
<div className="mt-4 grid grid-cols-3 gap-4">
|
||||||
|
<div className="bg-white rounded-lg p-3 border">
|
||||||
|
<div className="flex items-center gap-2 text-gray-500 text-sm">
|
||||||
|
<Leaf className="w-4 h-4" />
|
||||||
|
Species Progress
|
||||||
|
</div>
|
||||||
|
<div className="text-2xl font-bold text-blue-600 mt-1">
|
||||||
|
{progressCurrent} / {progressTotal}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div className="bg-white rounded-lg p-3 border">
|
||||||
|
<div className="flex items-center gap-2 text-gray-500 text-sm">
|
||||||
|
<Download className="w-4 h-4" />
|
||||||
|
Downloaded
|
||||||
|
</div>
|
||||||
|
<div className="text-2xl font-bold text-green-600 mt-1">
|
||||||
|
{job.images_downloaded}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div className="bg-white rounded-lg p-3 border">
|
||||||
|
<div className="flex items-center gap-2 text-gray-500 text-sm">
|
||||||
|
<XOctagon className="w-4 h-4" />
|
||||||
|
Rejected
|
||||||
|
</div>
|
||||||
|
<div className="text-2xl font-bold text-red-600 mt-1">
|
||||||
|
{job.images_rejected}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Current Species */}
|
||||||
|
{currentSpecies && (
|
||||||
|
<div className="mt-4 bg-white rounded-lg p-3 border">
|
||||||
|
<div className="text-sm text-gray-500 mb-1">Currently scraping:</div>
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<span className="relative flex h-3 w-3">
|
||||||
|
<span className="animate-ping absolute inline-flex h-full w-full rounded-full bg-blue-400 opacity-75"></span>
|
||||||
|
<span className="relative inline-flex rounded-full h-3 w-3 bg-blue-500"></span>
|
||||||
|
</span>
|
||||||
|
<span className="font-medium text-blue-800 italic">{currentSpecies}</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Progress bar */}
|
||||||
|
{progressTotal > 0 && (
|
||||||
|
<div className="mt-4">
|
||||||
|
<div className="flex justify-between text-sm text-gray-600 mb-1">
|
||||||
|
<span>Progress</span>
|
||||||
|
<span className="font-medium">{percentage}%</span>
|
||||||
|
</div>
|
||||||
|
<div className="h-3 bg-gray-200 rounded-full overflow-hidden">
|
||||||
|
<div
|
||||||
|
className="h-full rounded-full bg-gradient-to-r from-blue-500 to-blue-600 transition-all duration-500"
|
||||||
|
style={{ width: `${percentage}%` }}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
<div className="mt-3 text-xs text-gray-400">
|
||||||
|
Source: {job.source} • Started: {job.started_at ? new Date(job.started_at).toLocaleString() : 'N/A'}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Actions */}
|
||||||
|
<div className="flex gap-2 ml-4">
|
||||||
|
<button
|
||||||
|
onClick={onPause}
|
||||||
|
className="p-2 text-gray-600 hover:bg-gray-100 rounded"
|
||||||
|
title="Pause"
|
||||||
|
>
|
||||||
|
<Pause className="w-5 h-5" />
|
||||||
|
</button>
|
||||||
|
<button
|
||||||
|
onClick={onCancel}
|
||||||
|
className="p-2 text-red-600 hover:bg-red-50 rounded"
|
||||||
|
title="Cancel"
|
||||||
|
>
|
||||||
|
<XCircle className="w-5 h-5" />
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
543
frontend/src/pages/Settings.tsx
Normal file
543
frontend/src/pages/Settings.tsx
Normal file
@@ -0,0 +1,543 @@
|
|||||||
|
import { useState } from 'react'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Key,
|
||||||
|
CheckCircle,
|
||||||
|
XCircle,
|
||||||
|
Eye,
|
||||||
|
EyeOff,
|
||||||
|
RefreshCw,
|
||||||
|
FolderInput,
|
||||||
|
AlertTriangle,
|
||||||
|
} from 'lucide-react'
|
||||||
|
import { sourcesApi, imagesApi, SourceConfig, ImportScanResult } from '../api/client'
|
||||||
|
|
||||||
|
export default function Settings() {
|
||||||
|
const [editingSource, setEditingSource] = useState<string | null>(null)
|
||||||
|
|
||||||
|
const { data: sources, isLoading, error } = useQuery({
|
||||||
|
queryKey: ['sources'],
|
||||||
|
queryFn: () => sourcesApi.list().then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="space-y-6">
|
||||||
|
<h1 className="text-2xl font-bold">Settings</h1>
|
||||||
|
|
||||||
|
{/* API Keys Section */}
|
||||||
|
<div className="bg-white rounded-lg shadow">
|
||||||
|
<div className="px-6 py-4 border-b">
|
||||||
|
<h2 className="text-lg font-semibold flex items-center gap-2">
|
||||||
|
<Key className="w-5 h-5" />
|
||||||
|
API Keys
|
||||||
|
</h2>
|
||||||
|
<p className="text-sm text-gray-500 mt-1">
|
||||||
|
Configure API keys for each data source
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{isLoading ? (
|
||||||
|
<div className="p-6 text-center">
|
||||||
|
<RefreshCw className="w-6 h-6 animate-spin mx-auto text-gray-400" />
|
||||||
|
</div>
|
||||||
|
) : error ? (
|
||||||
|
<div className="p-6 text-center text-red-600">
|
||||||
|
Error loading sources: {(error as Error).message}
|
||||||
|
</div>
|
||||||
|
) : !sources || sources.length === 0 ? (
|
||||||
|
<div className="p-6 text-center text-gray-500">
|
||||||
|
No sources available
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<div className="divide-y">
|
||||||
|
{sources.map((source) => (
|
||||||
|
<SourceRow
|
||||||
|
key={source.name}
|
||||||
|
source={source}
|
||||||
|
isEditing={editingSource === source.name}
|
||||||
|
onEdit={() => setEditingSource(source.name)}
|
||||||
|
onClose={() => setEditingSource(null)}
|
||||||
|
/>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Import Scanner Section */}
|
||||||
|
<ImportScanner />
|
||||||
|
|
||||||
|
{/* Rate Limits Info */}
|
||||||
|
<div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
|
||||||
|
<h3 className="font-medium text-yellow-800">Rate Limits (recommended settings)</h3>
|
||||||
|
<ul className="text-sm text-yellow-700 mt-2 space-y-1 list-disc list-inside">
|
||||||
|
<li>GBIF: 1 req/sec safe (free, no authentication required)</li>
|
||||||
|
<li>iNaturalist: 1 req/sec max (60/min limit), 10k/day, 5GB/hr media</li>
|
||||||
|
<li>Flickr: 0.5 req/sec recommended (3600/hr limit shared across all users)</li>
|
||||||
|
<li>Wikimedia: 1 req/sec safe (requires OAuth credentials)</li>
|
||||||
|
<li>Trefle: 1 req/sec safe (120/min limit)</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function SourceRow({
|
||||||
|
source,
|
||||||
|
isEditing,
|
||||||
|
onEdit,
|
||||||
|
onClose,
|
||||||
|
}: {
|
||||||
|
source: SourceConfig
|
||||||
|
isEditing: boolean
|
||||||
|
onEdit: () => void
|
||||||
|
onClose: () => void
|
||||||
|
}) {
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const [showKey, setShowKey] = useState(false)
|
||||||
|
const [form, setForm] = useState({
|
||||||
|
api_key: '',
|
||||||
|
api_secret: '',
|
||||||
|
access_token: '',
|
||||||
|
rate_limit_per_sec: source.configured ? source.rate_limit_per_sec : (source.default_rate || 1.0),
|
||||||
|
enabled: source.enabled,
|
||||||
|
})
|
||||||
|
|
||||||
|
// Get field labels based on auth type
|
||||||
|
const isNoAuth = source.auth_type === 'none'
|
||||||
|
const isOAuth = source.auth_type === 'oauth'
|
||||||
|
const keyLabel = isOAuth ? 'Client ID' : 'API Key'
|
||||||
|
const secretLabel = isOAuth ? 'Client Secret' : 'API Secret'
|
||||||
|
const [testResult, setTestResult] = useState<{
|
||||||
|
status: 'success' | 'error'
|
||||||
|
message: string
|
||||||
|
} | null>(null)
|
||||||
|
|
||||||
|
const updateMutation = useMutation({
|
||||||
|
mutationFn: () =>
|
||||||
|
sourcesApi.update(source.name, {
|
||||||
|
api_key: isNoAuth ? undefined : form.api_key || undefined,
|
||||||
|
api_secret: form.api_secret || undefined,
|
||||||
|
access_token: form.access_token || undefined,
|
||||||
|
rate_limit_per_sec: form.rate_limit_per_sec,
|
||||||
|
enabled: form.enabled,
|
||||||
|
}),
|
||||||
|
onSuccess: () => {
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['sources'] })
|
||||||
|
onClose()
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const testMutation = useMutation({
|
||||||
|
mutationFn: () => sourcesApi.test(source.name),
|
||||||
|
onSuccess: (res) => {
|
||||||
|
setTestResult({ status: res.data.status, message: res.data.message })
|
||||||
|
},
|
||||||
|
onError: (err: any) => {
|
||||||
|
setTestResult({
|
||||||
|
status: 'error',
|
||||||
|
message: err.response?.data?.message || 'Connection failed',
|
||||||
|
})
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
if (isEditing) {
|
||||||
|
return (
|
||||||
|
<div className="p-6 bg-gray-50">
|
||||||
|
<div className="flex items-center justify-between mb-4">
|
||||||
|
<h3 className="font-medium">{source.label}</h3>
|
||||||
|
<button
|
||||||
|
onClick={onClose}
|
||||||
|
className="text-gray-500 hover:text-gray-700"
|
||||||
|
>
|
||||||
|
Cancel
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="space-y-4">
|
||||||
|
{isNoAuth ? (
|
||||||
|
<div className="bg-green-50 border border-green-200 rounded-lg p-3 text-green-700 text-sm">
|
||||||
|
This source doesn't require authentication. Just enable it to start scraping.
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<>
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">{keyLabel}</label>
|
||||||
|
<div className="relative">
|
||||||
|
<input
|
||||||
|
type={showKey ? 'text' : 'password'}
|
||||||
|
value={form.api_key}
|
||||||
|
onChange={(e) => setForm({ ...form, api_key: e.target.value })}
|
||||||
|
placeholder={source.api_key_masked || `Enter ${keyLabel}`}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg pr-10"
|
||||||
|
/>
|
||||||
|
<button
|
||||||
|
type="button"
|
||||||
|
onClick={() => setShowKey(!showKey)}
|
||||||
|
className="absolute right-2 top-1/2 -translate-y-1/2 text-gray-400"
|
||||||
|
>
|
||||||
|
{showKey ? (
|
||||||
|
<EyeOff className="w-4 h-4" />
|
||||||
|
) : (
|
||||||
|
<Eye className="w-4 h-4" />
|
||||||
|
)}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{source.requires_secret && (
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">
|
||||||
|
{secretLabel}
|
||||||
|
</label>
|
||||||
|
<input
|
||||||
|
type="password"
|
||||||
|
value={form.api_secret}
|
||||||
|
onChange={(e) =>
|
||||||
|
setForm({ ...form, api_secret: e.target.value })
|
||||||
|
}
|
||||||
|
placeholder={source.has_secret ? '••••••••' : `Enter ${secretLabel}`}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{isOAuth && (
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">
|
||||||
|
Access Token
|
||||||
|
</label>
|
||||||
|
<input
|
||||||
|
type="password"
|
||||||
|
value={form.access_token}
|
||||||
|
onChange={(e) =>
|
||||||
|
setForm({ ...form, access_token: e.target.value })
|
||||||
|
}
|
||||||
|
placeholder={source.has_access_token ? '••••••••' : 'Enter Access Token'}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</>
|
||||||
|
)}
|
||||||
|
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">
|
||||||
|
Rate Limit (requests/sec)
|
||||||
|
</label>
|
||||||
|
<input
|
||||||
|
type="number"
|
||||||
|
value={form.rate_limit_per_sec}
|
||||||
|
onChange={(e) =>
|
||||||
|
setForm({
|
||||||
|
...form,
|
||||||
|
rate_limit_per_sec: parseFloat(e.target.value) || 1,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
min={0.1}
|
||||||
|
max={10}
|
||||||
|
step={0.1}
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<input
|
||||||
|
type="checkbox"
|
||||||
|
id="enabled"
|
||||||
|
checked={form.enabled}
|
||||||
|
onChange={(e) => setForm({ ...form, enabled: e.target.checked })}
|
||||||
|
className="rounded"
|
||||||
|
/>
|
||||||
|
<label htmlFor="enabled" className="text-sm">
|
||||||
|
Enable this source
|
||||||
|
</label>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{testResult && (
|
||||||
|
<div
|
||||||
|
className={`p-3 rounded-lg ${
|
||||||
|
testResult.status === 'success'
|
||||||
|
? 'bg-green-50 text-green-700'
|
||||||
|
: 'bg-red-50 text-red-700'
|
||||||
|
}`}
|
||||||
|
>
|
||||||
|
{testResult.message}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
<div className="flex justify-between">
|
||||||
|
{source.configured && (
|
||||||
|
<button
|
||||||
|
onClick={() => testMutation.mutate()}
|
||||||
|
disabled={testMutation.isPending}
|
||||||
|
className="px-4 py-2 border rounded-lg hover:bg-white"
|
||||||
|
>
|
||||||
|
{testMutation.isPending ? 'Testing...' : 'Test Connection'}
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
|
<button
|
||||||
|
onClick={() => updateMutation.mutate()}
|
||||||
|
disabled={!isNoAuth && !form.api_key && !source.configured}
|
||||||
|
className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50 ml-auto"
|
||||||
|
>
|
||||||
|
Save
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
const isNoAuthRow = source.auth_type === 'none'
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="px-6 py-4 flex items-center justify-between">
|
||||||
|
<div className="flex items-center gap-4">
|
||||||
|
<div
|
||||||
|
className={`w-2 h-2 rounded-full ${
|
||||||
|
(isNoAuthRow || source.configured) && source.enabled
|
||||||
|
? 'bg-green-500'
|
||||||
|
: source.configured
|
||||||
|
? 'bg-yellow-500'
|
||||||
|
: 'bg-gray-300'
|
||||||
|
}`}
|
||||||
|
/>
|
||||||
|
<div>
|
||||||
|
<h3 className="font-medium">{source.label}</h3>
|
||||||
|
<p className="text-sm text-gray-500">
|
||||||
|
{isNoAuthRow
|
||||||
|
? 'No authentication required'
|
||||||
|
: source.configured
|
||||||
|
? `Key: ${source.api_key_masked}`
|
||||||
|
: 'Not configured'}
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div className="flex items-center gap-4">
|
||||||
|
{(isNoAuthRow || source.configured) && (
|
||||||
|
<span
|
||||||
|
className={`flex items-center gap-1 text-sm ${
|
||||||
|
source.enabled ? 'text-green-600' : 'text-gray-400'
|
||||||
|
}`}
|
||||||
|
>
|
||||||
|
{source.enabled ? (
|
||||||
|
<>
|
||||||
|
<CheckCircle className="w-4 h-4" />
|
||||||
|
Enabled
|
||||||
|
</>
|
||||||
|
) : (
|
||||||
|
<>
|
||||||
|
<XCircle className="w-4 h-4" />
|
||||||
|
Disabled
|
||||||
|
</>
|
||||||
|
)}
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
<button
|
||||||
|
onClick={onEdit}
|
||||||
|
className="px-3 py-1 text-sm border rounded hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
{isNoAuthRow || source.configured ? 'Edit' : 'Configure'}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function ImportScanner() {
|
||||||
|
const [scanResult, setScanResult] = useState<ImportScanResult | null>(null)
|
||||||
|
const [moveFiles, setMoveFiles] = useState(false)
|
||||||
|
const [importResult, setImportResult] = useState<{
|
||||||
|
imported: number
|
||||||
|
skipped: number
|
||||||
|
errors: string[]
|
||||||
|
} | null>(null)
|
||||||
|
|
||||||
|
const scanMutation = useMutation({
|
||||||
|
mutationFn: () => imagesApi.scanImports().then((res) => res.data),
|
||||||
|
onSuccess: (data) => {
|
||||||
|
setScanResult(data)
|
||||||
|
setImportResult(null)
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const importMutation = useMutation({
|
||||||
|
mutationFn: () => imagesApi.runImport(moveFiles).then((res) => res.data),
|
||||||
|
onSuccess: (data) => {
|
||||||
|
setImportResult(data)
|
||||||
|
setScanResult(null)
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="bg-white rounded-lg shadow">
|
||||||
|
<div className="px-6 py-4 border-b">
|
||||||
|
<h2 className="text-lg font-semibold flex items-center gap-2">
|
||||||
|
<FolderInput className="w-5 h-5" />
|
||||||
|
Import Images
|
||||||
|
</h2>
|
||||||
|
<p className="text-sm text-gray-500 mt-1">
|
||||||
|
Bulk import images from the imports folder
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="p-6 space-y-4">
|
||||||
|
<div className="bg-gray-50 rounded-lg p-4">
|
||||||
|
<h3 className="font-medium text-sm mb-2">Expected folder structure:</h3>
|
||||||
|
<code className="text-sm text-gray-600 block">
|
||||||
|
imports/{'{source}'}/{'{species_name}'}/*.jpg
|
||||||
|
</code>
|
||||||
|
<p className="text-sm text-gray-500 mt-2">
|
||||||
|
Example: imports/inaturalist/Monstera_deliciosa/image1.jpg
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="flex items-center gap-4">
|
||||||
|
<button
|
||||||
|
onClick={() => scanMutation.mutate()}
|
||||||
|
disabled={scanMutation.isPending}
|
||||||
|
className="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700 disabled:opacity-50 flex items-center gap-2"
|
||||||
|
>
|
||||||
|
{scanMutation.isPending ? (
|
||||||
|
<>
|
||||||
|
<RefreshCw className="w-4 h-4 animate-spin" />
|
||||||
|
Scanning...
|
||||||
|
</>
|
||||||
|
) : (
|
||||||
|
'Scan Imports Folder'
|
||||||
|
)}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{scanMutation.isError && (
|
||||||
|
<div className="bg-red-50 border border-red-200 rounded-lg p-4 text-red-700">
|
||||||
|
Error scanning: {(scanMutation.error as Error).message}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{scanResult && (
|
||||||
|
<div className="space-y-4">
|
||||||
|
{!scanResult.available ? (
|
||||||
|
<div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
|
||||||
|
<p className="text-yellow-700">{scanResult.message}</p>
|
||||||
|
</div>
|
||||||
|
) : scanResult.total_images === 0 ? (
|
||||||
|
<div className="bg-gray-50 border border-gray-200 rounded-lg p-4">
|
||||||
|
<p className="text-gray-600">No images found in the imports folder.</p>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<>
|
||||||
|
<div className="bg-green-50 border border-green-200 rounded-lg p-4">
|
||||||
|
<h3 className="font-medium text-green-800 mb-2">Scan Results</h3>
|
||||||
|
<div className="grid grid-cols-2 gap-4 text-sm">
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-600">Total Images:</span>
|
||||||
|
<span className="ml-2 font-medium">{scanResult.total_images}</span>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-600">Matched Species:</span>
|
||||||
|
<span className="ml-2 font-medium">{scanResult.matched_species}</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{scanResult.sources.length > 0 && (
|
||||||
|
<div className="mt-4">
|
||||||
|
<h4 className="text-sm font-medium text-green-800 mb-2">Sources Found:</h4>
|
||||||
|
<div className="space-y-1">
|
||||||
|
{scanResult.sources.map((source) => (
|
||||||
|
<div key={source.name} className="text-sm flex justify-between">
|
||||||
|
<span>{source.name}</span>
|
||||||
|
<span className="text-gray-600">
|
||||||
|
{source.species_count} species, {source.image_count} images
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{scanResult.unmatched_species.length > 0 && (
|
||||||
|
<div className="bg-yellow-50 border border-yellow-200 rounded-lg p-4">
|
||||||
|
<h3 className="font-medium text-yellow-800 flex items-center gap-2 mb-2">
|
||||||
|
<AlertTriangle className="w-4 h-4" />
|
||||||
|
Unmatched Species ({scanResult.unmatched_species.length})
|
||||||
|
</h3>
|
||||||
|
<p className="text-sm text-yellow-700 mb-2">
|
||||||
|
These species folders don't match any species in the database and will be skipped:
|
||||||
|
</p>
|
||||||
|
<div className="text-sm text-yellow-600 max-h-32 overflow-y-auto">
|
||||||
|
{scanResult.unmatched_species.slice(0, 20).map((name) => (
|
||||||
|
<div key={name}>{name}</div>
|
||||||
|
))}
|
||||||
|
{scanResult.unmatched_species.length > 20 && (
|
||||||
|
<div className="text-yellow-500 mt-1">
|
||||||
|
...and {scanResult.unmatched_species.length - 20} more
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
<div className="border-t pt-4">
|
||||||
|
<div className="flex items-center gap-4 mb-4">
|
||||||
|
<label className="flex items-center gap-2 text-sm">
|
||||||
|
<input
|
||||||
|
type="checkbox"
|
||||||
|
checked={moveFiles}
|
||||||
|
onChange={(e) => setMoveFiles(e.target.checked)}
|
||||||
|
className="rounded"
|
||||||
|
/>
|
||||||
|
Move files instead of copy (removes originals)
|
||||||
|
</label>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<button
|
||||||
|
onClick={() => importMutation.mutate()}
|
||||||
|
disabled={importMutation.isPending || scanResult.matched_species === 0}
|
||||||
|
className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50 flex items-center gap-2"
|
||||||
|
>
|
||||||
|
{importMutation.isPending ? (
|
||||||
|
<>
|
||||||
|
<RefreshCw className="w-4 h-4 animate-spin" />
|
||||||
|
Importing...
|
||||||
|
</>
|
||||||
|
) : (
|
||||||
|
`Import ${scanResult.total_images} Images`
|
||||||
|
)}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{importResult && (
|
||||||
|
<div className="bg-green-50 border border-green-200 rounded-lg p-4">
|
||||||
|
<h3 className="font-medium text-green-800 mb-2">Import Complete</h3>
|
||||||
|
<div className="text-sm space-y-1">
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-600">Imported:</span>
|
||||||
|
<span className="ml-2 font-medium text-green-700">{importResult.imported}</span>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<span className="text-gray-600">Skipped (already exists):</span>
|
||||||
|
<span className="ml-2 font-medium">{importResult.skipped}</span>
|
||||||
|
</div>
|
||||||
|
{importResult.errors.length > 0 && (
|
||||||
|
<div className="mt-2">
|
||||||
|
<span className="text-red-600">Errors ({importResult.errors.length}):</span>
|
||||||
|
<div className="text-red-500 mt-1 max-h-24 overflow-y-auto">
|
||||||
|
{importResult.errors.map((err, i) => (
|
||||||
|
<div key={i} className="text-xs">{err}</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
997
frontend/src/pages/Species.tsx
Normal file
997
frontend/src/pages/Species.tsx
Normal file
@@ -0,0 +1,997 @@
|
|||||||
|
import { useState, useRef } from 'react'
|
||||||
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query'
|
||||||
|
import {
|
||||||
|
Plus,
|
||||||
|
Upload,
|
||||||
|
Search,
|
||||||
|
Trash2,
|
||||||
|
Play,
|
||||||
|
ChevronLeft,
|
||||||
|
ChevronRight,
|
||||||
|
Filter,
|
||||||
|
X,
|
||||||
|
Image as ImageIcon,
|
||||||
|
ExternalLink,
|
||||||
|
} from 'lucide-react'
|
||||||
|
import { speciesApi, jobsApi, imagesApi, Species as SpeciesType } from '../api/client'
|
||||||
|
|
||||||
|
export default function Species() {
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const csvInputRef = useRef<HTMLInputElement>(null)
|
||||||
|
const jsonInputRef = useRef<HTMLInputElement>(null)
|
||||||
|
|
||||||
|
const [page, setPage] = useState(1)
|
||||||
|
const [search, setSearch] = useState('')
|
||||||
|
const [genus, setGenus] = useState<string>('')
|
||||||
|
const [hasImages, setHasImages] = useState<string>('')
|
||||||
|
const [maxImages, setMaxImages] = useState<string>('')
|
||||||
|
const [selectedIds, setSelectedIds] = useState<number[]>([])
|
||||||
|
const [showAddModal, setShowAddModal] = useState(false)
|
||||||
|
const [showScrapeModal, setShowScrapeModal] = useState(false)
|
||||||
|
const [showScrapeAllModal, setShowScrapeAllModal] = useState(false)
|
||||||
|
const [showScrapeFilteredModal, setShowScrapeFilteredModal] = useState(false)
|
||||||
|
const [viewSpecies, setViewSpecies] = useState<SpeciesType | null>(null)
|
||||||
|
|
||||||
|
const { data: genera } = useQuery({
|
||||||
|
queryKey: ['genera'],
|
||||||
|
queryFn: () => speciesApi.genera().then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
const { data, isLoading } = useQuery({
|
||||||
|
queryKey: ['species', page, search, genus, hasImages, maxImages],
|
||||||
|
queryFn: () =>
|
||||||
|
speciesApi.list({
|
||||||
|
page,
|
||||||
|
page_size: 50,
|
||||||
|
search: search || undefined,
|
||||||
|
genus: genus || undefined,
|
||||||
|
has_images: hasImages === '' ? undefined : hasImages === 'true',
|
||||||
|
max_images: maxImages ? parseInt(maxImages) : undefined,
|
||||||
|
}).then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
const importCsvMutation = useMutation({
|
||||||
|
mutationFn: (file: File) => speciesApi.import(file),
|
||||||
|
onSuccess: (res) => {
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['species'] })
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['genera'] })
|
||||||
|
alert(`Imported ${res.data.imported} species, skipped ${res.data.skipped}`)
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const importJsonMutation = useMutation({
|
||||||
|
mutationFn: (file: File) => speciesApi.importJson(file),
|
||||||
|
onSuccess: (res) => {
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['species'] })
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['genera'] })
|
||||||
|
alert(`Imported ${res.data.imported} species, skipped ${res.data.skipped}`)
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const deleteMutation = useMutation({
|
||||||
|
mutationFn: (id: number) => speciesApi.delete(id),
|
||||||
|
onSuccess: () => {
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['species'] })
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const createJobMutation = useMutation({
|
||||||
|
mutationFn: (data: { name: string; source: string; species_ids?: number[] }) =>
|
||||||
|
jobsApi.create(data),
|
||||||
|
onSuccess: () => {
|
||||||
|
setShowScrapeModal(false)
|
||||||
|
setSelectedIds([])
|
||||||
|
alert('Scrape job created!')
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
const handleCsvImport = (e: React.ChangeEvent<HTMLInputElement>) => {
|
||||||
|
const file = e.target.files?.[0]
|
||||||
|
if (file) {
|
||||||
|
importCsvMutation.mutate(file)
|
||||||
|
e.target.value = ''
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const handleJsonImport = (e: React.ChangeEvent<HTMLInputElement>) => {
|
||||||
|
const file = e.target.files?.[0]
|
||||||
|
if (file) {
|
||||||
|
importJsonMutation.mutate(file)
|
||||||
|
e.target.value = ''
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const handleSelectAll = () => {
|
||||||
|
if (!data) return
|
||||||
|
if (selectedIds.length === data.items.length) {
|
||||||
|
setSelectedIds([])
|
||||||
|
} else {
|
||||||
|
setSelectedIds(data.items.map((s) => s.id))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const handleSelect = (id: number) => {
|
||||||
|
setSelectedIds((prev) =>
|
||||||
|
prev.includes(id) ? prev.filter((i) => i !== id) : [...prev, id]
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="space-y-6">
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<h1 className="text-2xl font-bold">Species</h1>
|
||||||
|
<div className="flex gap-2">
|
||||||
|
<button
|
||||||
|
onClick={() => csvInputRef.current?.click()}
|
||||||
|
disabled={importCsvMutation.isPending}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200 disabled:opacity-50"
|
||||||
|
>
|
||||||
|
<Upload className="w-4 h-4" />
|
||||||
|
{importCsvMutation.isPending ? 'Importing...' : 'Import CSV'}
|
||||||
|
</button>
|
||||||
|
<input
|
||||||
|
ref={csvInputRef}
|
||||||
|
type="file"
|
||||||
|
accept=".csv"
|
||||||
|
onChange={handleCsvImport}
|
||||||
|
className="hidden"
|
||||||
|
/>
|
||||||
|
<button
|
||||||
|
onClick={() => jsonInputRef.current?.click()}
|
||||||
|
disabled={importJsonMutation.isPending}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-gray-100 rounded-lg hover:bg-gray-200 disabled:opacity-50"
|
||||||
|
>
|
||||||
|
<Upload className="w-4 h-4" />
|
||||||
|
{importJsonMutation.isPending ? 'Importing...' : 'Import JSON'}
|
||||||
|
</button>
|
||||||
|
<input
|
||||||
|
ref={jsonInputRef}
|
||||||
|
type="file"
|
||||||
|
accept=".json"
|
||||||
|
onChange={handleJsonImport}
|
||||||
|
className="hidden"
|
||||||
|
/>
|
||||||
|
<button
|
||||||
|
onClick={() => setShowAddModal(true)}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700"
|
||||||
|
>
|
||||||
|
<Plus className="w-4 h-4" />
|
||||||
|
Add Species
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Search and Filters */}
|
||||||
|
<div className="flex items-center gap-4 flex-wrap">
|
||||||
|
<div className="relative">
|
||||||
|
<Search className="absolute left-3 top-1/2 -translate-y-1/2 w-4 h-4 text-gray-400" />
|
||||||
|
<input
|
||||||
|
type="text"
|
||||||
|
placeholder="Search species..."
|
||||||
|
value={search}
|
||||||
|
onChange={(e) => {
|
||||||
|
setSearch(e.target.value)
|
||||||
|
setPage(1)
|
||||||
|
}}
|
||||||
|
className="pl-10 pr-4 py-2 border rounded-lg w-64"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="flex items-center gap-2">
|
||||||
|
<Filter className="w-4 h-4 text-gray-400" />
|
||||||
|
<select
|
||||||
|
value={genus}
|
||||||
|
onChange={(e) => {
|
||||||
|
setGenus(e.target.value)
|
||||||
|
setPage(1)
|
||||||
|
}}
|
||||||
|
className="px-3 py-2 border rounded-lg bg-white"
|
||||||
|
>
|
||||||
|
<option value="">All Genera</option>
|
||||||
|
{genera?.map((g) => (
|
||||||
|
<option key={g} value={g}>
|
||||||
|
{g}
|
||||||
|
</option>
|
||||||
|
))}
|
||||||
|
</select>
|
||||||
|
|
||||||
|
<select
|
||||||
|
value={hasImages}
|
||||||
|
onChange={(e) => {
|
||||||
|
setHasImages(e.target.value)
|
||||||
|
setMaxImages('')
|
||||||
|
setPage(1)
|
||||||
|
}}
|
||||||
|
className="px-3 py-2 border rounded-lg bg-white"
|
||||||
|
>
|
||||||
|
<option value="">All Species</option>
|
||||||
|
<option value="true">Has Images</option>
|
||||||
|
<option value="false">No Images</option>
|
||||||
|
</select>
|
||||||
|
|
||||||
|
<select
|
||||||
|
value={maxImages}
|
||||||
|
onChange={(e) => {
|
||||||
|
setMaxImages(e.target.value)
|
||||||
|
setHasImages('')
|
||||||
|
setPage(1)
|
||||||
|
}}
|
||||||
|
className="px-3 py-2 border rounded-lg bg-white"
|
||||||
|
>
|
||||||
|
<option value="">Any Image Count</option>
|
||||||
|
<option value="25">Less than 25 images</option>
|
||||||
|
<option value="50">Less than 50 images</option>
|
||||||
|
<option value="100">Less than 100 images</option>
|
||||||
|
<option value="250">Less than 250 images</option>
|
||||||
|
<option value="500">Less than 500 images</option>
|
||||||
|
</select>
|
||||||
|
|
||||||
|
{(genus || hasImages || maxImages) && (
|
||||||
|
<button
|
||||||
|
onClick={() => {
|
||||||
|
setGenus('')
|
||||||
|
setHasImages('')
|
||||||
|
setMaxImages('')
|
||||||
|
setPage(1)
|
||||||
|
}}
|
||||||
|
className="flex items-center gap-1 px-2 py-1 text-sm text-gray-500 hover:text-gray-700"
|
||||||
|
>
|
||||||
|
<X className="w-3 h-3" />
|
||||||
|
Clear
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="ml-auto flex items-center gap-4">
|
||||||
|
{maxImages && data && data.total > 0 && (
|
||||||
|
<button
|
||||||
|
onClick={() => setShowScrapeFilteredModal(true)}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700"
|
||||||
|
>
|
||||||
|
<Play className="w-4 h-4" />
|
||||||
|
Scrape All {data.total} Filtered
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
|
<button
|
||||||
|
onClick={() => setShowScrapeAllModal(true)}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-orange-600 text-white rounded-lg hover:bg-orange-700"
|
||||||
|
>
|
||||||
|
<Play className="w-4 h-4" />
|
||||||
|
Scrape All Without Images
|
||||||
|
</button>
|
||||||
|
{selectedIds.length > 0 && (
|
||||||
|
<div className="flex items-center gap-4">
|
||||||
|
<span className="text-sm text-gray-600">
|
||||||
|
{selectedIds.length} selected
|
||||||
|
</span>
|
||||||
|
<button
|
||||||
|
onClick={() => setShowScrapeModal(true)}
|
||||||
|
className="flex items-center gap-2 px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
|
||||||
|
>
|
||||||
|
<Play className="w-4 h-4" />
|
||||||
|
Start Scrape
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Table */}
|
||||||
|
<div className="bg-white rounded-lg shadow overflow-hidden">
|
||||||
|
<table className="w-full">
|
||||||
|
<thead className="bg-gray-50">
|
||||||
|
<tr>
|
||||||
|
<th className="px-4 py-3 text-left">
|
||||||
|
<input
|
||||||
|
type="checkbox"
|
||||||
|
checked={(data?.items?.length ?? 0) > 0 && selectedIds.length === (data?.items?.length ?? 0)}
|
||||||
|
onChange={handleSelectAll}
|
||||||
|
className="rounded"
|
||||||
|
/>
|
||||||
|
</th>
|
||||||
|
<th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
|
||||||
|
Scientific Name
|
||||||
|
</th>
|
||||||
|
<th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
|
||||||
|
Common Name
|
||||||
|
</th>
|
||||||
|
<th className="px-4 py-3 text-left text-sm font-medium text-gray-600">
|
||||||
|
Genus
|
||||||
|
</th>
|
||||||
|
<th className="px-4 py-3 text-right text-sm font-medium text-gray-600">
|
||||||
|
Images
|
||||||
|
</th>
|
||||||
|
<th className="px-4 py-3 text-right text-sm font-medium text-gray-600">
|
||||||
|
Actions
|
||||||
|
</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
{isLoading ? (
|
||||||
|
<tr>
|
||||||
|
<td colSpan={6} className="px-4 py-8 text-center text-gray-400">
|
||||||
|
Loading...
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
) : data?.items.length === 0 ? (
|
||||||
|
<tr>
|
||||||
|
<td colSpan={6} className="px-4 py-8 text-center text-gray-400">
|
||||||
|
No species found. Import a CSV to get started.
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
) : (
|
||||||
|
data?.items.map((species) => (
|
||||||
|
<tr
|
||||||
|
key={species.id}
|
||||||
|
className="border-t hover:bg-gray-50 cursor-pointer"
|
||||||
|
onClick={() => setViewSpecies(species)}
|
||||||
|
>
|
||||||
|
<td className="px-4 py-3" onClick={(e) => e.stopPropagation()}>
|
||||||
|
<input
|
||||||
|
type="checkbox"
|
||||||
|
checked={selectedIds.includes(species.id)}
|
||||||
|
onChange={() => handleSelect(species.id)}
|
||||||
|
className="rounded"
|
||||||
|
/>
|
||||||
|
</td>
|
||||||
|
<td className="px-4 py-3 font-medium">{species.scientific_name}</td>
|
||||||
|
<td className="px-4 py-3 text-gray-600">
|
||||||
|
{species.common_name || '-'}
|
||||||
|
</td>
|
||||||
|
<td className="px-4 py-3 text-gray-600">{species.genus || '-'}</td>
|
||||||
|
<td className="px-4 py-3 text-right">
|
||||||
|
<span
|
||||||
|
className={`inline-block px-2 py-1 rounded text-sm ${
|
||||||
|
species.image_count >= 100
|
||||||
|
? 'bg-green-100 text-green-700'
|
||||||
|
: species.image_count > 0
|
||||||
|
? 'bg-yellow-100 text-yellow-700'
|
||||||
|
: 'bg-gray-100 text-gray-600'
|
||||||
|
}`}
|
||||||
|
>
|
||||||
|
{species.image_count}
|
||||||
|
</span>
|
||||||
|
</td>
|
||||||
|
<td className="px-4 py-3 text-right" onClick={(e) => e.stopPropagation()}>
|
||||||
|
<button
|
||||||
|
onClick={() => deleteMutation.mutate(species.id)}
|
||||||
|
className="p-1 text-red-500 hover:bg-red-50 rounded"
|
||||||
|
>
|
||||||
|
<Trash2 className="w-4 h-4" />
|
||||||
|
</button>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
))
|
||||||
|
)}
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Pagination */}
|
||||||
|
{data && data.pages > 1 && (
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<span className="text-sm text-gray-600">
|
||||||
|
Showing {(page - 1) * 50 + 1} to {Math.min(page * 50, data.total)} of{' '}
|
||||||
|
{data.total}
|
||||||
|
</span>
|
||||||
|
<div className="flex gap-2">
|
||||||
|
<button
|
||||||
|
onClick={() => setPage((p) => Math.max(1, p - 1))}
|
||||||
|
disabled={page === 1}
|
||||||
|
className="p-2 rounded border disabled:opacity-50"
|
||||||
|
>
|
||||||
|
<ChevronLeft className="w-4 h-4" />
|
||||||
|
</button>
|
||||||
|
<span className="px-4 py-2">
|
||||||
|
Page {page} of {data.pages}
|
||||||
|
</span>
|
||||||
|
<button
|
||||||
|
onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
|
||||||
|
disabled={page === data.pages}
|
||||||
|
className="p-2 rounded border disabled:opacity-50"
|
||||||
|
>
|
||||||
|
<ChevronRight className="w-4 h-4" />
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Add Species Modal */}
|
||||||
|
{showAddModal && (
|
||||||
|
<AddSpeciesModal onClose={() => setShowAddModal(false)} />
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Scrape Modal */}
|
||||||
|
{showScrapeModal && (
|
||||||
|
<ScrapeModal
|
||||||
|
selectedIds={selectedIds}
|
||||||
|
onClose={() => setShowScrapeModal(false)}
|
||||||
|
onSubmit={(source) => {
|
||||||
|
createJobMutation.mutate({
|
||||||
|
name: `Scrape ${selectedIds.length} species from ${source}`,
|
||||||
|
source,
|
||||||
|
species_ids: selectedIds,
|
||||||
|
})
|
||||||
|
}}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Species Detail Modal */}
|
||||||
|
{viewSpecies && (
|
||||||
|
<SpeciesDetailModal
|
||||||
|
species={viewSpecies}
|
||||||
|
onClose={() => setViewSpecies(null)}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Scrape All Without Images Modal */}
|
||||||
|
{showScrapeAllModal && (
|
||||||
|
<ScrapeAllModal
|
||||||
|
onClose={() => setShowScrapeAllModal(false)}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Scrape All Filtered Modal */}
|
||||||
|
{showScrapeFilteredModal && (
|
||||||
|
<ScrapeFilteredModal
|
||||||
|
maxImages={parseInt(maxImages)}
|
||||||
|
speciesCount={data?.total ?? 0}
|
||||||
|
onClose={() => setShowScrapeFilteredModal(false)}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function AddSpeciesModal({ onClose }: { onClose: () => void }) {
|
||||||
|
const queryClient = useQueryClient()
|
||||||
|
const [form, setForm] = useState({
|
||||||
|
scientific_name: '',
|
||||||
|
common_name: '',
|
||||||
|
genus: '',
|
||||||
|
family: '',
|
||||||
|
})
|
||||||
|
|
||||||
|
const mutation = useMutation({
|
||||||
|
mutationFn: () => speciesApi.create(form),
|
||||||
|
onSuccess: () => {
|
||||||
|
queryClient.invalidateQueries({ queryKey: ['species'] })
|
||||||
|
onClose()
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||||
|
<div className="bg-white rounded-lg p-6 w-full max-w-md">
|
||||||
|
<h2 className="text-xl font-bold mb-4">Add Species</h2>
|
||||||
|
<div className="space-y-4">
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">
|
||||||
|
Scientific Name *
|
||||||
|
</label>
|
||||||
|
<input
|
||||||
|
type="text"
|
||||||
|
value={form.scientific_name}
|
||||||
|
onChange={(e) =>
|
||||||
|
setForm({ ...form, scientific_name: e.target.value })
|
||||||
|
}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
placeholder="e.g. Monstera deliciosa"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">Common Name</label>
|
||||||
|
<input
|
||||||
|
type="text"
|
||||||
|
value={form.common_name}
|
||||||
|
onChange={(e) => setForm({ ...form, common_name: e.target.value })}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
placeholder="e.g. Swiss Cheese Plant"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
<div className="grid grid-cols-2 gap-4">
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">Genus</label>
|
||||||
|
<input
|
||||||
|
type="text"
|
||||||
|
value={form.genus}
|
||||||
|
onChange={(e) => setForm({ ...form, genus: e.target.value })}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
placeholder="e.g. Monstera"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-1">Family</label>
|
||||||
|
<input
|
||||||
|
type="text"
|
||||||
|
value={form.family}
|
||||||
|
onChange={(e) => setForm({ ...form, family: e.target.value })}
|
||||||
|
className="w-full px-3 py-2 border rounded-lg"
|
||||||
|
placeholder="e.g. Araceae"
|
||||||
|
/>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div className="flex justify-end gap-2 mt-6">
|
||||||
|
<button
|
||||||
|
onClick={onClose}
|
||||||
|
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
Cancel
|
||||||
|
</button>
|
||||||
|
<button
|
||||||
|
onClick={() => mutation.mutate()}
|
||||||
|
disabled={!form.scientific_name}
|
||||||
|
className="px-4 py-2 bg-green-600 text-white rounded-lg hover:bg-green-700 disabled:opacity-50"
|
||||||
|
>
|
||||||
|
Add Species
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function ScrapeModal({
|
||||||
|
selectedIds,
|
||||||
|
onClose,
|
||||||
|
onSubmit,
|
||||||
|
}: {
|
||||||
|
selectedIds: number[]
|
||||||
|
onClose: () => void
|
||||||
|
onSubmit: (source: string) => void
|
||||||
|
}) {
|
||||||
|
const [source, setSource] = useState('inaturalist')
|
||||||
|
|
||||||
|
const sources = [
|
||||||
|
{ value: 'gbif', label: 'GBIF' },
|
||||||
|
{ value: 'inaturalist', label: 'iNaturalist' },
|
||||||
|
{ value: 'flickr', label: 'Flickr' },
|
||||||
|
{ value: 'wikimedia', label: 'Wikimedia Commons' },
|
||||||
|
{ value: 'trefle', label: 'Trefle.io' },
|
||||||
|
{ value: 'duckduckgo', label: 'DuckDuckGo' },
|
||||||
|
{ value: 'bing', label: 'Bing Image Search' },
|
||||||
|
]
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||||
|
<div className="bg-white rounded-lg p-6 w-full max-w-md">
|
||||||
|
<h2 className="text-xl font-bold mb-4">Start Scrape Job</h2>
|
||||||
|
<p className="text-gray-600 mb-4">
|
||||||
|
Scrape images for {selectedIds.length} selected species
|
||||||
|
</p>
|
||||||
|
<div>
|
||||||
|
<label className="block text-sm font-medium mb-2">Select Source</label>
|
||||||
|
<div className="space-y-2">
|
||||||
|
{sources.map((s) => (
|
||||||
|
<label
|
||||||
|
key={s.value}
|
||||||
|
className={`flex items-center p-3 border rounded-lg cursor-pointer ${
|
||||||
|
source === s.value ? 'border-green-500 bg-green-50' : ''
|
||||||
|
}`}
|
||||||
|
>
|
||||||
|
<input
|
||||||
|
type="radio"
|
||||||
|
value={s.value}
|
||||||
|
checked={source === s.value}
|
||||||
|
onChange={(e) => setSource(e.target.value)}
|
||||||
|
className="mr-3"
|
||||||
|
/>
|
||||||
|
{s.label}
|
||||||
|
</label>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div className="flex justify-end gap-2 mt-6">
|
||||||
|
<button
|
||||||
|
onClick={onClose}
|
||||||
|
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
Cancel
|
||||||
|
</button>
|
||||||
|
<button
|
||||||
|
onClick={() => onSubmit(source)}
|
||||||
|
className="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700"
|
||||||
|
>
|
||||||
|
Start Scrape
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function SpeciesDetailModal({
|
||||||
|
species,
|
||||||
|
onClose,
|
||||||
|
}: {
|
||||||
|
species: SpeciesType
|
||||||
|
onClose: () => void
|
||||||
|
}) {
|
||||||
|
const [page, setPage] = useState(1)
|
||||||
|
const pageSize = 20
|
||||||
|
|
||||||
|
const { data, isLoading } = useQuery({
|
||||||
|
queryKey: ['species-images', species.id, page],
|
||||||
|
queryFn: () =>
|
||||||
|
imagesApi.list({
|
||||||
|
species_id: species.id,
|
||||||
|
status: 'downloaded',
|
||||||
|
page,
|
||||||
|
page_size: pageSize,
|
||||||
|
}).then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50 p-4">
|
||||||
|
<div className="bg-white rounded-lg w-full max-w-5xl max-h-[90vh] flex flex-col">
|
||||||
|
{/* Header */}
|
||||||
|
<div className="px-6 py-4 border-b flex items-start justify-between">
|
||||||
|
<div>
|
||||||
|
<h2 className="text-xl font-bold">{species.scientific_name}</h2>
|
||||||
|
{species.common_name && (
|
||||||
|
<p className="text-gray-600">{species.common_name}</p>
|
||||||
|
)}
|
||||||
|
<div className="flex gap-4 mt-2 text-sm text-gray-500">
|
||||||
|
{species.genus && <span>Genus: {species.genus}</span>}
|
||||||
|
{species.family && <span>Family: {species.family}</span>}
|
||||||
|
<span>{species.image_count} images</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<button
|
||||||
|
onClick={onClose}
|
||||||
|
className="p-2 hover:bg-gray-100 rounded-lg"
|
||||||
|
>
|
||||||
|
<X className="w-5 h-5" />
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Images Grid */}
|
||||||
|
<div className="flex-1 overflow-y-auto p-6">
|
||||||
|
{isLoading ? (
|
||||||
|
<div className="flex items-center justify-center h-64">
|
||||||
|
<div className="animate-spin rounded-full h-8 w-8 border-b-2 border-green-600"></div>
|
||||||
|
</div>
|
||||||
|
) : !data || data.items.length === 0 ? (
|
||||||
|
<div className="flex flex-col items-center justify-center h-64 text-gray-400">
|
||||||
|
<ImageIcon className="w-12 h-12 mb-4" />
|
||||||
|
<p>No images yet</p>
|
||||||
|
<p className="text-sm mt-2">
|
||||||
|
Start a scrape job to download images for this species
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<div className="grid grid-cols-2 sm:grid-cols-3 md:grid-cols-4 lg:grid-cols-5 gap-4">
|
||||||
|
{data.items.map((image) => (
|
||||||
|
<div
|
||||||
|
key={image.id}
|
||||||
|
className="group relative aspect-square bg-gray-100 rounded-lg overflow-hidden"
|
||||||
|
>
|
||||||
|
{image.local_path ? (
|
||||||
|
<img
|
||||||
|
src={`/api/images/${image.id}/file`}
|
||||||
|
alt={species.scientific_name}
|
||||||
|
className="w-full h-full object-cover"
|
||||||
|
loading="lazy"
|
||||||
|
/>
|
||||||
|
) : (
|
||||||
|
<div className="w-full h-full flex items-center justify-center text-gray-400">
|
||||||
|
<ImageIcon className="w-8 h-8" />
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
{/* Overlay with info */}
|
||||||
|
<div className="absolute inset-0 bg-black/60 opacity-0 group-hover:opacity-100 transition-opacity flex flex-col justify-end p-2">
|
||||||
|
<div className="text-white text-xs">
|
||||||
|
<div className="flex items-center justify-between">
|
||||||
|
<span className="bg-white/20 px-1.5 py-0.5 rounded">
|
||||||
|
{image.source}
|
||||||
|
</span>
|
||||||
|
<span className="bg-white/20 px-1.5 py-0.5 rounded">
|
||||||
|
{image.license}
|
||||||
|
</span>
|
||||||
|
</div>
|
||||||
|
{image.width && image.height && (
|
||||||
|
<div className="mt-1 text-white/70">
|
||||||
|
{image.width} × {image.height}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
{image.url && (
|
||||||
|
<a
|
||||||
|
href={image.url}
|
||||||
|
target="_blank"
|
||||||
|
rel="noopener noreferrer"
|
||||||
|
className="absolute top-2 right-2 p-1 bg-white/20 rounded hover:bg-white/40"
|
||||||
|
onClick={(e) => e.stopPropagation()}
|
||||||
|
>
|
||||||
|
<ExternalLink className="w-4 h-4 text-white" />
|
||||||
|
</a>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Pagination */}
|
||||||
|
{data && data.pages > 1 && (
|
||||||
|
<div className="px-6 py-4 border-t flex items-center justify-between">
|
||||||
|
<span className="text-sm text-gray-600">
|
||||||
|
Showing {(page - 1) * pageSize + 1} to{' '}
|
||||||
|
{Math.min(page * pageSize, data.total)} of {data.total}
|
||||||
|
</span>
|
||||||
|
<div className="flex gap-2">
|
||||||
|
<button
|
||||||
|
onClick={() => setPage((p) => Math.max(1, p - 1))}
|
||||||
|
disabled={page === 1}
|
||||||
|
className="p-2 rounded border disabled:opacity-50"
|
||||||
|
>
|
||||||
|
<ChevronLeft className="w-4 h-4" />
|
||||||
|
</button>
|
||||||
|
<span className="px-4 py-2">
|
||||||
|
Page {page} of {data.pages}
|
||||||
|
</span>
|
||||||
|
<button
|
||||||
|
onClick={() => setPage((p) => Math.min(data.pages, p + 1))}
|
||||||
|
disabled={page === data.pages}
|
||||||
|
className="p-2 rounded border disabled:opacity-50"
|
||||||
|
>
|
||||||
|
<ChevronRight className="w-4 h-4" />
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function ScrapeAllModal({ onClose }: { onClose: () => void }) {
|
||||||
|
const [selectedSources, setSelectedSources] = useState<string[]>([])
|
||||||
|
const [isSubmitting, setIsSubmitting] = useState(false)
|
||||||
|
|
||||||
|
// Fetch count of species without images
|
||||||
|
const { data: speciesData, isLoading } = useQuery({
|
||||||
|
queryKey: ['species-no-images'],
|
||||||
|
queryFn: () =>
|
||||||
|
speciesApi.list({
|
||||||
|
page: 1,
|
||||||
|
page_size: 1,
|
||||||
|
has_images: false,
|
||||||
|
}).then((res) => res.data),
|
||||||
|
})
|
||||||
|
|
||||||
|
const sources = [
|
||||||
|
{ value: 'gbif', label: 'GBIF', description: 'Free biodiversity database, no API key needed' },
|
||||||
|
{ value: 'inaturalist', label: 'iNaturalist', description: 'Research-grade observations with CC licenses' },
|
||||||
|
{ value: 'wikimedia', label: 'Wikimedia Commons', description: 'Free media repository, requires OAuth' },
|
||||||
|
{ value: 'flickr', label: 'Flickr', description: 'Requires API key, CC-licensed photos' },
|
||||||
|
{ value: 'trefle', label: 'Trefle.io', description: 'Plant database, requires API key' },
|
||||||
|
{ value: 'duckduckgo', label: 'DuckDuckGo', description: 'Web image search, no API key needed' },
|
||||||
|
{ value: 'bing', label: 'Bing Image Search', description: 'Azure Cognitive Services, requires API key' },
|
||||||
|
]
|
||||||
|
|
||||||
|
const toggleSource = (source: string) => {
|
||||||
|
setSelectedSources((prev) =>
|
||||||
|
prev.includes(source)
|
||||||
|
? prev.filter((s) => s !== source)
|
||||||
|
: [...prev, source]
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
const handleSubmit = async () => {
|
||||||
|
if (selectedSources.length === 0) return
|
||||||
|
|
||||||
|
setIsSubmitting(true)
|
||||||
|
try {
|
||||||
|
// Create a job for each selected source
|
||||||
|
for (const source of selectedSources) {
|
||||||
|
await jobsApi.create({
|
||||||
|
name: `Scrape all species without images from ${source}`,
|
||||||
|
source,
|
||||||
|
only_without_images: true,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
alert(`Created ${selectedSources.length} scrape job(s)!`)
|
||||||
|
onClose()
|
||||||
|
} catch (error) {
|
||||||
|
alert('Failed to create jobs')
|
||||||
|
} finally {
|
||||||
|
setIsSubmitting(false)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const speciesCount = speciesData?.total ?? 0
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||||
|
<div className="bg-white rounded-lg p-6 w-full max-w-lg">
|
||||||
|
<h2 className="text-xl font-bold mb-2">Scrape All Species Without Images</h2>
|
||||||
|
{isLoading ? (
|
||||||
|
<p className="text-gray-600 mb-4">Loading...</p>
|
||||||
|
) : (
|
||||||
|
<p className="text-gray-600 mb-4">
|
||||||
|
{speciesCount === 0 ? (
|
||||||
|
'All species already have images!'
|
||||||
|
) : (
|
||||||
|
<>
|
||||||
|
<span className="font-semibold text-orange-600">{speciesCount}</span> species
|
||||||
|
don't have any images yet. Select sources to scrape from:
|
||||||
|
</>
|
||||||
|
)}
|
||||||
|
</p>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{speciesCount > 0 && (
|
||||||
|
<>
|
||||||
|
<div className="space-y-2 mb-6">
|
||||||
|
{sources.map((s) => (
|
||||||
|
<label
|
||||||
|
key={s.value}
|
||||||
|
className={`flex items-start p-3 border rounded-lg cursor-pointer transition-colors ${
|
||||||
|
selectedSources.includes(s.value)
|
||||||
|
? 'border-orange-500 bg-orange-50'
|
||||||
|
: 'hover:bg-gray-50'
|
||||||
|
}`}
|
||||||
|
>
|
||||||
|
<input
|
||||||
|
type="checkbox"
|
||||||
|
checked={selectedSources.includes(s.value)}
|
||||||
|
onChange={() => toggleSource(s.value)}
|
||||||
|
className="mt-1 mr-3 rounded"
|
||||||
|
/>
|
||||||
|
<div>
|
||||||
|
<div className="font-medium">{s.label}</div>
|
||||||
|
<div className="text-sm text-gray-500">{s.description}</div>
|
||||||
|
</div>
|
||||||
|
</label>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{selectedSources.length > 1 && (
|
||||||
|
<div className="bg-blue-50 border border-blue-200 rounded-lg p-3 mb-4 text-sm text-blue-700">
|
||||||
|
<strong>{selectedSources.length} jobs</strong> will be created and run in parallel,
|
||||||
|
one for each selected source.
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</>
|
||||||
|
)}
|
||||||
|
|
||||||
|
<div className="flex justify-end gap-2">
|
||||||
|
<button
|
||||||
|
onClick={onClose}
|
||||||
|
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
Cancel
|
||||||
|
</button>
|
||||||
|
{speciesCount > 0 && (
|
||||||
|
<button
|
||||||
|
onClick={handleSubmit}
|
||||||
|
disabled={selectedSources.length === 0 || isSubmitting}
|
||||||
|
className="px-4 py-2 bg-orange-600 text-white rounded-lg hover:bg-orange-700 disabled:opacity-50"
|
||||||
|
>
|
||||||
|
{isSubmitting
|
||||||
|
? 'Creating Jobs...'
|
||||||
|
: `Start ${selectedSources.length || ''} Scrape Job${selectedSources.length !== 1 ? 's' : ''}`}
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
function ScrapeFilteredModal({
|
||||||
|
maxImages,
|
||||||
|
speciesCount,
|
||||||
|
onClose,
|
||||||
|
}: {
|
||||||
|
maxImages: number
|
||||||
|
speciesCount: number
|
||||||
|
onClose: () => void
|
||||||
|
}) {
|
||||||
|
const [selectedSources, setSelectedSources] = useState<string[]>([])
|
||||||
|
const [isSubmitting, setIsSubmitting] = useState(false)
|
||||||
|
|
||||||
|
const sources = [
|
||||||
|
{ value: 'gbif', label: 'GBIF', description: 'Free biodiversity database, no API key needed' },
|
||||||
|
{ value: 'inaturalist', label: 'iNaturalist', description: 'Research-grade observations with CC licenses' },
|
||||||
|
{ value: 'wikimedia', label: 'Wikimedia Commons', description: 'Free media repository, requires OAuth' },
|
||||||
|
{ value: 'flickr', label: 'Flickr', description: 'Requires API key, CC-licensed photos' },
|
||||||
|
{ value: 'trefle', label: 'Trefle.io', description: 'Plant database, requires API key' },
|
||||||
|
{ value: 'duckduckgo', label: 'DuckDuckGo', description: 'Web image search, no API key needed' },
|
||||||
|
{ value: 'bing', label: 'Bing Image Search', description: 'Azure Cognitive Services, requires API key' },
|
||||||
|
]
|
||||||
|
|
||||||
|
const toggleSource = (source: string) => {
|
||||||
|
setSelectedSources((prev) =>
|
||||||
|
prev.includes(source)
|
||||||
|
? prev.filter((s) => s !== source)
|
||||||
|
: [...prev, source]
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
const handleSubmit = async () => {
|
||||||
|
if (selectedSources.length === 0) return
|
||||||
|
|
||||||
|
setIsSubmitting(true)
|
||||||
|
try {
|
||||||
|
for (const source of selectedSources) {
|
||||||
|
await jobsApi.create({
|
||||||
|
name: `Scrape species with <${maxImages} images from ${source}`,
|
||||||
|
source,
|
||||||
|
max_images: maxImages,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
alert(`Created ${selectedSources.length} scrape job(s)!`)
|
||||||
|
onClose()
|
||||||
|
} catch (error) {
|
||||||
|
alert('Failed to create jobs')
|
||||||
|
} finally {
|
||||||
|
setIsSubmitting(false)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50">
|
||||||
|
<div className="bg-white rounded-lg p-6 w-full max-w-lg">
|
||||||
|
<h2 className="text-xl font-bold mb-2">Scrape All Filtered Species</h2>
|
||||||
|
<p className="text-gray-600 mb-4">
|
||||||
|
<span className="font-semibold text-purple-600">{speciesCount}</span> species
|
||||||
|
have fewer than <span className="font-semibold">{maxImages}</span> images.
|
||||||
|
Select sources to scrape from:
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<div className="space-y-2 mb-6">
|
||||||
|
{sources.map((s) => (
|
||||||
|
<label
|
||||||
|
key={s.value}
|
||||||
|
className={`flex items-start p-3 border rounded-lg cursor-pointer transition-colors ${
|
||||||
|
selectedSources.includes(s.value)
|
||||||
|
? 'border-purple-500 bg-purple-50'
|
||||||
|
: 'hover:bg-gray-50'
|
||||||
|
}`}
|
||||||
|
>
|
||||||
|
<input
|
||||||
|
type="checkbox"
|
||||||
|
checked={selectedSources.includes(s.value)}
|
||||||
|
onChange={() => toggleSource(s.value)}
|
||||||
|
className="mt-1 mr-3 rounded"
|
||||||
|
/>
|
||||||
|
<div>
|
||||||
|
<div className="font-medium">{s.label}</div>
|
||||||
|
<div className="text-sm text-gray-500">{s.description}</div>
|
||||||
|
</div>
|
||||||
|
</label>
|
||||||
|
))}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{selectedSources.length > 1 && (
|
||||||
|
<div className="bg-blue-50 border border-blue-200 rounded-lg p-3 mb-4 text-sm text-blue-700">
|
||||||
|
<strong>{selectedSources.length} jobs</strong> will be created and run in parallel,
|
||||||
|
one for each selected source.
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
<div className="flex justify-end gap-2">
|
||||||
|
<button
|
||||||
|
onClick={onClose}
|
||||||
|
className="px-4 py-2 border rounded-lg hover:bg-gray-50"
|
||||||
|
>
|
||||||
|
Cancel
|
||||||
|
</button>
|
||||||
|
<button
|
||||||
|
onClick={handleSubmit}
|
||||||
|
disabled={selectedSources.length === 0 || isSubmitting}
|
||||||
|
className="px-4 py-2 bg-purple-600 text-white rounded-lg hover:bg-purple-700 disabled:opacity-50"
|
||||||
|
>
|
||||||
|
{isSubmitting
|
||||||
|
? 'Creating Jobs...'
|
||||||
|
: `Start ${selectedSources.length || ''} Scrape Job${selectedSources.length !== 1 ? 's' : ''}`}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
)
|
||||||
|
}
|
||||||
9
frontend/src/vite-env.d.ts
vendored
Normal file
9
frontend/src/vite-env.d.ts
vendored
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
/// <reference types="vite/client" />
|
||||||
|
|
||||||
|
interface ImportMetaEnv {
|
||||||
|
readonly VITE_API_URL: string
|
||||||
|
}
|
||||||
|
|
||||||
|
interface ImportMeta {
|
||||||
|
readonly env: ImportMetaEnv
|
||||||
|
}
|
||||||
11
frontend/tailwind.config.js
Normal file
11
frontend/tailwind.config.js
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
/** @type {import('tailwindcss').Config} */
|
||||||
|
export default {
|
||||||
|
content: [
|
||||||
|
"./index.html",
|
||||||
|
"./src/**/*.{js,ts,jsx,tsx}",
|
||||||
|
],
|
||||||
|
theme: {
|
||||||
|
extend: {},
|
||||||
|
},
|
||||||
|
plugins: [],
|
||||||
|
}
|
||||||
21
frontend/tsconfig.json
Normal file
21
frontend/tsconfig.json
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
{
|
||||||
|
"compilerOptions": {
|
||||||
|
"target": "ES2020",
|
||||||
|
"useDefineForClassFields": true,
|
||||||
|
"lib": ["ES2020", "DOM", "DOM.Iterable"],
|
||||||
|
"module": "ESNext",
|
||||||
|
"skipLibCheck": true,
|
||||||
|
"moduleResolution": "bundler",
|
||||||
|
"allowImportingTsExtensions": true,
|
||||||
|
"resolveJsonModule": true,
|
||||||
|
"isolatedModules": true,
|
||||||
|
"noEmit": true,
|
||||||
|
"jsx": "react-jsx",
|
||||||
|
"strict": true,
|
||||||
|
"noUnusedLocals": true,
|
||||||
|
"noUnusedParameters": true,
|
||||||
|
"noFallthroughCasesInSwitch": true
|
||||||
|
},
|
||||||
|
"include": ["src"],
|
||||||
|
"references": [{ "path": "./tsconfig.node.json" }]
|
||||||
|
}
|
||||||
10
frontend/tsconfig.node.json
Normal file
10
frontend/tsconfig.node.json
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
{
|
||||||
|
"compilerOptions": {
|
||||||
|
"composite": true,
|
||||||
|
"skipLibCheck": true,
|
||||||
|
"module": "ESNext",
|
||||||
|
"moduleResolution": "bundler",
|
||||||
|
"allowSyntheticDefaultImports": true
|
||||||
|
},
|
||||||
|
"include": ["vite.config.ts"]
|
||||||
|
}
|
||||||
18
frontend/vite.config.ts
Normal file
18
frontend/vite.config.ts
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
import { defineConfig } from 'vite'
|
||||||
|
import react from '@vitejs/plugin-react'
|
||||||
|
|
||||||
|
export default defineConfig({
|
||||||
|
plugins: [react()],
|
||||||
|
server: {
|
||||||
|
port: 3000,
|
||||||
|
host: true,
|
||||||
|
proxy: {
|
||||||
|
'/api': {
|
||||||
|
target: 'http://backend:8000',
|
||||||
|
changeOrigin: true,
|
||||||
|
},
|
||||||
|
},
|
||||||
|
// Disable HMR - not useful in Docker deployments
|
||||||
|
hmr: false,
|
||||||
|
},
|
||||||
|
})
|
||||||
18874
houseplants_list.json
Executable file
18874
houseplants_list.json
Executable file
File diff suppressed because it is too large
Load Diff
58
nginx/nginx.conf
Normal file
58
nginx/nginx.conf
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
events {
|
||||||
|
worker_connections 1024;
|
||||||
|
}
|
||||||
|
|
||||||
|
http {
|
||||||
|
include /etc/nginx/mime.types;
|
||||||
|
default_type application/octet-stream;
|
||||||
|
|
||||||
|
upstream backend {
|
||||||
|
server backend:8000;
|
||||||
|
}
|
||||||
|
|
||||||
|
upstream frontend {
|
||||||
|
server frontend:3000;
|
||||||
|
}
|
||||||
|
|
||||||
|
server {
|
||||||
|
listen 80;
|
||||||
|
server_name localhost;
|
||||||
|
|
||||||
|
# API routes
|
||||||
|
location /api {
|
||||||
|
proxy_pass http://backend;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||||
|
proxy_set_header X-Forwarded-Proto $scheme;
|
||||||
|
|
||||||
|
# Increase timeouts for slow API calls
|
||||||
|
proxy_connect_timeout 60s;
|
||||||
|
proxy_send_timeout 60s;
|
||||||
|
proxy_read_timeout 60s;
|
||||||
|
}
|
||||||
|
|
||||||
|
# Health check
|
||||||
|
location /health {
|
||||||
|
proxy_pass http://backend;
|
||||||
|
}
|
||||||
|
|
||||||
|
# WebSocket support for hot reload
|
||||||
|
location /ws {
|
||||||
|
proxy_pass http://frontend;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Upgrade $http_upgrade;
|
||||||
|
proxy_set_header Connection "upgrade";
|
||||||
|
}
|
||||||
|
|
||||||
|
# Frontend
|
||||||
|
location / {
|
||||||
|
proxy_pass http://frontend;
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_set_header X-Real-IP $remote_addr;
|
||||||
|
proxy_http_version 1.1;
|
||||||
|
proxy_set_header Upgrade $http_upgrade;
|
||||||
|
proxy_set_header Connection "upgrade";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user