# PlantGuideScraper Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training. ## Features - **Species Management**: Import species lists via CSV or JSON, search and filter by genus or image status - **Multi-Source Scraping**: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io - **Image Quality Pipeline**: Automatic deduplication, blur detection, resizing - **License Filtering**: Only collect commercially-safe CC0/CC-BY licensed images - **Export for CoreML**: Train/test split, Create ML-compatible folder structure - **Real-time Dashboard**: Progress tracking, statistics, job monitoring ## Quick Start ```bash # Clone and start cd PlantGuideScraper docker-compose up --build # Access the UI open http://localhost ``` ## Unraid Deployment ### Setup 1. Copy the project to your Unraid server: ```bash scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper ``` 2. SSH into Unraid and create data directories: ```bash ssh root@YOUR_UNRAID_IP mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis} ``` 3. Install **Docker Compose Manager** from Community Applications 4. In Unraid: **Docker → Compose → Add New Stack** - Path: `/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml` - Click **Compose Up** 5. Access at `http://YOUR_UNRAID_IP:8580` ### Configurable Paths Edit `docker-compose.unraid.yml` to customize where data is stored. Look for these lines in both `backend` and `celery` services: ```yaml # === CONFIGURABLE DATA PATHS === - /mnt/user/appdata/PlantGuideScraper/database:/data/db # DATABASE_PATH - /mnt/user/appdata/PlantGuideScraper/images:/data/images # IMAGES_PATH - /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH ``` | Path | Description | Default | |------|-------------|---------| | DATABASE_PATH | SQLite database file | `/mnt/user/appdata/PlantGuideScraper/database` | | IMAGES_PATH | Downloaded images (can be 100GB+) | `/mnt/user/appdata/PlantGuideScraper/images` | | EXPORTS_PATH | Generated export zip files | `/mnt/user/appdata/PlantGuideScraper/exports` | **Example: Store images on a separate share:** ```yaml - /mnt/user/data/PlantImages:/data/images # IMAGES_PATH ``` **Important:** Keep paths identical in both `backend` and `celery` services. ## Configuration 1. Configure API keys in Settings: - **Flickr**: Get key at https://www.flickr.com/services/api/ - **Trefle**: Get key at https://trefle.io/ - iNaturalist and Wikimedia don't require keys 2. Import species list (see Import Documentation below) 3. Select species and start scraping ## Import Documentation ### CSV Import Import species from a CSV file with the following columns: | Column | Required | Description | |--------|----------|-------------| | `scientific_name` | Yes | Binomial name (e.g., "Monstera deliciosa") | | `common_name` | No | Common name (e.g., "Swiss Cheese Plant") | | `genus` | No | Auto-extracted from scientific_name if not provided | | `family` | No | Plant family (e.g., "Araceae") | **Example CSV:** ```csv scientific_name,common_name,genus,family Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae Epipremnum aureum,Golden Pothos,Epipremnum,Araceae ``` ### JSON Import Import species from a JSON file with the following structure: ```json { "plants": [ { "scientific_name": "Monstera deliciosa", "common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"], "family": "Araceae" }, { "scientific_name": "Philodendron hederaceum", "common_names": ["Heartleaf Philodendron"], "family": "Araceae" } ] } ``` | Field | Required | Description | |-------|----------|-------------| | `scientific_name` | Yes | Binomial name | | `common_names` | No | Array of common names (first one is used) | | `family` | No | Plant family | **Notes:** - Genus is automatically extracted from the first word of `scientific_name` - Duplicate species (by scientific_name) are skipped - The included `houseplants_list.json` contains 2,278 houseplant species ### API Endpoints ```bash # Import CSV curl -X POST http://localhost/api/species/import \ -F "file=@species.csv" # Import JSON curl -X POST http://localhost/api/species/import-json \ -F "file=@plants.json" ``` **Response:** ```json { "imported": 150, "skipped": 5, "errors": [] } ``` ## Architecture ``` ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │ React │────▶│ FastAPI │────▶│ Celery │ │ Frontend │ │ Backend │ │ Workers │ └─────────────┘ └─────────────────┘ └─────────────┘ │ │ ▼ ▼ ┌─────────────┐ ┌─────────────┐ │ SQLite │ │ Redis │ │ Database │ │ Queue │ └─────────────┘ └─────────────┘ ``` ## Export Format Exports are Create ML-compatible: ``` export.zip/ ├── Training/ │ ├── Monstera_deliciosa/ │ │ ├── img_00001.jpg │ │ └── ... │ └── ... └── Testing/ ├── Monstera_deliciosa/ └── ... ``` ## Data Storage All data is stored in the `./data` directory: ``` data/ ├── db/ │ └── plants.sqlite # SQLite database ├── images/ # Downloaded images │ └── {species_id}/ │ └── {image_id}.jpg └── exports/ # Generated export archives └── {export_id}.zip ``` ## API Documentation Full API docs available at http://localhost/api/docs ## License MIT