PlantGuideScraper
Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.
Features
- Species Management: Import species lists via CSV or JSON, search and filter by genus or image status
- Multi-Source Scraping: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
- Image Quality Pipeline: Automatic deduplication, blur detection, resizing
- License Filtering: Only collect commercially-safe CC0/CC-BY licensed images
- Export for CoreML: Train/test split, Create ML-compatible folder structure
- Real-time Dashboard: Progress tracking, statistics, job monitoring
Quick Start
# Clone and start
cd PlantGuideScraper
docker-compose up --build
# Access the UI
open http://localhost
Unraid Deployment
Setup
-
Copy the project to your Unraid server:
scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper -
SSH into Unraid and create data directories:
ssh root@YOUR_UNRAID_IP mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis} -
Install Docker Compose Manager from Community Applications
-
In Unraid: Docker → Compose → Add New Stack
- Path:
/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml - Click Compose Up
- Path:
-
Access at
http://YOUR_UNRAID_IP:8580
Configurable Paths
Edit docker-compose.unraid.yml to customize where data is stored. Look for these lines in both backend and celery services:
# === CONFIGURABLE DATA PATHS ===
- /mnt/user/appdata/PlantGuideScraper/database:/data/db # DATABASE_PATH
- /mnt/user/appdata/PlantGuideScraper/images:/data/images # IMAGES_PATH
- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
| Path | Description | Default |
|---|---|---|
| DATABASE_PATH | SQLite database file | /mnt/user/appdata/PlantGuideScraper/database |
| IMAGES_PATH | Downloaded images (can be 100GB+) | /mnt/user/appdata/PlantGuideScraper/images |
| EXPORTS_PATH | Generated export zip files | /mnt/user/appdata/PlantGuideScraper/exports |
Example: Store images on a separate share:
- /mnt/user/data/PlantImages:/data/images # IMAGES_PATH
Important: Keep paths identical in both backend and celery services.
Configuration
-
Configure API keys in Settings:
- Flickr: Get key at https://www.flickr.com/services/api/
- Trefle: Get key at https://trefle.io/
- iNaturalist and Wikimedia don't require keys
-
Import species list (see Import Documentation below)
-
Select species and start scraping
Import Documentation
CSV Import
Import species from a CSV file with the following columns:
| Column | Required | Description |
|---|---|---|
scientific_name |
Yes | Binomial name (e.g., "Monstera deliciosa") |
common_name |
No | Common name (e.g., "Swiss Cheese Plant") |
genus |
No | Auto-extracted from scientific_name if not provided |
family |
No | Plant family (e.g., "Araceae") |
Example CSV:
scientific_name,common_name,genus,family
Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
Epipremnum aureum,Golden Pothos,Epipremnum,Araceae
JSON Import
Import species from a JSON file with the following structure:
{
"plants": [
{
"scientific_name": "Monstera deliciosa",
"common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
"family": "Araceae"
},
{
"scientific_name": "Philodendron hederaceum",
"common_names": ["Heartleaf Philodendron"],
"family": "Araceae"
}
]
}
| Field | Required | Description |
|---|---|---|
scientific_name |
Yes | Binomial name |
common_names |
No | Array of common names (first one is used) |
family |
No | Plant family |
Notes:
- Genus is automatically extracted from the first word of
scientific_name - Duplicate species (by scientific_name) are skipped
- The included
houseplants_list.jsoncontains 2,278 houseplant species
API Endpoints
# Import CSV
curl -X POST http://localhost/api/species/import \
-F "file=@species.csv"
# Import JSON
curl -X POST http://localhost/api/species/import-json \
-F "file=@plants.json"
Response:
{
"imported": 150,
"skipped": 5,
"errors": []
}
Architecture
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ React │────▶│ FastAPI │────▶│ Celery │
│ Frontend │ │ Backend │ │ Workers │
└─────────────┘ └─────────────────┘ └─────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ SQLite │ │ Redis │
│ Database │ │ Queue │
└─────────────┘ └─────────────┘
Export Format
Exports are Create ML-compatible:
export.zip/
├── Training/
│ ├── Monstera_deliciosa/
│ │ ├── img_00001.jpg
│ │ └── ...
│ └── ...
└── Testing/
├── Monstera_deliciosa/
└── ...
Data Storage
All data is stored in the ./data directory:
data/
├── db/
│ └── plants.sqlite # SQLite database
├── images/ # Downloaded images
│ └── {species_id}/
│ └── {image_id}.jpg
└── exports/ # Generated export archives
└── {export_id}.zip
API Documentation
Full API docs available at http://localhost/api/docs
License
MIT