Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.

Features

Species Management: Import species lists via CSV or JSON, search and filter by genus or image status
Multi-Source Scraping: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
Image Quality Pipeline: Automatic deduplication, blur detection, resizing
License Filtering: Only collect commercially-safe CC0/CC-BY licensed images
Export for CoreML: Train/test split, Create ML-compatible folder structure
Real-time Dashboard: Progress tracking, statistics, job monitoring

Quick Start

# Clone and start
cd PlantGuideScraper
docker-compose up --build

# Access the UI
open http://localhost

Unraid Deployment

Setup

Copy the project to your Unraid server:

scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper

SSH into Unraid and create data directories:

ssh root@YOUR_UNRAID_IP
mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}

Install Docker Compose Manager from Community Applications
In Unraid: Docker → Compose → Add New Stack
- Path: /mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml
- Click Compose Up
Access at http://YOUR_UNRAID_IP:8580

Configurable Paths

Edit docker-compose.unraid.yml to customize where data is stored. Look for these lines in both backend and celery services:

# === CONFIGURABLE DATA PATHS ===
- /mnt/user/appdata/PlantGuideScraper/database:/data/db    # DATABASE_PATH
- /mnt/user/appdata/PlantGuideScraper/images:/data/images  # IMAGES_PATH
- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH

Path	Description	Default
DATABASE_PATH	SQLite database file	`/mnt/user/appdata/PlantGuideScraper/database`
IMAGES_PATH	Downloaded images (can be 100GB+)	`/mnt/user/appdata/PlantGuideScraper/images`
EXPORTS_PATH	Generated export zip files	`/mnt/user/appdata/PlantGuideScraper/exports`

Example: Store images on a separate share:

- /mnt/user/data/PlantImages:/data/images  # IMAGES_PATH

Important: Keep paths identical in both backend and celery services.

Configuration

Configure API keys in Settings:
- Flickr: Get key at https://www.flickr.com/services/api/
- Trefle: Get key at https://trefle.io/
- iNaturalist and Wikimedia don't require keys
Import species list (see Import Documentation below)
Select species and start scraping

Import Documentation

CSV Import

Import species from a CSV file with the following columns:

Column	Required	Description
`scientific_name`	Yes	Binomial name (e.g., "Monstera deliciosa")
`common_name`	No	Common name (e.g., "Swiss Cheese Plant")
`genus`	No	Auto-extracted from scientific_name if not provided
`family`	No	Plant family (e.g., "Araceae")

Example CSV:

scientific_name,common_name,genus,family
Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
Epipremnum aureum,Golden Pothos,Epipremnum,Araceae

JSON Import

Import species from a JSON file with the following structure:

{
  "plants": [
    {
      "scientific_name": "Monstera deliciosa",
      "common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
      "family": "Araceae"
    },
    {
      "scientific_name": "Philodendron hederaceum",
      "common_names": ["Heartleaf Philodendron"],
      "family": "Araceae"
    }
  ]
}

Field	Required	Description
`scientific_name`	Yes	Binomial name
`common_names`	No	Array of common names (first one is used)
`family`	No	Plant family

Notes:

Genus is automatically extracted from the first word of scientific_name
Duplicate species (by scientific_name) are skipped
The included houseplants_list.json contains 2,278 houseplant species

API Endpoints

# Import CSV
curl -X POST http://localhost/api/species/import \
  -F "file=@species.csv"

# Import JSON
curl -X POST http://localhost/api/species/import-json \
  -F "file=@plants.json"

Response:

{
  "imported": 150,
  "skipped": 5,
  "errors": []
}

Architecture

┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│   React     │────▶│  FastAPI        │────▶│   Celery    │
│   Frontend  │     │  Backend        │     │   Workers   │
└─────────────┘     └─────────────────┘     └─────────────┘
                           │                       │
                           ▼                       ▼
                   ┌─────────────┐         ┌─────────────┐
                   │   SQLite    │         │   Redis     │
                   │   Database  │         │   Queue     │
                   └─────────────┘         └─────────────┘

Export Format

Exports are Create ML-compatible:

export.zip/
├── Training/
│   ├── Monstera_deliciosa/
│   │   ├── img_00001.jpg
│   │   └── ...
│   └── ...
└── Testing/
    ├── Monstera_deliciosa/
    └── ...

Data Storage

All data is stored in the ./data directory:

data/
├── db/
│   └── plants.sqlite    # SQLite database
├── images/              # Downloaded images
│   └── {species_id}/
│       └── {image_id}.jpg
└── exports/             # Generated export archives
    └── {export_id}.zip

API Documentation

Full API docs available at http://localhost/api/docs

License

MIT

Languages

Python 58.4%

TypeScript 40.9%

Dockerfile 0.3%

Mako 0.2%

HTML 0.1%