PlantGuideScraper

Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.

Features

  • Species Management: Import species lists via CSV or JSON, search and filter by genus or image status
  • Multi-Source Scraping: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
  • Image Quality Pipeline: Automatic deduplication, blur detection, resizing
  • License Filtering: Only collect commercially-safe CC0/CC-BY licensed images
  • Export for CoreML: Train/test split, Create ML-compatible folder structure
  • Real-time Dashboard: Progress tracking, statistics, job monitoring

Quick Start

# Clone and start
cd PlantGuideScraper
docker-compose up --build

# Access the UI
open http://localhost

Unraid Deployment

Setup

  1. Copy the project to your Unraid server:

    scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper
    
  2. SSH into Unraid and create data directories:

    ssh root@YOUR_UNRAID_IP
    mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}
    
  3. Install Docker Compose Manager from Community Applications

  4. In Unraid: Docker → Compose → Add New Stack

    • Path: /mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml
    • Click Compose Up
  5. Access at http://YOUR_UNRAID_IP:8580

Configurable Paths

Edit docker-compose.unraid.yml to customize where data is stored. Look for these lines in both backend and celery services:

# === CONFIGURABLE DATA PATHS ===
- /mnt/user/appdata/PlantGuideScraper/database:/data/db    # DATABASE_PATH
- /mnt/user/appdata/PlantGuideScraper/images:/data/images  # IMAGES_PATH
- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
Path Description Default
DATABASE_PATH SQLite database file /mnt/user/appdata/PlantGuideScraper/database
IMAGES_PATH Downloaded images (can be 100GB+) /mnt/user/appdata/PlantGuideScraper/images
EXPORTS_PATH Generated export zip files /mnt/user/appdata/PlantGuideScraper/exports

Example: Store images on a separate share:

- /mnt/user/data/PlantImages:/data/images  # IMAGES_PATH

Important: Keep paths identical in both backend and celery services.

Configuration

  1. Configure API keys in Settings:

  2. Import species list (see Import Documentation below)

  3. Select species and start scraping

Import Documentation

CSV Import

Import species from a CSV file with the following columns:

Column Required Description
scientific_name Yes Binomial name (e.g., "Monstera deliciosa")
common_name No Common name (e.g., "Swiss Cheese Plant")
genus No Auto-extracted from scientific_name if not provided
family No Plant family (e.g., "Araceae")

Example CSV:

scientific_name,common_name,genus,family
Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
Epipremnum aureum,Golden Pothos,Epipremnum,Araceae

JSON Import

Import species from a JSON file with the following structure:

{
  "plants": [
    {
      "scientific_name": "Monstera deliciosa",
      "common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
      "family": "Araceae"
    },
    {
      "scientific_name": "Philodendron hederaceum",
      "common_names": ["Heartleaf Philodendron"],
      "family": "Araceae"
    }
  ]
}
Field Required Description
scientific_name Yes Binomial name
common_names No Array of common names (first one is used)
family No Plant family

Notes:

  • Genus is automatically extracted from the first word of scientific_name
  • Duplicate species (by scientific_name) are skipped
  • The included houseplants_list.json contains 2,278 houseplant species

API Endpoints

# Import CSV
curl -X POST http://localhost/api/species/import \
  -F "file=@species.csv"

# Import JSON
curl -X POST http://localhost/api/species/import-json \
  -F "file=@plants.json"

Response:

{
  "imported": 150,
  "skipped": 5,
  "errors": []
}

Architecture

┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│   React     │────▶│  FastAPI        │────▶│   Celery    │
│   Frontend  │     │  Backend        │     │   Workers   │
└─────────────┘     └─────────────────┘     └─────────────┘
                           │                       │
                           ▼                       ▼
                   ┌─────────────┐         ┌─────────────┐
                   │   SQLite    │         │   Redis     │
                   │   Database  │         │   Queue     │
                   └─────────────┘         └─────────────┘

Export Format

Exports are Create ML-compatible:

export.zip/
├── Training/
│   ├── Monstera_deliciosa/
│   │   ├── img_00001.jpg
│   │   └── ...
│   └── ...
└── Testing/
    ├── Monstera_deliciosa/
    └── ...

Data Storage

All data is stored in the ./data directory:

data/
├── db/
│   └── plants.sqlite    # SQLite database
├── images/              # Downloaded images
│   └── {species_id}/
│       └── {image_id}.jpg
└── exports/             # Generated export archives
    └── {export_id}.zip

API Documentation

Full API docs available at http://localhost/api/docs

License

MIT

Description
Plant data scraper & web UI for Planttime
Readme 355 KiB
Languages
Python 58.4%
TypeScript 40.9%
Dockerfile 0.3%
Mako 0.2%
HTML 0.1%