PlantGuideScraper/README.md

# PlantGuideScraper

Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.

## Features

- **Species Management**: Import species lists via CSV or JSON, search and filter by genus or image status
- **Multi-Source Scraping**: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
- **Image Quality Pipeline**: Automatic deduplication, blur detection, resizing
- **License Filtering**: Only collect commercially-safe CC0/CC-BY licensed images
- **Export for CoreML**: Train/test split, Create ML-compatible folder structure
- **Real-time Dashboard**: Progress tracking, statistics, job monitoring

## Quick Start

```bash
# Clone and start
cd PlantGuideScraper
docker-compose up --build

# Access the UI
open http://localhost
```

## Unraid Deployment

### Setup

1. Copy the project to your Unraid server:
   ```bash
   scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper
   ```

2. SSH into Unraid and create data directories:
   ```bash
   ssh root@YOUR_UNRAID_IP
   mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}
   ```

3. Install **Docker Compose Manager** from Community Applications

4. In Unraid: **Docker → Compose → Add New Stack**
   - Path: `/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml`
   - Click **Compose Up**

5. Access at `http://YOUR_UNRAID_IP:8580`

### Configurable Paths

Edit `docker-compose.unraid.yml` to customize where data is stored. Look for these lines in both `backend` and `celery` services:

```yaml
# === CONFIGURABLE DATA PATHS ===
- /mnt/user/appdata/PlantGuideScraper/database:/data/db    # DATABASE_PATH
- /mnt/user/appdata/PlantGuideScraper/images:/data/images  # IMAGES_PATH
- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
```

| Path | Description | Default |
|------|-------------|---------|
| DATABASE_PATH | SQLite database file | `/mnt/user/appdata/PlantGuideScraper/database` |
| IMAGES_PATH | Downloaded images (can be 100GB+) | `/mnt/user/appdata/PlantGuideScraper/images` |
| EXPORTS_PATH | Generated export zip files | `/mnt/user/appdata/PlantGuideScraper/exports` |

**Example: Store images on a separate share:**
```yaml
- /mnt/user/data/PlantImages:/data/images  # IMAGES_PATH
```

**Important:** Keep paths identical in both `backend` and `celery` services.

## Configuration

1. Configure API keys in Settings:
   - **Flickr**: Get key at https://www.flickr.com/services/api/
   - **Trefle**: Get key at https://trefle.io/
   - iNaturalist and Wikimedia don't require keys

2. Import species list (see Import Documentation below)

3. Select species and start scraping

## Import Documentation

### CSV Import

Import species from a CSV file with the following columns:

| Column | Required | Description |
|--------|----------|-------------|
| `scientific_name` | Yes | Binomial name (e.g., "Monstera deliciosa") |
| `common_name` | No | Common name (e.g., "Swiss Cheese Plant") |
| `genus` | No | Auto-extracted from scientific_name if not provided |
| `family` | No | Plant family (e.g., "Araceae") |

**Example CSV:**
```csv
scientific_name,common_name,genus,family
Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
Epipremnum aureum,Golden Pothos,Epipremnum,Araceae
```

### JSON Import

Import species from a JSON file with the following structure:

```json
{
  "plants": [
    {
      "scientific_name": "Monstera deliciosa",
      "common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
      "family": "Araceae"
    },
    {
      "scientific_name": "Philodendron hederaceum",
      "common_names": ["Heartleaf Philodendron"],
      "family": "Araceae"
    }
  ]
}
```

| Field | Required | Description |
|-------|----------|-------------|
| `scientific_name` | Yes | Binomial name |
| `common_names` | No | Array of common names (first one is used) |
| `family` | No | Plant family |

**Notes:**
- Genus is automatically extracted from the first word of `scientific_name`
- Duplicate species (by scientific_name) are skipped
- The included `houseplants_list.json` contains 2,278 houseplant species

### API Endpoints

```bash
# Import CSV
curl -X POST http://localhost/api/species/import \
  -F "file=@species.csv"

# Import JSON
curl -X POST http://localhost/api/species/import-json \
  -F "file=@plants.json"
```

**Response:**
```json
{
  "imported": 150,
  "skipped": 5,
  "errors": []
}
```

## Architecture

```
┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│   React     │────▶│  FastAPI        │────▶│   Celery    │
│   Frontend  │     │  Backend        │     │   Workers   │
└─────────────┘     └─────────────────┘     └─────────────┘
                           │                       │
                           ▼                       ▼
                   ┌─────────────┐         ┌─────────────┐
                   │   SQLite    │         │   Redis     │
                   │   Database  │         │   Queue     │
                   └─────────────┘         └─────────────┘
```

## Export Format

Exports are Create ML-compatible:

```
export.zip/
├── Training/
│   ├── Monstera_deliciosa/
│   │   ├── img_00001.jpg
│   │   └── ...
│   └── ...
└── Testing/
    ├── Monstera_deliciosa/
    └── ...
```

## Data Storage

All data is stored in the `./data` directory:

```
data/
├── db/
│   └── plants.sqlite    # SQLite database
├── images/              # Downloaded images
│   └── {species_id}/
│       └── {image_id}.jpg
└── exports/             # Generated export archives
    └── {export_id}.zip
```

## API Documentation

Full API docs available at http://localhost/api/docs

## License

MIT