Files
PlantGuideScraper/README.md
2026-04-12 09:54:27 -05:00

210 lines
6.2 KiB
Markdown

# PlantGuideScraper
Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.
## Features
- **Species Management**: Import species lists via CSV or JSON, search and filter by genus or image status
- **Multi-Source Scraping**: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
- **Image Quality Pipeline**: Automatic deduplication, blur detection, resizing
- **License Filtering**: Only collect commercially-safe CC0/CC-BY licensed images
- **Export for CoreML**: Train/test split, Create ML-compatible folder structure
- **Real-time Dashboard**: Progress tracking, statistics, job monitoring
## Quick Start
```bash
# Clone and start
cd PlantGuideScraper
docker-compose up --build
# Access the UI
open http://localhost
```
## Unraid Deployment
### Setup
1. Copy the project to your Unraid server:
```bash
scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper
```
2. SSH into Unraid and create data directories:
```bash
ssh root@YOUR_UNRAID_IP
mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}
```
3. Install **Docker Compose Manager** from Community Applications
4. In Unraid: **Docker → Compose → Add New Stack**
- Path: `/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml`
- Click **Compose Up**
5. Access at `http://YOUR_UNRAID_IP:8580`
### Configurable Paths
Edit `docker-compose.unraid.yml` to customize where data is stored. Look for these lines in both `backend` and `celery` services:
```yaml
# === CONFIGURABLE DATA PATHS ===
- /mnt/user/appdata/PlantGuideScraper/database:/data/db # DATABASE_PATH
- /mnt/user/appdata/PlantGuideScraper/images:/data/images # IMAGES_PATH
- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
```
| Path | Description | Default |
|------|-------------|---------|
| DATABASE_PATH | SQLite database file | `/mnt/user/appdata/PlantGuideScraper/database` |
| IMAGES_PATH | Downloaded images (can be 100GB+) | `/mnt/user/appdata/PlantGuideScraper/images` |
| EXPORTS_PATH | Generated export zip files | `/mnt/user/appdata/PlantGuideScraper/exports` |
**Example: Store images on a separate share:**
```yaml
- /mnt/user/data/PlantImages:/data/images # IMAGES_PATH
```
**Important:** Keep paths identical in both `backend` and `celery` services.
## Configuration
1. Configure API keys in Settings:
- **Flickr**: Get key at https://www.flickr.com/services/api/
- **Trefle**: Get key at https://trefle.io/
- iNaturalist and Wikimedia don't require keys
2. Import species list (see Import Documentation below)
3. Select species and start scraping
## Import Documentation
### CSV Import
Import species from a CSV file with the following columns:
| Column | Required | Description |
|--------|----------|-------------|
| `scientific_name` | Yes | Binomial name (e.g., "Monstera deliciosa") |
| `common_name` | No | Common name (e.g., "Swiss Cheese Plant") |
| `genus` | No | Auto-extracted from scientific_name if not provided |
| `family` | No | Plant family (e.g., "Araceae") |
**Example CSV:**
```csv
scientific_name,common_name,genus,family
Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
Epipremnum aureum,Golden Pothos,Epipremnum,Araceae
```
### JSON Import
Import species from a JSON file with the following structure:
```json
{
"plants": [
{
"scientific_name": "Monstera deliciosa",
"common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
"family": "Araceae"
},
{
"scientific_name": "Philodendron hederaceum",
"common_names": ["Heartleaf Philodendron"],
"family": "Araceae"
}
]
}
```
| Field | Required | Description |
|-------|----------|-------------|
| `scientific_name` | Yes | Binomial name |
| `common_names` | No | Array of common names (first one is used) |
| `family` | No | Plant family |
**Notes:**
- Genus is automatically extracted from the first word of `scientific_name`
- Duplicate species (by scientific_name) are skipped
- The included `houseplants_list.json` contains 2,278 houseplant species
### API Endpoints
```bash
# Import CSV
curl -X POST http://localhost/api/species/import \
-F "file=@species.csv"
# Import JSON
curl -X POST http://localhost/api/species/import-json \
-F "file=@plants.json"
```
**Response:**
```json
{
"imported": 150,
"skipped": 5,
"errors": []
}
```
## Architecture
```
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ React │────▶│ FastAPI │────▶│ Celery │
│ Frontend │ │ Backend │ │ Workers │
└─────────────┘ └─────────────────┘ └─────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ SQLite │ │ Redis │
│ Database │ │ Queue │
└─────────────┘ └─────────────┘
```
## Export Format
Exports are Create ML-compatible:
```
export.zip/
├── Training/
│ ├── Monstera_deliciosa/
│ │ ├── img_00001.jpg
│ │ └── ...
│ └── ...
└── Testing/
├── Monstera_deliciosa/
└── ...
```
## Data Storage
All data is stored in the `./data` directory:
```
data/
├── db/
│ └── plants.sqlite # SQLite database
├── images/ # Downloaded images
│ └── {species_id}/
│ └── {image_id}.jpg
└── exports/ # Generated export archives
└── {export_id}.zip
```
## API Documentation
Full API docs available at http://localhost/api/docs
## License
MIT