210 lines
6.2 KiB
Markdown
210 lines
6.2 KiB
Markdown
# PlantGuideScraper
|
|
|
|
Web-based interface for managing a multi-source houseplant image scraping pipeline. Collects images from iNaturalist, Flickr, Wikimedia Commons, and Trefle.io to build datasets for CoreML training.
|
|
|
|
## Features
|
|
|
|
- **Species Management**: Import species lists via CSV or JSON, search and filter by genus or image status
|
|
- **Multi-Source Scraping**: iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle.io
|
|
- **Image Quality Pipeline**: Automatic deduplication, blur detection, resizing
|
|
- **License Filtering**: Only collect commercially-safe CC0/CC-BY licensed images
|
|
- **Export for CoreML**: Train/test split, Create ML-compatible folder structure
|
|
- **Real-time Dashboard**: Progress tracking, statistics, job monitoring
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Clone and start
|
|
cd PlantGuideScraper
|
|
docker-compose up --build
|
|
|
|
# Access the UI
|
|
open http://localhost
|
|
```
|
|
|
|
## Unraid Deployment
|
|
|
|
### Setup
|
|
|
|
1. Copy the project to your Unraid server:
|
|
```bash
|
|
scp -r PlantGuideScraper root@YOUR_UNRAID_IP:/mnt/user/appdata/PlantGuideScraper
|
|
```
|
|
|
|
2. SSH into Unraid and create data directories:
|
|
```bash
|
|
ssh root@YOUR_UNRAID_IP
|
|
mkdir -p /mnt/user/appdata/PlantGuideScraper/{database,images,exports,redis}
|
|
```
|
|
|
|
3. Install **Docker Compose Manager** from Community Applications
|
|
|
|
4. In Unraid: **Docker → Compose → Add New Stack**
|
|
- Path: `/mnt/user/appdata/PlantGuideScraper/docker-compose.unraid.yml`
|
|
- Click **Compose Up**
|
|
|
|
5. Access at `http://YOUR_UNRAID_IP:8580`
|
|
|
|
### Configurable Paths
|
|
|
|
Edit `docker-compose.unraid.yml` to customize where data is stored. Look for these lines in both `backend` and `celery` services:
|
|
|
|
```yaml
|
|
# === CONFIGURABLE DATA PATHS ===
|
|
- /mnt/user/appdata/PlantGuideScraper/database:/data/db # DATABASE_PATH
|
|
- /mnt/user/appdata/PlantGuideScraper/images:/data/images # IMAGES_PATH
|
|
- /mnt/user/appdata/PlantGuideScraper/exports:/data/exports # EXPORTS_PATH
|
|
```
|
|
|
|
| Path | Description | Default |
|
|
|------|-------------|---------|
|
|
| DATABASE_PATH | SQLite database file | `/mnt/user/appdata/PlantGuideScraper/database` |
|
|
| IMAGES_PATH | Downloaded images (can be 100GB+) | `/mnt/user/appdata/PlantGuideScraper/images` |
|
|
| EXPORTS_PATH | Generated export zip files | `/mnt/user/appdata/PlantGuideScraper/exports` |
|
|
|
|
**Example: Store images on a separate share:**
|
|
```yaml
|
|
- /mnt/user/data/PlantImages:/data/images # IMAGES_PATH
|
|
```
|
|
|
|
**Important:** Keep paths identical in both `backend` and `celery` services.
|
|
|
|
## Configuration
|
|
|
|
1. Configure API keys in Settings:
|
|
- **Flickr**: Get key at https://www.flickr.com/services/api/
|
|
- **Trefle**: Get key at https://trefle.io/
|
|
- iNaturalist and Wikimedia don't require keys
|
|
|
|
2. Import species list (see Import Documentation below)
|
|
|
|
3. Select species and start scraping
|
|
|
|
## Import Documentation
|
|
|
|
### CSV Import
|
|
|
|
Import species from a CSV file with the following columns:
|
|
|
|
| Column | Required | Description |
|
|
|--------|----------|-------------|
|
|
| `scientific_name` | Yes | Binomial name (e.g., "Monstera deliciosa") |
|
|
| `common_name` | No | Common name (e.g., "Swiss Cheese Plant") |
|
|
| `genus` | No | Auto-extracted from scientific_name if not provided |
|
|
| `family` | No | Plant family (e.g., "Araceae") |
|
|
|
|
**Example CSV:**
|
|
```csv
|
|
scientific_name,common_name,genus,family
|
|
Monstera deliciosa,Swiss Cheese Plant,Monstera,Araceae
|
|
Philodendron hederaceum,Heartleaf Philodendron,Philodendron,Araceae
|
|
Epipremnum aureum,Golden Pothos,Epipremnum,Araceae
|
|
```
|
|
|
|
### JSON Import
|
|
|
|
Import species from a JSON file with the following structure:
|
|
|
|
```json
|
|
{
|
|
"plants": [
|
|
{
|
|
"scientific_name": "Monstera deliciosa",
|
|
"common_names": ["Swiss Cheese Plant", "Split-leaf Philodendron"],
|
|
"family": "Araceae"
|
|
},
|
|
{
|
|
"scientific_name": "Philodendron hederaceum",
|
|
"common_names": ["Heartleaf Philodendron"],
|
|
"family": "Araceae"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
| Field | Required | Description |
|
|
|-------|----------|-------------|
|
|
| `scientific_name` | Yes | Binomial name |
|
|
| `common_names` | No | Array of common names (first one is used) |
|
|
| `family` | No | Plant family |
|
|
|
|
**Notes:**
|
|
- Genus is automatically extracted from the first word of `scientific_name`
|
|
- Duplicate species (by scientific_name) are skipped
|
|
- The included `houseplants_list.json` contains 2,278 houseplant species
|
|
|
|
### API Endpoints
|
|
|
|
```bash
|
|
# Import CSV
|
|
curl -X POST http://localhost/api/species/import \
|
|
-F "file=@species.csv"
|
|
|
|
# Import JSON
|
|
curl -X POST http://localhost/api/species/import-json \
|
|
-F "file=@plants.json"
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"imported": 150,
|
|
"skipped": 5,
|
|
"errors": []
|
|
}
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
|
|
│ React │────▶│ FastAPI │────▶│ Celery │
|
|
│ Frontend │ │ Backend │ │ Workers │
|
|
└─────────────┘ └─────────────────┘ └─────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌─────────────┐ ┌─────────────┐
|
|
│ SQLite │ │ Redis │
|
|
│ Database │ │ Queue │
|
|
└─────────────┘ └─────────────┘
|
|
```
|
|
|
|
## Export Format
|
|
|
|
Exports are Create ML-compatible:
|
|
|
|
```
|
|
export.zip/
|
|
├── Training/
|
|
│ ├── Monstera_deliciosa/
|
|
│ │ ├── img_00001.jpg
|
|
│ │ └── ...
|
|
│ └── ...
|
|
└── Testing/
|
|
├── Monstera_deliciosa/
|
|
└── ...
|
|
```
|
|
|
|
## Data Storage
|
|
|
|
All data is stored in the `./data` directory:
|
|
|
|
```
|
|
data/
|
|
├── db/
|
|
│ └── plants.sqlite # SQLite database
|
|
├── images/ # Downloaded images
|
|
│ └── {species_id}/
|
|
│ └── {image_id}.jpg
|
|
└── exports/ # Generated export archives
|
|
└── {export_id}.zip
|
|
```
|
|
|
|
## API Documentation
|
|
|
|
Full API docs available at http://localhost/api/docs
|
|
|
|
## License
|
|
|
|
MIT
|