Files

Trey T 6926f502c5 Initial commit — PlantGuideScraper project

2026-04-12 09:54:27 -05:00

19 KiB

Raw Blame History

Houseplant Image Scraper - Master Plan

Overview

Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.

Requirements Summary

Requirement	Value
Platform	Web app in Docker on Unraid
Sources	iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL
API keys	Configurable per service
Species list	Manual import (CSV/paste)
Grouping	Species, genus, source, license (faceted)
Search/filter	Yes
Quality filter	Automatic (hash dedup, blur, size)
Progress	Real-time dashboard
Storage	`/species_name/image.jpg` + SQLite DB
Export	Filtered zip for CoreML, downloadable anytime
Auth	None (single user)
Deployment	Docker Compose

Create ML Export Requirements

Per Apple's documentation:

Folder structure: /SpeciesName/image001.jpg (folder name = class label)
Train/Test split: 80/20 recommended, separate folders
Balance: Roughly equal images per class (avoid bias)
No metadata needed: Create ML uses folder names as labels

Export Format

dataset_export/
├── Training/
│   ├── Monstera_deliciosa/
│   │   ├── img001.jpg
│   │   └── ...
│   ├── Philodendron_hederaceum/
│   └── ...
└── Testing/
    ├── Monstera_deliciosa/
    └── ...

Data Sources

Source	API/Method	License Filter	Rate Limits	Notes
iNaturalist/GBIF	Bulk DwC-A export + API	CC0, CC-BY	1 req/sec, 10k/day, 5GB/hr media	Best source: Research Grade = verified
Flickr	flickr.photos.search	license=4,9 (CC-BY, CC0)	3600 req/hr	Good supplemental
Wikimedia Commons	MediaWiki API + pyWikiCommons	CC-BY, CC-BY-SA, PD	Generous	Category-based search
Trefle.io	REST API	Open source	Free tier	Species metadata + some images
USDA PLANTS	REST API	Public Domain	Generous	US-focused, limited images
Plant.id	REST API	Commercial	Paid	For validation, not scraping
Encyclopedia of Life	API	Mixed	Check each	Aggregator

Source References

iNaturalist: https://www.inaturalist.org/pages/developers
iNaturalist bulk download: https://forum.inaturalist.org/t/one-time-bulk-download-dataset/18741
Flickr API: https://www.flickr.com/services/api/flickr.photos.search.html
Wikimedia Commons API: https://commons.wikimedia.org/wiki/Commons:API
pyWikiCommons: https://pypi.org/project/pyWikiCommons/
Trefle.io: https://trefle.io/
USDA PLANTS: https://data.nal.usda.gov/dataset/usda-plants-database-api-r

Flickr License IDs

ID	License
0	All Rights Reserved
1	CC BY-NC-SA 2.0
2	CC BY-NC 2.0
3	CC BY-NC-ND 2.0
4	CC BY 2.0 (Commercial OK)
5	CC BY-SA 2.0
6	CC BY-ND 2.0
7	No known copyright restrictions
8	United States Government Work
9	Public Domain (CC0)

For commercial use: Filter to license IDs 4, 7, 8, 9 only.

Image Quality Pipeline

Stage	Library	Purpose
Deduplication	imagededup	Perceptual hash (CNN + hash methods)
Blur detection	scipy + Sobel variance	Reject blurry images
Size filter	Pillow	Min 256x256
Resize	Pillow	Normalize to 512x512

Library References

imagededup: https://github.com/idealo/imagededup
imagehash: https://github.com/JohannesBuchner/imagehash

Technology Stack

Component	Choice	Rationale
Backend	FastAPI (Python)	Async, fast, ML ecosystem, great docs
Frontend	React + Tailwind	Fast dev, good component libraries
Database	SQLite (+ FTS5)	Simple, no separate container, sufficient for single-user
Task Queue	Celery + Redis	Long-running scrape jobs, good monitoring
Containers	Docker Compose	Multi-service orchestration

Reference: https://github.com/fastapi/full-stack-fastapi-template

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         DOCKER COMPOSE ON UNRAID                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────┐    ┌─────────────────────────────────────────────────┐ │
│  │   NGINX     │    │              FASTAPI BACKEND                     │ │
│  │   :80       │───▶│  /api/species     - CRUD species list           │ │
│  │             │    │  /api/sources     - API key management          │ │
│  └──────┬──────┘    │  /api/jobs        - Scrape job control          │ │
│         │           │  /api/images      - Search, filter, browse      │ │
│         ▼           │  /api/export      - Generate zip for CoreML     │ │
│  ┌─────────────┐    │  /api/stats       - Dashboard metrics           │ │
│  │   REACT     │    └─────────────────────────────────────────────────┘ │
│  │   SPA       │                         │                              │
│  │   :3000     │                         ▼                              │
│  └─────────────┘    ┌─────────────────────────────────────────────────┐ │
│                     │              CELERY WORKERS                      │ │
│  ┌─────────────┐    │  - iNaturalist scraper                          │ │
│  │   REDIS     │◀───│  - Flickr scraper                               │ │
│  │   :6379     │    │  - Wikimedia scraper                            │ │
│  └─────────────┘    │  - Quality filter pipeline                      │ │
│                     │  - Export generator                              │ │
│                     └─────────────────────────────────────────────────┘ │
│                                          │                              │
│                                          ▼                              │
│  ┌─────────────────────────────────────────────────────────────────────┐│
│  │                         STORAGE (Bind Mounts)                        ││
│  │  /data/db/plants.sqlite     - Species, images metadata, jobs        ││
│  │  /data/images/{species}/    - Downloaded images                     ││
│  │  /data/exports/             - Generated zip files                   ││
│  └─────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────┘

Database Schema

-- Species master list (imported from CSV)
CREATE TABLE species (
    id INTEGER PRIMARY KEY,
    scientific_name TEXT UNIQUE NOT NULL,
    common_name TEXT,
    genus TEXT,
    family TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Full-text search index
CREATE VIRTUAL TABLE species_fts USING fts5(
    scientific_name,
    common_name,
    genus,
    content='species',
    content_rowid='id'
);

-- API credentials
CREATE TABLE api_keys (
    id INTEGER PRIMARY KEY,
    source TEXT UNIQUE NOT NULL,  -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
    api_key TEXT NOT NULL,
    api_secret TEXT,
    rate_limit_per_sec REAL DEFAULT 1.0,
    enabled BOOLEAN DEFAULT TRUE
);

-- Downloaded images
CREATE TABLE images (
    id INTEGER PRIMARY KEY,
    species_id INTEGER REFERENCES species(id),
    source TEXT NOT NULL,
    source_id TEXT,  -- Original ID from source
    url TEXT NOT NULL,
    local_path TEXT,
    license TEXT NOT NULL,
    attribution TEXT,
    width INTEGER,
    height INTEGER,
    phash TEXT,  -- Perceptual hash for dedup
    quality_score REAL,  -- Blur/quality metric
    status TEXT DEFAULT 'pending',  -- pending, downloaded, rejected, deleted
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(source, source_id)
);

-- Index for common queries
CREATE INDEX idx_images_species ON images(species_id);
CREATE INDEX idx_images_status ON images(status);
CREATE INDEX idx_images_source ON images(source);
CREATE INDEX idx_images_phash ON images(phash);

-- Scrape jobs
CREATE TABLE jobs (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    source TEXT NOT NULL,
    species_filter TEXT,  -- JSON array of species IDs or NULL for all
    status TEXT DEFAULT 'pending',  -- pending, running, paused, completed, failed
    progress_current INTEGER DEFAULT 0,
    progress_total INTEGER DEFAULT 0,
    images_downloaded INTEGER DEFAULT 0,
    images_rejected INTEGER DEFAULT 0,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT
);

-- Export jobs
CREATE TABLE exports (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    filter_criteria TEXT,  -- JSON: min_images, licenses, min_quality, species_ids
    train_split REAL DEFAULT 0.8,
    status TEXT DEFAULT 'pending',  -- pending, generating, completed, failed
    file_path TEXT,
    file_size INTEGER,
    species_count INTEGER,
    image_count INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP
);

API Endpoints

Species

Method	Endpoint	Description
GET	`/api/species`	List species (paginated, searchable)
POST	`/api/species`	Create single species
POST	`/api/species/import`	Bulk import from CSV
GET	`/api/species/{id}`	Get species details
PUT	`/api/species/{id}`	Update species
DELETE	`/api/species/{id}`	Delete species

API Keys

Method	Endpoint	Description
GET	`/api/sources`	List configured sources
PUT	`/api/sources/{source}`	Update source config (key, rate limit)

Jobs

Method	Endpoint	Description
GET	`/api/jobs`	List jobs
POST	`/api/jobs`	Create scrape job
GET	`/api/jobs/{id}`	Get job status
POST	`/api/jobs/{id}/pause`	Pause job
POST	`/api/jobs/{id}/resume`	Resume job
POST	`/api/jobs/{id}/cancel`	Cancel job

Images

Method	Endpoint	Description
GET	`/api/images`	List images (paginated, filterable)
GET	`/api/images/{id}`	Get image details
DELETE	`/api/images/{id}`	Delete image
POST	`/api/images/bulk-delete`	Bulk delete

Export

Method	Endpoint	Description
GET	`/api/exports`	List exports
POST	`/api/exports`	Create export job
GET	`/api/exports/{id}`	Get export status
GET	`/api/exports/{id}/download`	Download zip file

Stats

Method	Endpoint	Description
GET	`/api/stats`	Dashboard statistics
GET	`/api/stats/sources`	Per-source breakdown
GET	`/api/stats/species`	Per-species image counts

UI Screens

1. Dashboard

Total species, images by source, images by license
Active jobs with progress bars
Quick stats: images/sec, disk usage
Recent activity feed

2. Species Management

Table: scientific name, common name, genus, image count
Import CSV button (drag-and-drop)
Search/filter by name, genus
Bulk select → "Start Scrape" button
Inline editing

3. API Keys

Card per source with:
- API key input (masked)
- API secret input (if applicable)
- Rate limit slider
- Enable/disable toggle
- Test connection button

4. Image Browser

Grid view with thumbnails (lazy-loaded)
Filters sidebar:
- Species (autocomplete)
- Source (checkboxes)
- License (checkboxes)
- Quality score (range slider)
- Status (tabs: all, pending, downloaded, rejected)
Sort by: date, quality, species
Bulk select → actions (delete, re-queue)
Click to view full-size + metadata

5. Jobs

Table: name, source, status, progress, dates
Real-time progress updates (WebSocket)
Actions: pause, resume, cancel, view logs

6. Export

Filter builder:
- Min images per species
- License whitelist
- Min quality score
- Species selection (all or specific)
Train/test split slider (default 80/20)
Preview: estimated species count, image count, file size
"Generate Zip" button
Download history with re-download links

Tradeoffs

Decision	Alternative	Why This Choice
SQLite	PostgreSQL	Single-user, simpler Docker setup, sufficient for millions of rows
Celery+Redis	RQ, Dramatiq	Battle-tested, good monitoring (Flower)
React	Vue, Svelte	Largest ecosystem, more component libraries
Separate workers	Threads in FastAPI	Better isolation, can scale workers independently
Nginx reverse proxy	Traefik	Simpler config for single-app deployment

Risks & Mitigations

Risk	Likelihood	Mitigation
iNaturalist rate limits (5GB/hr)	High	Throttle downloads, prioritize species with low counts
Disk fills up	Medium	Dashboard shows disk usage, configurable storage limits
Scrape jobs crash mid-run	Medium	Job state in DB, resume from last checkpoint
Perceptual hash collisions	Low	Store hash, allow manual review of flagged duplicates
API keys exposed	Low	Environment variables, not stored in code
SQLite write contention	Low	WAL mode, single writer pattern via Celery

Implementation Phases

Phase 1: Foundation

Docker Compose setup (FastAPI, React, Redis, Nginx)
Database schema + migrations (Alembic)
Basic FastAPI skeleton with health checks
React app scaffolding with Tailwind

Phase 2: Core Data Management

Species CRUD API
CSV import endpoint
Species list UI with search/filter
API keys management UI

Phase 3: iNaturalist Scraper

Celery worker setup
iNaturalist/GBIF scraper task
Job management API
Real-time progress (WebSocket or polling)

Phase 4: Quality Pipeline

Image download worker
Perceptual hash deduplication
Blur detection + quality scoring
Resize to 512x512

Phase 5: Image Browser

Image listing API with filters
Thumbnail generation
Grid view UI
Bulk operations

Phase 6: Additional Scrapers

Flickr scraper
Wikimedia Commons scraper
Trefle scraper (metadata + images)
USDA PLANTS scraper

Phase 7: Export

Export job API
Train/test split logic
Zip generation worker
Download endpoint
Export UI with filters

Phase 8: Dashboard & Polish

Stats API
Dashboard UI with charts
Job monitoring UI
Error handling + logging
Documentation

File Structure

PlantGuideScraper/
├── docker-compose.yml
├── .env.example
├── docs/
│   └── master_plan.md
├── backend/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── alembic/
│   │   └── versions/
│   ├── app/
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── config.py
│   │   ├── database.py
│   │   ├── models/
│   │   │   ├── species.py
│   │   │   ├── image.py
│   │   │   ├── job.py
│   │   │   └── export.py
│   │   ├── schemas/
│   │   │   └── ...
│   │   ├── api/
│   │   │   ├── species.py
│   │   │   ├── images.py
│   │   │   ├── jobs.py
│   │   │   ├── exports.py
│   │   │   └── stats.py
│   │   ├── scrapers/
│   │   │   ├── base.py
│   │   │   ├── inaturalist.py
│   │   │   ├── flickr.py
│   │   │   ├── wikimedia.py
│   │   │   └── trefle.py
│   │   ├── workers/
│   │   │   ├── celery_app.py
│   │   │   ├── scrape_tasks.py
│   │   │   ├── quality_tasks.py
│   │   │   └── export_tasks.py
│   │   └── utils/
│   │       ├── image_quality.py
│   │       └── dedup.py
│   └── tests/
├── frontend/
│   ├── Dockerfile
│   ├── package.json
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   ├── pages/
│   │   │   ├── Dashboard.tsx
│   │   │   ├── Species.tsx
│   │   │   ├── Images.tsx
│   │   │   ├── Jobs.tsx
│   │   │   ├── Export.tsx
│   │   │   └── Settings.tsx
│   │   ├── hooks/
│   │   └── api/
│   └── public/
├── nginx/
│   └── nginx.conf
└── data/                  # Bind mount (not in repo)
    ├── db/
    ├── images/
    └── exports/

Environment Variables

# Backend
DATABASE_URL=sqlite:///data/db/plants.sqlite
REDIS_URL=redis://redis:6379/0
IMAGES_PATH=/data/images
EXPORTS_PATH=/data/exports

# API Keys (user-provided)
FLICKR_API_KEY=
FLICKR_API_SECRET=
INATURALIST_APP_ID=
INATURALIST_APP_SECRET=
TREFLE_API_KEY=

# Optional
LOG_LEVEL=INFO
CELERY_CONCURRENCY=4

Commands

# Development
docker-compose up --build

# Production
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Run migrations
docker-compose exec backend alembic upgrade head

# View Celery logs
docker-compose logs -f celery

# Access Redis CLI
docker-compose exec redis redis-cli

19 KiB Raw Blame History