Files
PlantGuideScraper/docs/master_plan.md
2026-04-12 09:54:27 -05:00

19 KiB

Houseplant Image Scraper - Master Plan

Overview

Web-based interface for managing a multi-source image scraping pipeline targeting 5-10K houseplant species with 1-5M total images. Runs on Unraid via Docker, exports datasets for CoreML training.


Requirements Summary

Requirement Value
Platform Web app in Docker on Unraid
Sources iNaturalist/GBIF, Flickr, Wikimedia Commons, Trefle, USDA PLANTS, EOL
API keys Configurable per service
Species list Manual import (CSV/paste)
Grouping Species, genus, source, license (faceted)
Search/filter Yes
Quality filter Automatic (hash dedup, blur, size)
Progress Real-time dashboard
Storage /species_name/image.jpg + SQLite DB
Export Filtered zip for CoreML, downloadable anytime
Auth None (single user)
Deployment Docker Compose

Create ML Export Requirements

Per Apple's documentation:

  • Folder structure: /SpeciesName/image001.jpg (folder name = class label)
  • Train/Test split: 80/20 recommended, separate folders
  • Balance: Roughly equal images per class (avoid bias)
  • No metadata needed: Create ML uses folder names as labels

Export Format

dataset_export/
├── Training/
│   ├── Monstera_deliciosa/
│   │   ├── img001.jpg
│   │   └── ...
│   ├── Philodendron_hederaceum/
│   └── ...
└── Testing/
    ├── Monstera_deliciosa/
    └── ...

Data Sources

Source API/Method License Filter Rate Limits Notes
iNaturalist/GBIF Bulk DwC-A export + API CC0, CC-BY 1 req/sec, 10k/day, 5GB/hr media Best source: Research Grade = verified
Flickr flickr.photos.search license=4,9 (CC-BY, CC0) 3600 req/hr Good supplemental
Wikimedia Commons MediaWiki API + pyWikiCommons CC-BY, CC-BY-SA, PD Generous Category-based search
Trefle.io REST API Open source Free tier Species metadata + some images
USDA PLANTS REST API Public Domain Generous US-focused, limited images
Plant.id REST API Commercial Paid For validation, not scraping
Encyclopedia of Life API Mixed Check each Aggregator

Source References

Flickr License IDs

ID License
0 All Rights Reserved
1 CC BY-NC-SA 2.0
2 CC BY-NC 2.0
3 CC BY-NC-ND 2.0
4 CC BY 2.0 (Commercial OK)
5 CC BY-SA 2.0
6 CC BY-ND 2.0
7 No known copyright restrictions
8 United States Government Work
9 Public Domain (CC0)

For commercial use: Filter to license IDs 4, 7, 8, 9 only.


Image Quality Pipeline

Stage Library Purpose
Deduplication imagededup Perceptual hash (CNN + hash methods)
Blur detection scipy + Sobel variance Reject blurry images
Size filter Pillow Min 256x256
Resize Pillow Normalize to 512x512

Library References


Technology Stack

Component Choice Rationale
Backend FastAPI (Python) Async, fast, ML ecosystem, great docs
Frontend React + Tailwind Fast dev, good component libraries
Database SQLite (+ FTS5) Simple, no separate container, sufficient for single-user
Task Queue Celery + Redis Long-running scrape jobs, good monitoring
Containers Docker Compose Multi-service orchestration

Reference: https://github.com/fastapi/full-stack-fastapi-template


Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         DOCKER COMPOSE ON UNRAID                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────┐    ┌─────────────────────────────────────────────────┐ │
│  │   NGINX     │    │              FASTAPI BACKEND                     │ │
│  │   :80       │───▶│  /api/species     - CRUD species list           │ │
│  │             │    │  /api/sources     - API key management          │ │
│  └──────┬──────┘    │  /api/jobs        - Scrape job control          │ │
│         │           │  /api/images      - Search, filter, browse      │ │
│         ▼           │  /api/export      - Generate zip for CoreML     │ │
│  ┌─────────────┐    │  /api/stats       - Dashboard metrics           │ │
│  │   REACT     │    └─────────────────────────────────────────────────┘ │
│  │   SPA       │                         │                              │
│  │   :3000     │                         ▼                              │
│  └─────────────┘    ┌─────────────────────────────────────────────────┐ │
│                     │              CELERY WORKERS                      │ │
│  ┌─────────────┐    │  - iNaturalist scraper                          │ │
│  │   REDIS     │◀───│  - Flickr scraper                               │ │
│  │   :6379     │    │  - Wikimedia scraper                            │ │
│  └─────────────┘    │  - Quality filter pipeline                      │ │
│                     │  - Export generator                              │ │
│                     └─────────────────────────────────────────────────┘ │
│                                          │                              │
│                                          ▼                              │
│  ┌─────────────────────────────────────────────────────────────────────┐│
│  │                         STORAGE (Bind Mounts)                        ││
│  │  /data/db/plants.sqlite     - Species, images metadata, jobs        ││
│  │  /data/images/{species}/    - Downloaded images                     ││
│  │  /data/exports/             - Generated zip files                   ││
│  └─────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────┘

Database Schema

-- Species master list (imported from CSV)
CREATE TABLE species (
    id INTEGER PRIMARY KEY,
    scientific_name TEXT UNIQUE NOT NULL,
    common_name TEXT,
    genus TEXT,
    family TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Full-text search index
CREATE VIRTUAL TABLE species_fts USING fts5(
    scientific_name,
    common_name,
    genus,
    content='species',
    content_rowid='id'
);

-- API credentials
CREATE TABLE api_keys (
    id INTEGER PRIMARY KEY,
    source TEXT UNIQUE NOT NULL,  -- 'flickr', 'inaturalist', 'wikimedia', 'trefle'
    api_key TEXT NOT NULL,
    api_secret TEXT,
    rate_limit_per_sec REAL DEFAULT 1.0,
    enabled BOOLEAN DEFAULT TRUE
);

-- Downloaded images
CREATE TABLE images (
    id INTEGER PRIMARY KEY,
    species_id INTEGER REFERENCES species(id),
    source TEXT NOT NULL,
    source_id TEXT,  -- Original ID from source
    url TEXT NOT NULL,
    local_path TEXT,
    license TEXT NOT NULL,
    attribution TEXT,
    width INTEGER,
    height INTEGER,
    phash TEXT,  -- Perceptual hash for dedup
    quality_score REAL,  -- Blur/quality metric
    status TEXT DEFAULT 'pending',  -- pending, downloaded, rejected, deleted
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(source, source_id)
);

-- Index for common queries
CREATE INDEX idx_images_species ON images(species_id);
CREATE INDEX idx_images_status ON images(status);
CREATE INDEX idx_images_source ON images(source);
CREATE INDEX idx_images_phash ON images(phash);

-- Scrape jobs
CREATE TABLE jobs (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    source TEXT NOT NULL,
    species_filter TEXT,  -- JSON array of species IDs or NULL for all
    status TEXT DEFAULT 'pending',  -- pending, running, paused, completed, failed
    progress_current INTEGER DEFAULT 0,
    progress_total INTEGER DEFAULT 0,
    images_downloaded INTEGER DEFAULT 0,
    images_rejected INTEGER DEFAULT 0,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT
);

-- Export jobs
CREATE TABLE exports (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    filter_criteria TEXT,  -- JSON: min_images, licenses, min_quality, species_ids
    train_split REAL DEFAULT 0.8,
    status TEXT DEFAULT 'pending',  -- pending, generating, completed, failed
    file_path TEXT,
    file_size INTEGER,
    species_count INTEGER,
    image_count INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    completed_at TIMESTAMP
);

API Endpoints

Species

Method Endpoint Description
GET /api/species List species (paginated, searchable)
POST /api/species Create single species
POST /api/species/import Bulk import from CSV
GET /api/species/{id} Get species details
PUT /api/species/{id} Update species
DELETE /api/species/{id} Delete species

API Keys

Method Endpoint Description
GET /api/sources List configured sources
PUT /api/sources/{source} Update source config (key, rate limit)

Jobs

Method Endpoint Description
GET /api/jobs List jobs
POST /api/jobs Create scrape job
GET /api/jobs/{id} Get job status
POST /api/jobs/{id}/pause Pause job
POST /api/jobs/{id}/resume Resume job
POST /api/jobs/{id}/cancel Cancel job

Images

Method Endpoint Description
GET /api/images List images (paginated, filterable)
GET /api/images/{id} Get image details
DELETE /api/images/{id} Delete image
POST /api/images/bulk-delete Bulk delete

Export

Method Endpoint Description
GET /api/exports List exports
POST /api/exports Create export job
GET /api/exports/{id} Get export status
GET /api/exports/{id}/download Download zip file

Stats

Method Endpoint Description
GET /api/stats Dashboard statistics
GET /api/stats/sources Per-source breakdown
GET /api/stats/species Per-species image counts

UI Screens

1. Dashboard

  • Total species, images by source, images by license
  • Active jobs with progress bars
  • Quick stats: images/sec, disk usage
  • Recent activity feed

2. Species Management

  • Table: scientific name, common name, genus, image count
  • Import CSV button (drag-and-drop)
  • Search/filter by name, genus
  • Bulk select → "Start Scrape" button
  • Inline editing

3. API Keys

  • Card per source with:
    • API key input (masked)
    • API secret input (if applicable)
    • Rate limit slider
    • Enable/disable toggle
    • Test connection button

4. Image Browser

  • Grid view with thumbnails (lazy-loaded)
  • Filters sidebar:
    • Species (autocomplete)
    • Source (checkboxes)
    • License (checkboxes)
    • Quality score (range slider)
    • Status (tabs: all, pending, downloaded, rejected)
  • Sort by: date, quality, species
  • Bulk select → actions (delete, re-queue)
  • Click to view full-size + metadata

5. Jobs

  • Table: name, source, status, progress, dates
  • Real-time progress updates (WebSocket)
  • Actions: pause, resume, cancel, view logs

6. Export

  • Filter builder:
    • Min images per species
    • License whitelist
    • Min quality score
    • Species selection (all or specific)
  • Train/test split slider (default 80/20)
  • Preview: estimated species count, image count, file size
  • "Generate Zip" button
  • Download history with re-download links

Tradeoffs

Decision Alternative Why This Choice
SQLite PostgreSQL Single-user, simpler Docker setup, sufficient for millions of rows
Celery+Redis RQ, Dramatiq Battle-tested, good monitoring (Flower)
React Vue, Svelte Largest ecosystem, more component libraries
Separate workers Threads in FastAPI Better isolation, can scale workers independently
Nginx reverse proxy Traefik Simpler config for single-app deployment

Risks & Mitigations

Risk Likelihood Mitigation
iNaturalist rate limits (5GB/hr) High Throttle downloads, prioritize species with low counts
Disk fills up Medium Dashboard shows disk usage, configurable storage limits
Scrape jobs crash mid-run Medium Job state in DB, resume from last checkpoint
Perceptual hash collisions Low Store hash, allow manual review of flagged duplicates
API keys exposed Low Environment variables, not stored in code
SQLite write contention Low WAL mode, single writer pattern via Celery

Implementation Phases

Phase 1: Foundation

  • Docker Compose setup (FastAPI, React, Redis, Nginx)
  • Database schema + migrations (Alembic)
  • Basic FastAPI skeleton with health checks
  • React app scaffolding with Tailwind

Phase 2: Core Data Management

  • Species CRUD API
  • CSV import endpoint
  • Species list UI with search/filter
  • API keys management UI

Phase 3: iNaturalist Scraper

  • Celery worker setup
  • iNaturalist/GBIF scraper task
  • Job management API
  • Real-time progress (WebSocket or polling)

Phase 4: Quality Pipeline

  • Image download worker
  • Perceptual hash deduplication
  • Blur detection + quality scoring
  • Resize to 512x512

Phase 5: Image Browser

  • Image listing API with filters
  • Thumbnail generation
  • Grid view UI
  • Bulk operations

Phase 6: Additional Scrapers

  • Flickr scraper
  • Wikimedia Commons scraper
  • Trefle scraper (metadata + images)
  • USDA PLANTS scraper

Phase 7: Export

  • Export job API
  • Train/test split logic
  • Zip generation worker
  • Download endpoint
  • Export UI with filters

Phase 8: Dashboard & Polish

  • Stats API
  • Dashboard UI with charts
  • Job monitoring UI
  • Error handling + logging
  • Documentation

File Structure

PlantGuideScraper/
├── docker-compose.yml
├── .env.example
├── docs/
│   └── master_plan.md
├── backend/
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── alembic/
│   │   └── versions/
│   ├── app/
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── config.py
│   │   ├── database.py
│   │   ├── models/
│   │   │   ├── species.py
│   │   │   ├── image.py
│   │   │   ├── job.py
│   │   │   └── export.py
│   │   ├── schemas/
│   │   │   └── ...
│   │   ├── api/
│   │   │   ├── species.py
│   │   │   ├── images.py
│   │   │   ├── jobs.py
│   │   │   ├── exports.py
│   │   │   └── stats.py
│   │   ├── scrapers/
│   │   │   ├── base.py
│   │   │   ├── inaturalist.py
│   │   │   ├── flickr.py
│   │   │   ├── wikimedia.py
│   │   │   └── trefle.py
│   │   ├── workers/
│   │   │   ├── celery_app.py
│   │   │   ├── scrape_tasks.py
│   │   │   ├── quality_tasks.py
│   │   │   └── export_tasks.py
│   │   └── utils/
│   │       ├── image_quality.py
│   │       └── dedup.py
│   └── tests/
├── frontend/
│   ├── Dockerfile
│   ├── package.json
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   ├── pages/
│   │   │   ├── Dashboard.tsx
│   │   │   ├── Species.tsx
│   │   │   ├── Images.tsx
│   │   │   ├── Jobs.tsx
│   │   │   ├── Export.tsx
│   │   │   └── Settings.tsx
│   │   ├── hooks/
│   │   └── api/
│   └── public/
├── nginx/
│   └── nginx.conf
└── data/                  # Bind mount (not in repo)
    ├── db/
    ├── images/
    └── exports/

Environment Variables

# Backend
DATABASE_URL=sqlite:///data/db/plants.sqlite
REDIS_URL=redis://redis:6379/0
IMAGES_PATH=/data/images
EXPORTS_PATH=/data/exports

# API Keys (user-provided)
FLICKR_API_KEY=
FLICKR_API_SECRET=
INATURALIST_APP_ID=
INATURALIST_APP_SECRET=
TREFLE_API_KEY=

# Optional
LOG_LEVEL=INFO
CELERY_CONCURRENCY=4

Commands

# Development
docker-compose up --build

# Production
docker-compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Run migrations
docker-compose exec backend alembic upgrade head

# View Celery logs
docker-compose logs -f celery

# Access Redis CLI
docker-compose exec redis redis-cli