Files
PlantGuide/Docs/phase3-implementation-plan.md
Trey t 136dfbae33 Add PlantGuide iOS app with plant identification and care management
- Implement camera capture and plant identification workflow
- Add Core Data persistence for plants, care schedules, and cached API data
- Create collection view with grid/list layouts and filtering
- Build plant detail views with care information display
- Integrate Trefle botanical API for plant care data
- Add local image storage for captured plant photos
- Implement dependency injection container for testability
- Include accessibility support throughout the app

Bug fixes in this commit:
- Fix Trefle API decoding by removing duplicate CodingKeys
- Fix LocalCachedImage to load from correct PlantImages directory
- Set dateAdded when saving plants for proper collection sorting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:18:01 -06:00

548 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan
## Overview
**Goal:** Prepare images for training with consistent formatting and augmentation pipeline.
**Prerequisites:** Phase 2 complete - `datasets/train/`, `datasets/val/`, `datasets/test/` directories with manifests
**Target Deliverable:** Training-ready dataset with standardized dimensions, normalized values, and augmentation pipeline
---
## Task Breakdown
### Task 3.1: Standardize Image Dimensions
**Objective:** Resize all images to consistent dimensions for model input.
**Actions:**
1. Create `scripts/phase3/standardize_dimensions.py` to:
- Load images from train/val/test directories
- Resize to target dimension (224x224 for MobileNetV3, 299x299 for EfficientNet)
- Preserve aspect ratio with center crop or letterboxing
- Save resized images to new directory structure
2. Support multiple output sizes:
```python
TARGET_SIZES = {
"mobilenet": (224, 224),
"efficientnet": (299, 299),
"vit": (384, 384)
}
```
3. Implement resize strategies:
- **center_crop:** Crop to square, then resize (preserves detail)
- **letterbox:** Pad to square, then resize (preserves full image)
- **stretch:** Direct resize (fastest, may distort)
4. Output directory structure:
```
datasets/
├── processed/
│ └── 224x224/
│ ├── train/
│ ├── val/
│ └── test/
```
**Output:**
- `datasets/processed/{size}/` directories
- `output/phase3/dimension_report.json` - Processing statistics
**Validation:**
- [ ] All images in processed directory are exactly target dimensions
- [ ] No corrupt images (all readable by PIL)
- [ ] Image count matches source (no images lost)
- [ ] Processing time logged for performance baseline
---
### Task 3.2: Normalize Color Channels
**Objective:** Standardize pixel values and handle format variations.
**Actions:**
1. Create `scripts/phase3/normalize_images.py` to:
- Convert all images to RGB (handle RGBA, grayscale, CMYK)
- Apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
- Handle various input formats (JPEG, PNG, WebP, HEIC)
- Save as consistent format (JPEG with quality 95, or PNG for lossless)
2. Implement color normalization:
```python
def normalize_image(image: np.ndarray) -> np.ndarray:
"""Normalize image for model input."""
image = image.astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
return (image - mean) / std
```
3. Create preprocessing pipeline class:
```python
class ImagePreprocessor:
def __init__(self, target_size, normalize=True):
self.target_size = target_size
self.normalize = normalize
def __call__(self, image_path: str) -> np.ndarray:
# Load, resize, convert, normalize
pass
```
4. Handle edge cases:
- Grayscale → convert to RGB by duplicating channels
- RGBA → remove alpha channel, composite on white
- CMYK → convert to RGB color space
- 16-bit images → convert to 8-bit
**Output:**
- Updated processed images with consistent color handling
- `output/phase3/color_conversion_log.json` - Format conversion statistics
**Validation:**
- [ ] All images have exactly 3 color channels (RGB)
- [ ] Pixel values in expected range after normalization
- [ ] No format conversion errors
- [ ] Color fidelity maintained (visual spot check on 50 random images)
---
### Task 3.3: Implement Data Augmentation Pipeline
**Objective:** Create augmentation transforms to increase training data variety.
**Actions:**
1. Create `scripts/phase3/augmentation_pipeline.py` with transforms:
**Geometric Transforms:**
- Random rotation: -30° to +30°
- Random horizontal flip: 50% probability
- Random vertical flip: 10% probability (some plants are naturally upside-down)
- Random crop: 80-100% of image, then resize back
- Random perspective: slight perspective distortion
**Color Transforms:**
- Random brightness: ±20%
- Random contrast: ±20%
- Random saturation: ±30%
- Random hue shift: ±10%
- Color jitter (combined)
**Blur/Noise Transforms:**
- Gaussian blur: kernel 3-7, 30% probability
- Motion blur: 10% probability
- Gaussian noise: σ=0.01-0.05, 20% probability
**Occlusion Transforms:**
- Random erasing (cutout): 10-30% area, 20% probability
- Grid dropout: 10% probability
2. Implement using PyTorch or Albumentations:
```python
import albumentations as A
train_transform = A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
A.HorizontalFlip(p=0.5),
A.Rotate(limit=30, p=0.5),
A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.3, hue=0.1),
A.GaussianBlur(blur_limit=(3, 7), p=0.3),
A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.2),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
])
val_transform = A.Compose([
A.Resize(256, 256),
A.CenterCrop(224, 224),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
])
```
3. Create visualization tool for augmentation preview:
```python
def visualize_augmentations(image_path, transform, n_samples=9):
"""Show grid of augmented versions of same image."""
pass
```
4. Save augmentation configuration to JSON for reproducibility
**Output:**
- `scripts/phase3/augmentation_pipeline.py` - Reusable transform classes
- `output/phase3/augmentation_config.json` - Transform parameters
- `output/phase3/augmentation_samples/` - Visual examples
**Validation:**
- [ ] All augmentations produce valid images (no NaN, no corruption)
- [ ] Augmented images visually reasonable (not over-augmented)
- [ ] Transforms are deterministic when seeded
- [ ] Pipeline runs at >100 images/second on CPU
---
### Task 3.4: Balance Underrepresented Classes
**Objective:** Create augmented variants to address class imbalance.
**Actions:**
1. Create `scripts/phase3/analyze_class_balance.py` to:
- Count images per class in training set
- Calculate imbalance ratio (max_class / min_class)
- Identify underrepresented classes (below median - 1 std)
- Visualize class distribution
2. Create `scripts/phase3/oversample_minority.py` to:
- Define target samples per class (e.g., median count)
- Generate augmented copies for minority classes
- Apply stronger augmentation for synthetic samples
- Track original vs augmented counts
3. Implement oversampling strategies:
```python
class BalancingStrategy:
"""Strategies for handling class imbalance."""
@staticmethod
def oversample_to_median(class_counts: dict) -> dict:
"""Oversample minority classes to median count."""
median = np.median(list(class_counts.values()))
targets = {}
for cls, count in class_counts.items():
targets[cls] = max(int(median), count)
return targets
@staticmethod
def oversample_to_max(class_counts: dict, cap_ratio=5) -> dict:
"""Oversample to max, capped at ratio times original."""
max_count = max(class_counts.values())
targets = {}
for cls, count in class_counts.items():
targets[cls] = min(max_count, count * cap_ratio)
return targets
```
4. Generate balanced training manifest:
- Include original images
- Add paths to augmented copies
- Mark augmented images in manifest (for analysis)
**Output:**
- `datasets/processed/balanced/train/` - Balanced training set
- `output/phase3/class_balance_before.json` - Original distribution
- `output/phase3/class_balance_after.json` - Balanced distribution
- `output/phase3/balance_histogram.png` - Visual comparison
**Validation:**
- [ ] Imbalance ratio reduced to < 10:1 (max:min)
- [ ] No class has fewer than 50 training samples
- [ ] Augmented images are visually distinct from originals
- [ ] Total training set size documented
---
### Task 3.5: Generate Image Manifest Files
**Objective:** Create mapping files for training pipeline.
**Actions:**
1. Create `scripts/phase3/generate_manifests.py` to produce:
**CSV Format (PyTorch ImageFolder compatible):**
```csv
path,label,scientific_name,plant_id,source,is_augmented
train/images/quercus_robur_001.jpg,42,Quercus robur,QR001,inaturalist,false
train/images/quercus_robur_002_aug.jpg,42,Quercus robur,QR001,augmented,true
```
**JSON Format (detailed metadata):**
```json
{
"train": [
{
"path": "train/images/quercus_robur_001.jpg",
"label": 42,
"scientific_name": "Quercus robur",
"common_name": "English Oak",
"plant_id": "QR001",
"source": "inaturalist",
"is_augmented": false,
"original_path": null
}
]
}
```
2. Generate label mapping file:
```json
{
"label_to_name": {
"0": "Acer palmatum",
"1": "Acer rubrum",
...
},
"name_to_label": {
"Acer palmatum": 0,
"Acer rubrum": 1,
...
},
"label_to_common": {
"0": "Japanese Maple",
...
}
}
```
3. Create split statistics:
- Total images per split
- Classes per split
- Images per class per split
**Output:**
- `datasets/processed/train_manifest.csv`
- `datasets/processed/val_manifest.csv`
- `datasets/processed/test_manifest.csv`
- `datasets/processed/label_mapping.json`
- `output/phase3/manifest_statistics.json`
**Validation:**
- [ ] All image paths in manifests exist on disk
- [ ] Labels are consecutive integers starting from 0
- [ ] No duplicate entries in manifests
- [ ] Split sizes match expected counts
- [ ] Label mapping covers all classes
---
### Task 3.6: Validate Dataset Integrity
**Objective:** Final verification of processed dataset.
**Actions:**
1. Create `scripts/phase3/validate_dataset.py` to run comprehensive checks:
**File Integrity:**
- All manifest paths exist
- All images load without error
- All images have correct dimensions
- File permissions allow read access
**Label Consistency:**
- Labels match between manifest and directory structure
- All labels have corresponding class names
- No orphaned images (in directory but not manifest)
- No missing images (in manifest but not directory)
**Dataset Statistics:**
- Per-class image counts
- Train/val/test split ratios
- Augmented vs original ratio
- File size distribution
**Sample Verification:**
- Random sample of 100 images per split
- Verify image content matches label (using pretrained model)
- Flag potential mislabels for review
2. Create `scripts/phase3/repair_dataset.py` for common fixes:
- Remove entries with missing files
- Fix incorrect labels (with confirmation)
- Regenerate corrupted augmentations
**Output:**
- `output/phase3/validation_report.json` - Full validation results
- `output/phase3/validation_summary.md` - Human-readable summary
- `output/phase3/flagged_for_review.json` - Potential issues
**Validation:**
- [ ] 0 missing files
- [ ] 0 corrupted images
- [ ] 0 dimension mismatches
- [ ] <1% potential mislabels flagged
- [ ] All metadata fields populated
---
## End-of-Phase Validation Checklist
Run `scripts/phase3/validate_phase3.py` to verify all criteria:
### Image Processing Validation
| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 1 | All images standardized to target size | 100% at 224x224 (or configured size) | [ ] |
| 2 | All images in RGB format | 100% RGB, 3 channels | [ ] |
| 3 | No corrupted images | 0 unreadable files | [ ] |
| 4 | Normalization applied correctly | Values in expected range | [ ] |
### Augmentation Validation
| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 5 | Augmentation pipeline functional | All transforms produce valid output | [ ] |
| 6 | Augmentation reproducible | Same seed = same output | [ ] |
| 7 | Augmentation performance | >100 images/sec on CPU | [ ] |
| 8 | Visual quality | Spot check passes (50 random samples) | [ ] |
### Class Balance Validation
| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 9 | Class imbalance ratio | < 10:1 (max:min) | [ ] |
| 10 | Minimum class size | ≥50 images per class in train | [ ] |
| 11 | Augmentation ratio | Augmented ≤ 4x original per class | [ ] |
### Manifest Validation
| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 12 | Manifest completeness | 100% images have manifest entries | [ ] |
| 13 | Path validity | 100% manifest paths exist | [ ] |
| 14 | Label consistency | Labels match directory structure | [ ] |
| 15 | No duplicates | 0 duplicate entries | [ ] |
| 16 | Label mapping complete | All labels have names | [ ] |
### Dataset Statistics
| Metric | Expected | Actual | Status |
|--------|----------|--------|--------|
| Total processed images | 50,000 - 200,000 | | [ ] |
| Training set size | ~70% of total | | [ ] |
| Validation set size | ~15% of total | | [ ] |
| Test set size | ~15% of total | | [ ] |
| Number of classes | 200 - 500 | | [ ] |
| Avg images per class (train) | 100 - 400 | | [ ] |
| Image file size (avg) | 30-100 KB | | [ ] |
| Total dataset size | 10-50 GB | | [ ] |
---
## Phase 3 Completion Checklist
- [ ] Task 3.1: Images standardized to target dimensions
- [ ] Task 3.2: Color channels normalized and formats unified
- [ ] Task 3.3: Augmentation pipeline implemented and tested
- [ ] Task 3.4: Class imbalance addressed through oversampling
- [ ] Task 3.5: Manifest files generated for all splits
- [ ] Task 3.6: Dataset integrity validated
- [ ] All 16 validation criteria pass
- [ ] Dataset statistics documented
- [ ] Augmentation config saved for reproducibility
- [ ] Ready for Phase 4 (Model Architecture Selection)
---
## Scripts Summary
| Script | Task | Input | Output |
|--------|------|-------|--------|
| `standardize_dimensions.py` | 3.1 | Raw images | Resized images |
| `normalize_images.py` | 3.2 | Resized images | Normalized images |
| `augmentation_pipeline.py` | 3.3 | Images | Transform classes |
| `analyze_class_balance.py` | 3.4 | Train manifest | Balance report |
| `oversample_minority.py` | 3.4 | Imbalanced set | Balanced set |
| `generate_manifests.py` | 3.5 | Processed images | CSV/JSON manifests |
| `validate_dataset.py` | 3.6 | Full dataset | Validation report |
| `validate_phase3.py` | Final | All outputs | Pass/Fail report |
---
## Dependencies
```
# requirements-phase3.txt
Pillow>=9.0.0
numpy>=1.24.0
albumentations>=1.3.0
torch>=2.0.0
torchvision>=0.15.0
opencv-python>=4.7.0
pandas>=2.0.0
tqdm>=4.65.0
matplotlib>=3.7.0
scikit-learn>=1.2.0
imagehash>=4.3.0
```
---
## Directory Structure After Phase 3
```
datasets/
├── raw/ # Original downloaded images (Phase 2)
├── organized/ # Organized by species (Phase 2)
├── verified/ # Quality-checked (Phase 2)
├── train/ # Train split (Phase 2)
├── val/ # Validation split (Phase 2)
├── test/ # Test split (Phase 2)
└── processed/ # Phase 3 output
├── 224x224/ # Standardized size
│ ├── train/
│ │ └── images/
│ ├── val/
│ │ └── images/
│ └── test/
│ └── images/
├── balanced/ # Class-balanced training
│ └── train/
│ └── images/
├── train_manifest.csv
├── val_manifest.csv
├── test_manifest.csv
├── label_mapping.json
└── augmentation_config.json
output/phase3/
├── dimension_report.json
├── color_conversion_log.json
├── augmentation_config.json
├── augmentation_samples/
├── class_balance_before.json
├── class_balance_after.json
├── balance_histogram.png
├── manifest_statistics.json
├── validation_report.json
├── validation_summary.md
└── flagged_for_review.json
```
---
## Risk Mitigation
| Risk | Mitigation |
|------|------------|
| Disk space exhaustion | Monitor disk usage, compress images, delete raw after processing |
| Memory errors with large batches | Process in batches of 1000, use memory-mapped files |
| Augmentation too aggressive | Visual review, conservative defaults, configurable parameters |
| Class imbalance persists | Multiple oversampling strategies, weighted loss in training |
| Slow processing | Multiprocessing, GPU acceleration for transforms |
| Reproducibility issues | Save all configs, use fixed random seeds, version control |
---
## Performance Optimization Tips
1. **Batch Processing:** Process images in parallel using multiprocessing
2. **Memory Efficiency:** Use generators, don't load all images at once
3. **Disk I/O:** Use SSD, batch writes, memory-mapped files
4. **Image Loading:** Use PIL with SIMD, or opencv for speed
5. **Augmentation:** Apply on-the-fly during training (save disk space)
---
## Notes
- Consider saving augmentation config separately from applying augmentations
- On-the-fly augmentation during training is often preferred over pre-generating
- Keep original unaugmented test set for fair evaluation
- Document any images excluded and reasons
- Save random seeds for all operations
- Phase 4 will select model architecture based on processed dataset size