Add PlantGuide iOS app with plant identification and care management

- Implement camera capture and plant identification workflow
- Add Core Data persistence for plants, care schedules, and cached API data
- Create collection view with grid/list layouts and filtering
- Build plant detail views with care information display
- Integrate Trefle botanical API for plant care data
- Add local image storage for captured plant photos
- Implement dependency injection container for testability
- Include accessibility support throughout the app

Bug fixes in this commit:
- Fix Trefle API decoding by removing duplicate CodingKeys
- Fix LocalCachedImage to load from correct PlantImages directory
- Set dateAdded when saving plants for proper collection sorting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Trey t
2026-01-23 12:18:01 -06:00
parent d3ab29eb84
commit 136dfbae33
187 changed files with 69001 additions and 0 deletions

View File

@@ -0,0 +1,547 @@
# Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan
## Overview
**Goal:** Prepare images for training with consistent formatting and augmentation pipeline.
**Prerequisites:** Phase 2 complete - `datasets/train/`, `datasets/val/`, `datasets/test/` directories with manifests
**Target Deliverable:** Training-ready dataset with standardized dimensions, normalized values, and augmentation pipeline
---
## Task Breakdown
### Task 3.1: Standardize Image Dimensions
**Objective:** Resize all images to consistent dimensions for model input.
**Actions:**
1. Create `scripts/phase3/standardize_dimensions.py` to:
- Load images from train/val/test directories
- Resize to target dimension (224x224 for MobileNetV3, 299x299 for EfficientNet)
- Preserve aspect ratio with center crop or letterboxing
- Save resized images to new directory structure
2. Support multiple output sizes:
```python
TARGET_SIZES = {
"mobilenet": (224, 224),
"efficientnet": (299, 299),
"vit": (384, 384)
}
```
3. Implement resize strategies:
- **center_crop:** Crop to square, then resize (preserves detail)
- **letterbox:** Pad to square, then resize (preserves full image)
- **stretch:** Direct resize (fastest, may distort)
4. Output directory structure:
```
datasets/
├── processed/
│ └── 224x224/
│ ├── train/
│ ├── val/
│ └── test/
```
**Output:**
- `datasets/processed/{size}/` directories
- `output/phase3/dimension_report.json` - Processing statistics
**Validation:**
- [ ] All images in processed directory are exactly target dimensions
- [ ] No corrupt images (all readable by PIL)
- [ ] Image count matches source (no images lost)
- [ ] Processing time logged for performance baseline
---
### Task 3.2: Normalize Color Channels
**Objective:** Standardize pixel values and handle format variations.
**Actions:**
1. Create `scripts/phase3/normalize_images.py` to:
- Convert all images to RGB (handle RGBA, grayscale, CMYK)
- Apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
- Handle various input formats (JPEG, PNG, WebP, HEIC)
- Save as consistent format (JPEG with quality 95, or PNG for lossless)
2. Implement color normalization:
```python
def normalize_image(image: np.ndarray) -> np.ndarray:
"""Normalize image for model input."""
image = image.astype(np.float32) / 255.0
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
return (image - mean) / std
```
3. Create preprocessing pipeline class:
```python
class ImagePreprocessor:
def __init__(self, target_size, normalize=True):
self.target_size = target_size
self.normalize = normalize
def __call__(self, image_path: str) -> np.ndarray:
# Load, resize, convert, normalize
pass
```
4. Handle edge cases:
- Grayscale → convert to RGB by duplicating channels
- RGBA → remove alpha channel, composite on white
- CMYK → convert to RGB color space
- 16-bit images → convert to 8-bit
**Output:**
- Updated processed images with consistent color handling
- `output/phase3/color_conversion_log.json` - Format conversion statistics
**Validation:**
- [ ] All images have exactly 3 color channels (RGB)
- [ ] Pixel values in expected range after normalization
- [ ] No format conversion errors
- [ ] Color fidelity maintained (visual spot check on 50 random images)
---
### Task 3.3: Implement Data Augmentation Pipeline
**Objective:** Create augmentation transforms to increase training data variety.
**Actions:**
1. Create `scripts/phase3/augmentation_pipeline.py` with transforms:
**Geometric Transforms:**
- Random rotation: -30° to +30°
- Random horizontal flip: 50% probability
- Random vertical flip: 10% probability (some plants are naturally upside-down)
- Random crop: 80-100% of image, then resize back
- Random perspective: slight perspective distortion
**Color Transforms:**
- Random brightness: ±20%
- Random contrast: ±20%
- Random saturation: ±30%
- Random hue shift: ±10%
- Color jitter (combined)
**Blur/Noise Transforms:**
- Gaussian blur: kernel 3-7, 30% probability
- Motion blur: 10% probability
- Gaussian noise: σ=0.01-0.05, 20% probability
**Occlusion Transforms:**
- Random erasing (cutout): 10-30% area, 20% probability
- Grid dropout: 10% probability
2. Implement using PyTorch or Albumentations:
```python
import albumentations as A
train_transform = A.Compose([
A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
A.HorizontalFlip(p=0.5),
A.Rotate(limit=30, p=0.5),
A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.3, hue=0.1),
A.GaussianBlur(blur_limit=(3, 7), p=0.3),
A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.2),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
])
val_transform = A.Compose([
A.Resize(256, 256),
A.CenterCrop(224, 224),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
])
```
3. Create visualization tool for augmentation preview:
```python
def visualize_augmentations(image_path, transform, n_samples=9):
"""Show grid of augmented versions of same image."""
pass
```
4. Save augmentation configuration to JSON for reproducibility
**Output:**
- `scripts/phase3/augmentation_pipeline.py` - Reusable transform classes
- `output/phase3/augmentation_config.json` - Transform parameters
- `output/phase3/augmentation_samples/` - Visual examples
**Validation:**
- [ ] All augmentations produce valid images (no NaN, no corruption)
- [ ] Augmented images visually reasonable (not over-augmented)
- [ ] Transforms are deterministic when seeded
- [ ] Pipeline runs at >100 images/second on CPU
---
### Task 3.4: Balance Underrepresented Classes
**Objective:** Create augmented variants to address class imbalance.
**Actions:**
1. Create `scripts/phase3/analyze_class_balance.py` to:
- Count images per class in training set
- Calculate imbalance ratio (max_class / min_class)
- Identify underrepresented classes (below median - 1 std)
- Visualize class distribution
2. Create `scripts/phase3/oversample_minority.py` to:
- Define target samples per class (e.g., median count)
- Generate augmented copies for minority classes
- Apply stronger augmentation for synthetic samples
- Track original vs augmented counts
3. Implement oversampling strategies:
```python
class BalancingStrategy:
"""Strategies for handling class imbalance."""
@staticmethod
def oversample_to_median(class_counts: dict) -> dict:
"""Oversample minority classes to median count."""
median = np.median(list(class_counts.values()))
targets = {}
for cls, count in class_counts.items():
targets[cls] = max(int(median), count)
return targets
@staticmethod
def oversample_to_max(class_counts: dict, cap_ratio=5) -> dict:
"""Oversample to max, capped at ratio times original."""
max_count = max(class_counts.values())
targets = {}
for cls, count in class_counts.items():
targets[cls] = min(max_count, count * cap_ratio)
return targets
```
4. Generate balanced training manifest:
- Include original images
- Add paths to augmented copies
- Mark augmented images in manifest (for analysis)
**Output:**
- `datasets/processed/balanced/train/` - Balanced training set
- `output/phase3/class_balance_before.json` - Original distribution
- `output/phase3/class_balance_after.json` - Balanced distribution
- `output/phase3/balance_histogram.png` - Visual comparison
**Validation:**
- [ ] Imbalance ratio reduced to < 10:1 (max:min)
- [ ] No class has fewer than 50 training samples
- [ ] Augmented images are visually distinct from originals
- [ ] Total training set size documented
---
### Task 3.5: Generate Image Manifest Files
**Objective:** Create mapping files for training pipeline.
**Actions:**
1. Create `scripts/phase3/generate_manifests.py` to produce:
**CSV Format (PyTorch ImageFolder compatible):**
```csv
path,label,scientific_name,plant_id,source,is_augmented
train/images/quercus_robur_001.jpg,42,Quercus robur,QR001,inaturalist,false
train/images/quercus_robur_002_aug.jpg,42,Quercus robur,QR001,augmented,true
```
**JSON Format (detailed metadata):**
```json
{
"train": [
{
"path": "train/images/quercus_robur_001.jpg",
"label": 42,
"scientific_name": "Quercus robur",
"common_name": "English Oak",
"plant_id": "QR001",
"source": "inaturalist",
"is_augmented": false,
"original_path": null
}
]
}
```
2. Generate label mapping file:
```json
{
"label_to_name": {
"0": "Acer palmatum",
"1": "Acer rubrum",
...
},
"name_to_label": {
"Acer palmatum": 0,
"Acer rubrum": 1,
...
},
"label_to_common": {
"0": "Japanese Maple",
...
}
}
```
3. Create split statistics:
- Total images per split
- Classes per split
- Images per class per split
**Output:**
- `datasets/processed/train_manifest.csv`
- `datasets/processed/val_manifest.csv`
- `datasets/processed/test_manifest.csv`
- `datasets/processed/label_mapping.json`
- `output/phase3/manifest_statistics.json`
**Validation:**
- [ ] All image paths in manifests exist on disk
- [ ] Labels are consecutive integers starting from 0
- [ ] No duplicate entries in manifests
- [ ] Split sizes match expected counts
- [ ] Label mapping covers all classes
---
### Task 3.6: Validate Dataset Integrity
**Objective:** Final verification of processed dataset.
**Actions:**
1. Create `scripts/phase3/validate_dataset.py` to run comprehensive checks:
**File Integrity:**
- All manifest paths exist
- All images load without error
- All images have correct dimensions
- File permissions allow read access
**Label Consistency:**
- Labels match between manifest and directory structure
- All labels have corresponding class names
- No orphaned images (in directory but not manifest)
- No missing images (in manifest but not directory)
**Dataset Statistics:**
- Per-class image counts
- Train/val/test split ratios
- Augmented vs original ratio
- File size distribution
**Sample Verification:**
- Random sample of 100 images per split
- Verify image content matches label (using pretrained model)
- Flag potential mislabels for review
2. Create `scripts/phase3/repair_dataset.py` for common fixes:
- Remove entries with missing files
- Fix incorrect labels (with confirmation)
- Regenerate corrupted augmentations
**Output:**
- `output/phase3/validation_report.json` - Full validation results
- `output/phase3/validation_summary.md` - Human-readable summary
- `output/phase3/flagged_for_review.json` - Potential issues
**Validation:**
- [ ] 0 missing files
- [ ] 0 corrupted images
- [ ] 0 dimension mismatches
- [ ] <1% potential mislabels flagged
- [ ] All metadata fields populated
---
## End-of-Phase Validation Checklist
Run `scripts/phase3/validate_phase3.py` to verify all criteria:
### Image Processing Validation
| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 1 | All images standardized to target size | 100% at 224x224 (or configured size) | [ ] |
| 2 | All images in RGB format | 100% RGB, 3 channels | [ ] |
| 3 | No corrupted images | 0 unreadable files | [ ] |
| 4 | Normalization applied correctly | Values in expected range | [ ] |
### Augmentation Validation
| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 5 | Augmentation pipeline functional | All transforms produce valid output | [ ] |
| 6 | Augmentation reproducible | Same seed = same output | [ ] |
| 7 | Augmentation performance | >100 images/sec on CPU | [ ] |
| 8 | Visual quality | Spot check passes (50 random samples) | [ ] |
### Class Balance Validation
| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 9 | Class imbalance ratio | < 10:1 (max:min) | [ ] |
| 10 | Minimum class size | ≥50 images per class in train | [ ] |
| 11 | Augmentation ratio | Augmented ≤ 4x original per class | [ ] |
### Manifest Validation
| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 12 | Manifest completeness | 100% images have manifest entries | [ ] |
| 13 | Path validity | 100% manifest paths exist | [ ] |
| 14 | Label consistency | Labels match directory structure | [ ] |
| 15 | No duplicates | 0 duplicate entries | [ ] |
| 16 | Label mapping complete | All labels have names | [ ] |
### Dataset Statistics
| Metric | Expected | Actual | Status |
|--------|----------|--------|--------|
| Total processed images | 50,000 - 200,000 | | [ ] |
| Training set size | ~70% of total | | [ ] |
| Validation set size | ~15% of total | | [ ] |
| Test set size | ~15% of total | | [ ] |
| Number of classes | 200 - 500 | | [ ] |
| Avg images per class (train) | 100 - 400 | | [ ] |
| Image file size (avg) | 30-100 KB | | [ ] |
| Total dataset size | 10-50 GB | | [ ] |
---
## Phase 3 Completion Checklist
- [ ] Task 3.1: Images standardized to target dimensions
- [ ] Task 3.2: Color channels normalized and formats unified
- [ ] Task 3.3: Augmentation pipeline implemented and tested
- [ ] Task 3.4: Class imbalance addressed through oversampling
- [ ] Task 3.5: Manifest files generated for all splits
- [ ] Task 3.6: Dataset integrity validated
- [ ] All 16 validation criteria pass
- [ ] Dataset statistics documented
- [ ] Augmentation config saved for reproducibility
- [ ] Ready for Phase 4 (Model Architecture Selection)
---
## Scripts Summary
| Script | Task | Input | Output |
|--------|------|-------|--------|
| `standardize_dimensions.py` | 3.1 | Raw images | Resized images |
| `normalize_images.py` | 3.2 | Resized images | Normalized images |
| `augmentation_pipeline.py` | 3.3 | Images | Transform classes |
| `analyze_class_balance.py` | 3.4 | Train manifest | Balance report |
| `oversample_minority.py` | 3.4 | Imbalanced set | Balanced set |
| `generate_manifests.py` | 3.5 | Processed images | CSV/JSON manifests |
| `validate_dataset.py` | 3.6 | Full dataset | Validation report |
| `validate_phase3.py` | Final | All outputs | Pass/Fail report |
---
## Dependencies
```
# requirements-phase3.txt
Pillow>=9.0.0
numpy>=1.24.0
albumentations>=1.3.0
torch>=2.0.0
torchvision>=0.15.0
opencv-python>=4.7.0
pandas>=2.0.0
tqdm>=4.65.0
matplotlib>=3.7.0
scikit-learn>=1.2.0
imagehash>=4.3.0
```
---
## Directory Structure After Phase 3
```
datasets/
├── raw/ # Original downloaded images (Phase 2)
├── organized/ # Organized by species (Phase 2)
├── verified/ # Quality-checked (Phase 2)
├── train/ # Train split (Phase 2)
├── val/ # Validation split (Phase 2)
├── test/ # Test split (Phase 2)
└── processed/ # Phase 3 output
├── 224x224/ # Standardized size
│ ├── train/
│ │ └── images/
│ ├── val/
│ │ └── images/
│ └── test/
│ └── images/
├── balanced/ # Class-balanced training
│ └── train/
│ └── images/
├── train_manifest.csv
├── val_manifest.csv
├── test_manifest.csv
├── label_mapping.json
└── augmentation_config.json
output/phase3/
├── dimension_report.json
├── color_conversion_log.json
├── augmentation_config.json
├── augmentation_samples/
├── class_balance_before.json
├── class_balance_after.json
├── balance_histogram.png
├── manifest_statistics.json
├── validation_report.json
├── validation_summary.md
└── flagged_for_review.json
```
---
## Risk Mitigation
| Risk | Mitigation |
|------|------------|
| Disk space exhaustion | Monitor disk usage, compress images, delete raw after processing |
| Memory errors with large batches | Process in batches of 1000, use memory-mapped files |
| Augmentation too aggressive | Visual review, conservative defaults, configurable parameters |
| Class imbalance persists | Multiple oversampling strategies, weighted loss in training |
| Slow processing | Multiprocessing, GPU acceleration for transforms |
| Reproducibility issues | Save all configs, use fixed random seeds, version control |
---
## Performance Optimization Tips
1. **Batch Processing:** Process images in parallel using multiprocessing
2. **Memory Efficiency:** Use generators, don't load all images at once
3. **Disk I/O:** Use SSD, batch writes, memory-mapped files
4. **Image Loading:** Use PIL with SIMD, or opencv for speed
5. **Augmentation:** Apply on-the-fly during training (save disk space)
---
## Notes
- Consider saving augmentation config separately from applying augmentations
- On-the-fly augmentation during training is often preferred over pre-generating
- Keep original unaugmented test set for fair evaluation
- Document any images excluded and reasons
- Save random seeds for all operations
- Phase 4 will select model architecture based on processed dataset size