Add PlantGuide iOS app with plant identification and care management

- Implement camera capture and plant identification workflow - Add Core Data persistence for plants, care schedules, and cached API data - Create collection view with grid/list layouts and filtering - Build plant detail views with care information display - Integrate Trefle botanical API for plant care data - Add local image storage for captured plant photos - Implement dependency injection container for testability - Include accessibility support throughout the app Bug fixes in this commit: - Fix Trefle API decoding by removing duplicate CodingKeys - Fix LocalCachedImage to load from correct PlantImages directory - Set dateAdded when saving plants for proper collection sorting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:18:01 -06:00
parent d3ab29eb84
commit 136dfbae33
187 changed files with 69001 additions and 0 deletions
--- a/Docs/phase3-implementation-plan.md
+++ b/Docs/phase3-implementation-plan.md
@@ -0,0 +1,547 @@
+# Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan
+
+## Overview
+
+**Goal:** Prepare images for training with consistent formatting and augmentation pipeline.
+
+**Prerequisites:** Phase 2 complete - `datasets/train/`, `datasets/val/`, `datasets/test/` directories with manifests
+
+**Target Deliverable:** Training-ready dataset with standardized dimensions, normalized values, and augmentation pipeline
+
+---
+
+## Task Breakdown
+
+### Task 3.1: Standardize Image Dimensions
+
+**Objective:** Resize all images to consistent dimensions for model input.
+
+**Actions:**
+1. Create `scripts/phase3/standardize_dimensions.py` to:
+   - Load images from train/val/test directories
+   - Resize to target dimension (224x224 for MobileNetV3, 299x299 for EfficientNet)
+   - Preserve aspect ratio with center crop or letterboxing
+   - Save resized images to new directory structure
+
+2. Support multiple output sizes:
+   ```python
+   TARGET_SIZES = {
+       "mobilenet": (224, 224),
+       "efficientnet": (299, 299),
+       "vit": (384, 384)
+   }
+   ```
+
+3. Implement resize strategies:
+   - **center_crop:** Crop to square, then resize (preserves detail)
+   - **letterbox:** Pad to square, then resize (preserves full image)
+   - **stretch:** Direct resize (fastest, may distort)
+
+4. Output directory structure:
+   ```
+   datasets/
+   ├── processed/
+   │   └── 224x224/
+   │       ├── train/
+   │       ├── val/
+   │       └── test/
+   ```
+
+**Output:**
+- `datasets/processed/{size}/` directories
+- `output/phase3/dimension_report.json` - Processing statistics
+
+**Validation:**
+- [ ] All images in processed directory are exactly target dimensions
+- [ ] No corrupt images (all readable by PIL)
+- [ ] Image count matches source (no images lost)
+- [ ] Processing time logged for performance baseline
+
+---
+
+### Task 3.2: Normalize Color Channels
+
+**Objective:** Standardize pixel values and handle format variations.
+
+**Actions:**
+1. Create `scripts/phase3/normalize_images.py` to:
+   - Convert all images to RGB (handle RGBA, grayscale, CMYK)
+   - Apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+   - Handle various input formats (JPEG, PNG, WebP, HEIC)
+   - Save as consistent format (JPEG with quality 95, or PNG for lossless)
+
+2. Implement color normalization:
+   ```python
+   def normalize_image(image: np.ndarray) -> np.ndarray:
+       """Normalize image for model input."""
+       image = image.astype(np.float32) / 255.0
+       mean = np.array([0.485, 0.456, 0.406])
+       std = np.array([0.229, 0.224, 0.225])
+       return (image - mean) / std
+   ```
+
+3. Create preprocessing pipeline class:
+   ```python
+   class ImagePreprocessor:
+       def __init__(self, target_size, normalize=True):
+           self.target_size = target_size
+           self.normalize = normalize
+
+       def __call__(self, image_path: str) -> np.ndarray:
+           # Load, resize, convert, normalize
+           pass
+   ```
+
+4. Handle edge cases:
+   - Grayscale → convert to RGB by duplicating channels
+   - RGBA → remove alpha channel, composite on white
+   - CMYK → convert to RGB color space
+   - 16-bit images → convert to 8-bit
+
+**Output:**
+- Updated processed images with consistent color handling
+- `output/phase3/color_conversion_log.json` - Format conversion statistics
+
+**Validation:**
+- [ ] All images have exactly 3 color channels (RGB)
+- [ ] Pixel values in expected range after normalization
+- [ ] No format conversion errors
+- [ ] Color fidelity maintained (visual spot check on 50 random images)
+
+---
+
+### Task 3.3: Implement Data Augmentation Pipeline
+
+**Objective:** Create augmentation transforms to increase training data variety.
+
+**Actions:**
+1. Create `scripts/phase3/augmentation_pipeline.py` with transforms:
+
+   **Geometric Transforms:**
+   - Random rotation: -30° to +30°
+   - Random horizontal flip: 50% probability
+   - Random vertical flip: 10% probability (some plants are naturally upside-down)
+   - Random crop: 80-100% of image, then resize back
+   - Random perspective: slight perspective distortion
+
+   **Color Transforms:**
+   - Random brightness: ±20%
+   - Random contrast: ±20%
+   - Random saturation: ±30%
+   - Random hue shift: ±10%
+   - Color jitter (combined)
+
+   **Blur/Noise Transforms:**
+   - Gaussian blur: kernel 3-7, 30% probability
+   - Motion blur: 10% probability
+   - Gaussian noise: σ=0.01-0.05, 20% probability
+
+   **Occlusion Transforms:**
+   - Random erasing (cutout): 10-30% area, 20% probability
+   - Grid dropout: 10% probability
+
+2. Implement using PyTorch or Albumentations:
+   ```python
+   import albumentations as A
+
+   train_transform = A.Compose([
+       A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
+       A.HorizontalFlip(p=0.5),
+       A.Rotate(limit=30, p=0.5),
+       A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.3, hue=0.1),
+       A.GaussianBlur(blur_limit=(3, 7), p=0.3),
+       A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.2),
+       A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+       ToTensorV2(),
+   ])
+
+   val_transform = A.Compose([
+       A.Resize(256, 256),
+       A.CenterCrop(224, 224),
+       A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+       ToTensorV2(),
+   ])
+   ```
+
+3. Create visualization tool for augmentation preview:
+   ```python
+   def visualize_augmentations(image_path, transform, n_samples=9):
+       """Show grid of augmented versions of same image."""
+       pass
+   ```
+
+4. Save augmentation configuration to JSON for reproducibility
+
+**Output:**
+- `scripts/phase3/augmentation_pipeline.py` - Reusable transform classes
+- `output/phase3/augmentation_config.json` - Transform parameters
+- `output/phase3/augmentation_samples/` - Visual examples
+
+**Validation:**
+- [ ] All augmentations produce valid images (no NaN, no corruption)
+- [ ] Augmented images visually reasonable (not over-augmented)
+- [ ] Transforms are deterministic when seeded
+- [ ] Pipeline runs at >100 images/second on CPU
+
+---
+
+### Task 3.4: Balance Underrepresented Classes
+
+**Objective:** Create augmented variants to address class imbalance.
+
+**Actions:**
+1. Create `scripts/phase3/analyze_class_balance.py` to:
+   - Count images per class in training set
+   - Calculate imbalance ratio (max_class / min_class)
+   - Identify underrepresented classes (below median - 1 std)
+   - Visualize class distribution
+
+2. Create `scripts/phase3/oversample_minority.py` to:
+   - Define target samples per class (e.g., median count)
+   - Generate augmented copies for minority classes
+   - Apply stronger augmentation for synthetic samples
+   - Track original vs augmented counts
+
+3. Implement oversampling strategies:
+   ```python
+   class BalancingStrategy:
+       """Strategies for handling class imbalance."""
+
+       @staticmethod
+       def oversample_to_median(class_counts: dict) -> dict:
+           """Oversample minority classes to median count."""
+           median = np.median(list(class_counts.values()))
+           targets = {}
+           for cls, count in class_counts.items():
+               targets[cls] = max(int(median), count)
+           return targets
+
+       @staticmethod
+       def oversample_to_max(class_counts: dict, cap_ratio=5) -> dict:
+           """Oversample to max, capped at ratio times original."""
+           max_count = max(class_counts.values())
+           targets = {}
+           for cls, count in class_counts.items():
+               targets[cls] = min(max_count, count * cap_ratio)
+           return targets
+   ```
+
+4. Generate balanced training manifest:
+   - Include original images
+   - Add paths to augmented copies
+   - Mark augmented images in manifest (for analysis)
+
+**Output:**
+- `datasets/processed/balanced/train/` - Balanced training set
+- `output/phase3/class_balance_before.json` - Original distribution
+- `output/phase3/class_balance_after.json` - Balanced distribution
+- `output/phase3/balance_histogram.png` - Visual comparison
+
+**Validation:**
+- [ ] Imbalance ratio reduced to < 10:1 (max:min)
+- [ ] No class has fewer than 50 training samples
+- [ ] Augmented images are visually distinct from originals
+- [ ] Total training set size documented
+
+---
+
+### Task 3.5: Generate Image Manifest Files
+
+**Objective:** Create mapping files for training pipeline.
+
+**Actions:**
+1. Create `scripts/phase3/generate_manifests.py` to produce:
+
+   **CSV Format (PyTorch ImageFolder compatible):**
+   ```csv
+   path,label,scientific_name,plant_id,source,is_augmented
+   train/images/quercus_robur_001.jpg,42,Quercus robur,QR001,inaturalist,false
+   train/images/quercus_robur_002_aug.jpg,42,Quercus robur,QR001,augmented,true
+   ```
+
+   **JSON Format (detailed metadata):**
+   ```json
+   {
+     "train": [
+       {
+         "path": "train/images/quercus_robur_001.jpg",
+         "label": 42,
+         "scientific_name": "Quercus robur",
+         "common_name": "English Oak",
+         "plant_id": "QR001",
+         "source": "inaturalist",
+         "is_augmented": false,
+         "original_path": null
+       }
+     ]
+   }
+   ```
+
+2. Generate label mapping file:
+   ```json
+   {
+     "label_to_name": {
+       "0": "Acer palmatum",
+       "1": "Acer rubrum",
+       ...
+     },
+     "name_to_label": {
+       "Acer palmatum": 0,
+       "Acer rubrum": 1,
+       ...
+     },
+     "label_to_common": {
+       "0": "Japanese Maple",
+       ...
+     }
+   }
+   ```
+
+3. Create split statistics:
+   - Total images per split
+   - Classes per split
+   - Images per class per split
+
+**Output:**
+- `datasets/processed/train_manifest.csv`
+- `datasets/processed/val_manifest.csv`
+- `datasets/processed/test_manifest.csv`
+- `datasets/processed/label_mapping.json`
+- `output/phase3/manifest_statistics.json`
+
+**Validation:**
+- [ ] All image paths in manifests exist on disk
+- [ ] Labels are consecutive integers starting from 0
+- [ ] No duplicate entries in manifests
+- [ ] Split sizes match expected counts
+- [ ] Label mapping covers all classes
+
+---
+
+### Task 3.6: Validate Dataset Integrity
+
+**Objective:** Final verification of processed dataset.
+
+**Actions:**
+1. Create `scripts/phase3/validate_dataset.py` to run comprehensive checks:
+
+   **File Integrity:**
+   - All manifest paths exist
+   - All images load without error
+   - All images have correct dimensions
+   - File permissions allow read access
+
+   **Label Consistency:**
+   - Labels match between manifest and directory structure
+   - All labels have corresponding class names
+   - No orphaned images (in directory but not manifest)
+   - No missing images (in manifest but not directory)
+
+   **Dataset Statistics:**
+   - Per-class image counts
+   - Train/val/test split ratios
+   - Augmented vs original ratio
+   - File size distribution
+
+   **Sample Verification:**
+   - Random sample of 100 images per split
+   - Verify image content matches label (using pretrained model)
+   - Flag potential mislabels for review
+
+2. Create `scripts/phase3/repair_dataset.py` for common fixes:
+   - Remove entries with missing files
+   - Fix incorrect labels (with confirmation)
+   - Regenerate corrupted augmentations
+
+**Output:**
+- `output/phase3/validation_report.json` - Full validation results
+- `output/phase3/validation_summary.md` - Human-readable summary
+- `output/phase3/flagged_for_review.json` - Potential issues
+
+**Validation:**
+- [ ] 0 missing files
+- [ ] 0 corrupted images
+- [ ] 0 dimension mismatches
+- [ ] <1% potential mislabels flagged
+- [ ] All metadata fields populated
+
+---
+
+## End-of-Phase Validation Checklist
+
+Run `scripts/phase3/validate_phase3.py` to verify all criteria:
+
+### Image Processing Validation
+
+| # | Criterion | Target | Status |
+|---|-----------|--------|--------|
+| 1 | All images standardized to target size | 100% at 224x224 (or configured size) | [ ] |
+| 2 | All images in RGB format | 100% RGB, 3 channels | [ ] |
+| 3 | No corrupted images | 0 unreadable files | [ ] |
+| 4 | Normalization applied correctly | Values in expected range | [ ] |
+
+### Augmentation Validation
+
+| # | Criterion | Target | Status |
+|---|-----------|--------|--------|
+| 5 | Augmentation pipeline functional | All transforms produce valid output | [ ] |
+| 6 | Augmentation reproducible | Same seed = same output | [ ] |
+| 7 | Augmentation performance | >100 images/sec on CPU | [ ] |
+| 8 | Visual quality | Spot check passes (50 random samples) | [ ] |
+
+### Class Balance Validation
+
+| # | Criterion | Target | Status |
+|---|-----------|--------|--------|
+| 9 | Class imbalance ratio | < 10:1 (max:min) | [ ] |
+| 10 | Minimum class size | ≥50 images per class in train | [ ] |
+| 11 | Augmentation ratio | Augmented ≤ 4x original per class | [ ] |
+
+### Manifest Validation
+
+| # | Criterion | Target | Status |
+|---|-----------|--------|--------|
+| 12 | Manifest completeness | 100% images have manifest entries | [ ] |
+| 13 | Path validity | 100% manifest paths exist | [ ] |
+| 14 | Label consistency | Labels match directory structure | [ ] |
+| 15 | No duplicates | 0 duplicate entries | [ ] |
+| 16 | Label mapping complete | All labels have names | [ ] |
+
+### Dataset Statistics
+
+| Metric | Expected | Actual | Status |
+|--------|----------|--------|--------|
+| Total processed images | 50,000 - 200,000 | | [ ] |
+| Training set size | ~70% of total | | [ ] |
+| Validation set size | ~15% of total | | [ ] |
+| Test set size | ~15% of total | | [ ] |
+| Number of classes | 200 - 500 | | [ ] |
+| Avg images per class (train) | 100 - 400 | | [ ] |
+| Image file size (avg) | 30-100 KB | | [ ] |
+| Total dataset size | 10-50 GB | | [ ] |
+
+---
+
+## Phase 3 Completion Checklist
+
+- [ ] Task 3.1: Images standardized to target dimensions
+- [ ] Task 3.2: Color channels normalized and formats unified
+- [ ] Task 3.3: Augmentation pipeline implemented and tested
+- [ ] Task 3.4: Class imbalance addressed through oversampling
+- [ ] Task 3.5: Manifest files generated for all splits
+- [ ] Task 3.6: Dataset integrity validated
+- [ ] All 16 validation criteria pass
+- [ ] Dataset statistics documented
+- [ ] Augmentation config saved for reproducibility
+- [ ] Ready for Phase 4 (Model Architecture Selection)
+
+---
+
+## Scripts Summary
+
+| Script | Task | Input | Output |
+|--------|------|-------|--------|
+| `standardize_dimensions.py` | 3.1 | Raw images | Resized images |
+| `normalize_images.py` | 3.2 | Resized images | Normalized images |
+| `augmentation_pipeline.py` | 3.3 | Images | Transform classes |
+| `analyze_class_balance.py` | 3.4 | Train manifest | Balance report |
+| `oversample_minority.py` | 3.4 | Imbalanced set | Balanced set |
+| `generate_manifests.py` | 3.5 | Processed images | CSV/JSON manifests |
+| `validate_dataset.py` | 3.6 | Full dataset | Validation report |
+| `validate_phase3.py` | Final | All outputs | Pass/Fail report |
+
+---
+
+## Dependencies
+
+```
+# requirements-phase3.txt
+Pillow>=9.0.0
+numpy>=1.24.0
+albumentations>=1.3.0
+torch>=2.0.0
+torchvision>=0.15.0
+opencv-python>=4.7.0
+pandas>=2.0.0
+tqdm>=4.65.0
+matplotlib>=3.7.0
+scikit-learn>=1.2.0
+imagehash>=4.3.0
+```
+
+---
+
+## Directory Structure After Phase 3
+
+```
+datasets/
+├── raw/                          # Original downloaded images (Phase 2)
+├── organized/                    # Organized by species (Phase 2)
+├── verified/                     # Quality-checked (Phase 2)
+├── train/                        # Train split (Phase 2)
+├── val/                          # Validation split (Phase 2)
+├── test/                         # Test split (Phase 2)
+└── processed/                    # Phase 3 output
+    ├── 224x224/                  # Standardized size
+    │   ├── train/
+    │   │   └── images/
+    │   ├── val/
+    │   │   └── images/
+    │   └── test/
+    │       └── images/
+    ├── balanced/                 # Class-balanced training
+    │   └── train/
+    │       └── images/
+    ├── train_manifest.csv
+    ├── val_manifest.csv
+    ├── test_manifest.csv
+    ├── label_mapping.json
+    └── augmentation_config.json
+
+output/phase3/
+├── dimension_report.json
+├── color_conversion_log.json
+├── augmentation_config.json
+├── augmentation_samples/
+├── class_balance_before.json
+├── class_balance_after.json
+├── balance_histogram.png
+├── manifest_statistics.json
+├── validation_report.json
+├── validation_summary.md
+└── flagged_for_review.json
+```
+
+---
+
+## Risk Mitigation
+
+| Risk | Mitigation |
+|------|------------|
+| Disk space exhaustion | Monitor disk usage, compress images, delete raw after processing |
+| Memory errors with large batches | Process in batches of 1000, use memory-mapped files |
+| Augmentation too aggressive | Visual review, conservative defaults, configurable parameters |
+| Class imbalance persists | Multiple oversampling strategies, weighted loss in training |
+| Slow processing | Multiprocessing, GPU acceleration for transforms |
+| Reproducibility issues | Save all configs, use fixed random seeds, version control |
+
+---
+
+## Performance Optimization Tips
+
+1. **Batch Processing:** Process images in parallel using multiprocessing
+2. **Memory Efficiency:** Use generators, don't load all images at once
+3. **Disk I/O:** Use SSD, batch writes, memory-mapped files
+4. **Image Loading:** Use PIL with SIMD, or opencv for speed
+5. **Augmentation:** Apply on-the-fly during training (save disk space)
+
+---
+
+## Notes
+
+- Consider saving augmentation config separately from applying augmentations
+- On-the-fly augmentation during training is often preferred over pre-generating
+- Keep original unaugmented test set for fair evaluation
+- Document any images excluded and reasons
+- Save random seeds for all operations
+- Phase 4 will select model architecture based on processed dataset size