- Implement camera capture and plant identification workflow - Add Core Data persistence for plants, care schedules, and cached API data - Create collection view with grid/list layouts and filtering - Build plant detail views with care information display - Integrate Trefle botanical API for plant care data - Add local image storage for captured plant photos - Implement dependency injection container for testability - Include accessibility support throughout the app Bug fixes in this commit: - Fix Trefle API decoding by removing duplicate CodingKeys - Fix LocalCachedImage to load from correct PlantImages directory - Set dateAdded when saving plants for proper collection sorting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
17 KiB
Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan
Overview
Goal: Prepare images for training with consistent formatting and augmentation pipeline.
Prerequisites: Phase 2 complete - datasets/train/, datasets/val/, datasets/test/ directories with manifests
Target Deliverable: Training-ready dataset with standardized dimensions, normalized values, and augmentation pipeline
Task Breakdown
Task 3.1: Standardize Image Dimensions
Objective: Resize all images to consistent dimensions for model input.
Actions:
-
Create
scripts/phase3/standardize_dimensions.pyto:- Load images from train/val/test directories
- Resize to target dimension (224x224 for MobileNetV3, 299x299 for EfficientNet)
- Preserve aspect ratio with center crop or letterboxing
- Save resized images to new directory structure
-
Support multiple output sizes:
TARGET_SIZES = { "mobilenet": (224, 224), "efficientnet": (299, 299), "vit": (384, 384) } -
Implement resize strategies:
- center_crop: Crop to square, then resize (preserves detail)
- letterbox: Pad to square, then resize (preserves full image)
- stretch: Direct resize (fastest, may distort)
-
Output directory structure:
datasets/ ├── processed/ │ └── 224x224/ │ ├── train/ │ ├── val/ │ └── test/
Output:
datasets/processed/{size}/directoriesoutput/phase3/dimension_report.json- Processing statistics
Validation:
- All images in processed directory are exactly target dimensions
- No corrupt images (all readable by PIL)
- Image count matches source (no images lost)
- Processing time logged for performance baseline
Task 3.2: Normalize Color Channels
Objective: Standardize pixel values and handle format variations.
Actions:
-
Create
scripts/phase3/normalize_images.pyto:- Convert all images to RGB (handle RGBA, grayscale, CMYK)
- Apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
- Handle various input formats (JPEG, PNG, WebP, HEIC)
- Save as consistent format (JPEG with quality 95, or PNG for lossless)
-
Implement color normalization:
def normalize_image(image: np.ndarray) -> np.ndarray: """Normalize image for model input.""" image = image.astype(np.float32) / 255.0 mean = np.array([0.485, 0.456, 0.406]) std = np.array([0.229, 0.224, 0.225]) return (image - mean) / std -
Create preprocessing pipeline class:
class ImagePreprocessor: def __init__(self, target_size, normalize=True): self.target_size = target_size self.normalize = normalize def __call__(self, image_path: str) -> np.ndarray: # Load, resize, convert, normalize pass -
Handle edge cases:
- Grayscale → convert to RGB by duplicating channels
- RGBA → remove alpha channel, composite on white
- CMYK → convert to RGB color space
- 16-bit images → convert to 8-bit
Output:
- Updated processed images with consistent color handling
output/phase3/color_conversion_log.json- Format conversion statistics
Validation:
- All images have exactly 3 color channels (RGB)
- Pixel values in expected range after normalization
- No format conversion errors
- Color fidelity maintained (visual spot check on 50 random images)
Task 3.3: Implement Data Augmentation Pipeline
Objective: Create augmentation transforms to increase training data variety.
Actions:
-
Create
scripts/phase3/augmentation_pipeline.pywith transforms:Geometric Transforms:
- Random rotation: -30° to +30°
- Random horizontal flip: 50% probability
- Random vertical flip: 10% probability (some plants are naturally upside-down)
- Random crop: 80-100% of image, then resize back
- Random perspective: slight perspective distortion
Color Transforms:
- Random brightness: ±20%
- Random contrast: ±20%
- Random saturation: ±30%
- Random hue shift: ±10%
- Color jitter (combined)
Blur/Noise Transforms:
- Gaussian blur: kernel 3-7, 30% probability
- Motion blur: 10% probability
- Gaussian noise: σ=0.01-0.05, 20% probability
Occlusion Transforms:
- Random erasing (cutout): 10-30% area, 20% probability
- Grid dropout: 10% probability
-
Implement using PyTorch or Albumentations:
import albumentations as A train_transform = A.Compose([ A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)), A.HorizontalFlip(p=0.5), A.Rotate(limit=30, p=0.5), A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.3, hue=0.1), A.GaussianBlur(blur_limit=(3, 7), p=0.3), A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.2), A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ToTensorV2(), ]) val_transform = A.Compose([ A.Resize(256, 256), A.CenterCrop(224, 224), A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ToTensorV2(), ]) -
Create visualization tool for augmentation preview:
def visualize_augmentations(image_path, transform, n_samples=9): """Show grid of augmented versions of same image.""" pass -
Save augmentation configuration to JSON for reproducibility
Output:
scripts/phase3/augmentation_pipeline.py- Reusable transform classesoutput/phase3/augmentation_config.json- Transform parametersoutput/phase3/augmentation_samples/- Visual examples
Validation:
- All augmentations produce valid images (no NaN, no corruption)
- Augmented images visually reasonable (not over-augmented)
- Transforms are deterministic when seeded
- Pipeline runs at >100 images/second on CPU
Task 3.4: Balance Underrepresented Classes
Objective: Create augmented variants to address class imbalance.
Actions:
-
Create
scripts/phase3/analyze_class_balance.pyto:- Count images per class in training set
- Calculate imbalance ratio (max_class / min_class)
- Identify underrepresented classes (below median - 1 std)
- Visualize class distribution
-
Create
scripts/phase3/oversample_minority.pyto:- Define target samples per class (e.g., median count)
- Generate augmented copies for minority classes
- Apply stronger augmentation for synthetic samples
- Track original vs augmented counts
-
Implement oversampling strategies:
class BalancingStrategy: """Strategies for handling class imbalance.""" @staticmethod def oversample_to_median(class_counts: dict) -> dict: """Oversample minority classes to median count.""" median = np.median(list(class_counts.values())) targets = {} for cls, count in class_counts.items(): targets[cls] = max(int(median), count) return targets @staticmethod def oversample_to_max(class_counts: dict, cap_ratio=5) -> dict: """Oversample to max, capped at ratio times original.""" max_count = max(class_counts.values()) targets = {} for cls, count in class_counts.items(): targets[cls] = min(max_count, count * cap_ratio) return targets -
Generate balanced training manifest:
- Include original images
- Add paths to augmented copies
- Mark augmented images in manifest (for analysis)
Output:
datasets/processed/balanced/train/- Balanced training setoutput/phase3/class_balance_before.json- Original distributionoutput/phase3/class_balance_after.json- Balanced distributionoutput/phase3/balance_histogram.png- Visual comparison
Validation:
- Imbalance ratio reduced to < 10:1 (max:min)
- No class has fewer than 50 training samples
- Augmented images are visually distinct from originals
- Total training set size documented
Task 3.5: Generate Image Manifest Files
Objective: Create mapping files for training pipeline.
Actions:
-
Create
scripts/phase3/generate_manifests.pyto produce:CSV Format (PyTorch ImageFolder compatible):
path,label,scientific_name,plant_id,source,is_augmented train/images/quercus_robur_001.jpg,42,Quercus robur,QR001,inaturalist,false train/images/quercus_robur_002_aug.jpg,42,Quercus robur,QR001,augmented,trueJSON Format (detailed metadata):
{ "train": [ { "path": "train/images/quercus_robur_001.jpg", "label": 42, "scientific_name": "Quercus robur", "common_name": "English Oak", "plant_id": "QR001", "source": "inaturalist", "is_augmented": false, "original_path": null } ] } -
Generate label mapping file:
{ "label_to_name": { "0": "Acer palmatum", "1": "Acer rubrum", ... }, "name_to_label": { "Acer palmatum": 0, "Acer rubrum": 1, ... }, "label_to_common": { "0": "Japanese Maple", ... } } -
Create split statistics:
- Total images per split
- Classes per split
- Images per class per split
Output:
datasets/processed/train_manifest.csvdatasets/processed/val_manifest.csvdatasets/processed/test_manifest.csvdatasets/processed/label_mapping.jsonoutput/phase3/manifest_statistics.json
Validation:
- All image paths in manifests exist on disk
- Labels are consecutive integers starting from 0
- No duplicate entries in manifests
- Split sizes match expected counts
- Label mapping covers all classes
Task 3.6: Validate Dataset Integrity
Objective: Final verification of processed dataset.
Actions:
-
Create
scripts/phase3/validate_dataset.pyto run comprehensive checks:File Integrity:
- All manifest paths exist
- All images load without error
- All images have correct dimensions
- File permissions allow read access
Label Consistency:
- Labels match between manifest and directory structure
- All labels have corresponding class names
- No orphaned images (in directory but not manifest)
- No missing images (in manifest but not directory)
Dataset Statistics:
- Per-class image counts
- Train/val/test split ratios
- Augmented vs original ratio
- File size distribution
Sample Verification:
- Random sample of 100 images per split
- Verify image content matches label (using pretrained model)
- Flag potential mislabels for review
-
Create
scripts/phase3/repair_dataset.pyfor common fixes:- Remove entries with missing files
- Fix incorrect labels (with confirmation)
- Regenerate corrupted augmentations
Output:
output/phase3/validation_report.json- Full validation resultsoutput/phase3/validation_summary.md- Human-readable summaryoutput/phase3/flagged_for_review.json- Potential issues
Validation:
- 0 missing files
- 0 corrupted images
- 0 dimension mismatches
- <1% potential mislabels flagged
- All metadata fields populated
End-of-Phase Validation Checklist
Run scripts/phase3/validate_phase3.py to verify all criteria:
Image Processing Validation
| # | Criterion | Target | Status |
|---|---|---|---|
| 1 | All images standardized to target size | 100% at 224x224 (or configured size) | [ ] |
| 2 | All images in RGB format | 100% RGB, 3 channels | [ ] |
| 3 | No corrupted images | 0 unreadable files | [ ] |
| 4 | Normalization applied correctly | Values in expected range | [ ] |
Augmentation Validation
| # | Criterion | Target | Status |
|---|---|---|---|
| 5 | Augmentation pipeline functional | All transforms produce valid output | [ ] |
| 6 | Augmentation reproducible | Same seed = same output | [ ] |
| 7 | Augmentation performance | >100 images/sec on CPU | [ ] |
| 8 | Visual quality | Spot check passes (50 random samples) | [ ] |
Class Balance Validation
| # | Criterion | Target | Status |
|---|---|---|---|
| 9 | Class imbalance ratio | < 10:1 (max:min) | [ ] |
| 10 | Minimum class size | ≥50 images per class in train | [ ] |
| 11 | Augmentation ratio | Augmented ≤ 4x original per class | [ ] |
Manifest Validation
| # | Criterion | Target | Status |
|---|---|---|---|
| 12 | Manifest completeness | 100% images have manifest entries | [ ] |
| 13 | Path validity | 100% manifest paths exist | [ ] |
| 14 | Label consistency | Labels match directory structure | [ ] |
| 15 | No duplicates | 0 duplicate entries | [ ] |
| 16 | Label mapping complete | All labels have names | [ ] |
Dataset Statistics
| Metric | Expected | Actual | Status |
|---|---|---|---|
| Total processed images | 50,000 - 200,000 | [ ] | |
| Training set size | ~70% of total | [ ] | |
| Validation set size | ~15% of total | [ ] | |
| Test set size | ~15% of total | [ ] | |
| Number of classes | 200 - 500 | [ ] | |
| Avg images per class (train) | 100 - 400 | [ ] | |
| Image file size (avg) | 30-100 KB | [ ] | |
| Total dataset size | 10-50 GB | [ ] |
Phase 3 Completion Checklist
- Task 3.1: Images standardized to target dimensions
- Task 3.2: Color channels normalized and formats unified
- Task 3.3: Augmentation pipeline implemented and tested
- Task 3.4: Class imbalance addressed through oversampling
- Task 3.5: Manifest files generated for all splits
- Task 3.6: Dataset integrity validated
- All 16 validation criteria pass
- Dataset statistics documented
- Augmentation config saved for reproducibility
- Ready for Phase 4 (Model Architecture Selection)
Scripts Summary
| Script | Task | Input | Output |
|---|---|---|---|
standardize_dimensions.py |
3.1 | Raw images | Resized images |
normalize_images.py |
3.2 | Resized images | Normalized images |
augmentation_pipeline.py |
3.3 | Images | Transform classes |
analyze_class_balance.py |
3.4 | Train manifest | Balance report |
oversample_minority.py |
3.4 | Imbalanced set | Balanced set |
generate_manifests.py |
3.5 | Processed images | CSV/JSON manifests |
validate_dataset.py |
3.6 | Full dataset | Validation report |
validate_phase3.py |
Final | All outputs | Pass/Fail report |
Dependencies
# requirements-phase3.txt
Pillow>=9.0.0
numpy>=1.24.0
albumentations>=1.3.0
torch>=2.0.0
torchvision>=0.15.0
opencv-python>=4.7.0
pandas>=2.0.0
tqdm>=4.65.0
matplotlib>=3.7.0
scikit-learn>=1.2.0
imagehash>=4.3.0
Directory Structure After Phase 3
datasets/
├── raw/ # Original downloaded images (Phase 2)
├── organized/ # Organized by species (Phase 2)
├── verified/ # Quality-checked (Phase 2)
├── train/ # Train split (Phase 2)
├── val/ # Validation split (Phase 2)
├── test/ # Test split (Phase 2)
└── processed/ # Phase 3 output
├── 224x224/ # Standardized size
│ ├── train/
│ │ └── images/
│ ├── val/
│ │ └── images/
│ └── test/
│ └── images/
├── balanced/ # Class-balanced training
│ └── train/
│ └── images/
├── train_manifest.csv
├── val_manifest.csv
├── test_manifest.csv
├── label_mapping.json
└── augmentation_config.json
output/phase3/
├── dimension_report.json
├── color_conversion_log.json
├── augmentation_config.json
├── augmentation_samples/
├── class_balance_before.json
├── class_balance_after.json
├── balance_histogram.png
├── manifest_statistics.json
├── validation_report.json
├── validation_summary.md
└── flagged_for_review.json
Risk Mitigation
| Risk | Mitigation |
|---|---|
| Disk space exhaustion | Monitor disk usage, compress images, delete raw after processing |
| Memory errors with large batches | Process in batches of 1000, use memory-mapped files |
| Augmentation too aggressive | Visual review, conservative defaults, configurable parameters |
| Class imbalance persists | Multiple oversampling strategies, weighted loss in training |
| Slow processing | Multiprocessing, GPU acceleration for transforms |
| Reproducibility issues | Save all configs, use fixed random seeds, version control |
Performance Optimization Tips
- Batch Processing: Process images in parallel using multiprocessing
- Memory Efficiency: Use generators, don't load all images at once
- Disk I/O: Use SSD, batch writes, memory-mapped files
- Image Loading: Use PIL with SIMD, or opencv for speed
- Augmentation: Apply on-the-fly during training (save disk space)
Notes
- Consider saving augmentation config separately from applying augmentations
- On-the-fly augmentation during training is often preferred over pre-generating
- Keep original unaugmented test set for fair evaluation
- Document any images excluded and reasons
- Save random seeds for all operations
- Phase 4 will select model architecture based on processed dataset size