Files
PlantGuide/Docs/phase3-implementation-plan.md
Trey t 136dfbae33 Add PlantGuide iOS app with plant identification and care management
- Implement camera capture and plant identification workflow
- Add Core Data persistence for plants, care schedules, and cached API data
- Create collection view with grid/list layouts and filtering
- Build plant detail views with care information display
- Integrate Trefle botanical API for plant care data
- Add local image storage for captured plant photos
- Implement dependency injection container for testability
- Include accessibility support throughout the app

Bug fixes in this commit:
- Fix Trefle API decoding by removing duplicate CodingKeys
- Fix LocalCachedImage to load from correct PlantImages directory
- Set dateAdded when saving plants for proper collection sorting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 12:18:01 -06:00

17 KiB
Raw Permalink Blame History

Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan

Overview

Goal: Prepare images for training with consistent formatting and augmentation pipeline.

Prerequisites: Phase 2 complete - datasets/train/, datasets/val/, datasets/test/ directories with manifests

Target Deliverable: Training-ready dataset with standardized dimensions, normalized values, and augmentation pipeline


Task Breakdown

Task 3.1: Standardize Image Dimensions

Objective: Resize all images to consistent dimensions for model input.

Actions:

  1. Create scripts/phase3/standardize_dimensions.py to:

    • Load images from train/val/test directories
    • Resize to target dimension (224x224 for MobileNetV3, 299x299 for EfficientNet)
    • Preserve aspect ratio with center crop or letterboxing
    • Save resized images to new directory structure
  2. Support multiple output sizes:

    TARGET_SIZES = {
        "mobilenet": (224, 224),
        "efficientnet": (299, 299),
        "vit": (384, 384)
    }
    
  3. Implement resize strategies:

    • center_crop: Crop to square, then resize (preserves detail)
    • letterbox: Pad to square, then resize (preserves full image)
    • stretch: Direct resize (fastest, may distort)
  4. Output directory structure:

    datasets/
    ├── processed/
    │   └── 224x224/
    │       ├── train/
    │       ├── val/
    │       └── test/
    

Output:

  • datasets/processed/{size}/ directories
  • output/phase3/dimension_report.json - Processing statistics

Validation:

  • All images in processed directory are exactly target dimensions
  • No corrupt images (all readable by PIL)
  • Image count matches source (no images lost)
  • Processing time logged for performance baseline

Task 3.2: Normalize Color Channels

Objective: Standardize pixel values and handle format variations.

Actions:

  1. Create scripts/phase3/normalize_images.py to:

    • Convert all images to RGB (handle RGBA, grayscale, CMYK)
    • Apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    • Handle various input formats (JPEG, PNG, WebP, HEIC)
    • Save as consistent format (JPEG with quality 95, or PNG for lossless)
  2. Implement color normalization:

    def normalize_image(image: np.ndarray) -> np.ndarray:
        """Normalize image for model input."""
        image = image.astype(np.float32) / 255.0
        mean = np.array([0.485, 0.456, 0.406])
        std = np.array([0.229, 0.224, 0.225])
        return (image - mean) / std
    
  3. Create preprocessing pipeline class:

    class ImagePreprocessor:
        def __init__(self, target_size, normalize=True):
            self.target_size = target_size
            self.normalize = normalize
    
        def __call__(self, image_path: str) -> np.ndarray:
            # Load, resize, convert, normalize
            pass
    
  4. Handle edge cases:

    • Grayscale → convert to RGB by duplicating channels
    • RGBA → remove alpha channel, composite on white
    • CMYK → convert to RGB color space
    • 16-bit images → convert to 8-bit

Output:

  • Updated processed images with consistent color handling
  • output/phase3/color_conversion_log.json - Format conversion statistics

Validation:

  • All images have exactly 3 color channels (RGB)
  • Pixel values in expected range after normalization
  • No format conversion errors
  • Color fidelity maintained (visual spot check on 50 random images)

Task 3.3: Implement Data Augmentation Pipeline

Objective: Create augmentation transforms to increase training data variety.

Actions:

  1. Create scripts/phase3/augmentation_pipeline.py with transforms:

    Geometric Transforms:

    • Random rotation: -30° to +30°
    • Random horizontal flip: 50% probability
    • Random vertical flip: 10% probability (some plants are naturally upside-down)
    • Random crop: 80-100% of image, then resize back
    • Random perspective: slight perspective distortion

    Color Transforms:

    • Random brightness: ±20%
    • Random contrast: ±20%
    • Random saturation: ±30%
    • Random hue shift: ±10%
    • Color jitter (combined)

    Blur/Noise Transforms:

    • Gaussian blur: kernel 3-7, 30% probability
    • Motion blur: 10% probability
    • Gaussian noise: σ=0.01-0.05, 20% probability

    Occlusion Transforms:

    • Random erasing (cutout): 10-30% area, 20% probability
    • Grid dropout: 10% probability
  2. Implement using PyTorch or Albumentations:

    import albumentations as A
    
    train_transform = A.Compose([
        A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
        A.HorizontalFlip(p=0.5),
        A.Rotate(limit=30, p=0.5),
        A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.3, hue=0.1),
        A.GaussianBlur(blur_limit=(3, 7), p=0.3),
        A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.2),
        A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ToTensorV2(),
    ])
    
    val_transform = A.Compose([
        A.Resize(256, 256),
        A.CenterCrop(224, 224),
        A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ToTensorV2(),
    ])
    
  3. Create visualization tool for augmentation preview:

    def visualize_augmentations(image_path, transform, n_samples=9):
        """Show grid of augmented versions of same image."""
        pass
    
  4. Save augmentation configuration to JSON for reproducibility

Output:

  • scripts/phase3/augmentation_pipeline.py - Reusable transform classes
  • output/phase3/augmentation_config.json - Transform parameters
  • output/phase3/augmentation_samples/ - Visual examples

Validation:

  • All augmentations produce valid images (no NaN, no corruption)
  • Augmented images visually reasonable (not over-augmented)
  • Transforms are deterministic when seeded
  • Pipeline runs at >100 images/second on CPU

Task 3.4: Balance Underrepresented Classes

Objective: Create augmented variants to address class imbalance.

Actions:

  1. Create scripts/phase3/analyze_class_balance.py to:

    • Count images per class in training set
    • Calculate imbalance ratio (max_class / min_class)
    • Identify underrepresented classes (below median - 1 std)
    • Visualize class distribution
  2. Create scripts/phase3/oversample_minority.py to:

    • Define target samples per class (e.g., median count)
    • Generate augmented copies for minority classes
    • Apply stronger augmentation for synthetic samples
    • Track original vs augmented counts
  3. Implement oversampling strategies:

    class BalancingStrategy:
        """Strategies for handling class imbalance."""
    
        @staticmethod
        def oversample_to_median(class_counts: dict) -> dict:
            """Oversample minority classes to median count."""
            median = np.median(list(class_counts.values()))
            targets = {}
            for cls, count in class_counts.items():
                targets[cls] = max(int(median), count)
            return targets
    
        @staticmethod
        def oversample_to_max(class_counts: dict, cap_ratio=5) -> dict:
            """Oversample to max, capped at ratio times original."""
            max_count = max(class_counts.values())
            targets = {}
            for cls, count in class_counts.items():
                targets[cls] = min(max_count, count * cap_ratio)
            return targets
    
  4. Generate balanced training manifest:

    • Include original images
    • Add paths to augmented copies
    • Mark augmented images in manifest (for analysis)

Output:

  • datasets/processed/balanced/train/ - Balanced training set
  • output/phase3/class_balance_before.json - Original distribution
  • output/phase3/class_balance_after.json - Balanced distribution
  • output/phase3/balance_histogram.png - Visual comparison

Validation:

  • Imbalance ratio reduced to < 10:1 (max:min)
  • No class has fewer than 50 training samples
  • Augmented images are visually distinct from originals
  • Total training set size documented

Task 3.5: Generate Image Manifest Files

Objective: Create mapping files for training pipeline.

Actions:

  1. Create scripts/phase3/generate_manifests.py to produce:

    CSV Format (PyTorch ImageFolder compatible):

    path,label,scientific_name,plant_id,source,is_augmented
    train/images/quercus_robur_001.jpg,42,Quercus robur,QR001,inaturalist,false
    train/images/quercus_robur_002_aug.jpg,42,Quercus robur,QR001,augmented,true
    

    JSON Format (detailed metadata):

    {
      "train": [
        {
          "path": "train/images/quercus_robur_001.jpg",
          "label": 42,
          "scientific_name": "Quercus robur",
          "common_name": "English Oak",
          "plant_id": "QR001",
          "source": "inaturalist",
          "is_augmented": false,
          "original_path": null
        }
      ]
    }
    
  2. Generate label mapping file:

    {
      "label_to_name": {
        "0": "Acer palmatum",
        "1": "Acer rubrum",
        ...
      },
      "name_to_label": {
        "Acer palmatum": 0,
        "Acer rubrum": 1,
        ...
      },
      "label_to_common": {
        "0": "Japanese Maple",
        ...
      }
    }
    
  3. Create split statistics:

    • Total images per split
    • Classes per split
    • Images per class per split

Output:

  • datasets/processed/train_manifest.csv
  • datasets/processed/val_manifest.csv
  • datasets/processed/test_manifest.csv
  • datasets/processed/label_mapping.json
  • output/phase3/manifest_statistics.json

Validation:

  • All image paths in manifests exist on disk
  • Labels are consecutive integers starting from 0
  • No duplicate entries in manifests
  • Split sizes match expected counts
  • Label mapping covers all classes

Task 3.6: Validate Dataset Integrity

Objective: Final verification of processed dataset.

Actions:

  1. Create scripts/phase3/validate_dataset.py to run comprehensive checks:

    File Integrity:

    • All manifest paths exist
    • All images load without error
    • All images have correct dimensions
    • File permissions allow read access

    Label Consistency:

    • Labels match between manifest and directory structure
    • All labels have corresponding class names
    • No orphaned images (in directory but not manifest)
    • No missing images (in manifest but not directory)

    Dataset Statistics:

    • Per-class image counts
    • Train/val/test split ratios
    • Augmented vs original ratio
    • File size distribution

    Sample Verification:

    • Random sample of 100 images per split
    • Verify image content matches label (using pretrained model)
    • Flag potential mislabels for review
  2. Create scripts/phase3/repair_dataset.py for common fixes:

    • Remove entries with missing files
    • Fix incorrect labels (with confirmation)
    • Regenerate corrupted augmentations

Output:

  • output/phase3/validation_report.json - Full validation results
  • output/phase3/validation_summary.md - Human-readable summary
  • output/phase3/flagged_for_review.json - Potential issues

Validation:

  • 0 missing files
  • 0 corrupted images
  • 0 dimension mismatches
  • <1% potential mislabels flagged
  • All metadata fields populated

End-of-Phase Validation Checklist

Run scripts/phase3/validate_phase3.py to verify all criteria:

Image Processing Validation

# Criterion Target Status
1 All images standardized to target size 100% at 224x224 (or configured size) [ ]
2 All images in RGB format 100% RGB, 3 channels [ ]
3 No corrupted images 0 unreadable files [ ]
4 Normalization applied correctly Values in expected range [ ]

Augmentation Validation

# Criterion Target Status
5 Augmentation pipeline functional All transforms produce valid output [ ]
6 Augmentation reproducible Same seed = same output [ ]
7 Augmentation performance >100 images/sec on CPU [ ]
8 Visual quality Spot check passes (50 random samples) [ ]

Class Balance Validation

# Criterion Target Status
9 Class imbalance ratio < 10:1 (max:min) [ ]
10 Minimum class size ≥50 images per class in train [ ]
11 Augmentation ratio Augmented ≤ 4x original per class [ ]

Manifest Validation

# Criterion Target Status
12 Manifest completeness 100% images have manifest entries [ ]
13 Path validity 100% manifest paths exist [ ]
14 Label consistency Labels match directory structure [ ]
15 No duplicates 0 duplicate entries [ ]
16 Label mapping complete All labels have names [ ]

Dataset Statistics

Metric Expected Actual Status
Total processed images 50,000 - 200,000 [ ]
Training set size ~70% of total [ ]
Validation set size ~15% of total [ ]
Test set size ~15% of total [ ]
Number of classes 200 - 500 [ ]
Avg images per class (train) 100 - 400 [ ]
Image file size (avg) 30-100 KB [ ]
Total dataset size 10-50 GB [ ]

Phase 3 Completion Checklist

  • Task 3.1: Images standardized to target dimensions
  • Task 3.2: Color channels normalized and formats unified
  • Task 3.3: Augmentation pipeline implemented and tested
  • Task 3.4: Class imbalance addressed through oversampling
  • Task 3.5: Manifest files generated for all splits
  • Task 3.6: Dataset integrity validated
  • All 16 validation criteria pass
  • Dataset statistics documented
  • Augmentation config saved for reproducibility
  • Ready for Phase 4 (Model Architecture Selection)

Scripts Summary

Script Task Input Output
standardize_dimensions.py 3.1 Raw images Resized images
normalize_images.py 3.2 Resized images Normalized images
augmentation_pipeline.py 3.3 Images Transform classes
analyze_class_balance.py 3.4 Train manifest Balance report
oversample_minority.py 3.4 Imbalanced set Balanced set
generate_manifests.py 3.5 Processed images CSV/JSON manifests
validate_dataset.py 3.6 Full dataset Validation report
validate_phase3.py Final All outputs Pass/Fail report

Dependencies

# requirements-phase3.txt
Pillow>=9.0.0
numpy>=1.24.0
albumentations>=1.3.0
torch>=2.0.0
torchvision>=0.15.0
opencv-python>=4.7.0
pandas>=2.0.0
tqdm>=4.65.0
matplotlib>=3.7.0
scikit-learn>=1.2.0
imagehash>=4.3.0

Directory Structure After Phase 3

datasets/
├── raw/                          # Original downloaded images (Phase 2)
├── organized/                    # Organized by species (Phase 2)
├── verified/                     # Quality-checked (Phase 2)
├── train/                        # Train split (Phase 2)
├── val/                          # Validation split (Phase 2)
├── test/                         # Test split (Phase 2)
└── processed/                    # Phase 3 output
    ├── 224x224/                  # Standardized size
    │   ├── train/
    │   │   └── images/
    │   ├── val/
    │   │   └── images/
    │   └── test/
    │       └── images/
    ├── balanced/                 # Class-balanced training
    │   └── train/
    │       └── images/
    ├── train_manifest.csv
    ├── val_manifest.csv
    ├── test_manifest.csv
    ├── label_mapping.json
    └── augmentation_config.json

output/phase3/
├── dimension_report.json
├── color_conversion_log.json
├── augmentation_config.json
├── augmentation_samples/
├── class_balance_before.json
├── class_balance_after.json
├── balance_histogram.png
├── manifest_statistics.json
├── validation_report.json
├── validation_summary.md
└── flagged_for_review.json

Risk Mitigation

Risk Mitigation
Disk space exhaustion Monitor disk usage, compress images, delete raw after processing
Memory errors with large batches Process in batches of 1000, use memory-mapped files
Augmentation too aggressive Visual review, conservative defaults, configurable parameters
Class imbalance persists Multiple oversampling strategies, weighted loss in training
Slow processing Multiprocessing, GPU acceleration for transforms
Reproducibility issues Save all configs, use fixed random seeds, version control

Performance Optimization Tips

  1. Batch Processing: Process images in parallel using multiprocessing
  2. Memory Efficiency: Use generators, don't load all images at once
  3. Disk I/O: Use SSD, batch writes, memory-mapped files
  4. Image Loading: Use PIL with SIMD, or opencv for speed
  5. Augmentation: Apply on-the-fly during training (save disk space)

Notes

  • Consider saving augmentation config separately from applying augmentations
  • On-the-fly augmentation during training is often preferred over pre-generating
  • Keep original unaugmented test set for fair evaluation
  • Document any images excluded and reasons
  • Save random seeds for all operations
  • Phase 4 will select model architecture based on processed dataset size