Files

Trey t 136dfbae33 Add PlantGuide iOS app with plant identification and care management

- Implement camera capture and plant identification workflow
- Add Core Data persistence for plants, care schedules, and cached API data
- Create collection view with grid/list layouts and filtering
- Build plant detail views with care information display
- Integrate Trefle botanical API for plant care data
- Add local image storage for captured plant photos
- Implement dependency injection container for testability
- Include accessibility support throughout the app

Bug fixes in this commit:
- Fix Trefle API decoding by removing duplicate CodingKeys
- Fix LocalCachedImage to load from correct PlantImages directory
- Set dateAdded when saving plants for proper collection sorting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-23 12:18:01 -06:00

17 KiB

Raw Blame History

Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan

Overview

Goal: Prepare images for training with consistent formatting and augmentation pipeline.

Prerequisites: Phase 2 complete - datasets/train/, datasets/val/, datasets/test/ directories with manifests

Target Deliverable: Training-ready dataset with standardized dimensions, normalized values, and augmentation pipeline

Task Breakdown

Task 3.1: Standardize Image Dimensions

Objective: Resize all images to consistent dimensions for model input.

Actions:

Create scripts/phase3/standardize_dimensions.py to:
- Load images from train/val/test directories
- Resize to target dimension (224x224 for MobileNetV3, 299x299 for EfficientNet)
- Preserve aspect ratio with center crop or letterboxing
- Save resized images to new directory structure

Support multiple output sizes:

TARGET_SIZES = {
    "mobilenet": (224, 224),
    "efficientnet": (299, 299),
    "vit": (384, 384)
}

Implement resize strategies:
- center_crop: Crop to square, then resize (preserves detail)
- letterbox: Pad to square, then resize (preserves full image)
- stretch: Direct resize (fastest, may distort)

Output directory structure:

datasets/
├── processed/
│   └── 224x224/
│       ├── train/
│       ├── val/
│       └── test/

Output:

datasets/processed/{size}/ directories
output/phase3/dimension_report.json - Processing statistics

Validation:

All images in processed directory are exactly target dimensions
No corrupt images (all readable by PIL)
Image count matches source (no images lost)
Processing time logged for performance baseline

Task 3.2: Normalize Color Channels

Objective: Standardize pixel values and handle format variations.

Actions:

Create scripts/phase3/normalize_images.py to:
- Convert all images to RGB (handle RGBA, grayscale, CMYK)
- Apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
- Handle various input formats (JPEG, PNG, WebP, HEIC)
- Save as consistent format (JPEG with quality 95, or PNG for lossless)

Implement color normalization:

def normalize_image(image: np.ndarray) -> np.ndarray:
    """Normalize image for model input."""
    image = image.astype(np.float32) / 255.0
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    return (image - mean) / std

Create preprocessing pipeline class:

class ImagePreprocessor:
    def __init__(self, target_size, normalize=True):
        self.target_size = target_size
        self.normalize = normalize

    def __call__(self, image_path: str) -> np.ndarray:
        # Load, resize, convert, normalize
        pass

Handle edge cases:
- Grayscale → convert to RGB by duplicating channels
- RGBA → remove alpha channel, composite on white
- CMYK → convert to RGB color space
- 16-bit images → convert to 8-bit

Output:

Updated processed images with consistent color handling
output/phase3/color_conversion_log.json - Format conversion statistics

Validation:

All images have exactly 3 color channels (RGB)
Pixel values in expected range after normalization
No format conversion errors
Color fidelity maintained (visual spot check on 50 random images)

Task 3.3: Implement Data Augmentation Pipeline

Objective: Create augmentation transforms to increase training data variety.

Actions:

Create scripts/phase3/augmentation_pipeline.py with transforms:

Geometric Transforms:
- Random rotation: -30° to +30°
- Random horizontal flip: 50% probability
- Random vertical flip: 10% probability (some plants are naturally upside-down)
- Random crop: 80-100% of image, then resize back
- Random perspective: slight perspective distortion
Color Transforms:
- Random brightness: ±20%
- Random contrast: ±20%
- Random saturation: ±30%
- Random hue shift: ±10%
- Color jitter (combined)
Blur/Noise Transforms:
- Gaussian blur: kernel 3-7, 30% probability
- Motion blur: 10% probability
- Gaussian noise: σ=0.01-0.05, 20% probability
Occlusion Transforms:
- Random erasing (cutout): 10-30% area, 20% probability
- Grid dropout: 10% probability

Implement using PyTorch or Albumentations:

import albumentations as A

train_transform = A.Compose([
    A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.Rotate(limit=30, p=0.5),
    A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.3, hue=0.1),
    A.GaussianBlur(blur_limit=(3, 7), p=0.3),
    A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.2),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2(),
])

val_transform = A.Compose([
    A.Resize(256, 256),
    A.CenterCrop(224, 224),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2(),
])

Create visualization tool for augmentation preview:

def visualize_augmentations(image_path, transform, n_samples=9):
    """Show grid of augmented versions of same image."""
    pass

Save augmentation configuration to JSON for reproducibility

Output:

scripts/phase3/augmentation_pipeline.py - Reusable transform classes
output/phase3/augmentation_config.json - Transform parameters
output/phase3/augmentation_samples/ - Visual examples

Validation:

All augmentations produce valid images (no NaN, no corruption)
Augmented images visually reasonable (not over-augmented)
Transforms are deterministic when seeded
Pipeline runs at >100 images/second on CPU

Task 3.4: Balance Underrepresented Classes

Objective: Create augmented variants to address class imbalance.

Actions:

Create scripts/phase3/analyze_class_balance.py to:
- Count images per class in training set
- Calculate imbalance ratio (max_class / min_class)
- Identify underrepresented classes (below median - 1 std)
- Visualize class distribution
Create scripts/phase3/oversample_minority.py to:
- Define target samples per class (e.g., median count)
- Generate augmented copies for minority classes
- Apply stronger augmentation for synthetic samples
- Track original vs augmented counts

Implement oversampling strategies:

class BalancingStrategy:
    """Strategies for handling class imbalance."""

    @staticmethod
    def oversample_to_median(class_counts: dict) -> dict:
        """Oversample minority classes to median count."""
        median = np.median(list(class_counts.values()))
        targets = {}
        for cls, count in class_counts.items():
            targets[cls] = max(int(median), count)
        return targets

    @staticmethod
    def oversample_to_max(class_counts: dict, cap_ratio=5) -> dict:
        """Oversample to max, capped at ratio times original."""
        max_count = max(class_counts.values())
        targets = {}
        for cls, count in class_counts.items():
            targets[cls] = min(max_count, count * cap_ratio)
        return targets

Generate balanced training manifest:
- Include original images
- Add paths to augmented copies
- Mark augmented images in manifest (for analysis)

Output:

datasets/processed/balanced/train/ - Balanced training set
output/phase3/class_balance_before.json - Original distribution
output/phase3/class_balance_after.json - Balanced distribution
output/phase3/balance_histogram.png - Visual comparison

Validation:

Imbalance ratio reduced to < 10:1 (max:min)
No class has fewer than 50 training samples
Augmented images are visually distinct from originals
Total training set size documented

Task 3.5: Generate Image Manifest Files

Objective: Create mapping files for training pipeline.

Actions:

Create scripts/phase3/generate_manifests.py to produce:

CSV Format (PyTorch ImageFolder compatible):

path,label,scientific_name,plant_id,source,is_augmented
train/images/quercus_robur_001.jpg,42,Quercus robur,QR001,inaturalist,false
train/images/quercus_robur_002_aug.jpg,42,Quercus robur,QR001,augmented,true

JSON Format (detailed metadata):

{
  "train": [
    {
      "path": "train/images/quercus_robur_001.jpg",
      "label": 42,
      "scientific_name": "Quercus robur",
      "common_name": "English Oak",
      "plant_id": "QR001",
      "source": "inaturalist",
      "is_augmented": false,
      "original_path": null
    }
  ]
}

Generate label mapping file:

{
  "label_to_name": {
    "0": "Acer palmatum",
    "1": "Acer rubrum",
    ...
  },
  "name_to_label": {
    "Acer palmatum": 0,
    "Acer rubrum": 1,
    ...
  },
  "label_to_common": {
    "0": "Japanese Maple",
    ...
  }
}

Create split statistics:
- Total images per split
- Classes per split
- Images per class per split

Output:

datasets/processed/train_manifest.csv
datasets/processed/val_manifest.csv
datasets/processed/test_manifest.csv
datasets/processed/label_mapping.json
output/phase3/manifest_statistics.json

Validation:

All image paths in manifests exist on disk
Labels are consecutive integers starting from 0
No duplicate entries in manifests
Split sizes match expected counts
Label mapping covers all classes

Task 3.6: Validate Dataset Integrity

Objective: Final verification of processed dataset.

Actions:

Create scripts/phase3/validate_dataset.py to run comprehensive checks:

File Integrity:
- All manifest paths exist
- All images load without error
- All images have correct dimensions
- File permissions allow read access
Label Consistency:
- Labels match between manifest and directory structure
- All labels have corresponding class names
- No orphaned images (in directory but not manifest)
- No missing images (in manifest but not directory)
Dataset Statistics:
- Per-class image counts
- Train/val/test split ratios
- Augmented vs original ratio
- File size distribution
Sample Verification:
- Random sample of 100 images per split
- Verify image content matches label (using pretrained model)
- Flag potential mislabels for review
Create scripts/phase3/repair_dataset.py for common fixes:
- Remove entries with missing files
- Fix incorrect labels (with confirmation)
- Regenerate corrupted augmentations

Output:

output/phase3/validation_report.json - Full validation results
output/phase3/validation_summary.md - Human-readable summary
output/phase3/flagged_for_review.json - Potential issues

Validation:

0 missing files
0 corrupted images
0 dimension mismatches
<1% potential mislabels flagged
All metadata fields populated

End-of-Phase Validation Checklist

Run scripts/phase3/validate_phase3.py to verify all criteria:

Image Processing Validation

#	Criterion	Target	Status
1	All images standardized to target size	100% at 224x224 (or configured size)	[ ]
2	All images in RGB format	100% RGB, 3 channels	[ ]
3	No corrupted images	0 unreadable files	[ ]
4	Normalization applied correctly	Values in expected range	[ ]

Augmentation Validation

#	Criterion	Target	Status
5	Augmentation pipeline functional	All transforms produce valid output	[ ]
6	Augmentation reproducible	Same seed = same output	[ ]
7	Augmentation performance	>100 images/sec on CPU	[ ]
8	Visual quality	Spot check passes (50 random samples)	[ ]

Class Balance Validation

#	Criterion	Target	Status
9	Class imbalance ratio	< 10:1 (max:min)	[ ]
10	Minimum class size	≥50 images per class in train	[ ]
11	Augmentation ratio	Augmented ≤ 4x original per class	[ ]

Manifest Validation

#	Criterion	Target	Status
12	Manifest completeness	100% images have manifest entries	[ ]
13	Path validity	100% manifest paths exist	[ ]
14	Label consistency	Labels match directory structure	[ ]
15	No duplicates	0 duplicate entries	[ ]
16	Label mapping complete	All labels have names	[ ]

Dataset Statistics

Metric	Expected	Status
Total processed images	50,000 - 200,000	[ ]
Training set size	~70% of total	[ ]
Validation set size	~15% of total	[ ]
Test set size	~15% of total	[ ]
Number of classes	200 - 500	[ ]
Avg images per class (train)	100 - 400	[ ]
Image file size (avg)	30-100 KB	[ ]
Total dataset size	10-50 GB	[ ]

Phase 3 Completion Checklist

Task 3.1: Images standardized to target dimensions
Task 3.2: Color channels normalized and formats unified
Task 3.3: Augmentation pipeline implemented and tested
Task 3.4: Class imbalance addressed through oversampling
Task 3.5: Manifest files generated for all splits
Task 3.6: Dataset integrity validated
All 16 validation criteria pass
Dataset statistics documented
Augmentation config saved for reproducibility
Ready for Phase 4 (Model Architecture Selection)

Scripts Summary

Script	Task	Input	Output
`standardize_dimensions.py`	3.1	Raw images	Resized images
`normalize_images.py`	3.2	Resized images	Normalized images
`augmentation_pipeline.py`	3.3	Images	Transform classes
`analyze_class_balance.py`	3.4	Train manifest	Balance report
`oversample_minority.py`	3.4	Imbalanced set	Balanced set
`generate_manifests.py`	3.5	Processed images	CSV/JSON manifests
`validate_dataset.py`	3.6	Full dataset	Validation report
`validate_phase3.py`	Final	All outputs	Pass/Fail report

Dependencies

# requirements-phase3.txt
Pillow>=9.0.0
numpy>=1.24.0
albumentations>=1.3.0
torch>=2.0.0
torchvision>=0.15.0
opencv-python>=4.7.0
pandas>=2.0.0
tqdm>=4.65.0
matplotlib>=3.7.0
scikit-learn>=1.2.0
imagehash>=4.3.0

Directory Structure After Phase 3

datasets/
├── raw/                          # Original downloaded images (Phase 2)
├── organized/                    # Organized by species (Phase 2)
├── verified/                     # Quality-checked (Phase 2)
├── train/                        # Train split (Phase 2)
├── val/                          # Validation split (Phase 2)
├── test/                         # Test split (Phase 2)
└── processed/                    # Phase 3 output
    ├── 224x224/                  # Standardized size
    │   ├── train/
    │   │   └── images/
    │   ├── val/
    │   │   └── images/
    │   └── test/
    │       └── images/
    ├── balanced/                 # Class-balanced training
    │   └── train/
    │       └── images/
    ├── train_manifest.csv
    ├── val_manifest.csv
    ├── test_manifest.csv
    ├── label_mapping.json
    └── augmentation_config.json

output/phase3/
├── dimension_report.json
├── color_conversion_log.json
├── augmentation_config.json
├── augmentation_samples/
├── class_balance_before.json
├── class_balance_after.json
├── balance_histogram.png
├── manifest_statistics.json
├── validation_report.json
├── validation_summary.md
└── flagged_for_review.json

Risk Mitigation

Risk	Mitigation
Disk space exhaustion	Monitor disk usage, compress images, delete raw after processing
Memory errors with large batches	Process in batches of 1000, use memory-mapped files
Augmentation too aggressive	Visual review, conservative defaults, configurable parameters
Class imbalance persists	Multiple oversampling strategies, weighted loss in training
Slow processing	Multiprocessing, GPU acceleration for transforms
Reproducibility issues	Save all configs, use fixed random seeds, version control

Performance Optimization Tips

Batch Processing: Process images in parallel using multiprocessing
Memory Efficiency: Use generators, don't load all images at once
Disk I/O: Use SSD, batch writes, memory-mapped files
Image Loading: Use PIL with SIMD, or opencv for speed
Augmentation: Apply on-the-fly during training (save disk space)

Notes

Consider saving augmentation config separately from applying augmentations
On-the-fly augmentation during training is often preferred over pre-generating
Keep original unaugmented test set for fair evaluation
Document any images excluded and reasons
Save random seeds for all operations
Phase 4 will select model architecture based on processed dataset size

17 KiB Raw Blame History Unescape Escape

Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan

Overview

Task Breakdown

Task 3.1: Standardize Image Dimensions

Task 3.2: Normalize Color Channels

Task 3.3: Implement Data Augmentation Pipeline

Task 3.4: Balance Underrepresented Classes

Task 3.5: Generate Image Manifest Files

Task 3.6: Validate Dataset Integrity

End-of-Phase Validation Checklist

Image Processing Validation

Augmentation Validation

Class Balance Validation

Manifest Validation

Dataset Statistics

Phase 3 Completion Checklist

Scripts Summary

Dependencies

Directory Structure After Phase 3

Risk Mitigation

Performance Optimization Tips

Notes

17 KiB

Raw Blame History