# Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan ## Overview **Goal:** Prepare images for training with consistent formatting and augmentation pipeline. **Prerequisites:** Phase 2 complete - `datasets/train/`, `datasets/val/`, `datasets/test/` directories with manifests **Target Deliverable:** Training-ready dataset with standardized dimensions, normalized values, and augmentation pipeline --- ## Task Breakdown ### Task 3.1: Standardize Image Dimensions **Objective:** Resize all images to consistent dimensions for model input. **Actions:** 1. Create `scripts/phase3/standardize_dimensions.py` to: - Load images from train/val/test directories - Resize to target dimension (224x224 for MobileNetV3, 299x299 for EfficientNet) - Preserve aspect ratio with center crop or letterboxing - Save resized images to new directory structure 2. Support multiple output sizes: ```python TARGET_SIZES = { "mobilenet": (224, 224), "efficientnet": (299, 299), "vit": (384, 384) } ``` 3. Implement resize strategies: - **center_crop:** Crop to square, then resize (preserves detail) - **letterbox:** Pad to square, then resize (preserves full image) - **stretch:** Direct resize (fastest, may distort) 4. Output directory structure: ``` datasets/ ├── processed/ │ └── 224x224/ │ ├── train/ │ ├── val/ │ └── test/ ``` **Output:** - `datasets/processed/{size}/` directories - `output/phase3/dimension_report.json` - Processing statistics **Validation:** - [ ] All images in processed directory are exactly target dimensions - [ ] No corrupt images (all readable by PIL) - [ ] Image count matches source (no images lost) - [ ] Processing time logged for performance baseline --- ### Task 3.2: Normalize Color Channels **Objective:** Standardize pixel values and handle format variations. **Actions:** 1. Create `scripts/phase3/normalize_images.py` to: - Convert all images to RGB (handle RGBA, grayscale, CMYK) - Apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) - Handle various input formats (JPEG, PNG, WebP, HEIC) - Save as consistent format (JPEG with quality 95, or PNG for lossless) 2. Implement color normalization: ```python def normalize_image(image: np.ndarray) -> np.ndarray: """Normalize image for model input.""" image = image.astype(np.float32) / 255.0 mean = np.array([0.485, 0.456, 0.406]) std = np.array([0.229, 0.224, 0.225]) return (image - mean) / std ``` 3. Create preprocessing pipeline class: ```python class ImagePreprocessor: def __init__(self, target_size, normalize=True): self.target_size = target_size self.normalize = normalize def __call__(self, image_path: str) -> np.ndarray: # Load, resize, convert, normalize pass ``` 4. Handle edge cases: - Grayscale → convert to RGB by duplicating channels - RGBA → remove alpha channel, composite on white - CMYK → convert to RGB color space - 16-bit images → convert to 8-bit **Output:** - Updated processed images with consistent color handling - `output/phase3/color_conversion_log.json` - Format conversion statistics **Validation:** - [ ] All images have exactly 3 color channels (RGB) - [ ] Pixel values in expected range after normalization - [ ] No format conversion errors - [ ] Color fidelity maintained (visual spot check on 50 random images) --- ### Task 3.3: Implement Data Augmentation Pipeline **Objective:** Create augmentation transforms to increase training data variety. **Actions:** 1. Create `scripts/phase3/augmentation_pipeline.py` with transforms: **Geometric Transforms:** - Random rotation: -30° to +30° - Random horizontal flip: 50% probability - Random vertical flip: 10% probability (some plants are naturally upside-down) - Random crop: 80-100% of image, then resize back - Random perspective: slight perspective distortion **Color Transforms:** - Random brightness: ±20% - Random contrast: ±20% - Random saturation: ±30% - Random hue shift: ±10% - Color jitter (combined) **Blur/Noise Transforms:** - Gaussian blur: kernel 3-7, 30% probability - Motion blur: 10% probability - Gaussian noise: σ=0.01-0.05, 20% probability **Occlusion Transforms:** - Random erasing (cutout): 10-30% area, 20% probability - Grid dropout: 10% probability 2. Implement using PyTorch or Albumentations: ```python import albumentations as A train_transform = A.Compose([ A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)), A.HorizontalFlip(p=0.5), A.Rotate(limit=30, p=0.5), A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.3, hue=0.1), A.GaussianBlur(blur_limit=(3, 7), p=0.3), A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.2), A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ToTensorV2(), ]) val_transform = A.Compose([ A.Resize(256, 256), A.CenterCrop(224, 224), A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ToTensorV2(), ]) ``` 3. Create visualization tool for augmentation preview: ```python def visualize_augmentations(image_path, transform, n_samples=9): """Show grid of augmented versions of same image.""" pass ``` 4. Save augmentation configuration to JSON for reproducibility **Output:** - `scripts/phase3/augmentation_pipeline.py` - Reusable transform classes - `output/phase3/augmentation_config.json` - Transform parameters - `output/phase3/augmentation_samples/` - Visual examples **Validation:** - [ ] All augmentations produce valid images (no NaN, no corruption) - [ ] Augmented images visually reasonable (not over-augmented) - [ ] Transforms are deterministic when seeded - [ ] Pipeline runs at >100 images/second on CPU --- ### Task 3.4: Balance Underrepresented Classes **Objective:** Create augmented variants to address class imbalance. **Actions:** 1. Create `scripts/phase3/analyze_class_balance.py` to: - Count images per class in training set - Calculate imbalance ratio (max_class / min_class) - Identify underrepresented classes (below median - 1 std) - Visualize class distribution 2. Create `scripts/phase3/oversample_minority.py` to: - Define target samples per class (e.g., median count) - Generate augmented copies for minority classes - Apply stronger augmentation for synthetic samples - Track original vs augmented counts 3. Implement oversampling strategies: ```python class BalancingStrategy: """Strategies for handling class imbalance.""" @staticmethod def oversample_to_median(class_counts: dict) -> dict: """Oversample minority classes to median count.""" median = np.median(list(class_counts.values())) targets = {} for cls, count in class_counts.items(): targets[cls] = max(int(median), count) return targets @staticmethod def oversample_to_max(class_counts: dict, cap_ratio=5) -> dict: """Oversample to max, capped at ratio times original.""" max_count = max(class_counts.values()) targets = {} for cls, count in class_counts.items(): targets[cls] = min(max_count, count * cap_ratio) return targets ``` 4. Generate balanced training manifest: - Include original images - Add paths to augmented copies - Mark augmented images in manifest (for analysis) **Output:** - `datasets/processed/balanced/train/` - Balanced training set - `output/phase3/class_balance_before.json` - Original distribution - `output/phase3/class_balance_after.json` - Balanced distribution - `output/phase3/balance_histogram.png` - Visual comparison **Validation:** - [ ] Imbalance ratio reduced to < 10:1 (max:min) - [ ] No class has fewer than 50 training samples - [ ] Augmented images are visually distinct from originals - [ ] Total training set size documented --- ### Task 3.5: Generate Image Manifest Files **Objective:** Create mapping files for training pipeline. **Actions:** 1. Create `scripts/phase3/generate_manifests.py` to produce: **CSV Format (PyTorch ImageFolder compatible):** ```csv path,label,scientific_name,plant_id,source,is_augmented train/images/quercus_robur_001.jpg,42,Quercus robur,QR001,inaturalist,false train/images/quercus_robur_002_aug.jpg,42,Quercus robur,QR001,augmented,true ``` **JSON Format (detailed metadata):** ```json { "train": [ { "path": "train/images/quercus_robur_001.jpg", "label": 42, "scientific_name": "Quercus robur", "common_name": "English Oak", "plant_id": "QR001", "source": "inaturalist", "is_augmented": false, "original_path": null } ] } ``` 2. Generate label mapping file: ```json { "label_to_name": { "0": "Acer palmatum", "1": "Acer rubrum", ... }, "name_to_label": { "Acer palmatum": 0, "Acer rubrum": 1, ... }, "label_to_common": { "0": "Japanese Maple", ... } } ``` 3. Create split statistics: - Total images per split - Classes per split - Images per class per split **Output:** - `datasets/processed/train_manifest.csv` - `datasets/processed/val_manifest.csv` - `datasets/processed/test_manifest.csv` - `datasets/processed/label_mapping.json` - `output/phase3/manifest_statistics.json` **Validation:** - [ ] All image paths in manifests exist on disk - [ ] Labels are consecutive integers starting from 0 - [ ] No duplicate entries in manifests - [ ] Split sizes match expected counts - [ ] Label mapping covers all classes --- ### Task 3.6: Validate Dataset Integrity **Objective:** Final verification of processed dataset. **Actions:** 1. Create `scripts/phase3/validate_dataset.py` to run comprehensive checks: **File Integrity:** - All manifest paths exist - All images load without error - All images have correct dimensions - File permissions allow read access **Label Consistency:** - Labels match between manifest and directory structure - All labels have corresponding class names - No orphaned images (in directory but not manifest) - No missing images (in manifest but not directory) **Dataset Statistics:** - Per-class image counts - Train/val/test split ratios - Augmented vs original ratio - File size distribution **Sample Verification:** - Random sample of 100 images per split - Verify image content matches label (using pretrained model) - Flag potential mislabels for review 2. Create `scripts/phase3/repair_dataset.py` for common fixes: - Remove entries with missing files - Fix incorrect labels (with confirmation) - Regenerate corrupted augmentations **Output:** - `output/phase3/validation_report.json` - Full validation results - `output/phase3/validation_summary.md` - Human-readable summary - `output/phase3/flagged_for_review.json` - Potential issues **Validation:** - [ ] 0 missing files - [ ] 0 corrupted images - [ ] 0 dimension mismatches - [ ] <1% potential mislabels flagged - [ ] All metadata fields populated --- ## End-of-Phase Validation Checklist Run `scripts/phase3/validate_phase3.py` to verify all criteria: ### Image Processing Validation | # | Criterion | Target | Status | |---|-----------|--------|--------| | 1 | All images standardized to target size | 100% at 224x224 (or configured size) | [ ] | | 2 | All images in RGB format | 100% RGB, 3 channels | [ ] | | 3 | No corrupted images | 0 unreadable files | [ ] | | 4 | Normalization applied correctly | Values in expected range | [ ] | ### Augmentation Validation | # | Criterion | Target | Status | |---|-----------|--------|--------| | 5 | Augmentation pipeline functional | All transforms produce valid output | [ ] | | 6 | Augmentation reproducible | Same seed = same output | [ ] | | 7 | Augmentation performance | >100 images/sec on CPU | [ ] | | 8 | Visual quality | Spot check passes (50 random samples) | [ ] | ### Class Balance Validation | # | Criterion | Target | Status | |---|-----------|--------|--------| | 9 | Class imbalance ratio | < 10:1 (max:min) | [ ] | | 10 | Minimum class size | ≥50 images per class in train | [ ] | | 11 | Augmentation ratio | Augmented ≤ 4x original per class | [ ] | ### Manifest Validation | # | Criterion | Target | Status | |---|-----------|--------|--------| | 12 | Manifest completeness | 100% images have manifest entries | [ ] | | 13 | Path validity | 100% manifest paths exist | [ ] | | 14 | Label consistency | Labels match directory structure | [ ] | | 15 | No duplicates | 0 duplicate entries | [ ] | | 16 | Label mapping complete | All labels have names | [ ] | ### Dataset Statistics | Metric | Expected | Actual | Status | |--------|----------|--------|--------| | Total processed images | 50,000 - 200,000 | | [ ] | | Training set size | ~70% of total | | [ ] | | Validation set size | ~15% of total | | [ ] | | Test set size | ~15% of total | | [ ] | | Number of classes | 200 - 500 | | [ ] | | Avg images per class (train) | 100 - 400 | | [ ] | | Image file size (avg) | 30-100 KB | | [ ] | | Total dataset size | 10-50 GB | | [ ] | --- ## Phase 3 Completion Checklist - [ ] Task 3.1: Images standardized to target dimensions - [ ] Task 3.2: Color channels normalized and formats unified - [ ] Task 3.3: Augmentation pipeline implemented and tested - [ ] Task 3.4: Class imbalance addressed through oversampling - [ ] Task 3.5: Manifest files generated for all splits - [ ] Task 3.6: Dataset integrity validated - [ ] All 16 validation criteria pass - [ ] Dataset statistics documented - [ ] Augmentation config saved for reproducibility - [ ] Ready for Phase 4 (Model Architecture Selection) --- ## Scripts Summary | Script | Task | Input | Output | |--------|------|-------|--------| | `standardize_dimensions.py` | 3.1 | Raw images | Resized images | | `normalize_images.py` | 3.2 | Resized images | Normalized images | | `augmentation_pipeline.py` | 3.3 | Images | Transform classes | | `analyze_class_balance.py` | 3.4 | Train manifest | Balance report | | `oversample_minority.py` | 3.4 | Imbalanced set | Balanced set | | `generate_manifests.py` | 3.5 | Processed images | CSV/JSON manifests | | `validate_dataset.py` | 3.6 | Full dataset | Validation report | | `validate_phase3.py` | Final | All outputs | Pass/Fail report | --- ## Dependencies ``` # requirements-phase3.txt Pillow>=9.0.0 numpy>=1.24.0 albumentations>=1.3.0 torch>=2.0.0 torchvision>=0.15.0 opencv-python>=4.7.0 pandas>=2.0.0 tqdm>=4.65.0 matplotlib>=3.7.0 scikit-learn>=1.2.0 imagehash>=4.3.0 ``` --- ## Directory Structure After Phase 3 ``` datasets/ ├── raw/ # Original downloaded images (Phase 2) ├── organized/ # Organized by species (Phase 2) ├── verified/ # Quality-checked (Phase 2) ├── train/ # Train split (Phase 2) ├── val/ # Validation split (Phase 2) ├── test/ # Test split (Phase 2) └── processed/ # Phase 3 output ├── 224x224/ # Standardized size │ ├── train/ │ │ └── images/ │ ├── val/ │ │ └── images/ │ └── test/ │ └── images/ ├── balanced/ # Class-balanced training │ └── train/ │ └── images/ ├── train_manifest.csv ├── val_manifest.csv ├── test_manifest.csv ├── label_mapping.json └── augmentation_config.json output/phase3/ ├── dimension_report.json ├── color_conversion_log.json ├── augmentation_config.json ├── augmentation_samples/ ├── class_balance_before.json ├── class_balance_after.json ├── balance_histogram.png ├── manifest_statistics.json ├── validation_report.json ├── validation_summary.md └── flagged_for_review.json ``` --- ## Risk Mitigation | Risk | Mitigation | |------|------------| | Disk space exhaustion | Monitor disk usage, compress images, delete raw after processing | | Memory errors with large batches | Process in batches of 1000, use memory-mapped files | | Augmentation too aggressive | Visual review, conservative defaults, configurable parameters | | Class imbalance persists | Multiple oversampling strategies, weighted loss in training | | Slow processing | Multiprocessing, GPU acceleration for transforms | | Reproducibility issues | Save all configs, use fixed random seeds, version control | --- ## Performance Optimization Tips 1. **Batch Processing:** Process images in parallel using multiprocessing 2. **Memory Efficiency:** Use generators, don't load all images at once 3. **Disk I/O:** Use SSD, batch writes, memory-mapped files 4. **Image Loading:** Use PIL with SIMD, or opencv for speed 5. **Augmentation:** Apply on-the-fly during training (save disk space) --- ## Notes - Consider saving augmentation config separately from applying augmentations - On-the-fly augmentation during training is often preferred over pre-generating - Keep original unaugmented test set for fair evaluation - Document any images excluded and reasons - Save random seeds for all operations - Phase 4 will select model architecture based on processed dataset size