# Phase 3: Dataset Preprocessing & Augmentation - Implementation Plan

## Overview

**Goal:** Prepare images for training with consistent formatting and augmentation pipeline.

**Prerequisites:** Phase 2 complete - `datasets/train/`, `datasets/val/`, `datasets/test/` directories with manifests

**Target Deliverable:** Training-ready dataset with standardized dimensions, normalized values, and augmentation pipeline

---

## Task Breakdown

### Task 3.1: Standardize Image Dimensions

**Objective:** Resize all images to consistent dimensions for model input.

**Actions:**
1. Create `scripts/phase3/standardize_dimensions.py` to:
   - Load images from train/val/test directories
   - Resize to target dimension (224x224 for MobileNetV3, 299x299 for EfficientNet)
   - Preserve aspect ratio with center crop or letterboxing
   - Save resized images to new directory structure

2. Support multiple output sizes:
   ```python
   TARGET_SIZES = {
       "mobilenet": (224, 224),
       "efficientnet": (299, 299),
       "vit": (384, 384)
   }
   ```

3. Implement resize strategies:
   - **center_crop:** Crop to square, then resize (preserves detail)
   - **letterbox:** Pad to square, then resize (preserves full image)
   - **stretch:** Direct resize (fastest, may distort)

4. Output directory structure:
   ```
   datasets/
   ├── processed/
   │   └── 224x224/
   │       ├── train/
   │       ├── val/
   │       └── test/
   ```

**Output:**
- `datasets/processed/{size}/` directories
- `output/phase3/dimension_report.json` - Processing statistics

**Validation:**
- [ ] All images in processed directory are exactly target dimensions
- [ ] No corrupt images (all readable by PIL)
- [ ] Image count matches source (no images lost)
- [ ] Processing time logged for performance baseline

---

### Task 3.2: Normalize Color Channels

**Objective:** Standardize pixel values and handle format variations.

**Actions:**
1. Create `scripts/phase3/normalize_images.py` to:
   - Convert all images to RGB (handle RGBA, grayscale, CMYK)
   - Apply ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
   - Handle various input formats (JPEG, PNG, WebP, HEIC)
   - Save as consistent format (JPEG with quality 95, or PNG for lossless)

2. Implement color normalization:
   ```python
   def normalize_image(image: np.ndarray) -> np.ndarray:
       """Normalize image for model input."""
       image = image.astype(np.float32) / 255.0
       mean = np.array([0.485, 0.456, 0.406])
       std = np.array([0.229, 0.224, 0.225])
       return (image - mean) / std
   ```

3. Create preprocessing pipeline class:
   ```python
   class ImagePreprocessor:
       def __init__(self, target_size, normalize=True):
           self.target_size = target_size
           self.normalize = normalize

       def __call__(self, image_path: str) -> np.ndarray:
           # Load, resize, convert, normalize
           pass
   ```

4. Handle edge cases:
   - Grayscale → convert to RGB by duplicating channels
   - RGBA → remove alpha channel, composite on white
   - CMYK → convert to RGB color space
   - 16-bit images → convert to 8-bit

**Output:**
- Updated processed images with consistent color handling
- `output/phase3/color_conversion_log.json` - Format conversion statistics

**Validation:**
- [ ] All images have exactly 3 color channels (RGB)
- [ ] Pixel values in expected range after normalization
- [ ] No format conversion errors
- [ ] Color fidelity maintained (visual spot check on 50 random images)

---

### Task 3.3: Implement Data Augmentation Pipeline

**Objective:** Create augmentation transforms to increase training data variety.

**Actions:**
1. Create `scripts/phase3/augmentation_pipeline.py` with transforms:

   **Geometric Transforms:**
   - Random rotation: -30° to +30°
   - Random horizontal flip: 50% probability
   - Random vertical flip: 10% probability (some plants are naturally upside-down)
   - Random crop: 80-100% of image, then resize back
   - Random perspective: slight perspective distortion

   **Color Transforms:**
   - Random brightness: ±20%
   - Random contrast: ±20%
   - Random saturation: ±30%
   - Random hue shift: ±10%
   - Color jitter (combined)

   **Blur/Noise Transforms:**
   - Gaussian blur: kernel 3-7, 30% probability
   - Motion blur: 10% probability
   - Gaussian noise: σ=0.01-0.05, 20% probability

   **Occlusion Transforms:**
   - Random erasing (cutout): 10-30% area, 20% probability
   - Grid dropout: 10% probability

2. Implement using PyTorch or Albumentations:
   ```python
   import albumentations as A

   train_transform = A.Compose([
       A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
       A.HorizontalFlip(p=0.5),
       A.Rotate(limit=30, p=0.5),
       A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.3, hue=0.1),
       A.GaussianBlur(blur_limit=(3, 7), p=0.3),
       A.CoarseDropout(max_holes=8, max_height=16, max_width=16, p=0.2),
       A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
       ToTensorV2(),
   ])

   val_transform = A.Compose([
       A.Resize(256, 256),
       A.CenterCrop(224, 224),
       A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
       ToTensorV2(),
   ])
   ```

3. Create visualization tool for augmentation preview:
   ```python
   def visualize_augmentations(image_path, transform, n_samples=9):
       """Show grid of augmented versions of same image."""
       pass
   ```

4. Save augmentation configuration to JSON for reproducibility

**Output:**
- `scripts/phase3/augmentation_pipeline.py` - Reusable transform classes
- `output/phase3/augmentation_config.json` - Transform parameters
- `output/phase3/augmentation_samples/` - Visual examples

**Validation:**
- [ ] All augmentations produce valid images (no NaN, no corruption)
- [ ] Augmented images visually reasonable (not over-augmented)
- [ ] Transforms are deterministic when seeded
- [ ] Pipeline runs at >100 images/second on CPU

---

### Task 3.4: Balance Underrepresented Classes

**Objective:** Create augmented variants to address class imbalance.

**Actions:**
1. Create `scripts/phase3/analyze_class_balance.py` to:
   - Count images per class in training set
   - Calculate imbalance ratio (max_class / min_class)
   - Identify underrepresented classes (below median - 1 std)
   - Visualize class distribution

2. Create `scripts/phase3/oversample_minority.py` to:
   - Define target samples per class (e.g., median count)
   - Generate augmented copies for minority classes
   - Apply stronger augmentation for synthetic samples
   - Track original vs augmented counts

3. Implement oversampling strategies:
   ```python
   class BalancingStrategy:
       """Strategies for handling class imbalance."""

       @staticmethod
       def oversample_to_median(class_counts: dict) -> dict:
           """Oversample minority classes to median count."""
           median = np.median(list(class_counts.values()))
           targets = {}
           for cls, count in class_counts.items():
               targets[cls] = max(int(median), count)
           return targets

       @staticmethod
       def oversample_to_max(class_counts: dict, cap_ratio=5) -> dict:
           """Oversample to max, capped at ratio times original."""
           max_count = max(class_counts.values())
           targets = {}
           for cls, count in class_counts.items():
               targets[cls] = min(max_count, count * cap_ratio)
           return targets
   ```

4. Generate balanced training manifest:
   - Include original images
   - Add paths to augmented copies
   - Mark augmented images in manifest (for analysis)

**Output:**
- `datasets/processed/balanced/train/` - Balanced training set
- `output/phase3/class_balance_before.json` - Original distribution
- `output/phase3/class_balance_after.json` - Balanced distribution
- `output/phase3/balance_histogram.png` - Visual comparison

**Validation:**
- [ ] Imbalance ratio reduced to < 10:1 (max:min)
- [ ] No class has fewer than 50 training samples
- [ ] Augmented images are visually distinct from originals
- [ ] Total training set size documented

---

### Task 3.5: Generate Image Manifest Files

**Objective:** Create mapping files for training pipeline.

**Actions:**
1. Create `scripts/phase3/generate_manifests.py` to produce:

   **CSV Format (PyTorch ImageFolder compatible):**
   ```csv
   path,label,scientific_name,plant_id,source,is_augmented
   train/images/quercus_robur_001.jpg,42,Quercus robur,QR001,inaturalist,false
   train/images/quercus_robur_002_aug.jpg,42,Quercus robur,QR001,augmented,true
   ```

   **JSON Format (detailed metadata):**
   ```json
   {
     "train": [
       {
         "path": "train/images/quercus_robur_001.jpg",
         "label": 42,
         "scientific_name": "Quercus robur",
         "common_name": "English Oak",
         "plant_id": "QR001",
         "source": "inaturalist",
         "is_augmented": false,
         "original_path": null
       }
     ]
   }
   ```

2. Generate label mapping file:
   ```json
   {
     "label_to_name": {
       "0": "Acer palmatum",
       "1": "Acer rubrum",
       ...
     },
     "name_to_label": {
       "Acer palmatum": 0,
       "Acer rubrum": 1,
       ...
     },
     "label_to_common": {
       "0": "Japanese Maple",
       ...
     }
   }
   ```

3. Create split statistics:
   - Total images per split
   - Classes per split
   - Images per class per split

**Output:**
- `datasets/processed/train_manifest.csv`
- `datasets/processed/val_manifest.csv`
- `datasets/processed/test_manifest.csv`
- `datasets/processed/label_mapping.json`
- `output/phase3/manifest_statistics.json`

**Validation:**
- [ ] All image paths in manifests exist on disk
- [ ] Labels are consecutive integers starting from 0
- [ ] No duplicate entries in manifests
- [ ] Split sizes match expected counts
- [ ] Label mapping covers all classes

---

### Task 3.6: Validate Dataset Integrity

**Objective:** Final verification of processed dataset.

**Actions:**
1. Create `scripts/phase3/validate_dataset.py` to run comprehensive checks:

   **File Integrity:**
   - All manifest paths exist
   - All images load without error
   - All images have correct dimensions
   - File permissions allow read access

   **Label Consistency:**
   - Labels match between manifest and directory structure
   - All labels have corresponding class names
   - No orphaned images (in directory but not manifest)
   - No missing images (in manifest but not directory)

   **Dataset Statistics:**
   - Per-class image counts
   - Train/val/test split ratios
   - Augmented vs original ratio
   - File size distribution

   **Sample Verification:**
   - Random sample of 100 images per split
   - Verify image content matches label (using pretrained model)
   - Flag potential mislabels for review

2. Create `scripts/phase3/repair_dataset.py` for common fixes:
   - Remove entries with missing files
   - Fix incorrect labels (with confirmation)
   - Regenerate corrupted augmentations

**Output:**
- `output/phase3/validation_report.json` - Full validation results
- `output/phase3/validation_summary.md` - Human-readable summary
- `output/phase3/flagged_for_review.json` - Potential issues

**Validation:**
- [ ] 0 missing files
- [ ] 0 corrupted images
- [ ] 0 dimension mismatches
- [ ] <1% potential mislabels flagged
- [ ] All metadata fields populated

---

## End-of-Phase Validation Checklist

Run `scripts/phase3/validate_phase3.py` to verify all criteria:

### Image Processing Validation

| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 1 | All images standardized to target size | 100% at 224x224 (or configured size) | [ ] |
| 2 | All images in RGB format | 100% RGB, 3 channels | [ ] |
| 3 | No corrupted images | 0 unreadable files | [ ] |
| 4 | Normalization applied correctly | Values in expected range | [ ] |

### Augmentation Validation

| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 5 | Augmentation pipeline functional | All transforms produce valid output | [ ] |
| 6 | Augmentation reproducible | Same seed = same output | [ ] |
| 7 | Augmentation performance | >100 images/sec on CPU | [ ] |
| 8 | Visual quality | Spot check passes (50 random samples) | [ ] |

### Class Balance Validation

| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 9 | Class imbalance ratio | < 10:1 (max:min) | [ ] |
| 10 | Minimum class size | ≥50 images per class in train | [ ] |
| 11 | Augmentation ratio | Augmented ≤ 4x original per class | [ ] |

### Manifest Validation

| # | Criterion | Target | Status |
|---|-----------|--------|--------|
| 12 | Manifest completeness | 100% images have manifest entries | [ ] |
| 13 | Path validity | 100% manifest paths exist | [ ] |
| 14 | Label consistency | Labels match directory structure | [ ] |
| 15 | No duplicates | 0 duplicate entries | [ ] |
| 16 | Label mapping complete | All labels have names | [ ] |

### Dataset Statistics

| Metric | Expected | Actual | Status |
|--------|----------|--------|--------|
| Total processed images | 50,000 - 200,000 | | [ ] |
| Training set size | ~70% of total | | [ ] |
| Validation set size | ~15% of total | | [ ] |
| Test set size | ~15% of total | | [ ] |
| Number of classes | 200 - 500 | | [ ] |
| Avg images per class (train) | 100 - 400 | | [ ] |
| Image file size (avg) | 30-100 KB | | [ ] |
| Total dataset size | 10-50 GB | | [ ] |

---

## Phase 3 Completion Checklist

- [ ] Task 3.1: Images standardized to target dimensions
- [ ] Task 3.2: Color channels normalized and formats unified
- [ ] Task 3.3: Augmentation pipeline implemented and tested
- [ ] Task 3.4: Class imbalance addressed through oversampling
- [ ] Task 3.5: Manifest files generated for all splits
- [ ] Task 3.6: Dataset integrity validated
- [ ] All 16 validation criteria pass
- [ ] Dataset statistics documented
- [ ] Augmentation config saved for reproducibility
- [ ] Ready for Phase 4 (Model Architecture Selection)

---

## Scripts Summary

| Script | Task | Input | Output |
|--------|------|-------|--------|
| `standardize_dimensions.py` | 3.1 | Raw images | Resized images |
| `normalize_images.py` | 3.2 | Resized images | Normalized images |
| `augmentation_pipeline.py` | 3.3 | Images | Transform classes |
| `analyze_class_balance.py` | 3.4 | Train manifest | Balance report |
| `oversample_minority.py` | 3.4 | Imbalanced set | Balanced set |
| `generate_manifests.py` | 3.5 | Processed images | CSV/JSON manifests |
| `validate_dataset.py` | 3.6 | Full dataset | Validation report |
| `validate_phase3.py` | Final | All outputs | Pass/Fail report |

---

## Dependencies

```
# requirements-phase3.txt
Pillow>=9.0.0
numpy>=1.24.0
albumentations>=1.3.0
torch>=2.0.0
torchvision>=0.15.0
opencv-python>=4.7.0
pandas>=2.0.0
tqdm>=4.65.0
matplotlib>=3.7.0
scikit-learn>=1.2.0
imagehash>=4.3.0
```

---

## Directory Structure After Phase 3

```
datasets/
├── raw/                          # Original downloaded images (Phase 2)
├── organized/                    # Organized by species (Phase 2)
├── verified/                     # Quality-checked (Phase 2)
├── train/                        # Train split (Phase 2)
├── val/                          # Validation split (Phase 2)
├── test/                         # Test split (Phase 2)
└── processed/                    # Phase 3 output
    ├── 224x224/                  # Standardized size
    │   ├── train/
    │   │   └── images/
    │   ├── val/
    │   │   └── images/
    │   └── test/
    │       └── images/
    ├── balanced/                 # Class-balanced training
    │   └── train/
    │       └── images/
    ├── train_manifest.csv
    ├── val_manifest.csv
    ├── test_manifest.csv
    ├── label_mapping.json
    └── augmentation_config.json

output/phase3/
├── dimension_report.json
├── color_conversion_log.json
├── augmentation_config.json
├── augmentation_samples/
├── class_balance_before.json
├── class_balance_after.json
├── balance_histogram.png
├── manifest_statistics.json
├── validation_report.json
├── validation_summary.md
└── flagged_for_review.json
```

---

## Risk Mitigation

| Risk | Mitigation |
|------|------------|
| Disk space exhaustion | Monitor disk usage, compress images, delete raw after processing |
| Memory errors with large batches | Process in batches of 1000, use memory-mapped files |
| Augmentation too aggressive | Visual review, conservative defaults, configurable parameters |
| Class imbalance persists | Multiple oversampling strategies, weighted loss in training |
| Slow processing | Multiprocessing, GPU acceleration for transforms |
| Reproducibility issues | Save all configs, use fixed random seeds, version control |

---

## Performance Optimization Tips

1. **Batch Processing:** Process images in parallel using multiprocessing
2. **Memory Efficiency:** Use generators, don't load all images at once
3. **Disk I/O:** Use SSD, batch writes, memory-mapped files
4. **Image Loading:** Use PIL with SIMD, or opencv for speed
5. **Augmentation:** Apply on-the-fly during training (save disk space)

---

## Notes

- Consider saving augmentation config separately from applying augmentations
- On-the-fly augmentation during training is often preferred over pre-generating
- Keep original unaugmented test set for fair evaluation
- Document any images excluded and reasons
- Save random seeds for all operations
- Phase 4 will select model architecture based on processed dataset size