Cleaning Invalid Images Before Training a Machine Learning Model

·

3 min read

Cleaning Invalid Images Before Training a Machine Learning Model

When training machine learning models with image data, one common issue that can halt your progress is invalid image files. Invalid images can cause errors during the training process, leading to interruptions and loss of valuable time. This blog will guide you through a Python script that can help you clean up your dataset by identifying and handling invalid images before you start training your model.

The Problem

During the training of a machine learning model, if an invalid image is encountered, it can lead to errors like:

InvalidArgumentError: Graph execution error: jpeg::Uncompress failed. Invalid JPEG data or crop window.

Such errors can stop the training process, causing significant delays. To prevent this, we need to pre-process the dataset and remove or handle invalid images.

The Solution

The following Python script traverses a directory containing images, attempts to read each image, and handles invalid images based on the specified option. The script can either quarantine the invalid images by moving them to a separate folder, delete them, or simply log their file paths.

import os
import tensorflow as tf

def clean_image_filepaths(image_dir, removal_option="quarantine"):
    """Removes invalid images from a directory based on file paths.

    Args:
        image_dir: The path to the directory containing image files.
        removal_option: The method to handle invalid images. Options are:
            "quarantine": Move them to a separate folder ("invalid_images").
            "delete": Permanently delete them (use with caution!).
            "log": Only log their file paths and keep them in place.

    Returns:
        A list of valid image file paths.
    """

    valid_image_paths = []
    invalid_paths = []  # Optional: Track invalid paths for logging or quarantine

    for dirpath, _, filenames in os.walk(image_dir):
        print(f"Scanning directory: {dirpath}")

        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            try:
                img = tf.io.read_file(filepath)
                img = tf.io.decode_jpeg(img)  # Adjust for other formats if needed
                valid_image_paths.append(filepath)
            except Exception as e:
                invalid_paths.append(filepath)
                print(f"Invalid image: {filepath} ({e})")  # Include error message

                # Handle invalid images based on removal_option:
                if removal_option == "quarantine":
                    new_path = os.path.join(image_dir, "invalid_images", filename)
                    os.makedirs(os.path.dirname(new_path), exist_ok=True)
                    os.rename(filepath, new_path)
                elif removal_option == "delete":
                    os.remove(filepath)
                    print(f"Deleted invalid image: {filepath}")
                elif removal_option == "log":
                    # Just log the invalid path without taking action
                    pass
                else:
                    print(f"Invalid removal option: {removal_option}")

    return valid_image_paths, invalid_paths  # Optionally return both lists

# Usage example:
image_dir = "/path/to/your/dataset"
valid_paths, invalid_paths = clean_image_filepaths(image_dir, removal_option="delete")
print(f"Valid paths: {valid_paths}")
print(f"Invalid paths (quarantined): {invalid_paths}")

How It Works

  1. Scan the Directory: The script walks through the directory containing your images.

  2. Attempt to Read Each Image: For each image file, it attempts to read and decode the image using TensorFlow functions.

  3. Handle Invalid Images: If an image cannot be read, it is considered invalid, and the script will handle it based on the specified removal option:

    • Quarantine: Move the invalid images to a separate folder named "invalid_images".

    • Delete: Permanently delete the invalid images.

    • Log: Log the file paths of invalid images and keep them in place.

Usage

To use this script, specify the directory containing your images and choose a removal option. For instance, to delete invalid images:

image_dir = "/path/to/your/dataset"
valid_paths, invalid_paths = clean_image_filepaths(image_dir, removal_option="delete")

Handling Errors During Training

If you still encounter errors during training, it may be due to issues other than invalid images, such as incorrect image formats or corrupted data. Ensure that all images are in the correct format and verify the integrity of your dataset.

By cleaning your dataset before training, you can avoid interruptions and ensure a smoother training process. Happy training!