In machine learning tasks involving image data, it’s crucial to split your dataset into separate test, training, and validation sets. This splitting ensures that your model is trained on one set of data, evaluated on a different set (validation), and finally tested on a completely unseen set of data (test). Manually splitting large image datasets can be a tedious and error-prone process, especially when dealing with thousands or millions of images.
Fortunately, the Python library split-folders
provides a convenient solution for automatically splitting image folders into test, training, and validation sets with customizable ratios. In this article, we’ll explore how to use this library and understand the process of splitting image folders using a practical example.
Split Image Folders into Train, Validation, and Test Sets with Python’s split-folders Library
Step 1: Setting up a Virtual Environment
If you’re using a virtual environment, make sure to activate it before proceeding. To activate your virtual environment:
On Unix-based systems (e.g., macOS, Linux):
On Windows:
You should see (env)
preceding your command prompt, indicating that the virtual environment is active.
Step 2: Installing split-folders
With your virtual environment activated, you can now install the split-folders
library. You can do this using pip, Python’s package installer:
pip install split-folders
Alternatively, if you’re using Anaconda or Miniconda, you can install the library via conda:
conda install -c conda-forge split-folders
Step 3: Understanding the split-folders Command
The split-folders
command is structured as follows:
split-folders --output OUTPUT_DIR --ratio TRAIN_RATIO VAL_RATIO TEST_RATIO -- INPUT_DIR
Let’s break down the different components of this command:
--output OUTPUT_DIR
: This argument specifies the directory where the split datasets will be stored. In our example, we used --output dataset
.
--ratio TRAIN_RATIO VAL_RATIO TEST_RATIO
: This argument specifies the ratios for splitting the data into training, validation, and test sets, respectively. In our example, we used --ratio .7 .1 .2
, which means 70% of the data will be used for training, 10% for validation, and 20% for testing.
--
: This double hyphen separates the arguments from the input directory path.
INPUT_DIR
: This is the path to the directory containing the images you want to split.
In our example, we used the following command:
split-folders --output dataset --ratio .7 .1 .2 -- PlantVillage
This command splits the images in the PlantVillage
directory into three sets: 70% for training, 10% for validation, and 20% for testing. The split datasets are stored in the dataset
directory.
Step 4: Executing the Command
When you execute the split-folders
command, you should see an output similar to the following:
Copying files: 2152 files [00:00, 2177.87 files/s]
This output indicates that the library is copying the files from the input directory (PlantVillage
) and splitting them into the specified sets.
Step 5: Exploring the Output
After the command has finished executing, you can explore the output directory (dataset
in our example) and its subdirectories:
dataset
├── train
├── val
└── test
The train
directory contains approximately 70% of the original images (in our example, around 1,506 files), the val
directory contains approximately 10% (around 215 files), and the test
directory contains approximately 20% (around 430 files).
Customizing the Command
The beauty of the split-folders
command lies in its flexibility. You can customize the ratios for splitting the data according to your needs. For example, if you want a 60-20-20 split for train, validation, and test sets, respectively, you would use the following command:
split-folders --output my_dataset --ratio .6 .2 .2 -- my_image_folder
This command will split the images in the my_image_folder
directory into three sets: 60% for training (my_dataset/train
), 20% for validation (my_dataset/val
), and 20% for testing (my_dataset/test
).
You can also specify the output directory name using the --output
argument, as shown in the example above.
Conclusion
The split-folders
library is a powerful tool for splitting image datasets into test, training, and validation sets with customizable ratios. By following the steps outlined in this article, you can easily split your image folders and prepare your data for machine learning tasks.
Explore the split-folders
library further and consider using it in your own machine learning projects involving image data.
Also Read:
Related