# Splits and subsets

Machine learning datasets are commonly organized in *splits* and they may also have *subsets* (also called *configurations*). These internal structures provide the scaffolding for building out a dataset, and determines how a dataset should be split and organized. Understanding a dataset's structure can help you create your own dataset, and know which subset of data you should use when during model training and evaluation.

![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)

## Splits

Every processed and cleaned dataset contains *splits*, specific parts of the data reserved for specific needs. The most common splits are:

* `train`: data used to train a model; this data is exposed to the model
* `validation`: data reserved for evaluation and improving model hyperparameters; this data is hidden from the model
* `test`: data reserved for evaluation only; this data is completely hidden from the model and ourselves

The `validation` and `test` sets are especially important to ensure a model is actually learning instead of *overfitting*, or just memorizing the data.

## Subsets

A *subset* (also called *configuration*) is a higher-level internal structure than a split, and a subset contains splits. You can think of a subset as a sub-dataset contained within a larger dataset. It is a useful structure for adding additional layers of organization to a dataset. For example, if you take a look at the [Multilingual LibriSpeech (MLS)](https://huggingface.co/datasets/facebook/multilingual_librispeech) dataset, you'll notice there are eight different languages. While you can create a dataset containing all eight languages, it's probably neater to create a dataset with each language as a subset. This way, users can instantly load a dataset with their language of interest instead of preprocessing the dataset to filter for a specific language.

Subsets are flexible, and can be used to organize a dataset along whatever objective you'd like. For example, the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset uses subsets to organize the dataset by task. One subset is dedicated to segmenting the whole image, while the other subset is for instance segmentation.

