Since the concept of the dataset is essential in ML, let's look at it in detail, with a focus on how to create the required splits for building a complete and correct ML pipeline.
A dataset is nothing more than a collection of data. Formally, we can describe a dataset as a set of pairs, , where is the i-th example and is its label, with a finite cardinality, :
A dataset has a finite number of elements, and our ML algorithm will loop over this dataset several times, trying to understand the data structure, until it solves the task it is asked to address. As shown in Chapter 2, Neural Networks and Deep Learning, some algorithms will consider all the data at once, while other algorithms will iteratively look at a small subset of the data at each training iteration.
A typical supervised learning task is the classification of the dataset. We train...