Splitting Data
You will learn more about splitting data in Chapter 7, The Generalization of Machine Learning Models, where we will cover the following:
- Simple data splits using
train_test_split
- Multiple data splits using cross-validation
For now, you will learn how to split data using a function from sklearn
called train_test_split
.
It is very important that you do not use all of your data to train a model. You must set aside some data for validation, and this data must not have been used previously for training. When you train a model, it tries to generate an equation that fits your data. The longer you train, the more complex the equation becomes so that it passes through as many of the data points as possible.
When you shuffle the data and set some aside for validation, it ensures that the model learns to not overfit the hypotheses you are trying to generate.
Exercise 6.01: Importing and Splitting Data
In this exercise, you will import data from a...