Train-test-splitting your data

In machine learning, our goal is to create a program that is able to perform tasks it has never been explicitly taught to perform. The way we do that is to use data we have collected to train or fit a mathematical or statistical model. The data used to fit the model is referred to as training data. The resulting trained model is then used to predict future, previously-unseen data. In this way, the program is able to manage new situations without human intervention.

One of the major challenges for a machine learning practitioner is the danger of overfitting – creating a model that performs well on the training data but is not able to generalize to new, previously-unseen data. In order to combat the problem of overfitting, machine learning practitioners set aside a portion of the data, called test data, and use it only to assess the performance of the trained model, as opposed to including it as part of the training dataset. This careful setting aside of testing sets is key to training classifiers in cybersecurity, where overfitting is an omnipresent danger. One small oversight, such as using only benign data from one locale, can lead to a poor classifier.

There are various other ways to validate model performance, such as cross-validation. For simplicity, we will focus mainly on train-test splitting.

Getting ready

Preparation for this recipe consists of installing the scikit-learn and pandas packages in pip. The command for this is as follows:

pip install sklearn pandas

In addition, we have included the north_korea_missile_test_database.csv dataset for use in this recipe.

How to do it...

The following steps demonstrate how to take a dataset, consisting of features X and labels y, and split these into a training and testing subset:

Start by importing the train_test_split module and the pandas library, and read your features into X and labels into y:

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("north_korea_missile_test_database.csv")
y = df["Missile Name"]
X = df.drop("Missile Name", axis=1)

Next, randomly split the dataset and its labels into a training set consisting 80% of the size of the original dataset and a testing set 20% of the size:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=31
)

We apply the train_test_split method once more, to obtain a validation set, X_val and y_val:

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=31
)

We end up with a training set that's 60% of the size of the original data, a validation set of 20%, and a testing set of 20%.

The following screenshot shows the output:

How it works...

We start by reading in our dataset, consisting of historical and continuing missile experiments in North Korea. We aim to predict the type of missile based on remaining features, such as facility and time of launch. This concludes step 1. In step 2, we apply scikit-learn's train_test_split method to subdivide X and y into a training set, X_train and y_train, and also a testing set, X_test and y_test. The test_size = 0.2 parameter means that the testing set consists of 20% of the original data, while the remainder is placed in the training set. The random_state parameter allows us to reproduce the same randomly generated split. Next, concerning step 3, it is important to note that, in applications, we often want to compare several different models. The danger of using the testing set to select the best model is that we may end up overfitting the testing set. This is similar to the statistical sin of data fishing. In order to combat this danger, we create an additional dataset, called the validation set. We train our models on the training set, use the validation set to compare them, and finally use the testing set to obtain an accurate indicator of the performance of the model we have chosen. So, in step 3, we choose our parameters so that, mathematically speaking, the end result consists of a training set of 60% of the original dataset, a validation set of 20%, and a testing set of 20%. Finally, we double-check our assumptions by employing the len function to compute the length of the arrays (step 4).