Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Supervised Machine Learning with Python

You're reading from   Supervised Machine Learning with Python Develop rich Python coding practices while exploring supervised machine learning

Arrow left icon
Product type Paperback
Published in May 2019
Publisher Packt
ISBN-13 9781838825669
Length 162 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Taylor Smith Taylor Smith
Author Profile Icon Taylor Smith
Taylor Smith
Arrow right icon
View More author details
Toc

Model evaluation and data splitting

In this chapter, we will define what it means to evaluate a model, best practices for gauging the advocacy of a model, how to split your data, and several considerations that you'll have to make when preparing your split.

It is important to understand some core best practices of machine learning. One of our primary tasks as ML practitioners is to create a model that is effective for making predictions on new data. But how do we know that a model is good? If you recall from the previous section, we defined supervised learning as simply a task that learns a function from labelled data such that we can approximate the target of the new data. Therefore, we can test our model's effectiveness. We can determine how it performs on data that is never seen—just like it's taking a test.

Out-of-sample versus in-sample evaluation

Let's say we are training a small machine which is a simple classification task. Here's some nomenclature you'll need: the in-sample data is the data the model learns from and the out-of-sample data is the data the model has never seen before. One of the pitfalls many new data scientists make is that they measure their model's effectiveness on the same data that the model learned from. What this ends up doing is rewarding the model's ability to memorize, rather than its ability to generalize, which is a huge difference.

If you take a look at the two examples here, the first presents a sample that the model learned from, and we can be reasonably confident that it's going to predict one, which would be correct. The second example presents a new sample, which appears to resemble more of the zero class. Of course, the model doesn't know that. But a good model should be able to recognize and generalize this pattern, shown as follows:

So, now the question is how we can ensure both in-sample and out-of-sample data for the model to prove its worth. Even more precisely, our out-of-sample data needs to be labeled. New or unlabeled data won't suffice because we have to know the actual answer in order to determine how correct the model is. So, one of the ways we can handle this in machine learning is to split our data into two parts: a training set and a testing set. The training set is what our model will learn on; the testing set is what our model will be evaluated on. How much data you have matters a lot. In fact, in the next sections, when we discuss the bias-variance trade-off, you'll see how some models require much more data to learn than others do.

Another thing to keep in mind is that if some of the distributions of your variables are highly skewed, or you have rare categorical levels embedded throughout, or even class imbalance in your y vector, you may end up getting a bad split. As an example, let's say you have a binary feature in your X matrix that indicates the presence of a very rare sensor for some event that occurs every 10,000 occurrences. If you randomly split your data and all of the positive sensor events are in your test set, then your model will learn from the training data that the sensor is never tripped and may deem that as an unimportant variable when, in reality, it could be hugely important, and hugely predictive. So, you can control these types of issues with stratification.

Splitting made easy

Here, we have a simple snippet that demonstrates how we can use the scikit-learn library to split our data into training and test sets. We're loading the data in from the datasets module and passing both X and y into the split function. We should be familiar with loading the data up. We have the train_test_split function from the model_selection submodule in sklearn. This is going to take any number of arrays. So, 20% is going to be test_size, and the remaining 80% of that data will be training. We define random_state, so that our split can be reproducible if we ever have to prove exactly how we got this split. There's also the stratify keyword, which we're not using here, which can be used to stratify a split for rare features or an imbalanced y vector:

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split



boston_housing = load_boston() # load data

X, y = boston_housing.data, boston_housing.target # get X, y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

random_state=42)

# show num samples (there are no duplicates in either set!)
print("Num train samples: %i" % X_train.shape[0])

print("Num test samples: %i" % X_test.shape[0])

The output of the preceding code is as follows:

Num train samples: 404
Num test samples: 102

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image