Packt+ | Advance your knowledge in tech

You're reading from Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits A practical guide to implementing supervised and unsupervised machine learning algorithms in Python

Product type Paperback

Published in Jul 2020

Publisher Packt

ISBN-13 9781838826048

Length 384 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Author (1):

Tarek Amr

View More author details

Tuning the hyperparameters for higher accuracy

Now that we have learned how to evaluate the model's accuracy more reliably using the ShuffleSplit cross-validation method, it is time to test our earlier hypothesis: would a smaller tree be more accurate?

Here is what we are going to do in the following sub sections:

Split the data into training and test sets.

Keep the test side to one side now.

Limit the tree's growth using different values of max_depth.

For each max_depth setting, we will use the ShuffleSplit cross-validation method on the training set to get an estimation of the classifier's accuracy.

Once we decide which value to use for max_depth, we will train the algorithm one last time on the entire training set and predict on the test set.

Splitting the data

Here is the usual code for splitting the data into training and test sets:

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.25)

x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]

y_train = df_train['target']
y_test = df_test['target']

Trying different hyperparameter values

If we allowed our earlier treeto grow indefinitely, we would get a tree depth of 4. You can check the depth of a tree by callingclf.get_depth()once it is trained. So, it doesn't make sense to try any max_depth values above 4. Here, we are going to loop over the maximum depths from 1 to 4 and use ShuffleSplit to get the classifier's accuracy:

import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate

for max_depth in [1, 2, 3, 4]:

# We initialize a new classifier each iteration with different max_depth
clf = DecisionTreeClassifier(max_depth=max_depth)
# We also initialize our shuffle splitter
rs = ShuffleSplit(n_splits=20, test_size=0.25)

cv_results = cross_validate(
clf, x_train, y_train, cv=rs, scoring='accuracy'
)
accuracy_scores = pd.Series(cv_results['test_score'])

print(
'@ max_depth = {}: accuracy_scores: {}~{}'.format(
max_depth,
accuracy_scores.quantile(.1).round(3),
accuracy_scores.quantile(.9).round(3)
)
)

We called the cross_validate() method as we did earlier, giving it the classifier's instance, as well as the ShuffleSplit instance. We also defined our evaluation score as accuracy. Finally, we print the scores we get with each iteration. We will look more at the printed values in the next section.

Comparing the accuracy scores

Since we have a list of scores for each iteration, we can calculate their mean, or, as we will do here, we will print their 10^th and 90^th percentiles to get an idea of the accuracy ranges versus each max_depthsetting.

Running the preceding code gave me the following results:

@ max_depth = 1: accuracy_scores: 0.532~0.646 @ max_depth = 2: accuracy_scores: 0.925~1.0 @ max_depth = 3: accuracy_scores: 0.929~1.0 @ max_depth = 4: accuracy_scores: 0.929~1.0

One thing I am sure about now is that a single-level tree (usually called a stub) is not as accurate as deeper trees. In other words, having a single decision based on whether the petal width is less than 0.8 is not enough. Allowing the tree to grow further improves the accuracy, but I can't see many differences between trees of depths 2, 3, and 4. I'd conclude that contrary to my earlier speculations, we shouldn't worry too much about overfitting here.

Here, we tried different values for a single parameter, max_depth. That's why a simple for loop over its different values was feasible. In later chapters, we will see what to do when we need to tune multiple hyperparameters at once to reach a combination that gives the best accuracy.

Finally, you can train your model once more using the entire training set and a max_depth value of, say, 3. Then, use the trained model to predict the classes for the test set in order to evaluate your final model. I won't bore you with the code for it this time as you can easily do it yourself.

In addition to printing the classifier's decision and descriptive statistics about its accuracy, it is useful to also see its decision boundaries visually. Mapping those boundaries versus the data samples helps us understand why the classifier made certain mistakes. In the next section, we are going to check the decision boundaries we got for the Iris dataset.