Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

You're reading from   Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits A practical guide to implementing supervised and unsupervised machine learning algorithms in Python

Arrow left icon
Product type Paperback
Published in Jul 2020
Publisher Packt
ISBN-13 9781838826048
Length 384 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Tarek Amr Tarek Amr
Author Profile Icon Tarek Amr
Tarek Amr
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Section 1: Supervised Learning
2. Introduction to Machine Learning FREE CHAPTER 3. Making Decisions with Trees 4. Making Decisions with Linear Equations 5. Preparing Your Data 6. Image Processing with Nearest Neighbors 7. Classifying Text Using Naive Bayes 8. Section 2: Advanced Supervised Learning
9. Neural Networks – Here Comes Deep Learning 10. Ensembles – When One Model Is Not Enough 11. The Y is as Important as the X 12. Imbalanced Learning – Not Even 1% Win the Lottery 13. Section 3: Unsupervised Learning and More
14. Clustering – Making Sense of Unlabeled Data 15. Anomaly Detection – Finding Outliers in Data 16. Recommender System – Getting to Know Their Taste 17. Other Books You May Enjoy

Tuning the hyperparameters for higher accuracy

Now that we have learned how to evaluate the model's accuracy more reliably using the ShuffleSplit cross-validation method, it is time to test our earlier hypothesis: would a smaller tree be more accurate?

Here is what we are going to do in the following sub sections:

  1. Split the data into training and test sets.
  2. Keep the test side to one side now.
  3. Limit the tree's growth using different values of max_depth.
  4. For each max_depth setting, we will use the ShuffleSplit cross-validation method on the training set to get an estimation of the classifier's accuracy.
  5. Once we decide which value to use for max_depth, we will train the algorithm one last time on the entire training set and predict on the test set.

Splitting the data

Here is the usual code for splitting the data into training and test sets:

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.25)

x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]

y_train = df_train['target']
y_test = df_test['target']

Trying different hyperparameter values

If we allowed our earlier treeto grow indefinitely, we would get a tree depth of 4. You can check the depth of a tree by callingclf.get_depth()once it is trained. So, it doesn't make sense to try any max_depth values above 4. Here, we are going to loop over the maximum depths from 1 to 4 and use ShuffleSplit to get the classifier's accuracy:

import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate

for max_depth in [1, 2, 3, 4]:

# We initialize a new classifier each iteration with different max_depth
clf = DecisionTreeClassifier(max_depth=max_depth)
# We also initialize our shuffle splitter
rs = ShuffleSplit(n_splits=20, test_size=0.25)

cv_results = cross_validate(
clf, x_train, y_train, cv=rs, scoring='accuracy'
)
accuracy_scores = pd.Series(cv_results['test_score'])

print(
'@ max_depth = {}: accuracy_scores: {}~{}'.format(
max_depth,
accuracy_scores.quantile(.1).round(3),
accuracy_scores.quantile(.9).round(3)
)
)

We called the cross_validate() method as we did earlier, giving it the classifier's instance, as well as the ShuffleSplit instance. We also defined our evaluation score as accuracy. Finally, we print the scores we get with each iteration. We will look more at the printed values in the next section.

Comparing the accuracy scores

Since we have a list of scores for each iteration, we can calculate their mean, or, as we will do here, we will print their 10th and 90th percentiles to get an idea of the accuracy ranges versus each max_depthsetting.

Running the preceding code gave me the following results:

@ max_depth = 1: accuracy_scores: 0.532~0.646
@ max_depth = 2: accuracy_scores: 0.925~1.0
@ max_depth = 3: accuracy_scores: 0.929~1.0
@ max_depth = 4: accuracy_scores: 0.929~1.0

One thing I am sure about now is that a single-level tree (usually called a stub) is not as accurate as deeper trees. In other words, having a single decision based on whether the petal width is less than 0.8 is not enough. Allowing the tree to grow further improves the accuracy, but I can't see many differences between trees of depths 2, 3, and 4. I'd conclude that contrary to my earlier speculations, we shouldn't worry too much about overfitting here.

Here, we tried different values for a single parameter, max_depth. That's why a simple for loop over its different values was feasible. In later chapters, we will see what to do when we need to tune multiple hyperparameters at once to reach a combination that gives the best accuracy.

Finally, you can train your model once more using the entire training set and a max_depth value of, say, 3. Then, use the trained model to predict the classes for the test set in order to evaluate your final model. I won't bore you with the code for it this time as you can easily do it yourself.

In addition to printing the classifier's decision and descriptive statistics about its accuracy, it is useful to also see its decision boundaries visually. Mapping those boundaries versus the data samples helps us understand why the classifier made certain mistakes. In the next section, we are going to check the decision boundaries we got for the Iris dataset.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image