Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science Projects with Python

You're reading from   Data Science Projects with Python A case study approach to gaining valuable insights from real data with machine learning

Arrow left icon
Product type Paperback
Published in Jul 2021
Publisher Packt
ISBN-13 9781800564480
Length 432 pages
Edition 2nd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Stephen Klosterman Stephen Klosterman
Author Profile Icon Stephen Klosterman
Stephen Klosterman
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface
1. Data Exploration and Cleaning 2. Introduction to Scikit-Learn and Model Evaluation FREE CHAPTER 3. Details of Logistic Regression and Feature Exploration 4. The Bias-Variance Trade-Off 5. Decision Trees and Random Forests 6. Gradient Boosting, XGBoost, and SHAP Values 7. Test Set Analysis, Financial Insights, and Delivery to the Client Appendix

Summary

In this chapter, we finished the initial exploration of the case study data by examining the response variable. Once we became confident in the completeness and correctness of the dataset, we were prepared to explore the relation between features and response and build models.

We spent much of this chapter getting used to model fitting in scikit-learn at the technical, coding level, and learning about metrics we could use with the binary classification problem of the case study. When trying different feature sets and different kinds of models, you will need some way to tell if one approach is working better than another. Consequently, you'll need to use model performance metrics like those we learned in this chapter.

While accuracy is a familiar and intuitive metric as the percentage of correct classifications, we learned why it may not give a useful assessment of the performance of a classifier. We learned how to use a majority-class null model to tell whether an accuracy rate is truly good, or no better than what would result from simply predicting the most common class for all samples. When the data is imbalanced, accuracy is usually not the best way to judge a classifier.

In order to have a more nuanced view of how a model is performing, it's necessary to separate the positive and negative classes and assess the accuracy of them independently. From the resulting counts of true and false positive and negative classifications, which can be summarized in a confusion matrix, we can derive several other metrics: true and false positive and negative rates. Combining true and false positives and negatives with the concept of predicted probabilities and a variable threshold of prediction, we can further characterize the usefulness of a classifier using the ROC curve, the precision-recall curve, and the areas under these curves.

With these tools, you are well equipped to answer general questions about the performance of a binary classifier in any domain you may be working in. Later in the book, we will learn about application-specific ways to assess model performance by attaching costs and benefits to true and false positives and negatives. Before that, starting in the next chapter, we will begin learning the details behind what is possibly the most popular and simplest classification model: logistic regression.

You have been reading a chapter from
Data Science Projects with Python - Second Edition
Published in: Jul 2021
Publisher: Packt
ISBN-13: 9781800564480
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime