Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Debugging Machine Learning Models with Python

You're reading from   Debugging Machine Learning Models with Python Develop high-performance, low-bias, and explainable machine learning and deep learning models

Arrow left icon
Product type Paperback
Published in Sep 2023
Publisher Packt
ISBN-13 9781800208582
Length 344 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Ali Madani Ali Madani
Author Profile Icon Ali Madani
Ali Madani
Arrow right icon
View More author details
Toc

Table of Contents (26) Chapters Close

Preface 1. Part 1:Debugging for Machine Learning Modeling
2. Chapter 1: Beyond Code Debugging FREE CHAPTER 3. Chapter 2: Machine Learning Life Cycle 4. Chapter 3: Debugging toward Responsible AI 5. Part 2:Improving Machine Learning Models
6. Chapter 4: Detecting Performance and Efficiency Issues in Machine Learning Models 7. Chapter 5: Improving the Performance of Machine Learning Models 8. Chapter 6: Interpretability and Explainability in Machine Learning Modeling 9. Chapter 7: Decreasing Bias and Achieving Fairness 10. Part 3:Low-Bug Machine Learning Development and Deployment
11. Chapter 8: Controlling Risks Using Test-Driven Development 12. Chapter 9: Testing and Debugging for Production 13. Chapter 10: Versioning and Reproducible Machine Learning Modeling 14. Chapter 11: Avoiding and Detecting Data and Concept Drifts 15. Part 4:Deep Learning Modeling
16. Chapter 12: Going Beyond ML Debugging with Deep Learning 17. Chapter 13: Advanced Deep Learning Techniques 18. Chapter 14: Introduction to Recent Advancements in Machine Learning 19. Part 5:Advanced Topics in Model Debugging
20. Chapter 15: Correlation versus Causality 21. Chapter 16: Security and Privacy in Machine Learning 22. Chapter 17: Human-in-the-Loop Machine Learning 23. Assessments 24. Index 25. Other Books You May Enjoy

Flaws in data used for modeling

Data is one of the core components of machine learning modeling (Figure 1.1). Applications of machine learning across different industries such as healthcare, finance, automotive, retail, and marketing are made possible by getting access to the necessary data for training and testing machine learning models. As the data gets fed into machine learning models for training (that is, identifying optimal model parameters) and testing, flaws in data could result in problems in models, such as low performance in training (for example, high bias), low generalizability (for example high variance), or socioeconomic biases. Here, we will discuss examples of flaws and properties of data that need to be considered when designing a machine learning model.

Data format and structure

There could be issues with how data is stored, read, and moved through different functions and classes in your code or pipeline. You might need to work with structured or tabular data or unstructured data such as videos and text documents. This data could be stored in relational databases such as MySQL or NoSQL (that is, non-relational) databases, data warehouses, and data lakes, or even stored locally in different file formats, such as CSV. Either way, the expected and existing file data structure and formats need to match. For example, if your code is expecting a tab-separated file format but instead the input file of the corresponding function is comma-separated, then all the columns could be lumped together. Luckily, most of the time, these kinds of issues result in errors in the code.

There could also be mismatches in the provided and expected data that wouldn’t cause any errors if the code is not defended against them and not enough information is logged. For example, imagine a scikit-learn fit function that expects training data with 100 features and at the same time, you have 100 data points. In this case, your code will not return any errors if features are in rows or columns of an input DataFrame. Then, your code needs to check if each row of an input DataFrame contains values of one feature across all data points or the feature values of one data point. The following figure shows how switching features with data points, such as transposing a DataFrame that switches rows with columns, could provide wrong input files but result in no error. In this figure, we have considered four columns and rows for simplicity. Here, F and D are used as abbreviations for feature and data point, respectively:

Figure 1.4 – Simplified example showcasing how the transpose of a DataFrame can be used by mistake in a scikit-learn fit function that expects four features

Figure 1.4 – Simplified example showcasing how the transpose of a DataFrame can be used by mistake in a scikit-learn fit function that expects four features

Data flaws are not restricted to structure and format issues. Some data characteristics need to be considered when you’re trying to build and improve a machine learning model.

Data quantity and quality

Despite machine learning being a more than half-century-old concept, the rise of excitement around machine learning started in 2012. Although there were algorithmic advancements for image classification between 2010 and 2015, it was the availability of 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest and the necessary computing power that played a crucial role in the development of the first high-performance image classification models, such as AlexNet (Krizhevsky et al., 2012) and VGG (Simonyan and Zisserman, 2014).

In addition to data quantity, the quality of the data also plays a very important role. In some applications, such as clinical cancer settings, a high quantity of high-quality data is not accessible. Benefitting from both quantity and quality could also become a tradeoff as we could have access to more data but with lower quality. We can choose to stick to high-quality data or low-quality ones or try to benefit from both high-quality and low-quality data if possible. Selecting the right approach is domain-specific and depends on the data and algorithm used for modeling.

Data biases

Machine learning models can have different kinds of biases, depending on the data we feed them. Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a famous example of machine learning models with reported biases. COMPAS is designed to estimate the likelihood of a defendant to re-offend based on their response to more than 100 survey questions. A summary of the responses to the questions results in a risk score, which includes questions such as whether one of the prisoner’s parents was ever in prison. Although the tool has been successful in many examples, when it has been wrong in terms of prediction, the results for white and black offenders were not the same. The developer company of COMPAS presented data that supports its algorithm’s findings. You can find articles and blog posts to read more about its current status and whether it is still used or still has biases or not.

These were some examples of issues in data and their consequences in the resulting machine learning models. But there are other problems in models that do not originate from data.

You have been reading a chapter from
Debugging Machine Learning Models with Python
Published in: Sep 2023
Publisher: Packt
ISBN-13: 9781800208582
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime