Introduction to Data Imbalance in Machine Learning
Machine learning algorithms have helped solve real-world problems as diverse as disease prediction and online shopping. However, many problems we would like to address with machine learning involve imbalanced datasets. In this chapter, we will discuss and define imbalanced datasets, explaining how they differ from other types of datasets. The ubiquity of imbalanced data will be demonstrated with examples of common problems and scenarios. We will also go through the basics of machine learning and cover the essentials, such as loss functions, regularization, and feature engineering. We will also learn about common evaluation metrics, particularly those that can be very helpful for imbalanced datasets. We will then introduce the imbalanced-learn
library.
In particular, we will learn about the following topics:
- Introduction to imbalanced datasets
- Machine learning 101
- Types of datasets and splits
- Common evaluation metrics
- Challenges and considerations when dealing with imbalanced data
- When can we have an imbalance in datasets?
- Why can imbalanced data be a challenge?
- When to not worry about data imbalance
- Introduction to the
imbalanced-learn
library - General rules to follow