Class Imbalanced Data
Class imbalance is the most common problem that a data scientist can encounter. Most real-world classification tasks involve classifying data, where one class or multiple classes are over-represented. This is called class imbalance. Common examples where class-imbalanced data is encountered is in fraud detection, anti-money laundering, spam detection, and cancer detection.
Exercise 47: Performing Classification on Imbalanced Data
For this exercise, we are going to use the mammography dataset from UCI. The dataset contains some attributes of patients, using which we need to build a model that can predict whether a patient will have cancer (that is, a malignant outcome, indicated by 1) or not (that is, a benign outcome, indicated by −1). 70% of the dataset has benign outcomes; hence, it is a highly imbalanced dataset. In this exercise, we will observe how imbalanced data affects the performance of a model:
Import fetch_datasets, pandas, RandomForestClassifier, train_test_split...