To be able to compare the results of different variables, we need to take into account the different number of samples of each class. Let's suppose that we want to train a machine learning model to predict whether a given passenger would survive, based on age group, gender, and class. If we plot the distribution of values in the survived variable, we will see the following:
It is clear from the preceding diagram and table that there are nearly twice as many non-survivors than survivors. If we use this dataset as is, we are introducing a bias to our dataset that will affect the results. Predicting 0 for the survival variable will be approximately two times more probable than predicting 1. An exception to this statement are decision trees and their related predictive models (such as random forests and XGBoost), which can correctly deal with...