In classification with the boosting model, we'll use the AdaBoostClassifier object. Here, we'll also use 50 estimators to combine the individual predictions. The learning rate that we will use here is 0.1, which is another hyperparameter for this model.
The following screenshot shows the code and the confusion matrix:
Now, we will compare the four models as shown in the following screenshot:
The preceding screenshot shows the similar accuracies for the four models, but the most important metric for this particular application is the recall metric.
The following screenshot shows that the model with the best recall and accuracy is the random forest model:
The preceding screenshot proves that the random forest model is better than the other models overall.
To see the relationship between precision, recall, and threshold, we can use the precision_recall_curve function from scikit-learn. Here, pass the predictions and the real observed values, and the result we get consists of the objects that will allow us to produce the code for the precision_recall_curve function.
The following screenshot shows the code for the precision_recall_curve function from scikit-learn:
The following screenshot will now visualize the relationship between precision and recall when using the random forest model and the logistic regression model:
The preceding screenshot shows that the random forest model is better because it is above the logistic regression curve. So, for a precision of 0.30, we get more recall with the random forest model than the logistic regression model.
To see the performance of the RandomForestClassifier method, we change the classification threshold. For example, we set a classification threshold of 0.12, so we will get a precision of 30 and a recall of 84. This model will correctly predict 84% of the possible defaulters, which will be very useful for a financial institution. This shows that the boosting model is better than the logistic regression model for this.
The following screenshot shows the code and the confusion matrix:
Feature importance is something very important that we get while using a random forest model. The scikit-learn library calculates this metric of feature importance for each of the features that we use in our model. The internal calculation allows us to get a metric for the importance of each feature in the predictions.
The following screenshot shows the visualization of these features, hence highlighting the importance of using a RandomForestClassifier method:
The most important feature for predicting whether the customer will default next month or whether the customer defaulted the month before is pay_1. Here, we just have to verify whether the customer paid last month or not. The other important features of this model are the bill amounts of two months, while the other feature in terms of importance is age.
The features that are not important for predicting the target are gender, marital status, and the education level of the customer.
Overall, the random forest model has proved to be better than the logistic regression model.
According to the no free lunch theorem, there is no single model that works best for every problem in every dataset. This means that ensemble learning cannot always outperform simpler methods because sometimes simpler methods perform better than complex methods. So, for every machine learning problem, we must use simple methods over complex methods and then evaluate the performance of both approaches to get the best results.