In Chapter 8, Identifying Credit Default with Machine Learning, we learned how to build an entire pipeline, with the goal of predicting customer default, that is, their inability to repay their debts. For the machine learning part, we used a decision tree classifier, which is one of the basic algorithms.
There are a few ways to possibly improve the performance of the model, some of them include:
- Gathering more observations
- Adding extra features—either by gathering additional data or through feature engineering
- Using more advanced models
- Tuning the hyperparameters
There is a common rule that data scientists spend 80% of their time on a project gathering and cleaning data while spending only 20% on the actual modeling. In line with this, adding more data might greatly improve a model's performance, especially when dealing with imbalanced...