Finally, we will employ a random forest ensemble. Once again, using validation curves, we will determine the optimal ensemble size. From the following graph, we conclude that 50 trees provide the least possible variance in our model, thus we proceed with ensemble size 50:
Validation curves for random forest
We provide the training and validation code as follows, as well as the achieved performance for both datasets. The following code is responsible for loading the required libraries and data, and training and evaluating the ensemble on the original and filtered datasets. We first load the required libraries and data, while creating train and test splits:
# --- SECTION 1 ---
# Libraries and data loading
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils...