Now that we have learned how to evaluate the model's accuracy more reliably using the ShuffleSplit cross-validation method, it is time to test our earlier hypothesis: would a smaller tree be more accurate?
Here is what we are going to do in the following sub sections:
- Split the data into training and test sets.
- Keep the test side to one side now.
- Limit the tree's growth using different values of max_depth.
- For each max_depth setting, we will use the ShuffleSplit cross-validation method on the training set to get an estimation of the classifier's accuracy.
- Once we decide which value to use for max_depth, we will train the algorithm one last time on the entire training set and predict on the test set.
Splitting the data
Here is the usual code for splitting the data into training and test sets:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.25)
x_train = df_train[iris.feature_names]
x_test = df_test[iris.feature_names]
y_train = df_train['target']
y_test = df_test['target']
Trying different hyperparameter values
If we allowed our earlier treeto grow indefinitely, we would get a tree depth of 4. You can check the depth of a tree by callingclf.get_depth()once it is trained. So, it doesn't make sense to try any max_depth values above 4. Here, we are going to loop over the maximum depths from 1 to 4 and use ShuffleSplit to get the classifier's accuracy:
import pandas as pd
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
for max_depth in [1, 2, 3, 4]:
# We initialize a new classifier each iteration with different max_depth
clf = DecisionTreeClassifier(max_depth=max_depth)
# We also initialize our shuffle splitter
rs = ShuffleSplit(n_splits=20, test_size=0.25)
cv_results = cross_validate(
clf, x_train, y_train, cv=rs, scoring='accuracy'
)
accuracy_scores = pd.Series(cv_results['test_score'])
print(
'@ max_depth = {}: accuracy_scores: {}~{}'.format(
max_depth,
accuracy_scores.quantile(.1).round(3),
accuracy_scores.quantile(.9).round(3)
)
)
We called the cross_validate() method as we did earlier, giving it the classifier's instance, as well as the ShuffleSplit instance. We also defined our evaluation score as accuracy. Finally, we print the scores we get with each iteration. We will look more at the printed values in the next section.
Comparing the accuracy scores
Since we have a list of scores for each iteration, we can calculate their mean, or, as we will do here, we will print their 10th and 90th percentiles to get an idea of the accuracy ranges versus each max_depthsetting.
Running the preceding code gave me the following results:
@ max_depth = 1: accuracy_scores: 0.532~0.646 @ max_depth = 2: accuracy_scores: 0.925~1.0 @ max_depth = 3: accuracy_scores: 0.929~1.0 @ max_depth = 4: accuracy_scores: 0.929~1.0
One thing I am sure about now is that a single-level tree (usually called a stub) is not as accurate as deeper trees. In other words, having a single decision based on whether the petal width is less than 0.8 is not enough. Allowing the tree to grow further improves the accuracy, but I can't see many differences between trees of depths 2, 3, and 4. I'd conclude that contrary to my earlier speculations, we shouldn't worry too much about overfitting here.
Finally, you can train your model once more using the entire training set and a max_depth value of, say, 3. Then, use the trained model to predict the classes for the test set in order to evaluate your final model. I won't bore you with the code for it this time as you can easily do it yourself.
In addition to printing the classifier's decision and descriptive statistics about its accuracy, it is useful to also see its decision boundaries visually. Mapping those boundaries versus the data samples helps us understand why the classifier made certain mistakes. In the next section, we are going to check the decision boundaries we got for the Iris dataset.