Classifying data with decision trees
One way to classify documents is to follow a hierarchical tree of rules, finally placing an instance into a bucket. This is essentially what decision trees do. Although they can work with any type of data, they are especially helpful in classifying nominal variables (discrete categories of data such as the species
attribute of the Iris dataset), where statistics designed for working with numerical data—such as K-Means clustering—doesn't work as well.
Decision trees have another handy feature. Unlike many types of data mining where the analysis is somewhat of a black box, decision trees are very intelligible. We can easily examine them and readily tell how and why they classify our data the way they do.
In this recipe, we'll look at a dataset of mushrooms and create a decision tree to tell us whether a mushroom instance is edible or poisonous.
Getting ready
First, we'll need to use the dependencies that we specified in the project...