Using decision trees to explore the data
We are now ready to start exploring the data with the objective of finding some rules on how to filter it. Because we have a lot of annotations to explore (in our case, we reduced them, but generally, that would be the case), we need to find a place to start. It can be daunting to go out on a blind fishing expedition. My personal preference for a first approach is using a machine learning technique called decision trees. Decision trees will suggest what the fundamental annotations segregating the data in correct and error calls are. Another advantage of decision trees is that they barely need any data preparation, as opposed to many other machine learning techniques.
How to do it…
- We start with a few imports, most notably of scikit-learn:
import gzip import pickle import numpy as np import graphviz from sklearn import tree
- Let's load the data and split it into inputs and outputs:
balanced_fit = np.load(gzip.open('balanced_fit.npy.gz', 'rb')) ordered_features...