In this chapter, we implemented an ML workflow or an ML pipeline. The pipeline combined several stages of data analysis into one workflow. We started by loading the data and from there on, we created training and test data, preprocessed the dataset, trained the RandomForestClassifier model, applied the Random Forest classifier to test data, evaluated the classifier, and computed a process that demonstrated the importance of each feature in the classification. We fulfilled the goal that we laid out early on in the Project overview – problem formulation section.
In the next chapter, we will analyze the Wisconsin Breast Cancer Data Set. This dataset has only categorical data. We will build another pipeline, but this time, we will set up the Hortonworks Development Platform Sandbox to develop and deploy a breast cancer prediction pipeline. Given a set of categorical feature variables, this pipeline will predict whether a given sample is benign or malignant. In the next and the last section of the current chapter, we will list a set of questions that will test your knowledge of what you have learned so far.Â