Business case
The overall business objective in this situation is to see if we can improve the predictive ability for some of the cases that we already worked on in the previous chapters. For regression, we will revisit the prostate cancer dataset from Chapter 4, Advanced Feature Selection in Linear Models. The baseline mean squared error to improve on is 0.444.
For classification purposes, we will utilize both the breast cancer biopsy data from Chapter 3, Logistic Regression and Discriminant Analysis and the Pima Indian Diabetes data from Chapter 5, More Classification Techniques — K-Nearest Neighbors and Support Vector Machines. In the breast cancer data, we achieved 97.6 percent predictive accuracy. For the diabetes data, we are seeking to improve on the 79.6 percent accuracy rate.
Both random forests and boosting will be applied to all three datasets. The simple tree method will only be used on the breast and prostate cancer sets from Chapter 4, Advanced Feature Selection in Linear...