Visualization and variable reduction
In the previous section, the housing data underwent a lot of analytical pre-processing, and we are now ready to further analyze this. First, we begin with visualization. Since we have a lot of variables, the visualization on the R visual device is slightly difficult. As seen in earlier chapters, to visualize the random forests and other large, complex structures, we will initiate a PDF device and store the graphs in it. In the housing dataset, the main variable is the housing price and so we will first name the output variable SalePrice
. We need to visualize the data in a way that facilitates the relationship between the numerous variables and the SalePrice
. The independent variables can be either numeric or categorical. If the variables are numeric, a scatterplot will indicate the kind of relationship between the variable and the SalePrice
regressand. If the independent variable is categorical/factor, we will visualize the boxplot at each level of the...