Exploring the data with a few visualizations
We should start the data visualization portion of a project with univariate graphics, such as histograms and boxplots. This is because the former will show us the data distribution, indicating the possible statistical tests to be used, while the latter will bring up the presence of outliers in the data.
Since there are more than 50 variables in this dataset, we will create a for
loop to plot the histograms for all of them. The following code uses the hist()
function from the base R histogram:
# Histograms
for (var in colnames(spam)[1:57]) {
hist(unlist(spam[,var]), col="royalblue",
main= paste("Histogram of", var),
xlab=var) }
Notice that we only did the loop for columns [1:57]
since we know that the last one is the target variable. Next, we will see four graphics, as shown in Figure 13.6:
...