Packt+ | Advance your knowledge in tech

You're reading from Mastering Machine Learning with R Advanced machine learning techniques for building smart applications with R 3.5

Product type Paperback

Published in Jan 2019

Publisher

ISBN-13 9781789618006

Length 354 pages

Edition 3rd Edition

Languages

Concepts

Machine Learning

Author (1):

Cory Lesmeister

View More author details

Before moving on to dataset treatment, it's an easy task to eliminate features that have either one unique value (zero variance) or a high ratio of the most common value to the next most common value such that there're few unique values (near-zero variance). To do this, we'll lean on the caret package and the nearZeroVar() function. We get started by creating a dataframe and using the function's defaults except for saveMetrics = TRUE. We need to make that specification to return the dataframe:

feature_variance <- caret::nearZeroVar(gettysburg, saveMetrics = TRUE)

To understand the default settings of the nearZeroVar() function and determine how to customize it to your needs, just use the R help function by typing ?nearZeroVar in the Console.

The output is quite interesting, so let's peek at the first six rows of what we produced:

head(feature_variance)

The output of the preceding code is as follows:

                       freqRatio     percentUnique    zeroVar     nzv
          type         3.186047      0.5110733        FALSE     FALSE
         state         1.094118      5.1107325        FALSE     FALSE
regiment_or_battery    1.105263     46.8483816        FALSE     FALSE
         brigade       1.111111     21.1243612        FALSE     FALSE
         division      1.423077      6.4735945        FALSE     FALSE
          corps        1.080000      2.3850085        FALSE     FALSE

The two key columns are zeroVar and nzv. They act as an indicator of whether or not that feature is zero variance or near-zero variance; TRUE indicates yes and FALSE not so surprisingly indicates no. The other columns must be defined:

freqRatio: This is the ratio of the percentage frequency for the most common value over the second most common value.
percentUnique: This is the number of unique values divided by the total number of samples multiplied by 100.

Let me explain that with the data we're using. For the type feature, the most common value is Infantry, which is roughly three times more common than Artillery. For percentUnique, the lower the percentage, the lower the number of unique values. You can explore this dataframe and adjust the function to determine your relevant cut points. For this example, we'll see whether we have any zero variance features by running this code:

which(feature_variance$zeroVar == 'TRUE')

The output of the preceding code is as follows:

[1] 17

Alas, we see that row 17 (feature 17) has zero variance. Let's see what that could be:

row.names(feature_variance[17, ])

The output of the preceding code is as follows:

[1] "4.5inch_rifles"

This is quite strange to me. What it means is that I failed to record the number of the artillery piece in the one Confederate unit that brought them to the battle. An egregious error on my part discovered using an elegant function from the caret package. Oh well, let's create a new tibble with this filtered out for demonstration purposes:

gettysburg_fltrd <- gettysburg[, feature_variance$zeroVar == 'FALSE']

This code eliminates the zero variance feature. If we wanted also to eliminate near-zero variance as well, just run the code and substitute feature_variance$zerVar with feature_variance$nzv.

We're now ready to perform the real magic of this process and treat our data.