In situations where we have categorical variables (factors) but need to use them in analytical methods that require numbers (for example, K nearest neighbors (KNN), Linear Regression), we need to create dummy variables.
Creating dummies for categorical variables
Getting ready
Read the data-conversion.csv file and store it in the working directory of your R environment. Install the dummies package. Then read the data:
> install.packages("dummies")
> library(dummies)
> students <- read.csv("data-conversion.csv")
How to do it...
Create dummies for all factors in the data frame:
> students.new <- dummy.data.frame(students, sep = ".")
> names(students.new)
[1] "Age" "State.NJ" "State.NY" "State.TX" "State.VA"
[6] "Gender.F" "Gender.M" "Height" "Income"
The students.new data frame now contains all the original variables and the newly added dummy variables. The dummy.data.frame() function has created dummy variables for all four levels of State and two levels of Gender factors. However, we will generally omit one of the dummy variables for State and one for Gender when we use machine learning techniques.
We can use the optional argument all = FALSE to specify that the resulting data frame should contain only the generated dummy variables and none of the original variables.
How it works...
The dummy.data.frame() function creates dummies for all the factors in the data frame supplied. Internally, it uses another dummy() function which creates dummy variables for a single factor. The dummy() function creates one new variable for every level of the factor for which we are creating dummies. It appends the variable name with the factor level name to generate names for the dummy variables. We can use the sep argument to specify the character that separates them; an empty string is the default:
> dummy(students$State, sep = ".")
State.NJ State.NY State.TX State.VA
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 0 0 0
[4,] 0 0 0 1
[5,] 0 1 0 0
[6,] 0 0 1 0
[7,] 1 0 0 0
[8,] 0 0 0 1
[9,] 0 0 1 0
[10,] 0 0 0 1
There's more...
In situations where a data frame has several factors, and you plan on using only a subset of them, you create dummies only for the chosen subset.
Choosing which variables to create dummies for
To create a dummy only for one variable or a subset of variables, we can use the names argument to specify the column names of the variables we want dummies for:
> students.new1 <- dummy.data.frame(students, names = c("State","Gender") , sep = ".")