You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781787124479

Length 560 pages

Edition 2nd Edition

Languages

Tools

MongoDB

Concepts

Data Analysis

Authors (3):

Kuntal Ganguly

Shanthi Viswanathan

Viswa Viswanathan

View More author details

Creating dummies for categorical variables

In situations where we have categorical variables (factors) but need to use them in analytical methods that require numbers (for example, K nearest neighbors (KNN), Linear Regression), we need to create dummy variables.

Getting ready

Read the data-conversion.csv file and store it in the working directory of your R environment. Install the dummies package. Then read the data:

> install.packages("dummies") 
> library(dummies) 
> students <- read.csv("data-conversion.csv")

How to do it...

Create dummies for all factors in the data frame:

> students.new <- dummy.data.frame(students, sep = ".") 
> names(students.new) 
 
[1] "Age"      "State.NJ" "State.NY" "State.TX" "State.VA" 
[6] "Gender.F" "Gender.M" "Height"   "Income"

The students.new data frame now contains all the original variables and the newly added dummy variables. The dummy.data.frame() function has created dummy variables for all four levels of State and two levels of Gender factors. However, we will generally omit one of the dummy variables for State and one for Gender when we use machine learning techniques.

We can use the optional argument all = FALSE to specify that the resulting data frame should contain only the generated dummy variables and none of the original variables.

How it works...

The dummy.data.frame() function creates dummies for all the factors in the data frame supplied. Internally, it uses another dummy() function which creates dummy variables for a single factor. The dummy() function creates one new variable for every level of the factor for which we are creating dummies. It appends the variable name with the factor level name to generate names for the dummy variables. We can use the sep argument to specify the character that separates them; an empty string is the default:

> dummy(students$State, sep = ".") 
 
      State.NJ State.NY State.TX State.VA 
 [1,]        1        0        0        0 
 [2,]        0        1        0        0 
 [3,]        1        0        0        0 
 [4,]        0        0        0        1 
 [5,]        0        1        0        0 
 [6,]        0        0        1        0 
 [7,]        1        0        0        0 
 [8,]        0        0        0        1 
 [9,]        0        0        1        0 
[10,]        0        0        0        1

There's more...

In situations where a data frame has several factors, and you plan on using only a subset of them, you create dummies only for the chosen subset.

Choosing which variables to create dummies for

To create a dummy only for one variable or a subset of variables, we can use the names argument to specify the column names of the variables we want dummies for:

> students.new1 <- dummy.data.frame(students,     names = c("State","Gender") , sep = ".")

You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Table of Contents (14) Chapters

Creating dummies for categorical variables

Getting ready

How to do it...

How it works...

There's more...

Choosing which variables to create dummies for

Authors (3)

Other recommended products

Personalised recommendations for you