You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781787124479

Length 560 pages

Edition 2nd Edition

Languages

Tools

MongoDB

Concepts

Data Analysis

Authors (3):

Kuntal Ganguly

Shanthi Viswanathan

Viswa Viswanathan

View More author details

Binning numerical data

Sometimes, we need to convert numerical data to categorical data or a factor. For example, Naive Bayes classification requires all variables (independent and dependent) to be categorical. In other situations, we may want to apply a classification method to a problem where the dependent variable is numeric but needs to be categorical.

Getting ready

From the code files for this chapter, store the data-conversion.csv file in the working directory of your R environment. Then read the data:

> students <- read.csv("data-conversion.csv")

How to do it...

Income is a numeric variable, and you may want to create a categorical variable from it by creating bins. Suppose you want to label incomes of $10,000 or below as Low, incomes between $10,000 and $31,000 as Medium, and the rest as High. We can do the following:

Create a vector of break points:

> b <- c(-Inf, 10000, 31000, Inf)

Create a vector of names for break points:

> names <- c("Low", "Medium", "High")

Cut the vector using the break points:

> students$Income.cat <- cut(students$Income, breaks = b, labels = names) 
> students 
 
   Age State Gender Height Income Income.cat 
1   23    NJ      F     61   5000        Low 
2   13    NY      M     55   1000        Low 
3   36    NJ      M     66   3000        Low 
4   31    VA      F     64   4000        Low 
5   58    NY      F     70  30000     Medium 
6   29    TX      F     63  10000        Low 
7   39    NJ      M     67  50000       High 
8   50    VA      M     70  55000       High 
9   23    TX      F     61   2000        Low 
10  36    VA      M     66  20000     Medium

How it works...

The cut() function uses the ranges implied by the breaks argument to infer the bins, and names them according to the strings provided in the labels argument. In our example, the function places incomes less than or equal to 10,000 in the first bin, incomes greater than 10,000 and less than or equal to 31,000 in the second bin, and incomes greater than 31,000 in the third bin. In other words, the first number in the interval is not included but the second one is. The number of bins will be one less than the number of elements in breaks. The strings in names become the factor levels of the bins.

If we leave out names, cut() uses the numbers in the second argument to construct interval names, as you can see here:

> b <- c(-Inf, 10000, 31000, Inf) 
> students$Income.cat1 <- cut(students$Income, breaks = b) 
> students 
 
   Age State Gender Height Income Income.cat     Income.cat1 
1   23    NJ      F     61   5000        Low    (-Inf,1e+04] 
2   13    NY      M     55   1000        Low    (-Inf,1e+04] 
3   36    NJ      M     66   3000        Low    (-Inf,1e+04] 
4   31    VA      F     64   4000        Low    (-Inf,1e+04] 
5   58    NY      F     70  30000     Medium (1e+04,3.1e+04] 
6   29    TX      F     63  10000        Low    (-Inf,1e+04] 
7   39    NJ      M     67  50000       High  (3.1e+04, Inf] 
8   50    VA      M     70  55000       High  (3.1e+04, Inf] 
9   23    TX      F     61   2000        Low    (-Inf,1e+04] 
10  36    VA      M     66  20000     Medium (1e+04,3.1e+04]

There's more...

You might not always be in a position to identify the breaks manually and may instead want to rely on R to do this automatically.

Creating a specified number of intervals automatically

Rather than determining the breaks and hence the intervals manually, as mentioned earlier, we can specify the number of bins we want, say n, and let the cut() function handle the rest automatically. In this case, cut() creates n intervals of approximately equal width, as follows:

> students$Income.cat2 <- cut(students$Income,     breaks = 4, labels = c("Level1", "Level2",       "Level3","Level4"))

You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Table of Contents (14) Chapters

Binning numerical data

Getting ready

How to do it...

How it works...

There's more...

Creating a specified number of intervals automatically

Authors (3)

Other recommended products

Personalised recommendations for you