You're reading from R Data Analysis Cookbook, Second Edition Customizable R Recipes for data mining, data visualization and time series analysis

Product type Paperback

Published in Sep 2017

Publisher Packt

ISBN-13 9781787124479

Length 560 pages

Edition 2nd Edition

Languages

Tools

MongoDB

Concepts

Data Analysis

Authors (3):

Kuntal Ganguly

Viswanathan

Viswa Viswanathan

View More author details

Normalizing or standardizing data in a data frame

Distance computations play a big role in many data analytics techniques. We know that variables with higher values tend to dominate distance computations and you may want to use the standardized (or z) values.

Getting ready

Download the BostonHousing.csv data file and store it in your R environment's working directory. Then read the data:

> housing <- read.csv("BostonHousing.csv")

How to do it...

To standardize all the variables in a data frame containing only numeric variables, use:

> housing.z <- scale(housing)

You can only use the scale() function on data frames that contain all numeric variables. Otherwise, you will get an error.

How it works...

When invoked in the preceding example, the scale() function computes the standard z score for each value (ignoring NAs) of each variable. That is, from each value it subtracts the mean and divides the result by the standard deviation of the associated variable.

The scale() function takes two optional arguments, center and scale, whose default values are TRUE. The following table shows the effect of these arguments:

Argument	Effect
`center = TRUE`, `scale = TRUE`	Default behavior described earlier
`center = TRUE`, `scale = FALSE`	From each value, subtract the mean of the concerned variable
`center = FALSE`, `scale = TRUE`	Divide each value by the root mean square of the associated variable, where root mean square is sqrt(sum(x^2)/(n-1))
`center = FALSE`, `scale = FALSE`	Return the original values unchanged

There's more...

When using distance-based techniques, you may need to rescale several variables. You may find it tedious to standardize one variable at a time.

Standardizing several variables simultaneously

If you have a data frame with some numeric and some non-numeric variables, or want to standardize only some of the variables in a fully numeric data frame, then you can either handle each variable separately, which would be cumbersome, or use a function such as the following to handle a subset of variables:

scale.many <- function(dat, column.nos) { 
  nms <- names(dat) 
  for(col in column.nos) { 
    name <- paste(nms[col],".z", sep = "") 
    dat[name] <- scale(dat[,col]) 
  } 
  cat(paste("Scaled ", length(column.nos), " variable(s)n")) 
  dat 
}

With this function, you can now do things like:

> housing <- read.csv("BostonHousing.csv") 
> housing <- scale.many(housing, c(1,3,5:7))

This will add the z values for variables 1, 3, 5, 6, and 7, with .z appended to the original column names:

> names(housing) 
 
[1] "CRIM"    "ZN"      "INDUS"   "CHAS"    "NOX"     "RM" 
[7] "AGE"     "DIS"     "RAD"     "TAX"     "PTRATIO" "B" 
[13] "LSTAT"   "MEDV"    "CRIM.z"  "INDUS.z" "NOX.z"   "RM.z" 
[19] "AGE.z"