Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
R Data Analysis Cookbook - Second Edition

You're reading from  R Data Analysis Cookbook - Second Edition

Product type Book
Published in Sep 2017
Publisher Packt
ISBN-13 9781787124479
Pages 560 pages
Edition 2nd Edition
Languages
Authors (3):
Kuntal Ganguly Kuntal Ganguly
Profile icon Kuntal Ganguly
Shanthi Viswanathan Shanthi Viswanathan
Profile icon Shanthi Viswanathan
Viswa Viswanathan Viswa Viswanathan
Profile icon Viswa Viswanathan
View More author details
Toc

Table of Contents (14) Chapters close

Preface 1. Acquire and Prepare the Ingredients - Your Data 2. What's in There - Exploratory Data Analysis 3. Where Does It Belong? Classification 4. Give Me a Number - Regression 5. Can you Simplify That? Data Reduction Techniques 6. Lessons from History - Time Series Analysis 7. How does it look? - Advanced data visualization 8. This may also interest you - Building Recommendations 9. It's All About Your Connections - Social Network Analysis 10. Put Your Best Foot Forward - Document and Present Your Analysis 11. Work Smarter, Not Harder - Efficient and Elegant R Code 12. Where in the World? Geospatial Analysis 13. Playing Nice - Connecting to Other Systems

Correcting data

In practice, raw data is rarely tidy, and is much harder to work with as a result. It is often said that 80 percent of data analysis is spent on the process of cleaning and correcting the data.

In this recipe, you will learn the best way to correctly layout your data to serve two major purposes:

  • Making data suitable for software processing, whether that be mathematical functions, visualization, and others
  • Revealing information and insights

Getting ready

Download the files for this chapter and store the USArrests.csv file in your R working directory. You should also install the tidyr package using the following command.

> install.packages("tidyr")
> library(tidyr)
> crimeData <- read.csv("USArrests.csv",stringsAsFactors = FALSE)

How to do it...

Follow these steps to correct the data from your dataset.

  1. View some records of the dataset:
> View(crimeData)
  1. Add the column name state in the dataset:
> crimeData <- cbind(state = rownames(crimeData), crimeData)
  1. Gather all the variables between Murder and UrbanPop:
> crimeData.1 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
Murder:UrbanPop)

> crimeData.1
  1. Gather all the columns except the column state:
> crimeData.2 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
-state)

> crimeData.2
  1. Gather only the Murder and Assault columns:
> crimeData.3 <- gather(crimeData,
key = "crime_type",
value = "arrest_estimate",
Murder, Assault)

> crimeData.3
  1. Spread crimeData.2 to turn a pair of key:value (crime_typ:arrest_estimate) columns into a set of tidy columns
> crimeData.4 <- spread(crimeData.2,
key = "crime_type",
value = "arrest_estimate"
)
> crimeData.4

How it works...

Correct data format is crucial for facilitating the tasks of data analysis, including data manipulation, modeling, and visualization. The tidy data arranges values so that the relationships in the data parallel the structure of the data frame. Every tidy dataset is based on two basic principles:

  • Each variable is saved in its own column
  • Each observation is saved in its own row

In the crimeData dataframe, the row names were states, hence we used the function cbind() to add a column named state in the dataframe. The function gather() collapses multiple columns into key-value pairs. It makes wide data longer. The gather() function basically takes four arguments, data (dataframe), key (column name representing new variable), value (column name representing variable values), and names of the columns to gather (or not gather).

In the crimeData.2 data, all column names (except state) were collapsed into a single key column crime_type and their values were put into a value column arrest_estimate.

And, in the crimeData.3 data, the two columns Murder and Assault were collapsed and the remaining columns (state, UrbanPop, and Rape) were
duplicated.

The function spread() does the reverse of gather(). It takes two columns (key and value) and spreads them into multiple columns. It makes long data wider. The spread() function takes three arguments in general, data (dataframe), key (column values to convert to multiple columns), and value (single column value to convert to multiple columns' values).

There's more...

Beside the spread() and gather() functions, there are two more important functions in the tidyr package that help to make data tidy.

Combining multiple columns to single columns

The unite() function takes multiple columns and pastes them together into one column:

> crimeData.5 <- unite(crimeData,
col = "Murder_Assault",
Murder, Assault,
sep = "_")
> crimeData.5

We combine the columns Murder and Assault from the crimeData
data-frame to generate a new column Murder_Assault, having the values separated by _.

Splitting single column to multiple columns

The separate() function is the reverse of unite(). It takes values inside a single character column and separates them into multiple columns:

> crimeData.6 <- separate_(crimeData.5,
col = "Murder_Assault",
into = c("Murder", "Assault"),
sep = "_")
>crimeData.6
You have been reading a chapter from
R Data Analysis Cookbook - Second Edition
Published in: Sep 2017 Publisher: Packt ISBN-13: 9781787124479
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime