Tidying up data with tidyr
The tidyr
package is available to clean up/tidy your dataset. The use of tidyr
is to rearrange your data so that:
- Each column is a variable
- Each row is an observation
When your data is arranged in this manner, it becomes much easier to analyze. There are many datasets published that mix columns and rows with values. You then must adjust them accordingly if you use the data in situ.
tidyr
provides three functions for cleaning up your data:
gather
separate
spread
The gather()
function takes your data and arranges the data into key-value pairs, much like the Hadoop database model. Let's use the standard example of stock prices for a date using the following:
library(tidyr) stocks <- data_frame( time = as.Date('2017-08-05') + 0:9, X = rnorm(10, 20, 1), #how many numbers, mean, std dev Y = rnorm(10, 20, 2), Z = rnorm(10, 20, 4) )
This will generate data that looks like this:
Every row has a timestamp and the prices of the three stocks at that time.
We first use gather...