Data analysis is a combination of art and science. The art part consists of data exploration and visualization, which is usually done best with better intuition and understanding of the data. The science part consists of statistical analysis, which relies on concrete knowledge of statistics and analytic skills. However, both parts of a serious research require proper tools and good skills to work with them.
R is exactly the proper tool to do data analysis with. In this article by Kun Ren, author of the book Learning R Programming, we will discuss how R and data.table package make it easy to transform data and, thus, greatly unleash our productivity.
(For more resources related to this topic, see here.)
The most basic data structures in R are atomic vectors, such as. numeric, logical, character, and complex vector, and list. An atomic vector stores elements of the same type while list is allowed to store different types of elements.
The most commonly used data structure in R to store real-world data is data frame. A data frame stores data in tabular form. In essence, a data frame is a list of vectors with equal length but maybe different types.
Most of the code in this article is based on a group of fictitious data about some products (you can download the data at https://gist.github.com/renkun-ken/ba2d33f21efded23db66a68240c20c92). We will use the readr package to load the data for better handling of column types. If you don't have this package installed, please run install.packages("readr").
library(readr)
product_info <- read_csv("data/product-info.csv")
product_info
## id name type class released
## 1 T01 SupCar toy vehicle yes
## 2 T02 SupPlane toy vehicle no
## 3 M01 JeepX model vehicle yes
## 4 M02 AircraftX model vehicle yes
## 5 M03 Runner model people yes
## 6 M04 Dancer model people no
Once the data is loaded into memory as a data frame, we can take a look at its column types, shown as follows:
sapply(product_info, class)
## id name type class released
## "character" "character" "character" "character" "character"
Although a data frame is essentially a list of vectors, we can access it like a matrix due to all column vectors being the same length. To select rows that meet certain conditions, we will supply a logical vector as the first argument of [] while the second is left empty.
For example, we can take out all rows of toy type, shown as follows:
product_info[product_info$type == "toy", ]
## id name type class released
## 1 T01 SupCar toy vehicle yes
## 2 T02 SupPlane toy vehicle no
Or, we can take out all rows that are not released.
product_info[product_info$released == "no", ]
## id name type class released
## 2 T02 SupPlane toy vehicle no
## 6 M04 Dancer model people no
To filter columns, we can supply a character vector as the second argument while the first is left empty, which is exactly the same with how we subset a matrix.
product_info[1:3, c("id", "name", "type")]
## id name type
## 1 T01 SupCar toy
## 2 T02 SupPlane toy
## 3 M01 JeepX model
Alternatively, we can filter the data frame by regarding it as a list. We can supply only one character vector of column names in [].
product_info[c("id", "name", "class")]
## id name class
## 1 T01 SupCar vehicle
## 2 T02 SupPlane vehicle
## 3 M01 JeepX vehicle
## 4 M02 AircraftX vehicle
## 5 M03 Runner people
## 6 M04 Dancer people
To filter a data frame by both row and column, we can supply a vector as the first argument to select rows and a vector as the second to select columns.
product_info[product_info$type == "toy", c("name", "class", "released")]
## name class released
## 1 SupCar vehicle yes
## 2 SupPlane vehicle no
If the row filtering condition is based on values of certain columns, the preceding code can be very redundant, especially when the condition gets more complicated. Another built-in function to simplify code is subset, as introduced previously.
subset(product_info,
subset = type == "model" & released == "yes",
select = name:class)
## name type class
## 3 JeepX model vehicle
## 4 AircraftX model vehicle
## 5 Runner model people
The subset function uses nonstandard evaluation so that we can directly use the columns of the data frame without typing product_info many times because the expressions are meant to be evaluated in the context of the data frame.
Similarly, we can use with to evaluate an expression in the context of the data frame, that is, the columns of the data frame can be used as symbols in the expression without repeatedly specifying the data frame.
with(product_info, name[released == "no"])
## [1] "SupPlane" "Dancer"
The expression can be more than a simple subsetting. We can summarize the data by counting the occurrences of each possible value of a vector. For example, we can create a table of occurrences of types of records that are released.
with(product_info, table(type[released == "yes"]))
##
## model toy
## 3 1
In addition to the table of product information, we also have a table of product statistics that describe some properties of each product.
product_stats <- read_csv("data/product-stats.csv")
product_stats
## id material size weight
## 1 T01 Metal 120 10.0
## 2 T02 Metal 350 45.0
## 3 M01 Plastics 50 NA
## 4 M02 Plastics 85 3.0
## 5 M03 Wood 15 NA
## 6 M04 Wood 16 0.6
Now, think of how we can get the names of products with the top three largest sizes? One way is to sort the records in product_stats by size in descending order, select id values of the top three records, and use these values to filter rows of product_info by id.
top_3_id <- product_stats[order(product_stats$size, decreasing = TRUE), "id"][1:3]
product_info[product_info$id %in% top_3_id, ]
## id name type class released
## 1 T01 SupCar toy vehicle yes
## 2 T02 SupPlane toy vehicle no
## 4 M02 AircraftX model vehicle yes
This approach looks quite redundant. Note that product_info and product_stats actually describe the same set of products in different perspectives. The connection between these two tables is the id column. Each id is unique and means the same product. To access both sets of information, we can put the two tables together into one data frame. The simplest way to do this is use merge:
product_table <- merge(product_info, product_stats, by = "id")
product_table
## id name type class released material size weight
## 1 M01 JeepX model vehicle yes Plastics 50 NA
## 2 M02 AircraftX model vehicle yes Plastics 85 3.0
## 3 M03 Runner model people yes Wood 15 NA
## 4 M04 Dancer model people no Wood 16 0.6
## 5 T01 SupCar toy vehicle yes Metal 120 10.0
## 6 T02 SupPlane toy vehicle no Metal 350 45.0
Now, we can create a new data frame that is a combined version of product_table and product_info with a shared id column. In fact, if you reorder the records in the second table, the two tables still can be correctly merged.
With the combined version, we can do things more easily. For example, with the merged version, we can sort the data frame with any column in one table we loaded without having to manually work with the other.
product_table[order(product_table$size), ]
## id name type class released material size weight
## 3 M03 Runner model people yes Wood 15 NA
## 4 M04 Dancer model people no Wood 16 0.6
## 1 M01 JeepX model vehicle yes Plastics 50 NA
## 2 M02 AircraftX model vehicle yes Plastics 85 3.0
## 5 T01 SupCar toy vehicle yes Metal 120 10.0
## 6 T02 SupPlane toy vehicle no Metal 350 45.0
To solve the problem, we can directly use the merged table and get the same answer.
product_table[order(product_table$size, decreasing = TRUE), "name"][1:3]
## [1] "SupPlane" "SupCar" "AircraftX"
The merged data frame allows us to sort the records by a column in one data frame and filter the records by a column in the other. For example, we can first sort the product records by weight in descending order and select all records of model type.
product_table[order(product_table$weight, decreasing = TRUE), ][
product_table$type == "model",]
## id name type class released material size weight
## 6 T02 SupPlane toy vehicle no Metal 350 45.0
## 5 T01 SupCar toy vehicle yes Metal 120 10.0
## 2 M02 AircraftX model vehicle yes Plastics 85 3.0
## 4 M04 Dancer model people no Wood 16 0.6
Sometimes, the column values are literal but can be converted to standard R data structures to better represent the data. For example, released column in product_info only takes yes and no, which can be better represented with a logical vector. We can use <- to modify the column values, as we learned previously. However, it is usually better to create a new data frame with the existing columns properly adjusted and new columns added without polluting the original data. To do this, we can use transform:
transform(product_table,
released = ifelse(released == "yes", TRUE, FALSE),
density = weight / size)
## id name type class released material size weight
## 1 M01 JeepX model vehicle TRUE Plastics 50 NA
## 2 M02 AircraftX model vehicle TRUE Plastics 85 3.0
## 3 M03 Runner model people TRUE Wood 15 NA
## 4 M04 Dancer model people FALSE Wood 16 0.6
## 5 T01 SupCar toy vehicle TRUE Metal 120 10.0
## 6 T02 SupPlane toy vehicle FALSE Metal 350 45.0
## density
## 1 NA
## 2 0.03529412
## 3 NA
## 4 0.03750000
## 5 0.08333333
## 6 0.12857143
The result is a new data frame with released converted to a logical vector and a new density column added. You can easily verify that product_table is not modified at all.
Additionally, note that transform is like subset, as both functions use nonstandard evaluation to allow direct use of data frame columns as symbols in the arguments so that we don't have to type product_table$ all the time.
Now, we will load another table into R. It is the test results of the quality, and durability of each product. We store the data in product_tests.
product_tests <- read_csv("data/product-tests.csv")
product_tests
## id quality durability waterproof
## 1 T01 NA 10 no
## 2 T02 10 9 no
## 3 M01 6 4 yes
## 4 M02 6 5 yes
## 5 M03 5 NA yes
## 6 M04 6 6 yes
Note that the values in both quality and durability contain missing values (NA). To exclude all rows with missing values, we can use na.omit():
na.omit(product_tests)
## id quality durability waterproof
## 2 T02 10 9 no
## 3 M01 6 4 yes
## 4 M02 6 5 yes
## 6 M04 6 6 yes
Another way is to use complete.cases() to get a logical vector indicating all complete rows, without any missing value,:
complete.cases(product_tests)
## [1] FALSE TRUE TRUE TRUE FALSE TRUE
Then, we can use this logical vector to filter the data frame. For example, we can get the id column of all complete rows as follows:
product_tests[complete.cases(product_tests), "id"]
## [1] "T02" "M01" "M02" "M04"
Or, we can get the id column of all incomplete rows:
product_tests[!complete.cases(product_tests), "id"]
## [1] "T01" "M03"
Note that product_info, product_stats and product_tests all share an id column, and we can merge them altogether. Unfortunately, there's no built-in function to merge an arbitrary number of data frames. We can only merge two existing data frames at a time, or we'll have to merge them recursively.
merge(product_table, product_tests, by = "id")
## id name type class released material size weight
## 1 M01 JeepX model vehicle yes Plastics 50 NA
## 2 M02 AircraftX model vehicle yes Plastics 85 3.0
## 3 M03 Runner model people yes Wood 15 NA
## 4 M04 Dancer model people no Wood 16 0.6
## 5 T01 SupCar toy vehicle yes Metal 120 10.0
## 6 T02 SupPlane toy vehicle no Metal 350 45.0
## quality durability waterproof
## 1 6 4 yes
## 2 6 5 yes
## 3 5 NA yes
## 4 6 6 yes
## 5 NA 10 no
## 6 10 9 no
In the previous section, we had an overview on how we can use built-in functions to work with data frames. Built-in functions work, but are usually verbose. In this section, let's use data.table, an enhanced version of data.frame, and see how it makes data manipulation much easier. Run install.packages("data.table") to install the package. As long as the package is ready, we can load the package and use fread() to read the data files as data.table objects.
library(data.table)
product_info <- fread("data/product-info.csv")
product_stats <- fread("data/product-stats.csv")
product_tests <- fread("data/product-tests.csv")
toy_tests <- fread("data/product-toy-tests.csv")
It is extremely easy to filter data in data.table. To select the first two rows, just use [1:2], which instead selects the first two columns for data.frame.
product_info[1:2]
## id name type class released
## 1: T01 SupCar toy vehicle yes
## 2: T02 SupPlane toy vehicle no
To filter by logical conditions, just directly type columns names as variables without quotation as the expression is evaluated within the context of product_info:
product_info[type == "model" & class == "people"]
## id name type class released
## 1: M03 Runner model people yes
## 2: M04 Dancer model people no
It is easy to select or transform columns.
product_stats[, .(id, material, density = size / weight)]
## id material density
## 1: T01 Metal 12.000000
## 2: T02 Metal 7.777778
## 3: M01 Plastics NA
## 4: M02 Plastics 28.333333
## 5: M03 Wood NA
## 6: M04 Wood 26.666667
The data.table object also supports using key for subsetting, which can be much faster than using ==. We can set a column as key for each data.table:
setkey(product_info, id)
setkey(product_stats, id)
setkey(product_tests, id)
Then, we can use a value to directly select rows.
product_info["M02"]
## id name type class released
## 1: M02 AircraftX model vehicle yes
We can also set multiple columns as key so as to use multiple values to subset it.
setkey(toy_tests, id, date)
toy_tests[.("T02", 20160303)]
## id date sample quality durability
## 1: T02 20160303 75 8 8
If two data.table objects share the same key, we can join them easily:
product_info[product_tests]
## id name type class released quality durability
## 1: M01 JeepX model vehicle yes 6 4
## 2: M02 AircraftX model vehicle yes 6 5
## 3: M03 Runner model people yes 5 NA
## 4: M04 Dancer model people no 6 6
## 5: T01 SupCar toy vehicle yes NA 10
## 6: T02 SupPlane toy vehicle no 10 9
## waterproof
## 1: yes
## 2: yes
## 3: yes
## 4: yes
## 5: no
## 6: no
Instead of creating new data.table, in-place modification is also supported. The := sets the values of a column in place without the overhead of making copies and, thus, is much faster than using <-.
product_info[, released := (released == "yes")]
## id name type class released
## 1: M01 JeepX model vehicle TRUE
## 2: M02 AircraftX model vehicle TRUE
## 3: M03 Runner model people TRUE
## 4: M04 Dancer model people FALSE
## 5: T01 SupCar toy vehicle TRUE
## 6: T02 SupPlane toy vehicle FALSE
product_info
## id name type class released
## 1: M01 JeepX model vehicle TRUE
## 2: M02 AircraftX model vehicle TRUE
## 3: M03 Runner model people TRUE
## 4: M04 Dancer model people FALSE
## 5: T01 SupCar toy vehicle TRUE
## 6: T02 SupPlane toy vehicle FALSE
Another important argument of subsetting a data.table is by, which is used to split the data into multiple parts and for each part the second argument (j) is evaluated. For example, the simplest usage of by is counting the records in each group. In the following code, we can count the number of both released and unreleased products:
product_info[, .N, by = released]
## released N
## 1: TRUE 4
## 2: FALSE 2
The group can be defined by more than one variable. For example, a tuple of type and class can be a group, and for each group, we can count the number of records, as follows:
product_info[, .N, by = .(type, class)]
## type class N
## 1: model vehicle 2
## 2: model people 2
## 3: toy vehicle 2
We can also perform the following statistical calculations for each group:
product_tests[, .(mean_quality = mean(quality, na.rm = TRUE)),
by = .(waterproof)]
## waterproof mean_quality
## 1: yes 5.75
## 2: no 10.00
We can chain multiple [] in turn. In the following example, we will first join product_info and product_tests by a shared key id and then calculate the mean value of quality and durability for each group of type and class of released products.
product_info[product_tests][released == TRUE,
.(mean_quality = mean(quality, na.rm = TRUE),
mean_durability = mean(durability, na.rm = TRUE)),
by = .(type, class)]
## type class mean_quality mean_durability
## 1: model vehicle 6 4.5
## 2: model people 5 NaN
## 3: toy vehicle NaN 10.0
Note that the values of the by columns will be unique in the resulted data.table; we can use keyby instead of by to ensure that it is automatically used as key by the resulted data.table.
product_info[product_tests][released == TRUE,
.(mean_quality = mean(quality, na.rm = TRUE),
mean_durability = mean(durability, na.rm = TRUE)),
keyby = .(type, class)]
## type class mean_quality mean_durability
## 1: model people 5 NaN
## 2: model vehicle 6 4.5
## 3: toy vehicle NaN 10.0
The data.table package also provides functions to perform superfast reshaping of data. For example, we can use dcast() to spread id values along the x-axis as columns and align quality values to all possible date values along the y-axis.
toy_quality <- dcast(toy_tests, date ~ id, value.var = "quality")
toy_quality
## date T01 T02
## 1: 20160201 9 7
## 2: 20160302 10 NA
## 3: 20160303 NA 8
## 4: 20160403 NA 9
## 5: 20160405 9 NA
## 6: 20160502 9 10
Although each month a test is conducted for each product, the dates may not exactly match with each other. This results in missing values if one product has a value on a day but the other has no corresponding value on exactly the same day.
One way to fix this is to use year-month data instead of exact date. In the following code, we will create a new ym column that is the first 6 characters of toy_tests. For example, substr(20160101, 1, 6) will result in 201601.
toy_tests[, ym := substr(toy_tests$date, 1, 6)]
## id date sample quality durability ym
## 1: T01 20160201 100 9 9 201602
## 2: T01 20160302 150 10 9 201603
## 3: T01 20160405 180 9 10 201604
## 4: T01 20160502 140 9 9 201605
## 5: T02 20160201 70 7 9 201602
## 6: T02 20160303 75 8 8 201603
## 7: T02 20160403 90 9 8 201604
## 8: T02 20160502 85 10 9 201605
toy_tests$ym
## [1] "201602" "201603" "201604" "201605" "201602" "201603"
## [7] "201604" "201605"
This time, we will use ym for alignment instead of date:
toy_quality <- dcast(toy_tests, ym ~ id, value.var = "quality")
toy_quality
## ym T01 T02
## 1: 201602 9 7
## 2: 201603 10 8
## 3: 201604 9 9
## 4: 201605 9 10
Now the missing values are gone, the quality scores of both products in each month are naturally presented.
Sometimes, we will need to combine a number of columns into one that indicates the measure and another that stores the value. For example, the following code uses melt() to combine the two measures (quality and durability) of the original data into a column named measure and a column of the measured value.
toy_tests2 <- melt(toy_tests, id.vars = c("id", "ym"),
measure.vars = c("quality", "durability"),
variable.name = "measure")
toy_tests2
## id ym measure value
## 1: T01 201602 quality 9
## 2: T01 201603 quality 10
## 3: T01 201604 quality 9
## 4: T01 201605 quality 9
## 5: T02 201602 quality 7
## 6: T02 201603 quality 8
## 7: T02 201604 quality 9
## 8: T02 201605 quality 10
## 9: T01 201602 durability 9
## 10: T01 201603 durability 9
## 11: T01 201604 durability 10
## 12: T01 201605 durability 9
## 13: T02 201602 durability 9
## 14: T02 201603 durability 8
## 15: T02 201604 durability 8
## 16: T02 201605 durability 9
The variable names are now contained in the data, which can be directly used by some packages. For example, we can use ggplot2 to plot data in such format. The following code is an example of a scatter plot with a facet grid of different combination of factors.
library(ggplot2)
ggplot(toy_tests2, aes(x = ym, y = value)) +
geom_point() +
facet_grid(id ~ measure)
The graph generated is shown as follows:
The plot can be easily manipulated because the grouping factor (measure) is contained as data rather than columns, which is easier to represent from the perspective of the ggplot2 package.
ggplot(toy_tests2, aes(x = ym, y = value, color = id)) +
geom_point() +
facet_grid(. ~ measure)
The graph generated is shown as follows:
In this article, we used both built-in functions and the data.table package to perform simple data manipulation tasks. Using built-in functions can be verbose while using data.table can be much easier and faster. However, the tasks in real-world data analysis can be much more complex than the examples we demonstrated, which also requires better R programming skills. It is helpful to have a good understanding on how nonstandard evaluation makes data.table so easy to work with, how environment works and scoping rules apply to make your code predictable, and so on. A universal and consistent understanding of how R basically works will certainly give you great confidence to write R code to work with data and enable you to learn packages very quickly.
Further resources on this subject: