Combining tables using join functions
Joining rectangular tables in data science is a powerful way to combine data from multiple sources, allowing for more complex and detailed analysis. The process of joining tables involves matching rows from one table with corresponding rows in another table, based on shared columns or keys. The ability to join tables allows data scientists to gather information from different sources and can also be used to clean and prepare data for analysis by eliminating duplicates or filling in missing values. Note that although the joining process is powerful and useful, it isn’t magic and is actually a common source of errors. The user must take care that the operation was successful in the way that they intended and that combining data doesn’t create unexpected combinations, especially empty cells and repeated rows.
The dplyr
package provides functions for manipulating and cleaning data, including a function called join()
that can be used to join tables based on one or more common columns. The join()
function supports several types of joins, including inner, left, right, and full outer joins. In this recipe, we’ll look at how each of these joins works.
Getting ready
We’ll need the dplyr
package and the rbioinfcookbook
package, which will give us a short gene expression dataset of just 10 Magnaporthe oryzae genes, and related annotation data of approximately 60,000 rows for the entire genome.
How to do it…
The process will begin with loading a data frame from the data package. The mo_gene_exp
, mo_go_acc
, and mo_go_evidence
objects are all available as data objects when you load the rbioinfcookbook
library, so we don’t have to try to load them from the file. You will have seen this behavior in numerous R tutorials before. For our work, this mimics the situation where you will already have gone through the process of loading in the data from a file on disk or received a data frame from an upstream function.
The following will help us to join tables together:
- Load the data and add terms to genes:
library(rbioinfcookbook)library(dplyr)x <- left_join(mo_gene_exp, mo_terms, by = c('gene_id' = 'Gene stable ID'))
- Add accession numbers:
y <- right_join(mo_go_acc, x, by = c( 'Gene stable ID' = 'gene_id' ) )
- Add evidence code:
z <- inner_join(y, mo_go_evidence, by = c('GO term accession' = 'GO term evidence code'))
- Compare the direction of joins:
a <- right_join(x, mo_go_acc, by = c( 'gene_id' = 'Gene stable ID') )
- Stack two data frames:
mol_func <- filter(mo_go_evidence, `GO domain` == 'molecular_function')cell_comp <- filter(mo_go_evidence, `GO domain` == 'cellular_component')bind_rows(mol_func, cell_comp)
- Put two data frames side by side:
small_mol_func <- head(mol_func, 15)small_cell_comp <- head(cell_comp, 15)bind_cols(small_mol_func, small_cell_comp)
And with that, we have joined data frames into one in most ways possible.
How it works…
The code joins different data frames in various ways. The mo_gene_exp
, mo_terms
, mo_go_acc
, and mo_go_evidence
objects are data frames, and they are loaded using the rbioinfcookbook
library. Then, the first operation is to add terms to genes using the left_join()
function. The left_join()
function joins the mo_gene_exp
and mo_terms
data frames on the gene_id
column of the mo_gene_exp
data frame and the Gene stable ID
column of the mo_terms
data frame. Note the increase in rows as well as columns because of the multiple matching rows.
By step 2, we’re adding accession numbers using the right_join()
function to join the mo_go_acc
data frame and the result of the first join (x
) on the Gene stable ID
column of the mo_go_acc
data frame and the gene_id
column of the x
data frame. Ordering the data frames this way minimizes the number of rows; see step 5 for how the converse goes. Note that the right_join()
function returns the full set of rows from the right data frame.
Step 3’s inner_join()
function demonstrates that only the rows shared are returned. The remaining steps create subsets of the mo_go_evidence
data frame based on the component to highlight how bind_rows()
does a name-unaware stacking and bind_cols()
does a blind left-right paste/concatenation of data frames. These last two functions are quick and easy but do not do anything clever, so be sure that the data can be properly joined this way.