You're reading from R Bioinformatics Cookbook Utilize R packages for bioinformatics, genomics, data science, and machine learning

Product type Paperback

Published in Oct 2023

Publisher Packt

ISBN-13 9781837634279

Length 396 pages

Edition 2nd Edition

Languages

Tools

ChatGPT

Concepts

Bioinformatics

Author (1):

Dan MacLean

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Setting Up Your R Bioinformatics Working Environment

2. Chapter 2: Loading, Tidying, and Cleaning Data in the tidyverse FREE CHAPTER

3. Chapter 3: ggplot2 and Extensions for Publication Quality Plots

4. Chapter 4: Using Quarto to Make Data-Rich Reports, Presentations, and Websites

5. Chapter 5: Easily Performing Statistical Tests Using Linear Models

6. Chapter 6: Performing Quantitative RNA-seq

7. Chapter 7: Finding Genetic Variants with HTS Data

8. Chapter 8: Searching Gene and Protein Sequences for Domains and Motifs

9. Chapter 9: Phylogenetic Analysis and Visualization

10. Chapter 10: Analyzing Gene Annotations

11. Chapter 11: Machine Learning with mlr3

12. Chapter 12: Functional Programming with purrr and base R

13. Chapter 13: Turbo-Charging Development in R with ChatGPT

14. Index

Why subscribe?

15. Other Books You May Enjoy

Combining tables using join functions

Joining rectangular tables in data science is a powerful way to combine data from multiple sources, allowing for more complex and detailed analysis. The process of joining tables involves matching rows from one table with corresponding rows in another table, based on shared columns or keys. The ability to join tables allows data scientists to gather information from different sources and can also be used to clean and prepare data for analysis by eliminating duplicates or filling in missing values. Note that although the joining process is powerful and useful, it isn’t magic and is actually a common source of errors. The user must take care that the operation was successful in the way that they intended and that combining data doesn’t create unexpected combinations, especially empty cells and repeated rows.

The dplyr package provides functions for manipulating and cleaning data, including a function called join() that can be used to join tables based on one or more common columns. The join() function supports several types of joins, including inner, left, right, and full outer joins. In this recipe, we’ll look at how each of these joins works.

Getting ready

We’ll need the dplyr package and the rbioinfcookbook package, which will give us a short gene expression dataset of just 10 Magnaporthe oryzae genes, and related annotation data of approximately 60,000 rows for the entire genome.

How to do it…

The process will begin with loading a data frame from the data package. The mo_gene_exp, mo_go_acc, and mo_go_evidence objects are all available as data objects when you load the rbioinfcookbook library, so we don’t have to try to load them from the file. You will have seen this behavior in numerous R tutorials before. For our work, this mimics the situation where you will already have gone through the process of loading in the data from a file on disk or received a data frame from an upstream function.

The following will help us to join tables together:

Load the data and add terms to genes:

library(rbioinfcookbook)library(dplyr)x <- left_join(mo_gene_exp, mo_terms, by = c('gene_id' = 'Gene stable ID'))

Add accession numbers:

y <- right_join(mo_go_acc, x, by = c( 'Gene stable ID' = 'gene_id' ) )

Add evidence code:

z <- inner_join(y, mo_go_evidence, by = c('GO term accession' = 'GO term evidence code'))

Compare the direction of joins:

a <- right_join(x, mo_go_acc, by = c( 'gene_id' = 'Gene stable ID') )

Stack two data frames:

mol_func <- filter(mo_go_evidence, `GO domain` == 'molecular_function')cell_comp <- filter(mo_go_evidence, `GO domain` == 'cellular_component')bind_rows(mol_func, cell_comp)

Put two data frames side by side:

small_mol_func <- head(mol_func, 15)small_cell_comp <- head(cell_comp, 15)bind_cols(small_mol_func, small_cell_comp)

And with that, we have joined data frames into one in most ways possible.

How it works…

The code joins different data frames in various ways. The mo_gene_exp, mo_terms, mo_go_acc, and mo_go_evidence objects are data frames, and they are loaded using the rbioinfcookbook library. Then, the first operation is to add terms to genes using the left_join() function. The left_join() function joins the mo_gene_exp and mo_terms data frames on the gene_id column of the mo_gene_exp data frame and the Gene stable ID column of the mo_terms data frame. Note the increase in rows as well as columns because of the multiple matching rows.

By step 2, we’re adding accession numbers using the right_join() function to join the mo_go_acc data frame and the result of the first join (x) on the Gene stable ID column of the mo_go_acc data frame and the gene_id column of the x data frame. Ordering the data frames this way minimizes the number of rows; see step 5 for how the converse goes. Note that the right_join() function returns the full set of rows from the right data frame.

Step 3’s inner_join() function demonstrates that only the rows shared are returned. The remaining steps create subsets of the mo_go_evidence data frame based on the component to highlight how bind_rows() does a name-unaware stacking and bind_cols() does a blind left-right paste/concatenation of data frames. These last two functions are quick and easy but do not do anything clever, so be sure that the data can be properly joined this way.