Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
R Bioinformatics Cookbook
R Bioinformatics Cookbook

R Bioinformatics Cookbook: Use R and Bioconductor to perform RNAseq, genomics, data visualization, and bioinformatic analysis

Arrow left icon
Profile Icon Dr Dan Maclean Profile Icon MacLean
Arrow right icon
$59.99
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.7 (3 Ratings)
Paperback Oct 2019 316 pages 1st Edition
eBook
$9.99 $47.99
Paperback
$59.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Dr Dan Maclean Profile Icon MacLean
Arrow right icon
$59.99
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.7 (3 Ratings)
Paperback Oct 2019 316 pages 1st Edition
eBook
$9.99 $47.99
Paperback
$59.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $47.99
Paperback
$59.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

R Bioinformatics Cookbook

Performing Quantitative RNAseq

The technology of RNAseq has revolutionized the study of transcript abundances, bringing high-sensitivity detection and high-throughput analysis. Bioinformatic analysis pipelines using RNAseq data typically start with a read quality control step followed by either alignment to a reference or the assembly of sequence reads into longer transcripts de novo. After that, transcript abundances are estimated with read counting and statistical models and differential expression between samples is assessed. Naturally, there are many technologies available for all steps of this pipeline. The quality control and read alignment steps will usually take place outside of R, so analysis in R will begin with a file of transcript or gene annotations (such as GFF and BED files) and a file of aligned reads (such as BAM files).

The tools in R for performing analysis are powerful and flexible. Many of them are part of the Bioconductor suite and, as such, integrate together very nicely. The key question researchers wish to answer with RNAseq is usually: Which transcripts are differentially expressed? In this chapter, we'll look at some recipes for that in standard cases where we already know the genomic positions of genes we're interested in, and in cases where we need to find unannotated transcripts. We'll also look at other important recipes that help answer the questions How many replicates are enough? and Which allele is expressed more?

In this chapter, we will cover the following recipes:

  • Estimating differential expression with edgeR
  • Estimating differential expression with DESeq2
  • Power analysis with powsimR
  • Finding unannotated transcribed regions with GRanges objects
  • Finding regions showing high expression ab initio with bumphunter
  • Differential peak analysis
  • Estimating batch effects using SVA
  • Finding allele-specific expression with AllelicImbalance
  • Plotting and presenting RNAseq data

Technical requirements

The sample data you'll need is available from this book's GitHub repository: https://github.com/PacktPublishing/R-Bioinformatics_Cookbook. If you want to use the code examples as they are written, then you will need to make sure that this data is in a sub-directory of whatever your working directory is.

Here are the R packages that you'll need. Most of these will install with install.packages(); others are a little more complicated:

  • Bioconductor
    • AllelicImbalance
    • bumphunter
    • csaw
    • DESeq
    • edgeR
    • IRanges
    • Rsamtools
    • rtracklayer
    • sva
    • SummarizedExperiment
    • VariantAnnotation
  • dplyr
  • extRemes
  • forcats
  • magrittr
  • powsimR
  • readr

Bioconductor is huge and has its own installation manager. You can install it with the following code:

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install()
Further information is available at https://www.bioconductor.org/install/.

Normally, in R, a user will load a library and use the functions directly by name. This is great in interactive sessions but it can cause confusion when many packages are loaded. To clarify which package and function I'm using at a given moment, I will occasionally use the packageName::functionName() convention.

Sometimes, in the middle of a recipe, I'll interrupt the code so you can see some intermediate output or the structure of an object it's important to understand. Whenever that happens, you'll see a code block where each line begins with ## (double hash symbols). Consider the following command:

letters[1:5]

This will give us output as follows:

## a b c d e

Note that the output lines are prefixed with ##.

Estimating differential expression with edgeR

edgeR is a widely used and powerful package that implements negative binomial models suitable for sparse count data such as RNAseq data in a general linear model framework, which are powerful for describing and understanding count relationships and exact tests for multi-group experiments. It uses a weighted style normalization called TMM, which is the weighted mean of log ratio between sample and control, after removal of genes with high counts and outlying log ratios. The TMM value should be close to one, but can be used as a correction factor to be applied to overall library sizes

In this recipe, we'll look at some options from preparing read counts for annotated regions in some object to identifying the differentially expressed features in a genome. Usually, there is an upstream step requiring us to take high-throughput sequence reads, align them to a reference and produce files describing those alignments, such as .bam files. With those files prepared, we'd fire up R and start to analyze. So that we can concentrate on the differential expression analysis part of the process, we'll use a prepared dataset for which all of the data is ready. Chapter 8, Working with Databases and Remote Data Sources, shows you how to go from raw data to this stage if you're looking for how to do that step.

As there are many different tools and methods for getting those alignments of reads, we will look at starting the process with two common input object types. We'll use a count table, like that we would have if we were loading from a text file and we'll use an ExpressionSet (eset) object, which is an object type common in Bioconductor.

Our prepared dataset will be the modencodefly data from the NHGRI encyclopedia of DNA elements project for the model organism, Drosophila melanogaster. You can read about this project at www.modencode.org. The dataset contains 147 different samples for D. melanogaster, a fruit fly with an approximately 110 Mbp genome, annotated with about 15,000 gene features.

Getting ready

The data is provided as both a count matrix and an ExpressionSet object and you can see the Appendix at the end of this book for further information on these object types. The data is in this book's code and data repository at https://github.com/PacktPublishing/R_Bioinformatics_Cookbook under datasets/ch1/modencodefly_eset.RData, datasets/ch1/modencodefly_count_table.txt, and datasets/ch1/modencodelfy_phenodata.txt . We'll also use the edgeR (from Bioconductor), readr, and magrittr libraries.

How to do it...

We will see two ways of estimating differential expressions with edgeR.

Using edgeR from a count table

For estimating differential expressions with edgeR from a count table (for example, in a text file), we will use the following steps:

  1. Load the count data:
count_dataframe <- readr::read_tsv(file.path(getwd(), "datasets", "ch1", "modencodefly_count_table.txt" ))
genes <- count_dataframe[['gene']]
count_dataframe[['gene']] <- NULL
count_matrix <- as.matrix(count_dataframe)
rownames(count_matrix) <- genes
pheno_data <- readr::read_table2(file.path(getwd(), "datasets", "ch1", "modencodefly_phenodata.txt"))
  1. Specify experiments of interest:
experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which( pheno_data[['stage']] %in% experiments_of_interest )
  1. Form the grouping factor:
library(magrittr)
grouping <- pheno_data[['stage']][columns_of_interest] %>%
forcats::as_factor()
  1. Form the subset of count data:
counts_of_interest <-  count_matrix[,columns_of_interest]
  1. Create the DGE object:
library(edgeR)
count_dge <- edgeR::DGEList(counts = counts_of_interest, group = grouping)
  1. Perform differential expression analysis:
design <- model.matrix(~ grouping)
eset_dge <- edgeR::estimateDisp(eset_dge, design)
fit <- edgeR::glmQLFit(eset_dge, design)
result <- edgeR::glmQLFTest(fit, coef=2)
topTags(result)

Using edgeR from an ExpressionSet object

Estimating using edgeR from our prepared eset object can be done using the following steps:

  1. Load the eset data:
load(file.path(getwd(), "datasets/ch1/modencodefly_eset.RData"))
  1. Specify experiments of interest:
experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which( phenoData(modencodefly.eset)[['stage']] %in% experiments_of_interest )
  1. Form the grouping factor:
grouping <- droplevels(phenoData(modencodefly.eset)[['stage']][columns_of_interest] )
  1. Form the subset of count data:
counts_of_interest <- exprs(modencodefly.eset)[, columns_of_interest]
  1. Create the DGE object:
eset_dge <- edgeR::DGEList(
counts = counts_of_interest,
group = grouping
)
  1. Perform differential expression analysis:
design <- model.matrix(~ grouping)
eset_dge <- edgeR::estimateDisp(eset_dge, design)

fit <- edgeR::glmQLFit(eset_dge, design)
result <- edgeR::glmQLFTest(fit, coef=2)
topTags(result)

How it works...

We saw two ways of estimating differential expression with edgeR. In the first half of this recipe, we used edgeR starting with our data in a text file.

Using edgeR from a count table

In step 1, we use the read_tsv() function in the readr package to load the tab delimited text file of counts into a dataframe called count_dataframe. Then, from that, we extract the 'gene' column to a new variable, genes, and erase it from count_dataframe, by assigning NULL. This is all done so we can easily convert into the count_matrix matrix with the base as.matrix() function and add the gene information back as rownames. Finally, we load the phenotype data we'll need from file using the readr read_table2() function.

Step 2 is concerned with working out which columns in count_matrix we want to use. We define a variable, experiments_of_interest, which holds the column names we want and then use the %in% operator and which() functions to create a binary vector that matches the number of columns. If, say, the third column of the columns_of_interest vector is TRUE it indicates the name was in the experiments_of interest variable.

Step 3 begins with loading the magrittr package to get the %>% operator, which will allow piping. We then use R indexing with the binary columns_of_interest factor to select the names of columns we want and send it to the forcats as_factor() function to get a factor object for our grouping variable. Sample grouping information is basically a factor that tells us which samples are replications of the same thing and it's important for the experimental design description. We need to create a grouping vector, each index of which refers to a column in the counts table. So, in the following example, the first three columns in the data would be replicates of one sample, the second three columns in the counts table would be replicates of a different replicate, and so on. We can use any symbols in the grouping vector to represent the groups. The more complicated the grouping vector, the more complicated the experiment design can be. In the recipe here, we'll use a simple test/control design:

numeric_groups <- c(1,1,1,2,2,2)
letter_groups <- c("A","A","A", "B","B","B")

A simple vector like this will do, but you can also use a factor object. The factor is R's categorical data type and is implemented as a vector of integers that have associated name labels, called levels. When a factor is displayed, the name labels are taken instead of the integers. The factor object has a memory of sorts, and even when a subset of levels is used, all of the levels that could have been used are retained so that when, for example, the levels are used as categories, empty levels can still be displayed.

In Step 4, we use indexing to extract the columns of data we want to actually analyze.

By Step 5, our preparatory work is done and we can build the DGEList object we need to do differential analysis. To start, we load the edgeR library and use the DGEList() function on counts_of_interest and our grouping object.

In Step 6, with DGEList, we can go through the edgeR process. First, we create the experimental design descriptor design object with the base model.matrix() function. A model design is required to tell the functions how to compare samples; this is a common thing in R and so has a base function. We use the grouping variable we created. We must estimate the dispersions of each gene with the estimateDisp() function, then we can use that measure of variability in tests. Finally, a generalized linear model is fit and the quasi-likelihood F-test is applied with the two uses of glmQLFTest(), first with the dispersal estimates, eset_dge, then with the resulting fit object.

We can use the topTags() function to see the details of differentially expressed genes. We get the following output:

 ## Coefficient: groupingL2Larvae
## logFC logCPM F PValue FDR
## FBgn0027527 6.318665 11.14876 42854.72 1.132951e-41 1.684584e-37
## [ reached 'max' / getOption("max.print") -- omitted 9 rows ]

The columns show the gene name, the logFC value of the gene, the F value, the P value and the False Detection Rate (FDR). Usually, the column we want to make statistical conclusions from is FDR.

Using edgeR from an ExpressionSet object

In Step 1, we are looking at using edgeR from our prepared eset object. We first load that in, using the base R function as it is stored in a standard Rdata format file.

In Step 2, we prepare the vector of experiments of interest. This works as in step 2, except that we don't need to look at the pheno_data object we created from a file; instead, we can use the eset function, phenoData(), to extract the phenotype data straight from the eset object (note that this is one of the major differences between eset and the count matrix—see this book's Appendix for further information).

In Step 3, we create the grouping factor. Again, this can be done by using the phenoData() extraction function, but, as it returns a factor, we need to drop the levels that aren't selected using the droplevels() function (see the How it works... section in the Estimating differential expression with edgeR recipe, step 3 from the previous method, for a brief discussion of factor objects).

In step 4, we extract the data for the columns we are interested in into a standard matrix object. Again, we have a dedicated function, exprs(), for extracting the expression values from eset, and we can subset that using column indexing with column_names.

In Step 5, we use the DGEList() constructor function to build the data structure for edgeR and in step 6, carry out the analysis. This step is identical to Step 6 of the first method.

Estimating differential expression with DESeq2

The DESeq2 package is a method for differential analysis of count data, so it is ideal for RNAseq (and other count-style data such as ChIPSeq). It uses dispersion estimates and relative expression changes to strengthen estimates and modeling with an emphasis on improving gene ranking in results tables. DESeq2 differs from edgeR in that it uses a geometric style normalization in which the per lane scaling factor is computed as the median of the ratios of the gene count over its geometric mean ratio, whereas edgeR uses the weighted one. The two normalization strategies are not mutually exclusive and both make different assumptions about the data. As with any RNAseq or large scale experiment, there is never an "out-of-the-box" best answer. You'll end up testing these methods and maybe others and closely examining results from control genes and cross-validation experiments to see which performs best. The performance will depend greatly on the particular dataset at hand, so the flexible approach we learn here will give you a good idea of how to test the different solutions for yourself.

The process we'll look at in this recipe is somewhat similar to that for edgeR in the preceding Recipe 1. We can use both ExpressionSets and count tables as input to DESeq2 and, when we've prepared them, we have a different set of functions to use to get our data into a DESeqDataSet, not the DGEList as with edgeR.

Getting ready

As in Recipe 1, the data is provided as both a count matrix and an ExpressionSet object and you can see the Appendix at the end of this book for further information on these object types. The data is in this book's code and data repository at https://github.com/PacktPublishing/R_Bioinformatics_Cookbook under datasets/ch1/modencodefly_eset.RData , datasets/ch1/modencodefly_count_table.txt, and datasets/ch1/modencodelfy_phenodata.txt. Again, we'll use readr and magrittr and, from Bioconductor, SummarizedExperiement, and DESeq2.

How to do it...

Estimating differential expressions with DESeq2 can be done in two ways, as shown in the following section.

Using DESeq2 from a count matrix

Estimating differential expressions with DESeq2 from a count table (for example, in a text file), we will use the following steps:

  1. Load count data:
count_dataframe <- readr::read_tsv(file.path(getwd(), "datasets", "ch1", "modencodefly_count_table.txt" ))
genes <- count_dataframe[['gene']]
count_dataframe[['gene']] <- NULL
count_matrix <- as.matrix(count_dataframe)
rownames(count_matrix) <- genes
pheno_data <- readr::read_table2(file.path(getwd(), "datasets", "ch1", "modencodefly_phenodata.txt"))
  1. Specify experiments of interest:
experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which( pheno_data[['stage']] %in% experiments_of_interest )
  1. Form the grouping factor:
library(magrittr)
grouping <- pheno_data[['stage']][columns_of_interest] %>%
forcats::as_factor()
  1. Form the subset of count data:
counts_of_interest <- count_matrix[,columns_of_interest]
  1. Build the DESeq object:
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = counts_of_interest,
colData = grouping,
design = ~ stage)
  1. Carry out the analysis:
dds <- DESeq(dds)
  1. Extract the results:
res <- results(dds, contrast=c("stage","L2Larvae","L1Larvae"))

Using DESeq2 from an ExpressionSet object

To estimate differential expressions with DESeq2 from an ExpressionSet object, we will use the following steps:

  1. Load the eset data and convert into DESeqDataSet():
library(SummarizedExperiment)
load(file.path(getwd(), "datasets/ch1/modencodefly_eset.RData"))
summ_exp <- makeSummarizedExperimentFromExpressionSet(modencodefly.eset)
ddsSE <- DESeqDataSet(summ_exp, design= ~ stage)
  1. Carry out analysis and extract results:
ddsSE <- DESeq(ddsSE)
resSE <- results(ddsSE, contrast=c("stage","L2Larvae","L1Larvae"))

How it works...

In the first section of this recipe, we used DESeq1 starting with our data in a text file; as you'll notice steps 1 to 4 are identical to those in the previous section.

Using DESeq2 from a count matrix

In Step 1, we use the readr package's read_tsv() function to load the tab-delimited text file of counts into a dataframe called count_dataframe. Then, from that, we extract the 'gene' column to a new variable, genes, and erase it from count_dataframe, by assigning NULL. This is all done so we can easily convert into the count_matrix matrix with the base as.matrix() function and add the gene information back as rownames. Finally, we load the phenotype data we'll need from the file using the readr read_table2() function.

Step 2 is concerned with working out which columns in count_matrix we want to use. We define a variable, experiments_of_interest, that holds the column names we want and then use the %in% operator and which() functions to create a binary vector that matches the number of columns. If, say the third column of the columns_of_interest vector is 'TRUE', it indicates the name was in the experiments_of interest variable.

Step 3 begins with loading the magrittr package to get the %>% operator, which will allow piping. We then use R indexing with the binary columns_of_interest factor to select the names of columns we want and send it to the forcats as_factor() function to get a factor object for our grouping variable. Sample grouping information is basically a factor that tells us which samples are replications of the same thing and it's important for the experimental design description. You can see an expanded description of these grouping/factor objects in step 3 in Recipe 1.

In Step 4, we use indexing to extract the columns of data we want to actually analyze.

By Step 5, we are into the actual analysis section. First, we convert our matrix of counts into a DESeqDataSet object; this can be done with the conversion function, DESeqDataSetFromMatrix(), passing in the counts, the groups, and a design. The design is in the form of an R formula, hence, the ~ stage annotation.

In Step 6, we perform the actual analysis using the DESeq() function on the dds DESeqDataSet object and in Step 7, we get the results into the res variable using the results() function. The output has the following six columns:

baseMean log2FoldChange lfcSE stat pvalue padj

This shows the mean counts, the log2 fold change between samples for a gene, the standard error of the log2 fold change, the Wald statistic, and the raw and adjusted P value. The padj column for adjusted P values is the one most commonly used for concluding about significance.

Using DESeq2 from an ExpressionSet object

Steps 1 and 2 show how to do the same procedure starting from the eset object. It only takes two short steps because DESeq2 is set up to work a lot more nicely with Bioconductor objects than edgeR is. In step 8, we load the eset data with the load() function. Then we use the makeSummarizedExperimentFromExpressionSet() function from the SummarizedExperiment Bioconductor package to convert eset into SummarizedExperiment, which can be used directly in the DESeq() function in step 9. This step works exactly as steps 6 and 7.

Power analysis with powsimR

An important preliminary to any experiment is assessing the power of the experimental design to optimize statistical sensitivity. In essence, a power analysis can tell us the number of replicates required to determine an effect size of a given magnitude for a given amount of experimental variability.

We'll use the powsimR package, which is not part of Bioconductor, to perform two types of power analysis. Both of these will be with a small real dataset, but first, we'll do it with two treatments—a test and control—then, we'll do it with just one. With each, we'll estimate the number of replicates we need to spot differences in gene expression of a particular magnitude—if they're present. powsimR takes a simulation-based approach, effectively generating many datasets and evaluating the detection power in each to create a distribution of detection power. The first step, then, is to estimate some parameters for these simulations—for this, we'll need some sample or preliminary data. After that, we can run simulations and assess power.

Getting ready

The dataset for this recipe will be a test or control RNAseq experiment from Arabidopsis with three replicates each. These are available as a prepared count matrix in datasets/ch1/arabidopsis.RDS in this book's data repository. In this section, we'll use a set of counts in a simple test or control experiment from Arabidopsis thaliana. The matrix has six columns (three mock treatments and three hrcc treatments) and 26,222 rows, each a gene feature. We'll need the dplyr, extRemes, and powsimR packages for this code.

Our package of interest, powsimR, isn't on CRAN; it's hosted as a source on GitHub at https://github.com/bvieth/powsimR. You'll need to use devtools to install it, which can be done using the following code:

install.packages("devtools")
devtools::install_github("bvieth/powsimR")

If you do this, there is a chance that this package will still fail to install. It has a lot of dependencies and you might need to install those manually; there is further information on the package GitHub repository and you should check that for the latest information. At the time of writing, you'll need to do the following two big steps. First, create the ipak function outlined here, then run the three different package installation steps with the ipak function:

ipak <- function(pkg, repository = c("CRAN", "Bioconductor", "github")) {
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    # new.pkg <- pkg
    if (length(new.pkg)) {
        if (repository == "CRAN") {
            install.packages(new.pkg, dependencies = TRUE)
        }
        if (repository == "Bioconductor") {
            if (strsplit(version[["version.string"]], " ")[[1]][3] > "3.5.0") {
                if (!requireNamespace("BiocManager")) {
                  install.packages("BiocManager")
                }
                BiocManager::install(new.pkg, dependencies = TRUE, ask = FALSE)
            }
            if (strsplit(version[["version.string"]], " ")[[1]][3] < "3.5.0") {
                source("https://bioconductor.org/biocLite.R")
                biocLite(new.pkg, dependencies = TRUE, ask = FALSE)
            }
        }
        if (repository == "github") {
            devtools::install_github(new.pkg, build_vignettes = FALSE, force = FALSE, 
                dependencies = TRUE)
        }
    }
}

# CRAN PACKAGES cranpackages <- c("broom", "cobs", "cowplot", "data.table", "devtools", "doParallel", "dplyr", "drc", "DrImpute", "fastICA", "fitdistrplus", "foreach", "gamlss.dist", "ggExtra", "ggplot2", "ggthemes", "grDevices", "glmnet", "grid", "gtools", "Hmisc", "kernlab", "MASS", "MBESS", "matrixStats", "mclust", "methods", "minpack.lm", "moments", "msir", "NBPSeq", "nonnest2", "parallel", "penalized", "plyr", "pscl", "reshape2", "Rmagic", "rsvd", "Rtsne", "scales", "Seurat", "snow", "stats", "tibble", "tidyr", "VGAM", "ZIM")
ipak(cranpackages, repository = "CRAN") # BIOCONDUCTOR biocpackages <- c("AnnotationDbi", "bayNorm", "baySeq", "Biobase", "BiocGenerics", "BiocParallel", "DEDS", "DESeq2", "EBSeq", "edgeR", "IHW", "iCOBRA", "limma", "Linnorm", "MAST", "monocle", "NOISeq", "qvalue", "ROTS", "RUVSeq", "S4Vectors", "scater", "scDD", "scde", "scone", "scran", "SCnorm", "SingleCellExperiment", "SummarizedExperiment", "zinbwave") ipak(biocpackages, repository = "Bioconductor") # GITHUB githubpackages <- c("nghiavtr/BPSC", "cz-ye/DECENT", "mohuangx/SAVER", "statOmics/zingeR") ipak(githubpackages, repository = "github")

When this is done, you should be able to install the package we're after with this code:

devtools::install_github("bvieth/powsimR", build_vignettes = TRUE, dependencies = FALSE)
library("powsimR")
At the moment, for this to work, you also need to manually load dplyr.

How to do it...

We will do the power analysis using the following steps:

  1. Estimate simulation parameter values:
arab_data <- readRDS(file.path(getwd(), "datasets", "ch1", "arabidopsis.RDS" ))
means_mock <- rowMeans(arab_data[, c("mock1", "mock2", "mock3")])
means_hrcc <- rowMeans(arab_data[, c("hrcc1", "hrcc2", "hrcc3")])
log2fc <- log2(means_hrcc / means_mock)
prop_de <- sum(abs(log2fc) > 2) / length(log2fc)
  1. Examine the distribution of the log2 fold change ratios:
finite_log2fc <-log2fc[is.finite(log2fc)]
plot(density(finite_log2fc))
extRemes::qqnorm(finite_log2fc )
  1. Set up parameter values for the simulation run:
library(powsimR)
library(dplyr)

params <- estimateParam(
countData = arab_data,
Distribution = "NB",
RNAseq = "bulk",
normalization = "TMM" # edgeR method, can be others
)

de_opts <- DESetup(ngenes=1000,
nsims=25,
p.DE = prop_de,
pLFC= finite_log2fc,
sim.seed = 58673
)

sim_opts <- SimSetup(
desetup = de_opts,
params = params
)

num_replicates <- c(2, 3, 5, 8, 12,15)
  1. Run the simulation:
 simDE <- simulateDE(n1 = num_replicates,
n2 = num_replicates,
sim.settings = sim_opts,
DEmethod = "edgeR-LRT",
normalization = "TMM",
verbose = FALSE)
  1. Run the evaluation of the simulation:
 evalDE <- evaluateDE(simRes = simDE,
alpha.type = 'adjusted',
MTC = 'BH',
alpha.nominal = 0.1,
stratify.by = 'mean',
filter.by = 'none',
strata.filtered = 1,
target.by = 'lfc',
delta = 0)
  1. Plot the evaluation:
 plotEvalDE(evalRes = evalDE,
rate='marginal',
quick=FALSE, annot=TRUE)

How it works...

Power analysis in powsimR requires us to do some pre-analysis so that we have estimates for some important parameters. To perform a simulation-based power analysis, we need to estimate the distribution of log fold changes between treatments and the proportion of features that are differentially expressed.

In step 1, we'll get the mean counts for each feature in the two treatments. After loading the expression data using the readRDS() function, we use the rowMeans() function on certain columns to get the mean expression counts of each gene in both the mock and hrcc1 treatments. We can then get the log2 ratio of those (by simply dividing the two vectors and, in the last line, use standard arithmetical operators to work out those that have a log2 fold change greater than 2). Inspecting the final prop_de variable gives the following output:

prop_de
## [1] 0.2001754

So, a proportion of about 0.2 of the features have counts changing by log2 twofold.

Step 2 looks at the distribution of the gene expression ratios. We first remove the non-finite ratios from the log2fc variable. We must do this because, when calculating ratios, we generate Inf values in R; this occurs when the denominator (the mock sample) has zero mean counts. We can remove them using indexing on the vector with the binary vector that comes from is.finite() function. With the Inf values removed, we can plot. First, we do a normal density plot using the density() function, which shows the distribution of ratios. Then, we use the qqnorm() function in the extRemes package, which plots the data against data sampled from an idealized normal distribution with the same mean. A strong, linear correlation indicates a normal distribution in the original data. We can see the output in the following screenshot:

They look pretty log-normally distributed, so we can assume a log-normal distribution.

The longest step here, step 3, is actually only four lines. We are basically setting up the parameters for the simulation, which requires us to specify a lot of values. The first set, params, which we create with the estimateParam() function needs the data source (countData), the distribution to use (we set Distribution = "NB", which selects the negative binomial); the type of RNAseq experiment—ours is a bulk RNAseq experiment (RNAseq = "bulk"), and normalization strategy—we use the edgeR style TMM (normalization = "TMM"). The second set, desetup, is created with the DESetup() function; in this, we choose the parameters relating to the number of genes for which to simulate differential expression. We set up 1,000 total gene simulations (ngenes) and 25 simulation runs (nsims). We set the proportion to be differentially expressed to that estimated in step 1 in prop_de. We use the vector of fold changes, finite_log2fc, as input for the pLFC parameter. Setting sim.seed is not necessary but will ensure reproducibility between runs. The third line uses the SimSetup() function to combine params and desetup into a single object, sim_opts. Finally, we create a num_replicates vector specifying the number of biological replicates (RNA samples) to simulate.

Step 4 is relatively straightforward: we run the differential expression simulation using the sim_opts parameters created in the previous steps, choosing "edgeR-LRT" as the differential expression method and "TMM" as the normalization. The simulation data is stored in the simDE variable.

In Step 5, we create an evaluation of the simulation—this analyzes and extracts various statistics. We pass the simDE simulation data to the evaluateDE() function along with values for things pertaining to grouping, filtering, and significance.

Finally, in Step 6, we can plot the evalDE object from Step 5 and see the results of the simulation. We get the following plot in which we can see the different powers at different replicate numbers. Note the x-axis indicates the number of replicate RNA samples used, and the metrics include FDR, False Negative/Positive Rate (FNR/FPR), and TNR/TPR (True Negative/Positive Rate):

There's more...

When we have only one sample (or maybe even just one replicate), we have a hard time estimating the log2 fold change distribution and the number of differentially expressed genes. In place of estimates, we can use a callback function to generate numbers as needed. The body of the function just needs to return numbers from a specified distribution with parameters you decide. Here, we'll build a function that returns numbers with a normal distribution of mean 0 and standard deviation 2. This reflects that we think the log fold change distribution is normal with these parameters. When we've built the function, it gets used in the DESetup() function in place of the vector of log2 fold changes. For the proportion of genes differentially expressed, we just have to guess or take an estimate from something we already know about the experimental system:

log2fc_func <- function(x){ rnorm(x, 0, 2)} 
prop_de = 0.1
de_opts <- DESetup(ngenes=1000,
nsims=25,
p.DE = prop_de,
pLFC= log2fc_func,
sim.seed = 58673
)

Finding unannotated transcribed regions

A common challenge is to find and count reads that have aligned outside of annotated regions. In an RNAseq experiment, these reads can represent non-annotated genes and novel transcripts. Essentially, we have some genes we know about and can see that they are transcribed as they have aligned read coverage, but other transcribed regions do not fall in any annotations and we want to know the locations of the alignments of the reads representing them. In this recipe, we'll look at a deceptively straightforward technique for finding such regions.

Getting ready

Our dataset will be a synthetic one that has a small 6,000 bp genome region and two gene features with reads and a third unannotated region with aligning reads, as shown in the following screenshot:

We'll need the Bioconductor csaw, IRanges, SummarizedExperiment, and rtracklayer libraries and some functions from other packages that are part of base Bioconductor. The data is in this book's data repository under datasets/ch1/windows.bam and datasets/ch1/genes.gff

How to do it...

Power analysis with powsimR can be done in the following steps:

  1. Set up a loading function:
get_annotated_regions_from_gff <- function(file_name) { 
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
}
  1. Get counts in windows across the whole genome:
whole_genome <- csaw::windowCounts( 
    file.path(getwd(), "datasets", "ch1", "windows.bam"),
    bin = TRUE,
    filter = 0,
    width = 500,
    param = csaw::readParam(
        minq = 20,
        dedup = TRUE,
        pe = "both"
    )
)
colnames(whole_genome) <- c("small_data")

annotated_regions <- get_annotated_regions_from_gff(file.path(getwd(), "datasets", "ch1", "genes.gff"))
  1. Find overlaps between annotations and our windows, and subset the windows:
library(IRanges)
library(SummarizedExperiment)
windows_in_genes <-IRanges::overlapsAny( SummarizedExperiment::rowRanges(whole_genome), annotated_regions )
  1. Subset the windows into those in annotated and non-annotated regions:
annotated_window_counts <- whole_genome[windows_in_genes,] 
non_annotated_window_counts <- whole_genome[ ! windows_in_genes,]
  1. Get the data out to a count matrix:
assay(non_annotated_window_counts)

How it works...

In step 1, we create a function that will load gene region information in a GFF file (see this book's Appendix for a description of GFF) and convert it into a Bioconductor GRanges object using the rtracklayer package. This recipe works because GRanges objects can be subset, just like a regular R matrix or dataframe. They're an object that is "matrix-like" in that respect and although GRanges is much more complicated than a matrix, it behaves much the same. This allows for some easy manipulations and extractions. We use GRanges extensively throughout this recipe, along with the related class, RangedSummarizedExperiment.

In step 2, we use the csaw windowCounts() function to get counts across the whole genome in 500 bp windows. The width parameter defines the window size, and the param parameter determines what constitutes a passing read; here, we set minimum read quality (minq) to a PHRED score of 20, remove PCR duplicates (dedup = TRUE), and require that both of the pairs of a read are aligned (pe="both"). The returned whole_genome object is RangedSummarizedExperiment. We set the name of the single data column in whole_genome to small_data. Finally, we use the custom function, get_annotated_regions_from_gff(), to make a GRanges object, annotated_regions, of the genes represented in our GFF file.

With Step 3, we use the IRanges overlapsAny() function to check whether the window locations overlap at all with the gene regions. This function requires GRanges objects, so we extract that from the whole_genome variable using the SummarizedExperiment rowRanges() function and pass that along with the existing GRanges object's annotated_regions to overlapsAny(). This returns a binary vector that we can use to do subsetting.

In step 4, we simply use the binary vector, windows_in_genes, to subset the whole_genome object, thereby extracting the annotated windows (into annotated_window_counts) as a GRanges object. Then, we can get the non-annotated windows with the same code but by logically inverting the binary vector using the ! operator. This gives us non_annotated_window_counts.

Finally, in step 5, we can extract the actual counts from the GRanges object using the assay() function.

There's more...

We may need to get annotated regions from other file formats than GFF. rtracklayer supports various formats—here's a function for working with BED files:

get_annotated_regions_from_bed <- function(file_name){ 
bed <- rtracklayer::import.bed(file_name)
as(bed, "GRanges")
}

Finding regions showing high expression ab initio with bumphunter

Finding regions of read alignments that all come from the same, potentially unannotated, genomic feature is a common task. The aim here is to group read alignments together in such a way that we will be able to mark regions that have significant coverage and then go on to compare samples for differences in expression levels.

Getting ready...

We'll use the same windows dataset that had one experiment with three peaks into the function that we used in Recipe 4—so we know we're looking for three bumps. The data is in this book's data repository under datasets/ch1/windows.bam. We'll need the Rsamtools and bumphunter libraries.

How to do it...

  1. Load data and get per-position coverage:
library(Rsamtools) 
library(bumphunter)
pileup_df <- Rsamtools::pileup(file.path(getwd(), "datasets", "ch1", "windows.bam"))
  1. Find preliminary clusters:
clusters <- bumphunter::clusterMaker(pileup_df$seqnames, pileup_df$pos, maxGap = 100) 
  1. Find the bumps with a minimum cutoff:
bumphunter::regionFinder(pileup_df$count, pileup_df$seqnames, pileup_df$pos, clusters, cutoff=1)

How it works...

In step 1, we use Rsamtools pileup() function with default settings to get a per-base coverage dataframe. Each row represents a single nucleotide in the reference and the count column gives the depth of coverage at that point. The result is stored in the pileup_df dataframe.

In step 2, we use the bumphunter clusterMaker() function on pileup_df, which simply groups reads within a certain distance of each other into clusters. We give it the sequence names, positions, and a maximum distance parameter (maxGap). The function returns a vector of cluster numbers of equal length to the dataframe, indicating the cluster membership of each row in the dataframe. If we tabulate with table, we can see the cluster sizes (number of rows) in each cluster:

table(clusters)
## clusters ## 1 2 3 ## 1486 1552 1520

In step 3, we refine our approach; we use regionFinder(), which applies a read depth cutoff to ensure a minimum read depth for the clusters. We pass it similar data as in step 2, adding the cluster membership vector clusters and a minimum read cutoff—here, we set to 1 for use with this very small dataset. The result of step 3 is the regions that are clustered together, but in a useful table:

##    chr start  end     value  area cluster indexStart indexEnd    L
## 3 Chr1  4503 5500 10.401974 15811       3       3039     4558 1520
## 1 Chr1   502 1500  9.985868 14839       1          1     1486 1486
## 2 Chr1  2501 3500  8.657216 13436       2       1487     3038 1552

In these region predictions, we can clearly see the three regions containing reads that are in that data, give or take a nucleotide or two.

There's more...

If you have multiple experiments to analyze, try the bumphunter() function. This will operate over multiple data columns in a matrix and perform linear modeling to assess uncertainty about the position and existence from the replicates; it is very similar to regionFinder() in operation.

Differential peak analysis

When you've discovered unannotated transcripts you may want to see whether they are differentially expressed between experiments. We've already looked at how we might do that with edgeR and DESeq, but one problem is going from an object such as a RangedSummarizedExperiment, comprised of the data and a GRanges object that describes the peak regions, to the internal DESeq object. In this recipe, we'll look at how we can summarise the data in those objects and get them into the correct format.

Getting ready

For this recipe, you'll need the RangedSummarizedExperiment version of the Arabidopsis thaliana RNAseq in datasets/ch1/arabidopsis_rse.RDS in this book's repository. We'll use the DESeq and SummarizedExperiment Bioconductor packages we used earlier too.

How to do it...

  1. Load data and set up a function that creates region tags:
library(SummarizedExperiment) 
arab_rse <- readRDS(file.path(getwd(), "datasets", "ch1", "arabidopsis_rse.RDS") )

make_tag <- function(grange_obj){
paste0(
grange_obj@seqnames,
":",
grange_obj@ranges@start,
"-",
(grange_obj@ranges@start + grange_obj@ranges@width)
)
}
  1. Extract data and annotate rows:
counts <- assay(arab_rse)

if ( ! is.null(names(rowRanges(arab_rse))) ){
  rownames(counts) <- names(rowRanges(arab_rse))
} else {
  rownames(counts) <- make_tag(rowRanges(arab_rse))
}

How it works...

Step 1 starts by loading in our pre-prepared RangedSummarized experiment; note that the names slot of the GRanges object in there is not populated. We next create a custom function, make_tag(), which works by pasting together seqnames, starts and the computed end (start + width) from a passed GRanges object. Note the @ sign syntax: this is used because GRange is an S4 object and the slots are accessed with @ rather than the more familiar $.

In step 2, the code pulls out the actual data from RangedSummarizedExperiment using the assay() function. The matrix returned has no row names, which is unuseful, so we use the if clause to check the names slot—we use that as row names if it's available; if it, isn't we make a row name tag using the position information in the GRanges object in the make_tag() function we have created. This will give the following outputa count matrix that has the location tag as the row name that can be used in DESeq and edgeR as described in Recipes 1 and 2 in this chapter:

head(counts)
## mock1 mock2 mock3 hrcc1 hrcc2 hrcc3 ## Chr1:3631-5900 35 77 40 46 64 60 ## Chr1:5928-8738 43 45 32 43 39 49 ## Chr1:11649-13715 16 24 26 27 35 20 ## Chr1:23146-31228 72 43 64 66 25 90 ## Chr1:31170-33154 49 78 90 67 45 60 ## Chr1:33379-37872 0 15 2 0 21 8

Estimating batch effects using SVA

High throughput data such as RNAseq is often complicated by technical errors that are not explicitly modeled in the experimental design and can confound the detection of differential expression. Differences in counts from samples run on different days or different locations or on different machines are an example of a technical error that is very often present and which we should try to model in our experimental design. An approach to this is to build a surrogate variable into our experimental design that explains the batch effect and take it into account in the modeling and differential expression analysis stages. We'll use the SVA package to do this.

Getting ready

We'll need the SVA package from Bioconductor and our Arabidopsis count data in datasets/ch1/arabidopsis.RDS.

How to do it...

For estimating batch effects using SVA, perform the following steps:

  1. Load the libraries and data:
library(sva)
arab <- readRDS(file.path(getwd(), "datasets", "ch1", "arabidopsis.RDS"))
  1. Filter out rows with too few counts in some experiments:
keep <- apply(arab, 1, function(x) { length(x[x>3])>=2 } )
arab_filtered <- arab[keep,]
  1. Create the initial design:
groups <- as.factor(rep(c("mock", "hrcc"), each=3))
  1. Set up the test and null models and run SVA:
test_model <- model.matrix(~groups)
null_model <- test_model[,1]
svar <- svaseq(arab_filtered, test_model, null_model, n.sv=1)
  1. Extract the surrogate variables to a new design for downstream use:
design <- cbind(test_model, svar$sv)

How it works...

In step 1, the script begins by loading in a count matrix of the Arabidopsis RNAseq data, which you will recall is a simple experiment with three replicates of mock and three of hrcc treatment.

In step 2, we create a vector of row indices that we wish to retain, which we do by testing whether the row has at least two columns with a count of over 3this is done by using apply() and an anonymous function over the rows of the count matrix.

With step 3, we create a groups factor describing the experiment sample types.

Step 4 is the one that does the work; we use the groups factor in model.matrix() to create the model design we want to use. We also need a null model, which, in this experimental design, is equivalent to the first column, so we extract that from the test_model design object. We can then use the key svaseq() function to estimate the surrogate variable to add to our design. We add in test_model and null_model and select the number of surrogate variables to use with n.sv, which should be one for a simple design like this.

The final bit, step 5, is to add the surrogate variable to the design model, which we do by binding test_model and the sv column of svar (svsar$sv) together. The final design object can then be used in packages such as edgeR and DESeq2 as any other and those methods will use the surrogate variable to deal with batch effects.

Finding allele-specific expressions with AllelicImbalance

An allele-specific expression is a situation that occurs when there is a differential abundance of different allelic variants of a transcript. RNAseq can help to provide quantitative estimates of allele-specific expression for genes with transcribed polymorphismsthat is, variants in the transcript that may result in different proteins. In this recipe, we'll take a look at a method for determining which of the variants of a transcript may have preferential expressions in different samples. The reads will come from different .bam files and the variants must already be known. This implies that you have already carried out a read alignment and a variant call step and have per sample .bam and .vcf files. We'll use the AllelicImbalance and VariantAnnotation packages for this recipe.

Getting ready

You'll need AllelicImbalance and VariantAnnotation from Bioconductor. The AllelicImbalance package provides a small but informative dataset of three SNPs on Chromosome 17 of the hg19 build of the human genome. The files have been extracted into this book's data repository in datasets/ch1/allele_expression .

How to do it...

  1. Load libraries and set up an import folder:
library(AllelicImbalance)
library(VariantAnnotation)

region_of_interest <- GRanges(seqnames = c("17"), ranges = IRanges(79478301, 79478361)) 
bam_folder <- file.path(getwd(), "datasets", "ch1", "allele_expression")
  1. Load reads and variants in regions of interest:
reads <- impBamGAL(bam_folder, region_of_interest, verbose = FALSE)

vcf_file <-file.path( getwd(), "datasets", "ch1", "allele_expression","ERP000101.vcf" )
variant_positions <- granges(VariantAnnotation::readVcf(vcf_file), "hg19" )

allele_counts <- getAlleleCounts(reads, variant_positions, verbose=FALSE)
  1. Build the ASEset object:
ase.vcf <- ASEsetFromCountList(rowRanges = variant_positions, allele_counts)

reference_sequence <- file.path(getwd(), "datasets", "ch1", "allele_expression", "hg19.chr17.subset.fa")

ref(ase.vcf) <- refAllele(ase.vcf,fasta=reference_sequence)
alt(ase.vcf) <- inferAltAllele(ase.vcf)
  1. Run the test on all variants:
binom.test(ase.vcf, n="*")

How it works...

In step 1, the script begins by creating the familar GRanges object describing our region of interest and the folder holding the .bam files of reads.

In step 2, the impBamGAL() function loads in reads in the region of interest. The variant information is loaded into variant_positions—another GRanges object and the reads and variants are used to make allele counts with getAlleleCounts().

With this done, in step 3, we can build the ASESet object, ase.vcf (a class that inherits from RangedSummarizedExperiment), using the constructor function, ASEsetFromCountList(); we then use the setter functions, ref() and alt(), to apply the reference and alternative base identities.

Finally, in step 4, we can apply tests. binom.test() carries out binomial per position per sample (.bam file) tests for deviations from equality in counts in reference and alternative alleles. The parameter n tells the test which strand to consider—in this example, we haven't set up per-strand information, so we use "*" to ignore strandedness.

This will give the following output:

##               chr17_79478287 chr17_79478331 chr17_79478334
## ERR009113.bam          0.500   1.000000e+00   1.000000e+00
## ERR009115.bam         0.125   6.103516e-05   3.051758e-05

There's more...

The preceding analysis can be extended to carry out per strand and per phenotype tests if required. The script would need amending to introduce strand information in the ASESet object construction step. Doing so usually requires that the RNAseq experiment and alignment steps were performed with strandedness in mind and the bioinformatics pipeline up to here configured accordingly. Phenotype information can be added in the construction step using the colData parameter and a vector of phenotype or sample types for columns in the ASESet object.

Plotting and presenting RNAseq data

Plotting the RNAseq data en masse or for individual genes or features is an important step in QC and understanding. In this recipe, we'll see how to make gene count plots in samples of interest, how to create an MA plot that plots counts against fold change and allows us to spot expression-related sample bias, and how to create a volcano plot that plots significance against fold change and allows us to spot the most meaningful changes easily.

Getting ready

In this recipe, we'll use the DESeq2 package, the ggplot2 package, magrittr, and dplyr. We'll use the DESeqDataSet object we created for the modencodefly data in Recipe 2a saved version is in the datasets/ch1/modencode_dds.RDS file in this book's data repository.

How to do it...

  1. Load libraries and create a dataframe of RNAseq results:
library(DESeq2)
library(magrittr)
library(ggplot2)

dds <- readRDS("~/Desktop/r_book/datasets/ch1/modencode_dds.RDS")
  1. Create a boxplot of counts for a single gene, conditioned on "stage":
plotCounts(dds, gene="FBgn0000014", intgroup = "stage", returnData = TRUE) %>%
  ggplot() + aes(stage, count) + geom_boxplot(aes(fill=stage)) + scale_y_log10() + theme_bw()
  1. Create an MA plot with coloring conditioned on significance:
result_df <- results(dds, contrast=c("stage","L2Larvae","L1Larvae"), tidy= TRUE) %>%
  dplyr::mutate(is_significant=padj<0.05)

ggplot(result_df) + aes(baseMean, log2FoldChange) + geom_point(aes(colour=is_significant)) + scale_x_log10() + theme_bw()
  1. Create a volcano plot with coloring conditioned on significance:
ggplot(result_df) + aes(log2FoldChange, -1 * log10(pvalue))  + geom_point(aes(colour=is_significant)) + theme_bw()

How it works...

Step 1 is brief and loads the dataset and libraries we'll need.

In Step 2, we take advantage of a couple of useful parameters in the plotCounts() and results() functions from DESeq2. The returnData flag in plotCounts() will optionally return a tidy dataframe of count information for a given gene in a given condition, hence allowing us to send the data through ggplot() to make a boxplot for an individual gene. The magrittr %>% operator allows us to send the return value of plotCounts() straight to the first positional argument of ggplot() without saving in an intermediate variable.

In Step 3, we use the results() function from DESeq2 to get the results dataframe, which we pipe to dplyr mutate() in order to add a new column called is_significant containing TRUE if the value of the padj column is lower than 0.05. We then use the returned result_df dataframe in a ggplot() command to make a scatter plot of baseMean (count) against log2 fold change, with points colored by the is_significant variable, effectively colored by whether the P value is lower than 0.05 or not.

In Step 4, we use the same result_df dataframe to plot log2fold change against the negative log10 of the 'pvalue' to give a 'volcano' plot of the relationship between P and differential expression level:

The preceding three plots are the combined resultant output of these three ggplot() commands.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Apply modern R packages to handle biological data using real-world examples
  • Represent biological data with advanced visualizations suitable for research and publications
  • Handle real-world problems in bioinformatics such as next-generation sequencing, metagenomics, and automating analyses

Description

Handling biological data effectively requires an in-depth knowledge of machine learning techniques and computational skills, along with an understanding of how to use tools such as edgeR and DESeq. With the R Bioinformatics Cookbook, you’ll explore all this and more, tackling common and not-so-common challenges in the bioinformatics domain using real-world examples. This book will use a recipe-based approach to show you how to perform practical research and analysis in computational biology with R. You will learn how to effectively analyze your data with the latest tools in Bioconductor, ggplot, and tidyverse. The book will guide you through the essential tools in Bioconductor to help you understand and carry out protocols in RNAseq, phylogenetics, genomics, and sequence analysis. As you progress, you will get up to speed with how machine learning techniques can be used in the bioinformatics domain. You will gradually develop key computational skills such as creating reusable workflows in R Markdown and packages for code reuse. By the end of this book, you’ll have gained a solid understanding of the most important and widely used techniques in bioinformatic analysis and the tools you need to work with real biological data.

Who is this book for?

This book is for bioinformaticians, data analysts, researchers, and R developers who want to address intermediate-to-advanced biological and bioinformatics problems by learning through a recipe-based approach. Working knowledge of R programming language and basic knowledge of bioinformatics are prerequisites.

What you will learn

  • Employ Bioconductor to determine differential expressions in RNAseq data
  • Run SAMtools and develop pipelines to find single nucleotide polymorphisms (SNPs) and Indels
  • Use ggplot to create and annotate a range of visualizations
  • Query external databases with Ensembl to find functional genomics information
  • Execute large-scale multiple sequence alignment with DECIPHER to perform comparative genomics
  • Use d3.js and Plotly to create dynamic and interactive web graphics
  • Use k-nearest neighbors, support vector machines and random forests to find groups and classify data
Estimated delivery fee Deliver to United States

Economy delivery 10 - 13 business days

Free $6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 11, 2019
Length: 316 pages
Edition : 1st
Language : English
ISBN-13 : 9781789950694
Vendor :
RStudio
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to United States

Economy delivery 10 - 13 business days

Free $6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Publication date : Oct 11, 2019
Length: 316 pages
Edition : 1st
Language : English
ISBN-13 : 9781789950694
Vendor :
RStudio
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 193.97
R Bioinformatics Cookbook
$59.99
Bioinformatics with Python Cookbook
$78.99
Python Machine Learning
$54.99
Total $ 193.97 Stars icon
Banner background image

Table of Contents

12 Chapters
Performing Quantitative RNAseq Chevron down icon Chevron up icon
Finding Genetic Variants with HTS Data Chevron down icon Chevron up icon
Searching Genes and Proteins for Domains and Motifs Chevron down icon Chevron up icon
Phylogenetic Analysis and Visualization Chevron down icon Chevron up icon
Metagenomics Chevron down icon Chevron up icon
Proteomics from Spectrum to Annotation Chevron down icon Chevron up icon
Producing Publication and Web-Ready Visualizations Chevron down icon Chevron up icon
Working with Databases and Remote Data Sources Chevron down icon Chevron up icon
Useful Statistical and Machine Learning Methods Chevron down icon Chevron up icon
Programming with Tidyverse and Bioconductor Chevron down icon Chevron up icon
Building Objects and Packages for Code Reuse Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.7
(3 Ratings)
5 star 33.3%
4 star 0%
3 star 0%
2 star 33.3%
1 star 33.3%
sreechakilam Nov 16, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
like, very useful
Amazon Verified review Amazon
Cliente Amazon Jul 26, 2023
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
Even for begginers it is pretty poor. Any tutorial in the internet may be more useful for the topics included in the book. It may be useful to get a general idea, but not to be introduced in practical bioinformatics
Amazon Verified review Amazon
RUser Jan 04, 2021
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
Hallo, dieses Buch leidet unter der Verwendung von R-Libraries, die zumindes zu Anfang des Jahres 2021 nicht funktionieren. Das ist immer der Tod von Cook Books! Hier handelt es sich um die für viele "Rezepte" essentielle Library gmapR, die beispielsweise für das Betriebssystem Windows nicht verfügbar ist (Stand Anfang 2021). Unter den neuesten Versionen von Ubuntu und R hat es aber auch nicht funktioniert.Schade, aber das ist natürlich immer eine Gefahr bei kostenloser Software, die - wie in diesem Falle R mit Bioconductor - irgendwann zu komplex wird, um zuverlässig zu funktionieren.Da nutzt dann auch das schönste Kochbuch nichts, wenn man kein Feuer im Herd hat.Ein Update: Ich gab dem Buch eine zweite Chance. Vergebens, der Code der (Stand Januar 2021) verfügbar war, enthält viele kleine Fehler (fehlerhafte Verweise auf Dateiverzeichnisse), die nur lästig sind, und schwere inhaltliche Fehler. Man kann den Code eventuell selbst korrigieren, dann braucht man aber das Buch nicht mehr... Das Recipe3.R des dritten Kapitels ist eine Zumutung. Wieder wird eine Library verwendet, die noch in der Entwicklung ist, und deren Herunterladen aufgrund vieler Abhängigkeiten reine Glückssache ist. Ich habe es nach mehreren Stunden aufgegeben. Und der Code, der das Herunterladen vereinfachen soll - ja: Code, um das komplexe Herunterladen überhaupt organisieren zu können - den kann man abschreiben. Er ist nicht in den zum Buch verfügbaren Codes enthalten. Viel Spaß dabei.Das Buch hätte eine Lücke füllen können, so füllt es eine Lücke im Altpapier. Der Verlag bedauert...
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela