Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
R Bioinformatics Cookbook

You're reading from   R Bioinformatics Cookbook Use R and Bioconductor to perform RNAseq, genomics, data visualization, and bioinformatic analysis

Arrow left icon
Product type Paperback
Published in Oct 2019
Publisher Packt
ISBN-13 9781789950694
Length 316 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Dr Dan Maclean Dr Dan Maclean
Author Profile Icon Dr Dan Maclean
Dr Dan Maclean
Dan MacLean Dan MacLean
Author Profile Icon Dan MacLean
Dan MacLean
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Performing Quantitative RNAseq 2. Finding Genetic Variants with HTS Data FREE CHAPTER 3. Searching Genes and Proteins for Domains and Motifs 4. Phylogenetic Analysis and Visualization 5. Metagenomics 6. Proteomics from Spectrum to Annotation 7. Producing Publication and Web-Ready Visualizations 8. Working with Databases and Remote Data Sources 9. Useful Statistical and Machine Learning Methods 10. Programming with Tidyverse and Bioconductor 11. Building Objects and Packages for Code Reuse 12. Other Books You May Enjoy

Finding regions showing high expression ab initio with bumphunter

Finding regions of read alignments that all come from the same, potentially unannotated, genomic feature is a common task. The aim here is to group read alignments together in such a way that we will be able to mark regions that have significant coverage and then go on to compare samples for differences in expression levels.

Getting ready...

We'll use the same windows dataset that had one experiment with three peaks into the function that we used in Recipe 4—so we know we're looking for three bumps. The data is in this book's data repository under datasets/ch1/windows.bam. We'll need the Rsamtools and bumphunter libraries.

How to do it...

  1. Load data and get per-position coverage:
library(Rsamtools) 
library(bumphunter)
pileup_df <- Rsamtools::pileup(file.path(getwd(), "datasets", "ch1", "windows.bam"))
  1. Find preliminary clusters:
clusters <- bumphunter::clusterMaker(pileup_df$seqnames, pileup_df$pos, maxGap = 100) 
  1. Find the bumps with a minimum cutoff:
bumphunter::regionFinder(pileup_df$count, pileup_df$seqnames, pileup_df$pos, clusters, cutoff=1)

How it works...

In step 1, we use Rsamtools pileup() function with default settings to get a per-base coverage dataframe. Each row represents a single nucleotide in the reference and the count column gives the depth of coverage at that point. The result is stored in the pileup_df dataframe.

In step 2, we use the bumphunter clusterMaker() function on pileup_df, which simply groups reads within a certain distance of each other into clusters. We give it the sequence names, positions, and a maximum distance parameter (maxGap). The function returns a vector of cluster numbers of equal length to the dataframe, indicating the cluster membership of each row in the dataframe. If we tabulate with table, we can see the cluster sizes (number of rows) in each cluster:

table(clusters)
## clusters ## 1 2 3 ## 1486 1552 1520

In step 3, we refine our approach; we use regionFinder(), which applies a read depth cutoff to ensure a minimum read depth for the clusters. We pass it similar data as in step 2, adding the cluster membership vector clusters and a minimum read cutoff—here, we set to 1 for use with this very small dataset. The result of step 3 is the regions that are clustered together, but in a useful table:

##    chr start  end     value  area cluster indexStart indexEnd    L
## 3 Chr1  4503 5500 10.401974 15811       3       3039     4558 1520
## 1 Chr1   502 1500  9.985868 14839       1          1     1486 1486
## 2 Chr1  2501 3500  8.657216 13436       2       1487     3038 1552

In these region predictions, we can clearly see the three regions containing reads that are in that data, give or take a nucleotide or two.

There's more...

If you have multiple experiments to analyze, try the bumphunter() function. This will operate over multiple data columns in a matrix and perform linear modeling to assess uncertainty about the position and existence from the replicates; it is very similar to regionFinder() in operation.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime