Predicting open reading frames in long reference sequences
A draft genome assembly of a previously unsequenced genome can be a rich source of biological knowledge, but when genomics resources such as gene annotations aren’t available, it can be tricky to proceed. In this recipe, we’ll look at a first-stage pipeline for finding potential genes and genomic loci of interest absolutely de novo and without information beyond the sequence. We’ll use a very simple set of rules to find open reading frames (ORFs) – sequences that begin with a start codon and end with a stop codon. The tools for doing this are encapsulated within a single function in the systemPipeR
Bioconductor package. We’ll end up with yet another GRanges
object that we can integrate into processes downstream that allow us to cross-reference other data, such as RNA-Seq. As a final step, we’ll look at how we can use a genome simulation to assess which of the open reading frames are...