Introducing Machine Learning for Genomics
Machine learning (ML) is the field of science that deals with developing computer algorithms and models that can perform certain tasks without explicitly programming them. This is to say, it teaches the machines to “learn” rather than specifying “rules” from input data provided to them. The machine then can convert that learning into expertise or knowledge and use that for predictions. ML is an important tool for leveraging technologies around artificial intelligence (AI), a subfield of computer science that aims to perform tasks automatically that we, as humans, are naturally good at. ML is an important aspect of all modern businesses and research. The adoption of ML for genomics applications is changing recently because of the availability of large genomic datasets, improvement in algorithms, and, most importantly, superior computational power. More and more scientific research organizations and industries are expanding the use of ML across vast volumes of genomic data for predictive diagnostics, as well as to get biological insights at the scale of population health.
Genomics, the study of the genetic constitution of organisms, holds promise in understanding and diagnosing human diseases or improving our agriculture and livestock. The field of genomics has seen exponential growth in the last 15 years, mainly due to recent technological advances in High-throughput sequencing also known as next-generation sequencing (NGS) technologies generating exponential amounts of genomics data. It is estimated that between 100 million and as many as 2 billion human genomes could be sequenced by 2025 (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195), representing an astounding growth of four to five orders of magnitude in 10 years and far exceeding the growth of many big data domains. This complexity and the sheer amount of data generated create roadblocks not only to the acquisition, storage, and distribution but also to genomic data analysis. The current tools used in the genomic analysis are built on top of deterministic approaches and rely on rules encoded to perform a particular task. To keep up with data growth, we need more and new innovative approaches, such as ML, in genomics to enrich our understanding of basic biology and subject them to applied research. In this chapter, we’ll learn what ML is, why ML is essential for genomics, and what value ML brings to life sciences and biotechnology industries that leverage genome data for the development of genomic-based products. By the end of this chapter, you will understand the limitations of the current conventional algorithms for genomic data analysis, how solving problems with ML is different from conventional approaches, and how ML approaches can fill in those gaps and make generating biological insights very easy.
As such, in this chapter, we’re going to cover the following main topics:
- What is machine learning?
- Why machine learning for genomics?
- Machine learning for genomics in life sciences and biotechnology