Machine learning for genomics in life sciences and biotechnology
Because of the incredible promise that ML has shown for genomics applications such as drug discovery, diagnostics, precision medicine, agriculture, and biological research, more and more life science and biotech organizations are leveraging ML to analyze genomic data for population health and predictive analytics. As per the market research study, which takes into account technology, functionality, application, and region, the global AI in the genomics market is forecasted to reach $1.671 billion by 2025 from $202 million in 2020 (https://www.marketsandmarkets.com/Market-Reports/artificial-intelligence-in-genomics-market-36649899.html). The main drivers for this growth can be attributed to the need to control spiraling drug costs, increasing public and private investments, and, most importantly, the adoption of AI solutions in precision medicine. The recent COVID-19 pandemic has played its part in accelerating the adoption of AI for genomics as well (https://www.jmir.org/2021/3/e22453/). Even though the outlook for ML for genomics is exciting, there is a lack of a skilled workforce to develop, manage, and apply these ML methodologies in genomics. Additionally, integrating these ML systems into existing systems is a challenging task that requires a proper understanding of the concepts and techniques. For researchers to stand out from the crowd and contribute to data-driven decisions by the company, they must have the necessary skill set.
This book will address the problem of the skill gap that currently exists in the market. This book is a Swiss Army knife for any research professional, data scientist, or manager who is getting started with genomic data analysis using ML. This book highlights the power of ML approaches in handling genomics big data by introducing key concepts, employing real-life business examples, use cases, best practices, and so on to help fill the gaps in both the technical skill set as well as general mentality within the field.
Exploring machine learning software
Before we start the tutorials, we will need some tools. To accommodate users regarding their specific operating system requirements, we will use ML software that is compatible across all operating systems, whether it’s Windows, macOS, or Linux. We will be using Python programming language and the Python libraries such as BioPython for genomic data analysis, Scikit-learn for ML building, and Keras to train our DL models. Let’s take a closer look at these pieces of ML software.
Python programming language
We will be using the Python programming language throughout this book. Python is a widely used programming language for researchers because of its popularity, the available packages that support all types of data analysis, and its user-friendliness. More importantly, ML, DL, and the genomic community routinely use Python for their own analysis needs. Throughout this book, we will use Python version 3.7 and look at a few ways of installing Python using Pip, Conda, and Anaconda.
Visualization
We will be using the Matplotlib and Seaborn Python packages, which are the two most popular visualization libraries in Python. They are quick to install, easy to use, and easy to import in the Python script. They both come with a variety of functions and methods to use on the data. Throughout this book, we will use Matplotlib version 3.5.1 and Seaborn version 0.11.2. We will look at a few ways of installing these libraries in the subsequent chapters.
Biopython
We will also be using Biopython, a Python module that provides a collection of Python tools for processing genomic data. It creates high-quality, reusable calls for analyzing complex genomic data. It has inherent libraries to connect to databases such as Swiss-Port, NCBI, ENSEMBL, and so on. We will use Biopython version 1.78 and look at separate ways of installing Biopython using Pip, Conda, and Anaconda.
Scikit-learn
Scikit-learn is a Python package written for the sole purpose of performing ML and is one of the most popular ML libraries used by data scientists. It has a rich collection of ML algorithms, extensive tutorials, good documentation, and, most importantly, an excellent user community. For this introductory chapter, we will use scikit-learn for developing ML models in Python. Wherever applicable, we will use scikit-learn version 1.0.2 and look at separate ways of installing scikit-learn in the subsequent chapters.