What is machine learning?
Before we talk about ML, let’s understand what AI is. In the simplest terms, AI is the ability of a machine to mimic human intelligence and iteratively improve itself based on the information it collects. The goal of AI is to build systems to perform actions that are routinely done by humans such as problem-solving, pattern matching, image recognition, knowledge acquisition, and so on. ML, a subset of AI, is the process of training a model to learn and improve from experience. Deep learning (DL), in turn, is a subfield of ML, in which we leverage artificial neural networks (ANNs) to mimic the human brain and find the nonlinear relationships between the input and output to generate predictions (Figure 1.1):
Figure 1.1 – AI versus ML versus DL – how they are related
In ML, a model is built based on input data and an underlying algorithm to make useful predictions from real-world data. In a simplified ML, “features” that represent an individual measurable property of the data are provided as input, and “labels” are returned as the predictions. Suppose we want to predict whether a particular sequence of DNA has a binding site for a transcription factor (TF) of your interest or not. Using the traditional approach, we would use a positional weight matrix (PWF) to scan the sequence and identify the potential motifs that are overrepresented. Even though this works, this is extremely difficult, manual, scalable, and so on. Using an ML-based approach, we would give an ML model plenty of DNA sequences until the ML model learns the mathematical relationship between the features from those DNA sequences that either have or don’t have binding sites (labels) based on experimental results. It then uses this knowledge to make decisions on new data and make informed predictions. For example, we could give the ML model an unknown DNA sequence, and it would predict the correct binding site motif if present. This is one such example of why ML is a good fit for genomics problems. Some other ways in which ML can be used in genomics include identifying genetic disorders, predicting the type of cancer from genetic variants, improving disease prognosis, and so on.