Use case – Model interpretability for genomics
In this hands-on exercise section, we will build a similar convolutional NN (CNN) model that we built in Chapter 9, Building and Tuning Deep Learning Models, but unlike in Chapter 9, here we will use a simulated dataset of DNA sequences of length 50
bases (whereas in Chapter 9, we have DNA sequence of length 101 bases). In addition, the binding sites in this example are not just for Transcription Factors (TFs) but any protein. The labels are designated as 0
and 1
, corresponding to positive and negative binding sites (0 = no binding site and 1 = binding site).
The goal of this is to train a CNN model to predict the DNA binding site of the protein and visualize it in the predictions. Since these are artificial sequences, we have injected the AAAGAGGAAGTT
motif into the positive sequence, but don’t worry—the CNN doesn’t know that.
Data collection
For this hands-on tutorial, we will use the simulated data...