The deep learning approach to Natural Language Processing
I think it is safe to say that deep learning revolutionized machine learning, especially in fields such as computer vision, speech recognition, and of course, NLP. Deep models created a wave of paradigm shifts in many of the fields in machine learning, as deep models learned rich features from raw data instead of using limited human-engineered features. This consequentially caused the pesky and expensive feature engineering to be obsolete. With this, deep models made the traditional workflow more efficient, as deep models perform feature learning and task learning, simultaneously. Moreover, due to the massive number of parameters (that is, weights) in a deep model, it can encompass significantly more features than a human could’ve engineered. However, deep models are considered a black box due to the poor interpretability of the model. For example, understanding the “how” and “what” features learned by deep models for a given problem is still an active area of research. But it is important to understand that there is a lot more research focusing on “model interpretability of deep learning models”.
A deep neural network is essentially an artificial neural network that has an input layer, many interconnected hidden layers in the middle, and finally, an output layer (for example, a classifier or a regressor). As you can see, this forms an end-to-end model from raw data to predictions. These hidden layers in the middle give the power to deep models as they are responsible for learning the good features from raw data, eventually succeeding at the task at hand. Let’s now understand the history of deep learning briefly.
History of deep learning
Let’s briefly discuss the roots of deep learning and how the field evolved to be a very promising technique for machine learning. In 1960, Hubel and Weisel performed an interesting experiment and discovered that a cat’s visual cortex is made of simple and complex cells, and that these cells are organized in a hierarchical form. Also, these cells react differently to different stimuli. For example, simple cells are activated by variously oriented edges while complex cells are insensitive to spatial variations (for example, the orientation of the edge). This kindled the motivation for replicating a similar behavior in machines, giving rise to the concept of artificial neural networks.
In the years that followed, neural networks gained the attention of many researchers. In 1965, a neural network trained by a method known as the Group Method of Data Handling (GMDH) and based on the famous Perceptron by Rosenblatt, was introduced by Ivakhnenko and others. Later, in 1979, Fukushima introduced the Neocognitron, which planted the seeds for one of the most famous variants of deep models—Convolutional Neural Networks (CNNs). Unlike the perceptrons, which always took in a 1D input, a Neocognitron was able to process 2D inputs using convolution operations.
Artificial neural networks used to backpropagate the error signal to optimize the network parameters by computing the gradients of the weights of a given layer with regards to the loss. Then, the weights are updated by pushing them in the opposite direction of the gradient, in order to minimize the loss. For a layer further away from the output layer (i.e. where the loss is computed), the algorithm uses the chain rule to compute gradients. The chain rule used with many layers led to a practical problem known as the vanishing gradients problem, strictly limiting the potential number of layers (depth) of the neural network. The gradients of layers closer to the inputs (i.e. further away from the output layer), being very small, cause the model training to stop prematurely, leading to an underfitted model. This is known as the vanishing gradients phenomenon.
Then, in 2006, it was found that pretraining a deep neural network by minimizing the reconstruction error (obtained by trying to compress the input to a lower dimensionality and then reconstructing it back into the original dimensionality) for each layer of the network provides a good initial starting point for the weight of the neural network; this allows a consistent flow of gradients from the output layer to the input layer. This essentially allowed neural network models to have more layers without the ill effects of the vanishing gradient. Also, these deeper models were able to surpass traditional machine learning models in many tasks, mostly in computer vision (for example, test accuracy for the MNIST handwritten digit dataset). With this breakthrough, deep learning became the buzzword in the machine learning community.
Things started gaining progressive momentum when, in 2012, AlexNet (a deep convolutional neural network created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton) won the Large Scale Visual Recognition Challenge (LSVRC) 2012 with an error decrease of 10% from the previous best. During this time, advances were made in speech recognition, wherein state-of-the-art speech recognition accuracies were reported using deep neural networks. Furthermore, people began to realize that Graphical Processing Units (GPUs) enable more parallelism, which allows for faster training of larger and deeper networks compared with Central Processing Units (CPUs).
Deep models were further improved with better model initialization techniques (for example, Xavier initialization), making the time-consuming pretraining redundant. Also, better nonlinear activation functions, such as Rectified Linear Units (ReLUs), were introduced, which alleviated the adversities of the vanishing gradient in deeper models. Better optimization (or learning) techniques, such as the Adam optimizer, automatically tweaked individual learning rates of each parameter among the millions of parameters that we have in the neural network model, which rewrote the state-of-the-art performance in many different fields of machine learning, such as object classification and speech recognition. These advancements also allowed neural network models to have large numbers of hidden layers. The ability to increase the number of hidden layers (that is, to make the neural networks deep) is one of the primary contributors to the significantly better performance of neural network models compared with other machine learning models. Furthermore, better intermediate regularizers, such as batch normalization layers, have improved the performance of deep nets for many tasks.
Later, even deeper models such as ResNets, Highway Nets, and Ladder Nets were introduced, which had hundreds of layers and billions of parameters. It was possible to have such an enormous number of layers with the help of various empirically and theoretically inspired techniques. For example, ResNets use shortcut connections or skip connections to connect layers that are far apart, which minimizes the diminishing of gradients layer to layer, as discussed earlier.
The current state of deep learning and NLP
Many different deep models have seen the light since their inception in early 2000. Even though they share a resemblance, such as all of them using nonlinear transformation of the inputs and parameters, the details can vary vastly. For example, a CNN can learn from two-dimensional data (for example, RGB images) as it is, while a multilayer perceptron model requires the input to be unwrapped to a one-dimensional vector, causing the loss of important spatial information.
When processing text, as one of the most intuitive interpretations of text is to perceive it as a sequence of characters, the learning model should be able to do time-series modeling, thus requiring the memory of the past. To understand this, think of a language modeling task; the next word for the word cat should be different from the next word for the word climbed. One such popular model that encompasses this ability is known as a Recurrent Neural Network (RNN). We will see in Chapter 6, Recurrent Neural Networks, how exactly RNNs achieve this by going through interactive exercises.
It should be noted that memory is not a trivial operation that is inherent to a learning model. Conversely, ways of persisting memory should be carefully designed.
Also, the term memory should not be confused with the learned weights of a non-sequential deep network that only looks at the current input, where a sequential model (for example, an RNN) will look at both the learned weights and the previous element of the sequence to predict the next output.
One prominent drawback of RNNs is that they cannot remember more than a few (approximately seven) time steps, thus lacking long-term memory. Long Short-Term Memory (LSTM) networks are an extension of RNNs that encapsulate long-term memory. Therefore, often LSTMs are preferred over standard RNNs, nowadays. We will peek under the hood in Chapter 7, Understanding Long Short-Term Memory Networks, to understand them better.
Finally, a model known as a Transformer has been introduced by Google fairly recently, which has outperformed many of the previous state-of-the-art models such as LSTMs on a plethora of NLP tasks. Previously, both recurrent models (e.g. LSTMs) and convolutional models (e.g. CNNs) dominated the NLP domain. For example, CNNs have been used for sentence classification, machine translation, and sequence-to-sequence learning tasks. However, Transformers use an entirely different approach where they use neither recurrence nor convolution, but an attention mechanism. The attention mechanism allows the model to look at the entire sequence at once, to produce a single output. For example, consider the sentence “The animal didn’t cross the road because it was tired.” While generating intermediate representations for the word “it,” it would be useful for the model to learn that “it” refers to the “animal”. The attention mechanism allows the Transformer model to learn such relationships. This capability cannot be replicated with standard recurrent models or convolutional models. We will investigate these models further in Chapter 10, Transformers and Chapter 11, Image Captioning with Transformers.
In summary, we can mainly separate deep networks into three categories: the non-sequential models that deal with only a single input at a time for both training and prediction (for example, image classification), the sequential models that cope with sequences of inputs of arbitrary length (for example, text generation where a single word is a single input), and finally, attention-based models that look at the sequence at once such as the Transformer, BERT, and XLNet, which are pretrained models based on the Transformer architecture. We can categorize non-sequential (also called feed-forward) models into deep (approximately less than 20 layers) and very deep networks (can be greater than hundreds of layers). The sequential models are categorized into short-term memory models (for example, RNNs), which can only memorize short-term patterns, and long-term memory models, which can memorize longer patterns. In Figure 1.4, we outline the discussed taxonomy. You don’t have to understand these different deep learning models fully at this point, but it illustrates the diversity of the deep learning models:
Figure 1.4: A general taxonomy of the most commonly used deep learning methods, categorized into several classes
Now, let’s take our first steps toward understanding the inner workings of a neural network.
Understanding a simple deep model – a fully connected neural network
Now, let’s have a closer look at a deep neural network in order to gain a better understanding. Although there are numerous different variants of deep models, let’s look at one of the earliest models (dating back to 1950–60), known as a fully connected neural network (FCNN), sometimes called a multilayer perceptron. Figure 1.5 depicts a standard three-layered FCNN.
The goal of an FCNN is to map an input (for example, an image or a sentence) to a certain label or annotation (for example, the object category for images). This is achieved by using an input x to compute h – a hidden representation of x – using a transformation such as ; here, W and b are the weights and bias of the FCNN, respectively, and is the sigmoid activation function. Neural networks use non-linear activation functions at every layer. Sigmoid activation is one such activation. It is an element-wise transformation applied to the output of a layer, where the sigmoidal output of x is given by, . Next, a classifier is placed on top of the FCNN that gives the ability to leverage the learned features in hidden layers to classify inputs. The classifier is a part of the FCNN and yet another hidden layer with some weights, Ws and a bias, bs. Also, we can calculate the final output of the FCNN as . For example, a softmax classifier can be used for multi-label classification problems. It provides a normalized representation of the scores output by the classifier layer. That is, it will produce a valid probability distribution over the classes in the classifier layer. The label is considered to be the output node with the highest softmax value. Then, with this, we can define a classification loss that is calculated as the difference between the predicted output label and the actual output label. An example of such a loss function is the mean squared loss. You don’t have to worry if you don’t understand the actual intricacies of the loss function. We will discuss quite a few of them in later chapters. Next, the neural network parameters, W, b, Ws, and bs, are optimized using a standard stochastic optimizer (for example, the stochastic gradient descent) to reduce the classification loss of all the inputs. Figure 1.5 depicts the process explained in this paragraph for a three-layer FCNN. We will walk through the details on how to use such a model for NLP tasks, step by step, in Chapter 3, Word2vec – Learning Word Embeddings.
Figure 1.5: An example of a fully connected neural network (FCNN)
Let’s look at an example of how to use a neural network for a sentiment analysis task. Consider that we have a dataset where the input is a sentence expressing a positive or negative opinion about a movie and a corresponding label saying if the sentence is actually positive (1) or negative (0). Then, we are given a test dataset, where we have single-sentence movie reviews, and our task is to classify these new sentences as positive or negative.
It is possible to use a neural network (which can be deep or shallow, depending on the difficulty of the task) for this task by adhering to the following workflow:
- Tokenize the sentence by words.
- Convert the sentences into a fixed sized numerical representation (for example, Bag-of-Words representation). A fixed sized representation is needed as fully connected neural networks require a fixed sized input.
- Feed the numerical inputs to the neural network, predict the output (positive or negative), and compare that with the true target.
- Optimize the neural network using a desired loss function.
In this section we looked at deep learning in more detail. We looked at the history and the current state of NLP. Finally, we looked at a fully connected neural network (a type of deep learning model) in more detail.
Now that we’ve introduced NLP, its tasks, and how approaches to it have evolved over the years, let’s take a moment to look the technical tools required for the rest of this book.