You're reading from Natural Language Processing with TensorFlow Teach language to machines using Python's deep learning library

Product type Paperback

Published in May 2018

Publisher Packt

ISBN-13 9781788478311

Length 472 pages

Edition 1st Edition

Languages

Processing

Tools

Processing

Concepts

Deep Learning

Authors (2):

Thushan Ganegedara

Motaz Saad

View More author details

Table of Contents (14) Chapters

Preface

1. Introduction to Natural Language Processing FREE CHAPTER

2. Understanding TensorFlow

3. Word2vec – Learning Word Embeddings

4. Advanced Word2vec

5. Sentence Classification with Convolutional Neural Networks

6. Recurrent Neural Networks

7. Long Short-Term Memory Networks

8. Applications of LSTM – Generating Text

9. Applications of LSTM – Image Caption Generation

10. Sequence-to-Sequence Learning – Neural Machine Translation

11. Current Trends and the Future of Natural Language Processing

A. Mathematical Foundations and Advanced TensorFlow

Index

The deep learning approach to Natural Language Processing

I think it is safe to assume that deep learning revolutionized machine learning, especially in fields such as computer vision, speech recognition, and of course, NLP. Deep models created a wave of paradigm shifts in many of the fields in machine learning, as deep models learned rich features from raw data instead of using limited human-engineered features. This consequentially caused the pesky and expensive feature engineering to be obsolete. With this, deep models made the traditional workflow more efficient, as deep models perform feature learning and task learning, simultaneously. Moreover, due to the massive number of parameters (that is, weights) in a deep model, it can encompass significantly more features than a human would've engineered. However, deep models are considered a black box due to the poor interpretability of the model. For example, understanding the "how" and "what" features learnt by deep models for a given problem still remains an open problem.

A deep model is essentially an artificial neural network that has an input layer, many interconnected hidden layers in the middle, and finally, an output layer (for example, a classifier or a regressor). As you can see, this forms an end-to-end model from raw data to predictions. These hidden layers in the middle give the power to deep models as they are responsible for learning the good features from raw data, eventually succeeding at the task at hand.

History of deep learning

Let's briefly discuss the roots of deep learning and how the field evolved to be a very promising technique for machine learning. In 1960, Hubel and Weisel performed an interesting experiment and discovered that a cat's visual cortex is made of simple and complex cells, and that these cells are organized in a hierarchical form. Also, these cells react differently to different stimuli. For example, simple cells are activated by variously oriented edges while complex cells are insensitive to spatial variations (for example, the orientation of the edge). This kindled the motivation for replicating a similar behavior in machines, giving rise to the concept of deep learning.

In the years that followed, neural networks gained the attention of many researchers. In 1965, a neural network trained by a method known as the Group Method of Data Handling (GMDH) and based on the famous Perceptron by Rosenblatt, was introduced by Ivakhnenko and others. Later, in 1979, Fukushima introduced the Neocognitron, which laid the base for one of the most famous variants of deep models—Convolution Neural Networks. Unlike the perceptrons, which always took in a 1D input, a neocognitron was able to process 2D inputs using convolution operations.

Artificial neural networks used to backpropagate the error signal to optimize the network parameters by computing a Jacobian matrix from one layer to the layer before it. Furthermore, the problem of vanishing gradients strictly limited the potential number of layers (depth) of the neural network. The gradients of layers closer to the inputs, being very small, is known as the vanishing gradients phenomenon. This transpired due to the application of the chain rule to compute gradients (the Jacobian matrix) of lower layer weights. This in turn limited the plausible maximum depth of classical neural networks.

Then in 2006, it was found that pretraining a deep neural network by minimizing the reconstruction error (obtained by trying to compress the input to a lower dimensionality and then reconstructing it back into the original dimensionality) for each layer of the network, provides a good initial starting point for the weight of the neural network; this allows a consistent flow of gradients from the output layer to the input layer. This essentially allowed neural network models to have more layers without the ill-effects of the vanishing gradient. Also, these deeper models were able to surpass traditional machine learning models in many tasks, mostly in computer vision (for example, test accuracy for the MNIST hand-written digit dataset). With this breakthrough, deep learning became the buzzword in the machine learning community.

Things started gaining a progressive momentum, when in 2012, AlexNet (a deep convolution neural network created by Alex Krizhevsky (http://www.cs.toronto.edu/~kriz/), Ilya Sutskever (http://www.cs.toronto.edu/~ilya/), and Geoff Hinton) won the Large Scale Visual Recognition Challenge (LSVRC) 2012 with an error decrease of 10% from the previous best. During this time, advances were made in speech recognition, wherein state-of-the-art speech recognition accuracies were reported using deep neural networks. Furthermore, people began realizing that Graphical Processing Units (GPUs) enable more parallelism, which allows for faster training of larger and deeper networks compared with Central Processing Units (CPUs).

Deep models were further improved with better model initialization techniques (for example, Xavier initialization), making the time-consuming pretraining redundant. Also, better nonlinear activation functions, such as Rectified Linear Units (ReLUs), were introduced, which alleviated the ill-effects of the vanishing gradient in deeper models. Better optimization (or learning) techniques, such as Adam, automatically tweaked individual learning rates of each parameter among the millions of parameters that we have in the neural network model, which rewrote the state-of-the-art performance in many different fields of machine learning, such as object classification and speech recognition. These advancements also allowed neural network models to have large numbers of hidden layers. The ability to increase the number of hidden layers (that is, to make the neural networks deep) is one of the primary contributors to the significantly better performance of neural network models compared with other machine learning models. Furthermore, better intermediate regularizers, such as batch normalization layers, have improved the performance of deep nets for many tasks.

Later, even deeper models such as ResNets, Highway Nets, and Ladder Nets were introduced, which had hundreds of layers and billions of parameters. It was possible to have such an enormous number of layers with the help of various empirically and theoretically inspired techniques. For example, ResNets use shortcut connections to connect layers that are far apart, which minimizes the diminishing of gradients, layer to layer, as discussed earlier.

The current state of deep learning and NLP

Many different deep models have seen the light since their inception in early 2000. Even though they share a resemblance, such as all of them using nonlinear transformation of the inputs and parameters, the details can vary vastly. For example, a Convolution Neural Network (CNN) can learn from two-dimensional data (for example, RGB images) as it is, while a multilayer perceptron model requires the input to be unwrapped to a one-dimensional vector, causing loss of important spatial information.

When processing text, as one of the most intuitive interpretations of text is to perceive it as a sequence of characters, the learning model should be able to do time-series modelling, thus requiring the memory of the past. To understand this, think of a language modelling task; the next word for the word cat should be different from the next word for the word climbed. One such popular model that encompasses this ability is known as a Recurrent Neural Network (RNN). We will see in Chapter 6, Recurrent Neural Networks how exactly RNNs achieve this by going through interactive exercises.

It should be noted that memory is not a trivial operation that is inherent to a learning model. Conversely, ways of persisting memory should be carefully designed. Also, the term memory should not be confused with the learned weights of a non-sequential deep network that only looks at the current input, where a sequential model (for example, RNN) will look at both the learned weights and the previous element of the sequence to predict the next output.

One prominent drawback of RNNs is that they cannot remember more than few (approximately 7) time steps, thus lacking long-term memory. Long Short-Term Memory (LSTM) networks are an extension of RNNs that encapsulate long-term memory. Therefore, often LSTMs are preferred over standard RNNs, nowadays. We will peek under the hood in Chapter 7, Long Short-Term Memory Networks to understand them better.

In summary, we can mainly separate deep networks into two categories: the non-sequential models that deal with only a single input at a time for both training and prediction (for example, image classification) and the sequential models that cope with sequences of inputs of arbitrary length (for example, text generation where a single word is a single input). Then we can categorize non-sequential (also called feed-forward) models into deep (approximately less than 20 layers) and very deep networks (can be greater than hundreds of layers). The sequential models are categorized into short-term memory models (for example, RNNs), which can only memorize short-term patterns and long-term memory models, which can memorize longer patterns. In Figure 1.4, we outline the discussed taxonomy. It is not expected that you understand these different deep learning models fully at this point, but it only illustrates the diversity of the deep learning models:

The current state of deep learning and NLP

Figure 1.4: A general taxonomy of the most commonly used deep learning methods, categorized into several classes

Understanding a simple deep model – a Fully-Connected Neural Network

Now let's have a closer look at a deep neural network in order to gain a better understanding. Although there are numerous different variants of deep models, let's look at one of the earliest models (dating back to 1950-60), known as a Fully-Connected Neural Network (FCNN), or sometimes called a multilayer perceptron. The Figure 1.5 depicts a standard three-layered FCNN.

The goal of a FCNN is to map an input (for example, an image or a sentence) to a certain label or annotation (for example, the object category for images). This is achieved by using an input x to compute h—a hidden representation of x—using a transformation such as h = sigma (W * x + b); here, W and b are the weights and bias of the FCNN, respectively, and sigma is the sigmoid activation function. Next, a classifier (for example, a softmax classifier) is placed on top of the FCNN that gives the ability to leverage the learned features in hidden layers to classify inputs. Classifier, essentially is a part of the FCNN and yet another hidden layer with some weights, W _s and a bias, b _s. Also, we can calculate the final output of the FCNN as, output = softmax (W _s * h + b _s ). For example, a softmax classifier provides a normalized representation of the scores output by the classifier layer; the label is considered to be the output node with the highest softmax value. Then, with this, we can define a classification loss that is calculated as the difference between the predicted output label and the actual output label. An example of such a loss function is the mean squared loss. You don't have to worry if you don't understand the actual intricacies of the loss function. We will discuss quite a few of them in later chapters. Next, the neural network parameters, W, b, W _s, and b _s, are optimized using a standard stochastic optimizer (for example, the stochastic gradient descent) to reduce the classification loss all the inputs. Figure 1.5 depicts the process explained in this paragraph for a three-layer FCNN. We will walk-through the details on how to use such a model for NLP tasks, step by step in Chapter 3, Word2vec – Learning Word Embeddings.

Understanding a simple deep model – a Fully-Connected Neural Network

Figure 1.5: An example of a Fully Connected Neural Network (FCNN)

Let's look at an example of how to use a neural network for a sentiment analysis task. Consider that we have a dataset where the input is a sentence expressing a positive or negative opinion about a movie and a corresponding label saying if the sentence is actually positive (1) or negative (0). Then, we are given a test data set, where we have single sentence movie reviews, and our task is to classify these new sentences as positive or negative.

It is possible to use a neural network (which can be deep or shallow, depending on the difficulty of the task) for this task by adhering to the following workflow:

Tokenize the sentence by words
Pad the sentences with a special token if necessary, to bring all sentences to a fixed length
Convert the sentences into a numerical representation (for example, Bag-of-Words representation)
Feed the numerical inputs to the neural network and predict the output (positive or negative)
Optimize the neural network using a desired loss function