Recurrent neural networks (RNNs)
We're now going to look at the last of our three artificial neural networks, Recurrent neural networks, or RNNs.
RNNs are a family of networks that are suitable for learning representations of sequential data like text in Natural Language Processing (NLP) or stream of sensor data in instrumentation. While each MNIST data sample is not sequential in nature, it is not hard to imagine that every image can be interpreted as a sequence of rows or columns of pixels. Thus, a model based on RNNs can process each MNIST image as a sequence of 28-element input vectors with timesteps equal to 28. The following listing shows the code for the RNN model in Figure 1.5.1:
In the following listing, Listing 1.5.1, the rnn-mnist-1.5.1.py
shows the Keras code for MNIST digit classification using RNNs:
import numpy as np from keras.models import Sequential from keras.layers import Dense, Activation, SimpleRNN from keras.utils import to_categorical, plot_model from keras.datasets import mnist # load mnist dataset (x_train, y_train), (x_test, y_test) = mnist.load_data() # compute the number of labels num_labels = len(np.unique(y_train)) # convert to one-hot vector y_train = to_categorical(y_train) y_test = to_categorical(y_test) # resize and normalize image_size = x_train.shape[1] x_train = np.reshape(x_train,[-1, image_size, image_size]) x_test = np.reshape(x_test,[-1, image_size, image_size]) x_train = x_train.astype('float32') / 255 x_test = x_test.astype('float32') / 255 # network parameters input_shape = (image_size, image_size) batch_size = 128 units = 256 dropout = 0.2 # model is RNN with 256 units, input is 28-dim vector 28 timesteps model = Sequential() model.add(SimpleRNN(units=units, dropout=dropout, input_shape=input_shape)) model.add(Dense(num_labels)) model.add(Activation('softmax')) model.summary() plot_model(model, to_file='rnn-mnist.png', show_shapes=True) # loss function for one-hot vector # use of sgd optimizer # accuracy is good metric for classification tasks model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # train the network model.fit(x_train, y_train, epochs=20, batch_size=batch_size) loss, acc = model.evaluate(x_test, y_test, batch_size=batch_size) print("\nTest accuracy: %.1f%%" % (100.0 * acc))
There are the two main differences between RNNs and the two previous models. First is the input_shape = (image_size, image_size)
which is actually input_shape = (timesteps, input_dim)
or a sequence of input_dim
—dimension vectors of timesteps
length. Second is the use of a SimpleRNN
layer to represent an RNN cell with units=256
. The units
variable represents the number of output units. If the CNN is characterized by the convolution of kernel across the input feature map, the RNN output is a function not only of the present input but also of the previous output or hidden state. Since the previous output is also a function of the previous input, the current output is also a function of the previous output and input and so on. The SimpleRNN
layer in Keras is a simplified version of the true RNN. The following, equation describes the output of SimpleRNN:
ht = tanh(b + Wht-1 + Uxt) (1.5.1)
In this equation, b is the bias, while W and U are called recurrent kernel (weights for previous output) and kernel (weights for the current input) respectively. Subscript t is used to indicate the position in the sequence. For SimpleRNN
layer with units=256
, the total number of parameters is 256 + 256 × 256 + 256 × 28 = 72,960 corresponding to b, W, and U contributions.
Following figure shows the diagrams of both SimpleRNN and RNN that were used in the MNIST digit classification. What makes SimpleRNN
simpler than RNN is the absence of the output values Ot = Vht + c before the softmax is computed:
RNNs might be initially harder to understand when compared to MLPs or CNNs. In MLPs, the perceptron is the fundamental unit. Once the concept of the perceptron is understood, MLPs are just a network of perceptrons. In CNNs, the kernel is a patch or window that slides through the feature map to generate another feature map. In RNNs, the most important is the concept of self-loop. There is in fact just one cell.
The illusion of multiple cells appears because a cell exists per timestep but in fact, it is just the same cell reused repeatedly unless the network is unrolled. The underlying neural networks of RNNs are shared across cells.
The summary in Listing 1.5.2 indicates that using a SimpleRNN
requires a fewer number of parameters. Figure 1.5.3 shows the graphical description of the RNN MNIST digit classifier. The model is very concise. Table 1.5.1 shows that the SimpleRNN
has the lowest accuracy among the networks presented.
Listing 1.5.2, RNN MNIST digit classifier summary:
_________________________________________________________________ Layer (type)         Output Shape       Param #  ================================================================= simple_rnn_1 (SimpleRNN)   (None, 256)        72960   _________________________________________________________________ dense_1 (Dense)       (None, 10)        2570    _________________________________________________________________ activation_1 (Activation)  (None, 10)        0     ================================================================= Total params: 75,530 Trainable params: 75,530 Non-trainable params: 0
Layers |
Optimizer |
Regularizer |
Train Accuracy, % |
Test Accuracy, % |
---|---|---|---|---|
256 |
SGD |
Dropout(0.2) |
97.26 |
98.00 |
256 |
RMSprop |
Dropout(0.2) |
96.72 |
97.60 |
256 |
Adam |
Dropout(0.2) |
96.79 |
97.40 |
512 |
SGD |
Dropout(0.2) |
97.88 |
98.30 |
Table 1.5.1: The different SimpleRNN network configurations and performance measures
In many deep neural networks, other members of the RNN family are more commonly used. For example, Long Short-Term Memory (LSTM) networks have been used in both machine translation and question answering problems. LSTM networks address the problem of long-term dependency or remembering relevant past information to the present output.
Unlike RNNs or SimpleRNN, the internal structure of the LSTM cell is more complex. Figure 1.5.4 shows a diagram of LSTM in the context of MNIST digit classification. LSTM uses not only the present input and past outputs or hidden states; it introduces a cell state, st, that carries information from one cell to the other. Information flow between cell states is controlled by three gates, ft, it and qt. The three gates have the effect of determining which information should be retained or replaced and the amount of information in the past and current input that should contribute to the current cell state or output. We will not discuss the details of the internal structure of the LSTM cell in this book. However, an intuitive guide to LSTM can be found at: http://colah.github.io/posts/2015-08-Understanding-LSTMs.
The LSTM()
layer can be used as a drop-in replacement to SimpleRNN()
. If LSTM is overkill for the task at hand, a simpler version called Gated Recurrent Unit (GRU) can be used. GRU simplifies LSTM by combining the cell state and hidden state together. GRU also reduces the number of gates by one. The GRU()
function can also be used as a drop-in replacement for SimpleRNN()
.
There are many other ways to configure RNNs. One way is making an RNN model that is bidirectional. By default, RNNs are unidirectional in the sense that the current output is only influenced by the past states and the current input. In bidirectional RNNs, future states can also influence the present state and the past states by allowing information to flow backward. Past outputs are updated as needed depending on the new information received. RNNs can be made bidirectional by calling a wrapper function. For example, the implementation of bidirectional LSTM is Bidirectional(LSTM())
.
For all types of RNNs, increasing the units will also increase the capacity. However, another way of increasing the capacity is by stacking the RNN layers. You should note though that as a general rule of thumb, the capacity of the model should only be increased if needed. Excess capacity may contribute to overfitting, and as a result, both longer training time and slower performance during prediction.