Stacked RNNs
Now, let's look at another approach we can follow to extract more power from RNNs. In all the models we've looked at in this chapter, we've used a single layer for the RNN layer (plain RNN, LSTM, or GRU). Going deeper, that is, adding more layers, has typically helped us for feedforward networks so that we can learn more complex patterns/features in the deeper layers. There is merit in trying this idea for recurrent networks. Indeed, stacked RNNs do seem to give us more predictive power.
The following figure illustrates a simple two-layer stacked LSTM model. Stacking RNNs simply means feeding the output of one RNN layer to another RNN layer. The RNN layers can output sequences (that is, output at each time step) and these can be fed, like any input sequence, into the subsequent RNN layer. In terms of implementation through code, stacking RNNs is as simple as returning sequences from one layer, and providing this as input to the next RNN layer, that is...