The Vanishing Gradient Problem
One of the biggest challenges while training standard feedforward deep neural networks is the vanishing gradient problem (as discussed in Chapter 2, Neural Networks). As the model gets more and more layers, backpropagating the errors all the way back to the initial layers becomes increasingly difficult. Layers close to the output will be "learning"/updated at a good pace, but by the time the error propagates to the initial layers, its value will have diminished greatly and have little or no effect on the parameters for the initial layers.
With RNNs, this problem is further compounded, as the parameters need to be updated not only along the depth but also for the time steps. If we have one hundred time steps in the inputs (which isn't uncommon, especially when working with text), the network needs to propagate the error (calculated at the 100th time step) all the way back to the first time step. For plain RNNs, this task can be a bit...