Vanishing and exploding gradients
Just like traditional neural networks, training the RNN also involves backpropagation. The difference in this case is that since the parameters are shared by all time steps, the gradient at each output depends not only on the current time step, but also on the previous ones. This process is called backpropagation through time (BPTT) (for more information refer to the article: Learning Internal Representations by Backpropagating errors, by G. E. Hinton, D. E. Rumelhart, and R. J. Williams, Parallel Distributed Processing: Explorations in the Microstructure of Cognition 1, 1985):
Consider the small three layer RNN shown in the preceding diagram. During the forward propagation (shown by the solid lines), the network produces predictions that are compared to the labels to compute a loss Lt at each time step. During backpropagation (shown by dotted lines), the gradients of the loss with respect to the parameters U, V, and W are computed at each time step and...