In backpropagation algorithm, the weights are adjusted in proportion to the gradient error, and for the way in which the gradients are computed. Let's check the following:
- If the weights are small, it can lead to a situation called vanishing gradients where the gradient signal gets so small that learning either becomes very slow or stops working altogether. This is often referred to as vanishing gradients.
- If the weights in this matrix are large it can lead to a situation where the gradient signal is so large that it can cause learning to diverge. This is often referred to as exploding gradients.
The vanishing-exploding gradient problem also afflicts RNNs. In fact, the BPTT rolls out the RNN creating a very deep feed-forward neural network. The impossibility of having a long-term context by the RNN is due precisely to this phenomenon, if the gradient vanishes or explodes within...