In the previous section on issues with traditional RNN, we learned about how RNN does not help when there is a long-term dependency. For example, imagine the input sentence is as follows:
I live in India. I speak ____.
The blank space in the preceding statement could be filled by looking at the key word, India, which is three time steps prior to the word we are trying to predict.
In a similar manner, if the key word is far away from the word to predict, vanishing/exploding gradient problems need to be solved.