While backpropagating an RNN, we discovered a problem called vanishing gradients. Due to the vanishing gradient problem, we cannot train the network properly, and this causes the RNN to not retain long sequences in the memory. To understand what we mean by this, let's consider a small sentence:
The sky is __.
An RNN can easily predict the blank as blue based on the information it has seen, but it cannot cover the long-term dependencies. What does that mean? Let's consider the following sentence to understand the problem better:
Archie lived in China for 13 years. He loves listening to good music. He is a fan of comics. He is fluent in ____.
Now, if we were asked to predict the missing word in the preceding sentence, we would predict it as Chinese, but how did we predict that? We simply remembered the previous sentences and understood that Archie lived...