How LSTMs solve the vanishing gradient problem
As we discussed earlier, even though RNNs are theoretically sound, in practice they suffer from a serious drawback. That is, when Backpropagation Through Time (BPTT) is used, the gradient diminishes quickly, which allows us to propagate the information of only a few time steps. Consequently, we can only store the information of very few time steps, thus possessing only short-term memory. This in turn limits the usefulness of RNNs in real-world sequential tasks.
Often, useful and interesting sequential tasks (such as stock market predictions or language modeling) require the ability to learn and store long-term dependencies. Think of the following example for predicting the next word:
John is a talented student. He is an A-grade student and plays rugby and cricket. All the other students envy ______.
For us, this is a very easy task. The answer would be John. However, for an RNN, this is a difficult task. We are trying to predict...