We have seen, in the previous chapter, that recurrent neural networks provide decent performance when working with data involving sequences. One of the key advantages of using LSTM networks lies in the fact that they address the vanishing gradient problem that makes network training difficult for a long sequence of words or integers. Gradients are used for updating RNN parameters and for a long sequence of words or integers; these gradients become smaller and smaller to the extent that, effectively, no network training can take place. LSTM networks help to overcome this problem and make it possible to capture long-term dependencies between keywords or integers in sequences that are separated by a large distance. For example, consider the following two sentences, where the first sentence is short and the second sentence is relatively longer:
- Sentence...