LSTM
In the last section, we learned about basic RNNs. In theory, simple RNNs should be able to retain even long-term memories. However, in practice, this approach often falls short because of the vanishing gradients problem.
Over the course of many timesteps, the network has a hard time keeping up meaningful gradients. While this is not the focus of this chapter, a more detailed exploration of why this happens can be read in the 1994 paper, Learning long-term dependencies with gradient descent is difficult, available at -https://ieeexplore.ieee.org/document/279181 - by Yoshua Bengio, Patrice Simard, and Paolo Frasconi.
In direct response to the vanishing gradients problem of simple RNNs, the Long Short-Term Memory (LSTM) layer was invented. This layer performs much better at longer time series. Yet, if relevant observations are a few hundred steps behind in the series, then even LSTM will struggle. This is why we manually included some lagged observations.
Before we dive into details, let...