Hochreiter and Schmidhuber studied the problems of vanishing and exploding gradients extensively and came up with a solution called Long Short-Term Memory (LSTM, https://www.bioinf.jku.at/publications/older/2604.pdf). LSTMs can handle long-term dependencies due to a specially crafted memory cell. In fact, they work so well that most of the current accomplishments in training RNNs on a variety of problems are due to the use of LSTMs. In this section, we'll explore how this memory cell works and how it solves the vanishing gradients issue.
The key idea of LSTM is the cell state, ct (in addition to the hidden RNN state, ht), where the information can only be explicitly written in or removed so that the state stays constant if there is no outside interference. The cell state can only be modified by specific gates, which are a way to let information...