LSTMs, GRUs, and Other Variants
The idea behind plain RNNs is very powerful and the architecture has shown tremendous promise. Due to this, researchers have experimented with the architecture of RNNs to find ways to overcome the one major drawback (the vanishing gradient problem) and exploit the power of RNNs. This led to the development of LSTMs and GRUs, which have now practically replaced RNNs. Indeed, these days, when we refer to RNNs, we usually refer to LSTMs, GRUs, or their variants.
This is because these variants are designed specifically to handle the vanishing gradient problem and learn long-range dependencies. Both approaches have outperformed plain RNNs significantly in most tasks around sequence modeling, and the difference is especially higher for long sequences. The paper titled Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (available at https://arxiv.org/abs/1406.1078) performs an empirical analysis of the performance...