Summary
In this chapter, we started by understanding the reasons for plain RNNs not being practical for very large sequences – the main culprit being the vanishing gradient problem, which makes modeling long-range dependencies impractical. We saw the LSTM as an update that performs extremely well for long sequences, but it is rather complicated and has a large number of parameters. GRU is an excellent alternative that is a simplification over LSTM and works well on smaller datasets.
Then, we started looking at ways to extract more power from these RNNs by using bidirectional RNNs and stacked layers of RNNs. We also discussed attention mechanisms, a significant new approach that provides state-of-the-art results in translation but can also be employed on other sequence-processing tasks. All of these are extremely powerful models that have changed the way several tasks are performed and form the basis for models that produce state-of-the-art results. With active research in...