Introduction
In the last chapter, we studied Long Short Term Memory units (LSTMs), which help combat the vanishing gradient problem. We also studied GRU in detail, which has its own way of handling vanishing gradients. Although LSTM and GRU reduce this problem in comparison to simple recurrent neural networks, the vanishing gradient problem still manages to prevail in many practical cases. The issue essentially remains the same: longer sentences with complex structural dependences are challenging for deep learning algorithms to encapsulate. Therefore, one of the most prevalent research areas represents the community's attempts to mitigate the effects of the vanishing gradient problem.
Attention mechanisms, in the last few years, have attempted to provide a solution to the vanishing gradient problem. The basic concept of an attention mechanism relies on having access to all parts of the input sentence when arriving at an output. This allows the model to lay varying amounts of weight ...