GRU
GRUs represent an evolution of the basic RNN structure, specifically designed to address some of the challenges encountered with traditional RNNs, such as the vanishing gradient problem. The architecture of a GRU is illustrated in Figure 10.8:
Figure 10.11: GRU
Let us start discussing GRU with the first activation function, annotated as A. At each timestep t, GRU first calculates the hidden state using the tanh activation function and utilizing and as inputs. The calculation is no different than how the hidden state is determined in the original RNNs presented in the previous section. But there is an important difference. The output is a candidate hidden state, which is calculated using Eq. 10.6:
where is the candidate value of the hidden layer.
Now, instead of using the candidate hidden state straight away, the GRU takes a moment to decide whether to use it. Imagine it like someone pausing to think before making a decision. This pause-and-think step...