Backpropagating the model's errors in a deep neural network, however, comes with its own complexities. This holds equally true for RNNs, facing their own versions of the vanishing and exploding gradient problem. As we discussed earlier, the activation of neurons in a given time step is dependent on the following equation:
at = tanH [ (W x t ) + (Waa x a(t-1)) + ba ]
We saw how Wax and Waa are two separate weight matrices that the RNN layers share through time. These matrices are multiplied to the input matrix at current time, and the activation from the previous time step, respectively. The dot products are then summed up, along with a bias term, and passed through a tanh activation function to compute the activation of neurons at current time (t). We then used this activation matrix to compute the predicted output at current time (), before...