As in the neural networks we have already encountered, RNNs also update their parameters using backpropagation by finding the gradient of the error (loss) with respect to the weights. Here, however, it is referred to as Backpropagation Through Time (BPTT) because each node in the RNN has a time step. I know the name sounds cool, but it has nothing to do with time travel—it's still just good old backpropagation with gradient descent for the parameter updates.
Here, using BPTT, we want to find out how much the hidden units and output affect the total error, as well as how much changing the weights (U, V, W) affects the output. W, as we know, is constant throughout the network, so we need to traverse all the way back to the initial time step to make an update to it.
When backpropagating in RNNs, we again apply the chain rule. What makes training...