For RNNs, however, we not only backpropagate the error through the depth of the network, but also through time. First of all, we compute the total loss by summing the individual loss (L) over all the time steps:
This means that we can compute the gradient for each time step separately. To greatly simplify the calculations, we will assume that tanh = identity (that is, we assume that there is no activation function). For instance, at t = 4, we will compute the gradient by applying the chain rule:
Here, we stumble upon a complexity—the third term (in bold) on the right-hand side of the equation cannot be easily derived. Indeed, to take the derivative of h<4> with respect to Wrec, all other terms must not depend on Wrec. However, h<4> also depends on h<3>. And h<3> depends on Wrec, since h<3>= tanh (Wrec h<2> + Winput x<3>+b), and so on and so forth until we reach h<0>, which is entirely composed...