To circumvent the long training time, it is possible to compute the gradient every k1 time step instead of every step. This divides the number of gradient operations by k1, making the training of the network faster.
Instead of backpropagating throughout all the time steps, we can also limit the propagation to k2 steps in the past. This effectively limits the gradient vanishing, since the gradient will depend on Wk2 at most. This also limits the computations that are necessary to compute the gradient. However, the network will be less likely to learn long-term temporal relations.
The combination of those two techniques is called truncated backpropagation, with its two parameters commonly referred to as k1 and k2. They must be tuned to ensure a good trade-off between training speed and model performance.
This technique—while powerful—remains a workaround for a fundamental RNN problem. In the next section, we will introduce a change of architecture...