Introduction
The overall motivation of the methods that we'll take a look at is to improve the stability of the policy update during the training. Intuitively, there is a dilemma: on the one hand, we'd like to train as fast as we can, making large steps during the Stochastic Gradient Descent (SGD) update. On the other hand, a large update of the policy is usually a bad idea, as our policy is a very nonlinear thing, so a large update can ruin the policy we've just learned. Things can become even worse in the RL landscape, as making a bad update of the policy once won't be recovered by subsequent updates. Instead, the bad policy will bring us bad experience samples that we'll use on subsequent training steps, which could break our policy completely. Thus, we want to avoid making large updates by all means possible. One of the naive solutions would be to use a small learning rate to make baby steps during the SGD, but this would significantly slow...