Trust Region Methods
In this chapter, we will take a look at the approaches used to improve the stability of the stochastic policy gradient method. Some attempts have been made to make the policy improvement more stable, and in this chapter, we will focus on three methods:
-
Proximal policy optimization (PPO)
-
Trust region policy optimization (TRPO)
-
Advantage actor-critic (A2C) using Kronecker-factored trust region (ACKTR) .
In addition, we will compare these methods to a relatively new off-policy method called soft actor-critic (SAC), which is the evolution of the deep deterministic policy gradients (DDPG) method described in Chapter 15. To compare them to the A2C baseline, we will use several environments from the so-called “locomotion gym environments” – environments shipped with Farama Gymnasium (using MuJoCo and PyBullet). We also will do a head-to-head comparison...