Proximal Policy Optimization
Now we will look at another policy optimization algorithm called Proximal Policy Optimization (PPO). It acts as an improvement to TRPO and has become the default RL algorithm of choice in solving many complex RL problems due to its performance. It was proposed by researchers at OpenAI for overcoming the shortcomings of TRPO. Recall the surrogate objective function of TRPO. It is a constraint optimization problem where we impose a constraint—that average KL divergence between the old and new policy should be less than
. But the problem with TRPO is that it requires a lot of computing power for computing conjugate gradients to perform constrained optimization.
So, PPO modifies the objective function of TRPO by changing the constraint to a penalty term so that we don't want to perform conjugate gradient. Now let's see how PPO works. We define
as a probability ratio between new and old policy. So, we can write our objective function as:
LCPI denotes the conservative...