PPO
The PPO method came from the OpenAI team, and it was proposed after TRPO, which is from 2015. However, we will start with PPO because it is much simpler than TRPO. It was first proposed in the 2017 paper named Proximal Policy Optimization Algorithms by Schulman et al. [Sch+17].
The core improvement over the classic A2C method is changing the formula used to estimate the policy gradients. Instead of using the gradient of the logarithm probability of the action taken, the PPO method uses a different objective: the ratio between the new and the old policy scaled by the advantages.
In math form, the A2C objective could be written like this
which means our gradient on model 𝜃 is estimated as the logarithm of the policy π multiplied by the advantage A.
The new objective proposed in PPO is the following:
The reason for changing the objective is the same as with the cross-entropy method covered...