PPO is an extension to TRPO, and was introduced in 2017 by researchers at OpenAI. PPO is also an on-policy algorithm, and can be applied to discrete action problems as well as continuous actions. It uses the same ratio of policy distributions as in TRPO, but does not use the KL divergence constraint. Specifically, PPO uses three loss functions that are combined into one. We will now see the three loss functions.
Learning PPO
PPO loss functions
The first of the three loss functions involved in PPO is called the clipped surrogate objective. Let rt(θ) denote the ratio of the new to old policy probability distributions:
The clipped surrogate objective is given by the following equation, where At is the advantage function...