Proximal Policy Optimization (PPO) is a policy gradient-based method and is one of the algorithms that have been proven to be stable as well as scalable. In fact, PPO was the algorithm used by the OpenAI Five team of agents that played (and won) against several human DOTA II players, which we discussed in our previous chapter.
Proximal Policy Optimization
Core concept
In policy gradient methods, the algorithm performs rollouts to collect samples of transitions and (potentially) rewards, and updates the parameters of the policy using gradient descent to minimize the objective function. The idea is to keep updating the parameters to improve the policy until a good policy is obtained. To improve the training stability, the Trust...