Trust-region methods
One of the important developments in the world of policy-based methods has been the evolution of the trust-region methods. In particular, TRPO and PPO algorithms have led to significant improvement over the algorithms like A2C and A3C. For example, the famous Dota 2 AI agent which reached expert-level performance competitions was trained using PPO and GAE. In this section, we go into the details of those algorithms to help you gain a solid understanding of how they work.
Info
Prof. Sergey Levine, who co-authored the TRPO and PPO papers, goes deep into the details of the math behind these methods in his online lecture more than we do in this section. That lecture is available at https://youtu.be/uR1Ubd2hAlE and I highly recommend you watch it to improve your theoretical understanding of these algorithms.
Without further ado, let's dive in!
Policy gradient as policy iteration
In the earlier chapters, we described how most of the RL algorithms...