TRPO is a very popular on-policy algorithm from OpenAI and the University of California, Berkeley, and was introduced in 2015. There are many flavors of TRPO, but we will learn about the vanilla TRPO version from the paper Trust Region Policy Optimization, by John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel, arXiv:1502.05477:
TRPO involves solving a policy optimization equation subject to an additional constraint on the size of the policy update. We will see these equations now.