Trust region policy optimization
The trust region policy optimization (TRPO) algorithm was proposed to solve complex continuous control tasks in the following paper: Schulman, S. Levine, P. Moritz, M. Jordan and P. Abbeel. Trust Region Policy Optimization. In ICML, 2015.
To understand why TRPO works requires some mathematical background. The main idea is that it is better to guarantee that the new policy,Â
, optimized by one training step, not only monotonically decreases the optimization loss function (and thus improves the policy), but also does not deviate from the previous policyÂ
much, which means that there should be a constraint on the difference betweenÂ
and
, for example,Â
for a certain constraint functionÂ
constant
.
Theory behind TRPO
Let's see the mechanism behind TRPO. If you feel that this part is hard to understand, you can skip it and go directly to how to run TRPO to solve MuJoCo control tasks. Consider an infinite-horizon discounted Markov decision process denoted by
, where...