Trust region policy optimization
The trust region policy optimization (TRPO) algorithm was proposed to solve complex continuous control tasks in the following paper: Schulman, S. Levine, P. Moritz, M. Jordan and P. Abbeel. Trust Region Policy Optimization. In ICML, 2015.
To understand why TRPO works requires some mathematical background. The main idea is that it is better to guarantee that the new policy,
, optimized by one training step, not only monotonically decreases the optimization loss function (and thus improves the policy), but also does not deviate from the previous policy
much, which means that there should be a constraint on the difference between
and
, for example,
for a certain constraint function
constant
.
Theory behind TRPO
Let's see the mechanism behind TRPO. If you feel that this part is hard to understand, you can skip it and go directly to how to run TRPO to solve MuJoCo control tasks. Consider an infinite-horizon discounted Markov decision process denoted by
, where...