Trust region policy optimization (TRPO) is the first successful algorithm that makes use of several approximations to compute the natural gradient with the goal of training a deep neural network policy in a more controlled and stable way. From NPG, we saw that it isn't possible to compute the inverse of the FIM for nonlinear functions with a lot of parameters. TRPO overcomes these difficulties by building on top of NPG. It does this by introducing a surrogate objective function and making a series of approximations, which means it succeeds in learning about complex policies for walking, hopping, or playing Atari games from raw pixels.
TRPO is one of the most complex model-free algorithms and though we already learned the underlying principles of the natural gradient, there are still difficult parts behind it. In this chapter, we'll only...