Trust Region Policy Optimization
The algorithm for Trust Region Policy Optimization (TRPO) is given as follows:
- Initialize the policy network parameter and value network parameter
- Generate N number of trajectories following the policy
- Compute the return (reward-to-go) Rt
- Compute the advantage value At
- Compute the policy gradients
- Compute using the conjugate gradient method
- Update the policy network parameter using the update rule
- Compute the mean squared error of the value network,
- Update the value network parameter using gradient descent,
- Repeat steps 2 to 9 for several iterations