PPO-Clipped
The algorithm for the PPO-clipped method is given as follows:
- Initialize the policy network parameter and value network parameter
- Collect some N number of trajectories following the policy
- Compute the return (reward-to-go) Rt
- Compute the gradient of the objective function
- Update the policy network parameter using gradient ascent,
- Compute the mean squared error of the value network,
- Compute the gradient of the value network
- Update the value network parameter using gradient descent,
- Repeat steps 2 to 8 for several iterations