PPO-Penalty
The algorithm for the PPO-penalty method is given as follows:
- Initialize the policy network parameter
and value network parameter
and initialize the penalty coefficient
and the target KL divergence
- For iterations
:
- Collect some N number of trajectories following the policy
- Compute the return (reward-to-go) Rt
- Compute
- Compute the gradient of the objective function
- Update the policy network parameter
using gradient ascent,
- If d is greater than or equal to
, then we set
; if d is lesser than or equal to
, then we set,
- Compute the mean squared error of the value network:
- Compute the gradients of the value network
- Update the value network parameter
using gradient descent,
- Collect some N number of trajectories following the policy