REINFORCE with Baseline
The algorithm for REINFORCE with baseline is given as follows:
- Initialize the policy network parameter and value network parameter
- Generate some N number of trajectories following the policy
- Compute the return (reward-to-go) Rt
- Compute the policy gradient,
- Update the policy network parameter using gradient ascent,
- Compute the mean squared error of the value network,
- Update the value network parameter using gradient descent,
- Repeat steps 2 to 7 for several iterations