Policy Gradient with Reward-To-Go
The algorithm for policy gradient with reward-to-go is given as follows:
- Initialize the network parameter with random values
- Generate some N number of trajectories following the policy
- Compute the return (reward-to-go) Rt
- Compute the gradients
- Update the network parameter:
- Repeat steps 2 to 5 for several iterations