Policy Gradient with Reward-To-Go
The algorithm for policy gradient with reward-to-go is given as follows:
- Initialize the network parameter
with random values
- Generate some N number of trajectories
following the policy
- Compute the return (reward-to-go) Rt
- Compute the gradients
- Update the network parameter:
- Repeat steps 2 to 5 for several iterations