Chapter 10 – Policy Gradient Method
- In the value-based method, we extract the optimal policy from the optimal Q function (Q values).
- It is difficult to compute optimal policy using the value-based method when our action space is continuous. So, we use the policy-based method. In the policy-based method, we compute the optimal policy without the Q function.
- In the policy gradient method, we select actions based on the action probability distribution given by the network and if we win the episode, that is, if we get a high return, then we assign high probabilities to all the actions of the episode, else we assign low probabilities to all the actions of the episode.
- The policy gradient is computed as .
- Reward-to-go is basically the return of the trajectory starting from the state st. It is computed as .
- The policy gradient with the baseline function is a policy gradient method that uses the baseline function to reduce the variance...