The REINFORCE method
The formula of policy gradient that you have just seen is used by most policy-based methods, but the details can vary. One very important point is how exactly gradient scales, Q(s,a), are calculated. In the cross-entropy method from Chapter 4, we played several episodes, calculated the total reward for each of them, and trained on transitions from episodes with a better-than-average reward. This training procedure is a policy gradient method with Q(s,a) = 1 for state and action pairs from good episodes (with a large total reward) and Q(s,a) = 0 for state and action pairs from worse episodes.
The cross-entropy method worked even with those simple assumptions, but the obvious improvement will be to use Q(s,a) for training instead of just 0 and 1. Why should it help? The answer is a more fine-grained separation of episodes. For example, transitions from the episode with a total reward of 10 should contribute to the gradient more than transitions from...