The REINFORCE method
The formula of PG that we’ve just seen is used by most of the policy-based methods, but the details can vary. One very important point is how exactly gradient scales Q(s, a) are calculated. In the cross-entropy method from Chapter 4, The Cross-Entropy Method, we played several episodes, calculated the total reward for each of them, and trained on transitions from episodes with a better-than-average reward. This training procedure is the PG method with Q(s, a) = 1 for actions from good episodes (with large total reward) and Q(s, a) = 0 for actions from worse episodes.
The cross-entropy method worked even with those simple assumptions, but the obvious improvement will be to use Q(s, a) for training instead of just 0 and 1. So why should it help? The answer is a more fine-grained separation of episodes. For example, transitions of the episode with the total reward = 10 should contribute to the gradient more than transitions from the episode with...