In the REINFORCE algorithm, Monte Carlo plays out the whole trajectory in an episode that is used to update the policy afterward. However, the stochastic policy may take different actions at the same state in different episodes. This can confuse the training, since one sampled experience wants to increase the probability of choosing one action while another sampled experience may want to decrease it. To reduce this high variance problem in vanilla REINFORCE, we will develop a variation algorithm, REINFORCE with baseline, in this recipe.
In REINFORCE with baseline, we subtract the baseline state-value from the return, G. As a result, we use an advantage function A in the gradient update, which is described as follows:
Here, V(s) is the value function that estimates the state-value given a state. Typically, we can use a linear function...