3. REINFORCE with baseline method
The REINFORCE algorithm can be generalized by subtracting a baseline from the return, . The baseline function, , can be any function as long as it does not depend on . The baseline does not alter the expectation of the performance gradient:
Equation 10.3.1 implies that since is not a function of . While the introduction of a baseline does not change the expectation, it reduces the variance of the gradient updates. The reduction in variance generally accelerates learning.
In most cases, we use the value function, as the baseline. If the return is overestimated, the scaling factor is proportionally reduced by the value function, resulting in a lower variance. The value function is also parameterized, , and is jointly trained with the policy network. In continuous action spaces, the state value can be a linear function of state features:
Algorithm 10.3.1 summarizes the REINFORCE...