Actor-critic methods
Actor-critic methods propose further remedies to the high variance problem in policy gradient algorithm. Just like REINFORCE and other policy gradient methods, actor-critic algorithms have been around for decades now. Combining this approach with deep reinforcement learning, however, has enabled them to solve more realistic RL problems. We start this section my presenting the ideas behind the actor-critic approach, and later we define them in more detail.
Further reducing the variance in policy-based methods
Remember that earlier, to reduce the variance in gradient estimates, we replaced the reward sum obtained in a trajectory with a reward-to-go term. Although a step in the right direction, it is usually not enough. We now introduce two more methods to further reduce this variance.
Estimating the reward-to-go
The reward-to-go term, , obtained in a trajectory is an estimate of the action-value under the existing policy .
Info
Notice the difference...