The Actor-Critic Method
In Chapter 11, Policy Gradients—an Alternative, we started to investigate a policy-based alternative to the familiar value-based methods family. In particular, we focused on the method called REINFORCE and its modification, which uses discounted reward to obtain the gradient of the policy (which gives us the direction in which to improve the policy). Both methods worked well for a small CartPole problem, but for a more complicated Pong environment, the convergence dynamics were painfully slow.
Next, we will discuss another extension to the vanilla policy gradient method, which magically improves the stability and convergence speed of that method. Despite the modification being only minor, the new method has its own name, actor-critic, and it's one of the most powerful methods in deep reinforcement learning (RL).
In this chapter, we will:
- Explore how the baseline impacts statistics and the convergence of gradients
- Cover an extension...