Actor-Critic Method: A2C and A3C
In Chapter 11, we started to investigate a policy-based alternative to the familiar value-based methods family. In particular, we focused on the method called REINFORCE and its modification, which uses discounted reward to obtain the gradient of the policy (which gives us the direction in which to improve the policy). Both methods worked well for a small CartPole problem, but for a more complicated Pong environment, we got no convergence.
Here, we will discuss another extension to the vanilla policy gradient method, which magically improves the stability and convergence speed of that method. Despite the modification being only minor, the new method has its own name, actor-critic, and it’s one of the most powerful methods in deep reinforcement learning (RL).
In this chapter, we will:
-
Explore how the baseline impacts statistics and the convergence of gradients
-
Cover an extension...