Policy Gradients
In this first chapter of Part 3 of the book, we will consider an alternative way to handle Markov decision process (MDP) problems, which form a full family of methods called policy gradient methods. In some situations, these methods work better than value-based methods, so it is really important to be familiar with them.
In this chapter, we will:
-
Cover an overview of the methods, their motivations, and their strengths and weaknesses in comparison to the already familiar Q-learning
-
Start with a simple policy gradient method called REINFORCE and try to apply it to our CartPole environment, comparing it with the deep Q-network (DQN) approach
-
Discuss problems with the vanilla REINFORCE method and ways to address them with the Policy Gradient (PG) method, which is a step toward a much more advanced method, A3C, that we’ll take a look at in the next chapter