Policy Gradients – an Alternative
In this first chapter of part three of the book, we will consider an alternative way to handle Markov decision process (MDP) problems, which forms a full family of methods called policy gradient methods.
In this chapter, we will:
- Cover an overview of the methods, their motivations, and their strengths and weaknesses in comparison to the already familiar Q-learning
- Start with a simple policy gradient method called REINFORCE and try to apply it to our CartPole environment, comparing this with the deep Q-network (DQN) approach