In this chapter, we learned about a new class of reinforcement learning algorithms called policy gradients. They approach the RL problem in a different way, compared to the value function methods that were studied in the previous chapters.
The simpler version of PG methods is called REINFORCE, which was learned, implemented, and tested throughout the course of this chapter. We then proposed adding a baseline in REINFORCE in order to decrease the variance and increase the convergence property of the algorithm. AC algorithms are free from the need for a full trajectory using a critic, and thus, we then solved the same problem using the AC model.
With a solid foundation of the classic policy gradient algorithms, we can now go further. In the next chapter, we'll look at some more complex, state-of-the-art policy gradient algorithms; namely, Trust Region Policy Optimization...