Summary
We started off with policy gradient methods which directly optimized the policy without requiring the Q function. We learned about policy gradients by solving a Lunar Lander game, and we looked at DDPG, which has the benefits of both policy gradients and Q functions.
Then we looked at policy optimization algorithms such as TRPO, which ensure monotonic policy improvements by enforcing a constraint on KL divergence between the old and new policy is not greater than
.
We also looked at proximal policy optimization, which changed the constraint to a penalty by penalizing the large policy update. In the next chapter, Chapter 19, Capstone Project – Car Racing Using DQN, we will see how to build an agent to win a car racing game.