Vanilla policy gradient
We start discussing the policy-based methods with the most fundamental algorithm: a vanilla policy gradient approach. Although such an algorithm is rarely useful in realistic problem settings, it is very important to understand it to build a strong intuition and a theoretical background for the more complex algorithms we will cover later.
Objective in the policy gradient methods
In value-based methods, we focused on finding good estimates for action values, with which we then obtained policies. Policy gradient methods, on the other hand, directly focus on optimizing the policy with respect to the reinforcement learning objective - although we will still make use of value estimates. If you don't remember what this objective was, it is the expected discounted return:
This is a slightly more rigorous way of writing this objective compared to how we wrote it before. Let's unpack what we have here:
- The objective...