A recent publication stipulated that policy gradient methods are becoming more and more popular. Their learning goal is to optimize the probability distribution of actions so that given a state, a more rewarding action will have a higher probability value. In the first recipe of the chapter, we will talk about the REINFORCE algorithm, which is foundational to advanced policy gradient methods.
The REINFORCE algorithm is also known as the Monte Carlo policy gradient, as it optimizes the policy based on Monte Carlo methods. Specifically, it collects trajectory samples from one episode using its current policy and uses them to the policy parameters, θ . The learning objective function for policy gradients is as follows:
Its gradient can be derived as follows:
Here, is the return, which is the cumulative discounted reward until time, t...