The first algorithm we will look at is known as REINFORCE. It introduces the concept of PG in a very elegant manner, especially in PyTorch, which masks many of the mathematical complexities of this implementation. REINFORCE also works by solving the optimization problem in reverse. That is, instead of using gradient ascent, it reverses the mathematics so we can express the problem as a loss function and hence use gradient descent. The update equation now transforms to the following:
Here, we now assume the following:
- This is the advantage over the baseline expressed by ; we will get to the advantage function in more detail shortly.
- This is the gradient now expressed as a loss and is equivalent to , assuming with the chain rule and the derivation of 1/x = log x.
Essentially, we flip the equation using the chain rule and the property 1/x = log x. Again...