Policy gradients with REINFORCE algorithms
The idea of Policy Gradients (PG) / REINFORCE algorithms is very simple: it consists in re-using the classification loss function in the case of reinforcement learning tasks.
Let's remember that the classification loss is given by the negative log likelihood, and minimizing it with a gradient descent follows the negative log-likelihood derivative with respect to the network weights:
![Policy gradients with REINFORCE algorithms](https://static.packt-cdn.com/products/9781786465825/graphics/graphics/B05525_11_18.jpg)
Here, y is the select action, the predicted probability of this action given inputs X and weights
.
The REINFORCE theorem introduces the equivalent for reinforcement learning, where r is the reward. The following derivative:
![Policy gradients with REINFORCE algorithms](https://static.packt-cdn.com/products/9781786465825/graphics/graphics/B05525_11_19.jpg)
represents an unbiased estimate of the derivative of the expected reward with respect to the network weights:
![Policy gradients with REINFORCE algorithms](https://static.packt-cdn.com/products/9781786465825/graphics/graphics/B05525_11_20.jpg)
So, following the derivative will encourage the agent to maximize the reward.
Such a gradient descent enables us to optimize a policy network for our agents: a policy is a probability distribution over legal actions, to sample actions to...