Policy gradients with REINFORCE algorithms
The idea of Policy Gradients (PG) / REINFORCE algorithms is very simple: it consists in re-using the classification loss function in the case of reinforcement learning tasks.
Let's remember that the classification loss is given by the negative log likelihood, and minimizing it with a gradient descent follows the negative log-likelihood derivative with respect to the network weights:
Here, y is the select action, the predicted probability of this action given inputs X and weights .
The REINFORCE theorem introduces the equivalent for reinforcement learning, where r is the reward. The following derivative:
represents an unbiased estimate of the derivative of the expected reward with respect to the network weights:
So, following the derivative will encourage the agent to maximize the reward.
Such a gradient descent enables us to optimize a policy network for our agents: a policy is a probability distribution over legal actions, to sample actions to...