Policy gradients for learning policy functions
The problem policy gradients aims to solve is a more general version of the problem of reinforcement learning, which is how you can use backpropagation on a task that has no gradient, from the reward to the output of our parameters. To give a more concrete example, we have a neural network that produces the probability of taking an action a, given a state s and some parameters ?, which are the weights of our neural network:
We also have our reward signal R. The actions affect the reward signal we take, but there is no gradient between them and the parameters. There is no equation in which we can plug R; it is just a value we obtain from our environment in response to a.
However, given that we know there is a link between the a we choose and R, there are a few things we could try. We could create a range of values for our ? from a Gaussian distribution and run them in the environment. We could then select a percentage of the most successful group...