All RL algorithms we discussed until now have tried to learn the state- or action-value functions. For example, in Q-learning we usually follow an ε-greedy policy, which has no parameters (OK, it has one parameter) and relies on the value function instead. In this section, we'll discuss something new: how to approximate the policy itself with the help of policy gradient methods. We'll follow a similar approach as in Chapter 8, Reinforcement Learning Theory, in the Value function approximation section.
There, we introduced a value approximation function, which is described by a set of parameters w (neural net weights). Here, we'll introduce a parameterized policy , which is described by a set of parameters θ. As with value function approximation, θ could be the weights of a neural network.
Recall that we use the notation...