In reinforcement learning, we cannot backpropagate the error in our network directly, because we don't have a truth set for each step. We only receive feedback now and then. This is why we need the policy gradient to propagate the rewards back to the network. The rules to determine the best action are called policies. The network for learning these policies is called policy network. This can be any type of network, for example, a simple, two-layer FNN or a CNN. The more complex the environment, the more you will benefit from a complex network. When using a policy gradient, we draw an action of the output distribution of our policy network. Because the reward is not always directly available, we treat the action as correct. Later we use the discounted reward as a scalar and backpropagate this to the network weights.
In the following recipe, we...