Policy gradient
The policy gradient is one of the amazing algorithms in reinforcement learning (RL) where we directly optimize the policy parameterized by some parameter
. So far, we have used the Q function for finding the optimal policy. Now we will see how to find the optimal policy without the Q function. First, let's define the policy function as
, that is, the probability of taking an action a given the state s. We parameterize the policy via a parameter
as
, which allows us to determine the best action in a state.
The policy gradient method has several advantages, and it can handle the continuous action space where we have an infinite number of actions and states. Say we are building a self-driving car. A car should be driven without hitting any other vehicles. We get a negative reward when the car hits a vehicle and a positive reward when the car does not hit any other vehicle. We update our model parameters in such a way that we receive only a positive reward so that our car will...