Policy optimization methods are an alternative to Q-learning and value function approximation. Instead of learning the Q-values for state/action pairs, these methods directly learn a policy π that maps state to an action by calculating a gradient. Fundamentally, for a search such as for an optimization problem, policy methods are a means of learning the correct policy from a stochastic distribution of potential policy actions. Therefore, our network architecture changes a bit to learn a policy directly:
Because every state has a distribution of possible actions, the optimization problem becomes easier. We no longer have to compute exact rewards for specific actions. Recall that deep learning methods rely on the concept of an episode. In the case of deep reinforcement learning, each episode represents a game or task, while trajectories represent plays...