The algorithms that have been learned and developed so far are value-based, which, at their core, learn a value function, V(s), or action-value function, Q(s, a). A value function is a function that defines the total reward that can be accumulated from a given state or state-action pair. An action can then be selected, based on the estimated action (or state) values.
Therefore, a greedy policy can be defined as follows:
Value-based methods, when combined with deep neural networks, can learn very sophisticated policies in order to control agents that operate in high-dimensionality spaces. Despite these great qualities, they suffer when dealing with problems with a large number of actions, or when the action space is continuous.
In such cases, maximum operation is not feasible. Policy gradient (PG) algorithms exhibit incredible potential in such contexts...