Revisiting off-policy Methods
One of the challenges with policy-based methods is that they are on-policy, which requires collecting new samples after every policy update. If it is costly to collect samples from the environment, then training on-policy methods could be really expensive. On the other hand, the value-based methods we covered in the previous chapter are off-policy but they only work with discrete action spaces. Therefore, there is a need for a class of methods that work with continuous action spaces and off-policy. In this section, we cover such algorithms. Let's start with the first one: Deep Deterministic Policy Gradient.
DDPG: Deep Deterministic Policy Gradient
DDPG, in some sense, is an extension of deep Q-learning to continuous action spaces. Remember that deep Q-learning methods learn a representation for action values, . The best action is then given by in a given state . Now, if the action space is continuous, learning the action-value representation...