In this chapter, we were introduced to our first continuous actions RL algorithm, DDPG, which also happens to be the first Actor-Critic algorithm in this book. DDPG is an off-policy algorithm, as it uses a replay buffer. We also covered the use of policy gradients to update the actor, and the use of the L2 norm to update the critic. Thus, we have two different neural networks. The actor learns the policy and the critic learns to evaluate the actor's policy, thereby providing a learning signal to the actor. You saw how to compute the gradient of the state-action value, Q(s,a), with respect to the action, and also the gradient of the policy, both of which are combined to evaluate the policy gradient, which is then used to update the actor. We trained the DDPG on the inverted pendulum problem, and the agent learned it very well.
We have come a long way in this chapter...