Twin delayed DDPG
Now, we will look into another interesting actor-critic algorithm, known as TD3. TD3 is an improvement (and basically a successor) to the DDPG algorithm we just covered.
In the previous section, we learned how DDPG uses a deterministic policy to work on the continuous action space. DDPG has several advantages and has been successfully used in a variety of continuous action space environments.
We understood that DDPG is an actor-critic method where an actor is a policy network and it finds the optimal policy, while the critic evaluates the policy produced by the actor by estimating the Q function using a DQN.
One of the problems with DDPG is that the critic overestimates the target Q value. This overestimation causes several issues. We learned that the policy is improved based on the Q value given by the critic, but when the Q value has an approximation error, it causes stability issues to our policy and the policy may converge to local optima.
Thus...