Chapter 12 – Learning DDPG, TD3, and SAC
- DDPG consists of an actor and critic. The actor is a policy network and uses the policy gradient method for learning the optimal policy. The critic is a DQN and it evaluates the action produced by the actor.
- The critic is basically a DQN. The goal of the critic is to evaluate the action produced by the actor network. The critic evaluates the action produced by the actor using the Q value computed by the DQN.
- The key features of TD3 includes clipped double Q learning, delayed policy updates, and target policy smoothing.
- Instead of using one critic network, we use two main critic networks for computing the Q value and we use two target critic networks for computing the target value. We compute two target Q values using two target critic networks and use the minimum value out of these two while computing the loss. This helps in preventing the overestimation of the target Q value.
- The DDPG method...