The double DQN
We have learned that in DQN, the target value is computed as:
One of the problems with a DQN is that it tends to overestimate the Q value of the next state-action pair in the target:
This overestimation is due to the presence of the max operator. Let's see how this overestimation happens with an example. Suppose we are in a state and we have three actions a1, a2, and a3. Assume a3 is the optimal action in the state . When we estimate the Q values of all the actions in state , the estimated Q value will have some noise and differ from the actual value. Say, due to the noise, action a2 will get a higher Q value than the optimal action a3.
We know that the target value is computed as:
Now, if we select the best action as the one that has the maximum value then we will end up selecting the action a2 instead of optimal action a3, as shown here:
So, how can we get rid of this overestimation? We can get rid of this overestimation...