The double DQN
We have learned that in DQN, the target value is computed as:
![](https://static.packt-cdn.com/products/9781839210686/graphics/Images/B15558_09_099.png)
One of the problems with a DQN is that it tends to overestimate the Q value of the next state-action pair in the target:
![](https://static.packt-cdn.com/products/9781839210686/graphics/Images/B15558_09_22.png)
This overestimation is due to the presence of the max operator. Let's see how this overestimation happens with an example. Suppose we are in a state and we have three actions a1, a2, and a3. Assume a3 is the optimal action in the state
. When we estimate the Q values of all the actions in state
, the estimated Q value will have some noise and differ from the actual value. Say, due to the noise, action a2 will get a higher Q value than the optimal action a3.
We know that the target value is computed as:
![](https://static.packt-cdn.com/products/9781839210686/graphics/Images/B15558_09_103.png)
Now, if we select the best action as the one that has the maximum value then we will end up selecting the action a2 instead of optimal action a3, as shown here:
![](https://static.packt-cdn.com/products/9781839210686/graphics/Images/B15558_09_104.png)
So, how can we get rid of this overestimation? We can get rid of this overestimation...