Another augmentation to the standard Q-learning model we just built is the idea of Double Q-learning, which was introduced by Hado van Hasselt (2010, and 2015). The intuition behind this is quite simple. Recall that, so far, we were estimating our target values for each state-action pair using the Bellman equation and checking how far off the mark our predictions are at a given state, like so:
![](https://static.packt-cdn.com/products/9781789536089/graphics/assets/560856e1-103a-48e8-a5a8-f305404472cc.png)
However, a problem arises from estimating the maximum expected future reward in this manner. As you may have noticed earlier, the max operator in the target equation (yt) uses the same Q-values to evaluate a given action as the ones that are used to predict a given action for a sampled state. This introduces a propensity for overestimation of Q-values, eventually even spiraling out of control. To compensate for such possibilities, Van Hasselt et al. (2016) implemented...