Another augmentation to the standard Q-learning model we just built is the idea of Double Q-learning, which was introduced by Hado van Hasselt (2010, and 2015). The intuition behind this is quite simple. Recall that, so far, we were estimating our target values for each state-action pair using the Bellman equation and checking how far off the mark our predictions are at a given state, like so:
However, a problem arises from estimating the maximum expected future reward in this manner. As you may have noticed earlier, the max operator in the target equation (yt) uses the same Q-values to evaluate a given action as the ones that are used to predict a given action for a sampled state. This introduces a propensity for overestimation of Q-values, eventually even spiraling out of control. To compensate for such possibilities, Van Hasselt et al. (2016) implemented...