In the deep Q-learning algorithms we have developed so far, the same neural network is used to calculate the predicted values and the target values. This may cause a lot of divergence as the target values keep on changing and the prediction has to chase it. In this recipe, we will develop a new algorithm using two neural networks instead of one.
In double DQNs, we use a separate network to estimate the target rather than the prediction network. The separate network has the same structure as the prediction network. And its weights are fixed for every T episode (T is a hyperparameter we can tune), which means they are only updated after every T episode. The update is simply done by copying the weights of the prediction network. In this way, the target function is fixed for a while, which results in a more stable training process.
Mathematically...