Double DQN
The next fruitful idea on how to improve a basic DQN came from DeepMind researchers in the paper titled Deep reinforcement learning with double Q-learning [VGS16]. In the paper, the authors demonstrated that the basic DQN tends to overestimate values for Q, which may be harmful to training performance and sometimes can lead to suboptimal policies. The root cause of this is the max operation in the Bellman equation, but the strict proof is a bit complicated (you can find the full explanation in the paper). As a solution to this problem, the authors proposed modifying the Bellman update a bit.
In the basic DQN, our target value for Q looked like this:
Q′(st+1,a) was Q-values calculated using our target network, the weights of which are copied from the trained network every n steps. The authors of the paper proposed choosing actions for the next state using the trained network, but taking values of Q from the target network. So...