We discussed the Deep Q-Network (DQN) algorithm in the previous chapter, coded it in Python and TensorFlow, and trained it to play Atari Breakout. In DQN, the same Q-network was used to select and evaluate an action. This, unfortunately, is known to overestimate the Q values, which results in over-optimistic estimates for the values. To mitigate this, DeepMind released another paper where it proposed the decoupling of the action selection and action evaluation. This is the crux of the Double DQN (DDQN) architectures, which we will investigate in this chapter.
Even later, DeepMind released another paper where they proposed the Q-network architecture with two output values, one representing the value, V(s), and the other the advantage of taking an action at the given state, A(s,a). DeepMind then combined these two to compute the Q...