Deep Q learning is pretty cool, right? It has generalized its learning to play any Atari game. But the problem with DQN is that it tends to overestimate Q values. This is because of the max operator in the Q learning equation. The max operator uses the same value for both selecting and evaluating an action. What do I mean by that? Let's suppose we are in a state s and we have five actions a1 to a5. Let's say a3 is the best action. When we estimate Q values for all these actions in the state s, the estimated Q values will have some noise and differ from the actual value. Due to this noise, action a2 will get a higher value than the optimal action a3. Now, if we select the best action as the one that has maximum value, we will end up selecting a suboptimal action a2 instead of optimal action a3.
We can solve this problem by having two separate Q functions, each...