Double DQN
Deep Q learning is pretty cool, right? It has generalized its learning to play any Atari game. But the problem with DQN is that it tends to overestimate Q values. This is because of the max operator in the Q learning equation. The max operator uses the same value for both selecting and evaluating an action. What do I mean by that? Let's suppose we are in a states and we have five actionsa1toa5. Let's saya3is the best action. When we estimate Q values for all these actions in the states, the estimated Q values will have some noise and differ from the actual value. Due to this noise, actiona2will get a higher value than the optimal actiona3. Now, if we select the best action as the one that has maximum value, we will end up selecting a suboptimal actiona2 instead of optimal actiona3.
We can solve this problem by having two separate Q functions, each learning independently. One Q function is used to select an action and the other Q function is used to evaluate an action. We can implement...