Q-learning was an early RL breakthrough when it was developed by Chris Watkins for his PhD thesis in 1989 (http://www.cs.rhul.ac.uk/~chrisw/thesis.html). It introduces incremental dynamic programming to control an MDP without knowing or modeling the transition and reward matrices that we used for value and policy iteration in the previous section. A convergence proof followed three years later by Watkins and Dayan (http://www.gatsby.ucl.ac.uk/~dayan/papers/wd92.html).
Q-learning directly optimizes the action-value function, q, to approximate q*. The learning proceeds off-policy, that is, the algorithm does not need to select actions based on the policy that's implied by the value function alone. However, convergence requires that all state-action pairs continue to be updated throughout the training process. A straightforward way to ensure this is by using an &...