Q-learning
This algorithm was proposed by Watkins (in Watkins C.I.C.H., Learning from delayed rewards, Ph.D. Thesis, University of Cambridge, 1989 and furtherly analyzed in Watkins C.I.C.H., Dayan P., Technical Note Q-Learning, Machine Learning 8, 1992) as a more efficient alternative to SARSA. The main feature of Q-learning is that the TD update rule is immediately greedy with respect to the Q(st+1, a) function (assuming that the agent received the reward rt after performing the action at while in the state st):
The key idea is to compare the current Q(st, at) value with the maximum Q value achievable when the agent is in the successor state. Assuming , the previous equation can be transformed into a TDerror structure:
The first term is the current reward, the second is the discounted maximum reward that the agent can theoretically achieve using its current knowledge and the last one is the estimation of the Q function. As the policy must be GLIE, the convergence...