In the previous recipe, we developed a value estimator based on linear regression. We will employ the estimator in Q-learning, as part of our FA journey.
As we have seen, Q-learning is an off-policy learning algorithm and it updates the Q-function based on the following equation:
Here, s' is the resulting state after taking action, a, in state, s; r is the associated reward; α is the learning rate; and γ is the discount factor. Also, means that the behavior policy is greedy, where the highest Q-value among those in state s' is selected to generate learning data. In Q-learning, actions are taken on the basis of the epsilon-greedy policy. Similarly, Q-learning with FA has the following error term:
Our learning goal is to minimize the error term to zero, which means the estimated V(st) should satisfy...