TD control
In TD prediction, we estimated the value function. In TD control, we optimize the value function. For TD control, we use two kinds of control algorithm:
- Off-policy learning algorithm: Q learning
- On-policy learning algorithm: SARSA
Q learning
We will now look into the very popular off-policy TD control algorithm called Q learning. Q learning is a very simple and widely used TD algorithm. In control algorithms, we don't care about state value; here, in Q learning, our concern is the state-action value pair—the effect of performing an action A in the state S.Â
We will update the Q value based on the following equation:
The preceding equation is similar to the TD prediction update rule with a little difference. We will see this in detail step by step. The steps involved in Q learning are as follows:
- First, we initialize the Q function to some arbitrary values
- We take an action from a state using epsilon-greedy policy () and move it to the new state
- We update the Q value of a previous state...