The Q value
An important question is that if the RL problem is to find , how does the agent learn by interacting with the environment? Equation 9.1.3 does not explicitly indicate the action to try and the succeeding state to compute the return. In RL, we find that it's easier to learn by using the Q value:
(Equation 9.2.1)
Where:
(Equation 9.2.2)
In other words, instead of finding the policy that maximizes the value for all states, Equation 9.2.1 looks for the action that maximizes the quality (Q) value for all states. After finding the Q value function, V* and hence are determined by Equation 9.2.2 and 9.1.3 respectively.
If for every action, the reward and the next state can be observed, we can formulate the following iterative or trial and error algorithm to learn the Q value:
(Equation 9.2.3)
For notational simplicity, both s ' and a ' are the next state and action respectively. Equation 9.2.3 is known as the Bellman Equation which is the core...