2. The Q value
If the RL problem is to find , how does the agent learn by interacting with the environment? Equation 9.1.3 does not explicitly indicate the action to try and the succeeding state to compute the return. In RL, it is easier to learn by using the Q value:
where:
In other words, instead of finding the policy that maximizes the value for all states, Equation 9.2.1 looks for the action that maximizes the quality (Q) value for all states. After finding the Q value function, and hence are determined by Equation 9.2.2 and Equation 9.1.3, respectively.
If, for every action, the reward and the next state can be observed, we can formulate the following iterative or trial-and-error algorithm to learn the Q value:
For notational simplicity, and are the next state and action, respectively. Equation 9.2.3 is known as the Bellman equation, which is the core of the Q-learning algorithm. Q-learning...