If you walk up to a wall, there are not many actions you can perform. You will likely respond to this state in your environment by choosing the action of turning around, followed by asking yourself why you walked up to a wall in the first place. Similarly, we would like our agent to leverage a sense of goodness for different actions with respect to the states they find themselves in while following a policy. We can achieve this using a Q-Value function. This function simply denotes the expected cumulative reward from taking a specific action, in a specific state, while following a policy. In other words, it denotes the quality of a state-action pairs for a given policy. Mathematically, we can denote the Q π ( a , s) relation as follows:
The Q π ( s , a) function allows us to represent the expected cumulative reward from following...