The Bellman equation, named after Richard Bellman, American mathematician, helps us to solve MDP. It is omnipresent in RL. When we say solve the MDP, it actually means finding the optimal policies and value functions. There can be many different value functions according to different policies. The optimal value function is the one which yields maximum value compared to all the other value functions:
Similarly, the optimal policy is the one which results in an optimal value function.
Since the optimal value function is the one that has a higher value compared to all other value functions (that is, maximum return), it will be the maximum of the Q function. So, the optimal value function can easily be computed by taking the maximum of the Q function as follows:
The Bellman equation for the value function can be represented as, (we...