The Bellman equation
The Bellman equation, named after Richard Bellman, helps us solve the Markov decision process (MDP). When we say solve the MDP, we mean finding the optimal policy.
As stated in the introduction of the chapter, the Bellman equation is ubiquitous in reinforcement learning and is widely used for finding the optimal value and Q functions recursively. Computing the optimal value and Q functions is very important because once we have the optimal value or optimal Q function, then we can use them to derive the optimal policy.
In this section, we'll learn what exactly the Bellman equation is and how we can use it to find the optimal value and Q functions.
The Bellman equation of the value function
The Bellman equation states that the value of a state can be obtained as a sum of the immediate reward and the discounted value of the next state. Say we perform an action a in state s and move to the next state and obtain a reward r, then...