The Bellman equation, which was proposed by American mathematician Richard Bellman, is one of the main workhorse equations powering the chariot of deep Q-learning. It essentially allows us to solve the Markov decision process we formalized earlier. Intuitively, the Bellman equation makes one simple assumption. It states that the maximum future reward for a given action, performed at a state, is the immediate reward plus the maximum future reward for the next state. To draw a parallel to the marshmallow experiments, the maximum possible reward of two marshmallows is attained by the agents through the act of abstaining at the first time step (with a reward of 0 marshmallows) and then collecting (with a reward of two marshmallows) at the second time step.
In other words, given any state-action pair, the quality (Q) of performing an action (a) at the given...