Now, you understand the intuition behind using a neural network to approximate the optimal function Q*(s,a), finding the best possible actions at given states. It goes without saying that the optimal sequence of actions, for a sequence of states, will generate an optimal sequence of rewards. Hence, our neural network is trying to estimate a function that can map possible actions to states, generating an optimal reward for the overall episode. As you will also recall, the optimal quality function Q*(s,a) that we need to estimate must satisfy the Bellman equation. The Bellman equation simply models maximum possible future reward as the reward at the current time, plus the maximum possible reward, at the immediately following time step:
Hence, we need to ensure that the conditions set forth by the Bellman equation are maintained when we aim...