Dynamic Programming in RL
DP plays an important role in RL as the number of choices you have at a given time is too large. For instance, whether the robot should take a left or right turn given the current state of the environment. To solve such a problem, it's infeasible to find the outcome of every state using brute force. We can do that, however, using DP, using the methods we learned in the previous section.
We have seen the Bellman equation in previous chapters. Let's reiterate the basics and see how the Bellman equation has both of the required properties for using DP.
Assuming the environment is a finite Markov Decision Process (MDP), let's define the state of the environment by a finite set of states, S. This indicates the state configuration, for instance, the current position of the robot. A finite set of actions, A, gives the action space, and a finite set of rewards, R. Let's denote the discounting rate using , which is a value between 0 and 1...