Introducing the reward: Markov reward process
In our robot example so far, we have not really identified any situation/state that is "good" or "bad." In any system though, there are desired states to be in and there are other states that we want to avoid. In this section, we attach rewards to states/transitions, which gives us a Markov Reward Process (MRP). We then assess the "value" of each state.
Attaching rewards to the grid world example
Remember the version of the robot example where it could not bounce back to the cell it was in when it hits a wall, but crashes in a way that it is not recoverable. From now on, we will work on that version, and attach rewards to the process. Now, let's build this example:
- We modify the transition probability matrix to assign self-transition probabilities to the "crashed" state that we add to the matrix:
P = np.zeros((m2 + 1, m2 + 1)) P[:m2, :m2] = get_P(3, 0.2, 0.3, 0.25, 0.25) for i in...