Why distributional reinforcement learning?
Say we are in state s and we have two possible actions to perform in this state. Let the actions be up and down. How do we decide which action to perform in the state? We compute Q values for all actions in the state and select the action that has the maximum Q value. So, we compute Q(s, up) and Q(s, down) and select the action that has the maximum Q value.
We learned that the Q value is the expected return an agent would obtain when starting from state s and performing an action a following the policy :
But there is a small problem in computing the Q value in this manner because the Q value is just an expectation of the return, and the expectation does not include the intrinsic randomness. Let's understand exactly what this means with an example.
Let's suppose we want to drive from work to home and we have two routes A and B. Now, we have to decide which route is better, that is, which route helps us to reach...