At first glance, this may appear quite simple. We already saw how the cabby can be incentivized by awarding it +20 points for a correct dropoff, -10 for a false one, and -1 for each time step that it takes to complete the episode. Logically, then, you can calculate the total reward collected by an agent for an episode as the cumulation of all the individual rewards for each time step that's seen by the agent. We can denote this mathematically and represent the total reward in an episode as follows:
Here, n simply denotes the time step of the episode. This seems intuitive enough. We can now ask our agent to maximize the total reward in a given episode. But there's a problem. Just like our own reality, the environment that's faced by our agent may be governed by largely random events. Hence, there may be no guarantee...