Let's talk about the self-driving taxi agent that we'll be building. Recall that the Taxi-v2 environment has 500 states, and 6 possible actions that can be taken from each state.
Your objective in the taxi environment is to pick up a passenger at one location, and drop them off at their desired destination in as few timesteps as possible.
You receive points for a successful drop-off, and lose points for the time it takes to complete the task, so your goal is to complete the task in as little time as possible. You also lose points for incorrect actions, such as dropping a passenger off at the wrong location.
Because your goal is to get to both the pickup and drop-off locations as quickly as possible, you lose one point for every move you make per timestep.
Your agent's goal in solving this problem is to find the optimal policy...