In reinforcement learning, we are trying to solve the problem of correlating immediate actions with the delayed rewards they return. These rewards are simply sparse, time-delayed labels that are used to control the agent's behavior. So far, we have discussed how an agent may act upon different states of an environment. We also saw how interactions generate various rewards for the agent and unlock new states of the environment. From here, the agent can resume interacting with the environment until the end of an episode. It's about time we mathematically formalize these relations between an agent and environment for the purpose of goal optimization. To do this, we will call upon a framework proposed by Russian mathematician Andrey Markov, now known as the Markov decision process (MDP).
This mathematical framework allows us to model our agent&apos...