So, how can we compensate for this divergence? One way is through discounting future rewards, thereby amplifying the relevance of current rewards over rewards from future time steps. We can achieve this by adding a discount factor to the reward that's generated at each time step while we calculate the total reward in a given episode. The purpose of this discount factor will be to dampen future rewards and amplify current ones. In the short term, we have more certainty of being able to collect rewards by using corresponding state action pairs. This cannot be said in the long run due to the cumulating effects of random events that populate the environment. Hence, to incentivize the agent to focus on relatively certain events, we can modify our earlier formulation for total reward to include this discount factor, like so:
In our new total reward formulation...