As discussed in the introduction, we have an environment described by a state s (s∈S where S is the set of all possible states) and an agent that can perform an action a (a∈A, where A is set of all possible actions) resulting in the movement of the agent from one state to another. The agent is rewarded for its action, and the goal of the agent is to maximize the reward. In Q learning, the agent learns the action to take (policy, π) by calculating the Quantity of a state-action combination that maximizes reward (R). In making the choice of the action, the agent takes into account not only the present but discounted future rewards:
The agent starts with some arbitrary initial value of Q, and, as the agent selects an action a and receives a reward r, it updates the state s' (which depends on the past...