We assume that the simple environment has four global states. In every state, there are two possible actions. For simplicity, the transition can be settled deterministically, which means the next state is decided only by the current state and action taken by the agent. Here is the diagram illustrating the MDP we are discussing:
is the initial state, and is the goal. There are two possible actions, , and , for every state other than the goal. The number represents the reward obtained by the specific transition. We are going to construct the agent updating the action-value function by using the Q-learning algorithm.