Developed upon the Markov chain, an MDP involves an agent and a decision-making process. Let's go ahead with developing an MDP and calculating the value function under the optimal policy.
Besides a set of possible states, S = {s0, s1, ... , sm}, an MDP is defined by a set of actions, A = {a0, a1, ... , an}; a transition model, T(s, a, s'); a reward function, R(s); and a discount factor, 𝝲. The transition matrix, T(s, a, s'), contains the probabilities of taking action a from state s then landing in s'. The discount factor, 𝝲, controls the tradeoff between future rewards and immediate ones.
To make our MDP slightly more complicated, we extend the study and sleep process with one more state, s2 play games. Let's say we have two actions, a0 work and a1 slack. The 3 * 2 * 3 transition matrix T(s, a, s') is as follows:
...