An MDP expresses the problem of sequential decision-making, where actions influence the next states and the results. MDPs are general and flexible enough to provide a formalization of the problem of learning a goal through interactions, the same problem that is addressed with RL. Thus we can express and reason with RL problems in terms of MDPs.
An MDP is four-tuple (S,A,P,R):
- S is the state space, with a finite set of states.
- A is the action space, with a finite set of actions.
- P is a transition function, which defines the probability of reaching a state, s′, from s through an action, a. In P(s′, s, a) = p(s′| s, a), the transition function is equal to the conditional probability of s′ given s and a.
- R is the reward function, which determines the value received for transitioning to state s′ after taking action a from state s.
An illustration...