An RL problem can be formalized as an MDP, providing an abstract framework for learning goal-based problems. An MDP is defined by a set of states, actions, rewards, and transition probabilities, and solving an MDP means finding a policy that maximizes the expected reward in each state. The Markov property is intrinsic to the MDP and ensures that the future states depend only on the current one, not on its history.
Using the definition of MDP, we formulated the concepts of policy, return function, expected return, action-value function, and value function. The latter two can be defined in terms of the values of the subsequent states, and the equations are called Bellman equations. These equations are useful because they provide a method to compute value functions in an iterative way. The optimal value functions can then be used to find the optimal policy.
RL algorithms...