Finite MDPs are a simple yet fundamental framework. We introduce the trajectories of rewards that the agent aims to optimize, and define the policy and value functions they are used to formulate the optimization problem and the Bellman equations that form the basis for the solution methods.
Dynamic programming – Value and Policy iteration
Finite MDPs
MDPs frame the agent-environment interaction as a sequential decision problem over a series of time steps t=1, ..., T, that constitute an episode. Time steps are assumed to be discrete, but the framework can be extended to continuous time.
The abstraction afforded by MDPs makes its application easily adaptable to many contexts. The time steps can be at arbitrary intervals...