Policy iteration
In this section, we are going to analyze a strategy to find an optimal policy based on complete knowledge of the environment (in terms of transition probability and expected returns). The first step is to define a method that can be employed to build a greedy policy. Let's suppose we're working with a finite MDP and a generic policy ; we can define the intrinsic value of a state st as the expected discounted return obtained by the agent starting from st and following the stochastic policy :
In this case, we are assuming that, as the agent will follow , the state sa is more useful than sb if the expected return starting from sa is greater than the one obtained starting from sb. Unfortunately, trying to directly find the value of each state using the previous definition is almost impossible when . However, this is a problem that can be solved using dynamic programming (for further information, please refer to R. A. Howard, Dynamic Programming and...