Building policy iteration

For us to determine the best policy, we first need a method to evaluate the given policy for a state. We can use a method of evaluating the policy by searching through all of the states of an MDP and further evaluating all actions. This will provide us with a value function for the given state that we can then use to perform successive updates of a new value function iteratively. Mathematically, we can then use the previous Bellman optimality equation and derive a new update to a state value function, as shown here:

In the preceding equation, the symbol represents an expectation and denotes the expected state value update to a new value function. Inside this expectation, we can see this dependent on the returned reward plus the previous discounted value for the next state given an already chosen action. That means that our algorithm will iterate over every state and action evaluating a new state value using the preceding update equation. This process is called backing up or planning, and it is helpful for us to visualize how this algorithm works using backup diagrams. The following is an example of the backup diagrams for action value and state value backups:

Backup diagrams for action value and state value backups

Diagram (a) or is the part of the backup or evaluation that tries each action and hence provides us with action values. The second part of the evaluation comes from the update and is shown in diagram (b) for . Recall, the update evaluates the forward states by evaluating each of the state actions. The diagrams represent the point of evaluation with a filled-in solid circle. Notice how the action value only focuses on the forward actions while the state value focuses on the action value for each forward state. Of course, it will be helpful to look at how this comes together in code. However, before we get to that, we want to do some housekeeping in the next section.

Installing OpenAI Gym

To help to encourage research and development in RL, the OpenAI group provides an open source platform for RL training called Gym. Gym, provided by OpenAI, comes with plenty of sample test environments that we can venture through while we work through this book. Also, other RL developers have developed other environments using the same standard interface Gym uses. Hence, by learning to use Gym, we will also be able to explore other cutting-edge RL environments later in this book.

The installation for Gym is quite simple, but, at the same time, we want to avoid any small mistakes that may cause you frustration later. Therefore, it is best to use the following instructions to set up and install an RL environment for development.

It is highly recommended that you use Anaconda for Python development with this book. Anaconda is a free open source cross-platform tool that can significantly increase your ease of development. Please stick with Anaconda unless you consider yourself an experienced Python developer. Google python anaconda to download and install it.

Follow the exercise to set up and install a Python environment with Gym:

Open a new Anaconda Prompt or Python shell. Do this as an admin or be sure to execute the commands as an admin if required.
From the command line, run the following:

conda create -n chapter2 python=3.6

This will create a new virtual environment for your development. A virtual environment allows you to isolate dependencies and control your versioning. If you are not using Anaconda, you can use the Python virtual environment to create a new environment. You should also notice that we are forcing the environment to use Python 3.6. Again, this makes sure we know what version of Python we are using.
After the installation, we activate the environment with the following:

activate chapter2

Next, we install Gym with the following command:

pip install gym

Gym will install several dependencies along with the various sample environments we will train on later.

Before we get too far ahead though, let's now test our Gym installation with code in the next section.

Testing Gym

In the next exercise, we will write code to test Gym and an environment called FrozenLake, which also happens to be our test environment for this chapter. Open the Chapter_2_4.py code example and follow the exercise:

For reference, the code is shown as follows:

from os import system, name
import time
import gym
import numpy as np

env = gym.make('FrozenLake-v0')
env.reset()

def clear():
    if name == 'nt': 
        _ = system('cls')    
    else: 
        _ = system('clear')

for _ in range(1000):
    clear()
    env.render()
    time.sleep(.5)
    env.step(env.action_space.sample()) # take a random action
env.close()

At the top, we have the imports to load the system modules as well as gym, time, and numpy. numpy is a helper library we use to construct tensors. Tensors are a math/programming concept that can describe single values or multidimensional arrays of numbers.
Next, we build and reset the environment with the following code:

env = gym.make('FrozenLake-v0')
env.reset()

After that, we have a clear function, which we use to clear rendering that is not critical to this example. The code should be self-explanatory as well.

This brings us to the for loop and where all of the actions, so to speak, happen. The line of most importance is shown as follows:

env.step(env.action_space.sample())

The env variable represents the environment, and, in the line, we are letting the algorithm take a random action every step or iteration. In this example, the agent learns nothing and just moves at random, for now.
Run the code as you normally would and pay attention to the output. An example of the output screen is shown in the following:

Example render from the FrozenLake environment

Since the algorithm/agent moves randomly, it is quite likely to hit a hole, denoted by H and just stay there. For reference, the legend for FrozenLake is given here:

S = start: This is where the agent starts when reset is called.
F = frozen: This allows the agent to move across this area.
H = hole: This is a hole in the ice; if the agent moves here, it falls in.
G = goal: This is the goal the agent wants to reach, and, when it does, it receives a reward of 1.0.

Now that we have Gym set up, we can move to evaluate the policy in the next section.

Policy evaluation

Unlike the trial-and-error learning, you have already been introduced to DP methods that work as a form of static learning or what we may call planning. Planning is an appropriate definition here since the algorithm evaluates the entire MDP and hence all states and actions beforehand. Hence, these methods require full knowledge of the environment including all finite states and actions. While this works for known finite environments such as the one we are playing within this chapter, these methods are not substantial enough for real-world physical problems. We will, of course, solve real-world problems later in this book. For now, though, let's look at how to evaluate a policy from the previous update equations in code. Open Chapter_2_5.py and follow the exercise:

For reference, the entire block of code, Chapter_2_5.py, is shown as follows:

from os import system, name
import time
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
env.reset()

def clear():
    if name == 'nt': 
        _ = system('cls')    
    else: 
        _ = system('clear')

def act(V, env, gamma, policy, state, v):
    for action, action_prob in enumerate(policy[state]):                
        for state_prob, next_state, reward, end in env.P[state][action]:                                        
            v += action_prob * state_prob * (reward + gamma * V[next_state])                    
            V[state] = v
            
def eval_policy(policy, env, gamma=1.0, theta=1e-9, terms=1e9):     
    V = np.zeros(env.nS)  
    delta = 0
    for i in range(int(terms)): 
        for state in range(env.nS):            
            act(V, env, gamma, policy, state, v=0.0)         
        clear()
        print(V)
        time.sleep(1) 
        v = np.sum(V)
        if v - delta < theta:
            return V
        else:
            delta = v
    return V

policy = np.ones([env.env.nS, env.env.nA]) / env.env.nA
V = eval_policy(policy, env.env)

print(policy, V)

At the beginning of the code, we perform the same initial steps as our test example. We load import statements and initialize and load the environment, then define the clear function.
Next, move to the very end of the code and notice how we are initializing the policy using numpy as np to fill a tensor of the size of the environment, state x action. We then divide the tensor by the number of actions in a state—4, in this case. This gives us a distributed probability of 0.25 per action. Remember that the combined action probability in a Markov property needs to sum up to 1.0 or 100%.
Now, move up the eval_policy function and focus on the double loop, as shown in the following code block:

for i in range(int(terms)):
  for state in range(env.nS):
    act(V, env, gamma, policy, state, v=0.0)
  clear()
  print(V)
  time.sleep(1)
  v = np.sum(V)
  if v - delta < theta:
    return V
  else:
    delta = v
return V

The first for loop loops on the number of terms or iterations before termination. We set a limit here to prevent endless looping. In the inner loop, all of the states in the environment are iterated through and acted on using the act function. After that, we use our previous render code to show the updated values. At the end of the first for loop, we check whether the calculated total change in the v value is less than a particular threshold, theta. If the change in value is less than the threshold, the function returns the calculate value function, V.
At the core of the algorithm is the act function and where the update equation operates; the inside of this function is shown as follows:

for action, action_prob in enumerate(policy[state]):   
  for state_prob, next_state, reward, end
 in env.P[state][action]:   
    v += action_prob * state_prob * (reward + gamma * V[next_state]) #update 
    V[state] = v

The first for loop iterates through all of the actions in the policy for the given state. Recall that we start by initializing the policy to 0.25 for every action function, action_prob = 0.25. Then, we loop through every transition from the state and action and apply the update. The update is shown in the highlighted equation. Finally, the value function, V, for the current state is updated to v.
Run the code and observe the output. Notice how the value function is continually updated. At the end of the run, you should see something similar to the following screenshot:

Running example Chapter_2_5.py

If it seems off that the policy is not updated, that is actually okay, for now. The important part here is to see how we update the value function. In the next section, we will look at how we can improve the policy.

Policy improvement

With policy evaluation under our belt, it is time to move on to improving the policy by looking ahead. Recall we do this by looking at one state ahead of the current state and then evaluating all possible actions. Let's look at how this works in code. Open up the Chapter_2_6.py example and follow the exercise:

For brevity, the following code excerpt from Chapter_2_6.py shows just the new sections of code that were added to that last example:

def evaluate(V, action_values, env, gamma, state):
    for action in range(env.nA):
        for prob, next_state, reward, terminated in env.P[state][action]:
            action_values[action] += prob * (reward + gamma * V[next_state])
    return action_values

def lookahead(env, state, V, gamma):
    action_values = np.zeros(env.nA)
    return evaluate(V, action_values, env, gamma, state)

def improve_policy(env, gamma=1.0, terms=1e9):    
    policy = np.ones([env.nS, env.nA]) / env.nA
    evals = 1
    for i in range(int(terms)):
        stable = True       
        V = eval_policy(policy, env, gamma=gamma)
        for state in range(env.nS):
            current_action = np.argmax(policy[state])
            action_value = lookahead(env, state, V, gamma)
            best_action = np.argmax(action_value)
            if current_action != best_action:
                stable = False               
                policy[state] = np.eye(env.nA)[best_action]
            evals += 1                
            if stable:
                return policy, V

#replaced bottom code from previous sample with
policy, V = improve_policy(env.env) 
print(policy, V)

Added to the last example are three new functions: improve_policy, lookahead, and evaluate. improve_policy uses a limited loop to loop through the states in the current environment; before looping through each state, it calls eval_policy to update the value function by passing in the current policy, environment, and gamma parameters (discount factor). Then, it calls the lookahead function, which internally calls an evaluate function that updates the action values for the state. evaluate is a modified version of the act function.
While both functions, eval_policy and improve_policy, use limited terms for loops to prevent endless looping, they still use very large limits; in the example, the default is 1e09. Therefore, we still want to determine a condition to hopefully stop the loop much earlier than the term's limit. In policy evaluation, we controlled this by observing the change or delta in the value function. In policy improvement, we now look to improve the actual policy and, to do this, we assume a greedy policy. In other words, we want to improve our policy to always pick the highest value action, as shown in the following code:

action_value = lookahead(env, state, V, gamma)best_action = np.argmax(action_value)
if current_action != best_action:                
  stable = False   
  policy[state] = np.eye(env.nA)[best_action]
evals += 1 

if stable:
  return policy, V

The preceding block of code first uses the numpy function—np.argmax on the list of action_value returns from the lookahead function. This returns the max or best_action, or in other words, the greedy action. We then consider whether current_action is not equal to best_action; if it is not, then we consider the policy is not stable by setting stable to false. Since the action is not the best, we also update policy with the identity tensor using np.eye for the shape defined. This step simply assigns the policy a value of 1.0 for the best/greedy actions and 0.0 for all others.
At the end of the code, you can see that we now just call improve_policy and print the results of the policy and value functions.

Run the code as you normally would and observe the output, as shown in the following screenshot:

Example output for Chapter_2_6.py

This sample will take a while longer to run and you should see the value function improve as the sample runs. When the sample completes, it will print the value function and policy. You can now see how the policy clearly indicates the best action for each state with a 1.0 value. The reason some states still have the 0.25 value for all actions is that the algorithm sees no need to evaluate or improve the policy in those states. They were likely states that were holes or were outside the optimal path.

Policy evaluation and improvement is one method we can use for planning with DP, but, in the next section, we will look at a second method called value iteration.