Unlike the trial-and-error learning, you have already been introduced to DP methods that work as a form of static learning or what we may call planning. Planning is an appropriate definition here since the algorithm evaluates the entire MDP and hence all states and actions beforehand. Hence, these methods require full knowledge of the environment including all finite states and actions. While this works for known finite environments such as the one we are playing within this chapter, these methods are not substantial enough for real-world physical problems. We will, of course, solve real-world problems later in this book. For now, though, let's look at how to evaluate a policy from the previous update equations in code. Open Chapter_2_5.py and follow the exercise:
- For reference, the entire block of code, Chapter_2_5.py, is shown as follows:
from os import system, name
import time
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
env.reset()
def clear():
if name == 'nt':
_ = system('cls')
else:
_ = system('clear')
def act(V, env, gamma, policy, state, v):
for action, action_prob in enumerate(policy[state]):
for state_prob, next_state, reward, end in env.P[state][action]:
v += action_prob * state_prob * (reward + gamma * V[next_state])
V[state] = v
def eval_policy(policy, env, gamma=1.0, theta=1e-9, terms=1e9):
V = np.zeros(env.nS)
delta = 0
for i in range(int(terms)):
for state in range(env.nS):
act(V, env, gamma, policy, state, v=0.0)
clear()
print(V)
time.sleep(1)
v = np.sum(V)
if v - delta < theta:
return V
else:
delta = v
return V
policy = np.ones([env.env.nS, env.env.nA]) / env.env.nA
V = eval_policy(policy, env.env)
print(policy, V)
- At the beginning of the code, we perform the same initial steps as our test example. We load import statements and initialize and load the environment, then define the clear function.
- Next, move to the very end of the code and notice how we are initializing the policy using numpy as np to fill a tensor of the size of the environment, state x action. We then divide the tensor by the number of actions in a state—4, in this case. This gives us a distributed probability of 0.25 per action. Remember that the combined action probability in a Markov property needs to sum up to 1.0 or 100%.
- Now, move up the eval_policy function and focus on the double loop, as shown in the following code block:
for i in range(int(terms)):
for state in range(env.nS):
act(V, env, gamma, policy, state, v=0.0)
clear()
print(V)
time.sleep(1)
v = np.sum(V)
if v - delta < theta:
return V
else:
delta = v
return V
- The first for loop loops on the number of terms or iterations before termination. We set a limit here to prevent endless looping. In the inner loop, all of the states in the environment are iterated through and acted on using the act function. After that, we use our previous render code to show the updated values. At the end of the first for loop, we check whether the calculated total change in the v value is less than a particular threshold, theta. If the change in value is less than the threshold, the function returns the calculate value function, V.
- At the core of the algorithm is the act function and where the update equation operates; the inside of this function is shown as follows:
for action, action_prob in enumerate(policy[state]):
for state_prob, next_state, reward, end
in env.P[state][action]:
v += action_prob * state_prob * (reward + gamma * V[next_state]) #update
V[state] = v
- The first for loop iterates through all of the actions in the policy for the given state. Recall that we start by initializing the policy to 0.25 for every action function, action_prob = 0.25. Then, we loop through every transition from the state and action and apply the update. The update is shown in the highlighted equation. Finally, the value function, V, for the current state is updated to v.
- Run the code and observe the output. Notice how the value function is continually updated. At the end of the run, you should see something similar to the following screenshot:
Running example Chapter_2_5.py
If it seems off that the policy is not updated, that is actually okay, for now. The important part here is to see how we update the value function. In the next section, we will look at how we can improve the policy.