Now, it is time to implement the policy gradient algorithm with PyTorch:
- As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n
- We define the run_episode function, which simulates an episode given the input weight and returns the total reward and the gradients computed. More specifically, it does the following tasks in each step:
- Calculates the probabilities, probs, for both actions based on the current state and input weight
- Samples an action, action, based on the resulting probabilities
- Computes the derivatives, d_softmax, of the softmax function with the probabilities as input
- Divides the resulting derivatives, d_softmax, by the probabilities, probs, to get the derivatives, d_log, of the log term with respect to the policy
- Applies the chain rule to compute the gradient, grad, of the weights
- Records the resulting gradient, grad
- Performs the action, accumulates the reward, and updates the state
Putting all of this into code, we have the following:
>>> def run_episode(env, weight):
... state = env.reset()
... grads = []
... total_reward = 0
... is_done = False
... while not is_done:
... state = torch.from_numpy(state).float()
... z = torch.matmul(state, weight)
... probs = torch.nn.Softmax()(z)
... action = int(torch.bernoulli(probs[1]).item())
... d_softmax = torch.diag(probs) -
probs.view(-1, 1) * probs
... d_log = d_softmax[action] / probs[action]
... grad = state.view(-1, 1) * d_log
... grads.append(grad)
... state, reward, is_done, _ = env.step(action)
... total_reward += reward
... if is_done:
... break
... return total_reward, grads
After an episode finishes, it returns the total reward obtained in this episode and the gradients computed for the individual steps. These two outputs will be used to update the weight.
- Let's make it 1,000 episodes for now:
>>> n_episode = 1000
This means we will run run_episode and n_episodetimes.
- Initiate the weight:
>>> weight = torch.rand(n_state, n_action)
We will also record the total reward for every episode:
>>> total_rewards = []
- At the end of each episode, we need to update the weight using the computed gradients. For every step of the episode, the weight moves by learning rate * gradient calculated in this step * total reward in the remaining steps. Here, we choose 0.001 as the learning rate:
>>> learning_rate = 0.001
Now, we can run n_episodeepisodes:
>>> for episode in range(n_episode):
... total_reward, gradients = run_episode(env, weight)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... for i, gradient in enumerate(gradients):
... weight += learning_rate * gradient * (total_reward - i)
... total_rewards.append(total_reward)
……
……
Episode 101: 200.0
Episode 102: 200.0
Episode 103: 200.0
Episode 104: 190.0
Episode 105: 133.0
……
……
Episode 996: 200.0
Episode 997: 200.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 200.0
- Now, we calculate the average total reward achieved by the policy gradient algorithm:
>>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 179.728
- We also plot the total reward for every episode as follows:
>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()
In the resulting plot, we can see a clear upward trend before it stays at the maximum value:
We can also see that the rewards oscillate even after it converges. This is because the policy gradient algorithm is a stochastic policy.
- Now, let's see how the learned policy performs on 100 new episodes:
>>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward, _ = run_episode(env, weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 200.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0
Let's see the average performance:
>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 199.78
The average reward for the testing episodes is close to the maximum value of 200 for the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.