Let's go ahead and implement the hill-climbing algorithm with PyTorch:
- As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n
- We will reuse the run_episode function we defined in the previous recipe, so we will not repeat it here. Again, given the input weight, it simulates an episode and returns the total reward.
-
Let's make it 1,000 episodes for now:
>>> n_episode = 1000
- We need to keep track of the best total reward on the fly, as well as the corresponding weight. So, let's specify their starting values:
>>> best_total_reward = 0
>>> best_weight = torch.rand(n_state, n_action)
We will also record the total reward for every episode:
>>> total_rewards = []
- As we mentioned, we will add some noise to the weight for each episode. In fact, we will apply a scale to the noise so that the noise won't overwhelm the weight. Here, we will choose 0.01 as the noise scale:
>>> noise_scale = 0.01
- Now, we can run the n_episode function. After we randomly pick an initial weight, for each episode, we do the following:
- Add random noise to the weight
- Let the agent take actions according to the linear mapping
- An episode terminates and returns the total reward
- If the current reward is greater than the best one obtained so far, update the best reward and the weight
- Otherwise, the best reward and the weight remain unchanged
- Also, keep a record of the total reward
Put this into code as follows:
>>> for episode in range(n_episode):
... weight = best_weight +
noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... total_rewards.append(total_reward)
... print('Episode {}: {}'.format(episode + 1, total_reward))
...
Episode 1: 56.0
Episode 2: 52.0
Episode 3: 85.0
Episode 4: 106.0
Episode 5: 41.0
……
……
Episode 996: 39.0
Episode 997: 51.0
Episode 998: 49.0
Episode 999: 54.0
Episode 1000: 41.0
We also calculate the average total reward achieved by the hill-climbing version of linear mapping:
>>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 50.024
- To assess the training using the hill-climbing algorithm, we repeat the training process multiple times (by running the code from Step 4 to Step 6 multiple times). We observe that the average total reward fluctuates a lot. The following are the results we got when running it 10 times:
Average total reward over 1000 episode: 9.261
Average total reward over 1000 episode: 88.565
Average total reward over 1000 episode: 51.796
Average total reward over 1000 episode: 9.41
Average total reward over 1000 episode: 109.758
Average total reward over 1000 episode: 55.787
Average total reward over 1000 episode: 189.251
Average total reward over 1000 episode: 177.624
Average total reward over 1000 episode: 9.146
Average total reward over 1000 episode: 102.311
What could cause such variance? It turns out that if the initial weight is bad, adding noise at a small scale will have little effect on improving the performance. This will cause poor convergence. On the other hand, if the initial weight is good, adding noise at a big scale might move the weight away from the optimal weight and jeopardize the performance. How can we make the training of the hill-climbing model more stable and reliable? We can actually make the noise scale adaptive to the performance, just like the adaptive learning rate in gradient descent. Let's see Step 8 for more details.
- To make the noise adaptive, we do the following:
- Specify a starting noise scale.
- If the performance in an episode improves, decrease the noise scale. In our case, we take half of the scale, but set 0.0001 as the lower bound.
- If the performance in an episode drops, increase the noise scale. In our case, we double the scale, but set 2 as the upper bound.
Put this into code:
>>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight +
noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 9.0
Episode 4: 10.0
Episode 5: 10.0
……
……
Episode 996: 200.0
Episode 997: 200.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 200.0
The reward is increasing as the episodes progress. It reaches the maximum of 200 within the first 100 episodes and stays there. The average total reward also looks promising:
>>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 186.11
We also plot the total reward for every episode as follows:
>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()
In the resulting plot, we can see a clear upward trend before it plateaus at the maximum value:
Feel free to run the new training process a few times. The results are very stable compared to learning with a constant noise scale.
- Now, let's see how the learned policy performs on 100 new episodes:
>>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward = run_episode(env, best_weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 200.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0
Let's see the average performance:
>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 199.94
The average reward for the testing episodes is close to the maximum of 200 that we obtained with the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.