[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn advanced features of TensorFlow1.x, such as distributed TensorFlow with TF Clusters, deploy production models with TensorFlow Serving, and more. [/box]
Today, we will help you understand OpenAI Gym and how to apply the basics of OpenAI Gym onto a cartpole game.
OpenAI Gym is a Python-based toolkit for the research and development of reinforcement learning algorithms. OpenAI Gym provides more than 700 opensource contributed environments at the time of writing. With OpenAI, you can also create your own environment. The biggest advantage is that OpenAI provides a unified interface for working with these environments, and takes care of running the simulation while you focus on the reinforcement learning algorithms.
Note : The research paper describing OpenAI Gym is available here: http://arxiv.org/abs/1606.01540
You can install OpenAI Gym using the following command:
pip3 install gym
Note: If the above command does not work, then you can find further help with installation at the following link: https://github.com/openai/ gym#installation
all_env = list(gym.envs.registry.all())
print('Total Environments in Gym version {} : {}'
.format(gym. version ,len(all_env)))
Total Environments in Gym version 0.9.4 : 777
for e in list(all_env):
print(e)
The partial list from the output is as follows:
EnvSpec(Carnival-ramNoFrameskip-v0) EnvSpec(EnduroDeterministic-v0) EnvSpec(FrostbiteNoFrameskip-v4) EnvSpec(Taxi-v2)
EnvSpec(Pooyan-ram-v0) EnvSpec(Solaris-ram-v4) EnvSpec(Breakout-ramDeterministic-v0)
EnvSpec(Kangaroo-ram-v4) EnvSpec(StarGunner-ram-v4) EnvSpec(Enduro-ramNoFrameskip-v4)
EnvSpec(DemonAttack-ramDeterministic-v0) EnvSpec(TimePilot-ramNoFrameskip-v0) EnvSpec(Amidar-v4)
Each environment, represented by the env object, has a standardized interface, for example: An env object can be created with the env.make(<game-id-string>) function by passing the id string.
Each env object contains the following main functions:
The step() function takes an action object as an argument and returns four objects:
observation: An object implemented by the environment, representing the observation of the environment.
reward: A signed float value indicating the gain (or loss) from the previous action.
done: A Boolean value representing if the scenario is finished. info: A Python dictionary object representing the diagnostic information.
The render() function creates a visual representation of the environment.
The reset() function resets the environment to the original state.
Each env object comes with well-defined actions and observations, represented by action_space and observation_space.
One of the most popular games in the gym to learn reinforcement learning is CartPole. In this game, a pole attached to a cart has to be balanced so that it doesn't fall. The game ends if either the pole tilts by more than 15 degrees or the cart moves by more than 2.4 units from the center. The home page of OpenAI.com emphasizes the game in these words:
The small size and simplicity of this environment make it possible to run very quick experiments, which is essential when learning the basics.
The game has only four observations and two actions. The actions are to move a cart by applying a force of +1 or -1. The observations are the position of the cart, the velocity of the cart, the angle of the pole, and the rotation rate of the pole. However, knowledge of the semantics of observation is not necessary to learn to maximize the rewards of the game.
Now let us load a popular game environment, CartPole-v0, and play it with stochastic control:
env = gym.make('CartPole-v0')
n_episodes = 1 env_vis = []
At the beginning of every episode, reset the environment using env.reset().
At the beginning of every timestep, capture the visualization using env.render().
for i_episode in range(n_episodes):
observation = env.reset()
for t in range(100):
env_vis.append(env.render(mode = 'rgb_array'))
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished at t{}".format(t+1))
break
env_render(env_vis)
def env_render(env_vis):
plt.figure()
plot = plt.imshow(env_vis[0])
plt.axis('off')
def animate(i):
plot.set_data(env_vis[i])
anim = anm.FuncAnimation(plt.gcf(), animate, frames=len(env_vis), interval=20, repeat=True, repeat_delay=20)
display(display_animation(anim, default_mode='loop'))
We get the following output when we run this example:
[-0.00666995 -0.03699492 -0.00972623 0.00287713] [-0.00740985 0.15826516 -0.00966868 -0.29285861] [-0.00424454 -0.03671761 -0.01552586 -0.00324067] [-0.0049789 -0.2316135 -0.01559067 0.28450351] [-0.00961117 -0.42650966 -0.0099006 0.57222875] [-0.01814136 -0.23125029 0.00154398 0.27644332] [-0.02276636 -0.0361504 0.00707284 -0.01575223] [-0.02348937 0.1588694 0.0067578 -0.30619523] [-0.02031198 -0.03634819 0.00063389 -0.01138875] [-0.02103895 0.15876466 0.00040612 -0.3038716 ] [-0.01786366 0.35388083 -0.00567131 -0.59642642] [-0.01078604 0.54908168 -0.01759984 -0.89089036]
[ 1.95594914e-04 7.44437934e-01 -3.54176495e-02 -1.18905344e+00]
[ | 0.01508435 | 0.54979251 | -0.05919872 | -0.90767902] |
[ | 0.0260802 | 0.35551978 | -0.0773523 | -0.63417465] |
[ | 0.0331906 | 0.55163065 | -0.09003579 | -0.95018025] |
[ | 0.04422321 | 0.74784161 | -0.1090394 | -1.26973934] |
[ | 0.05918004 | 0.55426764 | -0.13443418 | -1.01309691] |
[ | 0.0702654 | 0.36117014 | -0.15469612 | -0.76546874] |
[ | 0.0774888 | 0.16847818 | -0.1700055 | -0.52518186] |
[ | 0.08085836 | 0.3655333 | -0.18050913 | -0.86624457] |
[ | 0.08816903 | 0.56259197 | -0.19783403 | -1.20981195] |
Episode finished at t22
It took 22 time-steps for the pole to become unbalanced. At every run, we get a different time-step value because we picked the action scholastically by using env.action_space.sample().
Since the game results in a loss so quickly, randomly picking an action and applying it is probably not the best strategy. There are many algorithms for finding solutions to keeping the pole straight for a longer number of time-steps that you can use, such as Hill Climbing, Random Search, and Policy Gradient.
Note: Some of the algorithms for solving the Cartpole game are available at the following links:
https://openai.com/requests-for-research/#cartpole
http://kvfrans.com/simple-algoritms-for-solving-cartpole/
https://github.com/kvfrans/openai-cartpole
So far, we have randomly picked an action and applied it. Now let us apply some logic to picking the action instead of random chance. The third observation refers to the angle. If the angle is greater than zero, that means the pole is tilting right, thus we move the cart to the right (1). Otherwise, we move the cart to the left (0). Let us look at an example:
def policy_logic(env,obs):
return 1 if obs[2] > 0 else 0 def policy_random(env,obs):
return env.action_space.sample()
def experiment(policy, n_episodes, rewards_max):
rewards=np.empty(shape=(n_episodes))
env = gym.make('CartPole-v0')
for i in range(n_episodes):
obs = env.reset() done = False episode_reward = 0 while not done:
action = policy(env,obs)
obs, reward, done, info = env.step(action)
episode_reward += reward
if episode_reward > rewards_max:
break rewards[i]=episode_reward
print('Policy:{}, Min reward:{}, Max reward:{}'
.format(policy. name , min(rewards), max(rewards)))
n_episodes = 100 rewards_max = 10000
experiment(policy_random, n_episodes, rewards_max)
experiment(policy_logic, n_episodes, rewards_max)
We can see that the logically selected actions do better than the randomly selected ones, but not that much better:
Policy:policy_random, Min reward:9.0, Max reward:63.0, Average reward:20.26
Policy:policy_logic, Min reward:24.0, Max reward:66.0, Average reward:42.81
Now let us modify the process of selecting the action further—to be based on parameters. The parameters will be multiplied by the observations and the action will be chosen based on whether the multiplication result is zero or one. Let us modify the random search method in which we initialize the parameters randomly. The code looks as follows:
def policy_logic(theta,obs):
# just ignore theta
return 1 if obs[2] > 0 else 0
def policy_random(theta,obs):
return 0 if np.matmul(theta,obs) < 0 else 1
def episode(env, policy, rewards_max):
obs = env.reset() done = False episode_reward = 0
if policy. name
in ['policy_random']:
theta = np.random.rand(4) * 2 - 1
else:
theta = None while not done:
action = policy(theta,obs)
obs, reward, done, info = env.step(action)
episode_reward += reward
if episode_reward > rewards_max:
break
return episode_reward
def experiment(policy, n_episodes, rewards_max):
rewards=np.empty(shape=(n_episodes))
env = gym.make('CartPole-v0')
for i in range(n_episodes):
rewards[i]=episode(env,policy,rewards_max)
#print("Episode finished at t{}".format(reward))
print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'
.format(policy. name , np.min(rewards), np.max(rewards), np.mean(rewards)))
n_episodes = 100 rewards_max = 10000
experiment(policy_random, n_episodes, rewards_max)
experiment(policy_logic, n_episodes, rewards_max)
We can see that random search does improve the results:
Policy:policy_random, Min reward:8.0, Max reward:200.0, Average reward:40.04
Policy:policy_logic, Min reward:25.0, Max reward:62.0, Average reward:43.03
With the random search, we have improved our results to get the max rewards of 200. On average, the rewards for random search are lower because random search tries various bad parameters that bring the overall results down. However, we can select the best parameters from all the runs and then, in production, use the best parameters. Let us modify the code
to train the parameters first:
def policy_logic(theta,obs):
# just ignore theta
return 1 if obs[2] > 0 else 0
def policy_random(theta,obs):
return 0 if np.matmul(theta,obs) < 0 else 1
def episode(env,policy, rewards_max,theta):
obs = env.reset()
done = False episode_reward = 0
while not done:
action = policy(theta,obs)
obs, reward, done, info = env.step(action)
episode_reward += reward
if episode_reward > rewards_max:
break
return episode_reward
def train(policy, n_episodes, rewards_max):
env = gym.make('CartPole-v0') theta_best = np.empty(shape=[4]) reward_best = 0
for i in range(n_episodes):
if policy. name
in ['policy_random']:
theta = np.random.rand(4) * 2 - 1 else:
theta = None reward_episode=episode(env,policy,rewards_max, theta) if reward_episode > reward_best:
reward_best = reward_episode theta_best = theta.copy()
return reward_best,theta_best
def experiment(policy, n_episodes, rewards_max, theta=None):
rewards=np.empty(shape=[n_episodes])
env = gym.make('CartPole-v0')
for i in range(n_episodes):
rewards[i]=episode(env,policy,rewards_max,theta)
#print("Episode finished at t{}".format(reward))
print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'
.format(policy. name , np.min(rewards), np.max(rewards), np.mean(rewards)))
n_episodes = 100 rewards_max = 10000
reward,theta = train(policy_random, n_episodes, rewards_max) print('trained theta: {}, rewards: {}'.format(theta,reward)) experiment(policy_random, n_episodes, rewards_max, theta) experiment(policy_logic, n_episodes, rewards_max)
We train for 100 episodes and then use the best parameters to run the experiment for the random search policy:
n_episodes = 100 rewards_max = 10000
reward,theta = train(policy_random, n_episodes, rewards_max) print('trained theta: {}, rewards: {}'.format(theta,reward)) experiment(policy_random, n_episodes, rewards_max, theta) experiment(policy_logic, n_episodes, rewards_max)
We find the that the training parameters gives us the best results of 200:
trained theta: [-0.14779543 0.93269603 0.70896423 0.84632461], rewards:
200.0
Policy:policy_random, Min reward:200.0, Max reward:200.0, Average reward:200.0
Policy:policy_logic, Min reward:24.0, Max reward:63.0, Average reward:41.94
We may optimize the training code to continue training until we reach a maximum reward.
To summarize, we learnt the basics of OpenAI Gym and also applied it onto a cartpole game for relevant output.
If you found this post useful, do check out this book Mastering TensorFlow 1.x to build, scale, and deploy deep neural network models using star libraries in Python.