Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
PyTorch 1.x Reinforcement Learning Cookbook
PyTorch 1.x Reinforcement Learning Cookbook

PyTorch 1.x Reinforcement Learning Cookbook: Over 60 recipes to design, develop, and deploy self-learning AI models using Python

eBook
$20.98 $29.99
Paperback
$43.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

PyTorch 1.x Reinforcement Learning Cookbook

Getting Started with Reinforcement Learning and PyTorch

We kick off our journey of practical reinforcement learning and PyTorch with the basic, yet important, reinforcement learning algorithms, including random search, hill climbing, and policy gradient. We will start by setting up the working environment and OpenAI Gym, and you will become familiar with reinforcement learning environments through the Atari and CartPole playgrounds. We will also demonstrate how to develop algorithms to solve the CartPole problem step by step. Also, we will review the essentials of PyTorch and prepare for the upcoming learning examples and projects.

This chapter contains the following recipes:

  • Setting up the working environment
  • Installing OpenAI Gym
  • Simulating Atari environments
  • Simulating the CartPole environment
  • Reviewing the fundamentals of PyTorch
  • Implementing and evaluating a random search policy
  • Developing the hill-climbing algorithm
  • Developing a policy gradient algorithm

Setting up the working environment

Let's get started with setting up the working environment, including the correct versions of Python and Anaconda, and PyTorch as the main framework that is used throughout the book.

Python is the language we use to implement all reinforcement learning algorithms and techniques throughout the book. In this book, we will be using Python 3, or more specifically, 3.6 or above. If you are a Python 2 user, now is the best time for you to switch to Python 3, as Python 2 will no longer be supported after 2020. The transition is very smooth, though, so don't panic.

Anaconda is an open source Python distribution (www.anaconda.com/distribution/) for data science and machine learning. We will be using Anaconda's package manager, conda, to install Python packages, along with pip.

PyTorch (https://pytorch.org/), primarily developed by the Facebook AI Research (FAIR) Group, is a trendy machine learning library based on Torch (http://torch.ch/). Tensors in PyTorch replace NumPy's ndarrays, which provides more flexibility and compatibility with GPUs. Because of the powerful computational graphs and the simple and friendly interface, the PyTorch community is expanding on a daily basis, and it has seen heavy adoption by more and more tech giants.

Let's see how to properly set up all of these components.

How to do it...

We will begin by installing Anaconda. You can skip this if you already have Anaconda for Python 3.6 or 3.7 running on your system. Otherwise, you can follow the instructions at https://docs.anaconda.com/anaconda/install/ for your operating system, as follows:

Feel free to play around with PyTorch once the setup is done. To verify that you have the right setup of Anaconda and Python, you can enter the following line in your Terminal in Linux/Mac or Command Prompt in Windows (from now on, we will just call it Terminal):

python

It will display your Python Anaconda environment. You should see something similar to the following screenshot:

If Anaconda and Python 3.x are not mentioned, please check the system path or the path Python is running from.

The next thing to do is to install PyTorch. First, go to https://pytorch.org/get-started/locally/ and pick the description of your environment from the following table:

Here, we use Mac, Conda, Python 3.7, and running locally (no CUDA) as an example, and enter the resulting command line in the Terminal:

conda install pytorch torchvision -c pytorch

To confirm PyTorch is installed correctly, run the following lines of code in Python:

>>> import torch
>>> x = torch.empty(3, 4)
>>> print(x)
tensor([[ 0.0000e+00, 2.0000e+00, -1.2750e+16, -2.0005e+00],
[ 9.8742e-37, 1.4013e-45, 9.9222e-37, 1.4013e-45],
[ 9.9220e-37, 1.4013e-45, 9.9225e-37, 2.7551e-40]])

If a 3 x 4 matrix is displayed, that means PyTorch is installed correctly.

Now we have successfully set up the working environment.

How it works...

We have just created a tensor of size 3 x 4 in PyTorch. It is an empty matrix. By saying empty, this doesn't mean all elements are of the value Null. Instead, they are a bunch of meaningless floats that are considered placeholders. Users are required to set all the values later. This is very similar to NumPy's empty array.

There's more...

Some of you may question the necessity of installing Anaconda and using conda to manage packages since it is easy to install packages with pip. In fact, conda is a better packaging tool than pip. We mainly use conda for the following four reasons:

  • It handles library dependencies nicely: Installing a package with conda will automatically download all of its dependencies. However, doing so with pip will lead to a warning, and installation will be aborted.
  • It solves conflicts of packages gracefully: If installing a package requires another package of a specific version (let's say 2.3 or after, for example), conda will update the version of the other package automatically.
  • It creates a virtual environment easily: A virtual environment is a self-contained package directory tree. Different applications or projects can use different virtual environments. All virtual environments are isolated from each other. It is recommended to use virtual environments so that whatever we do for one application doesn't affect our system environment or any other environment.
  • It is also compatible with pip: We can still use pip in conda with the following command:
conda install pip

See also

If you are interested in learning more about conda, feel free to check out the following resources:

If you want to get more familiar with PyTorch, you can go through the Getting Started section in the official tutorial at https://pytorch.org/tutorials/#getting-started. We recommend you at least finish the following:

Installing OpenAI Gym

After setting up the working environment, we can now install OpenAI Gym. You can't work on reinforcement learning without using OpenAI Gym, which gives you a variety of environments in which to develop your learning algorithms.

OpenAI (https://openai.com/) is a non-profit research company that is focused on building safe artificial general intelligence (AGI) and ensuring that it benefits humans. OpenAI Gym is a powerful and open source toolkit for developing and comparing reinforcement learning algorithms. It provides an interface to varieties of reinforcement learning simulations and tasks, from walking to moon landing, from car racing to playing Atari games. See https://gym.openai.com/envs/ for the full list of environments.We can write agents to interact with OpenAI Gym environments using any numerical computation library, such as PyTorch, TensorFlow, or Keras.

How to do it...

There are two ways to install Gym. The first one is to use pip, as follows:

pip install gym

For conda users, remember to install pip first in conda using the following command before installing Gym using pip:

conda install pip

This is because Gym is not officially available in conda as of early 2019.

Another approach is to build from source:

  1. First, clone the package directly from its Git repository:
git clone https://github.com/openai/gym
  1. Go to the downloaded folder and install Gym from there:
cd gym
pip install -e .

And now you are good to go. Feel free to play around with gym.

  1. You can also check the available gym environment by typing the following lines of code:
>>> from gym import envs
>>> print(envs.registry.all())
dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v2), EnvSpec(BipedalWalkerHardcore-v2), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0)
...
...

This will give you a long list of environments if you installed Gym properly. We will play around with some of them in the next recipe, Simulating Atari environments.

How it works...

Compared to the simple pip approach for installing Gym, the second approach provides more flexibility if you want to add new environments and modify Gym itself.

There's more...

You may wonder why we need to test reinforcement learning algorithms on Gym's environments since the actual environments we work in can be a lot different. You will recall that reinforcement learning doesn't make many assumptions about the environment, but it gets to know more about the environment by interacting with it. Also, when comparing the performance of different algorithms, we need to apply them to standardized environments. Gym is a perfect benchmark, covering many versatile and easy-to-use environments. This is similar to the datasets that we often use as benchmarks in supervised and unsupervised learning, such as MNIST, Imagenet, MovieLens, and Thomson Reuters News.

See also

Simulating Atari environments

To get started with Gym, let's play some Atari games with it.

The Atari environments (https://gym.openai.com/envs/#atari) are a variety of Atari 2600 video games, such as Alien, AirRaid, Pong, and Space Race. If you have ever played Atari games, this recipe should be fun for you, as you will play an Atari game, Space Invaders. However, an agent will act on your behalf.

How to do it...

Let's simulate the Atari environments by following these steps:

  1. To run any atari environment for the first time, we need to install the atari dependencies by running this command in the Terminal:
pip install gym[atari]

Alternatively, if you used the second approach in the previous recipe to install gym, you can run the following command instead:

pip install -e '.[atari]'
  1. After installing the Atari dependencies, we import the gym library in Python:
>>> import gym
  1. Create an instance of the SpaceInvaders environment:
>>> env = gym.make('SpaceInvaders-v0')
  1. Reset the environment:
>>> env.reset()
array([[[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, 0],
...,
...,
[80, 89, 22],
[80, 89, 22],
[80, 89, 22]]], dtype=uint8)

As you can see, this also returns the initial state of the environment.

  1. Render the environment:
>>> env.render()
True

You will see a small window popping up, as follows:

As you can see from the game window, the spaceship starts with three lives (the red spaceships).

  1. Randomly pick one possible move and execute the action:
>>> action = env.action_space.sample()
>>> new_state, reward, is_done, info = env.step(action)

The step() method returns what happens after an action is taken, including the following:

  • New state: The new observation.
  • Reward: The reward associated with that action in that state.
  • Is done: A flag indicating whether the game ends. In a SpaceInvaders environment, this will be True if the spaceship has no more lives left or all the aliens are gone; otherwise, it will remain False.
  • Info: Extra information pertaining to the environment. This is about the number of lives left in this case. This is useful for debugging.

Let’s take a look at the is_done flag and info:

>>> print(is_done)
False
>>> print(info)
{'ale.lives': 3}

Now we render the environment:

>>> env.render()
True

The game window becomes the following:

You won't notice much difference in the game window, because the spaceship just made a move.

  1. Now, let's make a while loop and let the agent perform as many actions as it can:
>>> is_done = False
>>> while not is_done:
... action = env.action_space.sample()
... new_state, reward, is_done, info = env.step(action)
... print(info)
... env.render()
{'ale.lives': 3}
True
{'ale.lives': 3}
True
……
……
{'ale.lives': 2}
True
{'ale.lives': 2}
True
……
……
{'ale.lives': 1}
True
{'ale.lives': 1}
True

Meanwhile, you will see that the game is running, and the spaceship keeps moving and shooting, and so do the aliens. And it is pretty fun to watch, too. At the end, when the game ends, the window looks like the following:

As you can see, we scored 150 points in this game. You may get a higher or lower score than this because the actions the agent performs are all randomly selected.

We also confirm that no lives are left with the last piece of info:

>>> print(info)
{'ale.lives': 0}

How it works...

Using Gym, we can easily create an environment instance by calling the make() method with the name of the environment as the parameter.

As you may have noticed, the actions that the agent performs are randomly chosen using the sample() method.

Note that, normally, we would have a more sophisticated agent guided by reinforcement learning algorithms. Here, we just demonstrated how to simulate an environment, and how an agent takes actions regardless of the outcome.

Run this a few times and see what we get:

>>> env.action_space.sample()
0
>>> env.action_space.sample()
3
>>> env.action_space.sample()
0
>>> env.action_space.sample()
4
>>> env.action_space.sample()
2
>>> env.action_space.sample()
1
>>> env.action_space.sample()
4
>>> env.action_space.sample()
5
>>> env.action_space.sample()
1
>>> env.action_space.sample()
0

There are six possible actions in total. We can also see this by running the following command:

>>> env.action_space
Discrete(6)

Actions from 0 to 5 stand for No Operation, Fire, Up, Right, Left, and Down, respectively, which are all the moves the spaceship in the game can do.

The step() method will let the agent take the action that is specified as its parameter. The render() method will update the display window based on the latest observation of the environment.

The observation of the environment, new_state, is represented by a 210 x 160 x 3 matrix, as follows:

>>> print(new_state.shape)
(210, 160, 3)

This means that each frame of the display screen is an RGB image of size 210 x 160.

There's more...

You may wonder why we need to install Atari dependencies. In fact, there are a few more environments that do not accompany the installation of gym, such as Box2d, Classic control, MuJoCo, and Robotics.

Take the Box2d environments, for example; we need to install the Box2d dependencies before we first run the environments. Again, two installation approaches are as follows:

pip install gym[box2d]
pip install -e '.[box2d]'

After that, we can play around with the LunarLander environment, as follows:

>>> env = gym.make('LunarLander-v2')
>>> env.reset()
array([-5.0468446e-04, 1.4135642e+00, -5.1140346e-02, 1.1751971e-01,
5.9164839e-04, 1.1584054e-02, 0.0000000e+00, 0.0000000e+00],
dtype=float32)
>>> env.render()

A game window will pop up:

See also

If you are looking to simulate an environment but are not sure of the name you should use in the make() method, you can find it in the table of environments at https://github.com/openai/gym/wiki/Table-of-environments. Besides the name used to call an environment, the table also shows the size of the observation matrix and the number of possible actions. Have fun playing around with the environments.

Simulating the CartPole environment

In this recipe, we will work on simulating one more environment in order to get more familiar with Gym. The CartPole environment is a classic one in reinforcement learning research.

CartPole is a traditional reinforcement learning task in which a pole is placed upright on top of a cart. The agent moves the cart either to the left or to the right by 1 unit in a timestep. The goal is to balance the pole and prevent it from falling over. The pole is considered to have fallen if it is more than 12 degrees from the vertical, or the cart moves 2.4 units away from the origin. An episode terminates when any of the following occurs:

  • The pole falls over
  • The number of timesteps reaches 200

How to do it...

Let's simulate the CartPole environment by following these steps:

  1. To run the CartPole environment, let's first search for its name in the table of environments at https://github.com/openai/gym/wiki/Table-of-environments. We get 'CartPole-v0' and also learn that the observation space is represented in a 4-dimensional array, and that there are two possible actions (which makes sense).
  2. We import the Gym library and create an instance of the CartPole environment:
 >>> import gym
>>> env = gym.make('CartPole-v0')
  1. Reset the environment:
 >>> env.reset()
array([-0.00153354, 0.01961605, -0.03912845, -0.01850426])

As you can see, this also returns the initial state represented by an array of four floats.

  1. Render the environment:
 >>> env.render()
True

You will see a small window popping up, as follows:

  1. Now, let's make a while loop and let the agent perform as many random actions as it can:
 >>> is_done = False
>>> while not is_done:
... action = env.action_space.sample()
... new_state, reward, is_done, info = env.step(action)
... print(new_state)
... env.render()
...
[-0.00114122 -0.17492355 -0.03949854 0.26158095]
True
[-0.00463969 -0.36946006 -0.03426692 0.54154857]
True
……
……
[-0.11973207 -0.41075106 0.19355244 1.11780626]
True
[-0.12794709 -0.21862176 0.21590856 0.89154351]
True

Meanwhile, you will see that the cart and pole are moving. At the end, you will see they both stop. The window looks like the following:

The episode only lasts several steps because the left or right actions are chosen randomly. Can we record the whole process so we can replay it afterward? We can do so with just two lines of code in Gym, as shown in Step 7. If you are using a Mac or Linux system, you need to complete Step 6 first; otherwise, you can jump to Step 7.

  1. To record video, we need to install the ffmpeg package. For Mac, it can be installed via the following command:
brew install ffmpeg

For Linux, the following command should do it:

sudo apt-get install ffmpeg
  1. After creating the CartPole instance, add these two lines:
>>> video_dir = './cartpole_video/'
>>> env = gym.wrappers.Monitor(env, video_dir)

This will record what is displayed in the window and store it in the specified directory.

Now re-run the codes from Step 3 to Step 5. After an episode terminates, we can see that an .mp4 file is created in the video_dir folder. The video is quite short; it may last 1 second or so.

How it works...

In this recipe, we print out the state array for every step. But what does each float in the array mean? We can find more information about CartPole on Gym's GitHub wiki page: https://github.com/openai/gym/wiki/CartPole-v0. It turns out that those four floats represent the following:

  • Cart position: This ranges from -2.4 to 2.4, and any position beyond this range will trigger episode termination.
  • Cart velocity.
  • Pole angle: Any value less than -0.209 (-12 degrees) or greater than 0.209 (12 degrees) will trigger episode termination.
  • Pole velocity at the tip.

In terms of the action, it is either 0 or 1, which corresponds to pushing the cart to the left and to the right, respectively.

The reward in this environment is +1 for every timestep before the episode terminates. We can also verify this by printing out the reward for every step. And the total reward is simply the number of timesteps.

There's more...

So far, we've run only one episode. In order to assess how well the agent performs, we can simulate many episodes and then average the total rewards for an individual episode. The average total reward will tell us about the performance of the agent that takes random actions.

Let’s set 10,000 episodes:

 >>> n_episode = 10000

In each episode, we compute the total reward by accumulating the reward in every step:

 >>> total_rewards = []
>>> for episode in range(n_episode):
... state = env.reset()
... total_reward = 0
... is_done = False
... while not is_done:
... action = env.action_space.sample()
... state, reward, is_done, _ = env.step(action)
... total_reward += reward
... total_rewards.append(total_reward)

Finally, we calculate the average total reward:

 >>> print('Average total reward over {} episodes: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 10000 episodes: 22.2473

On average, taking a random action scores 22.25.

We all know that taking random actions is not sophisticated enough, and we will implement an advanced policy in upcoming recipes. But for the next recipe, let's take a break and review the basics of PyTorch.

Reviewing the fundamentals of PyTorch

As we’ve already mentioned, PyTorch is the numerical computation library we use to implement reinforcement learning algorithms in this book.

PyTorch is a trendy scientific computing and machine learning (including deep learning) library developed by Facebook. Tensor is the core data structure in PyTorch, which is similar to NumPy's ndarrays. PyTorch and NumPy are comparable in scientific computing. However, PyTorch is faster than NumPy in array operations and array traversing. This is mainly due to the fact that array element access is faster in PyTorch. Hence, more and more people believe PyTorch will replace NumPy.

How to do it...

Let's do a quick review of the basic programming in PyTorch to get more familiar with it:

  1. We created an uninitialized matrix in an earlier recipe. How about a randomly initialized one? See the following commands:
 >>> import torch
>>> x = torch.rand(3, 4)
>>> print(x)
tensor([[0.8052, 0.3370, 0.7676, 0.2442],
[0.7073, 0.4468, 0.1277, 0.6842],
[0.6688, 0.2107, 0.0527, 0.4391]])

Random floats from a uniform distribution in the interval (0, 1) are generated.

  1. We can specify the desired data type of the returned tensor. For example, a tensor of the double type (float64) is returned as follows:
 >>> x = torch.rand(3, 4, dtype=torch.double)
>>> print(x)
tensor([[0.6848, 0.3155, 0.8413, 0.5387],
[0.9517, 0.1657, 0.6056, 0.5794],
[0.0351, 0.3801, 0.7837, 0.4883]], dtype=torch.float64)

By default, float is the returned data type.

  1. Next, let's create a matrix full of zeros and a matrix full of ones:
    >>> x = torch.zeros(3, 4)
>>> print(x)
tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
>>> x = torch.ones(3, 4)
>>> print(x)
tensor([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
  1. To get the size of a tensor, use this code:
 >>> print(x.size())
torch.Size([3, 4])

torch.Size is actually a tuple.

  1. To reshape a tensor, we can use the view() method:
 >>> x_reshaped = x.view(2, 6)
>>> print(x_reshaped)
tensor([[1., 1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1., 1.]])
  1. We can create a tensor directly from data, including a single value, a list, and a nested list:
 >>> x1 = torch.tensor(3)
>>> print(x1)
tensor(3)
>>> x2 = torch.tensor([14.2, 3, 4])
>>> print(x2)
tensor([14.2000, 3.0000, 4.0000])
>>> x3 = torch.tensor([[3, 4, 6], [2, 1.0, 5]])
>>> print(x3)
tensor([[3., 4., 6.],
[2., 1., 5.]])
  1. To access the elements in a tensor of more than one element, we can use indexing in a similar way to NumPy:
 >>> print(x2[1])
tensor(3.)
>>> print(x3[1, 0])
tensor(2.)
>>> print(x3[:, 1])
tensor([4., 1.])
>>> print(x3[:, 1:])
tensor([[4., 6.],
[1., 5.]])

As with a one-element tensor, we do so by using the item() method:

 >>> print(x1.item())
3
  1. Tensor and NumPy arrays are mutually convertible. Convert a tensor to a NumPy array using the numpy() method:
 >>> x3.numpy()
array([[3., 4., 6.],
[2., 1., 5.]], dtype=float32)

Convert a NumPy array to a tensor with from_numpy():

>>> import numpy as np
>>> x_np = np.ones(3)
>>> x_torch = torch.from_numpy(x_np)
>>> print(x_torch)
tensor([1., 1., 1.], dtype=torch.float64)
Note that if the input NumPy array is of the float data type, the output tensor will be of the double type. Typecasting may occasionally be needed.

Take a look at the following example, where a tensor of the double type is converted to a float:

 >>> print(x_torch.float())
tensor([1., 1., 1.])
  1. Operations in PyTorch are similar to NumPy as well. Take addition as an example; we can simply do the following:
>>> x4 = torch.tensor([[1, 0, 0], [0, 1.0, 0]])
>>> print(x3 + x4)
tensor([[4., 4., 6.],
[2., 2., 5.]])

Or we can use the add() method as follows:

 >>> print(torch.add(x3, x4))
tensor([[4., 4., 6.],
[2., 2., 5.]])
  1. PyTorch supports in-place operations, which mutate the tensor object. For example, let's run this command:
 >>> x3.add_(x4)
tensor([[4., 4., 6.],
[2., 2., 5.]])

You will see that x3 is changed to the result of the original x3plus x4:

 >>> print(x3)
tensor([[4., 4., 6.],
[2., 2., 5.]])

There's more...

Any method with _ indicates that it is an in-place operation, which updates the tensor with the resulting value.

See also

For the full list of tensor operations in PyTorch, please go to the official docs at https://pytorch.org/docs/stable/torch.html. This is the best place to search for information if you get stuck on a PyTorch programming problem.

Implementing and evaluating a random search policy

After some practice with PyTorch programming, starting from this recipe, we will be working on more sophisticated policies to solve the CartPole problem than purely random actions. We start with the random search policy in this recipe.

A simple, yet effective, approach is to map an observation to a vector of two numbers representing two actions. The action with the higher value will be picked. The linear mapping is depicted by a weight matrix whose size is 4 x 2 since the observations are 4-dimensional in this case. In each episode, the weight is randomly generated and is used to compute the action for every step in this episode. The total reward is then calculated. This process repeats for many episodes and, in the end, the weight that enables the highest total reward will become the learned policy. This approach is called random search because the weight is randomly picked in each trial with the hope that the best weight will be found with a large number of trials.

How to do it...

Let's go ahead and implement a random search algorithm with PyTorch:

  1. Import the Gym and PyTorch packages and create an environment instance:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
  1. Obtain the dimensions of the observation and action space:
>>> n_state = env.observation_space.shape[0]
>>> n_state
4
>>> n_action = env.action_space.n
>>> n_action
2

These will be used when we define the tensor for the weight matrix, which is size 4 x 2 in size.

  1. Define a function that simulates an episode given the input weight and returns the total reward:
 >>> def run_episode(env, weight):
... state = env.reset()
... total_reward = 0
... is_done = False
... while not is_done:
... state = torch.from_numpy(state).float()
... action = torch.argmax(torch.matmul(state, weight))
... state, reward, is_done, _ = env.step(action.item())
... total_reward += reward
... return total_reward

Here, we convert the state array to a tensor of the float type because we need to compute the multiplication of the state and weight tensor, torch.matmul(state, weight), for linear mapping. The action with the higher value is selected using the torch.argmax() operation. And don't forget to take the value of the resulting action tensor using .item() because it is a one-element tensor.

  1. Specify the number of episodes:
>>> n_episode = 1000
  1. We need to keep track of the best total reward on the fly, as well as the corresponding weight. So, we specify their starting values:
>>> best_total_reward = 0
>>> best_weight = None

We will also record the total reward for every episode:

>>> total_rewards = []
  1. Now, we can run n_episode. For each episode, we do the following:
  • Randomly pick the weight
  • Let the agent take actions according to the linear mapping
  • An episode terminates and returns the total reward
  • Update the best total reward and the best weight if necessary
  • Also, keep a record of the total reward

Put this into code as follows:

 >>> for episode in range(n_episode):
... weight = torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... if total_reward > best_total_reward:
... best_weight = weight
... best_total_reward = total_reward
... total_rewards.append(total_reward)
...
Episode 1: 10.0
Episode 2: 73.0
Episode 3: 86.0
Episode 4: 10.0
Episode 5: 11.0
……
……
Episode 996: 200.0
Episode 997: 11.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 9.0

We have obtained the best policy through 1,000 random searches. The best policy is parameterized by best_weight.

  1. Before we test out the best policy in the testing episodes, we can calculate the average total reward achieved by random linear mapping:
 >>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 47.197

This is more than twice what we got from the random action policy (22.25).

  1. Now, let's see how the learned policy performs on 100 new episodes:
 >>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward = run_episode(env, best_weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 188.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0
>>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards_eval) / n_episode_eval))
Average total reward over 1000 episode: 196.72

Surprisingly, the average reward for the testing episodes is close to the maximum of 200 steps with the learned policy. Be aware that this value may vary a lot. It could be anywhere from 160 to 200.

How it works...

The random search algorithm works so well mainly because of the simplicity of our CartPole environment. Its observation state is composed of only four variables. You will recall that the observation in the Atari Space Invaders game is more than 100,000 (which is 210 * 160 * 3) . The number of dimensions of the action state in CartPole is a third of that in Space Invaders. In general, simple algorithms work well for simple problems. In our case, we simply search for the best linear mapping from the observation to the action from a random pool.

Another interesting thing we've noticed is that before we select and deploy the best policy (the best linear mapping), random search also outperforms random action. This is because random linear mapping does take the observations into consideration. With more information from the environment, the decisions made in the random search policy are more intelligent than completely random ones.

There's more...

We can also plot the total reward for every episode in the training phase:

>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

This will generate the following plot:

If you have not installed matplotlib, you can do so via the following command:

conda install matplotlib

We can see that the reward for each episode is pretty random, and that there is no trend of improvement as we go through the episodes. This is basically what we expected.

In the plot of reward versus episodes, we can see that there are some episodes in which the reward reaches 200. We can end the training phase whenever this occurs since there is no room to improve. Incorporating this change, we now have the following for the training phase:

 >>> n_episode = 1000
>>> best_total_reward = 0
>>> best_weight = None
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... if total_reward > best_total_reward:
... best_weight = weight
... best_total_reward = total_reward
... total_rewards.append(total_reward)
... if best_total_reward == 200:
... break
Episode 1: 9.0
Episode 2: 8.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 10.0
Episode 6: 9.0
Episode 7: 17.0
Episode 8: 10.0
Episode 9: 43.0
Episode 10: 10.0
Episode 11: 10.0
Episode 12: 106.0
Episode 13: 8.0
Episode 14: 32.0
Episode 15: 98.0
Episode 16: 10.0
Episode 17: 200.0

The policy achieving the maximal reward is found in episode 17. Again, this may vary a lot because the weights are generated randomly for each episode. To compute the expectation of training episodes needed, we can repeat the preceding training process 1,000 times and take the average of the training episodes:

 >>> n_training = 1000
>>> n_episode_training = []
>>> for _ in range(n_training):
... for episode in range(n_episode):
... weight = torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward == 200:
... n_episode_training.append(episode+1)
... break
>>> print('Expectation of training episodes needed: ',
sum(n_episode_training) / n_training)
Expectation of training episodes needed: 13.442

On average, we expect that it takes around 13 episodes to find the best policy.

Developing the hill-climbing algorithm

As we can see in the random search policy, each episode is independent. In fact, all episodes in random search can be run in parallel, and the weight that achieves the best performance will eventually be selected. We've also verified this with the plot of reward versus episode, where there is no upward trend. In this recipe, we will develop a different algorithm, a hill-climbing algorithm, to transfer the knowledge acquired in one episode to the next episode.

In the hill-climbing algorithm, we also start with a randomly chosen weight. But here, for every episode, we add some noise to the weight. If the total reward improves, we update the weight with the new one; otherwise, we keep the old weight. In this approach, the weight is gradually improved as we progress through the episodes, instead of jumping around in each episode.

How to do it...

Let's go ahead and implement the hill-climbing algorithm with PyTorch:

  1. As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n
  1. We will reuse the run_episode function we defined in the previous recipe, so we will not repeat it here. Again, given the input weight, it simulates an episode and returns the total reward.
  2. Let's make it 1,000 episodes for now:

>>> n_episode = 1000
  1. We need to keep track of the best total reward on the fly, as well as the corresponding weight. So, let's specify their starting values:
>>> best_total_reward = 0
>>> best_weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []
  1. As we mentioned, we will add some noise to the weight for each episode. In fact, we will apply a scale to the noise so that the noise won't overwhelm the weight. Here, we will choose 0.01 as the noise scale:
>>> noise_scale = 0.01
  1. Now, we can run the n_episode function. After we randomly pick an initial weight, for each episode, we do the following:
  • Add random noise to the weight
  • Let the agent take actions according to the linear mapping
  • An episode terminates and returns the total reward
  • If the current reward is greater than the best one obtained so far, update the best reward and the weight
  • Otherwise, the best reward and the weight remain unchanged
  • Also, keep a record of the total reward

Put this into code as follows:

 >>> for episode in range(n_episode):
... weight = best_weight +
noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... total_rewards.append(total_reward)
... print('Episode {}: {}'.format(episode + 1, total_reward))
...
Episode 1: 56.0
Episode 2: 52.0
Episode 3: 85.0
Episode 4: 106.0
Episode 5: 41.0
……
……
Episode 996: 39.0
Episode 997: 51.0
Episode 998: 49.0
Episode 999: 54.0
Episode 1000: 41.0

We also calculate the average total reward achieved by the hill-climbing version of linear mapping:

 >>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 50.024
  1. To assess the training using the hill-climbing algorithm, we repeat the training process multiple times (by running the code from Step 4 to Step 6 multiple times). We observe that the average total reward fluctuates a lot. The following are the results we got when running it 10 times:
Average total reward over 1000 episode: 9.261   
Average total reward over 1000 episode: 88.565
Average total reward over 1000 episode: 51.796
Average total reward over 1000 episode: 9.41
Average total reward over 1000 episode: 109.758
Average total reward over 1000 episode: 55.787
Average total reward over 1000 episode: 189.251
Average total reward over 1000 episode: 177.624
Average total reward over 1000 episode: 9.146
Average total reward over 1000 episode: 102.311

What could cause such variance? It turns out that if the initial weight is bad, adding noise at a small scale will have little effect on improving the performance. This will cause poor convergence. On the other hand, if the initial weight is good, adding noise at a big scale might move the weight away from the optimal weight and jeopardize the performance. How can we make the training of the hill-climbing model more stable and reliable? We can actually make the noise scale adaptive to the performance, just like the adaptive learning rate in gradient descent. Let's see Step 8 for more details.

  1. To make the noise adaptive, we do the following:
  • Specify a starting noise scale.
  • If the performance in an episode improves, decrease the noise scale. In our case, we take half of the scale, but set 0.0001 as the lower bound.
  • If the performance in an episode drops, increase the noise scale. In our case, we double the scale, but set 2 as the upper bound.

Put this into code:

 >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight +
noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 9.0
Episode 4: 10.0
Episode 5: 10.0
……
……
Episode 996: 200.0
Episode 997: 200.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 200.0

The reward is increasing as the episodes progress. It reaches the maximum of 200 within the first 100 episodes and stays there. The average total reward also looks promising:

 >>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 186.11

We also plot the total reward for every episode as follows:

>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

In the resulting plot, we can see a clear upward trend before it plateaus at the maximum value:

Feel free to run the new training process a few times. The results are very stable compared to learning with a constant noise scale.

  1. Now, let's see how the learned policy performs on 100 new episodes:
 >>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward = run_episode(env, best_weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 200.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 199.94

The average reward for the testing episodes is close to the maximum of 200 that we obtained with the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

How it works...

We are able to achieve much better performance with the hill-climbing algorithm than with random search by simply adding adaptive noise to each episode. We can think of it as a special kind of gradient descent without a target variable. The additional noise is the gradient, albeit in a random way. The noise scale is the learning rate, and it is adaptive to the reward from the previous episode. The target variable in hill climbing becomes achieving the highest reward. In summary, rather than isolating each episode, the agent in the hill-climbing algorithm makes use of the knowledge learned from each episode and performs a more reliable action in the next episode. As the name hill climbing implies, the reward moves upwards through the episodes as the weight gradually moves towards the optimum value.

There's more...

We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page, https://github.com/openai/gym/wiki/CartPole-v0, "solved" means the average reward over 100 consecutive episodes is no less than 195.

We refine the stopping criterion accordingly:

 >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight + noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
... if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 9.0
……
……
Episode 133: 200.0
Episode 134: 200.0
Episode 135: 200.0
Episode 136: 200.0
Episode 137: 200.0

At episode 137, the problem is considered solved.

See also

Developing a policy gradient algorithm

The last recipe of the first chapter is about solving the CartPole environment with a policy gradient algorithm. This may be more complicated than we need for this simple problem, in which the random search and hill-climbing algorithms suffice. However, it is a great algorithm to learn, and we will use it in more complicated environments later in the book.

In the policy gradient algorithm, the model weight moves in the direction of the gradient at the end of each episode. We will explain the computation of gradients in the next section. Also, in each step, it samples an action from the policy based on the probabilities computed using the state and weight. It no longer takes an action with certainty, in contrast with random search and hill climbing (by taking the action achieving the higher score). Hence, the policy switches from deterministic to stochastic.

How to do it...

Now, it is time to implement the policy gradient algorithm with PyTorch:

  1. As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n
  1. We define the run_episode function, which simulates an episode given the input weight and returns the total reward and the gradients computed. More specifically, it does the following tasks in each step:
  • Calculates the probabilities, probs, for both actions based on the current state and input weight
  • Samples an action, action, based on the resulting probabilities
  • Computes the derivatives, d_softmax, of the softmax function with the probabilities as input
  • Divides the resulting derivatives, d_softmax, by the probabilities, probs, to get the derivatives, d_log, of the log term with respect to the policy
  • Applies the chain rule to compute the gradient, grad, of the weights
  • Records the resulting gradient, grad
  • Performs the action, accumulates the reward, and updates the state

Putting all of this into code, we have the following:

 >>> def run_episode(env, weight):
... state = env.reset()
... grads = []
... total_reward = 0
... is_done = False
... while not is_done:
... state = torch.from_numpy(state).float()
... z = torch.matmul(state, weight)
... probs = torch.nn.Softmax()(z)
... action = int(torch.bernoulli(probs[1]).item())
... d_softmax = torch.diag(probs) -
probs.view(-1, 1) * probs
... d_log = d_softmax[action] / probs[action]
... grad = state.view(-1, 1) * d_log
... grads.append(grad)
... state, reward, is_done, _ = env.step(action)
... total_reward += reward
... if is_done:
... break
... return total_reward, grads

After an episode finishes, it returns the total reward obtained in this episode and the gradients computed for the individual steps. These two outputs will be used to update the weight.

  1. Let's make it 1,000 episodes for now:
>>> n_episode = 1000

This means we will run run_episode and n_episodetimes.

  1. Initiate the weight:
>>> weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []
  1. At the end of each episode, we need to update the weight using the computed gradients. For every step of the episode, the weight moves by learning rate * gradient calculated in this step * total reward in the remaining steps. Here, we choose 0.001 as the learning rate:
>>> learning_rate = 0.001

Now, we can run n_episodeepisodes:

 >>> for episode in range(n_episode):
... total_reward, gradients = run_episode(env, weight)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... for i, gradient in enumerate(gradients):
... weight += learning_rate * gradient * (total_reward - i)
... total_rewards.append(total_reward)
……
……
Episode 101: 200.0
Episode 102: 200.0
Episode 103: 200.0
Episode 104: 190.0
Episode 105: 133.0
……
……
Episode 996: 200.0
Episode 997: 200.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 200.0
  1. Now, we calculate the average total reward achieved by the policy gradient algorithm:
 >>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 179.728
  1. We also plot the total reward for every episode as follows:
 >>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

In the resulting plot, we can see a clear upward trend before it stays at the maximum value:

We can also see that the rewards oscillate even after it converges. This is because the policy gradient algorithm is a stochastic policy.

  1. Now, let's see how the learned policy performs on 100 new episodes:
 >>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward, _ = run_episode(env, weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 200.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 199.78

The average reward for the testing episodes is close to the maximum value of 200 for the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

How it works...

The policy gradient algorithm trains an agent by taking small steps and updating the weight based on the rewards associated with those steps at the end of an episode. The technique of having the agent run through an entire episode and then updating the policy based on the rewards obtained is called Monte Carlo policy gradient.

The action is selected based on the probability distribution computed based on the current state and the model’s weight. For example, if the probabilities for the left and right actions are [0.6, 0.4], this means the left action is selected 60% of the time; it doesn't mean the left action is chosen, as in the random search and hill-climbing algorithms.

We know that the reward is 1 for each step before an episode terminates. Hence, the future reward we use to calculate the policy gradient at each step is the number of steps remaining. After each episode, we feed the gradient history multiplied by the future rewards to update the weight using the stochastic gradient ascent method. In this way, the longer an episode is, the bigger the update of the weight. This will eventually increase the chance of getting a larger total reward.

As we mentioned at the start of this section, the policy gradient algorithm might be overkill for a simple environment such as CartPole, but it should get us ready for more complicated problems.

There's more...

If we examine the reward/episode plot, it seems that we can also stop early during training when it has been solved – the average reward over 100 consecutive episodes is no less than 195. We just add the following lines of code to the training session:

 >>> if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break

Re-run the training session. You should get something similar to the following, which stops after several hundred episodes:

Episode 1: 10.0
Episode 2: 27.0
Episode 3: 28.0
Episode 4: 15.0
Episode 5: 12.0
……
……
Episode 549: 200.0
Episode 550: 200.0
Episode 551: 200.0
Episode 552: 200.0
Episode 553: 200.0

See also

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Use PyTorch 1.x to design and build self-learning artificial intelligence (AI) models
  • Implement RL algorithms to solve control and optimization challenges faced by data scientists today
  • Apply modern RL libraries to simulate a controlled environment for your projects

Description

Reinforcement learning (RL) is a branch of machine learning that has gained popularity in recent times. It allows you to train AI models that learn from their own actions and optimize their behavior. PyTorch has also emerged as the preferred tool for training RL models because of its efficiency and ease of use. With this book, you'll explore the important RL concepts and the implementation of algorithms in PyTorch 1.x. The recipes in the book, along with real-world examples, will help you master various RL techniques, such as dynamic programming, Monte Carlo simulations, temporal difference, and Q-learning. You'll also gain insights into industry-specific applications of these techniques. Later chapters will guide you through solving problems such as the multi-armed bandit problem and the cartpole problem using the multi-armed bandit algorithm and function approximation. You'll also learn how to use Deep Q-Networks to complete Atari games, along with how to effectively implement policy gradients. Finally, you'll discover how RL techniques are applied to Blackjack, Gridworld environments, internet advertising, and the Flappy Bird game. By the end of this book, you'll have developed the skills you need to implement popular RL algorithms and use RL techniques to solve real-world problems.

Who is this book for?

Machine learning engineers, data scientists and AI researchers looking for quick solutions to different reinforcement learning problems will find this book useful. Although prior knowledge of machine learning concepts is required, experience with PyTorch will be useful but not necessary.

What you will learn

  • Use Q-learning and the state–action–reward–state–action (SARSA) algorithm to solve various Gridworld problems
  • Develop a multi-armed bandit algorithm to optimize display advertising
  • Scale up learning and control processes using Deep Q-Networks
  • Simulate Markov Decision Processes, OpenAI Gym environments, and other common control problems
  • Select and build RL models, evaluate their performance, and optimize and deploy them
  • Use policy gradient methods to solve continuous RL problems
Estimated delivery fee Deliver to Egypt

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 31, 2019
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781838551964
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Egypt

Standard delivery 10 - 13 business days

$12.95

Premium delivery 3 - 6 business days

$34.95
(Includes tracking information)

Product Details

Publication date : Oct 31, 2019
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781838551964
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 131.97
Advanced Deep Learning with Python
$48.99
PyTorch 1.x Reinforcement Learning Cookbook
$43.99
Reinforcement Learning Algorithms with Python
$38.99
Total $ 131.97 Stars icon

Table of Contents

10 Chapters
Getting Started with Reinforcement Learning and PyTorch Chevron down icon Chevron up icon
Markov Decision Processes and Dynamic Programming Chevron down icon Chevron up icon
Monte Carlo Methods for Making Numerical Estimations Chevron down icon Chevron up icon
Temporal Difference and Q-Learning Chevron down icon Chevron up icon
Solving Multi-armed Bandit Problems Chevron down icon Chevron up icon
Scaling Up Learning with Function Approximation Chevron down icon Chevron up icon
Deep Q-Networks in Action Chevron down icon Chevron up icon
Implementing Policy Gradients and Policy Optimization Chevron down icon Chevron up icon
Capstone Project – Playing Flappy Bird with DQN Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(3 Ratings)
5 star 33.3%
4 star 66.7%
3 star 0%
2 star 0%
1 star 0%
AustinSF Jan 24, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Got this book with the recommendation from one of my friend, this book provides good guidance for the people who wish to dive into the career in machine learning, with easy to understand explanations and hands-on code examples. It also teaches you how to think and tackle challenges like an ML engineer using reinforcement learning, which I think it's more important for a newbie like me.
Amazon Verified review Amazon
kepler man Oct 28, 2020
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
DQNの数式をOpenAIのGymのゲームでPyTorchで組んで具現化するレシピ集です。理論的な事は別の専門書に委ねるとして、数式を実際に組んでみるとこの様にできるという数多くの例を段階的に教示してあり、実際に動くと楽しくなります。deepmind社もAtariのゲームで発表していますので、このレシピ集は教材として充分でしょう。サンプルを試す前にgymでインストールしておくのはatariの他にbox2dやclassic_controlが必要で、3DゲームのMuJoCo(お試しライセンス付き)のサンプルは載っていません。あと最後の方にpygameのインストールが必要なDQNも載っています。対応マシンは、Windows、MacOS、Linuxのi686,x86_64マシンで、Pythonの3.6又は3.7です。aarch64系のJetson NANOでも少し工夫しましたが、ほぼサンプルは動かせました。その時のメモ:「sudo apt-get install -y cmake zlib1g-dev libjpeg-dev xvfb ffmpeg xorg-dev libboost-all-dev libsdl2-dev swiggitからのgymのsetup.pyをi686,x86_64用のopencv-pythonを外して編集し、'atari':['atari_py==0.2.0','Pillow'] だけにします。代替えに sudo apt-get install python3-opencv で、旧い3系を入れ、pip3 install -e '.[atari, box2d, classic_control]'この時に、--userで.localに入れるとenv.render()をかけるとライブラリーの関係でエラーになるのでそのまま普通の場所にインストール。env設定でメモリーに一旦ロードされると誤記でもしてSyntaxErrorをわざと起こさない限りクリアーでリセットがし難いとかの裏技発見。gym 0.17.3が依存するpyglet<=1.5.0,>=1.4.0の1.5.0はバグ、範囲外だがpyglet-1.5.7の最新版の方が調子が良い。」
Amazon Verified review Amazon
Matthew Emerick Jul 21, 2020
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
About This BookThis is a cookbook style (obviously) technical book that takes the intermediate machine learning developer into the world of PyTorch reinforcement leaning. It covers everything from setting up PyTorch and the OpenAI Gym environment to a final capstone project, the later of which I've never seen in a technical cookbook before.Who is This For?As per the preface, this book is written for a developer who has experience with machine learning. It doesn't teach you PyTorch, per se, but you can pick it up. Knowing PyTorch beforehand would probably let you get more out of this book, though.Why Was This Written?There is a massive market for books on the various aspects of artificial intelligence. Reinforcement learning is one of the areas that has seen a lot of interest lately, and rightfully so. I think that this is the kind of book that we need: it pulls in a specific and popular technology and looks at a single subfield of machine learning.OrganizationThis book sticks to a simple organization which makes it easier to find the section that you are looking for. The first chapter is all about setting up your system with PyTorch and the OpenAI Gym environment. Then it touches on a few simpler algorithms to introduce the reader to both technologies. The next seven chapters discuss reinforcement algorithms of increasing complexity. The book then finishes with a chapter with a capstone project, which I have never seen in a cookbook before. I think I like it, as it brings everything together.Did This Book Succeed?This book did everything that it tries to do. It checks all of the boxes on its todo list and gives the reader a lot of new knowledge of reinforcement learning and PyTorch. I agree with the supposition that the reader should have a good understanding of machine learning already, as the book doesn't go into much detail, nor should it; this is an intermediate book.Rating and Final ThoughtsI like this book and I think it's a good addition to anyone's collection. The capstone project is a nice touch. I just don't think it's exceptional. The author clearly has technical expertise, but writing is not their strongest suit. I hope the author keeps writing, though, as there is a lot to learn from him. All said and done, I give this book a 4 out of 5.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela