You're reading from Python Reinforcement Learning Projects Eight hands-on projects exploring reinforcement learning algorithms using TensorFlow

Product type Paperback

Published in Sep 2018

Publisher Packt

ISBN-13 9781788991612

Length 296 pages

Edition 1st Edition

Languages

Python

Tools

Keras

Concepts

Reinforcement Learning

Authors (3):

Sean Saito

Rajalingappaa Shanmugamani

Yang Wenzhuo

View More author details

What is reinforcement learning?

Our journey begins with understanding what reinforcement learning is about. Those who are familiar with machine learning may be aware of several learning paradigms, namely supervised learning and unsupervised learning. In supervised learning, a machine learning model has a supervisor that gives the ground truth for every data point. The model learns by minimizing the distance between its own prediction and the ground truth. The dataset is thus required to have an annotation for each data point, for example, each image of a dog and a cat would have its respective label. In unsupervised learning, the model does not have access to the ground truths of the data and thus has to learn about the distribution and patterns of the data without them.

In reinforcement learning, the agent refers to the model/algorithm that learns to complete a particular task. The agent learns primarily by receiving reward signals, which is a scalar indication of how well the agent is performing a task.

Suppose we have an agent that is tasked with controlling a robot's walking movement; the agent would receive positive rewards for successfully walking toward a destination and negative rewards for falling/failing to make progress.

Moreover, unlike in supervised learning, these reward signals are not given to the model immediately; rather, they are returned as a consequence of a sequence of actions that the agent makes. Actions are simply the things an agent can do within its environment. The environment refers to the world in which the agent resides and is primarily responsible for returning reward signals to the agent. An agent's actions are usually conditioned on what the agent perceives from the environment. What the agent perceives is referred to as the observation or the state of the environment. What further distinguishes reinforcement learning from other paradigms is that the actions of the agent can alter the environment and its subsequent responses.

For example, suppose an agent is tasked with playing Space Invaders, the popular Atari 2600 arcade game. The environment is the game itself, along with the logic upon which it runs. During the game, the agent queries the environment to make an observation. The observation is simply an array of the (210, 160, 3) shape, which is the screen of the game that displays the agent's ship, the enemies, the score, and any projectiles. Based on this observation, the agent makes some actions, which can include moving left or right, shooting a laser, or doing nothing. The environment receives the agent's action as input and makes any necessary updates to the state.

For instance, if a laser touches an enemy ship, it is removed from the game. If the agent decides to simply move to the left, the game updates the agent's coordinates accordingly. This process repeats until a terminal state, a state that represents the end of the sequence, is reached. In Space Invaders, the terminal state corresponds to when the agent's ship is destroyed, and the game subsequently returns the score that it keeps track of, a value that is calculated based on the number of enemy ships the agent successfully destroys.

Note that some environments do not have terminal states, such as the stock market. These environments keep running for as long as they exist.

Let's recap the terms we have learned about so far:

Term	Description	Examples
Agent	A model/algorithm that is tasked with learning to accomplish a task.	Self-driving cars, walking robots, video game players
Environment	The world in which the agent acts. It is responsible for controlling what the agent perceives and providing feedback on how well the agent is performing a particular task.	The road on which a car drives, a video game, the stock market
Action	A decision the agent makes in an environment, usually dependent on what the agent perceives.	Steering a car, buying or selling a particular stock, shooting a laser from the spaceship the agent is controlling
Reward signal	A scalar indication of how well the agent is performing a particular task.	Space Invaders score, return on investment for some stock, distance covered by a robot learning to walk
Observation/state	A description of the environment as can be perceived by the agent.	Video from a dashboard camera, the screen of the game, stock market statistics
Terminal state	A state at which no further actions can be made by the agent.	Reaching the end of a maze, the ship in Space Invaders getting destroyed

Put formally, at a given timestep, t, the following happens for an agent, P, and environment, E:

- P queries E for some observation 
- P decides to take action  based on observation 
- E receives  and returns reward  based on the action
- P receives 
- E updates  to  based on  and other factors

How does the environment computeand ? The environment usually has its own algorithm that computes these values based on numerous input/factors, including what action the agent takes.

Sometimes, the environment is composed of multiple agents that try to maximize their own rewards. The way gravity acts upon a ball that we drop from a height is a good representation of how the environment works; just like how our surroundings obey the laws of physics, the environment has some internal mechanism for computing rewards and the next state. This internal mechanism is usually hidden to the agent, and thus our job is to build agents that can learn to do a good job at their respective tasks, despite this uncertainty.

In the following sections, we will discuss in more detail the main protagonist of every reinforcement learning problem—the agent.

The agent

The goal of a reinforcement learning agent is to learn to perform a task well in an environment. Mathematically, this means to maximize the cumulative reward, R, which can be expressed in the following equation:

We are simply calculating a weighted sum of the reward received at each timestep.is called the discount factor, which is a scalar value between 0 and 1. The idea is that the later a reward comes, the less valuable it becomes. This reflects our perspectives on rewards as well; that we'd rather receive $100 now rather than a year later shows how the same reward signal can be valued differently based on its proximity to the present.

Because the mechanics of the environment are not fully observable or known to the agent, it must gain information by performing an action and observing how the environment reacts to it. This is much like how humans learn to perform certain tasks as well.

Suppose we are learning to play chess. While we don't have all the possible moves committed to memory or know exactly how an opponent will play, we are able to improve our proficiency over time. In particular, we are able to become proficient in the following:

Learning how to react to a move made by the opponent
Assessing how good of a position we are in to win the game
Predicting what the opponent will do next and using that prediction to decide on a move
Understanding how others would play in a similar situation

In fact, reinforcement learning agents can learn to do similar things. In particular, an agent can be composed of multiple functions and models to assist its decision-making. There are three main components that an agent can have: the policy, the value function, and the model.

Policy

A policy is an algorithm or a set of rules that describe how an agent makes its decisions. An example policy can be the strategy an investor uses to trade stocks, where the investor buys a stock when its price goes down and sells the stock when the price goes up.

More formally, a policy is a function, usually denoted as , that maps a state, , to an action, :

This means that an agent decides its action given its current state. This function can represent anything, as long as it can receive a state as input and output an action, be it a table, graph, or machine learning classifier.

For example, suppose we have an agent that is supposed to navigate a maze. We shall further assume that the agent knows what the maze looks like; the following is how the agent's policy can be represented:

Figure 1: A maze where each arrow indicates where an agent would go next

Each white square in this maze represents a state the agent can be in. Each blue arrow refers to the action an agent would take in the corresponding square. This essentially represents the agent's policy for this maze. Moreover, this can also be regarded as a deterministic policy, for the mapping from the state to the action is deterministic. This is in contrast to a stochastic policy, where a policy would output a probability distribution over the possible actions given some state:

Here,is a normalized probability vector over all the possible actions, as shown in the following example:

Figure 2: A policy mapping the game state (the screen) to actions (probabilities)

The agent playing the game of Breakout has a policy that takes the screen of the game as input and returns a probability for each possible action.

Value function

The second component an agent can have is called the value function. As mentioned previously, it is useful to assess your position, good or bad, in a given state. In a game of chess, a player would like to know the likelihood that they are going to win in a board state. An agent navigating a maze would like to know how close it is to the destination. The value function serves this purpose; it predicts the expected future reward an agent would receive in a given state. In other words, it measures whether a given state is desirable for the agent. More formally, the value function takes a state and a policy as input and returns a scalar value representing the expected cumulative reward:

Take our maze example, and suppose the agent receives a reward of -1 for every step it takes. The agent's goal is to finish the maze in the smallest number of steps possible. The value of each state can be represented as follows:

Figure 3: A maze where each square indicates the value of being in the state

Each square basically represents the number of steps it takes to get to the end of the maze. As you can see, the smallest number of steps required to reach the goal is 15.

How can the value function help an agent perform a task well, other than informing us of how desirable a given state is? As we will see in the following sections, value functions play an integral role in predicting how well a sequence of actions will do even before the agent performs them. This is similar to chess players imagining how well a sequence of future actions will do in improving his or her chances of winning. To do this, the agent also needs to have an understanding of how the environment works. This is where the third component of an agent, the model, becomes relevant.

Model

In the previous sections, we discussed how the environment is not fully known to the agent. In other words, the agent usually does not have an idea of how the internal algorithm of the environment looks. The agent thus needs to interact with it to gain information and learn how to maximize its expected cumulative reward. However, it is possible for the agent to have an internal replica, or a model, of the environment. The agent can use the model to predict how the environment would react to some action in a given state. A model of the stock market, for example, is tasked with predicting what the prices will look like in the future. If the model is accurate, the agent can then use its value function to assess how desirable future states look. More formally, a model can be denoted as a function, , that predicts the probability of the next state given the current state and an action:

In other scenarios, the model of the environment can be used to enumerate possible future states. This is commonly used in turn-based games, such as chess and tic-tac-toe, where the rules and scope of possible actions are clearly defined. Trees are often used to illustrate the possible sequence of actions and states in turn-based games:

Figure 4: A model using its value function to assess possible moves

In the preceding example of the tic-tac-toe game,denotes the possible states that taking theaction (represented as the shaded circle) could yield in a given state, . Moreover, we can calculate the value of each state using the agent's value function. The middle and bottom states would yield a high value since the agent would be one step away from victory, whereas the top state would yield a medium value since the agent needs to prevent the opponent from winning.

Let's review the terms we have covered so far:

Term	Description	What does it output?
Policy	The algorithm or function that outputs decisions the agent makes	A scalar/single decision (deterministic policy) or a vector of probabilities over possible actions (stochastic policy)
Value Function	The function that describes how good or bad a given state is	A scalar value representing the expected cumulative reward
Model	An agent's representation of the environment, which predicts how the environment will react to the agent's actions	The probability of the next state given an action and current state, or an enumeration of possible states given the rules of the environment

In the following sections, we will use these concepts to learn about one of the most fundamental frameworks in reinforcement learning: the Markov decision process.

Markov decision process (MDP)

A Markov decision process is a framework used to represent the environment of a reinforcement learning problem. It is a graphical model with directed edges (meaning that one node of the graph points to another node). Each node represents a possible state in the environment, and each edge pointing out of a state represents an action that can be taken in the given state. For example, consider the following MDP:

Figure 5: A sample Markov Decision Process

The preceding MDP represents what a typical day of a programmer could look like. Each circle represents a particular state the programmer can be in, where the blue state (Wake Up) is the initial state (or the state the agent is in at t=0), and the orange state (Publish Code) denotes the terminal state. Each arrow represents the transitions that the programmer can make between states. Each state has a reward that is associated with it, and the higher the reward, the more desirable the state is.

We can tabulate the rewards as an adjacency matrix as well:

State\action	Wake Up	Netflix	Code and debug	Nap	Deploy	Sleep
Wake Up	N/A	-2	-3	N/A	N/A	N/A
Netflix	N/A	-2	N/A	N/A	N/A	N/A
Code and debug	N/A	N/A	N/A	1	10	3
Nap	0	N/A	N/A	N/A	N/A	N/A
Deploy	N/A	N/A	N/A	N/A	N/A	3
Sleep	N/A	N/A	N/A	N/A	N/A	N/A

The left column represents the possible states and the top row represents the possible actions. N/A means that the action is not performable from the given state. This system basically represents the decisions that a programmer can make throughout their day.

When the programmer wakes up, they can either decide to work (code and debug the code) or watch Netflix. Notice that the reward for watching Netflix is higher than that of coding and debugging. For the programmer in question, watching Netflix seems like a more rewarding activity, while coding and debugging is perhaps a chore (which, I hope, is not the case for the reader!). However, both actions yield negative rewards, even though our objective is to maximize our cumulative reward. If the programmer chooses to watch Netflix, they will be stuck in an endless loop of binge-watching, which continuously lowers the reward. Rather, more rewarding states will become available to the programmer if they decide to code diligently. Let's look at the possible trajectories, which are the sequence of actions, the programmer can take:

Wake Up | Netflix | Netflix | ...
Wake Up | Code and debug | Nap | Wake Up | Code and debug | Nap | ...
Wake Up | Code and debug | Sleep
Wake Up | Code and debug | Deploy | Sleep

Both the first and second trajectories represent infinite loops. Let's calculate the cumulative reward for each, where we set :

It is easy to see that both the first and second trajectories, despite not reaching a terminal state, will never return positive rewards. The fourth trajectory yields the highest reward (successfully deploying code is a highly rewarding accomplishment!).

What we have calculated are the value functions for four policies that a programmer can take to go through their day. Recall that the value function is the expected cumulative reward starting from a given state and following a policy. We have observed four possible policies and have evaluated how each leads to a different cumulative reward; this exercise is also called policy evaluation. Moreover, the equations we have applied to calculate the expected rewards are also known as Bellman expectation equations. The Bellman equations are a set of equations used to evaluate and improve policies and value functions to help a reinforcement learning agent learn better. Though a thorough introduction to Bellman equations is outside the scope of this book, they are foundational to building a theoretical understanding of reinforcement learning. We encourage the reader to look into this further.

While we will not cover Bellman equations in depth, we highly recommend the reader to do so in order to build a solid understanding of reinforcement learning. For more information, refer to Reinforcement Learning: An Introduction, by Richard S. Sutton and Andrew Barto (reference at the end of this chapter).

Now that you have learned about some the key terms and concepts of reinforcement learning, you may be wondering how we teach a reinforcement learning agent to maximize its reward, or in other words, find that the fourth trajectory is the best. In this book, you will be working on solving this question for numerous tasks and problems, all using deep learning. While we encourage you to be familiar with the basics of deep learning, the following sections will serve as a light refresher to the field.