Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Hands-On Reinforcement Learning with Python

You're reading from   Hands-On Reinforcement Learning with Python Master reinforcement and deep reinforcement learning using OpenAI Gym and TensorFlow

Arrow left icon
Product type Paperback
Published in Jun 2018
Publisher Packt
ISBN-13 9781788836524
Length 318 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Sudharsan Ravichandiran Sudharsan Ravichandiran
Author Profile Icon Sudharsan Ravichandiran
Sudharsan Ravichandiran
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Introduction to Reinforcement Learning FREE CHAPTER 2. Getting Started with OpenAI and TensorFlow 3. The Markov Decision Process and Dynamic Programming 4. Gaming with Monte Carlo Methods 5. Temporal Difference Learning 6. Multi-Armed Bandit Problem 7. Deep Learning Fundamentals 8. Atari Games with Deep Q Network 9. Playing Doom with a Deep Recurrent Q Network 10. The Asynchronous Advantage Actor Critic Network 11. Policy Gradients and Optimization 12. Capstone Project – Car Racing Using DQN 13. Recent Advancements and Next Steps 14. Assessments 15. Other Books You May Enjoy

Markov Decision Process

MDP is an extension of the Markov chain. It provides a mathematical framework for modeling decision-making situations. Almost all Reinforcement Learning problems can be modeled as MDP.

MDP is represented by five important elements:

  • A set of states the agent can actually be in.
  • A set of actions that can be performed by an agent, for moving from one state to another.
  • A transition probability (), which is the probability of moving from one state to another state by performing some action .
  • A reward probability (), which is the probability of a reward acquired by the agent for moving from one state to another state by performing some action .
  • A discount factor (), which controls the importance of immediate and future rewards. We will discuss this in detail in the upcoming sections.

Rewards and returns

As we have learned, in an RL environment, an agent interacts with the environment by performing an action and moves from one state to another. Based on the action it performs, it receives a reward. A reward is nothing but a numerical value, say, +1 for a good action and -1 for a bad action. How do we decide if an action is good or bad? In a maze game, a good action is where the agent makes a move so that it doesn't hit a maze wall, whereas a bad action is where the agent moves and hits the maze wall.

An agent tries to maximize the total amount of rewards (cumulative rewards) it receives from the environment instead of immediate rewards. The total amount of rewards the agent receives from the environment is called returns. So, we can formulate total amount of rewards (returns) received by the agents as follows:

is the reward received by the agent at a time step while performing an action
to move from one state to another. is the reward received by the agent at a time
step while performing an action to move from one state to another. Similarly, is the reward received by the agent at a final time step while performing an action to move from one state to another.

Episodic and continuous tasks

Episodic tasks are the tasks that have a terminal state (end). In RL, episodes are considered agent-environment interactions from initial to final states.

For example, in a car racing video game, you start the game (initial state) and play the game until it is over (final state). This is called an episode. Once the game is over, you start the next episode by restarting the game, and you will begin from the initial state irrespective of the position you were in the previous game. So, each episode is independent of the other.

In a continuous task, there is not a terminal state. Continuous tasks will never end. For example, a personal assistance robot does not have a terminal state.

Discount factor

We have seen that an agent goal is to maximize the return. For an episodic task, we can define our return as Rt= rt+1 + rt+2 + ..... +rT, where T is the final state of the episode, and we try to maximize the return Rt.

Since we don't have any final state for a continuous task, we can define our return for continuous tasks as Rt= rt+1 + rt+2+....,which sums up to infinity. But how can we maximize the return if it never stops?

That's why we introduce the notion of a discount factor. We can redefine our return with a discount factor , as follows:

---(1)
---(2)

The discount factor decides how much importance we give to the future rewards and immediate rewards. The value of the discount factor lies within 0 to 1. A discount factor of 0 means that immediate rewards are more important, while a discount factor of 1 would mean that future rewards are more important than immediate rewards.

A discount factor of 0 will never learn considering only the immediate rewards; similarly, a discount factor of 1 will learn forever looking for the future reward, which may lead to infinity. So the optimal value of the discount factor lies between 0.2 to 0.8.

We give importance to immediate rewards and future rewards depending on the use case. In some cases, future rewards are more desirable than immediate rewards and vice versa. In a chess game, the goal is to defeat the opponent's king. If we give importance to the immediate reward, which is acquired by actions like our pawn defeating any opponent player and so on, the agent will learn to perform this sub-goal instead of learning to reach the actual goal. So, in this case, we give importance to future rewards, whereas in some cases, we prefer immediate rewards over future rewards. (Say, would you prefer chocolates if I gave you them today or 13 months later?)

The policy function

We have learned about the policy function in Chapter 1, Introduction to Reinforcement Learning, which maps the states to actions. It is denoted by π.

The policy function can be represented as , indicating mapping from states to actions. So, basically, a policy function says what action to perform in each state. Our ultimate goal lies in finding the optimal policy which specifies the correct action to perform in each state, which maximizes the reward.

State value function

A state value function is also called simply a value function. It specifies how good it is for an agent to be in a particular state with a policy π. A value function is often denoted by V(s). It denotes the value of a state following a policy.

We can define a state value function as follows:

This specifies the expected return starting from state s according to policy π. We can substitute the value of Rt in the value function from (2) as follows:

Note that the state value function depends on the policy and it varies depending on the policy we choose.

We can view value functions in a table. Let us say we have two states and both of these states follow the policy π. Based on the value of these two states, we can tell how good it is for our agent to be in that state following a policy. The greater the value, the better the state is:

State Value
State 1 0.3
State 2 0.9

Based on the preceding table, we can tell that it is good to be in state 2, as it has high value. We will see how to estimate these values intuitively in the upcoming sections.

State-action value function (Q function)

A state-action value function is also called the Q function. It specifies how good it is for an agent to perform a particular action in a state with a policy π. The Q function is denoted by Q(s). It denotes the value of taking an action in a state following a policy π.

We can define Q function as follows:

This specifies the expected return starting from state s with the action a according to policy π. We can substitute the value of Rt in the Q function from (2) as follows:

The difference between the value function and the Q function is that the value function specifies the goodness of a state, while a Q function specifies the goodness of an action in a state.

Like state value functions, Q functions can be viewed in a table. It is also called a Q table. Let us say we have two states and two actions; our Q table looks like the following:

State Action Value
State 1 Action 1 0.03
State 1 Action 2 0.02
State 2 Action 1 0.5
State 2 Action 2 0.9

Thus, the Q table shows the value of all possible state action pairs. So, by looking at this table, we can come to the conclusion that performing action 1 in state 1 and action 2 in state 2 is the better option as it has high value.

Whenever we say value function V(S) or Q function Q( S, a), it actually means the value table and Q table, as shown previously.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image