Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Implement Reinforcement learning using Markov Decision Process [Tutorial]

Save for later
  • 11 min read
  • 09 Jul 2018

article-image

The Markov decision process, better known as MDP, is an approach in reinforcement learning to take decisions in a gridworld environment. A gridworld environment consists of states in the form of grids.


The MDP tries to capture a world in the form of a grid by dividing it into states, actions, models/transition models, and rewards. The solution to an MDP is called a policy and the objective is to find the optimal policy for that MDP task.

Thus, any reinforcement learning task composed of a set of states, actions, and rewards that follows the Markov property would be considered an MDP.

In this tutorial, we will dig deep into MDPs, states, actions, rewards, policies, and how to solve them using Bellman equations.

This article is a reinforcement learning tutorial taken from the book, Reinforcement learning with TensorFlow.

Markov decision processes


MDP is defined as the collection of the following:

  • States: S
  • Actions: A(s), A
  • Transition model: T(s,a,s') ~ P(s'|s,a)
  • Rewards: R(s), R(s,a), R(s,a,s')
  • Policyreinforcement-learning-mdp-markov-decision-process-tutorial-img-0 reinforcement-learning-mdp-markov-decision-process-tutorial-img-1is the optimal policy


In the case of an MDP, the environment is fully observable, that is, whatever observation the agent makes at any point in time is enough to make an optimal decision. In case of a partially observable environment, the agent needs a memory to store the past observations to make the best possible decisions.

Let's try to break this into different lego blocks to understand what this overall process means.

The Markov property


In short, as per the Markov property, in order to know the information of near future (say, at time t+1) the present information at time t matters.

Given a sequence, reinforcement-learning-mdp-markov-decision-process-tutorial-img-2, the first order of Markov says,
reinforcement-learning-mdp-markov-decision-process-tutorial-img-3, that is,
reinforcement-learning-mdp-markov-decision-process-tutorial-img-4 depends only on reinforcement-learning-mdp-markov-decision-process-tutorial-img-5. Therefore,
reinforcement-learning-mdp-markov-decision-process-tutorial-img-6 will depend only on reinforcement-learning-mdp-markov-decision-process-tutorial-img-7. The second order of Markov says, reinforcement-learning-mdp-markov-decision-process-tutorial-img-8, that is, reinforcement-learning-mdp-markov-decision-process-tutorial-img-9 depends only on reinforcement-learning-mdp-markov-decision-process-tutorial-img-10 and reinforcement-learning-mdp-markov-decision-process-tutorial-img-11

In our context, we will follow the first order of the Markov property from now on. Therefore, we can convert any process to a Markov property if the probability of the new state, say reinforcement-learning-mdp-markov-decision-process-tutorial-img-12, depends only on the current state, reinforcement-learning-mdp-markov-decision-process-tutorial-img-13, such that the current state captures and remembers the property and knowledge from the past. Thus, as per the Markov property, the world (that is, the environment) is considered to be stationary, that is, the rules in the world are fixed.

The S state set


The S state set is a set of different states, represented as s, which constitute the environment. States are the feature representation of the data obtained from the environment. Thus, any input from the agent's sensors can play an important role in state formation. State spaces can be either discrete or continuous. The starts from start state and has to reach the goal state in the most optimized path without ending up in bad states (like the red colored state shown in the diagram below).

Consider the following gridworld as having 12 discrete states, where the green-colored grid is the goal state, red is the state to avoid, and black is a wall that you'll bounce back from if you hit it head on:

reinforcement-learning-mdp-markov-decision-process-tutorial-img-14

The states can be represented as 1, 2,....., 12 or by coordinates, (1,1),(1,2),.....(3,4).

Actions


The actions are the things an agent can perform or execute in a particular state. In other words, actions are sets of things an agent is allowed to do in the given environment. Like states, actions can also be either discrete or continuous.

Consider the following gridworld example having 12 discrete states and 4 discrete actions (UPDOWN, RIGHT, and LEFT):

reinforcement-learning-mdp-markov-decision-process-tutorial-img-15

The preceding example shows the action space to be a discrete set space, that is, reinforcement-learning-mdp-markov-decision-process-tutorial-img-16 A where, A = {UP, DOWN, RIGHT, and LEFT}. It can also be treated as a function of state, that is, a = A(s), where depending on the state function, it decides which action is possible.


Transition model


The transition model T(s, a, s') is a function of three variables, which are the current state (s), action (a), and the new state (s'), and defines the rules to play the game in the environment. It gives probability P(s'|s, a), that is, the probability of landing up in the new s' state given that the agent takes an action, a, in given state, s.

The transition model plays the crucial role in a stochastic world, unlike the case of a deterministic world where the probability for any landing state other than the determined one will have zero probability.

Let's consider the following environment (world) and consider different cases, determined and stochastic:

reinforcement-learning-mdp-markov-decision-process-tutorial-img-17

Since the actions reinforcement-learning-mdp-markov-decision-process-tutorial-img-18 A where, A = {UP, DOWN, RIGHT, and LEFT}.

The behavior of these two cases depends on certain factors:

  • Determined environment: In a determined environment, if you take a certain action, say UP, you will certainly perform that action with probability 1.
  • Stochastic environment: In a stochastic environment, if you take the same action, say UP, there will certain probability say 0.8 to actually perform the given action and there is 0.1 probability it can perform an action (either LEFT or RIGHT) perpendicular to the given action, UP. Here, for the s state and the UP action transition model, T(s',UP, s) = P(s'| s,UP) = 0.8.


Since T(s,a,s') ~ P(s'|s,a), where the probability of new state depends on the current state and action only, and none of the past states. Thus, the transition model follows the first order Markov property.

We can also say that our universe is also a stochastic environment, since the universe is composed of atoms that are in different states defined by position and velocity. Actions performed by each atom change their states and cause changes in the universe.

Rewards


The reward of the state quantifies the usefulness of entering into a state. There are three different forms to represent the reward namely, R(s)R(s, a) and R(s, a, s'), but they are all equivalent.

For a particular environment, the domain knowledge plays an important role in the assignment of rewards for different states as minor changes in the reward do matter for finding the optimal solution to an MDP problem.

There are two approaches we reward our agent for when taking a certain action. They are:

  • Credit assignment problem: We look at the past and check which actions led to the present reward, that is, which action gets the credit
  • Delayed rewards: In contrast, in the present state, we check which action to take that will lead us to potential rewards


Delayed rewards form the idea of foresight planning. Therefore, this concept is being used to calculate the expected reward for different states. We will discuss this in the later sections.

Policy


Until now, we have covered the blocks that create an MDP problem, that is, states, actions, transition models, and rewards, now comes the solution. The policy is the solution to an MDP problem.

reinforcement-learning-mdp-markov-decision-process-tutorial-img-19

The policy is a function that takes the state as an input and outputs the action to be taken. Therefore, the policy is a command that the agent has to obey.

reinforcement-learning-mdp-markov-decision-process-tutorial-img-20 is called the optimal policy, which maximizes the expected reward. Among all the policies taken, the optimal policy is the one that optimizes to maximize the amount of reward received or expected to receive over a lifetime. For an MDP, there's no end of the lifetime and you have to decide the end time.

Thus, the policy is nothing but a guide telling which action to take for a given state. It is not a plan but uncovers the underlying plan of the environment by returning the actions to take for each state.

The Bellman equations


Since the optimal reinforcement-learning-mdp-markov-decision-process-tutorial-img-21 policy is the policy that maximizes the expected rewards, therefore,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-22,

where reinforcement-learning-mdp-markov-decision-process-tutorial-img-23 means the expected value of the rewards obtained from the sequence of states agent observes if it follows the reinforcement-learning-mdp-markov-decision-process-tutorial-img-24 policy. Thus, reinforcement-learning-mdp-markov-decision-process-tutorial-img-25 outputs the reinforcement-learning-mdp-markov-decision-process-tutorial-img-26 policy that has the highest expected reward.

Similarly, we can also calculate the utility of the policy of a state, that is, if we are at the s state, given a reinforcement-learning-mdp-markov-decision-process-tutorial-img-27 policy, then, the utility of the reinforcement-learning-mdp-markov-decision-process-tutorial-img-28 policy for the s state, that is, reinforcement-learning-mdp-markov-decision-process-tutorial-img-29 would be the expected rewards from that state onward:

reinforcement-learning-mdp-markov-decision-process-tutorial-img-30

The immediate reward of the state, that is, reinforcement-learning-mdp-markov-decision-process-tutorial-img-31 is different than the utility of the reinforcement-learning-mdp-markov-decision-process-tutorial-img-32 state (that is, the utility of the optimal policy of the reinforcement-learning-mdp-markov-decision-process-tutorial-img-33 state) because of the concept of delayed rewards. From now onward, the utility of the reinforcement-learning-mdp-markov-decision-process-tutorial-img-34 state will refer to the utility of the optimal policy of the state, that is, the reinforcement-learning-mdp-markov-decision-process-tutorial-img-35 state.

Moreover, the optimal policy can also be regarded as the policy that maximizes the expected utility. Therefore,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-36

where, T(s,a,s') is the transition probability, that is, P(s'|s,a) and U(s') is the utility of the new landing state after the a action is taken on the s state.

reinforcement-learning-mdp-markov-decision-process-tutorial-img-37 refers to the summation of all possible new state outcomes for a particular action taken, then whichever action gives the maximum value of reinforcement-learning-mdp-markov-decision-process-tutorial-img-38 that is considered to be the part of the optimal policy and thereby, the utility of the 's' state is given by the following Bellman equation,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-39

where, reinforcement-learning-mdp-markov-decision-process-tutorial-img-40 is the immediate reward and reinforcement-learning-mdp-markov-decision-process-tutorial-img-41 is the reward from future, that is, the discounted utilities of the 's' state where the agent can reach from the given s state if the action, a, is taken.

Solving the Bellman equation to find policies


Say we have some n states in the given environment and if we see the Bellman equation,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-42

we find out that n states are given; therefore, we will have n equations and n unknown but the reinforcement-learning-mdp-markov-decision-process-tutorial-img-43 function makes it non-linear. Thus, we cannot solve them as linear equations.

Therefore, in order to solve:

  • Start with an arbitrary utility
  • Update the utilities based on the neighborhood until convergence, that is, update the utility of the state using the Bellman equation based on the utilities of the landing states from the given state


Iterate this multiple times to lead to the true value of the states. This process of iterating to convergence towards the true value of the state is called value iteration.

For the terminal states where the game ends, the utility of those terminal state equals the immediate reward the agent receives while entering the terminal state.

Let's try to understand this by implementing an example.

An example of value iteration using the Bellman equation


Consider the following environment and the given information:

reinforcement-learning-mdp-markov-decision-process-tutorial-img-44

Given information:

  • AC, and X are the names of some states.
  • The green-colored state is the goal state, G, with a reward of +1.
  • The red-colored state is the bad state, B, with a reward of -1, try to prevent your agent from entering this state
  • Thus, the green and red states are the terminal states, enter either and the game is over. If the agent encounters the green state, that is, the goal state, the agent wins, while if they enter the red state, then the agent loses the game.
  • reinforcement-learning-mdp-markov-decision-process-tutorial-img-45reinforcement-learning-mdp-markov-decision-process-tutorial-img-46 (that is, reward for all states except the G and B states is -0.04), reinforcement-learning-mdp-markov-decision-process-tutorial-img-47 (that is, the utility at the first time step is 0, except the G and B states).
  • Transition probability T(s,a,s') equals 0.8 if going in the desired direction; otherwise, 0.1 each if going perpendicular to the desired direction. For example, if the action is UP then with 0.8 probability, the agent goes UP but with 0.1 probability it goes RIGHT and 0.1 to the LEFT.


Questions:

  1.  Find reinforcement-learning-mdp-markov-decision-process-tutorial-img-48, the utility of the X state at time step 1, that is, the agent will go through one iteration
  2. Similarly, find reinforcement-learning-mdp-markov-decision-process-tutorial-img-49

Solution:


reinforcement-learning-mdp-markov-decision-process-tutorial-img-50

reinforcement-learning-mdp-markov-decision-process-tutorial-img-51

R(X) = -0.04

Action as'reinforcement-learning-mdp-markov-decision-process-tutorial-img-52reinforcement-learning-mdp-markov-decision-process-tutorial-img-53reinforcement-learning-mdp-markov-decision-process-tutorial-img-54RIGHT

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime

G


0.8+10.8 x 1 = 0.8RIGHTC0.100.1 x 0 = 0RIGHTX0.100.1 x 0 = 0

Thus, for action a = RIGHT,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-55Action as'reinforcement-learning-mdp-markov-decision-process-tutorial-img-56reinforcement-learning-mdp-markov-decision-process-tutorial-img-57reinforcement-learning-mdp-markov-decision-process-tutorial-img-58DOWN

C


0.800.8 x 0 = 0DOWNG0.1+10.1 x 1 = 0.1DOWNA0.100.1 x 0 = 0

Thus, for action a = DOWN,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-59Action as'reinforcement-learning-mdp-markov-decision-process-tutorial-img-60reinforcement-learning-mdp-markov-decision-process-tutorial-img-61reinforcement-learning-mdp-markov-decision-process-tutorial-img-62UP

X


0.800.8 x 0 = 0UPG0.1+10.1 x 1 = 0.1UPA0.100.1 x 0 = 0

Thus, for action a = UP,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-63Action as'reinforcement-learning-mdp-markov-decision-process-tutorial-img-64reinforcement-learning-mdp-markov-decision-process-tutorial-img-65reinforcement-learning-mdp-markov-decision-process-tutorial-img-66LEFT

A


0.800.8 x 0 = 0LEFTX0.100.1 x 0 = 0LEFTC0.100.1 x 0 = 0

Thus, for action a = LEFT,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-67

Therefore, among all actions,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-68

Therefore, reinforcement-learning-mdp-markov-decision-process-tutorial-img-69, where reinforcement-learning-mdp-markov-decision-process-tutorial-img-70 and reinforcement-learning-mdp-markov-decision-process-tutorial-img-71

Similarly, calculate reinforcement-learning-mdp-markov-decision-process-tutorial-img-72 and reinforcement-learning-mdp-markov-decision-process-tutorial-img-73 and we get reinforcement-learning-mdp-markov-decision-process-tutorial-img-74 and reinforcement-learning-mdp-markov-decision-process-tutorial-img-75


Since, reinforcement-learning-mdp-markov-decision-process-tutorial-img-76, and,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-77

R(X) = -0.04

Action as'reinforcement-learning-mdp-markov-decision-process-tutorial-img-78reinforcement-learning-mdp-markov-decision-process-tutorial-img-79reinforcement-learning-mdp-markov-decision-process-tutorial-img-80RIGHT

G


0.8+10.8 x 1 = 0.8RIGHTC0.1-0.040.1 x -0.04 = -0.004RIGHTX0.10.360.1 x 0.36 = 0.036

Thus, for action a = RIGHT,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-81Action as'reinforcement-learning-mdp-markov-decision-process-tutorial-img-82reinforcement-learning-mdp-markov-decision-process-tutorial-img-83reinforcement-learning-mdp-markov-decision-process-tutorial-img-84DOWN

C


0.8-0.040.8 x -0.04 = -0.032DOWNG0.1+10.1 x 1 = 0.1DOWNA0.1-0.040.1 x -0.04 = -0.004

Thus, for action a = DOWN,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-85Action as'reinforcement-learning-mdp-markov-decision-process-tutorial-img-86reinforcement-learning-mdp-markov-decision-process-tutorial-img-87reinforcement-learning-mdp-markov-decision-process-tutorial-img-88UP

X


0.80.360.8 x 0.36 = 0.288UPG0.1+10.1 x 1 = 0.1UPA0.1-0.040.1 x -0.04 = -0.004

Thus, for action a = UP,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-89Action as'reinforcement-learning-mdp-markov-decision-process-tutorial-img-90reinforcement-learning-mdp-markov-decision-process-tutorial-img-91reinforcement-learning-mdp-markov-decision-process-tutorial-img-92LEFT

A


0.8-0.040.8 x -0.04 = -0.032LEFTX0.10.360.1 x 0.36 = 0.036LEFTC0.1-0.040.1 x -0.04 = -0.004

Thus, for action a = LEFT,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-93

Therefore, among all actions,

reinforcement-learning-mdp-markov-decision-process-tutorial-img-94


Therefore, reinforcement-learning-mdp-markov-decision-process-tutorial-img-95, where reinforcement-learning-mdp-markov-decision-process-tutorial-img-96 and reinforcement-learning-mdp-markov-decision-process-tutorial-img-97

Therefore, the answers to the preceding questions are:

  1.  reinforcement-learning-mdp-markov-decision-process-tutorial-img-98
  2.  reinforcement-learning-mdp-markov-decision-process-tutorial-img-99

Policy iteration


The process of obtaining optimal utility by iterating over the policy and updating the policy itself instead of value until the policy converges to the optimum is called policy iteration. The process of policy iteration is as follows:

  • Start with a random policy, reinforcement-learning-mdp-markov-decision-process-tutorial-img-100
  • For the given reinforcement-learning-mdp-markov-decision-process-tutorial-img-101 policy at iteration step t, calculate reinforcement-learning-mdp-markov-decision-process-tutorial-img-102 by using the following formula:reinforcement-learning-mdp-markov-decision-process-tutorial-img-103
  • Improve the reinforcement-learning-mdp-markov-decision-process-tutorial-img-104 policy by
    reinforcement-learning-mdp-markov-decision-process-tutorial-img-105


This ends an interesting reinforcement learning tutorial. Want to implement state-of-the-art Reinforcement Learning algorithms from scratch? Get this best-selling title, Reinforcement Learning with TensorFlow.

How Reinforcement Learning works

Convolutional Neural Networks with Reinforcement Learning

Getting started with Q-learning using TensorFlow