The Markov decision process, better known as MDP, is an approach in reinforcement learning to take decisions in a gridworld environment. A gridworld environment consists of states in the form of grids.
The MDP tries to capture a world in the form of a grid by dividing it into states, actions, models/transition models, and rewards. The solution to an MDP is called a policy and the objective is to find the optimal policy for that MDP task.
Thus, any reinforcement learning task composed of a set of states, actions, and rewards that follows the Markov property would be considered an MDP.
In this tutorial, we will dig deep into MDPs, states, actions, rewards, policies, and how to solve them using Bellman equations.
This article is a reinforcement learning tutorial taken from the book, Reinforcement learning with TensorFlow.
MDP is defined as the collection of the following:
In the case of an MDP, the environment is fully observable, that is, whatever observation the agent makes at any point in time is enough to make an optimal decision. In case of a partially observable environment, the agent needs a memory to store the past observations to make the best possible decisions.
Let's try to break this into different lego blocks to understand what this overall process means.
In short, as per the Markov property, in order to know the information of near future (say, at time t+1) the present information at time t matters.
Given a sequence, , the first order of Markov says,
, that is,
depends only on . Therefore,
will depend only on . The second order of Markov says, , that is, depends only on and
In our context, we will follow the first order of the Markov property from now on. Therefore, we can convert any process to a Markov property if the probability of the new state, say , depends only on the current state, , such that the current state captures and remembers the property and knowledge from the past. Thus, as per the Markov property, the world (that is, the environment) is considered to be stationary, that is, the rules in the world are fixed.
The S state set is a set of different states, represented as s, which constitute the environment. States are the feature representation of the data obtained from the environment. Thus, any input from the agent's sensors can play an important role in state formation. State spaces can be either discrete or continuous. The starts from start state and has to reach the goal state in the most optimized path without ending up in bad states (like the red colored state shown in the diagram below).
Consider the following gridworld as having 12 discrete states, where the green-colored grid is the goal state, red is the state to avoid, and black is a wall that you'll bounce back from if you hit it head on:
The states can be represented as 1, 2,....., 12 or by coordinates, (1,1),(1,2),.....(3,4).
The actions are the things an agent can perform or execute in a particular state. In other words, actions are sets of things an agent is allowed to do in the given environment. Like states, actions can also be either discrete or continuous.
Consider the following gridworld example having 12 discrete states and 4 discrete actions (UP, DOWN, RIGHT, and LEFT):
The preceding example shows the action space to be a discrete set space, that is, a A where, A = {UP, DOWN, RIGHT, and LEFT}. It can also be treated as a function of state, that is, a = A(s), where depending on the state function, it decides which action is possible.
The transition model T(s, a, s') is a function of three variables, which are the current state (s), action (a), and the new state (s'), and defines the rules to play the game in the environment. It gives probability P(s'|s, a), that is, the probability of landing up in the new s' state given that the agent takes an action, a, in given state, s.
The transition model plays the crucial role in a stochastic world, unlike the case of a deterministic world where the probability for any landing state other than the determined one will have zero probability.
Let's consider the following environment (world) and consider different cases, determined and stochastic:
Since the actions a A where, A = {UP, DOWN, RIGHT, and LEFT}.
The behavior of these two cases depends on certain factors:
Since T(s,a,s') ~ P(s'|s,a), where the probability of new state depends on the current state and action only, and none of the past states. Thus, the transition model follows the first order Markov property.
We can also say that our universe is also a stochastic environment, since the universe is composed of atoms that are in different states defined by position and velocity. Actions performed by each atom change their states and cause changes in the universe.
The reward of the state quantifies the usefulness of entering into a state. There are three different forms to represent the reward namely, R(s), R(s, a) and R(s, a, s'), but they are all equivalent.
For a particular environment, the domain knowledge plays an important role in the assignment of rewards for different states as minor changes in the reward do matter for finding the optimal solution to an MDP problem.
There are two approaches we reward our agent for when taking a certain action. They are:
Delayed rewards form the idea of foresight planning. Therefore, this concept is being used to calculate the expected reward for different states. We will discuss this in the later sections.
Until now, we have covered the blocks that create an MDP problem, that is, states, actions, transition models, and rewards, now comes the solution. The policy is the solution to an MDP problem.
The policy is a function that takes the state as an input and outputs the action to be taken. Therefore, the policy is a command that the agent has to obey.
is called the optimal policy, which maximizes the expected reward. Among all the policies taken, the optimal policy is the one that optimizes to maximize the amount of reward received or expected to receive over a lifetime. For an MDP, there's no end of the lifetime and you have to decide the end time.
Thus, the policy is nothing but a guide telling which action to take for a given state. It is not a plan but uncovers the underlying plan of the environment by returning the actions to take for each state.
Since the optimal policy is the policy that maximizes the expected rewards, therefore,
,
where means the expected value of the rewards obtained from the sequence of states agent observes if it follows the policy. Thus, outputs the policy that has the highest expected reward.
Similarly, we can also calculate the utility of the policy of a state, that is, if we are at the s state, given a policy, then, the utility of the policy for the s state, that is, would be the expected rewards from that state onward:
The immediate reward of the state, that is, is different than the utility of the state (that is, the utility of the optimal policy of the state) because of the concept of delayed rewards. From now onward, the utility of the state will refer to the utility of the optimal policy of the state, that is, the state.
Moreover, the optimal policy can also be regarded as the policy that maximizes the expected utility. Therefore,
where, T(s,a,s') is the transition probability, that is, P(s'|s,a) and U(s') is the utility of the new landing state after the a action is taken on the s state.
refers to the summation of all possible new state outcomes for a particular action taken, then whichever action gives the maximum value of that is considered to be the part of the optimal policy and thereby, the utility of the 's' state is given by the following Bellman equation,
where, is the immediate reward and is the reward from future, that is, the discounted utilities of the 's' state where the agent can reach from the given s state if the action, a, is taken.
Say we have some n states in the given environment and if we see the Bellman equation,
we find out that n states are given; therefore, we will have n equations and n unknown but the function makes it non-linear. Thus, we cannot solve them as linear equations.
Therefore, in order to solve:
Iterate this multiple times to lead to the true value of the states. This process of iterating to convergence towards the true value of the state is called value iteration.
For the terminal states where the game ends, the utility of those terminal state equals the immediate reward the agent receives while entering the terminal state.
Let's try to understand this by implementing an example.
Consider the following environment and the given information:
Given information:
Questions:
Solution:
R(X) = -0.04
Action as'RIGHT
G
0.8+10.8 x 1 = 0.8RIGHTC0.100.1 x 0 = 0RIGHTX0.100.1 x 0 = 0
Thus, for action a = RIGHT,
Action as'DOWN
C
0.800.8 x 0 = 0DOWNG0.1+10.1 x 1 = 0.1DOWNA0.100.1 x 0 = 0
Thus, for action a = DOWN,
Action as'UP
X
0.800.8 x 0 = 0UPG0.1+10.1 x 1 = 0.1UPA0.100.1 x 0 = 0
Thus, for action a = UP,
Action as'LEFT
A
0.800.8 x 0 = 0LEFTX0.100.1 x 0 = 0LEFTC0.100.1 x 0 = 0
Thus, for action a = LEFT,
Therefore, among all actions,
Therefore, , where and
Similarly, calculate and and we get and
Since, , and,
R(X) = -0.04
Action as'RIGHT
G
0.8+10.8 x 1 = 0.8RIGHTC0.1-0.040.1 x -0.04 = -0.004RIGHTX0.10.360.1 x 0.36 = 0.036
Thus, for action a = RIGHT,
Action as'DOWN
C
0.8-0.040.8 x -0.04 = -0.032DOWNG0.1+10.1 x 1 = 0.1DOWNA0.1-0.040.1 x -0.04 = -0.004
Thus, for action a = DOWN,
Action as'UP
X
0.80.360.8 x 0.36 = 0.288UPG0.1+10.1 x 1 = 0.1UPA0.1-0.040.1 x -0.04 = -0.004
Thus, for action a = UP,
Action as'LEFT
A
0.8-0.040.8 x -0.04 = -0.032LEFTX0.10.360.1 x 0.36 = 0.036LEFTC0.1-0.040.1 x -0.04 = -0.004
Thus, for action a = LEFT,
Therefore, among all actions,
Therefore, , where and
Therefore, the answers to the preceding questions are:
The process of obtaining optimal utility by iterating over the policy and updating the policy itself instead of value until the policy converges to the optimum is called policy iteration. The process of policy iteration is as follows:
This ends an interesting reinforcement learning tutorial. Want to implement state-of-the-art Reinforcement Learning algorithms from scratch? Get this best-selling title, Reinforcement Learning with TensorFlow.
How Reinforcement Learning works
Convolutional Neural Networks with Reinforcement Learning