You're reading from Intelligent Projects Using Python 9 real-world AI projects leveraging machine learning and deep learning with TensorFlow and Keras

Product type Paperback

Published in Jan 2019

Publisher Packt

ISBN-13 9781788996921

Length 342 pages

Edition 1st Edition

Languages

Python

Tools

Keras

Concepts

Artificial Intelligence

Author (1):

Santanu Pattanayak

View More author details

Reinforcement learning

Reinforcement learning is a branch of machine learning that enables machines and/or agents to maximize some form of reward within a specific context by taking specific actions. Reinforcement learning is different from supervised and unsupervised learning. Reinforcement learning is used extensively in game theory, control systems, robotics, and other emerging areas of artificial intelligence. The following diagram illustrates the interaction between an agent and an environment in a reinforcement learning problem:

Figure 1.15: Agent-environment interaction in a reinforcement learning model

Q-learning

We will now look at a popular reinforcement learning algorithm, called Q-learning. Q-learning is used to determine an optimal action selection policy for a given finite Markov decision process. A Markov decision process is defined by a state space, S; an action space, A; an immediate rewards set, R; a probability of the next state, S^(t+1), given the current state, S^(t); a current action, a^(t); P(S^(t+1)/S^(t);r^(t)); and a discount factor, . The following diagram illustrates a Markov decision process, where the next state is dependent on the current state and any actions taken in the current state:

Figure 1.16: A Markov decision process

Let's suppose that we have a sequence of states, actions, and corresponding rewards, as follows:

If we consider the long term reward, R_t, at step t, it is equal to the sum of the immediate rewards at each step, from t until the end, as follows:

Now, a Markov decision process is a random process, and it is not possible to get the same next step, S^(t+1), based on S^(t) and a^(t) every time; so, we apply a discount factor, , to future rewards. This means that, the long-term reward can be better represented as follows:

Since at the time step, t, the immediate reward is already realized, to maximize the long-term reward, we need to maximize the long-term reward at the time step t+1 (that is, R_t+1), by choosing an optimal action. The maximum long-term reward expected at a state S^(t) by taking an action a^(t) is represented by the following Q-function:

At each state, s ∈ S, the agent in Q-learning tries to take an action, , that maximizes its long-term reward. The Q-learning algorithm is an iterative process, the update rule of which is as follows:

As you can see, the algorithm is inspired by the notion of a long-term reward, as expressed in (1).

The overall cumulative reward, Q(s^(t), a^(t)), of taking action a^(t) in state s^(t) is dependent on the immediate reward, r^(t), and the maximum long-term reward that we can hope for at the new step, s^(t+1). In a Markov decision process, the new state s^(t+1) is stochastically dependent on the current state, s^(t), and the action taken a^(t) through a probability density/mass function of the form P(S^(t+1)/S^(t);r^(t)).

The algorithm keeps on updating the expected long-term cumulative reward by taking a weighted average of the old expectation and the new long-term reward, based on the value of .

Once we have built the Q(s,a) function through the iterative algorithm, while playing the game based on a given state s we can take the best action, , as the policy that maximizes the Q-function:

Deep Q-learning

In Q-learning, we generally work with a finite set of states and actions; this means that, tables suffice to hold the Q-values and rewards. However, in practical applications, the number of states and applicable actions are mostly infinite, and better Q-function approximators are needed to represent and learn the Q-functions. This is where deep neural networks come to the rescue, since they are universal function approximators. We can represent the Q-function with a neural network that takes the states and actions as input and provides the corresponding Q-values as output. Alternatively, we can train a neural network using only the states, and have the output as Q-values corresponding to all of the actions. Both of these scenarios are illustrated in the following diagram. Since the Q-values are rewards, we are dealing with regression in these networks:

Figure 1.17: Deep Q-learning function approximator network

In this book, we will use reinforcement learning to train a race car to drive by itself through deep Q-learning.

You're reading from Intelligent Projects Using Python 9 real-world AI projects leveraging machine learning and deep learning with TensorFlow and Keras

Table of Contents (12) Chapters

Reinforcement learning

Q-learning

Deep Q-learning

Authors (1)

Other recommended products

Personalised recommendations for you