Getting started with Q-learning using TensorFlow

[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you master advanced concepts of deep learning such as transfer learning, reinforcement learning, generative models and more, using TensorFlow and Keras.[/box]

In this tutorial, we will learn about Q-learning and how to implement it using deep reinforcement learning.

Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate parameter.

The Q-Learning algorithm is as follows:

initialize  Q(shape=[#s,#a])  to  random  values  or  zeroes

Repeat  (for  each  episode) observe  current  state  s Repeat

select  an  action  a  (apply  explore  or  exploit  strategy)

observe  state  s_next  as  a  result  of  action  a update  the  Q-Table  using  bellman's  equation set  current  state  s  =  s_next

until  the  episode  ends  or  a  max  reward  /  max  steps  condition  is  reached

Until  a  number  of  episodes  or  a  condition  is  reached

(such  as  max  consecutive  wins)

Q(s, a) in the preceding algorithm represents the Q function. The values of this function are used for selecting the action instead of the rewards, thus this function represents the reward or discounted rewards. The values for the Q-function are updated using the values of the Q function in the future state. The well- known bellman equation captures this update:

This basically means that at time step t, in state s, for action a, the maximum future reward (Q) is equal to the reward from the current state plus the max future reward from the next state.

Q(s,a) can be implemented as a Q-Table or as a neural network known as a Q-Network. In both cases, the task of the Q-Table or the Q-Network is to provide the best possible action based on the Q value of the given input. The Q-Table-based approach generally becomes intractable as the Q-Table becomes large, thus making neural networks the best candidate for approximating the Q-function through Q-Network. Let us look at both of these approaches in action.

Initializing and discretizing for Q-Learning

The observations returned by the pole-cart environment involves the state of the environment. The state of pole-cart is represented by continuous values that we need to discretize.

If we discretize these values into small state-space, then the agent gets trained faster, but with the caveat of risking the convergence to the optimal policy.

We use the following helper function to discretize the state-space of the pole-cart environment:

#  discretize  the  value  to  a  state  space def  discretize(val,bounds,n_states):

discrete_val  =  0

if  val  <=  bounds[0]:

discrete_val  =  0 elif  val  >=  bounds[1]:

discrete_val  =  n_states-1 else:

                      discrete_val  =  int(round(  (n_states-1)  * ((val-bounds[0])/
                                                                                        (bounds[1]-bounds[0]))

return  discrete_val

def  discretize_state(vals,s_bounds,n_s):

discrete_vals  =  []

for  i  in  range(len(n_s)):

discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i]))

return  np.array(discrete_vals,dtype=np.int)

We discretize the space into 10 units for each of the observation dimensions. You may want to try out different discretization spaces. After the discretization, we find the upper and lower bounds of the observations, and change the bounds of velocity and angular velocity to be between -1 and +1, instead of -Inf and +Inf. The code is as follows:

env  =  gym.make('CartPole-v0')

n_a  =  env.action_space.n

#  number  of  discrete  states  for  each  observation  dimension

n_s  =  np.array([10,10,10,10])                  #  position,  velocity,  angle,  angular velocity

s_bounds  =  np.array(list(zip(env.observation_space.low, env.observation_space.high)))

#  the  velocity  and  angular  velocity  bounds  are

#  too  high  so  we  bound  between  -1,  +1

s_bounds[1]  =  (-1.0,1.0)

s_bounds[3]  =  (-1.0,1.0)

Q-Learning with Q-Table

Since our discretised space is of the dimensions [10,10,10,10], our Q-Table is of [10,10,10,10,2] dimensions:

#  create  a  Q-Table  of  shape  (10,10,10,10,  2)  representing  S  X  A  ->  R

q_table  =  np.zeros(shape  =  np.append(n_s,n_a))

We define a Q-Table policy that exploits or explores based on the exploration_rate:

def  policy_q_table(state,  env):

#  Exploration  strategy  -  Select  a  random  action if  np.random.random()  <  explore_rate:

action  =  env.action_space.sample()

#  Exploitation  strategy  -  Select  the  action  with  the  highest  q else:

action  =  np.argmax(q_table[tuple(state)])

return  action

Define the episode() function that runs a single episode as follows:

Start with initializing the variables and the first state:

obs  =  env.reset()

state_prev  =  discretize_state(obs,s_bounds,n_s)

episode_reward  =  0 done  =  False

t  =  0

Select the action and observe the next state:

action  =  policy(state_prev,  env)

obs,  reward,  done,  info  =  env.step(action)

state_new  =  discretize_state(obs,s_bounds,n_s)

Update the Q-Table:

best_q  =  np.amax(q_table[tuple(state_new)]) bellman_q  =  reward  +  discount_rate  *  best_q indices  =  tuple(np.append(state_prev,action))

q_table[indices]  +=  learning_rate*(  bellman_q  -  q_table[indices])

Set the next state as the previous state and add the rewards to the episode's rewards:

state_prev  =  state_new episode_reward  +=  reward

The experiment() function calls the episode function and accumulates the rewards for reporting. You may want to modify the function to check for consecutive wins and other logic specific to your play or games:

#  collect  observations  and  rewards  for  each  episode

def  experiment(env,  policy,  n_episodes,r_max=0,  t_max=0):

rewards=np.empty(shape=[n_episodes])

for  i  in  range(n_episodes):

val  =  episode(env,  policy,  r_max,  t_max)

rewards[i]=val

print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}'

.format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards)))

Now, all we have to do is define the parameters, such as learning_rate, discount_rate, and explore_rate, and run the experiment() function as follows:

learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  1000

experiment(env,  policy_q_table,  n_episodes)

For 1000 episodes, the Q-Table-based policy's maximum reward is 180 based on our simple implementation:

Policy:policy_q_table,  Min  reward:8.0,  Max  reward:180.0,  Average reward:17.592

Our implementation of the algorithm is very simple to explain. However, you can modify the code to set the explore rate high initially and then decay as the time-steps pass.

Similarly, you can also implement the decay logic for the learning and discount rates. Let us see if we can get a higher reward with fewer episodes as our Q function learns faster.

Q-Learning with Q-Network or Deep Q Network (DQN)

In the DQN, we replace the Q-Table with a neural network (Q-Network) that will learn to respond with the optimal action as we train it continuously with the explored states and their Q-Values. Thus, for training the network we need a place to store the game memory:

Implement the game memory using a deque of size 1000:

memory  =  deque(maxlen=1000)

Next, build a simple hidden layer neural network model, q_nn:

from  keras.models  import  Sequential from  keras.layers  import  Dense

model  =  Sequential()

model.add(Dense(8,input_dim=4,  activation='relu')) model.add(Dense(2,  activation='linear')) model.compile(loss='mse',optimizer='adam') model.summary()

q_nn  =  model



The Q-Network looks like this:

Layer  (type)                                           Output  Shape                                 Param  #

=================================================================

dense_1  (Dense)                                 (None,  8)                                         40

dense_2  (Dense)                                 (None,  2)                                         18

=================================================================

Total  params:  58

Trainable  params:  58

Non-trainable  params:  0

The episode() function that executes one episode of the game, incorporates the following changes for the Q-Network-based algorithm:

After generating the next state, add the states, action, and rewards to the game memory:

action  =  policy(state_prev,  env)

obs,  reward,  done,  info  =  env.step(action)

state_next  =  discretize_state(obs,s_bounds,n_s)

#  add  the  state_prev,  action,  reward,  state_new,  done  to  memory memory.append([state_prev,action,reward,state_next,done])

Generate and update the q_values with the maximum future rewards using the bellman function:

states  =  np.array([x[0]  for  x  in  memory])

states_next  =  np.array([np.zeros(4)  if  x[4]  else  x[3]  for  x  in memory])

q_values  =  q_nn.predict(states)

q_values_next  =  q_nn.predict(states_next)

for  i  in  range(len(memory)): state_prev,action,reward,state_next,done  =  memory[i] if  done:

q_values[i,action]  =  reward else:

best_q  =  np.amax(q_values_next[i])

bellman_q  =  reward  +  discount_rate  *  best_q q_values[i,action]  =  bellman_q

Train the q_nn with the states and the q_values we received from memory:

q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0)

The process of saving gameplay in memory and using it to train the model is also known as memory replay in deep reinforcement learning literature. Let us run our DQN-based gameplay as follows:

learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  100

experiment(env,  policy_q_nn,  n_episodes)

We get a max reward of 150 that you can improve upon with hyper-parameter tuning, network tuning, and by using rate decay for the discount rate and explore rate:

Policy:policy_q_nn,  Min  reward:8.0,  Max  reward:150.0,  Average  reward:41.27

To summarize, we calculated and trained the model at every step. One can change the code to discard the memory replay and retrain the model for the episodes that return smaller rewards. However, implement this option with caution as it may slow down your learning as initial gameplay would generate smaller rewards more often.

Do check out the book Mastering TensorFlow 1.x to explore advanced features of TensorFlow 1.x and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet.