[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you master advanced concepts of deep learning such as transfer learning, reinforcement learning, generative models and more, using TensorFlow and Keras.[/box]
In this tutorial, we will learn about Q-learning and how to implement it using deep reinforcement learning.
Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate parameter.
The Q-Learning algorithm is as follows:
initialize Q(shape=[#s,#a]) to random values or zeroes
Repeat (for each episode) observe current state s Repeat
select an action a (apply explore or exploit strategy)
observe state s_next as a result of action a update the Q-Table using bellman's equation set current state s = s_next
until the episode ends or a max reward / max steps condition is reached
Until a number of episodes or a condition is reached
(such as max consecutive wins)
Q(s, a) in the preceding algorithm represents the Q function. The values of this function are used for selecting the action instead of the rewards, thus this function represents the reward or discounted rewards. The values for the Q-function are updated using the values of the Q function in the future state. The well- known bellman equation captures this update:
This basically means that at time step t, in state s, for action a, the maximum future reward (Q) is equal to the reward from the current state plus the max future reward from the next state.
Q(s,a) can be implemented as a Q-Table or as a neural network known as a Q-Network. In both cases, the task of the Q-Table or the Q-Network is to provide the best possible action based on the Q value of the given input. The Q-Table-based approach generally becomes intractable as the Q-Table becomes large, thus making neural networks the best candidate for approximating the Q-function through Q-Network. Let us look at both of these approaches in action.
Initializing and discretizing for Q-Learning
The observations returned by the pole-cart environment involves the state of the environment. The state of pole-cart is represented by continuous values that we need to discretize.
If we discretize these values into small state-space, then the agent gets trained faster, but with the caveat of risking the convergence to the optimal policy.
We use the following helper function to discretize the state-space of the pole-cart environment:
# discretize the value to a state space def discretize(val,bounds,n_states):
discrete_val = 0
if val <= bounds[0]:
discrete_val = 0 elif val >= bounds[1]:
discrete_val = n_states-1 else:
discrete_val = int(round( (n_states-1) * ((val-bounds[0])/
(bounds[1]-bounds[0]))
return discrete_val
def discretize_state(vals,s_bounds,n_s):
discrete_vals = []
for i in range(len(n_s)):
discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i]))
return np.array(discrete_vals,dtype=np.int)
We discretize the space into 10 units for each of the observation dimensions. You may want to try out different discretization spaces. After the discretization, we find the upper and lower bounds of the observations, and change the bounds of velocity and angular velocity to be between -1 and +1, instead of -Inf and +Inf. The code is as follows:
env = gym.make('CartPole-v0')
n_a = env.action_space.n
# number of discrete states for each observation dimension
n_s = np.array([10,10,10,10]) # position, velocity, angle, angular velocity
s_bounds = np.array(list(zip(env.observation_space.low, env.observation_space.high)))
# the velocity and angular velocity bounds are
# too high so we bound between -1, +1
s_bounds[1] = (-1.0,1.0)
s_bounds[3] = (-1.0,1.0)
Q-Learning with Q-Table
Since our discretised space is of the dimensions [10,10,10,10], our Q-Table is of [10,10,10,10,2] dimensions:
# create a Q-Table of shape (10,10,10,10, 2) representing S X A -> R
q_table = np.zeros(shape = np.append(n_s,n_a))
We define a Q-Table policy that exploits or explores based on the exploration_rate:
def policy_q_table(state, env):
# Exploration strategy - Select a random action if np.random.random() < explore_rate:
action = env.action_space.sample()
# Exploitation strategy - Select the action with the highest q else:
action = np.argmax(q_table[tuple(state)])
return action
Define the episode() function that runs a single episode as follows:
obs = env.reset()
state_prev = discretize_state(obs,s_bounds,n_s)
episode_reward = 0 done = False
t = 0
action = policy(state_prev, env)
obs, reward, done, info = env.step(action)
state_new = discretize_state(obs,s_bounds,n_s)
best_q = np.amax(q_table[tuple(state_new)]) bellman_q = reward + discount_rate * best_q indices = tuple(np.append(state_prev,action))
q_table[indices] += learning_rate*( bellman_q - q_table[indices])
state_prev = state_new episode_reward += reward
The experiment() function calls the episode function and accumulates the rewards for reporting. You may want to modify the function to check for consecutive wins and other logic specific to your play or games:
# collect observations and rewards for each episode
def experiment(env, policy, n_episodes,r_max=0, t_max=0):
rewards=np.empty(shape=[n_episodes])
for i in range(n_episodes):
val = episode(env, policy, r_max, t_max)
rewards[i]=val
print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}'
.format(policy. name , np.min(rewards), np.max(rewards), np.mean(rewards)))
Now, all we have to do is define the parameters, such as learning_rate, discount_rate, and explore_rate, and run the experiment() function as follows:
learning_rate = 0.8 discount_rate = 0.9 explore_rate = 0.2 n_episodes = 1000
experiment(env, policy_q_table, n_episodes)
For 1000 episodes, the Q-Table-based policy's maximum reward is 180 based on our simple implementation:
Policy:policy_q_table, Min reward:8.0, Max reward:180.0, Average reward:17.592
Our implementation of the algorithm is very simple to explain. However, you can modify the code to set the explore rate high initially and then decay as the time-steps pass.
Similarly, you can also implement the decay logic for the learning and discount rates. Let us see if we can get a higher reward with fewer episodes as our Q function learns faster.
Q-Learning with Q-Network or Deep Q Network (DQN)
In the DQN, we replace the Q-Table with a neural network (Q-Network) that will learn to respond with the optimal action as we train it continuously with the explored states and their Q-Values. Thus, for training the network we need a place to store the game memory:
memory = deque(maxlen=1000)
from keras.models import Sequential from keras.layers import Dense
model = Sequential()
model.add(Dense(8,input_dim=4, activation='relu')) model.add(Dense(2, activation='linear')) model.compile(loss='mse',optimizer='adam') model.summary()
q_nn = model
The Q-Network looks like this:
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 8) 40
dense_2 (Dense) (None, 2) 18
=================================================================
Total params: 58
Trainable params: 58
Non-trainable params: 0
The episode() function that executes one episode of the game, incorporates the following changes for the Q-Network-based algorithm:
action = policy(state_prev, env)
obs, reward, done, info = env.step(action)
state_next = discretize_state(obs,s_bounds,n_s)
# add the state_prev, action, reward, state_new, done to memory memory.append([state_prev,action,reward,state_next,done])
states = np.array([x[0] for x in memory])
states_next = np.array([np.zeros(4) if x[4] else x[3] for x in memory])
q_values = q_nn.predict(states)
q_values_next = q_nn.predict(states_next)
for i in range(len(memory)): state_prev,action,reward,state_next,done = memory[i] if done:
q_values[i,action] = reward else:
best_q = np.amax(q_values_next[i])
bellman_q = reward + discount_rate * best_q q_values[i,action] = bellman_q
q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0)
The process of saving gameplay in memory and using it to train the model is also known as memory replay in deep reinforcement learning literature. Let us run our DQN-based gameplay as follows:
learning_rate = 0.8 discount_rate = 0.9 explore_rate = 0.2 n_episodes = 100
experiment(env, policy_q_nn, n_episodes)
We get a max reward of 150 that you can improve upon with hyper-parameter tuning, network tuning, and by using rate decay for the discount rate and explore rate:
Policy:policy_q_nn, Min reward:8.0, Max reward:150.0, Average reward:41.27
To summarize, we calculated and trained the model at every step. One can change the code to discard the memory replay and retrain the model for the episodes that return smaller rewards. However, implement this option with caution as it may slow down your learning as initial gameplay would generate smaller rewards more often.
Do check out the book Mastering TensorFlow 1.x to explore advanced features of TensorFlow 1.x and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet.