Packt+ | Advance your knowledge in tech

You're reading from Reinforcement Learning with TensorFlow A beginner's guide to designing self-learning systems with TensorFlow and OpenAI Gym

Product type Paperback

Published in Apr 2018

Publisher Packt

ISBN-13 9781788835725

Length 334 pages

Edition 1st Edition

Languages

Python

Tools

OpenAI Gym

Concepts

Reinforcement Learning

Author (1):

Sayon Dutta

View More author details

Table of Contents (17) Chapters

Preface

1. Deep Learning – Architectures and Frameworks FREE CHAPTER

2. Training Reinforcement Learning Agents Using OpenAI Gym

3. Markov Decision Process

4. Policy Gradients

5. Q-Learning and Deep Q-Networks

6. Asynchronous Methods

7. Robo Everything – Real Strategy Gaming

8. AlphaGo – Reinforcement Learning at Its Best

9. Reinforcement Learning in Autonomous Driving

10. Financial Portfolio Management

11. Reinforcement Learning in Robotics

12. Deep Reinforcement Learning in Ad Tech

13. Reinforcement Learning in Image Processing

14. Deep Reinforcement Learning in NLP

15. Further topics in Reinforcement Learning

16. Other Books You May Enjoy

Leave a review - let other readers know what you think

Asynchronous one-step SARSA

The architecture of asynchronous one-step SARSA is almost similar to the architecture of asynchronous one-step Q-learning, except the way target state-action value of the current state is calculated by the target network. Instead of using the maximum Q-value of the next state s' by the target network, SARSA uses

-greedy to choose the action a' for the next state s' and the Q-value of the next state action pair, that is, Q(s',a';

) is used to calculate the target state-action value of the current state.

The pseudo-code for asynchronous one-step SARSA is shown below. Here, the following are the global parameters:

: the parameters (weights and biases) of the policy network
: parameters (weights and biases) of the target network
T : overall time step counter

// Globally shared parameters 
, and T 
//  is initialized arbitrarily
// T is initialized 0

pseudo-code for each learner running parallel in each of the threads:

Initialize thread level time step counter t=0...