You're reading from Deep Reinforcement Learning Hands-On Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more

Product type Paperback

Published in Jun 2018

Publisher Packt

ISBN-13 9781788834247

Length 546 pages

Edition 1st Edition

Languages

Python

Tools

Deep Reinforcement Learning

Concepts

Deep Reinforcement Learning

Author (1):

Maxim Lapan

View More author details

Table of Contents (21) Chapters

Preface

1. What is Reinforcement Learning? FREE CHAPTER

2. OpenAI Gym

3. Deep Learning with PyTorch

4. The Cross-Entropy Method

5. Tabular Learning and the Bellman Equation

6. Deep Q-Networks

7. DQN Extensions

8. Stocks Trading Using RL

9. Policy Gradients – An Alternative

10. The Actor-Critic Method

11. Asynchronous Advantage Actor-Critic

12. Chatbots Training with RL

13. Web Navigation

14. Continuous Action Space

15. Trust Regions – TRPO, PPO, and ACKTR

16. Black-Box Optimization in RL

17. Beyond Model-Free – Imagination

18. AlphaGo Zero

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

The value iteration method

In the simplistic example we just saw, to calculate the values of states and actions, we have exploited the structure of the environment: we had no loops in transitions, so we could start from terminal states, calculate their values and then proceed to the central state. However, just one loop in the environment builds an obstacle in our approach. Let's consider such an environment with two states:

Figure 7: A sample environment with a loop in the transition diagram

We start from state The value iteration method , and the only action we can take leads us to state . We get reward r=1,and the only transition from is an action, which brings us back to the . So, the life of our agent is an infinite sequence of states []. To deal with this infinity loop, we can use a discount factor The value iteration method . Now, the question is, what are the values for both the states?

The answer is not very complicated, though. Every transition from The value iteration method to gives us a reward of 1 and every back transition gives us 2. So, our sequence...