Packt+ | Advance your knowledge in tech

You're reading from Deep Reinforcement Learning Hands-On Apply modern RL methods, with deep Q-networks, value iteration, policy gradients, TRPO, AlphaGo Zero and more

Product type Paperback

Published in Jun 2018

Publisher Packt

ISBN-13 9781788834247

Length 546 pages

Edition 1st Edition

Languages

Python

Tools

Deep Reinforcement Learning

Concepts

Deep Reinforcement Learning

Author (1):

Maxim Lapan

View More author details

Table of Contents (21) Chapters

Preface

1. What is Reinforcement Learning? FREE CHAPTER

2. OpenAI Gym

3. Deep Learning with PyTorch

4. The Cross-Entropy Method

5. Tabular Learning and the Bellman Equation

6. Deep Q-Networks

7. DQN Extensions

8. Stocks Trading Using RL

9. Policy Gradients – An Alternative

10. The Actor-Critic Method

11. Asynchronous Advantage Actor-Critic

12. Chatbots Training with RL

13. Web Navigation

14. Continuous Action Space

15. Trust Regions – TRPO, PPO, and ACKTR

16. Black-Box Optimization in RL

17. Beyond Model-Free – Imagination

18. AlphaGo Zero

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Taxonomy of RL methods

The cross-entropy method falls into the model-free and policy-based category of methods. These notions are new, so let's spend some time exploring them. All methods in RL can be classified into various aspects:

Model-free or model-based
Value-based or policy-based
On-policy or off-policy

There are other ways that you can taxonomize RL methods, but for now we're interested in the preceding three. Let's define them, as your problem specifics can influence your decision on a particular method.

The term model-free means that the method doesn't build a model of the environment or reward; it just directly connects observations to actions (or values that are related to actions). In other words, the agent takes current observations and does some computations on them, and the result is the action that it should take. In contrast, model-based methods try to predict what the next observation and/or reward will be. Based on this prediction, the agent is trying to choose the best possible...