You're reading from Deep Reinforcement Learning Hands-On A practical and easy-to-follow guide to RL from Q-learning and DQNs to PPO and RLHF

Product type Paperback

Published in Nov 2024

Publisher Packt

ISBN-13 9781835882702

Length 716 pages

Edition 3rd Edition

Languages

Python

Tools

PyTorch

Concepts

Deep Reinforcement Learning

Author (1):

Maxim Lapan

View More author details

Table of Contents (29) Chapters

Preface

1. Part 1 Introduction to RL FREE CHAPTER

2. What Is Reinforcement Learning?

3. OpenAI Gym API and Gymnasium

4. Deep Learning with PyTorch

5. The Cross-Entropy Method

6. Part 2 Value-based methods

7. Tabular Learning and the Bellman Equation

8. Deep Q-Networks

9. Higher-Level RL Libraries

10. DQN Extensions

11. Ways to Speed Up RL

12. Stocks Trading Using RL

13. Part 3 Policy-based methods

14. Policy Gradients

15. Actor-Critic Method: A2C and A3C

16. The TextWorld Environment

17. Web Navigation

18. Part 4 Advanced RL

19. Continous Action Space

20. Trust Region Methods

21. Black-Box Optimizations in RL

22. Advanced Exploration

23. Reinforcement Learning with Human Feedback

24. AlphaGo Zero and MuZero

25. RL in Discrete Optimization

26. Multi-Agent RL

27. Bibliography

28. Index

The REINFORCE method

The formula of policy gradient that you have just seen is used by most policy-based methods, but the details can vary. One very important point is how exactly gradient scales, Q(s,a), are calculated. In the cross-entropy method from Chapter 4, we played several episodes, calculated the total reward for each of them, and trained on transitions from episodes with a better-than-average reward. This training procedure is a policy gradient method with Q(s,a) = 1 for state and action pairs from good episodes (with a large total reward) and Q(s,a) = 0 for state and action pairs from worse episodes.

The cross-entropy method worked even with those simple assumptions, but the obvious improvement will be to use Q(s,a) for training instead of just 0 and 1. Why should it help? The answer is a more fine-grained separation of episodes. For example, transitions from the episode with a total reward of 10 should contribute to the gradient more than transitions from...

The rest of the chapter is locked

You're reading from Deep Reinforcement Learning Hands-On A practical and easy-to-follow guide to RL from Q-learning and DQNs to PPO and RLHF

Table of Contents (29) Chapters

The REINFORCE method

Authors (1)

Personalised recommendations for you

You're reading from Deep Reinforcement Learning Hands-On A practical and easy-to-follow guide to RL from Q-learning and DQNs to PPO and RLHF

Table of Contents (29) Chapters

The REINFORCE method

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you