Packt+ | Advance your knowledge in tech

You're reading from Python Reinforcement Learning Solve complex real-world problems by mastering reinforcement learning algorithms using OpenAI Gym and TensorFlow

Product type Course

Published in Apr 2019

Publisher Packt

ISBN-13 9781838649777

Length 496 pages

Edition 1st Edition

Languages

Python

Tools

OpenAI Gym

Concepts

Reinforcement Learning

Authors (4):

Yang Wenzhuo

Sean Saito

Sudharsan Ravichandiran

Rajalingappaa Shanmugamani

View More author details

Table of Contents (27) Chapters

Title Page

About Packt

Contributors

Preface

1. Introduction to Reinforcement Learning FREE CHAPTER

2. Getting Started with OpenAI and TensorFlow

3. The Markov Decision Process and Dynamic Programming

4. Gaming with Monte Carlo Methods

5. Temporal Difference Learning

6. Multi-Armed Bandit Problem

7. Playing Atari Games

8. Atari Games with Deep Q Network

9. Playing Doom with a Deep Recurrent Q Network

10. The Asynchronous Advantage Actor Critic Network

11. Policy Gradients and Optimization

12. Balancing CartPole

13. Simulating Control Tasks

14. Building Virtual Worlds in Minecraft

15. Learning to Play Go

16. Creating a Chatbot

17. Generating a Deep Learning Image Classifier

18. Predicting Future Stock Prices

19. Capstone Project - Car Racing Using DQN

20. Looking Ahead

1. Assessments

2. Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

A

Acrobot
- settings / The classic control tasks
agent
- about / Agent
- training, to play Doom / Training an agent to play Doom
agent environment interface / Agent environment interface
algorithmic tasks / Algorithmic tasks
AlphaGo
- about / AlphaGo
- supervised learning policy networks / Supervised learning policy networks
- reinforcement learning policy networks / Reinforcement learning policy networks
- value network / Value network
- neural networks and MCTS, combining / Combining neural networks and MCTS
AlphaGo Zero
- about / AlphaGo Zero, Putting everything together
- training / Training AlphaGo Zero
- implementing / Implementing AlphaGo Zero
- policy and value networks / Policy and value networks
- preprocessing.py module / preprocessing.py
- features.py module module / features.py
- network.py module / network.py
- alphagozero_agent.py / alphagozero_agent.py
- controller.py / controller.py
- train.py / train.py
- features.py module / Helper methods
Anaconda
- download link / Installing Anaconda
applications, RL
- about / Applications of RL
- education / Education
- medicine and healthcare / Medicine and healthcare
- manufacturing sector / Manufacturing
- inventory management / Inventory management
- finance / Finance
- Natural Language Processing (NLP) / Natural Language Processing and Computer Vision
- Computer Vision (CV) / Natural Language Processing and Computer Vision
architecture, Deep Q Networks (DQN)
- convolutional network / Convolutional network
- experience replay / Experience replay
- target network / Target network
- rewards, clipping / Clipping rewards
- algorithm / Understanding the algorithm
asynchronous advantage actor-critic (A3C) algorithm
- about / Asynchronous advantage actor-critic algorithm
- implementing / Implementation of A3C
- experiments / Experiments
Asynchronous Advantage Actor Critic (A3C) algorithm
- about / The Asynchronous Advantage Actor Critic, The three As
- architecture / The architecture of A3C
- working / How A3C works
- mountain car example / Driving up a mountain with A3C
- network, visualizing in TensorBoard / Visualization in TensorBoard
Atari 2600 games
- references / Introduction to Atari games
- unsolved issues / Demonstrating basic Q-learning algorithm
Atari emulator
- building / Building an Atari emulator, Getting started
- implementation / Implementation of the Atari emulator
- implementing, gym used / Atari simulator using gym
Atari game
- playing, by building agent / Building an agent to play Atari games
Atari games
- about / Introduction to Atari games
- data preparation / Data preparation
- playing / Atari

B

backpropagation / Update
backpropagation through time (BPTT) / DPG algorithm
basic elements, reinforcement learning
- state / Basic elements of reinforcement learning
- reward function / Basic elements of reinforcement learning
- policy function / Basic elements of reinforcement learning
- value function / Basic elements of reinforcement learning
basic Q-learning algorithm
- demonstrating / Demonstrating basic Q-learning algorithm
Bellman equation
- about / The Bellman equation and optimality
- deriving, for value and Q function / Deriving the Bellman equation for value and Q functions
- solving / Solving the Bellman equation
/ Demonstrating basic Q-learning algorithm
Bellman equation, solving
- dynamic programming (DP) technique, using / Dynamic programming
Blackjack game
- playing, with Monte Carlo / Let's play Blackjack with Monte Carlo
board state / Go and other board games

C

car racing game
- dueling DQN, using in / Car racing
CartPole
- about / Running an environment, CartPole
- specifications / The classic control tasks
chatbot
- background, issues / The background problem
- dataset / Dataset
- step-by-step guide / Step-by-step guide
- data parser / Data parser
- data reader / Data reader
- helper methods / Helper methods
- model / Chatbot model
- data, training / Training the data
- testing / Testing and results
- results / Testing and results
classic control tasks / The classic control tasks
conjugate gradient method
- URL / Trust Region Policy Optimization
constants / Constants
contextual bandits
- about / Contextual bandits
- reference / Further reading
continuous environment / Continuous environment
control tasks
- about / Introduction to control tasks, Getting started
- classic control tasks / The classic control tasks

D

data preparation, Atari games / Data preparation
Deep Attention Recurrent Q Network (DARQN)
- about / DARQN
- architecture / Architecture of DARQN
- attention layer / Architecture of DARQN
deep deterministic policy gradient (DDPG)
- about / Deep deterministic policy gradient
- used, for swinging pendulum / Swinging a pendulum
DeepMind Lab / DeepMind Lab
deep Q-learning
- about / Deep Q-learning
- basic elements, of reinforcement learning / Basic elements of reinforcement learning
- basic Q-learning algorithm, demonstrating / Demonstrating basic Q-learning algorithm
deep Q-learning algorithm (DQN)
- about / Deep Q-learning
- implementing / Implementation of DQN
- experiments / Experiments
Deep Q Network (DQN)
- about / What is a Deep Q Network?
- architecture / Architecture of DQN
- dueling network architecture / Dueling network architecture
Deep Recurrent Q Network (DRQN)
- about / DRQN
- architecture / Architecture of DRQN
- Doom / Doom with DRQN
deterministic environment / Deterministic environment
deterministic policy gradient (DPG)
- about / Deterministic policy gradient
- actor-critic architecture / Deterministic policy gradient
- theory / The theory behind policy gradient
- algorithm / DPG algorithm
- implementing / Implementation of DDPG
- experiments / Experiments
discrete environment / Discrete environment
Docker
- installing / Installing Docker
- download link / Installing Docker
Doom
- playing, by training agent / Training an agent to play Doom
- about / Basic Doom game
- with Deep Recurrent Q Network (DRQN) / Doom with DRQN
Double DQN / Double DQN
dueling network
- architecture / Dueling network architecture
- building / Dueling network
dynamic programming (DP) technique
- about / Dynamic programming, Monte Carlo prediction
- value iteration algorithm, using / Value iteration
- policy iteration algorithm, using / Policy iteration

E

elements, reinforcement learning (RL)
- agent / Agent
- policy function / Policy function
- value function / Value function
- model / Model
environment wrapper functions / Environment wrapper functions
episodic environment / Episodic and non-episodic environment
epsilon-greedy policy / The epsilon-greedy policy
experience replay / Experience replay

F

financial market
- background, issues / Background problem
- data used / Data used
- step-by-step guide / Step-by-step guide
- actor script / Actor script
- critic script / Critic script
- agent script / Agent script
- helper script / Helper script
- data, training / Training the data
- final result / Final result
frame-skipping technique / Data preparation
frozen lake problem
- solving / Solving the frozen lake problem
- value iteration algorithm, using / Value iteration
frozen lake problem, solving
- value iteration algorithm, using / Value iteration
- policy iteration algorithm, using / Policy iteration
fully observable environment / Fully observable environment

G

Go
- about / A brief introduction to Go
- and other board games / Go and other board games
- and AI research / Go and AI research
GridWorld game
- reference / Experiments

H

hard attention / Architecture of DARQN
Hidden Markov model / Markov models

J

Jupyter notebook
- URL / Lunar Lander using policy gradients

K

Kullback–Leibler (KL) / Trust Region Policy Optimization

M

Markov chain / The Markov chain and Markov process
Markov Decision Process (MDP)
- about / Markov Decision Process
- rewards and returns / Rewards and returns
- episodic tasks / Episodic and continuous tasks
- continuous tasks / Episodic and continuous tasks
- discount factor / Discount factor
- policy function / The policy function
- state value function / State value function
- state-action value function (Q function) / State-action value function (Q function)
- reference / Questions
Markov models
- about / Markov models
- CartPole / CartPole
Markov process / The Markov chain and Markov process
Massively Multiplayer Online Role Playing Game (MMORPGs) / Multi-agent reinforcement learning
MC-ES algorithm / Monte Carlo exploration starts
mean-squared error (MSE) / Value network
Minecraft environment
- about / Introduction to the Minecraft environment
- data preparation / Data preparation
model / Model
model-free / Demonstrating basic Q-learning algorithm
Monte Carlo
- methods / Monte Carlo methods
- used, for pi value estimation / Estimating the value of pi using Monte Carlo
- prediction algorithm / Monte Carlo prediction
- about / First visit Monte Carlo
- Blackjack game, playing with / Let's play Blackjack with Monte Carlo
Monte Carlo control
- about / Monte Carlo control
- exploration / Monte Carlo exploration starts
- on-policy Monte Carlo control / On-policy Monte Carlo control
- Monte Carlo control / On-policy Monte Carlo control
- off-policy Monte Carlo control / Off-policy Monte Carlo control
Monte Carlo exploring starts concept / Monte Carlo exploration starts
Monte Carlo prediction algorithm
- about / Monte Carlo prediction
- first visit / First visit Monte Carlo
- every visit / Every visit Monte Carlo
Monte Carlo tree search
- about / Monte Carlo tree search
- selection / Selection
- expansion / Expansion
- simulation / Simulation
- update step / Update
- mcts.py / mcts.py
MuJoCo
- about / MuJoCo
- reference / Introduction to control tasks
multi-agent environment / Single and multi-agent environment
multi-agent reinforcement learning / Multi-agent reinforcement learning
multi-armed bandit (MAB)
- applications / Applications of MAB
- used, for identifying advertisement banner / Identifying the right advertisement banner using MAB
- reference / Questions
multi-armed bandit (MAB) problem
- about / The MAB problem
- epsilon-greedy policy / The epsilon-greedy policy
- softmax exploration algorithm / The softmax exploration algorithm
- upper confidence bound (UCB) algorithm / The upper confidence bound algorithm
- Thompson sampling (TS) algorithm / The Thompson sampling algorithm

N

NAS, implementing
- about / Implementing NAS
- child_network.py module / child_network.py
- cifar10_processor.py / cifar10_processor.py
- controller.py module / controller.py
- controller generating, ways / Method for generating the Controller
- child network generating, controller used / Generating a child network using the Controller
- train_controller method / train_controller method
- ChildCNN, testing / Testing ChildCNN
- config.py module / config.py
- train.py module / train.py
- exercises / Additional exercises
- advantages / Advantages of NAS
network
- training / Training the network
neural architecture search
- about / Neural Architecture Search
- child networks, generating / Generating and training child networks
- child networks, training / Generating and training child networks
- controller, training / Training the Controller
- algorithm, training / Training algorithm
non-episodic environment / Episodic and non-episodic environment
nonusable ace / Let's play Blackjack with Monte Carlo
no operation (NOOP) action / Data preparation

O

OpenAI
- reference / Further reading
- about / OpenAI Gym
- Gym / Gym
OpenAI Five / Multi-agent reinforcement learning
OpenAI Gym
- about / OpenAI Gym and Universe, OpenAI Gym
- error fixes / Common error fixes
- basic cart pole environment, simulating / Basic simulations
- robot, training to walk / Training a robot to walk
- installation / Installation
- environment, running / Running an environment
- Atari / Atari
- algorithmic tasks / Algorithmic tasks
- MuJoCo / MuJoCo
- Robotics / Robotics
- reference / Introduction to control tasks
OpenAI Universe
- about / OpenAI Gym and Universe, OpenAI Universe
- video game bot, building / Building a video game bot
optimal value / The Bellman equation and optimality

P

partially observable environment / Partially observable environment
partially observable Markov Decision Process (POMDP) / DRQN
Pendulum
- specifications / The classic control tasks
pi value
- estimating, with Monte Carlo method / Estimating the value of pi using Monte Carlo
placeholders / Placeholders
playout / Simulation
policy function / Policy function, Building a video game bot, The policy function
policy gradient
- about / Policy gradient
- using, for Lunar Lander / Lunar Lander using policy gradients
- URL / Lunar Lander using policy gradients
PolicyValueNetwork
- and MCTS, combining / Combining PolicyValueNetwork and MCTS
- alphagozero_agent.py / alphagozero_agent.py
prioritized experience replay / Prioritized experience replay
Project Malmo / Project Malmo
proportional prioritization / Prioritized experience replay
Proximal Policy Optimization (PPO) / Proximal Policy Optimization

Q

Q learning, TD control
- about / Q learning
- used, for solving taxi problem / Solving the taxi problem using Q learning
- and SARSA algorithm, differentiating / The difference between Q learning and SARSA

R

rectifier nonlinearity (RELU) / Demonstrating basic Q-learning algorithm
recurrent deterministic policy gradient algorithm (RDPG) / DPG algorithm
reinforcement learning
- basic elements / Basic elements of reinforcement learning
- shortcomings / The shortcomings of reinforcement learning
- resource efficiency / Resource efficiency
- reproducibility / Reproducibility
- explainability/accountability / Explainability/accountability
- attacks, susceptibility to / Susceptibility to attacks
- limitations, addressing / Addressing the limitations
reinforcement learning (RL)
- about / What is RL?, Policy gradient
- algorithm / RL algorithm
- comparing, with ML paradigms / How RL differs from other ML paradigms
- elements / Elements of RL
reinforcement learning, developments
- about / Upcoming developments in reinforcement learning
- transfer learning / Transfer learning
- multi-agent reinforcement learning / Multi-agent reinforcement learning
REINFORCE method / Neural Architecture Search, Training the Controller
replay buffer
- building / Replay memory
RL environments
- types / Types of RL environment
- deterministic environment / Deterministic environment
- stochastic environment / Stochastic environment
- fully observable environment / Fully observable environment
- partially observable environment / Partially observable environment
- discrete environment / Discrete environment
- continuous environment / Continuous environment
- episodic and non-episodic environment / Episodic and non-episodic environment
- single and multi-agent environment / Single and multi-agent environment
RL platforms
- about / RL platforms
- OpenAI Universe / OpenAI Gym and Universe
- OpenAI Gym / OpenAI Gym and Universe
- DeepMind Lab / DeepMind Lab
- RL-Glue / RL-Glue
- Project Malmo / Project Malmo
- ViZDoom / ViZDoom
Robotics / Robotics
rollout / Simulation

S

SARSA algorithm, TD control
- about / SARSA
- used, for solving taxi problem / Solving the taxi problem using SARSA
- and Q learning, differentiating / The difference between Q learning and SARSA
sequential environment / Episodic and non-episodic environment
SGF (Smart Game Format) / alphagozero_agent.py
single-agent environment / Single and multi-agent environment
soft attention / Architecture of DARQN
softmax exploration algorithm / The softmax exploration algorithm
state-action value function (Q function) / State-action value function (Q function)
state value function / State value function
stochastic environment / Stochastic environment
system, setting up
- about / Setting up your machine
- Anaconda, installing / Installing Anaconda
- Docker, installing / Installing Docker
- OpenAI Universe, installing / Installing OpenAI Gym and Universe
- OpenAI Gym, installing / Installing OpenAI Gym and Universe

T

TD control
- about / TD control
- off-policy learning algorithm / TD control
- on-policy learning algorithm / TD control
- Q learning / Q learning
- State-Action-Reward-State-Action (SARSA) algorithm / SARSA
temporal-difference (TD) learning / TD learning
temporal-difference (TD) prediction / TD prediction
TensorBoard
- about / TensorBoard
- scope, adding / Adding scope
- network visualization / Visualization in TensorBoard
TensorFlow
- variables / Variables, constants, and placeholders
- placeholders / Variables, constants, and placeholders
- constants / Variables, constants, and placeholders
- computation graph / Computation graph
- sessions / Sessions
- TensorBoard / TensorBoard
- reference / Further reading
Thompson sampling (TS) algorithm / The Thompson sampling algorithm
TMUX
- about / Implementation of A3C
- reference / Implementation of A3C
Trust Region Policy Optimization (TRPO) / Trust Region Policy Optimization
trust region policy optimization (TRPO) algorithm
- about / Trust region policy optimization, TRPO algorithm
- theory / Theory behind TRPO
- experiments, on MuJoCo tasks / Experiments on MuJoCo tasks
types, attention layer
- soft attention / Architecture of DARQN
- hard attention / Architecture of DARQN

U

upper confidence bound (UCB) algorithm / The upper confidence bound algorithm
Upper Confidence Bound 1 Applied to Trees (UCT) / Selection
usable ace / Let's play Blackjack with Monte Carlo

V

value function / Value function
variables / Variables
video game bot
- building / Building a video game bot
ViZDoom / ViZDoom

The rest of the chapter is locked

You're reading from Python Reinforcement Learning Solve complex real-world problems by mastering reinforcement learning algorithms using OpenAI Gym and TensorFlow

Table of Contents (27) Chapters

Index

A

B

C

D

E

F

G

H

J

K

M

N

O

P

Q

R

S

T

U

V

Unlock this book and the full library FREE for 7 days

Authors (4)

Personalised recommendations for you