Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Reinforcement Learning

You're reading from   Python Reinforcement Learning Solve complex real-world problems by mastering reinforcement learning algorithms using OpenAI Gym and TensorFlow

Arrow left icon
Product type Course
Published in Apr 2019
Publisher Packt
ISBN-13 9781838649777
Length 496 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (4):
Arrow left icon
Yang Wenzhuo Yang Wenzhuo
Author Profile Icon Yang Wenzhuo
Yang Wenzhuo
Sean Saito Sean Saito
Author Profile Icon Sean Saito
Sean Saito
Sudharsan Ravichandiran Sudharsan Ravichandiran
Author Profile Icon Sudharsan Ravichandiran
Sudharsan Ravichandiran
Rajalingappaa Shanmugamani Rajalingappaa Shanmugamani
Author Profile Icon Rajalingappaa Shanmugamani
Rajalingappaa Shanmugamani
Arrow right icon
View More author details
Toc

Table of Contents (27) Chapters Close

Title Page
About Packt
Contributors
Preface
1. Introduction to Reinforcement Learning FREE CHAPTER 2. Getting Started with OpenAI and TensorFlow 3. The Markov Decision Process and Dynamic Programming 4. Gaming with Monte Carlo Methods 5. Temporal Difference Learning 6. Multi-Armed Bandit Problem 7. Playing Atari Games 8. Atari Games with Deep Q Network 9. Playing Doom with a Deep Recurrent Q Network 10. The Asynchronous Advantage Actor Critic Network 11. Policy Gradients and Optimization 12. Balancing CartPole 13. Simulating Control Tasks 14. Building Virtual Worlds in Minecraft 15. Learning to Play Go 16. Creating a Chatbot 17. Generating a Deep Learning Image Classifier 18. Predicting Future Stock Prices 19. Capstone Project - Car Racing Using DQN 20. Looking Ahead 1. Assessments 2. Other Books You May Enjoy Index

Index

A

  • Acrobot
    • settings / The classic control tasks
  • agent
    • about / Agent
    • training, to play Doom / Training an agent to play Doom 
  • agent environment interface / Agent environment interface
  • algorithmic tasks / Algorithmic tasks
  • AlphaGo
    • about / AlphaGo
    • supervised learning policy networks / Supervised learning policy networks
    • reinforcement learning policy networks / Reinforcement learning policy networks
    • value network / Value network
    • neural networks and MCTS, combining / Combining neural networks and MCTS
  • AlphaGo Zero
    • about / AlphaGo Zero, Putting everything together
    • training / Training AlphaGo Zero
    • implementing / Implementing AlphaGo Zero
    • policy and value networks / Policy and value networks
    • preprocessing.py module / preprocessing.py
    • features.py module module / features.py
    • network.py module / network.py
    • alphagozero_agent.py / alphagozero_agent.py
    • controller.py / controller.py
    • train.py / train.py
    • features.py module / Helper methods
  • Anaconda
    • download link / Installing Anaconda
  • applications, RL
    • about / Applications of RL
    • education / Education
    • medicine and healthcare / Medicine and healthcare
    • manufacturing sector / Manufacturing
    • inventory management / Inventory management
    • finance / Finance
    • Natural Language Processing (NLP) / Natural Language Processing and Computer Vision
    • Computer Vision (CV) / Natural Language Processing and Computer Vision
  • architecture, Deep Q Networks (DQN)
    • convolutional network / Convolutional network
    • experience replay / Experience replay
    • target network / Target network
    • rewards, clipping / Clipping rewards
    • algorithm / Understanding the algorithm
  • asynchronous advantage actor-critic (A3C) algorithm
    • about / Asynchronous advantage actor-critic algorithm
    • implementing / Implementation of A3C
    • experiments / Experiments
  • Asynchronous Advantage Actor Critic (A3C) algorithm
    • about / The Asynchronous Advantage Actor Critic, The three As
    • architecture / The architecture of A3C
    • working / How A3C works
    • mountain car example / Driving up a mountain with A3C
    • network, visualizing in TensorBoard / Visualization in TensorBoard
  • Atari 2600 games
    • references / Introduction to Atari games
    • unsolved issues / Demonstrating basic Q-learning algorithm
  • Atari emulator
    • building / Building an Atari emulator, Getting started
    • implementation / Implementation of the Atari emulator
    • implementing, gym used / Atari simulator using gym
  • Atari game
    • playing, by building agent / Building an agent to play Atari games
  • Atari games
    • about / Introduction to Atari games
    • data preparation / Data preparation
    • playing / Atari

B

  • backpropagation / Update
  • backpropagation through time (BPTT) / DPG algorithm
  • basic elements, reinforcement learning
    • state / Basic elements of reinforcement learning
    • reward function / Basic elements of reinforcement learning
    • policy function / Basic elements of reinforcement learning
    • value function / Basic elements of reinforcement learning
  • basic Q-learning algorithm
    • demonstrating / Demonstrating basic Q-learning algorithm
  • Bellman equation
    • about / The Bellman equation and optimality
    • deriving, for value and Q function / Deriving the Bellman equation for value and Q functions
    • solving / Solving the Bellman equation
    / Demonstrating basic Q-learning algorithm
  • Bellman equation, solving
    • dynamic programming (DP) technique, using / Dynamic programming
  • Blackjack game
    • playing, with Monte Carlo / Let's play Blackjack with Monte Carlo
  • board state / Go and other board games

C

  • car racing game
    • dueling DQN, using in / Car racing
  • CartPole
    • about / Running an environment, CartPole
    • specifications / The classic control tasks
  • chatbot
    • background, issues / The background problem
    • dataset / Dataset
    • step-by-step guide / Step-by-step guide
    • data parser / Data parser
    • data reader / Data reader
    • helper methods / Helper methods
    • model / Chatbot model
    • data, training / Training the data
    • testing / Testing and results
    • results / Testing and results
  • classic control tasks / The classic control tasks
  • conjugate gradient method
    • URL / Trust Region Policy Optimization
  • constants / Constants
  • contextual bandits
    • about / Contextual bandits
    • reference / Further reading
  • continuous environment / Continuous environment
  • control tasks
    • about / Introduction to control tasks, Getting started
    • classic control tasks / The classic control tasks

D

  • data preparation, Atari games / Data preparation
  • Deep Attention Recurrent Q Network (DARQN)
    • about / DARQN
    • architecture / Architecture of DARQN
    • attention layer / Architecture of DARQN
  • deep deterministic policy gradient (DDPG)
    • about / Deep deterministic policy gradient
    • used, for swinging pendulum / Swinging a pendulum
  • DeepMind Lab / DeepMind Lab
  • deep Q-learning
    • about / Deep Q-learning
    • basic elements, of reinforcement learning / Basic elements of reinforcement learning
    • basic Q-learning algorithm, demonstrating / Demonstrating basic Q-learning algorithm
  • deep Q-learning algorithm (DQN)
    • about / Deep Q-learning
    • implementing / Implementation of DQN
    • experiments / Experiments
  • Deep Q Network (DQN)
    • about / What is a Deep Q Network?
    • architecture / Architecture of DQN
    • dueling network architecture / Dueling network architecture
  • Deep Recurrent Q Network (DRQN)
    • about / DRQN
    • architecture / Architecture of DRQN
    • Doom / Doom with DRQN
  • deterministic environment / Deterministic environment
  • deterministic policy gradient (DPG)
    • about / Deterministic policy gradient
    • actor-critic architecture / Deterministic policy gradient
    • theory / The theory behind policy gradient
    • algorithm / DPG algorithm
    • implementing / Implementation of DDPG
    • experiments / Experiments
  • discrete environment / Discrete environment
  • Docker
    • installing / Installing Docker
    • download link / Installing Docker
  • Doom
    • playing, by training agent / Training an agent to play Doom 
    • about / Basic Doom game
    • with Deep Recurrent Q Network (DRQN) / Doom with DRQN
  • Double DQN / Double DQN
  • dueling network
    • architecture / Dueling network architecture
    • building / Dueling network
  • dynamic programming (DP) technique
    • about / Dynamic programming, Monte Carlo prediction
    • value iteration algorithm, using / Value iteration
    • policy iteration algorithm, using / Policy iteration

E

  • elements, reinforcement learning (RL)
    • agent / Agent
    • policy function / Policy function
    • value function / Value function
    • model / Model
  • environment wrapper functions / Environment wrapper functions
  • episodic environment / Episodic and non-episodic environment
  • epsilon-greedy policy / The epsilon-greedy policy
  • experience replay / Experience replay

F

  • financial market
    • background, issues / Background problem
    • data used / Data used
    • step-by-step guide / Step-by-step guide
    • actor script / Actor script
    • critic script / Critic script
    • agent script / Agent script
    • helper script / Helper script
    • data, training / Training the data
    • final result / Final result
  • frame-skipping technique / Data preparation
  • frozen lake problem
    • solving / Solving the frozen lake problem
    • value iteration algorithm, using / Value iteration
  • frozen lake problem, solving
    • value iteration algorithm, using / Value iteration
    • policy iteration algorithm, using / Policy iteration
  • fully observable environment / Fully observable environment

G

  • Go
    • about / A brief introduction to Go
    • and other board games / Go and other board games
    • and AI research / Go and AI research
  • GridWorld game
    • reference / Experiments

H

  • hard attention / Architecture of DARQN
  • Hidden Markov model / Markov models

J

  • Jupyter notebook
    • URL / Lunar Lander using policy gradients

K

  • Kullback–Leibler (KL) / Trust Region Policy Optimization

M

  • Markov chain / The Markov chain and Markov process
  • Markov Decision Process (MDP)
    • about / Markov Decision Process
    • rewards and returns / Rewards and returns
    • episodic tasks / Episodic and continuous tasks
    • continuous tasks / Episodic and continuous tasks
    • discount factor / Discount factor
    • policy function / The policy function
    • state value function / State value function
    • state-action value function (Q function) / State-action value function (Q function)
    • reference / Questions
  • Markov models
    • about / Markov models
    • CartPole / CartPole
  • Markov process / The Markov chain and Markov process
  • Massively Multiplayer Online Role Playing Game (MMORPGs) / Multi-agent reinforcement learning
  • MC-ES algorithm / Monte Carlo exploration starts
  • mean-squared error (MSE) / Value network
  • Minecraft environment
    • about / Introduction to the Minecraft environment
    • data preparation / Data preparation
  • model / Model
  • model-free / Demonstrating basic Q-learning algorithm
  • Monte Carlo
    • methods / Monte Carlo methods
    • used, for pi value estimation / Estimating the value of pi using Monte Carlo
    • prediction algorithm / Monte Carlo prediction
    • about / First visit Monte Carlo
    • Blackjack game, playing with / Let's play Blackjack with Monte Carlo
  • Monte Carlo control
    • about / Monte Carlo control
    • exploration / Monte Carlo exploration starts
    • on-policy Monte Carlo control / On-policy Monte Carlo control
    • Monte Carlo control / On-policy Monte Carlo control
    • off-policy Monte Carlo control / Off-policy Monte Carlo control
  • Monte Carlo exploring starts concept / Monte Carlo exploration starts
  • Monte Carlo prediction algorithm
    • about / Monte Carlo prediction
    • first visit / First visit Monte Carlo
    • every visit / Every visit Monte Carlo
  • Monte Carlo tree search
    • about / Monte Carlo tree search
    • selection / Selection
    • expansion / Expansion
    • simulation / Simulation
    • update step / Update
    • mcts.py / mcts.py
  • MuJoCo
    • about / MuJoCo
    • reference / Introduction to control tasks
  • multi-agent environment / Single and multi-agent environment
  • multi-agent reinforcement learning / Multi-agent reinforcement learning
  • multi-armed bandit (MAB)
    • applications / Applications of MAB
    • used, for identifying advertisement banner / Identifying the right advertisement banner using MAB
    • reference / Questions
  • multi-armed bandit (MAB) problem
    • about / The MAB problem
    • epsilon-greedy policy / The epsilon-greedy policy
    • softmax exploration algorithm / The softmax exploration algorithm
    • upper confidence bound (UCB) algorithm / The upper confidence bound algorithm
    • Thompson sampling (TS) algorithm / The Thompson sampling algorithm

N

  • NAS, implementing
    • about / Implementing NAS
    • child_network.py module / child_network.py
    • cifar10_processor.py / cifar10_processor.py
    • controller.py module / controller.py
    • controller generating, ways / Method for generating the Controller
    • child network generating, controller used / Generating a child network using the Controller
    • train_controller method / train_controller method
    • ChildCNN, testing / Testing ChildCNN
    • config.py module / config.py
    • train.py module / train.py
    • exercises / Additional exercises
    • advantages / Advantages of NAS
  • network
    • training / Training the network
  • neural architecture search
    • about / Neural Architecture Search
    • child networks, generating / Generating and training child networks
    • child networks, training / Generating and training child networks
    • controller, training / Training the Controller
    • algorithm, training / Training algorithm
  • non-episodic environment / Episodic and non-episodic environment
  • nonusable ace / Let's play Blackjack with Monte Carlo
  • no operation (NOOP) action / Data preparation

O

  • OpenAI
    • reference / Further reading
    • about / OpenAI Gym
    • Gym / Gym
  • OpenAI Five / Multi-agent reinforcement learning
  • OpenAI Gym
    • about / OpenAI Gym and Universe, OpenAI Gym
    • error fixes / Common error fixes
    • basic cart pole environment, simulating / Basic simulations
    • robot, training to walk / Training a robot to walk
    • installation / Installation 
    • environment, running / Running an environment
    • Atari / Atari
    • algorithmic tasks / Algorithmic tasks
    • MuJoCo / MuJoCo
    • Robotics / Robotics
    • reference / Introduction to control tasks
  • OpenAI Universe
    • about / OpenAI Gym and Universe, OpenAI Universe
    • video game bot, building / Building a video game bot
  • optimal value / The Bellman equation and optimality

P

  • partially observable environment / Partially observable environment
  • partially observable Markov Decision Process (POMDP) / DRQN
  • Pendulum
    • specifications / The classic control tasks
  • pi value
    • estimating, with Monte Carlo method / Estimating the value of pi using Monte Carlo
  • placeholders / Placeholders
  • playout / Simulation
  • policy function / Policy function, Building a video game bot, The policy function
  • policy gradient
    • about / Policy gradient
    • using, for Lunar Lander / Lunar Lander using policy gradients
    • URL / Lunar Lander using policy gradients
  • PolicyValueNetwork
    • and MCTS, combining / Combining PolicyValueNetwork and MCTS
    • alphagozero_agent.py / alphagozero_agent.py
  • prioritized experience replay / Prioritized experience replay
  • Project Malmo / Project Malmo
  • proportional prioritization / Prioritized experience replay
  • Proximal Policy Optimization (PPO) / Proximal Policy Optimization

Q

  • Q learning, TD control
    • about / Q learning
    • used, for solving taxi problem / Solving the taxi problem using Q learning
    • and SARSA algorithm, differentiating / The difference between Q learning and SARSA

R

  • rectifier nonlinearity (RELU) / Demonstrating basic Q-learning algorithm
  • recurrent deterministic policy gradient algorithm (RDPG) / DPG algorithm
  • reinforcement learning
    • basic elements / Basic elements of reinforcement learning
    • shortcomings / The shortcomings of reinforcement learning
    • resource efficiency / Resource efficiency
    • reproducibility / Reproducibility
    • explainability/accountability / Explainability/accountability
    • attacks, susceptibility to / Susceptibility to attacks
    • limitations, addressing / Addressing the limitations
  • reinforcement learning (RL)
    • about / What is RL?, Policy gradient
    • algorithm / RL algorithm
    • comparing, with ML paradigms / How RL differs from other ML paradigms
    • elements / Elements of RL
  • reinforcement learning, developments
    • about / Upcoming developments in reinforcement learning
    • transfer learning / Transfer learning
    • multi-agent reinforcement learning / Multi-agent reinforcement learning
  • REINFORCE method / Neural Architecture Search, Training the Controller
  • replay buffer
    • building / Replay memory
  • RL environments
    • types / Types of RL environment
    • deterministic environment / Deterministic environment
    • stochastic environment / Stochastic environment
    • fully observable environment / Fully observable environment
    • partially observable environment / Partially observable environment
    • discrete environment / Discrete environment
    • continuous environment / Continuous environment
    • episodic and non-episodic environment / Episodic and non-episodic environment
    • single and multi-agent environment / Single and multi-agent environment
  • RL platforms
    • about / RL platforms
    • OpenAI Universe / OpenAI Gym and Universe
    • OpenAI Gym / OpenAI Gym and Universe
    • DeepMind Lab / DeepMind Lab
    • RL-Glue / RL-Glue
    • Project Malmo / Project Malmo
    • ViZDoom / ViZDoom
  • Robotics / Robotics
  • rollout / Simulation

S

  • SARSA algorithm, TD control
    • about / SARSA
    • used, for solving taxi problem / Solving the taxi problem using SARSA
    • and Q learning, differentiating / The difference between Q learning and SARSA
  • sequential environment / Episodic and non-episodic environment
  • SGF (Smart Game Format) / alphagozero_agent.py
  • single-agent environment / Single and multi-agent environment
  • soft attention / Architecture of DARQN
  • softmax exploration algorithm / The softmax exploration algorithm
  • state-action value function (Q function) / State-action value function (Q function)
  • state value function / State value function
  • stochastic environment / Stochastic environment
  • system, setting up
    • about / Setting up your machine
    • Anaconda, installing / Installing Anaconda
    • Docker, installing / Installing Docker
    • OpenAI Universe, installing / Installing OpenAI Gym and Universe
    • OpenAI Gym, installing / Installing OpenAI Gym and Universe

T

  • TD control
    • about / TD control
    • off-policy learning algorithm / TD control
    • on-policy learning algorithm / TD control
    • Q learning / Q learning
    • State-Action-Reward-State-Action (SARSA) algorithm / SARSA
  • temporal-difference (TD) learning / TD learning
  • temporal-difference (TD) prediction / TD prediction
  • TensorBoard
    • about / TensorBoard
    • scope, adding / Adding scope
    • network visualization / Visualization in TensorBoard
  • TensorFlow
    • variables / Variables, constants, and placeholders
    • placeholders / Variables, constants, and placeholders
    • constants / Variables, constants, and placeholders
    • computation graph / Computation graph
    • sessions / Sessions
    • TensorBoard / TensorBoard
    • reference / Further reading
  • Thompson sampling (TS) algorithm / The Thompson sampling algorithm
  • TMUX
    • about / Implementation of A3C
    • reference / Implementation of A3C
  • Trust Region Policy Optimization (TRPO) / Trust Region Policy Optimization
  • trust region policy optimization (TRPO) algorithm
    • about / Trust region policy optimization, TRPO algorithm
    • theory / Theory behind TRPO
    • experiments, on MuJoCo tasks / Experiments on MuJoCo tasks
  • types, attention layer
    • soft attention / Architecture of DARQN
    • hard attention / Architecture of DARQN

U

  • upper confidence bound (UCB) algorithm / The upper confidence bound algorithm
  • Upper Confidence Bound 1 Applied to Trees (UCT) / Selection
  • usable ace / Let's play Blackjack with Monte Carlo

V

  • value function / Value function
  • variables / Variables
  • video game bot
    • building / Building a video game bot
  • ViZDoom / ViZDoom
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image