Index
A
- Acrobot
- settings / The classic control tasks
- agent
- about / Agent
- training, to play Doom / Training an agent to play DoomÂ
- agent environment interface / Agent environment interface
- algorithmic tasks / Algorithmic tasks
- AlphaGo
- about / AlphaGo
- supervised learning policy networks / Supervised learning policy networks
- reinforcement learning policy networks / Reinforcement learning policy networks
- value network / Value network
- neural networks and MCTS, combining / Combining neural networks and MCTS
- AlphaGo Zero
- about / AlphaGo Zero, Putting everything together
- training / Training AlphaGo Zero
- implementing / Implementing AlphaGo Zero
- policy and value networks / Policy and value networks
- preprocessing.py module / preprocessing.py
- features.py module module / features.py
- network.py module / network.py
- alphagozero_agent.py / alphagozero_agent.py
- controller.py / controller.py
- train.py / train.py
- features.py module / Helper methods
- Anaconda
- download link / Installing Anaconda
- applications, RL
- about / Applications of RL
- education / Education
- medicine and healthcare / Medicine and healthcare
- manufacturing sector / Manufacturing
- inventory management / Inventory management
- finance / Finance
- Natural Language Processing (NLP) / Natural Language Processing and Computer Vision
- Computer Vision (CV) / Natural Language Processing and Computer Vision
- architecture, Deep Q Networks (DQN)
- convolutional network / Convolutional network
- experience replay / Experience replay
- target network / Target network
- rewards, clipping / Clipping rewards
- algorithm / Understanding the algorithm
- asynchronous advantage actor-critic (A3C) algorithm
- about / Asynchronous advantage actor-critic algorithm
- implementing / Implementation of A3C
- experiments / Experiments
- Asynchronous Advantage Actor Critic (A3C) algorithm
- about / The Asynchronous Advantage Actor Critic, The three As
- architecture / The architecture of A3C
- working / How A3C works
- mountain car example / Driving up a mountain with A3C
- network, visualizing in TensorBoard / Visualization in TensorBoard
- Atari 2600 games
- references / Introduction to Atari games
- unsolved issues / Demonstrating basic Q-learning algorithm
- Atari emulator
- building / Building an Atari emulator, Getting started
- implementation / Implementation of the Atari emulator
- implementing, gym used / Atari simulator using gym
- Atari game
- playing, by building agent / Building an agent to play Atari games
- Atari games
- about / Introduction to Atari games
- data preparation / Data preparation
- playing / Atari
B
- backpropagation / Update
- backpropagation through time (BPTT) / DPG algorithm
- basic elements, reinforcement learning
- state / Basic elements of reinforcement learning
- reward function / Basic elements of reinforcement learning
- policy function / Basic elements of reinforcement learning
- value function / Basic elements of reinforcement learning
- basic Q-learning algorithm
- demonstrating / Demonstrating basic Q-learning algorithm
- Bellman equation
- about / The Bellman equation and optimality
- deriving, for value and Q function / Deriving the Bellman equation for value and Q functions
- solving / Solving the Bellman equation
- Bellman equation, solving
- dynamic programming (DP) technique, using / Dynamic programming
- Blackjack game
- playing, with Monte Carlo / Let's play Blackjack with Monte Carlo
- board state / Go and other board games
C
- car racing game
- dueling DQN, using in / Car racing
- CartPole
- about / Running an environment, CartPole
- specifications / The classic control tasks
- chatbot
- background, issues / The background problem
- dataset / Dataset
- step-by-step guide / Step-by-step guide
- data parser / Data parser
- data reader / Data reader
- helper methods / Helper methods
- model / Chatbot model
- data, training / Training the data
- testing / Testing and results
- results / Testing and results
- classic control tasks / The classic control tasks
- conjugate gradient method
- URL / Trust Region Policy Optimization
- constants / Constants
- contextual bandits
- about / Contextual bandits
- reference / Further reading
- continuous environment / Continuous environment
- control tasks
- about / Introduction to control tasks, Getting started
- classic control tasks / The classic control tasks
D
- data preparation, Atari games / Data preparation
- Deep Attention Recurrent Q Network (DARQN)
- about / DARQN
- architecture / Architecture of DARQN
- attention layer / Architecture of DARQN
- deep deterministic policy gradient (DDPG)
- about / Deep deterministic policy gradient
- used, for swinging pendulum / Swinging a pendulum
- DeepMind Lab / DeepMind Lab
- deep Q-learning
- about / Deep Q-learning
- basic elements, of reinforcement learning / Basic elements of reinforcement learning
- basic Q-learning algorithm, demonstrating / Demonstrating basic Q-learning algorithm
- deep Q-learning algorithm (DQN)
- about / Deep Q-learning
- implementing / Implementation of DQN
- experiments / Experiments
- Deep Q Network (DQN)
- about / What is a Deep Q Network?
- architecture / Architecture of DQN
- dueling network architecture / Dueling network architecture
- Deep Recurrent Q Network (DRQN)
- about / DRQN
- architecture / Architecture of DRQN
- Doom / Doom with DRQN
- deterministic environment / Deterministic environment
- deterministic policy gradient (DPG)
- about / Deterministic policy gradient
- actor-critic architecture / Deterministic policy gradient
- theory / The theory behind policy gradient
- algorithm / DPG algorithm
- implementing / Implementation of DDPG
- experiments / Experiments
- discrete environment / Discrete environment
- Docker
- installing / Installing Docker
- download link / Installing Docker
- Doom
- playing, by training agent / Training an agent to play DoomÂ
- about / Basic Doom game
- with Deep Recurrent Q Network (DRQN) / Doom with DRQN
- Double DQN / Double DQN
- dueling network
- architecture / Dueling network architecture
- building / Dueling network
- dynamic programming (DP) technique
- about / Dynamic programming, Monte Carlo prediction
- value iteration algorithm, using / Value iteration
- policy iteration algorithm, using / Policy iteration
E
- elements, reinforcement learning (RL)
- agent / Agent
- policy function / Policy function
- value function / Value function
- model / Model
- environment wrapper functions / Environment wrapper functions
- episodic environment / Episodic and non-episodic environment
- epsilon-greedy policy / The epsilon-greedy policy
- experience replay / Experience replay
F
- financial market
- background, issues / Background problem
- data used / Data used
- step-by-step guide / Step-by-step guide
- actor script / Actor script
- critic script / Critic script
- agent script / Agent script
- helper script / Helper script
- data, training / Training the data
- final result / Final result
- frame-skipping technique / Data preparation
- frozen lake problem
- solving / Solving the frozen lake problem
- value iteration algorithm, using / Value iteration
- frozen lake problem, solving
- value iteration algorithm, using / Value iteration
- policy iteration algorithm, using / Policy iteration
- fully observable environment / Fully observable environment
G
- Go
- about / A brief introduction to Go
- and other board games / Go and other board games
- and AI research / Go and AI research
- GridWorld game
- reference / Experiments
H
- hard attention / Architecture of DARQN
- Hidden Markov model / Markov models
J
- Jupyter notebook
- URL / Lunar Lander using policy gradients
K
- Kullback–Leibler (KL) / Trust Region Policy Optimization
M
- Markov chain / The Markov chain and Markov process
- Markov Decision Process (MDP)
- about / Markov Decision Process
- rewards and returns / Rewards and returns
- episodic tasks / Episodic and continuous tasks
- continuous tasks / Episodic and continuous tasks
- discount factor / Discount factor
- policy function / The policy function
- state value function / State value function
- state-action value function (Q function) / State-action value function (Q function)
- reference / Questions
- Markov models
- about / Markov models
- CartPole / CartPole
- Markov process / The Markov chain and Markov process
- Massively Multiplayer Online Role Playing Game (MMORPGs) / Multi-agent reinforcement learning
- MC-ES algorithm / Monte Carlo exploration starts
- mean-squared error (MSE) / Value network
- Minecraft environment
- about / Introduction to the Minecraft environment
- data preparation / Data preparation
- model / Model
- model-free / Demonstrating basic Q-learning algorithm
- Monte Carlo
- methods / Monte Carlo methods
- used, for pi value estimation / Estimating the value of pi using Monte Carlo
- prediction algorithm / Monte Carlo prediction
- about / First visit Monte Carlo
- Blackjack game, playing with / Let's play Blackjack with Monte Carlo
- Monte Carlo control
- about / Monte Carlo control
- exploration / Monte Carlo exploration starts
- on-policy Monte Carlo control / On-policy Monte Carlo control
- Monte Carlo control / On-policy Monte Carlo control
- off-policy Monte Carlo control / Off-policy Monte Carlo control
- Monte Carlo exploring starts concept / Monte Carlo exploration starts
- Monte Carlo prediction algorithm
- about / Monte Carlo prediction
- first visit / First visit Monte Carlo
- every visit / Every visit Monte Carlo
- Monte Carlo tree search
- about / Monte Carlo tree search
- selection / Selection
- expansion / Expansion
- simulation / Simulation
- update step / Update
- mcts.py / mcts.py
- MuJoCo
- about / MuJoCo
- reference / Introduction to control tasks
- multi-agent environment / Single and multi-agent environment
- multi-agent reinforcement learning / Multi-agent reinforcement learning
- multi-armed bandit (MAB)
- applications / Applications of MAB
- used, for identifying advertisement banner / Identifying the right advertisement banner using MAB
- reference / Questions
- multi-armed bandit (MAB) problem
- about / The MAB problem
- epsilon-greedy policy / The epsilon-greedy policy
- softmax exploration algorithm / The softmax exploration algorithm
- upper confidence bound (UCB) algorithm / The upper confidence bound algorithm
- Thompson sampling (TS) algorithm / The Thompson sampling algorithm
N
- NAS, implementing
- about / Implementing NAS
- child_network.py module / child_network.py
- cifar10_processor.py / cifar10_processor.py
- controller.py module / controller.py
- controller generating, ways / Method for generating the Controller
- child network generating, controller used / Generating a child network using the Controller
- train_controller method / train_controller method
- ChildCNN, testing / Testing ChildCNN
- config.py module / config.py
- train.py module / train.py
- exercises / Additional exercises
- advantages / Advantages of NAS
- network
- training / Training the network
- neural architecture search
- about / Neural Architecture Search
- child networks, generating / Generating and training child networks
- child networks, training / Generating and training child networks
- controller, training / Training the Controller
- algorithm, training / Training algorithm
- non-episodic environment / Episodic and non-episodic environment
- nonusable ace / Let's play Blackjack with Monte Carlo
- no operation (NOOP) action / Data preparation
O
- OpenAI
- reference / Further reading
- about / OpenAI Gym
- Gym / Gym
- OpenAI Five / Multi-agent reinforcement learning
- OpenAI Gym
- about / OpenAI Gym and Universe, OpenAI Gym
- error fixes / Common error fixes
- basic cart pole environment, simulating / Basic simulations
- robot, training to walk / Training a robot to walk
- installation / InstallationÂ
- environment, running / Running an environment
- Atari / Atari
- algorithmic tasks / Algorithmic tasks
- MuJoCo / MuJoCo
- Robotics / Robotics
- reference / Introduction to control tasks
- OpenAI Universe
- about / OpenAI Gym and Universe, OpenAI Universe
- video game bot, building / Building a video game bot
- optimal value / The Bellman equation and optimality
P
- partially observable environment / Partially observable environment
- partially observable Markov Decision Process (POMDP) / DRQN
- Pendulum
- specifications / The classic control tasks
- pi value
- estimating, with Monte Carlo method / Estimating the value of pi using Monte Carlo
- placeholders / Placeholders
- playout / Simulation
- policy function / Policy function, Building a video game bot, The policy function
- policy gradient
- about / Policy gradient
- using, for Lunar Lander / Lunar Lander using policy gradients
- URL / Lunar Lander using policy gradients
- PolicyValueNetwork
- and MCTS, combining / Combining PolicyValueNetwork and MCTS
- alphagozero_agent.py / alphagozero_agent.py
- prioritized experience replay / Prioritized experience replay
- Project Malmo / Project Malmo
- proportional prioritization / Prioritized experience replay
- Proximal Policy Optimization (PPO) / Proximal Policy Optimization
Q
- Q learning, TD control
- about / Q learning
- used, for solving taxi problem / Solving the taxi problem using Q learning
- and SARSA algorithm, differentiating / The difference between Q learning and SARSA
R
- rectifier nonlinearity (RELU) / Demonstrating basic Q-learning algorithm
- recurrent deterministic policy gradient algorithm (RDPG) / DPG algorithm
- reinforcement learning
- basic elements / Basic elements of reinforcement learning
- shortcomings / The shortcomings of reinforcement learning
- resource efficiency / Resource efficiency
- reproducibility / Reproducibility
- explainability/accountability / Explainability/accountability
- attacks, susceptibility to / Susceptibility to attacks
- limitations, addressing / Addressing the limitations
- reinforcement learning (RL)
- about / What is RL?, Policy gradient
- algorithm / RL algorithm
- comparing, with ML paradigms / How RL differs from other ML paradigms
- elements / Elements of RL
- reinforcement learning, developments
- about / Upcoming developments in reinforcement learning
- transfer learning / Transfer learning
- multi-agent reinforcement learning / Multi-agent reinforcement learning
- REINFORCE method / Neural Architecture Search, Training the Controller
- replay buffer
- building / Replay memory
- RL environments
- types / Types of RL environment
- deterministic environment / Deterministic environment
- stochastic environment / Stochastic environment
- fully observable environment / Fully observable environment
- partially observable environment / Partially observable environment
- discrete environment / Discrete environment
- continuous environment / Continuous environment
- episodic and non-episodic environment / Episodic and non-episodic environment
- single and multi-agent environment / Single and multi-agent environment
- RL platforms
- about / RL platforms
- OpenAI Universe / OpenAI Gym and Universe
- OpenAI Gym / OpenAI Gym and Universe
- DeepMind Lab / DeepMind Lab
- RL-Glue / RL-Glue
- Project Malmo / Project Malmo
- ViZDoom / ViZDoom
- Robotics / Robotics
- rollout / Simulation
S
- SARSA algorithm, TD control
- about / SARSA
- used, for solving taxi problem / Solving the taxi problem using SARSA
- and Q learning, differentiating / The difference between Q learning and SARSA
- sequential environment / Episodic and non-episodic environment
- SGF (Smart Game Format) / alphagozero_agent.py
- single-agent environment / Single and multi-agent environment
- soft attention / Architecture of DARQN
- softmax exploration algorithm / The softmax exploration algorithm
- state-action value function (Q function) / State-action value function (Q function)
- state value function / State value function
- stochastic environment / Stochastic environment
- system, setting up
- about / Setting up your machine
- Anaconda, installing / Installing Anaconda
- Docker, installing / Installing Docker
- OpenAI Universe, installing / Installing OpenAI Gym and Universe
- OpenAI Gym, installing / Installing OpenAI Gym and Universe
T
- TD control
- about / TD control
- off-policy learning algorithm / TD control
- on-policy learning algorithm / TD control
- Q learning / Q learning
- State-Action-Reward-State-Action (SARSA) algorithm / SARSA
- temporal-difference (TD) learning / TD learning
- temporal-difference (TD) prediction / TD prediction
- TensorBoard
- about / TensorBoard
- scope, adding / Adding scope
- network visualization / Visualization in TensorBoard
- TensorFlow
- variables / Variables, constants, and placeholders
- placeholders / Variables, constants, and placeholders
- constants / Variables, constants, and placeholders
- computation graph / Computation graph
- sessions / Sessions
- TensorBoard / TensorBoard
- reference / Further reading
- Thompson sampling (TS) algorithm / The Thompson sampling algorithm
- TMUX
- about / Implementation of A3C
- reference / Implementation of A3C
- Trust Region Policy Optimization (TRPO) / Trust Region Policy Optimization
- trust region policy optimization (TRPO) algorithm
- about / Trust region policy optimization, TRPO algorithm
- theory / Theory behind TRPO
- experiments, on MuJoCo tasks / Experiments on MuJoCo tasks
- types, attention layer
- soft attention / Architecture of DARQN
- hard attention / Architecture of DARQN
U
- upper confidence bound (UCB) algorithm / The upper confidence bound algorithm
- Upper Confidence Bound 1 Applied to Trees (UCT) / Selection
- usable ace / Let's play Blackjack with Monte Carlo
V
- value function / Value function
- variables / Variables
- video game bot
- building / Building a video game bot
- ViZDoom / ViZDoom