Chapter 9 – Deep Q Network and Its Variants
- When the environment consists of a large number of states and actions, it will be very expensive to compute the Q value of all possible state-action pairs in an exhaustive fashion. So, we use a deep Q network for approximating the Q function.
- We use a buffer called the replay buffer to collect the agent's experience and based on this experience, we train our network. The replay buffer is usually implemented as a queue structure (first in, first out) rather than a list. So, if the buffer is full and the new experience comes in, we remove the old experience and add the new experience into the buffer.
- When the target and predicted values depend on the same parameter , it will cause instability in the mean squared error and the network will learn poorly. It also causes a lot of divergence during training. So, we use a target network.
- Unlike with DQNs, in double DQNs, we compute the target value using two...