Summary
We started the chapter by learning what deep Q networks are and how they are used to approximate the Q value. We learned that in a DQN, we use a buffer called the replay buffer to store the agent's experience. Then, we randomly sample a minibatch of experience from the replay buffer and train the network by minimizing the MSE. Moving on, we looked at the algorithm of DQN in more detail, and then we learned how to implement DQN to play Atari games.
Following this, we learned that the DQN overestimates the target value due to the max operator. So, we used double DQN, where we have two Q functions in our target value computation. One Q function parameterized by the main network parameter is used for action selection, and the other Q function parameterized by the target network parameter is used for Q value computation.
Going ahead, we learned about the DQN with prioritized experience replay, where the transition is prioritized based on the TD error. We explored...