Prioritized replay buffer
The next very useful idea on how to improve DQN training was proposed in 2015 in the paper Prioritized experience replay [Sch+15]. This method tries to improve the efficiency of samples in the replay buffer by prioritizing those samples according to the training loss.
The basic DQN used the replay buffer to break the correlation between immediate transitions in our episodes. As we discussed in Chapter 6, the examples we experience during the episode will be highly correlated, as most of the time, the environment is ”smooth” and doesn’t change much according to our actions. However, the stochastic gradient descent (SGD) method assumes that the data we use for training has an iid property. To solve this problem, the classic DQN method uses a large buffer of transitions, randomly and uniformly sampled to get the next training batch.
The authors of the paper questioned this uniform random sample policy and proved that...