Play and train in separate processes
At a high level, our training contains a repetition of the following steps:
- Ask the current network to choose actions and execute them in our array of environments
- Put observations into the replay buffer
- Randomly sample the training batch from the replay buffer
- Train on this batch
The purpose of the first two steps is to populate the replay buffer with samples from the environment (which are observation, action, reward, and next observation). The last two steps are for training our network.
The following is an illustration of the preceding steps that will make potential parallelism slightly more obvious. On the left, the training flow is shown. The training steps use environments, the replay buffer, and our NN. The solid lines show data and code flow.
Dotted lines represent usage of the NN for training and inference.
Figure 9.6: A sequential diagram of the training process
As you can see, the top two steps...