Playing and training in separate processes
At a high level, our training contains a repetition of the following steps:
-
Ask the current network to choose actions and execute them in our array of environments.
-
Put observations into the replay buffer.
-
Randomly sample the training batch from the replay buffer.
-
Train on this batch.
The purpose of the first two steps is to populate the replay buffer with samples from the environment (which are observation, action, reward, and next observation). The last two steps are for training our network.
The following is an illustration of the preceding steps that will make potential parallelism slightly more obvious. On the left, the training flow is shown. The training steps use environments, the replay buffer, and...