The experiment results
Unfortunately, the paper provided no details about very important aspects of the method, like training hyperparameters, how deeply cubes were scrambled during the training, and the obtained convergence. To fill in the missing blanks, I experimented with various values of hyperparameters (.ini files are available in the GitHub repo), but still my results are very different from those published in the paper. I observed that the training convergence of the original method is very unstable. Even with a small learning rate and a large batch size, the training eventually diverges, with the value loss component growing exponentially. Examples of this behavior are shown in Figure 21.5 and Figure 21.6 (obtained from the 2 × 2 environment):
Figure 21.5: Values predicted by the value head during training on the paper’s method
Figure 21.6: The policy loss (left) and value loss ...