MuZero results
I ran the training for 15 hours and it played 3,400 episodes (you see, the training is not very fast). The policy and value losses are shown in Figure 20.7 . As often happens with self-play training, the charts have no obvious trend:
Figure 20.7: Policy (left) and value (right) losses for the MuZero training
During the training, almost 200 current best models were stored, which I checked in tournament mode using the play-mu.py script. Here are the top 10 models:
saves/mu-t5-6/best_010_00210.dat: w=339, l=41, d=0
saves/mu-t5-6/best_015_00260.dat: w=298, l=82, d=0
saves/mu-t5-6/best_155_02510.dat: w=287, l=93, d=0
saves/mu-t5-6/best_150_02460.dat: w=273, l=107, d=0
saves/mu-t5-6/best_140_02360.dat: w=267, l=113, d=0
saves/mu-t5-6/best_145_02410.dat: w=266, l=114, d=0
saves/mu-t5-6/best_165_02640.dat: w=253, l=127, d=0
saves/mu-t5-6/best_005_00100...