Connect4 results
To make the training fast, the hyperparameters of the training process were intentionally chosen to be small. For example, at every step of the self-play process, only 10 MCTS were performed, each with a minibatch size of eight. This, in combination with efficient minibatch MCTS and the fast game engine, made training very fast. Basically, after just one hour of training and 2,500 games played in the self-play mode, the produced model was sophisticated enough to be enjoyable to play against. Of course, the level of its play was well below even a kid's level, but it showed some rudimentary strategies and made mistakes in only every other move, which was good progress.
The training was left running for a day, which resulted in 55k games played by a best model and, in total, 102 best model rotations. The training dynamics are shown in the following charts:
The tournament verification was complicated by the number of different models, as...