The model application
Okay, imagine that we have trained the model using the process just described. How should we use it to solve the scrambled cube? From the network’s structure, you might imagine the obvious, but not very successful, way:
-
Feed the model the current state of the cube that we want to solve.
-
From the policy head, get the largest action to perform (or sample it from the resulting distribution).
-
Apply the action to the cube.
-
Repeat the process until the solved state has been reached.
On paper, this method should work, but in practice, it has one serious issue: it doesn’t! The main reason for that is our model’s quality. Due to the size of the state space and the nature of the NNs, it just isn’t possible to train...