The model application
Okay, imagine that we have trained the model using the process just described. How should we use it to solve the scrambled cube? From the network's structure, you might imagine the obvious, but not very successful, way:
- Feed the model the current state of the cube that we want to solve
- From the policy head, get the largest action to perform (or sample it from the resulting distribution)
- Apply the action to the cube
- Repeat the process until the solved state has been reached
On paper, this method should work, but in practice, it has one serious issue: it doesn't! The main reason for that is our model's quality. Due to the size of the state space and the nature of the NNs, it just isn't possible to train an NN to return the exact optimal action for any input state all of the time. Rather than telling us what to do to get the solved state, our model shows us promising directions to explore. Those directions could bring...