The training process
Now that you know how the state of the cube is encoded in a 20 × 24 tensor, let’s explore the NN architecture and understand how it is trained.
The NN architecture
Figure 21.2, from the paper by McAleer et al., shows the network architecture:
Figure 21.2: The NN architecture transforming the observation (top) to the action and value (bottom)
As the input, it accepts the already familiar cube state representation as a 20 × 24 tensor and produces two outputs:
-
The policy, which is a vector of 12 numbers, representing the probability distribution over our actions.
-
The value, a single scalar estimating the “goodness” of the state passed. The concrete meaning of a value will be discussed in the next section.
In my implementation, the architecture is exactly the...