The AlphaZero algorithm consists of three main components:
- A deep convolutional neural network, which takes the board position (or state) as input and outputs a value as the predicted game result from the position and a policy that is a list of move probabilities for each possible action from the input board state.
- A general-purpose reinforcement learning algorithm, which learns via self-play from scratch with no specific domain knowledge except the game rules. The deep neural network's parameters are learned by self-play reinforcement learning to minimize the loss between the predicted value and the actual self-play game result, and maximize the similarity between the predicted policy and the search probabilities, which come from the following algorithm.
- A general-purpose (domain-independent) Monte-Carlo Tree Search (MCTS) algorithm...