In the previous recipe, we predicted the value of a policy where the agent holds if the score gets to 18. This is a simple policy that everyone can easily come up with, although obviously not the optimal one. In this recipe, we will search for the optimal policy to play Blackjack, using on-policy Monte Carlo control.
Monte Carlo prediction is used to evaluate the value for a given policy, while Monte Carlo control (MC control) is for finding the optimal policy when such a policy is not given. There are basically categories of MC control: on-policy and off-policy. On-policy methods learn about the optimal policy by executing the policy and evaluating and improving it, while off-policy methods learn about the optimal policy using data generated by another policy. The way on-policy MC control works is quite similar to policy iteration in...