Solving the Blackjack problem with the Q-learning algorithm
Q-learning is also a model-free learning algorithm. It updates the Q-function for every step in an episode. We will demonstrate how Q-learning is used to solve the Blackjack environment.
Introducing the Q-learning algorithm
Q-learning is an off-policy learning algorithm that optimizes the Q-values based on data generated by a behavior policy. The behavior policy is a greedy policy where it takes actions that achieve the highest returns for given states. The behavior policy generates learning data and the target policy (the policy we attempt to optimize) updates the Q-values based on the following equation:
Here, is the resulting state after taking action a from state s and r is the associated reward. means that the behavior policy generates the highest Q-value given state . Hyperparameters and are the learning rate and discount factor respectively. The Q-learning equation updates the Q-value (estimated...