Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Machine Learning By Example

You're reading from   Python Machine Learning By Example Unlock machine learning best practices with real-world use cases

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Packt
ISBN-13 9781835085622
Length 518 pages
Edition 4th Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Yuxi (Hayden) Liu Yuxi (Hayden) Liu
Author Profile Icon Yuxi (Hayden) Liu
Yuxi (Hayden) Liu
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface 1. Getting Started with Machine Learning and Python FREE CHAPTER 2. Building a Movie Recommendation Engine with Naïve Bayes 3. Predicting Online Ad Click-Through with Tree-Based Algorithms 4. Predicting Online Ad Click-Through with Logistic Regression 5. Predicting Stock Prices with Regression Algorithms 6. Predicting Stock Prices with Artificial Neural Networks 7. Mining the 20 Newsgroups Dataset with Text Analysis Techniques 8. Discovering Underlying Topics in the Newsgroups Dataset with Clustering and Topic Modeling 9. Recognizing Faces with Support Vector Machine 10. Machine Learning Best Practices 11. Categorizing Images of Clothing with Convolutional Neural Networks 12. Making Predictions with Sequences Using Recurrent Neural Networks 13. Advancing Language Understanding and Generation with the Transformer Models 14. Building an Image Search Engine Using CLIP: a Multimodal Approach 15. Making Decisions in Complex Environments with Reinforcement Learning 16. Other Books You May Enjoy
17. Index

Solving the Blackjack problem with the Q-learning algorithm

Q-learning is also a model-free learning algorithm. It updates the Q-function for every step in an episode. We will demonstrate how Q-learning is used to solve the Blackjack environment.

Introducing the Q-learning algorithm

Q-learning is an off-policy learning algorithm that optimizes the Q-values based on data generated by a behavior policy. The behavior policy is a greedy policy where it takes actions that achieve the highest returns for given states. The behavior policy generates learning data and the target policy (the policy we attempt to optimize) updates the Q-values based on the following equation:

Here, is the resulting state after taking action a from state s and r is the associated reward. means that the behavior policy generates the highest Q-value given state . Hyperparameters and are the learning rate and discount factor respectively. The Q-learning equation updates the Q-value (estimated future reward) of a state-action pair based on the current Q-value, the immediate reward, and the potential future rewards the agent can expect by taking the best possible action in the next state.

Learning from the experience generated by another policy enables Q-learning to optimize its Q-values in every single step in an episode. We gain the information from a greedy policy and use this information to update the target values right away.

One more thing to note is that the target policy is epsilon-greedy, meaning it takes a random action with a probability of (value from 0 to 1) and takes a greedy action with a probability of . The epsilon-greedy policy combines exploitation and exploration: it exploits the best action while exploring different actions.

Developing the Q-learning algorithm

Now it is time to develop the Q-learning algorithm to solve the Blackjack environment:

  1. We start with defining the epsilon-greedy policy:
    >>> def epsilon_greedy_policy(n_action, epsilon, state, Q):
    ...     probs = torch.ones(n_action) * epsilon / n_action
    ...     best_action = torch.argmax(Q[state]).item()
    ...     probs[best_action] += 1.0 - epsilon
    ...     action = torch.multinomial(probs, 1).item()
    ...     return action
    

Given possible actions, each action is taken with a probability , and the action with the highest state-action value is chosen with an additional probability .

  1. We will start with a large exploration factor and reduce it over time until it reaches 0.1. We define the starting and final as follows:
    >>> epsilon = 1.0
    >>> final_epsilon = 0.1
    
  2. Next, we implement the Q-learning algorithm:
    >>> def q_learning(env, gamma, n_episode, alpha, epsilon, final_epsilon):
            n_action = env.action_space.n
            Q = defaultdict(lambda: torch.zeros(n_action))
            epsilon_decay = epsilon / (n_episode / 2)
            for episode in range(n_episode):
                state, _ = env.reset()
                is_done = False
                epsilon = max(final_epsilon, epsilon - epsilon_decay)
                while not is_done:
                    action = epsilon_greedy_policy(n_action, epsilon, state, Q)
                    next_state, reward, terminated, truncated, info = env.step(action)
                    is_done = terminated or truncated
                    delta = reward + gamma * torch.max(
                                               Q[next_state]) - Q[state][action]
                    Q[state][action] += alpha * delta
                    total_reward_episode[episode] += reward
                    if is_done:
                        break
                    state = next_state
            policy = {}
            for state, actions in Q.items():
                policy[state] = torch.argmax(actions).item()
            return Q, policy
    

We initialize the action-value function Q and calculate the epsilon decay rate for the epsilon-greedy exploration strategy. For each episode, we let the agent take actions following the epsilon-greedy policy, and update the Q function for each step based on the off-policy learning equation. The exploration factor also decreases over time. We run n_episode episodes and finally extract the optimal policy from the learned action-value function Q by selecting the action with the maximum value for each state.

  1. We then initiate a variable to store the performance of each of 10,000 episodes, measured by the reward:
    >>> n_episode = 10000
    >>> total_reward_episode = [0] * n_episode
    
  2. Finally, we perform Q-learning to obtain the optimal policy for the Blackjack problem with the following hyperparameters:
    >>> gamma = 1
    >>> alpha = 0.003
    >>> optimal_Q, optimal_policy = q_learning(env, gamma, n_episode, alpha,
                                               epsilon, final_epsilon)
    

Here, discount rate and learning rate for more exploration.

  1. After 10,000 episodes of learning, we plot the rolling average rewards over the episodes as follows:
    >>> rolling_avg_reward = [total_reward_episode[0]]
    >>> for i, reward in enumerate(total_reward_episode[1:], 1):
            rolling_avg_reward.append((rolling_avg_reward[-1]*i + reward)/(i+1))
    >>> plt.plot(rolling_avg_reward)
    >>> plt.title('Average reward over time')
    >>> plt.xlabel('Episode')
    >>> plt.ylabel('Average reward')
    >>> plt.ylim([-1, 1])
    >>> plt.show()
    

Refer to the following screenshot for the end result:

Figure 15.5: Average reward over episodes

The average reward steadily rises throughout the training process, eventually converging. This indicates that the model becomes proficient in solving the problem after training.

  1. Finally, we simulate 100,000 episodes for the optimal policy we obtained using Q-learning and calculate the winning chance:
    >>> n_episode = 100000
    >>> n_win_opt = 0
    >>> for _ in range(n_episode):
    ...     reward = simulate_episode(env, optimal_policy)
    ...     if reward == 1:
    ...         n_win_opt += 1
    >>>print(f'Winning probability under the optimal policy: {n_win_opt/n_episode}')
    Winning probability under the optimal policy: 0.42398
    

Playing under the optimal policy has a 42% chance of winning.

In this section, we solved the Blackjack problem with off-policy Q-learning. The algorithm optimizes the Q-values in every single step by learning from the experience generated by a greedy policy.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image