You're reading from Python Machine Learning By Example Unlock machine learning best practices with real-world use cases

Product type Paperback

Published in Jul 2024

Publisher Packt

ISBN-13 9781835085622

Length 518 pages

Edition 4th Edition

Languages

Python

Tools

PyTorch

Concepts

Computer Vision

Author (1):

Yuxi (Hayden) Liu

View More author details

Table of Contents (18) Chapters

Preface

1. Getting Started with Machine Learning and Python FREE CHAPTER

2. Building a Movie Recommendation Engine with Naïve Bayes

3. Predicting Online Ad Click-Through with Tree-Based Algorithms

4. Predicting Online Ad Click-Through with Logistic Regression

5. Predicting Stock Prices with Regression Algorithms

6. Predicting Stock Prices with Artificial Neural Networks

7. Mining the 20 Newsgroups Dataset with Text Analysis Techniques

8. Discovering Underlying Topics in the Newsgroups Dataset with Clustering and Topic Modeling

9. Recognizing Faces with Support Vector Machine

10. Machine Learning Best Practices

11. Categorizing Images of Clothing with Convolutional Neural Networks

12. Making Predictions with Sequences Using Recurrent Neural Networks

13. Advancing Language Understanding and Generation with the Transformer Models

14. Building an Image Search Engine Using CLIP: a Multimodal Approach

15. Making Decisions in Complex Environments with Reinforcement Learning

16. Other Books You May Enjoy

17. Index

Solving the Blackjack problem with the Q-learning algorithm

Q-learning is also a model-free learning algorithm. It updates the Q-function for every step in an episode. We will demonstrate how Q-learning is used to solve the Blackjack environment.

Introducing the Q-learning algorithm

Q-learning is an off-policy learning algorithm that optimizes the Q-values based on data generated by a behavior policy. The behavior policy is a greedy policy where it takes actions that achieve the highest returns for given states. The behavior policy generates learning data and the target policy (the policy we attempt to optimize) updates the Q-values based on the following equation:

Here, is the resulting state after taking action a from state s and r is the associated reward. means that the behavior policy generates the highest Q-value given state . Hyperparameters and are the learning rate and discount factor respectively. The Q-learning equation updates the Q-value (estimated future reward) of a state-action pair based on the current Q-value, the immediate reward, and the potential future rewards the agent can expect by taking the best possible action in the next state.

Learning from the experience generated by another policy enables Q-learning to optimize its Q-values in every single step in an episode. We gain the information from a greedy policy and use this information to update the target values right away.

One more thing to note is that the target policy is epsilon-greedy, meaning it takes a random action with a probability of (value from 0 to 1) and takes a greedy action with a probability of . The epsilon-greedy policy combines exploitation and exploration: it exploits the best action while exploring different actions.

Developing the Q-learning algorithm

Now it is time to develop the Q-learning algorithm to solve the Blackjack environment:

We start with defining the epsilon-greedy policy:

>>> def epsilon_greedy_policy(n_action, epsilon, state, Q):
...     probs = torch.ones(n_action) * epsilon / n_action
...     best_action = torch.argmax(Q[state]).item()
...     probs[best_action] += 1.0 - epsilon
...     action = torch.multinomial(probs, 1).item()
...     return action

Given possible actions, each action is taken with a probability , and the action with the highest state-action value is chosen with an additional probability .

We will start with a large exploration factor and reduce it over time until it reaches 0.1. We define the starting and final as follows:
```
>>> epsilon = 1.0
>>> final_epsilon = 0.1
```

Next, we implement the Q-learning algorithm:

>>> def q_learning(env, gamma, n_episode, alpha, epsilon, final_epsilon):
        n_action = env.action_space.n
        Q = defaultdict(lambda: torch.zeros(n_action))
        epsilon_decay = epsilon / (n_episode / 2)
        for episode in range(n_episode):
            state, _ = env.reset()
            is_done = False
            epsilon = max(final_epsilon, epsilon - epsilon_decay)
            while not is_done:
                action = epsilon_greedy_policy(n_action, epsilon, state, Q)
                next_state, reward, terminated, truncated, info = env.step(action)
                is_done = terminated or truncated
                delta = reward + gamma * torch.max(
                                           Q[next_state]) - Q[state][action]
                Q[state][action] += alpha * delta
                total_reward_episode[episode] += reward
                if is_done:
                    break
                state = next_state
        policy = {}
        for state, actions in Q.items():
            policy[state] = torch.argmax(actions).item()
        return Q, policy

We initialize the action-value function Q and calculate the epsilon decay rate for the epsilon-greedy exploration strategy. For each episode, we let the agent take actions following the epsilon-greedy policy, and update the Q function for each step based on the off-policy learning equation. The exploration factor also decreases over time. We run n_episode episodes and finally extract the optimal policy from the learned action-value function Q by selecting the action with the maximum value for each state.

We then initiate a variable to store the performance of each of 10,000 episodes, measured by the reward:
```
>>> n_episode = 10000
>>> total_reward_episode = [0] * n_episode
```

Finally, we perform Q-learning to obtain the optimal policy for the Blackjack problem with the following hyperparameters:

>>> gamma = 1
>>> alpha = 0.003
>>> optimal_Q, optimal_policy = q_learning(env, gamma, n_episode, alpha,
                                           epsilon, final_epsilon)

Here, discount rate and learning rate for more exploration.

After 10,000 episodes of learning, we plot the rolling average rewards over the episodes as follows:

>>> rolling_avg_reward = [total_reward_episode[0]]
>>> for i, reward in enumerate(total_reward_episode[1:], 1):
        rolling_avg_reward.append((rolling_avg_reward[-1]*i + reward)/(i+1))
>>> plt.plot(rolling_avg_reward)
>>> plt.title('Average reward over time')
>>> plt.xlabel('Episode')
>>> plt.ylabel('Average reward')
>>> plt.ylim([-1, 1])
>>> plt.show()

Refer to the following screenshot for the end result:

Figure 15.5: Average reward over episodes

The average reward steadily rises throughout the training process, eventually converging. This indicates that the model becomes proficient in solving the problem after training.

Finally, we simulate 100,000 episodes for the optimal policy we obtained using Q-learning and calculate the winning chance:

>>> n_episode = 100000
>>> n_win_opt = 0
>>> for _ in range(n_episode):
...     reward = simulate_episode(env, optimal_policy)
...     if reward == 1:
...         n_win_opt += 1
>>>print(f'Winning probability under the optimal policy: {n_win_opt/n_episode}')
Winning probability under the optimal policy: 0.42398

Playing under the optimal policy has a 42% chance of winning.

In this section, we solved the Blackjack problem with off-policy Q-learning. The algorithm optimizes the Q-values in every single step by learning from the experience generated by a greedy policy.