Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Product type Paperback

Published in May 2019

Publisher

ISBN-13 9781789952292

Length 482 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (3):

Benjamin Johnston

Christopher Kruger

Aaron Jones

View More author details

Table of Contents (12) Chapters

Applied Unsupervised Learning with Python

Preface

1. Introduction to Clustering FREE CHAPTER

2. Hierarchical Clustering

3. Neighborhood Approaches and DBSCAN

4. Dimension Reduction and PCA

5. Autoencoders

6. t-Distributed Stochastic Neighbor Embedding (t-SNE)

7. Topic Modeling

8. Market Basket Analysis

9. Hotspot Analysis

Appendix

Chapter 8: Market Basket Analysis

Activity 18: Loading and Preparing Full Online Retail Data

Solution:

Load the online retail dataset file:

import matplotlib.pyplot as plt
import mlxtend.frequent_patterns
import mlxtend.preprocessing
import numpy
import pandas

online = pandas.read_excel(
    io="Online Retail.xlsx", 
    sheet_name="Online Retail", 
    header=0
)

Clean and prep the data for modeling, including turning the cleaned data into a list of lists:

online['IsCPresent'] = (
    online['InvoiceNo']
    .astype(str)
    .apply(lambda x: 1 if x.find('C') != -1 else 0)
)

online1 = (
    online
    .loc[online["Quantity"] > 0]
    .loc[online['IsCPresent'] != 1]
    .loc[:, ["InvoiceNo", "Description"]]
    .dropna()
)

invoice_item_list = []
for num in list(set(online1.InvoiceNo.tolist())):
    tmp_df = online1.loc[online1['InvoiceNo'] == num]
    tmp_items = tmp_df.Description.tolist()
    invoice_item_list.append(tmp_items)

Encode the data and recast it as a DataFrame:

online_encoder = mlxtend.preprocessing.TransactionEncoder()
online_encoder_array = online_encoder.fit_transform(invoice_item_list)

online_encoder_df = pandas.DataFrame(
    online_encoder_array, 
    columns=online_encoder.columns_
)

online_encoder_df.loc[
    20125:20135, 
    online_encoder_df.columns.tolist()[100:110]
]

The output is as follows:

Figure 8.35: A subset of the cleaned, encoded, and recast DataFrame built from the complete online retail dataset

Activity 19: Apriori on the Complete Online Retail Dataset

Solution:

Run the Apriori algorithm on the full data with reasonable parameter settings:
```
mod_colnames_minsupport = mlxtend.frequent_patterns.apriori(
    online_encoder_df, 
    min_support=0.01,
    use_colnames=True
)
mod_colnames_minsupport.loc[0:6]
```
The output is as follows:
Figure 8.36: The Apriori algorithm results using the complete online retail dataset
Filter the results down to the item set containing 10 COLOUR SPACEBOY PEN. Compare the support value with that under Exercise 44, Executing the Apriori algorithm:
```
mod_colnames_minsupport[
    mod_colnames_minsupport['itemsets'] == frozenset(
        {'10 COLOUR SPACEBOY PEN'}
    )
]
```
The output is as follows:
Figure 8.37: Result of item set containing 10 COLOUR SPACEBOY PEN
The support value does change. When the dataset is expanded to include all transactions, the support for this item set increases from 0.015 to 0.015793. That is, in the reduced dataset used for the exercises, this item set appears in 1.5% of the transactions, while in the full dataset, it appears in approximately 1.6% of transactions.
Add another column containing the item set length. Then, filter down to those item sets whose length is two and whose support is in the range [0.02, 0.021]. Are the item sets the same as those found in Exercise 44, Executing the Apriori algorithm, Step 6?
```
mod_colnames_minsupport['length'] = (
    mod_colnames_minsupport['itemsets'].apply(lambda x: len(x))
)

mod_colnames_minsupport[
    (mod_colnames_minsupport['length'] == 2) & 
    (mod_colnames_minsupport['support'] >= 0.02) &
    (mod_colnames_minsupport['support'] < 0.021)
]
```
Figure 8.38: The section of the results of filtering based on length and support
The results did change. Before even looking at the particular item sets and their support values, we see that this filtered DataFrame has fewer item sets than the DataFrame in the previous exercise. When we use the full dataset, there are fewer item sets that match the filtering criteria; that is, only 14 item sets contain 2 items and have a support value greater than or equal to 0.02, and less than 0.021. In the previous exercise, 17 item sets met these criteria.

Plot the support values:

mod_colnames_minsupport.hist("support", grid=False, bins=30)
plt.title("Support")

Figure 8.39: The distribution of support values

This plot shows the distribution of support values for the full transaction dataset. As you might have assumed, the distribution is right skewed; that is, most of the item sets have lower support values and there is a long tail of support values on the higher end of the spectrum. Given how many unique item sets exist, it is not surprising that no single item set appears in a high percentage of the transactions. With this information, we could tell management that even the most prominent item set only appears in approximately 10% of the transactions, and that the vast majority of item sets appear in less than 2% of transactions. These results may not support changes in store layout, but could very well inform pricing and discounting strategies. We would gain more information on how to build these strategies by formalizing some association rules.

Activity 20: Finding the Association Rules on the Complete Online Retail Dataset

Solution:

Fit the association rule model on the full dataset. Use metric confidence and a minimum threshold of 0.6:
```
rules = mlxtend.frequent_patterns.association_rules(
    mod_colnames_minsupport, 
    metric="confidence",
    min_threshold=0.6, 
    support_only=False
)
rules.loc[0:6]
```
The output is as follows:
Figure 8.40: The association rules based on the complete online retail dataset
Count the number of association rules. Is the number different to that found in Exercise 45, Deriving Association Rules, Step 1?
```
print("Number of Associations: {}".format(rules.shape[0]))
```
There are 498 association rules.
Plot confidence against support:
```
rules.plot.scatter("support", "confidence", alpha=0.5, marker="*")
plt.xlabel("Support")
plt.ylabel("Confidence")
plt.title("Association Rules")
plt.show()
```
The output is as follows:
Figure 8.41: The plot of confidence against support
The plot reveals that there are some association rules featuring relatively high support and confidence values for this dataset.
Look at the distributions of lift, leverage, and conviction:
```
rules.hist("lift", grid=False, bins=30)
plt.title("Lift")
```
The output is as follows:
Figure 8.42: The distribution of lift values
```
rules.hist("leverage", grid=False, bins=30)
plt.title("Leverage")
```
The output is as follows:
Figure 8.43: The distribution of leverage values
```
plt.hist(
    rules[numpy.isfinite(rules['conviction'])].conviction.values, 
    bins = 30
)
plt.title("Conviction")
```
The output is as follows:
Figure 8.44: The distribution of conviction values

Having derived association rules, we can return to management with additional information, the most important of which would be that there are roughly seven item sets that have reasonably high values for both support and confidence. Look at the scatterplot of confidence against support to see the seven item sets that are separated from all the others. These seven item sets also have high lift values, as can be seen in the lift histogram. It seems that we have identified some actionable association rules, rules that we can use to drive business decisions.