Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with Python

You're reading from   Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Arrow left icon
Product type Paperback
Published in May 2019
Publisher
ISBN-13 9781789952292
Length 482 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Christopher Kruger Christopher Kruger
Author Profile Icon Christopher Kruger
Christopher Kruger
Aaron Jones Aaron Jones
Author Profile Icon Aaron Jones
Aaron Jones
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Applied Unsupervised Learning with Python
Preface
1. Introduction to Clustering 2. Hierarchical Clustering FREE CHAPTER 3. Neighborhood Approaches and DBSCAN 4. Dimension Reduction and PCA 5. Autoencoders 6. t-Distributed Stochastic Neighbor Embedding (t-SNE) 7. Topic Modeling 8. Market Basket Analysis 9. Hotspot Analysis Appendix

Chapter 8: Market Basket Analysis


Activity 18: Loading and Preparing Full Online Retail Data

Solution:

  1. Load the online retail dataset file:

    import matplotlib.pyplot as plt
    import mlxtend.frequent_patterns
    import mlxtend.preprocessing
    import numpy
    import pandas
    
    online = pandas.read_excel(
        io="Online Retail.xlsx", 
        sheet_name="Online Retail", 
        header=0
    )
  2. Clean and prep the data for modeling, including turning the cleaned data into a list of lists:

    online['IsCPresent'] = (
        online['InvoiceNo']
        .astype(str)
        .apply(lambda x: 1 if x.find('C') != -1 else 0)
    )
    
    online1 = (
        online
        .loc[online["Quantity"] > 0]
        .loc[online['IsCPresent'] != 1]
        .loc[:, ["InvoiceNo", "Description"]]
        .dropna()
    )
    
    invoice_item_list = []
    for num in list(set(online1.InvoiceNo.tolist())):
        tmp_df = online1.loc[online1['InvoiceNo'] == num]
        tmp_items = tmp_df.Description.tolist()
        invoice_item_list.append(tmp_items)
  3. Encode the data and recast it as a DataFrame:

    online_encoder = mlxtend.preprocessing.TransactionEncoder()
    online_encoder_array = online_encoder.fit_transform(invoice_item_list)
    
    online_encoder_df = pandas.DataFrame(
        online_encoder_array, 
        columns=online_encoder.columns_
    )
    
    online_encoder_df.loc[
        20125:20135, 
        online_encoder_df.columns.tolist()[100:110]
    ]

    The output is as follows:

    Figure 8.35: A subset of the cleaned, encoded, and recast DataFrame built from the complete online retail dataset

Activity 19: Apriori on the Complete Online Retail Dataset

Solution:

  1. Run the Apriori algorithm on the full data with reasonable parameter settings:

    mod_colnames_minsupport = mlxtend.frequent_patterns.apriori(
        online_encoder_df, 
        min_support=0.01,
        use_colnames=True
    )
    mod_colnames_minsupport.loc[0:6]

    The output is as follows:

    Figure 8.36: The Apriori algorithm results using the complete online retail dataset

  2. Filter the results down to the item set containing 10 COLOUR SPACEBOY PEN. Compare the support value with that under Exercise 44, Executing the Apriori algorithm:

    mod_colnames_minsupport[
        mod_colnames_minsupport['itemsets'] == frozenset(
            {'10 COLOUR SPACEBOY PEN'}
        )
    ]

    The output is as follows:

    Figure 8.37: Result of item set containing 10 COLOUR SPACEBOY PEN

    The support value does change. When the dataset is expanded to include all transactions, the support for this item set increases from 0.015 to 0.015793. That is, in the reduced dataset used for the exercises, this item set appears in 1.5% of the transactions, while in the full dataset, it appears in approximately 1.6% of transactions.

  3. Add another column containing the item set length. Then, filter down to those item sets whose length is two and whose support is in the range [0.02, 0.021]. Are the item sets the same as those found in Exercise 44, Executing the Apriori algorithm, Step 6?

    mod_colnames_minsupport['length'] = (
        mod_colnames_minsupport['itemsets'].apply(lambda x: len(x))
    )
    
    mod_colnames_minsupport[
        (mod_colnames_minsupport['length'] == 2) & 
        (mod_colnames_minsupport['support'] >= 0.02) &
        (mod_colnames_minsupport['support'] < 0.021)
    ]

    Figure 8.38: The section of the results of filtering based on length and support

    The results did change. Before even looking at the particular item sets and their support values, we see that this filtered DataFrame has fewer item sets than the DataFrame in the previous exercise. When we use the full dataset, there are fewer item sets that match the filtering criteria; that is, only 14 item sets contain 2 items and have a support value greater than or equal to 0.02, and less than 0.021. In the previous exercise, 17 item sets met these criteria.

  4. Plot the support values:

    mod_colnames_minsupport.hist("support", grid=False, bins=30)
    plt.title("Support")

    Figure 8.39: The distribution of support values

This plot shows the distribution of support values for the full transaction dataset. As you might have assumed, the distribution is right skewed; that is, most of the item sets have lower support values and there is a long tail of support values on the higher end of the spectrum. Given how many unique item sets exist, it is not surprising that no single item set appears in a high percentage of the transactions. With this information, we could tell management that even the most prominent item set only appears in approximately 10% of the transactions, and that the vast majority of item sets appear in less than 2% of transactions. These results may not support changes in store layout, but could very well inform pricing and discounting strategies. We would gain more information on how to build these strategies by formalizing some association rules.

Activity 20: Finding the Association Rules on the Complete Online Retail Dataset

Solution:

  1. Fit the association rule model on the full dataset. Use metric confidence and a minimum threshold of 0.6:

    rules = mlxtend.frequent_patterns.association_rules(
        mod_colnames_minsupport, 
        metric="confidence",
        min_threshold=0.6, 
        support_only=False
    )
    rules.loc[0:6]

    The output is as follows:

    Figure 8.40: The association rules based on the complete online retail dataset

  2. Count the number of association rules. Is the number different to that found in Exercise 45, Deriving Association Rules, Step 1?

    print("Number of Associations: {}".format(rules.shape[0]))

    There are 498 association rules.

  3. Plot confidence against support:

    rules.plot.scatter("support", "confidence", alpha=0.5, marker="*")
    plt.xlabel("Support")
    plt.ylabel("Confidence")
    plt.title("Association Rules")
    plt.show()

    The output is as follows:

    Figure 8.41: The plot of confidence against support

    The plot reveals that there are some association rules featuring relatively high support and confidence values for this dataset.

  4. Look at the distributions of lift, leverage, and conviction:

    rules.hist("lift", grid=False, bins=30)
    plt.title("Lift")

    The output is as follows:

    Figure 8.42: The distribution of lift values

    rules.hist("leverage", grid=False, bins=30)
    plt.title("Leverage")

    The output is as follows:

    Figure 8.43: The distribution of leverage values

    plt.hist(
        rules[numpy.isfinite(rules['conviction'])].conviction.values, 
        bins = 30
    )
    plt.title("Conviction")

    The output is as follows:

    Figure 8.44: The distribution of conviction values

Having derived association rules, we can return to management with additional information, the most important of which would be that there are roughly seven item sets that have reasonably high values for both support and confidence. Look at the scatterplot of confidence against support to see the seven item sets that are separated from all the others. These seven item sets also have high lift values, as can be seen in the lift histogram. It seems that we have identified some actionable association rules, rules that we can use to drive business decisions.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime