Chapter 8: Market Basket Analysis
Activity 18: Loading and Preparing Full Online Retail Data
Solution:
Load the online retail dataset file:
import matplotlib.pyplot as plt import mlxtend.frequent_patterns import mlxtend.preprocessing import numpy import pandas online = pandas.read_excel( io="Online Retail.xlsx", sheet_name="Online Retail", header=0 )
Clean and prep the data for modeling, including turning the cleaned data into a list of lists:
online['IsCPresent'] = ( online['InvoiceNo'] .astype(str) .apply(lambda x: 1 if x.find('C') != -1 else 0) ) online1 = ( online .loc[online["Quantity"] > 0] .loc[online['IsCPresent'] != 1] .loc[:, ["InvoiceNo", "Description"]] .dropna() ) invoice_item_list = [] for num in list(set(online1.InvoiceNo.tolist())): tmp_df = online1.loc[online1['InvoiceNo'] == num] tmp_items = tmp_df.Description.tolist() invoice_item_list.append(tmp_items)
Encode the data and recast it as a DataFrame:
online_encoder = mlxtend.preprocessing.TransactionEncoder() online_encoder_array = online_encoder.fit_transform(invoice_item_list) online_encoder_df = pandas.DataFrame( online_encoder_array, columns=online_encoder.columns_ ) online_encoder_df.loc[ 20125:20135, online_encoder_df.columns.tolist()[100:110] ]
The output is as follows:
Figure 8.35: A subset of the cleaned, encoded, and recast DataFrame built from the complete online retail dataset
Activity 19: Apriori on the Complete Online Retail Dataset
Solution:
Run the Apriori algorithm on the full data with reasonable parameter settings:
mod_colnames_minsupport = mlxtend.frequent_patterns.apriori( online_encoder_df, min_support=0.01, use_colnames=True ) mod_colnames_minsupport.loc[0:6]
The output is as follows:
Figure 8.36: The Apriori algorithm results using the complete online retail dataset
Filter the results down to the item set containing 10 COLOUR SPACEBOY PEN. Compare the support value with that under Exercise 44, Executing the Apriori algorithm:
mod_colnames_minsupport[ mod_colnames_minsupport['itemsets'] == frozenset( {'10 COLOUR SPACEBOY PEN'} ) ]
The output is as follows:
Figure 8.37: Result of item set containing 10 COLOUR SPACEBOY PEN
The support value does change. When the dataset is expanded to include all transactions, the support for this item set increases from 0.015 to 0.015793. That is, in the reduced dataset used for the exercises, this item set appears in 1.5% of the transactions, while in the full dataset, it appears in approximately 1.6% of transactions.
Add another column containing the item set length. Then, filter down to those item sets whose length is two and whose support is in the range [0.02, 0.021]. Are the item sets the same as those found in Exercise 44, Executing the Apriori algorithm, Step 6?
mod_colnames_minsupport['length'] = ( mod_colnames_minsupport['itemsets'].apply(lambda x: len(x)) ) mod_colnames_minsupport[ (mod_colnames_minsupport['length'] == 2) & (mod_colnames_minsupport['support'] >= 0.02) & (mod_colnames_minsupport['support'] < 0.021) ]
Figure 8.38: The section of the results of filtering based on length and support
The results did change. Before even looking at the particular item sets and their support values, we see that this filtered DataFrame has fewer item sets than the DataFrame in the previous exercise. When we use the full dataset, there are fewer item sets that match the filtering criteria; that is, only 14 item sets contain 2 items and have a support value greater than or equal to 0.02, and less than 0.021. In the previous exercise, 17 item sets met these criteria.
Plot the support values:
mod_colnames_minsupport.hist("support", grid=False, bins=30) plt.title("Support")
Figure 8.39: The distribution of support values
This plot shows the distribution of support values for the full transaction dataset. As you might have assumed, the distribution is right skewed; that is, most of the item sets have lower support values and there is a long tail of support values on the higher end of the spectrum. Given how many unique item sets exist, it is not surprising that no single item set appears in a high percentage of the transactions. With this information, we could tell management that even the most prominent item set only appears in approximately 10% of the transactions, and that the vast majority of item sets appear in less than 2% of transactions. These results may not support changes in store layout, but could very well inform pricing and discounting strategies. We would gain more information on how to build these strategies by formalizing some association rules.
Activity 20: Finding the Association Rules on the Complete Online Retail Dataset
Solution:
Fit the association rule model on the full dataset. Use metric confidence and a minimum threshold of 0.6:
rules = mlxtend.frequent_patterns.association_rules( mod_colnames_minsupport, metric="confidence", min_threshold=0.6, support_only=False ) rules.loc[0:6]
The output is as follows:
Figure 8.40: The association rules based on the complete online retail dataset
Count the number of association rules. Is the number different to that found in Exercise 45, Deriving Association Rules, Step 1?
print("Number of Associations: {}".format(rules.shape[0]))
There are 498 association rules.
Plot confidence against support:
rules.plot.scatter("support", "confidence", alpha=0.5, marker="*") plt.xlabel("Support") plt.ylabel("Confidence") plt.title("Association Rules") plt.show()
The output is as follows:
Figure 8.41: The plot of confidence against support
The plot reveals that there are some association rules featuring relatively high support and confidence values for this dataset.
Look at the distributions of lift, leverage, and conviction:
rules.hist("lift", grid=False, bins=30) plt.title("Lift")
The output is as follows:
Figure 8.42: The distribution of lift values
rules.hist("leverage", grid=False, bins=30) plt.title("Leverage")
The output is as follows:
Figure 8.43: The distribution of leverage values
plt.hist( rules[numpy.isfinite(rules['conviction'])].conviction.values, bins = 30 ) plt.title("Conviction")
The output is as follows:
Figure 8.44: The distribution of conviction values
Having derived association rules, we can return to management with additional information, the most important of which would be that there are roughly seven item sets that have reasonably high values for both support and confidence. Look at the scatterplot of confidence against support to see the seven item sets that are separated from all the others. These seven item sets also have high lift values, as can be seen in the lift histogram. It seems that we have identified some actionable association rules, rules that we can use to drive business decisions.