Chapter 8: Market Basket Analysis
Activity 18: Loading and Preparing Full Online Retail Data
Solution:
Load the online retail dataset file:
import matplotlib.pyplot as plt import mlxtend.frequent_patterns import mlxtend.preprocessing import numpy import pandas online = pandas.read_excel( io="Online Retail.xlsx", sheet_name="Online Retail", header=0 )
Clean and prep the data for modeling, including turning the cleaned data into a list of lists:
online['IsCPresent'] = ( online['InvoiceNo'] .astype(str) .apply(lambda x: 1 if x.find('C') != -1 else 0) ) online1 = ( online .loc[online["Quantity"] > 0] .loc[online['IsCPresent'] != 1] .loc[:, ["InvoiceNo", "Description"]] .dropna() ) invoice_item_list = [] for num in list(set(online1.InvoiceNo.tolist())): tmp_df = online1.loc[online1['InvoiceNo'] == num] tmp_items = tmp_df.Description.tolist() invoice_item_list.append(tmp_items)
Encode the data and recast it as a DataFrame:
online_encoder = mlxtend.preprocessing.TransactionEncoder() online_encoder_array = online_encoder.fit_transform(invoice_item_list) online_encoder_df = pandas.DataFrame( online_encoder_array, columns=online_encoder.columns_ ) online_encoder_df.loc[ 20125:20135, online_encoder_df.columns.tolist()[100:110] ]
The output is as follows:
Activity 19: Apriori on the Complete Online Retail Dataset
Solution:
Run the Apriori algorithm on the full data with reasonable parameter settings:
mod_colnames_minsupport = mlxtend.frequent_patterns.apriori( online_encoder_df, min_support=0.01, use_colnames=True ) mod_colnames_minsupport.loc[0:6]
The output is as follows:
Filter the results down to the item set containing 10 COLOUR SPACEBOY PEN. Compare the support value with that under Exercise 44, Executing the Apriori algorithm:
mod_colnames_minsupport[ mod_colnames_minsupport['itemsets'] == frozenset( {'10 COLOUR SPACEBOY PEN'} ) ]
The output is as follows:
The support value does change. When the dataset is expanded to include all transactions, the support for this item set increases from 0.015 to 0.015793. That is, in the reduced dataset used for the exercises, this item set appears in 1.5% of the transactions, while in the full dataset, it appears in approximately 1.6% of transactions.
Add another column containing the item set length. Then, filter down to those item sets whose length is two and whose support is in the range [0.02, 0.021]. Are the item sets the same as those found in Exercise 44, Executing the Apriori algorithm, Step 6?
mod_colnames_minsupport['length'] = ( mod_colnames_minsupport['itemsets'].apply(lambda x: len(x)) ) mod_colnames_minsupport[ (mod_colnames_minsupport['length'] == 2) & (mod_colnames_minsupport['support'] >= 0.02) & (mod_colnames_minsupport['support'] < 0.021) ]
The results did change. Before even looking at the particular item sets and their support values, we see that this filtered DataFrame has fewer item sets than the DataFrame in the previous exercise. When we use the full dataset, there are fewer item sets that match the filtering criteria; that is, only 14 item sets contain 2 items and have a support value greater than or equal to 0.02, and less than 0.021. In the previous exercise, 17 item sets met these criteria.
Plot the support values:
mod_colnames_minsupport.hist("support", grid=False, bins=30) plt.title("Support")
This plot shows the distribution of support values for the full transaction dataset. As you might have assumed, the distribution is right skewed; that is, most of the item sets have lower support values and there is a long tail of support values on the higher end of the spectrum. Given how many unique item sets exist, it is not surprising that no single item set appears in a high percentage of the transactions. With this information, we could tell management that even the most prominent item set only appears in approximately 10% of the transactions, and that the vast majority of item sets appear in less than 2% of transactions. These results may not support changes in store layout, but could very well inform pricing and discounting strategies. We would gain more information on how to build these strategies by formalizing some association rules.
Activity 20: Finding the Association Rules on the Complete Online Retail Dataset
Solution:
Fit the association rule model on the full dataset. Use metric confidence and a minimum threshold of 0.6:
rules = mlxtend.frequent_patterns.association_rules( mod_colnames_minsupport, metric="confidence", min_threshold=0.6, support_only=False ) rules.loc[0:6]
The output is as follows:
Count the number of association rules. Is the number different to that found in Exercise 45, Deriving Association Rules, Step 1?
print("Number of Associations: {}".format(rules.shape[0]))
There are 498 association rules.
Plot confidence against support:
rules.plot.scatter("support", "confidence", alpha=0.5, marker="*") plt.xlabel("Support") plt.ylabel("Confidence") plt.title("Association Rules") plt.show()
The output is as follows:
The plot reveals that there are some association rules featuring relatively high support and confidence values for this dataset.
Look at the distributions of lift, leverage, and conviction:
rules.hist("lift", grid=False, bins=30) plt.title("Lift")
The output is as follows:
rules.hist("leverage", grid=False, bins=30) plt.title("Leverage")
The output is as follows:
plt.hist( rules[numpy.isfinite(rules['conviction'])].conviction.values, bins = 30 ) plt.title("Conviction")
The output is as follows:
Having derived association rules, we can return to management with additional information, the most important of which would be that there are roughly seven item sets that have reasonably high values for both support and confidence. Look at the scatterplot of confidence against support to see the seven item sets that are separated from all the others. These seven item sets also have high lift values, as can be seen in the lift histogram. It seems that we have identified some actionable association rules, rules that we can use to drive business decisions.