There are several algorithmic implementations for association rule mining. Key among them is the apriori algorithm by Rakesh Agrawal and Ramakrishnan Srikanth, introduced in their paper, Fast Algorithms for Mining Association Rules. Going forward, we will use both the apriori algorithm and the association rule mining algorithm interchangeably.
Apriori is a parametric algorithm. It requires two parameters named support and confidence from the user. Support is needed to generate the frequent itemsets and the confidence parameter is required to filter the induced rules from the frequent itemsets. Support and confidence are broadly called interest measures. There are a lot of other interest measures in addition to support and confidence.
We will explain the association rule mining algorithm and the effect of the interest measures on the algorithm as we write our R code. We will halt our code writing in the required places to get a deeper understanding of how the algorithm works, the algorithm terminology such as itemsets, and how to leverage the interest measures to our benefit to support the cross-selling campaign.
We will use the arules package, version 1.5-0, to help us perform association mining on this dataset.
Type SessionInfo() into your R terminal to get information about the packages, including the version of the packages loaded in the current R session.
library(arules)
transactions.obj <- read.transactions(file = data.path, format = "single",
sep = ",",
cols = c("order_id", "product_id"),
rm.duplicates = FALSE,
quote = "", skip = 0,
encoding = "unknown")
We begin with reading our transactions stored in the text file and create an arules data structure called transactions. Let's look at the parameters of read.transactions, the function used to create the transactions object. For the first parameter, file, we pass our text file where we have the transactions from the retailer. The second parameter, format, can take any of two values, single or basket, depending on how the input data is organized. In our case, we have a tabular format with two columns--one column representing the unique identifier for our transaction and the other column for a unique identifier representing the product present in our transaction. This format is named single by arules. Refer to the arules documentation for a detailed description of all the parameters.
On inspecting the newly created transactions object transaction.obj:
transactions.obj
transactions in sparse format with
6988 transactions (rows) and
16793 items (columns)
We can see that there are 6988 transactions and 16793 products. They match the previous count values from the dplyr output.
Let's explore this transaction object a little bit more. Can we see the most frequent items, that is, the items that are present in most of the transactions and vice versa—the least frequent items and the items present in many fewer transactions?
The itemFrequency function in the arules package comes to our rescue. This function takes a transaction object as input and produces the frequency count (the number of transactions containing this product) of the individual products:
data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = TRUE), 10)) # Most frequent
Banana 1456
Bag of Organic Bananas 1135
Organic Strawberries 908
Organic Baby Spinach 808
Organic Hass Avocado 729
Organic Avocado 599
Large Lemon 534
Limes 524
Organic Raspberries 475
Organic Garlic 432
head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = FALSE), 10) # Least frequent
0% Fat Black Cherry Greek Yogurt y 1
0% Fat Blueberry Greek Yogurt 1
0% Fat Peach Greek Yogurt 1
0% Fat Strawberry Greek Yogurt 1
1 % Lowfat Milk 1
1 Mg Melatonin Sublingual Orange Tablets 1
Razor Handle and 2 Freesia Scented Razor Refills Premium BladeRazor System 1
1 to 1 Gluten Free Baking Flour 1
1% Low Fat Cottage Cheese 1
1% Lowfat Chocolate Milk 1
In the preceding code, we print the most and the least frequent items in our database using the itemFrequency function. The itemFrequency function produces all the items with their corresponding frequency, and the number of transactions in which they appear. We wrap the sort function over itemFrequency to sort this output; the sorting order is decided by the decreasing parameter. When set to TRUE, it sorts the items in descending order based on their transaction frequency. We finally wrap the sort function using the head function to get the top 10 most/least frequent items.
The Banana product is the most frequently occurring across 1,456 transactions. The itemFrequency method can also return the percentage of transactions rather than an absolute number if we set the type parameter to relative instead of absolute.
Another convenient way to inspect the frequency distribution of the items is to plot them visually as a histogram. The arules package provides the itemFrequencyPlot function to visualize the item frequency:
itemFrequencyPlot(transactions.obj,topN = 25)
In the preceding code, we plot the top 25 items by their relative frequency, as shown in the following diagram:
As per the figure, Banana is the most frequent item, present in 20 percent of the transactions. This chart can give us a clue about defining the support value for our algorithm, a concept we will quickly delve into in the subsequent paragraphs.
Now that we have successfully created the transaction object, let's proceed to apply the apriori algorithm to this transaction object.
The apriori algorithm works in two phases. Finding the frequent itemsets is the first phase of the association rule mining algorithm. A group of product IDs is called an itemset. The algorithm makes multiple passes into the database; in the first pass, it finds out the transaction frequency of all the individual items. These are itemsets of order 1. Let's introduce the first interest measure, Support, here:
- Support: As said previously, support is a parameter that we pass to this algorithm—a value between zero and one. Let's say we set the value to 0.1. We now say an itemset is considered frequent, and it should be used in the subsequent phases if—and only if—it appears in at least 10 percent of the transactions.
Now, in the first pass, the algorithm calculates the transaction frequency for each product. At this stage, we have order 1 itemsets. We will discard all those itemsets that fall below our support threshold. The assumption here is that items with a high transaction frequency are more interesting than the ones with a very low frequency. Items with very low support are not going to make for interesting rules further down the pipeline. Using the most frequent items, we can construct the itemsets as having two products and find their transaction frequency, that is, the number of transactions in which both the items are present. Once again, we discard all the two product itemsets (itemsets of order 2) that are below the given support threshold. We continue this way until we have exhausted them.
Let's see a quick illustration:
Pass 1 :
Support = 0.1
Product, transaction frequency
{item5}, 0.4
{item6}, 0.3
{item9}, 0.2
{item11}, 0.1
item11 will be discarded in this phase, as its transaction frequency is below the support threshold.
Pass 2:
{item5, item6}
{item5, item9}
{item6, item9}
As you can see, we have constructed itemsets of order 2 using the filtered items from pass 1. We proceed to find their transaction frequency, discard the itemsets falling below our minimum support threshold, and step on to pass 3, where once again we create itemsets of order 3, calculate the transaction frequency, and perform filtering and move on to pass 4. In one of the subsequent passes, we reach a stage where we cannot create higher order itemsets. That is when we stop:
# Interest Measures
support <- 0.01
# Frequent item sets
parameters = list(
support = support,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "frequent itemsets")
freq.items <- apriori(transactions.obj, parameter = parameters)
The apriori method is used in arules to get the most frequent items. This method takes two parameters, the transaction.obj and the second parameter, which is a named list. We create a named list called parameters. Inside the named list, we have an entry for our support threshold. We have set our support threshold to 0.01, namely, one percent of the transaction. We settled at this value by looking at the histogram we plotted earlier. By setting the value of the target parameter to frequent itemsets, we specify that we expect the method to return the final frequent itemsets. Minlen and maxlen set lower and upper cut off on how many items we expect in our itemsets. By setting our minlen to 2, we say we don't want itemsets of order 1. While explaining the apriori in phase 1, we said that the algorithm can do many passes into the database, and each subsequent pass creates itemsets that are of order 1, greater than the previous pass. We also said apriori ends when no higher order itemsets can be found. We don't want our method to run till the end, hence by using maxlen, we say that if we reach itemsets of order 10, we stop. The apriori function returns an object of type itemsets.
It's good practice to examine the created object, itemset in this case. A closer look at the itemset object should shed light on how we ended up using its properties to create our data frame of itemsets:
str(freq.items)
Formal class 'itemsets' [package "arules"] with 4 slots
..@ items :Formal class 'itemMatrix' [package "arules"] with 3 slots
.. .. ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
.. .. .. .. ..@ i : int [1:141] 1018 4107 4767 11508 4767 6543 4767 11187 4767 10322 ...
.. .. .. .. ..@ p : int [1:71] 0 2 4 6 8 10 12 14 16 18 ...
.. .. .. .. ..@ Dim : int [1:2] 14286 70
.. .. .. .. ..@ Dimnames:List of 2
.. .. .. .. .. ..$ : NULL
.. .. .. .. .. ..$ : NULL
.. .. .. .. ..@ factors : list()
.. .. ..@ itemInfo :'data.frame': 14286 obs. of 1 variable:
.. .. .. ..$ labels: chr [1:14286] "10" "1000" "10006" "10008" ...
.. .. ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
..@ tidLists: NULL
..@ quality :'data.frame': 70 obs. of 1 variable:
.. ..$ support: num [1:70] 0.0108 0.0124 0.0124 0.0154 0.0122 ...
..@ info :List of 4
.. ..$ data : symbol transactions.obj
.. ..$ ntransactions: int 4997
.. ..$ support : num 0.01
.. ..$ confidence : num 1
To create our freq.items.df data frame, we used the third slot of the quality freq.items object. It contains the support value for all the itemsets generated, 70 in this case. By calling the function label and passing the freq.items object, we retrieve the item names:
# Let us examine our freq item sites
freq.items.df <- data.frame(item_set = labels(freq.items)
, support = freq.items@quality)
head(freq.items.df)
item_set support
1 {Banana,Red Vine Tomato} 0.01030338
2 {Bag of Organic Bananas,Organic D'Anjou Pears} 0.01001717
3 {Bag of Organic Bananas,Organic Kiwi} 0.01016027
4 {Banana,Organic Gala Apples} 0.01073268
5 {Banana,Yellow Onions} 0.01302232
tail(freq.items.df)
item_set support
79 {Organic Baby Spinach,Organic Strawberries} 0.02575844
80 {Bag of Organic Bananas,Organic Baby Spinach} 0.02690326
81 {Banana,Organic Baby Spinach} 0.03048082
82 {Bag of Organic Bananas,Organic Strawberries} 0.03577562
83 {Banana,Organic Strawberries} 0.03305667
We create our data frame using these two lists of values. Finally, we use the head and tail functions to quickly look at the itemsets present at the top and bottom of our data frame.
Before we move on to phase two of the association mining algorithm, let's stop for a moment to investigate a quick feature present in the apriori method. If you had noticed, our itemsets consist mostly of the high frequency itemsets. However, we can ask the apriori method to ignore some of the items:
exclusion.items <- c('Banana','Bag of Organic Bananas')freq.items <- apriori(transactions.obj, parameter = parameters,
appearance = list(none = exclusion.items,
default = "both"))
freq.items.df <- data.frame(item_set = labels(freq.items)
, support = freq.items@quality)
We can create a vector of items to be excluded, and pass it as an appearance parameter to the apriori method. This will ensure that the items in our list are excluded from the generated itemsets:
head(freq.items.df,10)
item_set support.support support.confidence
1 {Organic Large Extra Fancy Fuji Apple} => {Organic Strawberries} 0.01030338 0.2727273
2 {Organic Cilantro} => {Limes} 0.01187750 0.2985612
3 {Organic Blueberries} => {Organic Strawberries} 0.01302232 0.3063973
4 {Cucumber Kirby} => {Organic Baby Spinach} 0.01001717 0.2089552
5 {Organic Grape Tomatoes} => {Organic Baby Spinach} 0.01016027 0.2232704
6 {Organic Grape Tomatoes} => {Organic Strawberries} 0.01144820 0.2515723
7 {Organic Lemon} => {Organic Hass Avocado} 0.01016027 0.2184615
8 {Organic Cucumber} => {Organic Hass Avocado} 0.01244991 0.2660550
9 {Organic Cucumber} => {Organic Baby Spinach} 0.01144820 0.2446483
10 {Organic Cucumber} => {Organic Strawberries} 0.01073268 0.2293578
As you can see, we have excluded Banana and Bag of Organic Bananas from our itemsets.
Congratulations, you have successfully finished implementing the first phase of the apriori algorithm in R!
Let's move on to phase two, where we will induce rules from these itemsets. It's time to introduce our second interest measure, confidence. Let's take an itemset from the list given to us from phase one of the algorithm, {Banana, Red Vine Tomato}.
We have two possible rules here:
- Banana => Red Vine Tomato: The presence of Banana in a transaction strongly suggests that Red Vine Tomato will also be there in the same transaction.
- Red Vine Tomato => Banana: The presence of Red Vine Tomato in a transaction strongly suggests that Banana will also be there in the same transaction.
How often are these two rules found to be true in our database? The confidence score, our next interest measure, will help us measure this:
- Confidence: For the rule Banana => Red Vine Tomato, the confidence is measured as the ratio of support for the itemset {Banana,Red Vine Tomato} and the support for the itemset {Banana}. Let's decipher this formula. Let's say that item Banana has occurred in 20 percent of the transactions, and it has occurred in 10 percent of transactions along with Red Vine Tomato, so the support for the rule is 50 percent, 0.1/0.2.
Similar to support, the confidence threshold is also provided by the user. Only those rules whose confidence is greater than or equal to the user-provided confidence will be included by the algorithm.
Let's say we have a rule, A => B, induced from itemset <A, B>. The support for the itemset <A,B> is the number of transactions where A and B occur divided by the total number of the transactions. Alternatively, it's the probability that a transaction contains <A, B>.
Now, confidence for A => B is P (B | A); which is the conditional probability that a transaction containing B also has A? P(B | A) = P ( A U B) / P (A) = Support ( A U B) / Support (A).
Remember from probability theory that when two events are independent of each other, the joint probability is calculated by multiplying the individual probability. We will use this shortly in our next interest measure lift.
With this knowledge, let's go ahead and implement phase two of our apriori algorithm in R:
> confidence <- 0.4 # Interest Measure
parameters = list(
support = support,
confidence = confidence,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "rules"
)
rules <- apriori(transactions.obj, parameter = parameters)
rules.df <- data.frame(rules = labels(rules)
,rules@quality)
Once again, we use the apriori method; however, we set the target parameter in our parameters named list to rules. Additionally, we also provide the confidence threshold. After calling the method apriori using the returned object rules, we finally build our data frame, rules.df, to explore/view our rules conveniently. Let's look at our output data frame, rules.df. For the given confidence threshold, we can see the set of rules thrown out by the algorithm:
head(rules.df)
rules support confidence lift
1 {Red Vine Tomato} => {Banana} 0.01030338 0.4067797 1.952319
2 {Honeycrisp Apple} => {Banana} 0.01617058 0.4248120 2.038864
3 {Organic Fuji Apple} => {Banana} 0.01817401 0.4110032 1.972590
The last column titled lift is yet another interest measure.
- Lift: Often, we may have rules with high support and confidence but that are, as yet, of no use. This occurs when the item at the right-hand side of the rule has more probability of being selected alone than with the associated item. Let's go back to a grocery example. Say there are 1,000 transactions. 900 of those transactions contain milk. 700 of them contain strawberries. 300 of them have both strawberries and milk. A typical rule that can come out of such a scenario is strawberry => milk, with a confidence of 66 percent. The support (strawberry, milk) / support ( strawberry). This is not accurate and it's not going to help the retailer at all, as the probability of people buying milk is 90 percent, much higher than the 66 percent given by this rule.
This is where lift comes to our rescue. For two products, A and B, lift measures how many times A and B occur together, more often than expected if they were statistically independent. Let's calculate the lift value for this hypothetical example:
lift ( strawberry => milk ) = support ( strawberry, milk) / support( strawberry) * support (milk)
= 0.3 / (0.9)(0.7)
= 0.47
Lift provides an increased probability of the presence of milk in a transaction containing strawberry. Rules with lift values greater than one are considered to be more effective. In this case, we have a lift value of less than one, indicating that the presence of a strawberry is not going to guarantee the presence of milk in the transaction. Rather, it's the opposite—people who buy strawberries will rarely buy milk in the same transaction.
There are tons of other interest measures available in arules. They can be obtained by calling the interestMeasure function. We show how we can quickly populate these measures into our rules data frame:
interestMeasure(rules, transactions = transactions.obj)
rules.df <- cbind(rules.df, data.frame(interestMeasure(rules,
transactions = transactions.obj)))
We will not go through all of them here. There is a ton of literature available to discuss these measures and their use in filtering most useful rules.
Alright, we have successfully implemented our association rule mining algorithm; we went under the hood to understand how the algorithm works in two phases to generate rules. We have examined three interest measures: support, confidence, and lift. Finally, we know that lift can be leveraged to make cross-selling recommendations to our retail customers. Before we proceed further, let's look at some more functions in the arules package that will allow us to do some sanity checks on the rules. Let us start with redundant rules:
is.redundant(rules) # Any redundant rules ?
To illustrate redundant rules, let's take an example:
Rule 1: {Red Vine Tomato, Ginger} => {Banana}, confidence score 0.32
Rule 2: {Red Vine Tomato} => {Banana}, confidence score 0.45
Rules 1 and Rule 2 share the same right-hand side, Banana. The left-hand side of Rule 2 has a subset of items from Rule 1. The confidence score of Rule 2 is higher than Rule 1. In this case, Rule 2 is considered a more general rule than Rule 1 and hence Rule 2 will be considered as a redundant rule and removed. Another sanity check is to find duplicated rules:
duplicated(rules) # Any duplicate rules ?
To remove rules with the same left- and right-hand side, the duplicated function comes in handy. Let's say we create two rules objects, using two sets of support and confidence thresholds. If we combine these two rules objects, we may have a lot of duplicate rules. The duplicated function can be used to filter these duplicate rules. Let us now look at the significance of a rule:
is.significant(rules, transactions.obj, method = "fisher")
Given the rule A => B, we explained that lift calculates how many times A and B occur together more often than expected. There are other ways of testing this independence, like a chi-square test or a Fisher's test. The arules package provides the is.significant method to do a Fisher or a chi-square test of independence. The parameter method can either take the value of fisher or chisq depending on the test we wish to perform. Refer to the arules documentation for more details. Finally let us see the list of transactions where these rules are supported:
>as(supportingTransactions(rules, transactions.obj), "list")
Here is the complete R script we have used up until now to demonstrate how to leverage arules to extract frequent itemsets, induce rules from those itemsets, and filter the rules based on interest measures.
Here is the following code:
###########################################################################
#
# R Data Analysis Projects
#
# Chapter 1
#
# Building Recommender System
# A step step approach to build Association Rule Mining
#
# Script:
#
# RScript to explain how to use arules package
# to mine Association rules
#
# Gopi Subramanian
###########################################################################
data.path = '../../data/data.csv' ## Path to data file
data = read.csv(data.path) ## Read the data file
head(data)
library(dplyr)
data %>% group_by('order_id') %>% summarize(order.count = n_distinct(order_id))
data %>% group_by('product_id') %>% summarize(product.count = n_distinct(product_id)
library(arules)
transactions.obj <- read.transactions(file = data.path, format = "single",
sep = ",",
cols = c("order_id", "product_id"),
rm.duplicates = FALSE,
quote = "", skip = 0,
encoding = "unknown")
transactions.obj
# Item frequency
head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = TRUE), 10) # Most frequent
head(sort(itemFrequency(transactions.obj, type = "absolute")
, decreasing = FALSE), 10) # Least frequent
itemFrequencyPlot(transactions.obj,topN = 25)
# Interest Measures
support <- 0.01
# Frequent item sets
parameters = list(
support = support,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "frequent itemsets"
)
freq.items <- apriori(transactions.obj, parameter = parameters)
# Let us examine our freq item sites
freq.items.df <- data.frame(item_set = labels(freq.items)
, support = freq.items@quality)
head(freq.items.df, 5)
tail(freq.items.df, 5)
# Let us now examine the rules
confidence <- 0.4 # Interest Measure
parameters = list(
support = support,
confidence = confidence,
minlen = 2, # Minimal number of items per item set
maxlen = 10, # Maximal number of items per item set
target = "rules"
)
rules <- apriori(transactions.obj, parameter = parameters)
rules.df <- data.frame(rules = labels(rules)
,rules@quality)
interestMeasure(rules, transactions = transactions.obj)
rules.df <- cbind(rules.df, data.frame(interestMeasure(rules, transactions = transactions.obj)))
rules.df$coverage <- as(coverage(rules,
transactions = transactions.obj), "list")
## Some sanity checks
duplicated(rules) # Any duplicate rules ?
is.significant(rules, transactions.obj)
## Transactions which support the rule.
as(supportingTransactions(rules, transactions.obj), "list")