Packt+ | Advance your knowledge in tech

You're reading from R Data Analysis Projects Build end to end analytics systems to get deeper insights from your data

Product type Paperback

Published in Nov 2017

Publisher Packt

ISBN-13 9781788621878

Length 366 pages

Edition 1st Edition

Languages

Tools

ggplot

Concepts

Data Analysis

Author (1):

Gopi Subramanian

View More author details

In the previous section, we saw that weighted association rule mining leads to recommendations that can increase the profitability of the retailer. Intuitively, weighted association rule mining is superior to vanilla association rule mining as the generated rules are sensitive to transactions, weights. Instead of running the plain version of the algorithm, can we run the weighted association algorithm? We were lucky enough to get transaction weights. What if the retailer did not provide us with the transaction weights? Can we infer weights from transactions? When our transaction data does not come with preassigned weights, we need some way to assign importance to those transactions. For instance, we can say that a transaction with a lot of items should be considered more important than a transaction with a single item.

The arules package provides a method called HITS to help us do the exact same thing—infer transaction weights. HITS stands for Hyperlink-induced Topic Search—a link analysis algorithm used to rate web pages developed by John Kleinberg. According to HITS, hubs are pages with large numbers of out degrees and authority are pages with large numbers of in degrees.

According to graph theory, a graph is formed by a set of vertices and edges. Edges connect the vertices. Graphs with directed edges are called directed graphs. The in degree is the number of inward directed edges from a given vertex in a directed graph. Similarly, the out degree is the number of outward directed edges from a given vertex in a directed graph.

The rationale is that if a lot of pages link to a particular page, then that page is considered an authority. If a page has links to a lot of other pages, it is considered a hub.

The paper, Mining Weighted Association Rules without Preassigned Weights by Ke Sun and Fengshan Bai, details the adaptation of the HITS algorithm for transaction databases.

The basic idea behind using the HITS algorithm for association rule mining is that frequent items may not be as important as they seem. The paper presents an algorithm that applies the HITS methodology to bipartite graphs.

In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V. Vertex sets are usually called the parts of the graph. - Wikipedia (https://en.wikipedia.org/wiki/Bipartite_graph)

According to the HITS modified for the transactions database, the transactions and products are treated as a bipartite graph, with an arc going from the transaction to the product if the product is present in the transaction. The following diagram is reproduced from the paper, Mining Weighted Association Rules without Preassigned Weights by Ke Sun and Fengshan Bai, to illustrate how transactions in a database are converted to a bipartite graph:

In this representation, the transactions can be ranked using the HITS algorithm. In this kind of representation, the support of an item is proportional to its degree. Consider item A. Its absolute support is 4; in the graph, the in degree of A is four. As you can see, considering only the support, we totally ignore transaction importance, unless the importance of the transactions is explicitly provided. How do we get the transaction weights in this scenario? Again, a way to get the weights intuitively is from a good transaction, which should be highly weighted and should contain many good items. A good item should be contained by many good transactions. By treating transactions as hubs and the products as authorities, the algorithm invokes HITS on this bipartite graph.

The arules package provides the method (HITS), which implements the algorithm that we described earlier:

weights.vector <- hits( transactions.obj, type = "relative")
weights.df <- data.frame(transactionID = labels(weights.vector), weight = weights.vector)

We invoke the algorithm using the HITS method. We have described the intuition behind using the HITS algorithm in transaction databases to give weights to our transactions. We will briefly describe how the HITS algorithm functions. Curious readers can refer to Authoritative Sources in a Hyperlinked Environment, J.M. Kleinberg, J. ACM, vol. 46, no. 5, pp. 604-632, 1999, for a better understanding of the HITS algorithm.

The HITS algorithm, to begin with, initializes the weight of all nodes to one; in our case, the items and the transactions are set to one. That is, we maintain two arrays, one for hub weights and the other one for authority weights.

It proceeds to do the following three steps in an iterative manner, that is, until our hub and authority arrays stabilize or don't change with subsequent updates:

Authority node score update: Modify the authority score of each node to the sum of the hub scores of each node that points to it.
Hub node score update: Change the hub score of each node to the sum of the authority scores of each node that it points to.
Unit normalize the hub and authority scores: Continue with the authority node score update until the hub and authority value stabilizes.

At the end of the algorithm, every item has an authority score, which is the sum of the hub scores of all the transactions that contain this item. Every transaction has a hub score that is the sum of the authority score of all the items in that transaction. Using the weights created using the HITS algorithm, we create a weights.df data frame:

head(weights.df)
                transactionID weight
1000431 1000431 1.893931e-04
100082 100082 1.409053e-04
1000928 1000928 2.608214e-05
1001517 1001517 1.735461e-04
1001650 1001650 1.184581e-04
1001934 1001934 2.310465e-04
Pass weights.df our transactions object. We can now generate the weighted association rules:

transactions.obj@itemsetInfo <- weights.df

support <- 0.01
 parameters = list(
 support = support,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "frequent itemsets"
 
 )
 weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
 weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
 , weclat.itemsets@quality)

weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence = 0.1)
 weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
 , weclat.rules@quality)
We can look into the output data frames created for frequent itemsets and rules:
head(weclat.itemsets.df)
  weclat.itemsets support
1 {Banana,Russet Potato} 0.01074109
2 {Banana,Total 0% Nonfat Greek Yogurt} 0.01198206
3 {100% Raw Coconut Water,Bag of Organic Bananas} 0.01024201
4 {Organic Roasted Turkey Breast,Organic Strawberries} 0.01124278
5 {Banana,Roma Tomato} 0.01089124
6 {Banana,Bartlett Pears} 0.01345293
tail(weclat.itemsets.df)
  weclat.itemsets support
540 {Bag of Organic Bananas,Organic Baby Spinach,Organic Strawberries} 0.02142840
541 {Banana,Organic Baby Spinach,Organic Strawberries} 0.02446832
542 {Bag of Organic Bananas,Organic Baby Spinach} 0.06536606
543 {Banana,Organic Baby Spinach} 0.07685530
544 {Bag of Organic Bananas,Organic Strawberries} 0.08640422
545 {Banana,Organic Strawberries} 0.08226264
head(weclat.rules.df)
weclat.rules support confidence lift itemset
weclat.rules support confidence lift itemset
1 {Russet Potato} => {Banana} 0.005580996 0.3714286 1.782653 1
3 {Total 0% Nonfat Greek Yogurt} => {Banana} 0.005580996 0.4148936 1.991261 2
6 {100% Raw Coconut Water} => {Bag of Organic Bananas} 0.004865484 0.3238095 1.993640 3
8 {Organic Roasted Turkey Breast} => {Organic Strawberries} 0.004579279 0.3440860 2.648098 4
9 {Roma Tomato} => {Banana} 0.006010303 0.3181818 1.527098 5
11 {Bartlett Pears} => {Banana} 0.007870635 0.4545455 2.181568 6

Based on the weights generated using the HITS algorithm, we can now order the items by their authority score. This is an alternate way of ranking the items in addition to ranking them by their frequency. We can leverage the itemFrequency function to generate the item scores:

freq.weights <- head(sort(itemFrequency(transactions.obj, weighted = TRUE),decreasing = TRUE),20)
freq.nweights <- head(sort(itemFrequency(transactions.obj, weighted = FALSE),decreasing = TRUE),20)

compare.df <- data.frame("items" = names(freq.weights),
"score" = freq.weights,
"items.nw" = names(freq.nweights),
"score.nw" = freq.nweights)
row.names(compare.df) <- NULL

Let's look at the compare.dfdata frame:

The column score gives the relative transaction frequency. The score.new column is the authority score from the hits algorithm. You can see that Limes and Large Lemon have interchanged places. Strawberries has gone further up the order compared to the original transaction frequency score.

The code is as follows:

###############################################################################
 #
 # R Data Analysis Projects
 #
 # Chapter 1
 #
 # Building Recommender System
 # A step step approach to build Association Rule Mining
 #
 # Script:
 #
 # RScript to explain application of hits to
 # transaction database.
 #
 # Gopi Subramanian
 ###############################################################################

library(arules)

get.txn <- function(data.path, columns){
 # Get transaction object for a given data file
 #
 # Args:
 # data.path: data file name location
 # columns: transaction id and item id columns.
 #
 # Returns:
 # transaction object
 transactions.obj <- read.transactions(file = data.path, format = "single",
 sep = ",",
 cols = columns,
 rm.duplicates = FALSE,
 quote = "", skip = 0,
 encoding = "unknown")
 return(transactions.obj)
 }

## Create txn object
 columns <- c("order_id", "product_id") ## columns of interest in data file
 data.path = '../../data/data.csv' ## Path to data file
 transactions.obj <- get.txn(data.path, columns) ## create txn object

## Generate weight vector using hits
 weights.vector <- hits( transactions.obj, type = "relative")
 weights.df <- data.frame(transactionID = labels(weights.vector), weight = weights.vector)

head(weights.df)

transactions.obj@itemsetInfo <- weights.df

## Frequent item sets generation
 support <- 0.01
 parameters = list(
 support = support,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "frequent itemsets"
 
 )
 weclat.itemsets <- weclat(transactions.obj, parameter = parameters)
 weclat.itemsets.df <-data.frame(weclat.itemsets = labels(weclat.itemsets)
 , weclat.itemsets@quality)

head(weclat.itemsets.df)
 tail(weclat.itemsets.df)

## Rule induction
 weclat.rules <- ruleInduction(weclat.itemsets, transactions.obj, confidence = 0.3)
 weclat.rules.df <-data.frame(weclat.rules = labels(weclat.rules)
 , weclat.rules@quality)

head(weclat.rules.df)

freq.weights <- head(sort(itemFrequency(transactions.obj, weighted = TRUE),decreasing = TRUE),20)
 freq.nweights <- head(sort(itemFrequency(transactions.obj, weighted = FALSE),decreasing = TRUE),20)

compare.df <- data.frame("items" = names(freq.weights),
 "score" = freq.weights,
 "items.nw" = names(freq.nweights),
 "score.nw" = freq.nweights)
 row.names(compare.df) <- NULL