A retailer has approached us with a problem. In the coming months, he wants to boost his sales. He is planning a marketing campaign on a large scale to promote sales. One aspect of his campaign is the cross-selling strategy. Cross-selling is the practice of selling additional products to customers. In order to do that, he wants to know what items/products tend to go together. Equipped with this information, he can now design his cross-selling strategy. He expects us to provide him with a recommendation of top N product associations so that he can pick and choose among them for inclusion in his campaign.
He has provided us with his historical transaction data. The data includes his past transactions, where each transaction is uniquely identified by an order_id integer, and the list of products present in the transaction called product_id. Remember the binary matrix representation of transactions we described in the introduction section? The dataset provided by our retailer is in exactly the same format.
Let's start by reading the data provided by the retailer. The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our example, we will introduce the arules R package that we will be using. In our description, we will be using the terms order/transaction, user/customer, and item/product interchangeably. We will not be describing the installation of R packages used through the code. It's assumed that the reader is well aware of the method for installing an R package and has installed those packages before using it.
This data can be downloaded from:
data.path = '../../data/data.csv'
data = read.csv(data.path)
head(data)
order_id product_id
1 837080 Unsweetened Almondmilk
2 837080 Fat Free Milk dairy
3 837080 Turkey
4 837080 Caramel Corn Rice Cakes
5 837080 Guacamole Singles
6 837080 HUMMUS 10OZ WHITE BEAN EAT WELL
The given data is in a tabular format. Every row is a tuple of order_id, representing the transaction, product_id, the item included in that transaction, and the department_id, that is the department to which the item belongs to. This is our binary data representation, which is absolutely tenable for the classical association rule mining algorithm. This algorithm is also called market basket analysis, as we are analyzing the customer's basket—the transactions. Given a large database of customer transactions where each transaction consists of items purchased by a customer during a visit, the association rule mining algorithm generates all significant association rules between the items in the database.
Let's quickly explore our data. We can count the number of unique transactions and the number of unique products:
library(dplyr)
data %>%
group_by('order_id') %>%
summarize(order.count = n_distinct(order_id))
data %>%
group_by('product_id') %>%
summarize(product.count = n_distinct(product_id))
# A tibble: 1 <U+00D7> 2
`"order_id"` order.count
<chr> <int>
1 order_id 6988
# A tibble: 1 <U+00D7> 2
`"product_id"` product.count
<chr> <int>
1 product_id 16793
We have 6988 transactions and 16793 individual products. There is no information about the quantity of products purchased in a transaction. We have used the dplyr library to perform these aggregate calculations, which is a library used to perform efficient data wrangling on data frames.
http://tidyverse.org/
http://dplyr.tidyverse.org/
In the next section, we will introduce the association rule mining algorithm. We will explain how this method can be leveraged to generate the top N product recommendations requested by the retailer for his cross-selling campaign.