Packt+ | Advance your knowledge in tech

You're reading from R Data Analysis Projects Build end to end analytics systems to get deeper insights from your data

Product type Paperback

Published in Nov 2017

Publisher Packt

ISBN-13 9781788621878

Length 366 pages

Edition 1st Edition

Languages

Tools

ggplot

Concepts

Data Analysis

Author (1):

Gopi Subramanian

View More author details

A retailer has approached us with a problem. In the coming months, he wants to boost his sales. He is planning a marketing campaign on a large scale to promote sales. One aspect of his campaign is the cross-selling strategy. Cross-selling is the practice of selling additional products to customers. In order to do that, he wants to know what items/products tend to go together. Equipped with this information, he can now design his cross-selling strategy. He expects us to provide him with a recommendation of top N product associations so that he can pick and choose among them for inclusion in his campaign.

He has provided us with his historical transaction data. The data includes his past transactions, where each transaction is uniquely identified by an order_id integer, and the list of products present in the transaction called product_id. Remember the binary matrix representation of transactions we described in the introduction section? The dataset provided by our retailer is in exactly the same format.

Let's start by reading the data provided by the retailer. The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our example, we will introduce the arules R package that we will be using. In our description, we will be using the terms order/transaction, user/customer, and item/product interchangeably. We will not be describing the installation of R packages used through the code. It's assumed that the reader is well aware of the method for installing an R package and has installed those packages before using it.

This data can be downloaded from:

data.path = '../../data/data.csv'
 data = read.csv(data.path)
 head(data)
order_id product_id
 1 837080 Unsweetened Almondmilk
 2 837080 Fat Free Milk dairy
 3 837080 Turkey
 4 837080 Caramel Corn Rice Cakes
 5 837080 Guacamole Singles
 6 837080 HUMMUS 10OZ WHITE BEAN EAT WELL

The given data is in a tabular format. Every row is a tuple of order_id, representing the transaction, product_id, the item included in that transaction, and the department_id, that is the department to which the item belongs to. This is our binary data representation, which is absolutely tenable for the classical association rule mining algorithm. This algorithm is also called market basket analysis, as we are analyzing the customer's basket—the transactions. Given a large database of customer transactions where each transaction consists of items purchased by a customer during a visit, the association rule mining algorithm generates all significant association rules between the items in the database.

What is an association rule? An example from a grocery transaction would be that the association rule is a recommendation of the form {peanut butter, jelly} => { bread }. It says that, based on the transactions, it's expected that bread will most likely be present in a transaction that contains peanut butter and jelly. It's a recommendation to the retailer that there is enough evidence in the database to say that customers who buy peanut butter and jelly will most likely buy bread.

Let's quickly explore our data. We can count the number of unique transactions and the number of unique products:

library(dplyr)
data %>%
 group_by('order_id') %>%
 summarize(order.count = n_distinct(order_id))
 
 data %>%
 group_by('product_id') %>%
 summarize(product.count = n_distinct(product_id))
# A tibble: 1 <U+00D7> 2
 `"order_id"` order.count
 <chr> <int>
 1 order_id 6988
# A tibble: 1 <U+00D7> 2
 `"product_id"` product.count
 <chr> <int>
 1 product_id 16793

We have 6988 transactions and 16793 individual products. There is no information about the quantity of products purchased in a transaction. We have used the dplyr library to perform these aggregate calculations, which is a library used to perform efficient data wrangling on data frames.

dplyr is part of tidyverse, a collection of R packages designed around a common philosophy. dplyr is a grammar of data manipulation, and provides a consistent set of methods to help solve the most common data manipulation challenges. To learn more about dplyr, refer to the following links:
http://tidyverse.org/
http://dplyr.tidyverse.org/

In the next section, we will introduce the association rule mining algorithm. We will explain how this method can be leveraged to generate the top N product recommendations requested by the retailer for his cross-selling campaign.