Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
R Data Analysis Projects

You're reading from   R Data Analysis Projects Build end to end analytics systems to get deeper insights from your data

Arrow left icon
Product type Paperback
Published in Nov 2017
Publisher Packt
ISBN-13 9781788621878
Length 366 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Gopi Subramanian Gopi Subramanian
Author Profile Icon Gopi Subramanian
Gopi Subramanian
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface 1. Association Rule Mining FREE CHAPTER 2. Fuzzy Logic Induced Content-Based Recommendation 3. Collaborative Filtering 4. Taming Time Series Data Using Deep Neural Networks 5. Twitter Text Sentiment Classification Using Kernel Density Estimates 6. Record Linkage - Stochastic and Machine Learning Approaches 7. Streaming Data Clustering Analysis in R 8. Analyze and Understand Networks Using R

Retailer use case and data

A retailer has approached us with a problem. In the coming months, he wants to boost his sales. He is planning a marketing campaign on a large scale to promote sales. One aspect of his campaign is the cross-selling strategy. Cross-selling is the practice of selling additional products to customers. In order to do that, he wants to know what items/products tend to go together. Equipped with this information, he can now design his cross-selling strategy. He expects us to provide him with a recommendation of top N product associations so that he can pick and choose among them for inclusion in his campaign.

He has provided us with his historical transaction data. The data includes his past transactions, where each transaction is uniquely identified by an order_id integer, and the list of products present in the transaction called product_id. Remember the binary matrix representation of transactions we described in the introduction section? The dataset provided by our retailer is in exactly the same format.

Let's start by reading the data provided by the retailer. The code for this chapter was written in RStudio Version 0.99.491. It uses R version 3.3.1. As we work through our example, we will introduce the arules R package that we will be using. In our description, we will be using the terms order/transaction, user/customer, and item/product interchangeably. We will not be describing the installation of R packages used through the code. It's assumed that the reader is well aware of the method for installing an R package and has installed those packages before using it.

This data can be downloaded from:

data.path = '../../data/data.csv'
data = read.csv(data.path)
head(data)
order_id product_id
1 837080 Unsweetened Almondmilk
2 837080 Fat Free Milk dairy
3 837080 Turkey
4 837080 Caramel Corn Rice Cakes
5 837080 Guacamole Singles
6 837080 HUMMUS 10OZ WHITE BEAN EAT WELL

The given data is in a tabular format. Every row is a tuple of order_id, representing the transaction, product_id, the item included in that transaction, and the department_id, that is the department to which the item belongs to. This is our binary data representation, which is absolutely tenable for the classical association rule mining algorithm. This algorithm is also called market basket analysis, as we are analyzing the customer's basket—the transactions. Given a large database of customer transactions where each transaction consists of items purchased by a customer during a visit, the association rule mining algorithm generates all significant association rules between the items in the database.

What is an association rule?  An example from a grocery transaction would be that the association rule is a recommendation of the form {peanut butter, jelly} => { bread }. It says that, based on the transactions, it's expected that bread will most likely be present in a transaction that contains peanut butter and jelly. It's a recommendation to the retailer that there is enough evidence in the database to say that customers who buy peanut butter and jelly will most likely buy bread.

Let's quickly explore our data. We can count the number of unique transactions and the number of unique products:

library(dplyr)
data %>%
group_by('order_id') %>%
summarize(order.count = n_distinct(order_id))

data %>%
group_by('product_id') %>%
summarize(product.count = n_distinct(product_id))
# A tibble: 1 <U+00D7> 2
`"order_id"` order.count
<chr> <int>
1 order_id 6988
# A tibble: 1 <U+00D7> 2
`"product_id"` product.count
<chr> <int>
1 product_id 16793

We have 6988 transactions and 16793 individual products. There is no information about the quantity of products purchased in a transaction. We have used the dplyr library to perform these aggregate calculations, which is a library used to perform efficient data wrangling on data frames.

dplyr is part of tidyverse, a collection of R packages designed around a common philosophy. dplyr is a grammar of data manipulation, and provides a consistent set of methods to help solve the most common data manipulation challenges. To learn more about dplyr, refer to the following links:
http://tidyverse.org/
http://dplyr.tidyverse.org/

In the next section, we will introduce the association rule mining algorithm. We will explain how this method can be leveraged to generate the top N product recommendations requested by the retailer for his cross-selling campaign.

You have been reading a chapter from
R Data Analysis Projects
Published in: Nov 2017
Publisher: Packt
ISBN-13: 9781788621878
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime