You're reading from R for Data Science Learn and explore the fundamentals of data science with R

Product type Paperback

Published in Dec 2014

Publisher

ISBN-13 9781784390860

Length 364 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Dan Toomey

View More author details

Table of Contents (14) Chapters

Preface

1. Data Mining Patterns FREE CHAPTER

2. Data Mining Sequences

3. Text Mining

4. Data Analysis – Regression Analysis

5. Data Analysis – Correlation

6. Data Analysis – Clustering

7. Data Visualization – R Graphics

8. Data Visualization – Plotting

9. Data Visualization – 3D

10. Machine Learning in Action

11. Predicting Events with Machine Learning

12. Supervised and Unsupervised Learning

Index

Association rules

Association rules describe associations between two datasets. This is most commonly used in market basket analysis. Given a set of transactions with multiple, different items per transaction (shopping bag), how can the item sales be associated? The most common associations are as follows:

Support: This is the percentage of transactions that contain A and B.
Confidence: This is the percentage (of time that rule is correct) of cases containing A that also contain B.
Lift: This is the ratio of confidence to the percentage of cases containing B. Please note that if lift is 1, then A and B are independent.

Mine for associations

The most widely used tool in R from association rules is apriori.

Usage

The apriori rules library can be called as follows:

apriori(data, parameter = NULL, appearance = NULL, control = NULL)

The various parameters of the apriori library are explained in the following table:

Parameter	Description
`data`	This is the transaction data.
`parameter`	This stores the default behavior to mine, with `support` as 0.1, `confidence` as 0.8, and `maxlen` as 10. You can change parameter values accordingly.
`appearance`	This is used to restrict items that appear in rules.
`control`	This is used to adjust the performance of the algorithm used.

Example

You will need to load the apriori rules library as follows:

> install.packages("arules")
> library(arules)

The market basket data can be loaded as follows:

> data <- read.csv("http://www.salemmarafi.com/wp-content/uploads/2014/03/groceries.csv")

Then, we can generate rules from the data as follows:

> rules <- apriori(data) 

parameter specification:
confidenceminvalsmaxaremavaloriginalSupport support minlenmaxlen target
        0.8    0.1    1 none FALSE            TRUE     0.1      1     10  rules
   ext
 FALSE

algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[655 item(s), 15295 transaction(s)] done [0.00s].
sorting and recoding items ... [3 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [5 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

There are several points to highlight in the results:

As you can see from the display, we are using the default settings (confidence 0.8, and so on)
We found 15,000 transactions for three items (picked from the 655 total items available)
We generated five rules

We can examine the rules that were generated as follows:

> rules

set of 5 rules 
> inspect(rules)

lhsrhs              support confidence     lift
1 {semi.finished.bread=} => {margarine=}   0.2278522          1 2.501226
2 {semi.finished.bread=} => {ready.soups=} 0.2278522          1 1.861385
3 {margarine=}           => {ready.soups=} 0.3998039          1 1.861385
4 {semi.finished.bread=,                                                
   margarine=}           => {ready.soups=} 0.2278522          1 1.861385
5 {semi.finished.bread=,                                                
   ready.soups=}         => {margarine=}   0.2278522          1 2.501226

The code has been slightly reformatted for readability.

Looking over the rules, there is a clear connection between buying bread, soup, and margarine—at least in the market where and when the data was gathered.

If we change the parameters (thresholds) used in the calculation, we get a different set of rules. For example, check the following code:

> rules <- apriori(data, parameter = list(supp = 0.001, conf = 0.8))

This code generates over 500 rules, but they have questionable meaning as we now have the rules with 0.001 confidence.