Merging the results back into the original data
We will want to retain the number of total items for each invoice on the original data frame. That will involve joining the number of items contained in each invoice back to the original transactions, using the merge()
function, and specifying Invoicenum
as the key.
If you count the number of distinct invoices before and after the merge, you can see that the invoice count is lower than prior to the merge:
#first take a 'before' snapshot nrow(OnlineRetail) > [1] 541909 #count the number of distinct invoices sqldf("select count(distinct InvoiceNo) from OnlineRetail")
The output shows a total of 25900 distinct invoices:
> count(distinct InvoiceNo) > 1 25900
Now merge the counts back with the original data:
OnlineRetail <- merge(OnlineRetail, x2, by = "InvoiceNo")
Check the new number of rows, and the new count of distinct invoices (20059 versus 25900). Note these counts compared to the original. The reduction...