Creating the test and training datasets
Now that we are finished with our transformations, we will create the training and test data frames. We will perform a 50/50 split between training and test:
# Take a sample of full vector nrow(OnlineRetail) > [1] 536068 pctx <- round(0.5 * nrow(OnlineRetail)) set.seed(1) # randomize rows df <- OnlineRetail[sample(nrow(OnlineRetail)), ] rows <- nrow(df) OnlineRetail <- df[1:pctx, ] #training set OnlineRetail.test <- df[(pctx + 1):rows, ] #test set rm(df) # Display the number of rows in the training and test datasets. nrow(OnlineRetail) > [1] 268034 nrow(OnlineRetail.test) > [1] 268034
Saving the results
It is a good idea to periodically save your data frames, so that you can pick up your analysis from various checkpoints.
In this example, I will first sort them both by InvoiceNo
, and then save the test and train data sets to disk, where I can always load them back into memory as needed:
setwd("C:/PracticalPredictiveAnalytics...