Modeling
Training
Now that the new dataset has been created, the next step is to replace 1
with is_spam
and 0
with not_spam
so that the random forest algorithm can understand that the target variable is not numeric and that it is a classification model. We can do this by using the recode()
function within a mutate
function:
# Replace the binary 1(spam) and 0(not_spam) spam_for_model <- spam_for_model %>% mutate( spam= recode(spam, '1'='is_spam','0'='not_spam') )
Now, it is time to separate the data into train and test subsets. The train subset is used to present the model with the patterns and the labels associated with it so that it can study how to classify each observation according to the patterns that occur. The test set is like a school test, where new data is presented to the trained model so that we can measure how accurate it is or how much it has learned.
As we learned during the...