Selecting the best variables
At this point, selecting the best variables should be smooth since exploring the data gives us the answer we’re looking for. When we checked the boxplots and tested the words and characters that impact the classification the most, as well as the impact of the uppercase letters, we were already making a variable selection. We should use those variables that have the highest difference between both groups so that it’s easier for the algorithm to find a clearer separation between the two groups. As we have seen, 23 words maximize the difference, the number of uppercase letters, and the presence of too many symbols.
In this section, we will take the top_words
vector, which gathers the top 23 words that have the most impact on the spam classification, as well as the exclamation, parenthesis, dollar sign, and hashtag characters and the uppercase variables and transform the dataset into a seven-variable Tibble, with six explanatory variables and...