In this section, we're going to look at a technique for detecting YouTube comment spam using bags of words and random forests. The dataset is pretty straightforward. We'll use a dataset that has about 2,000 comments from popular YouTube videos (https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection). The dataset is formatted in a way where each row has a comment followed by a value marked as 1 or 0 for spam or not spam.
First, we will import a single dataset. This dataset is actually split into four different files. Our set of comments comes from the PSY-Gangnam Style video:
Then we will print a few comments as follows:
Here we are able to see that there are more than two columns, but we will only require the content and the class columns. The content column contains the comments and the class column contains the values 1 or...