As a first example, we will look into the problem of identifying spam in YouTube video comments. The complete Jupyter Notebook for this example is available under the Chapter05/02_example.ipynb directory in this book's code repository. The data contains the comments with binary labels specifying whether the comment is genuine or spam. The code that follows loads the comments in CSV format into a pandas DataFrame:
comments_df_list = []
comments_file = ['data/Youtube01-Psy.csv','data/Youtube02-KatyPerry.csv','data/Youtube03-LMFAO.csv',
'data/Youtube04-Eminem.csv','data/Youtube05-Shakira.csv']
for f in comments_file:
df = pd.read_csv(f,header=0)
comments_df_list.append(df)
comments_df = pd.concat(comments_df_list)
comments_df = comments_df.sample(frac=1.0)
print(comments_df.shape...