Getting the dataset and evaluation ready
In this recipe, we will load a dataset, prepare it for processing, and create an evaluation baseline. This recipe builds on some of the recipes from Chapter 3, where we used different tools to represent text in a computer-readable form.
Getting ready
For this recipe, we will use the Rotten Tomatoes reviews dataset, available through Hugging Face. This dataset consists of user movie reviews that can be classified into positive and negative. We will prepare the dataset for machine learning classification. The preparation process in this case will involve loading the reviews, filtering out non-English language ones, tokenizing the text into words, and removing stopwords. Before the machine learning algorithm can run, the text reviews need to be transformed into vectors. This transformation process is described in detail in Chapter 3.
The notebook is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook...