Creating a simple classifier
The reason why we need to represent text as vectors is to make it into a computer-readable form. Computers can’t understand words but are good at manipulating numbers. One of the main NLP tasks is the classification of texts, and we are going to create a classifier for movie reviews. We will use the same classifier code but with different methods of creating vectors from text.
In this section, we will create the classifier that will assign either negative or positive sentiment to Rotten Tomatoes reviews, a dataset available through Hugging Face, a large repository of open source models and datasets. We will then use a baseline method, where we encode the text by counting the number of different parts of speech present in it (verbs, nouns, proper nouns, adjectives, adverbs, auxiliary verbs, pronouns, numbers, and punctuation).
By the end of this recipe, we will have created a separate file with functions that create the dataset and train the...