Performing rule-based text classification using keywords
In this recipe, we will use the vocabulary of the text to classify the Rotten Tomatoes reviews. We will create a simple classifier that will have a vectorizer for each class. That vectorizer will include the words characteristic to that class. The classification will simply be vectorizing the text using each of the vectorizers and then using the class that has more words.
Getting ready
We will use the CountVectorizer
class and the classification_report
function from sklearn
, as well as the word_tokenize
method from NLTK. All of these are included in the poetry
environment.
The notebook is located at https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter04/4.2_rule_based.ipynb.
How to do it…
In this recipe, we will create a separate vectorizer for each class. We will then use those vectorizers to count the number of each class word in each review to...