Representing texts with TF-IDF
We can go one step further and use the TF-IDF algorithm to count words and n-grams in incoming documents. TF-IDF stands for term frequency-inverse document frequency and gives more weight to words that are unique to a document than to words that are frequent but repeated throughout most documents. This allows us to give more weight to words uniquely characteristic of particular documents.
In this recipe, we will use a different type of vectorizer that can apply the TF-IDF algorithm to the input text and build a small classifier.
Getting ready
We will use the TfidfVectorizer
class from the sklearn
package. The features of the TfidfVectorizer
class should be familiar from the two previous recipes, Putting documents into a bag of words and Constructing an N-gram model. We will again use the Rotten Tomatoes review dataset from Hugging Face.
How to do it…
Here are the steps to build and use the TF-IDF vectorizer:
- Run the small...