Representing texts with TF-IDF
We can go one step further and use the TF-IDF algorithm to count words and ngrams in incoming documents. TF-IDF stands for term frequency-inverse document frequency and gives more weight to words that are unique to a document than to words that are frequent, but repeated throughout most documents. This allows us to give more weight to words uniquely characteristic to particular documents. You can find out more at https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting.
In this recipe, we will use a different type of vectorizer that can apply the TF-IDF algorithm to the input text. Like the CountVectorizer
class, it has an analyzer that we will use to show the representations of new sentences.
Getting ready
We will be using the TfidfVectorizer
class from the sklearn
package. We will also be using the stopwords list from Chapter 1, Learning NLP Basics.
How to do it…
The TfidfVectorizer
class allows for...