How to compute cosine similarity with scikit-learn
In this section, I will use the cosine_similarity()
function in scikit-learn to show computation.
First, let’s use the same song, “New York, New York” by Frank Sinatra, that was used in Chapter 2, Text Representation:
doc_list = ["Start spreading the news", "You're leaving today (tell him friend)", "I want to be a part of it, New York, New York", "Your vagabond shoes, they are longing to stray", "And steps around the heart of it, New York, New York" ]
This document has five sentences. Let’s create the bag of words for this document with the CountVectorizer()
function in scikit-learn:
from sklearn.feature_extraction.text import CountVectorizerimport pandas as pd cv = CountVectorizer() cv_fit = cv.fit_transform(doc_list) word_list = cv.get_feature_names_out() count_list = cv_fit.toarray().sum(axis=0)
Then let’s print out the result...