Performing word embedding with BoW and TF-IDF
Let’s first do BoW and TF-IDF. We learned how to prepare BoW and TF-IDF in Chapter 2, Text Representation. BoW is actually the count frequency of words, while its variation, TF-IDF, is designed to reflect the importance of a word in a document of a corpus.
We will first use the Dictionary
class to build and manage dictionaries of terms (words or tokens). It creates a mapping between unique terms in a corpus and their integer IDs. This is actually the BoW:
from gensim.corpora import Dictionarygensim_dictionary = Dictionary()
Let’s examine the dictionary list object, gensim_dictionary
. How many unique words are in it? Let’s check the length of this list to get the number of words:
len(gensim_dictionary)
We get the following output:
40360
So, there are 40,360 words!
Now, we will create the BoW.
BoW
We create the BoW by using the .doc2bow()
function:
bow_corpus = [gensim_dictionary.doc2bow...