In this chapter, we expanded on the ideas introduced in Chapter 4, Transforming Text into Data Structures. Instead of using the syntactical aspects of a document, we focused on capturing the semantics of words in a sentence. Properties such as the co-occurrence of words help in understanding the context of a word, and we tried to leverage this to build vector representations of text using the Word2vec algorithm. We explored the pretrained Word2vec model developed by Google and looked at a few relationships that it can capture. We followed this up by learning about the architecture of a Word2vec model. After that, we trained a few Word2vec models from scratch. Limitations and bias around the Word2Vec model were then discussed, followed by a discussion on some applications of the Word2vec model. Finally, we looked at how the WMD algorithm uses word vectors to capture document distances.
In the next chapter, we will take this idea further to build vectors for documents, sentences...