As we mentioned in the chapter's introduction, there is an extension of word2vec that encodes entire documents as opposed to individual words. In this case, a document is what you make of it, be it a sentence, a paragraph, an article, an essay, and so on. Not surprisingly, this paper came out after the original word2vec paper but was also, not surprisingly, coauthored by Tomas Mikolov and Quoc Le. Even though MLlib has yet to introduce doc2vec into their stable of algorithms, we feel it is necessary for a data science practitioner to know about this extension of word2vec, given its promise of and results with supervised learning and information retrieval tasks.
Like word2vec, doc2vec (sometimes referred to as paragraph vectors) relies on a supervised learning task to learn distributed representations of documents based on contextual words. Doc2vec is also...