From Word2Vec to Doc2Vec
Word2Vec, pioneered by Mikolov et al. in 2013 [1], generates vector representations for individual words in a text corpus. It represents words as continuous vectors that capture the semantic meaning of words in a high-dimensional space. Doc2Vec, led by Le and Mikolov [2], extends the idea of Word2Vec to generate vector representations for entire documents or paragraphs. It represents documents as continuous vectors in a similar vector space where documents with similar content or meaning are closer together. Doc2Vec has a wide range of applications for tasks involving document-level analysis such as document similarity, content recommendation, document clustering, and text summarization. The document in Doc2Vec can be a sentence, a paragraph, or an entire article. Le and Mikolov [2] refer to Doc2Vec as Paragraph Vector (PV) to emphasize the fact that it transforms a paragraph into a vector. In Word2Vec, each word has a unique ID. In Doc2Vec, each paragraph...