Text classification requires combining multiple word embeddings. A common approach is to average the embedding vectors for each word in the document. This uses information from all embeddings and effectively uses vector addition to arrive at a different location point in the embedding space. However, relevant information about the order of words is lost.
By contrast, the state-of-the-art generation of embeddings for pieces of text such as a paragraph or a product review is to use the document-embedding model Doc2vec. This model was developed by the Word2vec authors shortly after publishing their original contribution.
Similar to Word2vec, there are also two flavors of Doc2vec:
- The distributed bag of words (DBOW) model corresponds to the Word2vec CBOW model. The document vectors result from training a network in the synthetic task of predicting...