Tips on building a good Doc2Vec model
I would like to offer two perspectives:
- It is important to train Doc2Vec with lots of data to achieve stable results. The more words in a paragraph/document, the better. A document with less than 5 words is not easy to differentiate from any other.
- Lemmatization or stemming may not necessarily improve the results. You could try and test a model with lemmatization and a model without. The authors [2] report that “PV-DM is consistently better than PV-DBOW.” Hence the default model in the Gensim class is the PV-DM model.
That completes our discussion in this chapter.