Using a pretrained model for semantic search
Gensim is probably the most popular library for Word2Vec modeling. A famous pretrained Word2Vec model is the Google Word2Vec model. The model can represent 100 billion words. Each word is represented by a vector of 300 dimensions (300 words). This pretrained model has been released to the public. It can be downloaded from the Google Word2Vec Archive (https://code.google.com/archive/p/word2vec/) or Kaggle (https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300).
Gensim has developed a standalone module called KeyedVectors
that can query any word vectors built by different models, such as Word2Vec
or FastText
. Let’s import the module:
import gensimfrom gensim.models import Word2Vec,KeyedVectors
I use the load_word2vec_format()
class to load the Google News Word2Vec model. The limit = 10000
parameter sets the maximum number of word vectors to 10,000. If we do not specify it, the default will load all word vectors...