Creating your own embeddings using Gensim
We will create an embedding using Gensim and a small text corpus, called text8.
Gensim is an open-source Python library designed to extract semantic meaning from text documents. One of its features is an excellent implementation of the Word2Vec algorithm, with an easy-to-use API that allows you to train and query your own Word2Vec model. To learn more about Gensim, see https://radimrehurek.com/gensim/index.html. To install Gensim, please follow the instructions at https://radimrehurek.com/gensim/install.html.
The text8 dataset is the first 108 bytes of the Large Text Compression Benchmark, which consists of the first 109 bytes of English Wikipedia [7]. The text8 dataset is accessible from within the Gensim API as an iterable of tokens, essentially a list of tokenized sentences. To download the text8 corpus, create a Word2Vec model from it, and save it for later use, run the following few lines of code (available in create_embedding_with_text8...