In this section, we'll implement a short pipeline for preprocessing text sequences and training a word2vec model with the processed data. We'll also implement another example to visualize embedding vectors and check some of their interesting properties.
The code in this section requires the following Python packages:
- Gensim (version 3.80, https://radimrehurek.com/gensim/) is an open source Python library for unsupervised topic modeling and NLP. It supports all three models that we have discussed so far (word2vec, GloVe, and fastText).
- The Natural Language Toolkit (NLTK, https://www.nltk.org/, ver 3.4.4) is a Python suite of libraries and programs for symbolic and statistical NLP.
- Scikit-learn (ver 0.19.1, https://scikit-learn.org/) is an open source Python ML library with various classification, regression, and clustering algorithms. More...