Word embeddings
Tokenizers are one way of extracting features from text. They are powerful and can be trained to create complex tokens and capture statistical dependencies of words. However, they are limited by the fact that they are completely unsupervised and do not capture any meaning or relationship between words. This means that the tokenizers are great at providing input to neural network models, such as BERT, but sometimes, we would like to have features that are more aligned with a certain task.
This is where word embeddings come to the rescue. The following code shows how to instantiate the word embedding model, which is imported from the gensim
library. First, we need to prepare the dataset:
from gensim.models import word2vec # now, we need to prepare a dataset # in our case, let's just read a dataset that is a code of a program # in this example, I use the file from an open source component - Azure NetX # the actual part is not that important, as long as we have...