GloVe computes the probability of the next word given the previous one. In a log-bilinear model, this can be calculated in the following way:
Here, let's take a look at the following terms used in the preceding formula:
c is computed as follows:
GloVe is, essentially, a log-bilinear model with a weighted least-squares objective, which means that the overall solution minimizes the sum of the squares of the residuals created in the results of every single equation. The probabilities of the ratios of word-word occurring together, or simultaneously, has the ability to encode some meaning.
We can take an example from the GloVe website (https://nlp.stanford.edu/projects/glove/) and consider the probability that the two words, ice and steam, occur together. This is done by probing with the help of some words from the vocabulary. The following are some probabilities from a word corpus of around 6 billion:
Looking at these conditional probabilities, we can see that the word ice occurs together more frequently near the word solid than it does with gas, whereas steam occurs together less frequently with solid compared to gas. Steam and gas co-occur with the word water frequently, as they are states that water can appear as. On the other hand, they both occur together with the word fashion less frequently.
Noise from non-discriminative words, such as water and fashion, cancels out in the ratio of probabilities in a way that any value greater than 1 can correlate with the features specific to that of ice and any value smaller than 1 correlates well with the features that are specific to that of steam. Thus, the ratio of probabilities correlates with the non-realistic concept of thermodynamics.
GloVe's goal is to create vectors that represent words in a way that their dot product will equal the logarithm of the probability words and their co-occurrence. As we know, in the logarithmic scale a ratio is equivalent to the difference of the logarithm of the two elements considered. Because of this, the ratio of the logarithms of the probability of the elements will be translated in the vector space in the difference between two words. Because of this property, it's convenient to use these ratios to encode the meaning in a vector, and this will make it possible to use it for differences and obtain analogies such as the example we saw in Word2vec.
Now let's see how it's possible to run GloVe. First of all, we need to install it using the following commands:
- To compile GloVe we need gcc, a c compiler. On macOS, execute the following commands:
conda install -c psi4 gcc-6
pip install glove_python
- Alternatively, it's possible to execute the following commands:
export CC="/usr/local/bin/gcc-6"
export CFLAGS="-Wa,-q"
pip install glove_python
brew install gcc
and then export gcc into CC like:
export CC=/usr/local/Cellar/gcc/6.3.0_1/bin/g++-6
Test GloVe with some Python code. We will use an example from https://textminingonline.com:
- Import the main libraries as follows:
import itertools
from gensim.models.word2vec import Text8Corpus
from glove import Corpus, Glove
- We need gensim just to use their Text8Corpus:
sentences = list(itertools.islice(Text8Corpus('text8'),None))
corpus = Corpus()
corpus.fit(sentences, window=10)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
Observe the training of the model:
Performing 30 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
...
Epoch 27
Epoch 28
Epoch 29
- Add the dictionary to glove:
glove.add_dictionary(corpus.dictionary)
- Check the similarity among words:
glove.most_similar('man')
Out[10]:
[(u'terc', 0.82866443231836828),
(u'woman', 0.81587362007162523),
(u'girl', 0.79950702967210407),
(u'young', 0.78944050406331179)]
glove.most_similar('man', number=10)
Out[12]:
[(u'terc', 0.82866443231836828),
(u'woman', 0.81587362007162523),
(u'girl', 0.79950702967210407),
(u'young', 0.78944050406331179),
(u'spider', 0.78827287082192377),
(u'wise', 0.7662819233076561),
(u'men', 0.70576506880860157),
(u'beautiful', 0.69492684203254429),
(u'evil', 0.6887102864856347)]
glove.most_similar('frog', number=10)
Out[13]:
[(u'shark', 0.75775974484778419),
(u'giant', 0.71914687122031595),
(u'dodo', 0.70756087345768237),
(u'dome', 0.70536309001812902),
(u'serpent', 0.69089042980042681),
(u'vicious', 0.68885819147237815),
(u'blonde', 0.68574786672123234),
(u'panda', 0.6832336174432142),
(u'penny', 0.68202780165909405)]
glove.most_similar('girl', number=10)
Out[14]:
[(u'man', 0.79950702967210407),
(u'woman', 0.79380171669979771),
(u'baby', 0.77935645649673957),
(u'beautiful', 0.77447992804057431),
(u'young', 0.77355323458632896),
(u'wise', 0.76219894067614957),
(u'handsome', 0.74155095749823707),
(u'girls', 0.72011371864695584),
(u'atelocynus', 0.71560826080222384)]
glove.most_similar('car', number=10)
Out[15]:
[(u'driver', 0.88683873415652947),
(u'race', 0.84554581794165884),
(u'crash', 0.76818020141393994),
(u'cars', 0.76308628267402701),
(u'taxi', 0.76197230282808859),
(u'racing', 0.7384645880932772),
(u'touring', 0.73836030272284159),
(u'accident', 0.69000847113708996),
(u'manufacturer', 0.67263805153963518)]
glove.most_similar('queen', number=10)
Out[16]:
[(u'elizabeth', 0.91700558183820069),
(u'victoria', 0.87533970402870487),
(u'mary', 0.85515424257738148),
(u'anne', 0.78273531080737502),
(u'prince', 0.76833451608330772),
(u'lady', 0.75227426771795192),
(u'princess', 0.73927079922218319),
(u'catherine', 0.73538567181156611),
(u'tudor', 0.73028985404704971)]