Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Hands-On Neural Networks

You're reading from   Hands-On Neural Networks Learn how to build and train your first neural network model using Python

Arrow left icon
Product type Paperback
Published in May 2019
Publisher Packt
ISBN-13 9781788992596
Length 280 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Leonardo De Marchi Leonardo De Marchi
Author Profile Icon Leonardo De Marchi
Leonardo De Marchi
Laura Mitchell Laura Mitchell
Author Profile Icon Laura Mitchell
Laura Mitchell
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Section 1: Getting Started FREE CHAPTER
2. Getting Started with Supervised Learning 3. Neural Network Fundamentals 4. Section 2: Deep Learning Applications
5. Convolutional Neural Networks for Image Processing 6. Exploiting Text Embedding 7. Working with RNNs 8. Reusing Neural Networks with Transfer Learning 9. Section 3: Advanced Applications
10. Working with Generative Algorithms 11. Implementing Autoencoders 12. Deep Belief Networks 13. Reinforcement Learning 14. Whats Next? 15. Other Books You May Enjoy

GloVe

GloVe stands for Global Vectors and is a model to produce a distributed word representation. It's an unsupervised learning method that finds a useful word representation in a vector space using statistics on the co-occurrence words from a corpus.

It combines two methods: global matrix factorization and local context windows. We will now explain these two models in more detail and show an example of how to use it.

Global matrix factorization

Matrix factorization, also known as matrix decomposition, is the decomposition of a matrix into a product of multiple matrices. There are many ways to decompose a matrix depending on the class of problems we aim to solve.

Matrix factorization is intended as a set of algorithms, commonly used in recommender systems. In this case, the system goal is to represent users and items in a lower-dimensional space. This space is also called latent and is where latent features, or variables, lie. A latent variable is a variable that is not observable in the inputs but is inferred using a mathematical model, usually called a latent variable model.

The main reason to use latent variables is that it's possible to reduce the dimensionality of the data. In recommender systems, data can be really sparse. For example, Amazon Marketplace recommendations deal with millions of users and objects, most of them with little to no interaction.

These types of problems are quite similar to our text task, for which there are many different words, some of them with more than one meaning, of which most won't interact with each other.

With text, we usually have interchangeable words, and, as we saw before, it's possible to infer this by how many times they co-occur in the same context. Words are also represented as term-document frequency, which gives us the frequency of the word in each document in the corpus. In this case, words represent the columns, while the rows represent the documents.

Latent semantic analysis (LSA) is only one of the models that use correlations between words to infer the meaning. Others include the following:

  • Hyperspace Analogue to Language (HAL)
  • Syntax- or dependency-based models
  • Semantic folding
  • Topic models, such as LDA

The factorization of a generic m-by-n matrix M into a product U Σ V*, where U is m-by-m and unitary, Σ is an m-by-n rectangular diagonal matrix (the non-zero entries of which are known as the singular values of M), and V is n-by-n and unitary, which looks like this:

We will now see how it’s possible to implement in Python the matrix factorization that we just saw:

from numpy import array
from numpy import diag
from numpy import zeros
from numpy import linalg
# define a matrix that we want to
A = array([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]
])
print('Initial matrix')
print(A)
# Applying singular-value decomposition
# VT is already the vector we are looking in
# as the formula return it transposed
# while we are interested in the normal form
U, s, VT = linalg.svd(A)
# creating a m x n Sigma matrix
Sigma = zeros((A.shape[0], A.shape[1]))
# populate Sigma with n x n diagonal matrix
Sigma[:A.shape[0], :A.shape[0]] = diag(s)
# select only two elements
n_elements = 2
Sigma = Sigma[:, :n_elements]
VT = VT[:n_elements, :]
# reconstruct
A_reconstructed = U.dot(Sigma.dot(VT))
print(A_reconstructed)
# Calculate the result
# By the dot product
# Between the U and sigma
# In python 3 it's possible to
# calculate the dot product using @
T = U @ Sigma
# for python 2 should be
# T = U.dot(Sigma)
print('dot product between U and Sigma')
print(T)
print('dot product between A and V')
T_ = A @ VT.T
print(T_)

print('Are the dot product similar? ',
'Yes' if np.isclose(X, X_a).all() else 'no')

Using the GloVe model

GloVe computes the probability of the next word given the previous one. In a log-bilinear model, this can be calculated in the following way:

Here, let's take a look at the following terms used in the preceding formula:

c is computed as follows:

GloVe is, essentially, a log-bilinear model with a weighted least-squares objective, which means that the overall solution minimizes the sum of the squares of the residuals created in the results of every single equation. The probabilities of the ratios of word-word occurring together, or simultaneously, has the ability to encode some meaning.

We can take an example from the GloVe website (https://nlp.stanford.edu/projects/glove/) and consider the probability that the two words, ice and steam, occur together. This is done by probing with the help of some words from the vocabulary. The following are some probabilities from a word corpus of around 6 billion:

Looking at these conditional probabilities, we can see that the word ice occurs together more frequently near the word solid than it does with gas, whereas steam occurs together less frequently with solid compared to gas. Steam and gas co-occur with the word water frequently, as they are states that water can appear as. On the other hand, they both occur together with the word fashion less frequently.

Noise from non-discriminative words, such as water and fashion, cancels out in the ratio of probabilities in a way that any value greater than 1 can correlate with the features specific to that of ice and any value smaller than 1 correlates well with the features that are specific to that of steam. Thus, the ratio of probabilities correlates with the non-realistic concept of thermodynamics.

GloVe's goal is to create vectors that represent words in a way that their dot product will equal the logarithm of the probability words and their co-occurrence. As we know, in the logarithmic scale a ratio is equivalent to the difference of the logarithm of the two elements considered. Because of this, the ratio of the logarithms of the probability of the elements will be translated in the vector space in the difference between two words. Because of this property, it's convenient to use these ratios to encode the meaning in a vector, and this will make it possible to use it for differences and obtain analogies such as the example we saw in Word2vec.

Now let's see how it's possible to run GloVe. First of all, we need to install it using the following commands:

  • To compile GloVe we need gcc, a c compiler. On macOS, execute the following commands:
conda install -c psi4 gcc-6
pip install glove_python
  • Alternatively, it's possible to execute the following commands:
export CC="/usr/local/bin/gcc-6"
export CFLAGS="-Wa,-q"
pip install glove_python
  • On macOS using brew:
brew install gcc
and then export gcc into CC like:
export CC=/usr/local/Cellar/gcc/6.3.0_1/bin/g++-6

Test GloVe with some Python code. We will use an example from https://textminingonline.com:

  1. Import the main libraries as follows:
import itertools
from gensim.models.word2vec import Text8Corpus
from glove import Corpus, Glove
  1. We need gensim just to use their Text8Corpus:
sentences = list(itertools.islice(Text8Corpus('text8'),None))

corpus = Corpus()

corpus.fit(sentences, window=10)
glove = Glove(no_components=100, learning_rate=0.05)

glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)

Observe the training of the model:

Performing 30 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
...
Epoch 27
Epoch 28
Epoch 29
  1. Add the dictionary to glove:
glove.add_dictionary(corpus.dictionary)
  1. Check the similarity among words:
glove.most_similar('man')
Out[10]:
[(u'terc', 0.82866443231836828),
(u'woman', 0.81587362007162523),
(u'girl', 0.79950702967210407),
(u'young', 0.78944050406331179)]

glove.most_similar('man', number=10)
Out[12]:
[(u'terc', 0.82866443231836828),
(u'woman', 0.81587362007162523),
(u'girl', 0.79950702967210407),
(u'young', 0.78944050406331179),
(u'spider', 0.78827287082192377),
(u'wise', 0.7662819233076561),
(u'men', 0.70576506880860157),
(u'beautiful', 0.69492684203254429),
(u'evil', 0.6887102864856347)]

glove.most_similar('frog', number=10)
Out[13]:
[(u'shark', 0.75775974484778419),
(u'giant', 0.71914687122031595),
(u'dodo', 0.70756087345768237),
(u'dome', 0.70536309001812902),
(u'serpent', 0.69089042980042681),
(u'vicious', 0.68885819147237815),
(u'blonde', 0.68574786672123234),
(u'panda', 0.6832336174432142),
(u'penny', 0.68202780165909405)]

glove.most_similar('girl', number=10)
Out[14]:
[(u'man', 0.79950702967210407),
(u'woman', 0.79380171669979771),
(u'baby', 0.77935645649673957),
(u'beautiful', 0.77447992804057431),
(u'young', 0.77355323458632896),
(u'wise', 0.76219894067614957),
(u'handsome', 0.74155095749823707),
(u'girls', 0.72011371864695584),
(u'atelocynus', 0.71560826080222384)]

glove.most_similar('car', number=10)
Out[15]:
[(u'driver', 0.88683873415652947),
(u'race', 0.84554581794165884),
(u'crash', 0.76818020141393994),
(u'cars', 0.76308628267402701),
(u'taxi', 0.76197230282808859),
(u'racing', 0.7384645880932772),
(u'touring', 0.73836030272284159),
(u'accident', 0.69000847113708996),
(u'manufacturer', 0.67263805153963518)]

glove.most_similar('queen', number=10)
Out[16]:
[(u'elizabeth', 0.91700558183820069),
(u'victoria', 0.87533970402870487),
(u'mary', 0.85515424257738148),
(u'anne', 0.78273531080737502),
(u'prince', 0.76833451608330772),
(u'lady', 0.75227426771795192),
(u'princess', 0.73927079922218319),
(u'catherine', 0.73538567181156611),
(u'tudor', 0.73028985404704971)]

Text classification with GloVe

Now we can see how it's possible to use these vectorized representations to tackle some text classification tasks. This tutorial is a modification of a python tutorial from Robert Guthrie.

After downloading the embedding from GloVe's website (https://nlp.stanford.edu/projects/glove/) we will need to decide which representation we will be using. There are four choices based on the length of the vector (50, 100, 200, 300). We will try the representation with 50 values for each vector:

possible_word_vectors = (50, 100, 200, 300)
word_vectors = possible_word_vectors[0]
file_name = f'glove.6B.{word_vectors}d.txt'
filepath = '../data/'
pretrained_embedding = os.path.join(filepath, file_name)

Now we will need to create a better structure for the association word/index, we want to have a dictionary where each word is the key and the vectorized representation is the vector. This will be handy after to quickly transform each word into a vector.

We will then use a class that follows the APIs of scikit-learn to transform our document into an average of all its embedding vectors:

class EmbeddingVectorizer(object):
"""
Follows the scikit-learn API
Transform each document in the average
of the embedding of the words in it
"""
def __init__(self, word2vec):
self.word2vec = word2vec
self.dim = 50
def fit(self, X, y):
return self
def transform(self, X):
"""
Find the embedding vector for each word in the dictionary
and take the mean for each document
"""
# Renaming it just to make it more understandable
documents = X
embedded_docs = []
for document in documents:
# For each document
# Consider the mean of all the embeddings
embedded_document = []
for words in document:
for w in words:
if w in self.word2vec:
embedded_word = self.word2vec[w]
else:
embedded_word = np.zeros(self.dim)
embedded_document.append(embedded_word)
embedded_docs.append(np.mean(embedded_document, axis=0))
return embedded_docs

Now we can finally create the embeddings as follows:

# Creating the embedding
e = EmbeddingVectorizer(embeddings_index)
X_train_embedded = e.transform(X_train)

With those it's now possible to train our classifier and test it on unseen data:

# Train the classifier
rf = RandomForestClassifier(n_estimators=50, n_jobs=-1) rf.fit(X_train_embedded, y_train)
X_test_embedded = e.transform(X_test)
predictions = rf.predict(X_test_embedded)

We then check the predictions' AUC and the confusion matrix to evaluate the performances:

print('AUC score: ', roc_auc_score(predictions, y_test))
confusion_matrix(predictions, y_test)
The performances are acceptable, but they could be improved.
AUC score: 0.7390774760383386
array([[224, 89],
[ 95, 305]])
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime