Deep Learning for Natural Language Processing

Introduction to Natural Language Processing

Learning Objectives

By the end of this chapter, you will be able to:

Describe natural language processing and its applications
Explain different text preprocessing techniques
Perform text preprocessing on text corpora
Explain the functioning of Word2Vec and GloVe word embeddings
Generate word embeddings using Word2Vec and GloVe
Use the NLTK, Gensim, and Glove-Python libraries for text preprocessing and generating word embeddings

This chapter aims to equip you with knowledge of the basics of natural language processing and experience with the various text preprocessing techniques used in Deep Learning.

The Basics of Natural Language Processing

To understand what natural language processing is, let's break the term into two:

Natural language is a form of written and spoken communication that has developed organically and naturally.
Processing means analyzing and making sense of input data with computers.

Figure 1.1: Natural language processing

Therefore, natural language processing is the machine-based processing of human communication. It aims to teach machines how to process and understand the language of humans, thereby allowing an easy channel of communication between human and machines.

For example, the personal voice assistants found in our phones and smart speakers, such as Alexa and Siri, are a result of natural language processing. They have been created in such a manner that they are able to not only understand what we say to them but also to act upon what we say and respond with feedback. Natural language processing algorithms aid these technologies in communicating with humans.

The key thing to consider in the mentioned definition of natural language processing is that the communication needs to occur in the natural language of humans. We've been communicating with machines for decades now by creating programs to perform certain tasks and executing them. However, these programs are written in languages that are not natural languages, because they are not forms of spoken communication and they haven't developed naturally or organically. These languages, such as Java, Python, C, and C++, were created with machines in mind and the consideration always being, "what will the machine be able to understand and process easily?"

While Python is a more user-friendly language and so is easier for humans to learn and be able to write code in, the basic point remains the same – to communicate with a machine, humans must learn a language that the machine is able to understand.

Figure 1.2: Venn diagram for natural language processing

The purpose of natural language processing is the opposite of this. Rather than having humans conform to the ways of a machine and learn how to effectively communicate with them, natural language processing enables machines to conform to humans and learn their way of communication. This makes more sense since the aim of technology is to make our lives easier.

To clarify this with an example, your first ever program was probably a piece of code that asked the machine to print 'hello world'. This was you conforming to the machine and asking it to execute a task in a language that it understood. Asking your voice assistant to say 'hello world' by voicing this command to it, and having it say 'hello world' back to you, is an example of the application of natural language processing, because you are communicating with a machine in your natural language (in this case, English). The machine is conforming to your form of communication, understanding what you're saying, processing what you're asking it to do, and then executing the task.

Importance of natural language processing

The following figure illustrates the various sections of the field of artificial intelligence:

Fig 1.3: Artificial intelligence and some of its subfields

Along with machine learning and deep learning, natural language processing is a subfield of artificial intelligence, and because it deals with natural language, it's actually at the intersection of artificial intelligence and linguistics.

As mentioned, natural language processing is what enables machines to understand the language of humans, thus allowing an efficient channel of communication between the two. However, there is another reason Natural language processing is necessary, and that is because, like machines, machine learning and deep learning models work best with numerical data. Numerical data is hard for humans to naturally produce; imagine us talking in numbers rather than words. So, natural language processing works with textual data and converts it into numerical data, enabling machine learning and deep learning models to be fitted on it. Thus, it exists to bridge the communication gap between humans and machines by taking the spoken and written forms of language from humans and converting them into data that can be understood by machines. Thanks to natural language processing, the machine is able to make sense of, answer questions based on, solve problems using, and communicate in a natural language, among other things.

Applications of Natural Language Processing

The following figure depicts the general application areas of natural language processing:

Figure 1.4: Application areas of natural language processing

Automatic text summarization
This involves processing corpora to provide a summary.
Translation
This entails translation tools that translate text to and from different languages, for example, Google Translate.
Sentiment analysis
This is also known as emotional artificial intelligence or opinion mining, and it is the process of identifying, extracting, and quantifying emotions and affective states from corpora, both written and spoken. Sentiment analysis tools are used to process things such as customer reviews and social media posts to understand emotional responses to and opinions regarding particular things, such as the quality of food at a new restaurant.
Information extraction
This is the process of identifying and extracting important terms from corpora, known as entities. Named entity recognition falls under this category and is a process that will be explained in the next chapter.
Relationship extraction
Relationship extraction involves extracting semantic relationships from corpora. Semantic relationships occur between two or more entities (such as people, organizations, and things) and fall into one of the many semantic categories. For example, if a relationship extraction tool was given a paragraph about Sundar Pichai and how he is the CEO of Google, the tool would be able to produce "Sundar Pichai works for Google" as output, with Sundar Pichai and Google being the two entities, and 'works for' being the semantic category that defines their relationship.
Chatbot
Chatbots are forms of artificial intelligence that are designed to converse with humans via speech and text. The majority of them mimic humans and make it feel as though you are speaking to another human being. Chatbots are being used in the health industry to help people who suffer from depression and anxiety.
Social media analysis
Social media applications such as Twitter and Facebook have hashtags and trends that are tracked and monitored using natural language processing to understand what is being talked about around the world. Additionally, natural language processing aids the process of moderation by filtering out negative, offensive, and inappropriate comments and posts.
Personal voice assistants
Siri, Alexa, Google Assistant, and Cortana are all personal voice assistants that leverage natural language processing techniques to understand and respond to what we say.
Grammar checking
Grammar-checking software automatically checks and corrects your grammar, punctuation, and typing errors.

Text Preprocessing

When answering questions on a comprehension passage, the questions are specific to different parts of the passage, and so while some words and sentences are important to you, others are irrelevant. The trick is to identify key words from the questions and match them to the passage to find the correct answer.

Text preprocessing works in a similar fashion – the machine doesn't need the irrelevant parts of the corpora; it just needs the important words and phrases required to execute the task at hand. Thus, text preprocessing techniques involve prepping the corpora for proper analysis and for the machine learning and deep learning models. Text preprocessing is basically telling the machine what it needs to take into consideration and what it can disregard.

Each corpus requires different text preprocessing techniques depending on the task that needs to be executed, and once you've learned the different preprocessing techniques, you'll understand where to use what and why. The order in which the techniques have been explained is usually the order in which they are performed.

We will be using the NLTK Python library in the following exercises, but feel free to use different libraries while doing the activities. NLTK stands for Natural Language Toolkit and is the simplest and one of the most popular Python libraries for natural language processing, which is why we will be using it to understand the basic concepts of natural language processing.

Note

For further information on NLTK, go to https://www.nltk.org/.

Text Preprocessing Techniques

The following are the most popular text preprocessing techniques in natural language processing:

Lowercasing/uppercasing
Noise removal
Text normalization
Stemming
Lemmatization
Tokenization
Removing stop words

Let's look at each technique one by one.

Lowercasing/Uppercasing

This is one of the most simple and effective preprocessing techniques that people often forget to use. It either converts all the existing uppercase characters into lowercase ones so that the entire corpus is in lowercase, or it converts all the lowercase characters present in the corpus into uppercase ones so that the entire corpus is in uppercase.

This method is especially useful when the size of the corpus isn't too large and the task involves identifying terms or outputs that could be recognized differently due to the case of the characters, since a machine inherently processes uppercase and lowercase letters as separate entities – 'A' is different from 'a.' This kind of variation in the input capitalization could result in incorrect output or no output at all.

An example of this would be a corpus that contains both 'India' and 'india.' Without applying lowercasing, the machine would recognize these as two separate terms, when in reality they're both different forms of the same word and correspond to the same country. After lowercasing, there would exist only one instance of the term "India," which would be 'india,' simplifying the task of finding all the places where India has been mentioned in the corpus.

Note

All exercises and activities will be primarily developed on Jupyter Notebook. You will need to have Python 3.6 and NLTK installed on your system.

Exercises 1 – 6 can be done within the same Jupyter notebook.

Exercise 1: Performing Lowercasing on a Sentence

In this exercise, we will take an input sentence with both uppercase and lowercase characters and convert them all into lowercase characters. The following steps will help you with the solution:

Open cmd or another terminal depending on your operating system.
Navigate to the desired path and use the following command to initiate a Jupyter notebook:
jupyter notebook
Store an input sentence in an 's' variable, as shown:
s = "The cities I like most in India are Mumbai, Bangalore, Dharamsala and Allahabad."
Apply the lower() function to convert the capital letters into lowercase characters and then print the new string, as shown:
s = s.lower()
print(s)
Expected output:
Figure 1.5: Output for lowercasing with mixed casing in a sentence
Create an array of words with capitalized characters, as shown:
words = ['indiA', 'India', 'india', 'iNDia']
Using list comprehension, apply the lower() function on each element of the words array and then print the new array, as follows:
words = [word.lower() for word in words]
print(words)
Expected output:

Figure 1.6: Output for lowercasing with mixed casing of words

Noise Removal

Noise is a very general term and can mean different things with respect to different corpora and different tasks. What is considered noise for one task may be what is considered important for another, and thus this is a very domain-specific preprocessing technique. For example, when analyzing tweets, hashtags might be important to recognize trends and understand what's being spoken about around the globe, but hashtags may not be important when analyzing a news article, and so hashtags would be considered noise in the latter's case.

Noise doesn't include only words, but can also include symbols, punctuation marks, HTML markup (<,>, *, ?,.), numbers, whitespaces, stop words, particular terms, particular regular expressions, non-ASCII characters (\W|\d+), and parse terms.

Removing noise is crucial so that only the important parts of the corpora are fed into the models, ensuring accurate results. It also helps by bringing words into their root or standard form. Consider the following example:

Figure 1.7: Output for noise removal

After removing all the symbols and punctuation marks, all the instances of sleepy correspond to the one form of the word, enabling more efficient prediction and analysis of the corpus.

Exercise 2: Removing Noise from Words

In this exercise, we will take an input array containing words with noise attached (such as punctuation marks and HTML markup) and convert these words into their clean, noise-free forms. To do this, we will need to make use of Python's regular expression library. This library has several functions that allow us to filter through input data and remove the unnecessary parts, which is exactly what the process of noise removal aims to do.

Note

To learn more about 're,' click on https://docs.python.org/3/library/re.html.

In the same Jupyter notebook, import the regular expression library, as shown:
import re
Create a function called 'clean_words', which will contain methods to remove different types of noise from the words, as follows:
def clean_words(text):

#remove html markup
text = re.sub("(<.*?>)","",text)
#remove non-ascii and digits
text=re.sub("(\W|\d+)"," ",text)
#remove whitespace
text=text.strip()
return text
Create an array of raw words with noise, as demonstrated:
raw = ['..sleepy', 'sleepy!!', '#sleepy', '>>>>>sleepy>>>>', '<a>sleepy</a>']
Apply the clean_words() function on the words in the raw array and then print the array of clean words, as shown:
clean = [clean_words(r) for r in raw]
print(clean)
Expected output:

Figure 1.8: Output for noise removal

Text Normalization

This is the process of converting a raw corpus into a canonical and standard form, which is basically to ensure that the textual input is guaranteed to be consistent before it is analyzed, processed, and operated upon.

Examples of text normalization would be mapping an abbreviation to its full form, converting several spellings of the same word to one spelling of the word, and so on.

The following are examples for canonical forms of incorrect spellings and abbreviations:

Figure 1.9: Canonical form for incorrect spellings

Figure 1.10: Canonical form for abbreviations

There is no standard way to go about normalization since it is very dependent on the corpus and the task at hand. The most common way to go about it is with dictionary mapping, which involves manually creating a dictionary that maps all the various forms of one word to that one word, and then replaces each of those words with one standard form of the word.

Stemming

Stemming is performed on a corpus to reduce words to their stem or root form. The reason for saying "stem or root form" is that the process of stemming doesn't always reduce the word to its root but sometimes just to its canonical form.

The words that undergo stemming are known as inflected words. These words are in a form that is different from the root form of the word, to imply an attribute such as the number or gender. For example, "journalists" is the plural form of "journalist." Thus, stemming would cut off the 's', bringing "journalists" to its root form:

Figure 1.11: Output for stemming

Stemming is beneficial when building search applications due to the fact that when searching for something in particular, you might also want to find instances of that thing even if they're spelled differently. For example, if you're searching for exercises in this book, you might also want 'Exercise' to show up in your search.

However, stemming doesn't always provide the desired stem, since it works by chopping off the ends of the words. It's possible for the stemmer to reduce 'troubling' to 'troubl' instead of 'trouble' and this won't really help in problem solving, and so stemming isn't a method that's used too often. When it is used, Porter's stemming algorithm is the most common algorithm for stemming.

Exercise 3: Performing Stemming on Words

In this exercise, we will take an input array containing various forms of one word and convert these words into their stem forms.

In the same Jupyter notebook, import the nltk and pandas libraries as well as Porter Stemmer, as shown:
import nltk
import pandas as pd
from nltk.stem import PorterStemmer as ps
Create an instance of stemmer, as follows:
stemmer = ps()
Create an array of different forms of the same word, as shown:
words=['annoying', 'annoys', 'annoyed', 'annoy']
Apply the stemmer to each of the words in the words array and store them in a new array, as given:
stems =[stemmer.stem(word = word) for word in words]
Print the raw words and their stems in the form of a DataFrame, as shown:
sdf = pd.DataFrame({'raw word': words,'stem': stems})
sdf
Expected output:

Figure 1.12: Output of stemming

Lemmatization

Lemmatization is a process that is like stemming – its purpose is to reduce a word to its root form. What makes it different is that it doesn't just chop the ends of words off to obtain this root form, but instead follows a process, abides by rules, and often uses WordNet for mappings to return words to their root forms. (WordNet is an English language database that consists of words and their definitions along with synonyms and antonyms. It is considered to be an amalgamation of a dictionary and a thesaurus.) For example, lemmatization is capable of transforming the word 'better' into its root form 'good', since 'better' is just the comparative form of 'good."

While this quality of lemmatization makes it highly appealing and more efficient when compared with stemming, the drawback is that since lemmatization follows such an organized procedure, it takes a lot more time than stemming does. Hence, lemmatization is not recommended when you're working with a large corpus.

Exercise 4: Performing Lemmatization on Words

In this exercise, we will take an input array containing various forms of one word and convert these words into their root form.

In the same Jupyter notebook as the previous exercise, import WordNetLemmatizer and download WordNet, as shown:
from nltk.stem import WordNetLemmatizer as wnl
nltk.download('wordnet')
Create an instance of lemmatizer, as follows:
lemmatizer = wnl()
Create an array of different forms of the same word, as demonstrated:
words = ['troubling', 'troubled', 'troubles', 'trouble']
Apply lemmatizer to each of the words in the words array and store them in a new array, as follows. The word parameter provides the lemmatize function with the word it is supposed to lemmatize. The pos parameter is the part of speech you want the lemma to be. 'v' stands for verb and thus the lemmatizer will reduce the word to its closest verb form:
# v denotes verb in "pos"
lemmatized = [lemmatizer.lemmatize(word = word, pos = 'v') for word in words]
Print the raw words and their root forms in the form of a DataFrame, as shown:
ldf = pd.DataFrame({'raw word': words,'lemmatized': lemmatized})
ldf = ldf[['raw word','lemmatized']]
ldf
Expected output:

Figure 1.13: Output of lemmatization

Tokenization

Tokenization is the process of breaking down a corpus into individual tokens. Tokens are the most commonly used words – thus, this process breaks down a corpus into individual words – but can also include punctuation marks and spaces, among other things.

This technique is one of the most important ones since it is a prerequisite for a lot of applications of natural language processing that we will be learning about in the next chapter, such as Parts-of-Speech (PoS) tagging. These algorithms take tokens as input and can't function with strings or paragraphs of text as input.

Tokenization can be performed to obtain individual words as well as individual sentences as tokens. Let's try both of these out in the following exercises.

Exercise 5: Tokenizing Words

In this exercise, we will take an input sentence and produce individual words as tokens from it.

In the same Jupyter notebook, import nltk:
import nltk
From nltk, import word_tokenize and punkt, as shown:
nltk.download('punkt')
from nltk import word_tokenize
Store words in a variable and apply word_tokenize() on it, then print the results, as follows:
s = "hi! my name is john."
tokens = word_tokenize(s)
tokens
Expected output:

Figure 1.14: Output for the tokenization of words

As you can see, even the punctuation marks are tokenized and considered as individual tokens.

Now let's see how we can tokenize sentences.

Exercise 6: Tokenizing Sentences

In this exercise, we will take an input sentence and produce individual words as tokens from it.

In the same Jupyter notebook, import sent_tokenize, as shown:
from nltk import sent_tokenize
Store two sentences in a variable (our sentence from the previous exercise was actually two sentences, so we can use the same one to see the difference between word and sentence tokenization) and apply sent_tokenize() on it, then print the results, as follows:
s = "hi! my name is shubhangi."
tokens = sent_tokenize(s)
tokens
Expected output:

Figure 1.15: Output for tokenizing sentences

As you can see, the two sentences have formed two individual tokens.

Additional Techniques

There are several ways to perform text preprocessing, including the usage of a variety of Python libraries such as BeautifulSoup to strip away HTML markup. The previous exercises serve the purpose of introducing some techniques to you. Depending on the task at hand, you may need to use just one or two or all of them, including the modifications made to them. For example, at the noise removal stage, you may find it necessary to remove words such as 'the,' 'and,' 'this,' and 'it.' So, you will need to create an array containing these words and pass the corpus through a for loop to store only the words that are not a part of that array, removing the noisy words from the corpus. Another way of doing this is given later in this chapter and is done after tokenization has been performed.

Exercise 7: Removing Stop Words

In this exercise, we will take an input sentence and remove the stop words from it.

Open a Jupyter notebook and download 'stopwords' using the following line of code:
nltk.download('stopwords')
Store a sentence in a variable, as shown:
s = "the weather is really hot and i want to go for a swim"
Import stopwords and create a set of the English stop words, as follows:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
Tokenize the sentence using word_tokenize, and then store those tokens that do not occur in stop_words in an array. Then, print that array:
tokens = word_tokenize(s)
tokens = [word for word in tokens if not word in stop_words]
print(tokens)
Expected output:

Figure 1.16: Output after removing stopwords

Additionally, you may need to convert numbers into their word forms. This is also a method you can add to the noise removal function. Furthermore, you might need to make use of the contractions library, which serves the purpose of expanding the existing contractions in the text. For example, the contractions library will convert 'you're' into 'you are,' and if this is necessary for your task, then it is recommended to install this library and use it.

Text preprocessing techniques go beyond the ones that have been discussed in this chapter and can include anything and everything that is required for a task or a corpus. In some instances, some words may be important, while in others they won't be.

Word Embeddings

As mentioned in the earlier sections of this chapter, natural language processing prepares textual data for machine learning and deep learning models. The models perform most efficiently when provided with numerical data as input, and thus a key role of natural language processing is to transform preprocessed textual data into numerical data, which is a numerical representation of the textual data.

This is what word embeddings are: they are numerical representations in the form of real-value vectors for text. Words that have similar meanings map to similar vectors and thus have similar representations. This aids the machine in learning the meaning and context of different words. Since word embeddings are vectors mapping to individual words, word embeddings can only be generated once tokenization has been performed on the corpus.

Figure 1.17: Example for word embeddings

Word embeddings encompass a variety of techniques used to create a learned numerical representation and are the most popular way to represent a document's vocabulary. The beneficial aspect of word embeddings is that they are able to capture contextual, semantic, and syntactic similarities, and the relations of a word with other words, to effectively train the machine to comprehend natural language. This is the main aim of word embeddings – to form clusters of similar vectors that correspond to words with similar meanings.

The reason for using word embeddings is to make machines understand synonyms the same way we do. Consider an example of online restaurant reviews – they consist of adjectives describing food, ambience, and the overall experience. They are either positive or negative, and comprehending which reviews fall into which of these two categories is important. The automatic categorization of these reviews can provide a restaurant with quick insights as to what areas they need to improve on, what people liked about their restaurant, and so on.

There exist a variety of adjectives that can be classified as positive, and the same goes with negative adjectives. Thus, not only does the machine need to be able to differentiate between negative and positive, it also needs to learn and understand that multiple words can relate to the same category because they ultimately mean the same thing. This is where word embeddings are helpful.

Consider the example of restaurant reviews received on a food service application. The following two sentences are from two separate restaurant reviews:

Sentence A – The food here was great.
Sentence B – The food here was good.

The machine needs to be able to comprehend that both these reviews are positive and mean a similar thing, despite the adjective in both sentences being different. This is done by creating word embeddings, because the two words 'good' and 'great' map to two separate but similar real-value vectors and, thus, can be clustered together.

The Generation of Word Embeddings

We've understood what word embeddings are and their importance; now we need to understand how they're generated. The process of transforming words into their real-value vectors is known as vectorization and is done by word embedding techniques. There are many word embedding techniques available, but in this chapter, we will be discussing the two main ones – Word2Vec and GloVe. Once word embeddings (vectors) have been created, they combine to form a vector space, which is an algebraic model consisting of vectors that follow the rules of vector addition and scalar multiplication. If you don't remember your linear algebra, this might be a good time to quickly review it.

Word2Vec

As mentioned earlier, Word2Vec is one of the word embedding techniques used to generate vectors from words – something you can probably understand from the name itself.

Word2Vec is a shallow neural network – it has only two layers – and thus does not qualify as a deep learning model. The input is a text corpus, which it uses to generate vectors as the output. These vectors are known as feature vectors for the words present in the input corpus. It transforms a corpus into numerical data that can be understood by a deep neural network.

The aim of Word2Vec is to understand the probability of two or more words occurring together and thus to group words with similar meanings together to form a cluster in a vector space. Like any other machine learning or deep learning model, Word2Vec becomes more and more efficient by learning from past data and past occurrences of words. Thus, if provided with enough data and context, it can accurately guess a word's meaning based on past occurrences and context, similar to how we understand language.

For example, we are able to create a connection between the words 'boy' and 'man', and 'girl' and 'woman,' once we have heard and read about them and understood what they mean. Likewise, Word2Vec can also form this connection and generate vectors for these words that lie close together in the same cluster so as to ensure that the machine is aware that these words mean similar things.

Once Word2Vec has been given a corpus, it produces a vocabulary wherein each word has a vector of its own attached to it, which is known as its neural word embedding, and simply put, this neural word embedding is a word written in numbers.

Functioning of Word2Vec

Word2Vec trains a word against words that neighbor the word in the input corpus, and there are two methods of doing so:

Continuous Bag of Words (CBOW):
This method predicts the current word based on the context. Thus, it takes the word's surrounding words as input to produce the word as output, and it chooses this word based on the probability that this is indeed the word that is a part of the sentence.
For example, if the algorithm is provided with the words "the food was" and needs to predict the adjective after it, it is most likely to output the word "good" rather than output the word "delightful," since there would be more instances where the word "good" was used, and thus it has learned that "good" has a higher probability than "delightful." CBOW it said to be faster than skip-gram and has a higher accuracy with more frequent words.

Fig 1.18: The CBOW algorithm

Skip-gram
This method predicts the words surrounding a word by taking the word as input, understanding the meaning of the word, and assigning it to a context. For example, if the algorithm was given the word "delightful," it would have to understand its meaning and learn from past context to predict that the probability that the surrounding words are "the food was" is highest. Skip-gram is said to work best with a small corpus.

Fig 1.19: The skip-gram algorithm

While both methods seem to be working in opposite manners, they are essentially predicting words based on the context of local (nearby) words; they are using a window of context to predict what word will come next. This window is a configurable parameter.

The decision of choosing which algorithm to use depends on the corpus at hand. CBOW works on the basis of probability and thus chooses the word that has the highest probability of occurring given a specific context. This means it will usually predict only common and frequent words since those have the highest probabilities, and rare and infrequent words will never be produced by CBOW. Skip-gram, on the other hand, predicts context, and thus when given a word, it will take it as a new observation rather than comparing it to an existing word with a similar meaning. Due to this, rare words will not be avoided or looked over. However, this also means that a lot of training data will be required for skip-gram to work efficiently. Thus, depending on the training data and corpus at hand, the decision to use either algorithm should be made.

Essentially, both algorithms, and thus the model as a whole, require an intense learning phase where they are trained over thousands and millions of words to better understand context and meaning. Based on this, they are able to assign vectors to words and thus aid the machine in learning and predicting natural language. To understand Word2Vec better, let's do an exercise using Gensim's Word2Vec model.

Gensim is an open source library for unsupervised topic modeling and natural language processing using statistical machine learning. Gensim's Word2Vec algorithm takes an input of sequences of sentences in the form of individual words (tokens).

Also, we can use the min_count parameter. It exists to ask you how many instances of a word should be there in a corpus for it to be important to you, and then takes that into consideration when generating word embeddings. In a real-life scenario, when dealing with millions of words, a word that occurs only once or twice may not be important at all and thus can be ignored. However, right now, we are training our model only on three sentences each with only 5-6 words in every sentence. Thus, min_count is set to 1 since a word is important to us even if it occurs only once.

Exercise 8: Generating Word Embeddings Using Word2Vec

In this exercise, we will be using Gensim's Word2Vec algorithm to generate word embeddings post tokenization.

Note

You will need to have gensim installed on your system for the following exercise. You can use the following command to install it, if it is not already installed:

pip install –-upgrade gensim

For further information, click on https://radimrehurek.com/gensim/models/word2vec.html.

The following steps will help you with the solution:

Open a new Jupyter notebook.
Import the Word2Vec model from gensim, and import word_tokenize from nltk, as shown:
from gensim.models import Word2Vec as wtv
from nltk import word_tokenize
Store three strings with some common words into three separate variables, and then tokenize each sentence and store all the tokens in an array, as shown:
s1 = "Ariana Grande is a singer"
s2 = "She has been a singer for many years"
s3 = "Ariana is a great singer"
sentences = [word_tokenize(s1), word_tokenize(s2), word_tokenize(s3)]
You can print the array of sentences to view the tokens.
Train the model, as follows:
model = wtv(sentences, min_count = 1)
Word2Vec's default value for min_count is 5.
Summarize the model, as demonstrated:
print('this is the summary of the model: ')
print(model)
Your output will look something like this:
Figure 1.20: Output for model summary
Vocab = 12 signifies that there are 12 different words present in the sentences that were input to the model.
Let's find out what words are present in the vocabulary by summarizing it, as shown:
words = list(model.wv.vocab)
print('this is the vocabulary for our corpus: ')
print(words)
Your output will look something like this:

Figure 1.21: Output for the vocabulary of the corpus

Let's see what the vector (word embedding) for the word 'singer' is:

print("the vector for the word singer: ")

print(model['singer'])

Expected output:

Figure 1.22: Vector for the word 'singer'

Our Word2Vec model has been trained on these three sentences, and thus its vocabulary only includes the words present in this sentence. If we were to find words that are similar to a particular input word from our Word2Vec model, we wouldn't get words that actually make sense since the vocabulary is so small. Consider the following examples:

#lookup top 6 similar words to great

w1 = ["great"]

model.wv.most_similar (positive=w1, topn=6)

The 'positive' refers to the depiction of only positive vector values in the output.

The top six similar words to 'great' would be:

Figure 1.23: Word vectors similar to the word 'great'

Similarly, for the word 'singer', it could be as follows:

#lookup top 6 similar words to singer

w1 = ["singer"]

model.wv.most_similar (positive=w1, topn=6)

Figure 1.24: Word vector similar to word 'singer'

We know that these words are not actually similar in meaning to our input words at all, and that also shows up in the correlation value beside them. However, they show up because these are the only words that exist in our vocabulary.

Another important parameter of the Gensim Word2Vec model is the size parameter. Its default value is 100 and implies the size of the neural network layers that are being used to train the model. This corresponds to the amount of freedom the training algorithm has. A larger size requires more data but also leads to higher accuracy.

Note

For more information on Gensim's Word2Vec model, click on

https://rare-technologies.com/word2vec-tutorial/.

GloVe

GloVe, an abbreviation of "global vectors," is a word embedding technique that has been developed by Stanford. It is an unsupervised learning algorithm that builds on Word2Vec. While Word2Vec is quite successful in generating word embeddings, the issue with it is that is it has a small window through which it focuses on local words and local context to predict words. This means that it is unable to learn from the frequency of words present globally, that is, in the entire corpus. GloVe, as mentioned in its name, looks at all the words present in a corpus.

While Word2Vec is a predictive model as it learns vectors to improve its predictive abilities, GloVe is a count-based model. What this means is that GloVe learns its vectors by performing dimensionality reduction on a co-occurrence counts matrix. The connections that GloVe is able to make are along the lines of this:

king – man + woman = queen

This means it's able to understand that "king" and "queen" share a relationship that is similar to that between "man" and "woman".

These are complicated terms, so let's understand them one by one. All of these concepts come from statistics and linear algebra, so if you already know what's going on, you can skip to the activity!

When dealing with a corpus, there exist algorithms to construct matrices based on term frequencies. Basically, these matrices contain words that occur in a document as rows, and the columns are either paragraphs or separate documents. The elements of the matrices represent the frequency with which the words occur in the documents. Naturally, with a large corpus, this matrix will be huge. Processing such a large matrix will take a lot of time and memory, thus we perform dimensionality reduction. This is the process of reducing the size of the matrix so it is possible to perform further operations on it.

In the case of GloVe, the matrix is known as a co-occurrence counts matrix, which contains information on how many times a word has occurred in a particular context in a corpus. The rows are the words and the columns are the contexts. This matrix is then factorized in order to reduce the dimensions, and the new matrix has a vector representation for each word.

GloVe also has pretrained words with vectors attached to them that can be used if the semantics match the corpus and task at hand. The following activity guides you through the process of implementing GloVe in Python, except that the code isn't directly given to you, so you'll have to do some thinking and maybe some googling. Try it out!

Exercise 9: Generating Word Embeddings Using GloVe

In this exercise, we will be generating word embeddings using Glove-Python.

Note

To install Glove-Python on your platform, go to https://pypi.org/project/glove/#files.

Download the Text8Corpus from http://mattmahoney.net/dc/text8.zip.

Extract the file and store it with your Jupyter notebook.

Import itertools:
import itertools
We need a corpus to generate word embeddings for, and the gensim.models.word2vec library, luckily, has one called Text8Corpus. Import this along with two modules from the Glove-Python library:
from gensim.models.word2vec import Text8Corpus
from glove import Corpus, Glove
Convert the corpus into sentences in the form of a list using itertools:
sentences = list(itertools.islice(Text8Corpus('text8'),None))
Initiate the Corpus() model and fit it on to the sentences:
corpus = Corpus()
corpus.fit(sentences, window=10)
The window parameter controls how many neighboring words are considered.
Now that we have prepared our corpus, we need to train the embeddings. Initiate the Glove() model:
glove = Glove(no_components=100, learning_rate=0.05)
Generate a co-occurrence matrix based on the corpus and fit the glove model on to this matrix:
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
The model has been trained!
Add the dictionary of the corpus:
glove.add_dictionary(corpus.dictionary)
Use the following command to see which words are similar to your choice of word based on the word embeddings generated:
glove.most_similar('man')
Expected output:

Figure 1.25: Output of word embeddings for 'man'

You can try this out for several different words to see which words neighbor them and are the most similar to them:

glove.most_similar('queen', number = 10)

Expected output:

Figure 1.26: Output of word embeddings for 'queen'

Note

To learn more about GloVe, go to https://nlp.stanford.edu/projects/glove/.

Activity 1: Generating Word Embeddings from a Corpus Using Word2Vec.

You have been given the task of training a Word2Vec model on a particular corpus – the Text8Corpus, in this case – to determine which words are similar to each other. The following steps will help you with the solution.

Note

You can find the text corpus file at http://mattmahoney.net/dc/text8.zip.

Upload the text corpus from the link given previously.
Import word2vec from gensim models.
Store the corpus in a variable.
Fit the word2vec model on the corpus.
Find the most similar word to 'man'.
'Father' is to 'girl', 'x' is to "boy." Find the top 3 words for x.
Note
The solution for the activity can be found on page 296.
Expected Outputs:

Figure 1.27: Output for similar word embeddings

Top three words for 'x' could be:

Figure 1.28: Output for top three words for 'x'

Claus Jul 30, 2020

Ich behalte mir noch vor, dass Buch direkt an den Verlag zurückzuschicken, da das Druckbild der Code-Zeilen komplett unleserlich ist.Ist ja toll, dass man von Packt die Bücher jetzt (wahrscheinlich) also print on demand bekommt - nur leidet die Druckqualität so stark, so dass man bei vielen Codezeilen rätseln muss: "Was wollte der Autor mir damit sagen ... welche Buchstaben stellt diese Kritzelei dar?"

Amazon Verified review