Executing data preprocessing
During the tf-idf discussion, we mentioned that articles often do not help convey the critical information in a document. What about words such as but, for, or by? Indeed, they are ubiquitous in English texts but probably not very useful for our task. This section focuses on four techniques that help us remove noise from the data and reduce the problem’s complexity. These techniques constitute an integral part of the data preprocessing phase, which is crucial before applying more sophisticated methods to the text data. The first technique involves splitting an input text into meaningful chunks, while the second teaches us how to remove low informational value words from the text—the last two focus on mapping each word to a root form.
Tokenizing the input
So far, we have used the term token with the implicit assumption that it always refers to a word (or an n-gram) independently of the underlying NLP task. Tokenization is a more general process where we split textual data into smaller components called tokens. These can be words, phrases, symbols, or other meaningful elements. We perform this task using the nltk
toolkit and the word_tokenize
method in the following code:
# Import the toolkit. import nltk nltk.download('punkt') # Tokenize the input text. wordTokens = nltk.word_tokenize("a friend to all is a friend to none") print(wordTokens) >> ['a', 'friend', 'to', 'all', 'is', 'a', 'friend', 'to', 'none']
As words are the tokens of a sentence, sentences are the tokens of a paragraph. For the latter, we can use another method in nltk
called sent_tokenize
and tokenize a paragraph with three sentences:
# Tokenize the input paragraph. sentenceTokens = nltk.sent_tokenize("A friend to all is a friend to none. A friend to none is a friend to all. A friend is a friend.") print(sentenceTokens) >> ['A friend to all is a friend to none.', 'A friend to none is a friend to all.', 'A friend is a friend.']
This method uses the full stop as a delimiter (as in, a character to separate the text strings) and the output in our example is a list with three elements. Notice that using the full stop as a delimiter is not always the best solution. For example, the text can contain abbreviations; thus, more sophisticated solutions are required to compensate for this situation.
In the Using token count encoding section, we saw how CountVectorizer
used a pattern to split the input into multiple tokens and promised to demystify its syntax later in the chapter. So, it’s time to introduce regular expressions (regexp) that can assist with the creation of a tokenizer. These expressions are used to find a string in a document, replace part of the text with something else, or examine the conformity of some textual input. We can improvise very sophisticated matching patterns and mastering this skill demands time and effort. Recall that the unstructured nature of text data means that it requires preprocessing before it can be used for analysis, so regexp are a powerful tool for this task. The following table shows a few typical examples:
Table 2.5 – Various examples of regular expressions
A pattern using square brackets ([]) matches character ranges. For example, the [A-Z]
regexp matches Q because it is part of the range of capital letters from A to Z. Conversely, the same lowercase character is not matched. Quantifiers inside curly braces match repetitions of patterns. In this case, the [A-Z]{3}
regexp matches a sequence of BCD
. The ^
and $
characters match a pattern at the beginning and end of a sentence, respectively. For example, the ^[0-9]
regexp matches a 4ever
string, as it starts with the number four. The +
symbol matches one or more repetitions of the pattern, while *
matches zero or more repetitions. A dot, .
, is a wildcard for any character.
We can go a step further and analyze a more challenging regexp. Most of us have already used web forms that request an email address input. When the provided email is not valid, an error message is displayed. How does the web form recognize this problem? Obviously, by using a regexp! The general format of an email address contains a local-part, followed by an @
symbol, and then by a domain – for example, local-part@domain
. Figure 2.7 analyzes a regexp that can match this format:
Figure 2.7 – A regexp for checking the validity of an email address
This expression might seem overwhelming and challenging to understand, but things become apparent if you examine each part separately. Escaping the dot character is necessary to remove its special meaning in the context of a regexp and ensure that it is used literally. Specifically, ., a regexp, matches any word, whereas \. matches only a full stop.
To set things into action, we tokenize a valid and an invalid email address using the regexp from Figure 2.7:
# Create the Regexp tokenizer. tokenizer = nltk.tokenize.RegexpTokenizer(pattern='^([a-z0-9_\.-]+)@([a-z0-9_\.-]+)\.([a-z\.]{2,6})$') # Tokenize a valid email address. tokens = tokenizer.tokenize("john@doe.com") print(tokens) >> [('john', 'doe', 'com')]
The output tokens for the invalid email are as follows:
# Tokenize a non-valid email address. tokens = tokenizer.tokenize("john-AT-doe.com") print(tokens) >> []
In the first case, the input, john@doe.com
, is parsed as expected, as the address’s local-part, domain, and suffix are provided. Conversely, the second input does not comply with the pattern (it misses the @
symbol), and consequently, nothing is printed in the output.
There are many other situations where we need to craft particular regexps for identifying patterns in a document, such as HTML tags, URLs, telephone numbers, and punctuation marks. However, that’s the scope of another book!
Removing stop words
A typical task during the preprocessing phase is removing all the words that presumably help us focus on the most important information in the text. These are called stop words and there is no universal list in English or any other language. Examples of stop words include determiners (such as another
and the
), conjunctions (such as but
and or
), and prepositions (such as before
and in
). Many online sources are available that provide lists of stop words and it’s not uncommon to adapt their content according to the problem under study. In the following example, we remove all the stop words from a spam text using a built-in set from a wordcloud
module named STOPWORDS
. We also include three more words in the set to demonstrate its functionality:
from wordcloud import WordCloud, STOPWORDS # Read the text from the file data.txt. text = open('./data/spam.txt').read() # Get all stopwords and update with few others. sw = set(STOPWORDS) sw.update(["dear", "virus", "mr"]) # Create and configure the word cloud object. wc = WordCloud(background_color="white", stopwords=sw, max_words=2000)
Next, we generate the word cloud plot:
# Generate the word cloud image from the text. wordcloud = wc.generate(text.lower()) # Display the generated image. plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off")
The output is illustrated in Figure 2.8:
Figure 2.8 – A word cloud of the spam email after removing the stop words
Take a moment to compare it with the one in Figure 2.2. For example, the word virus is missing in the new version, as this word was part of the list of stop words.
The following section will cover another typical step of the preprocessing phase.
Stemming the words
Removing stop words is, in essence, a way to extract the juice out of the corpus. But we can squeeze the lemon even more! Undoubtedly, every different word form encapsulates a special meaning that adds richness and linguistic diversity to a language. These variances, however, result in data redundancy that can lead to ineffective ML models. In many practical applications, we can map words with the same core meaning to a central word or symbol and thus reduce the input dimension for the model. This reduction can be beneficial to the performance of the ML or NLP application.
This section introduces a technique called stemming that maps a word to its root form. Stemming is the process of cutting off the end (suffix) or the beginning (prefix) of an inflected word and ending up with its stem (the root word). So, for example, the stem of the word plays
is play
. The most common algorithm in English for performing stemming is the Porter stemmer, which consists of five sets of rules (https://tartarus.org/martin/PorterStemmer/) applied sequentially to the word. For example, one rule is to remove the “-ed” suffix from a word to obtain its stem only if the remainder contains at least one vowel. Based on this rule, the stem of played
is play
, but the stem for led
is still led
.
Using the PorterStemmer
class from nltk
in the following example, we observe that all three forms of play
have the same stem:
# Import the Porter stemmer. from nltk.stem import PorterStemmer # Create the stemmer. stemmer = PorterStemmer() # Stem the words 'playing', 'plays', 'played'. stemmer.stem('playing') >> 'play'
Let’s take the next word:
stemmer.stem('plays') >> 'play'
Now, check played
:
stemmer.stem('played') >> 'play'
Notice that the output of stemming doesn’t need to be a valid word:
# Stem the word 'bravery' stemmer.stem('bravery') >> 'braveri'
We can even create our stemmer using regexps and the RegexpStemmer
class from nltk
. In the following example, we search for words with the ed
suffix:
# Import the Porter stemmer from nltk.stem import RegexpStemmer # Create the stemmer matching words ending with 'ed'. stemmer = RegexpStemmer('ed') # Stem the verbs 'playing', 'plays', 'played'. stemmer.stem('playing') >> 'playing'
Let’s check the next word:
stemmer.stem('plays') >> 'plays'
Now, take another word:
stemmer.stem('played') >> 'play'
The regexp in the preceding code matches played
; therefore, the stemmer outputs play
. The two other words remain unmatched, and for that reason, no stemming is applied. The following section introduces a more powerful technique to achieve similar functionality.
Lemmatizing the words
Lemmatization is another sophisticated approach for reducing the inflectional forms of a word to a base root. The method performs morphological analysis of the word and obtains its proper lemma (the base form under which it appears in a dictionary). For example, the lemma of goes is go. Lemmatization differs from stemming, as it requires detailed dictionaries to look up a word. For this reason, it’s slower but more accurate than stemming and more complex to implement.
WordNet (https://wordnet.princeton.edu/) is a lexical database for the English language created by Princeton University and is part of the nltk
corpus. Superficially, it resembles a thesaurus in that it groups words based on their meanings. WordNet is one way to use lemmatization inside nltk
. In the example that follows, we extract the lemmas of three English words:
# Import the WordNet Lemmatizer. from nltk.stem import WordNetLemmatizer nltk.download('wordnet') nltk.download('omw-1.4') # Create the lemmatizer. lemmatizer = WordNetLemmatizer() # Lemmatize the verb 'played'. lemmatizer.lemmatize('played', pos='v') >> 'play'
Observe that the lemma for played
is the same as its stem, play
. On the other hand, the lemma and stem differ for led
(lead
versus led
, respectively):
# Lemmatize the verb 'led'. lemmatizer.lemmatize('led', pos='v') >> 'lead'
There are also situations where the same lemma corresponds to words with different stems. The following code shows an example of this case where good
and better
have the same lemma but not the same stem:
# Lemmatize the adjective 'better'. lemmatizer.lemmatize('better', pos='a') >> 'good'
The differences between lemmatization and stemming should be apparent from the previous examples. Remember that we use either method on a given dataset and not both simultaneously.
The focus of this section has been on four typical techniques for preprocessing text data. In the case of word representations, the way we apply this step impacts the model’s performance. In many similar situations, identifying which technique works better is a matter of experimentation. The following section presents how to implement classifiers using an open source corpus for spam detection.