Chapter 8. Applying Machine Learning to Sentiment Analysis
In this Internet and social media time and age, people's opinions, reviews, and recommendations have become a valuable resource for political science and businesses. Thanks to modern technologies, we are now able to collect and analyze such data most efficiently. In this chapter, we will delve into a subfield of natural language processing (NLP) called sentiment analysis and learn how to use machine learning algorithms to classify documents based on their polarity: the attitude of the writer. The topics that we will cover in the following sections include:
- Cleaning and preparing text data
- Building feature vectors from text documents
- Training a machine learning model to classify positive and negative movie reviews
- Working with large text datasets using out-of-core learning
Obtaining the IMDb movie review dataset
Sentiment analysis, sometimes also called opinion mining, is a popular sub-discipline of the broader field of NLP; it analyzes the polarity of documents. A popular task in sentiment analysis is the classification of documents based on the expressed opinions or emotions of the authors with regard to a particular topic.
In this chapter, we will be working with a large dataset of movie reviews from the Internet Movie Database (IMDb) that has been collected by Maas et al. (A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. In the proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics). The movie review dataset consists of 50,000 polar movie reviews that are labeled as either positive or negative; here, positive means that a movie was rated with more than six stars on IMDb, and negative means that a movie was rated with fewer than five stars on IMDb. In the following sections, we will learn how to extract meaningful information from a subset of these movie reviews to build a machine learning model that can predict whether a certain reviewer liked or disliked a movie.
A compressed archive of the movie review dataset (84.1 MB) can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/ as a gzip-compressed tarball archive:
- If you are working with Linux or Mac OS X, you can open a new terminal window, use
cd
to go into thedownload
directory, and executetar -zxf aclImdb_v1.tar.gz
to decompress the dataset - If you are working with Windows, you can download a free archiver such as 7-Zip (http://www.7-zip.org) to extract the files from the download archive
Having successfully extracted the dataset, we will now assemble the individual text documents from the decompressed download archive into a single CSV file. In the following code section, we will be reading the movie reviews into a pandas DataFrame
object, which can take up to 10 minutes on a standard desktop computer. To visualize the progress and estimated time until completion, we will use the PyPrind (Python Progress Indicator, https://pypi.python.org/pypi/PyPrind/) package that I developed several years ago for such purposes. PyPrind can be installed by executing the command: pip install pyprind
.
>>> import pyprind >>> import pandas as pd >>> import os >>> pbar = pyprind.ProgBar(50000) >>> labels = {'pos':1, 'neg':0} >>> df = pd.DataFrame() >>> for s in ('test', 'train'): ... for l in ('pos', 'neg'): ... path ='./aclImdb/%s/%s' % (s, l) ... for file in os.listdir(path): ... with open(os.path.join(path, file), 'r') as infile: ... txt = infile.read() ... df = df.append([[txt, labels[l]]], ignore_index=True) ... pbar.update() >>> df.columns = ['review', 'sentiment'] 0% 100% [##############################] | ETA[sec]: 0.000 Total time elapsed: 725.001 sec
Executing the preceding code, we first initialized a new progress bar object pbar
with 50,000 iterations, which is the number of documents we were going to read in. Using the nested for
loops, we iterated over the train
and test
subdirectories in the main aclImdb
directory and read the individual text files from the pos
and neg
subdirectories that we eventually appended to the DataFrame df
—together with an integer class label (1
= positive and 0
= negative).
Since the class labels in the assembled dataset are sorted, we will now shuffle DataFrame
using the permutation
function from the np.random
submodule—this will be useful to split the dataset into training and test sets in later sections when we will stream the data from our local drive directly. For our own convenience, we will also store the assembled and shuffled movie review dataset as a CSV file:
>>> import numpy as np >>> np.random.seed(0) >>> df = df.reindex(np.random.permutation(df.index)) >>> df.to_csv('./movie_data.csv', index=False)
Since we are going to use this dataset later in this chapter, let us quickly confirm that we successfully saved the data in the right format by reading in the CSV and printing an excerpt of the first three samples:
>>> df = pd.read_csv('./movie_data.csv') >>> df.head(3)
If you are running the code examples in IPython Notebook, you should now see the first three samples of the dataset, as shown in the following table:
Introducing the bag-of-words model
We remember from Chapter 4, Building Good Training Sets – Data Preprocessing, that we have to convert categorical data, such as text or words, into a numerical form before we can pass it on to a machine learning algorithm. In this section, we will introduce the bag-of-words model that allows us to represent text as numerical feature vectors. The idea behind the bag-of-words model is quite simple and can be summarized as follows:
- We create a vocabulary of unique tokens—for example, words—from the entire set of documents.
- We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.
Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them sparse. Do not worry if this sounds too abstract; in the following subsections, we will walk through the process of creating a simple bag-of-words model step-by-step.
Transforming words into feature vectors
To construct a bag-of-words model based on the word counts in the respective documents, we can use the CountVectorizer
class implemented in scikit-learn. As we will see in the following code section, the CountVectorizer
class takes an array of text data, which can be documents or just sentences, and constructs the bag-of-words model for us:
>>> import numpy as np >>> from sklearn.feature_extraction.text import CountVectorizer >>> count = CountVectorizer() >>> docs = np.array([ ... 'The sun is shining', ... 'The weather is sweet', ... 'The sun is shining and the weather is sweet']) >>> bag = count.fit_transform(docs)
By calling the fit_transform
method on CountVectorizer
, we just constructed the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:
The sun is shining
The weather is sweet
The sun is shining and the weather is sweet
Now let us print the contents of the vocabulary to get a better understanding of the underlying concepts:
>>> print(count.vocabulary_) {'the': 5, 'shining': 2, 'weather': 6, 'sun': 3, 'is': 1, 'sweet': 4, 'and': 0}
As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. Next let us print the feature vectors that we just created:
>>> print(bag.toarray()) [[0 1 1 1 0 1 0] [0 1 0 0 1 1 1] [1 2 1 1 1 2 1]]
Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer
vocabulary. For example, the first feature at index position 0
resembles the count of the word and
, which only occurs in the last document, and the word is
at index position 1
(the 2nd feature in the document vectors) occurs in all three sentences. Those values in the feature vectors are also called the raw term frequencies: tf (t,d)—the number of times a term t occurs in a document d.
Note
The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model—each item or token in the vocabulary represents a single word. More generally, the contiguous sequences of items in NLP—words, letters, or symbols—is also called an n-gram. The choice of the number n in the n-gram model depends on the particular application; for example, a study by Kanaris et al. revealed that n-grams of size 3 and 4 yield good performances in anti-spam filtering of e-mail messages (Ioannis Kanaris, Konstantinos Kanaris, Ioannis Houvardas, and Efstathios Stamatatos. Words vs Character N-Grams for Anti-Spam Filtering. International Journal on Artificial Intelligence Tools, 16(06):1047–1067, 2007). To summarize the concept of the n-gram representation, the 1-gram and 2-gram representations of our first document "the sun is shining" would be constructed as follows:
- 1-gram: "the", "sun", "is", "shining"
- 2-gram: "the sun", "sun is", "is shining"
The CountVectorizer
class in scikit-learn allows us to use different n-gram models via its ngram_range
parameter. While a 1-gram representation is used by default, we could switch to a 2-gram representation by initializing a new CountVectorizer
instance with ngram_range=(2,2)
.
Assessing word relevancy via term frequency-inverse document frequency
When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:
Here the tf(t, d) is the term frequency that we introduced in the previous section, and the inverse document frequency idf(t, d) can be calculated as:
where is the total number of documents, and df(d, t) is the number of documents d that contain the term t. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.
Scikit-learn implements yet another transformer, the TfidfTransformer
, that takes the raw term frequencies from CountVectorizer
as input and transforms them into tf-idfs:
>>> from sklearn.feature_extraction.text import TfidfTransformer >>> tfidf = TfidfTransformer() >>> np.set_printoptions(precision=2) >>> print(tfidf.fit_transform(count.fit_transform(docs)).toarray()) [[ 0. 0.43 0.56 0.56 0. 0.43 0. ] [ 0. 0.43 0. 0. 0.56 0.43 0.56] [ 0.4 0.48 0.31 0.31 0.31 0.48 0.31]]
As we saw in the previous subsection, the word is
had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word is
is now associated with a relatively small tf-idf (0.31
) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.
However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the TfidfTransformer
calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:
The tf-idf equation that was implemented in scikit-learn is as follows:
While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the TfidfTransformer
normalizes the tf-idfs directly. By default (norm='l2'
), scikit-learn's TfidfTransformer
applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector v by its L2-norm:
To make sure that we understand how TfidfTransformer
works, let us walk through an example and calculate the tf-idf of the word is
in the 3rd document.
The word is
has a term frequency of 2 (tf = 2) in document 3, and the document frequency of this term is 3 since the term is
occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:
Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:
If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [1.69, 2.00, 1.29, 1.29, 1.29, 2.00, and 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the TfidfTransformer
that we used previously. The final step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:
As we can see, the results now match the results returned by scikit-learn's TfidfTransformer
. Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.
Cleaning text data
In the previous subsections, we learned about the bag-of-words model, term frequencies, and tf-idfs. However, the first important step—before we build our bag-of-words model—is to clean the text data by stripping it of all unwanted characters. To illustrate why this is important, let us display the last 50 characters from the first document in the reshuffled movie review dataset:
>>> df.loc[0, 'review'][-50:] 'is seven.<br /><br />Title (Brazil): Not Available'
As we can see here, the text contains HTML markup as well as punctuation and other non-letter characters. While HTML markup does not contain much useful semantics, punctuation marks can represent useful, additional information in certain NLP contexts. However, for simplicity, we will now remove all punctuation marks but only keep emoticon characters such as ":)" since those are certainly useful for sentiment analysis. To accomplish this task, we will use Python's regular expression (regex) library, re, as shown here:
>>> import re >>> def preprocessor(text): ... text = re.sub('<[^>]*>', '', text) ... emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) ... text = re.sub('[\W]+', ' ', text.lower()) + \ '.join(emoticons).replace('-', '') ... return text
Via the first regex <[^>]*>
in the preceding code section, we tried to remove the entire HTML markup that was contained in the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons
. Next we removed all non-word characters from the text via the regex [\W]+
, converted the text into lowercase characters, and eventually added the temporarily stored emoticons
to the end of the processed document string. Additionally, we removed the nose character (-
) from the emoticons for consistency.
Note
Although regular expressions offer an efficient and convenient approach to searching for characters in a string, they also come with a steep learning curve. Unfortunately, an in-depth discussion of regular expressions is beyond the scope of this book. However, you can find a great tutorial on the Google Developers portal at https://developers.google.com/edu/python/regular-expressions or check out the official documentation of Python's re module at https://docs.python.org/3.4/library/re.html.
Although the addition of the emoticon characters to the end of the cleaned document strings may not look like the most elegant approach, the order of the words doesn't matter in our bag-of-words model if our vocabulary only consists of 1-word tokens. But before we talk more about splitting documents into individual terms, words, or tokens, let us confirm that our preprocessor works correctly:
>>> preprocessor(df.loc[0, 'review'][-50:]) 'is seven title brazil not available' >>> preprocessor("</a>This :) is :( a test :-)!") 'this is a test :) :( :)'
Lastly, since we will make use of the cleaned text data over and over again during the next sections, let us now apply our preprocessor
function to all movie reviews in our DataFrame
:
>>> df['review'] = df['review'].apply(preprocessor)
Processing documents into tokens
Having successfully prepared the movie review dataset, we now need to think about how to split the text corpora into individual elements. One way to tokenize documents is to split them into individual words by splitting the cleaned document at its whitespace characters:
>>> def tokenizer(text): ... return text.split() >>> tokenizer('runners like running and thus they run') ['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
In the context of tokenization, another useful technique is word stemming, which is the process of transforming a word into its root form that allows us to map related words to the same stem. The original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the Porter stemmer algorithm (Martin F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130–137, 1980). The Natural Language Toolkit for Python (NLTK, http://www.nltk.org) implements the Porter stemming algorithm, which we will use in the following code section. In order to install the NLTK, you can simply execute pip install nltk
.
>>> from nltk.stem.porter import PorterStemmer >>> porter = PorterStemmer() >>> def tokenizer_porter(text): ... return [porter.stem(word) for word in text.split()] >>> tokenizer_porter('runners like running and thus they run') ['runner', 'like', 'run', 'and', 'thu', 'they', 'run']
Note
Although NLTK is not the focus of the chapter, I highly recommend you to visit the NLTK website as well as the official NLTK book, which is freely available at http://www.nltk.org/book/, if you are interested in more advanced applications in NLP.
Using PorterStemmer
from the nltk
package, we modified our tokenizer
function to reduce words to their root form, which was illustrated by the previous simple example where the word running
was stemmed to its root form run
.
Note
The Porter stemming algorithm is probably the oldest and simplest stemming algorithm. Other popular stemming algorithms include the newer Snowball stemmer (Porter2 or "English" stemmer) or the Lancaster stemmer (Paice-Husk stemmer), which is faster but also more aggressive than the Porter stemmer. Those alternative stemming algorithms are also available through the NLTK package (http://www.nltk.org/api/nltk.stem.html).
While stemming can create non-real words, such as thu
, (from thus
) as shown in the previous example, a technique called lemmatization aims to obtain the canonical (grammatically correct) forms of individual words—the so-called lemmas. However, lemmatization is computationally more difficult and expensive compared to stemming and, in practice, it has been observed that stemming and lemmatization have little impact on the performance of text classification (Michal Toman, Roman Tesar, and Karel Jezek. Influence of word normalization on text classification. Proceedings of InSciT, pages 354–358, 2006).
Before we jump into the next section where will train a machine learning model using the bag-of-words model, let us briefly talk about another useful topic called stop-word removal. Stop-words are simply those words that are extremely common in all sorts of texts and likely bear no (or only little) useful information that can be used to distinguish between different classes of documents. Examples of stop-words are is, and, has, and the like. Removing stop-words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs, which are already downweighting frequently occurring words.
In order to remove stop-words from the movie reviews, we will use the set of 127 English stop-words that is available from the NLTK library, which can be obtained by calling the nltk.download
function:
>>> import nltk >>> nltk.download('stopwords')
After we have downloaded the stop-words set, we can load and apply the English stop-word set as follows:
>>> from nltk.corpus import stopwords >>> stop = stopwords.words('english') >>> [w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop] ['runner', 'like', 'run', 'run', 'lot']
Transforming words into feature vectors
To construct a bag-of-words model based on the word counts in the respective documents, we can use the CountVectorizer
class implemented in scikit-learn. As we will see in the following code section, the CountVectorizer
class takes an array of text data, which can be documents or just sentences, and constructs the bag-of-words model for us:
>>> import numpy as np >>> from sklearn.feature_extraction.text import CountVectorizer >>> count = CountVectorizer() >>> docs = np.array([ ... 'The sun is shining', ... 'The weather is sweet', ... 'The sun is shining and the weather is sweet']) >>> bag = count.fit_transform(docs)
By calling the fit_transform
method on CountVectorizer
, we just constructed the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:
The sun is shining
The weather is sweet
The sun is shining and the weather is sweet
Now let us print the contents of the vocabulary to get a better understanding of the underlying concepts:
>>> print(count.vocabulary_) {'the': 5, 'shining': 2, 'weather': 6, 'sun': 3, 'is': 1, 'sweet': 4, 'and': 0}
As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. Next let us print the feature vectors that we just created:
>>> print(bag.toarray()) [[0 1 1 1 0 1 0] [0 1 0 0 1 1 1] [1 2 1 1 1 2 1]]
Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer
vocabulary. For example, the first feature at index position 0
resembles the count of the word and
, which only occurs in the last document, and the word is
at index position 1
(the 2nd feature in the document vectors) occurs in all three sentences. Those values in the feature vectors are also called the raw term frequencies: tf (t,d)—the number of times a term t occurs in a document d.
Note
The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model—each item or token in the vocabulary represents a single word. More generally, the contiguous sequences of items in NLP—words, letters, or symbols—is also called an n-gram. The choice of the number n in the n-gram model depends on the particular application; for example, a study by Kanaris et al. revealed that n-grams of size 3 and 4 yield good performances in anti-spam filtering of e-mail messages (Ioannis Kanaris, Konstantinos Kanaris, Ioannis Houvardas, and Efstathios Stamatatos. Words vs Character N-Grams for Anti-Spam Filtering. International Journal on Artificial Intelligence Tools, 16(06):1047–1067, 2007). To summarize the concept of the n-gram representation, the 1-gram and 2-gram representations of our first document "the sun is shining" would be constructed as follows:
- 1-gram: "the", "sun", "is", "shining"
- 2-gram: "the sun", "sun is", "is shining"
The CountVectorizer
class in scikit-learn allows us to use different n-gram models via its ngram_range
parameter. While a 1-gram representation is used by default, we could switch to a 2-gram representation by initializing a new CountVectorizer
instance with ngram_range=(2,2)
.
Assessing word relevancy via term frequency-inverse document frequency
When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:
Here the tf(t, d) is the term frequency that we introduced in the previous section, and the inverse document frequency idf(t, d) can be calculated as:
where is the total number of documents, and df(d, t) is the number of documents d that contain the term t. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.
Scikit-learn implements yet another transformer, the TfidfTransformer
, that takes the raw term frequencies from CountVectorizer
as input and transforms them into tf-idfs:
>>> from sklearn.feature_extraction.text import TfidfTransformer >>> tfidf = TfidfTransformer() >>> np.set_printoptions(precision=2) >>> print(tfidf.fit_transform(count.fit_transform(docs)).toarray()) [[ 0. 0.43 0.56 0.56 0. 0.43 0. ] [ 0. 0.43 0. 0. 0.56 0.43 0.56] [ 0.4 0.48 0.31 0.31 0.31 0.48 0.31]]
As we saw in the previous subsection, the word is
had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word is
is now associated with a relatively small tf-idf (0.31
) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.
However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the TfidfTransformer
calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:
The tf-idf equation that was implemented in scikit-learn is as follows:
While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the TfidfTransformer
normalizes the tf-idfs directly. By default (norm='l2'
), scikit-learn's TfidfTransformer
applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector v by its L2-norm:
To make sure that we understand how TfidfTransformer
works, let us walk through an example and calculate the tf-idf of the word is
in the 3rd document.
The word is
has a term frequency of 2 (tf = 2) in document 3, and the document frequency of this term is 3 since the term is
occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:
Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:
If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [1.69, 2.00, 1.29, 1.29, 1.29, 2.00, and 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the TfidfTransformer
that we used previously. The final step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:
As we can see, the results now match the results returned by scikit-learn's TfidfTransformer
. Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.
Cleaning text data
In the previous subsections, we learned about the bag-of-words model, term frequencies, and tf-idfs. However, the first important step—before we build our bag-of-words model—is to clean the text data by stripping it of all unwanted characters. To illustrate why this is important, let us display the last 50 characters from the first document in the reshuffled movie review dataset:
>>> df.loc[0, 'review'][-50:] 'is seven.<br /><br />Title (Brazil): Not Available'
As we can see here, the text contains HTML markup as well as punctuation and other non-letter characters. While HTML markup does not contain much useful semantics, punctuation marks can represent useful, additional information in certain NLP contexts. However, for simplicity, we will now remove all punctuation marks but only keep emoticon characters such as ":)" since those are certainly useful for sentiment analysis. To accomplish this task, we will use Python's regular expression (regex) library, re, as shown here:
>>> import re >>> def preprocessor(text): ... text = re.sub('<[^>]*>', '', text) ... emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) ... text = re.sub('[\W]+', ' ', text.lower()) + \ '.join(emoticons).replace('-', '') ... return text
Via the first regex <[^>]*>
in the preceding code section, we tried to remove the entire HTML markup that was contained in the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons
. Next we removed all non-word characters from the text via the regex [\W]+
, converted the text into lowercase characters, and eventually added the temporarily stored emoticons
to the end of the processed document string. Additionally, we removed the nose character (-
) from the emoticons for consistency.
Note
Although regular expressions offer an efficient and convenient approach to searching for characters in a string, they also come with a steep learning curve. Unfortunately, an in-depth discussion of regular expressions is beyond the scope of this book. However, you can find a great tutorial on the Google Developers portal at https://developers.google.com/edu/python/regular-expressions or check out the official documentation of Python's re module at https://docs.python.org/3.4/library/re.html.
Although the addition of the emoticon characters to the end of the cleaned document strings may not look like the most elegant approach, the order of the words doesn't matter in our bag-of-words model if our vocabulary only consists of 1-word tokens. But before we talk more about splitting documents into individual terms, words, or tokens, let us confirm that our preprocessor works correctly:
>>> preprocessor(df.loc[0, 'review'][-50:]) 'is seven title brazil not available' >>> preprocessor("</a>This :) is :( a test :-)!") 'this is a test :) :( :)'
Lastly, since we will make use of the cleaned text data over and over again during the next sections, let us now apply our preprocessor
function to all movie reviews in our DataFrame
:
>>> df['review'] = df['review'].apply(preprocessor)
Processing documents into tokens
Having successfully prepared the movie review dataset, we now need to think about how to split the text corpora into individual elements. One way to tokenize documents is to split them into individual words by splitting the cleaned document at its whitespace characters:
>>> def tokenizer(text): ... return text.split() >>> tokenizer('runners like running and thus they run') ['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
In the context of tokenization, another useful technique is word stemming, which is the process of transforming a word into its root form that allows us to map related words to the same stem. The original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the Porter stemmer algorithm (Martin F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130–137, 1980). The Natural Language Toolkit for Python (NLTK, http://www.nltk.org) implements the Porter stemming algorithm, which we will use in the following code section. In order to install the NLTK, you can simply execute pip install nltk
.
>>> from nltk.stem.porter import PorterStemmer >>> porter = PorterStemmer() >>> def tokenizer_porter(text): ... return [porter.stem(word) for word in text.split()] >>> tokenizer_porter('runners like running and thus they run') ['runner', 'like', 'run', 'and', 'thu', 'they', 'run']
Note
Although NLTK is not the focus of the chapter, I highly recommend you to visit the NLTK website as well as the official NLTK book, which is freely available at http://www.nltk.org/book/, if you are interested in more advanced applications in NLP.
Using PorterStemmer
from the nltk
package, we modified our tokenizer
function to reduce words to their root form, which was illustrated by the previous simple example where the word running
was stemmed to its root form run
.
Note
The Porter stemming algorithm is probably the oldest and simplest stemming algorithm. Other popular stemming algorithms include the newer Snowball stemmer (Porter2 or "English" stemmer) or the Lancaster stemmer (Paice-Husk stemmer), which is faster but also more aggressive than the Porter stemmer. Those alternative stemming algorithms are also available through the NLTK package (http://www.nltk.org/api/nltk.stem.html).
While stemming can create non-real words, such as thu
, (from thus
) as shown in the previous example, a technique called lemmatization aims to obtain the canonical (grammatically correct) forms of individual words—the so-called lemmas. However, lemmatization is computationally more difficult and expensive compared to stemming and, in practice, it has been observed that stemming and lemmatization have little impact on the performance of text classification (Michal Toman, Roman Tesar, and Karel Jezek. Influence of word normalization on text classification. Proceedings of InSciT, pages 354–358, 2006).
Before we jump into the next section where will train a machine learning model using the bag-of-words model, let us briefly talk about another useful topic called stop-word removal. Stop-words are simply those words that are extremely common in all sorts of texts and likely bear no (or only little) useful information that can be used to distinguish between different classes of documents. Examples of stop-words are is, and, has, and the like. Removing stop-words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs, which are already downweighting frequently occurring words.
In order to remove stop-words from the movie reviews, we will use the set of 127 English stop-words that is available from the NLTK library, which can be obtained by calling the nltk.download
function:
>>> import nltk >>> nltk.download('stopwords')
After we have downloaded the stop-words set, we can load and apply the English stop-word set as follows:
>>> from nltk.corpus import stopwords >>> stop = stopwords.words('english') >>> [w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop] ['runner', 'like', 'run', 'run', 'lot']
Assessing word relevancy via term frequency-inverse document frequency
When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:
Here the tf(t, d) is the term frequency that we introduced in the previous section, and the inverse document frequency idf(t, d) can be calculated as:
where is the total number of documents, and df(d, t) is the number of documents d that contain the term t. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.
Scikit-learn implements yet another transformer, the TfidfTransformer
, that takes the raw term frequencies from CountVectorizer
as input and transforms them into tf-idfs:
>>> from sklearn.feature_extraction.text import TfidfTransformer >>> tfidf = TfidfTransformer() >>> np.set_printoptions(precision=2) >>> print(tfidf.fit_transform(count.fit_transform(docs)).toarray()) [[ 0. 0.43 0.56 0.56 0. 0.43 0. ] [ 0. 0.43 0. 0. 0.56 0.43 0.56] [ 0.4 0.48 0.31 0.31 0.31 0.48 0.31]]
As we saw in the previous subsection, the word is
had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word is
is now associated with a relatively small tf-idf (0.31
) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.
However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the TfidfTransformer
calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:
The tf-idf equation that was implemented in scikit-learn is as follows:
While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the TfidfTransformer
normalizes the tf-idfs directly. By default (norm='l2'
), scikit-learn's TfidfTransformer
applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector v by its L2-norm:
To make sure that we understand how TfidfTransformer
works, let us walk through an example and calculate the tf-idf of the word is
in the 3rd document.
The word is
has a term frequency of 2 (tf = 2) in document 3, and the document frequency of this term is 3 since the term is
occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:
Now in order to calculate the tf-idf, we simply need to add 1 to the inverse document frequency and multiply it by the term frequency:
If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [1.69, 2.00, 1.29, 1.29, 1.29, 2.00, and 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the TfidfTransformer
that we used previously. The final step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:
As we can see, the results now match the results returned by scikit-learn's TfidfTransformer
. Since we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those concepts to the movie review dataset.
Cleaning text data
In the previous subsections, we learned about the bag-of-words model, term frequencies, and tf-idfs. However, the first important step—before we build our bag-of-words model—is to clean the text data by stripping it of all unwanted characters. To illustrate why this is important, let us display the last 50 characters from the first document in the reshuffled movie review dataset:
>>> df.loc[0, 'review'][-50:] 'is seven.<br /><br />Title (Brazil): Not Available'
As we can see here, the text contains HTML markup as well as punctuation and other non-letter characters. While HTML markup does not contain much useful semantics, punctuation marks can represent useful, additional information in certain NLP contexts. However, for simplicity, we will now remove all punctuation marks but only keep emoticon characters such as ":)" since those are certainly useful for sentiment analysis. To accomplish this task, we will use Python's regular expression (regex) library, re, as shown here:
>>> import re >>> def preprocessor(text): ... text = re.sub('<[^>]*>', '', text) ... emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) ... text = re.sub('[\W]+', ' ', text.lower()) + \ '.join(emoticons).replace('-', '') ... return text
Via the first regex <[^>]*>
in the preceding code section, we tried to remove the entire HTML markup that was contained in the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons
. Next we removed all non-word characters from the text via the regex [\W]+
, converted the text into lowercase characters, and eventually added the temporarily stored emoticons
to the end of the processed document string. Additionally, we removed the nose character (-
) from the emoticons for consistency.
Note
Although regular expressions offer an efficient and convenient approach to searching for characters in a string, they also come with a steep learning curve. Unfortunately, an in-depth discussion of regular expressions is beyond the scope of this book. However, you can find a great tutorial on the Google Developers portal at https://developers.google.com/edu/python/regular-expressions or check out the official documentation of Python's re module at https://docs.python.org/3.4/library/re.html.
Although the addition of the emoticon characters to the end of the cleaned document strings may not look like the most elegant approach, the order of the words doesn't matter in our bag-of-words model if our vocabulary only consists of 1-word tokens. But before we talk more about splitting documents into individual terms, words, or tokens, let us confirm that our preprocessor works correctly:
>>> preprocessor(df.loc[0, 'review'][-50:]) 'is seven title brazil not available' >>> preprocessor("</a>This :) is :( a test :-)!") 'this is a test :) :( :)'
Lastly, since we will make use of the cleaned text data over and over again during the next sections, let us now apply our preprocessor
function to all movie reviews in our DataFrame
:
>>> df['review'] = df['review'].apply(preprocessor)
Processing documents into tokens
Having successfully prepared the movie review dataset, we now need to think about how to split the text corpora into individual elements. One way to tokenize documents is to split them into individual words by splitting the cleaned document at its whitespace characters:
>>> def tokenizer(text): ... return text.split() >>> tokenizer('runners like running and thus they run') ['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
In the context of tokenization, another useful technique is word stemming, which is the process of transforming a word into its root form that allows us to map related words to the same stem. The original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the Porter stemmer algorithm (Martin F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130–137, 1980). The Natural Language Toolkit for Python (NLTK, http://www.nltk.org) implements the Porter stemming algorithm, which we will use in the following code section. In order to install the NLTK, you can simply execute pip install nltk
.
>>> from nltk.stem.porter import PorterStemmer >>> porter = PorterStemmer() >>> def tokenizer_porter(text): ... return [porter.stem(word) for word in text.split()] >>> tokenizer_porter('runners like running and thus they run') ['runner', 'like', 'run', 'and', 'thu', 'they', 'run']
Note
Although NLTK is not the focus of the chapter, I highly recommend you to visit the NLTK website as well as the official NLTK book, which is freely available at http://www.nltk.org/book/, if you are interested in more advanced applications in NLP.
Using PorterStemmer
from the nltk
package, we modified our tokenizer
function to reduce words to their root form, which was illustrated by the previous simple example where the word running
was stemmed to its root form run
.
Note
The Porter stemming algorithm is probably the oldest and simplest stemming algorithm. Other popular stemming algorithms include the newer Snowball stemmer (Porter2 or "English" stemmer) or the Lancaster stemmer (Paice-Husk stemmer), which is faster but also more aggressive than the Porter stemmer. Those alternative stemming algorithms are also available through the NLTK package (http://www.nltk.org/api/nltk.stem.html).
While stemming can create non-real words, such as thu
, (from thus
) as shown in the previous example, a technique called lemmatization aims to obtain the canonical (grammatically correct) forms of individual words—the so-called lemmas. However, lemmatization is computationally more difficult and expensive compared to stemming and, in practice, it has been observed that stemming and lemmatization have little impact on the performance of text classification (Michal Toman, Roman Tesar, and Karel Jezek. Influence of word normalization on text classification. Proceedings of InSciT, pages 354–358, 2006).
Before we jump into the next section where will train a machine learning model using the bag-of-words model, let us briefly talk about another useful topic called stop-word removal. Stop-words are simply those words that are extremely common in all sorts of texts and likely bear no (or only little) useful information that can be used to distinguish between different classes of documents. Examples of stop-words are is, and, has, and the like. Removing stop-words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs, which are already downweighting frequently occurring words.
In order to remove stop-words from the movie reviews, we will use the set of 127 English stop-words that is available from the NLTK library, which can be obtained by calling the nltk.download
function:
>>> import nltk >>> nltk.download('stopwords')
After we have downloaded the stop-words set, we can load and apply the English stop-word set as follows:
>>> from nltk.corpus import stopwords >>> stop = stopwords.words('english') >>> [w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop] ['runner', 'like', 'run', 'run', 'lot']
Cleaning text data
In the previous subsections, we learned about the bag-of-words model, term frequencies, and tf-idfs. However, the first important step—before we build our bag-of-words model—is to clean the text data by stripping it of all unwanted characters. To illustrate why this is important, let us display the last 50 characters from the first document in the reshuffled movie review dataset:
>>> df.loc[0, 'review'][-50:] 'is seven.<br /><br />Title (Brazil): Not Available'
As we can see here, the text contains HTML markup as well as punctuation and other non-letter characters. While HTML markup does not contain much useful semantics, punctuation marks can represent useful, additional information in certain NLP contexts. However, for simplicity, we will now remove all punctuation marks but only keep emoticon characters such as ":)" since those are certainly useful for sentiment analysis. To accomplish this task, we will use Python's regular expression (regex) library, re, as shown here:
>>> import re >>> def preprocessor(text): ... text = re.sub('<[^>]*>', '', text) ... emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) ... text = re.sub('[\W]+', ' ', text.lower()) + \ '.join(emoticons).replace('-', '') ... return text
Via the first regex <[^>]*>
in the preceding code section, we tried to remove the entire HTML markup that was contained in the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons
. Next we removed all non-word characters from the text via the regex [\W]+
, converted the text into lowercase characters, and eventually added the temporarily stored emoticons
to the end of the processed document string. Additionally, we removed the nose character (-
) from the emoticons for consistency.
Note
Although regular expressions offer an efficient and convenient approach to searching for characters in a string, they also come with a steep learning curve. Unfortunately, an in-depth discussion of regular expressions is beyond the scope of this book. However, you can find a great tutorial on the Google Developers portal at https://developers.google.com/edu/python/regular-expressions or check out the official documentation of Python's re module at https://docs.python.org/3.4/library/re.html.
Although the addition of the emoticon characters to the end of the cleaned document strings may not look like the most elegant approach, the order of the words doesn't matter in our bag-of-words model if our vocabulary only consists of 1-word tokens. But before we talk more about splitting documents into individual terms, words, or tokens, let us confirm that our preprocessor works correctly:
>>> preprocessor(df.loc[0, 'review'][-50:]) 'is seven title brazil not available' >>> preprocessor("</a>This :) is :( a test :-)!") 'this is a test :) :( :)'
Lastly, since we will make use of the cleaned text data over and over again during the next sections, let us now apply our preprocessor
function to all movie reviews in our DataFrame
:
>>> df['review'] = df['review'].apply(preprocessor)
Processing documents into tokens
Having successfully prepared the movie review dataset, we now need to think about how to split the text corpora into individual elements. One way to tokenize documents is to split them into individual words by splitting the cleaned document at its whitespace characters:
>>> def tokenizer(text): ... return text.split() >>> tokenizer('runners like running and thus they run') ['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
In the context of tokenization, another useful technique is word stemming, which is the process of transforming a word into its root form that allows us to map related words to the same stem. The original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the Porter stemmer algorithm (Martin F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130–137, 1980). The Natural Language Toolkit for Python (NLTK, http://www.nltk.org) implements the Porter stemming algorithm, which we will use in the following code section. In order to install the NLTK, you can simply execute pip install nltk
.
>>> from nltk.stem.porter import PorterStemmer >>> porter = PorterStemmer() >>> def tokenizer_porter(text): ... return [porter.stem(word) for word in text.split()] >>> tokenizer_porter('runners like running and thus they run') ['runner', 'like', 'run', 'and', 'thu', 'they', 'run']
Note
Although NLTK is not the focus of the chapter, I highly recommend you to visit the NLTK website as well as the official NLTK book, which is freely available at http://www.nltk.org/book/, if you are interested in more advanced applications in NLP.
Using PorterStemmer
from the nltk
package, we modified our tokenizer
function to reduce words to their root form, which was illustrated by the previous simple example where the word running
was stemmed to its root form run
.
Note
The Porter stemming algorithm is probably the oldest and simplest stemming algorithm. Other popular stemming algorithms include the newer Snowball stemmer (Porter2 or "English" stemmer) or the Lancaster stemmer (Paice-Husk stemmer), which is faster but also more aggressive than the Porter stemmer. Those alternative stemming algorithms are also available through the NLTK package (http://www.nltk.org/api/nltk.stem.html).
While stemming can create non-real words, such as thu
, (from thus
) as shown in the previous example, a technique called lemmatization aims to obtain the canonical (grammatically correct) forms of individual words—the so-called lemmas. However, lemmatization is computationally more difficult and expensive compared to stemming and, in practice, it has been observed that stemming and lemmatization have little impact on the performance of text classification (Michal Toman, Roman Tesar, and Karel Jezek. Influence of word normalization on text classification. Proceedings of InSciT, pages 354–358, 2006).
Before we jump into the next section where will train a machine learning model using the bag-of-words model, let us briefly talk about another useful topic called stop-word removal. Stop-words are simply those words that are extremely common in all sorts of texts and likely bear no (or only little) useful information that can be used to distinguish between different classes of documents. Examples of stop-words are is, and, has, and the like. Removing stop-words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs, which are already downweighting frequently occurring words.
In order to remove stop-words from the movie reviews, we will use the set of 127 English stop-words that is available from the NLTK library, which can be obtained by calling the nltk.download
function:
>>> import nltk >>> nltk.download('stopwords')
After we have downloaded the stop-words set, we can load and apply the English stop-word set as follows:
>>> from nltk.corpus import stopwords >>> stop = stopwords.words('english') >>> [w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop] ['runner', 'like', 'run', 'run', 'lot']
Processing documents into tokens
Having successfully prepared the movie review dataset, we now need to think about how to split the text corpora into individual elements. One way to tokenize documents is to split them into individual words by splitting the cleaned document at its whitespace characters:
>>> def tokenizer(text): ... return text.split() >>> tokenizer('runners like running and thus they run') ['runners', 'like', 'running', 'and', 'thus', 'they', 'run']
In the context of tokenization, another useful technique is word stemming, which is the process of transforming a word into its root form that allows us to map related words to the same stem. The original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the Porter stemmer algorithm (Martin F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130–137, 1980). The Natural Language Toolkit for Python (NLTK, http://www.nltk.org) implements the Porter stemming algorithm, which we will use in the following code section. In order to install the NLTK, you can simply execute pip install nltk
.
>>> from nltk.stem.porter import PorterStemmer >>> porter = PorterStemmer() >>> def tokenizer_porter(text): ... return [porter.stem(word) for word in text.split()] >>> tokenizer_porter('runners like running and thus they run') ['runner', 'like', 'run', 'and', 'thu', 'they', 'run']
Note
Although NLTK is not the focus of the chapter, I highly recommend you to visit the NLTK website as well as the official NLTK book, which is freely available at http://www.nltk.org/book/, if you are interested in more advanced applications in NLP.
Using PorterStemmer
from the nltk
package, we modified our tokenizer
function to reduce words to their root form, which was illustrated by the previous simple example where the word running
was stemmed to its root form run
.
Note
The Porter stemming algorithm is probably the oldest and simplest stemming algorithm. Other popular stemming algorithms include the newer Snowball stemmer (Porter2 or "English" stemmer) or the Lancaster stemmer (Paice-Husk stemmer), which is faster but also more aggressive than the Porter stemmer. Those alternative stemming algorithms are also available through the NLTK package (http://www.nltk.org/api/nltk.stem.html).
While stemming can create non-real words, such as thu
, (from thus
) as shown in the previous example, a technique called lemmatization aims to obtain the canonical (grammatically correct) forms of individual words—the so-called lemmas. However, lemmatization is computationally more difficult and expensive compared to stemming and, in practice, it has been observed that stemming and lemmatization have little impact on the performance of text classification (Michal Toman, Roman Tesar, and Karel Jezek. Influence of word normalization on text classification. Proceedings of InSciT, pages 354–358, 2006).
Before we jump into the next section where will train a machine learning model using the bag-of-words model, let us briefly talk about another useful topic called stop-word removal. Stop-words are simply those words that are extremely common in all sorts of texts and likely bear no (or only little) useful information that can be used to distinguish between different classes of documents. Examples of stop-words are is, and, has, and the like. Removing stop-words can be useful if we are working with raw or normalized term frequencies rather than tf-idfs, which are already downweighting frequently occurring words.
In order to remove stop-words from the movie reviews, we will use the set of 127 English stop-words that is available from the NLTK library, which can be obtained by calling the nltk.download
function:
>>> import nltk >>> nltk.download('stopwords')
After we have downloaded the stop-words set, we can load and apply the English stop-word set as follows:
>>> from nltk.corpus import stopwords >>> stop = stopwords.words('english') >>> [w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop] ['runner', 'like', 'run', 'run', 'lot']
Training a logistic regression model for document classification
In this section, we will train a logistic regression model to classify the movie reviews into positive and negative reviews. First, we will divide the DataFrame
of cleaned text documents into 25,000 documents for training and 25,000 documents for testing:
>>> X_train = df.loc[:25000, 'review'].values >>> y_train = df.loc[:25000, 'sentiment'].values >>> X_test = df.loc[25000:, 'review'].values >>> y_test = df.loc[25000:, 'sentiment'].values
Next we will use a GridSearchCV
object to find the optimal set of parameters for our logistic regression model using 5-fold stratified cross-validation:
>>> from sklearn.grid_search import GridSearchCV >>> from sklearn.pipeline import Pipeline >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> tfidf = TfidfVectorizer(strip_accents=None, ... lowercase=False, ... preprocessor=None) >>> param_grid = [{'vect__ngram_range': [(1,1)], ... 'vect__stop_words': [stop, None], ... 'vect__tokenizer': [tokenizer, ... tokenizer_porter], ... 'clf__penalty': ['l1', 'l2'], ... 'clf__C': [1.0, 10.0, 100.0]}, ... {'vect__ngram_range': [(1,1)], ... 'vect__stop_words': [stop, None], ... 'vect__tokenizer': [tokenizer, ... tokenizer_porter], ... 'vect__use_idf':[False], ... 'vect__norm':[None], ... 'clf__penalty': ['l1', 'l2'], ... 'clf__C': [1.0, 10.0, 100.0]} ... ] >>> lr_tfidf = Pipeline([('vect', tfidf), ... ('clf', ... LogisticRegression(random_state=0))]) >>> gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, ... scoring='accuracy', ... cv=5, verbose=1, ... n_jobs=-1) >>> gs_lr_tfidf.fit(X_train, y_train)
When we initialized the GridSearchCV
object and its parameter grid using the preceding code, we restricted ourselves to a limited number of parameter combinations since the number of feature vectors, as well as the large vocabulary, can make the grid search computationally quite expensive; using a standard Desktop computer, our grid search may take up to 40 minutes to complete.
In the previous code example, we replaced the CountVectorizer
and TfidfTransformer
from the previous subsection with the TfidfVectorizer
, which combines the latter transformer objects. Our param_grid
consisted of two parameter dictionaries. In the first dictionary, we used the TfidfVectorizer
with its default settings (use_idf=True
, smooth_idf=True
, and norm='l2'
) to calculate the tf-idfs; in the second dictionary, we set those parameters to use_idf=False
, smooth_idf=False
, and norm=None
in order to train a model based on raw term frequencies. Furthermore, for the logistic regression classifier itself, we trained models using L2 and L1 regularization via the penalty parameter and compared different regularization strengths by defining a range of values for the inverse-regularization parameter C
.
After the grid search has finished, we can print the best parameter set:
>>> print('Best parameter set: %s ' % gs_lr_tfidf.best_params_) Best parameter set: {'clf__C': 10.0, 'vect__stop_words': None, 'clf__penalty': 'l2', 'vect__tokenizer': <function tokenizer at 0x7f6c704948c8>, 'vect__ngram_range': (1, 1)}
As we can see here, we obtained the best grid search results using the regular tokenizer without Porter stemming, no stop-word library, and tf-idfs in combination with a logistic regression classifier that uses L2 regularization with the regularization strength C=10.0
.
Using the best model from this grid search, let us print the 5-fold cross-validation accuracy scores on the training set and the classification accuracy on the test dataset:
>>> print('CV Accuracy: %.3f' ... % gs_lr_tfidf.best_score_) CV Accuracy: 0.897 >>> clf = gs_lr_tfidf.best_estimator_ >>> print('Test Accuracy: %.3f' ... % clf.score(X_test, y_test)) Test Accuracy: 0.899
The results reveal that our machine learning model can predict whether a movie review is positive or negative with 90 percent accuracy.
Note
A still very popular classifier for text classification is the Naïve Bayes classifier, which gained popularity in applications of e-mail spam filtering. Naïve Bayes classifiers are easy to implement, computationally efficient, and tend to perform particularly well on relatively small datasets compared to other algorithms. Although we don't discuss Naïve Bayes classifiers in this book, the interested reader can find my article about Naïve Text classification that I made freely available on arXiv (S. Raschka. Naive Bayes and Text Classification I - introduction and Theory. Computing Research Repository (CoRR), abs/1410.5329, 2014. http://arxiv.org/pdf/1410.5329v3.pdf).
Working with bigger data – online algorithms and out-of-core learning
If you executed the code examples in the previous section, you may have noticed that it could be computationally quite expensive to construct the feature vectors for the 50,000 movie review dataset during grid search. In many real-world applications it is not uncommon to work with even larger datasets that may even exceed our computer's memory. Since not everyone has access to supercomputer facilities, we will now apply a technique called out-of-core learning that allows us to work with such large datasets.
Back in Chapter 2, Training Machine Learning Algorithms for Classification, we introduced the concept of stochastic gradient descent, which is an optimization algorithm that updates the model's weights using one sample at a time. In this section, we will make use of the partial_fit
function of the SGDClassifier
in scikit-learn to stream the documents directly from our local drive and train a logistic regression model using small minibatches of documents.
First, we define a tokenizer
function that cleans the unprocessed text data from our movie_data.csv
file that we constructed in the beginning of this chapter and separates it into word tokens while removing stop words.
>>> import numpy as np >>> import re >>> from nltk.corpus import stopwords >>> stop = stopwords.words('english') >>> def tokenizer(text): ... text = re.sub('<[^>]*>', '', text) ... emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', ... text.lower()) ... text = re.sub('[\W]+', ' ', text.lower()) \ ... + ' '.join(emoticons).replace('-', '') ... tokenized = [w for w in text.split() if w not in stop] ... return tokenized
Next we define a generator function, stream_docs
, that reads in and returns one document at a time:
>>> def stream_docs(path): ... with open(path, 'r') as csv: ... next(csv) # skip header ... for line in csv: ... text, label = line[:-3], int(line[-2]) ... yield text, label
To verify that our stream_docs
function works correctly, let us read in the first document from the movie_data.csv
file, which should return a tuple consisting of the review text as well as the corresponding class label:
>>> next(stream_docs(path='./movie_data.csv')) ('"In 1974, the teenager Martha Moxley ... ',1)
We will now define a function, get_minibatch
, that will take a document stream from the stream_docs
function and return a particular number of documents specified by the size
parameter:
>>> def get_minibatch(doc_stream, size): ... docs, y = [], [] ... try: ... for _ in range(size): ... text, label = next(doc_stream) ... docs.append(text) ... y.append(label) ... except StopIteration: ... return None, None ... return docs, y
Unfortunately, we can't use the CountVectorizer
for out-of-core learning since it requires holding the complete vocabulary in memory. Also, the TfidfVectorizer
needs to keep the all feature vectors of the training dataset in memory to calculate the inverse document frequencies. However, another useful vectorizer for text processing implemented in scikit-learn is HashingVectorizer
. HashingVectorizer
is data-independent and makes use of the Hashing trick via the 32-bit MurmurHash3 algorithm by Austin Appleby (https://sites.google.com/site/murmurhash/).
>>> from sklearn.feature_extraction.text import HashingVectorizer >>> from sklearn.linear_model import SGDClassifier >>> vect = HashingVectorizer(decode_error='ignore', ... n_features=2**21, ... preprocessor=None, ... tokenizer=tokenizer) >>> clf = SGDClassifier(loss='log', random_state=1, n_iter=1) >>> doc_stream = stream_docs(path='./movie_data.csv')
Using the preceding code, we initialized HashingVectorizer
with our tokenizer
function and set the number of features to . Furthermore, we reinitialized a logistic regression classifier by setting the loss
parameter of the SGDClassifier
to log
—note that, by choosing a large number of features in the HashingVectorizer
, we reduce the chance to cause hash collisions but we also increase the number of coefficients in our logistic regression model.
Now comes the really interesting part. Having set up all the complementary functions, we can now start the out-of-core learning using the following code:
>>> import pyprind >>> pbar = pyprind.ProgBar(45) >>> classes = np.array([0, 1]) >>> for _ in range(45): ... X_train, y_train = get_minibatch(doc_stream, size=1000) ... if not X_train: ... break ... X_train = vect.transform(X_train) ... clf.partial_fit(X_train, y_train, classes=classes) ... pbar.update() 0% 100% [##############################] | ETA[sec]: 0.000 Total time elapsed: 50.063 sec
Again, we made use of the PyPrind package in order to estimate the progress of our learning algorithm. We initialized the progress bar object with 45 iterations and, in the following for
loop, we iterated over 45 minibatches of documents where each minibatch consists of 1,000 documents each.
Having completed the incremental learning process, we will use the last 5,000 documents to evaluate the performance of our model:
>>> X_test, y_test = get_minibatch(doc_stream, size=5000) >>> X_test = vect.transform(X_test) >>> print('Accuracy: %.3f' % clf.score(X_test, y_test)) Accuracy: 0.868
As we can see, the accuracy of the model is 87 percent, slightly below the accuracy that we achieved in the previous section using the grid search for hyperparameter tuning. However, out-of-core learning is very memory-efficient and took less than a minute to complete. Finally, we can use the last 5,000 documents to update our model:
>>> clf = clf.partial_fit(X_test, y_test)
If you are planning to continue directly with Chapter 9, Embedding a Machine Learning Model into a Web Application, I recommend you to keep the current Python session open. In the next chapter, will use the model that we just trained to learn how to save it to disk for later use and embed it into a web application.
Note
Although the bag-of-words model is still the most commonly used model for text classification, it does not consider sentence structure and grammar. A popular extension of the bag-of-words model is Latent Dirichlet allocation, which is a topic model that considers the latent semantics of words (D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. The Journal of machine Learning research, 3:993–1022, 2003).
A more modern alternative to the bag-of-words model is word2vec, an algorithm that Google released in 2013 (T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013). The word2vec algorithm is an unsupervised learning algorithm based on neural networks that attempts to automatically learn the relationship between words. The idea behind word2vec is to put words that have similar meanings into similar clusters; via clever vector-spacing, the model can reproduce certain words using simple vector math, for example, king – man + woman = queen.
The original C-implementation, with useful links to the relevant papers and alternative implementations, can be found at https://code.google.com/p/word2vec/.
Summary
In this chapter, we learned how to use machine learning algorithms to classify text documents based on their polarity, which is a basic task in sentiment analysis in the field of natural language processing. Not only did we learn how to encode a document as a feature vector using the bag-of-words model, but we also learned how to weight the term frequency by relevance using term frequency-inverse document frequency.
Working with text data can be computationally quite expensive due to the large feature vectors that are created during this process; in the last section, we learned how to utilize out-of-core or incremental learning to train a machine learning algorithm without loading the whole dataset into a computer's memory.
In the next chapter, we will use our document classifier and learn how to embed it into a web application.