[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. This book will help you learn to perform big data analytics tasks using machine learning concepts such as clustering, recommending products, data segmentation and more. [/box]
With this post, you will learn what is sentiment analysis and how it is used to analyze emotions associated within the text. You will also learn key NLP concepts such as Tokenization, stemming among others and how they are used for sentiment analysis.
What is sentiment analysis?
One of the forms of text analysis is sentimental analysis. As the name suggests this technique is used to figure out the sentiment or emotion associated with the underlying text. So if you have a piece of text and you want to understand what kind of emotion it conveys, for example, anger, love, hate, positive, negative, and so on you can use the technique sentimental analysis. Sentimental analysis is used in various places, for example:
To analyze the reviews of a product whether they are positive or negative
This can be especially useful to predict how successful your new product is by analyzing user feedback
To analyze the reviews of a movie to check if it's a hit or a flop
Detecting the use of bad language (such as heated language, negative remarks, and so on) in forums, emails, and social media
To analyze the content of tweets or information on other social media to check if a political party campaign was successful or not
Thus, sentimental analysis is a useful technique, but before we see the code for our sample sentimental analysis example, let's understand some of the concepts needed to solve this problem.
[box type="shadow" align="" class="" width=""]For working on a sentimental analysis problem we will be using some techniques from natural language processing and we will be explaining some of those concepts.[/box]
Concepts for sentimental analysis
Before we dive into the fully-fledged problem of analyzing the sentiment behind text, we must understand some concepts from the NLP (Natural Language Processing) perspective.
We will explain these concepts now.
Tokenization
From the perspective of machine learning one of the most important tasks is feature extraction and feature selection. When the data is plain text then we need some way to extract the information out of it. We use a technique called tokenization where the text content is pulled and tokens or words are extracted from it. The token can be a single word or a group of words too. There are various ways to extract the tokens, as follows:
By using regular expressions: Regular expressions can be applied to textual content to extract words or tokens from it.
By using a pre-trained model: Apache Spark ships with a pre-trained model (machine learning model) that is trained to pull tokens from a text. You can apply this model to a piece of text and it will return the predicted results as a set of tokens.
To understand a tokenizer using an example, let's see a simple sentence as follows:
Sentence: "The movie was awesome with nice songs"
Once you extract tokens from it you will get an array of strings as follows:
Tokens: ['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs']
[box type="shadow" align="" class="" width=""]The type of tokens you extract depends on the type of tokens you are interested in. Here we extracted single tokens, but tokens can also be a group of words, for example, 'very nice', 'not good', 'too bad', and so on.[/box]
Stop words removal
Not all the words present in the text are important. Some words are common words used in the English language that are important for the purpose of maintaining the grammar correctly, but from conveying the information perspective or emotion perspective they might not be important at all, for example, common words such as is, was, were, the, and so. To remove these words there are again some common techniques that you can use from natural language processing, such as:
Store stop words in a file or dictionary and compare your extracted tokens with the words in this dictionary or file. If they match simply ignore them.
Use a pre-trained machine learning model that has been taught to remove stop words. Apache Spark ships with one such model in the Spark feature package.
Let's try to understand stop words removal using an example:
Sentence: "The movie was awesome"
From the sentence we can see that common words with no special meaning to convey are the and was. So after applying the stop words removal program to this data you will get:
After stop words removal: [ 'movie', 'awesome', 'nice', 'songs']
[box type="shadow" align="" class="" width=""]In the preceding sentence, the stop words the, was, and with are removed.[/box]
Stemming
Stemming is the process of reducing a word to its base or root form. For example, look at the set of words shown here:
car, cars, car's, cars'
From our perspective of sentimental analysis, we are only interested in the main words or the main word that it refers to. The reason for this is that the underlying meaning of the word in any case is the same. So whether we pick car's or cars we are referring to a car only. Hence the stem or root word for the previous set of words will be:
car, cars, car's, cars' => car (stem or root word)
For English words again you can again use a pre-trained model and apply it to a set of data for figuring out the stem word. Of course there are more complex and better ways (for example, you can retrain the model with more data), or you have to totally use a different model or technique if you are dealing with languages other than English. Diving into stemming in detail is beyond the scope of this book and we would encourage readers to check out some documentation on natural language processing from Wikipedia and the Stanford nlp website.
[box type="shadow" align="" class="" width=""]To keep the sentimental analysis example in this book simple we will not be doing stemming of our tokens, but we will urge the readers to try the same to get better predictive results.[/box]
N-grams
Sometimes a single word conveys the meaning of context, other times a group of words can convey a better meaning. For example, 'happy' is a word in itself that conveys happiness, but 'not happy' changes the picture completely and 'not happy' is the exact opposite of 'happy'. If we are extracting only single words then in the example shown before, that is 'not happy', then 'not' and 'happy' would be two separate words and the entire sentence might be selected as positive by the classifier However, if the classifier picks the bi-grams (that is, two words in one token) in this case then it would be trained with 'not happy' and it would classify similar sentences with 'not happy' in it as 'negative'. Therefore, for training our models we can either use a uni-gram or a bi-gram where we have two words per token or as the name suggest an n-gram where we have 'n' words per token, it all depends upon which token set trains our model well and it improves its predictive results accuracy. To see examples of n-grams refer to the following table:
Sentence
The movie was awesome with nice songs
Uni-gram
['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs']
Bi-grams
['The movie', 'was awesome', 'with nice', 'songs']
Tri-grams
['The movie was', 'awesome with nice', 'songs']
For the purpose of this case study we will be only looking at unigrams to keep our example simple.
By now we know how to extract words from text and remove the unwanted words, but how do we measure the importance of words or the sentiment that originates from them? For this there are a few popular approaches and we will now discuss two such approaches.
Term presence and term frequency
Term presence just means that if the term is present we mark the value as 1 or else 0. Later we build a matrix out of it where the rows represent the words and columns represent each sentence. This matrix is later used to do text analysis by feeding its content to a classifier.
Term Frequency, as the name suggests, just depicts the count or occurrences of the word or tokens within the document. Let's refer to the example in the following table where we find term frequency:
Sentence
The movie was awesome with nice songs and nice dialogues.
Tokens (Unigrams only for now)
['The', 'movie', 'was', 'awesome', 'with', 'nice', 'songs', 'and', 'dialogues']
Term Frequency
['The = 1', 'movie = 1', 'was = 1', 'awesome = 1', 'with = 1', 'nice = 2', 'songs = 1', 'dialogues = 1']
As seen in the preceding table, the word 'nice' is repeated twice in the preceding sentence and hence it will get more weight in determining the opinion shown by the sentence.
Bland term frequency is not a precise approach for the following reasons:
There could be some redundant irrelevant words, for example, the, it, and they that might have a big frequency or count and they might impact the training of the model
There could be some important rare words that could convey the sentiment regarding the document yet their frequency might be low and hence they might not be inclusive for greater impact on the training of the model
Due to this reason, a better approach of TF-IDF is chosen as shown in the next sections.
TF-IDF
TF-IDF stands for Term Frequency and Inverse Document Frequency and in simple terms it means the importance of a term to a document. It works using two simple steps as follows:
It counts the number of terms in the document, so the higher the number of terms the greater the importance of this term to the document.
Counting just the frequency of words in a document is not a very precise way to find the importance of the words. The simple reason for this is there could be too many stop words and their count is high so their importance might get elevated above the importance of real good words. To fix this, TF-IDF checks for the availability of these stop words in other documents as well. If the words appear in other documents as well in large numbers that means these words could be grammatical words such as they, for, is, and so on, and TF-IDF decreases the importance or weight of such stop words.
Let's try to understand TF-IDF using the following figure:
As seen in the preceding figure, doc-1, doc-2, and so on are the documents from which we extract the tokens or words and then from those words we calculate the TF-IDFs. Words that are stop words or regular words such as for , is, and so on, have low TF-IDFs, while words that are rare such as 'awesome movie' have higher TF-IDFs.
TF-IDF is the product of Term Frequency and Inverse document frequency. Both of them are explained here:
Term Frequency: This is nothing but the count of the occurrences of the words in the document. There are other ways of measuring this, but the simplistic approach is to just count the occurrences of the tokens. The simple formula for its calculation is:
Term Frequency = Frequency count of the tokens
Inverse Document Frequency: This is the measure of how much information the word provides. It scales up the weight of the words that are rare and scales down the weight of highly occurring words. The formula for inverse document frequency is:
TF-IDF: TF-IDF is a simple multiplication of the Term Frequency and the Inverse Document Frequency. Hence:
This simple technique is very popular and it is used in a lot of places for text analysis. Next let's look into another simple approach called bag of words that is used in text analytics too.
Bag of words
As the name suggests, bag of words uses a simple approach whereby we first extract the words or tokens from the text and then push them in a bag (imaginary set) and the main point about this is that the words are stored in the bag without any particular order. Thus the mere presence of a word in the bag is of main importance and the order of the occurrence of the word in the sentence as well as its grammatical context carries no value. Since the bag of words gives no importance to the order of words you can use the TF-IDFs of all the words in the bag and put them in a vector and later train a classifier (naïve bayes or any other model) with it. Once trained, the model can now be fed with vectors of new data to predict on its sentiment.
Summing it up, we have got you well versed with sentiment analysis techniques and NLP concepts in order to apply sentimental analysis.
If you want to implement machine learning algorithms to carry out predictive analytics and real-time streaming analytics you can refer to the book Big Data Analytics with Java.
Read more