For this article, we will take a look at former President Obama's State of the Union speeches. We will be performing Sentiment Analysis with R on Obama's State of the Union speeches. The two main analytical goals are to build topic models on the six State of the Union speeches and then compare the first speech in 2010 and the last in January 2016 for sentence-based textual measures, such as sentiment and dispersion.
This tutorial is taken from the book Mastering Machine Learning with R - Second Edition by Cory Lesmeister. In this book, you will master machine learning techniques with R to deliver insights in complex projects.
The primary package that we will use is tm, the text mining package. We will also need SnowballC for the stemming of the words, RColorBrewer for the color palettes in wordclouds, and the wordcloud package. Please ensure that you have these packages installed before attempting to load them:
> library(tm)
> library(wordcloud) > library(RColorBrewer)
The data files are available for download in https://github.com/datameister66/data. Please ensure you put the text files into a separate directory because it will all go into our corpus for analysis.
Download the seven .txt files, for example, sou2012.txt, into your working R directory. You can identify your current working directory and set it with these functions:
> getwd()
> setwd(".../data")
We can now begin to create the corpus by first creating an object with the path to the speeches and then seeing how many files are in this directory and what they are named:
> name <- file.path(".../text")
> length(dir(name)) [1] 7 > dir(name) [1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt" [5] "sou2014.txt" "sou2015.txt" "sou2016.txt"
We will name our corpus docs and create it with the Corpus() function, wrapped around the directory source function, DirSource(), which is also part of the tm package:
> docs <- Corpus(DirSource(name))
> docs <<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 7
We can now begin the text transformations using the tm_map() function from the tm package. These will be the transformations that we discussed previously--lowercase letters, remove numbers, remove punctuation, remove stop words, strip out the whitespace, and stem the words:
> docs <- tm_map(docs, tolower)
> docs <- tm_map(docs, removeNumbers) > docs <- tm_map(docs, removePunctuation) > docs <- tm_map(docs, removeWords, stopwords("english")) > docs <- tm_map(docs, stripWhitespace)
At this point, it is a good idea to eliminate unnecessary words. For example, during the speeches, when Congress applauds a statement, you will find (Applause) in the text. This must be removed:
> docs <- tm_map(docs, removeWords, c("applause", "can", "cant", "will", "that", "weve", "dont", "wont", "youll", "youre"))
After completing the transformations and removal of other words, make sure that your documents are plain text, put it in a document-term matrix, and check the dimensions:
> docs = tm_map(docs, PlainTextDocument)
> dtm = DocumentTermMatrix(docs) > dim(dtm) [1] 7 4738
The six speeches contain 4738 words. It is optional, but one can remove the sparse terms with the removeSparseTerms() function. You will need to specify a number between zero and one where the higher the number, the higher the percentage of sparsity in the matrix. Sparsity is the relative frequency of a term in the documents. So, if your sparsity threshold is 0.75, only terms with sparsity greater than 0.75 are removed. For us that would be (1 - 0.75) * 7, which is equal to 1.75. Therefore, any term in fewer than two documents would be removed:
> dtm <- removeSparseTerms(dtm, 0.75)
> dim(dtm) [1] 7 2254
As we don't have the metadata on the documents, it is important to name the rows of the matrix so that we know which document is which:
> rownames(dtm) <- c("2010", "2011", "2012", "2013", "2014", "2015", "2016")
Using the inspect() function, you can examine the matrix. Here, we will look at the seven rows and the first five columns:
> inspect(dtm[1:7, 1:5]) Terms Docs abandon ability able abroad absolutely 2010 0 1 1 2 2 2011 1 0 4 3 0 2012 0 0 3 1 1 2013 0 3 3 2 1 2014 0 0 1 4 0 2015 1 0 1 1 0 2016 0 0 1 0 0
It appears that our data is ready for analysis, starting with looking at the word frequency counts. Let me point out that the output demonstrates why I've been trained to not favor wholesale stemming. You may be thinking that 'ability' and 'able' could be combined. If you stemmed the document you would end up with 'abl'. How does that help the analysis? I think you lose context, at least in the initial analysis. Again, I recommend applying stemming thoughtfully and judiciously.
Modeling will be broken into two distinct parts. The first will focus on word frequency and correlation and culminate in the building of a topic model. In the next portion, we will examine many different quantitative techniques by utilizing the power of the qdap package in order to compare two different speeches.
As we have everything set up in the document-term matrix, we can move on to exploring word frequencies by creating an object with the column sums, sorted in descending order. It is necessary to use as.matrix() in the code to sum the columns. The default order is ascending, so putting - in front of freq will change it to descending:
> freq <- colSums(as.matrix(dtm))
> ord <- order(-freq) We will examine the head and tail of the object with the following code:
> freq[head(ord)] new america people jobs now years 193 174 168 163 157 148
> freq[tail(ord)] wright written yearold youngest youngstown zero 2 2 2 2 2 2
The most frequent word is new and, as you might expect, the president mentions america frequently. Also, notice how important employment is with the frequency of jobs. I find it interesting that he mentions Youngstown, for Youngstown, OH, a couple of times.
To look at the frequency of the word frequency, you can create tables, as follows:
> head(table(freq)) freq 2 3 4 5 6 7 596 354 230 141 137 89
> tail(table(freq)) freq 148 157 163 168 174 193 1 1 1 1 1 1
What these tables show is the number of words with that specific frequency. So 354 words occurred three times; and one word, new in our case, occurred 193 times.
Using findFreqTerms(), we can see which words occurred at least 125 times:
> findFreqTerms(dtm, 125) [1] "america" "american" "americans" "jobs" "make" "new" [7] "now" "people" "work" "year" "years"
You can find associations with words by correlation with the findAssocs() function. Let's look at jobs as two examples using 0.85 as the correlation cutoff:
> findAssocs(dtm, "jobs", corlimit = 0.85) $jobs colleges serve market shouldnt defense put tax came 0.97 0.91 0.89 0.88 0.87 0.87 0.87 0.86
For visual portrayal, we can produce wordclouds and a bar chart. We will do two wordclouds to show the different ways to produce them: one with a minimum frequency and the other by specifying the maximum number of words to include. The first one with a minimum frequency also includes code to specify the color. The scale syntax determines the minimum and maximum word size by frequency; in this case, the minimum frequency is 70:
> wordcloud(names(freq), freq, min.freq = 70, scale = c(3, .5), colors = brewer.pal(6, "Dark2"))
The output of the preceding command is as follows:
One can forgo all the fancy graphics, as we will in the following image, capturing the 25 most frequent words:
> wordcloud(names(freq), freq, max.words = 25)
The output of the preceding command is as follows:
To produce a bar chart, the code can get a bit complicated, whether you use base R, ggplot2, or lattice. The following code will show you how to produce a bar chart for the 10 most frequent words in base R:
> freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE) > wf <- data.frame(word = names(freq), freq = freq) > wf <- wf[1:10, ] > barplot(wf$freq, names = wf$word, main = "Word Frequency", xlab = "Words", ylab = "Counts", ylim = c(0, 250))
The output of the preceding command is as follows:
We will now move on to building topic models using the topicmodels package, which offers the LDA() function. The question now is how many topics to create. It seems logical to solve for three topics (k=3). Certainly, I encourage you to try other numbers of topics:
> library(topicmodels)
> set.seed(123) > lda3 <- LDA(dtm, k = 3, method = "Gibbs") > topics(lda3) 2010 2011 2012 2013 2014 2015 2016 2 1 1 1 3 3 2
We can see an interesting transition over time. The first and last addresses have the same topic grouping, almost as if he opened and closed his tenure with the same themes.
Using the terms() function produces a list of an ordered word frequency for each topic. The list of words is specified in the function, so let's look at the top 20 per topic:
> terms(lda3, 25) Topic 1 Topic 2 Topic 3 [1,] "jobs" "people" "america" [2,] "now" "one" "new" [3,] "get" "work" "every" [4,] "tonight" "just" "years" [5,] "last" "year" "like" [6,] "energy" "know" "make" [7,] "tax" "economy" "time" [8,] "right" "americans" "need" [9,] "also" "businesses" "american" [10,] "government" "even" "world" [11,] "home" "give" "help" [12,] "well" "many" "lets" [13,] "american" "security" "want" [14,] "two" "better" "states" [15,] "congress" "come" "first" [16,] "country" "still" "country" [17,] "reform" "workers" "together" [18,] "must" "change" "keep" [19,] "deficit" "take" "back" [20,] "support" "health" "americans" [21,] "business" "care" "way" [22,] "education" "families" "hard" [23,] "companies" "made" "today" [24,] "million" "future" "working" [25,] "nation" "small" "good"
Topic 2 covers the first and last speeches. Nothing really stands out as compelling in that topic as the others. It will be interesting to see how the next analysis can yield insights into those speeches.
Topic 1 covers the next three speeches. Here, the message transitions to "jobs", "energy", "reform", and the "deficit", not to mention the comments about "education" and as we saw above, the correlation of "jobs" and "colleges".
Topic 3 brings us to the next two speeches. The focus seems to really shift on to the economy and business with mentions to "security" and healthcare.
In the next section, we can dig into the exact speech content further, along with comparing and contrasting the first and last State of the Union addresses.
This portion of the analysis will focus on the power of the qdap package. It allows you to compare multiple documents over a wide array of measures. Our effort will be on comparing the 2010 and 2016 speeches. For starters, we will need to turn the text into data frames, perform sentence splitting, and then combine them into one data frame with a variable created that specifies the year of the speech. We will use this as our grouping variable in the analyses. Dealing with text data, even in R, can be tricky. The code that follows seemed to work the best, in this case, to get the data loaded and ready for analysis. We first load the qdap package. Then, to bring in the data from a text file, we will use the readLines() function from base R, collapsing the results to eliminate unnecessary whitespace. I also recommend putting your text encoding to ASCII, otherwise, you may run into some bizarre text that will mess up your analysis. That is done with the iconv() function:
> library(qdap) > speech16 <- paste(readLines("sou2016.txt"), collapse=" ") Warning message: In readLines("sou2016.txt") : incomplete final line found on 'sou2016.txt'
> speech16 <- iconv(speech16, "latin1", "ASCII", "")
The warning message is not an issue as it is just telling us that the final line of text is not the same length as the other lines in the .txt file. We now apply the qprep() function from qdap.
This function is a wrapper for a number of other replacement functions and using it will speed pre-processing, but it should be used with caution if a more detailed analysis is required. The functions it passes through are as follows:
> prep16 <- qprep(speech16)
The other pre-processing we should do is to replace contractions (can't to cannot), remove stopwords, in our case the top 100, and remove unwanted characters, with the exception of periods and question marks. They will come in handy shortly:
> prep16 <- replace_contraction(prep16)
> prep16 <- rm_stopwords(prep16, Top100Words, separate = F) > prep16 <- strip(prep16, char.keep = c("?", "."))
Critical to this analysis is to now split it into sentences and add what will be the grouping variable, the year of the speech. This also creates the tot variable, which stands for Turn of Talk, serving as an indicator of sentence order. This is especially helpful in a situation where you are analyzing dialogue, say in a debate or question and answer session:
> sent16 <- data.frame(speech = prep16)
> sent16 <- sentSplit(sent16, "speech") > sent16$year <- "2016"
Repeat the steps for the 2010 speech:
> speech10 <- paste(readLines("sou2010.txt"), collapse=" ") > speech10 <- iconv(speech10, "latin1", "ASCII", "") > speech10 <- gsub("(Applause.)", "", speech10) > prep10 <- qprep(speech10) > prep10 <- replace_contraction(prep10) > prep10 <- rm_stopwords(prep10, Top100Words, separate = F) > prep10 <- strip(prep10, char.keep = c("?", ".")) > sent10 <- data.frame(speech = prep10) > sent10 <- sentSplit(sent10, "speech")
> sent10$year <- "2010"
Concatenate the separate years into one dataframe:
> sentences <- data.frame(rbind(sent10, sent16))
One of the great things about the qdap package is that it facilitates basic text exploration, as we did before. Let's see a plot of frequent terms:
> plot(freq_terms(sentences$speech))
The output of the preceding command is as follows:
You can create a word frequency matrix that provides the counts for each word by speech:
> wordMat <- wfm(sentences$speech, sentences$year)
> head(wordMat[order(wordMat[, 1], wordMat[, 2],decreasing = TRUE),]) 2010 2016 our 120 85 us 33 33 year 29 17 americans 28 15 why 27 10 jobs 23 8
This can also be converted into a document-term matrix with the function as.dtm() should you so desire. Let's next build wordclouds, by year with qdap functionality:
> trans_cloud(sentences$speech, sentences$year, min.freq = 10)
The preceding command produces the following two images:
Comprehensive word statistics are available. Here is a plot of the stats available in the package. The plot loses some of its visual appeal with just two speeches but is revealing nonetheless. A complete explanation of the stats is available under ?word_stats:
> ws <- word_stats(sentences$speech, sentences$year, rm.incomplete = T)
> plot(ws, label = T, lab.digits = 2)
The output of the preceding command is as follows:
Notice that the 2016 speech was much shorter, with over a hundred fewer sentences and almost a thousand fewer words. Also, there seems to be the use of asking questions as a rhetorical device in 2016 versus 2010 (n.quest 10 versus n.quest 4).
To compare the polarity (sentiment scores), use the polarity() function, specifying the text and grouping variables:
> pol = polarity(sentences$speech, sentences$year)
> pol year total.sentences total.words ave.polarity sd.polarity stan.mean.polarity 1 2010 435 3900 0.052 0.432 0.121 2 2016 299 2982 0.105 0.395 0.267
The stan.mean.polarity value represents the standardized mean polarity, which is the average polarity divided by the standard deviation. We see that 2015 was slightly higher (0.267) than 2010 (0.121). This is in line with what we would expect, wanting to end on a more positive note. You can also plot the data. The plot produces two charts. The first shows the polarity by sentences over time and the second shows the distribution of the polarity:
> plot(pol)
The output of the preceding command is as follows:
This plot may be a challenge to read in this text, but let me do my best to interpret it. The 2010 speech starts out with a strong negative sentiment and is slightly more negative than 2016. We can identify the most negative sentiment sentence by creating a dataframe of the pol object, find the sentence number, and produce it:
> pol.df <- pol$all
> which.min(pol.df$polarity) [1] 12 > pol.df$text.var[12] [1] "One year ago, I took office amid two wars, an economy rocked by a severe recession, a financial system on the verge of collapse, and a government deeply in debt.
Now that is negative sentiment! Ironically, the government is even more in debt today. We will look at the readability index next:
> ari <- automated_readability_index(sentences$speech, sentences$year)
> ari$Readability year word.count sentence.count character.count 1 2010 3900 435 23859 2 2016 2982 299 17957 Automated_Readability_Index 1 11.86709 2 11.91929
I think it is no surprise that they are basically the same. Formality analysis is next. This takes a couple of minutes to run in R:
> form <- formality(sentences$speech, sentences$year)
> form year word.count formality 1 2016 2983 65.61 2 2010 3900 63.88
This looks to be very similar. We can examine the proportion of the parts of the speech. A plot is available, but adds nothing to the analysis, in this instance:
> form$form.prop.by year word.count noun adj prep articles pronoun 1 2010 3900 44.18 15.95 3.67 0 4.51 2 2016 2982 43.46 17.37 4.49 0 4.96 verb adverb interj other 1 23.49 7.77 0.05 0.38 2 21.73 7.41 0.00 0.57
Now, the diversity measures are produced. Again, they are nearly identical. A plot is also available, (plot(div)), but being so similar, it once again adds no value. It is important to note that Obama's speechwriter for 2010 was Jon Favreau, and in 2016, it was Cody Keenan:
> div <- diversity(sentences$speech, sentences$year)
> div year wc simpson shannon collision berger_parker brillouin 1 2010 3900 0.998 6.825 5.970 0.031 6.326 2 2015 2982 0.998 6.824 6.008 0.029 6.248
One of my favorite plots is the dispersion plot. This shows the dispersion of a word throughout the text. Let's examine the dispersion of "jobs", "families", and "economy":
> dispersion_plot(sentences$speech, rm.vars = sentences$year, c("security", "jobs", "economy"), color = "black", bg.color = "white")
The output of the preceding command is as follows:
This completes our analysis of the two speeches. The analysis showed that, although the speeches had a similar style, the core messages changed over time as the political landscape changed. This extract is taken from the book Mastering Machine Learning with R - Second Edition. Read the book to know more advanced prediction, algorithms, and learning methods with R.
Understanding Sentiment Analysis and other key NLP concepts
Twitter Sentiment Analysis
Sentiment Analysis of the 2017 US elections on Twitter