Chapter 12. Bringing It All Together
While we have demonstrated many aspects of using Java to support data science tasks, the need to combine and use these techniques in an integrated manner exists. It is one thing to use the techniques in isolation and another to use them in a cohesive fashion. In this chapter, we will provide you with additional experience with these technologies and insights into how they can be used together.
Specifically, we will create a console-based application that analyzes tweets related to a user-defined topic. Using a console-based application allows us to focus on data-science-specific technologies and avoids having to choose a specific GUI technology that may not be relevant to us. It provides a common base from which a GUI implementation can be created if needed.
The application performs and illustrates the following high-level tasks:
- Data acquisition
- Data cleaning, including:
- Removing stop words
- Cleaning the text
- Sentiment analysis
- Basic data statistic collection
- Display of results
More than one type of analysis can be used with many of these steps. We will show the more relevant approaches and allude to other possibilities as appropriate. We will use Java 8's features whenever possible.
Defining the purpose and scope of our application
The application will prompt the user for a set of selection criteria, which include topic and sub-topic areas, and the number of tweets to process. The analysis performed will simply compute and display the number of positive and negative tweets for a topic and sub-topic. We used a generic sentiment analysis model, which will affect the quality of the sentiment analysis. However, other models and more analysis can be added.
We will use a Java 8 stream to structure the processing of tweet data. It is a stream of TweetHandler
objects, as we will describe shortly.
We use several classes in this application. They are summarized here:
TweetHandler
: This class holds the raw tweet text and specific fields needed for the processing including the actual tweet, username, and similar attributes.TwitterStream
: This is used to acquire the application's data. Using a specific class separates the acquisition of the data from its processing. The class possesses a few fields that control how the data is acquired.ApplicationDriver
: This contains themain
method, user prompts, and theTweetHandler
stream that controls the analysis.
Each of these classes will be detailed in later sections. However, we will present ApplicationDriver
next to provide an overview of the analysis process and how the user interacts with the application.
Understanding the application's architecture
Every application has its own unique structure, or architecture. This architecture provides the overarching organization or framework for the application. For this application, we combine the three classes using a Java 8 stream in the ApplicationDriver
class. This class consists of three methods:
ApplicationDriver
: Contains the applications' user inputperformAnalysis
: Performs the analysismain
: Creates theApplicationDriver
instance
The class structure is shown next. The three instance variables are used to control the processing:
public class ApplicationDriver { private String topic; private String subTopic; private int numberOfTweets; public ApplicationDriver() { ... } public void performAnalysis() { ... } public static void main(String[] args) { new ApplicationDriver(); } }
The ApplicationDriver
constructor follows. A Scanner
instance is created and the sentiment analysis model is built:
public ApplicationDriver() { Scanner scanner = new Scanner(System.in); TweetHandler swt = new TweetHandler(); swt.buildSentimentAnalysisModel(); ... }
The remainder of the method prompts the user for input and then calls the performAnalysis
method:
out.println("Welcome to the Tweet Analysis Application"); out.print("Enter a topic: "); this.topic = scanner.nextLine(); out.print("Enter a sub-topic: "); this.subTopic = scanner.nextLine().toLowerCase(); out.print("Enter number of tweets: "); this.numberOfTweets = scanner.nextInt(); performAnalysis();
The performAnalysis
method uses a Java 8 Stream
instance obtained from the TwitterStream
instance. The TwitterStream
class constructor uses the number of tweets and topic
as input. This class is discussed in the Data acquisition using Twitter section:
public void performAnalysis() { Stream<TweetHandler> stream = new TwitterStream( this.numberOfTweets, this.topic).stream(); ... }
The stream uses a series of map
, filter
, and a forEach
method to perform the processing. The map
method modifies the stream's elements. The filter
methods remove elements from the stream. The forEach
method will terminate the stream and generate the output.
The individual methods of the stream are executed in order. When acquired from a public Twitter stream, the Twitter information arrives as a JSON document, which we process first. This allows us to extract relevant tweet information and set the data to fields of the TweetHandler
instance. Next, the text of the tweet is converted to lowercase. Only English tweets are processed and only those tweets that contain the sub-topic will be processed. The tweet is then processed. The last step computes the statistics:
stream .map(s -> s.processJSON()) .map(s -> s.toLowerCase()) .filter(s -> s.isEnglish()) .map(s -> s.removeStopWords()) .filter(s -> s.containsCharacter(this.subTopic)) .map(s -> s.performSentimentAnalysis()) .forEach((TweetHandler s) -> { s.computeStats(); out.println(s); });
The results of the processing are then displayed:
out.println(); out.println("Positive Reviews: " + TweetHandler.getNumberOfPositiveReviews()); out.println("Negative Reviews: " + TweetHandler.getNumberOfNegativeReviews());
We tested our application on a Monday night during a Monday-night football game and used the topic #MNF. The # symbol is called a hashtag and is used to categorize tweets. By selecting a popular category of tweets, we ensured that we would have plenty of Twitter data to work with. For simplicity, we chose the football subtopic. We also chose to only analyze 50 tweets for this example. The following is an abbreviated sample of our prompts, input, and output:
Building Sentiment Model Welcome to the Tweet Analysis Application Enter a topic: #MNF Enter a sub-topic: football Enter number of tweets: 50 Creating Twitter Stream 51 messages processed! Text: rt @ bleacherreport : touchdown , broncos ! c . j . anderson punches ! lead , 7 - 6 # mnf # denvshou Date: Mon Oct 24 20:28:20 CDT 2016 Category: neg ... Text: i cannot emphasize enough how big td drive . @ broncos offense . needed confidence booster & amp ; just got . # mnf # denvshou Date: Mon Oct 24 20:28:52 CDT 2016 Category: pos Text: least touchdown game . # mnf Date: Mon Oct 24 20:28:52 CDT 2016 Category: neg Positive Reviews: 13 Negative Reviews: 27
We print out the text of each tweet, along with a timestamp and category. Notice that the text of the tweet does not always make sense. This may be due to the abbreviated nature of Twitter data, but it is partially due to the fact this text has been cleaned and stop words have been removed. We should still see our topic, #MNF
, although it will be lowercase due to our text cleaning. At the end, we print out the total number of tweets classified as positive and negative.
The classification of tweets is done by the performSentimentAnalysis
method. Notice the process of classification using sentiment analysis is not always precise. The following tweet mentions a touchdown by a Denver Broncos player. This tweet could be construed as positive or negative depending on an individual's personal feelings about that team, but our model classified it as positive:
Text: cj anderson td run @ broncos . broncos now lead 7 - 6 . # mnf Date: Mon Oct 24 20:28:42 CDT 2016 Category: pos
Additionally, some tweets may have a neutral tone, such as the one shown next, but still be classified as either positive or negative. The following tweet is a retweet of a popular sports news twitter handle, @bleacherreport
:
Text: rt @ bleacherreport : touchdown , broncos ! c . j . anderson punches ! lead , 7 - 6 # mnf # denvshou Date: Mon Oct 24 20:28:37 CDT 2016 Category: neg
This tweet has been classified as negative but perhaps could be considered neutral. The contents of the tweet simply provide information about a score in a football game. Whether this is a positive or negative event will depend upon which team a person may be rooting for. When we examine the entire set of tweet data analysed, we notice that this same @bleacherreport
tweet has been retweeted a number of times and classified as negative each time. This could skew our analysis when we consider that we may have a large number of improperly classified tweets. Using incorrect data decreases the accuracy of the results.
One option, depending on the purpose of analysis, may be to exclude tweets by news outlets or other popular Twitter users. Additionally we could exclude tweets with RT, an abbreviation denoting that the tweet is a retweet of another user.
There are additional issues to consider when performing this type of analysis, including the sub-topic used. If we were to analyze the popularity of a Star Wars character, then we would need to be careful which names we use. For example, when choosing a character name such as Han Solo, the tweet may use an alias. Aliases for Han Solo include Vykk Draygo, Rysto, Jenos Idanian, Solo Jaxal, Master Marksman, and Jobekk Jonn, to mention a few (http://starwars.wikia.com/wiki/Category:Han_Solo_aliases). The actor's name may be used instead of the actual character, which is Harrison Ford in the case of Han Solo. We may also want to consider the actor's nickname, such as Harry for Harrison.
Data acquisition using Twitter
The Twitter API is used in conjunction with HBC's HTTP client to acquire tweets, as previously illustrated in the Handling Twitter section of Chapter 2, Data Acquisition. This process involves using the public stream API at the default access level to pull a sample of public tweets currently streaming on Twitter. We will refine the data based on user-selected keywords.
To begin, we declare the TwitterStream
class. It consists of two instance variables, (numberOfTweets
and topic
), two constructors, and a stream
method. The numberOfTweets
variable contains the number of tweets to select and process, and topic
allows the user to search for tweets related to a specific topic. We have set our default constructor to pull 100
tweets related to Star Wars
:
public class TwitterStream { private int numberOfTweets; private String topic; public TwitterStream() { this(100, "Stars Wars"); } public TwitterStream(int numberOfTweets, String topic) { ... } }
The heart of our TwitterStream
class is the stream
method. We start by performing authentication using the information provided by Twitter when we created our Twitter application. We then create a BlockingQueue
object to hold our streaming data. In this example, we will set a default capacity of 1000
. We use our topic
variable in the trackTerms
method to specify the types of tweets we are searching for. Finally, we specify our endpoint
and turn off stall warnings:
String myKey = "mySecretKey"; String mySecret = "mySecret"; String myToken = "myToKen"; String myAccess = "myAccess"; out.println("Creating Twitter Stream"); BlockingQueue<String> statusQueue = new LinkedBlockingQueue<>(1000); StatusesFilterEndpoint endpoint = new StatusesFilterEndpoint(); endpoint.trackTerms(Lists.newArrayList("twitterapi", this.topic)); endpoint.stallWarnings(false);
Now we can create an Authentication
object using OAuth1
, a variation of the OAuth
class. This allows us to build our connection client and complete the HTTP connection:
Authentication twitterAuth = new OAuth1(myKey, mySecret, myToken, myAccess); BasicClient twitterClient = new ClientBuilder() .name("Twitter client") .hosts(Constants.STREAM_HOST) .endpoint(endpoint) .authentication(twitterAuth) .processor(new StringDelimitedProcessor(statusQueue)) .build(); twitterClient.connect();
Next, we create two ArrayLists, list
to hold our TweetHandler
objects and twitterList
to hold the JSON data streamed from Twitter. We will discuss the TweetHandler
object in the next section. We use the drainTo
method in place of the poll
method demonstrated in Chapter 2, Data Acquisition, because it can be more efficient for large amounts of data:
List<TweetHandler> list = new ArrayList(); List<String> twitterList = new ArrayList();
Next we loop through our retrieved messages. We call the take
method to remove each string message from the BlockingQueue
instance. We then create a new TweetHandler
object using the message and place it in our list
. After we have handled all of our messages and the for loop completes, we stop the HTTP client, display the number of messages, and return our stream of TweetHandler
objects:
statusQueue.drainTo(twitterList); for(int i=0; i<numberOfTweets; i++) { String message; try { message = statusQueue.take(); list.add(new TweetHandler(message)); } catch (InterruptedException ex) { ex.printStackTrace(); } } twitterClient.stop(); out.printf("%d messages processed!\n", twitterClient.getStatsTracker().getNumMessages()); return list.stream(); }
We are now ready to clean and analyze our data.
Understanding the TweetHandler class
The TweetHandler
class holds information about a specific tweet. It takes the raw JSON tweet and extracts those parts that are relevant to the application's needs. It also possesses the methods to process the tweet's text such as converting the text to lowercase and removing tweets that are not relevant. The first part of the class is shown next:
public class TweetHandler { private String jsonText; private String text; private Date date; private String language; private String category; private String userName; ... public TweetHandler processJSON() { ... } public TweetHandler toLowerCase(){ ... } public TweetHandler removeStopWords(){ ... } public boolean isEnglish(){ ... } public boolean containsCharacter(String character) { ... } public void computeStats(){ ... } public void buildSentimentAnalysisModel{ ... } public TweetHandler performSentimentAnalysis(){ ... } }
The instance variables show the type of data retrieved from a tweet and processed, as detailed here:
jsonText
: The raw JSON texttext
: The text of the processed tweetdate
: The date of the tweetlanguage
: The language of the tweetcategory
: The tweet classification, which is positive or negativeuserName
: The name of the Twitter user
There are several other instance variables used by the class. The following are used to create and use a sentiment analysis model. The classifier static variable refers to the model:
private static String[] labels = {"neg", "pos"}; private static int nGramSize = 8; private static DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess( labels, nGramSize);
The default constructor is used to provide an instance to build the sentiment model. The single argument constructor creates a TweetHandler
object using the raw JSON text:
public TweetHandler() { this.jsonText = ""; } public TweetHandler(String jsonText) { this.jsonText = jsonText; }
The remainder of the methods are discussed in the following sections.
Extracting data for a sentiment analysis model
In Chapter 9, Text Analysis, we used DL4J to perform sentiment analysis. We will use LingPipe in this example as an alternative to our previous approach. Because we want to classify Twitter data, we chose a dataset with pre-classified tweets, available at http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip. We must complete a one-time process of extracting this data into a format we can use with our model before we continue with our application development.
This dataset exists in a large .csv
file with one tweet and classification per line. The tweets are classified as either 0
(negative) or 1
(positive). The following is an example of one line of this data file:
95,0,Sentiment140, - Longest night ever.. ugh! http://tumblr.com/xwp1yxhi6
The first element represents a unique ID number which is part of the original data set and which we will use for the filename. The second element is the classification, the third is a data set label (effectively ignored for the purposes of this project), and the last element is the actual tweet text. Before we can use this data with our LingPipe model, we must write each tweet into an individual file. To do this, we created three string variables. The filename
variable will be assigned either pos
or neg
depending on each tweet's classification and will be used in the write operation. We also use the file
variable to hold the name of the individual tweet file and the text
variable to hold the individual tweet text. Next, we use the readAllLines
method with the Paths
class's get
method to store our data in a List
object. We need to specify the charset, StandardCharsets.ISO_8859_1
, as well:
try { String filename; String file; String text; List<String> lines = Files.readAllLines( Paths.get("\\path-to-file\\SentimentAnalysisDataset.csv"), StandardCharsets.ISO_8859_1); ... } catch (IOException ex) { // Handle exceptions }
Now we can loop through our list and use the split
method to store our .csv
data in a string array. We convert the element at position 1
to an integer and determine whether it is a 1
. Tweets classified with a 1
are considered positive tweets and we set filename
to pos
. All other tweets set the filename
to neg
. We extract the output filename from the element at position 0
and the text from element 3
. We ignore the label in position 2
for the purposes of this project. Finally, we write out our data:
for(String s : lines) { String[] oneLine = s.split(","); if(Integer.parseInt(oneLine[1])==1) { filename = "pos"; } else { filename = "neg"; } file = oneLine[0]+".txt"; text = oneLine[3]; Files.write(Paths.get( path-to-file\\txt_sentoken"+filename+""+file), text.getBytes()); }
Notice that we created the neg
and pos
directories within the txt_sentoken
directory. This location is important when we read the files to build our model.
Building the sentiment model
Now we are ready to build our model. We loop through the labels
array, which contains pos
and neg
, and for each label we create a new Classification
object. We then create a new file using this label and use the listFiles
method to create an array of filenames. Next, we will traverse these filenames using a for
loop:
public void buildSentimentAnalysisModel() { out.println("Building Sentiment Model"); File trainingDir = new File("\\path to file\\txt_sentoken"); for (int i = 0; i < labels.length; i++) { Classification classification = new Classification(labels[i]); File file = new File(trainingDir, labels[i]); File[] trainingFiles = file.listFiles(); ... } }
Within the for
loop, we extract the tweet data and store it in our string, review
. We then create a new Classified
object using review
and classification
. Finally we can call the handle
method to classify this particular text:
for (int j = 0; j < trainingFiles.length; j++) { try { String review = Files.readFromFile(trainingFiles[j], "ISO-8859-1"); Classified<CharSequence> classified = new Classified<>(review, classification); classifier.handle(classified); } catch (IOException ex) { // Handle exceptions } }
For the dataset discussed in the previous section, this process may take a substantial amount of time. However, we consider this time trade-off to be worth the quality of analysis made possible by this training data.
Processing the JSON input
The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler
class.
The TweetHandler
class's processJSON
method does the actual data extraction. An instance of the JSONObject
is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString
method to get the fields we need.
The start of the processJSON
method is shown next, where we start by obtaining the JSONObject
instance, which we will use to extract the relevant parts of the tweet:
public TweetHandler processJSON() { try { JSONObject jsonObject = new JSONObject(this.jsonText); ... } catch (JSONException ex) { // Handle exceptions } return this; }
First, we extract the tweet's text as shown here:
this.text = jsonObject.getString("text");
Next, we extract the tweet's date. We use the SimpleDateFormat
class to convert the date string to a Date
object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy"
, whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:
EEE
: Day of the week specified using three charactersMMM
: Month, using three charactersd
: Day of the monthHH:mm:ss
: Hours, minutes, and secondsZ
: Time zoneyyyy
: Year
The code follows:
SimpleDateFormat sdf = new SimpleDateFormat( "EEE MMM d HH:mm:ss Z yyyy"); try { this.date = sdf.parse(jsonObject.getString("created_at")); } catch (ParseException ex) { // Handle exceptions }
The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name
field:
this.language = jsonObject.getString("lang"); JSONObject user = jsonObject.getJSONObject("user"); this.userName = user.getString("name");
Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.
Cleaning data to improve our results
Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.
There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.
The conversion of the text to lowercase letters is easily achieved as shown here:
public TweetHandler toLowerCase() { this.text = this.text.toLowerCase().trim(); return this; }
Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean
return value is used by the filter
method in the Java 8 stream, which performs the actual removal:
public boolean isEnglish() { return this.language.equalsIgnoreCase("en"); } public boolean containsCharacter(String character) { return this.text.contains(character); }
Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.
Removing stop words
Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.
There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory
class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory
instance:
public TweetHandler removeStopWords() { TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory); ... return this; }
A series of tokens that do not contain stop words are extracted, and a StringBuilder
instance is used to create a string to replace the original text:
Tokenizer tokens = tokenizerFactory.tokenizer( this.text.toCharArray(), 0, this.text.length()); StringBuilder buffer = new StringBuilder(); for (String word : tokens) { buffer.append(word + " "); } this.text = buffer.toString();
The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.
Performing sentiment analysis
We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification
object by passing our cleaned text to the classify
method. We then use the bestCategory
method to classify our text as either positive or negative. Finally, we set category
to the result and return the TweetHandler
object:
public TweetHandler performSentimentAnalysis() { Classification classification = classifier.classify(this.text); String bestCategory = classification.bestCategory(); this.category = bestCategory; return this; }
We are now ready to analyze the results of our application.
Analysing the results
The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:
private static int numberOfPositiveReviews = 0; private static int numberOfNegativeReviews = 0;
The computeStats
method is called from the Java 8 stream and increments the appropriate variable:
public void computeStats() { if(this.category.equalsIgnoreCase("pos")) { numberOfPositiveReviews++; } else { numberOfNegativeReviews++; } }
Two static
methods provide access to the number of reviews:
public static int getNumberOfPositiveReviews() { return numberOfPositiveReviews; } public static int getNumberOfNegativeReviews() { return numberOfNegativeReviews; }
In addition, a simple toString
method is provided to display basic tweet information:
public String toString() { return "\nText: " + this.text + "\nDate: " + this.date + "\nCategory: " + this.category; }
More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.
Extracting data for a sentiment analysis model
In Chapter 9, Text Analysis, we used DL4J to perform sentiment analysis. We will use LingPipe in this example as an alternative to our previous approach. Because we want to classify Twitter data, we chose a dataset with pre-classified tweets, available at http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip. We must complete a one-time process of extracting this data into a format we can use with our model before we continue with our application development.
This dataset exists in a large .csv
file with one tweet and classification per line. The tweets are classified as either 0
(negative) or 1
(positive). The following is an example of one line of this data file:
95,0,Sentiment140, - Longest night ever.. ugh! http://tumblr.com/xwp1yxhi6
The first element represents a unique ID number which is part of the original data set and which we will use for the filename. The second element is the classification, the third is a data set label (effectively ignored for the purposes of this project), and the last element is the actual tweet text. Before we can use this data with our LingPipe model, we must write each tweet into an individual file. To do this, we created three string variables. The filename
variable will be assigned either pos
or neg
depending on each tweet's classification and will be used in the write operation. We also use the file
variable to hold the name of the individual tweet file and the text
variable to hold the individual tweet text. Next, we use the readAllLines
method with the Paths
class's get
method to store our data in a List
object. We need to specify the charset, StandardCharsets.ISO_8859_1
, as well:
try { String filename; String file; String text; List<String> lines = Files.readAllLines( Paths.get("\\path-to-file\\SentimentAnalysisDataset.csv"), StandardCharsets.ISO_8859_1); ... } catch (IOException ex) { // Handle exceptions }
Now we can loop through our list and use the split
method to store our .csv
data in a string array. We convert the element at position 1
to an integer and determine whether it is a 1
. Tweets classified with a 1
are considered positive tweets and we set filename
to pos
. All other tweets set the filename
to neg
. We extract the output filename from the element at position 0
and the text from element 3
. We ignore the label in position 2
for the purposes of this project. Finally, we write out our data:
for(String s : lines) { String[] oneLine = s.split(","); if(Integer.parseInt(oneLine[1])==1) { filename = "pos"; } else { filename = "neg"; } file = oneLine[0]+".txt"; text = oneLine[3]; Files.write(Paths.get( path-to-file\\txt_sentoken"+filename+""+file), text.getBytes()); }
Notice that we created the neg
and pos
directories within the txt_sentoken
directory. This location is important when we read the files to build our model.
Building the sentiment model
Now we are ready to build our model. We loop through the labels
array, which contains pos
and neg
, and for each label we create a new Classification
object. We then create a new file using this label and use the listFiles
method to create an array of filenames. Next, we will traverse these filenames using a for
loop:
public void buildSentimentAnalysisModel() { out.println("Building Sentiment Model"); File trainingDir = new File("\\path to file\\txt_sentoken"); for (int i = 0; i < labels.length; i++) { Classification classification = new Classification(labels[i]); File file = new File(trainingDir, labels[i]); File[] trainingFiles = file.listFiles(); ... } }
Within the for
loop, we extract the tweet data and store it in our string, review
. We then create a new Classified
object using review
and classification
. Finally we can call the handle
method to classify this particular text:
for (int j = 0; j < trainingFiles.length; j++) { try { String review = Files.readFromFile(trainingFiles[j], "ISO-8859-1"); Classified<CharSequence> classified = new Classified<>(review, classification); classifier.handle(classified); } catch (IOException ex) { // Handle exceptions } }
For the dataset discussed in the previous section, this process may take a substantial amount of time. However, we consider this time trade-off to be worth the quality of analysis made possible by this training data.
Processing the JSON input
The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler
class.
The TweetHandler
class's processJSON
method does the actual data extraction. An instance of the JSONObject
is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString
method to get the fields we need.
The start of the processJSON
method is shown next, where we start by obtaining the JSONObject
instance, which we will use to extract the relevant parts of the tweet:
public TweetHandler processJSON() { try { JSONObject jsonObject = new JSONObject(this.jsonText); ... } catch (JSONException ex) { // Handle exceptions } return this; }
First, we extract the tweet's text as shown here:
this.text = jsonObject.getString("text");
Next, we extract the tweet's date. We use the SimpleDateFormat
class to convert the date string to a Date
object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy"
, whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:
EEE
: Day of the week specified using three charactersMMM
: Month, using three charactersd
: Day of the monthHH:mm:ss
: Hours, minutes, and secondsZ
: Time zoneyyyy
: Year
The code follows:
SimpleDateFormat sdf = new SimpleDateFormat( "EEE MMM d HH:mm:ss Z yyyy"); try { this.date = sdf.parse(jsonObject.getString("created_at")); } catch (ParseException ex) { // Handle exceptions }
The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name
field:
this.language = jsonObject.getString("lang"); JSONObject user = jsonObject.getJSONObject("user"); this.userName = user.getString("name");
Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.
Cleaning data to improve our results
Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.
There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.
The conversion of the text to lowercase letters is easily achieved as shown here:
public TweetHandler toLowerCase() { this.text = this.text.toLowerCase().trim(); return this; }
Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean
return value is used by the filter
method in the Java 8 stream, which performs the actual removal:
public boolean isEnglish() { return this.language.equalsIgnoreCase("en"); } public boolean containsCharacter(String character) { return this.text.contains(character); }
Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.
Removing stop words
Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.
There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory
class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory
instance:
public TweetHandler removeStopWords() { TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory); ... return this; }
A series of tokens that do not contain stop words are extracted, and a StringBuilder
instance is used to create a string to replace the original text:
Tokenizer tokens = tokenizerFactory.tokenizer( this.text.toCharArray(), 0, this.text.length()); StringBuilder buffer = new StringBuilder(); for (String word : tokens) { buffer.append(word + " "); } this.text = buffer.toString();
The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.
Performing sentiment analysis
We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification
object by passing our cleaned text to the classify
method. We then use the bestCategory
method to classify our text as either positive or negative. Finally, we set category
to the result and return the TweetHandler
object:
public TweetHandler performSentimentAnalysis() { Classification classification = classifier.classify(this.text); String bestCategory = classification.bestCategory(); this.category = bestCategory; return this; }
We are now ready to analyze the results of our application.
Analysing the results
The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:
private static int numberOfPositiveReviews = 0; private static int numberOfNegativeReviews = 0;
The computeStats
method is called from the Java 8 stream and increments the appropriate variable:
public void computeStats() { if(this.category.equalsIgnoreCase("pos")) { numberOfPositiveReviews++; } else { numberOfNegativeReviews++; } }
Two static
methods provide access to the number of reviews:
public static int getNumberOfPositiveReviews() { return numberOfPositiveReviews; } public static int getNumberOfNegativeReviews() { return numberOfNegativeReviews; }
In addition, a simple toString
method is provided to display basic tweet information:
public String toString() { return "\nText: " + this.text + "\nDate: " + this.date + "\nCategory: " + this.category; }
More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.
Building the sentiment model
Now we are ready to build our model. We loop through the labels
array, which contains pos
and neg
, and for each label we create a new Classification
object. We then create a new file using this label and use the listFiles
method to create an array of filenames. Next, we will traverse these filenames using a for
loop:
public void buildSentimentAnalysisModel() { out.println("Building Sentiment Model"); File trainingDir = new File("\\path to file\\txt_sentoken"); for (int i = 0; i < labels.length; i++) { Classification classification = new Classification(labels[i]); File file = new File(trainingDir, labels[i]); File[] trainingFiles = file.listFiles(); ... } }
Within the for
loop, we extract the tweet data and store it in our string, review
. We then create a new Classified
object using review
and classification
. Finally we can call the handle
method to classify this particular text:
for (int j = 0; j < trainingFiles.length; j++) { try { String review = Files.readFromFile(trainingFiles[j], "ISO-8859-1"); Classified<CharSequence> classified = new Classified<>(review, classification); classifier.handle(classified); } catch (IOException ex) { // Handle exceptions } }
For the dataset discussed in the previous section, this process may take a substantial amount of time. However, we consider this time trade-off to be worth the quality of analysis made possible by this training data.
Processing the JSON input
The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler
class.
The TweetHandler
class's processJSON
method does the actual data extraction. An instance of the JSONObject
is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString
method to get the fields we need.
The start of the processJSON
method is shown next, where we start by obtaining the JSONObject
instance, which we will use to extract the relevant parts of the tweet:
public TweetHandler processJSON() { try { JSONObject jsonObject = new JSONObject(this.jsonText); ... } catch (JSONException ex) { // Handle exceptions } return this; }
First, we extract the tweet's text as shown here:
this.text = jsonObject.getString("text");
Next, we extract the tweet's date. We use the SimpleDateFormat
class to convert the date string to a Date
object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy"
, whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:
EEE
: Day of the week specified using three charactersMMM
: Month, using three charactersd
: Day of the monthHH:mm:ss
: Hours, minutes, and secondsZ
: Time zoneyyyy
: Year
The code follows:
SimpleDateFormat sdf = new SimpleDateFormat( "EEE MMM d HH:mm:ss Z yyyy"); try { this.date = sdf.parse(jsonObject.getString("created_at")); } catch (ParseException ex) { // Handle exceptions }
The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name
field:
this.language = jsonObject.getString("lang"); JSONObject user = jsonObject.getJSONObject("user"); this.userName = user.getString("name");
Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.
Cleaning data to improve our results
Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.
There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.
The conversion of the text to lowercase letters is easily achieved as shown here:
public TweetHandler toLowerCase() { this.text = this.text.toLowerCase().trim(); return this; }
Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean
return value is used by the filter
method in the Java 8 stream, which performs the actual removal:
public boolean isEnglish() { return this.language.equalsIgnoreCase("en"); } public boolean containsCharacter(String character) { return this.text.contains(character); }
Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.
Removing stop words
Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.
There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory
class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory
instance:
public TweetHandler removeStopWords() { TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory); ... return this; }
A series of tokens that do not contain stop words are extracted, and a StringBuilder
instance is used to create a string to replace the original text:
Tokenizer tokens = tokenizerFactory.tokenizer( this.text.toCharArray(), 0, this.text.length()); StringBuilder buffer = new StringBuilder(); for (String word : tokens) { buffer.append(word + " "); } this.text = buffer.toString();
The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.
Performing sentiment analysis
We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification
object by passing our cleaned text to the classify
method. We then use the bestCategory
method to classify our text as either positive or negative. Finally, we set category
to the result and return the TweetHandler
object:
public TweetHandler performSentimentAnalysis() { Classification classification = classifier.classify(this.text); String bestCategory = classification.bestCategory(); this.category = bestCategory; return this; }
We are now ready to analyze the results of our application.
Analysing the results
The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:
private static int numberOfPositiveReviews = 0; private static int numberOfNegativeReviews = 0;
The computeStats
method is called from the Java 8 stream and increments the appropriate variable:
public void computeStats() { if(this.category.equalsIgnoreCase("pos")) { numberOfPositiveReviews++; } else { numberOfNegativeReviews++; } }
Two static
methods provide access to the number of reviews:
public static int getNumberOfPositiveReviews() { return numberOfPositiveReviews; } public static int getNumberOfNegativeReviews() { return numberOfNegativeReviews; }
In addition, a simple toString
method is provided to display basic tweet information:
public String toString() { return "\nText: " + this.text + "\nDate: " + this.date + "\nCategory: " + this.category; }
More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.
Processing the JSON input
The Twitter data is retrieved using JSON format. We will use Twitter4J (http://twitter4j.org) to extract the relevant parts of the tweet and store in the corresponding field of the TweetHandler
class.
The TweetHandler
class's processJSON
method does the actual data extraction. An instance of the JSONObject
is created based on the JSON text. The class possesses several methods to extract specific types of data from an object. We use the getString
method to get the fields we need.
The start of the processJSON
method is shown next, where we start by obtaining the JSONObject
instance, which we will use to extract the relevant parts of the tweet:
public TweetHandler processJSON() { try { JSONObject jsonObject = new JSONObject(this.jsonText); ... } catch (JSONException ex) { // Handle exceptions } return this; }
First, we extract the tweet's text as shown here:
this.text = jsonObject.getString("text");
Next, we extract the tweet's date. We use the SimpleDateFormat
class to convert the date string to a Date
object. Its constructor is passed a string that specifies the format of the date string. We used the string "EEE MMM d HH:mm:ss Z yyyy"
, whose parts are detailed next. The order of the string elements corresponds to the order found in the JSON entity:
EEE
: Day of the week specified using three charactersMMM
: Month, using three charactersd
: Day of the monthHH:mm:ss
: Hours, minutes, and secondsZ
: Time zoneyyyy
: Year
The code follows:
SimpleDateFormat sdf = new SimpleDateFormat( "EEE MMM d HH:mm:ss Z yyyy"); try { this.date = sdf.parse(jsonObject.getString("created_at")); } catch (ParseException ex) { // Handle exceptions }
The remaining fields are extracted as shown next. We had to extract an intermediate JSON object to extract the name
field:
this.language = jsonObject.getString("lang"); JSONObject user = jsonObject.getJSONObject("user"); this.userName = user.getString("name");
Having acquired and extracted the text, we are now ready to perform the important task of cleaning the data.
Cleaning data to improve our results
Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.
There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.
The conversion of the text to lowercase letters is easily achieved as shown here:
public TweetHandler toLowerCase() { this.text = this.text.toLowerCase().trim(); return this; }
Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean
return value is used by the filter
method in the Java 8 stream, which performs the actual removal:
public boolean isEnglish() { return this.language.equalsIgnoreCase("en"); } public boolean containsCharacter(String character) { return this.text.contains(character); }
Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.
Removing stop words
Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.
There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory
class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory
instance:
public TweetHandler removeStopWords() { TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory); ... return this; }
A series of tokens that do not contain stop words are extracted, and a StringBuilder
instance is used to create a string to replace the original text:
Tokenizer tokens = tokenizerFactory.tokenizer( this.text.toCharArray(), 0, this.text.length()); StringBuilder buffer = new StringBuilder(); for (String word : tokens) { buffer.append(word + " "); } this.text = buffer.toString();
The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.
Performing sentiment analysis
We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification
object by passing our cleaned text to the classify
method. We then use the bestCategory
method to classify our text as either positive or negative. Finally, we set category
to the result and return the TweetHandler
object:
public TweetHandler performSentimentAnalysis() { Classification classification = classifier.classify(this.text); String bestCategory = classification.bestCategory(); this.category = bestCategory; return this; }
We are now ready to analyze the results of our application.
Analysing the results
The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:
private static int numberOfPositiveReviews = 0; private static int numberOfNegativeReviews = 0;
The computeStats
method is called from the Java 8 stream and increments the appropriate variable:
public void computeStats() { if(this.category.equalsIgnoreCase("pos")) { numberOfPositiveReviews++; } else { numberOfNegativeReviews++; } }
Two static
methods provide access to the number of reviews:
public static int getNumberOfPositiveReviews() { return numberOfPositiveReviews; } public static int getNumberOfNegativeReviews() { return numberOfNegativeReviews; }
In addition, a simple toString
method is provided to display basic tweet information:
public String toString() { return "\nText: " + this.text + "\nDate: " + this.date + "\nCategory: " + this.category; }
More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.
Cleaning data to improve our results
Data cleaning is a critical step in most data science problems. Data that is not properly cleaned may have errors such as misspellings, inconsistent representation of elements such as dates, and extraneous words.
There are numerous data cleaning options that we can apply to Twitter data. For this application, we perform simple cleaning. In addition, we will filter out certain tweets.
The conversion of the text to lowercase letters is easily achieved as shown here:
public TweetHandler toLowerCase() { this.text = this.text.toLowerCase().trim(); return this; }
Part of the process is to remove certain tweets that are not needed. For example, the following code illustrates how to detect whether the tweet is in English and whether it contains a sub-topic of interest to the user. The boolean
return value is used by the filter
method in the Java 8 stream, which performs the actual removal:
public boolean isEnglish() { return this.language.equalsIgnoreCase("en"); } public boolean containsCharacter(String character) { return this.text.contains(character); }
Numerous other cleaning operations can be easily added to the process such as removing leading and trailing white space, replacing tabs, and validating dates and email addresses.
Removing stop words
Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.
There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory
class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory
instance:
public TweetHandler removeStopWords() { TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory); ... return this; }
A series of tokens that do not contain stop words are extracted, and a StringBuilder
instance is used to create a string to replace the original text:
Tokenizer tokens = tokenizerFactory.tokenizer( this.text.toCharArray(), 0, this.text.length()); StringBuilder buffer = new StringBuilder(); for (String word : tokens) { buffer.append(word + " "); } this.text = buffer.toString();
The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.
Performing sentiment analysis
We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification
object by passing our cleaned text to the classify
method. We then use the bestCategory
method to classify our text as either positive or negative. Finally, we set category
to the result and return the TweetHandler
object:
public TweetHandler performSentimentAnalysis() { Classification classification = classifier.classify(this.text); String bestCategory = classification.bestCategory(); this.category = bestCategory; return this; }
We are now ready to analyze the results of our application.
Analysing the results
The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:
private static int numberOfPositiveReviews = 0; private static int numberOfNegativeReviews = 0;
The computeStats
method is called from the Java 8 stream and increments the appropriate variable:
public void computeStats() { if(this.category.equalsIgnoreCase("pos")) { numberOfPositiveReviews++; } else { numberOfNegativeReviews++; } }
Two static
methods provide access to the number of reviews:
public static int getNumberOfPositiveReviews() { return numberOfPositiveReviews; } public static int getNumberOfNegativeReviews() { return numberOfNegativeReviews; }
In addition, a simple toString
method is provided to display basic tweet information:
public String toString() { return "\nText: " + this.text + "\nDate: " + this.date + "\nCategory: " + this.category; }
More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.
Removing stop words
Stop words are those words that do not contribute to the understanding or processing of data. Typical stop words include the 0, and, a, and or. When they do not contribute to the data process, they can be removed to simplify processing and make it more efficient.
There are several techniques for removing stop words, as discussed in Chapter 9, Text Analysis. For this application, we will use LingPipe (http://alias-i.com/lingpipe/) to remove stop words. We use the EnglishStopTokenizerFactory
class to obtain a model for our stop words based on an IndoEuropeanTokenizerFactory
instance:
public TweetHandler removeStopWords() { TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.INSTANCE; tokenizerFactory = new EnglishStopTokenizerFactory(tokenizerFactory); ... return this; }
A series of tokens that do not contain stop words are extracted, and a StringBuilder
instance is used to create a string to replace the original text:
Tokenizer tokens = tokenizerFactory.tokenizer( this.text.toCharArray(), 0, this.text.length()); StringBuilder buffer = new StringBuilder(); for (String word : tokens) { buffer.append(word + " "); } this.text = buffer.toString();
The LingPipe model we used may not be the best suited for all tweets. In addition, it has been suggested that removing stop words from tweets may not be productive (http://oro.open.ac.uk/40666/). Options to select various stop words and whether stop words should even be removed can be added to the stream process.
Performing sentiment analysis
We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification
object by passing our cleaned text to the classify
method. We then use the bestCategory
method to classify our text as either positive or negative. Finally, we set category
to the result and return the TweetHandler
object:
public TweetHandler performSentimentAnalysis() { Classification classification = classifier.classify(this.text); String bestCategory = classification.bestCategory(); this.category = bestCategory; return this; }
We are now ready to analyze the results of our application.
Analysing the results
The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:
private static int numberOfPositiveReviews = 0; private static int numberOfNegativeReviews = 0;
The computeStats
method is called from the Java 8 stream and increments the appropriate variable:
public void computeStats() { if(this.category.equalsIgnoreCase("pos")) { numberOfPositiveReviews++; } else { numberOfNegativeReviews++; } }
Two static
methods provide access to the number of reviews:
public static int getNumberOfPositiveReviews() { return numberOfPositiveReviews; } public static int getNumberOfNegativeReviews() { return numberOfNegativeReviews; }
In addition, a simple toString
method is provided to display basic tweet information:
public String toString() { return "\nText: " + this.text + "\nDate: " + this.date + "\nCategory: " + this.category; }
More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.
Performing sentiment analysis
We can now perform sentiment analysis using the model built in the Building the sentiment model section of this chapter. We create a new Classification
object by passing our cleaned text to the classify
method. We then use the bestCategory
method to classify our text as either positive or negative. Finally, we set category
to the result and return the TweetHandler
object:
public TweetHandler performSentimentAnalysis() { Classification classification = classifier.classify(this.text); String bestCategory = classification.bestCategory(); this.category = bestCategory; return this; }
We are now ready to analyze the results of our application.
Analysing the results
The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:
private static int numberOfPositiveReviews = 0; private static int numberOfNegativeReviews = 0;
The computeStats
method is called from the Java 8 stream and increments the appropriate variable:
public void computeStats() { if(this.category.equalsIgnoreCase("pos")) { numberOfPositiveReviews++; } else { numberOfNegativeReviews++; } }
Two static
methods provide access to the number of reviews:
public static int getNumberOfPositiveReviews() { return numberOfPositiveReviews; } public static int getNumberOfNegativeReviews() { return numberOfNegativeReviews; }
In addition, a simple toString
method is provided to display basic tweet information:
public String toString() { return "\nText: " + this.text + "\nDate: " + this.date + "\nCategory: " + this.category; }
More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.
Analysing the results
The analysis performed in this application is fairly simple. Once the tweets have been classified as either positive or negative, a total is computed. We used two static variables for this purpose:
private static int numberOfPositiveReviews = 0; private static int numberOfNegativeReviews = 0;
The computeStats
method is called from the Java 8 stream and increments the appropriate variable:
public void computeStats() { if(this.category.equalsIgnoreCase("pos")) { numberOfPositiveReviews++; } else { numberOfNegativeReviews++; } }
Two static
methods provide access to the number of reviews:
public static int getNumberOfPositiveReviews() { return numberOfPositiveReviews; } public static int getNumberOfNegativeReviews() { return numberOfNegativeReviews; }
In addition, a simple toString
method is provided to display basic tweet information:
public String toString() { return "\nText: " + this.text + "\nDate: " + this.date + "\nCategory: " + this.category; }
More sophisticated analysis can be added as required. The intent of this application was to demonstrate a technique for combining the various data processing tasks.
Other optional enhancements
There are numerous improvements that can be made to the application. Many of these are user preferences and others relate to improving the results of the application. A GUI interface would be useful in many situations. Among the user options, we may want add support for:
- Displaying individual tweets
- Allowing null sub-topics
- Processing other tweet fields
- Providing list of topics or sub-topics the user can choose from
- Generating additional statistics and supporting charts
With regard to process result improvements, the following should be considered:
- Correct user entries for misspelling
- Remove spacing around punctuation
- Use alternate stop word removal techniques
- Use alternate sentiment analysis techniques
The details of many of these enhancements are dependent on the GUI interface used and the purpose and scope of the application.
Summary
The intent of this chapter was to illustrate how various data science tasks can be integrated into an application. We chose an application that processes tweets because it is a popular social medium and allows us to apply many of the techniques discussed in earlier chapters.
A simple console-based interface was used to avoid cluttering the discussion with specific but possibly irrelevant GUI details. The application prompted the user for a Twitter topic, a sub-topic, and the number of tweets to process. The analysis consisted of determining the sentiments of the tweets, with simple statistics regarding the positive or negative nature of the tweets.
The first step in the process was to build a sentiment model. We used LingPipe classes to build a model and perform the analysis. A Java 8 stream was used and supported a fluent style of programming where the individual processing steps could be easily added and removed.
Once the stream was created, the JSON raw text was processed and used to initialize a TweetHandler
class. Instances of this class were subsequently modified, including converting the text to lowercase, removing non-English tweets, removing stop words, and selecting only those tweets that contain the sub-topic. Sentiment analysis was then performed, followed by the computation of the statistics.
Data science is a broad topic that utilizes a wide range of statistical and computer science topics. In this book, we provided a brief introduction to many of these topics and how they are supported by Java.