Twitter sentiment analysis using Hive
Twitter is one of the most important data sources that helps you to know the sentiments behind various things. In this recipe, we will take a look at how to perform sentiment analysis using Hive on Twitter data.
Getting ready
To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Hive installed on it. Here, I am using Hive 1.2.1.
How to do it...
First of all, we need a dataset to perform this recipe. We will be using a dataset that can be found at http://s3.amazonaws.com/hw-sandbox/tutorial13/SentimentFiles.zip.
Next, we will unzip this data and upload it on HDFS. The zip contains three folders: the first for raw Twitter data, the second for a dictionary, and the third for a time zone:
hadoop fs -mkdir /data hadoop fs -put tweets_raw /data hadoop fs -put time_zone_map /data hadoop fs -put dictionary /data
We use Hive's JSON SerDe jar to read the tweeter data, as shown here:
ADD JAR json-serde-1.1.9.9-Hive1.2-jar-with...