Data preparation
In Chapter 11, Sentiment Analysis of Twitter Data, we explored how to create a bag of words from the Tweets Sentiment140
dataset. In this chapter, we will complement the example by using MongoDB. First we will prepare and transform the dataset from CSV to a JSON format in order to add it into a MongoDB collection.
Tip
We can download the Sentiment140 training and test data from http://help.sentiment140.com/for-students.
We will download and open the test data, the columns represent sentiment, id, date, via, user, and text. The first five records will look like this:
4,1,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2 and I think it's perfect :) 4,2,Mon May 11 03:26:10 UTC 2009, jquery,dcostalis,Jquery is my new best friend. 4,3,Mon May 11 03:27:15 UTC 2009,twitter,PJ_King,Loves twitter 4,4,Mon May 11 03:29:20 UTC 2009,obama,mandanicole,how can you not love Obama? he makes jokes about himself. 4,5,Mon May 11 05:22:12 UTC 2009,lebron...