Data preparation
In Chapter 11, Working with Twitter Data, we explored how to create a bag of words from the Tweets Sentiment140 dataset. In this chapter, we will complement the example using MongoDB. First, we will prepare and transform the dataset from CSV into a JSON format in order to add it into a MongoDB collection.
Tip
We can download the Sentiment140 training and test data at http://help.sentiment140.com/for-students.
We will download and open the test data; the columns represent sentiment
, id
, date
, and via, user, and text. The first five records will look similar to this:
4,1,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2 and I think it's perfect :)
4,2,Mon May 11 03:26:10 UTC 2009, jquery,dcostalis,Jquery is my new best friend.
4,3,Mon May 11 03:27:15 UTC 2009,twitter,PJ_King,Loves twitter
4,4,Mon May 11 03:29:20 UTC 2009,obama,mandanicole,how can you not love Obama? he makes jokes about himself.
4,5,Mon May 11 05:22:12 UTC 2009,lebron...