Case study - AlphaGo tweets analytics
Now that we have a good understanding of GraphX, let's apply our newly gained knowledge to analyze a retweet network. Like any big data project, the first task is to define a pipeline, figure out the data elements, the source, transformations, mapping, and processing.
Data pipeline
For this case study, I collected Twitter data pertaining to the AlphaGo project:
While the full mechanics of data collection from Twitter is out of scope, I will quickly mention the main steps:
- Using Python and the tweepy framework, you can download the tweets mentioning the hashtag #alphago. Initially, pull all the tweets that Twitter will give and then use the since ID to incrementally get the tweets.
- Then use application authentication for a higher rate. Twitter implements rate limiting, so the amount of tweets one can get without their firehose subscription is limited. Even then, I had collected approximately 300K tweets and 2 GB worth of data.
- Store the data in MongoDB...