Building a tweet analysis capability
In earlier chapters, we used various implementations of Twitter data analysis to describe several concepts. We will take this capability to a deeper level and approach it as a major case study.
In this chapter, we will build a data ingest pipeline, constructing a production-ready dataflow that is designed with reliability and future evolution in mind.
We'll build out the pipeline incrementally throughout the chapter. At each stage, we'll highlight what has changed but can't include full listings at each stage without trebling the size of the chapter. The source code for this chapter, however, has every iteration in its full glory.
Getting the tweet data
The first thing we need to do is get the actual tweet data. As in previous examples, we can pass the -j
and -n
arguments to stream.py
to dump JSON tweets to stdout:
$ stream.py -j -n 10000 > tweets.json
Since we have this tool that can create a batch of sample tweets on demand, we could start our ingest...