Part 1 – Acquiring the data with Spark Structured Streaming
To acquire the data, we use Tweepy which provides an elegant Python client library to access the Twitter APIs. The APIs covered by Tweepy are very extensive and covering them in detail is beyond the scope of this book, but you can find the complete API reference at the Tweepy official website: http://tweepy.readthedocs.io/en/v3.6.0/cursor_tutorial.html.
You can install the Tweepy library directly from PyPi using the pip install
command. The following command shows how to install it from a Notebook using the !
directive:
!pip install tweepy
Note
Note: The current Tweepy version used is 3.6.0. Do not forget to restart the kernel after installing the library.
Architecture diagram for the data pipeline
Before we start diving into each component of the data pipeline, it would be good to take a look at its overall architecture and understand the computation flow.
As shown in the following diagram, we start by creating a Tweepy stream that...