There are a number of ways to gather Twitter data. From web scraping to using custom libraries, each one has different advantages and disadvantages. For our implementation, as we also need sentiment labeling, we will utilize the Sentiment140 dataset (http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip). The reason that we do not collect our own data is mostly due to the time we would need to label it. In the last section of this chapter, we will see how we can collect our own data and analyze it in real time. The dataset consists of 1.6 million tweets, containing the following 6 fields:
- The tweet's polarity
- A numeric ID
- The date it was tweeted
- The query used to record the tweet
- The user's name
- The tweet's text content
For our models, we will only need the tweet's text and polarity. As can be seen in the following graph, there...