Getting data from Twitter with the twitteR package
Twitter is an unbeatable source of data for nearly every kind of data-driven problem.
If my words are not enough to convince you, and I think they shouldn't be, you can always perform a quick search on Google, for instance, text analytics with Twitter, and read the over 30 million results to be sure.
This should not surprise you, given Google's huge and word-spreaded base of users together with the relative structure and richness of metadata of content on the platform, which makes this social network a place to go when talking about data analysis projects, especially those involving sentiment analysis and customer segmentations.
R comes with a really well-developed package named twitteR
, developed by Jeff Gentry, which offers a function for nearly every functionality made available by Twitter through the API. The following recipe covers the typical use of the package: getting tweets related to a topic.
Getting ready
First of all, we have to install our great twitteR
package by running the following code:
install.packages("twitteR") library(twitter)
How to do it…
- As seen with the general procedure, in order to access the Twitter API, you will need to create a new application. This link (assuming you are already logged in to Twitter) will do the job: https://apps.twitter.com/app/new.
Feel free to give whatever name, description, and website to your app that you want. The callback URL can be also left blank.
After creating the app, you will have access to an API key and an API secret, namely Consumer Key and Consumer Secret, in the Keys and Access Tokens tab in your app settings.
Below the section containing these tokens, you will find a section called Your Access Token. These tokens are required in order to let the app perform actions on your account's behalf. For instance, you may be willing to send direct messages to all new followers and could therefore write an app to do that automatically.
Keep a note of these tokens as well, since you will need them to set up your connection within R.
- Then, we will get access to the API from R. In order to authenticate your app and use it to retrieve data from Twitter, you will just need to run a line of code, specifically, the
setup_twitter_oauth()
function, by passing the following arguments:consumer_key
consumer_token
access_token
access_secret
You can get these tokens from your app settings:
setup_twitter_oauth(consumer_key = "consumer_key", consumer_secret = "consumer_secret", access_token = "access_token", access_secret = "access_secret")
- Now, we will query Twitter and store the resulting data. We are finally ready for the core part: getting data from Twitter. Since we are looking for tweets pertaining to a specific topic, we are going to use the
searchTwitter()
function. This function allows you to specify a good number of parameters besides the search string. You can define the following:n
: This is the number of tweets to be downloaded.lang
: This is the language specified with the ISO 639-1 code. You can find a partial list of this code at https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes.since – until
: These are time parameters that define a range of time, where dates are expressed as YYYY-MM-DD, for instance, 2012-05-12.locale
: This specifies the geocode, expressed as latitude, longitude and radius, either in miles or kilometers, for example, 38.481157,-130.500342,1 mi.sinceID – maxID
: This is the account ID range.resultType
: This is used to filter results based on popularity. Possible values are 'mixed', 'recent', and 'popular'.retryOnRateLimit
: This is the number that defines how many times the query will be retried if the API rate limit is reached.
Supposing that we are interested in tweets regarding data science with R; we run the following function:
tweet_list <- searchTwitter('data science with R', n = 450)
Tip
Performing a character-wise search with twitteR
Searching Twitter for a specific sequence of characters is possible by submitting a query surrounded by double quotes, for instance,
"data science with R"
. Consequently, if you are looking to retrieve tweets in R corresponding to a specific sequence of characters, you will have to submit and run a line of code similar to the following:tweet_list <- searchTwitter('data science with R', n = 450)
tweet_list
will be a list of the first 450 tweets resulting from the given query.Be aware that since
n
is the maximum number of tweets retrievable, you may retrieve a smaller number of tweets, if for the given query the number or result is smaller thann
.Each element of the list will show the following attributes:
text
favorited
favoriteCount
replyToSN
created
truncated
replyToSID
id
replyToUID
statusSource
screenName
retweetCount
isRetweet
retweeted
longitude
latitude
In order to let you work on this data more easily, a specific function is provided to transform this list in a more convenient
data.frame
, namely, thetwiLstToDF()
function.After this, we can run the following line of code:
tweet_df <- twListToDF(tweet_list)
This will result in a
tweet_df
object that has the following structure:> str(tweet_df) 'data.frame': 20 obs. of 16 variables: $ text : chr "95% off Applied Data Science with R - $ favorited : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ favoriteCount: num 0 2 0 2 0 0 0 0 0 1 ... $ replyToSN : logi NA NA NA NA NA NA ... $ created : POSIXct, format: "2015-10-16 09:03:32" "2015-10-15 17:40:33" "2015-10-15 11:33:37" "2015-10-15 05:17:59" ... $ truncated : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ replyToSID : logi NA NA NA NA NA NA ... $ id : chr "654945762384740352" "654713487097135104" "654621142179819520" "654526612688375808" ... $ replyToUID : logi NA NA NA NA NA NA ... $ statusSource : chr "<a href=\"http://learnviral.com/\" rel=\"nofollow\">Learn Viral</a>" "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>" "<a href=\"http://not.yet/\" rel=\"nofollow\">final one kk</a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>" ... $ screenName : chr "Learn_Viral" "WinVectorLLC" "retweetjava" "verystrongjoe" ... $ retweetCount : num 0 0 1 1 0 0 0 2 2 2 ... $ isRetweet : logi FALSE FALSE TRUE FALSE FALSE FALSE ... $ retweeted : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ longitude : logi NA NA NA NA NA NA ... $ latitude : logi NA NA NA NA NA NA ...
After sending you to the data visualization section for advanced techniques, we will now quickly visualize the retweet distribution of our tweets, leveraging the base R
hist()
function:hist(tweet_df$retweetCount)
This code will result in a histogram that has the x axis as the number of retweets and the y axis as the frequency of those numbers:
There's more...
As stated in the official Twitter documentation, particularly at https://dev.twitter.com/rest/public/rate-limits, there is a limit to the number of tweets you can retrieve within a certain period of time, and this limit is set to 450 every 15 minutes.
However, what if you are engaged in a really sensible job and you want to base your work on a significant number of tweets? Should you set the n
argument of searchTwitter()
to 450 and wait for 15—everlasting—minutes? Not quite, the twitteR
package provides a convenient way to overcome this limit through the register_db_backend()
, register_sqlite_backend()
, and register_mysql_bakend()
functions. These functions allow you to create a connection with the named type of databases, passing the database name, path, username, and password as arguments, as you can see in the following example:
register_mysql_backend("db_name", "host","user","password")
You can now leverage the search_twitter_and_store
function, which stores the search results in the connected database. The main feature of this function is the retryOnRateLimit
argument, which lets you specify the number of tries to be performed by the code once the API limit is reached. Setting this limit to a convenient level will likely let you pass the 15-minutes interval:
tweets_db = search_twitter_and_store("data science R", retryOnRateLimit = 20)
Retrieving stored data will now just require you to run the following code:
from_db = load_tweets_db()