You're reading from RStudio for R Statistical Computing Cookbook Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature

Product type Paperback

Published in Apr 2016

Publisher

ISBN-13 9781784391034

Length 246 pages

Edition 1st Edition

Languages

Tools

RStudio

Concepts

Data Analysis

Author (1):

Andrea Cirillo

View More author details

Table of Contents (10) Chapters

Preface

1. Acquiring Data for Your Project

2. Preparing for Analysis – Data Cleansing and Manipulation FREE CHAPTER

3. Basic Visualization Techniques

4. Advanced and Interactive Visualization

5. Power Programming with R

6. Domain-specific Applications

7. Developing Static Reports

8. Dynamic Reporting and Web Application Development

Index

Getting data from Twitter with the twitteR package

Twitter is an unbeatable source of data for nearly every kind of data-driven problem.

If my words are not enough to convince you, and I think they shouldn't be, you can always perform a quick search on Google, for instance, text analytics with Twitter, and read the over 30 million results to be sure.

This should not surprise you, given Google's huge and word-spreaded base of users together with the relative structure and richness of metadata of content on the platform, which makes this social network a place to go when talking about data analysis projects, especially those involving sentiment analysis and customer segmentations.

R comes with a really well-developed package named twitteR, developed by Jeff Gentry, which offers a function for nearly every functionality made available by Twitter through the API. The following recipe covers the typical use of the package: getting tweets related to a topic.

Getting ready

First of all, we have to install our great twitteR package by running the following code:

install.packages("twitteR")
library(twitter)

How to do it…

As seen with the general procedure, in order to access the Twitter API, you will need to create a new application. This link (assuming you are already logged in to Twitter) will do the job: https://apps.twitter.com/app/new.
Feel free to give whatever name, description, and website to your app that you want. The callback URL can be also left blank.
After creating the app, you will have access to an API key and an API secret, namely Consumer Key and Consumer Secret, in the Keys and Access Tokens tab in your app settings.
Below the section containing these tokens, you will find a section called Your Access Token. These tokens are required in order to let the app perform actions on your account's behalf. For instance, you may be willing to send direct messages to all new followers and could therefore write an app to do that automatically.
Keep a note of these tokens as well, since you will need them to set up your connection within R.
Then, we will get access to the API from R. In order to authenticate your app and use it to retrieve data from Twitter, you will just need to run a line of code, specifically, the setup_twitter_oauth() function, by passing the following arguments:
- consumer_key
- consumer_token
- access_token
- access_secret
  You can get these tokens from your app settings:
```
setup_twitter_oauth(consumer_key    = "consumer_key", 
                   consumer_secret  = "consumer_secret", 
                   access_token     = "access_token",
                   access_secret    = "access_secret")
```
Now, we will query Twitter and store the resulting data. We are finally ready for the core part: getting data from Twitter. Since we are looking for tweets pertaining to a specific topic, we are going to use the searchTwitter() function. This function allows you to specify a good number of parameters besides the search string. You can define the following:
- n : This is the number of tweets to be downloaded.
- lang: This is the language specified with the ISO 639-1 code. You can find a partial list of this code at https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes.
- since – until: These are time parameters that define a range of time, where dates are expressed as YYYY-MM-DD, for instance, 2012-05-12.
- locale: This specifies the geocode, expressed as latitude, longitude and radius, either in miles or kilometers, for example, 38.481157,-130.500342,1 mi.
- sinceID – maxID: This is the account ID range.
- resultType: This is used to filter results based on popularity. Possible values are 'mixed', 'recent', and 'popular'.
- retryOnRateLimit: This is the number that defines how many times the query will be retried if the API rate limit is reached.
Supposing that we are interested in tweets regarding data science with R; we run the following function:
```
tweet_list <- searchTwitter('data science with R', n = 450)  
```
Tip
Performing a character-wise search with twitteR
Searching Twitter for a specific sequence of characters is possible by submitting a query surrounded by double quotes, for instance, "data science with R". Consequently, if you are looking to retrieve tweets in R corresponding to a specific sequence of characters, you will have to submit and run a line of code similar to the following:
```
 tweet_list <- searchTwitter('data science with R', n = 450)
```
tweet_list will be a list of the first 450 tweets resulting from the given query.
Be aware that since n is the maximum number of tweets retrievable, you may retrieve a smaller number of tweets, if for the given query the number or result is smaller than n.
Each element of the list will show the following attributes:
- text
- favorited
- favoriteCount
- replyToSN
- created
- truncated
- replyToSID
- id
- replyToUID
- statusSource
- screenName
- retweetCount
- isRetweet
- retweeted
- longitude
- latitude
  In order to let you work on this data more easily, a specific function is provided to transform this list in a more convenient data.frame, namely, the twiLstToDF() function.
  After this, we can run the following line of code:
```
tweet_df   <-  twListToDF(tweet_list)
```
  This will result in a tweet_df object that has the following structure:
```
> str(tweet_df)
'data.frame':  20 obs. of  16 variables:
 $ text         : chr  "95% off  Applied Data Science with R - 
 $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ favoriteCount: num  0 2 0 2 0 0 0 0 0 1 ...
 $ replyToSN    : logi  NA NA NA NA NA NA ...
 $ created      : POSIXct, format: "2015-10-16 09:03:32" "2015-10-15 17:40:33" "2015-10-15 11:33:37" "2015-10-15 05:17:59" ...
 $ truncated    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ replyToSID   : logi  NA NA NA NA NA NA ...
 $ id           : chr  "654945762384740352" "654713487097135104" "654621142179819520" "654526612688375808" ...
 $ replyToUID   : logi  NA NA NA NA NA NA ...
 $ statusSource : chr  "<a href=\"http://learnviral.com/\" rel=\"nofollow\">Learn Viral</a>" "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>" "<a href=\"http://not.yet/\" rel=\"nofollow\">final one kk</a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>" ...
 $ screenName   : chr  "Learn_Viral" "WinVectorLLC" "retweetjava" "verystrongjoe" ...
 $ retweetCount : num  0 0 1 1 0 0 0 2 2 2 ...
 $ isRetweet    : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
 $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ longitude    : logi  NA NA NA NA NA NA ...
 $ latitude     : logi  NA NA NA NA NA NA ...
```
After sending you to the data visualization section for advanced techniques, we will now quickly visualize the retweet distribution of our tweets, leveraging the base R hist() function:
```
hist(tweet_df$retweetCount)
```
This code will result in a histogram that has the x axis as the number of retweets and the y axis as the frequency of those numbers:

There's more...

As stated in the official Twitter documentation, particularly at https://dev.twitter.com/rest/public/rate-limits, there is a limit to the number of tweets you can retrieve within a certain period of time, and this limit is set to 450 every 15 minutes.

However, what if you are engaged in a really sensible job and you want to base your work on a significant number of tweets? Should you set the n argument of searchTwitter() to 450 and wait for 15—everlasting—minutes? Not quite, the twitteR package provides a convenient way to overcome this limit through the register_db_backend(), register_sqlite_backend(), and register_mysql_bakend() functions. These functions allow you to create a connection with the named type of databases, passing the database name, path, username, and password as arguments, as you can see in the following example:

    register_mysql_backend("db_name", "host","user","password")

You can now leverage the search_twitter_and_store function, which stores the search results in the connected database. The main feature of this function is the retryOnRateLimit argument, which lets you specify the number of tries to be performed by the code once the API limit is reached. Setting this limit to a convenient level will likely let you pass the 15-minutes interval:

tweets_db = search_twitter_and_store("data science R", retryOnRateLimit = 20)

Retrieving stored data will now just require you to run the following code:

    from_db = load_tweets_db()

The rest of the chapter is locked

You're reading from RStudio for R Statistical Computing Cookbook Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature

Table of Contents (10) Chapters

Getting data from Twitter with the twitteR package

Getting ready

How to do it…

Tip

There's more...

Authors (1)

Personalised recommendations for you

You're reading from RStudio for R Statistical Computing Cookbook Over 50 practical and useful recipes to help you perform data analysis with R by unleashing every native RStudio feature

Table of Contents (10) Chapters

Getting data from Twitter with the twitteR package

Getting ready

How to do it…

Tip

There's more...

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you