R provides various packages to assist in web search operations. These include XML, RCurl, and RJSON/RJSONIO/JASONLite. The XML package helps to parse XML and HTML, and provides XPath support for searching XML.
The RCurl package uses various protocols to transfer data, generate general HTTP requests, retrieve URLs, send forms, and so on. All of this information is used for transactions. These processes use the libcurl library. JSON is an abbreviation of JavaScript Object Notation and is the most common data format used on the web. Rjson, RJSONIO, and JsonLite packages convert data in R into JSON format.
Web scraping is based on the sum of unstructured data, mostly text, from the web. Resources such as the internet, blogs, online newspapers, and social networking platforms provide a large amount of text data. This is especially important for researchers who conduct research in areas such as Social Sciences and Linguistics. Companies like Google, Facebook, Twitter, and Amazon provide APIs that allow analysts to retrieve data.
You can access these APIs with the R tool and collect data. For Google services, the RGoogleStorage and RogleMap packages are available. The TwitteR and streamR packages are used to retrieve data from Twitter.
For Amazon services, there is the AWS tools package, which provides access to Amazon Web Services (EC2/S3) and MTurkR packages that provide access to the Amazon Mechanical Turk Requester API. To access news bulletins, the GuardianR package can be used. This package provides an interface to the Content API of the Guardian Media Group's Open Platform.
The RNYTimes package on the same shelf also provides broad access to New York Times web services, including researchers' articles, metadata, user-generated content, and offers access to content.
There are also some R packages that provide a web scraping environment in R. In this book, we will also look at two packages that are well-known and used the most: rvest and RSelenium.
The rvest is inspired by the beautiful soup library, while HTML is a package that simplifies data scraping from web pages. It is designed to work with the magrittr package. Thus, it is easy and practical to create web-based search scripts consisting of simple, easy-to-understand parts.
Selenium web is a web automation tool that was originally developed specifically for scraping. However, with Selenium, you can develop web-scavenging scripts. Selenium can also run web browsers. Since Selenium can run web browsers, all content must be created in the browser, which can slow down the data collection process.
There are browsers like phantomjs that speed up this process. The RSelenium package allows you to connect to a Selenium Server. RSelenium allows for unit testing and regression testing on a variety of browsers, operating systems, web apps, and web pages.