You're reading from R Web Scraping Quick Start Guide Techniques and tools to crawl and scrape data from websites

Product type Paperback

Published in Oct 2018

Publisher Packt

ISBN-13 9781789138733

Length 114 pages

Edition 1st Edition

Languages

Concepts

Data Mining

Author (1):

Olgun Aydin

View More author details

Web scraping techniques

Web scraping techniques automatically open a new world for researchers by automatically extracting structured datasets from readable web content. A web scraper accesses web pages, finds the data items specified on the page, extracts them, transforms them into different formats if necessary, and finally saves this data as a structured dataset.

This can be described as pretending to know how a web browser works by accessing web pages and saving them to a computer's hard disk cache. Researchers use this content for analysis after cleaning and organizing data.

A web scraper reverses the process of manually gathering data from many web pages and putting together structured datasets from complex, unstructured text that spans thousands—even millions—of individual pages. Web scraping discussions often bring with them questions about legality and fair use.

In theory, web scraping is the practice of collecting data in any way other than a program interacting with an API. This is usually accomplished by writing an automated program that queries a web server, which usually requests data and then parses that data to extract the necessary information.

There are a lot of different types of web scraping techniques. In this section, the most popularly used web scraping techniques will be described and discussed.

Traditional copy and paste

Occasionally, due to our process of manual examination, the copy and paste method is one of the best and workable web scraping technologies. However, this is an error-prone, boring, and tiresome technique when people need to scrap lots of datasets (Web scraping, 2015).

Text grabbing and regular expression

This is a simple and powerful approach that's used to obtain information from web pages. This technique is based on UNIX commands or regular expression mapping features of the programming language.

Document Object Model (DOM)

By parsing a web browser such as Internet Explorer or Mozilla browser control, programs can import dynamic content that's been generated by client-side scripting. These browser controls break web pages into a DOM tree based on which programs can take sections of pages.

Semantic annotation recognition

Pages that need to be scraped may contain metadata, semantic marks, or additional explanations that can be used to find specific data snippets. If the annotations are embedded in pages, such as Microformat, this technique is stored as a special case of DOM parsing, and additional annotations that are organized into a semantic layer are stored and managed separately from web pages. Thus, the scraper can get the data schema and instructions of this layer before scraping the pages.

Web scraping tools

It is possible to customize web scraping solutions. There are many software tools that can be used for this. These software tools provide a record interface that automatically recognizes the data structure of a page and removes the need to manually write web scraping code, or provides some script functions and database interfaces that can be used to extract and convert the content. Some of those tools are listed below;

Diffbot: This is a tool that uses computational vision and machine learning algorithms that have been developed for collecting data from web pages automatically, in a behavior like a human being would perform.
Heritrix: This is a web crawler that was designed for web archiving.
HTTrack: This is a web browser that is free and open source, and was initially designed to scrape websites. It can also work offline.
Selenium (software): This is used for testing the frameworks of web applications.
OutWit Hub: This special scraper is a web scraping application that has built-in data, image, document extractors, and editors that are used for automatic search and extraction.
Wget: This is a computer program that receives content from websites that supports access to websites through HTTP, HTTPS, and FTP protocols.
WSO2 Mashup Server: This tool lets you to gain information based on the web from different sources like web services.
Yahoo! Query Language (YQL): This is a query-like language similar to that of SQL that lets you query, filter, and join data across web services.

JavaScript tools

It is also possible to use JavaScript for web scraping tasks, mostly used JavaScript frameworks are listed as follows:

Node.js: Node.js is an open source, cross-platform JavaScript environment that allows JavaScript code to run without the need for a web browser.

PhantomJS: PhantomJS is a script-free and headless browser that's used to automate web pages with the JavaScript API that's provided.
jQuery: jQuery is a rich, cross-platform JavaScript library. With jQuery, which is easy to use and learn, it is possible to develop Ajax applications and mark objects in the DOM tree.

Web crawling frameworks

The following can be utilized to build web scrapers:

Scrapy: Scrapy is a free and open source web crawling platform written in Python that was originally designed for scraping the web. It is also possible to use Scrapy as a general purpose web scraping tool if you use its new version and APIs.
rvest: rvest is an R package that was written by Hadley Wickham that allows simple data collection from HTML web pages.
RSelenium: RSelenium is designed to make it easy to connect to a Selenium Server/Remote Selenium Server. RSelenium allows connections from the R environment to the Selenium Webdriver API.

Web crawling environment in R

R provides various packages to assist in web search operations. These include XML, RCurl, and RJSON/RJSONIO/JASONLite. The XML package helps to parse XML and HTML, and provides XPath support for searching XML.

The RCurl package uses various protocols to transfer data, generate general HTTP requests, retrieve URLs, send forms, and so on. All of this information is used for transactions. These processes use the libcurl library. JSON is an abbreviation of JavaScript Object Notation and is the most common data format used on the web. Rjson, RJSONIO, and JsonLite packages convert data in R into JSON format.

Web scraping is based on the sum of unstructured data, mostly text, from the web. Resources such as the internet, blogs, online newspapers, and social networking platforms provide a large amount of text data. This is especially important for researchers who conduct research in areas such as Social Sciences and Linguistics. Companies like Google, Facebook, Twitter, and Amazon provide APIs that allow analysts to retrieve data.

You can access these APIs with the R tool and collect data. For Google services, the RGoogleStorage and RogleMap packages are available. The TwitteR and streamR packages are used to retrieve data from Twitter.

For Amazon services, there is the AWS tools package, which provides access to Amazon Web Services (EC2/S3) and MTurkR packages that provide access to the Amazon Mechanical Turk Requester API. To access news bulletins, the GuardianR package can be used. This package provides an interface to the Content API of the Guardian Media Group's Open Platform.

The RNYTimes package on the same shelf also provides broad access to New York Times web services, including researchers' articles, metadata, user-generated content, and offers access to content.

There are also some R packages that provide a web scraping environment in R. In this book, we will also look at two packages that are well-known and used the most: rvest and RSelenium.

The rvest is inspired by the beautiful soup library, while HTML is a package that simplifies data scraping from web pages. It is designed to work with the magrittr package. Thus, it is easy and practical to create web-based search scripts consisting of simple, easy-to-understand parts.

Selenium web is a web automation tool that was originally developed specifically for scraping. However, with Selenium, you can develop web-scavenging scripts. Selenium can also run web browsers. Since Selenium can run web browsers, all content must be created in the browser, which can slow down the data collection process.

There are browsers like phantomjs that speed up this process. The RSelenium package allows you to connect to a Selenium Server. RSelenium allows for unit testing and regression testing on a variety of browsers, operating systems, web apps, and web pages.