Packt+ | Advance your knowledge in tech

You're reading from R Web Scraping Quick Start Guide Techniques and tools to crawl and scrape data from websites

Product type Paperback

Published in Oct 2018

Publisher Packt

ISBN-13 9781789138733

Length 114 pages

Edition 1st Edition

Languages

Concepts

Data Mining

Author (1):

Olgun Aydin

View More author details

Data is an essential part of any research, whether it be academic, marketing, or scientific . The World Wide Web (WWW) contains all kinds of information from different sources. Some of these are social, financial, security, and academic resources and are accessible via the internet.

People may want to collect and analyse data from multiple websites. These different websites that belong to specific categories display information in different formats. Even with a single website, you may not be able to see all the data at once. The data may be spanned across multiple pages under various sections.

Most websites do not allow you to save a copy of the data to your local storage. The only option is to manually copy and paste the data shown by the website to a local file in your computer. This is a very tedious process that can take lot of time.

Web scraping is a technique by which people can extract data from multiple websites to a single spreadsheet or database so that it becomes easier to analyse or even visualize the data. Web scraping is used to transform unstructured data from the network into a centralized local database.

Well-known companies, including Google, Amazon, Wikipedia, Facebook, and many more, provide APIs (Application Programming Interfaces) that contain object classes that facilitate interaction with variables, data structures, and other software components. In this way, data collection from those websites is fast and can be performed without any web scraping software.

One of the most used features when performing web scraping of the semi-structured of web pages are naturally rooted trees that are labeled. On this trees, the tags represent the appropriate labels for the HTML markup language syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page. The display of a web page using an ordered rooted tree labeled with a label is referred to as the DOM (Document Object Model), which is largely edited by the WWW Consortium.

The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags, with custom key words defined in the sign language. This can be interpreted by the browser to represent web-specific items. HTML tags can be placed in a hierarchical structure. In this hierarchy, nodes in the DOM are captured by the document tree that represents the HTML tags. We will take a look at DOM structures while we focus on XPath rules.