Extracting data from web pages using CSS selectors
In R, the easiest-to-use package for web scraping is rvest
. Run the following code to install the package from CRAN:
install.packages("rvest")
First, we load the package and use read_html()
to read data/single-table.html
and try to extract the table from the web page:
library(rvest) ## Loading required package: xml2 single_table_page <- read_html("data/single-table.html") single_table_page ## {xml_document} ## <html> ## [1] <head>\n <title>Single table</title>\n</head> ## [2] <body>\n <p>The following is a table</p>\n <table i ...
Note that single_table_page
is a parsed HTML document, which is a nested data structure of HTML nodes.
A typical process for scraping information from such a web page using rvest
functions is: First, locate the HTML nodes from which we need to extract data. Then, use either the CSS selector or XPath expression to filter the HTML...