Analysing HTML code and extracting data
In the previous sections, we learned the basics of HTML, CSS, and XPath. To scrape real-world web pages, the problem now becomesa question of writing the proper CSS or XPath selectors. In this section, we introduce some simple ways to figure out working selectors.
Suppose we want to scrape all available R packages at https://cran.rstudio.com/web/packages/available_packages_by_name.html. The web page looks simple. To figure out the selector expression, right-click on the table and select Inspect Element in the context menu, which should be available in most modern web browsers:
Then the inspector panel shows up and we can see the underlying HTML of the web page. In Firefox and Chrome, the selected node is highlighted so it can be located more easily:
The HTML contains a unique <table>
so we can directly use table
to select it and use html_table()
to extract it out as a data frame:
page <- read_html("https://cran.rstudio.com/web/packages...