In the previous chapters, we have dealt with whole web pages, which is not really practical for most web scrapers. Although it is nice to have all of the content from a web page, most of the time, you will only need small pieces of information from each page. In order to extract this information, you must learn to parse the standard formats of the web, the most common of these being HTML.
This chapter will cover the following topics:
- What is the HTML format
- Searching using the strings package
- Searching using the regexp package
- Searching using XPath queries
- Searching using Cascading Style Sheets selectors