Collecting Data by Scraping Web Pages
The basic building block of any web page is HTML (Hypertext Markup Language)—a markup language that specifies the structure of your content. HTML is written using a series of tags, combined with optional content. The content encompassed within HTML tags defines the appearance of the web page. It can be used to make words bold or italicize them, to add hyperlinks to the text, and even to add images. Additional information can be added to the element using attributes within tags. So, a web page can be considered to be a document written using HTML. Thus, we need to know the basics of HTML to scrape web pages effectively.
The following figure depicts the contents that are included within an HTML tag:
As you can see in the preceding figure, we can easily identify different elements within an HTML tag. The basic HTML structure and commonly used tags are shown and explained as...