Understanding HTML documents
In order to extract data from the fetched web pages, we need to isolate and manipulate the structural elements that contain the desired information. That's why a basic understanding of the generic structure of the web pages is helpful when performing web scraping. If you've done web scraping before, maybe using a different programming language, or if you just know enough about HTML documents, feel free to skip this section. On the other hand, if you're new to this or just need a quick refresher, please read on.
Hypertext Markup Language (HTML) is the gold standard for creating web pages and web applications. HTML goes hand in hand with HTTP, the protocol for transmitting HTML documents over the internet.
The building blocks of HTML pages are the HTML elements. They provide both the content and the structure of a web page. They can be nested to define complex relationships with each other (such as parents, children, siblings, ancestors, and so on). HTML elements...