11.8 Reading HTML documents
A great deal of content on the web is presented using HTML. A browser renders the data very nicely. We can write applications to extract content from HTML pages.
Parsing HTML involves two complications:
Ancient HTML dialects that are distinct from modern XML
Browsers that tolerate HTML that’s incorrect and create a proper display
The first complication is the history of HTML and XML. Modern HTML is a specific document type of XML. Historically, HTML started with its own unique document type definitions, based on the older SGML. These original SGML/HTML concepts were revised and extended to create a new language, XML. During the transition from legacy HTML to XML-based HTML, web servers provided content using a variety of transitional document type definitions. Most modern web servers use a <DOCTYPE html> preamble to state that the document is properly structured XML syntax, using...