In the previous examples for parsing HTML documents, we treated HTML simply as searchable text, where you can discover information by looking for specific strings. Fortunately, HTML documents actually have a structure. You can see that each set of tags can be viewed as some object, called a node, which can, in turn, contain more nodes. This creates a hierarchy of root, parent, and child nodes, providing a structured document. In particular, HTML documents are very similar to XML documents, although they are not fully XML-compliant. Because of this XML-like structure, we can search for content in the pages using XPath queries.
XPath queries define a way to traverse the hierarchy of nodes in an XML document, and return matching elements. In our previous examples, where we were looking for <a> tags in order to count and retrieve links, we needed...