Parsing and modifying an HTML page with dom.d
My dom.d
module is an HTML and XML parser that can understand much of the tag soup found on the Web. Once it parses a document, it provides a JavaScript-style DOM API for easy inspection and manipulation of the document tree.
Here, we'll use the library to extract some meta-information and text from an HTML page, and then modify it and save a local copy to explore its features and implementation, which uses several of the techniques we've learned in this book.
Getting ready
Download dom.d
and characterencodings.d
from my Github repository. It has no other dependencies, so you do not need to download any additional files or libraries.
How to do it…
Let's execute the following steps to parse and modify an HTML page:
Import
arsd.dom
.Create an instance of the
Document
class.Pass an unvalidated HTML string to the
parseGarbage
method, or if you want strict checks on case and well-formedness, useparseStrict
. It will throw exceptions when it encounters bad...