Using boilerpipe to extract text from HTML
There are several libraries available for extracting text from HTML documents. We will demonstrate how to use boilerpipe (https://code.google.com/p/boilerpipe/) to perform this operation. This is a flexible API that not only extracts the entire text of an HTML document but can also extract selected parts of an HTML document, such as its title and individual text blocks. We will use the HTML page at http://en.wikipedia.org/wiki/Berlin to illustrate the use of boilerpipe. Part of this page is shown in the following screenshot:
In order to use boilerpipe, you will need to download the binary for the Xerces Parser, which can be found at http://xerces.apache.org/index.html.
We start by creating a URL object that represents this page. We will use two classes to extract text. The first is the HTMLDocument
class that represents the HTML document. The second is the TextDocument
class that represents the text within an HTML document. It consists of one or more...