Preparing data
Text extraction is an early step in most NLP tasks. Here, we will quickly cover how text extraction can be performed for HTML, Word, and PDF documents. Although there are several APIs that support these tasks, we will use:
Boilerpipe (https://code.google.com/p/boilerpipe/) for HTML
POI (http://poi.apache.org/index.html) for Word
PDFBox (http://pdfbox.apache.org/) for PDF
Some APIs support the use of XML for input and output. For example, the Stanford XMLUtils
class provides support for reading XML files and manipulating XML data. The LingPipe's XMLParser
class will parse XML text.
Organizations store their data in many forms and frequently it is not in simple text files. Presentations are stored in PowerPoint slides, specifications are created using Word documents, and companies provide marketing and other materials in PDF documents. Most organizations have an Internet presence, which means that much useful information is found in HTML documents. Due to the widespread nature of...