Using PDFBox to extract text from PDF documents
The Apache PDFBox (http://pdfbox.apache.org/) project is an API for processing PDF documents. It supports the extraction of text and other tasks, such as document merging, form filling, and PDF creation. We will only illustrate the text extraction process. To demonstrate the use of POI, we will use a file called TestDocument.pdf
. This file was saved as a PDF document using the TestDocument.docx
file, as shown in the Using POI to extract text from Word documents section. The process is straightforward. A File
object is created for the PDF document. The PDDocument
class represents the document and the PDFTextStripper
class performs the actual text extraction using the getText
method, as shown here:
File file = new File(getResourcePath()); PDDocument pd = PDDocument.load(file); PDFTextStripper stripper = new PDFTextStripper(); String text= stripper.getText(pd); System.out.println(text);
The output is as follows:
Jump to navigation Jump to search...