In this recipe, we will learn how to extract text and images from a PDF document. The process of extracting metadata from a PDF document is found in the next recipe: Extracting metadata from a PDF document.
We will use the Apache PDFBox API to illustrate this process. This API is fairly complex. While we will only show how to extract text and images, more detailed information can be extracted. This API provides a series of classes and methods to identify and manipulate the structure and contents of PDF documents. To create a sample PDF document, we will use Microsoft Word. However, there are other ways of creating PDF documents, including PDFBox.