Metadata found in a Word document can be extracted using the Apache PDFBox API. We will demonstrate how this is performed using the Word document created in the recipe, Extracting text from a PDF document. While there are numerous properties available, we will only illustrate how a small subset can be obtained.
Extracting metadata from a Word document
Getting ready
To prepare this recipe, we need to do the following:
- Create a new Maven project.
- Add the following dependency to the project's POM file:
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.13</version>
</dependency>
- Use the Word document created in the recipe...